- Would like to acquire Reviewed-by tags from Thomas for MSI and IMS related bits.
- Would like to acquire for Reviewed-by tags from Alex and/or Kirti for the VFIO mdev driver bits.
- Would like to acquire for Reviewed-by tag from Bjorn for PCI common bits
- Would like to acquire 5.11 kernel acceptance through dmaengine (Vinod) with the review tags.
v4:
dev-msi:
- Make interrupt remapping code more readable (Thomas)
- Add flush writes to unmask/write and reset ims slots (Thomas)
- Interrupt Message Storm-> Interrupt Message Store (Thomas)
- Merge in pasid programming code. (Thomas)
mdev:
- Fixed up domain assignment (Thomas)
- Define magic numbers (Thomas)
- Move siov detection code to PCI common (Thomas)
- Remove duplicated MSI entry info (Thomas)
- Convert code to use ims_slot (Thomas)
- Add explanation of pasid programming for IMS entry (Thomas)
- Add release int handle release support due to DSA spec 1.1 update.
v3:
Dev-msi:
- No need to add support for 2 different dev-msi irq domains, a common
once can be used for both the cases(with IR enabled/disabled)
- Add arch specific function to specify additions to msi_prepare callback
instead of making the callback a weak function
- Call platform ops directly instead of a wrapper function
- Make mask/unmask callbacks as void functions
dev->msi_domain should be updated at the device driver level before
calling dev_msi_alloc_irqs()
dev_msi_alloc/free_irqs() cannot be used for PCI devices
Followed the generic layering scheme: infrastructure bits->arch bits->enabling bits
Mdev:
- Remove set kvm group notifier (Yan Zhao)
- Fix VFIO irq trigger removal (Yan Zhao)
- Add mmio read flush to ims mask (Jason)
v2:
IMS (now dev-msi):
- With recommendations from Jason/Thomas/Dan on making IMS more generic:
- Pass a non-pci generic device(struct device) for IMS management instead of mdev
- Remove all references to mdev and symbol_get/put
- Remove all references to IMS in common code and replace with dev-msi
- Remove dynamic allocation of platform-msi interrupts: no groups,no
new msi list or list helpers
- Create a generic dev-msi domain with and without interrupt remapping enabled.
- Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis
mdev:
- Removing unrelated bits from SVA enabling that’s not necessary for
the submission. (Kevin)
- Restructured entire mdev driver series to make reviewing easier (Kevin)
- Made rw emulation more robust (Kevin)
- Removed uuid wq type and added single dedicated wq type (Kevin)
- Locking fixes for vdev (Yan Zhao)
- VFIO MSIX trigger fixes (Yan Zhao)
This code series will match the support of the 5.6 kernel (stage 1) driver but on guest.
The code has dependency on Thomas’s MSI restructuring patch series:
https://lore.kernel.org/lkml/[email protected]/
The code has dependency on Baolu’s mdev domain patches:
https://lore.kernel.org/lkml/[email protected]/
The code has dependency on David Box’s dvsec definition patch:
https://lore.kernel.org/linux-pci/[email protected]/T/#m1d0dc12e3b2c739e2c37106a45f325bb8f001774
Stage 1 of the driver has been accepted in v5.6 kernel. It supports dedicated workqueue (wq)
without Shared Virtual Memory (SVM) support.
Stage 2 of the driver supports shared wq and SVM. It should be pending for 5.11 and in
dmaengine/next.
VFIO mediated device framework allows vendor drivers to wrap a portion of
device resources into virtual devices (mdev). Each mdev can be assigned
to different guest using the same set of VFIO uAPIs as assigning a
physical device. Accessing to the mdev resource is served with mixed
policies. For example, vendor drivers typically mark data-path interface
as pass-through for fast guest operations, and then trap-and-mediate the
control-path interface to avoid undesired interference between mdevs. Some
level of emulation is necessary behind vfio mdev to compose the virtual
device interface.
This series brings mdev to idxd driver to enable Intel Scalable IOV
(SIOV), a hardware-assisted mediated pass-through technology. SIOV makes
each DSA wq independently assignable through PASID-granular resource/DMA
isolation. It helps improve scalability and reduces mediation complexity
against purely software-based mdev implementations. Each assigned wq is
configured by host and exposed to the guest in a read-only configuration
mode, which allows the guest to use the wq w/o additional setup. This
design greatly reduces the emulation bits to focus on handling commands
from guests.
There are two possible avenues to support virtual device composition:
1. VFIO mediated device (mdev) or 2. User space DMA through char device
(or UACCE). Given the small portion of emulation to satisfy our needs
and VFIO mdev having the infrastructure already to support the device
passthrough, we feel that VFIO mdev is the better route. For more in depth
explanation, see documentation in Documents/driver-api/vfio/mdev-idxd.rst.
Introducing mdev types “1dwq-v1” type. This mdev type allows
allocation of a single dedicated wq from available dedicated wqs. After
a workqueue (wq) is enabled, the user will generate an uuid. On mdev
creation, the mdev driver code will find a dwq depending on the mdev
type. When the create operation is successful, the user generated uuid
can be passed to qemu. When the guest boots up, it should discover a
DSA device when doing PCI discovery.
For example of “1dwq-v1” type:
1. Enable wq with “mdev” wq type
2. A user generated uuid.
3. The uuid is written to the mdev class sysfs path:
echo $UUID > /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-1dwq-v1/create
4. Pass the following parameter to qemu:
"-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"
The wq exported through mdev will have the read only config bit set
for configuration. This means that the device does not require the
typical configuration. After enabling the device, the user must set the
WQ type and name. That is all is necessary to enable the WQ and start
using it. The single wq configuration is not the only way to create the
mdev. Multi wqs support for mdev will be in the future works.
The mdev utilizes Interrupt Message Store or IMS[3], a device-specific
MSI implementation, instead of MSIX for interrupts for the guest. This
preserves MSIX for host usages and also allows a significantly larger
number of interrupt vectors for guest usage.
The idxd driver implements IMS as on-device memory mapped unified
storage. Each interrupt message is stored as a DWORD size data payload
and a 64-bit address (same as MSI-X). Access to the IMS is through the
host idxd driver.
The idxd driver makes use of the generic IMS irq chip and domain which
stores the interrupt messages as an array in device memory. Allocation and
freeing of interrupts happens via the generic msi_domain_alloc/free_irqs()
interface. One only needs to ensure the interrupt domain is stored in
the underlying device struct.
[1]: https://lore.kernel.org/lkml/157965011794.73301.15960052071729101309.stgit@djiang5-desk3.ch.intel.com/
[2]: https://software.intel.com/en-us/articles/intel-sdm
[3]: https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[4]: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
[5]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
[6]: https://intel.github.io/idxd/
[7]: https://github.com/intel/idxd-driver idxd-stage2.5
---
Dave Jiang (15):
dmaengine: idxd: add theory of operation documentation for idxd mdev
dmaengine: idxd: add support for readonly config devices
dmaengine: idxd: add interrupt handle request support
PCI: add SIOV and IMS capability detection
dmaengine: idxd: add IMS support in base driver
dmaengine: idxd: add device support functions in prep for mdev
dmaengine: idxd: add basic mdev registration and helper functions
dmaengine: idxd: add emulation rw routines
dmaengine: idxd: prep for virtual device commands
dmaengine: idxd: virtual device commands emulation
dmaengine: idxd: ims setup for the vdcm
dmaengine: idxd: add mdev type as a new wq type
dmaengine: idxd: add dedicated wq mdev type
dmaengine: idxd: add new wq state for mdev
dmaengine: idxd: add error notification from host driver to mediated device
Megha Dey (1):
iommu/vt-d: Add DEV-MSI support
Thomas Gleixner (1):
irqchip: Add IMS (Interrupt Message Store) driver
.../ABI/stable/sysfs-driver-dma-idxd | 6 +
Documentation/driver-api/vfio/mdev-idxd.rst | 404 ++++++
MAINTAINERS | 1 +
drivers/dma/Kconfig | 9 +
drivers/dma/idxd/Makefile | 2 +
drivers/dma/idxd/cdev.c | 6 +-
drivers/dma/idxd/device.c | 294 ++++-
drivers/dma/idxd/idxd.h | 67 +-
drivers/dma/idxd/init.c | 86 ++
drivers/dma/idxd/irq.c | 6 +-
drivers/dma/idxd/mdev.c | 1121 +++++++++++++++++
drivers/dma/idxd/mdev.h | 116 ++
drivers/dma/idxd/registers.h | 38 +-
drivers/dma/idxd/submit.c | 37 +-
drivers/dma/idxd/sysfs.c | 52 +-
drivers/dma/idxd/vdev.c | 976 ++++++++++++++
drivers/dma/idxd/vdev.h | 28 +
drivers/iommu/intel/iommu.c | 31 +-
drivers/iommu/intel/irq_remapping.c | 34 +-
drivers/pci/Kconfig | 15 +
drivers/pci/Makefile | 2 +
drivers/pci/dvsec.c | 40 +
drivers/pci/siov.c | 50 +
include/linux/pci-siov.h | 18 +
include/linux/pci.h | 3 +
include/uapi/linux/idxd.h | 2 +
include/uapi/linux/pci_regs.h | 4 +
kernel/irq/msi.c | 2 +
28 files changed, 3352 insertions(+), 98 deletions(-)
create mode 100644 Documentation/driver-api/vfio/mdev-idxd.rst
create mode 100644 drivers/dma/idxd/mdev.c
create mode 100644 drivers/dma/idxd/mdev.h
create mode 100644 drivers/dma/idxd/vdev.c
create mode 100644 drivers/dma/idxd/vdev.h
create mode 100644 drivers/pci/dvsec.c
create mode 100644 drivers/pci/siov.c
create mode 100644 include/linux/pci-siov.h
--
From: Thomas Gleixner <[email protected]>
Generic IMS(Interrupt Message Store) irq chips and irq domain
implementations for IMS based devices which store the interrupt
messages in an array in device memory.
Allocation and freeing of interrupts happens via the generic
msi_domain_alloc/free_irqs() interface. No special purpose IMS magic
required as long as the interrupt domain is stored in the underlying
device struct.
Provide storage and a setter for an Address Space Identifier. The
identifier is stored in the top level irq_data and it only can be
modified when the interrupt is not active. Add the necessary storage
and helper functions and validate that interrupts which require an
ASID have one assigned.
[Megha : Fixed compile time errors
Added necessary dependencies to IMS_MSI_ARRAY config
Fixed polarity of IMS_VECTOR_CTRL
Added reads after writes to flush writes to device
Tested the IMS infrastructure with the IDXD driver]
Signed-off-by: Thomas Gleixner <[email protected]>
Signed-off-by: Megha Dey <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/irqchip/Kconfig | 14 ++
drivers/irqchip/Makefile | 1
drivers/irqchip/irq-ims-msi.c | 204 +++++++++++++++++++++++++++++++++++
include/linux/interrupt.h | 2
include/linux/irq.h | 4 +
include/linux/irqchip/irq-ims-msi.h | 68 ++++++++++++
kernel/irq/manage.c | 32 +++++
7 files changed, 325 insertions(+)
create mode 100644 drivers/irqchip/irq-ims-msi.c
create mode 100644 include/linux/irqchip/irq-ims-msi.h
diff --git a/drivers/irqchip/Kconfig b/drivers/irqchip/Kconfig
index c6098eee0c7c..862ea81a69a0 100644
--- a/drivers/irqchip/Kconfig
+++ b/drivers/irqchip/Kconfig
@@ -597,4 +597,18 @@ config MST_IRQ
help
Support MStar Interrupt Controller.
+config IMS_MSI
+ depends on PCI
+ select DEVICE_MSI
+ bool
+
+config IMS_MSI_ARRAY
+ bool "IMS Interrupt Message Store MSI controller for device memory storage arrays"
+ depends on PCI
+ select IMS_MSI
+ select GENERIC_MSI_IRQ_DOMAIN
+ help
+ Support for IMS Interrupt Message Store MSI controller
+ with IMS slot storage in a slot array in device memory
+
endmenu
diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile
index 94c2885882ee..a7d54605060a 100644
--- a/drivers/irqchip/Makefile
+++ b/drivers/irqchip/Makefile
@@ -114,3 +114,4 @@ obj-$(CONFIG_LOONGSON_PCH_PIC) += irq-loongson-pch-pic.o
obj-$(CONFIG_LOONGSON_PCH_MSI) += irq-loongson-pch-msi.o
obj-$(CONFIG_MST_IRQ) += irq-mst-intc.o
obj-$(CONFIG_SL28CPLD_INTC) += irq-sl28cpld.o
+obj-$(CONFIG_IMS_MSI) += irq-ims-msi.o
diff --git a/drivers/irqchip/irq-ims-msi.c b/drivers/irqchip/irq-ims-msi.c
new file mode 100644
index 000000000000..d54a54f5fdcc
--- /dev/null
+++ b/drivers/irqchip/irq-ims-msi.c
@@ -0,0 +1,204 @@
+// SPDX-License-Identifier: GPL-2.0
+// (C) Copyright 2020 Thomas Gleixner <[email protected]>
+/*
+ * Shared interrupt chips and irq domains for IMS devices
+ */
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/msi.h>
+#include <linux/irq.h>
+#include <linux/irqdomain.h>
+
+#include <linux/irqchip/irq-ims-msi.h>
+
+#ifdef CONFIG_IMS_MSI_ARRAY
+
+struct ims_array_data {
+ struct ims_array_info info;
+ unsigned long map[0];
+};
+
+static inline void iowrite32_and_flush(u32 value, void __iomem *addr)
+{
+ iowrite32(value, addr);
+ ioread32(addr);
+}
+
+static void ims_array_mask_irq(struct irq_data *data)
+{
+ struct msi_desc *desc = irq_data_get_msi_desc(data);
+ struct ims_slot __iomem *slot = desc->device_msi.priv_iomem;
+ u32 __iomem *ctrl = &slot->ctrl;
+
+ iowrite32_and_flush(ioread32(ctrl) | IMS_CTRL_VECTOR_MASKBIT, ctrl);
+}
+
+static void ims_array_unmask_irq(struct irq_data *data)
+{
+ struct msi_desc *desc = irq_data_get_msi_desc(data);
+ struct ims_slot __iomem *slot = desc->device_msi.priv_iomem;
+ u32 __iomem *ctrl = &slot->ctrl;
+
+ iowrite32_and_flush(ioread32(ctrl) & ~IMS_CTRL_VECTOR_MASKBIT, ctrl);
+}
+
+static void ims_array_write_msi_msg(struct irq_data *data, struct msi_msg *msg)
+{
+ struct msi_desc *desc = irq_data_get_msi_desc(data);
+ struct ims_slot __iomem *slot = desc->device_msi.priv_iomem;
+
+ iowrite32(msg->address_lo, &slot->address_lo);
+ iowrite32(msg->address_hi, &slot->address_hi);
+ iowrite32_and_flush(msg->data, &slot->data);
+}
+
+static int ims_array_set_auxdata(struct irq_data *data, unsigned int which,
+ u64 auxval)
+{
+ struct msi_desc *desc = irq_data_get_msi_desc(data);
+ struct ims_slot __iomem *slot = desc->device_msi.priv_iomem;
+ u32 val, __iomem *ctrl = &slot->ctrl;
+
+ if (which != IMS_AUXDATA_CONTROL_WORD)
+ return -EINVAL;
+ if (auxval & ~(u64)IMS_CONTROL_WORD_AUXMASK)
+ return -EINVAL;
+
+ val = ioread32(ctrl) & IMS_CONTROL_WORD_IRQMASK;
+ iowrite32_and_flush(val | (u32) auxval, ctrl);
+ return 0;
+}
+
+static const struct irq_chip ims_array_msi_controller = {
+ .name = "IMS",
+ .irq_mask = ims_array_mask_irq,
+ .irq_unmask = ims_array_unmask_irq,
+ .irq_write_msi_msg = ims_array_write_msi_msg,
+ .irq_set_auxdata = ims_array_set_auxdata,
+ .irq_retrigger = irq_chip_retrigger_hierarchy,
+ .flags = IRQCHIP_SKIP_SET_WAKE,
+};
+
+static void ims_array_reset_slot(struct ims_slot __iomem *slot)
+{
+ iowrite32(0, &slot->address_lo);
+ iowrite32(0, &slot->address_hi);
+ iowrite32(0, &slot->data);
+ iowrite32_and_flush(IMS_CTRL_VECTOR_MASKBIT, &slot->ctrl);
+}
+
+static void ims_array_free_msi_store(struct irq_domain *domain,
+ struct device *dev)
+{
+ struct msi_domain_info *info = domain->host_data;
+ struct ims_array_data *ims = info->data;
+ struct msi_desc *entry;
+
+ for_each_msi_entry(entry, dev) {
+ if (entry->device_msi.priv_iomem) {
+ clear_bit(entry->device_msi.hwirq, ims->map);
+ ims_array_reset_slot(entry->device_msi.priv_iomem);
+ entry->device_msi.priv_iomem = NULL;
+ entry->device_msi.hwirq = 0;
+ }
+ }
+}
+
+static int ims_array_alloc_msi_store(struct irq_domain *domain,
+ struct device *dev, int nvec)
+{
+ struct msi_domain_info *info = domain->host_data;
+ struct ims_array_data *ims = info->data;
+ struct msi_desc *entry;
+
+ for_each_msi_entry(entry, dev) {
+ unsigned int idx;
+
+ idx = find_first_zero_bit(ims->map, ims->info.max_slots);
+ if (idx >= ims->info.max_slots)
+ goto fail;
+ set_bit(idx, ims->map);
+ entry->device_msi.priv_iomem = &ims->info.slots[idx];
+ ims_array_reset_slot(entry->device_msi.priv_iomem);
+ entry->device_msi.hwirq = idx;
+ }
+ return 0;
+
+fail:
+ ims_array_free_msi_store(domain, dev);
+ return -ENOSPC;
+}
+
+struct ims_array_domain_template {
+ struct msi_domain_ops ops;
+ struct msi_domain_info info;
+};
+
+static const struct ims_array_domain_template ims_array_domain_template = {
+ .ops = {
+ .msi_alloc_store = ims_array_alloc_msi_store,
+ .msi_free_store = ims_array_free_msi_store,
+ },
+ .info = {
+ .flags = MSI_FLAG_USE_DEF_DOM_OPS |
+ MSI_FLAG_USE_DEF_CHIP_OPS,
+ .handler = handle_edge_irq,
+ .handler_name = "edge",
+ },
+};
+
+struct irq_domain *
+pci_ims_array_create_msi_irq_domain(struct pci_dev *pdev,
+ struct ims_array_info *ims_info)
+{
+ struct ims_array_domain_template *info;
+ struct ims_array_data *data;
+ struct irq_domain *domain;
+ struct irq_chip *chip;
+ unsigned int size;
+
+ /* Allocate new domain storage */
+ info = kmemdup(&ims_array_domain_template,
+ sizeof(ims_array_domain_template), GFP_KERNEL);
+ if (!info)
+ return NULL;
+ /* Link the ops */
+ info->info.ops = &info->ops;
+
+ /* Allocate ims_info along with the bitmap */
+ size = sizeof(*data);
+ size += BITS_TO_LONGS(ims_info->max_slots) * sizeof(unsigned long);
+ data = kzalloc(size, GFP_KERNEL);
+ if (!data)
+ goto err_info;
+
+ data->info = *ims_info;
+ info->info.data = data;
+
+ /*
+ * Allocate an interrupt chip because the core needs to be able to
+ * update it with default callbacks.
+ */
+ chip = kmemdup(&ims_array_msi_controller,
+ sizeof(ims_array_msi_controller), GFP_KERNEL);
+ if (!chip)
+ goto err_data;
+ info->info.chip = chip;
+
+ domain = pci_subdevice_msi_create_irq_domain(pdev, &info->info);
+ if (!domain)
+ goto err_chip;
+
+ return domain;
+
+err_chip:
+ kfree(chip);
+err_data:
+ kfree(data);
+err_info:
+ kfree(info);
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(pci_ims_array_create_msi_irq_domain);
+
+#endif /* CONFIG_IMS_MSI_ARRAY */
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index ee8299eb1f52..43a8d1e9647e 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -487,6 +487,8 @@ extern int irq_get_irqchip_state(unsigned int irq, enum irqchip_irq_state which,
extern int irq_set_irqchip_state(unsigned int irq, enum irqchip_irq_state which,
bool state);
+int irq_set_auxdata(unsigned int irq, unsigned int which, u64 val);
+
#ifdef CONFIG_IRQ_FORCED_THREADING
# ifdef CONFIG_PREEMPT_RT
# define force_irqthreads (true)
diff --git a/include/linux/irq.h b/include/linux/irq.h
index c54365309e97..fd162aea0c3f 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -491,6 +491,8 @@ static inline irq_hw_number_t irqd_to_hwirq(struct irq_data *d)
* irq_request_resources
* @irq_compose_msi_msg: optional to compose message content for MSI
* @irq_write_msi_msg: optional to write message content for MSI
+ * @irq_set_auxdata: Optional function to update auxiliary data e.g. in
+ * shared registers
* @irq_get_irqchip_state: return the internal state of an interrupt
* @irq_set_irqchip_state: set the internal state of a interrupt
* @irq_set_vcpu_affinity: optional to target a vCPU in a virtual machine
@@ -538,6 +540,8 @@ struct irq_chip {
void (*irq_compose_msi_msg)(struct irq_data *data, struct msi_msg *msg);
void (*irq_write_msi_msg)(struct irq_data *data, struct msi_msg *msg);
+ int (*irq_set_auxdata)(struct irq_data *data, unsigned int which, u64 auxval);
+
int (*irq_get_irqchip_state)(struct irq_data *data, enum irqchip_irq_state which, bool *state);
int (*irq_set_irqchip_state)(struct irq_data *data, enum irqchip_irq_state which, bool state);
diff --git a/include/linux/irqchip/irq-ims-msi.h b/include/linux/irqchip/irq-ims-msi.h
new file mode 100644
index 000000000000..a9e43e1f7890
--- /dev/null
+++ b/include/linux/irqchip/irq-ims-msi.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* (C) Copyright 2020 Thomas Gleixner <[email protected]> */
+
+#ifndef _LINUX_IRQCHIP_IRQ_IMS_MSI_H
+#define _LINUX_IRQCHIP_IRQ_IMS_MSI_H
+
+#include <linux/types.h>
+#include <linux/bits.h>
+
+/**
+ * ims_hw_slot - The hardware layout of an IMS based MSI message
+ * @address_lo: Lower 32bit address
+ * @address_hi: Upper 32bit address
+ * @data: Message data
+ * @ctrl: Control word
+ *
+ * This structure is used by both the device memory array and the queue
+ * memory variants of IMS.
+ */
+struct ims_slot {
+ u32 address_lo;
+ u32 address_hi;
+ u32 data;
+ u32 ctrl;
+} __packed;
+
+/*
+ * The IMS control word utilizes bit 0-2 for interrupt control. The remaining
+ * bits can contain auxiliary data.
+ */
+#define IMS_CONTROL_WORD_IRQMASK GENMASK(2, 0)
+#define IMS_CONTROL_WORD_AUXMASK GENMASK(31, 3)
+
+/* Bit to mask the interrupt in ims_hw_slot::ctrl */
+#define IMS_CTRL_VECTOR_MASKBIT BIT(0)
+
+/* Auxiliary control word data related defines */
+enum {
+ IMS_AUXDATA_CONTROL_WORD,
+};
+
+#define IMS_CTRL_PASID_ENABLE BIT(3)
+#define IMS_CTRL_PASID_SHIFT 12
+
+static inline u32 ims_ctrl_pasid_aux(unsigned int pasid, bool enable)
+{
+ u32 auxval = pasid << IMS_CTRL_PASID_SHIFT;
+
+ return enable ? auxval | IMS_CTRL_PASID_ENABLE : auxval;
+}
+
+/**
+ * struct ims_array_info - Information to create an IMS array domain
+ * @slots: Pointer to the start of the array
+ * @max_slots: Maximum number of slots in the array
+ */
+struct ims_array_info {
+ struct ims_slot __iomem *slots;
+ unsigned int max_slots;
+};
+
+struct pci_dev;
+struct irq_domain;
+
+struct irq_domain *pci_ims_array_create_msi_irq_domain(struct pci_dev *pdev,
+ struct ims_array_info *ims_info);
+
+#endif
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index dc65d90108db..d7bf2ae67170 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -2752,3 +2752,35 @@ int irq_set_irqchip_state(unsigned int irq, enum irqchip_irq_state which,
return err;
}
EXPORT_SYMBOL_GPL(irq_set_irqchip_state);
+
+/**
+ * irq_set_auxdata - Set auxiliary data
+ * @irq: Interrupt to update
+ * @which: Selector which data to update
+ * @auxval: Auxiliary data value
+ *
+ * Function to update auxiliary data for an interrupt, e.g. to update data
+ * which is stored in a shared register or data storage (e.g. IMS).
+ */
+int irq_set_auxdata(unsigned int irq, unsigned int which, u64 val)
+{
+ struct irq_desc *desc;
+ struct irq_data *data;
+ unsigned long flags;
+ int res = -ENODEV;
+
+ desc = irq_get_desc_buslock(irq, &flags, 0);
+ if (!desc)
+ return -EINVAL;
+
+ for (data = &desc->irq_data; data; data = irqd_get_parent_data(data)) {
+ if (data->chip->irq_set_auxdata) {
+ res = data->chip->irq_set_auxdata(data, which, val);
+ break;
+ }
+ }
+
+ irq_put_desc_busunlock(desc, flags);
+ return res;
+}
+EXPORT_SYMBOL_GPL(irq_set_auxdata);
From: Megha Dey <[email protected]>
Add required support in the interrupt remapping driver for devices
which generate dev-msi interrupts and use the intel remapping
domain as the parent domain.
Reviewed-by: Ashok Raj <[email protected]>
Suggested-by: Thomas Gleixner <[email protected]>
Signed-off-by: Megha Dey <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/iommu/intel/irq_remapping.c | 34 ++++++++++++++++++++++------------
1 file changed, 22 insertions(+), 12 deletions(-)
diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c
index 0cfce1d3b7bb..0e8d106d34c0 100644
--- a/drivers/iommu/intel/irq_remapping.c
+++ b/drivers/iommu/intel/irq_remapping.c
@@ -1260,6 +1260,16 @@ static struct irq_chip intel_ir_chip = {
.irq_set_vcpu_affinity = intel_ir_set_vcpu_affinity,
};
+static void irte_prepare_msg(struct msi_msg *msg, int index, int subhandle)
+{
+ msg->address_hi = MSI_ADDR_BASE_HI;
+ msg->data = subhandle;
+ msg->address_lo = MSI_ADDR_BASE_LO | MSI_ADDR_IR_EXT_INT |
+ MSI_ADDR_IR_SHV |
+ MSI_ADDR_IR_INDEX1(index) |
+ MSI_ADDR_IR_INDEX2(index);
+}
+
static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
struct irq_cfg *irq_cfg,
struct irq_alloc_info *info,
@@ -1301,19 +1311,18 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data,
break;
case X86_IRQ_ALLOC_TYPE_HPET:
+ set_hpet_sid(irte, info->devid);
+ irte_prepare_msg(msg, index, sub_handle);
+ break;
+
case X86_IRQ_ALLOC_TYPE_PCI_MSI:
case X86_IRQ_ALLOC_TYPE_PCI_MSIX:
- if (info->type == X86_IRQ_ALLOC_TYPE_HPET)
- set_hpet_sid(irte, info->devid);
- else
- set_msi_sid(irte, msi_desc_to_pci_dev(info->desc));
-
- msg->address_hi = MSI_ADDR_BASE_HI;
- msg->data = sub_handle;
- msg->address_lo = MSI_ADDR_BASE_LO | MSI_ADDR_IR_EXT_INT |
- MSI_ADDR_IR_SHV |
- MSI_ADDR_IR_INDEX1(index) |
- MSI_ADDR_IR_INDEX2(index);
+ set_msi_sid(irte, msi_desc_to_pci_dev(info->desc));
+ irte_prepare_msg(msg, index, sub_handle);
+ break;
+
+ case X86_IRQ_ALLOC_TYPE_DEV_MSI:
+ irte_prepare_msg(msg, index, sub_handle);
break;
default:
@@ -1358,7 +1367,8 @@ static int intel_irq_remapping_alloc(struct irq_domain *domain,
if (!info || !iommu)
return -EINVAL;
if (nr_irqs > 1 && info->type != X86_IRQ_ALLOC_TYPE_PCI_MSI &&
- info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX)
+ info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX &&
+ info->type != X86_IRQ_ALLOC_TYPE_DEV_MSI)
return -EINVAL;
/*
Add idxd vfio mediated device theory of operation documentation.
Provide description on mdev design, usage, and why vfio mdev was chosen.
Reviewed-by: Ashok Raj <[email protected]>
Reviewed-by: Kevin Tian <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
---
Documentation/driver-api/vfio/mdev-idxd.rst | 404 +++++++++++++++++++++++++++
MAINTAINERS | 1
2 files changed, 405 insertions(+)
create mode 100644 Documentation/driver-api/vfio/mdev-idxd.rst
diff --git a/Documentation/driver-api/vfio/mdev-idxd.rst b/Documentation/driver-api/vfio/mdev-idxd.rst
new file mode 100644
index 000000000000..c75b7d88ef6b
--- /dev/null
+++ b/Documentation/driver-api/vfio/mdev-idxd.rst
@@ -0,0 +1,404 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+IDXD Overview
+=============
+IDXD (Intel Data Accelerator Driver) is the driver for the Intel Data
+Streaming Accelerator (DSA). Intel DSA is a high performance data copy
+and transformation accelerator. In addition to data move operations,
+the device also supports data fill, CRC generation, Data Integrity Field
+(DIF), and memory compare and delta generation. Intel DSA supports
+a variety of PCI-SIG defined capabilities such as Address Translation
+Services (ATS), Process address Space ID (PASID), Page Request Interface
+(PRI), Message Signalled Interrupts Extended (MSI-X), and Advanced Error
+Reporting (AER). Some of those capabilities enable the device to support
+Shared Virtual Memory (SVM), or also known as Shared Virtual Addressing
+(SVA). Intel DSA also supports Intel Scalable I/O Virtualization (SIOV)
+to improve scalability of device assignment.
+
+
+The Intel DSA device contains the following basic components:
+* Work queue (WQ)
+
+ A WQ is an on device storage to queue descriptors to the
+ device. Requests are added to a WQ by using new CPU instructions
+ (MOVDIR64B and ENQCMD(S)) to write the memory mapped “portal”
+ associated with each WQ.
+
+* Engine
+
+ Operation unit that pulls descriptors from WQs and processes them.
+
+* Group
+
+ Abstract container to associate one or more engines with one or more WQs.
+
+
+Two types of WQs are supported:
+* Dedicated WQ (DWQ)
+
+ A single client should owns this exclusively and can submit work
+ to it. The MOVDIR64B instruction is used to submit descriptors to
+ this type of WQ. The instruction is a posted write, therefore the
+ submitter must ensure not exceed the WQ length for submission. The
+ use of PASID is optional with DWQ. Multiple clients can submit to
+ a DWQ, but sychronization is required due to when the WQ is full,
+ the submission is silently dropped.
+
+* Shared WQ (SWQ)
+
+ Multiple clients can submit work to this WQ. The submitter must use
+ ENQMCDS (from supervisor mode) or ENQCMD (from user mode). These
+ instructions will indicate via EFLAGS.ZF bit whether a submission
+ succeeds. The use of PASID is mandatory to identify the address space
+ of each client.
+
+
+For more information about the new instructions [1][2].
+
+The IDXD driver is broken down into following usages:
+* In kernel interface through dmaengine subsystem API.
+* Userspace DMA support through character device. mmap(2) is utilized
+ to map directly to mmio address (or portals) for descriptor submission.
+* VFIO Mediated device (mdev) supporting device passthrough usages.
+
+
+=================================
+Assignable Device Interface (ADI)
+=================================
+The term ADI is used to represent the minimal unit of assignment for
+Intel Scalable IOV device. Each ADI instance refers to the set of device
+backend resources that are allocated, configured and organized as an
+isolated unit.
+
+Intel DSA defines each WQ as an ADI. The MMIO registers of each work queue
+are partitioned into two categories:
+* MMIO registers accessed for data-path operations.
+* MMIO registers accessed for control-path operations.
+
+Data-path MMIO registers of each WQ are contained within
+one or more system page size aligned regions and can be mapped in the
+CPU page table for direct access from the guest. Control-path MMIO
+registers of all WQs are located together but segregated from data-path
+MMIO regions. Therefore, guest updates to control-path registers must
+be intercepted and then go through the host driver to be reflected in
+the device.
+
+Data-path MMIO registers of DSA WQ are portals for submitting descriptors
+to the device. There are four portals per WQ, each being 64 bytes
+in size and located on a separate 4KB page in BAR2. Each portal has
+different implications regarding interrupt message type (MSI vs. IMS)
+and occupancy control (limited vs. unlimited). It is not necessary to
+map all portals to the guest.
+
+Control-path MMIO registers of DSA WQ include global configurations
+(shared by all WQs) and WQ-specific configurations. The owner
+(e.g. the guest) of the WQ is expected to only change WQ-specific
+configurations. Intel DSA spec introduces a “Configuration Support”
+capability which, if cleared, indicates that some fields of WQ
+configuration registers are read-only and the WQ configuration is
+pre-configured by the host.
+
+
+Interrupt Message Store (IMS)
+=============================
+The ADI utilizes Interrupt Message Store (IMS), a device-specific MSI
+implementation, instead of MSIX for interrupts for the guest. This
+preserves MSIX for host usages and also allows a significantly larger
+number of interrupt vectors for large number of guests usage.
+
+Intel DSA device implements IMS as on-device memory mapped unified
+storage. Each interrupt message is stored as a DWORD size data payload
+and a 64-bit address (same as MSI-X). Access to the IMS is through the
+host idxd driver.
+
+The idxd driver makes use of the generic IMS irq chip and domain which
+stores the interrupt messages in an array in device memory. Allocation and
+freeing of interrupts happens via the generic msi_domain_alloc/free_irqs()
+interface. Driver only needs to ensure the interrupt domain is stored in
+the underlying device struct.
+
+
+ADI Isolation
+=============
+Operations or functioning of one ADI must not affect the functioning
+of another ADI or the physical device. Upstream memory requests from
+different ADIs are distinguished using a Process Address Space Identifier
+(PASID). With the support of PASID-granular address translation in Intel
+VT-d, the address space targeted by a request from ADI can be a Host
+Virtual Address (HVA), Host I/O Virtual Address (HIOVA), Guest Physical
+Address (GPA), Guest Virtual Address (GVA), Guest I/O Virtual Address
+(GIOVA), etc. The PASID identity for an ADI is expected to be accessed
+or modified by privileged software through the host driver.
+
+=========================
+Virtual DSA (vDSA) Device
+=========================
+The DSA WQ itself is not a PCI device thus must be composed into a
+virtual DSA device to the guest.
+
+The composition logic needs to handle four main requirements:
+* Emulate PCI config space.
+* Map data-path portals for direct access from the guest.
+* Emulate control-path MMIO registers and selectively forward WQ
+ configuration requests through host driver to the device.
+* Forward and emulate WQ interrupts to the guest.
+
+The composition logic tells the guest aspects of WQ which are configurable
+through a combination of capability fields, e.g.:
+* Configuration Support (if cleared, most aspects are not modifiable).
+* WQ Mode Support (if cleared, cannot change between dedicated and
+ shared mode).
+* Dedicated Mode Support.
+* Shared Mode Support.
+* ...
+
+The virtual capability fields are set according to the vDSA
+type. Following are examples of vDSA types and related WQ configurability:
+* Type ‘1DWQ_v1’
+ * One DSA gen 1 WQ dedicated to this guest
+ * Guest cannot share the WQ between its clients (no guest SVA)
+ * Guest cannot change any WQ configuration
+* Type ‘1SWQ_v1’
+ * One DSA gen 1 WQ shared between multiple VMs
+ * Guest can further share the WQ between its clients (guest SVA is required)
+ * Guest cannot change any WQ configuration
+* Type ‘1WQfull_v1’
+ * One DSA gen 1 WQ dedicated to this guest
+ * Guest is allowed to do limited WQ configurations (thru WQCFG
+ register), including WQ mode (dedicated/shared), privilege,
+ threshold, PASID enable, PASID value, etc.
+
+Besides, the composition logic also needs to serve administrative commands
+(thru virtual CMD register) through host driver, including:
+* Drain/abort all descriptors submitted by this guest.
+* Drain/abort descriptors associated with a PASID.
+* Enable/disable/reset the WQ (when it’s not shared by multiple VMs).
+* Request interrupt handle.
+
+With this design, vDSA emulation is **greatly simplified**. Most
+registers are emulated in simple READ-ONLY flavor, and handling limited
+configurability is required only for a few registers.
+
+===========================
+VFIO mdev vs. userspace DMA
+===========================
+There are two avenues to support vDSA composition.
+1. VFIO mediated device (mdev)
+2. Userspace DMA through char device
+
+VFIO mdev provides a generic subdevice passthrough framework. Unified
+uAPIs are used for both device and subdevice passthrough, thus any
+userspace VMM which already supports VFIO device passthrough would
+naturally support mdev/subdevice passthrough. The implication of VFIO
+mdev is putting emulation of device interface in the kernel (part of
+host driver) which must be carefully scrutinized. Fortunately, vDSA
+composition includes only a small portion of emulation code, due to the
+fact that most registers are simply READ-ONLY to the guest. The majority
+logic of handling limited configurability and administrative commands
+is anyway required to sit in the kernel, regardless of which kernel uAPI
+is pursued. In this regard, VFIO mdev is a nice fit for vDSA composition.
+
+IDXD driver provides a char device interface for applications to
+map the WQ portal and directly submit descriptors to do DMA. This
+interface provides only data-path access to userspace and relies on
+the host driver to handle control-path configurations. Expanding such
+interface to support subdevice passthrough allows moving the emulation
+code to userspace. However, quite some work is required to grow it from
+an application-oriented interface into a passthrough-oriented interface:
+new uAPIs to handle guest WQ configurability and administrative commands,
+and new uAPIs to handle passthrough specific requirements (e.g. DMA map,
+guest SVA, live migration, posted interrupt, etc.). And once it is done,
+every userspace VMM has to explicitly bind to IDXD specific uAPI, even
+though the real user is in the guest (instead of the VMM itself) in the
+passthrough scenario.
+
+Although some generalization might be possible to reduce the work of
+handling passthrough, we feel the difference between userspace DMA
+and subdevice passthrough is distinct in IDXD. Therefore, we choose to
+build vDSA composition on top of VFIO mdev framework and leave userspace
+DMA intact after discussion at LPC 2020.
+
+=============================
+Host Registration and Release
+=============================
+
+Intel DSA reports support for Intel Scalable IOV via a PCI Express
+Designated Vendor Specific Extended Capability (DVSEC). In addition,
+PASID-granular address translation capability is required in the
+IOMMU. During host initialization, the IDXD driver should check the
+presence of both capabilities before calling mdev_register_device()
+to register with the VFIO mdev framework and provide a set of ops
+(struct mdev_parent_ops). The IOMMU capability is indicated by the
+IOMMU_DEV_FEAT_AUX feature flag with iommu_dev_has_feature() and enabled
+with iommu_dev_enable_feature().
+
+On release, iommu_dev_disable_feature() is called after
+mdev_unregister_device() to disable the IOMMU_DEV_FEAT_AUX flag that
+the driver enabled during host initialization.
+
+The mdev_parent_ops data structure is filled out by the driver to provide
+a number of ops called by VFIO mdev framework::
+
+ struct mdev_parent_ops {
+ .supported_type_groups
+ .create
+ .remove
+ .open
+ .release
+ .read
+ .write
+ .mmap
+ .ioctl
+ };
+
+Supported_type_groups
+---------------------
+At the moment only one vDSA type is supported.
+
+“1DWQ_v1”:
+ Single dedicated WQ (DSA 1.0) with read-only configuration exposed to
+ the guest. On the guest kernel, a vDSA device shows up with a single
+ WQ that is pre-configured by the host. The configuration for the WQ
+ is entirely read-only and cannot be reconfigured. There is no support
+ of guest SVA on this WQ.
+
+ The interrupt vector 0 is emulated by the driver to support the admin
+ command completion and error reporting. A second interrupt vector is
+ bound to the IMS and used for I/O operation.
+
+
+create
+------
+API function to create the mdev. mdev_set_iommu_device() is called to
+associate the mdev device to the parent PCI device. This function is
+where the driver sets up and initializes the resources to support a single
+mdev device. This is triggered through sysfs to initiate the creation.
+
+remove
+------
+API function that mirrors the create() function and releases all the
+resources backing the mdev. This is also triggered through sysfs.
+
+open
+----
+API function that is called down from VFIO userspace to indicate to the
+driver that the upper layers are ready to claim and utilize the mdev. IMS
+entries are allocated and setup here.
+
+release
+-------
+The mirror function to open that releases the mdev by VFIO userspace.
+
+read / write
+------------
+This is where the Intel IDXD driver provides read/write emulation of
+PCI config space and MMIO registers. These paths are the “slow” path
+of the mediated device and emulation is used rather than direct access
+to the hardware resources. Typically configuration and administrative
+commands go through this path. This allows the mdev to show up as a
+virtual PCI device on the guest kernel.
+
+The emulation of PCI config space is nothing special, which is simply
+copied from kvmgt. In the future this part might be consolidated to
+reduce duplication.
+
+Emulating MMIO reads are simply memory copies. There is no side-effect
+to be emulated upon guest read.
+
+Emulating MMIO writes are required only for a few registers, due to
+read-only configuration on the ‘1DWQ-v1’ type. Majority of composition
+logic is hooked in the CMD register for performing administrative commands
+such as WQ drain, abort, enable, disable and reset operations. The rest of
+the emulation is about handling errors (GENCTRL/SWERROR) and interrupts
+(INTCAUSE/MSIXPERM) on the vDSA device. Future mdev types might allow
+limited WQ configurability, which then requires additional emulation of
+the WQCFG register.
+
+mmap
+----
+This is the function that provides the setup to expose a portion of the
+hardware, also known as portals, for direct access for “fast” path
+operations through the mmap() syscall. A limited region of the hardware
+is mapped to the guest for direct I/O submission.
+
+There are four portals per WQ: unlimited MSI-X, limited MSI-X, unlimited
+IMS, limited IMS. Descriptors submitted to limited portals are subject
+to threshold configuration limitations for shared WQs. The MSI-X portals
+are used for host submissions, and the IMS portals are mapped to vm for
+guest submission.
+
+ioctl
+-----
+This API function does several things
+* Provides general device information to VFIO userspace.
+* Provides device region information (PCI, mmio, etc).
+* Get interrupts information
+* Setup interrupts for the mediated device.
+* Mdev device reset
+
+For the Intel idxd driver, Interrupt Message Store (IMS) vectors are being
+used for mdev interrupts rather than MSIX vectors. IMS provides additional
+interrupt vectors outside of PCI MSIX specification in order to support
+significantly more vectors. The emulated interrupt (0) is connected through
+kernel eventfd. When interrupt 0 needs to be asserted, the driver will
+signal the eventfd to trigger the MSIX vector 0 interrupt on the guest.
+The IMS interrupts are setup via eventfd as well. However, it utilizes
+irq bypass manager to directly inject the interrupt in the guest.
+
+To allocate IMS, we utilize the IMS array APIs. On host init, we need
+to create the MSI domain::
+
+ struct ims_array_info ims_info;
+ struct device *dev = &pci_dev->dev;
+
+
+ /* assign the device IMS size */
+ ims_info.max_slots = max_ims_size;
+ /* assign the MMIO base address for the IMS table */
+ ims_info.slots = mmio_base + ims_offset;
+ /* assign the MSI domain to the device */
+ dev->msi_domain = pci_ims_array_create_msi_irq_domain(pci_dev, &ims_info);
+
+When we are ready to allocate the interrupts::
+
+ struct device *dev = mdev_dev(mdev);
+
+ irq_domain = pci_dev->dev.msi_domain;
+ /* the irqs are allocated against device of mdev */
+ rc = msi_domain_alloc_irqs(irq_domain, dev, num_vecs);
+
+
+ /* we can retrieve the slot index from msi_entry */
+ for_each_msi_entry(entry, dev) {
+ slot_index = entry->device_msi.hwirq;
+ irq = entry->irq;
+ }
+
+ request_irq(irq, interrupt_handler_function, 0, “ims”, context);
+
+
+The DSA device is structured such that MSI-X table entry 0 is used for
+admin commands completion, error reporting, and other misc commands. The
+remaining MSI-X table entries are used for WQ completion. For vm support,
+the virtual device also presents a similar layout. Therefore, vector 0
+is emulated by the software. Additional vector(s) are associated with IMS.
+
+The index (slot) for the per device IMS entry is managed by the MSI
+core. The index is the “interrupt handle” that the guest kernel
+needs to program into a DMA descriptor. That interrupt handle tells the
+hardware which IMS vector to trigger the interrupt on for the host.
+
+The virtual device presents an admin command called “request interrupt
+handle” that is not supported by the physical device. On probe of
+the DSA device on the guest kernel, the guest driver will issue the
+“request interrupt handle” command in order to get the interrupt
+handle for descriptor programming. The host driver will return the
+assigned slot for the IMS entry table to the guest.
+
+References
+==========
+[1] https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
+[2] https://software.intel.com/en-us/articles/intel-sdm
+[3] https://software.intel.com/sites/default/files/managed/cc/0e/intel-scalable-io-virtualization-technical-specification.pdf
+[4] https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
diff --git a/MAINTAINERS b/MAINTAINERS
index e73636b75f29..af04e674853c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8888,6 +8888,7 @@ INTEL IADX DRIVER
M: Dave Jiang <[email protected]>
L: [email protected]
S: Supported
+F: Documentation/driver-api/vfio/mdev-idxd.rst
F: drivers/dma/idxd/*
F: include/uapi/linux/idxd.h
Add support for requesting interrupt handle from the device. The interrupt
handle is put in the interrupt handle field of a descriptor for the device
to determine which interrupt vector to use be it MSI-X or IMS. On the host
device, the interrupt handle is indexed to the MSI-X table. This allows a
descriptor to program the interrupt handle 1:1 with the MSI-X index without
getting it from the request interrupt handle device command. For a guest
device, the index can be any index that the host assigned for the IMS
table, and therefore it must be requested from the virtual device during
MSI-X setup by the driver running on the guest.
On the actual hardware the MSIX vector 0 is misc interrupt and handles
events such as administrative command completion, error reporting,
performance monitor overflow, and etc. The MSIX vectors 1...N
are used for descriptor completion interrupts. On the guest kernel,
the MSIX interrupts are backed by the mediated device through emulation
or IMS vectors. Vector 0 is handled through emulation by the host vdcm.
It only requires the host driver to send the signal to qemu. The vector 1
(and more may be supported later) is backed by IMS.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/device.c | 58 ++++++++++++++++++++++++++++++++++++++++++
drivers/dma/idxd/idxd.h | 13 +++++++++
drivers/dma/idxd/init.c | 48 +++++++++++++++++++++++++++++++++++
drivers/dma/idxd/registers.h | 9 ++++++-
drivers/dma/idxd/submit.c | 29 ++++++++++++++++-----
5 files changed, 149 insertions(+), 8 deletions(-)
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index 7003884cd8ad..a9ae970db0a4 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -532,6 +532,64 @@ void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid)
dev_dbg(dev, "pasid %d drained\n", pasid);
}
+int idxd_device_request_int_handle(struct idxd_device *idxd, int idx, int *handle,
+ enum idxd_interrupt_type irq_type)
+{
+ struct device *dev = &idxd->pdev->dev;
+ u32 operand, status;
+
+ if (!(idxd->hw.cmd_cap & BIT(IDXD_CMD_REQUEST_INT_HANDLE)))
+ return -EOPNOTSUPP;
+
+ dev_dbg(dev, "get int handle, idx %d\n", idx);
+
+ operand = idx & GENMASK(15, 0);
+ if (irq_type == IDXD_IRQ_IMS)
+ operand |= CMD_INT_HANDLE_IMS;
+
+ dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_REQUEST_INT_HANDLE, operand);
+
+ idxd_cmd_exec(idxd, IDXD_CMD_REQUEST_INT_HANDLE, operand, &status);
+
+ if ((status & IDXD_CMDSTS_ERR_MASK) != IDXD_CMDSTS_SUCCESS) {
+ dev_dbg(dev, "request int handle failed: %#x\n", status);
+ return -ENXIO;
+ }
+
+ *handle = (status >> IDXD_CMDSTS_RES_SHIFT) & GENMASK(15, 0);
+
+ dev_dbg(dev, "int handle acquired: %u\n", *handle);
+ return 0;
+}
+
+int idxd_device_release_int_handle(struct idxd_device *idxd, int handle,
+ enum idxd_interrupt_type irq_type)
+{
+ struct device *dev = &idxd->pdev->dev;
+ u32 operand, status;
+
+ if (!(idxd->hw.cmd_cap & BIT(IDXD_CMD_RELEASE_INT_HANDLE)))
+ return -EOPNOTSUPP;
+
+ dev_dbg(dev, "release int handle, handle %d\n", handle);
+
+ operand = handle & GENMASK(15, 0);
+ if (irq_type == IDXD_IRQ_IMS)
+ operand |= CMD_INT_HANDLE_IMS;
+
+ dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_RELEASE_INT_HANDLE, operand);
+
+ idxd_cmd_exec(idxd, IDXD_CMD_RELEASE_INT_HANDLE, operand, &status);
+
+ if ((status & IDXD_CMDSTS_ERR_MASK) != IDXD_CMDSTS_SUCCESS) {
+ dev_dbg(dev, "release int handle failed: %#x\n", status);
+ return -ENXIO;
+ }
+
+ dev_dbg(dev, "int handle released.\n");
+ return 0;
+}
+
/* Device configuration bits */
static void idxd_group_config_write(struct idxd_group *group)
{
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 1afc34be4ed0..a506a16c83ee 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -140,6 +140,7 @@ struct idxd_hw {
union group_cap_reg group_cap;
union engine_cap_reg engine_cap;
struct opcap opcap;
+ u32 cmd_cap;
};
enum idxd_device_state {
@@ -205,6 +206,8 @@ struct idxd_device {
struct dma_device dma_dev;
struct workqueue_struct *wq;
struct work_struct work;
+
+ int *int_handles;
};
/* IDXD software descriptor */
@@ -218,6 +221,7 @@ struct idxd_desc {
struct list_head list;
int id;
int cpu;
+ unsigned int vector;
struct idxd_wq *wq;
};
@@ -253,6 +257,11 @@ enum idxd_portal_prot {
IDXD_PORTAL_LIMITED,
};
+enum idxd_interrupt_type {
+ IDXD_IRQ_MSIX = 0,
+ IDXD_IRQ_IMS,
+};
+
static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot)
{
return prot * 0x1000;
@@ -318,6 +327,10 @@ int idxd_device_config(struct idxd_device *idxd);
void idxd_device_wqs_clear_state(struct idxd_device *idxd);
void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid);
int idxd_device_load_config(struct idxd_device *idxd);
+int idxd_device_request_int_handle(struct idxd_device *idxd, int idx, int *handle,
+ enum idxd_interrupt_type irq_type);
+int idxd_device_release_int_handle(struct idxd_device *idxd, int handle,
+ enum idxd_interrupt_type irq_type);
/* work queue control */
int idxd_wq_alloc_resources(struct idxd_wq *wq);
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 98b1091181bb..c136216e19e8 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -133,6 +133,22 @@ static int idxd_setup_interrupts(struct idxd_device *idxd)
}
dev_dbg(dev, "Allocated idxd-msix %d for vector %d\n",
i, msix->vector);
+
+ if (idxd->hw.cmd_cap & BIT(IDXD_CMD_REQUEST_INT_HANDLE)) {
+ /*
+ * The MSIX vector enumeration starts at 1 with vector 0 being the
+ * misc interrupt that handles non I/O completion events. The
+ * interrupt handles are for IMS enumeration on guest. The misc
+ * interrupt vector does not require a handle and therefore we start
+ * the int_handles at index 0. Since 'i' starts at 1, the first
+ * int_handles index will be 0.
+ */
+ rc = idxd_device_request_int_handle(idxd, i, &idxd->int_handles[i - 1],
+ IDXD_IRQ_MSIX);
+ if (rc < 0)
+ goto err_no_irq;
+ dev_dbg(dev, "int handle requested: %u\n", idxd->int_handles[i - 1]);
+ }
}
idxd_unmask_error_interrupts(idxd);
@@ -160,6 +176,13 @@ static int idxd_setup_internals(struct idxd_device *idxd)
int i;
init_waitqueue_head(&idxd->cmd_waitq);
+
+ if (idxd->hw.cmd_cap & BIT(IDXD_CMD_REQUEST_INT_HANDLE)) {
+ idxd->int_handles = devm_kcalloc(dev, idxd->max_wqs, sizeof(int), GFP_KERNEL);
+ if (!idxd->int_handles)
+ return -ENOMEM;
+ }
+
idxd->groups = devm_kcalloc(dev, idxd->max_groups,
sizeof(struct idxd_group), GFP_KERNEL);
if (!idxd->groups)
@@ -233,6 +256,12 @@ static void idxd_read_caps(struct idxd_device *idxd)
/* reading generic capabilities */
idxd->hw.gen_cap.bits = ioread64(idxd->reg_base + IDXD_GENCAP_OFFSET);
dev_dbg(dev, "gen_cap: %#llx\n", idxd->hw.gen_cap.bits);
+
+ if (idxd->hw.gen_cap.cmd_cap) {
+ idxd->hw.cmd_cap = ioread32(idxd->reg_base + IDXD_CMDCAP_OFFSET);
+ dev_dbg(dev, "cmd_cap: %#x\n", idxd->hw.cmd_cap);
+ }
+
idxd->max_xfer_bytes = 1ULL << idxd->hw.gen_cap.max_xfer_shift;
dev_dbg(dev, "max xfer size: %llu bytes\n", idxd->max_xfer_bytes);
idxd->max_batch_size = 1U << idxd->hw.gen_cap.max_batch_shift;
@@ -471,6 +500,24 @@ static void idxd_flush_work_list(struct idxd_irq_entry *ie)
}
}
+static void idxd_release_int_handles(struct idxd_device *idxd)
+{
+ struct device *dev = &idxd->pdev->dev;
+ int i, rc;
+
+ for (i = 0; i < idxd->num_wq_irqs; i++) {
+ if (idxd->hw.cmd_cap & BIT(IDXD_CMD_RELEASE_INT_HANDLE)) {
+ rc = idxd_device_release_int_handle(idxd, idxd->int_handles[i],
+ IDXD_IRQ_MSIX);
+ if (rc < 0)
+ dev_warn(dev, "irq handle %d release failed\n",
+ idxd->int_handles[i]);
+ else
+ dev_dbg(dev, "int handle requested: %u\n", idxd->int_handles[i]);
+ }
+ }
+}
+
static void idxd_shutdown(struct pci_dev *pdev)
{
struct idxd_device *idxd = pci_get_drvdata(pdev);
@@ -495,6 +542,7 @@ static void idxd_shutdown(struct pci_dev *pdev)
idxd_flush_work_list(irq_entry);
}
+ idxd_release_int_handles(idxd);
destroy_workqueue(idxd->wq);
}
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index d29a58ee2651..d02fd59a8e39 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -23,8 +23,8 @@ union gen_cap_reg {
u64 overlap_copy:1;
u64 cache_control_mem:1;
u64 cache_control_cache:1;
+ u64 cmd_cap:1;
u64 rsvd:3;
- u64 int_handle_req:1;
u64 dest_readback:1;
u64 drain_readback:1;
u64 rsvd2:6;
@@ -179,8 +179,11 @@ enum idxd_cmd {
IDXD_CMD_DRAIN_PASID,
IDXD_CMD_ABORT_PASID,
IDXD_CMD_REQUEST_INT_HANDLE,
+ IDXD_CMD_RELEASE_INT_HANDLE,
};
+#define CMD_INT_HANDLE_IMS 0x10000
+
#define IDXD_CMDSTS_OFFSET 0xa8
union cmdsts_reg {
struct {
@@ -192,6 +195,8 @@ union cmdsts_reg {
u32 bits;
} __packed;
#define IDXD_CMDSTS_ACTIVE 0x80000000
+#define IDXD_CMDSTS_ERR_MASK 0xff
+#define IDXD_CMDSTS_RES_SHIFT 8
enum idxd_cmdsts_err {
IDXD_CMDSTS_SUCCESS = 0,
@@ -227,6 +232,8 @@ enum idxd_cmdsts_err {
IDXD_CMDSTS_ERR_NO_HANDLE,
};
+#define IDXD_CMDCAP_OFFSET 0xb0
+
#define IDXD_SWERR_OFFSET 0xc0
#define IDXD_SWERR_VALID 0x00000001
#define IDXD_SWERR_OVERFLOW 0x00000002
diff --git a/drivers/dma/idxd/submit.c b/drivers/dma/idxd/submit.c
index efca5d8468a6..cdea5d37ef24 100644
--- a/drivers/dma/idxd/submit.c
+++ b/drivers/dma/idxd/submit.c
@@ -22,11 +22,17 @@ static struct idxd_desc *__get_desc(struct idxd_wq *wq, int idx, int cpu)
desc->hw->pasid = idxd->pasid;
/*
- * Descriptor completion vectors are 1-8 for MSIX. We will round
- * robin through the 8 vectors.
+ * Descriptor completion vectors are 1...N for MSIX. We will round
+ * robin through the N vectors.
*/
wq->vec_ptr = (wq->vec_ptr % idxd->num_wq_irqs) + 1;
- desc->hw->int_handle = wq->vec_ptr;
+ if (!idxd->int_handles) {
+ desc->hw->int_handle = wq->vec_ptr;
+ } else {
+ desc->vector = wq->vec_ptr;
+ desc->hw->int_handle = idxd->int_handles[desc->vector];
+ }
+
return desc;
}
@@ -79,7 +85,6 @@ void idxd_free_desc(struct idxd_wq *wq, struct idxd_desc *desc)
int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc)
{
struct idxd_device *idxd = wq->idxd;
- int vec = desc->hw->int_handle;
void __iomem *portal;
int rc;
@@ -112,9 +117,19 @@ int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc)
* Pending the descriptor to the lockless list for the irq_entry
* that we designated the descriptor to.
*/
- if (desc->hw->flags & IDXD_OP_FLAG_RCI)
- llist_add(&desc->llnode,
- &idxd->irq_entries[vec].pending_llist);
+ if (desc->hw->flags & IDXD_OP_FLAG_RCI) {
+ int vec;
+
+ /*
+ * If the driver is on host kernel, it would be the value
+ * assigned to interrupt handle, which is index for MSIX
+ * vector. If it's guest then can't use the int_handle since
+ * that is the index to IMS for the entire device. The guest
+ * device local index will be used.
+ */
+ vec = !idxd->int_handles ? desc->hw->int_handle : desc->vector;
+ llist_add(&desc->llnode, &idxd->irq_entries[vec].pending_llist);
+ }
return 0;
}
Add the support code for "1dwq" mdev type. This mdev type follows the
standard VFIO mdev flow. The "1dwq" type will export a single dedicated wq
to the mdev. The dwq will have read-only configuration that is configured
by the host. The mdev type does not support PASID and SVA and will match
the stage 1 driver in functional support. For backward compatibility, the
mdev will maintain the DSA spec definition of this mdev type once the
commit goes upstream.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/mdev.c | 141 ++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 133 insertions(+), 8 deletions(-)
diff --git a/drivers/dma/idxd/mdev.c b/drivers/dma/idxd/mdev.c
index ed79c85e692e..16b56f8f7fc1 100644
--- a/drivers/dma/idxd/mdev.c
+++ b/drivers/dma/idxd/mdev.c
@@ -111,20 +111,58 @@ static void idxd_vdcm_release(struct mdev_device *mdev)
mutex_unlock(&vidxd->dev_lock);
}
+static struct idxd_wq *find_any_dwq(struct idxd_device *idxd)
+{
+ int i;
+ struct idxd_wq *wq;
+ unsigned long flags;
+
+ spin_lock_irqsave(&idxd->dev_lock, flags);
+ for (i = 0; i < idxd->max_wqs; i++) {
+ wq = &idxd->wqs[i];
+
+ if (wq->state != IDXD_WQ_ENABLED)
+ continue;
+
+ if (!wq_dedicated(wq))
+ continue;
+
+ if (idxd_wq_refcount(wq) != 0)
+ continue;
+
+ spin_unlock_irqrestore(&idxd->dev_lock, flags);
+ mutex_lock(&wq->wq_lock);
+ if (idxd_wq_refcount(wq)) {
+ spin_lock_irqsave(&idxd->dev_lock, flags);
+ continue;
+ }
+
+ idxd_wq_get(wq);
+ mutex_unlock(&wq->wq_lock);
+ return wq;
+ }
+
+ spin_unlock_irqrestore(&idxd->dev_lock, flags);
+ return NULL;
+}
+
static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd, struct mdev_device *mdev,
struct vdcm_idxd_type *type)
{
struct vdcm_idxd *vidxd;
struct idxd_wq *wq = NULL;
+ int rc;
- /* PLACEHOLDER, wq matching comes later */
-
+ if (type->type == IDXD_MDEV_TYPE_1_DWQ)
+ wq = find_any_dwq(idxd);
if (!wq)
return ERR_PTR(-ENODEV);
vidxd = kzalloc(sizeof(*vidxd), GFP_KERNEL);
- if (!vidxd)
- return ERR_PTR(-ENOMEM);
+ if (!vidxd) {
+ rc = -ENOMEM;
+ goto err;
+ }
mutex_init(&vidxd->dev_lock);
vidxd->idxd = idxd;
@@ -135,14 +173,23 @@ static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd, struct mdev
vidxd->num_wqs = VIDXD_MAX_WQS;
idxd_vdcm_init(vidxd);
- mutex_lock(&wq->wq_lock);
- idxd_wq_get(wq);
- mutex_unlock(&wq->wq_lock);
return vidxd;
+
+ err:
+ mutex_lock(&wq->wq_lock);
+ idxd_wq_put(wq);
+ mutex_unlock(&wq->wq_lock);
+ return ERR_PTR(rc);
}
-static struct vdcm_idxd_type idxd_mdev_types[IDXD_MDEV_TYPES];
+static struct vdcm_idxd_type idxd_mdev_types[IDXD_MDEV_TYPES] = {
+ {
+ .name = "1dwq-v1",
+ .description = "IDXD MDEV with 1 dedicated workqueue",
+ .type = IDXD_MDEV_TYPE_1_DWQ,
+ },
+};
static struct vdcm_idxd_type *idxd_vdcm_find_vidxd_type(struct device *dev,
const char *name)
@@ -934,7 +981,85 @@ static long idxd_vdcm_ioctl(struct mdev_device *mdev, unsigned int cmd,
return rc;
}
+static ssize_t name_show(struct kobject *kobj, struct device *dev, char *buf)
+{
+ struct vdcm_idxd_type *type;
+
+ type = idxd_vdcm_find_vidxd_type(dev, kobject_name(kobj));
+
+ if (type)
+ return sprintf(buf, "%s\n", type->description);
+
+ return -EINVAL;
+}
+static MDEV_TYPE_ATTR_RO(name);
+
+static int find_available_mdev_instances(struct idxd_device *idxd, struct vdcm_idxd_type *type)
+{
+ int count = 0, i;
+ unsigned long flags;
+
+ if (type->type != IDXD_MDEV_TYPE_1_DWQ)
+ return 0;
+
+ spin_lock_irqsave(&idxd->dev_lock, flags);
+ for (i = 0; i < idxd->max_wqs; i++) {
+ struct idxd_wq *wq;
+
+ wq = &idxd->wqs[i];
+ if (!is_idxd_wq_mdev(wq) || !wq_dedicated(wq) || idxd_wq_refcount(wq))
+ continue;
+
+ count++;
+ }
+ spin_unlock_irqrestore(&idxd->dev_lock, flags);
+
+ return count;
+}
+
+static ssize_t available_instances_show(struct kobject *kobj,
+ struct device *dev, char *buf)
+{
+ int count;
+ struct idxd_device *idxd = dev_get_drvdata(dev);
+ struct vdcm_idxd_type *type;
+
+ type = idxd_vdcm_find_vidxd_type(dev, kobject_name(kobj));
+ if (!type)
+ return -EINVAL;
+
+ count = find_available_mdev_instances(idxd, type);
+
+ return sprintf(buf, "%d\n", count);
+}
+static MDEV_TYPE_ATTR_RO(available_instances);
+
+static ssize_t device_api_show(struct kobject *kobj, struct device *dev,
+ char *buf)
+{
+ return sprintf(buf, "%s\n", VFIO_DEVICE_API_PCI_STRING);
+}
+static MDEV_TYPE_ATTR_RO(device_api);
+
+static struct attribute *idxd_mdev_types_attrs[] = {
+ &mdev_type_attr_name.attr,
+ &mdev_type_attr_device_api.attr,
+ &mdev_type_attr_available_instances.attr,
+ NULL,
+};
+
+static struct attribute_group idxd_mdev_type_group0 = {
+ .name = "1dwq-v1",
+ .attrs = idxd_mdev_types_attrs,
+};
+
+static struct attribute_group *idxd_mdev_type_groups[] = {
+ &idxd_mdev_type_group0,
+ NULL,
+};
+
static const struct mdev_parent_ops idxd_vdcm_ops = {
+ .supported_type_groups = idxd_mdev_type_groups,
.create = idxd_vdcm_create,
.remove = idxd_vdcm_remove,
.open = idxd_vdcm_open,
The VFIO mediated device for idxd driver will provide a virtual DSA
device by backing it with a workqueue. The virtual device will be limited
with the wq configuration registers set to read-only. Add support and
helper functions for the handling of a DSA device with the configuration
registers marked as read-only.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/device.c | 116 +++++++++++++++++++++++++++++++++++++++++++++
drivers/dma/idxd/idxd.h | 1
drivers/dma/idxd/init.c | 8 +++
drivers/dma/idxd/sysfs.c | 20 +++++---
4 files changed, 137 insertions(+), 8 deletions(-)
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index d6f551dcbcb6..7003884cd8ad 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -778,3 +778,119 @@ int idxd_device_config(struct idxd_device *idxd)
return 0;
}
+
+static int idxd_wq_load_config(struct idxd_wq *wq)
+{
+ struct idxd_device *idxd = wq->idxd;
+ struct device *dev = &idxd->pdev->dev;
+ int wqcfg_offset;
+ int i;
+
+ wqcfg_offset = WQCFG_OFFSET(idxd, wq->id, 0);
+ memcpy_fromio(wq->wqcfg, idxd->reg_base + wqcfg_offset, idxd->wqcfg_size);
+
+ wq->size = wq->wqcfg->wq_size;
+ wq->threshold = wq->wqcfg->wq_thresh;
+ if (wq->wqcfg->priv)
+ wq->type = IDXD_WQT_KERNEL;
+
+ /* The driver does not support shared WQ mode in read-only config yet */
+ if (wq->wqcfg->mode == 0 || wq->wqcfg->pasid_en)
+ return -EOPNOTSUPP;
+
+ set_bit(WQ_FLAG_DEDICATED, &wq->flags);
+
+ wq->priority = wq->wqcfg->priority;
+
+ for (i = 0; i < WQCFG_STRIDES(idxd); i++) {
+ wqcfg_offset = WQCFG_OFFSET(idxd, wq->id, i);
+ dev_dbg(dev, "WQ[%d][%d][%#x]: %#x\n", wq->id, i, wqcfg_offset, wq->wqcfg->bits[i]);
+ }
+
+ return 0;
+}
+
+static void idxd_group_load_config(struct idxd_group *group)
+{
+ struct idxd_device *idxd = group->idxd;
+ struct device *dev = &idxd->pdev->dev;
+ int i, j, grpcfg_offset;
+
+ /*
+ * Load WQS bit fields
+ * Iterate through all 256 bits 64 bits at a time
+ */
+ for (i = 0; i < GRPWQCFG_STRIDES; i++) {
+ struct idxd_wq *wq;
+
+ grpcfg_offset = GRPWQCFG_OFFSET(idxd, group->id, i);
+ group->grpcfg.wqs[i] = ioread64(idxd->reg_base + grpcfg_offset);
+ dev_dbg(dev, "GRPCFG wq[%d:%d: %#x]: %#llx\n",
+ group->id, i, grpcfg_offset, group->grpcfg.wqs[i]);
+
+ if (i * 64 >= idxd->max_wqs)
+ break;
+
+ /* Iterate through all 64 bits and check for wq set */
+ for (j = 0; j < 64; j++) {
+ int id = i * 64 + j;
+
+ /* No need to check beyond max wqs */
+ if (id >= idxd->max_wqs)
+ break;
+
+ /* Set group assignment for wq if wq bit is set */
+ if (group->grpcfg.wqs[i] & BIT(j)) {
+ wq = &idxd->wqs[id];
+ wq->group = group;
+ }
+ }
+ }
+
+ grpcfg_offset = GRPENGCFG_OFFSET(idxd, group->id);
+ group->grpcfg.engines = ioread64(idxd->reg_base + grpcfg_offset);
+ dev_dbg(dev, "GRPCFG engs[%d: %#x]: %#llx\n", group->id,
+ grpcfg_offset, group->grpcfg.engines);
+
+ /* Iterate through all 64 bits to check engines set */
+ for (i = 0; i < 64; i++) {
+ if (i >= idxd->max_engines)
+ break;
+
+ if (group->grpcfg.engines & BIT(i)) {
+ struct idxd_engine *engine = &idxd->engines[i];
+
+ engine->group = group;
+ }
+ }
+
+ grpcfg_offset = GRPFLGCFG_OFFSET(idxd, group->id);
+ group->grpcfg.flags.bits = ioread32(idxd->reg_base + grpcfg_offset);
+ dev_dbg(dev, "GRPFLAGS flags[%d: %#x]: %#x\n",
+ group->id, grpcfg_offset, group->grpcfg.flags.bits);
+}
+
+int idxd_device_load_config(struct idxd_device *idxd)
+{
+ union gencfg_reg reg;
+ int i, rc;
+
+ reg.bits = ioread32(idxd->reg_base + IDXD_GENCFG_OFFSET);
+ idxd->token_limit = reg.token_limit;
+
+ for (i = 0; i < idxd->max_groups; i++) {
+ struct idxd_group *group = &idxd->groups[i];
+
+ idxd_group_load_config(group);
+ }
+
+ for (i = 0; i < idxd->max_wqs; i++) {
+ struct idxd_wq *wq = &idxd->wqs[i];
+
+ rc = idxd_wq_load_config(wq);
+ if (rc < 0)
+ return rc;
+ }
+
+ return 0;
+}
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 7e54209c433a..1afc34be4ed0 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -317,6 +317,7 @@ void idxd_device_cleanup(struct idxd_device *idxd);
int idxd_device_config(struct idxd_device *idxd);
void idxd_device_wqs_clear_state(struct idxd_device *idxd);
void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid);
+int idxd_device_load_config(struct idxd_device *idxd);
/* work queue control */
int idxd_wq_alloc_resources(struct idxd_wq *wq);
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 45b0eac640c3..98b1091181bb 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -349,6 +349,14 @@ static int idxd_probe(struct idxd_device *idxd)
if (rc)
goto err_setup;
+ /* If the configs are readonly, then load them from device */
+ if (!test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags)) {
+ dev_dbg(dev, "Loading RO device config\n");
+ rc = idxd_device_load_config(idxd);
+ if (rc < 0)
+ goto err_setup;
+ }
+
rc = idxd_setup_interrupts(idxd);
if (rc)
goto err_setup;
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index 6d292eb79bf3..304eb2cf532e 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -102,7 +102,7 @@ static int idxd_config_bus_match(struct device *dev,
static int idxd_config_bus_probe(struct device *dev)
{
- int rc;
+ int rc = 0;
unsigned long flags;
dev_dbg(dev, "%s called\n", __func__);
@@ -120,7 +120,8 @@ static int idxd_config_bus_probe(struct device *dev)
/* Perform IDXD configuration and enabling */
spin_lock_irqsave(&idxd->dev_lock, flags);
- rc = idxd_device_config(idxd);
+ if (test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+ rc = idxd_device_config(idxd);
spin_unlock_irqrestore(&idxd->dev_lock, flags);
if (rc < 0) {
module_put(THIS_MODULE);
@@ -207,7 +208,8 @@ static int idxd_config_bus_probe(struct device *dev)
}
spin_lock_irqsave(&idxd->dev_lock, flags);
- rc = idxd_device_config(idxd);
+ if (test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags))
+ rc = idxd_device_config(idxd);
spin_unlock_irqrestore(&idxd->dev_lock, flags);
if (rc < 0) {
mutex_unlock(&wq->wq_lock);
@@ -328,12 +330,14 @@ static int idxd_config_bus_remove(struct device *dev)
idxd_unregister_dma_device(idxd);
rc = idxd_device_disable(idxd);
- for (i = 0; i < idxd->max_wqs; i++) {
- struct idxd_wq *wq = &idxd->wqs[i];
+ if (test_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags)) {
+ for (i = 0; i < idxd->max_wqs; i++) {
+ struct idxd_wq *wq = &idxd->wqs[i];
- mutex_lock(&wq->wq_lock);
- idxd_wq_disable_cleanup(wq);
- mutex_unlock(&wq->wq_lock);
+ mutex_lock(&wq->wq_lock);
+ idxd_wq_disable_cleanup(wq);
+ mutex_unlock(&wq->wq_lock);
+ }
}
module_put(THIS_MODULE);
if (rc < 0)
Add device support helper functions in preparation of adding VFIO
mdev support.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/device.c | 61 ++++++++++++++++++++++++++++++++++++++++++
drivers/dma/idxd/idxd.h | 4 +++
drivers/dma/idxd/registers.h | 3 +-
3 files changed, 67 insertions(+), 1 deletion(-)
diff --git a/drivers/dma/idxd/device.c b/drivers/dma/idxd/device.c
index a9ae970db0a4..8aff07b1acb4 100644
--- a/drivers/dma/idxd/device.c
+++ b/drivers/dma/idxd/device.c
@@ -287,6 +287,30 @@ void idxd_wq_unmap_portal(struct idxd_wq *wq)
devm_iounmap(dev, wq->portal);
}
+int idxd_wq_abort(struct idxd_wq *wq)
+{
+ struct idxd_device *idxd = wq->idxd;
+ struct device *dev = &idxd->pdev->dev;
+ u32 operand, status;
+
+ dev_dbg(dev, "Abort WQ %d\n", wq->id);
+ if (wq->state != IDXD_WQ_ENABLED) {
+ dev_dbg(dev, "WQ %d not active\n", wq->id);
+ return -ENXIO;
+ }
+
+ operand = BIT(wq->id % 16) | ((wq->id / 16) << 16);
+ dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_ABORT_WQ, operand);
+ idxd_cmd_exec(idxd, IDXD_CMD_ABORT_WQ, operand, &status);
+ if (status != IDXD_CMDSTS_SUCCESS) {
+ dev_dbg(dev, "WQ abort failed: %#x\n", status);
+ return -ENXIO;
+ }
+
+ dev_dbg(dev, "WQ %d aborted\n", wq->id);
+ return 0;
+}
+
int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid)
{
struct idxd_device *idxd = wq->idxd;
@@ -366,6 +390,32 @@ void idxd_wq_disable_cleanup(struct idxd_wq *wq)
}
}
+void idxd_wq_setup_pasid(struct idxd_wq *wq, int pasid)
+{
+ struct idxd_device *idxd = wq->idxd;
+ int offset;
+
+ lockdep_assert_held(&idxd->dev_lock);
+
+ /* PASID fields are 8 bytes into the WQCFG register */
+ offset = WQCFG_OFFSET(idxd, wq->id, WQCFG_PASID_IDX);
+ wq->wqcfg->pasid = pasid;
+ iowrite32(wq->wqcfg->bits[WQCFG_PASID_IDX], idxd->reg_base + offset);
+}
+
+void idxd_wq_setup_priv(struct idxd_wq *wq, int priv)
+{
+ struct idxd_device *idxd = wq->idxd;
+ int offset;
+
+ lockdep_assert_held(&idxd->dev_lock);
+
+ /* priv field is 8 bytes into the WQCFG register */
+ offset = WQCFG_OFFSET(idxd, wq->id, WQCFG_PRIV_IDX);
+ wq->wqcfg->priv = !!priv;
+ iowrite32(wq->wqcfg->bits[WQCFG_PRIV_IDX], idxd->reg_base + offset);
+}
+
/* Device control bits */
static inline bool idxd_is_enabled(struct idxd_device *idxd)
{
@@ -532,6 +582,17 @@ void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid)
dev_dbg(dev, "pasid %d drained\n", pasid);
}
+void idxd_device_abort_pasid(struct idxd_device *idxd, int pasid)
+{
+ struct device *dev = &idxd->pdev->dev;
+ u32 operand;
+
+ operand = pasid;
+ dev_dbg(dev, "cmd: %u operand: %#x\n", IDXD_CMD_ABORT_PASID, operand);
+ idxd_cmd_exec(idxd, IDXD_CMD_ABORT_PASID, operand, NULL);
+ dev_dbg(dev, "pasid %d aborted\n", pasid);
+}
+
int idxd_device_request_int_handle(struct idxd_device *idxd, int idx, int *handle,
enum idxd_interrupt_type irq_type)
{
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 549426bfb443..eb8552d32a0a 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -331,6 +331,7 @@ void idxd_device_cleanup(struct idxd_device *idxd);
int idxd_device_config(struct idxd_device *idxd);
void idxd_device_wqs_clear_state(struct idxd_device *idxd);
void idxd_device_drain_pasid(struct idxd_device *idxd, int pasid);
+void idxd_device_abort_pasid(struct idxd_device *idxd, int pasid);
int idxd_device_load_config(struct idxd_device *idxd);
int idxd_device_request_int_handle(struct idxd_device *idxd, int idx, int *handle,
enum idxd_interrupt_type irq_type);
@@ -348,6 +349,9 @@ void idxd_wq_unmap_portal(struct idxd_wq *wq);
void idxd_wq_disable_cleanup(struct idxd_wq *wq);
int idxd_wq_set_pasid(struct idxd_wq *wq, int pasid);
int idxd_wq_disable_pasid(struct idxd_wq *wq);
+int idxd_wq_abort(struct idxd_wq *wq);
+void idxd_wq_setup_pasid(struct idxd_wq *wq, int pasid);
+void idxd_wq_setup_priv(struct idxd_wq *wq, int priv);
/* submission */
int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc);
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index d02fd59a8e39..acc071df48eb 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -345,7 +345,8 @@ union wqcfg {
u32 bits[8];
} __packed;
-#define WQCFG_PASID_IDX 2
+#define WQCFG_PASID_IDX 2
+#define WQCFG_PRIV_IDX 2
/*
* This macro calculates the offset into the WQCFG register
Add all the helper functions that supports the emulation of the commands
that are submitted to the device command register.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/registers.h | 16 +-
drivers/dma/idxd/vdev.c | 427 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 438 insertions(+), 5 deletions(-)
diff --git a/drivers/dma/idxd/registers.h b/drivers/dma/idxd/registers.h
index 5a76fd0ab6ad..17f0d868e5a4 100644
--- a/drivers/dma/idxd/registers.h
+++ b/drivers/dma/idxd/registers.h
@@ -119,7 +119,8 @@ union gencfg_reg {
union genctrl_reg {
struct {
u32 softerr_int_en:1;
- u32 rsvd:31;
+ u32 halt_state_int_en:1;
+ u32 rsvd:30;
};
u32 bits;
} __packed;
@@ -141,6 +142,8 @@ enum idxd_device_status_state {
IDXD_DEVICE_STATE_HALT,
};
+#define IDXD_GENSTATS_MASK 0x03
+
enum idxd_device_reset_type {
IDXD_DEVICE_RESET_SOFTWARE = 0,
IDXD_DEVICE_RESET_FLR,
@@ -153,6 +156,7 @@ enum idxd_device_reset_type {
#define IDXD_INTC_CMD 0x02
#define IDXD_INTC_OCCUPY 0x04
#define IDXD_INTC_PERFMON_OVFL 0x08
+#define IDXD_INTC_HALT_STATE 0x10
#define IDXD_CMD_OFFSET 0xa0
union idxd_command_reg {
@@ -164,6 +168,7 @@ union idxd_command_reg {
};
u32 bits;
} __packed;
+#define IDXD_CMD_INT_MASK 0x80000000
enum idxd_cmd {
IDXD_CMD_ENABLE_DEVICE = 1,
@@ -227,10 +232,11 @@ enum idxd_cmdsts_err {
/* disable device errors */
IDXD_CMDSTS_ERR_DIS_DEV_EN = 0x31,
/* disable WQ, drain WQ, abort WQ, reset WQ */
- IDXD_CMDSTS_ERR_DEV_NOT_EN,
+ IDXD_CMDSTS_ERR_WQ_NOT_EN,
/* request interrupt handle */
IDXD_CMDSTS_ERR_INVAL_INT_IDX = 0x41,
IDXD_CMDSTS_ERR_NO_HANDLE,
+ IDXD_CMDSTS_ERR_INVAL_INT_IDX_RELEASE,
};
#define IDXD_CMDCAP_OFFSET 0xb0
@@ -351,6 +357,12 @@ union wqcfg {
u32 bits[8];
} __packed;
+enum idxd_wq_hw_state {
+ IDXD_WQ_DEV_DISABLED = 0,
+ IDXD_WQ_DEV_ENABLED,
+ IDXD_WQ_DEV_BUSY,
+};
+
#define WQCFG_PASID_IDX 2
#define WQCFG_PRIV_IDX 2
#define WQCFG_MODE_DEDICATED 1
diff --git a/drivers/dma/idxd/vdev.c b/drivers/dma/idxd/vdev.c
index b38bb676e604..6e7f98d0e52f 100644
--- a/drivers/dma/idxd/vdev.c
+++ b/drivers/dma/idxd/vdev.c
@@ -463,17 +463,438 @@ void vidxd_mmio_init(struct vdcm_idxd *vidxd)
static void idxd_complete_command(struct vdcm_idxd *vidxd, enum idxd_cmdsts_err val)
{
- /* PLACEHOLDER */
+ u8 *bar0 = vidxd->bar0;
+ u32 *cmd = (u32 *)(bar0 + IDXD_CMD_OFFSET);
+ u32 *cmdsts = (u32 *)(bar0 + IDXD_CMDSTS_OFFSET);
+ u32 *intcause = (u32 *)(bar0 + IDXD_INTCAUSE_OFFSET);
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+
+ *cmdsts = val;
+ dev_dbg(dev, "%s: cmd: %#x status: %#x\n", __func__, *cmd, val);
+
+ if (*cmd & IDXD_CMD_INT_MASK) {
+ *intcause |= IDXD_INTC_CMD;
+ vidxd_send_interrupt(vidxd, 0);
+ }
+}
+
+static void vidxd_enable(struct vdcm_idxd *vidxd)
+{
+ u8 *bar0 = vidxd->bar0;
+ union gensts_reg *gensts = (union gensts_reg *)(bar0 + IDXD_GENSTATS_OFFSET);
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+
+ dev_dbg(dev, "%s\n", __func__);
+ if (gensts->state == IDXD_DEVICE_STATE_ENABLED)
+ return idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_ENABLED);
+
+ /* Check PCI configuration */
+ if (!(vidxd->cfg[PCI_COMMAND] & PCI_COMMAND_MASTER))
+ return idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_BUSMASTER_EN);
+
+ gensts->state = IDXD_DEVICE_STATE_ENABLED;
+
+ return idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_disable(struct vdcm_idxd *vidxd)
+{
+ struct idxd_wq *wq;
+ union wqcfg *wqcfg;
+ u8 *bar0 = vidxd->bar0;
+ union gensts_reg *gensts = (union gensts_reg *)(bar0 + IDXD_GENSTATS_OFFSET);
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ u32 status;
+
+ dev_dbg(dev, "%s\n", __func__);
+ if (gensts->state == IDXD_DEVICE_STATE_DISABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DIS_DEV_EN);
+ return;
+ }
+
+ wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ wq = vidxd->wq;
+
+ /* If it is a DWQ, need to disable the DWQ as well */
+ if (wq_dedicated(wq)) {
+ idxd_wq_disable(wq, &status);
+ if (status) {
+ dev_warn(dev, "vidxd disable (wq disable) failed: %#x\n", status);
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DIS_DEV_EN);
+ return;
+ }
+ } else {
+ idxd_wq_drain(wq, &status);
+ if (status)
+ dev_warn(dev, "vidxd disable (wq drain) failed: %#x\n", status);
+ }
+
+ wqcfg->wq_state = 0;
+ gensts->state = IDXD_DEVICE_STATE_DISABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_drain_all(struct vdcm_idxd *vidxd)
+{
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ struct idxd_wq *wq = vidxd->wq;
+
+ dev_dbg(dev, "%s\n", __func__);
+
+ idxd_wq_drain(wq, NULL);
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_drain(struct vdcm_idxd *vidxd, int val)
+{
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ u8 *bar0 = vidxd->bar0;
+ union wqcfg *wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ struct idxd_wq *wq = vidxd->wq;
+ u32 status;
+
+ dev_dbg(dev, "%s\n", __func__);
+ if (wqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_NOT_EN);
+ return;
+ }
+
+ idxd_wq_drain(wq, &status);
+ if (status) {
+ dev_dbg(dev, "wq drain failed: %#x\n", status);
+ idxd_complete_command(vidxd, status);
+ return;
+ }
+
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_abort_all(struct vdcm_idxd *vidxd)
+{
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ struct idxd_wq *wq = vidxd->wq;
+
+ dev_dbg(dev, "%s\n", __func__);
+ idxd_wq_abort(wq, NULL);
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_abort(struct vdcm_idxd *vidxd, int val)
+{
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ u8 *bar0 = vidxd->bar0;
+ union wqcfg *wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ struct idxd_wq *wq = vidxd->wq;
+ u32 status;
+
+ dev_dbg(dev, "%s\n", __func__);
+ if (wqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_NOT_EN);
+ return;
+ }
+
+ idxd_wq_abort(wq, &status);
+ if (status) {
+ dev_dbg(dev, "wq abort failed: %#x\n", status);
+ idxd_complete_command(vidxd, status);
+ return;
+ }
+
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
}
void vidxd_reset(struct vdcm_idxd *vidxd)
{
- /* PLACEHOLDER */
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ u8 *bar0 = vidxd->bar0;
+ union gensts_reg *gensts = (union gensts_reg *)(bar0 + IDXD_GENSTATS_OFFSET);
+ struct idxd_wq *wq;
+
+ dev_dbg(dev, "%s\n", __func__);
+ gensts->state = IDXD_DEVICE_STATE_DRAIN;
+ wq = vidxd->wq;
+
+ if (wq->state == IDXD_WQ_ENABLED) {
+ idxd_wq_abort(wq, NULL);
+ idxd_wq_disable(wq, NULL);
+ }
+
+ vidxd_mmio_init(vidxd);
+ gensts->state = IDXD_DEVICE_STATE_DISABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_reset(struct vdcm_idxd *vidxd, int wq_id_mask)
+{
+ struct idxd_wq *wq;
+ u8 *bar0 = vidxd->bar0;
+ union wqcfg *wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ u32 status;
+
+ wq = vidxd->wq;
+ dev_dbg(dev, "vidxd reset wq %u:%u\n", 0, wq->id);
+
+ if (wqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_NOT_EN);
+ return;
+ }
+
+ idxd_wq_abort(wq, &status);
+ if (status) {
+ dev_dbg(dev, "vidxd reset wq failed to abort: %#x\n", status);
+ idxd_complete_command(vidxd, status);
+ return;
+ }
+
+ idxd_wq_disable(wq, &status);
+ if (status) {
+ dev_dbg(dev, "vidxd reset wq failed to disable: %#x\n", status);
+ idxd_complete_command(vidxd, status);
+ return;
+ }
+
+ wqcfg->wq_state = IDXD_WQ_DEV_DISABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_alloc_int_handle(struct vdcm_idxd *vidxd, int operand)
+{
+ bool ims = !!(operand & CMD_INT_HANDLE_IMS);
+ u32 cmdsts;
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ int ims_idx, vidx;
+
+ vidx = operand & GENMASK(15, 0);
+
+ dev_dbg(dev, "allocating int handle for %d\n", vidx);
+
+ /* vidx cannot be 0 since that's emulated and does not require IMS handle */
+ if (vidx <= 0 || vidx >= VIDXD_MAX_MSIX_ENTRIES) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_INVAL_INT_IDX);
+ return;
+ }
+
+ if (ims) {
+ dev_warn(dev, "IMS allocation is not implemented yet\n");
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_NO_HANDLE);
+ return;
+ }
+
+ ims_idx = vidxd->irq_entries[vidx - 1].entry->device_msi.hwirq;
+ vidx--; /* MSIX idx 0 is a slow path interrupt */
+ cmdsts = ims_idx << IDXD_CMDSTS_RES_SHIFT;
+ dev_dbg(dev, "int handle %d:%d\n", vidx, ims_idx);
+ idxd_complete_command(vidxd, cmdsts);
+}
+
+static void vidxd_release_int_handle(struct vdcm_idxd *vidxd, int operand)
+{
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ bool ims = !!(operand & CMD_INT_HANDLE_IMS);
+ int handle, i;
+ bool found = false;
+
+ handle = operand & GENMASK(15, 0);
+ dev_dbg(dev, "allocating int handle %d\n", handle);
+
+ if (ims) {
+ dev_warn(dev, "IMS allocation is not implemented yet\n");
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_INVAL_INT_IDX_RELEASE);
+ return;
+ }
+
+ for (i = 0; i < VIDXD_MAX_MSIX_ENTRIES - 1; i++) {
+ if (vidxd->irq_entries[i].entry->device_msi.hwirq == handle) {
+ found = true;
+ break;
+ }
+ }
+
+ if (!found) {
+ dev_warn(dev, "Freeing unallocated int handle.\n");
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_INVAL_INT_IDX_RELEASE);
+ }
+
+ dev_dbg(dev, "int handle %d released.\n", handle);
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_enable(struct vdcm_idxd *vidxd, int wq_id)
+{
+ struct idxd_wq *wq;
+ u8 *bar0 = vidxd->bar0;
+ union wq_cap_reg *wqcap;
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ struct idxd_device *idxd;
+ union wqcfg *vwqcfg, *wqcfg;
+ unsigned long flags;
+ int wq_pasid;
+ u32 status;
+ int priv;
+
+ if (wq_id >= VIDXD_MAX_WQS) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_INVAL_WQIDX);
+ return;
+ }
+
+ idxd = vidxd->idxd;
+ wq = vidxd->wq;
+
+ dev_dbg(dev, "%s: wq %u:%u\n", __func__, wq_id, wq->id);
+
+ vwqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET + wq_id * 32);
+ wqcap = (union wq_cap_reg *)(bar0 + IDXD_WQCAP_OFFSET);
+ wqcfg = wq->wqcfg;
+
+ if (vidxd_state(vidxd) != IDXD_DEVICE_STATE_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_DEV_NOTEN);
+ return;
+ }
+
+ if (vwqcfg->wq_state != IDXD_WQ_DEV_DISABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_ENABLED);
+ return;
+ }
+
+ if (wq_dedicated(wq) && wqcap->dedicated_mode == 0) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_MODE);
+ return;
+ }
+
+ wq_pasid = idxd_mdev_get_pasid(mdev);
+ priv = 1;
+
+ if (wq_pasid >= 0) {
+ /* Clear pasid_en, pasid, and priv values */
+ wqcfg->bits[WQCFG_PASID_IDX] &= ~GENMASK(29, 8);
+ wqcfg->priv = priv;
+ wqcfg->pasid_en = 1;
+ wqcfg->pasid = wq_pasid;
+ dev_dbg(dev, "program pasid %d in wq %d\n", wq_pasid, wq->id);
+ spin_lock_irqsave(&idxd->dev_lock, flags);
+ idxd_wq_setup_pasid(wq, wq_pasid);
+ idxd_wq_setup_priv(wq, priv);
+ spin_unlock_irqrestore(&idxd->dev_lock, flags);
+ idxd_wq_enable(wq, &status);
+ if (status) {
+ dev_err(dev, "vidxd enable wq %d failed\n", wq->id);
+ idxd_complete_command(vidxd, status);
+ return;
+ }
+ } else {
+ dev_err(dev, "idxd pasid setup failed wq %d wq_pasid %d\n", wq->id, wq_pasid);
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_PASID_EN);
+ return;
+ }
+
+ vwqcfg->wq_state = IDXD_WQ_DEV_ENABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
+}
+
+static void vidxd_wq_disable(struct vdcm_idxd *vidxd, int wq_id_mask)
+{
+ struct idxd_wq *wq;
+ union wqcfg *wqcfg;
+ u8 *bar0 = vidxd->bar0;
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ u32 status;
+
+ wq = vidxd->wq;
+
+ dev_dbg(dev, "vidxd disable wq %u:%u\n", 0, wq->id);
+
+ wqcfg = (union wqcfg *)(bar0 + VIDXD_WQCFG_OFFSET);
+ if (wqcfg->wq_state != IDXD_WQ_DEV_ENABLED) {
+ idxd_complete_command(vidxd, IDXD_CMDSTS_ERR_WQ_NOT_EN);
+ return;
+ }
+
+ /* If it is a DWQ, need to disable the DWQ as well */
+ if (wq_dedicated(wq)) {
+ idxd_wq_disable(wq, &status);
+ if (status) {
+ dev_warn(dev, "vidxd disable wq failed: %#x\n", status);
+ idxd_complete_command(vidxd, status);
+ return;
+ }
+ } else {
+ idxd_wq_drain(wq, &status);
+ if (status) {
+ dev_warn(dev, "vidxd disable drain wq failed: %#x\n", status);
+ idxd_complete_command(vidxd, status);
+ return;
+ }
+ }
+
+ wqcfg->wq_state = IDXD_WQ_DEV_DISABLED;
+ idxd_complete_command(vidxd, IDXD_CMDSTS_SUCCESS);
}
void vidxd_do_command(struct vdcm_idxd *vidxd, u32 val)
{
- /* PLACEHOLDER */
+ union idxd_command_reg *reg = (union idxd_command_reg *)(vidxd->bar0 + IDXD_CMD_OFFSET);
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+
+ reg->bits = val;
+
+ dev_dbg(dev, "%s: cmd code: %u reg: %x\n", __func__, reg->cmd, reg->bits);
+
+ switch (reg->cmd) {
+ case IDXD_CMD_ENABLE_DEVICE:
+ vidxd_enable(vidxd);
+ break;
+ case IDXD_CMD_DISABLE_DEVICE:
+ vidxd_disable(vidxd);
+ break;
+ case IDXD_CMD_DRAIN_ALL:
+ vidxd_drain_all(vidxd);
+ break;
+ case IDXD_CMD_ABORT_ALL:
+ vidxd_abort_all(vidxd);
+ break;
+ case IDXD_CMD_RESET_DEVICE:
+ vidxd_reset(vidxd);
+ break;
+ case IDXD_CMD_ENABLE_WQ:
+ vidxd_wq_enable(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_DISABLE_WQ:
+ vidxd_wq_disable(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_DRAIN_WQ:
+ vidxd_wq_drain(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_ABORT_WQ:
+ vidxd_wq_abort(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_RESET_WQ:
+ vidxd_wq_reset(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_REQUEST_INT_HANDLE:
+ vidxd_alloc_int_handle(vidxd, reg->operand);
+ break;
+ case IDXD_CMD_RELEASE_INT_HANDLE:
+ vidxd_release_int_handle(vidxd, reg->operand);
+ break;
+ default:
+ idxd_complete_command(vidxd, IDXD_CMDSTS_INVAL_CMD);
+ break;
+ }
}
int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd)
Create a mediated device through the VFIO mediated device framework. The
mdev framework allows creation of an mediated device by the driver with
portion of the device's resources. The driver will emulate the slow path
such as the PCI config space, MMIO bar, and the command registers. The
descriptor submission portal(s) will be mmaped to the guest in order to
submit descriptors directly by the guest kernel or apps. The mediated
device support code in the idxd will be referred to as the Virtual
Device Composition Module (vdcm). Add basic plumbing to fill out the
mdev_parent_ops struct that VFIO mdev requires to support a mediated
device.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/Kconfig | 7
drivers/dma/idxd/Makefile | 2
drivers/dma/idxd/idxd.h | 14 +
drivers/dma/idxd/init.c | 11 +
drivers/dma/idxd/mdev.c | 968 +++++++++++++++++++++++++++++++++++++++++++++
drivers/dma/idxd/mdev.h | 115 +++++
drivers/dma/idxd/vdev.c | 75 +++
drivers/dma/idxd/vdev.h | 19 +
8 files changed, 1211 insertions(+)
create mode 100644 drivers/dma/idxd/mdev.c
create mode 100644 drivers/dma/idxd/mdev.h
create mode 100644 drivers/dma/idxd/vdev.c
create mode 100644 drivers/dma/idxd/vdev.h
diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 6a908785a5f7..c5970e4a3a2c 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -306,6 +306,13 @@ config INTEL_IDXD_SVM
depends on PCI_PASID
depends on PCI_IOV
+config INTEL_IDXD_MDEV
+ bool "IDXD VFIO Mediated Device Support"
+ depends on INTEL_IDXD
+ depends on VFIO_MDEV
+ depends on VFIO_MDEV_DEVICE
+ select PCI_SIOV
+
config INTEL_IOATDMA
tristate "Intel I/OAT DMA support"
depends on PCI && X86_64
diff --git a/drivers/dma/idxd/Makefile b/drivers/dma/idxd/Makefile
index 8978b898d777..30cad704a95a 100644
--- a/drivers/dma/idxd/Makefile
+++ b/drivers/dma/idxd/Makefile
@@ -1,2 +1,4 @@
obj-$(CONFIG_INTEL_IDXD) += idxd.o
idxd-y := init.o irq.o device.o sysfs.o submit.o dma.o cdev.o
+
+idxd-$(CONFIG_INTEL_IDXD_MDEV) += mdev.o vdev.o
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index eb8552d32a0a..ab28a1bffb7c 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -8,6 +8,7 @@
#include <linux/percpu-rwsem.h>
#include <linux/wait.h>
#include <linux/cdev.h>
+#include <linux/mdev.h>
#include "registers.h"
#define IDXD_DRIVER_VERSION "1.00"
@@ -123,6 +124,7 @@ struct idxd_wq {
char name[WQ_NAME_SIZE + 1];
u64 max_xfer_bytes;
u32 max_batch_size;
+ struct list_head vdcm_list;
};
struct idxd_engine {
@@ -155,6 +157,7 @@ enum idxd_device_flag {
IDXD_FLAG_CMD_RUNNING,
IDXD_FLAG_PASID_ENABLED,
IDXD_FLAG_SIOV_SUPPORTED,
+ IDXD_FLAG_MDEV_ENABLED,
};
struct idxd_device {
@@ -250,11 +253,17 @@ static inline bool device_pasid_enabled(struct idxd_device *idxd)
return test_bit(IDXD_FLAG_PASID_ENABLED, &idxd->flags);
}
+
static inline bool device_swq_supported(struct idxd_device *idxd)
{
return (support_enqcmd && device_pasid_enabled(idxd));
}
+static inline bool device_mdev_enabled(struct idxd_device *idxd)
+{
+ return test_bit(IDXD_FLAG_MDEV_ENABLED, &idxd->flags);
+}
+
enum idxd_portal_prot {
IDXD_PORTAL_UNLIMITED = 0,
IDXD_PORTAL_LIMITED,
@@ -375,4 +384,9 @@ int idxd_cdev_get_major(struct idxd_device *idxd);
int idxd_wq_add_cdev(struct idxd_wq *wq);
void idxd_wq_del_cdev(struct idxd_wq *wq);
+/* mdev */
+int idxd_mdev_host_init(struct idxd_device *idxd);
+void idxd_mdev_host_release(struct idxd_device *idxd);
+int idxd_mdev_get_pasid(struct mdev_device *mdev);
+
#endif
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index 4a21c2a17a62..ab91293aedb9 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -218,6 +218,7 @@ static int idxd_setup_internals(struct idxd_device *idxd)
wq->wqcfg = devm_kzalloc(dev, idxd->wqcfg_size, GFP_KERNEL);
if (!wq->wqcfg)
return -ENOMEM;
+ INIT_LIST_HEAD(&wq->vdcm_list);
}
for (i = 0; i < idxd->max_engines; i++) {
@@ -479,6 +480,14 @@ static int idxd_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
return -ENODEV;
}
+ if (IS_ENABLED(CONFIG_INTEL_IDXD_MDEV)) {
+ rc = idxd_mdev_host_init(idxd);
+ if (rc < 0)
+ dev_warn(dev, "VFIO mdev not setup: %d\n", rc);
+ else
+ set_bit(IDXD_FLAG_MDEV_ENABLED, &idxd->flags);
+ }
+
rc = idxd_setup_sysfs(idxd);
if (rc) {
dev_err(dev, "IDXD sysfs setup failed\n");
@@ -572,6 +581,8 @@ static void idxd_remove(struct pci_dev *pdev)
dev_dbg(&pdev->dev, "%s called\n", __func__);
idxd_cleanup_sysfs(idxd);
idxd_shutdown(pdev);
+ if (IS_ENABLED(CONFIG_INTEL_IDXD_MDEV) && device_mdev_enabled(idxd))
+ idxd_mdev_host_release(idxd);
if (device_pasid_enabled(idxd))
idxd_disable_system_pasid(idxd);
mutex_lock(&idxd_idr_lock);
diff --git a/drivers/dma/idxd/mdev.c b/drivers/dma/idxd/mdev.c
new file mode 100644
index 000000000000..3b6febe22a0e
--- /dev/null
+++ b/drivers/dma/idxd/mdev.c
@@ -0,0 +1,968 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/sched/task.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/mm.h>
+#include <linux/mmu_context.h>
+#include <linux/vfio.h>
+#include <linux/mdev.h>
+#include <linux/msi.h>
+#include <linux/intel-iommu.h>
+#include <linux/intel-svm.h>
+#include <linux/kvm_host.h>
+#include <linux/eventfd.h>
+#include <linux/circ_buf.h>
+#include <linux/irqchip/irq-ims-msi.h>
+#include <uapi/linux/idxd.h>
+#include "registers.h"
+#include "idxd.h"
+#include "../../vfio/pci/vfio_pci_private.h"
+#include "mdev.h"
+#include "vdev.h"
+
+static u64 idxd_pci_config[] = {
+ 0x001000000b258086ULL,
+ 0x0080000008800000ULL,
+ 0x000000000000000cULL,
+ 0x000000000000000cULL,
+ 0x0000000000000000ULL,
+ 0x2010808600000000ULL,
+ 0x0000004000000000ULL,
+ 0x000000ff00000000ULL,
+ 0x0000060000015011ULL, /* MSI-X capability, hardcoded 2 entries, Encoded as N-1 */
+ 0x0000070000000000ULL,
+ 0x0000000000920010ULL, /* PCIe capability */
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+ 0x0000000000000000ULL,
+};
+
+static int idxd_vdcm_set_irqs(struct vdcm_idxd *vidxd, uint32_t flags, unsigned int index,
+ unsigned int start, unsigned int count, void *data);
+
+int idxd_mdev_get_pasid(struct mdev_device *mdev)
+{
+ struct iommu_domain *domain;
+ struct device *dev = mdev_dev(mdev);
+
+ domain = iommu_get_domain_for_dev(dev);
+ if (!domain)
+ return -ENODEV;
+
+ return iommu_aux_get_pasid(domain, dev->parent);
+}
+
+static inline void reset_vconfig(struct vdcm_idxd *vidxd)
+{
+ memset(vidxd->cfg, 0, VIDXD_MAX_CFG_SPACE_SZ);
+ memcpy(vidxd->cfg, idxd_pci_config, sizeof(idxd_pci_config));
+}
+
+static inline void reset_vmmio(struct vdcm_idxd *vidxd)
+{
+ memset(&vidxd->bar0, 0, VIDXD_MAX_MMIO_SPACE_SZ);
+}
+
+static void idxd_vdcm_init(struct vdcm_idxd *vidxd)
+{
+ struct idxd_wq *wq = vidxd->wq;
+
+ reset_vconfig(vidxd);
+ reset_vmmio(vidxd);
+
+ vidxd->bar_size[0] = VIDXD_BAR0_SIZE;
+ vidxd->bar_size[1] = VIDXD_BAR2_SIZE;
+
+ vidxd_mmio_init(vidxd);
+
+ if (wq_dedicated(wq) && wq->state == IDXD_WQ_ENABLED)
+ idxd_wq_disable(wq);
+}
+
+static void idxd_vdcm_release(struct mdev_device *mdev)
+{
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ struct device *dev = mdev_dev(mdev);
+
+ dev_dbg(dev, "vdcm_idxd_release %d\n", vidxd->type->type);
+ mutex_lock(&vidxd->dev_lock);
+ if (!vidxd->refcount)
+ goto out;
+
+ idxd_vdcm_set_irqs(vidxd, VFIO_IRQ_SET_DATA_NONE | VFIO_IRQ_SET_ACTION_TRIGGER,
+ VFIO_PCI_MSIX_IRQ_INDEX, 0, 0, NULL);
+
+ vidxd_free_ims_entries(vidxd);
+
+ /* Re-initialize the VIDXD to a pristine state for re-use */
+ idxd_vdcm_init(vidxd);
+ vidxd->refcount--;
+
+ out:
+ mutex_unlock(&vidxd->dev_lock);
+}
+
+static struct vdcm_idxd *vdcm_vidxd_create(struct idxd_device *idxd, struct mdev_device *mdev,
+ struct vdcm_idxd_type *type)
+{
+ struct vdcm_idxd *vidxd;
+ struct idxd_wq *wq = NULL;
+
+ /* PLACEHOLDER, wq matching comes later */
+
+ if (!wq)
+ return ERR_PTR(-ENODEV);
+
+ vidxd = kzalloc(sizeof(*vidxd), GFP_KERNEL);
+ if (!vidxd)
+ return ERR_PTR(-ENOMEM);
+
+ mutex_init(&vidxd->dev_lock);
+ vidxd->idxd = idxd;
+ vidxd->vdev.mdev = mdev;
+ vidxd->wq = wq;
+ mdev_set_drvdata(mdev, vidxd);
+ vidxd->type = type;
+ vidxd->num_wqs = VIDXD_MAX_WQS;
+
+ idxd_vdcm_init(vidxd);
+ mutex_lock(&wq->wq_lock);
+ idxd_wq_get(wq);
+ mutex_unlock(&wq->wq_lock);
+
+ return vidxd;
+}
+
+static struct vdcm_idxd_type idxd_mdev_types[IDXD_MDEV_TYPES];
+
+static struct vdcm_idxd_type *idxd_vdcm_find_vidxd_type(struct device *dev,
+ const char *name)
+{
+ int i;
+ char dev_name[IDXD_MDEV_NAME_LEN];
+
+ for (i = 0; i < IDXD_MDEV_TYPES; i++) {
+ snprintf(dev_name, IDXD_MDEV_NAME_LEN, "idxd-%s",
+ idxd_mdev_types[i].name);
+
+ if (!strncmp(name, dev_name, IDXD_MDEV_NAME_LEN))
+ return &idxd_mdev_types[i];
+ }
+
+ return NULL;
+}
+
+static int idxd_vdcm_create(struct kobject *kobj, struct mdev_device *mdev)
+{
+ struct vdcm_idxd *vidxd;
+ struct vdcm_idxd_type *type;
+ struct device *dev, *parent;
+ struct idxd_device *idxd;
+ struct idxd_wq *wq;
+
+ parent = mdev_parent_dev(mdev);
+ idxd = dev_get_drvdata(parent);
+ dev = mdev_dev(mdev);
+
+ type = idxd_vdcm_find_vidxd_type(dev, kobject_name(kobj));
+ if (!type) {
+ dev_err(dev, "failed to find type %s to create\n",
+ kobject_name(kobj));
+ return -EINVAL;
+ }
+
+ vidxd = vdcm_vidxd_create(idxd, mdev, type);
+ if (IS_ERR(vidxd)) {
+ dev_err(dev, "failed to create vidxd: %ld\n", PTR_ERR(vidxd));
+ return PTR_ERR(vidxd);
+ }
+
+ wq = vidxd->wq;
+ mutex_lock(&wq->wq_lock);
+ list_add(&vidxd->list, &wq->vdcm_list);
+ mutex_unlock(&wq->wq_lock);
+ dev_dbg(dev, "mdev creation success: %s\n", dev_name(mdev_dev(mdev)));
+
+ return 0;
+}
+
+static int idxd_vdcm_remove(struct mdev_device *mdev)
+{
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ struct idxd_device *idxd = vidxd->idxd;
+ struct device *dev = &idxd->pdev->dev;
+ struct idxd_wq *wq = vidxd->wq;
+
+ dev_dbg(dev, "%s: removing for wq %d\n", __func__, vidxd->wq->id);
+
+ mutex_lock(&wq->wq_lock);
+ list_del(&vidxd->list);
+ idxd_wq_put(wq);
+ mutex_unlock(&wq->wq_lock);
+
+ kfree(vidxd);
+ return 0;
+}
+
+static int idxd_vdcm_open(struct mdev_device *mdev)
+{
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ int rc;
+ struct vdcm_idxd_type *type = vidxd->type;
+ struct device *dev = mdev_dev(mdev);
+
+ dev_dbg(dev, "%s: type: %d\n", __func__, type->type);
+
+ mutex_lock(&vidxd->dev_lock);
+ if (vidxd->refcount)
+ goto out;
+
+ /* allocate and setup IMS entries */
+ rc = vidxd_setup_ims_entries(vidxd);
+ if (rc < 0)
+ goto out;
+
+ vidxd->refcount++;
+ mutex_unlock(&vidxd->dev_lock);
+
+ return rc;
+
+ out:
+ mutex_unlock(&vidxd->dev_lock);
+ return rc;
+}
+
+static ssize_t idxd_vdcm_rw(struct mdev_device *mdev, char *buf, size_t count, loff_t *ppos,
+ enum idxd_vdcm_rw mode)
+{
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
+ u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ struct device *dev = mdev_dev(mdev);
+ int rc = -EINVAL;
+
+ if (index >= VFIO_PCI_NUM_REGIONS) {
+ dev_err(dev, "invalid index: %u\n", index);
+ return -EINVAL;
+ }
+
+ switch (index) {
+ case VFIO_PCI_CONFIG_REGION_INDEX:
+ if (mode == IDXD_VDCM_WRITE)
+ rc = vidxd_cfg_write(vidxd, pos, buf, count);
+ else
+ rc = vidxd_cfg_read(vidxd, pos, buf, count);
+ break;
+ case VFIO_PCI_BAR0_REGION_INDEX:
+ case VFIO_PCI_BAR1_REGION_INDEX:
+ if (mode == IDXD_VDCM_WRITE)
+ rc = vidxd_mmio_write(vidxd, vidxd->bar_val[0] + pos, buf, count);
+ else
+ rc = vidxd_mmio_read(vidxd, vidxd->bar_val[0] + pos, buf, count);
+ break;
+ case VFIO_PCI_BAR2_REGION_INDEX:
+ case VFIO_PCI_BAR3_REGION_INDEX:
+ case VFIO_PCI_BAR4_REGION_INDEX:
+ case VFIO_PCI_BAR5_REGION_INDEX:
+ case VFIO_PCI_VGA_REGION_INDEX:
+ case VFIO_PCI_ROM_REGION_INDEX:
+ default:
+ dev_err(dev, "unsupported region: %u\n", index);
+ }
+
+ return rc == 0 ? count : rc;
+}
+
+static ssize_t idxd_vdcm_read(struct mdev_device *mdev, char __user *buf, size_t count,
+ loff_t *ppos)
+{
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ unsigned int done = 0;
+ int rc;
+
+ mutex_lock(&vidxd->dev_lock);
+ while (count) {
+ size_t filled;
+
+ if (count >= 4 && !(*ppos % 4)) {
+ u32 val;
+
+ rc = idxd_vdcm_rw(mdev, (char *)&val, sizeof(val),
+ ppos, IDXD_VDCM_READ);
+ if (rc <= 0)
+ goto read_err;
+
+ if (copy_to_user(buf, &val, sizeof(val)))
+ goto read_err;
+
+ filled = 4;
+ } else if (count >= 2 && !(*ppos % 2)) {
+ u16 val;
+
+ rc = idxd_vdcm_rw(mdev, (char *)&val, sizeof(val),
+ ppos, IDXD_VDCM_READ);
+ if (rc <= 0)
+ goto read_err;
+
+ if (copy_to_user(buf, &val, sizeof(val)))
+ goto read_err;
+
+ filled = 2;
+ } else {
+ u8 val;
+
+ rc = idxd_vdcm_rw(mdev, &val, sizeof(val), ppos,
+ IDXD_VDCM_READ);
+ if (rc <= 0)
+ goto read_err;
+
+ if (copy_to_user(buf, &val, sizeof(val)))
+ goto read_err;
+
+ filled = 1;
+ }
+
+ count -= filled;
+ done += filled;
+ *ppos += filled;
+ buf += filled;
+ }
+
+ mutex_unlock(&vidxd->dev_lock);
+ return done;
+
+ read_err:
+ mutex_unlock(&vidxd->dev_lock);
+ return -EFAULT;
+}
+
+static ssize_t idxd_vdcm_write(struct mdev_device *mdev, const char __user *buf, size_t count,
+ loff_t *ppos)
+{
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ unsigned int done = 0;
+ int rc;
+
+ mutex_lock(&vidxd->dev_lock);
+ while (count) {
+ size_t filled;
+
+ if (count >= 4 && !(*ppos % 4)) {
+ u32 val;
+
+ if (copy_from_user(&val, buf, sizeof(val)))
+ goto write_err;
+
+ rc = idxd_vdcm_rw(mdev, (char *)&val, sizeof(val),
+ ppos, IDXD_VDCM_WRITE);
+ if (rc <= 0)
+ goto write_err;
+
+ filled = 4;
+ } else if (count >= 2 && !(*ppos % 2)) {
+ u16 val;
+
+ if (copy_from_user(&val, buf, sizeof(val)))
+ goto write_err;
+
+ rc = idxd_vdcm_rw(mdev, (char *)&val,
+ sizeof(val), ppos, IDXD_VDCM_WRITE);
+ if (rc <= 0)
+ goto write_err;
+
+ filled = 2;
+ } else {
+ u8 val;
+
+ if (copy_from_user(&val, buf, sizeof(val)))
+ goto write_err;
+
+ rc = idxd_vdcm_rw(mdev, &val, sizeof(val),
+ ppos, IDXD_VDCM_WRITE);
+ if (rc <= 0)
+ goto write_err;
+
+ filled = 1;
+ }
+
+ count -= filled;
+ done += filled;
+ *ppos += filled;
+ buf += filled;
+ }
+
+ mutex_unlock(&vidxd->dev_lock);
+ return done;
+
+write_err:
+ mutex_unlock(&vidxd->dev_lock);
+ return -EFAULT;
+}
+
+static int check_vma(struct idxd_wq *wq, struct vm_area_struct *vma)
+{
+ if (vma->vm_end < vma->vm_start)
+ return -EINVAL;
+ if (!(vma->vm_flags & VM_SHARED))
+ return -EINVAL;
+
+ return 0;
+}
+
+static int idxd_vdcm_mmap(struct mdev_device *mdev, struct vm_area_struct *vma)
+{
+ unsigned int wq_idx, rc;
+ unsigned long req_size, pgoff = 0, offset;
+ pgprot_t pg_prot;
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ struct idxd_wq *wq = vidxd->wq;
+ struct idxd_device *idxd = vidxd->idxd;
+ enum idxd_portal_prot virt_portal, phys_portal;
+ phys_addr_t base = pci_resource_start(idxd->pdev, IDXD_WQ_BAR);
+ struct device *dev = mdev_dev(mdev);
+
+ rc = check_vma(wq, vma);
+ if (rc)
+ return rc;
+
+ pg_prot = vma->vm_page_prot;
+ req_size = vma->vm_end - vma->vm_start;
+ vma->vm_flags |= VM_DONTCOPY;
+
+ offset = (vma->vm_pgoff << PAGE_SHIFT) &
+ ((1ULL << VFIO_PCI_OFFSET_SHIFT) - 1);
+
+ wq_idx = offset >> (PAGE_SHIFT + 2);
+ if (wq_idx >= 1) {
+ dev_err(dev, "mapping invalid wq %d off %lx\n",
+ wq_idx, offset);
+ return -EINVAL;
+ }
+
+ /*
+ * Check and see if the guest wants to map to the limited or unlimited portal.
+ * The driver will allow mapping to unlimited portal only if the the wq is a
+ * dedicated wq. Otherwise, it goes to limited.
+ */
+ virt_portal = ((offset >> PAGE_SHIFT) & 0x3) == 1;
+ phys_portal = IDXD_PORTAL_LIMITED;
+ if (virt_portal == IDXD_PORTAL_UNLIMITED && wq_dedicated(wq))
+ phys_portal = IDXD_PORTAL_UNLIMITED;
+
+ /* We always map IMS portals to the guest */
+ pgoff = (base + idxd_get_wq_portal_full_offset(wq->id, phys_portal,
+ IDXD_IRQ_IMS)) >> PAGE_SHIFT;
+
+ dev_dbg(dev, "mmap %lx %lx %lx %lx\n", vma->vm_start, pgoff, req_size,
+ pgprot_val(pg_prot));
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vma->vm_private_data = mdev;
+ vma->vm_pgoff = pgoff;
+
+ return remap_pfn_range(vma, vma->vm_start, pgoff, req_size, pg_prot);
+}
+
+static int idxd_vdcm_get_irq_count(struct vdcm_idxd *vidxd, int type)
+{
+ /*
+ * Even though the number of MSIX vectors supported are not tied to number of
+ * wqs being exported, the current design is to allow 1 vector per WQ for guest.
+ * So here we end up with num of wqs plus 1 that handles the misc interrupts.
+ */
+ if (type == VFIO_PCI_MSI_IRQ_INDEX || type == VFIO_PCI_MSIX_IRQ_INDEX)
+ return VIDXD_MAX_MSIX_VECS;
+
+ return 0;
+}
+
+static irqreturn_t idxd_guest_wq_completion(int irq, void *data)
+{
+ struct ims_irq_entry *irq_entry = data;
+
+ /*
+ * WQ irq_entry 0 is actually MSIX vector 1 for guest. MSIX vector 0
+ * is emulated.
+ */
+ vidxd_send_interrupt(irq_entry->vidxd, irq_entry->id + 1);
+ return IRQ_HANDLED;
+}
+
+static int msix_trigger_unregister(struct vdcm_idxd *vidxd, int index)
+{
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ struct ims_irq_entry *irq_entry;
+ int rc;
+
+ if (!vidxd->vdev.msix_trigger[index])
+ return 0;
+
+ dev_dbg(dev, "disable MSIX trigger %d\n", index);
+ if (index) {
+ u32 auxval;
+
+ irq_entry = &vidxd->irq_entries[index - 1];
+ if (irq_entry->irq_set) {
+ free_irq(irq_entry->entry->irq, irq_entry);
+ irq_entry->irq_set = false;
+ }
+
+ auxval = ims_ctrl_pasid_aux(0, false);
+ rc = irq_set_auxdata(irq_entry->entry->irq, IMS_AUXDATA_CONTROL_WORD, auxval);
+ if (rc)
+ return rc;
+ }
+ eventfd_ctx_put(vidxd->vdev.msix_trigger[index]);
+ vidxd->vdev.msix_trigger[index] = NULL;
+
+ return 0;
+}
+
+static int msix_trigger_register(struct vdcm_idxd *vidxd, u32 fd, int index)
+{
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+ struct ims_irq_entry *irq_entry;
+ struct eventfd_ctx *trigger;
+ int rc;
+
+ rc = msix_trigger_unregister(vidxd, index);
+ if (rc < 0)
+ return rc;
+
+ dev_dbg(dev, "enable MSIX trigger %d\n", index);
+ trigger = eventfd_ctx_fdget(fd);
+ if (IS_ERR(trigger)) {
+ dev_warn(dev, "eventfd_ctx_fdget failed %d\n", index);
+ return PTR_ERR(trigger);
+ }
+
+ /*
+ * The MSIX vector 0 is emulated by the mdev. Starting with vector 1
+ * the interrupt is backed by IMS and needs to be set up, but we
+ * will be setting up entry 0 of the IMS vectors. So here we pass
+ * in i - 1 to the host setup and irq_entries.
+ */
+ if (index) {
+ int pasid;
+ u32 auxval;
+
+ irq_entry = &vidxd->irq_entries[index - 1];
+ pasid = idxd_mdev_get_pasid(mdev);
+ if (pasid < 0)
+ return pasid;
+
+ /*
+ * Program and enable the pasid field in the IMS entry. The programmed pasid and
+ * enabled field is checked against the pasid and enable field for the work queue
+ * configuration and the pasid for the descriptor. A mismatch will result in blocked
+ * IMS interrupt.
+ */
+ auxval = ims_ctrl_pasid_aux(pasid, true);
+ rc = irq_set_auxdata(irq_entry->entry->irq, IMS_AUXDATA_CONTROL_WORD, auxval);
+ if (rc < 0)
+ return rc;
+
+ rc = request_irq(irq_entry->entry->irq, idxd_guest_wq_completion, 0, "idxd-ims",
+ irq_entry);
+ if (rc) {
+ dev_warn(dev, "failed to request ims irq\n");
+ eventfd_ctx_put(trigger);
+ auxval = ims_ctrl_pasid_aux(0, false);
+ irq_set_auxdata(irq_entry->entry->irq, IMS_AUXDATA_CONTROL_WORD, auxval);
+ return rc;
+ }
+ irq_entry->irq_set = true;
+ }
+
+ vidxd->vdev.msix_trigger[index] = trigger;
+ return 0;
+}
+
+static int vdcm_idxd_set_msix_trigger(struct vdcm_idxd *vidxd,
+ unsigned int index, unsigned int start,
+ unsigned int count, uint32_t flags,
+ void *data)
+{
+ int i, rc = 0;
+
+ if (count > VIDXD_MAX_MSIX_ENTRIES - 1)
+ count = VIDXD_MAX_MSIX_ENTRIES - 1;
+
+ /*
+ * The MSIX vector 0 is emulated by the mdev. Starting with vector 1
+ * the interrupt is backed by IMS and needs to be set up, but we
+ * will be setting up entry 0 of the IMS vectors. So here we pass
+ * in i - 1 to the host setup and irq_entries.
+ */
+ if (count == 0 && (flags & VFIO_IRQ_SET_DATA_NONE)) {
+ /* Disable all MSIX entries */
+ for (i = 0; i < VIDXD_MAX_MSIX_ENTRIES; i++) {
+ rc = msix_trigger_unregister(vidxd, i);
+ if (rc < 0)
+ return rc;
+ }
+ return 0;
+ }
+
+ for (i = 0; i < count; i++) {
+ if (flags & VFIO_IRQ_SET_DATA_EVENTFD) {
+ u32 fd = *(u32 *)(data + i * sizeof(u32));
+
+ rc = msix_trigger_register(vidxd, fd, i);
+ if (rc < 0)
+ return rc;
+ } else if (flags & VFIO_IRQ_SET_DATA_NONE) {
+ rc = msix_trigger_unregister(vidxd, i);
+ if (rc < 0)
+ return rc;
+ }
+ }
+ return rc;
+}
+
+static int idxd_vdcm_set_irqs(struct vdcm_idxd *vidxd, uint32_t flags,
+ unsigned int index, unsigned int start,
+ unsigned int count, void *data)
+{
+ int (*func)(struct vdcm_idxd *vidxd, unsigned int index,
+ unsigned int start, unsigned int count, uint32_t flags,
+ void *data) = NULL;
+ struct mdev_device *mdev = vidxd->vdev.mdev;
+ struct device *dev = mdev_dev(mdev);
+
+ switch (index) {
+ case VFIO_PCI_INTX_IRQ_INDEX:
+ dev_warn(dev, "intx interrupts not supported.\n");
+ break;
+ case VFIO_PCI_MSI_IRQ_INDEX:
+ dev_dbg(dev, "msi interrupt.\n");
+ switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+ case VFIO_IRQ_SET_ACTION_MASK:
+ case VFIO_IRQ_SET_ACTION_UNMASK:
+ break;
+ case VFIO_IRQ_SET_ACTION_TRIGGER:
+ func = vdcm_idxd_set_msix_trigger;
+ break;
+ }
+ break;
+ case VFIO_PCI_MSIX_IRQ_INDEX:
+ switch (flags & VFIO_IRQ_SET_ACTION_TYPE_MASK) {
+ case VFIO_IRQ_SET_ACTION_MASK:
+ case VFIO_IRQ_SET_ACTION_UNMASK:
+ break;
+ case VFIO_IRQ_SET_ACTION_TRIGGER:
+ func = vdcm_idxd_set_msix_trigger;
+ break;
+ }
+ break;
+ default:
+ return -ENOTTY;
+ }
+
+ if (!func)
+ return -ENOTTY;
+
+ return func(vidxd, index, start, count, flags, data);
+}
+
+static void vidxd_vdcm_reset(struct vdcm_idxd *vidxd)
+{
+ vidxd_reset(vidxd);
+}
+
+static long idxd_vdcm_ioctl(struct mdev_device *mdev, unsigned int cmd,
+ unsigned long arg)
+{
+ struct vdcm_idxd *vidxd = mdev_get_drvdata(mdev);
+ unsigned long minsz;
+ int rc = -EINVAL;
+ struct device *dev = mdev_dev(mdev);
+
+ dev_dbg(dev, "vidxd %p ioctl, cmd: %d\n", vidxd, cmd);
+
+ mutex_lock(&vidxd->dev_lock);
+ if (cmd == VFIO_DEVICE_GET_INFO) {
+ struct vfio_device_info info;
+
+ minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz)) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ if (info.argsz < minsz) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ info.flags = VFIO_DEVICE_FLAGS_PCI;
+ info.flags |= VFIO_DEVICE_FLAGS_RESET;
+ info.num_regions = VFIO_PCI_NUM_REGIONS;
+ info.num_irqs = VFIO_PCI_NUM_IRQS;
+
+ if (copy_to_user((void __user *)arg, &info, minsz))
+ rc = -EFAULT;
+ else
+ rc = 0;
+ goto out;
+ } else if (cmd == VFIO_DEVICE_GET_REGION_INFO) {
+ struct vfio_region_info info;
+ struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
+ struct vfio_region_info_cap_sparse_mmap *sparse = NULL;
+ size_t size;
+ int nr_areas = 1;
+ int cap_type_id = 0;
+
+ minsz = offsetofend(struct vfio_region_info, offset);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz)) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ if (info.argsz < minsz) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ switch (info.index) {
+ case VFIO_PCI_CONFIG_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.size = VIDXD_MAX_CFG_SPACE_SZ;
+ info.flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
+ break;
+ case VFIO_PCI_BAR0_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.size = vidxd->bar_size[info.index];
+ if (!info.size) {
+ info.flags = 0;
+ break;
+ }
+
+ info.flags = VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
+ break;
+ case VFIO_PCI_BAR1_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.size = 0;
+ info.flags = 0;
+ break;
+ case VFIO_PCI_BAR2_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.flags = VFIO_REGION_INFO_FLAG_CAPS | VFIO_REGION_INFO_FLAG_MMAP |
+ VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE;
+ info.size = vidxd->bar_size[1];
+
+ /*
+ * Every WQ has two areas for unlimited and limited
+ * MSI-X portals. IMS portals are not reported
+ */
+ nr_areas = 2;
+
+ size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
+ sparse = kzalloc(size, GFP_KERNEL);
+ if (!sparse) {
+ rc = -ENOMEM;
+ goto out;
+ }
+
+ sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+ sparse->header.version = 1;
+ sparse->nr_areas = nr_areas;
+ cap_type_id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+
+ sparse->areas[0].offset = 0;
+ sparse->areas[0].size = PAGE_SIZE;
+
+ sparse->areas[1].offset = PAGE_SIZE;
+ sparse->areas[1].size = PAGE_SIZE;
+ break;
+
+ case VFIO_PCI_BAR3_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+ info.offset = VFIO_PCI_INDEX_TO_OFFSET(info.index);
+ info.size = 0;
+ info.flags = 0;
+ dev_dbg(dev, "get region info bar:%d\n", info.index);
+ break;
+
+ case VFIO_PCI_ROM_REGION_INDEX:
+ case VFIO_PCI_VGA_REGION_INDEX:
+ dev_dbg(dev, "get region info index:%d\n", info.index);
+ break;
+ default: {
+ if (info.index >= VFIO_PCI_NUM_REGIONS)
+ rc = -EINVAL;
+ else
+ rc = 0;
+ goto out;
+ } /* default */
+ } /* info.index switch */
+
+ if ((info.flags & VFIO_REGION_INFO_FLAG_CAPS) && sparse) {
+ if (cap_type_id == VFIO_REGION_INFO_CAP_SPARSE_MMAP) {
+ rc = vfio_info_add_capability(&caps, &sparse->header,
+ sizeof(*sparse) + (sparse->nr_areas *
+ sizeof(*sparse->areas)));
+ kfree(sparse);
+ if (rc)
+ goto out;
+ }
+ }
+
+ if (caps.size) {
+ if (info.argsz < sizeof(info) + caps.size) {
+ info.argsz = sizeof(info) + caps.size;
+ info.cap_offset = 0;
+ } else {
+ vfio_info_cap_shift(&caps, sizeof(info));
+ if (copy_to_user((void __user *)arg + sizeof(info),
+ caps.buf, caps.size)) {
+ kfree(caps.buf);
+ rc = -EFAULT;
+ goto out;
+ }
+ info.cap_offset = sizeof(info);
+ }
+
+ kfree(caps.buf);
+ }
+ if (copy_to_user((void __user *)arg, &info, minsz))
+ rc = -EFAULT;
+ else
+ rc = 0;
+ goto out;
+ } else if (cmd == VFIO_DEVICE_GET_IRQ_INFO) {
+ struct vfio_irq_info info;
+
+ minsz = offsetofend(struct vfio_irq_info, count);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz)) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ if (info.argsz < minsz || info.index >= VFIO_PCI_NUM_IRQS) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ switch (info.index) {
+ case VFIO_PCI_MSI_IRQ_INDEX:
+ case VFIO_PCI_MSIX_IRQ_INDEX:
+ default:
+ rc = -EINVAL;
+ goto out;
+ } /* switch(info.index) */
+
+ info.flags = VFIO_IRQ_INFO_EVENTFD | VFIO_IRQ_INFO_NORESIZE;
+ info.count = idxd_vdcm_get_irq_count(vidxd, info.index);
+
+ if (copy_to_user((void __user *)arg, &info, minsz))
+ rc = -EFAULT;
+ else
+ rc = 0;
+ goto out;
+ } else if (cmd == VFIO_DEVICE_SET_IRQS) {
+ struct vfio_irq_set hdr;
+ u8 *data = NULL;
+ size_t data_size = 0;
+
+ minsz = offsetofend(struct vfio_irq_set, count);
+
+ if (copy_from_user(&hdr, (void __user *)arg, minsz)) {
+ rc = -EFAULT;
+ goto out;
+ }
+
+ if (!(hdr.flags & VFIO_IRQ_SET_DATA_NONE)) {
+ int max = idxd_vdcm_get_irq_count(vidxd, hdr.index);
+
+ rc = vfio_set_irqs_validate_and_prepare(&hdr, max, VFIO_PCI_NUM_IRQS,
+ &data_size);
+ if (rc) {
+ dev_err(dev, "intel:vfio_set_irqs_validate_and_prepare failed\n");
+ rc = -EINVAL;
+ goto out;
+ }
+ if (data_size) {
+ data = memdup_user((void __user *)(arg + minsz), data_size);
+ if (IS_ERR(data)) {
+ rc = PTR_ERR(data);
+ goto out;
+ }
+ }
+ }
+
+ if (!data) {
+ rc = -EINVAL;
+ goto out;
+ }
+
+ rc = idxd_vdcm_set_irqs(vidxd, hdr.flags, hdr.index, hdr.start, hdr.count, data);
+ kfree(data);
+ goto out;
+ } else if (cmd == VFIO_DEVICE_RESET) {
+ vidxd_vdcm_reset(vidxd);
+ }
+
+ out:
+ mutex_unlock(&vidxd->dev_lock);
+ return rc;
+}
+
+static const struct mdev_parent_ops idxd_vdcm_ops = {
+ .create = idxd_vdcm_create,
+ .remove = idxd_vdcm_remove,
+ .open = idxd_vdcm_open,
+ .release = idxd_vdcm_release,
+ .read = idxd_vdcm_read,
+ .write = idxd_vdcm_write,
+ .mmap = idxd_vdcm_mmap,
+ .ioctl = idxd_vdcm_ioctl,
+};
+
+int idxd_mdev_host_init(struct idxd_device *idxd)
+{
+ struct device *dev = &idxd->pdev->dev;
+ int rc;
+
+ if (!test_bit(IDXD_FLAG_SIOV_SUPPORTED, &idxd->flags))
+ return -EOPNOTSUPP;
+
+ if (iommu_dev_has_feature(dev, IOMMU_DEV_FEAT_AUX)) {
+ rc = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_AUX);
+ if (rc < 0) {
+ dev_warn(dev, "Failed to enable aux-domain: %d\n", rc);
+ return rc;
+ }
+ } else {
+ dev_warn(dev, "No aux-domain feature.\n");
+ return -EOPNOTSUPP;
+ }
+
+ return mdev_register_device(dev, &idxd_vdcm_ops);
+}
+
+void idxd_mdev_host_release(struct idxd_device *idxd)
+{
+ struct device *dev = &idxd->pdev->dev;
+ int rc;
+
+ mdev_unregister_device(dev);
+ if (iommu_dev_has_feature(dev, IOMMU_DEV_FEAT_AUX)) {
+ rc = iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_AUX);
+ if (rc < 0)
+ dev_warn(dev, "Failed to disable aux-domain: %d\n",
+ rc);
+ }
+}
diff --git a/drivers/dma/idxd/mdev.h b/drivers/dma/idxd/mdev.h
new file mode 100644
index 000000000000..b474f2303ba0
--- /dev/null
+++ b/drivers/dma/idxd/mdev.h
@@ -0,0 +1,115 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+
+#ifndef _IDXD_MDEV_H_
+#define _IDXD_MDEV_H_
+
+/* two 64-bit BARs implemented */
+#define VIDXD_MAX_BARS 2
+#define VIDXD_MAX_CFG_SPACE_SZ 4096
+#define VIDXD_MAX_MMIO_SPACE_SZ 8192
+#define VIDXD_MSIX_TBL_SZ_OFFSET 0x42
+#define VIDXD_CAP_CTRL_SZ 0x100
+#define VIDXD_GRP_CTRL_SZ 0x100
+#define VIDXD_WQ_CTRL_SZ 0x100
+#define VIDXD_WQ_OCPY_INT_SZ 0x20
+#define VIDXD_MSIX_TBL_SZ 0x90
+#define VIDXD_MSIX_PERM_TBL_SZ 0x48
+
+#define VIDXD_MSIX_TABLE_OFFSET 0x600
+#define VIDXD_MSIX_PERM_OFFSET 0x300
+#define VIDXD_GRPCFG_OFFSET 0x400
+#define VIDXD_WQCFG_OFFSET 0x500
+#define VIDXD_IMS_OFFSET 0x1000
+
+#define VIDXD_BAR0_SIZE 0x2000
+#define VIDXD_BAR2_SIZE 0x20000
+#define VIDXD_MAX_MSIX_ENTRIES (VIDXD_MSIX_TBL_SZ / 0x10)
+#define VIDXD_MAX_WQS 1
+#define VIDXD_MAX_MSIX_VECS 2
+
+#define VIDXD_ATS_OFFSET 0x100
+#define VIDXD_PRS_OFFSET 0x110
+#define VIDXD_PASID_OFFSET 0x120
+#define VIDXD_MSIX_PBA_OFFSET 0x700
+
+struct ims_irq_entry {
+ struct vdcm_idxd *vidxd;
+ struct msi_desc *entry;
+ bool irq_set;
+ int id;
+};
+
+struct idxd_vdev {
+ struct mdev_device *mdev;
+ struct eventfd_ctx *msix_trigger[VIDXD_MAX_MSIX_ENTRIES];
+};
+
+struct vdcm_idxd {
+ struct idxd_device *idxd;
+ struct idxd_wq *wq;
+ struct idxd_vdev vdev;
+ struct vdcm_idxd_type *type;
+ int num_wqs;
+ struct ims_irq_entry irq_entries[VIDXD_MAX_MSIX_ENTRIES];
+
+ /* For VM use case */
+ u64 bar_val[VIDXD_MAX_BARS];
+ u64 bar_size[VIDXD_MAX_BARS];
+ u8 cfg[VIDXD_MAX_CFG_SPACE_SZ];
+ u8 bar0[VIDXD_MAX_MMIO_SPACE_SZ];
+ struct list_head list;
+ struct mutex dev_lock; /* lock for vidxd resources */
+
+ int refcount;
+};
+
+static inline struct vdcm_idxd *to_vidxd(struct idxd_vdev *vdev)
+{
+ return container_of(vdev, struct vdcm_idxd, vdev);
+}
+
+#define IDXD_MDEV_NAME_LEN 16
+#define IDXD_MDEV_DESCRIPTION_LEN 64
+
+enum idxd_mdev_type {
+ IDXD_MDEV_TYPE_1_DWQ = 0,
+};
+
+#define IDXD_MDEV_TYPES 1
+
+struct vdcm_idxd_type {
+ char name[IDXD_MDEV_NAME_LEN];
+ char description[IDXD_MDEV_DESCRIPTION_LEN];
+ enum idxd_mdev_type type;
+ unsigned int avail_instance;
+};
+
+enum idxd_vdcm_rw {
+ IDXD_VDCM_READ = 0,
+ IDXD_VDCM_WRITE,
+};
+
+static inline u64 get_reg_val(void *buf, int size)
+{
+ u64 val = 0;
+
+ switch (size) {
+ case 8:
+ val = *(u64 *)buf;
+ break;
+ case 4:
+ val = *(u32 *)buf;
+ break;
+ case 2:
+ val = *(u16 *)buf;
+ break;
+ case 1:
+ val = *(u8 *)buf;
+ break;
+ }
+
+ return val;
+}
+
+#endif
diff --git a/drivers/dma/idxd/vdev.c b/drivers/dma/idxd/vdev.c
new file mode 100644
index 000000000000..6cc097edc6e9
--- /dev/null
+++ b/drivers/dma/idxd/vdev.c
@@ -0,0 +1,75 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/pci.h>
+#include <linux/device.h>
+#include <linux/sched/task.h>
+#include <linux/io-64-nonatomic-lo-hi.h>
+#include <linux/mm.h>
+#include <linux/mmu_context.h>
+#include <linux/vfio.h>
+#include <linux/mdev.h>
+#include <linux/msi.h>
+#include <linux/intel-iommu.h>
+#include <linux/intel-svm.h>
+#include <linux/kvm_host.h>
+#include <linux/eventfd.h>
+#include <uapi/linux/idxd.h>
+#include "registers.h"
+#include "idxd.h"
+#include "../../vfio/pci/vfio_pci_private.h"
+#include "mdev.h"
+#include "vdev.h"
+
+int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx)
+{
+ /* PLACE HOLDER */
+ return 0;
+}
+
+int vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
+{
+ /* PLACEHOLDER */
+ return 0;
+}
+
+int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size)
+{
+ /* PLACEHOLDER */
+ return 0;
+}
+
+int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count)
+{
+ /* PLACEHOLDER */
+ return 0;
+}
+
+int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int size)
+{
+ /* PLACEHOLDER */
+ return 0;
+}
+
+void vidxd_mmio_init(struct vdcm_idxd *vidxd)
+{
+ /* PLACEHOLDER */
+}
+
+void vidxd_reset(struct vdcm_idxd *vidxd)
+{
+ /* PLACEHOLDER */
+}
+
+int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd)
+{
+ /* PLACEHOLDER */
+ return 0;
+}
+
+void vidxd_free_ims_entries(struct vdcm_idxd *vidxd)
+{
+ /* PLACEHOLDER */
+}
diff --git a/drivers/dma/idxd/vdev.h b/drivers/dma/idxd/vdev.h
new file mode 100644
index 000000000000..baa30d98f9cb
--- /dev/null
+++ b/drivers/dma/idxd/vdev.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2019,2020 Intel Corporation. All rights rsvd. */
+
+#ifndef _IDXD_VDEV_H_
+#define _IDXD_VDEV_H_
+
+#include "mdev.h"
+
+int vidxd_mmio_read(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size);
+int vidxd_mmio_write(struct vdcm_idxd *vidxd, u64 pos, void *buf, unsigned int size);
+int vidxd_cfg_read(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int count);
+int vidxd_cfg_write(struct vdcm_idxd *vidxd, unsigned int pos, void *buf, unsigned int size);
+void vidxd_mmio_init(struct vdcm_idxd *vidxd);
+void vidxd_reset(struct vdcm_idxd *vidxd);
+int vidxd_send_interrupt(struct vdcm_idxd *vidxd, int msix_idx);
+int vidxd_setup_ims_entries(struct vdcm_idxd *vidxd);
+void vidxd_free_ims_entries(struct vdcm_idxd *vidxd);
+
+#endif
When a dedicated wq is enabled as mdev, we must disable the wq on the
device in order to program the pasid to the wq. Introduce a wq state
IDXD_WQ_LOCKED that is software state only in order to prevent the user
from modifying the configuration while mdev wq is in this state. While
in this state, the wq is not in DISABLED state and will prevent any
modifications to the configuration. It is also not in the ENABLED state
and therefore prevents any actions allowed in the ENABLED state.
Signed-off-by: Dave Jiang <[email protected]>
---
drivers/dma/idxd/idxd.h | 1 +
drivers/dma/idxd/mdev.c | 4 +++-
drivers/dma/idxd/sysfs.c | 2 ++
3 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index 4e583fdd15d2..03275ad9e849 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -61,6 +61,7 @@ struct idxd_group {
enum idxd_wq_state {
IDXD_WQ_DISABLED = 0,
IDXD_WQ_ENABLED,
+ IDXD_WQ_LOCKED,
};
enum idxd_wq_flag {
diff --git a/drivers/dma/idxd/mdev.c b/drivers/dma/idxd/mdev.c
index 16b56f8f7fc1..3db7717a10c0 100644
--- a/drivers/dma/idxd/mdev.c
+++ b/drivers/dma/idxd/mdev.c
@@ -84,8 +84,10 @@ static void idxd_vdcm_init(struct vdcm_idxd *vidxd)
vidxd_mmio_init(vidxd);
- if (wq_dedicated(wq) && wq->state == IDXD_WQ_ENABLED)
+ if (wq_dedicated(wq) && wq->state == IDXD_WQ_ENABLED) {
idxd_wq_disable(wq, NULL);
+ wq->state = IDXD_WQ_LOCKED;
+ }
}
static void idxd_vdcm_release(struct mdev_device *mdev)
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index 5b79d9019f2e..3bbbd413980e 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -821,6 +821,8 @@ static ssize_t wq_state_show(struct device *dev,
return sprintf(buf, "disabled\n");
case IDXD_WQ_ENABLED:
return sprintf(buf, "enabled\n");
+ case IDXD_WQ_LOCKED:
+ return sprintf(buf, "locked\n");
}
return sprintf(buf, "unknown\n");
In preparation for support of VFIO mediated device for idxd driver, the
enabling for Interrupt Message Store (IMS) interrupts is added for the idxd
With IMS support the idxd driver can dynamically allocate interrupts on a
per mdev basis based on how many IMS vectors that are mapped to the mdev
device. This commit only provides the support functions in the base driver
and not the VFIO mdev code utilization.
The commit has some portal related changes. A "portal" is a special
location within the MMIO BAR2 of the DSA device where descriptors are
submitted via the CPU command MOVDIR64B or ENQCMD(S). The offset for the
portal address determines whether the submitted descriptor is for MSI-X
or IMS notification.
See Intel SIOV spec for more details:
https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
Signed-off-by: Dave Jiang <[email protected]>
---
Documentation/ABI/stable/sysfs-driver-dma-idxd | 6 ++++++
drivers/dma/idxd/cdev.c | 4 ++--
drivers/dma/idxd/idxd.h | 13 +++++++++----
drivers/dma/idxd/init.c | 19 +++++++++++++++++++
drivers/dma/idxd/submit.c | 10 ++++++++--
drivers/dma/idxd/sysfs.c | 9 +++++++++
6 files changed, 53 insertions(+), 8 deletions(-)
diff --git a/Documentation/ABI/stable/sysfs-driver-dma-idxd b/Documentation/ABI/stable/sysfs-driver-dma-idxd
index 5ea81ffd3c1a..ed5aeecf7015 100644
--- a/Documentation/ABI/stable/sysfs-driver-dma-idxd
+++ b/Documentation/ABI/stable/sysfs-driver-dma-idxd
@@ -129,6 +129,12 @@ KernelVersion: 5.10.0
Contact: [email protected]
Description: The last executed device administrative command's status/error.
+What: /sys/bus/dsa/devices/dsa<m>/ims_size
+Date: Oct 15, 2020
+KernelVersion: 5.11.0
+Contact: [email protected]
+Description: The total number of vectors available for Interrupt Message Store.
+
What: /sys/bus/dsa/devices/wq<m>.<n>/block_on_fault
Date: Oct 27, 2020
KernelVersion: 5.11.0
diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c
index 010b820d8f74..b774bf336347 100644
--- a/drivers/dma/idxd/cdev.c
+++ b/drivers/dma/idxd/cdev.c
@@ -204,8 +204,8 @@ static int idxd_cdev_mmap(struct file *filp, struct vm_area_struct *vma)
return rc;
vma->vm_flags |= VM_DONTCOPY;
- pfn = (base + idxd_get_wq_portal_full_offset(wq->id,
- IDXD_PORTAL_LIMITED)) >> PAGE_SHIFT;
+ pfn = (base + idxd_get_wq_portal_full_offset(wq->id, IDXD_PORTAL_LIMITED,
+ IDXD_IRQ_MSIX)) >> PAGE_SHIFT;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
vma->vm_private_data = ctx;
diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h
index a506a16c83ee..549426bfb443 100644
--- a/drivers/dma/idxd/idxd.h
+++ b/drivers/dma/idxd/idxd.h
@@ -154,6 +154,7 @@ enum idxd_device_flag {
IDXD_FLAG_CONFIGURABLE = 0,
IDXD_FLAG_CMD_RUNNING,
IDXD_FLAG_PASID_ENABLED,
+ IDXD_FLAG_SIOV_SUPPORTED,
};
struct idxd_device {
@@ -181,6 +182,7 @@ struct idxd_device {
int num_groups;
+ u32 ims_offset;
u32 msix_perm_offset;
u32 wqcfg_offset;
u32 grpcfg_offset;
@@ -188,6 +190,7 @@ struct idxd_device {
u64 max_xfer_bytes;
u32 max_batch_size;
+ int ims_size;
int max_groups;
int max_engines;
int max_tokens;
@@ -262,15 +265,17 @@ enum idxd_interrupt_type {
IDXD_IRQ_IMS,
};
-static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot)
+static inline int idxd_get_wq_portal_offset(enum idxd_portal_prot prot,
+ enum idxd_interrupt_type irq_type)
{
- return prot * 0x1000;
+ return prot * 0x1000 + irq_type * 0x2000;
}
static inline int idxd_get_wq_portal_full_offset(int wq_id,
- enum idxd_portal_prot prot)
+ enum idxd_portal_prot prot,
+ enum idxd_interrupt_type irq_type)
{
- return ((wq_id * 4) << PAGE_SHIFT) + idxd_get_wq_portal_offset(prot);
+ return ((wq_id * 4) << PAGE_SHIFT) + idxd_get_wq_portal_offset(prot, irq_type);
}
static inline void idxd_set_type(struct idxd_device *idxd)
diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c
index c136216e19e8..4a21c2a17a62 100644
--- a/drivers/dma/idxd/init.c
+++ b/drivers/dma/idxd/init.c
@@ -16,6 +16,7 @@
#include <linux/idr.h>
#include <linux/intel-svm.h>
#include <linux/iommu.h>
+#include <linux/pci-siov.h>
#include <uapi/linux/idxd.h>
#include <linux/dmaengine.h>
#include "../dmaengine.h"
@@ -244,10 +245,27 @@ static void idxd_read_table_offsets(struct idxd_device *idxd)
dev_dbg(dev, "IDXD Work Queue Config Offset: %#x\n", idxd->wqcfg_offset);
idxd->msix_perm_offset = offsets.msix_perm * IDXD_TABLE_MULT;
dev_dbg(dev, "IDXD MSIX Permission Offset: %#x\n", idxd->msix_perm_offset);
+ idxd->ims_offset = offsets.ims * IDXD_TABLE_MULT;
+ dev_dbg(dev, "IDXD IMS Offset: %#x\n", idxd->ims_offset);
idxd->perfmon_offset = offsets.perfmon * IDXD_TABLE_MULT;
dev_dbg(dev, "IDXD Perfmon Offset: %#x\n", idxd->perfmon_offset);
}
+static void idxd_check_siov(struct idxd_device *idxd)
+{
+ struct pci_dev *pdev = idxd->pdev;
+
+ if (pci_ims_supported(idxd->pdev) && idxd->hw.gen_cap.max_ims_mult) {
+ idxd->ims_size = idxd->hw.gen_cap.max_ims_mult * 256ULL;
+ dev_dbg(&pdev->dev, "IMS size: %u\n", idxd->ims_size);
+ set_bit(IDXD_FLAG_SIOV_SUPPORTED, &idxd->flags);
+ dev_dbg(&pdev->dev, "IMS supported for device\n");
+ return;
+ }
+
+ dev_dbg(&pdev->dev, "SIOV unsupported for device\n");
+}
+
static void idxd_read_caps(struct idxd_device *idxd)
{
struct device *dev = &idxd->pdev->dev;
@@ -266,6 +284,7 @@ static void idxd_read_caps(struct idxd_device *idxd)
dev_dbg(dev, "max xfer size: %llu bytes\n", idxd->max_xfer_bytes);
idxd->max_batch_size = 1U << idxd->hw.gen_cap.max_batch_shift;
dev_dbg(dev, "max batch size: %u\n", idxd->max_batch_size);
+ idxd_check_siov(idxd);
if (idxd->hw.gen_cap.config_en)
set_bit(IDXD_FLAG_CONFIGURABLE, &idxd->flags);
diff --git a/drivers/dma/idxd/submit.c b/drivers/dma/idxd/submit.c
index cdea5d37ef24..f76d154d1dbd 100644
--- a/drivers/dma/idxd/submit.c
+++ b/drivers/dma/idxd/submit.c
@@ -30,7 +30,13 @@ static struct idxd_desc *__get_desc(struct idxd_wq *wq, int idx, int cpu)
desc->hw->int_handle = wq->vec_ptr;
} else {
desc->vector = wq->vec_ptr;
- desc->hw->int_handle = idxd->int_handles[desc->vector];
+ /*
+ * int_handles are only for descriptor completion. However for device
+ * MSIX enumeration, vec 0 is used for misc interrupts. Therefore even
+ * though we are rotating through 1...N for descriptor interrupts, we
+ * need to acqurie the int_handles from 0..N-1.
+ */
+ desc->hw->int_handle = idxd->int_handles[desc->vector - 1];
}
return desc;
@@ -91,7 +97,7 @@ int idxd_submit_desc(struct idxd_wq *wq, struct idxd_desc *desc)
if (idxd->state != IDXD_DEV_ENABLED)
return -EIO;
- portal = wq->portal + idxd_get_wq_portal_offset(IDXD_PORTAL_LIMITED);
+ portal = wq->portal + idxd_get_wq_portal_offset(IDXD_PORTAL_LIMITED, IDXD_IRQ_MSIX);
/*
* The wmb() flushes writes to coherent DMA data before
diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c
index 304eb2cf532e..17f13ebae028 100644
--- a/drivers/dma/idxd/sysfs.c
+++ b/drivers/dma/idxd/sysfs.c
@@ -1353,6 +1353,14 @@ static ssize_t numa_node_show(struct device *dev,
}
static DEVICE_ATTR_RO(numa_node);
+static ssize_t ims_size_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+ struct idxd_device *idxd = container_of(dev, struct idxd_device, conf_dev);
+
+ return sprintf(buf, "%u\n", idxd->ims_size);
+}
+static DEVICE_ATTR_RO(ims_size);
+
static ssize_t max_batch_size_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -1548,6 +1556,7 @@ static struct attribute *idxd_device_attributes[] = {
&dev_attr_max_work_queues_size.attr,
&dev_attr_max_engines.attr,
&dev_attr_numa_node.attr,
+ &dev_attr_ims_size.attr,
&dev_attr_max_batch_size.attr,
&dev_attr_max_transfer_size.attr,
&dev_attr_op_cap.attr,
Intel Scalable I/O Virtualization (SIOV) enables sharing of I/O devices
across isolated domains through PASID based sub-device partitioning.
Interrupt Message Storage (IMS) enables devices to store the interrupt
messages in a device-specific optimized manner without the scalability
restrictions of the PCIe defined MSI-X capability. IMS is one of the
features supported under SIOV.
Move SIOV detection code from Intel iommu driver code to common PCI. Making
the detection code common allows supported accelerator drivers to query the
PCI core for SIOV and IMS capabilities. The support code will add the
ability to query the PCI DVSEC capabilities for the SIOV cap.
Suggested-by: Thomas Gleixner <[email protected]>
Cc: Baolu Lu <[email protected]>
Signed-off-by: Dave Jiang <[email protected]>
Reviewed-by: Ashok Raj <[email protected]>
---
drivers/iommu/intel/iommu.c | 31 ++-----------------------
drivers/pci/Kconfig | 15 ++++++++++++
drivers/pci/Makefile | 2 ++
drivers/pci/dvsec.c | 40 +++++++++++++++++++++++++++++++++
drivers/pci/siov.c | 50 +++++++++++++++++++++++++++++++++++++++++
include/linux/pci-siov.h | 18 +++++++++++++++
include/linux/pci.h | 3 ++
include/uapi/linux/pci_regs.h | 4 +++
8 files changed, 134 insertions(+), 29 deletions(-)
create mode 100644 drivers/pci/dvsec.c
create mode 100644 drivers/pci/siov.c
create mode 100644 include/linux/pci-siov.h
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 3e77a88b236c..d9335f590b42 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -36,6 +36,7 @@
#include <linux/tboot.h>
#include <linux/dmi.h>
#include <linux/pci-ats.h>
+#include <linux/pci-siov.h>
#include <linux/memblock.h>
#include <linux/dma-map-ops.h>
#include <linux/dma-direct.h>
@@ -5883,34 +5884,6 @@ static int intel_iommu_disable_auxd(struct device *dev)
return 0;
}
-/*
- * A PCI express designated vendor specific extended capability is defined
- * in the section 3.7 of Intel scalable I/O virtualization technical spec
- * for system software and tools to detect endpoint devices supporting the
- * Intel scalable IO virtualization without host driver dependency.
- *
- * Returns the address of the matching extended capability structure within
- * the device's PCI configuration space or 0 if the device does not support
- * it.
- */
-static int siov_find_pci_dvsec(struct pci_dev *pdev)
-{
- int pos;
- u16 vendor, id;
-
- pos = pci_find_next_ext_capability(pdev, 0, 0x23);
- while (pos) {
- pci_read_config_word(pdev, pos + 4, &vendor);
- pci_read_config_word(pdev, pos + 8, &id);
- if (vendor == PCI_VENDOR_ID_INTEL && id == 5)
- return pos;
-
- pos = pci_find_next_ext_capability(pdev, pos, 0x23);
- }
-
- return 0;
-}
-
static bool
intel_iommu_dev_has_feat(struct device *dev, enum iommu_dev_features feat)
{
@@ -5925,7 +5898,7 @@ intel_iommu_dev_has_feat(struct device *dev, enum iommu_dev_features feat)
if (ret < 0)
return false;
- return !!siov_find_pci_dvsec(to_pci_dev(dev));
+ return pci_siov_supported(to_pci_dev(dev));
}
if (feat == IOMMU_DEV_FEAT_SVA) {
diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
index 0c473d75e625..cf7f4d17d8cc 100644
--- a/drivers/pci/Kconfig
+++ b/drivers/pci/Kconfig
@@ -161,6 +161,21 @@ config PCI_PASID
If unsure, say N.
+config PCI_DVSEC
+ bool
+
+config PCI_SIOV
+ select PCI_PASID
+ select PCI_DVSEC
+ bool "PCI SIOV support"
+ help
+ Scalable I/O Virtualzation enables sharing of I/O devices across isolated
+ domains through PASID based sub-device partitioning. One of the sub features
+ supported by SIOV is Inetrrupt Message Storage (IMS). Select this option if
+ you want to compile the support into your kernel.
+
+ If unsure, say N.
+
config PCI_P2PDMA
bool "PCI peer-to-peer transfer support"
depends on ZONE_DEVICE
diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
index 522d2b974e91..653a1d69b0fc 100644
--- a/drivers/pci/Makefile
+++ b/drivers/pci/Makefile
@@ -20,6 +20,8 @@ obj-$(CONFIG_PCI_QUIRKS) += quirks.o
obj-$(CONFIG_HOTPLUG_PCI) += hotplug/
obj-$(CONFIG_PCI_MSI) += msi.o
obj-$(CONFIG_PCI_ATS) += ats.o
+obj-$(CONFIG_PCI_DVSEC) += dvsec.o
+obj-$(CONFIG_PCI_SIOV) += siov.o
obj-$(CONFIG_PCI_IOV) += iov.o
obj-$(CONFIG_PCI_BRIDGE_EMUL) += pci-bridge-emul.o
obj-$(CONFIG_PCI_LABEL) += pci-label.o
diff --git a/drivers/pci/dvsec.c b/drivers/pci/dvsec.c
new file mode 100644
index 000000000000..e49b079f0717
--- /dev/null
+++ b/drivers/pci/dvsec.c
@@ -0,0 +1,40 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * PCI DVSEC helper functions
+ * Copyright (C) 2020 Intel Corp.
+ */
+
+#include <linux/export.h>
+#include <linux/pci.h>
+#include <uapi/linux/pci_regs.h>
+#include "pci.h"
+
+/**
+ * pci_find_dvsec - return position of DVSEC with provided vendor and dvsec id
+ * @dev: the PCI device
+ * @vendor: Vendor for the DVSEC
+ * @id: the DVSEC cap id
+ *
+ * Return the offset of DVSEC on success or -ENOTSUPP if not found
+ */
+int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id)
+{
+ u16 dev_vendor, dev_id;
+ int pos;
+
+ pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC);
+ if (!pos)
+ return -ENOTSUPP;
+
+ while (pos) {
+ pci_read_config_word(dev, pos + PCI_DVSEC_HEADER1, &dev_vendor);
+ pci_read_config_word(dev, pos + PCI_DVSEC_HEADER2, &dev_id);
+ if (dev_vendor == vendor && dev_id == id)
+ return pos;
+
+ pos = pci_find_next_ext_capability(dev, pos, PCI_EXT_CAP_ID_DVSEC);
+ }
+
+ return -ENOTSUPP;
+}
+EXPORT_SYMBOL_GPL(pci_find_dvsec);
diff --git a/drivers/pci/siov.c b/drivers/pci/siov.c
new file mode 100644
index 000000000000..6147e6ae5832
--- /dev/null
+++ b/drivers/pci/siov.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Intel Scalable I/O Virtualization support
+ * Copyright (C) 2020 Intel Corp.
+ */
+
+#include <linux/export.h>
+#include <linux/pci.h>
+#include <linux/pci-siov.h>
+#include <uapi/linux/pci_regs.h>
+#include "pci.h"
+
+/*
+ * A PCI express designated vendor specific extended capability is defined
+ * in the section 3.7 of Intel scalable I/O virtualization technical spec
+ * for system software and tools to detect endpoint devices supporting the
+ * Intel scalable IO virtualization without host driver dependency.
+ */
+
+/**
+ * pci_siov_supported - check if the device can use SIOV
+ * @dev: the PCI device
+ *
+ * Returns true if the device supports SIOV, false otherwise.
+ */
+bool pci_siov_supported(struct pci_dev *dev)
+{
+ return pci_find_dvsec(dev, PCI_VENDOR_ID_INTEL, PCI_DVSEC_ID_INTEL_SIOV) < 0 ? false : true;
+}
+EXPORT_SYMBOL_GPL(pci_siov_supported);
+
+/**
+ * pci_ims_supported - check if the device can use IMS
+ * @dev: the PCI device
+ *
+ * Returns true if the device supports IMS, false otherwise.
+ */
+bool pci_ims_supported(struct pci_dev *dev)
+{
+ int pos;
+ u32 caps;
+
+ pos = pci_find_dvsec(dev, PCI_VENDOR_ID_INTEL, PCI_DVSEC_ID_INTEL_SIOV);
+ if (pos < 0)
+ return false;
+
+ pci_read_config_dword(dev, pos + PCI_DVSEC_INTEL_SIOV_CAP, &caps);
+ return (caps & PCI_DVSEC_INTEL_SIOV_CAP_IMS) ? true : false;
+}
+EXPORT_SYMBOL_GPL(pci_ims_supported);
diff --git a/include/linux/pci-siov.h b/include/linux/pci-siov.h
new file mode 100644
index 000000000000..a8a4eb5f4634
--- /dev/null
+++ b/include/linux/pci-siov.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef LINUX_PCI_SIOV_H
+#define LINUX_PCI_SIOV_H
+
+#include <linux/pci.h>
+
+#ifdef CONFIG_PCI_SIOV
+/* Scalable I/O Virtualization */
+bool pci_siov_supported(struct pci_dev *dev);
+bool pci_ims_supported(struct pci_dev *dev);
+#else /* CONFIG_PCI_SIOV */
+static inline bool pci_siov_supported(struct pci_dev *d)
+{ return false; }
+static inline bool pci_ims_supported(struct pci_dev *d)
+{ return false; }
+#endif /* CONFIG_PCI_SIOV */
+
+#endif /* LINUX_PCI_SIOV_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 22207a79762c..4710f09b43b1 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1070,6 +1070,7 @@ int pci_find_next_ext_capability(struct pci_dev *dev, int pos, int cap);
int pci_find_ht_capability(struct pci_dev *dev, int ht_cap);
int pci_find_next_ht_capability(struct pci_dev *dev, int pos, int ht_cap);
struct pci_bus *pci_find_next_bus(const struct pci_bus *from);
+int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id);
u64 pci_get_dsn(struct pci_dev *dev);
@@ -1726,6 +1727,8 @@ static inline int pci_find_next_capability(struct pci_dev *dev, u8 post,
{ return 0; }
static inline int pci_find_ext_capability(struct pci_dev *dev, int cap)
{ return 0; }
+static inline int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id)
+{ return 0; }
static inline u64 pci_get_dsn(struct pci_dev *dev)
{ return 0; }
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 8f8bd2318c6c..3532528441ef 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1071,6 +1071,10 @@
#define PCI_DVSEC_HEADER1 0x4 /* Designated Vendor-Specific Header1 */
#define PCI_DVSEC_HEADER2 0x8 /* Designated Vendor-Specific Header2 */
+#define PCI_DVSEC_ID_INTEL_SIOV 0x5
+#define PCI_DVSEC_INTEL_SIOV_CAP 0x14
+#define PCI_DVSEC_INTEL_SIOV_CAP_IMS 0x1
+
/* Data Link Feature */
#define PCI_DLF_CAP 0x04 /* Capabilities Register */
#define PCI_DLF_EXCHANGE_ENABLE 0x80000000 /* Data Link Feature Exchange Enable */
On Fri, Oct 30, 2020 at 11:50:47AM -0700, Dave Jiang wrote:
> .../ABI/stable/sysfs-driver-dma-idxd | 6 +
> Documentation/driver-api/vfio/mdev-idxd.rst | 404 ++++++
> MAINTAINERS | 1 +
> drivers/dma/Kconfig | 9 +
> drivers/dma/idxd/Makefile | 2 +
> drivers/dma/idxd/cdev.c | 6 +-
> drivers/dma/idxd/device.c | 294 ++++-
> drivers/dma/idxd/idxd.h | 67 +-
> drivers/dma/idxd/init.c | 86 ++
> drivers/dma/idxd/irq.c | 6 +-
> drivers/dma/idxd/mdev.c | 1121 +++++++++++++++++
> drivers/dma/idxd/mdev.h | 116 ++
Again, a subsytem driver belongs in the directory hierarchy of the
subsystem, not in other random places. All this mdev stuff belongs
under drivers/vfio
Jason
On 10/30/2020 11:58 AM, Jason Gunthorpe wrote:
> On Fri, Oct 30, 2020 at 11:50:47AM -0700, Dave Jiang wrote:
>> .../ABI/stable/sysfs-driver-dma-idxd | 6 +
>> Documentation/driver-api/vfio/mdev-idxd.rst | 404 ++++++
>> MAINTAINERS | 1 +
>> drivers/dma/Kconfig | 9 +
>> drivers/dma/idxd/Makefile | 2 +
>> drivers/dma/idxd/cdev.c | 6 +-
>> drivers/dma/idxd/device.c | 294 ++++-
>> drivers/dma/idxd/idxd.h | 67 +-
>> drivers/dma/idxd/init.c | 86 ++
>> drivers/dma/idxd/irq.c | 6 +-
>> drivers/dma/idxd/mdev.c | 1121 +++++++++++++++++
>> drivers/dma/idxd/mdev.h | 116 ++
>
> Again, a subsytem driver belongs in the directory hierarchy of the
> subsystem, not in other random places. All this mdev stuff belongs
> under drivers/vfio
Alex seems to have disagreed last time....
https://lore.kernel.org/dmaengine/[email protected]/
And I do agree with his perspective. The mdev is an extension of the PF driver.
It's a bit awkward to be a stand alone mdev driver under vfio/mdev/.
>
> Jason
>
On Fri, Oct 30, 2020 at 12:13:48PM -0700, Dave Jiang wrote:
>
>
> On 10/30/2020 11:58 AM, Jason Gunthorpe wrote:
> > On Fri, Oct 30, 2020 at 11:50:47AM -0700, Dave Jiang wrote:
> > > .../ABI/stable/sysfs-driver-dma-idxd | 6 +
> > > Documentation/driver-api/vfio/mdev-idxd.rst | 404 ++++++
> > > MAINTAINERS | 1 +
> > > drivers/dma/Kconfig | 9 +
> > > drivers/dma/idxd/Makefile | 2 +
> > > drivers/dma/idxd/cdev.c | 6 +-
> > > drivers/dma/idxd/device.c | 294 ++++-
> > > drivers/dma/idxd/idxd.h | 67 +-
> > > drivers/dma/idxd/init.c | 86 ++
> > > drivers/dma/idxd/irq.c | 6 +-
> > > drivers/dma/idxd/mdev.c | 1121 +++++++++++++++++
> > > drivers/dma/idxd/mdev.h | 116 ++
> >
> > Again, a subsytem driver belongs in the directory hierarchy of the
> > subsystem, not in other random places. All this mdev stuff belongs
> > under drivers/vfio
>
> Alex seems to have disagreed last time....
> https://lore.kernel.org/dmaengine/[email protected]/
Nobody else in the kernel is splitting subsystems up anymore
> And I do agree with his perspective. The mdev is an extension of the PF
> driver. It's a bit awkward to be a stand alone mdev driver under vfio/mdev/.
By this logic we'd have giagantic drivers under drivers/ethernet
touching netdev, rdma, scsi, vdpa, etc just because that is where the
PF driver came from.
It is not how the kernel works. Subsystem owners are responsible for
their subsystem, drivers implementing their subsystem are under the
subsystem directory.
Jason
On Fri, Oct 30, 2020 at 04:17:06PM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 30, 2020 at 12:13:48PM -0700, Dave Jiang wrote:
> >
> >
> > On 10/30/2020 11:58 AM, Jason Gunthorpe wrote:
> > > On Fri, Oct 30, 2020 at 11:50:47AM -0700, Dave Jiang wrote:
> > > > .../ABI/stable/sysfs-driver-dma-idxd | 6 +
> > > > Documentation/driver-api/vfio/mdev-idxd.rst | 404 ++++++
> > > > MAINTAINERS | 1 +
> > > > drivers/dma/Kconfig | 9 +
> > > > drivers/dma/idxd/Makefile | 2 +
> > > > drivers/dma/idxd/cdev.c | 6 +-
> > > > drivers/dma/idxd/device.c | 294 ++++-
> > > > drivers/dma/idxd/idxd.h | 67 +-
> > > > drivers/dma/idxd/init.c | 86 ++
> > > > drivers/dma/idxd/irq.c | 6 +-
> > > > drivers/dma/idxd/mdev.c | 1121 +++++++++++++++++
> > > > drivers/dma/idxd/mdev.h | 116 ++
> > >
> > > Again, a subsytem driver belongs in the directory hierarchy of the
> > > subsystem, not in other random places. All this mdev stuff belongs
> > > under drivers/vfio
> >
> > Alex seems to have disagreed last time....
> > https://lore.kernel.org/dmaengine/[email protected]/
>
> Nobody else in the kernel is splitting subsystems up anymore
>
> > And I do agree with his perspective. The mdev is an extension of the PF
> > driver. It's a bit awkward to be a stand alone mdev driver under vfio/mdev/.
>
> By this logic we'd have giagantic drivers under drivers/ethernet
> touching netdev, rdma, scsi, vdpa, etc just because that is where the
> PF driver came from.
What makes you think this is providing services like scsi/rdma/vdpa etc.. ?
for DSA this playes the exact same role, not a different function
as you highlight above. these mdev's are creating DSA for virtualization
use. They aren't providing a completely different role or subsystem per-se.
Cheers,
Ashok
On Fri, Oct 30, 2020 at 12:23:25PM -0700, Raj, Ashok wrote:
> On Fri, Oct 30, 2020 at 04:17:06PM -0300, Jason Gunthorpe wrote:
> > On Fri, Oct 30, 2020 at 12:13:48PM -0700, Dave Jiang wrote:
> > >
> > >
> > > On 10/30/2020 11:58 AM, Jason Gunthorpe wrote:
> > > > On Fri, Oct 30, 2020 at 11:50:47AM -0700, Dave Jiang wrote:
> > > > > .../ABI/stable/sysfs-driver-dma-idxd | 6 +
> > > > > Documentation/driver-api/vfio/mdev-idxd.rst | 404 ++++++
> > > > > MAINTAINERS | 1 +
> > > > > drivers/dma/Kconfig | 9 +
> > > > > drivers/dma/idxd/Makefile | 2 +
> > > > > drivers/dma/idxd/cdev.c | 6 +-
> > > > > drivers/dma/idxd/device.c | 294 ++++-
> > > > > drivers/dma/idxd/idxd.h | 67 +-
> > > > > drivers/dma/idxd/init.c | 86 ++
> > > > > drivers/dma/idxd/irq.c | 6 +-
> > > > > drivers/dma/idxd/mdev.c | 1121 +++++++++++++++++
> > > > > drivers/dma/idxd/mdev.h | 116 ++
> > > >
> > > > Again, a subsytem driver belongs in the directory hierarchy of the
> > > > subsystem, not in other random places. All this mdev stuff belongs
> > > > under drivers/vfio
> > >
> > > Alex seems to have disagreed last time....
> > > https://lore.kernel.org/dmaengine/[email protected]/
> >
> > Nobody else in the kernel is splitting subsystems up anymore
> >
> > > And I do agree with his perspective. The mdev is an extension of the PF
> > > driver. It's a bit awkward to be a stand alone mdev driver under vfio/mdev/.
> >
> > By this logic we'd have giagantic drivers under drivers/ethernet
> > touching netdev, rdma, scsi, vdpa, etc just because that is where the
> > PF driver came from.
>
> What makes you think this is providing services like scsi/rdma/vdpa etc.. ?
>
> for DSA this playes the exact same role, not a different function
> as you highlight above. these mdev's are creating DSA for virtualization
> use. They aren't providing a completely different role or subsystem per-se.
It is a different subsystem, different maintainer, and different
reviewers.
It is a development process problem, it doesn't matter what it is
doing.
Jason
On Fri, Oct 30, 2020 at 11:51:32AM -0700, Dave Jiang wrote:
> Intel Scalable I/O Virtualization (SIOV) enables sharing of I/O devices
> across isolated domains through PASID based sub-device partitioning.
> Interrupt Message Storage (IMS) enables devices to store the interrupt
> messages in a device-specific optimized manner without the scalability
> restrictions of the PCIe defined MSI-X capability. IMS is one of the
> features supported under SIOV.
>
> Move SIOV detection code from Intel iommu driver code to common PCI. Making
> the detection code common allows supported accelerator drivers to query the
> PCI core for SIOV and IMS capabilities. The support code will add the
> ability to query the PCI DVSEC capabilities for the SIOV cap.
This patch really does not include anything related to SIOV other than
adding a little code to *find* the capability. It doesn't add
anything that actually *uses* it. I think this patch should simply
add pci_find_dvsec(), and it doesn't need any of this SIOV or IMS
description.
> Suggested-by: Thomas Gleixner <[email protected]>
> Cc: Baolu Lu <[email protected]>
> Signed-off-by: Dave Jiang <[email protected]>
> Reviewed-by: Ashok Raj <[email protected]>
> ---
> drivers/iommu/intel/iommu.c | 31 ++-----------------------
> drivers/pci/Kconfig | 15 ++++++++++++
> drivers/pci/Makefile | 2 ++
> drivers/pci/dvsec.c | 40 +++++++++++++++++++++++++++++++++
> drivers/pci/siov.c | 50 +++++++++++++++++++++++++++++++++++++++++
> include/linux/pci-siov.h | 18 +++++++++++++++
> include/linux/pci.h | 3 ++
> include/uapi/linux/pci_regs.h | 4 +++
> 8 files changed, 134 insertions(+), 29 deletions(-)
> create mode 100644 drivers/pci/dvsec.c
> create mode 100644 drivers/pci/siov.c
> create mode 100644 include/linux/pci-siov.h
>
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 3e77a88b236c..d9335f590b42 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -36,6 +36,7 @@
> #include <linux/tboot.h>
> #include <linux/dmi.h>
> #include <linux/pci-ats.h>
> +#include <linux/pci-siov.h>
> #include <linux/memblock.h>
> #include <linux/dma-map-ops.h>
> #include <linux/dma-direct.h>
> @@ -5883,34 +5884,6 @@ static int intel_iommu_disable_auxd(struct device *dev)
> return 0;
> }
>
> -/*
> - * A PCI express designated vendor specific extended capability is defined
> - * in the section 3.7 of Intel scalable I/O virtualization technical spec
> - * for system software and tools to detect endpoint devices supporting the
> - * Intel scalable IO virtualization without host driver dependency.
> - *
> - * Returns the address of the matching extended capability structure within
> - * the device's PCI configuration space or 0 if the device does not support
> - * it.
> - */
> -static int siov_find_pci_dvsec(struct pci_dev *pdev)
> -{
> - int pos;
> - u16 vendor, id;
> -
> - pos = pci_find_next_ext_capability(pdev, 0, 0x23);
> - while (pos) {
> - pci_read_config_word(pdev, pos + 4, &vendor);
> - pci_read_config_word(pdev, pos + 8, &id);
> - if (vendor == PCI_VENDOR_ID_INTEL && id == 5)
> - return pos;
> -
> - pos = pci_find_next_ext_capability(pdev, pos, 0x23);
> - }
> -
> - return 0;
> -}
> -
> static bool
> intel_iommu_dev_has_feat(struct device *dev, enum iommu_dev_features feat)
> {
> @@ -5925,7 +5898,7 @@ intel_iommu_dev_has_feat(struct device *dev, enum iommu_dev_features feat)
> if (ret < 0)
> return false;
>
> - return !!siov_find_pci_dvsec(to_pci_dev(dev));
> + return pci_siov_supported(to_pci_dev(dev));
> }
>
> if (feat == IOMMU_DEV_FEAT_SVA) {
> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
> index 0c473d75e625..cf7f4d17d8cc 100644
> --- a/drivers/pci/Kconfig
> +++ b/drivers/pci/Kconfig
> @@ -161,6 +161,21 @@ config PCI_PASID
>
> If unsure, say N.
>
> +config PCI_DVSEC
> + bool
> +
> +config PCI_SIOV
> + select PCI_PASID
This patch has nothing to do with PCI_PASID. If you want to add this
select later in a patch that *does* add something that requires
PCI_PASID, that's OK.
> + select PCI_DVSEC
> + bool "PCI SIOV support"
> + help
> + Scalable I/O Virtualzation enables sharing of I/O devices across isolated
> + domains through PASID based sub-device partitioning. One of the sub features
> + supported by SIOV is Inetrrupt Message Storage (IMS). Select this option if
> + you want to compile the support into your kernel.
> + If unsure, say N.
> +
> config PCI_P2PDMA
> bool "PCI peer-to-peer transfer support"
> depends on ZONE_DEVICE
> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
> index 522d2b974e91..653a1d69b0fc 100644
> --- a/drivers/pci/Makefile
> +++ b/drivers/pci/Makefile
> @@ -20,6 +20,8 @@ obj-$(CONFIG_PCI_QUIRKS) += quirks.o
> obj-$(CONFIG_HOTPLUG_PCI) += hotplug/
> obj-$(CONFIG_PCI_MSI) += msi.o
> obj-$(CONFIG_PCI_ATS) += ats.o
> +obj-$(CONFIG_PCI_DVSEC) += dvsec.o
> +obj-$(CONFIG_PCI_SIOV) += siov.o
> obj-$(CONFIG_PCI_IOV) += iov.o
> obj-$(CONFIG_PCI_BRIDGE_EMUL) += pci-bridge-emul.o
> obj-$(CONFIG_PCI_LABEL) += pci-label.o
> diff --git a/drivers/pci/dvsec.c b/drivers/pci/dvsec.c
> new file mode 100644
> index 000000000000..e49b079f0717
> --- /dev/null
> +++ b/drivers/pci/dvsec.c
> @@ -0,0 +1,40 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCI DVSEC helper functions
> + * Copyright (C) 2020 Intel Corp.
> + */
> +
> +#include <linux/export.h>
> +#include <linux/pci.h>
> +#include <uapi/linux/pci_regs.h>
> +#include "pci.h"
> +
> +/**
> + * pci_find_dvsec - return position of DVSEC with provided vendor and dvsec id
> + * @dev: the PCI device
> + * @vendor: Vendor for the DVSEC
> + * @id: the DVSEC cap id
> + *
> + * Return the offset of DVSEC on success or -ENOTSUPP if not found
s/vendor/Vendor/
s/dvsec/DVSEC/
s/id/ID/ twice above
Please put this function in drivers/pci/pci.c next to
pci_find_ext_capability(). I don't think it's worth making a new file
just for this.
> + */
> +int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id)
> +{
> + u16 dev_vendor, dev_id;
> + int pos;
> +
> + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC);
> + if (!pos)
> + return -ENOTSUPP;
> +
> + while (pos) {
> + pci_read_config_word(dev, pos + PCI_DVSEC_HEADER1, &dev_vendor);
> + pci_read_config_word(dev, pos + PCI_DVSEC_HEADER2, &dev_id);
> + if (dev_vendor == vendor && dev_id == id)
> + return pos;
> +
> + pos = pci_find_next_ext_capability(dev, pos, PCI_EXT_CAP_ID_DVSEC);
> + }
> +
> + return -ENOTSUPP;
> +}
> +EXPORT_SYMBOL_GPL(pci_find_dvsec);
> diff --git a/drivers/pci/siov.c b/drivers/pci/siov.c
> new file mode 100644
> index 000000000000..6147e6ae5832
> --- /dev/null
> +++ b/drivers/pci/siov.c
> @@ -0,0 +1,50 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Intel Scalable I/O Virtualization support
> + * Copyright (C) 2020 Intel Corp.
> + */
> +
> +#include <linux/export.h>
> +#include <linux/pci.h>
> +#include <linux/pci-siov.h>
> +#include <uapi/linux/pci_regs.h>
> +#include "pci.h"
> +
> +/*
> + * A PCI express designated vendor specific extended capability is defined
> + * in the section 3.7 of Intel scalable I/O virtualization technical spec
> + * for system software and tools to detect endpoint devices supporting the
> + * Intel scalable IO virtualization without host driver dependency.
> + */
> +
> +/**
> + * pci_siov_supported - check if the device can use SIOV
> + * @dev: the PCI device
> + *
> + * Returns true if the device supports SIOV, false otherwise.
> + */
> +bool pci_siov_supported(struct pci_dev *dev)
> +{
> + return pci_find_dvsec(dev, PCI_VENDOR_ID_INTEL, PCI_DVSEC_ID_INTEL_SIOV) < 0 ? false : true;
> +}
> +EXPORT_SYMBOL_GPL(pci_siov_supported);
> +
> +/**
> + * pci_ims_supported - check if the device can use IMS
> + * @dev: the PCI device
> + *
> + * Returns true if the device supports IMS, false otherwise.
> + */
> +bool pci_ims_supported(struct pci_dev *dev)
> +{
> + int pos;
> + u32 caps;
> +
> + pos = pci_find_dvsec(dev, PCI_VENDOR_ID_INTEL, PCI_DVSEC_ID_INTEL_SIOV);
> + if (pos < 0)
> + return false;
> +
> + pci_read_config_dword(dev, pos + PCI_DVSEC_INTEL_SIOV_CAP, &caps);
> + return (caps & PCI_DVSEC_INTEL_SIOV_CAP_IMS) ? true : false;
> +}
> +EXPORT_SYMBOL_GPL(pci_ims_supported);
I don't really see the point of these *_supported() functions. If the
caller wants to use them, I would expect it to call
pci_find_dvsec(PCI_DVSEC_ID_INTEL_SIOV) itself anyway.
But there *are* no calls to pci_find_dvsec(PCI_DVSEC_ID_INTEL_SIOV).
So apparently all you care about is whether the capability *exists*,
and you don't need any information at all from the capability
registers except PCI_DVSEC_INTEL_SIOV_CAP_IMS? That seems a little
weird.
I don't think it's worth adding a whole new file just for this. The
only value the PCI core is adding here is a way to locate the
PCI_DVSEC_ID_INTEL_SIOV capability.
> diff --git a/include/linux/pci-siov.h b/include/linux/pci-siov.h
> new file mode 100644
> index 000000000000..a8a4eb5f4634
> --- /dev/null
> +++ b/include/linux/pci-siov.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef LINUX_PCI_SIOV_H
> +#define LINUX_PCI_SIOV_H
> +
> +#include <linux/pci.h>
> +
> +#ifdef CONFIG_PCI_SIOV
> +/* Scalable I/O Virtualization */
> +bool pci_siov_supported(struct pci_dev *dev);
> +bool pci_ims_supported(struct pci_dev *dev);
> +#else /* CONFIG_PCI_SIOV */
> +static inline bool pci_siov_supported(struct pci_dev *d)
> +{ return false; }
> +static inline bool pci_ims_supported(struct pci_dev *d)
> +{ return false; }
> +#endif /* CONFIG_PCI_SIOV */
> +
> +#endif /* LINUX_PCI_SIOV_H */
What's the benefit to putting these declarations in a separate
pci-siov.h as opposed to putting them in pci.h itself? That's what we
do for things like MSI, IOV, etc.
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 22207a79762c..4710f09b43b1 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1070,6 +1070,7 @@ int pci_find_next_ext_capability(struct pci_dev *dev, int pos, int cap);
> int pci_find_ht_capability(struct pci_dev *dev, int ht_cap);
> int pci_find_next_ht_capability(struct pci_dev *dev, int pos, int ht_cap);
> struct pci_bus *pci_find_next_bus(const struct pci_bus *from);
> +int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id);
>
> u64 pci_get_dsn(struct pci_dev *dev);
>
> @@ -1726,6 +1727,8 @@ static inline int pci_find_next_capability(struct pci_dev *dev, u8 post,
> { return 0; }
> static inline int pci_find_ext_capability(struct pci_dev *dev, int cap)
> { return 0; }
> +static inline int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id)
> +{ return 0; }
>
> static inline u64 pci_get_dsn(struct pci_dev *dev)
> { return 0; }
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 8f8bd2318c6c..3532528441ef 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1071,6 +1071,10 @@
> #define PCI_DVSEC_HEADER1 0x4 /* Designated Vendor-Specific Header1 */
> #define PCI_DVSEC_HEADER2 0x8 /* Designated Vendor-Specific Header2 */
>
> +#define PCI_DVSEC_ID_INTEL_SIOV 0x5
> +#define PCI_DVSEC_INTEL_SIOV_CAP 0x14
> +#define PCI_DVSEC_INTEL_SIOV_CAP_IMS 0x1
Convention in this file is to write constants in the register width,
e.g.,
#define PCI_DVSEC_ID_INTEL_SIOV 0x0005
#define PCI_DVSEC_INTEL_SIOV_CAP_IMS 0x00000001
You can learn this by looking at the surrounding definitions.
> /* Data Link Feature */
> #define PCI_DLF_CAP 0x04 /* Capabilities Register */
> #define PCI_DLF_EXCHANGE_ENABLE 0x80000000 /* Data Link Feature Exchange Enable */
>
>
On Fri, Oct 30 2020 at 11:51, Dave Jiang wrote:
> From: Megha Dey <[email protected]>
This conflicts with
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/apic
Thanks,
tglx
On Fri, Oct 30, 2020 at 04:30:45PM -0300, Jason Gunthorpe wrote:
> On Fri, Oct 30, 2020 at 12:23:25PM -0700, Raj, Ashok wrote:
> > On Fri, Oct 30, 2020 at 04:17:06PM -0300, Jason Gunthorpe wrote:
> > > On Fri, Oct 30, 2020 at 12:13:48PM -0700, Dave Jiang wrote:
> > > >
> > > >
> > > > On 10/30/2020 11:58 AM, Jason Gunthorpe wrote:
> > > > > On Fri, Oct 30, 2020 at 11:50:47AM -0700, Dave Jiang wrote:
> > > > > > .../ABI/stable/sysfs-driver-dma-idxd | 6 +
> > > > > > Documentation/driver-api/vfio/mdev-idxd.rst | 404 ++++++
> > > > > > MAINTAINERS | 1 +
> > > > > > drivers/dma/Kconfig | 9 +
> > > > > > drivers/dma/idxd/Makefile | 2 +
> > > > > > drivers/dma/idxd/cdev.c | 6 +-
> > > > > > drivers/dma/idxd/device.c | 294 ++++-
> > > > > > drivers/dma/idxd/idxd.h | 67 +-
> > > > > > drivers/dma/idxd/init.c | 86 ++
> > > > > > drivers/dma/idxd/irq.c | 6 +-
> > > > > > drivers/dma/idxd/mdev.c | 1121 +++++++++++++++++
> > > > > > drivers/dma/idxd/mdev.h | 116 ++
> > > > >
> > > > > Again, a subsytem driver belongs in the directory hierarchy of the
> > > > > subsystem, not in other random places. All this mdev stuff belongs
> > > > > under drivers/vfio
> > > >
> > > > Alex seems to have disagreed last time....
> > > > https://lore.kernel.org/dmaengine/[email protected]/
> > >
> > > Nobody else in the kernel is splitting subsystems up anymore
> > >
> > > > And I do agree with his perspective. The mdev is an extension of the PF
> > > > driver. It's a bit awkward to be a stand alone mdev driver under vfio/mdev/.
> > >
> > > By this logic we'd have giagantic drivers under drivers/ethernet
> > > touching netdev, rdma, scsi, vdpa, etc just because that is where the
> > > PF driver came from.
> >
> > What makes you think this is providing services like scsi/rdma/vdpa etc.. ?
> >
> > for DSA this playes the exact same role, not a different function
> > as you highlight above. these mdev's are creating DSA for virtualization
> > use. They aren't providing a completely different role or subsystem per-se.
>
> It is a different subsystem, different maintainer, and different
> reviewers.
>
> It is a development process problem, it doesn't matter what it is
> doing.
So drawing that parallel, do you expect all drivers that call
pci_register_driver() to be located in drivers/pci? Aren't they scattered
all over the place ata,scsi, platform drivers and such?
As Alex pointed out, i915 and handful of s390 drivers that are mdev users
are not in drivers/vfio. Are you sayint those drivers don't get reviewed?
This is no different than PF driver offering VF services. Its a logical
extension.
Reviews happen for mdev users today. What you suggest seems like cutting
the feet to fit the shoe. Unless the maintainers are asking things
to be split just because its calling mdev_register_device() that practice
doesn't exist and would be totally weird if you want to move all callers of
pci_register_driver().
Your argument seems interesting even entertaining :-). But honestly i'm not finding it
practical :-). So every caller of mmu_register_notifier() needs to be in
mm?
What you mention for different functions make absolute sense, not arguing
against that. but this ain't that.
And we just follow the asks of the maintainer.
I know you aren't going to give up, but there is little we can do. I want
the maintainers to make that call and I'm not add more noise to this.
Cheers,
Ashok
On Fri, Oct 30 2020 at 11:50, Dave Jiang wrote:
> The code has dependency on Thomas’s MSI restructuring patch series:
> https://lore.kernel.org/lkml/[email protected]/
which is outdated and not longer applicable.
Thanks,
tglx
On 10/30/2020 1:31 PM, Thomas Gleixner wrote:
> On Fri, Oct 30 2020 at 11:51, Dave Jiang wrote:
>> From: Megha Dey <[email protected]>
>
> This conflicts with
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/apic
I'll get that fixed up. Thanks!
>
> Thanks,
>
> tglx
>
On 10/30/2020 1:48 PM, Thomas Gleixner wrote:
> On Fri, Oct 30 2020 at 11:50, Dave Jiang wrote:
>> The code has dependency on Thomas’s MSI restructuring patch series:
>> https://lore.kernel.org/lkml/[email protected]/
>
> which is outdated and not longer applicable.
Yes.... I wasn't sure how to point to these patches from you as a dependency.
irqdomain/msi: Provide msi_alloc/free_store() callbacks
platform-msi: Add device MSI infrastructure
genirq/msi: Provide and use msi_domain_set_default_info_flags()
genirq/proc: Take buslock on affinity write
platform-msi: Provide default irq_chip:: Ack
x86/msi: Rename and rework pci_msi_prepare() to cover non-PCI MSI
x86/irq: Add DEV_MSI allocation type
Do I need to include these patches in my series? Thanks!
>
> Thanks,
>
> tglx
>
On 10/30/2020 12:51 PM, Bjorn Helgaas wrote:
> On Fri, Oct 30, 2020 at 11:51:32AM -0700, Dave Jiang wrote:
>> Intel Scalable I/O Virtualization (SIOV) enables sharing of I/O devices
>> across isolated domains through PASID based sub-device partitioning.
>> Interrupt Message Storage (IMS) enables devices to store the interrupt
>> messages in a device-specific optimized manner without the scalability
>> restrictions of the PCIe defined MSI-X capability. IMS is one of the
>> features supported under SIOV.
>>
>> Move SIOV detection code from Intel iommu driver code to common PCI. Making
>> the detection code common allows supported accelerator drivers to query the
>> PCI core for SIOV and IMS capabilities. The support code will add the
>> ability to query the PCI DVSEC capabilities for the SIOV cap.
>
> This patch really does not include anything related to SIOV other than
> adding a little code to *find* the capability. It doesn't add
> anything that actually *uses* it. I think this patch should simply
> add pci_find_dvsec(), and it doesn't need any of this SIOV or IMS
> description.
>
Thanks for the review Bjorn! I'll carve out a patch with just find_dvsec() and
apply your comments and recommendations.
So the intel-iommu driver checks for the SIOV cap. And the idxd driver checks
for SIOV and IMS cap. There will be other upcoming drivers that will check for
such cap too. It is Intel vendor specific right now, but SIOV is public and
other vendors may implement to the spec. Is there a good place to put the common
capability check for that?
There are some other fields in the SIOV dvsec cap, but presently they are not
being utilized. The idxd driver is only interested in making sure that SIOV and
IMS (sub feature) support are present at this point.
- Dave
>> Suggested-by: Thomas Gleixner <[email protected]>
>> Cc: Baolu Lu <[email protected]>
>> Signed-off-by: Dave Jiang <[email protected]>
>> Reviewed-by: Ashok Raj <[email protected]>
>> ---
>> drivers/iommu/intel/iommu.c | 31 ++-----------------------
>> drivers/pci/Kconfig | 15 ++++++++++++
>> drivers/pci/Makefile | 2 ++
>> drivers/pci/dvsec.c | 40 +++++++++++++++++++++++++++++++++
>> drivers/pci/siov.c | 50 +++++++++++++++++++++++++++++++++++++++++
>> include/linux/pci-siov.h | 18 +++++++++++++++
>> include/linux/pci.h | 3 ++
>> include/uapi/linux/pci_regs.h | 4 +++
>> 8 files changed, 134 insertions(+), 29 deletions(-)
>> create mode 100644 drivers/pci/dvsec.c
>> create mode 100644 drivers/pci/siov.c
>> create mode 100644 include/linux/pci-siov.h
>>
>> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
>> index 3e77a88b236c..d9335f590b42 100644
>> --- a/drivers/iommu/intel/iommu.c
>> +++ b/drivers/iommu/intel/iommu.c
>> @@ -36,6 +36,7 @@
>> #include <linux/tboot.h>
>> #include <linux/dmi.h>
>> #include <linux/pci-ats.h>
>> +#include <linux/pci-siov.h>
>> #include <linux/memblock.h>
>> #include <linux/dma-map-ops.h>
>> #include <linux/dma-direct.h>
>> @@ -5883,34 +5884,6 @@ static int intel_iommu_disable_auxd(struct device *dev)
>> return 0;
>> }
>>
>> -/*
>> - * A PCI express designated vendor specific extended capability is defined
>> - * in the section 3.7 of Intel scalable I/O virtualization technical spec
>> - * for system software and tools to detect endpoint devices supporting the
>> - * Intel scalable IO virtualization without host driver dependency.
>> - *
>> - * Returns the address of the matching extended capability structure within
>> - * the device's PCI configuration space or 0 if the device does not support
>> - * it.
>> - */
>> -static int siov_find_pci_dvsec(struct pci_dev *pdev)
>> -{
>> - int pos;
>> - u16 vendor, id;
>> -
>> - pos = pci_find_next_ext_capability(pdev, 0, 0x23);
>> - while (pos) {
>> - pci_read_config_word(pdev, pos + 4, &vendor);
>> - pci_read_config_word(pdev, pos + 8, &id);
>> - if (vendor == PCI_VENDOR_ID_INTEL && id == 5)
>> - return pos;
>> -
>> - pos = pci_find_next_ext_capability(pdev, pos, 0x23);
>> - }
>> -
>> - return 0;
>> -}
>> -
>> static bool
>> intel_iommu_dev_has_feat(struct device *dev, enum iommu_dev_features feat)
>> {
>> @@ -5925,7 +5898,7 @@ intel_iommu_dev_has_feat(struct device *dev, enum iommu_dev_features feat)
>> if (ret < 0)
>> return false;
>>
>> - return !!siov_find_pci_dvsec(to_pci_dev(dev));
>> + return pci_siov_supported(to_pci_dev(dev));
>> }
>>
>> if (feat == IOMMU_DEV_FEAT_SVA) {
>> diff --git a/drivers/pci/Kconfig b/drivers/pci/Kconfig
>> index 0c473d75e625..cf7f4d17d8cc 100644
>> --- a/drivers/pci/Kconfig
>> +++ b/drivers/pci/Kconfig
>> @@ -161,6 +161,21 @@ config PCI_PASID
>>
>> If unsure, say N.
>>
>> +config PCI_DVSEC
>> + bool
>> +
>> +config PCI_SIOV
>> + select PCI_PASID
>
> This patch has nothing to do with PCI_PASID. If you want to add this
> select later in a patch that *does* add something that requires
> PCI_PASID, that's OK.
>
>> + select PCI_DVSEC
>> + bool "PCI SIOV support"
>> + help
>> + Scalable I/O Virtualzation enables sharing of I/O devices across isolated
>> + domains through PASID based sub-device partitioning. One of the sub features
>> + supported by SIOV is Inetrrupt Message Storage (IMS). Select this option if
>> + you want to compile the support into your kernel.
>> + If unsure, say N.
>> +
>> config PCI_P2PDMA
>> bool "PCI peer-to-peer transfer support"
>> depends on ZONE_DEVICE
>> diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile
>> index 522d2b974e91..653a1d69b0fc 100644
>> --- a/drivers/pci/Makefile
>> +++ b/drivers/pci/Makefile
>> @@ -20,6 +20,8 @@ obj-$(CONFIG_PCI_QUIRKS) += quirks.o
>> obj-$(CONFIG_HOTPLUG_PCI) += hotplug/
>> obj-$(CONFIG_PCI_MSI) += msi.o
>> obj-$(CONFIG_PCI_ATS) += ats.o
>> +obj-$(CONFIG_PCI_DVSEC) += dvsec.o
>> +obj-$(CONFIG_PCI_SIOV) += siov.o
>> obj-$(CONFIG_PCI_IOV) += iov.o
>> obj-$(CONFIG_PCI_BRIDGE_EMUL) += pci-bridge-emul.o
>> obj-$(CONFIG_PCI_LABEL) += pci-label.o
>> diff --git a/drivers/pci/dvsec.c b/drivers/pci/dvsec.c
>> new file mode 100644
>> index 000000000000..e49b079f0717
>> --- /dev/null
>> +++ b/drivers/pci/dvsec.c
>> @@ -0,0 +1,40 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * PCI DVSEC helper functions
>> + * Copyright (C) 2020 Intel Corp.
>> + */
>> +
>> +#include <linux/export.h>
>> +#include <linux/pci.h>
>> +#include <uapi/linux/pci_regs.h>
>> +#include "pci.h"
>> +
>> +/**
>> + * pci_find_dvsec - return position of DVSEC with provided vendor and dvsec id
>> + * @dev: the PCI device
>> + * @vendor: Vendor for the DVSEC
>> + * @id: the DVSEC cap id
>> + *
>> + * Return the offset of DVSEC on success or -ENOTSUPP if not found
>
> s/vendor/Vendor/
> s/dvsec/DVSEC/
> s/id/ID/ twice above
>
> Please put this function in drivers/pci/pci.c next to
> pci_find_ext_capability(). I don't think it's worth making a new file
> just for this.
>
>> + */
>> +int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id)
>> +{
>> + u16 dev_vendor, dev_id;
>> + int pos;
>> +
>> + pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_DVSEC);
>> + if (!pos)
>> + return -ENOTSUPP;
>> +
>> + while (pos) {
>> + pci_read_config_word(dev, pos + PCI_DVSEC_HEADER1, &dev_vendor);
>> + pci_read_config_word(dev, pos + PCI_DVSEC_HEADER2, &dev_id);
>> + if (dev_vendor == vendor && dev_id == id)
>> + return pos;
>> +
>> + pos = pci_find_next_ext_capability(dev, pos, PCI_EXT_CAP_ID_DVSEC);
>> + }
>> +
>> + return -ENOTSUPP;
>> +}
>> +EXPORT_SYMBOL_GPL(pci_find_dvsec);
>> diff --git a/drivers/pci/siov.c b/drivers/pci/siov.c
>> new file mode 100644
>> index 000000000000..6147e6ae5832
>> --- /dev/null
>> +++ b/drivers/pci/siov.c
>> @@ -0,0 +1,50 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Intel Scalable I/O Virtualization support
>> + * Copyright (C) 2020 Intel Corp.
>> + */
>> +
>> +#include <linux/export.h>
>> +#include <linux/pci.h>
>> +#include <linux/pci-siov.h>
>> +#include <uapi/linux/pci_regs.h>
>> +#include "pci.h"
>> +
>> +/*
>> + * A PCI express designated vendor specific extended capability is defined
>> + * in the section 3.7 of Intel scalable I/O virtualization technical spec
>> + * for system software and tools to detect endpoint devices supporting the
>> + * Intel scalable IO virtualization without host driver dependency.
>> + */
>> +
>> +/**
>> + * pci_siov_supported - check if the device can use SIOV
>> + * @dev: the PCI device
>> + *
>> + * Returns true if the device supports SIOV, false otherwise.
>> + */
>> +bool pci_siov_supported(struct pci_dev *dev)
>> +{
>> + return pci_find_dvsec(dev, PCI_VENDOR_ID_INTEL, PCI_DVSEC_ID_INTEL_SIOV) < 0 ? false : true;
>> +}
>> +EXPORT_SYMBOL_GPL(pci_siov_supported);
>> +
>> +/**
>> + * pci_ims_supported - check if the device can use IMS
>> + * @dev: the PCI device
>> + *
>> + * Returns true if the device supports IMS, false otherwise.
>> + */
>> +bool pci_ims_supported(struct pci_dev *dev)
>> +{
>> + int pos;
>> + u32 caps;
>> +
>> + pos = pci_find_dvsec(dev, PCI_VENDOR_ID_INTEL, PCI_DVSEC_ID_INTEL_SIOV);
>> + if (pos < 0)
>> + return false;
>> +
>> + pci_read_config_dword(dev, pos + PCI_DVSEC_INTEL_SIOV_CAP, &caps);
>> + return (caps & PCI_DVSEC_INTEL_SIOV_CAP_IMS) ? true : false;
>> +}
>> +EXPORT_SYMBOL_GPL(pci_ims_supported);
>
> I don't really see the point of these *_supported() functions. If the
> caller wants to use them, I would expect it to call
> pci_find_dvsec(PCI_DVSEC_ID_INTEL_SIOV) itself anyway.
>
> But there *are* no calls to pci_find_dvsec(PCI_DVSEC_ID_INTEL_SIOV).
> So apparently all you care about is whether the capability *exists*,
> and you don't need any information at all from the capability
> registers except PCI_DVSEC_INTEL_SIOV_CAP_IMS? That seems a little
> weird.
>
> I don't think it's worth adding a whole new file just for this. The
> only value the PCI core is adding here is a way to locate the
> PCI_DVSEC_ID_INTEL_SIOV capability.
>
>> diff --git a/include/linux/pci-siov.h b/include/linux/pci-siov.h
>> new file mode 100644
>> index 000000000000..a8a4eb5f4634
>> --- /dev/null
>> +++ b/include/linux/pci-siov.h
>> @@ -0,0 +1,18 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef LINUX_PCI_SIOV_H
>> +#define LINUX_PCI_SIOV_H
>> +
>> +#include <linux/pci.h>
>> +
>> +#ifdef CONFIG_PCI_SIOV
>> +/* Scalable I/O Virtualization */
>> +bool pci_siov_supported(struct pci_dev *dev);
>> +bool pci_ims_supported(struct pci_dev *dev);
>> +#else /* CONFIG_PCI_SIOV */
>> +static inline bool pci_siov_supported(struct pci_dev *d)
>> +{ return false; }
>> +static inline bool pci_ims_supported(struct pci_dev *d)
>> +{ return false; }
>> +#endif /* CONFIG_PCI_SIOV */
>> +
>> +#endif /* LINUX_PCI_SIOV_H */
>
> What's the benefit to putting these declarations in a separate
> pci-siov.h as opposed to putting them in pci.h itself? That's what we
> do for things like MSI, IOV, etc.
>
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 22207a79762c..4710f09b43b1 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -1070,6 +1070,7 @@ int pci_find_next_ext_capability(struct pci_dev *dev, int pos, int cap);
>> int pci_find_ht_capability(struct pci_dev *dev, int ht_cap);
>> int pci_find_next_ht_capability(struct pci_dev *dev, int pos, int ht_cap);
>> struct pci_bus *pci_find_next_bus(const struct pci_bus *from);
>> +int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id);
>>
>> u64 pci_get_dsn(struct pci_dev *dev);
>>
>> @@ -1726,6 +1727,8 @@ static inline int pci_find_next_capability(struct pci_dev *dev, u8 post,
>> { return 0; }
>> static inline int pci_find_ext_capability(struct pci_dev *dev, int cap)
>> { return 0; }
>> +static inline int pci_find_dvsec(struct pci_dev *dev, u16 vendor, u16 id)
>> +{ return 0; }
>>
>> static inline u64 pci_get_dsn(struct pci_dev *dev)
>> { return 0; }
>> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
>> index 8f8bd2318c6c..3532528441ef 100644
>> --- a/include/uapi/linux/pci_regs.h
>> +++ b/include/uapi/linux/pci_regs.h
>> @@ -1071,6 +1071,10 @@
>> #define PCI_DVSEC_HEADER1 0x4 /* Designated Vendor-Specific Header1 */
>> #define PCI_DVSEC_HEADER2 0x8 /* Designated Vendor-Specific Header2 */
>>
>> +#define PCI_DVSEC_ID_INTEL_SIOV 0x5
>> +#define PCI_DVSEC_INTEL_SIOV_CAP 0x14
>> +#define PCI_DVSEC_INTEL_SIOV_CAP_IMS 0x1
>
> Convention in this file is to write constants in the register width,
> e.g.,
>
> #define PCI_DVSEC_ID_INTEL_SIOV 0x0005
> #define PCI_DVSEC_INTEL_SIOV_CAP_IMS 0x00000001
>
> You can learn this by looking at the surrounding definitions.
>
>> /* Data Link Feature */
>> #define PCI_DLF_CAP 0x04 /* Capabilities Register */
>> #define PCI_DLF_EXCHANGE_ENABLE 0x80000000 /* Data Link Feature Exchange Enable */
>>
>>
On Fri, Oct 30, 2020 at 02:20:03PM -0700, Dave Jiang wrote:
>
>
> On 10/30/2020 12:51 PM, Bjorn Helgaas wrote:
> > On Fri, Oct 30, 2020 at 11:51:32AM -0700, Dave Jiang wrote:
> > > Intel Scalable I/O Virtualization (SIOV) enables sharing of I/O devices
> > > across isolated domains through PASID based sub-device partitioning.
> > > Interrupt Message Storage (IMS) enables devices to store the interrupt
> > > messages in a device-specific optimized manner without the scalability
> > > restrictions of the PCIe defined MSI-X capability. IMS is one of the
> > > features supported under SIOV.
> > >
> > > Move SIOV detection code from Intel iommu driver code to common PCI. Making
> > > the detection code common allows supported accelerator drivers to query the
> > > PCI core for SIOV and IMS capabilities. The support code will add the
> > > ability to query the PCI DVSEC capabilities for the SIOV cap.
> >
> > This patch really does not include anything related to SIOV other than
> > adding a little code to *find* the capability. It doesn't add
> > anything that actually *uses* it. I think this patch should simply
> > add pci_find_dvsec(), and it doesn't need any of this SIOV or IMS
> > description.
>
> Thanks for the review Bjorn! I'll carve out a patch with just find_dvsec()
> and apply your comments and recommendations.
>
> So the intel-iommu driver checks for the SIOV cap. And the idxd driver
> checks for SIOV and IMS cap. There will be other upcoming drivers that will
> check for such cap too. It is Intel vendor specific right now, but SIOV is
> public and other vendors may implement to the spec. Is there a good place to
> put the common capability check for that?
Let's wait and see what that code looks like and figure it out then.
We can always move it to the PCI core if it turns out to be generic.
Right now the code only finds a capability and checks a bit in it.
None of that is anything the PCI core is interested in.
> There are some other fields in the SIOV dvsec cap, but presently they are
> not being utilized. The idxd driver is only interested in making sure that
> SIOV and IMS (sub feature) support are present at this point.
I'm a little dubious about code that checks whether support is present
but doesn't actually *do* anything with that support, but as long as
it's outside the PCI core, that's up to you :)
Bjorn
On Fri, Oct 30 2020 at 11:50, Dave Jiang wrote:
> --- a/include/linux/interrupt.h
> +++ b/include/linux/interrupt.h
> @@ -487,6 +487,8 @@ extern int irq_get_irqchip_state(unsigned int irq, enum irqchip_irq_state which,
> extern int irq_set_irqchip_state(unsigned int irq, enum irqchip_irq_state which,
> bool state);
>
> +int irq_set_auxdata(unsigned int irq, unsigned int which, u64 val);
....
> +EXPORT_SYMBOL_GPL(irq_set_auxdata);
Again: Read and follow documentation. This does not belong into this
driver patch and wants to be a standalone preperatory patch.
Also the core change, the irq chip, the iommu support and the device msi
dependency has to be completely seperate from this idxd series.
You cannot just dump a pile of patches touching several subsystems at
once plus having dependencies on stuff which is not even agreed on and
merged and then expect that everything just falls into place.
The various subsystems involved are not holding their breath and putting
a lock on development just because you have a series against some random
snapshot.
The dependencies, e.g. the device msi infrastructure, are not going to
make their way magically into the proper maintainer tree either.
If this ever goes into a mergeable state, then the merge logistics for
this whole thing need to be carefully sorted out and it's on you to make
that as simple as possible for every maintainer involved.
Thanks,
tglx
On Fri, Oct 30 2020 at 13:59, Dave Jiang wrote:
> On 10/30/2020 1:48 PM, Thomas Gleixner wrote:
>> On Fri, Oct 30 2020 at 11:50, Dave Jiang wrote:
>>> The code has dependency on Thomas’s MSI restructuring patch series:
>>> https://lore.kernel.org/lkml/[email protected]/
>>
>> which is outdated and not longer applicable.
>
> Yes.... I wasn't sure how to point to these patches from you as a dependency.
>
> irqdomain/msi: Provide msi_alloc/free_store() callbacks
> platform-msi: Add device MSI infrastructure
> genirq/msi: Provide and use msi_domain_set_default_info_flags()
> genirq/proc: Take buslock on affinity write
> platform-msi: Provide default irq_chip:: Ack
> x86/msi: Rename and rework pci_msi_prepare() to cover non-PCI MSI
> x86/irq: Add DEV_MSI allocation type
How can you point at something which is not longer applicable?
> Do I need to include these patches in my series? Thanks!
No. They are NOT part of this series. Prerequisites are seperate
entities and your series can be based on them.
So for one you want to make sure that the prerequisites for your IDXD
stuff are going to be merged into the relevant maintainer trees.
To allow people working with your stuff you simply provide an
aggregation git tree which contains all the collected prerequisites.
This aggregation tree needs to be rebased when the prerequisites change
during review or are merged into a maintainer tree/branch.
It's not rocket science and a lot of people do exactly this all the time
in order to coordinate changes which have dependencies over multiple
subsystems.
Thanks,
tglx
On Fri, Oct 30, 2020 at 02:20:03PM -0700, Dave Jiang wrote:
> So the intel-iommu driver checks for the SIOV cap. And the idxd driver
> checks for SIOV and IMS cap. There will be other upcoming drivers that will
> check for such cap too. It is Intel vendor specific right now, but SIOV is
> public and other vendors may implement to the spec. Is there a good place to
> put the common capability check for that?
I'm still really unhappy with these SIOV caps. It was explained this
is just a hack to make up for pci_ims_array_create_msi_irq_domain()
succeeding in VM cases when it doesn't actually work.
Someday this is likely to get fixed, so tying platform behavior to PCI
caps is completely wrong.
This needs to be solved in the platform code,
pci_ims_array_create_msi_irq_domain() should not succeed in these
cases.
Jason
On 10/30/2020 3:45 PM, Jason Gunthorpe wrote:
> On Fri, Oct 30, 2020 at 02:20:03PM -0700, Dave Jiang wrote:
>> So the intel-iommu driver checks for the SIOV cap. And the idxd driver
>> checks for SIOV and IMS cap. There will be other upcoming drivers that will
>> check for such cap too. It is Intel vendor specific right now, but SIOV is
>> public and other vendors may implement to the spec. Is there a good place to
>> put the common capability check for that?
>
> I'm still really unhappy with these SIOV caps. It was explained this
> is just a hack to make up for pci_ims_array_create_msi_irq_domain()
> succeeding in VM cases when it doesn't actually work.
>
> Someday this is likely to get fixed, so tying platform behavior to PCI
> caps is completely wrong.
>
> This needs to be solved in the platform code,
> pci_ims_array_create_msi_irq_domain() should not succeed in these
> cases.
That sounds reasonable. Are you asking that the IMS cap check should gate the
success/failure of pci_ims_array_create_msi_irq_domain() rather than the driver?
>
> Jason
>
On Fri, Oct 30, 2020 at 01:43:07PM -0700, Raj, Ashok wrote:
> So drawing that parallel, do you expect all drivers that call
> pci_register_driver() to be located in drivers/pci? Aren't they scattered
> all over the place ata,scsi, platform drivers and such?
The subsystem is the thing that calls
device_register. pci_register_driver() doesn't do that.
> As Alex pointed out, i915 and handful of s390 drivers that are mdev users
> are not in drivers/vfio. Are you sayint those drivers don't get reviewed?
Past mistakes do not justify continuing to do it wrong.
ARM and PPC went through a huge multi year cleanup moving code out of
arch and into the proper drivers/ directories. We know this is the
correct way to work the development process.
> Your argument seems interesting even entertaining :-). But honestly i'm not finding it
> practical :-). So every caller of mmu_register_notifier() needs to be in
> mm?
mmu notifiers are not a subsytem, they are core libary code.
You seem to completely not understand what a subsystem is. :(
> I know you aren't going to give up, but there is little we can do. I want
> the maintainers to make that call and I'm not add more noise to this.
Well, hopefully Vinod will insist on following kernel norms here.
Jason
Ashok,
On Fri, Oct 30 2020 at 13:43, Ashok Raj wrote:
> On Fri, Oct 30, 2020 at 04:30:45PM -0300, Jason Gunthorpe wrote:
>> On Fri, Oct 30, 2020 at 12:23:25PM -0700, Raj, Ashok wrote:
>> It is a different subsystem, different maintainer, and different
>> reviewers.
>>
>> It is a development process problem, it doesn't matter what it is
>> doing.
< skip a lot of non-sensical arguments>
> I know you aren't going to give up, but there is little we can do. I want
> the maintainers to make that call and I'm not add more noise to this.
Jason is absolutely right.
Just because there is historical precendence which does not care about
the differentiation of subsystems is not an argument at all to make the
same mistakes which have been made years ago.
IDXD is just infrastructure which provides the base for a variety of
different functionalities. Very similar to what multi function devices
provide. In fact IDXD is pretty much a MFD facility.
Sticking all of it into dmaengine is sloppy at best. The dma engine
related part of IDXD is only a part of the overall functionality.
I'm well aware that it is conveniant to just throw everything into
drivers/myturf/ but that does neither make it reviewable nor
maintainable.
What's the problem with restructuring your code in a way which makes it
fit into existing subsystems?
The whole thing - as I pointed out to Dave earlier - is based on 'works
for me' wishful thinking with a blissful ignorance of the development
process and the requirement to split a large problem into the proper
bits and pieces aka. engineering 101.
Thanks,
tglx
Hi Thomas,
On Sat, Oct 31, 2020 at 03:50:43AM +0100, Thomas Gleixner wrote:
> Ashok,
>
> < skip a lot of non-sensical arguments>
Ouch!.. Didn't mean to awaken you like this :-).. apologies.. profusely!
>
> Just because there is historical precendence which does not care about
> the differentiation of subsystems is not an argument at all to make the
> same mistakes which have been made years ago.
>
> IDXD is just infrastructure which provides the base for a variety of
> different functionalities. Very similar to what multi function devices
> provide. In fact IDXD is pretty much a MFD facility.
I'm only asking this to better understand the thought process.
I don't intend to be defensive, I have my hands tied back.. so we will do
what you say best fits per your recommendation.
Not my intend to dig a deeper hole than I have already dug! :-(
IDXD is just a glorified DMA engine, data mover. It also does a few other
things. In that sense its a multi-function facility. But doesn't do different
functional pieces like PCIe multi-function device in that sense. i.e
it doesn't do other storage and network in that sense.
>
> Sticking all of it into dmaengine is sloppy at best. The dma engine
> related part of IDXD is only a part of the overall functionality.
dmaengine is the basic non-transformational data-mover. Doing other operations
or transformations are just the glorified data-mover part. But fundamentally
not different.
>
> I'm well aware that it is conveniant to just throw everything into
> drivers/myturf/ but that does neither make it reviewable nor
> maintainable.
That's true, when we add lot of functionality in one place. IDXD doing
mdev support is not offering new functioanlity. SRIOV PF drivers that support
PF/VF mailboxes are part of PF drivers today. IDXD mdev is preciely playing that
exact role.
If we are doing this just to improve review effectiveness, Now we would need
some parent driver, and these sub-drivers registering seemed like a bit of
over-engineering when these sub-drivers actually are an extension of the
base driver and offer nothing more than extending sub-device partitions
of IDXD for guest drivers. These look and feel like IDXD, not another device
interface. In that sense if we move PF/VF mailboxes as
separate drivers i thought it feels a bit odd.
Please don't take it the wrong way.
Cheers,
Ashok
On Sat, Oct 31, 2020 at 04:53:59PM -0700, Raj, Ashok wrote:
> If we are doing this just to improve review effectiveness, Now we would need
> some parent driver, and these sub-drivers registering seemed like a bit of
> over-engineering when these sub-drivers actually are an extension of the
> base driver and offer nothing more than extending sub-device partitions
> of IDXD for guest drivers. These look and feel like IDXD, not another device
> interface. In that sense if we move PF/VF mailboxes as
> separate drivers i thought it feels a bit odd.
You need this split anyhow, putting VFIO calls into the main idxd
module is not OK.
Plugging in a PCI device should not auto-load VFIO modules.
Jason
On Fri, Oct 30, 2020 at 03:49:22PM -0700, Dave Jiang wrote:
>
>
> On 10/30/2020 3:45 PM, Jason Gunthorpe wrote:
> > On Fri, Oct 30, 2020 at 02:20:03PM -0700, Dave Jiang wrote:
> > > So the intel-iommu driver checks for the SIOV cap. And the idxd driver
> > > checks for SIOV and IMS cap. There will be other upcoming drivers that will
> > > check for such cap too. It is Intel vendor specific right now, but SIOV is
> > > public and other vendors may implement to the spec. Is there a good place to
> > > put the common capability check for that?
> >
> > I'm still really unhappy with these SIOV caps. It was explained this
> > is just a hack to make up for pci_ims_array_create_msi_irq_domain()
> > succeeding in VM cases when it doesn't actually work.
> >
> > Someday this is likely to get fixed, so tying platform behavior to PCI
> > caps is completely wrong.
> >
> > This needs to be solved in the platform code,
> > pci_ims_array_create_msi_irq_domain() should not succeed in these
> > cases.
>
> That sounds reasonable. Are you asking that the IMS cap check should gate
> the success/failure of pci_ims_array_create_msi_irq_domain() rather than the
> driver?
There shouldn't be an IMS cap at all
As I understand, the problem here is the only way to establish new
VT-d IRQ routing is by trapping and emulating MSI/MSI-X related
activities and triggering routing of the vectors into the guest.
There is a missing hypercall to allow the guest to do this on its own,
presumably it will someday be fixed so IMS can work in guests.
Until the hypercall is added pci_ims_array_create_msi_irq_domain()
should simply fail in guests. No PCI cap check required.
Jason
Hi Jason
On Mon, Nov 02, 2020 at 09:20:36AM -0400, Jason Gunthorpe wrote:
> > of IDXD for guest drivers. These look and feel like IDXD, not another device
> > interface. In that sense if we move PF/VF mailboxes as
> > separate drivers i thought it feels a bit odd.
>
> You need this split anyhow, putting VFIO calls into the main idxd
> module is not OK.
>
> Plugging in a PCI device should not auto-load VFIO modules.
Yes, I agree that would be a good reason to separate them completely and
glue functionality with private APIs between the 2 modules.
- Separate mdev code from base idxd.
- Separate maintainers, so its easy to review and include. (But remember
they are heavily inter-dependent. They have to move to-gether)
Almost all SRIOV drivers today are just configured with some form of Kconfig
and those relevant files are compiled into the same module.
I think in *most* applications idxd would be operating in that mode, where
you have the base driver and mdev parts (like VF) compiled in if configured
such.
Creating these private interfaces for intra-module are just 1-1 and not
general purpose and every accelerator needs to create these instances.
I wasn't sure focibly creating this firewall between the PF/VF interfaces
is actually worth the work every driver is going to require. I can see
where this is required when they offer separate functional interfaces
when we talk about multi-function in a more confined definition today.
idxd mdev's are purely a VF extension. It doesn't provide any different
function. For e.g. like an RDMA device that can provide iWarp, ipoib or
even multiplexing storage over IB. IDXD is a fixed function interface.
Sure having separate modules helps with that isolation. But I'm not
convinced if this simplifies, or complicates things more than what is
required for these device types.
Cheers,
Ashok
On Mon, Nov 02, 2020 at 08:20:43AM -0800, Raj, Ashok wrote:
> Creating these private interfaces for intra-module are just 1-1 and not
> general purpose and every accelerator needs to create these instances.
This is where we are going, auxillary bus should be merged soon which
is specifically to connect these kinds of devices across subsystems
Jason
On 11/2/2020 10:19 AM, Jason Gunthorpe wrote:
> On Mon, Nov 02, 2020 at 08:20:43AM -0800, Raj, Ashok wrote:
>> Creating these private interfaces for intra-module are just 1-1 and not
>> general purpose and every accelerator needs to create these instances.
>
> This is where we are going, auxillary bus should be merged soon which
> is specifically to connect these kinds of devices across subsystems
I think this resolves the aux device probe/remove issue via a common bus. But it
does not help with the mdev device needing a lot of the device handling calls
from the parent driver as it share the same handling as the parent device. My
plan is to export all the needed call via EXPORT_SYMBOL_NS() so the calls can be
shared in its own namespace between the modules. Do you have any objection with
that?
>
> Jason
>
On Mon, Nov 02, 2020 at 11:18:33AM -0700, Dave Jiang wrote:
>
>
> On 11/2/2020 10:19 AM, Jason Gunthorpe wrote:
> > On Mon, Nov 02, 2020 at 08:20:43AM -0800, Raj, Ashok wrote:
> > > Creating these private interfaces for intra-module are just 1-1 and not
> > > general purpose and every accelerator needs to create these instances.
> >
> > This is where we are going, auxillary bus should be merged soon which
> > is specifically to connect these kinds of devices across subsystems
>
> I think this resolves the aux device probe/remove issue via a common bus.
> But it does not help with the mdev device needing a lot of the device
> handling calls from the parent driver as it share the same handling as the
> parent device.
The intention of auxiliary bus is that the two parts will tightly
couple across some exported function interface.
> My plan is to export all the needed call via EXPORT_SYMBOL_NS() so
> the calls can be shared in its own namespace between the modules. Do
> you have any objection with that?
I think you will be the first to use the namespace stuff for this, it
seems like a good idea and others should probably do so as well.
Jason
On Mon, Nov 2, 2020 at 10:26 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Mon, Nov 02, 2020 at 11:18:33AM -0700, Dave Jiang wrote:
> >
> >
> > On 11/2/2020 10:19 AM, Jason Gunthorpe wrote:
> > > On Mon, Nov 02, 2020 at 08:20:43AM -0800, Raj, Ashok wrote:
> > > > Creating these private interfaces for intra-module are just 1-1 and not
> > > > general purpose and every accelerator needs to create these instances.
> > >
> > > This is where we are going, auxillary bus should be merged soon which
> > > is specifically to connect these kinds of devices across subsystems
> >
> > I think this resolves the aux device probe/remove issue via a common bus.
> > But it does not help with the mdev device needing a lot of the device
> > handling calls from the parent driver as it share the same handling as the
> > parent device.
>
> The intention of auxiliary bus is that the two parts will tightly
> couple across some exported function interface.
>
> > My plan is to export all the needed call via EXPORT_SYMBOL_NS() so
> > the calls can be shared in its own namespace between the modules. Do
> > you have any objection with that?
>
> I think you will be the first to use the namespace stuff for this, it
> seems like a good idea and others should probably do so as well.
I was thinking either EXPORT_SYMBOL_NS, or auxiliary bus, because you
should be able to export an ops structure with all the necessary
callbacks. Aux bus seems cleaner because the lifetime rules and
ownership concerns are clearer.
On Mon, Nov 02, 2020 at 10:38:28AM -0800, Dan Williams wrote:
> > I think you will be the first to use the namespace stuff for this, it
> > seems like a good idea and others should probably do so as well.
>
> I was thinking either EXPORT_SYMBOL_NS, or auxiliary bus, because you
> should be able to export an ops structure with all the necessary
> callbacks.
'or'?
Auxiliary bus should not be used with huge arrays of function
pointers... The module providing the device should export a normal
linkable function interface. Putting that in a namespace makes a lot
of sense.
Jason
On Mon, Nov 2, 2020 at 10:52 AM Jason Gunthorpe <[email protected]> wrote:
>
> On Mon, Nov 02, 2020 at 10:38:28AM -0800, Dan Williams wrote:
>
> > > I think you will be the first to use the namespace stuff for this, it
> > > seems like a good idea and others should probably do so as well.
> >
> > I was thinking either EXPORT_SYMBOL_NS, or auxiliary bus, because you
> > should be able to export an ops structure with all the necessary
> > callbacks.
>
> 'or'?
>
> Auxiliary bus should not be used with huge arrays of function
> pointers... The module providing the device should export a normal
> linkable function interface. Putting that in a namespace makes a lot
> of sense.
True, probably needs to be a mixture of both.
> From: Jason Gunthorpe <[email protected]>
> Sent: Monday, November 2, 2020 9:22 PM
>
> On Fri, Oct 30, 2020 at 03:49:22PM -0700, Dave Jiang wrote:
> >
> >
> > On 10/30/2020 3:45 PM, Jason Gunthorpe wrote:
> > > On Fri, Oct 30, 2020 at 02:20:03PM -0700, Dave Jiang wrote:
> > > > So the intel-iommu driver checks for the SIOV cap. And the idxd driver
> > > > checks for SIOV and IMS cap. There will be other upcoming drivers that
> will
> > > > check for such cap too. It is Intel vendor specific right now, but SIOV is
> > > > public and other vendors may implement to the spec. Is there a good
> place to
> > > > put the common capability check for that?
> > >
> > > I'm still really unhappy with these SIOV caps. It was explained this
> > > is just a hack to make up for pci_ims_array_create_msi_irq_domain()
> > > succeeding in VM cases when it doesn't actually work.
> > >
> > > Someday this is likely to get fixed, so tying platform behavior to PCI
> > > caps is completely wrong.
> > >
> > > This needs to be solved in the platform code,
> > > pci_ims_array_create_msi_irq_domain() should not succeed in these
> > > cases.
> >
> > That sounds reasonable. Are you asking that the IMS cap check should gate
> > the success/failure of pci_ims_array_create_msi_irq_domain() rather than
> the
> > driver?
>
> There shouldn't be an IMS cap at all
>
> As I understand, the problem here is the only way to establish new
> VT-d IRQ routing is by trapping and emulating MSI/MSI-X related
> activities and triggering routing of the vectors into the guest.
>
> There is a missing hypercall to allow the guest to do this on its own,
> presumably it will someday be fixed so IMS can work in guests.
Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
interface so any guest driver (if following the spec) can seamlessly
work on all hypervisors.
Thanks
Kevin
On Tue, Nov 03, 2020 at 02:49:27AM +0000, Tian, Kevin wrote:
> > There is a missing hypercall to allow the guest to do this on its own,
> > presumably it will someday be fixed so IMS can work in guests.
>
> Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> interface so any guest driver (if following the spec) can seamlessly
> work on all hypervisors.
It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM issue
is architecturally wrong.
IMS *can not work* in any hypervsior without some special
hypercall. Just block it in the platform code and forget about the PCI
cap.
Jason
> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, November 3, 2020 8:44 PM
>
> On Tue, Nov 03, 2020 at 02:49:27AM +0000, Tian, Kevin wrote:
>
> > > There is a missing hypercall to allow the guest to do this on its own,
> > > presumably it will someday be fixed so IMS can work in guests.
> >
> > Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> > interface so any guest driver (if following the spec) can seamlessly
> > work on all hypervisors.
>
> It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM issue
> is architecturally wrong.
>
> IMS *can not work* in any hypervsior without some special
> hypercall. Just block it in the platform code and forget about the PCI
> cap.
>
It's per-device thing instead of platform thing. If the VMM understands
the IMS format of a specific device and virtualize it to the guest, the
guest can use IMS w/o any hypercall. If the VMM doesn't understand, it
simply clears the IMS cap bit for this device which forces the guest to
use the standard PCI MSI/MSI-X interface. In VMM side the decision is
based on device virtualization knowledge, e.g. in VFIO, instead of in
platform virtualization logic. Your platform argument is based on the
hypercall assumption, which is what we want to avoid instead.
Thanks
Kevin
On Wed, Nov 04, 2020 at 03:41:33AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Tuesday, November 3, 2020 8:44 PM
> >
> > On Tue, Nov 03, 2020 at 02:49:27AM +0000, Tian, Kevin wrote:
> >
> > > > There is a missing hypercall to allow the guest to do this on its own,
> > > > presumably it will someday be fixed so IMS can work in guests.
> > >
> > > Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> > > interface so any guest driver (if following the spec) can seamlessly
> > > work on all hypervisors.
> >
> > It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM issue
> > is architecturally wrong.
> >
> > IMS *can not work* in any hypervsior without some special
> > hypercall. Just block it in the platform code and forget about the PCI
> > cap.
> >
>
> It's per-device thing instead of platform thing. If the VMM understands
> the IMS format of a specific device and virtualize it to the guest,
Please no! Adding device specific emulation is just going down deeper
into this bad architecture.
Interrupts is a platform issue. Using emulation of MSI to dynamically
insert vectors to a VM was a reasonable, but hacky thing. Now it needs
proper platform support.
Jason
> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, November 4, 2020 8:40 PM
>
> On Wed, Nov 04, 2020 at 03:41:33AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Tuesday, November 3, 2020 8:44 PM
> > >
> > > On Tue, Nov 03, 2020 at 02:49:27AM +0000, Tian, Kevin wrote:
> > >
> > > > > There is a missing hypercall to allow the guest to do this on its own,
> > > > > presumably it will someday be fixed so IMS can work in guests.
> > > >
> > > > Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> > > > interface so any guest driver (if following the spec) can seamlessly
> > > > work on all hypervisors.
> > >
> > > It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM issue
> > > is architecturally wrong.
> > >
> > > IMS *can not work* in any hypervsior without some special
> > > hypercall. Just block it in the platform code and forget about the PCI
> > > cap.
> > >
> >
> > It's per-device thing instead of platform thing. If the VMM understands
> > the IMS format of a specific device and virtualize it to the guest,
>
> Please no! Adding device specific emulation is just going down deeper
> into this bad architecture.
>
> Interrupts is a platform issue. Using emulation of MSI to dynamically
Interrupt controller is a platform issue. Interrupt source is about device.
> insert vectors to a VM was a reasonable, but hacky thing. Now it needs
> proper platform support.
>
why is MSI emulation a hacky thing? isn't it defined by PCISIG? I guess
that I must misunderstand your real point here...
Thanks
Kevin
On Wed, Nov 04, 2020 at 01:34:08PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, November 4, 2020 8:40 PM
> >
> > On Wed, Nov 04, 2020 at 03:41:33AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <[email protected]>
> > > > Sent: Tuesday, November 3, 2020 8:44 PM
> > > >
> > > > On Tue, Nov 03, 2020 at 02:49:27AM +0000, Tian, Kevin wrote:
> > > >
> > > > > > There is a missing hypercall to allow the guest to do this on its own,
> > > > > > presumably it will someday be fixed so IMS can work in guests.
> > > > >
> > > > > Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> > > > > interface so any guest driver (if following the spec) can seamlessly
> > > > > work on all hypervisors.
> > > >
> > > > It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM issue
> > > > is architecturally wrong.
> > > >
> > > > IMS *can not work* in any hypervsior without some special
> > > > hypercall. Just block it in the platform code and forget about the PCI
> > > > cap.
> > > >
> > >
> > > It's per-device thing instead of platform thing. If the VMM understands
> > > the IMS format of a specific device and virtualize it to the guest,
> >
> > Please no! Adding device specific emulation is just going down deeper
> > into this bad architecture.
> >
> > Interrupts is a platform issue. Using emulation of MSI to dynamically
>
> Interrupt controller is a platform issue. Interrupt source is about device.
The interrupt controller is responsible to create an addr/data pair
for an interrupt message. It sets the message format and ensures it
routes to the proper CPU interrupt handler. Everything about the
addr/data pair is owned by the platform interrupt controller.
Devices do not create interrupts. They only trigger the addr/data pair
the platform gives them.
> > insert vectors to a VM was a reasonable, but hacky thing. Now it needs
> > proper platform support.
>
> why is MSI emulation a hacky thing? isn't it defined by PCISIG? I guess
> that I must misunderstand your real point here...
It means the interrupt controller in the VM's platform is a fiction,
the addr/data pairs it creates are not real.
A PCI device assigned to a VM is supposed to be fully contained by the
IOMMU, interrupts included, so there is no reason to do MSI emulation
if the VM's interrupt controller is aware of what addr/data pairs it
can use with the device - eg by getting them through a hypercall. This
is much cleaner and supports things like IMS
Trying to do IMS emulation is nutz, the entire point of IMS is the
device can do what it likes, and emulating that is not going to
feasible. For instance go read the discussion I had with Thomas how a
object-centric device would manage interrupts.
Jason
> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, November 4, 2020 9:54 PM
>
> On Wed, Nov 04, 2020 at 01:34:08PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, November 4, 2020 8:40 PM
> > >
> > > On Wed, Nov 04, 2020 at 03:41:33AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <[email protected]>
> > > > > Sent: Tuesday, November 3, 2020 8:44 PM
> > > > >
> > > > > On Tue, Nov 03, 2020 at 02:49:27AM +0000, Tian, Kevin wrote:
> > > > >
> > > > > > > There is a missing hypercall to allow the guest to do this on its own,
> > > > > > > presumably it will someday be fixed so IMS can work in guests.
> > > > > >
> > > > > > Hypercall is VMM specific, while IMS cap provides a VMM-agnostic
> > > > > > interface so any guest driver (if following the spec) can seamlessly
> > > > > > work on all hypervisors.
> > > > >
> > > > > It is a *VMM* issue, not PCI. Adding a PCI cap to describe a VMM
> issue
> > > > > is architecturally wrong.
> > > > >
> > > > > IMS *can not work* in any hypervsior without some special
> > > > > hypercall. Just block it in the platform code and forget about the PCI
> > > > > cap.
> > > > >
> > > >
> > > > It's per-device thing instead of platform thing. If the VMM understands
> > > > the IMS format of a specific device and virtualize it to the guest,
> > >
> > > Please no! Adding device specific emulation is just going down deeper
> > > into this bad architecture.
> > >
> > > Interrupts is a platform issue. Using emulation of MSI to dynamically
> >
> > Interrupt controller is a platform issue. Interrupt source is about device.
>
> The interrupt controller is responsible to create an addr/data pair
> for an interrupt message. It sets the message format and ensures it
> routes to the proper CPU interrupt handler. Everything about the
> addr/data pair is owned by the platform interrupt controller.
>
> Devices do not create interrupts. They only trigger the addr/data pair
> the platform gives them.
I guess that we may just view it from different angles. On x86 platform,
a MSI/IMS capable device directly composes interrupt messages, with
addr/data pair filled by OS. If there is no IOMMU remapping enabled in
the middle, the message just hits the CPU. Your description possibly
is from software side, e.g. describing the hierarchical IRQ domain
concept?
>
> > > insert vectors to a VM was a reasonable, but hacky thing. Now it needs
> > > proper platform support.
> >
> > why is MSI emulation a hacky thing? isn't it defined by PCISIG? I guess
> > that I must misunderstand your real point here...
>
> It means the interrupt controller in the VM's platform is a fiction,
> the addr/data pairs it creates are not real.
>
> A PCI device assigned to a VM is supposed to be fully contained by the
> IOMMU, interrupts included, so there is no reason to do MSI emulation
> if the VM's interrupt controller is aware of what addr/data pairs it
> can use with the device - eg by getting them through a hypercall. This
> is much cleaner and supports things like IMS
I agree with this point, just as how pci-hyperv.c works. In concept Linux
guest driver should be able to use IMS when running on Hyper-v. There
is no such thing for KVM, but possibly one day we will need similar stuff.
Before that happens the guest could choose to simply disallow devmsi
by default in the platform code (inventing a hypercall just for 'disable'
doesn't make sense) and ignore the IMS cap. One small open is whether
this can be done in one central-place. The detection of running as guest
is done in arch-specific code. Do we need disabling devmsi for every arch?
But when talking about virtualization it's not good to assume the guest
behavior. It's perfectly sane to run a guest OS which doesn't implement
any PV stuff (thus don't know running in a VM) but do support IMS. In
such scenario the IMS cap allows the hypervisor to educate the guest
driver to use MSI instead of IMS, as long as the driver follows the device
spec. In this regard I don't think that the IMS cap will be a short-term
thing, although Linux may choose to not use it.
>
> Trying to do IMS emulation is nutz, the entire point of IMS is the
> device can do what it likes, and emulating that is not going to
> feasible. For instance go read the discussion I had with Thomas how a
> object-centric device would manage interrupts.
>
Do you mind providing the link? There were lots of discussions between
you and Thomas. I failed to locate the exact mail when searching above
keywords.
Thanks
Kevin
On Fri, Nov 06, 2020 at 09:48:34AM +0000, Tian, Kevin wrote:
> > The interrupt controller is responsible to create an addr/data pair
> > for an interrupt message. It sets the message format and ensures it
> > routes to the proper CPU interrupt handler. Everything about the
> > addr/data pair is owned by the platform interrupt controller.
> >
> > Devices do not create interrupts. They only trigger the addr/data pair
> > the platform gives them.
>
> I guess that we may just view it from different angles. On x86 platform,
> a MSI/IMS capable device directly composes interrupt messages, with
> addr/data pair filled by OS.
Yes, all platforms work like that. The addr/data pair is *opaque* to
the device. Only the platform interrupt controller component
understands how to form those values.
> If there is no IOMMU remapping enabled in the middle, the message
> just hits the CPU. Your description possibly is from software side,
> e.g. describing the hierarchical IRQ domain concept?
I suppose you could say that. Technically the APIC doesn't form any
addr/data pairs, but the configuration of the APIC, IOMMU and other
platform components define what addr/data pairs are acceptable.
The IRQ domain stuff broadly puts responsibilty to form these values
in the IRQ layer which abstracts all the platform detatils. In Linux
we expect the platform to provide the IRQ Domain tha can specify
working addr/data pairs.
> I agree with this point, just as how pci-hyperv.c works. In concept Linux
> guest driver should be able to use IMS when running on Hyper-v. There
> is no such thing for KVM, but possibly one day we will need similar stuff.
> Before that happens the guest could choose to simply disallow devmsi
> by default in the platform code (inventing a hypercall just for 'disable'
> doesn't make sense) and ignore the IMS cap. One small open is whether
> this can be done in one central-place. The detection of running as guest
> is done in arch-specific code. Do we need disabling devmsi for every arch?
>
> But when talking about virtualization it's not good to assume the guest
> behavior. It's perfectly sane to run a guest OS which doesn't implement
> any PV stuff (thus don't know running in a VM) but do support IMS. In
> such scenario the IMS cap allows the hypervisor to educate the guest
> driver to use MSI instead of IMS, as long as the driver follows the device
> spec. In this regard I don't think that the IMS cap will be a short-term
> thing, although Linux may choose to not use it.
The IMS flag belongs in the platform not in the devices.
For instance you could put a "disable IMS" flag in the ACPI tables, in
the config space of the emuulated root port, or any other areas that
clearly belong to the platform.
The OS logic would be
- If no IMS information found then use IMS (Bare metal)
- If the IMS disable flag is found then
- If (future) hypercall available and the OS knows how to use it
then use IMS
- If no hypercall found, or no OS knowledge, fail IMS
Our devices can use IMS even in a pure no-emulation
configurations. Saying that we need to insert complicated security
sensitive emulation just to get IMS in the guest is absolutely crazy.
> Do you mind providing the link? There were lots of discussions between
> you and Thomas. I failed to locate the exact mail when searching above
> keywords.
Read through these two threads:
https://lore.kernel.org/linux-hyperv/[email protected]/
https://lore.kernel.org/dmaengine/159534734833.28840.10067945890695808535.stgit@djiang5-desk3.ch.intel.com/
Jason
Hi Jason
On Fri, Nov 06, 2020 at 09:14:15AM -0400, Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 09:48:34AM +0000, Tian, Kevin wrote:
> > > The interrupt controller is responsible to create an addr/data pair
> > > for an interrupt message. It sets the message format and ensures it
> > > routes to the proper CPU interrupt handler. Everything about the
> > > addr/data pair is owned by the platform interrupt controller.
> > >
> > > Devices do not create interrupts. They only trigger the addr/data pair
> > > the platform gives them.
> >
> > I guess that we may just view it from different angles. On x86 platform,
> > a MSI/IMS capable device directly composes interrupt messages, with
> > addr/data pair filled by OS.
>
> Yes, all platforms work like that. The addr/data pair is *opaque* to
> the device. Only the platform interrupt controller component
> understands how to form those values.
True, the addr/data pair is opaque. IMS doesn't dictate what the contents
of addr/data pair is made of. That is still a platform attribute. IMS simply
controls where the pair is physically stored. Which only the device dictates.
>
> > If there is no IOMMU remapping enabled in the middle, the message
> > just hits the CPU. Your description possibly is from software side,
> > e.g. describing the hierarchical IRQ domain concept?
>
> I suppose you could say that. Technically the APIC doesn't form any
> addr/data pairs, but the configuration of the APIC, IOMMU and other
> platform components define what addr/data pairs are acceptable.
>
> The IRQ domain stuff broadly puts responsibilty to form these values
> in the IRQ layer which abstracts all the platform detatils. In Linux
> we expect the platform to provide the IRQ Domain tha can specify
> working addr/data pairs.
>
> > I agree with this point, just as how pci-hyperv.c works. In concept Linux
> > guest driver should be able to use IMS when running on Hyper-v. There
> > is no such thing for KVM, but possibly one day we will need similar stuff.
> > Before that happens the guest could choose to simply disallow devmsi
> > by default in the platform code (inventing a hypercall just for 'disable'
> > doesn't make sense) and ignore the IMS cap. One small open is whether
> > this can be done in one central-place. The detection of running as guest
> > is done in arch-specific code. Do we need disabling devmsi for every arch?
> >
> > But when talking about virtualization it's not good to assume the guest
> > behavior. It's perfectly sane to run a guest OS which doesn't implement
> > any PV stuff (thus don't know running in a VM) but do support IMS. In
> > such scenario the IMS cap allows the hypervisor to educate the guest
> > driver to use MSI instead of IMS, as long as the driver follows the device
> > spec. In this regard I don't think that the IMS cap will be a short-term
> > thing, although Linux may choose to not use it.
>
> The IMS flag belongs in the platform not in the devices.
This support is mostly a SW thing right? we don't need to muck with
platform/ACPI for that matter.
>
> For instance you could put a "disable IMS" flag in the ACPI tables, in
> the config space of the emuulated root port, or any other areas that
> clearly belong to the platform.
Maybe there is a different interpretation for IMS that I'm missing. Devices
that need more interrupt support than supported by PCIe standards, and how
device has grouped the storage needs for the addr/data pair is a device
attribute.
I missed why ACPI tables should carry such information. If kernel doesn't
want to support those devices its within kernel control. Which means kernel
will only use the available MSIx interfaces. This is legacy support.
>
> The OS logic would be
> - If no IMS information found then use IMS (Bare metal)
> - If the IMS disable flag is found then
> - If (future) hypercall available and the OS knows how to use it
> then use IMS
> - If no hypercall found, or no OS knowledge, fail IMS
>
> Our devices can use IMS even in a pure no-emulation
This is true for IMS as well. But probably not implemented in the kernel as
such. From a HW point of view (take idxd for instance) the facility is
available to native OS as well. The early RFC supported this for native.
Native devices can have both MSIx and IMS capability. But as I understand this
isn't how we have partitioned things in SW today. We left IMS only for
mdev's. And I agree this would be very useful.
In cases where we want to support interrupt handles for user space
notification (when application specifies that in the descriptor). Those
could be IMS. The device HW has support for it.
Remember the "Why PASID in IMS entry" discussion?
https://lore.kernel.org/lkml/[email protected]/
Cheers,
Ashok
On Fri, Nov 06, 2020 at 08:48:50AM -0800, Raj, Ashok wrote:
> > The IMS flag belongs in the platform not in the devices.
>
> This support is mostly a SW thing right? we don't need to muck with
> platform/ACPI for that matter.
Something needs to tell the guest OS platform what to do, so you need
a place to put it.
Putting it in a per-device PCI cap is horrible and hacky from an
architectural perspective.
> I missed why ACPI tables should carry such information. If kernel doesn't
> want to support those devices its within kernel control. Which means kernel
> will only use the available MSIx interfaces. This is legacy support.
The platform flag tells the guest that it can (or can't) support IMS
*at all*
Primarily a guest would be blocked because the VMM provides no way for
the guest to create addr/data pairs.
Has nothing to do with individual devices.
> > The OS logic would be
> > - If no IMS information found then use IMS (Bare metal)
> > - If the IMS disable flag is found then
> > - If (future) hypercall available and the OS knows how to use it
> > then use IMS
> > - If no hypercall found, or no OS knowledge, fail IMS
> >
> > Our devices can use IMS even in a pure no-emulation
>
> This is true for IMS as well. But probably not implemented in the kernel as
> such. From a HW point of view (take idxd for instance) the facility is
> available to native OS as well. The early RFC supported this for native.
I can't follow what you are trying to say here.
Dave said the IMS cap was to indicate that the VMM supported emulation
of IMS so that the VMM can do the MSI addr/data translation as part of
the emulation.
I'm saying emulation will be too horrible for our devices that don't
require *any* emulation.
It is a bad architecture. The platform needs to handle this globally
for all devices, not special hacky emulations things custom made for
every device out there.
> Native devices can have both MSIx and IMS capability. But as I
> understand this isn't how we have partitioned things in SW today. We
> left IMS only for mdev's. And I agree this would be very useful.
That split is just some decision idxd did, we are thinking about doing
other things in our devices.
Jason
On Fri, Nov 6, 2020 at 9:51 AM Jason Gunthorpe <[email protected]> wrote:
[..]
> > This is true for IMS as well. But probably not implemented in the kernel as
> > such. From a HW point of view (take idxd for instance) the facility is
> > available to native OS as well. The early RFC supported this for native.
>
> I can't follow what you are trying to say here.
I'm having a hard time following the technical cruxes of this debate.
I grokked your feedback on the original IMS proposal way back at the
beginning of this effort (pre-COVID even!), so maybe I can mediate
here as well. Although, SIOV is that much harder for me to spell than
IMS, so bear with me.
> Dave said the IMS cap was to indicate that the VMM supported emulation
> of IMS so that the VMM can do the MSI addr/data translation as part of
> the emulation.
>
> I'm saying emulation will be too horrible for our devices that don't
> require *any* emulation.
This part I think I understand, i.e. why spend any logic emulating IMS
as MSI since the IMS capability can be a paravirtualized interface
from guest to VMM with none of the compromises that MSI would enforce.
Did I get that right?
> It is a bad architecture. The platform needs to handle this globally
> for all devices, not special hacky emulations things custom made for
> every device out there.
I confess I don't quite understand the shape of what "platform needs
to handle this globally" means, but I understand the desired end
result of "no emulation added where not needed". However, would this
mean that the bare-metal idxd driver can not be used directly in the
guest without modification? For example, as I understand from talking
to Ashok, idxd has some device events like error notification hard
wired to MSI while data patch interrupts are IMS. So even if the IMS
side does not hook up MSI emulation doesn't idxd still need MSI
emulation to reuse the bare metal driver directly?
> > Native devices can have both MSIx and IMS capability. But as I
> > understand this isn't how we have partitioned things in SW today. We
> > left IMS only for mdev's. And I agree this would be very useful.
>
> That split is just some decision idxd did, we are thinking about doing
> other things in our devices.
Where does the collision happen between what you need for a clean
implementation of an IMS-like capability (/me misses his "dev-msi"
name that got thrown out in the Thomas rewrite), and emulation needed
to not have VF special casing in the idxd driver.
Also feel free to straighten me out (Jason or Ashok) if I've botched
the understanding of this.
On Fri, Nov 06, 2020 at 03:47:00PM -0800, Dan Williams wrote:
> Also feel free to straighten me out (Jason or Ashok) if I've botched
> the understanding of this.
It is pretty simple when you get down to it.
We have a new kernel API that Thomas added:
pci_subdevice_msi_create_irq_domain()
This creates an IRQ domain that hands out addr/data pairs that
trigger interrupts.
On bare metal the addr/data pairs from the IRQ domain are programmed
into the HW in some HW specific way by the device driver that calls
the above function.
On (kvm) virtualization the addr/data pair the IRQ domain hands out
doesn't work. It is some fake thing.
To make this work on normal MSI/MSI-X the VMM implements emulation of
the standard MSI/MSI-X programming and swaps the fake addr/data pair
for a real one obtained from the hypervisor IRQ domain.
To "deal" with this issue the SIOV spec suggests to add a per-device
PCI Capability that says "IMS works". Which means either:
- This is bare metal, so of course it works
- The VMM is trapping and emulating whatever the device specific IMS
programming is.
The idea being that a VMM can never advertise the IMS cap flag to the
guest unles the VMM provides a device specific driver that does device
specific emulation to capture the addr/data pair. Remeber IMS doesn't
say how to program the addr/data pair! Every device is unique!
On something like IDXD this emulation is not so hard, on something
like mlx5 this is completely unworkable. Further we never do
emulation on our devices, they always pass native hardware through,
even for SIOV-like cases.
In the end pci_subdevice_msi_create_irq_domain() is a platform
function. Either it should work completely on every device with no
device-specific emulation required in the VMM, or it should not work
at all and return -EOPNOTSUPP.
The only sane way to implement this generically is for the VMM to
provide a hypercall to obtain a real *working* addr/data pair(s) and
then have the platform hand those out from
pci_subdevice_msi_create_irq_domain().
All IMS device drivers will work correctly. No VMM device emulation is
ever needed to translate addr/data pairs.
Earlier in this thread Kevin said hyper-v is already working this way,
even for MSI/MSI-X. To me this says it is fundamentally a KVM platform
problem and it should not be solved by PCI capability flags.
Jason
On Fri, Nov 06 2020 at 09:48, Kevin Tian wrote:
>> From: Jason Gunthorpe <[email protected]>
>> On Wed, Nov 04, 2020 at 01:34:08PM +0000, Tian, Kevin wrote:
>> The interrupt controller is responsible to create an addr/data pair
>> for an interrupt message. It sets the message format and ensures it
>> routes to the proper CPU interrupt handler. Everything about the
>> addr/data pair is owned by the platform interrupt controller.
>>
>> Devices do not create interrupts. They only trigger the addr/data pair
>> the platform gives them.
>
> I guess that we may just view it from different angles. On x86 platform,
> a MSI/IMS capable device directly composes interrupt messages, with
> addr/data pair filled by OS. If there is no IOMMU remapping enabled in
> the middle, the message just hits the CPU. Your description possibly
> is from software side, e.g. describing the hierarchical IRQ domain
> concept?
No. The device composes nothing. If the interrupt is raised in the
device then the MSI block sends the message which was composed by the OS
and stored in the device's message store. For PCI/MSI that's the MSI or
MSIX table and for IMS that's either on device memory (as IDXD uses) or
some completely different location which Jason described.
This has absolutely nothing to do with the X86 platform. MSI is a
architecture independent mechanism: Send whatever the OS put into the
storage to raise an interrupt in the CPU. The device does neither know
whether that message is going to be intercepted by an interrupt
remapping unit or not.
Stop claiming that any of this has anything to do with x86. It has
absolutely nothing to do with x86 and looking at MSI from an x86
perspective instead of looking at it from the architecture agnostic
technical reality of MSI is the reason why we have this discussion at
all.
We had a similar discussion vs. the way how IMS interrupts have to be
dealt with in terms of irq domains. Can you finally stop looking at
everything as a big x86/intel/platform lump and understand that things
are very well structured and seperated both at the hardware and at the
software level?
> Do you mind providing the link? There were lots of discussions between
> you and Thomas. I failed to locate the exact mail when searching above
> keywords.
In this thread: [email protected] and you were on
Cc
Thanks,
tglx
On Fri, Nov 6, 2020 at 4:12 PM Jason Gunthorpe <[email protected]> wrote:
>
> On Fri, Nov 06, 2020 at 03:47:00PM -0800, Dan Williams wrote:
[..]
> The only sane way to implement this generically is for the VMM to
> provide a hypercall to obtain a real *working* addr/data pair(s) and
> then have the platform hand those out from
> pci_subdevice_msi_create_irq_domain().
Yeah, that seems a logical attach point for this magic. Appreciate you
taking the time to lay it out.
Hi Jason
Thanks, its now clear what you had mentioned earlier.
I had couple questions/clarifications below. Thanks for working
through this.
On Fri, Nov 06, 2020 at 08:12:07PM -0400, Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 03:47:00PM -0800, Dan Williams wrote:
>
> > Also feel free to straighten me out (Jason or Ashok) if I've botched
> > the understanding of this.
>
> It is pretty simple when you get down to it.
>
> We have a new kernel API that Thomas added:
>
> pci_subdevice_msi_create_irq_domain()
>
> This creates an IRQ domain that hands out addr/data pairs that
> trigger interrupts.
>
> On bare metal the addr/data pairs from the IRQ domain are programmed
> into the HW in some HW specific way by the device driver that calls
> the above function.
>
> On (kvm) virtualization the addr/data pair the IRQ domain hands out
> doesn't work. It is some fake thing.
Is it really some fake thing? I thought the vCPU and vector are real
for a guest, and VMM ensures when interrupts are delivered they are either.
1. Handled by VMM first and then injected to guest
2. Handled in a Posted Interrupt manner, and injected to guest
when it resumes. It can be delivered directly if guest was running
when the interrupt arrived.
>
> To make this work on normal MSI/MSI-X the VMM implements emulation of
> the standard MSI/MSI-X programming and swaps the fake addr/data pair
> for a real one obtained from the hypervisor IRQ domain.
>
> To "deal" with this issue the SIOV spec suggests to add a per-device
> PCI Capability that says "IMS works". Which means either:
> - This is bare metal, so of course it works
> - The VMM is trapping and emulating whatever the device specific IMS
> programming is.
>
> The idea being that a VMM can never advertise the IMS cap flag to the
> guest unles the VMM provides a device specific driver that does device
> specific emulation to capture the addr/data pair. Remeber IMS doesn't
> say how to program the addr/data pair! Every device is unique!
>
> On something like IDXD this emulation is not so hard, on something
> like mlx5 this is completely unworkable. Further we never do
> emulation on our devices, they always pass native hardware through,
> even for SIOV-like cases.
So is that true for interrupts too? Possibly you have the interrupt
entries sitting in memory resident on the device? Don't we need the
VMM to ensure they are brokered by VMM in either one of the two ways
above? What if the guest creates some addr in the 0xfee... range
how do we take care of interrupt remapping and such without any VMM
assist?
Its probably a gap in my understanding.
>
> In the end pci_subdevice_msi_create_irq_domain() is a platform
> function. Either it should work completely on every device with no
> device-specific emulation required in the VMM, or it should not work
> at all and return -EOPNOTSUPP.
>
> The only sane way to implement this generically is for the VMM to
> provide a hypercall to obtain a real *working* addr/data pair(s) and
> then have the platform hand those out from
> pci_subdevice_msi_create_irq_domain().
>
> All IMS device drivers will work correctly. No VMM device emulation is
> ever needed to translate addr/data pairs.
>
That's true. Probably this can work the same even for MSIx types too then?
When we do interrupt remapping support in guest which would be required
if we support x2apic in guest, I think this is something we should look into more
carefully to make this work.
One criteria that we generally tried to follow is driver that runs in host
and guest are the same, and if needed they need some functionality make it
work around some capability detection so the alternate path can be plummed in
a generic way.
I agree with the overall idea and we should certainly take that into consideration
when we need IMS in guest support and in context of interrupt remapping.
Hopefully I understood the overall concept. If I mis-understood any of this
please let me know.
Cheers,
Ashok
On Sun, 2020-11-08 at 10:11 -0800, Raj, Ashok wrote:
> Hi Jason
>
> Thanks, its now clear what you had mentioned earlier.
>
> I had couple questions/clarifications below. Thanks for working
> through this.
>
> On Fri, Nov 06, 2020 at 08:12:07PM -0400, Jason Gunthorpe wrote:
> > On Fri, Nov 06, 2020 at 03:47:00PM -0800, Dan Williams wrote:
> >
> > > Also feel free to straighten me out (Jason or Ashok) if I've botched
> > > the understanding of this.
> >
> > It is pretty simple when you get down to it.
> >
> > We have a new kernel API that Thomas added:
> >
> > pci_subdevice_msi_create_irq_domain()
> >
> > This creates an IRQ domain that hands out addr/data pairs that
> > trigger interrupts.
> >
> > On bare metal the addr/data pairs from the IRQ domain are programmed
> > into the HW in some HW specific way by the device driver that calls
> > the above function.
> >
> > On (kvm) virtualization the addr/data pair the IRQ domain hands out
> > doesn't work. It is some fake thing.
>
> Is it really some fake thing? I thought the vCPU and vector are real
> for a guest, and VMM ensures when interrupts are delivered they are either.
>
> 1. Handled by VMM first and then injected to guest
> 2. Handled in a Posted Interrupt manner, and injected to guest
> when it resumes. It can be delivered directly if guest was running
> when the interrupt arrived.
>
> >
> > To make this work on normal MSI/MSI-X the VMM implements emulation of
> > the standard MSI/MSI-X programming and swaps the fake addr/data pair
> > for a real one obtained from the hypervisor IRQ domain.
> >
> > To "deal" with this issue the SIOV spec suggests to add a per-device
> > PCI Capability that says "IMS works". Which means either:
> > - This is bare metal, so of course it works
> > - The VMM is trapping and emulating whatever the device specific IMS
> > programming is.
> >
> > The idea being that a VMM can never advertise the IMS cap flag to the
> > guest unles the VMM provides a device specific driver that does device
> > specific emulation to capture the addr/data pair. Remeber IMS doesn't
> > say how to program the addr/data pair! Every device is unique!
> >
> > On something like IDXD this emulation is not so hard, on something
> > like mlx5 this is completely unworkable. Further we never do
> > emulation on our devices, they always pass native hardware through,
> > even for SIOV-like cases.
>
> So is that true for interrupts too? Possibly you have the interrupt
> entries sitting in memory resident on the device? Don't we need the
> VMM to ensure they are brokered by VMM in either one of the two ways
> above? What if the guest creates some addr in the 0xfee... range
> how do we take care of interrupt remapping and such without any VMM
> assist?
>
> Its probably a gap in my understanding.
>
> >
> > In the end pci_subdevice_msi_create_irq_domain() is a platform
> > function. Either it should work completely on every device with no
> > device-specific emulation required in the VMM, or it should not work
> > at all and return -EOPNOTSUPP.
> >
> > The only sane way to implement this generically is for the VMM to
> > provide a hypercall to obtain a real *working* addr/data pair(s) and
> > then have the platform hand those out from
> > pci_subdevice_msi_create_irq_domain().
> >
> > All IMS device drivers will work correctly. No VMM device emulation is
> > ever needed to translate addr/data pairs.
> >
>
> That's true. Probably this can work the same even for MSIx types too then?
>
> When we do interrupt remapping support in guest which would be required
> if we support x2apic in guest, I think this is something we should look into more
> carefully to make this work.
No, interrupt remapping is not required for X2APIC in guests
They can have X2APIC and up to 32768 CPUs without needing interrupt
remapping at all. Only if they want more than 32768 vCPUs, or to do
nested virtualisation and actually remap for the benefit of *their*
(L2+) guests would they need IR.
On Fri, Nov 06 2020 at 20:12, Jason Gunthorpe wrote:
> All IMS device drivers will work correctly. No VMM device emulation is
> ever needed to translate addr/data pairs.
>
> Earlier in this thread Kevin said hyper-v is already working this way,
> even for MSI/MSI-X. To me this says it is fundamentally a KVM platform
> problem and it should not be solved by PCI capability flags.
I mostly agree but want to add a few clarifications about the
terminology and the boundaries because I think there is where lot of the
confusion comes from.
Let me go back to the basic structure both at the hardware and at the
software level.
The basic structure is:
[CPU] -- [Bridge] -- Bus -- [Device]
This applies to all kind of buses where the bridge directly translates
into the CPUs address space. Now let's look at the boundaries:
|
|
[CPU] -- [Bri | dge] -- Bus -- [Device]
|
|
The boundary is in the middle of the bridge because the CPU side of the
bridge is obviously CPU and therefore architecture specific. The Bus
side of the bridge is architecture agnostic.
Now let's add an IOMMU:
[CPU] -- [IOMMU] -- [Bridge] -- Bus -- [Device]
and in theory the boundary moves now to:
|
|
[CPU] -- [IO | MMU] -- [Bridge] -- Bus -- [Device]
|
|
because with an IOMMU the bridge could become CPU and architecture
agnostic. In reality this is not the case as the bridge is still the
same thing.
Now let's look at MSI. As established above, the Bus and the Device are
CPU and architecture agnostic and the Device merily uses a composed
message which is stored at some place accessible to the device to send
that message when it raises an interrupt. So where is this message
composed?
The basic case:
|
|
[CPU] -- [Bri | dge] -- Bus -- [Device]
|
Alloc +
Compose Store Use
The Bridge is irrelevant here as it just is involved in the
transport. Nevertheless the Bridge is only transport in the view of the
interrupt subsystem.
The IOMMU case:
|
|
[CPU] -- [IO | MMU] -- [Bridge] -- Bus -- [Device]
|
Alloc +
Alloc Compose Store Use
That's exactly reflected in hierarchical irq domains:
|
|
[CPU] -- [Bri | dge] -- Bus -- [Device]
|
Alloc +
Compose Store Use
Vectordomain Busdomain
and:
|
|
[CPU] -- [IO | MMU] -- [Bridge] -- Bus -- [Device]
|
Alloc +
Alloc Compose Store Use
Vectordomain Remapdomain Busdomain
Now if we look at the virtualization scenario and device hand through
then the structure in the guest view is not any different from the basic
case. This works with PCI-MSI[X] and the IDXD IMS variant because the
hypervisor can trap the access to the storage and translate the message:
|
|
[CPU] -- [Bri | dge] -- Bus -- [Device]
|
Alloc +
Compose Store Use
|
| Trap
v
Hypervisor translates and stores
But obviously with an IMS storage location which is software controlled
by the guest side driver (the case Jason is interested in) the above
cannot work for obvious reasons.
That means the guest needs a way to ask the hypervisor for a proper
translation, i.e. a hypercall. Now where to do that? Looking at the
above remapping case it's pretty obvious:
|
|
[CPU] -- [VI | RT] -- [Bridge] -- Bus -- [Device]
|
Alloc "Compose" Store Use
Vectordomain HCALLdomain Busdomain
| ^
| |
v |
Hypervisor
Alloc + Compose
Why? Because it reflects the boundaries and leaves the busdomain part
agnostic as it should be. And it works for _all_ variants of Busdomains.
Now the question which I can't answer is whether this can work correctly
in terms of isolation. If the IMS storage is in guest memory (queue
storage) then the guest driver can obviously write random crap into it
which the device will happily send. (For MSI and IDXD style IMS it
still can trap the store).
Is the IOMMU/Interrupt remapping unit able to catch such messages which
go outside the space to which the guest is allowed to signal to? If yes,
problem solved. If no, then IMS storage in guest memory can't ever work.
Coming back to this:
> In the end pci_subdevice_msi_create_irq_domain() is a platform
> function. Either it should work completely on every device with no
> device-specific emulation required in the VMM, or it should not work
> at all and return -EOPNOTSUPP.
The subdevice domain is a 'Busdomain' according to the structure
above. It does not and should never have any clue about the underlying
system. It's in the agnostic part and always works. It simply does not
care what's underneath. So it won't return -EOPNOTSUPP.
What it has to do is to transport the IMS in queue memory requirement to
the underlying parent domain.
So in case that the HCALL domain is missing, the Vector domain needs
return an error code on domain creation. If the HCALL domain is there
then the domain creation works and in case of actual interrupt
allocation the hypercall either returns a valid composed message or an
appropriate error code.
But there's a catch:
This only works when the guest OS actually knows that it runs in a
VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
solved because from the guest OS view that's the same as running on bare
metal. Obviously on bare metal the Vector domain can and must handle
this.
So this needs some thought.
Thanks,
tglx
On Sun, 2020-11-08 at 19:47 +0100, Thomas Gleixner wrote:
> This only works when the guest OS actually knows that it runs in a
> VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
> solved because from the guest OS view that's the same as running on bare
> metal. Obviously on bare metal the Vector domain can and must handle
> this.
>
> So this needs some thought.
The problem here is that Intel implemented interrupt remapping in a way
which is anathema to structured, ordered IRQ domains.
When a guest writes an MSI message (addr/data) to the MSI table of a
PCI device which has been assigned to that guest, it *doesn't* properly
inherit the MSI composition from a parent irqdomain which knows about
the (host-side) IOMMU.
What actually happens is the hypervisor *traps* the writes to the
device's MSI table, and translates them *then*. In *precisely* the
fashion which we're trying to avoid for IMS.
Now, you can imagine a world where it wasn't like this, where
Remappable Format MSI messages don't exist, and where we let guests
write native MSI message to the device without trapping — and where the
IOMMU then sees the incoming interrupt and has to map the APIC ID to a
*virtual* CPU for that guest, based on the PCI source-id of the device.
In that world, IMS would work naturally. But that isn't how Intel
designed interrupt remapping. They *designed* to have to trap and
translate as the message is written to the device.
So it does look like we're going to need a hypercall interface to
compose an MSI message on behalf of the guest, for IMS to use. In fact
PCI devices assigned to a guest could use that too, and then we'd only
need to trap-and-remap any attempt to write a Compatibility Format MSI
to the device's MSI table, while letting Remappable Format messages get
written directly.
We'd also need a way for an OS running on bare metal to *know* that
it's on bare metal and can just compose MSI messages for itself. Since
we do expect bare metal to have an IOMMU, perhaps that is just a
feature flag on the IOMMU?
That or Intel needs to fix the IOMMU to do proper virtualisation and
actually translate "Compatibility Format" MSIs for a guest too.
On Fri, Nov 06 2020 at 09:14, Jason Gunthorpe wrote:
> On Fri, Nov 06, 2020 at 09:48:34AM +0000, Tian, Kevin wrote:
> For instance you could put a "disable IMS" flag in the ACPI tables, in
> the config space of the emuulated root port, or any other areas that
> clearly belong to the platform.
>
> The OS logic would be
> - If no IMS information found then use IMS (Bare metal)
> - If the IMS disable flag is found then
> - If (future) hypercall available and the OS knows how to use it
> then use IMS
> - If no hypercall found, or no OS knowledge, fail IMS
That does not work because an older hypervisor would not have that
disable flag and the guest kernel would assume to be on bare metal (if
no other indicators are there).
Thanks
tglx
On Sun, Nov 08 2020 at 19:36, David Woodhouse wrote:
> On Sun, 2020-11-08 at 19:47 +0100, Thomas Gleixner wrote:
>> So this needs some thought.
>
> The problem here is that Intel implemented interrupt remapping in a way
> which is anathema to structured, ordered IRQ domains.
>
> When a guest writes an MSI message (addr/data) to the MSI table of a
> PCI device which has been assigned to that guest, it *doesn't* properly
> inherit the MSI composition from a parent irqdomain which knows about
> the (host-side) IOMMU.
>
> What actually happens is the hypervisor *traps* the writes to the
> device's MSI table, and translates them *then*.
That's what I showed in the ascii art :)
> In *precisely* the fashion which we're trying to avoid for IMS.
At least for the IMS variant where the storage is not in trappable
device memory.
> Now, you can imagine a world where it wasn't like this, where
> Remappable Format MSI messages don't exist, and where we let guests
> write native MSI message to the device without trapping — and where the
> IOMMU then sees the incoming interrupt and has to map the APIC ID to a
> *virtual* CPU for that guest, based on the PCI source-id of the
> device.
That would be not convoluted enough and make too much sense.
> In that world, IMS would work naturally. But that isn't how Intel
> designed interrupt remapping. They *designed* to have to trap and
> translate as the message is written to the device.
>
> So it does look like we're going to need a hypercall interface to
> compose an MSI message on behalf of the guest, for IMS to use. In fact
> PCI devices assigned to a guest could use that too, and then we'd only
> need to trap-and-remap any attempt to write a Compatibility Format MSI
> to the device's MSI table, while letting Remappable Format messages get
> written directly.
Yes, if we have the HCALL domain then the message composed by the
hypervisor is valid for everything not only IMS. That's why I left out
any specifics on the Busdomain side. It does not matter which kind of
bus that is. The only mechanics which is provided by the busdomain is
to store the precomposed message and eventually provide mask/unmask at
that level.
> We'd also need a way for an OS running on bare metal to *know* that
> it's on bare metal and can just compose MSI messages for itself. Since
> we do expect bare metal to have an IOMMU, perhaps that is just a
> feature flag on the IOMMU?
There are still CPUs w/o IOMMU out there and new ones are shipped.
So you would basically mandate that IMS with memory storage can only
work on bare metal when the CPU has an IOMMU.
Jason said in [1]: "For x86 I think we could accept linking this to
IOMMU, if really necessary."
OTOH, what's the chance that a guest runs on something which
1) Does not have X86_FEATURE_HYPERVISOR set in cpuid 1/EDX
and
2) Cannot be identified as Xen domain
and
3) Does not have a DMI vendor entry which identifies the
virtualization solution (we don't use that today, but
adding that table is trivial enough)
and
4) Has such an IMS device passed through?
Possible, yes. Likely, no. Do we care?
> That or Intel needs to fix the IOMMU to do proper virtualisation and
> actually translate "Compatibility Format" MSIs for a guest too.
Is that going to happen before I retire?
Thanks,
tglx
[1] https://lore.kernel.org/r/[email protected]
On Sun, Nov 08 2020 at 22:09, David Woodhouse wrote:
>> On Fri, Nov 06 2020 at 09:14, Jason Gunthorpe wrote:
>>> On Fri, Nov 06, 2020 at 09:48:34AM +0000, Tian, Kevin wrote:
>>> For instance you could put a "disable IMS" flag in the ACPI tables, in
>>> the config space of the emuulated root port, or any other areas that
>>> clearly belong to the platform.
>>>
>>> The OS logic would be
>>> - If no IMS information found then use IMS (Bare metal)
>>> - If the IMS disable flag is found then
>>> - If (future) hypercall available and the OS knows how to use it
>>> then use IMS
>>> - If no hypercall found, or no OS knowledge, fail IMS
>>
>> That does not work because an older hypervisor would not have that
>> disable flag and the guest kernel would assume to be on bare metal (if
>> no other indicators are there).
>
> In the absence of a forward-thinking design from Intel perhaps we could
Just to be fair the AMD interrupt remapping is not any better in that
regard.
Thanks,
tglx
> On Fri, Nov 06 2020 at 09:14, Jason Gunthorpe wrote:
>> On Fri, Nov 06, 2020 at 09:48:34AM +0000, Tian, Kevin wrote:
>> For instance you could put a "disable IMS" flag in the ACPI tables, in
>> the config space of the emuulated root port, or any other areas that
>> clearly belong to the platform.
>>
>> The OS logic would be
>> - If no IMS information found then use IMS (Bare metal)
>> - If the IMS disable flag is found then
>> - If (future) hypercall available and the OS knows how to use it
>> then use IMS
>> - If no hypercall found, or no OS knowledge, fail IMS
>
> That does not work because an older hypervisor would not have that
> disable flag and the guest kernel would assume to be on bare metal (if
> no other indicators are there).
In the absence of a forward-thinking design from Intel perhaps we could
use the existence of an IOMMU with interrupt remapping and not caching
mode as the indication that it's bare metal?
--
dwmw2
On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
>
> That means the guest needs a way to ask the hypervisor for a proper
> translation, i.e. a hypercall. Now where to do that? Looking at the
> above remapping case it's pretty obvious:
>
>
> |
> |
> [CPU] -- [VI | RT] -- [Bridge] -- Bus -- [Device]
> |
> Alloc "Compose" Store Use
>
> Vectordomain HCALLdomain Busdomain
> | ^
> | |
> v |
> Hypervisor
> Alloc + Compose
Yes, this will describes what I have been thinking
> Now the question which I can't answer is whether this can work correctly
> in terms of isolation. If the IMS storage is in guest memory (queue
> storage) then the guest driver can obviously write random crap into it
> which the device will happily send. (For MSI and IDXD style IMS it
> still can trap the store).
There are four cases of interest here:
1) Bare metal, PF and VF devices just deliver whatever addr/data pairs
to the APIC. IMS works perfectly with pci_subdevice_msi_create_irq_domain()
2) SRIOV VF assigned to the guest.
The guest can cause any MemWr TLP to any addr/data pair
and the iommu/platform/vmm is supposed to use the
Bus/device/function to isolate & secure the interrupt address
range.
IMS can work in the guest if the guest knows the details of the
address range and can make hypercalls to setup routing. So
pci_subdevice_msi_create_irq_domain() works if the hypercalls
exist and fails if they don't.
3) SIOV sub device assigned to the guest.
The difference between SIOV and SRIOV is the device must attach a
PASID to every TLP triggered by the guest. Logically we'd expect
when IMS is used in this situation the interrupt MemWr is tagged
with bus/device/function/PASID to uniquly ID the guest and the same
security protection scheme from #2 applies.
4) SIOV sub device assigned to the guest, but with emulation.
This SIOV device cannot tag interrupts with PASID so cannot do #2
(or the platform cannot recieve a PASID tagged interrupt message).
Since the interrupts are being delivered with TLPs pointing at the
hypervisor the only solution is for the hypervisor to exclusively
control the interrupt table. MSI table like emulation for IMS is
needed and the hypervisor will use pci_subdevice_msi_create_irq_domain()
to get the real interrupts.
pci_subdevice_msi_create_irq_domain() needs to return the 'fake'
addr/data pairs which are actually an ABI between the guest and
hypervisor carried in the hidden hypercall of the emulation.
(ie it works like MSI works today)
IDXD is worring about case #4, I think, but I didn't follow in that
whole discussion about the IMS table layout if they PASID tag the IMS
MemWr or not?? Ashok can you clarify?
> Is the IOMMU/Interrupt remapping unit able to catch such messages which
> go outside the space to which the guest is allowed to signal to? If yes,
> problem solved. If no, then IMS storage in guest memory can't ever work.
Right. Only PASID on the interrupt messages can resolve this securely.
> So in case that the HCALL domain is missing, the Vector domain needs
> return an error code on domain creation. If the HCALL domain is there
> then the domain creation works and in case of actual interrupt
> allocation the hypercall either returns a valid composed message or an
> appropriate error code.
Yes
> But there's a catch:
>
> This only works when the guest OS actually knows that it runs in a
> VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
> solved because from the guest OS view that's the same as running on bare
> metal. Obviously on bare metal the Vector domain can and must handle
> this.
Yes
The flip side is today, the way pci_subdevice_msi_create_irq_domain()
works a VF using it on baremetal will succeed and if that same VF is
assigned to a guest then pci_subdevice_msi_create_irq_domain()
succeeds but the interrupt never comes - so the driver is broken.
Yes, if we add some ACPI/etc flag that is not going to magically fix
old kvm's, but at least *something* exists that works right and
generically.
If we follow Intel's path then we need special KVM level support for
*every* device, PCI cap mangling and so on. Forever. Sounds horrible
to me..
This feels like one of these things where no matter what we do
something is broken. Picking the least breakage is the challenge here.
> So this needs some thought.
I think your HAOLL diagram is the only sane architecture.
If go that way then case #4 will still work, in this case the HCALL
will return addr/data pairs that conform to what the emulation
expects. Or fail if the VMM can't do emulation for the device.
Jason
On Sun, Nov 08, 2020 at 06:34:55PM +0000, David Woodhouse wrote:
> >
> > When we do interrupt remapping support in guest which would be required
> > if we support x2apic in guest, I think this is something we should look into more
> > carefully to make this work.
>
> No, interrupt remapping is not required for X2APIC in guests
>
> They can have X2APIC and up to 32768 CPUs without needing interrupt
How is this made available today without interrupt remapping?
I thought without IR, the destination ID is still limited to only 8 bits?
On native, even if you have less than 255 cpu's but the APICID are sparsly
distributed due to platform rules, the x2apic id could be more than 8 bits.
Which is why the spec requires IR when x2apic is enabled.
> remapping at all. Only if they want more than 32768 vCPUs, or to do
> nested virtualisation and actually remap for the benefit of *their*
> (L2+) guests would they need IR.
On Sun, Nov 08, 2020 at 11:47:13PM +0100, Thomas Gleixner wrote:
> OTOH, what's the chance that a guest runs on something which
>
> 1) Does not have X86_FEATURE_HYPERVISOR set in cpuid 1/EDX
>
> and
>
> 2) Cannot be identified as Xen domain
>
> and
>
> 3) Does not have a DMI vendor entry which identifies the
> virtualization solution (we don't use that today, but
> adding that table is trivial enough)
>
> and
>
> 4) Has such an IMS device passed through?
>
> Possible, yes. Likely, no. Do we care?
This is exactly my thinking too. IMS is still very new, if we add some
platform flag to disable it then yes there are broken cases but enough
options for an unlucky user to deal with it:
- Have their VMM set X86_FEATURE_HYPERVISOR
- Updating the VMM to set the global disable flag
- Add some "disable_subdevice_msi" kernel comand line flag in the guest
In exchange we get a much cleaner architecture for the next 10 years..
Jason
Hi Jason,
On Sun, Nov 08, 2020 at 07:23:41PM -0400, Jason Gunthorpe wrote:
>
> IDXD is worring about case #4, I think, but I didn't follow in that
> whole discussion about the IMS table layout if they PASID tag the IMS
> MemWr or not?? Ashok can you clarify?
>
The PASID in the interrupt store is for the IDXD to verify the interrupt handle
that came with the ENQCMD. User applications can obtain an interrupt handle and
ask for interrupt to be generated for transactions submitted via ENQCMD.
IDXD will compare the PASID that came with ENQCMD and verify if the PASID matches
the one stored in the Interrupt Table before generating the MemWr.
So MemWr for interrupts remains unchanged for IDXD on the wire. PASID is present in interrupt
store because the value was programmed by user space, and needs OS/hardware to ensure
the entity asking for interrupts has ownership for the interrupt handle.
On Sun, Nov 08, 2020 at 10:11:24AM -0800, Raj, Ashok wrote:
> > On (kvm) virtualization the addr/data pair the IRQ domain hands out
> > doesn't work. It is some fake thing.
>
> Is it really some fake thing? I thought the vCPU and vector are real
> for a guest, and VMM ensures when interrupts are delivered they are either.
It is fake in the sense it is programmed into no hardware.
It is real in the sense it is an ABI contract with the VMM.
> > On something like IDXD this emulation is not so hard, on something
> > like mlx5 this is completely unworkable. Further we never do
> > emulation on our devices, they always pass native hardware through,
> > even for SIOV-like cases.
>
> So is that true for interrupts too?
There is no *mlx5* emulation. We ride on the generic MSI emulation KVM
is going.
> Possibly you have the interrupt entries sitting in memory resident
> on the device?
For SRIOV, yes. The appeal of IMS is to move away from that.
> Don't we need the VMM to ensure they are brokered by VMM in either
> one of the two ways above?
Yes, no matter what the VMM has to know the guest wants an interrupt
routed in and setup the VMM part of the equation. With SRIOV this is
all done with the MSI trapping.
> What if the guest creates some addr in the 0xfee... range how do we
> take care of interrupt remapping and such without any VMM assist?
Not sure I understand this?
> That's true. Probably this can work the same even for MSIx types too then?
Yes, once you have the ability to hypercall to create the addr/data
pair then it can work with MSI and the VMM can stop emulation. It
would be a nice bit of uniformity to close this, but switching the VMM
from legacy to new mode is going to be tricky, I fear.
> I agree with the overall idea and we should certainly take that into
> consideration when we need IMS in guest support and in context of
> interrupt remapping.
The issue with things, as they sit now, is SRIOV.
If any driver starts using pci_subdevice_msi_create_irq_domain() then
it fails if the VF is assigned to a guest with SRVIO. This is a real
and important, use case for many devices today!
The "solution" can't be to go back and retroactively change every
shipping device to add PCI capability blocks, and ensure that every
existing VMM strips them out before assigning the device (including
Hyper-V!!) :(
Jason
Hi Thomas,
[-] Jing, She isn't working at Intel anymore.
Now this is getting compiled as a book :-).. Thanks a ton!
One question on the hypercall case that isn't immediately
clear to me.
On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
>
>
> Now if we look at the virtualization scenario and device hand through
> then the structure in the guest view is not any different from the basic
> case. This works with PCI-MSI[X] and the IDXD IMS variant because the
> hypervisor can trap the access to the storage and translate the message:
>
> |
> |
> [CPU] -- [Bri | dge] -- Bus -- [Device]
> |
> Alloc +
> Compose Store Use
> |
> | Trap
> v
> Hypervisor translates and stores
>
The above case, VMM is responsible for writing to the message
store. In both cases if its IMS or Legacy MSI/MSIx. VMM handles
the writes to the device interrupt region and to the IRTE tables.
> But obviously with an IMS storage location which is software controlled
> by the guest side driver (the case Jason is interested in) the above
> cannot work for obvious reasons.
>
> That means the guest needs a way to ask the hypervisor for a proper
> translation, i.e. a hypercall. Now where to do that? Looking at the
> above remapping case it's pretty obvious:
>
>
> |
> |
> [CPU] -- [VI | RT] -- [Bridge] -- Bus -- [Device]
> |
> Alloc "Compose" Store Use
>
> Vectordomain HCALLdomain Busdomain
> | ^
> | |
> v |
> Hypervisor
> Alloc + Compose
>
> Why? Because it reflects the boundaries and leaves the busdomain part
> agnostic as it should be. And it works for _all_ variants of Busdomains.
>
> Now the question which I can't answer is whether this can work correctly
> in terms of isolation. If the IMS storage is in guest memory (queue
> storage) then the guest driver can obviously write random crap into it
> which the device will happily send. (For MSI and IDXD style IMS it
> still can trap the store).
The isolation problem is not just the guest memory being used as interrrupt
store right? If the Store to device region is not trapped and controlled by
VMM, there is no gaurantee the guest OS has done the right thing?
Thinking about it, guest memory might be more problematic since its not
trappable and VMM can't enforce what is written. This is something that
needs more attension. But for now the devices supporting memory on device
the trap and store by VMM seems to satisfy the security properties you
highlight here.
>
> Is the IOMMU/Interrupt remapping unit able to catch such messages which
> go outside the space to which the guest is allowed to signal to? If yes,
> problem solved. If no, then IMS storage in guest memory can't ever work.
This can probably work for SRIOV devices where guest owns the entire device.
interrupt remap does have RID checks if interrupt arrives at an Interrupt handle
not allocated for that BDF.
But for SIOV devices there is no PASID filtering at the remap level since
interrupt messages don't carry PASID in the TLP.
>
> Coming back to this:
>
> > In the end pci_subdevice_msi_create_irq_domain() is a platform
> > function. Either it should work completely on every device with no
> > device-specific emulation required in the VMM, or it should not work
> > at all and return -EOPNOTSUPP.
>
> The subdevice domain is a 'Busdomain' according to the structure
> above. It does not and should never have any clue about the underlying
> system. It's in the agnostic part and always works. It simply does not
> care what's underneath. So it won't return -EOPNOTSUPP.
>
> What it has to do is to transport the IMS in queue memory requirement to
> the underlying parent domain.
>
> So in case that the HCALL domain is missing, the Vector domain needs
> return an error code on domain creation. If the HCALL domain is there
> then the domain creation works and in case of actual interrupt
> allocation the hypercall either returns a valid composed message or an
> appropriate error code.
>
> But there's a catch:
>
> This only works when the guest OS actually knows that it runs in a
> VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
Precicely!. It might work if the OS is new, but for legacy the trap-emulate
seems both safe and works for legacy as well?
Cheers,
Ashok
Hi Jason
On Sun, Nov 08, 2020 at 07:41:42PM -0400, Jason Gunthorpe wrote:
> On Sun, Nov 08, 2020 at 10:11:24AM -0800, Raj, Ashok wrote:
>
> > > On (kvm) virtualization the addr/data pair the IRQ domain hands out
> > > doesn't work. It is some fake thing.
> >
> > Is it really some fake thing? I thought the vCPU and vector are real
> > for a guest, and VMM ensures when interrupts are delivered they are either.
>
> It is fake in the sense it is programmed into no hardware.
>
> It is real in the sense it is an ABI contract with the VMM.
Ah.. its clear now. That clears up my question below as well.
>
> Yes, no matter what the VMM has to know the guest wants an interrupt
> routed in and setup the VMM part of the equation. With SRIOV this is
> all done with the MSI trapping.
>
> > What if the guest creates some addr in the 0xfee... range how do we
> > take care of interrupt remapping and such without any VMM assist?
>
> Not sure I understand this?
>
My question was based on mis-conception that interrupt entries are directly
written by guest OS for mlx*. My concern was about security isolation if guest OS
has full control of device interrupt store.
I think you clarified it, that interrupts still are marshalled by the VMM
and not in direct control of guest OS. That makes my question moot.
Cheers,
Ashok
> -----Original Message-----
> From: Thomas Gleixner <[email protected]>
> Sent: Saturday, November 7, 2020 8:32 AM
> To: Tian, Kevin <[email protected]>; Jason Gunthorpe <[email protected]>
> Cc: Jiang, Dave <[email protected]>; Bjorn Helgaas <[email protected]>;
> [email protected]; Dey, Megha <[email protected]>; [email protected];
> [email protected]; [email protected]; Pan, Jacob jun
> <[email protected]>; Raj, Ashok <[email protected]>; Liu, Yi L
> <[email protected]>; Lu, Baolu <[email protected]>; Kumar, Sanjay K
> <[email protected]>; Luck, Tony <[email protected]>;
> [email protected]; Williams, Dan J <[email protected]>;
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; Ortiz, Samuel
> <[email protected]>; Hossain, Mona <[email protected]>;
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]
> Subject: RE: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
>
> On Fri, Nov 06 2020 at 09:48, Kevin Tian wrote:
> >> From: Jason Gunthorpe <[email protected]>
> >> On Wed, Nov 04, 2020 at 01:34:08PM +0000, Tian, Kevin wrote:
> >> The interrupt controller is responsible to create an addr/data pair
> >> for an interrupt message. It sets the message format and ensures it
> >> routes to the proper CPU interrupt handler. Everything about the
> >> addr/data pair is owned by the platform interrupt controller.
> >>
> >> Devices do not create interrupts. They only trigger the addr/data pair
> >> the platform gives them.
> >
> > I guess that we may just view it from different angles. On x86 platform,
> > a MSI/IMS capable device directly composes interrupt messages, with
> > addr/data pair filled by OS. If there is no IOMMU remapping enabled in
> > the middle, the message just hits the CPU. Your description possibly
> > is from software side, e.g. describing the hierarchical IRQ domain
> > concept?
>
> No. The device composes nothing. If the interrupt is raised in the
> device then the MSI block sends the message which was composed by the OS
> and stored in the device's message store. For PCI/MSI that's the MSI or
> MSIX table and for IMS that's either on device memory (as IDXD uses) or
> some completely different location which Jason described.
Sorry being inaccurate here. I actually meant the same thing as
you described since I did mention addr/data pair filled by OS.
Unfortunately I mistakenly thought that 'compose' has similar
meaning to 'send' in English but clearly it's not and instead it's
just about the message content. and for sure I also agree with your
other clarifications regarding to architecture independent manner.
Thanks
Kevin
>
> This has absolutely nothing to do with the X86 platform. MSI is a
> architecture independent mechanism: Send whatever the OS put into the
> storage to raise an interrupt in the CPU. The device does neither know
> whether that message is going to be intercepted by an interrupt
> remapping unit or not.
>
> Stop claiming that any of this has anything to do with x86. It has
> absolutely nothing to do with x86 and looking at MSI from an x86
> perspective instead of looking at it from the architecture agnostic
> technical reality of MSI is the reason why we have this discussion at
> all.
>
> We had a similar discussion vs. the way how IMS interrupts have to be
> dealt with in terms of irq domains. Can you finally stop looking at
> everything as a big x86/intel/platform lump and understand that things
> are very well structured and seperated both at the hardware and at the
> software level?
>
> > Do you mind providing the link? There were lots of discussions between
> > you and Thomas. I failed to locate the exact mail when searching above
> > keywords.
>
> In this thread: [email protected] and you were on
> Cc
>
> Thanks,
>
> tglx
>
> From: Jason Gunthorpe <[email protected]>
> Sent: Monday, November 9, 2020 7:24 AM
>
> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> >
> > That means the guest needs a way to ask the hypervisor for a proper
> > translation, i.e. a hypercall. Now where to do that? Looking at the
> > above remapping case it's pretty obvious:
> >
> >
> > |
> > |
> > [CPU] -- [VI | RT] -- [Bridge] -- Bus -- [Device]
> > |
> > Alloc "Compose" Store Use
> >
> > Vectordomain HCALLdomain Busdomain
> > | ^
> > | |
> > v |
> > Hypervisor
> > Alloc + Compose
>
> Yes, this will describes what I have been thinking
Agree
>
> > Now the question which I can't answer is whether this can work correctly
> > in terms of isolation. If the IMS storage is in guest memory (queue
> > storage) then the guest driver can obviously write random crap into it
> > which the device will happily send. (For MSI and IDXD style IMS it
> > still can trap the store).
>
> There are four cases of interest here:
>
> 1) Bare metal, PF and VF devices just deliver whatever addr/data pairs
> to the APIC. IMS works perfectly with
> pci_subdevice_msi_create_irq_domain()
>
> 2) SRIOV VF assigned to the guest.
>
> The guest can cause any MemWr TLP to any addr/data pair
> and the iommu/platform/vmm is supposed to use the
> Bus/device/function to isolate & secure the interrupt address
> range.
>
> IMS can work in the guest if the guest knows the details of the
> address range and can make hypercalls to setup routing. So
> pci_subdevice_msi_create_irq_domain() works if the hypercalls
> exist and fails if they don't.
>
> 3) SIOV sub device assigned to the guest.
>
> The difference between SIOV and SRIOV is the device must attach a
> PASID to every TLP triggered by the guest. Logically we'd expect
> when IMS is used in this situation the interrupt MemWr is tagged
> with bus/device/function/PASID to uniquly ID the guest and the same
> security protection scheme from #2 applies.
Unfortunately no. Intel VT-d only treats MemWr w/o PASID to 0xFEExxxxx
as interrupt request. MemWr w/ PASID, even to 0xFEE, is translated
normally through DMA remapping page table. I don't know other IOMMU
vendors. But at least on Intel platform such device would not get the
desired effect, since the IOMMU only guarantees interrupt isolation in
BDF-level.
Does your device already implement such capability? We can bring this
request back to the hardware team.
>
> 4) SIOV sub device assigned to the guest, but with emulation.
>
> This SIOV device cannot tag interrupts with PASID so cannot do #2
> (or the platform cannot recieve a PASID tagged interrupt message).
>
> Since the interrupts are being delivered with TLPs pointing at the
> hypervisor the only solution is for the hypervisor to exclusively
> control the interrupt table. MSI table like emulation for IMS is
> needed and the hypervisor will use
> pci_subdevice_msi_create_irq_domain()
> to get the real interrupts.
>
> pci_subdevice_msi_create_irq_domain() needs to return the 'fake'
> addr/data pairs which are actually an ABI between the guest and
> hypervisor carried in the hidden hypercall of the emulation.
> (ie it works like MSI works today)
>
> IDXD is worring about case #4, I think, but I didn't follow in that
> whole discussion about the IMS table layout if they PASID tag the IMS
> MemWr or not?? Ashok can you clarify?
>
> > Is the IOMMU/Interrupt remapping unit able to catch such messages which
> > go outside the space to which the guest is allowed to signal to? If yes,
> > problem solved. If no, then IMS storage in guest memory can't ever work.
>
> Right. Only PASID on the interrupt messages can resolve this securely.
>
> > So in case that the HCALL domain is missing, the Vector domain needs
> > return an error code on domain creation. If the HCALL domain is there
> > then the domain creation works and in case of actual interrupt
> > allocation the hypercall either returns a valid composed message or an
> > appropriate error code.
>
> Yes
>
> > But there's a catch:
> >
> > This only works when the guest OS actually knows that it runs in a
> > VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
> > solved because from the guest OS view that's the same as running on bare
> > metal. Obviously on bare metal the Vector domain can and must handle
> > this.
>
> Yes
>
> The flip side is today, the way pci_subdevice_msi_create_irq_domain()
> works a VF using it on baremetal will succeed and if that same VF is
> assigned to a guest then pci_subdevice_msi_create_irq_domain()
> succeeds but the interrupt never comes - so the driver is broken.
Yes, this is the main worry here. While all agree that using hypercall is
the proper way to virtualize IMS, how to disable it when hypercall is
not available is a more urgent demand at current stage.
>
> Yes, if we add some ACPI/etc flag that is not going to magically fix
> old kvm's, but at least *something* exists that works right and
> generically.
Agree. We can work together on this definition.
btw in reality such ACPI extension doesn't exist yet, which likely will
take some time. In the meantime we already have pending usages
like IDXD. Do you suggest holding these patches until we get ASWG
to accept the extension, or accept using Intel IMS cap as a vendor
specific mitigation to move forward while the platform flag is being
worked on? Anyway the IMS cap is already defined and can help fix
some broken cases.
>
> If we follow Intel's path then we need special KVM level support for
> *every* device, PCI cap mangling and so on. Forever. Sounds horrible
> to me..
>
> This feels like one of these things where no matter what we do
> something is broken. Picking the least breakage is the challenge here.
>
> > So this needs some thought.
>
> I think your HAOLL diagram is the only sane architecture.
>
> If go that way then case #4 will still work, in this case the HCALL
> will return addr/data pairs that conform to what the emulation
> expects. Or fail if the VMM can't do emulation for the device.
>
Thanks
Kevin
> From: Raj, Ashok <[email protected]>
> Sent: Monday, November 9, 2020 7:59 AM
>
> Hi Thomas,
>
> [-] Jing, She isn't working at Intel anymore.
>
> Now this is getting compiled as a book :-).. Thanks a ton!
>
> One question on the hypercall case that isn't immediately
> clear to me.
>
> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> >
> >
> > Now if we look at the virtualization scenario and device hand through
> > then the structure in the guest view is not any different from the basic
> > case. This works with PCI-MSI[X] and the IDXD IMS variant because the
> > hypervisor can trap the access to the storage and translate the message:
> >
> > |
> > |
> > [CPU] -- [Bri | dge] -- Bus -- [Device]
> > |
> > Alloc +
> > Compose Store Use
> > |
> > | Trap
> > v
> > Hypervisor translates and stores
> >
>
> The above case, VMM is responsible for writing to the message
> store. In both cases if its IMS or Legacy MSI/MSIx. VMM handles
> the writes to the device interrupt region and to the IRTE tables.
>
> > But obviously with an IMS storage location which is software controlled
> > by the guest side driver (the case Jason is interested in) the above
> > cannot work for obvious reasons.
> >
> > That means the guest needs a way to ask the hypervisor for a proper
> > translation, i.e. a hypercall. Now where to do that? Looking at the
> > above remapping case it's pretty obvious:
> >
> >
> > |
> > |
> > [CPU] -- [VI | RT] -- [Bridge] -- Bus -- [Device]
> > |
> > Alloc "Compose" Store Use
> >
> > Vectordomain HCALLdomain Busdomain
> > | ^
> > | |
> > v |
> > Hypervisor
> > Alloc + Compose
> >
> > Why? Because it reflects the boundaries and leaves the busdomain part
> > agnostic as it should be. And it works for _all_ variants of Busdomains.
> >
> > Now the question which I can't answer is whether this can work correctly
> > in terms of isolation. If the IMS storage is in guest memory (queue
> > storage) then the guest driver can obviously write random crap into it
> > which the device will happily send. (For MSI and IDXD style IMS it
> > still can trap the store).
>
> The isolation problem is not just the guest memory being used as interrrupt
> store right? If the Store to device region is not trapped and controlled by
> VMM, there is no gaurantee the guest OS has done the right thing?
>
>
> Thinking about it, guest memory might be more problematic since its not
> trappable and VMM can't enforce what is written. This is something that
> needs more attension. But for now the devices supporting memory on device
> the trap and store by VMM seems to satisfy the security properties you
> highlight here.
>
Just want to clarify the trap part.
Guest memory is not trappable in Jason's example, which has queue/IMS
storage swapped between device/memory and requires special command
to sync the state.
But there is also other forms of in-memory IMS implementation. e.g. Some
devices serve work requests based on command buffers instead of HW work
queues. The command buffers are linked in per-process contexts (both in
memory) thus similarly IMS could be stored in each context too. There is no
swap per se. The context is allocated by the driver and then registered to
the device through a mgmt. interface. When the mgmt. interface is mediated,
the hypervisor knows the IMS location and could mark it as read-only in
EPT page table to enable trapping of guest writes. Of course this approach
is awkward if the complexity is paid just for virtualizing IMS.
Thanks
Kevin
On Sun, Nov 08 2020 at 15:58, Ashok Raj wrote:
> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
>>
>>
>> Now if we look at the virtualization scenario and device hand through
>> then the structure in the guest view is not any different from the basic
>> case. This works with PCI-MSI[X] and the IDXD IMS variant because the
>> hypervisor can trap the access to the storage and translate the message:
>>
>> |
>> |
>> [CPU] -- [Bri | dge] -- Bus -- [Device]
>> |
>> Alloc +
>> Compose Store Use
>> |
>> | Trap
>> v
>> Hypervisor translates and stores
>>
>
> The above case, VMM is responsible for writing to the message
> store. In both cases if its IMS or Legacy MSI/MSIx. VMM handles
> the writes to the device interrupt region and to the IRTE tables.
Yes, but that's just how it's done today and there is no real need to do
so.
>> Now the question which I can't answer is whether this can work correctly
>> in terms of isolation. If the IMS storage is in guest memory (queue
>> storage) then the guest driver can obviously write random crap into it
>> which the device will happily send. (For MSI and IDXD style IMS it
>> still can trap the store).
>
> The isolation problem is not just the guest memory being used as interrrupt
> store right? If the Store to device region is not trapped and controlled by
> VMM, there is no gaurantee the guest OS has done the right thing?
>
> Thinking about it, guest memory might be more problematic since its not
> trappable and VMM can't enforce what is written. This is something that
> needs more attension. But for now the devices supporting memory on device
> the trap and store by VMM seems to satisfy the security properties you
> highlight here.
That's not the problem at all. The VMM is not responsible for the
correctness of the guest OS at all. All the VMM cares about is that the
guest cannot access anything which does not belong to the guest.
If the guest OS screws up the message (by stupidity or malice), then the
MSI sent from the passed through device has to be caught by the
IOMMU/remap unit if an _only_ if it writes to something which it is not
allowed to.
If it overwrites the guests memory then so be it. The VMM cannot prevent
the guest OS doing so by a stray pointer either. So why would it worry
about the MSI going into guest owned lala land?
>> Is the IOMMU/Interrupt remapping unit able to catch such messages which
>> go outside the space to which the guest is allowed to signal to? If yes,
>> problem solved. If no, then IMS storage in guest memory can't ever work.
>
> This can probably work for SRIOV devices where guest owns the entire device.
> interrupt remap does have RID checks if interrupt arrives at an Interrupt handle
> not allocated for that BDF.
>
> But for SIOV devices there is no PASID filtering at the remap level since
> interrupt messages don't carry PASID in the TLP.
PASID is irrelevant here.
If the device sends a message then the remap unit will see the requester
ID of the device and if the message it sends is not matching the remap
tables then it's caught and the guest is terminated. At least that's how
it should be.
>> But there's a catch:
>>
>> This only works when the guest OS actually knows that it runs in a
>> VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
>
> Precicely!. It might work if the OS is new, but for legacy the trap-emulate
> seems both safe and works for legacy as well?
Again, trap emulate does not work for IMS when the IMS store is software
managed guest memory and not part of the device. And that's the whole
reason why we are discussing this.
Thanks,
tglx
On Mon, Nov 09 2020 at 12:14, Thomas Gleixner wrote:
> On Sun, Nov 08 2020 at 15:58, Ashok Raj wrote:
>> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
>> But for SIOV devices there is no PASID filtering at the remap level since
>> interrupt messages don't carry PASID in the TLP.
>
> Why do we need PASID for VMM integrity?
>
> If the device sends a message then the remap unit will see the requester
> ID of the device and if the message it sends is not
That made me look at patch 4/17 which adds DEVMSI support to the
remap code:
> + case X86_IRQ_ALLOC_TYPE_DEV_MSI:
> + irte_prepare_msg(msg, index, sub_handle);
> break;
It does not setup any requester-id filter in IRTE. How is that supposed
to be correct?
Thanks,
tglx
On Mon, Nov 09, 2020 at 07:37:03AM +0000, Tian, Kevin wrote:
> > 3) SIOV sub device assigned to the guest.
> >
> > The difference between SIOV and SRIOV is the device must attach a
> > PASID to every TLP triggered by the guest. Logically we'd expect
> > when IMS is used in this situation the interrupt MemWr is tagged
> > with bus/device/function/PASID to uniquly ID the guest and the same
> > security protection scheme from #2 applies.
>
> Unfortunately no. Intel VT-d only treats MemWr w/o PASID to 0xFEExxxxx
> as interrupt request. MemWr w/ PASID, even to 0xFEE, is translated
> normally through DMA remapping page table.
I've heard that current IOMMUs are limited as well, but IMHO, as I
describe, if you want full symmetry then you want to route interrupts
via PASID for SIOV. Otherwise the architecture is incomplete.
At least from a Linux and VMM perspective this should be planned
for. It is the only generic way to have a sub device assigned to a
guest and still have access to IMS.
> Does your device already implement such capability? We can bring this
> request back to the hardware team.
In some cases we can generate PASID tagged TLPs for interrupt
messages, if there was a reason to do that.
> Yes, this is the main worry here. While all agree that using hypercall is
> the proper way to virtualize IMS, how to disable it when hypercall is
> not available is a more urgent demand at current stage.
Hopefully Thomas's note about checking for virtualization will help..
> btw in reality such ACPI extension doesn't exist yet, which likely will
> take some time. In the meantime we already have pending usages
> like IDXD. Do you suggest holding these patches until we get ASWG
> to accept the extension, or accept using Intel IMS cap as a vendor
> specific mitigation to move forward while the platform flag is being
> worked on? Anyway the IMS cap is already defined and can help fix
> some broken cases.
I think you need to sort something generic out, these half baked
architectures just make it some other teams problem.
Thomas's suggestion to check cpuid seems reasonably workable
Jason
On Mon, Nov 09, 2020 at 12:21:22PM +0100, Thomas Gleixner wrote:
> >> Is the IOMMU/Interrupt remapping unit able to catch such messages which
> >> go outside the space to which the guest is allowed to signal to? If yes,
> >> problem solved. If no, then IMS storage in guest memory can't ever work.
> >
> > This can probably work for SRIOV devices where guest owns the entire device.
> > interrupt remap does have RID checks if interrupt arrives at an Interrupt handle
> > not allocated for that BDF.
> >
> > But for SIOV devices there is no PASID filtering at the remap level since
> > interrupt messages don't carry PASID in the TLP.
>
> PASID is irrelevant here.
>
> If the device sends a message then the remap unit will see the requester
> ID of the device and if the message it sends is not matching the remap
> tables then it's caught and the guest is terminated. At least that's how
> it should be.
The SIOV case is to take a single RID and split it to multiple
VMs and also to the hypervisor. All these things concurrently use the
same RID, and the IOMMU can't tell them apart.
The hypervisor security domain owns TLPs with no PASID. Each PASID is
assigned to a VM.
For interrupts, today, they are all generated, with no PASID, to the
same RID. There is no way for remapping to protect against a guest
without checking also PASID.
The relavance of PASID is this:
> Again, trap emulate does not work for IMS when the IMS store is software
> managed guest memory and not part of the device. And that's the whole
> reason why we are discussing this.
With PASID tagged interrupts and a IOMMU interrupt remapping
capability that can trigger on PASID, then the platform can provide
the same level of security as SRIOV - the above is no problem.
The device ensures that all DMAs and all interrupts program by the
guest are PASID tagged and the platform provides security by checking
the PASID when delivering the interrupt. Intel IOMMU doesn't work this
way today, but it makes alot of design sense.
Otherwise the interrupt is effectively delivered to the hypervisor. A
secure device can *never* allow a guest to specify an addr/data pair
for a non-PASID tagged TLP, so the device cannot offer IMS to the
guest.
Jason
On Mon, Nov 09, 2020 at 03:08:17PM +0100, Thomas Gleixner wrote:
> On Mon, Nov 09 2020 at 12:14, Thomas Gleixner wrote:
> > On Sun, Nov 08 2020 at 15:58, Ashok Raj wrote:
> >> On Sun, Nov 08, 2020 at 07:47:24PM +0100, Thomas Gleixner wrote:
> >> But for SIOV devices there is no PASID filtering at the remap level since
> >> interrupt messages don't carry PASID in the TLP.
> >
> > Why do we need PASID for VMM integrity?
> >
> > If the device sends a message then the remap unit will see the requester
> > ID of the device and if the message it sends is not
>
> That made me look at patch 4/17 which adds DEVMSI support to the
> remap code:
>
> > + case X86_IRQ_ALLOC_TYPE_DEV_MSI:
> > + irte_prepare_msg(msg, index, sub_handle);
> > break;
>
> It does not setup any requester-id filter in IRTE. How is that supposed
> to be correct?
>
Its missing a set_msi_sid() equivalent for the DEV_MSI type.
On Mon, Nov 09, 2020 at 01:30:34PM -0400, Jason Gunthorpe wrote:
>
> > Again, trap emulate does not work for IMS when the IMS store is software
> > managed guest memory and not part of the device. And that's the whole
> > reason why we are discussing this.
>
> With PASID tagged interrupts and a IOMMU interrupt remapping
> capability that can trigger on PASID, then the platform can provide
> the same level of security as SRIOV - the above is no problem.
You mean even if its stored in memory, as long as the MemWr comes with
PASID, and the hypercall has provisioned the IRTE properly?
that seems like a possiblity.
>
> The device ensures that all DMAs and all interrupts program by the
> guest are PASID tagged and the platform provides security by checking
> the PASID when delivering the interrupt. Intel IOMMU doesn't work this
> way today, but it makes alot of design sense.
>
> Otherwise the interrupt is effectively delivered to the hypervisor. A
> secure device can *never* allow a guest to specify an addr/data pair
> for a non-PASID tagged TLP, so the device cannot offer IMS to the
> guest.
Right, it seems like that's a limitation today.
On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> On Mon, Nov 09, 2020 at 12:21:22PM +0100, Thomas Gleixner wrote:
>> >> Is the IOMMU/Interrupt remapping unit able to catch such messages which
>> >> go outside the space to which the guest is allowed to signal to? If yes,
>> >> problem solved. If no, then IMS storage in guest memory can't ever
>> >> work.
> The SIOV case is to take a single RID and split it to multiple
> VMs and also to the hypervisor. All these things concurrently use the
> same RID, and the IOMMU can't tell them apart.
>
> The hypervisor security domain owns TLPs with no PASID. Each PASID is
> assigned to a VM.
>
> For interrupts, today, they are all generated, with no PASID, to the
> same RID. There is no way for remapping to protect against a guest
> without checking also PASID.
>
> The relavance of PASID is this:
>
>> Again, trap emulate does not work for IMS when the IMS store is software
>> managed guest memory and not part of the device. And that's the whole
>> reason why we are discussing this.
>
> With PASID tagged interrupts and a IOMMU interrupt remapping
> capability that can trigger on PASID, then the platform can provide
> the same level of security as SRIOV - the above is no problem.
>
> The device ensures that all DMAs and all interrupts program by the
> guest are PASID tagged and the platform provides security by checking
> the PASID when delivering the interrupt.
Correct.
> Intel IOMMU doesn't work this way today, but it makes alot of design
> sense.
Right.
> Otherwise the interrupt is effectively delivered to the hypervisor. A
> secure device can *never* allow a guest to specify an addr/data pair
> for a non-PASID tagged TLP, so the device cannot offer IMS to the
> guest.
Ok. Let me summarize the current state of supported scenarios:
1) SRIOV works with any form of IMS storage because it does not require
PASID and the VF devices have unique requester ids, which allows the
remap unit to sanity check the message.
2) SIOV with IMS when the hypervisor can manage the IMS store
exclusively.
So #2 prevents a device which handles IMS storage in queue memory to
utilize IMS for SIOV in a guest because the hypervisor cannot manage the
IMS message store and the guest can write arbitrary crap to it which
violates the isolation principle.
And here is the relevant part of the SIOV spec:
"IMS is managed by host driver software and is not accessible directly
from guest or user-mode drivers.
Within the device, IMS storage is not accessible from the ADIs. ADIs
can request interrupt generation only through the device’s ‘Interrupt
Message Generation Logic’, which allows an ADI to only generate
interrupt messages that are associated with that specific ADI. These
restrictions ensure that the host driver has complete control over
which interrupt messages can be generated by each ADI.
On Intel 64 architecture platforms, message signaled interrupts are
issued as DWORD size untranslated memory writes without a PASID TLP
Prefix, to address range 0xFEExxxxx. Since all memory requests
generated by ADIs include a PASID TLP Prefix, it is not possible for
an ADI to generate a DMA write that would be interpreted by the
platform as an interrupt message."
That's the reductio ad absurdum for this sentence in the first paragraph
of the preceding chapter describing the concept of IMS:
"IMS enables devices to store the interrupt messages for ADIs in a
device-specific optimized manner without the scalability restrictions
of the PCI Express defined MSI-X capability."
"Device-specific optimized manner" is either wishful thinking or
marketing induced verbal diarrhoea.
The current specification puts massive restrictions on IMS storage which
are _not_ allowing to optimize it in a device specific manner as
demonstrated in this discussion.
It also precludes obvious use cases like passing a full device to a
guest and let the guest manage SIOV subdevices for containers or nested
guests.
TBH, to me this is just another hastily cobbled together half thought
out misfeature cast in silicon. The proposed software support is
following the exactly same principle.
So before we go anywhere with this, I want to see a proper way forward
to support _all_ sensible use cases and to fulfil the promise of
"device-specific optimized manner" at the conceptual and specification
and also at the code level.
I'm not at all interested to rush in support for a half baken Intel
centric solution which other people have to clean up after the fact
(again).
IOW, it's time to go back to the drawing board.
Thanks,
tglx
Hi Thomas,
On Mon, Nov 09, 2020 at 11:42:29PM +0100, Thomas Gleixner wrote:
> On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> >
> > The relavance of PASID is this:
> >
> >> Again, trap emulate does not work for IMS when the IMS store is software
> >> managed guest memory and not part of the device. And that's the whole
> >> reason why we are discussing this.
> >
> > With PASID tagged interrupts and a IOMMU interrupt remapping
> > capability that can trigger on PASID, then the platform can provide
> > the same level of security as SRIOV - the above is no problem.
> >
> > The device ensures that all DMAs and all interrupts program by the
> > guest are PASID tagged and the platform provides security by checking
> > the PASID when delivering the interrupt.
>
> Correct.
>
> > Intel IOMMU doesn't work this way today, but it makes alot of design
> > sense.
Approach to IMS is more of a phased approach.
#1 Allow physical device to scale beyond limits of PCIe MSIx
Follows current methodology for guest interrupt programming and
evolutionary changes rather than drastic.
#2 Long term we should work together on enabling IMS in guest which
requires changes in both HW and SW eco-system.
For #1, the immediate need is to find a way to limit guest from using IMS
due to current limitations. We have couple options.
a) CPUID based method to disallow IMS when running in a guest OS. Limiting
use to existing virtual MSIx to guest devices. (Both you/Jason alluded)
b) We can extend DMAR table to have a flag for opt-out. So in real platform
this flag is clear and in guest VMM will ensure vDMAR will have this flag
set. Along the lines as Jason alluded, platform level and via ACPI
methods. We have similar use for x2apic_optout today.
Think a) is probably more generic.
For #2 Long term goal of allowing IMS in guest for devices that require
them. This requires some extensive eco-system enabling.
- Extending HW to understand PASID-tagged interrupt messages.
- Appropriate extensions to IOMMU to enforce such PASID based isolation.
From SW improvements:
- Hypercall to retrieve addr/data from host
- Ensure SW can provide guarantee that the interrupt address range will not
be mapped in process space when SVM is in play. Otherwise its hard to
distinguish between DMA and Interrupt. OS needs to opt-in to this
behavior. Today we ensure IOVA space has this 0xFEExxxxx range carve out
of the IOVA space.
Devices such as idxd that do not have these entries on page-boundaries for
isolation to permit direct programming from GuestOS will continue to use
trap-emulate as used today.
In the end, virtualizing IMS requires eco-system collaboration, and we are
very open to change hw when all the relevant pieces are in place.
Until then, IMS will be restricted to host VMM only, and we can use the
methods above to prevent IMS in guest and continue to use the legacy
virtual MSIx.
>
> Right.
>
> > Otherwise the interrupt is effectively delivered to the hypervisor. A
> > secure device can *never* allow a guest to specify an addr/data pair
> > for a non-PASID tagged TLP, so the device cannot offer IMS to the
> > guest.
>
> Ok. Let me summarize the current state of supported scenarios:
>
> 1) SRIOV works with any form of IMS storage because it does not require
> PASID and the VF devices have unique requester ids, which allows the
> remap unit to sanity check the message.
>
> 2) SIOV with IMS when the hypervisor can manage the IMS store
> exclusively.
Today this is true for all interrupt types, MSI/MSIx/IMS.
>
> So #2 prevents a device which handles IMS storage in queue memory to
> utilize IMS for SIOV in a guest because the hypervisor cannot manage the
> IMS message store and the guest can write arbitrary crap to it which
> violates the isolation principle.
>
> And here is the relevant part of the SIOV spec:
>
> "IMS is managed by host driver software and is not accessible directly
> from guest or user-mode drivers.
>
> Within the device, IMS storage is not accessible from the ADIs. ADIs
> can request interrupt generation only through the device’s ‘Interrupt
> Message Generation Logic’, which allows an ADI to only generate
> interrupt messages that are associated with that specific ADI. These
> restrictions ensure that the host driver has complete control over
> which interrupt messages can be generated by each ADI.
>
> On Intel 64 architecture platforms, message signaled interrupts are
> issued as DWORD size untranslated memory writes without a PASID TLP
> Prefix, to address range 0xFEExxxxx. Since all memory requests
> generated by ADIs include a PASID TLP Prefix, it is not possible for
> an ADI to generate a DMA write that would be interpreted by the
> platform as an interrupt message."
>
> That's the reductio ad absurdum for this sentence in the first paragraph
> of the preceding chapter describing the concept of IMS:
>
> "IMS enables devices to store the interrupt messages for ADIs in a
> device-specific optimized manner without the scalability restrictions
> of the PCI Express defined MSI-X capability."
>
> "Device-specific optimized manner" is either wishful thinking or
> marketing induced verbal diarrhoea.
No comment on the adjectives above :-)
>
> The current specification puts massive restrictions on IMS storage which
> are _not_ allowing to optimize it in a device specific manner as
> demonstrated in this discussion.
IMS doesn't restrict this optimization, but to allow it requires more OS support as
you had mentioned.
>
> It also precludes obvious use cases like passing a full device to a
> guest and let the guest manage SIOV subdevices for containers or nested
> guests.
>
> TBH, to me this is just another hastily cobbled together half thought
> out misfeature cast in silicon. The proposed software support is
> following the exactly same principle.
Current IMS support adds incremental feature capability. Works pretty much
following everything that was created for MSIx, but just adds some device
flexibility.
Here are some reasons why PASID isn't used today for tagging interrupts.
Interrupt messages (as specified by MSI/MSI-X in PCI specification) are
currently defined as DWORD DMA writes to a platform/architecture specific
address (0xFEExxxxx on Intel platforms). Existing root-complexes detect
DWORD writes to 0xFEExxxxx (without a PASID in the transaction) as interrupt
messages and route them to interrupt-remapping logic (as opposed to other
DMA requests that are routed to IOMMU's DMA remapping logic).
There are multiple tools (such as logic analyzers) and OEM test validation
harnesses that depend on such DWORD sized DMA writes with no PASID as interrupt
messages. One of the feedback we had received in the development of the
specification was to avoid impacting such tools irrespective of MSI-X or IMS
was used for interrupt message storage (on the wire they follow the same format),
and also to ensure interoperability of devices supporting IMS across CPU vendors
(who may not support PASID TLP prefix). This is one reason that led to interrupts
from IMS to not use PASID (and match the wire format of MSI/MSI-X generated interrupts).
The other problem was disambiguation between DMA to SVM v/s interrupts.
>
> So before we go anywhere with this, I want to see a proper way forward
> to support _all_ sensible use cases and to fulfil the promise of
> "device-specific optimized manner" at the conceptual and specification
> and also at the code level.
>
> I'm not at all interested to rush in support for a half baken Intel
> centric solution which other people have to clean up after the fact
> (again).
Intel had published the specification almost 2 years back and have
comprehended all the feedback received from the ecosystem
(both open-source and others), along with offering the specification
to be implemented by any vendors (both device and CPU vendors).
There are few device vendors who are implementing to the spec already and
are being explored for support by other CPU vendors
Cheers,
Ashok
Ashok,
On Mon, Nov 09 2020 at 21:14, Ashok Raj wrote:
> On Mon, Nov 09, 2020 at 11:42:29PM +0100, Thomas Gleixner wrote:
>> On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> Approach to IMS is more of a phased approach.
>
> #1 Allow physical device to scale beyond limits of PCIe MSIx
> Follows current methodology for guest interrupt programming and
> evolutionary changes rather than drastic.
Trapping MSI[X] writes is there because it allows to hand a device to an
unmodified guest OS and to handle the case where the MSI[X] entries
storage cannot be mapped exclusively to the guest.
But aside of this, it's not required if the storage can be mapped
exclusively, the guest is hypervisor aware and can get a host composed
message via a hypercall. That works for physical functions and SRIOV,
but not for SIOV.
> #2 Long term we should work together on enabling IMS in guest which
> requires changes in both HW and SW eco-system.
>
> For #1, the immediate need is to find a way to limit guest from using IMS
> due to current limitations. We have couple options.
>
> a) CPUID based method to disallow IMS when running in a guest OS. Limiting
> use to existing virtual MSIx to guest devices. (Both you/Jason alluded)
> b) We can extend DMAR table to have a flag for opt-out. So in real platform
> this flag is clear and in guest VMM will ensure vDMAR will have this flag
> set. Along the lines as Jason alluded, platform level and via ACPI
> methods. We have similar use for x2apic_optout today.
>
> Think a) is probably more generic.
But incomplete as I explained before. If the VMM does not set the
hypervisor bit in CPUID then the guest OS assumes to run on bare
metal. It needs more than just relying on CPUID.
Aside of that neither Jason nor myself said that IMS cannot be supported
in a guest. PF and VF IMS can and has to be supported. SIOV is a
different story due to the PASID requirement which obviously needs to be
managed host side and needs HW changes.
> From SW improvements:
>
> - Hypercall to retrieve addr/data from host
You need to have that even for the non SIOV case in order to hand in a
full device which has the IMS storage in queue memory.
> Devices such as idxd that do not have these entries on page-boundaries for
> isolation to permit direct programming from GuestOS will continue to use
> trap-emulate as used today.
That's a restriction of that particular hardware.
> Until then, IMS will be restricted to host VMM only, and we can use the
> methods above to prevent IMS in guest and continue to use the legacy
> virtual MSIx.
SIOV IMS.
But as things stand now not even PF/VF pass through are possible. This
might not be an issue for IDXD, but it's an issue in general and this
want's the be thought of _now_ before we put a lot of infrastructure in
to place which needs then to be ripped apart again.
>> The current specification puts massive restrictions on IMS storage which
>> are _not_ allowing to optimize it in a device specific manner as
>> demonstrated in this discussion.
>
> IMS doesn't restrict this optimization, but to allow it requires more
> OS support as you had mentioned.
Right, IMS per se does not put an restriction on it.
The specification and the HW limitations on the remapping unit put that
restriction into place.
OS support is an obvious requirement, but OS support cannot make
the restrictions of HW go away magically.
But again, we need to think about the path forward _now_.
Just slapping some 'works for IDXD' solution into place can severly
restrict the options for going beyond these limitations simply because
we have to support that 'works for IDXD thing' forever.
Thanks,
tglx
Thomas,
With all these interrupt message storms ;-), I'm missing how to move towards
an end goal.
On Tue, Nov 10, 2020 at 11:27:29AM +0100, Thomas Gleixner wrote:
> Ashok,
>
> On Mon, Nov 09 2020 at 21:14, Ashok Raj wrote:
> > On Mon, Nov 09, 2020 at 11:42:29PM +0100, Thomas Gleixner wrote:
> >> On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> > Approach to IMS is more of a phased approach.
> >
> > #1 Allow physical device to scale beyond limits of PCIe MSIx
> > Follows current methodology for guest interrupt programming and
> > evolutionary changes rather than drastic.
>
> Trapping MSI[X] writes is there because it allows to hand a device to an
> unmodified guest OS and to handle the case where the MSI[X] entries
> storage cannot be mapped exclusively to the guest.
>
> But aside of this, it's not required if the storage can be mapped
> exclusively, the guest is hypervisor aware and can get a host composed
> message via a hypercall. That works for physical functions and SRIOV,
> but not for SIOV.
It would greatly help if you can put down what you see is blocking
to move forward in the following areas.
Address Gaps in Spec:
Specs can accomodate change after review, as the number of ECN's that go on
with PCIe ;-). Please add what you like to see in the spec if you beleive
is a gap today.
Hardware Gaps?
- PASID tagged Interrupts.
- IOMMU Support for PASID based IR.
As i had called out, there are a lot of moving parts, and requires more
attention.
OS Gaps?
- Lack of ability to identify if platform can use IMS.
- Lack of hypercall.
We will always have devices that have more interrupts but their use doesn't
need IMS to be directly manipulated by the guest, or the fact those usages
require more than what is allowed by PCIe in a guest. These devices can
scale by adding another sub-device and you get another block of 2048 if needed.
This isn't just for idxd, as I mentioned earlier, there are vendors other
than Intel already working on this. In all cases the need for guest direct
manipulation of interrupt store hasn't come up. From the discussion, it
seems like there are devices today or in future that will require direct
manipulation of interrupt store in the guest. This needs additional work
in both the device hardware providing the right plumbing and OS work to
comprehend those.
Cheers,
Ashok
Hi David
I did't follow the support for 32768 CPUs in guest without IR support.
Can you tell me how that is done?
On Sun, Nov 08, 2020 at 03:25:57PM -0800, Ashok Raj wrote:
> On Sun, Nov 08, 2020 at 06:34:55PM +0000, David Woodhouse wrote:
> > >
> > > When we do interrupt remapping support in guest which would be required
> > > if we support x2apic in guest, I think this is something we should look into more
> > > carefully to make this work.
> >
> > No, interrupt remapping is not required for X2APIC in guests
> >
> > They can have X2APIC and up to 32768 CPUs without needing interrupt
>
> How is this made available today without interrupt remapping?
>
> I thought without IR, the destination ID is still limited to only 8 bits?
>
> On native, even if you have less than 255 cpu's but the APICID are sparsly
> distributed due to platform rules, the x2apic id could be more than 8 bits.
> Which is why the spec requires IR when x2apic is enabled.
>
> > remapping at all. Only if they want more than 32768 vCPUs, or to do
> > nested virtualisation and actually remap for the benefit of *their*
> > (L2+) guests would they need IR.
On Mon, Nov 09, 2020 at 09:14:12PM -0800, Raj, Ashok wrote:
> There are multiple tools (such as logic analyzers) and OEM test validation
> harnesses that depend on such DWORD sized DMA writes with no PASID as interrupt
> messages. One of the feedback we had received in the development of the
> specification was to avoid impacting such tools irrespective of
> MSI-X or IMS
This is a really bad reason to make a poor decision for system
security. Relying on trapping/emulation increases the attack surface
and complexity of the VMM and the device which now have to create this
artificial split, which does not exist in SRIOV.
Hopefully we won't see devices get this wrong, but any path that
allows the guest to cause the device to create TLPs outside its IOMMU
containment is security worrysome.
> was used for interrupt message storage (on the wire they follow the
> same format), and also to ensure interoperability of devices
> supporting IMS across CPU vendors (who may not support PASID TLP
> prefix). This is one reason that led to interrupts from IMS to not
> use PASID (and match the wire format of MSI/MSI-X generated
> interrupts). The other problem was disambiguation between DMA to
> SVM v/s interrupts.
This is a defect in the IOMMU, not something fundamental.
The IOMMU needs to know if the interrupt range is active or not for
each PASID. Process based SVA will, of course, not enable interrupts
on the PASID, VM Guest based PASID will.
> Intel had published the specification almost 2 years back and have
> comprehended all the feedback received from the ecosystem
> (both open-source and others), along with offering the specification
> to be implemented by any vendors (both device and CPU vendors).
> There are few device vendors who are implementing to the spec already and
> are being explored for support by other CPU vendors
Which is why it is such a shame that including PASID in the MSI was
deliberately skipped in the document, the ecosystem could have been
much aligned to this solution by now :(
Jason
On Tue, Nov 10, 2020 at 06:13:23AM -0800, Raj, Ashok wrote:
> This isn't just for idxd, as I mentioned earlier, there are vendors other
> than Intel already working on this. In all cases the need for guest direct
> manipulation of interrupt store hasn't come up. From the discussion, it
> seems like there are devices today or in future that will require direct
> manipulation of interrupt store in the guest. This needs additional work
> in both the device hardware providing the right plumbing and OS work to
> comprehend those.
We'd want to see SRIOV's assigned to guests to be able to use
IMS. This allows a SRIOV instance in a guest to spawn SIOV's which is
useful.
SIOV's assigned to guests could use IMS, but the use cases we see in
the short term can be handled by using SRIOV instead.
I would expect in general for SIOV to use MSI-X emulation to expose
interrupts - it would be really weird for a SIOV emulator to do
something else and we should probably discourage that.
Jason
> Hi David
>
> I did't follow the support for 32768 CPUs in guest without IR support.
>
> Can you tell me how that is done?
Using bits 11-5 of the MSI address bits (the other 7 bits of "Extended
Destination ID" that aren't the Remappable Format indicator).
And physical addressing mode, which is no loss for external interrupts
since they're all unicast dest_Fixed these days anyway.
--
dwmw2
> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, November 10, 2020 10:24 PM
>
> On Tue, Nov 10, 2020 at 06:13:23AM -0800, Raj, Ashok wrote:
>
> > This isn't just for idxd, as I mentioned earlier, there are vendors other
> > than Intel already working on this. In all cases the need for guest direct
> > manipulation of interrupt store hasn't come up. From the discussion, it
> > seems like there are devices today or in future that will require direct
> > manipulation of interrupt store in the guest. This needs additional work
> > in both the device hardware providing the right plumbing and OS work to
> > comprehend those.
>
> We'd want to see SRIOV's assigned to guests to be able to use
> IMS. This allows a SRIOV instance in a guest to spawn SIOV's which is
> useful.
Does your VF support both MSI/IMS or IMS only? If it is the former can't
we adopt a phased approach or parallel effort between forcing guest
to use MSI and adding hypercall to enable IMS on VF? Finding a way
to disable IMS is anyway required per earlier discussion when hypercall
is not available, and it could still provide a functional though suboptimal
model for such VFs.
>
> SIOV's assigned to guests could use IMS, but the use cases we see in
> the short term can be handled by using SRIOV instead.
>
> I would expect in general for SIOV to use MSI-X emulation to expose
> interrupts - it would be really weird for a SIOV emulator to do
> something else and we should probably discourage that.
>
I agree with this point. This leaves hardware gaps in IOMMU and root
complex less an immediate blocker and to be addressed in the long term.
Thanks
Kevin
> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, November 10, 2020 10:19 PM
> On Mon, Nov 09, 2020 at 09:14:12PM -0800, Raj, Ashok wrote:
>
> > was used for interrupt message storage (on the wire they follow the
> > same format), and also to ensure interoperability of devices
> > supporting IMS across CPU vendors (who may not support PASID TLP
> > prefix). This is one reason that led to interrupts from IMS to not
> > use PASID (and match the wire format of MSI/MSI-X generated
> > interrupts). The other problem was disambiguation between DMA to
> > SVM v/s interrupts.
>
> This is a defect in the IOMMU, not something fundamental.
>
> The IOMMU needs to know if the interrupt range is active or not for
> each PASID. Process based SVA will, of course, not enable interrupts
> on the PASID, VM Guest based PASID will.
>
Unfortunately it's more than that. The interrupt message is firstly recognized
at root complex today and then routed to the IOMMU, unlike other DMA
requests. I'm not saying it's an unsolvable limitation, but just wants to point
out that to achieve such goal there are more things to be considered beyond
the IOMMU.
Thanks
Kevin
> From: Raj, Ashok <[email protected]>
> Sent: Tuesday, November 10, 2020 10:13 PM
>
> Thomas,
>
> With all these interrupt message storms ;-), I'm missing how to move
> towards
> an end goal.
>
> On Tue, Nov 10, 2020 at 11:27:29AM +0100, Thomas Gleixner wrote:
> > Ashok,
> >
> > On Mon, Nov 09 2020 at 21:14, Ashok Raj wrote:
> > > On Mon, Nov 09, 2020 at 11:42:29PM +0100, Thomas Gleixner wrote:
> > >> On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> > > Approach to IMS is more of a phased approach.
> > >
> > > #1 Allow physical device to scale beyond limits of PCIe MSIx
> > > Follows current methodology for guest interrupt programming and
> > > evolutionary changes rather than drastic.
> >
> > Trapping MSI[X] writes is there because it allows to hand a device to an
> > unmodified guest OS and to handle the case where the MSI[X] entries
> > storage cannot be mapped exclusively to the guest.
> >
> > But aside of this, it's not required if the storage can be mapped
> > exclusively, the guest is hypervisor aware and can get a host composed
> > message via a hypercall. That works for physical functions and SRIOV,
> > but not for SIOV.
>
> It would greatly help if you can put down what you see is blocking
> to move forward in the following areas.
>
Agree. We really need some guidance on how to move forward. I think all
people in this thread are aligned now that it's not Intel or IDXD specific thing,
e.g. need architectural solution, enabling IMS on PF/VF is important, etc. But
what we are not sure is whether we need complete all requirements in one
batch, or could evolve step-by-step as long as the growing path is clearly
defined.
IMHO finding a way to disable IMS in guest is more important than supporting
IMS on PF/VF, since the latter requires hypercall which is not always available
in all scenarios. Even if Linux includes hypercall support for all existing archs
and hypervisors, it could run as an unmodified guest on a new hypervisor
before this hypervisor gets its enlightenments into the Linux. So it is more
prominent to find a way to force using MSI/MSI-x inside guest, as it allows
such PFs/VFs still functional though not benefiting all scalability merits of IMS.
If such two-step plans can be agreed, then the next open is about how to
disable IMS in guest. We need a sane solution when checking in the initial
host-only-IMS support. There are several options discussed in this thread:
1. Industry standard (e.g. a vendor-agnostic ACPI flag) followed by all
platforms, hypervisors and OSes. It will require collaboration beyond
Linux community;
2. IOMMU-vendor specific standards (DMAR, IORT, etc.) to report whether
IMS is allowed, implying that IMS is tied to the IOMMU. This tradeoff is
acceptable since IMS alone cannot make SIOV working which relies on the
IOMMU anyway. and this might be an easier path to move forward and
even not require to wait for all vendors to extend their tables together.
On physical platform the FW always reports IMS as 'allowed' and there is
time to change it. On virtual platform the hypervisor can choose to hide
IMS in three ways:
a) do not expose IOMMU
b) expose IOMMU, but using the old format
c) expose IOMMU, using the new format with IMS reported 'disallowed'
a/b can well support legacy software stack.
However, there is one potential issue with option 1/2. The construction
of the virtual ACPI table is at VM creation time, likely based on whether a
PV interrupt controller is exposed to this guest. However, in most cases the
hypervisor doesn't know which guest OS is running and whether it will
use the PV controller when the VM is being created. If IMS is marked as
'allowed' in the virtual DMAR table, an unmodified guest might just go to
enable it as if it's on the native platform. Maybe what we really required is
a flag to tell the guest that although IMS is available you cannot use it with
traditional interrupt controllers?
3. Use IOMMU 'caching mode' as the hint of running as guest and disable
IMS by default as long as 'caching mode' is detected. iirc all IOMMU vendors
provide such capability for constructing shadow IOMMU page table. Later
when hypercall support is detected for a specific hypervisor/arch, that path
can override the IOMMU hint to enable IMS.
Unlike the first two options, this will be a Linux-specific policy but self
contained. Other guest OSes may not follow this way though.
4. Using CPUID to detect running as guest. But as Thomas pointed out, this
approach is less reliable as not all hypervisors do this way.
Thoughts?
Thanks
Kevin
On Sun, Nov 08, 2020 at 07:36:34PM +0000, David Woodhouse wrote:
> So it does look like we're going to need a hypercall interface to
> compose an MSI message on behalf of the guest, for IMS to use. In fact
> PCI devices assigned to a guest could use that too, and then we'd only
> need to trap-and-remap any attempt to write a Compatibility Format MSI
> to the device's MSI table, while letting Remappable Format messages get
> written directly.
>
> We'd also need a way for an OS running on bare metal to *know* that
> it's on bare metal and can just compose MSI messages for itself. Since
> we do expect bare metal to have an IOMMU, perhaps that is just a
> feature flag on the IOMMU?
Have the platform firmware advertise if it needs native or virtualized
IMS handling. If it advertises neither don't support IMS?
On Wed, Nov 11, 2020 at 03:41:59PM +0000, Christoph Hellwig wrote:
> On Sun, Nov 08, 2020 at 07:36:34PM +0000, David Woodhouse wrote:
> > So it does look like we're going to need a hypercall interface to
> > compose an MSI message on behalf of the guest, for IMS to use. In fact
> > PCI devices assigned to a guest could use that too, and then we'd only
> > need to trap-and-remap any attempt to write a Compatibility Format MSI
> > to the device's MSI table, while letting Remappable Format messages get
> > written directly.
> >
> > We'd also need a way for an OS running on bare metal to *know* that
> > it's on bare metal and can just compose MSI messages for itself. Since
> > we do expect bare metal to have an IOMMU, perhaps that is just a
> > feature flag on the IOMMU?
>
> Have the platform firmware advertise if it needs native or virtualized
> IMS handling. If it advertises neither don't support IMS?
The platform hint can be easily accomplished via DMAR table flags. We could
have an IMS_OPTOUT(similart to x2apic optout flag) flag, when 0 its native
and IMS is supported.
When vIOMMU is presented to guest, virtual DMAR table will have this flag
set to 1. Indicates to GuestOS, native IMS isn't supported.
On Wed, Nov 11 2020 at 08:09, Ashok Raj wrote:
> On Wed, Nov 11, 2020 at 03:41:59PM +0000, Christoph Hellwig wrote:
>> On Sun, Nov 08, 2020 at 07:36:34PM +0000, David Woodhouse wrote:
>> > So it does look like we're going to need a hypercall interface to
>> > compose an MSI message on behalf of the guest, for IMS to use. In fact
>> > PCI devices assigned to a guest could use that too, and then we'd only
>> > need to trap-and-remap any attempt to write a Compatibility Format MSI
>> > to the device's MSI table, while letting Remappable Format messages get
>> > written directly.
>> >
>> > We'd also need a way for an OS running on bare metal to *know* that
>> > it's on bare metal and can just compose MSI messages for itself. Since
>> > we do expect bare metal to have an IOMMU, perhaps that is just a
>> > feature flag on the IOMMU?
>>
>> Have the platform firmware advertise if it needs native or virtualized
>> IMS handling. If it advertises neither don't support IMS?
>
> The platform hint can be easily accomplished via DMAR table flags. We could
> have an IMS_OPTOUT(similart to x2apic optout flag) flag, when 0 its native
> and IMS is supported.
>
> When vIOMMU is presented to guest, virtual DMAR table will have this flag
> set to 1. Indicates to GuestOS, native IMS isn't supported.
These opt-out bits suck by definition. It comes all back to the fact
that the whole virt thing didn't have a hardware defined way to tell
that the OS runs in a VM and not on bare metal. It wouldn't have been
rocket science to do so.
And because that does not exist, we need magic opt-out bits for every
other piece of functionality which gets added. Can we please stop this
and provide a well defined way to tell the OS whether it runs on bare
metal or not?
The point is that you really want opt-in bits so that decisions come
down to
if (!virt || virt->supports_X)
which is the obvious sane and safe logic. But sure, why am I asking for
sane and safe in the context of virtualization?
Thanks,
tglx
Ashok,
On Wed, Nov 11 2020 at 15:03, Ashok Raj wrote:
> On Wed, Nov 11, 2020 at 11:27:28PM +0100, Thomas Gleixner wrote:
>> which is the obvious sane and safe logic. But sure, why am I asking for
>> sane and safe in the context of virtualization?
>
> We can pick how to solve this, and just waiting for you to tell, what
> mechanism you prefer that's less painful and architecturally acceptible for
> virtualization and linux. We are all ears!
Obviously we can't turn the time back. The point I was trying to make is
that the general approach of just bolting things on top of the exiting
maze is bad in general.
Opt-out bits are error prone simply because anything which exists before
that point does not know that it should set that bit. Obvious, right?
CPUID bits are 'Feature available' and not 'Feature not longer
available' for a reason.
So with the introduction of VT this stringent road was left and the
approach was: Don't tell the guest OS that it's not running on bare
metal.
That's a perfectly fine approach for running existing legacy OSes which
do not care at all because they don't know about anything of this
newfangled stuff.
But it's a falls flat on it's nose for anything which comes past that
point simply because there is no reliable way to tell in which context
the OS runs.
The VMM can decide not to set or is not having support for setting the
software CPUID bit which tells the guest OS that it does NOT run on bare
metal and still hand in new fangled PCI devices for which the guest OS
happens to have a driver which then falls flat on it's nose because some
magic functionality is not there.
So we have the following matrix:
VMM Guest OS
Old Old -> Fine, does not support any of that
New Old -> Fine, does not support any of that
New New -> Fine, works as expected
Old New -> FAIL
To fix this we have to come up with heuristics again to figure out which
context we are running in and whether some magic feature can be
supported or not:
probably_on_bare_metal()
{
if (CPUID(FEATURE_HYPERVISOR))
return false;
if (dmi_match_hypervisor_vendor())
return false;
return PROBABLY_RUNNING_ON_BARE_METAL;
}
Yes, it works probably in most cases, but it still works by chance and
that's what I really hate about this; indeed 'hate' is not a strong
enough word.
Why on earth did VT not introduce a reliable way (instruction, CPUID
leaf, MSR, whatever, which can't be manipulated by the VMM to let the OS
figure out where it runs?)
Just because the general approach to these problems is: We can fix that
in software.
No, you can't fix inconsistency in software at all.
This is not the first time that we tell HW folks to stop this 'Fix this
in software' attitude which has caused more problems than it solved.
And you can argue in circles until you are blue, that inconsistency is
not going away.
Everytime new (mis)features are added which need awareness of the OS
whether it runs on bare-metal or in a VM we have this unsolvable dance
of requiring that the underlying VMM has to tell the guest OS NOT to use
it instead of having the guest OS making the simple decision:
if (!definitely_on_bare_metal())
return -ENOTSUPP;
or with a newer version of the guest OS:
if (!definitely_on_bare_metal() && !hypervisor->supportsthis())
return -ENOTSUPP;
I'm halfways content to go with the above probably_on_bare_metal()
function as a replacement for definitely_on_bare_metal() to go forward,
but only for the very simple reason that this is the only option we
have.
Thanks,
tglx
On Wed, Nov 11, 2020 at 11:27:28PM +0100, Thomas Gleixner wrote:
> On Wed, Nov 11 2020 at 08:09, Ashok Raj wrote:
> >> > We'd also need a way for an OS running on bare metal to *know* that
> >> > it's on bare metal and can just compose MSI messages for itself. Since
> >> > we do expect bare metal to have an IOMMU, perhaps that is just a
> >> > feature flag on the IOMMU?
> >>
> >> Have the platform firmware advertise if it needs native or virtualized
> >> IMS handling. If it advertises neither don't support IMS?
> >
> > The platform hint can be easily accomplished via DMAR table flags. We could
> > have an IMS_OPTOUT(similart to x2apic optout flag) flag, when 0 its native
> > and IMS is supported.
> >
> > When vIOMMU is presented to guest, virtual DMAR table will have this flag
> > set to 1. Indicates to GuestOS, native IMS isn't supported.
>
> These opt-out bits suck by definition. It comes all back to the fact
> that the whole virt thing didn't have a hardware defined way to tell
> that the OS runs in a VM and not on bare metal. It wouldn't have been
> rocket science to do so.
I'm sure everybody dislikes (hate being a strong word :-)).
DVSEC capability. Real hardware always sets it to 1 for the IMS capability.
By default the DVSEC is not presented to guest even when the full PF is
presented to guest. I believe VFIO only builds and presents known standard
capabilities and specific extended capabilities. I'm a bit weak but maybe
@AlexWilliamson can confirm if I'm off track.
This tells the driver in guest that IMS is not available and will not
create those new dev_msi calls.
Only if the VMM has build support to expose IMS for this device, guest SW
can even see DVSEC.SIOV.IMS=1. This also means the required plumbing, say
vIOMMU, or a hypercall has been provisioned, and adminstrator knows the
guest is compatible for these options.
There maybe better ways to do this. If this has to be done differently
we certainly can and will do.
>
> And because that does not exist, we need magic opt-out bits for every
> other piece of functionality which gets added. Can we please stop this
> and provide a well defined way to tell the OS whether it runs on bare
> metal or not?
>
> The point is that you really want opt-in bits so that decisions come
> down to
How would we opt-in when the feature is not available? You need someway to
tell the capability is available in the guest?, but then there is no reason
to opt-in though.. its ready for use isn't it?
>
> if (!virt || virt->supports_X)
The only closest thing that comes to mind is the CPUID bits, you had
mentioned they aren't reliable if the VMM didn't set those in an earlier
mail. If you want a platform level generic support.
- DMAR table optout's you had mentioned that's ugly
- We could use caching mode, but its not a platform level thing, and vendor
specific. I'm not sure if other vendors have a similar feature. If there
is a generic capabilty, we could expose via the iommu api's if we are in
virt or real platform.
>
> which is the obvious sane and safe logic. But sure, why am I asking for
> sane and safe in the context of virtualization?
We can pick how to solve this, and just waiting for you to tell, what
mechanism you prefer that's less painful and architecturally acceptible for
virtualization and linux. We are all ears!
Cheers,
Ashok
On Wed, Nov 11, 2020 at 03:03:21PM -0800, Raj, Ashok wrote:
> By default the DVSEC is not presented to guest even when the full PF is
> presented to guest. I believe VFIO only builds and presents known standard
> capabilities and specific extended capabilities. I'm a bit weak but maybe
> @AlexWilliamson can confirm if I'm off track.
This also need to work on Hyper-V and all other cases, you can't just
assume everything is vfio and kvm.
It is horrible to ask people to go back an retroactively change their
config space in a device just to work around all the design failings
Thomas eloquantly describes :(
Jason
On Wed, Nov 11, 2020 at 02:17:48AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Tuesday, November 10, 2020 10:24 PM
> >
> > On Tue, Nov 10, 2020 at 06:13:23AM -0800, Raj, Ashok wrote:
> >
> > > This isn't just for idxd, as I mentioned earlier, there are vendors other
> > > than Intel already working on this. In all cases the need for guest direct
> > > manipulation of interrupt store hasn't come up. From the discussion, it
> > > seems like there are devices today or in future that will require direct
> > > manipulation of interrupt store in the guest. This needs additional work
> > > in both the device hardware providing the right plumbing and OS work to
> > > comprehend those.
> >
> > We'd want to see SRIOV's assigned to guests to be able to use
> > IMS. This allows a SRIOV instance in a guest to spawn SIOV's which is
> > useful.
>
> Does your VF support both MSI/IMS or IMS only?
Of course VF's support MSI..
> If it is the former can't we adopt a phased approach or parallel
> effort between forcing guest to use MSI and adding hypercall to
> enable IMS on VF? Finding a way to disable IMS is anyway required
> per earlier discussion when hypercall is not available, and it could
> still provide a functional though suboptimal model for such VFs.
Sure, I view that as the bare minimum
Jason
.monster snip..
> 4. Using CPUID to detect running as guest. But as Thomas pointed out, this
> approach is less reliable as not all hypervisors do this way.
Is that truly true? It is the first time I see the argument that extra
steps are needed and that checking for X86_FEATURE_HYPERVISOR is not enough.
Or is it more "Some hypervisor probably forgot about it, so lets make sure we patch
over that possible hole?"
Also is there anything in this spec that precludes this from working
on non-X86 architectures, say ARM systems?
On Thu, Nov 12 2020 at 14:32, Konrad Rzeszutek Wilk wrote:
>> 4. Using CPUID to detect running as guest. But as Thomas pointed out, this
>> approach is less reliable as not all hypervisors do this way.
>
> Is that truly true? It is the first time I see the argument that extra
> steps are needed and that checking for X86_FEATURE_HYPERVISOR is not enough.
>
> Or is it more "Some hypervisor probably forgot about it, so lets make sure we patch
> over that possible hole?"
Nothing enforces that bit to be set. The bit is a pure software
convention and was proposed by VMWare in 2008 with the following
changelog:
"This patch proposes to use a cpuid interface to detect if we are
running on an hypervisor.
The discovery of a hypervisor is determined by bit 31 of CPUID#1_ECX,
which is defined to be "hypervisor present bit". For a VM, the bit is
1, otherwise it is set to 0. This bit is not officially documented by
either Intel/AMD yet, but they plan to do so some time soon, in the
meanwhile they have promised to keep it reserved for virtualization."
The reserved promise seems to hold. AMDs APM has it documented. The
Intel SDM not so.
Also the kernel side of KVM does not enforce that bit, it's up to the user
space management to set it.
And yes, I've tripped over this with some hypervisors and even qemu KVM
failed to set it in the early days because it was masked with host CPUID
trimming as there the bit is obviously 0.
DMI vendor name is pretty good final check when the bit is 0. The
strings I'm aware of are:
QEMU, Bochs, KVM, Xen, VMware, VMW, VMware Inc., innotek GmbH, Oracle
Corporation, Parallels, BHYVE, Microsoft Corporation
which is not complete but better than nothing ;)
Thanks,
tglx
> From: Thomas Gleixner <[email protected]>
> Sent: Friday, November 13, 2020 6:43 AM
>
> On Thu, Nov 12 2020 at 14:32, Konrad Rzeszutek Wilk wrote:
> >> 4. Using CPUID to detect running as guest. But as Thomas pointed out, this
> >> approach is less reliable as not all hypervisors do this way.
> >
> > Is that truly true? It is the first time I see the argument that extra
> > steps are needed and that checking for X86_FEATURE_HYPERVISOR is not
> enough.
> >
> > Or is it more "Some hypervisor probably forgot about it, so lets make sure
> we patch
> > over that possible hole?"
>
> Nothing enforces that bit to be set. The bit is a pure software
> convention and was proposed by VMWare in 2008 with the following
> changelog:
>
> "This patch proposes to use a cpuid interface to detect if we are
> running on an hypervisor.
>
> The discovery of a hypervisor is determined by bit 31 of CPUID#1_ECX,
> which is defined to be "hypervisor present bit". For a VM, the bit is
> 1, otherwise it is set to 0. This bit is not officially documented by
> either Intel/AMD yet, but they plan to do so some time soon, in the
> meanwhile they have promised to keep it reserved for virtualization."
>
> The reserved promise seems to hold. AMDs APM has it documented. The
> Intel SDM not so.
>
> Also the kernel side of KVM does not enforce that bit, it's up to the user
> space management to set it.
>
> And yes, I've tripped over this with some hypervisors and even qemu KVM
> failed to set it in the early days because it was masked with host CPUID
> trimming as there the bit is obviously 0.
>
> DMI vendor name is pretty good final check when the bit is 0. The
> strings I'm aware of are:
>
> QEMU, Bochs, KVM, Xen, VMware, VMW, VMware Inc., innotek GmbH,
> Oracle
> Corporation, Parallels, BHYVE, Microsoft Corporation
>
> which is not complete but better than nothing ;)
>
> Thanks,
>
> tglx
Hi, Thomas,
CPUID#1_ECX is a x86 thing. Do we need to figure out probably_on_
bare_metal for every architecture altogether, or is it OK to just
handle it for x86 arch at this stage? Based on previous discussions
ims is just one piece of multiple technologies to enable SIOV-like
scalability. Ideally arch-specific enablement beyond ims (e.g. the
IOMMU part) will be required for such scaled usage thus we
may just leave ims disabled for non-x86 and wait until that time to
figure out arch specific probably_on_bare_metal?
Thanks
Kevin
On Fri, Nov 13, 2020 at 02:42:02AM +0000, Tian, Kevin wrote:
> CPUID#1_ECX is a x86 thing. Do we need to figure out probably_on_
> bare_metal for every architecture altogether, or is it OK to just
> handle it for x86 arch at this stage? Based on previous discussions
> ims is just one piece of multiple technologies to enable SIOV-like
> scalability. Ideally arch-specific enablement beyond ims (e.g. the
> IOMMU part) will be required for such scaled usage thus we
> may just leave ims disabled for non-x86 and wait until that time to
> figure out arch specific probably_on_bare_metal?
At the very least you need to ensure that
pci_subdevice_msi_create_irq_domain() fails entirely on other
architectures until they can sort out these sorts of issues..
Jason
On Fri, Nov 13 2020 at 02:42, Kevin Tian wrote:
>> From: Thomas Gleixner <[email protected]>
> CPUID#1_ECX is a x86 thing. Do we need to figure out probably_on_
> bare_metal for every architecture altogether, or is it OK to just
> handle it for x86 arch at this stage? Based on previous discussions
> ims is just one piece of multiple technologies to enable SIOV-like
> scalability. Ideally arch-specific enablement beyond ims (e.g. the
> IOMMU part) will be required for such scaled usage thus we
> may just leave ims disabled for non-x86 and wait until that time to
> figure out arch specific probably_on_bare_metal?
Of course is this not only an x86 problem. Every architecture which
supports virtualization has the same issue. ARM(64) has no way to tell
for sure whether the machine runs bare metal either. No idea about the
other architectures.
Thanks,
tglx
> Of course is this not only an x86 problem. Every architecture which
> supports virtualization has the same issue. ARM(64) has no way to tell
> for sure whether the machine runs bare metal either. No idea about the
> other architectures.
Sounds like a hypervisor problem. If the VMM provides perfect emulation
of every weird quirk of h/w, then it is OK to let the guest believe that it is
running on bare metal.
If it isn't perfect, then it should make sure the guest knows *for sure*, so that
the guest can take appropriate actions to avoid the sharp edges.
-Tony
On Fri, Nov 13, 2020 at 08:12:39AM -0800, Luck, Tony wrote:
> > Of course is this not only an x86 problem. Every architecture which
> > supports virtualization has the same issue. ARM(64) has no way to tell
> > for sure whether the machine runs bare metal either. No idea about the
> > other architectures.
>
> Sounds like a hypervisor problem. If the VMM provides perfect emulation
> of every weird quirk of h/w, then it is OK to let the guest believe that it is
> running on bare metal.
That's true, which is why there isn't an immutable bit in cpuid or
otherwise telling you are running under a hypervisor. Providing something
like that would make certain features not virtualizable. Apparently before we
had faulting cpuid, what you had in guest was the real raw cpuid.
Waiver: I'm not saying this is perfect, I'm just replaying the reason
behind it. Not trying to defend it... flames > /dev/null
>
> If it isn't perfect, then it should make sure the guest knows *for sure*, so that
> the guest can take appropriate actions to avoid the sharp edges.
>
There are indeed 2 problems to solve.
1. How does device driver know if device is IMS capable.
IMS is a device attribute. Each vendor can provide its own method to
provide that indication. One such mechanism is the DVSEC.SIOV.IMS
property. Some might believe this is for use only by Intel. For DVSEC I
don't believe there is such a connection as in device vendor id in
standard header. TBH, there are other device vendors using the exact
same method to indicate SIOV and IMS propeties. What a DVSEC vendor ID
states is "As defined by Vendor X".
Why we choose a config vs something in device specific mmio is because
today VFIO being that one common mechanism, it only exposes known
standard and some extended headers to guest. When we expose a full PF,
the guest doens't see the DVSEC, so drivers know this isn't available.
This is our mechanism to stop drivers from calling
pci_ims_array_create_msi_irq_domain(). It may not be perfect for all
devices, it is a device specific mechanism. For devices under
consideration following the SIOV spec it meets the sprit of the
requirement even without #2 below. When devices have no way to detect
this, #2 is required as a second way to block IMS.
2. How does platform component (IOMMU) inform if they can support all forms
of IMS. (On device, or in memory).
On device would require some form trap/emulate. Legacy MSIx already has
that solved, but for device specific store you need some additional
work.
When its system memory (say IMS is in GPA space), you need some form of
hypercall. There is no way around it since we can't intercept. Yes, you
can maybe map those as RO and trap, but its not pretty.
To solve this rather than a generic platform capability, maybe we should
flip this to IOMMU instead, because that's the one that offers this
capability today.
iommu_ims_supported()
When platform has no IOMMU or no hypervisor calls, it returns
false. So device driver can tell, even if it supports IMS
capability deduction, does the platform support IMS.
On platforms where iommu supports capability.
Either there is a vIOMMU with a Virtual Command Register that can
provide a way to get the interrupt handle similar to what you would
get from an hypercall for instance. Or there is a real hypercall
that supports giving the guest OS the physical IRTE handle.
--
Cheers,
Ashok
[Forgiveness is the attribute of the STRONG - Gandhi]
On Thu, Nov 12, 2020 at 11:42:46PM +0100, Thomas Gleixner wrote:
> DMI vendor name is pretty good final check when the bit is 0. The
> strings I'm aware of are:
>
> QEMU, Bochs, KVM, Xen, VMware, VMW, VMware Inc., innotek GmbH, Oracle
> Corporation, Parallels, BHYVE, Microsoft Corporation
>
> which is not complete but better than nothing ;)
Which is why I really think we need explicit opt-ins for "native"
SIOV handling and for paravirtualized SIOV handling, with the kernel
not offering support at all without either or a manual override on
the command line.
On Sat, Nov 14, 2020 at 10:34:30AM +0000, Christoph Hellwig wrote:
> On Thu, Nov 12, 2020 at 11:42:46PM +0100, Thomas Gleixner wrote:
> > DMI vendor name is pretty good final check when the bit is 0. The
> > strings I'm aware of are:
> >
> > QEMU, Bochs, KVM, Xen, VMware, VMW, VMware Inc., innotek GmbH, Oracle
> > Corporation, Parallels, BHYVE, Microsoft Corporation
> >
> > which is not complete but better than nothing ;)
>
> Which is why I really think we need explicit opt-ins for "native"
> SIOV handling and for paravirtualized SIOV handling, with the kernel
> not offering support at all without either or a manual override on
> the command line.
opt-in by device or kernel? The way we are planning to support this is:
Device support for IMS - Can discover in device specific means
Kernel support for IMS. - Supported by IOMMU driver.
each driver can check
if (dev_supports_ims() && iommu_supports_ims()) {
/* Then IMS is supported in the platform.*/
}
until we have vIOMMU support or a hypercall, iommu_supports_ims() will
check if X86_FEATURE_HYPERVISOR in addition to the platform id's Thomas
mentioned. or on intel platform check for cap.caching_mode=1 and return false.
When we add support for getting a native interrupt handle then we will plumb that
appropriately.
Does this match what you wanted?
On Sat, Nov 14 2020 at 13:18, Ashok Raj wrote:
> On Sat, Nov 14, 2020 at 10:34:30AM +0000, Christoph Hellwig wrote:
>> On Thu, Nov 12, 2020 at 11:42:46PM +0100, Thomas Gleixner wrote:
>> Which is why I really think we need explicit opt-ins for "native"
>> SIOV handling and for paravirtualized SIOV handling, with the kernel
>> not offering support at all without either or a manual override on
>> the command line.
>
> opt-in by device or kernel? The way we are planning to support this is:
>
> Device support for IMS - Can discover in device specific means
> Kernel support for IMS. - Supported by IOMMU driver.
And why exactly do we have to enforce IOMMU support? Please stop looking
at IMS purely from the IDXD perspective. We are talking about the
general concept here and not about the restricted Intel universe.
> each driver can check
>
> if (dev_supports_ims() && iommu_supports_ims()) {
> /* Then IMS is supported in the platform.*/
> }
Please forget this 'each driver can check'. That's just wrong.
The only thing the driver has to check is whether the device supports
IMS or not. Everything else has to be handled by the underlying
infrastructure.
That's pretty much the same thing like PCI/MSI[X]. The driver does not
have to check 'device_has_msix() && platform_supports_msix()'. Enabling
MSI[X] will simply fail if it's not supported.
So for IMS creating the underlying irqdomain has to fail when the
platform does not support it and the driver can act upon the fail and
fallback to MSI[X] or just refuse to load when IMS is required for the
device to be functional.
Thanks,
tglx
On Sun, Nov 15, 2020 at 12:26:22PM +0100, Thomas Gleixner wrote:
> On Sat, Nov 14 2020 at 13:18, Ashok Raj wrote:
> > On Sat, Nov 14, 2020 at 10:34:30AM +0000, Christoph Hellwig wrote:
> >> On Thu, Nov 12, 2020 at 11:42:46PM +0100, Thomas Gleixner wrote:
> >> Which is why I really think we need explicit opt-ins for "native"
> >> SIOV handling and for paravirtualized SIOV handling, with the kernel
> >> not offering support at all without either or a manual override on
> >> the command line.
> >
> > opt-in by device or kernel? The way we are planning to support this is:
> >
> > Device support for IMS - Can discover in device specific means
> > Kernel support for IMS. - Supported by IOMMU driver.
>
> And why exactly do we have to enforce IOMMU support? Please stop looking
> at IMS purely from the IDXD perspective. We are talking about the
> general concept here and not about the restricted Intel universe.
I think you have mentioned it almost every reply :-)..Got that! Point taken
several emails ago!! :-)
I didn't mean just for idxd, I said for *ANY* device driver that wants to
use IMS.
>
> > each driver can check
> >
> > if (dev_supports_ims() && iommu_supports_ims()) {
> > /* Then IMS is supported in the platform.*/
> > }
>
> Please forget this 'each driver can check'. That's just wrong.
Ok.
>
> The only thing the driver has to check is whether the device supports
> IMS or not. Everything else has to be handled by the underlying
> infrastructure.
That's pretty much the same thing.. I guess you wanted to add
"Does infrastructure support IMS" to be someplace else, instead
of device driver checking it. That's perfectly fine.
Until we support this natively via hypercall or vIOMMU we can use your
varient of finding if you are not on bare_metal to decide support for IMS.
How you highligted below:
https://lore.kernel.org/lkml/[email protected]/
probably_on_bare_metal()
{
if (CPUID(FEATURE_HYPERVISOR))
return false;
if (dmi_match_hypervisor_vendor())
return false;
return PROBABLY_RUNNING_ON_BARE_METAL;
}
The above is all we need for now and will work in almost all cases.
We will move forward with just the above in the next series.
Below is for future consideration.
Even the above isn't fool proof if both HYPERVISOR feature flag isn't set,
and the dmi_string doesn't match, say some new hypervisor. The only way
we can figure that is
- If no iommu support, or iommu can tell if this is a virtualized iommu.
The presence of caching_mode is one such indication for Intel.
PS: Other IOMMU's must have something like this to support virtualization.
I'm not saying this is an Intel only feature just in case you interpret
it that way! I'm only saying if there is a mechanism to distinguish
native vs emulated platform.
When vIOMMU supports getting native interrupt handle via a virtual command
interface for Intel IOMMU's. OR some equivalent when other vedors provide
such capability. Even without a hypercall virtualizing IOMMU can provide
the same solution.
If we support hypercall then its more generic so it would fall into the
native all platforms/vendors. Certainly the most scalable long term
solution.
Cheers,
Ashok
On Sun, Nov 15 2020 at 11:31, Ashok Raj wrote:
> On Sun, Nov 15, 2020 at 12:26:22PM +0100, Thomas Gleixner wrote:
>> > opt-in by device or kernel? The way we are planning to support this is:
>> >
>> > Device support for IMS - Can discover in device specific means
>> > Kernel support for IMS. - Supported by IOMMU driver.
>>
>> And why exactly do we have to enforce IOMMU support? Please stop looking
>> at IMS purely from the IDXD perspective. We are talking about the
>> general concept here and not about the restricted Intel universe.
>
> I think you have mentioned it almost every reply :-)..Got that! Point taken
> several emails ago!! :-)
You sure? I _try_ to not mention it again then. No promise though. :)
> I didn't mean just for idxd, I said for *ANY* device driver that wants to
> use IMS.
Which is wrong. Again:
A) For PF/VF on bare metal there is absolutely no IOMMU dependency
because it does not have a PASID requirement. It's just an
alternative solution to MSI[X], which allows optimizations like
storing the message in driver manages queue memory or lifting the
restriction of 2048 interrupts per device. Nothing else.
B) For PF/VF in a guest the IOMMU dependency of IMS is a red herring.
There is no direct dependency on the IOMMU.
The problem is the inability of the VMM to trap the message write to
the IMS storage if the storage is in guest driver managed memory.
This can be solved with either
- a hypercall which translates the guest MSI message
or
- a vIOMMU which uses a hypercall or whatever to translate the guest
MSI message
C) Subdevices ala mdev are a different story. They require PASID which
enforces IOMMU and the IMS part is not managed by the users anyway.
So we have a couple of problems to solve:
1) Figure out whether the OS runs on bare metal
There is no reliable answer to that, so we either:
- Use heuristics and assume that failure is unlikely and in case
of failure blame the incompetence of VMM authors and/or
sysadmins
or
- Default to IMS disabled and let the sysadmin enable it via
command line option.
If the kernel detects to run in a VM it yells and disables it
unless the OS and the hypervisor agree to provide support for
that scenario (see #2).
That's fails as well if the sysadmin does so when the OS runs on
a VMM which is not identifiable, but at least we can rightfully
blame the sysadmin in that case.
or
- Declare that IMS always depends on IOMMU
I personaly don't care, but people working on these kind of
device already said, that they want to avoid it when possible.
If you want to go that route, then please talk to those folks
and ask them to agree in public.
You also need to take into account that this must work on all
architectures which support virtualization because IMS is
architecture independent.
2) Guest support for PF/VF
Again we have several scenarios depending on the IMS storage
type.
- If the storage type is device memory then it's pretty much the
same as MSI[X] just a different location.
- If the storage is in driver managed memory then this needs
#1 plus guest OS and hypervisor support (hypercall/vIOMMU)
3) Guest support for PF/VF and guest managed subdevice (mdev)
Depends on #1 and #2 and is an orthogonal problem if I'm not
missing something.
To move forward we need to make a decision about #1 and #2 now.
This needs to be well thought out as changing it after the fact is
going to be a nightmare.
/me grudgingly refrains from mentioning the obvious once more.
Thanks,
tglx
On Sun, Nov 15, 2020 at 11:11:27PM +0100, Thomas Gleixner wrote:
> On Sun, Nov 15 2020 at 11:31, Ashok Raj wrote:
> > On Sun, Nov 15, 2020 at 12:26:22PM +0100, Thomas Gleixner wrote:
> >> > opt-in by device or kernel? The way we are planning to support this is:
> >> >
> >> > Device support for IMS - Can discover in device specific means
> >> > Kernel support for IMS. - Supported by IOMMU driver.
> >>
> >> And why exactly do we have to enforce IOMMU support? Please stop looking
> >> at IMS purely from the IDXD perspective. We are talking about the
> >> general concept here and not about the restricted Intel universe.
> >
> > I think you have mentioned it almost every reply :-)..Got that! Point taken
> > several emails ago!! :-)
>
> You sure? I _try_ to not mention it again then. No promise though. :)
Hey.. anything that's entertaining go for it :-)
>
> > I didn't mean just for idxd, I said for *ANY* device driver that wants to
> > use IMS.
>
> Which is wrong. Again:
>
> A) For PF/VF on bare metal there is absolutely no IOMMU dependency
> because it does not have a PASID requirement. It's just an
> alternative solution to MSI[X], which allows optimizations like
> storing the message in driver manages queue memory or lifting the
> restriction of 2048 interrupts per device. Nothing else.
You are right.. my eyes were clouded by virtualization.. no dependency for
native absolutely.
>
> B) For PF/VF in a guest the IOMMU dependency of IMS is a red herring.
> There is no direct dependency on the IOMMU.
>
> The problem is the inability of the VMM to trap the message write to
> the IMS storage if the storage is in guest driver managed memory.
> This can be solved with either
>
> - a hypercall which translates the guest MSI message
> or
> - a vIOMMU which uses a hypercall or whatever to translate the guest
> MSI message
>
> C) Subdevices ala mdev are a different story. They require PASID which
> enforces IOMMU and the IMS part is not managed by the users anyway.
You are right again :)
The subdevices require PASID & IOMMU in native, but inside the guest there is no
need for IOMMU unless you want to build SVM on top. subdevices work without
any vIOMMU or hypercall in the guest. Only because they look like normal
PCI devices we could map interrupts to legacy MSIx.
>
> So we have a couple of problems to solve:
>
> 1) Figure out whether the OS runs on bare metal
>
> There is no reliable answer to that, so we either:
>
> - Use heuristics and assume that failure is unlikely and in case
> of failure blame the incompetence of VMM authors and/or
> sysadmins
>
> or
>
> - Default to IMS disabled and let the sysadmin enable it via
> command line option.
>
> If the kernel detects to run in a VM it yells and disables it
> unless the OS and the hypervisor agree to provide support for
> that scenario (see #2).
>
> That's fails as well if the sysadmin does so when the OS runs on
> a VMM which is not identifiable, but at least we can rightfully
> blame the sysadmin in that case.
cmdline isn't nice, best to have this functional out of box.
>
> or
>
> - Declare that IMS always depends on IOMMU
As you had mentioned IMS has no real dependency on IOMMU in native.
we just need to make sure if running in guest we have support for it
plumbed.
>
> I personaly don't care, but people working on these kind of
> device already said, that they want to avoid it when possible.
>
> If you want to go that route, then please talk to those folks
> and ask them to agree in public.
>
> You also need to take into account that this must work on all
> architectures which support virtualization because IMS is
> architecture independent.
What you suggest makes perfect sense. We can certainly get buy in from
iommu list and have this co-ordinated between all existing iommu varients.
>
> 2) Guest support for PF/VF
>
> Again we have several scenarios depending on the IMS storage
> type.
>
> - If the storage type is device memory then it's pretty much the
> same as MSI[X] just a different location.
True, but still need to have some special handling for trapping those mmio
access. Unlike for MSIx VFIO already traps them and everything is
pre-plummbed. It isn't seamless as its for MSIx.
>
> - If the storage is in driver managed memory then this needs
> #1 plus guest OS and hypervisor support (hypercall/vIOMMU)
Violent agreement here :-)
>
> 3) Guest support for PF/VF and guest managed subdevice (mdev)
>
> Depends on #1 and #2 and is an orthogonal problem if I'm not
> missing something.
>
> To move forward we need to make a decision about #1 and #2 now.
Mostly in agreement. Except for mdev (current considered use case) have no
need for IMS in the guest. (Don't get me wrong, I'm not saying some odd
device managing sub-devices would need IMS in addition and that the 2048
MSIx emulation.
>
> This needs to be well thought out as changing it after the fact is
> going to be a nightmare.
>
> /me grudgingly refrains from mentioning the obvious once more.
>
So this isn't an idxd and Intel only thing :-)...
Cheers,
Ashok
> From: Raj, Ashok <[email protected]>
> Sent: Monday, November 16, 2020 8:23 AM
>
> On Sun, Nov 15, 2020 at 11:11:27PM +0100, Thomas Gleixner wrote:
> > On Sun, Nov 15 2020 at 11:31, Ashok Raj wrote:
> > > On Sun, Nov 15, 2020 at 12:26:22PM +0100, Thomas Gleixner wrote:
> > >> > opt-in by device or kernel? The way we are planning to support this is:
> > >> >
> > >> > Device support for IMS - Can discover in device specific means
> > >> > Kernel support for IMS. - Supported by IOMMU driver.
> > >>
> > >> And why exactly do we have to enforce IOMMU support? Please stop
> looking
> > >> at IMS purely from the IDXD perspective. We are talking about the
> > >> general concept here and not about the restricted Intel universe.
> > >
> > > I think you have mentioned it almost every reply :-)..Got that! Point taken
> > > several emails ago!! :-)
> >
> > You sure? I _try_ to not mention it again then. No promise though. :)
>
> Hey.. anything that's entertaining go for it :-)
>
> >
> > > I didn't mean just for idxd, I said for *ANY* device driver that wants to
> > > use IMS.
> >
> > Which is wrong. Again:
> >
> > A) For PF/VF on bare metal there is absolutely no IOMMU dependency
> > because it does not have a PASID requirement. It's just an
> > alternative solution to MSI[X], which allows optimizations like
> > storing the message in driver manages queue memory or lifting the
> > restriction of 2048 interrupts per device. Nothing else.
>
> You are right.. my eyes were clouded by virtualization.. no dependency for
> native absolutely.
>
> >
> > B) For PF/VF in a guest the IOMMU dependency of IMS is a red herring.
> > There is no direct dependency on the IOMMU.
> >
> > The problem is the inability of the VMM to trap the message write to
> > the IMS storage if the storage is in guest driver managed memory.
> > This can be solved with either
> >
> > - a hypercall which translates the guest MSI message
> > or
> > - a vIOMMU which uses a hypercall or whatever to translate the guest
> > MSI message
> >
> > C) Subdevices ala mdev are a different story. They require PASID which
> > enforces IOMMU and the IMS part is not managed by the users anyway.
>
> You are right again :)
>
> The subdevices require PASID & IOMMU in native, but inside the guest there
> is no
> need for IOMMU unless you want to build SVM on top. subdevices work
> without
> any vIOMMU or hypercall in the guest. Only because they look like normal
> PCI devices we could map interrupts to legacy MSIx.
Guest managed subdevices on PF/VF requires vIOMMU. Anyway I think
Thomas was just pointing out that subdevices are the only category out
of above three which may have business tied to IOMMU. ????
>
> >
> > So we have a couple of problems to solve:
> >
> > 1) Figure out whether the OS runs on bare metal
> >
> > There is no reliable answer to that, so we either:
> >
> > - Use heuristics and assume that failure is unlikely and in case
> > of failure blame the incompetence of VMM authors and/or
> > sysadmins
> >
> > or
> >
> > - Default to IMS disabled and let the sysadmin enable it via
> > command line option.
> >
> > If the kernel detects to run in a VM it yells and disables it
> > unless the OS and the hypervisor agree to provide support for
> > that scenario (see #2).
> >
> > That's fails as well if the sysadmin does so when the OS runs on
> > a VMM which is not identifiable, but at least we can rightfully
> > blame the sysadmin in that case.
>
> cmdline isn't nice, best to have this functional out of box.
>
> >
> > or
> >
> > - Declare that IMS always depends on IOMMU
>
> As you had mentioned IMS has no real dependency on IOMMU in native.
>
> we just need to make sure if running in guest we have support for it
> plumbed.
>
> >
> > I personaly don't care, but people working on these kind of
> > device already said, that they want to avoid it when possible.
> >
> > If you want to go that route, then please talk to those folks
> > and ask them to agree in public.
> >
> > You also need to take into account that this must work on all
> > architectures which support virtualization because IMS is
> > architecture independent.
>
> What you suggest makes perfect sense. We can certainly get buy in from
> iommu list and have this co-ordinated between all existing iommu varients.
Does a hybrid scheme sound good here?
- Say a cmdline parameter: ims=[auto|on|off], with 'auto' as default;
- if ims=auto:
* If arch doesn't implement probably_on_bare_metal, disallow ims;
* If probably_on_bare_metal returns false, disallow ims;
# (future) if hypercall is supported, allow ims;
* If probably_on_bare_metal returns true, allow ims with caveat on
possible mis-interception of running on an old hypervisor. Sysadmin
may need to double-confirm in other means
# (future) if definitely_on_bare_metal is supported, no caveat;
- if ims=on:
* If probably_on_bare_metal return false, yell and disable it until
hypercall is supported;
* In all other cases allow ims. Sysadmin should be blamed if any
failure as doing so implies that extra confirmation has been done;
- if ims=off, then leave it off.
It's not necessary to claim strict dependency between ims and iommu.
Instead, we could leave iommu being an arch specific check when it
applies:
probably_on_bare_metal()
{
if (CPUID(FEATURE_HYPERVISOR))
return false;
if (dmi_match_hypervisor_vendor())
return false;
if (iommu_existing() && iommu_in_guest())
return false;
return PROBABLY_RUNNING_ON_BARE_METAL;
}
>
> >
> > 2) Guest support for PF/VF
> >
> > Again we have several scenarios depending on the IMS storage
> > type.
> >
> > - If the storage type is device memory then it's pretty much the
> > same as MSI[X] just a different location.
>
> True, but still need to have some special handling for trapping those mmio
> access. Unlike for MSIx VFIO already traps them and everything is
> pre-plummbed. It isn't seamless as its for MSIx.
yes. So what about tying guest IMS to hypercall even when emulation
is possible on some devices? It's difficult for the guest to know that
its IMS is emulated by hypervisor. Adopting an unified policy for all
IMS-capable devices might be an easier path.
>
> >
> > - If the storage is in driver managed memory then this needs
> > #1 plus guest OS and hypervisor support (hypercall/vIOMMU)
>
> Violent agreement here :-)
>
> >
> > 3) Guest support for PF/VF and guest managed subdevice (mdev)
> >
> > Depends on #1 and #2 and is an orthogonal problem if I'm not
> > missing something.
> >
> > To move forward we need to make a decision about #1 and #2 now.
>
> Mostly in agreement. Except for mdev (current considered use case) have no
> need for IMS in the guest. (Don't get me wrong, I'm not saying some odd
> device managing sub-devices would need IMS in addition and that the 2048
> MSIx emulation.
> >
> > This needs to be well thought out as changing it after the fact is
> > going to be a nightmare.
> >
> > /me grudgingly refrains from mentioning the obvious once more.
> >
>
> So this isn't an idxd and Intel only thing :-)...
>
> Cheers,
> Ashok
Thanks
Kevin
On Sat, Nov 14, 2020 at 01:18:37PM -0800, Raj, Ashok wrote:
> On Sat, Nov 14, 2020 at 10:34:30AM +0000, Christoph Hellwig wrote:
> > On Thu, Nov 12, 2020 at 11:42:46PM +0100, Thomas Gleixner wrote:
> > > DMI vendor name is pretty good final check when the bit is 0. The
> > > strings I'm aware of are:
> > >
> > > QEMU, Bochs, KVM, Xen, VMware, VMW, VMware Inc., innotek GmbH, Oracle
> > > Corporation, Parallels, BHYVE, Microsoft Corporation
> > >
> > > which is not complete but better than nothing ;)
> >
> > Which is why I really think we need explicit opt-ins for "native"
> > SIOV handling and for paravirtualized SIOV handling, with the kernel
> > not offering support at all without either or a manual override on
> > the command line.
>
> opt-in by device or kernel? The way we are planning to support this is:
opt-in by the platform. Not sure if an ACPI interface or something else
would be best. But basically the kernel needs to be able to query:
Does this platform claim to support IMS, and if yes how. If there is no
answer we need assume the platform doesn't.
On Mon, Nov 16, 2020 at 07:31:49AM +0000, Tian, Kevin wrote:
> > The subdevices require PASID & IOMMU in native, but inside the guest there
> > is no
> > need for IOMMU unless you want to build SVM on top. subdevices work
> > without
> > any vIOMMU or hypercall in the guest. Only because they look like normal
> > PCI devices we could map interrupts to legacy MSIx.
>
> Guest managed subdevices on PF/VF requires vIOMMU.
Why? I've never heard we need vIOMMU for our existing SRIOV flows in
VMs??
Jason
On Mon, Nov 16 2020 at 11:46, Jason Gunthorpe wrote:
> On Mon, Nov 16, 2020 at 07:31:49AM +0000, Tian, Kevin wrote:
>
>> > The subdevices require PASID & IOMMU in native, but inside the guest there
>> > is no
>> > need for IOMMU unless you want to build SVM on top. subdevices work
>> > without
>> > any vIOMMU or hypercall in the guest. Only because they look like normal
>> > PCI devices we could map interrupts to legacy MSIx.
>>
>> Guest managed subdevices on PF/VF requires vIOMMU.
>
> Why? I've never heard we need vIOMMU for our existing SRIOV flows in
> VMs??
Handing PF/VF into the guest does not require it.
But if the PF/VF driver in the guest wants to create and manage the
magic mdev subdevices which require PASID support then you surely need
it.
Thanks,
tglx
On Mon, Nov 16, 2020 at 06:56:33PM +0100, Thomas Gleixner wrote:
> On Mon, Nov 16 2020 at 11:46, Jason Gunthorpe wrote:
>
> > On Mon, Nov 16, 2020 at 07:31:49AM +0000, Tian, Kevin wrote:
> >
> >> > The subdevices require PASID & IOMMU in native, but inside the guest there
> >> > is no
> >> > need for IOMMU unless you want to build SVM on top. subdevices work
> >> > without
> >> > any vIOMMU or hypercall in the guest. Only because they look like normal
> >> > PCI devices we could map interrupts to legacy MSIx.
> >>
> >> Guest managed subdevices on PF/VF requires vIOMMU.
> >
> > Why? I've never heard we need vIOMMU for our existing SRIOV flows in
> > VMs??
>
> Handing PF/VF into the guest does not require it.
>
> But if the PF/VF driver in the guest wants to create and manage the
> magic mdev subdevices which require PASID support then you surely need
> it.
'magic mdevs' are only one reason to use IMS in a guest. On mlx5 we
might want to use IMS for VPDA devices. mlx5 can spawn a VDPA device
in a guest, against a 'ADI', without ever requiring an IOMMU to do it.
We don't even need IOMMU in the hypervisor to create the ADI, mlx5 has
an internal secure IOMMU that can be used instead of the platform
IOMMU.
Not saying this is a major use case, or a reason not to link things to
IOMMU detection, but lets be clear that a hard need for IOMMU is a
another IDXD thing, not general.
Jason
On Mon, Nov 16 2020 at 14:02, Jason Gunthorpe wrote:
> On Mon, Nov 16, 2020 at 06:56:33PM +0100, Thomas Gleixner wrote:
>> On Mon, Nov 16 2020 at 11:46, Jason Gunthorpe wrote:
>>
>> > On Mon, Nov 16, 2020 at 07:31:49AM +0000, Tian, Kevin wrote:
>> >
>> >> > The subdevices require PASID & IOMMU in native, but inside the guest there
>> >> > is no
>> >> > need for IOMMU unless you want to build SVM on top. subdevices work
>> >> > without
>> >> > any vIOMMU or hypercall in the guest. Only because they look like normal
>> >> > PCI devices we could map interrupts to legacy MSIx.
>> >>
>> >> Guest managed subdevices on PF/VF requires vIOMMU.
>> >
>> > Why? I've never heard we need vIOMMU for our existing SRIOV flows in
>> > VMs??
>>
>> Handing PF/VF into the guest does not require it.
>>
>> But if the PF/VF driver in the guest wants to create and manage the
>> magic mdev subdevices which require PASID support then you surely need
>> it.
>
> 'magic mdevs' are only one reason to use IMS in a guest. On mlx5 we
> might want to use IMS for VPDA devices. mlx5 can spawn a VDPA device
> in a guest, against a 'ADI', without ever requiring an IOMMU to do it.
>
> We don't even need IOMMU in the hypervisor to create the ADI, mlx5 has
> an internal secure IOMMU that can be used instead of the platform
> IOMMU.
>
> Not saying this is a major use case, or a reason not to link things to
> IOMMU detection, but lets be clear that a hard need for IOMMU is a
> another IDXD thing, not general.
Fair enough.
Thanks,
tglx
> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, November 17, 2020 2:03 AM
>
> On Mon, Nov 16, 2020 at 06:56:33PM +0100, Thomas Gleixner wrote:
> > On Mon, Nov 16 2020 at 11:46, Jason Gunthorpe wrote:
> >
> > > On Mon, Nov 16, 2020 at 07:31:49AM +0000, Tian, Kevin wrote:
> > >
> > >> > The subdevices require PASID & IOMMU in native, but inside the guest
> there
> > >> > is no
> > >> > need for IOMMU unless you want to build SVM on top. subdevices
> work
> > >> > without
> > >> > any vIOMMU or hypercall in the guest. Only because they look like
> normal
> > >> > PCI devices we could map interrupts to legacy MSIx.
> > >>
> > >> Guest managed subdevices on PF/VF requires vIOMMU.
> > >
> > > Why? I've never heard we need vIOMMU for our existing SRIOV flows in
> > > VMs??
> >
> > Handing PF/VF into the guest does not require it.
> >
> > But if the PF/VF driver in the guest wants to create and manage the
> > magic mdev subdevices which require PASID support then you surely need
> > it.
>
> 'magic mdevs' are only one reason to use IMS in a guest. On mlx5 we
> might want to use IMS for VPDA devices. mlx5 can spawn a VDPA device
> in a guest, against a 'ADI', without ever requiring an IOMMU to do it.
>
> We don't even need IOMMU in the hypervisor to create the ADI, mlx5 has
> an internal secure IOMMU that can be used instead of the platform
> IOMMU.
>
> Not saying this is a major use case, or a reason not to link things to
> IOMMU detection, but lets be clear that a hard need for IOMMU is a
> another IDXD thing, not general.
>
I should use "may require" in original post. and one thing that I obviously
mixed is the requirement of PASID-granular interrupt isolation in the
physical IOMMU instead of virtual IOMMU. But anyway, I didn't attempt
to use above to build hard need for IOMMU, just the opposite when looking
at all three cases together.
btw Jason/Thomas, how do you think about the proposal down in this
thread (ims=[auto|on|off])? Does it sound a good tradeoff to move forward?
Thanks
Kevin
On Mon, Nov 16 2020 at 23:51, Kevin Tian wrote:
>> From: Jason Gunthorpe <[email protected]>
> btw Jason/Thomas, how do you think about the proposal down in this
> thread (ims=[auto|on|off])? Does it sound a good tradeoff to move forward?
What does it solve? It defaults to auto and then you still need to solve
the problem of figuring out whether it's safe to use it or not.
The command line option is not a solution per se. It's the last resort
when the logic which decides whether IMS can be used or not fails to do
the right thing. Nothing more.
We clearly have outlined what needs to be done and you can come up with
as many magic bullets you want, they won't make the real problems go
away.
Thanks,
tglx