2018-08-30 04:13:08

by Lu Baolu

[permalink] [raw]
Subject: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

Hi,

The Mediate Device is a framework for fine-grained physical device
sharing across the isolated domains. Currently the mdev framework
is designed to be independent of the platform IOMMU support. As the
result, the DMA isolation relies on the mdev parent device in a
vendor specific way.

There are several cases where a mediated device could be protected
and isolated by the platform IOMMU. For example, Intel vt-d rev3.0
[1] introduces a new translation mode called 'scalable mode', which
enables PASID-granular translations. The vt-d scalable mode is the
key ingredient for Scalable I/O Virtualization [2] [3] which allows
sharing a device in minimal possible granularity (ADI - Assignable
Device Interface).

A mediated device backed by an ADI could be protected and isolated
by the IOMMU since 1) the parent device supports tagging an unique
PASID to all DMA traffic out of the mediated device; and 2) the DMA
translation unit (IOMMU) supports the PASID granular translation.
We can apply IOMMU protection and isolation to this kind of devices
just as what we are doing with an assignable PCI device.

In order to distinguish the IOMMU-capable mediated devices from those
which still need to rely on parent devices, this patch set adds a
domain type attribute to each mdev.

enum mdev_domain_type {
DOMAIN_TYPE_NO_IOMMU, /* Don't need any IOMMU support.
* All isolation and protection
* are handled by the parent
* device driver with a device
* specific mechanism.
*/
DOMAIN_TYPE_ATTACH_PARENT, /* IOMMU can isolate and protect
* the mdev, and the isolation
* domain should be attaced with
* the parent device.
*/
};

The mdev parent device driver could opt-in whether an mdev is IOMMU
capable when the device is created by invoking below interface within
its @create callback:

int mdev_set_domain_type(struct device *dev,
enum mdev_domain_type type);

In the vfio_iommu_type1_attach_group(), a domain allocated through
iommu_domain_alloc() will be attached to the mdev parent device if
the domain types of mdev devices in group are of type ATTACH_PARENT;
Otherwise, the dummy external domain will be used and all the DMA
isolation and protection are routed to parent driver as the result.

On IOMMU side, a basic requirement is allowing to attach multiple
domains for a PCI device if the device advertises the capability
and the IOMMU hardware supports finer granularity translations than
the normal PCI Source ID based translation.

In order for the ease of discussion, we call "a domain in auxiliary
mode' or simply 'an auxiliary domain' when a domain is attached to
a device for finer granularity translations (than the Source ID based
one). But we need to keep in mind that this doesn't mean two types of
domains. A same domain could be bound to a device for Source ID based
translation, and bound to another device for finer granularity
translation at the same time.

Below APIs are introduced in the IOMMU glue for device drivers to use
the finer granularity translation.

* iommu_capable(IOMMU_CAP_AUX_DOMAIN)
- Represents the ability for supporting multiple domains per device
(a.k.a. finer granularity translations) of the IOMMU hardware.

* iommu_en(dis)able_aux_domain(struct device *dev)
- Enable/disable the multiple domains capability for a device
referenced by @dev.

* iommu_auxiliary_id(struct iommu_domain *domain)
- Return the index value used for finer-granularity DMA translation.
The specific device driver needs to feed the hardware with this
value, so that hardware device could issue the DMA transaction with
this value tagged.

This patch series extends both IOMMU and vfio components to support
mdev device passing through when it could be isolated and protected
by the IOMMU units. The first part of this series (PATCH 1/10 ~ 6/10)
adds the interfaces and implementation of the multiple domains per
device. The second part (PATCH 7/12 ~ 10/12) adds the domain type
attribute to each mdev, determines domain type according to the
attribute when attaching group in vfio type1 iommu module, and bind
an auxiliary domain for the group with all mediated devices which
requires its own domain.

This patch series depends on a patch set posted here [4] for discussion
which added the support for scalable mode in Intel IOMMU driver.

References:
[1] https://software.intel.com/en-us/download/intel-virtualization-technology-for-directed-io-architecture-specification
[2] https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[3] https://schd.ws/hosted_files/lc32018/00/LC3-SIOV-final.pdf
[4] https://lkml.org/lkml/2018/8/30/27

Best regards,
Lu Baolu

Change log:
v1->v2:
- Rewrite the patches with the concept of auxiliary domains.

Lu Baolu (10):
iommu: Add APIs for multiple domains per device
iommu/vt-d: Add multiple domains per device query
iommu/amd: Add default branch in amd_iommu_capable()
iommu/vt-d: Enable/disable multiple domains per device
iommu/vt-d: Attach/detach domains in auxiliary mode
iommu/vt-d: Return ID associated with an auxiliary domain
vfio/mdev: Add mediated device domain type
vfio/type1: Add domain at(de)taching group helpers
vfio/type1: Determine domain type of an mdev group
vfio/type1: Attach domain for mdev group

drivers/iommu/amd_iommu.c | 2 +
drivers/iommu/intel-iommu.c | 208 ++++++++++++++++++++++++++++++-
drivers/iommu/iommu.c | 29 +++++
drivers/vfio/mdev/mdev_core.c | 36 ++++++
drivers/vfio/mdev/mdev_private.h | 2 +
drivers/vfio/vfio_iommu_type1.c | 144 +++++++++++++++++++--
include/linux/intel-iommu.h | 11 ++
include/linux/iommu.h | 13 ++
include/linux/mdev.h | 26 ++++
9 files changed, 455 insertions(+), 16 deletions(-)

--
2.17.1



2018-08-30 04:12:45

by Lu Baolu

[permalink] [raw]
Subject: [RFC PATCH v2 10/10] vfio/type1: Attach domain for mdev group

When attaching domain to a group of mediated devices which
all have the domain type attributes set to ATTACH_PARENT,
we should attach domain to the parent PCI device instead
of mdev device itself.

Cc: Ashok Raj <[email protected]>
Cc: Jacob Pan <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Liu Yi L <[email protected]>
Signed-off-by: Sanjay Kumar <[email protected]>
Signed-off-by: Lu Baolu <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 64bf55b91de1..c4231df44304 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1411,6 +1411,18 @@ static int vfio_mdev_domain_type(struct device *dev, void *data)
return -EINVAL;
}

+static int vfio_parent_bus_type(struct device *dev, void *data)
+{
+ struct bus_type **bus = data;
+
+ if (*bus && *bus != dev->parent->bus)
+ return -EINVAL;
+
+ *bus = dev->parent->bus;
+
+ return 0;
+}
+
static int vfio_iommu_type1_attach_group(void *iommu_data,
struct iommu_group *iommu_group)
{
@@ -1458,6 +1470,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
enum mdev_domain_type type = 0;

symbol_put(mdev_bus_type);
+ mdev_bus = NULL;

/* Determine the domain type: */
ret = iommu_group_for_each_dev(iommu_group, &type,
@@ -1479,7 +1492,14 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,

return 0;
case DOMAIN_TYPE_ATTACH_PARENT:
- /* FALLTHROUGH */
+ bus = NULL;
+ group->attach_parent = true;
+ /* Set @bus to bus type of the parent: */
+ ret = iommu_group_for_each_dev(iommu_group, &bus,
+ vfio_parent_bus_type);
+ if (ret)
+ goto out_free;
+ break;
default:
ret = -EINVAL;
goto out_free;
--
2.17.1


2018-08-30 04:13:02

by Lu Baolu

[permalink] [raw]
Subject: [RFC PATCH v2 05/10] iommu/vt-d: Attach/detach domains in auxiliary mode

When multiple domains per device has been enabled by the
device driver, the device will tag the default PASID for
the domain to all DMA traffics out of the subset of this
device; and the IOMMU should translate the DMA requests
in PASID granularity.

This extends the intel_iommu_attach/detach_device() ops
to support managing PASID granular translation structures
when the device driver has enabled multiple domains per
device.

Cc: Ashok Raj <[email protected]>
Cc: Jacob Pan <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Liu Yi L <[email protected]>
Signed-off-by: Sanjay Kumar <[email protected]>
Signed-off-by: Lu Baolu <[email protected]>
---
drivers/iommu/intel-iommu.c | 132 +++++++++++++++++++++++++++++++++++-
include/linux/intel-iommu.h | 10 +++
2 files changed, 139 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3606d25bc40c..784bd496f316 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2502,6 +2502,7 @@ static struct dmar_domain *dmar_insert_one_dev_info(struct intel_iommu *iommu,
info->iommu = iommu;
info->pasid_table = NULL;
info->auxd_enabled = 0;
+ INIT_LIST_HEAD(&info->auxiliary_domains);

if (dev && dev_is_pci(dev)) {
struct pci_dev *pdev = to_pci_dev(info->dev);
@@ -5036,6 +5037,124 @@ static void intel_iommu_domain_free(struct iommu_domain *domain)
domain_exit(to_dmar_domain(domain));
}

+/*
+ * Check whether a @domain will be attached to the @dev in the
+ * auxiliary mode.
+ */
+static inline bool
+is_device_attach_aux_domain(struct device *dev, struct iommu_domain *domain)
+{
+ struct device_domain_info *info = dev->archdata.iommu;
+
+ return info && info->auxd_enabled &&
+ domain->type == IOMMU_DOMAIN_UNMANAGED;
+}
+
+static void auxiliary_link_device(struct dmar_domain *domain,
+ struct device *dev)
+{
+ struct device_domain_info *info = dev->archdata.iommu;
+
+ assert_spin_locked(&device_domain_lock);
+ if (WARN_ON(!info))
+ return;
+
+ domain->auxd_refcnt++;
+ list_add(&domain->auxd, &info->auxiliary_domains);
+}
+
+static void auxiliary_unlink_device(struct dmar_domain *domain,
+ struct device *dev)
+{
+ struct device_domain_info *info = dev->archdata.iommu;
+
+ assert_spin_locked(&device_domain_lock);
+ if (WARN_ON(!info))
+ return;
+
+ list_del(&domain->auxd);
+ domain->auxd_refcnt--;
+
+ if (!domain->auxd_refcnt && domain->default_pasid > 0)
+ intel_pasid_free_id(domain->default_pasid);
+}
+
+static int domain_add_dev_auxd(struct dmar_domain *domain,
+ struct device *dev)
+{
+ int ret;
+ u8 bus, devfn;
+ unsigned long flags;
+ struct intel_iommu *iommu;
+
+ iommu = device_to_iommu(dev, &bus, &devfn);
+ if (!iommu)
+ return -ENODEV;
+
+ spin_lock_irqsave(&device_domain_lock, flags);
+ if (domain->default_pasid <= 0) {
+ domain->default_pasid = intel_pasid_alloc_id(domain, PASID_MIN,
+ intel_pasid_get_dev_max_id(dev), GFP_ATOMIC);
+ if (domain->default_pasid < 0) {
+ pr_err("Can't allocate default pasid\n");
+ ret = -ENODEV;
+ goto pasid_failed;
+ }
+ }
+
+ spin_lock(&iommu->lock);
+ ret = domain_attach_iommu(domain, iommu);
+ if (ret)
+ goto attach_failed;
+
+ /* Setup the PASID entry for mediated devices: */
+ ret = intel_pasid_setup_second_level(iommu, domain, dev,
+ domain->default_pasid, false);
+ if (ret)
+ goto table_failed;
+ spin_unlock(&iommu->lock);
+
+ auxiliary_link_device(domain, dev);
+
+ spin_unlock_irqrestore(&device_domain_lock, flags);
+
+ return 0;
+
+table_failed:
+ domain_detach_iommu(domain, iommu);
+attach_failed:
+ spin_unlock(&iommu->lock);
+ if (!domain->auxd_refcnt && domain->default_pasid > 0)
+ intel_pasid_free_id(domain->default_pasid);
+pasid_failed:
+ spin_unlock_irqrestore(&device_domain_lock, flags);
+
+ return ret;
+}
+
+static void domain_remove_dev_aux(struct dmar_domain *domain,
+ struct device *dev)
+{
+ struct device_domain_info *info;
+ struct intel_iommu *iommu;
+ unsigned long flags;
+
+ spin_lock_irqsave(&device_domain_lock, flags);
+ info = dev->archdata.iommu;
+ iommu = info->iommu;
+
+ intel_pasid_tear_down_second_level(iommu, domain,
+ dev, domain->default_pasid);
+
+ auxiliary_unlink_device(domain, dev);
+
+ spin_lock(&iommu->lock);
+ domain_detach_iommu(domain, iommu);
+ spin_unlock(&iommu->lock);
+
+ spin_unlock_irqrestore(&device_domain_lock, flags);
+}
+
static int intel_iommu_attach_device(struct iommu_domain *domain,
struct device *dev)
{
@@ -5050,7 +5169,8 @@ static int intel_iommu_attach_device(struct iommu_domain *domain,
}

/* normally dev is not mapped */
- if (unlikely(domain_context_mapped(dev))) {
+ if (unlikely(domain_context_mapped(dev) &&
+ !is_device_attach_aux_domain(dev, domain))) {
struct dmar_domain *old_domain;

old_domain = find_domain(dev);
@@ -5097,13 +5217,19 @@ static int intel_iommu_attach_device(struct iommu_domain *domain,
dmar_domain->agaw--;
}

- return domain_add_dev_info(dmar_domain, dev);
+ if (is_device_attach_aux_domain(dev, domain))
+ return domain_add_dev_auxd(dmar_domain, dev);
+ else
+ return domain_add_dev_info(dmar_domain, dev);
}

static void intel_iommu_detach_device(struct iommu_domain *domain,
struct device *dev)
{
- dmar_remove_one_dev_info(to_dmar_domain(domain), dev);
+ if (is_device_attach_aux_domain(dev, domain))
+ domain_remove_dev_aux(to_dmar_domain(domain), dev);
+ else
+ dmar_remove_one_dev_info(to_dmar_domain(domain), dev);
}

static int intel_iommu_map(struct iommu_domain *domain,
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 15981245796e..cbf3d250b521 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -408,9 +408,11 @@ struct dmar_domain {
/* Domain ids per IOMMU. Use u16 since
* domain ids are 16 bit wide according
* to VT-d spec, section 9.3 */
+ unsigned int auxd_refcnt; /* Refcount of auxiliary attaching */

bool has_iotlb_device;
struct list_head devices; /* all devices' list */
+ struct list_head auxd; /* link to device's auxiliary list */
struct iova_domain iovad; /* iova's that belong to this domain */

struct dma_pte *pgd; /* virtual address */
@@ -429,6 +431,11 @@ struct dmar_domain {
2 == 1GiB, 3 == 512GiB, 4 == 1TiB */
u64 max_addr; /* maximum mapped address */

+ int default_pasid; /*
+ * The default pasid used for non-SVM
+ * traffic on mediated devices.
+ */
+
struct iommu_domain domain; /* generic domain data structure for
iommu core */
};
@@ -478,6 +485,9 @@ struct device_domain_info {
struct list_head link; /* link to domain siblings */
struct list_head global; /* link to global list */
struct list_head table; /* link to pasid table */
+ struct list_head auxiliary_domains; /* auxiliary domains
+ * attached to this device
+ */
u8 bus; /* PCI bus number */
u8 devfn; /* PCI devfn number */
u16 pfsid; /* SRIOV physical function source ID */
--
2.17.1


2018-08-30 04:13:19

by Lu Baolu

[permalink] [raw]
Subject: [RFC PATCH v2 07/10] vfio/mdev: Add mediated device domain type

A parent device might create different types of mediated
devices. For example, a mediated device could be created
by the parent device with full isolation and protection
provided by the IOMMU. One usage case could be found on
Intel platforms where a mediated device is an assignable
subset of a PCI, the DMA requests on behalf of it are all
tagged with a PASID. Since IOMMU supports PASID-granular
translations (scalable mode in vt-d 3.0), this mediated
device could be individually protected and isolated by the
IOMMU.

This patch defines the domain types of a mediated device
and allows the parent driver to specify this attributes
when a mediated device is being careated. The following
types are defined:

* DOMAIN_TYPE_NO_IOMMU
- Do not need any IOMMU support. All isolation and
protection are handled by the parent device driver
through the callbacks with device specific mechanism.
* DOMAIN_TYPE_ATTACH_PARENT
- IOMMU can isolate and protect this mediated device,
and an isolation domain should be attaced to the
the parent device.

This also reseves a place in mdev private data structure
to save the iommu domain, and adds interfaces to store and
retrieve the domain. Below APIs are introduced:

* mdev_set/get_domain_type(type)
- Set or query the domain type of a mediated device.
The parent device driver should set the domain type
(or keep DOMAIN_TYPE_NO_IOMMU by default) during the
mediated device creation.

* mdev_set/get_domain(domain)
- A iommu domain which has been attached to the parent
device in order to protect and isolate the mediated
device will be kept in the mdev data structure and
could be retrieved later.

Cc: Ashok Raj <[email protected]>
Cc: Jacob Pan <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Liu Yi L <[email protected]>
Suggested-by: Kevin Tian <[email protected]>
Signed-off-by: Lu Baolu <[email protected]>
---
drivers/vfio/mdev/mdev_core.c | 36 ++++++++++++++++++++++++++++++++
drivers/vfio/mdev/mdev_private.h | 2 ++
include/linux/mdev.h | 26 +++++++++++++++++++++++
3 files changed, 64 insertions(+)

diff --git a/drivers/vfio/mdev/mdev_core.c b/drivers/vfio/mdev/mdev_core.c
index 0212f0ee8aea..d45a829c5b11 100644
--- a/drivers/vfio/mdev/mdev_core.c
+++ b/drivers/vfio/mdev/mdev_core.c
@@ -390,6 +390,42 @@ int mdev_device_remove(struct device *dev, bool force_remove)
return 0;
}

+int mdev_set_domain_type(struct device *dev, enum mdev_domain_type type)
+{
+ struct mdev_device *mdev = to_mdev_device(dev);
+
+ mdev->domain_type = type;
+
+ return 0;
+}
+EXPORT_SYMBOL(mdev_set_domain_type);
+
+enum mdev_domain_type mdev_get_domain_type(struct device *dev)
+{
+ struct mdev_device *mdev = to_mdev_device(dev);
+
+ return mdev->domain_type;
+}
+EXPORT_SYMBOL(mdev_get_domain_type);
+
+int mdev_set_domain(struct device *dev, void *domain)
+{
+ struct mdev_device *mdev = to_mdev_device(dev);
+
+ mdev->domain = domain;
+
+ return 0;
+}
+EXPORT_SYMBOL(mdev_set_domain);
+
+void *mdev_get_domain(struct device *dev)
+{
+ struct mdev_device *mdev = to_mdev_device(dev);
+
+ return mdev->domain;
+}
+EXPORT_SYMBOL(mdev_get_domain);
+
static int __init mdev_init(void)
{
return mdev_bus_register();
diff --git a/drivers/vfio/mdev/mdev_private.h b/drivers/vfio/mdev/mdev_private.h
index b5819b7d7ef7..fd9e33fbd6e5 100644
--- a/drivers/vfio/mdev/mdev_private.h
+++ b/drivers/vfio/mdev/mdev_private.h
@@ -34,6 +34,8 @@ struct mdev_device {
struct list_head next;
struct kobject *type_kobj;
bool active;
+ int domain_type;
+ void *domain;
};

#define to_mdev_device(dev) container_of(dev, struct mdev_device, dev)
diff --git a/include/linux/mdev.h b/include/linux/mdev.h
index b6e048e1045f..3224587bda1e 100644
--- a/include/linux/mdev.h
+++ b/include/linux/mdev.h
@@ -15,6 +15,32 @@

struct mdev_device;

+enum mdev_domain_type {
+ DOMAIN_TYPE_NO_IOMMU, /* Don't need any IOMMU support.
+ * All isolation and protection
+ * are handled by the parent
+ * device driver with a device
+ * specific mechanism.
+ */
+ DOMAIN_TYPE_ATTACH_PARENT, /* IOMMU can isolate and protect
+ * the mdev, and the isolation
+ * domain should be attaced with
+ * the parent device.
+ */
+};
+
+/*
+ * Called by the parent device driver to set the domain type.
+ * By default, the domain type is set to DOMAIN_TYPE_EXTERNAL.
+ */
+int mdev_set_domain_type(struct device *dev, enum mdev_domain_type type);
+
+/* Check the domain type. */
+enum mdev_domain_type mdev_get_domain_type(struct device *dev);
+
+int mdev_set_domain(struct device *dev, void *domain);
+void *mdev_get_domain(struct device *dev);
+
/**
* struct mdev_parent_ops - Structure to be registered for each parent device to
* register the device to mdev module.
--
2.17.1


2018-08-30 04:13:43

by Lu Baolu

[permalink] [raw]
Subject: [RFC PATCH v2 06/10] iommu/vt-d: Return ID associated with an auxiliary domain

This adds support to return the default pasid associated with
an auxiliary domain. The PCI device which is bound with this
domain should use this value as the pasid for all DMA requests
of the subset of device which is isolated and protected with
this domain.

Cc: Ashok Raj <[email protected]>
Cc: Jacob Pan <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Liu Yi L <[email protected]>
Signed-off-by: Lu Baolu <[email protected]>
---
drivers/iommu/intel-iommu.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 784bd496f316..1ba39f120a38 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5544,6 +5544,14 @@ static void intel_iommu_disable_auxd(struct device *dev)
spin_unlock_irqrestore(&device_domain_lock, flags);
}

+static int intel_iommu_auxd_id(struct iommu_domain *domain)
+{
+ struct dmar_domain *dmar_domain = to_dmar_domain(domain);
+
+ return dmar_domain->default_pasid > 0 ?
+ dmar_domain->default_pasid : -EINVAL;
+}
+
const struct iommu_ops intel_iommu_ops = {
.capable = intel_iommu_capable,
.domain_alloc = intel_iommu_domain_alloc,
@@ -5560,6 +5568,7 @@ const struct iommu_ops intel_iommu_ops = {
.device_group = pci_device_group,
.enable_auxd = intel_iommu_enable_auxd,
.disable_auxd = intel_iommu_disable_auxd,
+ .auxd_id = intel_iommu_auxd_id,
.pgsize_bitmap = INTEL_IOMMU_PGSIZES,
};

--
2.17.1


2018-08-30 04:14:05

by Lu Baolu

[permalink] [raw]
Subject: [RFC PATCH v2 03/10] iommu/amd: Add default branch in amd_iommu_capable()

Otherwise, there will be a build warning:

drivers/iommu/amd_iommu.c:3083:2: warning: enumeration value
'IOMMU_CAP_AUX_DOMAIN' not handled in switch [-Wswitch]

There is no functional change.

Signed-off-by: Lu Baolu <[email protected]>
---
drivers/iommu/amd_iommu.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 4e04fff23977..237ae6db4cfd 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -3077,6 +3077,8 @@ static bool amd_iommu_capable(enum iommu_cap cap)
return (irq_remapping_enabled == 1);
case IOMMU_CAP_NOEXEC:
return false;
+ default:
+ break;
}

return false;
--
2.17.1


2018-08-30 04:14:15

by Lu Baolu

[permalink] [raw]
Subject: [RFC PATCH v2 09/10] vfio/type1: Determine domain type of an mdev group

This adds the support to determine the domain type of a group
of mediated devices according to the domain type attribute of
each device.

Cc: Ashok Raj <[email protected]>
Cc: Jacob Pan <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Liu Yi L <[email protected]>
Signed-off-by: Sanjay Kumar <[email protected]>
Signed-off-by: Lu Baolu <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 47 ++++++++++++++++++++++++++++++---
1 file changed, 43 insertions(+), 4 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index 89e2e6123223..64bf55b91de1 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -1390,6 +1390,27 @@ static void vfio_iommu_detach_group(struct vfio_domain *domain,
iommu_detach_group(domain->domain, group->iommu_group);
}

+static int vfio_mdev_domain_type(struct device *dev, void *data)
+{
+ enum mdev_domain_type new, *old = data;
+ enum mdev_domain_type (*fn)(struct device *dev);
+
+ fn = symbol_get(mdev_get_domain_type);
+ if (fn) {
+ new = fn(dev);
+ symbol_put(mdev_get_domain_type);
+
+ if (*old && *old != new)
+ return -EINVAL;
+
+ *old = new;
+
+ return 0;
+ }
+
+ return -EINVAL;
+}
+
static int vfio_iommu_type1_attach_group(void *iommu_data,
struct iommu_group *iommu_group)
{
@@ -1433,9 +1454,19 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,

mdev_bus = symbol_get(mdev_bus_type);

- if (mdev_bus) {
- if ((bus == mdev_bus) && !iommu_present(bus)) {
- symbol_put(mdev_bus_type);
+ if (mdev_bus && bus == mdev_bus) {
+ enum mdev_domain_type type = 0;
+
+ symbol_put(mdev_bus_type);
+
+ /* Determine the domain type: */
+ ret = iommu_group_for_each_dev(iommu_group, &type,
+ vfio_mdev_domain_type);
+ if (ret)
+ goto out_free;
+
+ switch (type) {
+ case DOMAIN_TYPE_NO_IOMMU:
if (!iommu->external_domain) {
INIT_LIST_HEAD(&domain->group_list);
iommu->external_domain = domain;
@@ -1445,11 +1476,19 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
list_add(&group->next,
&iommu->external_domain->group_list);
mutex_unlock(&iommu->lock);
+
return 0;
+ case DOMAIN_TYPE_ATTACH_PARENT:
+ /* FALLTHROUGH */
+ default:
+ ret = -EINVAL;
+ goto out_free;
}
- symbol_put(mdev_bus_type);
}

+ if (mdev_bus)
+ symbol_put(mdev_bus_type);
+
domain->domain = iommu_domain_alloc(bus);
if (!domain->domain) {
ret = -EIO;
--
2.17.1


2018-08-30 04:14:22

by Lu Baolu

[permalink] [raw]
Subject: [RFC PATCH v2 01/10] iommu: Add APIs for multiple domains per device

Sharing a physical PCI device in a finer-granularity way
is becoming a consensus in the industry. IOMMU vendors
are also engaging efforts to support such sharing as well
as possible. Among the efforts, the capability of support
finer-granularity DMA isolation is a common requirement
due to the security consideration. With finer-granularity
DMA isolation, all DMA requests out of or to a subset of
a physical PCI device can be protected by the IOMMU. As a
result, there is a request in software to attach multiple
domains to a physical PCI device. One example of such use
model is the Intel Scalable IOV [1] [2]. The Intel vt-d
3.0 spec [3] introduces the scalable mode which enables
PASID granularity DMA isolation.

This adds the APIs to support multiple domains per device.
In order to ease the discussions, we call it 'a domain in
auxiliary mode' or simply 'auxiliary domain' when multiple
domains are attached to a physical device.

The APIs includes:

* iommu_capable(IOMMU_CAP_AUX_DOMAIN)
- Represents the ability of supporting multiple domains
per device.

* iommu_en(dis)able_aux_domain(struct device *dev)
- Enable/disable the multiple domains capability for a
device referenced by @dev.

* iommu_auxiliary_id(struct iommu_domain *domain)
- Return ID used for finer-granularity DMA translation.
For the Intel Scalable IOV usage model, this will be
a PASID. The device which supports Scalalbe IOV needs
to writes this ID to the device register so that DMA
requests could be tagged with a right PASID prefix.

Many people involved in discussions of this design. They're

Kevin Tian <[email protected]>
Liu Yi L <[email protected]>
Ashok Raj <[email protected]>
Sanjay Kumar <[email protected]>
Alex Williamson <[email protected]>
Jean-Philippe Brucker <[email protected]>

and some discussions can be found here [4].

[1] https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[2] https://schd.ws/hosted_files/lc32018/00/LC3-SIOV-final.pdf
[3] https://software.intel.com/en-us/download/intel-virtualization-technology-for-directed-io-architecture-specification
[4] https://lkml.org/lkml/2018/7/26/4

Cc: Ashok Raj <[email protected]>
Cc: Jacob Pan <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Liu Yi L <[email protected]>
Suggested-by: Kevin Tian <[email protected]>
Signed-off-by: Lu Baolu <[email protected]>
---
drivers/iommu/iommu.c | 29 +++++++++++++++++++++++++++++
include/linux/iommu.h | 13 +++++++++++++
2 files changed, 42 insertions(+)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 8c15c5980299..2c6faf417dd5 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2014,3 +2014,32 @@ int iommu_fwspec_add_ids(struct device *dev, u32 *ids, int num_ids)
return 0;
}
EXPORT_SYMBOL_GPL(iommu_fwspec_add_ids);
+
+int iommu_enable_aux_domain(struct device *dev)
+{
+ const struct iommu_ops *ops = dev->bus->iommu_ops;
+
+ if (ops && ops->enable_auxd)
+ return ops->enable_auxd(dev);
+
+ return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(iommu_enable_aux_domain);
+
+void iommu_disable_aux_domain(struct device *dev)
+{
+ const struct iommu_ops *ops = dev->bus->iommu_ops;
+
+ if (ops && ops->disable_auxd)
+ ops->disable_auxd(dev);
+}
+EXPORT_SYMBOL_GPL(iommu_disable_aux_domain);
+
+int iommu_auxiliary_id(struct iommu_domain *domain)
+{
+ if (domain->ops->auxd_id)
+ return domain->ops->auxd_id(domain);
+
+ return -EINVAL;
+}
+EXPORT_SYMBOL_GPL(iommu_auxiliary_id);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 87994c265bf5..ffd20b315bee 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -101,6 +101,8 @@ enum iommu_cap {
transactions */
IOMMU_CAP_INTR_REMAP, /* IOMMU supports interrupt isolation */
IOMMU_CAP_NOEXEC, /* IOMMU_NOEXEC flag */
+ IOMMU_CAP_AUX_DOMAIN, /* IOMMU supports multiple domains per
+ device */
};

/*
@@ -185,6 +187,9 @@ struct iommu_resv_region {
* @domain_get_windows: Return the number of windows for a domain
* @of_xlate: add OF master IDs to iommu grouping
* @pgsize_bitmap: bitmap of all possible supported page sizes
+ * @enable_auxd: enable multiple domains per device support
+ * @disable_auxd: disable multiple domains per device support
+ * @auxd_id: return the id of an auxiliary domain
*/
struct iommu_ops {
bool (*capable)(enum iommu_cap);
@@ -231,6 +236,10 @@ struct iommu_ops {
int (*of_xlate)(struct device *dev, struct of_phandle_args *args);
bool (*is_attach_deferred)(struct iommu_domain *domain, struct device *dev);

+ int (*enable_auxd)(struct device *dev);
+ void (*disable_auxd)(struct device *dev);
+ int (*auxd_id)(struct iommu_domain *domain);
+
unsigned long pgsize_bitmap;
};

@@ -400,6 +409,10 @@ void iommu_fwspec_free(struct device *dev);
int iommu_fwspec_add_ids(struct device *dev, u32 *ids, int num_ids);
const struct iommu_ops *iommu_ops_from_fwnode(struct fwnode_handle *fwnode);

+int iommu_enable_aux_domain(struct device *dev);
+void iommu_disable_aux_domain(struct device *dev);
+int iommu_auxiliary_id(struct iommu_domain *domain);
+
#else /* CONFIG_IOMMU_API */

struct iommu_ops {};
--
2.17.1


2018-08-30 04:14:38

by Lu Baolu

[permalink] [raw]
Subject: [RFC PATCH v2 04/10] iommu/vt-d: Enable/disable multiple domains per device

Add iommu ops for enabling and disabling multiple domains
for a device.

Cc: Ashok Raj <[email protected]>
Cc: Jacob Pan <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Liu Yi L <[email protected]>
Signed-off-by: Lu Baolu <[email protected]>
---
drivers/iommu/intel-iommu.c | 36 ++++++++++++++++++++++++++++++++++++
include/linux/intel-iommu.h | 1 +
2 files changed, 37 insertions(+)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 891ae70e7bf2..3606d25bc40c 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2501,6 +2501,7 @@ static struct dmar_domain *dmar_insert_one_dev_info(struct intel_iommu *iommu,
info->domain = domain;
info->iommu = iommu;
info->pasid_table = NULL;
+ info->auxd_enabled = 0;

if (dev && dev_is_pci(dev)) {
struct pci_dev *pdev = to_pci_dev(info->dev);
@@ -5384,6 +5385,39 @@ struct intel_iommu *intel_svm_device_to_iommu(struct device *dev)
}
#endif /* CONFIG_INTEL_IOMMU_SVM */

+static int intel_iommu_enable_auxd(struct device *dev)
+{
+ struct device_domain_info *info;
+ struct dmar_domain *domain;
+ unsigned long flags;
+
+ if (!scalable_mode_support())
+ return -ENODEV;
+
+ domain = get_valid_domain_for_dev(dev);
+ if (!domain)
+ return -ENODEV;
+
+ spin_lock_irqsave(&device_domain_lock, flags);
+ info = dev->archdata.iommu;
+ info->auxd_enabled = 1;
+ spin_unlock_irqrestore(&device_domain_lock, flags);
+
+ return 0;
+}
+
+static void intel_iommu_disable_auxd(struct device *dev)
+{
+ struct device_domain_info *info;
+ unsigned long flags;
+
+ spin_lock_irqsave(&device_domain_lock, flags);
+ info = dev->archdata.iommu;
+ if (!WARN_ON(!info))
+ info->auxd_enabled = 0;
+ spin_unlock_irqrestore(&device_domain_lock, flags);
+}
+
const struct iommu_ops intel_iommu_ops = {
.capable = intel_iommu_capable,
.domain_alloc = intel_iommu_domain_alloc,
@@ -5398,6 +5432,8 @@ const struct iommu_ops intel_iommu_ops = {
.get_resv_regions = intel_iommu_get_resv_regions,
.put_resv_regions = intel_iommu_put_resv_regions,
.device_group = pci_device_group,
+ .enable_auxd = intel_iommu_enable_auxd,
+ .disable_auxd = intel_iommu_disable_auxd,
.pgsize_bitmap = INTEL_IOMMU_PGSIZES,
};

diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index b34cf8b887a0..15981245796e 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -487,6 +487,7 @@ struct device_domain_info {
u8 pri_enabled:1;
u8 ats_supported:1;
u8 ats_enabled:1;
+ u8 auxd_enabled:1; /* Multiple domains per device */
u8 ats_qdep;
struct device *dev; /* it's NULL for PCIe-to-PCI bridge */
struct intel_iommu *iommu; /* IOMMU used by this device */
--
2.17.1


2018-08-30 04:14:53

by Lu Baolu

[permalink] [raw]
Subject: [RFC PATCH v2 02/10] iommu/vt-d: Add multiple domains per device query

Add the response to IOMMU_CAP_AUX_DOMAIN capability query
through iommu_capable(). Return true if IOMMUs support the
scalable mode, return false otherwise.

Cc: Ashok Raj <[email protected]>
Cc: Jacob Pan <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Liu Yi L <[email protected]>
Signed-off-by: Lu Baolu <[email protected]>
---
drivers/iommu/intel-iommu.c | 31 +++++++++++++++++++++++++++++--
1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 3e49d4029058..891ae70e7bf2 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -5193,12 +5193,39 @@ static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
return phys;
}

+static inline bool scalable_mode_support(void)
+{
+ struct dmar_drhd_unit *drhd;
+ struct intel_iommu *iommu;
+ bool ret = true;
+
+ rcu_read_lock();
+ for_each_active_iommu(iommu, drhd) {
+ if (!sm_supported(iommu)) {
+ ret = false;
+ break;
+ }
+ }
+ rcu_read_unlock();
+
+ return ret;
+}
+
static bool intel_iommu_capable(enum iommu_cap cap)
{
- if (cap == IOMMU_CAP_CACHE_COHERENCY)
+ switch (cap) {
+ case IOMMU_CAP_CACHE_COHERENCY:
return domain_update_iommu_snooping(NULL) == 1;
- if (cap == IOMMU_CAP_INTR_REMAP)
+ case IOMMU_CAP_INTR_REMAP:
return irq_remapping_enabled == 1;
+ case IOMMU_CAP_AUX_DOMAIN:
+ return scalable_mode_support();
+ case IOMMU_CAP_NOEXEC:
+ /* PASSTHROUGH */
+ default:
+ pr_info("Unsupported capability query %d\n", cap);
+ break;
+ }

return false;
}
--
2.17.1


2018-08-30 04:15:07

by Lu Baolu

[permalink] [raw]
Subject: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers

If a domain is attaching to a group which includes the
mediated devices, it should attach to the mdev parent
of each mdev. This adds a helper for attaching domain
to group, no matter a PCI physical device or mediated
devices which are derived from a PCI physical device.

Cc: Ashok Raj <[email protected]>
Cc: Jacob Pan <[email protected]>
Cc: Kevin Tian <[email protected]>
Cc: Liu Yi L <[email protected]>
Signed-off-by: Lu Baolu <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 77 ++++++++++++++++++++++++++++++---
1 file changed, 70 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index d9fd3188615d..89e2e6123223 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -91,6 +91,9 @@ struct vfio_dma {
struct vfio_group {
struct iommu_group *iommu_group;
struct list_head next;
+ bool attach_parent; /* An mdev group with domain
+ * attached to parent
+ */
};

/*
@@ -1327,6 +1330,66 @@ static bool vfio_iommu_has_sw_msi(struct iommu_group *group, phys_addr_t *base)
return ret;
}

+static int vfio_mdev_set_aux_domain(struct device *dev,
+ struct iommu_domain *domain)
+{
+ int (*fn)(struct device *dev, void *domain);
+ int ret;
+
+ fn = symbol_get(mdev_set_domain);
+ if (fn) {
+ ret = fn(dev, domain);
+ symbol_put(mdev_set_domain);
+
+ return ret;
+ }
+
+ return -EINVAL;
+}
+
+static int vfio_attach_aux_domain(struct device *dev, void *data)
+{
+ struct iommu_domain *domain = data;
+ int ret;
+
+ ret = vfio_mdev_set_aux_domain(dev, domain);
+ if (ret)
+ return ret;
+
+ return iommu_attach_device(domain, dev->parent);
+}
+
+static int vfio_detach_aux_domain(struct device *dev, void *data)
+{
+ struct iommu_domain *domain = data;
+
+ vfio_mdev_set_aux_domain(dev, NULL);
+ iommu_detach_device(domain, dev->parent);
+
+ return 0;
+}
+
+static int vfio_iommu_attach_group(struct vfio_domain *domain,
+ struct vfio_group *group)
+{
+ if (group->attach_parent)
+ return iommu_group_for_each_dev(group->iommu_group,
+ domain->domain,
+ vfio_attach_aux_domain);
+ else
+ return iommu_attach_group(domain->domain, group->iommu_group);
+}
+
+static void vfio_iommu_detach_group(struct vfio_domain *domain,
+ struct vfio_group *group)
+{
+ if (group->attach_parent)
+ iommu_group_for_each_dev(group->iommu_group, domain->domain,
+ vfio_detach_aux_domain);
+ else
+ iommu_detach_group(domain->domain, group->iommu_group);
+}
+
static int vfio_iommu_type1_attach_group(void *iommu_data,
struct iommu_group *iommu_group)
{
@@ -1402,7 +1465,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
goto out_domain;
}

- ret = iommu_attach_group(domain->domain, iommu_group);
+ ret = vfio_iommu_attach_group(domain, group);
if (ret)
goto out_domain;

@@ -1434,8 +1497,8 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
list_for_each_entry(d, &iommu->domain_list, next) {
if (d->domain->ops == domain->domain->ops &&
d->prot == domain->prot) {
- iommu_detach_group(domain->domain, iommu_group);
- if (!iommu_attach_group(d->domain, iommu_group)) {
+ vfio_iommu_detach_group(domain, group);
+ if (!vfio_iommu_attach_group(d, group)) {
list_add(&group->next, &d->group_list);
iommu_domain_free(domain->domain);
kfree(domain);
@@ -1443,7 +1506,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
return 0;
}

- ret = iommu_attach_group(domain->domain, iommu_group);
+ ret = vfio_iommu_attach_group(domain, group);
if (ret)
goto out_domain;
}
@@ -1469,7 +1532,7 @@ static int vfio_iommu_type1_attach_group(void *iommu_data,
return 0;

out_detach:
- iommu_detach_group(domain->domain, iommu_group);
+ vfio_iommu_detach_group(domain, group);
out_domain:
iommu_domain_free(domain->domain);
out_free:
@@ -1560,7 +1623,7 @@ static void vfio_iommu_type1_detach_group(void *iommu_data,
if (!group)
continue;

- iommu_detach_group(domain->domain, iommu_group);
+ vfio_iommu_detach_group(domain, group);
list_del(&group->next);
kfree(group);
/*
@@ -1625,7 +1688,7 @@ static void vfio_release_domain(struct vfio_domain *domain, bool external)
list_for_each_entry_safe(group, group_tmp,
&domain->group_list, next) {
if (!external)
- iommu_detach_group(domain->domain, group->iommu_group);
+ vfio_iommu_detach_group(domain, group);
list_del(&group->next);
kfree(group);
}
--
2.17.1


2018-09-05 03:03:14

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

> From: Lu Baolu [mailto:[email protected]]
> Sent: Thursday, August 30, 2018 12:09 PM
>
[...]
>
> In order to distinguish the IOMMU-capable mediated devices from those
> which still need to rely on parent devices, this patch set adds a
> domain type attribute to each mdev.
>
> enum mdev_domain_type {
> DOMAIN_TYPE_NO_IOMMU, /* Don't need any IOMMU support.
> * All isolation and protection
> * are handled by the parent
> * device driver with a device
> * specific mechanism.
> */
> DOMAIN_TYPE_ATTACH_PARENT, /* IOMMU can isolate and
> protect
> * the mdev, and the isolation
> * domain should be attaced with
> * the parent device.
> */
> };
>

ATTACH_PARENT is not like a good counterpart to NO_IOMMU.

what about DOMAIN_TYPE_NO_IOMMU/DOMAIN_TYPE_IOMMU? whether
to attach parent device is just internal logic.

Alternatively DOMAIN_TYPE_SOFTWARE/DOMAIN_TYPE_HARDWARE,
where software means iommu_domain is managed by software while
the other means managed by hardware.

One side note to Alex - with multiple domain extension in IOMMU layer,
this version combines IOMMU-capable usages in VFIO: PASID-based (as
in scalable iov) and RID-based (as the usage of mdev wrapper on any
device). Both cases share the common path - just binding the domain to the
parent device of mdev. IOMMU layer will handle two cases differently later.

Thanks
Kevin

2018-09-05 19:17:20

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

On Wed, 5 Sep 2018 03:01:39 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Lu Baolu [mailto:[email protected]]
> > Sent: Thursday, August 30, 2018 12:09 PM
> >
> [...]
> >
> > In order to distinguish the IOMMU-capable mediated devices from those
> > which still need to rely on parent devices, this patch set adds a
> > domain type attribute to each mdev.
> >
> > enum mdev_domain_type {
> > DOMAIN_TYPE_NO_IOMMU, /* Don't need any IOMMU support.
> > * All isolation and protection
> > * are handled by the parent
> > * device driver with a device
> > * specific mechanism.
> > */
> > DOMAIN_TYPE_ATTACH_PARENT, /* IOMMU can isolate and
> > protect
> > * the mdev, and the isolation
> > * domain should be attaced with
> > * the parent device.
> > */
> > };
> >
>
> ATTACH_PARENT is not like a good counterpart to NO_IOMMU.

Please do not use NO_IOMMU, we already have a thing called
vfio-noiommu, enabled through CONFIG_VFIO_NOIOMMU and module parameter
enable_unsafe_noiommu_mode. This is much, much too similar and will
generate confusion.

> what about DOMAIN_TYPE_NO_IOMMU/DOMAIN_TYPE_IOMMU? whether
> to attach parent device is just internal logic.
>
> Alternatively DOMAIN_TYPE_SOFTWARE/DOMAIN_TYPE_HARDWARE,
> where software means iommu_domain is managed by software while
> the other means managed by hardware.

I haven't gotten deep enough into the series to see how it's used, but
my gut reaction is that we don't need an enum, we just need some sort
of pointer on the mdev that points to an iommu_parent, which indicates
the root of our IOMMU based isolation, or is NULL, which indicates we
use vendor defined isolation as we have now.

> One side note to Alex - with multiple domain extension in IOMMU layer,
> this version combines IOMMU-capable usages in VFIO: PASID-based (as
> in scalable iov) and RID-based (as the usage of mdev wrapper on any
> device). Both cases share the common path - just binding the domain to the
> parent device of mdev. IOMMU layer will handle two cases differently later.

Good, I'm glad you've considered the regular (RID) IOMMU domain and not
just the new aux domain. Thanks,

Alex

2018-09-05 19:38:06

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 02/10] iommu/vt-d: Add multiple domains per device query

On Thu, 30 Aug 2018 12:09:14 +0800
Lu Baolu <[email protected]> wrote:

> Add the response to IOMMU_CAP_AUX_DOMAIN capability query
> through iommu_capable(). Return true if IOMMUs support the
> scalable mode, return false otherwise.
>
> Cc: Ashok Raj <[email protected]>
> Cc: Jacob Pan <[email protected]>
> Cc: Kevin Tian <[email protected]>
> Cc: Liu Yi L <[email protected]>
> Signed-off-by: Lu Baolu <[email protected]>
> ---
> drivers/iommu/intel-iommu.c | 31 +++++++++++++++++++++++++++++--
> 1 file changed, 29 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index 3e49d4029058..891ae70e7bf2 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -5193,12 +5193,39 @@ static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
> return phys;
> }
>
> +static inline bool scalable_mode_support(void)
> +{
> + struct dmar_drhd_unit *drhd;
> + struct intel_iommu *iommu;
> + bool ret = true;
> +
> + rcu_read_lock();
> + for_each_active_iommu(iommu, drhd) {
> + if (!sm_supported(iommu)) {
> + ret = false;
> + break;
> + }
> + }
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> static bool intel_iommu_capable(enum iommu_cap cap)
> {
> - if (cap == IOMMU_CAP_CACHE_COHERENCY)
> + switch (cap) {
> + case IOMMU_CAP_CACHE_COHERENCY:
> return domain_update_iommu_snooping(NULL) == 1;
> - if (cap == IOMMU_CAP_INTR_REMAP)
> + case IOMMU_CAP_INTR_REMAP:
> return irq_remapping_enabled == 1;
> + case IOMMU_CAP_AUX_DOMAIN:
> + return scalable_mode_support();
> + case IOMMU_CAP_NOEXEC:
> + /* PASSTHROUGH */
> + default:
> + pr_info("Unsupported capability query %d\n", cap);
> + break;

Please don't do this, there's no reason to be noisy about a query of a
capability that VT-d doesn't know about. We implement capabilities
exactly so that relevant drivers can expose a feature and others can
happily (and quietly) ignore them. Thanks,

Alex

2018-09-05 19:39:17

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC PATCH v2 03/10] iommu/amd: Add default branch in amd_iommu_capable()

On Thu, 30 Aug 2018 12:09:15 +0800
Lu Baolu <[email protected]> wrote:

> Otherwise, there will be a build warning:
>
> drivers/iommu/amd_iommu.c:3083:2: warning: enumeration value
> 'IOMMU_CAP_AUX_DOMAIN' not handled in switch [-Wswitch]
>
> There is no functional change.
>
> Signed-off-by: Lu Baolu <[email protected]>
> ---
> drivers/iommu/amd_iommu.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
> index 4e04fff23977..237ae6db4cfd 100644
> --- a/drivers/iommu/amd_iommu.c
> +++ b/drivers/iommu/amd_iommu.c
> @@ -3077,6 +3077,8 @@ static bool amd_iommu_capable(enum iommu_cap cap)
> return (irq_remapping_enabled == 1);
> case IOMMU_CAP_NOEXEC:
> return false;
> + default:
> + break;
> }
>
> return false;

Seems like a bug fix that doesn't need to be part of this RFC, send it
separately. Thanks,

Alex

2018-09-06 00:57:55

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 02/10] iommu/vt-d: Add multiple domains per device query

Hi,

On 09/06/2018 03:35 AM, Alex Williamson wrote:
> On Thu, 30 Aug 2018 12:09:14 +0800
> Lu Baolu <[email protected]> wrote:
>
>> Add the response to IOMMU_CAP_AUX_DOMAIN capability query
>> through iommu_capable(). Return true if IOMMUs support the
>> scalable mode, return false otherwise.
>>
>> Cc: Ashok Raj <[email protected]>
>> Cc: Jacob Pan <[email protected]>
>> Cc: Kevin Tian <[email protected]>
>> Cc: Liu Yi L <[email protected]>
>> Signed-off-by: Lu Baolu <[email protected]>
>> ---
>> drivers/iommu/intel-iommu.c | 31 +++++++++++++++++++++++++++++--
>> 1 file changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
>> index 3e49d4029058..891ae70e7bf2 100644
>> --- a/drivers/iommu/intel-iommu.c
>> +++ b/drivers/iommu/intel-iommu.c
>> @@ -5193,12 +5193,39 @@ static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain,
>> return phys;
>> }
>>
>> +static inline bool scalable_mode_support(void)
>> +{
>> + struct dmar_drhd_unit *drhd;
>> + struct intel_iommu *iommu;
>> + bool ret = true;
>> +
>> + rcu_read_lock();
>> + for_each_active_iommu(iommu, drhd) {
>> + if (!sm_supported(iommu)) {
>> + ret = false;
>> + break;
>> + }
>> + }
>> + rcu_read_unlock();
>> +
>> + return ret;
>> +}
>> +
>> static bool intel_iommu_capable(enum iommu_cap cap)
>> {
>> - if (cap == IOMMU_CAP_CACHE_COHERENCY)
>> + switch (cap) {
>> + case IOMMU_CAP_CACHE_COHERENCY:
>> return domain_update_iommu_snooping(NULL) == 1;
>> - if (cap == IOMMU_CAP_INTR_REMAP)
>> + case IOMMU_CAP_INTR_REMAP:
>> return irq_remapping_enabled == 1;
>> + case IOMMU_CAP_AUX_DOMAIN:
>> + return scalable_mode_support();
>> + case IOMMU_CAP_NOEXEC:
>> + /* PASSTHROUGH */
>> + default:
>> + pr_info("Unsupported capability query %d\n", cap);
>> + break;
>
> Please don't do this, there's no reason to be noisy about a query of a
> capability that VT-d doesn't know about. We implement capabilities
> exactly so that relevant drivers can expose a feature and others can
> happily (and quietly) ignore them. Thanks,

Fair enough. I will remove it in the next version.

Best regards,
Lu Baolu

2018-09-06 01:02:57

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 03/10] iommu/amd: Add default branch in amd_iommu_capable()

Hi,

On 09/06/2018 03:37 AM, Alex Williamson wrote:
> On Thu, 30 Aug 2018 12:09:15 +0800
> Lu Baolu <[email protected]> wrote:
>
>> Otherwise, there will be a build warning:
>>
>> drivers/iommu/amd_iommu.c:3083:2: warning: enumeration value
>> 'IOMMU_CAP_AUX_DOMAIN' not handled in switch [-Wswitch]
>>
>> There is no functional change.
>>
>> Signed-off-by: Lu Baolu <[email protected]>
>> ---
>> drivers/iommu/amd_iommu.c | 2 ++
>> 1 file changed, 2 insertions(+)
>>
>> diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
>> index 4e04fff23977..237ae6db4cfd 100644
>> --- a/drivers/iommu/amd_iommu.c
>> +++ b/drivers/iommu/amd_iommu.c
>> @@ -3077,6 +3077,8 @@ static bool amd_iommu_capable(enum iommu_cap cap)
>> return (irq_remapping_enabled == 1);
>> case IOMMU_CAP_NOEXEC:
>> return false;
>> + default:
>> + break;
>> }
>>
>> return false;
>
> Seems like a bug fix that doesn't need to be part of this RFC, send it
> separately. Thanks,

Sure.

Best regards,
Lu Baolu

2018-09-06 01:33:04

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

Hi,

On 09/06/2018 03:15 AM, Alex Williamson wrote:
> On Wed, 5 Sep 2018 03:01:39 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
>>> From: Lu Baolu [mailto:[email protected]]
>>> Sent: Thursday, August 30, 2018 12:09 PM
>>>
>> [...]
>>>
>>> In order to distinguish the IOMMU-capable mediated devices from those
>>> which still need to rely on parent devices, this patch set adds a
>>> domain type attribute to each mdev.
>>>
>>> enum mdev_domain_type {
>>> DOMAIN_TYPE_NO_IOMMU, /* Don't need any IOMMU support.
>>> * All isolation and protection
>>> * are handled by the parent
>>> * device driver with a device
>>> * specific mechanism.
>>> */
>>> DOMAIN_TYPE_ATTACH_PARENT, /* IOMMU can isolate and
>>> protect
>>> * the mdev, and the isolation
>>> * domain should be attaced with
>>> * the parent device.
>>> */
>>> };
>>>
>>
>> ATTACH_PARENT is not like a good counterpart to NO_IOMMU.
>
> Please do not use NO_IOMMU, we already have a thing called
> vfio-noiommu, enabled through CONFIG_VFIO_NOIOMMU and module parameter
> enable_unsafe_noiommu_mode. This is much, much too similar and will
> generate confusion.

Sure. Will remove this confusion.

>
>> what about DOMAIN_TYPE_NO_IOMMU/DOMAIN_TYPE_IOMMU? whether
>> to attach parent device is just internal logic.
>>
>> Alternatively DOMAIN_TYPE_SOFTWARE/DOMAIN_TYPE_HARDWARE,
>> where software means iommu_domain is managed by software while
>> the other means managed by hardware.
>
> I haven't gotten deep enough into the series to see how it's used, but
> my gut reaction is that we don't need an enum, we just need some sort
> of pointer on the mdev that points to an iommu_parent, which indicates
> the root of our IOMMU based isolation, or is NULL, which indicates we
> use vendor defined isolation as we have now.

It works as long as we can distinguish IOMMU based isolation and the
vendor defined isolation.

How about making the iommu_parent points the device structure who
created the mdev? If this pointer is NOT NULL we will bind the domain
to the device pointed to by it, otherwise, handle it in the vendor
defined way?

Best regards,
Lu Baolu

>
>> One side note to Alex - with multiple domain extension in IOMMU layer,
>> this version combines IOMMU-capable usages in VFIO: PASID-based (as
>> in scalable iov) and RID-based (as the usage of mdev wrapper on any
>> device). Both cases share the common path - just binding the domain to the
>> parent device of mdev. IOMMU layer will handle two cases differently later.
>
> Good, I'm glad you've considered the regular (RID) IOMMU domain and not
> just the new aux domain. Thanks,
>
> Alex
>

2018-09-10 16:25:23

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers

On 30/08/2018 05:09, Lu Baolu wrote:
> +static int vfio_detach_aux_domain(struct device *dev, void *data)
> +{
> + struct iommu_domain *domain = data;
> +
> + vfio_mdev_set_aux_domain(dev, NULL);
> + iommu_detach_device(domain, dev->parent);

I think that's only going to work for vt-d, which doesn't use a
default_domain. For other drivers, iommu.c ends up calling
domain->ops->attach_dev(default_domain, dev) here, which isn't what we want.

That was my concern with reusing attach/detach_dev callbacks for PASID
management. The attach_dev code of IOMMU drivers already has to deal
with toggling between default and unmanaged domain. Dealing with more
state transitions in the same path is going to be difficult.

Thanks,
Jean

2018-09-10 16:27:28

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

Hi,

On 30/08/2018 05:09, Lu Baolu wrote:
> Below APIs are introduced in the IOMMU glue for device drivers to use
> the finer granularity translation.
>
> * iommu_capable(IOMMU_CAP_AUX_DOMAIN)
> - Represents the ability for supporting multiple domains per device
> (a.k.a. finer granularity translations) of the IOMMU hardware.

iommu_capable() cannot represent hardware capabilities, we need
something else for systems with multiple IOMMUs that have different
caps. How about iommu_domain_get_attr on the device's domain instead?

> * iommu_en(dis)able_aux_domain(struct device *dev)
> - Enable/disable the multiple domains capability for a device
> referenced by @dev.
>
> * iommu_auxiliary_id(struct iommu_domain *domain)
> - Return the index value used for finer-granularity DMA translation.
> The specific device driver needs to feed the hardware with this
> value, so that hardware device could issue the DMA transaction with
> this value tagged.

This could also reuse iommu_domain_get_attr.


More generally I'm having trouble understanding how auxiliary domains
will be used. So VFIO allocates PASIDs like this:

* iommu_enable_aux_domain(parent_dev)
* iommu_domain_alloc() -> dom1
* iommu_domain_alloc() -> dom2
* iommu_attach_device(dom1, parent_dev)
-> dom1 gets PASID #1
* iommu_attach_device(dom2, parent_dev)
-> dom2 gets PASID #2

Then I'm not sure about the next steps, when userspace does
VFIO_IOMMU_MAP_DMA or VFIO_IOMMU_BIND on an mdev's container. Is the
following use accurate?

For the single translation level:
* iommu_map(dom1, ...) updates first-level/second-level pgtables for
PASID #1
* iommu_map(dom2, ...) updates first-level/second-level pgtables for
PASID #2

Nested translation:
* iommu_map(dom1, ...) updates second-level pgtables for PASID #1
* iommu_bind_table(dom1, ...) binds first-level pgtables, provided by
the guest, for PASID #1
* iommu_map(dom2, ...) updates second-level pgtables for PASID #2
* iommu_bind_table(dom2, ...) binds first-level pgtables for PASID #2


I'm trying to understand how to implement this with SMMU and other
IOMMUs. It's not a clean fit since we have a single domain to hold the
second-level pgtables. Then again, the nested case probably doesn't
matter for us - we might as well assign the parent directly, since all
mdevs have the same second-level and can only be assigned to the same VM.


Also, can non-VFIO device drivers use auxiliary domains to do map/unmap
on PASIDs? They are asking to do that and I'm proposing the private
PASID thing, but since aux domains provide a similar feature we should
probably converge somehow.

Thanks,
Jean

2018-09-12 02:44:46

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

Hi,

On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote:
> Hi,
>
> On 30/08/2018 05:09, Lu Baolu wrote:
>> Below APIs are introduced in the IOMMU glue for device drivers to use
>> the finer granularity translation.
>>
>> * iommu_capable(IOMMU_CAP_AUX_DOMAIN)
>> - Represents the ability for supporting multiple domains per device
>> (a.k.a. finer granularity translations) of the IOMMU hardware.
>
> iommu_capable() cannot represent hardware capabilities, we need
> something else for systems with multiple IOMMUs that have different
> caps. How about iommu_domain_get_attr on the device's domain instead?

Domain is not a good choice for per iommu cap query. A domain might be
attached to devices belonging to different iommu's.

How about an API with device structure as parameter? A device always
belongs to a specific iommu. This API is supposed to be used the
device driver.

>
>> * iommu_en(dis)able_aux_domain(struct device *dev)
>> - Enable/disable the multiple domains capability for a device
>> referenced by @dev.
>>
>> * iommu_auxiliary_id(struct iommu_domain *domain)
>> - Return the index value used for finer-granularity DMA translation.
>> The specific device driver needs to feed the hardware with this
>> value, so that hardware device could issue the DMA transaction with
>> this value tagged.
>
> This could also reuse iommu_domain_get_attr.
>
>
> More generally I'm having trouble understanding how auxiliary domains
> will be used. So VFIO allocates PASIDs like this:

As I wrote in the cover letter, "auxiliary domain" is just a name to
ease discussion. It's actually has no special meaning (we think a domain
as an isolation boundary which could be used by the IOMMU to isolate
the DMA transactions out of a PCI device or partial of it).

So drivers like vfio should see no difference when use an auxiliary
domain. The auxiliary domain is not aware out of iommu driver.

>
> * iommu_enable_aux_domain(parent_dev)
> * iommu_domain_alloc() -> dom1
> * iommu_domain_alloc() -> dom2
> * iommu_attach_device(dom1, parent_dev)
> -> dom1 gets PASID #1
> * iommu_attach_device(dom2, parent_dev)
> -> dom2 gets PASID #2
>
> Then I'm not sure about the next steps, when userspace does
> VFIO_IOMMU_MAP_DMA or VFIO_IOMMU_BIND on an mdev's container. Is the
> following use accurate?
>
> For the single translation level:
> * iommu_map(dom1, ...) updates first-level/second-level pgtables for
> PASID #1
> * iommu_map(dom2, ...) updates first-level/second-level pgtables for
> PASID #2
>
> Nested translation:
> * iommu_map(dom1, ...) updates second-level pgtables for PASID #1
> * iommu_bind_table(dom1, ...) binds first-level pgtables, provided by
> the guest, for PASID #1
> * iommu_map(dom2, ...) updates second-level pgtables for PASID #2
> * iommu_bind_table(dom2, ...) binds first-level pgtables for PASID #2
> >
> I'm trying to understand how to implement this with SMMU and other

This is proposed for architectures which support finer granularity
second level translation with no impact on architectures which only
support Source ID or the similar granularity.

> IOMMUs. It's not a clean fit since we have a single domain to hold the
> second-level pgtables.

Do you mind explaining why a domain holds multiple second-level
pgtables? Shouldn't that be multiple domains?

> Then again, the nested case probably doesn't
> matter for us - we might as well assign the parent directly, since all
> mdevs have the same second-level and can only be assigned to the same VM.
>
>
> Also, can non-VFIO device drivers use auxiliary domains to do map/unmap
> on PASIDs? They are asking to do that and I'm proposing the private
> PASID thing, but since aux domains provide a similar feature we should
> probably converge somehow.

Yes, any non-VFIO device driver could use aux domain as well. The use
model is:

iommu_enable_aux_domain(dev)
-- enables aux domain support for this device

iommu_domain_alloc(dev)
-- allocate an iommu domain

iommu_attach_device(domain, dev)
-- attach the domain to device

iommu_auxiliary_id(domain)
-- retrieve the pasid id used by this domain

The device driver then

iommu_map(domain, ...)

set the pasid id to hardware register and start to do dma.

Best regards,
Lu Baolu

2018-09-12 05:03:43

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers

Hi,

On 09/11/2018 12:23 AM, Jean-Philippe Brucker wrote:
> On 30/08/2018 05:09, Lu Baolu wrote:
>> +static int vfio_detach_aux_domain(struct device *dev, void *data)
>> +{
>> + struct iommu_domain *domain = data;
>> +
>> + vfio_mdev_set_aux_domain(dev, NULL);
>> + iommu_detach_device(domain, dev->parent);
>
> I think that's only going to work for vt-d, which doesn't use a
> default_domain. For other drivers, iommu.c ends up calling
> domain->ops->attach_dev(default_domain, dev) here, which isn't what we want.

This doesn't break any functionality. It takes effect only if iommu
hardware supports finer granular translation and advertises it through
the API.

>
> That was my concern with reusing attach/detach_dev callbacks for PASID
> management. The attach_dev code of IOMMU drivers already has to deal
> with toggling between default and unmanaged domain. Dealing with more
> state transitions in the same path is going to be difficult.

Let's consider it from the API point of view.

We have iommu_a(de)ttach_device() APIs to attach or detach a domain
to/from a device. We should avoid applying a limitation of "these are
only for single domain case, for multiple domains, use another API".

Best regards,
Lu Baolu


2018-09-12 17:55:02

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

On 12/09/2018 03:42, Lu Baolu wrote:
> Hi,
>
> On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote:
>> Hi,
>>
>> On 30/08/2018 05:09, Lu Baolu wrote:
>>> Below APIs are introduced in the IOMMU glue for device drivers to use
>>> the finer granularity translation.
>>>
>>> * iommu_capable(IOMMU_CAP_AUX_DOMAIN)
>>> - Represents the ability for supporting multiple domains per device
>>> (a.k.a. finer granularity translations) of the IOMMU hardware.
>>
>> iommu_capable() cannot represent hardware capabilities, we need
>> something else for systems with multiple IOMMUs that have different
>> caps. How about iommu_domain_get_attr on the device's domain instead?
>
> Domain is not a good choice for per iommu cap query. A domain might be
> attached to devices belonging to different iommu's.
>
> How about an API with device structure as parameter? A device always
> belongs to a specific iommu. This API is supposed to be used the
> device driver.

Ah right, domain attributes won't work. Your suggestion seems more
suitable, but maybe users can simply try to enable auxiliary domains
first, and conclude that the IOMMU doesn't support it if it returns an error

>>> * iommu_en(dis)able_aux_domain(struct device *dev)
>>> - Enable/disable the multiple domains capability for a device
>>> referenced by @dev.

It strikes me now that in the IOMMU driver,
iommu_enable/disable_aux_domain() will do the same thing as
iommu_sva_device_init/shutdown()
(https://www.spinics.net/lists/arm-kernel/msg651896.html). Some IOMMU
drivers want to enable PASID and allocate PASID tables only when
requested by users, in the sva_init_device IOMMU op (see Joerg's comment
last year https://patchwork.kernel.org/patch/9989307/#21025429). Maybe
we could simply add a flag to iommu_sva_device_init?

>>> * iommu_auxiliary_id(struct iommu_domain *domain)
>>> - Return the index value used for finer-granularity DMA translation.
>>> The specific device driver needs to feed the hardware with this
>>> value, so that hardware device could issue the DMA transaction with
>>> this value tagged.
>>
>> This could also reuse iommu_domain_get_attr.
>>
>>
>> More generally I'm having trouble understanding how auxiliary domains
>> will be used. So VFIO allocates PASIDs like this:
>
> As I wrote in the cover letter, "auxiliary domain" is just a name to
> ease discussion. It's actually has no special meaning (we think a domain
> as an isolation boundary which could be used by the IOMMU to isolate
> the DMA transactions out of a PCI device or partial of it).
>
> So drivers like vfio should see no difference when use an auxiliary
> domain. The auxiliary domain is not aware out of iommu driver.

For an auxiliary domain, VFIO does need to retrieve the PASID and write
it to hardware. But being able to reuse iommu_map/unmap/iova_to_phys/etc
on the auxiliary domain is nice.

>> * iommu_enable_aux_domain(parent_dev)
>> * iommu_domain_alloc() -> dom1
>> * iommu_domain_alloc() -> dom2
>> * iommu_attach_device(dom1, parent_dev)
>> -> dom1 gets PASID #1
>> * iommu_attach_device(dom2, parent_dev)
>> -> dom2 gets PASID #2
>>
>> Then I'm not sure about the next steps, when userspace does
>> VFIO_IOMMU_MAP_DMA or VFIO_IOMMU_BIND on an mdev's container. Is the
>> following use accurate?
>>
>> For the single translation level:
>> * iommu_map(dom1, ...) updates first-level/second-level pgtables for
>> PASID #1
>> * iommu_map(dom2, ...) updates first-level/second-level pgtables for
>> PASID #2
>>
>> Nested translation:
>> * iommu_map(dom1, ...) updates second-level pgtables for PASID #1
>> * iommu_bind_table(dom1, ...) binds first-level pgtables, provided by
>> the guest, for PASID #1
>> * iommu_map(dom2, ...) updates second-level pgtables for PASID #2
>> * iommu_bind_table(dom2, ...) binds first-level pgtables for PASID #2
>>>
>> I'm trying to understand how to implement this with SMMU and other
>
> This is proposed for architectures which support finer granularity
> second level translation with no impact on architectures which only
> support Source ID or the similar granularity.

Just to be clear, in this paragraph you're only referring to the
Nested/second-level translation for mdev, which is specific to vt-d
rev3? Other architectures can still do first-level translation with
PASID, to support some use-cases of IOMMU aware mediated device
(assigning mdevs to userspace drivers, for example)

>> IOMMUs. It's not a clean fit since we have a single domain to hold the
>> second-level pgtables.
>
> Do you mind explaining why a domain holds multiple second-level
> pgtables? Shouldn't that be multiple domains?

I didn't mean a single domain holding multiple second-level pgtables,
but a single domain holding a single set of second-level pgtables for
all mdevs. But let's ignore that, mdev and second-level isn't realistic
for arm SMMU.

>> Then again, the nested case probably doesn't
>> matter for us - we might as well assign the parent directly, since all
>> mdevs have the same second-level and can only be assigned to the same VM.
>>
>>
>> Also, can non-VFIO device drivers use auxiliary domains to do map/unmap
>> on PASIDs? They are asking to do that and I'm proposing the private
>> PASID thing, but since aux domains provide a similar feature we should
>> probably converge somehow.
>
> Yes, any non-VFIO device driver could use aux domain as well. The use
> model is:
>
> iommu_enable_aux_domain(dev)
> -- enables aux domain support for this device
>
> iommu_domain_alloc(dev)
> -- allocate an iommu domain
>
> iommu_attach_device(domain, dev)
> -- attach the domain to device
>
> iommu_auxiliary_id(domain)
> -- retrieve the pasid id used by this domain
>
> The device driver then
>
> iommu_map(domain, ...)
>
> set the pasid id to hardware register and start to do dma.

Sounds good, I'll drop the private PASID patch if we can figure out a
solution to the attach/detach_dev problem discussed on patch 8/10

Thanks,
Jean

2018-09-12 17:56:48

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers

On 12/09/2018 06:02, Lu Baolu wrote:
> Hi,
>
> On 09/11/2018 12:23 AM, Jean-Philippe Brucker wrote:
>> On 30/08/2018 05:09, Lu Baolu wrote:
>>> +static int vfio_detach_aux_domain(struct device *dev, void *data)
>>> +{
>>> + struct iommu_domain *domain = data;
>>> +
>>> + vfio_mdev_set_aux_domain(dev, NULL);
>>> + iommu_detach_device(domain, dev->parent);
>>
>> I think that's only going to work for vt-d, which doesn't use a
>> default_domain. For other drivers, iommu.c ends up calling
>> domain->ops->attach_dev(default_domain, dev) here, which isn't what we want.
>
> This doesn't break any functionality. It takes effect only if iommu
> hardware supports finer granular translation and advertises it through
> the API.>
>> That was my concern with reusing attach/detach_dev callbacks for PASID
>> management. The attach_dev code of IOMMU drivers already has to deal
>> with toggling between default and unmanaged domain. Dealing with more
>> state transitions in the same path is going to be difficult.
>
> Let's consider it from the API point of view.
>
> We have iommu_a(de)ttach_device() APIs to attach or detach a domain
> to/from a device. We should avoid applying a limitation of "these are
> only for single domain case, for multiple domains, use another API".

That's an idealized view of the API, the actual semantics aren't as
simple. For IOMMU drivers that implement IOMMU_DOMAIN_DMA in their
domain_alloc operation (Arm SMMU, AMD IOMMU, ...), attach_dev attaches
an unmanaged domain, detach_dev reattaches the default DMA domain, and
the detach_dev IOMMU op is not called. We can't change that now, so it's
difficult to add more functionality to attach_dev and detach_dev.

Thanks,
Jean

2018-09-13 00:24:17

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

> From: Jean-Philippe Brucker
> Sent: Thursday, September 13, 2018 1:54 AM
>
> On 12/09/2018 03:42, Lu Baolu wrote:
> > Hi,
> >
> > On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote:
> >> Hi,
> >>
> >> On 30/08/2018 05:09, Lu Baolu wrote:
> >>> Below APIs are introduced in the IOMMU glue for device drivers to use
> >>> the finer granularity translation.
> >>>
> >>> * iommu_capable(IOMMU_CAP_AUX_DOMAIN)
> >>> - Represents the ability for supporting multiple domains per device
> >>> (a.k.a. finer granularity translations) of the IOMMU hardware.
> >>
> >> iommu_capable() cannot represent hardware capabilities, we need
> >> something else for systems with multiple IOMMUs that have different
> >> caps. How about iommu_domain_get_attr on the device's domain
> instead?
> >
> > Domain is not a good choice for per iommu cap query. A domain might be
> > attached to devices belonging to different iommu's.
> >
> > How about an API with device structure as parameter? A device always
> > belongs to a specific iommu. This API is supposed to be used the
> > device driver.
>
> Ah right, domain attributes won't work. Your suggestion seems more
> suitable, but maybe users can simply try to enable auxiliary domains
> first, and conclude that the IOMMU doesn't support it if it returns an error
>
> >>> * iommu_en(dis)able_aux_domain(struct device *dev)
> >>> - Enable/disable the multiple domains capability for a device
> >>> referenced by @dev.
>
> It strikes me now that in the IOMMU driver,
> iommu_enable/disable_aux_domain() will do the same thing as
> iommu_sva_device_init/shutdown()
> (https://www.spinics.net/lists/arm-kernel/msg651896.html). Some IOMMU
> drivers want to enable PASID and allocate PASID tables only when
> requested by users, in the sva_init_device IOMMU op (see Joerg's comment
> last year https://patchwork.kernel.org/patch/9989307/#21025429). Maybe
> we could simply add a flag to iommu_sva_device_init?

We could combine, but definitely 'sva' should be removed :-)

>
> >>> * iommu_auxiliary_id(struct iommu_domain *domain)
> >>> - Return the index value used for finer-granularity DMA translation.
> >>> The specific device driver needs to feed the hardware with this
> >>> value, so that hardware device could issue the DMA transaction with
> >>> this value tagged.
> >>
> >> This could also reuse iommu_domain_get_attr.
> >>
> >>
> >> More generally I'm having trouble understanding how auxiliary domains
> >> will be used. So VFIO allocates PASIDs like this:
> >
> > As I wrote in the cover letter, "auxiliary domain" is just a name to
> > ease discussion. It's actually has no special meaning (we think a domain
> > as an isolation boundary which could be used by the IOMMU to isolate
> > the DMA transactions out of a PCI device or partial of it).
> >
> > So drivers like vfio should see no difference when use an auxiliary
> > domain. The auxiliary domain is not aware out of iommu driver.
>
> For an auxiliary domain, VFIO does need to retrieve the PASID and write
> it to hardware. But being able to reuse
> iommu_map/unmap/iova_to_phys/etc
> on the auxiliary domain is nice.
>
> >> * iommu_enable_aux_domain(parent_dev)
> >> * iommu_domain_alloc() -> dom1
> >> * iommu_domain_alloc() -> dom2
> >> * iommu_attach_device(dom1, parent_dev)
> >> -> dom1 gets PASID #1
> >> * iommu_attach_device(dom2, parent_dev)
> >> -> dom2 gets PASID #2
> >>
> >> Then I'm not sure about the next steps, when userspace does
> >> VFIO_IOMMU_MAP_DMA or VFIO_IOMMU_BIND on an mdev's
> container. Is the
> >> following use accurate?
> >>
> >> For the single translation level:
> >> * iommu_map(dom1, ...) updates first-level/second-level pgtables for
> >> PASID #1
> >> * iommu_map(dom2, ...) updates first-level/second-level pgtables for
> >> PASID #2
> >>
> >> Nested translation:
> >> * iommu_map(dom1, ...) updates second-level pgtables for PASID #1
> >> * iommu_bind_table(dom1, ...) binds first-level pgtables, provided by
> >> the guest, for PASID #1
> >> * iommu_map(dom2, ...) updates second-level pgtables for PASID #2
> >> * iommu_bind_table(dom2, ...) binds first-level pgtables for PASID #2
> >>>
> >> I'm trying to understand how to implement this with SMMU and other
> >
> > This is proposed for architectures which support finer granularity
> > second level translation with no impact on architectures which only
> > support Source ID or the similar granularity.
>
> Just to be clear, in this paragraph you're only referring to the
> Nested/second-level translation for mdev, which is specific to vt-d
> rev3? Other architectures can still do first-level translation with
> PASID, to support some use-cases of IOMMU aware mediated device
> (assigning mdevs to userspace drivers, for example)

yes. aux domain concept applies only to vt-d rev3 which introduces
scalable mode. Care is taken to avoid breaking usages on existing
architectures.

one note. Assigning mdevs to user space alone doesn't imply IOMMU
aware. All existing mdev usages use software or proprietary methods to
isolate DMA. There is only one potential IOMMU aware mdev usage
which we talked not rely on vt-d rev3 scalable mode - wrap a random
PCI device into a single mdev instance (no sharing). In that case mdev
inherits RID from parent PCI device, thus is isolated by IOMMU in RID
granular. Our RFC supports this usage too. In VFIO two usages (PASID-
based and RID-based) use same code path, i.e. always binding domain to
the parent device of mdev. But within IOMMU they go different paths.
PASID-based will go to aux-domain as iommu_enable_aux_domain
has been called on that device. RID-based will follow existing
unmanaged domain path, as if it is parent device assignment.

>
> >> IOMMUs. It's not a clean fit since we have a single domain to hold the
> >> second-level pgtables.
> >
> > Do you mind explaining why a domain holds multiple second-level
> > pgtables? Shouldn't that be multiple domains?
>
> I didn't mean a single domain holding multiple second-level pgtables,
> but a single domain holding a single set of second-level pgtables for
> all mdevs. But let's ignore that, mdev and second-level isn't realistic
> for arm SMMU.

yes. single second-level doesn't allow multiple mdevs (each mdev
assigned to different user process or VM). that is why vt-d rev3
introduces scalable mode. :-)

>
> >> Then again, the nested case probably doesn't
> >> matter for us - we might as well assign the parent directly, since all
> >> mdevs have the same second-level and can only be assigned to the same
> VM.
> >>
> >>
> >> Also, can non-VFIO device drivers use auxiliary domains to do
> map/unmap
> >> on PASIDs? They are asking to do that and I'm proposing the private
> >> PASID thing, but since aux domains provide a similar feature we should
> >> probably converge somehow.
> >
> > Yes, any non-VFIO device driver could use aux domain as well. The use
> > model is:
> >
> > iommu_enable_aux_domain(dev)
> > -- enables aux domain support for this device
> >
> > iommu_domain_alloc(dev)
> > -- allocate an iommu domain
> >
> > iommu_attach_device(domain, dev)
> > -- attach the domain to device
> >
> > iommu_auxiliary_id(domain)
> > -- retrieve the pasid id used by this domain
> >
> > The device driver then
> >
> > iommu_map(domain, ...)
> >
> > set the pasid id to hardware register and start to do dma.
>
> Sounds good, I'll drop the private PASID patch if we can figure out a
> solution to the attach/detach_dev problem discussed on patch 8/10
>

Can you elaborate a bit on private PASID usage? what is the
high level flow on it?

Again based on earlier explanation, aux domain is specific to IOMMU
architecture supporting vtd scalable mode-like capability, which allows
separate 2nd/1st level translations per PASID. Need a better understanding
how private PASID is relevant here.

Thanks
Kevin

2018-09-13 00:36:07

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers

> From: Jean-Philippe Brucker [mailto:[email protected]]
> Sent: Thursday, September 13, 2018 1:54 AM
>
> On 12/09/2018 06:02, Lu Baolu wrote:
> > Hi,
> >
> > On 09/11/2018 12:23 AM, Jean-Philippe Brucker wrote:
> >> On 30/08/2018 05:09, Lu Baolu wrote:
> >>> +static int vfio_detach_aux_domain(struct device *dev, void *data)
> >>> +{
> >>> + struct iommu_domain *domain = data;
> >>> +
> >>> + vfio_mdev_set_aux_domain(dev, NULL);
> >>> + iommu_detach_device(domain, dev->parent);
> >>
> >> I think that's only going to work for vt-d, which doesn't use a
> >> default_domain. For other drivers, iommu.c ends up calling
> >> domain->ops->attach_dev(default_domain, dev) here, which isn't what
> we want.
> >
> > This doesn't break any functionality. It takes effect only if iommu
> > hardware supports finer granular translation and advertises it through
> > the API.>
> >> That was my concern with reusing attach/detach_dev callbacks for
> PASID
> >> management. The attach_dev code of IOMMU drivers already has to
> deal
> >> with toggling between default and unmanaged domain. Dealing with
> more
> >> state transitions in the same path is going to be difficult.
> >
> > Let's consider it from the API point of view.
> >
> > We have iommu_a(de)ttach_device() APIs to attach or detach a domain
> > to/from a device. We should avoid applying a limitation of "these are
> > only for single domain case, for multiple domains, use another API".
>
> That's an idealized view of the API, the actual semantics aren't as
> simple. For IOMMU drivers that implement IOMMU_DOMAIN_DMA in their
> domain_alloc operation (Arm SMMU, AMD IOMMU, ...), attach_dev
> attaches
> an unmanaged domain, detach_dev reattaches the default DMA domain,
> and
> the detach_dev IOMMU op is not called. We can't change that now, so it's
> difficult to add more functionality to attach_dev and detach_dev.
>

Now we have four possible usages on a(de)ttach_device:

1) Normal DMA API path i.e. IOMMU_DOMAIN_DMA, for DMA operations
in host/parent device driver;

2) VFIO passthru path, when the physical device is assigned to user space
or guest driver

3) mdev passthru path 1, when mdev is assigned to user space or guest
driver. Here mdev is a wrap on random PCI device

4) mdev passthru path 2, when mdev is assigned to user space or guest
driver. Here mdev is in a smaller granularity (e.g. tagged by PASID) as
supported by vt-d scalable mode

1) and 2) are existing usages. What you described (unmanaged vs. default)
falls into this category.

3) is actually same as 2) in IOMMU layer. sort of passing through DMA
capability to guest. Though there is still a parent driver, the parent driver
itself should not do DMA - i.e. unmanaged in host side.

4) is a new code path introduced in IOMMU layer, which is what
vfio_a(de)tach_aux_domain is trying to address. In this case parent device
can still have its own DMA capability, using existing code path 1) (thus
default domain still applies). and the parent device can have multiple
aux domains (and associated structures), using code path 4).

I didn't see how this new code path interferes with default domain
concept in existing path. In the future if ARM or other architectures
plan to support similar scalable mode capability, only the new code
path should be tweaked.

Thanks
Kevin

2018-09-13 15:04:01

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

On 13/09/2018 01:19, Tian, Kevin wrote:
>>> This is proposed for architectures which support finer granularity
>>> second level translation with no impact on architectures which only
>>> support Source ID or the similar granularity.
>>
>> Just to be clear, in this paragraph you're only referring to the
>> Nested/second-level translation for mdev, which is specific to vt-d
>> rev3? Other architectures can still do first-level translation with
>> PASID, to support some use-cases of IOMMU aware mediated device
>> (assigning mdevs to userspace drivers, for example)
>
> yes. aux domain concept applies only to vt-d rev3 which introduces
> scalable mode. Care is taken to avoid breaking usages on existing
> architectures.
>
> one note. Assigning mdevs to user space alone doesn't imply IOMMU
> aware. All existing mdev usages use software or proprietary methods to
> isolate DMA. There is only one potential IOMMU aware mdev usage
> which we talked not rely on vt-d rev3 scalable mode - wrap a random
> PCI device into a single mdev instance (no sharing). In that case mdev
> inherits RID from parent PCI device, thus is isolated by IOMMU in RID
> granular. Our RFC supports this usage too. In VFIO two usages (PASID-
> based and RID-based) use same code path, i.e. always binding domain to
> the parent device of mdev. But within IOMMU they go different paths.
> PASID-based will go to aux-domain as iommu_enable_aux_domain
> has been called on that device. RID-based will follow existing
> unmanaged domain path, as if it is parent device assignment.

For Arm SMMU we're more interested in the PASID-granular case than the
RID-granular one. It doesn't necessarily require vt-d rev3 scalable
mode, the following example can be implemented with an SMMUv3, since it
only needs PASID-granular first-level translation:

We have a PCI function that supports PASID, and can be partitioned into
multiple isolated entities, mdevs. Each mdev has an MMIO frame, an MSI
vector and a PASID.

Different processes (userspace drivers, not QEMU) each open one mdev. A
process controlling one mdev has two ways of doing DMA:

(1) Classically, the process uses a VFIO_TYPE1v2_IOMMU container. This
creates an auxiliary domain for the mdev, with PASID #35. The process
creates DMA mappings with VFIO_IOMMU_MAP_DMA. VFIO calls iommu_map on
the auxiliary domain. The IOMMU driver populates the pgtables associated
with PASID #35.

(2) SVA. One way of doing it: the process uses a new
"VFIO_TYPE1_SVA_IOMMU" type of container. VFIO binds the process address
space to the device, gets PASID #35. Simpler, but not everyone wants to
use SVA, especially not userspace drivers which need the highest
performance.


This example only needs to modify first-level translation, and works
with SMMUv3. The kernel here could be the host, in which case
second-level translation is disabled in the SMMU, or it could be the
guest, in which case second-level mappings are created by QEMU and
first-level translation is managed by assigning PASID tables to the guest.

So (2) would use iommu_sva_bind_device(), but (1) needs something else.
Aren't auxiliary domains suitable for (1)? Why limit auxiliary domain to
second-level or nested translation? It seems silly to use a different
API for first-level, since the flow in userspace and VFIO is the same as
your second-level case as far as MAP_DMA ioctl goes. The difference is
that in your case the auxiliary domain supports an additional operation
which binds first-level page tables. An auxiliary domain that only
supports first-level wouldn't support this operation, but it can still
implement iommu_map/unmap/etc.


Another note: if for some reason you did want to allow userspace to
choose between first-level or second-level, you could implement the
VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU,
but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP
ioctl on a NESTING container would populate second-level, and DMA_MAP on
a normal container populates first-level. But if you're always going to
use second-level by default, the distinction isn't necessary.


>> Sounds good, I'll drop the private PASID patch if we can figure out a
>> solution to the attach/detach_dev problem discussed on patch 8/10
>>
>
> Can you elaborate a bit on private PASID usage? what is the
> high level flow on it?
>
> Again based on earlier explanation, aux domain is specific to IOMMU
> architecture supporting vtd scalable mode-like capability, which allows
> separate 2nd/1st level translations per PASID. Need a better understanding
> how private PASID is relevant here.

Private PASIDs are used for doing iommu_map/iommu_unmap on PASIDs
(first-level translation):
https://www.spinics.net/lists/dri-devel/msg177003.html As above, some
people don't want SVA, some can't do it, some may even want a few
private address spaces just for their kernel driver. They need a way to
allocate PASIDs and do iommu_map/iommu_unmap on them, without binding to
a process. I was planning to add the private PASID patch to my SVA
series, but in my opinion the feature overlaps with auxiliary domains.

Thanks,
Jean

2018-09-13 16:59:33

by Ashok Raj

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

On Thu, Sep 13, 2018 at 04:03:01PM +0100, Jean-Philippe Brucker wrote:
> On 13/09/2018 01:19, Tian, Kevin wrote:
> >>> This is proposed for architectures which support finer granularity
> >>> second level translation with no impact on architectures which only
> >>> support Source ID or the similar granularity.
> >>
> >> Just to be clear, in this paragraph you're only referring to the
> >> Nested/second-level translation for mdev, which is specific to vt-d
> >> rev3? Other architectures can still do first-level translation with
> >> PASID, to support some use-cases of IOMMU aware mediated device
> >> (assigning mdevs to userspace drivers, for example)
> >
> > yes. aux domain concept applies only to vt-d rev3 which introduces
> > scalable mode. Care is taken to avoid breaking usages on existing
> > architectures.
> >
> > one note. Assigning mdevs to user space alone doesn't imply IOMMU
> > aware. All existing mdev usages use software or proprietary methods to
> > isolate DMA. There is only one potential IOMMU aware mdev usage
> > which we talked not rely on vt-d rev3 scalable mode - wrap a random
> > PCI device into a single mdev instance (no sharing). In that case mdev
> > inherits RID from parent PCI device, thus is isolated by IOMMU in RID
> > granular. Our RFC supports this usage too. In VFIO two usages (PASID-
> > based and RID-based) use same code path, i.e. always binding domain to
> > the parent device of mdev. But within IOMMU they go different paths.
> > PASID-based will go to aux-domain as iommu_enable_aux_domain
> > has been called on that device. RID-based will follow existing
> > unmanaged domain path, as if it is parent device assignment.
>
> For Arm SMMU we're more interested in the PASID-granular case than the
> RID-granular one. It doesn't necessarily require vt-d rev3 scalable
> mode, the following example can be implemented with an SMMUv3, since it
> only needs PASID-granular first-level translation:

You are right, you can simply use the first level as IOVA for every PASID.

Only issue becomes when you need to assign that to a guest, you would be required
to shadow the 1st level. If you have a 2nd level per-pasid first level can
be managed in guest and don't require to shadow them.

>
> We have a PCI function that supports PASID, and can be partitioned into
> multiple isolated entities, mdevs. Each mdev has an MMIO frame, an MSI
> vector and a PASID.
>
> Different processes (userspace drivers, not QEMU) each open one mdev. A
> process controlling one mdev has two ways of doing DMA:
>
> (1) Classically, the process uses a VFIO_TYPE1v2_IOMMU container. This
> creates an auxiliary domain for the mdev, with PASID #35. The process
> creates DMA mappings with VFIO_IOMMU_MAP_DMA. VFIO calls iommu_map on
> the auxiliary domain. The IOMMU driver populates the pgtables associated
> with PASID #35.
>
> (2) SVA. One way of doing it: the process uses a new
> "VFIO_TYPE1_SVA_IOMMU" type of container. VFIO binds the process address
> space to the device, gets PASID #35. Simpler, but not everyone wants to
> use SVA, especially not userspace drivers which need the highest
> performance.
>
>
> This example only needs to modify first-level translation, and works
> with SMMUv3. The kernel here could be the host, in which case
> second-level translation is disabled in the SMMU, or it could be the
> guest, in which case second-level mappings are created by QEMU and
> first-level translation is managed by assigning PASID tables to the guest.
>
> So (2) would use iommu_sva_bind_device(), but (1) needs something else.
> Aren't auxiliary domains suitable for (1)? Why limit auxiliary domain to
> second-level or nested translation? It seems silly to use a different
> API for first-level, since the flow in userspace and VFIO is the same as
> your second-level case as far as MAP_DMA ioctl goes. The difference is
> that in your case the auxiliary domain supports an additional operation
> which binds first-level page tables. An auxiliary domain that only
> supports first-level wouldn't support this operation, but it can still
> implement iommu_map/unmap/etc.
>
>
> Another note: if for some reason you did want to allow userspace to
> choose between first-level or second-level, you could implement the
> VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU,
> but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP
> ioctl on a NESTING container would populate second-level, and DMA_MAP on
> a normal container populates first-level. But if you're always going to
> use second-level by default, the distinction isn't necessary.

Where is the nesting attribute specified? in vt-d2 it was part of context
entry, so also meant all PASID's are nested now. In vt-d3 its part of
PASID context.

It seems unsafe to share PASID's with different VM's since any request
W/O PASID has only one mapping.

>
>
> >> Sounds good, I'll drop the private PASID patch if we can figure out a
> >> solution to the attach/detach_dev problem discussed on patch 8/10
> >>
> >
> > Can you elaborate a bit on private PASID usage? what is the
> > high level flow on it?
> >
> > Again based on earlier explanation, aux domain is specific to IOMMU
> > architecture supporting vtd scalable mode-like capability, which allows
> > separate 2nd/1st level translations per PASID. Need a better understanding
> > how private PASID is relevant here.
>
> Private PASIDs are used for doing iommu_map/iommu_unmap on PASIDs
> (first-level translation):
> https://www.spinics.net/lists/dri-devel/msg177003.html As above, some
> people don't want SVA, some can't do it, some may even want a few
> private address spaces just for their kernel driver. They need a way to
> allocate PASIDs and do iommu_map/iommu_unmap on them, without binding to
> a process. I was planning to add the private PASID patch to my SVA
> series, but in my opinion the feature overlaps with auxiliary domains.

It sounds like it maps to AUX domains.

2018-09-14 03:04:29

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

Hi,

On 09/13/2018 01:54 AM, Jean-Philippe Brucker wrote:
> On 12/09/2018 03:42, Lu Baolu wrote:
>> Hi,
>>
>> On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote:
>>> Hi,
>>>
>>> On 30/08/2018 05:09, Lu Baolu wrote:
>>>> Below APIs are introduced in the IOMMU glue for device drivers to use
>>>> the finer granularity translation.
>>>>
>>>> * iommu_capable(IOMMU_CAP_AUX_DOMAIN)
>>>> - Represents the ability for supporting multiple domains per device
>>>> (a.k.a. finer granularity translations) of the IOMMU hardware.
>>> iommu_capable() cannot represent hardware capabilities, we need
>>> something else for systems with multiple IOMMUs that have different
>>> caps. How about iommu_domain_get_attr on the device's domain instead?
>> Domain is not a good choice for per iommu cap query. A domain might be
>> attached to devices belonging to different iommu's.
>>
>> How about an API with device structure as parameter? A device always
>> belongs to a specific iommu. This API is supposed to be used the
>> device driver.
> Ah right, domain attributes won't work. Your suggestion seems more
> suitable, but maybe users can simply try to enable auxiliary domains
> first, and conclude that the IOMMU doesn't support it if it returns an error
>

Some driver might want to check whether hardware supports AUX_DOMAIN
during the driver probe stage, but doesn't want to enable AUX_DOMAIN
at that time. One reasonable use case is driver check AUX_DOMAIN cap
during driver probe and expose different sysfs nodes according to
whether AUX_DOMAIN is support or not, then AUX_DOMAIN is enabled or
disabled during run time through a sysfs node. With this consideration,
we still need a API to check cap.

How about

* iommu_check_aux_domain(struct device *dev)
- Check whether the iommu driver supports multiple domains on @dev.

Best regards,
Lu Baolu




2018-09-14 03:06:10

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

> From: Lu Baolu [mailto:[email protected]]
> Sent: Friday, September 14, 2018 10:47 AM
>
> Hi,
>
> On 09/13/2018 01:54 AM, Jean-Philippe Brucker wrote:
> > On 12/09/2018 03:42, Lu Baolu wrote:
> >> Hi,
> >>
> >> On 09/11/2018 12:22 AM, Jean-Philippe Brucker wrote:
> >>> Hi,
> >>>
> >>> On 30/08/2018 05:09, Lu Baolu wrote:
> >>>> Below APIs are introduced in the IOMMU glue for device drivers to
> use
> >>>> the finer granularity translation.
> >>>>
> >>>> * iommu_capable(IOMMU_CAP_AUX_DOMAIN)
> >>>> - Represents the ability for supporting multiple domains per device
> >>>> (a.k.a. finer granularity translations) of the IOMMU hardware.
> >>> iommu_capable() cannot represent hardware capabilities, we need
> >>> something else for systems with multiple IOMMUs that have different
> >>> caps. How about iommu_domain_get_attr on the device's domain
> instead?
> >> Domain is not a good choice for per iommu cap query. A domain might
> be
> >> attached to devices belonging to different iommu's.
> >>
> >> How about an API with device structure as parameter? A device always
> >> belongs to a specific iommu. This API is supposed to be used the
> >> device driver.
> > Ah right, domain attributes won't work. Your suggestion seems more
> > suitable, but maybe users can simply try to enable auxiliary domains
> > first, and conclude that the IOMMU doesn't support it if it returns an
> error
> >
>
> Some driver might want to check whether hardware supports
> AUX_DOMAIN
> during the driver probe stage, but doesn't want to enable AUX_DOMAIN
> at that time. One reasonable use case is driver check AUX_DOMAIN cap
> during driver probe and expose different sysfs nodes according to
> whether AUX_DOMAIN is support or not, then AUX_DOMAIN is enabled or
> disabled during run time through a sysfs node. With this consideration,
> we still need a API to check cap.
>
> How about
>
> * iommu_check_aux_domain(struct device *dev)
> - Check whether the iommu driver supports multiple domains on @dev.
>

maybe generalized as iommu_check_attr with aux_domain as a flag,
in case other IOMMU checks introduced in the future. hinted by Jean's
comment on iommu_sva_device_init part.

Thanks
Kevin

2018-09-14 14:41:41

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

On 13/09/2018 17:55, Raj, Ashok wrote:
>> For Arm SMMU we're more interested in the PASID-granular case than the
>> RID-granular one. It doesn't necessarily require vt-d rev3 scalable
>> mode, the following example can be implemented with an SMMUv3, since it
>> only needs PASID-granular first-level translation:
>
> You are right, you can simply use the first level as IOVA for every PASID.
>
> Only issue becomes when you need to assign that to a guest, you would be required
> to shadow the 1st level. If you have a 2nd level per-pasid first level can
> be managed in guest and don't require to shadow them.

Right, for us assigning a PASID-granular mdev to a guest requires shadowing

>> Another note: if for some reason you did want to allow userspace to
>> choose between first-level or second-level, you could implement the
>> VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU,
>> but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP
>> ioctl on a NESTING container would populate second-level, and DMA_MAP on
>> a normal container populates first-level. But if you're always going to
>> use second-level by default, the distinction isn't necessary.
>
> Where is the nesting attribute specified? in vt-d2 it was part of context
> entry, so also meant all PASID's are nested now. In vt-d3 its part of
> PASID context.

I don't think the nesting attribute is described in details anywhere.
The SMMU drivers use it to know if they should create first- or
second-level mappings. At the moment QEMU always uses
VFIO_TYPE1v2_IOMMU, but Eric Auger is proposing a patch that adds
VFIO_TYPE1_NESTING_IOMMU to QEMU:
https://www.mail-archive.com/[email protected]/msg559820.html

> It seems unsafe to share PASID's with different VM's since any request
> W/O PASID has only one mapping.

Which case are you talking about? It might be more confusing than
helpful, but here's my understanding of what we can assign to a guest:

| no vIOMMU | vIOMMU no PASID | vIOMMU with PASID
--------------+-------------+------------------+--------------------
VF | ok | shadow or nest | nest
mdev, SMMUv3 | ok | shadow | shadow + PV (?)
mdev, vt-d3 | ok | nest | nest + PV

The first line, assigning a PCI VF to a guest is the "basic" vfio-pci
case. Currently in QEMU it works by shadowing first-level translation.
We still have to upstream nested translation for that case. Vt-d2 didn't
support nested without PASID, vt-d3 offers RID_PASID for this. On SMMUv3
the PASID table is assigned to the guest, whereas on vt-d3 the host
manages the PASID table and individual page tables are assigned to the
guest.

Assigning an mdev (here I'm talking about the PASID-granular partition
of a VF, not the whole RID-granular VF wrapped by an mdev) could be done
by shadowing first-level translation on SMMUv3. It cannot do nested
since the VF has a single set of second-level page tables, which cannot
be used when mdevs are assigned to different VMs. Vt-d3 has one set of
second-level page tables per PASID, so it can do nested.

Since the parent device has a single PASID space, allowing the guest to
use multiple PASIDs for one mdev requires paravirtual allocation of
PASIDs (last column). Vt-d3 uses the Virtual Command Registers for that.
I assume that it is safe because the host is in charge of programming
PASIDs in the parent device, so the guest couldn't use a PASID allocated
to another mdev, but I don't know what the device's programming model
would look like. Anyway I don't think guest PASID is tackled by this
series (right?) and I don't intend to work on it for SMMUv3 (shadowing
stage-1 for vSVA seems like a bad idea...)

Does this seem accurate?

Thanks,
Jean

2018-09-14 14:43:30

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

>> This example only needs to modify first-level translation, and works
>> with SMMUv3. The kernel here could be the host, in which case
>> second-level translation is disabled in the SMMU, or it could be the
>> guest, in which case second-level mappings are created by QEMU and
>> first-level translation is managed by assigning PASID tables to the guest.
>
> the former yes applies to aux domain concept. The latter doesn't -
> you have only one second-level per device. whole PASID table managed
> by guest means you assign the whole device to guest, which is not the
> concept of aux domain here.

Right, in the latter case, the host uses a "normal" domain to assign the
whole PCI function to the guest. But the guest can still use auxiliary
domains like in my example, to sub-assign the PCI function to different
guest userspace applications.

>> So (2) would use iommu_sva_bind_device(), but (1) needs something else.
>> Aren't auxiliary domains suitable for (1)? Why limit auxiliary domain to
>> second-level or nested translation? It seems silly to use a different
>> API for first-level, since the flow in userspace and VFIO is the same as
>> your second-level case as far as MAP_DMA ioctl goes. The difference is
>> that in your case the auxiliary domain supports an additional operation
>> which binds first-level page tables. An auxiliary domain that only
>> supports first-level wouldn't support this operation, but it can still
>> implement iommu_map/unmap/etc.
>
> Thanks for correcting me on this. You are right that aux domain shouldn't
> impose such limitation on 2nd or nested only. We define aux domain
> as a normal domain (aux takes effect only when attaching to a device),
> thus it should support all capabilities possible on a normal domain.
>
> btw I'm not sure whether you look at my comment to patch 8/10. I
> explained the rationale why aux domain doesn't interfere with existing
> default domain usage, and in a quick thinking above example might
> not make difference. but need your confirm here. :-)

Yes sorry, I didn't have time to answer, will do it now

Thanks,
Jean

2018-09-14 14:46:39

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers

On 13/09/2018 01:35, Tian, Kevin wrote:
>>> Let's consider it from the API point of view.
>>>
>>> We have iommu_a(de)ttach_device() APIs to attach or detach a domain
>>> to/from a device. We should avoid applying a limitation of "these are
>>> only for single domain case, for multiple domains, use another API".
>>
>> That's an idealized view of the API, the actual semantics aren't as
>> simple. For IOMMU drivers that implement IOMMU_DOMAIN_DMA in their
>> domain_alloc operation (Arm SMMU, AMD IOMMU, ...), attach_dev
>> attaches
>> an unmanaged domain, detach_dev reattaches the default DMA domain,
>> and
>> the detach_dev IOMMU op is not called. We can't change that now, so it's
>> difficult to add more functionality to attach_dev and detach_dev.
>>
>
> Now we have four possible usages on a(de)ttach_device:
>
> 1) Normal DMA API path i.e. IOMMU_DOMAIN_DMA, for DMA operations
> in host/parent device driver;
>
> 2) VFIO passthru path, when the physical device is assigned to user space
> or guest driver
>
> 3) mdev passthru path 1, when mdev is assigned to user space or guest
> driver. Here mdev is a wrap on random PCI device
>
> 4) mdev passthru path 2, when mdev is assigned to user space or guest
> driver. Here mdev is in a smaller granularity (e.g. tagged by PASID) as
> supported by vt-d scalable mode
>
> 1) and 2) are existing usages. What you described (unmanaged vs. default)
> falls into this category.
>
> 3) is actually same as 2) in IOMMU layer. sort of passing through DMA
> capability to guest. Though there is still a parent driver, the parent driver
> itself should not do DMA - i.e. unmanaged in host side.
>
> 4) is a new code path introduced in IOMMU layer, which is what
> vfio_a(de)tach_aux_domain is trying to address. In this case parent device
> can still have its own DMA capability, using existing code path 1) (thus
> default domain still applies). and the parent device can have multiple
> aux domains (and associated structures), using code path 4).

We still have the problem that detach() doesn't propagate to the IOMMU
driver. Here's the current flow, without mdev:

1) At boot, the PF's (parent device) driver is probed, and the IOMMU
core sets up a default DMA domain, to be used by dma_alloc by the PF's
driver.
-> iommu.c calls default_domain->ops->attach_dev(default_domain, dev)

2) or 3) Later userspace wants to control the PF, replaces the PF's
driver with vfio-pci. When userspace creates a container, VFIO allocates
a new IOMMU domain, and calls iommu_attach_group.
-> iommu.c calls domain->ops->attach_dev(domain, dev)
This detaches the PF from the default domain, and attaches it to the new
domain.

1) When the container is closed, VFIO calls iommu_detach_group. This
detaches the PF from its current domain, and attaches it back to the
default domain.
-> iommu.c calls default_domain->ops->attach_dev(default_domain, dev)

-----
Now with mdev, we still attach the DMA domain in 1). Then:

4) Userspace opens an mdev and creates a container. VFIO enables aux
domain for the device. VFIO allocates a new IOMMU domain, and calls
iommu_attach_device(domain1, parent_dev).
-> iommu.c calls domain->ops->attach_dev(domain1, dev)
Because the device is in "aux domain" state, the IOMMU driver does not
detach from the default domain, but instead allocates a PASID and
attaches the aux domain. (Side note: for SMMU we couldn't detach from
the default domain, because we need it for MSI mappings.)

4) Userspace opens another mdev.
-> iommu.c calls domain->ops->attach_dev(domain2, dev)

1)? When the container is closed, VFIO calls
iommu_detach_device(domain2, parent_dev)
-> iommu.c calls default_domain->ops->attach_dev(default_domain, dev)
Given that auxiliary domains are attached, the IOMMU driver could deduce
that this actually means "detach an auxiliary domain". But which one?

So the proposed interface doesn't seem to work as is. If we want to use
iommu_attach/detach_device for auxiliary domains, the existing behavior
of iommu.c, and IOMMU drivers that rely on it, have to change. Any
change I can think of right now seems more daunting than introducing a
different path for auxiliary domains, like iommu_attach_aux_domain for
example.

Thanks,
Jean

2018-09-14 21:06:39

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

On Thu, 13 Sep 2018 16:03:01 +0100
Jean-Philippe Brucker <[email protected]> wrote:

> On 13/09/2018 01:19, Tian, Kevin wrote:
> >>> This is proposed for architectures which support finer granularity
> >>> second level translation with no impact on architectures which
> >>> only support Source ID or the similar granularity.
> >>
> >> Just to be clear, in this paragraph you're only referring to the
> >> Nested/second-level translation for mdev, which is specific to vt-d
> >> rev3? Other architectures can still do first-level translation with
> >> PASID, to support some use-cases of IOMMU aware mediated device
> >> (assigning mdevs to userspace drivers, for example)
> >
> > yes. aux domain concept applies only to vt-d rev3 which introduces
> > scalable mode. Care is taken to avoid breaking usages on existing
> > architectures.
> >
> > one note. Assigning mdevs to user space alone doesn't imply IOMMU
> > aware. All existing mdev usages use software or proprietary methods
> > to isolate DMA. There is only one potential IOMMU aware mdev usage
> > which we talked not rely on vt-d rev3 scalable mode - wrap a random
> > PCI device into a single mdev instance (no sharing). In that case
> > mdev inherits RID from parent PCI device, thus is isolated by IOMMU
> > in RID granular. Our RFC supports this usage too. In VFIO two
> > usages (PASID- based and RID-based) use same code path, i.e. always
> > binding domain to the parent device of mdev. But within IOMMU they
> > go different paths. PASID-based will go to aux-domain as
> > iommu_enable_aux_domain has been called on that device. RID-based
> > will follow existing unmanaged domain path, as if it is parent
> > device assignment.
>
> For Arm SMMU we're more interested in the PASID-granular case than the
> RID-granular one. It doesn't necessarily require vt-d rev3 scalable
> mode, the following example can be implemented with an SMMUv3, since
> it only needs PASID-granular first-level translation:
>
> We have a PCI function that supports PASID, and can be partitioned
> into multiple isolated entities, mdevs. Each mdev has an MMIO frame,
> an MSI vector and a PASID.
>
> Different processes (userspace drivers, not QEMU) each open one mdev.
> A process controlling one mdev has two ways of doing DMA:
>
> (1) Classically, the process uses a VFIO_TYPE1v2_IOMMU container. This
> creates an auxiliary domain for the mdev, with PASID #35. The process
> creates DMA mappings with VFIO_IOMMU_MAP_DMA. VFIO calls iommu_map on
> the auxiliary domain. The IOMMU driver populates the pgtables
> associated with PASID #35.
>
> (2) SVA. One way of doing it: the process uses a new
> "VFIO_TYPE1_SVA_IOMMU" type of container. VFIO binds the process
> address space to the device, gets PASID #35. Simpler, but not
> everyone wants to use SVA, especially not userspace drivers which
> need the highest performance.
>
>
> This example only needs to modify first-level translation, and works
> with SMMUv3. The kernel here could be the host, in which case
> second-level translation is disabled in the SMMU, or it could be the
> guest, in which case second-level mappings are created by QEMU and
> first-level translation is managed by assigning PASID tables to the
> guest.
There is a difference in case of guest SVA. VT-d v3 will bind guest
PASID and guest CR3 instead of the guest PASID table. Then turn on
nesting. In case of mdev, the second level is obtained from the aux
domain which was setup for the default PASID. Or in case of PCI device,
second level is harvested from RID2PASID.

> So (2) would use iommu_sva_bind_device(),
We would need something different than that for guest bind, just to show
the two cases:

int iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, int
*pasid, unsigned long flags, void *drvdata)

(WIP)
int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data *data)
where:
/**
* struct gpasid_bind_data - Information about device and guest PASID binding
* @pasid: Process address space ID used for the guest mm
* @addr_width: Guest address width. Paging mode can also be derived.
* @gcr3: Guest CR3 value from guest mm
*/
struct gpasid_bind_data {
__u32 pasid;
__u64 gcr3;
__u32 addr_width;
__u32 flags;
#define IOMMU_SVA_GPASID_SRE BIT(0) /* supervisor request */
};
Perhaps there is room to merge with io_mm but the life cycle management
of guest PASID and host PASID will be different if you rely on mm
release callback than FD.

> but (1) needs something
> else. Aren't auxiliary domains suitable for (1)? Why limit auxiliary
> domain to second-level or nested translation? It seems silly to use a
> different API for first-level, since the flow in userspace and VFIO
> is the same as your second-level case as far as MAP_DMA ioctl goes.
> The difference is that in your case the auxiliary domain supports an
> additional operation which binds first-level page tables. An
> auxiliary domain that only supports first-level wouldn't support this
> operation, but it can still implement iommu_map/unmap/etc.
>
I think the intention is that when a mdev is created, we don;t
know whether it will be used for SVA or IOVA. So aux domain is here to
"hold a spot" for the default PASID such that MAP_DMA calls can work as
usual, which is second level only. Later, if SVA is used on the mdev
there will be another PASID allocated for that purpose.
Do we need to create an aux domain for each PASID? the translation can
be looked up by the combination of parent dev and pasid.

>
> Another note: if for some reason you did want to allow userspace to
> choose between first-level or second-level, you could implement the
> VFIO_TYPE1_NESTING_IOMMU container. It acts like a VFIO_TYPE1v2_IOMMU,
> but also sets the DOMAIN_ATTR_NESTING on the IOMMU domain. So DMA_MAP
> ioctl on a NESTING container would populate second-level, and DMA_MAP
> on a normal container populates first-level. But if you're always
> going to use second-level by default, the distinction isn't necessary.
>
In case of guest SVA, the second level is always there.
>
> >> Sounds good, I'll drop the private PASID patch if we can figure
> >> out a solution to the attach/detach_dev problem discussed on patch
> >> 8/10
> >
> > Can you elaborate a bit on private PASID usage? what is the
> > high level flow on it?
> >
> > Again based on earlier explanation, aux domain is specific to IOMMU
> > architecture supporting vtd scalable mode-like capability, which
> > allows separate 2nd/1st level translations per PASID. Need a better
> > understanding how private PASID is relevant here.
>
> Private PASIDs are used for doing iommu_map/iommu_unmap on PASIDs
> (first-level translation):
> https://www.spinics.net/lists/dri-devel/msg177003.html As above, some
> people don't want SVA, some can't do it, some may even want a few
> private address spaces just for their kernel driver. They need a way
> to allocate PASIDs and do iommu_map/iommu_unmap on them, without
> binding to a process. I was planning to add the private PASID patch
> to my SVA series, but in my opinion the feature overlaps with
> auxiliary domains.
>
> Thanks,
> Jean
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

[Jacob Pan]

2018-09-15 02:37:31

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers

> From: Jean-Philippe Brucker
> Sent: Friday, September 14, 2018 10:46 PM
>
> On 13/09/2018 01:35, Tian, Kevin wrote:
> >>> Let's consider it from the API point of view.
> >>>
> >>> We have iommu_a(de)ttach_device() APIs to attach or detach a domain
> >>> to/from a device. We should avoid applying a limitation of "these are
> >>> only for single domain case, for multiple domains, use another API".
> >>
> >> That's an idealized view of the API, the actual semantics aren't as
> >> simple. For IOMMU drivers that implement IOMMU_DOMAIN_DMA in
> their
> >> domain_alloc operation (Arm SMMU, AMD IOMMU, ...), attach_dev
> >> attaches
> >> an unmanaged domain, detach_dev reattaches the default DMA domain,
> >> and
> >> the detach_dev IOMMU op is not called. We can't change that now, so
> it's
> >> difficult to add more functionality to attach_dev and detach_dev.
> >>
> >
> > Now we have four possible usages on a(de)ttach_device:
> >
> > 1) Normal DMA API path i.e. IOMMU_DOMAIN_DMA, for DMA
> operations
> > in host/parent device driver;
> >
> > 2) VFIO passthru path, when the physical device is assigned to user space
> > or guest driver
> >
> > 3) mdev passthru path 1, when mdev is assigned to user space or guest
> > driver. Here mdev is a wrap on random PCI device
> >
> > 4) mdev passthru path 2, when mdev is assigned to user space or guest
> > driver. Here mdev is in a smaller granularity (e.g. tagged by PASID) as
> > supported by vt-d scalable mode
> >
> > 1) and 2) are existing usages. What you described (unmanaged vs. default)
> > falls into this category.
> >
> > 3) is actually same as 2) in IOMMU layer. sort of passing through DMA
> > capability to guest. Though there is still a parent driver, the parent driver
> > itself should not do DMA - i.e. unmanaged in host side.
> >
> > 4) is a new code path introduced in IOMMU layer, which is what
> > vfio_a(de)tach_aux_domain is trying to address. In this case parent
> device
> > can still have its own DMA capability, using existing code path 1) (thus
> > default domain still applies). and the parent device can have multiple
> > aux domains (and associated structures), using code path 4).
>
> We still have the problem that detach() doesn't propagate to the IOMMU
> driver. Here's the current flow, without mdev:
>
> 1) At boot, the PF's (parent device) driver is probed, and the IOMMU
> core sets up a default DMA domain, to be used by dma_alloc by the PF's
> driver.
> -> iommu.c calls default_domain->ops->attach_dev(default_domain, dev)
>
> 2) or 3) Later userspace wants to control the PF, replaces the PF's
> driver with vfio-pci. When userspace creates a container, VFIO allocates
> a new IOMMU domain, and calls iommu_attach_group.
> -> iommu.c calls domain->ops->attach_dev(domain, dev)
> This detaches the PF from the default domain, and attaches it to the new
> domain.
>
> 1) When the container is closed, VFIO calls iommu_detach_group. This
> detaches the PF from its current domain, and attaches it back to the
> default domain.
> -> iommu.c calls default_domain->ops->attach_dev(default_domain, dev)
>
> -----
> Now with mdev, we still attach the DMA domain in 1). Then:
>
> 4) Userspace opens an mdev and creates a container. VFIO enables aux
> domain for the device. VFIO allocates a new IOMMU domain, and calls
> iommu_attach_device(domain1, parent_dev).
> -> iommu.c calls domain->ops->attach_dev(domain1, dev)
> Because the device is in "aux domain" state, the IOMMU driver does not
> detach from the default domain, but instead allocates a PASID and
> attaches the aux domain. (Side note: for SMMU we couldn't detach from
> the default domain, because we need it for MSI mappings.)

same for vtd. We don't require parent driver to detach its domain in 1).
Parent driver can have its own DMA capability when aux domain is
enabled in parallel

>
> 4) Userspace opens another mdev.
> -> iommu.c calls domain->ops->attach_dev(domain2, dev)

another mdev in same VFIO container or different? I assume the
latter since you mentioned a new domain2.

>
> 1)? When the container is closed, VFIO calls
> iommu_detach_device(domain2, parent_dev)
> -> iommu.c calls default_domain->ops->attach_dev(default_domain, dev)
> Given that auxiliary domains are attached, the IOMMU driver could deduce
> that this actually means "detach an auxiliary domain". But which one?

I didn't get this one. There is no need to stick to 1) behavior for
4), i.e. below is expected:
domain2->ops->detach_dev(domain2, dev)

why cannot ARM implement a detach_dev for aux_domain too? My
feeling is that default domain twist is only for switch between 1/2/3
in concept.

>
> So the proposed interface doesn't seem to work as is. If we want to use
> iommu_attach/detach_device for auxiliary domains, the existing behavior
> of iommu.c, and IOMMU drivers that rely on it, have to change. Any
> change I can think of right now seems more daunting than introducing a
> different path for auxiliary domains, like iommu_attach_aux_domain for
> example.
>

introducing *aux* specific API will cause different VFIO code path to
handle RID-based and PASID-based mdev, since RID-based still needs
to use normal attach_domain that way. well, this argument is not very strong
in itself, if indeed proposed way doesn't work for ARM. But let's see
whether it is the case with more understanding of your actual concern.

Thanks
Kevin

2018-09-18 15:47:23

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

On 14/09/2018 22:04, Jacob Pan wrote:
>> This example only needs to modify first-level translation, and works
>> with SMMUv3. The kernel here could be the host, in which case
>> second-level translation is disabled in the SMMU, or it could be the
>> guest, in which case second-level mappings are created by QEMU and
>> first-level translation is managed by assigning PASID tables to the
>> guest.
> There is a difference in case of guest SVA. VT-d v3 will bind guest
> PASID and guest CR3 instead of the guest PASID table. Then turn on
> nesting. In case of mdev, the second level is obtained from the aux
> domain which was setup for the default PASID. Or in case of PCI device,
> second level is harvested from RID2PASID.

Right, though I wasn't talking about the host managing guest SVA here,
but a kernel binding the address space of one of its userspace drivers
to the mdev.

>> So (2) would use iommu_sva_bind_device(),
> We would need something different than that for guest bind, just to show
> the two cases:>
> int iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, int
> *pasid, unsigned long flags, void *drvdata)
>
> (WIP)
> int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data *data)
> where:
> /**
> ?* struct gpasid_bind_data - Information about device and guest PASID
> binding
> ?* @pasid:?????? Process address space ID used for the guest mm
> ?* @addr_width:? Guest address width. Paging mode can also be derived.
> ?* @gcr3:??????? Guest CR3 value from guest mm
> ?*/
> struct gpasid_bind_data {
> ??????? __u32 pasid;
> ??????? __u64 gcr3;
> ??????? __u32 addr_width;
> ??????? __u32 flags;
> #define IOMMU_SVA_GPASID_SRE??? BIT(0) /* supervisor request */
> };
> Perhaps there is room to merge with io_mm but the life cycle management
> of guest PASID and host PASID will be different if you rely on mm
> release callback than FD.

I think gpasid management should stay separate from io_mm, since in your
case VFIO mechanisms are used for life cycle management of the VM,
similarly to the former bind_pasid_table proposal. For example closing
the container fd would unbind all guest page tables. The QEMU process'
address space lifetime seems like the wrong thing to track for gpasid.

>> but (1) needs something
>> else. Aren't auxiliary domains suitable for (1)? Why limit auxiliary
>> domain to second-level or nested translation? It seems silly to use a
>> different API for first-level, since the flow in userspace and VFIO
>> is the same as your second-level case as far as MAP_DMA ioctl goes.
>> The difference is that in your case the auxiliary domain supports an
>> additional operation which binds first-level page tables. An
>> auxiliary domain that only supports first-level wouldn't support this
>> operation, but it can still implement iommu_map/unmap/etc.
>>
> I think the intention is that when a mdev is created, we don;t
> know whether it will be used for SVA or IOVA. So aux domain is here to
> "hold a spot" for the default PASID such that MAP_DMA calls can work as
> usual, which is second level only. Later, if SVA is used on the mdev
> there will be another PASID allocated for that purpose.
> Do we need to create an aux domain for each PASID? the translation can
> be looked up by the combination of parent dev and pasid.

When allocating a new PASID for the guest, I suppose you need to clone
the second-level translation config? In which case a single aux domain
for the mdev might be easier to implement in the IOMMU driver. Entirely
up to you since we don't have this case on SMMUv3

Thanks,
Jean

2018-09-18 15:55:15

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers

On 15/09/2018 03:36, Tian, Kevin wrote:
>> 4) Userspace opens another mdev.
>> -> iommu.c calls domain->ops->attach_dev(domain2, dev)
>
> another mdev in same VFIO container or different? I assume the
> latter since you mentioned a new domain2.

I was thinking a different VFIO container actually. I used domain2 to
try to make the example clearer

>> 1)? When the container is closed, VFIO calls
>> iommu_detach_device(domain2, parent_dev)
>> -> iommu.c calls default_domain->ops->attach_dev(default_domain, dev)
>> Given that auxiliary domains are attached, the IOMMU driver could deduce
>> that this actually means "detach an auxiliary domain". But which one?
>
> I didn't get this one. There is no need to stick to 1) behavior for
> 4), i.e. below is expected:
> ??????? domain2->ops->detach_dev(domain2, dev)

But in order to get that, the IOMMU core needs to know that domain2 is
auxiliary. Otherwise, detach_dev is never called when a default_domain
is present for the parent dev.

I guess one solution is to add an "auxiliary" attribute to iommu_domain,
so __iommu_detach_group would do something like:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 7113fe398b70..2b3e9b91aee7 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1786,10 +1786,11 @@ static void __iommu_detach_group(struct
iommu_domain *domain,
{
int ret;

- if (!group->default_domain) {
+ if (!group->default_domain || domain->auxiliary) {
__iommu_group_for_each_dev(group, domain,
iommu_group_do_detach_device);
- group->domain = NULL;
+ if (!domain->auxiliary)
+ group->domain = NULL;
return;
}

Not sure who would set this "auxiliary" attribute... Maybe the IOMMU
driver, when attaching the domain to a device that has auxiliary mode
enabled?

> why cannot ARM implement a detach_dev for aux_domain too? My
> feeling is that default domain twist is only for switch between 1/2/3
> in concept.

If the core actually calls it, we can implement detach_dev :) The
problem is that the core never calls detach_dev when default_domain is
present (affects any IOMMU driver that relies on default_domain,
including AMD), and even in case 4) the default_domain is present for
the parent device

>> So the proposed interface doesn't seem to work as is. If we want to use
>> iommu_attach/detach_device for auxiliary domains, the existing behavior
>> of iommu.c, and IOMMU drivers that rely on it, have to change. Any
>> change I can think of right now seems more daunting than introducing a
>> different path for auxiliary domains, like iommu_attach_aux_domain for
>> example.
>>
>
> introducing *aux* specific API will cause different VFIO code path to
> handle RID-based and PASID-based mdev, since RID-based still needs
> to use normal attach_domain that way.

The PASID-based mdev still requires a special case to retrieve the
allocated PASID and program it in the parent device, so VFIO will need
to know the difference between the two

Thanks,
Jean

> well, this argument is not very strong
> in itself, if indeed proposed way doesn't work for ARM. But let's see
> whether it is the case with more understanding of your actual concern.
>
> Thanks
> Kevin


2018-09-19 02:12:13

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers

Hi,

On 09/19/2018 07:26 AM, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:[email protected]]
>> Sent: Tuesday, September 18, 2018 11:52 PM
>>
>> On 15/09/2018 03:36, Tian, Kevin wrote:
>>>> 4) Userspace opens another mdev.
>>>> -> iommu.c calls domain->ops->attach_dev(domain2, dev)
>>>
>>> another mdev in same VFIO container or different? I assume the
>>> latter since you mentioned a new domain2.
>>
>> I was thinking a different VFIO container actually. I used domain2 to
>> try to make the example clearer
>>
>>>> 1)? When the container is closed, VFIO calls
>>>> iommu_detach_device(domain2, parent_dev)
>>>> -> iommu.c calls default_domain->ops->attach_dev(default_domain,
>> dev)
>>>> Given that auxiliary domains are attached, the IOMMU driver could
>> deduce
>>>> that this actually means "detach an auxiliary domain". But which one?
>>>
>>> I didn't get this one. There is no need to stick to 1) behavior for
>>> 4), i.e. below is expected:
>>>         domain2->ops->detach_dev(domain2, dev)
>>
>> But in order to get that, the IOMMU core needs to know that domain2 is
>> auxiliary. Otherwise, detach_dev is never called when a default_domain
>> is present for the parent dev.
>>
>> I guess one solution is to add an "auxiliary" attribute to iommu_domain,
>> so __iommu_detach_group would do something like:
>
> this doesn't work. same domain can be also attached to another physical
> device as non-aux domain (e.g. passthrough) at the same time (vfio-pci
> device and vfio-mdev device in same container), then default domain
> tweak is required in that case. "aux" takes effect only per-device, not
> per-domain.

If we have below APIs for aux domain (the API names are just for
discussion purpose, subject to change):

iommu_querry_aux_domain_capability(dev)
iommu_enable_aux_domain(dev)
iommu_disable_aux_domain(dev)
iommu_check_aux_domain_status(dev)

then, we could do this like below:

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index ab3d7d3b1583..3bfb652c78e8 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1469,12 +1469,31 @@ static int iommu_group_do_detach_device(struct
device *dev, void *data)
return 0;
}

+static int iommu_group_check_aux_domain(struct device *dev, void *data)
+{
+ const struct iommu_ops *ops = dev->bus->iommu_ops;
+
+ if (ops && ops->check_auxd)
+ return !ops->check_auxd(dev);
+
+ return -EINVAL;
+}
+
+/*
+ * Check whether devices in @group have aux domain enabled.
+ */
+static int iommu_group_aux_domain_enabled(struct iommu_group *group)
+{
+ return __iommu_group_for_each_dev(group, NULL,
+ iommu_group_check_aux_domain);
+}
+
static void __iommu_detach_group(struct iommu_domain *domain,
struct iommu_group *group)
{
int ret;

- if (!group->default_domain) {
+ if (!group->default_domain ||
iommu_group_aux_domain_enabled(group)) {
__iommu_group_for_each_dev(group, domain,
iommu_group_do_detach_device);
group->domain = NULL;

Best regards,
Lu Baolu

>
>>
>> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
>> index 7113fe398b70..2b3e9b91aee7 100644
>> --- a/drivers/iommu/iommu.c
>> +++ b/drivers/iommu/iommu.c
>> @@ -1786,10 +1786,11 @@ static void __iommu_detach_group(struct
>> iommu_domain *domain,
>> {
>> int ret;
>>
>> - if (!group->default_domain) {
>> + if (!group->default_domain || domain->auxiliary) {
>> __iommu_group_for_each_dev(group, domain,
>> iommu_group_do_detach_device);
>> - group->domain = NULL;
>> + if (!domain->auxiliary)
>> + group->domain = NULL;
>> return;
>> }
>>
>> Not sure who would set this "auxiliary" attribute... Maybe the IOMMU
>> driver, when attaching the domain to a device that has auxiliary mode
>> enabled?
>>
>>> why cannot ARM implement a detach_dev for aux_domain too? My
>>> feeling is that default domain twist is only for switch between 1/2/3
>>> in concept.
>>
>> If the core actually calls it, we can implement detach_dev :) The
>> problem is that the core never calls detach_dev when default_domain is
>> present (affects any IOMMU driver that relies on default_domain,
>> including AMD), and even in case 4) the default_domain is present for
>> the parent device
>
> Then can we change that core logic so detach_dev is invoked in all
> cases? yes there will be some changes in vendor drivers, but I expect
> this change trivial (especially considering the gain in IOMMU API
> simplicity side as described below).
>
>>
>>>> So the proposed interface doesn't seem to work as is. If we want to use
>>>> iommu_attach/detach_device for auxiliary domains, the existing
>> behavior
>>>> of iommu.c, and IOMMU drivers that rely on it, have to change. Any
>>>> change I can think of right now seems more daunting than introducing a
>>>> different path for auxiliary domains, like iommu_attach_aux_domain for
>>>> example.
>>>>
>>>
>>> introducing *aux* specific API will cause different VFIO code path to
>>> handle RID-based and PASID-based mdev, since RID-based still needs
>>> to use normal attach_domain that way.
>>
>> The PASID-based mdev still requires a special case to retrieve the
>> allocated PASID and program it in the parent device, so VFIO will need
>> to know the difference between the two
>>
>
> that retrieve/program is down by parent driver, instead of VFIO.
>
> Thanks
> Kevin
>

2018-09-19 02:22:34

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

> From: Jean-Philippe Brucker [mailto:[email protected]]
> Sent: Tuesday, September 18, 2018 11:47 PM
>
> On 14/09/2018 22:04, Jacob Pan wrote:
> >> This example only needs to modify first-level translation, and works
> >> with SMMUv3. The kernel here could be the host, in which case
> >> second-level translation is disabled in the SMMU, or it could be the
> >> guest, in which case second-level mappings are created by QEMU and
> >> first-level translation is managed by assigning PASID tables to the
> >> guest.
> > There is a difference in case of guest SVA. VT-d v3 will bind guest
> > PASID and guest CR3 instead of the guest PASID table. Then turn on
> > nesting. In case of mdev, the second level is obtained from the aux
> > domain which was setup for the default PASID. Or in case of PCI device,
> > second level is harvested from RID2PASID.
>
> Right, though I wasn't talking about the host managing guest SVA here,
> but a kernel binding the address space of one of its userspace drivers
> to the mdev.
>
> >> So (2) would use iommu_sva_bind_device(),
> > We would need something different than that for guest bind, just to show
> > the two cases:>
> > int iommu_sva_bind_device(struct device *dev, struct mm_struct *mm,
> int
> > *pasid, unsigned long flags, void *drvdata)
> >
> > (WIP)
> > int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data *data)
> > where:
> > /**
> > ?* struct gpasid_bind_data - Information about device and guest PASID
> > binding
> > ?* @pasid:?????? Process address space ID used for the guest mm
> > ?* @addr_width:? Guest address width. Paging mode can also be derived.
> > ?* @gcr3:??????? Guest CR3 value from guest mm
> > ?*/
> > struct gpasid_bind_data {
> > ??????? __u32 pasid;
> > ??????? __u64 gcr3;
> > ??????? __u32 addr_width;
> > ??????? __u32 flags;
> > #define IOMMU_SVA_GPASID_SRE??? BIT(0) /* supervisor request */
> > };
> > Perhaps there is room to merge with io_mm but the life cycle
> management
> > of guest PASID and host PASID will be different if you rely on mm
> > release callback than FD.

let's not calling gpasid here - which makes sense only in bind_pasid_table
proposal where pasid table thus pasid space is managed by guest. In above
context it is always about host pasid (allocated in system-wide), which
could point to a host cr3 (user process) or a guest cr3 (vm case).

>
> I think gpasid management should stay separate from io_mm, since in your
> case VFIO mechanisms are used for life cycle management of the VM,
> similarly to the former bind_pasid_table proposal. For example closing
> the container fd would unbind all guest page tables. The QEMU process'
> address space lifetime seems like the wrong thing to track for gpasid.

I sort of agree (though not thinking through all the flow carefully). PASIDs
are allocated per iommu domain, thus release also happens when domain
is detached (along with container fd close).

>
> >> but (1) needs something
> >> else. Aren't auxiliary domains suitable for (1)? Why limit auxiliary
> >> domain to second-level or nested translation? It seems silly to use a
> >> different API for first-level, since the flow in userspace and VFIO
> >> is the same as your second-level case as far as MAP_DMA ioctl goes.
> >> The difference is that in your case the auxiliary domain supports an
> >> additional operation which binds first-level page tables. An
> >> auxiliary domain that only supports first-level wouldn't support this
> >> operation, but it can still implement iommu_map/unmap/etc.
> >>
> > I think the intention is that when a mdev is created, we don;t
> > know whether it will be used for SVA or IOVA. So aux domain is here to
> > "hold a spot" for the default PASID such that MAP_DMA calls can work as
> > usual, which is second level only. Later, if SVA is used on the mdev
> > there will be another PASID allocated for that purpose.
> > Do we need to create an aux domain for each PASID? the translation can
> > be looked up by the combination of parent dev and pasid.
>
> When allocating a new PASID for the guest, I suppose you need to clone
> the second-level translation config? In which case a single aux domain
> for the mdev might be easier to implement in the IOMMU driver. Entirely
> up to you since we don't have this case on SMMUv3
>

One thing to highlight in related discussions (also mentioned in other
thread). There is not a new iommu domain type called 'aux'. 'aux' matters
only to a specific device when a domain is attached to that device which
has aux capability enabled. Same domain can be attached to other device
as normal domain. In that case multiple PASIDs allocated on same mdev
are tied to same aux domain, same bare metal SVA case, i.e. any domain
(normal or aux) can include 2nd level structure and multiple 1st level
structures. Jean is correct - all PASIDs in same domain then share 2nd
level translation, and there are io_mm or similar tracking structures to
associate each PASID to a 1st level translation structure.

Thanks
Kevin

2018-09-20 15:52:40

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device

On Wed, 19 Sep 2018 02:22:03 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Jean-Philippe Brucker [mailto:[email protected]]
> > Sent: Tuesday, September 18, 2018 11:47 PM
> >
> > On 14/09/2018 22:04, Jacob Pan wrote:
> > >> This example only needs to modify first-level translation, and
> > >> works with SMMUv3. The kernel here could be the host, in which
> > >> case second-level translation is disabled in the SMMU, or it
> > >> could be the guest, in which case second-level mappings are
> > >> created by QEMU and first-level translation is managed by
> > >> assigning PASID tables to the guest.
> > > There is a difference in case of guest SVA. VT-d v3 will bind
> > > guest PASID and guest CR3 instead of the guest PASID table. Then
> > > turn on nesting. In case of mdev, the second level is obtained
> > > from the aux domain which was setup for the default PASID. Or in
> > > case of PCI device, second level is harvested from RID2PASID.
> >
> > Right, though I wasn't talking about the host managing guest SVA
> > here, but a kernel binding the address space of one of its
> > userspace drivers to the mdev.
> >
> > >> So (2) would use iommu_sva_bind_device(),
> > > We would need something different than that for guest bind, just
> > > to show the two cases:>
> > > int iommu_sva_bind_device(struct device *dev, struct mm_struct
> > > *mm,
> > int
> > > *pasid, unsigned long flags, void *drvdata)
> > >
> > > (WIP)
> > > int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data
> > > *data) where:
> > > /**
> > >  * struct gpasid_bind_data - Information about device and guest
> > > PASID binding
> > >  * @pasid:       Process address space ID used for the guest mm
> > >  * @addr_width:  Guest address width. Paging mode can also be
> > > derived.
> > >  * @gcr3:        Guest CR3 value from guest mm
> > >  */
> > > struct gpasid_bind_data {
> > >         __u32 pasid;
> > >         __u64 gcr3;
> > >         __u32 addr_width;
> > >         __u32 flags;
> > > #define IOMMU_SVA_GPASID_SRE    BIT(0) /* supervisor request */
> > > };
> > > Perhaps there is room to merge with io_mm but the life cycle
> > management
> > > of guest PASID and host PASID will be different if you rely on mm
> > > release callback than FD.
>
> let's not calling gpasid here - which makes sense only in
> bind_pasid_table proposal where pasid table thus pasid space is
> managed by guest. In above context it is always about host pasid
> (allocated in system-wide), which could point to a host cr3 (user
> process) or a guest cr3 (vm case).
>
I agree this gpasid is confusing, we have a system wide PASID
name space. Just a way to differentiate different bind, perhaps
just a flag indicating the PASID is used for guest.
i.e.
struct pasid_bind_data {
        __u32 pasid;
        __u64 gcr3;
        __u32 addr_width;
        __u32 flags;
#define IOMMU_SVA_GPASID_SRE    BIT(0) /* supervisor request */
#define IOMMU_SVA_PASID_GUEST   BIT(0) /* host pasid used by guest */
};

> > I think gpasid management should stay separate from io_mm, since in
> > your case VFIO mechanisms are used for life cycle management of the
> > VM, similarly to the former bind_pasid_table proposal. For example
> > closing the container fd would unbind all guest page tables. The
> > QEMU process' address space lifetime seems like the wrong thing to
> > track for gpasid.
>
> I sort of agree (though not thinking through all the flow carefully).
> PASIDs are allocated per iommu domain, thus release also happens when
> domain is detached (along with container fd close).
>
I also prefer to keep gpasid separate.

But I don't think we need to have per iommu domain per PASID for guest
SVA case. Assuming you are talking about host IOMMU domain. The PASID
bind call is a result of guest PASID cache flush with a PASID
previously allocated. The host just need to put gcr3 into the PASID
entry then harvest the second level from the existing domain.
> >
> > >> but (1) needs something
> > >> else. Aren't auxiliary domains suitable for (1)? Why limit
> > >> auxiliary domain to second-level or nested translation? It seems
> > >> silly to use a different API for first-level, since the flow in
> > >> userspace and VFIO is the same as your second-level case as far
> > >> as MAP_DMA ioctl goes. The difference is that in your case the
> > >> auxiliary domain supports an additional operation which binds
> > >> first-level page tables. An auxiliary domain that only supports
> > >> first-level wouldn't support this operation, but it can still
> > >> implement iommu_map/unmap/etc.
> > > I think the intention is that when a mdev is created, we don;t
> > > know whether it will be used for SVA or IOVA. So aux domain is
> > > here to "hold a spot" for the default PASID such that MAP_DMA
> > > calls can work as usual, which is second level only. Later, if
> > > SVA is used on the mdev there will be another PASID allocated for
> > > that purpose. Do we need to create an aux domain for each PASID?
> > > the translation can be looked up by the combination of parent dev
> > > and pasid.
> >
> > When allocating a new PASID for the guest, I suppose you need to
> > clone the second-level translation config? In which case a single
> > aux domain for the mdev might be easier to implement in the IOMMU
> > driver. Entirely up to you since we don't have this case on SMMUv3
> >
>
> One thing to highlight in related discussions (also mentioned in other
> thread). There is not a new iommu domain type called 'aux'. 'aux'
> matters only to a specific device when a domain is attached to that
> device which has aux capability enabled. Same domain can be attached
> to other device as normal domain. In that case multiple PASIDs
> allocated on same mdev are tied to same aux domain, same bare metal
> SVA case, i.e. any domain (normal or aux) can include 2nd level
> structure and multiple 1st level structures. Jean is correct - all
> PASIDs in same domain then share 2nd level translation, and there are
> io_mm or similar tracking structures to associate each PASID to a 1st
> level translation structure.
>
I think we are all talking about the same thing :)
yes, 2nd level is cloned from aux domain/default PASID for mdev, and
pdev similarly from DMA_MAP domain.

> Thanks
> Kevin
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

[Jacob Pan]

2018-09-25 17:57:46

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers

On 19/09/2018 00:26, Tian, Kevin wrote:
>> If the core actually calls it, we can implement detach_dev :) The
>> problem is that the core never calls detach_dev when default_domain is
>> present (affects any IOMMU driver that relies on default_domain,
>> including AMD), and even in case 4) the default_domain is present for
>> the parent device
>
> Then can we change that core logic so detach_dev is invoked in all
> cases? yes there will be some changes in vendor drivers, but I expect
> this change trivial (especially considering the gain in IOMMU API
> simplicity side as described below).

Thinking more about this, there might be a way that doesn't require too
much rewriting of IOMMU drivers. When VFIO calls iommu_detach_group to
detach a vfio-pci device from its domain, it could do detach_dev(dev,
old_domain) followed by attach_dev(dev, default_domain) instead of only
attach_dev(dev, default_domain), so:

__iommu_attach_group(grp, unmanaged_domain)
ops->attach_dev(dev, unmanaged_domain)
-> IOMMU driver detaches from default_domain only if the dev isn't
in auxiliary mode

__iommu_detach_group(grp, unmanaged_domain)
if (ops->detach_dev)
ops->detach_dev(dev, unmanaged_domain)
ops->attach_dev(dev, default_domain)
-> IOMMU driver ignores this if still attached to default_domain

It's not trivial: since half the IOMMU drivers seem to be using default
domain now, their detach_dev callback might have been dead code for a
while (I can't find a path where it still gets called), so re-enabling
it requires some attention. However, since most of them won't care about
aux domains, maybe we could just remove their detach_dev pointer.

The resulting logic of attach/detach_dev in IOMMU drivers that do use
aux domains still seems unnecessarily convoluted. It could as well be
done with separate attach_aux/detach_aux functions. If you want to keep
it that way though, I think the above could work for us and I'll try to
write an RFC.

Thanks,
Jean

2018-09-26 02:15:03

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC PATCH v2 08/10] vfio/type1: Add domain at(de)taching group helpers

Hi,

On 09/26/2018 01:55 AM, Jean-Philippe Brucker wrote:
> On 19/09/2018 00:26, Tian, Kevin wrote:
>>> If the core actually calls it, we can implement detach_dev :) The
>>> problem is that the core never calls detach_dev when default_domain is
>>> present (affects any IOMMU driver that relies on default_domain,
>>> including AMD), and even in case 4) the default_domain is present for
>>> the parent device
>>
>> Then can we change that core logic so detach_dev is invoked in all
>> cases? yes there will be some changes in vendor drivers, but I expect
>> this change trivial (especially considering the gain in IOMMU API
>> simplicity side as described below).
>
> Thinking more about this, there might be a way that doesn't require too
> much rewriting of IOMMU drivers. When VFIO calls iommu_detach_group to
> detach a vfio-pci device from its domain, it could do detach_dev(dev,
> old_domain) followed by attach_dev(dev, default_domain) instead of only
> attach_dev(dev, default_domain), so:
>
> __iommu_attach_group(grp, unmanaged_domain)
> ops->attach_dev(dev, unmanaged_domain)
> -> IOMMU driver detaches from default_domain only if the dev isn't
> in auxiliary mode
>
> __iommu_detach_group(grp, unmanaged_domain)
> if (ops->detach_dev)
> ops->detach_dev(dev, unmanaged_domain)
> ops->attach_dev(dev, default_domain)
> -> IOMMU driver ignores this if still attached to default_domain
>

This looks good to me.

Just a minor change to make it more readable.

__iommu_detach_group(grp, unmanaged_domain)
if (ops->detach_dev)
ops->detach_dev(dev, unmanaged_domain)
if (default_domain)
ops->attach_dev(dev, default_domain)

Best regards,
Lu Baolu

> It's not trivial: since half the IOMMU drivers seem to be using default
> domain now, their detach_dev callback might have been dead code for a
> while (I can't find a path where it still gets called), so re-enabling
> it requires some attention. However, since most of them won't care about
> aux domains, maybe we could just remove their detach_dev pointer.
>
> The resulting logic of attach/detach_dev in IOMMU drivers that do use
> aux domains still seems unnecessarily convoluted. It could as well be
> done with separate attach_aux/detach_aux functions. If you want to keep
> it that way though, I think the above could work for us and I'll try to
> write an RFC.
>
> Thanks,
> Jean
>