2021-07-09 07:49:41

by Tian, Kevin

[permalink] [raw]
Subject: [RFC v2] /dev/iommu uAPI proposal

/dev/iommu provides an unified interface for managing I/O page tables for
devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
etc.) are expected to use this interface instead of creating their own logic to
isolate untrusted device DMAs initiated by userspace.

This proposal describes the uAPI of /dev/iommu and also sample sequences
with VFIO as example in typical usages. The driver-facing kernel API provided
by the iommu layer is still TBD, which can be discussed after consensus is
made on this uAPI.

It's based on a lengthy discussion starting from here:
https://lore.kernel.org/linux-iommu/[email protected]/

v1 can be found here:
https://lore.kernel.org/linux-iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.namprd12.prod.outlook.com/T/

This doc is also tracked on github, though it's not very useful for v1->v2
given dramatic refactoring:
https://github.com/luxis1999/dev_iommu_uapi

Changelog (v1->v2):
- Rename /dev/ioasid to /dev/iommu (Jason);
- Add a section for device-centric vs. group-centric design (many);
- Add a section for handling no-snoop DMA (Jason/Alex/Paolo);
- Add definition of user/kernel/shared I/O page tables (Baolu/Jason);
- Allow one device bound to multiple iommu fd's (Jason);
- No need to track user I/O page tables in kernel on ARM/AMD (Jean/Jason);
- Add a device cookie for iotlb invalidation and fault handling (Jean/Jason);
- Add capability/format query interface per device cookie (Jason);
- Specify format/attribute when creating an IOASID, leading to several v1
uAPI commands removed (Jason);
- Explain the value of software nesting (Jean);
- Replace IOASID_REGISTER_VIRTUAL_MEMORY with software nesting (David/Jason);
- Cover software mdev usage (Jason);
- No restriction on map/unmap vs. bind/invalidate (Jason/David);
- Report permitted IOVA range instead of reserved range (David);
- Refine the sample structures and helper functions (Jason);
- Add definition of default and non-default I/O address spaces;
- Expand and clarify the design for PASID virtualization;
- and lots of subtle refinement according to above changes;

TOC
====
1. Terminologies and Concepts
1.1. Manage I/O address space
1.2. Attach device to I/O address space
1.3. Group isolation
1.4. PASID virtualization
1.4.1. Devices which don't support DMWr
1.4.2. Devices which support DMWr
1.4.3. Mix different types together
1.4.4. User sequence
1.5. No-snoop DMA
2. uAPI Proposal
2.1. /dev/iommu uAPI
2.2. /dev/vfio device uAPI
2.3. /dev/kvm uAPI
3. Sample Structures and Helper Functions
4. Use Cases and Flows
4.1. A simple example
4.2. Multiple IOASIDs (no nesting)
4.3. IOASID nesting (software)
4.4. IOASID nesting (hardware)
4.5. Guest SVA (vSVA)
4.6. I/O page fault
====

1. Terminologies and Concepts
-----------------------------------------

IOMMU fd is the container holding multiple I/O address spaces. User
manages those address spaces through fd operations. Multiple fd's are
allowed per process, but with this proposal one fd should be sufficient for
all intended usages.

IOASID is the fd-local software handle representing an I/O address space.
Each IOASID is associated with a single I/O page table. IOASIDs can be
nested together, implying the output address from one I/O page table
(represented by child IOASID) must be further translated by another I/O
page table (represented by parent IOASID).

An I/O address space takes effect only after it is attached by a device.
One device is allowed to attach to multiple I/O address spaces. One I/O
address space can be attached by multiple devices.

Device must be bound to an IOMMU fd before attach operation can be
conducted. Though not necessary, user could bind one device to multiple
IOMMU FD's. But no cross-FD IOASID nesting is allowed.

The format of an I/O page table must be compatible to the attached
devices (or more specifically to the IOMMU which serves the DMA from
the attached devices). User is responsible for specifying the format
when allocating an IOASID, according to one or multiple devices which
will be attached right after. Attaching a device to an IOASID with
incompatible format is simply rejected.

Relationship between IOMMU fd, VFIO fd and KVM fd:

- IOMMU fd provides uAPI for managing IOASIDs and I/O page tables.
It also provides an unified capability/format reporting interface for
each bound device.

- VFIO fd provides uAPI for device binding and attaching. In this proposal
VFIO is used as the example of device passthrough frameworks. The
routing information that identifies an I/O address space in the wire is
per-device and registered to IOMMU fd via VFIO uAPI.

- KVM fd provides uAPI for handling no-snoop DMA and PASID virtualization
in CPU (when PASID is carried in instruction payload).

1.1. Manage I/O address space
+++++++++++++++++++++++++++++

An I/O address space can be created in three ways, according to how
the corresponding I/O page table is managed:

- kernel-managed I/O page table which is created via IOMMU fd, e.g.
for IOVA space (dpdk), GPA space (Qemu), GIOVA space (vIOMMU), etc.

- user-managed I/O page table which is created by the user, e.g. for
GIOVA/GVA space (vIOMMU), etc.

- shared kernel-managed CPU page table which is created by another
subsystem, e.g. for process VA space (mm), GPA space (kvm), etc.

The first category is managed via a dma mapping protocol (similar to
existing VFIO iommu type1), which allows the user to explicitly specify
which range in the I/O address space should be mapped.

The second category is managed via an iotlb protocol (similar to the
underlying IOMMU semantics). Once the user-managed page table is
bound to the IOMMU, the user can invoke an invalidation command
to update the kernel-side cache (either in software or in physical IOMMU).
In the meantime, a fault reporting/completion mechanism is also provided
for the user to fixup potential I/O page faults.

The last category is supposed to be managed via the subsystem which
actually owns the shared address space. Likely what's minimally required
in /dev/iommu uAPI is to build the connection with the address space
owner when allocating the IOASID, so an in-kernel interface (e.g. mmu_
notifer) is activated for any required synchronization between IOMMU fd
and the space owner.

This proposal focuses on how to manage the first two categories, as
they are existing and more urgent requirements. Support of the last
category can be discussed when a real usage comes in the future.

The user needs to specify the desired management protocol and page
table format when creating a new I/O address space. Before allocating
the IOASID, the user should already know at least one device that will be
attached to this space. It is expected to first query (via IOMMU fd) the
supported capabilities and page table format information of the to-be-
attached device (or a common set between multiple devices) and then
choose a compatible format to set on the IOASID.

I/O address spaces can be nested together, called IOASID nesting. IOASID
nesting can be implemented in two ways: hardware nesting and software
nesting. With hardware support the child and parent I/O page tables are
walked consecutively by the IOMMU to form a nested translation. When
it's implemented in software, /dev/iommu is responsible for merging the
two-level mappings into a single-level shadow I/O page table.

An user-managed I/O page table can be setup only on the child IOASID,
implying IOASID nesting must be enabled. This is because the kernel
doesn't trust userspace. Nesting allows the kernel to enforce its DMA
isolation policy through the parent IOASID.

Software nesting is useful in several scenarios. First, it allows
centralized accounting on locked pages between multiple root IOASIDs
(no parent). In this case a 'dummy' IOASID can be created with an
identity mapping (HVA->HVA), dedicated for page pinning/accounting and
nested by all root IOASIDs. Second, it's also useful for mdev drivers
(e.g. kvmgt) to write-protect guest structures when vIOMMU is enabled.
In this case the protected addresses are in GIOVA space while KVM
write-protection API is based on GPA. Software nesting allows finding
GPA according to GIOVA in the kernel.

1.2. Attach Device to I/O address space
+++++++++++++++++++++++++++++++++++++++

Device attach/bind is initiated through passthrough framework uAPI.

Device attaching is allowed only after a device is successfully bound to
the IOMMU fd. User should provide a device cookie when binding the
device through VFIO uAPI. This cookie is used when the user queries
device capability/format, issues per-device iotlb invalidation and
receives per-device I/O page fault data via IOMMU fd.

Successful binding puts the device into a security context which isolates
its DMA from the rest system. VFIO should not allow user to access the
device before binding is completed. Similarly, VFIO should prevent the
user from unbinding the device before user access is withdrawn.

When a device is in an iommu group which contains multiple devices,
all devices within the group must enter/exit the security context
together. Please check {1.3} for more info about group isolation via
this device-centric design.

Successful attaching activates an I/O address space in the IOMMU,
if the device is not purely software mediated. VFIO must provide device
specific routing information for where to install the I/O page table in
the IOMMU for this device. VFIO must also guarantee that the attached
device is configured to compose DMAs with the routing information that
is provided in the attaching call. When handling DMA requests, IOMMU
identifies the target I/O address space according to the routing
information carried in the request. Misconfiguration breaks DMA
isolation thus could lead to severe security vulnerability.

Routing information is per-device and bus specific. For PCI, it is
Requester ID (RID) identifying the device plus optional Process Address
Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-Stream
ID (SSID). PASID or SSID is used when multiple I/O address spaces are
enabled on a single device. For simplicity and continuity reason the
following context uses RID+PASID though SID+SSID may sound a clearer
naming from device p.o.v. We can decide the actual naming when coding.

Because one I/O address space can be attached by multiple devices,
per-device routing information (plus device cookie) is tracked under
each IOASID and is used respectively when activating the I/O address
space in the IOMMU for each attached device.

The device in the /dev/iommu context always refers to a physical one
(pdev) which is identifiable via RID. Physically each pdev can support
one default I/O address space (routed via RID) and optionally multiple
non-default I/O address spaces (via RID+PASID).

The device in VFIO context is a logic concept, being either a physical
device (pdev) or mediated device (mdev or subdev). Each vfio device
is represented by RID+cookie in IOMMU fd. User is allowed to create
one default I/O address space (routed by vRID from user p.o.v) per
each vfio_device. VFIO decides the routing information for this default
space based on device type:

1) pdev, routed via RID;

2) mdev/subdev with IOMMU-enforced DMA isolation, routed via
the parent's RID plus the PASID marking this mdev;

3) a purely sw-mediated device (sw mdev), no routing required i.e. no
need to install the I/O page table in the IOMMU. sw mdev just uses
the metadata to assist its internal DMA isolation logic on top of
the parent's IOMMU page table;

In addition, VFIO may allow user to create additional I/O address spaces
on a vfio_device based on the hardware capability. In such case the user
has its own view of the virtual routing information (vPASID) when marking
these non-default address spaces. How to virtualize vPASID is platform
specific and device specific. Some platforms allow the user to fully
manage the PASID space thus vPASIDs are directly used for routing and
even hidden from the kernel. Other platforms require the user to
explicitly register the vPASID information to the kernel when attaching
the vfio_device. In this case VFIO must figure out whether vPASID should
be directly used (pdev) or converted to a kernel-allocated pPASID (mdev)
for physical routing. Detail explanation about PASID virtualization can
be found in {1.4}.

For mdev both default and non-default I/O address spaces are routed
via PASIDs. To better differentiate them we use "default PASID" (or
defPASID) when talking about the default I/O address space on mdev. When
vPASID or pPASID is referred in PASID virtualization it's all about the
non-default spaces. defPASID and pPASID are always hidden from userspace
and can only be indirectly referenced via IOASID.

1.3. Group isolation
++++++++++++++++++++

Group is the minimal object when talking about DMA isolation in the
iommu layer. Devices which cannot be isolated from each other are
organized into a single group. Lack of isolation could be caused by
multiple reasons: no ACS capability in the upstreaming port, behind a
PCIe-to-PCI bridge (thus sharing RID), or DMA aliasing (multiple RIDs
per device), etc.

All devices in the group must be put in a security context together
before one or more devices in the group are operated by an untrusted
user. Passthrough frameworks must guarantee that:

1) No user access is granted on a device before an security context is
established for the entire group (becomes viable).

2) Group viability is not broken before the user relinquishes the device.
This implies that devices in the group must be either assigned to this
user, or driver-less, or bound to a driver which is known safe (not
do DMA).

3) The security context should not be destroyed before user access
permission is withdrawn.

Existing VFIO introduces explicit container and group semantics in its
uAPI to meet above requirements:

1) VFIO user can open a device fd only after:

* A container is created;
* The group is attached to the container (VFIO_GROUP_SET_CONTAINER);
* An empty I/O page table is created in the container (VFIO_SET_IOMMU);
* Group viability is passed and the entire group is attached to
the empty I/O page table (the security context);

2) VFIO monitors driver binding status to verify group viability

* IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
* BUG_ON() if group viability is broken;

3) Detach the group from the container when the last device fd in the
group is closed and destroy the I/O page table only after the last
group is detached from the container.

With this proposal VFIO can move to a simpler device-centric model by
directly exposeing device node under "/dev/vfio/devices" w/o using
container and group uAPI at all. In this case group isolation is enforced
mplicitly within IOMMU fd:

1) A successful binding call for the first device in the group creates
the security context for the entire group, by:

* Verifying group viability in a similar way as VFIO does;

* Calling IOMMU-API to move the group into a block-dma state,
which makes all devices in the group attached to an block-dma
domain with an empty I/O page table;

VFIO should not allow the user to mmap the MMIO bar of the bound
device until the binding call succeeds.

Binding other devices in the same group just succeeds since the
security context has already been established for the entire group.

2) IOMMU fd monitors driver binding status in case group viability is
broken, same as VFIO does today. BUG_ON() might be eliminated if we
can find a way to deny probe of non-iommu-safe drivers.

Before a device is unbound from IOMMU fd, it is always attached to a
security context (either the block-dma domain or an IOASID domain).
Switch between two domains is initiated by attaching the device to or
detaching it from an IOASID. The IOMMU layer should ensure that
the default domain is not implicitly re-attached in the switching
process, before the group is moved out of the block-dma state.

To stay on par with legacy VFIO, IOMMU fd could verify that all
bound devices in the same group must be attached to a single IOASID.

3) When a device fd is closed, VFIO automatically unbinds the device from
IOMMU fd before zapping the mmio mapping. Unbinding the last device
in the group moves the entire group out of the block-dma state and
re-attached to the default domain.

Actual implementation may use a staging approach, e.g. only support
one-device group in the start (leaving multi-devices group handled via
legacy VFIO uAPI) and then cover multi-devices group in a later stage.

If necessary, devices within a group may be further allowed to be
attached to different IOASIDs in the same IOMMU fd, in case that the
source devices can be reliably identifiable (e.g. due to !ACS). This will
require additional sub-group logic in the iommu layer and with
sub-group topology exposed to userspace. But no expectation of
changing the device-centric semantics except introducing sub-group
awareness within IOMMU fd.

A more detailed explanation of the staging approach can be found:

https://lore.kernel.org/linux-iommu/BN9PR11MB543382665D34E58155A9593C8C039@BN9PR11MB5433.namprd11.prod.outlook.com/

1.4. PASID Virtualization
+++++++++++++++++++++++++

As explained in {1.2}, PASID virtualization is required when multiple I/O
address spaces are supported on a device. The actual policy is per-device
thus defined by specific VFIO device driver.

A PASID virtualization policy is defined by four aspects:

1) Whether this device allows the user to create multiple I/O address
spaces (vPASID capability). This is decided upon whether this device
and its upstream IOMMU both support PASID.

2) If yes, whether the PASID space is delegated to the user, based on
whether the PASID table should be managed by user or kernel.

3) If no, the user should register vPASID to the kernel. Then the next
question is whether vPASID should be directly used for physical routing
(vPASID==pPASID or vPASID!=pPASID). The key is whether this device
must share the PASID space with others (pdev vs. mdev).

4) If vPASID!=pPASID, whether pPASID should be allocated from the
per-RID space or a global space. This is about whether the device
supports PCIe DMWr-type work submission (e.g. Intel ENQCMD) which
requires global pPASID allocation cross multiple devices.

Only vPASIDs are part of the VM state to be migrated in VM live migration.
This is basically about the virtual PASID table state in vendor vIOMMU. If
vPASID!=pPASID, new pPASIDs will be re-allocated on the destination and
VFIO device driver is responsible for programming the device to use the
new pPASID when restoring the device state.

Different policies may imply different uAPI semantics for user to follow
when attaching a device. The semantics information is expected to be
reported to the user via VFIO uAPI instead of via IOMMU fd, since the
latter only cares about pPASID. But if there is a different thought we'd
like to hear it.

Following sections (1.4.1 - 1.4.3) provide detail explanation on how
above are selected on different device types and the implication when
multiple types are mixed together (i.e. assigned to a single user). Last
section (1.4.4) then summarizes what uAPI semantics information is
reported and how user is expected to deal with it.

1.4.1. Devices which don't support DMWr
***************************************

This section is about following types:

1) a pdev which doesn't issue PASID;
2) a sw mdev which doesn't issue PASID;
3) a mdev which is programmed a fixed defPASID (for default I/O address
space), but does not expose vPASID capability;

4) a pdev which exposes vPASID and has its PASID table managed by user;
5) a pdev which exposes vPASID and has its PASID table managed by kernel;
6) a mdev which exposes vPASID and shares the parent's PASID table
with other mdev's;

+--------+---------+---------+----------+-----------+
| | |Delegated| vPASID== | per-RID |
| | vPASID | to user | pPASID | pPASID |
+========+=========+=========+==========+===========+
| type-1 | N/A | N/A | N/A | N/A |
+--------+---------+---------+----------+-----------+
| type-2 | N/A | N/A | N/A | N/A |
+--------+---------+---------+----------+-----------+
| type-3 | N/A | N/A | N/A | N/A |
+--------+---------+---------+----------+-----------+
| type-4 | Yes | Yes | v==p(*)| per-RID(*)|
+--------+---------+---------+----------+-----------+
| type-5 | Yes | No | v==p | per-RID |
+--------+---------+---------+----------+-----------+
| type-6 | Yes | No | v!=p | per-RID |
+--------+---------+---------+----------+-----------+
<* conceptual definition though the PASID space is fully delegated>

for 1-3 there is no vPASID capability exposed and the user can create
only one default I/O address space on this device. Thus there is no PASID
virtualization at all.

4) is specific to ARM/AMD platforms where the PASID table is managed by
the user. In this case the entire PASID space is delegated to the user
which just needs to create a single IOASID linked to the user-managed
PASID table, as placeholder covering all non-default I/O address spaces
on pdev. In concept this looks like a big 84bit address space (20bit
PASID + 64bit addr). vPASID may be carried in the uAPI data to help define
the operation scope when invalidating IOTLB or reporting I/O page fault.
IOMMU fd doesn't touch it and just acts as a channel for vIOMMU/pIOMMU to
exchange info.

5) is specific to Intel platforms where the PASID table is managed by
the kernel. In this case vPASIDs should be registered to the kernel
in the attaching call. This implies that every non-default I/O address
space on pdev is explicitly tracked by an unique IOASID in the kernel.
Because pdev is fully controlled by the user, its DMA request carries
vPASID as the routing informaiton thus requires VFIO device driver to
adopt vPASID==pPASID policy. Because an IOASID already represents a
standalone address space, there is no need to further carry vPASID in
the invalidation and fault paths.

6) is about mdev, as those enabled by Intel Scalable IOV. The main
difference from type-5) is on whether vPASID==pPASID. There is
only a single PASID table per the parent device, implying the per-RID
PASID space shared by all mdevs created on this parent. VFIO device
driver must use vPASID!=pPASID policy and allocate a pPASID from the
per-RID space for every registered vPASID to guarantee DMA isolation
between sibling mdev's. VFIO device driver needs to conduct vPASID->
pPASID conversion properly in several paths:

- When VFIO device driver provides the routing information in the
attaching call, since IOMMU fd only cares about pPASID;
- When VFIO device driver updates a PASID MMIO register in the
parent according to the vPASID intercepted in the mediation path;

1.4.2. Devices which support DMWr
*********************************

Modern devices may support a scalable workload submission interface
based on PCI Deferrable Memory Write (DMWr) capability, allowing a
single work queue to access multiple I/O address spaces. One example
using DMWr is Intel ENQCMD, having PASID saved in the CPU MSR and
carried in the non-posted DMWr payload when sent out to the device.
Then a single work queue shared by multiple processes can compose
DMAs toward different address spaces, by carrying the PASID value
retrieved from the DMWr payload. The role of DMWr is allowing the
shared work queue to return a retry response when the work queue
is under pressure (due to capacity or QoS). Upon such response the
software could try re-submitting the descriptor.

When ENQCMD is executed in the guest, the value saved in the CPU
MSR is vPASID (part of the xsave state). This creates another point for
consideration regarding to PASID virtualization.

Two device types are relevant:

7) a pdev same as 5) plus DMWr support;
8) a mdev same as 6) plus DMWr support;

and respective polices:

+--------+---------+---------+----------+-----------+
| | |Delegated| vPASID== | per-RID |
| | vPASID | to user | pPASID | pPASID |
+========+=========+=========+==========+===========+
| type-7 | Yes | Yes | v==p | per-RID |
+--------+---------+---------+----------+-----------+
| type-8 | Yes | Yes | v!=p | global |
+--------+---------+---------+----------+-----------+

DMWr or shared mode is configurable per work queue. It's completely
sane if an assigned device with multiple queues needs to handle both
DMWr (shared work queue) and normal write (dedicated work queue)
simultaneously. Thus the PASID virtualization policy must be consistent
when both paths are activated.

for 7) we should use the same policy as 5), i.e. directly using vPASID
for physical routing on pdev. In this case ENQCMD in the guest just works
w/o additional work because the vPASID saved in the PASID_MSR
matches the routing information configured for the target I/O address
space in the IOMMU. When receiving a DMWr request, the shared
work queue grabs vPASID from the payload and then tags outgoing
DMAs with vPASID. This is consistent with the dedicated work queue
path where vPASID is grabbed from the MMIO register to tag DMAs.

for 8) vPASID in the PASID_MSR must be converted to pPASID before
sent to the wire (given vPASID!=pPASID for the same reason as 6).
Intel CPU provides a hardware PASID translation capability for auto-
conversion when ENQCMD is being executed. In this case the payload
received by the work queue contains pPASID thus outgoing DMAs are
tagged with pPASID. This is consistent with the dedicated work
queue path where pPASID is programmed to the MMIO register in the
mediation path and then grabbed to tag DMAs.

However, the CPU translation structure is per-VM which implies
that a same pPASID must be used cross all type-8 devices (of this VM)
given a vPASID. This requires the pPASID allocated from a global pool by
the first type-8 device and then shared by the following type-8 devices
when they are attached to the same vPASID.

CPU translation capability is enabled via KVM uAPI. We need a secure
contract between VFIO device fd and KVM fd so VFIO device driver knows
when it's secure to allow guest access to the cmd portal of the type-8
device. It's dangerous by allowing the guest to issue ENQCMD to the
device before CPU is ready for PASID translation. In this window the
vPASID is untranslated thus grants the guest to access random I/O
address space on the parent of this mdev.

We plan to utilize existing kvm-vfio contract. It is currently used for
multiple purposes including propagating the kvm pointer to the VFIO
device driver. It can be extended to further notify whether CPU PASID
translation capability is turned on. Before receiving this notification,
the VFIO device driver should not allow user to access the DMWr-capable
work queue on type-8 device.

1.4.3. Mix different types together
***********************************

In majority case mixing different types doesn't change the aforementioned
PASID virtualization policy for each type. The user just needs to handle
them per device basis.

There is one exception though, when mixing type 7) and 8) together,
due to conflicting policies on how PASID_MSR should be handled.
For mdev (type-8) the CPU translation capability must be enabled to
prevent a malicious guest from doing bad things. But once per-VM
PASID translation is enabled, the shared work queue of pdev (type-7)
will also receive a pPASID allocated for mdev instead of the vPASID
that is expected on this pdev.

Fixing this exception for pdev is not easy. There are three options.

One is moving pdev to also accept pPASID. Because pdev may have both
shared work queue (PASID in MSR) and dedicated work queue (PASID
in MMIO) enabled by the guest, this requires VFIO device driver to
mediate the dedicated work queue path so vPASIDs programmed by
the guest are manually translated to pPASIDs before written to the
pdev. This may add undesired software complexity and potential
performance impact if the PASID register locates alongside other
fast-path resources in the same 4K page. If it works it essentially
converts type-7 to type-8 from user p.o.v.

The second option is using an enlightened approach so the guest
directly use the host-allocated pPASIDs instead of creating its own vPASID
space. In this case even the dedicated work queue path uses pPASID w/o
the need of mediation. However this requires different uAPI semantics
(from register-vPASID to return-pPASID) and exposes pPASID knowledge
to userspace which also implies breaking VM live migration.

The third option is making pPASID as an alias routing info to vPASID
and having both linked to the same I/O page table in the IOMMU, so
either way can hit the desired address space. This further requires sort
of range split scheme to avoid conflict between vPASID and pPASID.
However, we haven't found a clear way to fold this trick into this uAPI
proposal yet. and this option may not work when PASID is also used to
tag the IMS entry for verifying the interrupt source. In this case there
is no room for aliasing.

So, none of above can work cleanly based on current thoughts. We
decide to not support type-7/8 mix in this proposal. User could detect
this exception based on reported PASID flags, as outlined in next section.

1.4.4. User sequence
********************

A new PASID capability info could be introduced to VFIO_DEVICE_GET_INFO.
The presence indicates allowing the user to create multiple I/O address
spaces with vPASID on the device. This capability further includes
following flags to help describe the desired uAPI semantics:

- PASID_DELEGATED; // PASID space delegated to the user?
- PASID_CPU; // Allow vPASID used in the CPU?
- PASID_CPU_VIRT; // Require vPASID translation in the CPU?

The last two flags together help the user to detect the unsupported
type 7/8 mix scenario.

Take Qemu for example. It queries above flags for every vfio device at
initialization time, after identifying the PASID capability:

1) If PASID_DELEGATED is set, the PASID space is fully managed by the
user thus a single IOASID (linked to user-managed page table) is
required as the placeholder for all non-default I/O address spaces
on the device.

If not set, an IOASID must be created for every non-default I/O address
space on this device and vPASID must be registered to the kernel
when attaching the device to this IOASID.

User may want to sanity check on all devices with the same setting
as this flag is a platform attribute though it's exported per device.

If not set, continue to step 2.

2) If PASID_CPU is not set, done.

Otherwise check whether the PASID_CPU_VIRT flag on this device is
consistent with all other devices with PASID_CPU set.

If inconsistency is found (indicating type 7/8 mix), only one type
of devices (all set, or all clear) should have the vPASID capability
exposed to the guest.

3) If PASID_CPU_VIRT is not set, done.

If set and consistency check in 2) is passed, call KVM uAPI to
enable CPU PASID translation if it is the first device with this flag
set. Later when a new vPASID is identified through vIOMMU at run-time,
call another KVM uAPI to update the corresponding PASID mapping.

1.5. No-snoop DMA
++++++++++++++++++++

Snoop behavior of a DMA specifies whether the access is coherent (snoops
the processor caches) or not. The snoop behavior is decided by both device
and IOMMU. Device can set a no-snoop attribute in DMA request to force
the non-coherent behavior, while IOMMU may support a configuration which
enforces DMAs to be coherent (with the no-snoop attribute ignored).

No-snoop DMA requires the driver to manually flush caches for
observing the latest content. When such driver is running in the guest,
it further requires KVM to intercept/emulate WBINVD plus favoring
guest cache attributes in the EPT page table.

Alex helped create a matrix as below:
(https://lore.kernel.org/linux-iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.namprd12.prod.outlook.com/T/#mbfc96278b078d3ec07eabb9aa46abfe03a886dc6)

\ Device supports
IOMMU enforces\ no-snoop
snoop \ yes | no |
----------------+-----+-----+
yes | 1 | 2 |
----------------+-----+-----+
no | 3 | 4 |
----------------+-----+-----+

DMA is always coherent in boxes {1, 2, 4}. No-snoop DMA is allowed
in {3} but whether it is actually used is a driver decision.

VFIO currently adopts a simple policy - always turn on IOMMU enforce-
snoop if available. It provides a contract via kvm-vfio fd for KVM to
learn whether no-snoop DMA is used thus special tricks on WBINVD
and EPT must be enabled. However, the criteria of no-snoop DMA is
solely based on the fact of lacking IOMMU enforce-snoop for any vfio
device, i.e. both 3) and 4) are considered capable of doing no-snoop
DMA. This model has several limitations:

- It's impossible to move a device from 1) to 3) when no-snoop DMA
is a must to achieve the desired user experience;

- Unnecessary overhead in KVM side in 4) or if the driver doesn't do
no-snoop DMA in 3). Although the driver doesn't use WBINVD, the
guest still uses WBINVD in other places e.g. when changing cache-
related registers (e.g. MTRR/CR0);

We want to adopt an user-driven model in /dev/iommu for more accurate
control over the no-snoop usage. In this model the enforce-snoop format
is specified when an IOASID is created, while the device no-snoop usage
can be further clarified when it's attached to the IOASID.

IOMMU fd is expected to provide uAPIs and helper functions for:

- reporting IOMMU enforce-snoop capability to the user per device
cookie (device no-snoop capability is reported via VFIO).

- allowing user to specify whether an IOASID should be created in the
IOMMU enforce-snoop format (enable/disable/auto):

* This allows moving a device from 1) to 3) in case of performance
requirement.

* 'auto' falls back to the legacy VFIO policy, i.e. always enables
enforce-snoop if available.

* Any device can be attached to a non-enforce-snoop IOASID,
because this format is supported by all IOMMUs. In this case the
device belongs to {3, 4} and whether it is considered doing no-snoop
DMA is decided by the next interface.

* Attaching a device which cannot be forced to snoop by its IOMMU
to an enforce-snoop IOASID gets a failure. Successful attaching
implies the device always does snoop DMA, i.e. belonging to {1,2}.

* Some platform supports page-granular enforce-snoop. One open
is whether a page-granular interface is necessary here.

- allowing user to further hint whether no-snoop DMA is actually used
in {3, 4} on a specific IOASID, via the VFIO attaching call:

* in case the user has such intrinsic knowledge on a specific device.

* {3} can be filtered out with this hint.

* {4} can be filtered out automatically by VFIO device driver,
based on device no-snoop capability.

* If no hint is provided, fall back to legacy VFIO policy, i.e.
treating all devices in {3, 4} as capable of doing no-snoop.

- a new contract for KVM to learn whether any IOASID is attached by
devices which require no-snoop DMA:

* Once we thought existing kvm-vfio fd can be leveraged as a short
term approach (see above link). However kvm-vfio is centralized
on vfio group concept, while this proposal is moving to device-
centric model.

* The new contract will allows KVM to query no-snoop requirement
per IOMMU fd. This will apply to all passthrough frameworks.

* A notification mechanism might be introduced to switch between
WBINVD emulation and no-op intercept according to device
attaching status change in registered IOMMU fd.

* whether kvm-vfio will be completely deprecated is a TBD. It's
still used for non-iommu related contract, e.g. notifying kvm
pointer to mdev driver and pvIOMMU acceleration in PPC.

- optional bulk cache invalidation:

* Userspace driver can use clflush to invalidate cachelines for
buffers used for no-snoop DMA. But this may be inefficient when
a big buffer needs to be invalidated. In this case a bulk
invalidation could be provided based on WBINVD.

The implementation might be a staging approach. In the start IOMMU fd
only support devices which can be forced to snoop via the IOMMU (i.e.
{1, 2}), while leaving {3, 4} still handled via legacy VFIO. In
this case no need to introduce new contract with KVM. An easy way is
having VFIO not expose {3, 4} devices in /dev/vfio/devices. Then we have
plenty of time to figure out the implementation detail of the new model
at a later stage.

2. uAPI Proposal
----------------------

/dev/iommu uAPI covers everything about managing I/O address spaces.

/dev/vfio device uAPI builds connection between devices and I/O address
spaces.

/dev/kvm uAPI is optionally required as far as no-snoop DMA or ENQCMD
is concerned.

2.1. /dev/iommu uAPI
++++++++++++++++++++

/*
* Check whether an uAPI extension is supported.
*
* It's unlikely that all planned capabilities in IOMMU fd will be ready in
* one breath. User should check which uAPI extension is supported
* according to its intended usage.
*
* A rough list of possible extensions may include:
*
* - EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
* - EXT_MAP_NEWTYPE for an enhanced map semantics;
* - EXT_IOASID_NESTING for what the name stands;
* - EXT_USER_PAGE_TABLE for user managed page table;
* - EXT_USER_PASID_TABLE for user managed PASID table;
* - EXT_MULTIDEV_GROUP for 1:N iommu group;
* - EXT_DMA_NO_SNOOP for no-snoop DMA support;
* - EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
* - ...
*
* Return: 0 if not supported, 1 if supported.
*/
#define IOMMU_CHECK_EXTENSION _IO(IOMMU_TYPE, IOMMU_BASE + 0)


/*
* Check capabilities and format information on a bound device.
*
* It could be reported either via a capability chain as implemented in
* VFIO or a per-capability query interface. The device is identified
* by device cookie (registered when binding this device).
*
* Sample capability info:
* - VFIO type1 map: supported page sizes, permitted IOVA ranges, etc.;
* - IOASID nesting: hardware nesting vs. software nesting;
* - User-managed page table: vendor specific formats;
* - User-managed pasid table: vendor specific formats;
* - coherency: whether IOMMU can enforce snoop for this device;
* - ...
*
*/
#define IOMMU_DEVICE_GET_INFO _IO(IOMMU_TYPE, IOMMU_BASE + 1)


/*
* Allocate an IOASID.
*
* IOASID is the FD-local software handle representing an I/O address
* space. Each IOASID is associated with a single I/O page table. User
* must call this ioctl to get an IOASID for every I/O address space that is
* intended to be tracked by the kernel.
*
* User needs to specify the attributes of the IOASID and associated
* I/O page table format information according to one or multiple devices
* which will be attached to this IOASID right after. The I/O page table
* is activated in the IOMMU when it's attached by a device. Incompatible
* format between device and IOASID will lead to attaching failure.
*
* The root IOASID should always have a kernel-managed I/O page
* table for safety. Locked page accounting is also conducted on the root.
*
* Multiple roots are possible, e.g. when multiple I/O address spaces
* are created but IOASID nesting is disabled. However, one page might
* be accounted multiple times in this case. The user is recommended to
* instead create a 'dummy' root with identity mapping (HVA->HVA) for
* centralized accounting, nested by all other IOASIDs which represent
* 'real' I/O address spaces.
*
* Sample attributes:
* - Ownership: kernel-managed or user-managed I/O page table;
* - IOASID nesting: the parent IOASID info if enabled;
* - User-managed page table: addr and vendor specific formats;
* - User-managed pasid table: addr and vendor specific formats;
* - coherency: enforce-snoop;
* - ...
*
* Return: allocated ioasid on success, -errno on failure.
*/
#define IOMMU_IOASID_ALLOC _IO(IOMMU_TYPE, IOMMU_BASE + 2)
#define IOMMU_IOASID_FREE _IO(IOMMU_TYPE, IOMMU_BASE + 3)


/*
* Map/unmap process virtual addresses to I/O virtual addresses.
*
* Provide VFIO type1 equivalent semantics. Start with the same
* restriction e.g. the unmap size should match those used in the
* original mapping call.
*
* If the specified IOASID is the root, the mapped pages are automatically
* pinned and accounted as locked memory. Pinning might be postponed
* until the IOASID is attached by a device. Software mdev driver may
* further provide a hint to skip auto-pinning at attaching time, since
* it does selective pinning at run-time. auto-pinning can be also
* skipped when I/O page fault is enabled on the root.
*
* When software nesting is enabled, this implies that the merged
* shadow mapping will also be updated accordingly. However if the
* change happens on the parent, it requires reverse lookup to update
* all relevant child mappings which is time consuming. So the user
* is not suggested to change the parent mapping after the software
* nesting is established (maybe disallow?). There is no such restriction
* with hardware nesting, as the IOMMU will catch up the change
* when actually walking the page table.
*
* Input parameters:
* - u32 ioasid;
* - refer to vfio_iommu_type1_dma_{un}map
*
* Return: 0 on success, -errno on failure.
*/
#define IOMMU_MAP_DMA _IO(IOMMU_TYPE, IOMMU_BASE + 4)
#define IOMMU_UNMAP_DMA _IO(IOMMU_TYPE, IOMMU_BASE + 5)


/*
* Invalidate IOTLB for an user-managed I/O page table
*
* check include/uapi/linux/iommu.h for supported cache types and
* granularities. Device cookie and vPASID may be specified to help
* decide the scope of this operation.
*
* Input parameters:
* - child_ioasid;
* - granularity (per-device, per-pasid, range-based);
* - cache type (iotlb, devtlb, pasid cache);
*
* Return: 0 on success, -errno on failure
*/
#define IOMMU_INVALIDATE_CACHE _IO(IOMMU_TYPE, IOMMU_BASE + 6)


/*
* Page fault report and response
*
* This is TBD. Can be added after other parts are cleared up. It may
* include a fault region to report fault data via read()), an
* eventfd to notify the user and an ioctl to complete the fault.
*
* The fault data includes {IOASID, device_cookie, faulting addr, perm}
* as common info. vendor specific fault info can be also included if
* necessary.
*
* If the IOASID represents an user-managed PASID table, the vendor
* fault info includes vPASID information for the user to figure out
* which I/O page table triggers the fault.
*
* If the IOASID represents an user-managed I/O page table, the user
* is expected to find out vPASID itself according to {IOASID, device_
* cookie}.
*/


/*
* Dirty page tracking
*
* Track and report memory pages dirtied in I/O address spaces. There
* is an ongoing work by Kunkun Jiang by extending existing VFIO type1.
* It needs be adapted to /dev/iommu later.
*/


2.2. /dev/vfio device uAPI
++++++++++++++++++++++++++

/*
* Bind a vfio_device to the specified IOMMU fd
*
* The user should provide a device cookie when calling this ioctl. The
* cookie is later used in IOMMU fd for capability query, iotlb invalidation
* and I/O fault handling.
*
* User is not allowed to access the device before the binding operation
* is completed.
*
* Unbind is automatically conducted when device fd is closed.
*
* Input parameters:
* - iommu_fd;
* - cookie;
*
* Return: 0 on success, -errno on failure.
*/
#define VFIO_BIND_IOMMU_FD _IO(VFIO_TYPE, VFIO_BASE + 22)


/*
* Report vPASID info to userspace via VFIO_DEVICE_GET_INFO
*
* Add a new device capability. The presence indicates that the user
* is allowed to create multiple I/O address spaces on this device. The
* capability further includes following flags:
*
* - PASID_DELEGATED, if clear every vPASID must be registered to
* the kernel;
* - PASID_CPU, if set vPASID is allowed to be carried in the CPU
* instructions (e.g. ENQCMD);
* - PASID_CPU_VIRT, if set require vPASID translation in the CPU;
*
* The user must check that all devices with PASID_CPU set have the
* same setting on PASID_CPU_VIRT. If mismatching, it should enable
* vPASID only in one category (all set, or all clear).
*
* When the user enables vPASID on the device with PASID_CPU_VIRT
* set, it must enable vPASID CPU translation via kvm fd before attempting
* to use ENQCMD to submit work items. The command portal is blocked
* by the kernel until the CPU translation is enabled.
*/
#define VFIO_DEVICE_INFO_CAP_PASID 5


/*
* Attach a vfio device to the specified IOASID
*
* Multiple vfio devices can be attached to the same IOASID, and vice
* versa.
*
* User may optionally provide a "virtual PASID" to mark an I/O page
* table on this vfio device, if PASID_DELEGATED is not set in device info.
* Whether the virtual PASID is physically used or converted to another
* kernel-allocated PASID is a policy in the kernel.
*
* Because one device is allowed to bind to multiple IOMMU fd's, the
* user should provide both iommu_fd and ioasid for this attach operation.
*
* Input parameter:
* - iommu_fd;
* - ioasid;
* - flag;
* - vpasid (if specified);
*
* Return: 0 on success, -errno on failure.
*/
#define VFIO_ATTACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 23)
#define VFIO_DETACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 24)


2.3. KVM uAPI
+++++++++++++

/*
* Check/enable CPU PASID translation via KVM CAP interface
*
* This is necessary when ENQCMD will be used in the guest while the
* targeted device doesn't accept the vPASID saved in the CPU MSR.
*/
#define KVM_CAP_PASID_TRANSLATION 206


/*
* Update CPU PASID mapping
*
* This command allows user to set/clear the vPASID->pPASID mapping
* in the CPU, by providing the IOASID (and FD) information representing
* the I/O address space marked by this vPASID. KVM calls iommu helper
* function to retrieve pPASID according to the input parameters. So the
* pPASID value is completely hidden from the user.
*
* Input parameters:
* - user_pasid;
* - iommu_fd;
* - ioasid;
*/
#define KVM_MAP_PASID _IO(KVMIO, 0xf0)
#define KVM_UNMAP_PASID _IO(KVMIO, 0xf1)


/*
* and a new contract to exchange no-snoop dma status with IOMMU fd.
* this will be a device-centric interface, thus existing vfio-kvm contract
* is not suitable as it's group-centric.
*
* actual definition TBD.
*/


3. Sample structures and helper functions
--------------------------------------------------------

Three helper functions are provided to support VFIO_BIND_IOMMU_FD:

struct iommu_ctx *iommu_ctx_fdget(int fd);
struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
struct device *device, u64 cookie);
int iommu_unregister_device(struct iommu_dev *dev);

An iommu_ctx is created for each fd:

struct iommu_ctx {
// a list of allocated IOASID data's
struct xarray ioasid_xa;

// a list of registered devices
struct xarray dev_xa;
};

Later some group-tracking fields will be also introduced to support
multi-devices group.

Each registered device is represented by iommu_dev:

struct iommu_dev {
struct iommu_ctx *ctx;
// always be the physical device
struct device *device;
u64 cookie;
struct kref kref;
};

A successful binding establishes a security context for the bound
device and returns struct iommu_dev pointer to the caller. After this
point, the user is allowed to query device capabilities via IOMMU_
DEVICE_GET_INFO.

For mdev the struct device should be the pointer to the parent device.

An ioasid_data is created when IOMMU_IOASID_ALLOC, as the main
object describing characteristics about an I/O page table:

struct ioasid_data {
struct iommu_ctx *ctx;

// the IOASID number
u32 ioasid;

// the handle for kernel-managed I/O page table
struct iommu_domain *domain;

// map metadata (vfio type1 semantics)
struct rb_node dma_list;

// pointer to user-managed pgtable
u64 user_pgd;

// link to the parent ioasid (for nesting)
struct ioasid_data *parent;

// IOMMU enforce-snoop
bool enforce_snoop;

// various format information
...

// a list of device attach data (routing information)
struct list_head attach_data;

// a list of fault_data reported from the iommu layer
struct list_head fault_data;

...
}

iommu_domain is the object for operating the kernel-managed I/O
page tables in the IOMMU layer. ioasid_data is associated to an
iommu_domain explicitly or implicitly:

- root IOASID (except the 'dummy' one for locked accounting)
must use kernel-manage I/O page table thus always linked to an
iommu_domain;

- child IOASID (via software nesting) is explicitly linked to an iommu
domain as the shadow I/O page table is managed by the kernel;

- child IOASID (via hardware nesting) is linked to another simpler iommu
layer object (TBD) for tracking user-managed page table. Due to
nesting it is also implicitly linked to the iommu_domain of the
parent;

Following link has an initial discussion on this part:

https://lore.kernel.org/linux-iommu/BN9PR11MB54331FC6BB31E8CBF11914A48C019@BN9PR11MB5433.namprd11.prod.outlook.com/T/#m2c19d3825cc096daf2026ea94e00cc5858cda321

As Jason recommends in v1, bus-specific wrapper functions are provided
explicitly to support VFIO_ATTACH_IOASID, e.g.

struct iommu_attach_data * iommu_pci_device_attach(
struct iommu_dev *dev, struct pci_device *pdev,
u32 ioasid);
struct iommu_attach_data * iommu_pci_device_attach_pasid(
struct iommu_dev *dev, struct pci_device *pdev,
u32 ioasid, u32 pasid);

and variants for non-PCI devices.

A helper function is provided for above wrappers:

// flags specifies whether pasid is valid
struct iommu_attach_data *__iommu_device_attach(
struct ioasid_dev *dev, u32 ioasid, u32 pasid, int flags);

A new object is introduced and linked to ioasid_data->attach_data for
each successful attach operation:

struct iommu_attach_data {
struct list_head next;
struct iommu_dev *dev;
u32 pasid;
}

The helper function for VFIO_DETACH_IOASID is generic:

int iommu_device_detach(struct iommu_attach_data *data);

4. Use Cases and Flows
-------------------------------

Here assume VFIO will support a new model where /dev/iommu capable
devices are explicitly listed under /dev/vfio/devices thus a device fd can
be acquired w/o going through legacy container/group interface. They
maybe further categorized into sub-directories based on device types
(e.g. pdev, mdev, etc.). For illustration purpose those devices are putting
together and just called dev[1...N]:

device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);

VFIO continues to support container/group model for legacy applications
and also for devices which are not moved to /dev/iommu in one breath
(e.g. in a group with multiple devices, or support no-snoop DMA). In concept
there is no problem for VFIO to support two models simultaneously, but
we'll wait to see any issue when reaching implementation.

As explained earlier, one IOMMU fd is sufficient for all intended use cases:

iommu_fd = open("/dev/iommu", mode);

For simplicity below examples are all made for the virtualization story.
They are representative and could be easily adapted to a non-virtualization
scenario.

Three types of IOASIDs are considered:

gpa_ioasid[1...N]: GPA as the default address space
giova_ioasid[1...N]: GIOVA as the default address space (nesting)
gva_ioasid[1...N]: CPU VA as non-default address space (nesting)

At least one gpa_ioasid must always be created per guest, while the other
two are relevant as far as vIOMMU is concerned.

Examples here apply to both pdev and mdev. VFIO device driver in the
kernel will figure out the associated routing information in the attaching
operation.

For illustration simplicity, IOMMU_CHECK_EXTENSION and IOMMU_DEVICE_
GET_INFO are skipped in these examples. No-snoop DMA is also not covered here.

Below examples may not apply to all platforms. For example, the PAPR IOMMU
in PPC platform always requires a vIOMMU and blocks DMAs until the device is
explicitly attached to an GIOVA address space. there are even fixed
associations between available GIOVA spaces and devices. Those platform
specific variances are not covered here and will be figured out in the
implementation phase.

4.1. A simple example
+++++++++++++++++++++

Dev1 is assigned to the guest. A cookie has been allocated by the user
to represent this device in the iommu_fd.

One gpa_ioasid is created. The GPA address space is managed through
DMA mapping protocol by specifying that the I/O page table is managed
by the kernel:

/* Bind device to IOMMU fd */
device_fd = open("/dev/vfio/devices/dev1", mode);
iommu_fd = open("/dev/iommu", mode);
bind_data = {.fd = iommu_fd; .cookie = cookie};
ioctl(device_fd, VFIO_BIND_IOASID_FD, &bind_data);

/* Allocate IOASID */
alloc_data = {.user_pgtable = false};
gpa_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

/* Attach device to IOASID */
at_data = { .fd = iommu_fd; .ioasid = gpa_ioasid};
ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);

/* Setup GPA mapping [0 - 1GB] */
dma_map = {
.ioasid = gpa_ioasid;
.iova = 0; // GPA
.vaddr = 0x40000000; // HVA
.size = 1GB;
};
ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);

If the guest is assigned with more than dev1, the user follows above
sequence to attach other devices to the same gpa_ioasid i.e. sharing
the GPA address space cross all assigned devices, e.g. for dev2:

bind_data = {.fd = iommu_fd; .cookie = cookie2};
ioctl(device_fd2, VFIO_BIND_IOASID_FD, &bind_data);
ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

4.2. Multiple IOASIDs (no nesting)
++++++++++++++++++++++++++++++++++

Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
both devices are attached to gpa_ioasid. After boot the guest creates
a GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
through mode (gpa_ioasid).

Suppose IOASID nesting is not supported in this case. Qemu needs to
generate shadow mappings in userspace for giova_ioasid (like how
VFIO works today). The side-effect is that duplicated locked page
accounting might be incurred in this example as there are two root
IOASIDs now. It will be fixed once IOASID nesting is supported:

device_fd1 = open("/dev/vfio/devices/dev1", mode);
device_fd2 = open("/dev/vfio/devices/dev2", mode);
iommu_fd = open("/dev/iommu", mode);

/* Bind device to IOMMU fd */
bind_data = {.fd = iommu_fd; .device_cookie = cookie1};
ioctl(device_fd1, VFIO_BIND_IOASID_FD, &bind_data);
bind_data = {.fd = iommu_fd; .device_cookie = cookie2};
ioctl(device_fd2, VFIO_BIND_IOASID_FD, &bind_data);

/* Allocate IOASID */
alloc_data = {.user_pgtable = false};
gpa_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

/* Attach dev1 and dev2 to gpa_ioasid */
at_data = { .fd = iommu_fd; .ioasid = gpa_ioasid};
ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

/* Setup GPA mapping [0 - 1GB] */
dma_map = {
.ioasid = gpa_ioasid;
.iova = 0; // GPA
.vaddr = 0x40000000; // HVA
.size = 1GB;
};
ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);

/* After boot, guest enables a GIOVA space for dev2 via vIOMMU */
alloc_data = {.user_pgtable = false};
giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

/* First detach dev2 from previous address space */
at_data = { .fd = iommu_fd; .ioasid = gpa_ioasid};
ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);

/* Then attach dev2 to the new address space */
at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

/* Setup a shadow DMA mapping according to vIOMMU.
*
* e.g. the vIOMMU page table adds a new 4KB mapping:
* GIOVA [0x2000] -> GPA [0x1000]
*
* and GPA [0x1000] is mapped to HVA [0x40001000] in gpa_ioasid.
*
* In this case the shadow mapping should be:
* GIOVA [0x2000] -> HVA [0x40001000]
*/
dma_map = {
.ioasid = giova_ioasid;
.iova = 0x2000; // GIOVA
.vaddr = 0x40001000; // HVA
.size = 4KB;
};
ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);

4.3. IOASID nesting (software)
++++++++++++++++++++++++++++++

Same usage scenario as 4.2, with software-based IOASID nesting
available. In this mode it is the kernel instead of user to create the
shadow mapping.

The flow before guest boots is same as 4.2, except one point. Because
giova_ioasid is nested on gpa_ioasid, locked accounting is only
conducted for gpa_ioasid which becomes the only root.

There could be a case where different gpa_ioasids are created due
to incompatible format between dev1/dev2 (e.g. about IOMMU
enforce-snoop). In such case the user could further created a dummy
IOASID (HVA->HVA) as the root parent for two gpa_ioasids to avoid
duplicated accounting. But this scenario is not covered in following
flows.

To save space we only list the steps after boots (i.e. both dev1/dev2
have been attached to gpa_ioasid before guest boots):

/* After boots */
/* Create GIOVA space nested on GPA space
* Both page tables are managed by the kernel
*/
alloc_data = {.user_pgtable = false; .parent = gpa_ioasid};
giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

/* Attach dev2 to the new address space (child)
* Note dev2 is still attached to gpa_ioasid (parent)
*/
at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

/* Setup a GIOVA [0x2000] ->GPA [0x1000] mapping for giova_ioasid,
* based on the vIOMMU page table. The kernel is responsible for
* creating the shadow mapping GIOVA [0x2000] -> HVA [0x40001000]
* by walking the parent's I/O page table to find out GPA [0x1000] ->
* HVA [0x40001000].
*/
dma_map = {
.ioasid = giova_ioasid;
.iova = 0x2000; // GIOVA
.vaddr = 0x1000; // GPA
.size = 4KB;
};
ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);

4.4. IOASID nesting (hardware)
++++++++++++++++++++++++++++++

Same usage scenario as 4.2, with hardware-based IOASID nesting
available. In this mode the I/O page table is managed by userspace
thus an invalidation interface is used for the user to request iotlb
invalidation.

/* After boots */
/* Create GIOVA space nested on GPA space.
* Claim it's an user-managed I/O page table.
*/
alloc_data = {
.user_pgtable = true;
.parent = gpa_ioasid;
.addr = giova_pgtable;
// and format information;
};
giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

/* Attach dev2 to the new address space (child)
* Note dev2 is still attached to gpa_ioasid (parent)
*/
at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

/* Invalidate IOTLB when required */
inv_data = {
.ioasid = giova_ioasid;
// granular/cache type information
};
ioctl(iommu_fd, IOMMU_INVALIDATE_CACHE, &inv_data);

/* See 4.6 for I/O page fault handling */

4.5. Guest SVA (vSVA)
+++++++++++++++++++++

After boots the guest further creates a GVA address spaces (vpasid1) on
dev1. Dev2 is not affected (still attached to giova_ioasid).

As explained in section 1.4, the user should check the PASID capability
exposed via VFIO_DEVICE_GET_INFO and follow the required uAPI
semantics when doing the attaching call:

/****** If dev1 reports PASID_DELEGATED=false **********/
/* After boots */
/* Create GVA space nested on GPA space.
* Claim it's an user-managed I/O page table.
*/
alloc_data = {
.user_pgtable = true;
.parent = gpa_ioasid;
.addr = gva_pgtable;
// and format information;
};
gva_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

/* Attach dev1 to the new address space (child) and specify
* vPASID. Note dev1 is still attached to gpa_ioasid (parent)
*/
at_data = {
.fd = iommu_fd;
.ioasid = gva_ioasid;
.flag = IOASID_ATTACH_VPASID;
.vpasid = vpasid1;
};
ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

/* Enable CPU PASID translation if required */
if (PASID_CPU and PASID_CPU_VIRT are both true for dev1) {
pa_data = {
.iommu_fd = iommu_fd;
.ioasid = gva_ioasid;
.vpasid = vpasid1;
};
ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
};

/* Invalidate IOTLB when required */
...

/****** If dev1 reports PASID_DELEGATED=true **********/
/* Create user-managed vPASID space when it's enabled via vIOMMU */
alloc_data = {
.user_pasid_table = true;
.parent = gpa_ioasid;
.addr = gpasid_tbl;
// and format information;
};
pasidtbl_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

/* Attach dev1 to the vPASID space */
at_data = {.fd = iommu_fd; .ioasid = pasidtbl_ioasid};
ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

/* from now on all GVA address spaces on dev1 are represented by
* a single pasidtlb_ioasid as the placeholder in the kernel.
*
* But iotlb invalidation and fault handling are still per GVA
* address space. They are still going through IOMMU fd in the
* same way as PASID_DELEGATED=false scenario
*/
...

4.6. I/O page fault
+++++++++++++++++++

uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
to guest IOMMU driver and backwards. This flow assumes that I/O page faults
are reported via IOMMU interrupts. Some devices report faults via device
specific way instead of going through the IOMMU. That usage is not covered
here:

- Host IOMMU driver receives a I/O page fault with raw fault_data {rid,
pasid, addr};

- Host IOMMU driver identifies the faulting I/O page table according to
{rid, pasid} and calls the corresponding fault handler with an opaque
object (registered by the handler) and raw fault_data (rid, pasid, addr);

- IOASID fault handler identifies the corresponding ioasid and device
cookie according to the opaque object, generates an user fault_data
(ioasid, cookie, addr) in the fault region, and triggers eventfd to
userspace;

* In case ioasid represents a pasid table, pasid is also included as
additional fault_data;

* the raw fault_data is also cached in ioasid_data->fault_data and
used when generating response;

- Upon received event, Qemu needs to find the virtual routing information
(v_rid + v_pasid) of the device attached to the faulting ioasid;

* v_rid is identified according to device_cookie;

* v_pasid is either identified according to ioasid, or already carried
in the fault data;

- Qemu generates a virtual I/O page fault through vIOMMU into guest,
carrying the virtual fault data (v_rid, v_pasid, addr);

- Guest IOMMU driver fixes up the fault, updates the guest I/O page table
(GIOVA or GVA), and then sends a page response with virtual completion
data (v_rid, v_pasid, response_code) to vIOMMU;

- Qemu finds the pending fault event, converts virtual completion data
into (ioasid, cookie, response_code), and then calls a /dev/iommu ioctl to
complete the pending fault;

- /dev/iommu finds out the pending fault data {rid, pasid, addr} saved in
ioasid_data->fault_data, and then calls iommu api to complete it with
{rid, pasid, response_code};


2021-07-09 21:51:41

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

Hi Kevin,

A couple first pass comments...

On Fri, 9 Jul 2021 07:48:44 +0000
"Tian, Kevin" <[email protected]> wrote:
> 2.2. /dev/vfio device uAPI
> ++++++++++++++++++++++++++
>
> /*
> * Bind a vfio_device to the specified IOMMU fd
> *
> * The user should provide a device cookie when calling this ioctl. The
> * cookie is later used in IOMMU fd for capability query, iotlb invalidation
> * and I/O fault handling.
> *
> * User is not allowed to access the device before the binding operation
> * is completed.
> *
> * Unbind is automatically conducted when device fd is closed.
> *
> * Input parameters:
> * - iommu_fd;
> * - cookie;
> *
> * Return: 0 on success, -errno on failure.
> */
> #define VFIO_BIND_IOMMU_FD _IO(VFIO_TYPE, VFIO_BASE + 22)

I believe this is an ioctl on the device fd, therefore it should be
named VFIO_DEVICE_BIND_IOMMU_FD.

>
>
> /*
> * Report vPASID info to userspace via VFIO_DEVICE_GET_INFO
> *
> * Add a new device capability. The presence indicates that the user
> * is allowed to create multiple I/O address spaces on this device. The
> * capability further includes following flags:
> *
> * - PASID_DELEGATED, if clear every vPASID must be registered to
> * the kernel;
> * - PASID_CPU, if set vPASID is allowed to be carried in the CPU
> * instructions (e.g. ENQCMD);
> * - PASID_CPU_VIRT, if set require vPASID translation in the CPU;
> *
> * The user must check that all devices with PASID_CPU set have the
> * same setting on PASID_CPU_VIRT. If mismatching, it should enable
> * vPASID only in one category (all set, or all clear).
> *
> * When the user enables vPASID on the device with PASID_CPU_VIRT
> * set, it must enable vPASID CPU translation via kvm fd before attempting
> * to use ENQCMD to submit work items. The command portal is blocked
> * by the kernel until the CPU translation is enabled.
> */
> #define VFIO_DEVICE_INFO_CAP_PASID 5
>
>
> /*
> * Attach a vfio device to the specified IOASID
> *
> * Multiple vfio devices can be attached to the same IOASID, and vice
> * versa.
> *
> * User may optionally provide a "virtual PASID" to mark an I/O page
> * table on this vfio device, if PASID_DELEGATED is not set in device info.
> * Whether the virtual PASID is physically used or converted to another
> * kernel-allocated PASID is a policy in the kernel.
> *
> * Because one device is allowed to bind to multiple IOMMU fd's, the
> * user should provide both iommu_fd and ioasid for this attach operation.
> *
> * Input parameter:
> * - iommu_fd;
> * - ioasid;
> * - flag;
> * - vpasid (if specified);
> *
> * Return: 0 on success, -errno on failure.
> */
> #define VFIO_ATTACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 23)
> #define VFIO_DETACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 24)

Likewise, VFIO_DEVICE_{ATTACH,DETACH}_IOASID

...
> 3. Sample structures and helper functions
> --------------------------------------------------------
>
> Three helper functions are provided to support VFIO_BIND_IOMMU_FD:
>
> struct iommu_ctx *iommu_ctx_fdget(int fd);
> struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
> struct device *device, u64 cookie);
> int iommu_unregister_device(struct iommu_dev *dev);
>
> An iommu_ctx is created for each fd:
>
> struct iommu_ctx {
> // a list of allocated IOASID data's
> struct xarray ioasid_xa;
>
> // a list of registered devices
> struct xarray dev_xa;
> };
>
> Later some group-tracking fields will be also introduced to support
> multi-devices group.
>
> Each registered device is represented by iommu_dev:
>
> struct iommu_dev {
> struct iommu_ctx *ctx;
> // always be the physical device
> struct device *device;
> u64 cookie;
> struct kref kref;
> };
>
> A successful binding establishes a security context for the bound
> device and returns struct iommu_dev pointer to the caller. After this
> point, the user is allowed to query device capabilities via IOMMU_
> DEVICE_GET_INFO.

If we have an initial singleton group only restriction, I assume that
both iommu_register_device() would fail for any devices that are not in
a singleton group and vfio would only expose direct device files for
the devices in singleton groups. The latter implementation could
change when multi-device group support is added so that userspace can
assume that if the vfio device file exists, this interface is available.
I think this is confirmed further below.

> For mdev the struct device should be the pointer to the parent device.

I don't get how iommu_register_device() differentiates an mdev from a
pdev in this case.

...
> 4.3. IOASID nesting (software)
> ++++++++++++++++++++++++++++++
>
> Same usage scenario as 4.2, with software-based IOASID nesting
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.
>
> The flow before guest boots is same as 4.2, except one point. Because
> giova_ioasid is nested on gpa_ioasid, locked accounting is only
> conducted for gpa_ioasid which becomes the only root.
>
> There could be a case where different gpa_ioasids are created due
> to incompatible format between dev1/dev2 (e.g. about IOMMU
> enforce-snoop). In such case the user could further created a dummy
> IOASID (HVA->HVA) as the root parent for two gpa_ioasids to avoid
> duplicated accounting. But this scenario is not covered in following
> flows.

This use case has been noted several times in the proposal, it probably
deserves an example.

>
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
>
> /* After boots */
> /* Create GIOVA space nested on GPA space
> * Both page tables are managed by the kernel
> */
> alloc_data = {.user_pgtable = false; .parent = gpa_ioasid};
> giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

So the user would use IOMMU_DEVICE_GET_INFO on the iommu_fd with device
cookie2 after the VFIO_DEVICE_BIND_IOMMU_FD to learn that software
nesting is supported before proceeding down this path?

>
> /* Attach dev2 to the new address space (child)
> * Note dev2 is still attached to gpa_ioasid (parent)
> */
> at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup a GIOVA [0x2000] ->GPA [0x1000] mapping for giova_ioasid,
> * based on the vIOMMU page table. The kernel is responsible for
> * creating the shadow mapping GIOVA [0x2000] -> HVA [0x40001000]
> * by walking the parent's I/O page table to find out GPA [0x1000] ->
> * HVA [0x40001000].
> */
> dma_map = {
> .ioasid = giova_ioasid;
> .iova = 0x2000; // GIOVA
> .vaddr = 0x1000; // GPA
> .size = 4KB;
> };
> ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);
>
> 4.4. IOASID nesting (hardware)
> ++++++++++++++++++++++++++++++
>
> Same usage scenario as 4.2, with hardware-based IOASID nesting
> available. In this mode the I/O page table is managed by userspace
> thus an invalidation interface is used for the user to request iotlb
> invalidation.
>
> /* After boots */
> /* Create GIOVA space nested on GPA space.
> * Claim it's an user-managed I/O page table.
> */
> alloc_data = {
> .user_pgtable = true;
> .parent = gpa_ioasid;
> .addr = giova_pgtable;
> // and format information;
> };
> giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
>
> /* Attach dev2 to the new address space (child)
> * Note dev2 is still attached to gpa_ioasid (parent)
> */
> at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Invalidate IOTLB when required */
> inv_data = {
> .ioasid = giova_ioasid;
> // granular/cache type information
> };
> ioctl(iommu_fd, IOMMU_INVALIDATE_CACHE, &inv_data);
>
> /* See 4.6 for I/O page fault handling */
>
> 4.5. Guest SVA (vSVA)
> +++++++++++++++++++++
>
> After boots the guest further creates a GVA address spaces (vpasid1) on
> dev1. Dev2 is not affected (still attached to giova_ioasid).
>
> As explained in section 1.4, the user should check the PASID capability
> exposed via VFIO_DEVICE_GET_INFO and follow the required uAPI
> semantics when doing the attaching call:

And this characteristic lives in VFIO_DEVICE_GET_INFO rather than
IOMMU_DEVICE_GET_INFO because this is a characteristic known by the
vfio device driver rather than the system IOMMU, right? Thanks,

Alex

2021-07-12 01:23:48

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Alex Williamson <[email protected]>
> Sent: Saturday, July 10, 2021 5:51 AM
>
> Hi Kevin,
>
> A couple first pass comments...
>
> On Fri, 9 Jul 2021 07:48:44 +0000
> "Tian, Kevin" <[email protected]> wrote:
> > 2.2. /dev/vfio device uAPI
> > ++++++++++++++++++++++++++
> >
> > /*
> > * Bind a vfio_device to the specified IOMMU fd
> > *
> > * The user should provide a device cookie when calling this ioctl. The
> > * cookie is later used in IOMMU fd for capability query, iotlb invalidation
> > * and I/O fault handling.
> > *
> > * User is not allowed to access the device before the binding operation
> > * is completed.
> > *
> > * Unbind is automatically conducted when device fd is closed.
> > *
> > * Input parameters:
> > * - iommu_fd;
> > * - cookie;
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define VFIO_BIND_IOMMU_FD _IO(VFIO_TYPE, VFIO_BASE + 22)
>
> I believe this is an ioctl on the device fd, therefore it should be
> named VFIO_DEVICE_BIND_IOMMU_FD.

make sense.

>
> >
> >
> > /*
> > * Report vPASID info to userspace via VFIO_DEVICE_GET_INFO
> > *
> > * Add a new device capability. The presence indicates that the user
> > * is allowed to create multiple I/O address spaces on this device. The
> > * capability further includes following flags:
> > *
> > * - PASID_DELEGATED, if clear every vPASID must be registered to
> > * the kernel;
> > * - PASID_CPU, if set vPASID is allowed to be carried in the CPU
> > * instructions (e.g. ENQCMD);
> > * - PASID_CPU_VIRT, if set require vPASID translation in the CPU;
> > *
> > * The user must check that all devices with PASID_CPU set have the
> > * same setting on PASID_CPU_VIRT. If mismatching, it should enable
> > * vPASID only in one category (all set, or all clear).
> > *
> > * When the user enables vPASID on the device with PASID_CPU_VIRT
> > * set, it must enable vPASID CPU translation via kvm fd before attempting
> > * to use ENQCMD to submit work items. The command portal is blocked
> > * by the kernel until the CPU translation is enabled.
> > */
> > #define VFIO_DEVICE_INFO_CAP_PASID 5
> >
> >
> > /*
> > * Attach a vfio device to the specified IOASID
> > *
> > * Multiple vfio devices can be attached to the same IOASID, and vice
> > * versa.
> > *
> > * User may optionally provide a "virtual PASID" to mark an I/O page
> > * table on this vfio device, if PASID_DELEGATED is not set in device info.
> > * Whether the virtual PASID is physically used or converted to another
> > * kernel-allocated PASID is a policy in the kernel.
> > *
> > * Because one device is allowed to bind to multiple IOMMU fd's, the
> > * user should provide both iommu_fd and ioasid for this attach operation.
> > *
> > * Input parameter:
> > * - iommu_fd;
> > * - ioasid;
> > * - flag;
> > * - vpasid (if specified);
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define VFIO_ATTACH_IOASID _IO(VFIO_TYPE, VFIO_BASE +
> 23)
> > #define VFIO_DETACH_IOASID _IO(VFIO_TYPE, VFIO_BASE +
> 24)
>
> Likewise, VFIO_DEVICE_{ATTACH,DETACH}_IOASID
>
> ...
> > 3. Sample structures and helper functions
> > --------------------------------------------------------
> >
> > Three helper functions are provided to support VFIO_BIND_IOMMU_FD:
> >
> > struct iommu_ctx *iommu_ctx_fdget(int fd);
> > struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
> > struct device *device, u64 cookie);
> > int iommu_unregister_device(struct iommu_dev *dev);
> >
> > An iommu_ctx is created for each fd:
> >
> > struct iommu_ctx {
> > // a list of allocated IOASID data's
> > struct xarray ioasid_xa;
> >
> > // a list of registered devices
> > struct xarray dev_xa;
> > };
> >
> > Later some group-tracking fields will be also introduced to support
> > multi-devices group.
> >
> > Each registered device is represented by iommu_dev:
> >
> > struct iommu_dev {
> > struct iommu_ctx *ctx;
> > // always be the physical device
> > struct device *device;
> > u64 cookie;
> > struct kref kref;
> > };
> >
> > A successful binding establishes a security context for the bound
> > device and returns struct iommu_dev pointer to the caller. After this
> > point, the user is allowed to query device capabilities via IOMMU_
> > DEVICE_GET_INFO.
>
> If we have an initial singleton group only restriction, I assume that
> both iommu_register_device() would fail for any devices that are not in
> a singleton group and vfio would only expose direct device files for
> the devices in singleton groups. The latter implementation could
> change when multi-device group support is added so that userspace can
> assume that if the vfio device file exists, this interface is available.
> I think this is confirmed further below.

Exactly. Will elaborate this assumption in next version.

>
> > For mdev the struct device should be the pointer to the parent device.
>
> I don't get how iommu_register_device() differentiates an mdev from a
> pdev in this case.

via device cookie.

>
> ...
> > 4.3. IOASID nesting (software)
> > ++++++++++++++++++++++++++++++
> >
> > Same usage scenario as 4.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> >
> > The flow before guest boots is same as 4.2, except one point. Because
> > giova_ioasid is nested on gpa_ioasid, locked accounting is only
> > conducted for gpa_ioasid which becomes the only root.
> >
> > There could be a case where different gpa_ioasids are created due
> > to incompatible format between dev1/dev2 (e.g. about IOMMU
> > enforce-snoop). In such case the user could further created a dummy
> > IOASID (HVA->HVA) as the root parent for two gpa_ioasids to avoid
> > duplicated accounting. But this scenario is not covered in following
> > flows.
>
> This use case has been noted several times in the proposal, it probably
> deserves an example.

will do.

>
> >
> > To save space we only list the steps after boots (i.e. both dev1/dev2
> > have been attached to gpa_ioasid before guest boots):
> >
> > /* After boots */
> > /* Create GIOVA space nested on GPA space
> > * Both page tables are managed by the kernel
> > */
> > alloc_data = {.user_pgtable = false; .parent = gpa_ioasid};
> > giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
>
> So the user would use IOMMU_DEVICE_GET_INFO on the iommu_fd with
> device
> cookie2 after the VFIO_DEVICE_BIND_IOMMU_FD to learn that software
> nesting is supported before proceeding down this path?

yes. If this capability is not available, the user should fall back to the
flow in 4.2, i.e. generating shadow mappings in userspace.

>
> >
> > /* Attach dev2 to the new address space (child)
> > * Note dev2 is still attached to gpa_ioasid (parent)
> > */
> > at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
> > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Setup a GIOVA [0x2000] ->GPA [0x1000] mapping for giova_ioasid,
> > * based on the vIOMMU page table. The kernel is responsible for
> > * creating the shadow mapping GIOVA [0x2000] -> HVA [0x40001000]
> > * by walking the parent's I/O page table to find out GPA [0x1000] ->
> > * HVA [0x40001000].
> > */
> > dma_map = {
> > .ioasid = giova_ioasid;
> > .iova = 0x2000; // GIOVA
> > .vaddr = 0x1000; // GPA
> > .size = 4KB;
> > };
> > ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);
> >
> > 4.4. IOASID nesting (hardware)
> > ++++++++++++++++++++++++++++++
> >
> > Same usage scenario as 4.2, with hardware-based IOASID nesting
> > available. In this mode the I/O page table is managed by userspace
> > thus an invalidation interface is used for the user to request iotlb
> > invalidation.
> >
> > /* After boots */
> > /* Create GIOVA space nested on GPA space.
> > * Claim it's an user-managed I/O page table.
> > */
> > alloc_data = {
> > .user_pgtable = true;
> > .parent = gpa_ioasid;
> > .addr = giova_pgtable;
> > // and format information;
> > };
> > giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
> >
> > /* Attach dev2 to the new address space (child)
> > * Note dev2 is still attached to gpa_ioasid (parent)
> > */
> > at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
> > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Invalidate IOTLB when required */
> > inv_data = {
> > .ioasid = giova_ioasid;
> > // granular/cache type information
> > };
> > ioctl(iommu_fd, IOMMU_INVALIDATE_CACHE, &inv_data);
> >
> > /* See 4.6 for I/O page fault handling */
> >
> > 4.5. Guest SVA (vSVA)
> > +++++++++++++++++++++
> >
> > After boots the guest further creates a GVA address spaces (vpasid1) on
> > dev1. Dev2 is not affected (still attached to giova_ioasid).
> >
> > As explained in section 1.4, the user should check the PASID capability
> > exposed via VFIO_DEVICE_GET_INFO and follow the required uAPI
> > semantics when doing the attaching call:
>
> And this characteristic lives in VFIO_DEVICE_GET_INFO rather than
> IOMMU_DEVICE_GET_INFO because this is a characteristic known by the
> vfio device driver rather than the system IOMMU, right? Thanks,
>

yes.

Thanks
Kevin

2021-07-12 18:43:46

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Mon, 12 Jul 2021 01:22:11 +0000
"Tian, Kevin" <[email protected]> wrote:
> > From: Alex Williamson <[email protected]>
> > Sent: Saturday, July 10, 2021 5:51 AM
> > On Fri, 9 Jul 2021 07:48:44 +0000
> > "Tian, Kevin" <[email protected]> wrote:

> > > For mdev the struct device should be the pointer to the parent device.
> >
> > I don't get how iommu_register_device() differentiates an mdev from a
> > pdev in this case.
>
> via device cookie.


Let me re-add this section for more context:

> 3. Sample structures and helper functions
> --------------------------------------------------------
>
> Three helper functions are provided to support VFIO_BIND_IOMMU_FD:
>
> struct iommu_ctx *iommu_ctx_fdget(int fd);
> struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
> struct device *device, u64 cookie);
> int iommu_unregister_device(struct iommu_dev *dev);
>
> An iommu_ctx is created for each fd:
>
> struct iommu_ctx {
> // a list of allocated IOASID data's
> struct xarray ioasid_xa;
>
> // a list of registered devices
> struct xarray dev_xa;
> };
>
> Later some group-tracking fields will be also introduced to support
> multi-devices group.
>
> Each registered device is represented by iommu_dev:
>
> struct iommu_dev {
> struct iommu_ctx *ctx;
> // always be the physical device
> struct device *device;
> u64 cookie;
> struct kref kref;
> };
>
> A successful binding establishes a security context for the bound
> device and returns struct iommu_dev pointer to the caller. After this
> point, the user is allowed to query device capabilities via IOMMU_
> DEVICE_GET_INFO.
>
> For mdev the struct device should be the pointer to the parent device.


So we'll have a VFIO_DEVICE_BIND_IOMMU_FD ioctl where the user provides
the iommu_fd and a cookie. vfio will use iommu_ctx_fdget() to get an
iommu_ctx* for that iommu_fd, then we'll call iommu_register_device()
using that iommu_ctx* we got from the iommu_fd, the cookie provided by
the user, and for an mdev, the parent of the device the user owns
(the device_fd on which this ioctl is called)...

How does an arbitrary user provided cookie let you differentiate that
the request is actually for an mdev versus the parent device itself?

For instance, how can the IOMMU layer distinguish GVT-g (mdev) vs GVT-d
(direct assignment) when both use the same struct device* and cookie is
just a user provided value? Still confused. Thanks,

Alex

2021-07-12 23:42:39

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Alex Williamson <[email protected]>
> Sent: Tuesday, July 13, 2021 2:42 AM
>
> On Mon, 12 Jul 2021 01:22:11 +0000
> "Tian, Kevin" <[email protected]> wrote:
> > > From: Alex Williamson <[email protected]>
> > > Sent: Saturday, July 10, 2021 5:51 AM
> > > On Fri, 9 Jul 2021 07:48:44 +0000
> > > "Tian, Kevin" <[email protected]> wrote:
>
> > > > For mdev the struct device should be the pointer to the parent device.
> > >
> > > I don't get how iommu_register_device() differentiates an mdev from a
> > > pdev in this case.
> >
> > via device cookie.
>
>
> Let me re-add this section for more context:
>
> > 3. Sample structures and helper functions
> > --------------------------------------------------------
> >
> > Three helper functions are provided to support VFIO_BIND_IOMMU_FD:
> >
> > struct iommu_ctx *iommu_ctx_fdget(int fd);
> > struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
> > struct device *device, u64 cookie);
> > int iommu_unregister_device(struct iommu_dev *dev);
> >
> > An iommu_ctx is created for each fd:
> >
> > struct iommu_ctx {
> > // a list of allocated IOASID data's
> > struct xarray ioasid_xa;
> >
> > // a list of registered devices
> > struct xarray dev_xa;
> > };
> >
> > Later some group-tracking fields will be also introduced to support
> > multi-devices group.
> >
> > Each registered device is represented by iommu_dev:
> >
> > struct iommu_dev {
> > struct iommu_ctx *ctx;
> > // always be the physical device
> > struct device *device;
> > u64 cookie;
> > struct kref kref;
> > };
> >
> > A successful binding establishes a security context for the bound
> > device and returns struct iommu_dev pointer to the caller. After this
> > point, the user is allowed to query device capabilities via IOMMU_
> > DEVICE_GET_INFO.
> >
> > For mdev the struct device should be the pointer to the parent device.
>
>
> So we'll have a VFIO_DEVICE_BIND_IOMMU_FD ioctl where the user
> provides
> the iommu_fd and a cookie. vfio will use iommu_ctx_fdget() to get an
> iommu_ctx* for that iommu_fd, then we'll call iommu_register_device()
> using that iommu_ctx* we got from the iommu_fd, the cookie provided by
> the user, and for an mdev, the parent of the device the user owns
> (the device_fd on which this ioctl is called)...
>
> How does an arbitrary user provided cookie let you differentiate that
> the request is actually for an mdev versus the parent device itself?
>
> For instance, how can the IOMMU layer distinguish GVT-g (mdev) vs GVT-d
> (direct assignment) when both use the same struct device* and cookie is
> just a user provided value? Still confused. Thanks,
>

GVT-g is a special case here since it's purely software-emulated mdev
and reuse the default domain of the parent device. In this case IOASID
is treated as metadata for GVT-g device driver to conduct DMA isolation
in software. We won't install a new page table in the IOMMU just for
GVT-g mdev (this does reminds a missing flag in the attaching call to
indicate this requirement).

What you really care about is about SIOV mdev (with PASID-granular
DMA isolation in the IOMMU) and its parent. In this case mdev and
parent assignment are exclusive. When the parent is already assigned
to an user, it's not managed by the kernel anymore thus no mdev
per se. If mdev is created then it implies that the parent must be
managed by the kernel. In either case the user-provided cookie is
contained only within IOMMU fd. When calling IOMMU-API, it's
always about the routing information (RID, or RID+PASID) provided
in the attaching call.

Thanks
Kevin

2021-07-12 23:57:51

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Alex Williamson <[email protected]>
> Sent: Tuesday, July 13, 2021 2:42 AM
>
> On Mon, 12 Jul 2021 01:22:11 +0000
> "Tian, Kevin" <[email protected]> wrote:
> > > From: Alex Williamson <[email protected]>
> > > Sent: Saturday, July 10, 2021 5:51 AM
> > > On Fri, 9 Jul 2021 07:48:44 +0000
> > > "Tian, Kevin" <[email protected]> wrote:
>
> > > > For mdev the struct device should be the pointer to the parent device.
> > >
> > > I don't get how iommu_register_device() differentiates an mdev from a
> > > pdev in this case.
> >
> > via device cookie.
>
>
> Let me re-add this section for more context:
>
> > 3. Sample structures and helper functions
> > --------------------------------------------------------
> >
> > Three helper functions are provided to support VFIO_BIND_IOMMU_FD:
> >
> > struct iommu_ctx *iommu_ctx_fdget(int fd);
> > struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
> > struct device *device, u64 cookie);
> > int iommu_unregister_device(struct iommu_dev *dev);
> >
> > An iommu_ctx is created for each fd:
> >
> > struct iommu_ctx {
> > // a list of allocated IOASID data's
> > struct xarray ioasid_xa;
> >
> > // a list of registered devices
> > struct xarray dev_xa;
> > };
> >
> > Later some group-tracking fields will be also introduced to support
> > multi-devices group.
> >
> > Each registered device is represented by iommu_dev:
> >
> > struct iommu_dev {
> > struct iommu_ctx *ctx;
> > // always be the physical device
> > struct device *device;
> > u64 cookie;
> > struct kref kref;
> > };
> >
> > A successful binding establishes a security context for the bound
> > device and returns struct iommu_dev pointer to the caller. After this
> > point, the user is allowed to query device capabilities via IOMMU_
> > DEVICE_GET_INFO.
> >
> > For mdev the struct device should be the pointer to the parent device.
>
>
> So we'll have a VFIO_DEVICE_BIND_IOMMU_FD ioctl where the user
> provides
> the iommu_fd and a cookie. vfio will use iommu_ctx_fdget() to get an
> iommu_ctx* for that iommu_fd, then we'll call iommu_register_device()
> using that iommu_ctx* we got from the iommu_fd, the cookie provided by
> the user, and for an mdev, the parent of the device the user owns
> (the device_fd on which this ioctl is called)...
>
> How does an arbitrary user provided cookie let you differentiate that
> the request is actually for an mdev versus the parent device itself?
>

Maybe I misunderstood your question. Are you specifically worried
about establishing the security context for a mdev vs. for its parent?
At least in concept we should not change the security context of
the parent if this binding call is just for the mdev. And for mdev it will be
in a security context as long as the associated PASID entry is disabled
at the binding time. If this is the case, possibly we also need VFIO to
provide defPASID marking the mdev when calling iommu_register_device()
then IOMMU fd also provides defPASID when calling IOMMU API to
establish the security context.

Thanks,
Kevin

2021-07-13 12:55:52

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Mon, Jul 12, 2021 at 11:56:24PM +0000, Tian, Kevin wrote:

> Maybe I misunderstood your question. Are you specifically worried
> about establishing the security context for a mdev vs. for its
> parent?

The way to think about the cookie, and the device bind/attach in
general, is as taking control of a portion of the IOMMU routing:

- RID
- RID + PASID
- "software"

For the first two there can be only one device attachment per value so
the cookie is unambiguous.

For "software" the iommu layer has little to do with this - everything
is constructed outside by the mdev. If the mdev wishes to communicate
on /dev/iommu using the cookie then it has to do so using some iommufd
api and we can convay the proper device at that point.

Kevin didn't show it, but along side the PCI attaches:

struct iommu_attach_data * iommu_pci_device_attach(
struct iommu_dev *dev, struct pci_device *pdev,
u32 ioasid);

There would also be a software attach for mdev:

struct iommu_attach_data * iommu_sw_device_attach(
struct iommu_dev *dev, struct device *pdev, u32 ioasid);

Which does not connect anything to the iommu layer.

It would have to return something that allows querying the IO page
table, and the mdev would use that API instead of vfio_pin_pages().

Jason

2021-07-13 16:27:54

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Tue, 13 Jul 2021 09:55:03 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Mon, Jul 12, 2021 at 11:56:24PM +0000, Tian, Kevin wrote:
>
> > Maybe I misunderstood your question. Are you specifically worried
> > about establishing the security context for a mdev vs. for its
> > parent?
>
> The way to think about the cookie, and the device bind/attach in
> general, is as taking control of a portion of the IOMMU routing:
>
> - RID
> - RID + PASID
> - "software"
>
> For the first two there can be only one device attachment per value so
> the cookie is unambiguous.
>
> For "software" the iommu layer has little to do with this - everything
> is constructed outside by the mdev. If the mdev wishes to communicate
> on /dev/iommu using the cookie then it has to do so using some iommufd
> api and we can convay the proper device at that point.
>
> Kevin didn't show it, but along side the PCI attaches:
>
> struct iommu_attach_data * iommu_pci_device_attach(
> struct iommu_dev *dev, struct pci_device *pdev,
> u32 ioasid);
>
> There would also be a software attach for mdev:
>
> struct iommu_attach_data * iommu_sw_device_attach(
> struct iommu_dev *dev, struct device *pdev, u32 ioasid);
>
> Which does not connect anything to the iommu layer.
>
> It would have to return something that allows querying the IO page
> table, and the mdev would use that API instead of vfio_pin_pages().


Quoting this proposal again:

> 1) A successful binding call for the first device in the group creates
> the security context for the entire group, by:
>
> * Verifying group viability in a similar way as VFIO does;
>
> * Calling IOMMU-API to move the group into a block-dma state,
> which makes all devices in the group attached to an block-dma
> domain with an empty I/O page table;
>
> VFIO should not allow the user to mmap the MMIO bar of the bound
> device until the binding call succeeds.

The attach step is irrelevant to my question, the bind step is where
the device/group gets into a secure state for device access.

So for IGD we have two scenarios, direct assignment and software mdevs.

AIUI the operation of VFIO_DEVICE_BIND_IOMMU_FD looks like this:

iommu_ctx = iommu_ctx_fdget(iommu_fd);

mdev = mdev_from_dev(vdev->dev);
dev = mdev ? mdev_parent_dev(mdev) : vdev->dev;

iommu_dev = iommu_register_device(iommu_ctx, dev, cookie);

In either case, this last line is either registering the IGD itself
(ie. the struct device representing PCI device 0000:00:02.0) or the
parent of the GVT-g mdev (ie. the struct device representing PCI device
0000:00:02.0). They're the same! AIUI, the cookie is simply an
arbitrary user generated value which they'll use to refer to this
device via the iommu_fd uAPI.

So what magic is iommu_register_device() doing to infer my intentions
as to whether I'm asking for the IGD RID to be isolated or I'm only
creating a software context for an mdev? Thanks,

Alex

2021-07-13 16:37:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Tue, Jul 13, 2021 at 10:26:07AM -0600, Alex Williamson wrote:
> Quoting this proposal again:
>
> > 1) A successful binding call for the first device in the group creates
> > the security context for the entire group, by:
> >
> > * Verifying group viability in a similar way as VFIO does;
> >
> > * Calling IOMMU-API to move the group into a block-dma state,
> > which makes all devices in the group attached to an block-dma
> > domain with an empty I/O page table;
> >
> > VFIO should not allow the user to mmap the MMIO bar of the bound
> > device until the binding call succeeds.
>
> The attach step is irrelevant to my question, the bind step is where
> the device/group gets into a secure state for device access.

Binding is similar to attach, it will need to indicate the drivers
intention and a SW driver will not attach to the PCI device underneath
it.

> AIUI the operation of VFIO_DEVICE_BIND_IOMMU_FD looks like this:
>
> iommu_ctx = iommu_ctx_fdget(iommu_fd);
>
> mdev = mdev_from_dev(vdev->dev);
> dev = mdev ? mdev_parent_dev(mdev) : vdev->dev;
>
> iommu_dev = iommu_register_device(iommu_ctx, dev, cookie);

A default of binding to vdev->dev might turn out to be OK, but this
needs to be an overridable op in vfio_device and the SW mdevs will
have to do some 'iommu_register_sw_device()' and not pass in a dev at
all.

Jason

2021-07-13 22:50:10

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, July 14, 2021 12:33 AM
>
> On Tue, Jul 13, 2021 at 10:26:07AM -0600, Alex Williamson wrote:
> > Quoting this proposal again:
> >
> > > 1) A successful binding call for the first device in the group creates
> > > the security context for the entire group, by:
> > >
> > > * Verifying group viability in a similar way as VFIO does;
> > >
> > > * Calling IOMMU-API to move the group into a block-dma state,
> > > which makes all devices in the group attached to an block-dma
> > > domain with an empty I/O page table;
> > >
> > > VFIO should not allow the user to mmap the MMIO bar of the bound
> > > device until the binding call succeeds.
> >
> > The attach step is irrelevant to my question, the bind step is where
> > the device/group gets into a secure state for device access.
>
> Binding is similar to attach, it will need to indicate the drivers
> intention and a SW driver will not attach to the PCI device underneath
> it.

Yes. I need to clarify this part in next version. In v1 the binding operation
was purely a software operation within IOMMU fd thus there was no
intention to differentiate device types in this step. But now with v2 the
binding actually involves calling IOMMU API for devices other than sw
mdev. Then we do need similar per-type binding wrappers as defined
for attaching calls.

>
> > AIUI the operation of VFIO_DEVICE_BIND_IOMMU_FD looks like this:
> >
> > iommu_ctx = iommu_ctx_fdget(iommu_fd);
> >
> > mdev = mdev_from_dev(vdev->dev);
> > dev = mdev ? mdev_parent_dev(mdev) : vdev->dev;
> >
> > iommu_dev = iommu_register_device(iommu_ctx, dev, cookie);
>
> A default of binding to vdev->dev might turn out to be OK, but this
> needs to be an overridable op in vfio_device and the SW mdevs will
> have to do some 'iommu_register_sw_device()' and not pass in a dev at
> all.
>

We can still bind to the parent with cookie, but with iommu_register_
sw_device() IOMMU fd knows that this binding doesn't need to
establish any security context via IOMMU API.

Thanks
Kevin

2021-07-13 23:04:46

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Tue, Jul 13, 2021 at 10:48:38PM +0000, Tian, Kevin wrote:

> We can still bind to the parent with cookie, but with
> iommu_register_ sw_device() IOMMU fd knows that this binding doesn't
> need to establish any security context via IOMMU API.

AFAIK there is no reason to involve the parent PCI or other device in
SW mode. The iommufd doesn't need to be aware of anything there.

Jason

2021-07-13 23:21:40

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, July 14, 2021 7:03 AM
>
> On Tue, Jul 13, 2021 at 10:48:38PM +0000, Tian, Kevin wrote:
>
> > We can still bind to the parent with cookie, but with
> > iommu_register_ sw_device() IOMMU fd knows that this binding doesn't
> > need to establish any security context via IOMMU API.
>
> AFAIK there is no reason to involve the parent PCI or other device in
> SW mode. The iommufd doesn't need to be aware of anything there.
>

Yes. but does it makes sense to have an unified model in IOMMU fd
which always have a [struct device, cookie] with flags to indicate whether
the binding/attaching should be specially handled for sw mdev? Or
are you suggesting that lacking of struct device is actually the indicator
for such trick?

Thanks
Kevin

2021-07-13 23:24:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Tue, Jul 13, 2021 at 11:20:12PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, July 14, 2021 7:03 AM
> >
> > On Tue, Jul 13, 2021 at 10:48:38PM +0000, Tian, Kevin wrote:
> >
> > > We can still bind to the parent with cookie, but with
> > > iommu_register_ sw_device() IOMMU fd knows that this binding doesn't
> > > need to establish any security context via IOMMU API.
> >
> > AFAIK there is no reason to involve the parent PCI or other device in
> > SW mode. The iommufd doesn't need to be aware of anything there.
> >
>
> Yes. but does it makes sense to have an unified model in IOMMU fd
> which always have a [struct device, cookie] with flags to indicate whether
> the binding/attaching should be specially handled for sw mdev? Or
> are you suggesting that lacking of struct device is actually the indicator
> for such trick?

I think you've veered into such micro implementation details that it
is better to wait and see how things look.

The important point here is that whatever physical device is under a
SW mdev does not need to be passed to the iommufd because there is
nothing it can do with that information.

Jason

2021-07-13 23:26:10

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, July 14, 2021 7:23 AM
>
> On Tue, Jul 13, 2021 at 11:20:12PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, July 14, 2021 7:03 AM
> > >
> > > On Tue, Jul 13, 2021 at 10:48:38PM +0000, Tian, Kevin wrote:
> > >
> > > > We can still bind to the parent with cookie, but with
> > > > iommu_register_ sw_device() IOMMU fd knows that this binding
> doesn't
> > > > need to establish any security context via IOMMU API.
> > >
> > > AFAIK there is no reason to involve the parent PCI or other device in
> > > SW mode. The iommufd doesn't need to be aware of anything there.
> > >
> >
> > Yes. but does it makes sense to have an unified model in IOMMU fd
> > which always have a [struct device, cookie] with flags to indicate whether
> > the binding/attaching should be specially handled for sw mdev? Or
> > are you suggesting that lacking of struct device is actually the indicator
> > for such trick?
>
> I think you've veered into such micro implementation details that it
> is better to wait and see how things look.
>
> The important point here is that whatever physical device is under a
> SW mdev does not need to be passed to the iommufd because there is
> nothing it can do with that information.
>

Make sense

2021-07-15 03:22:20

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On 2021/7/9 15:48, Tian, Kevin wrote:
> 4.6. I/O page fault
> +++++++++++++++++++
>
> uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards. This flow assumes that I/O page faults
> are reported via IOMMU interrupts. Some devices report faults via device
> specific way instead of going through the IOMMU. That usage is not covered
> here:
>
> - Host IOMMU driver receives a I/O page fault with raw fault_data {rid,
> pasid, addr};
>
> - Host IOMMU driver identifies the faulting I/O page table according to
> {rid, pasid} and calls the corresponding fault handler with an opaque
> object (registered by the handler) and raw fault_data (rid, pasid, addr);
>
> - IOASID fault handler identifies the corresponding ioasid and device
> cookie according to the opaque object, generates an user fault_data
> (ioasid, cookie, addr) in the fault region, and triggers eventfd to
> userspace;
>

Hi, I have some doubts here:

For mdev, it seems that the rid in the raw fault_data is the parent device's,
then in the vSVA scenario, how can we get to know the mdev(cookie) from the
rid and pasid?

And from this point of view,would it be better to register the mdev
(iommu_register_device()) with the parent device info?

Thanks,
Shenming

2021-07-15 06:24:29

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Shenming Lu <[email protected]>
> Sent: Thursday, July 15, 2021 11:21 AM
>
> On 2021/7/9 15:48, Tian, Kevin wrote:
> > 4.6. I/O page fault
> > +++++++++++++++++++
> >
> > uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards. This flow assumes that I/O page
> faults
> > are reported via IOMMU interrupts. Some devices report faults via device
> > specific way instead of going through the IOMMU. That usage is not
> covered
> > here:
> >
> > - Host IOMMU driver receives a I/O page fault with raw fault_data {rid,
> > pasid, addr};
> >
> > - Host IOMMU driver identifies the faulting I/O page table according to
> > {rid, pasid} and calls the corresponding fault handler with an opaque
> > object (registered by the handler) and raw fault_data (rid, pasid, addr);
> >
> > - IOASID fault handler identifies the corresponding ioasid and device
> > cookie according to the opaque object, generates an user fault_data
> > (ioasid, cookie, addr) in the fault region, and triggers eventfd to
> > userspace;
> >
>
> Hi, I have some doubts here:
>
> For mdev, it seems that the rid in the raw fault_data is the parent device's,
> then in the vSVA scenario, how can we get to know the mdev(cookie) from
> the
> rid and pasid?
>
> And from this point of view,would it be better to register the mdev
> (iommu_register_device()) with the parent device info?
>

This is what is proposed in this RFC. A successful binding generates a new
iommu_dev object for each vfio device. For mdev this object includes
its parent device, the defPASID marking this mdev, and the cookie
representing it in userspace. Later it is iommu_dev being recorded in
the attaching_data when the mdev is attached to an IOASID:

struct iommu_attach_data *__iommu_device_attach(
struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags);

Then when a fault is reported, the fault handler just needs to figure out
iommu_dev according to {rid, pasid} in the raw fault data.

Thanks
Kevin

2021-07-15 08:03:15

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On 2021/7/15 11:55, Tian, Kevin wrote:
>> From: Shenming Lu <[email protected]>
>> Sent: Thursday, July 15, 2021 11:21 AM
>>
>> On 2021/7/9 15:48, Tian, Kevin wrote:
>>> 4.6. I/O page fault
>>> +++++++++++++++++++
>>>
>>> uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
>>> to guest IOMMU driver and backwards. This flow assumes that I/O page
>> faults
>>> are reported via IOMMU interrupts. Some devices report faults via device
>>> specific way instead of going through the IOMMU. That usage is not
>> covered
>>> here:
>>>
>>> - Host IOMMU driver receives a I/O page fault with raw fault_data {rid,
>>> pasid, addr};
>>>
>>> - Host IOMMU driver identifies the faulting I/O page table according to
>>> {rid, pasid} and calls the corresponding fault handler with an opaque
>>> object (registered by the handler) and raw fault_data (rid, pasid, addr);
>>>
>>> - IOASID fault handler identifies the corresponding ioasid and device
>>> cookie according to the opaque object, generates an user fault_data
>>> (ioasid, cookie, addr) in the fault region, and triggers eventfd to
>>> userspace;
>>>
>>
>> Hi, I have some doubts here:
>>
>> For mdev, it seems that the rid in the raw fault_data is the parent device's,
>> then in the vSVA scenario, how can we get to know the mdev(cookie) from
>> the
>> rid and pasid?
>>
>> And from this point of view,would it be better to register the mdev
>> (iommu_register_device()) with the parent device info?
>>
>
> This is what is proposed in this RFC. A successful binding generates a new
> iommu_dev object for each vfio device. For mdev this object includes
> its parent device, the defPASID marking this mdev, and the cookie
> representing it in userspace. Later it is iommu_dev being recorded in
> the attaching_data when the mdev is attached to an IOASID:
>
> struct iommu_attach_data *__iommu_device_attach(
> struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags);
>
> Then when a fault is reported, the fault handler just needs to figure out
> iommu_dev according to {rid, pasid} in the raw fault data.
>

Yeah, we have the defPASID that marks the mdev and refers to the default
I/O address space, but how about the non-default I/O address spaces?
Is there a case that two different mdevs (on the same parent device)
are used by the same process in the guest, thus have a same pasid route
in the physical IOMMU? It seems that we can't figure out the mdev from
the rid and pasid in this case...

Did I misunderstand something?... :-)

Thanks,
Shenming

2021-07-15 09:52:11

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Shenming Lu <[email protected]>
> Sent: Thursday, July 15, 2021 2:29 PM
>
> On 2021/7/15 11:55, Tian, Kevin wrote:
> >> From: Shenming Lu <[email protected]>
> >> Sent: Thursday, July 15, 2021 11:21 AM
> >>
> >> On 2021/7/9 15:48, Tian, Kevin wrote:
> >>> 4.6. I/O page fault
> >>> +++++++++++++++++++
> >>>
> >>> uAPI is TBD. Here is just about the high-level flow from host IOMMU
> driver
> >>> to guest IOMMU driver and backwards. This flow assumes that I/O page
> >> faults
> >>> are reported via IOMMU interrupts. Some devices report faults via
> device
> >>> specific way instead of going through the IOMMU. That usage is not
> >> covered
> >>> here:
> >>>
> >>> - Host IOMMU driver receives a I/O page fault with raw fault_data {rid,
> >>> pasid, addr};
> >>>
> >>> - Host IOMMU driver identifies the faulting I/O page table according to
> >>> {rid, pasid} and calls the corresponding fault handler with an opaque
> >>> object (registered by the handler) and raw fault_data (rid, pasid, addr);
> >>>
> >>> - IOASID fault handler identifies the corresponding ioasid and device
> >>> cookie according to the opaque object, generates an user fault_data
> >>> (ioasid, cookie, addr) in the fault region, and triggers eventfd to
> >>> userspace;
> >>>
> >>
> >> Hi, I have some doubts here:
> >>
> >> For mdev, it seems that the rid in the raw fault_data is the parent device's,
> >> then in the vSVA scenario, how can we get to know the mdev(cookie) from
> >> the
> >> rid and pasid?
> >>
> >> And from this point of view,would it be better to register the mdev
> >> (iommu_register_device()) with the parent device info?
> >>
> >
> > This is what is proposed in this RFC. A successful binding generates a new
> > iommu_dev object for each vfio device. For mdev this object includes
> > its parent device, the defPASID marking this mdev, and the cookie
> > representing it in userspace. Later it is iommu_dev being recorded in
> > the attaching_data when the mdev is attached to an IOASID:
> >
> > struct iommu_attach_data *__iommu_device_attach(
> > struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags);
> >
> > Then when a fault is reported, the fault handler just needs to figure out
> > iommu_dev according to {rid, pasid} in the raw fault data.
> >
>
> Yeah, we have the defPASID that marks the mdev and refers to the default
> I/O address space, but how about the non-default I/O address spaces?
> Is there a case that two different mdevs (on the same parent device)
> are used by the same process in the guest, thus have a same pasid route
> in the physical IOMMU? It seems that we can't figure out the mdev from
> the rid and pasid in this case...
>
> Did I misunderstand something?... :-)
>

No. You are right on this case. I don't think there is a way to
differentiate one mdev from the other if they come from the
same parent and attached by the same guest process. In this
case the fault could be reported on either mdev (e.g. the first
matching one) to get it fixed in the guest.

Thanks
Kevin

2021-07-15 10:37:23

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On 2021/7/15 14:49, Tian, Kevin wrote:
>> From: Shenming Lu <[email protected]>
>> Sent: Thursday, July 15, 2021 2:29 PM
>>
>> On 2021/7/15 11:55, Tian, Kevin wrote:
>>>> From: Shenming Lu <[email protected]>
>>>> Sent: Thursday, July 15, 2021 11:21 AM
>>>>
>>>> On 2021/7/9 15:48, Tian, Kevin wrote:
>>>>> 4.6. I/O page fault
>>>>> +++++++++++++++++++
>>>>>
>>>>> uAPI is TBD. Here is just about the high-level flow from host IOMMU
>> driver
>>>>> to guest IOMMU driver and backwards. This flow assumes that I/O page
>>>> faults
>>>>> are reported via IOMMU interrupts. Some devices report faults via
>> device
>>>>> specific way instead of going through the IOMMU. That usage is not
>>>> covered
>>>>> here:
>>>>>
>>>>> - Host IOMMU driver receives a I/O page fault with raw fault_data {rid,
>>>>> pasid, addr};
>>>>>
>>>>> - Host IOMMU driver identifies the faulting I/O page table according to
>>>>> {rid, pasid} and calls the corresponding fault handler with an opaque
>>>>> object (registered by the handler) and raw fault_data (rid, pasid, addr);
>>>>>
>>>>> - IOASID fault handler identifies the corresponding ioasid and device
>>>>> cookie according to the opaque object, generates an user fault_data
>>>>> (ioasid, cookie, addr) in the fault region, and triggers eventfd to
>>>>> userspace;
>>>>>
>>>>
>>>> Hi, I have some doubts here:
>>>>
>>>> For mdev, it seems that the rid in the raw fault_data is the parent device's,
>>>> then in the vSVA scenario, how can we get to know the mdev(cookie) from
>>>> the
>>>> rid and pasid?
>>>>
>>>> And from this point of view,would it be better to register the mdev
>>>> (iommu_register_device()) with the parent device info?
>>>>
>>>
>>> This is what is proposed in this RFC. A successful binding generates a new
>>> iommu_dev object for each vfio device. For mdev this object includes
>>> its parent device, the defPASID marking this mdev, and the cookie
>>> representing it in userspace. Later it is iommu_dev being recorded in
>>> the attaching_data when the mdev is attached to an IOASID:
>>>
>>> struct iommu_attach_data *__iommu_device_attach(
>>> struct iommu_dev *dev, u32 ioasid, u32 pasid, int flags);
>>>
>>> Then when a fault is reported, the fault handler just needs to figure out
>>> iommu_dev according to {rid, pasid} in the raw fault data.
>>>
>>
>> Yeah, we have the defPASID that marks the mdev and refers to the default
>> I/O address space, but how about the non-default I/O address spaces?
>> Is there a case that two different mdevs (on the same parent device)
>> are used by the same process in the guest, thus have a same pasid route
>> in the physical IOMMU? It seems that we can't figure out the mdev from
>> the rid and pasid in this case...
>>
>> Did I misunderstand something?... :-)
>>
>
> No. You are right on this case. I don't think there is a way to
> differentiate one mdev from the other if they come from the
> same parent and attached by the same guest process. In this
> case the fault could be reported on either mdev (e.g. the first
> matching one) to get it fixed in the guest.
>

OK. Thanks,

Shenming

2021-07-15 14:11:46

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Thu, Jul 15, 2021 at 06:49:54AM +0000, Tian, Kevin wrote:

> No. You are right on this case. I don't think there is a way to
> differentiate one mdev from the other if they come from the
> same parent and attached by the same guest process. In this
> case the fault could be reported on either mdev (e.g. the first
> matching one) to get it fixed in the guest.

If the IOMMU can't distinguish the two mdevs they are not isolated
and would have to share a group. Since group sharing is not supported
today this seems like a non-issue

Jason

2021-07-15 16:19:01

by Raj, Ashok

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 15, 2021 at 06:49:54AM +0000, Tian, Kevin wrote:
>
> > No. You are right on this case. I don't think there is a way to
> > differentiate one mdev from the other if they come from the
> > same parent and attached by the same guest process. In this
> > case the fault could be reported on either mdev (e.g. the first
> > matching one) to get it fixed in the guest.
>
> If the IOMMU can't distinguish the two mdevs they are not isolated
> and would have to share a group. Since group sharing is not supported
> today this seems like a non-issue

Does this mean we have to prevent 2 mdev's from same pdev being assigned to
the same guest?

2021-07-15 16:51:12

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Thu, Jul 15, 2021 at 06:57:57AM -0700, Raj, Ashok wrote:
> On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote:
> > On Thu, Jul 15, 2021 at 06:49:54AM +0000, Tian, Kevin wrote:
> >
> > > No. You are right on this case. I don't think there is a way to
> > > differentiate one mdev from the other if they come from the
> > > same parent and attached by the same guest process. In this
> > > case the fault could be reported on either mdev (e.g. the first
> > > matching one) to get it fixed in the guest.
> >
> > If the IOMMU can't distinguish the two mdevs they are not isolated
> > and would have to share a group. Since group sharing is not supported
> > today this seems like a non-issue
>
> Does this mean we have to prevent 2 mdev's from same pdev being assigned to
> the same guest?

No, it means that the IOMMU layer has to be able to distinguish them.

This either means they are "SW mdevs" which does not involve the IOMMU
layer and puts both the responsibility for isolation and idenfication
on the mdev driver.

Or they are some "PASID mdev" which does allow the IOMMU to isolate
them.

What can't happen is to comingle /dev/iommu control over the pdev
between two mdevs.

ie we can't talk about faults for IOMMU on SW mdevs - faults do not
come from the IOMMU layer, they have to come from inside the mdev it
self, somehow.

Jason

2021-07-15 17:28:08

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Thu, Jul 15, 2021 at 09:21:41AM -0700, Raj, Ashok wrote:
> On Thu, Jul 15, 2021 at 12:23:25PM -0300, Jason Gunthorpe wrote:
> > On Thu, Jul 15, 2021 at 06:57:57AM -0700, Raj, Ashok wrote:
> > > On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Jul 15, 2021 at 06:49:54AM +0000, Tian, Kevin wrote:
> > > >
> > > > > No. You are right on this case. I don't think there is a way to
> > > > > differentiate one mdev from the other if they come from the
> > > > > same parent and attached by the same guest process. In this
> > > > > case the fault could be reported on either mdev (e.g. the first
> > > > > matching one) to get it fixed in the guest.
> > > >
> > > > If the IOMMU can't distinguish the two mdevs they are not isolated
> > > > and would have to share a group. Since group sharing is not supported
> > > > today this seems like a non-issue
> > >
> > > Does this mean we have to prevent 2 mdev's from same pdev being assigned to
> > > the same guest?
> >
> > No, it means that the IOMMU layer has to be able to distinguish them.
>
> Ok, the guest has no control over it, as it see 2 separate pci devices and
> thinks they are all different.
>
> Only time when it can fail is during the bind operation. From guest
> perspective a bind in vIOMMU just turns into a write to local table and a
> invalidate will cause the host to update the real copy from the shadow.
>
> There is no way to fail the bind? and Allocation of the PASID is also a
> separate operation and has no clue how its going to be used in the guest.

You can't attach the same RID to the same PASID twice. The IOMMU code
should prevent this.

As we've talked about several times, it seems to me the vIOMMU
interface is misdesigned for the requirements you have. The hypervisor
should have a role in allocating the PASID since there are invisible
hypervisor restrictions. This is one of them.

> Do we have any isolation requirements here? its the same process. So if the
> page-request it sent to guest and even if you report it for mdev1, after
> the PRQ is resolved by guest, the request from mdev2 from the same guest
> should simply work?

I think we already talked about this and said it should not be done.

Jason

2021-07-15 18:26:53

by Raj, Ashok

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Thu, Jul 15, 2021 at 02:18:26PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 15, 2021 at 09:21:41AM -0700, Raj, Ashok wrote:
> > On Thu, Jul 15, 2021 at 12:23:25PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Jul 15, 2021 at 06:57:57AM -0700, Raj, Ashok wrote:
> > > > On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, Jul 15, 2021 at 06:49:54AM +0000, Tian, Kevin wrote:
> > > > >
> > > > > > No. You are right on this case. I don't think there is a way to
> > > > > > differentiate one mdev from the other if they come from the
> > > > > > same parent and attached by the same guest process. In this
> > > > > > case the fault could be reported on either mdev (e.g. the first
> > > > > > matching one) to get it fixed in the guest.
> > > > >
> > > > > If the IOMMU can't distinguish the two mdevs they are not isolated
> > > > > and would have to share a group. Since group sharing is not supported
> > > > > today this seems like a non-issue
> > > >
> > > > Does this mean we have to prevent 2 mdev's from same pdev being assigned to
> > > > the same guest?
> > >
> > > No, it means that the IOMMU layer has to be able to distinguish them.
> >
> > Ok, the guest has no control over it, as it see 2 separate pci devices and
> > thinks they are all different.
> >
> > Only time when it can fail is during the bind operation. From guest
> > perspective a bind in vIOMMU just turns into a write to local table and a
> > invalidate will cause the host to update the real copy from the shadow.
> >
> > There is no way to fail the bind? and Allocation of the PASID is also a
> > separate operation and has no clue how its going to be used in the guest.
>
> You can't attach the same RID to the same PASID twice. The IOMMU code
> should prevent this.
>
> As we've talked about several times, it seems to me the vIOMMU
> interface is misdesigned for the requirements you have. The hypervisor
> should have a role in allocating the PASID since there are invisible
> hypervisor restrictions. This is one of them.

Allocating a PASID is a separate step from binding, isn't it? In vt-d we
have a virtual command interface that can fail an allocation of PASID. But
which device its bound to is a dynamic thing that only gets at bind_mm()
right?

>
> > Do we have any isolation requirements here? its the same process. So if the
> > page-request it sent to guest and even if you report it for mdev1, after
> > the PRQ is resolved by guest, the request from mdev2 from the same guest
> > should simply work?
>
> I think we already talked about this and said it should not be done.

I get the should not be done, I'm wondering where should that be
implemented?

2021-07-15 18:28:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Thu, Jul 15, 2021 at 10:48:36AM -0700, Raj, Ashok wrote:

> > > Do we have any isolation requirements here? its the same process. So if the
> > > page-request it sent to guest and even if you report it for mdev1, after
> > > the PRQ is resolved by guest, the request from mdev2 from the same guest
> > > should simply work?
> >
> > I think we already talked about this and said it should not be done.
>
> I get the should not be done, I'm wondering where should that be
> implemented?

The iommu layer cannot have ambiguity. Every RID or RID,PASID slot
must have only one device attached to it. Attempting to connect two
devices to the same slot fails on the iommu layer.

So the 2nd mdev will fail during IOASID binding when it tries to bind
to the same PASID that the first mdev is already bound to.

Jason

2021-07-15 18:29:22

by Raj, Ashok

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Thu, Jul 15, 2021 at 02:53:36PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 15, 2021 at 10:48:36AM -0700, Raj, Ashok wrote:
>
> > > > Do we have any isolation requirements here? its the same process. So if the
> > > > page-request it sent to guest and even if you report it for mdev1, after
> > > > the PRQ is resolved by guest, the request from mdev2 from the same guest
> > > > should simply work?
> > >
> > > I think we already talked about this and said it should not be done.
> >
> > I get the should not be done, I'm wondering where should that be
> > implemented?
>
> The iommu layer cannot have ambiguity. Every RID or RID,PASID slot
> must have only one device attached to it. Attempting to connect two
> devices to the same slot fails on the iommu layer.

I guess we are talking about two different things. I was referring to SVM
side of things. Maybe you are referring to the mdev.

A single guest process should be allowed to work with 2 different
accelerators. The PASID for the process is just 1. Limiting that to just
one accelerator per process seems wrong.

Unless there is something else to prevent this, the best way seems never
expose more than 1 mdev from same pdev to the same guest. I think this is a
reasonable restriction compared to limiting a process to bind to no more
than 1 accelerator.


>
> So the 2nd mdev will fail during IOASID binding when it tries to bind
> to the same PASID that the first mdev is already bound to.
>
> Jason

2021-07-15 18:30:58

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Thu, Jul 15, 2021 at 11:05:45AM -0700, Raj, Ashok wrote:
> On Thu, Jul 15, 2021 at 02:53:36PM -0300, Jason Gunthorpe wrote:
> > On Thu, Jul 15, 2021 at 10:48:36AM -0700, Raj, Ashok wrote:
> >
> > > > > Do we have any isolation requirements here? its the same process. So if the
> > > > > page-request it sent to guest and even if you report it for mdev1, after
> > > > > the PRQ is resolved by guest, the request from mdev2 from the same guest
> > > > > should simply work?
> > > >
> > > > I think we already talked about this and said it should not be done.
> > >
> > > I get the should not be done, I'm wondering where should that be
> > > implemented?
> >
> > The iommu layer cannot have ambiguity. Every RID or RID,PASID slot
> > must have only one device attached to it. Attempting to connect two
> > devices to the same slot fails on the iommu layer.
>
> I guess we are talking about two different things. I was referring to SVM
> side of things. Maybe you are referring to the mdev.

I'm talking about in the hypervisor.

As I've said already, the vIOMMU interface is the problem here. The
guest VM should be able to know that it cannot use PASID 1 with two
devices, like the hypervisor knows. At the very least it should be
able to know that the PASID binding has failed and relay that failure
back to the process.

Ideally the guest would know it should allocate another PASID for
these cases.

But yes, if mdevs are going to be modeled with RIDs in the guest then
with the current vIOMMU we cannot cause a single hypervisor RID to
show up as two RIDs in the guest without breaking the vIOMMU model.

Jason

2021-07-15 21:01:03

by Raj, Ashok

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Thu, Jul 15, 2021 at 12:23:25PM -0300, Jason Gunthorpe wrote:
> On Thu, Jul 15, 2021 at 06:57:57AM -0700, Raj, Ashok wrote:
> > On Thu, Jul 15, 2021 at 09:48:13AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Jul 15, 2021 at 06:49:54AM +0000, Tian, Kevin wrote:
> > >
> > > > No. You are right on this case. I don't think there is a way to
> > > > differentiate one mdev from the other if they come from the
> > > > same parent and attached by the same guest process. In this
> > > > case the fault could be reported on either mdev (e.g. the first
> > > > matching one) to get it fixed in the guest.
> > >
> > > If the IOMMU can't distinguish the two mdevs they are not isolated
> > > and would have to share a group. Since group sharing is not supported
> > > today this seems like a non-issue
> >
> > Does this mean we have to prevent 2 mdev's from same pdev being assigned to
> > the same guest?
>
> No, it means that the IOMMU layer has to be able to distinguish them.

Ok, the guest has no control over it, as it see 2 separate pci devices and
thinks they are all different.

Only time when it can fail is during the bind operation. From guest
perspective a bind in vIOMMU just turns into a write to local table and a
invalidate will cause the host to update the real copy from the shadow.

There is no way to fail the bind? and Allocation of the PASID is also a
separate operation and has no clue how its going to be used in the guest.

>
> This either means they are "SW mdevs" which does not involve the IOMMU
> layer and puts both the responsibility for isolation and idenfication
> on the mdev driver.

When you mean SW mdev, is it the GPU like case where mdev is purely a SW
construct? or SIOV type where RID+PASID case?

>
> Or they are some "PASID mdev" which does allow the IOMMU to isolate
> them.
>
> What can't happen is to comingle /dev/iommu control over the pdev
> between two mdevs.
>
> ie we can't talk about faults for IOMMU on SW mdevs - faults do not
> come from the IOMMU layer, they have to come from inside the mdev it
> self, somehow.

Recoverable faults for guest needs to be sent to guest? A page-request from
mdev1 and from mdev2 will both look alike when the process is sharing it.

Do we have any isolation requirements here? its the same process. So if the
page-request it sent to guest and even if you report it for mdev1, after
the PRQ is resolved by guest, the request from mdev2 from the same guest
should simply work?


2021-07-16 01:23:19

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, July 16, 2021 2:13 AM
>
> On Thu, Jul 15, 2021 at 11:05:45AM -0700, Raj, Ashok wrote:
> > On Thu, Jul 15, 2021 at 02:53:36PM -0300, Jason Gunthorpe wrote:
> > > On Thu, Jul 15, 2021 at 10:48:36AM -0700, Raj, Ashok wrote:
> > >
> > > > > > Do we have any isolation requirements here? its the same process.
> So if the
> > > > > > page-request it sent to guest and even if you report it for mdev1,
> after
> > > > > > the PRQ is resolved by guest, the request from mdev2 from the
> same guest
> > > > > > should simply work?
> > > > >
> > > > > I think we already talked about this and said it should not be done.
> > > >
> > > > I get the should not be done, I'm wondering where should that be
> > > > implemented?
> > >
> > > The iommu layer cannot have ambiguity. Every RID or RID,PASID slot
> > > must have only one device attached to it. Attempting to connect two
> > > devices to the same slot fails on the iommu layer.
> >
> > I guess we are talking about two different things. I was referring to SVM
> > side of things. Maybe you are referring to the mdev.
>
> I'm talking about in the hypervisor.
>
> As I've said already, the vIOMMU interface is the problem here. The
> guest VM should be able to know that it cannot use PASID 1 with two
> devices, like the hypervisor knows. At the very least it should be
> able to know that the PASID binding has failed and relay that failure
> back to the process.
>
> Ideally the guest would know it should allocate another PASID for
> these cases.
>
> But yes, if mdevs are going to be modeled with RIDs in the guest then
> with the current vIOMMU we cannot cause a single hypervisor RID to
> show up as two RIDs in the guest without breaking the vIOMMU model.
>

To summarize, for vIOMMU we can work with the spec owner to
define a proper interface to feedback such restriction into the guest
if necessary. For the kernel part, it's clear that IOMMU fd should
disallow two devices attached to a single [RID] or [RID, PASID] slot
in the first place.

Then the next question is how to communicate such restriction
to the userspace. It sounds like a group, but different in concept.
An iommu group describes the minimal isolation boundary thus all
devices in the group can be only assigned to a single user. But this
case is opposite - the two mdevs (both support ENQCMD submission)
with the same parent have problem when assigned to a single VM
(in this case vPASID is vm-wide translated thus a same pPASID will be
used cross both mdevs) while they instead work pretty well when
assigned to different VMs (completely different vPASID spaces thus
different pPASIDs).

One thought is to have vfio device driver deal with it. In this proposal
it is the vfio device driver to define the PASID virtualization policy and
report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands
the restriction thus could just hide the vPASID capability when the user
calls GET_INFO on the 2nd mdev in above scenario. In this way the
user even doesn't need to know such restriction at all and both mdevs
can be assigned to a single VM w/o any problem.

Does it sound a right approach?

Thanks
Kevin

2021-07-16 12:21:17

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On 2021/7/16 9:20, Tian, Kevin wrote:
> To summarize, for vIOMMU we can work with the spec owner to
> define a proper interface to feedback such restriction into the guest
> if necessary. For the kernel part, it's clear that IOMMU fd should
> disallow two devices attached to a single [RID] or [RID, PASID] slot
> in the first place.
>
> Then the next question is how to communicate such restriction
> to the userspace. It sounds like a group, but different in concept.
> An iommu group describes the minimal isolation boundary thus all
> devices in the group can be only assigned to a single user. But this
> case is opposite - the two mdevs (both support ENQCMD submission)
> with the same parent have problem when assigned to a single VM
> (in this case vPASID is vm-wide translated thus a same pPASID will be
> used cross both mdevs) while they instead work pretty well when
> assigned to different VMs (completely different vPASID spaces thus
> different pPASIDs).
>
> One thought is to have vfio device driver deal with it. In this proposal
> it is the vfio device driver to define the PASID virtualization policy and
> report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands
> the restriction thus could just hide the vPASID capability when the user
> calls GET_INFO on the 2nd mdev in above scenario. In this way the
> user even doesn't need to know such restriction at all and both mdevs
> can be assigned to a single VM w/o any problem.
>

The restriction only probably happens when two mdevs are assigned to one VM,
how could the vfio device driver get to know this info to accurately hide
the vPASID capability for the 2nd mdev when VFIO_DEVICE_GET_INFO? There is no
need to do this in other cases.

Thanks,
Shenming

2021-07-16 18:33:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Fri, Jul 16, 2021 at 01:20:15AM +0000, Tian, Kevin wrote:

> One thought is to have vfio device driver deal with it. In this proposal
> it is the vfio device driver to define the PASID virtualization policy and
> report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands
> the restriction thus could just hide the vPASID capability when the user
> calls GET_INFO on the 2nd mdev in above scenario. In this way the
> user even doesn't need to know such restriction at all and both mdevs
> can be assigned to a single VM w/o any problem.

I think it makes more sense to expose some kind of "pasid group" to
qemu that identifies that each PASID must be unique across the
group. For vIOMMUs that are doing funky things with the RID This means
a single PASID group must not be exposed as two RIDs to the guest.

If the kernel blocks it then it can never be fixed by updating the
vIOMMU design.

Jason

2021-07-21 02:13:23

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Saturday, July 17, 2021 2:30 AM
>
> On Fri, Jul 16, 2021 at 01:20:15AM +0000, Tian, Kevin wrote:
>
> > One thought is to have vfio device driver deal with it. In this proposal
> > it is the vfio device driver to define the PASID virtualization policy and
> > report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands
> > the restriction thus could just hide the vPASID capability when the user
> > calls GET_INFO on the 2nd mdev in above scenario. In this way the
> > user even doesn't need to know such restriction at all and both mdevs
> > can be assigned to a single VM w/o any problem.
>
> I think it makes more sense to expose some kind of "pasid group" to
> qemu that identifies that each PASID must be unique across the
> group. For vIOMMUs that are doing funky things with the RID This means
> a single PASID group must not be exposed as two RIDs to the guest.
>

It's an interesting idea. Some aspects are still unclear to me now
e.g. how to describe such restriction in a way that it's applied only
to a single user owning the group (not the case when assigned to
different users), whether it can be generalized cross subsystems
(vPASID being a vfio-managed resource), etc. Let's refine it when
working on the actual implementation.

> If the kernel blocks it then it can never be fixed by updating the
> vIOMMU design.
>

But the mdev driver can choose to do so. Should we prevent it?

btw just be curious whether you have got a chance to have a full
review on v2. I wonder when might be a good time to discuss
the execution plan following this proposal, if no major open remains...

Thanks
Kevin

2021-07-21 02:16:58

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Shenming Lu
> Sent: Friday, July 16, 2021 8:20 PM
>
> On 2021/7/16 9:20, Tian, Kevin wrote:
> > To summarize, for vIOMMU we can work with the spec owner to
> > define a proper interface to feedback such restriction into the guest
> > if necessary. For the kernel part, it's clear that IOMMU fd should
> > disallow two devices attached to a single [RID] or [RID, PASID] slot
> > in the first place.
> >
> > Then the next question is how to communicate such restriction
> > to the userspace. It sounds like a group, but different in concept.
> > An iommu group describes the minimal isolation boundary thus all
> > devices in the group can be only assigned to a single user. But this
> > case is opposite - the two mdevs (both support ENQCMD submission)
> > with the same parent have problem when assigned to a single VM
> > (in this case vPASID is vm-wide translated thus a same pPASID will be
> > used cross both mdevs) while they instead work pretty well when
> > assigned to different VMs (completely different vPASID spaces thus
> > different pPASIDs).
> >
> > One thought is to have vfio device driver deal with it. In this proposal
> > it is the vfio device driver to define the PASID virtualization policy and
> > report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands
> > the restriction thus could just hide the vPASID capability when the user
> > calls GET_INFO on the 2nd mdev in above scenario. In this way the
> > user even doesn't need to know such restriction at all and both mdevs
> > can be assigned to a single VM w/o any problem.
> >
>
> The restriction only probably happens when two mdevs are assigned to one
> VM,
> how could the vfio device driver get to know this info to accurately hide
> the vPASID capability for the 2nd mdev when VFIO_DEVICE_GET_INFO?
> There is no
> need to do this in other cases.
>

I suppose the driver can detect it via whether two mdevs are opened by a
single process.

Thanks
Kevin

2021-07-22 16:34:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Wed, Jul 21, 2021 at 02:13:23AM +0000, Tian, Kevin wrote:
> > From: Shenming Lu
> > Sent: Friday, July 16, 2021 8:20 PM
> >
> > On 2021/7/16 9:20, Tian, Kevin wrote:
> > > To summarize, for vIOMMU we can work with the spec owner to
> > > define a proper interface to feedback such restriction into the guest
> > > if necessary. For the kernel part, it's clear that IOMMU fd should
> > > disallow two devices attached to a single [RID] or [RID, PASID] slot
> > > in the first place.
> > >
> > > Then the next question is how to communicate such restriction
> > > to the userspace. It sounds like a group, but different in concept.
> > > An iommu group describes the minimal isolation boundary thus all
> > > devices in the group can be only assigned to a single user. But this
> > > case is opposite - the two mdevs (both support ENQCMD submission)
> > > with the same parent have problem when assigned to a single VM
> > > (in this case vPASID is vm-wide translated thus a same pPASID will be
> > > used cross both mdevs) while they instead work pretty well when
> > > assigned to different VMs (completely different vPASID spaces thus
> > > different pPASIDs).
> > >
> > > One thought is to have vfio device driver deal with it. In this proposal
> > > it is the vfio device driver to define the PASID virtualization policy and
> > > report it to userspace via VFIO_DEVICE_GET_INFO. The driver understands
> > > the restriction thus could just hide the vPASID capability when the user
> > > calls GET_INFO on the 2nd mdev in above scenario. In this way the
> > > user even doesn't need to know such restriction at all and both mdevs
> > > can be assigned to a single VM w/o any problem.
> > >
> >
> > The restriction only probably happens when two mdevs are assigned to one
> > VM,
> > how could the vfio device driver get to know this info to accurately hide
> > the vPASID capability for the 2nd mdev when VFIO_DEVICE_GET_INFO?
> > There is no
> > need to do this in other cases.
> >
>
> I suppose the driver can detect it via whether two mdevs are opened by a
> single process.

Just have the kernel some ID for the PASID numberspace - devices with
the same ID have to be represented as a single RID.

Jason

2021-07-26 04:54:35

by David Gibson

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Fri, Jul 09, 2021 at 07:48:44AM +0000, Tian, Kevin wrote:
> /dev/iommu provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.
>
> This proposal describes the uAPI of /dev/iommu and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.
>
> It's based on a lengthy discussion starting from here:
> https://lore.kernel.org/linux-iommu/[email protected]/
>
> v1 can be found here:
> https://lore.kernel.org/linux-iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.namprd12.prod.outlook.com/T/
>
> This doc is also tracked on github, though it's not very useful for v1->v2
> given dramatic refactoring:
> https://github.com/luxis1999/dev_iommu_uapi

Thanks for all your work on this, Kevin. Apart from the actual
semantic improvements, I'm finding v2 significantly easier to read and
understand than v1.

[snip]
> 1.2. Attach Device to I/O address space
> +++++++++++++++++++++++++++++++++++++++
>
> Device attach/bind is initiated through passthrough framework uAPI.
>
> Device attaching is allowed only after a device is successfully bound to
> the IOMMU fd. User should provide a device cookie when binding the
> device through VFIO uAPI. This cookie is used when the user queries
> device capability/format, issues per-device iotlb invalidation and
> receives per-device I/O page fault data via IOMMU fd.
>
> Successful binding puts the device into a security context which isolates
> its DMA from the rest system. VFIO should not allow user to access the
> device before binding is completed. Similarly, VFIO should prevent the
> user from unbinding the device before user access is withdrawn.
>
> When a device is in an iommu group which contains multiple devices,
> all devices within the group must enter/exit the security context
> together. Please check {1.3} for more info about group isolation via
> this device-centric design.
>
> Successful attaching activates an I/O address space in the IOMMU,
> if the device is not purely software mediated. VFIO must provide device
> specific routing information for where to install the I/O page table in
> the IOMMU for this device. VFIO must also guarantee that the attached
> device is configured to compose DMAs with the routing information that
> is provided in the attaching call. When handling DMA requests, IOMMU
> identifies the target I/O address space according to the routing
> information carried in the request. Misconfiguration breaks DMA
> isolation thus could lead to severe security vulnerability.
>
> Routing information is per-device and bus specific. For PCI, it is
> Requester ID (RID) identifying the device plus optional Process Address
> Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-Stream
> ID (SSID). PASID or SSID is used when multiple I/O address spaces are
> enabled on a single device. For simplicity and continuity reason the
> following context uses RID+PASID though SID+SSID may sound a clearer
> naming from device p.o.v. We can decide the actual naming when coding.
>
> Because one I/O address space can be attached by multiple devices,
> per-device routing information (plus device cookie) is tracked under
> each IOASID and is used respectively when activating the I/O address
> space in the IOMMU for each attached device.
>
> The device in the /dev/iommu context always refers to a physical one
> (pdev) which is identifiable via RID. Physically each pdev can support
> one default I/O address space (routed via RID) and optionally multiple
> non-default I/O address spaces (via RID+PASID).
>
> The device in VFIO context is a logic concept, being either a physical
> device (pdev) or mediated device (mdev or subdev). Each vfio device
> is represented by RID+cookie in IOMMU fd. User is allowed to create
> one default I/O address space (routed by vRID from user p.o.v) per
> each vfio_device. VFIO decides the routing information for this default
> space based on device type:
>
> 1) pdev, routed via RID;
>
> 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via
> the parent's RID plus the PASID marking this mdev;
>
> 3) a purely sw-mediated device (sw mdev), no routing required i.e. no
> need to install the I/O page table in the IOMMU. sw mdev just uses
> the metadata to assist its internal DMA isolation logic on top of
> the parent's IOMMU page table;
>
> In addition, VFIO may allow user to create additional I/O address spaces
> on a vfio_device based on the hardware capability. In such case the user
> has its own view of the virtual routing information (vPASID) when marking
> these non-default address spaces. How to virtualize vPASID is platform
> specific and device specific. Some platforms allow the user to fully
> manage the PASID space thus vPASIDs are directly used for routing and
> even hidden from the kernel. Other platforms require the user to
> explicitly register the vPASID information to the kernel when attaching
> the vfio_device. In this case VFIO must figure out whether vPASID should
> be directly used (pdev) or converted to a kernel-allocated pPASID (mdev)
> for physical routing. Detail explanation about PASID virtualization can
> be found in {1.4}.
>
> For mdev both default and non-default I/O address spaces are routed
> via PASIDs. To better differentiate them we use "default PASID" (or
> defPASID) when talking about the default I/O address space on mdev. When
> vPASID or pPASID is referred in PASID virtualization it's all about the
> non-default spaces. defPASID and pPASID are always hidden from userspace
> and can only be indirectly referenced via IOASID.

That said, I'm still finding the various ways a device can attach to
an ioasid pretty confusing. Here are some thoughts on some extra
concepts that might make it easier to handle [note, I haven't thought
this all the way through so far, so there might be fatal problems with
this approach].

* DMA address type

This represents the format of the actual "over the wire" DMA
address. So far I only see 3 likely options for this 1) 32-bit,
2) 64-bit and 3) PASID, meaning the 84-bit PASID+address
combination.

* DMA identifier type

This represents the format of the "over the wire"
device-identifying information that the IOMMU receives. So "RID",
"RID+PASID", "SID+SSID" would all be DMA identifier types. We
could introduce some extra ones which might be necessary for
software mdevs.

So, every single DMA transaction has both DMA address and DMA
identifier information attached. In some cases we get to choose how
we split the availble information between identifier and address, more
on that later.

* DMA endpoint

An endpoint would represent a DMA origin which is identifiable to
the IOMMU. I'm using the new term, because while this would
sometimes correspond one to one with a device, there would be some
cases where it does not.

a) Multiple devices could be a single DMA endpoint - this would
be the case with non-ACS bridges or PCIe to PCI bridges where
devices behind the bridge can't be distinguished from each other.
Early versions might be able to treat all VFIO groups as single
endpoints, which might simplify transition

b) A single device could supply multiple DMA endpoints, this would
be the case with PASID capable devices where you want to map
different PASIDs to different IOASes.

**Caveat: feel free to come up with a better name than "endpoint"

**Caveat: I'm not immediately sure how to represent these to
userspace, and how we do that could have some important
implications for managing their lifetime

Every endpoint would have a fixed, known DMA address type and DMA
identifier type (though I'm not sure if we need/want to expose the DMA
identifier type to userspace). Every IOAS would also have a DMA
address type fixed at IOAS creation.

An endpoint can only be attached to one IOAS at a time. It can only
be attached to an IOAS whose DMA address type matches the endpoint.

Most userspace managed IO page formats would imply a particular DMA
address type, and also a particular DMA address type for their
"parent" IOAS. I'd expect kernel managed IO page tables to be able to
be able to handle most combinations.

/dev/iommu would work entirely (or nearly so) in terms of endpoint
handles, not device handles. Endpoints are what get bound to an IOAS,
and endpoints are what get the user chosen endpoint cookie.

Getting endpoint handles from devices is handled on the VFIO/device
side. The simplest transitional approach is probably for a VFIO pdev
groups to expose just a single endpoint. We can potentially make that
more flexible as a later step, and other subsystems might have other
needs.

Example A: VFIO userspace driver, with non-PASID capable device(s)

IOAS A1
IOPT format: Kernel managed
DMA address type: 64-bit
Parent DMA address type: Root (User Virtual Address)

=> 1 or more VFIO group endpoints attached
DMA address type: 64-bit

Driver manually maps userspace address ranges into A1, and
doesn't really care what IOVAs it uses.


Example B: Qemu passthrough, no-vIOMMU

IOAS B1
IOPT format: Kernel managed
DMA address type: 64-bit
Parent DMA address type: Root (User Virtual Address)

=> 1 or more VFIO group endpoints attached
DMA address type: 64-bit

Qemu maps guest memory ranges into B1, using IOVAs equal to GPA.

Example C: Qemu passthrough, non-PASID paravirtual vIOMMU

IOAS C1
IOPT format: Kernel managed
DMA address type: 64-bit
Parent DMA address type: Root (User Virtual Address)

Qemu maps guest memory ranges into C1, using IOVas equal to GPA

IOAS C2
IOPT format: Kernel managed
DMA address type: 64-bit
Parent DMA address type: 64-bit

=> 1 or more VFIO group endpoints attached
DMA address type: 64-bit

Qemu implements vIOMMU hypercalls updating guest IOMMU domain 0
to change mappings in C2.

IOAS C3, C4, ...

As C2, but for other guest IOMMU domains.

Example D: Qemu passthrough, non-PASID virtual-IOPT vIOMMU

IOAS D1
IOPT format: Kernel managed
DMA address type: 64-bit
Parent DMA address type: Root (User Virtual Address)

Qemu maps guest memory ranges into C1, using IOVAs equal to GPA

IOAS D2
IOPT format: x86 IOPT (non-PASID)
DMA address type: 64-bit
Parent DMA address type: 64-bit

=> 1 or more VFIO group endpoints attached
DMA address type: 64-bit

Qemu configures D2 to point at the guest IOPT root. Guest IOTLB
flushes are trapped and translated to flushes on D2.

With nested-IOMMU capable host hardware, /dev/iommu will
configure the host IOMMU to use D1's IOPT as the L1 and D2's
IOPT as the L2 for the relevant endpoints

With a host-IOMMU that isn't nested capable, /dev/iommu will
shadow the combined D1+D2 mappings into the host IOPT for the
relevant endpoints.

IOAS D3, D4, ...
As D2, but for other guest IOMMU domains

Example E: Userspace driver, single-PASID mdev

IOAS E1
IOPT format: Kernel managed
DMA address type: 64-bit
Parent DMA address type: Root (User Virtual Address)

=> mdev endpoint attached
DMA address type: 64-bit
DMA identifier type: RID+PASID

Userspace maps the ranges it wants to use, not caring about IOVA

Example F: Userspace driver, PASID capable dev (option 1)

IOAS F1
IOPT format: Kernel managed
DMA address type: PASID
Parent DMA address type: Root (User Virtual Address)

=> all-PASID endpoint for device
DMA address type: PASID
DMA identifier type: RID

Driver maps in whatever chunks of memory it wants. Note that
every IO_MAP operation supplies both a PASID and address (because
that's the format of a "PASID" type IOVA).


Example G: Userspace driver, PASID capable dev (option 2)

IOAS G1
IOPT format: Kernel managed
DMA address type: 64-bit
Parent DMA address type: Root (User Virtual Address)

=> one-PASID endpoint for device
DMA address type: 64-bit
DMA identifier type: RID+PASID

Driver makes mappings for a single PASID into G1. IO_MAP
operations include only a 64-bit address, because the PASID is
implied by the choice of IOAS/endpoint

IOAS G2, G3, ...
As G1 but for different PASIDs


More examples are possible, of course.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (13.08 kB)
signature.asc (849.00 B)
Download all attachments

2021-07-26 08:16:27

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

Hi Kevin,

On Fri, Jul 09, 2021 at 07:48:44AM +0000, Tian, Kevin wrote:
> /dev/iommu provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.
>
> This proposal describes the uAPI of /dev/iommu and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.

The document looks good to me, I don't have other concerns at the moment

Thanks,
Jean

2021-07-28 04:06:49

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

Hi, David,

> From: David Gibson <[email protected]>
> Sent: Monday, July 26, 2021 12:51 PM
>
> On Fri, Jul 09, 2021 at 07:48:44AM +0000, Tian, Kevin wrote:
> > /dev/iommu provides an unified interface for managing I/O page tables for
> > devices assigned to userspace. Device passthrough frameworks (VFIO,
> vDPA,
> > etc.) are expected to use this interface instead of creating their own logic to
> > isolate untrusted device DMAs initiated by userspace.
> >
> > This proposal describes the uAPI of /dev/iommu and also sample
> sequences
> > with VFIO as example in typical usages. The driver-facing kernel API
> provided
> > by the iommu layer is still TBD, which can be discussed after consensus is
> > made on this uAPI.
> >
> > It's based on a lengthy discussion starting from here:
> > https://lore.kernel.org/linux-
> iommu/[email protected]/
> >
> > v1 can be found here:
> > https://lore.kernel.org/linux-
> iommu/[email protected]
> amprd12.prod.outlook.com/T/
> >
> > This doc is also tracked on github, though it's not very useful for v1->v2
> > given dramatic refactoring:
> > https://github.com/luxis1999/dev_iommu_uapi
>
> Thanks for all your work on this, Kevin. Apart from the actual
> semantic improvements, I'm finding v2 significantly easier to read and
> understand than v1.
>
> [snip]
> > 1.2. Attach Device to I/O address space
> > +++++++++++++++++++++++++++++++++++++++
> >
> > Device attach/bind is initiated through passthrough framework uAPI.
> >
> > Device attaching is allowed only after a device is successfully bound to
> > the IOMMU fd. User should provide a device cookie when binding the
> > device through VFIO uAPI. This cookie is used when the user queries
> > device capability/format, issues per-device iotlb invalidation and
> > receives per-device I/O page fault data via IOMMU fd.
> >
> > Successful binding puts the device into a security context which isolates
> > its DMA from the rest system. VFIO should not allow user to access the
> > device before binding is completed. Similarly, VFIO should prevent the
> > user from unbinding the device before user access is withdrawn.
> >
> > When a device is in an iommu group which contains multiple devices,
> > all devices within the group must enter/exit the security context
> > together. Please check {1.3} for more info about group isolation via
> > this device-centric design.
> >
> > Successful attaching activates an I/O address space in the IOMMU,
> > if the device is not purely software mediated. VFIO must provide device
> > specific routing information for where to install the I/O page table in
> > the IOMMU for this device. VFIO must also guarantee that the attached
> > device is configured to compose DMAs with the routing information that
> > is provided in the attaching call. When handling DMA requests, IOMMU
> > identifies the target I/O address space according to the routing
> > information carried in the request. Misconfiguration breaks DMA
> > isolation thus could lead to severe security vulnerability.
> >
> > Routing information is per-device and bus specific. For PCI, it is
> > Requester ID (RID) identifying the device plus optional Process Address
> > Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-Stream
> > ID (SSID). PASID or SSID is used when multiple I/O address spaces are
> > enabled on a single device. For simplicity and continuity reason the
> > following context uses RID+PASID though SID+SSID may sound a clearer
> > naming from device p.o.v. We can decide the actual naming when coding.
> >
> > Because one I/O address space can be attached by multiple devices,
> > per-device routing information (plus device cookie) is tracked under
> > each IOASID and is used respectively when activating the I/O address
> > space in the IOMMU for each attached device.
> >
> > The device in the /dev/iommu context always refers to a physical one
> > (pdev) which is identifiable via RID. Physically each pdev can support
> > one default I/O address space (routed via RID) and optionally multiple
> > non-default I/O address spaces (via RID+PASID).
> >
> > The device in VFIO context is a logic concept, being either a physical
> > device (pdev) or mediated device (mdev or subdev). Each vfio device
> > is represented by RID+cookie in IOMMU fd. User is allowed to create
> > one default I/O address space (routed by vRID from user p.o.v) per
> > each vfio_device. VFIO decides the routing information for this default
> > space based on device type:
> >
> > 1) pdev, routed via RID;
> >
> > 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via
> > the parent's RID plus the PASID marking this mdev;
> >
> > 3) a purely sw-mediated device (sw mdev), no routing required i.e. no
> > need to install the I/O page table in the IOMMU. sw mdev just uses
> > the metadata to assist its internal DMA isolation logic on top of
> > the parent's IOMMU page table;
> >
> > In addition, VFIO may allow user to create additional I/O address spaces
> > on a vfio_device based on the hardware capability. In such case the user
> > has its own view of the virtual routing information (vPASID) when marking
> > these non-default address spaces. How to virtualize vPASID is platform
> > specific and device specific. Some platforms allow the user to fully
> > manage the PASID space thus vPASIDs are directly used for routing and
> > even hidden from the kernel. Other platforms require the user to
> > explicitly register the vPASID information to the kernel when attaching
> > the vfio_device. In this case VFIO must figure out whether vPASID should
> > be directly used (pdev) or converted to a kernel-allocated pPASID (mdev)
> > for physical routing. Detail explanation about PASID virtualization can
> > be found in {1.4}.
> >
> > For mdev both default and non-default I/O address spaces are routed
> > via PASIDs. To better differentiate them we use "default PASID" (or
> > defPASID) when talking about the default I/O address space on mdev.
> When
> > vPASID or pPASID is referred in PASID virtualization it's all about the
> > non-default spaces. defPASID and pPASID are always hidden from
> userspace
> > and can only be indirectly referenced via IOASID.
>
> That said, I'm still finding the various ways a device can attach to
> an ioasid pretty confusing. Here are some thoughts on some extra
> concepts that might make it easier to handle [note, I haven't thought
> this all the way through so far, so there might be fatal problems with
> this approach].

Thanks for sharing your thoughts.

>
> * DMA address type
>
> This represents the format of the actual "over the wire" DMA
> address. So far I only see 3 likely options for this 1) 32-bit,
> 2) 64-bit and 3) PASID, meaning the 84-bit PASID+address
> combination.
>
> * DMA identifier type
>
> This represents the format of the "over the wire"
> device-identifying information that the IOMMU receives. So "RID",
> "RID+PASID", "SID+SSID" would all be DMA identifier types. We
> could introduce some extra ones which might be necessary for
> software mdevs.
>
> So, every single DMA transaction has both DMA address and DMA
> identifier information attached. In some cases we get to choose how
> we split the availble information between identifier and address, more
> on that later.
>
> * DMA endpoint
>
> An endpoint would represent a DMA origin which is identifiable to
> the IOMMU. I'm using the new term, because while this would
> sometimes correspond one to one with a device, there would be some
> cases where it does not.
>
> a) Multiple devices could be a single DMA endpoint - this would
> be the case with non-ACS bridges or PCIe to PCI bridges where
> devices behind the bridge can't be distinguished from each other.
> Early versions might be able to treat all VFIO groups as single
> endpoints, which might simplify transition
>
> b) A single device could supply multiple DMA endpoints, this would
> be the case with PASID capable devices where you want to map
> different PASIDs to different IOASes.
>
> **Caveat: feel free to come up with a better name than "endpoint"
>
> **Caveat: I'm not immediately sure how to represent these to
> userspace, and how we do that could have some important
> implications for managing their lifetime
>
> Every endpoint would have a fixed, known DMA address type and DMA
> identifier type (though I'm not sure if we need/want to expose the DMA
> identifier type to userspace). Every IOAS would also have a DMA
> address type fixed at IOAS creation.
>
> An endpoint can only be attached to one IOAS at a time. It can only
> be attached to an IOAS whose DMA address type matches the endpoint.
>
> Most userspace managed IO page formats would imply a particular DMA
> address type, and also a particular DMA address type for their
> "parent" IOAS. I'd expect kernel managed IO page tables to be able to
> be able to handle most combinations.
>
> /dev/iommu would work entirely (or nearly so) in terms of endpoint
> handles, not device handles. Endpoints are what get bound to an IOAS,
> and endpoints are what get the user chosen endpoint cookie.
>
> Getting endpoint handles from devices is handled on the VFIO/device
> side. The simplest transitional approach is probably for a VFIO pdev
> groups to expose just a single endpoint. We can potentially make that
> more flexible as a later step, and other subsystems might have other
> needs.
>

I wonder what is the real value of this endpoint concept. for SVA-capable
pdev case, the entire pdev is fully managed by the guest thus only the
guest driver knows DMA endpoints on this pdev. vfio-pci doesn't know
the presence of an endpoint until Qemu requests to do ioasid attaching
after identifying an IOAS via vIOMMU. If we want to build /dev/iommu
uAPI around endpoint, probably vfio has to provide an uAPI for user to
request creating an endpoint in the fly before doing the attaching call.
but what goodness does it bring with additional complexity, given what
we require is just the RID or RID+PASID routing info which can be already
dig out by vfio driver w/o knowing any endpoint concept...

In concept I feel the purpose of DMA endpoint is equivalent to the routing
info in this proposal. But making it explicitly in uAPI doesn't sound bring
more value...

Thanks
Kevin

2021-07-28 04:10:02

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Jean-Philippe Brucker <[email protected]>
> Sent: Monday, July 26, 2021 4:15 PM
>
> Hi Kevin,
>
> On Fri, Jul 09, 2021 at 07:48:44AM +0000, Tian, Kevin wrote:
> > /dev/iommu provides an unified interface for managing I/O page tables for
> > devices assigned to userspace. Device passthrough frameworks (VFIO,
> vDPA,
> > etc.) are expected to use this interface instead of creating their own logic to
> > isolate untrusted device DMAs initiated by userspace.
> >
> > This proposal describes the uAPI of /dev/iommu and also sample
> sequences
> > with VFIO as example in typical usages. The driver-facing kernel API
> provided
> > by the iommu layer is still TBD, which can be discussed after consensus is
> > made on this uAPI.
>
> The document looks good to me, I don't have other concerns at the moment
>

Thanks for your review.

2021-07-30 14:55:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Mon, Jul 26, 2021 at 02:50:48PM +1000, David Gibson wrote:

> That said, I'm still finding the various ways a device can attach to
> an ioasid pretty confusing. Here are some thoughts on some extra
> concepts that might make it easier to handle [note, I haven't thought
> this all the way through so far, so there might be fatal problems with
> this approach].

I think you've summarized how I've been viewing this problem. All the
concepts you pointed to should show through in the various APIs at the
end, one way or another.

How much we need to expose to userspace, I don't know.

Does userspace need to care how the system labels traffic between DMA
endpoint and the IOASID? At some point maybe yes since stuff like
PASID does leak out in various spots

> /dev/iommu would work entirely (or nearly so) in terms of endpoint
> handles, not device handles. Endpoints are what get bound to an IOAS,
> and endpoints are what get the user chosen endpoint cookie.

While an accurate modeling of groups, it feels like an
overcomplication at this point in history where new HW largely doesn't
need it. The user interface VFIO and others presents is device
centric, inserting a new endpoint object is going going back to some
kind of group centric view of the world.

I'd rather deduce the endpoint from a collection of devices than the
other way around...

Jason

2021-08-02 02:50:47

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, July 30, 2021 10:51 PM
>
> On Mon, Jul 26, 2021 at 02:50:48PM +1000, David Gibson wrote:
>
> > That said, I'm still finding the various ways a device can attach to
> > an ioasid pretty confusing. Here are some thoughts on some extra
> > concepts that might make it easier to handle [note, I haven't thought
> > this all the way through so far, so there might be fatal problems with
> > this approach].
>
> I think you've summarized how I've been viewing this problem. All the
> concepts you pointed to should show through in the various APIs at the
> end, one way or another.

I still didn't get the value of making endpoint explicit in /dev/iommu uAPI.
From IOMMU p.o.v it only cares how to route incoming DMA traffic to a
specific I/O page table, according to RID or RID+PASID info carried in DMA
packets. This has been covered by this proposal. Which DMA endpoint in
the source device actually triggers the traffic is not a matter for /dev/iommu...

>
> How much we need to expose to userspace, I don't know.
>
> Does userspace need to care how the system labels traffic between DMA
> endpoint and the IOASID? At some point maybe yes since stuff like
> PASID does leak out in various spots
>

Can you elaborate? IMO the user only cares about the label (device cookie
plus optional vPASID) which is generated by itself when doing the attaching
call, and expects this virtual label being used in various spots (invalidation,
page fault, etc.). How the system labels the traffic (the physical RID or RID+
PASID) should be completely invisible to userspace.

Thanks
Kevin

2021-08-03 02:00:54

by David Gibson

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Wed, Jul 28, 2021 at 04:04:24AM +0000, Tian, Kevin wrote:
> Hi, David,
>
> > From: David Gibson <[email protected]>
> > Sent: Monday, July 26, 2021 12:51 PM
> >
> > On Fri, Jul 09, 2021 at 07:48:44AM +0000, Tian, Kevin wrote:
> > > /dev/iommu provides an unified interface for managing I/O page tables for
> > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > vDPA,
> > > etc.) are expected to use this interface instead of creating their own logic to
> > > isolate untrusted device DMAs initiated by userspace.
> > >
> > > This proposal describes the uAPI of /dev/iommu and also sample
> > sequences
> > > with VFIO as example in typical usages. The driver-facing kernel API
> > provided
> > > by the iommu layer is still TBD, which can be discussed after consensus is
> > > made on this uAPI.
> > >
> > > It's based on a lengthy discussion starting from here:
> > > https://lore.kernel.org/linux-
> > iommu/[email protected]/
> > >
> > > v1 can be found here:
> > > https://lore.kernel.org/linux-
> > iommu/[email protected]
> > amprd12.prod.outlook.com/T/
> > >
> > > This doc is also tracked on github, though it's not very useful for v1->v2
> > > given dramatic refactoring:
> > > https://github.com/luxis1999/dev_iommu_uapi
> >
> > Thanks for all your work on this, Kevin. Apart from the actual
> > semantic improvements, I'm finding v2 significantly easier to read and
> > understand than v1.
> >
> > [snip]
> > > 1.2. Attach Device to I/O address space
> > > +++++++++++++++++++++++++++++++++++++++
> > >
> > > Device attach/bind is initiated through passthrough framework uAPI.
> > >
> > > Device attaching is allowed only after a device is successfully bound to
> > > the IOMMU fd. User should provide a device cookie when binding the
> > > device through VFIO uAPI. This cookie is used when the user queries
> > > device capability/format, issues per-device iotlb invalidation and
> > > receives per-device I/O page fault data via IOMMU fd.
> > >
> > > Successful binding puts the device into a security context which isolates
> > > its DMA from the rest system. VFIO should not allow user to access the
> > > device before binding is completed. Similarly, VFIO should prevent the
> > > user from unbinding the device before user access is withdrawn.
> > >
> > > When a device is in an iommu group which contains multiple devices,
> > > all devices within the group must enter/exit the security context
> > > together. Please check {1.3} for more info about group isolation via
> > > this device-centric design.
> > >
> > > Successful attaching activates an I/O address space in the IOMMU,
> > > if the device is not purely software mediated. VFIO must provide device
> > > specific routing information for where to install the I/O page table in
> > > the IOMMU for this device. VFIO must also guarantee that the attached
> > > device is configured to compose DMAs with the routing information that
> > > is provided in the attaching call. When handling DMA requests, IOMMU
> > > identifies the target I/O address space according to the routing
> > > information carried in the request. Misconfiguration breaks DMA
> > > isolation thus could lead to severe security vulnerability.
> > >
> > > Routing information is per-device and bus specific. For PCI, it is
> > > Requester ID (RID) identifying the device plus optional Process Address
> > > Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-Stream
> > > ID (SSID). PASID or SSID is used when multiple I/O address spaces are
> > > enabled on a single device. For simplicity and continuity reason the
> > > following context uses RID+PASID though SID+SSID may sound a clearer
> > > naming from device p.o.v. We can decide the actual naming when coding.
> > >
> > > Because one I/O address space can be attached by multiple devices,
> > > per-device routing information (plus device cookie) is tracked under
> > > each IOASID and is used respectively when activating the I/O address
> > > space in the IOMMU for each attached device.
> > >
> > > The device in the /dev/iommu context always refers to a physical one
> > > (pdev) which is identifiable via RID. Physically each pdev can support
> > > one default I/O address space (routed via RID) and optionally multiple
> > > non-default I/O address spaces (via RID+PASID).
> > >
> > > The device in VFIO context is a logic concept, being either a physical
> > > device (pdev) or mediated device (mdev or subdev). Each vfio device
> > > is represented by RID+cookie in IOMMU fd. User is allowed to create
> > > one default I/O address space (routed by vRID from user p.o.v) per
> > > each vfio_device. VFIO decides the routing information for this default
> > > space based on device type:
> > >
> > > 1) pdev, routed via RID;
> > >
> > > 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via
> > > the parent's RID plus the PASID marking this mdev;
> > >
> > > 3) a purely sw-mediated device (sw mdev), no routing required i.e. no
> > > need to install the I/O page table in the IOMMU. sw mdev just uses
> > > the metadata to assist its internal DMA isolation logic on top of
> > > the parent's IOMMU page table;
> > >
> > > In addition, VFIO may allow user to create additional I/O address spaces
> > > on a vfio_device based on the hardware capability. In such case the user
> > > has its own view of the virtual routing information (vPASID) when marking
> > > these non-default address spaces. How to virtualize vPASID is platform
> > > specific and device specific. Some platforms allow the user to fully
> > > manage the PASID space thus vPASIDs are directly used for routing and
> > > even hidden from the kernel. Other platforms require the user to
> > > explicitly register the vPASID information to the kernel when attaching
> > > the vfio_device. In this case VFIO must figure out whether vPASID should
> > > be directly used (pdev) or converted to a kernel-allocated pPASID (mdev)
> > > for physical routing. Detail explanation about PASID virtualization can
> > > be found in {1.4}.
> > >
> > > For mdev both default and non-default I/O address spaces are routed
> > > via PASIDs. To better differentiate them we use "default PASID" (or
> > > defPASID) when talking about the default I/O address space on mdev.
> > When
> > > vPASID or pPASID is referred in PASID virtualization it's all about the
> > > non-default spaces. defPASID and pPASID are always hidden from
> > userspace
> > > and can only be indirectly referenced via IOASID.
> >
> > That said, I'm still finding the various ways a device can attach to
> > an ioasid pretty confusing. Here are some thoughts on some extra
> > concepts that might make it easier to handle [note, I haven't thought
> > this all the way through so far, so there might be fatal problems with
> > this approach].
>
> Thanks for sharing your thoughts.
>
> >
> > * DMA address type
> >
> > This represents the format of the actual "over the wire" DMA
> > address. So far I only see 3 likely options for this 1) 32-bit,
> > 2) 64-bit and 3) PASID, meaning the 84-bit PASID+address
> > combination.
> >
> > * DMA identifier type
> >
> > This represents the format of the "over the wire"
> > device-identifying information that the IOMMU receives. So "RID",
> > "RID+PASID", "SID+SSID" would all be DMA identifier types. We
> > could introduce some extra ones which might be necessary for
> > software mdevs.
> >
> > So, every single DMA transaction has both DMA address and DMA
> > identifier information attached. In some cases we get to choose how
> > we split the availble information between identifier and address, more
> > on that later.
> >
> > * DMA endpoint
> >
> > An endpoint would represent a DMA origin which is identifiable to
> > the IOMMU. I'm using the new term, because while this would
> > sometimes correspond one to one with a device, there would be some
> > cases where it does not.
> >
> > a) Multiple devices could be a single DMA endpoint - this would
> > be the case with non-ACS bridges or PCIe to PCI bridges where
> > devices behind the bridge can't be distinguished from each other.
> > Early versions might be able to treat all VFIO groups as single
> > endpoints, which might simplify transition
> >
> > b) A single device could supply multiple DMA endpoints, this would
> > be the case with PASID capable devices where you want to map
> > different PASIDs to different IOASes.
> >
> > **Caveat: feel free to come up with a better name than "endpoint"
> >
> > **Caveat: I'm not immediately sure how to represent these to
> > userspace, and how we do that could have some important
> > implications for managing their lifetime
> >
> > Every endpoint would have a fixed, known DMA address type and DMA
> > identifier type (though I'm not sure if we need/want to expose the DMA
> > identifier type to userspace). Every IOAS would also have a DMA
> > address type fixed at IOAS creation.
> >
> > An endpoint can only be attached to one IOAS at a time. It can only
> > be attached to an IOAS whose DMA address type matches the endpoint.
> >
> > Most userspace managed IO page formats would imply a particular DMA
> > address type, and also a particular DMA address type for their
> > "parent" IOAS. I'd expect kernel managed IO page tables to be able to
> > be able to handle most combinations.
> >
> > /dev/iommu would work entirely (or nearly so) in terms of endpoint
> > handles, not device handles. Endpoints are what get bound to an IOAS,
> > and endpoints are what get the user chosen endpoint cookie.
> >
> > Getting endpoint handles from devices is handled on the VFIO/device
> > side. The simplest transitional approach is probably for a VFIO pdev
> > groups to expose just a single endpoint. We can potentially make that
> > more flexible as a later step, and other subsystems might have other
> > needs.
>
> I wonder what is the real value of this endpoint concept. for SVA-capable
> pdev case, the entire pdev is fully managed by the guest thus only the
> guest driver knows DMA endpoints on this pdev. vfio-pci doesn't know
> the presence of an endpoint until Qemu requests to do ioasid attaching
> after identifying an IOAS via vIOMMU.

No.. that's not true. vfio-pci knows it can generate a "RID"-type
endpoint for the device, and I assume the device will have a SVA
capability bit, which lets vfio know that the endpoint will generate
PASID+addr addresses, rather than plain 64-bit addresses.

You can't construct RID+PASID endpoints with vfio's knowledge alone,
but that's ok - that style would be for mdevs or other cases where you
do have more information about the specific device.

> If we want to build /dev/iommu
> uAPI around endpoint, probably vfio has to provide an uAPI for user to
> request creating an endpoint in the fly before doing the attaching call.
> but what goodness does it bring with additional complexity, given what
> we require is just the RID or RID+PASID routing info which can be already
> dig out by vfio driver w/o knowing any endpoint concept...

It more clearly delineates who's responsible for what. The driver
(VFIO, mdev, vDPA, whatever) supplies endpoints. Depending on the
type of device it could be one endpoint per device, a choice of
several different endpoints for a device, several simultaneous
endpoints for the device, or one endpoint for several devices. But
whatever it is that's all on the device side. Once you get an
endpoint, it's always binding exactly one endpoint to exactly one IOAS
so the point at which the device side meets the IOMMU side becomes
much simpler.

If we find a new device type or bus with a new way of doing DMA
addressing, it's just adding some address/id types; we don't need new
ways of binding these devices to the IOMMU.

> In concept I feel the purpose of DMA endpoint is equivalent to the routing
> info in this proposal.

Maybe? I'm afraid I never quite managed to understand the role of the
routing info in your proposal.

> But making it explicitly in uAPI doesn't sound bring
> more value...
>
> Thanks
> Kevin
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (12.44 kB)
signature.asc (849.00 B)
Download all attachments

2021-08-03 02:02:00

by David Gibson

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Fri, Jul 30, 2021 at 11:51:23AM -0300, Jason Gunthorpe wrote:
> On Mon, Jul 26, 2021 at 02:50:48PM +1000, David Gibson wrote:
>
> > That said, I'm still finding the various ways a device can attach to
> > an ioasid pretty confusing. Here are some thoughts on some extra
> > concepts that might make it easier to handle [note, I haven't thought
> > this all the way through so far, so there might be fatal problems with
> > this approach].
>
> I think you've summarized how I've been viewing this problem. All the
> concepts you pointed to should show through in the various APIs at the
> end, one way or another.
>
> How much we need to expose to userspace, I don't know.
>
> Does userspace need to care how the system labels traffic between DMA
> endpoint and the IOASID? At some point maybe yes since stuff like
> PASID does leak out in various spots

Yeah, I'm not sure. I think it probably doesn't for the "main path"
of the API, though we might want to expose that for debugging and some
edge cases.

We *should* however be exposing the address type for each IOAS, since
that affects how your MAP operations will work, as well as what
endpoints are compatible with the IOAS.

> > /dev/iommu would work entirely (or nearly so) in terms of endpoint
> > handles, not device handles. Endpoints are what get bound to an IOAS,
> > and endpoints are what get the user chosen endpoint cookie.
>
> While an accurate modeling of groups, it feels like an
> overcomplication at this point in history where new HW largely doesn't
> need it.

So.. first, is that really true across the board? I expect it's true
of high end server hardware, but for consumer level and embedded
hardware as well? Then there's virtual hardware - I could point to
several things still routinely using emulated PCIe to PCI bridges in
qemu.

Second, we can't just ignore older hardware.

> The user interface VFIO and others presents is device
> centric, inserting a new endpoint object is going going back to some
> kind of group centric view of the world.

Well, kind of, yeah, because I still think the concept has value.
Part of the trouble is that "device" is pretty ambiguous. "Device" in
the sense of PCI address for register interface may not be the same as
"device" in terms of DMA RID may not be the same as as "device" in
terms of Linux struct device


terms of PCI register interface is not the same as "device"
in terms of RID / DMA identifiability is not the same "device" in
terms of what.

> I'd rather deduce the endpoint from a collection of devices than the
> other way around...

Which I think is confusing, and in any case doesn't cover the case of
one "device" with multiple endpoints.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.88 kB)
signature.asc (849.00 B)
Download all attachments

2021-08-03 03:24:13

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: David Gibson <[email protected]>
> Sent: Tuesday, August 3, 2021 9:51 AM
>
> On Wed, Jul 28, 2021 at 04:04:24AM +0000, Tian, Kevin wrote:
> > Hi, David,
> >
> > > From: David Gibson <[email protected]>
> > > Sent: Monday, July 26, 2021 12:51 PM
> > >
> > > On Fri, Jul 09, 2021 at 07:48:44AM +0000, Tian, Kevin wrote:
> > > > /dev/iommu provides an unified interface for managing I/O page tables
> for
> > > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > > vDPA,
> > > > etc.) are expected to use this interface instead of creating their own
> logic to
> > > > isolate untrusted device DMAs initiated by userspace.
> > > >
> > > > This proposal describes the uAPI of /dev/iommu and also sample
> > > sequences
> > > > with VFIO as example in typical usages. The driver-facing kernel API
> > > provided
> > > > by the iommu layer is still TBD, which can be discussed after consensus
> is
> > > > made on this uAPI.
> > > >
> > > > It's based on a lengthy discussion starting from here:
> > > > https://lore.kernel.org/linux-
> > > iommu/[email protected]/
> > > >
> > > > v1 can be found here:
> > > > https://lore.kernel.org/linux-
> > >
> iommu/[email protected]
> > > amprd12.prod.outlook.com/T/
> > > >
> > > > This doc is also tracked on github, though it's not very useful for v1->v2
> > > > given dramatic refactoring:
> > > > https://github.com/luxis1999/dev_iommu_uapi
> > >
> > > Thanks for all your work on this, Kevin. Apart from the actual
> > > semantic improvements, I'm finding v2 significantly easier to read and
> > > understand than v1.
> > >
> > > [snip]
> > > > 1.2. Attach Device to I/O address space
> > > > +++++++++++++++++++++++++++++++++++++++
> > > >
> > > > Device attach/bind is initiated through passthrough framework uAPI.
> > > >
> > > > Device attaching is allowed only after a device is successfully bound to
> > > > the IOMMU fd. User should provide a device cookie when binding the
> > > > device through VFIO uAPI. This cookie is used when the user queries
> > > > device capability/format, issues per-device iotlb invalidation and
> > > > receives per-device I/O page fault data via IOMMU fd.
> > > >
> > > > Successful binding puts the device into a security context which isolates
> > > > its DMA from the rest system. VFIO should not allow user to access the
> > > > device before binding is completed. Similarly, VFIO should prevent the
> > > > user from unbinding the device before user access is withdrawn.
> > > >
> > > > When a device is in an iommu group which contains multiple devices,
> > > > all devices within the group must enter/exit the security context
> > > > together. Please check {1.3} for more info about group isolation via
> > > > this device-centric design.
> > > >
> > > > Successful attaching activates an I/O address space in the IOMMU,
> > > > if the device is not purely software mediated. VFIO must provide device
> > > > specific routing information for where to install the I/O page table in
> > > > the IOMMU for this device. VFIO must also guarantee that the attached
> > > > device is configured to compose DMAs with the routing information
> that
> > > > is provided in the attaching call. When handling DMA requests, IOMMU
> > > > identifies the target I/O address space according to the routing
> > > > information carried in the request. Misconfiguration breaks DMA
> > > > isolation thus could lead to severe security vulnerability.
> > > >
> > > > Routing information is per-device and bus specific. For PCI, it is
> > > > Requester ID (RID) identifying the device plus optional Process Address
> > > > Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-
> Stream
> > > > ID (SSID). PASID or SSID is used when multiple I/O address spaces are
> > > > enabled on a single device. For simplicity and continuity reason the
> > > > following context uses RID+PASID though SID+SSID may sound a clearer
> > > > naming from device p.o.v. We can decide the actual naming when
> coding.
> > > >
> > > > Because one I/O address space can be attached by multiple devices,
> > > > per-device routing information (plus device cookie) is tracked under
> > > > each IOASID and is used respectively when activating the I/O address
> > > > space in the IOMMU for each attached device.
> > > >
> > > > The device in the /dev/iommu context always refers to a physical one
> > > > (pdev) which is identifiable via RID. Physically each pdev can support
> > > > one default I/O address space (routed via RID) and optionally multiple
> > > > non-default I/O address spaces (via RID+PASID).
> > > >
> > > > The device in VFIO context is a logic concept, being either a physical
> > > > device (pdev) or mediated device (mdev or subdev). Each vfio device
> > > > is represented by RID+cookie in IOMMU fd. User is allowed to create
> > > > one default I/O address space (routed by vRID from user p.o.v) per
> > > > each vfio_device. VFIO decides the routing information for this default
> > > > space based on device type:
> > > >
> > > > 1) pdev, routed via RID;
> > > >
> > > > 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via
> > > > the parent's RID plus the PASID marking this mdev;
> > > >
> > > > 3) a purely sw-mediated device (sw mdev), no routing required i.e. no
> > > > need to install the I/O page table in the IOMMU. sw mdev just uses
> > > > the metadata to assist its internal DMA isolation logic on top of
> > > > the parent's IOMMU page table;
> > > >
> > > > In addition, VFIO may allow user to create additional I/O address
> spaces
> > > > on a vfio_device based on the hardware capability. In such case the
> user
> > > > has its own view of the virtual routing information (vPASID) when
> marking
> > > > these non-default address spaces. How to virtualize vPASID is platform
> > > > specific and device specific. Some platforms allow the user to fully
> > > > manage the PASID space thus vPASIDs are directly used for routing and
> > > > even hidden from the kernel. Other platforms require the user to
> > > > explicitly register the vPASID information to the kernel when attaching
> > > > the vfio_device. In this case VFIO must figure out whether vPASID
> should
> > > > be directly used (pdev) or converted to a kernel-allocated pPASID (mdev)
> > > > for physical routing. Detail explanation about PASID virtualization can
> > > > be found in {1.4}.
> > > >
> > > > For mdev both default and non-default I/O address spaces are routed
> > > > via PASIDs. To better differentiate them we use "default PASID" (or
> > > > defPASID) when talking about the default I/O address space on mdev.
> > > When
> > > > vPASID or pPASID is referred in PASID virtualization it's all about the
> > > > non-default spaces. defPASID and pPASID are always hidden from
> > > userspace
> > > > and can only be indirectly referenced via IOASID.
> > >
> > > That said, I'm still finding the various ways a device can attach to
> > > an ioasid pretty confusing. Here are some thoughts on some extra
> > > concepts that might make it easier to handle [note, I haven't thought
> > > this all the way through so far, so there might be fatal problems with
> > > this approach].
> >
> > Thanks for sharing your thoughts.
> >
> > >
> > > * DMA address type
> > >
> > > This represents the format of the actual "over the wire" DMA
> > > address. So far I only see 3 likely options for this 1) 32-bit,
> > > 2) 64-bit and 3) PASID, meaning the 84-bit PASID+address
> > > combination.
> > >
> > > * DMA identifier type
> > >
> > > This represents the format of the "over the wire"
> > > device-identifying information that the IOMMU receives. So "RID",
> > > "RID+PASID", "SID+SSID" would all be DMA identifier types. We
> > > could introduce some extra ones which might be necessary for
> > > software mdevs.
> > >
> > > So, every single DMA transaction has both DMA address and DMA
> > > identifier information attached. In some cases we get to choose how
> > > we split the availble information between identifier and address, more
> > > on that later.
> > >
> > > * DMA endpoint
> > >
> > > An endpoint would represent a DMA origin which is identifiable to
> > > the IOMMU. I'm using the new term, because while this would
> > > sometimes correspond one to one with a device, there would be some
> > > cases where it does not.
> > >
> > > a) Multiple devices could be a single DMA endpoint - this would
> > > be the case with non-ACS bridges or PCIe to PCI bridges where
> > > devices behind the bridge can't be distinguished from each other.
> > > Early versions might be able to treat all VFIO groups as single
> > > endpoints, which might simplify transition
> > >
> > > b) A single device could supply multiple DMA endpoints, this would
> > > be the case with PASID capable devices where you want to map
> > > different PASIDs to different IOASes.
> > >
> > > **Caveat: feel free to come up with a better name than "endpoint"
> > >
> > > **Caveat: I'm not immediately sure how to represent these to
> > > userspace, and how we do that could have some important
> > > implications for managing their lifetime
> > >
> > > Every endpoint would have a fixed, known DMA address type and DMA
> > > identifier type (though I'm not sure if we need/want to expose the DMA
> > > identifier type to userspace). Every IOAS would also have a DMA
> > > address type fixed at IOAS creation.
> > >
> > > An endpoint can only be attached to one IOAS at a time. It can only
> > > be attached to an IOAS whose DMA address type matches the endpoint.
> > >
> > > Most userspace managed IO page formats would imply a particular DMA
> > > address type, and also a particular DMA address type for their
> > > "parent" IOAS. I'd expect kernel managed IO page tables to be able to
> > > be able to handle most combinations.
> > >
> > > /dev/iommu would work entirely (or nearly so) in terms of endpoint
> > > handles, not device handles. Endpoints are what get bound to an IOAS,
> > > and endpoints are what get the user chosen endpoint cookie.
> > >
> > > Getting endpoint handles from devices is handled on the VFIO/device
> > > side. The simplest transitional approach is probably for a VFIO pdev
> > > groups to expose just a single endpoint. We can potentially make that
> > > more flexible as a later step, and other subsystems might have other
> > > needs.
> >
> > I wonder what is the real value of this endpoint concept. for SVA-capable
> > pdev case, the entire pdev is fully managed by the guest thus only the
> > guest driver knows DMA endpoints on this pdev. vfio-pci doesn't know
> > the presence of an endpoint until Qemu requests to do ioasid attaching
> > after identifying an IOAS via vIOMMU.
>
> No.. that's not true. vfio-pci knows it can generate a "RID"-type
> endpoint for the device, and I assume the device will have a SVA
> capability bit, which lets vfio know that the endpoint will generate
> PASID+addr addresses, rather than plain 64-bit addresses.
>
> You can't construct RID+PASID endpoints with vfio's knowledge alone,
> but that's ok - that style would be for mdevs or other cases where you
> do have more information about the specific device.

if vfio-pci cannot construct endpoint alone in all cases, then I worried
we are just inventing unnecessary uAPI objects of which the role can
be already fulfilled by device cookie+PASID in the proposed uAPI.

>
> > If we want to build /dev/iommu
> > uAPI around endpoint, probably vfio has to provide an uAPI for user to
> > request creating an endpoint in the fly before doing the attaching call.
> > but what goodness does it bring with additional complexity, given what
> > we require is just the RID or RID+PASID routing info which can be already
> > dig out by vfio driver w/o knowing any endpoint concept...
>
> It more clearly delineates who's responsible for what. The driver
> (VFIO, mdev, vDPA, whatever) supplies endpoints. Depending on the
> type of device it could be one endpoint per device, a choice of
> several different endpoints for a device, several simultaneous
> endpoints for the device, or one endpoint for several devices. But
> whatever it is that's all on the device side. Once you get an
> endpoint, it's always binding exactly one endpoint to exactly one IOAS
> so the point at which the device side meets the IOMMU side becomes
> much simpler.

Sticking to iommu semantics {device, pasid} are clear enough for
the user to build the connection between IOAS and device. This
also matches the vIOMMU part which just understands vRID and
vPASID, without any concept of endpoint. anyway RID+PASID (or
SID+SSID) is what is defined in the bus to tag an IOAS. Forcing
vIOMMU to fake an endpoint (via vfio) per PASID before doing
attach just adds unnecessary confusion.

>
> If we find a new device type or bus with a new way of doing DMA
> addressing, it's just adding some address/id types; we don't need new
> ways of binding these devices to the IOMMU.

We just need define a better name to cover pasid, ssid, or other ids.
regardless of the bus type, it's always the device cookie to identify
the default I/O address space plus a what-ever id to tag other I/O
address spaces targeted by the device.

>
> > In concept I feel the purpose of DMA endpoint is equivalent to the routing
> > info in this proposal.
>
> Maybe? I'm afraid I never quite managed to understand the role of the
> routing info in your proposal.
>

the IOMMU routes incoming DMA packets to a specific I/O page table,
according to RID or RID+PASID carried in the packet. RID or RID+PASID
is the routing information (represented by device cookie +PASID in
proposed uAPI) and what the iommu driver really cares when activating
the I/O page table in the iommu.

Thanks
Kevin

2021-08-04 14:06:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Mon, Aug 02, 2021 at 02:49:44AM +0000, Tian, Kevin wrote:

> Can you elaborate? IMO the user only cares about the label (device cookie
> plus optional vPASID) which is generated by itself when doing the attaching
> call, and expects this virtual label being used in various spots (invalidation,
> page fault, etc.). How the system labels the traffic (the physical RID or RID+
> PASID) should be completely invisible to userspace.

I don't think that is true if the vIOMMU driver is also emulating
PASID. Presumably the same is true for other PASID-like schemes.

Jason

2021-08-04 14:08:57

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Tue, Aug 03, 2021 at 11:58:54AM +1000, David Gibson wrote:
> > I'd rather deduce the endpoint from a collection of devices than the
> > other way around...
>
> Which I think is confusing, and in any case doesn't cover the case of
> one "device" with multiple endpoints.

Well they are both confusing, and I'd prefer to focus on the common
case without extra mandatory steps. Exposing optional endpoint sharing
information seems more in line with where everything is going than
making endpoint sharing a first class object.

AFAIK a device with multiple endpoints where those endpoints are
shared with other devices doesn't really exist/or is useful? Eg PASID
has multiple RIDs by they are not shared.

Jason

2021-08-04 16:05:00

by Eric Auger

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

Hi Kevin,

Few comments/questions below.

On 7/9/21 9:48 AM, Tian, Kevin wrote:
> /dev/iommu provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.
>
> This proposal describes the uAPI of /dev/iommu and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.
>
> It's based on a lengthy discussion starting from here:
> https://lore.kernel.org/linux-iommu/[email protected]/
>
> v1 can be found here:
> https://lore.kernel.org/linux-iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.namprd12.prod.outlook.com/T/
>
> This doc is also tracked on github, though it's not very useful for v1->v2
> given dramatic refactoring:
> https://github.com/luxis1999/dev_iommu_uapi
>
> Changelog (v1->v2):
> - Rename /dev/ioasid to /dev/iommu (Jason);
> - Add a section for device-centric vs. group-centric design (many);
> - Add a section for handling no-snoop DMA (Jason/Alex/Paolo);
> - Add definition of user/kernel/shared I/O page tables (Baolu/Jason);
> - Allow one device bound to multiple iommu fd's (Jason);
> - No need to track user I/O page tables in kernel on ARM/AMD (Jean/Jason);
> - Add a device cookie for iotlb invalidation and fault handling (Jean/Jason);
> - Add capability/format query interface per device cookie (Jason);
> - Specify format/attribute when creating an IOASID, leading to several v1
> uAPI commands removed (Jason);
> - Explain the value of software nesting (Jean);
> - Replace IOASID_REGISTER_VIRTUAL_MEMORY with software nesting (David/Jason);
> - Cover software mdev usage (Jason);
> - No restriction on map/unmap vs. bind/invalidate (Jason/David);
> - Report permitted IOVA range instead of reserved range (David);
> - Refine the sample structures and helper functions (Jason);
> - Add definition of default and non-default I/O address spaces;
> - Expand and clarify the design for PASID virtualization;
> - and lots of subtle refinement according to above changes;
>
> TOC
> ====
> 1. Terminologies and Concepts
> 1.1. Manage I/O address space
> 1.2. Attach device to I/O address space
> 1.3. Group isolation
> 1.4. PASID virtualization
> 1.4.1. Devices which don't support DMWr
> 1.4.2. Devices which support DMWr
> 1.4.3. Mix different types together
> 1.4.4. User sequence
> 1.5. No-snoop DMA
> 2. uAPI Proposal
> 2.1. /dev/iommu uAPI
> 2.2. /dev/vfio device uAPI
> 2.3. /dev/kvm uAPI
> 3. Sample Structures and Helper Functions
> 4. Use Cases and Flows
> 4.1. A simple example
> 4.2. Multiple IOASIDs (no nesting)
> 4.3. IOASID nesting (software)
> 4.4. IOASID nesting (hardware)
> 4.5. Guest SVA (vSVA)
> 4.6. I/O page fault
> ====
>
> 1. Terminologies and Concepts
> -----------------------------------------
>
> IOMMU fd is the container holding multiple I/O address spaces. User
> manages those address spaces through fd operations. Multiple fd's are
> allowed per process, but with this proposal one fd should be sufficient for
> all intended usages.
>
> IOASID is the fd-local software handle representing an I/O address space.
> Each IOASID is associated with a single I/O page table. IOASIDs can be
> nested together, implying the output address from one I/O page table
> (represented by child IOASID) must be further translated by another I/O
> page table (represented by parent IOASID).
>
> An I/O address space takes effect only after it is attached by a device.
> One device is allowed to attach to multiple I/O address spaces. One I/O
> address space can be attached by multiple devices.
>
> Device must be bound to an IOMMU fd before attach operation can be
> conducted. Though not necessary, user could bind one device to multiple
> IOMMU FD's. But no cross-FD IOASID nesting is allowed.
>
> The format of an I/O page table must be compatible to the attached
> devices (or more specifically to the IOMMU which serves the DMA from
> the attached devices). User is responsible for specifying the format
> when allocating an IOASID, according to one or multiple devices which
> will be attached right after. Attaching a device to an IOASID with
> incompatible format is simply rejected.
>
> Relationship between IOMMU fd, VFIO fd and KVM fd:
>
> - IOMMU fd provides uAPI for managing IOASIDs and I/O page tables.
> It also provides an unified capability/format reporting interface for
> each bound device.
>
> - VFIO fd provides uAPI for device binding and attaching. In this proposal
> VFIO is used as the example of device passthrough frameworks. The
> routing information that identifies an I/O address space in the wire is
> per-device and registered to IOMMU fd via VFIO uAPI.
>
> - KVM fd provides uAPI for handling no-snoop DMA and PASID virtualization
> in CPU (when PASID is carried in instruction payload).
>
> 1.1. Manage I/O address space
> +++++++++++++++++++++++++++++
>
> An I/O address space can be created in three ways, according to how
> the corresponding I/O page table is managed:
>
> - kernel-managed I/O page table which is created via IOMMU fd, e.g.
> for IOVA space (dpdk), GPA space (Qemu), GIOVA space (vIOMMU), etc.
>
> - user-managed I/O page table which is created by the user, e.g. for
> GIOVA/GVA space (vIOMMU), etc.
>
> - shared kernel-managed CPU page table which is created by another
> subsystem, e.g. for process VA space (mm), GPA space (kvm), etc.
>
> The first category is managed via a dma mapping protocol (similar to
> existing VFIO iommu type1), which allows the user to explicitly specify
> which range in the I/O address space should be mapped.
>
> The second category is managed via an iotlb protocol (similar to the
> underlying IOMMU semantics). Once the user-managed page table is
> bound to the IOMMU, the user can invoke an invalidation command
> to update the kernel-side cache (either in software or in physical IOMMU).
> In the meantime, a fault reporting/completion mechanism is also provided
> for the user to fixup potential I/O page faults.
>
> The last category is supposed to be managed via the subsystem which
> actually owns the shared address space. Likely what's minimally required
> in /dev/iommu uAPI is to build the connection with the address space
> owner when allocating the IOASID, so an in-kernel interface (e.g. mmu_
> notifer) is activated for any required synchronization between IOMMU fd
> and the space owner.
>
> This proposal focuses on how to manage the first two categories, as
> they are existing and more urgent requirements. Support of the last
> category can be discussed when a real usage comes in the future.
>
> The user needs to specify the desired management protocol and page
> table format when creating a new I/O address space. Before allocating
> the IOASID, the user should already know at least one device that will be
> attached to this space. It is expected to first query (via IOMMU fd) the
> supported capabilities and page table format information of the to-be-
> attached device (or a common set between multiple devices) and then
> choose a compatible format to set on the IOASID.
>
> I/O address spaces can be nested together, called IOASID nesting. IOASID
> nesting can be implemented in two ways: hardware nesting and software
> nesting. With hardware support the child and parent I/O page tables are
> walked consecutively by the IOMMU to form a nested translation. When
> it's implemented in software, /dev/iommu is responsible for merging the
> two-level mappings into a single-level shadow I/O page table.
>
> An user-managed I/O page table can be setup only on the child IOASID,
> implying IOASID nesting must be enabled. This is because the kernel
> doesn't trust userspace. Nesting allows the kernel to enforce its DMA
> isolation policy through the parent IOASID.
>
> Software nesting is useful in several scenarios. First, it allows
> centralized accounting on locked pages between multiple root IOASIDs
> (no parent). In this case a 'dummy' IOASID can be created with an
> identity mapping (HVA->HVA), dedicated for page pinning/accounting and
> nested by all root IOASIDs. Second, it's also useful for mdev drivers
> (e.g. kvmgt) to write-protect guest structures when vIOMMU is enabled.
> In this case the protected addresses are in GIOVA space while KVM
> write-protection API is based on GPA. Software nesting allows finding
> GPA according to GIOVA in the kernel.
>
> 1.2. Attach Device to I/O address space
> +++++++++++++++++++++++++++++++++++++++
>
> Device attach/bind is initiated through passthrough framework uAPI.
>
> Device attaching is allowed only after a device is successfully bound to
> the IOMMU fd. User should provide a device cookie when binding the
> device through VFIO uAPI. This cookie is used when the user queries
> device capability/format, issues per-device iotlb invalidation and
> receives per-device I/O page fault data via IOMMU fd.
>
> Successful binding puts the device into a security context which isolates
> its DMA from the rest system. VFIO should not allow user to access the
s/from the rest system/from the rest of the system
> device before binding is completed. Similarly, VFIO should prevent the
> user from unbinding the device before user access is withdrawn.
With Intel scalable IOV, I understand you could assign an RID/PASID to
one VM and another one to another VM (which is not the case for ARM). Is
it a targetted use case?How would it be handled? Is it related to the
sub-groups evoked hereafter?

Actually all devices bound to an IOMMU fd should have the same parent
I/O address space or root address space, am I correct? If so, maybe add
this comment explicitly?
> When a device is in an iommu group which contains multiple devices,
> all devices within the group must enter/exit the security context
> together. Please check {1.3} for more info about group isolation via
> this device-centric design.
>
> Successful attaching activates an I/O address space in the IOMMU,
> if the device is not purely software mediated. VFIO must provide device
> specific routing information for where to install the I/O page table in
> the IOMMU for this device. VFIO must also guarantee that the attached
> device is configured to compose DMAs with the routing information that
> is provided in the attaching call. When handling DMA requests, IOMMU
> identifies the target I/O address space according to the routing
> information carried in the request. Misconfiguration breaks DMA
> isolation thus could lead to severe security vulnerability.
>
> Routing information is per-device and bus specific. For PCI, it is
> Requester ID (RID) identifying the device plus optional Process Address
> Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-Stream
> ID (SSID). PASID or SSID is used when multiple I/O address spaces are
> enabled on a single device. For simplicity and continuity reason the
> following context uses RID+PASID though SID+SSID may sound a clearer
> naming from device p.o.v. We can decide the actual naming when coding.
>
> Because one I/O address space can be attached by multiple devices,
> per-device routing information (plus device cookie) is tracked under
> each IOASID and is used respectively when activating the I/O address
> space in the IOMMU for each attached device.
>
> The device in the /dev/iommu context always refers to a physical one
> (pdev) which is identifiable via RID. Physically each pdev can support
> one default I/O address space (routed via RID) and optionally multiple
> non-default I/O address spaces (via RID+PASID).
>
> The device in VFIO context is a logic concept, being either a physical
> device (pdev) or mediated device (mdev or subdev). Each vfio device
> is represented by RID+cookie in IOMMU fd. User is allowed to create
> one default I/O address space (routed by vRID from user p.o.v) per
> each vfio_device.
The concept of default address space is not fully clear for me. I
currently understand this is a
root address space (not nesting). Is that coorect.This may need
clarification.
> VFIO decides the routing information for this default
> space based on device type:
>
> 1) pdev, routed via RID;
>
> 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via
> the parent's RID plus the PASID marking this mdev;
>
> 3) a purely sw-mediated device (sw mdev), no routing required i.e. no
> need to install the I/O page table in the IOMMU. sw mdev just uses
> the metadata to assist its internal DMA isolation logic on top of
> the parent's IOMMU page table;
Maybe you should introduce this concept of SW mediated device earlier
because it seems to special case the way the attach behaves. I am
especially refering to

"Successful attaching activates an I/O address space in the IOMMU, if the device is not purely software mediated"

>
> In addition, VFIO may allow user to create additional I/O address spaces
> on a vfio_device based on the hardware capability. In such case the user
> has its own view of the virtual routing information (vPASID) when marking
> these non-default address spaces.
I do not catch what does mean "marking these non default address space".
> How to virtualize vPASID is platform
> specific and device specific. Some platforms allow the user to fully
> manage the PASID space thus vPASIDs are directly used for routing and
> even hidden from the kernel. Other platforms require the user to
> explicitly register the vPASID information to the kernel when attaching
> the vfio_device. In this case VFIO must figure out whether vPASID should
> be directly used (pdev) or converted to a kernel-allocated pPASID (mdev)
> for physical routing. Detail explanation about PASID virtualization can
> be found in {1.4}.
>
> For mdev both default and non-default I/O address spaces are routed
> via PASIDs. To better differentiate them we use "default PASID" (or
> defPASID) when talking about the default I/O address space on mdev. When
> vPASID or pPASID is referred in PASID virtualization it's all about the
> non-default spaces. defPASID and pPASID are always hidden from userspace
> and can only be indirectly referenced via IOASID.
>
> 1.3. Group isolation
> ++++++++++++++++++++
>
> Group is the minimal object when talking about DMA isolation in the
> iommu layer. Devices which cannot be isolated from each other are
> organized into a single group. Lack of isolation could be caused by
> multiple reasons: no ACS capability in the upstreaming port, behind a
> PCIe-to-PCI bridge (thus sharing RID), or DMA aliasing (multiple RIDs
> per device), etc.
>
> All devices in the group must be put in a security context together
> before one or more devices in the group are operated by an untrusted
> user. Passthrough frameworks must guarantee that:
>
> 1) No user access is granted on a device before an security context is
> established for the entire group (becomes viable).
>
> 2) Group viability is not broken before the user relinquishes the device.
> This implies that devices in the group must be either assigned to this
> user, or driver-less, or bound to a driver which is known safe (not
> do DMA).
>
> 3) The security context should not be destroyed before user access
> permission is withdrawn.
>
> Existing VFIO introduces explicit container and group semantics in its
> uAPI to meet above requirements:
>
> 1) VFIO user can open a device fd only after:
>
> * A container is created;
> * The group is attached to the container (VFIO_GROUP_SET_CONTAINER);
> * An empty I/O page table is created in the container (VFIO_SET_IOMMU);
> * Group viability is passed and the entire group is attached to
> the empty I/O page table (the security context);
>
> 2) VFIO monitors driver binding status to verify group viability
>
> * IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
> * BUG_ON() if group viability is broken;
>
> 3) Detach the group from the container when the last device fd in the
> group is closed and destroy the I/O page table only after the last
> group is detached from the container.
>
> With this proposal VFIO can move to a simpler device-centric model by
> directly exposeing device node under "/dev/vfio/devices" w/o using
s/exposeing/exposing
> container and group uAPI at all. In this case group isolation is enforced
> mplicitly within IOMMU fd:
s/mplicitly/implicitly
>
> 1) A successful binding call for the first device in the group creates
> the security context for the entire group, by:
>
> * Verifying group viability in a similar way as VFIO does;
>
> * Calling IOMMU-API to move the group into a block-dma state,
> which makes all devices in the group attached to an block-dma
> domain with an empty I/O page table;
this block-dma state/domain would deserve to be better defined (I know
you already evoked it in 1.1 with the dma mapping protocol though)
activates an empty I/O page table in the IOMMU (if the device is not
purely SW mediated)?
How does that relate to the default address space? Is it the same?
>
> VFIO should not allow the user to mmap the MMIO bar of the bound
> device until the binding call succeeds.
>
> Binding other devices in the same group just succeeds since the
> security context has already been established for the entire group.
>
> 2) IOMMU fd monitors driver binding status in case group viability is
> broken, same as VFIO does today. BUG_ON() might be eliminated if we
> can find a way to deny probe of non-iommu-safe drivers.
>
> Before a device is unbound from IOMMU fd, it is always attached to a
> security context (either the block-dma domain or an IOASID domain).
> Switch between two domains is initiated by attaching the device to or
> detaching it from an IOASID. The IOMMU layer should ensure that
> the default domain is not implicitly re-attached in the switching
> process, before the group is moved out of the block-dma state.
>
> To stay on par with legacy VFIO, IOMMU fd could verify that all
> bound devices in the same group must be attached to a single IOASID.
>
> 3) When a device fd is closed, VFIO automatically unbinds the device from
> IOMMU fd before zapping the mmio mapping. Unbinding the last device
> in the group moves the entire group out of the block-dma state and
> re-attached to the default domain.
>
> Actual implementation may use a staging approach, e.g. only support
> one-device group in the start (leaving multi-devices group handled via
> legacy VFIO uAPI) and then cover multi-devices group in a later stage.
>
> If necessary, devices within a group may be further allowed to be
> attached to different IOASIDs in the same IOMMU fd, in case that the
> source devices can be reliably identifiable (e.g. due to !ACS). This will
> require additional sub-group logic in the iommu layer and with
> sub-group topology exposed to userspace. But no expectation of
> changing the device-centric semantics except introducing sub-group
> awareness within IOMMU fd.
This is a bit cryptic to me. Devices using different child IOASIDs and
same parent IOASID are allowed within the same IOMMU fd, right?
Please could you clarify?
>
> A more detailed explanation of the staging approach can be found:
>
> https://lore.kernel.org/linux-iommu/BN9PR11MB543382665D34E58155A9593C8C039@BN9PR11MB5433.namprd11.prod.outlook.com/
>
> 1.4. PASID Virtualization
> +++++++++++++++++++++++++
>
> As explained in {1.2}, PASID virtualization is required when multiple I/O
> address spaces are supported on a device. The actual policy is per-device
> thus defined by specific VFIO device driver.
>
> A PASID virtualization policy is defined by four aspects:
>
> 1) Whether this device allows the user to create multiple I/O address
> spaces (vPASID capability). This is decided upon whether this device
> and its upstream IOMMU both support PASID.
>
> 2) If yes, whether the PASID space is delegated to the user, based on
> whether the PASID table should be managed by user or kernel.
>
> 3) If no, the user should register vPASID to the kernel. Then the next
> question is whether vPASID should be directly used for physical routing
> (vPASID==pPASID or vPASID!=pPASID). The key is whether this device
> must share the PASID space with others (pdev vs. mdev).
>
> 4) If vPASID!=pPASID, whether pPASID should be allocated from the
> per-RID space or a global space. This is about whether the device
> supports PCIe DMWr-type work submission (e.g. Intel ENQCMD) which
> requires global pPASID allocation cross multiple devices.
>
> Only vPASIDs are part of the VM state to be migrated in VM live migration.
> This is basically about the virtual PASID table state in vendor vIOMMU. If
> vPASID!=pPASID, new pPASIDs will be re-allocated on the destination and
> VFIO device driver is responsible for programming the device to use the
> new pPASID when restoring the device state.
>
> Different policies may imply different uAPI semantics for user to follow
> when attaching a device. The semantics information is expected to be
> reported to the user via VFIO uAPI instead of via IOMMU fd, since the
> latter only cares about pPASID. But if there is a different thought we'd
> like to hear it.
>
> Following sections (1.4.1 - 1.4.3) provide detail explanation on how
> above are selected on different device types and the implication when
> multiple types are mixed together (i.e. assigned to a single user). Last
> section (1.4.4) then summarizes what uAPI semantics information is
> reported and how user is expected to deal with it.
>
> 1.4.1. Devices which don't support DMWr
> ***************************************
>
> This section is about following types:
>
> 1) a pdev which doesn't issue PASID;
> 2) a sw mdev which doesn't issue PASID;
> 3) a mdev which is programmed a fixed defPASID (for default I/O address
> space), but does not expose vPASID capability;
>
> 4) a pdev which exposes vPASID and has its PASID table managed by user;
> 5) a pdev which exposes vPASID and has its PASID table managed by kernel;
> 6) a mdev which exposes vPASID and shares the parent's PASID table
> with other mdev's;
>
> +--------+---------+---------+----------+-----------+
> | | |Delegated| vPASID== | per-RID |
> | | vPASID | to user | pPASID | pPASID |
> +========+=========+=========+==========+===========+
> | type-1 | N/A | N/A | N/A | N/A |
> +--------+---------+---------+----------+-----------+
> | type-2 | N/A | N/A | N/A | N/A |
> +--------+---------+---------+----------+-----------+
> | type-3 | N/A | N/A | N/A | N/A |
> +--------+---------+---------+----------+-----------+
> | type-4 | Yes | Yes | v==p(*)| per-RID(*)|
> +--------+---------+---------+----------+-----------+
> | type-5 | Yes | No | v==p | per-RID |
> +--------+---------+---------+----------+-----------+
> | type-6 | Yes | No | v!=p | per-RID |
> +--------+---------+---------+----------+-----------+
> <* conceptual definition though the PASID space is fully delegated>
>
> for 1-3 there is no vPASID capability exposed and the user can create
> only one default I/O address space on this device. Thus there is no PASID
> virtualization at all.
>
> 4) is specific to ARM/AMD platforms where the PASID table is managed by
> the user. In this case the entire PASID space is delegated to the user
> which just needs to create a single IOASID linked to the user-managed
> PASID table, as placeholder covering all non-default I/O address spaces
> on pdev. In concept this looks like a big 84bit address space (20bit
> PASID + 64bit addr). vPASID may be carried in the uAPI data to help define
> the operation scope when invalidating IOTLB or reporting I/O page fault.
> IOMMU fd doesn't touch it and just acts as a channel for vIOMMU/pIOMMU to
> exchange info.
>
> 5) is specific to Intel platforms where the PASID table is managed by
> the kernel. In this case vPASIDs should be registered to the kernel
> in the attaching call. This implies that every non-default I/O address
> space on pdev is explicitly tracked by an unique IOASID in the kernel.
> Because pdev is fully controlled by the user, its DMA request carries
> vPASID as the routing informaiton thus requires VFIO device driver to
s/informaiton/information
> adopt vPASID==pPASID policy. Because an IOASID already represents a
> standalone address space, there is no need to further carry vPASID in
> the invalidation and fault paths.
>
> 6) is about mdev, as those enabled by Intel Scalable IOV. The main
> difference from type-5) is on whether vPASID==pPASID. There is
> only a single PASID table per the parent device, implying the per-RID
> PASID space shared by all mdevs created on this parent. VFIO device
> driver must use vPASID!=pPASID policy and allocate a pPASID from the
> per-RID space for every registered vPASID to guarantee DMA isolation
> between sibling mdev's. VFIO device driver needs to conduct vPASID->
> pPASID conversion properly in several paths:
>
> - When VFIO device driver provides the routing information in the
> attaching call, since IOMMU fd only cares about pPASID;
> - When VFIO device driver updates a PASID MMIO register in the
> parent according to the vPASID intercepted in the mediation path;
>
> 1.4.2. Devices which support DMWr
> *********************************
>
> Modern devices may support a scalable workload submission interface
> based on PCI Deferrable Memory Write (DMWr) capability, allowing a
> single work queue to access multiple I/O address spaces. One example
> using DMWr is Intel ENQCMD, having PASID saved in the CPU MSR and
> carried in the non-posted DMWr payload when sent out to the device.
> Then a single work queue shared by multiple processes can compose
> DMAs toward different address spaces, by carrying the PASID value
> retrieved from the DMWr payload. The role of DMWr is allowing the
> shared work queue to return a retry response when the work queue
> is under pressure (due to capacity or QoS). Upon such response the
> software could try re-submitting the descriptor.
>
> When ENQCMD is executed in the guest, the value saved in the CPU
> MSR is vPASID (part of the xsave state). This creates another point for
> consideration regarding to PASID virtualization.
>
> Two device types are relevant:
>
> 7) a pdev same as 5) plus DMWr support;
> 8) a mdev same as 6) plus DMWr support;
>
> and respective polices:
>
> +--------+---------+---------+----------+-----------+
> | | |Delegated| vPASID== | per-RID |
> | | vPASID | to user | pPASID | pPASID |
> +========+=========+=========+==========+===========+
> | type-7 | Yes | Yes | v==p | per-RID |
> +--------+---------+---------+----------+-----------+
> | type-8 | Yes | Yes | v!=p | global |
> +--------+---------+---------+----------+-----------+
>
> DMWr or shared mode is configurable per work queue. It's completely
> sane if an assigned device with multiple queues needs to handle both
> DMWr (shared work queue) and normal write (dedicated work queue)
> simultaneously. Thus the PASID virtualization policy must be consistent
> when both paths are activated.
>
> for 7) we should use the same policy as 5), i.e. directly using vPASID
> for physical routing on pdev. In this case ENQCMD in the guest just works
> w/o additional work because the vPASID saved in the PASID_MSR
> matches the routing information configured for the target I/O address
> space in the IOMMU. When receiving a DMWr request, the shared
> work queue grabs vPASID from the payload and then tags outgoing
> DMAs with vPASID. This is consistent with the dedicated work queue
> path where vPASID is grabbed from the MMIO register to tag DMAs.
>
> for 8) vPASID in the PASID_MSR must be converted to pPASID before
> sent to the wire (given vPASID!=pPASID for the same reason as 6).
> Intel CPU provides a hardware PASID translation capability for auto-
> conversion when ENQCMD is being executed. In this case the payload
> received by the work queue contains pPASID thus outgoing DMAs are
> tagged with pPASID. This is consistent with the dedicated work
> queue path where pPASID is programmed to the MMIO register in the
> mediation path and then grabbed to tag DMAs.
>
> However, the CPU translation structure is per-VM which implies
> that a same pPASID must be used cross all type-8 devices (of this VM)
> given a vPASID. This requires the pPASID allocated from a global pool by
> the first type-8 device and then shared by the following type-8 devices
> when they are attached to the same vPASID.
>
> CPU translation capability is enabled via KVM uAPI. We need a secure
> contract between VFIO device fd and KVM fd so VFIO device driver knows
> when it's secure to allow guest access to the cmd portal of the type-8
> device. It's dangerous by allowing the guest to issue ENQCMD to the
> device before CPU is ready for PASID translation. In this window the
> vPASID is untranslated thus grants the guest to access random I/O
> address space on the parent of this mdev.
>
> We plan to utilize existing kvm-vfio contract. It is currently used for
> multiple purposes including propagating the kvm pointer to the VFIO
> device driver. It can be extended to further notify whether CPU PASID
> translation capability is turned on. Before receiving this notification,
> the VFIO device driver should not allow user to access the DMWr-capable
> work queue on type-8 device.
>
> 1.4.3. Mix different types together
> ***********************************
>
> In majority case mixing different types doesn't change the aforementioned
> PASID virtualization policy for each type. The user just needs to handle
> them per device basis.
>
> There is one exception though, when mixing type 7) and 8) together,
> due to conflicting policies on how PASID_MSR should be handled.
> For mdev (type-8) the CPU translation capability must be enabled to
> prevent a malicious guest from doing bad things. But once per-VM
> PASID translation is enabled, the shared work queue of pdev (type-7)
> will also receive a pPASID allocated for mdev instead of the vPASID
> that is expected on this pdev.
>
> Fixing this exception for pdev is not easy. There are three options.
>
> One is moving pdev to also accept pPASID. Because pdev may have both
> shared work queue (PASID in MSR) and dedicated work queue (PASID
> in MMIO) enabled by the guest, this requires VFIO device driver to
> mediate the dedicated work queue path so vPASIDs programmed by
> the guest are manually translated to pPASIDs before written to the
> pdev. This may add undesired software complexity and potential
> performance impact if the PASID register locates alongside other
> fast-path resources in the same 4K page. If it works it essentially
> converts type-7 to type-8 from user p.o.v.
>
> The second option is using an enlightened approach so the guest
> directly use the host-allocated pPASIDs instead of creating its own vPASID
> space. In this case even the dedicated work queue path uses pPASID w/o
> the need of mediation. However this requires different uAPI semantics
> (from register-vPASID to return-pPASID) and exposes pPASID knowledge
> to userspace which also implies breaking VM live migration.
>
> The third option is making pPASID as an alias routing info to vPASID
> and having both linked to the same I/O page table in the IOMMU, so
> either way can hit the desired address space. This further requires sort
> of range split scheme to avoid conflict between vPASID and pPASID.
> However, we haven't found a clear way to fold this trick into this uAPI
> proposal yet. and this option may not work when PASID is also used to
> tag the IMS entry for verifying the interrupt source. In this case there
> is no room for aliasing.
>
> So, none of above can work cleanly based on current thoughts. We
> decide to not support type-7/8 mix in this proposal. User could detect
> this exception based on reported PASID flags, as outlined in next section.
>
> 1.4.4. User sequence
> ********************
>
> A new PASID capability info could be introduced to VFIO_DEVICE_GET_INFO.
> The presence indicates allowing the user to create multiple I/O address
> spaces with vPASID on the device. This capability further includes
> following flags to help describe the desired uAPI semantics:
>
> - PASID_DELEGATED; // PASID space delegated to the user?
> - PASID_CPU; // Allow vPASID used in the CPU?
> - PASID_CPU_VIRT; // Require vPASID translation in the CPU?
>
> The last two flags together help the user to detect the unsupported
> type 7/8 mix scenario.
>
> Take Qemu for example. It queries above flags for every vfio device at
> initialization time, after identifying the PASID capability:
>
> 1) If PASID_DELEGATED is set, the PASID space is fully managed by the
> user thus a single IOASID (linked to user-managed page table) is
> required as the placeholder for all non-default I/O address spaces
> on the device.
>
> If not set, an IOASID must be created for every non-default I/O address
> space on this device and vPASID must be registered to the kernel
> when attaching the device to this IOASID.
>
> User may want to sanity check on all devices with the same setting
> as this flag is a platform attribute though it's exported per device.
>
> If not set, continue to step 2.
>
> 2) If PASID_CPU is not set, done.
>
> Otherwise check whether the PASID_CPU_VIRT flag on this device is
> consistent with all other devices with PASID_CPU set.
>
> If inconsistency is found (indicating type 7/8 mix), only one type
> of devices (all set, or all clear) should have the vPASID capability
> exposed to the guest.
>
> 3) If PASID_CPU_VIRT is not set, done.
>
> If set and consistency check in 2) is passed, call KVM uAPI to
> enable CPU PASID translation if it is the first device with this flag
> set. Later when a new vPASID is identified through vIOMMU at run-time,
> call another KVM uAPI to update the corresponding PASID mapping.
>
> 1.5. No-snoop DMA
> ++++++++++++++++++++
>
> Snoop behavior of a DMA specifies whether the access is coherent (snoops
> the processor caches) or not. The snoop behavior is decided by both device
> and IOMMU. Device can set a no-snoop attribute in DMA request to force
> the non-coherent behavior, while IOMMU may support a configuration which
> enforces DMAs to be coherent (with the no-snoop attribute ignored).
>
> No-snoop DMA requires the driver to manually flush caches for
> observing the latest content. When such driver is running in the guest,
> it further requires KVM to intercept/emulate WBINVD plus favoring
> guest cache attributes in the EPT page table.
>
> Alex helped create a matrix as below:
> (https://lore.kernel.org/linux-iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.namprd12.prod.outlook.com/T/#mbfc96278b078d3ec07eabb9aa46abfe03a886dc6)
>
> \ Device supports
> IOMMU enforces\ no-snoop
> snoop \ yes | no |
> ----------------+-----+-----+
> yes | 1 | 2 |
> ----------------+-----+-----+
> no | 3 | 4 |
> ----------------+-----+-----+
>
> DMA is always coherent in boxes {1, 2, 4}. No-snoop DMA is allowed
> in {3} but whether it is actually used is a driver decision.
>
> VFIO currently adopts a simple policy - always turn on IOMMU enforce-
> snoop if available. It provides a contract via kvm-vfio fd for KVM to
> learn whether no-snoop DMA is used thus special tricks on WBINVD
> and EPT must be enabled. However, the criteria of no-snoop DMA is
> solely based on the fact of lacking IOMMU enforce-snoop for any vfio
> device, i.e. both 3) and 4) are considered capable of doing no-snoop
> DMA. This model has several limitations:
>
> - It's impossible to move a device from 1) to 3) when no-snoop DMA
> is a must to achieve the desired user experience;
>
> - Unnecessary overhead in KVM side in 4) or if the driver doesn't do
> no-snoop DMA in 3). Although the driver doesn't use WBINVD, the
> guest still uses WBINVD in other places e.g. when changing cache-
> related registers (e.g. MTRR/CR0);
>
> We want to adopt an user-driven model in /dev/iommu for more accurate
> control over the no-snoop usage. In this model the enforce-snoop format
> is specified when an IOASID is created, while the device no-snoop usage
> can be further clarified when it's attached to the IOASID.
>
> IOMMU fd is expected to provide uAPIs and helper functions for:
>
> - reporting IOMMU enforce-snoop capability to the user per device
> cookie (device no-snoop capability is reported via VFIO).
>
> - allowing user to specify whether an IOASID should be created in the
> IOMMU enforce-snoop format (enable/disable/auto):
>
> * This allows moving a device from 1) to 3) in case of performance
> requirement.
>
> * 'auto' falls back to the legacy VFIO policy, i.e. always enables
> enforce-snoop if available.
>
> * Any device can be attached to a non-enforce-snoop IOASID,
> because this format is supported by all IOMMUs. In this case the
> device belongs to {3, 4} and whether it is considered doing no-snoop
> DMA is decided by the next interface.
>
> * Attaching a device which cannot be forced to snoop by its IOMMU
> to an enforce-snoop IOASID gets a failure. Successful attaching
> implies the device always does snoop DMA, i.e. belonging to {1,2}.
>
> * Some platform supports page-granular enforce-snoop. One open
> is whether a page-granular interface is necessary here.
>
> - allowing user to further hint whether no-snoop DMA is actually used
> in {3, 4} on a specific IOASID, via the VFIO attaching call:
>
> * in case the user has such intrinsic knowledge on a specific device.
>
> * {3} can be filtered out with this hint.
>
> * {4} can be filtered out automatically by VFIO device driver,
> based on device no-snoop capability.
>
> * If no hint is provided, fall back to legacy VFIO policy, i.e.
> treating all devices in {3, 4} as capable of doing no-snoop.
>
> - a new contract for KVM to learn whether any IOASID is attached by
> devices which require no-snoop DMA:
>
> * Once we thought existing kvm-vfio fd can be leveraged as a short
> term approach (see above link). However kvm-vfio is centralized
> on vfio group concept, while this proposal is moving to device-
> centric model.
>
> * The new contract will allows KVM to query no-snoop requirement
> per IOMMU fd. This will apply to all passthrough frameworks.
>
> * A notification mechanism might be introduced to switch between
> WBINVD emulation and no-op intercept according to device
> attaching status change in registered IOMMU fd.
>
> * whether kvm-vfio will be completely deprecated is a TBD. It's
> still used for non-iommu related contract, e.g. notifying kvm
> pointer to mdev driver and pvIOMMU acceleration in PPC.
>
> - optional bulk cache invalidation:
>
> * Userspace driver can use clflush to invalidate cachelines for
> buffers used for no-snoop DMA. But this may be inefficient when
> a big buffer needs to be invalidated. In this case a bulk
> invalidation could be provided based on WBINVD.
>
> The implementation might be a staging approach. In the start IOMMU fd
> only support devices which can be forced to snoop via the IOMMU (i.e.
> {1, 2}), while leaving {3, 4} still handled via legacy VFIO. In
> this case no need to introduce new contract with KVM. An easy way is
> having VFIO not expose {3, 4} devices in /dev/vfio/devices. Then we have
> plenty of time to figure out the implementation detail of the new model
> at a later stage.
>
> 2. uAPI Proposal
> ----------------------
>
> /dev/iommu uAPI covers everything about managing I/O address spaces.
>
> /dev/vfio device uAPI builds connection between devices and I/O address
> spaces.
>
> /dev/kvm uAPI is optionally required as far as no-snoop DMA or ENQCMD
> is concerned.
>
> 2.1. /dev/iommu uAPI
> ++++++++++++++++++++
>
> /*
> * Check whether an uAPI extension is supported.
> *
> * It's unlikely that all planned capabilities in IOMMU fd will be ready in
> * one breath. User should check which uAPI extension is supported
> * according to its intended usage.
> *
> * A rough list of possible extensions may include:
> *
> * - EXT_MAP_TYPE1V2 for vfio type1v2 map semantics;
> * - EXT_MAP_NEWTYPE for an enhanced map semantics;
> * - EXT_IOASID_NESTING for what the name stands;
> * - EXT_USER_PAGE_TABLE for user managed page table;
> * - EXT_USER_PASID_TABLE for user managed PASID table;
> * - EXT_MULTIDEV_GROUP for 1:N iommu group;
> * - EXT_DMA_NO_SNOOP for no-snoop DMA support;
> * - EXT_DIRTY_TRACKING for tracking pages dirtied by DMA;
> * - ...
> *
> * Return: 0 if not supported, 1 if supported.
> */
> #define IOMMU_CHECK_EXTENSION _IO(IOMMU_TYPE, IOMMU_BASE + 0)
>
>
> /*
> * Check capabilities and format information on a bound device.
> *
> * It could be reported either via a capability chain as implemented in
> * VFIO or a per-capability query interface. The device is identified
> * by device cookie (registered when binding this device).
> *
> * Sample capability info:
> * - VFIO type1 map: supported page sizes, permitted IOVA ranges, etc.;
> * - IOASID nesting: hardware nesting vs. software nesting;
> * - User-managed page table: vendor specific formats;
> * - User-managed pasid table: vendor specific formats;
> * - coherency: whether IOMMU can enforce snoop for this device;
> * - ...
> *
> */
> #define IOMMU_DEVICE_GET_INFO _IO(IOMMU_TYPE, IOMMU_BASE + 1)
>
>
> /*
> * Allocate an IOASID.
> *
> * IOASID is the FD-local software handle representing an I/O address
> * space. Each IOASID is associated with a single I/O page table. User
> * must call this ioctl to get an IOASID for every I/O address space that is
> * intended to be tracked by the kernel.
> *
> * User needs to specify the attributes of the IOASID and associated
> * I/O page table format information according to one or multiple devices
> * which will be attached to this IOASID right after. The I/O page table
> * is activated in the IOMMU when it's attached by a device. Incompatible

.. if not SW mediated
> * format between device and IOASID will lead to attaching failure.
> *
> * The root IOASID should always have a kernel-managed I/O page
> * table for safety. Locked page accounting is also conducted on the root.
The definition of root IOASID is not easily found in this spec. Maybe
this would deserve some clarification.
> * Multiple roots are possible, e.g. when multiple I/O address spaces
> * are created but IOASID nesting is disabled. However, one page might
> * be accounted multiple times in this case. The user is recommended to
> * instead create a 'dummy' root with identity mapping (HVA->HVA) for
> * centralized accounting, nested by all other IOASIDs which represent
> * 'real' I/O address spaces.
> *
> * Sample attributes:
> * - Ownership: kernel-managed or user-managed I/O page table;
> * - IOASID nesting: the parent IOASID info if enabled;
> * - User-managed page table: addr and vendor specific formats;
> * - User-managed pasid table: addr and vendor specific formats;
> * - coherency: enforce-snoop;
> * - ...
> *
> * Return: allocated ioasid on success, -errno on failure.
> */
> #define IOMMU_IOASID_ALLOC _IO(IOMMU_TYPE, IOMMU_BASE + 2)
> #define IOMMU_IOASID_FREE _IO(IOMMU_TYPE, IOMMU_BASE + 3)
>
>
> /*
> * Map/unmap process virtual addresses to I/O virtual addresses.
> *
> * Provide VFIO type1 equivalent semantics. Start with the same
> * restriction e.g. the unmap size should match those used in the
> * original mapping call.
> *
> * If the specified IOASID is the root, the mapped pages are automatically
> * pinned and accounted as locked memory. Pinning might be postponed
> * until the IOASID is attached by a device. Software mdev driver may
> * further provide a hint to skip auto-pinning at attaching time, since
> * it does selective pinning at run-time. auto-pinning can be also
> * skipped when I/O page fault is enabled on the root.
> *
> * When software nesting is enabled, this implies that the merged
> * shadow mapping will also be updated accordingly. However if the
> * change happens on the parent, it requires reverse lookup to update
> * all relevant child mappings which is time consuming. So the user
> * is not suggested to change the parent mapping after the software
> * nesting is established (maybe disallow?). There is no such restriction
> * with hardware nesting, as the IOMMU will catch up the change
> * when actually walking the page table.
> *
> * Input parameters:
> * - u32 ioasid;
> * - refer to vfio_iommu_type1_dma_{un}map
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOMMU_MAP_DMA _IO(IOMMU_TYPE, IOMMU_BASE + 4)
> #define IOMMU_UNMAP_DMA _IO(IOMMU_TYPE, IOMMU_BASE + 5)
>
>
> /*
> * Invalidate IOTLB for an user-managed I/O page table
> *
> * check include/uapi/linux/iommu.h for supported cache types and
> * granularities. Device cookie and vPASID may be specified to help
> * decide the scope of this operation.
> *
> * Input parameters:
> * - child_ioasid;
> * - granularity (per-device, per-pasid, range-based);
> * - cache type (iotlb, devtlb, pasid cache);
> *
> * Return: 0 on success, -errno on failure
> */
> #define IOMMU_INVALIDATE_CACHE _IO(IOMMU_TYPE, IOMMU_BASE + 6)
>
>
> /*
> * Page fault report and response
> *
> * This is TBD. Can be added after other parts are cleared up. It may
> * include a fault region to report fault data via read()), an
> * eventfd to notify the user and an ioctl to complete the fault.
> *
> * The fault data includes {IOASID, device_cookie, faulting addr, perm}
> * as common info. vendor specific fault info can be also included if
> * necessary.
> *
> * If the IOASID represents an user-managed PASID table, the vendor
> * fault info includes vPASID information for the user to figure out
> * which I/O page table triggers the fault.
> *
> * If the IOASID represents an user-managed I/O page table, the user
> * is expected to find out vPASID itself according to {IOASID, device_
> * cookie}.
> */
>
>
> /*
> * Dirty page tracking
> *
> * Track and report memory pages dirtied in I/O address spaces. There
> * is an ongoing work by Kunkun Jiang by extending existing VFIO type1.
> * It needs be adapted to /dev/iommu later.
> */
>
>
> 2.2. /dev/vfio device uAPI
> ++++++++++++++++++++++++++
>
> /*
> * Bind a vfio_device to the specified IOMMU fd
> *
> * The user should provide a device cookie when calling this ioctl. The
> * cookie is later used in IOMMU fd for capability query, iotlb invalidation
> * and I/O fault handling.
> *
> * User is not allowed to access the device before the binding operation
> * is completed.
> *
> * Unbind is automatically conducted when device fd is closed.
> *
> * Input parameters:
> * - iommu_fd;
> * - cookie;
> *
> * Return: 0 on success, -errno on failure.
> */
> #define VFIO_BIND_IOMMU_FD _IO(VFIO_TYPE, VFIO_BASE + 22)
>
>
> /*
> * Report vPASID info to userspace via VFIO_DEVICE_GET_INFO
> *
> * Add a new device capability. The presence indicates that the user
> * is allowed to create multiple I/O address spaces on this device. The
> * capability further includes following flags:
> *
> * - PASID_DELEGATED, if clear every vPASID must be registered to
> * the kernel;
> * - PASID_CPU, if set vPASID is allowed to be carried in the CPU
> * instructions (e.g. ENQCMD);
> * - PASID_CPU_VIRT, if set require vPASID translation in the CPU;
> *
> * The user must check that all devices with PASID_CPU set have the
> * same setting on PASID_CPU_VIRT. If mismatching, it should enable
> * vPASID only in one category (all set, or all clear).
> *
> * When the user enables vPASID on the device with PASID_CPU_VIRT
> * set, it must enable vPASID CPU translation via kvm fd before attempting
> * to use ENQCMD to submit work items. The command portal is blocked
> * by the kernel until the CPU translation is enabled.
> */
> #define VFIO_DEVICE_INFO_CAP_PASID 5
>
>
> /*
> * Attach a vfio device to the specified IOASID
> *
> * Multiple vfio devices can be attached to the same IOASID, and vice
> * versa.
> *
> * User may optionally provide a "virtual PASID" to mark an I/O page
> * table on this vfio device, if PASID_DELEGATED is not set in device info.
> * Whether the virtual PASID is physically used or converted to another
> * kernel-allocated PASID is a policy in the kernel.
> *
> * Because one device is allowed to bind to multiple IOMMU fd's, the
> * user should provide both iommu_fd and ioasid for this attach operation.
> *
> * Input parameter:
> * - iommu_fd;
> * - ioasid;
> * - flag;
> * - vpasid (if specified);
> *
> * Return: 0 on success, -errno on failure.
> */
> #define VFIO_ATTACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 23)
> #define VFIO_DETACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 24)
>
>
> 2.3. KVM uAPI
> +++++++++++++
>
> /*
> * Check/enable CPU PASID translation via KVM CAP interface
> *
> * This is necessary when ENQCMD will be used in the guest while the
> * targeted device doesn't accept the vPASID saved in the CPU MSR.
> */
> #define KVM_CAP_PASID_TRANSLATION 206
>
>
> /*
> * Update CPU PASID mapping
> *
> * This command allows user to set/clear the vPASID->pPASID mapping
> * in the CPU, by providing the IOASID (and FD) information representing
> * the I/O address space marked by this vPASID. KVM calls iommu helper
> * function to retrieve pPASID according to the input parameters. So the
> * pPASID value is completely hidden from the user.
> *
> * Input parameters:
> * - user_pasid;
> * - iommu_fd;
> * - ioasid;
> */
> #define KVM_MAP_PASID _IO(KVMIO, 0xf0)
> #define KVM_UNMAP_PASID _IO(KVMIO, 0xf1)
>
>
> /*
> * and a new contract to exchange no-snoop dma status with IOMMU fd.
> * this will be a device-centric interface, thus existing vfio-kvm contract
> * is not suitable as it's group-centric.
> *
> * actual definition TBD.
> */
>
>
> 3. Sample structures and helper functions
> --------------------------------------------------------
>
> Three helper functions are provided to support VFIO_BIND_IOMMU_FD:
>
> struct iommu_ctx *iommu_ctx_fdget(int fd);
> struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
> struct device *device, u64 cookie);
> int iommu_unregister_device(struct iommu_dev *dev);
>
> An iommu_ctx is created for each fd:
>
> struct iommu_ctx {
> // a list of allocated IOASID data's
> struct xarray ioasid_xa;
>
> // a list of registered devices
> struct xarray dev_xa;
> };
>
> Later some group-tracking fields will be also introduced to support
> multi-devices group.
>
> Each registered device is represented by iommu_dev:
>
> struct iommu_dev {
> struct iommu_ctx *ctx;
> // always be the physical device
> struct device *device;
> u64 cookie;
> struct kref kref;
> };
>
> A successful binding establishes a security context for the bound
> device and returns struct iommu_dev pointer to the caller. After this
> point, the user is allowed to query device capabilities via IOMMU_
> DEVICE_GET_INFO.
>
> For mdev the struct device should be the pointer to the parent device.
>
> An ioasid_data is created when IOMMU_IOASID_ALLOC, as the main
> object describing characteristics about an I/O page table:
>
> struct ioasid_data {
> struct iommu_ctx *ctx;
>
> // the IOASID number
> u32 ioasid;
>
> // the handle for kernel-managed I/O page table
> struct iommu_domain *domain;
>
> // map metadata (vfio type1 semantics)
> struct rb_node dma_list;
>
> // pointer to user-managed pgtable
> u64 user_pgd;
>
> // link to the parent ioasid (for nesting)
> struct ioasid_data *parent;
>
> // IOMMU enforce-snoop
> bool enforce_snoop;
>
> // various format information
> ...
>
> // a list of device attach data (routing information)
> struct list_head attach_data;
>
> // a list of fault_data reported from the iommu layer
> struct list_head fault_data;
>
> ...
> }
>
> iommu_domain is the object for operating the kernel-managed I/O
> page tables in the IOMMU layer. ioasid_data is associated to an
> iommu_domain explicitly or implicitly:
>
> - root IOASID (except the 'dummy' one for locked accounting)
> must use kernel-manage I/O page table thus always linked to an
> iommu_domain;
>
> - child IOASID (via software nesting) is explicitly linked to an iommu
> domain as the shadow I/O page table is managed by the kernel;
>
> - child IOASID (via hardware nesting) is linked to another simpler iommu
> layer object (TBD) for tracking user-managed page table. Due to
> nesting it is also implicitly linked to the iommu_domain of the
> parent;
>
> Following link has an initial discussion on this part:
>
> https://lore.kernel.org/linux-iommu/BN9PR11MB54331FC6BB31E8CBF11914A48C019@BN9PR11MB5433.namprd11.prod.outlook.com/T/#m2c19d3825cc096daf2026ea94e00cc5858cda321
>
> As Jason recommends in v1, bus-specific wrapper functions are provided
> explicitly to support VFIO_ATTACH_IOASID, e.g.
>
> struct iommu_attach_data * iommu_pci_device_attach(
> struct iommu_dev *dev, struct pci_device *pdev,
> u32 ioasid);
> struct iommu_attach_data * iommu_pci_device_attach_pasid(
> struct iommu_dev *dev, struct pci_device *pdev,
> u32 ioasid, u32 pasid);
>
> and variants for non-PCI devices.
>
> A helper function is provided for above wrappers:
>
> // flags specifies whether pasid is valid
> struct iommu_attach_data *__iommu_device_attach(
> struct ioasid_dev *dev, u32 ioasid, u32 pasid, int flags);
>
> A new object is introduced and linked to ioasid_data->attach_data for
> each successful attach operation:
>
> struct iommu_attach_data {
> struct list_head next;
> struct iommu_dev *dev;
> u32 pasid;
> }
>
> The helper function for VFIO_DETACH_IOASID is generic:
>
> int iommu_device_detach(struct iommu_attach_data *data);
>
> 4. Use Cases and Flows
> -------------------------------
>
> Here assume VFIO will support a new model where /dev/iommu capable
> devices are explicitly listed under /dev/vfio/devices thus a device fd can
> be acquired w/o going through legacy container/group interface. They
> maybe further categorized into sub-directories based on device types
> (e.g. pdev, mdev, etc.). For illustration purpose those devices are putting
s/putting/put
> together and just called dev[1...N]:
>
> device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
>
> VFIO continues to support container/group model for legacy applications
> and also for devices which are not moved to /dev/iommu in one breath
> (e.g. in a group with multiple devices, or support no-snoop DMA). In concept
> there is no problem for VFIO to support two models simultaneously, but
> we'll wait to see any issue when reaching implementation.
>
> As explained earlier, one IOMMU fd is sufficient for all intended use cases:
>
> iommu_fd = open("/dev/iommu", mode);
>
> For simplicity below examples are all made for the virtualization story.
> They are representative and could be easily adapted to a non-virtualization
> scenario.
>
> Three types of IOASIDs are considered:
>
> gpa_ioasid[1...N]: GPA as the default address space
> giova_ioasid[1...N]: GIOVA as the default address space (nesting)
> gva_ioasid[1...N]: CPU VA as non-default address space (nesting)
>
> At least one gpa_ioasid must always be created per guest, while the other
> two are relevant as far as vIOMMU is concerned.
>
> Examples here apply to both pdev and mdev. VFIO device driver in the
> kernel will figure out the associated routing information in the attaching
> operation.
>
> For illustration simplicity, IOMMU_CHECK_EXTENSION and IOMMU_DEVICE_
> GET_INFO are skipped in these examples. No-snoop DMA is also not covered here.
>
> Below examples may not apply to all platforms. For example, the PAPR IOMMU
> in PPC platform always requires a vIOMMU and blocks DMAs until the device is
> explicitly attached to an GIOVA address space. there are even fixed
> associations between available GIOVA spaces and devices. Those platform
> specific variances are not covered here and will be figured out in the
> implementation phase.
>
> 4.1. A simple example
> +++++++++++++++++++++
>
> Dev1 is assigned to the guest. A cookie has been allocated by the user
> to represent this device in the iommu_fd.
>
> One gpa_ioasid is created. The GPA address space is managed through
> DMA mapping protocol by specifying that the I/O page table is managed
> by the kernel:
>
> /* Bind device to IOMMU fd */
> device_fd = open("/dev/vfio/devices/dev1", mode);
> iommu_fd = open("/dev/iommu", mode);
> bind_data = {.fd = iommu_fd; .cookie = cookie};
> ioctl(device_fd, VFIO_BIND_IOASID_FD, &bind_data);
>
> /* Allocate IOASID */
> alloc_data = {.user_pgtable = false};
> gpa_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
>
> /* Attach device to IOASID */
> at_data = { .fd = iommu_fd; .ioasid = gpa_ioasid};
> ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup GPA mapping [0 - 1GB] */
> dma_map = {
> .ioasid = gpa_ioasid;
> .iova = 0; // GPA
> .vaddr = 0x40000000; // HVA
> .size = 1GB;
> };
> ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);
>
> If the guest is assigned with more than dev1, the user follows above
> sequence to attach other devices to the same gpa_ioasid i.e. sharing
> the GPA address space cross all assigned devices, e.g. for dev2:
>
> bind_data = {.fd = iommu_fd; .cookie = cookie2};
> ioctl(device_fd2, VFIO_BIND_IOASID_FD, &bind_data);
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> 4.2. Multiple IOASIDs (no nesting)
> ++++++++++++++++++++++++++++++++++
>
> Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> both devices are attached to gpa_ioasid. After boot the guest creates
> a GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> through mode (gpa_ioasid).
>
> Suppose IOASID nesting is not supported in this case. Qemu needs to
> generate shadow mappings in userspace for giova_ioasid (like how
> VFIO works today). The side-effect is that duplicated locked page
> accounting might be incurred in this example as there are two root
> IOASIDs now. It will be fixed once IOASID nesting is supported:
>
> device_fd1 = open("/dev/vfio/devices/dev1", mode);
> device_fd2 = open("/dev/vfio/devices/dev2", mode);
> iommu_fd = open("/dev/iommu", mode);
>
> /* Bind device to IOMMU fd */
> bind_data = {.fd = iommu_fd; .device_cookie = cookie1};
> ioctl(device_fd1, VFIO_BIND_IOASID_FD, &bind_data);
> bind_data = {.fd = iommu_fd; .device_cookie = cookie2};
> ioctl(device_fd2, VFIO_BIND_IOASID_FD, &bind_data);
>
> /* Allocate IOASID */
> alloc_data = {.user_pgtable = false};
> gpa_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
>
> /* Attach dev1 and dev2 to gpa_ioasid */
> at_data = { .fd = iommu_fd; .ioasid = gpa_ioasid};
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup GPA mapping [0 - 1GB] */
> dma_map = {
> .ioasid = gpa_ioasid;
> .iova = 0; // GPA
> .vaddr = 0x40000000; // HVA
> .size = 1GB;
> };
> ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);
>
> /* After boot, guest enables a GIOVA space for dev2 via vIOMMU */
> alloc_data = {.user_pgtable = false};
> giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
>
> /* First detach dev2 from previous address space */
> at_data = { .fd = iommu_fd; .ioasid = gpa_ioasid};
> ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
>
> /* Then attach dev2 to the new address space */
> at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup a shadow DMA mapping according to vIOMMU.
> *
> * e.g. the vIOMMU page table adds a new 4KB mapping:
> * GIOVA [0x2000] -> GPA [0x1000]
> *
> * and GPA [0x1000] is mapped to HVA [0x40001000] in gpa_ioasid.
> *
> * In this case the shadow mapping should be:
> * GIOVA [0x2000] -> HVA [0x40001000]
> */
> dma_map = {
> .ioasid = giova_ioasid;
> .iova = 0x2000; // GIOVA
> .vaddr = 0x40001000; // HVA
> .size = 4KB;
> };
> ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);
>
> 4.3. IOASID nesting (software)
> ++++++++++++++++++++++++++++++
>
> Same usage scenario as 4.2, with software-based IOASID nesting
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.
>
> The flow before guest boots is same as 4.2, except one point. Because
> giova_ioasid is nested on gpa_ioasid, locked accounting is only
> conducted for gpa_ioasid which becomes the only root.
>
> There could be a case where different gpa_ioasids are created due
> to incompatible format between dev1/dev2 (e.g. about IOMMU
> enforce-snoop). In such case the user could further created a dummy
> IOASID (HVA->HVA) as the root parent for two gpa_ioasids to avoid
> duplicated accounting. But this scenario is not covered in following
> flows.
>
> To save space we only list the steps after boots (i.e. both dev1/dev2
s/after boots/after boot
here and below
> have been attached to gpa_ioasid before guest boots):
>
> /* After boots */
> /* Create GIOVA space nested on GPA space
> * Both page tables are managed by the kernel
> */
> alloc_data = {.user_pgtable = false; .parent = gpa_ioasid};
> giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
>
> /* Attach dev2 to the new address space (child)
> * Note dev2 is still attached to gpa_ioasid (parent)
> */
> at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup a GIOVA [0x2000] ->GPA [0x1000] mapping for giova_ioasid,
> * based on the vIOMMU page table. The kernel is responsible for
> * creating the shadow mapping GIOVA [0x2000] -> HVA [0x40001000]
> * by walking the parent's I/O page table to find out GPA [0x1000] ->
> * HVA [0x40001000].
> */
> dma_map = {
> .ioasid = giova_ioasid;
> .iova = 0x2000; // GIOVA
> .vaddr = 0x1000; // GPA
> .size = 4KB;
> };
> ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);
>
> 4.4. IOASID nesting (hardware)
> ++++++++++++++++++++++++++++++
>
> Same usage scenario as 4.2, with hardware-based IOASID nesting
> available. In this mode the I/O page table is managed by userspace
> thus an invalidation interface is used for the user to request iotlb
> invalidation.
>
> /* After boots */
> /* Create GIOVA space nested on GPA space.
> * Claim it's an user-managed I/O page table.
> */
> alloc_data = {
> .user_pgtable = true;
> .parent = gpa_ioasid;
> .addr = giova_pgtable;
> // and format information;
> };
> giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
>
> /* Attach dev2 to the new address space (child)
> * Note dev2 is still attached to gpa_ioasid (parent)
> */
> at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Invalidate IOTLB when required */
> inv_data = {
> .ioasid = giova_ioasid;
> // granular/cache type information
> };
> ioctl(iommu_fd, IOMMU_INVALIDATE_CACHE, &inv_data);
>
> /* See 4.6 for I/O page fault handling */
>
> 4.5. Guest SVA (vSVA)
> +++++++++++++++++++++
>
> After boots the guest further creates a GVA address spaces (vpasid1) on
> dev1. Dev2 is not affected (still attached to giova_ioasid).

> As explained in section 1.4, the user should check the PASID capability
> exposed via VFIO_DEVICE_GET_INFO and follow the required uAPI
> semantics when doing the attaching call:
>
> /****** If dev1 reports PASID_DELEGATED=false **********/
> /* After boots */
> /* Create GVA space nested on GPA space.
> * Claim it's an user-managed I/O page table.
> */
> alloc_data = {
> .user_pgtable = true;
> .parent = gpa_ioasid;
> .addr = gva_pgtable;
> // and format information;
> };
> gva_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
>
> /* Attach dev1 to the new address space (child) and specify
> * vPASID. Note dev1 is still attached to gpa_ioasid (parent)
> */
> at_data = {
> .fd = iommu_fd;
> .ioasid = gva_ioasid;
> .flag = IOASID_ATTACH_VPASID;
> .vpasid = vpasid1;
> };
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> /* Enable CPU PASID translation if required */
> if (PASID_CPU and PASID_CPU_VIRT are both true for dev1) {
> pa_data = {
> .iommu_fd = iommu_fd;
> .ioasid = gva_ioasid;
> .vpasid = vpasid1;
> };
> ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
> };
>
> /* Invalidate IOTLB when required */
> ...
>
> /****** If dev1 reports PASID_DELEGATED=true **********/
> /* Create user-managed vPASID space when it's enabled via vIOMMU */
> alloc_data = {
> .user_pasid_table = true;
> .parent = gpa_ioasid;
> .addr = gpasid_tbl;
> // and format information;
> };
> pasidtbl_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
>
> /* Attach dev1 to the vPASID space */
> at_data = {.fd = iommu_fd; .ioasid = pasidtbl_ioasid};
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> /* from now on all GVA address spaces on dev1 are represented by
> * a single pasidtlb_ioasid as the placeholder in the kernel.
> *
> * But iotlb invalidation and fault handling are still per GVA
> * address space. They are still going through IOMMU fd in the
> * same way as PASID_DELEGATED=false scenario
> */
> ...
>
> 4.6. I/O page fault
> +++++++++++++++++++
>
> uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards. This flow assumes that I/O page faults
> are reported via IOMMU interrupts. Some devices report faults via device
> specific way instead of going through the IOMMU. That usage is not covered
> here:
>
> - Host IOMMU driver receives a I/O page fault with raw fault_data {rid,
> pasid, addr};
>
> - Host IOMMU driver identifies the faulting I/O page table according to
> {rid, pasid} and calls the corresponding fault handler with an opaque
> object (registered by the handler) and raw fault_data (rid, pasid, addr);
>
> - IOASID fault handler identifies the corresponding ioasid and device
> cookie according to the opaque object, generates an user fault_data
> (ioasid, cookie, addr) in the fault region, and triggers eventfd to
> userspace;
>
> * In case ioasid represents a pasid table, pasid is also included as
> additional fault_data;
>
> * the raw fault_data is also cached in ioasid_data->fault_data and
> used when generating response;
>
> - Upon received event, Qemu needs to find the virtual routing information
> (v_rid + v_pasid) of the device attached to the faulting ioasid;
>
> * v_rid is identified according to device_cookie;
>
> * v_pasid is either identified according to ioasid, or already carried
> in the fault data;
>
> - Qemu generates a virtual I/O page fault through vIOMMU into guest,
> carrying the virtual fault data (v_rid, v_pasid, addr);
>
> - Guest IOMMU driver fixes up the fault, updates the guest I/O page table
> (GIOVA or GVA), and then sends a page response with virtual completion
> data (v_rid, v_pasid, response_code) to vIOMMU;
>
> - Qemu finds the pending fault event, converts virtual completion data
> into (ioasid, cookie, response_code), and then calls a /dev/iommu ioctl to
> complete the pending fault;
>
> - /dev/iommu finds out the pending fault data {rid, pasid, addr} saved in
> ioasid_data->fault_data, and then calls iommu api to complete it with
> {rid, pasid, response_code};
>
Thanks

Eric

2021-08-05 03:48:31

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, August 4, 2021 10:05 PM
>
> On Mon, Aug 02, 2021 at 02:49:44AM +0000, Tian, Kevin wrote:
>
> > Can you elaborate? IMO the user only cares about the label (device cookie
> > plus optional vPASID) which is generated by itself when doing the attaching
> > call, and expects this virtual label being used in various spots (invalidation,
> > page fault, etc.). How the system labels the traffic (the physical RID or RID+
> > PASID) should be completely invisible to userspace.
>
> I don't think that is true if the vIOMMU driver is also emulating
> PASID. Presumably the same is true for other PASID-like schemes.
>

I'm getting even more confused with this comment. Isn't it the
consensus from day one that physical PASID should not be exposed
to userspace as doing so breaks live migration? with PASID emulation
vIOMMU only cares about vPASID instead of pPASID, and the uAPI
only requires user to register vPASID instead of reporting pPASID
back to userspace...

Thanks
Kevin

2021-08-05 04:15:17

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Eric Auger <[email protected]>
> Sent: Wednesday, August 4, 2021 11:59 PM
>
[...]
> > 1.2. Attach Device to I/O address space
> > +++++++++++++++++++++++++++++++++++++++
> >
> > Device attach/bind is initiated through passthrough framework uAPI.
> >
> > Device attaching is allowed only after a device is successfully bound to
> > the IOMMU fd. User should provide a device cookie when binding the
> > device through VFIO uAPI. This cookie is used when the user queries
> > device capability/format, issues per-device iotlb invalidation and
> > receives per-device I/O page fault data via IOMMU fd.
> >
> > Successful binding puts the device into a security context which isolates
> > its DMA from the rest system. VFIO should not allow user to access the
> s/from the rest system/from the rest of the system
> > device before binding is completed. Similarly, VFIO should prevent the
> > user from unbinding the device before user access is withdrawn.
> With Intel scalable IOV, I understand you could assign an RID/PASID to
> one VM and another one to another VM (which is not the case for ARM). Is
> it a targetted use case?How would it be handled? Is it related to the
> sub-groups evoked hereafter?

Not related to sub-group. Each mdev is bound to the IOMMU fd respectively
with the defPASID which represents the mdev.

>
> Actually all devices bound to an IOMMU fd should have the same parent
> I/O address space or root address space, am I correct? If so, maybe add
> this comment explicitly?

in most cases yes but it's not mandatory. multiple roots are allowed
(e.g. with vIOMMU but no nesting).

[...]
> > The device in the /dev/iommu context always refers to a physical one
> > (pdev) which is identifiable via RID. Physically each pdev can support
> > one default I/O address space (routed via RID) and optionally multiple
> > non-default I/O address spaces (via RID+PASID).
> >
> > The device in VFIO context is a logic concept, being either a physical
> > device (pdev) or mediated device (mdev or subdev). Each vfio device
> > is represented by RID+cookie in IOMMU fd. User is allowed to create
> > one default I/O address space (routed by vRID from user p.o.v) per
> > each vfio_device.
> The concept of default address space is not fully clear for me. I
> currently understand this is a
> root address space (not nesting). Is that coorect.This may need
> clarification.

w/o PASID there is only one address space (either GPA or GIOVA)
per device. This one is called default. whether it's root is orthogonal
(e.g. GIOVA could be also nested) to the device view of this space.

w/ PASID additional address spaces can be targeted by the device.
those are called non-default.

I could also rename default to RID address space and non-default to
RID+PASID address space if doing so makes it clearer.

> > VFIO decides the routing information for this default
> > space based on device type:
> >
> > 1) pdev, routed via RID;
> >
> > 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via
> > the parent's RID plus the PASID marking this mdev;
> >
> > 3) a purely sw-mediated device (sw mdev), no routing required i.e. no
> > need to install the I/O page table in the IOMMU. sw mdev just uses
> > the metadata to assist its internal DMA isolation logic on top of
> > the parent's IOMMU page table;
> Maybe you should introduce this concept of SW mediated device earlier
> because it seems to special case the way the attach behaves. I am
> especially refering to
>
> "Successful attaching activates an I/O address space in the IOMMU, if the
> device is not purely software mediated"

makes sense.

>
> >
> > In addition, VFIO may allow user to create additional I/O address spaces
> > on a vfio_device based on the hardware capability. In such case the user
> > has its own view of the virtual routing information (vPASID) when marking
> > these non-default address spaces.
> I do not catch what does mean "marking these non default address space".

as explained above, those non-default address spaces are identified/routed
via PASID.

> >
> > 1.3. Group isolation
> > ++++++++++++++++++++
[...]
> >
> > 1) A successful binding call for the first device in the group creates
> > the security context for the entire group, by:
> >
> > * Verifying group viability in a similar way as VFIO does;
> >
> > * Calling IOMMU-API to move the group into a block-dma state,
> > which makes all devices in the group attached to an block-dma
> > domain with an empty I/O page table;
> this block-dma state/domain would deserve to be better defined (I know
> you already evoked it in 1.1 with the dma mapping protocol though)
> activates an empty I/O page table in the IOMMU (if the device is not
> purely SW mediated)?

sure. some explanations are scattered in following paragraph, but I
can consider to further clarify it.

> How does that relate to the default address space? Is it the same?

different. this block-dma domain doesn't hold any valid mapping. The
default address space is represented by a normal unmanaged domain.
the ioasid attaching operation will detach the device from the block-dma
domain and then attach it to the target ioasid.

> >
> > 2. uAPI Proposal
> > ----------------------
[...]
> > /*
> > * Allocate an IOASID.
> > *
> > * IOASID is the FD-local software handle representing an I/O address
> > * space. Each IOASID is associated with a single I/O page table. User
> > * must call this ioctl to get an IOASID for every I/O address space that is
> > * intended to be tracked by the kernel.
> > *
> > * User needs to specify the attributes of the IOASID and associated
> > * I/O page table format information according to one or multiple devices
> > * which will be attached to this IOASID right after. The I/O page table
> > * is activated in the IOMMU when it's attached by a device. Incompatible
>
> .. if not SW mediated
> > * format between device and IOASID will lead to attaching failure.
> > *
> > * The root IOASID should always have a kernel-managed I/O page
> > * table for safety. Locked page accounting is also conducted on the root.
> The definition of root IOASID is not easily found in this spec. Maybe
> this would deserve some clarification.

make sense.

and thanks for other typo-related comments.

Thanks
Kevin

2021-08-05 11:29:31

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Wed, Aug 04, 2021 at 10:59:21PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, August 4, 2021 10:05 PM
> >
> > On Mon, Aug 02, 2021 at 02:49:44AM +0000, Tian, Kevin wrote:
> >
> > > Can you elaborate? IMO the user only cares about the label (device cookie
> > > plus optional vPASID) which is generated by itself when doing the attaching
> > > call, and expects this virtual label being used in various spots (invalidation,
> > > page fault, etc.). How the system labels the traffic (the physical RID or RID+
> > > PASID) should be completely invisible to userspace.
> >
> > I don't think that is true if the vIOMMU driver is also emulating
> > PASID. Presumably the same is true for other PASID-like schemes.
> >
>
> I'm getting even more confused with this comment. Isn't it the
> consensus from day one that physical PASID should not be exposed
> to userspace as doing so breaks live migration?

Uh, no?

> with PASID emulation vIOMMU only cares about vPASID instead of
> pPASID, and the uAPI only requires user to register vPASID instead
> of reporting pPASID back to userspace...

vPASID is only a feature of one device in existance, so we can't make
vPASID mandatory.

Jason

2021-08-06 05:51:59

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, August 5, 2021 7:27 PM
>
> On Wed, Aug 04, 2021 at 10:59:21PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, August 4, 2021 10:05 PM
> > >
> > > On Mon, Aug 02, 2021 at 02:49:44AM +0000, Tian, Kevin wrote:
> > >
> > > > Can you elaborate? IMO the user only cares about the label (device
> cookie
> > > > plus optional vPASID) which is generated by itself when doing the
> attaching
> > > > call, and expects this virtual label being used in various spots
> (invalidation,
> > > > page fault, etc.). How the system labels the traffic (the physical RID or
> RID+
> > > > PASID) should be completely invisible to userspace.
> > >
> > > I don't think that is true if the vIOMMU driver is also emulating
> > > PASID. Presumably the same is true for other PASID-like schemes.
> > >
> >
> > I'm getting even more confused with this comment. Isn't it the
> > consensus from day one that physical PASID should not be exposed
> > to userspace as doing so breaks live migration?
>
> Uh, no?
>
> > with PASID emulation vIOMMU only cares about vPASID instead of
> > pPASID, and the uAPI only requires user to register vPASID instead
> > of reporting pPASID back to userspace...
>
> vPASID is only a feature of one device in existance, so we can't make
> vPASID mandatory.
>

sure. my point is just that if vPASID is being emulated there is no need
of exposing pPASID to user space. Can you give a concrete example
where pPASID must be exposed and how the user wants to use this
information?

Thanks
Kevin

2021-08-06 07:51:46

by David Gibson

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Wed, Aug 04, 2021 at 11:04:47AM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 02, 2021 at 02:49:44AM +0000, Tian, Kevin wrote:
>
> > Can you elaborate? IMO the user only cares about the label (device cookie
> > plus optional vPASID) which is generated by itself when doing the attaching
> > call, and expects this virtual label being used in various spots (invalidation,
> > page fault, etc.). How the system labels the traffic (the physical RID or RID+
> > PASID) should be completely invisible to userspace.
>
> I don't think that is true if the vIOMMU driver is also emulating
> PASID. Presumably the same is true for other PASID-like schemes.

Right. The idea for an SVA capable vIOMMU in my scheme is that the
hypervisor would set up an IOAS of address type "PASID+address" with
the mappings made by the guest according to its vIOMMU semantics.
Then SVA capable devices would be plugged into that IOAS by using
"PASID+address" type endpoints from those devices.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.16 kB)
signature.asc (849.00 B)
Download all attachments

2021-08-06 11:39:27

by David Gibson

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Wed, Aug 04, 2021 at 11:07:42AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 03, 2021 at 11:58:54AM +1000, David Gibson wrote:
> > > I'd rather deduce the endpoint from a collection of devices than the
> > > other way around...
> >
> > Which I think is confusing, and in any case doesn't cover the case of
> > one "device" with multiple endpoints.
>
> Well they are both confusing, and I'd prefer to focus on the common
> case without extra mandatory steps. Exposing optional endpoint sharing
> information seems more in line with where everything is going than
> making endpoint sharing a first class object.
>
> AFAIK a device with multiple endpoints where those endpoints are
> shared with other devices doesn't really exist/or is useful? Eg PASID
> has multiple RIDs by they are not shared.

No, I can't think of a (non-contrived) example where a device would
have *both* multiple endpoints and those endpoints are shared amongst
multiple devices. I can easily think of examples where a device has
multiple (non shared) endpoints and where multiple devices share a
single endpoint.

The point is that making endpoints explicit separates the various
options here from the logic of the IOMMU layer itself. New device
types with new possibilities here means new interfaces *on those
devices*, but not new interfaces on /dev/iommu.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.53 kB)
signature.asc (849.00 B)
Download all attachments

2021-08-06 11:39:26

by David Gibson

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Tue, Aug 03, 2021 at 03:19:26AM +0000, Tian, Kevin wrote:
> > From: David Gibson <[email protected]>
> > Sent: Tuesday, August 3, 2021 9:51 AM
> >
> > On Wed, Jul 28, 2021 at 04:04:24AM +0000, Tian, Kevin wrote:
> > > Hi, David,
> > >
> > > > From: David Gibson <[email protected]>
> > > > Sent: Monday, July 26, 2021 12:51 PM
> > > >
> > > > On Fri, Jul 09, 2021 at 07:48:44AM +0000, Tian, Kevin wrote:
> > > > > /dev/iommu provides an unified interface for managing I/O page tables
> > for
> > > > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > > > vDPA,
> > > > > etc.) are expected to use this interface instead of creating their own
> > logic to
> > > > > isolate untrusted device DMAs initiated by userspace.
> > > > >
> > > > > This proposal describes the uAPI of /dev/iommu and also sample
> > > > sequences
> > > > > with VFIO as example in typical usages. The driver-facing kernel API
> > > > provided
> > > > > by the iommu layer is still TBD, which can be discussed after consensus
> > is
> > > > > made on this uAPI.
> > > > >
> > > > > It's based on a lengthy discussion starting from here:
> > > > > https://lore.kernel.org/linux-
> > > > iommu/[email protected]/
> > > > >
> > > > > v1 can be found here:
> > > > > https://lore.kernel.org/linux-
> > > >
> > iommu/[email protected]
> > > > amprd12.prod.outlook.com/T/
> > > > >
> > > > > This doc is also tracked on github, though it's not very useful for v1->v2
> > > > > given dramatic refactoring:
> > > > > https://github.com/luxis1999/dev_iommu_uapi
> > > >
> > > > Thanks for all your work on this, Kevin. Apart from the actual
> > > > semantic improvements, I'm finding v2 significantly easier to read and
> > > > understand than v1.
> > > >
> > > > [snip]
> > > > > 1.2. Attach Device to I/O address space
> > > > > +++++++++++++++++++++++++++++++++++++++
> > > > >
> > > > > Device attach/bind is initiated through passthrough framework uAPI.
> > > > >
> > > > > Device attaching is allowed only after a device is successfully bound to
> > > > > the IOMMU fd. User should provide a device cookie when binding the
> > > > > device through VFIO uAPI. This cookie is used when the user queries
> > > > > device capability/format, issues per-device iotlb invalidation and
> > > > > receives per-device I/O page fault data via IOMMU fd.
> > > > >
> > > > > Successful binding puts the device into a security context which isolates
> > > > > its DMA from the rest system. VFIO should not allow user to access the
> > > > > device before binding is completed. Similarly, VFIO should prevent the
> > > > > user from unbinding the device before user access is withdrawn.
> > > > >
> > > > > When a device is in an iommu group which contains multiple devices,
> > > > > all devices within the group must enter/exit the security context
> > > > > together. Please check {1.3} for more info about group isolation via
> > > > > this device-centric design.
> > > > >
> > > > > Successful attaching activates an I/O address space in the IOMMU,
> > > > > if the device is not purely software mediated. VFIO must provide device
> > > > > specific routing information for where to install the I/O page table in
> > > > > the IOMMU for this device. VFIO must also guarantee that the attached
> > > > > device is configured to compose DMAs with the routing information
> > that
> > > > > is provided in the attaching call. When handling DMA requests, IOMMU
> > > > > identifies the target I/O address space according to the routing
> > > > > information carried in the request. Misconfiguration breaks DMA
> > > > > isolation thus could lead to severe security vulnerability.
> > > > >
> > > > > Routing information is per-device and bus specific. For PCI, it is
> > > > > Requester ID (RID) identifying the device plus optional Process Address
> > > > > Space ID (PASID). For ARM, it is Stream ID (SID) plus optional Sub-
> > Stream
> > > > > ID (SSID). PASID or SSID is used when multiple I/O address spaces are
> > > > > enabled on a single device. For simplicity and continuity reason the
> > > > > following context uses RID+PASID though SID+SSID may sound a clearer
> > > > > naming from device p.o.v. We can decide the actual naming when
> > coding.
> > > > >
> > > > > Because one I/O address space can be attached by multiple devices,
> > > > > per-device routing information (plus device cookie) is tracked under
> > > > > each IOASID and is used respectively when activating the I/O address
> > > > > space in the IOMMU for each attached device.
> > > > >
> > > > > The device in the /dev/iommu context always refers to a physical one
> > > > > (pdev) which is identifiable via RID. Physically each pdev can support
> > > > > one default I/O address space (routed via RID) and optionally multiple
> > > > > non-default I/O address spaces (via RID+PASID).
> > > > >
> > > > > The device in VFIO context is a logic concept, being either a physical
> > > > > device (pdev) or mediated device (mdev or subdev). Each vfio device
> > > > > is represented by RID+cookie in IOMMU fd. User is allowed to create
> > > > > one default I/O address space (routed by vRID from user p.o.v) per
> > > > > each vfio_device. VFIO decides the routing information for this default
> > > > > space based on device type:
> > > > >
> > > > > 1) pdev, routed via RID;
> > > > >
> > > > > 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via
> > > > > the parent's RID plus the PASID marking this mdev;
> > > > >
> > > > > 3) a purely sw-mediated device (sw mdev), no routing required i.e. no
> > > > > need to install the I/O page table in the IOMMU. sw mdev just uses
> > > > > the metadata to assist its internal DMA isolation logic on top of
> > > > > the parent's IOMMU page table;
> > > > >
> > > > > In addition, VFIO may allow user to create additional I/O address
> > spaces
> > > > > on a vfio_device based on the hardware capability. In such case the
> > user
> > > > > has its own view of the virtual routing information (vPASID) when
> > marking
> > > > > these non-default address spaces. How to virtualize vPASID is platform
> > > > > specific and device specific. Some platforms allow the user to fully
> > > > > manage the PASID space thus vPASIDs are directly used for routing and
> > > > > even hidden from the kernel. Other platforms require the user to
> > > > > explicitly register the vPASID information to the kernel when attaching
> > > > > the vfio_device. In this case VFIO must figure out whether vPASID
> > should
> > > > > be directly used (pdev) or converted to a kernel-allocated pPASID (mdev)
> > > > > for physical routing. Detail explanation about PASID virtualization can
> > > > > be found in {1.4}.
> > > > >
> > > > > For mdev both default and non-default I/O address spaces are routed
> > > > > via PASIDs. To better differentiate them we use "default PASID" (or
> > > > > defPASID) when talking about the default I/O address space on mdev.
> > > > When
> > > > > vPASID or pPASID is referred in PASID virtualization it's all about the
> > > > > non-default spaces. defPASID and pPASID are always hidden from
> > > > userspace
> > > > > and can only be indirectly referenced via IOASID.
> > > >
> > > > That said, I'm still finding the various ways a device can attach to
> > > > an ioasid pretty confusing. Here are some thoughts on some extra
> > > > concepts that might make it easier to handle [note, I haven't thought
> > > > this all the way through so far, so there might be fatal problems with
> > > > this approach].
> > >
> > > Thanks for sharing your thoughts.
> > >
> > > >
> > > > * DMA address type
> > > >
> > > > This represents the format of the actual "over the wire" DMA
> > > > address. So far I only see 3 likely options for this 1) 32-bit,
> > > > 2) 64-bit and 3) PASID, meaning the 84-bit PASID+address
> > > > combination.
> > > >
> > > > * DMA identifier type
> > > >
> > > > This represents the format of the "over the wire"
> > > > device-identifying information that the IOMMU receives. So "RID",
> > > > "RID+PASID", "SID+SSID" would all be DMA identifier types. We
> > > > could introduce some extra ones which might be necessary for
> > > > software mdevs.
> > > >
> > > > So, every single DMA transaction has both DMA address and DMA
> > > > identifier information attached. In some cases we get to choose how
> > > > we split the availble information between identifier and address, more
> > > > on that later.
> > > >
> > > > * DMA endpoint
> > > >
> > > > An endpoint would represent a DMA origin which is identifiable to
> > > > the IOMMU. I'm using the new term, because while this would
> > > > sometimes correspond one to one with a device, there would be some
> > > > cases where it does not.
> > > >
> > > > a) Multiple devices could be a single DMA endpoint - this would
> > > > be the case with non-ACS bridges or PCIe to PCI bridges where
> > > > devices behind the bridge can't be distinguished from each other.
> > > > Early versions might be able to treat all VFIO groups as single
> > > > endpoints, which might simplify transition
> > > >
> > > > b) A single device could supply multiple DMA endpoints, this would
> > > > be the case with PASID capable devices where you want to map
> > > > different PASIDs to different IOASes.
> > > >
> > > > **Caveat: feel free to come up with a better name than "endpoint"
> > > >
> > > > **Caveat: I'm not immediately sure how to represent these to
> > > > userspace, and how we do that could have some important
> > > > implications for managing their lifetime
> > > >
> > > > Every endpoint would have a fixed, known DMA address type and DMA
> > > > identifier type (though I'm not sure if we need/want to expose the DMA
> > > > identifier type to userspace). Every IOAS would also have a DMA
> > > > address type fixed at IOAS creation.
> > > >
> > > > An endpoint can only be attached to one IOAS at a time. It can only
> > > > be attached to an IOAS whose DMA address type matches the endpoint.
> > > >
> > > > Most userspace managed IO page formats would imply a particular DMA
> > > > address type, and also a particular DMA address type for their
> > > > "parent" IOAS. I'd expect kernel managed IO page tables to be able to
> > > > be able to handle most combinations.
> > > >
> > > > /dev/iommu would work entirely (or nearly so) in terms of endpoint
> > > > handles, not device handles. Endpoints are what get bound to an IOAS,
> > > > and endpoints are what get the user chosen endpoint cookie.
> > > >
> > > > Getting endpoint handles from devices is handled on the VFIO/device
> > > > side. The simplest transitional approach is probably for a VFIO pdev
> > > > groups to expose just a single endpoint. We can potentially make that
> > > > more flexible as a later step, and other subsystems might have other
> > > > needs.
> > >
> > > I wonder what is the real value of this endpoint concept. for SVA-capable
> > > pdev case, the entire pdev is fully managed by the guest thus only the
> > > guest driver knows DMA endpoints on this pdev. vfio-pci doesn't know
> > > the presence of an endpoint until Qemu requests to do ioasid attaching
> > > after identifying an IOAS via vIOMMU.
> >
> > No.. that's not true. vfio-pci knows it can generate a "RID"-type
> > endpoint for the device, and I assume the device will have a SVA
> > capability bit, which lets vfio know that the endpoint will generate
> > PASID+addr addresses, rather than plain 64-bit addresses.
> >
> > You can't construct RID+PASID endpoints with vfio's knowledge alone,
> > but that's ok - that style would be for mdevs or other cases where you
> > do have more information about the specific device.
>
> if vfio-pci cannot construct endpoint alone in all cases, then I worried
> we are just inventing unnecessary uAPI objects of which the role can
> be already fulfilled by device cookie+PASID in the proposed uAPI.

I don't see that the device cookie is relevant here - AIUI that's
*assigned* by the user, so it doesn't really address the original
identification of the device. In my proposal you'd instead assign a
cookie to an endpoint, since that's the identifiable object.

The problem with "device+PASID" is that it only makes sense for some
versions of "device". For mdevs that are implemented by using
one-PASID of a pdev, then the PASID is included already in the
"device" so adding a PASID makes no sense. For devices that aren't
SVA capable, "device+PASID" is meaningful (sort of) but useless.

So, you have to know details of the type of device in order to use the
IOMMU APIs. You have to know what kinds of devices an IOAS can
accomodate, and what kinds of binding to that IOAS it can accomodate.
Can a device be bound to multiple IOASes or not? If you want to bind
it to multiple IOASes, does it need a specific kind of binding?

And that's before we even consider what some new bus revision might do
and what new possibilities that might add.

With endpoints it's simply "does this endpoint have the same address
type as the IOAS".

> > > If we want to build /dev/iommu
> > > uAPI around endpoint, probably vfio has to provide an uAPI for user to
> > > request creating an endpoint in the fly before doing the attaching call.
> > > but what goodness does it bring with additional complexity, given what
> > > we require is just the RID or RID+PASID routing info which can be already
> > > dig out by vfio driver w/o knowing any endpoint concept...
> >
> > It more clearly delineates who's responsible for what. The driver
> > (VFIO, mdev, vDPA, whatever) supplies endpoints. Depending on the
> > type of device it could be one endpoint per device, a choice of
> > several different endpoints for a device, several simultaneous
> > endpoints for the device, or one endpoint for several devices. But
> > whatever it is that's all on the device side. Once you get an
> > endpoint, it's always binding exactly one endpoint to exactly one IOAS
> > so the point at which the device side meets the IOMMU side becomes
> > much simpler.
>
> Sticking to iommu semantics {device, pasid} are clear enough for
> the user to build the connection between IOAS and device.

I don't really agree, it's certainly confusing to me. Again the
problem is that "device" can cover a bunch of different meanings (PCI
device, mdev, kernel device). Endpoint means, explicitly, "a thing
that can be bound to a single DMA address space at a time".

> This
> also matches the vIOMMU part which just understands vRID and
> vPASID, without any concept of endpoint. anyway RID+PASID (or
> SID+SSID) is what is defined in the bus to tag an IOAS. Forcing
> vIOMMU to fake an endpoint (via vfio) per PASID before doing
> attach just adds unnecessary confusion.

There's no faking here - the endpoint is a real thing, and you don't
need to have an endpoint per PASID. As noted in one of the examples
and SVA capable device can be treated as single endpoint, with address
type "PASID+address". The *possibility* of single-PASID endpoints
exists, so that a) single-PASID mdevs can expose themselves that way
and b) because for some user spacve drivers it might be more
convenient to

> > If we find a new device type or bus with a new way of doing DMA
> > addressing, it's just adding some address/id types; we don't need new
> > ways of binding these devices to the IOMMU.
>
> We just need define a better name to cover pasid, ssid, or other ids.
> regardless of the bus type, it's always the device cookie to identify
> the default I/O address space plus a what-ever id to tag other I/O
> address spaces targeted by the device.

Well, that's kind of what I'm doing. PCI currently has the notion of
"default" address space for a RID, but there's no guarantee that other
buses (or even future PCI extensions) will. The idea is that
"endpoint" means exactly the (RID, PASID) or (SID, SSID) or whatever
future variations are on that. It can also then accomodate both
non-SVA PCI devices with simply a (RID) == (RID, DEFAULT_PASID), a
case where you want to manage multiple PASIDs within a single IOMMU
table, again as just (RID), but with a complex address type.

> > > In concept I feel the purpose of DMA endpoint is equivalent to the routing
> > > info in this proposal.
> >
> > Maybe? I'm afraid I never quite managed to understand the role of the
> > routing info in your proposal.
> >
>
> the IOMMU routes incoming DMA packets to a specific I/O page table,
> according to RID or RID+PASID carried in the packet. RID or RID+PASID
> is the routing information (represented by device cookie +PASID in
> proposed uAPI) and what the iommu driver really cares when activating
> the I/O page table in the iommu.

Ok, so yes, endpoint is roughly equivalent to that. But my point is
that the IOMMU layer really only cares about that (device+routing)
combination, not other aspects of what the device is. So that's the
concept we should give a name and put front and center in the API.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (17.32 kB)
signature.asc (849.00 B)
Download all attachments

2021-08-06 19:05:16

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Fri, Aug 06, 2021 at 02:45:26PM +1000, David Gibson wrote:

> Well, that's kind of what I'm doing. PCI currently has the notion of
> "default" address space for a RID, but there's no guarantee that other
> buses (or even future PCI extensions) will. The idea is that
> "endpoint" means exactly the (RID, PASID) or (SID, SSID) or whatever
> future variations are on that.

This is already happening in this proposal, it is why I insisted that
the driver facing API has to be very explicit. That API specifies
exactly what the device silicon is doing.

However, that is placed at the IOASID level. There is no reason to
create endpoint objects that are 1:1 with IOASID objects, eg for
PASID.

We need to have clear software layers and responsibilities, I think
this is where the VFIO container design has fallen behind.

The device driver is responsible to delcare what TLPs the device it
controls will issue

The system layer is responsible to determine how those TLPs can be
matched to IO page tables, if at all

The IO page table layer is responsible to map the TLPs to physical
memory.

Each must stay in its box and we should not create objects that smush
together, say, the device and system layers because it will only make
a mess of the software design.

Since the system layer doesn't have any concrete objects in our
environment (which is based on devices and IO page tables) it has to
exist as metadata attached to the other two objects.

Jason

2021-08-09 08:36:16

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: David Gibson <[email protected]>
> Sent: Friday, August 6, 2021 12:45 PM
> > > > In concept I feel the purpose of DMA endpoint is equivalent to the
> routing
> > > > info in this proposal.
> > >
> > > Maybe? I'm afraid I never quite managed to understand the role of the
> > > routing info in your proposal.
> > >
> >
> > the IOMMU routes incoming DMA packets to a specific I/O page table,
> > according to RID or RID+PASID carried in the packet. RID or RID+PASID
> > is the routing information (represented by device cookie +PASID in
> > proposed uAPI) and what the iommu driver really cares when activating
> > the I/O page table in the iommu.
>
> Ok, so yes, endpoint is roughly equivalent to that. But my point is
> that the IOMMU layer really only cares about that (device+routing)
> combination, not other aspects of what the device is. So that's the
> concept we should give a name and put front and center in the API.
>

This is how this proposal works, centered around the routing info. the
uAPI doesn't care what the device is. It just requires the user to specify
the user view of routing info (device fd + optional pasid) to tag an IOAS.
the user view is then converted to the kernel view of routing (rid or
rid+pasid) by vfio driver and then passed to iommu fd in the attaching
operation. and GET_INFO interface is provided for the user to check
whether a device supports multiple IOASes and whether pasid space is
delegated to the user. We just need a better name if pasid is considered
too pci specific...

But creating an endpoint per ioasid and making it centered in uAPI is not
what the IOMMU layer cares about.

Thanks
Kevin

2021-08-10 07:05:49

by David Gibson

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Mon, Aug 09, 2021 at 08:34:06AM +0000, Tian, Kevin wrote:
> > From: David Gibson <[email protected]>
> > Sent: Friday, August 6, 2021 12:45 PM
> > > > > In concept I feel the purpose of DMA endpoint is equivalent to the
> > routing
> > > > > info in this proposal.
> > > >
> > > > Maybe? I'm afraid I never quite managed to understand the role of the
> > > > routing info in your proposal.
> > > >
> > >
> > > the IOMMU routes incoming DMA packets to a specific I/O page table,
> > > according to RID or RID+PASID carried in the packet. RID or RID+PASID
> > > is the routing information (represented by device cookie +PASID in
> > > proposed uAPI) and what the iommu driver really cares when activating
> > > the I/O page table in the iommu.
> >
> > Ok, so yes, endpoint is roughly equivalent to that. But my point is
> > that the IOMMU layer really only cares about that (device+routing)
> > combination, not other aspects of what the device is. So that's the
> > concept we should give a name and put front and center in the API.
> >
>
> This is how this proposal works, centered around the routing info. the
> uAPI doesn't care what the device is. It just requires the user to specify
> the user view of routing info (device fd + optional pasid) to tag an IOAS.

Which works as long as (just device fd) and (device fd + PASID) covers
all the options. If we have new possibilities we need new interfaces.
And, that can't even handle the case of one endpoint for multiple
devices (e.g. ACS-incapable bridge).

> the user view is then converted to the kernel view of routing (rid or
> rid+pasid) by vfio driver and then passed to iommu fd in the attaching
> operation. and GET_INFO interface is provided for the user to check
> whether a device supports multiple IOASes and whether pasid space is
> delegated to the user. We just need a better name if pasid is considered
> too pci specific...
>
> But creating an endpoint per ioasid and making it centered in uAPI is not
> what the IOMMU layer cares about.

It's not an endpoint per ioasid. You can have multiple endpoints per
ioasid, just not the other way around. As it is multiple IOASes per
device means *some* sort of disambiguation (generally by PASID) which
is hard to describe generall. Having endpoints as a first-class
concept makes that simpler.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.52 kB)
signature.asc (849.00 B)
Download all attachments

2021-08-10 08:38:14

by Eric Auger

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

Hi Kevin,

On 8/5/21 2:36 AM, Tian, Kevin wrote:
>> From: Eric Auger <[email protected]>
>> Sent: Wednesday, August 4, 2021 11:59 PM
>>
> [...]
>>> 1.2. Attach Device to I/O address space
>>> +++++++++++++++++++++++++++++++++++++++
>>>
>>> Device attach/bind is initiated through passthrough framework uAPI.
>>>
>>> Device attaching is allowed only after a device is successfully bound to
>>> the IOMMU fd. User should provide a device cookie when binding the
>>> device through VFIO uAPI. This cookie is used when the user queries
>>> device capability/format, issues per-device iotlb invalidation and
>>> receives per-device I/O page fault data via IOMMU fd.
>>>
>>> Successful binding puts the device into a security context which isolates
>>> its DMA from the rest system. VFIO should not allow user to access the
>> s/from the rest system/from the rest of the system
>>> device before binding is completed. Similarly, VFIO should prevent the
>>> user from unbinding the device before user access is withdrawn.
>> With Intel scalable IOV, I understand you could assign an RID/PASID to
>> one VM and another one to another VM (which is not the case for ARM). Is
>> it a targetted use case?How would it be handled? Is it related to the
>> sub-groups evoked hereafter?
> Not related to sub-group. Each mdev is bound to the IOMMU fd respectively
> with the defPASID which represents the mdev.
But how does it work in term of security. The device (RID) is bound to
an IOMMU fd. But then each SID/PASID may be working for a different VM.
How do you detect this is safe as each SID can work safely for a
different VM versus the ARM case where it is not possible.

1.3 says
"

1) A successful binding call for the first device in the group creates
the security context for the entire group, by:
"
What does it mean for above scalable IOV use case?

>
>> Actually all devices bound to an IOMMU fd should have the same parent
>> I/O address space or root address space, am I correct? If so, maybe add
>> this comment explicitly?
> in most cases yes but it's not mandatory. multiple roots are allowed
> (e.g. with vIOMMU but no nesting).
OK, right, this corresponds to example 4.2 for example. I misinterpreted
the notion of security context. The security context does not match the
IOMMU fd but is something implicit created on 1st device binding.
>
> [...]
>>> The device in the /dev/iommu context always refers to a physical one
>>> (pdev) which is identifiable via RID. Physically each pdev can support
>>> one default I/O address space (routed via RID) and optionally multiple
>>> non-default I/O address spaces (via RID+PASID).
>>>
>>> The device in VFIO context is a logic concept, being either a physical
>>> device (pdev) or mediated device (mdev or subdev). Each vfio device
>>> is represented by RID+cookie in IOMMU fd. User is allowed to create
>>> one default I/O address space (routed by vRID from user p.o.v) per
>>> each vfio_device.
>> The concept of default address space is not fully clear for me. I
>> currently understand this is a
>> root address space (not nesting). Is that coorect.This may need
>> clarification.
> w/o PASID there is only one address space (either GPA or GIOVA)
> per device. This one is called default. whether it's root is orthogonal
> (e.g. GIOVA could be also nested) to the device view of this space.
>
> w/ PASID additional address spaces can be targeted by the device.
> those are called non-default.
>
> I could also rename default to RID address space and non-default to
> RID+PASID address space if doing so makes it clearer.
Yes I think it is worth having a kind of glossary and defining root as,
default as as you clearly defined child/parent.
>
>>> VFIO decides the routing information for this default
>>> space based on device type:
>>>
>>> 1) pdev, routed via RID;
>>>
>>> 2) mdev/subdev with IOMMU-enforced DMA isolation, routed via
>>> the parent's RID plus the PASID marking this mdev;
>>>
>>> 3) a purely sw-mediated device (sw mdev), no routing required i.e. no
>>> need to install the I/O page table in the IOMMU. sw mdev just uses
>>> the metadata to assist its internal DMA isolation logic on top of
>>> the parent's IOMMU page table;
>> Maybe you should introduce this concept of SW mediated device earlier
>> because it seems to special case the way the attach behaves. I am
>> especially refering to
>>
>> "Successful attaching activates an I/O address space in the IOMMU, if the
>> device is not purely software mediated"
> makes sense.
>
>>> In addition, VFIO may allow user to create additional I/O address spaces
>>> on a vfio_device based on the hardware capability. In such case the user
>>> has its own view of the virtual routing information (vPASID) when marking
>>> these non-default address spaces.
>> I do not catch what does mean "marking these non default address space".
> as explained above, those non-default address spaces are identified/routed
> via PASID.
>
>>> 1.3. Group isolation
>>> ++++++++++++++++++++
> [...]
>>> 1) A successful binding call for the first device in the group creates
>>> the security context for the entire group, by:
>>>
>>> * Verifying group viability in a similar way as VFIO does;
>>>
>>> * Calling IOMMU-API to move the group into a block-dma state,
>>> which makes all devices in the group attached to an block-dma
>>> domain with an empty I/O page table;
>> this block-dma state/domain would deserve to be better defined (I know
>> you already evoked it in 1.1 with the dma mapping protocol though)
>> activates an empty I/O page table in the IOMMU (if the device is not
>> purely SW mediated)?
> sure. some explanations are scattered in following paragraph, but I
> can consider to further clarify it.
>
>> How does that relate to the default address space? Is it the same?
> different. this block-dma domain doesn't hold any valid mapping. The
> default address space is represented by a normal unmanaged domain.
> the ioasid attaching operation will detach the device from the block-dma
> domain and then attach it to the target ioasid.
OK

Thanks

Eric
>
>>> 2. uAPI Proposal
>>> ----------------------
> [...]
>>> /*
>>> * Allocate an IOASID.
>>> *
>>> * IOASID is the FD-local software handle representing an I/O address
>>> * space. Each IOASID is associated with a single I/O page table. User
>>> * must call this ioctl to get an IOASID for every I/O address space that is
>>> * intended to be tracked by the kernel.
>>> *
>>> * User needs to specify the attributes of the IOASID and associated
>>> * I/O page table format information according to one or multiple devices
>>> * which will be attached to this IOASID right after. The I/O page table
>>> * is activated in the IOMMU when it's attached by a device. Incompatible
>> .. if not SW mediated
>>> * format between device and IOASID will lead to attaching failure.
>>> *
>>> * The root IOASID should always have a kernel-managed I/O page
>>> * table for safety. Locked page accounting is also conducted on the root.
>> The definition of root IOASID is not easily found in this spec. Maybe
>> this would deserve some clarification.
> make sense.
>
> and thanks for other typo-related comments.
>
> Thanks
> Kevin

2021-08-10 09:08:07

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: David Gibson <[email protected]>
> Sent: Tuesday, August 10, 2021 12:48 PM
>
> On Mon, Aug 09, 2021 at 08:34:06AM +0000, Tian, Kevin wrote:
> > > From: David Gibson <[email protected]>
> > > Sent: Friday, August 6, 2021 12:45 PM
> > > > > > In concept I feel the purpose of DMA endpoint is equivalent to the
> > > routing
> > > > > > info in this proposal.
> > > > >
> > > > > Maybe? I'm afraid I never quite managed to understand the role of
> the
> > > > > routing info in your proposal.
> > > > >
> > > >
> > > > the IOMMU routes incoming DMA packets to a specific I/O page table,
> > > > according to RID or RID+PASID carried in the packet. RID or RID+PASID
> > > > is the routing information (represented by device cookie +PASID in
> > > > proposed uAPI) and what the iommu driver really cares when activating
> > > > the I/O page table in the iommu.
> > >
> > > Ok, so yes, endpoint is roughly equivalent to that. But my point is
> > > that the IOMMU layer really only cares about that (device+routing)
> > > combination, not other aspects of what the device is. So that's the
> > > concept we should give a name and put front and center in the API.
> > >
> >
> > This is how this proposal works, centered around the routing info. the
> > uAPI doesn't care what the device is. It just requires the user to specify
> > the user view of routing info (device fd + optional pasid) to tag an IOAS.
>
> Which works as long as (just device fd) and (device fd + PASID) covers
> all the options. If we have new possibilities we need new interfaces.
> And, that can't even handle the case of one endpoint for multiple
> devices (e.g. ACS-incapable bridge).

why not? We have went through a long debate in v1 to reach conclusion
that a device-centric uAPI can cover above group scenario (see section 1.3)

>
> > the user view is then converted to the kernel view of routing (rid or
> > rid+pasid) by vfio driver and then passed to iommu fd in the attaching
> > operation. and GET_INFO interface is provided for the user to check
> > whether a device supports multiple IOASes and whether pasid space is
> > delegated to the user. We just need a better name if pasid is considered
> > too pci specific...
> >
> > But creating an endpoint per ioasid and making it centered in uAPI is not
> > what the IOMMU layer cares about.
>
> It's not an endpoint per ioasid. You can have multiple endpoints per
> ioasid, just not the other way around. As it is multiple IOASes per

you need create one endpoint per device fd to attach to gpa_ioasid.

then one endpoint per device fd to attach to pasidtbl_ioasid on arm/amd.

then one endpoint per pasid to attach to gva_ioasid on intel.

In the end you just create one endpoint per each attached ioasid given
a device.

> device means *some* sort of disambiguation (generally by PASID) which
> is hard to describe generall. Having endpoints as a first-class
> concept makes that simpler.
>

I don't think pasid causes any disambiguation (except the name itself
being pci specific). with multiple IOASes you always need an id to tag it.
This id is what iommu layer cares about. which endpoint on the device
uses the id is not a business to iommu.

Thanks
Kevin

2021-08-10 11:49:05

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC v2] /dev/iommu uAPI proposal

> From: Eric Auger <[email protected]>
> Sent: Tuesday, August 10, 2021 3:17 PM
>
> Hi Kevin,
>
> On 8/5/21 2:36 AM, Tian, Kevin wrote:
> >> From: Eric Auger <[email protected]>
> >> Sent: Wednesday, August 4, 2021 11:59 PM
> >>
> > [...]
> >>> 1.2. Attach Device to I/O address space
> >>> +++++++++++++++++++++++++++++++++++++++
> >>>
> >>> Device attach/bind is initiated through passthrough framework uAPI.
> >>>
> >>> Device attaching is allowed only after a device is successfully bound to
> >>> the IOMMU fd. User should provide a device cookie when binding the
> >>> device through VFIO uAPI. This cookie is used when the user queries
> >>> device capability/format, issues per-device iotlb invalidation and
> >>> receives per-device I/O page fault data via IOMMU fd.
> >>>
> >>> Successful binding puts the device into a security context which isolates
> >>> its DMA from the rest system. VFIO should not allow user to access the
> >> s/from the rest system/from the rest of the system
> >>> device before binding is completed. Similarly, VFIO should prevent the
> >>> user from unbinding the device before user access is withdrawn.
> >> With Intel scalable IOV, I understand you could assign an RID/PASID to
> >> one VM and another one to another VM (which is not the case for ARM).
> Is
> >> it a targetted use case?How would it be handled? Is it related to the
> >> sub-groups evoked hereafter?
> > Not related to sub-group. Each mdev is bound to the IOMMU fd
> respectively
> > with the defPASID which represents the mdev.
> But how does it work in term of security. The device (RID) is bound to
> an IOMMU fd. But then each SID/PASID may be working for a different VM.
> How do you detect this is safe as each SID can work safely for a
> different VM versus the ARM case where it is not possible.

PASID is managed by the parent driver, which knows which PASID to be
used given a mdev when later attaching it to an IOASID.

>
> 1.3 says
> "
>
> 1) A successful binding call for the first device in the group creates
> the security context for the entire group, by:
> "
> What does it mean for above scalable IOV use case?
>

This is a good question (as Alex raised) which needs more explanation
in next version:

https://lore.kernel.org/linux-iommu/[email protected]/

In general we need provide different helpers for binding pdev/mdev/
sw mdev. 1.3 in v2 describes the behavior for pdev via iommu_register_
device(). for mdev a new helper (e.g. iommu_register_device_pasid())
is required and then the IOMMU-API will also provide a pasid variation
for creating security context per pasid. sw mdev will also have its binding
helper to indicate no routing info required in ioasid attaching.

Thanks
Kevin

2021-08-10 13:27:46

by David Gibson

[permalink] [raw]
Subject: Re: [RFC v2] /dev/iommu uAPI proposal

On Fri, Aug 06, 2021 at 09:32:11AM -0300, Jason Gunthorpe wrote:
> On Fri, Aug 06, 2021 at 02:45:26PM +1000, David Gibson wrote:
>
> > Well, that's kind of what I'm doing. PCI currently has the notion of
> > "default" address space for a RID, but there's no guarantee that other
> > buses (or even future PCI extensions) will. The idea is that
> > "endpoint" means exactly the (RID, PASID) or (SID, SSID) or whatever
> > future variations are on that.
>
> This is already happening in this proposal, it is why I insisted that
> the driver facing API has to be very explicit. That API specifies
> exactly what the device silicon is doing.
>
> However, that is placed at the IOASID level. There is no reason to
> create endpoint objects that are 1:1 with IOASID objects, eg for
> PASID.

They're not 1:1 though. You can have multiple endpoints in the same
IOAS, that's the whole point.

> We need to have clear software layers and responsibilities, I think
> this is where the VFIO container design has fallen behind.
>
> The device driver is responsible to delcare what TLPs the device it
> controls will issue

Right.. and I'm envisaging an endpoint as a abstraction to represent a
single TLP.

> The system layer is responsible to determine how those TLPs can be
> matched to IO page tables, if at all
>
> The IO page table layer is responsible to map the TLPs to physical
> memory.
>
> Each must stay in its box and we should not create objects that smush
> together, say, the device and system layers because it will only make
> a mess of the software design.

I agree... and endpoints are explicitly an attempt to do that. I
don't see how you think they're smushing things together.

> Since the system layer doesn't have any concrete objects in our
> environment (which is based on devices and IO page tables) it has to
> exist as metadata attached to the other two objects.

Whereas I'm suggesting clarifying this by *creating* concrete objects
to represent the concept we need.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.19 kB)
signature.asc (849.00 B)
Download all attachments