2021-05-27 08:00:42

by Tian, Kevin

[permalink] [raw]
Subject: [RFC] /dev/ioasid uAPI proposal

/dev/ioasid provides an unified interface for managing I/O page tables for
devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
etc.) are expected to use this interface instead of creating their own logic to
isolate untrusted device DMAs initiated by userspace.

This proposal describes the uAPI of /dev/ioasid and also sample sequences
with VFIO as example in typical usages. The driver-facing kernel API provided
by the iommu layer is still TBD, which can be discussed after consensus is
made on this uAPI.

It's based on a lengthy discussion starting from here:
https://lore.kernel.org/linux-iommu/[email protected]/

It ends up to be a long writing due to many things to be summarized and
non-trivial effort required to connect them into a complete proposal.
Hope it provides a clean base to converge.

TOC
====
1. Terminologies and Concepts
2. uAPI Proposal
2.1. /dev/ioasid uAPI
2.2. /dev/vfio uAPI
2.3. /dev/kvm uAPI
3. Sample structures and helper functions
4. PASID virtualization
5. Use Cases and Flows
5.1. A simple example
5.2. Multiple IOASIDs (no nesting)
5.3. IOASID nesting (software)
5.4. IOASID nesting (hardware)
5.5. Guest SVA (vSVA)
5.6. I/O page fault
5.7. BIND_PASID_TABLE
====

1. Terminologies and Concepts
-----------------------------------------

IOASID FD is the container holding multiple I/O address spaces. User
manages those address spaces through FD operations. Multiple FD's are
allowed per process, but with this proposal one FD should be sufficient for
all intended usages.

IOASID is the FD-local software handle representing an I/O address space.
Each IOASID is associated with a single I/O page table. IOASIDs can be
nested together, implying the output address from one I/O page table
(represented by child IOASID) must be further translated by another I/O
page table (represented by parent IOASID).

I/O address space can be managed through two protocols, according to
whether the corresponding I/O page table is constructed by the kernel or
the user. When kernel-managed, a dma mapping protocol (similar to
existing VFIO iommu type1) is provided for the user to explicitly specify
how the I/O address space is mapped. Otherwise, a different protocol is
provided for the user to bind an user-managed I/O page table to the
IOMMU, plus necessary commands for iotlb invalidation and I/O fault
handling.

Pgtable binding protocol can be used only on the child IOASID's, implying
IOASID nesting must be enabled. This is because the kernel doesn't trust
userspace. Nesting allows the kernel to enforce its DMA isolation policy
through the parent IOASID.

IOASID nesting can be implemented in two ways: hardware nesting and
software nesting. With hardware support the child and parent I/O page
tables are walked consecutively by the IOMMU to form a nested translation.
When it's implemented in software, the ioasid driver is responsible for
merging the two-level mappings into a single-level shadow I/O page table.
Software nesting requires both child/parent page tables operated through
the dma mapping protocol, so any change in either level can be captured
by the kernel to update the corresponding shadow mapping.

An I/O address space takes effect in the IOMMU only after it is attached
to a device. The device in the /dev/ioasid context always refers to a
physical one or 'pdev' (PF or VF).

One I/O address space could be attached to multiple devices. In this case,
/dev/ioasid uAPI applies to all attached devices under the specified IOASID.

Based on the underlying IOMMU capability one device might be allowed
to attach to multiple I/O address spaces, with DMAs accessing them by
carrying different routing information. One of them is the default I/O
address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
remaining are routed by RID + Process Address Space ID (PASID) or
Stream+Substream ID. For simplicity the following context uses RID and
PASID when talking about the routing information for I/O address spaces.

Device attachment is initiated through passthrough framework uAPI (use
VFIO for simplicity in following context). VFIO is responsible for identifying
the routing information and registering it to the ioasid driver when calling
ioasid attach helper function. It could be RID if the assigned device is
pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
user might also provide its view of virtual routing information (vPASID) in
the attach call, e.g. when multiple user-managed I/O address spaces are
attached to the vfio_device. In this case VFIO must figure out whether
vPASID should be directly used (for pdev) or converted to a kernel-
allocated one (pPASID, for mdev) for physical routing (see section 4).

Device must be bound to an IOASID FD before attach operation can be
conducted. This is also through VFIO uAPI. In this proposal one device
should not be bound to multiple FD's. Not sure about the gain of
allowing it except adding unnecessary complexity. But if others have
different view we can further discuss.

VFIO must ensure its device composes DMAs with the routing information
attached to the IOASID. For pdev it naturally happens since vPASID is
directly programmed to the device by guest software. For mdev this
implies any guest operation carrying a vPASID on this device must be
trapped into VFIO and then converted to pPASID before sent to the
device. A detail explanation about PASID virtualization policies can be
found in section 4.

Modern devices may support a scalable workload submission interface
based on PCI DMWr capability, allowing a single work queue to access
multiple I/O address spaces. One example is Intel ENQCMD, having
PASID saved in the CPU MSR and carried in the instruction payload
when sent out to the device. Then a single work queue shared by
multiple processes can compose DMAs carrying different PASIDs.

When executing ENQCMD in the guest, the CPU MSR includes a vPASID
which, if targeting a mdev, must be converted to pPASID before sent
to the wire. Intel CPU provides a hardware PASID translation capability
for auto-conversion in the fast path. The user is expected to setup the
PASID mapping through KVM uAPI, with information about {vpasid,
ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM
to figure out the actual pPASID given an IOASID.

With above design /dev/ioasid uAPI is all about I/O address spaces.
It doesn't include any device routing information, which is only
indirectly registered to the ioasid driver through VFIO uAPI. For
example, I/O page fault is always reported to userspace per IOASID,
although it's physically reported per device (RID+PASID). If there is a
need of further relaying this fault into the guest, the user is responsible
of identifying the device attached to this IOASID (randomly pick one if
multiple attached devices) and then generates a per-device virtual I/O
page fault into guest. Similarly the iotlb invalidation uAPI describes the
granularity in the I/O address space (all, or a range), different from the
underlying IOMMU semantics (domain-wide, PASID-wide, range-based).

I/O page tables routed through PASID are installed in a per-RID PASID
table structure. Some platforms implement the PASID table in the guest
physical space (GPA), expecting it managed by the guest. The guest
PASID table is bound to the IOMMU also by attaching to an IOASID,
representing the per-RID vPASID space.

We propose the host kernel needs to explicitly track guest I/O page
tables even on these platforms, i.e. the same pgtable binding protocol
should be used universally on all platforms (with only difference on who
actually writes the PASID table). One opinion from previous discussion
was treating this special IOASID as a container for all guest I/O page
tables i.e. hiding them from the host. However this way significantly
violates the philosophy in this /dev/ioasid proposal. It is not one IOASID
one address space any more. Device routing information (indirectly
marking hidden I/O spaces) has to be carried in iotlb invalidation and
page faulting uAPI to help connect vIOMMU with the underlying
pIOMMU. This is one design choice to be confirmed with ARM guys.

Devices may sit behind IOMMU's with incompatible capabilities. The
difference may lie in the I/O page table format, or availability of an user
visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for
checking the incompatibility between newly-attached device and existing
devices under the specific IOASID and, if found, returning error to user.
Upon such error the user should create a new IOASID for the incompatible
device.

There is no explicit group enforcement in /dev/ioasid uAPI, due to no
device notation in this interface as aforementioned. But the ioasid driver
does implicit check to make sure that devices within an iommu group
must be all attached to the same IOASID before this IOASID starts to
accept any uAPI command. Otherwise error information is returned to
the user.

There was a long debate in previous discussion whether VFIO should keep
explicit container/group semantics in its uAPI. Jason Gunthorpe proposes
a simplified model where every device bound to VFIO is explicitly listed
under /dev/vfio thus a device fd can be acquired w/o going through legacy
container/group interface. In this case the user is responsible for
understanding the group topology and meeting the implicit group check
criteria enforced in /dev/ioasid. The use case examples in this proposal
are based on the new model.

Of course for backward compatibility VFIO still needs to keep the existing
uAPI and vfio iommu type1 will become a shim layer connecting VFIO
iommu ops to internal ioasid helper functions.

Notes:
- It might be confusing as IOASID is also used in the kernel (drivers/
iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
find a better name later to differentiate.

- PPC has not be considered yet as we haven't got time to fully understand
its semantics. According to previous discussion there is some generality
between PPC window-based scheme and VFIO type1 semantics. Let's
first make consensus on this proposal and then further discuss how to
extend it to cover PPC's requirement.

- There is a protocol between vfio group and kvm. Needs to think about
how it will be affected following this proposal.

- mdev in this context refers to mediated subfunctions (e.g. Intel SIOV)
which can be physically isolated in-between through PASID-granular
IOMMU protection. Historically people also discussed one usage by
mediating a pdev into a mdev. This usage is not covered here, and is
supposed to be replaced by Max's work which allows overriding various
VFIO operations in vfio-pci driver.

2. uAPI Proposal
----------------------

/dev/ioasid uAPI covers everything about managing I/O address spaces.

/dev/vfio uAPI builds connection between devices and I/O address spaces.

/dev/kvm uAPI is optional required as far as ENQCMD is concerned.


2.1. /dev/ioasid uAPI
+++++++++++++++++

/*
* Check whether an uAPI extension is supported.
*
* This is for FD-level capabilities, such as locked page pre-registration.
* IOASID-level capabilities are reported through IOASID_GET_INFO.
*
* Return: 0 if not supported, 1 if supported.
*/
#define IOASID_CHECK_EXTENSION _IO(IOASID_TYPE, IOASID_BASE + 0)


/*
* Register user space memory where DMA is allowed.
*
* It pins user pages and does the locked memory accounting so sub-
* sequent IOASID_MAP/UNMAP_DMA calls get faster.
*
* When this ioctl is not used, one user page might be accounted
* multiple times when it is mapped by multiple IOASIDs which are
* not nested together.
*
* Input parameters:
* - vaddr;
* - size;
*
* Return: 0 on success, -errno on failure.
*/
#define IOASID_REGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 1)
#define IOASID_UNREGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 2)


/*
* Allocate an IOASID.
*
* IOASID is the FD-local software handle representing an I/O address
* space. Each IOASID is associated with a single I/O page table. User
* must call this ioctl to get an IOASID for every I/O address space that is
* intended to be enabled in the IOMMU.
*
* A newly-created IOASID doesn't accept any command before it is
* attached to a device. Once attached, an empty I/O page table is
* bound with the IOMMU then the user could use either DMA mapping
* or pgtable binding commands to manage this I/O page table.
*
* Device attachment is initiated through device driver uAPI (e.g. VFIO)
*
* Return: allocated ioasid on success, -errno on failure.
*/
#define IOASID_ALLOC _IO(IOASID_TYPE, IOASID_BASE + 3)
#define IOASID_FREE _IO(IOASID_TYPE, IOASID_BASE + 4)


/*
* Get information about an I/O address space
*
* Supported capabilities:
* - VFIO type1 map/unmap;
* - pgtable/pasid_table binding
* - hardware nesting vs. software nesting;
* - ...
*
* Related attributes:
* - supported page sizes, reserved IOVA ranges (DMA mapping);
* - vendor pgtable formats (pgtable binding);
* - number of child IOASIDs (nesting);
* - ...
*
* Above information is available only after one or more devices are
* attached to the specified IOASID. Otherwise the IOASID is just a
* number w/o any capability or attribute.
*
* Input parameters:
* - u32 ioasid;
*
* Output parameters:
* - many. TBD.
*/
#define IOASID_GET_INFO _IO(IOASID_TYPE, IOASID_BASE + 5)


/*
* Map/unmap process virtual addresses to I/O virtual addresses.
*
* Provide VFIO type1 equivalent semantics. Start with the same
* restriction e.g. the unmap size should match those used in the
* original mapping call.
*
* If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
* must be already in the preregistered list.
*
* Input parameters:
* - u32 ioasid;
* - refer to vfio_iommu_type1_dma_{un}map
*
* Return: 0 on success, -errno on failure.
*/
#define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
#define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)


/*
* Create a nesting IOASID (child) on an existing IOASID (parent)
*
* IOASIDs can be nested together, implying that the output address
* from one I/O page table (child) must be further translated by
* another I/O page table (parent).
*
* As the child adds essentially another reference to the I/O page table
* represented by the parent, any device attached to the child ioasid
* must be already attached to the parent.
*
* In concept there is no limit on the number of the nesting levels.
* However for the majority case one nesting level is sufficient. The
* user should check whether an IOASID supports nesting through
* IOASID_GET_INFO. For example, if only one nesting level is allowed,
* the nesting capability is reported only on the parent instead of the
* child.
*
* User also needs check (via IOASID_GET_INFO) whether the nesting
* is implemented in hardware or software. If software-based, DMA
* mapping protocol should be used on the child IOASID. Otherwise,
* the child should be operated with pgtable binding protocol.
*
* Input parameters:
* - u32 parent_ioasid;
*
* Return: child_ioasid on success, -errno on failure;
*/
#define IOASID_CREATE_NESTING _IO(IOASID_TYPE, IOASID_BASE + 8)


/*
* Bind an user-managed I/O page table with the IOMMU
*
* Because user page table is untrusted, IOASID nesting must be enabled
* for this ioasid so the kernel can enforce its DMA isolation policy
* through the parent ioasid.
*
* Pgtable binding protocol is different from DMA mapping. The latter
* has the I/O page table constructed by the kernel and updated
* according to user MAP/UNMAP commands. With pgtable binding the
* whole page table is created and updated by userspace, thus different
* set of commands are required (bind, iotlb invalidation, page fault, etc.).
*
* Because the page table is directly walked by the IOMMU, the user
* must use a format compatible to the underlying hardware. It can
* check the format information through IOASID_GET_INFO.
*
* The page table is bound to the IOMMU according to the routing
* information of each attached device under the specified IOASID. The
* routing information (RID and optional PASID) is registered when a
* device is attached to this IOASID through VFIO uAPI.
*
* Input parameters:
* - child_ioasid;
* - address of the user page table;
* - formats (vendor, address_width, etc.);
*
* Return: 0 on success, -errno on failure.
*/
#define IOASID_BIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 9)
#define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10)


/*
* Bind an user-managed PASID table to the IOMMU
*
* This is required for platforms which place PASID table in the GPA space.
* In this case the specified IOASID represents the per-RID PASID space.
*
* Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
* special flag to indicate the difference from normal I/O address spaces.
*
* The format info of the PASID table is reported in IOASID_GET_INFO.
*
* As explained in the design section, user-managed I/O page tables must
* be explicitly bound to the kernel even on these platforms. It allows
* the kernel to uniformly manage I/O address spaces cross all platforms.
* Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
* to carry device routing information to indirectly mark the hidden I/O
* address spaces.
*
* Input parameters:
* - child_ioasid;
* - address of PASID table;
* - formats (vendor, size, etc.);
*
* Return: 0 on success, -errno on failure.
*/
#define IOASID_BIND_PASID_TABLE _IO(IOASID_TYPE, IOASID_BASE + 11)
#define IOASID_UNBIND_PASID_TABLE _IO(IOASID_TYPE, IOASID_BASE + 12)


/*
* Invalidate IOTLB for an user-managed I/O page table
*
* Unlike what's defined in include/uapi/linux/iommu.h, this command
* doesn't allow the user to specify cache type and likely support only
* two granularities (all, or a specified range) in the I/O address space.
*
* Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
* cache). If the IOASID represents an I/O address space, the invalidation
* always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
* represents a vPASID space, then this command applies to the PASID
* cache.
*
* Similarly this command doesn't provide IOMMU-like granularity
* info (domain-wide, pasid-wide, range-based), since it's all about the
* I/O address space itself. The ioasid driver walks the attached
* routing information to match the IOMMU semantics under the
* hood.
*
* Input parameters:
* - child_ioasid;
* - granularity
*
* Return: 0 on success, -errno on failure
*/
#define IOASID_INVALIDATE_CACHE _IO(IOASID_TYPE, IOASID_BASE + 13)


/*
* Page fault report and response
*
* This is TBD. Can be added after other parts are cleared up. Likely it
* will be a ring buffer shared between user/kernel, an eventfd to notify
* the user and an ioctl to complete the fault.
*
* The fault data is per I/O address space, i.e.: IOASID + faulting_addr
*/


/*
* Dirty page tracking
*
* Track and report memory pages dirtied in I/O address spaces. There
* is an ongoing work by Kunkun Jiang by extending existing VFIO type1.
* It needs be adapted to /dev/ioasid later.
*/


2.2. /dev/vfio uAPI
++++++++++++++++

/*
* Bind a vfio_device to the specified IOASID fd
*
* Multiple vfio devices can be bound to a single ioasid_fd, but a single
* vfio device should not be bound to multiple ioasid_fd's.
*
* Input parameters:
* - ioasid_fd;
*
* Return: 0 on success, -errno on failure.
*/
#define VFIO_BIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 22)
#define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)


/*
* Attach a vfio device to the specified IOASID
*
* Multiple vfio devices can be attached to the same IOASID, and vice
* versa.
*
* User may optionally provide a "virtual PASID" to mark an I/O page
* table on this vfio device. Whether the virtual PASID is physically used
* or converted to another kernel-allocated PASID is a policy in vfio device
* driver.
*
* There is no need to specify ioasid_fd in this call due to the assumption
* of 1:1 connection between vfio device and the bound fd.
*
* Input parameter:
* - ioasid;
* - flag;
* - user_pasid (if specified);
*
* Return: 0 on success, -errno on failure.
*/
#define VFIO_ATTACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 24)
#define VFIO_DETACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 25)


2.3. KVM uAPI
++++++++++++

/*
* Update CPU PASID mapping
*
* This is necessary when ENQCMD will be used in the guest while the
* targeted device doesn't accept the vPASID saved in the CPU MSR.
*
* This command allows user to set/clear the vPASID->pPASID mapping
* in the CPU, by providing the IOASID (and FD) information representing
* the I/O address space marked by this vPASID.
*
* Input parameters:
* - user_pasid;
* - ioasid_fd;
* - ioasid;
*/
#define KVM_MAP_PASID _IO(KVMIO, 0xf0)
#define KVM_UNMAP_PASID _IO(KVMIO, 0xf1)


3. Sample structures and helper functions
--------------------------------------------------------

Three helper functions are provided to support VFIO_BIND_IOASID_FD:

struct ioasid_ctx *ioasid_ctx_fdget(int fd);
int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
int ioasid_unregister_device(struct ioasid_dev *dev);

An ioasid_ctx is created for each fd:

struct ioasid_ctx {
// a list of allocated IOASID data's
struct list_head ioasid_list;
// a list of registered devices
struct list_head dev_list;
// a list of pre-registered virtual address ranges
struct list_head prereg_list;
};

Each registered device is represented by ioasid_dev:

struct ioasid_dev {
struct list_head next;
struct ioasid_ctx *ctx;
// always be the physical device
struct device *device;
struct kref kref;
};

Because we assume one vfio_device connected to at most one ioasid_fd,
here ioasid_dev could be embedded in vfio_device and then linked to
ioasid_ctx->dev_list when registration succeeds. For mdev the struct
device should be the pointer to the parent device. PASID marking this
mdev is specified later when VFIO_ATTACH_IOASID.

An ioasid_data is created when IOASID_ALLOC, as the main object
describing characteristics about an I/O page table:

struct ioasid_data {
// link to ioasid_ctx->ioasid_list
struct list_head next;

// the IOASID number
u32 ioasid;

// the handle to convey iommu operations
// hold the pgd (TBD until discussing iommu api)
struct iommu_domain *domain;

// map metadata (vfio type1 semantics)
struct rb_node dma_list;

// pointer to user-managed pgtable (for nesting case)
u64 user_pgd;

// link to the parent ioasid (for nesting)
struct ioasid_data *parent;

// cache the global PASID shared by ENQCMD-capable
// devices (see below explanation in section 4)
u32 pasid;

// a list of device attach data (routing information)
struct list_head attach_data;

// a list of partially-attached devices (group)
struct list_head partial_devices;

// a list of fault_data reported from the iommu layer
struct list_head fault_data;

...
}

ioasid_data and iommu_domain have overlapping roles as both are
introduced to represent an I/O address space. It is still a big TBD how
the two should be corelated or even merged, and whether new iommu
ops are required to handle RID+PASID explicitly. We leave this as open
for now as this proposal is mainly about uAPI. For simplification
purpose the two objects are kept separate in this context, assuming an
1:1 connection in-between and the domain as the place-holder
representing the 1st class object in the iommu ops.

Two helper functions are provided to support VFIO_ATTACH_IOASID:

struct attach_info {
u32 ioasid;
// If valid, the PASID to be used physically
u32 pasid;
};
int ioasid_device_attach(struct ioasid_dev *dev,
struct attach_info info);
int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);

The pasid parameter is optionally provided based on the policy in vfio
device driver. It could be the PASID marking the default I/O address
space for a mdev, or the user-provided PASID marking an user I/O page
table, or another kernel-allocated PASID backing the user-provided one.
Please check next section for detail explanation.

A new object is introduced and linked to ioasid_data->attach_data for
each successful attach operation:

struct ioasid_attach_data {
struct list_head next;
struct ioasid_dev *dev;
u32 pasid;
}

As explained in the design section, there is no explicit group enforcement
in /dev/ioasid uAPI or helper functions. But the ioasid driver does
implicit group check - before every device within an iommu group is
attached to this IOASID, the previously-attached devices in this group are
put in ioasid_data->partial_devices. The IOASID rejects any command if
the partial_devices list is not empty.

Then is the last helper function:
u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx,
u32 ioasid, bool alloc);

ioasid_get_global_pasid is necessary in scenarios where multiple devices
want to share a same PASID value on the attached I/O page table (e.g.
when ENQCMD is enabled, as explained in next section). We need a
centralized place (ioasid_data->pasid) to hold this value (allocated when
first called with alloc=true). vfio device driver calls this function (alloc=
true) to get the global PASID for an ioasid before calling ioasid_device_
attach. KVM also calls this function (alloc=false) to setup PASID translation
structure when user calls KVM_MAP_PASID.

4. PASID Virtualization
------------------------------

When guest SVA (vSVA) is enabled, multiple GVA address spaces are
created on the assigned vfio device. This leads to the concepts of
"virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned
by the guest to mark an GVA address space while pPASID is the one
selected by the host and actually routed in the wire.

vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.

vfio device driver translates vPASID to pPASID before calling ioasid_attach_
device, with two factors to be considered:

- Whether vPASID is directly used (vPASID==pPASID) in the wire, or
should be instead converted to a newly-allocated one (vPASID!=
pPASID);

- If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
space or a global PASID space (implying sharing pPASID cross devices,
e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
as part of the process context);

The actual policy depends on pdev vs. mdev, and whether ENQCMD is
supported. There are three possible scenarios:

(Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
policies.)

1) pdev (w/ or w/o ENQCMD): vPASID==pPASID

vPASIDs are directly programmed by the guest to the assigned MMIO
bar, implying all DMAs out of this device having vPASID in the packet
header. This mandates vPASID==pPASID, sort of delegating the entire
per-RID PASID space to the guest.

When ENQCMD is enabled, the CPU MSR when running a guest task
contains a vPASID. In this case the CPU PASID translation capability
should be disabled so this vPASID in CPU MSR is directly sent to the
wire.

This ensures consistent vPASID usage on pdev regardless of the
workload submitted through a MMIO register or ENQCMD instruction.

2) mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)

PASIDs are also used by kernel to mark the default I/O address space
for mdev, thus cannot be delegated to the guest. Instead, the mdev
driver must allocate a new pPASID for each vPASID (thus vPASID!=
pPASID) and then use pPASID when attaching this mdev to an ioasid.

The mdev driver needs cache the PASID mapping so in mediation
path vPASID programmed by the guest can be converted to pPASID
before updating the physical MMIO register. The mapping should
also be saved in the CPU PASID translation structure (via KVM uAPI),
so the vPASID saved in the CPU MSR is auto-translated to pPASID
before sent to the wire, when ENQCMD is enabled.

Generally pPASID could be allocated from the per-RID PASID space
if all mdev's created on the parent device don't support ENQCMD.

However if the parent supports ENQCMD-capable mdev, pPASIDs
must be allocated from a global pool because the CPU PASID
translation structure is per-VM. It implies that when an guest I/O
page table is attached to two mdevs with a single vPASID (i.e. bind
to the same guest process), a same pPASID should be used for
both mdevs even when they belong to different parents. Sharing
pPASID cross mdevs is achieved by calling aforementioned ioasid_
get_global_pasid().

3) Mix pdev/mdev together

Above policies are per device type thus are not affected when mixing
those device types together (when assigned to a single guest). However,
there is one exception - when both pdev/mdev support ENQCMD.

Remember the two types have conflicting requirements on whether
CPU PASID translation should be enabled. This capability is per-VM,
and must be enabled for mdev isolation. When enabled, pdev will
receive a mdev pPASID violating its vPASID expectation.

In previous thread a PASID range split scheme was discussed to support
this combination, but we haven't worked out a clean uAPI design yet.
Therefore in this proposal we decide to not support it, implying the
user should have some intelligence to avoid such scenario. It could be
a TODO task for future.

In spite of those subtle considerations, the kernel implementation could
start simple, e.g.:

- v==p for pdev;
- v!=p and always use a global PASID pool for all mdev's;

Regardless of the kernel policy, the user policy is unchanged:

- provide vPASID when calling VFIO_ATTACH_IOASID;
- call KVM uAPI to setup CPU PASID translation if ENQCMD-capable mdev;
- Don't expose ENQCMD capability on both pdev and mdev;

Sample user flow is described in section 5.5.

5. Use Cases and Flows
-------------------------------

Here assume VFIO will support a new model where every bound device
is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
going through legacy container/group interface. For illustration purpose
those devices are just called dev[1...N]:

device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);

As explained earlier, one IOASID fd is sufficient for all intended use cases:

ioasid_fd = open("/dev/ioasid", mode);

For simplicity below examples are all made for the virtualization story.
They are representative and could be easily adapted to a non-virtualization
scenario.

Three types of IOASIDs are considered:

gpa_ioasid[1...N]: for GPA address space
giova_ioasid[1...N]: for guest IOVA address space
gva_ioasid[1...N]: for guest CPU VA address space

At least one gpa_ioasid must always be created per guest, while the other
two are relevant as far as vIOMMU is concerned.

Examples here apply to both pdev and mdev, if not explicitly marked out
(e.g. in section 5.5). VFIO device driver in the kernel will figure out the
associated routing information in the attaching operation.

For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
INFO are skipped in these examples.

5.1. A simple example
++++++++++++++++++

Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
space is managed through DMA mapping protocol:

/* Bind device to IOASID fd */
device_fd = open("/dev/vfio/devices/dev1", mode);
ioasid_fd = open("/dev/ioasid", mode);
ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);

/* Attach device to IOASID */
gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
at_data = { .ioasid = gpa_ioasid};
ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);

/* Setup GPA mapping */
dma_map = {
.ioasid = gpa_ioasid;
.iova = 0; // GPA
.vaddr = 0x40000000; // HVA
.size = 1GB;
};
ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

If the guest is assigned with more than dev1, user follows above sequence
to attach other devices to the same gpa_ioasid i.e. sharing the GPA
address space cross all assigned devices.

5.2. Multiple IOASIDs (no nesting)
++++++++++++++++++++++++++++

Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
both devices are attached to gpa_ioasid. After boot the guest creates
an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
through mode (gpa_ioasid).

Suppose IOASID nesting is not supported in this case. Qemu need to
generate shadow mappings in userspace for giova_ioasid (like how
VFIO works today).

To avoid duplicated locked page accounting, it's recommended to pre-
register the virtual address range that will be used for DMA:

device_fd1 = open("/dev/vfio/devices/dev1", mode);
device_fd2 = open("/dev/vfio/devices/dev2", mode);
ioasid_fd = open("/dev/ioasid", mode);
ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);

/* pre-register the virtual address range for accounting */
mem_info = { .vaddr = 0x40000000; .size = 1GB };
ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);

/* Attach dev1 and dev2 to gpa_ioasid */
gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
at_data = { .ioasid = gpa_ioasid};
ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

/* Setup GPA mapping */
dma_map = {
.ioasid = gpa_ioasid;
.iova = 0; // GPA
.vaddr = 0x40000000; // HVA
.size = 1GB;
};
ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

/* After boot, guest enables an GIOVA space for dev2 */
giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);

/* First detach dev2 from previous address space */
at_data = { .ioasid = gpa_ioasid};
ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);

/* Then attach dev2 to the new address space */
at_data = { .ioasid = giova_ioasid};
ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

/* Setup a shadow DMA mapping according to vIOMMU
* GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
*/
dma_map = {
.ioasid = giova_ioasid;
.iova = 0x2000; // GIOVA
.vaddr = 0x40001000; // HVA
.size = 4KB;
};
ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

5.3. IOASID nesting (software)
+++++++++++++++++++++++++

Same usage scenario as 5.2, with software-based IOASID nesting
available. In this mode it is the kernel instead of user to create the
shadow mapping.

The flow before guest boots is same as 5.2, except one point. Because
giova_ioasid is nested on gpa_ioasid, locked accounting is only
conducted for gpa_ioasid. So it's not necessary to pre-register virtual
memory.

To save space we only list the steps after boots (i.e. both dev1/dev2
have been attached to gpa_ioasid before guest boots):

/* After boots */
/* Make GIOVA space nested on GPA space */
giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
gpa_ioasid);

/* Attach dev2 to the new address space (child)
* Note dev2 is still attached to gpa_ioasid (parent)
*/
at_data = { .ioasid = giova_ioasid};
ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

/* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
* merged by the kernel with GPA->HVA mapping of gpa_ioasid
* to form a shadow mapping.
*/
dma_map = {
.ioasid = giova_ioasid;
.iova = 0x2000; // GIOVA
.vaddr = 0x1000; // GPA
.size = 4KB;
};
ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

5.4. IOASID nesting (hardware)
+++++++++++++++++++++++++

Same usage scenario as 5.2, with hardware-based IOASID nesting
available. In this mode the pgtable binding protocol is used to
bind the guest IOVA page table with the IOMMU:

/* After boots */
/* Make GIOVA space nested on GPA space */
giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
gpa_ioasid);

/* Attach dev2 to the new address space (child)
* Note dev2 is still attached to gpa_ioasid (parent)
*/
at_data = { .ioasid = giova_ioasid};
ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);

/* Bind guest I/O page table */
bind_data = {
.ioasid = giova_ioasid;
.addr = giova_pgtable;
// and format information
};
ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

/* Invalidate IOTLB when required */
inv_data = {
.ioasid = giova_ioasid;
// granular information
};
ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);

/* See 5.6 for I/O page fault handling */

5.5. Guest SVA (vSVA)
++++++++++++++++++

After boots the guest further create a GVA address spaces (gpasid1) on
dev1. Dev2 is not affected (still attached to giova_ioasid).

As explained in section 4, user should avoid expose ENQCMD on both
pdev and mdev.

The sequence applies to all device types (being pdev or mdev), except
one additional step to call KVM for ENQCMD-capable mdev:

/* After boots */
/* Make GVA space nested on GPA space */
gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
gpa_ioasid);

/* Attach dev1 to the new address space and specify vPASID */
at_data = {
.ioasid = gva_ioasid;
.flag = IOASID_ATTACH_USER_PASID;
.user_pasid = gpasid1;
};
ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

/* if dev1 is ENQCMD-capable mdev, update CPU PASID
* translation structure through KVM
*/
pa_data = {
.ioasid_fd = ioasid_fd;
.ioasid = gva_ioasid;
.guest_pasid = gpasid1;
};
ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);

/* Bind guest I/O page table */
bind_data = {
.ioasid = gva_ioasid;
.addr = gva_pgtable1;
// and format information
};
ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

...


5.6. I/O page fault
+++++++++++++++

(uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
to guest IOMMU driver and backwards).

- Host IOMMU driver receives a page request with raw fault_data {rid,
pasid, addr};

- Host IOMMU driver identifies the faulting I/O page table according to
information registered by IOASID fault handler;

- IOASID fault handler is called with raw fault_data (rid, pasid, addr), which
is saved in ioasid_data->fault_data (used for response);

- IOASID fault handler generates an user fault_data (ioasid, addr), links it
to the shared ring buffer and triggers eventfd to userspace;

- Upon received event, Qemu needs to find the virtual routing information
(v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
multiple, pick a random one. This should be fine since the purpose is to
fix the I/O page table on the guest;

- Qemu generates a virtual I/O page fault through vIOMMU into guest,
carrying the virtual fault data (v_rid, v_pasid, addr);

- Guest IOMMU driver fixes up the fault, updates the I/O page table, and
then sends a page response with virtual completion data (v_rid, v_pasid,
response_code) to vIOMMU;

- Qemu finds the pending fault event, converts virtual completion data
into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
complete the pending fault;

- /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
ioasid_data->fault_data, and then calls iommu api to complete it with
{rid, pasid, response_code};

5.7. BIND_PASID_TABLE
++++++++++++++++++++

PASID table is put in the GPA space on some platform, thus must be updated
by the guest. It is treated as another user page table to be bound with the
IOMMU.

As explained earlier, the user still needs to explicitly bind every user I/O
page table to the kernel so the same pgtable binding protocol (bind, cache
invalidate and fault handling) is unified cross platforms.

vIOMMUs may include a caching mode (or paravirtualized way) which, once
enabled, requires the guest to invalidate PASID cache for any change on the
PASID table. This allows Qemu to track the lifespan of guest I/O page tables.

In case of missing such capability, Qemu could enable write-protection on
the guest PASID table to achieve the same effect.

/* After boots */
/* Make vPASID space nested on GPA space */
pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
gpa_ioasid);

/* Attach dev1 to pasidtbl_ioasid */
at_data = { .ioasid = pasidtbl_ioasid};
ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

/* Bind PASID table */
bind_data = {
.ioasid = pasidtbl_ioasid;
.addr = gpa_pasid_table;
// and format information
};
ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);

/* vIOMMU detects a new GVA I/O space created */
gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
gpa_ioasid);

/* Attach dev1 to the new address space, with gpasid1 */
at_data = {
.ioasid = gva_ioasid;
.flag = IOASID_ATTACH_USER_PASID;
.user_pasid = gpasid1;
};
ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

/* Bind guest I/O page table. Because SET_PASID_TABLE has been
* used, the kernel will not update the PASID table. Instead, just
* track the bound I/O page table for handling invalidation and
* I/O page faults.
*/
bind_data = {
.ioasid = gva_ioasid;
.addr = gva_pgtable1;
// and format information
};
ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

...

Thanks
Kevin


2021-05-28 08:48:42

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


?? 2021/5/27 ????3:58, Tian, Kevin д??:
> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.


Not a native speaker but /dev/ioas seems better?


>
> This proposal describes the uAPI of /dev/ioasid and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.
>
> It's based on a lengthy discussion starting from here:
> https://lore.kernel.org/linux-iommu/[email protected]/
>
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.
>
> TOC
> ====
> 1. Terminologies and Concepts
> 2. uAPI Proposal
> 2.1. /dev/ioasid uAPI
> 2.2. /dev/vfio uAPI
> 2.3. /dev/kvm uAPI
> 3. Sample structures and helper functions
> 4. PASID virtualization
> 5. Use Cases and Flows
> 5.1. A simple example
> 5.2. Multiple IOASIDs (no nesting)
> 5.3. IOASID nesting (software)
> 5.4. IOASID nesting (hardware)
> 5.5. Guest SVA (vSVA)
> 5.6. I/O page fault
> 5.7. BIND_PASID_TABLE
> ====
>
> 1. Terminologies and Concepts
> -----------------------------------------
>
> IOASID FD is the container holding multiple I/O address spaces. User
> manages those address spaces through FD operations. Multiple FD's are
> allowed per process, but with this proposal one FD should be sufficient for
> all intended usages.
>
> IOASID is the FD-local software handle representing an I/O address space.
> Each IOASID is associated with a single I/O page table. IOASIDs can be
> nested together, implying the output address from one I/O page table
> (represented by child IOASID) must be further translated by another I/O
> page table (represented by parent IOASID).
>
> I/O address space can be managed through two protocols, according to
> whether the corresponding I/O page table is constructed by the kernel or
> the user. When kernel-managed, a dma mapping protocol (similar to
> existing VFIO iommu type1) is provided for the user to explicitly specify
> how the I/O address space is mapped. Otherwise, a different protocol is
> provided for the user to bind an user-managed I/O page table to the
> IOMMU, plus necessary commands for iotlb invalidation and I/O fault
> handling.
>
> Pgtable binding protocol can be used only on the child IOASID's, implying
> IOASID nesting must be enabled. This is because the kernel doesn't trust
> userspace. Nesting allows the kernel to enforce its DMA isolation policy
> through the parent IOASID.
>
> IOASID nesting can be implemented in two ways: hardware nesting and
> software nesting. With hardware support the child and parent I/O page
> tables are walked consecutively by the IOMMU to form a nested translation.
> When it's implemented in software, the ioasid driver


Need to explain what did "ioasid driver" mean.

I guess it's the module that implements the IOASID abstraction:

1) RID
2) RID+PASID
3) others

And if yes, does it allow the device for software specific implementation:

1) swiotlb or
2) device specific IOASID implementation


> is responsible for
> merging the two-level mappings into a single-level shadow I/O page table.
> Software nesting requires both child/parent page tables operated through
> the dma mapping protocol, so any change in either level can be captured
> by the kernel to update the corresponding shadow mapping.
>
> An I/O address space takes effect in the IOMMU only after it is attached
> to a device. The device in the /dev/ioasid context always refers to a
> physical one or 'pdev' (PF or VF).
>
> One I/O address space could be attached to multiple devices. In this case,
> /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
>
> Based on the underlying IOMMU capability one device might be allowed
> to attach to multiple I/O address spaces, with DMAs accessing them by
> carrying different routing information. One of them is the default I/O
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
> remaining are routed by RID + Process Address Space ID (PASID) or
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.
>
> Device attachment is initiated through passthrough framework uAPI (use
> VFIO for simplicity in following context). VFIO is responsible for identifying
> the routing information and registering it to the ioasid driver when calling
> ioasid attach helper function. It could be RID if the assigned device is
> pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
> user might also provide its view of virtual routing information (vPASID) in
> the attach call, e.g. when multiple user-managed I/O address spaces are
> attached to the vfio_device. In this case VFIO must figure out whether
> vPASID should be directly used (for pdev) or converted to a kernel-
> allocated one (pPASID, for mdev) for physical routing (see section 4).
>
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device
> should not be bound to multiple FD's. Not sure about the gain of
> allowing it except adding unnecessary complexity. But if others have
> different view we can further discuss.
>
> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is
> directly programmed to the device by guest software. For mdev this
> implies any guest operation carrying a vPASID on this device must be
> trapped into VFIO and then converted to pPASID before sent to the
> device. A detail explanation about PASID virtualization policies can be
> found in section 4.
>
> Modern devices may support a scalable workload submission interface
> based on PCI DMWr capability, allowing a single work queue to access
> multiple I/O address spaces. One example is Intel ENQCMD, having
> PASID saved in the CPU MSR and carried in the instruction payload
> when sent out to the device. Then a single work queue shared by
> multiple processes can compose DMAs carrying different PASIDs.
>
> When executing ENQCMD in the guest, the CPU MSR includes a vPASID
> which, if targeting a mdev, must be converted to pPASID before sent
> to the wire. Intel CPU provides a hardware PASID translation capability
> for auto-conversion in the fast path. The user is expected to setup the
> PASID mapping through KVM uAPI, with information about {vpasid,
> ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM
> to figure out the actual pPASID given an IOASID.
>
> With above design /dev/ioasid uAPI is all about I/O address spaces.
> It doesn't include any device routing information, which is only
> indirectly registered to the ioasid driver through VFIO uAPI. For
> example, I/O page fault is always reported to userspace per IOASID,
> although it's physically reported per device (RID+PASID). If there is a
> need of further relaying this fault into the guest, the user is responsible
> of identifying the device attached to this IOASID (randomly pick one if
> multiple attached devices) and then generates a per-device virtual I/O
> page fault into guest. Similarly the iotlb invalidation uAPI describes the
> granularity in the I/O address space (all, or a range), different from the
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
>
> I/O page tables routed through PASID are installed in a per-RID PASID
> table structure.


I'm not sure this is true for all archs.


> Some platforms implement the PASID table in the guest
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID,
> representing the per-RID vPASID space.
>
> We propose the host kernel needs to explicitly track guest I/O page
> tables even on these platforms, i.e. the same pgtable binding protocol
> should be used universally on all platforms (with only difference on who
> actually writes the PASID table). One opinion from previous discussion
> was treating this special IOASID as a container for all guest I/O page
> tables i.e. hiding them from the host. However this way significantly
> violates the philosophy in this /dev/ioasid proposal. It is not one IOASID
> one address space any more. Device routing information (indirectly
> marking hidden I/O spaces) has to be carried in iotlb invalidation and
> page faulting uAPI to help connect vIOMMU with the underlying
> pIOMMU. This is one design choice to be confirmed with ARM guys.
>
> Devices may sit behind IOMMU's with incompatible capabilities. The
> difference may lie in the I/O page table format, or availability of an user
> visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for
> checking the incompatibility between newly-attached device and existing
> devices under the specific IOASID and, if found, returning error to user.
> Upon such error the user should create a new IOASID for the incompatible
> device.
>
> There is no explicit group enforcement in /dev/ioasid uAPI, due to no
> device notation in this interface as aforementioned. But the ioasid driver
> does implicit check to make sure that devices within an iommu group
> must be all attached to the same IOASID before this IOASID starts to
> accept any uAPI command. Otherwise error information is returned to
> the user.
>
> There was a long debate in previous discussion whether VFIO should keep
> explicit container/group semantics in its uAPI. Jason Gunthorpe proposes
> a simplified model where every device bound to VFIO is explicitly listed
> under /dev/vfio thus a device fd can be acquired w/o going through legacy
> container/group interface. In this case the user is responsible for
> understanding the group topology and meeting the implicit group check
> criteria enforced in /dev/ioasid. The use case examples in this proposal
> are based on the new model.
>
> Of course for backward compatibility VFIO still needs to keep the existing
> uAPI and vfio iommu type1 will become a shim layer connecting VFIO
> iommu ops to internal ioasid helper functions.
>
> Notes:
> - It might be confusing as IOASID is also used in the kernel (drivers/
> iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
> find a better name later to differentiate.
>
> - PPC has not be considered yet as we haven't got time to fully understand
> its semantics. According to previous discussion there is some generality
> between PPC window-based scheme and VFIO type1 semantics. Let's
> first make consensus on this proposal and then further discuss how to
> extend it to cover PPC's requirement.
>
> - There is a protocol between vfio group and kvm. Needs to think about
> how it will be affected following this proposal.
>
> - mdev in this context refers to mediated subfunctions (e.g. Intel SIOV)
> which can be physically isolated in-between through PASID-granular
> IOMMU protection. Historically people also discussed one usage by
> mediating a pdev into a mdev. This usage is not covered here, and is
> supposed to be replaced by Max's work which allows overriding various
> VFIO operations in vfio-pci driver.
>
> 2. uAPI Proposal
> ----------------------
>
> /dev/ioasid uAPI covers everything about managing I/O address spaces.
>
> /dev/vfio uAPI builds connection between devices and I/O address spaces.
>
> /dev/kvm uAPI is optional required as far as ENQCMD is concerned.
>
>
> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
>
> /*
> * Check whether an uAPI extension is supported.
> *
> * This is for FD-level capabilities, such as locked page pre-registration.
> * IOASID-level capabilities are reported through IOASID_GET_INFO.
> *
> * Return: 0 if not supported, 1 if supported.
> */
> #define IOASID_CHECK_EXTENSION _IO(IOASID_TYPE, IOASID_BASE + 0)
>
>
> /*
> * Register user space memory where DMA is allowed.
> *
> * It pins user pages and does the locked memory accounting so sub-
> * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> *
> * When this ioctl is not used, one user page might be accounted
> * multiple times when it is mapped by multiple IOASIDs which are
> * not nested together.
> *
> * Input parameters:
> * - vaddr;
> * - size;
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_REGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 2)
>
>
> /*
> * Allocate an IOASID.
> *
> * IOASID is the FD-local software handle representing an I/O address
> * space. Each IOASID is associated with a single I/O page table. User
> * must call this ioctl to get an IOASID for every I/O address space that is
> * intended to be enabled in the IOMMU.
> *
> * A newly-created IOASID doesn't accept any command before it is
> * attached to a device. Once attached, an empty I/O page table is
> * bound with the IOMMU then the user could use either DMA mapping
> * or pgtable binding commands to manage this I/O page table.
> *
> * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> *
> * Return: allocated ioasid on success, -errno on failure.
> */
> #define IOASID_ALLOC _IO(IOASID_TYPE, IOASID_BASE + 3)
> #define IOASID_FREE _IO(IOASID_TYPE, IOASID_BASE + 4)


I would like to know the reason for such indirection.

It looks to me the ioasid fd is sufficient for performing any operations.

Such allocation only work if as ioas fd can have multiple ioasid which
seems not the case you describe here.


>
>
> /*
> * Get information about an I/O address space
> *
> * Supported capabilities:
> * - VFIO type1 map/unmap;
> * - pgtable/pasid_table binding
> * - hardware nesting vs. software nesting;
> * - ...
> *
> * Related attributes:
> * - supported page sizes, reserved IOVA ranges (DMA mapping);
> * - vendor pgtable formats (pgtable binding);
> * - number of child IOASIDs (nesting);
> * - ...
> *
> * Above information is available only after one or more devices are
> * attached to the specified IOASID. Otherwise the IOASID is just a
> * number w/o any capability or attribute.
> *
> * Input parameters:
> * - u32 ioasid;
> *
> * Output parameters:
> * - many. TBD.
> */
> #define IOASID_GET_INFO _IO(IOASID_TYPE, IOASID_BASE + 5)
>
>
> /*
> * Map/unmap process virtual addresses to I/O virtual addresses.
> *
> * Provide VFIO type1 equivalent semantics. Start with the same
> * restriction e.g. the unmap size should match those used in the
> * original mapping call.
> *
> * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> * must be already in the preregistered list.
> *
> * Input parameters:
> * - u32 ioasid;
> * - refer to vfio_iommu_type1_dma_{un}map
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
> #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)
>
>
> /*
> * Create a nesting IOASID (child) on an existing IOASID (parent)
> *
> * IOASIDs can be nested together, implying that the output address
> * from one I/O page table (child) must be further translated by
> * another I/O page table (parent).
> *
> * As the child adds essentially another reference to the I/O page table
> * represented by the parent, any device attached to the child ioasid
> * must be already attached to the parent.
> *
> * In concept there is no limit on the number of the nesting levels.
> * However for the majority case one nesting level is sufficient. The
> * user should check whether an IOASID supports nesting through
> * IOASID_GET_INFO. For example, if only one nesting level is allowed,
> * the nesting capability is reported only on the parent instead of the
> * child.
> *
> * User also needs check (via IOASID_GET_INFO) whether the nesting
> * is implemented in hardware or software. If software-based, DMA
> * mapping protocol should be used on the child IOASID. Otherwise,
> * the child should be operated with pgtable binding protocol.
> *
> * Input parameters:
> * - u32 parent_ioasid;
> *
> * Return: child_ioasid on success, -errno on failure;
> */
> #define IOASID_CREATE_NESTING _IO(IOASID_TYPE, IOASID_BASE + 8)
>
>
> /*
> * Bind an user-managed I/O page table with the IOMMU
> *
> * Because user page table is untrusted, IOASID nesting must be enabled
> * for this ioasid so the kernel can enforce its DMA isolation policy
> * through the parent ioasid.
> *
> * Pgtable binding protocol is different from DMA mapping. The latter
> * has the I/O page table constructed by the kernel and updated
> * according to user MAP/UNMAP commands. With pgtable binding the
> * whole page table is created and updated by userspace, thus different
> * set of commands are required (bind, iotlb invalidation, page fault, etc.).
> *
> * Because the page table is directly walked by the IOMMU, the user
> * must use a format compatible to the underlying hardware. It can
> * check the format information through IOASID_GET_INFO.
> *
> * The page table is bound to the IOMMU according to the routing
> * information of each attached device under the specified IOASID. The
> * routing information (RID and optional PASID) is registered when a
> * device is attached to this IOASID through VFIO uAPI.
> *
> * Input parameters:
> * - child_ioasid;
> * - address of the user page table;
> * - formats (vendor, address_width, etc.);
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_BIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 9)
> #define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10)
>
>
> /*
> * Bind an user-managed PASID table to the IOMMU
> *
> * This is required for platforms which place PASID table in the GPA space.
> * In this case the specified IOASID represents the per-RID PASID space.
> *
> * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
> * special flag to indicate the difference from normal I/O address spaces.
> *
> * The format info of the PASID table is reported in IOASID_GET_INFO.
> *
> * As explained in the design section, user-managed I/O page tables must
> * be explicitly bound to the kernel even on these platforms. It allows
> * the kernel to uniformly manage I/O address spaces cross all platforms.
> * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
> * to carry device routing information to indirectly mark the hidden I/O
> * address spaces.
> *
> * Input parameters:
> * - child_ioasid;
> * - address of PASID table;
> * - formats (vendor, size, etc.);
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_BIND_PASID_TABLE _IO(IOASID_TYPE, IOASID_BASE + 11)
> #define IOASID_UNBIND_PASID_TABLE _IO(IOASID_TYPE, IOASID_BASE + 12)
>
>
> /*
> * Invalidate IOTLB for an user-managed I/O page table
> *
> * Unlike what's defined in include/uapi/linux/iommu.h, this command
> * doesn't allow the user to specify cache type and likely support only
> * two granularities (all, or a specified range) in the I/O address space.
> *
> * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
> * cache). If the IOASID represents an I/O address space, the invalidation
> * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
> * represents a vPASID space, then this command applies to the PASID
> * cache.
> *
> * Similarly this command doesn't provide IOMMU-like granularity
> * info (domain-wide, pasid-wide, range-based), since it's all about the
> * I/O address space itself. The ioasid driver walks the attached
> * routing information to match the IOMMU semantics under the
> * hood.
> *
> * Input parameters:
> * - child_ioasid;
> * - granularity
> *
> * Return: 0 on success, -errno on failure
> */
> #define IOASID_INVALIDATE_CACHE _IO(IOASID_TYPE, IOASID_BASE + 13)
>
>
> /*
> * Page fault report and response
> *
> * This is TBD. Can be added after other parts are cleared up. Likely it
> * will be a ring buffer shared between user/kernel, an eventfd to notify
> * the user and an ioctl to complete the fault.
> *
> * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
> */
>
>
> /*
> * Dirty page tracking
> *
> * Track and report memory pages dirtied in I/O address spaces. There
> * is an ongoing work by Kunkun Jiang by extending existing VFIO type1.
> * It needs be adapted to /dev/ioasid later.
> */
>
>
> 2.2. /dev/vfio uAPI
> ++++++++++++++++
>
> /*
> * Bind a vfio_device to the specified IOASID fd
> *
> * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> * vfio device should not be bound to multiple ioasid_fd's.
> *
> * Input parameters:
> * - ioasid_fd;
> *
> * Return: 0 on success, -errno on failure.
> */
> #define VFIO_BIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)
>
>
> /*
> * Attach a vfio device to the specified IOASID
> *
> * Multiple vfio devices can be attached to the same IOASID, and vice
> * versa.
> *
> * User may optionally provide a "virtual PASID" to mark an I/O page
> * table on this vfio device. Whether the virtual PASID is physically used
> * or converted to another kernel-allocated PASID is a policy in vfio device
> * driver.
> *
> * There is no need to specify ioasid_fd in this call due to the assumption
> * of 1:1 connection between vfio device and the bound fd.
> *
> * Input parameter:
> * - ioasid;
> * - flag;
> * - user_pasid (if specified);
> *
> * Return: 0 on success, -errno on failure.
> */
> #define VFIO_ATTACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 24)
> #define VFIO_DETACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 25)
>
>
> 2.3. KVM uAPI
> ++++++++++++
>
> /*
> * Update CPU PASID mapping
> *
> * This is necessary when ENQCMD will be used in the guest while the
> * targeted device doesn't accept the vPASID saved in the CPU MSR.
> *
> * This command allows user to set/clear the vPASID->pPASID mapping
> * in the CPU, by providing the IOASID (and FD) information representing
> * the I/O address space marked by this vPASID.
> *
> * Input parameters:
> * - user_pasid;
> * - ioasid_fd;
> * - ioasid;
> */
> #define KVM_MAP_PASID _IO(KVMIO, 0xf0)
> #define KVM_UNMAP_PASID _IO(KVMIO, 0xf1)
>
>
> 3. Sample structures and helper functions
> --------------------------------------------------------
>
> Three helper functions are provided to support VFIO_BIND_IOASID_FD:
>
> struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
> int ioasid_unregister_device(struct ioasid_dev *dev);
>
> An ioasid_ctx is created for each fd:
>
> struct ioasid_ctx {
> // a list of allocated IOASID data's
> struct list_head ioasid_list;
> // a list of registered devices
> struct list_head dev_list;
> // a list of pre-registered virtual address ranges
> struct list_head prereg_list;
> };
>
> Each registered device is represented by ioasid_dev:
>
> struct ioasid_dev {
> struct list_head next;
> struct ioasid_ctx *ctx;
> // always be the physical device
> struct device *device;
> struct kref kref;
> };
>
> Because we assume one vfio_device connected to at most one ioasid_fd,
> here ioasid_dev could be embedded in vfio_device and then linked to
> ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> device should be the pointer to the parent device. PASID marking this
> mdev is specified later when VFIO_ATTACH_IOASID.
>
> An ioasid_data is created when IOASID_ALLOC, as the main object
> describing characteristics about an I/O page table:
>
> struct ioasid_data {
> // link to ioasid_ctx->ioasid_list
> struct list_head next;
>
> // the IOASID number
> u32 ioasid;
>
> // the handle to convey iommu operations
> // hold the pgd (TBD until discussing iommu api)
> struct iommu_domain *domain;
>
> // map metadata (vfio type1 semantics)
> struct rb_node dma_list;
>
> // pointer to user-managed pgtable (for nesting case)
> u64 user_pgd;
>
> // link to the parent ioasid (for nesting)
> struct ioasid_data *parent;
>
> // cache the global PASID shared by ENQCMD-capable
> // devices (see below explanation in section 4)
> u32 pasid;
>
> // a list of device attach data (routing information)
> struct list_head attach_data;
>
> // a list of partially-attached devices (group)
> struct list_head partial_devices;
>
> // a list of fault_data reported from the iommu layer
> struct list_head fault_data;
>
> ...
> }
>
> ioasid_data and iommu_domain have overlapping roles as both are
> introduced to represent an I/O address space. It is still a big TBD how
> the two should be corelated or even merged, and whether new iommu
> ops are required to handle RID+PASID explicitly. We leave this as open
> for now as this proposal is mainly about uAPI. For simplification
> purpose the two objects are kept separate in this context, assuming an
> 1:1 connection in-between and the domain as the place-holder
> representing the 1st class object in the iommu ops.
>
> Two helper functions are provided to support VFIO_ATTACH_IOASID:
>
> struct attach_info {
> u32 ioasid;
> // If valid, the PASID to be used physically
> u32 pasid;
> };
> int ioasid_device_attach(struct ioasid_dev *dev,
> struct attach_info info);
> int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
>
> The pasid parameter is optionally provided based on the policy in vfio
> device driver. It could be the PASID marking the default I/O address
> space for a mdev, or the user-provided PASID marking an user I/O page
> table, or another kernel-allocated PASID backing the user-provided one.
> Please check next section for detail explanation.
>
> A new object is introduced and linked to ioasid_data->attach_data for
> each successful attach operation:
>
> struct ioasid_attach_data {
> struct list_head next;
> struct ioasid_dev *dev;
> u32 pasid;
> }
>
> As explained in the design section, there is no explicit group enforcement
> in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> implicit group check - before every device within an iommu group is
> attached to this IOASID, the previously-attached devices in this group are
> put in ioasid_data->partial_devices. The IOASID rejects any command if
> the partial_devices list is not empty.
>
> Then is the last helper function:
> u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx,
> u32 ioasid, bool alloc);
>
> ioasid_get_global_pasid is necessary in scenarios where multiple devices
> want to share a same PASID value on the attached I/O page table (e.g.
> when ENQCMD is enabled, as explained in next section). We need a
> centralized place (ioasid_data->pasid) to hold this value (allocated when
> first called with alloc=true). vfio device driver calls this function (alloc=
> true) to get the global PASID for an ioasid before calling ioasid_device_
> attach. KVM also calls this function (alloc=false) to setup PASID translation
> structure when user calls KVM_MAP_PASID.
>
> 4. PASID Virtualization
> ------------------------------
>
> When guest SVA (vSVA) is enabled, multiple GVA address spaces are
> created on the assigned vfio device. This leads to the concepts of
> "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned
> by the guest to mark an GVA address space while pPASID is the one
> selected by the host and actually routed in the wire.
>
> vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.
>
> vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> device, with two factors to be considered:
>
> - Whether vPASID is directly used (vPASID==pPASID) in the wire, or
> should be instead converted to a newly-allocated one (vPASID!=
> pPASID);
>
> - If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
> space or a global PASID space (implying sharing pPASID cross devices,
> e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
> as part of the process context);
>
> The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> supported. There are three possible scenarios:
>
> (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
> policies.)
>
> 1) pdev (w/ or w/o ENQCMD): vPASID==pPASID
>
> vPASIDs are directly programmed by the guest to the assigned MMIO
> bar, implying all DMAs out of this device having vPASID in the packet
> header. This mandates vPASID==pPASID, sort of delegating the entire
> per-RID PASID space to the guest.
>
> When ENQCMD is enabled, the CPU MSR when running a guest task
> contains a vPASID. In this case the CPU PASID translation capability
> should be disabled so this vPASID in CPU MSR is directly sent to the
> wire.
>
> This ensures consistent vPASID usage on pdev regardless of the
> workload submitted through a MMIO register or ENQCMD instruction.
>
> 2) mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
>
> PASIDs are also used by kernel to mark the default I/O address space
> for mdev, thus cannot be delegated to the guest. Instead, the mdev
> driver must allocate a new pPASID for each vPASID (thus vPASID!=
> pPASID) and then use pPASID when attaching this mdev to an ioasid.
>
> The mdev driver needs cache the PASID mapping so in mediation
> path vPASID programmed by the guest can be converted to pPASID
> before updating the physical MMIO register. The mapping should
> also be saved in the CPU PASID translation structure (via KVM uAPI),
> so the vPASID saved in the CPU MSR is auto-translated to pPASID
> before sent to the wire, when ENQCMD is enabled.
>
> Generally pPASID could be allocated from the per-RID PASID space
> if all mdev's created on the parent device don't support ENQCMD.
>
> However if the parent supports ENQCMD-capable mdev, pPASIDs
> must be allocated from a global pool because the CPU PASID
> translation structure is per-VM. It implies that when an guest I/O
> page table is attached to two mdevs with a single vPASID (i.e. bind
> to the same guest process), a same pPASID should be used for
> both mdevs even when they belong to different parents. Sharing
> pPASID cross mdevs is achieved by calling aforementioned ioasid_
> get_global_pasid().
>
> 3) Mix pdev/mdev together
>
> Above policies are per device type thus are not affected when mixing
> those device types together (when assigned to a single guest). However,
> there is one exception - when both pdev/mdev support ENQCMD.
>
> Remember the two types have conflicting requirements on whether
> CPU PASID translation should be enabled. This capability is per-VM,
> and must be enabled for mdev isolation. When enabled, pdev will
> receive a mdev pPASID violating its vPASID expectation.
>
> In previous thread a PASID range split scheme was discussed to support
> this combination, but we haven't worked out a clean uAPI design yet.
> Therefore in this proposal we decide to not support it, implying the
> user should have some intelligence to avoid such scenario. It could be
> a TODO task for future.
>
> In spite of those subtle considerations, the kernel implementation could
> start simple, e.g.:
>
> - v==p for pdev;
> - v!=p and always use a global PASID pool for all mdev's;
>
> Regardless of the kernel policy, the user policy is unchanged:
>
> - provide vPASID when calling VFIO_ATTACH_IOASID;
> - call KVM uAPI to setup CPU PASID translation if ENQCMD-capable mdev;
> - Don't expose ENQCMD capability on both pdev and mdev;
>
> Sample user flow is described in section 5.5.
>
> 5. Use Cases and Flows
> -------------------------------
>
> Here assume VFIO will support a new model where every bound device
> is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> going through legacy container/group interface. For illustration purpose
> those devices are just called dev[1...N]:
>
> device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
>
> As explained earlier, one IOASID fd is sufficient for all intended use cases:
>
> ioasid_fd = open("/dev/ioasid", mode);
>
> For simplicity below examples are all made for the virtualization story.
> They are representative and could be easily adapted to a non-virtualization
> scenario.
>
> Three types of IOASIDs are considered:
>
> gpa_ioasid[1...N]: for GPA address space
> giova_ioasid[1...N]: for guest IOVA address space
> gva_ioasid[1...N]: for guest CPU VA address space
>
> At least one gpa_ioasid must always be created per guest, while the other
> two are relevant as far as vIOMMU is concerned.
>
> Examples here apply to both pdev and mdev, if not explicitly marked out
> (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> associated routing information in the attaching operation.
>
> For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> INFO are skipped in these examples.
>
> 5.1. A simple example
> ++++++++++++++++++
>
> Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> space is managed through DMA mapping protocol:
>
> /* Bind device to IOASID fd */
> device_fd = open("/dev/vfio/devices/dev1", mode);
> ioasid_fd = open("/dev/ioasid", mode);
> ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
>
> /* Attach device to IOASID */
> gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> at_data = { .ioasid = gpa_ioasid};
> ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup GPA mapping */
> dma_map = {
> .ioasid = gpa_ioasid;
> .iova = 0; // GPA
> .vaddr = 0x40000000; // HVA
> .size = 1GB;
> };
> ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> If the guest is assigned with more than dev1, user follows above sequence
> to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> address space cross all assigned devices.
>
> 5.2. Multiple IOASIDs (no nesting)
> ++++++++++++++++++++++++++++
>
> Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> both devices are attached to gpa_ioasid. After boot the guest creates
> an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> through mode (gpa_ioasid).
>
> Suppose IOASID nesting is not supported in this case. Qemu need to
> generate shadow mappings in userspace for giova_ioasid (like how
> VFIO works today).
>
> To avoid duplicated locked page accounting, it's recommended to pre-
> register the virtual address range that will be used for DMA:
>
> device_fd1 = open("/dev/vfio/devices/dev1", mode);
> device_fd2 = open("/dev/vfio/devices/dev2", mode);
> ioasid_fd = open("/dev/ioasid", mode);
> ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
>
> /* pre-register the virtual address range for accounting */
> mem_info = { .vaddr = 0x40000000; .size = 1GB };
> ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
>
> /* Attach dev1 and dev2 to gpa_ioasid */
> gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> at_data = { .ioasid = gpa_ioasid};
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup GPA mapping */
> dma_map = {
> .ioasid = gpa_ioasid;
> .iova = 0; // GPA
> .vaddr = 0x40000000; // HVA
> .size = 1GB;
> };
> ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> /* After boot, guest enables an GIOVA space for dev2 */
> giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
>
> /* First detach dev2 from previous address space */
> at_data = { .ioasid = gpa_ioasid};
> ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
>
> /* Then attach dev2 to the new address space */
> at_data = { .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup a shadow DMA mapping according to vIOMMU
> * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> */
> dma_map = {
> .ioasid = giova_ioasid;
> .iova = 0x2000; // GIOVA
> .vaddr = 0x40001000; // HVA
> .size = 4KB;
> };
> ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> 5.3. IOASID nesting (software)
> +++++++++++++++++++++++++
>
> Same usage scenario as 5.2, with software-based IOASID nesting
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.
>
> The flow before guest boots is same as 5.2, except one point. Because
> giova_ioasid is nested on gpa_ioasid, locked accounting is only
> conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> memory.
>
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
>
> /* After boots */
> /* Make GIOVA space nested on GPA space */
> giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev2 to the new address space (child)
> * Note dev2 is still attached to gpa_ioasid (parent)
> */
> at_data = { .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);


For vDPA, we need something similar. And in the future, vDPA may allow
multiple ioasid to be attached to a single device. It should work with
the current design.


>
> /* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> * to form a shadow mapping.
> */
> dma_map = {
> .ioasid = giova_ioasid;
> .iova = 0x2000; // GIOVA
> .vaddr = 0x1000; // GPA
> .size = 4KB;
> };
> ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> 5.4. IOASID nesting (hardware)
> +++++++++++++++++++++++++
>
> Same usage scenario as 5.2, with hardware-based IOASID nesting
> available. In this mode the pgtable binding protocol is used to
> bind the guest IOVA page table with the IOMMU:
>
> /* After boots */
> /* Make GIOVA space nested on GPA space */
> giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev2 to the new address space (child)
> * Note dev2 is still attached to gpa_ioasid (parent)
> */
> at_data = { .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);


I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
hardware nesting. Or is there way to detect the capability before?

I think GET_INFO only works after the ATTACH.


>
> /* Bind guest I/O page table */
> bind_data = {
> .ioasid = giova_ioasid;
> .addr = giova_pgtable;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> /* Invalidate IOTLB when required */
> inv_data = {
> .ioasid = giova_ioasid;
> // granular information
> };
> ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
>
> /* See 5.6 for I/O page fault handling */
>
> 5.5. Guest SVA (vSVA)
> ++++++++++++++++++
>
> After boots the guest further create a GVA address spaces (gpasid1) on
> dev1. Dev2 is not affected (still attached to giova_ioasid).
>
> As explained in section 4, user should avoid expose ENQCMD on both
> pdev and mdev.
>
> The sequence applies to all device types (being pdev or mdev), except
> one additional step to call KVM for ENQCMD-capable mdev:


My understanding is ENQCMD is Intel specific and not a requirement for
having vSVA.


>
> /* After boots */
> /* Make GVA space nested on GPA space */
> gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev1 to the new address space and specify vPASID */
> at_data = {
> .ioasid = gva_ioasid;
> .flag = IOASID_ATTACH_USER_PASID;
> .user_pasid = gpasid1;
> };
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> /* if dev1 is ENQCMD-capable mdev, update CPU PASID
> * translation structure through KVM
> */
> pa_data = {
> .ioasid_fd = ioasid_fd;
> .ioasid = gva_ioasid;
> .guest_pasid = gpasid1;
> };
> ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
>
> /* Bind guest I/O page table */
> bind_data = {
> .ioasid = gva_ioasid;
> .addr = gva_pgtable1;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> ...
>
>
> 5.6. I/O page fault
> +++++++++++++++
>
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
>
> - Host IOMMU driver receives a page request with raw fault_data {rid,
> pasid, addr};
>
> - Host IOMMU driver identifies the faulting I/O page table according to
> information registered by IOASID fault handler;
>
> - IOASID fault handler is called with raw fault_data (rid, pasid, addr), which
> is saved in ioasid_data->fault_data (used for response);
>
> - IOASID fault handler generates an user fault_data (ioasid, addr), links it
> to the shared ring buffer and triggers eventfd to userspace;
>
> - Upon received event, Qemu needs to find the virtual routing information
> (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
> multiple, pick a random one. This should be fine since the purpose is to
> fix the I/O page table on the guest;
>
> - Qemu generates a virtual I/O page fault through vIOMMU into guest,
> carrying the virtual fault data (v_rid, v_pasid, addr);
>
> - Guest IOMMU driver fixes up the fault, updates the I/O page table, and
> then sends a page response with virtual completion data (v_rid, v_pasid,
> response_code) to vIOMMU;
>
> - Qemu finds the pending fault event, converts virtual completion data
> into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> complete the pending fault;
>
> - /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> ioasid_data->fault_data, and then calls iommu api to complete it with
> {rid, pasid, response_code};
>
> 5.7. BIND_PASID_TABLE
> ++++++++++++++++++++
>
> PASID table is put in the GPA space on some platform, thus must be updated
> by the guest. It is treated as another user page table to be bound with the
> IOMMU.
>
> As explained earlier, the user still needs to explicitly bind every user I/O
> page table to the kernel so the same pgtable binding protocol (bind, cache
> invalidate and fault handling) is unified cross platforms.
>
> vIOMMUs may include a caching mode (or paravirtualized way) which, once
> enabled, requires the guest to invalidate PASID cache for any change on the
> PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
>
> In case of missing such capability, Qemu could enable write-protection on
> the guest PASID table to achieve the same effect.
>
> /* After boots */
> /* Make vPASID space nested on GPA space */
> pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev1 to pasidtbl_ioasid */
> at_data = { .ioasid = pasidtbl_ioasid};
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> /* Bind PASID table */
> bind_data = {
> .ioasid = pasidtbl_ioasid;
> .addr = gpa_pasid_table;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
>
> /* vIOMMU detects a new GVA I/O space created */
> gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev1 to the new address space, with gpasid1 */
> at_data = {
> .ioasid = gva_ioasid;
> .flag = IOASID_ATTACH_USER_PASID;
> .user_pasid = gpasid1;
> };
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);


Do we need VFIO_DETACH_IOASID?

Thanks


>
> /* Bind guest I/O page table. Because SET_PASID_TABLE has been
> * used, the kernel will not update the PASID table. Instead, just
> * track the bound I/O page table for handling invalidation and
> * I/O page faults.
> */
> bind_data = {
> .ioasid = gva_ioasid;
> .addr = gva_pgtable1;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> ...
>
> Thanks
> Kevin
>

2021-05-28 17:17:44

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.
>
> This proposal describes the uAPI of /dev/ioasid and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.
>
> It's based on a lengthy discussion starting from here:
> https://lore.kernel.org/linux-iommu/[email protected]/
>
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.

Firstly thanks for writing this up and for your patience. I've not read in
detail the second half yet, will take another look later.

> 1. Terminologies and Concepts
> -----------------------------------------
>
> IOASID FD is the container holding multiple I/O address spaces. User
> manages those address spaces through FD operations. Multiple FD's are
> allowed per process, but with this proposal one FD should be sufficient for
> all intended usages.
>
> IOASID is the FD-local software handle representing an I/O address space.
> Each IOASID is associated with a single I/O page table. IOASIDs can be
> nested together, implying the output address from one I/O page table
> (represented by child IOASID) must be further translated by another I/O
> page table (represented by parent IOASID).
>
> I/O address space can be managed through two protocols, according to
> whether the corresponding I/O page table is constructed by the kernel or
> the user. When kernel-managed, a dma mapping protocol (similar to
> existing VFIO iommu type1) is provided for the user to explicitly specify
> how the I/O address space is mapped. Otherwise, a different protocol is
> provided for the user to bind an user-managed I/O page table to the
> IOMMU, plus necessary commands for iotlb invalidation and I/O fault
> handling.
>
> Pgtable binding protocol can be used only on the child IOASID's, implying
> IOASID nesting must be enabled. This is because the kernel doesn't trust
> userspace. Nesting allows the kernel to enforce its DMA isolation policy
> through the parent IOASID.
>
> IOASID nesting can be implemented in two ways: hardware nesting and
> software nesting. With hardware support the child and parent I/O page
> tables are walked consecutively by the IOMMU to form a nested translation.
> When it's implemented in software, the ioasid driver is responsible for
> merging the two-level mappings into a single-level shadow I/O page table.
> Software nesting requires both child/parent page tables operated through
> the dma mapping protocol, so any change in either level can be captured
> by the kernel to update the corresponding shadow mapping.

Is there an advantage to moving software nesting into the kernel?
We could just have the guest do its usual combined map/unmap on the child
fd

>
> An I/O address space takes effect in the IOMMU only after it is attached
> to a device. The device in the /dev/ioasid context always refers to a
> physical one or 'pdev' (PF or VF).
>
> One I/O address space could be attached to multiple devices. In this case,
> /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
>
> Based on the underlying IOMMU capability one device might be allowed
> to attach to multiple I/O address spaces, with DMAs accessing them by
> carrying different routing information. One of them is the default I/O
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
> remaining are routed by RID + Process Address Space ID (PASID) or
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.
>
> Device attachment is initiated through passthrough framework uAPI (use
> VFIO for simplicity in following context). VFIO is responsible for identifying
> the routing information and registering it to the ioasid driver when calling
> ioasid attach helper function. It could be RID if the assigned device is
> pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
> user might also provide its view of virtual routing information (vPASID) in
> the attach call, e.g. when multiple user-managed I/O address spaces are
> attached to the vfio_device. In this case VFIO must figure out whether
> vPASID should be directly used (for pdev) or converted to a kernel-
> allocated one (pPASID, for mdev) for physical routing (see section 4).
>
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device
> should not be bound to multiple FD's. Not sure about the gain of
> allowing it except adding unnecessary complexity. But if others have
> different view we can further discuss.
>
> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is
> directly programmed to the device by guest software. For mdev this
> implies any guest operation carrying a vPASID on this device must be
> trapped into VFIO and then converted to pPASID before sent to the
> device. A detail explanation about PASID virtualization policies can be
> found in section 4.
>
> Modern devices may support a scalable workload submission interface
> based on PCI DMWr capability, allowing a single work queue to access
> multiple I/O address spaces. One example is Intel ENQCMD, having
> PASID saved in the CPU MSR and carried in the instruction payload
> when sent out to the device. Then a single work queue shared by
> multiple processes can compose DMAs carrying different PASIDs.
>
> When executing ENQCMD in the guest, the CPU MSR includes a vPASID
> which, if targeting a mdev, must be converted to pPASID before sent
> to the wire. Intel CPU provides a hardware PASID translation capability
> for auto-conversion in the fast path. The user is expected to setup the
> PASID mapping through KVM uAPI, with information about {vpasid,
> ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM
> to figure out the actual pPASID given an IOASID.
>
> With above design /dev/ioasid uAPI is all about I/O address spaces.
> It doesn't include any device routing information, which is only
> indirectly registered to the ioasid driver through VFIO uAPI. For
> example, I/O page fault is always reported to userspace per IOASID,
> although it's physically reported per device (RID+PASID). If there is a
> need of further relaying this fault into the guest, the user is responsible
> of identifying the device attached to this IOASID (randomly pick one if
> multiple attached devices)

We need to report accurate information for faults. If the guest tells
device A to DMA, it shouldn't receive a fault report for device B. This is
important if the guest needs to kill a misbehaving device, or even just
for statistics and debugging. It may also simplify routing the page
response, which has to be fast.

> and then generates a per-device virtual I/O
> page fault into guest. Similarly the iotlb invalidation uAPI describes the
> granularity in the I/O address space (all, or a range), different from the
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
>
> I/O page tables routed through PASID are installed in a per-RID PASID
> table structure. Some platforms implement the PASID table in the guest
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID,
> representing the per-RID vPASID space.
>
> We propose the host kernel needs to explicitly track guest I/O page
> tables even on these platforms, i.e. the same pgtable binding protocol
> should be used universally on all platforms (with only difference on who
> actually writes the PASID table).

This adds significant complexity for Arm (and AMD). Userspace will now
need to walk the PASID table, serializing against invalidation. At least
the SMMU has caching mode for PASID tables so there is no need to trap,
but I'd rather avoid this. I really don't want to make virtio-iommu
devices walk PASID tables unless absolutely necessary, they need to stay
simple.

> One opinion from previous discussion
> was treating this special IOASID as a container for all guest I/O page
> tables i.e. hiding them from the host. However this way significantly
> violates the philosophy in this /dev/ioasid proposal.

It does correspond better to the underlying architecture and hardware
implementation, of which userspace is well aware since it has to report
them to the guest and deal with different descriptor formats.

> It is not one IOASID
> one address space any more. Device routing information (indirectly
> marking hidden I/O spaces) has to be carried in iotlb invalidation and
> page faulting uAPI to help connect vIOMMU with the underlying
> pIOMMU.

As above, I think it's essential that we carry device information in fault
reports. In addition to the two previous reasons, on these platforms
userspace will route all faults through the same channel (vIOMMU event
queue) regardless of the PASID, so we do not need them split and tracked
by PASID. Given that IOPF will be a hot path we should make sure there is
no unnecessary indirection.

Regarding the invalidation, I think limiting it to IOASID may work but it
does bother me that we can't directly forward all invalidations received
on the vIOMMU: if the guest sends a device-wide invalidation, do we
iterate over all IOASIDs and issue one ioctl for each? Sure the guest is
probably sending that because of detaching the PASID table, for which the
kernel did perform the invalidation, but we can't just assume that and
ignore the request, there may be a different reason. Iterating is going to
take a lot time, whereas with the current API we can send a single request
and issue a single command to the IOMMU hardware.

Similarly, if the guest sends an ATC invalidation for a whole device (in
the SMMU, that's an ATC_INV without SSID), we'll have to transform that
into multiple IOTLB invalidations? We can't just send it on IOASID #0,
because it may not have been created by the guest.

Maybe we could at least have invalidation requests on the parent fd for
this kind of global case? But I'd much rather avoid the PASID tracking
altogether and keep the existing cache invalidate API, let the pIOMMU
driver decode that stuff.

> This is one design choice to be confirmed with ARM guys.
>
> Devices may sit behind IOMMU's with incompatible capabilities. The
> difference may lie in the I/O page table format, or availability of an user
> visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for
> checking the incompatibility between newly-attached device and existing
> devices under the specific IOASID and, if found, returning error to user.
> Upon such error the user should create a new IOASID for the incompatible
> device.
>
> There is no explicit group enforcement in /dev/ioasid uAPI, due to no
> device notation in this interface as aforementioned. But the ioasid driver
> does implicit check to make sure that devices within an iommu group
> must be all attached to the same IOASID before this IOASID starts to
> accept any uAPI command. Otherwise error information is returned to
> the user.
>
> There was a long debate in previous discussion whether VFIO should keep
> explicit container/group semantics in its uAPI. Jason Gunthorpe proposes
> a simplified model where every device bound to VFIO is explicitly listed
> under /dev/vfio thus a device fd can be acquired w/o going through legacy
> container/group interface. In this case the user is responsible for
> understanding the group topology and meeting the implicit group check
> criteria enforced in /dev/ioasid. The use case examples in this proposal
> are based on the new model.
>
> Of course for backward compatibility VFIO still needs to keep the existing
> uAPI and vfio iommu type1 will become a shim layer connecting VFIO
> iommu ops to internal ioasid helper functions.
>
> Notes:
> - It might be confusing as IOASID is also used in the kernel (drivers/
> iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
> find a better name later to differentiate.

Yes this isn't just about allocating PASIDs anymore. /dev/iommu or
/dev/ioas would make more sense.

>
> - PPC has not be considered yet as we haven't got time to fully understand
> its semantics. According to previous discussion there is some generality
> between PPC window-based scheme and VFIO type1 semantics. Let's
> first make consensus on this proposal and then further discuss how to
> extend it to cover PPC's requirement.
>
> - There is a protocol between vfio group and kvm. Needs to think about
> how it will be affected following this proposal.

(Arm also needs this, obtaining the VMID allocated by KVM and write it to
the SMMU descriptor when installing the PASID table
https://lore.kernel.org/linux-iommu/[email protected]/)

>
> - mdev in this context refers to mediated subfunctions (e.g. Intel SIOV)
> which can be physically isolated in-between through PASID-granular
> IOMMU protection. Historically people also discussed one usage by
> mediating a pdev into a mdev. This usage is not covered here, and is
> supposed to be replaced by Max's work which allows overriding various
> VFIO operations in vfio-pci driver.
>
> 2. uAPI Proposal
> ----------------------
[...]

> /*
> * Get information about an I/O address space
> *
> * Supported capabilities:
> * - VFIO type1 map/unmap;
> * - pgtable/pasid_table binding
> * - hardware nesting vs. software nesting;
> * - ...
> *
> * Related attributes:
> * - supported page sizes, reserved IOVA ranges (DMA mapping);
> * - vendor pgtable formats (pgtable binding);
> * - number of child IOASIDs (nesting);
> * - ...
> *
> * Above information is available only after one or more devices are
> * attached to the specified IOASID. Otherwise the IOASID is just a
> * number w/o any capability or attribute.
> *
> * Input parameters:
> * - u32 ioasid;
> *
> * Output parameters:
> * - many. TBD.

We probably need a capability format similar to PCI and VFIO.

> */
> #define IOASID_GET_INFO _IO(IOASID_TYPE, IOASID_BASE + 5)
[...]

> 2.2. /dev/vfio uAPI
> ++++++++++++++++
>
> /*
> * Bind a vfio_device to the specified IOASID fd
> *
> * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> * vfio device should not be bound to multiple ioasid_fd's.
> *
> * Input parameters:
> * - ioasid_fd;

How about adding a 32-bit "virtual RID" at this point, that the kernel can
provide to userspace during fault reporting?

Thanks,
Jean

> *
> * Return: 0 on success, -errno on failure.
> */
> #define VFIO_BIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)

2021-05-28 17:37:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:

> IOASID nesting can be implemented in two ways: hardware nesting and
> software nesting. With hardware support the child and parent I/O page
> tables are walked consecutively by the IOMMU to form a nested translation.
> When it's implemented in software, the ioasid driver is responsible for
> merging the two-level mappings into a single-level shadow I/O page table.
> Software nesting requires both child/parent page tables operated through
> the dma mapping protocol, so any change in either level can be captured
> by the kernel to update the corresponding shadow mapping.

Why? A SW emulation could do this synchronization during invalidation
processing if invalidation contained an IOVA range.

I think this document would be stronger to include some "Rational"
statements in key places

> Based on the underlying IOMMU capability one device might be allowed
> to attach to multiple I/O address spaces, with DMAs accessing them by
> carrying different routing information. One of them is the default I/O
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
> remaining are routed by RID + Process Address Space ID (PASID) or
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.

I wonder if we should just adopt the ARM naming as the API
standard. It is general and doesn't have the SVA connotation that
"Process Address Space ID" carries.

> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device
> should not be bound to multiple FD's. Not sure about the gain of
> allowing it except adding unnecessary complexity. But if others have
> different view we can further discuss.

Unless there is some internal kernel design reason to block it, I
wouldn't go out of my way to prevent it.

> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is
> directly programmed to the device by guest software. For mdev this
> implies any guest operation carrying a vPASID on this device must be
> trapped into VFIO and then converted to pPASID before sent to the
> device. A detail explanation about PASID virtualization policies can be
> found in section 4.

vPASID and related seems like it needs other IOMMU vendors to take a
very careful look. I'm really glad to see this starting to be spelled
out in such a clear way, as it was hard to see from the patches there
is vendor variation.

> With above design /dev/ioasid uAPI is all about I/O address spaces.
> It doesn't include any device routing information, which is only
> indirectly registered to the ioasid driver through VFIO uAPI. For
> example, I/O page fault is always reported to userspace per IOASID,
> although it's physically reported per device (RID+PASID).

I agree with Jean-Philippe - at the very least erasing this
information needs a major rational - but I don't really see why it
must be erased? The HW reports the originating device, is it just a
matter of labeling the devices attached to the /dev/ioasid FD so it
can be reported to userspace?

> multiple attached devices) and then generates a per-device virtual I/O
> page fault into guest. Similarly the iotlb invalidation uAPI describes the
> granularity in the I/O address space (all, or a range), different from the
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).

This seems OK though, I can't think of a reason to allow an IOASID to
be left partially invalidated???

> I/O page tables routed through PASID are installed in a per-RID PASID
> table structure. Some platforms implement the PASID table in the guest
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID,
> representing the per-RID vPASID space.
>
> We propose the host kernel needs to explicitly track guest I/O page
> tables even on these platforms, i.e. the same pgtable binding protocol
> should be used universally on all platforms (with only difference on who
> actually writes the PASID table). One opinion from previous discussion
> was treating this special IOASID as a container for all guest I/O page
> tables i.e. hiding them from the host.

> However this way significantly
> violates the philosophy in this /dev/ioasid proposal. It is not one IOASID
> one address space any more. Device routing information (indirectly
> marking hidden I/O spaces) has to be carried in iotlb invalidation and
> page faulting uAPI to help connect vIOMMU with the underlying
> pIOMMU. This is one design choice to be confirmed with ARM guys.

I'm confused by this rational.

For a vIOMMU that has IO page tables in the guest the basic
choices are:
- Do we have a hypervisor trap to bind the page table or not? (RID
and PASID may differ here)
- Do we have a hypervisor trap to invaliate the page tables or not?

If the first is a hypervisor trap then I agree it makes sense to create a
child IOASID that points to each guest page table and manage it
directly. This should not require walking guest page tables as it is
really just informing the HW where the page table lives. HW will walk
them.

If there are no hypervisor traps (does this exist?) then there is no
way to involve the hypervisor here and the child IOASID should simply
be a pointer to the guest's data structure that describes binding. In
this case that IOASID should claim all PASIDs when bound to a
RID.

Invalidation should be passed up the to the IOMMU driver in terms of
the guest tables information and either the HW or software has to walk
to guest tables to make sense of it.

Events from the IOMMU to userspace should be tagged with the attached
device label and the PASID/substream ID. This means there is no issue
to have a a 'all PASID' IOASID.

> Notes:
> - It might be confusing as IOASID is also used in the kernel (drivers/
> iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
> find a better name later to differentiate.

+1 on Jean-Philippe's remarks

> - PPC has not be considered yet as we haven't got time to fully understand
> its semantics. According to previous discussion there is some generality
> between PPC window-based scheme and VFIO type1 semantics. Let's
> first make consensus on this proposal and then further discuss how to
> extend it to cover PPC's requirement.

From what I understood PPC is not so bad, Nesting IOASID's did its
preload feature and it needed a way to specify/query the IOVA range a
IOASID will cover.

> - There is a protocol between vfio group and kvm. Needs to think about
> how it will be affected following this proposal.

Ugh, I always stop looking when I reach that boundary. Can anyone
summarize what is going on there?

Most likely passing the /dev/ioasid into KVM's FD (or vicevera) is the
right answer. Eg if ARM needs to get the VMID from KVM and set it to
ioasid then a KVM "ioctl set_arm_vmid(/dev/ioasid)" call is
reasonable. Certainly better than the symbol get sutff we have right
now.

I will read through the detail below in another email

Jason

2021-05-28 20:00:48

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
>
> 5. Use Cases and Flows
>
> Here assume VFIO will support a new model where every bound device
> is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> going through legacy container/group interface. For illustration purpose
> those devices are just called dev[1...N]:
>
> device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
>
> As explained earlier, one IOASID fd is sufficient for all intended use cases:
>
> ioasid_fd = open("/dev/ioasid", mode);
>
> For simplicity below examples are all made for the virtualization story.
> They are representative and could be easily adapted to a non-virtualization
> scenario.

For others, I don't think this is *strictly* necessary, we can
probably still get to the device_fd using the group_fd and fit in
/dev/ioasid. It does make the rest of this more readable though.


> Three types of IOASIDs are considered:
>
> gpa_ioasid[1...N]: for GPA address space
> giova_ioasid[1...N]: for guest IOVA address space
> gva_ioasid[1...N]: for guest CPU VA address space
>
> At least one gpa_ioasid must always be created per guest, while the other
> two are relevant as far as vIOMMU is concerned.
>
> Examples here apply to both pdev and mdev, if not explicitly marked out
> (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> associated routing information in the attaching operation.
>
> For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> INFO are skipped in these examples.
>
> 5.1. A simple example
> ++++++++++++++++++
>
> Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> space is managed through DMA mapping protocol:
>
> /* Bind device to IOASID fd */
> device_fd = open("/dev/vfio/devices/dev1", mode);
> ioasid_fd = open("/dev/ioasid", mode);
> ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
>
> /* Attach device to IOASID */
> gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> at_data = { .ioasid = gpa_ioasid};
> ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup GPA mapping */
> dma_map = {
> .ioasid = gpa_ioasid;
> .iova = 0; // GPA
> .vaddr = 0x40000000; // HVA
> .size = 1GB;
> };
> ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> If the guest is assigned with more than dev1, user follows above sequence
> to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> address space cross all assigned devices.

eg

device2_fd = open("/dev/vfio/devices/dev1", mode);
ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
ioctl(device2_fd, VFIO_ATTACH_IOASID, &at_data);

Right?

>
> 5.2. Multiple IOASIDs (no nesting)
> ++++++++++++++++++++++++++++
>
> Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> both devices are attached to gpa_ioasid. After boot the guest creates
> an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> through mode (gpa_ioasid).
>
> Suppose IOASID nesting is not supported in this case. Qemu need to
> generate shadow mappings in userspace for giova_ioasid (like how
> VFIO works today).
>
> To avoid duplicated locked page accounting, it's recommended to pre-
> register the virtual address range that will be used for DMA:
>
> device_fd1 = open("/dev/vfio/devices/dev1", mode);
> device_fd2 = open("/dev/vfio/devices/dev2", mode);
> ioasid_fd = open("/dev/ioasid", mode);
> ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
>
> /* pre-register the virtual address range for accounting */
> mem_info = { .vaddr = 0x40000000; .size = 1GB };
> ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
>
> /* Attach dev1 and dev2 to gpa_ioasid */
> gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> at_data = { .ioasid = gpa_ioasid};
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup GPA mapping */
> dma_map = {
> .ioasid = gpa_ioasid;
> .iova = 0; // GPA
> .vaddr = 0x40000000; // HVA
> .size = 1GB;
> };
> ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> /* After boot, guest enables an GIOVA space for dev2 */
> giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
>
> /* First detach dev2 from previous address space */
> at_data = { .ioasid = gpa_ioasid};
> ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
>
> /* Then attach dev2 to the new address space */
> at_data = { .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup a shadow DMA mapping according to vIOMMU
> * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> */

Here "shadow DMA" means relay the guest's vIOMMU page tables to the HW
IOMMU?

> dma_map = {
> .ioasid = giova_ioasid;
> .iova = 0x2000; // GIOVA
> .vaddr = 0x40001000; // HVA

eg HVA came from reading the guest's page tables and finding it wanted
GPA 0x1000 mapped to IOVA 0x2000?


> 5.3. IOASID nesting (software)
> +++++++++++++++++++++++++
>
> Same usage scenario as 5.2, with software-based IOASID nesting
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.
>
> The flow before guest boots is same as 5.2, except one point. Because
> giova_ioasid is nested on gpa_ioasid, locked accounting is only
> conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> memory.
>
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
>
> /* After boots */
> /* Make GIOVA space nested on GPA space */
> giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev2 to the new address space (child)
> * Note dev2 is still attached to gpa_ioasid (parent)
> */
> at_data = { .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> * to form a shadow mapping.
> */
> dma_map = {
> .ioasid = giova_ioasid;
> .iova = 0x2000; // GIOVA
> .vaddr = 0x1000; // GPA
> .size = 4KB;
> };
> ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);

And in this version the kernel reaches into the parent IOASID's page
tables to translate 0x1000 to 0x40001000 to physical page? So we
basically remove the qemu process address space entirely from this
translation. It does seem convenient

> 5.4. IOASID nesting (hardware)
> +++++++++++++++++++++++++
>
> Same usage scenario as 5.2, with hardware-based IOASID nesting
> available. In this mode the pgtable binding protocol is used to
> bind the guest IOVA page table with the IOMMU:
>
> /* After boots */
> /* Make GIOVA space nested on GPA space */
> giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev2 to the new address space (child)
> * Note dev2 is still attached to gpa_ioasid (parent)
> */
> at_data = { .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Bind guest I/O page table */
> bind_data = {
> .ioasid = giova_ioasid;
> .addr = giova_pgtable;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

I really think you need to use consistent language. Things that
allocate a new IOASID should be calle IOASID_ALLOC_IOASID. If multiple
IOCTLs are needed then it is IOASID_ALLOC_IOASID_PGTABLE, etc.
alloc/create/bind is too confusing.

> 5.5. Guest SVA (vSVA)
> ++++++++++++++++++
>
> After boots the guest further create a GVA address spaces (gpasid1) on
> dev1. Dev2 is not affected (still attached to giova_ioasid).
>
> As explained in section 4, user should avoid expose ENQCMD on both
> pdev and mdev.
>
> The sequence applies to all device types (being pdev or mdev), except
> one additional step to call KVM for ENQCMD-capable mdev:
>
> /* After boots */
> /* Make GVA space nested on GPA space */
> gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev1 to the new address space and specify vPASID */
> at_data = {
> .ioasid = gva_ioasid;
> .flag = IOASID_ATTACH_USER_PASID;
> .user_pasid = gpasid1;
> };
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);

Still a little unsure why the vPASID is here not on the gva_ioasid. Is
there any scenario where we want different vpasid's for the same
IOASID? I guess it is OK like this. Hum.

> /* if dev1 is ENQCMD-capable mdev, update CPU PASID
> * translation structure through KVM
> */
> pa_data = {
> .ioasid_fd = ioasid_fd;
> .ioasid = gva_ioasid;
> .guest_pasid = gpasid1;
> };
> ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);

Make sense

> /* Bind guest I/O page table */
> bind_data = {
> .ioasid = gva_ioasid;
> .addr = gva_pgtable1;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

Again I do wonder if this should just be part of alloc_ioasid. Is
there any reason to split these things? The only advantage to the
split is the device is known, but the device shouldn't impact
anything..

> 5.6. I/O page fault
> +++++++++++++++
>
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
>
> - Host IOMMU driver receives a page request with raw fault_data {rid,
> pasid, addr};
>
> - Host IOMMU driver identifies the faulting I/O page table according to
> information registered by IOASID fault handler;
>
> - IOASID fault handler is called with raw fault_data (rid, pasid, addr), which
> is saved in ioasid_data->fault_data (used for response);
>
> - IOASID fault handler generates an user fault_data (ioasid, addr), links it
> to the shared ring buffer and triggers eventfd to userspace;

Here rid should be translated to a labeled device and return the
device label from VFIO_BIND_IOASID_FD. Depending on how the device
bound the label might match to a rid or to a rid,pasid

> - Upon received event, Qemu needs to find the virtual routing information
> (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
> multiple, pick a random one. This should be fine since the purpose is to
> fix the I/O page table on the guest;

The device label should fix this

> - Qemu finds the pending fault event, converts virtual completion data
> into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> complete the pending fault;
>
> - /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> ioasid_data->fault_data, and then calls iommu api to complete it with
> {rid, pasid, response_code};

So resuming a fault on an ioasid will resume all devices pending on
the fault?

> 5.7. BIND_PASID_TABLE
> ++++++++++++++++++++
>
> PASID table is put in the GPA space on some platform, thus must be updated
> by the guest. It is treated as another user page table to be bound with the
> IOMMU.
>
> As explained earlier, the user still needs to explicitly bind every user I/O
> page table to the kernel so the same pgtable binding protocol (bind, cache
> invalidate and fault handling) is unified cross platforms.
>
> vIOMMUs may include a caching mode (or paravirtualized way) which, once
> enabled, requires the guest to invalidate PASID cache for any change on the
> PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
>
> In case of missing such capability, Qemu could enable write-protection on
> the guest PASID table to achieve the same effect.
>
> /* After boots */
> /* Make vPASID space nested on GPA space */
> pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev1 to pasidtbl_ioasid */
> at_data = { .ioasid = pasidtbl_ioasid};
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> /* Bind PASID table */
> bind_data = {
> .ioasid = pasidtbl_ioasid;
> .addr = gpa_pasid_table;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
>
> /* vIOMMU detects a new GVA I/O space created */
> gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev1 to the new address space, with gpasid1 */
> at_data = {
> .ioasid = gva_ioasid;
> .flag = IOASID_ATTACH_USER_PASID;
> .user_pasid = gpasid1;
> };
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> /* Bind guest I/O page table. Because SET_PASID_TABLE has been
> * used, the kernel will not update the PASID table. Instead, just
> * track the bound I/O page table for handling invalidation and
> * I/O page faults.
> */
> bind_data = {
> .ioasid = gva_ioasid;
> .addr = gva_pgtable1;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

I still don't quite get the benifit from doing this.

The idea to create an all PASID IOASID seems to work better with less
fuss on HW that is directly parsing the guest's PASID table.

Cache invalidate seems easy enough to support

Fault handling needs to return the (ioasid, device_label, pasid) when
working with this kind of ioasid.

It is true that it does create an additional flow qemu has to
implement, but it does directly mirror the HW.

Jason

2021-05-28 20:04:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.

It is very long, but I think this has turned out quite well. It
certainly matches the basic sketch I had in my head when we were
talking about how to create vDPA devices a few years ago.

When you get down to the operations they all seem pretty common sense
and straightfoward. Create an IOASID. Connect to a device. Fill the
IOASID with pages somehow. Worry about PASID labeling.

It really is critical to get all the vendor IOMMU people to go over it
and see how their HW features map into this.

Thanks,
Jason

2021-05-28 20:18:51

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, May 28, 2021 at 06:23:07PM +0200, Jean-Philippe Brucker wrote:

> Regarding the invalidation, I think limiting it to IOASID may work but it
> does bother me that we can't directly forward all invalidations received
> on the vIOMMU: if the guest sends a device-wide invalidation, do we
> iterate over all IOASIDs and issue one ioctl for each? Sure the guest is
> probably sending that because of detaching the PASID table, for which the
> kernel did perform the invalidation, but we can't just assume that and
> ignore the request, there may be a different reason. Iterating is going to
> take a lot time, whereas with the current API we can send a single request
> and issue a single command to the IOMMU hardware.

I think the invalidation could stand some improvement, but that also
feels basically incremental to the essence of the proposal.

I agree with the general goal that the uAPI should be able to issue
invalidates that directly map to HW invalidations.

> Similarly, if the guest sends an ATC invalidation for a whole device (in
> the SMMU, that's an ATC_INV without SSID), we'll have to transform that
> into multiple IOTLB invalidations? We can't just send it on IOASID #0,
> because it may not have been created by the guest.

For instance adding device labels allows an invalidate device
operation to exist and the "generic" kernel driver can iterate over
all IOASIDs hooked to the device. Overridable by the IOMMU driver.

> > Notes:
> > - It might be confusing as IOASID is also used in the kernel (drivers/
> > iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
> > find a better name later to differentiate.
>
> Yes this isn't just about allocating PASIDs anymore. /dev/iommu or
> /dev/ioas would make more sense.

Either makes sense to me

/dev/iommu and the internal IOASID objects can be called IOAS (==
iommu_domain) is not bad

> > * Get information about an I/O address space
> > *
> > * Supported capabilities:
> > * - VFIO type1 map/unmap;
> > * - pgtable/pasid_table binding
> > * - hardware nesting vs. software nesting;
> > * - ...
> > *
> > * Related attributes:
> > * - supported page sizes, reserved IOVA ranges (DMA mapping);
> > * - vendor pgtable formats (pgtable binding);
> > * - number of child IOASIDs (nesting);
> > * - ...
> > *
> > * Above information is available only after one or more devices are
> > * attached to the specified IOASID. Otherwise the IOASID is just a
> > * number w/o any capability or attribute.
> > *
> > * Input parameters:
> > * - u32 ioasid;
> > *
> > * Output parameters:
> > * - many. TBD.
>
> We probably need a capability format similar to PCI and VFIO.

Designing this kind of uAPI where it is half HW and half generic is
really tricky to get right. Probably best to take the detailed design
of the IOCTL structs later.

Jason

2021-05-28 20:26:52

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, May 28, 2021 at 10:24:56AM +0800, Jason Wang wrote:
> > IOASID nesting can be implemented in two ways: hardware nesting and
> > software nesting. With hardware support the child and parent I/O page
> > tables are walked consecutively by the IOMMU to form a nested translation.
> > When it's implemented in software, the ioasid driver
>
> Need to explain what did "ioasid driver" mean.

I think it means "drivers/iommu"

> And if yes, does it allow the device for software specific implementation:
>
> 1) swiotlb or

I think it is necessary to have a 'software page table' which is
required to do all the mdevs we have today.

> 2) device specific IOASID implementation

"drivers/iommu" is pluggable, so I guess it can exist? I've never seen
it done before though

If we'd want this to drive an on-device translation table is an
interesting question. I don't have an answer

> > I/O page tables routed through PASID are installed in a per-RID PASID
> > table structure.
>
> I'm not sure this is true for all archs.

It must be true. For security reasons access to a PASID must be
limited by RID.

RID_A assigned to guest A should not be able to access a PASID being
used by RID_B in guest B. Only a per-RID restriction can accomplish
this.

> I would like to know the reason for such indirection.
>
> It looks to me the ioasid fd is sufficient for performing any operations.
>
> Such allocation only work if as ioas fd can have multiple ioasid which seems
> not the case you describe here.

It is the case, read the examples section. One had 3 interrelated
IOASID objects inside the same FD.

> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> >
> > The flow before guest boots is same as 5.2, except one point. Because
> > giova_ioasid is nested on gpa_ioasid, locked accounting is only
> > conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> > memory.
> >
> > To save space we only list the steps after boots (i.e. both dev1/dev2
> > have been attached to gpa_ioasid before guest boots):
> >
> > /* After boots */
> > /* Make GIOVA space nested on GPA space */
> > giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev2 to the new address space (child)
> > * Note dev2 is still attached to gpa_ioasid (parent)
> > */
> > at_data = { .ioasid = giova_ioasid};
> > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
>
> For vDPA, we need something similar. And in the future, vDPA may allow
> multiple ioasid to be attached to a single device. It should work with the
> current design.

What do you imagine multiple IOASID's being used for in VDPA?

Jason

2021-05-28 23:41:28

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:

> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
>
> /*
> * Check whether an uAPI extension is supported.
> *
> * This is for FD-level capabilities, such as locked page pre-registration.
> * IOASID-level capabilities are reported through IOASID_GET_INFO.
> *
> * Return: 0 if not supported, 1 if supported.
> */
> #define IOASID_CHECK_EXTENSION _IO(IOASID_TYPE, IOASID_BASE + 0)


> /*
> * Register user space memory where DMA is allowed.
> *
> * It pins user pages and does the locked memory accounting so sub-
> * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> *
> * When this ioctl is not used, one user page might be accounted
> * multiple times when it is mapped by multiple IOASIDs which are
> * not nested together.
> *
> * Input parameters:
> * - vaddr;
> * - size;
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_REGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 2)

So VA ranges are pinned and stored in a tree and later references to
those VA ranges by any other IOASID use the pin cached in the tree?

It seems reasonable and is similar to the ioasid parent/child I
suggested for PPC.

IMHO this should be merged with the all SW IOASID that is required for
today's mdev drivers. If this can be done while keeping this uAPI then
great, otherwise I don't think it is so bad to weakly nest a physical
IOASID under a SW one just to optimize page pinning.

Either way this seems like a smart direction

> /*
> * Allocate an IOASID.
> *
> * IOASID is the FD-local software handle representing an I/O address
> * space. Each IOASID is associated with a single I/O page table. User
> * must call this ioctl to get an IOASID for every I/O address space that is
> * intended to be enabled in the IOMMU.
> *
> * A newly-created IOASID doesn't accept any command before it is
> * attached to a device. Once attached, an empty I/O page table is
> * bound with the IOMMU then the user could use either DMA mapping
> * or pgtable binding commands to manage this I/O page table.

Can the IOASID can be populated before being attached?

> * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> *
> * Return: allocated ioasid on success, -errno on failure.
> */
> #define IOASID_ALLOC _IO(IOASID_TYPE, IOASID_BASE + 3)
> #define IOASID_FREE _IO(IOASID_TYPE, IOASID_BASE + 4)

I assume alloc will include quite a big structure to satisfy the
various vendor needs?

> /*
> * Get information about an I/O address space
> *
> * Supported capabilities:
> * - VFIO type1 map/unmap;
> * - pgtable/pasid_table binding
> * - hardware nesting vs. software nesting;
> * - ...
> *
> * Related attributes:
> * - supported page sizes, reserved IOVA ranges (DMA mapping);
> * - vendor pgtable formats (pgtable binding);
> * - number of child IOASIDs (nesting);
> * - ...
> *
> * Above information is available only after one or more devices are
> * attached to the specified IOASID. Otherwise the IOASID is just a
> * number w/o any capability or attribute.

This feels wrong to learn most of these attributes of the IOASID after
attaching to a device.

The user should have some idea how it intends to use the IOASID when
it creates it and the rest of the system should match the intention.

For instance if the user is creating a IOASID to cover the guest GPA
with the intention of making children it should indicate this during
alloc.

If the user is intending to point a child IOASID to a guest page table
in a certain descriptor format then it should indicate it during
alloc.

device bind should fail if the device somehow isn't compatible with
the scheme the user is tring to use.

> /*
> * Map/unmap process virtual addresses to I/O virtual addresses.
> *
> * Provide VFIO type1 equivalent semantics. Start with the same
> * restriction e.g. the unmap size should match those used in the
> * original mapping call.
> *
> * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> * must be already in the preregistered list.
> *
> * Input parameters:
> * - u32 ioasid;
> * - refer to vfio_iommu_type1_dma_{un}map
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
> #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)

What about nested IOASIDs?

> /*
> * Create a nesting IOASID (child) on an existing IOASID (parent)
> *
> * IOASIDs can be nested together, implying that the output address
> * from one I/O page table (child) must be further translated by
> * another I/O page table (parent).
> *
> * As the child adds essentially another reference to the I/O page table
> * represented by the parent, any device attached to the child ioasid
> * must be already attached to the parent.
> *
> * In concept there is no limit on the number of the nesting levels.
> * However for the majority case one nesting level is sufficient. The
> * user should check whether an IOASID supports nesting through
> * IOASID_GET_INFO. For example, if only one nesting level is allowed,
> * the nesting capability is reported only on the parent instead of the
> * child.
> *
> * User also needs check (via IOASID_GET_INFO) whether the nesting
> * is implemented in hardware or software. If software-based, DMA
> * mapping protocol should be used on the child IOASID. Otherwise,
> * the child should be operated with pgtable binding protocol.
> *
> * Input parameters:
> * - u32 parent_ioasid;
> *
> * Return: child_ioasid on success, -errno on failure;
> */
> #define IOASID_CREATE_NESTING _IO(IOASID_TYPE, IOASID_BASE + 8)

Do you think another ioctl is best? Should this just be another
parameter to alloc?

> /*
> * Bind an user-managed I/O page table with the IOMMU
> *
> * Because user page table is untrusted, IOASID nesting must be enabled
> * for this ioasid so the kernel can enforce its DMA isolation policy
> * through the parent ioasid.
> *
> * Pgtable binding protocol is different from DMA mapping. The latter
> * has the I/O page table constructed by the kernel and updated
> * according to user MAP/UNMAP commands. With pgtable binding the
> * whole page table is created and updated by userspace, thus different
> * set of commands are required (bind, iotlb invalidation, page fault, etc.).
> *
> * Because the page table is directly walked by the IOMMU, the user
> * must use a format compatible to the underlying hardware. It can
> * check the format information through IOASID_GET_INFO.
> *
> * The page table is bound to the IOMMU according to the routing
> * information of each attached device under the specified IOASID. The
> * routing information (RID and optional PASID) is registered when a
> * device is attached to this IOASID through VFIO uAPI.
> *
> * Input parameters:
> * - child_ioasid;
> * - address of the user page table;
> * - formats (vendor, address_width, etc.);
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_BIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 9)
> #define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10)

Also feels backwards, why wouldn't we specify this, and the required
page table format, during alloc time?

> /*
> * Bind an user-managed PASID table to the IOMMU
> *
> * This is required for platforms which place PASID table in the GPA space.
> * In this case the specified IOASID represents the per-RID PASID space.
> *
> * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
> * special flag to indicate the difference from normal I/O address spaces.
> *
> * The format info of the PASID table is reported in IOASID_GET_INFO.
> *
> * As explained in the design section, user-managed I/O page tables must
> * be explicitly bound to the kernel even on these platforms. It allows
> * the kernel to uniformly manage I/O address spaces cross all platforms.
> * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
> * to carry device routing information to indirectly mark the hidden I/O
> * address spaces.
> *
> * Input parameters:
> * - child_ioasid;
> * - address of PASID table;
> * - formats (vendor, size, etc.);
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_BIND_PASID_TABLE _IO(IOASID_TYPE, IOASID_BASE + 11)
> #define IOASID_UNBIND_PASID_TABLE _IO(IOASID_TYPE, IOASID_BASE + 12)

Ditto

>
> /*
> * Invalidate IOTLB for an user-managed I/O page table
> *
> * Unlike what's defined in include/uapi/linux/iommu.h, this command
> * doesn't allow the user to specify cache type and likely support only
> * two granularities (all, or a specified range) in the I/O address space.
> *
> * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
> * cache). If the IOASID represents an I/O address space, the invalidation
> * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
> * represents a vPASID space, then this command applies to the PASID
> * cache.
> *
> * Similarly this command doesn't provide IOMMU-like granularity
> * info (domain-wide, pasid-wide, range-based), since it's all about the
> * I/O address space itself. The ioasid driver walks the attached
> * routing information to match the IOMMU semantics under the
> * hood.
> *
> * Input parameters:
> * - child_ioasid;
> * - granularity
> *
> * Return: 0 on success, -errno on failure
> */
> #define IOASID_INVALIDATE_CACHE _IO(IOASID_TYPE, IOASID_BASE + 13)

This should have an IOVA range too?

> /*
> * Page fault report and response
> *
> * This is TBD. Can be added after other parts are cleared up. Likely it
> * will be a ring buffer shared between user/kernel, an eventfd to notify
> * the user and an ioctl to complete the fault.
> *
> * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
> */

Any reason not to just use read()?

>
> 2.2. /dev/vfio uAPI
> ++++++++++++++++

To be clear you mean the 'struct vfio_device' API, these are not
IOCTLs on the container or group?

> /*
> * Bind a vfio_device to the specified IOASID fd
> *
> * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> * vfio device should not be bound to multiple ioasid_fd's.
> *
> * Input parameters:
> * - ioasid_fd;
> *
> * Return: 0 on success, -errno on failure.
> */
> #define VFIO_BIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)

This is where it would make sense to have an output "device id" that
allows /dev/ioasid to refer to this "device" by number in events and
other related things.

>
> 2.3. KVM uAPI
> ++++++++++++
>
> /*
> * Update CPU PASID mapping
> *
> * This is necessary when ENQCMD will be used in the guest while the
> * targeted device doesn't accept the vPASID saved in the CPU MSR.
> *
> * This command allows user to set/clear the vPASID->pPASID mapping
> * in the CPU, by providing the IOASID (and FD) information representing
> * the I/O address space marked by this vPASID.
> *
> * Input parameters:
> * - user_pasid;
> * - ioasid_fd;
> * - ioasid;
> */
> #define KVM_MAP_PASID _IO(KVMIO, 0xf0)
> #define KVM_UNMAP_PASID _IO(KVMIO, 0xf1)

It seems simple enough.. So the physical PASID can only be assigned if
the user has an IOASID that points at it? Thus it is secure?

> 3. Sample structures and helper functions
>
> Three helper functions are provided to support VFIO_BIND_IOASID_FD:
>
> struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
> int ioasid_unregister_device(struct ioasid_dev *dev);
>
> An ioasid_ctx is created for each fd:
>
> struct ioasid_ctx {
> // a list of allocated IOASID data's
> struct list_head ioasid_list;

Would expect an xarray

> // a list of registered devices
> struct list_head dev_list;

xarray of device_id

> // a list of pre-registered virtual address ranges
> struct list_head prereg_list;

Should re-use the existing SW IOASID table, and be an interval tree.

> Each registered device is represented by ioasid_dev:
>
> struct ioasid_dev {
> struct list_head next;
> struct ioasid_ctx *ctx;
> // always be the physical device
> struct device *device;
> struct kref kref;
> };
>
> Because we assume one vfio_device connected to at most one ioasid_fd,
> here ioasid_dev could be embedded in vfio_device and then linked to
> ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> device should be the pointer to the parent device. PASID marking this
> mdev is specified later when VFIO_ATTACH_IOASID.

Don't embed a struct like this in something with vfio_device - that
just makes a mess of reference counting by having multiple krefs in
the same memory block. Keep it as a pointer, the attach operation
should return a pointer to the above struct.

> An ioasid_data is created when IOASID_ALLOC, as the main object
> describing characteristics about an I/O page table:
>
> struct ioasid_data {
> // link to ioasid_ctx->ioasid_list
> struct list_head next;
>
> // the IOASID number
> u32 ioasid;
>
> // the handle to convey iommu operations
> // hold the pgd (TBD until discussing iommu api)
> struct iommu_domain *domain;

But at least for the first coding draft I would expect to see this API
presented with no PASID support and a simple 1:1 with iommu_domain. How
PASID gets modeled is the big TBD, right?

> ioasid_data and iommu_domain have overlapping roles as both are
> introduced to represent an I/O address space. It is still a big TBD how
> the two should be corelated or even merged, and whether new iommu
> ops are required to handle RID+PASID explicitly.

I think it is OK that the uapi and kernel api have different
structs. The uapi focused one should hold the uapi related data, which
is what you've shown here, I think.

> Two helper functions are provided to support VFIO_ATTACH_IOASID:
>
> struct attach_info {
> u32 ioasid;
> // If valid, the PASID to be used physically
> u32 pasid;
> };
> int ioasid_device_attach(struct ioasid_dev *dev,
> struct attach_info info);
> int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);

Honestly, I still prefer this to be highly explicit as this is where
all device driver authors get invovled:

ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev, u32 ioasid);
ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32 *physical_pasid, struct ioasid_dev *dev, u32 ioasid);

And presumably a variant for ARM non-PCI platform (?) devices.

This could boil down to a __ioasid_device_attach() as you've shown.

> A new object is introduced and linked to ioasid_data->attach_data for
> each successful attach operation:
>
> struct ioasid_attach_data {
> struct list_head next;
> struct ioasid_dev *dev;
> u32 pasid;
> }

This should be returned as a pointer and detatch should be:

int ioasid_device_detach(struct ioasid_attach_data *);

> As explained in the design section, there is no explicit group enforcement
> in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> implicit group check - before every device within an iommu group is
> attached to this IOASID, the previously-attached devices in this group are
> put in ioasid_data->partial_devices. The IOASID rejects any command if
> the partial_devices list is not empty.

It is simple enough. Would be good to design in a diagnostic string so
userspace can make sense of the failure. Eg return something like
-EDEADLK and provide an ioctl 'why did EDEADLK happen' ?


> Then is the last helper function:
> u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx,
> u32 ioasid, bool alloc);
>
> ioasid_get_global_pasid is necessary in scenarios where multiple devices
> want to share a same PASID value on the attached I/O page table (e.g.
> when ENQCMD is enabled, as explained in next section). We need a
> centralized place (ioasid_data->pasid) to hold this value (allocated when
> first called with alloc=true). vfio device driver calls this function (alloc=
> true) to get the global PASID for an ioasid before calling ioasid_device_
> attach. KVM also calls this function (alloc=false) to setup PASID translation
> structure when user calls KVM_MAP_PASID.

When/why would the VFIO driver do this? isn't this just some varient
of pasid_attach?

ioasid_pci_device_enqcmd_attach(struct pci_device *pdev, u32 *physical_pasid, struct ioasid_dev *dev, u32 ioasid);

?

> 4. PASID Virtualization
>
> When guest SVA (vSVA) is enabled, multiple GVA address spaces are
> created on the assigned vfio device. This leads to the concepts of
> "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned
> by the guest to mark an GVA address space while pPASID is the one
> selected by the host and actually routed in the wire.
>
> vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.

Should the vPASID programmed into the IOASID before calling
VFIO_ATTACH_IOASID?

> vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> device, with two factors to be considered:
>
> - Whether vPASID is directly used (vPASID==pPASID) in the wire, or
> should be instead converted to a newly-allocated one (vPASID!=
> pPASID);
>
> - If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
> space or a global PASID space (implying sharing pPASID cross devices,
> e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
> as part of the process context);

This whole section is 4 really confusing. I think it would be more
understandable to focus on the list below and minimize the vPASID

> The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> supported. There are three possible scenarios:
>
> (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
> policies.)

This has become unclear. I think this should start by identifying the
6 main type of devices and how they can use pPASID/vPASID:

0) Device is a RID and cannot issue PASID
1) Device is a mdev and cannot issue PASID
2) Device is a mdev and programs a single fixed PASID during bind,
does not accept PASID from the guest

3) Device accepts any PASIDs from the guest. No
vPASID/pPASID translation is possible. (classic vfio_pci)
4) Device accepts any PASID from the guest and has an
internal vPASID/pPASID translation (enhanced vfio_pci)
5) Device accepts and PASID from the guest and relys on
external vPASID/pPASID translation via ENQCMD (Intel SIOV mdev)

0-2 don't use vPASID at all

3-5 consume a vPASID but handle it differently.

I think the 3-5 map into what you are trying to explain in the table
below, which is the rules for allocating the vPASID depending on which
of device types 3-5 are present and or mixed.

For instance device type 3 requires vPASID == pPASID because it can't
do translation at all.

This probably all needs to come through clearly in the /dev/ioasid
interface. Once the attached devices are labled it would make sense to
have a 'query device' /dev/ioasid IOCTL to report the details based on
how the device attached and other information.

> 2) mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
>
> PASIDs are also used by kernel to mark the default I/O address space
> for mdev, thus cannot be delegated to the guest. Instead, the mdev
> driver must allocate a new pPASID for each vPASID (thus vPASID!=
> pPASID) and then use pPASID when attaching this mdev to an ioasid.

I don't understand this at all.. What does "PASIDs are also used by
the kernel" mean?

> The mdev driver needs cache the PASID mapping so in mediation
> path vPASID programmed by the guest can be converted to pPASID
> before updating the physical MMIO register.

This is my scenario #4 above. Device and internally virtualize
vPASID/pPASID - how that is done is up to the device. But this is all
just labels, when such a device attaches, it should use some specific
API:

ioasid_pci_device_vpasid_attach(struct pci_device *pdev,
u32 *physical_pasid, u32 *virtual_pasid, struct ioasid_dev *dev, u32 ioasid);

And then maintain its internal translation

> In previous thread a PASID range split scheme was discussed to support
> this combination, but we haven't worked out a clean uAPI design yet.
> Therefore in this proposal we decide to not support it, implying the
> user should have some intelligence to avoid such scenario. It could be
> a TODO task for future.

It really just boils down to how to allocate the PASIDs to get around
the bad viommu interface that assumes all PASIDs are usable by all
devices.

> In spite of those subtle considerations, the kernel implementation could
> start simple, e.g.:
>
> - v==p for pdev;
> - v!=p and always use a global PASID pool for all mdev's;

Regardless all this mess needs to be hidden from the consuming drivers
with some simple APIs as above. The driver should indicate what its HW
can do and the PASID #'s that magically come out of /dev/ioasid should
be appropriate.

Will resume on another email..

Jason

2021-05-31 11:35:23

by Liu Yi L

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, 28 May 2021 20:36:49 -0300, Jason Gunthorpe wrote:

> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
>
> > 2.1. /dev/ioasid uAPI
> > +++++++++++++++++
> >
> > /*
> > * Check whether an uAPI extension is supported.
> > *
> > * This is for FD-level capabilities, such as locked page pre-registration.
> > * IOASID-level capabilities are reported through IOASID_GET_INFO.
> > *
> > * Return: 0 if not supported, 1 if supported.
> > */
> > #define IOASID_CHECK_EXTENSION _IO(IOASID_TYPE, IOASID_BASE + 0)
>
>
> > /*
> > * Register user space memory where DMA is allowed.
> > *
> > * It pins user pages and does the locked memory accounting so sub-
> > * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> > *
> > * When this ioctl is not used, one user page might be accounted
> > * multiple times when it is mapped by multiple IOASIDs which are
> > * not nested together.
> > *
> > * Input parameters:
> > * - vaddr;
> > * - size;
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define IOASID_REGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 1)
> > #define IOASID_UNREGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 2)
>
> So VA ranges are pinned and stored in a tree and later references to
> those VA ranges by any other IOASID use the pin cached in the tree?
>
> It seems reasonable and is similar to the ioasid parent/child I
> suggested for PPC.
>
> IMHO this should be merged with the all SW IOASID that is required for
> today's mdev drivers. If this can be done while keeping this uAPI then
> great, otherwise I don't think it is so bad to weakly nest a physical
> IOASID under a SW one just to optimize page pinning.
>
> Either way this seems like a smart direction
>
> > /*
> > * Allocate an IOASID.
> > *
> > * IOASID is the FD-local software handle representing an I/O address
> > * space. Each IOASID is associated with a single I/O page table. User
> > * must call this ioctl to get an IOASID for every I/O address space that is
> > * intended to be enabled in the IOMMU.
> > *
> > * A newly-created IOASID doesn't accept any command before it is
> > * attached to a device. Once attached, an empty I/O page table is
> > * bound with the IOMMU then the user could use either DMA mapping
> > * or pgtable binding commands to manage this I/O page table.
>
> Can the IOASID can be populated before being attached?

perhaps a MAP/UNMAP operation on a gpa_ioasid?

>
> > * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> > *
> > * Return: allocated ioasid on success, -errno on failure.
> > */
> > #define IOASID_ALLOC _IO(IOASID_TYPE, IOASID_BASE + 3)
> > #define IOASID_FREE _IO(IOASID_TYPE, IOASID_BASE + 4)
>
> I assume alloc will include quite a big structure to satisfy the
> various vendor needs?
>
>
> > /*
> > * Get information about an I/O address space
> > *
> > * Supported capabilities:
> > * - VFIO type1 map/unmap;
> > * - pgtable/pasid_table binding
> > * - hardware nesting vs. software nesting;
> > * - ...
> > *
> > * Related attributes:
> > * - supported page sizes, reserved IOVA ranges (DMA mapping);
> > * - vendor pgtable formats (pgtable binding);
> > * - number of child IOASIDs (nesting);
> > * - ...
> > *
> > * Above information is available only after one or more devices are
> > * attached to the specified IOASID. Otherwise the IOASID is just a
> > * number w/o any capability or attribute.
>
> This feels wrong to learn most of these attributes of the IOASID after
> attaching to a device.

but an IOASID is just a software handle before attached to a specific
device. e.g. before attaching to a device, we have no idea about the
supported page size in underlying iommu, coherent etc.

> The user should have some idea how it intends to use the IOASID when
> it creates it and the rest of the system should match the intention.
>
> For instance if the user is creating a IOASID to cover the guest GPA
> with the intention of making children it should indicate this during
> alloc.
>
> If the user is intending to point a child IOASID to a guest page table
> in a certain descriptor format then it should indicate it during
> alloc.

Actually, we have only two kinds of IOASIDs so far. One is used as parent
and another is child. For child, this proposal has defined IOASID_CREATE_NESTING
for it. But yeah, I think it is doable to indicate the type in ALLOC. But
for child IOASID, there require one more step to config its parent IOASID
or may include such info in the ioctl input as well.

> device bind should fail if the device somehow isn't compatible with
> the scheme the user is tring to use.

yeah, I guess you mean to fail the device attach when the IOASID is a
nesting IOASID but the device is behind an iommu without nesting support.
right?

>
> > /*
> > * Map/unmap process virtual addresses to I/O virtual addresses.
> > *
> > * Provide VFIO type1 equivalent semantics. Start with the same
> > * restriction e.g. the unmap size should match those used in the
> > * original mapping call.
> > *
> > * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > * must be already in the preregistered list.
> > *
> > * Input parameters:
> > * - u32 ioasid;
> > * - refer to vfio_iommu_type1_dma_{un}map
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
> > #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)
>
> What about nested IOASIDs?

at first glance, it looks like we should prevent the MAP/UNMAP usage on
nested IOASIDs. At least hardware nested translation only allows MAP/UNMAP
on the parent IOASIDs and page table bind on nested IOASIDs. But considering
about software nesting, it seems still useful to allow MAP/UNMAP usage
on nested IOASIDs. This is how I understand it, how about your opinion
on it? do you think it's better to allow MAP/UNMAP usage only on parent
IOASIDs as a start?

>
> > /*
> > * Create a nesting IOASID (child) on an existing IOASID (parent)
> > *
> > * IOASIDs can be nested together, implying that the output address
> > * from one I/O page table (child) must be further translated by
> > * another I/O page table (parent).
> > *
> > * As the child adds essentially another reference to the I/O page table
> > * represented by the parent, any device attached to the child ioasid
> > * must be already attached to the parent.
> > *
> > * In concept there is no limit on the number of the nesting levels.
> > * However for the majority case one nesting level is sufficient. The
> > * user should check whether an IOASID supports nesting through
> > * IOASID_GET_INFO. For example, if only one nesting level is allowed,
> > * the nesting capability is reported only on the parent instead of the
> > * child.
> > *
> > * User also needs check (via IOASID_GET_INFO) whether the nesting
> > * is implemented in hardware or software. If software-based, DMA
> > * mapping protocol should be used on the child IOASID. Otherwise,
> > * the child should be operated with pgtable binding protocol.
> > *
> > * Input parameters:
> > * - u32 parent_ioasid;
> > *
> > * Return: child_ioasid on success, -errno on failure;
> > */
> > #define IOASID_CREATE_NESTING _IO(IOASID_TYPE, IOASID_BASE + 8)
>
> Do you think another ioctl is best? Should this just be another
> parameter to alloc?

either is fine. This ioctl is following one of your previous comment.

https://lore.kernel.org/linux-iommu/[email protected]/

>
> > /*
> > * Bind an user-managed I/O page table with the IOMMU
> > *
> > * Because user page table is untrusted, IOASID nesting must be enabled
> > * for this ioasid so the kernel can enforce its DMA isolation policy
> > * through the parent ioasid.
> > *
> > * Pgtable binding protocol is different from DMA mapping. The latter
> > * has the I/O page table constructed by the kernel and updated
> > * according to user MAP/UNMAP commands. With pgtable binding the
> > * whole page table is created and updated by userspace, thus different
> > * set of commands are required (bind, iotlb invalidation, page fault, etc.).
> > *
> > * Because the page table is directly walked by the IOMMU, the user
> > * must use a format compatible to the underlying hardware. It can
> > * check the format information through IOASID_GET_INFO.
> > *
> > * The page table is bound to the IOMMU according to the routing
> > * information of each attached device under the specified IOASID. The
> > * routing information (RID and optional PASID) is registered when a
> > * device is attached to this IOASID through VFIO uAPI.
> > *
> > * Input parameters:
> > * - child_ioasid;
> > * - address of the user page table;
> > * - formats (vendor, address_width, etc.);
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define IOASID_BIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 9)
> > #define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10)
>
> Also feels backwards, why wouldn't we specify this, and the required
> page table format, during alloc time?

here the model is user-space gets the page table format from kernel and
decide if it can proceed. So what you are suggesting is user-space should
tell kernel the page table format it has in ALLOC and kenrel should fail
the ALLOC if the user-space page table format is not compatible with underlying
iommu?

>
> > /*
> > * Bind an user-managed PASID table to the IOMMU
> > *
> > * This is required for platforms which place PASID table in the GPA space.
> > * In this case the specified IOASID represents the per-RID PASID space.
> > *
> > * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
> > * special flag to indicate the difference from normal I/O address spaces.
> > *
> > * The format info of the PASID table is reported in IOASID_GET_INFO.
> > *
> > * As explained in the design section, user-managed I/O page tables must
> > * be explicitly bound to the kernel even on these platforms. It allows
> > * the kernel to uniformly manage I/O address spaces cross all platforms.
> > * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
> > * to carry device routing information to indirectly mark the hidden I/O
> > * address spaces.
> > *
> > * Input parameters:
> > * - child_ioasid;
> > * - address of PASID table;
> > * - formats (vendor, size, etc.);
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define IOASID_BIND_PASID_TABLE _IO(IOASID_TYPE, IOASID_BASE + 11)
> > #define IOASID_UNBIND_PASID_TABLE _IO(IOASID_TYPE, IOASID_BASE + 12)
>
> Ditto
>
> >
> > /*
> > * Invalidate IOTLB for an user-managed I/O page table
> > *
> > * Unlike what's defined in include/uapi/linux/iommu.h, this command
> > * doesn't allow the user to specify cache type and likely support only
> > * two granularities (all, or a specified range) in the I/O address space.
> > *
> > * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
> > * cache). If the IOASID represents an I/O address space, the invalidation
> > * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
> > * represents a vPASID space, then this command applies to the PASID
> > * cache.
> > *
> > * Similarly this command doesn't provide IOMMU-like granularity
> > * info (domain-wide, pasid-wide, range-based), since it's all about the
> > * I/O address space itself. The ioasid driver walks the attached
> > * routing information to match the IOMMU semantics under the
> > * hood.
> > *
> > * Input parameters:
> > * - child_ioasid;
> > * - granularity
> > *
> > * Return: 0 on success, -errno on failure
> > */
> > #define IOASID_INVALIDATE_CACHE _IO(IOASID_TYPE, IOASID_BASE + 13)
>
> This should have an IOVA range too?
>
> > /*
> > * Page fault report and response
> > *
> > * This is TBD. Can be added after other parts are cleared up. Likely it
> > * will be a ring buffer shared between user/kernel, an eventfd to notify
> > * the user and an ioctl to complete the fault.
> > *
> > * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
> > */
>
> Any reason not to just use read()?

a ring buffer may be mmap to user-space, thus reading fault data from kernel
would be faster. This is also how Eric's fault reporting is doing today.

https://lore.kernel.org/linux-iommu/[email protected]/

> >
> > 2.2. /dev/vfio uAPI
> > ++++++++++++++++
>
> To be clear you mean the 'struct vfio_device' API, these are not
> IOCTLs on the container or group?
>
> > /*
> > * Bind a vfio_device to the specified IOASID fd
> > *
> > * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> > * vfio device should not be bound to multiple ioasid_fd's.
> > *
> > * Input parameters:
> > * - ioasid_fd;
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define VFIO_BIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 22)
> > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)
>
> This is where it would make sense to have an output "device id" that
> allows /dev/ioasid to refer to this "device" by number in events and
> other related things.

perhaps this is the device info Jean Philippe wants in page fault reporting
path?

--
Regards,
Yi Liu

2021-05-31 17:38:46

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal



> From: Tian, Kevin <[email protected]>
> Sent: Thursday, May 27, 2021 1:28 PM
> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.
>
> This proposal describes the uAPI of /dev/ioasid and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.
>
> It's based on a lengthy discussion starting from here:
> https://lore.kernel.org/linux-
> iommu/[email protected]/
>
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.

Thanks for the detailed RFC. Digesting it...

[..]
> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
> /*
> * Register user space memory where DMA is allowed.
> *
> * It pins user pages and does the locked memory accounting so sub-
> * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> *
> * When this ioctl is not used, one user page might be accounted
> * multiple times when it is mapped by multiple IOASIDs which are
> * not nested together.
> *
> * Input parameters:
> * - vaddr;
> * - size;
> *
> * Return: 0 on success, -errno on failure.
> */
It appears that this is only to make map ioctl faster apart from accounting.
It doesn't have any ioasid handle input either.

In that case, can it be a new system call? Why does it have to be under /dev/ioasid?
For example few years back such system call mpin() thought was proposed in [1].

Or a new MAP_PINNED flag is better approach to achieve in single mmap() call?

> #define IOASID_REGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY _IO(IOASID_TYPE,
> IOASID_BASE + 2)

[1] https://lwn.net/Articles/600502/

2021-05-31 18:11:20

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Mon, May 31, 2021 at 07:31:57PM +0800, Liu Yi L wrote:
> > > /*
> > > * Get information about an I/O address space
> > > *
> > > * Supported capabilities:
> > > * - VFIO type1 map/unmap;
> > > * - pgtable/pasid_table binding
> > > * - hardware nesting vs. software nesting;
> > > * - ...
> > > *
> > > * Related attributes:
> > > * - supported page sizes, reserved IOVA ranges (DMA mapping);
> > > * - vendor pgtable formats (pgtable binding);
> > > * - number of child IOASIDs (nesting);
> > > * - ...
> > > *
> > > * Above information is available only after one or more devices are
> > > * attached to the specified IOASID. Otherwise the IOASID is just a
> > > * number w/o any capability or attribute.
> >
> > This feels wrong to learn most of these attributes of the IOASID after
> > attaching to a device.
>
> but an IOASID is just a software handle before attached to a specific
> device. e.g. before attaching to a device, we have no idea about the
> supported page size in underlying iommu, coherent etc.

The idea is you attach the device to the /dev/ioasid FD and this
action is what crystalizes the iommu driver that is being used:

device_fd = open("/dev/vfio/devices/dev1", mode);
ioasid_fd = open("/dev/ioasid", mode);
ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);

After this sequence we should have most of the information about the
IOMMU.

One /dev/ioasid FD has one iommu driver. Design what an "iommu driver"
means so that the system should only have one. Eg the coherent/not
coherent distinction should not be a different "iommu driver".

Device attach to the _IOASID_ is a different thing, and I think it
puts the whole sequence out of order because we loose the option to
customize the IOASID before it has to be realized into HW format.

> > The user should have some idea how it intends to use the IOASID when
> > it creates it and the rest of the system should match the intention.
> >
> > For instance if the user is creating a IOASID to cover the guest GPA
> > with the intention of making children it should indicate this during
> > alloc.
> >
> > If the user is intending to point a child IOASID to a guest page table
> > in a certain descriptor format then it should indicate it during
> > alloc.
>
> Actually, we have only two kinds of IOASIDs so far.

Maybe at a very very high level, but it looks like there is alot of
IOMMU specific configuration that goes into an IOASD.


> > device bind should fail if the device somehow isn't compatible with
> > the scheme the user is tring to use.
>
> yeah, I guess you mean to fail the device attach when the IOASID is a
> nesting IOASID but the device is behind an iommu without nesting support.
> right?

Right..

> >
> > > /*
> > > * Map/unmap process virtual addresses to I/O virtual addresses.
> > > *
> > > * Provide VFIO type1 equivalent semantics. Start with the same
> > > * restriction e.g. the unmap size should match those used in the
> > > * original mapping call.
> > > *
> > > * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > > * must be already in the preregistered list.
> > > *
> > > * Input parameters:
> > > * - u32 ioasid;
> > > * - refer to vfio_iommu_type1_dma_{un}map
> > > *
> > > * Return: 0 on success, -errno on failure.
> > > */
> > > #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
> > > #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)
> >
> > What about nested IOASIDs?
>
> at first glance, it looks like we should prevent the MAP/UNMAP usage on
> nested IOASIDs. At least hardware nested translation only allows MAP/UNMAP
> on the parent IOASIDs and page table bind on nested IOASIDs. But considering
> about software nesting, it seems still useful to allow MAP/UNMAP usage
> on nested IOASIDs. This is how I understand it, how about your opinion
> on it? do you think it's better to allow MAP/UNMAP usage only on parent
> IOASIDs as a start?

If the only form of nested IOASID is the "read the page table from
my process memory" then MAP/UNMAP won't make sense on that..

MAP/UNMAP is only useful if the page table is stored in kernel memory.

> > > #define IOASID_CREATE_NESTING _IO(IOASID_TYPE, IOASID_BASE + 8)
> >
> > Do you think another ioctl is best? Should this just be another
> > parameter to alloc?
>
> either is fine. This ioctl is following one of your previous comment.

Sometimes I say things in a way that is ment to be easier to
understand conecpts not necessarily good API design :)

> > > #define IOASID_BIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 9)
> > > #define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10)
> >
> > Also feels backwards, why wouldn't we specify this, and the required
> > page table format, during alloc time?
>
> here the model is user-space gets the page table format from kernel and
> decide if it can proceed. So what you are suggesting is user-space should
> tell kernel the page table format it has in ALLOC and kenrel should fail
> the ALLOC if the user-space page table format is not compatible with underlying
> iommu?

Yes, the action should be
Alloc an IOASID that points at a page table in this user memory,
that is stored in this specific format.

The supported formats should be discoverable after VFIO_BIND_IOASID_FD

> > > /*
> > > * Page fault report and response
> > > *
> > > * This is TBD. Can be added after other parts are cleared up. Likely it
> > > * will be a ring buffer shared between user/kernel, an eventfd to notify
> > > * the user and an ioctl to complete the fault.
> > > *
> > > * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
> > > */
> >
> > Any reason not to just use read()?
>
> a ring buffer may be mmap to user-space, thus reading fault data from kernel
> would be faster. This is also how Eric's fault reporting is doing today.

Okay, if it is performance sensitive.. mmap rings are just tricky beasts

> > > * Bind a vfio_device to the specified IOASID fd
> > > *
> > > * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> > > * vfio device should not be bound to multiple ioasid_fd's.
> > > *
> > > * Input parameters:
> > > * - ioasid_fd;
> > > *
> > > * Return: 0 on success, -errno on failure.
> > > */
> > > #define VFIO_BIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 22)
> > > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)
> >
> > This is where it would make sense to have an output "device id" that
> > allows /dev/ioasid to refer to this "device" by number in events and
> > other related things.
>
> perhaps this is the device info Jean Philippe wants in page fault reporting
> path?

Yes, it is

Jason

2021-05-31 18:15:41

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Mon, May 31, 2021 at 05:37:35PM +0000, Parav Pandit wrote:

> In that case, can it be a new system call? Why does it have to be under /dev/ioasid?
> For example few years back such system call mpin() thought was proposed in [1].

Reference counting of the overall pins are required

So when a pinned pages is incorporated into an IOASID page table in a
later IOCTL it means it cannot be unpinned while the IOASID page table
is using it.

This is some trick to organize the pinning into groups and then
refcount each group, thus avoiding needing per-page refcounts.

The data structure would be an interval tree of pins in general

The ioasid itself would have an interval tree of its own mappings,
each entry in this tree would reference count against an element in
the above tree

Then the ioasid's interval tree would be mapped into a page table tree
in HW format.

The redundant storages are needed to keep track of the refencing and
the CPU page table values for later unpinning.

Jason

2021-06-01 01:31:13

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 5/31/21 7:31 PM, Liu Yi L wrote:
> On Fri, 28 May 2021 20:36:49 -0300, Jason Gunthorpe wrote:
>
>> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
>>
>>> 2.1. /dev/ioasid uAPI
>>> +++++++++++++++++

[---cut for short---]

>>> /*
>>> * Allocate an IOASID.
>>> *
>>> * IOASID is the FD-local software handle representing an I/O address
>>> * space. Each IOASID is associated with a single I/O page table. User
>>> * must call this ioctl to get an IOASID for every I/O address space that is
>>> * intended to be enabled in the IOMMU.
>>> *
>>> * A newly-created IOASID doesn't accept any command before it is
>>> * attached to a device. Once attached, an empty I/O page table is
>>> * bound with the IOMMU then the user could use either DMA mapping
>>> * or pgtable binding commands to manage this I/O page table.
>> Can the IOASID can be populated before being attached?
> perhaps a MAP/UNMAP operation on a gpa_ioasid?
>

But before attaching to any device, there's no connection between an
IOASID and the underlying IOMMU. How do you know the supported page
sizes and cache coherency?

The restriction of iommu_group is implicitly expressed as only after all
devices belonging to an iommu_group are attached, the operations of the
page table can be performed.

Best regards,
baolu

2021-06-01 02:38:11

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


在 2021/5/31 下午4:41, Liu Yi L 写道:
>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>> hardware nesting. Or is there way to detect the capability before?
> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
> is not able to support nesting, then should fail it.
>
>> I think GET_INFO only works after the ATTACH.
> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
> gpa_ioasid and check if nesting is supported or not. right?


Some more questions:

1) Is the handle returned by IOASID_ALLOC an fd?
2) If yes, what's the reason for not simply use the fd opened from
/dev/ioas. (This is the question that is not answered) and what happens
if we call GET_INFO for the ioasid_fd?
3) If not, how GET_INFO work?


>
>>> /* Bind guest I/O page table */
>>> bind_data = {
>>> .ioasid = giova_ioasid;
>>> .addr = giova_pgtable;
>>> // and format information
>>> };
>>> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>>>
>>> /* Invalidate IOTLB when required */
>>> inv_data = {
>>> .ioasid = giova_ioasid;
>>> // granular information
>>> };
>>> ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
>>>
>>> /* See 5.6 for I/O page fault handling */
>>>
>>> 5.5. Guest SVA (vSVA)
>>> ++++++++++++++++++
>>>
>>> After boots the guest further create a GVA address spaces (gpasid1) on
>>> dev1. Dev2 is not affected (still attached to giova_ioasid).
>>>
>>> As explained in section 4, user should avoid expose ENQCMD on both
>>> pdev and mdev.
>>>
>>> The sequence applies to all device types (being pdev or mdev), except
>>> one additional step to call KVM for ENQCMD-capable mdev:
>> My understanding is ENQCMD is Intel specific and not a requirement for
>> having vSVA.
> ENQCMD is not really Intel specific although only Intel supports it today.
> The PCIe DMWr capability is the capability for software to enumerate the
> ENQCMD support in device side. yes, it is not a requirement for vSVA. They
> are orthogonal.


Right, then it's better to mention DMWr instead of a vendor specific
instruction in a general framework like ioasid.

Thanks


>

2021-06-01 03:10:52

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 6/1/21 2:09 AM, Jason Gunthorpe wrote:
>>> device bind should fail if the device somehow isn't compatible with
>>> the scheme the user is tring to use.
>> yeah, I guess you mean to fail the device attach when the IOASID is a
>> nesting IOASID but the device is behind an iommu without nesting support.
>> right?
> Right..
>

Just want to confirm...

Does this mean that we only support hardware nesting and don't want to
have soft nesting (shadowed page table in kernel) in IOASID?

Best regards,
baolu

2021-06-01 04:29:28

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 2021/6/1 10:36, Jason Wang wrote:
>
> 在 2021/5/31 下午4:41, Liu Yi L 写道:
>>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>>> hardware nesting. Or is there way to detect the capability before?
>> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
>> is not able to support nesting, then should fail it.
>>
>>> I think GET_INFO only works after the ATTACH.
>> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
>> gpa_ioasid and check if nesting is supported or not. right?
>
>
> Some more questions:
>
> 1) Is the handle returned by IOASID_ALLOC an fd?
> 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd?
> 3) If not, how GET_INFO work?

It seems that the return value from IOASID_ALLOC is an IOASID number in the
ioasid_data struct, then when calling GET_INFO, we should convey this IOASID
number to get the associated I/O address space attributes (depend on the
physical IOMMU, which could be discovered when attaching a device to the
IOASID fd or number), right?

Thanks,
Shenming

2021-06-01 04:33:10

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 2021/5/27 15:58, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.
>
> This proposal describes the uAPI of /dev/ioasid and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.
>
> It's based on a lengthy discussion starting from here:
> https://lore.kernel.org/linux-iommu/[email protected]/
>
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.
>

[..]

>
> /*
> * Page fault report and response
> *
> * This is TBD. Can be added after other parts are cleared up. Likely it
> * will be a ring buffer shared between user/kernel, an eventfd to notify
> * the user and an ioctl to complete the fault.
> *
> * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
> */

Hi,

It seems that the ioasid has different usage in different situation, it could
be directly used in the physical routing, or just a virtual handle that indicates
a page table or a vPASID table (such as the GPA address space, in the simple
passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
Substream ID), right?

And Baolu suggested that since one device might consume multiple page tables,
it's more reasonable to have one fault handler per page table. By this, do we
have to maintain such an ioasid info list in the IOMMU layer?

Then if we add host IOPF support (for the GPA address space) in the future
(I have sent a series for this but it aimed for VFIO, I will convert it for
IOASID later [1] :-)), how could we find the handler for the received fault
event which only contains a Stream ID... Do we also have to maintain a
dev(vPASID)->ioasid mapping in the IOMMU layer?

[1] https://lore.kernel.org/patchwork/cover/1410223/

Thanks,
Shenming

2021-06-01 05:11:32

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


在 2021/6/1 上午11:31, Liu Yi L 写道:
> On Tue, 1 Jun 2021 10:36:36 +0800, Jason Wang wrote:
>
>> 在 2021/5/31 下午4:41, Liu Yi L 写道:
>>>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>>>> hardware nesting. Or is there way to detect the capability before?
>>> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
>>> is not able to support nesting, then should fail it.
>>>
>>>> I think GET_INFO only works after the ATTACH.
>>> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
>>> gpa_ioasid and check if nesting is supported or not. right?
>>
>> Some more questions:
>>
>> 1) Is the handle returned by IOASID_ALLOC an fd?
> it's an ID so far in this proposal.


Ok.


>
>> 2) If yes, what's the reason for not simply use the fd opened from
>> /dev/ioas. (This is the question that is not answered) and what happens
>> if we call GET_INFO for the ioasid_fd?
>> 3) If not, how GET_INFO work?
> oh, missed this question in prior reply. Personally, no special reason
> yet. But using ID may give us opportunity to customize the management
> of the handle. For one, better lookup efficiency by using xarray to
> store the allocated IDs. For two, could categorize the allocated IDs
> (parent or nested). GET_INFO just works with an input FD and an ID.


I'm not sure I get this, for nesting cases you can still make the child
an fd.

And a question still, under what case we need to create multiple ioasids
on a single ioasid fd?

(This case is not demonstrated in your examples).

Thanks


>
>>>
>>>>> /* Bind guest I/O page table */
>>>>> bind_data = {
>>>>> .ioasid = giova_ioasid;
>>>>> .addr = giova_pgtable;
>>>>> // and format information
>>>>> };
>>>>> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>>>>>
>>>>> /* Invalidate IOTLB when required */
>>>>> inv_data = {
>>>>> .ioasid = giova_ioasid;
>>>>> // granular information
>>>>> };
>>>>> ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
>>>>>
>>>>> /* See 5.6 for I/O page fault handling */
>>>>>
>>>>> 5.5. Guest SVA (vSVA)
>>>>> ++++++++++++++++++
>>>>>
>>>>> After boots the guest further create a GVA address spaces (gpasid1) on
>>>>> dev1. Dev2 is not affected (still attached to giova_ioasid).
>>>>>
>>>>> As explained in section 4, user should avoid expose ENQCMD on both
>>>>> pdev and mdev.
>>>>>
>>>>> The sequence applies to all device types (being pdev or mdev), except
>>>>> one additional step to call KVM for ENQCMD-capable mdev:
>>>> My understanding is ENQCMD is Intel specific and not a requirement for
>>>> having vSVA.
>>> ENQCMD is not really Intel specific although only Intel supports it today.
>>> The PCIe DMWr capability is the capability for software to enumerate the
>>> ENQCMD support in device side. yes, it is not a requirement for vSVA. They
>>> are orthogonal.
>>
>> Right, then it's better to mention DMWr instead of a vendor specific
>> instruction in a general framework like ioasid.
> good suggestion. :)
>

2021-06-01 05:12:26

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


在 2021/6/1 下午12:27, Shenming Lu 写道:
> On 2021/6/1 10:36, Jason Wang wrote:
>> 在 2021/5/31 下�4:41, Liu Yi L 写�:
>>>> I guess VFIO_ATTACH_IOASID will fail if the underlayer doesn't support
>>>> hardware nesting. Or is there way to detect the capability before?
>>> I think it could fail in the IOASID_CREATE_NESTING. If the gpa_ioasid
>>> is not able to support nesting, then should fail it.
>>>
>>>> I think GET_INFO only works after the ATTACH.
>>> yes. After attaching to gpa_ioasid, userspace could GET_INFO on the
>>> gpa_ioasid and check if nesting is supported or not. right?
>>
>> Some more questions:
>>
>> 1) Is the handle returned by IOASID_ALLOC an fd?
>> 2) If yes, what's the reason for not simply use the fd opened from /dev/ioas. (This is the question that is not answered) and what happens if we call GET_INFO for the ioasid_fd?
>> 3) If not, how GET_INFO work?
> It seems that the return value from IOASID_ALLOC is an IOASID number in the
> ioasid_data struct, then when calling GET_INFO, we should convey this IOASID
> number to get the associated I/O address space attributes (depend on the
> physical IOMMU, which could be discovered when attaching a device to the
> IOASID fd or number), right?


Right, but the question is why need such indirection? Unless there's a
case that you need to create multiple IOASIDs per ioasid fd. It's more
simpler to attach the metadata into the ioasid fd itself.

Thanks


>
> Thanks,
> Shenming
>

2021-06-01 05:14:33

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

Hi Shenming,

On 6/1/21 12:31 PM, Shenming Lu wrote:
> On 2021/5/27 15:58, Tian, Kevin wrote:
>> /dev/ioasid provides an unified interface for managing I/O page tables for
>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>> etc.) are expected to use this interface instead of creating their own logic to
>> isolate untrusted device DMAs initiated by userspace.
>>
>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>> with VFIO as example in typical usages. The driver-facing kernel API provided
>> by the iommu layer is still TBD, which can be discussed after consensus is
>> made on this uAPI.
>>
>> It's based on a lengthy discussion starting from here:
>> https://lore.kernel.org/linux-iommu/[email protected]/
>>
>> It ends up to be a long writing due to many things to be summarized and
>> non-trivial effort required to connect them into a complete proposal.
>> Hope it provides a clean base to converge.
>>
>
> [..]
>
>>
>> /*
>> * Page fault report and response
>> *
>> * This is TBD. Can be added after other parts are cleared up. Likely it
>> * will be a ring buffer shared between user/kernel, an eventfd to notify
>> * the user and an ioctl to complete the fault.
>> *
>> * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>> */
>
> Hi,
>
> It seems that the ioasid has different usage in different situation, it could
> be directly used in the physical routing, or just a virtual handle that indicates
> a page table or a vPASID table (such as the GPA address space, in the simple
> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
> Substream ID), right?
>
> And Baolu suggested that since one device might consume multiple page tables,
> it's more reasonable to have one fault handler per page table. By this, do we
> have to maintain such an ioasid info list in the IOMMU layer?

As discussed earlier, the I/O page fault and cache invalidation paths
will have "device labels" so that the information could be easily
translated and routed.

So it's likely the per-device fault handler registering API in iommu
core can be kept, but /dev/ioasid will be grown with a layer to
translate and propagate I/O page fault information to the right
consumers.

If things evolve in this way, probably the SVA I/O page fault also needs
to be ported to /dev/ioasid.

>
> Then if we add host IOPF support (for the GPA address space) in the future
> (I have sent a series for this but it aimed for VFIO, I will convert it for
> IOASID later [1] :-)), how could we find the handler for the received fault
> event which only contains a Stream ID... Do we also have to maintain a
> dev(vPASID)->ioasid mapping in the IOMMU layer?
>
> [1] https://lore.kernel.org/patchwork/cover/1410223/

Best regards,
baolu

2021-06-01 05:26:07

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

Hi Jason W,

On 6/1/21 1:08 PM, Jason Wang wrote:
>>> 2) If yes, what's the reason for not simply use the fd opened from
>>> /dev/ioas. (This is the question that is not answered) and what happens
>>> if we call GET_INFO for the ioasid_fd?
>>> 3) If not, how GET_INFO work?
>> oh, missed this question in prior reply. Personally, no special reason
>> yet. But using ID may give us opportunity to customize the management
>> of the handle. For one, better lookup efficiency by using xarray to
>> store the allocated IDs. For two, could categorize the allocated IDs
>> (parent or nested). GET_INFO just works with an input FD and an ID.
>
>
> I'm not sure I get this, for nesting cases you can still make the child
> an fd.
>
> And a question still, under what case we need to create multiple ioasids
> on a single ioasid fd?

One possible situation where multiple IOASIDs per FD could be used is
that devices with different underlying IOMMU capabilities are sharing a
single FD. In this case, only devices with consistent underlying IOMMU
capabilities could be put in an IOASID and multiple IOASIDs per FD could
be applied.

Though, I still not sure about "multiple IOASID per-FD" vs "multiple
IOASID FDs" for such case.

Best regards,
baolu

2021-06-01 05:32:23

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


在 2021/6/1 下午1:23, Lu Baolu 写道:
> Hi Jason W,
>
> On 6/1/21 1:08 PM, Jason Wang wrote:
>>>> 2) If yes, what's the reason for not simply use the fd opened from
>>>> /dev/ioas. (This is the question that is not answered) and what
>>>> happens
>>>> if we call GET_INFO for the ioasid_fd?
>>>> 3) If not, how GET_INFO work?
>>> oh, missed this question in prior reply. Personally, no special reason
>>> yet. But using ID may give us opportunity to customize the management
>>> of the handle. For one, better lookup efficiency by using xarray to
>>> store the allocated IDs. For two, could categorize the allocated IDs
>>> (parent or nested). GET_INFO just works with an input FD and an ID.
>>
>>
>> I'm not sure I get this, for nesting cases you can still make the
>> child an fd.
>>
>> And a question still, under what case we need to create multiple
>> ioasids on a single ioasid fd?
>
> One possible situation where multiple IOASIDs per FD could be used is
> that devices with different underlying IOMMU capabilities are sharing a
> single FD. In this case, only devices with consistent underlying IOMMU
> capabilities could be put in an IOASID and multiple IOASIDs per FD could
> be applied.
>
> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
> IOASID FDs" for such case.


Right, that's exactly my question. The latter seems much more easier to
be understood and implemented.

Thanks


>
> Best regards,
> baolu
>

2021-06-01 05:46:30

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Wang
> Sent: Tuesday, June 1, 2021 1:30 PM
>
> 在 2021/6/1 下午1:23, Lu Baolu 写道:
> > Hi Jason W,
> >
> > On 6/1/21 1:08 PM, Jason Wang wrote:
> >>>> 2) If yes, what's the reason for not simply use the fd opened from
> >>>> /dev/ioas. (This is the question that is not answered) and what
> >>>> happens
> >>>> if we call GET_INFO for the ioasid_fd?
> >>>> 3) If not, how GET_INFO work?
> >>> oh, missed this question in prior reply. Personally, no special reason
> >>> yet. But using ID may give us opportunity to customize the management
> >>> of the handle. For one, better lookup efficiency by using xarray to
> >>> store the allocated IDs. For two, could categorize the allocated IDs
> >>> (parent or nested). GET_INFO just works with an input FD and an ID.
> >>
> >>
> >> I'm not sure I get this, for nesting cases you can still make the
> >> child an fd.
> >>
> >> And a question still, under what case we need to create multiple
> >> ioasids on a single ioasid fd?
> >
> > One possible situation where multiple IOASIDs per FD could be used is
> > that devices with different underlying IOMMU capabilities are sharing a
> > single FD. In this case, only devices with consistent underlying IOMMU
> > capabilities could be put in an IOASID and multiple IOASIDs per FD could
> > be applied.
> >
> > Though, I still not sure about "multiple IOASID per-FD" vs "multiple
> > IOASID FDs" for such case.
>
>
> Right, that's exactly my question. The latter seems much more easier to
> be understood and implemented.
>

A simple reason discussed in previous thread - there could be 1M's
I/O address spaces per device while #FD's are precious resource.
So this RFC treats fd as a container of address spaces which is each
tagged by an IOASID.

Thanks
Kevin

2021-06-01 06:10:29

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


在 2021/6/1 下午1:42, Tian, Kevin 写道:
>> From: Jason Wang
>> Sent: Tuesday, June 1, 2021 1:30 PM
>>
>> 在 2021/6/1 下午1:23, Lu Baolu 写道:
>>> Hi Jason W,
>>>
>>> On 6/1/21 1:08 PM, Jason Wang wrote:
>>>>>> 2) If yes, what's the reason for not simply use the fd opened from
>>>>>> /dev/ioas. (This is the question that is not answered) and what
>>>>>> happens
>>>>>> if we call GET_INFO for the ioasid_fd?
>>>>>> 3) If not, how GET_INFO work?
>>>>> oh, missed this question in prior reply. Personally, no special reason
>>>>> yet. But using ID may give us opportunity to customize the management
>>>>> of the handle. For one, better lookup efficiency by using xarray to
>>>>> store the allocated IDs. For two, could categorize the allocated IDs
>>>>> (parent or nested). GET_INFO just works with an input FD and an ID.
>>>>
>>>> I'm not sure I get this, for nesting cases you can still make the
>>>> child an fd.
>>>>
>>>> And a question still, under what case we need to create multiple
>>>> ioasids on a single ioasid fd?
>>> One possible situation where multiple IOASIDs per FD could be used is
>>> that devices with different underlying IOMMU capabilities are sharing a
>>> single FD. In this case, only devices with consistent underlying IOMMU
>>> capabilities could be put in an IOASID and multiple IOASIDs per FD could
>>> be applied.
>>>
>>> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
>>> IOASID FDs" for such case.
>>
>> Right, that's exactly my question. The latter seems much more easier to
>> be understood and implemented.
>>
> A simple reason discussed in previous thread - there could be 1M's
> I/O address spaces per device while #FD's are precious resource.


Is the concern for ulimit or performance? Note that we had

#define NR_OPEN_MAX ~0U

And with the fd semantic, you can do a lot of other stuffs: close on
exec, passing via SCM_RIGHTS.

For the case of 1M, I would like to know what's the use case for a
single process to handle 1M+ address spaces?


> So this RFC treats fd as a container of address spaces which is each
> tagged by an IOASID.


If the container and address space is 1:1 then the container seems useless.

Thanks


>
> Thanks
> Kevin

2021-06-01 06:17:40

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Wang
> Sent: Tuesday, June 1, 2021 2:07 PM
>
> 在 2021/6/1 下午1:42, Tian, Kevin 写道:
> >> From: Jason Wang
> >> Sent: Tuesday, June 1, 2021 1:30 PM
> >>
> >> 在 2021/6/1 下午1:23, Lu Baolu 写道:
> >>> Hi Jason W,
> >>>
> >>> On 6/1/21 1:08 PM, Jason Wang wrote:
> >>>>>> 2) If yes, what's the reason for not simply use the fd opened from
> >>>>>> /dev/ioas. (This is the question that is not answered) and what
> >>>>>> happens
> >>>>>> if we call GET_INFO for the ioasid_fd?
> >>>>>> 3) If not, how GET_INFO work?
> >>>>> oh, missed this question in prior reply. Personally, no special reason
> >>>>> yet. But using ID may give us opportunity to customize the
> management
> >>>>> of the handle. For one, better lookup efficiency by using xarray to
> >>>>> store the allocated IDs. For two, could categorize the allocated IDs
> >>>>> (parent or nested). GET_INFO just works with an input FD and an ID.
> >>>>
> >>>> I'm not sure I get this, for nesting cases you can still make the
> >>>> child an fd.
> >>>>
> >>>> And a question still, under what case we need to create multiple
> >>>> ioasids on a single ioasid fd?
> >>> One possible situation where multiple IOASIDs per FD could be used is
> >>> that devices with different underlying IOMMU capabilities are sharing a
> >>> single FD. In this case, only devices with consistent underlying IOMMU
> >>> capabilities could be put in an IOASID and multiple IOASIDs per FD could
> >>> be applied.
> >>>
> >>> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
> >>> IOASID FDs" for such case.
> >>
> >> Right, that's exactly my question. The latter seems much more easier to
> >> be understood and implemented.
> >>
> > A simple reason discussed in previous thread - there could be 1M's
> > I/O address spaces per device while #FD's are precious resource.
>
>
> Is the concern for ulimit or performance? Note that we had
>
> #define NR_OPEN_MAX ~0U
>
> And with the fd semantic, you can do a lot of other stuffs: close on
> exec, passing via SCM_RIGHTS.

yes, fd has its merits.

>
> For the case of 1M, I would like to know what's the use case for a
> single process to handle 1M+ address spaces?

This single process is Qemu with an assigned device. Within the guest
there could be many guest processes. Though in reality I didn't see
such 1M processes on a single device, better not restrict it in uAPI?

>
>
> > So this RFC treats fd as a container of address spaces which is each
> > tagged by an IOASID.
>
>
> If the container and address space is 1:1 then the container seems useless.
>

yes, 1:1 then container is useless. But here it's assumed 1:M then
even a single fd is sufficient for all intended usages.

Thanks
Kevin

2021-06-01 07:03:46

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Saturday, May 29, 2021 4:03 AM
>
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > /dev/ioasid provides an unified interface for managing I/O page tables for
> > devices assigned to userspace. Device passthrough frameworks (VFIO,
> vDPA,
> > etc.) are expected to use this interface instead of creating their own logic to
> > isolate untrusted device DMAs initiated by userspace.
>
> It is very long, but I think this has turned out quite well. It
> certainly matches the basic sketch I had in my head when we were
> talking about how to create vDPA devices a few years ago.
>
> When you get down to the operations they all seem pretty common sense
> and straightfoward. Create an IOASID. Connect to a device. Fill the
> IOASID with pages somehow. Worry about PASID labeling.
>
> It really is critical to get all the vendor IOMMU people to go over it
> and see how their HW features map into this.
>

Agree. btw I feel it might be good to have several design opens
centrally discussed after going through all the comments. Otherwise
they may be buried in different sub-threads and potentially with
insufficient care (especially for people who haven't completed the
reading).

I summarized five opens here, about:

1) Finalizing the name to replace /dev/ioasid;
2) Whether one device is allowed to bind to multiple IOASID fd's;
3) Carry device information in invalidation/fault reporting uAPI;
4) What should/could be specified when allocating an IOASID;
5) The protocol between vfio group and kvm;

For 1), two alternative names are mentioned: /dev/iommu and
/dev/ioas. I don't have a strong preference and would like to hear
votes from all stakeholders. /dev/iommu is slightly better imho for
two reasons. First, per AMD's presentation in last KVM forum they
implement vIOMMU in hardware thus need to support user-managed
domains. An iommu uAPI notation might make more sense moving
forward. Second, it makes later uAPI naming easier as 'IOASID' can
be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
IOASID_ALLOC_IOASID. :)

Another naming open is about IOASID (the software handle for ioas)
and the associated hardware ID (PASID or substream ID). Jason thought
PASID is defined more from SVA angle while ARM's convention sounds
clearer from device p.o.v. Following this direction then SID/SSID will be
used to replace RID/PASID in this RFC (and possibly also implying that
the kernel IOASID allocator should also be renamed to SSID allocator).
I don't have better alternative. If no one objects, I'll change to this new
naming in next version.

For 2), Jason prefers to not blocking it if no kernel design reason. If
one device is allowed to bind multiple IOASID fd's, the main problem
is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1
and giova_ioasid created in fd2 and then nesting them together (and
whether any cross-fd notification required when handling invalidation
etc.). We thought that this just adds some complexity while not sure
about the value of supporting it (when one fd can already afford all
discussed usages). Therefore this RFC proposes a device only bound
to at most one IOASID fd. Does this rationale make sense?

To the other end there was also thought whether we should make
a single I/O address space per IOASID fd. This was discussed in previous
thread that #fd's are insufficient to afford theoretical 1M's address
spaces per device. But let's have another revisit and draw a clear
conclusion whether this option is viable.

For 3), Jason/Jean both think it's cleaner to carry device info in the
uAPI. Actually this was one option we developed in earlier internal
versions of this RFC. Later on we changed it to the current way based
on misinterpretation of previous discussion. Thinking more we will
adopt this suggestion in next version, due to both efficiency (I/O page
fault is already a long path ) and security reason (some faults are
unrecoverable thus the faulting device must be identified/isolated).

This implies that VFIO_BOUND_IOASID will be extended to allow user
specify a device label. This label will be recorded in /dev/iommu to
serve per-device invalidation request from and report per-device
fault data to the user. In addition, vPASID (if provided by user) will
be also recorded in /dev/iommu so vPASID<->pPASID conversion
is conducted properly. e.g. invalidation request from user carries
a vPASID which must be converted into pPASID before calling iommu
driver. Vice versa for raw fault data which carries pPASID while the
user expects a vPASID.

For 4), There are two options for specifying the IOASID attributes:

In this RFC, an IOASID has no attribute before it's attached to any
device. After device attach, user queries capability/format info
about the IOMMU which the device belongs to, and then call
different ioctl commands to set the attributes for an IOASID (e.g.
map/unmap, bind/unbind user pgtable, nesting, etc.). This follows
how the underlying iommu-layer API is designed: a domain reports
capability/format info and serves iommu ops only after it's attached
to a device.

Jason suggests having user to specify all attributes about how an
IOASID is expected to work when creating this IOASID. This requires
/dev/iommu to provide capability/format info once a device is bound
to ioasid fd (before creating any IOASID). In concept this should work,
since given a device we can always find its IOMMU. The only gap is
aforementioned: current iommu API is designed per domain instead
of per-device.

Seems to close this design open we have to touch the kAPI design. and
Joerg's input is highly appreciated here.

For 5), I'd expect Alex to chime in. Per my understanding looks the
original purpose of this protocol is not about I/O address space. It's
for KVM to know whether any device is assigned to this VM and then
do something special (e.g. posted interrupt, EPT cache attribute, etc.).
Because KVM deduces some policy based on the fact of assigned device,
it needs to hold a reference to related vfio group. this part is irrelevant
to this RFC.

But ARM's VMID usage is related to I/O address space thus needs some
consideration. Another strange thing is about PPC. Looks it also leverages
this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
group. I don't know why it's done through KVM instead of VFIO uAPI in
the first place.

Thanks
Kevin

2021-06-01 07:16:58

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 2021/6/1 13:10, Lu Baolu wrote:
> Hi Shenming,
>
> On 6/1/21 12:31 PM, Shenming Lu wrote:
>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>> etc.) are expected to use this interface instead of creating their own logic to
>>> isolate untrusted device DMAs initiated by userspace.
>>>
>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>> with VFIO as example in typical usages. The driver-facing kernel API provided
>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>> made on this uAPI.
>>>
>>> It's based on a lengthy discussion starting from here:
>>>     https://lore.kernel.org/linux-iommu/[email protected]/
>>>
>>> It ends up to be a long writing due to many things to be summarized and
>>> non-trivial effort required to connect them into a complete proposal.
>>> Hope it provides a clean base to converge.
>>>
>>
>> [..]
>>
>>>
>>> /*
>>>    * Page fault report and response
>>>    *
>>>    * This is TBD. Can be added after other parts are cleared up. Likely it
>>>    * will be a ring buffer shared between user/kernel, an eventfd to notify
>>>    * the user and an ioctl to complete the fault.
>>>    *
>>>    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>>    */
>>
>> Hi,
>>
>> It seems that the ioasid has different usage in different situation, it could
>> be directly used in the physical routing, or just a virtual handle that indicates
>> a page table or a vPASID table (such as the GPA address space, in the simple
>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>> Substream ID), right?
>>
>> And Baolu suggested that since one device might consume multiple page tables,
>> it's more reasonable to have one fault handler per page table. By this, do we
>> have to maintain such an ioasid info list in the IOMMU layer?
>
> As discussed earlier, the I/O page fault and cache invalidation paths
> will have "device labels" so that the information could be easily
> translated and routed.
>
> So it's likely the per-device fault handler registering API in iommu
> core can be kept, but /dev/ioasid will be grown with a layer to
> translate and propagate I/O page fault information to the right
> consumers.

Yeah, having a general preprocessing of the faults in IOASID seems to be
a doable direction. But since there may be more than one consumer at the
same time, who is responsible for registering the per-device fault handler?

Thanks,
Shenming

>
> If things evolve in this way, probably the SVA I/O page fault also needs
> to be ported to /dev/ioasid.
>
>>
>> Then if we add host IOPF support (for the GPA address space) in the future
>> (I have sent a series for this but it aimed for VFIO, I will convert it for
>> IOASID later [1] :-)), how could we find the handler for the received fault
>> event which only contains a Stream ID... Do we also have to maintain a
>> dev(vPASID)->ioasid mapping in the IOMMU layer?
>>
>> [1] https://lore.kernel.org/patchwork/cover/1410223/
>
> Best regards,
> baolu
> .

2021-06-01 07:54:12

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jean-Philippe Brucker <[email protected]>
> Sent: Saturday, May 29, 2021 12:23 AM
> >
> > IOASID nesting can be implemented in two ways: hardware nesting and
> > software nesting. With hardware support the child and parent I/O page
> > tables are walked consecutively by the IOMMU to form a nested translation.
> > When it's implemented in software, the ioasid driver is responsible for
> > merging the two-level mappings into a single-level shadow I/O page table.
> > Software nesting requires both child/parent page tables operated through
> > the dma mapping protocol, so any change in either level can be captured
> > by the kernel to update the corresponding shadow mapping.
>
> Is there an advantage to moving software nesting into the kernel?
> We could just have the guest do its usual combined map/unmap on the child
> fd
>

There are at least two intended usages:

1) From previous discussion looks PPC's window-based scheme can be
better supported with software nesting (a shared IOVA address space
as the parent (shared by all devices) which is nested by multiple windows
as the children (per-device);

2) Some mdev drivers (e.g. kvmgt) may want to do write-protection on
guest data structures (base address programmed to mediated MMIO
register). The base address is IOVA while KVM page-tracking API is
based on GPA. nesting allows finding GPA according to IOVA.

Thanks
Kevin

2021-06-01 08:11:10

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Saturday, May 29, 2021 1:36 AM
>
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
>
> > IOASID nesting can be implemented in two ways: hardware nesting and
> > software nesting. With hardware support the child and parent I/O page
> > tables are walked consecutively by the IOMMU to form a nested translation.
> > When it's implemented in software, the ioasid driver is responsible for
> > merging the two-level mappings into a single-level shadow I/O page table.
> > Software nesting requires both child/parent page tables operated through
> > the dma mapping protocol, so any change in either level can be captured
> > by the kernel to update the corresponding shadow mapping.
>
> Why? A SW emulation could do this synchronization during invalidation
> processing if invalidation contained an IOVA range.

In this proposal we differentiate between host-managed and user-
managed I/O page tables. If host-managed, the user is expected to use
map/unmap cmd explicitly upon any change required on the page table.
If user-managed, the user first binds its page table to the IOMMU and
then use invalidation cmd to flush iotlb when necessary (e.g. typically
not required when changing a PTE from non-present to present).

We expect user to use map+unmap and bind+invalidate respectively
instead of mixing them together. Following this policy, map+unmap
must be used in both levels for software nesting, so changes in either
level are captured timely to synchronize the shadow mapping.

>
> I think this document would be stronger to include some "Rational"
> statements in key places
>

Sure. I tried to provide rationale as much as possible but sometimes
it's lost in a complex context like this. :)

Thanks
Kevin

2021-06-01 08:39:44

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Saturday, May 29, 2021 3:59 AM
>
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> >
> > 5. Use Cases and Flows
> >
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> >
> > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> >
> > ioasid_fd = open("/dev/ioasid", mode);
> >
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
>
> For others, I don't think this is *strictly* necessary, we can
> probably still get to the device_fd using the group_fd and fit in
> /dev/ioasid. It does make the rest of this more readable though.

Jason, want to confirm here. Per earlier discussion we remain an
impression that you want VFIO to be a pure device driver thus
container/group are used only for legacy application. From this
comment are you suggesting that VFIO can still keep container/
group concepts and user just deprecates the use of vfio iommu
uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has
a simple policy that an IOASID will reject cmd if partially-attached
group exists)?

>
>
> > Three types of IOASIDs are considered:
> >
> > gpa_ioasid[1...N]: for GPA address space
> > giova_ioasid[1...N]: for guest IOVA address space
> > gva_ioasid[1...N]: for guest CPU VA address space
> >
> > At least one gpa_ioasid must always be created per guest, while the other
> > two are relevant as far as vIOMMU is concerned.
> >
> > Examples here apply to both pdev and mdev, if not explicitly marked out
> > (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> > associated routing information in the attaching operation.
> >
> > For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> > INFO are skipped in these examples.
> >
> > 5.1. A simple example
> > ++++++++++++++++++
> >
> > Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> > space is managed through DMA mapping protocol:
> >
> > /* Bind device to IOASID fd */
> > device_fd = open("/dev/vfio/devices/dev1", mode);
> > ioasid_fd = open("/dev/ioasid", mode);
> > ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > /* Attach device to IOASID */
> > gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > at_data = { .ioasid = gpa_ioasid};
> > ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Setup GPA mapping */
> > dma_map = {
> > .ioasid = gpa_ioasid;
> > .iova = 0; // GPA
> > .vaddr = 0x40000000; // HVA
> > .size = 1GB;
> > };
> > ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > If the guest is assigned with more than dev1, user follows above sequence
> > to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> > address space cross all assigned devices.
>
> eg
>
> device2_fd = open("/dev/vfio/devices/dev1", mode);
> ioctl(device2_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
> ioctl(device2_fd, VFIO_ATTACH_IOASID, &at_data);
>
> Right?

Exactly, except a small typo ('dev1' -> 'dev2'). :)

>
> >
> > 5.2. Multiple IOASIDs (no nesting)
> > ++++++++++++++++++++++++++++
> >
> > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > both devices are attached to gpa_ioasid. After boot the guest creates
> > an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> > through mode (gpa_ioasid).
> >
> > Suppose IOASID nesting is not supported in this case. Qemu need to
> > generate shadow mappings in userspace for giova_ioasid (like how
> > VFIO works today).
> >
> > To avoid duplicated locked page accounting, it's recommended to pre-
> > register the virtual address range that will be used for DMA:
> >
> > device_fd1 = open("/dev/vfio/devices/dev1", mode);
> > device_fd2 = open("/dev/vfio/devices/dev2", mode);
> > ioasid_fd = open("/dev/ioasid", mode);
> > ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> > ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
> >
> > /* pre-register the virtual address range for accounting */
> > mem_info = { .vaddr = 0x40000000; .size = 1GB };
> > ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> >
> > /* Attach dev1 and dev2 to gpa_ioasid */
> > gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> > at_data = { .ioasid = gpa_ioasid};
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Setup GPA mapping */
> > dma_map = {
> > .ioasid = gpa_ioasid;
> > .iova = 0; // GPA
> > .vaddr = 0x40000000; // HVA
> > .size = 1GB;
> > };
> > ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
> >
> > /* After boot, guest enables an GIOVA space for dev2 */
> > giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> >
> > /* First detach dev2 from previous address space */
> > at_data = { .ioasid = gpa_ioasid};
> > ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
> >
> > /* Then attach dev2 to the new address space */
> > at_data = { .ioasid = giova_ioasid};
> > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Setup a shadow DMA mapping according to vIOMMU
> > * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> > */
>
> Here "shadow DMA" means relay the guest's vIOMMU page tables to the HW
> IOMMU?

'shadow' means the merged mapping: GIOVA(0x2000) -> HVA (0x40001000)

>
> > dma_map = {
> > .ioasid = giova_ioasid;
> > .iova = 0x2000; // GIOVA
> > .vaddr = 0x40001000; // HVA
>
> eg HVA came from reading the guest's page tables and finding it wanted
> GPA 0x1000 mapped to IOVA 0x2000?

yes

>
>
> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> >
> > The flow before guest boots is same as 5.2, except one point. Because
> > giova_ioasid is nested on gpa_ioasid, locked accounting is only
> > conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> > memory.
> >
> > To save space we only list the steps after boots (i.e. both dev1/dev2
> > have been attached to gpa_ioasid before guest boots):
> >
> > /* After boots */
> > /* Make GIOVA space nested on GPA space */
> > giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev2 to the new address space (child)
> > * Note dev2 is still attached to gpa_ioasid (parent)
> > */
> > at_data = { .ioasid = giova_ioasid};
> > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> > * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> > * to form a shadow mapping.
> > */
> > dma_map = {
> > .ioasid = giova_ioasid;
> > .iova = 0x2000; // GIOVA
> > .vaddr = 0x1000; // GPA
> > .size = 4KB;
> > };
> > ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> And in this version the kernel reaches into the parent IOASID's page
> tables to translate 0x1000 to 0x40001000 to physical page? So we
> basically remove the qemu process address space entirely from this
> translation. It does seem convenient

yes.

>
> > 5.4. IOASID nesting (hardware)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with hardware-based IOASID nesting
> > available. In this mode the pgtable binding protocol is used to
> > bind the guest IOVA page table with the IOMMU:
> >
> > /* After boots */
> > /* Make GIOVA space nested on GPA space */
> > giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev2 to the new address space (child)
> > * Note dev2 is still attached to gpa_ioasid (parent)
> > */
> > at_data = { .ioasid = giova_ioasid};
> > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Bind guest I/O page table */
> > bind_data = {
> > .ioasid = giova_ioasid;
> > .addr = giova_pgtable;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> I really think you need to use consistent language. Things that
> allocate a new IOASID should be calle IOASID_ALLOC_IOASID. If multiple
> IOCTLs are needed then it is IOASID_ALLOC_IOASID_PGTABLE, etc.
> alloc/create/bind is too confusing.
>
> > 5.5. Guest SVA (vSVA)
> > ++++++++++++++++++
> >
> > After boots the guest further create a GVA address spaces (gpasid1) on
> > dev1. Dev2 is not affected (still attached to giova_ioasid).
> >
> > As explained in section 4, user should avoid expose ENQCMD on both
> > pdev and mdev.
> >
> > The sequence applies to all device types (being pdev or mdev), except
> > one additional step to call KVM for ENQCMD-capable mdev:
> >
> > /* After boots */
> > /* Make GVA space nested on GPA space */
> > gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev1 to the new address space and specify vPASID */
> > at_data = {
> > .ioasid = gva_ioasid;
> > .flag = IOASID_ATTACH_USER_PASID;
> > .user_pasid = gpasid1;
> > };
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> there any scenario where we want different vpasid's for the same
> IOASID? I guess it is OK like this. Hum.

Yes, it's completely sane that the guest links a I/O page table to
different vpasids on dev1 and dev2. The IOMMU doesn't mandate
that when multiple devices share an I/O page table they must use
the same PASID#.

>
> > /* if dev1 is ENQCMD-capable mdev, update CPU PASID
> > * translation structure through KVM
> > */
> > pa_data = {
> > .ioasid_fd = ioasid_fd;
> > .ioasid = gva_ioasid;
> > .guest_pasid = gpasid1;
> > };
> > ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
>
> Make sense
>
> > /* Bind guest I/O page table */
> > bind_data = {
> > .ioasid = gva_ioasid;
> > .addr = gva_pgtable1;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> Again I do wonder if this should just be part of alloc_ioasid. Is
> there any reason to split these things? The only advantage to the
> split is the device is known, but the device shouldn't impact
> anything..

I summarized this as open#4 in another mail for focused discussion.

>
> > 5.6. I/O page fault
> > +++++++++++++++
> >
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards).
> >
> > - Host IOMMU driver receives a page request with raw fault_data {rid,
> > pasid, addr};
> >
> > - Host IOMMU driver identifies the faulting I/O page table according to
> > information registered by IOASID fault handler;
> >
> > - IOASID fault handler is called with raw fault_data (rid, pasid, addr),
> which
> > is saved in ioasid_data->fault_data (used for response);
> >
> > - IOASID fault handler generates an user fault_data (ioasid, addr), links it
> > to the shared ring buffer and triggers eventfd to userspace;
>
> Here rid should be translated to a labeled device and return the
> device label from VFIO_BIND_IOASID_FD. Depending on how the device
> bound the label might match to a rid or to a rid,pasid

Yes, I acknowledged this input from you and Jean about page fault and
bind_pasid_table. I summarized it as open#3 in another mail.

thus following is skipped...

Thanks
Kevin

>
> > - Upon received event, Qemu needs to find the virtual routing information
> > (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
> > multiple, pick a random one. This should be fine since the purpose is to
> > fix the I/O page table on the guest;
>
> The device label should fix this
>
> > - Qemu finds the pending fault event, converts virtual completion data
> > into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> > complete the pending fault;
> >
> > - /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> > ioasid_data->fault_data, and then calls iommu api to complete it with
> > {rid, pasid, response_code};
>
> So resuming a fault on an ioasid will resume all devices pending on
> the fault?
>
> > 5.7. BIND_PASID_TABLE
> > ++++++++++++++++++++
> >
> > PASID table is put in the GPA space on some platform, thus must be
> updated
> > by the guest. It is treated as another user page table to be bound with the
> > IOMMU.
> >
> > As explained earlier, the user still needs to explicitly bind every user I/O
> > page table to the kernel so the same pgtable binding protocol (bind, cache
> > invalidate and fault handling) is unified cross platforms.
> >
> > vIOMMUs may include a caching mode (or paravirtualized way) which,
> once
> > enabled, requires the guest to invalidate PASID cache for any change on the
> > PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
> >
> > In case of missing such capability, Qemu could enable write-protection on
> > the guest PASID table to achieve the same effect.
> >
> > /* After boots */
> > /* Make vPASID space nested on GPA space */
> > pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev1 to pasidtbl_ioasid */
> > at_data = { .ioasid = pasidtbl_ioasid};
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Bind PASID table */
> > bind_data = {
> > .ioasid = pasidtbl_ioasid;
> > .addr = gpa_pasid_table;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> >
> > /* vIOMMU detects a new GVA I/O space created */
> > gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev1 to the new address space, with gpasid1 */
> > at_data = {
> > .ioasid = gva_ioasid;
> > .flag = IOASID_ATTACH_USER_PASID;
> > .user_pasid = gpasid1;
> > };
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Bind guest I/O page table. Because SET_PASID_TABLE has been
> > * used, the kernel will not update the PASID table. Instead, just
> > * track the bound I/O page table for handling invalidation and
> > * I/O page faults.
> > */
> > bind_data = {
> > .ioasid = gva_ioasid;
> > .addr = gva_pgtable1;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> I still don't quite get the benifit from doing this.
>
> The idea to create an all PASID IOASID seems to work better with less
> fuss on HW that is directly parsing the guest's PASID table.
>
> Cache invalidate seems easy enough to support
>
> Fault handling needs to return the (ioasid, device_label, pasid) when
> working with this kind of ioasid.
>
> It is true that it does create an additional flow qemu has to
> implement, but it does directly mirror the HW.
>
> Jason

2021-06-01 08:48:23

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


在 2021/6/1 下午2:16, Tian, Kevin 写道:
>> From: Jason Wang
>> Sent: Tuesday, June 1, 2021 2:07 PM
>>
>> 在 2021/6/1 下午1:42, Tian, Kevin 写道:
>>>> From: Jason Wang
>>>> Sent: Tuesday, June 1, 2021 1:30 PM
>>>>
>>>> 在 2021/6/1 下午1:23, Lu Baolu 写道:
>>>>> Hi Jason W,
>>>>>
>>>>> On 6/1/21 1:08 PM, Jason Wang wrote:
>>>>>>>> 2) If yes, what's the reason for not simply use the fd opened from
>>>>>>>> /dev/ioas. (This is the question that is not answered) and what
>>>>>>>> happens
>>>>>>>> if we call GET_INFO for the ioasid_fd?
>>>>>>>> 3) If not, how GET_INFO work?
>>>>>>> oh, missed this question in prior reply. Personally, no special reason
>>>>>>> yet. But using ID may give us opportunity to customize the
>> management
>>>>>>> of the handle. For one, better lookup efficiency by using xarray to
>>>>>>> store the allocated IDs. For two, could categorize the allocated IDs
>>>>>>> (parent or nested). GET_INFO just works with an input FD and an ID.
>>>>>> I'm not sure I get this, for nesting cases you can still make the
>>>>>> child an fd.
>>>>>>
>>>>>> And a question still, under what case we need to create multiple
>>>>>> ioasids on a single ioasid fd?
>>>>> One possible situation where multiple IOASIDs per FD could be used is
>>>>> that devices with different underlying IOMMU capabilities are sharing a
>>>>> single FD. In this case, only devices with consistent underlying IOMMU
>>>>> capabilities could be put in an IOASID and multiple IOASIDs per FD could
>>>>> be applied.
>>>>>
>>>>> Though, I still not sure about "multiple IOASID per-FD" vs "multiple
>>>>> IOASID FDs" for such case.
>>>> Right, that's exactly my question. The latter seems much more easier to
>>>> be understood and implemented.
>>>>
>>> A simple reason discussed in previous thread - there could be 1M's
>>> I/O address spaces per device while #FD's are precious resource.
>>
>> Is the concern for ulimit or performance? Note that we had
>>
>> #define NR_OPEN_MAX ~0U
>>
>> And with the fd semantic, you can do a lot of other stuffs: close on
>> exec, passing via SCM_RIGHTS.
> yes, fd has its merits.
>
>> For the case of 1M, I would like to know what's the use case for a
>> single process to handle 1M+ address spaces?
> This single process is Qemu with an assigned device. Within the guest
> there could be many guest processes. Though in reality I didn't see
> such 1M processes on a single device, better not restrict it in uAPI?


Sorry I don't get here.

We can open up to ~0U file descriptors, I don't see why we need to
restrict it in uAPI.

Thanks


>
>>
>>> So this RFC treats fd as a container of address spaces which is each
>>> tagged by an IOASID.
>>
>> If the container and address space is 1:1 then the container seems useless.
>>
> yes, 1:1 then container is useless. But here it's assumed 1:M then
> even a single fd is sufficient for all intended usages.
>
> Thanks
> Kevin

2021-06-01 11:10:53

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

Hi Jason,

On 2021/5/29 7:36, Jason Gunthorpe wrote:
>> /*
>> * Bind an user-managed I/O page table with the IOMMU
>> *
>> * Because user page table is untrusted, IOASID nesting must be enabled
>> * for this ioasid so the kernel can enforce its DMA isolation policy
>> * through the parent ioasid.
>> *
>> * Pgtable binding protocol is different from DMA mapping. The latter
>> * has the I/O page table constructed by the kernel and updated
>> * according to user MAP/UNMAP commands. With pgtable binding the
>> * whole page table is created and updated by userspace, thus different
>> * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>> *
>> * Because the page table is directly walked by the IOMMU, the user
>> * must use a format compatible to the underlying hardware. It can
>> * check the format information through IOASID_GET_INFO.
>> *
>> * The page table is bound to the IOMMU according to the routing
>> * information of each attached device under the specified IOASID. The
>> * routing information (RID and optional PASID) is registered when a
>> * device is attached to this IOASID through VFIO uAPI.
>> *
>> * Input parameters:
>> * - child_ioasid;
>> * - address of the user page table;
>> * - formats (vendor, address_width, etc.);
>> *
>> * Return: 0 on success, -errno on failure.
>> */
>> #define IOASID_BIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 9)
>> #define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10)
> Also feels backwards, why wouldn't we specify this, and the required
> page table format, during alloc time?
>

Thinking of the required page table format, perhaps we should shed more
light on the page table of an IOASID. So far, an IOASID might represent
one of the following page tables (might be more):

1) an IOMMU format page table (a.k.a. iommu_domain)
2) a user application CPU page table (SVA for example)
3) a KVM EPT (future option)
4) a VM guest managed page table (nesting mode)

This version only covers 1) and 4). Do you think we need to support 2),
3) and beyond? If so, it seems that we need some in-kernel helpers and
uAPIs to support pre-installing a page table to IOASID. From this point
of view an IOASID is actually not just a variant of iommu_domain, but an
I/O page table representation in a broader sense.

Best regards,
baolu

2021-06-01 12:05:10

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal



> From: Jason Gunthorpe <[email protected]>
> Sent: Monday, May 31, 2021 11:43 PM
>
> On Mon, May 31, 2021 at 05:37:35PM +0000, Parav Pandit wrote:
>
> > In that case, can it be a new system call? Why does it have to be under
> /dev/ioasid?
> > For example few years back such system call mpin() thought was proposed
> in [1].
>
> Reference counting of the overall pins are required
>
> So when a pinned pages is incorporated into an IOASID page table in a later
> IOCTL it means it cannot be unpinned while the IOASID page table is using it.
Ok. but cant it use the same refcount of that mmu uses?

>
> This is some trick to organize the pinning into groups and then refcount each
> group, thus avoiding needing per-page refcounts.
Pinned page refcount is already maintained by the mmu without ioasid, isn't it?

>
> The data structure would be an interval tree of pins in general
>
> The ioasid itself would have an interval tree of its own mappings, each entry
> in this tree would reference count against an element in the above tree
>
> Then the ioasid's interval tree would be mapped into a page table tree in HW
> format.
Does it mean in simple use case [1], second level page table copy is maintained in the IOMMU side via map interface?
I hope not. It should use the same as what mmu uses, right?

[1] one SIOV/ADI device assigned with one PASID and mapped in guest VM

>
> The redundant storages are needed to keep track of the refencing and the
> CPU page table values for later unpinning.
>
> Jason

2021-06-01 12:32:09

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 2021/6/1 15:15, Shenming Lu wrote:
> On 2021/6/1 13:10, Lu Baolu wrote:
>> Hi Shenming,
>>
>> On 6/1/21 12:31 PM, Shenming Lu wrote:
>>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>>> etc.) are expected to use this interface instead of creating their own logic to
>>>> isolate untrusted device DMAs initiated by userspace.
>>>>
>>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>>> with VFIO as example in typical usages. The driver-facing kernel API provided
>>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>>> made on this uAPI.
>>>>
>>>> It's based on a lengthy discussion starting from here:
>>>> https://lore.kernel.org/linux-iommu/[email protected]/
>>>>
>>>> It ends up to be a long writing due to many things to be summarized and
>>>> non-trivial effort required to connect them into a complete proposal.
>>>> Hope it provides a clean base to converge.
>>>>
>>> [..]
>>>
>>>> /*
>>>>    * Page fault report and response
>>>>    *
>>>>    * This is TBD. Can be added after other parts are cleared up. Likely it
>>>>    * will be a ring buffer shared between user/kernel, an eventfd to notify
>>>>    * the user and an ioctl to complete the fault.
>>>>    *
>>>>    * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>>>    */
>>> Hi,
>>>
>>> It seems that the ioasid has different usage in different situation, it could
>>> be directly used in the physical routing, or just a virtual handle that indicates
>>> a page table or a vPASID table (such as the GPA address space, in the simple
>>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>>> Substream ID), right?
>>>
>>> And Baolu suggested that since one device might consume multiple page tables,
>>> it's more reasonable to have one fault handler per page table. By this, do we
>>> have to maintain such an ioasid info list in the IOMMU layer?
>> As discussed earlier, the I/O page fault and cache invalidation paths
>> will have "device labels" so that the information could be easily
>> translated and routed.
>>
>> So it's likely the per-device fault handler registering API in iommu
>> core can be kept, but /dev/ioasid will be grown with a layer to
>> translate and propagate I/O page fault information to the right
>> consumers.
> Yeah, having a general preprocessing of the faults in IOASID seems to be
> a doable direction. But since there may be more than one consumer at the
> same time, who is responsible for registering the per-device fault handler?

The drivers register per page table fault handlers to /dev/ioasid which
will then register itself to iommu core to listen and route the per-
device I/O page faults. This is just a top level thought. Haven't gone
through the details yet. Need to wait and see what /dev/ioasid finally
looks like.

Best regards,
baolu

2021-06-01 13:12:53

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 2021/6/1 20:30, Lu Baolu wrote:
> On 2021/6/1 15:15, Shenming Lu wrote:
>> On 2021/6/1 13:10, Lu Baolu wrote:
>>> Hi Shenming,
>>>
>>> On 6/1/21 12:31 PM, Shenming Lu wrote:
>>>> On 2021/5/27 15:58, Tian, Kevin wrote:
>>>>> /dev/ioasid provides an unified interface for managing I/O page tables for
>>>>> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
>>>>> etc.) are expected to use this interface instead of creating their own logic to
>>>>> isolate untrusted device DMAs initiated by userspace.
>>>>>
>>>>> This proposal describes the uAPI of /dev/ioasid and also sample sequences
>>>>> with VFIO as example in typical usages. The driver-facing kernel API provided
>>>>> by the iommu layer is still TBD, which can be discussed after consensus is
>>>>> made on this uAPI.
>>>>>
>>>>> It's based on a lengthy discussion starting from here:
>>>>>      https://lore.kernel.org/linux-iommu/[email protected]/
>>>>>
>>>>> It ends up to be a long writing due to many things to be summarized and
>>>>> non-trivial effort required to connect them into a complete proposal.
>>>>> Hope it provides a clean base to converge.
>>>>>
>>>> [..]
>>>>
>>>>> /*
>>>>>     * Page fault report and response
>>>>>     *
>>>>>     * This is TBD. Can be added after other parts are cleared up. Likely it
>>>>>     * will be a ring buffer shared between user/kernel, an eventfd to notify
>>>>>     * the user and an ioctl to complete the fault.
>>>>>     *
>>>>>     * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
>>>>>     */
>>>> Hi,
>>>>
>>>> It seems that the ioasid has different usage in different situation, it could
>>>> be directly used in the physical routing, or just a virtual handle that indicates
>>>> a page table or a vPASID table (such as the GPA address space, in the simple
>>>> passthrough case, the DMA input to IOMMU will just contain a Stream ID, no
>>>> Substream ID), right?
>>>>
>>>> And Baolu suggested that since one device might consume multiple page tables,
>>>> it's more reasonable to have one fault handler per page table. By this, do we
>>>> have to maintain such an ioasid info list in the IOMMU layer?
>>> As discussed earlier, the I/O page fault and cache invalidation paths
>>> will have "device labels" so that the information could be easily
>>> translated and routed.
>>>
>>> So it's likely the per-device fault handler registering API in iommu
>>> core can be kept, but /dev/ioasid will be grown with a layer to
>>> translate and propagate I/O page fault information to the right
>>> consumers.
>> Yeah, having a general preprocessing of the faults in IOASID seems to be
>> a doable direction. But since there may be more than one consumer at the
>> same time, who is responsible for registering the per-device fault handler?
>
> The drivers register per page table fault handlers to /dev/ioasid which
> will then register itself to iommu core to listen and route the per-
> device I/O page faults. This is just a top level thought. Haven't gone
> through the details yet. Need to wait and see what /dev/ioasid finally
> looks like.

OK. And it needs to be confirmed by Jean since we might migrate the code from
io-pgfault.c to IOASID... Anyway, finalize /dev/ioasid first. Thanks,

Shenming

>
> Best regards,
> baolu
> .

2021-06-01 17:26:31

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 01, 2021 at 11:08:53AM +0800, Lu Baolu wrote:
> On 6/1/21 2:09 AM, Jason Gunthorpe wrote:
> > > > device bind should fail if the device somehow isn't compatible with
> > > > the scheme the user is tring to use.
> > > yeah, I guess you mean to fail the device attach when the IOASID is a
> > > nesting IOASID but the device is behind an iommu without nesting support.
> > > right?
> > Right..
>
> Just want to confirm...
>
> Does this mean that we only support hardware nesting and don't want to
> have soft nesting (shadowed page table in kernel) in IOASID?

No, the uAPI presents a contract, if the kernel can fulfill the
contract then it should be supported.

If you want SW nesting then the kernel has to have the SW support for
it or fail.

At least for the purposes of document I wouldn't devle too much deeper
into that question.

Jason

2021-06-01 17:28:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:

> This version only covers 1) and 4). Do you think we need to support 2),
> 3) and beyond?

Yes aboslutely. The API should be flexable enough to specify the
creation of all future page table formats we'd want to have and all HW
specific details on those formats.

> If so, it seems that we need some in-kernel helpers and uAPIs to
> support pre-installing a page table to IOASID.

Not sure what this means..

> From this point of view an IOASID is actually not just a variant of
> iommu_domain, but an I/O page table representation in a broader
> sense.

Yes, and things need to evolve in a staged way. The ioctl API should
have room for this growth but you need to start out with some
constrained enough to actually implement then figure out how to grow
from there

Jason

2021-06-01 17:31:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 01, 2021 at 02:07:05PM +0800, Jason Wang wrote:

> For the case of 1M, I would like to know what's the use case for a single
> process to handle 1M+ address spaces?

For some scenarios every guest PASID will require a IOASID ID # so
there is a large enough demand that FDs alone are not a good fit.

Further there are global container wide properties that are hard to
carry over to a multi-FD model, like the attachment of devices to the
container at the startup.

> > So this RFC treats fd as a container of address spaces which is each
> > tagged by an IOASID.
>
> If the container and address space is 1:1 then the container seems useless.

The examples at the bottom of the document show multiple IOASIDs in
the container for a parent/child type relationship

Jason

2021-06-01 17:33:57

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:

> We can open up to ~0U file descriptors, I don't see why we need to restrict
> it in uAPI.

There are significant problems with such large file descriptor
tables. High FD numbers man things like select don't work at all
anymore and IIRC there are more complications.

A huge number of FDs for typical usages should be avoided.

Jason

2021-06-01 17:34:13

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Tian, Kevin <[email protected]>
> Sent: Thursday, May 27, 2021 1:28 PM

> 5.6. I/O page fault
> +++++++++++++++
>
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
>
> - Host IOMMU driver receives a page request with raw fault_data {rid,
> pasid, addr};
>
> - Host IOMMU driver identifies the faulting I/O page table according to
> information registered by IOASID fault handler;
>
> - IOASID fault handler is called with raw fault_data (rid, pasid, addr), which
> is saved in ioasid_data->fault_data (used for response);
>
> - IOASID fault handler generates an user fault_data (ioasid, addr), links it
> to the shared ring buffer and triggers eventfd to userspace;
>
> - Upon received event, Qemu needs to find the virtual routing information
> (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
> multiple, pick a random one. This should be fine since the purpose is to
> fix the I/O page table on the guest;
>
> - Qemu generates a virtual I/O page fault through vIOMMU into guest,
> carrying the virtual fault data (v_rid, v_pasid, addr);
>
Why does it have to be through vIOMMU?
For a VFIO PCI device, have you considered to reuse the same PRI interface to inject page fault in the guest?
This eliminates any new v_rid.
It will also route the page fault request and response through the right vfio device.

> - Guest IOMMU driver fixes up the fault, updates the I/O page table, and
> then sends a page response with virtual completion data (v_rid, v_pasid,
> response_code) to vIOMMU;
>
What about fixing up the fault for mmu page table as well in guest?
Or you meant both when above you said "updates the I/O page table"?

It is unclear to me that if there is single nested page table maintained or two (one for cr3 references and other for iommu).
Can you please clarify?

> - Qemu finds the pending fault event, converts virtual completion data
> into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> complete the pending fault;
>
For VFIO PCI device a virtual PRI request response interface is done, it can be generic interface among multiple vIOMMUs.

> - /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> ioasid_data->fault_data, and then calls iommu api to complete it with
> {rid, pasid, response_code};
>

2021-06-01 17:36:02

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:

> The drivers register per page table fault handlers to /dev/ioasid which
> will then register itself to iommu core to listen and route the per-
> device I/O page faults.

I'm still confused why drivers need fault handlers at all?

Jason

2021-06-01 17:38:06

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 01, 2021 at 12:04:00PM +0000, Parav Pandit wrote:
>
>
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Monday, May 31, 2021 11:43 PM
> >
> > On Mon, May 31, 2021 at 05:37:35PM +0000, Parav Pandit wrote:
> >
> > > In that case, can it be a new system call? Why does it have to be under
> > /dev/ioasid?
> > > For example few years back such system call mpin() thought was proposed
> > in [1].
> >
> > Reference counting of the overall pins are required
> >
> > So when a pinned pages is incorporated into an IOASID page table in a later
> > IOCTL it means it cannot be unpinned while the IOASID page table is using it.
> Ok. but cant it use the same refcount of that mmu uses?

Manipulating that refcount is part of the overhead that is trying to
be avoided here, plus ensuring that the pinned pages accounting
doesn't get out of sync with the actual account of pinned pages!

> > Then the ioasid's interval tree would be mapped into a page table tree in HW
> > format.

> Does it mean in simple use case [1], second level page table copy is
> maintained in the IOMMU side via map interface? I hope not. It
> should use the same as what mmu uses, right?

Not a full page by page copy, but some interval reference.

Jason

2021-06-01 17:44:28

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 01, 2021 at 08:10:14AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Saturday, May 29, 2021 1:36 AM
> >
> > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> >
> > > IOASID nesting can be implemented in two ways: hardware nesting and
> > > software nesting. With hardware support the child and parent I/O page
> > > tables are walked consecutively by the IOMMU to form a nested translation.
> > > When it's implemented in software, the ioasid driver is responsible for
> > > merging the two-level mappings into a single-level shadow I/O page table.
> > > Software nesting requires both child/parent page tables operated through
> > > the dma mapping protocol, so any change in either level can be captured
> > > by the kernel to update the corresponding shadow mapping.
> >
> > Why? A SW emulation could do this synchronization during invalidation
> > processing if invalidation contained an IOVA range.
>
> In this proposal we differentiate between host-managed and user-
> managed I/O page tables. If host-managed, the user is expected to use
> map/unmap cmd explicitly upon any change required on the page table.
> If user-managed, the user first binds its page table to the IOMMU and
> then use invalidation cmd to flush iotlb when necessary (e.g. typically
> not required when changing a PTE from non-present to present).
>
> We expect user to use map+unmap and bind+invalidate respectively
> instead of mixing them together. Following this policy, map+unmap
> must be used in both levels for software nesting, so changes in either
> level are captured timely to synchronize the shadow mapping.

map+unmap or bind+invalidate is a policy of the IOASID itself set when
it is created. If you put two different types in a tree then each IOASID
must continue to use its own operation mode.

I don't see a reason to force all IOASIDs in a tree to be consistent??

A software emulated two level page table where the leaf level is a
bound page table in guest memory should continue to use
bind/invalidate to maintain the guest page table IOASID even though it
is a SW construct.

The GPA level should use map/unmap because it is a kernel owned page
table

Though how to efficiently mix map/unmap on the GPA when there are SW
nested levels below it looks to be quite challenging.

Jason

2021-06-01 17:58:45

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 01, 2021 at 08:38:00AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Saturday, May 29, 2021 3:59 AM
> >
> > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > >
> > > 5. Use Cases and Flows
> > >
> > > Here assume VFIO will support a new model where every bound device
> > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > going through legacy container/group interface. For illustration purpose
> > > those devices are just called dev[1...N]:
> > >
> > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > >
> > > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> > >
> > > ioasid_fd = open("/dev/ioasid", mode);
> > >
> > > For simplicity below examples are all made for the virtualization story.
> > > They are representative and could be easily adapted to a non-virtualization
> > > scenario.
> >
> > For others, I don't think this is *strictly* necessary, we can
> > probably still get to the device_fd using the group_fd and fit in
> > /dev/ioasid. It does make the rest of this more readable though.
>
> Jason, want to confirm here. Per earlier discussion we remain an
> impression that you want VFIO to be a pure device driver thus
> container/group are used only for legacy application.

Let me call this a "nice wish".

If you get to a point where you hard need this, then identify the hard
requirement and let's do it, but I wouldn't bloat this already large
project unnecessarily.

Similarly I wouldn't depend on the group fd existing in this design
so it could be changed later.

> From this comment are you suggesting that VFIO can still keep
> container/ group concepts and user just deprecates the use of vfio
> iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> simple policy that an IOASID will reject cmd if partially-attached
> group exists)?

I would say no on the container. /dev/ioasid == the container, having
two competing objects at once in a single process is just a mess.

If the group fd can be kept requires charting a path through the
ioctls where the container is not used and /dev/ioasid is sub'd in
using the same device FD specific IOCTLs you show here.

I didn't try to chart this out carefully.

Also, ultimately, something need to be done about compatability with
the vfio container fd. It looks clear enough to me that the the VFIO
container FD is just a single IOASID using a special ioctl interface
so it would be quite rasonable to harmonize these somehow.

But that is too complicated and far out for me at least to guess on at
this point..

> > Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> > there any scenario where we want different vpasid's for the same
> > IOASID? I guess it is OK like this. Hum.
>
> Yes, it's completely sane that the guest links a I/O page table to
> different vpasids on dev1 and dev2. The IOMMU doesn't mandate
> that when multiple devices share an I/O page table they must use
> the same PASID#.

Ok..

Jason

2021-06-01 20:30:28

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 01, 2021 at 07:01:57AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Saturday, May 29, 2021 4:03 AM
> >
> > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > /dev/ioasid provides an unified interface for managing I/O page tables for
> > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > vDPA,
> > > etc.) are expected to use this interface instead of creating their own logic to
> > > isolate untrusted device DMAs initiated by userspace.
> >
> > It is very long, but I think this has turned out quite well. It
> > certainly matches the basic sketch I had in my head when we were
> > talking about how to create vDPA devices a few years ago.
> >
> > When you get down to the operations they all seem pretty common sense
> > and straightfoward. Create an IOASID. Connect to a device. Fill the
> > IOASID with pages somehow. Worry about PASID labeling.
> >
> > It really is critical to get all the vendor IOMMU people to go over it
> > and see how their HW features map into this.
> >
>
> Agree. btw I feel it might be good to have several design opens
> centrally discussed after going through all the comments. Otherwise
> they may be buried in different sub-threads and potentially with
> insufficient care (especially for people who haven't completed the
> reading).
>
> I summarized five opens here, about:
>
> 1) Finalizing the name to replace /dev/ioasid;
> 2) Whether one device is allowed to bind to multiple IOASID fd's;
> 3) Carry device information in invalidation/fault reporting uAPI;
> 4) What should/could be specified when allocating an IOASID;
> 5) The protocol between vfio group and kvm;
>
> For 1), two alternative names are mentioned: /dev/iommu and
> /dev/ioas. I don't have a strong preference and would like to hear
> votes from all stakeholders. /dev/iommu is slightly better imho for
> two reasons. First, per AMD's presentation in last KVM forum they
> implement vIOMMU in hardware thus need to support user-managed
> domains. An iommu uAPI notation might make more sense moving
> forward. Second, it makes later uAPI naming easier as 'IOASID' can
> be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
> IOASID_ALLOC_IOASID. :)

I think two years ago I suggested /dev/iommu and it didn't go very far
at the time. We've also talked about this as /dev/sva for a while and
now /dev/ioasid

I think /dev/iommu is fine, and call the things inside them IOAS
objects.

Then we don't have naming aliasing with kernel constructs.

> For 2), Jason prefers to not blocking it if no kernel design reason. If
> one device is allowed to bind multiple IOASID fd's, the main problem
> is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1
> and giova_ioasid created in fd2 and then nesting them together (and

Huh? This can't happen

Creating an IOASID is an operation on on the /dev/ioasid FD. We won't
provide APIs to create a tree of IOASID's outside a single FD container.

If a device can consume multiple IOASID's it doesn't care how many or
what /dev/ioasid FDs they come from.

> To the other end there was also thought whether we should make
> a single I/O address space per IOASID fd. This was discussed in previous
> thread that #fd's are insufficient to afford theoretical 1M's address
> spaces per device. But let's have another revisit and draw a clear
> conclusion whether this option is viable.

I had remarks on this, I think per-fd doesn't work

> This implies that VFIO_BOUND_IOASID will be extended to allow user
> specify a device label. This label will be recorded in /dev/iommu to
> serve per-device invalidation request from and report per-device
> fault data to the user.

I wonder which of the user providing a 64 bit cookie or the kernel
returning a small IDA is the best choice here? Both have merits
depending on what qemu needs..

> In addition, vPASID (if provided by user) will
> be also recorded in /dev/iommu so vPASID<->pPASID conversion
> is conducted properly. e.g. invalidation request from user carries
> a vPASID which must be converted into pPASID before calling iommu
> driver. Vice versa for raw fault data which carries pPASID while the
> user expects a vPASID.

I don't think the PASID should be returned at all. It should return
the IOASID number in the FD and/or a u64 cookie associated with that
IOASID. Userspace should figure out what the IOASID & device
combination means.

> Seems to close this design open we have to touch the kAPI design. and
> Joerg's input is highly appreciated here.

uAPI is forever, the kAPI is constantly changing. I always dislike
warping the uAPI based on the current kAPI situation.

Jason

2021-06-01 22:32:21

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, 1 Jun 2021 07:01:57 +0000
"Tian, Kevin" <[email protected]> wrote:
>
> I summarized five opens here, about:
>
> 1) Finalizing the name to replace /dev/ioasid;
> 2) Whether one device is allowed to bind to multiple IOASID fd's;
> 3) Carry device information in invalidation/fault reporting uAPI;
> 4) What should/could be specified when allocating an IOASID;
> 5) The protocol between vfio group and kvm;
>
...
>
> For 5), I'd expect Alex to chime in. Per my understanding looks the
> original purpose of this protocol is not about I/O address space. It's
> for KVM to know whether any device is assigned to this VM and then
> do something special (e.g. posted interrupt, EPT cache attribute, etc.).

Right, the original use case was for KVM to determine whether it needs
to emulate invlpg, so it needs to be aware when an assigned device is
present and be able to test if DMA for that device is cache coherent.
The user, QEMU, creates a KVM "pseudo" device representing the vfio
group, providing the file descriptor of that group to show ownership.
The ugly symbol_get code is to avoid hard module dependencies, ie. the
kvm module should not pull in or require the vfio module, but vfio will
be present if attempting to register this device.

With kvmgt, the interface also became a way to register the kvm pointer
with vfio for the translation mentioned elsewhere in this thread.

The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
page table so that it can handle iotlb programming from pre-registered
memory without trapping out to userspace.

> Because KVM deduces some policy based on the fact of assigned device,
> it needs to hold a reference to related vfio group. this part is irrelevant
> to this RFC.

All of these use cases are related to the IOMMU, whether DMA is
coherent, translating device IOVA to GPA, and an acceleration path to
emulate IOMMU programming in kernel... they seem pretty relevant.

> But ARM's VMID usage is related to I/O address space thus needs some
> consideration. Another strange thing is about PPC. Looks it also leverages
> this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
> group. I don't know why it's done through KVM instead of VFIO uAPI in
> the first place.

AIUI, IOMMU programming on PPC is done through hypercalls, so KVM needs
to know how to handle those for in-kernel acceleration. Thanks,

Alex

2021-06-02 02:15:49

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, June 2, 2021 1:57 AM
>
> On Tue, Jun 01, 2021 at 08:38:00AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Saturday, May 29, 2021 3:59 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > >
> > > > 5. Use Cases and Flows
> > > >
> > > > Here assume VFIO will support a new model where every bound device
> > > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > > going through legacy container/group interface. For illustration purpose
> > > > those devices are just called dev[1...N]:
> > > >
> > > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > > >
> > > > As explained earlier, one IOASID fd is sufficient for all intended use
> cases:
> > > >
> > > > ioasid_fd = open("/dev/ioasid", mode);
> > > >
> > > > For simplicity below examples are all made for the virtualization story.
> > > > They are representative and could be easily adapted to a non-
> virtualization
> > > > scenario.
> > >
> > > For others, I don't think this is *strictly* necessary, we can
> > > probably still get to the device_fd using the group_fd and fit in
> > > /dev/ioasid. It does make the rest of this more readable though.
> >
> > Jason, want to confirm here. Per earlier discussion we remain an
> > impression that you want VFIO to be a pure device driver thus
> > container/group are used only for legacy application.
>
> Let me call this a "nice wish".
>
> If you get to a point where you hard need this, then identify the hard
> requirement and let's do it, but I wouldn't bloat this already large
> project unnecessarily.
>

OK, got your point. So let's start by keeping this room. For new
sub-systems like vDPA, they don't need inventing group fd uAPI
and just leave to their user to meet the group limitation. For existing
sub-system i.e. VFIO, it could keep a stronger group enforcement
uAPI like today. One day, we may revisit it if the simple policy works
well for all other new sub-systems.

> Similarly I wouldn't depend on the group fd existing in this design
> so it could be changed later.

Yes, this is guaranteed. /dev/ioasid uAPI has no group concept.

>
> > From this comment are you suggesting that VFIO can still keep
> > container/ group concepts and user just deprecates the use of vfio
> > iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> > simple policy that an IOASID will reject cmd if partially-attached
> > group exists)?
>
> I would say no on the container. /dev/ioasid == the container, having
> two competing objects at once in a single process is just a mess.
>
> If the group fd can be kept requires charting a path through the
> ioctls where the container is not used and /dev/ioasid is sub'd in
> using the same device FD specific IOCTLs you show here.

yes

>
> I didn't try to chart this out carefully.
>
> Also, ultimately, something need to be done about compatability with
> the vfio container fd. It looks clear enough to me that the the VFIO
> container FD is just a single IOASID using a special ioctl interface
> so it would be quite rasonable to harmonize these somehow.

Possibly multiple IOASIDs as VFIO container cay hold incompatible devices
today. Suppose helper functions will be provided for VFIO container to
create IOASID and then use map/unmap to manage its I/O page table.
This is the shim iommu driver concept in previous discussion between
you and Alex.

This can be done at a later stage. Let's focus on /dev/ioasid uAPI, and
bear some code duplication between it and vfio type1 for now.

>
> But that is too complicated and far out for me at least to guess on at
> this point..

We're working on a prototype in parallel with this discussion. Based on
this work we'll figure out what's the best way to start with.

Thanks
Kevin

2021-06-02 02:21:38

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Alex Williamson <[email protected]>
> Sent: Wednesday, June 2, 2021 6:22 AM
>
> On Tue, 1 Jun 2021 07:01:57 +0000
> "Tian, Kevin" <[email protected]> wrote:
> >
> > I summarized five opens here, about:
> >
> > 1) Finalizing the name to replace /dev/ioasid;
> > 2) Whether one device is allowed to bind to multiple IOASID fd's;
> > 3) Carry device information in invalidation/fault reporting uAPI;
> > 4) What should/could be specified when allocating an IOASID;
> > 5) The protocol between vfio group and kvm;
> >
> ...
> >
> > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > original purpose of this protocol is not about I/O address space. It's
> > for KVM to know whether any device is assigned to this VM and then
> > do something special (e.g. posted interrupt, EPT cache attribute, etc.).
>
> Right, the original use case was for KVM to determine whether it needs
> to emulate invlpg, so it needs to be aware when an assigned device is

invlpg -> wbinvd :)

> present and be able to test if DMA for that device is cache coherent.
> The user, QEMU, creates a KVM "pseudo" device representing the vfio
> group, providing the file descriptor of that group to show ownership.
> The ugly symbol_get code is to avoid hard module dependencies, ie. the
> kvm module should not pull in or require the vfio module, but vfio will
> be present if attempting to register this device.

so the symbol_get thing is not about the protocol itself. Whatever protocol
is defined, as long as kvm needs to call vfio or ioasid helper function, we
need define a proper way to do it. Jason, what's your opinion of alternative
option since you dislike symbol_get?

>
> With kvmgt, the interface also became a way to register the kvm pointer
> with vfio for the translation mentioned elsewhere in this thread.
>
> The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
> page table so that it can handle iotlb programming from pre-registered
> memory without trapping out to userspace.
>
> > Because KVM deduces some policy based on the fact of assigned device,
> > it needs to hold a reference to related vfio group. this part is irrelevant
> > to this RFC.
>
> All of these use cases are related to the IOMMU, whether DMA is
> coherent, translating device IOVA to GPA, and an acceleration path to
> emulate IOMMU programming in kernel... they seem pretty relevant.

One open is whether kvm should hold a device reference or IOASID
reference. For DMA coherence, it only matters whether assigned
devices are coherent or not (not for a specific address space). For kvmgt,
it is for recoding kvm pointer in mdev driver to do write protection. For
ppc, it does relate to a specific I/O page table.

Then I feel only a part of the protocol will be moved to /dev/ioasid and
something will still remain between kvm and vfio?

>
> > But ARM's VMID usage is related to I/O address space thus needs some
> > consideration. Another strange thing is about PPC. Looks it also leverages
> > this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
> > group. I don't know why it's done through KVM instead of VFIO uAPI in
> > the first place.
>
> AIUI, IOMMU programming on PPC is done through hypercalls, so KVM
> needs
> to know how to handle those for in-kernel acceleration. Thanks,
>

ok.

Thanks
Kevin

2021-06-02 03:44:00

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, June 2, 2021 4:29 AM
>
> On Tue, Jun 01, 2021 at 07:01:57AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Saturday, May 29, 2021 4:03 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > > /dev/ioasid provides an unified interface for managing I/O page tables
> for
> > > > devices assigned to userspace. Device passthrough frameworks (VFIO,
> > > vDPA,
> > > > etc.) are expected to use this interface instead of creating their own
> logic to
> > > > isolate untrusted device DMAs initiated by userspace.
> > >
> > > It is very long, but I think this has turned out quite well. It
> > > certainly matches the basic sketch I had in my head when we were
> > > talking about how to create vDPA devices a few years ago.
> > >
> > > When you get down to the operations they all seem pretty common
> sense
> > > and straightfoward. Create an IOASID. Connect to a device. Fill the
> > > IOASID with pages somehow. Worry about PASID labeling.
> > >
> > > It really is critical to get all the vendor IOMMU people to go over it
> > > and see how their HW features map into this.
> > >
> >
> > Agree. btw I feel it might be good to have several design opens
> > centrally discussed after going through all the comments. Otherwise
> > they may be buried in different sub-threads and potentially with
> > insufficient care (especially for people who haven't completed the
> > reading).
> >
> > I summarized five opens here, about:
> >
> > 1) Finalizing the name to replace /dev/ioasid;
> > 2) Whether one device is allowed to bind to multiple IOASID fd's;
> > 3) Carry device information in invalidation/fault reporting uAPI;
> > 4) What should/could be specified when allocating an IOASID;
> > 5) The protocol between vfio group and kvm;
> >
> > For 1), two alternative names are mentioned: /dev/iommu and
> > /dev/ioas. I don't have a strong preference and would like to hear
> > votes from all stakeholders. /dev/iommu is slightly better imho for
> > two reasons. First, per AMD's presentation in last KVM forum they
> > implement vIOMMU in hardware thus need to support user-managed
> > domains. An iommu uAPI notation might make more sense moving
> > forward. Second, it makes later uAPI naming easier as 'IOASID' can
> > be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
> > IOASID_ALLOC_IOASID. :)
>
> I think two years ago I suggested /dev/iommu and it didn't go very far
> at the time. We've also talked about this as /dev/sva for a while and
> now /dev/ioasid
>
> I think /dev/iommu is fine, and call the things inside them IOAS
> objects.
>
> Then we don't have naming aliasing with kernel constructs.
>
> > For 2), Jason prefers to not blocking it if no kernel design reason. If
> > one device is allowed to bind multiple IOASID fd's, the main problem
> > is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1
> > and giova_ioasid created in fd2 and then nesting them together (and
>
> Huh? This can't happen
>
> Creating an IOASID is an operation on on the /dev/ioasid FD. We won't
> provide APIs to create a tree of IOASID's outside a single FD container.
>
> If a device can consume multiple IOASID's it doesn't care how many or
> what /dev/ioasid FDs they come from.

OK, this implies that if one user inadvertently creates intended parent/
child via different fd's then the operation will simply fail. More specifically
taking ARM's case for example. There is only a single 2nd-level I/O page
table per device (nested by multiple 1st-level tables). Say the user already
creates a gpa_ioasid for a device via fd1. Now he binds the device to fd2,
intending to enable vSVA which requires nested translation thus needs
create a parent via fd2. This parent creation will simply fail by the IOMMU
layer because the 2nd-level (via fd1) is already installed for this device.

>
> > To the other end there was also thought whether we should make
> > a single I/O address space per IOASID fd. This was discussed in previous
> > thread that #fd's are insufficient to afford theoretical 1M's address
> > spaces per device. But let's have another revisit and draw a clear
> > conclusion whether this option is viable.
>
> I had remarks on this, I think per-fd doesn't work
>
> > This implies that VFIO_BOUND_IOASID will be extended to allow user
> > specify a device label. This label will be recorded in /dev/iommu to
> > serve per-device invalidation request from and report per-device
> > fault data to the user.
>
> I wonder which of the user providing a 64 bit cookie or the kernel
> returning a small IDA is the best choice here? Both have merits
> depending on what qemu needs..

Yes, either way can work. I don't have a strong preference. Jean?

>
> > In addition, vPASID (if provided by user) will
> > be also recorded in /dev/iommu so vPASID<->pPASID conversion
> > is conducted properly. e.g. invalidation request from user carries
> > a vPASID which must be converted into pPASID before calling iommu
> > driver. Vice versa for raw fault data which carries pPASID while the
> > user expects a vPASID.
>
> I don't think the PASID should be returned at all. It should return
> the IOASID number in the FD and/or a u64 cookie associated with that
> IOASID. Userspace should figure out what the IOASID & device
> combination means.

This is true for Intel. But what about ARM which has only one IOASID
(pasid table) per device to represent all guest I/O page tables?

>
> > Seems to close this design open we have to touch the kAPI design. and
> > Joerg's input is highly appreciated here.
>
> uAPI is forever, the kAPI is constantly changing. I always dislike
> warping the uAPI based on the current kAPI situation.
>

I got this point. My point was that I didn't see significant gain from either
option thus to better compare the two uAPI options we might want to
further consider the involved kAPI effort as another factor.

Thanks
Kevin

2021-06-02 03:44:14

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, June 2, 2021 1:42 AM
>
> On Tue, Jun 01, 2021 at 08:10:14AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Saturday, May 29, 2021 1:36 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > >
> > > > IOASID nesting can be implemented in two ways: hardware nesting and
> > > > software nesting. With hardware support the child and parent I/O page
> > > > tables are walked consecutively by the IOMMU to form a nested
> translation.
> > > > When it's implemented in software, the ioasid driver is responsible for
> > > > merging the two-level mappings into a single-level shadow I/O page
> table.
> > > > Software nesting requires both child/parent page tables operated
> through
> > > > the dma mapping protocol, so any change in either level can be
> captured
> > > > by the kernel to update the corresponding shadow mapping.
> > >
> > > Why? A SW emulation could do this synchronization during invalidation
> > > processing if invalidation contained an IOVA range.
> >
> > In this proposal we differentiate between host-managed and user-
> > managed I/O page tables. If host-managed, the user is expected to use
> > map/unmap cmd explicitly upon any change required on the page table.
> > If user-managed, the user first binds its page table to the IOMMU and
> > then use invalidation cmd to flush iotlb when necessary (e.g. typically
> > not required when changing a PTE from non-present to present).
> >
> > We expect user to use map+unmap and bind+invalidate respectively
> > instead of mixing them together. Following this policy, map+unmap
> > must be used in both levels for software nesting, so changes in either
> > level are captured timely to synchronize the shadow mapping.
>
> map+unmap or bind+invalidate is a policy of the IOASID itself set when
> it is created. If you put two different types in a tree then each IOASID
> must continue to use its own operation mode.
>
> I don't see a reason to force all IOASIDs in a tree to be consistent??

only for software nesting. With hardware support the parent uses map
while the child uses bind.

Yes, the policy is specified per IOASID. But if the policy violates the
requirement in a specific nesting mode, then nesting should fail.

>
> A software emulated two level page table where the leaf level is a
> bound page table in guest memory should continue to use
> bind/invalidate to maintain the guest page table IOASID even though it
> is a SW construct.

with software nesting the leaf should be a host-managed page table
(or metadata). A bind/invalidate protocol doesn't require the user
to notify the kernel of every page table change. But for software nesting
the kernel must know every change to timely update the shadow/merged
mapping, otherwise DMA may hit stale mapping.

>
> The GPA level should use map/unmap because it is a kernel owned page
> table

yes, this is always true.

>
> Though how to efficiently mix map/unmap on the GPA when there are SW
> nested levels below it looks to be quite challenging.
>

Thanks
Kevin

2021-06-02 04:21:46

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 6/2/21 1:26 AM, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
>
>> This version only covers 1) and 4). Do you think we need to support 2),
>> 3) and beyond?
>
> Yes aboslutely. The API should be flexable enough to specify the
> creation of all future page table formats we'd want to have and all HW
> specific details on those formats.

OK, stay in the same line.

>> If so, it seems that we need some in-kernel helpers and uAPIs to
>> support pre-installing a page table to IOASID.
>
> Not sure what this means..

Sorry that I didn't make this clear.

Let me bring back the page table types in my eyes.

1) IOMMU format page table (a.k.a. iommu_domain)
2) user application CPU page table (SVA for example)
3) KVM EPT (future option)
4) VM guest managed page table (nesting mode)

Each type of page table should be able to be associated with its IOASID.
We have BIND protocol for 4); We explicitly allocate an iommu_domain for
1). But we don't have a clear definition for 2) 3) and others. I think
it's necessary to clearly define a time point and kAPI name between
IOASID_ALLOC and IOASID_ATTACH, so that other modules have the
opportunity to associate their page table with the allocated IOASID
before attaching the page table to the real IOMMU hardware.

I/O page fault handling is similar. The provider of the page table
should take the responsibility to handle the possible page faults.

Could this answer the question of "I'm still confused why drivers need
fault handlers at all?" in below thread?

https://lore.kernel.org/linux-iommu/PH0PR12MB54811863B392C644E5365446DC3E9@PH0PR12MB5481.namprd12.prod.outlook.com/T/#m15def9e8b236dfcf97e21c8e9f8a58da214e3691

>
>> From this point of view an IOASID is actually not just a variant of
>> iommu_domain, but an I/O page table representation in a broader
>> sense.
>
> Yes, and things need to evolve in a staged way. The ioctl API should
> have room for this growth but you need to start out with some
> constrained enough to actually implement then figure out how to grow
> from there

Yes, agreed. I just think about it from the perspective of a design
document.

Best regards,
baolu

2021-06-02 07:30:09

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, May 28, 2021 at 04:58:39PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> >
> > 5. Use Cases and Flows
> >
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> >
> > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> >
> > ioasid_fd = open("/dev/ioasid", mode);
> >
> > For simplicity below examples are all made for the virtualization story.
> > They are representative and could be easily adapted to a non-virtualization
> > scenario.
>
> For others, I don't think this is *strictly* necessary, we can
> probably still get to the device_fd using the group_fd and fit in
> /dev/ioasid. It does make the rest of this more readable though.

Leaving aside whether group fds should exist, while they *do* exist
binding to an IOASID should be done on the group not an individual
device.

[snip]
> > /* if dev1 is ENQCMD-capable mdev, update CPU PASID
> > * translation structure through KVM
> > */
> > pa_data = {
> > .ioasid_fd = ioasid_fd;
> > .ioasid = gva_ioasid;
> > .guest_pasid = gpasid1;
> > };
> > ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
>
> Make sense
>
> > /* Bind guest I/O page table */
> > bind_data = {
> > .ioasid = gva_ioasid;
> > .addr = gva_pgtable1;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> Again I do wonder if this should just be part of alloc_ioasid. Is
> there any reason to split these things? The only advantage to the
> split is the device is known, but the device shouldn't impact
> anything..

I'm pretty sure the device(s) could matter, although they probably
won't usually. But it would certainly be possible for a system to
have two different host bridges with two different IOMMUs with
different pagetable formats. Until you know which devices (and
therefore which host bridge) you're talking about, you don't know what
formats of pagetable to accept. And if you have devices from *both*
bridges you can't bind a page table at all - you could theoretically
support a kernel managed pagetable by mirroring each MAP and UNMAP to
tables in both formats, but it would be pretty reasonable not to
support that.

> > 5.6. I/O page fault
> > +++++++++++++++
> >
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> > to guest IOMMU driver and backwards).
> >
> > - Host IOMMU driver receives a page request with raw fault_data {rid,
> > pasid, addr};
> >
> > - Host IOMMU driver identifies the faulting I/O page table according to
> > information registered by IOASID fault handler;
> >
> > - IOASID fault handler is called with raw fault_data (rid, pasid, addr), which
> > is saved in ioasid_data->fault_data (used for response);
> >
> > - IOASID fault handler generates an user fault_data (ioasid, addr), links it
> > to the shared ring buffer and triggers eventfd to userspace;
>
> Here rid should be translated to a labeled device and return the
> device label from VFIO_BIND_IOASID_FD. Depending on how the device
> bound the label might match to a rid or to a rid,pasid

I like the idea of labelling devices when they're attached, it makes
extension to non-PCI devices much more obvious that having to deal
with concrete RIDs.

But, remember we can only (reliably) determine rid up to the group
boundary. So if you're labelling devices, all devices in a group
would have to have the same label. Or you attach the label to a group
not a device, which would be a reason to represent the group as an
object again.

> > - Upon received event, Qemu needs to find the virtual routing information
> > (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
> > multiple, pick a random one. This should be fine since the purpose is to
> > fix the I/O page table on the guest;
>
> The device label should fix this
>
> > - Qemu finds the pending fault event, converts virtual completion data
> > into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> > complete the pending fault;
> >
> > - /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> > ioasid_data->fault_data, and then calls iommu api to complete it with
> > {rid, pasid, response_code};
>
> So resuming a fault on an ioasid will resume all devices pending on
> the fault?
>
> > 5.7. BIND_PASID_TABLE
> > ++++++++++++++++++++
> >
> > PASID table is put in the GPA space on some platform, thus must be updated
> > by the guest. It is treated as another user page table to be bound with the
> > IOMMU.
> >
> > As explained earlier, the user still needs to explicitly bind every user I/O
> > page table to the kernel so the same pgtable binding protocol (bind, cache
> > invalidate and fault handling) is unified cross platforms.
> >
> > vIOMMUs may include a caching mode (or paravirtualized way) which, once
> > enabled, requires the guest to invalidate PASID cache for any change on the
> > PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
> >
> > In case of missing such capability, Qemu could enable write-protection on
> > the guest PASID table to achieve the same effect.
> >
> > /* After boots */
> > /* Make vPASID space nested on GPA space */
> > pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev1 to pasidtbl_ioasid */
> > at_data = { .ioasid = pasidtbl_ioasid};
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Bind PASID table */
> > bind_data = {
> > .ioasid = pasidtbl_ioasid;
> > .addr = gpa_pasid_table;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
> >
> > /* vIOMMU detects a new GVA I/O space created */
> > gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
> >
> > /* Attach dev1 to the new address space, with gpasid1 */
> > at_data = {
> > .ioasid = gva_ioasid;
> > .flag = IOASID_ATTACH_USER_PASID;
> > .user_pasid = gpasid1;
> > };
> > ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Bind guest I/O page table. Because SET_PASID_TABLE has been
> > * used, the kernel will not update the PASID table. Instead, just
> > * track the bound I/O page table for handling invalidation and
> > * I/O page faults.
> > */
> > bind_data = {
> > .ioasid = gva_ioasid;
> > .addr = gva_pgtable1;
> > // and format information
> > };
> > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> I still don't quite get the benifit from doing this.
>
> The idea to create an all PASID IOASID seems to work better with less
> fuss on HW that is directly parsing the guest's PASID table.
>
> Cache invalidate seems easy enough to support
>
> Fault handling needs to return the (ioasid, device_label, pasid) when
> working with this kind of ioasid.
>
> It is true that it does create an additional flow qemu has to
> implement, but it does directly mirror the HW.
>
> Jason
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (7.56 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-02 07:30:20

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 01, 2021 at 02:56:43PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 08:38:00AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Saturday, May 29, 2021 3:59 AM
> > >
> > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > >
> > > > 5. Use Cases and Flows
> > > >
> > > > Here assume VFIO will support a new model where every bound device
> > > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > > going through legacy container/group interface. For illustration purpose
> > > > those devices are just called dev[1...N]:
> > > >
> > > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> > > >
> > > > As explained earlier, one IOASID fd is sufficient for all intended use cases:
> > > >
> > > > ioasid_fd = open("/dev/ioasid", mode);
> > > >
> > > > For simplicity below examples are all made for the virtualization story.
> > > > They are representative and could be easily adapted to a non-virtualization
> > > > scenario.
> > >
> > > For others, I don't think this is *strictly* necessary, we can
> > > probably still get to the device_fd using the group_fd and fit in
> > > /dev/ioasid. It does make the rest of this more readable though.
> >
> > Jason, want to confirm here. Per earlier discussion we remain an
> > impression that you want VFIO to be a pure device driver thus
> > container/group are used only for legacy application.
>
> Let me call this a "nice wish".
>
> If you get to a point where you hard need this, then identify the hard
> requirement and let's do it, but I wouldn't bloat this already large
> project unnecessarily.
>
> Similarly I wouldn't depend on the group fd existing in this design
> so it could be changed later.

I don't think presence or absence of a group fd makes a lot of
difference to this design. Having a group fd just means we attach
groups to the ioasid instead of individual devices, and we no longer
need the bookkeeping of "partial" devices.

> > From this comment are you suggesting that VFIO can still keep
> > container/ group concepts and user just deprecates the use of vfio
> > iommu uAPI (e.g. VFIO_SET_IOMMU) by using /dev/ioasid (which has a
> > simple policy that an IOASID will reject cmd if partially-attached
> > group exists)?
>
> I would say no on the container. /dev/ioasid == the container, having
> two competing objects at once in a single process is just a mess.

Right. I'd assume that for compatibility, creating a container would
create a single IOASID under the hood with a compatiblity layer
translating the container operations to iosaid operations.

> If the group fd can be kept requires charting a path through the
> ioctls where the container is not used and /dev/ioasid is sub'd in
> using the same device FD specific IOCTLs you show here.

Again, I don't think it makes much difference. The model doesn't
really change even if you allow both ATTACH_GROUP and ATTACH_DEVICE on
the IOASID. Basically ATTACH_GROUP would just be equivalent to
attaching all the constituent devices.

> I didn't try to chart this out carefully.
>
> Also, ultimately, something need to be done about compatability with
> the vfio container fd. It looks clear enough to me that the the VFIO
> container FD is just a single IOASID using a special ioctl interface
> so it would be quite rasonable to harmonize these somehow.
>
> But that is too complicated and far out for me at least to guess on at
> this point..
>
> > > Still a little unsure why the vPASID is here not on the gva_ioasid. Is
> > > there any scenario where we want different vpasid's for the same
> > > IOASID? I guess it is OK like this. Hum.
> >
> > Yes, it's completely sane that the guest links a I/O page table to
> > different vpasids on dev1 and dev2. The IOMMU doesn't mandate
> > that when multiple devices share an I/O page table they must use
> > the same PASID#.
>
> Ok..
>
> Jason
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (4.15 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-02 07:31:11

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, May 28, 2021 at 08:36:49PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
>
> > 2.1. /dev/ioasid uAPI
> > +++++++++++++++++
> >
> > /*
> > * Check whether an uAPI extension is supported.
> > *
> > * This is for FD-level capabilities, such as locked page pre-registration.
> > * IOASID-level capabilities are reported through IOASID_GET_INFO.
> > *
> > * Return: 0 if not supported, 1 if supported.
> > */
> > #define IOASID_CHECK_EXTENSION _IO(IOASID_TYPE, IOASID_BASE + 0)
>
>
> > /*
> > * Register user space memory where DMA is allowed.
> > *
> > * It pins user pages and does the locked memory accounting so sub-
> > * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> > *
> > * When this ioctl is not used, one user page might be accounted
> > * multiple times when it is mapped by multiple IOASIDs which are
> > * not nested together.
> > *
> > * Input parameters:
> > * - vaddr;
> > * - size;
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define IOASID_REGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 1)
> > #define IOASID_UNREGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 2)
>
> So VA ranges are pinned and stored in a tree and later references to
> those VA ranges by any other IOASID use the pin cached in the tree?
>
> It seems reasonable and is similar to the ioasid parent/child I
> suggested for PPC.
>
> IMHO this should be merged with the all SW IOASID that is required for
> today's mdev drivers. If this can be done while keeping this uAPI then
> great, otherwise I don't think it is so bad to weakly nest a physical
> IOASID under a SW one just to optimize page pinning.

Right, I think we can simplify the interface by modelling the
preregistration as a nesting layer. Well, mostly.. the wrinkle is
that generally you can't do anything with an ioasid until you've
attached devices to it, but that doesn't really make sense for the
prereg layer. I expect we can find some way to deal with that,
though.

Actually... to simplify that "weak nesting" concept I wonder if we
want to expand to 3 ways of specifying the pagetables for the ioasid:
1) kernel managed (MAP/UNMAP)
2) user managed (BIND/INVALIDATE)
3) pass-though (IOVA==parent address)

Obviously pass-through wouldn't be allowed in all circumstances.

> Either way this seems like a smart direction
>
> > /*
> > * Allocate an IOASID.
> > *
> > * IOASID is the FD-local software handle representing an I/O address
> > * space. Each IOASID is associated with a single I/O page table. User
> > * must call this ioctl to get an IOASID for every I/O address space that is
> > * intended to be enabled in the IOMMU.
> > *
> > * A newly-created IOASID doesn't accept any command before it is
> > * attached to a device. Once attached, an empty I/O page table is
> > * bound with the IOMMU then the user could use either DMA mapping
> > * or pgtable binding commands to manage this I/O page table.
>
> Can the IOASID can be populated before being attached?

I don't think it reasonably can. Until attached, you don't actually
know what hardware IOMMU will be backing it, and therefore you don't
know it's capabilities. You can't really allow mappings if you don't
even know allowed IOVA ranges and page size.

> > * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> > *
> > * Return: allocated ioasid on success, -errno on failure.
> > */
> > #define IOASID_ALLOC _IO(IOASID_TYPE, IOASID_BASE + 3)
> > #define IOASID_FREE _IO(IOASID_TYPE, IOASID_BASE + 4)
>
> I assume alloc will include quite a big structure to satisfy the
> various vendor needs?
>
> > /*
> > * Get information about an I/O address space
> > *
> > * Supported capabilities:
> > * - VFIO type1 map/unmap;
> > * - pgtable/pasid_table binding
> > * - hardware nesting vs. software nesting;
> > * - ...
> > *
> > * Related attributes:
> > * - supported page sizes, reserved IOVA ranges (DMA mapping);
> > * - vendor pgtable formats (pgtable binding);
> > * - number of child IOASIDs (nesting);
> > * - ...
> > *
> > * Above information is available only after one or more devices are
> > * attached to the specified IOASID. Otherwise the IOASID is just a
> > * number w/o any capability or attribute.
>
> This feels wrong to learn most of these attributes of the IOASID after
> attaching to a device.

Yes... but as above, we have no idea what the IOMMU's capabilities are
until devices are attached.

> The user should have some idea how it intends to use the IOASID when
> it creates it and the rest of the system should match the intention.
>
> For instance if the user is creating a IOASID to cover the guest GPA
> with the intention of making children it should indicate this during
> alloc.
>
> If the user is intending to point a child IOASID to a guest page table
> in a certain descriptor format then it should indicate it during
> alloc.
>
> device bind should fail if the device somehow isn't compatible with
> the scheme the user is tring to use.

[snip]
> > 2.2. /dev/vfio uAPI
> > ++++++++++++++++
>
> To be clear you mean the 'struct vfio_device' API, these are not
> IOCTLs on the container or group?
>
> > /*
> > * Bind a vfio_device to the specified IOASID fd
> > *
> > * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> > * vfio device should not be bound to multiple ioasid_fd's.
> > *
> > * Input parameters:
> > * - ioasid_fd;
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define VFIO_BIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 22)
> > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)
>
> This is where it would make sense to have an output "device id" that
> allows /dev/ioasid to refer to this "device" by number in events and
> other related things.

The group number could be used for that, even if there are no group
fds. You generally can't identify things more narrowly than group
anyway.


--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (6.29 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-02 07:31:33

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, May 28, 2021 at 02:35:38PM -0300, Jason Gunthorpe wrote:
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
[snip]
> > With above design /dev/ioasid uAPI is all about I/O address spaces.
> > It doesn't include any device routing information, which is only
> > indirectly registered to the ioasid driver through VFIO uAPI. For
> > example, I/O page fault is always reported to userspace per IOASID,
> > although it's physically reported per device (RID+PASID).
>
> I agree with Jean-Philippe - at the very least erasing this
> information needs a major rational - but I don't really see why it
> must be erased? The HW reports the originating device, is it just a
> matter of labeling the devices attached to the /dev/ioasid FD so it
> can be reported to userspace?

HW reports the originating device as far as it knows. In many cases
where you have multiple devices in an IOMMU group, it's because
although they're treated as separate devices at the kernel level, they
have the same RID at the HW level. Which means a RID for something in
the right group is the closest you can count on supplying.

[snip]
> > However this way significantly
> > violates the philosophy in this /dev/ioasid proposal. It is not one IOASID
> > one address space any more. Device routing information (indirectly
> > marking hidden I/O spaces) has to be carried in iotlb invalidation and
> > page faulting uAPI to help connect vIOMMU with the underlying
> > pIOMMU. This is one design choice to be confirmed with ARM guys.
>
> I'm confused by this rational.
>
> For a vIOMMU that has IO page tables in the guest the basic
> choices are:
> - Do we have a hypervisor trap to bind the page table or not? (RID
> and PASID may differ here)
> - Do we have a hypervisor trap to invaliate the page tables or not?
>
> If the first is a hypervisor trap then I agree it makes sense to create a
> child IOASID that points to each guest page table and manage it
> directly. This should not require walking guest page tables as it is
> really just informing the HW where the page table lives. HW will walk
> them.
>
> If there are no hypervisor traps (does this exist?) then there is no
> way to involve the hypervisor here and the child IOASID should simply
> be a pointer to the guest's data structure that describes binding. In
> this case that IOASID should claim all PASIDs when bound to a
> RID.

And in that case I think we should call that object something other
than an IOASID, since it represents multiple address spaces.

> Invalidation should be passed up the to the IOMMU driver in terms of
> the guest tables information and either the HW or software has to walk
> to guest tables to make sense of it.
>
> Events from the IOMMU to userspace should be tagged with the attached
> device label and the PASID/substream ID. This means there is no issue
> to have a a 'all PASID' IOASID.
>
> > Notes:
> > - It might be confusing as IOASID is also used in the kernel (drivers/
> > iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
> > find a better name later to differentiate.
>
> +1 on Jean-Philippe's remarks
>
> > - PPC has not be considered yet as we haven't got time to fully understand
> > its semantics. According to previous discussion there is some generality
> > between PPC window-based scheme and VFIO type1 semantics. Let's
> > first make consensus on this proposal and then further discuss how to
> > extend it to cover PPC's requirement.
>
> From what I understood PPC is not so bad, Nesting IOASID's did its
> preload feature and it needed a way to specify/query the IOVA range a
> IOASID will cover.
>
> > - There is a protocol between vfio group and kvm. Needs to think about
> > how it will be affected following this proposal.
>
> Ugh, I always stop looking when I reach that boundary. Can anyone
> summarize what is going on there?
>
> Most likely passing the /dev/ioasid into KVM's FD (or vicevera) is the
> right answer. Eg if ARM needs to get the VMID from KVM and set it to
> ioasid then a KVM "ioctl set_arm_vmid(/dev/ioasid)" call is
> reasonable. Certainly better than the symbol get sutff we have right
> now.
>
> I will read through the detail below in another email
>
> Jason
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (4.47 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-02 07:59:57

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 2021/6/2 1:33, Jason Gunthorpe wrote:
> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
>
>> The drivers register per page table fault handlers to /dev/ioasid which
>> will then register itself to iommu core to listen and route the per-
>> device I/O page faults.
>
> I'm still confused why drivers need fault handlers at all?

Essentially it is the userspace that needs the fault handlers,
one case is to deliver the faults to the vIOMMU, and another
case is to enable IOPF on the GPA address space for on-demand
paging, it seems that both could be specified in/through the
IOASID_ALLOC ioctl?

Thanks,
Shenming

2021-06-02 09:51:31

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


?? 2021/6/2 ????1:31, Jason Gunthorpe д??:
> On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:
>
>> We can open up to ~0U file descriptors, I don't see why we need to restrict
>> it in uAPI.
> There are significant problems with such large file descriptor
> tables. High FD numbers man things like select don't work at all
> anymore and IIRC there are more complications.


I don't see how much difference for IOASID and other type of fds. People
can choose to use poll or epoll.

And with the current proposal, (assuming there's a N:1 ioasid to
ioasid). I wonder how select can work for the specific ioasid.

Thanks


>
> A huge number of FDs for typical usages should be avoided.
>
> Jason
>

2021-06-02 11:45:32

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.
>
> This proposal describes the uAPI of /dev/ioasid and also sample sequences
> with VFIO as example in typical usages. The driver-facing kernel API provided
> by the iommu layer is still TBD, which can be discussed after consensus is
> made on this uAPI.
>
> It's based on a lengthy discussion starting from here:
> https://lore.kernel.org/linux-iommu/[email protected]/
>
> It ends up to be a long writing due to many things to be summarized and
> non-trivial effort required to connect them into a complete proposal.
> Hope it provides a clean base to converge.

Thanks for the writeup. I'm giving this a first pass review, note
that I haven't read all the existing replies in detail yet.

>
> TOC
> ====
> 1. Terminologies and Concepts
> 2. uAPI Proposal
> 2.1. /dev/ioasid uAPI
> 2.2. /dev/vfio uAPI
> 2.3. /dev/kvm uAPI
> 3. Sample structures and helper functions
> 4. PASID virtualization
> 5. Use Cases and Flows
> 5.1. A simple example
> 5.2. Multiple IOASIDs (no nesting)
> 5.3. IOASID nesting (software)
> 5.4. IOASID nesting (hardware)
> 5.5. Guest SVA (vSVA)
> 5.6. I/O page fault
> 5.7. BIND_PASID_TABLE
> ====
>
> 1. Terminologies and Concepts
> -----------------------------------------
>
> IOASID FD is the container holding multiple I/O address spaces. User
> manages those address spaces through FD operations. Multiple FD's are
> allowed per process, but with this proposal one FD should be sufficient for
> all intended usages.
>
> IOASID is the FD-local software handle representing an I/O address space.
> Each IOASID is associated with a single I/O page table. IOASIDs can be
> nested together, implying the output address from one I/O page table
> (represented by child IOASID) must be further translated by another I/O
> page table (represented by parent IOASID).

Is there a compelling reason to have all the IOASIDs handled by one
FD? Simply on the grounds that handles to kernel internal objects are
usualy fds, having an fd per ioasid seems like an obvious alternative.
In that case plain open() would replace IOASID_ALLOC. Nested could be
handled either by 1) having a CREATED_NESTED on the parent fd which
spawns a new fd or 2) opening /dev/ioasid again for a new fd and doing
a SET_PARENT before doing anything else.

I may be bikeshedding here..

> I/O address space can be managed through two protocols, according to
> whether the corresponding I/O page table is constructed by the kernel or
> the user. When kernel-managed, a dma mapping protocol (similar to
> existing VFIO iommu type1) is provided for the user to explicitly specify
> how the I/O address space is mapped. Otherwise, a different protocol is
> provided for the user to bind an user-managed I/O page table to the
> IOMMU, plus necessary commands for iotlb invalidation and I/O fault
> handling.
>
> Pgtable binding protocol can be used only on the child IOASID's, implying
> IOASID nesting must be enabled. This is because the kernel doesn't trust
> userspace. Nesting allows the kernel to enforce its DMA isolation policy
> through the parent IOASID.

To clarify, I'm guessing that's a restriction of likely practice,
rather than a fundamental API restriction. I can see a couple of
theoretical future cases where a user-managed pagetable for a "base"
IOASID would be feasible:

1) On some fancy future MMU allowing free nesting, where the kernel
would insert an implicit extra layer translating user addresses
to physical addresses, and the userspace manages a pagetable with
its own VAs being the target AS
2) For a purely software virtual device, where its virtual DMA
engine can interpet user addresses fine

> IOASID nesting can be implemented in two ways: hardware nesting and
> software nesting. With hardware support the child and parent I/O page
> tables are walked consecutively by the IOMMU to form a nested translation.
> When it's implemented in software, the ioasid driver is responsible for
> merging the two-level mappings into a single-level shadow I/O page table.
> Software nesting requires both child/parent page tables operated through
> the dma mapping protocol, so any change in either level can be captured
> by the kernel to update the corresponding shadow mapping.

As Jason also said, I don't think you need to restrict software
nesting to only kernel managed L2 tables - you already need hooks for
cache invalidation, and you can use those to trigger reshadows.

> An I/O address space takes effect in the IOMMU only after it is attached
> to a device. The device in the /dev/ioasid context always refers to a
> physical one or 'pdev' (PF or VF).

What you mean by "physical" device here isn't really clear - VFs
aren't really physical devices, and the PF/VF terminology also doesn't
extent to non-PCI devices (which I think we want to consider for the
API, even if we're not implemenenting it any time soon).

Now, it's clear that we can't program things into the IOMMU before
attaching a device - we might not even know which IOMMU to use.
However, I'm not sure if its wise to automatically make the AS "real"
as soon as we attach a device:

* If we're going to attach a whole bunch of devices, could we (for at
least some IOMMU models) end up doing a lot of work which then has
to be re-done for each extra device we attach?

* With kernel managed IO page tables could attaching a second device
(at least on some IOMMU models) require some operation which would
require discarding those tables? e.g. if the second device somehow
forces a different IO page size

For that reason I wonder if we want some sort of explicit enable or
activate call. Device attaches would only be valid before, map or
attach pagetable calls would only be valid after.

> One I/O address space could be attached to multiple devices. In this case,
> /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
>
> Based on the underlying IOMMU capability one device might be allowed
> to attach to multiple I/O address spaces, with DMAs accessing them by
> carrying different routing information. One of them is the default I/O
> address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
> remaining are routed by RID + Process Address Space ID (PASID) or
> Stream+Substream ID. For simplicity the following context uses RID and
> PASID when talking about the routing information for I/O address spaces.

I'm not really clear on how this interacts with nested ioasids. Would
you generally expect the RID+PASID IOASes to be children of the base
RID IOAS, or not?

If the PASID ASes are children of the RID AS, can we consider this not
as the device explicitly attaching to multiple IOASIDs, but instead
attaching to the parent IOASID with awareness of the child ones?

> Device attachment is initiated through passthrough framework uAPI (use
> VFIO for simplicity in following context). VFIO is responsible for identifying
> the routing information and registering it to the ioasid driver when calling
> ioasid attach helper function. It could be RID if the assigned device is
> pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
> user might also provide its view of virtual routing information (vPASID) in
> the attach call, e.g. when multiple user-managed I/O address spaces are
> attached to the vfio_device. In this case VFIO must figure out whether
> vPASID should be directly used (for pdev) or converted to a kernel-
> allocated one (pPASID, for mdev) for physical routing (see section 4).
>
> Device must be bound to an IOASID FD before attach operation can be
> conducted. This is also through VFIO uAPI. In this proposal one device
> should not be bound to multiple FD's. Not sure about the gain of
> allowing it except adding unnecessary complexity. But if others have
> different view we can further discuss.
>
> VFIO must ensure its device composes DMAs with the routing information
> attached to the IOASID. For pdev it naturally happens since vPASID is
> directly programmed to the device by guest software. For mdev this
> implies any guest operation carrying a vPASID on this device must be
> trapped into VFIO and then converted to pPASID before sent to the
> device. A detail explanation about PASID virtualization policies can be
> found in section 4.
>
> Modern devices may support a scalable workload submission interface
> based on PCI DMWr capability, allowing a single work queue to access
> multiple I/O address spaces. One example is Intel ENQCMD, having
> PASID saved in the CPU MSR and carried in the instruction payload
> when sent out to the device. Then a single work queue shared by
> multiple processes can compose DMAs carrying different PASIDs.

Is the assumption here that the processes share the IOASID FD
instance, but not memory?

> When executing ENQCMD in the guest, the CPU MSR includes a vPASID
> which, if targeting a mdev, must be converted to pPASID before sent
> to the wire. Intel CPU provides a hardware PASID translation capability
> for auto-conversion in the fast path. The user is expected to setup the
> PASID mapping through KVM uAPI, with information about {vpasid,
> ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM
> to figure out the actual pPASID given an IOASID.
>
> With above design /dev/ioasid uAPI is all about I/O address spaces.
> It doesn't include any device routing information, which is only
> indirectly registered to the ioasid driver through VFIO uAPI. For
> example, I/O page fault is always reported to userspace per IOASID,
> although it's physically reported per device (RID+PASID). If there is a
> need of further relaying this fault into the guest, the user is responsible
> of identifying the device attached to this IOASID (randomly pick one if
> multiple attached devices) and then generates a per-device virtual I/O
> page fault into guest. Similarly the iotlb invalidation uAPI describes the
> granularity in the I/O address space (all, or a range), different from the
> underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
>
> I/O page tables routed through PASID are installed in a per-RID PASID
> table structure. Some platforms implement the PASID table in the guest
> physical space (GPA), expecting it managed by the guest. The guest
> PASID table is bound to the IOMMU also by attaching to an IOASID,
> representing the per-RID vPASID space.

Do we need to consider two management modes here, much as we have for
the pagetables themsleves: either kernel managed, in which we have
explicit calls to bind a vPASID to a parent PASID, or user managed in
which case we register a table in some format.

> We propose the host kernel needs to explicitly track guest I/O page
> tables even on these platforms, i.e. the same pgtable binding protocol
> should be used universally on all platforms (with only difference on who
> actually writes the PASID table). One opinion from previous discussion
> was treating this special IOASID as a container for all guest I/O page
> tables i.e. hiding them from the host. However this way significantly
> violates the philosophy in this /dev/ioasid proposal. It is not one IOASID
> one address space any more. Device routing information (indirectly
> marking hidden I/O spaces) has to be carried in iotlb invalidation and
> page faulting uAPI to help connect vIOMMU with the underlying
> pIOMMU. This is one design choice to be confirmed with ARM guys.
>
> Devices may sit behind IOMMU's with incompatible capabilities. The
> difference may lie in the I/O page table format, or availability of an user
> visible uAPI (e.g. hardware nesting). /dev/ioasid is responsible for
> checking the incompatibility between newly-attached device and existing
> devices under the specific IOASID and, if found, returning error to user.
> Upon such error the user should create a new IOASID for the incompatible
> device.
>
> There is no explicit group enforcement in /dev/ioasid uAPI, due to no
> device notation in this interface as aforementioned. But the ioasid driver
> does implicit check to make sure that devices within an iommu group
> must be all attached to the same IOASID before this IOASID starts to
> accept any uAPI command. Otherwise error information is returned to
> the user.

An explicit ENABLE call might make this checking simpler.

> There was a long debate in previous discussion whether VFIO should keep
> explicit container/group semantics in its uAPI. Jason Gunthorpe proposes
> a simplified model where every device bound to VFIO is explicitly listed
> under /dev/vfio thus a device fd can be acquired w/o going through legacy
> container/group interface. In this case the user is responsible for
> understanding the group topology and meeting the implicit group check
> criteria enforced in /dev/ioasid. The use case examples in this proposal
> are based on the new model.
>
> Of course for backward compatibility VFIO still needs to keep the existing
> uAPI and vfio iommu type1 will become a shim layer connecting VFIO
> iommu ops to internal ioasid helper functions.
>
> Notes:
> - It might be confusing as IOASID is also used in the kernel (drivers/
> iommu/ioasid.c) to represent PCI PASID or ARM substream ID. We need
> find a better name later to differentiate.
>
> - PPC has not be considered yet as we haven't got time to fully understand
> its semantics. According to previous discussion there is some generality
> between PPC window-based scheme and VFIO type1 semantics. Let's
> first make consensus on this proposal and then further discuss how to
> extend it to cover PPC's requirement.

From what I've seen so far, it seems ok to me. Note that at this
stage I'm only familiar with existing PPC IOMMUs, which don't have
PASID or anything similar. I'm not sure what IBM's future plans are
for IOMMUs, so there will be more checking to be done.

> - There is a protocol between vfio group and kvm. Needs to think about
> how it will be affected following this proposal.

I think that's only used on PPC, as an optimization for PAPR's
paravirt IOMMU with a small default IOVA window. I think we can do
something equivalent for IOASIDs from what I've seen so far.

> - mdev in this context refers to mediated subfunctions (e.g. Intel SIOV)
> which can be physically isolated in-between through PASID-granular
> IOMMU protection. Historically people also discussed one usage by
> mediating a pdev into a mdev. This usage is not covered here, and is
> supposed to be replaced by Max's work which allows overriding various
> VFIO operations in vfio-pci driver.

I think there are a couple of different mdev cases, so we'll need to
be careful of that and clarify our terminology a bit, I think.

> 2. uAPI Proposal
> ----------------------
>
> /dev/ioasid uAPI covers everything about managing I/O address spaces.
>
> /dev/vfio uAPI builds connection between devices and I/O address spaces.
>
> /dev/kvm uAPI is optional required as far as ENQCMD is concerned.
>
>
> 2.1. /dev/ioasid uAPI
> +++++++++++++++++
>
> /*
> * Check whether an uAPI extension is supported.
> *
> * This is for FD-level capabilities, such as locked page pre-registration.
> * IOASID-level capabilities are reported through IOASID_GET_INFO.
> *
> * Return: 0 if not supported, 1 if supported.
> */
> #define IOASID_CHECK_EXTENSION _IO(IOASID_TYPE, IOASID_BASE + 0)
>
>
> /*
> * Register user space memory where DMA is allowed.
> *
> * It pins user pages and does the locked memory accounting so sub-
> * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> *
> * When this ioctl is not used, one user page might be accounted
> * multiple times when it is mapped by multiple IOASIDs which are
> * not nested together.
> *
> * Input parameters:
> * - vaddr;
> * - size;
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_REGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 1)
> #define IOASID_UNREGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 2)

AIUI PPC is the main user of the current pre-registration API, though
it could have value in any vIOMMU case to avoid possibly costly
accounting on every guest map/unmap.

I wonder if there's a way to model this using a nested AS rather than
requiring special operations. e.g.

'prereg' IOAS
|
\- 'rid' IOAS
|
\- 'pasid' IOAS (maybe)

'prereg' would have a kernel managed pagetable into which (for
example) qemu platform code would map all guest memory (using
IOASID_MAP_DMA). qemu's vIOMMU driver would then mirror the guest's
IO mappings into the 'rid' IOAS in terms of GPA.

This wouldn't quite work as is, because the 'prereg' IOAS would have
no devices. But we could potentially have another call to mark an
IOAS as a purely "preregistration" or pure virtual IOAS. Using that
would be an alternative to attaching devices.

> /*
> * Allocate an IOASID.
> *
> * IOASID is the FD-local software handle representing an I/O address
> * space. Each IOASID is associated with a single I/O page table. User
> * must call this ioctl to get an IOASID for every I/O address space that is
> * intended to be enabled in the IOMMU.
> *
> * A newly-created IOASID doesn't accept any command before it is
> * attached to a device. Once attached, an empty I/O page table is
> * bound with the IOMMU then the user could use either DMA mapping
> * or pgtable binding commands to manage this I/O page table.
> *
> * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> *
> * Return: allocated ioasid on success, -errno on failure.
> */
> #define IOASID_ALLOC _IO(IOASID_TYPE, IOASID_BASE + 3)
> #define IOASID_FREE _IO(IOASID_TYPE, IOASID_BASE + 4)
>
>
> /*
> * Get information about an I/O address space
> *
> * Supported capabilities:
> * - VFIO type1 map/unmap;
> * - pgtable/pasid_table binding
> * - hardware nesting vs. software nesting;
> * - ...
> *
> * Related attributes:
> * - supported page sizes, reserved IOVA ranges (DMA mapping);

Can I request we represent this in terms of permitted IOVA ranges,
rather than reserved IOVA ranges. This works better with the "window"
model I have in mind for unifying the restrictions of the POWER IOMMU
with Type1 like mapping.

> * - vendor pgtable formats (pgtable binding);
> * - number of child IOASIDs (nesting);
> * - ...
> *
> * Above information is available only after one or more devices are
> * attached to the specified IOASID. Otherwise the IOASID is just a
> * number w/o any capability or attribute.
> *
> * Input parameters:
> * - u32 ioasid;
> *
> * Output parameters:
> * - many. TBD.
> */
> #define IOASID_GET_INFO _IO(IOASID_TYPE, IOASID_BASE + 5)
>
>
> /*
> * Map/unmap process virtual addresses to I/O virtual addresses.
> *
> * Provide VFIO type1 equivalent semantics. Start with the same
> * restriction e.g. the unmap size should match those used in the
> * original mapping call.
> *
> * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> * must be already in the preregistered list.
> *
> * Input parameters:
> * - u32 ioasid;
> * - refer to vfio_iommu_type1_dma_{un}map
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
> #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)

I'm assuming these would be expected to fail if a user managed
pagetable has been bound?

> /*
> * Create a nesting IOASID (child) on an existing IOASID (parent)
> *
> * IOASIDs can be nested together, implying that the output address
> * from one I/O page table (child) must be further translated by
> * another I/O page table (parent).
> *
> * As the child adds essentially another reference to the I/O page table
> * represented by the parent, any device attached to the child ioasid
> * must be already attached to the parent.
> *
> * In concept there is no limit on the number of the nesting levels.
> * However for the majority case one nesting level is sufficient. The
> * user should check whether an IOASID supports nesting through
> * IOASID_GET_INFO. For example, if only one nesting level is allowed,
> * the nesting capability is reported only on the parent instead of the
> * child.
> *
> * User also needs check (via IOASID_GET_INFO) whether the nesting
> * is implemented in hardware or software. If software-based, DMA
> * mapping protocol should be used on the child IOASID. Otherwise,
> * the child should be operated with pgtable binding protocol.
> *
> * Input parameters:
> * - u32 parent_ioasid;
> *
> * Return: child_ioasid on success, -errno on failure;
> */
> #define IOASID_CREATE_NESTING _IO(IOASID_TYPE, IOASID_BASE + 8)
>
>
> /*
> * Bind an user-managed I/O page table with the IOMMU
> *
> * Because user page table is untrusted, IOASID nesting must be enabled
> * for this ioasid so the kernel can enforce its DMA isolation policy
> * through the parent ioasid.
> *
> * Pgtable binding protocol is different from DMA mapping. The latter
> * has the I/O page table constructed by the kernel and updated
> * according to user MAP/UNMAP commands. With pgtable binding the
> * whole page table is created and updated by userspace, thus different
> * set of commands are required (bind, iotlb invalidation, page fault, etc.).
> *
> * Because the page table is directly walked by the IOMMU, the user
> * must use a format compatible to the underlying hardware. It can
> * check the format information through IOASID_GET_INFO.
> *
> * The page table is bound to the IOMMU according to the routing
> * information of each attached device under the specified IOASID. The
> * routing information (RID and optional PASID) is registered when a
> * device is attached to this IOASID through VFIO uAPI.
> *
> * Input parameters:
> * - child_ioasid;
> * - address of the user page table;
> * - formats (vendor, address_width, etc.);
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_BIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 9)
> #define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10)

I'm assuming that UNBIND would return the IOASID to a kernel-managed
pagetable?

For debugging and certain hypervisor edge cases it might be useful to
have a call to allow userspace to lookup and specific IOVA in a guest
managed pgtable.


> /*
> * Bind an user-managed PASID table to the IOMMU
> *
> * This is required for platforms which place PASID table in the GPA space.
> * In this case the specified IOASID represents the per-RID PASID space.
> *
> * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
> * special flag to indicate the difference from normal I/O address spaces.
> *
> * The format info of the PASID table is reported in IOASID_GET_INFO.
> *
> * As explained in the design section, user-managed I/O page tables must
> * be explicitly bound to the kernel even on these platforms. It allows
> * the kernel to uniformly manage I/O address spaces cross all platforms.
> * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
> * to carry device routing information to indirectly mark the hidden I/O
> * address spaces.
> *
> * Input parameters:
> * - child_ioasid;

Wouldn't this be the parent ioasid, rather than one of the potentially
many child ioasids?

> * - address of PASID table;
> * - formats (vendor, size, etc.);
> *
> * Return: 0 on success, -errno on failure.
> */
> #define IOASID_BIND_PASID_TABLE _IO(IOASID_TYPE, IOASID_BASE + 11)
> #define IOASID_UNBIND_PASID_TABLE _IO(IOASID_TYPE, IOASID_BASE + 12)
>
>
> /*
> * Invalidate IOTLB for an user-managed I/O page table
> *
> * Unlike what's defined in include/uapi/linux/iommu.h, this command
> * doesn't allow the user to specify cache type and likely support only
> * two granularities (all, or a specified range) in the I/O address space.
> *
> * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
> * cache). If the IOASID represents an I/O address space, the invalidation
> * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
> * represents a vPASID space, then this command applies to the PASID
> * cache.
> *
> * Similarly this command doesn't provide IOMMU-like granularity
> * info (domain-wide, pasid-wide, range-based), since it's all about the
> * I/O address space itself. The ioasid driver walks the attached
> * routing information to match the IOMMU semantics under the
> * hood.
> *
> * Input parameters:
> * - child_ioasid;

And couldn't this be be any ioasid, not just a child one, depending on
whether you want PASID scope or RID scope invalidation?

> * - granularity
> *
> * Return: 0 on success, -errno on failure
> */
> #define IOASID_INVALIDATE_CACHE _IO(IOASID_TYPE, IOASID_BASE + 13)
>
>
> /*
> * Page fault report and response
> *
> * This is TBD. Can be added after other parts are cleared up. Likely it
> * will be a ring buffer shared between user/kernel, an eventfd to notify
> * the user and an ioctl to complete the fault.
> *
> * The fault data is per I/O address space, i.e.: IOASID + faulting_addr
> */
>
>
> /*
> * Dirty page tracking
> *
> * Track and report memory pages dirtied in I/O address spaces. There
> * is an ongoing work by Kunkun Jiang by extending existing VFIO type1.
> * It needs be adapted to /dev/ioasid later.
> */
>
>
> 2.2. /dev/vfio uAPI
> ++++++++++++++++
>
> /*
> * Bind a vfio_device to the specified IOASID fd
> *
> * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> * vfio device should not be bound to multiple ioasid_fd's.
> *
> * Input parameters:
> * - ioasid_fd;
> *
> * Return: 0 on success, -errno on failure.
> */
> #define VFIO_BIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 22)
> #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)
>
>
> /*
> * Attach a vfio device to the specified IOASID
> *
> * Multiple vfio devices can be attached to the same IOASID, and vice
> * versa.
> *
> * User may optionally provide a "virtual PASID" to mark an I/O page
> * table on this vfio device. Whether the virtual PASID is physically used
> * or converted to another kernel-allocated PASID is a policy in vfio device
> * driver.
> *
> * There is no need to specify ioasid_fd in this call due to the assumption
> * of 1:1 connection between vfio device and the bound fd.
> *
> * Input parameter:
> * - ioasid;
> * - flag;
> * - user_pasid (if specified);

Wouldn't the PASID be communicated by whether you give a parent or
child ioasid, rather than needing an extra value?

> * Return: 0 on success, -errno on failure.
> */
> #define VFIO_ATTACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 24)
> #define VFIO_DETACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 25)
>
>
> 2.3. KVM uAPI
> ++++++++++++
>
> /*
> * Update CPU PASID mapping
> *
> * This is necessary when ENQCMD will be used in the guest while the
> * targeted device doesn't accept the vPASID saved in the CPU MSR.
> *
> * This command allows user to set/clear the vPASID->pPASID mapping
> * in the CPU, by providing the IOASID (and FD) information representing
> * the I/O address space marked by this vPASID.
> *
> * Input parameters:
> * - user_pasid;
> * - ioasid_fd;
> * - ioasid;
> */
> #define KVM_MAP_PASID _IO(KVMIO, 0xf0)
> #define KVM_UNMAP_PASID _IO(KVMIO, 0xf1)
>
>
> 3. Sample structures and helper functions
> --------------------------------------------------------
>
> Three helper functions are provided to support VFIO_BIND_IOASID_FD:
>
> struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev *dev);
> int ioasid_unregister_device(struct ioasid_dev *dev);
>
> An ioasid_ctx is created for each fd:
>
> struct ioasid_ctx {
> // a list of allocated IOASID data's
> struct list_head ioasid_list;
> // a list of registered devices
> struct list_head dev_list;
> // a list of pre-registered virtual address ranges
> struct list_head prereg_list;
> };
>
> Each registered device is represented by ioasid_dev:
>
> struct ioasid_dev {
> struct list_head next;
> struct ioasid_ctx *ctx;
> // always be the physical device

Again "physical" isn't really clearly defined here.

> struct device *device;
> struct kref kref;
> };
>
> Because we assume one vfio_device connected to at most one ioasid_fd,
> here ioasid_dev could be embedded in vfio_device and then linked to
> ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> device should be the pointer to the parent device. PASID marking this
> mdev is specified later when VFIO_ATTACH_IOASID.
>
> An ioasid_data is created when IOASID_ALLOC, as the main object
> describing characteristics about an I/O page table:
>
> struct ioasid_data {
> // link to ioasid_ctx->ioasid_list
> struct list_head next;
>
> // the IOASID number
> u32 ioasid;
>
> // the handle to convey iommu operations
> // hold the pgd (TBD until discussing iommu api)
> struct iommu_domain *domain;
>
> // map metadata (vfio type1 semantics)
> struct rb_node dma_list;

Why do you need this? Can't you just store the kernel managed
mappings in the host IO pgtable?

> // pointer to user-managed pgtable (for nesting case)
> u64 user_pgd;

> // link to the parent ioasid (for nesting)
> struct ioasid_data *parent;
>
> // cache the global PASID shared by ENQCMD-capable
> // devices (see below explanation in section 4)
> u32 pasid;
>
> // a list of device attach data (routing information)
> struct list_head attach_data;
>
> // a list of partially-attached devices (group)
> struct list_head partial_devices;
>
> // a list of fault_data reported from the iommu layer
> struct list_head fault_data;
>
> ...
> }
>
> ioasid_data and iommu_domain have overlapping roles as both are
> introduced to represent an I/O address space. It is still a big TBD how
> the two should be corelated or even merged, and whether new iommu
> ops are required to handle RID+PASID explicitly. We leave this as open
> for now as this proposal is mainly about uAPI. For simplification
> purpose the two objects are kept separate in this context, assuming an
> 1:1 connection in-between and the domain as the place-holder
> representing the 1st class object in the iommu ops.
>
> Two helper functions are provided to support VFIO_ATTACH_IOASID:
>
> struct attach_info {
> u32 ioasid;
> // If valid, the PASID to be used physically
> u32 pasid;

Again shouldn't the choice of a parent or child ioasid inform whether
there is a pasid, and if so which one?

> };
> int ioasid_device_attach(struct ioasid_dev *dev,
> struct attach_info info);
> int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
>
> The pasid parameter is optionally provided based on the policy in vfio
> device driver. It could be the PASID marking the default I/O address
> space for a mdev, or the user-provided PASID marking an user I/O page
> table, or another kernel-allocated PASID backing the user-provided one.
> Please check next section for detail explanation.
>
> A new object is introduced and linked to ioasid_data->attach_data for
> each successful attach operation:
>
> struct ioasid_attach_data {
> struct list_head next;
> struct ioasid_dev *dev;
> u32 pasid;
> }
>
> As explained in the design section, there is no explicit group enforcement
> in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> implicit group check - before every device within an iommu group is
> attached to this IOASID, the previously-attached devices in this group are
> put in ioasid_data->partial_devices. The IOASID rejects any command if
> the partial_devices list is not empty.
>
> Then is the last helper function:
> u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx,
> u32 ioasid, bool alloc);
>
> ioasid_get_global_pasid is necessary in scenarios where multiple devices
> want to share a same PASID value on the attached I/O page table (e.g.
> when ENQCMD is enabled, as explained in next section). We need a
> centralized place (ioasid_data->pasid) to hold this value (allocated when
> first called with alloc=true). vfio device driver calls this function (alloc=
> true) to get the global PASID for an ioasid before calling ioasid_device_
> attach. KVM also calls this function (alloc=false) to setup PASID translation
> structure when user calls KVM_MAP_PASID.
>
> 4. PASID Virtualization
> ------------------------------
>
> When guest SVA (vSVA) is enabled, multiple GVA address spaces are
> created on the assigned vfio device. This leads to the concepts of
> "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned
> by the guest to mark an GVA address space while pPASID is the one
> selected by the host and actually routed in the wire.
>
> vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.
>
> vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> device, with two factors to be considered:
>
> - Whether vPASID is directly used (vPASID==pPASID) in the wire, or
> should be instead converted to a newly-allocated one (vPASID!=
> pPASID);
>
> - If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
> space or a global PASID space (implying sharing pPASID cross devices,
> e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
> as part of the process context);
>
> The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> supported. There are three possible scenarios:
>
> (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
> policies.)
>
> 1) pdev (w/ or w/o ENQCMD): vPASID==pPASID
>
> vPASIDs are directly programmed by the guest to the assigned MMIO
> bar, implying all DMAs out of this device having vPASID in the packet
> header. This mandates vPASID==pPASID, sort of delegating the entire
> per-RID PASID space to the guest.
>
> When ENQCMD is enabled, the CPU MSR when running a guest task
> contains a vPASID. In this case the CPU PASID translation capability
> should be disabled so this vPASID in CPU MSR is directly sent to the
> wire.
>
> This ensures consistent vPASID usage on pdev regardless of the
> workload submitted through a MMIO register or ENQCMD instruction.
>
> 2) mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
>
> PASIDs are also used by kernel to mark the default I/O address space
> for mdev, thus cannot be delegated to the guest. Instead, the mdev
> driver must allocate a new pPASID for each vPASID (thus vPASID!=
> pPASID) and then use pPASID when attaching this mdev to an ioasid.
>
> The mdev driver needs cache the PASID mapping so in mediation
> path vPASID programmed by the guest can be converted to pPASID
> before updating the physical MMIO register. The mapping should
> also be saved in the CPU PASID translation structure (via KVM uAPI),
> so the vPASID saved in the CPU MSR is auto-translated to pPASID
> before sent to the wire, when ENQCMD is enabled.
>
> Generally pPASID could be allocated from the per-RID PASID space
> if all mdev's created on the parent device don't support ENQCMD.
>
> However if the parent supports ENQCMD-capable mdev, pPASIDs
> must be allocated from a global pool because the CPU PASID
> translation structure is per-VM. It implies that when an guest I/O
> page table is attached to two mdevs with a single vPASID (i.e. bind
> to the same guest process), a same pPASID should be used for
> both mdevs even when they belong to different parents. Sharing
> pPASID cross mdevs is achieved by calling aforementioned ioasid_
> get_global_pasid().
>
> 3) Mix pdev/mdev together
>
> Above policies are per device type thus are not affected when mixing
> those device types together (when assigned to a single guest). However,
> there is one exception - when both pdev/mdev support ENQCMD.
>
> Remember the two types have conflicting requirements on whether
> CPU PASID translation should be enabled. This capability is per-VM,
> and must be enabled for mdev isolation. When enabled, pdev will
> receive a mdev pPASID violating its vPASID expectation.
>
> In previous thread a PASID range split scheme was discussed to support
> this combination, but we haven't worked out a clean uAPI design yet.
> Therefore in this proposal we decide to not support it, implying the
> user should have some intelligence to avoid such scenario. It could be
> a TODO task for future.
>
> In spite of those subtle considerations, the kernel implementation could
> start simple, e.g.:
>
> - v==p for pdev;
> - v!=p and always use a global PASID pool for all mdev's;
>
> Regardless of the kernel policy, the user policy is unchanged:
>
> - provide vPASID when calling VFIO_ATTACH_IOASID;
> - call KVM uAPI to setup CPU PASID translation if ENQCMD-capable mdev;
> - Don't expose ENQCMD capability on both pdev and mdev;
>
> Sample user flow is described in section 5.5.
>
> 5. Use Cases and Flows
> -------------------------------
>
> Here assume VFIO will support a new model where every bound device
> is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> going through legacy container/group interface. For illustration purpose
> those devices are just called dev[1...N]:
>
> device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);

Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the
filenames for actual PCI functions. Maybe /dev/vfio/mdev/something
for mdevs. That leaves other subdirs of /dev/vfio free for future
non-PCI device types, and /dev/vfio itself for the legacy group
devices.

> As explained earlier, one IOASID fd is sufficient for all intended use cases:
>
> ioasid_fd = open("/dev/ioasid", mode);
>
> For simplicity below examples are all made for the virtualization story.
> They are representative and could be easily adapted to a non-virtualization
> scenario.
>
> Three types of IOASIDs are considered:
>
> gpa_ioasid[1...N]: for GPA address space
> giova_ioasid[1...N]: for guest IOVA address space
> gva_ioasid[1...N]: for guest CPU VA address space
>
> At least one gpa_ioasid must always be created per guest, while the other
> two are relevant as far as vIOMMU is concerned.
>
> Examples here apply to both pdev and mdev, if not explicitly marked out
> (e.g. in section 5.5). VFIO device driver in the kernel will figure out the
> associated routing information in the attaching operation.
>
> For illustration simplicity, IOASID_CHECK_EXTENSION and IOASID_GET_
> INFO are skipped in these examples.
>
> 5.1. A simple example
> ++++++++++++++++++
>
> Dev1 is assigned to the guest. One gpa_ioasid is created. The GPA address
> space is managed through DMA mapping protocol:
>
> /* Bind device to IOASID fd */
> device_fd = open("/dev/vfio/devices/dev1", mode);
> ioasid_fd = open("/dev/ioasid", mode);
> ioctl(device_fd, VFIO_BIND_IOASID_FD, ioasid_fd);
>
> /* Attach device to IOASID */
> gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> at_data = { .ioasid = gpa_ioasid};
> ioctl(device_fd, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup GPA mapping */
> dma_map = {
> .ioasid = gpa_ioasid;
> .iova = 0; // GPA
> .vaddr = 0x40000000; // HVA
> .size = 1GB;
> };
> ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> If the guest is assigned with more than dev1, user follows above sequence
> to attach other devices to the same gpa_ioasid i.e. sharing the GPA
> address space cross all assigned devices.
>
> 5.2. Multiple IOASIDs (no nesting)
> ++++++++++++++++++++++++++++
>
> Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> both devices are attached to gpa_ioasid.

Doesn't really affect your example, but note that the PAPR IOMMU does
not have a passthrough mode, so devices will not initially be attached
to gpa_ioasid - they will be unusable for DMA until attached to a
gIOVA ioasid.

> After boot the guest creates
> an GIOVA address space (giova_ioasid) for dev2, leaving dev1 in pass
> through mode (gpa_ioasid).
>
> Suppose IOASID nesting is not supported in this case. Qemu need to
> generate shadow mappings in userspace for giova_ioasid (like how
> VFIO works today).
>
> To avoid duplicated locked page accounting, it's recommended to pre-
> register the virtual address range that will be used for DMA:
>
> device_fd1 = open("/dev/vfio/devices/dev1", mode);
> device_fd2 = open("/dev/vfio/devices/dev2", mode);
> ioasid_fd = open("/dev/ioasid", mode);
> ioctl(device_fd1, VFIO_BIND_IOASID_FD, ioasid_fd);
> ioctl(device_fd2, VFIO_BIND_IOASID_FD, ioasid_fd);
>
> /* pre-register the virtual address range for accounting */
> mem_info = { .vaddr = 0x40000000; .size = 1GB };
> ioctl(ioasid_fd, IOASID_REGISTER_MEMORY, &mem_info);
> /* Attach dev1 and dev2 to gpa_ioasid */
> gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
> at_data = { .ioasid = gpa_ioasid};
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup GPA mapping */
> dma_map = {
> .ioasid = gpa_ioasid;
> .iova = 0; // GPA
> .vaddr = 0x40000000; // HVA
> .size = 1GB;
> };
> ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> /* After boot, guest enables an GIOVA space for dev2 */

Again, doesn't break the example, but this need not happen after guest
boot. On the PAPR vIOMMU, the guest IOVA spaces (known as "logical IO
bus numbers" / liobns) and which devices are in each are fixed at
guest creation time and advertised to the guest via firmware.

> giova_ioasid = ioctl(ioasid_fd, IOASID_ALLOC);
>
> /* First detach dev2 from previous address space */
> at_data = { .ioasid = gpa_ioasid};
> ioctl(device_fd2, VFIO_DETACH_IOASID, &at_data);
>
> /* Then attach dev2 to the new address space */
> at_data = { .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup a shadow DMA mapping according to vIOMMU
> * GIOVA (0x2000) -> GPA (0x1000) -> HVA (0x40001000)
> */
> dma_map = {
> .ioasid = giova_ioasid;
> .iova = 0x2000; // GIOVA
> .vaddr = 0x40001000; // HVA
> .size = 4KB;
> };
> ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> 5.3. IOASID nesting (software)
> +++++++++++++++++++++++++
>
> Same usage scenario as 5.2, with software-based IOASID nesting
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.

In this case, I feel like the preregistration is redundant with the
GPA level mapping. As long as the gIOVA mappings (which might be
frequent) can piggyback on the accounting done for the GPA mapping we
accomplish what we need from preregistration.

> The flow before guest boots is same as 5.2, except one point. Because
> giova_ioasid is nested on gpa_ioasid, locked accounting is only
> conducted for gpa_ioasid. So it's not necessary to pre-register virtual
> memory.
>
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
>
> /* After boots */
> /* Make GIOVA space nested on GPA space */
> giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev2 to the new address space (child)
> * Note dev2 is still attached to gpa_ioasid (parent)
> */
> at_data = { .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup a GIOVA->GPA mapping for giova_ioasid, which will be
> * merged by the kernel with GPA->HVA mapping of gpa_ioasid
> * to form a shadow mapping.
> */
> dma_map = {
> .ioasid = giova_ioasid;
> .iova = 0x2000; // GIOVA
> .vaddr = 0x1000; // GPA
> .size = 4KB;
> };
> ioctl(ioasid_fd, IOASID_DMA_MAP, &dma_map);
>
> 5.4. IOASID nesting (hardware)
> +++++++++++++++++++++++++
>
> Same usage scenario as 5.2, with hardware-based IOASID nesting
> available. In this mode the pgtable binding protocol is used to
> bind the guest IOVA page table with the IOMMU:
>
> /* After boots */
> /* Make GIOVA space nested on GPA space */
> giova_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev2 to the new address space (child)
> * Note dev2 is still attached to gpa_ioasid (parent)
> */
> at_data = { .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Bind guest I/O page table */
> bind_data = {
> .ioasid = giova_ioasid;
> .addr = giova_pgtable;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> /* Invalidate IOTLB when required */
> inv_data = {
> .ioasid = giova_ioasid;
> // granular information
> };
> ioctl(ioasid_fd, IOASID_INVALIDATE_CACHE, &inv_data);
>
> /* See 5.6 for I/O page fault handling */
>
> 5.5. Guest SVA (vSVA)
> ++++++++++++++++++
>
> After boots the guest further create a GVA address spaces (gpasid1) on
> dev1. Dev2 is not affected (still attached to giova_ioasid).
>
> As explained in section 4, user should avoid expose ENQCMD on both
> pdev and mdev.
>
> The sequence applies to all device types (being pdev or mdev), except
> one additional step to call KVM for ENQCMD-capable mdev:
>
> /* After boots */
> /* Make GVA space nested on GPA space */
> gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);

I'm not clear what gva_ioasid is representing. Is it representing a
single vPASID's address space, or a whole bunch of vPASIDs address
spaces?

> /* Attach dev1 to the new address space and specify vPASID */
> at_data = {
> .ioasid = gva_ioasid;
> .flag = IOASID_ATTACH_USER_PASID;
> .user_pasid = gpasid1;
> };
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> /* if dev1 is ENQCMD-capable mdev, update CPU PASID
> * translation structure through KVM
> */
> pa_data = {
> .ioasid_fd = ioasid_fd;
> .ioasid = gva_ioasid;
> .guest_pasid = gpasid1;
> };
> ioctl(kvm_fd, KVM_MAP_PASID, &pa_data);
>
> /* Bind guest I/O page table */
> bind_data = {
> .ioasid = gva_ioasid;
> .addr = gva_pgtable1;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
>
> ...
>
>
> 5.6. I/O page fault
> +++++++++++++++
>
> (uAPI is TBD. Here is just about the high-level flow from host IOMMU driver
> to guest IOMMU driver and backwards).
>
> - Host IOMMU driver receives a page request with raw fault_data {rid,
> pasid, addr};
>
> - Host IOMMU driver identifies the faulting I/O page table according to
> information registered by IOASID fault handler;
>
> - IOASID fault handler is called with raw fault_data (rid, pasid, addr), which
> is saved in ioasid_data->fault_data (used for response);
>
> - IOASID fault handler generates an user fault_data (ioasid, addr), links it
> to the shared ring buffer and triggers eventfd to userspace;
>
> - Upon received event, Qemu needs to find the virtual routing information
> (v_rid + v_pasid) of the device attached to the faulting ioasid. If there are
> multiple, pick a random one. This should be fine since the purpose is to
> fix the I/O page table on the guest;
>
> - Qemu generates a virtual I/O page fault through vIOMMU into guest,
> carrying the virtual fault data (v_rid, v_pasid, addr);
>
> - Guest IOMMU driver fixes up the fault, updates the I/O page table, and
> then sends a page response with virtual completion data (v_rid, v_pasid,
> response_code) to vIOMMU;
>
> - Qemu finds the pending fault event, converts virtual completion data
> into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> complete the pending fault;
>
> - /dev/ioasid finds out the pending fault data {rid, pasid, addr} saved in
> ioasid_data->fault_data, and then calls iommu api to complete it with
> {rid, pasid, response_code};
>
> 5.7. BIND_PASID_TABLE
> ++++++++++++++++++++
>
> PASID table is put in the GPA space on some platform, thus must be updated
> by the guest. It is treated as another user page table to be bound with the
> IOMMU.
>
> As explained earlier, the user still needs to explicitly bind every user I/O
> page table to the kernel so the same pgtable binding protocol (bind, cache
> invalidate and fault handling) is unified cross platforms.
>
> vIOMMUs may include a caching mode (or paravirtualized way) which, once
> enabled, requires the guest to invalidate PASID cache for any change on the
> PASID table. This allows Qemu to track the lifespan of guest I/O page tables.
>
> In case of missing such capability, Qemu could enable write-protection on
> the guest PASID table to achieve the same effect.
>
> /* After boots */
> /* Make vPASID space nested on GPA space */
> pasidtbl_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);

I think this time pasidtbl_ioasid is representing multiple vPASID
address spaces, yes? In which case I don't think it should be treated
as the same sort of object as a normal IOASID, which represents a
single address space IIUC.

> /* Attach dev1 to pasidtbl_ioasid */
> at_data = { .ioasid = pasidtbl_ioasid};
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> /* Bind PASID table */
> bind_data = {
> .ioasid = pasidtbl_ioasid;
> .addr = gpa_pasid_table;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PASID_TABLE, &bind_data);
>
> /* vIOMMU detects a new GVA I/O space created */
> gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> gpa_ioasid);
>
> /* Attach dev1 to the new address space, with gpasid1 */
> at_data = {
> .ioasid = gva_ioasid;
> .flag = IOASID_ATTACH_USER_PASID;
> .user_pasid = gpasid1;
> };
> ioctl(device_fd1, VFIO_ATTACH_IOASID, &at_data);
>
> /* Bind guest I/O page table. Because SET_PASID_TABLE has been
> * used, the kernel will not update the PASID table. Instead, just
> * track the bound I/O page table for handling invalidation and
> * I/O page faults.
> */
> bind_data = {
> .ioasid = gva_ioasid;
> .addr = gva_pgtable1;
> // and format information
> };
> ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);

Hrm.. if you still have to individually bind a table for each vPASID,
what's the point of BIND_PASID_TABLE?

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (51.70 kB)
signature.asc (849.00 B)
Download all attachments
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 31.05.21 19:37, Parav Pandit wrote:

> It appears that this is only to make map ioctl faster apart from accounting.
> It doesn't have any ioasid handle input either.
>
> In that case, can it be a new system call? Why does it have to be under /dev/ioasid?
> For example few years back such system call mpin() thought was proposed in [1].

I'm very reluctant to more syscall inflation. We already have lots of
syscalls that could have been easily done via devices or filesystems
(yes, some of them are just old Unix relics).

Syscalls don't play well w/ modules, containers, distributed systems,
etc, and need extra low-level code for most non-C languages (eg.
scripting languages).


--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 27.05.21 09:58, Tian, Kevin wrote:

Hi,

> /dev/ioasid provides an unified interface for managing I/O page tables for
> devices assigned to userspace. Device passthrough frameworks (VFIO, vDPA,
> etc.) are expected to use this interface instead of creating their own logic to
> isolate untrusted device DMAs initiated by userspace.

While I'm in favour of having generic APIs for generic tasks, as well as
using FDs, I wonder whether it has to be a new and separate device.

Now applications have to use multiple APIs in lockstep. One consequence
of that is operators, as well as provisioning systems, container
infrastructures, etc, always have to consider multiple devices together.

You can't just say "give workload XY access to device /dev/foo" anymore.
Now you have to take care about scenarios like "if someone wants
/dev/foo, he also needs /dev/bar"). And if that happens multiple times
together ("/dev/foo and /dev/wurst, both require /dev/bar), leading to
scenarios like the dev nodes are bind-mounted somewhere, you need to
take care that additional devices aren't bind-mounted twice, etc ...

If I understand this correctly, /dev/ioasid is a kind of "common
supplier" to other APIs / devices. Why can't the fd be acquired by the
consumer APIs (eg. kvm, vfio, etc) ?


--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

2021-06-02 11:52:33

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


?? 2021/6/2 ????4:28, Jason Gunthorpe д??:
>> I summarized five opens here, about:
>>
>> 1) Finalizing the name to replace /dev/ioasid;
>> 2) Whether one device is allowed to bind to multiple IOASID fd's;
>> 3) Carry device information in invalidation/fault reporting uAPI;
>> 4) What should/could be specified when allocating an IOASID;
>> 5) The protocol between vfio group and kvm;
>>
>> For 1), two alternative names are mentioned: /dev/iommu and
>> /dev/ioas. I don't have a strong preference and would like to hear
>> votes from all stakeholders. /dev/iommu is slightly better imho for
>> two reasons. First, per AMD's presentation in last KVM forum they
>> implement vIOMMU in hardware thus need to support user-managed
>> domains. An iommu uAPI notation might make more sense moving
>> forward. Second, it makes later uAPI naming easier as 'IOASID' can
>> be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
>> IOASID_ALLOC_IOASID.:)
> I think two years ago I suggested /dev/iommu and it didn't go very far
> at the time.


It looks to me using "/dev/iommu" excludes the possibility of
implementing IOASID in a device specific way (e.g through the
co-operation with device MMU + platform IOMMU)?

What's more, ATS spec doesn't forbid the device #PF to be reported via a
device specific way.

Thanks


> We've also talked about this as /dev/sva for a while and
> now /dev/ioasid
>
> I think /dev/iommu is fine, and call the things inside them IOAS
> objects.
>
> Then we don't have naming aliasing with kernel constructs.
>

2021-06-02 11:52:59

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


?? 2021/6/2 ????1:29, Jason Gunthorpe д??:
> On Tue, Jun 01, 2021 at 02:07:05PM +0800, Jason Wang wrote:
>
>> For the case of 1M, I would like to know what's the use case for a single
>> process to handle 1M+ address spaces?
> For some scenarios every guest PASID will require a IOASID ID # so
> there is a large enough demand that FDs alone are not a good fit.
>
> Further there are global container wide properties that are hard to
> carry over to a multi-FD model, like the attachment of devices to the
> container at the startup.


So if we implement per fd model. The global "container" properties could
be done via the parent fd. E.g attaching the parent to the device at the
startup.


>
>>> So this RFC treats fd as a container of address spaces which is each
>>> tagged by an IOASID.
>> If the container and address space is 1:1 then the container seems useless.
> The examples at the bottom of the document show multiple IOASIDs in
> the container for a parent/child type relationship


This can also be done per fd? A fd parent can have multiple fd childs.

Thanks


>
> Jason
>

2021-06-02 12:43:56

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal


> From: Enrico Weigelt, metux IT consult <[email protected]>
> Sent: Wednesday, June 2, 2021 2:09 PM
>
> On 31.05.21 19:37, Parav Pandit wrote:
>
> > It appears that this is only to make map ioctl faster apart from accounting.
> > It doesn't have any ioasid handle input either.
> >
> > In that case, can it be a new system call? Why does it have to be under
> /dev/ioasid?
> > For example few years back such system call mpin() thought was proposed
> in [1].
>
> I'm very reluctant to more syscall inflation. We already have lots of syscalls
> that could have been easily done via devices or filesystems (yes, some of
> them are just old Unix relics).
>
> Syscalls don't play well w/ modules, containers, distributed systems, etc, and
> need extra low-level code for most non-C languages (eg.
> scripting languages).

Likely, but as per my understanding, this ioctl() is a wrapper to device agnostic code as,

{
atomic_inc(mm->pinned_vm);
pin_user_pages();
}

And mm must got to hold the reference to it, so that these pages cannot be munmap() or freed.

And second reason I think (I could be wrong) is that, second level page table for a PASID, should be same as what process CR3 has used.
Essentially iommu page table and mmu page table should be pointing to same page table entry.
If they are different, than even if the guest CPU has accessed the pages, device access via IOMMU will result in an expensive page faults.

So assuming both cr3 and pasid table entry points to same page table, I fail to understand for the need of extra refcount and hence driver specific ioctl().
Though I do not have strong objection to the ioctl(). But want to know what it will and will_not do.
Io uring fs has similar ioctl() doing io_sqe_buffer_register(), pinning the memory.

2021-06-02 16:04:30

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 02:20:15AM +0000, Tian, Kevin wrote:
> > From: Alex Williamson <[email protected]>
> > Sent: Wednesday, June 2, 2021 6:22 AM
> >
> > On Tue, 1 Jun 2021 07:01:57 +0000
> > "Tian, Kevin" <[email protected]> wrote:
> > >
> > > I summarized five opens here, about:
> > >
> > > 1) Finalizing the name to replace /dev/ioasid;
> > > 2) Whether one device is allowed to bind to multiple IOASID fd's;
> > > 3) Carry device information in invalidation/fault reporting uAPI;
> > > 4) What should/could be specified when allocating an IOASID;
> > > 5) The protocol between vfio group and kvm;
> > >
> > ...
> > >
> > > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > > original purpose of this protocol is not about I/O address space. It's
> > > for KVM to know whether any device is assigned to this VM and then
> > > do something special (e.g. posted interrupt, EPT cache attribute, etc.).
> >
> > Right, the original use case was for KVM to determine whether it needs
> > to emulate invlpg, so it needs to be aware when an assigned device is
>
> invlpg -> wbinvd :)
>
> > present and be able to test if DMA for that device is cache
> > coherent.

Why is this such a strong linkage to VFIO and not just a 'hey kvm
emulate wbinvd' flag from qemu?

I briefly didn't see any obvios linkage in the arch code, just some
dead code:

$ git grep iommu_noncoherent
arch/x86/include/asm/kvm_host.h: bool iommu_noncoherent;
$ git grep iommu_domain arch/x86
arch/x86/include/asm/kvm_host.h: struct iommu_domain *iommu_domain;

Huh?

It kind of looks like the other main point is to generate the
VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to
connect back to the kvm data

But that seems like it would have been better handled with some IOCTL
on the vfio_device fd to import the KVM to the driver not this
roundabout way?

> > The user, QEMU, creates a KVM "pseudo" device representing the vfio
> > group, providing the file descriptor of that group to show ownership.
> > The ugly symbol_get code is to avoid hard module dependencies, ie. the
> > kvm module should not pull in or require the vfio module, but vfio will
> > be present if attempting to register this device.
>
> so the symbol_get thing is not about the protocol itself. Whatever protocol
> is defined, as long as kvm needs to call vfio or ioasid helper function, we
> need define a proper way to do it. Jason, what's your opinion of alternative
> option since you dislike symbol_get?

The symbol_get was to avoid module dependencies because bringing in
vfio along with kvm is not nice.

The symbol get is not nice here, but unless things can be truely moved
to lower levels of code where module dependencies are not a problem (eg
kvm to iommu is a non-issue) I don't see much of a solution.

Other cases like kvmgt or AP would be similarly fine to have had a
kvmgt to kvm module dependency.

> > All of these use cases are related to the IOMMU, whether DMA is
> > coherent, translating device IOVA to GPA, and an acceleration path to
> > emulate IOMMU programming in kernel... they seem pretty relevant.
>
> One open is whether kvm should hold a device reference or IOASID
> reference. For DMA coherence, it only matters whether assigned
> devices are coherent or not (not for a specific address space). For kvmgt,
> it is for recoding kvm pointer in mdev driver to do write protection. For
> ppc, it does relate to a specific I/O page table.
>
> Then I feel only a part of the protocol will be moved to /dev/ioasid and
> something will still remain between kvm and vfio?

Honestly I would try not to touch this too much :\

Jason

2021-06-02 16:08:46

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 04:52:02PM +0800, Jason Wang wrote:
>
> 在 2021/6/2 上午4:28, Jason Gunthorpe 写道:
> > > I summarized five opens here, about:
> > >
> > > 1) Finalizing the name to replace /dev/ioasid;
> > > 2) Whether one device is allowed to bind to multiple IOASID fd's;
> > > 3) Carry device information in invalidation/fault reporting uAPI;
> > > 4) What should/could be specified when allocating an IOASID;
> > > 5) The protocol between vfio group and kvm;
> > >
> > > For 1), two alternative names are mentioned: /dev/iommu and
> > > /dev/ioas. I don't have a strong preference and would like to hear
> > > votes from all stakeholders. /dev/iommu is slightly better imho for
> > > two reasons. First, per AMD's presentation in last KVM forum they
> > > implement vIOMMU in hardware thus need to support user-managed
> > > domains. An iommu uAPI notation might make more sense moving
> > > forward. Second, it makes later uAPI naming easier as 'IOASID' can
> > > be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
> > > IOASID_ALLOC_IOASID.:)
> > I think two years ago I suggested /dev/iommu and it didn't go very far
> > at the time.
>
>
> It looks to me using "/dev/iommu" excludes the possibility of implementing
> IOASID in a device specific way (e.g through the co-operation with device
> MMU + platform IOMMU)?

This is intended to be the 'drivers/iommu' subsystem though. I don't
want to see pluggabilit here beyoned what drivers/iommu is providing.

If someone wants to do a something complicated through this interface
then they need to make a drivers/iommu implementation.

Or they need to use the mdev-esque "SW TABLE" mode when their driver
attaches to the interface.

If this is good enough or not for a specific device is an entirely
other question though

> What's more, ATS spec doesn't forbid the device #PF to be reported via a
> device specific way.

And if this is done then a kernel function indicating page fault
should be triggered on the ioasid handle that the device has. It is
still drivers/iommu functionality

Jason

2021-06-02 16:13:11

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 01:33:22AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, June 2, 2021 1:42 AM
> >
> > On Tue, Jun 01, 2021 at 08:10:14AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <[email protected]>
> > > > Sent: Saturday, May 29, 2021 1:36 AM
> > > >
> > > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > >
> > > > > IOASID nesting can be implemented in two ways: hardware nesting and
> > > > > software nesting. With hardware support the child and parent I/O page
> > > > > tables are walked consecutively by the IOMMU to form a nested
> > translation.
> > > > > When it's implemented in software, the ioasid driver is responsible for
> > > > > merging the two-level mappings into a single-level shadow I/O page
> > table.
> > > > > Software nesting requires both child/parent page tables operated
> > through
> > > > > the dma mapping protocol, so any change in either level can be
> > captured
> > > > > by the kernel to update the corresponding shadow mapping.
> > > >
> > > > Why? A SW emulation could do this synchronization during invalidation
> > > > processing if invalidation contained an IOVA range.
> > >
> > > In this proposal we differentiate between host-managed and user-
> > > managed I/O page tables. If host-managed, the user is expected to use
> > > map/unmap cmd explicitly upon any change required on the page table.
> > > If user-managed, the user first binds its page table to the IOMMU and
> > > then use invalidation cmd to flush iotlb when necessary (e.g. typically
> > > not required when changing a PTE from non-present to present).
> > >
> > > We expect user to use map+unmap and bind+invalidate respectively
> > > instead of mixing them together. Following this policy, map+unmap
> > > must be used in both levels for software nesting, so changes in either
> > > level are captured timely to synchronize the shadow mapping.
> >
> > map+unmap or bind+invalidate is a policy of the IOASID itself set when
> > it is created. If you put two different types in a tree then each IOASID
> > must continue to use its own operation mode.
> >
> > I don't see a reason to force all IOASIDs in a tree to be consistent??
>
> only for software nesting. With hardware support the parent uses map
> while the child uses bind.
>
> Yes, the policy is specified per IOASID. But if the policy violates the
> requirement in a specific nesting mode, then nesting should fail.

I don't get it.

If the IOASID is a page table then it is bind/invalidate. SW or not SW
doesn't matter at all.

> >
> > A software emulated two level page table where the leaf level is a
> > bound page table in guest memory should continue to use
> > bind/invalidate to maintain the guest page table IOASID even though it
> > is a SW construct.
>
> with software nesting the leaf should be a host-managed page table
> (or metadata). A bind/invalidate protocol doesn't require the user
> to notify the kernel of every page table change.

The purpose of invalidate is to inform the implementation that the
page table has changed so it can flush the caches. If the page table
is changed and invalidation is not issued then then the implementation
is free to ignore the changes.

In this way the SW mode is the same as a HW mode with an infinite
cache.

The collaposed shadow page table is really just a cache.

Jason

2021-06-02 16:21:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 04:32:27PM +1000, David Gibson wrote:
> > I agree with Jean-Philippe - at the very least erasing this
> > information needs a major rational - but I don't really see why it
> > must be erased? The HW reports the originating device, is it just a
> > matter of labeling the devices attached to the /dev/ioasid FD so it
> > can be reported to userspace?
>
> HW reports the originating device as far as it knows. In many cases
> where you have multiple devices in an IOMMU group, it's because
> although they're treated as separate devices at the kernel level, they
> have the same RID at the HW level. Which means a RID for something in
> the right group is the closest you can count on supplying.

Granted there may be cases where exact fidelity is not possible, but
that doesn't excuse eliminating fedelity where it does exist..

> > If there are no hypervisor traps (does this exist?) then there is no
> > way to involve the hypervisor here and the child IOASID should simply
> > be a pointer to the guest's data structure that describes binding. In
> > this case that IOASID should claim all PASIDs when bound to a
> > RID.
>
> And in that case I think we should call that object something other
> than an IOASID, since it represents multiple address spaces.

Maybe.. It is certainly a special case.

We can still consider it a single "address space" from the IOMMU
perspective. What has happened is that the address table is not just a
64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".

If we are already going in the direction of having the IOASID specify
the page table format and other details, specifying that the page
tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
step.

I wouldn't twist things into knots to create a difference, but if it
is easy to do it wouldn't hurt either.

Jason

2021-06-02 16:42:38

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 04:57:52PM +1000, David Gibson wrote:

> I don't think presence or absence of a group fd makes a lot of
> difference to this design. Having a group fd just means we attach
> groups to the ioasid instead of individual devices, and we no longer
> need the bookkeeping of "partial" devices.

Oh, I think we really don't want to attach the group to an ioasid, or
at least not as a first-class idea.

The fundamental problem that got us here is we now live in a world
where there are many ways to attach a device to an IOASID:

- A RID binding
- A RID,PASID binding
- A RID,PASID binding for ENQCMD
- A SW TABLE binding
- etc

The selection of which mode to use is based on the specific
driver/device operation. Ie the thing that implements the 'struct
vfio_device' is the thing that has to select the binding mode.

group attachment was fine when there was only one mode. As you say it
is fine to just attach every group member with RID binding if RID
binding is the only option.

When SW TABLE binding was added the group code was hacked up - now the
group logic is choosing between RID/SW TABLE in a very hacky and mdev
specific way, and this is just a mess.

The flow must carry the IOASID from the /dev/iommu to the vfio_device
driver and the vfio_device implementation must choose which binding
mode and parameters it wants based on driver and HW configuration.

eg if two PCI devices are in a group then it is perfectly fine that
one device uses RID binding and the other device uses RID,PASID
binding.

The only place I see for a "group bind" in the uAPI is some compat
layer for the vfio container, and the implementation would be quite
different, we'd have to call each vfio_device driver in the group and
execute the IOASID attach IOCTL.

> > I would say no on the container. /dev/ioasid == the container, having
> > two competing objects at once in a single process is just a mess.
>
> Right. I'd assume that for compatibility, creating a container would
> create a single IOASID under the hood with a compatiblity layer
> translating the container operations to iosaid operations.

It is a nice dream for sure

/dev/vfio could be a special case of /dev/ioasid just with a different
uapi and ending up with only one IOASID. They could be interchangable
from then on, which would simplify the internals of VFIO if it
consistently delt with these new ioasid objects everywhere. But last I
looked it was complicated enough to best be done later on

Jason

2021-06-02 17:01:05

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > /* Bind guest I/O page table */
> > > bind_data = {
> > > .ioasid = gva_ioasid;
> > > .addr = gva_pgtable1;
> > > // and format information
> > > };
> > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> >
> > Again I do wonder if this should just be part of alloc_ioasid. Is
> > there any reason to split these things? The only advantage to the
> > split is the device is known, but the device shouldn't impact
> > anything..
>
> I'm pretty sure the device(s) could matter, although they probably
> won't usually.

It is a bit subtle, but the /dev/iommu fd itself is connected to the
devices first. This prevents wildly incompatible devices from being
joined together, and allows some "get info" to report the capability
union of all devices if we want to do that.

The original concept was that devices joined would all have to support
the same IOASID format, at least for the kernel owned map/unmap IOASID
type. Supporting different page table formats maybe is reason to
revisit that concept.

There is a small advantage to re-using the IOASID container because of
the get_user_pages caching and pinned accounting management at the FD
level.

I don't know if that small advantage is worth the extra complexity
though.

> But it would certainly be possible for a system to have two
> different host bridges with two different IOMMUs with different
> pagetable formats. Until you know which devices (and therefore
> which host bridge) you're talking about, you don't know what formats
> of pagetable to accept. And if you have devices from *both* bridges
> you can't bind a page table at all - you could theoretically support
> a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> in both formats, but it would be pretty reasonable not to support
> that.

The basic process for a user space owned pgtable mode would be:

1) qemu has to figure out what format of pgtable to use

Presumably it uses query functions using the device label. The
kernel code should look at the entire device path through all the
IOMMU HW to determine what is possible.

Or it already knows because the VM's vIOMMU is running in some
fixed page table format, or the VM's vIOMMU already told it, or
something.

2) qemu creates an IOASID and based on #1 and says 'I want this format'

3) qemu binds the IOASID to the device.

If qmeu gets it wrong then it just fails.

4) For the next device qemu would have to figure out if it can re-use
an existing IOASID based on the required proeprties.

You pointed to the case of mixing vIOMMU's of different platforms. So
it is completely reasonable for qemu to ask for a "ARM 64 bit IOMMU
page table mode v2" while running on an x86 because that is what the
vIOMMU is wired to work with.

Presumably qemu will fall back to software emulation if this is not
possible.

One interesting option for software emulation is to just transform the
ARM page table format to a x86 page table format in userspace and use
nested bind/invalidate to synchronize with the kernel. With SW nesting
I suspect this would be much faster

Jason

2021-06-02 17:13:27

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, 2 Jun 2021 13:01:40 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Wed, Jun 02, 2021 at 02:20:15AM +0000, Tian, Kevin wrote:
> > > From: Alex Williamson <[email protected]>
> > > Sent: Wednesday, June 2, 2021 6:22 AM
> > >
> > > On Tue, 1 Jun 2021 07:01:57 +0000
> > > "Tian, Kevin" <[email protected]> wrote:
> > > >
> > > > I summarized five opens here, about:
> > > >
> > > > 1) Finalizing the name to replace /dev/ioasid;
> > > > 2) Whether one device is allowed to bind to multiple IOASID fd's;
> > > > 3) Carry device information in invalidation/fault reporting uAPI;
> > > > 4) What should/could be specified when allocating an IOASID;
> > > > 5) The protocol between vfio group and kvm;
> > > >
> > > ...
> > > >
> > > > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > > > original purpose of this protocol is not about I/O address space. It's
> > > > for KVM to know whether any device is assigned to this VM and then
> > > > do something special (e.g. posted interrupt, EPT cache attribute, etc.).
> > >
> > > Right, the original use case was for KVM to determine whether it needs
> > > to emulate invlpg, so it needs to be aware when an assigned device is
> >
> > invlpg -> wbinvd :)

Oops, of course.

> > > present and be able to test if DMA for that device is cache
> > > coherent.
>
> Why is this such a strong linkage to VFIO and not just a 'hey kvm
> emulate wbinvd' flag from qemu?

IIRC, wbinvd has host implications, a malicious user could tell KVM to
emulate wbinvd then run the op in a loop and induce a disproportionate
load on the system. We therefore wanted a way that it would only be
enabled when required.

> I briefly didn't see any obvios linkage in the arch code, just some
> dead code:
>
> $ git grep iommu_noncoherent
> arch/x86/include/asm/kvm_host.h: bool iommu_noncoherent;
> $ git grep iommu_domain arch/x86
> arch/x86/include/asm/kvm_host.h: struct iommu_domain *iommu_domain;
>
> Huh?

Cruft from legacy KVM device assignment, I assume. What you're looking
for is:

kvm_vfio_update_coherency
kvm_arch_register_noncoherent_dma
atomic_inc(&kvm->arch.noncoherent_dma_count);

need_emulate_wbinvd
kvm_arch_has_noncoherent_dma
atomic_read(&kvm->arch.noncoherent_dma_count);

There are a couple other callers that I'm not as familiar with.

> It kind of looks like the other main point is to generate the
> VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to
> connect back to the kvm data
>
> But that seems like it would have been better handled with some IOCTL
> on the vfio_device fd to import the KVM to the driver not this
> roundabout way?

Then QEMU would need to know which drivers require KVM knowledge? This
allowed transparent backwards compatibility with userspace. Thanks,

Alex

2021-06-02 17:23:42

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 04:15:07PM +1000, David Gibson wrote:

> Is there a compelling reason to have all the IOASIDs handled by one
> FD?

There was an answer on this, if every PASID needs an IOASID then there
are too many FDs.

It is difficult to share the get_user_pages cache across FDs.

There are global properties in the /dev/iommu FD, like what devices
are part of it, that are important for group security operations. This
becomes confused if it is split to many FDs.

> > I/O address space can be managed through two protocols, according to
> > whether the corresponding I/O page table is constructed by the kernel or
> > the user. When kernel-managed, a dma mapping protocol (similar to
> > existing VFIO iommu type1) is provided for the user to explicitly specify
> > how the I/O address space is mapped. Otherwise, a different protocol is
> > provided for the user to bind an user-managed I/O page table to the
> > IOMMU, plus necessary commands for iotlb invalidation and I/O fault
> > handling.
> >
> > Pgtable binding protocol can be used only on the child IOASID's, implying
> > IOASID nesting must be enabled. This is because the kernel doesn't trust
> > userspace. Nesting allows the kernel to enforce its DMA isolation policy
> > through the parent IOASID.
>
> To clarify, I'm guessing that's a restriction of likely practice,
> rather than a fundamental API restriction. I can see a couple of
> theoretical future cases where a user-managed pagetable for a "base"
> IOASID would be feasible:
>
> 1) On some fancy future MMU allowing free nesting, where the kernel
> would insert an implicit extra layer translating user addresses
> to physical addresses, and the userspace manages a pagetable with
> its own VAs being the target AS

I would model this by having a "SVA" parent IOASID. A "SVA" IOASID one
where the IOVA == process VA and the kernel maintains this mapping.

Since the uAPI is so general I do have a general expecation that the
drivers/iommu implementations might need to be a bit more complicated,
like if the HW can optimize certain specific graphs of IOASIDs we
would still model them as graphs and the HW driver would have to
"compile" the graph into the optimal hardware.

This approach has worked reasonable in other kernel areas.

> 2) For a purely software virtual device, where its virtual DMA
> engine can interpet user addresses fine

This also sounds like an SVA IOASID.

Depending on HW if a device can really only bind to a very narrow kind
of IOASID then it should ask for that (probably platform specific!)
type during its attachment request to drivers/iommu.

eg "I am special hardware and only know how to do PLATFORM_BLAH
transactions, give me an IOASID comatible with that". If the only way
to create "PLATFORM_BLAH" is with a SVA IOASID because BLAH is
hardwired to the CPU ASID then that is just how it is.

> I wonder if there's a way to model this using a nested AS rather than
> requiring special operations. e.g.
>
> 'prereg' IOAS
> |
> \- 'rid' IOAS
> |
> \- 'pasid' IOAS (maybe)
>
> 'prereg' would have a kernel managed pagetable into which (for
> example) qemu platform code would map all guest memory (using
> IOASID_MAP_DMA). qemu's vIOMMU driver would then mirror the guest's
> IO mappings into the 'rid' IOAS in terms of GPA.
>
> This wouldn't quite work as is, because the 'prereg' IOAS would have
> no devices. But we could potentially have another call to mark an
> IOAS as a purely "preregistration" or pure virtual IOAS. Using that
> would be an alternative to attaching devices.

It is one option for sure, this is where I was thinking when we were
talking in the other thread. I think the decision is best
implementation driven as the datastructure to store the
preregsitration data should be rather purpose built.

> > /*
> > * Map/unmap process virtual addresses to I/O virtual addresses.
> > *
> > * Provide VFIO type1 equivalent semantics. Start with the same
> > * restriction e.g. the unmap size should match those used in the
> > * original mapping call.
> > *
> > * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > * must be already in the preregistered list.
> > *
> > * Input parameters:
> > * - u32 ioasid;
> > * - refer to vfio_iommu_type1_dma_{un}map
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
> > #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)
>
> I'm assuming these would be expected to fail if a user managed
> pagetable has been bound?

Me too, or a SVA page table.

This document would do well to have a list of imagined page table
types and the set of operations that act on them. I think they are all
pretty disjoint..

Your presentation of 'kernel owns the table' vs 'userspace owns the
table' is a useful clarification to call out too

> > 5. Use Cases and Flows
> >
> > Here assume VFIO will support a new model where every bound device
> > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > going through legacy container/group interface. For illustration purpose
> > those devices are just called dev[1...N]:
> >
> > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
>
> Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the
> filenames for actual PCI functions. Maybe /dev/vfio/mdev/something
> for mdevs. That leaves other subdirs of /dev/vfio free for future
> non-PCI device types, and /dev/vfio itself for the legacy group
> devices.

There are a bunch of nice options here if we go this path

> > 5.2. Multiple IOASIDs (no nesting)
> > ++++++++++++++++++++++++++++
> >
> > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > both devices are attached to gpa_ioasid.
>
> Doesn't really affect your example, but note that the PAPR IOMMU does
> not have a passthrough mode, so devices will not initially be attached
> to gpa_ioasid - they will be unusable for DMA until attached to a
> gIOVA ioasid.

I think attachment should always be explicit in the API. If the user
doesn't explicitly ask for a device to be attached to the IOASID then
the iommu driver is free to block it.

If you want passthrough then you have to create a passthrough IOASID
and attach every device to it. Some of those attaches might be NOP's
due to groups.

Jason

2021-06-02 17:26:33

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 04:54:26PM +0800, Jason Wang wrote:
>
> 在 2021/6/2 上午1:31, Jason Gunthorpe 写道:
> > On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:
> > > We can open up to ~0U file descriptors, I don't see why we need to restrict
> > > it in uAPI.
> > There are significant problems with such large file descriptor
> > tables. High FD numbers man things like select don't work at all
> > anymore and IIRC there are more complications.
>
>
> I don't see how much difference for IOASID and other type of fds. People can
> choose to use poll or epoll.

Not really, once one thing in an applicate uses a large number FDs the
entire application is effected. If any open() can return 'very big
number' then nothing in the process is allowed to ever use select.

It is not a trivial thing to ask for

> And with the current proposal, (assuming there's a N:1 ioasid to ioasid). I
> wonder how select can work for the specific ioasid.

pagefault events are one thing that comes to mind. Bundling them all
together into a single ring buffer is going to be necessary. Multifds
just complicate this too

Jason

2021-06-02 17:27:04

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 10:56:48AM +0200, Enrico Weigelt, metux IT consult wrote:

> If I understand this correctly, /dev/ioasid is a kind of "common supplier"
> to other APIs / devices. Why can't the fd be acquired by the
> consumer APIs (eg. kvm, vfio, etc) ?

/dev/ioasid would be similar to /dev/vfio, and everything already
deals with exposing /dev/vfio and /dev/vfio/N together

I don't see it as a problem, just more work.

Having FDs spawn other FDs is pretty ugly, it defeats the "everything
is a file" model of UNIX.

Jason

2021-06-02 17:38:32

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:

> > > > present and be able to test if DMA for that device is cache
> > > > coherent.
> >
> > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > emulate wbinvd' flag from qemu?
>
> IIRC, wbinvd has host implications, a malicious user could tell KVM to
> emulate wbinvd then run the op in a loop and induce a disproportionate
> load on the system. We therefore wanted a way that it would only be
> enabled when required.

I think the non-coherentness is vfio_device specific? eg a specific
device will decide if it is coherent or not?

If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
implementation and not link it through the IOMMU.

If userspace is telling the vfio_device to be non-coherent or not then
it can call kvm_arch_register_noncoherent_dma() or not based on that
signal.

> > It kind of looks like the other main point is to generate the
> > VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to
> > connect back to the kvm data
> >
> > But that seems like it would have been better handled with some IOCTL
> > on the vfio_device fd to import the KVM to the driver not this
> > roundabout way?
>
> Then QEMU would need to know which drivers require KVM knowledge? This
> allowed transparent backwards compatibility with userspace. Thanks,

I'd just blindly fire a generic 'hey here is your KVM FD' into every
VFIO device.

The backwards compat angle is reasonable enough though.

So those two don't sound so bad, don't know about PPC, but David seem
optimistic

A basic idea is to remove the iommu stuff from the kvm connection so
that the scope of the iommu related rework is contained to vfio

Jason

2021-06-02 18:05:12

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, 2 Jun 2021 14:35:10 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:
>
> > > > > present and be able to test if DMA for that device is cache
> > > > > coherent.
> > >
> > > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > > emulate wbinvd' flag from qemu?
> >
> > IIRC, wbinvd has host implications, a malicious user could tell KVM to
> > emulate wbinvd then run the op in a loop and induce a disproportionate
> > load on the system. We therefore wanted a way that it would only be
> > enabled when required.
>
> I think the non-coherentness is vfio_device specific? eg a specific
> device will decide if it is coherent or not?

No, this is specifically whether DMA is cache coherent to the
processor, ie. in the case of wbinvd whether the processor needs to
invalidate its cache in order to see data from DMA.

> If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
> from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
> implementation and not link it through the IOMMU.

The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to
IOMMU_CAP_CACHE_COHERENCY for all domains within a container.

> If userspace is telling the vfio_device to be non-coherent or not then
> it can call kvm_arch_register_noncoherent_dma() or not based on that
> signal.

Not non-coherent device memory, that would be a driver issue, cache
coherence of DMA is what we're after.

> > > It kind of looks like the other main point is to generate the
> > > VFIO_GROUP_NOTIFY_SET_KVM which is being used by two VFIO drivers to
> > > connect back to the kvm data
> > >
> > > But that seems like it would have been better handled with some IOCTL
> > > on the vfio_device fd to import the KVM to the driver not this
> > > roundabout way?
> >
> > Then QEMU would need to know which drivers require KVM knowledge? This
> > allowed transparent backwards compatibility with userspace. Thanks,
>
> I'd just blindly fire a generic 'hey here is your KVM FD' into every
> VFIO device.

Yes, QEMU could do this, but the vfio-kvm device was already there with
this association and required no uAPI work. This one is the least IOMMU
related of the use cases. Thanks,

Alex

2021-06-02 18:11:20

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 12:01:11PM -0600, Alex Williamson wrote:
> On Wed, 2 Jun 2021 14:35:10 -0300
> Jason Gunthorpe <[email protected]> wrote:
>
> > On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:
> >
> > > > > > present and be able to test if DMA for that device is cache
> > > > > > coherent.
> > > >
> > > > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > > > emulate wbinvd' flag from qemu?
> > >
> > > IIRC, wbinvd has host implications, a malicious user could tell KVM to
> > > emulate wbinvd then run the op in a loop and induce a disproportionate
> > > load on the system. We therefore wanted a way that it would only be
> > > enabled when required.
> >
> > I think the non-coherentness is vfio_device specific? eg a specific
> > device will decide if it is coherent or not?
>
> No, this is specifically whether DMA is cache coherent to the
> processor, ie. in the case of wbinvd whether the processor needs to
> invalidate its cache in order to see data from DMA.

I'm confused. This is x86, all DMA is cache coherent unless the device
is doing something special.

> > If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
> > from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
> > implementation and not link it through the IOMMU.
>
> The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to
> IOMMU_CAP_CACHE_COHERENCY for all domains within a container.

And this special IOMMU mode is basically requested by the device
driver, right? Because if you use this mode you have to also use
special programming techniques.

This smells like all the "snoop bypass" stuff from PCIE (for GPUs
even) in a different guise - it is device triggered, not platform
triggered behavior.

Jason

2021-06-02 19:02:33

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, 2 Jun 2021 15:09:25 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Wed, Jun 02, 2021 at 12:01:11PM -0600, Alex Williamson wrote:
> > On Wed, 2 Jun 2021 14:35:10 -0300
> > Jason Gunthorpe <[email protected]> wrote:
> >
> > > On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:
> > >
> > > > > > > present and be able to test if DMA for that device is cache
> > > > > > > coherent.
> > > > >
> > > > > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > > > > emulate wbinvd' flag from qemu?
> > > >
> > > > IIRC, wbinvd has host implications, a malicious user could tell KVM to
> > > > emulate wbinvd then run the op in a loop and induce a disproportionate
> > > > load on the system. We therefore wanted a way that it would only be
> > > > enabled when required.
> > >
> > > I think the non-coherentness is vfio_device specific? eg a specific
> > > device will decide if it is coherent or not?
> >
> > No, this is specifically whether DMA is cache coherent to the
> > processor, ie. in the case of wbinvd whether the processor needs to
> > invalidate its cache in order to see data from DMA.
>
> I'm confused. This is x86, all DMA is cache coherent unless the device
> is doing something special.
>
> > > If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
> > > from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
> > > implementation and not link it through the IOMMU.
> >
> > The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to
> > IOMMU_CAP_CACHE_COHERENCY for all domains within a container.
>
> And this special IOMMU mode is basically requested by the device
> driver, right? Because if you use this mode you have to also use
> special programming techniques.
>
> This smells like all the "snoop bypass" stuff from PCIE (for GPUs
> even) in a different guise - it is device triggered, not platform
> triggered behavior.

Right, the device can generate the no-snoop transactions, but it's the
IOMMU that essentially determines whether those transactions are
actually still cache coherent, AIUI.

I did experiment with virtually hardwiring the Enable No-Snoop bit in
the Device Control Register to zero, which would be generically allowed
by the PCIe spec, but then we get into subtle dependencies in the device
drivers and clearing the bit again after any sort of reset and the
backdoor accesses to config space which exist mostly in the class of
devices that might use no-snoop transactions (yes, GPUs suck).

It was much easier and more robust to ignore the device setting and rely
on the IOMMU behavior. Yes, maybe we sometimes emulate wbinvd for VMs
where the device doesn't support no-snoop, but it seemed like platforms
were headed in this direction where no-snoop was ignored anyway.
Thanks,

Alex

2021-06-02 19:57:41

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote:
> On Wed, 2 Jun 2021 15:09:25 -0300
> Jason Gunthorpe <[email protected]> wrote:
>
> > On Wed, Jun 02, 2021 at 12:01:11PM -0600, Alex Williamson wrote:
> > > On Wed, 2 Jun 2021 14:35:10 -0300
> > > Jason Gunthorpe <[email protected]> wrote:
> > >
> > > > On Wed, Jun 02, 2021 at 11:11:17AM -0600, Alex Williamson wrote:
> > > >
> > > > > > > > present and be able to test if DMA for that device is cache
> > > > > > > > coherent.
> > > > > >
> > > > > > Why is this such a strong linkage to VFIO and not just a 'hey kvm
> > > > > > emulate wbinvd' flag from qemu?
> > > > >
> > > > > IIRC, wbinvd has host implications, a malicious user could tell KVM to
> > > > > emulate wbinvd then run the op in a loop and induce a disproportionate
> > > > > load on the system. We therefore wanted a way that it would only be
> > > > > enabled when required.
> > > >
> > > > I think the non-coherentness is vfio_device specific? eg a specific
> > > > device will decide if it is coherent or not?
> > >
> > > No, this is specifically whether DMA is cache coherent to the
> > > processor, ie. in the case of wbinvd whether the processor needs to
> > > invalidate its cache in order to see data from DMA.
> >
> > I'm confused. This is x86, all DMA is cache coherent unless the device
> > is doing something special.
> >
> > > > If yes I'd recast this to call kvm_arch_register_noncoherent_dma()
> > > > from the VFIO_GROUP_NOTIFY_SET_KVM in the struct vfio_device
> > > > implementation and not link it through the IOMMU.
> > >
> > > The IOMMU tells us if DMA is cache coherent, VFIO_DMA_CC_IOMMU maps to
> > > IOMMU_CAP_CACHE_COHERENCY for all domains within a container.
> >
> > And this special IOMMU mode is basically requested by the device
> > driver, right? Because if you use this mode you have to also use
> > special programming techniques.
> >
> > This smells like all the "snoop bypass" stuff from PCIE (for GPUs
> > even) in a different guise - it is device triggered, not platform
> > triggered behavior.
>
> Right, the device can generate the no-snoop transactions, but it's the
> IOMMU that essentially determines whether those transactions are
> actually still cache coherent, AIUI.

Wow, this is really confusing stuff in the code.

At the PCI level there is a TLP bit called no-snoop that is platform
specific. The general intention is to allow devices to selectively
bypass the CPU caching for DMAs. GPUs like to use this feature for
performance.

I assume there is some exciting security issues here. Looks like
allowing cache bypass does something bad inside VMs? Looks like
allowing the VM to use the cache clear instruction that is mandatory
with cache bypass DMA causes some QOS issues? OK.

So how does it work?

What I see in the intel/iommu.c is that some domains support "snoop
control" or not, based on some HW flag. This indicates if the
DMA_PTE_SNP bit is supported on a page by page basis or not.

Since x86 always leans toward "DMA cache coherent" I'm reading some
tea leaves here:

IOMMU_CAP_CACHE_COHERENCY, /* IOMMU can enforce cache coherent DMA
transactions */

And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the
snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set?

Further, I guess IOMMUs that don't support PTE_SNP, or have
DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat
and all)

So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA
incoherent with the CPU caches, it just means that that snoop bit in
the TLP cannot be enforced. ie the device *could* do no-shoop DMA
if it wants. Devices that never do no-snoop remain DMA coherent on
x86, as they always have been.

IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means
the PCI device is blocked from using no-snoop in its TLPs.

I wonder if ARM implemented this consistently? I see VDPA is
confused.. I was confused. What a terrible set of names.

In VFIO generic code I see it always sets IOMMU_CACHE:

if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
domain->prot |= IOMMU_CACHE;

And thus also always provides IOMMU_CACHE to iommu_map:

ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
npage << PAGE_SHIFT, prot | d->prot);

So when the IOMMU supports the no-snoop blocking security feature VFIO
turns it on and blocks no-snoop to all pages? Ok..

But I must be missing something big because *something* in the IOVA
map should work with no-snoopable DMA, right? Otherwise what is the
point of exposing the invalidate instruction to the guest?

I would think userspace should be relaying the DMA_PTE_SNP bit from
the guest's page tables up to here??

The KVM hookup is driven by IOMMU_CACHE which is driven by
IOMMU_CAP_CACHE_COHERENCY. So we turn on the special KVM support only
if the IOMMU can block the SNP bit? And then we map all the pages to
block the snoop bit? Huh?

Your explanation makes perfect sense: Block guests from using the
dangerous cache invalidate instruction unless a device that uses
no-snoop is plugged in. Block devices from using no-snoop because
something about it is insecure. Ok.

But the conditions I'm looking for "device that uses no-snoop" is:
- The device will issue no-snoop TLPs at all
- The IOMMU will let no-snoop through
- The platform will honor no-snoop

Only if all three are met we should allow the dangerous instruction in
KVM, right?

Which brings me back to my original point - this is at least partially
a device specific behavior. It depends on the content of the IOMMU
page table, it depends if the device even supports no-snoop at all.

My guess is this works correctly for the mdev Intel kvmgt which
probably somehow allows no-snoop DMA throught the mdev SW iommu
mappings. (assuming I didn't miss a tricky iommu_map without
IOMMU_CACHe set in the type1 code?)

But why is vfio-pci using it? Hmm?

Confused,
Jason

2021-06-02 20:41:09

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, 2 Jun 2021 16:54:04 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote:
> >
> > Right, the device can generate the no-snoop transactions, but it's the
> > IOMMU that essentially determines whether those transactions are
> > actually still cache coherent, AIUI.
>
> Wow, this is really confusing stuff in the code.
>
> At the PCI level there is a TLP bit called no-snoop that is platform
> specific. The general intention is to allow devices to selectively
> bypass the CPU caching for DMAs. GPUs like to use this feature for
> performance.

Yes

> I assume there is some exciting security issues here. Looks like
> allowing cache bypass does something bad inside VMs? Looks like
> allowing the VM to use the cache clear instruction that is mandatory
> with cache bypass DMA causes some QOS issues? OK.

IIRC, largely a DoS issue if userspace gets to choose when to emulate
wbinvd rather than it being demanded for correct operation.

> So how does it work?
>
> What I see in the intel/iommu.c is that some domains support "snoop
> control" or not, based on some HW flag. This indicates if the
> DMA_PTE_SNP bit is supported on a page by page basis or not.
>
> Since x86 always leans toward "DMA cache coherent" I'm reading some
> tea leaves here:
>
> IOMMU_CAP_CACHE_COHERENCY, /* IOMMU can enforce cache coherent DMA
> transactions */
>
> And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the
> snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set?

That's my understanding as well.

> Further, I guess IOMMUs that don't support PTE_SNP, or have
> DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat
> and all)

Yes.

> So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA
> incoherent with the CPU caches, it just means that that snoop bit in
> the TLP cannot be enforced. ie the device *could* do no-shoop DMA
> if it wants. Devices that never do no-snoop remain DMA coherent on
> x86, as they always have been.

Yes, IOMMU_CAP_CACHE_COHERENCY=false means we cannot force the device
DMA to be coherent via the IOMMU.

> IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means
> the PCI device is blocked from using no-snoop in its TLPs.
>
> I wonder if ARM implemented this consistently? I see VDPA is
> confused.. I was confused. What a terrible set of names.
>
> In VFIO generic code I see it always sets IOMMU_CACHE:
>
> if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
> domain->prot |= IOMMU_CACHE;
>
> And thus also always provides IOMMU_CACHE to iommu_map:
>
> ret = iommu_map(d->domain, iova, (phys_addr_t)pfn << PAGE_SHIFT,
> npage << PAGE_SHIFT, prot | d->prot);
>
> So when the IOMMU supports the no-snoop blocking security feature VFIO
> turns it on and blocks no-snoop to all pages? Ok..

Yep, I'd forgotten this nuance that we need to enable it via the
mapping flags.

> But I must be missing something big because *something* in the IOVA
> map should work with no-snoopable DMA, right? Otherwise what is the
> point of exposing the invalidate instruction to the guest?
>
> I would think userspace should be relaying the DMA_PTE_SNP bit from
> the guest's page tables up to here??
>
> The KVM hookup is driven by IOMMU_CACHE which is driven by
> IOMMU_CAP_CACHE_COHERENCY. So we turn on the special KVM support only
> if the IOMMU can block the SNP bit? And then we map all the pages to
> block the snoop bit? Huh?

Right. I don't follow where you're jumping to relaying DMA_PTE_SNP
from the guest page table... what page table? We don't necessarily
have a vIOMMU to expose such things, I don't think it even existed when
this we added. Essentially if we can ignore no-snoop at the IOMMU,
then KVM doesn't need to worry about emulating wbinvd because of an
assigned device, whether that device uses it or not. Win-win.

> Your explanation makes perfect sense: Block guests from using the
> dangerous cache invalidate instruction unless a device that uses
> no-snoop is plugged in. Block devices from using no-snoop because
> something about it is insecure. Ok.

No-snoop itself is not insecure, but to support no-snoop in a VM KVM
can't ignore wbinvd, which has overhead and abuse implications.

> But the conditions I'm looking for "device that uses no-snoop" is:
> - The device will issue no-snoop TLPs at all

We can't really know this generically. We can try to set the enable
bit to see if the device is capable of no-snoop, but that doesn't mean
it will use no-snoop.

> - The IOMMU will let no-snoop through
> - The platform will honor no-snoop
>
> Only if all three are met we should allow the dangerous instruction in
> KVM, right?

We test at the IOMMU and assume that the IOMMU knowledge encompasses
whether the platform honors no-snoop (note for example how amd and arm
report true for IOMMU_CAP_CACHE_COHERENCY but seem to ignore the
IOMMU_CACHE flag). We could probably use an iommu_group_for_each_dev
to test if any devices within the group are capable of no-snoop if the
IOMMU can't protect us, but at the time it didn't seem worthwhile. I'm
still not sure if it is.

> Which brings me back to my original point - this is at least partially
> a device specific behavior. It depends on the content of the IOMMU
> page table, it depends if the device even supports no-snoop at all.
>
> My guess is this works correctly for the mdev Intel kvmgt which
> probably somehow allows no-snoop DMA throught the mdev SW iommu
> mappings. (assuming I didn't miss a tricky iommu_map without
> IOMMU_CACHe set in the type1 code?)

This support existed before mdev, IIRC we needed it for direct
assignment of NVIDIA GPUs.

> But why is vfio-pci using it? Hmm?

Use the IOMMU to reduce hypervisor overhead, let the hypervisor learn
about it, ignore the subtleties of whether the device actually uses
no-snoop as imprecise and poor ROI given the apparent direction of
hardware.

¯\_(ツ)_/¯,
Alex

2021-06-02 22:47:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:

> Right. I don't follow where you're jumping to relaying DMA_PTE_SNP
> from the guest page table... what page table?

I see my confusion now, the phrasing in your earlier remark led me
think this was about allowing the no-snoop performance enhancement in
some restricted way.

It is really about blocking no-snoop 100% of the time and then
disabling the dangerous wbinvd when the block is successful.

Didn't closely read the kvm code :\

If it was about allowing the optimization then I'd expect the guest to
enable no-snoopable regions via it's vIOMMU and realize them to the
hypervisor and plumb the whole thing through. Hence my remark about
the guest page tables..

So really the test is just 'were we able to block it' ?

> This support existed before mdev, IIRC we needed it for direct
> assignment of NVIDIA GPUs.

Probably because they ignored the disable no-snoop bits in the control
block, or reset them in some insane way to "fix" broken bioses and
kept using it even though by all rights qemu would have tried hard to
turn it off via the config space. Processing no-snoop without a
working wbinvd would be fatal. Yeesh

But Ok, back the /dev/ioasid. This answers a few lingering questions I
had..

1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
domains.

This doesn't actually matter. If you mix them together then kvm
will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
anywhere in this VM.

This if two IOMMU's are joined together into a single /dev/ioasid
then we can just make them both pretend to be
!IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.

2) How to fit this part of kvm in some new /dev/ioasid world

What we want to do here is iterate over every ioasid associated
with the group fd that is passed into kvm.

Today the group fd has a single container which specifies the
single ioasid so this is being done trivially.

To reorg we want to get the ioasid from the device not the
group (see my note to David about the groups vs device rational)

This is just iterating over each vfio_device in the group and
querying the ioasid it is using.

Or perhaps more directly: an op attaching the vfio_device to the
kvm and having some simple helper
'(un)register ioasid with kvm (kvm, ioasid)'
that the vfio_device driver can call that just sorts this out.

It is not terrible..

Jason

2021-06-02 23:26:06

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 12:01:57PM +0800, Lu Baolu wrote:
> On 6/2/21 1:26 AM, Jason Gunthorpe wrote:
> > On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
> >
> > > This version only covers 1) and 4). Do you think we need to support 2),
> > > 3) and beyond?
> >
> > Yes aboslutely. The API should be flexable enough to specify the
> > creation of all future page table formats we'd want to have and all HW
> > specific details on those formats.
>
> OK, stay in the same line.
>
> > > If so, it seems that we need some in-kernel helpers and uAPIs to
> > > support pre-installing a page table to IOASID.
> >
> > Not sure what this means..
>
> Sorry that I didn't make this clear.
>
> Let me bring back the page table types in my eyes.
>
> 1) IOMMU format page table (a.k.a. iommu_domain)
> 2) user application CPU page table (SVA for example)
> 3) KVM EPT (future option)
> 4) VM guest managed page table (nesting mode)
>
> Each type of page table should be able to be associated with its IOASID.
> We have BIND protocol for 4); We explicitly allocate an iommu_domain for
> 1). But we don't have a clear definition for 2) 3) and others. I think
> it's necessary to clearly define a time point and kAPI name between
> IOASID_ALLOC and IOASID_ATTACH, so that other modules have the
> opportunity to associate their page table with the allocated IOASID
> before attaching the page table to the real IOMMU hardware.

In my mind these are all actions of creation..

#1 is ALLOC_IOASID 'to be compatible with thes devices attached to
this FD'
#2 is ALLOC_IOASID_SVA
#3 is some ALLOC_IOASID_KVM (and maybe the kvm fd has to issue this ioctl)
#4 is ALLOC_IOASID_USER_PAGE_TABLE w/ user VA address or
ALLOC_IOASID_NESTED_PAGE_TABLE w/ IOVA address

Each allocation should have a set of operations that are allows
map/unmap is only legal on #1. invalidate is only legal on #4, etc.

How you want to split this up in the ioctl interface is a more
interesting question. I generally like more calls than giant unwieldly
multiplexer structs, but some things are naturally flags and optional
modifications of a single ioctl.

In any event they should have a similar naming 'ALLOC_IOASID_XXX' and
then a single 'DESTROY_IOASID' that works on all of them.

> I/O page fault handling is similar. The provider of the page table
> should take the responsibility to handle the possible page faults.

For the faultable types, yes #3 and #4 should hook in the fault
handler and deal with it.

Jason

2021-06-02 23:28:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 01:25:00AM +0000, Tian, Kevin wrote:

> OK, this implies that if one user inadvertently creates intended parent/
> child via different fd's then the operation will simply fail.

Remember the number space to refer to the ioasid's inside the FD is
local to that instance of the FD. Each FD should have its own xarray

You can't actually accidently refer to an IOASID in FD A from FD B
because the xarray lookup in FD B will not return 'IOASID A'.

Jason

2021-06-03 01:31:54

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe
> Sent: Thursday, June 3, 2021 12:09 AM
>
> On Wed, Jun 02, 2021 at 01:33:22AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, June 2, 2021 1:42 AM
> > >
> > > On Tue, Jun 01, 2021 at 08:10:14AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Gunthorpe <[email protected]>
> > > > > Sent: Saturday, May 29, 2021 1:36 AM
> > > > >
> > > > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > > >
> > > > > > IOASID nesting can be implemented in two ways: hardware nesting
> and
> > > > > > software nesting. With hardware support the child and parent I/O
> page
> > > > > > tables are walked consecutively by the IOMMU to form a nested
> > > translation.
> > > > > > When it's implemented in software, the ioasid driver is responsible
> for
> > > > > > merging the two-level mappings into a single-level shadow I/O page
> > > table.
> > > > > > Software nesting requires both child/parent page tables operated
> > > through
> > > > > > the dma mapping protocol, so any change in either level can be
> > > captured
> > > > > > by the kernel to update the corresponding shadow mapping.
> > > > >
> > > > > Why? A SW emulation could do this synchronization during
> invalidation
> > > > > processing if invalidation contained an IOVA range.
> > > >
> > > > In this proposal we differentiate between host-managed and user-
> > > > managed I/O page tables. If host-managed, the user is expected to use
> > > > map/unmap cmd explicitly upon any change required on the page table.
> > > > If user-managed, the user first binds its page table to the IOMMU and
> > > > then use invalidation cmd to flush iotlb when necessary (e.g. typically
> > > > not required when changing a PTE from non-present to present).
> > > >
> > > > We expect user to use map+unmap and bind+invalidate respectively
> > > > instead of mixing them together. Following this policy, map+unmap
> > > > must be used in both levels for software nesting, so changes in either
> > > > level are captured timely to synchronize the shadow mapping.
> > >
> > > map+unmap or bind+invalidate is a policy of the IOASID itself set when
> > > it is created. If you put two different types in a tree then each IOASID
> > > must continue to use its own operation mode.
> > >
> > > I don't see a reason to force all IOASIDs in a tree to be consistent??
> >
> > only for software nesting. With hardware support the parent uses map
> > while the child uses bind.
> >
> > Yes, the policy is specified per IOASID. But if the policy violates the
> > requirement in a specific nesting mode, then nesting should fail.
>
> I don't get it.
>
> If the IOASID is a page table then it is bind/invalidate. SW or not SW
> doesn't matter at all.
>
> > >
> > > A software emulated two level page table where the leaf level is a
> > > bound page table in guest memory should continue to use
> > > bind/invalidate to maintain the guest page table IOASID even though it
> > > is a SW construct.
> >
> > with software nesting the leaf should be a host-managed page table
> > (or metadata). A bind/invalidate protocol doesn't require the user
> > to notify the kernel of every page table change.
>
> The purpose of invalidate is to inform the implementation that the
> page table has changed so it can flush the caches. If the page table
> is changed and invalidation is not issued then then the implementation
> is free to ignore the changes.
>
> In this way the SW mode is the same as a HW mode with an infinite
> cache.
>
> The collaposed shadow page table is really just a cache.
>

OK. One additional thing is that we may need a 'caching_mode"
thing reported by /dev/ioasid, indicating whether invalidation is
required when changing non-present to present. For hardware
nesting it's not reported as the hardware IOMMU will walk the
guest page table in cases of iotlb miss. For software nesting
caching_mode is reported so the user must issue invalidation
upon any change in guest page table so the kernel can update
the shadow page table timely.

Following this and your other comment with David, we will mark
host-managed vs. guest-managed explicitly for I/O page table
of each IOASID. map+unmap or bind+invalid is decided by
which owner is specified by the user.

Thanks
Kevin

2021-06-03 02:13:52

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, June 3, 2021 12:17 AM
>
[...]
> > > If there are no hypervisor traps (does this exist?) then there is no
> > > way to involve the hypervisor here and the child IOASID should simply
> > > be a pointer to the guest's data structure that describes binding. In
> > > this case that IOASID should claim all PASIDs when bound to a
> > > RID.
> >
> > And in that case I think we should call that object something other
> > than an IOASID, since it represents multiple address spaces.
>
> Maybe.. It is certainly a special case.
>
> We can still consider it a single "address space" from the IOMMU
> perspective. What has happened is that the address table is not just a
> 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".

More accurately 64+20=84 bit IOVA ????

>
> If we are already going in the direction of having the IOASID specify
> the page table format and other details, specifying that the page

I'm leaning toward this direction now, after a discussion with Baolu.
He reminded me that a default domain is already created for each
device when it's probed by the iommu driver. So it looks workable
to expose a per-device capability query uAPI to user once a device
is bound to the ioasid fd. Once it's available, the user should be able
to judge what format/mode should be set when creating an IOASID.

> tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
> step.

In concept this view is true. But when designing the uAPI possibly
we will not call it a 84bit format as the PASID table itself just
serves 20bit PASID space.

Will think more how to mark it in the next version.

>
> I wouldn't twist things into knots to create a difference, but if it
> is easy to do it wouldn't hurt either.
>

Thanks
Kevin

2021-06-03 02:53:06

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, June 3, 2021 12:59 AM
>
> On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > /* Bind guest I/O page table */
> > > > bind_data = {
> > > > .ioasid = gva_ioasid;
> > > > .addr = gva_pgtable1;
> > > > // and format information
> > > > };
> > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > >
> > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > there any reason to split these things? The only advantage to the
> > > split is the device is known, but the device shouldn't impact
> > > anything..
> >
> > I'm pretty sure the device(s) could matter, although they probably
> > won't usually.
>
> It is a bit subtle, but the /dev/iommu fd itself is connected to the
> devices first. This prevents wildly incompatible devices from being
> joined together, and allows some "get info" to report the capability
> union of all devices if we want to do that.

I would expect the capability reported per-device via /dev/iommu.
Incompatible devices can bind to the same fd but cannot attach to
the same IOASID. This allows incompatible devices to share locked
page accounting.

>
> The original concept was that devices joined would all have to support
> the same IOASID format, at least for the kernel owned map/unmap IOASID
> type. Supporting different page table formats maybe is reason to
> revisit that concept.

I hope my memory was not broken, that the original concept was
the devices attached to the same IOASID must support the same
format. Otherwise they need attach to different IOASIDs (but still
within the same fd).

>
> There is a small advantage to re-using the IOASID container because of
> the get_user_pages caching and pinned accounting management at the FD
> level.

With above concept we don't need IOASID container then.

>
> I don't know if that small advantage is worth the extra complexity
> though.
>
> > But it would certainly be possible for a system to have two
> > different host bridges with two different IOMMUs with different
> > pagetable formats. Until you know which devices (and therefore
> > which host bridge) you're talking about, you don't know what formats
> > of pagetable to accept. And if you have devices from *both* bridges
> > you can't bind a page table at all - you could theoretically support
> > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > in both formats, but it would be pretty reasonable not to support
> > that.
>
> The basic process for a user space owned pgtable mode would be:
>
> 1) qemu has to figure out what format of pgtable to use
>
> Presumably it uses query functions using the device label. The
> kernel code should look at the entire device path through all the
> IOMMU HW to determine what is possible.
>
> Or it already knows because the VM's vIOMMU is running in some
> fixed page table format, or the VM's vIOMMU already told it, or
> something.

I'd expect the both. First get the hardware format. Then detect whether
it's compatible to the vIOMMU format.

>
> 2) qemu creates an IOASID and based on #1 and says 'I want this format'

Based on earlier discussion this will possibly be:

struct iommu_ioasid_create_info {

// if set this is a guest-managed page table, use bind+invalidate, with
// info provided in struct pgtable_info;
// if clear it's host-managed and use map+unmap;
#define IOMMU_IOASID_FLAG_USER_PGTABLE 1

// if set it is for pasid table binding. same implication as USER_PGTABLE
// except it's for a different pgtable type
#define IOMMU_IOASID_FLAG_USER_PASID_TABLE 2
int flags;

// Create nesting if not INVALID_IOASID
u32 parent_ioasid;

// additional info about the page table
union {
// for user-managed page table
struct {
u64 user_pgd;
u32 format;
u32 addr_width;
// and other vendor format info
} pgtable_info;

// for kernel-managed page table
struct {
// not required on x86
// for ppc, iirc the user wants to claim a window
// explicitly?
} map_info;
};
};

then there will be no UNBIND_PGTABLE ioctl. The unbind is done
automatically when the IOASID is freed.

>
> 3) qemu binds the IOASID to the device.

let's use 'attach' for consistency. ???? 'bind' is for ioasid fd which must
be completed in step 0) so format can be reported in step 1)

>
> If qmeu gets it wrong then it just fails.
>
> 4) For the next device qemu would have to figure out if it can re-use
> an existing IOASID based on the required proeprties.
>
> You pointed to the case of mixing vIOMMU's of different platforms. So
> it is completely reasonable for qemu to ask for a "ARM 64 bit IOMMU
> page table mode v2" while running on an x86 because that is what the
> vIOMMU is wired to work with.
>
> Presumably qemu will fall back to software emulation if this is not
> possible.
>
> One interesting option for software emulation is to just transform the
> ARM page table format to a x86 page table format in userspace and use
> nested bind/invalidate to synchronize with the kernel. With SW nesting
> I suspect this would be much faster
>

or just use map+unmap. It's no difference from how an virtio-iommu could
work on all platforms, which is by definition is not the same type as the
underlying hardware.

Thanks
Kevin

2021-06-03 02:54:28

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, 2 Jun 2021 19:45:36 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
>
> > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP
> > from the guest page table... what page table?
>
> I see my confusion now, the phrasing in your earlier remark led me
> think this was about allowing the no-snoop performance enhancement in
> some restricted way.
>
> It is really about blocking no-snoop 100% of the time and then
> disabling the dangerous wbinvd when the block is successful.
>
> Didn't closely read the kvm code :\
>
> If it was about allowing the optimization then I'd expect the guest to
> enable no-snoopable regions via it's vIOMMU and realize them to the
> hypervisor and plumb the whole thing through. Hence my remark about
> the guest page tables..
>
> So really the test is just 'were we able to block it' ?

Yup. Do we really still consider that there's some performance benefit
to be had by enabling a device to use no-snoop? This seems largely a
legacy thing.

> > This support existed before mdev, IIRC we needed it for direct
> > assignment of NVIDIA GPUs.
>
> Probably because they ignored the disable no-snoop bits in the control
> block, or reset them in some insane way to "fix" broken bioses and
> kept using it even though by all rights qemu would have tried hard to
> turn it off via the config space. Processing no-snoop without a
> working wbinvd would be fatal. Yeesh
>
> But Ok, back the /dev/ioasid. This answers a few lingering questions I
> had..
>
> 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
> domains.
>
> This doesn't actually matter. If you mix them together then kvm
> will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> anywhere in this VM.
>
> This if two IOMMU's are joined together into a single /dev/ioasid
> then we can just make them both pretend to be
> !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.

Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we
need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
available based on the per domain support available. That gives us the
most consistent behavior, ie. we don't have VMs emulating wbinvd
because they used to have a device attached where the domain required
it and we can't atomically remap with new flags to perform the same as
a VM that never had that device attached in the first place.

> 2) How to fit this part of kvm in some new /dev/ioasid world
>
> What we want to do here is iterate over every ioasid associated
> with the group fd that is passed into kvm.

Yeah, we need some better names, binding a device to an ioasid (fd) but
then attaching a device to an allocated ioasid (non-fd)... I assume
you're talking about the latter ioasid.

> Today the group fd has a single container which specifies the
> single ioasid so this is being done trivially.
>
> To reorg we want to get the ioasid from the device not the
> group (see my note to David about the groups vs device rational)
>
> This is just iterating over each vfio_device in the group and
> querying the ioasid it is using.

The IOMMU API group interfaces is largely iommu_group_for_each_dev()
anyway, we still need to account for all the RIDs and aliases of a
group.

> Or perhaps more directly: an op attaching the vfio_device to the
> kvm and having some simple helper
> '(un)register ioasid with kvm (kvm, ioasid)'
> that the vfio_device driver can call that just sorts this out.

We could almost eliminate the device notion altogether here, use an
ioasidfd_for_each_ioasid() but we really want a way to trigger on each
change to the composition of the device set for the ioasid, which is
why we currently do it on addition or removal of a group, where the
group has a consistent set of IOMMU properties. Register a notifier
callback via the ioasidfd? Thanks,

Alex

2021-06-03 02:55:33

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


在 2021/6/3 上午4:37, Alex Williamson 写道:
> On Wed, 2 Jun 2021 16:54:04 -0300
> Jason Gunthorpe <[email protected]> wrote:
>
>> On Wed, Jun 02, 2021 at 01:00:53PM -0600, Alex Williamson wrote:
>>> Right, the device can generate the no-snoop transactions, but it's the
>>> IOMMU that essentially determines whether those transactions are
>>> actually still cache coherent, AIUI.
>> Wow, this is really confusing stuff in the code.
>>
>> At the PCI level there is a TLP bit called no-snoop that is platform
>> specific. The general intention is to allow devices to selectively
>> bypass the CPU caching for DMAs. GPUs like to use this feature for
>> performance.
> Yes
>
>> I assume there is some exciting security issues here. Looks like
>> allowing cache bypass does something bad inside VMs? Looks like
>> allowing the VM to use the cache clear instruction that is mandatory
>> with cache bypass DMA causes some QOS issues? OK.
> IIRC, largely a DoS issue if userspace gets to choose when to emulate
> wbinvd rather than it being demanded for correct operation.
>
>> So how does it work?
>>
>> What I see in the intel/iommu.c is that some domains support "snoop
>> control" or not, based on some HW flag. This indicates if the
>> DMA_PTE_SNP bit is supported on a page by page basis or not.
>>
>> Since x86 always leans toward "DMA cache coherent" I'm reading some
>> tea leaves here:
>>
>> IOMMU_CAP_CACHE_COHERENCY, /* IOMMU can enforce cache coherent DMA
>> transactions */
>>
>> And guessing that IOMMUs that implement DMA_PTE_SNP will ignore the
>> snoop bit in TLPs for IOVA's that have DMA_PTE_SNP set?
> That's my understanding as well.
>
>> Further, I guess IOMMUs that don't support PTE_SNP, or have
>> DMA_PTE_SNP clear will always honour the snoop bit. (backwards compat
>> and all)
> Yes.
>
>> So, IOMMU_CAP_CACHE_COHERENCY does not mean the IOMMU is DMA
>> incoherent with the CPU caches, it just means that that snoop bit in
>> the TLP cannot be enforced. ie the device *could* do no-shoop DMA
>> if it wants. Devices that never do no-snoop remain DMA coherent on
>> x86, as they always have been.
> Yes, IOMMU_CAP_CACHE_COHERENCY=false means we cannot force the device
> DMA to be coherent via the IOMMU.
>
>> IOMMU_CACHE does not mean the IOMMU is DMA cache coherent, it means
>> the PCI device is blocked from using no-snoop in its TLPs.
>>
>> I wonder if ARM implemented this consistently? I see VDPA is
>> confused..


Basically, we don't want to bother with pseudo KVM device like what VFIO
did. So for simplicity, we rules out the IOMMU that can't enforce
coherency in vhost-vDPA if the parent purely depends on the platform IOMMU:


        if (!iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
                return -ENOTSUPP;

For the parents that use its own translations logic, an implicit
assumption is that the hardware will always perform cache coherent DMA.

Thanks


2021-06-03 03:08:25

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, June 3, 2021 1:20 AM
>
[...]
> > I wonder if there's a way to model this using a nested AS rather than
> > requiring special operations. e.g.
> >
> > 'prereg' IOAS
> > |
> > \- 'rid' IOAS
> > |
> > \- 'pasid' IOAS (maybe)
> >
> > 'prereg' would have a kernel managed pagetable into which (for
> > example) qemu platform code would map all guest memory (using
> > IOASID_MAP_DMA). qemu's vIOMMU driver would then mirror the guest's
> > IO mappings into the 'rid' IOAS in terms of GPA.
> >
> > This wouldn't quite work as is, because the 'prereg' IOAS would have
> > no devices. But we could potentially have another call to mark an
> > IOAS as a purely "preregistration" or pure virtual IOAS. Using that
> > would be an alternative to attaching devices.
>
> It is one option for sure, this is where I was thinking when we were
> talking in the other thread. I think the decision is best
> implementation driven as the datastructure to store the
> preregsitration data should be rather purpose built.

Yes. For now I prefer to managing prereg through a separate cmd
instead of special-casing it in the IOASID graph. Anyway this is sort
of a per-fd thing.

>
> > > /*
> > > * Map/unmap process virtual addresses to I/O virtual addresses.
> > > *
> > > * Provide VFIO type1 equivalent semantics. Start with the same
> > > * restriction e.g. the unmap size should match those used in the
> > > * original mapping call.
> > > *
> > > * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > > * must be already in the preregistered list.
> > > *
> > > * Input parameters:
> > > * - u32 ioasid;
> > > * - refer to vfio_iommu_type1_dma_{un}map
> > > *
> > > * Return: 0 on success, -errno on failure.
> > > */
> > > #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
> > > #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)
> >
> > I'm assuming these would be expected to fail if a user managed
> > pagetable has been bound?
>
> Me too, or a SVA page table.
>
> This document would do well to have a list of imagined page table
> types and the set of operations that act on them. I think they are all
> pretty disjoint..
>
> Your presentation of 'kernel owns the table' vs 'userspace owns the
> table' is a useful clarification to call out too

sure, I incorporated this comment in last reply.

>
> > > 5. Use Cases and Flows
> > >
> > > Here assume VFIO will support a new model where every bound device
> > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > going through legacy container/group interface. For illustration purpose
> > > those devices are just called dev[1...N]:
> > >
> > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the
> > filenames for actual PCI functions. Maybe /dev/vfio/mdev/something
> > for mdevs. That leaves other subdirs of /dev/vfio free for future
> > non-PCI device types, and /dev/vfio itself for the legacy group
> > devices.
>
> There are a bunch of nice options here if we go this path

Yes, this part is only roughly visited to focus on /dev/iommu first. In later
versions it will be considered more seriously.

>
> > > 5.2. Multiple IOASIDs (no nesting)
> > > ++++++++++++++++++++++++++++
> > >
> > > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > > both devices are attached to gpa_ioasid.
> >
> > Doesn't really affect your example, but note that the PAPR IOMMU does
> > not have a passthrough mode, so devices will not initially be attached
> > to gpa_ioasid - they will be unusable for DMA until attached to a
> > gIOVA ioasid.

'initially' here is still user-requested action. For PAPR you should do
attach only when it's necessary.

Thanks
Kevin

2021-06-03 03:38:18

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Alex Williamson <[email protected]>
> Sent: Thursday, June 3, 2021 10:51 AM
>
> On Wed, 2 Jun 2021 19:45:36 -0300
> Jason Gunthorpe <[email protected]> wrote:
>
> > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> >
> > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > from the guest page table... what page table?
> >
> > I see my confusion now, the phrasing in your earlier remark led me
> > think this was about allowing the no-snoop performance enhancement in
> > some restricted way.
> >
> > It is really about blocking no-snoop 100% of the time and then
> > disabling the dangerous wbinvd when the block is successful.
> >
> > Didn't closely read the kvm code :\
> >
> > If it was about allowing the optimization then I'd expect the guest to
> > enable no-snoopable regions via it's vIOMMU and realize them to the
> > hypervisor and plumb the whole thing through. Hence my remark about
> > the guest page tables..
> >
> > So really the test is just 'were we able to block it' ?
>
> Yup. Do we really still consider that there's some performance benefit
> to be had by enabling a device to use no-snoop? This seems largely a
> legacy thing.

Yes, there is indeed performance benefit for device to use no-snoop,
e.g. 8K display and some imaging processing path, etc. The problem is
that the IOMMU for such devices is typically a different one from the
default IOMMU for most devices. This special IOMMU may not have
the ability of enforcing snoop on no-snoop PCI traffic then this fact
must be understood by KVM to do proper mtrr/pat/wbinvd virtualization
for such devices to work correctly.

>
> > > This support existed before mdev, IIRC we needed it for direct
> > > assignment of NVIDIA GPUs.
> >
> > Probably because they ignored the disable no-snoop bits in the control
> > block, or reset them in some insane way to "fix" broken bioses and
> > kept using it even though by all rights qemu would have tried hard to
> > turn it off via the config space. Processing no-snoop without a
> > working wbinvd would be fatal. Yeesh
> >
> > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > had..
> >
> > 1) Mixing IOMMU_CAP_CACHE_COHERENCY
> and !IOMMU_CAP_CACHE_COHERENCY
> > domains.
> >
> > This doesn't actually matter. If you mix them together then kvm
> > will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> > anywhere in this VM.
> >
> > This if two IOMMU's are joined together into a single /dev/ioasid
> > then we can just make them both pretend to be
> > !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
>
> Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then
> we
> need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> available based on the per domain support available. That gives us the
> most consistent behavior, ie. we don't have VMs emulating wbinvd
> because they used to have a device attached where the domain required
> it and we can't atomically remap with new flags to perform the same as
> a VM that never had that device attached in the first place.
>
> > 2) How to fit this part of kvm in some new /dev/ioasid world
> >
> > What we want to do here is iterate over every ioasid associated
> > with the group fd that is passed into kvm.
>
> Yeah, we need some better names, binding a device to an ioasid (fd) but
> then attaching a device to an allocated ioasid (non-fd)... I assume
> you're talking about the latter ioasid.
>
> > Today the group fd has a single container which specifies the
> > single ioasid so this is being done trivially.
> >
> > To reorg we want to get the ioasid from the device not the
> > group (see my note to David about the groups vs device rational)
> >
> > This is just iterating over each vfio_device in the group and
> > querying the ioasid it is using.
>
> The IOMMU API group interfaces is largely iommu_group_for_each_dev()
> anyway, we still need to account for all the RIDs and aliases of a
> group.
>
> > Or perhaps more directly: an op attaching the vfio_device to the
> > kvm and having some simple helper
> > '(un)register ioasid with kvm (kvm, ioasid)'
> > that the vfio_device driver can call that just sorts this out.
>
> We could almost eliminate the device notion altogether here, use an
> ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> change to the composition of the device set for the ioasid, which is
> why we currently do it on addition or removal of a group, where the
> group has a consistent set of IOMMU properties. Register a notifier
> callback via the ioasidfd? Thanks,
>

When discussing I/O page fault support in another thread, the consensus
is that an device handle will be registered (by user) or allocated (return
to user) in /dev/ioasid when binding the device to ioasid fd. From this
angle we can register {ioasid_fd, device_handle} to KVM and then call
something like ioasidfd_device_is_coherent() to get the property.
Anyway the coherency is a per-device property which is not changed
by how many I/O page tables are attached to it.

Thanks
Kevin

2021-06-03 04:16:20

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, 3 Jun 2021 03:22:27 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Alex Williamson <[email protected]>
> > Sent: Thursday, June 3, 2021 10:51 AM
> >
> > On Wed, 2 Jun 2021 19:45:36 -0300
> > Jason Gunthorpe <[email protected]> wrote:
> >
> > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > >
> > > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > > from the guest page table... what page table?
> > >
> > > I see my confusion now, the phrasing in your earlier remark led me
> > > think this was about allowing the no-snoop performance enhancement in
> > > some restricted way.
> > >
> > > It is really about blocking no-snoop 100% of the time and then
> > > disabling the dangerous wbinvd when the block is successful.
> > >
> > > Didn't closely read the kvm code :\
> > >
> > > If it was about allowing the optimization then I'd expect the guest to
> > > enable no-snoopable regions via it's vIOMMU and realize them to the
> > > hypervisor and plumb the whole thing through. Hence my remark about
> > > the guest page tables..
> > >
> > > So really the test is just 'were we able to block it' ?
> >
> > Yup. Do we really still consider that there's some performance benefit
> > to be had by enabling a device to use no-snoop? This seems largely a
> > legacy thing.
>
> Yes, there is indeed performance benefit for device to use no-snoop,
> e.g. 8K display and some imaging processing path, etc. The problem is
> that the IOMMU for such devices is typically a different one from the
> default IOMMU for most devices. This special IOMMU may not have
> the ability of enforcing snoop on no-snoop PCI traffic then this fact
> must be understood by KVM to do proper mtrr/pat/wbinvd virtualization
> for such devices to work correctly.

The case where the IOMMU does not support snoop-control for such a
device already works fine, we can't prevent no-snoop so KVM will
emulate wbinvd. The harder one is if we should opt to allow no-snoop
even if the IOMMU does support snoop-control.

> > > > This support existed before mdev, IIRC we needed it for direct
> > > > assignment of NVIDIA GPUs.
> > >
> > > Probably because they ignored the disable no-snoop bits in the control
> > > block, or reset them in some insane way to "fix" broken bioses and
> > > kept using it even though by all rights qemu would have tried hard to
> > > turn it off via the config space. Processing no-snoop without a
> > > working wbinvd would be fatal. Yeesh
> > >
> > > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > > had..
> > >
> > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY
> > and !IOMMU_CAP_CACHE_COHERENCY
> > > domains.
> > >
> > > This doesn't actually matter. If you mix them together then kvm
> > > will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> > > anywhere in this VM.
> > >
> > > This if two IOMMU's are joined together into a single /dev/ioasid
> > > then we can just make them both pretend to be
> > > !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
> >
> > Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then
> > we
> > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> > available based on the per domain support available. That gives us the
> > most consistent behavior, ie. we don't have VMs emulating wbinvd
> > because they used to have a device attached where the domain required
> > it and we can't atomically remap with new flags to perform the same as
> > a VM that never had that device attached in the first place.
> >
> > > 2) How to fit this part of kvm in some new /dev/ioasid world
> > >
> > > What we want to do here is iterate over every ioasid associated
> > > with the group fd that is passed into kvm.
> >
> > Yeah, we need some better names, binding a device to an ioasid (fd) but
> > then attaching a device to an allocated ioasid (non-fd)... I assume
> > you're talking about the latter ioasid.
> >
> > > Today the group fd has a single container which specifies the
> > > single ioasid so this is being done trivially.
> > >
> > > To reorg we want to get the ioasid from the device not the
> > > group (see my note to David about the groups vs device rational)
> > >
> > > This is just iterating over each vfio_device in the group and
> > > querying the ioasid it is using.
> >
> > The IOMMU API group interfaces is largely iommu_group_for_each_dev()
> > anyway, we still need to account for all the RIDs and aliases of a
> > group.
> >
> > > Or perhaps more directly: an op attaching the vfio_device to the
> > > kvm and having some simple helper
> > > '(un)register ioasid with kvm (kvm, ioasid)'
> > > that the vfio_device driver can call that just sorts this out.
> >
> > We could almost eliminate the device notion altogether here, use an
> > ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> > change to the composition of the device set for the ioasid, which is
> > why we currently do it on addition or removal of a group, where the
> > group has a consistent set of IOMMU properties. Register a notifier
> > callback via the ioasidfd? Thanks,
> >
>
> When discussing I/O page fault support in another thread, the consensus
> is that an device handle will be registered (by user) or allocated (return
> to user) in /dev/ioasid when binding the device to ioasid fd. From this
> angle we can register {ioasid_fd, device_handle} to KVM and then call
> something like ioasidfd_device_is_coherent() to get the property.
> Anyway the coherency is a per-device property which is not changed
> by how many I/O page tables are attached to it.

The mechanics are different, but this is pretty similar in concept to
KVM learning coherence using the groupfd today. Do we want to
compromise on kernel control of wbinvd emulation to allow userspace to
make such decisions? Ownership of a device might be reason enough to
allow the user that privilege. Thanks,

Alex

2021-06-03 05:22:21

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Alex Williamson <[email protected]>
> Sent: Thursday, June 3, 2021 12:15 PM
>
> On Thu, 3 Jun 2021 03:22:27 +0000
> "Tian, Kevin" <[email protected]> wrote:
>
> > > From: Alex Williamson <[email protected]>
> > > Sent: Thursday, June 3, 2021 10:51 AM
> > >
> > > On Wed, 2 Jun 2021 19:45:36 -0300
> > > Jason Gunthorpe <[email protected]> wrote:
> > >
> > > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > > >
> > > > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > > > from the guest page table... what page table?
> > > >
> > > > I see my confusion now, the phrasing in your earlier remark led me
> > > > think this was about allowing the no-snoop performance enhancement
> in
> > > > some restricted way.
> > > >
> > > > It is really about blocking no-snoop 100% of the time and then
> > > > disabling the dangerous wbinvd when the block is successful.
> > > >
> > > > Didn't closely read the kvm code :\
> > > >
> > > > If it was about allowing the optimization then I'd expect the guest to
> > > > enable no-snoopable regions via it's vIOMMU and realize them to the
> > > > hypervisor and plumb the whole thing through. Hence my remark about
> > > > the guest page tables..
> > > >
> > > > So really the test is just 'were we able to block it' ?
> > >
> > > Yup. Do we really still consider that there's some performance benefit
> > > to be had by enabling a device to use no-snoop? This seems largely a
> > > legacy thing.
> >
> > Yes, there is indeed performance benefit for device to use no-snoop,
> > e.g. 8K display and some imaging processing path, etc. The problem is
> > that the IOMMU for such devices is typically a different one from the
> > default IOMMU for most devices. This special IOMMU may not have
> > the ability of enforcing snoop on no-snoop PCI traffic then this fact
> > must be understood by KVM to do proper mtrr/pat/wbinvd virtualization
> > for such devices to work correctly.
>
> The case where the IOMMU does not support snoop-control for such a
> device already works fine, we can't prevent no-snoop so KVM will
> emulate wbinvd. The harder one is if we should opt to allow no-snoop
> even if the IOMMU does support snoop-control.

In other discussion we are leaning toward a per-device capability
reporting scheme through /dev/ioasid (or /dev/iommu as the new
name). It seems natural to also allow setting a capability e.g. no-
snoop for a device if underlying IOMMU driver allows it.

>
> > > > > This support existed before mdev, IIRC we needed it for direct
> > > > > assignment of NVIDIA GPUs.
> > > >
> > > > Probably because they ignored the disable no-snoop bits in the control
> > > > block, or reset them in some insane way to "fix" broken bioses and
> > > > kept using it even though by all rights qemu would have tried hard to
> > > > turn it off via the config space. Processing no-snoop without a
> > > > working wbinvd would be fatal. Yeesh
> > > >
> > > > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > > > had..
> > > >
> > > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY
> > > and !IOMMU_CAP_CACHE_COHERENCY
> > > > domains.
> > > >
> > > > This doesn't actually matter. If you mix them together then kvm
> > > > will turn on wbinvd anyhow, so we don't need to use the
> DMA_PTE_SNP
> > > > anywhere in this VM.
> > > >
> > > > This if two IOMMU's are joined together into a single /dev/ioasid
> > > > then we can just make them both pretend to be
> > > > !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
> > >
> > > Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY
> then
> > > we
> > > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> > > available based on the per domain support available. That gives us the
> > > most consistent behavior, ie. we don't have VMs emulating wbinvd
> > > because they used to have a device attached where the domain required
> > > it and we can't atomically remap with new flags to perform the same as
> > > a VM that never had that device attached in the first place.
> > >
> > > > 2) How to fit this part of kvm in some new /dev/ioasid world
> > > >
> > > > What we want to do here is iterate over every ioasid associated
> > > > with the group fd that is passed into kvm.
> > >
> > > Yeah, we need some better names, binding a device to an ioasid (fd) but
> > > then attaching a device to an allocated ioasid (non-fd)... I assume
> > > you're talking about the latter ioasid.
> > >
> > > > Today the group fd has a single container which specifies the
> > > > single ioasid so this is being done trivially.
> > > >
> > > > To reorg we want to get the ioasid from the device not the
> > > > group (see my note to David about the groups vs device rational)
> > > >
> > > > This is just iterating over each vfio_device in the group and
> > > > querying the ioasid it is using.
> > >
> > > The IOMMU API group interfaces is largely iommu_group_for_each_dev()
> > > anyway, we still need to account for all the RIDs and aliases of a
> > > group.
> > >
> > > > Or perhaps more directly: an op attaching the vfio_device to the
> > > > kvm and having some simple helper
> > > > '(un)register ioasid with kvm (kvm, ioasid)'
> > > > that the vfio_device driver can call that just sorts this out.
> > >
> > > We could almost eliminate the device notion altogether here, use an
> > > ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> > > change to the composition of the device set for the ioasid, which is
> > > why we currently do it on addition or removal of a group, where the
> > > group has a consistent set of IOMMU properties. Register a notifier
> > > callback via the ioasidfd? Thanks,
> > >
> >
> > When discussing I/O page fault support in another thread, the consensus
> > is that an device handle will be registered (by user) or allocated (return
> > to user) in /dev/ioasid when binding the device to ioasid fd. From this
> > angle we can register {ioasid_fd, device_handle} to KVM and then call
> > something like ioasidfd_device_is_coherent() to get the property.
> > Anyway the coherency is a per-device property which is not changed
> > by how many I/O page tables are attached to it.
>
> The mechanics are different, but this is pretty similar in concept to
> KVM learning coherence using the groupfd today. Do we want to
> compromise on kernel control of wbinvd emulation to allow userspace to
> make such decisions? Ownership of a device might be reason enough to
> allow the user that privilege. Thanks,
>

I think so. In the end it's still decided by the underlying IOMMU driver.
If the IOMMU driver doesn't allow user to opt for no-snoop, it's exactly
same as today's groupfd approach. Otherwise an user-opted policy
implies that the decision is delegated to userspace.

Thanks
Kevin

2021-06-03 05:52:20

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 6/3/21 7:23 AM, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 12:01:57PM +0800, Lu Baolu wrote:
>> On 6/2/21 1:26 AM, Jason Gunthorpe wrote:
>>> On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
>>>
>>>> This version only covers 1) and 4). Do you think we need to support 2),
>>>> 3) and beyond?
>>>
>>> Yes aboslutely. The API should be flexable enough to specify the
>>> creation of all future page table formats we'd want to have and all HW
>>> specific details on those formats.
>>
>> OK, stay in the same line.
>>
>>>> If so, it seems that we need some in-kernel helpers and uAPIs to
>>>> support pre-installing a page table to IOASID.
>>>
>>> Not sure what this means..
>>
>> Sorry that I didn't make this clear.
>>
>> Let me bring back the page table types in my eyes.
>>
>> 1) IOMMU format page table (a.k.a. iommu_domain)
>> 2) user application CPU page table (SVA for example)
>> 3) KVM EPT (future option)
>> 4) VM guest managed page table (nesting mode)
>>
>> Each type of page table should be able to be associated with its IOASID.
>> We have BIND protocol for 4); We explicitly allocate an iommu_domain for
>> 1). But we don't have a clear definition for 2) 3) and others. I think
>> it's necessary to clearly define a time point and kAPI name between
>> IOASID_ALLOC and IOASID_ATTACH, so that other modules have the
>> opportunity to associate their page table with the allocated IOASID
>> before attaching the page table to the real IOMMU hardware.
>
> In my mind these are all actions of creation..
>
> #1 is ALLOC_IOASID 'to be compatible with thes devices attached to
> this FD'
> #2 is ALLOC_IOASID_SVA
> #3 is some ALLOC_IOASID_KVM (and maybe the kvm fd has to issue this ioctl)
> #4 is ALLOC_IOASID_USER_PAGE_TABLE w/ user VA address or
> ALLOC_IOASID_NESTED_PAGE_TABLE w/ IOVA address
>
> Each allocation should have a set of operations that are allows
> map/unmap is only legal on #1. invalidate is only legal on #4, etc.

This sounds reasonable. The corresponding page table types and required
callbacks are also part of it.

>
> How you want to split this up in the ioctl interface is a more
> interesting question. I generally like more calls than giant unwieldly
> multiplexer structs, but some things are naturally flags and optional
> modifications of a single ioctl.
>
> In any event they should have a similar naming 'ALLOC_IOASID_XXX' and
> then a single 'DESTROY_IOASID' that works on all of them.
>
>> I/O page fault handling is similar. The provider of the page table
>> should take the responsibility to handle the possible page faults.
>
> For the faultable types, yes #3 and #4 should hook in the fault
> handler and deal with it.

Agreed.

Best regards,
baolu

2021-06-03 06:30:54

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 01:16:48PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:32:27PM +1000, David Gibson wrote:
> > > I agree with Jean-Philippe - at the very least erasing this
> > > information needs a major rational - but I don't really see why it
> > > must be erased? The HW reports the originating device, is it just a
> > > matter of labeling the devices attached to the /dev/ioasid FD so it
> > > can be reported to userspace?
> >
> > HW reports the originating device as far as it knows. In many cases
> > where you have multiple devices in an IOMMU group, it's because
> > although they're treated as separate devices at the kernel level, they
> > have the same RID at the HW level. Which means a RID for something in
> > the right group is the closest you can count on supplying.
>
> Granted there may be cases where exact fidelity is not possible, but
> that doesn't excuse eliminating fedelity where it does exist..
>
> > > If there are no hypervisor traps (does this exist?) then there is no
> > > way to involve the hypervisor here and the child IOASID should simply
> > > be a pointer to the guest's data structure that describes binding. In
> > > this case that IOASID should claim all PASIDs when bound to a
> > > RID.
> >
> > And in that case I think we should call that object something other
> > than an IOASID, since it represents multiple address spaces.
>
> Maybe.. It is certainly a special case.
>
> We can still consider it a single "address space" from the IOMMU
> perspective. What has happened is that the address table is not just a
> 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".

True. This does complexify how we represent what IOVA ranges are
valid, though. I'll bet you most implementations don't actually
implement a full 64-bit IOVA, which means we effectively have a large
number of windows from (0..max IOVA) for each valid pasid. This adds
another reason I don't think my concept of IOVA windows is just a
power specific thing.

> If we are already going in the direction of having the IOASID specify
> the page table format and other details, specifying that the page
> tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
> step.

Well, rather I think userspace needs to request what page table format
it wants and the kernel tells it whether it can oblige or not.

> I wouldn't twist things into knots to create a difference, but if it
> is easy to do it wouldn't hurt either.
>
> Jason
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.68 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-03 06:30:54

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 02:49:56AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Thursday, June 3, 2021 12:59 AM
> >
> > On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > > /* Bind guest I/O page table */
> > > > > bind_data = {
> > > > > .ioasid = gva_ioasid;
> > > > > .addr = gva_pgtable1;
> > > > > // and format information
> > > > > };
> > > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > > >
> > > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > > there any reason to split these things? The only advantage to the
> > > > split is the device is known, but the device shouldn't impact
> > > > anything..
> > >
> > > I'm pretty sure the device(s) could matter, although they probably
> > > won't usually.
> >
> > It is a bit subtle, but the /dev/iommu fd itself is connected to the
> > devices first. This prevents wildly incompatible devices from being
> > joined together, and allows some "get info" to report the capability
> > union of all devices if we want to do that.
>
> I would expect the capability reported per-device via /dev/iommu.
> Incompatible devices can bind to the same fd but cannot attach to
> the same IOASID. This allows incompatible devices to share locked
> page accounting.

Yeah... I'm not convinced that everything relevant here can be
reported per-device. I think we may have edge cases where
combinations of devices have restrictions that individual devices in
the set do not.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.71 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-03 06:30:54

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 01:58:38PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > /* Bind guest I/O page table */
> > > > bind_data = {
> > > > .ioasid = gva_ioasid;
> > > > .addr = gva_pgtable1;
> > > > // and format information
> > > > };
> > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > >
> > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > there any reason to split these things? The only advantage to the
> > > split is the device is known, but the device shouldn't impact
> > > anything..
> >
> > I'm pretty sure the device(s) could matter, although they probably
> > won't usually.
>
> It is a bit subtle, but the /dev/iommu fd itself is connected to the
> devices first. This prevents wildly incompatible devices from being
> joined together, and allows some "get info" to report the capability
> union of all devices if we want to do that.

Right.. but I've not been convinced that having a /dev/iommu fd
instance be the boundary for these types of things actually makes
sense. For example if we were doing the preregistration thing
(whether by child ASes or otherwise) then that still makes sense
across wildly different devices, but we couldn't share that layer if
we have to open different instances for each of them.

It really seems to me that it's at the granularity of the address
space (including extended RID+PASID ASes) that we need to know what
devices we have, and therefore what capbilities we have for that AS.

> The original concept was that devices joined would all have to support
> the same IOASID format, at least for the kernel owned map/unmap IOASID
> type. Supporting different page table formats maybe is reason to
> revisit that concept.
>
> There is a small advantage to re-using the IOASID container because of
> the get_user_pages caching and pinned accounting management at the FD
> level.

Right, but at this stage I'm just not seeing a really clear (across
platforms and device typpes) boundary for what things have to be per
IOASID container and what have to be per IOASID, so I'm just not sure
the /dev/iommu instance grouping makes any sense.

> I don't know if that small advantage is worth the extra complexity
> though.
>
> > But it would certainly be possible for a system to have two
> > different host bridges with two different IOMMUs with different
> > pagetable formats. Until you know which devices (and therefore
> > which host bridge) you're talking about, you don't know what formats
> > of pagetable to accept. And if you have devices from *both* bridges
> > you can't bind a page table at all - you could theoretically support
> > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > in both formats, but it would be pretty reasonable not to support
> > that.
>
> The basic process for a user space owned pgtable mode would be:
>
> 1) qemu has to figure out what format of pgtable to use
>
> Presumably it uses query functions using the device label.

No... in the qemu case it would always select the page table format
that it needs to present to the guest. That's part of the
guest-visible platform that's selected by qemu's configuration.

There's no negotiation here: either the kernel can supply what qemu
needs to pass to the guest, or it can't. If it can't qemu, will have
to either emulate in SW (if possible, probably using a kernel-managed
IOASID to back it) or fail outright.

> The
> kernel code should look at the entire device path through all the
> IOMMU HW to determine what is possible.
>
> Or it already knows because the VM's vIOMMU is running in some
> fixed page table format, or the VM's vIOMMU already told it, or
> something.

Again, I think you have the order a bit backwards. The user selects
the capabilities that the vIOMMU will present to the guest as part of
the qemu configuration. Qemu then requests that of the host kernel,
and either the host kernel supplies it, qemu emulates it in SW, or
qemu fails to start.

Guest visible properties of the platform never (or *should* never)
depend implicitly on host capabilities - it's impossible to sanely
support migration in such an environment.

> 2) qemu creates an IOASID and based on #1 and says 'I want this format'

Right.

> 3) qemu binds the IOASID to the device.
>
> If qmeu gets it wrong then it just fails.

Right, though it may be fall back to (partial) software emulation. In
practice that would mean using a kernel-managed IOASID and walking the
guest IO pagetables itself to mirror them into the host kernel.

> 4) For the next device qemu would have to figure out if it can re-use
> an existing IOASID based on the required proeprties.

Nope. Again, what devices share an IO address space is a guest
visible part of the platform. If the host kernel can't supply that,
then qemu must not start (or fail the hotplug if the new device is
being hotplugged).

> You pointed to the case of mixing vIOMMU's of different platforms. So
> it is completely reasonable for qemu to ask for a "ARM 64 bit IOMMU
> page table mode v2" while running on an x86 because that is what the
> vIOMMU is wired to work with.

Yes.

> Presumably qemu will fall back to software emulation if this is not
> possible.

Right. But even in this case it needs to do some checking of the
capabilities of the backing IOMMU. At minimum the host IOMMU needs to
be able to map all the IOVAs that the guest expects to be mappable,
and the host IOMMU needs to support a pagesize that's a submultiple of
the pagesize expected in the guest.

For this reason, amongst some others, I think when selecting a kernel
managed pagetable we need to also have userspace explicitly request
which IOVA ranges are mappable, and what (minimum) page size it
needs.

> One interesting option for software emulation is to just transform the
> ARM page table format to a x86 page table format in userspace and use
> nested bind/invalidate to synchronize with the kernel. With SW nesting
> I suspect this would be much faster

It should be possible *if* the backing IOMMU can support the necessary
IOVAs and pagesizes (and maybe some other things I haven't thought
of). If not, you're simply out of luck and there's no option but to
fail to start the guest.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (6.52 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-03 06:31:22

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 01:37:53PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:57:52PM +1000, David Gibson wrote:
>
> > I don't think presence or absence of a group fd makes a lot of
> > difference to this design. Having a group fd just means we attach
> > groups to the ioasid instead of individual devices, and we no longer
> > need the bookkeeping of "partial" devices.
>
> Oh, I think we really don't want to attach the group to an ioasid, or
> at least not as a first-class idea.
>
> The fundamental problem that got us here is we now live in a world
> where there are many ways to attach a device to an IOASID:

I'm not seeing that that's necessarily a problem.

> - A RID binding
> - A RID,PASID binding
> - A RID,PASID binding for ENQCMD

I have to admit I haven't fully grasped the differences between these
modes. I'm hoping we can consolidate at least some of them into the
same sort of binding onto different IOASIDs (which may be linked in
parent/child relationships).

> - A SW TABLE binding
> - etc
>
> The selection of which mode to use is based on the specific
> driver/device operation. Ie the thing that implements the 'struct
> vfio_device' is the thing that has to select the binding mode.

I thought userspace selected the binding mode - although not all modes
will be possible for all devices.

> group attachment was fine when there was only one mode. As you say it
> is fine to just attach every group member with RID binding if RID
> binding is the only option.
>
> When SW TABLE binding was added the group code was hacked up - now the
> group logic is choosing between RID/SW TABLE in a very hacky and mdev
> specific way, and this is just a mess.

Sounds like it. What do you propose instead to handle backwards
compatibility for group-based VFIO code?

> The flow must carry the IOASID from the /dev/iommu to the vfio_device
> driver and the vfio_device implementation must choose which binding
> mode and parameters it wants based on driver and HW configuration.
>
> eg if two PCI devices are in a group then it is perfectly fine that
> one device uses RID binding and the other device uses RID,PASID
> binding.

Uhhhh... I don't see how that can be. They could well be in the same
group because their RIDs cannot be distinguished from each other.

> The only place I see for a "group bind" in the uAPI is some compat
> layer for the vfio container, and the implementation would be quite
> different, we'd have to call each vfio_device driver in the group and
> execute the IOASID attach IOCTL.
>
> > > I would say no on the container. /dev/ioasid == the container, having
> > > two competing objects at once in a single process is just a mess.
> >
> > Right. I'd assume that for compatibility, creating a container would
> > create a single IOASID under the hood with a compatiblity layer
> > translating the container operations to iosaid operations.
>
> It is a nice dream for sure
>
> /dev/vfio could be a special case of /dev/ioasid just with a different
> uapi and ending up with only one IOASID. They could be interchangable
> from then on, which would simplify the internals of VFIO if it
> consistently delt with these new ioasid objects everywhere. But last I
> looked it was complicated enough to best be done later on
>
> Jason
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.50 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-03 06:31:30

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
> Hi Jason,
>
> On 2021/5/29 7:36, Jason Gunthorpe wrote:
> > > /*
> > > * Bind an user-managed I/O page table with the IOMMU
> > > *
> > > * Because user page table is untrusted, IOASID nesting must be enabled
> > > * for this ioasid so the kernel can enforce its DMA isolation policy
> > > * through the parent ioasid.
> > > *
> > > * Pgtable binding protocol is different from DMA mapping. The latter
> > > * has the I/O page table constructed by the kernel and updated
> > > * according to user MAP/UNMAP commands. With pgtable binding the
> > > * whole page table is created and updated by userspace, thus different
> > > * set of commands are required (bind, iotlb invalidation, page fault, etc.).
> > > *
> > > * Because the page table is directly walked by the IOMMU, the user
> > > * must use a format compatible to the underlying hardware. It can
> > > * check the format information through IOASID_GET_INFO.
> > > *
> > > * The page table is bound to the IOMMU according to the routing
> > > * information of each attached device under the specified IOASID. The
> > > * routing information (RID and optional PASID) is registered when a
> > > * device is attached to this IOASID through VFIO uAPI.
> > > *
> > > * Input parameters:
> > > * - child_ioasid;
> > > * - address of the user page table;
> > > * - formats (vendor, address_width, etc.);
> > > *
> > > * Return: 0 on success, -errno on failure.
> > > */
> > > #define IOASID_BIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 9)
> > > #define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10)
> > Also feels backwards, why wouldn't we specify this, and the required
> > page table format, during alloc time?
> >
>
> Thinking of the required page table format, perhaps we should shed more
> light on the page table of an IOASID. So far, an IOASID might represent
> one of the following page tables (might be more):
>
> 1) an IOMMU format page table (a.k.a. iommu_domain)
> 2) a user application CPU page table (SVA for example)
> 3) a KVM EPT (future option)
> 4) a VM guest managed page table (nesting mode)
>
> This version only covers 1) and 4). Do you think we need to support 2),

Isn't (2) the equivalent of using the using the host-managed pagetable
then doing a giant MAP of all your user address space into it? But
maybe we should identify that case explicitly in case the host can
optimize it.

> 3) and beyond? If so, it seems that we need some in-kernel helpers and
> uAPIs to support pre-installing a page table to IOASID. From this point
> of view an IOASID is actually not just a variant of iommu_domain, but an
> I/O page table representation in a broader sense.
>
> Best regards,
> baolu
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.02 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-03 06:31:30

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 01:29:58AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Thursday, June 3, 2021 12:09 AM
> >
> > On Wed, Jun 02, 2021 at 01:33:22AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe <[email protected]>
> > > > Sent: Wednesday, June 2, 2021 1:42 AM
> > > >
> > > > On Tue, Jun 01, 2021 at 08:10:14AM +0000, Tian, Kevin wrote:
> > > > > > From: Jason Gunthorpe <[email protected]>
> > > > > > Sent: Saturday, May 29, 2021 1:36 AM
> > > > > >
> > > > > > On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > > > > >
> > > > > > > IOASID nesting can be implemented in two ways: hardware nesting
> > and
> > > > > > > software nesting. With hardware support the child and parent I/O
> > page
> > > > > > > tables are walked consecutively by the IOMMU to form a nested
> > > > translation.
> > > > > > > When it's implemented in software, the ioasid driver is responsible
> > for
> > > > > > > merging the two-level mappings into a single-level shadow I/O page
> > > > table.
> > > > > > > Software nesting requires both child/parent page tables operated
> > > > through
> > > > > > > the dma mapping protocol, so any change in either level can be
> > > > captured
> > > > > > > by the kernel to update the corresponding shadow mapping.
> > > > > >
> > > > > > Why? A SW emulation could do this synchronization during
> > invalidation
> > > > > > processing if invalidation contained an IOVA range.
> > > > >
> > > > > In this proposal we differentiate between host-managed and user-
> > > > > managed I/O page tables. If host-managed, the user is expected to use
> > > > > map/unmap cmd explicitly upon any change required on the page table.
> > > > > If user-managed, the user first binds its page table to the IOMMU and
> > > > > then use invalidation cmd to flush iotlb when necessary (e.g. typically
> > > > > not required when changing a PTE from non-present to present).
> > > > >
> > > > > We expect user to use map+unmap and bind+invalidate respectively
> > > > > instead of mixing them together. Following this policy, map+unmap
> > > > > must be used in both levels for software nesting, so changes in either
> > > > > level are captured timely to synchronize the shadow mapping.
> > > >
> > > > map+unmap or bind+invalidate is a policy of the IOASID itself set when
> > > > it is created. If you put two different types in a tree then each IOASID
> > > > must continue to use its own operation mode.
> > > >
> > > > I don't see a reason to force all IOASIDs in a tree to be consistent??
> > >
> > > only for software nesting. With hardware support the parent uses map
> > > while the child uses bind.
> > >
> > > Yes, the policy is specified per IOASID. But if the policy violates the
> > > requirement in a specific nesting mode, then nesting should fail.
> >
> > I don't get it.
> >
> > If the IOASID is a page table then it is bind/invalidate. SW or not SW
> > doesn't matter at all.
> >
> > > >
> > > > A software emulated two level page table where the leaf level is a
> > > > bound page table in guest memory should continue to use
> > > > bind/invalidate to maintain the guest page table IOASID even though it
> > > > is a SW construct.
> > >
> > > with software nesting the leaf should be a host-managed page table
> > > (or metadata). A bind/invalidate protocol doesn't require the user
> > > to notify the kernel of every page table change.
> >
> > The purpose of invalidate is to inform the implementation that the
> > page table has changed so it can flush the caches. If the page table
> > is changed and invalidation is not issued then then the implementation
> > is free to ignore the changes.
> >
> > In this way the SW mode is the same as a HW mode with an infinite
> > cache.
> >
> > The collaposed shadow page table is really just a cache.
> >
>
> OK. One additional thing is that we may need a 'caching_mode"
> thing reported by /dev/ioasid, indicating whether invalidation is
> required when changing non-present to present. For hardware
> nesting it's not reported as the hardware IOMMU will walk the
> guest page table in cases of iotlb miss. For software nesting
> caching_mode is reported so the user must issue invalidation
> upon any change in guest page table so the kernel can update
> the shadow page table timely.

For the fist cut, I'd have the API assume that invalidates are
*always* required. Some bypass to avoid them in cases where they're
not needed can be an additional extension.

> Following this and your other comment with David, we will mark
> host-managed vs. guest-managed explicitly for I/O page table
> of each IOASID. map+unmap or bind+invalid is decided by
> which owner is specified by the user.
>
> Thanks
> Kevin
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (4.92 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-03 06:32:12

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 02:19:30PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:15:07PM +1000, David Gibson wrote:
>
> > Is there a compelling reason to have all the IOASIDs handled by one
> > FD?
>
> There was an answer on this, if every PASID needs an IOASID then there
> are too many FDs.

Too many in what regard? fd limits? Something else?

It seems to be there are two different cases for PASID handling here.
One is where userspace explicitly creates each valid PASID and
attaches a separate pagetable for each (or handles each with
MAP/UNMAP). In that case I wouldn't have expected there to be too
many fds.

Then there's the case where we register a whole PASID table, in which
case I think you only need the one FD. We can treat that as creating
an 84-bit IOAS, whose pagetable format is (PASID table + a bunch of
pagetables for each PASID).

> It is difficult to share the get_user_pages cache across FDs.

Ah... hrm, yes I can see that.

> There are global properties in the /dev/iommu FD, like what devices
> are part of it, that are important for group security operations. This
> becomes confused if it is split to many FDs.

I'm still not seeing those. I'm really not seeing any well-defined
meaning to devices being attached to the fd, but not to a particular
IOAS.

> > > I/O address space can be managed through two protocols, according to
> > > whether the corresponding I/O page table is constructed by the kernel or
> > > the user. When kernel-managed, a dma mapping protocol (similar to
> > > existing VFIO iommu type1) is provided for the user to explicitly specify
> > > how the I/O address space is mapped. Otherwise, a different protocol is
> > > provided for the user to bind an user-managed I/O page table to the
> > > IOMMU, plus necessary commands for iotlb invalidation and I/O fault
> > > handling.
> > >
> > > Pgtable binding protocol can be used only on the child IOASID's, implying
> > > IOASID nesting must be enabled. This is because the kernel doesn't trust
> > > userspace. Nesting allows the kernel to enforce its DMA isolation policy
> > > through the parent IOASID.
> >
> > To clarify, I'm guessing that's a restriction of likely practice,
> > rather than a fundamental API restriction. I can see a couple of
> > theoretical future cases where a user-managed pagetable for a "base"
> > IOASID would be feasible:
> >
> > 1) On some fancy future MMU allowing free nesting, where the kernel
> > would insert an implicit extra layer translating user addresses
> > to physical addresses, and the userspace manages a pagetable with
> > its own VAs being the target AS
>
> I would model this by having a "SVA" parent IOASID. A "SVA" IOASID one
> where the IOVA == process VA and the kernel maintains this mapping.

That makes sense. Needs a different name to avoid Intel and PCI
specificness, but having a trivial "pagetable format" which just says
IOVA == user address is nice idea.

> Since the uAPI is so general I do have a general expecation that the
> drivers/iommu implementations might need to be a bit more complicated,
> like if the HW can optimize certain specific graphs of IOASIDs we
> would still model them as graphs and the HW driver would have to
> "compile" the graph into the optimal hardware.
>
> This approach has worked reasonable in other kernel areas.

That seems sensible.

> > 2) For a purely software virtual device, where its virtual DMA
> > engine can interpet user addresses fine
>
> This also sounds like an SVA IOASID.

Ok.

> Depending on HW if a device can really only bind to a very narrow kind
> of IOASID then it should ask for that (probably platform specific!)
> type during its attachment request to drivers/iommu.
>
> eg "I am special hardware and only know how to do PLATFORM_BLAH
> transactions, give me an IOASID comatible with that". If the only way
> to create "PLATFORM_BLAH" is with a SVA IOASID because BLAH is
> hardwired to the CPU ASID then that is just how it is.

Fair enough.

> > I wonder if there's a way to model this using a nested AS rather than
> > requiring special operations. e.g.
> >
> > 'prereg' IOAS
> > |
> > \- 'rid' IOAS
> > |
> > \- 'pasid' IOAS (maybe)
> >
> > 'prereg' would have a kernel managed pagetable into which (for
> > example) qemu platform code would map all guest memory (using
> > IOASID_MAP_DMA). qemu's vIOMMU driver would then mirror the guest's
> > IO mappings into the 'rid' IOAS in terms of GPA.
> >
> > This wouldn't quite work as is, because the 'prereg' IOAS would have
> > no devices. But we could potentially have another call to mark an
> > IOAS as a purely "preregistration" or pure virtual IOAS. Using that
> > would be an alternative to attaching devices.
>
> It is one option for sure, this is where I was thinking when we were
> talking in the other thread. I think the decision is best
> implementation driven as the datastructure to store the
> preregsitration data should be rather purpose built.

Right. I think this gets nicer now that we're considering more
specific options at IOAS creation time, and different "types" of
IOAS. We could add a "preregistration" IOAS type, which supports
MAP/UNMAP of user addresses, and allows *no* devices to be attached,
but does allow other IOAS types to be added as children.

>
> > > /*
> > > * Map/unmap process virtual addresses to I/O virtual addresses.
> > > *
> > > * Provide VFIO type1 equivalent semantics. Start with the same
> > > * restriction e.g. the unmap size should match those used in the
> > > * original mapping call.
> > > *
> > > * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > > * must be already in the preregistered list.
> > > *
> > > * Input parameters:
> > > * - u32 ioasid;
> > > * - refer to vfio_iommu_type1_dma_{un}map
> > > *
> > > * Return: 0 on success, -errno on failure.
> > > */
> > > #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
> > > #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)
> >
> > I'm assuming these would be expected to fail if a user managed
> > pagetable has been bound?
>
> Me too, or a SVA page table.
>
> This document would do well to have a list of imagined page table
> types and the set of operations that act on them. I think they are all
> pretty disjoint..

Right. With the possible exception that I can imagine a call for
several types which all support MAP/UNMAP, but have other different
characteristics.

> Your presentation of 'kernel owns the table' vs 'userspace owns the
> table' is a useful clarification to call out too
>
> > > 5. Use Cases and Flows
> > >
> > > Here assume VFIO will support a new model where every bound device
> > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > going through legacy container/group interface. For illustration purpose
> > > those devices are just called dev[1...N]:
> > >
> > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the
> > filenames for actual PCI functions. Maybe /dev/vfio/mdev/something
> > for mdevs. That leaves other subdirs of /dev/vfio free for future
> > non-PCI device types, and /dev/vfio itself for the legacy group
> > devices.
>
> There are a bunch of nice options here if we go this path
>
> > > 5.2. Multiple IOASIDs (no nesting)
> > > ++++++++++++++++++++++++++++
> > >
> > > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > > both devices are attached to gpa_ioasid.
> >
> > Doesn't really affect your example, but note that the PAPR IOMMU does
> > not have a passthrough mode, so devices will not initially be attached
> > to gpa_ioasid - they will be unusable for DMA until attached to a
> > gIOVA ioasid.
>
> I think attachment should always be explicit in the API. If the user
> doesn't explicitly ask for a device to be attached to the IOASID then
> the iommu driver is free to block it.
>
> If you want passthrough then you have to create a passthrough IOASID
> and attach every device to it. Some of those attaches might be NOP's
> due to groups.

Agreed.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (8.40 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-03 06:42:07

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Saturday, May 29, 2021 7:37 AM
>
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
>
> > 2.1. /dev/ioasid uAPI
> > +++++++++++++++++
> >
> > /*
> > * Check whether an uAPI extension is supported.
> > *
> > * This is for FD-level capabilities, such as locked page pre-registration.
> > * IOASID-level capabilities are reported through IOASID_GET_INFO.
> > *
> > * Return: 0 if not supported, 1 if supported.
> > */
> > #define IOASID_CHECK_EXTENSION _IO(IOASID_TYPE, IOASID_BASE + 0)
>
>
> > /*
> > * Register user space memory where DMA is allowed.
> > *
> > * It pins user pages and does the locked memory accounting so sub-
> > * sequent IOASID_MAP/UNMAP_DMA calls get faster.
> > *
> > * When this ioctl is not used, one user page might be accounted
> > * multiple times when it is mapped by multiple IOASIDs which are
> > * not nested together.
> > *
> > * Input parameters:
> > * - vaddr;
> > * - size;
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define IOASID_REGISTER_MEMORY _IO(IOASID_TYPE, IOASID_BASE + 1)
> > #define IOASID_UNREGISTER_MEMORY _IO(IOASID_TYPE,
> IOASID_BASE + 2)
>
> So VA ranges are pinned and stored in a tree and later references to
> those VA ranges by any other IOASID use the pin cached in the tree?

yes.

>
> It seems reasonable and is similar to the ioasid parent/child I
> suggested for PPC.
>
> IMHO this should be merged with the all SW IOASID that is required for
> today's mdev drivers. If this can be done while keeping this uAPI then

Agree. Regarding uAPI there is no difference between SW IOASID and
HW IOASID. The main difference is behind /dev/ioasid, that SW IOASID
is not linked to the IOMMU.

> great, otherwise I don't think it is so bad to weakly nest a physical
> IOASID under a SW one just to optimize page pinning.
>
> Either way this seems like a smart direction
>
> > /*
> > * Allocate an IOASID.
> > *
> > * IOASID is the FD-local software handle representing an I/O address
> > * space. Each IOASID is associated with a single I/O page table. User
> > * must call this ioctl to get an IOASID for every I/O address space that is
> > * intended to be enabled in the IOMMU.
> > *
> > * A newly-created IOASID doesn't accept any command before it is
> > * attached to a device. Once attached, an empty I/O page table is
> > * bound with the IOMMU then the user could use either DMA mapping
> > * or pgtable binding commands to manage this I/O page table.
>
> Can the IOASID can be populated before being attached?
>
> > * Device attachment is initiated through device driver uAPI (e.g. VFIO)
> > *
> > * Return: allocated ioasid on success, -errno on failure.
> > */
> > #define IOASID_ALLOC _IO(IOASID_TYPE, IOASID_BASE + 3)
> > #define IOASID_FREE _IO(IOASID_TYPE, IOASID_BASE + 4)
>
> I assume alloc will include quite a big structure to satisfy the
> various vendor needs?

I'll skip below /dev/ioasid uAPI comments about alloc/bind. It's already
covered in other sub-threads.

[...]

> >
> > 2.2. /dev/vfio uAPI
> > ++++++++++++++++
>
> To be clear you mean the 'struct vfio_device' API, these are not
> IOCTLs on the container or group?

Exactly

>
> > /*
> > * Bind a vfio_device to the specified IOASID fd
> > *
> > * Multiple vfio devices can be bound to a single ioasid_fd, but a single
> > * vfio device should not be bound to multiple ioasid_fd's.
> > *
> > * Input parameters:
> > * - ioasid_fd;
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define VFIO_BIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 22)
> > #define VFIO_UNBIND_IOASID_FD _IO(VFIO_TYPE, VFIO_BASE + 23)
>
> This is where it would make sense to have an output "device id" that
> allows /dev/ioasid to refer to this "device" by number in events and
> other related things.

As chatted earlier, either an input or output "device id" is fine here.

>
> >
> > 2.3. KVM uAPI
> > ++++++++++++
> >
> > /*
> > * Update CPU PASID mapping
> > *
> > * This is necessary when ENQCMD will be used in the guest while the
> > * targeted device doesn't accept the vPASID saved in the CPU MSR.
> > *
> > * This command allows user to set/clear the vPASID->pPASID mapping
> > * in the CPU, by providing the IOASID (and FD) information representing
> > * the I/O address space marked by this vPASID.
> > *
> > * Input parameters:
> > * - user_pasid;
> > * - ioasid_fd;
> > * - ioasid;
> > */
> > #define KVM_MAP_PASID _IO(KVMIO, 0xf0)
> > #define KVM_UNMAP_PASID _IO(KVMIO, 0xf1)
>
> It seems simple enough.. So the physical PASID can only be assigned if
> the user has an IOASID that points at it? Thus it is secure?

Yes. The kernel doesn't trust user to provide a random physical PASID.

>
> > 3. Sample structures and helper functions
> >
> > Three helper functions are provided to support VFIO_BIND_IOASID_FD:
> >
> > struct ioasid_ctx *ioasid_ctx_fdget(int fd);
> > int ioasid_register_device(struct ioasid_ctx *ctx, struct ioasid_dev
> *dev);
> > int ioasid_unregister_device(struct ioasid_dev *dev);
> >
> > An ioasid_ctx is created for each fd:
> >
> > struct ioasid_ctx {
> > // a list of allocated IOASID data's
> > struct list_head ioasid_list;
>
> Would expect an xarray
>
> > // a list of registered devices
> > struct list_head dev_list;
>
> xarray of device_id

list of ioasid_dev objects. device_id will be put inside each object.

>
> > // a list of pre-registered virtual address ranges
> > struct list_head prereg_list;
>
> Should re-use the existing SW IOASID table, and be an interval tree.

What is the existing SW IOASID table?

>
> > Each registered device is represented by ioasid_dev:
> >
> > struct ioasid_dev {
> > struct list_head next;
> > struct ioasid_ctx *ctx;
> > // always be the physical device
> > struct device *device;
> > struct kref kref;
> > };
> >
> > Because we assume one vfio_device connected to at most one ioasid_fd,
> > here ioasid_dev could be embedded in vfio_device and then linked to
> > ioasid_ctx->dev_list when registration succeeds. For mdev the struct
> > device should be the pointer to the parent device. PASID marking this
> > mdev is specified later when VFIO_ATTACH_IOASID.
>
> Don't embed a struct like this in something with vfio_device - that
> just makes a mess of reference counting by having multiple krefs in
> the same memory block. Keep it as a pointer, the attach operation
> should return a pointer to the above struct.

OK. Also based on the agreement that one device can bind to multiple
fd's, this struct embed approach also doesn't work then.

>
> > An ioasid_data is created when IOASID_ALLOC, as the main object
> > describing characteristics about an I/O page table:
> >
> > struct ioasid_data {
> > // link to ioasid_ctx->ioasid_list
> > struct list_head next;
> >
> > // the IOASID number
> > u32 ioasid;
> >
> > // the handle to convey iommu operations
> > // hold the pgd (TBD until discussing iommu api)
> > struct iommu_domain *domain;
>
> But at least for the first coding draft I would expect to see this API
> presented with no PASID support and a simple 1:1 with iommu_domain.
> How
> PASID gets modeled is the big TBD, right?

yes. As the starting point we will assume 1:1 association. This should
work for PF/VF. But very soon mdev must be considered. I expect
we can start conversation on PASID support once this uAPI proposal
is settled down.

>
> > ioasid_data and iommu_domain have overlapping roles as both are
> > introduced to represent an I/O address space. It is still a big TBD how
> > the two should be corelated or even merged, and whether new iommu
> > ops are required to handle RID+PASID explicitly.
>
> I think it is OK that the uapi and kernel api have different
> structs. The uapi focused one should hold the uapi related data, which
> is what you've shown here, I think.
>
> > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> >
> > struct attach_info {
> > u32 ioasid;
> > // If valid, the PASID to be used physically
> > u32 pasid;
> > };
> > int ioasid_device_attach(struct ioasid_dev *dev,
> > struct attach_info info);
> > int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
>
> Honestly, I still prefer this to be highly explicit as this is where
> all device driver authors get invovled:
>
> ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev,
> u32 ioasid);
> ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32 *physical_pasid,
> struct ioasid_dev *dev, u32 ioasid);

Then better naming it as pci_device_attach_ioasid since the 1st parameter
is struct pci_device?

By keeping physical_pasid as a pointer, you want to remove the last helper
function (ioasid_get_global_pasid) so the global pasid is returned along
with the attach function?

>
> And presumably a variant for ARM non-PCI platform (?) devices.
>
> This could boil down to a __ioasid_device_attach() as you've shown.
>
> > A new object is introduced and linked to ioasid_data->attach_data for
> > each successful attach operation:
> >
> > struct ioasid_attach_data {
> > struct list_head next;
> > struct ioasid_dev *dev;
> > u32 pasid;
> > }
>
> This should be returned as a pointer and detatch should be:
>
> int ioasid_device_detach(struct ioasid_attach_data *);

ok

>
> > As explained in the design section, there is no explicit group enforcement
> > in /dev/ioasid uAPI or helper functions. But the ioasid driver does
> > implicit group check - before every device within an iommu group is
> > attached to this IOASID, the previously-attached devices in this group are
> > put in ioasid_data->partial_devices. The IOASID rejects any command if
> > the partial_devices list is not empty.
>
> It is simple enough. Would be good to design in a diagnostic string so
> userspace can make sense of the failure. Eg return something like
> -EDEADLK and provide an ioctl 'why did EDEADLK happen' ?
>

Make sense.

>
> > Then is the last helper function:
> > u32 ioasid_get_global_pasid(struct ioasid_ctx *ctx,
> > u32 ioasid, bool alloc);
> >
> > ioasid_get_global_pasid is necessary in scenarios where multiple devices
> > want to share a same PASID value on the attached I/O page table (e.g.
> > when ENQCMD is enabled, as explained in next section). We need a
> > centralized place (ioasid_data->pasid) to hold this value (allocated when
> > first called with alloc=true). vfio device driver calls this function (alloc=
> > true) to get the global PASID for an ioasid before calling ioasid_device_
> > attach. KVM also calls this function (alloc=false) to setup PASID translation
> > structure when user calls KVM_MAP_PASID.
>
> When/why would the VFIO driver do this? isn't this just some varient
> of pasid_attach?
>
> ioasid_pci_device_enqcmd_attach(struct pci_device *pdev, u32
> *physical_pasid, struct ioasid_dev *dev, u32 ioasid);
>
> ?

will adopt this way.

>
> > 4. PASID Virtualization
> >
> > When guest SVA (vSVA) is enabled, multiple GVA address spaces are
> > created on the assigned vfio device. This leads to the concepts of
> > "virtual PASID" (vPASID) vs. "physical PASID" (pPASID). vPASID is assigned
> > by the guest to mark an GVA address space while pPASID is the one
> > selected by the host and actually routed in the wire.
> >
> > vPASID is conveyed to the kernel when user calls VFIO_ATTACH_IOASID.
>
> Should the vPASID programmed into the IOASID before calling
> VFIO_ATTACH_IOASID?

No. As explained in earlier reply, when multiple devices are attached
to the same IOASID the guest may link the page table to different
vPASID# cross attached devices. Anyway vPASID is a per-RID thing.

>
> > vfio device driver translates vPASID to pPASID before calling ioasid_attach_
> > device, with two factors to be considered:
> >
> > - Whether vPASID is directly used (vPASID==pPASID) in the wire, or
> > should be instead converted to a newly-allocated one (vPASID!=
> > pPASID);
> >
> > - If vPASID!=pPASID, whether pPASID is allocated from per-RID PASID
> > space or a global PASID space (implying sharing pPASID cross devices,
> > e.g. when supporting Intel ENQCMD which puts PASID in a CPU MSR
> > as part of the process context);
>
> This whole section is 4 really confusing. I think it would be more
> understandable to focus on the list below and minimize the vPASID
>
> > The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> > supported. There are three possible scenarios:
> >
> > (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
> > policies.)
>
> This has become unclear. I think this should start by identifying the
> 6 main type of devices and how they can use pPASID/vPASID:
>
> 0) Device is a RID and cannot issue PASID
> 1) Device is a mdev and cannot issue PASID
> 2) Device is a mdev and programs a single fixed PASID during bind,
> does not accept PASID from the guest

There are no vPASID per se in above 3 types. So this section only
focus on the latter 3 types. But I can include them in next version
if it sets the tone clearer.

>
> 3) Device accepts any PASIDs from the guest. No
> vPASID/pPASID translation is possible. (classic vfio_pci)
> 4) Device accepts any PASID from the guest and has an
> internal vPASID/pPASID translation (enhanced vfio_pci)

what is enhanced vfio_pci? In my writing this is for mdev
which doesn't support ENQCMD

> 5) Device accepts and PASID from the guest and relys on
> external vPASID/pPASID translation via ENQCMD (Intel SIOV mdev)
>
> 0-2 don't use vPASID at all
>
> 3-5 consume a vPASID but handle it differently.
>
> I think the 3-5 map into what you are trying to explain in the table
> below, which is the rules for allocating the vPASID depending on which
> of device types 3-5 are present and or mixed.

Exactly

>
> For instance device type 3 requires vPASID == pPASID because it can't
> do translation at all.
>
> This probably all needs to come through clearly in the /dev/ioasid
> interface. Once the attached devices are labled it would make sense to
> have a 'query device' /dev/ioasid IOCTL to report the details based on
> how the device attached and other information.

This is a good point. Another benefit of having a device label.

for 0-2 the device will report no PASID support. Although this may duplicate
with other information (e.g. PCI PASID cap), this provides a vendor-agnostic
way for reporting details around IOASID.

for 3-5 the device will report PASID support. In these cases the user is
expected to always provide a vPASID.

for 5 in addition the device will report a requirement on CPU PASID
translation. For such device the user should talk to KVM to setup the PASID
mapping. This way the user doesn't need to know whether a device is
pdev or mdev. Just follows what device capability reports.

>
> > 2) mdev: vPASID!=pPASID (per-RID if w/o ENQCMD, otherwise global)
> >
> > PASIDs are also used by kernel to mark the default I/O address space
> > for mdev, thus cannot be delegated to the guest. Instead, the mdev
> > driver must allocate a new pPASID for each vPASID (thus vPASID!=
> > pPASID) and then use pPASID when attaching this mdev to an ioasid.
>
> I don't understand this at all.. What does "PASIDs are also used by
> the kernel" mean?

Just refer to your type-2. Because PASIDs on this device are already used
by the parent driver to mark mdev, we cannot delegate the per-RID space
to the guest.

>
> > The mdev driver needs cache the PASID mapping so in mediation
> > path vPASID programmed by the guest can be converted to pPASID
> > before updating the physical MMIO register.
>
> This is my scenario #4 above. Device and internally virtualize
> vPASID/pPASID - how that is done is up to the device. But this is all
> just labels, when such a device attaches, it should use some specific
> API:
>
> ioasid_pci_device_vpasid_attach(struct pci_device *pdev,
> u32 *physical_pasid, u32 *virtual_pasid, struct ioasid_dev *dev, u32 ioasid);

yes.

>
> And then maintain its internal translation
>
> > In previous thread a PASID range split scheme was discussed to support
> > this combination, but we haven't worked out a clean uAPI design yet.
> > Therefore in this proposal we decide to not support it, implying the
> > user should have some intelligence to avoid such scenario. It could be
> > a TODO task for future.
>
> It really just boils down to how to allocate the PASIDs to get around
> the bad viommu interface that assumes all PASIDs are usable by all
> devices.

viommu (e.g. Intel VT-d) has good interface to restrict how many PASIDs
are available to the guest. There is a PASID size filed in the viommu
register. Here the puzzle is just about how to design a good uAPI to
handle this mixed scenario where vPASID/pPASID are in split range and
must be linked to the same I/O page table together.

I'll see whether this can be afforded after addressing other comments
in this section.

>
> > In spite of those subtle considerations, the kernel implementation could
> > start simple, e.g.:
> >
> > - v==p for pdev;
> > - v!=p and always use a global PASID pool for all mdev's;
>
> Regardless all this mess needs to be hidden from the consuming drivers
> with some simple APIs as above. The driver should indicate what its HW
> can do and the PASID #'s that magically come out of /dev/ioasid should
> be appropriate.
>

Yes, I see how it should work now.

Thanks
Kevin

2021-06-03 06:51:32

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: David Gibson
> Sent: Thursday, June 3, 2021 1:09 PM
[...]
> > > In this way the SW mode is the same as a HW mode with an infinite
> > > cache.
> > >
> > > The collaposed shadow page table is really just a cache.
> > >
> >
> > OK. One additional thing is that we may need a 'caching_mode"
> > thing reported by /dev/ioasid, indicating whether invalidation is
> > required when changing non-present to present. For hardware
> > nesting it's not reported as the hardware IOMMU will walk the
> > guest page table in cases of iotlb miss. For software nesting
> > caching_mode is reported so the user must issue invalidation
> > upon any change in guest page table so the kernel can update
> > the shadow page table timely.
>
> For the fist cut, I'd have the API assume that invalidates are
> *always* required. Some bypass to avoid them in cases where they're
> not needed can be an additional extension.
>

Isn't a typical TLB semantics is that non-present entries are not
cached thus invalidation is not required when making non-present
to present? It's true to both CPU TLB and IOMMU TLB. In reality
I feel there are more usages built on hardware nesting than software
nesting thus making default following hardware TLB behavior makes
more sense...

Thanks
Kevin

2021-06-03 06:53:42

by Lu Baolu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

Hi David,

On 6/3/21 1:54 PM, David Gibson wrote:
> On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
>> Hi Jason,
>>
>> On 2021/5/29 7:36, Jason Gunthorpe wrote:
>>>> /*
>>>> * Bind an user-managed I/O page table with the IOMMU
>>>> *
>>>> * Because user page table is untrusted, IOASID nesting must be enabled
>>>> * for this ioasid so the kernel can enforce its DMA isolation policy
>>>> * through the parent ioasid.
>>>> *
>>>> * Pgtable binding protocol is different from DMA mapping. The latter
>>>> * has the I/O page table constructed by the kernel and updated
>>>> * according to user MAP/UNMAP commands. With pgtable binding the
>>>> * whole page table is created and updated by userspace, thus different
>>>> * set of commands are required (bind, iotlb invalidation, page fault, etc.).
>>>> *
>>>> * Because the page table is directly walked by the IOMMU, the user
>>>> * must use a format compatible to the underlying hardware. It can
>>>> * check the format information through IOASID_GET_INFO.
>>>> *
>>>> * The page table is bound to the IOMMU according to the routing
>>>> * information of each attached device under the specified IOASID. The
>>>> * routing information (RID and optional PASID) is registered when a
>>>> * device is attached to this IOASID through VFIO uAPI.
>>>> *
>>>> * Input parameters:
>>>> * - child_ioasid;
>>>> * - address of the user page table;
>>>> * - formats (vendor, address_width, etc.);
>>>> *
>>>> * Return: 0 on success, -errno on failure.
>>>> */
>>>> #define IOASID_BIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 9)
>>>> #define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10)
>>> Also feels backwards, why wouldn't we specify this, and the required
>>> page table format, during alloc time?
>>>
>> Thinking of the required page table format, perhaps we should shed more
>> light on the page table of an IOASID. So far, an IOASID might represent
>> one of the following page tables (might be more):
>>
>> 1) an IOMMU format page table (a.k.a. iommu_domain)
>> 2) a user application CPU page table (SVA for example)
>> 3) a KVM EPT (future option)
>> 4) a VM guest managed page table (nesting mode)
>>
>> This version only covers 1) and 4). Do you think we need to support 2),
> Isn't (2) the equivalent of using the using the host-managed pagetable
> then doing a giant MAP of all your user address space into it? But
> maybe we should identify that case explicitly in case the host can
> optimize it.
>

Conceptually, yes. Current SVA implementation just reuses the
application's cpu page table w/o map/unmap operations.

Best regards,
baolu

2021-06-03 07:19:26

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: David Gibson <[email protected]>
> Sent: Wednesday, June 2, 2021 2:15 PM
>
[...]
> > An I/O address space takes effect in the IOMMU only after it is attached
> > to a device. The device in the /dev/ioasid context always refers to a
> > physical one or 'pdev' (PF or VF).
>
> What you mean by "physical" device here isn't really clear - VFs
> aren't really physical devices, and the PF/VF terminology also doesn't
> extent to non-PCI devices (which I think we want to consider for the
> API, even if we're not implemenenting it any time soon).

Yes, it's not very clear, and more in PCI context to simplify the
description. A "physical" one here means an PCI endpoint function
which has a unique RID. It's more to differentiate with later mdev/
subdevice which uses both RID+PASID. Naming is always a hard
exercise to me... Possibly I'll just use device vs. subdevice in future
versions.

>
> Now, it's clear that we can't program things into the IOMMU before
> attaching a device - we might not even know which IOMMU to use.

yes

> However, I'm not sure if its wise to automatically make the AS "real"
> as soon as we attach a device:
>
> * If we're going to attach a whole bunch of devices, could we (for at
> least some IOMMU models) end up doing a lot of work which then has
> to be re-done for each extra device we attach?

which extra work did you specifically refer to? each attach just implies
writing the base address of the I/O page table to the IOMMU structure
corresponding to this device (either being a per-device entry, or per
device+PASID entry).

and generally device attach should not be in a hot path.

>
> * With kernel managed IO page tables could attaching a second device
> (at least on some IOMMU models) require some operation which would
> require discarding those tables? e.g. if the second device somehow
> forces a different IO page size

Then the attach should fail and the user should create another IOASID
for the second device.

>
> For that reason I wonder if we want some sort of explicit enable or
> activate call. Device attaches would only be valid before, map or
> attach pagetable calls would only be valid after.

I'm interested in learning a real example requiring explicit enable...

>
> > One I/O address space could be attached to multiple devices. In this case,
> > /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
> >
> > Based on the underlying IOMMU capability one device might be allowed
> > to attach to multiple I/O address spaces, with DMAs accessing them by
> > carrying different routing information. One of them is the default I/O
> > address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
> > remaining are routed by RID + Process Address Space ID (PASID) or
> > Stream+Substream ID. For simplicity the following context uses RID and
> > PASID when talking about the routing information for I/O address spaces.
>
> I'm not really clear on how this interacts with nested ioasids. Would
> you generally expect the RID+PASID IOASes to be children of the base
> RID IOAS, or not?

No. With Intel SIOV both parent/children could be RID+PASID, e.g.
when one enables vSVA on a mdev.

>
> If the PASID ASes are children of the RID AS, can we consider this not
> as the device explicitly attaching to multiple IOASIDs, but instead
> attaching to the parent IOASID with awareness of the child ones?
>
> > Device attachment is initiated through passthrough framework uAPI (use
> > VFIO for simplicity in following context). VFIO is responsible for identifying
> > the routing information and registering it to the ioasid driver when calling
> > ioasid attach helper function. It could be RID if the assigned device is
> > pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
> > user might also provide its view of virtual routing information (vPASID) in
> > the attach call, e.g. when multiple user-managed I/O address spaces are
> > attached to the vfio_device. In this case VFIO must figure out whether
> > vPASID should be directly used (for pdev) or converted to a kernel-
> > allocated one (pPASID, for mdev) for physical routing (see section 4).
> >
> > Device must be bound to an IOASID FD before attach operation can be
> > conducted. This is also through VFIO uAPI. In this proposal one device
> > should not be bound to multiple FD's. Not sure about the gain of
> > allowing it except adding unnecessary complexity. But if others have
> > different view we can further discuss.
> >
> > VFIO must ensure its device composes DMAs with the routing information
> > attached to the IOASID. For pdev it naturally happens since vPASID is
> > directly programmed to the device by guest software. For mdev this
> > implies any guest operation carrying a vPASID on this device must be
> > trapped into VFIO and then converted to pPASID before sent to the
> > device. A detail explanation about PASID virtualization policies can be
> > found in section 4.
> >
> > Modern devices may support a scalable workload submission interface
> > based on PCI DMWr capability, allowing a single work queue to access
> > multiple I/O address spaces. One example is Intel ENQCMD, having
> > PASID saved in the CPU MSR and carried in the instruction payload
> > when sent out to the device. Then a single work queue shared by
> > multiple processes can compose DMAs carrying different PASIDs.
>
> Is the assumption here that the processes share the IOASID FD
> instance, but not memory?

I didn't get this question

>
> > When executing ENQCMD in the guest, the CPU MSR includes a vPASID
> > which, if targeting a mdev, must be converted to pPASID before sent
> > to the wire. Intel CPU provides a hardware PASID translation capability
> > for auto-conversion in the fast path. The user is expected to setup the
> > PASID mapping through KVM uAPI, with information about {vpasid,
> > ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM
> > to figure out the actual pPASID given an IOASID.
> >
> > With above design /dev/ioasid uAPI is all about I/O address spaces.
> > It doesn't include any device routing information, which is only
> > indirectly registered to the ioasid driver through VFIO uAPI. For
> > example, I/O page fault is always reported to userspace per IOASID,
> > although it's physically reported per device (RID+PASID). If there is a
> > need of further relaying this fault into the guest, the user is responsible
> > of identifying the device attached to this IOASID (randomly pick one if
> > multiple attached devices) and then generates a per-device virtual I/O
> > page fault into guest. Similarly the iotlb invalidation uAPI describes the
> > granularity in the I/O address space (all, or a range), different from the
> > underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
> >
> > I/O page tables routed through PASID are installed in a per-RID PASID
> > table structure. Some platforms implement the PASID table in the guest
> > physical space (GPA), expecting it managed by the guest. The guest
> > PASID table is bound to the IOMMU also by attaching to an IOASID,
> > representing the per-RID vPASID space.
>
> Do we need to consider two management modes here, much as we have for
> the pagetables themsleves: either kernel managed, in which we have
> explicit calls to bind a vPASID to a parent PASID, or user managed in
> which case we register a table in some format.

yes, this is related to PASID virtualization in section 4. And based on
suggestion from Jason, the vPASID requirement will be reported to
user space via the per-device reporting interface.

Thanks
Kevin

2021-06-03 08:14:48

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: David Gibson <[email protected]>
> Sent: Wednesday, June 2, 2021 2:15 PM
>
[...]

> >
> > /*
> > * Get information about an I/O address space
> > *
> > * Supported capabilities:
> > * - VFIO type1 map/unmap;
> > * - pgtable/pasid_table binding
> > * - hardware nesting vs. software nesting;
> > * - ...
> > *
> > * Related attributes:
> > * - supported page sizes, reserved IOVA ranges (DMA mapping);
>
> Can I request we represent this in terms of permitted IOVA ranges,
> rather than reserved IOVA ranges. This works better with the "window"
> model I have in mind for unifying the restrictions of the POWER IOMMU
> with Type1 like mapping.

Can you elaborate how permitted range work better here?

> > #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
> > #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)
>
> I'm assuming these would be expected to fail if a user managed
> pagetable has been bound?

yes. Following Jason's suggestion the format will be specified when
creating an IOASID, thus incompatible cmd will be simply rejected.

> > #define IOASID_BIND_PGTABLE _IO(IOASID_TYPE,
> IOASID_BASE + 9)
> > #define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10)
>
> I'm assuming that UNBIND would return the IOASID to a kernel-managed
> pagetable?

There will be no UNBIND call in the next version. unbind will be
automatically handled when destroying the IOASID.

>
> For debugging and certain hypervisor edge cases it might be useful to
> have a call to allow userspace to lookup and specific IOVA in a guest
> managed pgtable.

Since all the mapping metadata is from userspace, why would one
rely on the kernel to provide such service? Or are you simply asking
for some debugfs node to dump the I/O page table for a given
IOASID?

>
>
> > /*
> > * Bind an user-managed PASID table to the IOMMU
> > *
> > * This is required for platforms which place PASID table in the GPA space.
> > * In this case the specified IOASID represents the per-RID PASID space.
> > *
> > * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
> > * special flag to indicate the difference from normal I/O address spaces.
> > *
> > * The format info of the PASID table is reported in IOASID_GET_INFO.
> > *
> > * As explained in the design section, user-managed I/O page tables must
> > * be explicitly bound to the kernel even on these platforms. It allows
> > * the kernel to uniformly manage I/O address spaces cross all platforms.
> > * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
> > * to carry device routing information to indirectly mark the hidden I/O
> > * address spaces.
> > *
> > * Input parameters:
> > * - child_ioasid;
>
> Wouldn't this be the parent ioasid, rather than one of the potentially
> many child ioasids?

there is just one child IOASID (per device) for this PASID table

parent ioasid in this case carries the GPA mapping.

> >
> > /*
> > * Invalidate IOTLB for an user-managed I/O page table
> > *
> > * Unlike what's defined in include/uapi/linux/iommu.h, this command
> > * doesn't allow the user to specify cache type and likely support only
> > * two granularities (all, or a specified range) in the I/O address space.
> > *
> > * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
> > * cache). If the IOASID represents an I/O address space, the invalidation
> > * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
> > * represents a vPASID space, then this command applies to the PASID
> > * cache.
> > *
> > * Similarly this command doesn't provide IOMMU-like granularity
> > * info (domain-wide, pasid-wide, range-based), since it's all about the
> > * I/O address space itself. The ioasid driver walks the attached
> > * routing information to match the IOMMU semantics under the
> > * hood.
> > *
> > * Input parameters:
> > * - child_ioasid;
>
> And couldn't this be be any ioasid, not just a child one, depending on
> whether you want PASID scope or RID scope invalidation?

yes, any ioasid could accept invalidation cmd. This was based on
the old assumption that bind+invalidate only applies to child, which
will be fixed in next version.

> > /*
> > * Attach a vfio device to the specified IOASID
> > *
> > * Multiple vfio devices can be attached to the same IOASID, and vice
> > * versa.
> > *
> > * User may optionally provide a "virtual PASID" to mark an I/O page
> > * table on this vfio device. Whether the virtual PASID is physically used
> > * or converted to another kernel-allocated PASID is a policy in vfio device
> > * driver.
> > *
> > * There is no need to specify ioasid_fd in this call due to the assumption
> > * of 1:1 connection between vfio device and the bound fd.
> > *
> > * Input parameter:
> > * - ioasid;
> > * - flag;
> > * - user_pasid (if specified);
>
> Wouldn't the PASID be communicated by whether you give a parent or
> child ioasid, rather than needing an extra value?

No. ioasid is just the software handle.

> > struct ioasid_data {
> > // link to ioasid_ctx->ioasid_list
> > struct list_head next;
> >
> > // the IOASID number
> > u32 ioasid;
> >
> > // the handle to convey iommu operations
> > // hold the pgd (TBD until discussing iommu api)
> > struct iommu_domain *domain;
> >
> > // map metadata (vfio type1 semantics)
> > struct rb_node dma_list;
>
> Why do you need this? Can't you just store the kernel managed
> mappings in the host IO pgtable?

A simple reason is that to implement vfio type1 semantics we
need make sure unmap with size same as what is used for map.
The metadata allows verifying this assumption. Another reason
is when doing software nesting, the page table linked into the
iommu domain is the shadow one. It's better to keep the original
metadata so it can be used to update the shadow when another
level (parent or child) changes the mapping.

> >
> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
>
> In this case, I feel like the preregistration is redundant with the
> GPA level mapping. As long as the gIOVA mappings (which might be
> frequent) can piggyback on the accounting done for the GPA mapping we
> accomplish what we need from preregistration.

yes, preregistration makes more sense when multiple IOASIDs are
used but are not nested together.

> > 5.5. Guest SVA (vSVA)
> > ++++++++++++++++++
> >
> > After boots the guest further create a GVA address spaces (gpasid1) on
> > dev1. Dev2 is not affected (still attached to giova_ioasid).
> >
> > As explained in section 4, user should avoid expose ENQCMD on both
> > pdev and mdev.
> >
> > The sequence applies to all device types (being pdev or mdev), except
> > one additional step to call KVM for ENQCMD-capable mdev:
> >
> > /* After boots */
> > /* Make GVA space nested on GPA space */
> > gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
>
> I'm not clear what gva_ioasid is representing. Is it representing a
> single vPASID's address space, or a whole bunch of vPASIDs address
> spaces?

a single vPASID's address space.

Thanks
Kevin

2021-06-03 11:52:12

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 06:49:20AM +0000, Tian, Kevin wrote:
> > From: David Gibson
> > Sent: Thursday, June 3, 2021 1:09 PM
> [...]
> > > > In this way the SW mode is the same as a HW mode with an infinite
> > > > cache.
> > > >
> > > > The collaposed shadow page table is really just a cache.
> > > >
> > >
> > > OK. One additional thing is that we may need a 'caching_mode"
> > > thing reported by /dev/ioasid, indicating whether invalidation is
> > > required when changing non-present to present. For hardware
> > > nesting it's not reported as the hardware IOMMU will walk the
> > > guest page table in cases of iotlb miss. For software nesting
> > > caching_mode is reported so the user must issue invalidation
> > > upon any change in guest page table so the kernel can update
> > > the shadow page table timely.
> >
> > For the fist cut, I'd have the API assume that invalidates are
> > *always* required. Some bypass to avoid them in cases where they're
> > not needed can be an additional extension.
> >
>
> Isn't a typical TLB semantics is that non-present entries are not
> cached thus invalidation is not required when making non-present
> to present? It's true to both CPU TLB and IOMMU TLB. In reality
> I feel there are more usages built on hardware nesting than software
> nesting thus making default following hardware TLB behavior makes
> more sense...

From a modelling perspective it makes sense to have the most general
be the default and if an implementation can elide certain steps then
describing those as additional behaviors on the universal baseline is
cleaner

I'm surprised to hear your remarks about the not-present though,
how does the vIOMMU emulation work if there are not hypervisor
invalidation traps for not-present/present transitions?

Jason

2021-06-03 11:54:05

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 03:13:44PM +1000, David Gibson wrote:

> > We can still consider it a single "address space" from the IOMMU
> > perspective. What has happened is that the address table is not just a
> > 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".
>
> True. This does complexify how we represent what IOVA ranges are
> valid, though. I'll bet you most implementations don't actually
> implement a full 64-bit IOVA, which means we effectively have a large
> number of windows from (0..max IOVA) for each valid pasid. This adds
> another reason I don't think my concept of IOVA windows is just a
> power specific thing.

Yes

Things rapidly get into weird hardware specific stuff though, the
request will be for things like:
"ARM PASID&IO page table format from SMMU IP block vXX"

Which may have a bunch of (possibly very weird!) format specific data
to describe and/or configure it.

The uAPI needs to be suitably general here. :(

> > If we are already going in the direction of having the IOASID specify
> > the page table format and other details, specifying that the page
> > tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
> > step.
>
> Well, rather I think userspace needs to request what page table format
> it wants and the kernel tells it whether it can oblige or not.

Yes, this is what I ment.

Jason

2021-06-03 12:15:47

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote:
> On Wed, Jun 02, 2021 at 01:58:38PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > > /* Bind guest I/O page table */
> > > > > bind_data = {
> > > > > .ioasid = gva_ioasid;
> > > > > .addr = gva_pgtable1;
> > > > > // and format information
> > > > > };
> > > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > > >
> > > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > > there any reason to split these things? The only advantage to the
> > > > split is the device is known, but the device shouldn't impact
> > > > anything..
> > >
> > > I'm pretty sure the device(s) could matter, although they probably
> > > won't usually.
> >
> > It is a bit subtle, but the /dev/iommu fd itself is connected to the
> > devices first. This prevents wildly incompatible devices from being
> > joined together, and allows some "get info" to report the capability
> > union of all devices if we want to do that.
>
> Right.. but I've not been convinced that having a /dev/iommu fd
> instance be the boundary for these types of things actually makes
> sense. For example if we were doing the preregistration thing
> (whether by child ASes or otherwise) then that still makes sense
> across wildly different devices, but we couldn't share that layer if
> we have to open different instances for each of them.

It is something that still seems up in the air.. What seems clear for
/dev/iommu is that it
- holds a bunch of IOASID's organized into a tree
- holds a bunch of connected devices
- holds a pinned memory cache

One thing it must do is enforce IOMMU group security. A device cannot
be attached to an IOASID unless all devices in its IOMMU group are
part of the same /dev/iommu FD.

The big open question is what parameters govern allowing devices to
connect to the /dev/iommu:
- all devices can connect and we model the differences inside the API
somehow.
- Only sufficiently "similar" devices can be connected
- The FD's capability is the minimum of all the connected devices

There are some practical problems here, when an IOASID is created the
kernel does need to allocate a page table for it, and that has to be
in some definite format.

It may be that we had a false start thinking the FD container should
be limited. Perhaps creating an IOASID should pass in a list
of the "device labels" that the IOASID will be used with and that can
guide the kernel what to do?

> Right, but at this stage I'm just not seeing a really clear (across
> platforms and device typpes) boundary for what things have to be per
> IOASID container and what have to be per IOASID, so I'm just not sure
> the /dev/iommu instance grouping makes any sense.

I would push as much stuff as possible to be per-IOASID..

> > I don't know if that small advantage is worth the extra complexity
> > though.
> >
> > > But it would certainly be possible for a system to have two
> > > different host bridges with two different IOMMUs with different
> > > pagetable formats. Until you know which devices (and therefore
> > > which host bridge) you're talking about, you don't know what formats
> > > of pagetable to accept. And if you have devices from *both* bridges
> > > you can't bind a page table at all - you could theoretically support
> > > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > > in both formats, but it would be pretty reasonable not to support
> > > that.
> >
> > The basic process for a user space owned pgtable mode would be:
> >
> > 1) qemu has to figure out what format of pgtable to use
> >
> > Presumably it uses query functions using the device label.
>
> No... in the qemu case it would always select the page table format
> that it needs to present to the guest. That's part of the
> guest-visible platform that's selected by qemu's configuration.

I should have said "vfio user" here because apps like DPDK might use
this path

> > 4) For the next device qemu would have to figure out if it can re-use
> > an existing IOASID based on the required proeprties.
>
> Nope. Again, what devices share an IO address space is a guest
> visible part of the platform. If the host kernel can't supply that,
> then qemu must not start (or fail the hotplug if the new device is
> being hotplugged).

qemu can always emulate. If the config requires to devices that cannot
share an IOASID because the local platform is wonky then qemu needs to
shadow and duplicate the IO page table from the guest into two IOASID
objects to make it work. This is a SW emulation option.

> For this reason, amongst some others, I think when selecting a kernel
> managed pagetable we need to also have userspace explicitly request
> which IOVA ranges are mappable, and what (minimum) page size it
> needs.

It does make sense

Jason

2021-06-03 12:31:36

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 03:23:17PM +1000, David Gibson wrote:
> On Wed, Jun 02, 2021 at 01:37:53PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 02, 2021 at 04:57:52PM +1000, David Gibson wrote:
> >
> > > I don't think presence or absence of a group fd makes a lot of
> > > difference to this design. Having a group fd just means we attach
> > > groups to the ioasid instead of individual devices, and we no longer
> > > need the bookkeeping of "partial" devices.
> >
> > Oh, I think we really don't want to attach the group to an ioasid, or
> > at least not as a first-class idea.
> >
> > The fundamental problem that got us here is we now live in a world
> > where there are many ways to attach a device to an IOASID:
>
> I'm not seeing that that's necessarily a problem.
>
> > - A RID binding
> > - A RID,PASID binding
> > - A RID,PASID binding for ENQCMD
>
> I have to admit I haven't fully grasped the differences between these
> modes. I'm hoping we can consolidate at least some of them into the
> same sort of binding onto different IOASIDs (which may be linked in
> parent/child relationships).

What I would like is that the /dev/iommu side managing the IOASID
doesn't really care much, but the device driver has to tell
drivers/iommu what it is going to do when it attaches.

It makes sense, in PCI terms, only the driver knows what TLPs the
device will generate. The IOMMU needs to know what TLPs it will
recieve to configure properly.

PASID or not is major device specific variation, as is the ENQCMD/etc

Having the device be explicit when it tells the IOMMU what it is going
to be sending is a major plus to me. I actually don't want to see this
part of the interface be made less strong.

> > The selection of which mode to use is based on the specific
> > driver/device operation. Ie the thing that implements the 'struct
> > vfio_device' is the thing that has to select the binding mode.
>
> I thought userspace selected the binding mode - although not all modes
> will be possible for all devices.

/dev/iommu is concerned with setting up the IOAS and filling the IO
page tables with information

The driver behind "struct vfio_device" is responsible to "route" its
HW into that IOAS.

They are two halfs of the problem, one is only the io page table, and one
the is connection of a PCI TLP to a specific io page table.

Only the driver knows what format of TLPs the device will generate so
only the driver can specify the "route"

> > eg if two PCI devices are in a group then it is perfectly fine that
> > one device uses RID binding and the other device uses RID,PASID
> > binding.
>
> Uhhhh... I don't see how that can be. They could well be in the same
> group because their RIDs cannot be distinguished from each other.

Inability to match the RID is rare, certainly I would expect any IOMMU
HW that can do PCIEe PASID matching can also do RID matching. With
such HW the above is perfectly fine - the group may not be secure
between members (eg !ACS), but the TLPs still carry valid RIDs and
PASID and the IOMMU can still discriminate.

I think you are talking about really old IOMMU's that could only
isolate based on ingress port or something.. I suppose modern PCIe has
some cases like this in the NTB stuff too.

Oh, I hadn't spent time thinking about any of those.. It is messy but
it can still be forced to work, I guess. A device centric model means
all the devices using the same routing ID have to be connected to the
same IOASID by userspace. So some of the connections will be NOPs.

Jason

2021-06-03 12:36:40

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 08:50:54PM -0600, Alex Williamson wrote:
> On Wed, 2 Jun 2021 19:45:36 -0300
> Jason Gunthorpe <[email protected]> wrote:
>
> > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> >
> > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > from the guest page table... what page table?
> >
> > I see my confusion now, the phrasing in your earlier remark led me
> > think this was about allowing the no-snoop performance enhancement in
> > some restricted way.
> >
> > It is really about blocking no-snoop 100% of the time and then
> > disabling the dangerous wbinvd when the block is successful.
> >
> > Didn't closely read the kvm code :\
> >
> > If it was about allowing the optimization then I'd expect the guest to
> > enable no-snoopable regions via it's vIOMMU and realize them to the
> > hypervisor and plumb the whole thing through. Hence my remark about
> > the guest page tables..
> >
> > So really the test is just 'were we able to block it' ?
>
> Yup. Do we really still consider that there's some performance benefit
> to be had by enabling a device to use no-snoop? This seems largely a
> legacy thing.

I've recently had some no-snoopy discussions lately.. The issue didn't
vanish, it is still expensive going through all that cache hardware.

> > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > had..
> >
> > 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
> > domains.
> >
> > This doesn't actually matter. If you mix them together then kvm
> > will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> > anywhere in this VM.
> >
> > This if two IOMMU's are joined together into a single /dev/ioasid
> > then we can just make them both pretend to be
> > !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
>
> Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we
> need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> available based on the per domain support available. That gives us the
> most consistent behavior, ie. we don't have VMs emulating wbinvd
> because they used to have a device attached where the domain required
> it and we can't atomically remap with new flags to perform the same as
> a VM that never had that device attached in the first place.

I think we are saying the same thing..

> > 2) How to fit this part of kvm in some new /dev/ioasid world
> >
> > What we want to do here is iterate over every ioasid associated
> > with the group fd that is passed into kvm.
>
> Yeah, we need some better names, binding a device to an ioasid (fd) but
> then attaching a device to an allocated ioasid (non-fd)... I assume
> you're talking about the latter ioasid.

Fingers crossed on RFCv2.. Here I mean the IOASID object inside the
/dev/iommu FD. The vfio_device would have some kref handle to the
in-kernel representation of it. So we can interact with it..

> > Or perhaps more directly: an op attaching the vfio_device to the
> > kvm and having some simple helper
> > '(un)register ioasid with kvm (kvm, ioasid)'
> > that the vfio_device driver can call that just sorts this out.
>
> We could almost eliminate the device notion altogether here, use an
> ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> change to the composition of the device set for the ioasid, which is
> why we currently do it on addition or removal of a group, where the
> group has a consistent set of IOMMU properties.

That is another quite good option, just forget about trying to be
highly specific and feed in the /dev/ioasid FD and have kvm ask "does
anything in here not enforce snoop?"

With something appropriate to track/block changing that answer.

It doesn't solve the problem to connect kvm to AP and kvmgt though

Jason

2021-06-03 12:43:25

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 03:22:27AM +0000, Tian, Kevin wrote:
> > From: Alex Williamson <[email protected]>
> > Sent: Thursday, June 3, 2021 10:51 AM
> >
> > On Wed, 2 Jun 2021 19:45:36 -0300
> > Jason Gunthorpe <[email protected]> wrote:
> >
> > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > >
> > > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > > from the guest page table... what page table?
> > >
> > > I see my confusion now, the phrasing in your earlier remark led me
> > > think this was about allowing the no-snoop performance enhancement in
> > > some restricted way.
> > >
> > > It is really about blocking no-snoop 100% of the time and then
> > > disabling the dangerous wbinvd when the block is successful.
> > >
> > > Didn't closely read the kvm code :\
> > >
> > > If it was about allowing the optimization then I'd expect the guest to
> > > enable no-snoopable regions via it's vIOMMU and realize them to the
> > > hypervisor and plumb the whole thing through. Hence my remark about
> > > the guest page tables..
> > >
> > > So really the test is just 'were we able to block it' ?
> >
> > Yup. Do we really still consider that there's some performance benefit
> > to be had by enabling a device to use no-snoop? This seems largely a
> > legacy thing.
>
> Yes, there is indeed performance benefit for device to use no-snoop,
> e.g. 8K display and some imaging processing path, etc. The problem is
> that the IOMMU for such devices is typically a different one from the
> default IOMMU for most devices. This special IOMMU may not have
> the ability of enforcing snoop on no-snoop PCI traffic then this fact
> must be understood by KVM to do proper mtrr/pat/wbinvd virtualization
> for such devices to work correctly.

Or stated another way:

We in Linux don't have a way to control if the VFIO IO page table will
be snoop or no snoop from userspace so Intel has forced the platform's
IOMMU path for the integrated GPU to be unable to enforce snoop, thus
"solving" the problem.

I don't think that is sustainable in the oveall ecosystem though.

'qemu --allow-no-snoop' makes more sense to me

> When discussing I/O page fault support in another thread, the consensus
> is that an device handle will be registered (by user) or allocated (return
> to user) in /dev/ioasid when binding the device to ioasid fd. From this
> angle we can register {ioasid_fd, device_handle} to KVM and then call
> something like ioasidfd_device_is_coherent() to get the property.
> Anyway the coherency is a per-device property which is not changed
> by how many I/O page tables are attached to it.

It is not device specific, it is driver specific

As I said before, the question is if the IOASID itself can enforce
snoop, or not. AND if the device will issue no-snoop or not.

Devices that are hard wired to never issue no-snoop are safe even with
an IOASID that cannot enforce snoop. AFAIK really only GPUs use this
feature. Eg I would be comfortable to say mlx5 never uses the no-snoop
TLP flag.

Only the vfio_driver could know this.

Jason

2021-06-03 12:49:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 04:26:08PM +1000, David Gibson wrote:

> > There are global properties in the /dev/iommu FD, like what devices
> > are part of it, that are important for group security operations. This
> > becomes confused if it is split to many FDs.
>
> I'm still not seeing those. I'm really not seeing any well-defined
> meaning to devices being attached to the fd, but not to a particular
> IOAS.

Kevin can you add a section on how group security would have to work
to the RFC? This is the idea you can't attach a device to an IOASID
unless all devices in the IOMMU group are joined to the /dev/iommu FD.

The basic statement is that userspace must present the entire group
membership to /dev/iommu to prove that it has the security right to
manipulate their DMA translation.

It is the device centric analog to what the group FD is doing for
security.

Jason

2021-06-03 12:51:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 07:17:23AM +0000, Tian, Kevin wrote:
> > From: David Gibson <[email protected]>
> > Sent: Wednesday, June 2, 2021 2:15 PM
> >
> [...]
> > > An I/O address space takes effect in the IOMMU only after it is attached
> > > to a device. The device in the /dev/ioasid context always refers to a
> > > physical one or 'pdev' (PF or VF).
> >
> > What you mean by "physical" device here isn't really clear - VFs
> > aren't really physical devices, and the PF/VF terminology also doesn't
> > extent to non-PCI devices (which I think we want to consider for the
> > API, even if we're not implemenenting it any time soon).
>
> Yes, it's not very clear, and more in PCI context to simplify the
> description. A "physical" one here means an PCI endpoint function
> which has a unique RID. It's more to differentiate with later mdev/
> subdevice which uses both RID+PASID. Naming is always a hard
> exercise to me... Possibly I'll just use device vs. subdevice in future
> versions.

Using PCI words:

A "physical" device is RID matching.

A "subdevice" is (RID, PASID) matching.

A "SW mdev" is performing DMA isolation in a device specific way - all
DMA's from the device are routed to the hypervisor's IOMMU page
tables.

Jason

2021-06-03 12:58:08

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 02:50:11PM +0800, Lu Baolu wrote:
> Hi David,
>
> On 6/3/21 1:54 PM, David Gibson wrote:
> > On Tue, Jun 01, 2021 at 07:09:21PM +0800, Lu Baolu wrote:
> > > Hi Jason,
> > >
> > > On 2021/5/29 7:36, Jason Gunthorpe wrote:
> > > > > /*
> > > > > * Bind an user-managed I/O page table with the IOMMU
> > > > > *
> > > > > * Because user page table is untrusted, IOASID nesting must be enabled
> > > > > * for this ioasid so the kernel can enforce its DMA isolation policy
> > > > > * through the parent ioasid.
> > > > > *
> > > > > * Pgtable binding protocol is different from DMA mapping. The latter
> > > > > * has the I/O page table constructed by the kernel and updated
> > > > > * according to user MAP/UNMAP commands. With pgtable binding the
> > > > > * whole page table is created and updated by userspace, thus different
> > > > > * set of commands are required (bind, iotlb invalidation, page fault, etc.).
> > > > > *
> > > > > * Because the page table is directly walked by the IOMMU, the user
> > > > > * must use a format compatible to the underlying hardware. It can
> > > > > * check the format information through IOASID_GET_INFO.
> > > > > *
> > > > > * The page table is bound to the IOMMU according to the routing
> > > > > * information of each attached device under the specified IOASID. The
> > > > > * routing information (RID and optional PASID) is registered when a
> > > > > * device is attached to this IOASID through VFIO uAPI.
> > > > > *
> > > > > * Input parameters:
> > > > > * - child_ioasid;
> > > > > * - address of the user page table;
> > > > > * - formats (vendor, address_width, etc.);
> > > > > *
> > > > > * Return: 0 on success, -errno on failure.
> > > > > */
> > > > > #define IOASID_BIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 9)
> > > > > #define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10)
> > > > Also feels backwards, why wouldn't we specify this, and the required
> > > > page table format, during alloc time?
> > > >
> > > Thinking of the required page table format, perhaps we should shed more
> > > light on the page table of an IOASID. So far, an IOASID might represent
> > > one of the following page tables (might be more):
> > >
> > > 1) an IOMMU format page table (a.k.a. iommu_domain)
> > > 2) a user application CPU page table (SVA for example)
> > > 3) a KVM EPT (future option)
> > > 4) a VM guest managed page table (nesting mode)
> > >
> > > This version only covers 1) and 4). Do you think we need to support 2),
> > Isn't (2) the equivalent of using the using the host-managed pagetable
> > then doing a giant MAP of all your user address space into it? But
> > maybe we should identify that case explicitly in case the host can
> > optimize it.
>
> Conceptually, yes. Current SVA implementation just reuses the
> application's cpu page table w/o map/unmap operations.

The key distinction is faulting, and this goes back to the importance
of having the device tell drivers/iommu what TLPs it is generating.

A #1 table with a map of 'all user space memory' does not have IO DMA
faults. The pages should be pinned and this object should be
compatible with any DMA user.

A #2/#3 table allows page faulting, and it can only be used with a
device that supports the page faulting protocol. For instance a PCI
device needs to say it is running in ATS mode and supports PRI. This
is where you might fit in CAPI generically.

As the other case in my other email, the kind of TLPs the device
generates is only known by the driver when it connects to the IOASID
and must be communicated to the IOMMU so it knows how to set things
up. ATS/PRI w/ faulting is a very different setup than simple RID
matching.

Jason

2021-06-03 13:07:06

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 06:39:30AM +0000, Tian, Kevin wrote:
> > > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> > >
> > > struct attach_info {
> > > u32 ioasid;
> > > // If valid, the PASID to be used physically
> > > u32 pasid;
> > > };
> > > int ioasid_device_attach(struct ioasid_dev *dev,
> > > struct attach_info info);
> > > int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> >
> > Honestly, I still prefer this to be highly explicit as this is where
> > all device driver authors get invovled:
> >
> > ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev,
> > u32 ioasid);
> > ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32 *physical_pasid,
> > struct ioasid_dev *dev, u32 ioasid);
>
> Then better naming it as pci_device_attach_ioasid since the 1st parameter
> is struct pci_device?

No, the leading tag indicates the API's primary subystem, in this case
it is iommu (and if you prefer list the iommu related arguments first)

> By keeping physical_pasid as a pointer, you want to remove the last helper
> function (ioasid_get_global_pasid) so the global pasid is returned along
> with the attach function?

It is just a thought.. It allows the caller to both specify a fixed
PASID and request an allocation

I still dont have a clear idea how all this PASID complexity should
work, sorry.

> > > The actual policy depends on pdev vs. mdev, and whether ENQCMD is
> > > supported. There are three possible scenarios:
> > >
> > > (Note: /dev/ioasid uAPI is not affected by underlying PASID virtualization
> > > policies.)
> >
> > This has become unclear. I think this should start by identifying the
> > 6 main type of devices and how they can use pPASID/vPASID:
> >
> > 0) Device is a RID and cannot issue PASID
> > 1) Device is a mdev and cannot issue PASID
> > 2) Device is a mdev and programs a single fixed PASID during bind,
> > does not accept PASID from the guest
>
> There are no vPASID per se in above 3 types. So this section only
> focus on the latter 3 types. But I can include them in next version
> if it sets the tone clearer.

I think it helps

> >
> > 3) Device accepts any PASIDs from the guest. No
> > vPASID/pPASID translation is possible. (classic vfio_pci)
> > 4) Device accepts any PASID from the guest and has an
> > internal vPASID/pPASID translation (enhanced vfio_pci)
>
> what is enhanced vfio_pci? In my writing this is for mdev
> which doesn't support ENQCMD

This is a vfio_pci that mediates some element of the device interface
to communicate the vPASID/pPASID table to the device, using Max's
series for vfio_pci drivers to inject itself into VFIO.

For instance a device might send a message through the PF that the VF
has a certain vPASID/pPASID translation table. This would be useful
for devices that cannot use ENQCMD but still want to support migration
and thus need vPASID.

> for 0-2 the device will report no PASID support. Although this may duplicate
> with other information (e.g. PCI PASID cap), this provides a vendor-agnostic
> way for reporting details around IOASID.

We have to consider mdevs too here, so PCI caps are not general enough

> for 3-5 the device will report PASID support. In these cases the user is
> expected to always provide a vPASID.
>
> for 5 in addition the device will report a requirement on CPU PASID
> translation. For such device the user should talk to KVM to setup the PASID
> mapping. This way the user doesn't need to know whether a device is
> pdev or mdev. Just follows what device capability reports.

Something like that. Needs careful documentation

Jason

2021-06-03 13:14:05

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 10:52:51AM +0800, Jason Wang wrote:

> Basically, we don't want to bother with pseudo KVM device like what VFIO
> did. So for simplicity, we rules out the IOMMU that can't enforce coherency
> in vhost-vDPA if the parent purely depends on the platform IOMMU:

VDPA HW cannot issue no-snoop TLPs in the first place.

virtio does not define a protocol to discover such a functionality,
nor do any virtio drivers implement the required platform specific
cache flushing to make no-snoop TLPs work.

It is fundamentally part of the virtio HW PCI API that a device vendor
cannot alter.

Basically since we already know that the virtio kernel drivers do not
call the cache flush instruction we don't need the weird KVM logic to
turn it on at all.

Enforcing no-snoop at the IOMMU here is redundant/confusing.

Jason

2021-06-03 18:19:58

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

Hi Shenming,

On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu <[email protected]>
wrote:

> On 2021/6/2 1:33, Jason Gunthorpe wrote:
> > On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> >
> >> The drivers register per page table fault handlers to /dev/ioasid which
> >> will then register itself to iommu core to listen and route the per-
> >> device I/O page faults.
> >
> > I'm still confused why drivers need fault handlers at all?
>
> Essentially it is the userspace that needs the fault handlers,
> one case is to deliver the faults to the vIOMMU, and another
> case is to enable IOPF on the GPA address space for on-demand
> paging, it seems that both could be specified in/through the
> IOASID_ALLOC ioctl?
>
I would think IOASID_BIND_PGTABLE is where fault handler should be
registered. There wouldn't be any IO page fault without the binding anyway.

I also don't understand why device drivers should register the fault
handler, the fault is detected by the pIOMMU and injected to the vIOMMU. So
I think it should be the IOASID itself register the handler.

> Thanks,
> Shenming
>


Thanks,

Jacob

2021-06-03 20:06:21

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, 3 Jun 2021 09:34:01 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Wed, Jun 02, 2021 at 08:50:54PM -0600, Alex Williamson wrote:
> > On Wed, 2 Jun 2021 19:45:36 -0300
> > Jason Gunthorpe <[email protected]> wrote:
> >
> > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > >
> > > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > > from the guest page table... what page table?
> > >
> > > I see my confusion now, the phrasing in your earlier remark led me
> > > think this was about allowing the no-snoop performance enhancement in
> > > some restricted way.
> > >
> > > It is really about blocking no-snoop 100% of the time and then
> > > disabling the dangerous wbinvd when the block is successful.
> > >
> > > Didn't closely read the kvm code :\
> > >
> > > If it was about allowing the optimization then I'd expect the guest to
> > > enable no-snoopable regions via it's vIOMMU and realize them to the
> > > hypervisor and plumb the whole thing through. Hence my remark about
> > > the guest page tables..
> > >
> > > So really the test is just 'were we able to block it' ?
> >
> > Yup. Do we really still consider that there's some performance benefit
> > to be had by enabling a device to use no-snoop? This seems largely a
> > legacy thing.
>
> I've recently had some no-snoopy discussions lately.. The issue didn't
> vanish, it is still expensive going through all that cache hardware.
>
> > > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > > had..
> > >
> > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
> > > domains.
> > >
> > > This doesn't actually matter. If you mix them together then kvm
> > > will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> > > anywhere in this VM.
> > >
> > > This if two IOMMU's are joined together into a single /dev/ioasid
> > > then we can just make them both pretend to be
> > > !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
> >
> > Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we
> > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> > available based on the per domain support available. That gives us the
> > most consistent behavior, ie. we don't have VMs emulating wbinvd
> > because they used to have a device attached where the domain required
> > it and we can't atomically remap with new flags to perform the same as
> > a VM that never had that device attached in the first place.
>
> I think we are saying the same thing..

Hrm? I think I'm saying the opposite of your "both not set
IOMMU_CACHE". IOMMU_CACHE is the mapping flag that enables
DMA_PTE_SNP. Maybe you're using IOMMU_CACHE as the state reported to
KVM?

> > > 2) How to fit this part of kvm in some new /dev/ioasid world
> > >
> > > What we want to do here is iterate over every ioasid associated
> > > with the group fd that is passed into kvm.
> >
> > Yeah, we need some better names, binding a device to an ioasid (fd) but
> > then attaching a device to an allocated ioasid (non-fd)... I assume
> > you're talking about the latter ioasid.
>
> Fingers crossed on RFCv2.. Here I mean the IOASID object inside the
> /dev/iommu FD. The vfio_device would have some kref handle to the
> in-kernel representation of it. So we can interact with it..
>
> > > Or perhaps more directly: an op attaching the vfio_device to the
> > > kvm and having some simple helper
> > > '(un)register ioasid with kvm (kvm, ioasid)'
> > > that the vfio_device driver can call that just sorts this out.
> >
> > We could almost eliminate the device notion altogether here, use an
> > ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> > change to the composition of the device set for the ioasid, which is
> > why we currently do it on addition or removal of a group, where the
> > group has a consistent set of IOMMU properties.
>
> That is another quite good option, just forget about trying to be
> highly specific and feed in the /dev/ioasid FD and have kvm ask "does
> anything in here not enforce snoop?"
>
> With something appropriate to track/block changing that answer.
>
> It doesn't solve the problem to connect kvm to AP and kvmgt though

It does not, we'll probably need a vfio ioctl to gratuitously announce
the KVM fd to each device. I think some devices might currently fail
their open callback if that linkage isn't already available though, so
it's not clear when that should happen, ie. it can't currently be a
VFIO_DEVICE ioctl as getting the device fd requires an open, but this
proposal requires some availability of the vfio device fd without any
setup, so presumably that won't yet call the driver open callback.
Maybe that's part of the attach phase now... I'm not sure, it's not
clear when the vfio device uAPI starts being available in the process
of setting up the ioasid. Thanks,

Alex

2021-06-03 20:11:55

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 02:01:46PM -0600, Alex Williamson wrote:

> > > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
> > > > domains.
> > > >
> > > > This doesn't actually matter. If you mix them together then kvm
> > > > will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> > > > anywhere in this VM.
> > > >
> > > > This if two IOMMU's are joined together into a single /dev/ioasid
> > > > then we can just make them both pretend to be
> > > > !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
> > >
> > > Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we
> > > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> > > available based on the per domain support available. That gives us the
> > > most consistent behavior, ie. we don't have VMs emulating wbinvd
> > > because they used to have a device attached where the domain required
> > > it and we can't atomically remap with new flags to perform the same as
> > > a VM that never had that device attached in the first place.
> >
> > I think we are saying the same thing..
>
> Hrm? I think I'm saying the opposite of your "both not set
> IOMMU_CACHE". IOMMU_CACHE is the mapping flag that enables
> DMA_PTE_SNP. Maybe you're using IOMMU_CACHE as the state reported to
> KVM?

I'm saying if we enable wbinvd in the guest then no IOASIDs used by
that guest need to set DMA_PTE_SNP. If we disable wbinvd in the guest
then all IOASIDs must enforce DMA_PTE_SNP (or we otherwise guarentee
no-snoop is not possible).

This is not what VFIO does today, but it is a reasonable choice.

Based on that observation we can say as soon as the user wants to use
an IOMMU that does not support DMA_PTE_SNP in the guest we can still
share the IO page table with IOMMUs that do support DMA_PTE_SNP.

> > It doesn't solve the problem to connect kvm to AP and kvmgt though
>
> It does not, we'll probably need a vfio ioctl to gratuitously announce
> the KVM fd to each device. I think some devices might currently fail
> their open callback if that linkage isn't already available though, so
> it's not clear when that should happen, ie. it can't currently be a
> VFIO_DEVICE ioctl as getting the device fd requires an open, but this
> proposal requires some availability of the vfio device fd without any
> setup, so presumably that won't yet call the driver open callback.
> Maybe that's part of the attach phase now... I'm not sure, it's not
> clear when the vfio device uAPI starts being available in the process
> of setting up the ioasid. Thanks,

At a certain point we maybe just have to stick to backward compat, I
think. Though it is useful to think about green field alternates to
try to guide the backward compat design..

Jason

2021-06-03 20:44:45

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, 3 Jun 2021 09:40:36 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Thu, Jun 03, 2021 at 03:22:27AM +0000, Tian, Kevin wrote:
> > > From: Alex Williamson <[email protected]>
> > > Sent: Thursday, June 3, 2021 10:51 AM
> > >
> > > On Wed, 2 Jun 2021 19:45:36 -0300
> > > Jason Gunthorpe <[email protected]> wrote:
> > >
> > > > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> > > >
> > > > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > > > from the guest page table... what page table?
> > > >
> > > > I see my confusion now, the phrasing in your earlier remark led me
> > > > think this was about allowing the no-snoop performance enhancement in
> > > > some restricted way.
> > > >
> > > > It is really about blocking no-snoop 100% of the time and then
> > > > disabling the dangerous wbinvd when the block is successful.
> > > >
> > > > Didn't closely read the kvm code :\
> > > >
> > > > If it was about allowing the optimization then I'd expect the guest to
> > > > enable no-snoopable regions via it's vIOMMU and realize them to the
> > > > hypervisor and plumb the whole thing through. Hence my remark about
> > > > the guest page tables..
> > > >
> > > > So really the test is just 'were we able to block it' ?
> > >
> > > Yup. Do we really still consider that there's some performance benefit
> > > to be had by enabling a device to use no-snoop? This seems largely a
> > > legacy thing.
> >
> > Yes, there is indeed performance benefit for device to use no-snoop,
> > e.g. 8K display and some imaging processing path, etc. The problem is
> > that the IOMMU for such devices is typically a different one from the
> > default IOMMU for most devices. This special IOMMU may not have
> > the ability of enforcing snoop on no-snoop PCI traffic then this fact
> > must be understood by KVM to do proper mtrr/pat/wbinvd virtualization
> > for such devices to work correctly.
>
> Or stated another way:
>
> We in Linux don't have a way to control if the VFIO IO page table will
> be snoop or no snoop from userspace so Intel has forced the platform's
> IOMMU path for the integrated GPU to be unable to enforce snoop, thus
> "solving" the problem.

That's giving vfio a lot of credit for influencing VT-d design.

> I don't think that is sustainable in the oveall ecosystem though.

Our current behavior is a reasonable default IMO, but I agree more
control will probably benefit us in the long run.

> 'qemu --allow-no-snoop' makes more sense to me

I'd be tempted to attach it to the -device vfio-pci option, it's
specific drivers for specific devices that are going to want this and
those devices may not be permanently attached to the VM. But I see in
the other thread you're trying to optimize IOMMU page table sharing.

There's a usability question in either case though and I'm not sure how
to get around it other than QEMU or the kernel knowing a list of
devices (explicit IDs or vendor+class) to select per device defaults.

> > When discussing I/O page fault support in another thread, the consensus
> > is that an device handle will be registered (by user) or allocated (return
> > to user) in /dev/ioasid when binding the device to ioasid fd. From this
> > angle we can register {ioasid_fd, device_handle} to KVM and then call
> > something like ioasidfd_device_is_coherent() to get the property.
> > Anyway the coherency is a per-device property which is not changed
> > by how many I/O page tables are attached to it.
>
> It is not device specific, it is driver specific
>
> As I said before, the question is if the IOASID itself can enforce
> snoop, or not. AND if the device will issue no-snoop or not.
>
> Devices that are hard wired to never issue no-snoop are safe even with
> an IOASID that cannot enforce snoop. AFAIK really only GPUs use this
> feature. Eg I would be comfortable to say mlx5 never uses the no-snoop
> TLP flag.
>
> Only the vfio_driver could know this.

Could you clarify "vfio_driver"? The existing vfio-pci driver can't
know this, beyond perhaps probing if the Enable No-snoop bit is
hardwired to zero. It's the driver running on top of vfio that
ultimately controls whether a capable device actually issues no-snoop
TLPs, but that can't be known to us. A vendor variant of vfio-pci
might certainly know more about how its device is used by those
userspace/VM drivers. Thanks,

Alex

2021-06-03 20:58:37

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

Hi Parav,

On Tue, 1 Jun 2021 17:30:51 +0000, Parav Pandit <[email protected]> wrote:

> > From: Tian, Kevin <[email protected]>
> > Sent: Thursday, May 27, 2021 1:28 PM
>
> > 5.6. I/O page fault
> > +++++++++++++++
> >
> > (uAPI is TBD. Here is just about the high-level flow from host IOMMU
> > driver to guest IOMMU driver and backwards).
> >
> > - Host IOMMU driver receives a page request with raw fault_data {rid,
> > pasid, addr};
> >
> > - Host IOMMU driver identifies the faulting I/O page table according
> > to information registered by IOASID fault handler;
> >
> > - IOASID fault handler is called with raw fault_data (rid, pasid,
> > addr), which is saved in ioasid_data->fault_data (used for response);
> >
> > - IOASID fault handler generates an user fault_data (ioasid, addr),
> > links it to the shared ring buffer and triggers eventfd to userspace;
> >
> > - Upon received event, Qemu needs to find the virtual routing
> > information (v_rid + v_pasid) of the device attached to the faulting
> > ioasid. If there are multiple, pick a random one. This should be fine
> > since the purpose is to fix the I/O page table on the guest;
> >
> > - Qemu generates a virtual I/O page fault through vIOMMU into guest,
> > carrying the virtual fault data (v_rid, v_pasid, addr);
> >
> Why does it have to be through vIOMMU?
I think this flow is for fully emulated IOMMU, the same IOMMU and device
drivers run in the host and guest. Page request interrupt is reported by
the IOMMU, thus reporting to vIOMMU in the guest.

> For a VFIO PCI device, have you considered to reuse the same PRI
> interface to inject page fault in the guest? This eliminates any new
> v_rid. It will also route the page fault request and response through the
> right vfio device.
>
I am curious how would PCI PRI can be used to inject fault. Are you talking
about PCI config PRI extended capability structure? The control is very
limited, only enable and reset. Can you explain how would page fault
handled in generic PCI cap?
Some devices may have device specific way to handle page faults, but I
guess this is not the PCI PRI method you are referring to?

> > - Guest IOMMU driver fixes up the fault, updates the I/O page table,
> > and then sends a page response with virtual completion data (v_rid,
> > v_pasid, response_code) to vIOMMU;
> >
> What about fixing up the fault for mmu page table as well in guest?
> Or you meant both when above you said "updates the I/O page table"?
>
> It is unclear to me that if there is single nested page table maintained
> or two (one for cr3 references and other for iommu). Can you please
> clarify?
>
I think it is just one, at least for VT-d, guest cr3 in GPA is stored
in the host iommu. Guest iommu driver calls handle_mm_fault to fix the mmu
page tables which is shared by the iommu.

> > - Qemu finds the pending fault event, converts virtual completion data
> > into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> > complete the pending fault;
> >
> For VFIO PCI device a virtual PRI request response interface is done, it
> can be generic interface among multiple vIOMMUs.
>
same question above, not sure how this works in terms of interrupts and
response queuing etc.

> > - /dev/ioasid finds out the pending fault data {rid, pasid, addr}
> > saved in ioasid_data->fault_data, and then calls iommu api to complete
> > it with {rid, pasid, response_code};
> >


Thanks,

Jacob

2021-06-03 21:46:46

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, 3 Jun 2021 17:10:18 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Thu, Jun 03, 2021 at 02:01:46PM -0600, Alex Williamson wrote:
>
> > > > > 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
> > > > > domains.
> > > > >
> > > > > This doesn't actually matter. If you mix them together then kvm
> > > > > will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> > > > > anywhere in this VM.
> > > > >
> > > > > This if two IOMMU's are joined together into a single /dev/ioasid
> > > > > then we can just make them both pretend to be
> > > > > !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
> > > >
> > > > Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we
> > > > need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> > > > available based on the per domain support available. That gives us the
> > > > most consistent behavior, ie. we don't have VMs emulating wbinvd
> > > > because they used to have a device attached where the domain required
> > > > it and we can't atomically remap with new flags to perform the same as
> > > > a VM that never had that device attached in the first place.
> > >
> > > I think we are saying the same thing..
> >
> > Hrm? I think I'm saying the opposite of your "both not set
> > IOMMU_CACHE". IOMMU_CACHE is the mapping flag that enables
> > DMA_PTE_SNP. Maybe you're using IOMMU_CACHE as the state reported to
> > KVM?
>
> I'm saying if we enable wbinvd in the guest then no IOASIDs used by
> that guest need to set DMA_PTE_SNP.

Yes

> If we disable wbinvd in the guest
> then all IOASIDs must enforce DMA_PTE_SNP (or we otherwise guarentee
> no-snoop is not possible).

Yes, but we can't get from one of these to the other atomically wrt to
the device DMA.

> This is not what VFIO does today, but it is a reasonable choice.
>
> Based on that observation we can say as soon as the user wants to use
> an IOMMU that does not support DMA_PTE_SNP in the guest we can still
> share the IO page table with IOMMUs that do support DMA_PTE_SNP.

If your goal is to prioritize IO page table sharing, sure. But because
we cannot atomically transition from one to the other, each device is
stuck with the pages tables it has, so the history of the VM becomes a
factor in the performance characteristics.

For example if device {A} is backed by an IOMMU capable of blocking
no-snoop and device {B} is backed by an IOMMU which cannot block
no-snoop, then booting VM1 with {A,B} and later removing device {B}
would result in ongoing wbinvd emulation versus a VM2 only booted with
{A}.

Type1 would use separate IO page tables (domains/ioasids) for these such
that VM1 and VM2 have the same characteristics at the end.

Does this become user defined policy in the IOASID model? There's
quite a mess of exposing sufficient GET_INFO for an IOASID for the user
to know such properties of the IOMMU, plus maybe we need mapping flags
equivalent to IOMMU_CACHE exposed to the user, preventing sharing an
IOASID that could generate IOMMU faults, etc.

> > > It doesn't solve the problem to connect kvm to AP and kvmgt though
> >
> > It does not, we'll probably need a vfio ioctl to gratuitously announce
> > the KVM fd to each device. I think some devices might currently fail
> > their open callback if that linkage isn't already available though, so
> > it's not clear when that should happen, ie. it can't currently be a
> > VFIO_DEVICE ioctl as getting the device fd requires an open, but this
> > proposal requires some availability of the vfio device fd without any
> > setup, so presumably that won't yet call the driver open callback.
> > Maybe that's part of the attach phase now... I'm not sure, it's not
> > clear when the vfio device uAPI starts being available in the process
> > of setting up the ioasid. Thanks,
>
> At a certain point we maybe just have to stick to backward compat, I
> think. Though it is useful to think about green field alternates to
> try to guide the backward compat design..

I think more to drive the replacement design; if we can't figure out
how to do something other than backwards compatibility trickery in the
kernel, it's probably going to bite us. Thanks,

Alex

2021-06-04 01:12:32

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


?? 2021/6/3 ????9:09, Jason Gunthorpe д??:
> On Thu, Jun 03, 2021 at 10:52:51AM +0800, Jason Wang wrote:
>
>> Basically, we don't want to bother with pseudo KVM device like what VFIO
>> did. So for simplicity, we rules out the IOMMU that can't enforce coherency
>> in vhost-vDPA if the parent purely depends on the platform IOMMU:
> VDPA HW cannot issue no-snoop TLPs in the first place.


Note that virtio/vDPA is not necessarily a PCI device.


>
> virtio does not define a protocol to discover such a functionality,


Actually we had:

VIRTIO_F_ACCESS_PLATFORM(33)
This feature indicates that the device can be used on a platform where
device access to data in memory is limited and/or translated. E.g. this
is the case if the device can be located behind an IOMMU that translates
bus addresses from the device into physical addresses in memory, if the
device can be limited to only access certain memory addresses or if
special commands such as a cache flush can be needed to synchronise data
in memory with the device.


> nor do any virtio drivers implement the required platform specific
> cache flushing to make no-snoop TLPs work.


I don't get why virtio drivers needs to do that. I think DMA API should
hide those arch/platform specific stuffs from us.


>
> It is fundamentally part of the virtio HW PCI API that a device vendor
> cannot alter.


The spec doesn't forbid this, and it just leave the detection and action
to the driver in a platform specific way.

Thanks


>
> Basically since we already know that the virtio kernel drivers do not
> call the cache flush instruction we don't need the weird KVM logic to
> turn it on at all.
>
> Enforcing no-snoop at the IOMMU here is redundant/confusing.
>
> Jason
>

2021-06-04 01:34:06

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


?? 2021/6/4 ????2:19, Jacob Pan д??:
> Hi Shenming,
>
> On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu <[email protected]>
> wrote:
>
>> On 2021/6/2 1:33, Jason Gunthorpe wrote:
>>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
>>>
>>>> The drivers register per page table fault handlers to /dev/ioasid which
>>>> will then register itself to iommu core to listen and route the per-
>>>> device I/O page faults.
>>> I'm still confused why drivers need fault handlers at all?
>> Essentially it is the userspace that needs the fault handlers,
>> one case is to deliver the faults to the vIOMMU, and another
>> case is to enable IOPF on the GPA address space for on-demand
>> paging, it seems that both could be specified in/through the
>> IOASID_ALLOC ioctl?
>>
> I would think IOASID_BIND_PGTABLE is where fault handler should be
> registered. There wouldn't be any IO page fault without the binding anyway.
>
> I also don't understand why device drivers should register the fault
> handler, the fault is detected by the pIOMMU and injected to the vIOMMU. So
> I think it should be the IOASID itself register the handler.


As discussed in another thread.

I think the reason is that ATS doesn't forbid the #PF to be reported via
a device specific way.

Thanks


>
>> Thanks,
>> Shenming
>>
>
> Thanks,
>
> Jacob
>

2021-06-04 02:05:19

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 2021/6/4 2:19, Jacob Pan wrote:
> Hi Shenming,
>
> On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu <[email protected]>
> wrote:
>
>> On 2021/6/2 1:33, Jason Gunthorpe wrote:
>>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
>>>
>>>> The drivers register per page table fault handlers to /dev/ioasid which
>>>> will then register itself to iommu core to listen and route the per-
>>>> device I/O page faults.
>>>
>>> I'm still confused why drivers need fault handlers at all?
>>
>> Essentially it is the userspace that needs the fault handlers,
>> one case is to deliver the faults to the vIOMMU, and another
>> case is to enable IOPF on the GPA address space for on-demand
>> paging, it seems that both could be specified in/through the
>> IOASID_ALLOC ioctl?
>>
> I would think IOASID_BIND_PGTABLE is where fault handler should be
> registered. There wouldn't be any IO page fault without the binding anyway.

Yeah, I also proposed this before, registering the handler in the BIND_PGTABLE
ioctl does make sense for the guest page faults. :-)

But how about the page faults from the GPA address space (it's page table is
mapped through the MAP_DMA ioctl)? From your point of view, it seems that we
should register the handler for the GPA address space in the (first) MAP_DMA
ioctl.

>
> I also don't understand why device drivers should register the fault
> handler, the fault is detected by the pIOMMU and injected to the vIOMMU. So
> I think it should be the IOASID itself register the handler.

Yeah, and it can also be said that the provider of the page table registers the
handler (Baolu).

Thanks,
Shenming

2021-06-04 02:18:10

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, June 3, 2021 7:47 PM
>
> On Thu, Jun 03, 2021 at 06:49:20AM +0000, Tian, Kevin wrote:
> > > From: David Gibson
> > > Sent: Thursday, June 3, 2021 1:09 PM
> > [...]
> > > > > In this way the SW mode is the same as a HW mode with an infinite
> > > > > cache.
> > > > >
> > > > > The collaposed shadow page table is really just a cache.
> > > > >
> > > >
> > > > OK. One additional thing is that we may need a 'caching_mode"
> > > > thing reported by /dev/ioasid, indicating whether invalidation is
> > > > required when changing non-present to present. For hardware
> > > > nesting it's not reported as the hardware IOMMU will walk the
> > > > guest page table in cases of iotlb miss. For software nesting
> > > > caching_mode is reported so the user must issue invalidation
> > > > upon any change in guest page table so the kernel can update
> > > > the shadow page table timely.
> > >
> > > For the fist cut, I'd have the API assume that invalidates are
> > > *always* required. Some bypass to avoid them in cases where they're
> > > not needed can be an additional extension.
> > >
> >
> > Isn't a typical TLB semantics is that non-present entries are not
> > cached thus invalidation is not required when making non-present
> > to present? It's true to both CPU TLB and IOMMU TLB. In reality
> > I feel there are more usages built on hardware nesting than software
> > nesting thus making default following hardware TLB behavior makes
> > more sense...
>
> From a modelling perspective it makes sense to have the most general
> be the default and if an implementation can elide certain steps then
> describing those as additional behaviors on the universal baseline is
> cleaner
>
> I'm surprised to hear your remarks about the not-present though,
> how does the vIOMMU emulation work if there are not hypervisor
> invalidation traps for not-present/present transitions?
>

Such invalidation traps matter only for shadow I/O page table (software
nesting). For hardware nesting no trap is required for non-present/
present transition since physical IOTLB doesn't cache non-present
entries. The IOMMU will walk the guest I/O page table in case of IOTLB
miss.

The vIOMMU should be constructed according to whether software
or hardware nesting is used. For Intel (and AMD iirc), a caching_mode
capability decides whether the guest needs to do invalidation for
non-present/present transition. Such vIOMMU should clear this bit
for hardware nesting or set it for software nesting. ARM SMMU doesn't
have this capability. Therefore their vSMMU can only work with a
hardware nested IOASID.

Thanks
Kevin

2021-06-04 06:10:47

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, June 3, 2021 8:11 PM
>
> On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote:
> > On Wed, Jun 02, 2021 at 01:58:38PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > > > /* Bind guest I/O page table */
> > > > > > bind_data = {
> > > > > > .ioasid = gva_ioasid;
> > > > > > .addr = gva_pgtable1;
> > > > > > // and format information
> > > > > > };
> > > > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > > > >
> > > > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > > > there any reason to split these things? The only advantage to the
> > > > > split is the device is known, but the device shouldn't impact
> > > > > anything..
> > > >
> > > > I'm pretty sure the device(s) could matter, although they probably
> > > > won't usually.
> > >
> > > It is a bit subtle, but the /dev/iommu fd itself is connected to the
> > > devices first. This prevents wildly incompatible devices from being
> > > joined together, and allows some "get info" to report the capability
> > > union of all devices if we want to do that.
> >
> > Right.. but I've not been convinced that having a /dev/iommu fd
> > instance be the boundary for these types of things actually makes
> > sense. For example if we were doing the preregistration thing
> > (whether by child ASes or otherwise) then that still makes sense
> > across wildly different devices, but we couldn't share that layer if
> > we have to open different instances for each of them.
>
> It is something that still seems up in the air.. What seems clear for
> /dev/iommu is that it
> - holds a bunch of IOASID's organized into a tree
> - holds a bunch of connected devices
> - holds a pinned memory cache
>
> One thing it must do is enforce IOMMU group security. A device cannot
> be attached to an IOASID unless all devices in its IOMMU group are
> part of the same /dev/iommu FD.
>
> The big open question is what parameters govern allowing devices to
> connect to the /dev/iommu:
> - all devices can connect and we model the differences inside the API
> somehow.

I prefer to this option if no significant block ahead.

> - Only sufficiently "similar" devices can be connected
> - The FD's capability is the minimum of all the connected devices
>
> There are some practical problems here, when an IOASID is created the
> kernel does need to allocate a page table for it, and that has to be
> in some definite format.
>
> It may be that we had a false start thinking the FD container should
> be limited. Perhaps creating an IOASID should pass in a list
> of the "device labels" that the IOASID will be used with and that can
> guide the kernel what to do?

In Qemu case the problem is that it doesn't know the list of devices
that will be attached to an IOASID when it's created. This is a guest-
side knowledge which is conveyed one device at a time to Qemu
though vIOMMU.

I feel it's fair to say that before user wants to create an IOASID he
should already check the format information about the device which
is intended to be attached right after then when creating the IOASID
the user should specify a format compatible to the device. There is
format check when IOASID is created, since its I/O page table is not
installed to the IOMMU yet. Later when the intended device is attached
to this IOASID, then verify the format and fail the attach request if
incompatible.

Thanks
Kevin

2021-06-04 06:31:01

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, June 3, 2021 8:46 PM
>
> On Thu, Jun 03, 2021 at 04:26:08PM +1000, David Gibson wrote:
>
> > > There are global properties in the /dev/iommu FD, like what devices
> > > are part of it, that are important for group security operations. This
> > > becomes confused if it is split to many FDs.
> >
> > I'm still not seeing those. I'm really not seeing any well-defined
> > meaning to devices being attached to the fd, but not to a particular
> > IOAS.
>
> Kevin can you add a section on how group security would have to work
> to the RFC? This is the idea you can't attach a device to an IOASID
> unless all devices in the IOMMU group are joined to the /dev/iommu FD.
>
> The basic statement is that userspace must present the entire group
> membership to /dev/iommu to prove that it has the security right to
> manipulate their DMA translation.
>
> It is the device centric analog to what the group FD is doing for
> security.
>

Yes, will do.

Thanks
Kevin

2021-06-04 06:39:42

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe
> Sent: Thursday, June 3, 2021 9:05 PM
>
> > >
> > > 3) Device accepts any PASIDs from the guest. No
> > > vPASID/pPASID translation is possible. (classic vfio_pci)
> > > 4) Device accepts any PASID from the guest and has an
> > > internal vPASID/pPASID translation (enhanced vfio_pci)
> >
> > what is enhanced vfio_pci? In my writing this is for mdev
> > which doesn't support ENQCMD
>
> This is a vfio_pci that mediates some element of the device interface
> to communicate the vPASID/pPASID table to the device, using Max's
> series for vfio_pci drivers to inject itself into VFIO.
>
> For instance a device might send a message through the PF that the VF
> has a certain vPASID/pPASID translation table. This would be useful
> for devices that cannot use ENQCMD but still want to support migration
> and thus need vPASID.

I still don't quite get. If it's a PCI device why is PASID translation required?
Just delegate the per-RID PASID space to user as type-3 then migrating the
vPASID space is just straightforward.

Thanks
Kevin

2021-06-04 07:35:25

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Thursday, June 3, 2021 8:41 PM
>
> > When discussing I/O page fault support in another thread, the consensus
> > is that an device handle will be registered (by user) or allocated (return
> > to user) in /dev/ioasid when binding the device to ioasid fd. From this
> > angle we can register {ioasid_fd, device_handle} to KVM and then call
> > something like ioasidfd_device_is_coherent() to get the property.
> > Anyway the coherency is a per-device property which is not changed
> > by how many I/O page tables are attached to it.
>
> It is not device specific, it is driver specific
>
> As I said before, the question is if the IOASID itself can enforce
> snoop, or not. AND if the device will issue no-snoop or not.

Sure. My earlier comment was based on the assumption that all IOASIDs
attached to a device should inherit the same snoop/no-snoop fact. But
looks it doesn't prevent a device driver from setting PTE_SNP only for
selected I/O page tables, according to whether isoch agents are involved.

An user space driver could figure out per-IOASID requirements itself.

A guest device driver can indirectly convey this information through
vIOMMU.

Registering {IOASID_FD, IOASID} to KVM has another merit, as we also
need it to update CPU PASID mapping for ENQCMD. We can define
one interface for both requirements. ????

Thanks
Kevin

2021-06-04 08:19:35

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 02, 2021 at 01:25:00AM +0000, Tian, Kevin wrote:
> > > This implies that VFIO_BOUND_IOASID will be extended to allow user
> > > specify a device label. This label will be recorded in /dev/iommu to
> > > serve per-device invalidation request from and report per-device
> > > fault data to the user.
> >
> > I wonder which of the user providing a 64 bit cookie or the kernel
> > returning a small IDA is the best choice here? Both have merits
> > depending on what qemu needs..
>
> Yes, either way can work. I don't have a strong preference. Jean?

I don't see an issue with either solution, maybe it will show up while
prototyping. First one uses IDs that do mean something for someone, and
userspace may inject faults slightly faster since it doesn't need an
ID->vRID lookup, so that's my preference.

> > > In addition, vPASID (if provided by user) will
> > > be also recorded in /dev/iommu so vPASID<->pPASID conversion
> > > is conducted properly. e.g. invalidation request from user carries
> > > a vPASID which must be converted into pPASID before calling iommu
> > > driver. Vice versa for raw fault data which carries pPASID while the
> > > user expects a vPASID.
> >
> > I don't think the PASID should be returned at all. It should return
> > the IOASID number in the FD and/or a u64 cookie associated with that
> > IOASID. Userspace should figure out what the IOASID & device
> > combination means.
>
> This is true for Intel. But what about ARM which has only one IOASID
> (pasid table) per device to represent all guest I/O page tables?

In that case vPASID = pPASID though. The vPASID allocated by the guest is
the same from the vIOMMU inval to the pIOMMU inval. I don't think host
kernel or userspace need to alter it.

Thanks,
Jean

2021-06-04 08:42:25

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Alex Williamson <[email protected]>
> Sent: Friday, June 4, 2021 5:44 AM
>
> > Based on that observation we can say as soon as the user wants to use
> > an IOMMU that does not support DMA_PTE_SNP in the guest we can still
> > share the IO page table with IOMMUs that do support DMA_PTE_SNP.

page table sharing between incompatible IOMMUs is not a critical
thing. I prefer to disallowing sharing in such case as the starting point,
i.e. the user needs to create separate IOASIDs for such devices.

>
> If your goal is to prioritize IO page table sharing, sure. But because
> we cannot atomically transition from one to the other, each device is
> stuck with the pages tables it has, so the history of the VM becomes a
> factor in the performance characteristics.
>
> For example if device {A} is backed by an IOMMU capable of blocking
> no-snoop and device {B} is backed by an IOMMU which cannot block
> no-snoop, then booting VM1 with {A,B} and later removing device {B}
> would result in ongoing wbinvd emulation versus a VM2 only booted with
> {A}.
>
> Type1 would use separate IO page tables (domains/ioasids) for these such
> that VM1 and VM2 have the same characteristics at the end.
>
> Does this become user defined policy in the IOASID model? There's
> quite a mess of exposing sufficient GET_INFO for an IOASID for the user
> to know such properties of the IOMMU, plus maybe we need mapping flags
> equivalent to IOMMU_CACHE exposed to the user, preventing sharing an
> IOASID that could generate IOMMU faults, etc.

IOMMU_CACHE is a fixed attribute given an IOMMU. So it's better to
convey this info to userspace via GET_INFO for a device_label, before
creating any IOASID. But overall I agree that careful thinking is required
about how to organize those info reporting (per-fd, per-device, per-ioasid)
to userspace.

>
> > > > It doesn't solve the problem to connect kvm to AP and kvmgt though
> > >
> > > It does not, we'll probably need a vfio ioctl to gratuitously announce
> > > the KVM fd to each device. I think some devices might currently fail
> > > their open callback if that linkage isn't already available though, so
> > > it's not clear when that should happen, ie. it can't currently be a
> > > VFIO_DEVICE ioctl as getting the device fd requires an open, but this
> > > proposal requires some availability of the vfio device fd without any
> > > setup, so presumably that won't yet call the driver open callback.
> > > Maybe that's part of the attach phase now... I'm not sure, it's not
> > > clear when the vfio device uAPI starts being available in the process
> > > of setting up the ioasid. Thanks,
> >
> > At a certain point we maybe just have to stick to backward compat, I
> > think. Though it is useful to think about green field alternates to
> > try to guide the backward compat design..
>
> I think more to drive the replacement design; if we can't figure out
> how to do something other than backwards compatibility trickery in the
> kernel, it's probably going to bite us. Thanks,
>

I'm a bit lost on the desired flow in your minds. Here is one flow based
on my understanding of this discussion. Please comment whether it
matches your thinking:

0) ioasid_fd is created and registered to KVM via KVM_ADD_IOASID_FD;

1) Qemu binds dev1 to ioasid_fd;

2) Qemu calls IOASID_GET_DEV_INFO for dev1. This will carry IOMMU_
CACHE info i.e. whether underlying IOMMU can enforce snoop;

3) Qemu plans to create a gpa_ioasid, and attach dev1 to it. Here Qemu
needs to figure out whether dev1 wants to do no-snoop. This might
be based a fixed vendor/class list or specified by user;

4) gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); At this point a 'snoop'
flag is specified to decide the page table format, which is supposed
to match dev1;

5) Qemu attaches dev1 to gpa_ioasid via VFIO_ATTACH_IOASID. At this
point, specify snoop/no-snoop again. If not supported by related
iommu or different from what gpa_ioasid has, attach fails.

6) call KVM to update the snoop requirement via KVM_UPADTE_IOASID_FD.
this triggers ioasidfd_for_each_ioasid();

later when dev2 is attached to gpa_ioasid, same flow is followed. This
implies that KVM_UPDATE_IOASID_FD is called only when new IOASID is
created or existing IOASID is destroyed, because all devices under an
IOASID should have the same snoop requirement.

Thanks
Kevin

2021-06-04 08:46:06

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jean-Philippe Brucker <[email protected]>
> Sent: Friday, June 4, 2021 4:18 PM
>
> On Wed, Jun 02, 2021 at 01:25:00AM +0000, Tian, Kevin wrote:
> > > > This implies that VFIO_BOUND_IOASID will be extended to allow user
> > > > specify a device label. This label will be recorded in /dev/iommu to
> > > > serve per-device invalidation request from and report per-device
> > > > fault data to the user.
> > >
> > > I wonder which of the user providing a 64 bit cookie or the kernel
> > > returning a small IDA is the best choice here? Both have merits
> > > depending on what qemu needs..
> >
> > Yes, either way can work. I don't have a strong preference. Jean?
>
> I don't see an issue with either solution, maybe it will show up while
> prototyping. First one uses IDs that do mean something for someone, and
> userspace may inject faults slightly faster since it doesn't need an
> ID->vRID lookup, so that's my preference.

ok, will go for the first option in v2.

>
> > > > In addition, vPASID (if provided by user) will
> > > > be also recorded in /dev/iommu so vPASID<->pPASID conversion
> > > > is conducted properly. e.g. invalidation request from user carries
> > > > a vPASID which must be converted into pPASID before calling iommu
> > > > driver. Vice versa for raw fault data which carries pPASID while the
> > > > user expects a vPASID.
> > >
> > > I don't think the PASID should be returned at all. It should return
> > > the IOASID number in the FD and/or a u64 cookie associated with that
> > > IOASID. Userspace should figure out what the IOASID & device
> > > combination means.
> >
> > This is true for Intel. But what about ARM which has only one IOASID
> > (pasid table) per device to represent all guest I/O page tables?
>
> In that case vPASID = pPASID though. The vPASID allocated by the guest is
> the same from the vIOMMU inval to the pIOMMU inval. I don't think host
> kernel or userspace need to alter it.
>

yes. So responding to Jason's earlier comment we do need return
PASID (although no conversion is required) to userspace in this
case. ????

Thanks
Kevin

2021-06-04 09:21:39

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Alex Williamson <[email protected]>
> Sent: Friday, June 4, 2021 4:42 AM
>
> > 'qemu --allow-no-snoop' makes more sense to me
>
> I'd be tempted to attach it to the -device vfio-pci option, it's
> specific drivers for specific devices that are going to want this and
> those devices may not be permanently attached to the VM. But I see in
> the other thread you're trying to optimize IOMMU page table sharing.
>
> There's a usability question in either case though and I'm not sure how
> to get around it other than QEMU or the kernel knowing a list of
> devices (explicit IDs or vendor+class) to select per device defaults.
>

"-device vfio-pci" is a per-device option, which implies that the
no-snoop choice is given to the admin then no need to maintain
a fixed device list in Qemu?

Thanks
Kevin

2021-06-04 10:26:38

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote:
> > > But it would certainly be possible for a system to have two
> > > different host bridges with two different IOMMUs with different
> > > pagetable formats. Until you know which devices (and therefore
> > > which host bridge) you're talking about, you don't know what formats
> > > of pagetable to accept. And if you have devices from *both* bridges
> > > you can't bind a page table at all - you could theoretically support
> > > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > > in both formats, but it would be pretty reasonable not to support
> > > that.
> >
> > The basic process for a user space owned pgtable mode would be:
> >
> > 1) qemu has to figure out what format of pgtable to use
> >
> > Presumably it uses query functions using the device label.
>
> No... in the qemu case it would always select the page table format
> that it needs to present to the guest. That's part of the
> guest-visible platform that's selected by qemu's configuration.
>
> There's no negotiation here: either the kernel can supply what qemu
> needs to pass to the guest, or it can't. If it can't qemu, will have
> to either emulate in SW (if possible, probably using a kernel-managed
> IOASID to back it) or fail outright.
>
> > The
> > kernel code should look at the entire device path through all the
> > IOMMU HW to determine what is possible.
> >
> > Or it already knows because the VM's vIOMMU is running in some
> > fixed page table format, or the VM's vIOMMU already told it, or
> > something.
>
> Again, I think you have the order a bit backwards. The user selects
> the capabilities that the vIOMMU will present to the guest as part of
> the qemu configuration. Qemu then requests that of the host kernel,
> and either the host kernel supplies it, qemu emulates it in SW, or
> qemu fails to start.

Hm, how fine a capability are we talking about? If it's just "give me
VT-d capabilities" or "give me Arm capabilities" that would work, but
probably isn't useful. Anything finer will be awkward because userspace
will have to try combinations of capabilities to see what sticks, and
supporting new hardware will drop compatibility for older one.

For example depending whether the hardware IOMMU is SMMUv2 or SMMUv3, that
completely changes the capabilities offered to the guest (some v2
implementations support nesting page tables, but never PASID nor PRI
unlike v3.) The same vIOMMU could support either, presenting different
capabilities to the guest, even multiple page table formats if we wanted
to be exhaustive (SMMUv2 supports the older 32-bit descriptor), but it
needs to know early on what the hardware is precisely. Then some new page
table format shows up and, although the vIOMMU can support that in
addition to older ones, QEMU will have to pick a single one, that it
assumes the guest knows how to drive?

I think once it binds a device to an IOASID fd, QEMU will want to probe
what hardware features are available before going further with the vIOMMU
setup (is there PASID, PRI, which page table formats are supported,
address size, page granule, etc). Obtaining precise information about the
hardware would be less awkward than trying different configurations until
one succeeds. Binding an additional device would then fail if its pIOMMU
doesn't support exactly the features supported for the first device,
because we don't know which ones the guest will choose. QEMU will have to
open a new IOASID fd for that device.

Thanks,
Jean

>
> Guest visible properties of the platform never (or *should* never)
> depend implicitly on host capabilities - it's impossible to sanely
> support migration in such an environment.
>
> > 2) qemu creates an IOASID and based on #1 and says 'I want this format'
>
> Right.
>
> > 3) qemu binds the IOASID to the device.
> >
> > If qmeu gets it wrong then it just fails.
>
> Right, though it may be fall back to (partial) software emulation. In
> practice that would mean using a kernel-managed IOASID and walking the
> guest IO pagetables itself to mirror them into the host kernel.
>
> > 4) For the next device qemu would have to figure out if it can re-use
> > an existing IOASID based on the required proeprties.
>
> Nope. Again, what devices share an IO address space is a guest
> visible part of the platform. If the host kernel can't supply that,
> then qemu must not start (or fail the hotplug if the new device is
> being hotplugged).

Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 02.06.21 19:24, Jason Gunthorpe wrote:

Hi,

>> If I understand this correctly, /dev/ioasid is a kind of "common
supplier"
>> to other APIs / devices. Why can't the fd be acquired by the
>> consumer APIs (eg. kvm, vfio, etc) ?
>
> /dev/ioasid would be similar to /dev/vfio, and everything already
> deals with exposing /dev/vfio and /dev/vfio/N together
>
> I don't see it as a problem, just more work.

One of the problems I'm seeing is in container environments: when
passing in an vfio device, we now also need to pass in /dev/ioasid,
thus increasing the complexity in container setup (or orchestration).

And in such scenarios you usually want to pass in one specific device,
not all of the same class, and usually orchestration shall pick the
next free one.

Can we make sure that a process having full access to /dev/ioasid
while only supposed to have to specific consumer devices, can't do
any harm (eg. influencing other containers that might use a different
consumer device) ?

Note that we don't have device namespaces yet (device isolation still
has to be done w/ complicated bpf magic). I'm already working on that,
but even "simple" things like loopdev allocation turns out to be not
entirely easy.

> Having FDs spawn other FDs is pretty ugly, it defeats the "everything
> is a file" model of UNIX.

Unfortunately, this is already defeated in many other places :(
(I'd even claim that ioctls already break it :p)

It seems your approach also breaks this, since we now need to open two
files in order to talk to one device.

By the way: my idea does keep the "everything's a file" concept - we
just have a file that allows opening "sub-files". Well, it would be
better if devices could also have directory semantics.


--mtx

---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

2021-06-04 11:59:29

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 09:11:03AM +0800, Jason Wang wrote:
> > nor do any virtio drivers implement the required platform specific
> > cache flushing to make no-snoop TLPs work.
>
> I don't get why virtio drivers needs to do that. I think DMA API should hide
> those arch/platform specific stuffs from us.

It is not arch/platform stuff. If the device uses no-snoop then a
very platform specific recovery is required in the device driver.

It is not part of the normal DMA API, it is side APIs like
flush_agp_cache() or wbinvd() that are used by GPU drivers only.

If drivers/virtio doesn't explicitly call these things it doesn't
support no-snoop - hence no VDPA device can ever use no-snoop.

Since VIRTIO_F_ACCESS_PLATFORM doesn't trigger wbinvd on x86 it has
nothing to do with no-snoop.

Jason

2021-06-04 12:07:41

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote:

> I think once it binds a device to an IOASID fd, QEMU will want to probe
> what hardware features are available before going further with the vIOMMU
> setup (is there PASID, PRI, which page table formats are supported,

I think David's point was that qemu should be told what vIOMMU it is
emulating exactly (right down to what features it has) and then
the goal is simply to match what the vIOMMU needs with direct HW
support via /dev/ioasid and fall back to SW emulation when not
possible.

If qemu wants to have some auto-configuration: 'pass host IOMMU
capabilities' similar to the CPU flags then qemu should probe the
/dev/ioasid - and maybe we should just return some highly rolled up
"this is IOMMU HW ID ARM SMMU vXYZ" out of some query to guide qemu in
doing this.

Jason

2021-06-04 12:11:25

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 06:37:26AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Thursday, June 3, 2021 9:05 PM
> >
> > > >
> > > > 3) Device accepts any PASIDs from the guest. No
> > > > vPASID/pPASID translation is possible. (classic vfio_pci)
> > > > 4) Device accepts any PASID from the guest and has an
> > > > internal vPASID/pPASID translation (enhanced vfio_pci)
> > >
> > > what is enhanced vfio_pci? In my writing this is for mdev
> > > which doesn't support ENQCMD
> >
> > This is a vfio_pci that mediates some element of the device interface
> > to communicate the vPASID/pPASID table to the device, using Max's
> > series for vfio_pci drivers to inject itself into VFIO.
> >
> > For instance a device might send a message through the PF that the VF
> > has a certain vPASID/pPASID translation table. This would be useful
> > for devices that cannot use ENQCMD but still want to support migration
> > and thus need vPASID.
>
> I still don't quite get. If it's a PCI device why is PASID translation required?
> Just delegate the per-RID PASID space to user as type-3 then migrating the
> vPASID space is just straightforward.

This is only possible if we get rid of the global pPASID allocation
(honestly is my preference as it makes the HW a lot simpler)

Without vPASID the migration would need pPASID's on the RID that are
guarenteed free.

Jason

2021-06-04 12:15:54

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 02:41:36PM -0600, Alex Williamson wrote:

> Could you clarify "vfio_driver"?

This is the thing providing the vfio_device_ops function pointers.

So vfio-pci can't know anything about this (although your no-snoop
control probing idea makes sense to me)

But vfio_mlx5_pci can know

So can mdev_idxd

And kvmgt

Jason

2021-06-04 12:31:10

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 08:38:26AM +0000, Tian, Kevin wrote:
> > I think more to drive the replacement design; if we can't figure out
> > how to do something other than backwards compatibility trickery in the
> > kernel, it's probably going to bite us. Thanks,
>
> I'm a bit lost on the desired flow in your minds. Here is one flow based
> on my understanding of this discussion. Please comment whether it
> matches your thinking:
>
> 0) ioasid_fd is created and registered to KVM via KVM_ADD_IOASID_FD;
>
> 1) Qemu binds dev1 to ioasid_fd;
>
> 2) Qemu calls IOASID_GET_DEV_INFO for dev1. This will carry IOMMU_
> CACHE info i.e. whether underlying IOMMU can enforce snoop;
>
> 3) Qemu plans to create a gpa_ioasid, and attach dev1 to it. Here Qemu
> needs to figure out whether dev1 wants to do no-snoop. This might
> be based a fixed vendor/class list or specified by user;
>
> 4) gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); At this point a 'snoop'
> flag is specified to decide the page table format, which is supposed
> to match dev1;

> 5) Qemu attaches dev1 to gpa_ioasid via VFIO_ATTACH_IOASID. At this
> point, specify snoop/no-snoop again. If not supported by related
> iommu or different from what gpa_ioasid has, attach fails.

Why do we need to specify it again?

If the IOASID was created with the "block no-snoop" flag then it is
blocked in that IOASID, and that blocking sets the page table format.

The only question is if we can successfully attach a device to the
page table, or not.

The KVM interface is a bit tricky because Alex said this is partially
security, wbinvd is only enabled if someone has a FD to a device that
can support no-snoop.

Personally I think this got way too complicated, the KVM interface
should simply be

ioctl(KVM_ALLOW_INCOHERENT_DMA, ioasidfd, device_label)
ioctl(KVM_DISALLOW_INCOHERENT_DMA, ioasidfd, device_label)

and let qemu sort it out based on command flags, detection, whatever.

'ioasidfd, device_label' is the security proof that Alex asked
for. This needs to be some device in the ioasidfd that declares it is
capabale of no-snoop. Eg vfio_pci would always declare it is capable
of no-snoop.

No kernel call backs, no kernel auto-sync/etc. If qemu mismatches the
IOASID block no-snoop flag with the KVM_x_INCOHERENT_DMA state then it
is just a kernel-harmless uerspace bug.

Then user space can decide which of the various axis's it wants to
optimize for.

Jason

2021-06-04 12:34:46

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 12:44:28PM +0200, Enrico Weigelt, metux IT consult wrote:
> On 02.06.21 19:24, Jason Gunthorpe wrote:
>
> Hi,
>
> >> If I understand this correctly, /dev/ioasid is a kind of "common
> supplier"
> >> to other APIs / devices. Why can't the fd be acquired by the
> >> consumer APIs (eg. kvm, vfio, etc) ?
> >
> > /dev/ioasid would be similar to /dev/vfio, and everything already
> > deals with exposing /dev/vfio and /dev/vfio/N together
> >
> > I don't see it as a problem, just more work.
>
> One of the problems I'm seeing is in container environments: when
> passing in an vfio device, we now also need to pass in /dev/ioasid,
> thus increasing the complexity in container setup (or orchestration).

Containers already needed to do this today. Container orchestration is
hard.

> And in such scenarios you usually want to pass in one specific device,
> not all of the same class, and usually orchestration shall pick the
> next free one.
>
> Can we make sure that a process having full access to /dev/ioasid
> while only supposed to have to specific consumer devices, can't do
> any harm (eg. influencing other containers that might use a different
> consumer device) ?

Yes, /dev/ioasid shouldn't do anything unless you have a device to
connect it with. In this way it is probably safe to stuff it into
every container.

> > Having FDs spawn other FDs is pretty ugly, it defeats the "everything
> > is a file" model of UNIX.
>
> Unfortunately, this is already defeated in many other places :(
> (I'd even claim that ioctls already break it :p)

I think you are reaching a bit :)

> It seems your approach also breaks this, since we now need to open two
> files in order to talk to one device.

It is two devices, thus two files.

Jason

2021-06-04 12:35:41

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 06:08:28AM +0000, Tian, Kevin wrote:

> In Qemu case the problem is that it doesn't know the list of devices
> that will be attached to an IOASID when it's created. This is a guest-
> side knowledge which is conveyed one device at a time to Qemu
> though vIOMMU.

At least for the guest side it is alot simpler because the vIOMMU
being emulated will define nearly everything.

qemu will just have to ask the kernel for whatever it is the guest is
doing. If the kernel can't do it then qemu has to SW emulate.

The no-snoop block may be the only thing that is under qemu's control
because it is transparent to the guest.

This will probably become clearer as people start to define what the
get_info should return.

Jason

2021-06-04 15:29:55

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

[Cc +Paolo]

On Fri, 4 Jun 2021 09:28:30 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Fri, Jun 04, 2021 at 08:38:26AM +0000, Tian, Kevin wrote:
> > > I think more to drive the replacement design; if we can't figure out
> > > how to do something other than backwards compatibility trickery in the
> > > kernel, it's probably going to bite us. Thanks,
> >
> > I'm a bit lost on the desired flow in your minds. Here is one flow based
> > on my understanding of this discussion. Please comment whether it
> > matches your thinking:
> >
> > 0) ioasid_fd is created and registered to KVM via KVM_ADD_IOASID_FD;
> >
> > 1) Qemu binds dev1 to ioasid_fd;
> >
> > 2) Qemu calls IOASID_GET_DEV_INFO for dev1. This will carry IOMMU_
> > CACHE info i.e. whether underlying IOMMU can enforce snoop;
> >
> > 3) Qemu plans to create a gpa_ioasid, and attach dev1 to it. Here Qemu
> > needs to figure out whether dev1 wants to do no-snoop. This might
> > be based a fixed vendor/class list or specified by user;
> >
> > 4) gpa_ioasid = ioctl(ioasid_fd, IOASID_ALLOC); At this point a 'snoop'
> > flag is specified to decide the page table format, which is supposed
> > to match dev1;
>
> > 5) Qemu attaches dev1 to gpa_ioasid via VFIO_ATTACH_IOASID. At this
> > point, specify snoop/no-snoop again. If not supported by related
> > iommu or different from what gpa_ioasid has, attach fails.
>
> Why do we need to specify it again?

My thought as well.

> If the IOASID was created with the "block no-snoop" flag then it is
> blocked in that IOASID, and that blocking sets the page table format.
>
> The only question is if we can successfully attach a device to the
> page table, or not.
>
> The KVM interface is a bit tricky because Alex said this is partially
> security, wbinvd is only enabled if someone has a FD to a device that
> can support no-snoop.
>
> Personally I think this got way too complicated, the KVM interface
> should simply be
>
> ioctl(KVM_ALLOW_INCOHERENT_DMA, ioasidfd, device_label)
> ioctl(KVM_DISALLOW_INCOHERENT_DMA, ioasidfd, device_label)
>
> and let qemu sort it out based on command flags, detection, whatever.
>
> 'ioasidfd, device_label' is the security proof that Alex asked
> for. This needs to be some device in the ioasidfd that declares it is
> capabale of no-snoop. Eg vfio_pci would always declare it is capable
> of no-snoop.
>
> No kernel call backs, no kernel auto-sync/etc. If qemu mismatches the
> IOASID block no-snoop flag with the KVM_x_INCOHERENT_DMA state then it
> is just a kernel-harmless uerspace bug.
>
> Then user space can decide which of the various axis's it wants to
> optimize for.

Let's make sure the KVM folks are part of this decision; a re-cap for
them, KVM currently automatically enables wbinvd emulation when
potentially non-coherent devices are present which is determined solely
based on the IOMMU's (or platform's, as exposed via the IOMMU) ability
to essentially force no-snoop transactions from a device to be cache
coherent. This synchronization is triggered via the kvm-vfio device,
where QEMU creates the device and adds/removes vfio group fd
descriptors as an additionally layer to prevent the user from enabling
wbinvd emulation on a whim.

IIRC, this latter association was considered a security/DoS issue to
prevent a malicious guest/userspace from creating a disproportionate
system load.

Where would KVM stand on allowing more direct userspace control of
wbinvd behavior? Would arbitrary control be acceptable or should we
continue to require it only in association to a device requiring it for
correct operation.

A wrinkle in "correct operation" is that while the IOMMU may be able to
force no-snoop transactions to be coherent, in the scenario described
in the previous reply, the user may intend to use non-coherent DMA
regardless of the IOMMU capabilities due to their own optimization
policy. There's a whole spectrum here, including aspects we can't
determine around the device driver's intentions to use non-coherent
transactions, the user's policy in trading hypervisor overhead for
cache coherence overhead, etc. Thanks,

Alex

2021-06-04 15:40:11

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, 4 Jun 2021 09:19:50 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Alex Williamson <[email protected]>
> > Sent: Friday, June 4, 2021 4:42 AM
> >
> > > 'qemu --allow-no-snoop' makes more sense to me
> >
> > I'd be tempted to attach it to the -device vfio-pci option, it's
> > specific drivers for specific devices that are going to want this and
> > those devices may not be permanently attached to the VM. But I see in
> > the other thread you're trying to optimize IOMMU page table sharing.
> >
> > There's a usability question in either case though and I'm not sure how
> > to get around it other than QEMU or the kernel knowing a list of
> > devices (explicit IDs or vendor+class) to select per device defaults.
> >
>
> "-device vfio-pci" is a per-device option, which implies that the
> no-snoop choice is given to the admin then no need to maintain
> a fixed device list in Qemu?

I think we want to look at where we put it to have the best default
user experience. For example the QEMU vfio-pci device option could use
on/off/auto semantics where auto is the default and QEMU maintains a
list of IDs or vendor/class configurations where we've determined the
"optimal" auto configuration. Management tools could provide an
override, but we're imposing some pretty technical requirements for a
management tool to be able to come up with good per device defaults.
Seems like we should consolidate that technical decision in one place.
Thanks,

Alex

2021-06-04 15:42:03

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 04/06/21 17:26, Alex Williamson wrote:
> Let's make sure the KVM folks are part of this decision; a re-cap for
> them, KVM currently automatically enables wbinvd emulation when
> potentially non-coherent devices are present which is determined solely
> based on the IOMMU's (or platform's, as exposed via the IOMMU) ability
> to essentially force no-snoop transactions from a device to be cache
> coherent. This synchronization is triggered via the kvm-vfio device,
> where QEMU creates the device and adds/removes vfio group fd
> descriptors as an additionally layer to prevent the user from enabling
> wbinvd emulation on a whim.
>
> IIRC, this latter association was considered a security/DoS issue to
> prevent a malicious guest/userspace from creating a disproportionate
> system load.
>
> Where would KVM stand on allowing more direct userspace control of
> wbinvd behavior? Would arbitrary control be acceptable or should we
> continue to require it only in association to a device requiring it for
> correct operation.

Extending the scenarios where WBINVD is not a nop is not a problem for
me. If possible I wouldn't mind keeping the existing kvm-vfio
connection via the device, if only because then the decision remains in
the VFIO camp (whose judgment I trust more than mine on this kind of issue).

For example, would it make sense if *VFIO* (not KVM) gets an API that
says "I am going to do incoherent DMA"? Then that API causes WBINVD to
become not-a-nop even on otherwise coherent platforms. (Would this make
sense at all without a hypervisor that indirectly lets userspace execute
WBINVD? Perhaps VFIO would benefit from a WBINVD ioctl too).

Paolo

2021-06-04 15:52:19

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 05:40:34PM +0200, Paolo Bonzini wrote:
> On 04/06/21 17:26, Alex Williamson wrote:
> > Let's make sure the KVM folks are part of this decision; a re-cap for
> > them, KVM currently automatically enables wbinvd emulation when
> > potentially non-coherent devices are present which is determined solely
> > based on the IOMMU's (or platform's, as exposed via the IOMMU) ability
> > to essentially force no-snoop transactions from a device to be cache
> > coherent. This synchronization is triggered via the kvm-vfio device,
> > where QEMU creates the device and adds/removes vfio group fd
> > descriptors as an additionally layer to prevent the user from enabling
> > wbinvd emulation on a whim.
> >
> > IIRC, this latter association was considered a security/DoS issue to
> > prevent a malicious guest/userspace from creating a disproportionate
> > system load.
> >
> > Where would KVM stand on allowing more direct userspace control of
> > wbinvd behavior? Would arbitrary control be acceptable or should we
> > continue to require it only in association to a device requiring it for
> > correct operation.
>
> Extending the scenarios where WBINVD is not a nop is not a problem for me.
> If possible I wouldn't mind keeping the existing kvm-vfio connection via the
> device, if only because then the decision remains in the VFIO camp (whose
> judgment I trust more than mine on this kind of issue).

Really the question to answer is what "security proof" do you want
before the wbinvd can be enabled

1) User has access to a device that can issue no-snoop TLPS
2) User has access to an IOMMU that can not block no-snoop (today)
3) Require CAP_SYS_RAW_IO
4) Anyone

#1 is an improvement because it allows userspace to enable wbinvd and
no-snoop optimizations based on user choice

#2 is where we are today and wbinvd effectively becomes a fixed
platform choice. Userspace has no say

#3 is "there is a problem, but not so serious, root is powerful
enough to override"

#4 is "there is no problem here"

Jason

2021-06-04 15:58:41

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 04/06/21 17:50, Jason Gunthorpe wrote:
>> Extending the scenarios where WBINVD is not a nop is not a problem for me.
>> If possible I wouldn't mind keeping the existing kvm-vfio connection via the
>> device, if only because then the decision remains in the VFIO camp (whose
>> judgment I trust more than mine on this kind of issue).
> Really the question to answer is what "security proof" do you want
> before the wbinvd can be enabled

I don't want a security proof myself; I want to trust VFIO to make the
right judgment and I'm happy to defer to it (via the KVM-VFIO device).

Given how KVM is just a device driver inside Linux, VMs should be a
slightly more roundabout way to do stuff that is accessible to bare
metal; not a way to gain extra privilege.

Paolo

> 1) User has access to a device that can issue no-snoop TLPS
> 2) User has access to an IOMMU that can not block no-snoop (today)
> 3) Require CAP_SYS_RAW_IO
> 4) Anyone
>
> #1 is an improvement because it allows userspace to enable wbinvd and
> no-snoop optimizations based on user choice
>
> #2 is where we are today and wbinvd effectively becomes a fixed
> platform choice. Userspace has no say
>
> #3 is "there is a problem, but not so serious, root is powerful
> enough to override"

2021-06-04 16:07:56

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> On 04/06/21 17:50, Jason Gunthorpe wrote:
> > > Extending the scenarios where WBINVD is not a nop is not a problem for me.
> > > If possible I wouldn't mind keeping the existing kvm-vfio connection via the
> > > device, if only because then the decision remains in the VFIO camp (whose
> > > judgment I trust more than mine on this kind of issue).
> > Really the question to answer is what "security proof" do you want
> > before the wbinvd can be enabled
>
> I don't want a security proof myself; I want to trust VFIO to make the right
> judgment and I'm happy to defer to it (via the KVM-VFIO device).
>
> Given how KVM is just a device driver inside Linux, VMs should be a slightly
> more roundabout way to do stuff that is accessible to bare metal; not a way
> to gain extra privilege.

Okay, fine, lets turn the question on its head then.

VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
application can make use of no-snoop optimizations. The ability of KVM
to execute wbinvd should be tied to the ability of that IOCTL to run
in a normal process context.

So, under what conditions do we want to allow VFIO to giave a process
elevated access to the CPU:

> > 1) User has access to a device that can issue no-snoop TLPS
> > 2) User has access to an IOMMU that can not block no-snoop (today)
> > 3) Require CAP_SYS_RAW_IO
> > 4) Anyone

Jason

2021-06-04 16:15:05

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 04/06/21 18:03, Jason Gunthorpe wrote:
> On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
>> I don't want a security proof myself; I want to trust VFIO to make the right
>> judgment and I'm happy to defer to it (via the KVM-VFIO device).
>>
>> Given how KVM is just a device driver inside Linux, VMs should be a slightly
>> more roundabout way to do stuff that is accessible to bare metal; not a way
>> to gain extra privilege.
>
> Okay, fine, lets turn the question on its head then.
>
> VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> application can make use of no-snoop optimizations. The ability of KVM
> to execute wbinvd should be tied to the ability of that IOCTL to run
> in a normal process context.
>
> So, under what conditions do we want to allow VFIO to giave a process
> elevated access to the CPU:

Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
#2+#3 would be worse than what we have today), but IIUC the proposal
(was it yours or Kevin's?) was to keep #2 and add #1 with an
enable/disable ioctl, which then would be on VFIO and not on KVM. I
assumed Alex was more or less okay with it, given he included me in the
discussion.

If later y'all switch to "it's always okay to issue the enable/disable
ioctl", I guess the rationale would be documented in the commit message.

Paolo

>>> 1) User has access to a device that can issue no-snoop TLPS
>>> 2) User has access to an IOMMU that can not block no-snoop (today)
>>> 3) Require CAP_SYS_RAW_IO
>>> 4) Anyone
>
> Jason
>

2021-06-04 16:23:42

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

Hi Jason,

On Fri, 4 Jun 2021 09:30:37 +0800, Jason Wang <[email protected]> wrote:

> 在 2021/6/4 上午2:19, Jacob Pan 写道:
> > Hi Shenming,
> >
> > On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu <[email protected]>
> > wrote:
> >
> >> On 2021/6/2 1:33, Jason Gunthorpe wrote:
> >>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> >>>
> >>>> The drivers register per page table fault handlers to /dev/ioasid
> >>>> which will then register itself to iommu core to listen and route
> >>>> the per- device I/O page faults.
> >>> I'm still confused why drivers need fault handlers at all?
> >> Essentially it is the userspace that needs the fault handlers,
> >> one case is to deliver the faults to the vIOMMU, and another
> >> case is to enable IOPF on the GPA address space for on-demand
> >> paging, it seems that both could be specified in/through the
> >> IOASID_ALLOC ioctl?
> >>
> > I would think IOASID_BIND_PGTABLE is where fault handler should be
> > registered. There wouldn't be any IO page fault without the binding
> > anyway.
> >
> > I also don't understand why device drivers should register the fault
> > handler, the fault is detected by the pIOMMU and injected to the
> > vIOMMU. So I think it should be the IOASID itself register the handler.
> >
>
>
> As discussed in another thread.
>
> I think the reason is that ATS doesn't forbid the #PF to be reported via
> a device specific way.
>
Yes, in that case we should support both. Give the device driver a chance
to handle the IOPF if it can.

> Thanks
>
>
> >
> >> Thanks,
> >> Shenming
> >>
> >
> > Thanks,
> >
> > Jacob
> >
>


Thanks,

Jacob

2021-06-04 16:25:58

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 09:22:43AM -0700, Jacob Pan wrote:
> Hi Jason,
>
> On Fri, 4 Jun 2021 09:30:37 +0800, Jason Wang <[email protected]> wrote:
>
> > 在 2021/6/4 上午2:19, Jacob Pan 写道:
> > > Hi Shenming,
> > >
> > > On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu <[email protected]>
> > > wrote:
> > >
> > >> On 2021/6/2 1:33, Jason Gunthorpe wrote:
> > >>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> > >>>
> > >>>> The drivers register per page table fault handlers to /dev/ioasid
> > >>>> which will then register itself to iommu core to listen and route
> > >>>> the per- device I/O page faults.
> > >>> I'm still confused why drivers need fault handlers at all?
> > >> Essentially it is the userspace that needs the fault handlers,
> > >> one case is to deliver the faults to the vIOMMU, and another
> > >> case is to enable IOPF on the GPA address space for on-demand
> > >> paging, it seems that both could be specified in/through the
> > >> IOASID_ALLOC ioctl?
> > >>
> > > I would think IOASID_BIND_PGTABLE is where fault handler should be
> > > registered. There wouldn't be any IO page fault without the binding
> > > anyway.
> > >
> > > I also don't understand why device drivers should register the fault
> > > handler, the fault is detected by the pIOMMU and injected to the
> > > vIOMMU. So I think it should be the IOASID itself register the handler.
> > >
> >
> >
> > As discussed in another thread.
> >
> > I think the reason is that ATS doesn't forbid the #PF to be reported via
> > a device specific way.
>
> Yes, in that case we should support both. Give the device driver a chance
> to handle the IOPF if it can.

Huh?

The device driver does not "handle the IOPF" the device driver might
inject the IOPF.

Jason

2021-06-04 17:24:36

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> On 04/06/21 18:03, Jason Gunthorpe wrote:
> > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> > > I don't want a security proof myself; I want to trust VFIO to make the right
> > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > >
> > > Given how KVM is just a device driver inside Linux, VMs should be a slightly
> > > more roundabout way to do stuff that is accessible to bare metal; not a way
> > > to gain extra privilege.
> >
> > Okay, fine, lets turn the question on its head then.
> >
> > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > application can make use of no-snoop optimizations. The ability of KVM
> > to execute wbinvd should be tied to the ability of that IOCTL to run
> > in a normal process context.
> >
> > So, under what conditions do we want to allow VFIO to giave a process
> > elevated access to the CPU:
>
> Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> #2+#3 would be worse than what we have today), but IIUC the proposal (was it
> yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> which then would be on VFIO and not on KVM.

At the end of the day we need an ioctl with two arguments:
- The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
- The KVM FD to control wbinvd support on

Philosophically it doesn't matter too much which subsystem that ioctl
lives, but we have these obnoxious cross module dependencies to
consider..

Framing the question, as you have, to be about the process, I think
explains why KVM doesn't really care what is decided, so long as the
process and the VM have equivalent rights.

Alex, how about a more fleshed out suggestion:

1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
it communicates its no-snoop configuration:
- 0 enable, allow WBINVD
- 1 automatic disable, block WBINVD if the platform
IOMMU can police it (what we do today)
- 2 force disable, do not allow BINVD ever

vfio_pci may want to take this from an admin configuration knob
someplace. It allows the admin to customize if they want.

If we can figure out a way to autodetect 2 from vfio_pci, all the
better

2) There is some IOMMU_EXECUTE_WBINVD IOCTL that allows userspace
to access wbinvd so it can make use of the no snoop optimization.

wbinvd is allowed when:
- A device is joined with mode #0
- A device is joined with mode #1 and the IOMMU cannot block
no-snoop (today)

3) The IOASID's don't care about this at all. If IOMMU_EXECUTE_WBINVD
is blocked and userspace doesn't request to block no-snoop in the
IOASID then it is a userspace error.

4) The KVM interface is the very simple enable/disable WBINVD.
Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
to enable WBINVD at KVM.

It is pretty simple from a /dev/ioasid perpsective, covers todays
compat requirement, gives some future option to allow the no-snoop
optimization, and gives a new option for qemu to totally block wbinvd
no matter what.

Jason

2021-06-04 17:26:36

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

Hi Jason,

On Fri, 4 Jun 2021 09:05:55 -0300, Jason Gunthorpe <[email protected]> wrote:

> On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote:
>
> > I think once it binds a device to an IOASID fd, QEMU will want to probe
> > what hardware features are available before going further with the
> > vIOMMU setup (is there PASID, PRI, which page table formats are
> > supported,
>
> I think David's point was that qemu should be told what vIOMMU it is
> emulating exactly (right down to what features it has) and then
> the goal is simply to match what the vIOMMU needs with direct HW
> support via /dev/ioasid and fall back to SW emulation when not
> possible.
>
> If qemu wants to have some auto-configuration: 'pass host IOMMU
> capabilities' similar to the CPU flags then qemu should probe the
> /dev/ioasid - and maybe we should just return some highly rolled up
> "this is IOMMU HW ID ARM SMMU vXYZ" out of some query to guide qemu in
> doing this.
>
There can be mixed types of physical IOMMUs on the host. So not until a
device is attached, we would not know if the vIOMMU can match the HW
support of the device's IOMMU. Perhaps, vIOMMU should check the
least common denominator features before commit.

Thanks,

Jacob

2021-06-04 17:41:48

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 10:27:43AM -0700, Jacob Pan wrote:
> Hi Jason,
>
> On Fri, 4 Jun 2021 09:05:55 -0300, Jason Gunthorpe <[email protected]> wrote:
>
> > On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote:
> >
> > > I think once it binds a device to an IOASID fd, QEMU will want to probe
> > > what hardware features are available before going further with the
> > > vIOMMU setup (is there PASID, PRI, which page table formats are
> > > supported,
> >
> > I think David's point was that qemu should be told what vIOMMU it is
> > emulating exactly (right down to what features it has) and then
> > the goal is simply to match what the vIOMMU needs with direct HW
> > support via /dev/ioasid and fall back to SW emulation when not
> > possible.
> >
> > If qemu wants to have some auto-configuration: 'pass host IOMMU
> > capabilities' similar to the CPU flags then qemu should probe the
> > /dev/ioasid - and maybe we should just return some highly rolled up
> > "this is IOMMU HW ID ARM SMMU vXYZ" out of some query to guide qemu in
> > doing this.
> >
> There can be mixed types of physical IOMMUs on the host. So not until a
> device is attached, we would not know if the vIOMMU can match the HW
> support of the device's IOMMU. Perhaps, vIOMMU should check the
> least common denominator features before commit.

qemu has to set the vIOMMU at VM startup time, so if it is running in
some "copy host" mode the only thing it can do is evaluate the VFIO
devices that are present at boot and select a vIOMMU from that list.

Probably would pick the most capable physical IOMMU and software
emulate the reset.

platforms really should avoid creating wildly divergent IOMMUs in the
same system if they want to support virtualization effectively.

Jason

2021-06-04 18:07:09

by Jacob Pan

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

Hi Jason,

On Fri, 4 Jun 2021 13:22:00 -0300, Jason Gunthorpe <[email protected]> wrote:

> >
> > Yes, in that case we should support both. Give the device driver a
> > chance to handle the IOPF if it can.
>
> Huh?
>
> The device driver does not "handle the IOPF" the device driver might
> inject the IOPF.
You are right, I got confused with the native case where device drivers can
handle the fault, or do something about it.

Thanks,

Jacob

2021-06-04 21:33:18

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, 4 Jun 2021 14:22:07 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > On 04/06/21 18:03, Jason Gunthorpe wrote:
> > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> > > > I don't want a security proof myself; I want to trust VFIO to make the right
> > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > >
> > > > Given how KVM is just a device driver inside Linux, VMs should be a slightly
> > > > more roundabout way to do stuff that is accessible to bare metal; not a way
> > > > to gain extra privilege.
> > >
> > > Okay, fine, lets turn the question on its head then.
> > >
> > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > application can make use of no-snoop optimizations. The ability of KVM
> > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > in a normal process context.
> > >
> > > So, under what conditions do we want to allow VFIO to giave a process
> > > elevated access to the CPU:
> >
> > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > #2+#3 would be worse than what we have today), but IIUC the proposal (was it
> > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > which then would be on VFIO and not on KVM.
>
> At the end of the day we need an ioctl with two arguments:
> - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> - The KVM FD to control wbinvd support on
>
> Philosophically it doesn't matter too much which subsystem that ioctl
> lives, but we have these obnoxious cross module dependencies to
> consider..
>
> Framing the question, as you have, to be about the process, I think
> explains why KVM doesn't really care what is decided, so long as the
> process and the VM have equivalent rights.
>
> Alex, how about a more fleshed out suggestion:
>
> 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> it communicates its no-snoop configuration:

Communicates to whom?

> - 0 enable, allow WBINVD
> - 1 automatic disable, block WBINVD if the platform
> IOMMU can police it (what we do today)
> - 2 force disable, do not allow BINVD ever

The only thing we know about the device is whether or not Enable
No-snoop is hard wired to zero, ie. it either can't generate no-snoop
TLPs ("coherent-only") or it might ("assumed non-coherent"). If
we're putting the policy decision in the hands of userspace they should
have access to wbinvd if they own a device that is assumed
non-coherent AND it's attached to an IOMMU (page table) that is not
blocking no-snoop (a "non-coherent IOASID").

I think that means that the IOASID needs to be created (IOASID_ALLOC)
with a flag that specifies whether this address space is coherent
(IOASID_GET_INFO probably needs a flag/cap to expose if the system
supports this). All mappings in this IOASID would use IOMMU_CACHE and
and devices attached to it would be required to be backed by an IOMMU
capable of IOMMU_CAP_CACHE_COHERENCY (attach fails otherwise). If only
these IOASIDs exist, access to wbinvd would not be provided. (How does
a user provided page table work? - reserved bit set, user error?)

Conversely, a user could create a non-coherent IOASID and attach any
device to it, regardless of IOMMU backing capabilities. Only if an
assumed non-coherent device is attached would the wbinvd be allowed.

I think that means that an EXECUTE_WBINVD ioctl lives on the IOASIDFD
and the IOASID world needs to understand the device's ability to
generate non-coherent DMA. This wbinvd ioctl would be a no-op (or
some known errno) unless a non-coherent IOASID exists with a potentially
non-coherent device attached.

> vfio_pci may want to take this from an admin configuration knob
> someplace. It allows the admin to customize if they want.
>
> If we can figure out a way to autodetect 2 from vfio_pci, all the
> better
>
> 2) There is some IOMMU_EXECUTE_WBINVD IOCTL that allows userspace
> to access wbinvd so it can make use of the no snoop optimization.
>
> wbinvd is allowed when:
> - A device is joined with mode #0
> - A device is joined with mode #1 and the IOMMU cannot block
> no-snoop (today)
>
> 3) The IOASID's don't care about this at all. If IOMMU_EXECUTE_WBINVD
> is blocked and userspace doesn't request to block no-snoop in the
> IOASID then it is a userspace error.

In my model above, the IOASID is central to this.

> 4) The KVM interface is the very simple enable/disable WBINVD.
> Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
> to enable WBINVD at KVM.

Right, and in the new world order, vfio is only a device driver, the
IOASID manages the device's DMA. wbinvd is only necessary relative to
non-coherent DMA, which seems like QEMU needs to bump KVM with an
ioasidfd.

> It is pretty simple from a /dev/ioasid perpsective, covers todays
> compat requirement, gives some future option to allow the no-snoop
> optimization, and gives a new option for qemu to totally block wbinvd
> no matter what.

What do you imagine is the use case for totally blocking wbinvd? In
the model I describe, wbinvd would always be a no-op/known-errno when
the IOASIDs are all allocated as coherent or a non-coherent IOASID has
only coherent-only devices attached. Does userspace need a way to
prevent itself from scenarios where wbvind is not a no-op?

In general I'm having trouble wrapping my brain around the semantics of
the enable/automatic/force-disable wbinvd specific proposal, sorry.
Thanks,

Alex

2021-06-04 21:46:38

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, 4 Jun 2021 09:13:37 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Thu, Jun 03, 2021 at 02:41:36PM -0600, Alex Williamson wrote:
>
> > Could you clarify "vfio_driver"?
>
> This is the thing providing the vfio_device_ops function pointers.
>
> So vfio-pci can't know anything about this (although your no-snoop
> control probing idea makes sense to me)
>
> But vfio_mlx5_pci can know
>
> So can mdev_idxd
>
> And kvmgt

A capability on VFIO_DEVICE_GET_INFO could provide a hint to userspace.
Stock vfio-pci could fill it out to the extent advertising if the
device is capable of non-coherent DMA based on the Enable No-snoop
probing, the device specific vfio_drivers could set it based on
knowledge of the device behavior. Another bit might indicate a
preference to not suppress non-coherent DMA at the IOMMU. Thanks,

Alex

2021-06-04 23:04:14

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 03:29:18PM -0600, Alex Williamson wrote:
> On Fri, 4 Jun 2021 14:22:07 -0300
> Jason Gunthorpe <[email protected]> wrote:
>
> > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > > On 04/06/21 18:03, Jason Gunthorpe wrote:
> > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> > > > > I don't want a security proof myself; I want to trust VFIO to make the right
> > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > >
> > > > > Given how KVM is just a device driver inside Linux, VMs should be a slightly
> > > > > more roundabout way to do stuff that is accessible to bare metal; not a way
> > > > > to gain extra privilege.
> > > >
> > > > Okay, fine, lets turn the question on its head then.
> > > >
> > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > > application can make use of no-snoop optimizations. The ability of KVM
> > > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > > in a normal process context.
> > > >
> > > > So, under what conditions do we want to allow VFIO to giave a process
> > > > elevated access to the CPU:
> > >
> > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > > #2+#3 would be worse than what we have today), but IIUC the proposal (was it
> > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > > which then would be on VFIO and not on KVM.
> >
> > At the end of the day we need an ioctl with two arguments:
> > - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> > - The KVM FD to control wbinvd support on
> >
> > Philosophically it doesn't matter too much which subsystem that ioctl
> > lives, but we have these obnoxious cross module dependencies to
> > consider..
> >
> > Framing the question, as you have, to be about the process, I think
> > explains why KVM doesn't really care what is decided, so long as the
> > process and the VM have equivalent rights.
> >
> > Alex, how about a more fleshed out suggestion:
> >
> > 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> > it communicates its no-snoop configuration:
>
> Communicates to whom?

To the /dev/iommu FD which will have to maintain a list of devices
attached to it internally.

> > - 0 enable, allow WBINVD
> > - 1 automatic disable, block WBINVD if the platform
> > IOMMU can police it (what we do today)
> > - 2 force disable, do not allow BINVD ever
>
> The only thing we know about the device is whether or not Enable
> No-snoop is hard wired to zero, ie. it either can't generate no-snoop
> TLPs ("coherent-only") or it might ("assumed non-coherent").

Here I am outlining the choice an also imagining we might want an
admin knob to select the three.

> If we're putting the policy decision in the hands of userspace they
> should have access to wbinvd if they own a device that is assumed
> non-coherent AND it's attached to an IOMMU (page table) that is not
> blocking no-snoop (a "non-coherent IOASID").

There are two parts here, like Paolo was leading too. If the process
has access to WBINVD and then if such an allowed process tells KVM to
turn on WBINVD in the guest.

If the process has a device and it has a way to create a non-coherent
IOASID, then that process has access to WBINVD.

For security it doesn't matter if the process actually creates the
non-coherent IOASID or not. An attacker will simply do the steps that
give access to WBINVD.

The important detail is that access to WBINVD does not compell the
process to tell KVM to turn on WBINVD. So a qemu with access to WBINVD
can still choose to create a secure guest by always using IOMMU_CACHE
in its page tables and not asking KVM to enable WBINVD.

This propsal shifts this policy decision from the kernel to userspace.
qemu is responsible to determine if KVM should enable wbinvd or not
based on if it was able to create IOASID's with IOMMU_CACHE.

> Conversely, a user could create a non-coherent IOASID and attach any
> device to it, regardless of IOMMU backing capabilities. Only if an
> assumed non-coherent device is attached would the wbinvd be allowed.

Right, this is exactly the point. Since the user gets to pick if the
IOASID is coherent or not then an attacker can always reach WBINVD
using only the device FD. Additional checks don't add to the security
of the process.

The additional checks you are describing add to the security of the
guest, however qemu is capable of doing them without more help from the
kernel.

It is the strenth of Paolo's model that KVM should not be able to do
optionally less, not more than the process itself can do.

> > It is pretty simple from a /dev/ioasid perpsective, covers todays
> > compat requirement, gives some future option to allow the no-snoop
> > optimization, and gives a new option for qemu to totally block wbinvd
> > no matter what.
>
> What do you imagine is the use case for totally blocking wbinvd?

If wbinvd is really security important then an operator should endevor
to turn it off. It can be safely turned off if the operator
understands the SRIOV devices they are using. ie if you are only using
mlx5 or a nvme then force it off and be secure, regardless of the
platform capability.

Jason

2021-06-04 23:13:50

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, June 4, 2021 8:09 PM
>
> On Fri, Jun 04, 2021 at 06:37:26AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Thursday, June 3, 2021 9:05 PM
> > >
> > > > >
> > > > > 3) Device accepts any PASIDs from the guest. No
> > > > > vPASID/pPASID translation is possible. (classic vfio_pci)
> > > > > 4) Device accepts any PASID from the guest and has an
> > > > > internal vPASID/pPASID translation (enhanced vfio_pci)
> > > >
> > > > what is enhanced vfio_pci? In my writing this is for mdev
> > > > which doesn't support ENQCMD
> > >
> > > This is a vfio_pci that mediates some element of the device interface
> > > to communicate the vPASID/pPASID table to the device, using Max's
> > > series for vfio_pci drivers to inject itself into VFIO.
> > >
> > > For instance a device might send a message through the PF that the VF
> > > has a certain vPASID/pPASID translation table. This would be useful
> > > for devices that cannot use ENQCMD but still want to support migration
> > > and thus need vPASID.
> >
> > I still don't quite get. If it's a PCI device why is PASID translation required?
> > Just delegate the per-RID PASID space to user as type-3 then migrating the
> > vPASID space is just straightforward.
>
> This is only possible if we get rid of the global pPASID allocation
> (honestly is my preference as it makes the HW a lot simpler)
>

In this proposal global vs. per-RID allocation is a per-device policy.
for vfio-pci it can always use per-RID (regardless of whether the
device is partially mediated or not) and no vPASID/pPASID conversion.
Even for mdev if no ENQCMD we can still do per-RID conversion.
only for mdev which has ENQCMD we need global pPASID allocation.

I think this is the motivation you explained earlier that it's not good
to have one global PASID allocator in the kernel. per-RID vs. global
should be selected per device.

Thanks
Kevin

2021-06-04 23:22:23

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Friday, June 4, 2021 8:34 PM
>
> On Fri, Jun 04, 2021 at 06:08:28AM +0000, Tian, Kevin wrote:
>
> > In Qemu case the problem is that it doesn't know the list of devices
> > that will be attached to an IOASID when it's created. This is a guest-
> > side knowledge which is conveyed one device at a time to Qemu
> > though vIOMMU.
>
> At least for the guest side it is alot simpler because the vIOMMU
> being emulated will define nearly everything.
>
> qemu will just have to ask the kernel for whatever it is the guest is
> doing. If the kernel can't do it then qemu has to SW emulate.
>
> The no-snoop block may be the only thing that is under qemu's control
> because it is transparent to the guest.
>
> This will probably become clearer as people start to define what the
> get_info should return.
>

Sure. Just to clarify my comment that it is for " Perhaps creating an
IOASID should pass in a list of the device labels that the IOASID will
be used with". My point is that Qemu doesn't know this fact before
the guest completes binding page table to all relevant devices, while
IOASID must be created when the table is bound to first device. So
Qemu just needs to create IOASID with format that is required for the
current device. Incompatibility will be detected when attaching other
devices later.

Thanks
Kevin

2021-06-05 06:27:25

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 04/06/21 19:22, Jason Gunthorpe wrote:
> 4) The KVM interface is the very simple enable/disable WBINVD.
> Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
> to enable WBINVD at KVM.

The KVM interface is the same kvm-vfio device that exists already. The
userspace API does not need to change at all: adding one VFIO file
descriptor with WBINVD enabled to the kvm-vfio device lets the VM use
WBINVD functionality (see kvm_vfio_update_coherency).

Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls.
But it seems useless complication compared to just using what we have
now, at least while VMs only use IOASIDs via VFIO.

Either way, there should be no policy attached to the add/delete
operations. KVM users want to add the VFIO (or IOASID) file descriptors
to the device independent of WBINVD. If userspace wants/needs to apply
its own policy on whether to enable WBINVD or not, they can do it on the
VFIO/IOASID side:

> 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> it communicates its no-snoop configuration:
> - 0 enable, allow WBINVD
> - 1 automatic disable, block WBINVD if the platform
> IOMMU can police it (what we do today)
> - 2 force disable, do not allow BINVD ever

Though, like Alex, it's also not clear to me whether force-disable is
useful. Instead userspace can query the IOMMU or the device to ensure
it's not enabled.

Paolo

2021-06-07 03:23:22

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


?? 2021/6/4 ????7:58, Jason Gunthorpe д??:
> On Fri, Jun 04, 2021 at 09:11:03AM +0800, Jason Wang wrote:
>>> nor do any virtio drivers implement the required platform specific
>>> cache flushing to make no-snoop TLPs work.
>> I don't get why virtio drivers needs to do that. I think DMA API should hide
>> those arch/platform specific stuffs from us.
> It is not arch/platform stuff. If the device uses no-snoop then a
> very platform specific recovery is required in the device driver.
>
> It is not part of the normal DMA API, it is side APIs like
> flush_agp_cache() or wbinvd() that are used by GPU drivers only.


Yes and virtio doesn't support AGP.


>
> If drivers/virtio doesn't explicitly call these things it doesn't
> support no-snoop - hence no VDPA device can ever use no-snoop.


Note that no drivers call these things doesn't meant it was not
supported by the spec.

Actually, spec doesn't forbid the non coherent DMA, anyway we can raise
a new thread in the virtio mailing list to discuss about that.

But consider virtio has already supported GPU, crypto and sound device,
and the devices like codec and video are being proposed. It doesn't help
if we mandate coherent DMA now.

Thanks


>
> Since VIRTIO_F_ACCESS_PLATFORM doesn't trigger wbinvd on x86 it has
> nothing to do with no-snoop.
>
> Jason
>

2021-06-07 03:27:03

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Alex Williamson <[email protected]>
> Sent: Saturday, June 5, 2021 5:29 AM
>
> On Fri, 4 Jun 2021 14:22:07 -0300
> Jason Gunthorpe <[email protected]> wrote:
>
> > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > > On 04/06/21 18:03, Jason Gunthorpe wrote:
> > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> > > > > I don't want a security proof myself; I want to trust VFIO to make the
> right
> > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > >
> > > > > Given how KVM is just a device driver inside Linux, VMs should be a
> slightly
> > > > > more roundabout way to do stuff that is accessible to bare metal; not
> a way
> > > > > to gain extra privilege.
> > > >
> > > > Okay, fine, lets turn the question on its head then.
> > > >
> > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace
> VFIO
> > > > application can make use of no-snoop optimizations. The ability of KVM
> > > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > > in a normal process context.
> > > >
> > > > So, under what conditions do we want to allow VFIO to giave a process
> > > > elevated access to the CPU:
> > >
> > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > > #2+#3 would be worse than what we have today), but IIUC the proposal
> (was it
> > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > > which then would be on VFIO and not on KVM.
> >
> > At the end of the day we need an ioctl with two arguments:
> > - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> > - The KVM FD to control wbinvd support on
> >
> > Philosophically it doesn't matter too much which subsystem that ioctl
> > lives, but we have these obnoxious cross module dependencies to
> > consider..
> >
> > Framing the question, as you have, to be about the process, I think
> > explains why KVM doesn't really care what is decided, so long as the
> > process and the VM have equivalent rights.
> >
> > Alex, how about a more fleshed out suggestion:

Possibly just a naming thing, but I feel it's better to just talk about
no-snoop or non-coherent in the uAPI. Per Intel SDM wbinvd is a
privileged instruction. A process on the host has no privilege to
execute it. Only when this process holds a VM, this instruction matters
as there are guest privilege levels. But having VFIO uAPI (which is
userspace oriented) to explicitly deal with a CPU instruction which
makes sense only in a virtualization context sounds a bit weird...

> >
> > 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> > it communicates its no-snoop configuration:
>
> Communicates to whom?
>
> > - 0 enable, allow WBINVD
> > - 1 automatic disable, block WBINVD if the platform
> > IOMMU can police it (what we do today)
> > - 2 force disable, do not allow BINVD ever
>
> The only thing we know about the device is whether or not Enable
> No-snoop is hard wired to zero, ie. it either can't generate no-snoop
> TLPs ("coherent-only") or it might ("assumed non-coherent"). If
> we're putting the policy decision in the hands of userspace they should
> have access to wbinvd if they own a device that is assumed
> non-coherent AND it's attached to an IOMMU (page table) that is not
> blocking no-snoop (a "non-coherent IOASID").
>
> I think that means that the IOASID needs to be created (IOASID_ALLOC)
> with a flag that specifies whether this address space is coherent
> (IOASID_GET_INFO probably needs a flag/cap to expose if the system
> supports this). All mappings in this IOASID would use IOMMU_CACHE and

Yes, this sounds a cleaner way than specifying this attribute late in
VFIO_ATTACH_IOASID. Following Jason's proposal v2 will move to
the scheme requiring user to specify format info when creating an
IOASID. Leaving coherent out of that box just adds some trickiness,
e.g. whether allowing user to update page table between ALLOC
and ATTACH.

> and devices attached to it would be required to be backed by an IOMMU
> capable of IOMMU_CAP_CACHE_COHERENCY (attach fails otherwise). If only
> these IOASIDs exist, access to wbinvd would not be provided. (How does
> a user provided page table work? - reserved bit set, user error?)
>
> Conversely, a user could create a non-coherent IOASID and attach any
> device to it, regardless of IOMMU backing capabilities. Only if an
> assumed non-coherent device is attached would the wbinvd be allowed.
>
> I think that means that an EXECUTE_WBINVD ioctl lives on the IOASIDFD
> and the IOASID world needs to understand the device's ability to
> generate non-coherent DMA. This wbinvd ioctl would be a no-op (or
> some known errno) unless a non-coherent IOASID exists with a potentially
> non-coherent device attached.
>
> > vfio_pci may want to take this from an admin configuration knob
> > someplace. It allows the admin to customize if they want.
> >
> > If we can figure out a way to autodetect 2 from vfio_pci, all the
> > better
> >
> > 2) There is some IOMMU_EXECUTE_WBINVD IOCTL that allows userspace
> > to access wbinvd so it can make use of the no snoop optimization.
> >
> > wbinvd is allowed when:
> > - A device is joined with mode #0
> > - A device is joined with mode #1 and the IOMMU cannot block
> > no-snoop (today)
> >
> > 3) The IOASID's don't care about this at all. If IOMMU_EXECUTE_WBINVD
> > is blocked and userspace doesn't request to block no-snoop in the
> > IOASID then it is a userspace error.
>
> In my model above, the IOASID is central to this.
>
> > 4) The KVM interface is the very simple enable/disable WBINVD.
> > Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
> > to enable WBINVD at KVM.
>
> Right, and in the new world order, vfio is only a device driver, the
> IOASID manages the device's DMA. wbinvd is only necessary relative to
> non-coherent DMA, which seems like QEMU needs to bump KVM with an
> ioasidfd.
>
> > It is pretty simple from a /dev/ioasid perpsective, covers todays
> > compat requirement, gives some future option to allow the no-snoop
> > optimization, and gives a new option for qemu to totally block wbinvd
> > no matter what.
>
> What do you imagine is the use case for totally blocking wbinvd? In
> the model I describe, wbinvd would always be a no-op/known-errno when
> the IOASIDs are all allocated as coherent or a non-coherent IOASID has
> only coherent-only devices attached. Does userspace need a way to
> prevent itself from scenarios where wbvind is not a no-op?
>
> In general I'm having trouble wrapping my brain around the semantics of
> the enable/automatic/force-disable wbinvd specific proposal, sorry.
> Thanks,
>

Thanks,
Kevin

2021-06-07 03:53:21

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Paolo Bonzini <[email protected]>
> Sent: Saturday, June 5, 2021 2:22 PM
>
> On 04/06/21 19:22, Jason Gunthorpe wrote:
> > 4) The KVM interface is the very simple enable/disable WBINVD.
> > Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
> > to enable WBINVD at KVM.
>
> The KVM interface is the same kvm-vfio device that exists already. The
> userspace API does not need to change at all: adding one VFIO file
> descriptor with WBINVD enabled to the kvm-vfio device lets the VM use
> WBINVD functionality (see kvm_vfio_update_coherency).
>
> Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls.
> But it seems useless complication compared to just using what we have
> now, at least while VMs only use IOASIDs via VFIO.
>

A new IOASID variation may make more sense in case non-vfio subsystems
want to handle similar coherency problem. Per other discussions looks
it's still an open whether vDPA wants it or not. and there could be other
passthrough frameworks in the future. Having them all use vfio-naming
sounds not very clean. Anyway the coherency attribute must be configured
on IOASID in the end, then it looks reasonable for KVM to learn the info
from an unified place.

Just FYI we are also planning new IOASID-specific ioctl in KVM for other
usages. Future Intel platforms support a new ENQCMD instruction for
scalable work submission to the device. This instruction includes a 64-
bytes payload plus a PASID retrieved from a CPU MSR register (covered
by xsave). When supporting this instruction in the guest, the value in
the MSR is a guest PASID which must be translated to a host PASID.
A new VMCS structure (PASID translation table) is introduced for this
purpose. In this /dev/ioasid proposal, we propose VFIO_{UN}MAP_
IOASID for user to update the VMCS structure properly. The user is
expected to provide {ioasid_fd, ioasid, vPASID} to KVM which then
calls ioasid helper function to figure out the corresponding hPASID
to update the specified entry.

Thanks
Kevin

2021-06-07 06:55:15

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 07/06/21 05:25, Tian, Kevin wrote:
> Per Intel SDM wbinvd is a privileged instruction. A process on the
> host has no privilege to execute it.

(Half of) the point of the kernel is to do privileged tasks on the
processes' behalf. There are good reasons why a process that uses VFIO
(without KVM) could want to use wbinvd, so VFIO lets them do it with a
ioctl and adequate checks around the operation.

Paolo

2021-06-07 12:22:36

by Yi Liu

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Shenming Lu <[email protected]>
> Sent: Friday, June 4, 2021 10:03 AM
>
> On 2021/6/4 2:19, Jacob Pan wrote:
> > Hi Shenming,
> >
> > On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu
> <[email protected]>
> > wrote:
> >
> >> On 2021/6/2 1:33, Jason Gunthorpe wrote:
> >>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
> >>>
> >>>> The drivers register per page table fault handlers to /dev/ioasid which
> >>>> will then register itself to iommu core to listen and route the per-
> >>>> device I/O page faults.
> >>>
> >>> I'm still confused why drivers need fault handlers at all?
> >>
> >> Essentially it is the userspace that needs the fault handlers,
> >> one case is to deliver the faults to the vIOMMU, and another
> >> case is to enable IOPF on the GPA address space for on-demand
> >> paging, it seems that both could be specified in/through the
> >> IOASID_ALLOC ioctl?
> >>
> > I would think IOASID_BIND_PGTABLE is where fault handler should be
> > registered. There wouldn't be any IO page fault without the binding
> anyway.
>
> Yeah, I also proposed this before, registering the handler in the
> BIND_PGTABLE
> ioctl does make sense for the guest page faults. :-)
>
> But how about the page faults from the GPA address space (it's page table is
> mapped through the MAP_DMA ioctl)? From your point of view, it seems
> that we should register the handler for the GPA address space in the (first)
> MAP_DMA ioctl.

under new proposal, I think the page fault handler is also registered
per ioasid object. The difference compared with guest page table case
is there is no need to inject the fault to VM.

Regards,
Yi Liu

Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 02.06.21 19:21, Jason Gunthorpe wrote:

Hi,

> Not really, once one thing in an applicate uses a large number FDs the
> entire application is effected. If any open() can return 'very big
> number' then nothing in the process is allowed to ever use select.

isn't that a bug in select() ?

--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

2021-06-07 14:18:12

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Mon, Jun 07, 2021 at 11:18:33AM +0800, Jason Wang wrote:

> Note that no drivers call these things doesn't meant it was not
> supported by the spec.

Of course it does. If the spec doesn't define exactly when the driver
should call the cache flushes for no-snoop transactions then the
protocol doesn't support no-soop.

no-snoop is only used in very specific sequences of operations, like
certain GPU usages, because regaining coherence on x86 is incredibly
expensive.

ie I wouldn't ever expect a NIC to use no-snoop because NIC's expect
packets to be processed by the CPU.

"non-coherent DMA" is some general euphemism that evokes images of
embedded platforms that don't have coherent DMA at all and have low
cost ways to regain coherence. This is not at all what we are talking
about here at all.

Jason

2021-06-07 15:44:02

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, 4 Jun 2021 20:01:08 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Fri, Jun 04, 2021 at 03:29:18PM -0600, Alex Williamson wrote:
> > On Fri, 4 Jun 2021 14:22:07 -0300
> > Jason Gunthorpe <[email protected]> wrote:
> >
> > > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > > > On 04/06/21 18:03, Jason Gunthorpe wrote:
> > > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> > > > > > I don't want a security proof myself; I want to trust VFIO to make the right
> > > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > > >
> > > > > > Given how KVM is just a device driver inside Linux, VMs should be a slightly
> > > > > > more roundabout way to do stuff that is accessible to bare metal; not a way
> > > > > > to gain extra privilege.
> > > > >
> > > > > Okay, fine, lets turn the question on its head then.
> > > > >
> > > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > > > application can make use of no-snoop optimizations. The ability of KVM
> > > > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > > > in a normal process context.
> > > > >
> > > > > So, under what conditions do we want to allow VFIO to giave a process
> > > > > elevated access to the CPU:
> > > >
> > > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > > > #2+#3 would be worse than what we have today), but IIUC the proposal (was it
> > > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > > > which then would be on VFIO and not on KVM.
> > >
> > > At the end of the day we need an ioctl with two arguments:
> > > - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> > > - The KVM FD to control wbinvd support on
> > >
> > > Philosophically it doesn't matter too much which subsystem that ioctl
> > > lives, but we have these obnoxious cross module dependencies to
> > > consider..
> > >
> > > Framing the question, as you have, to be about the process, I think
> > > explains why KVM doesn't really care what is decided, so long as the
> > > process and the VM have equivalent rights.
> > >
> > > Alex, how about a more fleshed out suggestion:
> > >
> > > 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> > > it communicates its no-snoop configuration:
> >
> > Communicates to whom?
>
> To the /dev/iommu FD which will have to maintain a list of devices
> attached to it internally.
>
> > > - 0 enable, allow WBINVD
> > > - 1 automatic disable, block WBINVD if the platform
> > > IOMMU can police it (what we do today)
> > > - 2 force disable, do not allow BINVD ever
> >
> > The only thing we know about the device is whether or not Enable
> > No-snoop is hard wired to zero, ie. it either can't generate no-snoop
> > TLPs ("coherent-only") or it might ("assumed non-coherent").
>
> Here I am outlining the choice an also imagining we might want an
> admin knob to select the three.

You're calling this an admin knob, which to me suggests a global module
option, so are you trying to implement both an administrator and a user
policy? ie. the user can create scenarios where access to wbinvd might
be justified by hardware/IOMMU configuration, but can be limited by the
admin?

For example I proposed that the ioasidfd would bear the responsibility
of a wbinvd ioctl and therefore validate the user's access to enable
wbinvd emulation w/ KVM, so I'm assuming this module option lives
there. I essentially described the "enable" behavior in my previous
reply, user has access to wbinvd if owning a non-coherent capable
device managed in a non-coherent IOASID. Yes, the user IOASID
configuration controls the latter half of this.

What then is "automatic" mode? The user cannot create a non-coherent
IOASID with a non-coherent device if the IOMMU supports no-snoop
blocking? Do they get a failure? Does it get silently promoted to
coherent?

In "disable" mode, I think we're just narrowing the restriction
further, a non-coherent capable device cannot be used except in a
forced coherent IOASID.

> > If we're putting the policy decision in the hands of userspace they
> > should have access to wbinvd if they own a device that is assumed
> > non-coherent AND it's attached to an IOMMU (page table) that is not
> > blocking no-snoop (a "non-coherent IOASID").
>
> There are two parts here, like Paolo was leading too. If the process
> has access to WBINVD and then if such an allowed process tells KVM to
> turn on WBINVD in the guest.
>
> If the process has a device and it has a way to create a non-coherent
> IOASID, then that process has access to WBINVD.
>
> For security it doesn't matter if the process actually creates the
> non-coherent IOASID or not. An attacker will simply do the steps that
> give access to WBINVD.

Yes, at this point the user has the ability to create a configuration
where they could have access to wbinvd, but if they haven't created
such a configuration, is the wbinvd a no-op?

> The important detail is that access to WBINVD does not compell the
> process to tell KVM to turn on WBINVD. So a qemu with access to WBINVD
> can still choose to create a secure guest by always using IOMMU_CACHE
> in its page tables and not asking KVM to enable WBINVD.

Of course.

> This propsal shifts this policy decision from the kernel to userspace.
> qemu is responsible to determine if KVM should enable wbinvd or not
> based on if it was able to create IOASID's with IOMMU_CACHE.

QEMU is responsible for making sure the VM is consistent; if
non-coherent DMA can occur, wbinvd is emulated. But it's still the
KVM/IOASID connection that validates that access.

> > Conversely, a user could create a non-coherent IOASID and attach any
> > device to it, regardless of IOMMU backing capabilities. Only if an
> > assumed non-coherent device is attached would the wbinvd be allowed.
>
> Right, this is exactly the point. Since the user gets to pick if the
> IOASID is coherent or not then an attacker can always reach WBINVD
> using only the device FD. Additional checks don't add to the security
> of the process.
>
> The additional checks you are describing add to the security of the
> guest, however qemu is capable of doing them without more help from the
> kernel.
>
> It is the strenth of Paolo's model that KVM should not be able to do
> optionally less, not more than the process itself can do.

I think my previous reply was working towards those guidelines. I feel
like we're mostly in agreement, but perhaps reading past each other.
Nothing here convinced me against my previous proposal that the
ioasidfd bears responsibility for managing access to a wbinvd ioctl,
and therefore the equivalent KVM access. Whether wbinvd is allowed or
no-op'd when the use has access to a non-coherent device in a
configuration where the IOMMU prevents non-coherent DMA is maybe still
a matter of personal preference.

> > > It is pretty simple from a /dev/ioasid perpsective, covers todays
> > > compat requirement, gives some future option to allow the no-snoop
> > > optimization, and gives a new option for qemu to totally block wbinvd
> > > no matter what.
> >
> > What do you imagine is the use case for totally blocking wbinvd?
>
> If wbinvd is really security important then an operator should endevor
> to turn it off. It can be safely turned off if the operator
> understands the SRIOV devices they are using. ie if you are only using
> mlx5 or a nvme then force it off and be secure, regardless of the
> platform capability.

Ok, I'm not opposed to something like a module option that restricts to
only coherent DMA, but we need to work through how that's exposed and
the userspace behavior. The most obvious would be that a GET_INFO
ioctl on the ioasidfd indicates the restrictions, a flag on the IOASID
alloc indicates the coherency of the IOASID, and we fail any cases
where the admin policy or hardware support doesn't match (ie. alloc if
it's incompatible with policy, attach if the device/IOMMU backing
violates policy). This is all a compatible layer with what I described
previously. Thanks,

Alex

2021-06-07 17:58:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 11:10:53PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Friday, June 4, 2021 8:09 PM
> >
> > On Fri, Jun 04, 2021 at 06:37:26AM +0000, Tian, Kevin wrote:
> > > > From: Jason Gunthorpe
> > > > Sent: Thursday, June 3, 2021 9:05 PM
> > > >
> > > > > >
> > > > > > 3) Device accepts any PASIDs from the guest. No
> > > > > > vPASID/pPASID translation is possible. (classic vfio_pci)
> > > > > > 4) Device accepts any PASID from the guest and has an
> > > > > > internal vPASID/pPASID translation (enhanced vfio_pci)
> > > > >
> > > > > what is enhanced vfio_pci? In my writing this is for mdev
> > > > > which doesn't support ENQCMD
> > > >
> > > > This is a vfio_pci that mediates some element of the device interface
> > > > to communicate the vPASID/pPASID table to the device, using Max's
> > > > series for vfio_pci drivers to inject itself into VFIO.
> > > >
> > > > For instance a device might send a message through the PF that the VF
> > > > has a certain vPASID/pPASID translation table. This would be useful
> > > > for devices that cannot use ENQCMD but still want to support migration
> > > > and thus need vPASID.
> > >
> > > I still don't quite get. If it's a PCI device why is PASID translation required?
> > > Just delegate the per-RID PASID space to user as type-3 then migrating the
> > > vPASID space is just straightforward.
> >
> > This is only possible if we get rid of the global pPASID allocation
> > (honestly is my preference as it makes the HW a lot simpler)
> >
>
> In this proposal global vs. per-RID allocation is a per-device policy.
> for vfio-pci it can always use per-RID (regardless of whether the
> device is partially mediated or not) and no vPASID/pPASID conversion.
> Even for mdev if no ENQCMD we can still do per-RID conversion.
> only for mdev which has ENQCMD we need global pPASID allocation.
>
> I think this is the motivation you explained earlier that it's not good
> to have one global PASID allocator in the kernel. per-RID vs. global
> should be selected per device.

I thought we concluded this wasn't possible because the guest could
choose to bind the same vPASID to a RID and to a ENQCMD device and
then we run into trouble? Are are you saying that a RID device gets a
complete dedicated table and can always have a vPASID == pPASID?

In any event it needs clear explanation in the next RFC

Jason

2021-06-07 18:01:24

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Sat, Jun 05, 2021 at 08:22:27AM +0200, Paolo Bonzini wrote:
> On 04/06/21 19:22, Jason Gunthorpe wrote:
> > 4) The KVM interface is the very simple enable/disable WBINVD.
> > Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
> > to enable WBINVD at KVM.
>
> The KVM interface is the same kvm-vfio device that exists already. The
> userspace API does not need to change at all: adding one VFIO file
> descriptor with WBINVD enabled to the kvm-vfio device lets the VM use WBINVD
> functionality (see kvm_vfio_update_coherency).

The problem is we are talking about adding a new /dev/ioasid FD and it
won't fit into the existing KVM VFIO FD interface. There are lots of
options here, one is to add new ioctls that specifically use the new
FD, the other is to somehow use VFIO as a proxy to carry things to the
/dev/ioasid FD code.

> Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. But it
> seems useless complication compared to just using what we have now, at least
> while VMs only use IOASIDs via VFIO.

The simplest is KVM_ENABLE_WBINVD(<fd security proof>) and be done
with it.

I don't need to keep track things in KVM, just flip one flag on/off
under user control.

> Either way, there should be no policy attached to the add/delete operations.
> KVM users want to add the VFIO (or IOASID) file descriptors to the device
> independent of WBINVD. If userspace wants/needs to apply its own policy on
> whether to enable WBINVD or not, they can do it on the VFIO/IOASID side:

Why does KVM need to know abut IOASID's? I don't think it can do
anything with this general information.

> > 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> > it communicates its no-snoop configuration:
> > - 0 enable, allow WBINVD
> > - 1 automatic disable, block WBINVD if the platform
> > IOMMU can police it (what we do today)
> > - 2 force disable, do not allow BINVD ever
>
> Though, like Alex, it's also not clear to me whether force-disable is
> useful. Instead userspace can query the IOMMU or the device to ensure it's
> not enabled.

"force disable" would be a way for the device to signal to whatever
query you imagine that it is not enabled. Maybe I should have called
it "no-snoop is never used"

Jason

2021-06-07 18:03:39

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Mon, Jun 07, 2021 at 08:51:42AM +0200, Paolo Bonzini wrote:
> On 07/06/21 05:25, Tian, Kevin wrote:
> > Per Intel SDM wbinvd is a privileged instruction. A process on the
> > host has no privilege to execute it.
>
> (Half of) the point of the kernel is to do privileged tasks on the
> processes' behalf. There are good reasons why a process that uses VFIO
> (without KVM) could want to use wbinvd, so VFIO lets them do it with a ioctl
> and adequate checks around the operation.

Yes, exactly.

You cannot write a correct VFIO application for hardware that uses the
no-snoop bit without access to wbinvd.

KVM or not does not matter.

Jason

2021-06-07 18:05:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Mon, Jun 07, 2021 at 03:30:21PM +0200, Enrico Weigelt, metux IT consult wrote:
> On 02.06.21 19:21, Jason Gunthorpe wrote:
>
> Hi,
>
> > Not really, once one thing in an applicate uses a large number FDs the
> > entire application is effected. If any open() can return 'very big
> > number' then nothing in the process is allowed to ever use select.
>
> isn't that a bug in select() ?

<shrug> it is what it is, select has a fixed size bitmap of FD #s and
a hard upper bound on that size as part of the glibc ABI - can't be
fixed.

Jason

2021-06-07 18:22:47

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Mon, Jun 07, 2021 at 09:41:48AM -0600, Alex Williamson wrote:
> You're calling this an admin knob, which to me suggests a global module
> option, so are you trying to implement both an administrator and a user
> policy? ie. the user can create scenarios where access to wbinvd might
> be justified by hardware/IOMMU configuration, but can be limited by the
> admin?

Could be a per-device sysfs too. I'm not really sure what is useful
here.

> For example I proposed that the ioasidfd would bear the responsibility
> of a wbinvd ioctl and therefore validate the user's access to enable
> wbinvd emulation w/ KVM, so I'm assuming this module option lives
> there.

Right, this is what I was thinking

> What then is "automatic" mode? The user cannot create a non-coherent
> IOASID with a non-coherent device if the IOMMU supports no-snoop
> blocking? Do they get a failure? Does it get silently promoted to
> coherent?

"automatic" was just a way to keep the API the same as today. Today if
the IOMMU can block no-snoop then vfio disables wbinvd. To get the
same level of security automatic mode would detect that vfio would
have blocked wbinvd because the IOMMU can do it, and then always block
it.

It makes sense if there is an admin knob, as the admin could then move
to an explict enable/disable to get functionality they can't get
today.

> In "disable" mode, I think we're just narrowing the restriction
> further, a non-coherent capable device cannot be used except in a
> forced coherent IOASID.

I wouldn't say "cannot be used" - just you can't get access to
wbinvd.

It is up to qemu if it wants to proceed or not. There is no issue with
allowing the use of no-snoop and blocking wbinvd, other than some
drivers may malfunction. If the user is certain they don't have
malfunctioning drivers then no issue to go ahead.

The current vfio arrangement (automatic) maximized compatability. The
enable/disable options provide for max performance and max security as
alternative targets.

> > It is the strenth of Paolo's model that KVM should not be able to do
> > optionally less, not more than the process itself can do.
>
> I think my previous reply was working towards those guidelines. I feel
> like we're mostly in agreement, but perhaps reading past each other.

Yes, I think I said we were agreeing :)

> Nothing here convinced me against my previous proposal that the
> ioasidfd bears responsibility for managing access to a wbinvd ioctl,
> and therefore the equivalent KVM access. Whether wbinvd is allowed or
> no-op'd when the use has access to a non-coherent device in a
> configuration where the IOMMU prevents non-coherent DMA is maybe still
> a matter of personal preference.

I think it makes the software design much simpler if the security
check is very simple. Possessing a suitable device in an ioasid fd
container is enough to flip on the feature and we don't need to track
changes from that point on. We don't need to revoke wbinvd if the
ioasid fd changes, for instance. Better to keep the kernel very simple
in this regard.

Seems agreeable enough that there is something here to explore in
patches when the time comes

Jason

2021-06-07 19:04:43

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Mon, 7 Jun 2021 15:18:58 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Mon, Jun 07, 2021 at 09:41:48AM -0600, Alex Williamson wrote:
> > You're calling this an admin knob, which to me suggests a global module
> > option, so are you trying to implement both an administrator and a user
> > policy? ie. the user can create scenarios where access to wbinvd might
> > be justified by hardware/IOMMU configuration, but can be limited by the
> > admin?
>
> Could be a per-device sysfs too. I'm not really sure what is useful
> here.
>
> > For example I proposed that the ioasidfd would bear the responsibility
> > of a wbinvd ioctl and therefore validate the user's access to enable
> > wbinvd emulation w/ KVM, so I'm assuming this module option lives
> > there.
>
> Right, this is what I was thinking
>
> > What then is "automatic" mode? The user cannot create a non-coherent
> > IOASID with a non-coherent device if the IOMMU supports no-snoop
> > blocking? Do they get a failure? Does it get silently promoted to
> > coherent?
>
> "automatic" was just a way to keep the API the same as today. Today if
> the IOMMU can block no-snoop then vfio disables wbinvd. To get the
> same level of security automatic mode would detect that vfio would
> have blocked wbinvd because the IOMMU can do it, and then always block
> it.
>
> It makes sense if there is an admin knob, as the admin could then move
> to an explict enable/disable to get functionality they can't get
> today.
>
> > In "disable" mode, I think we're just narrowing the restriction
> > further, a non-coherent capable device cannot be used except in a
> > forced coherent IOASID.
>
> I wouldn't say "cannot be used" - just you can't get access to
> wbinvd.
>
> It is up to qemu if it wants to proceed or not. There is no issue with
> allowing the use of no-snoop and blocking wbinvd, other than some
> drivers may malfunction. If the user is certain they don't have
> malfunctioning drivers then no issue to go ahead.

A driver that knows how to use the device in a coherent way can
certainly proceed, but I suspect that's not something we can ask of
QEMU. QEMU has no visibility to the in-use driver and sketchy ability
to virtualize the no-snoop enable bit to prevent non-coherent DMA from
the device. There might be an experimental ("x-" prefixed) QEMU device
option to allow user override, but QEMU should disallow the possibility
of malfunctioning drivers by default. If we have devices that probe as
supporting no-snoop, but actually can't generate such traffic, we might
need a quirk list somewhere.

> The current vfio arrangement (automatic) maximized compatability. The
> enable/disable options provide for max performance and max security as
> alternative targets.
>
> > > It is the strenth of Paolo's model that KVM should not be able to do
> > > optionally less, not more than the process itself can do.
> >
> > I think my previous reply was working towards those guidelines. I feel
> > like we're mostly in agreement, but perhaps reading past each other.
>
> Yes, I think I said we were agreeing :)
>
> > Nothing here convinced me against my previous proposal that the
> > ioasidfd bears responsibility for managing access to a wbinvd ioctl,
> > and therefore the equivalent KVM access. Whether wbinvd is allowed or
> > no-op'd when the use has access to a non-coherent device in a
> > configuration where the IOMMU prevents non-coherent DMA is maybe still
> > a matter of personal preference.
>
> I think it makes the software design much simpler if the security
> check is very simple. Possessing a suitable device in an ioasid fd
> container is enough to flip on the feature and we don't need to track
> changes from that point on. We don't need to revoke wbinvd if the
> ioasid fd changes, for instance. Better to keep the kernel very simple
> in this regard.

You're suggesting that a user isn't forced to give up wbinvd emulation
if they lose access to their device? I suspect that like we do today,
we'll want to re-evaluate the need for wbinvd on most device changes.
I think this is why the kvm-vfio device holds a vfio group reference;
to make sure a given group can't elevate privileges for multiple
processes. Thanks,

Alex

2021-06-07 19:09:28

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Mon, Jun 07, 2021 at 12:59:46PM -0600, Alex Williamson wrote:

> > It is up to qemu if it wants to proceed or not. There is no issue with
> > allowing the use of no-snoop and blocking wbinvd, other than some
> > drivers may malfunction. If the user is certain they don't have
> > malfunctioning drivers then no issue to go ahead.
>
> A driver that knows how to use the device in a coherent way can
> certainly proceed, but I suspect that's not something we can ask of
> QEMU. QEMU has no visibility to the in-use driver and sketchy ability
> to virtualize the no-snoop enable bit to prevent non-coherent DMA from
> the device. There might be an experimental ("x-" prefixed) QEMU device
> option to allow user override, but QEMU should disallow the possibility
> of malfunctioning drivers by default. If we have devices that probe as
> supporting no-snoop, but actually can't generate such traffic, we might
> need a quirk list somewhere.

Compatibility is important, but when I look in the kernel code I see
very few places that call wbinvd(). Basically all DRM for something
relavent to qemu.

That tells me that the vast majority of PCI devices do not generate
no-snoop traffic.

> > I think it makes the software design much simpler if the security
> > check is very simple. Possessing a suitable device in an ioasid fd
> > container is enough to flip on the feature and we don't need to track
> > changes from that point on. We don't need to revoke wbinvd if the
> > ioasid fd changes, for instance. Better to keep the kernel very simple
> > in this regard.
>
> You're suggesting that a user isn't forced to give up wbinvd emulation
> if they lose access to their device?

Sure, why do we need to be stricter? It is the same logic I gave
earlier, once an attacker process has access to wbinvd an attacker can
just keep its access indefinitely.

The main use case for revokation assumes that qemu would be
compromised after a device is hot-unplugged and you want to block off
wbinvd. But I have a hard time seeing that as useful enough to justify
all the complicated code to do it...

For KVM qemu can turn on/off on hot plug events as it requires to give
VM security. It doesn't need to rely on the kernel to control this.

But I think it is all fine tuning, the basic idea seems like it could
work, so we are not blocked here on kvm interactions.

Jason

2021-06-07 19:45:54

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Mon, 7 Jun 2021 16:08:02 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Mon, Jun 07, 2021 at 12:59:46PM -0600, Alex Williamson wrote:
>
> > > It is up to qemu if it wants to proceed or not. There is no issue with
> > > allowing the use of no-snoop and blocking wbinvd, other than some
> > > drivers may malfunction. If the user is certain they don't have
> > > malfunctioning drivers then no issue to go ahead.
> >
> > A driver that knows how to use the device in a coherent way can
> > certainly proceed, but I suspect that's not something we can ask of
> > QEMU. QEMU has no visibility to the in-use driver and sketchy ability
> > to virtualize the no-snoop enable bit to prevent non-coherent DMA from
> > the device. There might be an experimental ("x-" prefixed) QEMU device
> > option to allow user override, but QEMU should disallow the possibility
> > of malfunctioning drivers by default. If we have devices that probe as
> > supporting no-snoop, but actually can't generate such traffic, we might
> > need a quirk list somewhere.
>
> Compatibility is important, but when I look in the kernel code I see
> very few places that call wbinvd(). Basically all DRM for something
> relavent to qemu.
>
> That tells me that the vast majority of PCI devices do not generate
> no-snoop traffic.

Unfortunately, even just looking at devices across a couple laptops
most devices do support and have NoSnoop+ set by default. I don't
notice anything in the kernel that actually tries to set this enable (a
handful that actively disable), so I assume it's done by the firmware.
It's not safe for QEMU to make an assumption that only GPUs will
actually make use of it.

> > > I think it makes the software design much simpler if the security
> > > check is very simple. Possessing a suitable device in an ioasid fd
> > > container is enough to flip on the feature and we don't need to track
> > > changes from that point on. We don't need to revoke wbinvd if the
> > > ioasid fd changes, for instance. Better to keep the kernel very simple
> > > in this regard.
> >
> > You're suggesting that a user isn't forced to give up wbinvd emulation
> > if they lose access to their device?
>
> Sure, why do we need to be stricter? It is the same logic I gave
> earlier, once an attacker process has access to wbinvd an attacker can
> just keep its access indefinitely.
>
> The main use case for revokation assumes that qemu would be
> compromised after a device is hot-unplugged and you want to block off
> wbinvd. But I have a hard time seeing that as useful enough to justify
> all the complicated code to do it...

It's currently just a matter of the kvm-vfio device holding a reference
to the group so that it cannot be used elsewhere so long as it's being
used to elevate privileges on a given KVM instance. If we conclude that
access to a device with the right capability is required to gain a
privilege, I don't really see how we can wave aside that the privilege
isn't lost with the device.

> For KVM qemu can turn on/off on hot plug events as it requires to give
> VM security. It doesn't need to rely on the kernel to control this.

Yes, QEMU can reject a hot-unplug event, but then QEMU retains the
privilege that the device grants it. Releasing the device and
retaining the privileged gained by it seems wrong. Thanks,

Alex

2021-06-07 23:06:43

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Mon, Jun 07, 2021 at 01:41:28PM -0600, Alex Williamson wrote:

> > Compatibility is important, but when I look in the kernel code I see
> > very few places that call wbinvd(). Basically all DRM for something
> > relavent to qemu.
> >
> > That tells me that the vast majority of PCI devices do not generate
> > no-snoop traffic.
>
> Unfortunately, even just looking at devices across a couple laptops
> most devices do support and have NoSnoop+ set by default.

Yes, mine too, but that doesn't mean the device is issuing nosnoop
transactions, it just means the OS is allowing it to do so if it wants.

As I said, without driver support the feature cannot be used, and
there is no driver support in Linux outside DRM, unless it is
hidden.. Certainly I've never run into it..

Even mlx5 is setting the nosnoop bit, but I have a fairly high
confidence that we don't set the TLP bit for anything Linux does.

> It's not safe for QEMU to make an assumption that only GPUs will
> actually make use of it.

Not 100% safe, but if you know you are running Linux OS in the VM you
can look at the drivers the devices need and make a determination.

> Yes, QEMU can reject a hot-unplug event, but then QEMU retains the
> privilege that the device grants it. Releasing the device and
> retaining the privileged gained by it seems wrong. Thanks,

It is not completely ideal, but it is such a simplification, and I
can't really see a drawback..

Jason

2021-06-08 00:32:06

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Mon, 7 Jun 2021 20:03:53 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Mon, Jun 07, 2021 at 01:41:28PM -0600, Alex Williamson wrote:
>
> > > Compatibility is important, but when I look in the kernel code I see
> > > very few places that call wbinvd(). Basically all DRM for something
> > > relavent to qemu.
> > >
> > > That tells me that the vast majority of PCI devices do not generate
> > > no-snoop traffic.
> >
> > Unfortunately, even just looking at devices across a couple laptops
> > most devices do support and have NoSnoop+ set by default.
>
> Yes, mine too, but that doesn't mean the device is issuing nosnoop
> transactions, it just means the OS is allowing it to do so if it wants.
>
> As I said, without driver support the feature cannot be used, and
> there is no driver support in Linux outside DRM, unless it is
> hidden.. Certainly I've never run into it..
>
> Even mlx5 is setting the nosnoop bit, but I have a fairly high
> confidence that we don't set the TLP bit for anything Linux does.
>
> > It's not safe for QEMU to make an assumption that only GPUs will
> > actually make use of it.
>
> Not 100% safe, but if you know you are running Linux OS in the VM you
> can look at the drivers the devices need and make a determination.

QEMU doesn't know what guest it's running or what driver the guest is
using. QEMU can only create safe configurations by default, the same
as done now using vfio. Anything outside of that scope would require
experimental opt-in support by the user or a guarantee from the device
vendor that the device cannot ever (not just for the existing drivers)
create non-coherent TLPs. Thanks,

Alex

> > Yes, QEMU can reject a hot-unplug event, but then QEMU retains the
> > privilege that the device grants it. Releasing the device and
> > retaining the privileged gained by it seems wrong. Thanks,
>
> It is not completely ideal, but it is such a simplification, and I
> can't really see a drawback..
>
> Jason
>

2021-06-08 01:04:10

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


?? 2021/6/7 ????10:14, Jason Gunthorpe д??:
> On Mon, Jun 07, 2021 at 11:18:33AM +0800, Jason Wang wrote:
>
>> Note that no drivers call these things doesn't meant it was not
>> supported by the spec.
> Of course it does. If the spec doesn't define exactly when the driver
> should call the cache flushes for no-snoop transactions then the
> protocol doesn't support no-soop.


Just to make sure we are in the same page. What I meant is, if the DMA
behavior like (no-snoop) is device specific. There's no need to mandate
a virtio general attributes. We can describe it per device. The devices
implemented in the current spec does not use non-coherent DMA doesn't
mean any future devices won't do that. The driver could choose to use
transport (e.g PCI), platform (ACPI) or device specific (general virtio
command) way to detect and flush cache when necessary.


>
> no-snoop is only used in very specific sequences of operations, like
> certain GPU usages, because regaining coherence on x86 is incredibly
> expensive.
>
> ie I wouldn't ever expect a NIC to use no-snoop because NIC's expect
> packets to be processed by the CPU.


For NIC yes. But virtio is more that just NIC. We've already supported
GPU and crypto devices. In this case, no-snoop will be useful since the
data is not necessarily expected to be processed by CPU.

And a lot of other type of devices are being proposed.

Thanks


>
> "non-coherent DMA" is some general euphemism that evokes images of
> embedded platforms that don't have coherent DMA at all and have low
> cost ways to regain coherence. This is not at all what we are talking
> about here at all.
>
> Jason
>

2021-06-08 01:10:52

by Shenming Lu

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 2021/6/7 20:19, Liu, Yi L wrote:
>> From: Shenming Lu <[email protected]>
>> Sent: Friday, June 4, 2021 10:03 AM
>>
>> On 2021/6/4 2:19, Jacob Pan wrote:
>>> Hi Shenming,
>>>
>>> On Wed, 2 Jun 2021 12:50:26 +0800, Shenming Lu
>> <[email protected]>
>>> wrote:
>>>
>>>> On 2021/6/2 1:33, Jason Gunthorpe wrote:
>>>>> On Tue, Jun 01, 2021 at 08:30:35PM +0800, Lu Baolu wrote:
>>>>>
>>>>>> The drivers register per page table fault handlers to /dev/ioasid which
>>>>>> will then register itself to iommu core to listen and route the per-
>>>>>> device I/O page faults.
>>>>>
>>>>> I'm still confused why drivers need fault handlers at all?
>>>>
>>>> Essentially it is the userspace that needs the fault handlers,
>>>> one case is to deliver the faults to the vIOMMU, and another
>>>> case is to enable IOPF on the GPA address space for on-demand
>>>> paging, it seems that both could be specified in/through the
>>>> IOASID_ALLOC ioctl?
>>>>
>>> I would think IOASID_BIND_PGTABLE is where fault handler should be
>>> registered. There wouldn't be any IO page fault without the binding
>> anyway.
>>
>> Yeah, I also proposed this before, registering the handler in the
>> BIND_PGTABLE
>> ioctl does make sense for the guest page faults. :-)
>>
>> But how about the page faults from the GPA address space (it's page table is
>> mapped through the MAP_DMA ioctl)? From your point of view, it seems
>> that we should register the handler for the GPA address space in the (first)
>> MAP_DMA ioctl.
>
> under new proposal, I think the page fault handler is also registered
> per ioasid object. The difference compared with guest page table case
> is there is no need to inject the fault to VM.

Yeah. And there are some issues specific to the GPA address space case
which have been discussed with Alex.. Thanks,

Shenming

2021-06-08 01:12:49

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


在 2021/6/3 上午1:21, Jason Gunthorpe 写道:
> On Wed, Jun 02, 2021 at 04:54:26PM +0800, Jason Wang wrote:
>> 在 2021/6/2 上午1:31, Jason Gunthorpe 写道:
>>> On Tue, Jun 01, 2021 at 04:47:15PM +0800, Jason Wang wrote:
>>>> We can open up to ~0U file descriptors, I don't see why we need to restrict
>>>> it in uAPI.
>>> There are significant problems with such large file descriptor
>>> tables. High FD numbers man things like select don't work at all
>>> anymore and IIRC there are more complications.
>>
>> I don't see how much difference for IOASID and other type of fds. People can
>> choose to use poll or epoll.
> Not really, once one thing in an applicate uses a large number FDs the
> entire application is effected. If any open() can return 'very big
> number' then nothing in the process is allowed to ever use select.
>
> It is not a trivial thing to ask for
>
>> And with the current proposal, (assuming there's a N:1 ioasid to ioasid). I
>> wonder how select can work for the specific ioasid.
> pagefault events are one thing that comes to mind. Bundling them all
> together into a single ring buffer is going to be necessary. Multifds
> just complicate this too
>
> Jason


Well, this sounds like a re-invention of io_uring which has already
worked for multifds.

Thanks


2021-06-08 01:23:07

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


?? 2021/6/8 ????3:41, Alex Williamson д??:
> On Mon, 7 Jun 2021 16:08:02 -0300
> Jason Gunthorpe <[email protected]> wrote:
>
>> On Mon, Jun 07, 2021 at 12:59:46PM -0600, Alex Williamson wrote:
>>
>>>> It is up to qemu if it wants to proceed or not. There is no issue with
>>>> allowing the use of no-snoop and blocking wbinvd, other than some
>>>> drivers may malfunction. If the user is certain they don't have
>>>> malfunctioning drivers then no issue to go ahead.
>>> A driver that knows how to use the device in a coherent way can
>>> certainly proceed, but I suspect that's not something we can ask of
>>> QEMU. QEMU has no visibility to the in-use driver and sketchy ability
>>> to virtualize the no-snoop enable bit to prevent non-coherent DMA from
>>> the device. There might be an experimental ("x-" prefixed) QEMU device
>>> option to allow user override, but QEMU should disallow the possibility
>>> of malfunctioning drivers by default. If we have devices that probe as
>>> supporting no-snoop, but actually can't generate such traffic, we might
>>> need a quirk list somewhere.
>> Compatibility is important, but when I look in the kernel code I see
>> very few places that call wbinvd(). Basically all DRM for something
>> relavent to qemu.
>>
>> That tells me that the vast majority of PCI devices do not generate
>> no-snoop traffic.
> Unfortunately, even just looking at devices across a couple laptops
> most devices do support and have NoSnoop+ set by default. I don't
> notice anything in the kernel that actually tries to set this enable (a
> handful that actively disable), so I assume it's done by the firmware.


I wonder whether or not it was done via ACPI:

"

6.2.17 _CCA (Cache Coherency Attribute) The _CCA object returns whether
or not a bus-master device supports hardware managed cache coherency.
Expected values are 0 to indicate it is not supported, and 1 to indicate
that it is supported. All other values are reserved.

...

On Intel platforms, if the _CCA object is not supplied, the OSPM will
assume the devices are hardware cache coherent.

"

Thanks


> It's not safe for QEMU to make an assumption that only GPUs will
> actually make use of it.
>
>>>> I think it makes the software design much simpler if the security
>>>> check is very simple. Possessing a suitable device in an ioasid fd
>>>> container is enough to flip on the feature and we don't need to track
>>>> changes from that point on. We don't need to revoke wbinvd if the
>>>> ioasid fd changes, for instance. Better to keep the kernel very simple
>>>> in this regard.
>>> You're suggesting that a user isn't forced to give up wbinvd emulation
>>> if they lose access to their device?
>> Sure, why do we need to be stricter? It is the same logic I gave
>> earlier, once an attacker process has access to wbinvd an attacker can
>> just keep its access indefinitely.
>>
>> The main use case for revokation assumes that qemu would be
>> compromised after a device is hot-unplugged and you want to block off
>> wbinvd. But I have a hard time seeing that as useful enough to justify
>> all the complicated code to do it...
> It's currently just a matter of the kvm-vfio device holding a reference
> to the group so that it cannot be used elsewhere so long as it's being
> used to elevate privileges on a given KVM instance. If we conclude that
> access to a device with the right capability is required to gain a
> privilege, I don't really see how we can wave aside that the privilege
> isn't lost with the device.
>
>> For KVM qemu can turn on/off on hot plug events as it requires to give
>> VM security. It doesn't need to rely on the kernel to control this.
> Yes, QEMU can reject a hot-unplug event, but then QEMU retains the
> privilege that the device grants it. Releasing the device and
> retaining the privileged gained by it seems wrong. Thanks,
>
> Alex
>

2021-06-08 02:27:48

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 06:49:20AM +0000, Tian, Kevin wrote:
> > From: David Gibson
> > Sent: Thursday, June 3, 2021 1:09 PM
> [...]
> > > > In this way the SW mode is the same as a HW mode with an infinite
> > > > cache.
> > > >
> > > > The collaposed shadow page table is really just a cache.
> > > >
> > >
> > > OK. One additional thing is that we may need a 'caching_mode"
> > > thing reported by /dev/ioasid, indicating whether invalidation is
> > > required when changing non-present to present. For hardware
> > > nesting it's not reported as the hardware IOMMU will walk the
> > > guest page table in cases of iotlb miss. For software nesting
> > > caching_mode is reported so the user must issue invalidation
> > > upon any change in guest page table so the kernel can update
> > > the shadow page table timely.
> >
> > For the fist cut, I'd have the API assume that invalidates are
> > *always* required. Some bypass to avoid them in cases where they're
> > not needed can be an additional extension.
> >
>
> Isn't a typical TLB semantics is that non-present entries are not
> cached thus invalidation is not required when making non-present
> to present?

Usually, but not necessarily.

> It's true to both CPU TLB and IOMMU TLB.

I don't think it's entirely true of the CPU TLB on all ppc MMU models
(of which there are far too many).

> In reality
> I feel there are more usages built on hardware nesting than software
> nesting thus making default following hardware TLB behavior makes
> more sense...

I'm arguing for always-require-invalidate because it's strictly more
general. Requiring the invalidate will support models that don't
require it in all cases; we just make the invalidate a no-op. The
reverse is not true, so we should tackle the general case first, then
optimize.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.00 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-08 02:30:19

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 09:30:54AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 04, 2021 at 12:44:28PM +0200, Enrico Weigelt, metux IT consult wrote:
> > On 02.06.21 19:24, Jason Gunthorpe wrote:
> >
> > Hi,
> >
> > >> If I understand this correctly, /dev/ioasid is a kind of "common
> > supplier"
> > >> to other APIs / devices. Why can't the fd be acquired by the
> > >> consumer APIs (eg. kvm, vfio, etc) ?
> > >
> > > /dev/ioasid would be similar to /dev/vfio, and everything already
> > > deals with exposing /dev/vfio and /dev/vfio/N together
> > >
> > > I don't see it as a problem, just more work.
> >
> > One of the problems I'm seeing is in container environments: when
> > passing in an vfio device, we now also need to pass in /dev/ioasid,
> > thus increasing the complexity in container setup (or orchestration).
>
> Containers already needed to do this today. Container orchestration is
> hard.

Right to use VFIO a container already needs both /dev/vfio and one or
more /dev/vfio/NNN group devices.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.21 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-08 02:30:24

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 08:52:24AM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 03, 2021 at 03:13:44PM +1000, David Gibson wrote:
>
> > > We can still consider it a single "address space" from the IOMMU
> > > perspective. What has happened is that the address table is not just a
> > > 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".
> >
> > True. This does complexify how we represent what IOVA ranges are
> > valid, though. I'll bet you most implementations don't actually
> > implement a full 64-bit IOVA, which means we effectively have a large
> > number of windows from (0..max IOVA) for each valid pasid. This adds
> > another reason I don't think my concept of IOVA windows is just a
> > power specific thing.
>
> Yes
>
> Things rapidly get into weird hardware specific stuff though, the
> request will be for things like:
> "ARM PASID&IO page table format from SMMU IP block vXX"

So, I'm happy enough for picking a user-managed pagetable format to
imply the set of valid IOVA ranges (though a query might be nice).

I'm mostly thinking of representing (and/or choosing) valid IOVA
ranges as something for the kernel-managed pagetable style
(MAP/UNMAP).

> Which may have a bunch of (possibly very weird!) format specific data
> to describe and/or configure it.
>
> The uAPI needs to be suitably general here. :(
>
> > > If we are already going in the direction of having the IOASID specify
> > > the page table format and other details, specifying that the page
> > > tabnle format is the 80 bit "PASID, IOVA" format is a fairly small
> > > step.
> >
> > Well, rather I think userspace needs to request what page table format
> > it wants and the kernel tells it whether it can oblige or not.
>
> Yes, this is what I ment.
>
> Jason
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.97 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-08 03:03:47

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 01, 2021 at 04:22:25PM -0600, Alex Williamson wrote:
> On Tue, 1 Jun 2021 07:01:57 +0000
> "Tian, Kevin" <[email protected]> wrote:
> >
> > I summarized five opens here, about:
> >
> > 1) Finalizing the name to replace /dev/ioasid;
> > 2) Whether one device is allowed to bind to multiple IOASID fd's;
> > 3) Carry device information in invalidation/fault reporting uAPI;
> > 4) What should/could be specified when allocating an IOASID;
> > 5) The protocol between vfio group and kvm;
> >
> ...
> >
> > For 5), I'd expect Alex to chime in. Per my understanding looks the
> > original purpose of this protocol is not about I/O address space. It's
> > for KVM to know whether any device is assigned to this VM and then
> > do something special (e.g. posted interrupt, EPT cache attribute, etc.).
>
> Right, the original use case was for KVM to determine whether it needs
> to emulate invlpg, so it needs to be aware when an assigned device is
> present and be able to test if DMA for that device is cache coherent.
> The user, QEMU, creates a KVM "pseudo" device representing the vfio
> group, providing the file descriptor of that group to show ownership.
> The ugly symbol_get code is to avoid hard module dependencies, ie. the
> kvm module should not pull in or require the vfio module, but vfio will
> be present if attempting to register this device.
>
> With kvmgt, the interface also became a way to register the kvm pointer
> with vfio for the translation mentioned elsewhere in this thread.
>
> The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
> page table so that it can handle iotlb programming from pre-registered
> memory without trapping out to userspace.

To clarify that's a guest side logical vIOMMU page table which is
partially managed by KVM. This is an optimization - things can work
without it, but it means guest iomap/unmap becomes a hot path because
each map/unmap hypercall has to go
guest -> KVM -> qemu -> VFIO

So there are multiple context transitions.

> > Because KVM deduces some policy based on the fact of assigned device,
> > it needs to hold a reference to related vfio group. this part is irrelevant
> > to this RFC.
>
> All of these use cases are related to the IOMMU, whether DMA is
> coherent, translating device IOVA to GPA, and an acceleration path to
> emulate IOMMU programming in kernel... they seem pretty relevant.
>
> > But ARM's VMID usage is related to I/O address space thus needs some
> > consideration. Another strange thing is about PPC. Looks it also leverages
> > this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
> > group. I don't know why it's done through KVM instead of VFIO uAPI in
> > the first place.
>
> AIUI, IOMMU programming on PPC is done through hypercalls, so KVM needs
> to know how to handle those for in-kernel acceleration. Thanks,

For PAPR guests, which is the common case, yes. Bare metal POWER
hosts have their own page table format. And probably some of the
newer embedded ppc models have some different IOMMU model entirely,
but I'm not familiar with it.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (3.29 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-08 06:32:28

by Parav Pandit

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

Hi Jaocb,

Sorry for the late response. Was on PTO on Friday last week.
Please see comments below.

> From: Jacob Pan <[email protected]>
> Sent: Friday, June 4, 2021 2:28 AM
>
> Hi Parav,
>
> On Tue, 1 Jun 2021 17:30:51 +0000, Parav Pandit <[email protected]> wrote:
>
> > > From: Tian, Kevin <[email protected]>
> > > Sent: Thursday, May 27, 2021 1:28 PM
> >
> > > 5.6. I/O page fault
> > > +++++++++++++++
> > >
> > > (uAPI is TBD. Here is just about the high-level flow from host IOMMU
> > > driver to guest IOMMU driver and backwards).
> > >
> > > - Host IOMMU driver receives a page request with raw fault_data {rid,
> > > pasid, addr};
> > >
> > > - Host IOMMU driver identifies the faulting I/O page table according
> > > to information registered by IOASID fault handler;
> > >
> > > - IOASID fault handler is called with raw fault_data (rid, pasid,
> > > addr), which is saved in ioasid_data->fault_data (used for
> > > response);
> > >
> > > - IOASID fault handler generates an user fault_data (ioasid, addr),
> > > links it to the shared ring buffer and triggers eventfd to
> > > userspace;
> > >
> > > - Upon received event, Qemu needs to find the virtual routing
> > > information (v_rid + v_pasid) of the device attached to the faulting
> > > ioasid. If there are multiple, pick a random one. This should be
> > > fine since the purpose is to fix the I/O page table on the guest;
> > >
> > > - Qemu generates a virtual I/O page fault through vIOMMU into guest,
> > > carrying the virtual fault data (v_rid, v_pasid, addr);
> > >
> > Why does it have to be through vIOMMU?
> I think this flow is for fully emulated IOMMU, the same IOMMU and device
> drivers run in the host and guest. Page request interrupt is reported by the
> IOMMU, thus reporting to vIOMMU in the guest.
In non-emulated case, how will the page fault of guest will be handled?
If I take Intel example, I thought FL page table entry still need to be handled by guest, which in turn fills up 2nd level page table entries.
No?

>
> > For a VFIO PCI device, have you considered to reuse the same PRI
> > interface to inject page fault in the guest? This eliminates any new
> > v_rid. It will also route the page fault request and response through
> > the right vfio device.
> >
> I am curious how would PCI PRI can be used to inject fault. Are you talking
> about PCI config PRI extended capability structure?
PCI PRI capability is only to expose page fault support.
Page fault injection/response cannot happen through the pci cap anyway.
This requires a side channel.
I was suggesting to emulate pci_endpoint->rc->iommu->iommu_irq path of hypervisor, as

vmm->guest_emuated_pri_device->pri_req/rsp queue(s).


> The control is very
> limited, only enable and reset. Can you explain how would page fault
> handled in generic PCI cap?
Not via pci cap.
Through more generic interface without attaching to viommu.

> Some devices may have device specific way to handle page faults, but I guess
> this is not the PCI PRI method you are referring to?
This was my next question that if page fault reporting and response interface is generic, it will be more scalable given that PCI PRI is limited to single page requests.
And additionally VT-d seems to funnel all the page fault interrupts via single IRQ.
And 3rdly, its requirement to always come through the hypervisor intermediatory.

Having a generic mechanism, will help to overcome above limitations as Jean already pointed out that page fault is a hot path.

>
> > > - Guest IOMMU driver fixes up the fault, updates the I/O page table,
> > > and then sends a page response with virtual completion data (v_rid,
> > > v_pasid, response_code) to vIOMMU;
> > >
> > What about fixing up the fault for mmu page table as well in guest?
> > Or you meant both when above you said "updates the I/O page table"?
> >
> > It is unclear to me that if there is single nested page table
> > maintained or two (one for cr3 references and other for iommu). Can
> > you please clarify?
> >
> I think it is just one, at least for VT-d, guest cr3 in GPA is stored in the host
> iommu. Guest iommu driver calls handle_mm_fault to fix the mmu page
> tables which is shared by the iommu.
>
So if guest has touched the page data, FL and SL entries of mmu should be populated and IOMMU side should not even reach a point of raising the PRI.
(ATS should be enough).
Because IOMMU side share the same FL and SL table entries referred by the scalable-mode PASID-table entry format described in Section 9.6.
Is that correct?

> > > - Qemu finds the pending fault event, converts virtual completion data
> > > into (ioasid, response_code), and then calls a /dev/ioasid ioctl to
> > > complete the pending fault;
> > >
> > For VFIO PCI device a virtual PRI request response interface is done,
> > it can be generic interface among multiple vIOMMUs.
> >
> same question above, not sure how this works in terms of interrupts and
> response queuing etc.
>
Citing "VFIO PCI device" was wrong on my part.
Was considering a generic page fault device to expose in guest that has request/response queues.
This way it is not attached to specific viommu driver and having other benefits explained above.

2021-06-08 06:56:05

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 12:24:08PM +0200, Jean-Philippe Brucker wrote:
> On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote:
> > > > But it would certainly be possible for a system to have two
> > > > different host bridges with two different IOMMUs with different
> > > > pagetable formats. Until you know which devices (and therefore
> > > > which host bridge) you're talking about, you don't know what formats
> > > > of pagetable to accept. And if you have devices from *both* bridges
> > > > you can't bind a page table at all - you could theoretically support
> > > > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > > > in both formats, but it would be pretty reasonable not to support
> > > > that.
> > >
> > > The basic process for a user space owned pgtable mode would be:
> > >
> > > 1) qemu has to figure out what format of pgtable to use
> > >
> > > Presumably it uses query functions using the device label.
> >
> > No... in the qemu case it would always select the page table format
> > that it needs to present to the guest. That's part of the
> > guest-visible platform that's selected by qemu's configuration.
> >
> > There's no negotiation here: either the kernel can supply what qemu
> > needs to pass to the guest, or it can't. If it can't qemu, will have
> > to either emulate in SW (if possible, probably using a kernel-managed
> > IOASID to back it) or fail outright.
> >
> > > The
> > > kernel code should look at the entire device path through all the
> > > IOMMU HW to determine what is possible.
> > >
> > > Or it already knows because the VM's vIOMMU is running in some
> > > fixed page table format, or the VM's vIOMMU already told it, or
> > > something.
> >
> > Again, I think you have the order a bit backwards. The user selects
> > the capabilities that the vIOMMU will present to the guest as part of
> > the qemu configuration. Qemu then requests that of the host kernel,
> > and either the host kernel supplies it, qemu emulates it in SW, or
> > qemu fails to start.
>
> Hm, how fine a capability are we talking about? If it's just "give me
> VT-d capabilities" or "give me Arm capabilities" that would work, but
> probably isn't useful. Anything finer will be awkward because userspace
> will have to try combinations of capabilities to see what sticks, and
> supporting new hardware will drop compatibility for older one.

For the qemu case, I would imagine a two stage fallback:

1) Ask for the exact IOMMU capabilities (including pagetable
format) that the vIOMMU has. If the host can supply, you're
good

2) If not, ask for a kernel managed IOAS. Verify that it can map
all the IOVA ranges the guest vIOMMU needs, and has an equal or
smaller pagesize than the guest vIOMMU presents. If so,
software emulate the vIOMMU by shadowing guest io pagetable
updates into the kernel managed IOAS.

3) You're out of luck, don't start.

For both (1) and (2) I'd expect it to be asking this question *after*
saying what devices are attached to the IOAS, based on the virtual
hardware configuration. That doesn't cover hotplug, of course, for
that you have to just fail the hotplug if the new device isn't
supportable with the IOAS you already have.

One can imagine optimizations where for certain intermediate cases you
could do a lighter SW emu if the host supports a model that's close to
the vIOMMU one, and you're able to trap and emulate the differences.
In practice I doubt anyone's going to have time to look for such cases
and implement the logic for it.

> For example depending whether the hardware IOMMU is SMMUv2 or SMMUv3, that
> completely changes the capabilities offered to the guest (some v2
> implementations support nesting page tables, but never PASID nor PRI
> unlike v3.) The same vIOMMU could support either, presenting different
> capabilities to the guest, even multiple page table formats if we wanted
> to be exhaustive (SMMUv2 supports the older 32-bit descriptor), but it
> needs to know early on what the hardware is precisely. Then some new page
> table format shows up and, although the vIOMMU can support that in
> addition to older ones, QEMU will have to pick a single one, that it
> assumes the guest knows how to drive?
>
> I think once it binds a device to an IOASID fd, QEMU will want to probe
> what hardware features are available before going further with the vIOMMU
> setup (is there PASID, PRI, which page table formats are supported,
> address size, page granule, etc). Obtaining precise information about the
> hardware would be less awkward than trying different configurations until
> one succeeds. Binding an additional device would then fail if its pIOMMU
> doesn't support exactly the features supported for the first device,
> because we don't know which ones the guest will choose. QEMU will have to
> open a new IOASID fd for that device.

No, this fundamentally misunderstands the qemu model. The user
*chooses* the guest visible platform, and qemu supplies it or fails.
There is no negotiation with the guest, because this makes managing
migration impossibly difficult.

-cpu host is an exception, which is used because it is so useful, but
it's kind of a pain on the qemu side. Virt management systems like
oVirt/RHV almost universally *do not use* -cpu host, precisely because
it cannot support predictable migration.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (5.62 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-08 06:57:19

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 09:11:05AM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 03, 2021 at 03:45:09PM +1000, David Gibson wrote:
> > On Wed, Jun 02, 2021 at 01:58:38PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 02, 2021 at 04:48:35PM +1000, David Gibson wrote:
> > > > > > /* Bind guest I/O page table */
> > > > > > bind_data = {
> > > > > > .ioasid = gva_ioasid;
> > > > > > .addr = gva_pgtable1;
> > > > > > // and format information
> > > > > > };
> > > > > > ioctl(ioasid_fd, IOASID_BIND_PGTABLE, &bind_data);
> > > > >
> > > > > Again I do wonder if this should just be part of alloc_ioasid. Is
> > > > > there any reason to split these things? The only advantage to the
> > > > > split is the device is known, but the device shouldn't impact
> > > > > anything..
> > > >
> > > > I'm pretty sure the device(s) could matter, although they probably
> > > > won't usually.
> > >
> > > It is a bit subtle, but the /dev/iommu fd itself is connected to the
> > > devices first. This prevents wildly incompatible devices from being
> > > joined together, and allows some "get info" to report the capability
> > > union of all devices if we want to do that.
> >
> > Right.. but I've not been convinced that having a /dev/iommu fd
> > instance be the boundary for these types of things actually makes
> > sense. For example if we were doing the preregistration thing
> > (whether by child ASes or otherwise) then that still makes sense
> > across wildly different devices, but we couldn't share that layer if
> > we have to open different instances for each of them.
>
> It is something that still seems up in the air.. What seems clear for
> /dev/iommu is that it
> - holds a bunch of IOASID's organized into a tree
> - holds a bunch of connected devices

Right, and it's still not really clear to me what devices connected to
the same /dev/iommu instance really need to have in common, as
distinct from what devices connected to the same specific ioasid need
to have in common.

> - holds a pinned memory cache
>
> One thing it must do is enforce IOMMU group security. A device cannot
> be attached to an IOASID unless all devices in its IOMMU group are
> part of the same /dev/iommu FD.

Well, you can't attach a device to an individual IOASID unless all
devices in its group are attached to the same individual IOASID
either, so I'm not clear what benefit there is to enforcing it at the
/dev/iommu instance as well as at the individual ioasid level.

> The big open question is what parameters govern allowing devices to
> connect to the /dev/iommu:
> - all devices can connect and we model the differences inside the API
> somehow.
> - Only sufficiently "similar" devices can be connected
> - The FD's capability is the minimum of all the connected devices
>
> There are some practical problems here, when an IOASID is created the
> kernel does need to allocate a page table for it, and that has to be
> in some definite format.
>
> It may be that we had a false start thinking the FD container should
> be limited. Perhaps creating an IOASID should pass in a list
> of the "device labels" that the IOASID will be used with and that can
> guide the kernel what to do?
>
> > Right, but at this stage I'm just not seeing a really clear (across
> > platforms and device typpes) boundary for what things have to be per
> > IOASID container and what have to be per IOASID, so I'm just not sure
> > the /dev/iommu instance grouping makes any sense.
>
> I would push as much stuff as possible to be per-IOASID..

I agree. But the question is what's *not* possible to be per-IOASID,
so what's the semantic boundary that defines when things have to be in
the same /dev/iommu instance, but not the same IOASID.

> > > I don't know if that small advantage is worth the extra complexity
> > > though.
> > >
> > > > But it would certainly be possible for a system to have two
> > > > different host bridges with two different IOMMUs with different
> > > > pagetable formats. Until you know which devices (and therefore
> > > > which host bridge) you're talking about, you don't know what formats
> > > > of pagetable to accept. And if you have devices from *both* bridges
> > > > you can't bind a page table at all - you could theoretically support
> > > > a kernel managed pagetable by mirroring each MAP and UNMAP to tables
> > > > in both formats, but it would be pretty reasonable not to support
> > > > that.
> > >
> > > The basic process for a user space owned pgtable mode would be:
> > >
> > > 1) qemu has to figure out what format of pgtable to use
> > >
> > > Presumably it uses query functions using the device label.
> >
> > No... in the qemu case it would always select the page table format
> > that it needs to present to the guest. That's part of the
> > guest-visible platform that's selected by qemu's configuration.
>
> I should have said "vfio user" here because apps like DPDK might use
> this path

Ok.

> > > 4) For the next device qemu would have to figure out if it can re-use
> > > an existing IOASID based on the required proeprties.
> >
> > Nope. Again, what devices share an IO address space is a guest
> > visible part of the platform. If the host kernel can't supply that,
> > then qemu must not start (or fail the hotplug if the new device is
> > being hotplugged).
>
> qemu can always emulate.

No, not always, only sometimes. The host side IOVA has to be able to
process all the IOVAs that the guest might generate, and it needs to
have an equal or smaller pagesize than the guest expects.

> If the config requires to devices that cannot
> share an IOASID because the local platform is wonky then qemu needs to
> shadow and duplicate the IO page table from the guest into two IOASID
> objects to make it work. This is a SW emulation option.
>
> > For this reason, amongst some others, I think when selecting a kernel
> > managed pagetable we need to also have userspace explicitly request
> > which IOVA ranges are mappable, and what (minimum) page size it
> > needs.
>
> It does make sense
>
> Jason
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (6.28 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-08 06:58:04

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 09:28:32AM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 03, 2021 at 03:23:17PM +1000, David Gibson wrote:
> > On Wed, Jun 02, 2021 at 01:37:53PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 02, 2021 at 04:57:52PM +1000, David Gibson wrote:
> > >
> > > > I don't think presence or absence of a group fd makes a lot of
> > > > difference to this design. Having a group fd just means we attach
> > > > groups to the ioasid instead of individual devices, and we no longer
> > > > need the bookkeeping of "partial" devices.
> > >
> > > Oh, I think we really don't want to attach the group to an ioasid, or
> > > at least not as a first-class idea.
> > >
> > > The fundamental problem that got us here is we now live in a world
> > > where there are many ways to attach a device to an IOASID:
> >
> > I'm not seeing that that's necessarily a problem.
> >
> > > - A RID binding
> > > - A RID,PASID binding
> > > - A RID,PASID binding for ENQCMD
> >
> > I have to admit I haven't fully grasped the differences between these
> > modes. I'm hoping we can consolidate at least some of them into the
> > same sort of binding onto different IOASIDs (which may be linked in
> > parent/child relationships).
>
> What I would like is that the /dev/iommu side managing the IOASID
> doesn't really care much, but the device driver has to tell
> drivers/iommu what it is going to do when it attaches.

By the device driver, do you mean the userspace or guest device
driver? Or do you mean the vfio_pci or mdev "shim" device driver"?

> It makes sense, in PCI terms, only the driver knows what TLPs the
> device will generate. The IOMMU needs to know what TLPs it will
> recieve to configure properly.
>
> PASID or not is major device specific variation, as is the ENQCMD/etc
>
> Having the device be explicit when it tells the IOMMU what it is going
> to be sending is a major plus to me. I actually don't want to see this
> part of the interface be made less strong.

Ok, if I'm understanding this right a PASID capable IOMMU will be able
to process *both* transactions with just a RID and transactions with a
RID+PASID.

So if we're thinking of this notional 84ish-bit address space, then
that includes "no PASID" as well as all the possible PASID values.
Yes? Or am I confused?

>
> > > The selection of which mode to use is based on the specific
> > > driver/device operation. Ie the thing that implements the 'struct
> > > vfio_device' is the thing that has to select the binding mode.
> >
> > I thought userspace selected the binding mode - although not all modes
> > will be possible for all devices.
>
> /dev/iommu is concerned with setting up the IOAS and filling the IO
> page tables with information
>
> The driver behind "struct vfio_device" is responsible to "route" its
> HW into that IOAS.
>
> They are two halfs of the problem, one is only the io page table, and one
> the is connection of a PCI TLP to a specific io page table.
>
> Only the driver knows what format of TLPs the device will generate so
> only the driver can specify the "route"

Ok. I'd really like if we can encode this in a way that doesn't build
PCI-specific structure into the API, though.

>
> > > eg if two PCI devices are in a group then it is perfectly fine that
> > > one device uses RID binding and the other device uses RID,PASID
> > > binding.
> >
> > Uhhhh... I don't see how that can be. They could well be in the same
> > group because their RIDs cannot be distinguished from each other.
>
> Inability to match the RID is rare, certainly I would expect any IOMMU
> HW that can do PCIEe PASID matching can also do RID matching.

It's not just up to the IOMMU. The obvious case is a PCIe-to-PCI
bridge. All transactions show the RID of the bridge, because vanilla
PCI doesn't have them. Same situation with a buggy multifunction
device which uses function 0's RID for all functions.

It may be rare, but we still have to deal with it one way or another.

I really don't think we want to support multiple binding types for a
single group.

> With
> such HW the above is perfectly fine - the group may not be secure
> between members (eg !ACS), but the TLPs still carry valid RIDs and
> PASID and the IOMMU can still discriminate.

They carry RIDs, whether they're valid depends on how buggy your
hardware is.

> I think you are talking about really old IOMMU's that could only
> isolate based on ingress port or something.. I suppose modern PCIe has
> some cases like this in the NTB stuff too.

Depends what you mean by really old. They may seem really old to
those working on new fancy IOMMU technology. But I hit problems in
practice not long ago with awkwardly multi-device groups because it
was on a particular Dell server without ACS implementation. Likewise
I strongly suspect non-PASID IOMMUs will remain common on low end
hardware (like peoples' laptops) for some time.

> Oh, I hadn't spent time thinking about any of those.. It is messy but
> it can still be forced to work, I guess. A device centric model means
> all the devices using the same routing ID have to be connected to the
> same IOASID by userspace. So some of the connections will be NOPs.

See, that's exactly what I thought the group checks were enforcing.
I'm really hoping we don't need two levels of granularity here: groups
of devices that can't be identified from each other, and then groups
of those that can't be isolated from each other. That introduces a
huge amount of extra conceptual complexity.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (5.71 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-08 06:59:23

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 07:17:23AM +0000, Tian, Kevin wrote:
> > From: David Gibson <[email protected]>
> > Sent: Wednesday, June 2, 2021 2:15 PM
> >
> [...]
> > > An I/O address space takes effect in the IOMMU only after it is attached
> > > to a device. The device in the /dev/ioasid context always refers to a
> > > physical one or 'pdev' (PF or VF).
> >
> > What you mean by "physical" device here isn't really clear - VFs
> > aren't really physical devices, and the PF/VF terminology also doesn't
> > extent to non-PCI devices (which I think we want to consider for the
> > API, even if we're not implemenenting it any time soon).
>
> Yes, it's not very clear, and more in PCI context to simplify the
> description. A "physical" one here means an PCI endpoint function
> which has a unique RID. It's more to differentiate with later mdev/
> subdevice which uses both RID+PASID. Naming is always a hard
> exercise to me... Possibly I'll just use device vs. subdevice in future
> versions.
>
> >
> > Now, it's clear that we can't program things into the IOMMU before
> > attaching a device - we might not even know which IOMMU to use.
>
> yes
>
> > However, I'm not sure if its wise to automatically make the AS "real"
> > as soon as we attach a device:
> >
> > * If we're going to attach a whole bunch of devices, could we (for at
> > least some IOMMU models) end up doing a lot of work which then has
> > to be re-done for each extra device we attach?
>
> which extra work did you specifically refer to? each attach just implies
> writing the base address of the I/O page table to the IOMMU structure
> corresponding to this device (either being a per-device entry, or per
> device+PASID entry).
>
> and generally device attach should not be in a hot path.
>
> >
> > * With kernel managed IO page tables could attaching a second device
> > (at least on some IOMMU models) require some operation which would
> > require discarding those tables? e.g. if the second device somehow
> > forces a different IO page size
>
> Then the attach should fail and the user should create another IOASID
> for the second device.

Couldn't this make things weirdly order dependent though? If device A
has strictly more capabilities than device B, then attaching A then B
will be fine, but B then A will trigger a new ioasid fd.

> > For that reason I wonder if we want some sort of explicit enable or
> > activate call. Device attaches would only be valid before, map or
> > attach pagetable calls would only be valid after.
>
> I'm interested in learning a real example requiring explicit enable...
>
> >
> > > One I/O address space could be attached to multiple devices. In this case,
> > > /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
> > >
> > > Based on the underlying IOMMU capability one device might be allowed
> > > to attach to multiple I/O address spaces, with DMAs accessing them by
> > > carrying different routing information. One of them is the default I/O
> > > address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
> > > remaining are routed by RID + Process Address Space ID (PASID) or
> > > Stream+Substream ID. For simplicity the following context uses RID and
> > > PASID when talking about the routing information for I/O address spaces.
> >
> > I'm not really clear on how this interacts with nested ioasids. Would
> > you generally expect the RID+PASID IOASes to be children of the base
> > RID IOAS, or not?
>
> No. With Intel SIOV both parent/children could be RID+PASID, e.g.
> when one enables vSVA on a mdev.

Hm, ok. I really haven't understood how the PASIDs fit into this
then. I'll try again on v2.

> > If the PASID ASes are children of the RID AS, can we consider this not
> > as the device explicitly attaching to multiple IOASIDs, but instead
> > attaching to the parent IOASID with awareness of the child ones?
> >
> > > Device attachment is initiated through passthrough framework uAPI (use
> > > VFIO for simplicity in following context). VFIO is responsible for identifying
> > > the routing information and registering it to the ioasid driver when calling
> > > ioasid attach helper function. It could be RID if the assigned device is
> > > pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
> > > user might also provide its view of virtual routing information (vPASID) in
> > > the attach call, e.g. when multiple user-managed I/O address spaces are
> > > attached to the vfio_device. In this case VFIO must figure out whether
> > > vPASID should be directly used (for pdev) or converted to a kernel-
> > > allocated one (pPASID, for mdev) for physical routing (see section 4).
> > >
> > > Device must be bound to an IOASID FD before attach operation can be
> > > conducted. This is also through VFIO uAPI. In this proposal one device
> > > should not be bound to multiple FD's. Not sure about the gain of
> > > allowing it except adding unnecessary complexity. But if others have
> > > different view we can further discuss.
> > >
> > > VFIO must ensure its device composes DMAs with the routing information
> > > attached to the IOASID. For pdev it naturally happens since vPASID is
> > > directly programmed to the device by guest software. For mdev this
> > > implies any guest operation carrying a vPASID on this device must be
> > > trapped into VFIO and then converted to pPASID before sent to the
> > > device. A detail explanation about PASID virtualization policies can be
> > > found in section 4.
> > >
> > > Modern devices may support a scalable workload submission interface
> > > based on PCI DMWr capability, allowing a single work queue to access
> > > multiple I/O address spaces. One example is Intel ENQCMD, having
> > > PASID saved in the CPU MSR and carried in the instruction payload
> > > when sent out to the device. Then a single work queue shared by
> > > multiple processes can compose DMAs carrying different PASIDs.
> >
> > Is the assumption here that the processes share the IOASID FD
> > instance, but not memory?
>
> I didn't get this question

Ok, stepping back, what exactly do you mean by "processes" above? Do
you mean Linux processes, or something else?

> > > When executing ENQCMD in the guest, the CPU MSR includes a vPASID
> > > which, if targeting a mdev, must be converted to pPASID before sent
> > > to the wire. Intel CPU provides a hardware PASID translation capability
> > > for auto-conversion in the fast path. The user is expected to setup the
> > > PASID mapping through KVM uAPI, with information about {vpasid,
> > > ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM
> > > to figure out the actual pPASID given an IOASID.
> > >
> > > With above design /dev/ioasid uAPI is all about I/O address spaces.
> > > It doesn't include any device routing information, which is only
> > > indirectly registered to the ioasid driver through VFIO uAPI. For
> > > example, I/O page fault is always reported to userspace per IOASID,
> > > although it's physically reported per device (RID+PASID). If there is a
> > > need of further relaying this fault into the guest, the user is responsible
> > > of identifying the device attached to this IOASID (randomly pick one if
> > > multiple attached devices) and then generates a per-device virtual I/O
> > > page fault into guest. Similarly the iotlb invalidation uAPI describes the
> > > granularity in the I/O address space (all, or a range), different from the
> > > underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
> > >
> > > I/O page tables routed through PASID are installed in a per-RID PASID
> > > table structure. Some platforms implement the PASID table in the guest
> > > physical space (GPA), expecting it managed by the guest. The guest
> > > PASID table is bound to the IOMMU also by attaching to an IOASID,
> > > representing the per-RID vPASID space.
> >
> > Do we need to consider two management modes here, much as we have for
> > the pagetables themsleves: either kernel managed, in which we have
> > explicit calls to bind a vPASID to a parent PASID, or user managed in
> > which case we register a table in some format.
>
> yes, this is related to PASID virtualization in section 4. And based on
> suggestion from Jason, the vPASID requirement will be reported to
> user space via the per-device reporting interface.
>
> Thanks
> Kevin
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (8.60 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-08 07:57:57

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 07/06/21 19:59, Jason Gunthorpe wrote:
>> The KVM interface is the same kvm-vfio device that exists already. The
>> userspace API does not need to change at all: adding one VFIO file
>> descriptor with WBINVD enabled to the kvm-vfio device lets the VM use WBINVD
>> functionality (see kvm_vfio_update_coherency).
>
> The problem is we are talking about adding a new /dev/ioasid FD and it
> won't fit into the existing KVM VFIO FD interface. There are lots of
> options here, one is to add new ioctls that specifically use the new
> FD, the other is to somehow use VFIO as a proxy to carry things to the
> /dev/ioasid FD code.

Exactly.

>> Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. But it
>> seems useless complication compared to just using what we have now, at least
>> while VMs only use IOASIDs via VFIO.
>
> The simplest is KVM_ENABLE_WBINVD(<fd security proof>) and be done
> with it.

The simplest one is KVM_DEV_VFIO_GROUP_ADD/DEL, that already exists and
also covers hot-unplug. The second simplest one is KVM_DEV_IOASID_ADD/DEL.

It need not be limited to wbinvd support, it's just a generic "let VMs
do what userspace can do if it has access to this file descriptor".
That it enables guest WBINVD is an implementation detail.

>> Either way, there should be no policy attached to the add/delete operations.
>> KVM users want to add the VFIO (or IOASID) file descriptors to the device
>> independent of WBINVD. If userspace wants/needs to apply its own policy on
>> whether to enable WBINVD or not, they can do it on the VFIO/IOASID side:
>
> Why does KVM need to know abut IOASID's? I don't think it can do
> anything with this general information.

Indeed, it only uses them as the security proofs---either VFIO or IOASID
file descriptors can be used as such.

Paolo

Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 08.06.21 03:00, Jason Wang wrote:

Hi folks,

> Just to make sure we are in the same page. What I meant is, if the DMA
> behavior like (no-snoop) is device specific. There's no need to mandate
> a virtio general attributes. We can describe it per device. The devices
> implemented in the current spec does not use non-coherent DMA doesn't
> mean any future devices won't do that. The driver could choose to use
> transport (e.g PCI), platform (ACPI) or device specific (general virtio
> command) way to detect and flush cache when necessary.

Maybe I've totally misunderstood the whole issue, but what I've learned
to far:

* it's a performance improvement for certain scenarios
* whether it can be used depends on the devices as well as the
underlying transport (combination of both)
* whether it should be used (when possible) can only be decided by the
driver

Correct ?

I tend to believe that's something that virtio infrastructure should
handle in a generic way.

Maybe the device as well as the transport could announce their
capability (which IMHO should go via the virtio protocol), and if both
are capable, the (guest's) virtio subsys tells the driver whether it's
usable for a specific device. Perhaps we should also have a mechanism
to tell the device that it's actually used.


Sorry, if i'm completely on the wrong page and just talking junk here :o


--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 04.06.21 14:30, Jason Gunthorpe wrote:

Hi,

> Containers already needed to do this today. Container orchestration is
> hard.

Yes, but I hate to see even more work upcoming here.

> Yes, /dev/ioasid shouldn't do anything unless you have a device to
> connect it with. In this way it is probably safe to stuff it into
> every container.

Okay, if we can guarantee that, I'm completely fine.

>>> Having FDs spawn other FDs is pretty ugly, it defeats the "everything
>>> is a file" model of UNIX.
>>
>> Unfortunately, this is already defeated in many other places :(
>> (I'd even claim that ioctls already break it :p)
>
> I think you are reaching a bit :)
>
>> It seems your approach also breaks this, since we now need to open two
>> files in order to talk to one device.
>
> It is two devices, thus two files.

Two separate real (hardware) devices or just two logical device nodes ?


--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 07.06.21 20:01, Jason Gunthorpe wrote:
> <shrug> it is what it is, select has a fixed size bitmap of FD #s and
> a hard upper bound on that size as part of the glibc ABI - can't be
> fixed.

in glibc ABI ? Uuuuh!


--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

2021-06-08 13:17:00

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 08, 2021 at 12:43:38PM +0200, Enrico Weigelt, metux IT consult wrote:

> > It is two devices, thus two files.
>
> Two separate real (hardware) devices or just two logical device nodes ?

A real PCI device and a real IOMMU block integrated into the CPU

Jason

2021-06-08 13:17:53

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 08, 2021 at 09:56:09AM +0200, Paolo Bonzini wrote:

> > > Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. But it
> > > seems useless complication compared to just using what we have now, at least
> > > while VMs only use IOASIDs via VFIO.
> >
> > The simplest is KVM_ENABLE_WBINVD(<fd security proof>) and be done
> > with it.
>
> The simplest one is KVM_DEV_VFIO_GROUP_ADD/DEL, that already exists and also
> covers hot-unplug. The second simplest one is KVM_DEV_IOASID_ADD/DEL.

This isn't the same thing, this is back to trying to have the kernel
set policy for userspace.

qmeu need to be in direct control of this specific KVM emulation
feature if it is going to support the full range of options.

IMHO obfuscating it with some ADD/DEL doesn't achieve that.

Especially since even today GROUP_ADD/DEL is not just about
controlling wbinvd but also about linking mdevs to the kvm struct - it
is both not optional to call from qemu and triggers behavior that is
against the userspace policy.

This is why I prefer a direct and obvious KVM_ENABLE_WBINVD approach

Jason

2021-06-08 13:21:23

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 08, 2021 at 12:37:04PM +1000, David Gibson wrote:

> > The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
> > page table so that it can handle iotlb programming from pre-registered
> > memory without trapping out to userspace.
>
> To clarify that's a guest side logical vIOMMU page table which is
> partially managed by KVM. This is an optimization - things can work
> without it, but it means guest iomap/unmap becomes a hot path because
> each map/unmap hypercall has to go
> guest -> KVM -> qemu -> VFIO
>
> So there are multiple context transitions.

Isn't this overhead true of many of the vIOMMUs? Can the fast path be
generalized?

Jason

2021-06-08 13:50:02

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 08/06/21 15:15, Jason Gunthorpe wrote:
> On Tue, Jun 08, 2021 at 09:56:09AM +0200, Paolo Bonzini wrote:
>
>>>> Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. But it
>>>> seems useless complication compared to just using what we have now, at least
>>>> while VMs only use IOASIDs via VFIO.
>>>
>>> The simplest is KVM_ENABLE_WBINVD(<fd security proof>) and be done
>>> with it.
>>
>> The simplest one is KVM_DEV_VFIO_GROUP_ADD/DEL, that already exists and also
>> covers hot-unplug. The second simplest one is KVM_DEV_IOASID_ADD/DEL.
>
> This isn't the same thing, this is back to trying to have the kernel
> set policy for userspace.

If you want a userspace policy then there would be three states:

* WBINVD enabled because a WBINVD-enabled VFIO device is attached.

* WBINVD potentially enabled but no WBINVD-enabled VFIO device attached

* WBINVD forcefully disabled

KVM_DEV_VFIO_GROUP_ADD/DEL can still be used to distinguish the first
two. Due to backwards compatibility, those two describe the default
behavior; disabling wbinvd can be done easily with a new sub-ioctl of
KVM_ENABLE_CAP and doesn't require any security proof.

The meaning of WBINVD-enabled is "won't return -ENXIO for the wbinvd
ioctl", nothing more nothing less. If all VFIO devices are going to be
WBINVD-enabled, then that will reflect on KVM as well, and I won't have
anything to object if there's consensus on the device assignment side of
things that the wbinvd ioctl won't ever fail.

Paolo

2021-06-08 19:15:17

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 08, 2021 at 12:47:00PM -0600, Alex Williamson wrote:
> On Tue, 8 Jun 2021 15:44:26 +0200
> Paolo Bonzini <[email protected]> wrote:
>
> > On 08/06/21 15:15, Jason Gunthorpe wrote:
> > > On Tue, Jun 08, 2021 at 09:56:09AM +0200, Paolo Bonzini wrote:
> > >
> > >>>> Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. But it
> > >>>> seems useless complication compared to just using what we have now, at least
> > >>>> while VMs only use IOASIDs via VFIO.
> > >>>
> > >>> The simplest is KVM_ENABLE_WBINVD(<fd security proof>) and be done
> > >>> with it.
>
> Even if we were to relax wbinvd access to any device (capable of
> no-snoop or not) in any IOMMU configuration (blocking no-snoop or not),
> I think as soon as we say "proof" is required to gain this access then
> that proof should be ongoing for the life of the access.

This idea is not entirely consistent with the usual Unix access
control model..

Eg I can do open() on a file and I get to keep that FD. I get to keep
that FD even if someone later does chmod() on that file so I can't
open it again.

There are lots of examples where a one time access control check
provides continuing access to a resource. I feel the ongoing proof is
the rarity in Unix.. 'revoke' is an uncommon concept in Unix..

That said, I don't feel strongly either way, would just like to see
something implementatbale. Even the various options to change the
feature are more thought explorations to try to understand how to
model the feature than any requirements I am aware of.

> notifier to indicate an end of that authorization. I don't think we
> can simplify that out of the equation or we've essentially invalidated
> that the proof is really required.

The proof is like the chown in the above open() example. Once kvm is
authorized it keeps working even if a new authorization could not be
obtained. It is not very different from chmod'ing a file after
something opened it.

Inablity to revoke doesn't invalidate the value of the initial
one-time access control check.

Generally agree on the rest of your message

Regards,
Jason

2021-06-09 00:29:31

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 08, 2021 at 10:54:59AM +0200, Enrico Weigelt, metux IT consult wrote:

> Maybe the device as well as the transport could announce their
> capability (which IMHO should go via the virtio protocol), and if both
> are capable, the (guest's) virtio subsys tells the driver whether it's
> usable for a specific device. Perhaps we should also have a mechanism
> to tell the device that it's actually used.

The usage should be extremely narrow, like

"If the driver issues a GPU command with flag X then the resulting
DMAs will be no-snoop and the driver must re-establish coherency at
the right moment"

It is not a general idea, but something baked directly into the device
protocol that virtio carries.

The general notion of no-nsoop should ideally be carried in the PCIe
extended config space flag. If 0 then no-snoop should never be issued,
expected, or requested.

Jason

2021-06-09 01:36:26

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote:

> Well, this sounds like a re-invention of io_uring which has already worked
> for multifds.

How so? io_uring is about sending work to the kernel, not getting
structued events back?

It is more like one of the perf rings

Jason

2021-06-09 08:05:18

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 08, 2021 at 10:53:02AM +1000, David Gibson wrote:
> On Thu, Jun 03, 2021 at 08:52:24AM -0300, Jason Gunthorpe wrote:
> > On Thu, Jun 03, 2021 at 03:13:44PM +1000, David Gibson wrote:
> >
> > > > We can still consider it a single "address space" from the IOMMU
> > > > perspective. What has happened is that the address table is not just a
> > > > 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".
> > >
> > > True. This does complexify how we represent what IOVA ranges are
> > > valid, though. I'll bet you most implementations don't actually
> > > implement a full 64-bit IOVA, which means we effectively have a large
> > > number of windows from (0..max IOVA) for each valid pasid. This adds
> > > another reason I don't think my concept of IOVA windows is just a
> > > power specific thing.
> >
> > Yes
> >
> > Things rapidly get into weird hardware specific stuff though, the
> > request will be for things like:
> > "ARM PASID&IO page table format from SMMU IP block vXX"
>
> So, I'm happy enough for picking a user-managed pagetable format to
> imply the set of valid IOVA ranges (though a query might be nice).

I think a query is mandatory, and optionally asking for ranges seems
generally useful as a HW property.

The danger is things can get really tricky as the app can ask for
ranges some HW needs but other HW can't provide.

I would encourage a flow where "generic" apps like DPDK can somehow
just ignore this, or at least be very, very simplified "I want around
XX GB of IOVA space"

dpdk type apps vs qemu apps are really quite different and we should
be carefully that the needs of HW accelerated vIOMMU emulation do not
trump the needs of simple universal control over a DMA map.

Jason

2021-06-09 08:25:49

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 08, 2021 at 04:04:26PM +1000, David Gibson wrote:

> > What I would like is that the /dev/iommu side managing the IOASID
> > doesn't really care much, but the device driver has to tell
> > drivers/iommu what it is going to do when it attaches.
>
> By the device driver, do you mean the userspace or guest device
> driver? Or do you mean the vfio_pci or mdev "shim" device driver"?

I mean vfio_pci, mdev "shim", vdpa, etc. Some kernel driver that is
allowing qemu access to a HW resource.

> > It makes sense, in PCI terms, only the driver knows what TLPs the
> > device will generate. The IOMMU needs to know what TLPs it will
> > recieve to configure properly.
> >
> > PASID or not is major device specific variation, as is the ENQCMD/etc
> >
> > Having the device be explicit when it tells the IOMMU what it is going
> > to be sending is a major plus to me. I actually don't want to see this
> > part of the interface be made less strong.
>
> Ok, if I'm understanding this right a PASID capable IOMMU will be able
> to process *both* transactions with just a RID and transactions with a
> RID+PASID.

Yes

> So if we're thinking of this notional 84ish-bit address space, then
> that includes "no PASID" as well as all the possible PASID values.
> Yes? Or am I confused?

Right, though I expect how to model 'no pasid' vs all the pasids is
some micro-detail someone would need to work on a real vIOMMU
implemetnation to decide..

> > /dev/iommu is concerned with setting up the IOAS and filling the IO
> > page tables with information
> >
> > The driver behind "struct vfio_device" is responsible to "route" its
> > HW into that IOAS.
> >
> > They are two halfs of the problem, one is only the io page table, and one
> > the is connection of a PCI TLP to a specific io page table.
> >
> > Only the driver knows what format of TLPs the device will generate so
> > only the driver can specify the "route"
>
> Ok. I'd really like if we can encode this in a way that doesn't build
> PCI-specific structure into the API, though.

I think we should at least have bus specific "convenience" APIs for
the popular cases. It is complicated enough already, trying to force
people to figure out the kernel synonym for a PCI standard name gets
pretty rough... Plus the RID is inherently a hardware specific
concept.

> > Inability to match the RID is rare, certainly I would expect any IOMMU
> > HW that can do PCIEe PASID matching can also do RID matching.
>
> It's not just up to the IOMMU. The obvious case is a PCIe-to-PCI
> bridge.

Yes.. but PCI is *really* old at this point. Even PCI-X sustains the
originating RID.

The general case here is that each device can route to its own
IOAS. The specialty case is that only one IOAS in a group can be
used. We should make room in the API for the special case without
destroying the general case.

> > Oh, I hadn't spent time thinking about any of those.. It is messy but
> > it can still be forced to work, I guess. A device centric model means
> > all the devices using the same routing ID have to be connected to the
> > same IOASID by userspace. So some of the connections will be NOPs.
>
> See, that's exactly what I thought the group checks were enforcing.
> I'm really hoping we don't need two levels of granularity here: groups
> of devices that can't be identified from each other, and then groups
> of those that can't be isolated from each other. That introduces a
> huge amount of extra conceptual complexity.

We've got this far with groups that mean all those things together, I
wouldn't propose to do a bunch of kernel work to change that
significantly.

I just want to have a device centric uAPI so we are not trapped
forever in groups being 1:1 with an IOASID model, which is clearly not
accurately modeling what today's systems are actually able to do,
especially with PASID.

We can report some fixed info to user space 'all these devices share
one ioasid' and leave it for now/ever

Jason

2021-06-09 10:11:44

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, 8 Jun 2021 15:44:26 +0200
Paolo Bonzini <[email protected]> wrote:

> On 08/06/21 15:15, Jason Gunthorpe wrote:
> > On Tue, Jun 08, 2021 at 09:56:09AM +0200, Paolo Bonzini wrote:
> >
> >>>> Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of ioctls. But it
> >>>> seems useless complication compared to just using what we have now, at least
> >>>> while VMs only use IOASIDs via VFIO.
> >>>
> >>> The simplest is KVM_ENABLE_WBINVD(<fd security proof>) and be done
> >>> with it.

Even if we were to relax wbinvd access to any device (capable of
no-snoop or not) in any IOMMU configuration (blocking no-snoop or not),
I think as soon as we say "proof" is required to gain this access then
that proof should be ongoing for the life of the access.

That alone makes this more than a "I want this feature, here's my
proof", one-shot ioctl. Like the groupfd enabling a path for KVM to
ask if that group is non-coherent and holding a group reference to
prevent the group from being used to authorize multiple KVM instances,
the ioasidfd proof would need to minimally validate that devices are
present and provide some reference to enforce that model as ongoing, or
notifier to indicate an end of that authorization. I don't think we
can simplify that out of the equation or we've essentially invalidated
that the proof is really required.

> >>
> >> The simplest one is KVM_DEV_VFIO_GROUP_ADD/DEL, that already exists and also
> >> covers hot-unplug. The second simplest one is KVM_DEV_IOASID_ADD/DEL.
> >
> > This isn't the same thing, this is back to trying to have the kernel
> > set policy for userspace.
>
> If you want a userspace policy then there would be three states:
>
> * WBINVD enabled because a WBINVD-enabled VFIO device is attached.
>
> * WBINVD potentially enabled but no WBINVD-enabled VFIO device attached
>
> * WBINVD forcefully disabled
>
> KVM_DEV_VFIO_GROUP_ADD/DEL can still be used to distinguish the first
> two. Due to backwards compatibility, those two describe the default
> behavior; disabling wbinvd can be done easily with a new sub-ioctl of
> KVM_ENABLE_CAP and doesn't require any security proof.

That seems like a good model, use the kvm-vfio device for the default
behavior and extend an existing KVM ioctl if QEMU still needs a way to
tell KVM to assume all DMA is coherent, regardless of what the kvm-vfio
device reports.

If feels like we should be able to support a backwards compatibility
mode using the vfio group, but I expect long term we'll want to
transition the kvm-vfio device from a groupfd to an ioasidfd.

> The meaning of WBINVD-enabled is "won't return -ENXIO for the wbinvd
> ioctl", nothing more nothing less. If all VFIO devices are going to be
> WBINVD-enabled, then that will reflect on KVM as well, and I won't have
> anything to object if there's consensus on the device assignment side of
> things that the wbinvd ioctl won't ever fail.

If we create the IOMMU vs device coherency matrix:

\ Device supports
IOMMU blocks \ no-snoop
no-snoop \ yes | no |
---------------+-----+-----+
yes | 1 | 2 |
---------------+-----+-----+
no | 3 | 4 |
---------------+-----+-----+

DMA is always coherent in boxes {1,2,4} (wbinvd emulation is not
needed). VFIO will currently always configure the IOMMU for {1,2} when
the feature is supported. Boxes {3,4} are where we'll currently
emulate wbinvd. The best we could do, not knowing the guest or
insights into the guest driver would be to only emulate wbinvd for {3}.

The majority of devices appear to report no-snoop support {1,3}, but
the claim is that it's mostly unused outside of GPUs, effectively {2,4}.
I'll speculate that most IOMMUs support enforcing coherency (amd, arm,
fsl unconditionally, intel conditionally) {1,2}.

I think that means we're currently operating primarily in Box {1},
which does not seem to lean towards unconditional wbinvd access with
device ownership.

I think we have a desire with IOASID to allow user policy to operate
certain devices in {3} and I think the argument there is that a
specific software enforced coherence sync is more efficient on the bus
than the constant coherence enforcement by the IOMMU.

I think that the disable mode Jason proposed is essentially just a way
to move a device from {3} to {4}, ie. the IOASID support or
configuration does not block no-snoop and the device claims to support
no-snoop, but doesn't use it. How we'd determine this to be true for
a device without a crystal ball of driver development or hardware
errata that no-snoop transactions are not possible regardless of the
behavior of the enable bit, I'm not sure. If we're operating a device
in {3}, but the device does not generate no-snoop transactions, then
presumably the guest driver isn't generating wbinvd instructions for us
to emulate, so where are the wbinvd instructions that this feature
would prevent being emulated coming from? Thanks,

Alex

2021-06-09 12:30:04

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: David Gibson <[email protected]>
> Sent: Tuesday, June 8, 2021 8:50 AM
>
> On Thu, Jun 03, 2021 at 06:49:20AM +0000, Tian, Kevin wrote:
> > > From: David Gibson
> > > Sent: Thursday, June 3, 2021 1:09 PM
> > [...]
> > > > > In this way the SW mode is the same as a HW mode with an infinite
> > > > > cache.
> > > > >
> > > > > The collaposed shadow page table is really just a cache.
> > > > >
> > > >
> > > > OK. One additional thing is that we may need a 'caching_mode"
> > > > thing reported by /dev/ioasid, indicating whether invalidation is
> > > > required when changing non-present to present. For hardware
> > > > nesting it's not reported as the hardware IOMMU will walk the
> > > > guest page table in cases of iotlb miss. For software nesting
> > > > caching_mode is reported so the user must issue invalidation
> > > > upon any change in guest page table so the kernel can update
> > > > the shadow page table timely.
> > >
> > > For the fist cut, I'd have the API assume that invalidates are
> > > *always* required. Some bypass to avoid them in cases where they're
> > > not needed can be an additional extension.
> > >
> >
> > Isn't a typical TLB semantics is that non-present entries are not
> > cached thus invalidation is not required when making non-present
> > to present?
>
> Usually, but not necessarily.
>
> > It's true to both CPU TLB and IOMMU TLB.
>
> I don't think it's entirely true of the CPU TLB on all ppc MMU models
> (of which there are far too many).
>
> > In reality
> > I feel there are more usages built on hardware nesting than software
> > nesting thus making default following hardware TLB behavior makes
> > more sense...
>
> I'm arguing for always-require-invalidate because it's strictly more
> general. Requiring the invalidate will support models that don't
> require it in all cases; we just make the invalidate a no-op. The
> reverse is not true, so we should tackle the general case first, then
> optimize.
>

It makes sense. Will adopt this way.

2021-06-09 15:00:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 09, 2021 at 02:49:32AM +0000, Tian, Kevin wrote:

> Last unclosed open. Jason, you dislike symbol_get in this contract per
> earlier comment. As Alex explained, looks it's more about module
> dependency which is orthogonal to how this contract is designed. What
> is your opinion now?

Generally when you see symbol_get like this it suggests something is
wrong in the layering..

Why shouldn't kvm have a normal module dependency on drivers/iommu?

Jason

2021-06-09 15:22:20

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 09/06/21 14:47, Jason Gunthorpe wrote:
> On Wed, Jun 09, 2021 at 02:46:05PM +0200, Paolo Bonzini wrote:
>> On 09/06/21 13:57, Jason Gunthorpe wrote:
>>> On Wed, Jun 09, 2021 at 02:49:32AM +0000, Tian, Kevin wrote:
>>>
>>>> Last unclosed open. Jason, you dislike symbol_get in this contract per
>>>> earlier comment. As Alex explained, looks it's more about module
>>>> dependency which is orthogonal to how this contract is designed. What
>>>> is your opinion now?
>>>
>>> Generally when you see symbol_get like this it suggests something is
>>> wrong in the layering..
>>>
>>> Why shouldn't kvm have a normal module dependency on drivers/iommu?
>>
>> It allows KVM to load even if there's an "install /bin/false" for vfio
>> (typically used together with the blacklist directive) in modprobe.conf.
>> This rationale should apply to iommu as well.
>
> I can vaugely understand this rational for vfio, but not at all for
> the platform's iommu driver, sorry.

Sorry, should apply to ioasid, not iommu (assuming that /dev/ioasid
support would be modular).

Paolo

2021-06-09 15:41:02

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, 9 Jun 2021 08:54:45 -0300
Jason Gunthorpe <[email protected]> wrote:

> On Wed, Jun 09, 2021 at 11:11:17AM +0200, Paolo Bonzini wrote:
> > On 09/06/21 10:51, Enrico Weigelt, metux IT consult wrote:
> > > On 08.06.21 21:00, Jason Gunthorpe wrote:
> > >
> > > > Eg I can do open() on a file and I get to keep that FD. I get to keep
> > > > that FD even if someone later does chmod() on that file so I can't
> > > > open it again.
> > > >
> > > > There are lots of examples where a one time access control check
> > > > provides continuing access to a resource. I feel the ongoing proof is
> > > > the rarity in Unix.. 'revoke' is an uncommon concept in Unix..
> > >
> > > Yes, it's even possible that somebody w/ privileges opens an fd and
> > > hands it over to somebody unprivileged (eg. via unix socket). This is
> > > a very basic unix concept. If some (already opened) fd now suddenly
> > > behaves differently based on the current caller, that would be a break
> > > with traditional unix semantics.

That's not the scenario here, this would work as expected. It's not a
matter of who uses the fd it's that a property of the fd that provided
elevated access has been removed. I only need to look as far as sudo
to find examples of protocols where having access at some point in time
does not guarantee ongoing access.

If we go back to the wbinvd ioctl mechanism, if I call that ioctl with
an ioasidfd that contains no devices, then I shouldn't be able to
generate a wbinvd on the processor, right? If I add a device,
especially in a configuration that can generate non-coherent DMA, now
that ioctl should work. If I then remove all devices from that ioasid,
what then is the difference from the initial state. Should the ioctl
now work because it worked once in the past?

If we equate KVM to the process access to the ioctl, then a reference
or notification method should be used to retain or mark the end of that
access. This is no different than starting a shell via sudo (ie. an
ongoing reference) or having the previous authentication time out, or
in our case be notified it has expired.

> > That's already more or less meaningless for both KVM and VFIO, since they
> > are tied to an mm.
>
> vfio isn't supposed to be tied to a mm.

vfio does accounting against an mm, why shouldn't it be tied to an mm?
Maybe you mean in the new model where vfio is just a device driver?
Thanks,

Alex

2021-06-09 15:43:35

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 09, 2021 at 03:24:11PM +0200, Paolo Bonzini wrote:
> On 09/06/21 14:47, Jason Gunthorpe wrote:
> > On Wed, Jun 09, 2021 at 02:46:05PM +0200, Paolo Bonzini wrote:
> > > On 09/06/21 13:57, Jason Gunthorpe wrote:
> > > > On Wed, Jun 09, 2021 at 02:49:32AM +0000, Tian, Kevin wrote:
> > > >
> > > > > Last unclosed open. Jason, you dislike symbol_get in this contract per
> > > > > earlier comment. As Alex explained, looks it's more about module
> > > > > dependency which is orthogonal to how this contract is designed. What
> > > > > is your opinion now?
> > > >
> > > > Generally when you see symbol_get like this it suggests something is
> > > > wrong in the layering..
> > > >
> > > > Why shouldn't kvm have a normal module dependency on drivers/iommu?
> > >
> > > It allows KVM to load even if there's an "install /bin/false" for vfio
> > > (typically used together with the blacklist directive) in modprobe.conf.
> > > This rationale should apply to iommu as well.
> >
> > I can vaugely understand this rational for vfio, but not at all for
> > the platform's iommu driver, sorry.
>
> Sorry, should apply to ioasid, not iommu (assuming that /dev/ioasid support
> would be modular).

/dev/ioasid and drivers/iommu are tightly coupled things, I we can put
a key symbol or two in a place where it will not be a problem to
depend on.

Jason

2021-06-09 15:46:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 09, 2021 at 08:31:34AM -0600, Alex Williamson wrote:

> If we go back to the wbinvd ioctl mechanism, if I call that ioctl with
> an ioasidfd that contains no devices, then I shouldn't be able to
> generate a wbinvd on the processor, right? If I add a device,
> especially in a configuration that can generate non-coherent DMA, now
> that ioctl should work. If I then remove all devices from that ioasid,
> what then is the difference from the initial state. Should the ioctl
> now work because it worked once in the past?

The ioctl is fine, but telling KVM to enable WBINVD is very similar to
open and then reconfiguring the ioasid_fd is very similar to
chmod. From a security perspective revoke is not strictly required,
IMHO.

> access. This is no different than starting a shell via sudo (ie. an
> ongoing reference) or having the previous authentication time out, or
> in our case be notified it has expired.

Those are all authentication gates as well, yes sudo has a timer, but
once the timer expires it doesn't forcibly revoke & close all the
existing sudo sessions. It just means you can't create new ones
without authenticating.

> > > That's already more or less meaningless for both KVM and VFIO, since they
> > > are tied to an mm.
> >
> > vfio isn't supposed to be tied to a mm.
>
> vfio does accounting against an mm, why shouldn't it be tied to an mm?

It looks like vfio type 1 is doing it properly, each ranch of of user
VA is stuffed into a struct vfio_dma and that contains a struct task
(which can be a mm_struct these days) that refers to the owning mm.

Looks like a single fd can hold multiple vfio_dma's and I don't see an
enforcment that current is locked to any specific process.

When the accounting is done it is done via the mm obtained through the
vfio_dma struct, not a global FD wide mm.

This appears all fine for something using pin_user_pages(). We don't
expect FDs to become locked to a single process on the first call to
pin_user_pages() that is un-unixy.

kvm is special in this regard.

Jason

2021-06-09 16:18:52

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 09/06/21 16:45, Jason Gunthorpe wrote:
> On Wed, Jun 09, 2021 at 08:31:34AM -0600, Alex Williamson wrote:
>
>> If we go back to the wbinvd ioctl mechanism, if I call that ioctl with
>> an ioasidfd that contains no devices, then I shouldn't be able to
>> generate a wbinvd on the processor, right? If I add a device,
>> especially in a configuration that can generate non-coherent DMA, now
>> that ioctl should work. If I then remove all devices from that ioasid,
>> what then is the difference from the initial state. Should the ioctl
>> now work because it worked once in the past?
>
> The ioctl is fine, but telling KVM to enable WBINVD is very similar to
> open and then reconfiguring the ioasid_fd is very similar to
> chmod. From a security perspective revoke is not strictly required,
> IMHO.

I absolutely do *not* want an API that tells KVM to enable WBINVD. This
is not up for discussion.

But really, let's stop calling the file descriptor a security proof or a
capability. It's overkill; all that we are doing here is kernel
acceleration of the WBINVD ioctl.

As a thought experiment, let's consider what would happen if wbinvd
caused an unconditional exit from guest to userspace. Userspace would
react by invoking the ioctl on the ioasid. The proposed functionality
is just an acceleration of this same thing, avoiding the
guest->KVM->userspace->IOASID->wbinvd trip.

This is why the API that I want, and that is already exists for VFIO
group file descriptors, informs KVM of which "ioctls" the guest should
be able to do via privileged instructions[1]. Then the kernel works out
with KVM how to ensure a 1:1 correspondence between the operation of the
ioctls and the privileged operations.

One way to do it would be to always trap WBINVD and invoke the same
kernel function that implements the ioctl. The function would do either
a wbinvd or nothing, based on whether the ioasid has any device. The
next logical step is a notification mechanism that enables WBINVD (by
disabling the WBINVD intercept) when there are devices in the ioasidfd,
and disables WBINVD (by enabling a no-op intercept) when there are none.

And in fact once all VFIO devices are gone, wbinvd is for all purposes a
no-op as far as the guest kernel can tell. So there's no reason to
treat it as anything but a no-op.

Thanks,

Paolo

[1] As an aside, I must admit I didn't entirely understand the design of
the KVM-VFIO device back when Alex added it. But with this model it was
absolutely the right thing to do, and it remains the right thing to do
even if VFIO groups are replaced with IOASID file descriptors.

2021-06-09 17:05:40

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Alex Williamson <[email protected]>
> Sent: Wednesday, June 9, 2021 2:47 AM
>
> On Tue, 8 Jun 2021 15:44:26 +0200
> Paolo Bonzini <[email protected]> wrote:
>
> > On 08/06/21 15:15, Jason Gunthorpe wrote:
> > > On Tue, Jun 08, 2021 at 09:56:09AM +0200, Paolo Bonzini wrote:
> > >
> > >>>> Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of
> ioctls. But it
> > >>>> seems useless complication compared to just using what we have
> now, at least
> > >>>> while VMs only use IOASIDs via VFIO.
> > >>>
> > >>> The simplest is KVM_ENABLE_WBINVD(<fd security proof>) and be
> done
> > >>> with it.
>
> Even if we were to relax wbinvd access to any device (capable of
> no-snoop or not) in any IOMMU configuration (blocking no-snoop or not),
> I think as soon as we say "proof" is required to gain this access then
> that proof should be ongoing for the life of the access.
>
> That alone makes this more than a "I want this feature, here's my
> proof", one-shot ioctl. Like the groupfd enabling a path for KVM to
> ask if that group is non-coherent and holding a group reference to
> prevent the group from being used to authorize multiple KVM instances,
> the ioasidfd proof would need to minimally validate that devices are
> present and provide some reference to enforce that model as ongoing, or
> notifier to indicate an end of that authorization. I don't think we
> can simplify that out of the equation or we've essentially invalidated
> that the proof is really required.
>
> > >>
> > >> The simplest one is KVM_DEV_VFIO_GROUP_ADD/DEL, that already
> exists and also
> > >> covers hot-unplug. The second simplest one is
> KVM_DEV_IOASID_ADD/DEL.
> > >
> > > This isn't the same thing, this is back to trying to have the kernel
> > > set policy for userspace.
> >
> > If you want a userspace policy then there would be three states:
> >
> > * WBINVD enabled because a WBINVD-enabled VFIO device is attached.
> >
> > * WBINVD potentially enabled but no WBINVD-enabled VFIO device
> attached
> >
> > * WBINVD forcefully disabled
> >
> > KVM_DEV_VFIO_GROUP_ADD/DEL can still be used to distinguish the first
> > two. Due to backwards compatibility, those two describe the default
> > behavior; disabling wbinvd can be done easily with a new sub-ioctl of
> > KVM_ENABLE_CAP and doesn't require any security proof.
>
> That seems like a good model, use the kvm-vfio device for the default
> behavior and extend an existing KVM ioctl if QEMU still needs a way to
> tell KVM to assume all DMA is coherent, regardless of what the kvm-vfio
> device reports.
>
> If feels like we should be able to support a backwards compatibility
> mode using the vfio group, but I expect long term we'll want to
> transition the kvm-vfio device from a groupfd to an ioasidfd.
>
> > The meaning of WBINVD-enabled is "won't return -ENXIO for the wbinvd
> > ioctl", nothing more nothing less. If all VFIO devices are going to be
> > WBINVD-enabled, then that will reflect on KVM as well, and I won't have
> > anything to object if there's consensus on the device assignment side of
> > things that the wbinvd ioctl won't ever fail.
>
> If we create the IOMMU vs device coherency matrix:
>
> \ Device supports
> IOMMU blocks \ no-snoop
> no-snoop \ yes | no |
> ---------------+-----+-----+
> yes | 1 | 2 |
> ---------------+-----+-----+
> no | 3 | 4 |
> ---------------+-----+-----+
>
> DMA is always coherent in boxes {1,2,4} (wbinvd emulation is not
> needed). VFIO will currently always configure the IOMMU for {1,2} when
> the feature is supported. Boxes {3,4} are where we'll currently
> emulate wbinvd. The best we could do, not knowing the guest or
> insights into the guest driver would be to only emulate wbinvd for {3}.
>
> The majority of devices appear to report no-snoop support {1,3}, but
> the claim is that it's mostly unused outside of GPUs, effectively {2,4}.
> I'll speculate that most IOMMUs support enforcing coherency (amd, arm,
> fsl unconditionally, intel conditionally) {1,2}.
>
> I think that means we're currently operating primarily in Box {1},
> which does not seem to lean towards unconditional wbinvd access with
> device ownership.
>
> I think we have a desire with IOASID to allow user policy to operate
> certain devices in {3} and I think the argument there is that a
> specific software enforced coherence sync is more efficient on the bus
> than the constant coherence enforcement by the IOMMU.
>
> I think that the disable mode Jason proposed is essentially just a way
> to move a device from {3} to {4}, ie. the IOASID support or
> configuration does not block no-snoop and the device claims to support
> no-snoop, but doesn't use it. How we'd determine this to be true for
> a device without a crystal ball of driver development or hardware
> errata that no-snoop transactions are not possible regardless of the
> behavior of the enable bit, I'm not sure. If we're operating a device
> in {3}, but the device does not generate no-snoop transactions, then
> presumably the guest driver isn't generating wbinvd instructions for us
> to emulate, so where are the wbinvd instructions that this feature
> would prevent being emulated coming from? Thanks,
>

I'm writing v2 now. Below is what I captured from this discussion.
Please let me know whether it matches your thoughts:

- Keep existing kvm-vfio device with kernel-decided policy in short term,
i.e. 'disable' for 1/2 and 'enable' for 3/4. Jason still has different thought
whether this should be an explicit wbinvd cmd, though;

- Long-term transition to ioasidfd (for non-vfio usage);

- As extension we want to support 'force-enable' (1->3 for performance
reason) from user but not 'force-disable' (3->4, sounds meaningless
since if guest driver doesn't use wbinvd then keeping wbinvd emulation
doesn't hurt);

- To support 'force-enable' no need for additional KVM-side contract.
It just relies on what kvm-vfio device reports, regardless of whether
an 'enable' policy comes from kernel or user;

- 'force-enable' is supported through /dev/iommu (new name for
/dev/ioasid);

- Qemu first calls IOMMU_GET_DEV_INFO(device_handle) to acquire
the default policy (enable vs. disable) for a device. This is the kernel
decision based on device no-snoop and iommu snoop control capability;

- If not specified, an newly-created IOASID follows the kernel policy.
Alternatively, Qemu could explicitly mark an IOASID as non-coherent
when IOMMU_ALLOC_IOASID;

- Attaching a non-snoop device which cannot be forced to snoop by
iommu to a coherent IOASID gets a failure, because a snoop-format
I/O page table causes error on such iommu;

- Devices attached to a non-coherent IOASID all use the no-snoop
format I/O page table, even when the iommu is capable of forcing
snoop;

- After IOASID is properly configured, Qemu then uses kvm-vfio device
to notify KVM which calls vfio helper function to get coherent attribute
per vfio_group. Because this attribute is kept in IOASID, we possibly
need return the attribute to vfio when vfio_attach_ioasid.

Last unclosed open. Jason, you dislike symbol_get in this contract per
earlier comment. As Alex explained, looks it's more about module
dependency which is orthogonal to how this contract is designed. What
is your opinion now?

Thanks
Kevin

2021-06-09 17:14:14

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 09/06/21 10:51, Enrico Weigelt, metux IT consult wrote:
> On 08.06.21 21:00, Jason Gunthorpe wrote:
>
>> Eg I can do open() on a file and I get to keep that FD. I get to keep
>> that FD even if someone later does chmod() on that file so I can't
>> open it again.
>>
>> There are lots of examples where a one time access control check
>> provides continuing access to a resource. I feel the ongoing proof is
>> the rarity in Unix.. 'revoke' is an uncommon concept in Unix..
>
> Yes, it's even possible that somebody w/ privileges opens an fd and
> hands it over to somebody unprivileged (eg. via unix socket). This is
> a very basic unix concept. If some (already opened) fd now suddenly
> behaves differently based on the current caller, that would be a break
> with traditional unix semantics.

That's already more or less meaningless for both KVM and VFIO, since
they are tied to an mm.

Paolo

Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 08.06.21 21:00, Jason Gunthorpe wrote:

> Eg I can do open() on a file and I get to keep that FD. I get to keep
> that FD even if someone later does chmod() on that file so I can't
> open it again.
>
> There are lots of examples where a one time access control check
> provides continuing access to a resource. I feel the ongoing proof is
> the rarity in Unix.. 'revoke' is an uncommon concept in Unix..

Yes, it's even possible that somebody w/ privileges opens an fd and
hands it over to somebody unprivileged (eg. via unix socket). This is
a very basic unix concept. If some (already opened) fd now suddenly
behaves differently based on the current caller, that would be a break
with traditional unix semantics.


--mtx

--
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
[email protected] -- +49-151-27565287

2021-06-09 17:22:20

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 09, 2021 at 11:11:17AM +0200, Paolo Bonzini wrote:
> On 09/06/21 10:51, Enrico Weigelt, metux IT consult wrote:
> > On 08.06.21 21:00, Jason Gunthorpe wrote:
> >
> > > Eg I can do open() on a file and I get to keep that FD. I get to keep
> > > that FD even if someone later does chmod() on that file so I can't
> > > open it again.
> > >
> > > There are lots of examples where a one time access control check
> > > provides continuing access to a resource. I feel the ongoing proof is
> > > the rarity in Unix.. 'revoke' is an uncommon concept in Unix..
> >
> > Yes, it's even possible that somebody w/ privileges opens an fd and
> > hands it over to somebody unprivileged (eg. via unix socket). This is
> > a very basic unix concept. If some (already opened) fd now suddenly
> > behaves differently based on the current caller, that would be a break
> > with traditional unix semantics.
>
> That's already more or less meaningless for both KVM and VFIO, since they
> are tied to an mm.

vfio isn't supposed to be tied to a mm.

Jason

2021-06-09 17:26:44

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 09/06/21 13:57, Jason Gunthorpe wrote:
> On Wed, Jun 09, 2021 at 02:49:32AM +0000, Tian, Kevin wrote:
>
>> Last unclosed open. Jason, you dislike symbol_get in this contract per
>> earlier comment. As Alex explained, looks it's more about module
>> dependency which is orthogonal to how this contract is designed. What
>> is your opinion now?
>
> Generally when you see symbol_get like this it suggests something is
> wrong in the layering..
>
> Why shouldn't kvm have a normal module dependency on drivers/iommu?

It allows KVM to load even if there's an "install /bin/false" for vfio
(typically used together with the blacklist directive) in modprobe.conf.
This rationale should apply to iommu as well.

Paolo

2021-06-09 17:27:56

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 09, 2021 at 02:46:05PM +0200, Paolo Bonzini wrote:
> On 09/06/21 13:57, Jason Gunthorpe wrote:
> > On Wed, Jun 09, 2021 at 02:49:32AM +0000, Tian, Kevin wrote:
> >
> > > Last unclosed open. Jason, you dislike symbol_get in this contract per
> > > earlier comment. As Alex explained, looks it's more about module
> > > dependency which is orthogonal to how this contract is designed. What
> > > is your opinion now?
> >
> > Generally when you see symbol_get like this it suggests something is
> > wrong in the layering..
> >
> > Why shouldn't kvm have a normal module dependency on drivers/iommu?
>
> It allows KVM to load even if there's an "install /bin/false" for vfio
> (typically used together with the blacklist directive) in modprobe.conf.
> This rationale should apply to iommu as well.

I can vaugely understand this rational for vfio, but not at all for
the platform's iommu driver, sorry.

Jason

2021-06-09 18:19:51

by Alex Williamson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, 9 Jun 2021 02:49:32 +0000
"Tian, Kevin" <[email protected]> wrote:

> > From: Alex Williamson <[email protected]>
> > Sent: Wednesday, June 9, 2021 2:47 AM
> >
> > On Tue, 8 Jun 2021 15:44:26 +0200
> > Paolo Bonzini <[email protected]> wrote:
> >
> > > On 08/06/21 15:15, Jason Gunthorpe wrote:
> > > > On Tue, Jun 08, 2021 at 09:56:09AM +0200, Paolo Bonzini wrote:
> > > >
> > > >>>> Alternatively you can add a KVM_DEV_IOASID_{ADD,DEL} pair of
> > ioctls. But it
> > > >>>> seems useless complication compared to just using what we have
> > now, at least
> > > >>>> while VMs only use IOASIDs via VFIO.
> > > >>>
> > > >>> The simplest is KVM_ENABLE_WBINVD(<fd security proof>) and be
> > done
> > > >>> with it.
> >
> > Even if we were to relax wbinvd access to any device (capable of
> > no-snoop or not) in any IOMMU configuration (blocking no-snoop or not),
> > I think as soon as we say "proof" is required to gain this access then
> > that proof should be ongoing for the life of the access.
> >
> > That alone makes this more than a "I want this feature, here's my
> > proof", one-shot ioctl. Like the groupfd enabling a path for KVM to
> > ask if that group is non-coherent and holding a group reference to
> > prevent the group from being used to authorize multiple KVM instances,
> > the ioasidfd proof would need to minimally validate that devices are
> > present and provide some reference to enforce that model as ongoing, or
> > notifier to indicate an end of that authorization. I don't think we
> > can simplify that out of the equation or we've essentially invalidated
> > that the proof is really required.
> >
> > > >>
> > > >> The simplest one is KVM_DEV_VFIO_GROUP_ADD/DEL, that already
> > exists and also
> > > >> covers hot-unplug. The second simplest one is
> > KVM_DEV_IOASID_ADD/DEL.
> > > >
> > > > This isn't the same thing, this is back to trying to have the kernel
> > > > set policy for userspace.
> > >
> > > If you want a userspace policy then there would be three states:
> > >
> > > * WBINVD enabled because a WBINVD-enabled VFIO device is attached.
> > >
> > > * WBINVD potentially enabled but no WBINVD-enabled VFIO device
> > attached
> > >
> > > * WBINVD forcefully disabled
> > >
> > > KVM_DEV_VFIO_GROUP_ADD/DEL can still be used to distinguish the first
> > > two. Due to backwards compatibility, those two describe the default
> > > behavior; disabling wbinvd can be done easily with a new sub-ioctl of
> > > KVM_ENABLE_CAP and doesn't require any security proof.
> >
> > That seems like a good model, use the kvm-vfio device for the default
> > behavior and extend an existing KVM ioctl if QEMU still needs a way to
> > tell KVM to assume all DMA is coherent, regardless of what the kvm-vfio
> > device reports.
> >
> > If feels like we should be able to support a backwards compatibility
> > mode using the vfio group, but I expect long term we'll want to
> > transition the kvm-vfio device from a groupfd to an ioasidfd.
> >
> > > The meaning of WBINVD-enabled is "won't return -ENXIO for the wbinvd
> > > ioctl", nothing more nothing less. If all VFIO devices are going to be
> > > WBINVD-enabled, then that will reflect on KVM as well, and I won't have
> > > anything to object if there's consensus on the device assignment side of
> > > things that the wbinvd ioctl won't ever fail.
> >
> > If we create the IOMMU vs device coherency matrix:
> >
> > \ Device supports
> > IOMMU blocks \ no-snoop
> > no-snoop \ yes | no |
> > ---------------+-----+-----+
> > yes | 1 | 2 |
> > ---------------+-----+-----+
> > no | 3 | 4 |
> > ---------------+-----+-----+
> >
> > DMA is always coherent in boxes {1,2,4} (wbinvd emulation is not
> > needed). VFIO will currently always configure the IOMMU for {1,2} when
> > the feature is supported. Boxes {3,4} are where we'll currently
> > emulate wbinvd. The best we could do, not knowing the guest or
> > insights into the guest driver would be to only emulate wbinvd for {3}.
> >
> > The majority of devices appear to report no-snoop support {1,3}, but
> > the claim is that it's mostly unused outside of GPUs, effectively {2,4}.
> > I'll speculate that most IOMMUs support enforcing coherency (amd, arm,
> > fsl unconditionally, intel conditionally) {1,2}.
> >
> > I think that means we're currently operating primarily in Box {1},
> > which does not seem to lean towards unconditional wbinvd access with
> > device ownership.
> >
> > I think we have a desire with IOASID to allow user policy to operate
> > certain devices in {3} and I think the argument there is that a
> > specific software enforced coherence sync is more efficient on the bus
> > than the constant coherence enforcement by the IOMMU.
> >
> > I think that the disable mode Jason proposed is essentially just a way
> > to move a device from {3} to {4}, ie. the IOASID support or
> > configuration does not block no-snoop and the device claims to support
> > no-snoop, but doesn't use it. How we'd determine this to be true for
> > a device without a crystal ball of driver development or hardware
> > errata that no-snoop transactions are not possible regardless of the
> > behavior of the enable bit, I'm not sure. If we're operating a device
> > in {3}, but the device does not generate no-snoop transactions, then
> > presumably the guest driver isn't generating wbinvd instructions for us
> > to emulate, so where are the wbinvd instructions that this feature
> > would prevent being emulated coming from? Thanks,
> >
>
> I'm writing v2 now. Below is what I captured from this discussion.
> Please let me know whether it matches your thoughts:
>
> - Keep existing kvm-vfio device with kernel-decided policy in short term,
> i.e. 'disable' for 1/2 and 'enable' for 3/4. Jason still has different thought
> whether this should be an explicit wbinvd cmd, though;

Right, in the short term there will need to be compatibility through
the vfio groupfd as used by the kvm-vfio device and that should attempt
to approximate the narrowest reporting of potential non-coherent DMA,
ideally only 3, but acceptable equivalence for 3/4.

> - Long-term transition to ioasidfd (for non-vfio usage);

Long term, I'd expect the kvm-vfio device to transition to ioasidfd as
well.

> - As extension we want to support 'force-enable' (1->3 for performance
> reason) from user but not 'force-disable' (3->4, sounds meaningless
> since if guest driver doesn't use wbinvd then keeping wbinvd emulation
> doesn't hurt);

Yes, the user should have control to define an IOASID that does not
enforce coherency by the IOMMU. The value of disabling KVM wbinvd
emulation (3->4) needs to be justified, but there's no opposition to
extending existing KVM ioctls to providing that emulate-pause/resume
switch.

> - To support 'force-enable' no need for additional KVM-side contract.
> It just relies on what kvm-vfio device reports, regardless of whether
> an 'enable' policy comes from kernel or user;

Yes.

> - 'force-enable' is supported through /dev/iommu (new name for
> /dev/ioasid);

Same as 2 above, but yes. There are probably several opens here
regarding the user granularity of IOMMU coherence, ex. is it defined at
the IOASID or per mapping? If not per mapping, how is it enforced for
user owned page tables? The map/unmap interface could simply follow an
IOASID creation attribute to maintain ioctl compatibility with type1.

> - Qemu first calls IOMMU_GET_DEV_INFO(device_handle) to acquire
> the default policy (enable vs. disable) for a device. This is the kernel
> decision based on device no-snoop and iommu snoop control capability;

The device handle is from vfio though, right? I'd think this would be
a capability on the VFIO_DEVICE_GET_INFO ioctl, where for PCI,
non-coherent is simply the result of probing that Enable No-snoop is
already or can be enabled (not hardwired zero).

kvm_vfio_group_is_coherent() would require either IOMMU enforced
coherence or all devices in the group to be coherent, the latter
allowing box 4 in the matrix to make wbinvd a no-op.

> - If not specified, an newly-created IOASID follows the kernel policy.
> Alternatively, Qemu could explicitly mark an IOASID as non-coherent
> when IOMMU_ALLOC_IOASID;

It seems like the "kernel policy" is specific to vfio compatibility.
The native IOASID API should allow the user to specify.

> - Attaching a non-snoop device which cannot be forced to snoop by
> iommu to a coherent IOASID gets a failure, because a snoop-format
> I/O page table causes error on such iommu;

Yes, and vfio code emulating type1 via IOASID would need to create a
separate IOASID.

> - Devices attached to a non-coherent IOASID all use the no-snoop
> format I/O page table, even when the iommu is capable of forcing
> snoop;

Both this and the previous are assuming the coherence is defined at the
IOASID rather than per mapping, I think we need to decide if that's the
case.

> - After IOASID is properly configured, Qemu then uses kvm-vfio device
> to notify KVM which calls vfio helper function to get coherent attribute
> per vfio_group. Because this attribute is kept in IOASID, we possibly
> need return the attribute to vfio when vfio_attach_ioasid.

Yes, vfio will need some way to know whether the IOASID is operating in
boxes 3/4 (or better just 3). Also, since coherence is the product of
device configuration rather than group configuration, we'll likely need
new mechanisms to keep the KVM view correct. Thanks,

Alex

2021-06-10 02:02:19

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


?? 2021/6/8 ????9:20, Jason Gunthorpe д??:
> On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote:
>
>> Well, this sounds like a re-invention of io_uring which has already worked
>> for multifds.
> How so? io_uring is about sending work to the kernel, not getting
> structued events back?


Actually it can. Userspace can poll multiple fds via preparing multiple
sqes with IORING_OP_ADD flag.


>
> It is more like one of the perf rings


This means another ring and we need introduce ioctl() to add or remove
ioasids from the poll. And it still need a kind of fallback like a list
if the ring is full.

Thanks


>
> Jason
>

2021-06-10 02:18:55

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


在 2021/6/8 下午6:45, Enrico Weigelt, metux IT consult 写道:
> On 07.06.21 20:01, Jason Gunthorpe wrote:
>> <shrug> it is what it is, select has a fixed size bitmap of FD #s and
>> a hard upper bound on that size as part of the glibc ABI - can't be
>> fixed.
>
> in glibc ABI ? Uuuuh!
>

Note that dealing with select() or try to overcome the limitation via
epoll() directly via the application is not a good practice (or it's not
portable).

It's suggested to use building blocks provided by glib, e.g the main
event loop[1]. That is how Qemu solve the issues of dealing with a lot
of file descriptors.

Thanks

[1] https://developer.gnome.org/glib/stable/glib-The-Main-Event-Loop.html


>
> --mtx
>

2021-06-10 04:08:05

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


?? 2021/6/10 ????10:00, Jason Wang д??:
>
> ?? 2021/6/8 ????9:20, Jason Gunthorpe д??:
>> On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote:
>>
>>> Well, this sounds like a re-invention of io_uring which has already
>>> worked
>>> for multifds.
>> How so? io_uring is about sending work to the kernel, not getting
>> structued events back?
>
>
> Actually it can. Userspace can poll multiple fds via preparing
> multiple sqes with IORING_OP_ADD flag.


IORING_OP_POLL_ADD actually.

Thanks


>
>
>>
>> It is more like one of the perf rings
>
>
> This means another ring and we need introduce ioctl() to add or remove
> ioasids from the poll. And it still need a kind of fallback like a
> list if the ring is full.
>
> Thanks
>
>
>>
>> Jason
>>

2021-06-10 11:49:50

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 10, 2021 at 10:00:01AM +0800, Jason Wang wrote:
>
> 在 2021/6/8 下午9:20, Jason Gunthorpe 写道:
> > On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote:
> >
> > > Well, this sounds like a re-invention of io_uring which has already worked
> > > for multifds.
> > How so? io_uring is about sending work to the kernel, not getting
> > structued events back?
>
>
> Actually it can. Userspace can poll multiple fds via preparing multiple sqes
> with IORING_OP_ADD flag.

Poll is only a part of what is needed here, the main issue is
transfering the PRI events to userspace quickly.

> This means another ring and we need introduce ioctl() to add or remove
> ioasids from the poll. And it still need a kind of fallback like a list if
> the ring is full.

The max size of the ring should be determinable based on the PRI
concurrance of each device and the number of devices sharing the ring

In any event, I'm not entirely convinced eliding the PRI user/kernel
copy is the main issue here.. If we want this to be low latency I
think it ends up with some kernel driver component assisting the
vIOMMU emulation and avoiding the round trip to userspace

Jason

2021-06-10 16:41:44

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 08, 2021 at 04:31:50PM +1000, David Gibson wrote:
> For the qemu case, I would imagine a two stage fallback:
>
> 1) Ask for the exact IOMMU capabilities (including pagetable
> format) that the vIOMMU has. If the host can supply, you're
> good
>
> 2) If not, ask for a kernel managed IOAS. Verify that it can map
> all the IOVA ranges the guest vIOMMU needs, and has an equal or
> smaller pagesize than the guest vIOMMU presents. If so,
> software emulate the vIOMMU by shadowing guest io pagetable
> updates into the kernel managed IOAS.
>
> 3) You're out of luck, don't start.
>
> For both (1) and (2) I'd expect it to be asking this question *after*
> saying what devices are attached to the IOAS, based on the virtual
> hardware configuration. That doesn't cover hotplug, of course, for
> that you have to just fail the hotplug if the new device isn't
> supportable with the IOAS you already have.

Yes. So there is a point in time when the IOAS is frozen, and cannot take
in new incompatible devices. I think that can support the usage I had in
mind. If the VMM (non-QEMU, let's say) wanted to create one IOASID FD per
feature set it could bind the first device, freeze the features, then bind
the second device. If the second bind fails it creates a new FD, allowing
to fall back to (2) for the second device while keeping (1) for the first
device. A paravirtual IOMMU like virtio-iommu could easily support this as
it describes pIOMMU properties for each device to the guest. An emulated
vIOMMU could also support some hybrid cases as you describe below.

> One can imagine optimizations where for certain intermediate cases you
> could do a lighter SW emu if the host supports a model that's close to
> the vIOMMU one, and you're able to trap and emulate the differences.
> In practice I doubt anyone's going to have time to look for such cases
> and implement the logic for it.
>
> > For example depending whether the hardware IOMMU is SMMUv2 or SMMUv3, that
> > completely changes the capabilities offered to the guest (some v2
> > implementations support nesting page tables, but never PASID nor PRI
> > unlike v3.) The same vIOMMU could support either, presenting different
> > capabilities to the guest, even multiple page table formats if we wanted
> > to be exhaustive (SMMUv2 supports the older 32-bit descriptor), but it
> > needs to know early on what the hardware is precisely. Then some new page
> > table format shows up and, although the vIOMMU can support that in
> > addition to older ones, QEMU will have to pick a single one, that it
> > assumes the guest knows how to drive?
> >
> > I think once it binds a device to an IOASID fd, QEMU will want to probe
> > what hardware features are available before going further with the vIOMMU
> > setup (is there PASID, PRI, which page table formats are supported,
> > address size, page granule, etc). Obtaining precise information about the
> > hardware would be less awkward than trying different configurations until
> > one succeeds. Binding an additional device would then fail if its pIOMMU
> > doesn't support exactly the features supported for the first device,
> > because we don't know which ones the guest will choose. QEMU will have to
> > open a new IOASID fd for that device.
>
> No, this fundamentally misunderstands the qemu model. The user
> *chooses* the guest visible platform, and qemu supplies it or fails.
> There is no negotiation with the guest, because this makes managing
> migration impossibly difficult.

I'd like to understand better where the difficulty lies, with migration.
Is the problem, once we have a guest running on physical machine A, to
make sure that physical machine B supports the same IOMMU properties
before migrating the VM over to B? Why can't QEMU (instead of the user)
select a feature set on machine A, then when time comes to migrate, query
all information from the host kernel on machine B and check that it
matches what was picked for machine A? Or is it only trying to
accommodate different sets of features between A and B, that would be too
difficult?

Thanks,
Jean

>
> -cpu host is an exception, which is used because it is so useful, but
> it's kind of a pain on the qemu side. Virt management systems like
> oVirt/RHV almost universally *do not use* -cpu host, precisely because
> it cannot support predictable migration.
>
> --
> David Gibson | I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
> | _way_ _around_!
> http://www.ozlabs.org/~dgibson


2021-06-11 05:51:40

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal


在 2021/6/10 下午7:47, Jason Gunthorpe 写道:
> On Thu, Jun 10, 2021 at 10:00:01AM +0800, Jason Wang wrote:
>> 在 2021/6/8 下午9:20, Jason Gunthorpe 写道:
>>> On Tue, Jun 08, 2021 at 09:10:42AM +0800, Jason Wang wrote:
>>>
>>>> Well, this sounds like a re-invention of io_uring which has already worked
>>>> for multifds.
>>> How so? io_uring is about sending work to the kernel, not getting
>>> structued events back?
>>
>> Actually it can. Userspace can poll multiple fds via preparing multiple sqes
>> with IORING_OP_ADD flag.
> Poll is only a part of what is needed here, the main issue is
> transfering the PRI events to userspace quickly.


Do we really care e.g at most one more syscall in this case? I think the
time spent on demand paging is much more than transferring #PF to
userspace. What's more, a well designed vIOMMU capable IOMMU hardware
should have the ability to inject such event directly to guest if #PF
happens on L1.


>
>> This means another ring and we need introduce ioctl() to add or remove
>> ioasids from the poll. And it still need a kind of fallback like a list if
>> the ring is full.
> The max size of the ring should be determinable based on the PRI
> concurrance of each device and the number of devices sharing the ring


This has at least one assumption, #PF event is the only event for the
ring, I'm not sure this is the case.

Thanks


>
> In any event, I'm not entirely convinced eliding the PRI user/kernel
> copy is the main issue here.. If we want this to be low latency I
> think it ends up with some kernel driver component assisting the
> vIOMMU emulation and avoiding the round trip to userspace
>
> Jason
>

2021-06-15 09:00:45

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

Hi, Jason,

> From: Jason Gunthorpe
> Sent: Thursday, June 3, 2021 9:05 PM
>
> On Thu, Jun 03, 2021 at 06:39:30AM +0000, Tian, Kevin wrote:
> > > > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> > > >
> > > > struct attach_info {
> > > > u32 ioasid;
> > > > // If valid, the PASID to be used physically
> > > > u32 pasid;
> > > > };
> > > > int ioasid_device_attach(struct ioasid_dev *dev,
> > > > struct attach_info info);
> > > > int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> > >
> > > Honestly, I still prefer this to be highly explicit as this is where
> > > all device driver authors get invovled:
> > >
> > > ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev,
> > > u32 ioasid);
> > > ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32
> *physical_pasid,
> > > struct ioasid_dev *dev, u32 ioasid);
> >
> > Then better naming it as pci_device_attach_ioasid since the 1st parameter
> > is struct pci_device?
>
> No, the leading tag indicates the API's primary subystem, in this case
> it is iommu (and if you prefer list the iommu related arguments first)
>

I have a question on this suggestion when working on v2.

Within IOMMU fd it uses only the generic struct device pointer, which
is already saved in struct ioasid_dev at device bind time:

struct ioasid_dev *ioasid_register_device(struct ioasid_ctx *ctx,
struct device *device, u64 device_label);

What does this additional struct pci_device bring when it's specified in
the attach call? If we save it in attach_data, at which point will it be
used or checked?

Can you help elaborate?

Thanks
Kevin

2021-06-15 15:08:21

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 15, 2021 at 08:59:25AM +0000, Tian, Kevin wrote:
> Hi, Jason,
>
> > From: Jason Gunthorpe
> > Sent: Thursday, June 3, 2021 9:05 PM
> >
> > On Thu, Jun 03, 2021 at 06:39:30AM +0000, Tian, Kevin wrote:
> > > > > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> > > > >
> > > > > struct attach_info {
> > > > > u32 ioasid;
> > > > > // If valid, the PASID to be used physically
> > > > > u32 pasid;
> > > > > };
> > > > > int ioasid_device_attach(struct ioasid_dev *dev,
> > > > > struct attach_info info);
> > > > > int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> > > >
> > > > Honestly, I still prefer this to be highly explicit as this is where
> > > > all device driver authors get invovled:
> > > >
> > > > ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev *dev,
> > > > u32 ioasid);
> > > > ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32
> > *physical_pasid,
> > > > struct ioasid_dev *dev, u32 ioasid);
> > >
> > > Then better naming it as pci_device_attach_ioasid since the 1st parameter
> > > is struct pci_device?
> >
> > No, the leading tag indicates the API's primary subystem, in this case
> > it is iommu (and if you prefer list the iommu related arguments first)
> >
>
> I have a question on this suggestion when working on v2.
>
> Within IOMMU fd it uses only the generic struct device pointer, which
> is already saved in struct ioasid_dev at device bind time:
>
> struct ioasid_dev *ioasid_register_device(struct ioasid_ctx *ctx,
> struct device *device, u64 device_label);
>
> What does this additional struct pci_device bring when it's specified in
> the attach call? If we save it in attach_data, at which point will it be
> used or checked?

The above was for attaching to an ioasid not the register path

You should call 'device_label' 'device_cookie' if it is a user
provided u64

Jason

2021-06-15 23:02:36

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Tuesday, June 15, 2021 11:07 PM
>
> On Tue, Jun 15, 2021 at 08:59:25AM +0000, Tian, Kevin wrote:
> > Hi, Jason,
> >
> > > From: Jason Gunthorpe
> > > Sent: Thursday, June 3, 2021 9:05 PM
> > >
> > > On Thu, Jun 03, 2021 at 06:39:30AM +0000, Tian, Kevin wrote:
> > > > > > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> > > > > >
> > > > > > struct attach_info {
> > > > > > u32 ioasid;
> > > > > > // If valid, the PASID to be used physically
> > > > > > u32 pasid;
> > > > > > };
> > > > > > int ioasid_device_attach(struct ioasid_dev *dev,
> > > > > > struct attach_info info);
> > > > > > int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> > > > >
> > > > > Honestly, I still prefer this to be highly explicit as this is where
> > > > > all device driver authors get invovled:
> > > > >
> > > > > ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev
> *dev,
> > > > > u32 ioasid);
> > > > > ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32
> > > *physical_pasid,
> > > > > struct ioasid_dev *dev, u32 ioasid);
> > > >
> > > > Then better naming it as pci_device_attach_ioasid since the 1st
> parameter
> > > > is struct pci_device?
> > >
> > > No, the leading tag indicates the API's primary subystem, in this case
> > > it is iommu (and if you prefer list the iommu related arguments first)
> > >
> >
> > I have a question on this suggestion when working on v2.
> >
> > Within IOMMU fd it uses only the generic struct device pointer, which
> > is already saved in struct ioasid_dev at device bind time:
> >
> > struct ioasid_dev *ioasid_register_device(struct ioasid_ctx *ctx,
> > struct device *device, u64 device_label);
> >
> > What does this additional struct pci_device bring when it's specified in
> > the attach call? If we save it in attach_data, at which point will it be
> > used or checked?
>
> The above was for attaching to an ioasid not the register path

Yes, I know. and this is my question. When receiving a struct pci_device
at attach time, what should IOMMU fd do with it? Just verify whether
pci_device->device is same as ioasid_dev->device? if saving it to per-device
attach data under ioasid then when will it be further used?

I feel once ioasid_dev is returned in the register path, following operations
(unregister, attach, detach) just uses ioasid_dev as the main object.

>
> You should call 'device_label' 'device_cookie' if it is a user
> provided u64
>

will do.

Thanks
Kevin

2021-06-15 23:04:37

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 15, 2021 at 10:59:06PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Tuesday, June 15, 2021 11:07 PM
> >
> > On Tue, Jun 15, 2021 at 08:59:25AM +0000, Tian, Kevin wrote:
> > > Hi, Jason,
> > >
> > > > From: Jason Gunthorpe
> > > > Sent: Thursday, June 3, 2021 9:05 PM
> > > >
> > > > On Thu, Jun 03, 2021 at 06:39:30AM +0000, Tian, Kevin wrote:
> > > > > > > Two helper functions are provided to support VFIO_ATTACH_IOASID:
> > > > > > >
> > > > > > > struct attach_info {
> > > > > > > u32 ioasid;
> > > > > > > // If valid, the PASID to be used physically
> > > > > > > u32 pasid;
> > > > > > > };
> > > > > > > int ioasid_device_attach(struct ioasid_dev *dev,
> > > > > > > struct attach_info info);
> > > > > > > int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> > > > > >
> > > > > > Honestly, I still prefer this to be highly explicit as this is where
> > > > > > all device driver authors get invovled:
> > > > > >
> > > > > > ioasid_pci_device_attach(struct pci_device *pdev, struct ioasid_dev
> > *dev,
> > > > > > u32 ioasid);
> > > > > > ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32
> > > > *physical_pasid,
> > > > > > struct ioasid_dev *dev, u32 ioasid);
> > > > >
> > > > > Then better naming it as pci_device_attach_ioasid since the 1st
> > parameter
> > > > > is struct pci_device?
> > > >
> > > > No, the leading tag indicates the API's primary subystem, in this case
> > > > it is iommu (and if you prefer list the iommu related arguments first)
> > > >
> > >
> > > I have a question on this suggestion when working on v2.
> > >
> > > Within IOMMU fd it uses only the generic struct device pointer, which
> > > is already saved in struct ioasid_dev at device bind time:
> > >
> > > struct ioasid_dev *ioasid_register_device(struct ioasid_ctx *ctx,
> > > struct device *device, u64 device_label);
> > >
> > > What does this additional struct pci_device bring when it's specified in
> > > the attach call? If we save it in attach_data, at which point will it be
> > > used or checked?
> >
> > The above was for attaching to an ioasid not the register path
>
> Yes, I know. and this is my question. When receiving a struct pci_device
> at attach time, what should IOMMU fd do with it? Just verify whether
> pci_device->device is same as ioasid_dev->device? if saving it to per-device
> attach data under ioasid then when will it be further used?
>
> I feel once ioasid_dev is returned in the register path, following operations
> (unregister, attach, detach) just uses ioasid_dev as the main object.

The point of having the pci_device specific API was to convey bus
specific information during the attachment to the IOASID.

The registration of the device to the iommu_fd doesn't need bus
specific information, AFIAK? So just use a normal struct device
pointer

Jason

2021-06-15 23:10:54

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, June 16, 2021 7:02 AM
>
> On Tue, Jun 15, 2021 at 10:59:06PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Tuesday, June 15, 2021 11:07 PM
> > >
> > > On Tue, Jun 15, 2021 at 08:59:25AM +0000, Tian, Kevin wrote:
> > > > Hi, Jason,
> > > >
> > > > > From: Jason Gunthorpe
> > > > > Sent: Thursday, June 3, 2021 9:05 PM
> > > > >
> > > > > On Thu, Jun 03, 2021 at 06:39:30AM +0000, Tian, Kevin wrote:
> > > > > > > > Two helper functions are provided to support
> VFIO_ATTACH_IOASID:
> > > > > > > >
> > > > > > > > struct attach_info {
> > > > > > > > u32 ioasid;
> > > > > > > > // If valid, the PASID to be used physically
> > > > > > > > u32 pasid;
> > > > > > > > };
> > > > > > > > int ioasid_device_attach(struct ioasid_dev *dev,
> > > > > > > > struct attach_info info);
> > > > > > > > int ioasid_device_detach(struct ioasid_dev *dev, u32 ioasid);
> > > > > > >
> > > > > > > Honestly, I still prefer this to be highly explicit as this is where
> > > > > > > all device driver authors get invovled:
> > > > > > >
> > > > > > > ioasid_pci_device_attach(struct pci_device *pdev, struct
> ioasid_dev
> > > *dev,
> > > > > > > u32 ioasid);
> > > > > > > ioasid_pci_device_pasid_attach(struct pci_device *pdev, u32
> > > > > *physical_pasid,
> > > > > > > struct ioasid_dev *dev, u32 ioasid);
> > > > > >
> > > > > > Then better naming it as pci_device_attach_ioasid since the 1st
> > > parameter
> > > > > > is struct pci_device?
> > > > >
> > > > > No, the leading tag indicates the API's primary subystem, in this case
> > > > > it is iommu (and if you prefer list the iommu related arguments first)
> > > > >
> > > >
> > > > I have a question on this suggestion when working on v2.
> > > >
> > > > Within IOMMU fd it uses only the generic struct device pointer, which
> > > > is already saved in struct ioasid_dev at device bind time:
> > > >
> > > > struct ioasid_dev *ioasid_register_device(struct ioasid_ctx *ctx,
> > > > struct device *device, u64 device_label);
> > > >
> > > > What does this additional struct pci_device bring when it's specified in
> > > > the attach call? If we save it in attach_data, at which point will it be
> > > > used or checked?
> > >
> > > The above was for attaching to an ioasid not the register path
> >
> > Yes, I know. and this is my question. When receiving a struct pci_device
> > at attach time, what should IOMMU fd do with it? Just verify whether
> > pci_device->device is same as ioasid_dev->device? if saving it to per-device
> > attach data under ioasid then when will it be further used?
> >
> > I feel once ioasid_dev is returned in the register path, following operations
> > (unregister, attach, detach) just uses ioasid_dev as the main object.
>
> The point of having the pci_device specific API was to convey bus
> specific information during the attachment to the IOASID.

which information can you elaborate? This is the area which I'm not
familiar with thus would appreciate if you can help explain how this
bus specific information is utilized within the attach function or
sometime later.

>
> The registration of the device to the iommu_fd doesn't need bus
> specific information, AFIAK? So just use a normal struct device
> pointer
>

yes.

Thanks
Kevin

2021-06-15 23:42:42

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 15, 2021 at 11:09:37PM +0000, Tian, Kevin wrote:

> which information can you elaborate? This is the area which I'm not
> familiar with thus would appreciate if you can help explain how this
> bus specific information is utilized within the attach function or
> sometime later.

This is the idea that the device driver needs to specify which bus
specific protocol it uses to issue DMA's when it attaches itself to an
IOASID. For PCI:

- Normal RID DMA
- PASID DMA
- ENQCMD triggered PASID DMA
- ATS/PRI enabled or not

And maybe more. Eg CXL has some other operating modes, I think

The device knows what it is going to do, we need to convey that to the
IOMMU layer so it is prepared properly.

Jason

2021-06-15 23:59:34

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, June 16, 2021 7:41 AM
>
> On Tue, Jun 15, 2021 at 11:09:37PM +0000, Tian, Kevin wrote:
>
> > which information can you elaborate? This is the area which I'm not
> > familiar with thus would appreciate if you can help explain how this
> > bus specific information is utilized within the attach function or
> > sometime later.
>
> This is the idea that the device driver needs to specify which bus
> specific protocol it uses to issue DMA's when it attaches itself to an
> IOASID. For PCI:

What about defining some general attributes instead of asking iommu
fd to understand those bus specific detail?

>
> - Normal RID DMA

this is struct device pointer

> - PASID DMA

this is PASID or SSID which is understood by underlying iommu driver

> - ENQCMD triggered PASID DMA

from iommu p.o.v there is no difference from last one. In v2 the device
driver just needs to communicate the PASID virtualization policy at
device binding time, e.g. whether vPASID is allowed, if yes whether
vPASID must be registered to the kernel, if via kernel whether per-RID
vs. global, etc. This policy is then conveyed to userspace via device
capability query interface via iommu fd.

> - ATS/PRI enabled or not

Just a generic support I/O page fault or not

>
> And maybe more. Eg CXL has some other operating modes, I think
>
> The device knows what it is going to do, we need to convey that to the
> IOMMU layer so it is prepared properly.
>

Yes, but it's not necessarily to have iommu fd understand bus specific
attributes. In the end when /dev/iommu uAPI calls iommu layer interface,
it's all bus agnostic.

Thanks
Kevin

2021-06-16 00:01:16

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 15, 2021 at 11:56:28PM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <[email protected]>
> > Sent: Wednesday, June 16, 2021 7:41 AM
> >
> > On Tue, Jun 15, 2021 at 11:09:37PM +0000, Tian, Kevin wrote:
> >
> > > which information can you elaborate? This is the area which I'm not
> > > familiar with thus would appreciate if you can help explain how this
> > > bus specific information is utilized within the attach function or
> > > sometime later.
> >
> > This is the idea that the device driver needs to specify which bus
> > specific protocol it uses to issue DMA's when it attaches itself to an
> > IOASID. For PCI:
>
> What about defining some general attributes instead of asking iommu
> fd to understand those bus specific detail?

I prefer the API be very clear and intent driven, otherwise things
just get confused.

The whole WBINVD/no-snoop discussion I think is proof of that :\

> from iommu p.o.v there is no difference from last one. In v2 the device
> driver just needs to communicate the PASID virtualization policy at
> device binding time,

I want it documented in the kernel source WTF is happening, because
otherwise we are going to be completely lost in a few years. And your
RFC did have device driver specific differences here

> > The device knows what it is going to do, we need to convey that to the
> > IOMMU layer so it is prepared properly.
>
> Yes, but it's not necessarily to have iommu fd understand bus specific
> attributes. In the end when /dev/iommu uAPI calls iommu layer interface,
> it's all bus agnostic.

Why not? Just put some inline wrappers to translate the bus specific
language to your generic language if that is what makes the most
sense.

Jason

2021-06-16 00:05:17

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Wednesday, June 16, 2021 7:59 AM
>
> On Tue, Jun 15, 2021 at 11:56:28PM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe <[email protected]>
> > > Sent: Wednesday, June 16, 2021 7:41 AM
> > >
> > > On Tue, Jun 15, 2021 at 11:09:37PM +0000, Tian, Kevin wrote:
> > >
> > > > which information can you elaborate? This is the area which I'm not
> > > > familiar with thus would appreciate if you can help explain how this
> > > > bus specific information is utilized within the attach function or
> > > > sometime later.
> > >
> > > This is the idea that the device driver needs to specify which bus
> > > specific protocol it uses to issue DMA's when it attaches itself to an
> > > IOASID. For PCI:
> >
> > What about defining some general attributes instead of asking iommu
> > fd to understand those bus specific detail?
>
> I prefer the API be very clear and intent driven, otherwise things
> just get confused.
>
> The whole WBINVD/no-snoop discussion I think is proof of that :\
>
> > from iommu p.o.v there is no difference from last one. In v2 the device
> > driver just needs to communicate the PASID virtualization policy at
> > device binding time,
>
> I want it documented in the kernel source WTF is happening, because
> otherwise we are going to be completely lost in a few years. And your
> RFC did have device driver specific differences here
>
> > > The device knows what it is going to do, we need to convey that to the
> > > IOMMU layer so it is prepared properly.
> >
> > Yes, but it's not necessarily to have iommu fd understand bus specific
> > attributes. In the end when /dev/iommu uAPI calls iommu layer interface,
> > it's all bus agnostic.
>
> Why not? Just put some inline wrappers to translate the bus specific
> language to your generic language if that is what makes the most
> sense.
>

I can do this. Thanks

2021-06-17 07:22:55

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 10, 2021 at 06:37:31PM +0200, Jean-Philippe Brucker wrote:
> On Tue, Jun 08, 2021 at 04:31:50PM +1000, David Gibson wrote:
> > For the qemu case, I would imagine a two stage fallback:
> >
> > 1) Ask for the exact IOMMU capabilities (including pagetable
> > format) that the vIOMMU has. If the host can supply, you're
> > good
> >
> > 2) If not, ask for a kernel managed IOAS. Verify that it can map
> > all the IOVA ranges the guest vIOMMU needs, and has an equal or
> > smaller pagesize than the guest vIOMMU presents. If so,
> > software emulate the vIOMMU by shadowing guest io pagetable
> > updates into the kernel managed IOAS.
> >
> > 3) You're out of luck, don't start.
> >
> > For both (1) and (2) I'd expect it to be asking this question *after*
> > saying what devices are attached to the IOAS, based on the virtual
> > hardware configuration. That doesn't cover hotplug, of course, for
> > that you have to just fail the hotplug if the new device isn't
> > supportable with the IOAS you already have.
>
> Yes. So there is a point in time when the IOAS is frozen, and cannot take
> in new incompatible devices. I think that can support the usage I had in
> mind. If the VMM (non-QEMU, let's say) wanted to create one IOASID FD per
> feature set it could bind the first device, freeze the features, then bind

Are you thinking of this "freeze the features" as an explicitly
triggered action? I have suggested that an explicit "ENABLE" step
might be useful, but that hasn't had much traction from what I've
seen.

> the second device. If the second bind fails it creates a new FD, allowing
> to fall back to (2) for the second device while keeping (1) for the first
> device. A paravirtual IOMMU like virtio-iommu could easily support this as
> it describes pIOMMU properties for each device to the guest. An emulated
> vIOMMU could also support some hybrid cases as you describe below.

Eh.. in some cases. The vIOMMU model will often dictate what guest
side devices need to share an an address space, which may make it very
impractical to have them in different address spaces on the host side.

> > One can imagine optimizations where for certain intermediate cases you
> > could do a lighter SW emu if the host supports a model that's close to
> > the vIOMMU one, and you're able to trap and emulate the differences.
> > In practice I doubt anyone's going to have time to look for such cases
> > and implement the logic for it.
> >
> > > For example depending whether the hardware IOMMU is SMMUv2 or SMMUv3, that
> > > completely changes the capabilities offered to the guest (some v2
> > > implementations support nesting page tables, but never PASID nor PRI
> > > unlike v3.) The same vIOMMU could support either, presenting different
> > > capabilities to the guest, even multiple page table formats if we wanted
> > > to be exhaustive (SMMUv2 supports the older 32-bit descriptor), but it
> > > needs to know early on what the hardware is precisely. Then some new page
> > > table format shows up and, although the vIOMMU can support that in
> > > addition to older ones, QEMU will have to pick a single one, that it
> > > assumes the guest knows how to drive?
> > >
> > > I think once it binds a device to an IOASID fd, QEMU will want to probe
> > > what hardware features are available before going further with the vIOMMU
> > > setup (is there PASID, PRI, which page table formats are supported,
> > > address size, page granule, etc). Obtaining precise information about the
> > > hardware would be less awkward than trying different configurations until
> > > one succeeds. Binding an additional device would then fail if its pIOMMU
> > > doesn't support exactly the features supported for the first device,
> > > because we don't know which ones the guest will choose. QEMU will have to
> > > open a new IOASID fd for that device.
> >
> > No, this fundamentally misunderstands the qemu model. The user
> > *chooses* the guest visible platform, and qemu supplies it or fails.
> > There is no negotiation with the guest, because this makes managing
> > migration impossibly difficult.
>
> I'd like to understand better where the difficulty lies, with migration.
> Is the problem, once we have a guest running on physical machine A, to
> make sure that physical machine B supports the same IOMMU properties
> before migrating the VM over to B? Why can't QEMU (instead of the user)
> select a feature set on machine A, then when time comes to migrate, query
> all information from the host kernel on machine B and check that it
> matches what was picked for machine A? Or is it only trying to
> accommodate different sets of features between A and B, that would be too
> difficult?

There are two problems

1) Although it could be done in theory, it's hard, and it would need a
huge rewrite to qemu's whole migration infrastructure to do this.
We'd need a way of representing host features, working out which sets
are compatible with which others depending on what things the guest is
allowed to use, encoding the information in the migration stream and
reporting failure. None of this exists now.

Indeed qemu requires that you create the (stopped) machine on the
destination (including virtual hardware configuration) before even
attempting to process the incoming migration. It does not for the
most part transfer the machine configuration in the migration stream.
Now, that's generally considered a flaw with the design, but fixing it
is a huge project that no-one's really had the energy to begin despite
the idea being around for years.

2) It makes behaviour really hard to predict for management layers
above. Things like oVirt automatically migrate around a cluster for
load balancing. At the moment the model which works is basically that
you if you request the same guest features on each end of the
migration, and qemu starts with that configuration on each end, the
migration should work (or only fail for transient reasons). If you
can't know if the migration is possible until you get the incoming
stream, reporting and exposing what will and won't work to the layer
above also becomes an immensely fiddly problem.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (6.39 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-17 07:23:09

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 08, 2021 at 10:17:56AM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 08, 2021 at 12:37:04PM +1000, David Gibson wrote:
>
> > > The PPC/SPAPR support allows KVM to associate a vfio group to an IOMMU
> > > page table so that it can handle iotlb programming from pre-registered
> > > memory without trapping out to userspace.
> >
> > To clarify that's a guest side logical vIOMMU page table which is
> > partially managed by KVM. This is an optimization - things can work
> > without it, but it means guest iomap/unmap becomes a hot path because
> > each map/unmap hypercall has to go
> > guest -> KVM -> qemu -> VFIO
> >
> > So there are multiple context transitions.
>
> Isn't this overhead true of many of the vIOMMUs?

Yes, but historically it bit much harder on POWER for a couple of reasons:

1) POWER guests *always* have a vIOMMU - the platform has no concept
of passthrough mode. We therefore had a vIOMMU implementation some
time before the AMD or Intel IOMMUs were implemented as vIOMMUs in
qemu.

2) At the time we were implementing this the supported IOVA window for
the paravirtualized IOMMU was pretty small (1G, I think) making
vIOMMU maps and unmaps a pretty common operation.

> Can the fast path be
> generalized?

Not really. This is a paravirtualized guest IOMMU, so it's a platform
specific group of hypercalls that's being interpreted by KVM and
passed through to the IOMMU side using essentially the same backend
that that the userspace implementation would eventually get to after a
bunch more context switches.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (1.76 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-17 07:23:19

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 03, 2021 at 08:12:27AM +0000, Tian, Kevin wrote:
> > From: David Gibson <[email protected]>
> > Sent: Wednesday, June 2, 2021 2:15 PM
> >
> [...]
>
> > >
> > > /*
> > > * Get information about an I/O address space
> > > *
> > > * Supported capabilities:
> > > * - VFIO type1 map/unmap;
> > > * - pgtable/pasid_table binding
> > > * - hardware nesting vs. software nesting;
> > > * - ...
> > > *
> > > * Related attributes:
> > > * - supported page sizes, reserved IOVA ranges (DMA mapping);
> >
> > Can I request we represent this in terms of permitted IOVA ranges,
> > rather than reserved IOVA ranges. This works better with the "window"
> > model I have in mind for unifying the restrictions of the POWER IOMMU
> > with Type1 like mapping.
>
> Can you elaborate how permitted range work better here?

Pretty much just that MAP operations would fail if they don't entirely
lie within a permitted range. So, for example if your IOMMU only
implements say, 45 bits of IOVA, then you'd have 0..0x1fffffffffff as
your only permitted range. If, like the POWER paravirtual IOMMU (in
defaut configuration) you have a small (1G) 32-bit range and a large
(45-bit) 64-bit range at a high address, you'd have say:
0x00000000..0x3fffffff (32-bit range)
and
0x800000000000000 .. 0x8001fffffffffff (64-bit range)
as your permitted ranges.

If your IOMMU supports truly full 64-bit addressing, but has a
reserved range (for MSIs or whatever) at 0xaaaa000..0xbbbb0000 then
you'd have permitted ranges of 0..0xaaa9ffff and
0xbbbb0000..0xffffffffffffffff.

[snip]
> > For debugging and certain hypervisor edge cases it might be useful to
> > have a call to allow userspace to lookup and specific IOVA in a guest
> > managed pgtable.
>
> Since all the mapping metadata is from userspace, why would one
> rely on the kernel to provide such service? Or are you simply asking
> for some debugfs node to dump the I/O page table for a given
> IOASID?

I'm thinking of this as a debugging aid so you can make sure that how
the kernel is interpreting that metadata in the same way that your
userspace expects it to interpret that metadata.


--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.37 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-17 07:24:04

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 08, 2021 at 04:04:06PM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 08, 2021 at 10:53:02AM +1000, David Gibson wrote:
> > On Thu, Jun 03, 2021 at 08:52:24AM -0300, Jason Gunthorpe wrote:
> > > On Thu, Jun 03, 2021 at 03:13:44PM +1000, David Gibson wrote:
> > >
> > > > > We can still consider it a single "address space" from the IOMMU
> > > > > perspective. What has happened is that the address table is not just a
> > > > > 64 bit IOVA, but an extended ~80 bit IOVA formed by "PASID, IOVA".
> > > >
> > > > True. This does complexify how we represent what IOVA ranges are
> > > > valid, though. I'll bet you most implementations don't actually
> > > > implement a full 64-bit IOVA, which means we effectively have a large
> > > > number of windows from (0..max IOVA) for each valid pasid. This adds
> > > > another reason I don't think my concept of IOVA windows is just a
> > > > power specific thing.
> > >
> > > Yes
> > >
> > > Things rapidly get into weird hardware specific stuff though, the
> > > request will be for things like:
> > > "ARM PASID&IO page table format from SMMU IP block vXX"
> >
> > So, I'm happy enough for picking a user-managed pagetable format to
> > imply the set of valid IOVA ranges (though a query might be nice).
>
> I think a query is mandatory, and optionally asking for ranges seems
> generally useful as a HW property.
>
> The danger is things can get really tricky as the app can ask for
> ranges some HW needs but other HW can't provide.
>
> I would encourage a flow where "generic" apps like DPDK can somehow
> just ignore this, or at least be very, very simplified "I want around
> XX GB of IOVA space"
>
> dpdk type apps vs qemu apps are really quite different and we should
> be carefully that the needs of HW accelerated vIOMMU emulation do not
> trump the needs of simple universal control over a DMA map.

Agreed.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.07 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-18 19:22:38

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Thu, Jun 17, 2021 at 01:00:14PM +1000, David Gibson wrote:
> On Thu, Jun 10, 2021 at 06:37:31PM +0200, Jean-Philippe Brucker wrote:
> > On Tue, Jun 08, 2021 at 04:31:50PM +1000, David Gibson wrote:
> > > For the qemu case, I would imagine a two stage fallback:
> > >
> > > 1) Ask for the exact IOMMU capabilities (including pagetable
> > > format) that the vIOMMU has. If the host can supply, you're
> > > good
> > >
> > > 2) If not, ask for a kernel managed IOAS. Verify that it can map
> > > all the IOVA ranges the guest vIOMMU needs, and has an equal or
> > > smaller pagesize than the guest vIOMMU presents. If so,
> > > software emulate the vIOMMU by shadowing guest io pagetable
> > > updates into the kernel managed IOAS.
> > >
> > > 3) You're out of luck, don't start.
> > >
> > > For both (1) and (2) I'd expect it to be asking this question *after*
> > > saying what devices are attached to the IOAS, based on the virtual
> > > hardware configuration. That doesn't cover hotplug, of course, for
> > > that you have to just fail the hotplug if the new device isn't
> > > supportable with the IOAS you already have.
> >
> > Yes. So there is a point in time when the IOAS is frozen, and cannot take
> > in new incompatible devices. I think that can support the usage I had in
> > mind. If the VMM (non-QEMU, let's say) wanted to create one IOASID FD per
> > feature set it could bind the first device, freeze the features, then bind
>
> Are you thinking of this "freeze the features" as an explicitly
> triggered action? I have suggested that an explicit "ENABLE" step
> might be useful, but that hasn't had much traction from what I've
> seen.

Seems like we do need an explicit enable step for the flow you described
above:

a) Bind all devices to an ioasid. Each bind succeeds.
b) Ask for a specific set of features for this aggregate of device. Ask
for (1), fall back to (2), or abort.
c) Boot the VM
d) Hotplug a device, bind it to the ioasid. We're long past negotiating
features for the ioasid, so the host needs to reject the bind if the
new device is incompatible with what was requested at (b)

So a successful request at (b) would be the point where we change the
behavior of bind.

Since the kernel needs a form of feature check in any case, I still have a
preference for aborting the bind at (a) if the device isn't exactly
compatible with other devices already in the ioasid, because it might be
simpler to implement in the host, but I don't feel strongly about this.


> > I'd like to understand better where the difficulty lies, with migration.
> > Is the problem, once we have a guest running on physical machine A, to
> > make sure that physical machine B supports the same IOMMU properties
> > before migrating the VM over to B? Why can't QEMU (instead of the user)
> > select a feature set on machine A, then when time comes to migrate, query
> > all information from the host kernel on machine B and check that it
> > matches what was picked for machine A? Or is it only trying to
> > accommodate different sets of features between A and B, that would be too
> > difficult?
>
> There are two problems
>
> 1) Although it could be done in theory, it's hard, and it would need a
> huge rewrite to qemu's whole migration infrastructure to do this.
> We'd need a way of representing host features, working out which sets
> are compatible with which others depending on what things the guest is
> allowed to use, encoding the information in the migration stream and
> reporting failure. None of this exists now.
>
> Indeed qemu requires that you create the (stopped) machine on the
> destination (including virtual hardware configuration) before even
> attempting to process the incoming migration. It does not for the
> most part transfer the machine configuration in the migration stream.
> Now, that's generally considered a flaw with the design, but fixing it
> is a huge project that no-one's really had the energy to begin despite
> the idea being around for years.
>
> 2) It makes behaviour really hard to predict for management layers
> above. Things like oVirt automatically migrate around a cluster for
> load balancing. At the moment the model which works is basically that
> you if you request the same guest features on each end of the
> migration, and qemu starts with that configuration on each end, the
> migration should work (or only fail for transient reasons). If you
> can't know if the migration is possible until you get the incoming
> stream, reporting and exposing what will and won't work to the layer
> above also becomes an immensely fiddly problem.

That was really useful, thanks. One thing I'm worried about is the user
having to know way too much detail about IOMMUs in order to pick a precise
configuration. The Arm SMMUs have a lot of small features that
implementations can mix and match and that a user shouldn't have to care
about, and there are lots of different implementations by various vendors.
I suppose QEMU can offer a couple of configurations with predefined sets
of features, but it seems easy to end up with a config that gets rejected
because it is slightly different than the hardware. Anyway this is a
discussion we can have once we touch on the features in GET_INFO, I don't
have a precise idea at the moment.

Thanks,
Jean

2021-06-19 02:18:07

by Jason Gunthorpe

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 18, 2021 at 07:03:31PM +0200, Jean-Philippe Brucker wrote:

> configuration. The Arm SMMUs have a lot of small features that
> implementations can mix and match and that a user shouldn't have to care
> about, and there are lots of different implementations by various
> vendors.

This is really something to think about carefully in this RFC - I do
have a guess that a 'HW specific' channel is going to be useful here.

If the goal is for qemu to provide all these fiddly things and they
cannot be SW emulated, then direct access to the fiddly HW native
stuff is possibly necessary.

We've kind of seen this mistake in DRM and RDMA
historically. Attempting to generalize too early, or generalize
something that is really a one off. Better for everyone to just keep
it as a one off.

Jason

2021-06-23 07:58:20

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jean-Philippe Brucker
> Sent: Saturday, June 19, 2021 1:04 AM
>
> On Thu, Jun 17, 2021 at 01:00:14PM +1000, David Gibson wrote:
> > On Thu, Jun 10, 2021 at 06:37:31PM +0200, Jean-Philippe Brucker wrote:
> > > On Tue, Jun 08, 2021 at 04:31:50PM +1000, David Gibson wrote:
> > > > For the qemu case, I would imagine a two stage fallback:
> > > >
> > > > 1) Ask for the exact IOMMU capabilities (including pagetable
> > > > format) that the vIOMMU has. If the host can supply, you're
> > > > good
> > > >
> > > > 2) If not, ask for a kernel managed IOAS. Verify that it can map
> > > > all the IOVA ranges the guest vIOMMU needs, and has an equal or
> > > > smaller pagesize than the guest vIOMMU presents. If so,
> > > > software emulate the vIOMMU by shadowing guest io pagetable
> > > > updates into the kernel managed IOAS.
> > > >
> > > > 3) You're out of luck, don't start.
> > > >
> > > > For both (1) and (2) I'd expect it to be asking this question *after*
> > > > saying what devices are attached to the IOAS, based on the virtual
> > > > hardware configuration. That doesn't cover hotplug, of course, for
> > > > that you have to just fail the hotplug if the new device isn't
> > > > supportable with the IOAS you already have.
> > >
> > > Yes. So there is a point in time when the IOAS is frozen, and cannot take
> > > in new incompatible devices. I think that can support the usage I had in
> > > mind. If the VMM (non-QEMU, let's say) wanted to create one IOASID FD
> per
> > > feature set it could bind the first device, freeze the features, then bind
> >
> > Are you thinking of this "freeze the features" as an explicitly
> > triggered action? I have suggested that an explicit "ENABLE" step
> > might be useful, but that hasn't had much traction from what I've
> > seen.
>
> Seems like we do need an explicit enable step for the flow you described
> above:
>
> a) Bind all devices to an ioasid. Each bind succeeds.

let's use consistent terms in this discussion. :)

'bind' the device to a IOMMU fd (container of I/O address spaces).

'attach' the device to an IOASID (representing an I/O address space
within the IOMMU fd)

> b) Ask for a specific set of features for this aggregate of device. Ask
> for (1), fall back to (2), or abort.
> c) Boot the VM
> d) Hotplug a device, bind it to the ioasid. We're long past negotiating
> features for the ioasid, so the host needs to reject the bind if the
> new device is incompatible with what was requested at (b)
>
> So a successful request at (b) would be the point where we change the
> behavior of bind.

Per Jason's recommendation v2 will move to a new model:

a) Bind all devices to an IOMMU fd:
- The user should provide a 'device_cookie' to mark each bound
device in following uAPIs.

b) Successful binding allows user to check the capability/format info per
device_cookie (GET_DEVICE_INFO), before creating any IOASID:
- Sample capability info:
* VFIO type1 map: supported page sizes, permitted IOVA ranges, etc.;
* IOASID nesting: hardware nesting vs. software nesting;
* User-managed page table: vendor specific formats;
* User-managed pasid table: vendor specific formats;
* vPASID: whether delegated to user, if kernel-managed per-RID or global;
* coherency: what's kernel default policy, whether allows user to change;
* ...
- Actual logistics might be finalized when code is implemented;

c) When creating a new IOASID, the user should specify a format which
is compatible to one or more devices which will be attached to this
IOASID right after.

d) Attaching a device which has incompatible format to this IOASID
is simply rejected. Whether it's hotplugged doesn't matter.

Qemu is expected to query capability/format information for all devices
according to what a specified vIOMMU model requires. Then decide
whether to fail vIOMMU creation if not strictly matched or fall back to
a hybrid model with software emulation to bridge the gap. In any case
before a new I/O address space is created, Qemu should have a clear
picture about what format is required given a set of to-be-attached
devices and whether multiple IOASIDs are required if these devices
have incompatible formats.

With this model we don't need a separate 'enable' step.

>
> Since the kernel needs a form of feature check in any case, I still have a
> preference for aborting the bind at (a) if the device isn't exactly
> compatible with other devices already in the ioasid, because it might be
> simpler to implement in the host, but I don't feel strongly about this.

this is covered by d). Actually with all the format information available
Qemu even should not attempt to attach incompatible device in the
first place, though the kernel will do this simple check under the hood.

Thanks
Kevin

2021-06-23 08:02:04

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: David Gibson <[email protected]>
> Sent: Thursday, June 17, 2021 11:48 AM
>
> On Tue, Jun 08, 2021 at 10:17:56AM -0300, Jason Gunthorpe wrote:
> > On Tue, Jun 08, 2021 at 12:37:04PM +1000, David Gibson wrote:
> >
> > > > The PPC/SPAPR support allows KVM to associate a vfio group to an
> IOMMU
> > > > page table so that it can handle iotlb programming from pre-registered
> > > > memory without trapping out to userspace.
> > >
> > > To clarify that's a guest side logical vIOMMU page table which is
> > > partially managed by KVM. This is an optimization - things can work
> > > without it, but it means guest iomap/unmap becomes a hot path because
> > > each map/unmap hypercall has to go
> > > guest -> KVM -> qemu -> VFIO
> > >
> > > So there are multiple context transitions.
> >
> > Isn't this overhead true of many of the vIOMMUs?
>
> Yes, but historically it bit much harder on POWER for a couple of reasons:
>
> 1) POWER guests *always* have a vIOMMU - the platform has no concept
> of passthrough mode. We therefore had a vIOMMU implementation some
> time before the AMD or Intel IOMMUs were implemented as vIOMMUs in
> qemu.
>
> 2) At the time we were implementing this the supported IOVA window for
> the paravirtualized IOMMU was pretty small (1G, I think) making
> vIOMMU maps and unmaps a pretty common operation.
>
> > Can the fast path be
> > generalized?
>
> Not really. This is a paravirtualized guest IOMMU, so it's a platform
> specific group of hypercalls that's being interpreted by KVM and
> passed through to the IOMMU side using essentially the same backend
> that that the userspace implementation would eventually get to after a
> bunch more context switches.
>

Can virtio-iommu work on PPC? iirc Jean has a plan to implement
a vhost-iommu which is supposed to implement the similar in-kernel
acceleration...

Thanks
Kevin

2021-06-23 08:02:33

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: David Gibson
> Sent: Thursday, June 17, 2021 12:08 PM
>
> On Thu, Jun 03, 2021 at 08:12:27AM +0000, Tian, Kevin wrote:
> > > From: David Gibson <[email protected]>
> > > Sent: Wednesday, June 2, 2021 2:15 PM
> > >
> > [...]
> >
> > > >
> > > > /*
> > > > * Get information about an I/O address space
> > > > *
> > > > * Supported capabilities:
> > > > * - VFIO type1 map/unmap;
> > > > * - pgtable/pasid_table binding
> > > > * - hardware nesting vs. software nesting;
> > > > * - ...
> > > > *
> > > > * Related attributes:
> > > > * - supported page sizes, reserved IOVA ranges (DMA
> mapping);
> > >
> > > Can I request we represent this in terms of permitted IOVA ranges,
> > > rather than reserved IOVA ranges. This works better with the "window"
> > > model I have in mind for unifying the restrictions of the POWER IOMMU
> > > with Type1 like mapping.
> >
> > Can you elaborate how permitted range work better here?
>
> Pretty much just that MAP operations would fail if they don't entirely
> lie within a permitted range. So, for example if your IOMMU only
> implements say, 45 bits of IOVA, then you'd have 0..0x1fffffffffff as
> your only permitted range. If, like the POWER paravirtual IOMMU (in
> defaut configuration) you have a small (1G) 32-bit range and a large
> (45-bit) 64-bit range at a high address, you'd have say:
> 0x00000000..0x3fffffff (32-bit range)
> and
> 0x800000000000000 .. 0x8001fffffffffff (64-bit range)
> as your permitted ranges.
>
> If your IOMMU supports truly full 64-bit addressing, but has a
> reserved range (for MSIs or whatever) at 0xaaaa000..0xbbbb0000 then
> you'd have permitted ranges of 0..0xaaa9ffff and
> 0xbbbb0000..0xffffffffffffffff.

I see. Has incorporated this comment in v2.

>
> [snip]
> > > For debugging and certain hypervisor edge cases it might be useful to
> > > have a call to allow userspace to lookup and specific IOVA in a guest
> > > managed pgtable.
> >
> > Since all the mapping metadata is from userspace, why would one
> > rely on the kernel to provide such service? Or are you simply asking
> > for some debugfs node to dump the I/O page table for a given
> > IOASID?
>
> I'm thinking of this as a debugging aid so you can make sure that how
> the kernel is interpreting that metadata in the same way that your
> userspace expects it to interpret that metadata.
>

I'll not include it in this RFC. There are already too many stuff. The
debugging aid can be added anyway when it's actually required.

Thanks,
Kevin

2021-06-23 08:23:57

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Jason Gunthorpe <[email protected]>
> Sent: Saturday, June 19, 2021 2:31 AM
>
> On Fri, Jun 18, 2021 at 07:03:31PM +0200, Jean-Philippe Brucker wrote:
>
> > configuration. The Arm SMMUs have a lot of small features that
> > implementations can mix and match and that a user shouldn't have to care
> > about, and there are lots of different implementations by various
> > vendors.
>
> This is really something to think about carefully in this RFC - I do
> have a guess that a 'HW specific' channel is going to be useful here.

Agree.

>
> If the goal is for qemu to provide all these fiddly things and they
> cannot be SW emulated, then direct access to the fiddly HW native
> stuff is possibly necessary.
>
> We've kind of seen this mistake in DRM and RDMA
> historically. Attempting to generalize too early, or generalize
> something that is really a one off. Better for everyone to just keep
> it as a one off.
>

Yes. Except some generic info (e.g. addr_width), most format info can
be exposed via a vendor specific data union region. There is no need
to define those vendor bits explicitly in the uAPI. Kernel IOMMU driver
and vIOMMU emulation logic know how to interpret them.

Take VT-d for example, all format/cap info are carried in two registers
(cap_reg and ecap_reg) thus two u64 fields are sufficient...

Thanks
Kevin

2021-06-24 04:54:53

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 23, 2021 at 07:59:21AM +0000, Tian, Kevin wrote:
> > From: David Gibson <[email protected]>
> > Sent: Thursday, June 17, 2021 11:48 AM
> >
> > On Tue, Jun 08, 2021 at 10:17:56AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Jun 08, 2021 at 12:37:04PM +1000, David Gibson wrote:
> > >
> > > > > The PPC/SPAPR support allows KVM to associate a vfio group to an
> > IOMMU
> > > > > page table so that it can handle iotlb programming from pre-registered
> > > > > memory without trapping out to userspace.
> > > >
> > > > To clarify that's a guest side logical vIOMMU page table which is
> > > > partially managed by KVM. This is an optimization - things can work
> > > > without it, but it means guest iomap/unmap becomes a hot path because
> > > > each map/unmap hypercall has to go
> > > > guest -> KVM -> qemu -> VFIO
> > > >
> > > > So there are multiple context transitions.
> > >
> > > Isn't this overhead true of many of the vIOMMUs?
> >
> > Yes, but historically it bit much harder on POWER for a couple of reasons:
> >
> > 1) POWER guests *always* have a vIOMMU - the platform has no concept
> > of passthrough mode. We therefore had a vIOMMU implementation some
> > time before the AMD or Intel IOMMUs were implemented as vIOMMUs in
> > qemu.
> >
> > 2) At the time we were implementing this the supported IOVA window for
> > the paravirtualized IOMMU was pretty small (1G, I think) making
> > vIOMMU maps and unmaps a pretty common operation.
> >
> > > Can the fast path be
> > > generalized?
> >
> > Not really. This is a paravirtualized guest IOMMU, so it's a platform
> > specific group of hypercalls that's being interpreted by KVM and
> > passed through to the IOMMU side using essentially the same backend
> > that that the userspace implementation would eventually get to after a
> > bunch more context switches.
> >
>
> Can virtio-iommu work on PPC? iirc Jean has a plan to implement
> a vhost-iommu which is supposed to implement the similar in-kernel
> acceleration...

I don't know - I'd have to research virtio-iommu a bunch to determine
that.

Even if we can, the platform IOMMU would still be there (it's a
platform requirement), so we couldn't completely ignore it.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.43 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-24 04:54:58

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 18, 2021 at 07:03:31PM +0200, Jean-Philippe Brucker wrote:
> On Thu, Jun 17, 2021 at 01:00:14PM +1000, David Gibson wrote:
> > On Thu, Jun 10, 2021 at 06:37:31PM +0200, Jean-Philippe Brucker wrote:
> > > On Tue, Jun 08, 2021 at 04:31:50PM +1000, David Gibson wrote:
> > > > For the qemu case, I would imagine a two stage fallback:
> > > >
> > > > 1) Ask for the exact IOMMU capabilities (including pagetable
> > > > format) that the vIOMMU has. If the host can supply, you're
> > > > good
> > > >
> > > > 2) If not, ask for a kernel managed IOAS. Verify that it can map
> > > > all the IOVA ranges the guest vIOMMU needs, and has an equal or
> > > > smaller pagesize than the guest vIOMMU presents. If so,
> > > > software emulate the vIOMMU by shadowing guest io pagetable
> > > > updates into the kernel managed IOAS.
> > > >
> > > > 3) You're out of luck, don't start.
> > > >
> > > > For both (1) and (2) I'd expect it to be asking this question *after*
> > > > saying what devices are attached to the IOAS, based on the virtual
> > > > hardware configuration. That doesn't cover hotplug, of course, for
> > > > that you have to just fail the hotplug if the new device isn't
> > > > supportable with the IOAS you already have.
> > >
> > > Yes. So there is a point in time when the IOAS is frozen, and cannot take
> > > in new incompatible devices. I think that can support the usage I had in
> > > mind. If the VMM (non-QEMU, let's say) wanted to create one IOASID FD per
> > > feature set it could bind the first device, freeze the features, then bind
> >
> > Are you thinking of this "freeze the features" as an explicitly
> > triggered action? I have suggested that an explicit "ENABLE" step
> > might be useful, but that hasn't had much traction from what I've
> > seen.
>
> Seems like we do need an explicit enable step for the flow you described
> above:
>
> a) Bind all devices to an ioasid. Each bind succeeds.
> b) Ask for a specific set of features for this aggregate of device. Ask
> for (1), fall back to (2), or abort.
> c) Boot the VM
> d) Hotplug a device, bind it to the ioasid. We're long past negotiating
> features for the ioasid, so the host needs to reject the bind if the
> new device is incompatible with what was requested at (b)
>
> So a successful request at (b) would be the point where we change the
> behavior of bind.
>
> Since the kernel needs a form of feature check in any case, I still have a
> preference for aborting the bind at (a) if the device isn't exactly
> compatible with other devices already in the ioasid, because it might be
> simpler to implement in the host, but I don't feel strongly about this.
>
>
> > > I'd like to understand better where the difficulty lies, with migration.
> > > Is the problem, once we have a guest running on physical machine A, to
> > > make sure that physical machine B supports the same IOMMU properties
> > > before migrating the VM over to B? Why can't QEMU (instead of the user)
> > > select a feature set on machine A, then when time comes to migrate, query
> > > all information from the host kernel on machine B and check that it
> > > matches what was picked for machine A? Or is it only trying to
> > > accommodate different sets of features between A and B, that would be too
> > > difficult?
> >
> > There are two problems
> >
> > 1) Although it could be done in theory, it's hard, and it would need a
> > huge rewrite to qemu's whole migration infrastructure to do this.
> > We'd need a way of representing host features, working out which sets
> > are compatible with which others depending on what things the guest is
> > allowed to use, encoding the information in the migration stream and
> > reporting failure. None of this exists now.
> >
> > Indeed qemu requires that you create the (stopped) machine on the
> > destination (including virtual hardware configuration) before even
> > attempting to process the incoming migration. It does not for the
> > most part transfer the machine configuration in the migration stream.
> > Now, that's generally considered a flaw with the design, but fixing it
> > is a huge project that no-one's really had the energy to begin despite
> > the idea being around for years.
> >
> > 2) It makes behaviour really hard to predict for management layers
> > above. Things like oVirt automatically migrate around a cluster for
> > load balancing. At the moment the model which works is basically that
> > you if you request the same guest features on each end of the
> > migration, and qemu starts with that configuration on each end, the
> > migration should work (or only fail for transient reasons). If you
> > can't know if the migration is possible until you get the incoming
> > stream, reporting and exposing what will and won't work to the layer
> > above also becomes an immensely fiddly problem.
>
> That was really useful, thanks. One thing I'm worried about is the user
> having to know way too much detail about IOMMUs in order to pick a precise
> configuration. The Arm SMMUs have a lot of small features that
> implementations can mix and match and that a user shouldn't have to care
> about, and there are lots of different implementations by various vendors.
> I suppose QEMU can offer a couple of configurations with predefined sets
> of features, but it seems easy to end up with a config that gets rejected
> because it is slightly different than the hardware. Anyway this is a
> discussion we can have once we touch on the features in GET_INFO, I don't
> have a precise idea at the moment.

That's a reasonable concern. Most of this is about selecting good
default modes in the machine type and virtual devices. In general it
would be best to make the defaults for the virtual devices use only
features that are either available on enough current hardware or can
be software emulated without too much trouble. Roughly have the qemu
mode default to the least common denominator IOMMU capabilities. We
can update those defaults with new machine types as new hardware
becomes current and older stuff becomes rare/obsolete. That still
leaves selecting an old machine type or explicitly overriding the
parameters if you need to either a) work on an old host that's missing
capabilities or b) want to take full advantage of a new host.

This can be a pretty complex judgement call of course. There are many
tradeoffs, particularly of performance on new hosts versus
compatibility with old hosts. There can be compelling reasons to
restrict the default model to new(ish) hardware even though it means
quite a lot of people with older hardware will need awkward options
(we have a non IOMMU related version of this problem on POWER; for
security reasons, current machine types default to enabling several
Spectre mitigations - but that means qemu won't start without special
options on hosts that have an old firmware which doesn't support those
mitigations).

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (7.17 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-24 04:57:57

by David Gibson

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 23, 2021 at 08:00:49AM +0000, Tian, Kevin wrote:
> > From: David Gibson
> > Sent: Thursday, June 17, 2021 12:08 PM
> >
> > On Thu, Jun 03, 2021 at 08:12:27AM +0000, Tian, Kevin wrote:
> > > > From: David Gibson <[email protected]>
> > > > Sent: Wednesday, June 2, 2021 2:15 PM
> > > >
> > > [...]
> > >
> > > > >
> > > > > /*
> > > > > * Get information about an I/O address space
> > > > > *
> > > > > * Supported capabilities:
> > > > > * - VFIO type1 map/unmap;
> > > > > * - pgtable/pasid_table binding
> > > > > * - hardware nesting vs. software nesting;
> > > > > * - ...
> > > > > *
> > > > > * Related attributes:
> > > > > * - supported page sizes, reserved IOVA ranges (DMA
> > mapping);
> > > >
> > > > Can I request we represent this in terms of permitted IOVA ranges,
> > > > rather than reserved IOVA ranges. This works better with the "window"
> > > > model I have in mind for unifying the restrictions of the POWER IOMMU
> > > > with Type1 like mapping.
> > >
> > > Can you elaborate how permitted range work better here?
> >
> > Pretty much just that MAP operations would fail if they don't entirely
> > lie within a permitted range. So, for example if your IOMMU only
> > implements say, 45 bits of IOVA, then you'd have 0..0x1fffffffffff as
> > your only permitted range. If, like the POWER paravirtual IOMMU (in
> > defaut configuration) you have a small (1G) 32-bit range and a large
> > (45-bit) 64-bit range at a high address, you'd have say:
> > 0x00000000..0x3fffffff (32-bit range)
> > and
> > 0x800000000000000 .. 0x8001fffffffffff (64-bit range)
> > as your permitted ranges.
> >
> > If your IOMMU supports truly full 64-bit addressing, but has a
> > reserved range (for MSIs or whatever) at 0xaaaa000..0xbbbb0000 then
> > you'd have permitted ranges of 0..0xaaa9ffff and
> > 0xbbbb0000..0xffffffffffffffff.
>
> I see. Has incorporated this comment in v2.
>
> >
> > [snip]
> > > > For debugging and certain hypervisor edge cases it might be useful to
> > > > have a call to allow userspace to lookup and specific IOVA in a guest
> > > > managed pgtable.
> > >
> > > Since all the mapping metadata is from userspace, why would one
> > > rely on the kernel to provide such service? Or are you simply asking
> > > for some debugfs node to dump the I/O page table for a given
> > > IOASID?
> >
> > I'm thinking of this as a debugging aid so you can make sure that how
> > the kernel is interpreting that metadata in the same way that your
> > userspace expects it to interpret that metadata.
> >
>
> I'll not include it in this RFC. There are already too many stuff. The
> debugging aid can be added anyway when it's actually required.

Fair enough.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


Attachments:
(No filename) (2.93 kB)
signature.asc (849.00 B)
Download all attachments

2021-06-30 06:51:02

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Mon, Jun 07, 2021 at 04:08:02PM -0300, Jason Gunthorpe wrote:
> Compatibility is important, but when I look in the kernel code I see
> very few places that call wbinvd(). Basically all DRM for something
> relavent to qemu.
>
> That tells me that the vast majority of PCI devices do not generate
> no-snoop traffic.

Part of it is that we have no general API for it, because the DRM folks
as usual just tarted piling up local hacks instead of introducing
a proper API..

2021-06-30 06:57:03

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Tue, Jun 08, 2021 at 09:20:29AM +0800, Jason Wang wrote:
> "
>
> 6.2.17 _CCA (Cache Coherency Attribute) The _CCA object returns whether or
> not a bus-master device supports hardware managed cache coherency. Expected
> values are 0 to indicate it is not supported, and 1 to indicate that it is
> supported. All other values are reserved.
>
> ...
>
> On Intel platforms, if the _CCA object is not supplied, the OSPM will assume
> the devices are hardware cache coherent.
>
> "

_CCA is mostly used on arm/arm64 platforms to figure out if a device
needs non-coherent DMA handling in the DMA API or not. It is not
related to the NoSnoop TLPs that override the setting for an otherwise
coherent device.

2021-06-30 06:59:09

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Mon, Jun 07, 2021 at 03:25:32AM +0000, Tian, Kevin wrote:
>
> Possibly just a naming thing, but I feel it's better to just talk about
> no-snoop or non-coherent in the uAPI. Per Intel SDM wbinvd is a
> privileged instruction. A process on the host has no privilege to
> execute it. Only when this process holds a VM, this instruction matters
> as there are guest privilege levels. But having VFIO uAPI (which is
> userspace oriented) to explicitly deal with a CPU instruction which
> makes sense only in a virtualization context sounds a bit weird...

More importantly the Intel instructions here are super weird.
Pretty much every other architecture just has plan old cache
writeback/invalidate/writeback+invalidate instructions without all these
weird implications.

2021-06-30 07:03:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Wed, Jun 09, 2021 at 09:47:42AM -0300, Jason Gunthorpe wrote:
> I can vaugely understand this rational for vfio, but not at all for
> the platform's iommu driver, sorry.

Agreed. More importantly the dependency is not for the platform iommu
driver but just for the core iommu code, which is always built in if
enabled.

2021-06-30 07:09:17

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Fri, Jun 04, 2021 at 08:58:05AM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 04, 2021 at 09:11:03AM +0800, Jason Wang wrote:
> > > nor do any virtio drivers implement the required platform specific
> > > cache flushing to make no-snoop TLPs work.
> >
> > I don't get why virtio drivers needs to do that. I think DMA API should hide
> > those arch/platform specific stuffs from us.
>
> It is not arch/platform stuff. If the device uses no-snoop then a
> very platform specific recovery is required in the device driver.

Well, the proper way to support NO_SNOOP DMA would be to force the
DMA API into supporting the device as if the platform was not DMA
coherent, probably on a per-call basis. It is just that no one bothered
to actually do the work an people just kept piling hacks over hacks.

2021-06-30 07:10:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On Mon, Jun 07, 2021 at 11:14:24AM -0300, Jason Gunthorpe wrote:
> "non-coherent DMA" is some general euphemism that evokes images of
> embedded platforms that don't have coherent DMA at all and have low
> cost ways to regain coherence. This is not at all what we are talking
> about here at all.

It literally is the same way of working. And not just low-end embedded
platforms use this, but a lot of older server platforms did as well.

2021-10-27 18:16:31

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

Hi, Paolo,

> From: Paolo Bonzini <[email protected]>
> Sent: Wednesday, June 9, 2021 11:21 PM
>
> On 09/06/21 16:45, Jason Gunthorpe wrote:
> > On Wed, Jun 09, 2021 at 08:31:34AM -0600, Alex Williamson wrote:
> >
> >> If we go back to the wbinvd ioctl mechanism, if I call that ioctl with
> >> an ioasidfd that contains no devices, then I shouldn't be able to
> >> generate a wbinvd on the processor, right? If I add a device,
> >> especially in a configuration that can generate non-coherent DMA, now
> >> that ioctl should work. If I then remove all devices from that ioasid,
> >> what then is the difference from the initial state. Should the ioctl
> >> now work because it worked once in the past?
> >
> > The ioctl is fine, but telling KVM to enable WBINVD is very similar to
> > open and then reconfiguring the ioasid_fd is very similar to
> > chmod. From a security perspective revoke is not strictly required,
> > IMHO.
>
> I absolutely do *not* want an API that tells KVM to enable WBINVD. This
> is not up for discussion.
>
> But really, let's stop calling the file descriptor a security proof or a
> capability. It's overkill; all that we are doing here is kernel
> acceleration of the WBINVD ioctl.
>
> As a thought experiment, let's consider what would happen if wbinvd
> caused an unconditional exit from guest to userspace. Userspace would
> react by invoking the ioctl on the ioasid. The proposed functionality
> is just an acceleration of this same thing, avoiding the
> guest->KVM->userspace->IOASID->wbinvd trip.

While the concept here makes sense, in reality implementing a wbinvd
ioctl for userspace requiring iommufd (previous /dev/ioasid is renamed
to /dev/iommu now) to track dirty CPUs that a given process has been
running since wbinvd only flushes local cache. KVM tracks dirty CPUs by
registering preempt notifier on the current vCPU. iommufd may do the
same thing for the thread which opens /dev/iommu, but per below
discussion one open is whether it's worthwhile adding such hassle for
something which no real user is interested in today except kvm:

https://lore.kernel.org/lkml/[email protected]/

Is it ok to omit the actual wbinvd ioctl here and just leverage vfio-kvm
contract to manage whether guest wbinvd is emulated as no-op? It is
still iommufd which decides whether wbinvd is allowed (based on IOAS
and device attach information) but just sort of special casing that the
operation can only be done via kvm path...

btw does kvm community set a strict criteria that any operation that
the guest can do must be first carried in host uAPI first? In concept
KVM deals with ISA-level to cover both guest kernel and guest user
while host uAPI is only for host user. Introducing new uAPIs to allow
host user doing whatever guest kernel can do sounds ideal, but not
exactly necessary imho.

>
> This is why the API that I want, and that is already exists for VFIO
> group file descriptors, informs KVM of which "ioctls" the guest should
> be able to do via privileged instructions[1]. Then the kernel works out
> with KVM how to ensure a 1:1 correspondence between the operation of the
> ioctls and the privileged operations.
>
> One way to do it would be to always trap WBINVD and invoke the same
> kernel function that implements the ioctl. The function would do either
> a wbinvd or nothing, based on whether the ioasid has any device. The
> next logical step is a notification mechanism that enables WBINVD (by
> disabling the WBINVD intercept) when there are devices in the ioasidfd,
> and disables WBINVD (by enabling a no-op intercept) when there are none.
>
> And in fact once all VFIO devices are gone, wbinvd is for all purposes a
> no-op as far as the guest kernel can tell. So there's no reason to
> treat it as anything but a no-op.
>
> Thanks,
>
> Paolo
>
> [1] As an aside, I must admit I didn't entirely understand the design of
> the KVM-VFIO device back when Alex added it. But with this model it was
> absolutely the right thing to do, and it remains the right thing to do
> even if VFIO groups are replaced with IOASID file descriptors.

2021-10-27 21:24:26

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [RFC] /dev/ioasid uAPI proposal

On 27/10/21 08:18, Tian, Kevin wrote:
>> I absolutely do *not* want an API that tells KVM to enable WBINVD. This
>> is not up for discussion.
>>
>> But really, let's stop calling the file descriptor a security proof or a
>> capability. It's overkill; all that we are doing here is kernel
>> acceleration of the WBINVD ioctl.
>>
>> As a thought experiment, let's consider what would happen if wbinvd
>> caused an unconditional exit from guest to userspace. Userspace would
>> react by invoking the ioctl on the ioasid. The proposed functionality
>> is just an acceleration of this same thing, avoiding the
>> guest->KVM->userspace->IOASID->wbinvd trip.
>
> While the concept here makes sense, in reality implementing a wbinvd
> ioctl for userspace requiring iommufd (previous /dev/ioasid is renamed
> to /dev/iommu now) to track dirty CPUs that a given process has been
> running since wbinvd only flushes local cache.
>
> Is it ok to omit the actual wbinvd ioctl here and just leverage vfio-kvm
> contract to manage whether guest wbinvd is emulated as no-op?

Yes, it'd be okay for me. As I wrote in the message, the concept of a
wbinvd ioctl is mostly important as a thought experiment for what is
security sensitive and what is not. If a wbinvd ioctl would not be
privileged on the iommufd, then WBINVD is not considered privileged in a
guest either.

That does not imply a requirement to implement the wbinvd ioctl, though.
Of course, non-KVM usage of iommufd systems/devices with non-coherent
DMA would be less useful; but that's already the case for VFIO.

> btw does kvm community set a strict criteria that any operation that
> the guest can do must be first carried in host uAPI first? In concept
> KVM deals with ISA-level to cover both guest kernel and guest user
> while host uAPI is only for host user. Introducing new uAPIs to allow
> host user doing whatever guest kernel can do sounds ideal, but not
> exactly necessary imho.

I agree; however, it's the right mindset in my opinion because
virtualization (in a perfect world) should not be a way to give
processes privilege to do something that they cannot do. If it does,
it's usually a good idea to ask yourself "should this functionality be
accessible outside KVM too?".

Thanks,

Paolo

2021-10-28 01:53:54

by Tian, Kevin

[permalink] [raw]
Subject: RE: [RFC] /dev/ioasid uAPI proposal

> From: Paolo Bonzini <[email protected]>
> Sent: Wednesday, October 27, 2021 6:32 PM
>
> On 27/10/21 08:18, Tian, Kevin wrote:
> >> I absolutely do *not* want an API that tells KVM to enable WBINVD. This
> >> is not up for discussion.
> >>
> >> But really, let's stop calling the file descriptor a security proof or a
> >> capability. It's overkill; all that we are doing here is kernel
> >> acceleration of the WBINVD ioctl.
> >>
> >> As a thought experiment, let's consider what would happen if wbinvd
> >> caused an unconditional exit from guest to userspace. Userspace would
> >> react by invoking the ioctl on the ioasid. The proposed functionality
> >> is just an acceleration of this same thing, avoiding the
> >> guest->KVM->userspace->IOASID->wbinvd trip.
> >
> > While the concept here makes sense, in reality implementing a wbinvd
> > ioctl for userspace requiring iommufd (previous /dev/ioasid is renamed
> > to /dev/iommu now) to track dirty CPUs that a given process has been
> > running since wbinvd only flushes local cache.
> >
> > Is it ok to omit the actual wbinvd ioctl here and just leverage vfio-kvm
> > contract to manage whether guest wbinvd is emulated as no-op?
>
> Yes, it'd be okay for me. As I wrote in the message, the concept of a
> wbinvd ioctl is mostly important as a thought experiment for what is
> security sensitive and what is not. If a wbinvd ioctl would not be
> privileged on the iommufd, then WBINVD is not considered privileged in a
> guest either.
>
> That does not imply a requirement to implement the wbinvd ioctl, though.
> Of course, non-KVM usage of iommufd systems/devices with non-coherent
> DMA would be less useful; but that's already the case for VFIO.

Thanks for confirming it!

>
> > btw does kvm community set a strict criteria that any operation that
> > the guest can do must be first carried in host uAPI first? In concept
> > KVM deals with ISA-level to cover both guest kernel and guest user
> > while host uAPI is only for host user. Introducing new uAPIs to allow
> > host user doing whatever guest kernel can do sounds ideal, but not
> > exactly necessary imho.
>
> I agree; however, it's the right mindset in my opinion because
> virtualization (in a perfect world) should not be a way to give
> processes privilege to do something that they cannot do. If it does,
> it's usually a good idea to ask yourself "should this functionality be
> accessible outside KVM too?".
>

Agree. It's always good to keep such mindset in thought practice.

Thanks
Kevin