2020-09-28 21:38:10

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 00/14] IOASID extensions for guest SVA

IOASID was introduced in v5.5 as a generic kernel allocator service for
both PCIe Process Address Space ID (PASID) and ARM SMMU's Sub Stream
ID. In addition to basic ID allocation, ioasid_set was defined as a
token that is shared by a group of IOASIDs. This set token can be used
for permission checking, but lack of some features to address the
following needs by guest Shared Virtual Address (SVA).
- Manage IOASIDs by group, group ownership, quota, etc.
- State synchronization among IOASID users
- Non-identity guest-host IOASID mapping
- Lifecycle management across many users

This patchset introduces the following extensions as solutions to the
problems above.
- Redefine and extend IOASID set such that IOASIDs can be managed by groups.
- Add notifications for IOASID state synchronization
- Add reference counting for life cycle alignment among users
- Support ioasid_set private IDs, which can be used as guest IOASIDs
Please refer to Documentation/ioasid.rst in enclosed patch 1/9 for more
details.

This patchset only included VT-d driver as users of some of the new APIs.
VFIO and KVM patches are coming up to fully utilize the APIs introduced
here.

You can find this series at:
https://github.com/jacobpan/linux.git ioasid_v3
(VFIO and KVM patches will be available at this branch when published.)

This work is a result of collaboration with many people:
Liu, Yi L <[email protected]>
Wu Hao <[email protected]>
Ashok Raj <[email protected]>
Kevin Tian <[email protected]>

Thanks,

Jacob

Changelog:

V3:
- Use consistent ioasid_set_ prefix for ioasid_set level APIs
- Make SPID and private detach/attach APIs symmetric
- Use the same ioasid_put semantics as Jean-Phillippe IOASID reference patch
- Take away the public ioasid_notify() function, notifications are now emitted
by IOASID core as a result of certain IOASID APIs
- Partition into finer incremental patches
- Miscellaneous cleanup, locking, exception handling fixes based on v2 reviews

V2:
- Redesigned ioasid_set APIs, removed set ID
- Added set private ID (SPID) for guest PASID usage.
- Add per ioasid_set notification and priority support.
- Back to use spinlocks and atomic notifications.
- Added async work in VT-d driver to perform teardown outside atomic context

Jacob Pan (14):
docs: Document IO Address Space ID (IOASID) APIs
iommu/ioasid: Rename ioasid_set_data()
iommu/ioasid: Add a separate function for detach data
iommu/ioasid: Support setting system-wide capacity
iommu/ioasid: Redefine IOASID set and allocation APIs
iommu/ioasid: Introduce API to adjust the quota of an ioasid_set
iommu/ioasid: Add an iterator API for ioasid_set
iommu/ioasid: Add reference couting functions
iommu/ioasid: Introduce ioasid_set private ID
iommu/ioasid: Introduce notification APIs
iommu/ioasid: Support mm type ioasid_set notifications
iommu/vt-d: Remove mm reference for guest SVA
iommu/vt-d: Listen to IOASID notifications
iommu/vt-d: Store guest PASID during bind

Documentation/driver-api/ioasid.rst | 648 ++++++++++++++++++++++++++
drivers/iommu/intel/iommu.c | 29 +-
drivers/iommu/intel/pasid.h | 1 +
drivers/iommu/intel/svm.c | 132 +++++-
drivers/iommu/ioasid.c | 890 ++++++++++++++++++++++++++++++++++--
include/linux/intel-iommu.h | 2 +
include/linux/ioasid.h | 197 +++++++-
7 files changed, 1830 insertions(+), 69 deletions(-)
create mode 100644 Documentation/driver-api/ioasid.rst

--
2.7.4


2020-09-28 21:38:24

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 04/14] iommu/ioasid: Support setting system-wide capacity

IOASID is a system-wide resource that could vary on different systems.
The default capacity is 20 bits as defined in the PCI-E specifications.
This patch adds a function to allow adjusting system IOASID capacity.
For VT-d this is set during boot as part of the Intel IOMMU
initialization.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel/iommu.c | 5 +++++
drivers/iommu/ioasid.c | 20 ++++++++++++++++++++
include/linux/ioasid.h | 11 +++++++++++
3 files changed, 36 insertions(+)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 18ed3b3c70d7..e7bcb299e51e 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -42,6 +42,7 @@
#include <linux/crash_dump.h>
#include <linux/numa.h>
#include <linux/swiotlb.h>
+#include <linux/ioasid.h>
#include <asm/irq_remapping.h>
#include <asm/cacheflush.h>
#include <asm/iommu.h>
@@ -3331,6 +3332,10 @@ static int __init init_dmars(void)
if (ret)
goto free_iommu;

+ /* PASID is needed for scalable mode irrespective to SVM */
+ if (intel_iommu_sm)
+ ioasid_install_capacity(intel_pasid_max_id);
+
/*
* for each drhd
* enable fault log
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 6cfbdfb492e0..4277cb17e15b 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -10,6 +10,10 @@
#include <linux/spinlock.h>
#include <linux/xarray.h>

+/* Default to PCIe standard 20 bit PASID */
+#define PCI_PASID_MAX 0x100000
+static ioasid_t ioasid_capacity = PCI_PASID_MAX;
+static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
struct ioasid_data {
ioasid_t id;
struct ioasid_set *set;
@@ -17,6 +21,22 @@ struct ioasid_data {
struct rcu_head rcu;
};

+void ioasid_install_capacity(ioasid_t total)
+{
+ if (ioasid_capacity && ioasid_capacity != PCI_PASID_MAX) {
+ pr_warn("IOASID capacity is already set.\n");
+ return;
+ }
+ ioasid_capacity = ioasid_capacity_avail = total;
+}
+EXPORT_SYMBOL_GPL(ioasid_install_capacity);
+
+ioasid_t ioasid_get_capacity(void)
+{
+ return ioasid_capacity;
+}
+EXPORT_SYMBOL_GPL(ioasid_get_capacity);
+
/*
* struct ioasid_allocator_data - Internal data structure to hold information
* about an allocator. There are two types of allocators:
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index c7f649fa970a..7fc320656be2 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -32,6 +32,8 @@ struct ioasid_allocator_ops {
#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }

#if IS_ENABLED(CONFIG_IOASID)
+void ioasid_install_capacity(ioasid_t total);
+ioasid_t ioasid_get_capacity(void);
ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
void *private);
void ioasid_free(ioasid_t ioasid);
@@ -42,6 +44,15 @@ void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
int ioasid_attach_data(ioasid_t ioasid, void *data);
void ioasid_detach_data(ioasid_t ioasid);
#else /* !CONFIG_IOASID */
+static inline void ioasid_install_capacity(ioasid_t total)
+{
+}
+
+static inline ioasid_t ioasid_get_capacity(void)
+{
+ return 0;
+}
+
static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
ioasid_t max, void *private)
{
--
2.7.4

2020-09-28 21:38:31

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 01/14] docs: Document IO Address Space ID (IOASID) APIs

IOASID is used to identify address spaces that can be targeted by device
DMA. It is a system-wide resource that is essential to its many users.
This document is an attempt to help developers from all vendors navigate
the APIs. At this time, ARM SMMU and Intel’s Scalable IO Virtualization
(SIOV) enabled platforms are the primary users of IOASID. Examples of
how SIOV components interact with IOASID APIs are provided in that many
APIs are driven by the requirements from SIOV.

Cc: Jonathan Corbet <[email protected]>
Cc: [email protected]
Cc: Randy Dunlap <[email protected]>
Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Wu Hao <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
Documentation/driver-api/ioasid.rst | 648 ++++++++++++++++++++++++++++++++++++
1 file changed, 648 insertions(+)
create mode 100644 Documentation/driver-api/ioasid.rst

diff --git a/Documentation/driver-api/ioasid.rst b/Documentation/driver-api/ioasid.rst
new file mode 100644
index 000000000000..7f8e702997ab
--- /dev/null
+++ b/Documentation/driver-api/ioasid.rst
@@ -0,0 +1,648 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. ioasid:
+
+===================
+IO Address Space ID
+===================
+
+IOASID is a generic name for PCIe Process Address ID (PASID) or ARM
+SMMU SubstreamID. An IOASID identifies an address space that DMA
+requests can target.
+
+The primary use cases for IOASID are Shared Virtual Address (SVA) and
+multiple IOVA spaces per device. However, the requirements for IOASID
+management can vary among hardware architectures.
+
+For baremetal IOVA, IOASID #0 is used for DMA request without
+PASID. Even though some architectures such as VT-d also offers
+the flexibility of using any PASIDs for DMA request without PASID.
+PASID #0 is reserved and not allocated from any ioasid_set.
+
+Multiple IOVA spaces per device are mapped to auxiliary domains which
+can be used for mediated device assignment with and without a virtual
+IOMMU (vIOMMU). An IOASID is allocated for each auxiliary domain as default
+PASID. Without vIOMMU, default IOASID is used for DMA map/unmap
+APIs. With vIOMMU, default IOASID is used for guest IOVA where DMA
+request with PASID is required for the device. The reason is that
+there is only one PASID #0 per device, e.g. VT-d, RID_PASID is per PCI
+device.
+
+This document covers the generic features supported by IOASID
+APIs. Vendor-specific use cases are also illustrated with Intel's VT-d
+based platforms as the first example.
+
+.. contents:: :local:
+
+Glossary
+========
+PASID - Process Address Space ID
+
+IOASID - IO Address Space ID (generic term for PCIe PASID and
+SubstreamID in SMMU)
+
+SVA/SVM - Shared Virtual Addressing/Memory
+
+ENQCMD - Intel X86 ISA for efficient workqueue submission [1]
+!!!TODO: Link to Spec at the bottom
+
+DSA - Intel Data Streaming Accelerator [2]
+
+VDCM - Virtual Device Composition Module [3]
+
+SIOV - Intel Scalable IO Virtualization
+
+
+Key Concepts
+============
+
+IOASID Set
+-----------
+An IOASID set is a group of IOASIDs allocated from the system-wide
+IOASID pool. Refer to IOASID set APIs for more details.
+
+IOASID set is particularly useful for guest SVA where each guest could
+have its own IOASID set for security and efficiency reasons.
+
+IOASID Set Private ID (SPID)
+----------------------------
+Each IOASID set has a private namespace of SPIDs. An SPID maps to a
+single system-wide IOASID. Conversely, each IOASID may be associated
+with an alias ID, local to the IOASID set, named SPID.
+SPIDs can be used as guest IOASIDs where each guest could do
+IOASID allocation from its own pool and map them to host physical
+IOASIDs. SPIDs are particularly useful for supporting live migration
+where decoupling guest and host physical resources are necessary.
+
+For example, two VMs can both allocate guest PASID/SPID #101 but map to
+different host PASIDs #201 and #202 respectively as shown in the
+diagram below.
+::
+
+ .------------------. .------------------.
+ | VM 1 | | VM 2 |
+ | | | |
+ |------------------| |------------------|
+ | GPASID/SPID 101 | | GPASID/SPID 101 |
+ '------------------' -------------------' Guest
+ __________|______________________|____________________
+ | | Host
+ v v
+ .------------------. .------------------.
+ | Host IOASID 201 | | Host IOASID 202 |
+ '------------------' '------------------'
+ | IOASID set 1 | | IOASID set 2 |
+ '------------------' '------------------'
+
+Guest PASID is treated as IOASID set private ID (SPID) within an
+IOASID set, mappings between guest and host IOASIDs are stored in the
+set for inquiry.
+
+IOASID APIs
+===========
+To get the IOASID APIs, users must #include <linux/ioasid.h>. These APIs
+serve the following functionalities:
+
+ - IOASID allocation/Free
+ - Group management in the form of ioasid_set
+ - Private data storage and lookup
+ - Reference counting
+ - Event notification in case of a state change
+
+IOASID Set Level APIs
+--------------------------
+For use cases such as guest SVA it is necessary to manage IOASIDs at
+ioasid_set level. For example, VMs may allocate multiple IOASIDs for
+guest process address sharing (vSVA). It is imperative to enforce
+VM-IOASID ownership such that a malicious guest cannot target DMA
+traffic outside its own IOASIDs, or free an active IOASID that belongs
+to another VM.
+
+The IOASID set APIs serve the following purposes:
+
+ - Ownership/permission enforcement
+ - Take collective actions, e.g. free an entire set
+ - Event notifications within a set
+ - Look up a set based on token
+ - Quota enforcement
+
+Each IOASID set is created with a token, which can be one of the
+following token types:
+
+ - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
+ - IOASID_SET_TYPE_MM (Set token is a mm_struct)
+
+The explicit MM token type is useful when multiple users of an IOASID
+set under the same process need to communicate about their shared IOASIDs.
+E.g. An IOASID set created by VFIO for one guest can be associated
+with the KVM instance for the same guest since they share a common mm_struct.
+A token must be unique within its type.
+
+::
+
+ struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, u32 type)
+
+ int ioasid_adjust_set(struct ioasid_set *set, int quota);
+
+ void ioasid_set_get(struct ioasid_set *set)
+
+ void ioasid_set_put(struct ioasid_set *set)
+
+ void ioasid_set_get_locked(struct ioasid_set *set)
+
+ void ioasid_set_put_locked(struct ioasid_set *set)
+
+ int ioasid_set_for_each_ioasid(struct ioasid_set *set,
+ void (*fn)(ioasid_t id, void *data),
+ void *data)
+
+
+Individual IOASID APIs
+----------------------
+Once an ioasid_set is created, IOASIDs can be allocated from the set.
+Within the IOASID set namespace, set private ID (SPID) is supported. In
+the VM use case, SPID can be used for storing guest PASID.
+
+::
+
+ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
+ void *private);
+
+ int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
+
+ void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
+
+ int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
+
+ void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
+
+ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
+ bool (*getter)(void *));
+
+ ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
+
+ int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
+ void *data);
+ int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
+ ioasid_t spid);
+
+
+Notifications
+-------------
+An IOASID may have multiple users, each user may have hardware context
+associated with an IOASID. When the status of an IOASID changes,
+e.g. an IOASID is being freed, users need to be notified such that the
+associated hardware context can be cleared, flushed, and drained.
+
+::
+
+ int ioasid_register_notifier(struct ioasid_set *set, struct
+ notifier_block *nb)
+
+ void ioasid_unregister_notifier(struct ioasid_set *set,
+ struct notifier_block *nb)
+
+ int ioasid_register_notifier_mm(struct mm_struct *mm, struct
+ notifier_block *nb)
+
+ void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
+ notifier_block *nb)
+
+ int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
+ unsigned int flags)
+
+"_mm" flavor of the ioasid_register_notifier() APIs are used when
+an IOASID user need to listen to the IOASID events belong to a
+process but without the knowledge of the associated ioasid_set.
+
+Events
+~~~~~~
+Notification events are pertinent to individual IOASIDs, they can be
+one of the following:
+
+ - ALLOC
+ - FREE
+ - BIND
+ - UNBIND
+
+Ordering
+~~~~~~~~
+Ordering is supported by IOASID notification priorities as the
+following (in ascending order):
+
+::
+
+ enum ioasid_notifier_prios {
+ IOASID_PRIO_LAST,
+ IOASID_PRIO_IOMMU,
+ IOASID_PRIO_DEVICE,
+ IOASID_PRIO_CPU,
+ };
+
+When registered, notifiers are assigned a priority that affect the
+call order. For example, notifiers with CPU priority get called before
+notifiers with device priority and so on.
+
+The typical use case is when an IOASID is freed due to an exception, DMA
+source should be quiesced before tearing down other hardware contexts
+in the system. This will reduce the churn in handling faults. DMA work
+submission is performed by the CPU which is granted higher priority than
+devices.
+
+Level Sensitivity
+~~~~~~~~~~~~~~~~~
+For each IOASID state transition, IOASID core ensures that there is
+only one notification sent. This resembles level triggered interrupt
+where a single interrupt is raised during a state transition.
+For example, if ioasid_free() is called twice by a user before the
+IOASID is reclaimed, IOASID core will only send out a single
+IOASID_NOTIFY_FREE event. Similarly, for IOASID_NOTIFY_BIND/UNBIND
+events, which is only sent out once when a SPID is attached/detached.
+
+IOASID notifications cannot be sent directly by IOASID users, they are
+sent out by the IOASID core as a by-product of the following APIs:
+- ioasid_alloc/free(), emit IOASID_NOTIFY_ALLOC/FREE
+- ioasid_attach/detach_spid() emit IOASID_NOTIFY_BIND/UNBIND
+
+Scopes
+~~~~~~
+There are two types of notifiers in IOASID core: system-wide and
+ioasid_set-wide.
+
+System-wide notifier is catering for users that need to handle all the
+IOASIDs in the system. E.g. The IOMMU driver.
+
+Per ioasid_set notifier can be used by VM specific components such as
+KVM. After all, each KVM instance only cares about IOASIDs within its
+own set/guest.
+
+
+Atomicity
+~~~~~~~~~
+IOASID notifiers are atomic due to spinlocks used inside the IOASID
+core. For tasks that cannot be completed in the notifier handler,
+async work can be submitted to complete the work later as long as
+there is no ordering requirement.
+
+Reference counting
+------------------
+IOASID lifecycle management is based on reference counting. Users of
+IOASID who intend to align lifecycle with the IOASID need to hold
+a reference of the IOASID. The IOASID will not be returned to the pool
+for allocation until all references are dropped. Calling ioasid_free()
+will mark the IOASID as FREE_PENDING if the IOASID has outstanding
+references. No new references can be taken by ioasid_get() once an
+IOASID is in the FREE_PENDING state. ioasid_free() can be called
+multiple times without an error until all refs are dropped.
+
+ioasid_put() decrements and tests refcount of the IOASID. If refcount
+is 0, ioasid will be freed. Deleted from the system-wide xarray as
+well as per set xarray. The IOASID will be returned to the pool and
+available for new allocations.
+
+Event notifications are used to inform users of IOASID status change.
+IOASID_FREE event prompts users to drop their references after
+clearing its context.
+
+For example, on VT-d platform when an IOASID is freed, teardown
+actions are performed on KVM, device driver (VDCM), and the IOMMU
+driver. To quiesce VCPU for work submission, KVM notifier handler must
+be called before VDCM handler. Therefore, KVM and VDCM shall use
+notification priority IOASID_PRIO_CPU andIOASID_PRIO_DEVICE
+respectively.
+
+For both KVM and VDCM, notifier blocks shall be registered on the
+IOASID set such that *only* events from the matching VM are received.
+
+If KVM attempts to register a notifier block before the IOASID set is
+created using the MM token, the notifier block will be placed on a
+pending list inside IOASID core. Once the token matching IOASID set
+is created, IOASID will register the notifier block automatically.
+IOASID core does not replay events for the existing IOASIDs in the
+set. For IOASID set of MM type, notification blocks can be registered
+on empty sets only. This is to avoid lost events.
+
+IOMMU driver shall register notifier block on global chain::
+
+ static struct notifier_block pasid_nb_vtd = {
+ .notifier_call = pasid_status_change_vtd,
+ .priority = IOASID_PRIO_IOMMU,
+ };
+
+Custom allocator APIs
+---------------------
+
+::
+
+ int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
+
+ void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
+
+Allocator Choices
+~~~~~~~~~~~~~~~~~
+IOASIDs are allocated for both host and guest SVA/IOVA usage. However,
+allocators can be different. For example, on VT-d guest PASID
+allocation must be performed via a virtual command interface which is
+emulated by VMM.
+
+IOASID core has the notion of "custom allocator" such that guest can
+register virtual command allocator that precedes the default one.
+
+Namespaces
+~~~~~~~~~~
+IOASIDs are limited system resources that default to 20 bits in
+size. Each device can have its own PASID table for security reasons.
+Theoretically the namespace can be per device also.
+
+However IOASID namespace is system-wide for two reasons:
+- Simplicity
+- Sharing resources of a single device to multiple VMs.
+
+Take VT-d as an example, VT-d supports shared workqueue and ENQCMD[1]
+where one IOASID could be used to submit work on multiple devices that
+are shared with other VMs. This requires IOASID to be
+system-wide. This is also the reason why guests must use an
+emulated virtual command interface to allocate IOASID from the host.
+
+Life cycle
+==========
+This section covers IOASID life cycle management for both bare-metal
+and guest usages. In bare-metal SVA, MMU notifier is directly hooked
+up with IOMMU driver, therefore the IOASID life cycle matches the
+process address space (MM) life cycle.
+
+However, guest MMU notifier is not available to host IOMMU driver,
+when guest MM terminates unexpectedly, the events have to go through
+VFIO and IOMMU UAPI to reach host IOMMU driver. There are also more
+parties involved in guest SVA, e.g. on Intel VT-d platform, IOASIDs
+are used by IOMMU driver, KVM, VDCM, and VFIO.
+
+Native IOASID Life Cycle (VT-d Example)
+---------------------------------------
+
+The normal flow of native SVA code with Intel Data Streaming
+Accelerator(DSA) [2] as example:
+
+1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
+2. DSA driver allocate WQ, do sva_bind_device();
+3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
+ mmu_notifier_get()
+4. DMA starts by DSA driver userspace
+5. DSA userspace close FD
+6. DSA/uacce kernel driver handles FD.close()
+7. DSA driver stops DMA
+8. DSA driver calls sva_unbind_device();
+9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
+ TLBs. mmu_notifier_put() called.
+10. mmu_notifier.release() called, IOMMU SVA code calls ioasid_free()*
+11. The IOASID is returned to the pool, reclaimed.
+
+::
+
+ * With ENQCMD, PASID used on VT-d is not released in mmu_notifier() but
+ mmdrop(). mmdrop comes after FD close which does not make a difference.
+
+During the normal teardown, the following three key steps would happen in
+order: (7), (9), (11).
+
+Exception happens when process terminates *before* device driver stops
+DMA and call IOMMU driver to unbind. The detailed flow of process
+exits are as follows:
+
+::
+
+ do_exit() {
+ exit_mm() {
+ mm_put();
+ exit_mmap() {
+ intel_invalidate_range() //mmu notifier
+ tlb_finish_mmu()
+ mmu_notifier_release(mm) {
+ intel_mm_release() {
+ (9) intel_iommu_teardown_pasid();
+ intel_iommu_flush_tlbs();
+ }
+ // tlb_invalidate_range cb removed
+ }
+ unmap_vmas();
+ free_pgtables(); // IOMMU cannot walk PGT after this
+ };
+ }
+ exit_files(tsk) {
+ close_files() {
+ dsa_close();
+ (7) dsa_stop_dma();
+ intel_svm_unbind_pasid(); //nothing to do
+ }
+ }
+ }
+
+ mmdrop() /* some random time later, lazy mm user */ {
+ mm_free_pgd();
+ destroy_context(mm); {
+ (11) ioasid_free();
+ }
+ }
+
+As shown in the list above, step #2 could happen before
+#1. Unrecoverable(UR) faults could happen between #2 and #1.
+Fault processing is disabled by the IOMMU driver in #2, therefore the UR
+fault never reaches the driver.
+
+
+Also notice that TLB invalidation occurs at mmu_notifier
+invalidate_range callback as well as the release callback. The reason
+is that release callback will delete IOMMU driver from the notifier
+chain which may skip invalidate_range() calls during the exit path.
+
+To avoid unnecessary reporting of UR fault, IOMMU driver shall disable
+fault reporting after free and before unbind.
+
+Guest IOASID Life Cycle (VT-d Example)
+--------------------------------------
+Guest IOASID life cycle starts with guest driver open(), this could be
+uacce or individual accelerator driver such as DSA. At FD open,
+sva_bind_device() is called which triggers a series of actions.
+
+The example below is an illustration of *normal* operations that
+involves *all* the SW components in VT-d. The flow can be simpler if
+no ENQCMD is supported.
+
+::
+
+ VFIO IOMMU KVM VDCM IOASID Ref
+ ..................................................................
+ 1 ioasid_register_notifier/_mm()
+ 2 ioasid_alloc() 1
+ 3 bind_gpasid()
+ 4 iommu_bind()->ioasid_get() 2
+ 5 ioasid_notify(BIND)
+ 6 -> ioasid_get() 3
+ 7 -> vmcs_update_atomic()
+ 8 mdev_write(gpasid)
+ 9 hpasid=
+ 10 find_by_spid(gpasid) 4
+ 11 vdev_write(hpasid)
+ 12 -------- GUEST STARTS DMA --------------------------
+ 13 -------- GUEST STOPS DMA --------------------------
+ 14 mdev_clear(gpasid)
+ 15 vdev_clear(hpasid)
+ 16 ioasid_put() 3
+ 17 unbind_gpasid()
+ 18 iommu_ubind()
+ 19 ioasid_notify(UNBIND)
+ 20 -> vmcs_update_atomic()
+ 21 -> ioasid_put() 2
+ 22 ioasid_free() 1
+ 23 ioasid_put() 0
+ 24 Reclaimed
+ -------------- New Life Cycle Begin ----------------------------
+ 1 ioasid_alloc() -> 1
+
+ Note: IOASID Notification Events: FREE, BIND, UNBIND
+
+Exception cases arise when a guest crashes or a malicious guest
+attempts to cause disruption on the host system. The fault handling
+rules are:
+
+1. IOASID free must *always* succeed.
+2. An inactive period may be required before the freed IOASID is
+ reclaimed. During this period, consumers of IOASID perform cleanup.
+3. Malfunction is limited to the guest owned resources for all
+ programming errors.
+
+The primary source of exception is when the following are out of
+order:
+
+1. Start/Stop of DMA activity
+ (Guest device driver, mdev via VFIO)
+2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
+ (Host IOMMU driver bind/unbind)
+3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
+ case of ENQCMD
+4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
+5. IOASID alloc/free (Host IOASID)
+
+VFIO is the *only* user-kernel interface, which is ultimately
+responsible for exception handling.
+
+#1 is processed the same way as the assigned device today based on
+device file descriptors and events. There is no special handling.
+
+#3 is based on bind/unbind events emitted by #2.
+
+#4 is naturally aligned with IOASID life cycle in that an illegal
+guest PASID programming would fail in obtaining reference of the
+matching host IOASID.
+
+#5 is similar to #4. The fault will be reported to the user if PASID
+used in the ENQCMD is not set up in VMCS PASID translation table.
+
+Therefore, the remaining out of order problem is between #2 and
+#5. I.e. unbind vs. free. More specifically, free before unbind.
+
+IOASID notifier and refcounting are used to ensure order. Following
+a publisher-subscriber pattern where:
+
+- Publishers: VFIO & IOMMU
+- Subscribers: KVM, VDCM, IOMMU
+
+IOASID reference must be acquired before receiving the FREE event. The
+reference must be dropped at the end of the processing in order to
+return the IOASID to the pool.
+
+Let's examine the IOASID life cycle again when free happens *before*
+unbind. This could be a result of misbehaving guests or crash. Assuming
+VFIO cannot enforce unbind->free order. Notice that the setup part up
+until step #12 is identical to the normal case, the flow below starts
+with step 13.
+
+::
+
+ VFIO IOMMU KVM VDCM IOASID Ref
+ ..................................................................
+ 13 -------- GUEST STARTS DMA --------------------------
+ 14 -------- *GUEST MISBEHAVES!!!* ----------------
+ 15 ioasid_free()
+ 16 ioasid_notify(FREE)
+ 17 mark_free_pending (1)
+ 18 kvm_nb_handler(FREE)
+ 19 vmcs_update_atomic()
+ 20 ioasid_put_locked() -> 3
+ 21 vdcm_nb_handler(FREE)
+ 22 iomm_nb_handler(FREE)
+ 23 ioasid_free() returns(2) schedule_work() 2
+ 24 schedule_work() vdev_clear_wk(hpasid)
+ 25 teardown_pasid_wk()
+ 26 ioasid_put() -> 1
+ 27 ioasid_put() 0
+ 28 Reclaimed
+ 29 unbind_gpasid()
+ 30 iommu_unbind()->ioasid_find() Fails(3)
+ -------------- New Life Cycle Begin ----------------------------
+
+Note:
+
+1. By marking IOASID FREE_PENDING at step #17, no new references can be
+ held. ioasid_get/find() will return -ENOENT;
+2. After step #23, all events can go out of order. Shall not affect
+ the outcome.
+3. IOMMU driver fails to find private data for unbinding. If unbind is
+ called after the same IOASID is allocated for the same guest again,
+ this is a programming error. The damage is limited to the guest
+ itself since unbind performs permission checking based on the
+ IOASID set associated with the guest process.
+4. Workqueues are used by VDCM and IOMMU driver for processing that
+ requires thread context.
+
+
+KVM PASID Translation Table Updates
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Per VM PASID translation table is maintained by KVM in order to
+support ENQCMD in the guest. The table contains host-guest PASID
+translations to be consumed by CPU ucode. The synchronization of the
+PASID states depends on VFIO/IOMMU driver, where IOCTL and atomic
+notifiers are used. KVM must register IOASID notifier per VM instance
+during launch time. The following events are handled:
+
+1. BIND/UNBIND
+2. FREE
+
+Rules:
+
+1. Multiple devices can bind with the same PASID, this can be different PCI
+ devices or mdevs within the same PCI device. However, only the
+ *first* BIND and *last* UNBIND emit notifications.
+2. IOASID code is responsible for ensuring the correctness of H-G
+ PASID mapping. There is no need for KVM to validate the
+ notification data.
+3. When UNBIND happens *after* FREE, KVM will see error in
+ ioasid_get() even when the reclaim is not done. IOMMU driver will
+ also avoid sending UNBIND if the PASID is already FREE.
+4. When KVM terminates *before* FREE & UNBIND, references will be
+ dropped for all host PASIDs.
+
+VDCM PASID Programming
+~~~~~~~~~~~~~~~~~~~~~~
+VDCM composes virtual devices and exposes them to the guests. When
+the guest allocates a PASID then program it to the virtual device, VDCM
+intercepts the programming attempt then programs the matching host
+PASID on to the hardware.
+Conversely, when a device is going away, VDCM must be informed such
+that PASID context on the hardware can be cleared. There could be
+multiple mdevs assigned to different guests in the same VDCM. Since
+the PASID table is shared at PCI device level, lazy clearing is not
+secure. A malicious guest can attack by using newly freed PASIDs that
+are allocated by another guest.
+
+By holding a reference of the PASID until VDCM cleans up the HW context,
+it is guaranteed that PASID life cycles do not cross within the same
+device.
+
+
+Reference
+====================================================
+1. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
+
+2. https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
+
+3. https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
--
2.7.4

2020-09-28 21:38:35

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 03/14] iommu/ioasid: Add a separate function for detach data

IOASID private data can be cleared by ioasid_attach_data() with a NULL
data pointer. A common use case is for a caller to free the data
afterward. ioasid_attach_data() calls synchronize_rcu() before return
such that free data can be sure without outstanding readers.
However, since synchronize_rcu() may sleep, ioasid_attach_data() cannot
be used under spinlocks.

This patch adds ioasid_detach_data() as a separate API where
synchronize_rcu() is called only in this case. ioasid_attach_data() can
then be used under spinlocks. In addition, this change makes the API
symmetrical.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel/svm.c | 4 ++--
drivers/iommu/ioasid.c | 54 ++++++++++++++++++++++++++++++++++++++---------
include/linux/ioasid.h | 5 ++++-
3 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 2c5645f0737a..06a16bee7b65 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -398,7 +398,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
list_add_rcu(&sdev->list, &svm->devs);
out:
if (!IS_ERR_OR_NULL(svm) && list_empty(&svm->devs)) {
- ioasid_attach_data(data->hpasid, NULL);
+ ioasid_detach_data(data->hpasid);
kfree(svm);
}

@@ -441,7 +441,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
* the unbind, IOMMU driver will get notified
* and perform cleanup.
*/
- ioasid_attach_data(pasid, NULL);
+ ioasid_detach_data(pasid);
kfree(svm);
}
}
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 5f63af07acd5..6cfbdfb492e0 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -272,24 +272,58 @@ int ioasid_attach_data(ioasid_t ioasid, void *data)

spin_lock(&ioasid_allocator_lock);
ioasid_data = xa_load(&active_allocator->xa, ioasid);
- if (ioasid_data)
- rcu_assign_pointer(ioasid_data->private, data);
- else
+
+ if (!ioasid_data) {
ret = -ENOENT;
- spin_unlock(&ioasid_allocator_lock);
+ goto done_unlock;
+ }

- /*
- * Wait for readers to stop accessing the old private data, so the
- * caller can free it.
- */
- if (!ret)
- synchronize_rcu();
+ if (ioasid_data->private) {
+ ret = -EBUSY;
+ goto done_unlock;
+ }
+ rcu_assign_pointer(ioasid_data->private, data);
+
+done_unlock:
+ spin_unlock(&ioasid_allocator_lock);

return ret;
}
EXPORT_SYMBOL_GPL(ioasid_attach_data);

/**
+ * ioasid_detach_data - Clear the private data of an ioasid
+ *
+ * @ioasid: the IOASIDD to clear private data
+ */
+void ioasid_detach_data(ioasid_t ioasid)
+{
+ struct ioasid_data *ioasid_data;
+
+ spin_lock(&ioasid_allocator_lock);
+ ioasid_data = xa_load(&active_allocator->xa, ioasid);
+
+ if (!ioasid_data) {
+ pr_warn("IOASID %u not found to detach data from\n", ioasid);
+ goto done_unlock;
+ }
+
+ if (ioasid_data->private) {
+ rcu_assign_pointer(ioasid_data->private, NULL);
+ goto done_unlock;
+ }
+
+done_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+ /*
+ * Wait for readers to stop accessing the old private data,
+ * so the caller can free it.
+ */
+ synchronize_rcu();
+}
+EXPORT_SYMBOL_GPL(ioasid_detach_data);
+
+/**
* ioasid_alloc - Allocate an IOASID
* @set: the IOASID set
* @min: the minimum ID (inclusive)
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 9c44947a68c8..c7f649fa970a 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -40,7 +40,7 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
int ioasid_attach_data(ioasid_t ioasid, void *data);
-
+void ioasid_detach_data(ioasid_t ioasid);
#else /* !CONFIG_IOASID */
static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
ioasid_t max, void *private)
@@ -72,5 +72,8 @@ static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
return -ENOTSUPP;
}

+static inline void ioasid_detach_data(ioasid_t ioasid)
+{
+}
#endif /* CONFIG_IOASID */
#endif /* __LINUX_IOASID_H */
--
2.7.4

2020-09-28 21:38:37

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 02/14] iommu/ioasid: Rename ioasid_set_data()

Rename ioasid_set_data() to ioasid_attach_data() to avoid confusion with
struct ioasid_set. ioasid_set is a group of IOASIDs that share a common
token.

Reviewed-by: Jean-Philippe Brucker <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel/svm.c | 6 +++---
drivers/iommu/ioasid.c | 6 +++---
include/linux/ioasid.h | 4 ++--
3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 0cb9a15f1112..2c5645f0737a 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -346,7 +346,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
svm->gpasid = data->gpasid;
svm->flags |= SVM_FLAG_GUEST_PASID;
}
- ioasid_set_data(data->hpasid, svm);
+ ioasid_attach_data(data->hpasid, svm);
INIT_LIST_HEAD_RCU(&svm->devs);
mmput(svm->mm);
}
@@ -398,7 +398,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
list_add_rcu(&sdev->list, &svm->devs);
out:
if (!IS_ERR_OR_NULL(svm) && list_empty(&svm->devs)) {
- ioasid_set_data(data->hpasid, NULL);
+ ioasid_attach_data(data->hpasid, NULL);
kfree(svm);
}

@@ -441,7 +441,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
* the unbind, IOMMU driver will get notified
* and perform cleanup.
*/
- ioasid_set_data(pasid, NULL);
+ ioasid_attach_data(pasid, NULL);
kfree(svm);
}
}
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 0f8dd377aada..5f63af07acd5 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -258,14 +258,14 @@ void ioasid_unregister_allocator(struct ioasid_allocator_ops *ops)
EXPORT_SYMBOL_GPL(ioasid_unregister_allocator);

/**
- * ioasid_set_data - Set private data for an allocated ioasid
+ * ioasid_attach_data - Set private data for an allocated ioasid
* @ioasid: the ID to set data
* @data: the private data
*
* For IOASID that is already allocated, private data can be set
* via this API. Future lookup can be done via ioasid_find.
*/
-int ioasid_set_data(ioasid_t ioasid, void *data)
+int ioasid_attach_data(ioasid_t ioasid, void *data)
{
struct ioasid_data *ioasid_data;
int ret = 0;
@@ -287,7 +287,7 @@ int ioasid_set_data(ioasid_t ioasid, void *data)

return ret;
}
-EXPORT_SYMBOL_GPL(ioasid_set_data);
+EXPORT_SYMBOL_GPL(ioasid_attach_data);

/**
* ioasid_alloc - Allocate an IOASID
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 6f000d7a0ddc..9c44947a68c8 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -39,7 +39,7 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
bool (*getter)(void *));
int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
-int ioasid_set_data(ioasid_t ioasid, void *data);
+int ioasid_attach_data(ioasid_t ioasid, void *data);

#else /* !CONFIG_IOASID */
static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
@@ -67,7 +67,7 @@ static inline void ioasid_unregister_allocator(struct ioasid_allocator_ops *allo
{
}

-static inline int ioasid_set_data(ioasid_t ioasid, void *data)
+static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
{
return -ENOTSUPP;
}
--
2.7.4

2020-09-28 21:38:45

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 05/14] iommu/ioasid: Redefine IOASID set and allocation APIs

ioasid_set was introduced as an arbitrary token that is shared by a
group of IOASIDs. For example, two IOASIDs allocated via the same
ioasid_set pointer belong to the same set.

For guest SVA usages, system-wide IOASID resources need to be
partitioned such that each VM can have its own quota and being managed
separately. ioasid_set is the perfect candidate for meeting such
requirements. This patch redefines and extends ioasid_set with the
following new fields:
- Quota
- Reference count
- Storage of its namespace
- The token is now stored in the ioasid_set with types

Basic ioasid_set level APIs are introduced that wire up these new data.
Existing users of IOASID APIs are converted where a host IOASID set is
allocated for bare-metal usage.

Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel/iommu.c | 26 +++--
drivers/iommu/intel/pasid.h | 1 +
drivers/iommu/intel/svm.c | 25 ++--
drivers/iommu/ioasid.c | 277 ++++++++++++++++++++++++++++++++++++++++----
include/linux/ioasid.h | 53 +++++++--
5 files changed, 333 insertions(+), 49 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index e7bcb299e51e..872391890323 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -104,6 +104,9 @@
*/
#define INTEL_IOMMU_PGSIZES (~0xFFFUL)

+/* PASIDs used by host SVM */
+struct ioasid_set *host_pasid_set;
+
static inline int agaw_to_level(int agaw)
{
return agaw + 2;
@@ -3147,8 +3150,8 @@ static void intel_vcmd_ioasid_free(ioasid_t ioasid, void *data)
* Sanity check the ioasid owner is done at upper layer, e.g. VFIO
* We can only free the PASID when all the devices are unbound.
*/
- if (ioasid_find(NULL, ioasid, NULL)) {
- pr_alert("Cannot free active IOASID %d\n", ioasid);
+ if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
+ pr_err("IOASID %d to be freed but not in system set\n", ioasid);
return;
}
vcmd_free_pasid(iommu, ioasid);
@@ -3333,8 +3336,17 @@ static int __init init_dmars(void)
goto free_iommu;

/* PASID is needed for scalable mode irrespective to SVM */
- if (intel_iommu_sm)
+ if (intel_iommu_sm) {
ioasid_install_capacity(intel_pasid_max_id);
+ /* We should not run out of IOASIDs at boot */
+ host_pasid_set = ioasid_set_alloc(NULL, PID_MAX_DEFAULT,
+ IOASID_SET_TYPE_NULL);
+ if (IS_ERR_OR_NULL(host_pasid_set)) {
+ pr_err("Failed to allocate host PASID set %lu\n",
+ PTR_ERR(host_pasid_set));
+ intel_iommu_sm = 0;
+ }
+ }

/*
* for each drhd
@@ -3381,7 +3393,7 @@ static int __init init_dmars(void)
disable_dmar_iommu(iommu);
free_dmar_iommu(iommu);
}
-
+ ioasid_set_put(host_pasid_set);
kfree(g_iommus);

error:
@@ -5163,7 +5175,7 @@ static void auxiliary_unlink_device(struct dmar_domain *domain,
domain->auxd_refcnt--;

if (!domain->auxd_refcnt && domain->default_pasid > 0)
- ioasid_free(domain->default_pasid);
+ ioasid_free(host_pasid_set, domain->default_pasid);
}

static int aux_domain_add_dev(struct dmar_domain *domain,
@@ -5181,7 +5193,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
int pasid;

/* No private data needed for the default pasid */
- pasid = ioasid_alloc(NULL, PASID_MIN,
+ pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
pci_max_pasids(to_pci_dev(dev)) - 1,
NULL);
if (pasid == INVALID_IOASID) {
@@ -5224,7 +5236,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
spin_unlock(&iommu->lock);
spin_unlock_irqrestore(&device_domain_lock, flags);
if (!domain->auxd_refcnt && domain->default_pasid > 0)
- ioasid_free(domain->default_pasid);
+ ioasid_free(host_pasid_set, domain->default_pasid);

return ret;
}
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index c9850766c3a9..ccdc23446015 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct pasid_entry *pte)
}

extern u32 intel_pasid_max_id;
+extern struct ioasid_set *host_pasid_set;
int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp);
void intel_pasid_free_id(int pasid);
void *intel_pasid_lookup_id(int pasid);
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 06a16bee7b65..2e764e283469 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -228,7 +228,9 @@ static LIST_HEAD(global_svm_list);
list_for_each_entry((sdev), &(svm)->devs, list) \
if ((d) != (sdev)->dev) {} else

-static int pasid_to_svm_sdev(struct device *dev, unsigned int pasid,
+static int pasid_to_svm_sdev(struct device *dev,
+ struct ioasid_set *set,
+ unsigned int pasid,
struct intel_svm **rsvm,
struct intel_svm_dev **rsdev)
{
@@ -242,7 +244,7 @@ static int pasid_to_svm_sdev(struct device *dev, unsigned int pasid,
if (pasid == INVALID_IOASID || pasid >= PASID_MAX)
return -EINVAL;

- svm = ioasid_find(NULL, pasid, NULL);
+ svm = ioasid_find(set, pasid, NULL);
if (IS_ERR(svm))
return PTR_ERR(svm);

@@ -312,7 +314,8 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
dmar_domain = to_dmar_domain(domain);

mutex_lock(&pasid_mutex);
- ret = pasid_to_svm_sdev(dev, data->hpasid, &svm, &sdev);
+ ret = pasid_to_svm_sdev(dev, NULL,
+ data->hpasid, &svm, &sdev);
if (ret)
goto out;

@@ -417,7 +420,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
return -EINVAL;

mutex_lock(&pasid_mutex);
- ret = pasid_to_svm_sdev(dev, pasid, &svm, &sdev);
+ ret = pasid_to_svm_sdev(dev, NULL, pasid, &svm, &sdev);
if (ret)
goto out;

@@ -555,7 +558,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
pasid_max = intel_pasid_max_id;

/* Do not use PASID 0, reserved for RID to PASID */
- svm->pasid = ioasid_alloc(NULL, PASID_MIN,
+ svm->pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
pasid_max - 1, svm);
if (svm->pasid == INVALID_IOASID) {
kfree(svm);
@@ -572,7 +575,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
if (mm) {
ret = mmu_notifier_register(&svm->notifier, mm);
if (ret) {
- ioasid_free(svm->pasid);
+ ioasid_free(host_pasid_set, svm->pasid);
kfree(svm);
kfree(sdev);
goto out;
@@ -590,7 +593,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
if (ret) {
if (mm)
mmu_notifier_unregister(&svm->notifier, mm);
- ioasid_free(svm->pasid);
+ ioasid_free(host_pasid_set, svm->pasid);
kfree(svm);
kfree(sdev);
goto out;
@@ -638,7 +641,8 @@ static int intel_svm_unbind_mm(struct device *dev, int pasid)
if (!iommu)
goto out;

- ret = pasid_to_svm_sdev(dev, pasid, &svm, &sdev);
+ ret = pasid_to_svm_sdev(dev, host_pasid_set,
+ pasid, &svm, &sdev);
if (ret)
goto out;

@@ -659,7 +663,7 @@ static int intel_svm_unbind_mm(struct device *dev, int pasid)
kfree_rcu(sdev, rcu);

if (list_empty(&svm->devs)) {
- ioasid_free(svm->pasid);
+ ioasid_free(host_pasid_set, svm->pasid);
if (svm->mm)
mmu_notifier_unregister(&svm->notifier, svm->mm);
list_del(&svm->list);
@@ -1130,7 +1134,8 @@ int intel_svm_page_response(struct device *dev,
goto out;
}

- ret = pasid_to_svm_sdev(dev, prm->pasid, &svm, &sdev);
+ ret = pasid_to_svm_sdev(dev, host_pasid_set,
+ prm->pasid, &svm, &sdev);
if (ret || !sdev) {
ret = -ENODEV;
goto out;
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 4277cb17e15b..61e25c2375ab 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -1,8 +1,8 @@
// SPDX-License-Identifier: GPL-2.0
/*
* I/O Address Space ID allocator. There is one global IOASID space, split into
- * subsets. Users create a subset with DECLARE_IOASID_SET, then allocate and
- * free IOASIDs with ioasid_alloc and ioasid_free.
+ * sets. Users create a set with ioasid_set_alloc, then allocate/free IDs
+ * with ioasid_alloc and ioasid_free.
*/
#include <linux/ioasid.h>
#include <linux/module.h>
@@ -14,6 +14,8 @@
#define PCI_PASID_MAX 0x100000
static ioasid_t ioasid_capacity = PCI_PASID_MAX;
static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
+static DEFINE_XARRAY_ALLOC(ioasid_sets);
+
struct ioasid_data {
ioasid_t id;
struct ioasid_set *set;
@@ -343,6 +345,114 @@ void ioasid_detach_data(ioasid_t ioasid)
}
EXPORT_SYMBOL_GPL(ioasid_detach_data);

+static inline bool ioasid_set_is_valid(struct ioasid_set *set)
+{
+ return xa_load(&ioasid_sets, set->id) == set;
+}
+
+/**
+ * ioasid_set_alloc - Allocate a new IOASID set for a given token
+ *
+ * @token: An optional arbitrary number that can be associated with the
+ * IOASID set. @token can be NULL if the type is
+ * IOASID_SET_TYPE_NULL
+ * @quota: Quota allowed in this set
+ * @type: The type of the token used to create the IOASID set
+ *
+ * IOASID is limited system-wide resource that requires quota management.
+ * Token will be stored in the ioasid_set returned. A reference will be taken
+ * on the newly created set. Subsequent IOASID allocation within the set need
+ * to use the retured ioasid_set pointer.
+ */
+struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type)
+{
+ struct ioasid_set *set;
+ unsigned long index;
+ ioasid_t id;
+
+ if (type >= IOASID_SET_TYPE_NR)
+ return ERR_PTR(-EINVAL);
+
+ /* REVISIT: may support set w/o quota, use system available */
+ if (!quota)
+ return ERR_PTR(-EINVAL);
+
+ /*
+ * Need to check space available if we share system-wide quota.
+ * TODO: we may need to support quota free sets in the future.
+ */
+ spin_lock(&ioasid_allocator_lock);
+ if (quota > ioasid_capacity_avail) {
+ pr_warn("Out of IOASID capacity! ask %d, avail %d\n",
+ quota, ioasid_capacity_avail);
+ set = ERR_PTR(-ENOSPC);
+ goto exit_unlock;
+ }
+
+ /*
+ * Token is only unique within its types but right now we have only
+ * mm type. If we have more token types, we have to match type as well.
+ */
+ switch (type) {
+ case IOASID_SET_TYPE_MM:
+ if (!token) {
+ set = ERR_PTR(-EINVAL);
+ goto exit_unlock;
+ }
+ /* Search existing set tokens, reject duplicates */
+ xa_for_each(&ioasid_sets, index, set) {
+ if (set->token == token &&
+ set->type == IOASID_SET_TYPE_MM) {
+ set = ERR_PTR(-EEXIST);
+ goto exit_unlock;
+ }
+ }
+ break;
+ case IOASID_SET_TYPE_NULL:
+ if (!token)
+ break;
+ fallthrough;
+ default:
+ pr_err("Invalid token and IOASID type\n");
+ set = ERR_PTR(-EINVAL);
+ goto exit_unlock;
+ }
+
+ set = kzalloc(sizeof(*set), GFP_ATOMIC);
+ if (!set) {
+ set = ERR_PTR(-ENOMEM);
+ goto exit_unlock;
+ }
+
+ if (xa_alloc(&ioasid_sets, &id, set,
+ XA_LIMIT(0, ioasid_capacity_avail - quota),
+ GFP_ATOMIC)) {
+ kfree(set);
+ set = ERR_PTR(-ENOSPC);
+ goto exit_unlock;
+ }
+
+ set->token = token;
+ set->type = type;
+ set->quota = quota;
+ set->id = id;
+ refcount_set(&set->ref, 1);
+
+ /*
+ * Per set XA is used to store private IDs within the set, get ready
+ * for ioasid_set private ID and system-wide IOASID allocation
+ * results.
+ */
+ xa_init(&set->xa);
+ ioasid_capacity_avail -= quota;
+
+exit_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+
+ return set;
+}
+EXPORT_SYMBOL_GPL(ioasid_set_alloc);
+
/**
* ioasid_alloc - Allocate an IOASID
* @set: the IOASID set
@@ -360,11 +470,22 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
{
struct ioasid_data *data;
void *adata;
- ioasid_t id;
+ ioasid_t id = INVALID_IOASID;
+
+ spin_lock(&ioasid_allocator_lock);
+ /* Check if the IOASID set has been allocated and initialized */
+ if (WARN_ON(!ioasid_set_is_valid(set)))
+ goto done_unlock;
+
+ if (set->quota <= set->nr_ioasids) {
+ pr_err_ratelimited("IOASID set out of quota %d\n",
+ set->quota);
+ goto done_unlock;
+ }

data = kzalloc(sizeof(*data), GFP_ATOMIC);
if (!data)
- return INVALID_IOASID;
+ goto done_unlock;

data->set = set;
data->private = private;
@@ -373,11 +494,10 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
* Custom allocator needs allocator data to perform platform specific
* operations.
*/
- spin_lock(&ioasid_allocator_lock);
adata = active_allocator->flags & IOASID_ALLOCATOR_CUSTOM ? active_allocator->ops->pdata : data;
id = active_allocator->ops->alloc(min, max, adata);
if (id == INVALID_IOASID) {
- pr_err("Failed ASID allocation %lu\n", active_allocator->flags);
+ pr_err_ratelimited("Failed IOASID allocation %lu\n", active_allocator->flags);
goto exit_free;
}

@@ -386,46 +506,153 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
/* Custom allocator needs framework to store and track allocation results */
pr_err("Failed to alloc ioasid from %d\n", id);
active_allocator->ops->free(id, active_allocator->ops->pdata);
+ if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM)
+ xa_erase(&active_allocator->xa, id);
goto exit_free;
}
data->id = id;

- spin_unlock(&ioasid_allocator_lock);
- return id;
+ /* Store IOASID in the per set data */
+ if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
+ pr_err_ratelimited("Failed to store ioasid %d in set\n", id);
+ active_allocator->ops->free(id, active_allocator->ops->pdata);
+ goto exit_free;
+ }
+ set->nr_ioasids++;
+ goto done_unlock;
exit_free:
- spin_unlock(&ioasid_allocator_lock);
kfree(data);
- return INVALID_IOASID;
+done_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+ return id;
}
EXPORT_SYMBOL_GPL(ioasid_alloc);

-/**
- * ioasid_free - Free an IOASID
- * @ioasid: the ID to remove
- */
-void ioasid_free(ioasid_t ioasid)
+static void ioasid_do_free_locked(struct ioasid_data *data)
{
struct ioasid_data *ioasid_data;

- spin_lock(&ioasid_allocator_lock);
- ioasid_data = xa_load(&active_allocator->xa, ioasid);
- if (!ioasid_data) {
- pr_err("Trying to free unknown IOASID %u\n", ioasid);
- goto exit_unlock;
- }
-
- active_allocator->ops->free(ioasid, active_allocator->ops->pdata);
+ active_allocator->ops->free(data->id, active_allocator->ops->pdata);
/* Custom allocator needs additional steps to free the xa element */
if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
- ioasid_data = xa_erase(&active_allocator->xa, ioasid);
+ ioasid_data = xa_erase(&active_allocator->xa, data->id);
kfree_rcu(ioasid_data, rcu);
}

-exit_unlock:
+ data->set->nr_ioasids--;
+ xa_erase(&data->set->xa, data->id);
+}
+
+static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
+{
+ struct ioasid_data *data;
+
+ data = xa_load(&active_allocator->xa, ioasid);
+ if (!data) {
+ pr_err("Trying to free unknown IOASID %u\n", ioasid);
+ return;
+ }
+ if (data->set != set) {
+ pr_warn("Cannot free IOASID %u due to set ownership\n", ioasid);
+ return;
+ }
+ /* Check if the set exists */
+ if (WARN_ON(!xa_load(&ioasid_sets, data->set->id)))
+ return;
+
+ ioasid_do_free_locked(data);
+}
+
+/**
+ * ioasid_free - Drop reference on an IOASID. Free if refcount drops to 0,
+ * including free from its set and system-wide list.
+ * @set: The ioasid_set to check permission with. If not NULL, IOASID
+ * free will fail if the set does not match.
+ * @ioasid: The IOASID to remove
+ */
+void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
+{
+ spin_lock(&ioasid_allocator_lock);
+ ioasid_free_locked(set, ioasid);
spin_unlock(&ioasid_allocator_lock);
}
EXPORT_SYMBOL_GPL(ioasid_free);

+static void ioasid_set_get_locked(struct ioasid_set *set)
+{
+ if (WARN_ON(!ioasid_set_is_valid(set)))
+ return;
+
+ refcount_inc(&set->ref);
+}
+
+void ioasid_set_get(struct ioasid_set *set)
+{
+ spin_lock(&ioasid_allocator_lock);
+ ioasid_set_get_locked(set);
+ spin_unlock(&ioasid_allocator_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_set_get);
+
+static void ioasid_set_put_locked(struct ioasid_set *set)
+{
+ struct ioasid_data *entry;
+ unsigned long index;
+
+ if (WARN_ON(!ioasid_set_is_valid(set)))
+ return;
+
+ if (!refcount_dec_and_test(&set->ref))
+ return;
+
+ /* The set is already empty, we just destroy the set. */
+ if (xa_empty(&set->xa))
+ goto done_destroy;
+
+ /*
+ * Free all PASIDs from system-wide IOASID pool, all subscribers gets
+ * notified and do cleanup of their own.
+ * Note that some references of the IOASIDs within the set can still
+ * be held after the free call. This is OK in that the IOASIDs will be
+ * marked inactive, the only operations can be done is ioasid_put.
+ * No need to track IOASID set states since there is no reclaim phase.
+ */
+ xa_for_each(&set->xa, index, entry) {
+ ioasid_free_locked(set, index);
+ /* Free from per set private pool */
+ xa_erase(&set->xa, index);
+ }
+
+done_destroy:
+ /* Return the quota back to system pool */
+ ioasid_capacity_avail += set->quota;
+
+ /*
+ * Token got released right away after the ioasid_set is freed.
+ * If a new set is created immediately with the newly released token,
+ * it will not allocate the same IOASIDs unless they are reclaimed.
+ */
+ xa_erase(&ioasid_sets, set->id);
+ kfree_rcu(set, rcu);
+}
+
+/**
+ * ioasid_set_put - Drop a reference to the IOASID set. Free all IOASIDs within
+ * the set if there are no more users.
+ *
+ * @set: The IOASID set ID to be freed
+ *
+ * If refcount drops to zero, all IOASIDs allocated within the set will be
+ * freed.
+ */
+void ioasid_set_put(struct ioasid_set *set)
+{
+ spin_lock(&ioasid_allocator_lock);
+ ioasid_set_put_locked(set);
+ spin_unlock(&ioasid_allocator_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_set_put);
+
/**
* ioasid_find - Find IOASID data
* @set: the IOASID set
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 7fc320656be2..1ae213b660f0 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -4,14 +4,43 @@

#include <linux/types.h>
#include <linux/errno.h>
+#include <linux/xarray.h>
+#include <linux/refcount.h>

#define INVALID_IOASID ((ioasid_t)-1)
typedef unsigned int ioasid_t;
typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min, ioasid_t max, void *data);
typedef void (*ioasid_free_fn_t)(ioasid_t ioasid, void *data);

+/* IOASID set types */
+enum ioasid_set_type {
+ IOASID_SET_TYPE_NULL = 1, /* Set token is NULL */
+ IOASID_SET_TYPE_MM, /* Set token is a mm_struct pointer
+ * i.e. associated with a process
+ */
+ IOASID_SET_TYPE_NR,
+};
+
+/**
+ * struct ioasid_set - Meta data about ioasid_set
+ * @xa: XArray to store ioasid_set private IDs, can be used for
+ * guest-host IOASID mapping, or just a private IOASID namespace.
+ * @token: Unique to identify an IOASID set
+ * @type: Token types
+ * @quota: Max number of IOASIDs can be allocated within the set
+ * @nr_ioasids: Number of IOASIDs currently allocated in the set
+ * @id: ID of the set
+ * @ref: Reference count of the users
+ */
struct ioasid_set {
- int dummy;
+ struct xarray xa;
+ void *token;
+ int type;
+ int quota;
+ int nr_ioasids;
+ int id;
+ refcount_t ref;
+ struct rcu_head rcu;
};

/**
@@ -29,14 +58,16 @@ struct ioasid_allocator_ops {
void *pdata;
};

-#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
-
#if IS_ENABLED(CONFIG_IOASID)
void ioasid_install_capacity(ioasid_t total);
ioasid_t ioasid_get_capacity(void);
+struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type);
+void ioasid_set_get(struct ioasid_set *set);
+void ioasid_set_put(struct ioasid_set *set);
+
ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
void *private);
-void ioasid_free(ioasid_t ioasid);
+void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
bool (*getter)(void *));
int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
@@ -59,12 +90,20 @@ static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
return INVALID_IOASID;
}

-static inline void ioasid_free(ioasid_t ioasid)
+static inline void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
+{
+}
+
+static inline struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type)
+{
+ return ERR_PTR(-ENOTSUPP);
+}
+
+static inline void ioasid_set_put(struct ioasid_set *set)
{
}

-static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
- bool (*getter)(void *))
+static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *))
{
return NULL;
}
--
2.7.4

2020-09-28 21:38:49

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 09/14] iommu/ioasid: Introduce ioasid_set private ID

When an IOASID set is used for guest SVA, each VM will acquire its
ioasid_set for IOASID allocations. IOASIDs within the VM must have a
host/physical IOASID backing, mapping between guest and host IOASIDs can
be non-identical. IOASID set private ID (SPID) is introduced in this
patch to be used as guest IOASID. However, the concept of ioasid_set
specific namespace is generic, thus named SPID.

As SPID namespace is within the IOASID set, the IOASID core can provide
lookup services at both directions. SPIDs may not be available when its
IOASID is allocated, the mapping between SPID and IOASID is usually
established when a guest page table is bound to a host PASID.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/ioasid.h | 19 +++++++++
2 files changed, 121 insertions(+)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 828cc44b1b1c..378fef8f23d9 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -26,6 +26,7 @@ enum ioasid_state {
* struct ioasid_data - Meta data about ioasid
*
* @id: Unique ID
+ * @spid: Private ID unique within a set
* @users: Number of active users
* @state: Track state of the IOASID
* @set: ioasid_set of the IOASID belongs to
@@ -34,6 +35,7 @@ enum ioasid_state {
*/
struct ioasid_data {
ioasid_t id;
+ ioasid_t spid;
refcount_t users;
enum ioasid_state state;
struct ioasid_set *set;
@@ -363,6 +365,105 @@ void ioasid_detach_data(ioasid_t ioasid)
}
EXPORT_SYMBOL_GPL(ioasid_detach_data);

+static ioasid_t ioasid_find_by_spid_locked(struct ioasid_set *set, ioasid_t spid)
+{
+ ioasid_t ioasid = INVALID_IOASID;
+ struct ioasid_data *entry;
+ unsigned long index;
+
+ if (!xa_load(&ioasid_sets, set->id)) {
+ pr_warn("Invalid set\n");
+ goto done;
+ }
+
+ xa_for_each(&set->xa, index, entry) {
+ if (spid == entry->spid) {
+ refcount_inc(&entry->users);
+ ioasid = index;
+ }
+ }
+done:
+ return ioasid;
+}
+
+/**
+ * ioasid_attach_spid - Attach ioasid_set private ID to an IOASID
+ *
+ * @ioasid: the system-wide IOASID to attach
+ * @spid: the ioasid_set private ID of @ioasid
+ *
+ * After attching SPID, future lookup can be done via ioasid_find_by_spid().
+ */
+int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
+{
+ struct ioasid_data *data;
+ int ret = 0;
+
+ if (spid == INVALID_IOASID)
+ return -EINVAL;
+
+ spin_lock(&ioasid_allocator_lock);
+ data = xa_load(&active_allocator->xa, ioasid);
+
+ if (!data) {
+ pr_err("No IOASID entry %d to attach SPID %d\n",
+ ioasid, spid);
+ ret = -ENOENT;
+ goto done_unlock;
+ }
+ /* Check if SPID is unique within the set */
+ if (ioasid_find_by_spid_locked(data->set, spid) != INVALID_IOASID) {
+ ret = -EINVAL;
+ goto done_unlock;
+ }
+ data->spid = spid;
+
+done_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_attach_spid);
+
+void ioasid_detach_spid(ioasid_t ioasid)
+{
+ struct ioasid_data *data;
+
+ spin_lock(&ioasid_allocator_lock);
+ data = xa_load(&active_allocator->xa, ioasid);
+
+ if (!data || data->spid == INVALID_IOASID) {
+ pr_err("Invalid IOASID entry %d to detach\n", ioasid);
+ goto done_unlock;
+ }
+ data->spid = INVALID_IOASID;
+
+done_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_detach_spid);
+
+/**
+ * ioasid_find_by_spid - Find the system-wide IOASID by a set private ID and
+ * its set.
+ *
+ * @set: the ioasid_set to search within
+ * @spid: the set private ID
+ *
+ * Given a set private ID and its IOASID set, find the system-wide IOASID. Take
+ * a reference upon finding the matching IOASID. Return INVALID_IOASID if the
+ * IOASID is not found in the set or the set is not valid.
+ */
+ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
+{
+ ioasid_t ioasid;
+
+ spin_lock(&ioasid_allocator_lock);
+ ioasid = ioasid_find_by_spid_locked(set, spid);
+ spin_unlock(&ioasid_allocator_lock);
+ return ioasid;
+}
+EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
+
static inline bool ioasid_set_is_valid(struct ioasid_set *set)
{
return xa_load(&ioasid_sets, set->id) == set;
@@ -529,6 +630,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
goto exit_free;
}
data->id = id;
+ data->spid = INVALID_IOASID;
data->state = IOASID_STATE_ACTIVE;
refcount_set(&data->users, 1);

diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 16d421357173..2dfe85e6cb7e 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -79,6 +79,10 @@ bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
int ioasid_attach_data(ioasid_t ioasid, void *data);
void ioasid_detach_data(ioasid_t ioasid);
+int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
+void ioasid_detach_spid(ioasid_t ioasid);
+ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid);
+
void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
void (*fn)(ioasid_t id, void *data),
void *data);
@@ -159,6 +163,21 @@ static inline void ioasid_detach_data(ioasid_t ioasid)
{
}

+static inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
+{
+ return -ENOTSUPP;
+}
+
+static inline void ioasid_detach_spid(ioasid_t ioasid)
+{
+}
+
+static inline ioasid_t ioasid_find_by_spid(struct ioasid_set *set,
+ ioasid_t spid)
+{
+ return INVALID_IOASID;
+}
+
static inline void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
void (*fn)(ioasid_t id, void *data),
void *data)
--
2.7.4

2020-09-28 21:38:53

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 13/14] iommu/vt-d: Listen to IOASID notifications

On Intel Scalable I/O Virtualization (SIOV) enabled platforms, IOMMU
driver is one of the users of IOASIDs. In normal flow, callers will
perform IOASID allocation, bind, unbind, and free in order. However, for
guest SVA, IOASID free could come before unbind as guest is untrusted.
This patch registers IOASID notification handler such that IOMMU driver
can perform PASID teardown upon receiving an unexpected IOASID free
event.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel/svm.c | 94 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/intel-iommu.h | 2 +
2 files changed, 94 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 39a09a93300e..8f886718df83 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -95,6 +95,90 @@ static inline bool intel_svm_capable(struct intel_iommu *iommu)
return iommu->flags & VTD_FLAG_SVM_CAPABLE;
}

+static inline void intel_svm_drop_pasid(ioasid_t pasid)
+{
+ ioasid_detach_data(pasid);
+ ioasid_put(NULL, pasid);
+}
+
+static DEFINE_MUTEX(pasid_mutex);
+#define pasid_lock_held() lock_is_held(&pasid_mutex.dep_map)
+
+static void intel_svm_free_async_fn(struct work_struct *work)
+{
+ struct intel_svm *svm = container_of(work, struct intel_svm, work);
+ struct intel_svm_dev *sdev;
+
+ /*
+ * Unbind all devices associated with this PASID which is
+ * being freed by other users such as VFIO.
+ */
+ mutex_lock(&pasid_mutex);
+ list_for_each_entry_rcu(sdev, &svm->devs, list, pasid_lock_held()) {
+ /* Does not poison forward pointer */
+ list_del_rcu(&sdev->list);
+ spin_lock(&svm->iommu->lock);
+ intel_pasid_tear_down_entry(svm->iommu, sdev->dev,
+ svm->pasid, true);
+ spin_unlock(&svm->iommu->lock);
+ kfree_rcu(sdev, rcu);
+ }
+ /*
+ * We may not be the last user to drop the reference but since
+ * the PASID is in FREE_PENDING state, no one can get new reference.
+ * Therefore, we can safely free the private data svm.
+ */
+ intel_svm_drop_pasid(svm->pasid);
+ /*
+ * Free before unbind can only happen with host PASIDs used for
+ * guest SVM. We get here because ioasid_free is called with
+ * outstanding references. So we need to drop the reference
+ * such that the PASID can be reclaimed. unbind_gpasid() after this
+ * will not result in dropping refcount since the private data is
+ * already detached.
+ */
+ kfree(svm);
+
+ mutex_unlock(&pasid_mutex);
+}
+
+
+static int pasid_status_change(struct notifier_block *nb,
+ unsigned long code, void *data)
+{
+ struct ioasid_nb_args *args = (struct ioasid_nb_args *)data;
+ struct intel_svm *svm = (struct intel_svm *)args->pdata;
+ int ret = NOTIFY_DONE;
+
+ if (code == IOASID_NOTIFY_FREE) {
+ /*
+ * If PASID UNBIND happens before FREE, private data of the
+ * IOASID should be NULL, then we don't need to do anything.
+ */
+ if (!svm)
+ goto done;
+ if (args->id != svm->pasid) {
+ pr_warn("Notify PASID does not match data %d : %d\n",
+ args->id, svm->pasid);
+ goto done;
+ }
+ schedule_work(&svm->work);
+ return NOTIFY_OK;
+ }
+done:
+ return ret;
+}
+
+static struct notifier_block pasid_nb = {
+ .notifier_call = pasid_status_change,
+};
+
+void intel_svm_add_pasid_notifier(void)
+{
+ /* Listen to all PASIDs, not specific to a set */
+ ioasid_register_notifier(NULL, &pasid_nb);
+}
+
void intel_svm_check(struct intel_iommu *iommu)
{
if (!pasid_supported(iommu))
@@ -221,7 +305,6 @@ static const struct mmu_notifier_ops intel_mmuops = {
.invalidate_range = intel_invalidate_range,
};

-static DEFINE_MUTEX(pasid_mutex);
static LIST_HEAD(global_svm_list);

#define for_each_svm_dev(sdev, svm, d) \
@@ -344,6 +427,13 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
svm->flags |= SVM_FLAG_GUEST_PASID;
}
ioasid_attach_data(data->hpasid, svm);
+ ioasid_get(NULL, svm->pasid);
+ svm->iommu = iommu;
+ /*
+ * Set up cleanup async work in case IOASID core notify us PASID
+ * is freed before unbind.
+ */
+ INIT_WORK(&svm->work, intel_svm_free_async_fn);
INIT_LIST_HEAD_RCU(&svm->devs);
}
sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
@@ -437,7 +527,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
* the unbind, IOMMU driver will get notified
* and perform cleanup.
*/
- ioasid_detach_data(pasid);
+ intel_svm_drop_pasid(pasid);
kfree(svm);
}
}
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index b1ed2f25f7c0..d36038e6ae0b 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -744,6 +744,7 @@ void intel_svm_unbind(struct iommu_sva *handle);
int intel_svm_get_pasid(struct iommu_sva *handle);
int intel_svm_page_response(struct device *dev, struct iommu_fault_event *evt,
struct iommu_page_response *msg);
+void intel_svm_add_pasid_notifier(void);

struct svm_dev_ops;

@@ -770,6 +771,7 @@ struct intel_svm {
int gpasid; /* In case that guest PASID is different from host PASID */
struct list_head devs;
struct list_head list;
+ struct work_struct work; /* For deferred clean up */
};
#else
static inline void intel_svm_check(struct intel_iommu *iommu) {}
--
2.7.4

2020-09-28 21:39:02

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 14/14] iommu/vt-d: Store guest PASID during bind

IOASID core maintains the guest-host mapping in the form of SPID and
IOASID. This patch assigns the guest PASID (if valid) as SPID while
binding guest page table with a host PASID. This mapping will be used
for lookup and notifications.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel/svm.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 8f886718df83..e18f8b5af9ba 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -98,6 +98,7 @@ static inline bool intel_svm_capable(struct intel_iommu *iommu)
static inline void intel_svm_drop_pasid(ioasid_t pasid)
{
ioasid_detach_data(pasid);
+ ioasid_detach_spid(pasid);
ioasid_put(NULL, pasid);
}

@@ -425,6 +426,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
if (data->flags & IOMMU_SVA_GPASID_VAL) {
svm->gpasid = data->gpasid;
svm->flags |= SVM_FLAG_GUEST_PASID;
+ ioasid_attach_spid(data->hpasid, data->gpasid);
}
ioasid_attach_data(data->hpasid, svm);
ioasid_get(NULL, svm->pasid);
--
2.7.4

2020-09-28 21:39:17

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 08/14] iommu/ioasid: Add reference couting functions

There can be multiple users of an IOASID, each user could have hardware
contexts associated with the IOASID. In order to align lifecycles,
reference counting is introduced in this patch. It is expected that when
an IOASID is being freed, each user will drop a reference only after its
context is cleared.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/ioasid.h | 24 ++++++++++
2 files changed, 141 insertions(+)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 9628e78b2ab4..828cc44b1b1c 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -16,8 +16,26 @@ static ioasid_t ioasid_capacity = PCI_PASID_MAX;
static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
static DEFINE_XARRAY_ALLOC(ioasid_sets);

+enum ioasid_state {
+ IOASID_STATE_INACTIVE,
+ IOASID_STATE_ACTIVE,
+ IOASID_STATE_FREE_PENDING,
+};
+
+/**
+ * struct ioasid_data - Meta data about ioasid
+ *
+ * @id: Unique ID
+ * @users: Number of active users
+ * @state: Track state of the IOASID
+ * @set: ioasid_set of the IOASID belongs to
+ * @private: Private data associated with the IOASID
+ * @rcu: For free after RCU grace period
+ */
struct ioasid_data {
ioasid_t id;
+ refcount_t users;
+ enum ioasid_state state;
struct ioasid_set *set;
void *private;
struct rcu_head rcu;
@@ -511,6 +529,8 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
goto exit_free;
}
data->id = id;
+ data->state = IOASID_STATE_ACTIVE;
+ refcount_set(&data->users, 1);

/* Store IOASID in the per set data */
if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
@@ -560,6 +580,14 @@ static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
if (WARN_ON(!xa_load(&ioasid_sets, data->set->id)))
return;

+ /* Free is already in progress */
+ if (data->state == IOASID_STATE_FREE_PENDING)
+ return;
+
+ data->state = IOASID_STATE_FREE_PENDING;
+ if (!refcount_dec_and_test(&data->users))
+ return;
+
ioasid_do_free_locked(data);
}

@@ -717,6 +745,95 @@ void ioasid_set_for_each_ioasid(struct ioasid_set *set,
}
EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);

+int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
+{
+ struct ioasid_data *data;
+
+ data = xa_load(&active_allocator->xa, ioasid);
+ if (!data) {
+ pr_err("Trying to get unknown IOASID %u\n", ioasid);
+ return -EINVAL;
+ }
+ if (data->state == IOASID_STATE_FREE_PENDING) {
+ pr_err("Trying to get IOASID being freed%u\n", ioasid);
+ return -EBUSY;
+ }
+
+ /* Check set ownership if the set is non-null */
+ if (set && data->set != set) {
+ pr_err("Trying to get IOASID %u outside the set\n", ioasid);
+ /* data found but does not belong to the set */
+ return -EACCES;
+ }
+ refcount_inc(&data->users);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(ioasid_get_locked);
+
+/**
+ * ioasid_get - Obtain a reference to an ioasid
+ * @set: the ioasid_set to check permission against if not NULL
+ * @ioasid: the ID to remove
+ *
+ *
+ * Return: 0 on success, error if failed.
+ */
+int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
+{
+ int ret;
+
+ spin_lock(&ioasid_allocator_lock);
+ ret = ioasid_get_locked(set, ioasid);
+ spin_unlock(&ioasid_allocator_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_get);
+
+bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
+{
+ struct ioasid_data *data;
+
+ data = xa_load(&active_allocator->xa, ioasid);
+ if (!data) {
+ pr_err("Trying to put unknown IOASID %u\n", ioasid);
+ return false;
+ }
+ if (set && data->set != set) {
+ pr_err("Trying to drop IOASID %u outside the set\n", ioasid);
+ return false;
+ }
+ if (!refcount_dec_and_test(&data->users))
+ return false;
+
+ ioasid_do_free_locked(data);
+
+ return true;
+}
+EXPORT_SYMBOL_GPL(ioasid_put_locked);
+
+/**
+ * ioasid_put - Release a reference to an ioasid
+ * @set: the ioasid_set to check permission against if not NULL
+ * @ioasid: the ID to remove
+ *
+ * Put a reference to the IOASID, free it when the number of references drops to
+ * zero.
+ *
+ * Return: %true if the IOASID was freed, %false otherwise.
+ */
+bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
+{
+ bool ret;
+
+ spin_lock(&ioasid_allocator_lock);
+ ret = ioasid_put_locked(set, ioasid);
+ spin_unlock(&ioasid_allocator_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_put);
+
/**
* ioasid_find - Find IOASID data
* @set: the IOASID set
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index aab58bc26714..16d421357173 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -73,6 +73,10 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
bool (*getter)(void *));
int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
+int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
+int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
+bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
+bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
int ioasid_attach_data(ioasid_t ioasid, void *data);
void ioasid_detach_data(ioasid_t ioasid);
void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
@@ -112,6 +116,26 @@ static inline void ioasid_set_put(struct ioasid_set *set)
{
}

+static inline int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
+{
+ return -ENOTSUPP;
+}
+
+static inline int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
+{
+ return -ENOTSUPP;
+}
+
+static inline bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
+{
+ return false;
+}
+
+static inline bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
+{
+ return false;
+}
+
static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *))
{
return NULL;
--
2.7.4

2020-09-28 21:39:27

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 12/14] iommu/vt-d: Remove mm reference for guest SVA

Now that IOASID core keeps track of the IOASID to mm_struct ownership in
the forms of ioasid_set with IOASID_SET_TYPE_MM token type, there is no
need to keep the same mapping in VT-d driver specific data. Native SVM
usage is not affected by the change.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel/svm.c | 7 -------
1 file changed, 7 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 2e764e283469..39a09a93300e 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -338,12 +338,6 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
ret = -ENOMEM;
goto out;
}
- /* REVISIT: upper layer/VFIO can track host process that bind
- * the PASID. ioasid_set = mm might be sufficient for vfio to
- * check pasid VMM ownership. We can drop the following line
- * once VFIO and IOASID set check is in place.
- */
- svm->mm = get_task_mm(current);
svm->pasid = data->hpasid;
if (data->flags & IOMMU_SVA_GPASID_VAL) {
svm->gpasid = data->gpasid;
@@ -351,7 +345,6 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
}
ioasid_attach_data(data->hpasid, svm);
INIT_LIST_HEAD_RCU(&svm->devs);
- mmput(svm->mm);
}
sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
if (!sdev) {
--
2.7.4

2020-09-28 21:40:26

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 10/14] iommu/ioasid: Introduce notification APIs

Relations among IOASID users largely follow a publisher-subscriber
pattern. E.g. to support guest SVA on Intel Scalable I/O Virtualization
(SIOV) enabled platforms, VFIO, IOMMU, device drivers, KVM are all users
of IOASIDs. When a state change occurs, VFIO publishes the change event
that needs to be processed by other users/subscribers.

This patch introduced two types of notifications: global and per
ioasid_set. The latter is intended for users who only needs to handle
events related to the IOASID of a given set.
For more information, refer to the kernel documentation at
Documentation/ioasid.rst.

Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Wu Hao <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 141 +++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/ioasid.h | 57 +++++++++++++++++++-
2 files changed, 197 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 378fef8f23d9..894b17c06ead 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -10,12 +10,35 @@
#include <linux/spinlock.h>
#include <linux/xarray.h>

+/*
+ * An IOASID can have multiple consumers where each consumer may have
+ * hardware contexts associated with the IOASID.
+ * When a status change occurs, like on IOASID deallocation, notifier chains
+ * are used to keep the consumers in sync.
+ * This is a publisher-subscriber pattern where publisher can change the
+ * state of each IOASID, e.g. alloc/free, bind IOASID to a device and mm.
+ * On the other hand, subscribers get notified for the state change and
+ * keep local states in sync.
+ */
+static ATOMIC_NOTIFIER_HEAD(ioasid_notifier);
+/* List to hold pending notification block registrations */
+static LIST_HEAD(ioasid_nb_pending_list);
+static DEFINE_SPINLOCK(ioasid_nb_lock);
+
/* Default to PCIe standard 20 bit PASID */
#define PCI_PASID_MAX 0x100000
static ioasid_t ioasid_capacity = PCI_PASID_MAX;
static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
static DEFINE_XARRAY_ALLOC(ioasid_sets);

+struct ioasid_set_nb {
+ struct list_head list;
+ struct notifier_block *nb;
+ void *token;
+ struct ioasid_set *set;
+ bool active;
+};
+
enum ioasid_state {
IOASID_STATE_INACTIVE,
IOASID_STATE_ACTIVE,
@@ -365,6 +388,42 @@ void ioasid_detach_data(ioasid_t ioasid)
}
EXPORT_SYMBOL_GPL(ioasid_detach_data);

+/**
+ * ioasid_notify - Send notification on a given IOASID for status change.
+ *
+ * @data: The IOASID data to which the notification will send
+ * @cmd: Notification event sent by IOASID external users, can be
+ * IOASID_BIND or IOASID_UNBIND.
+ *
+ * @flags: Special instructions, e.g. notify within a set or global by
+ * IOASID_NOTIFY_FLAG_SET or IOASID_NOTIFY_FLAG_ALL flags
+ * Caller must hold ioasid_allocator_lock and reference to the IOASID
+ */
+static int ioasid_notify(struct ioasid_data *data,
+ enum ioasid_notify_val cmd, unsigned int flags)
+{
+ struct ioasid_nb_args args = { 0 };
+ int ret = 0;
+
+ /* IOASID_FREE/ALLOC are internal events emitted by IOASID core only */
+ if (cmd <= IOASID_NOTIFY_FREE)
+ return -EINVAL;
+
+ if (flags & ~(IOASID_NOTIFY_FLAG_ALL | IOASID_NOTIFY_FLAG_SET))
+ return -EINVAL;
+
+ args.id = data->id;
+ args.set = data->set;
+ args.pdata = data->private;
+ args.spid = data->spid;
+ if (flags & IOASID_NOTIFY_FLAG_ALL)
+ ret = atomic_notifier_call_chain(&ioasid_notifier, cmd, &args);
+ if (flags & IOASID_NOTIFY_FLAG_SET)
+ ret = atomic_notifier_call_chain(&data->set->nh, cmd, &args);
+
+ return ret;
+}
+
static ioasid_t ioasid_find_by_spid_locked(struct ioasid_set *set, ioasid_t spid)
{
ioasid_t ioasid = INVALID_IOASID;
@@ -417,6 +476,7 @@ int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
goto done_unlock;
}
data->spid = spid;
+ ioasid_notify(data, IOASID_NOTIFY_BIND, IOASID_NOTIFY_FLAG_SET);

done_unlock:
spin_unlock(&ioasid_allocator_lock);
@@ -436,6 +496,7 @@ void ioasid_detach_spid(ioasid_t ioasid)
goto done_unlock;
}
data->spid = INVALID_IOASID;
+ ioasid_notify(data, IOASID_NOTIFY_UNBIND, IOASID_NOTIFY_FLAG_SET);

done_unlock:
spin_unlock(&ioasid_allocator_lock);
@@ -469,6 +530,28 @@ static inline bool ioasid_set_is_valid(struct ioasid_set *set)
return xa_load(&ioasid_sets, set->id) == set;
}

+static void ioasid_add_pending_nb(struct ioasid_set *set)
+{
+ struct ioasid_set_nb *curr;
+
+ if (set->type != IOASID_SET_TYPE_MM)
+ return;
+
+ /*
+ * Check if there are any pending nb requests for the given token, if so
+ * add them to the notifier chain.
+ */
+ spin_lock(&ioasid_nb_lock);
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->token == set->token && !curr->active) {
+ atomic_notifier_chain_register(&set->nh, curr->nb);
+ curr->set = set;
+ curr->active = true;
+ }
+ }
+ spin_unlock(&ioasid_nb_lock);
+}
+
/**
* ioasid_set_alloc - Allocate a new IOASID set for a given token
*
@@ -556,6 +639,13 @@ struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type)
set->quota = quota;
set->id = id;
refcount_set(&set->ref, 1);
+ ATOMIC_INIT_NOTIFIER_HEAD(&set->nh);
+
+ /*
+ * Check if there are any pending nb requests for the given token, if so
+ * add them to the notifier chain.
+ */
+ ioasid_add_pending_nb(set);

/*
* Per set XA is used to store private IDs within the set, get ready
@@ -641,6 +731,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
goto exit_free;
}
set->nr_ioasids++;
+ ioasid_notify(data, IOASID_NOTIFY_ALLOC, IOASID_NOTIFY_FLAG_SET);
goto done_unlock;
exit_free:
kfree(data);
@@ -687,6 +778,8 @@ static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
return;

data->state = IOASID_STATE_FREE_PENDING;
+ ioasid_notify(data, IOASID_NOTIFY_FREE,
+ IOASID_NOTIFY_FLAG_ALL | IOASID_NOTIFY_FLAG_ALL);
if (!refcount_dec_and_test(&data->users))
return;

@@ -726,6 +819,7 @@ EXPORT_SYMBOL_GPL(ioasid_set_get);

static void ioasid_set_put_locked(struct ioasid_set *set)
{
+ struct ioasid_set_nb *curr;
struct ioasid_data *entry;
unsigned long index;

@@ -757,6 +851,16 @@ static void ioasid_set_put_locked(struct ioasid_set *set)
/* Return the quota back to system pool */
ioasid_capacity_avail += set->quota;

+ /* Restore pending status of the set NBs */
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->token == set->token) {
+ if (curr->active)
+ curr->active = false;
+ else
+ pr_warn("Set token exists but not active!\n");
+ }
+ }
+
/*
* Token got released right away after the ioasid_set is freed.
* If a new set is created immediately with the newly released token,
@@ -778,7 +882,9 @@ static void ioasid_set_put_locked(struct ioasid_set *set)
void ioasid_set_put(struct ioasid_set *set)
{
spin_lock(&ioasid_allocator_lock);
+ spin_lock(&ioasid_nb_lock);
ioasid_set_put_locked(set);
+ spin_unlock(&ioasid_nb_lock);
spin_unlock(&ioasid_allocator_lock);
}
EXPORT_SYMBOL_GPL(ioasid_set_put);
@@ -980,6 +1086,41 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
}
EXPORT_SYMBOL_GPL(ioasid_find);

+int ioasid_register_notifier(struct ioasid_set *set, struct notifier_block *nb)
+{
+ if (set)
+ return atomic_notifier_chain_register(&set->nh, nb);
+ else
+ return atomic_notifier_chain_register(&ioasid_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(ioasid_register_notifier);
+
+void ioasid_unregister_notifier(struct ioasid_set *set,
+ struct notifier_block *nb)
+{
+ struct ioasid_set_nb *curr;
+
+ spin_lock(&ioasid_nb_lock);
+ /*
+ * Pending list is registered with a token without an ioasid_set,
+ * therefore should not be unregistered directly.
+ */
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->nb == nb) {
+ pr_warn("Cannot unregister NB from pending list\n");
+ spin_unlock(&ioasid_nb_lock);
+ return;
+ }
+ }
+ spin_unlock(&ioasid_nb_lock);
+
+ if (set)
+ atomic_notifier_chain_unregister(&set->nh, nb);
+ else
+ atomic_notifier_chain_unregister(&ioasid_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);
+
MODULE_AUTHOR("Jean-Philippe Brucker <[email protected]>");
MODULE_AUTHOR("Jacob Pan <[email protected]>");
MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 2dfe85e6cb7e..1b551c99d568 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -23,6 +23,7 @@ enum ioasid_set_type {

/**
* struct ioasid_set - Meta data about ioasid_set
+ * @nh: List of notifiers private to that set
* @xa: XArray to store ioasid_set private IDs, can be used for
* guest-host IOASID mapping, or just a private IOASID namespace.
* @token: Unique to identify an IOASID set
@@ -33,6 +34,7 @@ enum ioasid_set_type {
* @ref: Reference count of the users
*/
struct ioasid_set {
+ struct atomic_notifier_head nh;
struct xarray xa;
void *token;
int type;
@@ -58,6 +60,47 @@ struct ioasid_allocator_ops {
void *pdata;
};

+/* Notification data when IOASID status changed */
+enum ioasid_notify_val {
+ IOASID_NOTIFY_ALLOC = 1,
+ IOASID_NOTIFY_FREE,
+ IOASID_NOTIFY_BIND,
+ IOASID_NOTIFY_UNBIND,
+};
+
+#define IOASID_NOTIFY_FLAG_ALL BIT(0)
+#define IOASID_NOTIFY_FLAG_SET BIT(1)
+/**
+ * enum ioasid_notifier_prios - IOASID event notification order
+ *
+ * When status of an IOASID changes, users might need to take actions to
+ * reflect the new state. For example, when an IOASID is freed due to
+ * exception, the hardware context in virtual CPU, DMA device, and IOMMU
+ * shall be cleared and drained. Order is required to prevent life cycle
+ * problems.
+ */
+enum ioasid_notifier_prios {
+ IOASID_PRIO_LAST,
+ IOASID_PRIO_DEVICE,
+ IOASID_PRIO_IOMMU,
+ IOASID_PRIO_CPU,
+};
+
+/**
+ * struct ioasid_nb_args - Argument provided by IOASID core when notifier
+ * is called.
+ * @id: The IOASID being notified
+ * @spid: The set private ID associated with the IOASID
+ * @set: The IOASID set of @id
+ * @pdata: The private data attached to the IOASID
+ */
+struct ioasid_nb_args {
+ ioasid_t id;
+ ioasid_t spid;
+ struct ioasid_set *set;
+ void *pdata;
+};
+
#if IS_ENABLED(CONFIG_IOASID)
void ioasid_install_capacity(ioasid_t total);
ioasid_t ioasid_get_capacity(void);
@@ -82,7 +125,10 @@ void ioasid_detach_data(ioasid_t ioasid);
int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
void ioasid_detach_spid(ioasid_t ioasid);
ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid);
-
+int ioasid_register_notifier(struct ioasid_set *set,
+ struct notifier_block *nb);
+void ioasid_unregister_notifier(struct ioasid_set *set,
+ struct notifier_block *nb);
void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
void (*fn)(ioasid_t id, void *data),
void *data);
@@ -145,6 +191,15 @@ static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*
return NULL;
}

+static inline int ioasid_register_notifier(struct notifier_block *nb)
+{
+ return -ENOTSUPP;
+}
+
+static inline void ioasid_unregister_notifier(struct notifier_block *nb)
+{
+}
+
static inline int ioasid_register_allocator(struct ioasid_allocator_ops *allocator)
{
return -ENOTSUPP;
--
2.7.4

2020-09-28 21:40:40

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 11/14] iommu/ioasid: Support mm type ioasid_set notifications

As a system-wide resource, IOASID is often shared by multiple kernel
subsystems that are independent of each other. However, at the
ioasid_set level, these kernel subsystems must communicate with each
other for ownership checking, event notifications, etc. For example, on
Intel Scalable IO Virtualization (SIOV) enabled platforms, KVM and VFIO
instances under the same process/guest must be aware of a shared IOASID
set.
IOASID_SET_TYPE_MM token type was introduced to explicitly mark an
IOASID set that belongs to a process, thus use the same mm_struct
pointer as a token. Users of the same process can then identify with
each other based on this token.

This patch introduces MM token specific event registration APIs. Event
subscribers such as KVM instances can register IOASID event handler
without the knowledge of its ioasid_set. Event handlers are registered
based on its mm_struct pointer as a token. In case when subscribers
register handler *prior* to the creation of the ioasid_set, the
handler’s notification block is stored in a pending list within IOASID
core. Once the ioasid_set of the MM token is created, the notification
block will be registered by the IOASID core.

Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Wu Hao <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/ioasid.h | 15 +++++++
2 files changed, 132 insertions(+)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 894b17c06ead..d5faeb559a43 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -889,6 +889,29 @@ void ioasid_set_put(struct ioasid_set *set)
}
EXPORT_SYMBOL_GPL(ioasid_set_put);

+/*
+ * ioasid_find_mm_set - Retrieve IOASID set with mm token
+ * Take a reference of the set if found.
+ */
+static struct ioasid_set *ioasid_find_mm_set(struct mm_struct *token)
+{
+ struct ioasid_set *set;
+ unsigned long index;
+
+ spin_lock(&ioasid_allocator_lock);
+
+ xa_for_each(&ioasid_sets, index, set) {
+ if (set->type == IOASID_SET_TYPE_MM && set->token == token) {
+ refcount_inc(&set->ref);
+ goto exit_unlock;
+ }
+ }
+ set = NULL;
+exit_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+ return set;
+}
+
/**
* ioasid_adjust_set - Adjust the quota of an IOASID set
* @set: IOASID set to be assigned
@@ -1121,6 +1144,100 @@ void ioasid_unregister_notifier(struct ioasid_set *set,
}
EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);

+/**
+ * ioasid_register_notifier_mm - Register a notifier block on the IOASID set
+ * created by the mm_struct pointer as the token
+ *
+ * @mm: the mm_struct token of the ioasid_set
+ * @nb: notfier block to be registered on the ioasid_set
+ *
+ * This a variant of ioasid_register_notifier() where the caller intends to
+ * listen to IOASID events belong the ioasid_set created under the same
+ * process. Caller is not aware of the ioasid_set, no need to hold reference
+ * of the ioasid_set.
+ */
+int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
+{
+ struct ioasid_set_nb *curr;
+ struct ioasid_set *set;
+ int ret = 0;
+
+ if (!mm)
+ return -EINVAL;
+
+ spin_lock(&ioasid_nb_lock);
+
+ /* Check for duplicates, nb is unique per set */
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->token == mm && curr->nb == nb) {
+ ret = -EBUSY;
+ goto exit_unlock;
+ }
+ }
+
+ /* Check if the token has an existing set */
+ set = ioasid_find_mm_set(mm);
+ if (!set) {
+ /* Add to the rsvd list as inactive */
+ curr->active = false;
+ } else {
+ /* REVISIT: Only register empty set for now. Can add an option
+ * in the future to playback existing PASIDs.
+ */
+ if (set->nr_ioasids) {
+ pr_warn("IOASID set %d not empty\n", set->id);
+ ret = -EBUSY;
+ goto exit_unlock;
+ }
+ curr = kzalloc(sizeof(*curr), GFP_ATOMIC);
+ if (!curr) {
+ ret = -ENOMEM;
+ goto exit_unlock;
+ }
+ curr->token = mm;
+ curr->nb = nb;
+ curr->active = true;
+ curr->set = set;
+
+ /* Set already created, add to the notifier chain */
+ atomic_notifier_chain_register(&set->nh, nb);
+ /*
+ * Do not hold a reference, if the set gets destroyed, the nb
+ * entry will be marked inactive.
+ */
+ ioasid_set_put(set);
+ }
+
+ list_add(&curr->list, &ioasid_nb_pending_list);
+
+exit_unlock:
+ spin_unlock(&ioasid_nb_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_register_notifier_mm);
+
+void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
+{
+ struct ioasid_set_nb *curr;
+
+ spin_lock(&ioasid_nb_lock);
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->token == mm && curr->nb == nb) {
+ list_del(&curr->list);
+ spin_unlock(&ioasid_nb_lock);
+ if (curr->active) {
+ atomic_notifier_chain_unregister(&curr->set->nh,
+ nb);
+ }
+ kfree(curr);
+ return;
+ }
+ }
+ pr_warn("No ioasid set found for mm token %llx\n", (u64)mm);
+ spin_unlock(&ioasid_nb_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_unregister_notifier_mm);
+
MODULE_AUTHOR("Jean-Philippe Brucker <[email protected]>");
MODULE_AUTHOR("Jacob Pan <[email protected]>");
MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 1b551c99d568..c6cc855aadb6 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -132,6 +132,9 @@ void ioasid_unregister_notifier(struct ioasid_set *set,
void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
void (*fn)(ioasid_t id, void *data),
void *data);
+int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
+void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
+
#else /* !CONFIG_IOASID */
static inline void ioasid_install_capacity(ioasid_t total)
{
@@ -238,5 +241,17 @@ static inline void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
void *data)
{
}
+
+static inline int ioasid_register_notifier_mm(struct mm_struct *mm,
+ struct notifier_block *nb)
+{
+ return -ENOTSUPP;
+}
+
+static inline void ioasid_unregister_notifier_mm(struct mm_struct *mm,
+ struct notifier_block *nb)
+{
+}
+
#endif /* CONFIG_IOASID */
#endif /* __LINUX_IOASID_H */
--
2.7.4

2020-09-28 21:41:13

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 07/14] iommu/ioasid: Add an iterator API for ioasid_set

Users of an ioasid_set may not keep track of all the IOASIDs allocated
under the set. When collective actions are needed for each IOASIDs, it
is useful to iterate over all the IOASIDs within the set. For example,
when the ioasid_set is freed, the user might perform the same cleanup
operation on each IOASID.

This patch adds an API to iterate all the IOASIDs within the set.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 17 +++++++++++++++++
include/linux/ioasid.h | 9 +++++++++
2 files changed, 26 insertions(+)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index cf8c7d34e2de..9628e78b2ab4 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -701,6 +701,23 @@ int ioasid_adjust_set(struct ioasid_set *set, int quota)
EXPORT_SYMBOL_GPL(ioasid_adjust_set);

/**
+ * ioasid_set_for_each_ioasid - Iterate over all the IOASIDs within the set
+ *
+ * Caller must hold a reference of the set and handles its own locking.
+ */
+void ioasid_set_for_each_ioasid(struct ioasid_set *set,
+ void (*fn)(ioasid_t id, void *data),
+ void *data)
+{
+ struct ioasid_data *entry;
+ unsigned long index;
+
+ xa_for_each(&set->xa, index, entry)
+ fn(index, data);
+}
+EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
+
+/**
* ioasid_find - Find IOASID data
* @set: the IOASID set
* @ioasid: the IOASID to find
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 0a5e82148eb9..aab58bc26714 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -75,6 +75,9 @@ int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
int ioasid_attach_data(ioasid_t ioasid, void *data);
void ioasid_detach_data(ioasid_t ioasid);
+void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
+ void (*fn)(ioasid_t id, void *data),
+ void *data);
#else /* !CONFIG_IOASID */
static inline void ioasid_install_capacity(ioasid_t total)
{
@@ -131,5 +134,11 @@ static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
static inline void ioasid_detach_data(ioasid_t ioasid)
{
}
+
+static inline void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
+ void (*fn)(ioasid_t id, void *data),
+ void *data)
+{
+}
#endif /* CONFIG_IOASID */
#endif /* __LINUX_IOASID_H */
--
2.7.4

2020-09-28 21:42:01

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v3 06/14] iommu/ioasid: Introduce API to adjust the quota of an ioasid_set

Each ioasid_set is given a quota during allocation. As system
administrators balance resources among VMs, we shall support the
adjustment of quota at runtime. The new quota cannot be less than the
outstanding IOASIDs already allocated within the set. The extra quota
will be returned to the system-wide IOASID pool if the new quota is
smaller than the existing one.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/ioasid.h | 6 ++++++
2 files changed, 53 insertions(+)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 61e25c2375ab..cf8c7d34e2de 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -654,6 +654,53 @@ void ioasid_set_put(struct ioasid_set *set)
EXPORT_SYMBOL_GPL(ioasid_set_put);

/**
+ * ioasid_adjust_set - Adjust the quota of an IOASID set
+ * @set: IOASID set to be assigned
+ * @quota: Quota allowed in this set
+ *
+ * Return 0 on success. If the new quota is smaller than the number of
+ * IOASIDs already allocated, -EINVAL will be returned. No change will be
+ * made to the existing quota.
+ */
+int ioasid_adjust_set(struct ioasid_set *set, int quota)
+{
+ int ret = 0;
+
+ if (quota <= 0)
+ return -EINVAL;
+
+ spin_lock(&ioasid_allocator_lock);
+ if (set->nr_ioasids > quota) {
+ pr_err("New quota %d is smaller than outstanding IOASIDs %d\n",
+ quota, set->nr_ioasids);
+ ret = -EINVAL;
+ goto done_unlock;
+ }
+
+ if ((quota > set->quota) &&
+ (quota - set->quota > ioasid_capacity_avail)) {
+ ret = -ENOSPC;
+ goto done_unlock;
+ }
+
+ /* Return the delta back to system pool */
+ ioasid_capacity_avail += set->quota - quota;
+
+ /*
+ * May have a policy to prevent giving all available IOASIDs
+ * to one set. But we don't enforce here, it should be in the
+ * upper layers.
+ */
+ set->quota = quota;
+
+done_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_adjust_set);
+
+/**
* ioasid_find - Find IOASID data
* @set: the IOASID set
* @ioasid: the IOASID to find
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 1ae213b660f0..0a5e82148eb9 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -62,6 +62,7 @@ struct ioasid_allocator_ops {
void ioasid_install_capacity(ioasid_t total);
ioasid_t ioasid_get_capacity(void);
struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type);
+int ioasid_adjust_set(struct ioasid_set *set, int quota);
void ioasid_set_get(struct ioasid_set *set);
void ioasid_set_put(struct ioasid_set *set);

@@ -99,6 +100,11 @@ static inline struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, i
return ERR_PTR(-ENOTSUPP);
}

+static inline int ioasid_adjust_set(struct ioasid_set *set, int quota)
+{
+ return -ENOTSUPP;
+}
+
static inline void ioasid_set_put(struct ioasid_set *set)
{
}
--
2.7.4

2020-10-20 02:35:21

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v3 00/14] IOASID extensions for guest SVA

Hi,

Any comments on this? I know we have some opens w.r.t. PASID management
UAPI, but I think having this common kernel API features should be
justified.

Thanks!

Jacob


On Mon, 28 Sep 2020 14:38:27 -0700, Jacob Pan <[email protected]>
wrote:

> IOASID was introduced in v5.5 as a generic kernel allocator service for
> both PCIe Process Address Space ID (PASID) and ARM SMMU's Sub Stream
> ID. In addition to basic ID allocation, ioasid_set was defined as a
> token that is shared by a group of IOASIDs. This set token can be used
> for permission checking, but lack of some features to address the
> following needs by guest Shared Virtual Address (SVA).
> - Manage IOASIDs by group, group ownership, quota, etc.
> - State synchronization among IOASID users
> - Non-identity guest-host IOASID mapping
> - Lifecycle management across many users
>
> This patchset introduces the following extensions as solutions to the
> problems above.
> - Redefine and extend IOASID set such that IOASIDs can be managed by
> groups.
> - Add notifications for IOASID state synchronization
> - Add reference counting for life cycle alignment among users
> - Support ioasid_set private IDs, which can be used as guest IOASIDs
> Please refer to Documentation/ioasid.rst in enclosed patch 1/9 for more
> details.
>
> This patchset only included VT-d driver as users of some of the new APIs.
> VFIO and KVM patches are coming up to fully utilize the APIs introduced
> here.
>
> You can find this series at:
> https://github.com/jacobpan/linux.git ioasid_v3
> (VFIO and KVM patches will be available at this branch when published.)
>
> This work is a result of collaboration with many people:
> Liu, Yi L <[email protected]>
> Wu Hao <[email protected]>
> Ashok Raj <[email protected]>
> Kevin Tian <[email protected]>
>
> Thanks,
>
> Jacob
>
> Changelog:
>
> V3:
> - Use consistent ioasid_set_ prefix for ioasid_set level APIs
> - Make SPID and private detach/attach APIs symmetric
> - Use the same ioasid_put semantics as Jean-Phillippe IOASID reference
> patch
> - Take away the public ioasid_notify() function, notifications are now
> emitted by IOASID core as a result of certain IOASID APIs
> - Partition into finer incremental patches
> - Miscellaneous cleanup, locking, exception handling fixes based on v2
> reviews
>
> V2:
> - Redesigned ioasid_set APIs, removed set ID
> - Added set private ID (SPID) for guest PASID usage.
> - Add per ioasid_set notification and priority support.
> - Back to use spinlocks and atomic notifications.
> - Added async work in VT-d driver to perform teardown outside atomic
> context
>
> Jacob Pan (14):
> docs: Document IO Address Space ID (IOASID) APIs
> iommu/ioasid: Rename ioasid_set_data()
> iommu/ioasid: Add a separate function for detach data
> iommu/ioasid: Support setting system-wide capacity
> iommu/ioasid: Redefine IOASID set and allocation APIs
> iommu/ioasid: Introduce API to adjust the quota of an ioasid_set
> iommu/ioasid: Add an iterator API for ioasid_set
> iommu/ioasid: Add reference couting functions
> iommu/ioasid: Introduce ioasid_set private ID
> iommu/ioasid: Introduce notification APIs
> iommu/ioasid: Support mm type ioasid_set notifications
> iommu/vt-d: Remove mm reference for guest SVA
> iommu/vt-d: Listen to IOASID notifications
> iommu/vt-d: Store guest PASID during bind
>
> Documentation/driver-api/ioasid.rst | 648 ++++++++++++++++++++++++++
> drivers/iommu/intel/iommu.c | 29 +-
> drivers/iommu/intel/pasid.h | 1 +
> drivers/iommu/intel/svm.c | 132 +++++-
> drivers/iommu/ioasid.c | 890
> ++++++++++++++++++++++++++++++++++-- include/linux/intel-iommu.h
> | 2 + include/linux/ioasid.h | 197 +++++++-
> 7 files changed, 1830 insertions(+), 69 deletions(-)
> create mode 100644 Documentation/driver-api/ioasid.rst
>


Thanks,

Jacob

2020-10-20 14:08:00

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v3 01/14] docs: Document IO Address Space ID (IOASID) APIs

On Mon, Sep 28, 2020 at 02:38:28PM -0700, Jacob Pan wrote:
> IOASID is used to identify address spaces that can be targeted by device
> DMA. It is a system-wide resource that is essential to its many users.
> This document is an attempt to help developers from all vendors navigate
> the APIs. At this time, ARM SMMU and Intel’s Scalable IO Virtualization
> (SIOV) enabled platforms are the primary users of IOASID. Examples of
> how SIOV components interact with IOASID APIs are provided in that many
> APIs are driven by the requirements from SIOV.
>
> Cc: Jonathan Corbet <[email protected]>
> Cc: [email protected]
> Cc: Randy Dunlap <[email protected]>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Wu Hao <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>

This looks good to me, with small comments below.

> ---
> Documentation/driver-api/ioasid.rst | 648 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 648 insertions(+)
> create mode 100644 Documentation/driver-api/ioasid.rst
>
> diff --git a/Documentation/driver-api/ioasid.rst b/Documentation/driver-api/ioasid.rst
> new file mode 100644
> index 000000000000..7f8e702997ab
> --- /dev/null
> +++ b/Documentation/driver-api/ioasid.rst
> @@ -0,0 +1,648 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +.. ioasid:
> +
> +===================
> +IO Address Space ID
> +===================
> +
> +IOASID is a generic name for PCIe Process Address ID (PASID) or ARM
> +SMMU SubstreamID. An IOASID identifies an address space that DMA
> +requests can target.
> +
> +The primary use cases for IOASID are Shared Virtual Address (SVA) and
> +multiple IOVA spaces per device. However, the requirements for IOASID
> +management can vary among hardware architectures.
> +
> +For baremetal IOVA, IOASID #0 is used for DMA request without

bare metal

> +PASID. Even though some architectures such as VT-d also offers
> +the flexibility of using any PASIDs for DMA request without PASID.
> +PASID #0 is reserved and not allocated from any ioasid_set.
> +
> +Multiple IOVA spaces per device are mapped to auxiliary domains which
> +can be used for mediated device assignment with and without a virtual
> +IOMMU (vIOMMU). An IOASID is allocated for each auxiliary domain as default
> +PASID. Without vIOMMU, default IOASID is used for DMA map/unmap
> +APIs. With vIOMMU, default IOASID is used for guest IOVA where DMA
> +request with PASID is required for the device. The reason is that
> +there is only one PASID #0 per device, e.g. VT-d, RID_PASID is per PCI

on VT-d

> +device.
> +
> +This document covers the generic features supported by IOASID
> +APIs. Vendor-specific use cases are also illustrated with Intel's VT-d
> +based platforms as the first example.
> +
> +.. contents:: :local:
> +
> +Glossary
> +========
> +PASID - Process Address Space ID
> +
> +IOASID - IO Address Space ID (generic term for PCIe PASID and
> +SubstreamID in SMMU)
> +
> +SVA/SVM - Shared Virtual Addressing/Memory
> +
> +ENQCMD - Intel X86 ISA for efficient workqueue submission [1]
> +!!!TODO: Link to Spec at the bottom

Yes, or maybe hyperlinks at the end of this section would be better. There
are references and lists all over the document so keeping things as close
as possible avoids confusion.

> +
> +DSA - Intel Data Streaming Accelerator [2]
> +
> +VDCM - Virtual Device Composition Module [3]
> +
> +SIOV - Intel Scalable IO Virtualization
> +
> +
> +Key Concepts
> +============
> +
> +IOASID Set
> +-----------
> +An IOASID set is a group of IOASIDs allocated from the system-wide
> +IOASID pool. Refer to IOASID set APIs for more details.

IOASID Set Level APIs

> +
> +IOASID set is particularly useful for guest SVA where each guest could
> +have its own IOASID set for security and efficiency reasons.
> +
> +IOASID Set Private ID (SPID)
> +----------------------------
> +Each IOASID set has a private namespace of SPIDs. An SPID maps to a
> +single system-wide IOASID. Conversely, each IOASID may be associated
> +with an alias ID, local to the IOASID set, named SPID.
> +SPIDs can be used as guest IOASIDs where each guest could do
> +IOASID allocation from its own pool and map them to host physical
> +IOASIDs. SPIDs are particularly useful for supporting live migration
> +where decoupling guest and host physical resources are necessary.
> +
> +For example, two VMs can both allocate guest PASID/SPID #101 but map to
> +different host PASIDs #201 and #202 respectively as shown in the
> +diagram below.
> +::
> +
> + .------------------. .------------------.
> + | VM 1 | | VM 2 |
> + | | | |
> + |------------------| |------------------|
> + | GPASID/SPID 101 | | GPASID/SPID 101 |
> + '------------------' -------------------' Guest
> + __________|______________________|____________________
> + | | Host
> + v v
> + .------------------. .------------------.
> + | Host IOASID 201 | | Host IOASID 202 |
> + '------------------' '------------------'
> + | IOASID set 1 | | IOASID set 2 |
> + '------------------' '------------------'
> +
> +Guest PASID is treated as IOASID set private ID (SPID) within an
> +IOASID set, mappings between guest and host IOASIDs are stored in the
> +set for inquiry.
> +
> +IOASID APIs
> +===========
> +To get the IOASID APIs, users must #include <linux/ioasid.h>. These APIs
> +serve the following functionalities:
> +
> + - IOASID allocation/Free

freeing (or release)

> + - Group management in the form of ioasid_set
> + - Private data storage and lookup
> + - Reference counting
> + - Event notification in case of a state change
> +
> +IOASID Set Level APIs
> +--------------------------
> +For use cases such as guest SVA it is necessary to manage IOASIDs at
> +ioasid_set level. For example, VMs may allocate multiple IOASIDs for
> +guest process address sharing (vSVA). It is imperative to enforce
> +VM-IOASID ownership such that a malicious guest cannot target DMA
> +traffic outside its own IOASIDs, or free an active IOASID that belongs
> +to another VM.
> +
> +The IOASID set APIs serve the following purposes:
> +
> + - Ownership/permission enforcement
> + - Take collective actions, e.g. free an entire set
> + - Event notifications within a set
> + - Look up a set based on token
> + - Quota enforcement
> +
> +Each IOASID set is created with a token, which can be one of the
> +following token types:
> +
> + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)

Maybe NULL isn't the best name then. NONE?

> + - IOASID_SET_TYPE_MM (Set token is a mm_struct)
> +
> +The explicit MM token type is useful when multiple users of an IOASID
> +set under the same process need to communicate about their shared IOASIDs.
> +E.g. An IOASID set created by VFIO for one guest can be associated
> +with the KVM instance for the same guest since they share a common mm_struct.
> +A token must be unique within its type.
> +
> +::
> +
> + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, u32 type)
> +
> + int ioasid_adjust_set(struct ioasid_set *set, int quota);
> +
> + void ioasid_set_get(struct ioasid_set *set)
> +
> + void ioasid_set_put(struct ioasid_set *set)
> +
> + void ioasid_set_get_locked(struct ioasid_set *set)
> +
> + void ioasid_set_put_locked(struct ioasid_set *set)
> +
> + int ioasid_set_for_each_ioasid(struct ioasid_set *set,
> + void (*fn)(ioasid_t id, void *data),
> + void *data)

misaligned

> +
> +
> +Individual IOASID APIs
> +----------------------
> +Once an ioasid_set is created, IOASIDs can be allocated from the set.
> +Within the IOASID set namespace, set private ID (SPID) is supported. In
> +the VM use case, SPID can be used for storing guest PASID.
> +
> +::
> +
> + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> + void *private);
> +
> + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> +
> + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> +
> + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> +
> + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> +
> + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> + bool (*getter)(void *));
> +
> + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
> +
> + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
> + void *data);
> + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
> + ioasid_t spid);
> +
> +
> +Notifications
> +-------------
> +An IOASID may have multiple users, each user may have hardware context
> +associated with an IOASID. When the status of an IOASID changes,
> +e.g. an IOASID is being freed, users need to be notified such that the
> +associated hardware context can be cleared, flushed, and drained.
> +
> +::
> +
> + int ioasid_register_notifier(struct ioasid_set *set, struct
> + notifier_block *nb)
> +
> + void ioasid_unregister_notifier(struct ioasid_set *set,
> + struct notifier_block *nb)
> +
> + int ioasid_register_notifier_mm(struct mm_struct *mm, struct
> + notifier_block *nb)
> +
> + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
> + notifier_block *nb)
> +
> + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
> + unsigned int flags)
> +
> +"_mm" flavor of the ioasid_register_notifier() APIs are used when
> +an IOASID user need to listen to the IOASID events belong to a
> +process but without the knowledge of the associated ioasid_set.
> +
> +Events
> +~~~~~~
> +Notification events are pertinent to individual IOASIDs, they can be
> +one of the following:
> +
> + - ALLOC
> + - FREE
> + - BIND
> + - UNBIND
> +
> +Ordering
> +~~~~~~~~
> +Ordering is supported by IOASID notification priorities as the
> +following (in ascending order):
> +
> +::
> +
> + enum ioasid_notifier_prios {
> + IOASID_PRIO_LAST,
> + IOASID_PRIO_IOMMU,
> + IOASID_PRIO_DEVICE,
> + IOASID_PRIO_CPU,
> + };
> +
> +When registered, notifiers are assigned a priority that affect the
> +call order. For example, notifiers with CPU priority get called before
> +notifiers with device priority and so on.
> +
> +The typical use case is when an IOASID is freed due to an exception, DMA
> +source should be quiesced before tearing down other hardware contexts
> +in the system. This will reduce the churn in handling faults. DMA work
> +submission is performed by the CPU which is granted higher priority than
> +devices.
> +
> +Level Sensitivity
> +~~~~~~~~~~~~~~~~~
> +For each IOASID state transition, IOASID core ensures that there is
> +only one notification sent. This resembles level triggered interrupt
> +where a single interrupt is raised during a state transition.
> +For example, if ioasid_free() is called twice by a user before the
> +IOASID is reclaimed, IOASID core will only send out a single
> +IOASID_NOTIFY_FREE event. Similarly, for IOASID_NOTIFY_BIND/UNBIND
> +events, which is only sent out once when a SPID is attached/detached.
> +
> +IOASID notifications cannot be sent directly by IOASID users, they are
> +sent out by the IOASID core as a by-product of the following APIs:
> +- ioasid_alloc/free(), emit IOASID_NOTIFY_ALLOC/FREE
> +- ioasid_attach/detach_spid() emit IOASID_NOTIFY_BIND/UNBIND
> +
> +Scopes
> +~~~~~~
> +There are two types of notifiers in IOASID core: system-wide and
> +ioasid_set-wide.
> +
> +System-wide notifier is catering for users that need to handle all the
> +IOASIDs in the system. E.g. The IOMMU driver.
> +
> +Per ioasid_set notifier can be used by VM specific components such as
> +KVM. After all, each KVM instance only cares about IOASIDs within its
> +own set/guest.
> +
> +
> +Atomicity
> +~~~~~~~~~
> +IOASID notifiers are atomic due to spinlocks used inside the IOASID
> +core. For tasks that cannot be completed in the notifier handler,
> +async work can be submitted to complete the work later as long as
> +there is no ordering requirement.
> +
> +Reference counting
> +------------------
> +IOASID lifecycle management is based on reference counting. Users of
> +IOASID who intend to align lifecycle with the IOASID need to hold
> +a reference of the IOASID. The IOASID will not be returned to the pool
> +for allocation until all references are dropped. Calling ioasid_free()
> +will mark the IOASID as FREE_PENDING if the IOASID has outstanding
> +references. No new references can be taken by ioasid_get() once an
> +IOASID is in the FREE_PENDING state. ioasid_free() can be called
> +multiple times without an error until all refs are dropped.
> +
> +ioasid_put() decrements and tests refcount of the IOASID. If refcount
> +is 0, ioasid will be freed. Deleted from the system-wide xarray as
> +well as per set xarray.

These xarray structures haven't been introduced so far. I would leave out
this level of detail because it's not relevant to the user and will be
difficult to keep in sync with the internals of IOASID.

> The IOASID will be returned to the pool and
> +available for new allocations.
> +
> +Event notifications are used to inform users of IOASID status change.
> +IOASID_FREE event prompts users to drop their references after
> +clearing its context.
> +
> +For example, on VT-d platform when an IOASID is freed, teardown
> +actions are performed on KVM, device driver (VDCM), and the IOMMU
> +driver. To quiesce VCPU for work submission, KVM notifier handler must
> +be called before VDCM handler. Therefore, KVM and VDCM shall use
> +notification priority IOASID_PRIO_CPU andIOASID_PRIO_DEVICE

and IOASID_PRIO_DEVICE

> +respectively.
> +
> +For both KVM and VDCM, notifier blocks shall be registered on the
> +IOASID set such that *only* events from the matching VM are received.
> +
> +If KVM attempts to register a notifier block before the IOASID set is
> +created using the MM token, the notifier block will be placed on a
> +pending list inside IOASID core. Once the token matching IOASID set
> +is created, IOASID will register the notifier block automatically.
> +IOASID core does not replay events for the existing IOASIDs in the
> +set. For IOASID set of MM type, notification blocks can be registered
> +on empty sets only. This is to avoid lost events.
> +
> +IOMMU driver shall register notifier block on global chain::
> +
> + static struct notifier_block pasid_nb_vtd = {
> + .notifier_call = pasid_status_change_vtd,
> + .priority = IOASID_PRIO_IOMMU,
> + };
> +
> +Custom allocator APIs
> +---------------------
> +
> +::
> +
> + int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> +
> + void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> +
> +Allocator Choices
> +~~~~~~~~~~~~~~~~~
> +IOASIDs are allocated for both host and guest SVA/IOVA usage. However,
> +allocators can be different. For example, on VT-d guest PASID
> +allocation must be performed via a virtual command interface which is
> +emulated by VMM.
> +
> +IOASID core has the notion of "custom allocator" such that guest can
> +register virtual command allocator that precedes the default one.

"Supersedes", rather than "precedes"?

> +
> +Namespaces
> +~~~~~~~~~~
> +IOASIDs are limited system resources that default to 20 bits in
> +size. Each device can have its own PASID table for security reasons.
> +Theoretically the namespace can be per device also.
> +
> +However IOASID namespace is system-wide for two reasons:
> +- Simplicity
> +- Sharing resources of a single device to multiple VMs.
> +
> +Take VT-d as an example, VT-d supports shared workqueue and ENQCMD[1]
> +where one IOASID could be used to submit work on multiple devices that
> +are shared with other VMs. This requires IOASID to be
> +system-wide. This is also the reason why guests must use an
> +emulated virtual command interface to allocate IOASID from the host.
> +
> +Life cycle
> +==========
> +This section covers IOASID life cycle management for both bare-metal
> +and guest usages. In bare-metal SVA, MMU notifier is directly hooked
> +up with IOMMU driver, therefore the IOASID life cycle matches the
> +process address space (MM) life cycle.
> +
> +However, guest MMU notifier is not available to host IOMMU driver,
> +when guest MM terminates unexpectedly, the events have to go through
> +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also more
> +parties involved in guest SVA, e.g. on Intel VT-d platform, IOASIDs
> +are used by IOMMU driver, KVM, VDCM, and VFIO.
> +
> +Native IOASID Life Cycle (VT-d Example)
> +---------------------------------------
> +
> +The normal flow of native SVA code with Intel Data Streaming
> +Accelerator(DSA) [2] as example:
> +
> +1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
> +2. DSA driver allocate WQ, do sva_bind_device();
> +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
> + mmu_notifier_get()
> +4. DMA starts by DSA driver userspace
> +5. DSA userspace close FD
> +6. DSA/uacce kernel driver handles FD.close()
> +7. DSA driver stops DMA
> +8. DSA driver calls sva_unbind_device();
> +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
> + TLBs. mmu_notifier_put() called.
> +10. mmu_notifier.release() called, IOMMU SVA code calls ioasid_free()*
> +11. The IOASID is returned to the pool, reclaimed.
> +
> +::
> +
> + * With ENQCMD, PASID used on VT-d is not released in mmu_notifier() but
> + mmdrop(). mmdrop comes after FD close which does not make a difference.
> +
> +During the normal teardown, the following three key steps would happen in
> +order: (7), (9), (11).
> +
> +Exception happens when process terminates *before* device driver stops
> +DMA and call IOMMU driver to unbind. The detailed flow of process
> +exits are as follows:
> +
> +::
> +
> + do_exit() {
> + exit_mm() {
> + mm_put();
> + exit_mmap() {
> + intel_invalidate_range() //mmu notifier
> + tlb_finish_mmu()
> + mmu_notifier_release(mm) {
> + intel_mm_release() {
> + (9) intel_iommu_teardown_pasid();
> + intel_iommu_flush_tlbs();
> + }
> + // tlb_invalidate_range cb removed
> + }
> + unmap_vmas();
> + free_pgtables(); // IOMMU cannot walk PGT after this

or rather, it *must* not walk PGT at this point? PGT should be disabled on
the IOMMU side at step 9, before the TLB invalidation

> + };
> + }
> + exit_files(tsk) {
> + close_files() {
> + dsa_close();
> + (7) dsa_stop_dma();
> + intel_svm_unbind_pasid(); //nothing to do
> + }
> + }
> + }
> +
> + mmdrop() /* some random time later, lazy mm user */ {
> + mm_free_pgd();
> + destroy_context(mm); {
> + (11) ioasid_free();
> + }
> + }
> +
> +As shown in the list above, step #2 could happen before
> +#1. Unrecoverable(UR) faults could happen between #2 and #1.
> +Fault processing is disabled by the IOMMU driver in #2, therefore the UR
> +fault never reaches the driver.

Did you mean steps (7) and (9) instead for #1 and #2?

> +
> +
> +Also notice that TLB invalidation occurs at mmu_notifier
> +invalidate_range callback as well as the release callback. The reason
> +is that release callback will delete IOMMU driver from the notifier
> +chain which may skip invalidate_range() calls during the exit path.

I think this whole section also goes too deep in implementation details,
and is susceptible to become out of sync with the code pretty quickly.
I don't feel strongly about this, though, leave it if you want.

> +
> +To avoid unnecessary reporting of UR fault, IOMMU driver shall disable
> +fault reporting after free and before unbind.
> +
> +Guest IOASID Life Cycle (VT-d Example)
> +--------------------------------------
> +Guest IOASID life cycle starts with guest driver open(), this could be
> +uacce or individual accelerator driver such as DSA. At FD open,
> +sva_bind_device() is called which triggers a series of actions.
> +
> +The example below is an illustration of *normal* operations that
> +involves *all* the SW components in VT-d. The flow can be simpler if
> +no ENQCMD is supported.
> +
> +::
> +
> + VFIO IOMMU KVM VDCM IOASID Ref
> + ..................................................................
> + 1 ioasid_register_notifier/_mm()
> + 2 ioasid_alloc() 1
> + 3 bind_gpasid()
> + 4 iommu_bind()->ioasid_get() 2
> + 5 ioasid_notify(BIND)
> + 6 -> ioasid_get() 3
> + 7 -> vmcs_update_atomic()
> + 8 mdev_write(gpasid)
> + 9 hpasid=
> + 10 find_by_spid(gpasid) 4

Are 9 and 10 the same step?

> + 11 vdev_write(hpasid)

What's the 'v' in vdev? Aren't we writing to the physical device here?

> + 12 -------- GUEST STARTS DMA --------------------------
> + 13 -------- GUEST STOPS DMA --------------------------
> + 14 mdev_clear(gpasid)
> + 15 vdev_clear(hpasid)
> + 16 ioasid_put() 3
> + 17 unbind_gpasid()
> + 18 iommu_ubind()
> + 19 ioasid_notify(UNBIND)
> + 20 -> vmcs_update_atomic()
> + 21 -> ioasid_put() 2
> + 22 ioasid_free() 1
> + 23 ioasid_put() 0
> + 24 Reclaimed
> + -------------- New Life Cycle Begin ----------------------------
> + 1 ioasid_alloc() -> 1
> +
> + Note: IOASID Notification Events: FREE, BIND, UNBIND

Is FREE at step 22?

> +
> +Exception cases arise when a guest crashes or a malicious guest
> +attempts to cause disruption on the host system. The fault handling
> +rules are:
> +
> +1. IOASID free must *always* succeed.
> +2. An inactive period may be required before the freed IOASID is
> + reclaimed. During this period, consumers of IOASID perform cleanup.
> +3. Malfunction is limited to the guest owned resources for all
> + programming errors.
> +
> +The primary source of exception is when the following are out of
> +order:
> +
> +1. Start/Stop of DMA activity
> + (Guest device driver, mdev via VFIO)
> +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
> + (Host IOMMU driver bind/unbind)
> +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
> + case of ENQCMD
> +4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
> +5. IOASID alloc/free (Host IOASID)
> +
> +VFIO is the *only* user-kernel interface, which is ultimately
> +responsible for exception handling.
> +
> +#1 is processed the same way as the assigned device today based on
> +device file descriptors and events. There is no special handling.
> +
> +#3 is based on bind/unbind events emitted by #2.
> +
> +#4 is naturally aligned with IOASID life cycle in that an illegal
> +guest PASID programming would fail in obtaining reference of the
> +matching host IOASID.
> +
> +#5 is similar to #4. The fault will be reported to the user if PASID
> +used in the ENQCMD is not set up in VMCS PASID translation table.
> +
> +Therefore, the remaining out of order problem is between #2 and
> +#5. I.e. unbind vs. free. More specifically, free before unbind.
> +
> +IOASID notifier and refcounting are used to ensure order. Following
> +a publisher-subscriber pattern where:
> +
> +- Publishers: VFIO & IOMMU
> +- Subscribers: KVM, VDCM, IOMMU
> +
> +IOASID reference must be acquired before receiving the FREE event. The
> +reference must be dropped at the end of the processing in order to
> +return the IOASID to the pool.
> +
> +Let's examine the IOASID life cycle again when free happens *before*
> +unbind. This could be a result of misbehaving guests or crash. Assuming
> +VFIO cannot enforce unbind->free order. Notice that the setup part up
> +until step #12 is identical to the normal case, the flow below starts
> +with step 13.
> +
> +::
> +
> + VFIO IOMMU KVM VDCM IOASID Ref
> + ..................................................................
> + 13 -------- GUEST STARTS DMA --------------------------
> + 14 -------- *GUEST MISBEHAVES!!!* ----------------
> + 15 ioasid_free()
> + 16 ioasid_notify(FREE)
> + 17 mark_free_pending (1)

Could we use superscript ¹²³⁴ for footnotes? These look like function
parameters

> + 18 kvm_nb_handler(FREE)
> + 19 vmcs_update_atomic()
> + 20 ioasid_put_locked() -> 3
> + 21 vdcm_nb_handler(FREE)
> + 22 iomm_nb_handler(FREE)

iommu_nb_handler

> + 23 ioasid_free() returns(2) schedule_work() 2

I completely lost track here, couldn't figure out in which direction to
read the diagram. What work is scheduled? Why does the IOMMU driver drop
its reference to the IOASID before unbdind_gpasid()?

> + 24 schedule_work() vdev_clear_wk(hpasid)
> + 25 teardown_pasid_wk()
> + 26 ioasid_put() -> 1
> + 27 ioasid_put() 0
> + 28 Reclaimed
> + 29 unbind_gpasid()
> + 30 iommu_unbind()->ioasid_find() Fails(3)
> + -------------- New Life Cycle Begin ----------------------------
> +
> +Note:
> +
> +1. By marking IOASID FREE_PENDING at step #17, no new references can be
> + held. ioasid_get/find() will return -ENOENT;

s/held/taken

Thanks,
Jean

> +2. After step #23, all events can go out of order. Shall not affect
> + the outcome.
> +3. IOMMU driver fails to find private data for unbinding. If unbind is
> + called after the same IOASID is allocated for the same guest again,
> + this is a programming error. The damage is limited to the guest
> + itself since unbind performs permission checking based on the
> + IOASID set associated with the guest process.
> +4. Workqueues are used by VDCM and IOMMU driver for processing that
> + requires thread context.
> +
> +
> +KVM PASID Translation Table Updates
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Per VM PASID translation table is maintained by KVM in order to
> +support ENQCMD in the guest. The table contains host-guest PASID
> +translations to be consumed by CPU ucode. The synchronization of the
> +PASID states depends on VFIO/IOMMU driver, where IOCTL and atomic
> +notifiers are used. KVM must register IOASID notifier per VM instance
> +during launch time. The following events are handled:
> +
> +1. BIND/UNBIND
> +2. FREE
> +
> +Rules:
> +
> +1. Multiple devices can bind with the same PASID, this can be different PCI
> + devices or mdevs within the same PCI device. However, only the
> + *first* BIND and *last* UNBIND emit notifications.
> +2. IOASID code is responsible for ensuring the correctness of H-G
> + PASID mapping. There is no need for KVM to validate the
> + notification data.
> +3. When UNBIND happens *after* FREE, KVM will see error in
> + ioasid_get() even when the reclaim is not done. IOMMU driver will
> + also avoid sending UNBIND if the PASID is already FREE.
> +4. When KVM terminates *before* FREE & UNBIND, references will be
> + dropped for all host PASIDs.
> +
> +VDCM PASID Programming
> +~~~~~~~~~~~~~~~~~~~~~~
> +VDCM composes virtual devices and exposes them to the guests. When
> +the guest allocates a PASID then program it to the virtual device, VDCM
> +intercepts the programming attempt then programs the matching host
> +PASID on to the hardware.
> +Conversely, when a device is going away, VDCM must be informed such
> +that PASID context on the hardware can be cleared. There could be
> +multiple mdevs assigned to different guests in the same VDCM. Since
> +the PASID table is shared at PCI device level, lazy clearing is not
> +secure. A malicious guest can attack by using newly freed PASIDs that
> +are allocated by another guest.
> +
> +By holding a reference of the PASID until VDCM cleans up the HW context,
> +it is guaranteed that PASID life cycles do not cross within the same
> +device.
> +
> +
> +Reference
> +====================================================
> +1. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
> +
> +2. https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
> +
> +3. https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
> --
> 2.7.4
>


Attachments:
(No filename) (29.54 kB)

2020-10-21 15:05:24

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v3 06/14] iommu/ioasid: Introduce API to adjust the quota of an ioasid_set

On Mon, Sep 28, 2020 at 02:38:33PM -0700, Jacob Pan wrote:
> Each ioasid_set is given a quota during allocation. As system
> administrators balance resources among VMs, we shall support the
> adjustment of quota at runtime. The new quota cannot be less than the
> outstanding IOASIDs already allocated within the set. The extra quota
> will be returned to the system-wide IOASID pool if the new quota is
> smaller than the existing one.
>
> Signed-off-by: Jacob Pan <[email protected]>

Minor comments below, but

Reviewed-by: Jean-Philippe Brucker <[email protected]>

> ---
> drivers/iommu/ioasid.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/ioasid.h | 6 ++++++
> 2 files changed, 53 insertions(+)
>
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 61e25c2375ab..cf8c7d34e2de 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -654,6 +654,53 @@ void ioasid_set_put(struct ioasid_set *set)
> EXPORT_SYMBOL_GPL(ioasid_set_put);
>
> /**
> + * ioasid_adjust_set - Adjust the quota of an IOASID set
> + * @set: IOASID set to be assigned
> + * @quota: Quota allowed in this set
> + *
> + * Return 0 on success. If the new quota is smaller than the number of
> + * IOASIDs already allocated, -EINVAL will be returned. No change will be
> + * made to the existing quota.
> + */
> +int ioasid_adjust_set(struct ioasid_set *set, int quota)

@quota probably doesn't need to be signed (since it's the same as an
ioasid_t, which is unsigned).

> +{
> + int ret = 0;
> +
> + if (quota <= 0)
> + return -EINVAL;
> +
> + spin_lock(&ioasid_allocator_lock);
> + if (set->nr_ioasids > quota) {
> + pr_err("New quota %d is smaller than outstanding IOASIDs %d\n",
> + quota, set->nr_ioasids);
> + ret = -EINVAL;
> + goto done_unlock;
> + }
> +
> + if ((quota > set->quota) &&
> + (quota - set->quota > ioasid_capacity_avail)) {

misaligned

> + ret = -ENOSPC;
> + goto done_unlock;
> + }
> +
> + /* Return the delta back to system pool */
> + ioasid_capacity_avail += set->quota - quota;
> +
> + /*
> + * May have a policy to prevent giving all available IOASIDs
> + * to one set. But we don't enforce here, it should be in the
> + * upper layers.
> + */

I think here would be OK to check about fairness. But anyway, we don't
have to worry about this yet, so I'd just drop the comment.

Thanks,
Jean

> + set->quota = quota;
> +
> +done_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_adjust_set);
> +
> +/**
> * ioasid_find - Find IOASID data
> * @set: the IOASID set
> * @ioasid: the IOASID to find
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 1ae213b660f0..0a5e82148eb9 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -62,6 +62,7 @@ struct ioasid_allocator_ops {
> void ioasid_install_capacity(ioasid_t total);
> ioasid_t ioasid_get_capacity(void);
> struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type);
> +int ioasid_adjust_set(struct ioasid_set *set, int quota);
> void ioasid_set_get(struct ioasid_set *set);
> void ioasid_set_put(struct ioasid_set *set);
>
> @@ -99,6 +100,11 @@ static inline struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, i
> return ERR_PTR(-ENOTSUPP);
> }
>
> +static inline int ioasid_adjust_set(struct ioasid_set *set, int quota)
> +{
> + return -ENOTSUPP;
> +}
> +
> static inline void ioasid_set_put(struct ioasid_set *set)
> {
> }
> --
> 2.7.4
>

2020-10-22 03:24:46

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v3 04/14] iommu/ioasid: Support setting system-wide capacity

On Mon, Sep 28, 2020 at 02:38:31PM -0700, Jacob Pan wrote:
> IOASID is a system-wide resource that could vary on different systems.
> The default capacity is 20 bits as defined in the PCI-E specifications.
> This patch adds a function to allow adjusting system IOASID capacity.
> For VT-d this is set during boot as part of the Intel IOMMU
> initialization.
>
> Signed-off-by: Jacob Pan <[email protected]>

Reviewed-by: Jean-Philippe Brucker <[email protected]>

> ---
> drivers/iommu/intel/iommu.c | 5 +++++
> drivers/iommu/ioasid.c | 20 ++++++++++++++++++++
> include/linux/ioasid.h | 11 +++++++++++
> 3 files changed, 36 insertions(+)
>
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 18ed3b3c70d7..e7bcb299e51e 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -42,6 +42,7 @@
> #include <linux/crash_dump.h>
> #include <linux/numa.h>
> #include <linux/swiotlb.h>
> +#include <linux/ioasid.h>

(not in alphabetical order)

> #include <asm/irq_remapping.h>
> #include <asm/cacheflush.h>
> #include <asm/iommu.h>
> @@ -3331,6 +3332,10 @@ static int __init init_dmars(void)
> if (ret)
> goto free_iommu;
>
> + /* PASID is needed for scalable mode irrespective to SVM */
> + if (intel_iommu_sm)
> + ioasid_install_capacity(intel_pasid_max_id);
> +
> /*
> * for each drhd
> * enable fault log
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 6cfbdfb492e0..4277cb17e15b 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -10,6 +10,10 @@
> #include <linux/spinlock.h>
> #include <linux/xarray.h>
>
> +/* Default to PCIe standard 20 bit PASID */
> +#define PCI_PASID_MAX 0x100000
> +static ioasid_t ioasid_capacity = PCI_PASID_MAX;
> +static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
> struct ioasid_data {
> ioasid_t id;
> struct ioasid_set *set;
> @@ -17,6 +21,22 @@ struct ioasid_data {
> struct rcu_head rcu;
> };
>
> +void ioasid_install_capacity(ioasid_t total)
> +{
> + if (ioasid_capacity && ioasid_capacity != PCI_PASID_MAX) {
> + pr_warn("IOASID capacity is already set.\n");
> + return;
> + }
> + ioasid_capacity = ioasid_capacity_avail = total;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_install_capacity);
> +
> +ioasid_t ioasid_get_capacity(void)
> +{
> + return ioasid_capacity;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get_capacity);
> +
> /*
> * struct ioasid_allocator_data - Internal data structure to hold information
> * about an allocator. There are two types of allocators:
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index c7f649fa970a..7fc320656be2 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -32,6 +32,8 @@ struct ioasid_allocator_ops {
> #define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
>
> #if IS_ENABLED(CONFIG_IOASID)
> +void ioasid_install_capacity(ioasid_t total);
> +ioasid_t ioasid_get_capacity(void);
> ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> void *private);
> void ioasid_free(ioasid_t ioasid);
> @@ -42,6 +44,15 @@ void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> int ioasid_attach_data(ioasid_t ioasid, void *data);
> void ioasid_detach_data(ioasid_t ioasid);
> #else /* !CONFIG_IOASID */
> +static inline void ioasid_install_capacity(ioasid_t total)
> +{
> +}
> +
> +static inline ioasid_t ioasid_get_capacity(void)
> +{
> + return 0;
> +}
> +
> static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> ioasid_t max, void *private)
> {
> --
> 2.7.4
>

2020-10-22 03:25:14

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v3 08/14] iommu/ioasid: Add reference couting functions

On Mon, Sep 28, 2020 at 02:38:35PM -0700, Jacob Pan wrote:
> There can be multiple users of an IOASID, each user could have hardware
> contexts associated with the IOASID. In order to align lifecycles,
> reference counting is introduced in this patch. It is expected that when
> an IOASID is being freed, each user will drop a reference only after its
> context is cleared.
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/ioasid.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/ioasid.h | 24 ++++++++++
> 2 files changed, 141 insertions(+)
>
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 9628e78b2ab4..828cc44b1b1c 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -16,8 +16,26 @@ static ioasid_t ioasid_capacity = PCI_PASID_MAX;
> static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
> static DEFINE_XARRAY_ALLOC(ioasid_sets);
>
> +enum ioasid_state {
> + IOASID_STATE_INACTIVE,
> + IOASID_STATE_ACTIVE,
> + IOASID_STATE_FREE_PENDING,
> +};
> +
> +/**
> + * struct ioasid_data - Meta data about ioasid
> + *
> + * @id: Unique ID
> + * @users: Number of active users
> + * @state: Track state of the IOASID
> + * @set: ioasid_set of the IOASID belongs to
> + * @private: Private data associated with the IOASID
> + * @rcu: For free after RCU grace period
> + */
> struct ioasid_data {
> ioasid_t id;
> + refcount_t users;
> + enum ioasid_state state;
> struct ioasid_set *set;
> void *private;
> struct rcu_head rcu;
> @@ -511,6 +529,8 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> goto exit_free;
> }
> data->id = id;
> + data->state = IOASID_STATE_ACTIVE;
> + refcount_set(&data->users, 1);
>
> /* Store IOASID in the per set data */
> if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
> @@ -560,6 +580,14 @@ static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
> if (WARN_ON(!xa_load(&ioasid_sets, data->set->id)))
> return;
>
> + /* Free is already in progress */
> + if (data->state == IOASID_STATE_FREE_PENDING)
> + return;

But the previous call to ioasid_free_locked() dropped a reference, then
returned because more refs where held. Shouldn't this call also
dec_and_test() the reference and call ioasid_do_free_locked() if
necessary?

> +
> + data->state = IOASID_STATE_FREE_PENDING;
> + if (!refcount_dec_and_test(&data->users))
> + return;
> +
> ioasid_do_free_locked(data);
> }
>
> @@ -717,6 +745,95 @@ void ioasid_set_for_each_ioasid(struct ioasid_set *set,
> }
> EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
>
> +int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(&active_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to get unknown IOASID %u\n", ioasid);
> + return -EINVAL;
> + }
> + if (data->state == IOASID_STATE_FREE_PENDING) {
> + pr_err("Trying to get IOASID being freed%u\n", ioasid);

Strange placement of the %u

> + return -EBUSY;
> + }
> +
> + /* Check set ownership if the set is non-null */
> + if (set && data->set != set) {
> + pr_err("Trying to get IOASID %u outside the set\n", ioasid);
> + /* data found but does not belong to the set */
> + return -EACCES;
> + }
> + refcount_inc(&data->users);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get_locked);

If this is a public facing let's add a lockdep_assert_held() to make sure
they do hold the right lock. Same for ioasid_put_locked().

Thanks,
Jean

> +
> +/**
> + * ioasid_get - Obtain a reference to an ioasid
> + * @set: the ioasid_set to check permission against if not NULL
> + * @ioasid: the ID to remove
> + *
> + *
> + * Return: 0 on success, error if failed.
> + */
> +int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + int ret;
> +
> + spin_lock(&ioasid_allocator_lock);
> + ret = ioasid_get_locked(set, ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get);
> +
> +bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(&active_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to put unknown IOASID %u\n", ioasid);
> + return false;
> + }
> + if (set && data->set != set) {
> + pr_err("Trying to drop IOASID %u outside the set\n", ioasid);
> + return false;
> + }
> + if (!refcount_dec_and_test(&data->users))
> + return false;
> +
> + ioasid_do_free_locked(data);
> +
> + return true;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_put_locked);
> +
> +/**
> + * ioasid_put - Release a reference to an ioasid
> + * @set: the ioasid_set to check permission against if not NULL
> + * @ioasid: the ID to remove
> + *
> + * Put a reference to the IOASID, free it when the number of references drops to
> + * zero.
> + *
> + * Return: %true if the IOASID was freed, %false otherwise.
> + */
> +bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + bool ret;
> +
> + spin_lock(&ioasid_allocator_lock);
> + ret = ioasid_put_locked(set, ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_put);
> +
> /**
> * ioasid_find - Find IOASID data
> * @set: the IOASID set
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index aab58bc26714..16d421357173 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -73,6 +73,10 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> bool (*getter)(void *));
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> +int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> +int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> +bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> +bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> int ioasid_attach_data(ioasid_t ioasid, void *data);
> void ioasid_detach_data(ioasid_t ioasid);
> void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> @@ -112,6 +116,26 @@ static inline void ioasid_set_put(struct ioasid_set *set)
> {
> }
>
> +static inline int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + return -ENOTSUPP;
> +}
> +
> +static inline int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + return -ENOTSUPP;
> +}
> +
> +static inline bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + return false;
> +}
> +
> +static inline bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + return false;
> +}
> +
> static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *))
> {
> return NULL;
> --
> 2.7.4
>

2020-10-22 03:44:22

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v3 03/14] iommu/ioasid: Add a separate function for detach data

On Mon, Sep 28, 2020 at 02:38:30PM -0700, Jacob Pan wrote:
> IOASID private data can be cleared by ioasid_attach_data() with a NULL
> data pointer. A common use case is for a caller to free the data
> afterward. ioasid_attach_data() calls synchronize_rcu() before return
> such that free data can be sure without outstanding readers.
> However, since synchronize_rcu() may sleep, ioasid_attach_data() cannot
> be used under spinlocks.
>
> This patch adds ioasid_detach_data() as a separate API where
> synchronize_rcu() is called only in this case. ioasid_attach_data() can
> then be used under spinlocks. In addition, this change makes the API
> symmetrical.
>
> Signed-off-by: Jacob Pan <[email protected]>

A typo below, but

Reviewed-by: Jean-Philippe Brucker <[email protected]>

> ---
> drivers/iommu/intel/svm.c | 4 ++--
> drivers/iommu/ioasid.c | 54 ++++++++++++++++++++++++++++++++++++++---------
> include/linux/ioasid.h | 5 ++++-
> 3 files changed, 50 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 2c5645f0737a..06a16bee7b65 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -398,7 +398,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
> list_add_rcu(&sdev->list, &svm->devs);
> out:
> if (!IS_ERR_OR_NULL(svm) && list_empty(&svm->devs)) {
> - ioasid_attach_data(data->hpasid, NULL);
> + ioasid_detach_data(data->hpasid);
> kfree(svm);
> }
>
> @@ -441,7 +441,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> * the unbind, IOMMU driver will get notified
> * and perform cleanup.
> */
> - ioasid_attach_data(pasid, NULL);
> + ioasid_detach_data(pasid);
> kfree(svm);
> }
> }
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 5f63af07acd5..6cfbdfb492e0 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -272,24 +272,58 @@ int ioasid_attach_data(ioasid_t ioasid, void *data)
>
> spin_lock(&ioasid_allocator_lock);
> ioasid_data = xa_load(&active_allocator->xa, ioasid);
> - if (ioasid_data)
> - rcu_assign_pointer(ioasid_data->private, data);
> - else
> +
> + if (!ioasid_data) {
> ret = -ENOENT;
> - spin_unlock(&ioasid_allocator_lock);
> + goto done_unlock;
> + }
>
> - /*
> - * Wait for readers to stop accessing the old private data, so the
> - * caller can free it.
> - */
> - if (!ret)
> - synchronize_rcu();
> + if (ioasid_data->private) {
> + ret = -EBUSY;
> + goto done_unlock;
> + }
> + rcu_assign_pointer(ioasid_data->private, data);
> +
> +done_unlock:
> + spin_unlock(&ioasid_allocator_lock);
>
> return ret;
> }
> EXPORT_SYMBOL_GPL(ioasid_attach_data);
>
> /**
> + * ioasid_detach_data - Clear the private data of an ioasid
> + *
> + * @ioasid: the IOASIDD to clear private data

IOASID

> + */
> +void ioasid_detach_data(ioasid_t ioasid)
> +{
> + struct ioasid_data *ioasid_data;
> +
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_data = xa_load(&active_allocator->xa, ioasid);
> +
> + if (!ioasid_data) {
> + pr_warn("IOASID %u not found to detach data from\n", ioasid);
> + goto done_unlock;
> + }
> +
> + if (ioasid_data->private) {
> + rcu_assign_pointer(ioasid_data->private, NULL);
> + goto done_unlock;
> + }
> +
> +done_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> + /*
> + * Wait for readers to stop accessing the old private data,
> + * so the caller can free it.
> + */
> + synchronize_rcu();
> +}
> +EXPORT_SYMBOL_GPL(ioasid_detach_data);
> +
> +/**
> * ioasid_alloc - Allocate an IOASID
> * @set: the IOASID set
> * @min: the minimum ID (inclusive)
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 9c44947a68c8..c7f649fa970a 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -40,7 +40,7 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> int ioasid_attach_data(ioasid_t ioasid, void *data);
> -
> +void ioasid_detach_data(ioasid_t ioasid);
> #else /* !CONFIG_IOASID */
> static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> ioasid_t max, void *private)
> @@ -72,5 +72,8 @@ static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
> return -ENOTSUPP;
> }
>
> +static inline void ioasid_detach_data(ioasid_t ioasid)
> +{
> +}
> #endif /* CONFIG_IOASID */
> #endif /* __LINUX_IOASID_H */
> --
> 2.7.4
>

2020-10-22 03:44:25

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v3 05/14] iommu/ioasid: Redefine IOASID set and allocation APIs

On Mon, Sep 28, 2020 at 02:38:32PM -0700, Jacob Pan wrote:
> ioasid_set was introduced as an arbitrary token that is shared by a
> group of IOASIDs. For example, two IOASIDs allocated via the same
> ioasid_set pointer belong to the same set.
>
> For guest SVA usages, system-wide IOASID resources need to be
> partitioned such that each VM can have its own quota and being managed
> separately. ioasid_set is the perfect candidate for meeting such
> requirements. This patch redefines and extends ioasid_set with the
> following new fields:
> - Quota
> - Reference count
> - Storage of its namespace
> - The token is now stored in the ioasid_set with types
>
> Basic ioasid_set level APIs are introduced that wire up these new data.
> Existing users of IOASID APIs are converted where a host IOASID set is
> allocated for bare-metal usage.
>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/intel/iommu.c | 26 +++--
> drivers/iommu/intel/pasid.h | 1 +
> drivers/iommu/intel/svm.c | 25 ++--
> drivers/iommu/ioasid.c | 277 ++++++++++++++++++++++++++++++++++++++++----
> include/linux/ioasid.h | 53 +++++++--
> 5 files changed, 333 insertions(+), 49 deletions(-)
>
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index e7bcb299e51e..872391890323 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -104,6 +104,9 @@
> */
> #define INTEL_IOMMU_PGSIZES (~0xFFFUL)
>
> +/* PASIDs used by host SVM */
> +struct ioasid_set *host_pasid_set;
> +
> static inline int agaw_to_level(int agaw)
> {
> return agaw + 2;
> @@ -3147,8 +3150,8 @@ static void intel_vcmd_ioasid_free(ioasid_t ioasid, void *data)
> * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
> * We can only free the PASID when all the devices are unbound.
> */
> - if (ioasid_find(NULL, ioasid, NULL)) {
> - pr_alert("Cannot free active IOASID %d\n", ioasid);
> + if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
> + pr_err("IOASID %d to be freed but not in system set\n", ioasid);
> return;
> }
> vcmd_free_pasid(iommu, ioasid);
> @@ -3333,8 +3336,17 @@ static int __init init_dmars(void)
> goto free_iommu;
>
> /* PASID is needed for scalable mode irrespective to SVM */
> - if (intel_iommu_sm)
> + if (intel_iommu_sm) {
> ioasid_install_capacity(intel_pasid_max_id);
> + /* We should not run out of IOASIDs at boot */
> + host_pasid_set = ioasid_set_alloc(NULL, PID_MAX_DEFAULT,
> + IOASID_SET_TYPE_NULL);
> + if (IS_ERR_OR_NULL(host_pasid_set)) {
> + pr_err("Failed to allocate host PASID set %lu\n",

Printing a negative value as unsigned

> + PTR_ERR(host_pasid_set));
> + intel_iommu_sm = 0;
> + }
> + }
>
> /*
> * for each drhd
> @@ -3381,7 +3393,7 @@ static int __init init_dmars(void)
> disable_dmar_iommu(iommu);
> free_dmar_iommu(iommu);
> }
> -
> + ioasid_set_put(host_pasid_set);
> kfree(g_iommus);
>
> error:
> @@ -5163,7 +5175,7 @@ static void auxiliary_unlink_device(struct dmar_domain *domain,
> domain->auxd_refcnt--;
>
> if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - ioasid_free(domain->default_pasid);
> + ioasid_free(host_pasid_set, domain->default_pasid);
> }
>
> static int aux_domain_add_dev(struct dmar_domain *domain,
> @@ -5181,7 +5193,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
> int pasid;
>
> /* No private data needed for the default pasid */
> - pasid = ioasid_alloc(NULL, PASID_MIN,
> + pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
> pci_max_pasids(to_pci_dev(dev)) - 1,
> NULL);
> if (pasid == INVALID_IOASID) {
> @@ -5224,7 +5236,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
> spin_unlock(&iommu->lock);
> spin_unlock_irqrestore(&device_domain_lock, flags);
> if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - ioasid_free(domain->default_pasid);
> + ioasid_free(host_pasid_set, domain->default_pasid);
>
> return ret;
> }
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index c9850766c3a9..ccdc23446015 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct pasid_entry *pte)
> }
>
> extern u32 intel_pasid_max_id;
> +extern struct ioasid_set *host_pasid_set;
> int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp);
> void intel_pasid_free_id(int pasid);
> void *intel_pasid_lookup_id(int pasid);
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 06a16bee7b65..2e764e283469 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -228,7 +228,9 @@ static LIST_HEAD(global_svm_list);
> list_for_each_entry((sdev), &(svm)->devs, list) \
> if ((d) != (sdev)->dev) {} else
>
> -static int pasid_to_svm_sdev(struct device *dev, unsigned int pasid,
> +static int pasid_to_svm_sdev(struct device *dev,
> + struct ioasid_set *set,
> + unsigned int pasid,
> struct intel_svm **rsvm,
> struct intel_svm_dev **rsdev)
> {
> @@ -242,7 +244,7 @@ static int pasid_to_svm_sdev(struct device *dev, unsigned int pasid,
> if (pasid == INVALID_IOASID || pasid >= PASID_MAX)
> return -EINVAL;
>
> - svm = ioasid_find(NULL, pasid, NULL);
> + svm = ioasid_find(set, pasid, NULL);
> if (IS_ERR(svm))
> return PTR_ERR(svm);
>
> @@ -312,7 +314,8 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
> dmar_domain = to_dmar_domain(domain);
>
> mutex_lock(&pasid_mutex);
> - ret = pasid_to_svm_sdev(dev, data->hpasid, &svm, &sdev);
> + ret = pasid_to_svm_sdev(dev, NULL,
> + data->hpasid, &svm, &sdev);
> if (ret)
> goto out;
>
> @@ -417,7 +420,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> return -EINVAL;
>
> mutex_lock(&pasid_mutex);
> - ret = pasid_to_svm_sdev(dev, pasid, &svm, &sdev);
> + ret = pasid_to_svm_sdev(dev, NULL, pasid, &svm, &sdev);
> if (ret)
> goto out;
>
> @@ -555,7 +558,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
> pasid_max = intel_pasid_max_id;
>
> /* Do not use PASID 0, reserved for RID to PASID */
> - svm->pasid = ioasid_alloc(NULL, PASID_MIN,
> + svm->pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
> pasid_max - 1, svm);
> if (svm->pasid == INVALID_IOASID) {
> kfree(svm);
> @@ -572,7 +575,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
> if (mm) {
> ret = mmu_notifier_register(&svm->notifier, mm);
> if (ret) {
> - ioasid_free(svm->pasid);
> + ioasid_free(host_pasid_set, svm->pasid);
> kfree(svm);
> kfree(sdev);
> goto out;
> @@ -590,7 +593,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
> if (ret) {
> if (mm)
> mmu_notifier_unregister(&svm->notifier, mm);
> - ioasid_free(svm->pasid);
> + ioasid_free(host_pasid_set, svm->pasid);
> kfree(svm);
> kfree(sdev);
> goto out;
> @@ -638,7 +641,8 @@ static int intel_svm_unbind_mm(struct device *dev, int pasid)
> if (!iommu)
> goto out;
>
> - ret = pasid_to_svm_sdev(dev, pasid, &svm, &sdev);
> + ret = pasid_to_svm_sdev(dev, host_pasid_set,
> + pasid, &svm, &sdev);
> if (ret)
> goto out;
>
> @@ -659,7 +663,7 @@ static int intel_svm_unbind_mm(struct device *dev, int pasid)
> kfree_rcu(sdev, rcu);
>
> if (list_empty(&svm->devs)) {
> - ioasid_free(svm->pasid);
> + ioasid_free(host_pasid_set, svm->pasid);
> if (svm->mm)
> mmu_notifier_unregister(&svm->notifier, svm->mm);
> list_del(&svm->list);
> @@ -1130,7 +1134,8 @@ int intel_svm_page_response(struct device *dev,
> goto out;
> }
>
> - ret = pasid_to_svm_sdev(dev, prm->pasid, &svm, &sdev);
> + ret = pasid_to_svm_sdev(dev, host_pasid_set,
> + prm->pasid, &svm, &sdev);
> if (ret || !sdev) {
> ret = -ENODEV;
> goto out;
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 4277cb17e15b..61e25c2375ab 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -1,8 +1,8 @@
> // SPDX-License-Identifier: GPL-2.0
> /*
> * I/O Address Space ID allocator. There is one global IOASID space, split into
> - * subsets. Users create a subset with DECLARE_IOASID_SET, then allocate and
> - * free IOASIDs with ioasid_alloc and ioasid_free.
> + * sets. Users create a set with ioasid_set_alloc, then allocate/free IDs
> + * with ioasid_alloc and ioasid_free.
> */
> #include <linux/ioasid.h>
> #include <linux/module.h>
> @@ -14,6 +14,8 @@
> #define PCI_PASID_MAX 0x100000
> static ioasid_t ioasid_capacity = PCI_PASID_MAX;
> static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
> +static DEFINE_XARRAY_ALLOC(ioasid_sets);
> +
> struct ioasid_data {
> ioasid_t id;
> struct ioasid_set *set;
> @@ -343,6 +345,114 @@ void ioasid_detach_data(ioasid_t ioasid)
> }
> EXPORT_SYMBOL_GPL(ioasid_detach_data);
>
> +static inline bool ioasid_set_is_valid(struct ioasid_set *set)

'inline' isn't needed

> +{
> + return xa_load(&ioasid_sets, set->id) == set;
> +}
> +
> +/**
> + * ioasid_set_alloc - Allocate a new IOASID set for a given token
> + *
> + * @token: An optional arbitrary number that can be associated with the
> + * IOASID set. @token can be NULL if the type is
> + * IOASID_SET_TYPE_NULL
> + * @quota: Quota allowed in this set
> + * @type: The type of the token used to create the IOASID set
> + *
> + * IOASID is limited system-wide resource that requires quota management.
> + * Token will be stored in the ioasid_set returned. A reference will be taken
> + * on the newly created set. Subsequent IOASID allocation within the set need
> + * to use the retured ioasid_set pointer.

returned

> + */
> +struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type)
> +{
> + struct ioasid_set *set;
> + unsigned long index;
> + ioasid_t id;
> +
> + if (type >= IOASID_SET_TYPE_NR)
> + return ERR_PTR(-EINVAL);
> +
> + /* REVISIT: may support set w/o quota, use system available */

You can drop this comment, anyone wanting to pass a quota of zero can send
a patch.

> + if (!quota)
> + return ERR_PTR(-EINVAL);
> +
> + /*
> + * Need to check space available if we share system-wide quota.
> + * TODO: we may need to support quota free sets in the future.

You can drop this too

> + */
> + spin_lock(&ioasid_allocator_lock);
> + if (quota > ioasid_capacity_avail) {
> + pr_warn("Out of IOASID capacity! ask %d, avail %d\n",
> + quota, ioasid_capacity_avail);
> + set = ERR_PTR(-ENOSPC);
> + goto exit_unlock;
> + }
> +
> + /*
> + * Token is only unique within its types but right now we have only
> + * mm type. If we have more token types, we have to match type as well.
> + */
> + switch (type) {
> + case IOASID_SET_TYPE_MM:
> + if (!token) {
> + set = ERR_PTR(-EINVAL);
> + goto exit_unlock;
> + }
> + /* Search existing set tokens, reject duplicates */
> + xa_for_each(&ioasid_sets, index, set) {
> + if (set->token == token &&
> + set->type == IOASID_SET_TYPE_MM) {

misaligned

> + set = ERR_PTR(-EEXIST);
> + goto exit_unlock;
> + }
> + }
> + break;
> + case IOASID_SET_TYPE_NULL:
> + if (!token)
> + break;
> + fallthrough;
> + default:
> + pr_err("Invalid token and IOASID type\n");
> + set = ERR_PTR(-EINVAL);
> + goto exit_unlock;
> + }
> +
> + set = kzalloc(sizeof(*set), GFP_ATOMIC);
> + if (!set) {
> + set = ERR_PTR(-ENOMEM);
> + goto exit_unlock;
> + }
> +
> + if (xa_alloc(&ioasid_sets, &id, set,
> + XA_LIMIT(0, ioasid_capacity_avail - quota),
> + GFP_ATOMIC)) {
> + kfree(set);
> + set = ERR_PTR(-ENOSPC);
> + goto exit_unlock;
> + }
> +
> + set->token = token;
> + set->type = type;
> + set->quota = quota;
> + set->id = id;
> + refcount_set(&set->ref, 1);
> +
> + /*
> + * Per set XA is used to store private IDs within the set, get ready

Could drop "get ready". But I think the comment on the ioasid_set struct
is sufficient, maybe you can drop this whole comment.

> + * for ioasid_set private ID and system-wide IOASID allocation
> + * results.
> + */
> + xa_init(&set->xa);
> + ioasid_capacity_avail -= quota;
> +
> +exit_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> +
> + return set;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_alloc);
> +
> /**
> * ioasid_alloc - Allocate an IOASID
> * @set: the IOASID set
> @@ -360,11 +470,22 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> {
> struct ioasid_data *data;
> void *adata;
> - ioasid_t id;
> + ioasid_t id = INVALID_IOASID;
> +
> + spin_lock(&ioasid_allocator_lock);
> + /* Check if the IOASID set has been allocated and initialized */
> + if (WARN_ON(!ioasid_set_is_valid(set)))
> + goto done_unlock;
> +
> + if (set->quota <= set->nr_ioasids) {
> + pr_err_ratelimited("IOASID set out of quota %d\n",
> + set->quota);
> + goto done_unlock;
> + }
>
> data = kzalloc(sizeof(*data), GFP_ATOMIC);
> if (!data)
> - return INVALID_IOASID;
> + goto done_unlock;
>
> data->set = set;
> data->private = private;
> @@ -373,11 +494,10 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> * Custom allocator needs allocator data to perform platform specific
> * operations.
> */
> - spin_lock(&ioasid_allocator_lock);
> adata = active_allocator->flags & IOASID_ALLOCATOR_CUSTOM ? active_allocator->ops->pdata : data;
> id = active_allocator->ops->alloc(min, max, adata);
> if (id == INVALID_IOASID) {
> - pr_err("Failed ASID allocation %lu\n", active_allocator->flags);
> + pr_err_ratelimited("Failed IOASID allocation %lu\n", active_allocator->flags);
> goto exit_free;
> }
>
> @@ -386,46 +506,153 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> /* Custom allocator needs framework to store and track allocation results */
> pr_err("Failed to alloc ioasid from %d\n", id);
> active_allocator->ops->free(id, active_allocator->ops->pdata);
> + if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM)
> + xa_erase(&active_allocator->xa, id);

That doesn't look right, given we get here if the CUSTOM flag is set and
xa_alloc() fails. Should it be in the next block?

> goto exit_free;
> }
> data->id = id;
>
> - spin_unlock(&ioasid_allocator_lock);
> - return id;
> + /* Store IOASID in the per set data */
> + if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
> + pr_err_ratelimited("Failed to store ioasid %d in set\n", id);
> + active_allocator->ops->free(id, active_allocator->ops->pdata);
> + goto exit_free;

Should set id = INVALID_IOASID. I feel this function would be more
readable if we keep separate error and success paths.

> + }
> + set->nr_ioasids++;
> + goto done_unlock;
> exit_free:
> - spin_unlock(&ioasid_allocator_lock);
> kfree(data);
> - return INVALID_IOASID;
> +done_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> + return id;
> }
> EXPORT_SYMBOL_GPL(ioasid_alloc);
>
> -/**
> - * ioasid_free - Free an IOASID
> - * @ioasid: the ID to remove
> - */
> -void ioasid_free(ioasid_t ioasid)
> +static void ioasid_do_free_locked(struct ioasid_data *data)
> {
> struct ioasid_data *ioasid_data;
>
> - spin_lock(&ioasid_allocator_lock);
> - ioasid_data = xa_load(&active_allocator->xa, ioasid);
> - if (!ioasid_data) {
> - pr_err("Trying to free unknown IOASID %u\n", ioasid);
> - goto exit_unlock;
> - }
> -
> - active_allocator->ops->free(ioasid, active_allocator->ops->pdata);
> + active_allocator->ops->free(data->id, active_allocator->ops->pdata);
> /* Custom allocator needs additional steps to free the xa element */
> if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> - ioasid_data = xa_erase(&active_allocator->xa, ioasid);
> + ioasid_data = xa_erase(&active_allocator->xa, data->id);
> kfree_rcu(ioasid_data, rcu);
> }
>
> -exit_unlock:
> + data->set->nr_ioasids--;
> + xa_erase(&data->set->xa, data->id);
> +}
> +
> +static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(&active_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to free unknown IOASID %u\n", ioasid);
> + return;
> + }
> + if (data->set != set) {
> + pr_warn("Cannot free IOASID %u due to set ownership\n", ioasid);
> + return;
> + }
> + /* Check if the set exists */
> + if (WARN_ON(!xa_load(&ioasid_sets, data->set->id)))
> + return;
> +
> + ioasid_do_free_locked(data);
> +}
> +
> +/**
> + * ioasid_free - Drop reference on an IOASID. Free if refcount drops to 0,
> + * including free from its set and system-wide list.
> + * @set: The ioasid_set to check permission with. If not NULL, IOASID
> + * free will fail if the set does not match.
> + * @ioasid: The IOASID to remove
> + */
> +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_free_locked(set, ioasid);
> spin_unlock(&ioasid_allocator_lock);
> }
> EXPORT_SYMBOL_GPL(ioasid_free);
>
> +static void ioasid_set_get_locked(struct ioasid_set *set)
> +{
> + if (WARN_ON(!ioasid_set_is_valid(set)))
> + return;
> +
> + refcount_inc(&set->ref);
> +}
> +
> +void ioasid_set_get(struct ioasid_set *set)
> +{
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_set_get_locked(set);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_get);
> +
> +static void ioasid_set_put_locked(struct ioasid_set *set)
> +{
> + struct ioasid_data *entry;
> + unsigned long index;
> +
> + if (WARN_ON(!ioasid_set_is_valid(set)))
> + return;
> +
> + if (!refcount_dec_and_test(&set->ref))
> + return;
> +
> + /* The set is already empty, we just destroy the set. */
> + if (xa_empty(&set->xa))
> + goto done_destroy;

Do we need this optimization, if xa_for_each() is allowed on the empty xa?
This function doesn't really need to be fast. Otherwise you can just test
nr_ioasids.

> +
> + /*
> + * Free all PASIDs from system-wide IOASID pool, all subscribers gets
> + * notified and do cleanup of their own.
> + * Note that some references of the IOASIDs within the set can still
> + * be held after the free call. This is OK in that the IOASIDs will be
> + * marked inactive, the only operations can be done is ioasid_put.
> + * No need to track IOASID set states since there is no reclaim phase.
> + */
> + xa_for_each(&set->xa, index, entry) {
> + ioasid_free_locked(set, index);
> + /* Free from per set private pool */
> + xa_erase(&set->xa, index);
> + }
> +
> +done_destroy:
> + /* Return the quota back to system pool */
> + ioasid_capacity_avail += set->quota;
> +
> + /*
> + * Token got released right away after the ioasid_set is freed.
> + * If a new set is created immediately with the newly released token,
> + * it will not allocate the same IOASIDs unless they are reclaimed.
> + */
> + xa_erase(&ioasid_sets, set->id);
> + kfree_rcu(set, rcu);
> +}
> +
> +/**
> + * ioasid_set_put - Drop a reference to the IOASID set. Free all IOASIDs within
> + * the set if there are no more users.
> + *
> + * @set: The IOASID set ID to be freed
> + *
> + * If refcount drops to zero, all IOASIDs allocated within the set will be
> + * freed.
> + */
> +void ioasid_set_put(struct ioasid_set *set)
> +{
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_set_put_locked(set);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_put);
> +
> /**
> * ioasid_find - Find IOASID data
> * @set: the IOASID set
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 7fc320656be2..1ae213b660f0 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -4,14 +4,43 @@
>
> #include <linux/types.h>
> #include <linux/errno.h>
> +#include <linux/xarray.h>
> +#include <linux/refcount.h>
>
> #define INVALID_IOASID ((ioasid_t)-1)
> typedef unsigned int ioasid_t;
> typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min, ioasid_t max, void *data);
> typedef void (*ioasid_free_fn_t)(ioasid_t ioasid, void *data);
>
> +/* IOASID set types */
> +enum ioasid_set_type {
> + IOASID_SET_TYPE_NULL = 1, /* Set token is NULL */
> + IOASID_SET_TYPE_MM, /* Set token is a mm_struct pointer
> + * i.e. associated with a process
> + */
> + IOASID_SET_TYPE_NR,
> +};
> +
> +/**
> + * struct ioasid_set - Meta data about ioasid_set
> + * @xa: XArray to store ioasid_set private IDs, can be used for
> + * guest-host IOASID mapping, or just a private IOASID namespace.
> + * @token: Unique to identify an IOASID set
> + * @type: Token types
> + * @quota: Max number of IOASIDs can be allocated within the set
> + * @nr_ioasids: Number of IOASIDs currently allocated in the set
> + * @id: ID of the set
> + * @ref: Reference count of the users
> + */
> struct ioasid_set {
> - int dummy;
> + struct xarray xa;
> + void *token;
> + int type;

Could be an 'enum ioasid_set_type'

> + int quota;
> + int nr_ioasids;

These could be unsigned, or ioasid_t

> + int id;

unsigned as well

> + refcount_t ref;
> + struct rcu_head rcu;
> };
>
> /**
> @@ -29,14 +58,16 @@ struct ioasid_allocator_ops {
> void *pdata;
> };
>
> -#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
> -
> #if IS_ENABLED(CONFIG_IOASID)
> void ioasid_install_capacity(ioasid_t total);
> ioasid_t ioasid_get_capacity(void);
> +struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type);
> +void ioasid_set_get(struct ioasid_set *set);
> +void ioasid_set_put(struct ioasid_set *set);
> +
> ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> void *private);
> -void ioasid_free(ioasid_t ioasid);
> +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
> void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> bool (*getter)(void *));
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> @@ -59,12 +90,20 @@ static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> return INVALID_IOASID;
> }
>
> -static inline void ioasid_free(ioasid_t ioasid)
> +static inline void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
> +{
> +}
> +
> +static inline struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type)
> +{
> + return ERR_PTR(-ENOTSUPP);
> +}
> +
> +static inline void ioasid_set_put(struct ioasid_set *set)
> {
> }

A stub is missing for ioasid_set_get()

>
> -static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> - bool (*getter)(void *))
> +static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *))

Spurious change?

Thanks,
Jean

> {
> return NULL;
> }
> --
> 2.7.4
>

2020-10-22 03:45:24

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v3 07/14] iommu/ioasid: Add an iterator API for ioasid_set

On Mon, Sep 28, 2020 at 02:38:34PM -0700, Jacob Pan wrote:
> Users of an ioasid_set may not keep track of all the IOASIDs allocated
> under the set. When collective actions are needed for each IOASIDs, it
> is useful to iterate over all the IOASIDs within the set. For example,
> when the ioasid_set is freed, the user might perform the same cleanup
> operation on each IOASID.
>
> This patch adds an API to iterate all the IOASIDs within the set.
>
> Signed-off-by: Jacob Pan <[email protected]>

Could add a short description of the function parameters, but

Reviewed-by: Jean-Philippe Brucker <[email protected]>

> ---
> drivers/iommu/ioasid.c | 17 +++++++++++++++++
> include/linux/ioasid.h | 9 +++++++++
> 2 files changed, 26 insertions(+)
>
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index cf8c7d34e2de..9628e78b2ab4 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -701,6 +701,23 @@ int ioasid_adjust_set(struct ioasid_set *set, int quota)
> EXPORT_SYMBOL_GPL(ioasid_adjust_set);
>
> /**
> + * ioasid_set_for_each_ioasid - Iterate over all the IOASIDs within the set
> + *
> + * Caller must hold a reference of the set and handles its own locking.
> + */
> +void ioasid_set_for_each_ioasid(struct ioasid_set *set,
> + void (*fn)(ioasid_t id, void *data),
> + void *data)
> +{
> + struct ioasid_data *entry;
> + unsigned long index;
> +
> + xa_for_each(&set->xa, index, entry)
> + fn(index, data);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
> +
> +/**
> * ioasid_find - Find IOASID data
> * @set: the IOASID set
> * @ioasid: the IOASID to find
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 0a5e82148eb9..aab58bc26714 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -75,6 +75,9 @@ int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> int ioasid_attach_data(ioasid_t ioasid, void *data);
> void ioasid_detach_data(ioasid_t ioasid);
> +void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> + void (*fn)(ioasid_t id, void *data),
> + void *data);
> #else /* !CONFIG_IOASID */
> static inline void ioasid_install_capacity(ioasid_t total)
> {
> @@ -131,5 +134,11 @@ static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
> static inline void ioasid_detach_data(ioasid_t ioasid)
> {
> }
> +
> +static inline void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> + void (*fn)(ioasid_t id, void *data),
> + void *data)
> +{
> +}
> #endif /* CONFIG_IOASID */
> #endif /* __LINUX_IOASID_H */
> --
> 2.7.4
>

2020-10-23 11:43:43

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v3 10/14] iommu/ioasid: Introduce notification APIs

On Mon, Sep 28, 2020 at 02:38:37PM -0700, Jacob Pan wrote:
> Relations among IOASID users largely follow a publisher-subscriber
> pattern. E.g. to support guest SVA on Intel Scalable I/O Virtualization
> (SIOV) enabled platforms, VFIO, IOMMU, device drivers, KVM are all users
> of IOASIDs. When a state change occurs, VFIO publishes the change event
> that needs to be processed by other users/subscribers.
>
> This patch introduced two types of notifications: global and per
> ioasid_set. The latter is intended for users who only needs to handle
> events related to the IOASID of a given set.
> For more information, refer to the kernel documentation at
> Documentation/ioasid.rst.
>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Wu Hao <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/ioasid.c | 141 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/ioasid.h | 57 +++++++++++++++++++-
> 2 files changed, 197 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 378fef8f23d9..894b17c06ead 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -10,12 +10,35 @@
> #include <linux/spinlock.h>
> #include <linux/xarray.h>
>
> +/*
> + * An IOASID can have multiple consumers where each consumer may have
> + * hardware contexts associated with the IOASID.
> + * When a status change occurs, like on IOASID deallocation, notifier chains
> + * are used to keep the consumers in sync.
> + * This is a publisher-subscriber pattern where publisher can change the
> + * state of each IOASID, e.g. alloc/free, bind IOASID to a device and mm.
> + * On the other hand, subscribers get notified for the state change and
> + * keep local states in sync.
> + */
> +static ATOMIC_NOTIFIER_HEAD(ioasid_notifier);
> +/* List to hold pending notification block registrations */
> +static LIST_HEAD(ioasid_nb_pending_list);

nit: it might be tidier to deal with the pending list in the next patch,
since it's only relevant for mm set notifier

> +static DEFINE_SPINLOCK(ioasid_nb_lock);
> +
> /* Default to PCIe standard 20 bit PASID */
> #define PCI_PASID_MAX 0x100000
> static ioasid_t ioasid_capacity = PCI_PASID_MAX;
> static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
> static DEFINE_XARRAY_ALLOC(ioasid_sets);
>
> +struct ioasid_set_nb {
> + struct list_head list;
> + struct notifier_block *nb;
> + void *token;
> + struct ioasid_set *set;
> + bool active;
> +};
> +
> enum ioasid_state {
> IOASID_STATE_INACTIVE,
> IOASID_STATE_ACTIVE,
> @@ -365,6 +388,42 @@ void ioasid_detach_data(ioasid_t ioasid)
> }
> EXPORT_SYMBOL_GPL(ioasid_detach_data);
>
> +/**
> + * ioasid_notify - Send notification on a given IOASID for status change.
> + *
> + * @data: The IOASID data to which the notification will send
> + * @cmd: Notification event sent by IOASID external users, can be
> + * IOASID_BIND or IOASID_UNBIND.
> + *
> + * @flags: Special instructions, e.g. notify within a set or global by
> + * IOASID_NOTIFY_FLAG_SET or IOASID_NOTIFY_FLAG_ALL flags
> + * Caller must hold ioasid_allocator_lock and reference to the IOASID
> + */
> +static int ioasid_notify(struct ioasid_data *data,
> + enum ioasid_notify_val cmd, unsigned int flags)
> +{
> + struct ioasid_nb_args args = { 0 };
> + int ret = 0;
> +
> + /* IOASID_FREE/ALLOC are internal events emitted by IOASID core only */
> + if (cmd <= IOASID_NOTIFY_FREE)
> + return -EINVAL;

This function is now called with ALLOC and FREE

> +
> + if (flags & ~(IOASID_NOTIFY_FLAG_ALL | IOASID_NOTIFY_FLAG_SET))
> + return -EINVAL;
> +
> + args.id = data->id;
> + args.set = data->set;
> + args.pdata = data->private;
> + args.spid = data->spid;
> + if (flags & IOASID_NOTIFY_FLAG_ALL)
> + ret = atomic_notifier_call_chain(&ioasid_notifier, cmd, &args);
> + if (flags & IOASID_NOTIFY_FLAG_SET)
> + ret = atomic_notifier_call_chain(&data->set->nh, cmd, &args);

If both flags are set, we'll miss errors within the global notification.
Is that a problem?

> +
> + return ret;
> +}
> +
> static ioasid_t ioasid_find_by_spid_locked(struct ioasid_set *set, ioasid_t spid)
> {
> ioasid_t ioasid = INVALID_IOASID;
> @@ -417,6 +476,7 @@ int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
> goto done_unlock;
> }
> data->spid = spid;
> + ioasid_notify(data, IOASID_NOTIFY_BIND, IOASID_NOTIFY_FLAG_SET);
>
> done_unlock:
> spin_unlock(&ioasid_allocator_lock);
> @@ -436,6 +496,7 @@ void ioasid_detach_spid(ioasid_t ioasid)
> goto done_unlock;
> }
> data->spid = INVALID_IOASID;
> + ioasid_notify(data, IOASID_NOTIFY_UNBIND, IOASID_NOTIFY_FLAG_SET);
>
> done_unlock:
> spin_unlock(&ioasid_allocator_lock);
> @@ -469,6 +530,28 @@ static inline bool ioasid_set_is_valid(struct ioasid_set *set)
> return xa_load(&ioasid_sets, set->id) == set;
> }
>
> +static void ioasid_add_pending_nb(struct ioasid_set *set)
> +{
> + struct ioasid_set_nb *curr;
> +
> + if (set->type != IOASID_SET_TYPE_MM)
> + return;
> +
> + /*
> + * Check if there are any pending nb requests for the given token, if so
> + * add them to the notifier chain.
> + */
> + spin_lock(&ioasid_nb_lock);
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->token == set->token && !curr->active) {
> + atomic_notifier_chain_register(&set->nh, curr->nb);
> + curr->set = set;
> + curr->active = true;
> + }
> + }
> + spin_unlock(&ioasid_nb_lock);
> +}
> +
> /**
> * ioasid_set_alloc - Allocate a new IOASID set for a given token
> *
> @@ -556,6 +639,13 @@ struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type)
> set->quota = quota;
> set->id = id;
> refcount_set(&set->ref, 1);
> + ATOMIC_INIT_NOTIFIER_HEAD(&set->nh);
> +
> + /*
> + * Check if there are any pending nb requests for the given token, if so
> + * add them to the notifier chain.
> + */
> + ioasid_add_pending_nb(set);
>
> /*
> * Per set XA is used to store private IDs within the set, get ready
> @@ -641,6 +731,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> goto exit_free;
> }
> set->nr_ioasids++;
> + ioasid_notify(data, IOASID_NOTIFY_ALLOC, IOASID_NOTIFY_FLAG_SET);
> goto done_unlock;
> exit_free:
> kfree(data);
> @@ -687,6 +778,8 @@ static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
> return;
>
> data->state = IOASID_STATE_FREE_PENDING;
> + ioasid_notify(data, IOASID_NOTIFY_FREE,
> + IOASID_NOTIFY_FLAG_ALL | IOASID_NOTIFY_FLAG_ALL);
> if (!refcount_dec_and_test(&data->users))
> return;
>
> @@ -726,6 +819,7 @@ EXPORT_SYMBOL_GPL(ioasid_set_get);
>
> static void ioasid_set_put_locked(struct ioasid_set *set)
> {
> + struct ioasid_set_nb *curr;
> struct ioasid_data *entry;
> unsigned long index;
>
> @@ -757,6 +851,16 @@ static void ioasid_set_put_locked(struct ioasid_set *set)
> /* Return the quota back to system pool */
> ioasid_capacity_avail += set->quota;
>
> + /* Restore pending status of the set NBs */
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->token == set->token) {
> + if (curr->active)
> + curr->active = false;
> + else
> + pr_warn("Set token exists but not active!\n");
> + }
> + }
> +
> /*
> * Token got released right away after the ioasid_set is freed.
> * If a new set is created immediately with the newly released token,
> @@ -778,7 +882,9 @@ static void ioasid_set_put_locked(struct ioasid_set *set)
> void ioasid_set_put(struct ioasid_set *set)
> {
> spin_lock(&ioasid_allocator_lock);
> + spin_lock(&ioasid_nb_lock);
> ioasid_set_put_locked(set);
> + spin_unlock(&ioasid_nb_lock);
> spin_unlock(&ioasid_allocator_lock);
> }
> EXPORT_SYMBOL_GPL(ioasid_set_put);
> @@ -980,6 +1086,41 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> }
> EXPORT_SYMBOL_GPL(ioasid_find);
>
> +int ioasid_register_notifier(struct ioasid_set *set, struct notifier_block *nb)

A comment might be useful, explaining that @set is optional and that the
caller should set a priority within ioasid_notifier_prios in @nb.

> +{
> + if (set)
> + return atomic_notifier_chain_register(&set->nh, nb);
> + else
> + return atomic_notifier_chain_register(&ioasid_notifier, nb);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_register_notifier);
> +

Here as well, a comment saying that a reference to the set must be held,
though maybe that's obvious.

Thanks,
Jean

> +void ioasid_unregister_notifier(struct ioasid_set *set,
> + struct notifier_block *nb)
> +{
> + struct ioasid_set_nb *curr;
> +
> + spin_lock(&ioasid_nb_lock);
> + /*
> + * Pending list is registered with a token without an ioasid_set,
> + * therefore should not be unregistered directly.
> + */
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->nb == nb) {
> + pr_warn("Cannot unregister NB from pending list\n");
> + spin_unlock(&ioasid_nb_lock);
> + return;
> + }
> + }
> + spin_unlock(&ioasid_nb_lock);
> +
> + if (set)
> + atomic_notifier_chain_unregister(&set->nh, nb);
> + else
> + atomic_notifier_chain_unregister(&ioasid_notifier, nb);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);
> +
> MODULE_AUTHOR("Jean-Philippe Brucker <[email protected]>");
> MODULE_AUTHOR("Jacob Pan <[email protected]>");
> MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 2dfe85e6cb7e..1b551c99d568 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -23,6 +23,7 @@ enum ioasid_set_type {
>
> /**
> * struct ioasid_set - Meta data about ioasid_set
> + * @nh: List of notifiers private to that set
> * @xa: XArray to store ioasid_set private IDs, can be used for
> * guest-host IOASID mapping, or just a private IOASID namespace.
> * @token: Unique to identify an IOASID set
> @@ -33,6 +34,7 @@ enum ioasid_set_type {
> * @ref: Reference count of the users
> */
> struct ioasid_set {
> + struct atomic_notifier_head nh;
> struct xarray xa;
> void *token;
> int type;
> @@ -58,6 +60,47 @@ struct ioasid_allocator_ops {
> void *pdata;
> };
>
> +/* Notification data when IOASID status changed */
> +enum ioasid_notify_val {
> + IOASID_NOTIFY_ALLOC = 1,
> + IOASID_NOTIFY_FREE,
> + IOASID_NOTIFY_BIND,
> + IOASID_NOTIFY_UNBIND,
> +};
> +
> +#define IOASID_NOTIFY_FLAG_ALL BIT(0)
> +#define IOASID_NOTIFY_FLAG_SET BIT(1)
> +/**
> + * enum ioasid_notifier_prios - IOASID event notification order
> + *
> + * When status of an IOASID changes, users might need to take actions to
> + * reflect the new state. For example, when an IOASID is freed due to
> + * exception, the hardware context in virtual CPU, DMA device, and IOMMU
> + * shall be cleared and drained. Order is required to prevent life cycle
> + * problems.
> + */
> +enum ioasid_notifier_prios {
> + IOASID_PRIO_LAST,
> + IOASID_PRIO_DEVICE,
> + IOASID_PRIO_IOMMU,
> + IOASID_PRIO_CPU,
> +};
> +
> +/**
> + * struct ioasid_nb_args - Argument provided by IOASID core when notifier
> + * is called.
> + * @id: The IOASID being notified
> + * @spid: The set private ID associated with the IOASID
> + * @set: The IOASID set of @id
> + * @pdata: The private data attached to the IOASID
> + */
> +struct ioasid_nb_args {
> + ioasid_t id;
> + ioasid_t spid;
> + struct ioasid_set *set;
> + void *pdata;
> +};
> +
> #if IS_ENABLED(CONFIG_IOASID)
> void ioasid_install_capacity(ioasid_t total);
> ioasid_t ioasid_get_capacity(void);
> @@ -82,7 +125,10 @@ void ioasid_detach_data(ioasid_t ioasid);
> int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
> void ioasid_detach_spid(ioasid_t ioasid);
> ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid);
> -
> +int ioasid_register_notifier(struct ioasid_set *set,
> + struct notifier_block *nb);
> +void ioasid_unregister_notifier(struct ioasid_set *set,
> + struct notifier_block *nb);
> void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> void (*fn)(ioasid_t id, void *data),
> void *data);
> @@ -145,6 +191,15 @@ static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*
> return NULL;
> }
>
> +static inline int ioasid_register_notifier(struct notifier_block *nb)
> +{
> + return -ENOTSUPP;
> +}
> +
> +static inline void ioasid_unregister_notifier(struct notifier_block *nb)
> +{
> +}
> +
> static inline int ioasid_register_allocator(struct ioasid_allocator_ops *allocator)
> {
> return -ENOTSUPP;
> --
> 2.7.4
>

2020-10-23 11:43:54

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v3 09/14] iommu/ioasid: Introduce ioasid_set private ID

On Mon, Sep 28, 2020 at 02:38:36PM -0700, Jacob Pan wrote:
> When an IOASID set is used for guest SVA, each VM will acquire its
> ioasid_set for IOASID allocations. IOASIDs within the VM must have a
> host/physical IOASID backing, mapping between guest and host IOASIDs can
> be non-identical. IOASID set private ID (SPID) is introduced in this
> patch to be used as guest IOASID. However, the concept of ioasid_set
> specific namespace is generic, thus named SPID.
>
> As SPID namespace is within the IOASID set, the IOASID core can provide
> lookup services at both directions. SPIDs may not be available when its
> IOASID is allocated, the mapping between SPID and IOASID is usually
> established when a guest page table is bound to a host PASID.
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/ioasid.c | 102 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/ioasid.h | 19 +++++++++
> 2 files changed, 121 insertions(+)
>
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 828cc44b1b1c..378fef8f23d9 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -26,6 +26,7 @@ enum ioasid_state {
> * struct ioasid_data - Meta data about ioasid
> *
> * @id: Unique ID
> + * @spid: Private ID unique within a set
> * @users: Number of active users
> * @state: Track state of the IOASID
> * @set: ioasid_set of the IOASID belongs to
> @@ -34,6 +35,7 @@ enum ioasid_state {
> */
> struct ioasid_data {
> ioasid_t id;
> + ioasid_t spid;
> refcount_t users;
> enum ioasid_state state;
> struct ioasid_set *set;
> @@ -363,6 +365,105 @@ void ioasid_detach_data(ioasid_t ioasid)
> }
> EXPORT_SYMBOL_GPL(ioasid_detach_data);
>
> +static ioasid_t ioasid_find_by_spid_locked(struct ioasid_set *set, ioasid_t spid)
> +{
> + ioasid_t ioasid = INVALID_IOASID;
> + struct ioasid_data *entry;
> + unsigned long index;
> +
> + if (!xa_load(&ioasid_sets, set->id)) {
> + pr_warn("Invalid set\n");

Could use ioasid_set_is_valid(), and perhaps a WARN_ON() instead of
pr_warn() since this is a programming error.

> + goto done;

Or just return INVALID_IOASID

> + }
> +
> + xa_for_each(&set->xa, index, entry) {
> + if (spid == entry->spid) {
> + refcount_inc(&entry->users);
> + ioasid = index;

break

> + }
> + }
> +done:
> + return ioasid;
> +}
> +
> +/**
> + * ioasid_attach_spid - Attach ioasid_set private ID to an IOASID
> + *
> + * @ioasid: the system-wide IOASID to attach
> + * @spid: the ioasid_set private ID of @ioasid
> + *
> + * After attching SPID, future lookup can be done via ioasid_find_by_spid().

attaching

> + */
> +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
> +{
> + struct ioasid_data *data;
> + int ret = 0;
> +
> + if (spid == INVALID_IOASID)
> + return -EINVAL;
> +
> + spin_lock(&ioasid_allocator_lock);
> + data = xa_load(&active_allocator->xa, ioasid);
> +
> + if (!data) {
> + pr_err("No IOASID entry %d to attach SPID %d\n",
> + ioasid, spid);
> + ret = -ENOENT;
> + goto done_unlock;
> + }
> + /* Check if SPID is unique within the set */
> + if (ioasid_find_by_spid_locked(data->set, spid) != INVALID_IOASID) {

We need an additional parameter to ioasid_find_by_spid_locked(), telling
it not to take a reference to the conflicting entry. Here we return with
the reference, which will never be released.

> + ret = -EINVAL;
> + goto done_unlock;
> + }
> + data->spid = spid;
> +
> +done_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_attach_spid);
> +
> +void ioasid_detach_spid(ioasid_t ioasid)

Could add a small comment to this public function

Thanks,
Jean

> +{
> + struct ioasid_data *data;
> +
> + spin_lock(&ioasid_allocator_lock);
> + data = xa_load(&active_allocator->xa, ioasid);
> +
> + if (!data || data->spid == INVALID_IOASID) {
> + pr_err("Invalid IOASID entry %d to detach\n", ioasid);
> + goto done_unlock;
> + }
> + data->spid = INVALID_IOASID;
> +
> +done_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_detach_spid);
> +
> +/**
> + * ioasid_find_by_spid - Find the system-wide IOASID by a set private ID and
> + * its set.
> + *
> + * @set: the ioasid_set to search within
> + * @spid: the set private ID
> + *
> + * Given a set private ID and its IOASID set, find the system-wide IOASID. Take
> + * a reference upon finding the matching IOASID. Return INVALID_IOASID if the
> + * IOASID is not found in the set or the set is not valid.
> + */
> +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
> +{
> + ioasid_t ioasid;
> +
> + spin_lock(&ioasid_allocator_lock);
> + ioasid = ioasid_find_by_spid_locked(set, spid);
> + spin_unlock(&ioasid_allocator_lock);
> + return ioasid;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
> +
> static inline bool ioasid_set_is_valid(struct ioasid_set *set)
> {
> return xa_load(&ioasid_sets, set->id) == set;
> @@ -529,6 +630,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> goto exit_free;
> }
> data->id = id;
> + data->spid = INVALID_IOASID;
> data->state = IOASID_STATE_ACTIVE;
> refcount_set(&data->users, 1);
>
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 16d421357173..2dfe85e6cb7e 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -79,6 +79,10 @@ bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> int ioasid_attach_data(ioasid_t ioasid, void *data);
> void ioasid_detach_data(ioasid_t ioasid);
> +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
> +void ioasid_detach_spid(ioasid_t ioasid);
> +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid);
> +
> void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> void (*fn)(ioasid_t id, void *data),
> void *data);
> @@ -159,6 +163,21 @@ static inline void ioasid_detach_data(ioasid_t ioasid)
> {
> }
>
> +static inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
> +{
> + return -ENOTSUPP;
> +}
> +
> +static inline void ioasid_detach_spid(ioasid_t ioasid)
> +{
> +}
> +
> +static inline ioasid_t ioasid_find_by_spid(struct ioasid_set *set,
> + ioasid_t spid)
> +{
> + return INVALID_IOASID;
> +}
> +
> static inline void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> void (*fn)(ioasid_t id, void *data),
> void *data)
> --
> 2.7.4
>

2020-10-23 11:47:00

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v3 11/14] iommu/ioasid: Support mm type ioasid_set notifications

On Mon, Sep 28, 2020 at 02:38:38PM -0700, Jacob Pan wrote:
> As a system-wide resource, IOASID is often shared by multiple kernel
> subsystems that are independent of each other. However, at the
> ioasid_set level, these kernel subsystems must communicate with each
> other for ownership checking, event notifications, etc. For example, on
> Intel Scalable IO Virtualization (SIOV) enabled platforms, KVM and VFIO
> instances under the same process/guest must be aware of a shared IOASID
> set.
> IOASID_SET_TYPE_MM token type was introduced to explicitly mark an
> IOASID set that belongs to a process, thus use the same mm_struct
> pointer as a token. Users of the same process can then identify with
> each other based on this token.
>
> This patch introduces MM token specific event registration APIs. Event
> subscribers such as KVM instances can register IOASID event handler
> without the knowledge of its ioasid_set. Event handlers are registered
> based on its mm_struct pointer as a token. In case when subscribers
> register handler *prior* to the creation of the ioasid_set, the
> handler’s notification block is stored in a pending list within IOASID
> core. Once the ioasid_set of the MM token is created, the notification
> block will be registered by the IOASID core.
>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Wu Hao <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/ioasid.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/ioasid.h | 15 +++++++
> 2 files changed, 132 insertions(+)
>
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 894b17c06ead..d5faeb559a43 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -889,6 +889,29 @@ void ioasid_set_put(struct ioasid_set *set)
> }
> EXPORT_SYMBOL_GPL(ioasid_set_put);
>
> +/*
> + * ioasid_find_mm_set - Retrieve IOASID set with mm token
> + * Take a reference of the set if found.
> + */
> +static struct ioasid_set *ioasid_find_mm_set(struct mm_struct *token)
> +{
> + struct ioasid_set *set;
> + unsigned long index;
> +
> + spin_lock(&ioasid_allocator_lock);

This could deadlock since ioasid_set_put() takes the nb_lock while holding
the allocator_lock.

> +
> + xa_for_each(&ioasid_sets, index, set) {
> + if (set->type == IOASID_SET_TYPE_MM && set->token == token) {
> + refcount_inc(&set->ref);
> + goto exit_unlock;
> + }
> + }
> + set = NULL;
> +exit_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> + return set;
> +}
> +
> /**
> * ioasid_adjust_set - Adjust the quota of an IOASID set
> * @set: IOASID set to be assigned
> @@ -1121,6 +1144,100 @@ void ioasid_unregister_notifier(struct ioasid_set *set,
> }
> EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);
>
> +/**
> + * ioasid_register_notifier_mm - Register a notifier block on the IOASID set
> + * created by the mm_struct pointer as the token
> + *
> + * @mm: the mm_struct token of the ioasid_set
> + * @nb: notfier block to be registered on the ioasid_set

Maybe add that the priority in @nb needs to be set by the caller using a
ioasid_notifier_priority value

> + *
> + * This a variant of ioasid_register_notifier() where the caller intends to
> + * listen to IOASID events belong the ioasid_set created under the same

that belong to

> + * process. Caller is not aware of the ioasid_set, no need to hold reference
> + * of the ioasid_set.
> + */
> +int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
> +{
> + struct ioasid_set_nb *curr;
> + struct ioasid_set *set;
> + int ret = 0;
> +
> + if (!mm)
> + return -EINVAL;
> +
> + spin_lock(&ioasid_nb_lock);
> +
> + /* Check for duplicates, nb is unique per set */
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->token == mm && curr->nb == nb) {
> + ret = -EBUSY;
> + goto exit_unlock;
> + }
> + }
> +
> + /* Check if the token has an existing set */
> + set = ioasid_find_mm_set(mm);
> + if (!set) {
> + /* Add to the rsvd list as inactive */
> + curr->active = false;

curr is invalid here

> + } else {
> + /* REVISIT: Only register empty set for now. Can add an option
> + * in the future to playback existing PASIDs.
> + */

You can drop this comment

> + if (set->nr_ioasids) {
> + pr_warn("IOASID set %d not empty\n", set->id);
> + ret = -EBUSY;
> + goto exit_unlock;
> + }
> + curr = kzalloc(sizeof(*curr), GFP_ATOMIC);

Needs to be in the common path, before this block

> + if (!curr) {
> + ret = -ENOMEM;
> + goto exit_unlock;
> + }
> + curr->token = mm;
> + curr->nb = nb;
> + curr->active = true;
> + curr->set = set;
> +
> + /* Set already created, add to the notifier chain */
> + atomic_notifier_chain_register(&set->nh, nb);
> + /*
> + * Do not hold a reference, if the set gets destroyed, the nb
> + * entry will be marked inactive.
> + */
> + ioasid_set_put(set);
> + }
> +
> + list_add(&curr->list, &ioasid_nb_pending_list);
> +
> +exit_unlock:
> + spin_unlock(&ioasid_nb_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_register_notifier_mm);
> +
> +void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
> +{
> + struct ioasid_set_nb *curr;
> +
> + spin_lock(&ioasid_nb_lock);
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->token == mm && curr->nb == nb) {
> + list_del(&curr->list);
> + spin_unlock(&ioasid_nb_lock);
> + if (curr->active) {
> + atomic_notifier_chain_unregister(&curr->set->nh,
> + nb);

I think it's possible for ioasid_set_put_locked() to free the set right
after we release the lock, then this unregister() will be use-after-free.
Best keep holding the lock for this.

> + }
> + kfree(curr);
> + return;
> + }
> + }
> + pr_warn("No ioasid set found for mm token %llx\n", (u64)mm);

Use %p

Overall I think error messages in this series could be demoted to
pr_debug, since they're useful when developing users to this API, but
figuring out whether they can be user-triggered and need care is too much
work.

Thanks,
Jean

> + spin_unlock(&ioasid_nb_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier_mm);
> +
> MODULE_AUTHOR("Jean-Philippe Brucker <[email protected]>");
> MODULE_AUTHOR("Jacob Pan <[email protected]>");
> MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 1b551c99d568..c6cc855aadb6 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -132,6 +132,9 @@ void ioasid_unregister_notifier(struct ioasid_set *set,
> void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> void (*fn)(ioasid_t id, void *data),
> void *data);
> +int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
> +void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
> +
> #else /* !CONFIG_IOASID */
> static inline void ioasid_install_capacity(ioasid_t total)
> {
> @@ -238,5 +241,17 @@ static inline void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> void *data)
> {
> }
> +
> +static inline int ioasid_register_notifier_mm(struct mm_struct *mm,
> + struct notifier_block *nb)
> +{
> + return -ENOTSUPP;
> +}
> +
> +static inline void ioasid_unregister_notifier_mm(struct mm_struct *mm,
> + struct notifier_block *nb)
> +{
> +}
> +
> #endif /* CONFIG_IOASID */
> #endif /* __LINUX_IOASID_H */
> --
> 2.7.4
>

2020-10-27 07:30:25

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v3 01/14] docs: Document IO Address Space ID (IOASID) APIs

Hi Jean-Philippe,

Thanks a lot for the review. Comments inline.

On Tue, 20 Oct 2020 15:58:09 +0200, Jean-Philippe Brucker
<[email protected]> wrote:

> On Mon, Sep 28, 2020 at 02:38:28PM -0700, Jacob Pan wrote:
> > IOASID is used to identify address spaces that can be targeted by device
> > DMA. It is a system-wide resource that is essential to its many users.
> > This document is an attempt to help developers from all vendors navigate
> > the APIs. At this time, ARM SMMU and Intel’s Scalable IO Virtualization
> > (SIOV) enabled platforms are the primary users of IOASID. Examples of
> > how SIOV components interact with IOASID APIs are provided in that many
> > APIs are driven by the requirements from SIOV.
> >
> > Cc: Jonathan Corbet <[email protected]>
> > Cc: [email protected]
> > Cc: Randy Dunlap <[email protected]>
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Wu Hao <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
>
> This looks good to me, with small comments below.
>
Can I add your Reviewed-by tag after addressing the comments?

> > ---
> > Documentation/driver-api/ioasid.rst | 648
> > ++++++++++++++++++++++++++++++++++++ 1 file changed, 648 insertions(+)
> > create mode 100644 Documentation/driver-api/ioasid.rst
> >
> > diff --git a/Documentation/driver-api/ioasid.rst
> > b/Documentation/driver-api/ioasid.rst new file mode 100644
> > index 000000000000..7f8e702997ab
> > --- /dev/null
> > +++ b/Documentation/driver-api/ioasid.rst
> > @@ -0,0 +1,648 @@
> > +.. SPDX-License-Identifier: GPL-2.0
> > +.. ioasid:
> > +
> > +===================
> > +IO Address Space ID
> > +===================
> > +
> > +IOASID is a generic name for PCIe Process Address ID (PASID) or ARM
> > +SMMU SubstreamID. An IOASID identifies an address space that DMA
> > +requests can target.
> > +
> > +The primary use cases for IOASID are Shared Virtual Address (SVA) and
> > +multiple IOVA spaces per device. However, the requirements for IOASID
> > +management can vary among hardware architectures.
> > +
> > +For baremetal IOVA, IOASID #0 is used for DMA request without
>
> bare metal
got it

>
> > +PASID. Even though some architectures such as VT-d also offers
> > +the flexibility of using any PASIDs for DMA request without PASID.
> > +PASID #0 is reserved and not allocated from any ioasid_set.
> > +
> > +Multiple IOVA spaces per device are mapped to auxiliary domains which
> > +can be used for mediated device assignment with and without a virtual
> > +IOMMU (vIOMMU). An IOASID is allocated for each auxiliary domain as
> > default +PASID. Without vIOMMU, default IOASID is used for DMA map/unmap
> > +APIs. With vIOMMU, default IOASID is used for guest IOVA where DMA
> > +request with PASID is required for the device. The reason is that
> > +there is only one PASID #0 per device, e.g. VT-d, RID_PASID is per
> > PCI
>
> on VT-d
got that
>
> > +device.
> > +
> > +This document covers the generic features supported by IOASID
> > +APIs. Vendor-specific use cases are also illustrated with Intel's VT-d
> > +based platforms as the first example.
> > +
> > +.. contents:: :local:
> > +
> > +Glossary
> > +========
> > +PASID - Process Address Space ID
> > +
> > +IOASID - IO Address Space ID (generic term for PCIe PASID and
> > +SubstreamID in SMMU)
> > +
> > +SVA/SVM - Shared Virtual Addressing/Memory
> > +
> > +ENQCMD - Intel X86 ISA for efficient workqueue submission [1]
> > +!!!TODO: Link to Spec at the bottom
>
> Yes, or maybe hyperlinks at the end of this section would be better. There
> are references and lists all over the document so keeping things as close
> as possible avoids confusion.
>
sounds good.

> > +
> > +DSA - Intel Data Streaming Accelerator [2]
> > +
> > +VDCM - Virtual Device Composition Module [3]
> > +
> > +SIOV - Intel Scalable IO Virtualization
> > +
> > +
> > +Key Concepts
> > +============
> > +
> > +IOASID Set
> > +-----------
> > +An IOASID set is a group of IOASIDs allocated from the system-wide
> > +IOASID pool. Refer to IOASID set APIs for more details.
>
> IOASID Set Level APIs
>
Yes, I should use the exact section title.

> > +
> > +IOASID set is particularly useful for guest SVA where each guest could
> > +have its own IOASID set for security and efficiency reasons.
> > +
> > +IOASID Set Private ID (SPID)
> > +----------------------------
> > +Each IOASID set has a private namespace of SPIDs. An SPID maps to a
> > +single system-wide IOASID. Conversely, each IOASID may be associated
> > +with an alias ID, local to the IOASID set, named SPID.
> > +SPIDs can be used as guest IOASIDs where each guest could do
> > +IOASID allocation from its own pool and map them to host physical
> > +IOASIDs. SPIDs are particularly useful for supporting live migration
> > +where decoupling guest and host physical resources are necessary.
> > +
> > +For example, two VMs can both allocate guest PASID/SPID #101 but map to
> > +different host PASIDs #201 and #202 respectively as shown in the
> > +diagram below.
> > +::
> > +
> > + .------------------. .------------------.
> > + | VM 1 | | VM 2 |
> > + | | | |
> > + |------------------| |------------------|
> > + | GPASID/SPID 101 | | GPASID/SPID 101 |
> > + '------------------' -------------------' Guest
> > + __________|______________________|____________________
> > + | | Host
> > + v v
> > + .------------------. .------------------.
> > + | Host IOASID 201 | | Host IOASID 202 |
> > + '------------------' '------------------'
> > + | IOASID set 1 | | IOASID set 2 |
> > + '------------------' '------------------'
> > +
> > +Guest PASID is treated as IOASID set private ID (SPID) within an
> > +IOASID set, mappings between guest and host IOASIDs are stored in the
> > +set for inquiry.
> > +
> > +IOASID APIs
> > +===========
> > +To get the IOASID APIs, users must #include <linux/ioasid.h>. These
> > APIs +serve the following functionalities:
> > +
> > + - IOASID allocation/Free
>
> freeing (or release)
>
got it

> > + - Group management in the form of ioasid_set
> > + - Private data storage and lookup
> > + - Reference counting
> > + - Event notification in case of a state change
> > +
> > +IOASID Set Level APIs
> > +--------------------------
> > +For use cases such as guest SVA it is necessary to manage IOASIDs at
> > +ioasid_set level. For example, VMs may allocate multiple IOASIDs for
> > +guest process address sharing (vSVA). It is imperative to enforce
> > +VM-IOASID ownership such that a malicious guest cannot target DMA
> > +traffic outside its own IOASIDs, or free an active IOASID that belongs
> > +to another VM.
> > +
> > +The IOASID set APIs serve the following purposes:
> > +
> > + - Ownership/permission enforcement
> > + - Take collective actions, e.g. free an entire set
> > + - Event notifications within a set
> > + - Look up a set based on token
> > + - Quota enforcement
> > +
> > +Each IOASID set is created with a token, which can be one of the
> > +following token types:
> > +
> > + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
>
> Maybe NULL isn't the best name then. NONE?
>
Agreed, 'NONE' makes more sense.

> > + - IOASID_SET_TYPE_MM (Set token is a mm_struct)
> > +
> > +The explicit MM token type is useful when multiple users of an IOASID
> > +set under the same process need to communicate about their shared
> > IOASIDs. +E.g. An IOASID set created by VFIO for one guest can be
> > associated +with the KVM instance for the same guest since they share a
> > common mm_struct. +A token must be unique within its type.
> > +
> > +::
> > +
> > + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, u32
> > type) +
> > + int ioasid_adjust_set(struct ioasid_set *set, int quota);
> > +
> > + void ioasid_set_get(struct ioasid_set *set)
> > +
> > + void ioasid_set_put(struct ioasid_set *set)
> > +
> > + void ioasid_set_get_locked(struct ioasid_set *set)
> > +
> > + void ioasid_set_put_locked(struct ioasid_set *set)
> > +
> > + int ioasid_set_for_each_ioasid(struct ioasid_set *set,
> > + void (*fn)(ioasid_t id, void *data),
> > + void *data)
>
> misaligned
>
got it

> > +
> > +
> > +Individual IOASID APIs
> > +----------------------
> > +Once an ioasid_set is created, IOASIDs can be allocated from the set.
> > +Within the IOASID set namespace, set private ID (SPID) is supported. In
> > +the VM use case, SPID can be used for storing guest PASID.
> > +
> > +::
> > +
> > + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t
> > max,
> > + void *private);
> > +
> > + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> > + bool (*getter)(void *));
> > +
> > + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
> > +
> > + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
> > + void *data);
> > + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
> > + ioasid_t spid);
> > +
> > +
> > +Notifications
> > +-------------
> > +An IOASID may have multiple users, each user may have hardware context
> > +associated with an IOASID. When the status of an IOASID changes,
> > +e.g. an IOASID is being freed, users need to be notified such that the
> > +associated hardware context can be cleared, flushed, and drained.
> > +
> > +::
> > +
> > + int ioasid_register_notifier(struct ioasid_set *set, struct
> > + notifier_block *nb)
> > +
> > + void ioasid_unregister_notifier(struct ioasid_set *set,
> > + struct notifier_block *nb)
> > +
> > + int ioasid_register_notifier_mm(struct mm_struct *mm, struct
> > + notifier_block *nb)
> > +
> > + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
> > + notifier_block *nb)
> > +
> > + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
> > + unsigned int flags)
> > +
> > +"_mm" flavor of the ioasid_register_notifier() APIs are used when
> > +an IOASID user need to listen to the IOASID events belong to a
> > +process but without the knowledge of the associated ioasid_set.
> > +
> > +Events
> > +~~~~~~
> > +Notification events are pertinent to individual IOASIDs, they can be
> > +one of the following:
> > +
> > + - ALLOC
> > + - FREE
> > + - BIND
> > + - UNBIND
> > +
> > +Ordering
> > +~~~~~~~~
> > +Ordering is supported by IOASID notification priorities as the
> > +following (in ascending order):
> > +
> > +::
> > +
> > + enum ioasid_notifier_prios {
> > + IOASID_PRIO_LAST,
> > + IOASID_PRIO_IOMMU,
> > + IOASID_PRIO_DEVICE,
> > + IOASID_PRIO_CPU,
> > + };
> > +
> > +When registered, notifiers are assigned a priority that affect the
> > +call order. For example, notifiers with CPU priority get called before
> > +notifiers with device priority and so on.
> > +
> > +The typical use case is when an IOASID is freed due to an exception,
> > DMA +source should be quiesced before tearing down other hardware
> > contexts +in the system. This will reduce the churn in handling faults.
> > DMA work +submission is performed by the CPU which is granted higher
> > priority than +devices.
> > +
> > +Level Sensitivity
> > +~~~~~~~~~~~~~~~~~
> > +For each IOASID state transition, IOASID core ensures that there is
> > +only one notification sent. This resembles level triggered interrupt
> > +where a single interrupt is raised during a state transition.
> > +For example, if ioasid_free() is called twice by a user before the
> > +IOASID is reclaimed, IOASID core will only send out a single
> > +IOASID_NOTIFY_FREE event. Similarly, for IOASID_NOTIFY_BIND/UNBIND
> > +events, which is only sent out once when a SPID is attached/detached.
> > +
> > +IOASID notifications cannot be sent directly by IOASID users, they are
> > +sent out by the IOASID core as a by-product of the following APIs:
> > +- ioasid_alloc/free(), emit IOASID_NOTIFY_ALLOC/FREE
> > +- ioasid_attach/detach_spid() emit IOASID_NOTIFY_BIND/UNBIND
> > +
> > +Scopes
> > +~~~~~~
> > +There are two types of notifiers in IOASID core: system-wide and
> > +ioasid_set-wide.
> > +
> > +System-wide notifier is catering for users that need to handle all the
> > +IOASIDs in the system. E.g. The IOMMU driver.
> > +
> > +Per ioasid_set notifier can be used by VM specific components such as
> > +KVM. After all, each KVM instance only cares about IOASIDs within its
> > +own set/guest.
> > +
> > +
> > +Atomicity
> > +~~~~~~~~~
> > +IOASID notifiers are atomic due to spinlocks used inside the IOASID
> > +core. For tasks that cannot be completed in the notifier handler,
> > +async work can be submitted to complete the work later as long as
> > +there is no ordering requirement.
> > +
> > +Reference counting
> > +------------------
> > +IOASID lifecycle management is based on reference counting. Users of
> > +IOASID who intend to align lifecycle with the IOASID need to hold
> > +a reference of the IOASID. The IOASID will not be returned to the pool
> > +for allocation until all references are dropped. Calling ioasid_free()
> > +will mark the IOASID as FREE_PENDING if the IOASID has outstanding
> > +references. No new references can be taken by ioasid_get() once an
> > +IOASID is in the FREE_PENDING state. ioasid_free() can be called
> > +multiple times without an error until all refs are dropped.
> > +
> > +ioasid_put() decrements and tests refcount of the IOASID. If refcount
> > +is 0, ioasid will be freed. Deleted from the system-wide xarray as
> > +well as per set xarray.
>
> These xarray structures haven't been introduced so far. I would leave out
> this level of detail because it's not relevant to the user and will be
> difficult to keep in sync with the internals of IOASID.
>
Right, I will deleted the xarray part.

> > The IOASID will be returned to the pool and
> > +available for new allocations.
> > +
> > +Event notifications are used to inform users of IOASID status change.
> > +IOASID_FREE event prompts users to drop their references after
> > +clearing its context.
> > +
> > +For example, on VT-d platform when an IOASID is freed, teardown
> > +actions are performed on KVM, device driver (VDCM), and the IOMMU
> > +driver. To quiesce VCPU for work submission, KVM notifier handler must
> > +be called before VDCM handler. Therefore, KVM and VDCM shall use
> > +notification priority IOASID_PRIO_CPU andIOASID_PRIO_DEVICE
>
> and IOASID_PRIO_DEVICE
>
go it.

> > +respectively.
> > +
> > +For both KVM and VDCM, notifier blocks shall be registered on the
> > +IOASID set such that *only* events from the matching VM are received.
> > +
> > +If KVM attempts to register a notifier block before the IOASID set is
> > +created using the MM token, the notifier block will be placed on a
> > +pending list inside IOASID core. Once the token matching IOASID set
> > +is created, IOASID will register the notifier block automatically.
> > +IOASID core does not replay events for the existing IOASIDs in the
> > +set. For IOASID set of MM type, notification blocks can be registered
> > +on empty sets only. This is to avoid lost events.
> > +
> > +IOMMU driver shall register notifier block on global chain::
> > +
> > + static struct notifier_block pasid_nb_vtd = {
> > + .notifier_call = pasid_status_change_vtd,
> > + .priority = IOASID_PRIO_IOMMU,
> > + };
> > +
> > +Custom allocator APIs
> > +---------------------
> > +
> > +::
> > +
> > + int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> > +
> > + void ioasid_unregister_allocator(struct ioasid_allocator_ops
> > *allocator); +
> > +Allocator Choices
> > +~~~~~~~~~~~~~~~~~
> > +IOASIDs are allocated for both host and guest SVA/IOVA usage. However,
> > +allocators can be different. For example, on VT-d guest PASID
> > +allocation must be performed via a virtual command interface which is
> > +emulated by VMM.
> > +
> > +IOASID core has the notion of "custom allocator" such that guest can
> > +register virtual command allocator that precedes the default one.
>
> "Supersedes", rather than "precedes"?
>
My understanding is that 'supersede' means replace something but 'precede'
means get in front of something. I do want to emphasis that the custom
allocator takes precedence over the default allocator.

> > +
> > +Namespaces
> > +~~~~~~~~~~
> > +IOASIDs are limited system resources that default to 20 bits in
> > +size. Each device can have its own PASID table for security reasons.
> > +Theoretically the namespace can be per device also.
> > +
> > +However IOASID namespace is system-wide for two reasons:
> > +- Simplicity
> > +- Sharing resources of a single device to multiple VMs.
> > +
> > +Take VT-d as an example, VT-d supports shared workqueue and ENQCMD[1]
> > +where one IOASID could be used to submit work on multiple devices that
> > +are shared with other VMs. This requires IOASID to be
> > +system-wide. This is also the reason why guests must use an
> > +emulated virtual command interface to allocate IOASID from the host.
> > +
> > +Life cycle
> > +==========
> > +This section covers IOASID life cycle management for both bare-metal
> > +and guest usages. In bare-metal SVA, MMU notifier is directly hooked
> > +up with IOMMU driver, therefore the IOASID life cycle matches the
> > +process address space (MM) life cycle.
> > +
> > +However, guest MMU notifier is not available to host IOMMU driver,
> > +when guest MM terminates unexpectedly, the events have to go through
> > +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also more
> > +parties involved in guest SVA, e.g. on Intel VT-d platform, IOASIDs
> > +are used by IOMMU driver, KVM, VDCM, and VFIO.
> > +
> > +Native IOASID Life Cycle (VT-d Example)
> > +---------------------------------------
> > +
> > +The normal flow of native SVA code with Intel Data Streaming
> > +Accelerator(DSA) [2] as example:
> > +
> > +1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
> > +2. DSA driver allocate WQ, do sva_bind_device();
> > +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
> > + mmu_notifier_get()
> > +4. DMA starts by DSA driver userspace
> > +5. DSA userspace close FD
> > +6. DSA/uacce kernel driver handles FD.close()
> > +7. DSA driver stops DMA
> > +8. DSA driver calls sva_unbind_device();
> > +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
> > + TLBs. mmu_notifier_put() called.
> > +10. mmu_notifier.release() called, IOMMU SVA code calls ioasid_free()*
> > +11. The IOASID is returned to the pool, reclaimed.
> > +
> > +::
> > +
> > + * With ENQCMD, PASID used on VT-d is not released in mmu_notifier()
> > but
> > + mmdrop(). mmdrop comes after FD close which does not make a
> > difference. +
> > +During the normal teardown, the following three key steps would happen
> > in +order: (7), (9), (11).
> > +
> > +Exception happens when process terminates *before* device driver stops
> > +DMA and call IOMMU driver to unbind. The detailed flow of process
> > +exits are as follows:
> > +
> > +::
> > +
> > + do_exit() {
> > + exit_mm() {
> > + mm_put();
> > + exit_mmap() {
> > + intel_invalidate_range() //mmu notifier
> > + tlb_finish_mmu()
> > + mmu_notifier_release(mm) {
> > + intel_mm_release() {
> > + (9)
> > intel_iommu_teardown_pasid();
> > + intel_iommu_flush_tlbs();
> > + }
> > + // tlb_invalidate_range cb removed
> > + }
> > + unmap_vmas();
> > + free_pgtables(); // IOMMU cannot walk PGT
> > after this
>
> or rather, it *must* not walk PGT at this point? PGT should be disabled on
> the IOMMU side at step 9, before the TLB invalidation
>
Yes, agreed already guaranteed. I will delete the comment.

> > + };
> > + }
> > + exit_files(tsk) {
> > + close_files() {
> > + dsa_close();
> > + (7) dsa_stop_dma();
> > + intel_svm_unbind_pasid(); //nothing to do
> > + }
> > + }
> > + }
> > +
> > + mmdrop() /* some random time later, lazy mm user */ {
> > + mm_free_pgd();
> > + destroy_context(mm); {
> > + (11) ioasid_free();
> > + }
> > + }
> > +
> > +As shown in the list above, step #2 could happen before
> > +#1. Unrecoverable(UR) faults could happen between #2 and #1.
> > +Fault processing is disabled by the IOMMU driver in #2, therefore the
> > UR +fault never reaches the driver.
>
> Did you mean steps (7) and (9) instead for #1 and #2?
>
Yes, forgot to update the index.

> > +
> > +
> > +Also notice that TLB invalidation occurs at mmu_notifier
> > +invalidate_range callback as well as the release callback. The reason
> > +is that release callback will delete IOMMU driver from the notifier
> > +chain which may skip invalidate_range() calls during the exit path.
>
> I think this whole section also goes too deep in implementation details,
> and is susceptible to become out of sync with the code pretty quickly.
> I don't feel strongly about this, though, leave it if you want.
>
I will delete this, better be accurate than misleading.

> > +
> > +To avoid unnecessary reporting of UR fault, IOMMU driver shall disable
> > +fault reporting after free and before unbind.
> > +
> > +Guest IOASID Life Cycle (VT-d Example)
> > +--------------------------------------
> > +Guest IOASID life cycle starts with guest driver open(), this could be
> > +uacce or individual accelerator driver such as DSA. At FD open,
> > +sva_bind_device() is called which triggers a series of actions.
> > +
> > +The example below is an illustration of *normal* operations that
> > +involves *all* the SW components in VT-d. The flow can be simpler if
> > +no ENQCMD is supported.
> > +
> > +::
> > +
> > + VFIO IOMMU KVM VDCM IOASID Ref
> > + ..................................................................
> > + 1 ioasid_register_notifier/_mm()
> > + 2 ioasid_alloc() 1
> > + 3 bind_gpasid()
> > + 4 iommu_bind()->ioasid_get() 2
> > + 5 ioasid_notify(BIND)
> > + 6 -> ioasid_get() 3
> > + 7 -> vmcs_update_atomic()
> > + 8 mdev_write(gpasid)
> > + 9 hpasid=
> > + 10 find_by_spid(gpasid) 4
>
> Are 9 and 10 the same step?
>
yes, here is just line #.

> > + 11 vdev_write(hpasid)
>
> What's the 'v' in vdev? Aren't we writing to the physical device here?
>
Yes, we are writing to the physical resource of the virtual device. Let me
remove the 'v'.

> > + 12 -------- GUEST STARTS DMA --------------------------
> > + 13 -------- GUEST STOPS DMA --------------------------
> > + 14 mdev_clear(gpasid)
> > + 15 vdev_clear(hpasid)
> > + 16 ioasid_put() 3
> > + 17 unbind_gpasid()
> > + 18 iommu_ubind()
> > + 19 ioasid_notify(UNBIND)
> > + 20 -> vmcs_update_atomic()
> > + 21 -> ioasid_put() 2
> > + 22 ioasid_free() 1
> > + 23 ioasid_put() 0
> > + 24 Reclaimed
> > + -------------- New Life Cycle Begin ----------------------------
> > + 1 ioasid_alloc() -> 1
> > +
> > + Note: IOASID Notification Events: FREE, BIND, UNBIND
>
> Is FREE at step 22?
>
Yes the IOASID_FREE event should be sent at step 22, let me call it out
explicitly.

> > +
> > +Exception cases arise when a guest crashes or a malicious guest
> > +attempts to cause disruption on the host system. The fault handling
> > +rules are:
> > +
> > +1. IOASID free must *always* succeed.
> > +2. An inactive period may be required before the freed IOASID is
> > + reclaimed. During this period, consumers of IOASID perform cleanup.
> > +3. Malfunction is limited to the guest owned resources for all
> > + programming errors.
> > +
> > +The primary source of exception is when the following are out of
> > +order:
> > +
> > +1. Start/Stop of DMA activity
> > + (Guest device driver, mdev via VFIO)
> > +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
> > + (Host IOMMU driver bind/unbind)
> > +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
> > + case of ENQCMD
> > +4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
> > +5. IOASID alloc/free (Host IOASID)
> > +
> > +VFIO is the *only* user-kernel interface, which is ultimately
> > +responsible for exception handling.
> > +
> > +#1 is processed the same way as the assigned device today based on
> > +device file descriptors and events. There is no special handling.
> > +
> > +#3 is based on bind/unbind events emitted by #2.
> > +
> > +#4 is naturally aligned with IOASID life cycle in that an illegal
> > +guest PASID programming would fail in obtaining reference of the
> > +matching host IOASID.
> > +
> > +#5 is similar to #4. The fault will be reported to the user if PASID
> > +used in the ENQCMD is not set up in VMCS PASID translation table.
> > +
> > +Therefore, the remaining out of order problem is between #2 and
> > +#5. I.e. unbind vs. free. More specifically, free before unbind.
> > +
> > +IOASID notifier and refcounting are used to ensure order. Following
> > +a publisher-subscriber pattern where:
> > +
> > +- Publishers: VFIO & IOMMU
> > +- Subscribers: KVM, VDCM, IOMMU
> > +
> > +IOASID reference must be acquired before receiving the FREE event. The
> > +reference must be dropped at the end of the processing in order to
> > +return the IOASID to the pool.
> > +
> > +Let's examine the IOASID life cycle again when free happens *before*
> > +unbind. This could be a result of misbehaving guests or crash. Assuming
> > +VFIO cannot enforce unbind->free order. Notice that the setup part up
> > +until step #12 is identical to the normal case, the flow below starts
> > +with step 13.
> > +
> > +::
> > +
> > + VFIO IOMMU KVM VDCM IOASID Ref
> > + ..................................................................
> > + 13 -------- GUEST STARTS DMA --------------------------
> > + 14 -------- *GUEST MISBEHAVES!!!* ----------------
> > + 15 ioasid_free()
> > + 16 ioasid_notify(FREE)
> > + 17 mark_free_pending
> > (1)
>
> Could we use superscript ¹²³⁴ for footnotes? These look like function
> parameters
>
yes, much better

> > + 18 kvm_nb_handler(FREE)
> > + 19 vmcs_update_atomic()
> > + 20 ioasid_put_locked() -> 3
> > + 21 vdcm_nb_handler(FREE)
> > + 22 iomm_nb_handler(FREE)
>
> iommu_nb_handler
>
got it

> > + 23 ioasid_free() returns(2) schedule_work() 2
>
> I completely lost track here, couldn't figure out in which direction to
> read the diagram. What work is scheduled?
The time line goes downward but we only control the notification order in
terms of when the events are received. Some completions are async thus out
of order done by work items. The only in-order completion is the KVM update
of its PASID translation table.

After #23, the async works are scheduled to complete clean up work outside
the spinlock(held by the caller of the atomic notifier).

Any suggestions to improve the readability of the time line?

> Why does the IOMMU driver drop
> its reference to the IOASID before unbdind_gpasid()?
>
This is the exception case where userspace issues IOASID free before
unbind_gpasid(). The equivalent of unbind is performed in the IOASID_FREE
notification handler. In IOASID_FREE handler, reference is dropped and
private data deleted. After that, if unbind comes to IOMMU driver, it will
not find IOASID private data therefore just return.

> > + 24 schedule_work() vdev_clear_wk(hpasid)
> > + 25 teardown_pasid_wk()
> > + 26 ioasid_put() -> 1
> > + 27 ioasid_put() 0
> > + 28 Reclaimed
> > + 29 unbind_gpasid()
> > + 30 iommu_unbind()->ioasid_find() Fails(3)
> > + -------------- New Life Cycle Begin ----------------------------
> > +
> > +Note:
> > +
> > +1. By marking IOASID FREE_PENDING at step #17, no new references can be
> > + held. ioasid_get/find() will return -ENOENT;
>
> s/held/taken
>
Got it.

> Thanks,
> Jean
>
> > +2. After step #23, all events can go out of order. Shall not affect
> > + the outcome.
> > +3. IOMMU driver fails to find private data for unbinding. If unbind is
> > + called after the same IOASID is allocated for the same guest again,
> > + this is a programming error. The damage is limited to the guest
> > + itself since unbind performs permission checking based on the
> > + IOASID set associated with the guest process.
> > +4. Workqueues are used by VDCM and IOMMU driver for processing that
> > + requires thread context.
> > +
> > +
> > +KVM PASID Translation Table Updates
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +Per VM PASID translation table is maintained by KVM in order to
> > +support ENQCMD in the guest. The table contains host-guest PASID
> > +translations to be consumed by CPU ucode. The synchronization of the
> > +PASID states depends on VFIO/IOMMU driver, where IOCTL and atomic
> > +notifiers are used. KVM must register IOASID notifier per VM instance
> > +during launch time. The following events are handled:
> > +
> > +1. BIND/UNBIND
> > +2. FREE
> > +
> > +Rules:
> > +
> > +1. Multiple devices can bind with the same PASID, this can be
> > different PCI
> > + devices or mdevs within the same PCI device. However, only the
> > + *first* BIND and *last* UNBIND emit notifications.
> > +2. IOASID code is responsible for ensuring the correctness of H-G
> > + PASID mapping. There is no need for KVM to validate the
> > + notification data.
> > +3. When UNBIND happens *after* FREE, KVM will see error in
> > + ioasid_get() even when the reclaim is not done. IOMMU driver will
> > + also avoid sending UNBIND if the PASID is already FREE.
> > +4. When KVM terminates *before* FREE & UNBIND, references will be
> > + dropped for all host PASIDs.
> > +
> > +VDCM PASID Programming
> > +~~~~~~~~~~~~~~~~~~~~~~
> > +VDCM composes virtual devices and exposes them to the guests. When
> > +the guest allocates a PASID then program it to the virtual device, VDCM
> > +intercepts the programming attempt then programs the matching host
> > +PASID on to the hardware.
> > +Conversely, when a device is going away, VDCM must be informed such
> > +that PASID context on the hardware can be cleared. There could be
> > +multiple mdevs assigned to different guests in the same VDCM. Since
> > +the PASID table is shared at PCI device level, lazy clearing is not
> > +secure. A malicious guest can attack by using newly freed PASIDs that
> > +are allocated by another guest.
> > +
> > +By holding a reference of the PASID until VDCM cleans up the HW
> > context, +it is guaranteed that PASID life cycles do not cross within
> > the same +device.
> > +
> > +
> > +Reference
> > +====================================================
> > +1.
> > https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
> > + +2.
> > https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator +
> > +3.
> > https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
> > -- 2.7.4
> >


Thanks,

Jacob

2020-10-30 10:22:59

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v3 01/14] docs: Document IO Address Space ID (IOASID) APIs

On Mon, Oct 26, 2020 at 02:05:06PM -0700, Jacob Pan wrote:
> > This looks good to me, with small comments below.
> >
> Can I add your Reviewed-by tag after addressing the comments?

Yes sure, this took forever to review so I'm happy not to do another pass :)


> > > +Each IOASID set is created with a token, which can be one of the
> > > +following token types:
> > > +
> > > + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
> >
> > Maybe NULL isn't the best name then. NONE?
> >
> Agreed, 'NONE' makes more sense.

Although patch 5 only allows a NULL token for this type. So the name seems
fine, you could just fix this description.


> > > +IOASID core has the notion of "custom allocator" such that guest can
> > > +register virtual command allocator that precedes the default one.
> >
> > "Supersedes", rather than "precedes"?
> >
> My understanding is that 'supersede' means replace something but 'precede'
> means get in front of something. I do want to emphasis that the custom
> allocator takes precedence over the default allocator.

Right it's ambiguous. The custom allocator does entirely replace the
allocation action, but the default one is still used for storage. Anyway,
you can leave this.


> > > +Let's examine the IOASID life cycle again when free happens *before*
> > > +unbind. This could be a result of misbehaving guests or crash. Assuming
> > > +VFIO cannot enforce unbind->free order. Notice that the setup part up
> > > +until step #12 is identical to the normal case, the flow below starts
> > > +with step 13.
> > > +
> > > +::
> > > +
> > > + VFIO IOMMU KVM VDCM IOASID Ref
> > > + ..................................................................
> > > + 13 -------- GUEST STARTS DMA --------------------------
> > > + 14 -------- *GUEST MISBEHAVES!!!* ----------------
> > > + 15 ioasid_free()
> > > + 16 ioasid_notify(FREE)
> > > + 17 mark_free_pending
> > > (1)
> >
> > Could we use superscript ¹²³⁴ for footnotes? These look like function
> > parameters
> >
> yes, much better
>
> > > + 18 kvm_nb_handler(FREE)
> > > + 19 vmcs_update_atomic()
> > > + 20 ioasid_put_locked() -> 3
> > > + 21 vdcm_nb_handler(FREE)
> > > + 22 iomm_nb_handler(FREE)
> >
> > iommu_nb_handler
> >
> got it
>
> > > + 23 ioasid_free() returns(2) schedule_work() 2
> >
> > I completely lost track here, couldn't figure out in which direction to
> > read the diagram. What work is scheduled?
> The time line goes downward but we only control the notification order in
> terms of when the events are received. Some completions are async thus out
> of order done by work items. The only in-order completion is the KVM update
> of its PASID translation table.
>
> After #23, the async works are scheduled to complete clean up work outside
> the spinlock(held by the caller of the atomic notifier).
>
> Any suggestions to improve the readability of the time line?

Maybe explain what happens from line 23: ioasid_free() schedules... a FREE
notification? Which happens on line 24 (corresponding to the second
schedule_work()?) and is handled by (a) VDCM to clear the device context
and (b) IOMMU to clear the PASID context, both ending up dropping their
ref.

>
> > Why does the IOMMU driver drop
> > its reference to the IOASID before unbdind_gpasid()?
> >
> This is the exception case where userspace issues IOASID free before
> unbind_gpasid(). The equivalent of unbind is performed in the IOASID_FREE
> notification handler. In IOASID_FREE handler, reference is dropped and
> private data deleted. After that, if unbind comes to IOMMU driver, it will
> not find IOASID private data therefore just return.

Right ok. As you noted below the damage is caused by and limited to the
guest, so I think it's fine.

>
> > > + 24 schedule_work() vdev_clear_wk(hpasid)
> > > + 25 teardown_pasid_wk()
> > > + 26 ioasid_put() -> 1
> > > + 27 ioasid_put() 0
> > > + 28 Reclaimed
> > > + 29 unbind_gpasid()
> > > + 30 iommu_unbind()->ioasid_find() Fails(3)
> > > + -------------- New Life Cycle Begin ----------------------------
> > > +
> > > +Note:
> > > +
> > > +1. By marking IOASID FREE_PENDING at step #17, no new references can be
> > > + held. ioasid_get/find() will return -ENOENT;
> >
> > s/held/taken
> >
> Got it.
>
> > Thanks,
> > Jean
> >
> > > +2. After step #23, all events can go out of order. Shall not affect
> > > + the outcome.
> > > +3. IOMMU driver fails to find private data for unbinding. If unbind is
> > > + called after the same IOASID is allocated for the same guest again,
> > > + this is a programming error. The damage is limited to the guest
> > > + itself since unbind performs permission checking based on the
> > > + IOASID set associated with the guest process.

"guest process" can be confusing (process run by the guest?), just "guest"
might be better.

Thanks,
Jean

2020-11-02 23:41:28

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v3 01/14] docs: Document IO Address Space ID (IOASID) APIs

Hi Jean-Philippe,

On Fri, 30 Oct 2020 11:18:27 +0100, Jean-Philippe Brucker
<[email protected]> wrote:

> On Mon, Oct 26, 2020 at 02:05:06PM -0700, Jacob Pan wrote:
> > > This looks good to me, with small comments below.
> > >
> > Can I add your Reviewed-by tag after addressing the comments?
>
> Yes sure, this took forever to review so I'm happy not to do another
> pass :)
>
I am afraid I have to ask for another round of reviews since it was
suggested to keep IOASID allocation interface independent, instead of being
part of VFIO UAPI. Yi and I are working out the details to come up with a
PoC. As you might be aware, the need for this independent interface is that
we may have multiple users of PASID, e.g VDPA, user space drivers, etc.
The IOASID user interface also has slight impact on the IOASID core code,
which is why I am slow in response to your code review. Will incorporate
your review in the next round with support of independent user API.
Much appreciated!

>
> > > > +Each IOASID set is created with a token, which can be one of the
> > > > +following token types:
> > > > +
> > > > + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
> > >
> > > Maybe NULL isn't the best name then. NONE?
> > >
> > Agreed, 'NONE' makes more sense.
>
> Although patch 5 only allows a NULL token for this type. So the name seems
> fine, you could just fix this description.
>
OK.

>
> > > > +IOASID core has the notion of "custom allocator" such that guest
> > > > can +register virtual command allocator that precedes the default
> > > > one.
> > >
> > > "Supersedes", rather than "precedes"?
> > >
> > My understanding is that 'supersede' means replace something but
> > 'precede' means get in front of something. I do want to emphasis that
> > the custom allocator takes precedence over the default allocator.
>
> Right it's ambiguous. The custom allocator does entirely replace the
> allocation action, but the default one is still used for storage. Anyway,
> you can leave this.
>
OK

>
> > > > +Let's examine the IOASID life cycle again when free happens
> > > > *before* +unbind. This could be a result of misbehaving guests or
> > > > crash. Assuming +VFIO cannot enforce unbind->free order. Notice
> > > > that the setup part up +until step #12 is identical to the normal
> > > > case, the flow below starts +with step 13.
> > > > +
> > > > +::
> > > > +
> > > > + VFIO IOMMU KVM VDCM IOASID
> > > > Ref
> > > > + ..................................................................
> > > > + 13 -------- GUEST STARTS DMA --------------------------
> > > > + 14 -------- *GUEST MISBEHAVES!!!* ----------------
> > > > + 15 ioasid_free()
> > > > + 16
> > > > ioasid_notify(FREE)
> > > > + 17 mark_free_pending
> > > > (1)
> > >
> > > Could we use superscript ¹²³⁴ for footnotes? These look like function
> > > parameters
> > >
> > yes, much better
> >
> > > > + 18 kvm_nb_handler(FREE)
> > > > + 19 vmcs_update_atomic()
> > > > + 20 ioasid_put_locked() -> 3
> > > > + 21 vdcm_nb_handler(FREE)
> > > > + 22 iomm_nb_handler(FREE)
> > >
> > > iommu_nb_handler
> > >
> > got it
> >
> > > > + 23 ioasid_free() returns(2) schedule_work()
> > > > 2
> > >
> > > I completely lost track here, couldn't figure out in which direction
> > > to read the diagram. What work is scheduled?
> > The time line goes downward but we only control the notification order
> > in terms of when the events are received. Some completions are async
> > thus out of order done by work items. The only in-order completion is
> > the KVM update of its PASID translation table.
> >
> > After #23, the async works are scheduled to complete clean up work
> > outside the spinlock(held by the caller of the atomic notifier).
> >
> > Any suggestions to improve the readability of the time line?
>
> Maybe explain what happens from line 23: ioasid_free() schedules... a FREE
> notification? Which happens on line 24 (corresponding to the second
> schedule_work()?) and is handled by (a) VDCM to clear the device context
> and (b) IOMMU to clear the PASID context, both ending up dropping their
> ref.
>
Got it, I will add that.

> >
> > > Why does the IOMMU driver drop
> > > its reference to the IOASID before unbdind_gpasid()?
> > >
> > This is the exception case where userspace issues IOASID free before
> > unbind_gpasid(). The equivalent of unbind is performed in the
> > IOASID_FREE notification handler. In IOASID_FREE handler, reference is
> > dropped and private data deleted. After that, if unbind comes to IOMMU
> > driver, it will not find IOASID private data therefore just return.
>
> Right ok. As you noted below the damage is caused by and limited to the
> guest, so I think it's fine.
>
OK.

> >
> > > > + 24 schedule_work() vdev_clear_wk(hpasid)
> > > > + 25 teardown_pasid_wk()
> > > > + 26 ioasid_put() -> 1
> > > > + 27 ioasid_put() 0
> > > > + 28 Reclaimed
> > > > + 29 unbind_gpasid()
> > > > + 30 iommu_unbind()->ioasid_find() Fails(3)
> > > > + -------------- New Life Cycle Begin ----------------------------
> > > > +
> > > > +Note:
> > > > +
> > > > +1. By marking IOASID FREE_PENDING at step #17, no new references
> > > > can be
> > > > + held. ioasid_get/find() will return -ENOENT;
> > >
> > > s/held/taken
> > >
> > Got it.
> >
> > > Thanks,
> > > Jean
> > >
> > > > +2. After step #23, all events can go out of order. Shall not affect
> > > > + the outcome.
> > > > +3. IOMMU driver fails to find private data for unbinding. If
> > > > unbind is
> > > > + called after the same IOASID is allocated for the same guest
> > > > again,
> > > > + this is a programming error. The damage is limited to the guest
> > > > + itself since unbind performs permission checking based on the
> > > > + IOASID set associated with the guest process.
>
> "guest process" can be confusing (process run by the guest?), just "guest"
> might be better.
>
> Thanks,
> Jean


Thanks,

Jacob