2020-08-22 04:29:26

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v2 0/9] IOASID extensions for guest SVA

IOASID was introduced in v5.5 as a generic kernel allocator service for
both PCIe Process Address Space ID (PASID) and ARM SMMU's Sub Stream
ID. In addition to basic ID allocation, ioasid_set was defined as a
token that is shared by a group of IOASIDs. This set token can be used
for permission checking, but lack of some features to address the
following needs by guest Shared Virtual Address (SVA).
- Manage IOASIDs by group, group ownership, quota, etc.
- State synchronization among IOASID users
- Non-identity guest-host IOASID mapping
- Lifecycle management across many users

This patchset introduces the following extensions as solutions to the
problems above.
- Redefine and extend IOASID set such that IOASIDs can be managed by groups.
- Add notifications for IOASID state synchronization
- Add reference counting for life cycle alignment among users
- Support ioasid_set private IDs, which can be used as guest IOASIDs
Please refer to Documentation/ioasid.rst in enclosed patch 1/9 for more
details.

This patchset only included VT-d driver as users of some of the new APIs.
VFIO and KVM patches are coming up to fully utilize the APIs introduced
here.

You can find this series at:
https://github.com/jacobpan/linux.git ioasid_ext_v2
(VFIO and KVM patches will be available at this branch when published.)

This work is a result of collaboration with many people:
Liu, Yi L <[email protected]>
Wu Hao <[email protected]>
Ashok Raj <[email protected]>
Kevin Tian <[email protected]>

Thanks,

Jacob

Changelog

V2:
- Redesigned ioasid_set APIs, removed set ID
- Added set private ID (SPID) for guest PASID usage.
- Add per ioasid_set notification and priority support.
- Back to use spinlocks and atomic notifications.
- Added async work in VT-d driver to perform teardown outside atomic context

Jacob Pan (9):
docs: Document IO Address Space ID (IOASID) APIs
iommu/ioasid: Rename ioasid_set_data()
iommu/ioasid: Introduce ioasid_set APIs
iommu/ioasid: Add reference couting functions
iommu/ioasid: Introduce ioasid_set private ID
iommu/ioasid: Introduce notification APIs
iommu/vt-d: Listen to IOASID notifications
iommu/vt-d: Send IOASID bind/unbind notifications
iommu/vt-d: Store guest PASID during bind

Documentation/ioasid.rst | 618 ++++++++++++++++++++++++++++++++
drivers/iommu/intel/iommu.c | 27 +-
drivers/iommu/intel/pasid.h | 1 +
drivers/iommu/intel/svm.c | 97 ++++-
drivers/iommu/ioasid.c | 835 ++++++++++++++++++++++++++++++++++++++++++--
include/linux/intel-iommu.h | 2 +
include/linux/ioasid.h | 166 ++++++++-
7 files changed, 1699 insertions(+), 47 deletions(-)
create mode 100644 Documentation/ioasid.rst

--
2.7.4


2020-08-22 04:29:39

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

IOASID is used to identify address spaces that can be targeted by device
DMA. It is a system-wide resource that is essential to its many users.
This document is an attempt to help developers from all vendors navigate
the APIs. At this time, ARM SMMU and Intel’s Scalable IO Virtualization
(SIOV) enabled platforms are the primary users of IOASID. Examples of
how SIOV components interact with IOASID APIs are provided in that many
APIs are driven by the requirements from SIOV.

Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Wu Hao <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
Documentation/ioasid.rst | 618 +++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 618 insertions(+)
create mode 100644 Documentation/ioasid.rst

diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst
new file mode 100644
index 000000000000..b6a8cdc885ff
--- /dev/null
+++ b/Documentation/ioasid.rst
@@ -0,0 +1,618 @@
+.. ioasid:
+
+=====================================
+IO Address Space ID
+=====================================
+
+IOASID is a generic name for PCIe Process Address ID (PASID) or ARM
+SMMU sub-stream ID. An IOASID identifies an address space that DMA
+requests can target.
+
+The primary use cases for IOASID are Shared Virtual Address (SVA) and
+IO Virtual Address (IOVA). However, the requirements for IOASID
+management can vary among hardware architectures.
+
+This document covers the generic features supported by IOASID
+APIs. Vendor-specific use cases are also illustrated with Intel's VT-d
+based platforms as the first example.
+
+.. contents:: :local:
+
+Glossary
+========
+PASID - Process Address Space ID
+
+IOASID - IO Address Space ID (generic term for PCIe PASID and
+sub-stream ID in SMMU)
+
+SVA/SVM - Shared Virtual Addressing/Memory
+
+ENQCMD - New Intel X86 ISA for efficient workqueue submission [1]
+
+DSA - Intel Data Streaming Accelerator [2]
+
+VDCM - Virtual device composition module [3]
+
+SIOV - Intel Scalable IO Virtualization
+
+
+Key Concepts
+============
+
+IOASID Set
+-----------
+An IOASID set is a group of IOASIDs allocated from the system-wide
+IOASID pool. An IOASID set is created and can be identified by a
+token of u64. Refer to IOASID set APIs for more details.
+
+IOASID set is particularly useful for guest SVA where each guest could
+have its own IOASID set for security and efficiency reasons.
+
+IOASID Set Private ID (SPID)
+----------------------------
+SPIDs are introduced as IOASIDs within its set. Each SPID maps to a
+system-wide IOASID but the namespace of SPID is within its IOASID
+set. SPIDs can be used as guest IOASIDs where each guest could do
+IOASID allocation from its own pool and map them to host physical
+IOASIDs. SPIDs are particularly useful for supporting live migration
+where decoupling guest and host physical resources are necessary.
+
+For example, two VMs can both allocate guest PASID/SPID #101 but map to
+different host PASIDs #201 and #202 respectively as shown in the
+diagram below.
+::
+
+ .------------------. .------------------.
+ | VM 1 | | VM 2 |
+ | | | |
+ |------------------| |------------------|
+ | GPASID/SPID 101 | | GPASID/SPID 101 |
+ '------------------' -------------------' Guest
+ __________|______________________|______________________
+ | | Host
+ v v
+ .------------------. .------------------.
+ | Host IOASID 201 | | Host IOASID 202 |
+ '------------------' '------------------'
+ | IOASID set 1 | | IOASID set 2 |
+ '------------------' '------------------'
+
+Guest PASID is treated as IOASID set private ID (SPID) within an
+IOASID set, mappings between guest and host IOASIDs are stored in the
+set for inquiry.
+
+IOASID APIs
+===========
+To get the IOASID APIs, users must #include <linux/ioasid.h>. These APIs
+serve the following functionalities:
+
+ - IOASID allocation/Free
+ - Group management in the form of ioasid_set
+ - Private data storage and lookup
+ - Reference counting
+ - Event notification in case of state change
+
+IOASID Set Level APIs
+--------------------------
+For use cases such as guest SVA it is necessary to manage IOASIDs at
+a group level. For example, VMs may allocate multiple IOASIDs for
+guest process address sharing (vSVA). It is imperative to enforce
+VM-IOASID ownership such that malicious guest cannot target DMA
+traffic outside its own IOASIDs, or free an active IOASID belong to
+another VM.
+::
+
+ struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, u32 type)
+
+ int ioasid_adjust_set(struct ioasid_set *set, int quota);
+
+ void ioasid_set_get(struct ioasid_set *set)
+
+ void ioasid_set_put(struct ioasid_set *set)
+
+ void ioasid_set_get_locked(struct ioasid_set *set)
+
+ void ioasid_set_put_locked(struct ioasid_set *set)
+
+ int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
+ void (*fn)(ioasid_t id, void *data),
+ void *data)
+
+
+IOASID set concept is introduced to represent such IOASID groups. Each
+IOASID set is created with a token which can be one of the following
+types:
+
+ - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
+ - IOASID_SET_TYPE_MM (Set token is a mm_struct)
+
+The explicit MM token type is useful when multiple users of an IOASID
+set under the same process need to communicate about their shared IOASIDs.
+E.g. An IOASID set created by VFIO for one guest can be associated
+with the KVM instance for the same guest since they share a common mm_struct.
+
+The IOASID set APIs serve the following purposes:
+
+ - Ownership/permission enforcement
+ - Take collective actions, e.g. free an entire set
+ - Event notifications within a set
+ - Look up a set based on token
+ - Quota enforcement
+
+Individual IOASID APIs
+----------------------
+Once an ioasid_set is created, IOASIDs can be allocated from the set.
+Within the IOASID set namespace, set private ID (SPID) is supported. In
+the VM use case, SPID can be used for storing guest PASID.
+
+::
+
+ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
+ void *private);
+
+ int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
+
+ void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
+
+ int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
+
+ void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
+
+ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
+ bool (*getter)(void *));
+
+ ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
+
+ int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
+ void *data);
+ int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
+ ioasid_t ssid);
+
+
+Notifications
+-------------
+An IOASID may have multiple users, each user may have hardware context
+associated with an IOASID. When the status of an IOASID changes,
+e.g. an IOASID is being freed, users need to be notified such that the
+associated hardware context can be cleared, flushed, and drained.
+
+::
+
+ int ioasid_register_notifier(struct ioasid_set *set, struct
+ notifier_block *nb)
+
+ void ioasid_unregister_notifier(struct ioasid_set *set,
+ struct notifier_block *nb)
+
+ int ioasid_register_notifier_mm(struct mm_struct *mm, struct
+ notifier_block *nb)
+
+ void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
+ notifier_block *nb)
+
+ int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
+ unsigned int flags)
+
+
+Events
+~~~~~~
+Notification events are pertinent to individual IOASIDs, they can be
+one of the following:
+
+ - ALLOC
+ - FREE
+ - BIND
+ - UNBIND
+
+Ordering
+~~~~~~~~
+Ordering is supported by IOASID notification priorities as the
+following (in ascending order):
+
+::
+
+ enum ioasid_notifier_prios {
+ IOASID_PRIO_LAST,
+ IOASID_PRIO_IOMMU,
+ IOASID_PRIO_DEVICE,
+ IOASID_PRIO_CPU,
+ };
+
+The typical use case is when an IOASID is freed due to an exception, DMA
+source should be quiesced before tearing down other hardware contexts
+in the system. This will reduce the churn in handling faults. DMA work
+submission is performed by the CPU which is granted higher priority than
+devices.
+
+
+Scopes
+~~~~~~
+There are two types of notifiers in IOASID core: system-wide and
+ioasid_set-wide.
+
+System-wide notifier is catering for users that need to handle all
+IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs.
+
+Per ioasid_set notifier can be used by VM specific components such as
+KVM. After all, each KVM instance only cares about IOASIDs within its
+own set.
+
+
+Atomicity
+~~~~~~~~~
+IOASID notifiers are atomic due to spinlocks used inside the IOASID
+core. For tasks cannot be completed in the notifier handler, async work
+can be submitted to complete the work later as long as there is no
+ordering requirement.
+
+Reference counting
+------------------
+IOASID lifecycle management is based on reference counting. Users of
+IOASID intend to align lifecycle with the IOASID need to hold
+reference of the IOASID. IOASID will not be returned to the pool for
+allocation until all references are dropped. Calling ioasid_free()
+will mark the IOASID as FREE_PENDING if the IOASID has outstanding
+reference. ioasid_get() is not allowed once an IOASID is in the
+FREE_PENDING state.
+
+Event notifications are used to inform users of IOASID status change.
+IOASID_FREE event prompts users to drop their references after
+clearing its context.
+
+For example, on VT-d platform when an IOASID is freed, teardown
+actions are performed on KVM, device driver, and IOMMU driver.
+KVM shall register notifier block with::
+
+ static struct notifier_block pasid_nb_kvm = {
+ .notifier_call = pasid_status_change_kvm,
+ .priority = IOASID_PRIO_CPU,
+ };
+
+VDCM driver shall register notifier block with::
+
+ static struct notifier_block pasid_nb_vdcm = {
+ .notifier_call = pasid_status_change_vdcm,
+ .priority = IOASID_PRIO_DEVICE,
+ };
+
+In both cases, notifier blocks shall be registered on the IOASID set
+such that *only* events from the matching VM is received.
+
+If KVM attempts to register notifier block before the IOASID set is
+created for the MM token, the notifier block will be placed on a
+pending list inside IOASID core. Once the token matching IOASID set
+is created, IOASID will register the notifier block automatically.
+IOASID core does not replay events for the existing IOASIDs in the
+set. For IOASID set of MM type, notification blocks can be registered
+on empty sets only. This is to avoid lost events.
+
+IOMMU driver shall register notifier block on global chain::
+
+ static struct notifier_block pasid_nb_vtd = {
+ .notifier_call = pasid_status_change_vtd,
+ .priority = IOASID_PRIO_IOMMU,
+ };
+
+Custom allocator APIs
+---------------------
+
+::
+
+ int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
+
+ void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
+
+Allocator Choices
+~~~~~~~~~~~~~~~~~
+IOASIDs are allocated for both host and guest SVA/IOVA usage. However,
+allocators can be different. For example, on VT-d guest PASID
+allocation must be performed via a virtual command interface which is
+emulated by VMM.
+
+IOASID core has the notion of "custom allocator" such that guest can
+register virtual command allocator that precedes the default one.
+
+Namespaces
+~~~~~~~~~~
+IOASIDs are limited system resources that default to 20 bits in
+size. Since each device has its own table, theoretically the namespace
+can be per device also. However, for security reasons sharing PASID
+tables among devices are not good for isolation. Therefore, IOASID
+namespace is system-wide.
+
+There are also other reasons to have this simpler system-wide
+namespace. Take VT-d as an example, VT-d supports shared workqueue
+and ENQCMD[1] where one IOASID could be used to submit work on
+multiple devices that are shared with other VMs. This requires IOASID
+to be system-wide. This is also the reason why guests must use an
+emulated virtual command interface to allocate IOASID from the host.
+
+
+Life cycle
+==========
+This section covers IOASID lifecycle management for both bare-metal
+and guest usages. In bare-metal SVA, MMU notifier is directly hooked
+up with IOMMU driver, therefore the process address space (MM)
+lifecycle is aligned with IOASID.
+
+However, guest MMU notifier is not available to host IOMMU driver,
+when guest MM terminates unexpectedly, the events have to go through
+VFIO and IOMMU UAPI to reach host IOMMU driver. There are also more
+parties involved in guest SVA, e.g. on Intel VT-d platform, IOASIDs
+are used by IOMMU driver, KVM, VDCM, and VFIO.
+
+Native IOASID Life Cycle (VT-d Example)
+---------------------------------------
+
+The normal flow of native SVA code with Intel Data Streaming
+Accelerator(DSA) [2] as example:
+
+1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
+2. DSA driver allocate WQ, do sva_bind_device();
+3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
+ mmu_notifier_get()
+4. DMA starts by DSA driver userspace
+5. DSA userspace close FD
+6. DSA/uacce kernel driver handles FD.close()
+7. DSA driver stops DMA
+8. DSA driver calls sva_unbind_device();
+9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
+ TLBs. mmu_notifier_put() called.
+10. mmu_notifier.release() called, IOMMU SVA code calls ioasid_free()*
+11. The IOASID is returned to the pool, reclaimed.
+
+::
+
+ * With ENQCMD, PASID used on VT-d is not released in mmu_notifier() but
+ mmdrop(). mmdrop comes after FD close. Should not matter.
+ If the user process dies unexpectedly, Step #10 may come before
+ Step #5, in between, all DMA faults discarded. PRQ responded with
+ code INVALID REQUEST.
+
+During the normal teardown, the following three steps would happen in
+order:
+
+1. Device driver stops DMA request
+2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain in-flight
+ requests.
+3. IOASID freed
+
+Exception happens when process terminates *before* device driver stops
+DMA and call IOMMU driver to unbind. The flow of process exists are as
+follows:
+
+::
+
+ do_exit() {
+ exit_mm() {
+ mm_put();
+ exit_mmap() {
+ intel_invalidate_range() //mmu notifier
+ tlb_finish_mmu()
+ mmu_notifier_release(mm) {
+ intel_iommu_release() {
+ [2] intel_iommu_teardown_pasid();
+ intel_iommu_flush_tlbs();
+ }
+ // tlb_invalidate_range cb removed
+ }
+ unmap_vmas();
+ free_pgtables(); // IOMMU cannot walk PGT after this
+ };
+ }
+ exit_files(tsk) {
+ close_files() {
+ dsa_close();
+ [1] dsa_stop_dma();
+ intel_svm_unbind_pasid(); //nothing to do
+ }
+ }
+ }
+
+ mmdrop() /* some random time later, lazy mm user */ {
+ mm_free_pgd();
+ destroy_context(mm); {
+ [3] ioasid_free();
+ }
+ }
+
+As shown in the list above, step #2 could happen before
+#1. Unrecoverable(UR) faults could happen between #2 and #1.
+
+Also notice that TLB invalidation occurs at mmu_notifier
+invalidate_range callback as well as the release callback. The reason
+is that release callback will delete IOMMU driver from the notifier
+chain which may skip invalidate_range() calls during the exit path.
+
+To avoid unnecessary reporting of UR fault, IOMMU driver shall disable
+fault reporting after free and before unbind.
+
+Guest IOASID Life Cycle (VT-d Example)
+--------------------------------------
+Guest IOASID life cycle starts with guest driver open(), this could be
+uacce or individual accelerator driver such as DSA. At FD open,
+sva_bind_device() is called which triggers a series of actions.
+
+The example below is an illustration of *normal* operations that
+involves *all* the SW components in VT-d. The flow can be simpler if
+no ENQCMD is supported.
+
+::
+
+ VFIO IOMMU KVM VDCM IOASID Ref
+ ..................................................................
+ 1 ioasid_register_notifier/_mm()
+ 2 ioasid_alloc() 1
+ 3 bind_gpasid()
+ 4 iommu_bind()->ioasid_get() 2
+ 5 ioasid_notify(BIND)
+ 6 -> ioasid_get() 3
+ 7 -> vmcs_update_atomic()
+ 8 mdev_write(gpasid)
+ 9 hpasid=
+ 10 find_by_spid(gpasid) 4
+ 11 vdev_write(hpasid)
+ 12 -------- GUEST STARTS DMA --------------------------
+ 13 -------- GUEST STOPS DMA --------------------------
+ 14 mdev_clear(gpasid)
+ 15 vdev_clear(hpasid)
+ 16 ioasid_put() 3
+ 17 unbind_gpasid()
+ 18 iommu_ubind()
+ 19 ioasid_notify(UNBIND)
+ 20 -> vmcs_update_atomic()
+ 21 -> ioasid_put() 2
+ 22 ioasid_free() 1
+ 23 ioasid_put() 0
+ 24 Reclaimed
+ -------------- New Life Cycle Begin ----------------------------
+ 1 ioasid_alloc() -> 1
+
+ Note: IOASID Notification Events: FREE, BIND, UNBIND
+
+Exception cases arise when a guest crashes or a malicious guest
+attempts to cause disruption on the host system. The fault handling
+rules are:
+
+1. IOASID free must *always* succeed.
+2. An inactive period may be required before the freed IOASID is
+ reclaimed. During this period, consumers of IOASID perform cleanup.
+3. Malfunction is limited to the guest owned resources for all
+ programming errors.
+
+The primary source of exception is when the following are out of
+order:
+
+1. Start/Stop of DMA activity
+ (Guest device driver, mdev via VFIO)
+2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
+ (Host IOMMU driver bind/unbind)
+3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
+ case of ENQCMD
+4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
+5. IOASID alloc/free (Host IOASID)
+
+VFIO is the *only* user-kernel interface, which is ultimately
+responsible for exception handlings.
+
+#1 is processed the same way as the assigned device today based on
+device file descriptors and events. There is no special handling.
+
+#3 is based on bind/unbind events emitted by #2.
+
+#4 is naturally aligned with IOASID life cycle in that an illegal
+guest PASID programming would fail in obtaining reference of the
+matching host IOASID.
+
+#5 is similar to #4. The fault will be reported to the user if PASID
+used in the ENQCMD is not set up in VMCS PASID translation table.
+
+Therefore, the remaining out of order problem is between #2 and
+#5. I.e. unbind vs. free. More specifically, free before unbind.
+
+IOASID notifier and refcounting are used to ensure order. Following
+a publisher-subscriber pattern where:
+
+- Publishers: VFIO & IOMMU
+- Subscribers: KVM, VDCM, IOMMU
+
+IOASID notifier is atomic which requires subscribers to do quick
+handling of the event in the atomic context. Workqueue can be used for
+any processing that requires thread context. IOASID reference must be
+acquired before receiving the FREE event. The reference must be
+dropped at the end of the processing in order to return the IOASID to
+the pool.
+
+Let's examine the IOASID life cycle again when free happens *before*
+unbind. This could be a result of misbehaving guests or crash. Assuming
+VFIO cannot enforce unbind->free order. Notice that the setup part up
+until step #12 is identical to the normal case, the flow below starts
+with step 13.
+
+::
+
+ VFIO IOMMU KVM VDCM IOASID Ref
+ ..................................................................
+ 13 -------- GUEST STARTS DMA --------------------------
+ 14 -------- *GUEST MISBEHAVES!!!* ----------------
+ 15 ioasid_free()
+ 16 ioasid_notify(FREE)
+ 17 mark_ioasid_inactive[1]
+ 18 kvm_nb_handler(FREE)
+ 19 vmcs_update_atomic()
+ 20 ioasid_put_locked() -> 3
+ 21 vdcm_nb_handler(FREE)
+ 22 iomm_nb_handler(FREE)
+ 23 ioasid_free() returns[2] schedule_work() 2
+ 24 schedule_work() vdev_clear_wk(hpasid)
+ 25 teardown_pasid_wk()
+ 26 ioasid_put() -> 1
+ 27 ioasid_put() 0
+ 28 Reclaimed
+ 29 unbind_gpasid()
+ 30 iommu_unbind()->ioasid_find() Fails[3]
+ -------------- New Life Cycle Begin ----------------------------
+
+Note:
+
+1. By marking IOASID inactive at step #17, no new references can be
+ held. ioasid_get/find() will return -ENOENT;
+2. After step #23, all events can go out of order. Shall not affect
+ the outcome.
+3. IOMMU driver fails to find private data for unbinding. If unbind is
+ called after the same IOASID is allocated for the same guest again,
+ this is a programming error. The damage is limited to the guest
+ itself since unbind performs permission checking based on the
+ IOASID set associated with the guest process.
+
+KVM PASID Translation Table Updates
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Per VM PASID translation table is maintained by KVM in order to
+support ENQCMD in the guest. The table contains host-guest PASID
+translations to be consumed by CPU ucode. The synchronization of the
+PASID states depends on VFIO/IOMMU driver, where IOCTL and atomic
+notifiers are used. KVM must register IOASID notifier per VM instance
+during launch time. The following events are handled:
+
+1. BIND/UNBIND
+2. FREE
+
+Rules:
+
+1. Multiple devices can bind with the same PASID, this can be different PCI
+ devices or mdevs within the same PCI device. However, only the
+ *first* BIND and *last* UNBIND emit notifications.
+2. IOASID code is responsible for ensuring the correctness of H-G
+ PASID mapping. There is no need for KVM to validate the
+ notification data.
+3. When UNBIND happens *after* FREE, KVM will see error in
+ ioasid_get() even when the reclaim is not done. IOMMU driver will
+ also avoid sending UNBIND if the PASID is already FREE.
+4. When KVM terminates *before* FREE & UNBIND, references will be
+ dropped for all host PASIDs.
+
+VDCM PASID Programming
+~~~~~~~~~~~~~~~~~~~~~~
+VDCM composes virtual devices and exposes them to the guests. When
+the guest allocates a PASID then program it to the virtual device, VDCM
+intercepts the programming attempt then program the matching host
+PASID on to the hardware.
+Conversely, when a device is going away, VDCM must be informed such
+that PASID context on the hardware can be cleared. There could be
+multiple mdevs assigned to different guests in the same VDCM. Since
+the PASID table is shared at PCI device level, lazy clearing is not
+secure. A malicious guest can attack by using newly freed PASIDs that
+are allocated by another guest.
+
+By holding a reference of the PASID until VDCM cleans up the HW context,
+it is guaranteed that PASID life cycles do not cross within the same
+device.
+
+
+Reference
+====================================================
+1. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
+
+2. https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
+
+3. https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
--
2.7.4

2020-08-22 04:29:48

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

ioasid_set was introduced as an arbitrary token that are shared by a
group of IOASIDs. For example, if IOASID #1 and #2 are allocated via the
same ioasid_set*, they are viewed as to belong to the same set.

For guest SVA usages, system-wide IOASID resources need to be
partitioned such that VMs can have its own quota and being managed
separately. ioasid_set is the perfect candidate for meeting such
requirements. This patch redefines and extends ioasid_set with the
following new fields:
- Quota
- Reference count
- Storage of its namespace
- The token is stored in the new ioasid_set but with optional types

ioasid_set level APIs are introduced that wires up these new data.
Existing users of IOASID APIs are converted where a host IOASID set is
allocated for bare-metal usage.

Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel/iommu.c | 27 ++-
drivers/iommu/intel/pasid.h | 1 +
drivers/iommu/intel/svm.c | 8 +-
drivers/iommu/ioasid.c | 390 +++++++++++++++++++++++++++++++++++++++++---
include/linux/ioasid.h | 82 ++++++++--
5 files changed, 465 insertions(+), 43 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index a3a0b5c8921d..5813eeaa5edb 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -42,6 +42,7 @@
#include <linux/crash_dump.h>
#include <linux/numa.h>
#include <linux/swiotlb.h>
+#include <linux/ioasid.h>
#include <asm/irq_remapping.h>
#include <asm/cacheflush.h>
#include <asm/iommu.h>
@@ -103,6 +104,9 @@
*/
#define INTEL_IOMMU_PGSIZES (~0xFFFUL)

+/* PASIDs used by host SVM */
+struct ioasid_set *host_pasid_set;
+
static inline int agaw_to_level(int agaw)
{
return agaw + 2;
@@ -3103,8 +3107,8 @@ static void intel_vcmd_ioasid_free(ioasid_t ioasid, void *data)
* Sanity check the ioasid owner is done at upper layer, e.g. VFIO
* We can only free the PASID when all the devices are unbound.
*/
- if (ioasid_find(NULL, ioasid, NULL)) {
- pr_alert("Cannot free active IOASID %d\n", ioasid);
+ if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
+ pr_err("Cannot free IOASID %d, not in system set\n", ioasid);
return;
}
vcmd_free_pasid(iommu, ioasid);
@@ -3288,6 +3292,19 @@ static int __init init_dmars(void)
if (ret)
goto free_iommu;

+ /* PASID is needed for scalable mode irrespective to SVM */
+ if (intel_iommu_sm) {
+ ioasid_install_capacity(intel_pasid_max_id);
+ /* We should not run out of IOASIDs at boot */
+ host_pasid_set = ioasid_alloc_set(NULL, PID_MAX_DEFAULT,
+ IOASID_SET_TYPE_NULL);
+ if (IS_ERR_OR_NULL(host_pasid_set)) {
+ pr_err("Failed to enable host PASID allocator %lu\n",
+ PTR_ERR(host_pasid_set));
+ intel_iommu_sm = 0;
+ }
+ }
+
/*
* for each drhd
* enable fault log
@@ -5149,7 +5166,7 @@ static void auxiliary_unlink_device(struct dmar_domain *domain,
domain->auxd_refcnt--;

if (!domain->auxd_refcnt && domain->default_pasid > 0)
- ioasid_free(domain->default_pasid);
+ ioasid_free(host_pasid_set, domain->default_pasid);
}

static int aux_domain_add_dev(struct dmar_domain *domain,
@@ -5167,7 +5184,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
int pasid;

/* No private data needed for the default pasid */
- pasid = ioasid_alloc(NULL, PASID_MIN,
+ pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
pci_max_pasids(to_pci_dev(dev)) - 1,
NULL);
if (pasid == INVALID_IOASID) {
@@ -5210,7 +5227,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
spin_unlock(&iommu->lock);
spin_unlock_irqrestore(&device_domain_lock, flags);
if (!domain->auxd_refcnt && domain->default_pasid > 0)
- ioasid_free(domain->default_pasid);
+ ioasid_free(host_pasid_set, domain->default_pasid);

return ret;
}
diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
index c9850766c3a9..ccdc23446015 100644
--- a/drivers/iommu/intel/pasid.h
+++ b/drivers/iommu/intel/pasid.h
@@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct pasid_entry *pte)
}

extern u32 intel_pasid_max_id;
+extern struct ioasid_set *host_pasid_set;
int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp);
void intel_pasid_free_id(int pasid);
void *intel_pasid_lookup_id(int pasid);
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 37a9beabc0ca..634e191ca2c3 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -551,7 +551,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
pasid_max = intel_pasid_max_id;

/* Do not use PASID 0, reserved for RID to PASID */
- svm->pasid = ioasid_alloc(NULL, PASID_MIN,
+ svm->pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
pasid_max - 1, svm);
if (svm->pasid == INVALID_IOASID) {
kfree(svm);
@@ -568,7 +568,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
if (mm) {
ret = mmu_notifier_register(&svm->notifier, mm);
if (ret) {
- ioasid_free(svm->pasid);
+ ioasid_free(host_pasid_set, svm->pasid);
kfree(svm);
kfree(sdev);
goto out;
@@ -586,7 +586,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
if (ret) {
if (mm)
mmu_notifier_unregister(&svm->notifier, mm);
- ioasid_free(svm->pasid);
+ ioasid_free(host_pasid_set, svm->pasid);
kfree(svm);
kfree(sdev);
goto out;
@@ -655,7 +655,7 @@ static int intel_svm_unbind_mm(struct device *dev, int pasid)
kfree_rcu(sdev, rcu);

if (list_empty(&svm->devs)) {
- ioasid_free(svm->pasid);
+ ioasid_free(host_pasid_set, svm->pasid);
if (svm->mm)
mmu_notifier_unregister(&svm->notifier, svm->mm);
list_del(&svm->list);
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 5f63af07acd5..f73b3dbfc37a 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -1,22 +1,58 @@
// SPDX-License-Identifier: GPL-2.0
/*
* I/O Address Space ID allocator. There is one global IOASID space, split into
- * subsets. Users create a subset with DECLARE_IOASID_SET, then allocate and
- * free IOASIDs with ioasid_alloc and ioasid_free.
+ * subsets. Users create a subset with ioasid_alloc_set, then allocate/free IDs
+ * with ioasid_alloc and ioasid_free.
*/
-#include <linux/ioasid.h>
#include <linux/module.h>
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/xarray.h>
+#include <linux/ioasid.h>
+
+static DEFINE_XARRAY_ALLOC(ioasid_sets);
+enum ioasid_state {
+ IOASID_STATE_INACTIVE,
+ IOASID_STATE_ACTIVE,
+ IOASID_STATE_FREE_PENDING,
+};

+/**
+ * struct ioasid_data - Meta data about ioasid
+ *
+ * @id: Unique ID
+ * @users Number of active users
+ * @state Track state of the IOASID
+ * @set Meta data of the set this IOASID belongs to
+ * @private Private data associated with the IOASID
+ * @rcu For free after RCU grace period
+ */
struct ioasid_data {
ioasid_t id;
struct ioasid_set *set;
+ refcount_t users;
+ enum ioasid_state state;
void *private;
struct rcu_head rcu;
};

+/* Default to PCIe standard 20 bit PASID */
+#define PCI_PASID_MAX 0x100000
+static ioasid_t ioasid_capacity = PCI_PASID_MAX;
+static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
+
+void ioasid_install_capacity(ioasid_t total)
+{
+ ioasid_capacity = ioasid_capacity_avail = total;
+}
+EXPORT_SYMBOL_GPL(ioasid_install_capacity);
+
+ioasid_t ioasid_get_capacity()
+{
+ return ioasid_capacity;
+}
+EXPORT_SYMBOL_GPL(ioasid_get_capacity);
+
/*
* struct ioasid_allocator_data - Internal data structure to hold information
* about an allocator. There are two types of allocators:
@@ -306,11 +342,23 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
{
struct ioasid_data *data;
void *adata;
- ioasid_t id;
+ ioasid_t id = INVALID_IOASID;
+
+ spin_lock(&ioasid_allocator_lock);
+ /* Check if the IOASID set has been allocated and initialized */
+ if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
+ pr_warn("Invalid set\n");
+ goto done_unlock;
+ }
+
+ if (set->quota <= set->nr_ioasids) {
+ pr_err("IOASID set %d out of quota %d\n", set->sid, set->quota);
+ goto done_unlock;
+ }

data = kzalloc(sizeof(*data), GFP_ATOMIC);
if (!data)
- return INVALID_IOASID;
+ goto done_unlock;

data->set = set;
data->private = private;
@@ -319,7 +367,6 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
* Custom allocator needs allocator data to perform platform specific
* operations.
*/
- spin_lock(&ioasid_allocator_lock);
adata = active_allocator->flags & IOASID_ALLOCATOR_CUSTOM ? active_allocator->ops->pdata : data;
id = active_allocator->ops->alloc(min, max, adata);
if (id == INVALID_IOASID) {
@@ -335,42 +382,339 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
goto exit_free;
}
data->id = id;
+ data->state = IOASID_STATE_ACTIVE;
+ refcount_set(&data->users, 1);
+
+ /* Store IOASID in the per set data */
+ if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
+ pr_err("Failed to ioasid %d in set %d\n", id, set->sid);
+ goto exit_free;
+ }
+ set->nr_ioasids++;
+ goto done_unlock;

- spin_unlock(&ioasid_allocator_lock);
- return id;
exit_free:
- spin_unlock(&ioasid_allocator_lock);
kfree(data);
- return INVALID_IOASID;
+done_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+ return id;
}
EXPORT_SYMBOL_GPL(ioasid_alloc);

+static void ioasid_do_free(struct ioasid_data *data)
+{
+ struct ioasid_data *ioasid_data;
+ struct ioasid_set *sdata;
+
+ active_allocator->ops->free(data->id, active_allocator->ops->pdata);
+ /* Custom allocator needs additional steps to free the xa element */
+ if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
+ ioasid_data = xa_erase(&active_allocator->xa, data->id);
+ kfree_rcu(ioasid_data, rcu);
+ }
+
+ sdata = xa_load(&ioasid_sets, data->set->sid);
+ if (!sdata) {
+ pr_err("No set %d for IOASID %d\n", data->set->sid,
+ data->id);
+ return;
+ }
+ xa_erase(&sdata->xa, data->id);
+ sdata->nr_ioasids--;
+}
+
+static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
+{
+ struct ioasid_data *data;
+
+ data = xa_load(&active_allocator->xa, ioasid);
+ if (!data) {
+ pr_err("Trying to free unknown IOASID %u\n", ioasid);
+ return;
+ }
+
+ if (data->set != set) {
+ pr_warn("Cannot free IOASID %u due to set ownership\n", ioasid);
+ return;
+ }
+ data->state = IOASID_STATE_FREE_PENDING;
+
+ if (!refcount_dec_and_test(&data->users))
+ return;
+
+ ioasid_do_free(data);
+}
+
/**
- * ioasid_free - Free an IOASID
- * @ioasid: the ID to remove
+ * ioasid_free - Drop reference on an IOASID. Free if refcount drops to 0,
+ * including free from its set and system-wide list.
+ * @set: The ioasid_set to check permission with. If not NULL, IOASID
+ * free will fail if the set does not match.
+ * @ioasid: The IOASID to remove
*/
-void ioasid_free(ioasid_t ioasid)
+void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
{
- struct ioasid_data *ioasid_data;
+ spin_lock(&ioasid_allocator_lock);
+ ioasid_free_locked(set, ioasid);
+ spin_unlock(&ioasid_allocator_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_free);

+/**
+ * ioasid_alloc_set - Allocate a new IOASID set for a given token
+ *
+ * @token: Unique token of the IOASID set, cannot be NULL
+ * @quota: Quota allowed in this set. Only for new set creation
+ * @flags: Special requirements
+ *
+ * IOASID can be limited system-wide resource that requires quota management.
+ * If caller does not wish to enforce quota, use IOASID_SET_NO_QUOTA flag.
+ *
+ * Token will be stored in the ioasid_set returned. A reference will be taken
+ * upon finding a matching set or newly created set.
+ * IOASID allocation within the set and other per set operations will use
+ * the retured ioasid_set *.
+ *
+ */
+struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type)
+{
+ struct ioasid_set *sdata;
+ unsigned long index;
+ ioasid_t id;
+
+ if (type >= IOASID_SET_TYPE_NR)
+ return ERR_PTR(-EINVAL);
+
+ /*
+ * Need to check space available if we share system-wide quota.
+ * TODO: we may need to support quota free sets in the future.
+ */
spin_lock(&ioasid_allocator_lock);
- ioasid_data = xa_load(&active_allocator->xa, ioasid);
- if (!ioasid_data) {
- pr_err("Trying to free unknown IOASID %u\n", ioasid);
+ if (quota > ioasid_capacity_avail) {
+ pr_warn("Out of IOASID capacity! ask %d, avail %d\n",
+ quota, ioasid_capacity_avail);
+ sdata = ERR_PTR(-ENOSPC);
goto exit_unlock;
}

- active_allocator->ops->free(ioasid, active_allocator->ops->pdata);
- /* Custom allocator needs additional steps to free the xa element */
- if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
- ioasid_data = xa_erase(&active_allocator->xa, ioasid);
- kfree_rcu(ioasid_data, rcu);
+ /*
+ * Token is only unique within its types but right now we have only
+ * mm type. If we have more token types, we have to match type as well.
+ */
+ switch (type) {
+ case IOASID_SET_TYPE_MM:
+ /* Search existing set tokens, reject duplicates */
+ xa_for_each(&ioasid_sets, index, sdata) {
+ if (sdata->token == token &&
+ sdata->type == IOASID_SET_TYPE_MM) {
+ sdata = ERR_PTR(-EEXIST);
+ goto exit_unlock;
+ }
+ }
+ break;
+ case IOASID_SET_TYPE_NULL:
+ if (!token)
+ break;
+ fallthrough;
+ default:
+ pr_err("Invalid token and IOASID type\n");
+ sdata = ERR_PTR(-EINVAL);
+ goto exit_unlock;
}

+ /* REVISIT: may support set w/o quota, use system available */
+ if (!quota) {
+ sdata = ERR_PTR(-EINVAL);
+ goto exit_unlock;
+ }
+
+ sdata = kzalloc(sizeof(*sdata), GFP_ATOMIC);
+ if (!sdata) {
+ sdata = ERR_PTR(-ENOMEM);
+ goto exit_unlock;
+ }
+
+ if (xa_alloc(&ioasid_sets, &id, sdata,
+ XA_LIMIT(0, ioasid_capacity_avail - quota),
+ GFP_ATOMIC)) {
+ kfree(sdata);
+ sdata = ERR_PTR(-ENOSPC);
+ goto exit_unlock;
+ }
+
+ sdata->token = token;
+ sdata->type = type;
+ sdata->quota = quota;
+ sdata->sid = id;
+ refcount_set(&sdata->ref, 1);
+
+ /*
+ * Per set XA is used to store private IDs within the set, get ready
+ * for ioasid_set private ID and system-wide IOASID allocation
+ * results.
+ */
+ xa_init_flags(&sdata->xa, XA_FLAGS_ALLOC);
+ ioasid_capacity_avail -= quota;
+
exit_unlock:
spin_unlock(&ioasid_allocator_lock);
+
+ return sdata;
}
-EXPORT_SYMBOL_GPL(ioasid_free);
+EXPORT_SYMBOL_GPL(ioasid_alloc_set);
+
+void ioasid_set_get_locked(struct ioasid_set *set)
+{
+ if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
+ pr_warn("Invalid set data\n");
+ return;
+ }
+
+ refcount_inc(&set->ref);
+}
+EXPORT_SYMBOL_GPL(ioasid_set_get_locked);
+
+void ioasid_set_get(struct ioasid_set *set)
+{
+ spin_lock(&ioasid_allocator_lock);
+ ioasid_set_get_locked(set);
+ spin_unlock(&ioasid_allocator_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_set_get);
+
+void ioasid_set_put_locked(struct ioasid_set *set)
+{
+ struct ioasid_data *entry;
+ unsigned long index;
+
+ if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
+ pr_warn("Invalid set data\n");
+ return;
+ }
+
+ if (!refcount_dec_and_test(&set->ref)) {
+ pr_debug("%s: IOASID set %d has %d users\n",
+ __func__, set->sid, refcount_read(&set->ref));
+ return;
+ }
+
+ /* The set is already empty, we just destroy the set. */
+ if (xa_empty(&set->xa))
+ goto done_destroy;
+
+ /*
+ * Free all PASIDs from system-wide IOASID pool, all subscribers gets
+ * notified and do cleanup of their own.
+ * Note that some references of the IOASIDs within the set can still
+ * be held after the free call. This is OK in that the IOASIDs will be
+ * marked inactive, the only operations can be done is ioasid_put.
+ * No need to track IOASID set states since there is no reclaim phase.
+ */
+ xa_for_each(&set->xa, index, entry) {
+ ioasid_free_locked(set, index);
+ /* Free from per set private pool */
+ xa_erase(&set->xa, index);
+ }
+
+done_destroy:
+ /* Return the quota back to system pool */
+ ioasid_capacity_avail += set->quota;
+ kfree_rcu(set, rcu);
+
+ /*
+ * Token got released right away after the ioasid_set is freed.
+ * If a new set is created immediately with the newly released token,
+ * it will not allocate the same IOASIDs unless they are reclaimed.
+ */
+ xa_erase(&ioasid_sets, set->sid);
+}
+EXPORT_SYMBOL_GPL(ioasid_set_put_locked);
+
+/**
+ * ioasid_set_put - Drop a reference to the IOASID set. Free all IOASIDs within
+ * the set if there are no more users.
+ *
+ * @set: The IOASID set ID to be freed
+ *
+ * If refcount drops to zero, all IOASIDs allocated within the set will be
+ * freed.
+ */
+void ioasid_set_put(struct ioasid_set *set)
+{
+ spin_lock(&ioasid_allocator_lock);
+ ioasid_set_put_locked(set);
+ spin_unlock(&ioasid_allocator_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_set_put);
+
+/**
+ * ioasid_adjust_set - Adjust the quota of an IOASID set
+ * @set: IOASID set to be assigned
+ * @quota: Quota allowed in this set
+ *
+ * Return 0 on success. If the new quota is smaller than the number of
+ * IOASIDs already allocated, -EINVAL will be returned. No change will be
+ * made to the existing quota.
+ */
+int ioasid_adjust_set(struct ioasid_set *set, int quota)
+{
+ int ret = 0;
+
+ spin_lock(&ioasid_allocator_lock);
+ if (set->nr_ioasids > quota) {
+ pr_err("New quota %d is smaller than outstanding IOASIDs %d\n",
+ quota, set->nr_ioasids);
+ ret = -EINVAL;
+ goto done_unlock;
+ }
+
+ if (quota >= ioasid_capacity_avail) {
+ ret = -ENOSPC;
+ goto done_unlock;
+ }
+
+ /* Return the delta back to system pool */
+ ioasid_capacity_avail += set->quota - quota;
+
+ /*
+ * May have a policy to prevent giving all available IOASIDs
+ * to one set. But we don't enforce here, it should be in the
+ * upper layers.
+ */
+ set->quota = quota;
+
+done_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_adjust_set);
+
+/**
+ * ioasid_set_for_each_ioasid - Iterate over all the IOASIDs within the set
+ *
+ * Caller must hold a reference of the set and handles its own locking.
+ */
+int ioasid_set_for_each_ioasid(struct ioasid_set *set,
+ void (*fn)(ioasid_t id, void *data),
+ void *data)
+{
+ struct ioasid_data *entry;
+ unsigned long index;
+ int ret = 0;
+
+ if (xa_empty(&set->xa)) {
+ pr_warn("No IOASIDs in the set %d\n", set->sid);
+ return -ENOENT;
+ }
+
+ xa_for_each(&set->xa, index, entry) {
+ fn(index, data);
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);

/**
* ioasid_find - Find IOASID data
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 9c44947a68c8..412d025d440e 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -10,8 +10,35 @@ typedef unsigned int ioasid_t;
typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min, ioasid_t max, void *data);
typedef void (*ioasid_free_fn_t)(ioasid_t ioasid, void *data);

+/* IOASID set types */
+enum ioasid_set_type {
+ IOASID_SET_TYPE_NULL = 1, /* Set token is NULL */
+ IOASID_SET_TYPE_MM, /* Set token is a mm_struct,
+ * i.e. associated with a process
+ */
+ IOASID_SET_TYPE_NR,
+};
+
+/**
+ * struct ioasid_set - Meta data about ioasid_set
+ * @type: Token types and other features
+ * @token: Unique to identify an IOASID set
+ * @xa: XArray to store ioasid_set private IDs, can be used for
+ * guest-host IOASID mapping, or just a private IOASID namespace.
+ * @quota: Max number of IOASIDs can be allocated within the set
+ * @nr_ioasids Number of IOASIDs currently allocated in the set
+ * @sid: ID of the set
+ * @ref: Reference count of the users
+ */
struct ioasid_set {
- int dummy;
+ void *token;
+ struct xarray xa;
+ int type;
+ int quota;
+ int nr_ioasids;
+ int sid;
+ refcount_t ref;
+ struct rcu_head rcu;
};

/**
@@ -29,31 +56,64 @@ struct ioasid_allocator_ops {
void *pdata;
};

-#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
-
#if IS_ENABLED(CONFIG_IOASID)
+void ioasid_install_capacity(ioasid_t total);
+ioasid_t ioasid_get_capacity(void);
+struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type);
+int ioasid_adjust_set(struct ioasid_set *set, int quota);
+void ioasid_set_get_locked(struct ioasid_set *set);
+void ioasid_set_put_locked(struct ioasid_set *set);
+void ioasid_set_put(struct ioasid_set *set);
+
ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
void *private);
-void ioasid_free(ioasid_t ioasid);
-void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
- bool (*getter)(void *));
+void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
+
+bool ioasid_is_active(ioasid_t ioasid);
+
+void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *));
+int ioasid_attach_data(ioasid_t ioasid, void *data);
int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
-int ioasid_attach_data(ioasid_t ioasid, void *data);
-
+void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
+int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
+ void (*fn)(ioasid_t id, void *data),
+ void *data);
#else /* !CONFIG_IOASID */
+static inline void ioasid_install_capacity(ioasid_t total)
+{
+}
+
+static inline ioasid_t ioasid_get_capacity(void)
+{
+ return 0;
+}
+
static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
ioasid_t max, void *private)
{
return INVALID_IOASID;
}

-static inline void ioasid_free(ioasid_t ioasid)
+static inline void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
+{
+}
+
+static inline bool ioasid_is_active(ioasid_t ioasid)
+{
+ return false;
+}
+
+static inline struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type)
+{
+ return ERR_PTR(-ENOTSUPP);
+}
+
+static inline void ioasid_set_put(struct ioasid_set *set)
{
}

-static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
- bool (*getter)(void *))
+static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *))
{
return NULL;
}
--
2.7.4

2020-08-22 04:29:56

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v2 4/9] iommu/ioasid: Add reference couting functions

There can be multiple users of an IOASID, each user could have hardware
contexts associated with the IOASID. In order to align lifecycles,
reference counting is introduced in this patch. It is expected that when
an IOASID is being freed, each user will drop a reference only after its
context is cleared.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/ioasid.h | 4 ++
2 files changed, 117 insertions(+)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index f73b3dbfc37a..5f31d63c75b1 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -717,6 +717,119 @@ int ioasid_set_for_each_ioasid(struct ioasid_set *set,
EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);

/**
+ * IOASID refcounting rules
+ * - ioasid_alloc() set initial refcount to 1
+ *
+ * - ioasid_free() decrement and test refcount.
+ * If refcount is 0, ioasid will be freed. Deleted from the system-wide
+ * xarray as well as per set xarray. The IOASID will be returned to the
+ * pool and available for new allocations.
+ *
+ * If recount is non-zero, mark IOASID as IOASID_STATE_FREE_PENDING.
+ * No new reference can be added. The IOASID is not returned to the pool
+ * for reuse.
+ * After free, ioasid_get() will return error but ioasid_find() and other
+ * non refcount adding APIs will continue to work until the last reference
+ * is dropped
+ *
+ * - ioasid_get() get a reference on an active IOASID
+ *
+ * - ioasid_put() decrement and test refcount of the IOASID.
+ * If refcount is 0, ioasid will be freed. Deleted from the system-wide
+ * xarray as well as per set xarray. The IOASID will be returned to the
+ * pool and available for new allocations.
+ * Do nothing if refcount is non-zero.
+ *
+ * - ioasid_find() does not take reference, caller must hold reference
+ *
+ * ioasid_free() can be called multiple times without error until all refs are
+ * dropped.
+ */
+
+int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
+{
+ struct ioasid_data *data;
+
+ data = xa_load(&active_allocator->xa, ioasid);
+ if (!data) {
+ pr_err("Trying to get unknown IOASID %u\n", ioasid);
+ return -EINVAL;
+ }
+ if (data->state == IOASID_STATE_FREE_PENDING) {
+ pr_err("Trying to get IOASID being freed%u\n", ioasid);
+ return -EBUSY;
+ }
+
+ if (set && data->set != set) {
+ pr_err("Trying to get IOASID not in set%u\n", ioasid);
+ /* data found but does not belong to the set */
+ return -EACCES;
+ }
+ refcount_inc(&data->users);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(ioasid_get_locked);
+
+/**
+ * ioasid_get - Obtain a reference of an ioasid
+ * @set
+ * @ioasid
+ *
+ * Check set ownership if @set is non-null.
+ */
+int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
+{
+ int ret = 0;
+
+ spin_lock(&ioasid_allocator_lock);
+ ret = ioasid_get_locked(set, ioasid);
+ spin_unlock(&ioasid_allocator_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_get);
+
+void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
+{
+ struct ioasid_data *data;
+
+ data = xa_load(&active_allocator->xa, ioasid);
+ if (!data) {
+ pr_err("Trying to put unknown IOASID %u\n", ioasid);
+ return;
+ }
+
+ if (set && data->set != set) {
+ pr_err("Trying to drop IOASID not in the set %u\n", ioasid);
+ return;
+ }
+
+ if (!refcount_dec_and_test(&data->users)) {
+ pr_debug("%s: IOASID %d has %d remainning users\n",
+ __func__, ioasid, refcount_read(&data->users));
+ return;
+ }
+ ioasid_do_free(data);
+}
+EXPORT_SYMBOL_GPL(ioasid_put_locked);
+
+/**
+ * ioasid_put - Drop a reference of an ioasid
+ * @set
+ * @ioasid
+ *
+ * Check set ownership if @set is non-null.
+ */
+void ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
+{
+ spin_lock(&ioasid_allocator_lock);
+ ioasid_put_locked(set, ioasid);
+ spin_unlock(&ioasid_allocator_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_put);
+
+/**
* ioasid_find - Find IOASID data
* @set: the IOASID set
* @ioasid: the IOASID to find
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 412d025d440e..310abe4187a3 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -76,6 +76,10 @@ int ioasid_attach_data(ioasid_t ioasid, void *data);
int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
+int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
+int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
+void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
+void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
void (*fn)(ioasid_t id, void *data),
void *data);
--
2.7.4

2020-08-22 04:30:02

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v2 6/9] iommu/ioasid: Introduce notification APIs

Relations among IOASID users largely follow a publisher-subscriber
pattern. E.g. to support guest SVA on Intel Scalable I/O Virtualization
(SIOV) enabled platforms, VFIO, IOMMU, device drivers, KVM are all users
of IOASIDs. When a state change occurs, VFIO publishes the change event
that needs to be processed by other users/subscribers.

This patch introduced two types of notifications: global and per
ioasid_set. The latter is intended for users who only needs to handle
events related to the IOASID of a given set.
For more information, refer to the kernel documentation at
Documentation/ioasid.rst.

Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Wu Hao <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 280 ++++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/ioasid.h | 70 +++++++++++++
2 files changed, 348 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index c0aef38a4fde..6ddc09a7fe74 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -9,8 +9,35 @@
#include <linux/spinlock.h>
#include <linux/xarray.h>
#include <linux/ioasid.h>
+#include <linux/sched/mm.h>

static DEFINE_XARRAY_ALLOC(ioasid_sets);
+/*
+ * An IOASID could have multiple consumers where each consumeer may have
+ * hardware contexts associated with IOASIDs.
+ * When a status change occurs, such as IOASID is being freed, notifier chains
+ * are used to keep the consumers in sync.
+ * This is a publisher-subscriber pattern where publisher can change the
+ * state of each IOASID, e.g. alloc/free, bind IOASID to a device and mm.
+ * On the other hand, subscribers gets notified for the state change and
+ * keep local states in sync.
+ *
+ * Currently, the notifier is global. A further optimization could be per
+ * IOASID set notifier chain.
+ */
+static ATOMIC_NOTIFIER_HEAD(ioasid_chain);
+
+/* List to hold pending notification block registrations */
+static LIST_HEAD(ioasid_nb_pending_list);
+static DEFINE_SPINLOCK(ioasid_nb_lock);
+struct ioasid_set_nb {
+ struct list_head list;
+ struct notifier_block *nb;
+ void *token;
+ struct ioasid_set *set;
+ bool active;
+};
+
enum ioasid_state {
IOASID_STATE_INACTIVE,
IOASID_STATE_ACTIVE,
@@ -394,6 +421,7 @@ EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
void *private)
{
+ struct ioasid_nb_args args;
struct ioasid_data *data;
void *adata;
ioasid_t id = INVALID_IOASID;
@@ -445,8 +473,14 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
goto exit_free;
}
set->nr_ioasids++;
- goto done_unlock;
+ args.id = id;
+ /* Set private ID is not attached during allocation */
+ args.spid = INVALID_IOASID;
+ args.set = set;
+ atomic_notifier_call_chain(&set->nh, IOASID_ALLOC, &args);

+ spin_unlock(&ioasid_allocator_lock);
+ return id;
exit_free:
kfree(data);
done_unlock:
@@ -479,6 +513,7 @@ static void ioasid_do_free(struct ioasid_data *data)

static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
{
+ struct ioasid_nb_args args;
struct ioasid_data *data;

data = xa_load(&active_allocator->xa, ioasid);
@@ -491,7 +526,16 @@ static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
pr_warn("Cannot free IOASID %u due to set ownership\n", ioasid);
return;
}
+
data->state = IOASID_STATE_FREE_PENDING;
+ /* Notify all users that this IOASID is being freed */
+ args.id = ioasid;
+ args.spid = data->spid;
+ args.pdata = data->private;
+ args.set = data->set;
+ atomic_notifier_call_chain(&ioasid_chain, IOASID_FREE, &args);
+ /* Notify the ioasid_set for per set users */
+ atomic_notifier_call_chain(&set->nh, IOASID_FREE, &args);

if (!refcount_dec_and_test(&data->users))
return;
@@ -514,6 +558,28 @@ void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
}
EXPORT_SYMBOL_GPL(ioasid_free);

+static void ioasid_add_pending_nb(struct ioasid_set *set)
+{
+ struct ioasid_set_nb *curr;
+
+ if (set->type != IOASID_SET_TYPE_MM)
+ return;
+
+ /*
+ * Check if there are any pending nb requests for the given token, if so
+ * add them to the notifier chain.
+ */
+ spin_lock(&ioasid_nb_lock);
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->token == set->token && !curr->active) {
+ atomic_notifier_chain_register(&set->nh, curr->nb);
+ curr->set = set;
+ curr->active = true;
+ }
+ }
+ spin_unlock(&ioasid_nb_lock);
+}
+
/**
* ioasid_alloc_set - Allocate a new IOASID set for a given token
*
@@ -601,6 +667,13 @@ struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type)
sdata->quota = quota;
sdata->sid = id;
refcount_set(&sdata->ref, 1);
+ ATOMIC_INIT_NOTIFIER_HEAD(&sdata->nh);
+
+ /*
+ * Check if there are any pending nb requests for the given token, if so
+ * add them to the notifier chain.
+ */
+ ioasid_add_pending_nb(sdata);

/*
* Per set XA is used to store private IDs within the set, get ready
@@ -617,6 +690,30 @@ struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type)
}
EXPORT_SYMBOL_GPL(ioasid_alloc_set);

+
+/*
+ * ioasid_find_mm_set - Retrieve IOASID set with mm token
+ * Take a reference of the set if found.
+ */
+static struct ioasid_set *ioasid_find_mm_set(struct mm_struct *token)
+{
+ struct ioasid_set *sdata, *set = NULL;
+ unsigned long index;
+
+ spin_lock(&ioasid_allocator_lock);
+
+ xa_for_each(&ioasid_sets, index, sdata) {
+ if (sdata->type == IOASID_SET_TYPE_MM && sdata->token == token) {
+ refcount_inc(&sdata->ref);
+ set = sdata;
+ goto exit_unlock;
+ }
+ }
+exit_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+ return set;
+}
+
void ioasid_set_get_locked(struct ioasid_set *set)
{
if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
@@ -638,6 +735,8 @@ EXPORT_SYMBOL_GPL(ioasid_set_get);

void ioasid_set_put_locked(struct ioasid_set *set)
{
+ struct ioasid_nb_args args = { 0 };
+ struct ioasid_set_nb *curr;
struct ioasid_data *entry;
unsigned long index;

@@ -673,8 +772,24 @@ void ioasid_set_put_locked(struct ioasid_set *set)
done_destroy:
/* Return the quota back to system pool */
ioasid_capacity_avail += set->quota;
- kfree_rcu(set, rcu);

+ /* Restore pending status of the set NBs */
+ spin_lock(&ioasid_nb_lock);
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->token == set->token) {
+ if (curr->active)
+ curr->active = false;
+ else
+ pr_warn("Set token exists but not active!\n");
+ }
+ }
+ spin_unlock(&ioasid_nb_lock);
+
+ args.set = set;
+ atomic_notifier_call_chain(&ioasid_chain, IOASID_SET_FREE, &args);
+
+ kfree_rcu(set, rcu);
+ pr_debug("Set freed %d\n", set->sid);
/*
* Token got released right away after the ioasid_set is freed.
* If a new set is created immediately with the newly released token,
@@ -927,6 +1042,167 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
}
EXPORT_SYMBOL_GPL(ioasid_find);

+int ioasid_register_notifier(struct ioasid_set *set, struct notifier_block *nb)
+{
+ if (set)
+ return atomic_notifier_chain_register(&set->nh, nb);
+ else
+ return atomic_notifier_chain_register(&ioasid_chain, nb);
+}
+EXPORT_SYMBOL_GPL(ioasid_register_notifier);
+
+void ioasid_unregister_notifier(struct ioasid_set *set,
+ struct notifier_block *nb)
+{
+ struct ioasid_set_nb *curr;
+
+ spin_lock(&ioasid_nb_lock);
+ /*
+ * Pending list is registered with a token without an ioasid_set,
+ * therefore should not be unregistered directly.
+ */
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->nb == nb) {
+ pr_warn("Cannot unregister NB from pending list\n");
+ spin_unlock(&ioasid_nb_lock);
+ return;
+ }
+ }
+ spin_unlock(&ioasid_nb_lock);
+
+ if (set)
+ atomic_notifier_chain_unregister(&set->nh, nb);
+ else
+ atomic_notifier_chain_unregister(&ioasid_chain, nb);
+}
+EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);
+
+int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
+{
+ struct ioasid_set_nb *curr;
+ struct ioasid_set *set;
+ int ret = 0;
+
+ if (!mm)
+ return -EINVAL;
+
+ spin_lock(&ioasid_nb_lock);
+
+ /* Check for duplicates, nb is unique per set */
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->token == mm && curr->nb == nb) {
+ ret = -EBUSY;
+ goto exit_unlock;
+ }
+ }
+
+ /* Check if the token has an existing set */
+ set = ioasid_find_mm_set(mm);
+ if (IS_ERR_OR_NULL(set)) {
+ /* Add to the rsvd list as inactive */
+ curr->active = false;
+ } else {
+ /* REVISIT: Only register empty set for now. Can add an option
+ * in the future to playback existing PASIDs.
+ */
+ if (set->nr_ioasids) {
+ pr_warn("IOASID set %d not empty\n", set->sid);
+ ret = -EBUSY;
+ goto exit_unlock;
+ }
+ curr = kzalloc(sizeof(*curr), GFP_ATOMIC);
+ if (!curr) {
+ ret = -ENOMEM;
+ goto exit_unlock;
+ }
+ curr->token = mm;
+ curr->nb = nb;
+ curr->active = true;
+ curr->set = set;
+
+ /* Set already created, add to the notifier chain */
+ atomic_notifier_chain_register(&set->nh, nb);
+ /*
+ * Do not hold a reference, if the set gets destroyed, the nb
+ * entry will be marked inactive.
+ */
+ ioasid_set_put(set);
+ }
+
+ list_add(&curr->list, &ioasid_nb_pending_list);
+
+exit_unlock:
+ spin_unlock(&ioasid_nb_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_register_notifier_mm);
+
+void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
+{
+ struct ioasid_set_nb *curr;
+
+ spin_lock(&ioasid_nb_lock);
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->token == mm && curr->nb == nb) {
+ list_del(&curr->list);
+ goto exit_free;
+ }
+ }
+ pr_warn("No ioasid set found for mm token %llx\n", (u64)mm);
+ goto done_unlock;
+
+exit_free:
+ if (curr->active) {
+ pr_debug("mm set active, unregister %llx\n",
+ (u64)mm);
+ atomic_notifier_chain_unregister(&curr->set->nh, nb);
+ }
+ kfree(curr);
+done_unlock:
+ spin_unlock(&ioasid_nb_lock);
+ return;
+}
+EXPORT_SYMBOL_GPL(ioasid_unregister_notifier_mm);
+
+/**
+ * ioasid_notify - Send notification on a given IOASID for status change.
+ * Used by publishers when the status change may affect
+ * subscriber's internal state.
+ *
+ * @ioasid: The IOASID to which the notification will send
+ * @cmd: The notification event
+ * @flags: Special instructions, e.g. notify with a set or global
+ */
+int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd, unsigned int flags)
+{
+ struct ioasid_data *ioasid_data;
+ struct ioasid_nb_args args = { 0 };
+ int ret = 0;
+
+ spin_lock(&ioasid_allocator_lock);
+ ioasid_data = xa_load(&active_allocator->xa, ioasid);
+ if (!ioasid_data) {
+ pr_err("Trying to notify unknown IOASID %u\n", ioasid);
+ spin_unlock(&ioasid_allocator_lock);
+ return -EINVAL;
+ }
+
+ args.id = ioasid;
+ args.set = ioasid_data->set;
+ args.pdata = ioasid_data->private;
+ args.spid = ioasid_data->spid;
+ if (flags & IOASID_NOTIFY_ALL) {
+ ret = atomic_notifier_call_chain(&ioasid_chain, cmd, &args);
+ } else if (flags & IOASID_NOTIFY_SET) {
+ ret = atomic_notifier_call_chain(&ioasid_data->set->nh,
+ cmd, &args);
+ }
+ spin_unlock(&ioasid_allocator_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_notify);
+
MODULE_AUTHOR("Jean-Philippe Brucker <[email protected]>");
MODULE_AUTHOR("Jacob Pan <[email protected]>");
MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index d4b3e83672f6..572111cd3b4b 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -23,6 +23,7 @@ enum ioasid_set_type {
* struct ioasid_set - Meta data about ioasid_set
* @type: Token types and other features
* @token: Unique to identify an IOASID set
+ * @nh: Notifier for IOASID events within the set
* @xa: XArray to store ioasid_set private IDs, can be used for
* guest-host IOASID mapping, or just a private IOASID namespace.
* @quota: Max number of IOASIDs can be allocated within the set
@@ -32,6 +33,7 @@ enum ioasid_set_type {
*/
struct ioasid_set {
void *token;
+ struct atomic_notifier_head nh;
struct xarray xa;
int type;
int quota;
@@ -56,6 +58,49 @@ struct ioasid_allocator_ops {
void *pdata;
};

+/* Notification data when IOASID status changed */
+enum ioasid_notify_val {
+ IOASID_ALLOC = 1,
+ IOASID_FREE,
+ IOASID_BIND,
+ IOASID_UNBIND,
+ IOASID_SET_ALLOC,
+ IOASID_SET_FREE,
+};
+
+#define IOASID_NOTIFY_ALL BIT(0)
+#define IOASID_NOTIFY_SET BIT(1)
+/**
+ * enum ioasid_notifier_prios - IOASID event notification order
+ *
+ * When status of an IOASID changes, users might need to take actions to
+ * reflect the new state. For example, when an IOASID is freed due to
+ * exception, the hardware context in virtual CPU, DMA device, and IOMMU
+ * shall be cleared and drained. Order is required to prevent life cycle
+ * problems.
+ */
+enum ioasid_notifier_prios {
+ IOASID_PRIO_LAST,
+ IOASID_PRIO_DEVICE,
+ IOASID_PRIO_IOMMU,
+ IOASID_PRIO_CPU,
+};
+
+/**
+ * struct ioasid_nb_args - Argument provided by IOASID core when notifier
+ * is called.
+ * @id: The IOASID being notified
+ * @spid: The set private ID associated with the IOASID
+ * @set: The IOASID set of @id
+ * @pdata: The private data attached to the IOASID
+ */
+struct ioasid_nb_args {
+ ioasid_t id;
+ ioasid_t spid;
+ struct ioasid_set *set;
+ void *pdata;
+};
+
#if IS_ENABLED(CONFIG_IOASID)
void ioasid_install_capacity(ioasid_t total);
ioasid_t ioasid_get_capacity(void);
@@ -75,8 +120,16 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *
int ioasid_attach_data(ioasid_t ioasid, void *data);
int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid);
+
+int ioasid_register_notifier(struct ioasid_set *set,
+ struct notifier_block *nb);
+void ioasid_unregister_notifier(struct ioasid_set *set,
+ struct notifier_block *nb);
+
int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
+
+int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd, unsigned int flags);
void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
@@ -85,6 +138,9 @@ void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
void (*fn)(ioasid_t id, void *data),
void *data);
+int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
+void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
+
#else /* !CONFIG_IOASID */
static inline void ioasid_install_capacity(ioasid_t total)
{
@@ -124,6 +180,20 @@ static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*
return NULL;
}

+static inline int ioasid_register_notifier(struct notifier_block *nb)
+{
+ return -ENOTSUPP;
+}
+
+static inline void ioasid_unregister_notifier(struct notifier_block *nb)
+{
+}
+
+static inline int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd, unsigned int flags)
+{
+ return -ENOTSUPP;
+}
+
static inline int ioasid_register_allocator(struct ioasid_allocator_ops *allocator)
{
return -ENOTSUPP;
--
2.7.4

2020-08-22 04:30:07

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v2 8/9] iommu/vt-d: Send IOASID bind/unbind notifications

On Intel Scalable I/O Virtualization (SIOV) enabled platforms with
ENQCMD in use by the guest, KVM must establish a guest-host PASID
translation table prior to the issuing of ENQCMD. PASID translation
table also depends on the IOMMU PASID entry, which is configured during
the guest page table bind time. This patch adds a notification event to
the IOMMU driver such that KVM can be notified when a new guest-host
PASID mapping is established or demolished.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel/svm.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 600e3ae5b656..d8a5efa75095 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -471,6 +471,11 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
}

mutex_unlock(&pasid_mutex);
+ /*
+ * Notify KVM new host-guest PASID bind is ready. KVM will set up
+ * PASID translation table to support guest ENQCMD.
+ */
+ ioasid_notify(data->hpasid, IOASID_BIND, IOASID_NOTIFY_SET);
return ret;
}

@@ -510,6 +515,8 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
* and perform cleanup.
*/
ioasid_attach_data(pasid, NULL);
+ ioasid_notify(pasid, IOASID_UNBIND,
+ IOASID_NOTIFY_SET);
kfree(svm);
}
}
--
2.7.4

2020-08-22 04:31:19

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v2 2/9] iommu/ioasid: Rename ioasid_set_data()

Rename ioasid_set_data() to ioasid_attach_data() to avoid confusion with
struct ioasid_set. ioasid_set is a group of IOASIDs that share a common
token.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel/svm.c | 6 +++---
drivers/iommu/ioasid.c | 6 +++---
include/linux/ioasid.h | 4 ++--
3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index b6972dca2ae0..37a9beabc0ca 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -342,7 +342,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
svm->gpasid = data->gpasid;
svm->flags |= SVM_FLAG_GUEST_PASID;
}
- ioasid_set_data(data->hpasid, svm);
+ ioasid_attach_data(data->hpasid, svm);
INIT_LIST_HEAD_RCU(&svm->devs);
mmput(svm->mm);
}
@@ -394,7 +394,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
list_add_rcu(&sdev->list, &svm->devs);
out:
if (!IS_ERR_OR_NULL(svm) && list_empty(&svm->devs)) {
- ioasid_set_data(data->hpasid, NULL);
+ ioasid_attach_data(data->hpasid, NULL);
kfree(svm);
}

@@ -437,7 +437,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
* the unbind, IOMMU driver will get notified
* and perform cleanup.
*/
- ioasid_set_data(pasid, NULL);
+ ioasid_attach_data(pasid, NULL);
kfree(svm);
}
}
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 0f8dd377aada..5f63af07acd5 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -258,14 +258,14 @@ void ioasid_unregister_allocator(struct ioasid_allocator_ops *ops)
EXPORT_SYMBOL_GPL(ioasid_unregister_allocator);

/**
- * ioasid_set_data - Set private data for an allocated ioasid
+ * ioasid_attach_data - Set private data for an allocated ioasid
* @ioasid: the ID to set data
* @data: the private data
*
* For IOASID that is already allocated, private data can be set
* via this API. Future lookup can be done via ioasid_find.
*/
-int ioasid_set_data(ioasid_t ioasid, void *data)
+int ioasid_attach_data(ioasid_t ioasid, void *data)
{
struct ioasid_data *ioasid_data;
int ret = 0;
@@ -287,7 +287,7 @@ int ioasid_set_data(ioasid_t ioasid, void *data)

return ret;
}
-EXPORT_SYMBOL_GPL(ioasid_set_data);
+EXPORT_SYMBOL_GPL(ioasid_attach_data);

/**
* ioasid_alloc - Allocate an IOASID
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 6f000d7a0ddc..9c44947a68c8 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -39,7 +39,7 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
bool (*getter)(void *));
int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
-int ioasid_set_data(ioasid_t ioasid, void *data);
+int ioasid_attach_data(ioasid_t ioasid, void *data);

#else /* !CONFIG_IOASID */
static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
@@ -67,7 +67,7 @@ static inline void ioasid_unregister_allocator(struct ioasid_allocator_ops *allo
{
}

-static inline int ioasid_set_data(ioasid_t ioasid, void *data)
+static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
{
return -ENOTSUPP;
}
--
2.7.4

2020-08-22 04:33:06

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v2 5/9] iommu/ioasid: Introduce ioasid_set private ID

When an IOASID set is used for guest SVA, each VM will acquire its
ioasid_set for IOASID allocations. IOASIDs within the VM must have a
host/physical IOASID backing, mapping between guest and host IOASIDs can
be non-identical. IOASID set private ID (SPID) is introduced in this
patch to be used as guest IOASID. However, the concept of ioasid_set
specific namespace is generic, thus named SPID.

As SPID namespace is within the IOASID set, the IOASID core can provide
lookup services at both directions. SPIDs may not be allocated when its
IOASID is allocated, the mapping between SPID and IOASID is usually
established when a guest page table is bound to a host PASID.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/ioasid.h | 12 +++++++++++
2 files changed, 66 insertions(+)

diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 5f31d63c75b1..c0aef38a4fde 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -21,6 +21,7 @@ enum ioasid_state {
* struct ioasid_data - Meta data about ioasid
*
* @id: Unique ID
+ * @spid: Private ID unique within a set
* @users Number of active users
* @state Track state of the IOASID
* @set Meta data of the set this IOASID belongs to
@@ -29,6 +30,7 @@ enum ioasid_state {
*/
struct ioasid_data {
ioasid_t id;
+ ioasid_t spid;
struct ioasid_set *set;
refcount_t users;
enum ioasid_state state;
@@ -326,6 +328,58 @@ int ioasid_attach_data(ioasid_t ioasid, void *data)
EXPORT_SYMBOL_GPL(ioasid_attach_data);

/**
+ * ioasid_attach_spid - Attach ioasid_set private ID to an IOASID
+ *
+ * @ioasid: the ID to attach
+ * @spid: the ioasid_set private ID of @ioasid
+ *
+ * For IOASID that is already allocated, private ID within the set can be
+ * attached via this API. Future lookup can be done via ioasid_find.
+ */
+int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
+{
+ struct ioasid_data *ioasid_data;
+ int ret = 0;
+
+ spin_lock(&ioasid_allocator_lock);
+ ioasid_data = xa_load(&active_allocator->xa, ioasid);
+
+ if (!ioasid_data) {
+ pr_err("No IOASID entry %d to attach SPID %d\n",
+ ioasid, spid);
+ ret = -ENOENT;
+ goto done_unlock;
+ }
+ ioasid_data->spid = spid;
+
+done_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_attach_spid);
+
+ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
+{
+ struct ioasid_data *entry;
+ unsigned long index;
+
+ if (!xa_load(&ioasid_sets, set->sid)) {
+ pr_warn("Invalid set\n");
+ return INVALID_IOASID;
+ }
+
+ xa_for_each(&set->xa, index, entry) {
+ if (spid == entry->spid) {
+ pr_debug("Found ioasid %lu by spid %u\n", index, spid);
+ refcount_inc(&entry->users);
+ return index;
+ }
+ }
+ return INVALID_IOASID;
+}
+EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
+
+/**
* ioasid_alloc - Allocate an IOASID
* @set: the IOASID set
* @min: the minimum ID (inclusive)
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 310abe4187a3..d4b3e83672f6 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -73,6 +73,8 @@ bool ioasid_is_active(ioasid_t ioasid);

void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *));
int ioasid_attach_data(ioasid_t ioasid, void *data);
+int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
+ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid);
int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
@@ -136,5 +138,15 @@ static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
return -ENOTSUPP;
}

+staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
+{
+ return -ENOTSUPP;
+}
+
+static inline ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
+{
+ return -ENOTSUPP;
+}
+
#endif /* CONFIG_IOASID */
#endif /* __LINUX_IOASID_H */
--
2.7.4

2020-08-22 04:33:35

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v2 7/9] iommu/vt-d: Listen to IOASID notifications

On Intel Scalable I/O Virtualization (SIOV) enabled platforms, IOMMU
driver is one of the users of IOASIDs. In normal flow, callers will
perform IOASID allocation, bind, unbind, and free in order. However, for
guest SVA, IOASID free could come before unbind as guest is untrusted.
This patch registers IOASID notification handler such that IOMMU driver
can perform PASID teardown upon receiving an unexpected IOASID free
event.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel/svm.c | 74 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/intel-iommu.h | 2 ++
2 files changed, 75 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index 634e191ca2c3..600e3ae5b656 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -95,6 +95,72 @@ static inline bool intel_svm_capable(struct intel_iommu *iommu)
return iommu->flags & VTD_FLAG_SVM_CAPABLE;
}

+#define pasid_lock_held() lock_is_held(&pasid_mutex.dep_map)
+static DEFINE_MUTEX(pasid_mutex);
+
+static void intel_svm_free_async_fn(struct work_struct *work)
+{
+ struct intel_svm *svm = container_of(work, struct intel_svm, work);
+ struct intel_svm_dev *sdev;
+
+ /*
+ * Unbind all devices associated with this PASID which is
+ * being freed by other users such as VFIO.
+ */
+ mutex_lock(&pasid_mutex);
+ list_for_each_entry_rcu(sdev, &svm->devs, list, pasid_lock_held()) {
+ /* Does not poison forward pointer */
+ list_del_rcu(&sdev->list);
+ spin_lock(&svm->iommu->lock);
+ intel_pasid_tear_down_entry(svm->iommu, sdev->dev,
+ svm->pasid, true);
+ spin_unlock(&svm->iommu->lock);
+ kfree_rcu(sdev, rcu);
+ /*
+ * Free before unbind only happens with guest usaged
+ * host PASIDs. IOASID free will detach private data
+ * and free the IOASID entry.
+ */
+ ioasid_put(NULL, svm->pasid);
+ if (list_empty(&svm->devs))
+ kfree(svm);
+ }
+ mutex_unlock(&pasid_mutex);
+}
+
+
+static int pasid_status_change(struct notifier_block *nb,
+ unsigned long code, void *data)
+{
+ struct ioasid_nb_args *args = (struct ioasid_nb_args *)data;
+ struct intel_svm *svm = (struct intel_svm *)args->pdata;
+ int ret = NOTIFY_DONE;
+
+ if (code == IOASID_FREE) {
+ if (!svm)
+ goto done;
+ if (args->id != svm->pasid) {
+ pr_warn("Notify PASID does not match data %d : %d\n",
+ args->id, svm->pasid);
+ goto done;
+ }
+ schedule_work(&svm->work);
+ return NOTIFY_OK;
+ }
+done:
+ return ret;
+}
+
+static struct notifier_block pasid_nb = {
+ .notifier_call = pasid_status_change,
+};
+
+void intel_svm_add_pasid_notifier(void)
+{
+ /* Listen to all PASIDs, not specific to a set */
+ ioasid_register_notifier(NULL, &pasid_nb);
+}
+
void intel_svm_check(struct intel_iommu *iommu)
{
if (!pasid_supported(iommu))
@@ -221,7 +287,6 @@ static const struct mmu_notifier_ops intel_mmuops = {
.invalidate_range = intel_invalidate_range,
};

-static DEFINE_MUTEX(pasid_mutex);
static LIST_HEAD(global_svm_list);

#define for_each_svm_dev(sdev, svm, d) \
@@ -342,7 +407,14 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
svm->gpasid = data->gpasid;
svm->flags |= SVM_FLAG_GUEST_PASID;
}
+ svm->iommu = iommu;
+ /*
+ * Set up cleanup async work in case IOASID core notify us PASID
+ * is freed before unbind.
+ */
+ INIT_WORK(&svm->work, intel_svm_free_async_fn);
ioasid_attach_data(data->hpasid, svm);
+ ioasid_get(NULL, svm->pasid);
INIT_LIST_HEAD_RCU(&svm->devs);
mmput(svm->mm);
}
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index b1ed2f25f7c0..d36038e6ae0b 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -744,6 +744,7 @@ void intel_svm_unbind(struct iommu_sva *handle);
int intel_svm_get_pasid(struct iommu_sva *handle);
int intel_svm_page_response(struct device *dev, struct iommu_fault_event *evt,
struct iommu_page_response *msg);
+void intel_svm_add_pasid_notifier(void);

struct svm_dev_ops;

@@ -770,6 +771,7 @@ struct intel_svm {
int gpasid; /* In case that guest PASID is different from host PASID */
struct list_head devs;
struct list_head list;
+ struct work_struct work; /* For deferred clean up */
};
#else
static inline void intel_svm_check(struct intel_iommu *iommu) {}
--
2.7.4

2020-08-22 04:33:36

by Jacob Pan

[permalink] [raw]
Subject: [PATCH v2 9/9] iommu/vt-d: Store guest PASID during bind

IOASID core maintains the guest-host mapping in the form of SPID and
IOASID. This patch assigns the guest PASID (if valid) as SPID while
binding guest page table with a host PASID. This mapping will be used
for lookup and notifications.

Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel/svm.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index d8a5efa75095..4c958b1aec4c 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -406,6 +406,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
if (data->flags & IOMMU_SVA_GPASID_VAL) {
svm->gpasid = data->gpasid;
svm->flags |= SVM_FLAG_GUEST_PASID;
+ ioasid_attach_spid(data->hpasid, data->gpasid);
}
svm->iommu = iommu;
/*
@@ -517,6 +518,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
ioasid_attach_data(pasid, NULL);
ioasid_notify(pasid, IOASID_UNBIND,
IOASID_NOTIFY_SET);
+ ioasid_attach_spid(pasid, INVALID_IOASID);
kfree(svm);
}
}
--
2.7.4

2020-08-22 09:07:04

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] iommu/ioasid: Introduce ioasid_set private ID

Hi Jacob,

I love your patch! Yet something to improve:

[auto build test ERROR on iommu/next]
[also build test ERROR on linux/master linus/master v5.9-rc1 next-20200821]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Jacob-Pan/IOASID-extensions-for-guest-SVA/20200822-123111
base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: mips-randconfig-r015-20200822 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project b587ca93be114d07ec3bf654add97d7872325281)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# install mips cross compiling tool for clang build
# apt-get install binutils-mips-linux-gnu
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=mips

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from drivers/of/device.c:7:
In file included from include/linux/of_iommu.h:6:
In file included from include/linux/iommu.h:16:
>> include/linux/ioasid.h:141:1: error: unknown type name 'staic'; did you mean 'static'?
staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
^~~~~
static
In file included from drivers/of/device.c:8:
include/linux/dma-mapping.h:824:9: warning: implicit conversion from 'unsigned long long' to 'unsigned long' changes value from 18446744073709551615 to 4294967295 [-Wconstant-conversion]
return DMA_BIT_MASK(32);
~~~~~~ ^~~~~~~~~~~~~~~~
include/linux/dma-mapping.h:139:40: note: expanded from macro 'DMA_BIT_MASK'
#define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
^~~~~
1 warning and 1 error generated.
--
In file included from drivers/of/platform.c:20:
In file included from include/linux/of_iommu.h:6:
In file included from include/linux/iommu.h:16:
>> include/linux/ioasid.h:141:1: error: unknown type name 'staic'; did you mean 'static'?
staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
^~~~~
static
1 error generated.

# https://github.com/0day-ci/linux/commit/09f31e901946399a274ce954bdefa4108e895b33
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Jacob-Pan/IOASID-extensions-for-guest-SVA/20200822-123111
git checkout 09f31e901946399a274ce954bdefa4108e895b33
vim +141 include/linux/ioasid.h

140
> 141 staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
142 {
143 return -ENOTSUPP;
144 }
145

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (3.08 kB)
.config.gz (34.64 kB)
Download all attachments

2020-08-22 09:10:29

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] iommu/ioasid: Introduce ioasid_set private ID

Hi Jacob,

I love your patch! Yet something to improve:

[auto build test ERROR on iommu/next]
[also build test ERROR on linux/master linus/master v5.9-rc1 next-20200821]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Jacob-Pan/IOASID-extensions-for-guest-SVA/20200822-123111
base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: ia64-randconfig-r003-20200822 (attached as .config)
compiler: ia64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=ia64

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All errors (new ones prefixed by >>):

In file included from include/linux/iommu.h:16,
from include/linux/of_iommu.h:6,
from drivers/of/device.c:7:
>> include/linux/ioasid.h:141:6: error: expected ';' before 'inline'
141 | staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
| ^
| ;
--
In file included from include/linux/iommu.h:16,
from drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:34:
>> include/linux/ioasid.h:141:6: error: expected ';' before 'inline'
141 | staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
| ^
| ;
drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:1094:5: warning: no previous prototype for 'amdgpu_ttm_gart_bind' [-Wmissing-prototypes]
1094 | int amdgpu_ttm_gart_bind(struct amdgpu_device *adev,
| ^~~~~~~~~~~~~~~~~~~~
In file included from drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:55:
drivers/gpu/drm/amd/amdgpu/amdgpu.h:190:18: warning: 'sched_policy' defined but not used [-Wunused-const-variable=]
190 | static const int sched_policy = KFD_SCHED_POLICY_HWS;
| ^~~~~~~~~~~~
In file included from drivers/gpu/drm/amd/amdgpu/../display/dc/dc_types.h:33,
from drivers/gpu/drm/amd/amdgpu/../display/dc/dm_services_types.h:30,
from drivers/gpu/drm/amd/amdgpu/../include/dm_pp_interface.h:26,
from drivers/gpu/drm/amd/amdgpu/amdgpu.h:65,
from drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c:55:
drivers/gpu/drm/amd/amdgpu/../display/include/fixed31_32.h:76:32: warning: 'dc_fixpt_ln2_div_2' defined but not used [-Wunused-const-variable=]
76 | static const struct fixed31_32 dc_fixpt_ln2_div_2 = { 1488522236LL };
| ^~~~~~~~~~~~~~~~~~
drivers/gpu/drm/amd/amdgpu/../display/include/fixed31_32.h:75:32: warning: 'dc_fixpt_ln2' defined but not used [-Wunused-const-variable=]
75 | static const struct fixed31_32 dc_fixpt_ln2 = { 2977044471LL };
| ^~~~~~~~~~~~
drivers/gpu/drm/amd/amdgpu/../display/include/fixed31_32.h:74:32: warning: 'dc_fixpt_e' defined but not used [-Wunused-const-variable=]
74 | static const struct fixed31_32 dc_fixpt_e = { 11674931555LL };
| ^~~~~~~~~~
drivers/gpu/drm/amd/amdgpu/../display/include/fixed31_32.h:73:32: warning: 'dc_fixpt_two_pi' defined but not used [-Wunused-const-variable=]
73 | static const struct fixed31_32 dc_fixpt_two_pi = { 26986075409LL };
| ^~~~~~~~~~~~~~~
drivers/gpu/drm/amd/amdgpu/../display/include/fixed31_32.h:72:32: warning: 'dc_fixpt_pi' defined but not used [-Wunused-const-variable=]
72 | static const struct fixed31_32 dc_fixpt_pi = { 13493037705LL };
| ^~~~~~~~~~~
drivers/gpu/drm/amd/amdgpu/../display/include/fixed31_32.h:67:32: warning: 'dc_fixpt_zero' defined but not used [-Wunused-const-variable=]
67 | static const struct fixed31_32 dc_fixpt_zero = { 0 };
| ^~~~~~~~~~~~~
--
In file included from include/linux/iommu.h:16,
from drivers/gpu/drm/nouveau/include/nvif/os.h:30,
from drivers/gpu/drm/nouveau/include/nvkm/core/os.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/oclass.h:3,
from drivers/gpu/drm/nouveau/include/nvkm/core/device.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/subdev.h:4,
from drivers/gpu/drm/nouveau/nvkm/nvfw/acr.c:22:
>> include/linux/ioasid.h:141:6: error: expected ';' before 'inline'
141 | staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
| ^
| ;
drivers/gpu/drm/nouveau/nvkm/nvfw/acr.c:49:1: warning: no previous prototype for 'lsb_header_tail_dump' [-Wmissing-prototypes]
49 | lsb_header_tail_dump(struct nvkm_subdev *subdev,
| ^~~~~~~~~~~~~~~~~~~~
--
In file included from include/linux/iommu.h:16,
from drivers/gpu/drm/nouveau/include/nvif/os.h:30,
from drivers/gpu/drm/nouveau/include/nvkm/core/os.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/oclass.h:3,
from drivers/gpu/drm/nouveau/include/nvkm/core/device.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/subdev.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/subdev/mc.h:4,
from drivers/gpu/drm/nouveau/nvkm/subdev/mc/priv.h:5,
from drivers/gpu/drm/nouveau/nvkm/subdev/mc/gp10b.c:24:
>> include/linux/ioasid.h:141:6: error: expected ';' before 'inline'
141 | staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
| ^
| ;
drivers/gpu/drm/nouveau/nvkm/subdev/mc/gp10b.c:27:1: warning: no previous prototype for 'gp10b_mc_init' [-Wmissing-prototypes]
27 | gp10b_mc_init(struct nvkm_mc *mc)
| ^~~~~~~~~~~~~
--
In file included from include/linux/iommu.h:16,
from drivers/gpu/drm/nouveau/include/nvif/os.h:30,
from drivers/gpu/drm/nouveau/include/nvkm/core/os.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/oclass.h:3,
from drivers/gpu/drm/nouveau/include/nvkm/core/object.h:4,
from drivers/gpu/drm/nouveau/nvkm/subdev/mmu/ummu.h:4,
from drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c:24:
>> include/linux/ioasid.h:141:6: error: expected ';' before 'inline'
141 | staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
| ^
| ;
drivers/gpu/drm/nouveau/nvkm/subdev/mmu/base.c:65:1: warning: no previous prototype for 'nvkm_mmu_ptp_get' [-Wmissing-prototypes]
65 | nvkm_mmu_ptp_get(struct nvkm_mmu *mmu, u32 size, bool zero)
| ^~~~~~~~~~~~~~~~
--
In file included from include/linux/iommu.h:16,
from drivers/gpu/drm/nouveau/include/nvif/os.h:30,
from drivers/gpu/drm/nouveau/include/nvkm/core/os.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/oclass.h:3,
from drivers/gpu/drm/nouveau/include/nvkm/core/device.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/subdev.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/subdev/mmu.h:4,
from drivers/gpu/drm/nouveau/nvkm/subdev/mmu/priv.h:5,
from drivers/gpu/drm/nouveau/nvkm/subdev/mmu/mem.h:3,
from drivers/gpu/drm/nouveau/nvkm/subdev/mmu/tu102.c:23:
>> include/linux/ioasid.h:141:6: error: expected ';' before 'inline'
141 | staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
| ^
| ;
drivers/gpu/drm/nouveau/nvkm/subdev/mmu/tu102.c:31:1: warning: no previous prototype for 'tu102_mmu_kind' [-Wmissing-prototypes]
31 | tu102_mmu_kind(struct nvkm_mmu *mmu, int *count, u8 *invalid)
| ^~~~~~~~~~~~~~
--
In file included from include/linux/iommu.h:16,
from drivers/gpu/drm/nouveau/include/nvif/os.h:30,
from drivers/gpu/drm/nouveau/include/nvkm/core/os.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/oclass.h:3,
from drivers/gpu/drm/nouveau/include/nvkm/core/device.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/subdev.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/subdev/therm.h:4,
from drivers/gpu/drm/nouveau/nvkm/subdev/therm/priv.h:27,
from drivers/gpu/drm/nouveau/nvkm/subdev/therm/gt215.c:24:
>> include/linux/ioasid.h:141:6: error: expected ';' before 'inline'
141 | staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
| ^
| ;
drivers/gpu/drm/nouveau/nvkm/subdev/therm/gt215.c:40:1: warning: no previous prototype for 'gt215_therm_init' [-Wmissing-prototypes]
40 | gt215_therm_init(struct nvkm_therm *therm)
| ^~~~~~~~~~~~~~~~
--
In file included from include/linux/iommu.h:16,
from drivers/gpu/drm/nouveau/include/nvif/os.h:30,
from drivers/gpu/drm/nouveau/include/nvkm/core/os.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/oclass.h:3,
from drivers/gpu/drm/nouveau/include/nvkm/core/device.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/subdev.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/engine.h:5,
from drivers/gpu/drm/nouveau/include/nvkm/engine/gr.h:4,
from drivers/gpu/drm/nouveau/nvkm/engine/gr/priv.h:5,
from drivers/gpu/drm/nouveau/nvkm/engine/gr/gf100.h:27,
from drivers/gpu/drm/nouveau/nvkm/engine/gr/gf100.c:24:
>> include/linux/ioasid.h:141:6: error: expected ';' before 'inline'
141 | staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
| ^
| ;
drivers/gpu/drm/nouveau/nvkm/engine/gr/gf100.c:745:1: warning: no previous prototype for 'gf100_gr_fecs_start_ctxsw' [-Wmissing-prototypes]
745 | gf100_gr_fecs_start_ctxsw(struct nvkm_gr *base)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/nouveau/nvkm/engine/gr/gf100.c:760:1: warning: no previous prototype for 'gf100_gr_fecs_stop_ctxsw' [-Wmissing-prototypes]
760 | gf100_gr_fecs_stop_ctxsw(struct nvkm_gr *base)
| ^~~~~~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/nouveau/nvkm/engine/gr/gf100.c:2036:1: warning: no previous prototype for 'gf100_gr_dtor' [-Wmissing-prototypes]
2036 | gf100_gr_dtor(struct nvkm_gr *base)
| ^~~~~~~~~~~~~
--
In file included from include/linux/iommu.h:16,
from drivers/gpu/drm/nouveau/include/nvif/os.h:30,
from drivers/gpu/drm/nouveau/include/nvkm/core/os.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/oclass.h:3,
from drivers/gpu/drm/nouveau/include/nvkm/core/device.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/subdev.h:4,
from drivers/gpu/drm/nouveau/include/nvkm/core/engine.h:5,
from drivers/gpu/drm/nouveau/include/nvkm/engine/gr.h:4,
from drivers/gpu/drm/nouveau/nvkm/engine/gr/priv.h:5,
from drivers/gpu/drm/nouveau/nvkm/engine/gr/gf100.h:27,
from drivers/gpu/drm/nouveau/nvkm/engine/gr/gk20a.c:22:
>> include/linux/ioasid.h:141:6: error: expected ';' before 'inline'
141 | staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
| ^
| ;
drivers/gpu/drm/nouveau/nvkm/engine/gr/gk20a.c:37:1: warning: no previous prototype for 'gk20a_gr_av_to_init' [-Wmissing-prototypes]
37 | gk20a_gr_av_to_init(struct gf100_gr *gr, const char *path, const char *name,
| ^~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/nouveau/nvkm/engine/gr/gk20a.c:87:1: warning: no previous prototype for 'gk20a_gr_aiv_to_init' [-Wmissing-prototypes]
87 | gk20a_gr_aiv_to_init(struct gf100_gr *gr, const char *path, const char *name,
| ^~~~~~~~~~~~~~~~~~~~
drivers/gpu/drm/nouveau/nvkm/engine/gr/gk20a.c:130:1: warning: no previous prototype for 'gk20a_gr_av_to_method' [-Wmissing-prototypes]
130 | gk20a_gr_av_to_method(struct gf100_gr *gr, const char *path, const char *name,
| ^~~~~~~~~~~~~~~~~~~~~
--
In file included from include/linux/iommu.h:16,
from drivers/gpu/drm/nouveau/include/nvif/os.h:30,
from drivers/gpu/drm/nouveau/include/nvif/object.h:5,
from drivers/gpu/drm/nouveau/include/nvif/client.h:5,
from drivers/gpu/drm/nouveau/nouveau_drv.h:43,
from drivers/gpu/drm/nouveau/nouveau_display.h:5,
from drivers/gpu/drm/nouveau/nouveau_fbcon.h:32,
from drivers/gpu/drm/nouveau/nouveau_display.c:38:
>> include/linux/ioasid.h:141:6: error: expected ';' before 'inline'
141 | staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
| ^
| ;
drivers/gpu/drm/nouveau/nouveau_display.c: In function 'nouveau_framebuffer_new':
drivers/gpu/drm/nouveau/nouveau_display.c:338:15: warning: variable 'width' set but not used [-Wunused-but-set-variable]
338 | unsigned int width, height, i;
| ^~~~~
--
In file included from include/linux/iommu.h:16,
from drivers/gpu/drm/nouveau/include/nvif/os.h:30,
from drivers/gpu/drm/nouveau/include/nvif/object.h:5,
from drivers/gpu/drm/nouveau/include/nvif/mmu.h:3,
from drivers/gpu/drm/nouveau/include/nvif/mem.h:3,
from drivers/gpu/drm/nouveau/dispnv50/disp.h:3,
from drivers/gpu/drm/nouveau/dispnv50/disp.c:24:
>> include/linux/ioasid.h:141:6: error: expected ';' before 'inline'
141 | staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
| ^
| ;
drivers/gpu/drm/nouveau/dispnv50/disp.c: In function 'nv50_mstm_cleanup':
drivers/gpu/drm/nouveau/dispnv50/disp.c:1237:6: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
1237 | int ret;
| ^~~
drivers/gpu/drm/nouveau/dispnv50/disp.c: In function 'nv50_mstm_prepare':
drivers/gpu/drm/nouveau/dispnv50/disp.c:1261:6: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
1261 | int ret;
| ^~~
drivers/gpu/drm/nouveau/dispnv50/disp.c: At top level:
drivers/gpu/drm/nouveau/dispnv50/disp.c:2450:1: warning: no previous prototype for 'nv50_display_create' [-Wmissing-prototypes]
2450 | nv50_display_create(struct drm_device *dev)
| ^~~~~~~~~~~~~~~~~~~
..

# https://github.com/0day-ci/linux/commit/09f31e901946399a274ce954bdefa4108e895b33
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Jacob-Pan/IOASID-extensions-for-guest-SVA/20200822-123111
git checkout 09f31e901946399a274ce954bdefa4108e895b33
vim +141 include/linux/ioasid.h

140
> 141 staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
142 {
143 return -ENOTSUPP;
144 }
145

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (15.76 kB)
.config.gz (27.48 kB)
Download all attachments

2020-08-22 14:02:50

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

Hi Jacob,

I love your patch! Perhaps something to improve:

[auto build test WARNING on iommu/next]
[also build test WARNING on linux/master linus/master v5.9-rc1 next-20200821]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url: https://github.com/0day-ci/linux/commits/Jacob-Pan/IOASID-extensions-for-guest-SVA/20200822-123111
base: https://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git next
config: x86_64-allyesconfig (attached as .config)
compiler: gcc-9 (Debian 9.3.0-15) 9.3.0
reproduce (this is a W=1 build):
# save the attached .config to linux build tree
make W=1 ARCH=x86_64

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

drivers/iommu/ioasid.c: In function 'ioasid_get_capacity':
>> drivers/iommu/ioasid.c:50:10: warning: old-style function definition [-Wold-style-definition]
50 | ioasid_t ioasid_get_capacity()
| ^~~~~~~~~~~~~~~~~~~
drivers/iommu/ioasid.c: At top level:
>> drivers/iommu/ioasid.c:577:6: warning: no previous prototype for 'ioasid_set_get' [-Wmissing-prototypes]
577 | void ioasid_set_get(struct ioasid_set *set)
| ^~~~~~~~~~~~~~

# https://github.com/0day-ci/linux/commit/59b6f319b27588b2a8a0268a4f4f09f7be458861
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Jacob-Pan/IOASID-extensions-for-guest-SVA/20200822-123111
git checkout 59b6f319b27588b2a8a0268a4f4f09f7be458861
vim +50 drivers/iommu/ioasid.c

49
> 50 ioasid_t ioasid_get_capacity()
51 {
52 return ioasid_capacity;
53 }
54 EXPORT_SYMBOL_GPL(ioasid_get_capacity);
55

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]


Attachments:
(No filename) (1.97 kB)
.config.gz (73.51 kB)
Download all attachments

2020-08-23 08:05:04

by Lu Baolu

[permalink] [raw]
Subject: Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

Hi Jacob,

On 2020/8/22 12:35, Jacob Pan wrote:
> IOASID is used to identify address spaces that can be targeted by device
> DMA. It is a system-wide resource that is essential to its many users.
> This document is an attempt to help developers from all vendors navigate
> the APIs. At this time, ARM SMMU and Intel’s Scalable IO Virtualization
> (SIOV) enabled platforms are the primary users of IOASID. Examples of
> how SIOV components interact with IOASID APIs are provided in that many
> APIs are driven by the requirements from SIOV.
>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Wu Hao <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> Documentation/ioasid.rst | 618 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 618 insertions(+)
> create mode 100644 Documentation/ioasid.rst
>
> diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst
> new file mode 100644
> index 000000000000..b6a8cdc885ff
> --- /dev/null
> +++ b/Documentation/ioasid.rst
> @@ -0,0 +1,618 @@
> +.. ioasid:
> +
> +=====================================
> +IO Address Space ID
> +=====================================
> +
> +IOASID is a generic name for PCIe Process Address ID (PASID) or ARM
> +SMMU sub-stream ID. An IOASID identifies an address space that DMA
> +requests can target.
> +
> +The primary use cases for IOASID are Shared Virtual Address (SVA) and
> +IO Virtual Address (IOVA). However, the requirements for IOASID

Can you please elaborate a bit more about how ioasid is used by IOVA?

> +management can vary among hardware architectures.
> +
> +This document covers the generic features supported by IOASID
> +APIs. Vendor-specific use cases are also illustrated with Intel's VT-d
> +based platforms as the first example.
> +
> +.. contents:: :local:
> +
> +Glossary
> +========
> +PASID - Process Address Space ID
> +
> +IOASID - IO Address Space ID (generic term for PCIe PASID and
> +sub-stream ID in SMMU)
> +
> +SVA/SVM - Shared Virtual Addressing/Memory
> +
> +ENQCMD - New Intel X86 ISA for efficient workqueue submission [1]
> +
> +DSA - Intel Data Streaming Accelerator [2]
> +
> +VDCM - Virtual device composition module [3]

Capitalize the first letter of each word.

> +
> +SIOV - Intel Scalable IO Virtualization
> +
> +
> +Key Concepts
> +============
> +
> +IOASID Set
> +-----------
> +An IOASID set is a group of IOASIDs allocated from the system-wide
> +IOASID pool. An IOASID set is created and can be identified by a
> +token of u64. Refer to IOASID set APIs for more details.
> +
> +IOASID set is particularly useful for guest SVA where each guest could
> +have its own IOASID set for security and efficiency reasons.
> +
> +IOASID Set Private ID (SPID)
> +----------------------------
> +SPIDs are introduced as IOASIDs within its set. Each SPID maps to a
> +system-wide IOASID but the namespace of SPID is within its IOASID
> +set. SPIDs can be used as guest IOASIDs where each guest could do
> +IOASID allocation from its own pool and map them to host physical
> +IOASIDs. SPIDs are particularly useful for supporting live migration
> +where decoupling guest and host physical resources are necessary.
> +
> +For example, two VMs can both allocate guest PASID/SPID #101 but map to
> +different host PASIDs #201 and #202 respectively as shown in the
> +diagram below.
> +::
> +
> + .------------------. .------------------.
> + | VM 1 | | VM 2 |
> + | | | |
> + |------------------| |------------------|
> + | GPASID/SPID 101 | | GPASID/SPID 101 |
> + '------------------' -------------------' Guest
> + __________|______________________|______________________
> + | | Host
> + v v
> + .------------------. .------------------.
> + | Host IOASID 201 | | Host IOASID 202 |
> + '------------------' '------------------'
> + | IOASID set 1 | | IOASID set 2 |
> + '------------------' '------------------'
> +
> +Guest PASID is treated as IOASID set private ID (SPID) within an
> +IOASID set, mappings between guest and host IOASIDs are stored in the
> +set for inquiry.

Is there a real IOASID set allocated in the host which represent the
SPID?

> +
> +IOASID APIs
> +===========
> +To get the IOASID APIs, users must #include <linux/ioasid.h>. These APIs
> +serve the following functionalities:
> +
> + - IOASID allocation/Free
> + - Group management in the form of ioasid_set
> + - Private data storage and lookup
> + - Reference counting
> + - Event notification in case of state change
> +
> +IOASID Set Level APIs
> +--------------------------
> +For use cases such as guest SVA it is necessary to manage IOASIDs at
> +a group level. For example, VMs may allocate multiple IOASIDs for
> +guest process address sharing (vSVA). It is imperative to enforce
> +VM-IOASID ownership such that malicious guest cannot target DMA
> +traffic outside its own IOASIDs, or free an active IOASID belong to
> +another VM.
> +::
> +
> + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, u32 type)
> +
> + int ioasid_adjust_set(struct ioasid_set *set, int quota);
> +
> + void ioasid_set_get(struct ioasid_set *set)
> +
> + void ioasid_set_put(struct ioasid_set *set)
> +
> + void ioasid_set_get_locked(struct ioasid_set *set)
> +
> + void ioasid_set_put_locked(struct ioasid_set *set)
> +
> + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> + void (*fn)(ioasid_t id, void *data),
> + void *data)
> +
> +
> +IOASID set concept is introduced to represent such IOASID groups. Each
> +IOASID set is created with a token which can be one of the following
> +types:
> +
> + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
> + - IOASID_SET_TYPE_MM (Set token is a mm_struct)
> +
> +The explicit MM token type is useful when multiple users of an IOASID
> +set under the same process need to communicate about their shared IOASIDs.
> +E.g. An IOASID set created by VFIO for one guest can be associated
> +with the KVM instance for the same guest since they share a common mm_struct.
> +
> +The IOASID set APIs serve the following purposes:
> +
> + - Ownership/permission enforcement
> + - Take collective actions, e.g. free an entire set
> + - Event notifications within a set
> + - Look up a set based on token
> + - Quota enforcement
> +
> +Individual IOASID APIs
> +----------------------
> +Once an ioasid_set is created, IOASIDs can be allocated from the set.
> +Within the IOASID set namespace, set private ID (SPID) is supported. In
> +the VM use case, SPID can be used for storing guest PASID.
> +
> +::
> +
> + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> + void *private);
> +
> + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> +
> + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> +
> + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> +
> + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> +
> + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> + bool (*getter)(void *));
> +
> + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
> +
> + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
> + void *data);
> + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
> + ioasid_t ssid);
> +
> +
> +Notifications
> +-------------
> +An IOASID may have multiple users, each user may have hardware context
> +associated with an IOASID. When the status of an IOASID changes,
> +e.g. an IOASID is being freed, users need to be notified such that the
> +associated hardware context can be cleared, flushed, and drained.
> +
> +::
> +
> + int ioasid_register_notifier(struct ioasid_set *set, struct
> + notifier_block *nb)
> +
> + void ioasid_unregister_notifier(struct ioasid_set *set,
> + struct notifier_block *nb)
> +
> + int ioasid_register_notifier_mm(struct mm_struct *mm, struct
> + notifier_block *nb)
> +
> + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
> + notifier_block *nb)
> +
> + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
> + unsigned int flags)
> +
> +
> +Events
> +~~~~~~
> +Notification events are pertinent to individual IOASIDs, they can be
> +one of the following:
> +
> + - ALLOC
> + - FREE
> + - BIND
> + - UNBIND
> +
> +Ordering
> +~~~~~~~~
> +Ordering is supported by IOASID notification priorities as the
> +following (in ascending order):

What does ascending order exactly mean here? LAST->IOMMU->DEVICE...?

> +
> +::
> +
> + enum ioasid_notifier_prios {
> + IOASID_PRIO_LAST,
> + IOASID_PRIO_IOMMU,
> + IOASID_PRIO_DEVICE,
> + IOASID_PRIO_CPU,
> + };
> +
> +The typical use case is when an IOASID is freed due to an exception, DMA
> +source should be quiesced before tearing down other hardware contexts
> +in the system. This will reduce the churn in handling faults. DMA work
> +submission is performed by the CPU which is granted higher priority than
> +devices.
> +
> +
> +Scopes
> +~~~~~~
> +There are two types of notifiers in IOASID core: system-wide and
> +ioasid_set-wide.
> +
> +System-wide notifier is catering for users that need to handle all
> +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs.
> +
> +Per ioasid_set notifier can be used by VM specific components such as
> +KVM. After all, each KVM instance only cares about IOASIDs within its
> +own set.
> +
> +
> +Atomicity
> +~~~~~~~~~
> +IOASID notifiers are atomic due to spinlocks used inside the IOASID
> +core. For tasks cannot be completed in the notifier handler, async work
> +can be submitted to complete the work later as long as there is no
> +ordering requirement.
> +
> +Reference counting
> +------------------
> +IOASID lifecycle management is based on reference counting. Users of
> +IOASID intend to align lifecycle with the IOASID need to hold
> +reference of the IOASID. IOASID will not be returned to the pool for
> +allocation until all references are dropped. Calling ioasid_free()
> +will mark the IOASID as FREE_PENDING if the IOASID has outstanding
> +reference. ioasid_get() is not allowed once an IOASID is in the
> +FREE_PENDING state.
> +
> +Event notifications are used to inform users of IOASID status change.
> +IOASID_FREE event prompts users to drop their references after
> +clearing its context.
> +
> +For example, on VT-d platform when an IOASID is freed, teardown
> +actions are performed on KVM, device driver, and IOMMU driver.
> +KVM shall register notifier block with::
> +
> + static struct notifier_block pasid_nb_kvm = {
> + .notifier_call = pasid_status_change_kvm,
> + .priority = IOASID_PRIO_CPU,
> + };
> +
> +VDCM driver shall register notifier block with::
> +
> + static struct notifier_block pasid_nb_vdcm = {
> + .notifier_call = pasid_status_change_vdcm,
> + .priority = IOASID_PRIO_DEVICE,
> + };
> +
> +In both cases, notifier blocks shall be registered on the IOASID set
> +such that *only* events from the matching VM is received.
> +
> +If KVM attempts to register notifier block before the IOASID set is
> +created for the MM token, the notifier block will be placed on a
> +pending list inside IOASID core. Once the token matching IOASID set
> +is created, IOASID will register the notifier block automatically.
> +IOASID core does not replay events for the existing IOASIDs in the
> +set. For IOASID set of MM type, notification blocks can be registered
> +on empty sets only. This is to avoid lost events.
> +
> +IOMMU driver shall register notifier block on global chain::
> +
> + static struct notifier_block pasid_nb_vtd = {
> + .notifier_call = pasid_status_change_vtd,
> + .priority = IOASID_PRIO_IOMMU,
> + };
> +
> +Custom allocator APIs
> +---------------------
> +
> +::
> +
> + int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> +
> + void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> +
> +Allocator Choices
> +~~~~~~~~~~~~~~~~~
> +IOASIDs are allocated for both host and guest SVA/IOVA usage. However,
> +allocators can be different. For example, on VT-d guest PASID
> +allocation must be performed via a virtual command interface which is
> +emulated by VMM.
> +
> +IOASID core has the notion of "custom allocator" such that guest can
> +register virtual command allocator that precedes the default one.
> +
> +Namespaces
> +~~~~~~~~~~
> +IOASIDs are limited system resources that default to 20 bits in
> +size. Since each device has its own table, theoretically the namespace
> +can be per device also. However, for security reasons sharing PASID
> +tables among devices are not good for isolation. Therefore, IOASID
> +namespace is system-wide.
> +
> +There are also other reasons to have this simpler system-wide
> +namespace. Take VT-d as an example, VT-d supports shared workqueue
> +and ENQCMD[1] where one IOASID could be used to submit work on
> +multiple devices that are shared with other VMs. This requires IOASID
> +to be system-wide. This is also the reason why guests must use an
> +emulated virtual command interface to allocate IOASID from the host.
> +
> +
> +Life cycle
> +==========
> +This section covers IOASID lifecycle management for both bare-metal
> +and guest usages. In bare-metal SVA, MMU notifier is directly hooked
> +up with IOMMU driver, therefore the process address space (MM)
> +lifecycle is aligned with IOASID.

MMU notifier for SVA mainly serves IOMMU cache flushes, right? The
IOASID life cycle for bare matal SVA is managed by the device driver
through the iommu sva api's iommu_sva_(un)bind_device()?

> +
> +However, guest MMU notifier is not available to host IOMMU driver,
> +when guest MM terminates unexpectedly, the events have to go through
> +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also more
> +parties involved in guest SVA, e.g. on Intel VT-d platform, IOASIDs
> +are used by IOMMU driver, KVM, VDCM, and VFIO.
> +
> +Native IOASID Life Cycle (VT-d Example)
> +---------------------------------------
> +
> +The normal flow of native SVA code with Intel Data Streaming
> +Accelerator(DSA) [2] as example:
> +
> +1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
> +2. DSA driver allocate WQ, do sva_bind_device();
> +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
> + mmu_notifier_get()
> +4. DMA starts by DSA driver userspace
> +5. DSA userspace close FD
> +6. DSA/uacce kernel driver handles FD.close()
> +7. DSA driver stops DMA
> +8. DSA driver calls sva_unbind_device();
> +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
> + TLBs. mmu_notifier_put() called.
> +10. mmu_notifier.release() called, IOMMU SVA code calls ioasid_free()*
> +11. The IOASID is returned to the pool, reclaimed.
> +
> +::
> +
> + * With ENQCMD, PASID used on VT-d is not released in mmu_notifier() but
> + mmdrop(). mmdrop comes after FD close. Should not matter.
> + If the user process dies unexpectedly, Step #10 may come before
> + Step #5, in between, all DMA faults discarded. PRQ responded with
> + code INVALID REQUEST.
> +
> +During the normal teardown, the following three steps would happen in
> +order:
> +
> +1. Device driver stops DMA request
> +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain in-flight
> + requests.
> +3. IOASID freed
> +
> +Exception happens when process terminates *before* device driver stops
> +DMA and call IOMMU driver to unbind. The flow of process exists are as
> +follows:
> +
> +::
> +
> + do_exit() {
> + exit_mm() {
> + mm_put();
> + exit_mmap() {
> + intel_invalidate_range() //mmu notifier
> + tlb_finish_mmu()
> + mmu_notifier_release(mm) {
> + intel_iommu_release() {

intel_mm_release()

> + [2] intel_iommu_teardown_pasid();
> + intel_iommu_flush_tlbs();
> + }
> + // tlb_invalidate_range cb removed
> + }
> + unmap_vmas();
> + free_pgtables(); // IOMMU cannot walk PGT after this
> + };
> + }
> + exit_files(tsk) {
> + close_files() {
> + dsa_close();
> + [1] dsa_stop_dma();
> + intel_svm_unbind_pasid(); //nothing to do
> + }
> + }
> + }
> +
> + mmdrop() /* some random time later, lazy mm user */ {
> + mm_free_pgd();
> + destroy_context(mm); {
> + [3] ioasid_free();
> + }
> + }
> +
> +As shown in the list above, step #2 could happen before
> +#1. Unrecoverable(UR) faults could happen between #2 and #1.

The VT-d hardware will ignore UR faults due to the setting of FPD bit of
the PASID entry. The software won't see UR faults.

> +
> +Also notice that TLB invalidation occurs at mmu_notifier
> +invalidate_range callback as well as the release callback. The reason
> +is that release callback will delete IOMMU driver from the notifier
> +chain which may skip invalidate_range() calls during the exit path.
> +
> +To avoid unnecessary reporting of UR fault, IOMMU driver shall disable
> +fault reporting after free and before unbind.
> +
> +Guest IOASID Life Cycle (VT-d Example)
> +--------------------------------------
> +Guest IOASID life cycle starts with guest driver open(), this could be
> +uacce or individual accelerator driver such as DSA. At FD open,
> +sva_bind_device() is called which triggers a series of actions.
> +
> +The example below is an illustration of *normal* operations that
> +involves *all* the SW components in VT-d. The flow can be simpler if
> +no ENQCMD is supported.
> +
> +::
> +
> + VFIO IOMMU KVM VDCM IOASID Ref
> + ..................................................................
> + 1 ioasid_register_notifier/_mm()
> + 2 ioasid_alloc() 1
> + 3 bind_gpasid()
> + 4 iommu_bind()->ioasid_get() 2
> + 5 ioasid_notify(BIND)
> + 6 -> ioasid_get() 3
> + 7 -> vmcs_update_atomic()
> + 8 mdev_write(gpasid)
> + 9 hpasid=
> + 10 find_by_spid(gpasid) 4
> + 11 vdev_write(hpasid)
> + 12 -------- GUEST STARTS DMA --------------------------
> + 13 -------- GUEST STOPS DMA --------------------------
> + 14 mdev_clear(gpasid)
> + 15 vdev_clear(hpasid)
> + 16 ioasid_put() 3
> + 17 unbind_gpasid()
> + 18 iommu_ubind()
> + 19 ioasid_notify(UNBIND)
> + 20 -> vmcs_update_atomic()
> + 21 -> ioasid_put() 2
> + 22 ioasid_free() 1
> + 23 ioasid_put() 0
> + 24 Reclaimed
> + -------------- New Life Cycle Begin ----------------------------
> + 1 ioasid_alloc() -> 1
> +
> + Note: IOASID Notification Events: FREE, BIND, UNBIND
> +
> +Exception cases arise when a guest crashes or a malicious guest
> +attempts to cause disruption on the host system. The fault handling
> +rules are:
> +
> +1. IOASID free must *always* succeed.
> +2. An inactive period may be required before the freed IOASID is
> + reclaimed. During this period, consumers of IOASID perform cleanup.
> +3. Malfunction is limited to the guest owned resources for all
> + programming errors.
> +
> +The primary source of exception is when the following are out of
> +order:
> +
> +1. Start/Stop of DMA activity
> + (Guest device driver, mdev via VFIO)
> +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
> + (Host IOMMU driver bind/unbind)
> +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
> + case of ENQCMD
> +4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
> +5. IOASID alloc/free (Host IOASID)
> +
> +VFIO is the *only* user-kernel interface, which is ultimately
> +responsible for exception handlings.
> +
> +#1 is processed the same way as the assigned device today based on
> +device file descriptors and events. There is no special handling.
> +
> +#3 is based on bind/unbind events emitted by #2.
> +
> +#4 is naturally aligned with IOASID life cycle in that an illegal
> +guest PASID programming would fail in obtaining reference of the
> +matching host IOASID.
> +
> +#5 is similar to #4. The fault will be reported to the user if PASID
> +used in the ENQCMD is not set up in VMCS PASID translation table.
> +
> +Therefore, the remaining out of order problem is between #2 and
> +#5. I.e. unbind vs. free. More specifically, free before unbind.
> +
> +IOASID notifier and refcounting are used to ensure order. Following
> +a publisher-subscriber pattern where:
> +
> +- Publishers: VFIO & IOMMU
> +- Subscribers: KVM, VDCM, IOMMU
> +
> +IOASID notifier is atomic which requires subscribers to do quick
> +handling of the event in the atomic context. Workqueue can be used for
> +any processing that requires thread context. IOASID reference must be
> +acquired before receiving the FREE event. The reference must be
> +dropped at the end of the processing in order to return the IOASID to
> +the pool.
> +
> +Let's examine the IOASID life cycle again when free happens *before*
> +unbind. This could be a result of misbehaving guests or crash. Assuming
> +VFIO cannot enforce unbind->free order. Notice that the setup part up
> +until step #12 is identical to the normal case, the flow below starts
> +with step 13.
> +
> +::
> +
> + VFIO IOMMU KVM VDCM IOASID Ref
> + ..................................................................
> + 13 -------- GUEST STARTS DMA --------------------------
> + 14 -------- *GUEST MISBEHAVES!!!* ----------------
> + 15 ioasid_free()
> + 16 ioasid_notify(FREE)
> + 17 mark_ioasid_inactive[1]
> + 18 kvm_nb_handler(FREE)
> + 19 vmcs_update_atomic()
> + 20 ioasid_put_locked() -> 3
> + 21 vdcm_nb_handler(FREE)
> + 22 iomm_nb_handler(FREE)
> + 23 ioasid_free() returns[2] schedule_work() 2
> + 24 schedule_work() vdev_clear_wk(hpasid)
> + 25 teardown_pasid_wk()
> + 26 ioasid_put() -> 1
> + 27 ioasid_put() 0
> + 28 Reclaimed
> + 29 unbind_gpasid()
> + 30 iommu_unbind()->ioasid_find() Fails[3]
> + -------------- New Life Cycle Begin ----------------------------
> +
> +Note:
> +
> +1. By marking IOASID inactive at step #17, no new references can be
> + held. ioasid_get/find() will return -ENOENT;
> +2. After step #23, all events can go out of order. Shall not affect
> + the outcome.
> +3. IOMMU driver fails to find private data for unbinding. If unbind is
> + called after the same IOASID is allocated for the same guest again,
> + this is a programming error. The damage is limited to the guest
> + itself since unbind performs permission checking based on the
> + IOASID set associated with the guest process.
> +
> +KVM PASID Translation Table Updates
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Per VM PASID translation table is maintained by KVM in order to
> +support ENQCMD in the guest. The table contains host-guest PASID
> +translations to be consumed by CPU ucode. The synchronization of the
> +PASID states depends on VFIO/IOMMU driver, where IOCTL and atomic
> +notifiers are used. KVM must register IOASID notifier per VM instance
> +during launch time. The following events are handled:
> +
> +1. BIND/UNBIND
> +2. FREE
> +
> +Rules:
> +
> +1. Multiple devices can bind with the same PASID, this can be different PCI
> + devices or mdevs within the same PCI device. However, only the
> + *first* BIND and *last* UNBIND emit notifications.
> +2. IOASID code is responsible for ensuring the correctness of H-G
> + PASID mapping. There is no need for KVM to validate the
> + notification data.
> +3. When UNBIND happens *after* FREE, KVM will see error in
> + ioasid_get() even when the reclaim is not done. IOMMU driver will
> + also avoid sending UNBIND if the PASID is already FREE.
> +4. When KVM terminates *before* FREE & UNBIND, references will be
> + dropped for all host PASIDs.
> +
> +VDCM PASID Programming
> +~~~~~~~~~~~~~~~~~~~~~~
> +VDCM composes virtual devices and exposes them to the guests. When
> +the guest allocates a PASID then program it to the virtual device, VDCM
> +intercepts the programming attempt then program the matching host
> +PASID on to the hardware.
> +Conversely, when a device is going away, VDCM must be informed such
> +that PASID context on the hardware can be cleared. There could be
> +multiple mdevs assigned to different guests in the same VDCM. Since
> +the PASID table is shared at PCI device level, lazy clearing is not
> +secure. A malicious guest can attack by using newly freed PASIDs that
> +are allocated by another guest.
> +
> +By holding a reference of the PASID until VDCM cleans up the HW context,
> +it is guaranteed that PASID life cycles do not cross within the same
> +device.
> +
> +
> +Reference
> +====================================================
> +1. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
> +
> +2. https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
> +
> +3. https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
>

Best regards,
baolu

2020-08-24 02:43:14

by Lu Baolu

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

Hi Jacob,

On 8/22/20 12:35 PM, Jacob Pan wrote:
> ioasid_set was introduced as an arbitrary token that are shared by a
> group of IOASIDs. For example, if IOASID #1 and #2 are allocated via the
> same ioasid_set*, they are viewed as to belong to the same set.
>
> For guest SVA usages, system-wide IOASID resources need to be
> partitioned such that VMs can have its own quota and being managed
> separately. ioasid_set is the perfect candidate for meeting such
> requirements. This patch redefines and extends ioasid_set with the
> following new fields:
> - Quota
> - Reference count
> - Storage of its namespace
> - The token is stored in the new ioasid_set but with optional types
>
> ioasid_set level APIs are introduced that wires up these new data.
> Existing users of IOASID APIs are converted where a host IOASID set is
> allocated for bare-metal usage.
>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/intel/iommu.c | 27 ++-
> drivers/iommu/intel/pasid.h | 1 +
> drivers/iommu/intel/svm.c | 8 +-
> drivers/iommu/ioasid.c | 390 +++++++++++++++++++++++++++++++++++++++++---
> include/linux/ioasid.h | 82 ++++++++--
> 5 files changed, 465 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index a3a0b5c8921d..5813eeaa5edb 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -42,6 +42,7 @@
> #include <linux/crash_dump.h>
> #include <linux/numa.h>
> #include <linux/swiotlb.h>
> +#include <linux/ioasid.h>
> #include <asm/irq_remapping.h>
> #include <asm/cacheflush.h>
> #include <asm/iommu.h>
> @@ -103,6 +104,9 @@
> */
> #define INTEL_IOMMU_PGSIZES (~0xFFFUL)
>
> +/* PASIDs used by host SVM */
> +struct ioasid_set *host_pasid_set;
> +
> static inline int agaw_to_level(int agaw)
> {
> return agaw + 2;
> @@ -3103,8 +3107,8 @@ static void intel_vcmd_ioasid_free(ioasid_t ioasid, void *data)
> * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
> * We can only free the PASID when all the devices are unbound.
> */
> - if (ioasid_find(NULL, ioasid, NULL)) {
> - pr_alert("Cannot free active IOASID %d\n", ioasid);
> + if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
> + pr_err("Cannot free IOASID %d, not in system set\n", ioasid);
> return;
> }
> vcmd_free_pasid(iommu, ioasid);
> @@ -3288,6 +3292,19 @@ static int __init init_dmars(void)
> if (ret)
> goto free_iommu;
>
> + /* PASID is needed for scalable mode irrespective to SVM */
> + if (intel_iommu_sm) {
> + ioasid_install_capacity(intel_pasid_max_id);
> + /* We should not run out of IOASIDs at boot */
> + host_pasid_set = ioasid_alloc_set(NULL, PID_MAX_DEFAULT,
> + IOASID_SET_TYPE_NULL);
> + if (IS_ERR_OR_NULL(host_pasid_set)) {
> + pr_err("Failed to enable host PASID allocator %lu\n",
> + PTR_ERR(host_pasid_set));
> + intel_iommu_sm = 0;
> + }
> + }
> +
> /*
> * for each drhd
> * enable fault log
> @@ -5149,7 +5166,7 @@ static void auxiliary_unlink_device(struct dmar_domain *domain,
> domain->auxd_refcnt--;
>
> if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - ioasid_free(domain->default_pasid);
> + ioasid_free(host_pasid_set, domain->default_pasid);
> }
>
> static int aux_domain_add_dev(struct dmar_domain *domain,
> @@ -5167,7 +5184,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
> int pasid;
>
> /* No private data needed for the default pasid */
> - pasid = ioasid_alloc(NULL, PASID_MIN,
> + pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
> pci_max_pasids(to_pci_dev(dev)) - 1,
> NULL);
> if (pasid == INVALID_IOASID) {
> @@ -5210,7 +5227,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
> spin_unlock(&iommu->lock);
> spin_unlock_irqrestore(&device_domain_lock, flags);
> if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - ioasid_free(domain->default_pasid);
> + ioasid_free(host_pasid_set, domain->default_pasid);
>
> return ret;
> }
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index c9850766c3a9..ccdc23446015 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct pasid_entry *pte)
> }
>
> extern u32 intel_pasid_max_id;
> +extern struct ioasid_set *host_pasid_set;
> int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp);
> void intel_pasid_free_id(int pasid);
> void *intel_pasid_lookup_id(int pasid);
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 37a9beabc0ca..634e191ca2c3 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -551,7 +551,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
> pasid_max = intel_pasid_max_id;
>
> /* Do not use PASID 0, reserved for RID to PASID */
> - svm->pasid = ioasid_alloc(NULL, PASID_MIN,
> + svm->pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
> pasid_max - 1, svm);
> if (svm->pasid == INVALID_IOASID) {
> kfree(svm);
> @@ -568,7 +568,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
> if (mm) {
> ret = mmu_notifier_register(&svm->notifier, mm);
> if (ret) {
> - ioasid_free(svm->pasid);
> + ioasid_free(host_pasid_set, svm->pasid);
> kfree(svm);
> kfree(sdev);
> goto out;
> @@ -586,7 +586,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
> if (ret) {
> if (mm)
> mmu_notifier_unregister(&svm->notifier, mm);
> - ioasid_free(svm->pasid);
> + ioasid_free(host_pasid_set, svm->pasid);
> kfree(svm);
> kfree(sdev);
> goto out;
> @@ -655,7 +655,7 @@ static int intel_svm_unbind_mm(struct device *dev, int pasid)
> kfree_rcu(sdev, rcu);
>
> if (list_empty(&svm->devs)) {
> - ioasid_free(svm->pasid);
> + ioasid_free(host_pasid_set, svm->pasid);
> if (svm->mm)
> mmu_notifier_unregister(&svm->notifier, svm->mm);
> list_del(&svm->list);
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 5f63af07acd5..f73b3dbfc37a 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -1,22 +1,58 @@
> // SPDX-License-Identifier: GPL-2.0
> /*
> * I/O Address Space ID allocator. There is one global IOASID space, split into
> - * subsets. Users create a subset with DECLARE_IOASID_SET, then allocate and
> - * free IOASIDs with ioasid_alloc and ioasid_free.
> + * subsets. Users create a subset with ioasid_alloc_set, then allocate/free IDs
> + * with ioasid_alloc and ioasid_free.
> */
> -#include <linux/ioasid.h>
> #include <linux/module.h>
> #include <linux/slab.h>
> #include <linux/spinlock.h>
> #include <linux/xarray.h>
> +#include <linux/ioasid.h>
> +
> +static DEFINE_XARRAY_ALLOC(ioasid_sets);
> +enum ioasid_state {
> + IOASID_STATE_INACTIVE,
> + IOASID_STATE_ACTIVE,
> + IOASID_STATE_FREE_PENDING,
> +};
>
> +/**
> + * struct ioasid_data - Meta data about ioasid
> + *
> + * @id: Unique ID
> + * @users Number of active users
> + * @state Track state of the IOASID
> + * @set Meta data of the set this IOASID belongs to
> + * @private Private data associated with the IOASID
> + * @rcu For free after RCU grace period
> + */
> struct ioasid_data {
> ioasid_t id;
> struct ioasid_set *set;
> + refcount_t users;
> + enum ioasid_state state;
> void *private;
> struct rcu_head rcu;
> };
>
> +/* Default to PCIe standard 20 bit PASID */
> +#define PCI_PASID_MAX 0x100000
> +static ioasid_t ioasid_capacity = PCI_PASID_MAX;
> +static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
> +
> +void ioasid_install_capacity(ioasid_t total)
> +{
> + ioasid_capacity = ioasid_capacity_avail = total;

Need any check for multiple settings or setting after used?

> +}
> +EXPORT_SYMBOL_GPL(ioasid_install_capacity);
> +
> +ioasid_t ioasid_get_capacity()
> +{
> + return ioasid_capacity;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get_capacity);
> +
> /*
> * struct ioasid_allocator_data - Internal data structure to hold information
> * about an allocator. There are two types of allocators:
> @@ -306,11 +342,23 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> {
> struct ioasid_data *data;
> void *adata;
> - ioasid_t id;
> + ioasid_t id = INVALID_IOASID;
> +
> + spin_lock(&ioasid_allocator_lock);
> + /* Check if the IOASID set has been allocated and initialized */
> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> + pr_warn("Invalid set\n");
> + goto done_unlock;
> + }
> +
> + if (set->quota <= set->nr_ioasids) {
> + pr_err("IOASID set %d out of quota %d\n", set->sid, set->quota);
> + goto done_unlock;
> + }
>
> data = kzalloc(sizeof(*data), GFP_ATOMIC);
> if (!data)
> - return INVALID_IOASID;
> + goto done_unlock;
>
> data->set = set;
> data->private = private;
> @@ -319,7 +367,6 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> * Custom allocator needs allocator data to perform platform specific
> * operations.
> */
> - spin_lock(&ioasid_allocator_lock);
> adata = active_allocator->flags & IOASID_ALLOCATOR_CUSTOM ? active_allocator->ops->pdata : data;
> id = active_allocator->ops->alloc(min, max, adata);
> if (id == INVALID_IOASID) {
> @@ -335,42 +382,339 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> goto exit_free;
> }
> data->id = id;
> + data->state = IOASID_STATE_ACTIVE;
> + refcount_set(&data->users, 1);
> +
> + /* Store IOASID in the per set data */
> + if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
> + pr_err("Failed to ioasid %d in set %d\n", id, set->sid);
> + goto exit_free;
> + }
> + set->nr_ioasids++;
> + goto done_unlock;
>
> - spin_unlock(&ioasid_allocator_lock);
> - return id;
> exit_free:
> - spin_unlock(&ioasid_allocator_lock);
> kfree(data);
> - return INVALID_IOASID;
> +done_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> + return id;
> }
> EXPORT_SYMBOL_GPL(ioasid_alloc);
>
> +static void ioasid_do_free(struct ioasid_data *data)
> +{
> + struct ioasid_data *ioasid_data;
> + struct ioasid_set *sdata;
> +
> + active_allocator->ops->free(data->id, active_allocator->ops->pdata);
> + /* Custom allocator needs additional steps to free the xa element */
> + if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> + ioasid_data = xa_erase(&active_allocator->xa, data->id);
> + kfree_rcu(ioasid_data, rcu);
> + }
> +
> + sdata = xa_load(&ioasid_sets, data->set->sid);
> + if (!sdata) {
> + pr_err("No set %d for IOASID %d\n", data->set->sid,
> + data->id);
> + return;
> + }
> + xa_erase(&sdata->xa, data->id);
> + sdata->nr_ioasids--;
> +}
> +
> +static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(&active_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to free unknown IOASID %u\n", ioasid);
> + return;
> + }
> +
> + if (data->set != set) {
> + pr_warn("Cannot free IOASID %u due to set ownership\n", ioasid);
> + return;
> + }
> + data->state = IOASID_STATE_FREE_PENDING;
> +
> + if (!refcount_dec_and_test(&data->users))
> + return;
> +
> + ioasid_do_free(data);
> +}
> +
> /**
> - * ioasid_free - Free an IOASID
> - * @ioasid: the ID to remove
> + * ioasid_free - Drop reference on an IOASID. Free if refcount drops to 0,
> + * including free from its set and system-wide list.
> + * @set: The ioasid_set to check permission with. If not NULL, IOASID
> + * free will fail if the set does not match.
> + * @ioasid: The IOASID to remove
> */
> -void ioasid_free(ioasid_t ioasid)
> +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
> {
> - struct ioasid_data *ioasid_data;
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_free_locked(set, ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_free);
>
> +/**
> + * ioasid_alloc_set - Allocate a new IOASID set for a given token
> + *
> + * @token: Unique token of the IOASID set, cannot be NULL

What's the use of @token? I might be able to find the answer in the
code, but i have no idea when i comes here first time. :-)

This line of comment says that token cannot be NULL. It seems not to
match the real code, where token could be NULL if type is
IOASID_SET_TYPE_NULL.

> + * @quota: Quota allowed in this set. Only for new set creation
> + * @flags: Special requirements
> + *
> + * IOASID can be limited system-wide resource that requires quota management.
> + * If caller does not wish to enforce quota, use IOASID_SET_NO_QUOTA flag.

If you are not going to add NO_QUOTA support this time, I'd suggest you
to remove above comment.

> + *
> + * Token will be stored in the ioasid_set returned. A reference will be taken
> + * upon finding a matching set or newly created set.
> + * IOASID allocation within the set and other per set operations will use
> + * the retured ioasid_set *.

nit: remove *, or you mean the ioasid_set pointer?

> + *
> + */
> +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type)
> +{
> + struct ioasid_set *sdata;
> + unsigned long index;
> + ioasid_t id;
> +
> + if (type >= IOASID_SET_TYPE_NR)
> + return ERR_PTR(-EINVAL);
> +
> + /*
> + * Need to check space available if we share system-wide quota.
> + * TODO: we may need to support quota free sets in the future.
> + */
> spin_lock(&ioasid_allocator_lock);
> - ioasid_data = xa_load(&active_allocator->xa, ioasid);
> - if (!ioasid_data) {
> - pr_err("Trying to free unknown IOASID %u\n", ioasid);
> + if (quota > ioasid_capacity_avail) {

Thinking that ioasid_set itself needs an ioasid, so the check might be

(quota + 1 > ioasid_capacity_avail)?

> + pr_warn("Out of IOASID capacity! ask %d, avail %d\n",
> + quota, ioasid_capacity_avail);
> + sdata = ERR_PTR(-ENOSPC);
> goto exit_unlock;
> }
>
> - active_allocator->ops->free(ioasid, active_allocator->ops->pdata);
> - /* Custom allocator needs additional steps to free the xa element */
> - if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> - ioasid_data = xa_erase(&active_allocator->xa, ioasid);
> - kfree_rcu(ioasid_data, rcu);
> + /*
> + * Token is only unique within its types but right now we have only
> + * mm type. If we have more token types, we have to match type as well.
> + */
> + switch (type) {
> + case IOASID_SET_TYPE_MM:
> + /* Search existing set tokens, reject duplicates */
> + xa_for_each(&ioasid_sets, index, sdata) {
> + if (sdata->token == token &&
> + sdata->type == IOASID_SET_TYPE_MM) {
> + sdata = ERR_PTR(-EEXIST);
> + goto exit_unlock;
> + }
> + }

Do you need to enforce non-NULL token policy here?

> + break;
> + case IOASID_SET_TYPE_NULL:
> + if (!token)
> + break;
> + fallthrough;
> + default:
> + pr_err("Invalid token and IOASID type\n");
> + sdata = ERR_PTR(-EINVAL);
> + goto exit_unlock;
> }
>
> + /* REVISIT: may support set w/o quota, use system available */
> + if (!quota) {
> + sdata = ERR_PTR(-EINVAL);
> + goto exit_unlock;
> + }
> +
> + sdata = kzalloc(sizeof(*sdata), GFP_ATOMIC);
> + if (!sdata) {
> + sdata = ERR_PTR(-ENOMEM);
> + goto exit_unlock;
> + }
> +
> + if (xa_alloc(&ioasid_sets, &id, sdata,
> + XA_LIMIT(0, ioasid_capacity_avail - quota),
> + GFP_ATOMIC)) {
> + kfree(sdata);
> + sdata = ERR_PTR(-ENOSPC);
> + goto exit_unlock;
> + }
> +
> + sdata->token = token;
> + sdata->type = type;
> + sdata->quota = quota;
> + sdata->sid = id;
> + refcount_set(&sdata->ref, 1);
> +
> + /*
> + * Per set XA is used to store private IDs within the set, get ready
> + * for ioasid_set private ID and system-wide IOASID allocation
> + * results.
> + */

I'm not sure that I understood the per-set XA correctly. Is it used to
save both private ID and real ID allocated from the system-wide pool? If
so, isn't private id might be equal to real ID?

> + xa_init_flags(&sdata->xa, XA_FLAGS_ALLOC);
> + ioasid_capacity_avail -= quota;

As mentioned above, the ioasid_set consumed one extra ioasid, so

ioasid_capacity_avail -= (quota + 1);

?

> +
> exit_unlock:
> spin_unlock(&ioasid_allocator_lock);
> +
> + return sdata;
> }
> -EXPORT_SYMBOL_GPL(ioasid_free);
> +EXPORT_SYMBOL_GPL(ioasid_alloc_set);
> +
> +void ioasid_set_get_locked(struct ioasid_set *set)
> +{
> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> + pr_warn("Invalid set data\n");
> + return;
> + }
> +
> + refcount_inc(&set->ref);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_get_locked);
> +
> +void ioasid_set_get(struct ioasid_set *set)
> +{
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_set_get_locked(set);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_get);
> +
> +void ioasid_set_put_locked(struct ioasid_set *set)
> +{
> + struct ioasid_data *entry;
> + unsigned long index;
> +
> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {

There are multiple occurences of this line of code, how about defining
it as a inline helper?

static inline bool ioasid_is_valid(struct ioasid_set *set)
{
return xa_load(&ioasid_sets, set->sid) == set;
}

> + pr_warn("Invalid set data\n");
> + return;
> + }
> +
> + if (!refcount_dec_and_test(&set->ref)) {
> + pr_debug("%s: IOASID set %d has %d users\n",
> + __func__, set->sid, refcount_read(&set->ref));
> + return;
> + }
> +
> + /* The set is already empty, we just destroy the set. */
> + if (xa_empty(&set->xa))
> + goto done_destroy;
> +
> + /*
> + * Free all PASIDs from system-wide IOASID pool, all subscribers gets
> + * notified and do cleanup of their own.
> + * Note that some references of the IOASIDs within the set can still
> + * be held after the free call. This is OK in that the IOASIDs will be
> + * marked inactive, the only operations can be done is ioasid_put.
> + * No need to track IOASID set states since there is no reclaim phase.
> + */
> + xa_for_each(&set->xa, index, entry) {
> + ioasid_free_locked(set, index);
> + /* Free from per set private pool */
> + xa_erase(&set->xa, index);
> + }
> +
> +done_destroy:
> + /* Return the quota back to system pool */
> + ioasid_capacity_avail += set->quota;
> + kfree_rcu(set, rcu);
> +
> + /*
> + * Token got released right away after the ioasid_set is freed.
> + * If a new set is created immediately with the newly released token,
> + * it will not allocate the same IOASIDs unless they are reclaimed.
> + */
> + xa_erase(&ioasid_sets, set->sid);

No. pointer is used after free.

> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_put_locked);
> +
> +/**
> + * ioasid_set_put - Drop a reference to the IOASID set. Free all IOASIDs within
> + * the set if there are no more users.
> + *
> + * @set: The IOASID set ID to be freed
> + *
> + * If refcount drops to zero, all IOASIDs allocated within the set will be
> + * freed.
> + */
> +void ioasid_set_put(struct ioasid_set *set)
> +{
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_set_put_locked(set);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_put);
> +
> +/**
> + * ioasid_adjust_set - Adjust the quota of an IOASID set
> + * @set: IOASID set to be assigned
> + * @quota: Quota allowed in this set
> + *
> + * Return 0 on success. If the new quota is smaller than the number of
> + * IOASIDs already allocated, -EINVAL will be returned. No change will be
> + * made to the existing quota.
> + */
> +int ioasid_adjust_set(struct ioasid_set *set, int quota)
> +{
> + int ret = 0;
> +
> + spin_lock(&ioasid_allocator_lock);
> + if (set->nr_ioasids > quota) {
> + pr_err("New quota %d is smaller than outstanding IOASIDs %d\n",
> + quota, set->nr_ioasids);
> + ret = -EINVAL;
> + goto done_unlock;
> + }
> +
> + if (quota >= ioasid_capacity_avail) {

This check doesn't make sense since you are updating (not asking for) a
quota.

if ((quota > set->quota) &&
(quota - set->quota > ioasid_capacity_avail))

> + ret = -ENOSPC;
> + goto done_unlock;
> + }
> +
> + /* Return the delta back to system pool */
> + ioasid_capacity_avail += set->quota - quota;

ioasid_capacity_avail is defined as a unsigned int, hence this always
increase the available capacity value even the caller is asking for a
bigger quota?

> +
> + /*
> + * May have a policy to prevent giving all available IOASIDs
> + * to one set. But we don't enforce here, it should be in the
> + * upper layers.
> + */
> + set->quota = quota;
> +
> +done_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_adjust_set);
> +
> +/**
> + * ioasid_set_for_each_ioasid - Iterate over all the IOASIDs within the set
> + *
> + * Caller must hold a reference of the set and handles its own locking.

Do you need to hold ioasid_allocator_lock here?

> + */
> +int ioasid_set_for_each_ioasid(struct ioasid_set *set,
> + void (*fn)(ioasid_t id, void *data),
> + void *data)
> +{
> + struct ioasid_data *entry;
> + unsigned long index;
> + int ret = 0;
> +
> + if (xa_empty(&set->xa)) {
> + pr_warn("No IOASIDs in the set %d\n", set->sid);
> + return -ENOENT;
> + }
> +
> + xa_for_each(&set->xa, index, entry) {
> + fn(index, data);
> + }
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
>
> /**
> * ioasid_find - Find IOASID data
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 9c44947a68c8..412d025d440e 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -10,8 +10,35 @@ typedef unsigned int ioasid_t;
> typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min, ioasid_t max, void *data);
> typedef void (*ioasid_free_fn_t)(ioasid_t ioasid, void *data);
>
> +/* IOASID set types */
> +enum ioasid_set_type {
> + IOASID_SET_TYPE_NULL = 1, /* Set token is NULL */
> + IOASID_SET_TYPE_MM, /* Set token is a mm_struct,
> + * i.e. associated with a process
> + */
> + IOASID_SET_TYPE_NR,
> +};
> +
> +/**
> + * struct ioasid_set - Meta data about ioasid_set
> + * @type: Token types and other features
> + * @token: Unique to identify an IOASID set
> + * @xa: XArray to store ioasid_set private IDs, can be used for
> + * guest-host IOASID mapping, or just a private IOASID namespace.
> + * @quota: Max number of IOASIDs can be allocated within the set
> + * @nr_ioasids Number of IOASIDs currently allocated in the set
> + * @sid: ID of the set
> + * @ref: Reference count of the users
> + */
> struct ioasid_set {
> - int dummy;
> + void *token;
> + struct xarray xa;
> + int type;
> + int quota;
> + int nr_ioasids;
> + int sid;
> + refcount_t ref;
> + struct rcu_head rcu;
> };
>
> /**
> @@ -29,31 +56,64 @@ struct ioasid_allocator_ops {
> void *pdata;
> };
>
> -#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
> -
> #if IS_ENABLED(CONFIG_IOASID)
> +void ioasid_install_capacity(ioasid_t total);
> +ioasid_t ioasid_get_capacity(void);
> +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type);
> +int ioasid_adjust_set(struct ioasid_set *set, int quota);
> +void ioasid_set_get_locked(struct ioasid_set *set);
> +void ioasid_set_put_locked(struct ioasid_set *set);
> +void ioasid_set_put(struct ioasid_set *set);
> +
> ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> void *private);
> -void ioasid_free(ioasid_t ioasid);
> -void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> - bool (*getter)(void *));
> +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
> +
> +bool ioasid_is_active(ioasid_t ioasid);
> +
> +void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *));
> +int ioasid_attach_data(ioasid_t ioasid, void *data);
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> -int ioasid_attach_data(ioasid_t ioasid, void *data);
> -
> +void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
> +int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> + void (*fn)(ioasid_t id, void *data),
> + void *data);
> #else /* !CONFIG_IOASID */
> +static inline void ioasid_install_capacity(ioasid_t total)
> +{
> +}
> +
> +static inline ioasid_t ioasid_get_capacity(void)
> +{
> + return 0;
> +}
> +
> static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> ioasid_t max, void *private)
> {
> return INVALID_IOASID;
> }
>
> -static inline void ioasid_free(ioasid_t ioasid)
> +static inline void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
> +{
> +}
> +
> +static inline bool ioasid_is_active(ioasid_t ioasid)
> +{
> + return false;
> +}
> +
> +static inline struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type)
> +{
> + return ERR_PTR(-ENOTSUPP);
> +}
> +
> +static inline void ioasid_set_put(struct ioasid_set *set)
> {
> }
>
> -static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> - bool (*getter)(void *))
> +static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *))
> {
> return NULL;
> }
>

Best regards,
baolu

2020-08-24 03:16:48

by Lu Baolu

[permalink] [raw]
Subject: Re: [PATCH v2 4/9] iommu/ioasid: Add reference couting functions

Hi Jacob,

On 8/22/20 12:35 PM, Jacob Pan wrote:
> There can be multiple users of an IOASID, each user could have hardware
> contexts associated with the IOASID. In order to align lifecycles,
> reference counting is introduced in this patch. It is expected that when
> an IOASID is being freed, each user will drop a reference only after its
> context is cleared.
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/ioasid.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/ioasid.h | 4 ++
> 2 files changed, 117 insertions(+)
>
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index f73b3dbfc37a..5f31d63c75b1 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -717,6 +717,119 @@ int ioasid_set_for_each_ioasid(struct ioasid_set *set,
> EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
>
> /**
> + * IOASID refcounting rules
> + * - ioasid_alloc() set initial refcount to 1
> + *
> + * - ioasid_free() decrement and test refcount.
> + * If refcount is 0, ioasid will be freed. Deleted from the system-wide
> + * xarray as well as per set xarray. The IOASID will be returned to the
> + * pool and available for new allocations.
> + *
> + * If recount is non-zero, mark IOASID as IOASID_STATE_FREE_PENDING.
> + * No new reference can be added. The IOASID is not returned to the pool
> + * for reuse.
> + * After free, ioasid_get() will return error but ioasid_find() and other
> + * non refcount adding APIs will continue to work until the last reference
> + * is dropped
> + *
> + * - ioasid_get() get a reference on an active IOASID
> + *
> + * - ioasid_put() decrement and test refcount of the IOASID.
> + * If refcount is 0, ioasid will be freed. Deleted from the system-wide
> + * xarray as well as per set xarray. The IOASID will be returned to the
> + * pool and available for new allocations.
> + * Do nothing if refcount is non-zero.
> + *
> + * - ioasid_find() does not take reference, caller must hold reference
> + *
> + * ioasid_free() can be called multiple times without error until all refs are
> + * dropped.
> + */
> +
> +int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(&active_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to get unknown IOASID %u\n", ioasid);
> + return -EINVAL;
> + }
> + if (data->state == IOASID_STATE_FREE_PENDING) {
> + pr_err("Trying to get IOASID being freed%u\n", ioasid);
> + return -EBUSY;
> + }
> +
> + if (set && data->set != set) {
> + pr_err("Trying to get IOASID not in set%u\n", ioasid);
> + /* data found but does not belong to the set */
> + return -EACCES;
> + }
> + refcount_inc(&data->users);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get_locked);
> +
> +/**
> + * ioasid_get - Obtain a reference of an ioasid
> + * @set
> + * @ioasid
> + *
> + * Check set ownership if @set is non-null.
> + */
> +int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + int ret = 0;
> +
> + spin_lock(&ioasid_allocator_lock);
> + ret = ioasid_get_locked(set, ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get);
> +
> +void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(&active_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to put unknown IOASID %u\n", ioasid);
> + return;
> + }
> +
> + if (set && data->set != set) {
> + pr_err("Trying to drop IOASID not in the set %u\n", ioasid);
> + return;
> + }
> +
> + if (!refcount_dec_and_test(&data->users)) {
> + pr_debug("%s: IOASID %d has %d remainning users\n",
> + __func__, ioasid, refcount_read(&data->users));
> + return;
> + }
> + ioasid_do_free(data);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_put_locked);
> +
> +/**
> + * ioasid_put - Drop a reference of an ioasid
> + * @set
> + * @ioasid
> + *
> + * Check set ownership if @set is non-null.
> + */
> +void ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_put_locked(set, ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_put);
> +
> +/**
> * ioasid_find - Find IOASID data
> * @set: the IOASID set
> * @ioasid: the IOASID to find

Do you need to increase the refcount of the found ioasid and ask the
caller to drop it after use? Otherwise, the ioasid might be freed
elsewhere.

Best regards,
baolu

> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 412d025d440e..310abe4187a3 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -76,6 +76,10 @@ int ioasid_attach_data(ioasid_t ioasid, void *data);
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
> +int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> +int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> +void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> +void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> void (*fn)(ioasid_t id, void *data),
> void *data);
>

2020-08-24 10:35:07

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

On Fri, Aug 21, 2020 at 09:35:10PM -0700, Jacob Pan wrote:
> IOASID is used to identify address spaces that can be targeted by device
> DMA. It is a system-wide resource that is essential to its many users.
> This document is an attempt to help developers from all vendors navigate
> the APIs. At this time, ARM SMMU and Intel’s Scalable IO Virtualization
> (SIOV) enabled platforms are the primary users of IOASID. Examples of
> how SIOV components interact with IOASID APIs are provided in that many
> APIs are driven by the requirements from SIOV.
>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Wu Hao <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> Documentation/ioasid.rst | 618 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 618 insertions(+)
> create mode 100644 Documentation/ioasid.rst
>
> diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst

Thanks for writing this up. Should it go to Documentation/driver-api/, or
Documentation/driver-api/iommu/? I think this also needs to Cc
[email protected] and [email protected]

> new file mode 100644
> index 000000000000..b6a8cdc885ff
> --- /dev/null
> +++ b/Documentation/ioasid.rst
> @@ -0,0 +1,618 @@
> +.. ioasid:
> +
> +=====================================
> +IO Address Space ID
> +=====================================
> +
> +IOASID is a generic name for PCIe Process Address ID (PASID) or ARM
> +SMMU sub-stream ID. An IOASID identifies an address space that DMA

"SubstreamID"

> +requests can target.
> +
> +The primary use cases for IOASID are Shared Virtual Address (SVA) and
> +IO Virtual Address (IOVA). However, the requirements for IOASID

IOVA alone isn't a use case, maybe "multiple IOVA spaces per device"?

> +management can vary among hardware architectures.
> +
> +This document covers the generic features supported by IOASID
> +APIs. Vendor-specific use cases are also illustrated with Intel's VT-d
> +based platforms as the first example.
> +
> +.. contents:: :local:
> +
> +Glossary
> +========
> +PASID - Process Address Space ID
> +
> +IOASID - IO Address Space ID (generic term for PCIe PASID and
> +sub-stream ID in SMMU)

"SubstreamID"

> +
> +SVA/SVM - Shared Virtual Addressing/Memory
> +
> +ENQCMD - New Intel X86 ISA for efficient workqueue submission [1]

Maybe drop the "New", to keep the documentation perennial. It might be
good to add internal links here to the specifications URLs at the bottom.

> +
> +DSA - Intel Data Streaming Accelerator [2]
> +
> +VDCM - Virtual device composition module [3]
> +
> +SIOV - Intel Scalable IO Virtualization
> +
> +
> +Key Concepts
> +============
> +
> +IOASID Set
> +-----------
> +An IOASID set is a group of IOASIDs allocated from the system-wide
> +IOASID pool. An IOASID set is created and can be identified by a
> +token of u64. Refer to IOASID set APIs for more details.

Identified either by an u64 or an mm_struct, right? Maybe just drop the
second sentence if it's detailed in the IOASID set section below.

> +
> +IOASID set is particularly useful for guest SVA where each guest could
> +have its own IOASID set for security and efficiency reasons.
> +
> +IOASID Set Private ID (SPID)
> +----------------------------
> +SPIDs are introduced as IOASIDs within its set. Each SPID maps to a
> +system-wide IOASID but the namespace of SPID is within its IOASID
> +set.

The intro isn't super clear. Perhaps this is simpler:
"Each IOASID set has a private namespace of SPIDs. An SPID maps to a
single system-wide IOASID."

> SPIDs can be used as guest IOASIDs where each guest could do
> +IOASID allocation from its own pool and map them to host physical
> +IOASIDs. SPIDs are particularly useful for supporting live migration
> +where decoupling guest and host physical resources are necessary.
> +
> +For example, two VMs can both allocate guest PASID/SPID #101 but map to
> +different host PASIDs #201 and #202 respectively as shown in the
> +diagram below.
> +::
> +
> + .------------------. .------------------.
> + | VM 1 | | VM 2 |
> + | | | |
> + |------------------| |------------------|
> + | GPASID/SPID 101 | | GPASID/SPID 101 |
> + '------------------' -------------------' Guest
> + __________|______________________|______________________
> + | | Host
> + v v
> + .------------------. .------------------.
> + | Host IOASID 201 | | Host IOASID 202 |
> + '------------------' '------------------'
> + | IOASID set 1 | | IOASID set 2 |
> + '------------------' '------------------'
> +
> +Guest PASID is treated as IOASID set private ID (SPID) within an
> +IOASID set, mappings between guest and host IOASIDs are stored in the
> +set for inquiry.
> +
> +IOASID APIs
> +===========
> +To get the IOASID APIs, users must #include <linux/ioasid.h>. These APIs
> +serve the following functionalities:
> +
> + - IOASID allocation/Free
> + - Group management in the form of ioasid_set
> + - Private data storage and lookup
> + - Reference counting
> + - Event notification in case of state change
> +
> +IOASID Set Level APIs
> +--------------------------
> +For use cases such as guest SVA it is necessary to manage IOASIDs at
> +a group level. For example, VMs may allocate multiple IOASIDs for
> +guest process address sharing (vSVA). It is imperative to enforce
> +VM-IOASID ownership such that malicious guest cannot target DMA

"a malicious guest"

> +traffic outside its own IOASIDs, or free an active IOASID belong to

"that belongs to"

> +another VM.
> +::
> +
> + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, u32 type)
> +
> + int ioasid_adjust_set(struct ioasid_set *set, int quota);

These could be named "ioasid_set_alloc" and "ioasid_set_adjust" to be
consistent with the rest of the API.

> +
> + void ioasid_set_get(struct ioasid_set *set)
> +
> + void ioasid_set_put(struct ioasid_set *set)
> +
> + void ioasid_set_get_locked(struct ioasid_set *set)
> +
> + void ioasid_set_put_locked(struct ioasid_set *set)
> +
> + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,

Might be nicer to keep the same argument names within the API. Here "set"
rather than "sdata".

> + void (*fn)(ioasid_t id, void *data),
> + void *data)

(alignment)

> +
> +
> +IOASID set concept is introduced to represent such IOASID groups. Each

Or just "IOASID sets represent such IOASID groups", but might be
redundant.

> +IOASID set is created with a token which can be one of the following
> +types:
> +
> + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
> + - IOASID_SET_TYPE_MM (Set token is a mm_struct)
> +
> +The explicit MM token type is useful when multiple users of an IOASID
> +set under the same process need to communicate about their shared IOASIDs.
> +E.g. An IOASID set created by VFIO for one guest can be associated
> +with the KVM instance for the same guest since they share a common mm_struct.
> +
> +The IOASID set APIs serve the following purposes:
> +
> + - Ownership/permission enforcement
> + - Take collective actions, e.g. free an entire set
> + - Event notifications within a set
> + - Look up a set based on token
> + - Quota enforcement

This paragraph could be earlier in the section

> +
> +Individual IOASID APIs
> +----------------------
> +Once an ioasid_set is created, IOASIDs can be allocated from the set.
> +Within the IOASID set namespace, set private ID (SPID) is supported. In
> +the VM use case, SPID can be used for storing guest PASID.
> +
> +::
> +
> + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> + void *private);
> +
> + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> +
> + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> +
> + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> +
> + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> +
> + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> + bool (*getter)(void *));
> +
> + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
> +
> + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
> + void *data);
> + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
> + ioasid_t ssid);

s/ssid/spid

> +
> +
> +Notifications
> +-------------
> +An IOASID may have multiple users, each user may have hardware context
> +associated with an IOASID. When the status of an IOASID changes,
> +e.g. an IOASID is being freed, users need to be notified such that the
> +associated hardware context can be cleared, flushed, and drained.
> +
> +::
> +
> + int ioasid_register_notifier(struct ioasid_set *set, struct
> + notifier_block *nb)
> +
> + void ioasid_unregister_notifier(struct ioasid_set *set,
> + struct notifier_block *nb)
> +
> + int ioasid_register_notifier_mm(struct mm_struct *mm, struct
> + notifier_block *nb)
> +
> + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
> + notifier_block *nb)
> +
> + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
> + unsigned int flags)
> +
> +
> +Events
> +~~~~~~
> +Notification events are pertinent to individual IOASIDs, they can be
> +one of the following:
> +
> + - ALLOC
> + - FREE
> + - BIND
> + - UNBIND
> +
> +Ordering
> +~~~~~~~~
> +Ordering is supported by IOASID notification priorities as the
> +following (in ascending order):
> +
> +::
> +
> + enum ioasid_notifier_prios {
> + IOASID_PRIO_LAST,
> + IOASID_PRIO_IOMMU,
> + IOASID_PRIO_DEVICE,
> + IOASID_PRIO_CPU,
> + };
> +
> +The typical use case is when an IOASID is freed due to an exception, DMA
> +source should be quiesced before tearing down other hardware contexts
> +in the system. This will reduce the churn in handling faults. DMA work
> +submission is performed by the CPU which is granted higher priority than
> +devices.
> +
> +
> +Scopes
> +~~~~~~
> +There are two types of notifiers in IOASID core: system-wide and
> +ioasid_set-wide.
> +
> +System-wide notifier is catering for users that need to handle all
> +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs.
> +
> +Per ioasid_set notifier can be used by VM specific components such as
> +KVM. After all, each KVM instance only cares about IOASIDs within its
> +own set.
> +
> +
> +Atomicity
> +~~~~~~~~~
> +IOASID notifiers are atomic due to spinlocks used inside the IOASID
> +core. For tasks cannot be completed in the notifier handler, async work

"tasks that cannot be"

> +can be submitted to complete the work later as long as there is no
> +ordering requirement.
> +
> +Reference counting
> +------------------
> +IOASID lifecycle management is based on reference counting. Users of
> +IOASID intend to align lifecycle with the IOASID need to hold

"who intend to"

> +reference of the IOASID. IOASID will not be returned to the pool for

"a reference to the IOASID. The IOASID"

> +allocation until all references are dropped. Calling ioasid_free()
> +will mark the IOASID as FREE_PENDING if the IOASID has outstanding
> +reference. ioasid_get() is not allowed once an IOASID is in the
> +FREE_PENDING state.
> +
> +Event notifications are used to inform users of IOASID status change.
> +IOASID_FREE event prompts users to drop their references after
> +clearing its context.
> +
> +For example, on VT-d platform when an IOASID is freed, teardown
> +actions are performed on KVM, device driver, and IOMMU driver.
> +KVM shall register notifier block with::
> +
> + static struct notifier_block pasid_nb_kvm = {
> + .notifier_call = pasid_status_change_kvm,
> + .priority = IOASID_PRIO_CPU,
> + };
> +
> +VDCM driver shall register notifier block with::
> +
> + static struct notifier_block pasid_nb_vdcm = {
> + .notifier_call = pasid_status_change_vdcm,
> + .priority = IOASID_PRIO_DEVICE,
> + };
> +
> +In both cases, notifier blocks shall be registered on the IOASID set
> +such that *only* events from the matching VM is received.
> +
> +If KVM attempts to register notifier block before the IOASID set is
> +created for the MM token, the notifier block will be placed on a
> +pending list inside IOASID core. Once the token matching IOASID set
> +is created, IOASID will register the notifier block automatically.
> +IOASID core does not replay events for the existing IOASIDs in the
> +set. For IOASID set of MM type, notification blocks can be registered
> +on empty sets only. This is to avoid lost events.
> +
> +IOMMU driver shall register notifier block on global chain::
> +
> + static struct notifier_block pasid_nb_vtd = {
> + .notifier_call = pasid_status_change_vtd,
> + .priority = IOASID_PRIO_IOMMU,
> + };
> +
> +Custom allocator APIs
> +---------------------
> +
> +::
> +
> + int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> +
> + void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> +
> +Allocator Choices
> +~~~~~~~~~~~~~~~~~
> +IOASIDs are allocated for both host and guest SVA/IOVA usage. However,
> +allocators can be different. For example, on VT-d guest PASID
> +allocation must be performed via a virtual command interface which is
> +emulated by VMM.
> +
> +IOASID core has the notion of "custom allocator" such that guest can
> +register virtual command allocator that precedes the default one.
> +
> +Namespaces
> +~~~~~~~~~~
> +IOASIDs are limited system resources that default to 20 bits in
> +size. Since each device has its own table, theoretically the namespace
> +can be per device also. However, for security reasons sharing PASID
> +tables among devices are not good for isolation. Therefore, IOASID
> +namespace is system-wide.

I don't follow this development. Having per-device PASID table would work
fine for isolation (assuming no hardware bug necessitating IOMMU groups).
If I remember correctly IOASID space was chosen to be OS-wide because it
simplifies the management code (single PASID per task), and it is
system-wide across VMs only in the case of VT-d scalable mode.

> +
> +There are also other reasons to have this simpler system-wide
> +namespace. Take VT-d as an example, VT-d supports shared workqueue
> +and ENQCMD[1] where one IOASID could be used to submit work on

Maybe use the Sphinx glossary syntax rather than "[1]"
https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#glossary-directive

> +multiple devices that are shared with other VMs. This requires IOASID
> +to be system-wide. This is also the reason why guests must use an
> +emulated virtual command interface to allocate IOASID from the host.
> +
> +
> +Life cycle
> +==========
> +This section covers IOASID lifecycle management for both bare-metal
> +and guest usages. In bare-metal SVA, MMU notifier is directly hooked
> +up with IOMMU driver, therefore the process address space (MM)
> +lifecycle is aligned with IOASID.
> +
> +However, guest MMU notifier is not available to host IOMMU driver,
> +when guest MM terminates unexpectedly, the events have to go through
> +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also more
> +parties involved in guest SVA, e.g. on Intel VT-d platform, IOASIDs
> +are used by IOMMU driver, KVM, VDCM, and VFIO.
> +
> +Native IOASID Life Cycle (VT-d Example)
> +---------------------------------------
> +
> +The normal flow of native SVA code with Intel Data Streaming
> +Accelerator(DSA) [2] as example:
> +
> +1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
> +2. DSA driver allocate WQ, do sva_bind_device();
> +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
> + mmu_notifier_get()
> +4. DMA starts by DSA driver userspace
> +5. DSA userspace close FD
> +6. DSA/uacce kernel driver handles FD.close()
> +7. DSA driver stops DMA
> +8. DSA driver calls sva_unbind_device();
> +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
> + TLBs. mmu_notifier_put() called.
> +10. mmu_notifier.release() called, IOMMU SVA code calls ioasid_free()*
> +11. The IOASID is returned to the pool, reclaimed.
> +
> +::
> +

Use a footnote? https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes

> + * With ENQCMD, PASID used on VT-d is not released in mmu_notifier() but
> + mmdrop(). mmdrop comes after FD close. Should not matter.

"comes after FD close, which doesn't make a difference?"
The following might not be necessary since early process termination is
described later.

> + If the user process dies unexpectedly, Step #10 may come before
> + Step #5, in between, all DMA faults discarded. PRQ responded with

PRQ hasn't been defined in this document.

> + code INVALID REQUEST.
> +
> +During the normal teardown, the following three steps would happen in
> +order:
> +
> +1. Device driver stops DMA request
> +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain in-flight
> + requests.
> +3. IOASID freed
> +
> +Exception happens when process terminates *before* device driver stops
> +DMA and call IOMMU driver to unbind. The flow of process exists are as

"exits"

> +follows:
> +
> +::
> +
> + do_exit() {
> + exit_mm() {
> + mm_put();
> + exit_mmap() {
> + intel_invalidate_range() //mmu notifier
> + tlb_finish_mmu()
> + mmu_notifier_release(mm) {
> + intel_iommu_release() {
> + [2] intel_iommu_teardown_pasid();

Parentheses might be better than square brackets for step numbers

> + intel_iommu_flush_tlbs();
> + }
> + // tlb_invalidate_range cb removed
> + }
> + unmap_vmas();
> + free_pgtables(); // IOMMU cannot walk PGT after this
> + };
> + }
> + exit_files(tsk) {
> + close_files() {
> + dsa_close();
> + [1] dsa_stop_dma();
> + intel_svm_unbind_pasid(); //nothing to do
> + }
> + }
> + }
> +
> + mmdrop() /* some random time later, lazy mm user */ {
> + mm_free_pgd();
> + destroy_context(mm); {
> + [3] ioasid_free();
> + }
> + }
> +
> +As shown in the list above, step #2 could happen before
> +#1. Unrecoverable(UR) faults could happen between #2 and #1.
> +
> +Also notice that TLB invalidation occurs at mmu_notifier
> +invalidate_range callback as well as the release callback. The reason
> +is that release callback will delete IOMMU driver from the notifier
> +chain which may skip invalidate_range() calls during the exit path.
> +
> +To avoid unnecessary reporting of UR fault, IOMMU driver shall disable
> +fault reporting after free and before unbind.
> +
> +Guest IOASID Life Cycle (VT-d Example)
> +--------------------------------------
> +Guest IOASID life cycle starts with guest driver open(), this could be
> +uacce or individual accelerator driver such as DSA. At FD open,
> +sva_bind_device() is called which triggers a series of actions.
> +
> +The example below is an illustration of *normal* operations that
> +involves *all* the SW components in VT-d. The flow can be simpler if
> +no ENQCMD is supported.
> +
> +::
> +
> + VFIO IOMMU KVM VDCM IOASID Ref
> + ..................................................................
> + 1 ioasid_register_notifier/_mm()
> + 2 ioasid_alloc() 1
> + 3 bind_gpasid()
> + 4 iommu_bind()->ioasid_get() 2
> + 5 ioasid_notify(BIND)
> + 6 -> ioasid_get() 3
> + 7 -> vmcs_update_atomic()
> + 8 mdev_write(gpasid)
> + 9 hpasid=
> + 10 find_by_spid(gpasid) 4
> + 11 vdev_write(hpasid)
> + 12 -------- GUEST STARTS DMA --------------------------
> + 13 -------- GUEST STOPS DMA --------------------------
> + 14 mdev_clear(gpasid)
> + 15 vdev_clear(hpasid)
> + 16 ioasid_put() 3
> + 17 unbind_gpasid()
> + 18 iommu_ubind()
> + 19 ioasid_notify(UNBIND)
> + 20 -> vmcs_update_atomic()
> + 21 -> ioasid_put() 2
> + 22 ioasid_free() 1
> + 23 ioasid_put() 0
> + 24 Reclaimed
> + -------------- New Life Cycle Begin ----------------------------
> + 1 ioasid_alloc() -> 1
> +
> + Note: IOASID Notification Events: FREE, BIND, UNBIND
> +
> +Exception cases arise when a guest crashes or a malicious guest
> +attempts to cause disruption on the host system. The fault handling
> +rules are:
> +
> +1. IOASID free must *always* succeed.
> +2. An inactive period may be required before the freed IOASID is
> + reclaimed. During this period, consumers of IOASID perform cleanup.
> +3. Malfunction is limited to the guest owned resources for all
> + programming errors.
> +
> +The primary source of exception is when the following are out of
> +order:
> +
> +1. Start/Stop of DMA activity
> + (Guest device driver, mdev via VFIO)
> +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
> + (Host IOMMU driver bind/unbind)
> +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
> + case of ENQCMD
> +4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
> +5. IOASID alloc/free (Host IOASID)
> +
> +VFIO is the *only* user-kernel interface, which is ultimately
> +responsible for exception handlings.

"handling"

> +
> +#1 is processed the same way as the assigned device today based on
> +device file descriptors and events. There is no special handling.
> +
> +#3 is based on bind/unbind events emitted by #2.
> +
> +#4 is naturally aligned with IOASID life cycle in that an illegal
> +guest PASID programming would fail in obtaining reference of the
> +matching host IOASID.
> +
> +#5 is similar to #4. The fault will be reported to the user if PASID
> +used in the ENQCMD is not set up in VMCS PASID translation table.
> +
> +Therefore, the remaining out of order problem is between #2 and
> +#5. I.e. unbind vs. free. More specifically, free before unbind.
> +
> +IOASID notifier and refcounting are used to ensure order. Following
> +a publisher-subscriber pattern where:
> +
> +- Publishers: VFIO & IOMMU
> +- Subscribers: KVM, VDCM, IOMMU
> +
> +IOASID notifier is atomic which requires subscribers to do quick
> +handling of the event in the atomic context. Workqueue can be used for
> +any processing that requires thread context. IOASID reference must be
> +acquired before receiving the FREE event. The reference must be
> +dropped at the end of the processing in order to return the IOASID to
> +the pool.
> +
> +Let's examine the IOASID life cycle again when free happens *before*
> +unbind. This could be a result of misbehaving guests or crash. Assuming
> +VFIO cannot enforce unbind->free order. Notice that the setup part up
> +until step #12 is identical to the normal case, the flow below starts
> +with step 13.
> +
> +::
> +
> + VFIO IOMMU KVM VDCM IOASID Ref
> + ..................................................................
> + 13 -------- GUEST STARTS DMA --------------------------
> + 14 -------- *GUEST MISBEHAVES!!!* ----------------
> + 15 ioasid_free()
> + 16 ioasid_notify(FREE)
> + 17 mark_ioasid_inactive[1]
> + 18 kvm_nb_handler(FREE)
> + 19 vmcs_update_atomic()
> + 20 ioasid_put_locked() -> 3
> + 21 vdcm_nb_handler(FREE)
> + 22 iomm_nb_handler(FREE)
> + 23 ioasid_free() returns[2] schedule_work() 2
> + 24 schedule_work() vdev_clear_wk(hpasid)
> + 25 teardown_pasid_wk()
> + 26 ioasid_put() -> 1
> + 27 ioasid_put() 0
> + 28 Reclaimed
> + 29 unbind_gpasid()
> + 30 iommu_unbind()->ioasid_find() Fails[3]
> + -------------- New Life Cycle Begin ----------------------------
> +
> +Note:
> +
> +1. By marking IOASID inactive at step #17, no new references can be

Is "inactive" FREE_PENDING?

> + held. ioasid_get/find() will return -ENOENT;
> +2. After step #23, all events can go out of order. Shall not affect
> + the outcome.
> +3. IOMMU driver fails to find private data for unbinding. If unbind is
> + called after the same IOASID is allocated for the same guest again,
> + this is a programming error. The damage is limited to the guest
> + itself since unbind performs permission checking based on the
> + IOASID set associated with the guest process.
> +
> +KVM PASID Translation Table Updates
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +Per VM PASID translation table is maintained by KVM in order to
> +support ENQCMD in the guest. The table contains host-guest PASID
> +translations to be consumed by CPU ucode. The synchronization of the
> +PASID states depends on VFIO/IOMMU driver, where IOCTL and atomic
> +notifiers are used. KVM must register IOASID notifier per VM instance
> +during launch time. The following events are handled:
> +
> +1. BIND/UNBIND
> +2. FREE
> +
> +Rules:
> +
> +1. Multiple devices can bind with the same PASID, this can be different PCI
> + devices or mdevs within the same PCI device. However, only the
> + *first* BIND and *last* UNBIND emit notifications.
> +2. IOASID code is responsible for ensuring the correctness of H-G
> + PASID mapping. There is no need for KVM to validate the
> + notification data.
> +3. When UNBIND happens *after* FREE, KVM will see error in
> + ioasid_get() even when the reclaim is not done. IOMMU driver will
> + also avoid sending UNBIND if the PASID is already FREE.
> +4. When KVM terminates *before* FREE & UNBIND, references will be
> + dropped for all host PASIDs.
> +
> +VDCM PASID Programming
> +~~~~~~~~~~~~~~~~~~~~~~
> +VDCM composes virtual devices and exposes them to the guests. When
> +the guest allocates a PASID then program it to the virtual device, VDCM
> +intercepts the programming attempt then program the matching host

"programs"

Thanks,
Jean

> +PASID on to the hardware.
> +Conversely, when a device is going away, VDCM must be informed such
> +that PASID context on the hardware can be cleared. There could be
> +multiple mdevs assigned to different guests in the same VDCM. Since
> +the PASID table is shared at PCI device level, lazy clearing is not
> +secure. A malicious guest can attack by using newly freed PASIDs that
> +are allocated by another guest.
> +
> +By holding a reference of the PASID until VDCM cleans up the HW context,
> +it is guaranteed that PASID life cycles do not cross within the same
> +device.
> +
> +
> +Reference
> +====================================================
> +1. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
> +
> +2. https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
> +
> +3. https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
> --
> 2.7.4
>

2020-08-24 18:30:53

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

On Fri, Aug 21, 2020 at 09:35:12PM -0700, Jacob Pan wrote:
> ioasid_set was introduced as an arbitrary token that are shared by a
> group of IOASIDs. For example, if IOASID #1 and #2 are allocated via the
> same ioasid_set*, they are viewed as to belong to the same set.
>
> For guest SVA usages, system-wide IOASID resources need to be
> partitioned such that VMs can have its own quota and being managed
> separately. ioasid_set is the perfect candidate for meeting such
> requirements. This patch redefines and extends ioasid_set with the
> following new fields:
> - Quota
> - Reference count
> - Storage of its namespace
> - The token is stored in the new ioasid_set but with optional types
>
> ioasid_set level APIs are introduced that wires up these new data.
> Existing users of IOASID APIs are converted where a host IOASID set is
> allocated for bare-metal usage.
>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/intel/iommu.c | 27 ++-
> drivers/iommu/intel/pasid.h | 1 +
> drivers/iommu/intel/svm.c | 8 +-
> drivers/iommu/ioasid.c | 390 +++++++++++++++++++++++++++++++++++++++++---
> include/linux/ioasid.h | 82 ++++++++--
> 5 files changed, 465 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index a3a0b5c8921d..5813eeaa5edb 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -42,6 +42,7 @@
> #include <linux/crash_dump.h>
> #include <linux/numa.h>
> #include <linux/swiotlb.h>
> +#include <linux/ioasid.h>
> #include <asm/irq_remapping.h>
> #include <asm/cacheflush.h>
> #include <asm/iommu.h>
> @@ -103,6 +104,9 @@
> */
> #define INTEL_IOMMU_PGSIZES (~0xFFFUL)
>
> +/* PASIDs used by host SVM */
> +struct ioasid_set *host_pasid_set;
> +
> static inline int agaw_to_level(int agaw)
> {
> return agaw + 2;
> @@ -3103,8 +3107,8 @@ static void intel_vcmd_ioasid_free(ioasid_t ioasid, void *data)
> * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
> * We can only free the PASID when all the devices are unbound.
> */
> - if (ioasid_find(NULL, ioasid, NULL)) {
> - pr_alert("Cannot free active IOASID %d\n", ioasid);
> + if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
> + pr_err("Cannot free IOASID %d, not in system set\n", ioasid);
> return;
> }
> vcmd_free_pasid(iommu, ioasid);
> @@ -3288,6 +3292,19 @@ static int __init init_dmars(void)
> if (ret)
> goto free_iommu;
>
> + /* PASID is needed for scalable mode irrespective to SVM */
> + if (intel_iommu_sm) {
> + ioasid_install_capacity(intel_pasid_max_id);
> + /* We should not run out of IOASIDs at boot */
> + host_pasid_set = ioasid_alloc_set(NULL, PID_MAX_DEFAULT,
> + IOASID_SET_TYPE_NULL);
> + if (IS_ERR_OR_NULL(host_pasid_set)) {
> + pr_err("Failed to enable host PASID allocator %lu\n",
> + PTR_ERR(host_pasid_set));
> + intel_iommu_sm = 0;
> + }
> + }
> +
> /*
> * for each drhd
> * enable fault log
> @@ -5149,7 +5166,7 @@ static void auxiliary_unlink_device(struct dmar_domain *domain,
> domain->auxd_refcnt--;
>
> if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - ioasid_free(domain->default_pasid);
> + ioasid_free(host_pasid_set, domain->default_pasid);
> }
>
> static int aux_domain_add_dev(struct dmar_domain *domain,
> @@ -5167,7 +5184,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
> int pasid;
>
> /* No private data needed for the default pasid */
> - pasid = ioasid_alloc(NULL, PASID_MIN,
> + pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
> pci_max_pasids(to_pci_dev(dev)) - 1,
> NULL);
> if (pasid == INVALID_IOASID) {
> @@ -5210,7 +5227,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
> spin_unlock(&iommu->lock);
> spin_unlock_irqrestore(&device_domain_lock, flags);
> if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - ioasid_free(domain->default_pasid);
> + ioasid_free(host_pasid_set, domain->default_pasid);
>
> return ret;
> }
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index c9850766c3a9..ccdc23446015 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct pasid_entry *pte)
> }
>
> extern u32 intel_pasid_max_id;
> +extern struct ioasid_set *host_pasid_set;
> int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp);
> void intel_pasid_free_id(int pasid);
> void *intel_pasid_lookup_id(int pasid);
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 37a9beabc0ca..634e191ca2c3 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -551,7 +551,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
> pasid_max = intel_pasid_max_id;
>
> /* Do not use PASID 0, reserved for RID to PASID */
> - svm->pasid = ioasid_alloc(NULL, PASID_MIN,
> + svm->pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
> pasid_max - 1, svm);
> if (svm->pasid == INVALID_IOASID) {
> kfree(svm);
> @@ -568,7 +568,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
> if (mm) {
> ret = mmu_notifier_register(&svm->notifier, mm);
> if (ret) {
> - ioasid_free(svm->pasid);
> + ioasid_free(host_pasid_set, svm->pasid);
> kfree(svm);
> kfree(sdev);
> goto out;
> @@ -586,7 +586,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
> if (ret) {
> if (mm)
> mmu_notifier_unregister(&svm->notifier, mm);
> - ioasid_free(svm->pasid);
> + ioasid_free(host_pasid_set, svm->pasid);
> kfree(svm);
> kfree(sdev);
> goto out;
> @@ -655,7 +655,7 @@ static int intel_svm_unbind_mm(struct device *dev, int pasid)
> kfree_rcu(sdev, rcu);
>
> if (list_empty(&svm->devs)) {
> - ioasid_free(svm->pasid);
> + ioasid_free(host_pasid_set, svm->pasid);
> if (svm->mm)
> mmu_notifier_unregister(&svm->notifier, svm->mm);
> list_del(&svm->list);
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 5f63af07acd5..f73b3dbfc37a 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -1,22 +1,58 @@
> // SPDX-License-Identifier: GPL-2.0
> /*
> * I/O Address Space ID allocator. There is one global IOASID space, split into
> - * subsets. Users create a subset with DECLARE_IOASID_SET, then allocate and
> - * free IOASIDs with ioasid_alloc and ioasid_free.
> + * subsets. Users create a subset with ioasid_alloc_set, then allocate/free IDs
> + * with ioasid_alloc and ioasid_free.
> */
> -#include <linux/ioasid.h>
> #include <linux/module.h>
> #include <linux/slab.h>
> #include <linux/spinlock.h>
> #include <linux/xarray.h>
> +#include <linux/ioasid.h>

Spurious change (best keep the includes in alphabetical order)

> +
> +static DEFINE_XARRAY_ALLOC(ioasid_sets);

I'd prefer keeping all static variables together

> +enum ioasid_state {
> + IOASID_STATE_INACTIVE,
> + IOASID_STATE_ACTIVE,
> + IOASID_STATE_FREE_PENDING,
> +};
>
> +/**
> + * struct ioasid_data - Meta data about ioasid
> + *
> + * @id: Unique ID
> + * @users Number of active users
> + * @state Track state of the IOASID
> + * @set Meta data of the set this IOASID belongs to
> + * @private Private data associated with the IOASID
> + * @rcu For free after RCU grace period

nit: it would be nicer to follow the struct order

> + */
> struct ioasid_data {
> ioasid_t id;
> struct ioasid_set *set;
> + refcount_t users;
> + enum ioasid_state state;
> void *private;
> struct rcu_head rcu;
> };
>
> +/* Default to PCIe standard 20 bit PASID */
> +#define PCI_PASID_MAX 0x100000
> +static ioasid_t ioasid_capacity = PCI_PASID_MAX;
> +static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
> +
> +void ioasid_install_capacity(ioasid_t total)
> +{
> + ioasid_capacity = ioasid_capacity_avail = total;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_install_capacity);
> +
> +ioasid_t ioasid_get_capacity()
> +{
> + return ioasid_capacity;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get_capacity);
> +
> /*
> * struct ioasid_allocator_data - Internal data structure to hold information
> * about an allocator. There are two types of allocators:
> @@ -306,11 +342,23 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> {
> struct ioasid_data *data;
> void *adata;
> - ioasid_t id;
> + ioasid_t id = INVALID_IOASID;
> +
> + spin_lock(&ioasid_allocator_lock);
> + /* Check if the IOASID set has been allocated and initialized */
> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> + pr_warn("Invalid set\n");

WARN_ON() is sufficient

> + goto done_unlock;
> + }
> +
> + if (set->quota <= set->nr_ioasids) {
> + pr_err("IOASID set %d out of quota %d\n", set->sid, set->quota);

As this can be called directly by userspace via VFIO, I wonder if we
should remove non-bug error messages like this one to avoid leaking
internal IDs, or at least rate-limit them. We already have a few, perhaps
we should deal with them before the VFIO_IOMMU_ALLOC_PASID patches land?

> + goto done_unlock;
> + }
>
> data = kzalloc(sizeof(*data), GFP_ATOMIC);
> if (!data)
> - return INVALID_IOASID;
> + goto done_unlock;
>
> data->set = set;
> data->private = private;
> @@ -319,7 +367,6 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> * Custom allocator needs allocator data to perform platform specific
> * operations.
> */
> - spin_lock(&ioasid_allocator_lock);
> adata = active_allocator->flags & IOASID_ALLOCATOR_CUSTOM ? active_allocator->ops->pdata : data;
> id = active_allocator->ops->alloc(min, max, adata);
> if (id == INVALID_IOASID) {
> @@ -335,42 +382,339 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> goto exit_free;
> }
> data->id = id;
> + data->state = IOASID_STATE_ACTIVE;
> + refcount_set(&data->users, 1);
> +
> + /* Store IOASID in the per set data */
> + if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
> + pr_err("Failed to ioasid %d in set %d\n", id, set->sid);

"Failed to store"
Don't we need to call active_allocator->ops->free()?

And I need to think about this more, but do you see any issue with
revoking here the data that we published into the xarray above through
alloc()? We might need to free data in an RCU callback.

> + goto exit_free;
> + }
> + set->nr_ioasids++;
> + goto done_unlock;
>
> - spin_unlock(&ioasid_allocator_lock);
> - return id;
> exit_free:
> - spin_unlock(&ioasid_allocator_lock);
> kfree(data);
> - return INVALID_IOASID;
> +done_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> + return id;
> }
> EXPORT_SYMBOL_GPL(ioasid_alloc);
>
> +static void ioasid_do_free(struct ioasid_data *data)
> +{
> + struct ioasid_data *ioasid_data;
> + struct ioasid_set *sdata;
> +
> + active_allocator->ops->free(data->id, active_allocator->ops->pdata);
> + /* Custom allocator needs additional steps to free the xa element */
> + if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> + ioasid_data = xa_erase(&active_allocator->xa, data->id);
> + kfree_rcu(ioasid_data, rcu);
> + }
> +
> + sdata = xa_load(&ioasid_sets, data->set->sid);
> + if (!sdata) {
> + pr_err("No set %d for IOASID %d\n", data->set->sid,
> + data->id);
> + return;

I don't think we're allowed to fail at this point. If we need more
sanity-check on the parameters, it should be before we start removing from
the active_allocator above. Otherwise this should be a WARN

> + }
> + xa_erase(&sdata->xa, data->id);
> + sdata->nr_ioasids--;

Would be nicer to perform the cleanup in the order opposite from
ioasid_alloc()

> +}
> +
> +static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(&active_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to free unknown IOASID %u\n", ioasid);
> + return;
> + }
> +
> + if (data->set != set) {
> + pr_warn("Cannot free IOASID %u due to set ownership\n", ioasid);
> + return;
> + }
> + data->state = IOASID_STATE_FREE_PENDING;
> +
> + if (!refcount_dec_and_test(&data->users))
> + return;
> +
> + ioasid_do_free(data);
> +}
> +
> /**
> - * ioasid_free - Free an IOASID
> - * @ioasid: the ID to remove
> + * ioasid_free - Drop reference on an IOASID. Free if refcount drops to 0,
> + * including free from its set and system-wide list.
> + * @set: The ioasid_set to check permission with. If not NULL, IOASID
> + * free will fail if the set does not match.
> + * @ioasid: The IOASID to remove
> */
> -void ioasid_free(ioasid_t ioasid)
> +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
> {
> - struct ioasid_data *ioasid_data;
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_free_locked(set, ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_free);
>
> +/**
> + * ioasid_alloc_set - Allocate a new IOASID set for a given token
> + *
> + * @token: Unique token of the IOASID set, cannot be NULL
> + * @quota: Quota allowed in this set. Only for new set creation
> + * @flags: Special requirements

There is no @flags, but @type is missing

> + *
> + * IOASID can be limited system-wide resource that requires quota management.
> + * If caller does not wish to enforce quota, use IOASID_SET_NO_QUOTA flag.

The flag isn't introduced in this patch. How about passing @quota == 0 in
this case? For now I'm fine with leaving this as TODO and returning
-EINVAL.

> + *
> + * Token will be stored in the ioasid_set returned. A reference will be taken
> + * upon finding a matching set or newly created set.
> + * IOASID allocation within the set and other per set operations will use
> + * the retured ioasid_set *.
> + *
> + */
> +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type)
> +{
> + struct ioasid_set *sdata;
> + unsigned long index;
> + ioasid_t id;
> +
> + if (type >= IOASID_SET_TYPE_NR)
> + return ERR_PTR(-EINVAL);
> +
> + /*
> + * Need to check space available if we share system-wide quota.
> + * TODO: we may need to support quota free sets in the future.
> + */
> spin_lock(&ioasid_allocator_lock);
> - ioasid_data = xa_load(&active_allocator->xa, ioasid);
> - if (!ioasid_data) {
> - pr_err("Trying to free unknown IOASID %u\n", ioasid);
> + if (quota > ioasid_capacity_avail) {
> + pr_warn("Out of IOASID capacity! ask %d, avail %d\n",
> + quota, ioasid_capacity_avail);
> + sdata = ERR_PTR(-ENOSPC);
> goto exit_unlock;
> }
>
> - active_allocator->ops->free(ioasid, active_allocator->ops->pdata);
> - /* Custom allocator needs additional steps to free the xa element */
> - if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> - ioasid_data = xa_erase(&active_allocator->xa, ioasid);
> - kfree_rcu(ioasid_data, rcu);
> + /*
> + * Token is only unique within its types but right now we have only
> + * mm type. If we have more token types, we have to match type as well.
> + */
> + switch (type) {
> + case IOASID_SET_TYPE_MM:
> + /* Search existing set tokens, reject duplicates */
> + xa_for_each(&ioasid_sets, index, sdata) {
> + if (sdata->token == token &&
> + sdata->type == IOASID_SET_TYPE_MM) {

Should be aligned at the "if ("

According to the function doc, shouldn't we take a reference to the set in
this case, and return it?
"A reference will be taken upon finding a matching set or newly created
set."

However it might be better to separate the two functionalities into
ioasid_alloc_set() and ioasid_find_set(). Because two modules can want to
work on the same set for an mm, but they won't pass the same @quota,
right? So it'd make more sense for one of them (VFIO) to alloc the set
and the other to reuse it.

> + sdata = ERR_PTR(-EEXIST);
> + goto exit_unlock;
> + }
> + }
> + break;
> + case IOASID_SET_TYPE_NULL:
> + if (!token)
> + break;
> + fallthrough;
> + default:
> + pr_err("Invalid token and IOASID type\n");
> + sdata = ERR_PTR(-EINVAL);
> + goto exit_unlock;
> }
>
> + /* REVISIT: may support set w/o quota, use system available */
> + if (!quota) {

Maybe move this next to the other quota check above

> + sdata = ERR_PTR(-EINVAL);
> + goto exit_unlock;
> + }
> +
> + sdata = kzalloc(sizeof(*sdata), GFP_ATOMIC);
> + if (!sdata) {
> + sdata = ERR_PTR(-ENOMEM);
> + goto exit_unlock;
> + }
> +
> + if (xa_alloc(&ioasid_sets, &id, sdata,
> + XA_LIMIT(0, ioasid_capacity_avail - quota),

Why this limit? sid could just be an arbitrary u32 (xa_limit_32b)

> + GFP_ATOMIC)) {
> + kfree(sdata);
> + sdata = ERR_PTR(-ENOSPC);
> + goto exit_unlock;
> + }
> +
> + sdata->token = token;
> + sdata->type = type;
> + sdata->quota = quota;
> + sdata->sid = id;
> + refcount_set(&sdata->ref, 1);
> +
> + /*
> + * Per set XA is used to store private IDs within the set, get ready
> + * for ioasid_set private ID and system-wide IOASID allocation
> + * results.
> + */
> + xa_init_flags(&sdata->xa, XA_FLAGS_ALLOC);

Since it's only used for storage, you could use xa_init()

> + ioasid_capacity_avail -= quota;
> +
> exit_unlock:
> spin_unlock(&ioasid_allocator_lock);
> +
> + return sdata;
> }
> -EXPORT_SYMBOL_GPL(ioasid_free);
> +EXPORT_SYMBOL_GPL(ioasid_alloc_set);
> +
> +void ioasid_set_get_locked(struct ioasid_set *set)
> +{
> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> + pr_warn("Invalid set data\n");

WARN_ON() is sufficient

> + return;
> + }
> +
> + refcount_inc(&set->ref);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_get_locked);

Why is this function public, is it for an iterator? Might want to add a
lockdep_assert_held() annotation.

> +
> +void ioasid_set_get(struct ioasid_set *set)
> +{
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_set_get_locked(set);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_get);
> +
> +void ioasid_set_put_locked(struct ioasid_set *set)
> +{
> + struct ioasid_data *entry;
> + unsigned long index;
> +
> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> + pr_warn("Invalid set data\n");

WARN_ON() is sufficient

> + return;
> + }
> +
> + if (!refcount_dec_and_test(&set->ref)) {
> + pr_debug("%s: IOASID set %d has %d users\n",
> + __func__, set->sid, refcount_read(&set->ref));
> + return;
> + }
> +
> + /* The set is already empty, we just destroy the set. */
> + if (xa_empty(&set->xa))
> + goto done_destroy;
> +
> + /*
> + * Free all PASIDs from system-wide IOASID pool, all subscribers gets
> + * notified and do cleanup of their own.
> + * Note that some references of the IOASIDs within the set can still
> + * be held after the free call. This is OK in that the IOASIDs will be
> + * marked inactive, the only operations can be done is ioasid_put.
> + * No need to track IOASID set states since there is no reclaim phase.
> + */
> + xa_for_each(&set->xa, index, entry) {
> + ioasid_free_locked(set, index);
> + /* Free from per set private pool */
> + xa_erase(&set->xa, index);
> + }
> +
> +done_destroy:
> + /* Return the quota back to system pool */
> + ioasid_capacity_avail += set->quota;
> + kfree_rcu(set, rcu);
> +
> + /*
> + * Token got released right away after the ioasid_set is freed.
> + * If a new set is created immediately with the newly released token,
> + * it will not allocate the same IOASIDs unless they are reclaimed.
> + */
> + xa_erase(&ioasid_sets, set->sid);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_put_locked);

Same comment as ioasid_set_get_locked

> +
> +/**
> + * ioasid_set_put - Drop a reference to the IOASID set. Free all IOASIDs within
> + * the set if there are no more users.
> + *
> + * @set: The IOASID set ID to be freed
> + *
> + * If refcount drops to zero, all IOASIDs allocated within the set will be
> + * freed.
> + */
> +void ioasid_set_put(struct ioasid_set *set)
> +{
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_set_put_locked(set);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_put);
> +
> +/**
> + * ioasid_adjust_set - Adjust the quota of an IOASID set
> + * @set: IOASID set to be assigned
> + * @quota: Quota allowed in this set
> + *
> + * Return 0 on success. If the new quota is smaller than the number of
> + * IOASIDs already allocated, -EINVAL will be returned. No change will be
> + * made to the existing quota.
> + */
> +int ioasid_adjust_set(struct ioasid_set *set, int quota)
> +{
> + int ret = 0;
> +
> + spin_lock(&ioasid_allocator_lock);
> + if (set->nr_ioasids > quota) {
> + pr_err("New quota %d is smaller than outstanding IOASIDs %d\n",
> + quota, set->nr_ioasids);
> + ret = -EINVAL;
> + goto done_unlock;
> + }
> +
> + if (quota >= ioasid_capacity_avail) {
> + ret = -ENOSPC;
> + goto done_unlock;
> + }
> +
> + /* Return the delta back to system pool */
> + ioasid_capacity_avail += set->quota - quota;

I think this is correct as long as the above check is fixed (as pointed
out by Baolu). A check that quota >= 0 could be nice too.

> +
> + /*
> + * May have a policy to prevent giving all available IOASIDs
> + * to one set. But we don't enforce here, it should be in the
> + * upper layers.
> + */
> + set->quota = quota;
> +
> +done_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_adjust_set);
> +
> +/**
> + * ioasid_set_for_each_ioasid - Iterate over all the IOASIDs within the set
> + *
> + * Caller must hold a reference of the set and handles its own locking.
> + */
> +int ioasid_set_for_each_ioasid(struct ioasid_set *set,
> + void (*fn)(ioasid_t id, void *data),
> + void *data)
> +{
> + struct ioasid_data *entry;
> + unsigned long index;
> + int ret = 0;
> +
> + if (xa_empty(&set->xa)) {

Who calls this function? Empty xa may be a normal use-case if the caller
just uses it for sweeping, so pr_warn() could be problematic. The returned
value also isn't particularly accurate if concurrent ioasid_alloc/free are
allowed, so I'd drop this.

> + pr_warn("No IOASIDs in the set %d\n", set->sid);
> + return -ENOENT;
> + }
> +
> + xa_for_each(&set->xa, index, entry) {
> + fn(index, data);
> + }
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
>
> /**
> * ioasid_find - Find IOASID data
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 9c44947a68c8..412d025d440e 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -10,8 +10,35 @@ typedef unsigned int ioasid_t;
> typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min, ioasid_t max, void *data);
> typedef void (*ioasid_free_fn_t)(ioasid_t ioasid, void *data);
>
> +/* IOASID set types */
> +enum ioasid_set_type {
> + IOASID_SET_TYPE_NULL = 1, /* Set token is NULL */
> + IOASID_SET_TYPE_MM, /* Set token is a mm_struct,
> + * i.e. associated with a process
> + */
> + IOASID_SET_TYPE_NR,
> +};
> +
> +/**
> + * struct ioasid_set - Meta data about ioasid_set
> + * @type: Token types and other features

nit: doesn't follow struct order

> + * @token: Unique to identify an IOASID set
> + * @xa: XArray to store ioasid_set private IDs, can be used for
> + * guest-host IOASID mapping, or just a private IOASID namespace.
> + * @quota: Max number of IOASIDs can be allocated within the set
> + * @nr_ioasids Number of IOASIDs currently allocated in the set
> + * @sid: ID of the set
> + * @ref: Reference count of the users
> + */
> struct ioasid_set {
> - int dummy;
> + void *token;
> + struct xarray xa;
> + int type;
> + int quota;
> + int nr_ioasids;
> + int sid;
> + refcount_t ref;
> + struct rcu_head rcu;
> };
>
> /**
> @@ -29,31 +56,64 @@ struct ioasid_allocator_ops {
> void *pdata;
> };
>
> -#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
> -
> #if IS_ENABLED(CONFIG_IOASID)
> +void ioasid_install_capacity(ioasid_t total);
> +ioasid_t ioasid_get_capacity(void);
> +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type);
> +int ioasid_adjust_set(struct ioasid_set *set, int quota);
> +void ioasid_set_get_locked(struct ioasid_set *set);
> +void ioasid_set_put_locked(struct ioasid_set *set);
> +void ioasid_set_put(struct ioasid_set *set);

These three functions need a stub for !CONFIG_IOASID

> +
> ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> void *private);
> -void ioasid_free(ioasid_t ioasid);
> -void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> - bool (*getter)(void *));
> +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
> +
> +bool ioasid_is_active(ioasid_t ioasid);

Not implemented by this series?

> +
> +void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *));

Spurious change

> +int ioasid_attach_data(ioasid_t ioasid, void *data);
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> -int ioasid_attach_data(ioasid_t ioasid, void *data);

Spurious change?
> -
> +void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);

Not implemented here

> +int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> + void (*fn)(ioasid_t id, void *data),
> + void *data);

Needs a stub for !CONFIG_IOASID

> #else /* !CONFIG_IOASID */
> +static inline void ioasid_install_capacity(ioasid_t total)
> +{
> +}
> +
> +static inline ioasid_t ioasid_get_capacity(void)
> +{
> + return 0;
> +}
> +
> static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> ioasid_t max, void *private)
> {
> return INVALID_IOASID;
> }
>
> -static inline void ioasid_free(ioasid_t ioasid)
> +static inline void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
> +{
> +}
> +
> +static inline bool ioasid_is_active(ioasid_t ioasid)
> +{
> + return false;
> +}
> +
> +static inline struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type)
> +{
> + return ERR_PTR(-ENOTSUPP);
> +}
> +
> +static inline void ioasid_set_put(struct ioasid_set *set)
> {
> }
>
> -static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> - bool (*getter)(void *))
> +static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *))

Spurious change

Thanks,
Jean

> {
> return NULL;
> }
> --
> 2.7.4
>

2020-08-24 18:31:18

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v2 2/9] iommu/ioasid: Rename ioasid_set_data()

On Fri, Aug 21, 2020 at 09:35:11PM -0700, Jacob Pan wrote:
> Rename ioasid_set_data() to ioasid_attach_data() to avoid confusion with
> struct ioasid_set. ioasid_set is a group of IOASIDs that share a common
> token.
>
> Signed-off-by: Jacob Pan <[email protected]>

Reviewed-by: Jean-Philippe Brucker <[email protected]>

> ---
> drivers/iommu/intel/svm.c | 6 +++---
> drivers/iommu/ioasid.c | 6 +++---
> include/linux/ioasid.h | 4 ++--
> 3 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index b6972dca2ae0..37a9beabc0ca 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -342,7 +342,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
> svm->gpasid = data->gpasid;
> svm->flags |= SVM_FLAG_GUEST_PASID;
> }
> - ioasid_set_data(data->hpasid, svm);
> + ioasid_attach_data(data->hpasid, svm);
> INIT_LIST_HEAD_RCU(&svm->devs);
> mmput(svm->mm);
> }
> @@ -394,7 +394,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
> list_add_rcu(&sdev->list, &svm->devs);
> out:
> if (!IS_ERR_OR_NULL(svm) && list_empty(&svm->devs)) {
> - ioasid_set_data(data->hpasid, NULL);
> + ioasid_attach_data(data->hpasid, NULL);
> kfree(svm);
> }
>
> @@ -437,7 +437,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> * the unbind, IOMMU driver will get notified
> * and perform cleanup.
> */
> - ioasid_set_data(pasid, NULL);
> + ioasid_attach_data(pasid, NULL);
> kfree(svm);
> }
> }
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 0f8dd377aada..5f63af07acd5 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -258,14 +258,14 @@ void ioasid_unregister_allocator(struct ioasid_allocator_ops *ops)
> EXPORT_SYMBOL_GPL(ioasid_unregister_allocator);
>
> /**
> - * ioasid_set_data - Set private data for an allocated ioasid
> + * ioasid_attach_data - Set private data for an allocated ioasid
> * @ioasid: the ID to set data
> * @data: the private data
> *
> * For IOASID that is already allocated, private data can be set
> * via this API. Future lookup can be done via ioasid_find.
> */
> -int ioasid_set_data(ioasid_t ioasid, void *data)
> +int ioasid_attach_data(ioasid_t ioasid, void *data)
> {
> struct ioasid_data *ioasid_data;
> int ret = 0;
> @@ -287,7 +287,7 @@ int ioasid_set_data(ioasid_t ioasid, void *data)
>
> return ret;
> }
> -EXPORT_SYMBOL_GPL(ioasid_set_data);
> +EXPORT_SYMBOL_GPL(ioasid_attach_data);
>
> /**
> * ioasid_alloc - Allocate an IOASID
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 6f000d7a0ddc..9c44947a68c8 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -39,7 +39,7 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> bool (*getter)(void *));
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> -int ioasid_set_data(ioasid_t ioasid, void *data);
> +int ioasid_attach_data(ioasid_t ioasid, void *data);
>
> #else /* !CONFIG_IOASID */
> static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> @@ -67,7 +67,7 @@ static inline void ioasid_unregister_allocator(struct ioasid_allocator_ops *allo
> {
> }
>
> -static inline int ioasid_set_data(ioasid_t ioasid, void *data)
> +static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
> {
> return -ENOTSUPP;
> }
> --
> 2.7.4
>

2020-08-24 18:34:09

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

On 8/24/20 11:28 AM, Jean-Philippe Brucker wrote:
>> +/**
>> + * struct ioasid_data - Meta data about ioasid
>> + *
>> + * @id: Unique ID
>> + * @users Number of active users
>> + * @state Track state of the IOASID
>> + * @set Meta data of the set this IOASID belongs to
>> + * @private Private data associated with the IOASID
>> + * @rcu For free after RCU grace period
> nit: it would be nicer to follow the struct order

and use a ':' after each struct member name, as is done for @id:

--
~Randy

2020-08-24 18:38:37

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

On 8/24/20 11:28 AM, Jean-Philippe Brucker wrote:
>> +/**
>> + * struct ioasid_set - Meta data about ioasid_set
>> + * @type: Token types and other features
> nit: doesn't follow struct order
>
>> + * @token: Unique to identify an IOASID set
>> + * @xa: XArray to store ioasid_set private IDs, can be used for
>> + * guest-host IOASID mapping, or just a private IOASID namespace.
>> + * @quota: Max number of IOASIDs can be allocated within the set
>> + * @nr_ioasids Number of IOASIDs currently allocated in the set

* @nr_ioasids: Number of IOASIDs currently allocated in the set

>> + * @sid: ID of the set
>> + * @ref: Reference count of the users
>> + */
>> struct ioasid_set {
>> - int dummy;
>> + void *token;
>> + struct xarray xa;
>> + int type;
>> + int quota;
>> + int nr_ioasids;
>> + int sid;
>> + refcount_t ref;
>> + struct rcu_head rcu;
>> };


--
~Randy

2020-08-25 10:23:49

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v2 4/9] iommu/ioasid: Add reference couting functions

On Mon, Aug 24, 2020 at 10:26:55AM +0800, Lu Baolu wrote:
> Hi Jacob,
>
> On 8/22/20 12:35 PM, Jacob Pan wrote:
> > There can be multiple users of an IOASID, each user could have hardware
> > contexts associated with the IOASID. In order to align lifecycles,
> > reference counting is introduced in this patch. It is expected that when
> > an IOASID is being freed, each user will drop a reference only after its
> > context is cleared.
> >
> > Signed-off-by: Jacob Pan <[email protected]>
[...]
> > +/**
> > * ioasid_find - Find IOASID data
> > * @set: the IOASID set
> > * @ioasid: the IOASID to find
>
> Do you need to increase the refcount of the found ioasid and ask the
> caller to drop it after use? Otherwise, the ioasid might be freed
> elsewhere.

ioasid_find() takes a getter function as parameter, which ensures that the
returned data is valid. It fetches the IOASID data under rcu_read_lock()
and calls the getter on the private data (for example mmget_not_zero() for
bare-metal SVA). Given that, I don't think returning with a reference to
the IOASID is necessary. The IOASID may be freed once ioasid_find()
returns but not the returned data.

Thanks,
Jean

2020-08-25 12:27:24

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v2 4/9] iommu/ioasid: Add reference couting functions

On Fri, Aug 21, 2020 at 09:35:13PM -0700, Jacob Pan wrote:
> There can be multiple users of an IOASID, each user could have hardware
> contexts associated with the IOASID. In order to align lifecycles,
> reference counting is introduced in this patch. It is expected that when
> an IOASID is being freed, each user will drop a reference only after its
> context is cleared.
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/ioasid.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/ioasid.h | 4 ++
> 2 files changed, 117 insertions(+)
>
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index f73b3dbfc37a..5f31d63c75b1 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -717,6 +717,119 @@ int ioasid_set_for_each_ioasid(struct ioasid_set *set,
> EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
>
> /**
> + * IOASID refcounting rules
> + * - ioasid_alloc() set initial refcount to 1
> + *
> + * - ioasid_free() decrement and test refcount.
> + * If refcount is 0, ioasid will be freed. Deleted from the system-wide
> + * xarray as well as per set xarray. The IOASID will be returned to the
> + * pool and available for new allocations.
> + *
> + * If recount is non-zero, mark IOASID as IOASID_STATE_FREE_PENDING.
> + * No new reference can be added. The IOASID is not returned to the pool
> + * for reuse.
> + * After free, ioasid_get() will return error but ioasid_find() and other
> + * non refcount adding APIs will continue to work until the last reference
> + * is dropped
> + *
> + * - ioasid_get() get a reference on an active IOASID
> + *
> + * - ioasid_put() decrement and test refcount of the IOASID.
> + * If refcount is 0, ioasid will be freed. Deleted from the system-wide
> + * xarray as well as per set xarray. The IOASID will be returned to the
> + * pool and available for new allocations.
> + * Do nothing if refcount is non-zero.
> + *
> + * - ioasid_find() does not take reference, caller must hold reference
> + *
> + * ioasid_free() can be called multiple times without error until all refs are
> + * dropped.
> + */

Since you already document this in ioasid.rst, I'm not sure the comment
is necessary. Maybe the doc for _free/_put would be better in the
function's documentation.

> +
> +int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(&active_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to get unknown IOASID %u\n", ioasid);
> + return -EINVAL;
> + }
> + if (data->state == IOASID_STATE_FREE_PENDING) {
> + pr_err("Trying to get IOASID being freed%u\n", ioasid);
> + return -EBUSY;
> + }
> +
> + if (set && data->set != set) {
> + pr_err("Trying to get IOASID not in set%u\n", ioasid);
> + /* data found but does not belong to the set */
> + return -EACCES;
> + }
> + refcount_inc(&data->users);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get_locked);

Is it necessary to export the *_locked variant? Who'd call them and how
would they acquire the lock?

> +
> +/**
> + * ioasid_get - Obtain a reference of an ioasid
> + * @set
> + * @ioasid

Can be dropped. The doc checker will throw a warning, though.

> + *
> + * Check set ownership if @set is non-null.
> + */
> +int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + int ret = 0;

No need to initialize ret

> +
> + spin_lock(&ioasid_allocator_lock);
> + ret = ioasid_get_locked(set, ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get);
> +
> +void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(&active_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to put unknown IOASID %u\n", ioasid);
> + return;
> + }
> +
> + if (set && data->set != set) {
> + pr_err("Trying to drop IOASID not in the set %u\n", ioasid);
> + return;
> + }
> +
> + if (!refcount_dec_and_test(&data->users)) {
> + pr_debug("%s: IOASID %d has %d remainning users\n",
> + __func__, ioasid, refcount_read(&data->users));
> + return;
> + }
> + ioasid_do_free(data);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_put_locked);
> +
> +/**
> + * ioasid_put - Drop a reference of an ioasid
> + * @set
> + * @ioasid
> + *
> + * Check set ownership if @set is non-null.
> + */
> +void ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_put_locked(set, ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_put);
> +
> +/**
> * ioasid_find - Find IOASID data
> * @set: the IOASID set
> * @ioasid: the IOASID to find
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 412d025d440e..310abe4187a3 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -76,6 +76,10 @@ int ioasid_attach_data(ioasid_t ioasid, void *data);
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
> +int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> +int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> +void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> +void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);

Please also add the stubs for !CONFIG_IOASID.

Thanks,
Jean

> int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> void (*fn)(ioasid_t id, void *data),
> void *data);
> --
> 2.7.4
>

2020-08-25 12:49:48

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] iommu/ioasid: Introduce ioasid_set private ID

On Fri, Aug 21, 2020 at 09:35:14PM -0700, Jacob Pan wrote:
> When an IOASID set is used for guest SVA, each VM will acquire its
> ioasid_set for IOASID allocations. IOASIDs within the VM must have a
> host/physical IOASID backing, mapping between guest and host IOASIDs can
> be non-identical. IOASID set private ID (SPID) is introduced in this
> patch to be used as guest IOASID. However, the concept of ioasid_set
> specific namespace is generic, thus named SPID.
>
> As SPID namespace is within the IOASID set, the IOASID core can provide
> lookup services at both directions. SPIDs may not be allocated when its
> IOASID is allocated, the mapping between SPID and IOASID is usually
> established when a guest page table is bound to a host PASID.
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/ioasid.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/ioasid.h | 12 +++++++++++
> 2 files changed, 66 insertions(+)
>
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 5f31d63c75b1..c0aef38a4fde 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -21,6 +21,7 @@ enum ioasid_state {
> * struct ioasid_data - Meta data about ioasid
> *
> * @id: Unique ID
> + * @spid: Private ID unique within a set
> * @users Number of active users
> * @state Track state of the IOASID
> * @set Meta data of the set this IOASID belongs to
> @@ -29,6 +30,7 @@ enum ioasid_state {
> */
> struct ioasid_data {
> ioasid_t id;
> + ioasid_t spid;
> struct ioasid_set *set;
> refcount_t users;
> enum ioasid_state state;
> @@ -326,6 +328,58 @@ int ioasid_attach_data(ioasid_t ioasid, void *data)
> EXPORT_SYMBOL_GPL(ioasid_attach_data);
>
> /**
> + * ioasid_attach_spid - Attach ioasid_set private ID to an IOASID
> + *
> + * @ioasid: the ID to attach
> + * @spid: the ioasid_set private ID of @ioasid
> + *
> + * For IOASID that is already allocated, private ID within the set can be
> + * attached via this API. Future lookup can be done via ioasid_find.

via ioasid_find_by_spid()?

> + */
> +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
> +{
> + struct ioasid_data *ioasid_data;
> + int ret = 0;
> +
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_data = xa_load(&active_allocator->xa, ioasid);
> +
> + if (!ioasid_data) {
> + pr_err("No IOASID entry %d to attach SPID %d\n",
> + ioasid, spid);
> + ret = -ENOENT;
> + goto done_unlock;
> + }
> + ioasid_data->spid = spid;
> +
> +done_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_attach_spid);
> +
> +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)

Maybe add a bit of documentation as this is public-facing.

> +{
> + struct ioasid_data *entry;
> + unsigned long index;
> +
> + if (!xa_load(&ioasid_sets, set->sid)) {
> + pr_warn("Invalid set\n");
> + return INVALID_IOASID;
> + }
> +
> + xa_for_each(&set->xa, index, entry) {
> + if (spid == entry->spid) {
> + pr_debug("Found ioasid %lu by spid %u\n", index, spid);
> + refcount_inc(&entry->users);

Nothing prevents ioasid_free() from concurrently dropping the refcount to
zero and calling ioasid_do_free(). The caller will later call ioasid_put()
on a stale/reallocated index.

> + return index;
> + }
> + }
> + return INVALID_IOASID;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
> +
> +/**
> * ioasid_alloc - Allocate an IOASID
> * @set: the IOASID set
> * @min: the minimum ID (inclusive)
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 310abe4187a3..d4b3e83672f6 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -73,6 +73,8 @@ bool ioasid_is_active(ioasid_t ioasid);
>
> void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *));
> int ioasid_attach_data(ioasid_t ioasid, void *data);
> +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
> +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid);
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
> @@ -136,5 +138,15 @@ static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
> return -ENOTSUPP;
> }
>
> +staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
> +{
> + return -ENOTSUPP;
> +}
> +
> +static inline ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
> +{
> + return -ENOTSUPP;

INVALID_IOASID

Thanks,
Jean

> +}
> +
> #endif /* CONFIG_IOASID */
> #endif /* __LINUX_IOASID_H */
> --
> 2.7.4
>

2020-08-25 12:49:48

by Jean-Philippe Brucker

[permalink] [raw]
Subject: Re: [PATCH v2 6/9] iommu/ioasid: Introduce notification APIs

On Fri, Aug 21, 2020 at 09:35:15PM -0700, Jacob Pan wrote:
> Relations among IOASID users largely follow a publisher-subscriber
> pattern. E.g. to support guest SVA on Intel Scalable I/O Virtualization
> (SIOV) enabled platforms, VFIO, IOMMU, device drivers, KVM are all users
> of IOASIDs. When a state change occurs, VFIO publishes the change event
> that needs to be processed by other users/subscribers.
>
> This patch introduced two types of notifications: global and per
> ioasid_set. The latter is intended for users who only needs to handle
> events related to the IOASID of a given set.
> For more information, refer to the kernel documentation at
> Documentation/ioasid.rst.
>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Wu Hao <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/ioasid.c | 280 ++++++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/ioasid.h | 70 +++++++++++++
> 2 files changed, 348 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index c0aef38a4fde..6ddc09a7fe74 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -9,8 +9,35 @@
> #include <linux/spinlock.h>
> #include <linux/xarray.h>
> #include <linux/ioasid.h>
> +#include <linux/sched/mm.h>
>
> static DEFINE_XARRAY_ALLOC(ioasid_sets);
> +/*
> + * An IOASID could have multiple consumers where each consumeer may have

consumer

> + * hardware contexts associated with IOASIDs.
> + * When a status change occurs, such as IOASID is being freed, notifier chains
> + * are used to keep the consumers in sync.
> + * This is a publisher-subscriber pattern where publisher can change the
> + * state of each IOASID, e.g. alloc/free, bind IOASID to a device and mm.
> + * On the other hand, subscribers gets notified for the state change and
> + * keep local states in sync.
> + *
> + * Currently, the notifier is global. A further optimization could be per
> + * IOASID set notifier chain.

The patch adds both

> + */
> +static ATOMIC_NOTIFIER_HEAD(ioasid_chain);

"ioasid_notifier" may be clearer

> +
> +/* List to hold pending notification block registrations */
> +static LIST_HEAD(ioasid_nb_pending_list);
> +static DEFINE_SPINLOCK(ioasid_nb_lock);
> +struct ioasid_set_nb {
> + struct list_head list;
> + struct notifier_block *nb;
> + void *token;
> + struct ioasid_set *set;
> + bool active;
> +};
> +
> enum ioasid_state {
> IOASID_STATE_INACTIVE,
> IOASID_STATE_ACTIVE,
> @@ -394,6 +421,7 @@ EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
> ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> void *private)
> {
> + struct ioasid_nb_args args;
> struct ioasid_data *data;
> void *adata;
> ioasid_t id = INVALID_IOASID;
> @@ -445,8 +473,14 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> goto exit_free;
> }
> set->nr_ioasids++;
> - goto done_unlock;
> + args.id = id;
> + /* Set private ID is not attached during allocation */
> + args.spid = INVALID_IOASID;
> + args.set = set;

args.pdata is uninitialized

> + atomic_notifier_call_chain(&set->nh, IOASID_ALLOC, &args);

No global notification?

>
> + spin_unlock(&ioasid_allocator_lock);
> + return id;
> exit_free:
> kfree(data);
> done_unlock:
> @@ -479,6 +513,7 @@ static void ioasid_do_free(struct ioasid_data *data)
>
> static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
> {
> + struct ioasid_nb_args args;
> struct ioasid_data *data;
>
> data = xa_load(&active_allocator->xa, ioasid);
> @@ -491,7 +526,16 @@ static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
> pr_warn("Cannot free IOASID %u due to set ownership\n", ioasid);
> return;
> }
> +
> data->state = IOASID_STATE_FREE_PENDING;
> + /* Notify all users that this IOASID is being freed */
> + args.id = ioasid;
> + args.spid = data->spid;
> + args.pdata = data->private;
> + args.set = data->set;
> + atomic_notifier_call_chain(&ioasid_chain, IOASID_FREE, &args);
> + /* Notify the ioasid_set for per set users */
> + atomic_notifier_call_chain(&set->nh, IOASID_FREE, &args);
>
> if (!refcount_dec_and_test(&data->users))
> return;
> @@ -514,6 +558,28 @@ void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
> }
> EXPORT_SYMBOL_GPL(ioasid_free);
>
> +static void ioasid_add_pending_nb(struct ioasid_set *set)
> +{
> + struct ioasid_set_nb *curr;
> +
> + if (set->type != IOASID_SET_TYPE_MM)
> + return;
> +
> + /*
> + * Check if there are any pending nb requests for the given token, if so
> + * add them to the notifier chain.
> + */
> + spin_lock(&ioasid_nb_lock);
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->token == set->token && !curr->active) {
> + atomic_notifier_chain_register(&set->nh, curr->nb);
> + curr->set = set;
> + curr->active = true;
> + }
> + }
> + spin_unlock(&ioasid_nb_lock);
> +}
> +
> /**
> * ioasid_alloc_set - Allocate a new IOASID set for a given token
> *
> @@ -601,6 +667,13 @@ struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type)
> sdata->quota = quota;
> sdata->sid = id;
> refcount_set(&sdata->ref, 1);
> + ATOMIC_INIT_NOTIFIER_HEAD(&sdata->nh);
> +
> + /*
> + * Check if there are any pending nb requests for the given token, if so
> + * add them to the notifier chain.
> + */
> + ioasid_add_pending_nb(sdata);
>
> /*
> * Per set XA is used to store private IDs within the set, get ready
> @@ -617,6 +690,30 @@ struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type)
> }
> EXPORT_SYMBOL_GPL(ioasid_alloc_set);
>
> +
> +/*
> + * ioasid_find_mm_set - Retrieve IOASID set with mm token
> + * Take a reference of the set if found.
> + */
> +static struct ioasid_set *ioasid_find_mm_set(struct mm_struct *token)
> +{
> + struct ioasid_set *sdata, *set = NULL;
> + unsigned long index;
> +
> + spin_lock(&ioasid_allocator_lock);
> +
> + xa_for_each(&ioasid_sets, index, sdata) {
> + if (sdata->type == IOASID_SET_TYPE_MM && sdata->token == token) {
> + refcount_inc(&sdata->ref);
> + set = sdata;
> + goto exit_unlock;

Or just break

> + }
> + }
> +exit_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> + return set;
> +}
> +
> void ioasid_set_get_locked(struct ioasid_set *set)
> {
> if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> @@ -638,6 +735,8 @@ EXPORT_SYMBOL_GPL(ioasid_set_get);
>
> void ioasid_set_put_locked(struct ioasid_set *set)
> {
> + struct ioasid_nb_args args = { 0 };
> + struct ioasid_set_nb *curr;
> struct ioasid_data *entry;
> unsigned long index;
>
> @@ -673,8 +772,24 @@ void ioasid_set_put_locked(struct ioasid_set *set)
> done_destroy:
> /* Return the quota back to system pool */
> ioasid_capacity_avail += set->quota;
> - kfree_rcu(set, rcu);
>
> + /* Restore pending status of the set NBs */
> + spin_lock(&ioasid_nb_lock);
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->token == set->token) {
> + if (curr->active)
> + curr->active = false;
> + else
> + pr_warn("Set token exists but not active!\n");
> + }
> + }
> + spin_unlock(&ioasid_nb_lock);
> +
> + args.set = set;
> + atomic_notifier_call_chain(&ioasid_chain, IOASID_SET_FREE, &args);
> +
> + kfree_rcu(set, rcu);
> + pr_debug("Set freed %d\n", set->sid);

set might have been freed

> /*
> * Token got released right away after the ioasid_set is freed.
> * If a new set is created immediately with the newly released token,
> @@ -927,6 +1042,167 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> }
> EXPORT_SYMBOL_GPL(ioasid_find);
>
> +int ioasid_register_notifier(struct ioasid_set *set, struct notifier_block *nb)

Maybe add a bit of documentation on the difference with the _mm variant,
as well as the @set parameter.

Will this be used by anyone at first? We could introduce only the _mm
functions for now.

> +{
> + if (set)
> + return atomic_notifier_chain_register(&set->nh, nb);
> + else
> + return atomic_notifier_chain_register(&ioasid_chain, nb);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_register_notifier);
> +
> +void ioasid_unregister_notifier(struct ioasid_set *set,
> + struct notifier_block *nb)
> +{
> + struct ioasid_set_nb *curr;
> +
> + spin_lock(&ioasid_nb_lock);
> + /*
> + * Pending list is registered with a token without an ioasid_set,
> + * therefore should not be unregistered directly.
> + */
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->nb == nb) {
> + pr_warn("Cannot unregister NB from pending list\n");
> + spin_unlock(&ioasid_nb_lock);
> + return;
> + }
> + }
> + spin_unlock(&ioasid_nb_lock);
> +
> + if (set)
> + atomic_notifier_chain_unregister(&set->nh, nb);
> + else
> + atomic_notifier_chain_unregister(&ioasid_chain, nb);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);
> +
> +int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
> +{
> + struct ioasid_set_nb *curr;
> + struct ioasid_set *set;
> + int ret = 0;
> +
> + if (!mm)
> + return -EINVAL;
> +
> + spin_lock(&ioasid_nb_lock);
> +
> + /* Check for duplicates, nb is unique per set */
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->token == mm && curr->nb == nb) {
> + ret = -EBUSY;
> + goto exit_unlock;
> + }
> + }
> +
> + /* Check if the token has an existing set */
> + set = ioasid_find_mm_set(mm);

Seems to be a deadlock here, as ioasid_find_mm_set() grabs
ioasid_allocator_lock while holding ioasid_nb_lock, and
ioasid_set_put/get_locked() grabs ioasid_nb_lock while holding
ioasid_allocator_lock.

> + if (IS_ERR_OR_NULL(set)) {

Looks a bit off, maybe we can check !set since ioasid_find_mm_set()
doesn't return errors.

> + /* Add to the rsvd list as inactive */
> + curr->active = false;

curr isn't valid here

> + } else {
> + /* REVISIT: Only register empty set for now. Can add an option
> + * in the future to playback existing PASIDs.
> + */
> + if (set->nr_ioasids) {
> + pr_warn("IOASID set %d not empty\n", set->sid);
> + ret = -EBUSY;
> + goto exit_unlock;
> + }
> + curr = kzalloc(sizeof(*curr), GFP_ATOMIC);

As a side-note, I think there's too much atomic allocation in this file,
I'd like to try and rework the locking once it stabilizes and I find some
time. Do you remember why ioasid_allocator_lock needed to be a spinlock?

> + if (!curr) {
> + ret = -ENOMEM;
> + goto exit_unlock;
> + }
> + curr->token = mm;
> + curr->nb = nb;
> + curr->active = true;
> + curr->set = set;
> +
> + /* Set already created, add to the notifier chain */
> + atomic_notifier_chain_register(&set->nh, nb);
> + /*
> + * Do not hold a reference, if the set gets destroyed, the nb
> + * entry will be marked inactive.
> + */
> + ioasid_set_put(set);
> + }
> +
> + list_add(&curr->list, &ioasid_nb_pending_list);
> +
> +exit_unlock:
> + spin_unlock(&ioasid_nb_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_register_notifier_mm);
> +
> +void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
> +{
> + struct ioasid_set_nb *curr;
> +
> + spin_lock(&ioasid_nb_lock);
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->token == mm && curr->nb == nb) {
> + list_del(&curr->list);
> + goto exit_free;
> + }
> + }
> + pr_warn("No ioasid set found for mm token %llx\n", (u64)mm);
> + goto done_unlock;
> +
> +exit_free:
> + if (curr->active) {
> + pr_debug("mm set active, unregister %llx\n",
> + (u64)mm);

%px shows raw pointers, but I would drop this altogether or use %p.

> + atomic_notifier_chain_unregister(&curr->set->nh, nb);
> + }
> + kfree(curr);
> +done_unlock:
> + spin_unlock(&ioasid_nb_lock);
> + return;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier_mm);
> +
> +/**
> + * ioasid_notify - Send notification on a given IOASID for status change.
> + * Used by publishers when the status change may affect
> + * subscriber's internal state.
> + *
> + * @ioasid: The IOASID to which the notification will send
> + * @cmd: The notification event
> + * @flags: Special instructions, e.g. notify with a set or global

Describe valid values for @cmd and @flags? I guess this function
shouldn't accept IOASID_ALLOC, IOASID_FREE etc

> + */
> +int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd, unsigned int flags)
> +{
> + struct ioasid_data *ioasid_data;
> + struct ioasid_nb_args args = { 0 };
> + int ret = 0;
> +
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_data = xa_load(&active_allocator->xa, ioasid);
> + if (!ioasid_data) {
> + pr_err("Trying to notify unknown IOASID %u\n", ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> + return -EINVAL;
> + }
> +
> + args.id = ioasid;
> + args.set = ioasid_data->set;
> + args.pdata = ioasid_data->private;
> + args.spid = ioasid_data->spid;
> + if (flags & IOASID_NOTIFY_ALL) {
> + ret = atomic_notifier_call_chain(&ioasid_chain, cmd, &args);
> + } else if (flags & IOASID_NOTIFY_SET) {
> + ret = atomic_notifier_call_chain(&ioasid_data->set->nh,
> + cmd, &args);
> + }

else ret = -EINVAL?
What about allowing both flags?

> + spin_unlock(&ioasid_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_notify);
> +
> MODULE_AUTHOR("Jean-Philippe Brucker <[email protected]>");
> MODULE_AUTHOR("Jacob Pan <[email protected]>");
> MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index d4b3e83672f6..572111cd3b4b 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -23,6 +23,7 @@ enum ioasid_set_type {
> * struct ioasid_set - Meta data about ioasid_set
> * @type: Token types and other features
> * @token: Unique to identify an IOASID set
> + * @nh: Notifier for IOASID events within the set
> * @xa: XArray to store ioasid_set private IDs, can be used for
> * guest-host IOASID mapping, or just a private IOASID namespace.
> * @quota: Max number of IOASIDs can be allocated within the set
> @@ -32,6 +33,7 @@ enum ioasid_set_type {
> */
> struct ioasid_set {
> void *token;
> + struct atomic_notifier_head nh;
> struct xarray xa;
> int type;
> int quota;
> @@ -56,6 +58,49 @@ struct ioasid_allocator_ops {
> void *pdata;
> };
>
> +/* Notification data when IOASID status changed */
> +enum ioasid_notify_val {
> + IOASID_ALLOC = 1,
> + IOASID_FREE,
> + IOASID_BIND,
> + IOASID_UNBIND,
> + IOASID_SET_ALLOC,
> + IOASID_SET_FREE,
> +};

May be nicer to prefix these with IOASID_NOTIFY_

> +
> +#define IOASID_NOTIFY_ALL BIT(0)
> +#define IOASID_NOTIFY_SET BIT(1)
> +/**
> + * enum ioasid_notifier_prios - IOASID event notification order
> + *
> + * When status of an IOASID changes, users might need to take actions to
> + * reflect the new state. For example, when an IOASID is freed due to
> + * exception, the hardware context in virtual CPU, DMA device, and IOMMU
> + * shall be cleared and drained. Order is required to prevent life cycle
> + * problems.
> + */
> +enum ioasid_notifier_prios {
> + IOASID_PRIO_LAST,
> + IOASID_PRIO_DEVICE,
> + IOASID_PRIO_IOMMU,
> + IOASID_PRIO_CPU,
> +};

Not used by this patch, can be added later

> +
> +/**
> + * struct ioasid_nb_args - Argument provided by IOASID core when notifier
> + * is called.
> + * @id: The IOASID being notified
> + * @spid: The set private ID associated with the IOASID
> + * @set: The IOASID set of @id
> + * @pdata: The private data attached to the IOASID
> + */
> +struct ioasid_nb_args {
> + ioasid_t id;
> + ioasid_t spid;
> + struct ioasid_set *set;
> + void *pdata;
> +};
> +
> #if IS_ENABLED(CONFIG_IOASID)
> void ioasid_install_capacity(ioasid_t total);
> ioasid_t ioasid_get_capacity(void);
> @@ -75,8 +120,16 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *
> int ioasid_attach_data(ioasid_t ioasid, void *data);
> int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
> ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid);
> +
> +int ioasid_register_notifier(struct ioasid_set *set,
> + struct notifier_block *nb);
> +void ioasid_unregister_notifier(struct ioasid_set *set,
> + struct notifier_block *nb);
> +
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> +
> +int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd, unsigned int flags);
> void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
> int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> @@ -85,6 +138,9 @@ void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> void (*fn)(ioasid_t id, void *data),
> void *data);
> +int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
> +void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);

These need stubs for !CONFIG_IOASID

> +
> #else /* !CONFIG_IOASID */
> static inline void ioasid_install_capacity(ioasid_t total)
> {
> @@ -124,6 +180,20 @@ static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*
> return NULL;
> }
>
> +static inline int ioasid_register_notifier(struct notifier_block *nb)

Missing set argument

Thanks,
Jean

> +{
> + return -ENOTSUPP;
> +}
> +
> +static inline void ioasid_unregister_notifier(struct notifier_block *nb)
> +{
> +}
> +
> +static inline int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd, unsigned int flags)
> +{
> + return -ENOTSUPP;
> +}
> +
> static inline int ioasid_register_allocator(struct ioasid_allocator_ops *allocator)
> {
> return -ENOTSUPP;
> --
> 2.7.4
>

2020-08-27 16:22:58

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

Hi Jacob,
On 8/24/20 12:32 PM, Jean-Philippe Brucker wrote:
> On Fri, Aug 21, 2020 at 09:35:10PM -0700, Jacob Pan wrote:
>> IOASID is used to identify address spaces that can be targeted by device
>> DMA. It is a system-wide resource that is essential to its many users.
>> This document is an attempt to help developers from all vendors navigate
>> the APIs. At this time, ARM SMMU and Intel’s Scalable IO Virtualization
>> (SIOV) enabled platforms are the primary users of IOASID. Examples of
>> how SIOV components interact with IOASID APIs are provided in that many
>> APIs are driven by the requirements from SIOV.
>>
>> Signed-off-by: Liu Yi L <[email protected]>
>> Signed-off-by: Wu Hao <[email protected]>
>> Signed-off-by: Jacob Pan <[email protected]>
>> ---
>> Documentation/ioasid.rst | 618 +++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 618 insertions(+)
>> create mode 100644 Documentation/ioasid.rst
>>
>> diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst
>
> Thanks for writing this up. Should it go to Documentation/driver-api/, or
> Documentation/driver-api/iommu/? I think this also needs to Cc
> [email protected] and [email protected]
>
>> new file mode 100644
>> index 000000000000..b6a8cdc885ff
>> --- /dev/null
>> +++ b/Documentation/ioasid.rst
>> @@ -0,0 +1,618 @@
>> +.. ioasid:
>> +
>> +=====================================
>> +IO Address Space ID
>> +=====================================
>> +
>> +IOASID is a generic name for PCIe Process Address ID (PASID) or ARM
>> +SMMU sub-stream ID. An IOASID identifies an address space that DMA
>
> "SubstreamID"
On ARM if we don't use PASIDs we have streamids (SID) which can also
identify address spaces that DMA requests can target. So maybe this
definition is not sufficient.

>
>> +requests can target.
>> +
>> +The primary use cases for IOASID are Shared Virtual Address (SVA) and
>> +IO Virtual Address (IOVA). However, the requirements for IOASID
>
> IOVA alone isn't a use case, maybe "multiple IOVA spaces per device"?
>
>> +management can vary among hardware architectures.
>> +
>> +This document covers the generic features supported by IOASID
>> +APIs. Vendor-specific use cases are also illustrated with Intel's VT-d
>> +based platforms as the first example.
>> +
>> +.. contents:: :local:
>> +
>> +Glossary
>> +========
>> +PASID - Process Address Space ID
>> +
>> +IOASID - IO Address Space ID (generic term for PCIe PASID and
>> +sub-stream ID in SMMU)
>
> "SubstreamID"
>
>> +
>> +SVA/SVM - Shared Virtual Addressing/Memory
>> +
>> +ENQCMD - New Intel X86 ISA for efficient workqueue submission [1]
>
> Maybe drop the "New", to keep the documentation perennial. It might be
> good to add internal links here to the specifications URLs at the bottom.
>
>> +
>> +DSA - Intel Data Streaming Accelerator [2]
>> +
>> +VDCM - Virtual device composition module [3]
>> +
>> +SIOV - Intel Scalable IO Virtualization
>> +
>> +
>> +Key Concepts
>> +============
>> +
>> +IOASID Set
>> +-----------
>> +An IOASID set is a group of IOASIDs allocated from the system-wide
>> +IOASID pool. An IOASID set is created and can be identified by a
>> +token of u64. Refer to IOASID set APIs for more details.
>
> Identified either by an u64 or an mm_struct, right? Maybe just drop the
> second sentence if it's detailed in the IOASID set section below.
>
>> +
>> +IOASID set is particularly useful for guest SVA where each guest could
>> +have its own IOASID set for security and efficiency reasons.
>> +
>> +IOASID Set Private ID (SPID)
>> +----------------------------
>> +SPIDs are introduced as IOASIDs within its set. Each SPID maps to a
>> +system-wide IOASID but the namespace of SPID is within its IOASID
>> +set.
>
> The intro isn't super clear. Perhaps this is simpler:
> "Each IOASID set has a private namespace of SPIDs. An SPID maps to a
> single system-wide IOASID."
or, "within an ioasid set, each ioasid can be associated with an alias
ID, named SPID."
>
>> SPIDs can be used as guest IOASIDs where each guest could do
>> +IOASID allocation from its own pool and map them to host physical
>> +IOASIDs. SPIDs are particularly useful for supporting live migration
>> +where decoupling guest and host physical resources are necessary.
>> +
>> +For example, two VMs can both allocate guest PASID/SPID #101 but map to
>> +different host PASIDs #201 and #202 respectively as shown in the
>> +diagram below.
>> +::
>> +
>> + .------------------. .------------------.
>> + | VM 1 | | VM 2 |
>> + | | | |
>> + |------------------| |------------------|
>> + | GPASID/SPID 101 | | GPASID/SPID 101 |
>> + '------------------' -------------------' Guest
>> + __________|______________________|______________________
>> + | | Host
>> + v v
>> + .------------------. .------------------.
>> + | Host IOASID 201 | | Host IOASID 202 |
>> + '------------------' '------------------'
>> + | IOASID set 1 | | IOASID set 2 |
>> + '------------------' '------------------'
>> +
>> +Guest PASID is treated as IOASID set private ID (SPID) within an
>> +IOASID set, mappings between guest and host IOASIDs are stored in the
>> +set for inquiry.
>> +
>> +IOASID APIs
>> +===========
>> +To get the IOASID APIs, users must #include <linux/ioasid.h>. These APIs
>> +serve the following functionalities:
>> +
>> + - IOASID allocation/Free
>> + - Group management in the form of ioasid_set
>> + - Private data storage and lookup
>> + - Reference counting
>> + - Event notification in case of state change
(a)
>> +
>> +IOASID Set Level APIs
>> +--------------------------
>> +For use cases such as guest SVA it is necessary to manage IOASIDs at
>> +a group level. For example, VMs may allocate multiple IOASIDs for
I would use the introduced ioasid_set terminology instead of "group".
>> +guest process address sharing (vSVA). It is imperative to enforce
>> +VM-IOASID ownership such that malicious guest cannot target DMA
>
> "a malicious guest"
>
>> +traffic outside its own IOASIDs, or free an active IOASID belong to
>
> "that belongs to"
>
>> +another VM.
>> +::
>> +
>> + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, u32 type)
what is this void *token? also the type may be explained here.
>> +
>> + int ioasid_adjust_set(struct ioasid_set *set, int quota);
>
> These could be named "ioasid_set_alloc" and "ioasid_set_adjust" to be
> consistent with the rest of the API.
>
>> +
>> + void ioasid_set_get(struct ioasid_set *set)
>> +
>> + void ioasid_set_put(struct ioasid_set *set)
>> +
>> + void ioasid_set_get_locked(struct ioasid_set *set)
>> +
>> + void ioasid_set_put_locked(struct ioasid_set *set)
>> +
>> + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
>
> Might be nicer to keep the same argument names within the API. Here "set"
> rather than "sdata".
>
>> + void (*fn)(ioasid_t id, void *data),
>> + void *data)
>
> (alignment)
>
>> +
>> +
>> +IOASID set concept is introduced to represent such IOASID groups. Each
>
> Or just "IOASID sets represent such IOASID groups", but might be
> redundant.
>
>> +IOASID set is created with a token which can be one of the following
>> +types:
I think this explanation should happen before the above function prototypes
>> +
>> + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
>> + - IOASID_SET_TYPE_MM (Set token is a mm_struct)
>> +
>> +The explicit MM token type is useful when multiple users of an IOASID
>> +set under the same process need to communicate about their shared IOASIDs.
>> +E.g. An IOASID set created by VFIO for one guest can be associated
>> +with the KVM instance for the same guest since they share a common mm_struct.
>> +
>> +The IOASID set APIs serve the following purposes:
>> +
>> + - Ownership/permission enforcement
>> + - Take collective actions, e.g. free an entire set
>> + - Event notifications within a set
>> + - Look up a set based on token
>> + - Quota enforcement
>
> This paragraph could be earlier in the section

yes this is a kind of repetition of (a), above
>
>> +
>> +Individual IOASID APIs
>> +----------------------
>> +Once an ioasid_set is created, IOASIDs can be allocated from the set.
>> +Within the IOASID set namespace, set private ID (SPID) is supported. In
>> +the VM use case, SPID can be used for storing guest PASID.
>> +
>> +::
>> +
>> + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
>> + void *private);
>> +
>> + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
>> +
>> + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
>> +
>> + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
>> +
>> + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
>> +
>> + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>> + bool (*getter)(void *));
>> +
>> + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
>> +
>> + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
>> + void *data);
>> + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
>> + ioasid_t ssid);
>
> s/ssid/spid>
>> +
>> +
>> +Notifications
>> +-------------
>> +An IOASID may have multiple users, each user may have hardware context
>> +associated with an IOASID. When the status of an IOASID changes,
>> +e.g. an IOASID is being freed, users need to be notified such that the
>> +associated hardware context can be cleared, flushed, and drained.
>> +
>> +::
>> +
>> + int ioasid_register_notifier(struct ioasid_set *set, struct
>> + notifier_block *nb)
>> +
>> + void ioasid_unregister_notifier(struct ioasid_set *set,
>> + struct notifier_block *nb)
>> +
>> + int ioasid_register_notifier_mm(struct mm_struct *mm, struct
>> + notifier_block *nb)
>> +
>> + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
>> + notifier_block *nb)
the mm_struct prototypes may be justified
>> +
>> + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
>> + unsigned int flags)
this one is not obvious either.
>> +
>> +
>> +Events
>> +~~~~~~
>> +Notification events are pertinent to individual IOASIDs, they can be
>> +one of the following:
>> +
>> + - ALLOC
>> + - FREE
>> + - BIND
>> + - UNBIND
>> +
>> +Ordering
>> +~~~~~~~~
>> +Ordering is supported by IOASID notification priorities as the
>> +following (in ascending order):
>> +
>> +::
>> +
>> + enum ioasid_notifier_prios {
>> + IOASID_PRIO_LAST,
>> + IOASID_PRIO_IOMMU,
>> + IOASID_PRIO_DEVICE,
>> + IOASID_PRIO_CPU,
>> + };

Maybe:
when registered, notifiers are assigned a priority that affect the call
order. Notifiers with CPU priority get called before notifiers with
device priority and so on.
>> +
>> +The typical use case is when an IOASID is freed due to an exception, DMA
>> +source should be quiesced before tearing down other hardware contexts
>> +in the system. This will reduce the churn in handling faults. DMA work
>> +submission is performed by the CPU which is granted higher priority than
>> +devices.
>> +
>> +
>> +Scopes
>> +~~~~~~
>> +There are two types of notifiers in IOASID core: system-wide and
>> +ioasid_set-wide.
>> +
>> +System-wide notifier is catering for users that need to handle all
>> +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs.
>> +
>> +Per ioasid_set notifier can be used by VM specific components such as
>> +KVM. After all, each KVM instance only cares about IOASIDs within its
>> +own set.
>> +
>> +
>> +Atomicity
>> +~~~~~~~~~
>> +IOASID notifiers are atomic due to spinlocks used inside the IOASID
>> +core. For tasks cannot be completed in the notifier handler, async work
>
> "tasks that cannot be"
>
>> +can be submitted to complete the work later as long as there is no
>> +ordering requirement.
>> +
>> +Reference counting
>> +------------------
>> +IOASID lifecycle management is based on reference counting. Users of
>> +IOASID intend to align lifecycle with the IOASID need to hold
>
> "who intend to"
>
>> +reference of the IOASID. IOASID will not be returned to the pool for
>
> "a reference to the IOASID. The IOASID"
>
>> +allocation until all references are dropped. Calling ioasid_free()
>> +will mark the IOASID as FREE_PENDING if the IOASID has outstanding
>> +reference. ioasid_get() is not allowed once an IOASID is in the
>> +FREE_PENDING state.
>> +
>> +Event notifications are used to inform users of IOASID status change.
>> +IOASID_FREE event prompts users to drop their references after
>> +clearing its context.
>> +
>> +For example, on VT-d platform when an IOASID is freed, teardown
>> +actions are performed on KVM, device driver, and IOMMU driver.
>> +KVM shall register notifier block with::
>> +
>> + static struct notifier_block pasid_nb_kvm = {
>> + .notifier_call = pasid_status_change_kvm,
>> + .priority = IOASID_PRIO_CPU,
>> + };
>> +
>> +VDCM driver shall register notifier block with::
>> +
>> + static struct notifier_block pasid_nb_vdcm = {
>> + .notifier_call = pasid_status_change_vdcm,
>> + .priority = IOASID_PRIO_DEVICE,
>> + };
not sure those code snippets are really useful. Maybe simply say who is
supposed to use each prio.
>> +
>> +In both cases, notifier blocks shall be registered on the IOASID set
>> +such that *only* events from the matching VM is received.
>> +
>> +If KVM attempts to register notifier block before the IOASID set is
>> +created for the MM token, the notifier block will be placed on a
using the MM token
>> +pending list inside IOASID core. Once the token matching IOASID set
>> +is created, IOASID will register the notifier block automatically.
Is this implementation mandated? Can't you enforce the ioasid_set to be
created before the notifier gets registered?
>> +IOASID core does not replay events for the existing IOASIDs in the
>> +set. For IOASID set of MM type, notification blocks can be registered
>> +on empty sets only. This is to avoid lost events.
>> +
>> +IOMMU driver shall register notifier block on global chain::
>> +
>> + static struct notifier_block pasid_nb_vtd = {
>> + .notifier_call = pasid_status_change_vtd,
>> + .priority = IOASID_PRIO_IOMMU,
>> + };
>> +
>> +Custom allocator APIs
>> +---------------------
>> +
>> +::
>> +
>> + int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
>> +
>> + void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
>> +
>> +Allocator Choices
>> +~~~~~~~~~~~~~~~~~
>> +IOASIDs are allocated for both host and guest SVA/IOVA usage. However,
>> +allocators can be different. For example, on VT-d guest PASID
>> +allocation must be performed via a virtual command interface which is
>> +emulated by VMM.
>> +
>> +IOASID core has the notion of "custom allocator" such that guest can
>> +register virtual command allocator that precedes the default one.
>> +
>> +Namespaces
>> +~~~~~~~~~~
>> +IOASIDs are limited system resources that default to 20 bits in
>> +size. Since each device has its own table, theoretically the namespace
>> +can be per device also. However, for security reasons sharing PASID
>> +tables among devices are not good for isolation. Therefore, IOASID
>> +namespace is system-wide.
>
> I don't follow this development. Having per-device PASID table would work
> fine for isolation (assuming no hardware bug necessitating IOMMU groups).
> If I remember correctly IOASID space was chosen to be OS-wide because it
> simplifies the management code (single PASID per task), and it is
> system-wide across VMs only in the case of VT-d scalable mode.
>
>> +
>> +There are also other reasons to have this simpler system-wide
>> +namespace. Take VT-d as an example, VT-d supports shared workqueue
>> +and ENQCMD[1] where one IOASID could be used to submit work on
>
> Maybe use the Sphinx glossary syntax rather than "[1]"
> https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#glossary-directive
>
>> +multiple devices that are shared with other VMs. This requires IOASID
>> +to be system-wide. This is also the reason why guests must use an
>> +emulated virtual command interface to allocate IOASID from the host.
>> +
>> +
>> +Life cycle
>> +==========
>> +This section covers IOASID lifecycle management for both bare-metal
>> +and guest usages. In bare-metal SVA, MMU notifier is directly hooked
>> +up with IOMMU driver, therefore the process address space (MM)
>> +lifecycle is aligned with IOASID.
therefore the IOASID lifecyle matches the process address space (MM)
lifecyle?
>> +
>> +However, guest MMU notifier is not available to host IOMMU driver,
the guest MMU notifier
>> +when guest MM terminates unexpectedly, the events have to go through
the guest MM
>> +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also more
>> +parties involved in guest SVA, e.g. on Intel VT-d platform, IOASIDs
>> +are used by IOMMU driver, KVM, VDCM, and VFIO.
>> +
>> +Native IOASID Life Cycle (VT-d Example)
>> +---------------------------------------
>> +
>> +The normal flow of native SVA code with Intel Data Streaming
>> +Accelerator(DSA) [2] as example:
>> +
>> +1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
>> +2. DSA driver allocate WQ, do sva_bind_device();
>> +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
>> + mmu_notifier_get()
>> +4. DMA starts by DSA driver userspace
>> +5. DSA userspace close FD
>> +6. DSA/uacce kernel driver handles FD.close()
>> +7. DSA driver stops DMA
>> +8. DSA driver calls sva_unbind_device();
>> +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
>> + TLBs. mmu_notifier_put() called.
>> +10. mmu_notifier.release() called, IOMMU SVA code calls ioasid_free()*
>> +11. The IOASID is returned to the pool, reclaimed.
>> +
>> +::
>> +
>
> Use a footnote? https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes
>
>> + * With ENQCMD, PASID used on VT-d is not released in mmu_notifier() but
>> + mmdrop(). mmdrop comes after FD close. Should not matter.
>
> "comes after FD close, which doesn't make a difference?"
> The following might not be necessary since early process termination is
> described later.
>
>> + If the user process dies unexpectedly, Step #10 may come before
>> + Step #5, in between, all DMA faults discarded. PRQ responded with
>
> PRQ hasn't been defined in this document.
>
>> + code INVALID REQUEST.
>> +
>> +During the normal teardown, the following three steps would happen in
>> +order:
can't this be illustrated in the above 1-11 sequence, just adding NORMAL
TEARDONW before #7?
>> +
>> +1. Device driver stops DMA request
>> +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain in-flight
>> + requests.
>> +3. IOASID freed
>> +
Then you can just focus on abnormal termination
>> +Exception happens when process terminates *before* device driver stops
>> +DMA and call IOMMU driver to unbind. The flow of process exists are as
Can't this be explained with something simpler looking at the steps 1-11?
>
> "exits"
>
>> +follows:
>> +
>> +::
>> +
>> + do_exit() {
>> + exit_mm() {
>> + mm_put();
>> + exit_mmap() {
>> + intel_invalidate_range() //mmu notifier
>> + tlb_finish_mmu()
>> + mmu_notifier_release(mm) {
>> + intel_iommu_release() {
>> + [2] intel_iommu_teardown_pasid();
>
> Parentheses might be better than square brackets for step numbers
>
>> + intel_iommu_flush_tlbs();
>> + }
>> + // tlb_invalidate_range cb removed
>> + }
>> + unmap_vmas();
>> + free_pgtables(); // IOMMU cannot walk PGT after this
>> + };
>> + }
>> + exit_files(tsk) {
>> + close_files() {
>> + dsa_close();
>> + [1] dsa_stop_dma();
>> + intel_svm_unbind_pasid(); //nothing to do
>> + }
>> + }
>> + }
>> +
>> + mmdrop() /* some random time later, lazy mm user */ {
>> + mm_free_pgd();
>> + destroy_context(mm); {
>> + [3] ioasid_free();
>> + }
>> + }
>> +
>> +As shown in the list above, step #2 could happen before
>> +#1. Unrecoverable(UR) faults could happen between #2 and #1.
>> +
>> +Also notice that TLB invalidation occurs at mmu_notifier
>> +invalidate_range callback as well as the release callback. The reason
>> +is that release callback will delete IOMMU driver from the notifier
>> +chain which may skip invalidate_range() calls during the exit path.
>> +
>> +To avoid unnecessary reporting of UR fault, IOMMU driver shall disable
UR?
>> +fault reporting after free and before unbind.
>> +
>> +Guest IOASID Life Cycle (VT-d Example)
>> +--------------------------------------
>> +Guest IOASID life cycle starts with guest driver open(), this could be
>> +uacce or individual accelerator driver such as DSA. At FD open,
>> +sva_bind_device() is called which triggers a series of actions.
>> +
>> +The example below is an illustration of *normal* operations that
>> +involves *all* the SW components in VT-d. The flow can be simpler if
>> +no ENQCMD is supported.
>> +
>> +::
>> +
>> + VFIO IOMMU KVM VDCM IOASID Ref
>> + ..................................................................
>> + 1 ioasid_register_notifier/_mm()
>> + 2 ioasid_alloc() 1
>> + 3 bind_gpasid()
>> + 4 iommu_bind()->ioasid_get() 2
>> + 5 ioasid_notify(BIND)
>> + 6 -> ioasid_get() 3
>> + 7 -> vmcs_update_atomic()
>> + 8 mdev_write(gpasid)
>> + 9 hpasid=
>> + 10 find_by_spid(gpasid) 4
>> + 11 vdev_write(hpasid)
>> + 12 -------- GUEST STARTS DMA --------------------------
>> + 13 -------- GUEST STOPS DMA --------------------------
>> + 14 mdev_clear(gpasid)
>> + 15 vdev_clear(hpasid)
>> + 16 ioasid_put() 3
>> + 17 unbind_gpasid()
>> + 18 iommu_ubind()
>> + 19 ioasid_notify(UNBIND)
>> + 20 -> vmcs_update_atomic()
>> + 21 -> ioasid_put() 2
>> + 22 ioasid_free() 1
>> + 23 ioasid_put() 0
>> + 24 Reclaimed
>> + -------------- New Life Cycle Begin ----------------------------
>> + 1 ioasid_alloc() -> 1
>> +
>> + Note: IOASID Notification Events: FREE, BIND, UNBIND
>> +
>> +Exception cases arise when a guest crashes or a malicious guest
>> +attempts to cause disruption on the host system. The fault handling
>> +rules are:
>> +
>> +1. IOASID free must *always* succeed.
>> +2. An inactive period may be required before the freed IOASID is
>> + reclaimed. During this period, consumers of IOASID perform cleanup.
>> +3. Malfunction is limited to the guest owned resources for all
>> + programming errors.
>> +
>> +The primary source of exception is when the following are out of
>> +order:
>> +
>> +1. Start/Stop of DMA activity
>> + (Guest device driver, mdev via VFIO)
please explain the meaning of what is inside (): initiator?
>> +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
>> + (Host IOMMU driver bind/unbind)
>> +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
>> + case of ENQCMD
>> +4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
>> +5. IOASID alloc/free (Host IOASID)
>> +
>> +VFIO is the *only* user-kernel interface, which is ultimately
>> +responsible for exception handlings.
>
> "handling"
>
>> +
>> +#1 is processed the same way as the assigned device today based on
>> +device file descriptors and events. There is no special handling.
>> +
>> +#3 is based on bind/unbind events emitted by #2.
>> +
>> +#4 is naturally aligned with IOASID life cycle in that an illegal
>> +guest PASID programming would fail in obtaining reference of the
>> +matching host IOASID.
>> +
>> +#5 is similar to #4. The fault will be reported to the user if PASID
>> +used in the ENQCMD is not set up in VMCS PASID translation table.
>> +
>> +Therefore, the remaining out of order problem is between #2 and
>> +#5. I.e. unbind vs. free. More specifically, free before unbind.
>> +
>> +IOASID notifier and refcounting are used to ensure order. Following
>> +a publisher-subscriber pattern where:
with the following actors:
>> +
>> +- Publishers: VFIO & IOMMU
>> +- Subscribers: KVM, VDCM, IOMMU
this may be introduced before.
>> +
>> +IOASID notifier is atomic which requires subscribers to do quick
>> +handling of the event in the atomic context. Workqueue can be used for
>> +any processing that requires thread context.
repetition of what was said before.
IOASID reference must be
>> +acquired before receiving the FREE event. The reference must be
>> +dropped at the end of the processing in order to return the IOASID to
>> +the pool.
>> +
>> +Let's examine the IOASID life cycle again when free happens *before*
>> +unbind. This could be a result of misbehaving guests or crash. Assuming
>> +VFIO cannot enforce unbind->free order. Notice that the setup part up
>> +until step #12 is identical to the normal case, the flow below starts
>> +with step 13.
>> +
>> +::
>> +
>> + VFIO IOMMU KVM VDCM IOASID Ref
>> + ..................................................................
>> + 13 -------- GUEST STARTS DMA --------------------------
>> + 14 -------- *GUEST MISBEHAVES!!!* ----------------
>> + 15 ioasid_free()
>> + 16 ioasid_notify(FREE)
>> + 17 mark_ioasid_inactive[1]
>> + 18 kvm_nb_handler(FREE)
>> + 19 vmcs_update_atomic()
>> + 20 ioasid_put_locked() -> 3
>> + 21 vdcm_nb_handler(FREE)
>> + 22 iomm_nb_handler(FREE)
>> + 23 ioasid_free() returns[2] schedule_work() 2
>> + 24 schedule_work() vdev_clear_wk(hpasid)
>> + 25 teardown_pasid_wk()
>> + 26 ioasid_put() -> 1
>> + 27 ioasid_put() 0
>> + 28 Reclaimed
>> + 29 unbind_gpasid()
>> + 30 iommu_unbind()->ioasid_find() Fails[3]
>> + -------------- New Life Cycle Begin ----------------------------
>> +
>> +Note:
>> +
>> +1. By marking IOASID inactive at step #17, no new references can be
>
> Is "inactive" FREE_PENDING?
>
>> + held. ioasid_get/find() will return -ENOENT;
>> +2. After step #23, all events can go out of order. Shall not affect
>> + the outcome.
>> +3. IOMMU driver fails to find private data for unbinding. If unbind is
>> + called after the same IOASID is allocated for the same guest again,
>> + this is a programming error. The damage is limited to the guest
>> + itself since unbind performs permission checking based on the
>> + IOASID set associated with the guest process.
>> +
>> +KVM PASID Translation Table Updates
>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> +Per VM PASID translation table is maintained by KVM in order to
>> +support ENQCMD in the guest. The table contains host-guest PASID
>> +translations to be consumed by CPU ucode. The synchronization of the
>> +PASID states depends on VFIO/IOMMU driver, where IOCTL and atomic
>> +notifiers are used. KVM must register IOASID notifier per VM instance
>> +during launch time. The following events are handled:
>> +
>> +1. BIND/UNBIND
>> +2. FREE
>> +
>> +Rules:
>> +
>> +1. Multiple devices can bind with the same PASID, this can be different PCI
>> + devices or mdevs within the same PCI device. However, only the
>> + *first* BIND and *last* UNBIND emit notifications.
>> +2. IOASID code is responsible for ensuring the correctness of H-G
>> + PASID mapping. There is no need for KVM to validate the
>> + notification data.
>> +3. When UNBIND happens *after* FREE, KVM will see error in
>> + ioasid_get() even when the reclaim is not done. IOMMU driver will
>> + also avoid sending UNBIND if the PASID is already FREE.
>> +4. When KVM terminates *before* FREE & UNBIND, references will be
>> + dropped for all host PASIDs.
>> +
>> +VDCM PASID Programming
>> +~~~~~~~~~~~~~~~~~~~~~~
>> +VDCM composes virtual devices and exposes them to the guests. When
>> +the guest allocates a PASID then program it to the virtual device, VDCM
programs as well
>> +intercepts the programming attempt then program the matching host
>
> "programs"
>
> Thanks,
> Jean
>
>> +PASID on to the hardware.
>> +Conversely, when a device is going away, VDCM must be informed such
>> +that PASID context on the hardware can be cleared. There could be
>> +multiple mdevs assigned to different guests in the same VDCM. Since
>> +the PASID table is shared at PCI device level, lazy clearing is not
>> +secure. A malicious guest can attack by using newly freed PASIDs that
>> +are allocated by another guest.
>> +
>> +By holding a reference of the PASID until VDCM cleans up the HW context,
>> +it is guaranteed that PASID life cycles do not cross within the same
>> +device.
>> +
>> +
>> +Reference
>> +====================================================
>> +1. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
>> +
>> +2. https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
>> +
>> +3. https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
>> --
>> 2.7.4

Thanks

Eric
>>
>

2020-08-28 16:55:11

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

Hi Baolu,

Thanks for the review!

On Sun, 23 Aug 2020 15:05:08 +0800
Lu Baolu <[email protected]> wrote:

> Hi Jacob,
>
> On 2020/8/22 12:35, Jacob Pan wrote:
> > IOASID is used to identify address spaces that can be targeted by
> > device DMA. It is a system-wide resource that is essential to its
> > many users. This document is an attempt to help developers from all
> > vendors navigate the APIs. At this time, ARM SMMU and Intel’s
> > Scalable IO Virtualization (SIOV) enabled platforms are the primary
> > users of IOASID. Examples of how SIOV components interact with
> > IOASID APIs are provided in that many APIs are driven by the
> > requirements from SIOV.
> >
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Wu Hao <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > Documentation/ioasid.rst | 618
> > +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 618
> > insertions(+) create mode 100644 Documentation/ioasid.rst
> >
> > diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst
> > new file mode 100644
> > index 000000000000..b6a8cdc885ff
> > --- /dev/null
> > +++ b/Documentation/ioasid.rst
> > @@ -0,0 +1,618 @@
> > +.. ioasid:
> > +
> > +=====================================
> > +IO Address Space ID
> > +=====================================
> > +
> > +IOASID is a generic name for PCIe Process Address ID (PASID) or ARM
> > +SMMU sub-stream ID. An IOASID identifies an address space that DMA
> > +requests can target.
> > +
> > +The primary use cases for IOASID are Shared Virtual Address (SVA)
> > and +IO Virtual Address (IOVA). However, the requirements for
> > IOASID
>
> Can you please elaborate a bit more about how ioasid is used by IOVA?
>
Good point, I will add a paragraph for IOVA usage. Something like this:
"For IOVA, IOASID #0 is typically used for DMA request without
PASID. However, some architectures such as VT-d also offers the
flexibility of using any PASID for DMA request without PASID. For
example, on VT-d PASID #0 is used for PCI device RID2PASID and for
SIOV each auxilary domain also allocates a non-zero default PASID for
DMA request w/o PASID. PASID #0, is reserved and not allocated from any
ioasid_set."


> > +management can vary among hardware architectures.
> > +
> > +This document covers the generic features supported by IOASID
> > +APIs. Vendor-specific use cases are also illustrated with Intel's
> > VT-d +based platforms as the first example.
> > +
> > +.. contents:: :local:
> > +
> > +Glossary
> > +========
> > +PASID - Process Address Space ID
> > +
> > +IOASID - IO Address Space ID (generic term for PCIe PASID and
> > +sub-stream ID in SMMU)
> > +
> > +SVA/SVM - Shared Virtual Addressing/Memory
> > +
> > +ENQCMD - New Intel X86 ISA for efficient workqueue submission [1]
> > +
> > +DSA - Intel Data Streaming Accelerator [2]
> > +
> > +VDCM - Virtual device composition module [3]
>
> Capitalize the first letter of each word.
>
will do.

> > +
> > +SIOV - Intel Scalable IO Virtualization
> > +
> > +
> > +Key Concepts
> > +============
> > +
> > +IOASID Set
> > +-----------
> > +An IOASID set is a group of IOASIDs allocated from the system-wide
> > +IOASID pool. An IOASID set is created and can be identified by a
> > +token of u64. Refer to IOASID set APIs for more details.
> > +
> > +IOASID set is particularly useful for guest SVA where each guest
> > could +have its own IOASID set for security and efficiency reasons.
> > +
> > +IOASID Set Private ID (SPID)
> > +----------------------------
> > +SPIDs are introduced as IOASIDs within its set. Each SPID maps to a
> > +system-wide IOASID but the namespace of SPID is within its IOASID
> > +set. SPIDs can be used as guest IOASIDs where each guest could do
> > +IOASID allocation from its own pool and map them to host physical
> > +IOASIDs. SPIDs are particularly useful for supporting live
> > migration +where decoupling guest and host physical resources are
> > necessary. +
> > +For example, two VMs can both allocate guest PASID/SPID #101 but
> > map to +different host PASIDs #201 and #202 respectively as shown
> > in the +diagram below.
> > +::
> > +
> > + .------------------. .------------------.
> > + | VM 1 | | VM 2 |
> > + | | | |
> > + |------------------| |------------------|
> > + | GPASID/SPID 101 | | GPASID/SPID 101 |
> > + '------------------' -------------------' Guest
> > + __________|______________________|______________________
> > + | | Host
> > + v v
> > + .------------------. .------------------.
> > + | Host IOASID 201 | | Host IOASID 202 |
> > + '------------------' '------------------'
> > + | IOASID set 1 | | IOASID set 2 |
> > + '------------------' '------------------'
> > +
> > +Guest PASID is treated as IOASID set private ID (SPID) within an
> > +IOASID set, mappings between guest and host IOASIDs are stored in
> > the +set for inquiry.
>
> Is there a real IOASID set allocated in the host which represent the
> SPID?
>
SPIDs are not allocated from the host IOASID set, but the backing host
IOASID of the SPID is. So the same SPID # can belong to different
IOASID set.

SPIDs are allocated by VMM and given to the host, IOASID code in the
host just stores it in the ioasid_set.

> > +
> > +IOASID APIs
> > +===========
> > +To get the IOASID APIs, users must #include <linux/ioasid.h>.
> > These APIs +serve the following functionalities:
> > +
> > + - IOASID allocation/Free
> > + - Group management in the form of ioasid_set
> > + - Private data storage and lookup
> > + - Reference counting
> > + - Event notification in case of state change
> > +
> > +IOASID Set Level APIs
> > +--------------------------
> > +For use cases such as guest SVA it is necessary to manage IOASIDs
> > at +a group level. For example, VMs may allocate multiple IOASIDs
> > for +guest process address sharing (vSVA). It is imperative to
> > enforce +VM-IOASID ownership such that malicious guest cannot
> > target DMA +traffic outside its own IOASIDs, or free an active
> > IOASID belong to +another VM.
> > +::
> > +
> > + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
> > u32 type) +
> > + int ioasid_adjust_set(struct ioasid_set *set, int quota);
> > +
> > + void ioasid_set_get(struct ioasid_set *set)
> > +
> > + void ioasid_set_put(struct ioasid_set *set)
> > +
> > + void ioasid_set_get_locked(struct ioasid_set *set)
> > +
> > + void ioasid_set_put_locked(struct ioasid_set *set)
> > +
> > + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> > + void (*fn)(ioasid_t id, void
> > *data),
> > + void *data)
> > +
> > +
> > +IOASID set concept is introduced to represent such IOASID groups.
> > Each +IOASID set is created with a token which can be one of the
> > following +types:
> > +
> > + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
> > + - IOASID_SET_TYPE_MM (Set token is a mm_struct)
> > +
> > +The explicit MM token type is useful when multiple users of an
> > IOASID +set under the same process need to communicate about their
> > shared IOASIDs. +E.g. An IOASID set created by VFIO for one guest
> > can be associated +with the KVM instance for the same guest since
> > they share a common mm_struct. +
> > +The IOASID set APIs serve the following purposes:
> > +
> > + - Ownership/permission enforcement
> > + - Take collective actions, e.g. free an entire set
> > + - Event notifications within a set
> > + - Look up a set based on token
> > + - Quota enforcement
> > +
> > +Individual IOASID APIs
> > +----------------------
> > +Once an ioasid_set is created, IOASIDs can be allocated from the
> > set. +Within the IOASID set namespace, set private ID (SPID) is
> > supported. In +the VM use case, SPID can be used for storing guest
> > PASID. +
> > +::
> > +
> > + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> > ioasid_t max,
> > + void *private);
> > +
> > + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> > + bool (*getter)(void *));
> > +
> > + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t
> > spid) +
> > + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
> > + void *data);
> > + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
> > + ioasid_t ssid);
> > +
> > +
> > +Notifications
> > +-------------
> > +An IOASID may have multiple users, each user may have hardware
> > context +associated with an IOASID. When the status of an IOASID
> > changes, +e.g. an IOASID is being freed, users need to be notified
> > such that the +associated hardware context can be cleared, flushed,
> > and drained. +
> > +::
> > +
> > + int ioasid_register_notifier(struct ioasid_set *set, struct
> > + notifier_block *nb)
> > +
> > + void ioasid_unregister_notifier(struct ioasid_set *set,
> > + struct notifier_block *nb)
> > +
> > + int ioasid_register_notifier_mm(struct mm_struct *mm, struct
> > + notifier_block *nb)
> > +
> > + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
> > + notifier_block *nb)
> > +
> > + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
> > + unsigned int flags)
> > +
> > +
> > +Events
> > +~~~~~~
> > +Notification events are pertinent to individual IOASIDs, they can
> > be +one of the following:
> > +
> > + - ALLOC
> > + - FREE
> > + - BIND
> > + - UNBIND
> > +
> > +Ordering
> > +~~~~~~~~
> > +Ordering is supported by IOASID notification priorities as the
> > +following (in ascending order):
>
> What does ascending order exactly mean here? LAST->IOMMU->DEVICE...?
>
Yes. CPU has the highest priority and will get notified first.

> > +
> > +::
> > +
> > + enum ioasid_notifier_prios {
> > + IOASID_PRIO_LAST,
> > + IOASID_PRIO_IOMMU,
> > + IOASID_PRIO_DEVICE,
> > + IOASID_PRIO_CPU,
> > + };
> > +
> > +The typical use case is when an IOASID is freed due to an
> > exception, DMA +source should be quiesced before tearing down other
> > hardware contexts +in the system. This will reduce the churn in
> > handling faults. DMA work +submission is performed by the CPU which
> > is granted higher priority than +devices.
> > +
> > +
> > +Scopes
> > +~~~~~~
> > +There are two types of notifiers in IOASID core: system-wide and
> > +ioasid_set-wide.
> > +
> > +System-wide notifier is catering for users that need to handle all
> > +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs.
> > +
> > +Per ioasid_set notifier can be used by VM specific components such
> > as +KVM. After all, each KVM instance only cares about IOASIDs
> > within its +own set.
> > +
> > +
> > +Atomicity
> > +~~~~~~~~~
> > +IOASID notifiers are atomic due to spinlocks used inside the IOASID
> > +core. For tasks cannot be completed in the notifier handler, async
> > work +can be submitted to complete the work later as long as there
> > is no +ordering requirement.
> > +
> > +Reference counting
> > +------------------
> > +IOASID lifecycle management is based on reference counting. Users
> > of +IOASID intend to align lifecycle with the IOASID need to hold
> > +reference of the IOASID. IOASID will not be returned to the pool
> > for +allocation until all references are dropped. Calling
> > ioasid_free() +will mark the IOASID as FREE_PENDING if the IOASID
> > has outstanding +reference. ioasid_get() is not allowed once an
> > IOASID is in the +FREE_PENDING state.
> > +
> > +Event notifications are used to inform users of IOASID status
> > change. +IOASID_FREE event prompts users to drop their references
> > after +clearing its context.
> > +
> > +For example, on VT-d platform when an IOASID is freed, teardown
> > +actions are performed on KVM, device driver, and IOMMU driver.
> > +KVM shall register notifier block with::
> > +
> > + static struct notifier_block pasid_nb_kvm = {
> > + .notifier_call = pasid_status_change_kvm,
> > + .priority = IOASID_PRIO_CPU,
> > + };
> > +
> > +VDCM driver shall register notifier block with::
> > +
> > + static struct notifier_block pasid_nb_vdcm = {
> > + .notifier_call = pasid_status_change_vdcm,
> > + .priority = IOASID_PRIO_DEVICE,
> > + };
> > +
> > +In both cases, notifier blocks shall be registered on the IOASID
> > set +such that *only* events from the matching VM is received.
> > +
> > +If KVM attempts to register notifier block before the IOASID set is
> > +created for the MM token, the notifier block will be placed on a
> > +pending list inside IOASID core. Once the token matching IOASID set
> > +is created, IOASID will register the notifier block automatically.
> > +IOASID core does not replay events for the existing IOASIDs in the
> > +set. For IOASID set of MM type, notification blocks can be
> > registered +on empty sets only. This is to avoid lost events.
> > +
> > +IOMMU driver shall register notifier block on global chain::
> > +
> > + static struct notifier_block pasid_nb_vtd = {
> > + .notifier_call = pasid_status_change_vtd,
> > + .priority = IOASID_PRIO_IOMMU,
> > + };
> > +
> > +Custom allocator APIs
> > +---------------------
> > +
> > +::
> > +
> > + int ioasid_register_allocator(struct ioasid_allocator_ops
> > *allocator); +
> > + void ioasid_unregister_allocator(struct ioasid_allocator_ops
> > *allocator); +
> > +Allocator Choices
> > +~~~~~~~~~~~~~~~~~
> > +IOASIDs are allocated for both host and guest SVA/IOVA usage.
> > However, +allocators can be different. For example, on VT-d guest
> > PASID +allocation must be performed via a virtual command interface
> > which is +emulated by VMM.
> > +
> > +IOASID core has the notion of "custom allocator" such that guest
> > can +register virtual command allocator that precedes the default
> > one. +
> > +Namespaces
> > +~~~~~~~~~~
> > +IOASIDs are limited system resources that default to 20 bits in
> > +size. Since each device has its own table, theoretically the
> > namespace +can be per device also. However, for security reasons
> > sharing PASID +tables among devices are not good for isolation.
> > Therefore, IOASID +namespace is system-wide.
> > +
> > +There are also other reasons to have this simpler system-wide
> > +namespace. Take VT-d as an example, VT-d supports shared workqueue
> > +and ENQCMD[1] where one IOASID could be used to submit work on
> > +multiple devices that are shared with other VMs. This requires
> > IOASID +to be system-wide. This is also the reason why guests must
> > use an +emulated virtual command interface to allocate IOASID from
> > the host. +
> > +
> > +Life cycle
> > +==========
> > +This section covers IOASID lifecycle management for both bare-metal
> > +and guest usages. In bare-metal SVA, MMU notifier is directly
> > hooked +up with IOMMU driver, therefore the process address space
> > (MM) +lifecycle is aligned with IOASID.
>
> MMU notifier for SVA mainly serves IOMMU cache flushes, right? The
> IOASID life cycle for bare matal SVA is managed by the device driver
> through the iommu sva api's iommu_sva_(un)bind_device()?
>
True that lifecycle between iommu and device are aligned by sva APIs.
But between mm/cpu and iommu, it depends on mmu_notifier.release(). In
case process terminates unexpectedly.

> > +
> > +However, guest MMU notifier is not available to host IOMMU driver,
> > +when guest MM terminates unexpectedly, the events have to go
> > through +VFIO and IOMMU UAPI to reach host IOMMU driver. There are
> > also more +parties involved in guest SVA, e.g. on Intel VT-d
> > platform, IOASIDs +are used by IOMMU driver, KVM, VDCM, and VFIO.
> > +
> > +Native IOASID Life Cycle (VT-d Example)
> > +---------------------------------------
> > +
> > +The normal flow of native SVA code with Intel Data Streaming
> > +Accelerator(DSA) [2] as example:
> > +
> > +1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
> > +2. DSA driver allocate WQ, do sva_bind_device();
> > +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
> > + mmu_notifier_get()
> > +4. DMA starts by DSA driver userspace
> > +5. DSA userspace close FD
> > +6. DSA/uacce kernel driver handles FD.close()
> > +7. DSA driver stops DMA
> > +8. DSA driver calls sva_unbind_device();
> > +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
> > + TLBs. mmu_notifier_put() called.
> > +10. mmu_notifier.release() called, IOMMU SVA code calls
> > ioasid_free()* +11. The IOASID is returned to the pool, reclaimed.
> > +
> > +::
> > +
> > + * With ENQCMD, PASID used on VT-d is not released in
> > mmu_notifier() but
> > + mmdrop(). mmdrop comes after FD close. Should not matter.
> > + If the user process dies unexpectedly, Step #10 may come
> > before
> > + Step #5, in between, all DMA faults discarded. PRQ responded
> > with
> > + code INVALID REQUEST.
> > +
> > +During the normal teardown, the following three steps would happen
> > in +order:
> > +
> > +1. Device driver stops DMA request
> > +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain
> > in-flight
> > + requests.
> > +3. IOASID freed
> > +
> > +Exception happens when process terminates *before* device driver
> > stops +DMA and call IOMMU driver to unbind. The flow of process
> > exists are as +follows:
> > +
> > +::
> > +
> > + do_exit() {
> > + exit_mm() {
> > + mm_put();
> > + exit_mmap() {
> > + intel_invalidate_range() //mmu notifier
> > + tlb_finish_mmu()
> > + mmu_notifier_release(mm) {
> > + intel_iommu_release() {
>
> intel_mm_release()
Good catch,

>
> > + [2]
> > intel_iommu_teardown_pasid();
> > + intel_iommu_flush_tlbs();
> > + }
> > + // tlb_invalidate_range cb removed
> > + }
> > + unmap_vmas();
> > + free_pgtables(); // IOMMU cannot walk PGT
> > after this
> > + };
> > + }
> > + exit_files(tsk) {
> > + close_files() {
> > + dsa_close();
> > + [1] dsa_stop_dma();
> > + intel_svm_unbind_pasid(); //nothing to do
> > + }
> > + }
> > + }
> > +
> > + mmdrop() /* some random time later, lazy mm user */ {
> > + mm_free_pgd();
> > + destroy_context(mm); {
> > + [3] ioasid_free();
> > + }
> > + }
> > +
> > +As shown in the list above, step #2 could happen before
> > +#1. Unrecoverable(UR) faults could happen between #2 and #1.
>
> The VT-d hardware will ignore UR faults due to the setting of FPD bit
> of the PASID entry. The software won't see UR faults.
>
Yes, here I should note that.
"Fault processing is disabled by the IOMMU driver in #2, therefore the
UR fault never reaches the driver."

> > +
> > +Also notice that TLB invalidation occurs at mmu_notifier
> > +invalidate_range callback as well as the release callback. The
> > reason +is that release callback will delete IOMMU driver from the
> > notifier +chain which may skip invalidate_range() calls during the
> > exit path. +
> > +To avoid unnecessary reporting of UR fault, IOMMU driver shall
> > disable +fault reporting after free and before unbind.
> > +
> > +Guest IOASID Life Cycle (VT-d Example)
> > +--------------------------------------
> > +Guest IOASID life cycle starts with guest driver open(), this
> > could be +uacce or individual accelerator driver such as DSA. At FD
> > open, +sva_bind_device() is called which triggers a series of
> > actions. +
> > +The example below is an illustration of *normal* operations that
> > +involves *all* the SW components in VT-d. The flow can be simpler
> > if +no ENQCMD is supported.
> > +
> > +::
> > +
> > + VFIO IOMMU KVM VDCM IOASID
> > Ref
> > + ..................................................................
> > + 1 ioasid_register_notifier/_mm()
> > + 2 ioasid_alloc() 1
> > + 3 bind_gpasid()
> > + 4 iommu_bind()->ioasid_get() 2
> > + 5 ioasid_notify(BIND)
> > + 6 -> ioasid_get() 3
> > + 7 -> vmcs_update_atomic()
> > + 8 mdev_write(gpasid)
> > + 9 hpasid=
> > + 10 find_by_spid(gpasid) 4
> > + 11 vdev_write(hpasid)
> > + 12 -------- GUEST STARTS DMA --------------------------
> > + 13 -------- GUEST STOPS DMA --------------------------
> > + 14 mdev_clear(gpasid)
> > + 15 vdev_clear(hpasid)
> > + 16 ioasid_put()
> > 3
> > + 17 unbind_gpasid()
> > + 18 iommu_ubind()
> > + 19 ioasid_notify(UNBIND)
> > + 20 -> vmcs_update_atomic()
> > + 21 -> ioasid_put()
> > 2
> > + 22 ioasid_free()
> > 1
> > + 23 ioasid_put()
> > 0
> > + 24 Reclaimed
> > + -------------- New Life Cycle Begin ----------------------------
> > + 1 ioasid_alloc() ->
> > 1 +
> > + Note: IOASID Notification Events: FREE, BIND, UNBIND
> > +
> > +Exception cases arise when a guest crashes or a malicious guest
> > +attempts to cause disruption on the host system. The fault handling
> > +rules are:
> > +
> > +1. IOASID free must *always* succeed.
> > +2. An inactive period may be required before the freed IOASID is
> > + reclaimed. During this period, consumers of IOASID perform
> > cleanup. +3. Malfunction is limited to the guest owned resources
> > for all
> > + programming errors.
> > +
> > +The primary source of exception is when the following are out of
> > +order:
> > +
> > +1. Start/Stop of DMA activity
> > + (Guest device driver, mdev via VFIO)
> > +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
> > + (Host IOMMU driver bind/unbind)
> > +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
> > + case of ENQCMD
> > +4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
> > +5. IOASID alloc/free (Host IOASID)
> > +
> > +VFIO is the *only* user-kernel interface, which is ultimately
> > +responsible for exception handlings.
> > +
> > +#1 is processed the same way as the assigned device today based on
> > +device file descriptors and events. There is no special handling.
> > +
> > +#3 is based on bind/unbind events emitted by #2.
> > +
> > +#4 is naturally aligned with IOASID life cycle in that an illegal
> > +guest PASID programming would fail in obtaining reference of the
> > +matching host IOASID.
> > +
> > +#5 is similar to #4. The fault will be reported to the user if
> > PASID +used in the ENQCMD is not set up in VMCS PASID translation
> > table. +
> > +Therefore, the remaining out of order problem is between #2 and
> > +#5. I.e. unbind vs. free. More specifically, free before unbind.
> > +
> > +IOASID notifier and refcounting are used to ensure order. Following
> > +a publisher-subscriber pattern where:
> > +
> > +- Publishers: VFIO & IOMMU
> > +- Subscribers: KVM, VDCM, IOMMU
> > +
> > +IOASID notifier is atomic which requires subscribers to do quick
> > +handling of the event in the atomic context. Workqueue can be used
> > for +any processing that requires thread context. IOASID reference
> > must be +acquired before receiving the FREE event. The reference
> > must be +dropped at the end of the processing in order to return
> > the IOASID to +the pool.
> > +
> > +Let's examine the IOASID life cycle again when free happens
> > *before* +unbind. This could be a result of misbehaving guests or
> > crash. Assuming +VFIO cannot enforce unbind->free order. Notice
> > that the setup part up +until step #12 is identical to the normal
> > case, the flow below starts +with step 13.
> > +
> > +::
> > +
> > + VFIO IOMMU KVM VDCM IOASID
> > Ref
> > + ..................................................................
> > + 13 -------- GUEST STARTS DMA --------------------------
> > + 14 -------- *GUEST MISBEHAVES!!!* ----------------
> > + 15 ioasid_free()
> > + 16
> > ioasid_notify(FREE)
> > + 17
> > mark_ioasid_inactive[1]
> > + 18 kvm_nb_handler(FREE)
> > + 19 vmcs_update_atomic()
> > + 20 ioasid_put_locked() -> 3
> > + 21 vdcm_nb_handler(FREE)
> > + 22 iomm_nb_handler(FREE)
> > + 23 ioasid_free() returns[2] schedule_work() 2
> > + 24 schedule_work() vdev_clear_wk(hpasid)
> > + 25 teardown_pasid_wk()
> > + 26 ioasid_put() -> 1
> > + 27 ioasid_put() 0
> > + 28 Reclaimed
> > + 29 unbind_gpasid()
> > + 30 iommu_unbind()->ioasid_find() Fails[3]
> > + -------------- New Life Cycle Begin ----------------------------
> > +
> > +Note:
> > +
> > +1. By marking IOASID inactive at step #17, no new references can be
> > + held. ioasid_get/find() will return -ENOENT;
> > +2. After step #23, all events can go out of order. Shall not affect
> > + the outcome.
> > +3. IOMMU driver fails to find private data for unbinding. If
> > unbind is
> > + called after the same IOASID is allocated for the same guest
> > again,
> > + this is a programming error. The damage is limited to the guest
> > + itself since unbind performs permission checking based on the
> > + IOASID set associated with the guest process.
> > +
> > +KVM PASID Translation Table Updates
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +Per VM PASID translation table is maintained by KVM in order to
> > +support ENQCMD in the guest. The table contains host-guest PASID
> > +translations to be consumed by CPU ucode. The synchronization of
> > the +PASID states depends on VFIO/IOMMU driver, where IOCTL and
> > atomic +notifiers are used. KVM must register IOASID notifier per
> > VM instance +during launch time. The following events are handled:
> > +
> > +1. BIND/UNBIND
> > +2. FREE
> > +
> > +Rules:
> > +
> > +1. Multiple devices can bind with the same PASID, this can be
> > different PCI
> > + devices or mdevs within the same PCI device. However, only the
> > + *first* BIND and *last* UNBIND emit notifications.
> > +2. IOASID code is responsible for ensuring the correctness of H-G
> > + PASID mapping. There is no need for KVM to validate the
> > + notification data.
> > +3. When UNBIND happens *after* FREE, KVM will see error in
> > + ioasid_get() even when the reclaim is not done. IOMMU driver
> > will
> > + also avoid sending UNBIND if the PASID is already FREE.
> > +4. When KVM terminates *before* FREE & UNBIND, references will be
> > + dropped for all host PASIDs.
> > +
> > +VDCM PASID Programming
> > +~~~~~~~~~~~~~~~~~~~~~~
> > +VDCM composes virtual devices and exposes them to the guests. When
> > +the guest allocates a PASID then program it to the virtual device,
> > VDCM +intercepts the programming attempt then program the matching
> > host +PASID on to the hardware.
> > +Conversely, when a device is going away, VDCM must be informed such
> > +that PASID context on the hardware can be cleared. There could be
> > +multiple mdevs assigned to different guests in the same VDCM. Since
> > +the PASID table is shared at PCI device level, lazy clearing is not
> > +secure. A malicious guest can attack by using newly freed PASIDs
> > that +are allocated by another guest.
> > +
> > +By holding a reference of the PASID until VDCM cleans up the HW
> > context, +it is guaranteed that PASID life cycles do not cross
> > within the same +device.
> > +
> > +
> > +Reference
> > +====================================================
> > +1.
> > https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
> > + +2.
> > https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
> > + +3.
> > https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
>
> Best regards,
> baolu
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
[Jacob Pan]

2020-08-28 22:35:20

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

Hi Jean,

Thanks for the review!

On Mon, 24 Aug 2020 12:32:39 +0200
Jean-Philippe Brucker <[email protected]> wrote:

> On Fri, Aug 21, 2020 at 09:35:10PM -0700, Jacob Pan wrote:
> > IOASID is used to identify address spaces that can be targeted by
> > device DMA. It is a system-wide resource that is essential to its
> > many users. This document is an attempt to help developers from all
> > vendors navigate the APIs. At this time, ARM SMMU and Intel’s
> > Scalable IO Virtualization (SIOV) enabled platforms are the primary
> > users of IOASID. Examples of how SIOV components interact with
> > IOASID APIs are provided in that many APIs are driven by the
> > requirements from SIOV.
> >
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Wu Hao <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > Documentation/ioasid.rst | 618
> > +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 618
> > insertions(+) create mode 100644 Documentation/ioasid.rst
> >
> > diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst
>
> Thanks for writing this up. Should it go to
> Documentation/driver-api/, or Documentation/driver-api/iommu/? I
> think this also needs to Cc [email protected] and
> [email protected]
>
Good point, I think Documentation/driver-api/ is good for now as there
are no other IOMMU docs.
Will CC Jon also.

> > new file mode 100644
> > index 000000000000..b6a8cdc885ff
> > --- /dev/null
> > +++ b/Documentation/ioasid.rst
> > @@ -0,0 +1,618 @@
> > +.. ioasid:
> > +
> > +=====================================
> > +IO Address Space ID
> > +=====================================
> > +
> > +IOASID is a generic name for PCIe Process Address ID (PASID) or ARM
> > +SMMU sub-stream ID. An IOASID identifies an address space that
> > DMA
>
> "SubstreamID"
>

> > +requests can target.
> > +
> > +The primary use cases for IOASID are Shared Virtual Address (SVA)
> > and +IO Virtual Address (IOVA). However, the requirements for
> > IOASID
>
> IOVA alone isn't a use case, maybe "multiple IOVA spaces per device"?
>
Yes, I meant guest IOVA for mdev which has "multiple IOVA spaces per
device" based on aux domain. I will add this to the IOVA case description.

"The primary use cases for IOASID are Shared Virtual Address (SVA) and
multiple IOVA spaces per device. However, the requirements for IOASID
management can vary among hardware architectures.

For baremetal IOVA, IOASID #0 is used for DMA request without
PASID. Even though some architectures such as VT-d also offers
the flexibility of using any PASIDs for DMA request without PASID.
PASID #0 is reserved and not allocated from any ioasid_set.

Multiple IOVA spaces per device are mapped to auxiliary domains which
can be used for mediated device assignment with and without a virtual
IOMMU (vIOMMU). An IOASID is allocated for each auxiliary domain as default
PASID. Without vIOMMU, default IOASID is used for DMA map/unmap
APIs. With vIOMMU, default IOASID is used for guest IOVA where DMA
request with PASID is required for the device. The reason is that
there is only one PASID #0 per device, e.g. VT-d, RID_PASID is per PCI
device.
"

> > +management can vary among hardware architectures.
> > +
> > +This document covers the generic features supported by IOASID
> > +APIs. Vendor-specific use cases are also illustrated with Intel's
> > VT-d +based platforms as the first example.
> > +
> > +.. contents:: :local:
> > +
> > +Glossary
> > +========
> > +PASID - Process Address Space ID
> > +
> > +IOASID - IO Address Space ID (generic term for PCIe PASID and
> > +sub-stream ID in SMMU)
>
> "SubstreamID"
>
will fix.

> > +
> > +SVA/SVM - Shared Virtual Addressing/Memory
> > +
> > +ENQCMD - New Intel X86 ISA for efficient workqueue submission [1]
>
> Maybe drop the "New", to keep the documentation perennial. It might be
> good to add internal links here to the specifications URLs at the
> bottom.
>
Good idea

> > +
> > +DSA - Intel Data Streaming Accelerator [2]
> > +
> > +VDCM - Virtual device composition module [3]
> > +
> > +SIOV - Intel Scalable IO Virtualization
> > +
> > +
> > +Key Concepts
> > +============
> > +
> > +IOASID Set
> > +-----------
> > +An IOASID set is a group of IOASIDs allocated from the system-wide
> > +IOASID pool. An IOASID set is created and can be identified by a
> > +token of u64. Refer to IOASID set APIs for more details.
>
> Identified either by an u64 or an mm_struct, right? Maybe just drop
> the second sentence if it's detailed in the IOASID set section below.
>
Sounds good.

> > +
> > +IOASID set is particularly useful for guest SVA where each guest
> > could +have its own IOASID set for security and efficiency reasons.
> > +
> > +IOASID Set Private ID (SPID)
> > +----------------------------
> > +SPIDs are introduced as IOASIDs within its set. Each SPID maps to a
> > +system-wide IOASID but the namespace of SPID is within its IOASID
> > +set.
>
> The intro isn't super clear. Perhaps this is simpler:
> "Each IOASID set has a private namespace of SPIDs. An SPID maps to a
> single system-wide IOASID."
>
Sounds better, thanks for the rewrite.

> > SPIDs can be used as guest IOASIDs where each guest could do
> > +IOASID allocation from its own pool and map them to host physical
> > +IOASIDs. SPIDs are particularly useful for supporting live
> > migration +where decoupling guest and host physical resources are
> > necessary. +
> > +For example, two VMs can both allocate guest PASID/SPID #101 but
> > map to +different host PASIDs #201 and #202 respectively as shown
> > in the +diagram below.
> > +::
> > +
> > + .------------------. .------------------.
> > + | VM 1 | | VM 2 |
> > + | | | |
> > + |------------------| |------------------|
> > + | GPASID/SPID 101 | | GPASID/SPID 101 |
> > + '------------------' -------------------' Guest
> > + __________|______________________|______________________
> > + | | Host
> > + v v
> > + .------------------. .------------------.
> > + | Host IOASID 201 | | Host IOASID 202 |
> > + '------------------' '------------------'
> > + | IOASID set 1 | | IOASID set 2 |
> > + '------------------' '------------------'
> > +
> > +Guest PASID is treated as IOASID set private ID (SPID) within an
> > +IOASID set, mappings between guest and host IOASIDs are stored in
> > the +set for inquiry.
> > +
> > +IOASID APIs
> > +===========
> > +To get the IOASID APIs, users must #include <linux/ioasid.h>.
> > These APIs +serve the following functionalities:
> > +
> > + - IOASID allocation/Free
> > + - Group management in the form of ioasid_set
> > + - Private data storage and lookup
> > + - Reference counting
> > + - Event notification in case of state change
> > +
> > +IOASID Set Level APIs
> > +--------------------------
> > +For use cases such as guest SVA it is necessary to manage IOASIDs
> > at +a group level. For example, VMs may allocate multiple IOASIDs
> > for +guest process address sharing (vSVA). It is imperative to
> > enforce +VM-IOASID ownership such that malicious guest cannot
> > target DMA
>
> "a malicious guest"
>
got it

> > +traffic outside its own IOASIDs, or free an active IOASID belong
> > to
>
> "that belongs to"
>
got it

> > +another VM.
> > +::
> > +
> > + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
> > u32 type) +
> > + int ioasid_adjust_set(struct ioasid_set *set, int quota);
>
> These could be named "ioasid_set_alloc" and "ioasid_set_adjust" to be
> consistent with the rest of the API.
>
right

> > +
> > + void ioasid_set_get(struct ioasid_set *set)
> > +
> > + void ioasid_set_put(struct ioasid_set *set)
> > +
> > + void ioasid_set_get_locked(struct ioasid_set *set)
> > +
> > + void ioasid_set_put_locked(struct ioasid_set *set)
> > +
> > + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
>
> Might be nicer to keep the same argument names within the API. Here
> "set" rather than "sdata".
>
yes, will do.

> > + void (*fn)(ioasid_t id, void
> > *data),
> > + void *data)
>
> (alignment)
>
it looked aligned in emacs and generated html doc. might just the email?
I have been having smtp issues, i am using gmail smtp this time.

> > +
> > +
> > +IOASID set concept is introduced to represent such IOASID groups.
> > Each
>
> Or just "IOASID sets represent such IOASID groups", but might be
> redundant.
>
right, simpler is good

> > +IOASID set is created with a token which can be one of the
> > following +types:
> > +
> > + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
> > + - IOASID_SET_TYPE_MM (Set token is a mm_struct)
> > +
> > +The explicit MM token type is useful when multiple users of an
> > IOASID +set under the same process need to communicate about their
> > shared IOASIDs. +E.g. An IOASID set created by VFIO for one guest
> > can be associated +with the KVM instance for the same guest since
> > they share a common mm_struct. +
> > +The IOASID set APIs serve the following purposes:
> > +
> > + - Ownership/permission enforcement
> > + - Take collective actions, e.g. free an entire set
> > + - Event notifications within a set
> > + - Look up a set based on token
> > + - Quota enforcement
>
> This paragraph could be earlier in the section
>
Make sense. I will move it before listing the APIs.
> > +
> > +Individual IOASID APIs
> > +----------------------
> > +Once an ioasid_set is created, IOASIDs can be allocated from the
> > set. +Within the IOASID set namespace, set private ID (SPID) is
> > supported. In +the VM use case, SPID can be used for storing guest
> > PASID. +
> > +::
> > +
> > + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> > ioasid_t max,
> > + void *private);
> > +
> > + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> > + bool (*getter)(void *));
> > +
> > + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t
> > spid) +
> > + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
> > + void *data);
> > + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
> > + ioasid_t ssid);
>
> s/ssid/spid
>
got it

> > +
> > +
> > +Notifications
> > +-------------
> > +An IOASID may have multiple users, each user may have hardware
> > context +associated with an IOASID. When the status of an IOASID
> > changes, +e.g. an IOASID is being freed, users need to be notified
> > such that the +associated hardware context can be cleared, flushed,
> > and drained. +
> > +::
> > +
> > + int ioasid_register_notifier(struct ioasid_set *set, struct
> > + notifier_block *nb)
> > +
> > + void ioasid_unregister_notifier(struct ioasid_set *set,
> > + struct notifier_block *nb)
> > +
> > + int ioasid_register_notifier_mm(struct mm_struct *mm, struct
> > + notifier_block *nb)
> > +
> > + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
> > + notifier_block *nb)
> > +
> > + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
> > + unsigned int flags)
> > +
> > +
> > +Events
> > +~~~~~~
> > +Notification events are pertinent to individual IOASIDs, they can
> > be +one of the following:
> > +
> > + - ALLOC
> > + - FREE
> > + - BIND
> > + - UNBIND
> > +
> > +Ordering
> > +~~~~~~~~
> > +Ordering is supported by IOASID notification priorities as the
> > +following (in ascending order):
> > +
> > +::
> > +
> > + enum ioasid_notifier_prios {
> > + IOASID_PRIO_LAST,
> > + IOASID_PRIO_IOMMU,
> > + IOASID_PRIO_DEVICE,
> > + IOASID_PRIO_CPU,
> > + };
> > +
> > +The typical use case is when an IOASID is freed due to an
> > exception, DMA +source should be quiesced before tearing down other
> > hardware contexts +in the system. This will reduce the churn in
> > handling faults. DMA work +submission is performed by the CPU which
> > is granted higher priority than +devices.
> > +
> > +
> > +Scopes
> > +~~~~~~
> > +There are two types of notifiers in IOASID core: system-wide and
> > +ioasid_set-wide.
> > +
> > +System-wide notifier is catering for users that need to handle all
> > +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs.
> > +
> > +Per ioasid_set notifier can be used by VM specific components such
> > as +KVM. After all, each KVM instance only cares about IOASIDs
> > within its +own set.
> > +
> > +
> > +Atomicity
> > +~~~~~~~~~
> > +IOASID notifiers are atomic due to spinlocks used inside the IOASID
> > +core. For tasks cannot be completed in the notifier handler, async
> > work
>
> "tasks that cannot be"
>
got it

> > +can be submitted to complete the work later as long as there is no
> > +ordering requirement.
> > +
> > +Reference counting
> > +------------------
> > +IOASID lifecycle management is based on reference counting. Users
> > of +IOASID intend to align lifecycle with the IOASID need to hold
>
> "who intend to"
>
got it

> > +reference of the IOASID. IOASID will not be returned to the pool
> > for
>
> "a reference to the IOASID. The IOASID"
>
got it

> > +allocation until all references are dropped. Calling ioasid_free()
> > +will mark the IOASID as FREE_PENDING if the IOASID has outstanding
> > +reference. ioasid_get() is not allowed once an IOASID is in the
> > +FREE_PENDING state.
> > +
> > +Event notifications are used to inform users of IOASID status
> > change. +IOASID_FREE event prompts users to drop their references
> > after +clearing its context.
> > +
> > +For example, on VT-d platform when an IOASID is freed, teardown
> > +actions are performed on KVM, device driver, and IOMMU driver.
> > +KVM shall register notifier block with::
> > +
> > + static struct notifier_block pasid_nb_kvm = {
> > + .notifier_call = pasid_status_change_kvm,
> > + .priority = IOASID_PRIO_CPU,
> > + };
> > +
> > +VDCM driver shall register notifier block with::
> > +
> > + static struct notifier_block pasid_nb_vdcm = {
> > + .notifier_call = pasid_status_change_vdcm,
> > + .priority = IOASID_PRIO_DEVICE,
> > + };
> > +
> > +In both cases, notifier blocks shall be registered on the IOASID
> > set +such that *only* events from the matching VM is received.
> > +
> > +If KVM attempts to register notifier block before the IOASID set is
> > +created for the MM token, the notifier block will be placed on a
> > +pending list inside IOASID core. Once the token matching IOASID set
> > +is created, IOASID will register the notifier block automatically.
> > +IOASID core does not replay events for the existing IOASIDs in the
> > +set. For IOASID set of MM type, notification blocks can be
> > registered +on empty sets only. This is to avoid lost events.
> > +
> > +IOMMU driver shall register notifier block on global chain::
> > +
> > + static struct notifier_block pasid_nb_vtd = {
> > + .notifier_call = pasid_status_change_vtd,
> > + .priority = IOASID_PRIO_IOMMU,
> > + };
> > +
> > +Custom allocator APIs
> > +---------------------
> > +
> > +::
> > +
> > + int ioasid_register_allocator(struct ioasid_allocator_ops
> > *allocator); +
> > + void ioasid_unregister_allocator(struct ioasid_allocator_ops
> > *allocator); +
> > +Allocator Choices
> > +~~~~~~~~~~~~~~~~~
> > +IOASIDs are allocated for both host and guest SVA/IOVA usage.
> > However, +allocators can be different. For example, on VT-d guest
> > PASID +allocation must be performed via a virtual command interface
> > which is +emulated by VMM.
> > +
> > +IOASID core has the notion of "custom allocator" such that guest
> > can +register virtual command allocator that precedes the default
> > one. +
> > +Namespaces
> > +~~~~~~~~~~
> > +IOASIDs are limited system resources that default to 20 bits in
> > +size. Since each device has its own table, theoretically the
> > namespace +can be per device also. However, for security reasons
> > sharing PASID +tables among devices are not good for isolation.
> > Therefore, IOASID +namespace is system-wide.
>
> I don't follow this development. Having per-device PASID table would
> work fine for isolation (assuming no hardware bug necessitating IOMMU
> groups). If I remember correctly IOASID space was chosen to be
> OS-wide because it simplifies the management code (single PASID per
> task), and it is system-wide across VMs only in the case of VT-d
> scalable mode.
>
You are right, system-wide namespace is chosen for simplicity and
enqcmd. I will fix that.

> > +
> > +There are also other reasons to have this simpler system-wide
> > +namespace. Take VT-d as an example, VT-d supports shared workqueue
> > +and ENQCMD[1] where one IOASID could be used to submit work on
>
> Maybe use the Sphinx glossary syntax rather than "[1]"
> https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#glossary-directive
>
I will look into that, thanks!

> > +multiple devices that are shared with other VMs. This requires
> > IOASID +to be system-wide. This is also the reason why guests must
> > use an +emulated virtual command interface to allocate IOASID from
> > the host. +
> > +
> > +Life cycle
> > +==========
> > +This section covers IOASID lifecycle management for both bare-metal
> > +and guest usages. In bare-metal SVA, MMU notifier is directly
> > hooked +up with IOMMU driver, therefore the process address space
> > (MM) +lifecycle is aligned with IOASID.
> > +
> > +However, guest MMU notifier is not available to host IOMMU driver,
> > +when guest MM terminates unexpectedly, the events have to go
> > through +VFIO and IOMMU UAPI to reach host IOMMU driver. There are
> > also more +parties involved in guest SVA, e.g. on Intel VT-d
> > platform, IOASIDs +are used by IOMMU driver, KVM, VDCM, and VFIO.
> > +
> > +Native IOASID Life Cycle (VT-d Example)
> > +---------------------------------------
> > +
> > +The normal flow of native SVA code with Intel Data Streaming
> > +Accelerator(DSA) [2] as example:
> > +
> > +1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
> > +2. DSA driver allocate WQ, do sva_bind_device();
> > +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
> > + mmu_notifier_get()
> > +4. DMA starts by DSA driver userspace
> > +5. DSA userspace close FD
> > +6. DSA/uacce kernel driver handles FD.close()
> > +7. DSA driver stops DMA
> > +8. DSA driver calls sva_unbind_device();
> > +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
> > + TLBs. mmu_notifier_put() called.
> > +10. mmu_notifier.release() called, IOMMU SVA code calls
> > ioasid_free()* +11. The IOASID is returned to the pool, reclaimed.
> > +
> > +::
> > +
>
> Use a footnote?
> https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes
>
ditto

> > + * With ENQCMD, PASID used on VT-d is not released in
> > mmu_notifier() but
> > + mmdrop(). mmdrop comes after FD close. Should not matter.
>
> "comes after FD close, which doesn't make a difference?"
> The following might not be necessary since early process termination
> is described later.
>
yes, it is redundant. I will remove it.

> > + If the user process dies unexpectedly, Step #10 may come
> > before
> > + Step #5, in between, all DMA faults discarded. PRQ responded
> > with
>
> PRQ hasn't been defined in this document.
>
will remove

> > + code INVALID REQUEST.
> > +
> > +During the normal teardown, the following three steps would happen
> > in +order:
> > +
> > +1. Device driver stops DMA request
> > +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain
> > in-flight
> > + requests.
> > +3. IOASID freed
> > +
> > +Exception happens when process terminates *before* device driver
> > stops +DMA and call IOMMU driver to unbind. The flow of process
> > exists are as
>
> "exits"
>
will fix

> > +follows:
> > +
> > +::
> > +
> > + do_exit() {
> > + exit_mm() {
> > + mm_put();
> > + exit_mmap() {
> > + intel_invalidate_range() //mmu notifier
> > + tlb_finish_mmu()
> > + mmu_notifier_release(mm) {
> > + intel_iommu_release() {
> > + [2]
> > intel_iommu_teardown_pasid();
>
> Parentheses might be better than square brackets for step numbers
>
yes, it gets highlight as well. thanks!

> > + intel_iommu_flush_tlbs();
> > + }
> > + // tlb_invalidate_range cb removed
> > + }
> > + unmap_vmas();
> > + free_pgtables(); // IOMMU cannot walk PGT
> > after this
> > + };
> > + }
> > + exit_files(tsk) {
> > + close_files() {
> > + dsa_close();
> > + [1] dsa_stop_dma();
> > + intel_svm_unbind_pasid(); //nothing to do
> > + }
> > + }
> > + }
> > +
> > + mmdrop() /* some random time later, lazy mm user */ {
> > + mm_free_pgd();
> > + destroy_context(mm); {
> > + [3] ioasid_free();
> > + }
> > + }
> > +
> > +As shown in the list above, step #2 could happen before
> > +#1. Unrecoverable(UR) faults could happen between #2 and #1.
> > +
> > +Also notice that TLB invalidation occurs at mmu_notifier
> > +invalidate_range callback as well as the release callback. The
> > reason +is that release callback will delete IOMMU driver from the
> > notifier +chain which may skip invalidate_range() calls during the
> > exit path. +
> > +To avoid unnecessary reporting of UR fault, IOMMU driver shall
> > disable +fault reporting after free and before unbind.
> > +
> > +Guest IOASID Life Cycle (VT-d Example)
> > +--------------------------------------
> > +Guest IOASID life cycle starts with guest driver open(), this
> > could be +uacce or individual accelerator driver such as DSA. At FD
> > open, +sva_bind_device() is called which triggers a series of
> > actions. +
> > +The example below is an illustration of *normal* operations that
> > +involves *all* the SW components in VT-d. The flow can be simpler
> > if +no ENQCMD is supported.
> > +
> > +::
> > +
> > + VFIO IOMMU KVM VDCM IOASID
> > Ref
> > + ..................................................................
> > + 1 ioasid_register_notifier/_mm()
> > + 2 ioasid_alloc() 1
> > + 3 bind_gpasid()
> > + 4 iommu_bind()->ioasid_get() 2
> > + 5 ioasid_notify(BIND)
> > + 6 -> ioasid_get() 3
> > + 7 -> vmcs_update_atomic()
> > + 8 mdev_write(gpasid)
> > + 9 hpasid=
> > + 10 find_by_spid(gpasid) 4
> > + 11 vdev_write(hpasid)
> > + 12 -------- GUEST STARTS DMA --------------------------
> > + 13 -------- GUEST STOPS DMA --------------------------
> > + 14 mdev_clear(gpasid)
> > + 15 vdev_clear(hpasid)
> > + 16 ioasid_put()
> > 3
> > + 17 unbind_gpasid()
> > + 18 iommu_ubind()
> > + 19 ioasid_notify(UNBIND)
> > + 20 -> vmcs_update_atomic()
> > + 21 -> ioasid_put()
> > 2
> > + 22 ioasid_free()
> > 1
> > + 23 ioasid_put()
> > 0
> > + 24 Reclaimed
> > + -------------- New Life Cycle Begin ----------------------------
> > + 1 ioasid_alloc() ->
> > 1 +
> > + Note: IOASID Notification Events: FREE, BIND, UNBIND
> > +
> > +Exception cases arise when a guest crashes or a malicious guest
> > +attempts to cause disruption on the host system. The fault handling
> > +rules are:
> > +
> > +1. IOASID free must *always* succeed.
> > +2. An inactive period may be required before the freed IOASID is
> > + reclaimed. During this period, consumers of IOASID perform
> > cleanup. +3. Malfunction is limited to the guest owned resources
> > for all
> > + programming errors.
> > +
> > +The primary source of exception is when the following are out of
> > +order:
> > +
> > +1. Start/Stop of DMA activity
> > + (Guest device driver, mdev via VFIO)
> > +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
> > + (Host IOMMU driver bind/unbind)
> > +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
> > + case of ENQCMD
> > +4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
> > +5. IOASID alloc/free (Host IOASID)
> > +
> > +VFIO is the *only* user-kernel interface, which is ultimately
> > +responsible for exception handlings.
>
> "handling"
>
got it

> > +
> > +#1 is processed the same way as the assigned device today based on
> > +device file descriptors and events. There is no special handling.
> > +
> > +#3 is based on bind/unbind events emitted by #2.
> > +
> > +#4 is naturally aligned with IOASID life cycle in that an illegal
> > +guest PASID programming would fail in obtaining reference of the
> > +matching host IOASID.
> > +
> > +#5 is similar to #4. The fault will be reported to the user if
> > PASID +used in the ENQCMD is not set up in VMCS PASID translation
> > table. +
> > +Therefore, the remaining out of order problem is between #2 and
> > +#5. I.e. unbind vs. free. More specifically, free before unbind.
> > +
> > +IOASID notifier and refcounting are used to ensure order. Following
> > +a publisher-subscriber pattern where:
> > +
> > +- Publishers: VFIO & IOMMU
> > +- Subscribers: KVM, VDCM, IOMMU
> > +
> > +IOASID notifier is atomic which requires subscribers to do quick
> > +handling of the event in the atomic context. Workqueue can be used
> > for +any processing that requires thread context. IOASID reference
> > must be +acquired before receiving the FREE event. The reference
> > must be +dropped at the end of the processing in order to return
> > the IOASID to +the pool.
> > +
> > +Let's examine the IOASID life cycle again when free happens
> > *before* +unbind. This could be a result of misbehaving guests or
> > crash. Assuming +VFIO cannot enforce unbind->free order. Notice
> > that the setup part up +until step #12 is identical to the normal
> > case, the flow below starts +with step 13.
> > +
> > +::
> > +
> > + VFIO IOMMU KVM VDCM IOASID
> > Ref
> > + ..................................................................
> > + 13 -------- GUEST STARTS DMA --------------------------
> > + 14 -------- *GUEST MISBEHAVES!!!* ----------------
> > + 15 ioasid_free()
> > + 16
> > ioasid_notify(FREE)
> > + 17
> > mark_ioasid_inactive[1]
> > + 18 kvm_nb_handler(FREE)
> > + 19 vmcs_update_atomic()
> > + 20 ioasid_put_locked() -> 3
> > + 21 vdcm_nb_handler(FREE)
> > + 22 iomm_nb_handler(FREE)
> > + 23 ioasid_free() returns[2] schedule_work() 2
> > + 24 schedule_work() vdev_clear_wk(hpasid)
> > + 25 teardown_pasid_wk()
> > + 26 ioasid_put() -> 1
> > + 27 ioasid_put() 0
> > + 28 Reclaimed
> > + 29 unbind_gpasid()
> > + 30 iommu_unbind()->ioasid_find() Fails[3]
> > + -------------- New Life Cycle Begin ----------------------------
> > +
> > +Note:
> > +
> > +1. By marking IOASID inactive at step #17, no new references can
> > be
>
> Is "inactive" FREE_PENDING?
>
yes, will fix.

> > + held. ioasid_get/find() will return -ENOENT;
> > +2. After step #23, all events can go out of order. Shall not affect
> > + the outcome.
> > +3. IOMMU driver fails to find private data for unbinding. If
> > unbind is
> > + called after the same IOASID is allocated for the same guest
> > again,
> > + this is a programming error. The damage is limited to the guest
> > + itself since unbind performs permission checking based on the
> > + IOASID set associated with the guest process.
> > +
> > +KVM PASID Translation Table Updates
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +Per VM PASID translation table is maintained by KVM in order to
> > +support ENQCMD in the guest. The table contains host-guest PASID
> > +translations to be consumed by CPU ucode. The synchronization of
> > the +PASID states depends on VFIO/IOMMU driver, where IOCTL and
> > atomic +notifiers are used. KVM must register IOASID notifier per
> > VM instance +during launch time. The following events are handled:
> > +
> > +1. BIND/UNBIND
> > +2. FREE
> > +
> > +Rules:
> > +
> > +1. Multiple devices can bind with the same PASID, this can be
> > different PCI
> > + devices or mdevs within the same PCI device. However, only the
> > + *first* BIND and *last* UNBIND emit notifications.
> > +2. IOASID code is responsible for ensuring the correctness of H-G
> > + PASID mapping. There is no need for KVM to validate the
> > + notification data.
> > +3. When UNBIND happens *after* FREE, KVM will see error in
> > + ioasid_get() even when the reclaim is not done. IOMMU driver
> > will
> > + also avoid sending UNBIND if the PASID is already FREE.
> > +4. When KVM terminates *before* FREE & UNBIND, references will be
> > + dropped for all host PASIDs.
> > +
> > +VDCM PASID Programming
> > +~~~~~~~~~~~~~~~~~~~~~~
> > +VDCM composes virtual devices and exposes them to the guests. When
> > +the guest allocates a PASID then program it to the virtual device,
> > VDCM +intercepts the programming attempt then program the matching
> > host
>
> "programs"
>
> Thanks,
> Jean
>
> > +PASID on to the hardware.
> > +Conversely, when a device is going away, VDCM must be informed such
> > +that PASID context on the hardware can be cleared. There could be
> > +multiple mdevs assigned to different guests in the same VDCM. Since
> > +the PASID table is shared at PCI device level, lazy clearing is not
> > +secure. A malicious guest can attack by using newly freed PASIDs
> > that +are allocated by another guest.
> > +
> > +By holding a reference of the PASID until VDCM cleans up the HW
> > context, +it is guaranteed that PASID life cycles do not cross
> > within the same +device.
> > +
> > +
> > +Reference
> > +====================================================
> > +1.
> > https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
> > + +2.
> > https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
> > + +3.
> > https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
> > -- 2.7.4
> >
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
[Jacob Pan]

2020-09-01 12:05:58

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

Hi Jacob,

On 8/22/20 6:35 AM, Jacob Pan wrote:
> ioasid_set was introduced as an arbitrary token that are shared by a
that is
> group of IOASIDs. For example, if IOASID #1 and #2 are allocated via the
> same ioasid_set*, they are viewed as to belong to the same set.
two IOASIDs allocated via the same ioasid_set pointer belong to the same
set?
>
> For guest SVA usages, system-wide IOASID resources need to be
> partitioned such that VMs can have its own quota and being managed
their own
> separately. ioasid_set is the perfect candidate for meeting such
> requirements. This patch redefines and extends ioasid_set with the
> following new fields:
> - Quota
> - Reference count
> - Storage of its namespace
> - The token is stored in the new ioasid_set but with optional types
>
> ioasid_set level APIs are introduced that wires up these new data.
that wire
> Existing users of IOASID APIs are converted where a host IOASID set is
> allocated for bare-metal usage.
>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/intel/iommu.c | 27 ++-
> drivers/iommu/intel/pasid.h | 1 +
> drivers/iommu/intel/svm.c | 8 +-
> drivers/iommu/ioasid.c | 390 +++++++++++++++++++++++++++++++++++++++++---
> include/linux/ioasid.h | 82 ++++++++--
> 5 files changed, 465 insertions(+), 43 deletions(-)
>
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index a3a0b5c8921d..5813eeaa5edb 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -42,6 +42,7 @@
> #include <linux/crash_dump.h>
> #include <linux/numa.h>
> #include <linux/swiotlb.h>
> +#include <linux/ioasid.h>
> #include <asm/irq_remapping.h>
> #include <asm/cacheflush.h>
> #include <asm/iommu.h>
> @@ -103,6 +104,9 @@
> */
> #define INTEL_IOMMU_PGSIZES (~0xFFFUL)
>
> +/* PASIDs used by host SVM */
> +struct ioasid_set *host_pasid_set;
> +
> static inline int agaw_to_level(int agaw)
> {
> return agaw + 2;
> @@ -3103,8 +3107,8 @@ static void intel_vcmd_ioasid_free(ioasid_t ioasid, void *data)
> * Sanity check the ioasid owner is done at upper layer, e.g. VFIO
> * We can only free the PASID when all the devices are unbound.
> */
> - if (ioasid_find(NULL, ioasid, NULL)) {
> - pr_alert("Cannot free active IOASID %d\n", ioasid);
> + if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
> + pr_err("Cannot free IOASID %d, not in system set\n", ioasid);
not sure the change in the trace is worth. Also you may be more explicit
like IOASID %d to be freed cannot be found in the system ioasid set.
shouldn't it be rate_limited as it is originated from user space?
> return;
> }
> vcmd_free_pasid(iommu, ioasid);
> @@ -3288,6 +3292,19 @@ static int __init init_dmars(void)
> if (ret)
> goto free_iommu;
>
> + /* PASID is needed for scalable mode irrespective to SVM */
> + if (intel_iommu_sm) {
> + ioasid_install_capacity(intel_pasid_max_id);
> + /* We should not run out of IOASIDs at boot */
> + host_pasid_set = ioasid_alloc_set(NULL, PID_MAX_DEFAULT,
s/PID_MAX_DEFAULT/intel_pasid_max_id?
> + IOASID_SET_TYPE_NULL);
as suggested by jean-Philippe ioasid_set_alloc
> + if (IS_ERR_OR_NULL(host_pasid_set)) {
> + pr_err("Failed to enable host PASID allocator %lu\n",
> + PTR_ERR(host_pasid_set));
does not sound like the correct error message? failed to allocate the
system ioasid_set?
> + intel_iommu_sm = 0;
> + }
> + }
> +
> /*
> * for each drhd
> * enable fault log
> @@ -5149,7 +5166,7 @@ static void auxiliary_unlink_device(struct dmar_domain *domain,
> domain->auxd_refcnt--;
>
> if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - ioasid_free(domain->default_pasid);
> + ioasid_free(host_pasid_set, domain->default_pasid);
> }
>
> static int aux_domain_add_dev(struct dmar_domain *domain,
> @@ -5167,7 +5184,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
> int pasid;
>
> /* No private data needed for the default pasid */
> - pasid = ioasid_alloc(NULL, PASID_MIN,
> + pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
> pci_max_pasids(to_pci_dev(dev)) - 1,
> NULL);
don't you want to ioasid_set_put() the ioasid_set in
intel_iommu_free_dmars()?
> if (pasid == INVALID_IOASID) {
> @@ -5210,7 +5227,7 @@ static int aux_domain_add_dev(struct dmar_domain *domain,
> spin_unlock(&iommu->lock);
> spin_unlock_irqrestore(&device_domain_lock, flags);
> if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - ioasid_free(domain->default_pasid);
> + ioasid_free(host_pasid_set, domain->default_pasid);
>
> return ret;
> }
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index c9850766c3a9..ccdc23446015 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct pasid_entry *pte)
> }
>
> extern u32 intel_pasid_max_id;
> +extern struct ioasid_set *host_pasid_set;
> int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp);
> void intel_pasid_free_id(int pasid);
> void *intel_pasid_lookup_id(int pasid);
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 37a9beabc0ca..634e191ca2c3 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -551,7 +551,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
> pasid_max = intel_pasid_max_id;
>
> /* Do not use PASID 0, reserved for RID to PASID */
> - svm->pasid = ioasid_alloc(NULL, PASID_MIN,
> + svm->pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
> pasid_max - 1, svm);
> if (svm->pasid == INVALID_IOASID) {
> kfree(svm);
> @@ -568,7 +568,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
> if (mm) {
> ret = mmu_notifier_register(&svm->notifier, mm);
> if (ret) {
> - ioasid_free(svm->pasid);
> + ioasid_free(host_pasid_set, svm->pasid);
> kfree(svm);
> kfree(sdev);
> goto out;
> @@ -586,7 +586,7 @@ intel_svm_bind_mm(struct device *dev, int flags, struct svm_dev_ops *ops,
> if (ret) {
> if (mm)
> mmu_notifier_unregister(&svm->notifier, mm);
> - ioasid_free(svm->pasid);
> + ioasid_free(host_pasid_set, svm->pasid);
> kfree(svm);
> kfree(sdev);
> goto out;
> @@ -655,7 +655,7 @@ static int intel_svm_unbind_mm(struct device *dev, int pasid)
> kfree_rcu(sdev, rcu);
>
> if (list_empty(&svm->devs)) {
> - ioasid_free(svm->pasid);
> + ioasid_free(host_pasid_set, svm->pasid);
> if (svm->mm)
> mmu_notifier_unregister(&svm->notifier, svm->mm);
> list_del(&svm->list);
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 5f63af07acd5..f73b3dbfc37a 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -1,22 +1,58 @@
> // SPDX-License-Identifier: GPL-2.0
> /*
> * I/O Address Space ID allocator. There is one global IOASID space, split into
> - * subsets. Users create a subset with DECLARE_IOASID_SET, then allocate and
I would try to avoid using new terms: s/subset_ioset_set
> - * free IOASIDs with ioasid_alloc and ioasid_free.
> + * subsets. Users create a subset with ioasid_alloc_set, then allocate/free IDs
here also and ioasid_set_alloc
> + * with ioasid_alloc and ioasid_free.
> */
> -#include <linux/ioasid.h>
> #include <linux/module.h>
> #include <linux/slab.h>
> #include <linux/spinlock.h>
> #include <linux/xarray.h>
> +#include <linux/ioasid.h>
> +
> +static DEFINE_XARRAY_ALLOC(ioasid_sets);
> +enum ioasid_state {
> + IOASID_STATE_INACTIVE,
> + IOASID_STATE_ACTIVE,
> + IOASID_STATE_FREE_PENDING,
> +};
>
> +/**
> + * struct ioasid_data - Meta data about ioasid
> + *
> + * @id: Unique ID
> + * @users Number of active users
> + * @state Track state of the IOASID
> + * @set Meta data of the set this IOASID belongs to
s/Meta data of the set this IOASID belongs to/ioasid_set the asid belongs to
> + * @private Private data associated with the IOASID
I would have expected to find the private asid somewhere
> + * @rcu For free after RCU grace period
> + */
> struct ioasid_data {
> ioasid_t id;
> struct ioasid_set *set;
> + refcount_t users;
> + enum ioasid_state state;
> void *private;
> struct rcu_head rcu;
> };
>
> +/* Default to PCIe standard 20 bit PASID */
> +#define PCI_PASID_MAX 0x100000
> +static ioasid_t ioasid_capacity = PCI_PASID_MAX;
> +static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
> +
> +void ioasid_install_capacity(ioasid_t total)
> +{
> + ioasid_capacity = ioasid_capacity_avail = total;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_install_capacity);
> +
> +ioasid_t ioasid_get_capacity()
> +{
> + return ioasid_capacity;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get_capacity);
> +
> /*
> * struct ioasid_allocator_data - Internal data structure to hold information
> * about an allocator. There are two types of allocators:
> @@ -306,11 +342,23 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> {
> struct ioasid_data *data;
> void *adata;
> - ioasid_t id;
> + ioasid_t id = INVALID_IOASID;
> +
> + spin_lock(&ioasid_allocator_lock);
> + /* Check if the IOASID set has been allocated and initialized */
> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> + pr_warn("Invalid set\n");
> + goto done_unlock;
> + }
> +
> + if (set->quota <= set->nr_ioasids) {
> + pr_err("IOASID set %d out of quota %d\n", set->sid, set->quota);
> + goto done_unlock;
> + }
>
> data = kzalloc(sizeof(*data), GFP_ATOMIC);
> if (!data)
> - return INVALID_IOASID;
> + goto done_unlock;
>
> data->set = set;
> data->private = private;
> @@ -319,7 +367,6 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> * Custom allocator needs allocator data to perform platform specific
> * operations.
> */
> - spin_lock(&ioasid_allocator_lock);
> adata = active_allocator->flags & IOASID_ALLOCATOR_CUSTOM ? active_allocator->ops->pdata : data;
> id = active_allocator->ops->alloc(min, max, adata);
> if (id == INVALID_IOASID) {
> @@ -335,42 +382,339 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> goto exit_free;
> }
> data->id = id;
> + data->state = IOASID_STATE_ACTIVE;
> + refcount_set(&data->users, 1);
> +
> + /* Store IOASID in the per set data */
> + if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
> + pr_err("Failed to ioasid %d in set %d\n", id, set->sid);
> + goto exit_free;
> + }
> + set->nr_ioasids++;
> + goto done_unlock;
>
> - spin_unlock(&ioasid_allocator_lock);
> - return id;
> exit_free:
> - spin_unlock(&ioasid_allocator_lock);
> kfree(data);
> - return INVALID_IOASID;
> +done_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> + return id;
> }
> EXPORT_SYMBOL_GPL(ioasid_alloc);
>
> +static void ioasid_do_free(struct ioasid_data *data)
do_free_locked?
> +{
> + struct ioasid_data *ioasid_data;
> + struct ioasid_set *sdata;
> +
> + active_allocator->ops->free(data->id, active_allocator->ops->pdata);
> + /* Custom allocator needs additional steps to free the xa element */
> + if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> + ioasid_data = xa_erase(&active_allocator->xa, data->id);
> + kfree_rcu(ioasid_data, rcu);
> + }
> +
> + sdata = xa_load(&ioasid_sets, data->set->sid);
> + if (!sdata) {
> + pr_err("No set %d for IOASID %d\n", data->set->sid,
> + data->id);
> + return;
> + }
> + xa_erase(&sdata->xa, data->id);
> + sdata->nr_ioasids--;
> +}
> +
> +static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(&active_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to free unknown IOASID %u\n", ioasid);
> + return;
> + }
> +
> + if (data->set != set) {
> + pr_warn("Cannot free IOASID %u due to set ownership\n", ioasid);
> + return;
> + }
> + data->state = IOASID_STATE_FREE_PENDING;
> +
> + if (!refcount_dec_and_test(&data->users))
> + return;
> +
> + ioasid_do_free(data);
> +}
> +
> /**
> - * ioasid_free - Free an IOASID
> - * @ioasid: the ID to remove
> + * ioasid_free - Drop reference on an IOASID. Free if refcount drops to 0,
> + * including free from its set and system-wide list.
> + * @set: The ioasid_set to check permission with. If not NULL, IOASID
> + * free will fail if the set does not match.
> + * @ioasid: The IOASID to remove
> */
> -void ioasid_free(ioasid_t ioasid)
> +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
> {
> - struct ioasid_data *ioasid_data;
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_free_locked(set, ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_free);
>
> +/**
> + * ioasid_alloc_set - Allocate a new IOASID set for a given token
> + *
> + * @token: Unique token of the IOASID set, cannot be NULL
> + * @quota: Quota allowed in this set. Only for new set creation
> + * @flags: Special requirements
> + *
> + * IOASID can be limited system-wide resource that requires quota management.
> + * If caller does not wish to enforce quota, use IOASID_SET_NO_QUOTA flag.
> + *
> + * Token will be stored in the ioasid_set returned. A reference will be taken
> + * upon finding a matching set or newly created set.
> + * IOASID allocation within the set and other per set operations will use
> + * the retured ioasid_set *.
> + *
> + */
> +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type)
> +{
> + struct ioasid_set *sdata;
> + unsigned long index;
> + ioasid_t id;
> +
> + if (type >= IOASID_SET_TYPE_NR)
> + return ERR_PTR(-EINVAL);
> +
> + /*
> + * Need to check space available if we share system-wide quota.
> + * TODO: we may need to support quota free sets in the future.
> + */
> spin_lock(&ioasid_allocator_lock);
> - ioasid_data = xa_load(&active_allocator->xa, ioasid);
> - if (!ioasid_data) {
> - pr_err("Trying to free unknown IOASID %u\n", ioasid);
> + if (quota > ioasid_capacity_avail) {
> + pr_warn("Out of IOASID capacity! ask %d, avail %d\n",
> + quota, ioasid_capacity_avail);
> + sdata = ERR_PTR(-ENOSPC);
> goto exit_unlock;
> }
>
> - active_allocator->ops->free(ioasid, active_allocator->ops->pdata);
> - /* Custom allocator needs additional steps to free the xa element */
> - if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> - ioasid_data = xa_erase(&active_allocator->xa, ioasid);
> - kfree_rcu(ioasid_data, rcu);
> + /*
> + * Token is only unique within its types but right now we have only
> + * mm type. If we have more token types, we have to match type as well.
> + */
> + switch (type) {
> + case IOASID_SET_TYPE_MM:
> + /* Search existing set tokens, reject duplicates */
> + xa_for_each(&ioasid_sets, index, sdata) {
> + if (sdata->token == token &&
> + sdata->type == IOASID_SET_TYPE_MM) {
> + sdata = ERR_PTR(-EEXIST);
> + goto exit_unlock;
> + }
> + }
> + break;
> + case IOASID_SET_TYPE_NULL:
> + if (!token)
> + break;
> + fallthrough;
> + default:
> + pr_err("Invalid token and IOASID type\n");
> + sdata = ERR_PTR(-EINVAL);
> + goto exit_unlock;
> }
>
> + /* REVISIT: may support set w/o quota, use system available */
> + if (!quota) {
> + sdata = ERR_PTR(-EINVAL);
> + goto exit_unlock;
> + }
> +
> + sdata = kzalloc(sizeof(*sdata), GFP_ATOMIC);
> + if (!sdata) {
> + sdata = ERR_PTR(-ENOMEM);
> + goto exit_unlock;
> + }
> +
> + if (xa_alloc(&ioasid_sets, &id, sdata,
> + XA_LIMIT(0, ioasid_capacity_avail - quota),
> + GFP_ATOMIC)) {
> + kfree(sdata);
> + sdata = ERR_PTR(-ENOSPC);
> + goto exit_unlock;
> + }
> +
> + sdata->token = token;
> + sdata->type = type;
> + sdata->quota = quota;
> + sdata->sid = id;
> + refcount_set(&sdata->ref, 1);
> +
> + /*
> + * Per set XA is used to store private IDs within the set, get ready
> + * for ioasid_set private ID and system-wide IOASID allocation
> + * results.
> + */
> + xa_init_flags(&sdata->xa, XA_FLAGS_ALLOC);
> + ioasid_capacity_avail -= quota;
> +
> exit_unlock:
> spin_unlock(&ioasid_allocator_lock);
> +
> + return sdata;
> }
> -EXPORT_SYMBOL_GPL(ioasid_free);
> +EXPORT_SYMBOL_GPL(ioasid_alloc_set);
> +
> +void ioasid_set_get_locked(struct ioasid_set *set)
> +{
> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> + pr_warn("Invalid set data\n");
> + return;
> + }
> +
> + refcount_inc(&set->ref);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_get_locked);
> +
> +void ioasid_set_get(struct ioasid_set *set)
> +{
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_set_get_locked(set);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_get);
> +
> +void ioasid_set_put_locked(struct ioasid_set *set)
> +{
> + struct ioasid_data *entry;
> + unsigned long index;
> +
> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> + pr_warn("Invalid set data\n");
> + return;
> + }
> +
> + if (!refcount_dec_and_test(&set->ref)) {
> + pr_debug("%s: IOASID set %d has %d users\n",
> + __func__, set->sid, refcount_read(&set->ref));
> + return;
> + }
> +
> + /* The set is already empty, we just destroy the set. */
> + if (xa_empty(&set->xa))
> + goto done_destroy;
> +
> + /*
> + * Free all PASIDs from system-wide IOASID pool, all subscribers gets
> + * notified and do cleanup of their own.
> + * Note that some references of the IOASIDs within the set can still
> + * be held after the free call. This is OK in that the IOASIDs will be
> + * marked inactive, the only operations can be done is ioasid_put.
> + * No need to track IOASID set states since there is no reclaim phase.
> + */
> + xa_for_each(&set->xa, index, entry) {
> + ioasid_free_locked(set, index);
> + /* Free from per set private pool */
> + xa_erase(&set->xa, index);
> + }
> +
> +done_destroy:
> + /* Return the quota back to system pool */
> + ioasid_capacity_avail += set->quota;
> + kfree_rcu(set, rcu);
> +
> + /*
> + * Token got released right away after the ioasid_set is freed.
> + * If a new set is created immediately with the newly released token,
> + * it will not allocate the same IOASIDs unless they are reclaimed.
> + */
> + xa_erase(&ioasid_sets, set->sid);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_put_locked);
> +
> +/**
> + * ioasid_set_put - Drop a reference to the IOASID set. Free all IOASIDs within
> + * the set if there are no more users.
> + *
> + * @set: The IOASID set ID to be freed
> + *
> + * If refcount drops to zero, all IOASIDs allocated within the set will be
> + * freed.
> + */
> +void ioasid_set_put(struct ioasid_set *set)
> +{
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_set_put_locked(set);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_put);
> +
> +/**
> + * ioasid_adjust_set - Adjust the quota of an IOASID set
> + * @set: IOASID set to be assigned
> + * @quota: Quota allowed in this set
> + *
> + * Return 0 on success. If the new quota is smaller than the number of
> + * IOASIDs already allocated, -EINVAL will be returned. No change will be
> + * made to the existing quota.
> + */
> +int ioasid_adjust_set(struct ioasid_set *set, int quota)
> +{
> + int ret = 0;
> +
> + spin_lock(&ioasid_allocator_lock);
> + if (set->nr_ioasids > quota) {
> + pr_err("New quota %d is smaller than outstanding IOASIDs %d\n",
> + quota, set->nr_ioasids);
> + ret = -EINVAL;
> + goto done_unlock;
> + }
> +
> + if (quota >= ioasid_capacity_avail) {
> + ret = -ENOSPC;
> + goto done_unlock;
> + }
> +
> + /* Return the delta back to system pool */
> + ioasid_capacity_avail += set->quota - quota;
> +
> + /*
> + * May have a policy to prevent giving all available IOASIDs
> + * to one set. But we don't enforce here, it should be in the
> + * upper layers.
> + */
> + set->quota = quota;
> +
> +done_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_adjust_set);
> +
> +/**
> + * ioasid_set_for_each_ioasid - Iterate over all the IOASIDs within the set
> + *
> + * Caller must hold a reference of the set and handles its own locking.
> + */
> +int ioasid_set_for_each_ioasid(struct ioasid_set *set,
> + void (*fn)(ioasid_t id, void *data),
> + void *data)
> +{
> + struct ioasid_data *entry;
> + unsigned long index;
> + int ret = 0;
> +
> + if (xa_empty(&set->xa)) {
> + pr_warn("No IOASIDs in the set %d\n", set->sid);
> + return -ENOENT;
> + }
> +
> + xa_for_each(&set->xa, index, entry) {
> + fn(index, data);
> + }
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
>
> /**
> * ioasid_find - Find IOASID data
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 9c44947a68c8..412d025d440e 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -10,8 +10,35 @@ typedef unsigned int ioasid_t;
> typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min, ioasid_t max, void *data);
> typedef void (*ioasid_free_fn_t)(ioasid_t ioasid, void *data);
>
> +/* IOASID set types */
> +enum ioasid_set_type {
> + IOASID_SET_TYPE_NULL = 1, /* Set token is NULL */
> + IOASID_SET_TYPE_MM, /* Set token is a mm_struct,
s/mm_struct/mm_struct pointer
> + * i.e. associated with a process
> + */
> + IOASID_SET_TYPE_NR,
> +};
> +
> +/**
> + * struct ioasid_set - Meta data about ioasid_set
> + * @type: Token types and other features
token type. Why "and other features"
> + * @token: Unique to identify an IOASID set
> + * @xa: XArray to store ioasid_set private IDs, can be used for
> + * guest-host IOASID mapping, or just a private IOASID namespace.
> + * @quota: Max number of IOASIDs can be allocated within the set
> + * @nr_ioasids Number of IOASIDs currently allocated in the set
> + * @sid: ID of the set
> + * @ref: Reference count of the users
> + */
> struct ioasid_set {
> - int dummy;
> + void *token;
> + struct xarray xa;
> + int type;
> + int quota;
> + int nr_ioasids;
> + int sid;
nit id? sid has a special meaning on ARM.

> + refcount_t ref;
> + struct rcu_head rcu;
> };
>
> /**
> @@ -29,31 +56,64 @@ struct ioasid_allocator_ops {
> void *pdata;
> };
>
> -#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
> -
> #if IS_ENABLED(CONFIG_IOASID)
> +void ioasid_install_capacity(ioasid_t total);
> +ioasid_t ioasid_get_capacity(void);
> +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type);
> +int ioasid_adjust_set(struct ioasid_set *set, int quota);
ioasid_set_adjust_quota
> +void ioasid_set_get_locked(struct ioasid_set *set);
as mentionned during the Plumber uConf, the set_get is unfortunate.
Globally I wonder if we shouldn't rename "set" into "pool" or something
alike.
> +void ioasid_set_put_locked(struct ioasid_set *set);
> +void ioasid_set_put(struct ioasid_set *set);
> +
> ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> void *private);
> -void ioasid_free(ioasid_t ioasid);
> -void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> - bool (*getter)(void *));
> +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
> +
> +bool ioasid_is_active(ioasid_t ioasid);
> +
> +void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *));
> +int ioasid_attach_data(ioasid_t ioasid, void *data);
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> -int ioasid_attach_data(ioasid_t ioasid, void *data);
> -
> +void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
> +int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> + void (*fn)(ioasid_t id, void *data),
> + void *data);
> #else /* !CONFIG_IOASID */
> +static inline void ioasid_install_capacity(ioasid_t total)
> +{
> +}
> +
> +static inline ioasid_t ioasid_get_capacity(void)
> +{
> + return 0;
> +}
> +
> static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> ioasid_t max, void *private)
> {
> return INVALID_IOASID;
> }
>
> -static inline void ioasid_free(ioasid_t ioasid)
> +static inline void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
> +{
> +}
> +
> +static inline bool ioasid_is_active(ioasid_t ioasid)
> +{
> + return false;
> +}
> +
> +static inline struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type)
> +{
> + return ERR_PTR(-ENOTSUPP);
> +}
> +
> +static inline void ioasid_set_put(struct ioasid_set *set)
> {
> }
>
> -static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> - bool (*getter)(void *))
> +static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *))
> {
> return NULL;
> }
>
I felt very difficult to review this patch. Could you split it into
several ones? maybe introduce the a dummy host_pasid_set and update the
call sites accordingling.

You introduce ownership checking, quota checking, ioasid state, ref
counting, ioasid type handling (whereas existing is NULL) so I have the
feeling that could ease the review process by adopting a more
incremental approach.

Thanks

Eric

2020-09-01 12:08:06

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v2 2/9] iommu/ioasid: Rename ioasid_set_data()

Hi jacob,

On 8/22/20 6:35 AM, Jacob Pan wrote:
> Rename ioasid_set_data() to ioasid_attach_data() to avoid confusion with
> struct ioasid_set. ioasid_set is a group of IOASIDs that share a common
> token.
>
> Signed-off-by: Jacob Pan <[email protected]>
Reviewed-by: Eric Auger <[email protected]>

Eric
> ---
> drivers/iommu/intel/svm.c | 6 +++---
> drivers/iommu/ioasid.c | 6 +++---
> include/linux/ioasid.h | 4 ++--
> 3 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index b6972dca2ae0..37a9beabc0ca 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -342,7 +342,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
> svm->gpasid = data->gpasid;
> svm->flags |= SVM_FLAG_GUEST_PASID;
> }
> - ioasid_set_data(data->hpasid, svm);
> + ioasid_attach_data(data->hpasid, svm);
> INIT_LIST_HEAD_RCU(&svm->devs);
> mmput(svm->mm);
> }
> @@ -394,7 +394,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
> list_add_rcu(&sdev->list, &svm->devs);
> out:
> if (!IS_ERR_OR_NULL(svm) && list_empty(&svm->devs)) {
> - ioasid_set_data(data->hpasid, NULL);
> + ioasid_attach_data(data->hpasid, NULL);
> kfree(svm);
> }
>
> @@ -437,7 +437,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> * the unbind, IOMMU driver will get notified
> * and perform cleanup.
> */
> - ioasid_set_data(pasid, NULL);
> + ioasid_attach_data(pasid, NULL);
> kfree(svm);
> }
> }
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 0f8dd377aada..5f63af07acd5 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -258,14 +258,14 @@ void ioasid_unregister_allocator(struct ioasid_allocator_ops *ops)
> EXPORT_SYMBOL_GPL(ioasid_unregister_allocator);
>
> /**
> - * ioasid_set_data - Set private data for an allocated ioasid
> + * ioasid_attach_data - Set private data for an allocated ioasid
> * @ioasid: the ID to set data
> * @data: the private data
> *
> * For IOASID that is already allocated, private data can be set
> * via this API. Future lookup can be done via ioasid_find.
> */
> -int ioasid_set_data(ioasid_t ioasid, void *data)
> +int ioasid_attach_data(ioasid_t ioasid, void *data)
> {
> struct ioasid_data *ioasid_data;
> int ret = 0;
> @@ -287,7 +287,7 @@ int ioasid_set_data(ioasid_t ioasid, void *data)
>
> return ret;
> }
> -EXPORT_SYMBOL_GPL(ioasid_set_data);
> +EXPORT_SYMBOL_GPL(ioasid_attach_data);
>
> /**
> * ioasid_alloc - Allocate an IOASID
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 6f000d7a0ddc..9c44947a68c8 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -39,7 +39,7 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> bool (*getter)(void *));
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> -int ioasid_set_data(ioasid_t ioasid, void *data);
> +int ioasid_attach_data(ioasid_t ioasid, void *data);
>
> #else /* !CONFIG_IOASID */
> static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> @@ -67,7 +67,7 @@ static inline void ioasid_unregister_allocator(struct ioasid_allocator_ops *allo
> {
> }
>
> -static inline int ioasid_set_data(ioasid_t ioasid, void *data)
> +static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
> {
> return -ENOTSUPP;
> }
>

2020-09-01 12:46:18

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v2 4/9] iommu/ioasid: Add reference couting functions

Hi Jacob,

On 8/22/20 6:35 AM, Jacob Pan wrote:
> There can be multiple users of an IOASID, each user could have hardware
> contexts associated with the IOASID. In order to align lifecycles,
> reference counting is introduced in this patch. It is expected that when
> an IOASID is being freed, each user will drop a reference only after its
> context is cleared.
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/ioasid.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/ioasid.h | 4 ++
> 2 files changed, 117 insertions(+)
>
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index f73b3dbfc37a..5f31d63c75b1 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -717,6 +717,119 @@ int ioasid_set_for_each_ioasid(struct ioasid_set *set,
> EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
>
> /**
> + * IOASID refcounting rules
> + * - ioasid_alloc() set initial refcount to 1
> + *
> + * - ioasid_free() decrement and test refcount.
> + * If refcount is 0, ioasid will be freed. Deleted from the system-wide
> + * xarray as well as per set xarray. The IOASID will be returned to the
> + * pool and available for new allocations.
> + *
> + * If recount is non-zero, mark IOASID as IOASID_STATE_FREE_PENDING.
s/recount/refcount
> + * No new reference can be added. The IOASID is not returned to the pool
can be taken
> + * for reuse.
> + * After free, ioasid_get() will return error but ioasid_find() and other
> + * non refcount adding APIs will continue to work until the last reference
> + * is dropped
> + *
> + * - ioasid_get() get a reference on an active IOASID
> + *
> + * - ioasid_put() decrement and test refcount of the IOASID.
> + * If refcount is 0, ioasid will be freed. Deleted from the system-wide
> + * xarray as well as per set xarray. The IOASID will be returned to the
> + * pool and available for new allocations.
> + * Do nothing if refcount is non-zero.
I would drop this last sentence
> + *
> + * - ioasid_find() does not take reference, caller must hold reference
So can you use ioasid_find() once the state is
IOASID_STATE_FREE_PENDING? According to Jean's reply, the "IOASID may be
freed once ioasid_find() returns but not the returned data." So holding
a ref is not mandated right?
> + *
> + * ioasid_free() can be called multiple times without error until all refs are
> + * dropped.
> + */
> +
> +int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(&active_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to get unknown IOASID %u\n", ioasid);
> + return -EINVAL;
> + }
> + if (data->state == IOASID_STATE_FREE_PENDING) {
> + pr_err("Trying to get IOASID being freed%u\n", ioasid);
> + return -EBUSY;
> + }
> +
> + if (set && data->set != set) {
> + pr_err("Trying to get IOASID not in set%u\n", ioasid);
maybe try to normalize your traces using always the same formatting for
ioasids and ioasid sets. Also we can understand %u is the id of the set.
> + /* data found but does not belong to the set */
you can get rid of this comment
> + return -EACCES;
> + }
> + refcount_inc(&data->users);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get_locked);
> +
> +/**
> + * ioasid_get - Obtain a reference of an ioasid
> + * @set
> + * @ioasid
> + *
> + * Check set ownership if @set is non-null.
> + */
> +int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + int ret = 0;
> +
> + spin_lock(&ioasid_allocator_lock);
> + ret = ioasid_get_locked(set, ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get);
> +
> +void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(&active_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to put unknown IOASID %u\n", ioasid);
> + return;
> + }
> +
> + if (set && data->set != set) {
> + pr_err("Trying to drop IOASID not in the set %u\n", ioasid);
was set%u above
> + return;
> + }
> +
> + if (!refcount_dec_and_test(&data->users)) {
> + pr_debug("%s: IOASID %d has %d remainning users\n",
> + __func__, ioasid, refcount_read(&data->users));
> + return;
> + }
> + ioasid_do_free(data);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_put_locked);
> +
> +/**
> + * ioasid_put - Drop a reference of an ioasid
> + * @set
> + * @ioasid
> + *
> + * Check set ownership if @set is non-null.
> + */
> +void ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_put_locked(set, ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_put);
> +
> +/**
> * ioasid_find - Find IOASID data
> * @set: the IOASID set
> * @ioasid: the IOASID to find
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 412d025d440e..310abe4187a3 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -76,6 +76,10 @@ int ioasid_attach_data(ioasid_t ioasid, void *data);
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
> +int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> +int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> +void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> +void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> void (*fn)(ioasid_t id, void *data),
> void *data);
>
Thanks

Eric

2020-09-01 16:12:23

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] iommu/ioasid: Introduce ioasid_set private ID

Hi Jacob,
On 8/22/20 6:35 AM, Jacob Pan wrote:
> When an IOASID set is used for guest SVA, each VM will acquire its
> ioasid_set for IOASID allocations. IOASIDs within the VM must have a
> host/physical IOASID backing, mapping between guest and host IOASIDs can
> be non-identical. IOASID set private ID (SPID) is introduced in this
> patch to be used as guest IOASID. However, the concept of ioasid_set
> specific namespace is generic, thus named SPID.
>
> As SPID namespace is within the IOASID set, the IOASID core can provide
> lookup services at both directions. SPIDs may not be allocated when its
> IOASID is allocated, the mapping between SPID and IOASID is usually
> established when a guest page table is bound to a host PASID.
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/ioasid.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/ioasid.h | 12 +++++++++++
> 2 files changed, 66 insertions(+)
>
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 5f31d63c75b1..c0aef38a4fde 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -21,6 +21,7 @@ enum ioasid_state {
> * struct ioasid_data - Meta data about ioasid
> *
> * @id: Unique ID
> + * @spid: Private ID unique within a set
> * @users Number of active users
> * @state Track state of the IOASID
> * @set Meta data of the set this IOASID belongs to
> @@ -29,6 +30,7 @@ enum ioasid_state {
> */
> struct ioasid_data {
> ioasid_t id;
> + ioasid_t spid;
> struct ioasid_set *set;
> refcount_t users;
> enum ioasid_state state;
> @@ -326,6 +328,58 @@ int ioasid_attach_data(ioasid_t ioasid, void *data)
> EXPORT_SYMBOL_GPL(ioasid_attach_data);
>
> /**
> + * ioasid_attach_spid - Attach ioasid_set private ID to an IOASID
> + *
> + * @ioasid: the ID to attach
> + * @spid: the ioasid_set private ID of @ioasid
> + *
> + * For IOASID that is already allocated, private ID within the set can be
> + * attached via this API. Future lookup can be done via ioasid_find.
I would remove "For IOASID that is already allocated, private ID within
the set can be attached via this API"
> + */
> +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
> +{
> + struct ioasid_data *ioasid_data;
> + int ret = 0;
> +
> + spin_lock(&ioasid_allocator_lock);
We keep on saying the SPID is local to an IOASID set but we don't check
any IOASID set contains this ioasid. It looks a bit weird to me.
> + ioasid_data = xa_load(&active_allocator->xa, ioasid);
> +
> + if (!ioasid_data) {
> + pr_err("No IOASID entry %d to attach SPID %d\n",
> + ioasid, spid);
> + ret = -ENOENT;
> + goto done_unlock;
> + }
> + ioasid_data->spid = spid;
is there any way/need to remove an spid binding?
> +
> +done_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_attach_spid);
> +
> +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
> +{
> + struct ioasid_data *entry;
> + unsigned long index;
> +
> + if (!xa_load(&ioasid_sets, set->sid)) {
> + pr_warn("Invalid set\n");
> + return INVALID_IOASID;
> + }
> +
> + xa_for_each(&set->xa, index, entry) {
> + if (spid == entry->spid) {
> + pr_debug("Found ioasid %lu by spid %u\n", index, spid);
> + refcount_inc(&entry->users);
> + return index;
> + }
> + }
> + return INVALID_IOASID;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
> +
> +/**
> * ioasid_alloc - Allocate an IOASID
> * @set: the IOASID set
> * @min: the minimum ID (inclusive)
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 310abe4187a3..d4b3e83672f6 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -73,6 +73,8 @@ bool ioasid_is_active(ioasid_t ioasid);
>
> void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *));
> int ioasid_attach_data(ioasid_t ioasid, void *data);
> +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
> +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid);
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
> @@ -136,5 +138,15 @@ static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
> return -ENOTSUPP;
> }
>
> +staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
> +{
> + return -ENOTSUPP;
> +}
> +
> +static inline ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
> +{
> + return -ENOTSUPP;
> +}
> +
> #endif /* CONFIG_IOASID */
> #endif /* __LINUX_IOASID_H */
>
Thanks

Eric

2020-09-01 16:52:23

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

Hi Eric,

On Thu, 27 Aug 2020 18:21:07 +0200
Auger Eric <[email protected]> wrote:

> Hi Jacob,
> On 8/24/20 12:32 PM, Jean-Philippe Brucker wrote:
> > On Fri, Aug 21, 2020 at 09:35:10PM -0700, Jacob Pan wrote:
> >> IOASID is used to identify address spaces that can be targeted by
> >> device DMA. It is a system-wide resource that is essential to its
> >> many users. This document is an attempt to help developers from
> >> all vendors navigate the APIs. At this time, ARM SMMU and Intel’s
> >> Scalable IO Virtualization (SIOV) enabled platforms are the
> >> primary users of IOASID. Examples of how SIOV components interact
> >> with IOASID APIs are provided in that many APIs are driven by the
> >> requirements from SIOV.
> >>
> >> Signed-off-by: Liu Yi L <[email protected]>
> >> Signed-off-by: Wu Hao <[email protected]>
> >> Signed-off-by: Jacob Pan <[email protected]>
> >> ---
> >> Documentation/ioasid.rst | 618
> >> +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed,
> >> 618 insertions(+) create mode 100644 Documentation/ioasid.rst
> >>
> >> diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst
> >
> > Thanks for writing this up. Should it go to
> > Documentation/driver-api/, or Documentation/driver-api/iommu/? I
> > think this also needs to Cc [email protected] and
> > [email protected]
> >> new file mode 100644
> >> index 000000000000..b6a8cdc885ff
> >> --- /dev/null
> >> +++ b/Documentation/ioasid.rst
> >> @@ -0,0 +1,618 @@
> >> +.. ioasid:
> >> +
> >> +=====================================
> >> +IO Address Space ID
> >> +=====================================
> >> +
> >> +IOASID is a generic name for PCIe Process Address ID (PASID) or
> >> ARM +SMMU sub-stream ID. An IOASID identifies an address space
> >> that DMA
> >
> > "SubstreamID"
> On ARM if we don't use PASIDs we have streamids (SID) which can also
> identify address spaces that DMA requests can target. So maybe this
> definition is not sufficient.
>
According to SMMU spec, the SubstreamID is equivalent to PASID. My
understanding is that SID is equivalent to PCI requester ID that
identifies stage 2. Do you plan to use IOASID for stage 2?
IOASID is mostly for SVA and DMA request w/ PASID.

> >
> >> +requests can target.
> >> +
> >> +The primary use cases for IOASID are Shared Virtual Address (SVA)
> >> and +IO Virtual Address (IOVA). However, the requirements for
> >> IOASID
> >
> > IOVA alone isn't a use case, maybe "multiple IOVA spaces per
> > device"?
> >> +management can vary among hardware architectures.
> >> +
> >> +This document covers the generic features supported by IOASID
> >> +APIs. Vendor-specific use cases are also illustrated with Intel's
> >> VT-d +based platforms as the first example.
> >> +
> >> +.. contents:: :local:
> >> +
> >> +Glossary
> >> +========
> >> +PASID - Process Address Space ID
> >> +
> >> +IOASID - IO Address Space ID (generic term for PCIe PASID and
> >> +sub-stream ID in SMMU)
> >
> > "SubstreamID"
> >
> >> +
> >> +SVA/SVM - Shared Virtual Addressing/Memory
> >> +
> >> +ENQCMD - New Intel X86 ISA for efficient workqueue submission
> >> [1]
> >
> > Maybe drop the "New", to keep the documentation perennial. It might
> > be good to add internal links here to the specifications URLs at
> > the bottom.
> >> +
> >> +DSA - Intel Data Streaming Accelerator [2]
> >> +
> >> +VDCM - Virtual device composition module [3]
> >> +
> >> +SIOV - Intel Scalable IO Virtualization
> >> +
> >> +
> >> +Key Concepts
> >> +============
> >> +
> >> +IOASID Set
> >> +-----------
> >> +An IOASID set is a group of IOASIDs allocated from the system-wide
> >> +IOASID pool. An IOASID set is created and can be identified by a
> >> +token of u64. Refer to IOASID set APIs for more details.
> >
> > Identified either by an u64 or an mm_struct, right? Maybe just
> > drop the second sentence if it's detailed in the IOASID set section
> > below.
> >> +
> >> +IOASID set is particularly useful for guest SVA where each guest
> >> could +have its own IOASID set for security and efficiency reasons.
> >> +
> >> +IOASID Set Private ID (SPID)
> >> +----------------------------
> >> +SPIDs are introduced as IOASIDs within its set. Each SPID maps to
> >> a +system-wide IOASID but the namespace of SPID is within its
> >> IOASID +set.
> >
> > The intro isn't super clear. Perhaps this is simpler:
> > "Each IOASID set has a private namespace of SPIDs. An SPID maps to a
> > single system-wide IOASID."
> or, "within an ioasid set, each ioasid can be associated with an alias
> ID, named SPID."
I don't have strong opinion, I feel it is good to explain the
relationship between SPID and IOASID in both directions, how about add?
" Conversely, each IOASID is associated with an alias ID, named SPID."

> >
> >> SPIDs can be used as guest IOASIDs where each guest could do
> >> +IOASID allocation from its own pool and map them to host physical
> >> +IOASIDs. SPIDs are particularly useful for supporting live
> >> migration +where decoupling guest and host physical resources are
> >> necessary. +
> >> +For example, two VMs can both allocate guest PASID/SPID #101 but
> >> map to +different host PASIDs #201 and #202 respectively as shown
> >> in the +diagram below.
> >> +::
> >> +
> >> + .------------------. .------------------.
> >> + | VM 1 | | VM 2 |
> >> + | | | |
> >> + |------------------| |------------------|
> >> + | GPASID/SPID 101 | | GPASID/SPID 101 |
> >> + '------------------' -------------------' Guest
> >> + __________|______________________|______________________
> >> + | | Host
> >> + v v
> >> + .------------------. .------------------.
> >> + | Host IOASID 201 | | Host IOASID 202 |
> >> + '------------------' '------------------'
> >> + | IOASID set 1 | | IOASID set 2 |
> >> + '------------------' '------------------'
> >> +
> >> +Guest PASID is treated as IOASID set private ID (SPID) within an
> >> +IOASID set, mappings between guest and host IOASIDs are stored in
> >> the +set for inquiry.
> >> +
> >> +IOASID APIs
> >> +===========
> >> +To get the IOASID APIs, users must #include <linux/ioasid.h>.
> >> These APIs +serve the following functionalities:
> >> +
> >> + - IOASID allocation/Free
> >> + - Group management in the form of ioasid_set
> >> + - Private data storage and lookup
> >> + - Reference counting
> >> + - Event notification in case of state change
> (a)
got it

> >> +
> >> +IOASID Set Level APIs
> >> +--------------------------
> >> +For use cases such as guest SVA it is necessary to manage IOASIDs
> >> at +a group level. For example, VMs may allocate multiple IOASIDs
> >> for
> I would use the introduced ioasid_set terminology instead of "group".
Right, we already introduced it.

> >> +guest process address sharing (vSVA). It is imperative to enforce
> >> +VM-IOASID ownership such that malicious guest cannot target DMA
> >
> > "a malicious guest"
> >
> >> +traffic outside its own IOASIDs, or free an active IOASID belong
> >> to
> >
> > "that belongs to"
> >
> >> +another VM.
> >> +::
> >> +
> >> + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
> >> u32 type)
> what is this void *token? also the type may be explained here.
token is explained in the text following API list. I can move it up.

> >> +
> >> + int ioasid_adjust_set(struct ioasid_set *set, int quota);
> >
> > These could be named "ioasid_set_alloc" and "ioasid_set_adjust" to
> > be consistent with the rest of the API.
> >
> >> +
> >> + void ioasid_set_get(struct ioasid_set *set)
> >> +
> >> + void ioasid_set_put(struct ioasid_set *set)
> >> +
> >> + void ioasid_set_get_locked(struct ioasid_set *set)
> >> +
> >> + void ioasid_set_put_locked(struct ioasid_set *set)
> >> +
> >> + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> >
> > Might be nicer to keep the same argument names within the API. Here
> > "set" rather than "sdata".
> >
> >> + void (*fn)(ioasid_t id, void
> >> *data),
> >> + void *data)
> >
> > (alignment)
> >
> >> +
> >> +
> >> +IOASID set concept is introduced to represent such IOASID groups.
> >> Each
> >
> > Or just "IOASID sets represent such IOASID groups", but might be
> > redundant.
> >
> >> +IOASID set is created with a token which can be one of the
> >> following +types:
> I think this explanation should happen before the above function
> prototypes
ditto.

> >> +
> >> + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
> >> + - IOASID_SET_TYPE_MM (Set token is a mm_struct)
> >> +
> >> +The explicit MM token type is useful when multiple users of an
> >> IOASID +set under the same process need to communicate about their
> >> shared IOASIDs. +E.g. An IOASID set created by VFIO for one guest
> >> can be associated +with the KVM instance for the same guest since
> >> they share a common mm_struct. +
> >> +The IOASID set APIs serve the following purposes:
> >> +
> >> + - Ownership/permission enforcement
> >> + - Take collective actions, e.g. free an entire set
> >> + - Event notifications within a set
> >> + - Look up a set based on token
> >> + - Quota enforcement
> >
> > This paragraph could be earlier in the section
>
> yes this is a kind of repetition of (a), above
I meant to highlight on what the APIs do such that readers don't
need to read the code instead.

> >
> >> +
> >> +Individual IOASID APIs
> >> +----------------------
> >> +Once an ioasid_set is created, IOASIDs can be allocated from the
> >> set. +Within the IOASID set namespace, set private ID (SPID) is
> >> supported. In +the VM use case, SPID can be used for storing guest
> >> PASID. +
> >> +::
> >> +
> >> + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> >> ioasid_t max,
> >> + void *private);
> >> +
> >> + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> >> +
> >> + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> >> +
> >> + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> >> +
> >> + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> >> +
> >> + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> >> + bool (*getter)(void *));
> >> +
> >> + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t
> >> spid) +
> >> + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
> >> + void *data);
> >> + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
> >> + ioasid_t ssid);
> >
> > s/ssid/spid>
got it

> >> +
> >> +
> >> +Notifications
> >> +-------------
> >> +An IOASID may have multiple users, each user may have hardware
> >> context +associated with an IOASID. When the status of an IOASID
> >> changes, +e.g. an IOASID is being freed, users need to be notified
> >> such that the +associated hardware context can be cleared,
> >> flushed, and drained. +
> >> +::
> >> +
> >> + int ioasid_register_notifier(struct ioasid_set *set, struct
> >> + notifier_block *nb)
> >> +
> >> + void ioasid_unregister_notifier(struct ioasid_set *set,
> >> + struct notifier_block *nb)
> >> +
> >> + int ioasid_register_notifier_mm(struct mm_struct *mm, struct
> >> + notifier_block *nb)
> >> +
> >> + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
> >> + notifier_block *nb)
> the mm_struct prototypes may be justified
This is the mm type token, i.e.
- IOASID_SET_TYPE_MM (Set token is a mm_struct)
I am not sure if it is better to keep the explanation in code or in
this document, certainly don't want to duplicate.

> >> +
> >> + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
> >> + unsigned int flags)
> this one is not obvious either.
Here I just wanted to list the API functions, perhaps readers can check
out the code comments?

> >> +
> >> +
> >> +Events
> >> +~~~~~~
> >> +Notification events are pertinent to individual IOASIDs, they can
> >> be +one of the following:
> >> +
> >> + - ALLOC
> >> + - FREE
> >> + - BIND
> >> + - UNBIND
> >> +
> >> +Ordering
> >> +~~~~~~~~
> >> +Ordering is supported by IOASID notification priorities as the
> >> +following (in ascending order):
> >> +
> >> +::
> >> +
> >> + enum ioasid_notifier_prios {
> >> + IOASID_PRIO_LAST,
> >> + IOASID_PRIO_IOMMU,
> >> + IOASID_PRIO_DEVICE,
> >> + IOASID_PRIO_CPU,
> >> + };
>
> Maybe:
> when registered, notifiers are assigned a priority that affect the
> call order. Notifiers with CPU priority get called before notifiers
> with device priority and so on.
Sounds good.

> >> +
> >> +The typical use case is when an IOASID is freed due to an
> >> exception, DMA +source should be quiesced before tearing down
> >> other hardware contexts +in the system. This will reduce the churn
> >> in handling faults. DMA work +submission is performed by the CPU
> >> which is granted higher priority than +devices.
> >> +
> >> +
> >> +Scopes
> >> +~~~~~~
> >> +There are two types of notifiers in IOASID core: system-wide and
> >> +ioasid_set-wide.
> >> +
> >> +System-wide notifier is catering for users that need to handle all
> >> +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs.
> >> +
> >> +Per ioasid_set notifier can be used by VM specific components
> >> such as +KVM. After all, each KVM instance only cares about
> >> IOASIDs within its +own set.
> >> +
> >> +
> >> +Atomicity
> >> +~~~~~~~~~
> >> +IOASID notifiers are atomic due to spinlocks used inside the
> >> IOASID +core. For tasks cannot be completed in the notifier
> >> handler, async work
> >
> > "tasks that cannot be"
> >
> >> +can be submitted to complete the work later as long as there is no
> >> +ordering requirement.
> >> +
> >> +Reference counting
> >> +------------------
> >> +IOASID lifecycle management is based on reference counting. Users
> >> of +IOASID intend to align lifecycle with the IOASID need to hold
> >
> > "who intend to"
> >
> >> +reference of the IOASID. IOASID will not be returned to the pool
> >> for
> >
> > "a reference to the IOASID. The IOASID"
> >
> >> +allocation until all references are dropped. Calling ioasid_free()
> >> +will mark the IOASID as FREE_PENDING if the IOASID has outstanding
> >> +reference. ioasid_get() is not allowed once an IOASID is in the
> >> +FREE_PENDING state.
> >> +
> >> +Event notifications are used to inform users of IOASID status
> >> change. +IOASID_FREE event prompts users to drop their references
> >> after +clearing its context.
> >> +
> >> +For example, on VT-d platform when an IOASID is freed, teardown
> >> +actions are performed on KVM, device driver, and IOMMU driver.
> >> +KVM shall register notifier block with::
> >> +
> >> + static struct notifier_block pasid_nb_kvm = {
> >> + .notifier_call = pasid_status_change_kvm,
> >> + .priority = IOASID_PRIO_CPU,
> >> + };
> >> +
> >> +VDCM driver shall register notifier block with::
> >> +
> >> + static struct notifier_block pasid_nb_vdcm = {
> >> + .notifier_call = pasid_status_change_vdcm,
> >> + .priority = IOASID_PRIO_DEVICE,
> >> + };
> not sure those code snippets are really useful. Maybe simply say who
> is supposed to use each prio.
Agreed, not all the bits in the snippets are explained. I will explain
KVM and VDCM need to use priority to ensure call order.

> >> +
> >> +In both cases, notifier blocks shall be registered on the IOASID
> >> set +such that *only* events from the matching VM is received.
> >> +
> >> +If KVM attempts to register notifier block before the IOASID set
> >> is +created for the MM token, the notifier block will be placed on
> >> a
> using the MM token
sounds good

> >> +pending list inside IOASID core. Once the token matching IOASID
> >> set +is created, IOASID will register the notifier block
> >> automatically.
> Is this implementation mandated? Can't you enforce the ioasid_set to
> be created before the notifier gets registered?
> >> +IOASID core does not replay events for the existing IOASIDs in the
> >> +set. For IOASID set of MM type, notification blocks can be
> >> registered +on empty sets only. This is to avoid lost events.
> >> +
> >> +IOMMU driver shall register notifier block on global chain::
> >> +
> >> + static struct notifier_block pasid_nb_vtd = {
> >> + .notifier_call = pasid_status_change_vtd,
> >> + .priority = IOASID_PRIO_IOMMU,
> >> + };
> >> +
> >> +Custom allocator APIs
> >> +---------------------
> >> +
> >> +::
> >> +
> >> + int ioasid_register_allocator(struct ioasid_allocator_ops
> >> *allocator); +
> >> + void ioasid_unregister_allocator(struct ioasid_allocator_ops
> >> *allocator); +
> >> +Allocator Choices
> >> +~~~~~~~~~~~~~~~~~
> >> +IOASIDs are allocated for both host and guest SVA/IOVA usage.
> >> However, +allocators can be different. For example, on VT-d guest
> >> PASID +allocation must be performed via a virtual command
> >> interface which is +emulated by VMM.
> >> +
> >> +IOASID core has the notion of "custom allocator" such that guest
> >> can +register virtual command allocator that precedes the default
> >> one. +
> >> +Namespaces
> >> +~~~~~~~~~~
> >> +IOASIDs are limited system resources that default to 20 bits in
> >> +size. Since each device has its own table, theoretically the
> >> namespace +can be per device also. However, for security reasons
> >> sharing PASID +tables among devices are not good for isolation.
> >> Therefore, IOASID +namespace is system-wide.
> >
> > I don't follow this development. Having per-device PASID table
> > would work fine for isolation (assuming no hardware bug
> > necessitating IOMMU groups). If I remember correctly IOASID space
> > was chosen to be OS-wide because it simplifies the management code
> > (single PASID per task), and it is system-wide across VMs only in
> > the case of VT-d scalable mode.
> >> +
> >> +There are also other reasons to have this simpler system-wide
> >> +namespace. Take VT-d as an example, VT-d supports shared workqueue
> >> +and ENQCMD[1] where one IOASID could be used to submit work on
> >
> > Maybe use the Sphinx glossary syntax rather than "[1]"
> > https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#glossary-directive
> >
> >> +multiple devices that are shared with other VMs. This requires
> >> IOASID +to be system-wide. This is also the reason why guests must
> >> use an +emulated virtual command interface to allocate IOASID from
> >> the host. +
> >> +
> >> +Life cycle
> >> +==========
> >> +This section covers IOASID lifecycle management for both
> >> bare-metal +and guest usages. In bare-metal SVA, MMU notifier is
> >> directly hooked +up with IOMMU driver, therefore the process
> >> address space (MM) +lifecycle is aligned with IOASID.
> therefore the IOASID lifecyle matches the process address space (MM)
> lifecyle?
Sounds good.

> >> +
> >> +However, guest MMU notifier is not available to host IOMMU
> >> driver,
> the guest MMU notifier
> >> +when guest MM terminates unexpectedly, the events have to go
> >> through
> the guest MM
> >> +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also
> >> more +parties involved in guest SVA, e.g. on Intel VT-d platform,
> >> IOASIDs +are used by IOMMU driver, KVM, VDCM, and VFIO.
> >> +
> >> +Native IOASID Life Cycle (VT-d Example)
> >> +---------------------------------------
> >> +
> >> +The normal flow of native SVA code with Intel Data Streaming
> >> +Accelerator(DSA) [2] as example:
> >> +
> >> +1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
> >> +2. DSA driver allocate WQ, do sva_bind_device();
> >> +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
> >> + mmu_notifier_get()
> >> +4. DMA starts by DSA driver userspace
> >> +5. DSA userspace close FD
> >> +6. DSA/uacce kernel driver handles FD.close()
> >> +7. DSA driver stops DMA
> >> +8. DSA driver calls sva_unbind_device();
> >> +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
> >> + TLBs. mmu_notifier_put() called.
> >> +10. mmu_notifier.release() called, IOMMU SVA code calls
> >> ioasid_free()* +11. The IOASID is returned to the pool, reclaimed.
> >> +
> >> +::
> >> +
> >
> > Use a footnote?
> > https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes
> >> + * With ENQCMD, PASID used on VT-d is not released in
> >> mmu_notifier() but
> >> + mmdrop(). mmdrop comes after FD close. Should not matter.
> >
> > "comes after FD close, which doesn't make a difference?"
> > The following might not be necessary since early process
> > termination is described later.
> >
> >> + If the user process dies unexpectedly, Step #10 may come
> >> before
> >> + Step #5, in between, all DMA faults discarded. PRQ responded
> >> with
> >
> > PRQ hasn't been defined in this document.
> >
> >> + code INVALID REQUEST.
> >> +
> >> +During the normal teardown, the following three steps would
> >> happen in +order:
> can't this be illustrated in the above 1-11 sequence, just adding
> NORMAL TEARDONW before #7?
> >> +
> >> +1. Device driver stops DMA request
> >> +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain
> >> in-flight
> >> + requests.
> >> +3. IOASID freed
> >> +
> Then you can just focus on abnormal termination
Yes, will refer to the steps starting #7. These can be removed.

> >> +Exception happens when process terminates *before* device driver
> >> stops +DMA and call IOMMU driver to unbind. The flow of process
> >> exists are as
> Can't this be explained with something simpler looking at the steps
> 1-11?
It meant to be educational given this level of details. Simpler
steps are labeled with (1) (2) (3). Perhaps these labels didn't stand
out right? I will use the steps in the 1-11 sequence.

> >
> > "exits"
> >
> >> +follows:
> >> +
> >> +::
> >> +
> >> + do_exit() {
> >> + exit_mm() {
> >> + mm_put();
> >> + exit_mmap() {
> >> + intel_invalidate_range() //mmu notifier
> >> + tlb_finish_mmu()
> >> + mmu_notifier_release(mm) {
> >> + intel_iommu_release() {
> >> + [2]
> >> intel_iommu_teardown_pasid();
> >
> > Parentheses might be better than square brackets for step numbers
> >
> >> + intel_iommu_flush_tlbs();
> >> + }
> >> + // tlb_invalidate_range cb removed
> >> + }
> >> + unmap_vmas();
> >> + free_pgtables(); // IOMMU cannot walk PGT
> >> after this
> >> + };
> >> + }
> >> + exit_files(tsk) {
> >> + close_files() {
> >> + dsa_close();
> >> + [1] dsa_stop_dma();
> >> + intel_svm_unbind_pasid(); //nothing to do
> >> + }
> >> + }
> >> + }
> >> +
> >> + mmdrop() /* some random time later, lazy mm user */ {
> >> + mm_free_pgd();
> >> + destroy_context(mm); {
> >> + [3] ioasid_free();
> >> + }
> >> + }
> >> +
> >> +As shown in the list above, step #2 could happen before
> >> +#1. Unrecoverable(UR) faults could happen between #2 and #1.
> >> +
> >> +Also notice that TLB invalidation occurs at mmu_notifier
> >> +invalidate_range callback as well as the release callback. The
> >> reason +is that release callback will delete IOMMU driver from the
> >> notifier +chain which may skip invalidate_range() calls during the
> >> exit path. +
> >> +To avoid unnecessary reporting of UR fault, IOMMU driver shall
> >> disable
> UR?
Unrecoverable, mentioned in the previous paragraph.

> >> +fault reporting after free and before unbind.
> >> +
> >> +Guest IOASID Life Cycle (VT-d Example)
> >> +--------------------------------------
> >> +Guest IOASID life cycle starts with guest driver open(), this
> >> could be +uacce or individual accelerator driver such as DSA. At
> >> FD open, +sva_bind_device() is called which triggers a series of
> >> actions. +
> >> +The example below is an illustration of *normal* operations that
> >> +involves *all* the SW components in VT-d. The flow can be simpler
> >> if +no ENQCMD is supported.
> >> +
> >> +::
> >> +
> >> + VFIO IOMMU KVM VDCM IOASID
> >> Ref
> >> + ..................................................................
> >> + 1 ioasid_register_notifier/_mm()
> >> + 2 ioasid_alloc()
> >> 1
> >> + 3 bind_gpasid()
> >> + 4 iommu_bind()->ioasid_get()
> >> 2
> >> + 5 ioasid_notify(BIND)
> >> + 6 -> ioasid_get()
> >> 3
> >> + 7 -> vmcs_update_atomic()
> >> + 8 mdev_write(gpasid)
> >> + 9 hpasid=
> >> + 10 find_by_spid(gpasid)
> >> 4
> >> + 11 vdev_write(hpasid)
> >> + 12 -------- GUEST STARTS DMA --------------------------
> >> + 13 -------- GUEST STOPS DMA --------------------------
> >> + 14 mdev_clear(gpasid)
> >> + 15 vdev_clear(hpasid)
> >> + 16
> >> ioasid_put() 3
> >> + 17 unbind_gpasid()
> >> + 18 iommu_ubind()
> >> + 19 ioasid_notify(UNBIND)
> >> + 20 -> vmcs_update_atomic()
> >> + 21 ->
> >> ioasid_put() 2
> >> + 22
> >> ioasid_free() 1
> >> + 23
> >> ioasid_put() 0
> >> + 24 Reclaimed
> >> + -------------- New Life Cycle Begin
> >> ----------------------------
> >> + 1 ioasid_alloc()
> >> -> 1 +
> >> + Note: IOASID Notification Events: FREE, BIND, UNBIND
> >> +
> >> +Exception cases arise when a guest crashes or a malicious guest
> >> +attempts to cause disruption on the host system. The fault
> >> handling +rules are:
> >> +
> >> +1. IOASID free must *always* succeed.
> >> +2. An inactive period may be required before the freed IOASID is
> >> + reclaimed. During this period, consumers of IOASID perform
> >> cleanup. +3. Malfunction is limited to the guest owned resources
> >> for all
> >> + programming errors.
> >> +
> >> +The primary source of exception is when the following are out of
> >> +order:
> >> +
> >> +1. Start/Stop of DMA activity
> >> + (Guest device driver, mdev via VFIO)
> please explain the meaning of what is inside (): initiator?
> >> +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
> >> + (Host IOMMU driver bind/unbind)
> >> +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
> >> + case of ENQCMD
> >> +4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
> >> +5. IOASID alloc/free (Host IOASID)
> >> +
> >> +VFIO is the *only* user-kernel interface, which is ultimately
> >> +responsible for exception handlings.
> >
> > "handling"
> >
> >> +
> >> +#1 is processed the same way as the assigned device today based on
> >> +device file descriptors and events. There is no special handling.
> >> +
> >> +#3 is based on bind/unbind events emitted by #2.
> >> +
> >> +#4 is naturally aligned with IOASID life cycle in that an illegal
> >> +guest PASID programming would fail in obtaining reference of the
> >> +matching host IOASID.
> >> +
> >> +#5 is similar to #4. The fault will be reported to the user if
> >> PASID +used in the ENQCMD is not set up in VMCS PASID translation
> >> table. +
> >> +Therefore, the remaining out of order problem is between #2 and
> >> +#5. I.e. unbind vs. free. More specifically, free before unbind.
> >> +
> >> +IOASID notifier and refcounting are used to ensure order.
> >> Following +a publisher-subscriber pattern where:
> with the following actors:
> >> +
> >> +- Publishers: VFIO & IOMMU
> >> +- Subscribers: KVM, VDCM, IOMMU
> this may be introduced before.
> >> +
> >> +IOASID notifier is atomic which requires subscribers to do quick
> >> +handling of the event in the atomic context. Workqueue can be
> >> used for +any processing that requires thread context.
> repetition of what was said before.
> IOASID reference must be
Right, will remove.

> >> +acquired before receiving the FREE event. The reference must be
> >> +dropped at the end of the processing in order to return the
> >> IOASID to +the pool.
> >> +
> >> +Let's examine the IOASID life cycle again when free happens
> >> *before* +unbind. This could be a result of misbehaving guests or
> >> crash. Assuming +VFIO cannot enforce unbind->free order. Notice
> >> that the setup part up +until step #12 is identical to the normal
> >> case, the flow below starts +with step 13.
> >> +
> >> +::
> >> +
> >> + VFIO IOMMU KVM VDCM IOASID
> >> Ref
> >> + ..................................................................
> >> + 13 -------- GUEST STARTS DMA --------------------------
> >> + 14 -------- *GUEST MISBEHAVES!!!* ----------------
> >> + 15 ioasid_free()
> >> + 16
> >> ioasid_notify(FREE)
> >> + 17
> >> mark_ioasid_inactive[1]
> >> + 18 kvm_nb_handler(FREE)
> >> + 19 vmcs_update_atomic()
> >> + 20 ioasid_put_locked() ->
> >> 3
> >> + 21 vdcm_nb_handler(FREE)
> >> + 22 iomm_nb_handler(FREE)
> >> + 23 ioasid_free() returns[2] schedule_work()
> >> 2
> >> + 24 schedule_work() vdev_clear_wk(hpasid)
> >> + 25 teardown_pasid_wk()
> >> + 26 ioasid_put() ->
> >> 1
> >> + 27 ioasid_put()
> >> 0
> >> + 28 Reclaimed
> >> + 29 unbind_gpasid()
> >> + 30 iommu_unbind()->ioasid_find() Fails[3]
> >> + -------------- New Life Cycle Begin
> >> ---------------------------- +
> >> +Note:
> >> +
> >> +1. By marking IOASID inactive at step #17, no new references can
> >> be
> >
> > Is "inactive" FREE_PENDING?
> >
> >> + held. ioasid_get/find() will return -ENOENT;
> >> +2. After step #23, all events can go out of order. Shall not
> >> affect
> >> + the outcome.
> >> +3. IOMMU driver fails to find private data for unbinding. If
> >> unbind is
> >> + called after the same IOASID is allocated for the same guest
> >> again,
> >> + this is a programming error. The damage is limited to the guest
> >> + itself since unbind performs permission checking based on the
> >> + IOASID set associated with the guest process.
> >> +
> >> +KVM PASID Translation Table Updates
> >> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> +Per VM PASID translation table is maintained by KVM in order to
> >> +support ENQCMD in the guest. The table contains host-guest PASID
> >> +translations to be consumed by CPU ucode. The synchronization of
> >> the +PASID states depends on VFIO/IOMMU driver, where IOCTL and
> >> atomic +notifiers are used. KVM must register IOASID notifier per
> >> VM instance +during launch time. The following events are handled:
> >> +
> >> +1. BIND/UNBIND
> >> +2. FREE
> >> +
> >> +Rules:
> >> +
> >> +1. Multiple devices can bind with the same PASID, this can be
> >> different PCI
> >> + devices or mdevs within the same PCI device. However, only the
> >> + *first* BIND and *last* UNBIND emit notifications.
> >> +2. IOASID code is responsible for ensuring the correctness of H-G
> >> + PASID mapping. There is no need for KVM to validate the
> >> + notification data.
> >> +3. When UNBIND happens *after* FREE, KVM will see error in
> >> + ioasid_get() even when the reclaim is not done. IOMMU driver
> >> will
> >> + also avoid sending UNBIND if the PASID is already FREE.
> >> +4. When KVM terminates *before* FREE & UNBIND, references will be
> >> + dropped for all host PASIDs.
> >> +
> >> +VDCM PASID Programming
> >> +~~~~~~~~~~~~~~~~~~~~~~
> >> +VDCM composes virtual devices and exposes them to the guests. When
> >> +the guest allocates a PASID then program it to the virtual
> >> device, VDCM
> programs as well
> >> +intercepts the programming attempt then program the matching
> >> host
> >
> > "programs"
> >
> > Thanks,
> > Jean
> >
> >> +PASID on to the hardware.
> >> +Conversely, when a device is going away, VDCM must be informed
> >> such +that PASID context on the hardware can be cleared. There
> >> could be +multiple mdevs assigned to different guests in the same
> >> VDCM. Since +the PASID table is shared at PCI device level, lazy
> >> clearing is not +secure. A malicious guest can attack by using
> >> newly freed PASIDs that +are allocated by another guest.
> >> +
> >> +By holding a reference of the PASID until VDCM cleans up the HW
> >> context, +it is guaranteed that PASID life cycles do not cross
> >> within the same +device.
> >> +
> >> +
> >> +Reference
> >> +====================================================
> >> +1.
> >> https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
> >> + +2.
> >> https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
> >> + +3.
> >> https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
> >> -- 2.7.4
>
> Thanks
>
> Eric
> >>
> >
>
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
[Jacob Pan]

2020-09-01 16:53:14

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v2 6/9] iommu/ioasid: Introduce notification APIs

Hi Jacob,

On 8/22/20 6:35 AM, Jacob Pan wrote:
> Relations among IOASID users largely follow a publisher-subscriber
> pattern. E.g. to support guest SVA on Intel Scalable I/O Virtualization
> (SIOV) enabled platforms, VFIO, IOMMU, device drivers, KVM are all users
> of IOASIDs. When a state change occurs, VFIO publishes the change event
> that needs to be processed by other users/subscribers.
>
> This patch introduced two types of notifications: global and per
> ioasid_set. The latter is intended for users who only needs to handle
> events related to the IOASID of a given set.
> For more information, refer to the kernel documentation at
> Documentation/ioasid.rst.
>
> Signed-off-by: Liu Yi L <[email protected]>
> Signed-off-by: Wu Hao <[email protected]>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/ioasid.c | 280 ++++++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/ioasid.h | 70 +++++++++++++
> 2 files changed, 348 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index c0aef38a4fde..6ddc09a7fe74 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -9,8 +9,35 @@
> #include <linux/spinlock.h>
> #include <linux/xarray.h>
> #include <linux/ioasid.h>
> +#include <linux/sched/mm.h>
>
> static DEFINE_XARRAY_ALLOC(ioasid_sets);
> +/*
> + * An IOASID could have multiple consumers where each consumeer may have
can have multiple consumers
> + * hardware contexts associated with IOASIDs.
> + * When a status change occurs, such as IOASID is being freed, notifier chains
s/such as IOASID is being freed/, like on IOASID deallocation,
> + * are used to keep the consumers in sync.
> + * This is a publisher-subscriber pattern where publisher can change the
> + * state of each IOASID, e.g. alloc/free, bind IOASID to a device and mm.
> + * On the other hand, subscribers gets notified for the state change and
> + * keep local states in sync.
> + *
> + * Currently, the notifier is global. A further optimization could be per
> + * IOASID set notifier chain.
> + */
> +static ATOMIC_NOTIFIER_HEAD(ioasid_chain);
> +
> +/* List to hold pending notification block registrations */
> +static LIST_HEAD(ioasid_nb_pending_list);
> +static DEFINE_SPINLOCK(ioasid_nb_lock);
> +struct ioasid_set_nb {
> + struct list_head list;
> + struct notifier_block *nb;
> + void *token;
> + struct ioasid_set *set;
> + bool active;
> +};
> +
> enum ioasid_state {
> IOASID_STATE_INACTIVE,
> IOASID_STATE_ACTIVE,
> @@ -394,6 +421,7 @@ EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
> ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> void *private)
> {
> + struct ioasid_nb_args args;
> struct ioasid_data *data;
> void *adata;
> ioasid_t id = INVALID_IOASID;
> @@ -445,8 +473,14 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> goto exit_free;
> }
> set->nr_ioasids++;
> - goto done_unlock;
> + args.id = id;
> + /* Set private ID is not attached during allocation */
> + args.spid = INVALID_IOASID;
> + args.set = set;
> + atomic_notifier_call_chain(&set->nh, IOASID_ALLOC, &args);
>
> + spin_unlock(&ioasid_allocator_lock);
> + return id;
spurious change
> exit_free:
> kfree(data);
> done_unlock:
> @@ -479,6 +513,7 @@ static void ioasid_do_free(struct ioasid_data *data)
>
> static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
> {
> + struct ioasid_nb_args args;
> struct ioasid_data *data;
>
> data = xa_load(&active_allocator->xa, ioasid);
> @@ -491,7 +526,16 @@ static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
> pr_warn("Cannot free IOASID %u due to set ownership\n", ioasid);
> return;
> }
> +
spurious new line
> data->state = IOASID_STATE_FREE_PENDING;
> + /* Notify all users that this IOASID is being freed */
> + args.id = ioasid;
> + args.spid = data->spid;
> + args.pdata = data->private;
> + args.set = data->set;
> + atomic_notifier_call_chain(&ioasid_chain, IOASID_FREE, &args);
> + /* Notify the ioasid_set for per set users */
> + atomic_notifier_call_chain(&set->nh, IOASID_FREE, &args);
>
> if (!refcount_dec_and_test(&data->users))
> return;
Shouldn't we call the notifier only when ref count == 0?
> @@ -514,6 +558,28 @@ void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
> }
> EXPORT_SYMBOL_GPL(ioasid_free);
>
> +static void ioasid_add_pending_nb(struct ioasid_set *set)
> +{
> + struct ioasid_set_nb *curr;
> +
> + if (set->type != IOASID_SET_TYPE_MM)
> + return;
> +
> + /*
> + * Check if there are any pending nb requests for the given token, if so
> + * add them to the notifier chain.
> + */
> + spin_lock(&ioasid_nb_lock);
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->token == set->token && !curr->active) {
> + atomic_notifier_chain_register(&set->nh, curr->nb);
> + curr->set = set;
> + curr->active = true;
> + }
> + }
> + spin_unlock(&ioasid_nb_lock);
> +}
> +
> /**
> * ioasid_alloc_set - Allocate a new IOASID set for a given token
> *
> @@ -601,6 +667,13 @@ struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type)
> sdata->quota = quota;
> sdata->sid = id;
> refcount_set(&sdata->ref, 1);
> + ATOMIC_INIT_NOTIFIER_HEAD(&sdata->nh);
> +
> + /*
> + * Check if there are any pending nb requests for the given token, if so
> + * add them to the notifier chain.
> + */
> + ioasid_add_pending_nb(sdata);
>
> /*
> * Per set XA is used to store private IDs within the set, get ready
> @@ -617,6 +690,30 @@ struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, int type)
> }
> EXPORT_SYMBOL_GPL(ioasid_alloc_set);
>
> +
> +/*
> + * ioasid_find_mm_set - Retrieve IOASID set with mm token
> + * Take a reference of the set if found.
> + */
> +static struct ioasid_set *ioasid_find_mm_set(struct mm_struct *token)
> +{
> + struct ioasid_set *sdata, *set = NULL;
> + unsigned long index;
> +
> + spin_lock(&ioasid_allocator_lock);
> +
> + xa_for_each(&ioasid_sets, index, sdata) {
> + if (sdata->type == IOASID_SET_TYPE_MM && sdata->token == token) {
> + refcount_inc(&sdata->ref);
> + set = sdata;
> + goto exit_unlock;
> + }
> + }
> +exit_unlock:
> + spin_unlock(&ioasid_allocator_lock);
> + return set;
> +}
> +
> void ioasid_set_get_locked(struct ioasid_set *set)
> {
> if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> @@ -638,6 +735,8 @@ EXPORT_SYMBOL_GPL(ioasid_set_get);
>
> void ioasid_set_put_locked(struct ioasid_set *set)
> {
> + struct ioasid_nb_args args = { 0 };
> + struct ioasid_set_nb *curr;
> struct ioasid_data *entry;
> unsigned long index;
>
> @@ -673,8 +772,24 @@ void ioasid_set_put_locked(struct ioasid_set *set)
> done_destroy:
> /* Return the quota back to system pool */
> ioasid_capacity_avail += set->quota;
> - kfree_rcu(set, rcu);
>
> + /* Restore pending status of the set NBs */
> + spin_lock(&ioasid_nb_lock);
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->token == set->token) {
> + if (curr->active)
> + curr->active = false;
> + else
> + pr_warn("Set token exists but not active!\n");
> + }
> + }
> + spin_unlock(&ioasid_nb_lock);
> +
> + args.set = set;
> + atomic_notifier_call_chain(&ioasid_chain, IOASID_SET_FREE, &args);
> +
> + kfree_rcu(set, rcu);
> + pr_debug("Set freed %d\n", set->sid);
> /*
> * Token got released right away after the ioasid_set is freed.
> * If a new set is created immediately with the newly released token,
> @@ -927,6 +1042,167 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> }
> EXPORT_SYMBOL_GPL(ioasid_find);
>
> +int ioasid_register_notifier(struct ioasid_set *set, struct notifier_block *nb)
> +{
> + if (set)
> + return atomic_notifier_chain_register(&set->nh, nb);
> + else
> + return atomic_notifier_chain_register(&ioasid_chain, nb);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_register_notifier);
> +
> +void ioasid_unregister_notifier(struct ioasid_set *set,
> + struct notifier_block *nb)
> +{
> + struct ioasid_set_nb *curr;
> +
> + spin_lock(&ioasid_nb_lock);
> + /*
> + * Pending list is registered with a token without an ioasid_set,
> + * therefore should not be unregistered directly.
> + */
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->nb == nb) {
> + pr_warn("Cannot unregister NB from pending list\n");
> + spin_unlock(&ioasid_nb_lock);
> + return;
> + }
> + }
> + spin_unlock(&ioasid_nb_lock);
is it safe to release the lock here? What does prevent another NB to be
added to ioasid_nb_pending_list after that?
> +
> + if (set)
> + atomic_notifier_chain_unregister(&set->nh, nb);
> + else
> + atomic_notifier_chain_unregister(&ioasid_chain, nb);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);
> +
> +int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
> +{
> + struct ioasid_set_nb *curr;
> + struct ioasid_set *set;
> + int ret = 0;
> +
> + if (!mm)
> + return -EINVAL;
> +
> + spin_lock(&ioasid_nb_lock);
> +
> + /* Check for duplicates, nb is unique per set */
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->token == mm && curr->nb == nb) {
> + ret = -EBUSY;
> + goto exit_unlock;
> + }
> + }
> +
> + /* Check if the token has an existing set */
> + set = ioasid_find_mm_set(mm);
> + if (IS_ERR_OR_NULL(set)) {
> + /* Add to the rsvd list as inactive */
> + curr->active = false;
> + } else {
> + /* REVISIT: Only register empty set for now. Can add an option
> + * in the future to playback existing PASIDs.
> + */
> + if (set->nr_ioasids) {
> + pr_warn("IOASID set %d not empty\n", set->sid);
> + ret = -EBUSY;
> + goto exit_unlock;
> + }
> + curr = kzalloc(sizeof(*curr), GFP_ATOMIC);
> + if (!curr) {
> + ret = -ENOMEM;
> + goto exit_unlock;
> + }
> + curr->token = mm;
> + curr->nb = nb;
> + curr->active = true;
> + curr->set = set;
> +
> + /* Set already created, add to the notifier chain */
> + atomic_notifier_chain_register(&set->nh, nb);
> + /*
> + * Do not hold a reference, if the set gets destroyed, the nb
> + * entry will be marked inactive.
> + */
> + ioasid_set_put(set);
> + }
> +
> + list_add(&curr->list, &ioasid_nb_pending_list);
> +
> +exit_unlock:
> + spin_unlock(&ioasid_nb_lock);
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_register_notifier_mm);
> +
> +void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
> +{
> + struct ioasid_set_nb *curr;
> +
> + spin_lock(&ioasid_nb_lock);
> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> + if (curr->token == mm && curr->nb == nb) {
> + list_del(&curr->list);
> + goto exit_free;
> + }
> + }
> + pr_warn("No ioasid set found for mm token %llx\n", (u64)mm);
> + goto done_unlock;
> +
> +exit_free:
> + if (curr->active) {
> + pr_debug("mm set active, unregister %llx\n",
> + (u64)mm);
> + atomic_notifier_chain_unregister(&curr->set->nh, nb);
> + }
> + kfree(curr);
> +done_unlock:
> + spin_unlock(&ioasid_nb_lock);
> + return;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier_mm);
> +
> +/**
> + * ioasid_notify - Send notification on a given IOASID for status change.
> + * Used by publishers when the status change may affect
> + * subscriber's internal state.
> + *
> + * @ioasid: The IOASID to which the notification will send
> + * @cmd: The notification event
> + * @flags: Special instructions, e.g. notify with a set or global
> + */
> +int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd, unsigned int flags)
> +{
> + struct ioasid_data *ioasid_data;
> + struct ioasid_nb_args args = { 0 };
> + int ret = 0;
> +
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_data = xa_load(&active_allocator->xa, ioasid);
> + if (!ioasid_data) {
> + pr_err("Trying to notify unknown IOASID %u\n", ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> + return -EINVAL;
> + }
> +
> + args.id = ioasid;
> + args.set = ioasid_data->set;
> + args.pdata = ioasid_data->private;
> + args.spid = ioasid_data->spid;
> + if (flags & IOASID_NOTIFY_ALL) {
> + ret = atomic_notifier_call_chain(&ioasid_chain, cmd, &args);
> + } else if (flags & IOASID_NOTIFY_SET) {
> + ret = atomic_notifier_call_chain(&ioasid_data->set->nh,
> + cmd, &args);
> + }
> + spin_unlock(&ioasid_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_notify);
> +
> MODULE_AUTHOR("Jean-Philippe Brucker <[email protected]>");
> MODULE_AUTHOR("Jacob Pan <[email protected]>");
> MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index d4b3e83672f6..572111cd3b4b 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -23,6 +23,7 @@ enum ioasid_set_type {
> * struct ioasid_set - Meta data about ioasid_set
> * @type: Token types and other features
> * @token: Unique to identify an IOASID set
> + * @nh: Notifier for IOASID events within the set
list of notifiers private to that set?
> * @xa: XArray to store ioasid_set private IDs, can be used for
> * guest-host IOASID mapping, or just a private IOASID namespace.
> * @quota: Max number of IOASIDs can be allocated within the set
> @@ -32,6 +33,7 @@ enum ioasid_set_type {
> */
> struct ioasid_set {
> void *token;
> + struct atomic_notifier_head nh;
> struct xarray xa;
> int type;
> int quota;
> @@ -56,6 +58,49 @@ struct ioasid_allocator_ops {
> void *pdata;
> };
>
> +/* Notification data when IOASID status changed */
> +enum ioasid_notify_val {
> + IOASID_ALLOC = 1,
> + IOASID_FREE,
> + IOASID_BIND,
> + IOASID_UNBIND,
> + IOASID_SET_ALLOC,
> + IOASID_SET_FREE,
> +};
> +
> +#define IOASID_NOTIFY_ALL BIT(0)
> +#define IOASID_NOTIFY_SET BIT(1)
> +/**
> + * enum ioasid_notifier_prios - IOASID event notification order
> + *
> + * When status of an IOASID changes, users might need to take actions to
> + * reflect the new state. For example, when an IOASID is freed due to
> + * exception, the hardware context in virtual CPU, DMA device, and IOMMU
> + * shall be cleared and drained. Order is required to prevent life cycle
> + * problems.
> + */
> +enum ioasid_notifier_prios {
> + IOASID_PRIO_LAST,
> + IOASID_PRIO_DEVICE,
> + IOASID_PRIO_IOMMU,
> + IOASID_PRIO_CPU,
> +};
> +
> +/**
> + * struct ioasid_nb_args - Argument provided by IOASID core when notifier
> + * is called.
> + * @id: The IOASID being notified
> + * @spid: The set private ID associated with the IOASID
> + * @set: The IOASID set of @id
> + * @pdata: The private data attached to the IOASID
> + */
> +struct ioasid_nb_args {
> + ioasid_t id;
> + ioasid_t spid;
> + struct ioasid_set *set;
> + void *pdata;
> +};
> +
> #if IS_ENABLED(CONFIG_IOASID)
> void ioasid_install_capacity(ioasid_t total);
> ioasid_t ioasid_get_capacity(void);
> @@ -75,8 +120,16 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *
> int ioasid_attach_data(ioasid_t ioasid, void *data);
> int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
> ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid);
> +
> +int ioasid_register_notifier(struct ioasid_set *set,
> + struct notifier_block *nb);
> +void ioasid_unregister_notifier(struct ioasid_set *set,
> + struct notifier_block *nb);
> +
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> +
> +int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd, unsigned int flags);
> void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
> int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> @@ -85,6 +138,9 @@ void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> void (*fn)(ioasid_t id, void *data),
> void *data);
> +int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
> +void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
> +
> #else /* !CONFIG_IOASID */
> static inline void ioasid_install_capacity(ioasid_t total)
> {
> @@ -124,6 +180,20 @@ static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*
> return NULL;
> }
>
> +static inline int ioasid_register_notifier(struct notifier_block *nb)
> +{
> + return -ENOTSUPP;
> +}
> +
> +static inline void ioasid_unregister_notifier(struct notifier_block *nb)
> +{
> +}
> +
> +static inline int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd, unsigned int flags)
> +{
> + return -ENOTSUPP;
> +}
> +
> static inline int ioasid_register_allocator(struct ioasid_allocator_ops *allocator)
> {
> return -ENOTSUPP;
>
Thanks

Eric

2020-09-01 17:04:50

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] iommu/vt-d: Listen to IOASID notifications

Hi Jacob,

On 8/22/20 6:35 AM, Jacob Pan wrote:
> On Intel Scalable I/O Virtualization (SIOV) enabled platforms, IOMMU
> driver is one of the users of IOASIDs. In normal flow, callers will
> perform IOASID allocation, bind, unbind, and free in order. However, for
> guest SVA, IOASID free could come before unbind as guest is untrusted.
> This patch registers IOASID notification handler such that IOMMU driver
> can perform PASID teardown upon receiving an unexpected IOASID free
> event.
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/intel/svm.c | 74 ++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/intel-iommu.h | 2 ++
> 2 files changed, 75 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 634e191ca2c3..600e3ae5b656 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -95,6 +95,72 @@ static inline bool intel_svm_capable(struct intel_iommu *iommu)
> return iommu->flags & VTD_FLAG_SVM_CAPABLE;
> }
>
> +#define pasid_lock_held() lock_is_held(&pasid_mutex.dep_map)
put after the pasid_mutex definition?
> +static DEFINE_MUTEX(pasid_mutex);
> +
> +static void intel_svm_free_async_fn(struct work_struct *work)
> +{
> + struct intel_svm *svm = container_of(work, struct intel_svm, work);
> + struct intel_svm_dev *sdev;
> +
> + /*
> + * Unbind all devices associated with this PASID which is
> + * being freed by other users such as VFIO.
> + */
> + mutex_lock(&pasid_mutex);
> + list_for_each_entry_rcu(sdev, &svm->devs, list, pasid_lock_held()) {
> + /* Does not poison forward pointer */
> + list_del_rcu(&sdev->list);
> + spin_lock(&svm->iommu->lock);
> + intel_pasid_tear_down_entry(svm->iommu, sdev->dev,
> + svm->pasid, true);
> + spin_unlock(&svm->iommu->lock);
> + kfree_rcu(sdev, rcu);
> + /*
> + * Free before unbind only happens with guest usaged
usaged?
> + * host PASIDs. IOASID free will detach private data
> + * and free the IOASID entry.
> + */
> + ioasid_put(NULL, svm->pasid);
> + if (list_empty(&svm->devs))
> + kfree(svm);
> + }
> + mutex_unlock(&pasid_mutex);
> +}
> +
> +
> +static int pasid_status_change(struct notifier_block *nb,
> + unsigned long code, void *data)
> +{
> + struct ioasid_nb_args *args = (struct ioasid_nb_args *)data;
> + struct intel_svm *svm = (struct intel_svm *)args->pdata;
> + int ret = NOTIFY_DONE;
> +
> + if (code == IOASID_FREE) {
> + if (!svm)
> + goto done;
> + if (args->id != svm->pasid) {
> + pr_warn("Notify PASID does not match data %d : %d\n",
> + args->id, svm->pasid);
> + goto done;
> + }
> + schedule_work(&svm->work);
> + return NOTIFY_OK;
> + }
> +done:
> + return ret;> +}
> +
> +static struct notifier_block pasid_nb = {
> + .notifier_call = pasid_status_change,
> +};
> +
> +void intel_svm_add_pasid_notifier(void)
> +{
> + /* Listen to all PASIDs, not specific to a set */
> + ioasid_register_notifier(NULL, &pasid_nb);
> +}
> +
> void intel_svm_check(struct intel_iommu *iommu)
> {
> if (!pasid_supported(iommu))
> @@ -221,7 +287,6 @@ static const struct mmu_notifier_ops intel_mmuops = {
> .invalidate_range = intel_invalidate_range,
> };
>
> -static DEFINE_MUTEX(pasid_mutex);
> static LIST_HEAD(global_svm_list);
>
> #define for_each_svm_dev(sdev, svm, d) \
> @@ -342,7 +407,14 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
> svm->gpasid = data->gpasid;
> svm->flags |= SVM_FLAG_GUEST_PASID;
> }
> + svm->iommu = iommu;
> + /*
> + * Set up cleanup async work in case IOASID core notify us PASID
> + * is freed before unbind.
> + */
> + INIT_WORK(&svm->work, intel_svm_free_async_fn);
> ioasid_attach_data(data->hpasid, svm);
> + ioasid_get(NULL, svm->pasid);
> INIT_LIST_HEAD_RCU(&svm->devs);
> mmput(svm->mm);
> }
> diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
> index b1ed2f25f7c0..d36038e6ae0b 100644
> --- a/include/linux/intel-iommu.h
> +++ b/include/linux/intel-iommu.h
> @@ -744,6 +744,7 @@ void intel_svm_unbind(struct iommu_sva *handle);
> int intel_svm_get_pasid(struct iommu_sva *handle);
> int intel_svm_page_response(struct device *dev, struct iommu_fault_event *evt,
> struct iommu_page_response *msg);
> +void intel_svm_add_pasid_notifier(void);
>
> struct svm_dev_ops;
>
> @@ -770,6 +771,7 @@ struct intel_svm {
> int gpasid; /* In case that guest PASID is different from host PASID */
> struct list_head devs;
> struct list_head list;
> + struct work_struct work; /* For deferred clean up */
> };
> #else
> static inline void intel_svm_check(struct intel_iommu *iommu) {}
>

Thanks

Eric

2020-09-01 17:10:08

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] iommu/vt-d: Store guest PASID during bind

Hi Jacob,
On 8/22/20 6:35 AM, Jacob Pan wrote:
> IOASID core maintains the guest-host mapping in the form of SPID and
> IOASID. This patch assigns the guest PASID (if valid) as SPID while
> binding guest page table with a host PASID. This mapping will be used
> for lookup and notifications.
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/intel/svm.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index d8a5efa75095..4c958b1aec4c 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -406,6 +406,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
> if (data->flags & IOMMU_SVA_GPASID_VAL) {
> svm->gpasid = data->gpasid;
> svm->flags |= SVM_FLAG_GUEST_PASID;
> + ioasid_attach_spid(data->hpasid, data->gpasid);
don't you want to handle the returned value?
> }
> svm->iommu = iommu;
> /*
> @@ -517,6 +518,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
> ioasid_attach_data(pasid, NULL);
> ioasid_notify(pasid, IOASID_UNBIND,
> IOASID_NOTIFY_SET);
> + ioasid_attach_spid(pasid, INVALID_IOASID);
So this answers my previous question ;-) but won't it enter the if
(!ioasid_data) path and fail to reset the spid?

Eric
> kfree(svm);
> }
> }
>

2020-09-01 21:22:53

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

On Mon, 24 Aug 2020 10:24:11 +0800
Lu Baolu <[email protected]> wrote:

> Hi Jacob,
>
> On 8/22/20 12:35 PM, Jacob Pan wrote:
> > ioasid_set was introduced as an arbitrary token that are shared by a
> > group of IOASIDs. For example, if IOASID #1 and #2 are allocated
> > via the same ioasid_set*, they are viewed as to belong to the same
> > set.
> >
> > For guest SVA usages, system-wide IOASID resources need to be
> > partitioned such that VMs can have its own quota and being managed
> > separately. ioasid_set is the perfect candidate for meeting such
> > requirements. This patch redefines and extends ioasid_set with the
> > following new fields:
> > - Quota
> > - Reference count
> > - Storage of its namespace
> > - The token is stored in the new ioasid_set but with optional types
> >
> > ioasid_set level APIs are introduced that wires up these new data.
> > Existing users of IOASID APIs are converted where a host IOASID set
> > is allocated for bare-metal usage.
> >
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > drivers/iommu/intel/iommu.c | 27 ++-
> > drivers/iommu/intel/pasid.h | 1 +
> > drivers/iommu/intel/svm.c | 8 +-
> > drivers/iommu/ioasid.c | 390
> > +++++++++++++++++++++++++++++++++++++++++---
> > include/linux/ioasid.h | 82 ++++++++-- 5 files changed, 465
> > insertions(+), 43 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c
> > b/drivers/iommu/intel/iommu.c index a3a0b5c8921d..5813eeaa5edb
> > 100644 --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -42,6 +42,7 @@
> > #include <linux/crash_dump.h>
> > #include <linux/numa.h>
> > #include <linux/swiotlb.h>
> > +#include <linux/ioasid.h>
> > #include <asm/irq_remapping.h>
> > #include <asm/cacheflush.h>
> > #include <asm/iommu.h>
> > @@ -103,6 +104,9 @@
> > */
> > #define INTEL_IOMMU_PGSIZES (~0xFFFUL)
> >
> > +/* PASIDs used by host SVM */
> > +struct ioasid_set *host_pasid_set;
> > +
> > static inline int agaw_to_level(int agaw)
> > {
> > return agaw + 2;
> > @@ -3103,8 +3107,8 @@ static void intel_vcmd_ioasid_free(ioasid_t
> > ioasid, void *data)
> > * Sanity check the ioasid owner is done at upper layer,
> > e.g. VFIO
> > * We can only free the PASID when all the devices are
> > unbound. */
> > - if (ioasid_find(NULL, ioasid, NULL)) {
> > - pr_alert("Cannot free active IOASID %d\n", ioasid);
> > + if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
> > + pr_err("Cannot free IOASID %d, not in system
> > set\n", ioasid); return;
> > }
> > vcmd_free_pasid(iommu, ioasid);
> > @@ -3288,6 +3292,19 @@ static int __init init_dmars(void)
> > if (ret)
> > goto free_iommu;
> >
> > + /* PASID is needed for scalable mode irrespective to SVM */
> > + if (intel_iommu_sm) {
> > + ioasid_install_capacity(intel_pasid_max_id);
> > + /* We should not run out of IOASIDs at boot */
> > + host_pasid_set = ioasid_alloc_set(NULL,
> > PID_MAX_DEFAULT,
> > +
> > IOASID_SET_TYPE_NULL);
> > + if (IS_ERR_OR_NULL(host_pasid_set)) {
> > + pr_err("Failed to enable host PASID
> > allocator %lu\n",
> > + PTR_ERR(host_pasid_set));
> > + intel_iommu_sm = 0;
> > + }
> > + }
> > +
> > /*
> > * for each drhd
> > * enable fault log
> > @@ -5149,7 +5166,7 @@ static void auxiliary_unlink_device(struct
> > dmar_domain *domain, domain->auxd_refcnt--;
> >
> > if (!domain->auxd_refcnt && domain->default_pasid > 0)
> > - ioasid_free(domain->default_pasid);
> > + ioasid_free(host_pasid_set, domain->default_pasid);
> > }
> >
> > static int aux_domain_add_dev(struct dmar_domain *domain,
> > @@ -5167,7 +5184,7 @@ static int aux_domain_add_dev(struct
> > dmar_domain *domain, int pasid;
> >
> > /* No private data needed for the default pasid */
> > - pasid = ioasid_alloc(NULL, PASID_MIN,
> > + pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
> > pci_max_pasids(to_pci_dev(dev))
> > - 1, NULL);
> > if (pasid == INVALID_IOASID) {
> > @@ -5210,7 +5227,7 @@ static int aux_domain_add_dev(struct
> > dmar_domain *domain, spin_unlock(&iommu->lock);
> > spin_unlock_irqrestore(&device_domain_lock, flags);
> > if (!domain->auxd_refcnt && domain->default_pasid > 0)
> > - ioasid_free(domain->default_pasid);
> > + ioasid_free(host_pasid_set, domain->default_pasid);
> >
> > return ret;
> > }
> > diff --git a/drivers/iommu/intel/pasid.h
> > b/drivers/iommu/intel/pasid.h index c9850766c3a9..ccdc23446015
> > 100644 --- a/drivers/iommu/intel/pasid.h
> > +++ b/drivers/iommu/intel/pasid.h
> > @@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct
> > pasid_entry *pte) }
> >
> > extern u32 intel_pasid_max_id;
> > +extern struct ioasid_set *host_pasid_set;
> > int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t
> > gfp); void intel_pasid_free_id(int pasid);
> > void *intel_pasid_lookup_id(int pasid);
> > diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> > index 37a9beabc0ca..634e191ca2c3 100644
> > --- a/drivers/iommu/intel/svm.c
> > +++ b/drivers/iommu/intel/svm.c
> > @@ -551,7 +551,7 @@ intel_svm_bind_mm(struct device *dev, int
> > flags, struct svm_dev_ops *ops, pasid_max = intel_pasid_max_id;
> >
> > /* Do not use PASID 0, reserved for RID to PASID
> > */
> > - svm->pasid = ioasid_alloc(NULL, PASID_MIN,
> > + svm->pasid = ioasid_alloc(host_pasid_set,
> > PASID_MIN, pasid_max - 1, svm);
> > if (svm->pasid == INVALID_IOASID) {
> > kfree(svm);
> > @@ -568,7 +568,7 @@ intel_svm_bind_mm(struct device *dev, int
> > flags, struct svm_dev_ops *ops, if (mm) {
> > ret =
> > mmu_notifier_register(&svm->notifier, mm); if (ret) {
> > - ioasid_free(svm->pasid);
> > + ioasid_free(host_pasid_set,
> > svm->pasid); kfree(svm);
> > kfree(sdev);
> > goto out;
> > @@ -586,7 +586,7 @@ intel_svm_bind_mm(struct device *dev, int
> > flags, struct svm_dev_ops *ops, if (ret) {
> > if (mm)
> > mmu_notifier_unregister(&svm->notifier,
> > mm);
> > - ioasid_free(svm->pasid);
> > + ioasid_free(host_pasid_set, svm->pasid);
> > kfree(svm);
> > kfree(sdev);
> > goto out;
> > @@ -655,7 +655,7 @@ static int intel_svm_unbind_mm(struct device
> > *dev, int pasid) kfree_rcu(sdev, rcu);
> >
> > if (list_empty(&svm->devs)) {
> > - ioasid_free(svm->pasid);
> > + ioasid_free(host_pasid_set,
> > svm->pasid); if (svm->mm)
> > mmu_notifier_unregister(&svm->notifier,
> > svm->mm); list_del(&svm->list);
> > diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> > index 5f63af07acd5..f73b3dbfc37a 100644
> > --- a/drivers/iommu/ioasid.c
> > +++ b/drivers/iommu/ioasid.c
> > @@ -1,22 +1,58 @@
> > // SPDX-License-Identifier: GPL-2.0
> > /*
> > * I/O Address Space ID allocator. There is one global IOASID
> > space, split into
> > - * subsets. Users create a subset with DECLARE_IOASID_SET, then
> > allocate and
> > - * free IOASIDs with ioasid_alloc and ioasid_free.
> > + * subsets. Users create a subset with ioasid_alloc_set, then
> > allocate/free IDs
> > + * with ioasid_alloc and ioasid_free.
> > */
> > -#include <linux/ioasid.h>
> > #include <linux/module.h>
> > #include <linux/slab.h>
> > #include <linux/spinlock.h>
> > #include <linux/xarray.h>
> > +#include <linux/ioasid.h>
> > +
> > +static DEFINE_XARRAY_ALLOC(ioasid_sets);
> > +enum ioasid_state {
> > + IOASID_STATE_INACTIVE,
> > + IOASID_STATE_ACTIVE,
> > + IOASID_STATE_FREE_PENDING,
> > +};
> >
> > +/**
> > + * struct ioasid_data - Meta data about ioasid
> > + *
> > + * @id: Unique ID
> > + * @users Number of active users
> > + * @state Track state of the IOASID
> > + * @set Meta data of the set this IOASID belongs to
> > + * @private Private data associated with the IOASID
> > + * @rcu For free after RCU grace period
> > + */
> > struct ioasid_data {
> > ioasid_t id;
> > struct ioasid_set *set;
> > + refcount_t users;
> > + enum ioasid_state state;
> > void *private;
> > struct rcu_head rcu;
> > };
> >
> > +/* Default to PCIe standard 20 bit PASID */
> > +#define PCI_PASID_MAX 0x100000
> > +static ioasid_t ioasid_capacity = PCI_PASID_MAX;
> > +static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
> > +
> > +void ioasid_install_capacity(ioasid_t total)
> > +{
> > + ioasid_capacity = ioasid_capacity_avail = total;
>
> Need any check for multiple settings or setting after used?
>
Good point, capacity can only be set once. will add check to prevent
setting after use.

> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_install_capacity);
> > +
> > +ioasid_t ioasid_get_capacity()
> > +{
> > + return ioasid_capacity;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_get_capacity);
> > +
> > /*
> > * struct ioasid_allocator_data - Internal data structure to hold
> > information
> > * about an allocator. There are two types of allocators:
> > @@ -306,11 +342,23 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
> > ioasid_t min, ioasid_t max, {
> > struct ioasid_data *data;
> > void *adata;
> > - ioasid_t id;
> > + ioasid_t id = INVALID_IOASID;
> > +
> > + spin_lock(&ioasid_allocator_lock);
> > + /* Check if the IOASID set has been allocated and
> > initialized */
> > + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> > + pr_warn("Invalid set\n");
> > + goto done_unlock;
> > + }
> > +
> > + if (set->quota <= set->nr_ioasids) {
> > + pr_err("IOASID set %d out of quota %d\n",
> > set->sid, set->quota);
> > + goto done_unlock;
> > + }
> >
> > data = kzalloc(sizeof(*data), GFP_ATOMIC);
> > if (!data)
> > - return INVALID_IOASID;
> > + goto done_unlock;
> >
> > data->set = set;
> > data->private = private;
> > @@ -319,7 +367,6 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
> > ioasid_t min, ioasid_t max,
> > * Custom allocator needs allocator data to perform
> > platform specific
> > * operations.
> > */
> > - spin_lock(&ioasid_allocator_lock);
> > adata = active_allocator->flags &
> > IOASID_ALLOCATOR_CUSTOM ? active_allocator->ops->pdata : data; id =
> > active_allocator->ops->alloc(min, max, adata); if (id ==
> > INVALID_IOASID) { @@ -335,42 +382,339 @@ ioasid_t
> > ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> > goto exit_free; }
> > data->id = id;
> > + data->state = IOASID_STATE_ACTIVE;
> > + refcount_set(&data->users, 1);
> > +
> > + /* Store IOASID in the per set data */
> > + if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
> > + pr_err("Failed to ioasid %d in set %d\n", id,
> > set->sid);
> > + goto exit_free;
> > + }
> > + set->nr_ioasids++;
> > + goto done_unlock;
> >
> > - spin_unlock(&ioasid_allocator_lock);
> > - return id;
> > exit_free:
> > - spin_unlock(&ioasid_allocator_lock);
> > kfree(data);
> > - return INVALID_IOASID;
> > +done_unlock:
> > + spin_unlock(&ioasid_allocator_lock);
> > + return id;
> > }
> > EXPORT_SYMBOL_GPL(ioasid_alloc);
> >
> > +static void ioasid_do_free(struct ioasid_data *data)
> > +{
> > + struct ioasid_data *ioasid_data;
> > + struct ioasid_set *sdata;
> > +
> > + active_allocator->ops->free(data->id,
> > active_allocator->ops->pdata);
> > + /* Custom allocator needs additional steps to free the xa
> > element */
> > + if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> > + ioasid_data = xa_erase(&active_allocator->xa,
> > data->id);
> > + kfree_rcu(ioasid_data, rcu);
> > + }
> > +
> > + sdata = xa_load(&ioasid_sets, data->set->sid);
> > + if (!sdata) {
> > + pr_err("No set %d for IOASID %d\n", data->set->sid,
> > + data->id);
> > + return;
> > + }
> > + xa_erase(&sdata->xa, data->id);
> > + sdata->nr_ioasids--;
> > +}
> > +
> > +static void ioasid_free_locked(struct ioasid_set *set, ioasid_t
> > ioasid) +{
> > + struct ioasid_data *data;
> > +
> > + data = xa_load(&active_allocator->xa, ioasid);
> > + if (!data) {
> > + pr_err("Trying to free unknown IOASID %u\n",
> > ioasid);
> > + return;
> > + }
> > +
> > + if (data->set != set) {
> > + pr_warn("Cannot free IOASID %u due to set
> > ownership\n", ioasid);
> > + return;
> > + }
> > + data->state = IOASID_STATE_FREE_PENDING;
> > +
> > + if (!refcount_dec_and_test(&data->users))
> > + return;
> > +
> > + ioasid_do_free(data);
> > +}
> > +
> > /**
> > - * ioasid_free - Free an IOASID
> > - * @ioasid: the ID to remove
> > + * ioasid_free - Drop reference on an IOASID. Free if refcount
> > drops to 0,
> > + * including free from its set and system-wide list.
> > + * @set: The ioasid_set to check permission with. If not
> > NULL, IOASID
> > + * free will fail if the set does not match.
> > + * @ioasid: The IOASID to remove
> > */
> > -void ioasid_free(ioasid_t ioasid)
> > +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
> > {
> > - struct ioasid_data *ioasid_data;
> > + spin_lock(&ioasid_allocator_lock);
> > + ioasid_free_locked(set, ioasid);
> > + spin_unlock(&ioasid_allocator_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_free);
> >
> > +/**
> > + * ioasid_alloc_set - Allocate a new IOASID set for a given token
> > + *
> > + * @token: Unique token of the IOASID set, cannot be NULL
>
> What's the use of @token? I might be able to find the answer in the
> code, but i have no idea when i comes here first time. :-)
>
You are right, I need to clarify here. How about:
* @token: An optional arbitrary number that can be associated
with the
* IOASID set. @token can be NULL if the type is
* IOASID_SET_TYPE_NULL
* @quota: Quota allowed in this set
* @type: The type of the token used to create the IOASID set


> This line of comment says that token cannot be NULL. It seems not to
> match the real code, where token could be NULL if type is
> IOASID_SET_TYPE_NULL.
>
Good catch, it was a relic from previous versions.

> > + * @quota: Quota allowed in this set. Only for new set
> > creation
> > + * @flags: Special requirements
> > + *
> > + * IOASID can be limited system-wide resource that requires quota
> > management.
> > + * If caller does not wish to enforce quota, use
> > IOASID_SET_NO_QUOTA flag.
>
> If you are not going to add NO_QUOTA support this time, I'd suggest
> you to remove above comment.
>
Yes, will remove.

> > + *
> > + * Token will be stored in the ioasid_set returned. A reference
> > will be taken
> > + * upon finding a matching set or newly created set.
> > + * IOASID allocation within the set and other per set operations
> > will use
> > + * the retured ioasid_set *.
>
> nit: remove *, or you mean the ioasid_set pointer?
>
Sounds good

> > + *
> > + */
> > +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
> > int type) +{
> > + struct ioasid_set *sdata;
> > + unsigned long index;
> > + ioasid_t id;
> > +
> > + if (type >= IOASID_SET_TYPE_NR)
> > + return ERR_PTR(-EINVAL);
> > +
> > + /*
> > + * Need to check space available if we share system-wide
> > quota.
> > + * TODO: we may need to support quota free sets in the
> > future.
> > + */
> > spin_lock(&ioasid_allocator_lock);
> > - ioasid_data = xa_load(&active_allocator->xa, ioasid);
> > - if (!ioasid_data) {
> > - pr_err("Trying to free unknown IOASID %u\n",
> > ioasid);
> > + if (quota > ioasid_capacity_avail) {
>
> Thinking that ioasid_set itself needs an ioasid, so the check might be
>
> (quota + 1 > ioasid_capacity_avail)?
>
ioasid_set does not need an ioasid, the set ID is allocated from a
different XArray: ioasid_sets.

> > + pr_warn("Out of IOASID capacity! ask %d, avail
> > %d\n",
> > + quota, ioasid_capacity_avail);
> > + sdata = ERR_PTR(-ENOSPC);
> > goto exit_unlock;
> > }
> >
> > - active_allocator->ops->free(ioasid,
> > active_allocator->ops->pdata);
> > - /* Custom allocator needs additional steps to free the xa
> > element */
> > - if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> > - ioasid_data = xa_erase(&active_allocator->xa,
> > ioasid);
> > - kfree_rcu(ioasid_data, rcu);
> > + /*
> > + * Token is only unique within its types but right now we
> > have only
> > + * mm type. If we have more token types, we have to match
> > type as well.
> > + */
> > + switch (type) {
> > + case IOASID_SET_TYPE_MM:
> > + /* Search existing set tokens, reject duplicates */
> > + xa_for_each(&ioasid_sets, index, sdata) {
> > + if (sdata->token == token &&
> > + sdata->type == IOASID_SET_TYPE_MM)
> > {
> > + sdata = ERR_PTR(-EEXIST);
> > + goto exit_unlock;
> > + }
> > + }
>
> Do you need to enforce non-NULL token policy here?
>
yes, mm type cannot have NULL token. NULL type must have NULL token.
This design has been flipped a couple of times. Thanks for catching
this.

> > + break;
> > + case IOASID_SET_TYPE_NULL:
> > + if (!token)
> > + break;
should not be NULL

> > + fallthrough;
> > + default:
> > + pr_err("Invalid token and IOASID type\n");
> > + sdata = ERR_PTR(-EINVAL);
> > + goto exit_unlock;
> > }
> >
> > + /* REVISIT: may support set w/o quota, use system
> > available */
> > + if (!quota) {
> > + sdata = ERR_PTR(-EINVAL);
> > + goto exit_unlock;
> > + }
> > +
> > + sdata = kzalloc(sizeof(*sdata), GFP_ATOMIC);
> > + if (!sdata) {
> > + sdata = ERR_PTR(-ENOMEM);
> > + goto exit_unlock;
> > + }
> > +
> > + if (xa_alloc(&ioasid_sets, &id, sdata,
> > + XA_LIMIT(0, ioasid_capacity_avail - quota),
> > + GFP_ATOMIC)) {
> > + kfree(sdata);
> > + sdata = ERR_PTR(-ENOSPC);
> > + goto exit_unlock;
> > + }
> > +
> > + sdata->token = token;
> > + sdata->type = type;
> > + sdata->quota = quota;
> > + sdata->sid = id;
> > + refcount_set(&sdata->ref, 1);
> > +
> > + /*
> > + * Per set XA is used to store private IDs within the set,
> > get ready
> > + * for ioasid_set private ID and system-wide IOASID
> > allocation
> > + * results.
> > + */
>
> I'm not sure that I understood the per-set XA correctly. Is it used to
> save both private ID and real ID allocated from the system-wide pool?
> If so, isn't private id might be equal to real ID?
>
> > + xa_init_flags(&sdata->xa, XA_FLAGS_ALLOC);
> > + ioasid_capacity_avail -= quota;
>
> As mentioned above, the ioasid_set consumed one extra ioasid, so
>
> ioasid_capacity_avail -= (quota + 1);
>
> ?
>
Same explanation, make sense?

> > +
> > exit_unlock:
> > spin_unlock(&ioasid_allocator_lock);
> > +
> > + return sdata;
> > }
> > -EXPORT_SYMBOL_GPL(ioasid_free);
> > +EXPORT_SYMBOL_GPL(ioasid_alloc_set);
> > +
> > +void ioasid_set_get_locked(struct ioasid_set *set)
> > +{
> > + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> > + pr_warn("Invalid set data\n");
> > + return;
> > + }
> > +
> > + refcount_inc(&set->ref);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_get_locked);
> > +
> > +void ioasid_set_get(struct ioasid_set *set)
> > +{
> > + spin_lock(&ioasid_allocator_lock);
> > + ioasid_set_get_locked(set);
> > + spin_unlock(&ioasid_allocator_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_get);
> > +
> > +void ioasid_set_put_locked(struct ioasid_set *set)
> > +{
> > + struct ioasid_data *entry;
> > + unsigned long index;
> > +
> > + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
>
> There are multiple occurences of this line of code, how about defining
> it as a inline helper?
>
> static inline bool ioasid_is_valid(struct ioasid_set *set)
> {
> return xa_load(&ioasid_sets, set->sid) == set;
> }
>
Sounds good. perhaps the name should be
ioasid_set_is_valid()?

> > + pr_warn("Invalid set data\n");
> > + return;
> > + }
> > +
> > + if (!refcount_dec_and_test(&set->ref)) {
> > + pr_debug("%s: IOASID set %d has %d users\n",
> > + __func__, set->sid,
> > refcount_read(&set->ref));
> > + return;
> > + }
> > +
> > + /* The set is already empty, we just destroy the set. */
> > + if (xa_empty(&set->xa))
> > + goto done_destroy;
> > +
> > + /*
> > + * Free all PASIDs from system-wide IOASID pool, all
> > subscribers gets
> > + * notified and do cleanup of their own.
> > + * Note that some references of the IOASIDs within the set
> > can still
> > + * be held after the free call. This is OK in that the
> > IOASIDs will be
> > + * marked inactive, the only operations can be done is
> > ioasid_put.
> > + * No need to track IOASID set states since there is no
> > reclaim phase.
> > + */
> > + xa_for_each(&set->xa, index, entry) {
> > + ioasid_free_locked(set, index);
> > + /* Free from per set private pool */
> > + xa_erase(&set->xa, index);
> > + }
> > +
> > +done_destroy:
> > + /* Return the quota back to system pool */
> > + ioasid_capacity_avail += set->quota;
> > + kfree_rcu(set, rcu);
> > +
> > + /*
> > + * Token got released right away after the ioasid_set is
> > freed.
> > + * If a new set is created immediately with the newly
> > released token,
> > + * it will not allocate the same IOASIDs unless they are
> > reclaimed.
> > + */
> > + xa_erase(&ioasid_sets, set->sid);
>
> No. pointer is used after free.
>
Right, will move it before kfree_rcu.

> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_put_locked);
> > +
> > +/**
> > + * ioasid_set_put - Drop a reference to the IOASID set. Free all
> > IOASIDs within
> > + * the set if there are no more users.
> > + *
> > + * @set: The IOASID set ID to be freed
> > + *
> > + * If refcount drops to zero, all IOASIDs allocated within the set
> > will be
> > + * freed.
> > + */
> > +void ioasid_set_put(struct ioasid_set *set)
> > +{
> > + spin_lock(&ioasid_allocator_lock);
> > + ioasid_set_put_locked(set);
> > + spin_unlock(&ioasid_allocator_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_put);
> > +
> > +/**
> > + * ioasid_adjust_set - Adjust the quota of an IOASID set
> > + * @set: IOASID set to be assigned
> > + * @quota: Quota allowed in this set
> > + *
> > + * Return 0 on success. If the new quota is smaller than the
> > number of
> > + * IOASIDs already allocated, -EINVAL will be returned. No change
> > will be
> > + * made to the existing quota.
> > + */
> > +int ioasid_adjust_set(struct ioasid_set *set, int quota)
> > +{
> > + int ret = 0;
> > +
> > + spin_lock(&ioasid_allocator_lock);
> > + if (set->nr_ioasids > quota) {
> > + pr_err("New quota %d is smaller than outstanding
> > IOASIDs %d\n",
> > + quota, set->nr_ioasids);
> > + ret = -EINVAL;
> > + goto done_unlock;
> > + }
> > +
> > + if (quota >= ioasid_capacity_avail) {
>
> This check doesn't make sense since you are updating (not asking for)
> a quota.
>
> if ((quota > set->quota) &&
> (quota - set->quota > ioasid_capacity_avail))
>
Good point, will fix.

Thanks a lot!

> > + ret = -ENOSPC;
> > + goto done_unlock;
> > + }
> > +
> > + /* Return the delta back to system pool */
> > + ioasid_capacity_avail += set->quota - quota;
>
> ioasid_capacity_avail is defined as a unsigned int, hence this always
> increase the available capacity value even the caller is asking for a
> bigger quota?
>
> > +
> > + /*
> > + * May have a policy to prevent giving all available
> > IOASIDs
> > + * to one set. But we don't enforce here, it should be in
> > the
> > + * upper layers.
> > + */
> > + set->quota = quota;
> > +
> > +done_unlock:
> > + spin_unlock(&ioasid_allocator_lock);
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_adjust_set);
> > +
> > +/**
> > + * ioasid_set_for_each_ioasid - Iterate over all the IOASIDs
> > within the set
> > + *
> > + * Caller must hold a reference of the set and handles its own
> > locking.
>
> Do you need to hold ioasid_allocator_lock here?
>
> > + */
> > +int ioasid_set_for_each_ioasid(struct ioasid_set *set,
> > + void (*fn)(ioasid_t id, void *data),
> > + void *data)
> > +{
> > + struct ioasid_data *entry;
> > + unsigned long index;
> > + int ret = 0;
> > +
> > + if (xa_empty(&set->xa)) {
> > + pr_warn("No IOASIDs in the set %d\n", set->sid);
> > + return -ENOENT;
> > + }
> > +
> > + xa_for_each(&set->xa, index, entry) {
> > + fn(index, data);
> > + }
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
> >
> > /**
> > * ioasid_find - Find IOASID data
> > diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> > index 9c44947a68c8..412d025d440e 100644
> > --- a/include/linux/ioasid.h
> > +++ b/include/linux/ioasid.h
> > @@ -10,8 +10,35 @@ typedef unsigned int ioasid_t;
> > typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min, ioasid_t max,
> > void *data); typedef void (*ioasid_free_fn_t)(ioasid_t ioasid, void
> > *data);
> > +/* IOASID set types */
> > +enum ioasid_set_type {
> > + IOASID_SET_TYPE_NULL = 1, /* Set token is NULL */
> > + IOASID_SET_TYPE_MM, /* Set token is a mm_struct,
> > + * i.e. associated with a process
> > + */
> > + IOASID_SET_TYPE_NR,
> > +};
> > +
> > +/**
> > + * struct ioasid_set - Meta data about ioasid_set
> > + * @type: Token types and other features
> > + * @token: Unique to identify an IOASID set
> > + * @xa: XArray to store ioasid_set private IDs, can
> > be used for
> > + * guest-host IOASID mapping, or just a private
> > IOASID namespace.
> > + * @quota: Max number of IOASIDs can be allocated within
> > the set
> > + * @nr_ioasids Number of IOASIDs currently allocated in the
> > set
> > + * @sid: ID of the set
> > + * @ref: Reference count of the users
> > + */
> > struct ioasid_set {
> > - int dummy;
> > + void *token;
> > + struct xarray xa;
> > + int type;
> > + int quota;
> > + int nr_ioasids;
> > + int sid;
> > + refcount_t ref;
> > + struct rcu_head rcu;
> > };
> >
> > /**
> > @@ -29,31 +56,64 @@ struct ioasid_allocator_ops {
> > void *pdata;
> > };
> >
> > -#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
> > -
> > #if IS_ENABLED(CONFIG_IOASID)
> > +void ioasid_install_capacity(ioasid_t total);
> > +ioasid_t ioasid_get_capacity(void);
> > +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
> > int type); +int ioasid_adjust_set(struct ioasid_set *set, int
> > quota); +void ioasid_set_get_locked(struct ioasid_set *set);
> > +void ioasid_set_put_locked(struct ioasid_set *set);
> > +void ioasid_set_put(struct ioasid_set *set);
> > +
> > ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> > ioasid_t max, void *private);
> > -void ioasid_free(ioasid_t ioasid);
> > -void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> > - bool (*getter)(void *));
> > +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > +bool ioasid_is_active(ioasid_t ioasid);
> > +
> > +void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool
> > (*getter)(void *)); +int ioasid_attach_data(ioasid_t ioasid, void
> > *data); int ioasid_register_allocator(struct ioasid_allocator_ops
> > *allocator); void ioasid_unregister_allocator(struct
> > ioasid_allocator_ops *allocator); -int ioasid_attach_data(ioasid_t
> > ioasid, void *data); -
> > +void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
> > +int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> > + void (*fn)(ioasid_t id, void *data),
> > + void *data);
> > #else /* !CONFIG_IOASID */
> > +static inline void ioasid_install_capacity(ioasid_t total)
> > +{
> > +}
> > +
> > +static inline ioasid_t ioasid_get_capacity(void)
> > +{
> > + return 0;
> > +}
> > +
> > static inline ioasid_t ioasid_alloc(struct ioasid_set *set,
> > ioasid_t min, ioasid_t max, void *private)
> > {
> > return INVALID_IOASID;
> > }
> >
> > -static inline void ioasid_free(ioasid_t ioasid)
> > +static inline void ioasid_free(struct ioasid_set *set, ioasid_t
> > ioasid) +{
> > +}
> > +
> > +static inline bool ioasid_is_active(ioasid_t ioasid)
> > +{
> > + return false;
> > +}
> > +
> > +static inline struct ioasid_set *ioasid_alloc_set(void *token,
> > ioasid_t quota, int type) +{
> > + return ERR_PTR(-ENOTSUPP);
> > +}
> > +
> > +static inline void ioasid_set_put(struct ioasid_set *set)
> > {
> > }
> >
> > -static inline void *ioasid_find(struct ioasid_set *set, ioasid_t
> > ioasid,
> > - bool (*getter)(void *))
> > +static inline void *ioasid_find(struct ioasid_set *set, ioasid_t
> > ioasid, bool (*getter)(void *)) {
> > return NULL;
> > }
> >
>
> Best regards,
> baolu
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

[Jacob Pan]

2020-09-02 03:04:55

by Lu Baolu

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

Hi,

On 9/2/20 5:28 AM, Jacob Pan wrote:
> On Mon, 24 Aug 2020 10:24:11 +0800
> Lu Baolu <[email protected]> wrote:
>
>> Hi Jacob,
>>
>> On 8/22/20 12:35 PM, Jacob Pan wrote:
>>> ioasid_set was introduced as an arbitrary token that are shared by a
>>> group of IOASIDs. For example, if IOASID #1 and #2 are allocated
>>> via the same ioasid_set*, they are viewed as to belong to the same
>>> set.
>>>
>>> For guest SVA usages, system-wide IOASID resources need to be
>>> partitioned such that VMs can have its own quota and being managed
>>> separately. ioasid_set is the perfect candidate for meeting such
>>> requirements. This patch redefines and extends ioasid_set with the
>>> following new fields:
>>> - Quota
>>> - Reference count
>>> - Storage of its namespace
>>> - The token is stored in the new ioasid_set but with optional types
>>>
>>> ioasid_set level APIs are introduced that wires up these new data.
>>> Existing users of IOASID APIs are converted where a host IOASID set
>>> is allocated for bare-metal usage.
>>>
>>> Signed-off-by: Liu Yi L <[email protected]>
>>> Signed-off-by: Jacob Pan <[email protected]>
>>> ---
>>> drivers/iommu/intel/iommu.c | 27 ++-
>>> drivers/iommu/intel/pasid.h | 1 +
>>> drivers/iommu/intel/svm.c | 8 +-
>>> drivers/iommu/ioasid.c | 390
>>> +++++++++++++++++++++++++++++++++++++++++---
>>> include/linux/ioasid.h | 82 ++++++++-- 5 files changed, 465
>>> insertions(+), 43 deletions(-)
>>>
>>> diff --git a/drivers/iommu/intel/iommu.c
>>> b/drivers/iommu/intel/iommu.c index a3a0b5c8921d..5813eeaa5edb
>>> 100644 --- a/drivers/iommu/intel/iommu.c
>>> +++ b/drivers/iommu/intel/iommu.c
>>> @@ -42,6 +42,7 @@
>>> #include <linux/crash_dump.h>
>>> #include <linux/numa.h>
>>> #include <linux/swiotlb.h>
>>> +#include <linux/ioasid.h>
>>> #include <asm/irq_remapping.h>
>>> #include <asm/cacheflush.h>
>>> #include <asm/iommu.h>
>>> @@ -103,6 +104,9 @@
>>> */
>>> #define INTEL_IOMMU_PGSIZES (~0xFFFUL)
>>>
>>> +/* PASIDs used by host SVM */
>>> +struct ioasid_set *host_pasid_set;
>>> +
>>> static inline int agaw_to_level(int agaw)
>>> {
>>> return agaw + 2;
>>> @@ -3103,8 +3107,8 @@ static void intel_vcmd_ioasid_free(ioasid_t
>>> ioasid, void *data)
>>> * Sanity check the ioasid owner is done at upper layer,
>>> e.g. VFIO
>>> * We can only free the PASID when all the devices are
>>> unbound. */
>>> - if (ioasid_find(NULL, ioasid, NULL)) {
>>> - pr_alert("Cannot free active IOASID %d\n", ioasid);
>>> + if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
>>> + pr_err("Cannot free IOASID %d, not in system
>>> set\n", ioasid); return;
>>> }
>>> vcmd_free_pasid(iommu, ioasid);
>>> @@ -3288,6 +3292,19 @@ static int __init init_dmars(void)
>>> if (ret)
>>> goto free_iommu;
>>>
>>> + /* PASID is needed for scalable mode irrespective to SVM */
>>> + if (intel_iommu_sm) {
>>> + ioasid_install_capacity(intel_pasid_max_id);
>>> + /* We should not run out of IOASIDs at boot */
>>> + host_pasid_set = ioasid_alloc_set(NULL,
>>> PID_MAX_DEFAULT,
>>> +
>>> IOASID_SET_TYPE_NULL);
>>> + if (IS_ERR_OR_NULL(host_pasid_set)) {
>>> + pr_err("Failed to enable host PASID
>>> allocator %lu\n",
>>> + PTR_ERR(host_pasid_set));
>>> + intel_iommu_sm = 0;
>>> + }
>>> + }
>>> +
>>> /*
>>> * for each drhd
>>> * enable fault log
>>> @@ -5149,7 +5166,7 @@ static void auxiliary_unlink_device(struct
>>> dmar_domain *domain, domain->auxd_refcnt--;
>>>
>>> if (!domain->auxd_refcnt && domain->default_pasid > 0)
>>> - ioasid_free(domain->default_pasid);
>>> + ioasid_free(host_pasid_set, domain->default_pasid);
>>> }
>>>
>>> static int aux_domain_add_dev(struct dmar_domain *domain,
>>> @@ -5167,7 +5184,7 @@ static int aux_domain_add_dev(struct
>>> dmar_domain *domain, int pasid;
>>>
>>> /* No private data needed for the default pasid */
>>> - pasid = ioasid_alloc(NULL, PASID_MIN,
>>> + pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
>>> pci_max_pasids(to_pci_dev(dev))
>>> - 1, NULL);
>>> if (pasid == INVALID_IOASID) {
>>> @@ -5210,7 +5227,7 @@ static int aux_domain_add_dev(struct
>>> dmar_domain *domain, spin_unlock(&iommu->lock);
>>> spin_unlock_irqrestore(&device_domain_lock, flags);
>>> if (!domain->auxd_refcnt && domain->default_pasid > 0)
>>> - ioasid_free(domain->default_pasid);
>>> + ioasid_free(host_pasid_set, domain->default_pasid);
>>>
>>> return ret;
>>> }
>>> diff --git a/drivers/iommu/intel/pasid.h
>>> b/drivers/iommu/intel/pasid.h index c9850766c3a9..ccdc23446015
>>> 100644 --- a/drivers/iommu/intel/pasid.h
>>> +++ b/drivers/iommu/intel/pasid.h
>>> @@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct
>>> pasid_entry *pte) }
>>>
>>> extern u32 intel_pasid_max_id;
>>> +extern struct ioasid_set *host_pasid_set;
>>> int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t
>>> gfp); void intel_pasid_free_id(int pasid);
>>> void *intel_pasid_lookup_id(int pasid);
>>> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
>>> index 37a9beabc0ca..634e191ca2c3 100644
>>> --- a/drivers/iommu/intel/svm.c
>>> +++ b/drivers/iommu/intel/svm.c
>>> @@ -551,7 +551,7 @@ intel_svm_bind_mm(struct device *dev, int
>>> flags, struct svm_dev_ops *ops, pasid_max = intel_pasid_max_id;
>>>
>>> /* Do not use PASID 0, reserved for RID to PASID
>>> */
>>> - svm->pasid = ioasid_alloc(NULL, PASID_MIN,
>>> + svm->pasid = ioasid_alloc(host_pasid_set,
>>> PASID_MIN, pasid_max - 1, svm);
>>> if (svm->pasid == INVALID_IOASID) {
>>> kfree(svm);
>>> @@ -568,7 +568,7 @@ intel_svm_bind_mm(struct device *dev, int
>>> flags, struct svm_dev_ops *ops, if (mm) {
>>> ret =
>>> mmu_notifier_register(&svm->notifier, mm); if (ret) {
>>> - ioasid_free(svm->pasid);
>>> + ioasid_free(host_pasid_set,
>>> svm->pasid); kfree(svm);
>>> kfree(sdev);
>>> goto out;
>>> @@ -586,7 +586,7 @@ intel_svm_bind_mm(struct device *dev, int
>>> flags, struct svm_dev_ops *ops, if (ret) {
>>> if (mm)
>>> mmu_notifier_unregister(&svm->notifier,
>>> mm);
>>> - ioasid_free(svm->pasid);
>>> + ioasid_free(host_pasid_set, svm->pasid);
>>> kfree(svm);
>>> kfree(sdev);
>>> goto out;
>>> @@ -655,7 +655,7 @@ static int intel_svm_unbind_mm(struct device
>>> *dev, int pasid) kfree_rcu(sdev, rcu);
>>>
>>> if (list_empty(&svm->devs)) {
>>> - ioasid_free(svm->pasid);
>>> + ioasid_free(host_pasid_set,
>>> svm->pasid); if (svm->mm)
>>> mmu_notifier_unregister(&svm->notifier,
>>> svm->mm); list_del(&svm->list);
>>> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
>>> index 5f63af07acd5..f73b3dbfc37a 100644
>>> --- a/drivers/iommu/ioasid.c
>>> +++ b/drivers/iommu/ioasid.c
>>> @@ -1,22 +1,58 @@
>>> // SPDX-License-Identifier: GPL-2.0
>>> /*
>>> * I/O Address Space ID allocator. There is one global IOASID
>>> space, split into
>>> - * subsets. Users create a subset with DECLARE_IOASID_SET, then
>>> allocate and
>>> - * free IOASIDs with ioasid_alloc and ioasid_free.
>>> + * subsets. Users create a subset with ioasid_alloc_set, then
>>> allocate/free IDs
>>> + * with ioasid_alloc and ioasid_free.
>>> */
>>> -#include <linux/ioasid.h>
>>> #include <linux/module.h>
>>> #include <linux/slab.h>
>>> #include <linux/spinlock.h>
>>> #include <linux/xarray.h>
>>> +#include <linux/ioasid.h>
>>> +
>>> +static DEFINE_XARRAY_ALLOC(ioasid_sets);
>>> +enum ioasid_state {
>>> + IOASID_STATE_INACTIVE,
>>> + IOASID_STATE_ACTIVE,
>>> + IOASID_STATE_FREE_PENDING,
>>> +};
>>>
>>> +/**
>>> + * struct ioasid_data - Meta data about ioasid
>>> + *
>>> + * @id: Unique ID
>>> + * @users Number of active users
>>> + * @state Track state of the IOASID
>>> + * @set Meta data of the set this IOASID belongs to
>>> + * @private Private data associated with the IOASID
>>> + * @rcu For free after RCU grace period
>>> + */
>>> struct ioasid_data {
>>> ioasid_t id;
>>> struct ioasid_set *set;
>>> + refcount_t users;
>>> + enum ioasid_state state;
>>> void *private;
>>> struct rcu_head rcu;
>>> };
>>>
>>> +/* Default to PCIe standard 20 bit PASID */
>>> +#define PCI_PASID_MAX 0x100000
>>> +static ioasid_t ioasid_capacity = PCI_PASID_MAX;
>>> +static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
>>> +
>>> +void ioasid_install_capacity(ioasid_t total)
>>> +{
>>> + ioasid_capacity = ioasid_capacity_avail = total;
>>
>> Need any check for multiple settings or setting after used?
>>
> Good point, capacity can only be set once. will add check to prevent
> setting after use.
>
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_install_capacity);
>>> +
>>> +ioasid_t ioasid_get_capacity()
>>> +{
>>> + return ioasid_capacity;
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_get_capacity);
>>> +
>>> /*
>>> * struct ioasid_allocator_data - Internal data structure to hold
>>> information
>>> * about an allocator. There are two types of allocators:
>>> @@ -306,11 +342,23 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
>>> ioasid_t min, ioasid_t max, {
>>> struct ioasid_data *data;
>>> void *adata;
>>> - ioasid_t id;
>>> + ioasid_t id = INVALID_IOASID;
>>> +
>>> + spin_lock(&ioasid_allocator_lock);
>>> + /* Check if the IOASID set has been allocated and
>>> initialized */
>>> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
>>> + pr_warn("Invalid set\n");
>>> + goto done_unlock;
>>> + }
>>> +
>>> + if (set->quota <= set->nr_ioasids) {
>>> + pr_err("IOASID set %d out of quota %d\n",
>>> set->sid, set->quota);
>>> + goto done_unlock;
>>> + }
>>>
>>> data = kzalloc(sizeof(*data), GFP_ATOMIC);
>>> if (!data)
>>> - return INVALID_IOASID;
>>> + goto done_unlock;
>>>
>>> data->set = set;
>>> data->private = private;
>>> @@ -319,7 +367,6 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
>>> ioasid_t min, ioasid_t max,
>>> * Custom allocator needs allocator data to perform
>>> platform specific
>>> * operations.
>>> */
>>> - spin_lock(&ioasid_allocator_lock);
>>> adata = active_allocator->flags &
>>> IOASID_ALLOCATOR_CUSTOM ? active_allocator->ops->pdata : data; id =
>>> active_allocator->ops->alloc(min, max, adata); if (id ==
>>> INVALID_IOASID) { @@ -335,42 +382,339 @@ ioasid_t
>>> ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
>>> goto exit_free; }
>>> data->id = id;
>>> + data->state = IOASID_STATE_ACTIVE;
>>> + refcount_set(&data->users, 1);
>>> +
>>> + /* Store IOASID in the per set data */
>>> + if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
>>> + pr_err("Failed to ioasid %d in set %d\n", id,
>>> set->sid);
>>> + goto exit_free;
>>> + }
>>> + set->nr_ioasids++;
>>> + goto done_unlock;
>>>
>>> - spin_unlock(&ioasid_allocator_lock);
>>> - return id;
>>> exit_free:
>>> - spin_unlock(&ioasid_allocator_lock);
>>> kfree(data);
>>> - return INVALID_IOASID;
>>> +done_unlock:
>>> + spin_unlock(&ioasid_allocator_lock);
>>> + return id;
>>> }
>>> EXPORT_SYMBOL_GPL(ioasid_alloc);
>>>
>>> +static void ioasid_do_free(struct ioasid_data *data)
>>> +{
>>> + struct ioasid_data *ioasid_data;
>>> + struct ioasid_set *sdata;
>>> +
>>> + active_allocator->ops->free(data->id,
>>> active_allocator->ops->pdata);
>>> + /* Custom allocator needs additional steps to free the xa
>>> element */
>>> + if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
>>> + ioasid_data = xa_erase(&active_allocator->xa,
>>> data->id);
>>> + kfree_rcu(ioasid_data, rcu);
>>> + }
>>> +
>>> + sdata = xa_load(&ioasid_sets, data->set->sid);
>>> + if (!sdata) {
>>> + pr_err("No set %d for IOASID %d\n", data->set->sid,
>>> + data->id);
>>> + return;
>>> + }
>>> + xa_erase(&sdata->xa, data->id);
>>> + sdata->nr_ioasids--;
>>> +}
>>> +
>>> +static void ioasid_free_locked(struct ioasid_set *set, ioasid_t
>>> ioasid) +{
>>> + struct ioasid_data *data;
>>> +
>>> + data = xa_load(&active_allocator->xa, ioasid);
>>> + if (!data) {
>>> + pr_err("Trying to free unknown IOASID %u\n",
>>> ioasid);
>>> + return;
>>> + }
>>> +
>>> + if (data->set != set) {
>>> + pr_warn("Cannot free IOASID %u due to set
>>> ownership\n", ioasid);
>>> + return;
>>> + }
>>> + data->state = IOASID_STATE_FREE_PENDING;
>>> +
>>> + if (!refcount_dec_and_test(&data->users))
>>> + return;
>>> +
>>> + ioasid_do_free(data);
>>> +}
>>> +
>>> /**
>>> - * ioasid_free - Free an IOASID
>>> - * @ioasid: the ID to remove
>>> + * ioasid_free - Drop reference on an IOASID. Free if refcount
>>> drops to 0,
>>> + * including free from its set and system-wide list.
>>> + * @set: The ioasid_set to check permission with. If not
>>> NULL, IOASID
>>> + * free will fail if the set does not match.
>>> + * @ioasid: The IOASID to remove
>>> */
>>> -void ioasid_free(ioasid_t ioasid)
>>> +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
>>> {
>>> - struct ioasid_data *ioasid_data;
>>> + spin_lock(&ioasid_allocator_lock);
>>> + ioasid_free_locked(set, ioasid);
>>> + spin_unlock(&ioasid_allocator_lock);
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_free);
>>>
>>> +/**
>>> + * ioasid_alloc_set - Allocate a new IOASID set for a given token
>>> + *
>>> + * @token: Unique token of the IOASID set, cannot be NULL
>>
>> What's the use of @token? I might be able to find the answer in the
>> code, but i have no idea when i comes here first time. :-)
>>
> You are right, I need to clarify here. How about:
> * @token: An optional arbitrary number that can be associated
> with the
> * IOASID set. @token can be NULL if the type is
> * IOASID_SET_TYPE_NULL
> * @quota: Quota allowed in this set
> * @type: The type of the token used to create the IOASID set
>
>
>> This line of comment says that token cannot be NULL. It seems not to
>> match the real code, where token could be NULL if type is
>> IOASID_SET_TYPE_NULL.
>>
> Good catch, it was a relic from previous versions.
>
>>> + * @quota: Quota allowed in this set. Only for new set
>>> creation
>>> + * @flags: Special requirements
>>> + *
>>> + * IOASID can be limited system-wide resource that requires quota
>>> management.
>>> + * If caller does not wish to enforce quota, use
>>> IOASID_SET_NO_QUOTA flag.
>>
>> If you are not going to add NO_QUOTA support this time, I'd suggest
>> you to remove above comment.
>>
> Yes, will remove.
>
>>> + *
>>> + * Token will be stored in the ioasid_set returned. A reference
>>> will be taken
>>> + * upon finding a matching set or newly created set.
>>> + * IOASID allocation within the set and other per set operations
>>> will use
>>> + * the retured ioasid_set *.
>>
>> nit: remove *, or you mean the ioasid_set pointer?
>>
> Sounds good
>
>>> + *
>>> + */
>>> +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
>>> int type) +{
>>> + struct ioasid_set *sdata;
>>> + unsigned long index;
>>> + ioasid_t id;
>>> +
>>> + if (type >= IOASID_SET_TYPE_NR)
>>> + return ERR_PTR(-EINVAL);
>>> +
>>> + /*
>>> + * Need to check space available if we share system-wide
>>> quota.
>>> + * TODO: we may need to support quota free sets in the
>>> future.
>>> + */
>>> spin_lock(&ioasid_allocator_lock);
>>> - ioasid_data = xa_load(&active_allocator->xa, ioasid);
>>> - if (!ioasid_data) {
>>> - pr_err("Trying to free unknown IOASID %u\n",
>>> ioasid);
>>> + if (quota > ioasid_capacity_avail) {
>>
>> Thinking that ioasid_set itself needs an ioasid, so the check might be
>>
>> (quota + 1 > ioasid_capacity_avail)?
>>
> ioasid_set does not need an ioasid, the set ID is allocated from a
> different XArray: ioasid_sets.

Get it now! Thanks!

>
>>> + pr_warn("Out of IOASID capacity! ask %d, avail
>>> %d\n",
>>> + quota, ioasid_capacity_avail);
>>> + sdata = ERR_PTR(-ENOSPC);
>>> goto exit_unlock;
>>> }
>>>
>>> - active_allocator->ops->free(ioasid,
>>> active_allocator->ops->pdata);
>>> - /* Custom allocator needs additional steps to free the xa
>>> element */
>>> - if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
>>> - ioasid_data = xa_erase(&active_allocator->xa,
>>> ioasid);
>>> - kfree_rcu(ioasid_data, rcu);
>>> + /*
>>> + * Token is only unique within its types but right now we
>>> have only
>>> + * mm type. If we have more token types, we have to match
>>> type as well.
>>> + */
>>> + switch (type) {
>>> + case IOASID_SET_TYPE_MM:
>>> + /* Search existing set tokens, reject duplicates */
>>> + xa_for_each(&ioasid_sets, index, sdata) {
>>> + if (sdata->token == token &&
>>> + sdata->type == IOASID_SET_TYPE_MM)
>>> {
>>> + sdata = ERR_PTR(-EEXIST);
>>> + goto exit_unlock;
>>> + }
>>> + }
>>
>> Do you need to enforce non-NULL token policy here?
>>
> yes, mm type cannot have NULL token. NULL type must have NULL token.
> This design has been flipped a couple of times. Thanks for catching
> this.
>
>>> + break;
>>> + case IOASID_SET_TYPE_NULL:
>>> + if (!token)
>>> + break;
> should not be NULL
>
>>> + fallthrough;
>>> + default:
>>> + pr_err("Invalid token and IOASID type\n");
>>> + sdata = ERR_PTR(-EINVAL);
>>> + goto exit_unlock;
>>> }
>>>
>>> + /* REVISIT: may support set w/o quota, use system
>>> available */
>>> + if (!quota) {
>>> + sdata = ERR_PTR(-EINVAL);
>>> + goto exit_unlock;
>>> + }
>>> +
>>> + sdata = kzalloc(sizeof(*sdata), GFP_ATOMIC);
>>> + if (!sdata) {
>>> + sdata = ERR_PTR(-ENOMEM);
>>> + goto exit_unlock;
>>> + }
>>> +
>>> + if (xa_alloc(&ioasid_sets, &id, sdata,
>>> + XA_LIMIT(0, ioasid_capacity_avail - quota),
>>> + GFP_ATOMIC)) {
>>> + kfree(sdata);
>>> + sdata = ERR_PTR(-ENOSPC);
>>> + goto exit_unlock;
>>> + }
>>> +
>>> + sdata->token = token;
>>> + sdata->type = type;
>>> + sdata->quota = quota;
>>> + sdata->sid = id;
>>> + refcount_set(&sdata->ref, 1);
>>> +
>>> + /*
>>> + * Per set XA is used to store private IDs within the set,
>>> get ready
>>> + * for ioasid_set private ID and system-wide IOASID
>>> allocation
>>> + * results.
>>> + */
>>
>> I'm not sure that I understood the per-set XA correctly. Is it used to
>> save both private ID and real ID allocated from the system-wide pool?
>> If so, isn't private id might be equal to real ID?
>>
>>> + xa_init_flags(&sdata->xa, XA_FLAGS_ALLOC);
>>> + ioasid_capacity_avail -= quota;
>>
>> As mentioned above, the ioasid_set consumed one extra ioasid, so
>>
>> ioasid_capacity_avail -= (quota + 1);
>>
>> ?
>>
> Same explanation, make sense?

Sure.

>
>>> +
>>> exit_unlock:
>>> spin_unlock(&ioasid_allocator_lock);
>>> +
>>> + return sdata;
>>> }
>>> -EXPORT_SYMBOL_GPL(ioasid_free);
>>> +EXPORT_SYMBOL_GPL(ioasid_alloc_set);
>>> +
>>> +void ioasid_set_get_locked(struct ioasid_set *set)
>>> +{
>>> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
>>> + pr_warn("Invalid set data\n");
>>> + return;
>>> + }
>>> +
>>> + refcount_inc(&set->ref);
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_set_get_locked);
>>> +
>>> +void ioasid_set_get(struct ioasid_set *set)
>>> +{
>>> + spin_lock(&ioasid_allocator_lock);
>>> + ioasid_set_get_locked(set);
>>> + spin_unlock(&ioasid_allocator_lock);
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_set_get);
>>> +
>>> +void ioasid_set_put_locked(struct ioasid_set *set)
>>> +{
>>> + struct ioasid_data *entry;
>>> + unsigned long index;
>>> +
>>> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
>>
>> There are multiple occurences of this line of code, how about defining
>> it as a inline helper?
>>
>> static inline bool ioasid_is_valid(struct ioasid_set *set)
>> {
>> return xa_load(&ioasid_sets, set->sid) == set;
>> }
>>
> Sounds good. perhaps the name should be
> ioasid_set_is_valid()?

Looks good to me.

>
>>> + pr_warn("Invalid set data\n");
>>> + return;
>>> + }
>>> +
>>> + if (!refcount_dec_and_test(&set->ref)) {
>>> + pr_debug("%s: IOASID set %d has %d users\n",
>>> + __func__, set->sid,
>>> refcount_read(&set->ref));
>>> + return;
>>> + }
>>> +
>>> + /* The set is already empty, we just destroy the set. */
>>> + if (xa_empty(&set->xa))
>>> + goto done_destroy;
>>> +
>>> + /*
>>> + * Free all PASIDs from system-wide IOASID pool, all
>>> subscribers gets
>>> + * notified and do cleanup of their own.
>>> + * Note that some references of the IOASIDs within the set
>>> can still
>>> + * be held after the free call. This is OK in that the
>>> IOASIDs will be
>>> + * marked inactive, the only operations can be done is
>>> ioasid_put.
>>> + * No need to track IOASID set states since there is no
>>> reclaim phase.
>>> + */
>>> + xa_for_each(&set->xa, index, entry) {
>>> + ioasid_free_locked(set, index);
>>> + /* Free from per set private pool */
>>> + xa_erase(&set->xa, index);
>>> + }
>>> +
>>> +done_destroy:
>>> + /* Return the quota back to system pool */
>>> + ioasid_capacity_avail += set->quota;
>>> + kfree_rcu(set, rcu);
>>> +
>>> + /*
>>> + * Token got released right away after the ioasid_set is
>>> freed.
>>> + * If a new set is created immediately with the newly
>>> released token,
>>> + * it will not allocate the same IOASIDs unless they are
>>> reclaimed.
>>> + */
>>> + xa_erase(&ioasid_sets, set->sid);
>>
>> No. pointer is used after free.
>>
> Right, will move it before kfree_rcu.
>
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_set_put_locked);
>>> +
>>> +/**
>>> + * ioasid_set_put - Drop a reference to the IOASID set. Free all
>>> IOASIDs within
>>> + * the set if there are no more users.
>>> + *
>>> + * @set: The IOASID set ID to be freed
>>> + *
>>> + * If refcount drops to zero, all IOASIDs allocated within the set
>>> will be
>>> + * freed.
>>> + */
>>> +void ioasid_set_put(struct ioasid_set *set)
>>> +{
>>> + spin_lock(&ioasid_allocator_lock);
>>> + ioasid_set_put_locked(set);
>>> + spin_unlock(&ioasid_allocator_lock);
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_set_put);
>>> +
>>> +/**
>>> + * ioasid_adjust_set - Adjust the quota of an IOASID set
>>> + * @set: IOASID set to be assigned
>>> + * @quota: Quota allowed in this set
>>> + *
>>> + * Return 0 on success. If the new quota is smaller than the
>>> number of
>>> + * IOASIDs already allocated, -EINVAL will be returned. No change
>>> will be
>>> + * made to the existing quota.
>>> + */
>>> +int ioasid_adjust_set(struct ioasid_set *set, int quota)
>>> +{
>>> + int ret = 0;
>>> +
>>> + spin_lock(&ioasid_allocator_lock);
>>> + if (set->nr_ioasids > quota) {
>>> + pr_err("New quota %d is smaller than outstanding
>>> IOASIDs %d\n",
>>> + quota, set->nr_ioasids);
>>> + ret = -EINVAL;
>>> + goto done_unlock;
>>> + }
>>> +
>>> + if (quota >= ioasid_capacity_avail) {
>>
>> This check doesn't make sense since you are updating (not asking for)
>> a quota.
>>
>> if ((quota > set->quota) &&
>> (quota - set->quota > ioasid_capacity_avail))
>>
> Good point, will fix.
>
> Thanks a lot!
>
>>> + ret = -ENOSPC;
>>> + goto done_unlock;
>>> + }
>>> +
>>> + /* Return the delta back to system pool */
>>> + ioasid_capacity_avail += set->quota - quota;
>>
>> ioasid_capacity_avail is defined as a unsigned int, hence this always
>> increase the available capacity value even the caller is asking for a
>> bigger quota?
>>
>>> +
>>> + /*
>>> + * May have a policy to prevent giving all available
>>> IOASIDs
>>> + * to one set. But we don't enforce here, it should be in
>>> the
>>> + * upper layers.
>>> + */
>>> + set->quota = quota;
>>> +
>>> +done_unlock:
>>> + spin_unlock(&ioasid_allocator_lock);
>>> +
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_adjust_set);
>>> +
>>> +/**
>>> + * ioasid_set_for_each_ioasid - Iterate over all the IOASIDs
>>> within the set
>>> + *
>>> + * Caller must hold a reference of the set and handles its own
>>> locking.
>>
>> Do you need to hold ioasid_allocator_lock here?
>>
>>> + */
>>> +int ioasid_set_for_each_ioasid(struct ioasid_set *set,
>>> + void (*fn)(ioasid_t id, void *data),
>>> + void *data)
>>> +{
>>> + struct ioasid_data *entry;
>>> + unsigned long index;
>>> + int ret = 0;
>>> +
>>> + if (xa_empty(&set->xa)) {
>>> + pr_warn("No IOASIDs in the set %d\n", set->sid);
>>> + return -ENOENT;
>>> + }
>>> +
>>> + xa_for_each(&set->xa, index, entry) {
>>> + fn(index, data);
>>> + }
>>> +
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
>>>
>>> /**
>>> * ioasid_find - Find IOASID data
>>> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
>>> index 9c44947a68c8..412d025d440e 100644
>>> --- a/include/linux/ioasid.h
>>> +++ b/include/linux/ioasid.h
>>> @@ -10,8 +10,35 @@ typedef unsigned int ioasid_t;
>>> typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min, ioasid_t max,
>>> void *data); typedef void (*ioasid_free_fn_t)(ioasid_t ioasid, void
>>> *data);
>>> +/* IOASID set types */
>>> +enum ioasid_set_type {
>>> + IOASID_SET_TYPE_NULL = 1, /* Set token is NULL */
>>> + IOASID_SET_TYPE_MM, /* Set token is a mm_struct,
>>> + * i.e. associated with a process
>>> + */
>>> + IOASID_SET_TYPE_NR,
>>> +};
>>> +
>>> +/**
>>> + * struct ioasid_set - Meta data about ioasid_set
>>> + * @type: Token types and other features
>>> + * @token: Unique to identify an IOASID set
>>> + * @xa: XArray to store ioasid_set private IDs, can
>>> be used for
>>> + * guest-host IOASID mapping, or just a private
>>> IOASID namespace.
>>> + * @quota: Max number of IOASIDs can be allocated within
>>> the set
>>> + * @nr_ioasids Number of IOASIDs currently allocated in the
>>> set
>>> + * @sid: ID of the set
>>> + * @ref: Reference count of the users
>>> + */
>>> struct ioasid_set {
>>> - int dummy;
>>> + void *token;
>>> + struct xarray xa;
>>> + int type;
>>> + int quota;
>>> + int nr_ioasids;
>>> + int sid;
>>> + refcount_t ref;
>>> + struct rcu_head rcu;
>>> };
>>>
>>> /**
>>> @@ -29,31 +56,64 @@ struct ioasid_allocator_ops {
>>> void *pdata;
>>> };
>>>
>>> -#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
>>> -
>>> #if IS_ENABLED(CONFIG_IOASID)
>>> +void ioasid_install_capacity(ioasid_t total);
>>> +ioasid_t ioasid_get_capacity(void);
>>> +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
>>> int type); +int ioasid_adjust_set(struct ioasid_set *set, int
>>> quota); +void ioasid_set_get_locked(struct ioasid_set *set);
>>> +void ioasid_set_put_locked(struct ioasid_set *set);
>>> +void ioasid_set_put(struct ioasid_set *set);
>>> +
>>> ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
>>> ioasid_t max, void *private);
>>> -void ioasid_free(ioasid_t ioasid);
>>> -void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>>> - bool (*getter)(void *));
>>> +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
>>> +
>>> +bool ioasid_is_active(ioasid_t ioasid);
>>> +
>>> +void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool
>>> (*getter)(void *)); +int ioasid_attach_data(ioasid_t ioasid, void
>>> *data); int ioasid_register_allocator(struct ioasid_allocator_ops
>>> *allocator); void ioasid_unregister_allocator(struct
>>> ioasid_allocator_ops *allocator); -int ioasid_attach_data(ioasid_t
>>> ioasid, void *data); -
>>> +void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
>>> +int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
>>> + void (*fn)(ioasid_t id, void *data),
>>> + void *data);
>>> #else /* !CONFIG_IOASID */
>>> +static inline void ioasid_install_capacity(ioasid_t total)
>>> +{
>>> +}
>>> +
>>> +static inline ioasid_t ioasid_get_capacity(void)
>>> +{
>>> + return 0;
>>> +}
>>> +
>>> static inline ioasid_t ioasid_alloc(struct ioasid_set *set,
>>> ioasid_t min, ioasid_t max, void *private)
>>> {
>>> return INVALID_IOASID;
>>> }
>>>
>>> -static inline void ioasid_free(ioasid_t ioasid)
>>> +static inline void ioasid_free(struct ioasid_set *set, ioasid_t
>>> ioasid) +{
>>> +}
>>> +
>>> +static inline bool ioasid_is_active(ioasid_t ioasid)
>>> +{
>>> + return false;
>>> +}
>>> +
>>> +static inline struct ioasid_set *ioasid_alloc_set(void *token,
>>> ioasid_t quota, int type) +{
>>> + return ERR_PTR(-ENOTSUPP);
>>> +}
>>> +
>>> +static inline void ioasid_set_put(struct ioasid_set *set)
>>> {
>>> }
>>>
>>> -static inline void *ioasid_find(struct ioasid_set *set, ioasid_t
>>> ioasid,
>>> - bool (*getter)(void *))
>>> +static inline void *ioasid_find(struct ioasid_set *set, ioasid_t
>>> ioasid, bool (*getter)(void *)) {
>>> return NULL;
>>> }
>>>
>>
>> Best regards,
>> baolu
>> _______________________________________________
>> iommu mailing list
>> [email protected]
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
> [Jacob Pan]
>

Best regards,
baolu

2020-09-02 21:40:59

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

On Mon, 24 Aug 2020 11:30:47 -0700
Randy Dunlap <[email protected]> wrote:

> On 8/24/20 11:28 AM, Jean-Philippe Brucker wrote:
> >> +/**
> >> + * struct ioasid_data - Meta data about ioasid
> >> + *
> >> + * @id: Unique ID
> >> + * @users Number of active users
> >> + * @state Track state of the IOASID
> >> + * @set Meta data of the set this IOASID belongs to
> >> + * @private Private data associated with the IOASID
> >> + * @rcu For free after RCU grace period
> > nit: it would be nicer to follow the struct order
>
> and use a ':' after each struct member name, as is done for @id:
>
Got it, thanks.

2020-09-02 21:41:00

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

On Mon, 24 Aug 2020 11:34:29 -0700
Randy Dunlap <[email protected]> wrote:

> On 8/24/20 11:28 AM, Jean-Philippe Brucker wrote:
> >> +/**
> >> + * struct ioasid_set - Meta data about ioasid_set
> >> + * @type: Token types and other features
> > nit: doesn't follow struct order
> >
> >> + * @token: Unique to identify an IOASID set
> >> + * @xa: XArray to store ioasid_set private IDs, can be used for
> >> + * guest-host IOASID mapping, or just a private IOASID namespace.
> >> + * @quota: Max number of IOASIDs can be allocated within the set
> >> + * @nr_ioasids Number of IOASIDs currently allocated in the set
>
> * @nr_ioasids: Number of IOASIDs currently allocated in the set
>
got it. thanks!

> >> + * @sid: ID of the set
> >> + * @ref: Reference count of the users
> >> + */
> >> struct ioasid_set {
> >> - int dummy;
> >> + void *token;
> >> + struct xarray xa;
> >> + int type;
> >> + int quota;
> >> + int nr_ioasids;
> >> + int sid;
> >> + refcount_t ref;
> >> + struct rcu_head rcu;
> >> };
>
>

[Jacob Pan]

2020-09-02 21:57:12

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

On Mon, 24 Aug 2020 20:28:48 +0200
Jean-Philippe Brucker <[email protected]> wrote:

> On Fri, Aug 21, 2020 at 09:35:12PM -0700, Jacob Pan wrote:
> > ioasid_set was introduced as an arbitrary token that are shared by a
> > group of IOASIDs. For example, if IOASID #1 and #2 are allocated
> > via the same ioasid_set*, they are viewed as to belong to the same
> > set.
> >
> > For guest SVA usages, system-wide IOASID resources need to be
> > partitioned such that VMs can have its own quota and being managed
> > separately. ioasid_set is the perfect candidate for meeting such
> > requirements. This patch redefines and extends ioasid_set with the
> > following new fields:
> > - Quota
> > - Reference count
> > - Storage of its namespace
> > - The token is stored in the new ioasid_set but with optional types
> >
> > ioasid_set level APIs are introduced that wires up these new data.
> > Existing users of IOASID APIs are converted where a host IOASID set
> > is allocated for bare-metal usage.
> >
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > drivers/iommu/intel/iommu.c | 27 ++-
> > drivers/iommu/intel/pasid.h | 1 +
> > drivers/iommu/intel/svm.c | 8 +-
> > drivers/iommu/ioasid.c | 390
> > +++++++++++++++++++++++++++++++++++++++++---
> > include/linux/ioasid.h | 82 ++++++++-- 5 files changed, 465
> > insertions(+), 43 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c
> > b/drivers/iommu/intel/iommu.c index a3a0b5c8921d..5813eeaa5edb
> > 100644 --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -42,6 +42,7 @@
> > #include <linux/crash_dump.h>
> > #include <linux/numa.h>
> > #include <linux/swiotlb.h>
> > +#include <linux/ioasid.h>
> > #include <asm/irq_remapping.h>
> > #include <asm/cacheflush.h>
> > #include <asm/iommu.h>
> > @@ -103,6 +104,9 @@
> > */
> > #define INTEL_IOMMU_PGSIZES (~0xFFFUL)
> >
> > +/* PASIDs used by host SVM */
> > +struct ioasid_set *host_pasid_set;
> > +
> > static inline int agaw_to_level(int agaw)
> > {
> > return agaw + 2;
> > @@ -3103,8 +3107,8 @@ static void intel_vcmd_ioasid_free(ioasid_t
> > ioasid, void *data)
> > * Sanity check the ioasid owner is done at upper layer,
> > e.g. VFIO
> > * We can only free the PASID when all the devices are
> > unbound. */
> > - if (ioasid_find(NULL, ioasid, NULL)) {
> > - pr_alert("Cannot free active IOASID %d\n", ioasid);
> > + if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
> > + pr_err("Cannot free IOASID %d, not in system
> > set\n", ioasid); return;
> > }
> > vcmd_free_pasid(iommu, ioasid);
> > @@ -3288,6 +3292,19 @@ static int __init init_dmars(void)
> > if (ret)
> > goto free_iommu;
> >
> > + /* PASID is needed for scalable mode irrespective to SVM */
> > + if (intel_iommu_sm) {
> > + ioasid_install_capacity(intel_pasid_max_id);
> > + /* We should not run out of IOASIDs at boot */
> > + host_pasid_set = ioasid_alloc_set(NULL,
> > PID_MAX_DEFAULT,
> > +
> > IOASID_SET_TYPE_NULL);
> > + if (IS_ERR_OR_NULL(host_pasid_set)) {
> > + pr_err("Failed to enable host PASID
> > allocator %lu\n",
> > + PTR_ERR(host_pasid_set));
> > + intel_iommu_sm = 0;
> > + }
> > + }
> > +
> > /*
> > * for each drhd
> > * enable fault log
> > @@ -5149,7 +5166,7 @@ static void auxiliary_unlink_device(struct
> > dmar_domain *domain, domain->auxd_refcnt--;
> >
> > if (!domain->auxd_refcnt && domain->default_pasid > 0)
> > - ioasid_free(domain->default_pasid);
> > + ioasid_free(host_pasid_set, domain->default_pasid);
> > }
> >
> > static int aux_domain_add_dev(struct dmar_domain *domain,
> > @@ -5167,7 +5184,7 @@ static int aux_domain_add_dev(struct
> > dmar_domain *domain, int pasid;
> >
> > /* No private data needed for the default pasid */
> > - pasid = ioasid_alloc(NULL, PASID_MIN,
> > + pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
> > pci_max_pasids(to_pci_dev(dev))
> > - 1, NULL);
> > if (pasid == INVALID_IOASID) {
> > @@ -5210,7 +5227,7 @@ static int aux_domain_add_dev(struct
> > dmar_domain *domain, spin_unlock(&iommu->lock);
> > spin_unlock_irqrestore(&device_domain_lock, flags);
> > if (!domain->auxd_refcnt && domain->default_pasid > 0)
> > - ioasid_free(domain->default_pasid);
> > + ioasid_free(host_pasid_set, domain->default_pasid);
> >
> > return ret;
> > }
> > diff --git a/drivers/iommu/intel/pasid.h
> > b/drivers/iommu/intel/pasid.h index c9850766c3a9..ccdc23446015
> > 100644 --- a/drivers/iommu/intel/pasid.h
> > +++ b/drivers/iommu/intel/pasid.h
> > @@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct
> > pasid_entry *pte) }
> >
> > extern u32 intel_pasid_max_id;
> > +extern struct ioasid_set *host_pasid_set;
> > int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp);
> > void intel_pasid_free_id(int pasid);
> > void *intel_pasid_lookup_id(int pasid);
> > diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> > index 37a9beabc0ca..634e191ca2c3 100644
> > --- a/drivers/iommu/intel/svm.c
> > +++ b/drivers/iommu/intel/svm.c
> > @@ -551,7 +551,7 @@ intel_svm_bind_mm(struct device *dev, int
> > flags, struct svm_dev_ops *ops, pasid_max = intel_pasid_max_id;
> >
> > /* Do not use PASID 0, reserved for RID to PASID */
> > - svm->pasid = ioasid_alloc(NULL, PASID_MIN,
> > + svm->pasid = ioasid_alloc(host_pasid_set,
> > PASID_MIN, pasid_max - 1, svm);
> > if (svm->pasid == INVALID_IOASID) {
> > kfree(svm);
> > @@ -568,7 +568,7 @@ intel_svm_bind_mm(struct device *dev, int
> > flags, struct svm_dev_ops *ops, if (mm) {
> > ret =
> > mmu_notifier_register(&svm->notifier, mm); if (ret) {
> > - ioasid_free(svm->pasid);
> > + ioasid_free(host_pasid_set,
> > svm->pasid); kfree(svm);
> > kfree(sdev);
> > goto out;
> > @@ -586,7 +586,7 @@ intel_svm_bind_mm(struct device *dev, int
> > flags, struct svm_dev_ops *ops, if (ret) {
> > if (mm)
> > mmu_notifier_unregister(&svm->notifier,
> > mm);
> > - ioasid_free(svm->pasid);
> > + ioasid_free(host_pasid_set, svm->pasid);
> > kfree(svm);
> > kfree(sdev);
> > goto out;
> > @@ -655,7 +655,7 @@ static int intel_svm_unbind_mm(struct device
> > *dev, int pasid) kfree_rcu(sdev, rcu);
> >
> > if (list_empty(&svm->devs)) {
> > - ioasid_free(svm->pasid);
> > + ioasid_free(host_pasid_set,
> > svm->pasid); if (svm->mm)
> > mmu_notifier_unregister(&svm->notifier,
> > svm->mm); list_del(&svm->list);
> > diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> > index 5f63af07acd5..f73b3dbfc37a 100644
> > --- a/drivers/iommu/ioasid.c
> > +++ b/drivers/iommu/ioasid.c
> > @@ -1,22 +1,58 @@
> > // SPDX-License-Identifier: GPL-2.0
> > /*
> > * I/O Address Space ID allocator. There is one global IOASID
> > space, split into
> > - * subsets. Users create a subset with DECLARE_IOASID_SET, then
> > allocate and
> > - * free IOASIDs with ioasid_alloc and ioasid_free.
> > + * subsets. Users create a subset with ioasid_alloc_set, then
> > allocate/free IDs
> > + * with ioasid_alloc and ioasid_free.
> > */
> > -#include <linux/ioasid.h>
> > #include <linux/module.h>
> > #include <linux/slab.h>
> > #include <linux/spinlock.h>
> > #include <linux/xarray.h>
> > +#include <linux/ioasid.h>
>
> Spurious change (best keep the includes in alphabetical order)
>
I changed order so that I don't have to include more headers in
ioasid.h, but it is better to keep the order and add headers to ioasid.h

> > +
> > +static DEFINE_XARRAY_ALLOC(ioasid_sets);
>
> I'd prefer keeping all static variables together
>
yes, same here :) will move.

> > +enum ioasid_state {
> > + IOASID_STATE_INACTIVE,
> > + IOASID_STATE_ACTIVE,
> > + IOASID_STATE_FREE_PENDING,
> > +};
> >
> > +/**
> > + * struct ioasid_data - Meta data about ioasid
> > + *
> > + * @id: Unique ID
> > + * @users Number of active users
> > + * @state Track state of the IOASID
> > + * @set Meta data of the set this IOASID belongs to
> > + * @private Private data associated with the IOASID
> > + * @rcu For free after RCU grace period
>
> nit: it would be nicer to follow the struct order
>
yes, will do.

> > + */
> > struct ioasid_data {
> > ioasid_t id;
> > struct ioasid_set *set;
> > + refcount_t users;
> > + enum ioasid_state state;
> > void *private;
> > struct rcu_head rcu;
> > };
> >
> > +/* Default to PCIe standard 20 bit PASID */
> > +#define PCI_PASID_MAX 0x100000
> > +static ioasid_t ioasid_capacity = PCI_PASID_MAX;
> > +static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
> > +
> > +void ioasid_install_capacity(ioasid_t total)
> > +{
> > + ioasid_capacity = ioasid_capacity_avail = total;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_install_capacity);
> > +
> > +ioasid_t ioasid_get_capacity()
> > +{
> > + return ioasid_capacity;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_get_capacity);
> > +
> > /*
> > * struct ioasid_allocator_data - Internal data structure to hold
> > information
> > * about an allocator. There are two types of allocators:
> > @@ -306,11 +342,23 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
> > ioasid_t min, ioasid_t max, {
> > struct ioasid_data *data;
> > void *adata;
> > - ioasid_t id;
> > + ioasid_t id = INVALID_IOASID;
> > +
> > + spin_lock(&ioasid_allocator_lock);
> > + /* Check if the IOASID set has been allocated and
> > initialized */
> > + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> > + pr_warn("Invalid set\n");
>
> WARN_ON() is sufficient
>
right, will do

> > + goto done_unlock;
> > + }
> > +
> > + if (set->quota <= set->nr_ioasids) {
> > + pr_err("IOASID set %d out of quota %d\n",
> > set->sid, set->quota);
>
> As this can be called directly by userspace via VFIO, I wonder if we
> should remove non-bug error messages like this one to avoid leaking
> internal IDs, or at least rate-limit them. We already have a few,
> perhaps we should deal with them before the VFIO_IOMMU_ALLOC_PASID
> patches land?
>
Good point, will hide internal IDs and add _ratelimited for VFIO
possible call path.

> > + goto done_unlock;
> > + }
> >
> > data = kzalloc(sizeof(*data), GFP_ATOMIC);
> > if (!data)
> > - return INVALID_IOASID;
> > + goto done_unlock;
> >
> > data->set = set;
> > data->private = private;
> > @@ -319,7 +367,6 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
> > ioasid_t min, ioasid_t max,
> > * Custom allocator needs allocator data to perform
> > platform specific
> > * operations.
> > */
> > - spin_lock(&ioasid_allocator_lock);
> > adata = active_allocator->flags &
> > IOASID_ALLOCATOR_CUSTOM ? active_allocator->ops->pdata : data; id =
> > active_allocator->ops->alloc(min, max, adata); if (id ==
> > INVALID_IOASID) { @@ -335,42 +382,339 @@ ioasid_t
> > ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> > goto exit_free; }
> > data->id = id;
> > + data->state = IOASID_STATE_ACTIVE;
> > + refcount_set(&data->users, 1);
> > +
> > + /* Store IOASID in the per set data */
> > + if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
> > + pr_err("Failed to ioasid %d in set %d\n", id,
> > set->sid);
>
> "Failed to store"
will fix

> Don't we need to call active_allocator->ops->free()?
>
yes, also erase if custom allocator is used. How about:
active_allocator->ops->free(id, active_allocator->ops->pdata);
if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM)
xa_erase(&active_allocator->xa, id);

> And I need to think about this more, but do you see any issue with
> revoking here the data that we published into the xarray above through
> alloc()? We might need to free data in an RCU callback.
>
I don't see we need to do kfree_rcu since there shouldn't be any
readers before the alloc function returns. Or I am missing something?
> > + goto exit_free;
> > + }
> > + set->nr_ioasids++;
> > + goto done_unlock;
> >
> > - spin_unlock(&ioasid_allocator_lock);
> > - return id;
> > exit_free:
> > - spin_unlock(&ioasid_allocator_lock);
> > kfree(data);
> > - return INVALID_IOASID;
> > +done_unlock:
> > + spin_unlock(&ioasid_allocator_lock);
> > + return id;
> > }
> > EXPORT_SYMBOL_GPL(ioasid_alloc);
> >
> > +static void ioasid_do_free(struct ioasid_data *data)
> > +{
> > + struct ioasid_data *ioasid_data;
> > + struct ioasid_set *sdata;
> > +
> > + active_allocator->ops->free(data->id,
> > active_allocator->ops->pdata);
> > + /* Custom allocator needs additional steps to free the xa
> > element */
> > + if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> > + ioasid_data = xa_erase(&active_allocator->xa,
> > data->id);
> > + kfree_rcu(ioasid_data, rcu);
> > + }
> > +
> > + sdata = xa_load(&ioasid_sets, data->set->sid);
> > + if (!sdata) {
> > + pr_err("No set %d for IOASID %d\n", data->set->sid,
> > + data->id);
> > + return;
>
> I don't think we're allowed to fail at this point. If we need more
> sanity-check on the parameters, it should be before we start removing
> from the active_allocator above. Otherwise this should be a WARN
>
Good point, I will move this earlier as a WARN.

> > + }
> > + xa_erase(&sdata->xa, data->id);
> > + sdata->nr_ioasids--;
>
> Would be nicer to perform the cleanup in the order opposite from
> ioasid_alloc()
>
Yes, will switch the order.

> > +}
> > +
> > +static void ioasid_free_locked(struct ioasid_set *set, ioasid_t
> > ioasid) +{
> > + struct ioasid_data *data;
> > +
> > + data = xa_load(&active_allocator->xa, ioasid);
> > + if (!data) {
> > + pr_err("Trying to free unknown IOASID %u\n",
> > ioasid);
> > + return;
> > + }
> > +
> > + if (data->set != set) {
> > + pr_warn("Cannot free IOASID %u due to set
> > ownership\n", ioasid);
> > + return;
> > + }
> > + data->state = IOASID_STATE_FREE_PENDING;
> > +
> > + if (!refcount_dec_and_test(&data->users))
> > + return;
> > +
> > + ioasid_do_free(data);
> > +}
> > +
> > /**
> > - * ioasid_free - Free an IOASID
> > - * @ioasid: the ID to remove
> > + * ioasid_free - Drop reference on an IOASID. Free if refcount
> > drops to 0,
> > + * including free from its set and system-wide list.
> > + * @set: The ioasid_set to check permission with. If not
> > NULL, IOASID
> > + * free will fail if the set does not match.
> > + * @ioasid: The IOASID to remove
> > */
> > -void ioasid_free(ioasid_t ioasid)
> > +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
> > {
> > - struct ioasid_data *ioasid_data;
> > + spin_lock(&ioasid_allocator_lock);
> > + ioasid_free_locked(set, ioasid);
> > + spin_unlock(&ioasid_allocator_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_free);
> >
> > +/**
> > + * ioasid_alloc_set - Allocate a new IOASID set for a given token
> > + *
> > + * @token: Unique token of the IOASID set, cannot be NULL
> > + * @quota: Quota allowed in this set. Only for new set
> > creation
> > + * @flags: Special requirements
>
> There is no @flags, but @type is missing
>
right, will fix.

> > + *
> > + * IOASID can be limited system-wide resource that requires quota
> > management.
> > + * If caller does not wish to enforce quota, use
> > IOASID_SET_NO_QUOTA flag.
>
> The flag isn't introduced in this patch. How about passing @quota ==
> 0 in this case? For now I'm fine with leaving this as TODO and
> returning -EINVAL.
>
I will remove the flag for now. We can add @quota == 0 case later.

> > + *
> > + * Token will be stored in the ioasid_set returned. A reference
> > will be taken
> > + * upon finding a matching set or newly created set.
> > + * IOASID allocation within the set and other per set operations
> > will use
> > + * the retured ioasid_set *.
> > + *
> > + */
> > +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
> > int type) +{
> > + struct ioasid_set *sdata;
> > + unsigned long index;
> > + ioasid_t id;
> > +
> > + if (type >= IOASID_SET_TYPE_NR)
> > + return ERR_PTR(-EINVAL);
> > +
> > + /*
> > + * Need to check space available if we share system-wide
> > quota.
> > + * TODO: we may need to support quota free sets in the
> > future.
> > + */
> > spin_lock(&ioasid_allocator_lock);
> > - ioasid_data = xa_load(&active_allocator->xa, ioasid);
> > - if (!ioasid_data) {
> > - pr_err("Trying to free unknown IOASID %u\n",
> > ioasid);
> > + if (quota > ioasid_capacity_avail) {
> > + pr_warn("Out of IOASID capacity! ask %d, avail
> > %d\n",
> > + quota, ioasid_capacity_avail);
> > + sdata = ERR_PTR(-ENOSPC);
> > goto exit_unlock;
> > }
> >
> > - active_allocator->ops->free(ioasid,
> > active_allocator->ops->pdata);
> > - /* Custom allocator needs additional steps to free the xa
> > element */
> > - if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> > - ioasid_data = xa_erase(&active_allocator->xa,
> > ioasid);
> > - kfree_rcu(ioasid_data, rcu);
> > + /*
> > + * Token is only unique within its types but right now we
> > have only
> > + * mm type. If we have more token types, we have to match
> > type as well.
> > + */
> > + switch (type) {
> > + case IOASID_SET_TYPE_MM:
> > + /* Search existing set tokens, reject duplicates */
> > + xa_for_each(&ioasid_sets, index, sdata) {
> > + if (sdata->token == token &&
> > + sdata->type == IOASID_SET_TYPE_MM)
> > {
>
> Should be aligned at the "if ("
>
> According to the function doc, shouldn't we take a reference to the
> set in this case, and return it?
> "A reference will be taken upon finding a matching set or newly
> created set."
>
> However it might be better to separate the two functionalities into
> ioasid_alloc_set() and ioasid_find_set(). Because two modules can
> want to work on the same set for an mm, but they won't pass the same
> @quota, right? So it'd make more sense for one of them (VFIO) to
> alloc the set and the other to reuse it.
I went back and forth on this but yes, it is better to keep them
separate. So far the only use case for find is
ioasid_find_mm_set()
Will take reference during set allocation.

>
> > + sdata = ERR_PTR(-EEXIST);
> > + goto exit_unlock;
> > + }
> > + }
> > + break;
> > + case IOASID_SET_TYPE_NULL:
> > + if (!token)
> > + break;
> > + fallthrough;
> > + default:
> > + pr_err("Invalid token and IOASID type\n");
> > + sdata = ERR_PTR(-EINVAL);
> > + goto exit_unlock;
> > }
> >
> > + /* REVISIT: may support set w/o quota, use system
> > available */
> > + if (!quota) {
>
> Maybe move this next to the other quota check above
>
yes, that is better. no need to be under the spinlock either.

> > + sdata = ERR_PTR(-EINVAL);
> > + goto exit_unlock;
> > + }
> > +
> > + sdata = kzalloc(sizeof(*sdata), GFP_ATOMIC);
> > + if (!sdata) {
> > + sdata = ERR_PTR(-ENOMEM);
> > + goto exit_unlock;
> > + }
> > +
> > + if (xa_alloc(&ioasid_sets, &id, sdata,
> > + XA_LIMIT(0, ioasid_capacity_avail -
> > quota),
>
> Why this limit? sid could just be an arbitrary u32 (xa_limit_32b)
>
True, but I was thinking the number of sets cannot be more than the
number of available IOASIDs. XA allocator should be able to find empty
slots within this range. Performance is better if the IDs are densely
populated.

> > + GFP_ATOMIC)) {
> > + kfree(sdata);
> > + sdata = ERR_PTR(-ENOSPC);
> > + goto exit_unlock;
> > + }
> > +
> > + sdata->token = token;
> > + sdata->type = type;
> > + sdata->quota = quota;
> > + sdata->sid = id;
> > + refcount_set(&sdata->ref, 1);
> > +
> > + /*
> > + * Per set XA is used to store private IDs within the set,
> > get ready
> > + * for ioasid_set private ID and system-wide IOASID
> > allocation
> > + * results.
> > + */
> > + xa_init_flags(&sdata->xa, XA_FLAGS_ALLOC);
>
> Since it's only used for storage, you could use xa_init()
>
good point, will do.

> > + ioasid_capacity_avail -= quota;
> > +
> > exit_unlock:
> > spin_unlock(&ioasid_allocator_lock);
> > +
> > + return sdata;
> > }
> > -EXPORT_SYMBOL_GPL(ioasid_free);
> > +EXPORT_SYMBOL_GPL(ioasid_alloc_set);
> > +
> > +void ioasid_set_get_locked(struct ioasid_set *set)
> > +{
> > + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> > + pr_warn("Invalid set data\n");
>
> WARN_ON() is sufficient
>
yes

> > + return;
> > + }
> > +
> > + refcount_inc(&set->ref);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_get_locked);
>
> Why is this function public, is it for an iterator? Might want to
> add a lockdep_assert_held() annotation.
>
You are right, no need to be public now.

Earlier, we had a design for users to acquire ioasid_set reference in
the notifier handler (under lock). Since KVM may not know when or if the
ioasid_set for its VM is created, we wanted KVM to listen to ioasid_set
allocation event then register notifier block.
In the current design we have a notifier pending list so users such
as KVM does not need to be aware of ioasid_set.

> > +
> > +void ioasid_set_get(struct ioasid_set *set)
> > +{
> > + spin_lock(&ioasid_allocator_lock);
> > + ioasid_set_get_locked(set);
> > + spin_unlock(&ioasid_allocator_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_get);
> > +
> > +void ioasid_set_put_locked(struct ioasid_set *set)
> > +{
> > + struct ioasid_data *entry;
> > + unsigned long index;
> > +
> > + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> > + pr_warn("Invalid set data\n");
>
> WARN_ON() is sufficient
>
got it

> > + return;
> > + }
> > +
> > + if (!refcount_dec_and_test(&set->ref)) {
> > + pr_debug("%s: IOASID set %d has %d users\n",
> > + __func__, set->sid,
> > refcount_read(&set->ref));
> > + return;
> > + }
> > +
> > + /* The set is already empty, we just destroy the set. */
> > + if (xa_empty(&set->xa))
> > + goto done_destroy;
> > +
> > + /*
> > + * Free all PASIDs from system-wide IOASID pool, all
> > subscribers gets
> > + * notified and do cleanup of their own.
> > + * Note that some references of the IOASIDs within the set
> > can still
> > + * be held after the free call. This is OK in that the
> > IOASIDs will be
> > + * marked inactive, the only operations can be done is
> > ioasid_put.
> > + * No need to track IOASID set states since there is no
> > reclaim phase.
> > + */
> > + xa_for_each(&set->xa, index, entry) {
> > + ioasid_free_locked(set, index);
> > + /* Free from per set private pool */
> > + xa_erase(&set->xa, index);
> > + }
> > +
> > +done_destroy:
> > + /* Return the quota back to system pool */
> > + ioasid_capacity_avail += set->quota;
> > + kfree_rcu(set, rcu);
> > +
> > + /*
> > + * Token got released right away after the ioasid_set is
> > freed.
> > + * If a new set is created immediately with the newly
> > released token,
> > + * it will not allocate the same IOASIDs unless they are
> > reclaimed.
> > + */
> > + xa_erase(&ioasid_sets, set->sid);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_put_locked);
>
> Same comment as ioasid_set_get_locked
>
same, will remove :)

> > +
> > +/**
> > + * ioasid_set_put - Drop a reference to the IOASID set. Free all
> > IOASIDs within
> > + * the set if there are no more users.
> > + *
> > + * @set: The IOASID set ID to be freed
> > + *
> > + * If refcount drops to zero, all IOASIDs allocated within the set
> > will be
> > + * freed.
> > + */
> > +void ioasid_set_put(struct ioasid_set *set)
> > +{
> > + spin_lock(&ioasid_allocator_lock);
> > + ioasid_set_put_locked(set);
> > + spin_unlock(&ioasid_allocator_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_put);
> > +
> > +/**
> > + * ioasid_adjust_set - Adjust the quota of an IOASID set
> > + * @set: IOASID set to be assigned
> > + * @quota: Quota allowed in this set
> > + *
> > + * Return 0 on success. If the new quota is smaller than the
> > number of
> > + * IOASIDs already allocated, -EINVAL will be returned. No change
> > will be
> > + * made to the existing quota.
> > + */
> > +int ioasid_adjust_set(struct ioasid_set *set, int quota)
> > +{
> > + int ret = 0;
> > +
> > + spin_lock(&ioasid_allocator_lock);
> > + if (set->nr_ioasids > quota) {
> > + pr_err("New quota %d is smaller than outstanding
> > IOASIDs %d\n",
> > + quota, set->nr_ioasids);
> > + ret = -EINVAL;
> > + goto done_unlock;
> > + }
> > +
> > + if (quota >= ioasid_capacity_avail) {
> > + ret = -ENOSPC;
> > + goto done_unlock;
> > + }
> > +
> > + /* Return the delta back to system pool */
> > + ioasid_capacity_avail += set->quota - quota;
>
> I think this is correct as long as the above check is fixed (as
> pointed out by Baolu). A check that quota >= 0 could be nice too.
>
yes, will fix per Baolu.
Will add a check for quota > 0 but 0 is not supported yet.

> > +
> > + /*
> > + * May have a policy to prevent giving all available
> > IOASIDs
> > + * to one set. But we don't enforce here, it should be in
> > the
> > + * upper layers.
> > + */
> > + set->quota = quota;
> > +
> > +done_unlock:
> > + spin_unlock(&ioasid_allocator_lock);
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_adjust_set);
> > +
> > +/**
> > + * ioasid_set_for_each_ioasid - Iterate over all the IOASIDs
> > within the set
> > + *
> > + * Caller must hold a reference of the set and handles its own
> > locking.
> > + */
> > +int ioasid_set_for_each_ioasid(struct ioasid_set *set,
> > + void (*fn)(ioasid_t id, void *data),
> > + void *data)
> > +{
> > + struct ioasid_data *entry;
> > + unsigned long index;
> > + int ret = 0;
> > +
> > + if (xa_empty(&set->xa)) {
>
> Who calls this function?
This is called by Yi's VFIO PASID code. Not sure when he will send out
but should be soon. The use case is when a guest terminates, VFIO can
call this function to free all PASIDs for this guest.

> Empty xa may be a normal use-case if the
> caller just uses it for sweeping, so pr_warn() could be problematic.
> The returned value also isn't particularly accurate if concurrent
> ioasid_alloc/free are allowed, so I'd drop this.
>
Right, xa could be empty if the guest never allocated PASIDS. Will drop
the check and make the function void.

> > + pr_warn("No IOASIDs in the set %d\n", set->sid);
> > + return -ENOENT;
> > + }
> > +
> > + xa_for_each(&set->xa, index, entry) {
> > + fn(index, data);
> > + }
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
> >
> > /**
> > * ioasid_find - Find IOASID data
> > diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> > index 9c44947a68c8..412d025d440e 100644
> > --- a/include/linux/ioasid.h
> > +++ b/include/linux/ioasid.h
> > @@ -10,8 +10,35 @@ typedef unsigned int ioasid_t;
> > typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min, ioasid_t max,
> > void *data); typedef void (*ioasid_free_fn_t)(ioasid_t ioasid, void
> > *data);
> > +/* IOASID set types */
> > +enum ioasid_set_type {
> > + IOASID_SET_TYPE_NULL = 1, /* Set token is NULL */
> > + IOASID_SET_TYPE_MM, /* Set token is a mm_struct,
> > + * i.e. associated with a process
> > + */
> > + IOASID_SET_TYPE_NR,
> > +};
> > +
> > +/**
> > + * struct ioasid_set - Meta data about ioasid_set
> > + * @type: Token types and other features
>
> nit: doesn't follow struct order
>
will fix.

> > + * @token: Unique to identify an IOASID set
> > + * @xa: XArray to store ioasid_set private IDs, can
> > be used for
> > + * guest-host IOASID mapping, or just a private
> > IOASID namespace.
> > + * @quota: Max number of IOASIDs can be allocated within
> > the set
> > + * @nr_ioasids Number of IOASIDs currently allocated in the
> > set
> > + * @sid: ID of the set
> > + * @ref: Reference count of the users
> > + */
> > struct ioasid_set {
> > - int dummy;
> > + void *token;
> > + struct xarray xa;
> > + int type;
> > + int quota;
> > + int nr_ioasids;
> > + int sid;
> > + refcount_t ref;
> > + struct rcu_head rcu;
> > };
> >
> > /**
> > @@ -29,31 +56,64 @@ struct ioasid_allocator_ops {
> > void *pdata;
> > };
> >
> > -#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
> > -
> > #if IS_ENABLED(CONFIG_IOASID)
> > +void ioasid_install_capacity(ioasid_t total);
> > +ioasid_t ioasid_get_capacity(void);
> > +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
> > int type); +int ioasid_adjust_set(struct ioasid_set *set, int
> > quota); +void ioasid_set_get_locked(struct ioasid_set *set);
> > +void ioasid_set_put_locked(struct ioasid_set *set);
> > +void ioasid_set_put(struct ioasid_set *set);
>
> These three functions need a stub for !CONFIG_IOASID
>
will do.

> > +
> > ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> > ioasid_t max, void *private);
> > -void ioasid_free(ioasid_t ioasid);
> > -void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> > - bool (*getter)(void *));
> > +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > +bool ioasid_is_active(ioasid_t ioasid);
>
> Not implemented by this series?
>
should be removed. Had a use earlier but not anymore.

> > +
> > +void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool
> > (*getter)(void *));
>
> Spurious change
>
> > +int ioasid_attach_data(ioasid_t ioasid, void *data);
> > int ioasid_register_allocator(struct ioasid_allocator_ops
> > *allocator); void ioasid_unregister_allocator(struct
> > ioasid_allocator_ops *allocator); -int ioasid_attach_data(ioasid_t
> > ioasid, void *data);
>
> Spurious change?
yes, will remove.

> > -
> > +void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
>
> Not implemented here
>
will remove

> > +int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> > + void (*fn)(ioasid_t id, void *data),
> > + void *data);
>
> Needs a stub for !CONFIG_IOASID
>
got it

> > #else /* !CONFIG_IOASID */
> > +static inline void ioasid_install_capacity(ioasid_t total)
> > +{
> > +}
> > +
> > +static inline ioasid_t ioasid_get_capacity(void)
> > +{
> > + return 0;
> > +}
> > +
> > static inline ioasid_t ioasid_alloc(struct ioasid_set *set,
> > ioasid_t min, ioasid_t max, void *private)
> > {
> > return INVALID_IOASID;
> > }
> >
> > -static inline void ioasid_free(ioasid_t ioasid)
> > +static inline void ioasid_free(struct ioasid_set *set, ioasid_t
> > ioasid) +{
> > +}
> > +
> > +static inline bool ioasid_is_active(ioasid_t ioasid)
> > +{
> > + return false;
> > +}
> > +
> > +static inline struct ioasid_set *ioasid_alloc_set(void *token,
> > ioasid_t quota, int type) +{
> > + return ERR_PTR(-ENOTSUPP);
> > +}
> > +
> > +static inline void ioasid_set_put(struct ioasid_set *set)
> > {
> > }
> >
> > -static inline void *ioasid_find(struct ioasid_set *set, ioasid_t
> > ioasid,
> > - bool (*getter)(void *))
> > +static inline void *ioasid_find(struct ioasid_set *set, ioasid_t
> > ioasid, bool (*getter)(void *))
>
> Spurious change
>
got it

> Thanks,
> Jean
>
> > {
> > return NULL;
> > }
> > --
> > 2.7.4
> >
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

[Jacob Pan]

2020-09-03 21:02:24

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

On Tue, 1 Sep 2020 13:51:26 +0200
Auger Eric <[email protected]> wrote:

> Hi Jacob,
>
> On 8/22/20 6:35 AM, Jacob Pan wrote:
> > ioasid_set was introduced as an arbitrary token that are shared by
> > a
> that is
got it

> > group of IOASIDs. For example, if IOASID #1 and #2 are allocated
> > via the same ioasid_set*, they are viewed as to belong to the same
> > set.
> two IOASIDs allocated via the same ioasid_set pointer belong to the
> same set?
> >
yes, better.

> > For guest SVA usages, system-wide IOASID resources need to be
> > partitioned such that VMs can have its own quota and being managed
> their own
right,

> > separately. ioasid_set is the perfect candidate for meeting such
> > requirements. This patch redefines and extends ioasid_set with the
> > following new fields:
> > - Quota
> > - Reference count
> > - Storage of its namespace
> > - The token is stored in the new ioasid_set but with optional types
> >
> > ioasid_set level APIs are introduced that wires up these new data.
> that wire
right

> > Existing users of IOASID APIs are converted where a host IOASID set
> > is allocated for bare-metal usage.
> >
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > drivers/iommu/intel/iommu.c | 27 ++-
> > drivers/iommu/intel/pasid.h | 1 +
> > drivers/iommu/intel/svm.c | 8 +-
> > drivers/iommu/ioasid.c | 390
> > +++++++++++++++++++++++++++++++++++++++++---
> > include/linux/ioasid.h | 82 ++++++++-- 5 files changed, 465
> > insertions(+), 43 deletions(-)
> >
> > diff --git a/drivers/iommu/intel/iommu.c
> > b/drivers/iommu/intel/iommu.c index a3a0b5c8921d..5813eeaa5edb
> > 100644 --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -42,6 +42,7 @@
> > #include <linux/crash_dump.h>
> > #include <linux/numa.h>
> > #include <linux/swiotlb.h>
> > +#include <linux/ioasid.h>
> > #include <asm/irq_remapping.h>
> > #include <asm/cacheflush.h>
> > #include <asm/iommu.h>
> > @@ -103,6 +104,9 @@
> > */
> > #define INTEL_IOMMU_PGSIZES (~0xFFFUL)
> >
> > +/* PASIDs used by host SVM */
> > +struct ioasid_set *host_pasid_set;
> > +
> > static inline int agaw_to_level(int agaw)
> > {
> > return agaw + 2;
> > @@ -3103,8 +3107,8 @@ static void intel_vcmd_ioasid_free(ioasid_t
> > ioasid, void *data)
> > * Sanity check the ioasid owner is done at upper layer,
> > e.g. VFIO
> > * We can only free the PASID when all the devices are
> > unbound. */
> > - if (ioasid_find(NULL, ioasid, NULL)) {
> > - pr_alert("Cannot free active IOASID %d\n", ioasid);
> > + if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
> > + pr_err("Cannot free IOASID %d, not in system
> > set\n", ioasid);
> not sure the change in the trace is worth. Also you may be more
> explicit like IOASID %d to be freed cannot be found in the system
> ioasid set.
Yes, better. will do.

> shouldn't it be rate_limited as it is originated from
> user space?
virtual command is only used in the guest kernel, not from userspace
though. But I should add ratelimited to all user originated calls.

> > return;
> > }
> > vcmd_free_pasid(iommu, ioasid);
> > @@ -3288,6 +3292,19 @@ static int __init init_dmars(void)
> > if (ret)
> > goto free_iommu;
> >
> > + /* PASID is needed for scalable mode irrespective to SVM */
> > + if (intel_iommu_sm) {
> > + ioasid_install_capacity(intel_pasid_max_id);
> > + /* We should not run out of IOASIDs at boot */
> > + host_pasid_set = ioasid_alloc_set(NULL,
> > PID_MAX_DEFAULT,
> s/PID_MAX_DEFAULT/intel_pasid_max_id?
Not really, when both baremetal and guest SVA are used on the same
system, we want to to limit the baremetal SVM PASID to the number of
host processes. host_pasid_set is for baremetal only.

intel_pasid_max_id would take up the entire PASID resource and leave no
PASIDs for guest usages.

> > +
> > IOASID_SET_TYPE_NULL);
> as suggested by jean-Philippe ioasid_set_alloc
> > + if (IS_ERR_OR_NULL(host_pasid_set)) {
> > + pr_err("Failed to enable host PASID
> > allocator %lu\n",
> > + PTR_ERR(host_pasid_set));
> does not sound like the correct error message? failed to allocate the
> system ioasid_set?
right

> > + intel_iommu_sm = 0;
> > + }
> > + }
> > +
> > /*
> > * for each drhd
> > * enable fault log
> > @@ -5149,7 +5166,7 @@ static void auxiliary_unlink_device(struct
> > dmar_domain *domain, domain->auxd_refcnt--;
> >
> > if (!domain->auxd_refcnt && domain->default_pasid > 0)
> > - ioasid_free(domain->default_pasid);
> > + ioasid_free(host_pasid_set, domain->default_pasid);
> > }
> >
> > static int aux_domain_add_dev(struct dmar_domain *domain,
> > @@ -5167,7 +5184,7 @@ static int aux_domain_add_dev(struct
> > dmar_domain *domain, int pasid;
> >
> > /* No private data needed for the default pasid */
> > - pasid = ioasid_alloc(NULL, PASID_MIN,
> > + pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
> > pci_max_pasids(to_pci_dev(dev))
> > - 1, NULL);
> don't you want to ioasid_set_put() the ioasid_set in
> intel_iommu_free_dmars()?
yes, good catch.

> > if (pasid == INVALID_IOASID) {
> > @@ -5210,7 +5227,7 @@ static int aux_domain_add_dev(struct
> > dmar_domain *domain, spin_unlock(&iommu->lock);
> > spin_unlock_irqrestore(&device_domain_lock, flags);
> > if (!domain->auxd_refcnt && domain->default_pasid > 0)
> > - ioasid_free(domain->default_pasid);
> > + ioasid_free(host_pasid_set, domain->default_pasid);
> >
> > return ret;
> > }
> > diff --git a/drivers/iommu/intel/pasid.h
> > b/drivers/iommu/intel/pasid.h index c9850766c3a9..ccdc23446015
> > 100644 --- a/drivers/iommu/intel/pasid.h
> > +++ b/drivers/iommu/intel/pasid.h
> > @@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct
> > pasid_entry *pte) }
> >
> > extern u32 intel_pasid_max_id;
> > +extern struct ioasid_set *host_pasid_set;
> > int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp);
> > void intel_pasid_free_id(int pasid);
> > void *intel_pasid_lookup_id(int pasid);
> > diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> > index 37a9beabc0ca..634e191ca2c3 100644
> > --- a/drivers/iommu/intel/svm.c
> > +++ b/drivers/iommu/intel/svm.c
> > @@ -551,7 +551,7 @@ intel_svm_bind_mm(struct device *dev, int
> > flags, struct svm_dev_ops *ops, pasid_max = intel_pasid_max_id;
> >
> > /* Do not use PASID 0, reserved for RID to PASID */
> > - svm->pasid = ioasid_alloc(NULL, PASID_MIN,
> > + svm->pasid = ioasid_alloc(host_pasid_set,
> > PASID_MIN, pasid_max - 1, svm);
> > if (svm->pasid == INVALID_IOASID) {
> > kfree(svm);
> > @@ -568,7 +568,7 @@ intel_svm_bind_mm(struct device *dev, int
> > flags, struct svm_dev_ops *ops, if (mm) {
> > ret =
> > mmu_notifier_register(&svm->notifier, mm); if (ret) {
> > - ioasid_free(svm->pasid);
> > + ioasid_free(host_pasid_set,
> > svm->pasid); kfree(svm);
> > kfree(sdev);
> > goto out;
> > @@ -586,7 +586,7 @@ intel_svm_bind_mm(struct device *dev, int
> > flags, struct svm_dev_ops *ops, if (ret) {
> > if (mm)
> > mmu_notifier_unregister(&svm->notifier,
> > mm);
> > - ioasid_free(svm->pasid);
> > + ioasid_free(host_pasid_set, svm->pasid);
> > kfree(svm);
> > kfree(sdev);
> > goto out;
> > @@ -655,7 +655,7 @@ static int intel_svm_unbind_mm(struct device
> > *dev, int pasid) kfree_rcu(sdev, rcu);
> >
> > if (list_empty(&svm->devs)) {
> > - ioasid_free(svm->pasid);
> > + ioasid_free(host_pasid_set,
> > svm->pasid); if (svm->mm)
> > mmu_notifier_unregister(&svm->notifier,
> > svm->mm); list_del(&svm->list);
> > diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> > index 5f63af07acd5..f73b3dbfc37a 100644
> > --- a/drivers/iommu/ioasid.c
> > +++ b/drivers/iommu/ioasid.c
> > @@ -1,22 +1,58 @@
> > // SPDX-License-Identifier: GPL-2.0
> > /*
> > * I/O Address Space ID allocator. There is one global IOASID
> > space, split into
> > - * subsets. Users create a subset with DECLARE_IOASID_SET, then
> > allocate and
> I would try to avoid using new terms: s/subset_ioset_set
Right, I initially used term subset ID for set private ID then realized
SSID meant something else in ARM SMMU. :)

> > - * free IOASIDs with ioasid_alloc and ioasid_free.
> > + * subsets. Users create a subset with ioasid_alloc_set, then
> > allocate/free IDs
> here also and ioasid_set_alloc
ditto.

> > + * with ioasid_alloc and ioasid_free.
> > */
> > -#include <linux/ioasid.h>
> > #include <linux/module.h>
> > #include <linux/slab.h>
> > #include <linux/spinlock.h>
> > #include <linux/xarray.h>
> > +#include <linux/ioasid.h>
> > +
> > +static DEFINE_XARRAY_ALLOC(ioasid_sets);
> > +enum ioasid_state {
> > + IOASID_STATE_INACTIVE,
> > + IOASID_STATE_ACTIVE,
> > + IOASID_STATE_FREE_PENDING,
> > +};
> >
> > +/**
> > + * struct ioasid_data - Meta data about ioasid
> > + *
> > + * @id: Unique ID
> > + * @users Number of active users
> > + * @state Track state of the IOASID
> > + * @set Meta data of the set this IOASID belongs
> > to
> s/Meta data of the set this IOASID belongs to/ioasid_set the asid
> belongs to
make sense.

> > + * @private Private data associated with the IOASID
> I would have expected to find the private asid somewhere
> > + * @rcu For free after RCU grace period
> > + */
> > struct ioasid_data {
> > ioasid_t id;
> > struct ioasid_set *set;
> > + refcount_t users;
> > + enum ioasid_state state;
> > void *private;
> > struct rcu_head rcu;
> > };
> >
> > +/* Default to PCIe standard 20 bit PASID */
> > +#define PCI_PASID_MAX 0x100000
> > +static ioasid_t ioasid_capacity = PCI_PASID_MAX;
> > +static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
> > +
> > +void ioasid_install_capacity(ioasid_t total)
> > +{
> > + ioasid_capacity = ioasid_capacity_avail = total;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_install_capacity);
> > +
> > +ioasid_t ioasid_get_capacity()
> > +{
> > + return ioasid_capacity;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_get_capacity);
> > +
> > /*
> > * struct ioasid_allocator_data - Internal data structure to hold
> > information
> > * about an allocator. There are two types of allocators:
> > @@ -306,11 +342,23 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
> > ioasid_t min, ioasid_t max, {
> > struct ioasid_data *data;
> > void *adata;
> > - ioasid_t id;
> > + ioasid_t id = INVALID_IOASID;
> > +
> > + spin_lock(&ioasid_allocator_lock);
> > + /* Check if the IOASID set has been allocated and
> > initialized */
> > + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> > + pr_warn("Invalid set\n");
> > + goto done_unlock;
> > + }
> > +
> > + if (set->quota <= set->nr_ioasids) {
> > + pr_err("IOASID set %d out of quota %d\n",
> > set->sid, set->quota);
> > + goto done_unlock;
> > + }
> >
> > data = kzalloc(sizeof(*data), GFP_ATOMIC);
> > if (!data)
> > - return INVALID_IOASID;
> > + goto done_unlock;
> >
> > data->set = set;
> > data->private = private;
> > @@ -319,7 +367,6 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
> > ioasid_t min, ioasid_t max,
> > * Custom allocator needs allocator data to perform
> > platform specific
> > * operations.
> > */
> > - spin_lock(&ioasid_allocator_lock);
> > adata = active_allocator->flags &
> > IOASID_ALLOCATOR_CUSTOM ? active_allocator->ops->pdata : data; id =
> > active_allocator->ops->alloc(min, max, adata); if (id ==
> > INVALID_IOASID) { @@ -335,42 +382,339 @@ ioasid_t
> > ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> > goto exit_free; }
> > data->id = id;
> > + data->state = IOASID_STATE_ACTIVE;
> > + refcount_set(&data->users, 1);
> > +
> > + /* Store IOASID in the per set data */
> > + if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
> > + pr_err("Failed to ioasid %d in set %d\n", id,
> > set->sid);
> > + goto exit_free;
> > + }
> > + set->nr_ioasids++;
> > + goto done_unlock;
> >
> > - spin_unlock(&ioasid_allocator_lock);
> > - return id;
> > exit_free:
> > - spin_unlock(&ioasid_allocator_lock);
> > kfree(data);
> > - return INVALID_IOASID;
> > +done_unlock:
> > + spin_unlock(&ioasid_allocator_lock);
> > + return id;
> > }
> > EXPORT_SYMBOL_GPL(ioasid_alloc);
> >
> > +static void ioasid_do_free(struct ioasid_data *data)
> do_free_locked?
sounds good, more accurate.

> > +{
> > + struct ioasid_data *ioasid_data;
> > + struct ioasid_set *sdata;
> > +
> > + active_allocator->ops->free(data->id,
> > active_allocator->ops->pdata);
> > + /* Custom allocator needs additional steps to free the xa
> > element */
> > + if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> > + ioasid_data = xa_erase(&active_allocator->xa,
> > data->id);
> > + kfree_rcu(ioasid_data, rcu);
> > + }
> > +
> > + sdata = xa_load(&ioasid_sets, data->set->sid);
> > + if (!sdata) {
> > + pr_err("No set %d for IOASID %d\n", data->set->sid,
> > + data->id);
> > + return;
> > + }
> > + xa_erase(&sdata->xa, data->id);
> > + sdata->nr_ioasids--;
> > +}
> > +
> > +static void ioasid_free_locked(struct ioasid_set *set, ioasid_t
> > ioasid) +{
> > + struct ioasid_data *data;
> > +
> > + data = xa_load(&active_allocator->xa, ioasid);
> > + if (!data) {
> > + pr_err("Trying to free unknown IOASID %u\n",
> > ioasid);
> > + return;
> > + }
> > +
> > + if (data->set != set) {
> > + pr_warn("Cannot free IOASID %u due to set
> > ownership\n", ioasid);
> > + return;
> > + }
> > + data->state = IOASID_STATE_FREE_PENDING;
> > +
> > + if (!refcount_dec_and_test(&data->users))
> > + return;
> > +
> > + ioasid_do_free(data);
> > +}
> > +
> > /**
> > - * ioasid_free - Free an IOASID
> > - * @ioasid: the ID to remove
> > + * ioasid_free - Drop reference on an IOASID. Free if refcount
> > drops to 0,
> > + * including free from its set and system-wide list.
> > + * @set: The ioasid_set to check permission with. If not
> > NULL, IOASID
> > + * free will fail if the set does not match.
> > + * @ioasid: The IOASID to remove
> > */
> > -void ioasid_free(ioasid_t ioasid)
> > +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
> > {
> > - struct ioasid_data *ioasid_data;
> > + spin_lock(&ioasid_allocator_lock);
> > + ioasid_free_locked(set, ioasid);
> > + spin_unlock(&ioasid_allocator_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_free);
> >
> > +/**
> > + * ioasid_alloc_set - Allocate a new IOASID set for a given token
> > + *
> > + * @token: Unique token of the IOASID set, cannot be NULL
> > + * @quota: Quota allowed in this set. Only for new set
> > creation
> > + * @flags: Special requirements
> > + *
> > + * IOASID can be limited system-wide resource that requires quota
> > management.
> > + * If caller does not wish to enforce quota, use
> > IOASID_SET_NO_QUOTA flag.
> > + *
> > + * Token will be stored in the ioasid_set returned. A reference
> > will be taken
> > + * upon finding a matching set or newly created set.
> > + * IOASID allocation within the set and other per set operations
> > will use
> > + * the retured ioasid_set *.
> > + *
> > + */
> > +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
> > int type) +{
> > + struct ioasid_set *sdata;
> > + unsigned long index;
> > + ioasid_t id;
> > +
> > + if (type >= IOASID_SET_TYPE_NR)
> > + return ERR_PTR(-EINVAL);
> > +
> > + /*
> > + * Need to check space available if we share system-wide
> > quota.
> > + * TODO: we may need to support quota free sets in the
> > future.
> > + */
> > spin_lock(&ioasid_allocator_lock);
> > - ioasid_data = xa_load(&active_allocator->xa, ioasid);
> > - if (!ioasid_data) {
> > - pr_err("Trying to free unknown IOASID %u\n",
> > ioasid);
> > + if (quota > ioasid_capacity_avail) {
> > + pr_warn("Out of IOASID capacity! ask %d, avail
> > %d\n",
> > + quota, ioasid_capacity_avail);
> > + sdata = ERR_PTR(-ENOSPC);
> > goto exit_unlock;
> > }
> >
> > - active_allocator->ops->free(ioasid,
> > active_allocator->ops->pdata);
> > - /* Custom allocator needs additional steps to free the xa
> > element */
> > - if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
> > - ioasid_data = xa_erase(&active_allocator->xa,
> > ioasid);
> > - kfree_rcu(ioasid_data, rcu);
> > + /*
> > + * Token is only unique within its types but right now we
> > have only
> > + * mm type. If we have more token types, we have to match
> > type as well.
> > + */
> > + switch (type) {
> > + case IOASID_SET_TYPE_MM:
> > + /* Search existing set tokens, reject duplicates */
> > + xa_for_each(&ioasid_sets, index, sdata) {
> > + if (sdata->token == token &&
> > + sdata->type == IOASID_SET_TYPE_MM)
> > {
> > + sdata = ERR_PTR(-EEXIST);
> > + goto exit_unlock;
> > + }
> > + }
> > + break;
> > + case IOASID_SET_TYPE_NULL:
> > + if (!token)
> > + break;
> > + fallthrough;
> > + default:
> > + pr_err("Invalid token and IOASID type\n");
> > + sdata = ERR_PTR(-EINVAL);
> > + goto exit_unlock;
> > }
> >
> > + /* REVISIT: may support set w/o quota, use system
> > available */
> > + if (!quota) {
> > + sdata = ERR_PTR(-EINVAL);
> > + goto exit_unlock;
> > + }
> > +
> > + sdata = kzalloc(sizeof(*sdata), GFP_ATOMIC);
> > + if (!sdata) {
> > + sdata = ERR_PTR(-ENOMEM);
> > + goto exit_unlock;
> > + }
> > +
> > + if (xa_alloc(&ioasid_sets, &id, sdata,
> > + XA_LIMIT(0, ioasid_capacity_avail - quota),
> > + GFP_ATOMIC)) {
> > + kfree(sdata);
> > + sdata = ERR_PTR(-ENOSPC);
> > + goto exit_unlock;
> > + }
> > +
> > + sdata->token = token;
> > + sdata->type = type;
> > + sdata->quota = quota;
> > + sdata->sid = id;
> > + refcount_set(&sdata->ref, 1);
> > +
> > + /*
> > + * Per set XA is used to store private IDs within the set,
> > get ready
> > + * for ioasid_set private ID and system-wide IOASID
> > allocation
> > + * results.
> > + */
> > + xa_init_flags(&sdata->xa, XA_FLAGS_ALLOC);
> > + ioasid_capacity_avail -= quota;
> > +
> > exit_unlock:
> > spin_unlock(&ioasid_allocator_lock);
> > +
> > + return sdata;
> > }
> > -EXPORT_SYMBOL_GPL(ioasid_free);
> > +EXPORT_SYMBOL_GPL(ioasid_alloc_set);
> > +
> > +void ioasid_set_get_locked(struct ioasid_set *set)
> > +{
> > + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> > + pr_warn("Invalid set data\n");
> > + return;
> > + }
> > +
> > + refcount_inc(&set->ref);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_get_locked);
> > +
> > +void ioasid_set_get(struct ioasid_set *set)
> > +{
> > + spin_lock(&ioasid_allocator_lock);
> > + ioasid_set_get_locked(set);
> > + spin_unlock(&ioasid_allocator_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_get);
> > +
> > +void ioasid_set_put_locked(struct ioasid_set *set)
> > +{
> > + struct ioasid_data *entry;
> > + unsigned long index;
> > +
> > + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> > + pr_warn("Invalid set data\n");
> > + return;
> > + }
> > +
> > + if (!refcount_dec_and_test(&set->ref)) {
> > + pr_debug("%s: IOASID set %d has %d users\n",
> > + __func__, set->sid,
> > refcount_read(&set->ref));
> > + return;
> > + }
> > +
> > + /* The set is already empty, we just destroy the set. */
> > + if (xa_empty(&set->xa))
> > + goto done_destroy;
> > +
> > + /*
> > + * Free all PASIDs from system-wide IOASID pool, all
> > subscribers gets
> > + * notified and do cleanup of their own.
> > + * Note that some references of the IOASIDs within the set
> > can still
> > + * be held after the free call. This is OK in that the
> > IOASIDs will be
> > + * marked inactive, the only operations can be done is
> > ioasid_put.
> > + * No need to track IOASID set states since there is no
> > reclaim phase.
> > + */
> > + xa_for_each(&set->xa, index, entry) {
> > + ioasid_free_locked(set, index);
> > + /* Free from per set private pool */
> > + xa_erase(&set->xa, index);
> > + }
> > +
> > +done_destroy:
> > + /* Return the quota back to system pool */
> > + ioasid_capacity_avail += set->quota;
> > + kfree_rcu(set, rcu);
> > +
> > + /*
> > + * Token got released right away after the ioasid_set is
> > freed.
> > + * If a new set is created immediately with the newly
> > released token,
> > + * it will not allocate the same IOASIDs unless they are
> > reclaimed.
> > + */
> > + xa_erase(&ioasid_sets, set->sid);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_put_locked);
> > +
> > +/**
> > + * ioasid_set_put - Drop a reference to the IOASID set. Free all
> > IOASIDs within
> > + * the set if there are no more users.
> > + *
> > + * @set: The IOASID set ID to be freed
> > + *
> > + * If refcount drops to zero, all IOASIDs allocated within the set
> > will be
> > + * freed.
> > + */
> > +void ioasid_set_put(struct ioasid_set *set)
> > +{
> > + spin_lock(&ioasid_allocator_lock);
> > + ioasid_set_put_locked(set);
> > + spin_unlock(&ioasid_allocator_lock);
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_put);
> > +
> > +/**
> > + * ioasid_adjust_set - Adjust the quota of an IOASID set
> > + * @set: IOASID set to be assigned
> > + * @quota: Quota allowed in this set
> > + *
> > + * Return 0 on success. If the new quota is smaller than the
> > number of
> > + * IOASIDs already allocated, -EINVAL will be returned. No change
> > will be
> > + * made to the existing quota.
> > + */
> > +int ioasid_adjust_set(struct ioasid_set *set, int quota)
> > +{
> > + int ret = 0;
> > +
> > + spin_lock(&ioasid_allocator_lock);
> > + if (set->nr_ioasids > quota) {
> > + pr_err("New quota %d is smaller than outstanding
> > IOASIDs %d\n",
> > + quota, set->nr_ioasids);
> > + ret = -EINVAL;
> > + goto done_unlock;
> > + }
> > +
> > + if (quota >= ioasid_capacity_avail) {
> > + ret = -ENOSPC;
> > + goto done_unlock;
> > + }
> > +
> > + /* Return the delta back to system pool */
> > + ioasid_capacity_avail += set->quota - quota;
> > +
> > + /*
> > + * May have a policy to prevent giving all available
> > IOASIDs
> > + * to one set. But we don't enforce here, it should be in
> > the
> > + * upper layers.
> > + */
> > + set->quota = quota;
> > +
> > +done_unlock:
> > + spin_unlock(&ioasid_allocator_lock);
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_adjust_set);
> > +
> > +/**
> > + * ioasid_set_for_each_ioasid - Iterate over all the IOASIDs
> > within the set
> > + *
> > + * Caller must hold a reference of the set and handles its own
> > locking.
> > + */
> > +int ioasid_set_for_each_ioasid(struct ioasid_set *set,
> > + void (*fn)(ioasid_t id, void *data),
> > + void *data)
> > +{
> > + struct ioasid_data *entry;
> > + unsigned long index;
> > + int ret = 0;
> > +
> > + if (xa_empty(&set->xa)) {
> > + pr_warn("No IOASIDs in the set %d\n", set->sid);
> > + return -ENOENT;
> > + }
> > +
> > + xa_for_each(&set->xa, index, entry) {
> > + fn(index, data);
> > + }
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
> >
> > /**
> > * ioasid_find - Find IOASID data
> > diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> > index 9c44947a68c8..412d025d440e 100644
> > --- a/include/linux/ioasid.h
> > +++ b/include/linux/ioasid.h
> > @@ -10,8 +10,35 @@ typedef unsigned int ioasid_t;
> > typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min, ioasid_t max,
> > void *data); typedef void (*ioasid_free_fn_t)(ioasid_t ioasid, void
> > *data);
> > +/* IOASID set types */
> > +enum ioasid_set_type {
> > + IOASID_SET_TYPE_NULL = 1, /* Set token is NULL */
> > + IOASID_SET_TYPE_MM, /* Set token is a mm_struct,
> s/mm_struct/mm_struct pointer
got it

> > + * i.e. associated with a process
> > + */
> > + IOASID_SET_TYPE_NR,
> > +};
> > +
> > +/**
> > + * struct ioasid_set - Meta data about ioasid_set
> > + * @type: Token types and other features
> token type. Why "and other features"
will remove. initially wanted to have a flag

> > + * @token: Unique to identify an IOASID set
> > + * @xa: XArray to store ioasid_set private IDs, can
> > be used for
> > + * guest-host IOASID mapping, or just a private
> > IOASID namespace.
> > + * @quota: Max number of IOASIDs can be allocated within
> > the set
> > + * @nr_ioasids Number of IOASIDs currently allocated in the
> > set
> > + * @sid: ID of the set
> > + * @ref: Reference count of the users
> > + */
> > struct ioasid_set {
> > - int dummy;
> > + void *token;
> > + struct xarray xa;
> > + int type;
> > + int quota;
> > + int nr_ioasids;
> > + int sid;
> nit id? sid has a special meaning on ARM.
>
sounds good.

> > + refcount_t ref;
> > + struct rcu_head rcu;
> > };
> >
> > /**
> > @@ -29,31 +56,64 @@ struct ioasid_allocator_ops {
> > void *pdata;
> > };
> >
> > -#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
> > -
> > #if IS_ENABLED(CONFIG_IOASID)
> > +void ioasid_install_capacity(ioasid_t total);
> > +ioasid_t ioasid_get_capacity(void);
> > +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
> > int type); +int ioasid_adjust_set(struct ioasid_set *set, int
> > quota);
> ioasid_set_adjust_quota
> > +void ioasid_set_get_locked(struct ioasid_set *set);
> as mentionned during the Plumber uConf, the set_get is unfortunate.
> Globally I wonder if we shouldn't rename "set" into "pool" or
> something alike.
I agree, how about "group"? I felt pool does not reflect enough of the
resource partitioning. Any better names? Jean?

> > +void ioasid_set_put_locked(struct ioasid_set *set);
> > +void ioasid_set_put(struct ioasid_set *set);
> > +
> > ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> > ioasid_t max, void *private);
> > -void ioasid_free(ioasid_t ioasid);
> > -void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
> > - bool (*getter)(void *));
> > +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
> > +
> > +bool ioasid_is_active(ioasid_t ioasid);
> > +
> > +void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool
> > (*getter)(void *)); +int ioasid_attach_data(ioasid_t ioasid, void
> > *data); int ioasid_register_allocator(struct ioasid_allocator_ops
> > *allocator); void ioasid_unregister_allocator(struct
> > ioasid_allocator_ops *allocator); -int ioasid_attach_data(ioasid_t
> > ioasid, void *data); -
> > +void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
> > +int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> > + void (*fn)(ioasid_t id, void *data),
> > + void *data);
> > #else /* !CONFIG_IOASID */
> > +static inline void ioasid_install_capacity(ioasid_t total)
> > +{
> > +}
> > +
> > +static inline ioasid_t ioasid_get_capacity(void)
> > +{
> > + return 0;
> > +}
> > +
> > static inline ioasid_t ioasid_alloc(struct ioasid_set *set,
> > ioasid_t min, ioasid_t max, void *private)
> > {
> > return INVALID_IOASID;
> > }
> >
> > -static inline void ioasid_free(ioasid_t ioasid)
> > +static inline void ioasid_free(struct ioasid_set *set, ioasid_t
> > ioasid) +{
> > +}
> > +
> > +static inline bool ioasid_is_active(ioasid_t ioasid)
> > +{
> > + return false;
> > +}
> > +
> > +static inline struct ioasid_set *ioasid_alloc_set(void *token,
> > ioasid_t quota, int type) +{
> > + return ERR_PTR(-ENOTSUPP);
> > +}
> > +
> > +static inline void ioasid_set_put(struct ioasid_set *set)
> > {
> > }
> >
> > -static inline void *ioasid_find(struct ioasid_set *set, ioasid_t
> > ioasid,
> > - bool (*getter)(void *))
> > +static inline void *ioasid_find(struct ioasid_set *set, ioasid_t
> > ioasid, bool (*getter)(void *)) {
> > return NULL;
> > }
> >
> I felt very difficult to review this patch. Could you split it into
> several ones? maybe introduce the a dummy host_pasid_set and update
> the call sites accordingling.
>
> You introduce ownership checking, quota checking, ioasid state, ref
> counting, ioasid type handling (whereas existing is NULL) so I have
> the feeling that could ease the review process by adopting a more
> incremental approach.
>
Yes, I felt the same. It is just that the changes are intertwined but I
will give it a try again in the next version.

Thanks for the review and suggestion.

> Thanks
>
> Eric
>
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

[Jacob Pan]

2020-09-07 08:08:42

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs

Hi Jacob,

On 9/1/20 6:56 PM, Jacob Pan wrote:
> Hi Eric,
>
> On Thu, 27 Aug 2020 18:21:07 +0200
> Auger Eric <[email protected]> wrote:
>
>> Hi Jacob,
>> On 8/24/20 12:32 PM, Jean-Philippe Brucker wrote:
>>> On Fri, Aug 21, 2020 at 09:35:10PM -0700, Jacob Pan wrote:
>>>> IOASID is used to identify address spaces that can be targeted by
>>>> device DMA. It is a system-wide resource that is essential to its
>>>> many users. This document is an attempt to help developers from
>>>> all vendors navigate the APIs. At this time, ARM SMMU and Intel’s
>>>> Scalable IO Virtualization (SIOV) enabled platforms are the
>>>> primary users of IOASID. Examples of how SIOV components interact
>>>> with IOASID APIs are provided in that many APIs are driven by the
>>>> requirements from SIOV.
>>>>
>>>> Signed-off-by: Liu Yi L <[email protected]>
>>>> Signed-off-by: Wu Hao <[email protected]>
>>>> Signed-off-by: Jacob Pan <[email protected]>
>>>> ---
>>>> Documentation/ioasid.rst | 618
>>>> +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed,
>>>> 618 insertions(+) create mode 100644 Documentation/ioasid.rst
>>>>
>>>> diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst
>>>
>>> Thanks for writing this up. Should it go to
>>> Documentation/driver-api/, or Documentation/driver-api/iommu/? I
>>> think this also needs to Cc [email protected] and
>>> [email protected]
>>>> new file mode 100644
>>>> index 000000000000..b6a8cdc885ff
>>>> --- /dev/null
>>>> +++ b/Documentation/ioasid.rst
>>>> @@ -0,0 +1,618 @@
>>>> +.. ioasid:
>>>> +
>>>> +=====================================
>>>> +IO Address Space ID
>>>> +=====================================
>>>> +
>>>> +IOASID is a generic name for PCIe Process Address ID (PASID) or
>>>> ARM +SMMU sub-stream ID. An IOASID identifies an address space
>>>> that DMA
>>>
>>> "SubstreamID"
>> On ARM if we don't use PASIDs we have streamids (SID) which can also
>> identify address spaces that DMA requests can target. So maybe this
>> definition is not sufficient.
>>
> According to SMMU spec, the SubstreamID is equivalent to PASID. My
> understanding is that SID is equivalent to PCI requester ID that
> identifies stage 2. Do you plan to use IOASID for stage 2?
No. So actually if PASID is not used we still have a default single
IOASID matching the single context. So that may be fine as a definition.
> IOASID is mostly for SVA and DMA request w/ PASID.
>
>>>
>>>> +requests can target.
>>>> +
>>>> +The primary use cases for IOASID are Shared Virtual Address (SVA)
>>>> and +IO Virtual Address (IOVA). However, the requirements for
>>>> IOASID
>>>
>>> IOVA alone isn't a use case, maybe "multiple IOVA spaces per
>>> device"?
>>>> +management can vary among hardware architectures.
>>>> +
>>>> +This document covers the generic features supported by IOASID
>>>> +APIs. Vendor-specific use cases are also illustrated with Intel's
>>>> VT-d +based platforms as the first example.
>>>> +
>>>> +.. contents:: :local:
>>>> +
>>>> +Glossary
>>>> +========
>>>> +PASID - Process Address Space ID
>>>> +
>>>> +IOASID - IO Address Space ID (generic term for PCIe PASID and
>>>> +sub-stream ID in SMMU)
>>>
>>> "SubstreamID"
>>>
>>>> +
>>>> +SVA/SVM - Shared Virtual Addressing/Memory
>>>> +
>>>> +ENQCMD - New Intel X86 ISA for efficient workqueue submission
>>>> [1]
>>>
>>> Maybe drop the "New", to keep the documentation perennial. It might
>>> be good to add internal links here to the specifications URLs at
>>> the bottom.
>>>> +
>>>> +DSA - Intel Data Streaming Accelerator [2]
>>>> +
>>>> +VDCM - Virtual device composition module [3]
>>>> +
>>>> +SIOV - Intel Scalable IO Virtualization
>>>> +
>>>> +
>>>> +Key Concepts
>>>> +============
>>>> +
>>>> +IOASID Set
>>>> +-----------
>>>> +An IOASID set is a group of IOASIDs allocated from the system-wide
>>>> +IOASID pool. An IOASID set is created and can be identified by a
>>>> +token of u64. Refer to IOASID set APIs for more details.
>>>
>>> Identified either by an u64 or an mm_struct, right? Maybe just
>>> drop the second sentence if it's detailed in the IOASID set section
>>> below.
>>>> +
>>>> +IOASID set is particularly useful for guest SVA where each guest
>>>> could +have its own IOASID set for security and efficiency reasons.
>>>> +
>>>> +IOASID Set Private ID (SPID)
>>>> +----------------------------
>>>> +SPIDs are introduced as IOASIDs within its set. Each SPID maps to
>>>> a +system-wide IOASID but the namespace of SPID is within its
>>>> IOASID +set.
>>>
>>> The intro isn't super clear. Perhaps this is simpler:
>>> "Each IOASID set has a private namespace of SPIDs. An SPID maps to a
>>> single system-wide IOASID."
>> or, "within an ioasid set, each ioasid can be associated with an alias
>> ID, named SPID."
> I don't have strong opinion, I feel it is good to explain the
> relationship between SPID and IOASID in both directions, how about add?
> " Conversely, each IOASID is associated with an alias ID, named SPID."
yep. I amy suggest: each IOASID may be associated with an alias ID,
local to the IOASID set, named SPID.
>
>>>
>>>> SPIDs can be used as guest IOASIDs where each guest could do
>>>> +IOASID allocation from its own pool and map them to host physical
>>>> +IOASIDs. SPIDs are particularly useful for supporting live
>>>> migration +where decoupling guest and host physical resources are
>>>> necessary. +
>>>> +For example, two VMs can both allocate guest PASID/SPID #101 but
>>>> map to +different host PASIDs #201 and #202 respectively as shown
>>>> in the +diagram below.
>>>> +::
>>>> +
>>>> + .------------------. .------------------.
>>>> + | VM 1 | | VM 2 |
>>>> + | | | |
>>>> + |------------------| |------------------|
>>>> + | GPASID/SPID 101 | | GPASID/SPID 101 |
>>>> + '------------------' -------------------' Guest
>>>> + __________|______________________|______________________
>>>> + | | Host
>>>> + v v
>>>> + .------------------. .------------------.
>>>> + | Host IOASID 201 | | Host IOASID 202 |
>>>> + '------------------' '------------------'
>>>> + | IOASID set 1 | | IOASID set 2 |
>>>> + '------------------' '------------------'
>>>> +
>>>> +Guest PASID is treated as IOASID set private ID (SPID) within an
>>>> +IOASID set, mappings between guest and host IOASIDs are stored in
>>>> the +set for inquiry.
>>>> +
>>>> +IOASID APIs
>>>> +===========
>>>> +To get the IOASID APIs, users must #include <linux/ioasid.h>.
>>>> These APIs +serve the following functionalities:
>>>> +
>>>> + - IOASID allocation/Free
>>>> + - Group management in the form of ioasid_set
>>>> + - Private data storage and lookup
>>>> + - Reference counting
>>>> + - Event notification in case of state change
>> (a)
> got it
>
>>>> +
>>>> +IOASID Set Level APIs
>>>> +--------------------------
>>>> +For use cases such as guest SVA it is necessary to manage IOASIDs
>>>> at +a group level. For example, VMs may allocate multiple IOASIDs
>>>> for
>> I would use the introduced ioasid_set terminology instead of "group".
> Right, we already introduced it.
>
>>>> +guest process address sharing (vSVA). It is imperative to enforce
>>>> +VM-IOASID ownership such that malicious guest cannot target DMA
>>>
>>> "a malicious guest"
>>>
>>>> +traffic outside its own IOASIDs, or free an active IOASID belong
>>>> to
>>>
>>> "that belongs to"
>>>
>>>> +another VM.
>>>> +::
>>>> +
>>>> + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
>>>> u32 type)
>> what is this void *token? also the type may be explained here.
> token is explained in the text following API list. I can move it up.
>
>>>> +
>>>> + int ioasid_adjust_set(struct ioasid_set *set, int quota);
>>>
>>> These could be named "ioasid_set_alloc" and "ioasid_set_adjust" to
>>> be consistent with the rest of the API.
>>>
>>>> +
>>>> + void ioasid_set_get(struct ioasid_set *set)
>>>> +
>>>> + void ioasid_set_put(struct ioasid_set *set)
>>>> +
>>>> + void ioasid_set_get_locked(struct ioasid_set *set)
>>>> +
>>>> + void ioasid_set_put_locked(struct ioasid_set *set)
>>>> +
>>>> + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
>>>
>>> Might be nicer to keep the same argument names within the API. Here
>>> "set" rather than "sdata".
>>>
>>>> + void (*fn)(ioasid_t id, void
>>>> *data),
>>>> + void *data)
>>>
>>> (alignment)
>>>
>>>> +
>>>> +
>>>> +IOASID set concept is introduced to represent such IOASID groups.
>>>> Each
>>>
>>> Or just "IOASID sets represent such IOASID groups", but might be
>>> redundant.
>>>
>>>> +IOASID set is created with a token which can be one of the
>>>> following +types:
>> I think this explanation should happen before the above function
>> prototypes
> ditto.
>
>>>> +
>>>> + - IOASID_SET_TYPE_NULL (Arbitrary u64 value)
>>>> + - IOASID_SET_TYPE_MM (Set token is a mm_struct)
>>>> +
>>>> +The explicit MM token type is useful when multiple users of an
>>>> IOASID +set under the same process need to communicate about their
>>>> shared IOASIDs. +E.g. An IOASID set created by VFIO for one guest
>>>> can be associated +with the KVM instance for the same guest since
>>>> they share a common mm_struct. +
>>>> +The IOASID set APIs serve the following purposes:
>>>> +
>>>> + - Ownership/permission enforcement
>>>> + - Take collective actions, e.g. free an entire set
>>>> + - Event notifications within a set
>>>> + - Look up a set based on token
>>>> + - Quota enforcement
>>>
>>> This paragraph could be earlier in the section
>>
>> yes this is a kind of repetition of (a), above
> I meant to highlight on what the APIs do such that readers don't
> need to read the code instead.
>
>>>
>>>> +
>>>> +Individual IOASID APIs
>>>> +----------------------
>>>> +Once an ioasid_set is created, IOASIDs can be allocated from the
>>>> set. +Within the IOASID set namespace, set private ID (SPID) is
>>>> supported. In +the VM use case, SPID can be used for storing guest
>>>> PASID. +
>>>> +::
>>>> +
>>>> + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
>>>> ioasid_t max,
>>>> + void *private);
>>>> +
>>>> + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
>>>> +
>>>> + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
>>>> +
>>>> + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
>>>> +
>>>> + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
>>>> +
>>>> + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>>>> + bool (*getter)(void *));
>>>> +
>>>> + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t
>>>> spid) +
>>>> + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid,
>>>> + void *data);
>>>> + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid,
>>>> + ioasid_t ssid);
>>>
>>> s/ssid/spid>
> got it
>
>>>> +
>>>> +
>>>> +Notifications
>>>> +-------------
>>>> +An IOASID may have multiple users, each user may have hardware
>>>> context +associated with an IOASID. When the status of an IOASID
>>>> changes, +e.g. an IOASID is being freed, users need to be notified
>>>> such that the +associated hardware context can be cleared,
>>>> flushed, and drained. +
>>>> +::
>>>> +
>>>> + int ioasid_register_notifier(struct ioasid_set *set, struct
>>>> + notifier_block *nb)
>>>> +
>>>> + void ioasid_unregister_notifier(struct ioasid_set *set,
>>>> + struct notifier_block *nb)
>>>> +
>>>> + int ioasid_register_notifier_mm(struct mm_struct *mm, struct
>>>> + notifier_block *nb)
>>>> +
>>>> + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
>>>> + notifier_block *nb)
>> the mm_struct prototypes may be justified
> This is the mm type token, i.e.
> - IOASID_SET_TYPE_MM (Set token is a mm_struct)
> I am not sure if it is better to keep the explanation in code or in
> this document, certainly don't want to duplicate.
OK. Maybe add a text explaining why it makes sense to register a
notifier at mm_struct granularity.
>
>>>> +
>>>> + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
>>>> + unsigned int flags)
>> this one is not obvious either.
> Here I just wanted to list the API functions, perhaps readers can check
> out the code comments?
OK never mind. The exercise is difficult anyway.
>
>>>> +
>>>> +
>>>> +Events
>>>> +~~~~~~
>>>> +Notification events are pertinent to individual IOASIDs, they can
>>>> be +one of the following:
>>>> +
>>>> + - ALLOC
>>>> + - FREE
>>>> + - BIND
>>>> + - UNBIND
>>>> +
>>>> +Ordering
>>>> +~~~~~~~~
>>>> +Ordering is supported by IOASID notification priorities as the
>>>> +following (in ascending order):
>>>> +
>>>> +::
>>>> +
>>>> + enum ioasid_notifier_prios {
>>>> + IOASID_PRIO_LAST,
>>>> + IOASID_PRIO_IOMMU,
>>>> + IOASID_PRIO_DEVICE,
>>>> + IOASID_PRIO_CPU,
>>>> + };
>>
>> Maybe:
>> when registered, notifiers are assigned a priority that affect the
>> call order. Notifiers with CPU priority get called before notifiers
>> with device priority and so on.
> Sounds good.
>
>>>> +
>>>> +The typical use case is when an IOASID is freed due to an
>>>> exception, DMA +source should be quiesced before tearing down
>>>> other hardware contexts +in the system. This will reduce the churn
>>>> in handling faults. DMA work +submission is performed by the CPU
>>>> which is granted higher priority than +devices.
>>>> +
>>>> +
>>>> +Scopes
>>>> +~~~~~~
>>>> +There are two types of notifiers in IOASID core: system-wide and
>>>> +ioasid_set-wide.
>>>> +
>>>> +System-wide notifier is catering for users that need to handle all
>>>> +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs.
>>>> +
>>>> +Per ioasid_set notifier can be used by VM specific components
>>>> such as +KVM. After all, each KVM instance only cares about
>>>> IOASIDs within its +own set.
>>>> +
>>>> +
>>>> +Atomicity
>>>> +~~~~~~~~~
>>>> +IOASID notifiers are atomic due to spinlocks used inside the
>>>> IOASID +core. For tasks cannot be completed in the notifier
>>>> handler, async work
>>>
>>> "tasks that cannot be"
>>>
>>>> +can be submitted to complete the work later as long as there is no
>>>> +ordering requirement.
>>>> +
>>>> +Reference counting
>>>> +------------------
>>>> +IOASID lifecycle management is based on reference counting. Users
>>>> of +IOASID intend to align lifecycle with the IOASID need to hold
>>>
>>> "who intend to"
>>>
>>>> +reference of the IOASID. IOASID will not be returned to the pool
>>>> for
>>>
>>> "a reference to the IOASID. The IOASID"
>>>
>>>> +allocation until all references are dropped. Calling ioasid_free()
>>>> +will mark the IOASID as FREE_PENDING if the IOASID has outstanding
>>>> +reference. ioasid_get() is not allowed once an IOASID is in the
>>>> +FREE_PENDING state.
>>>> +
>>>> +Event notifications are used to inform users of IOASID status
>>>> change. +IOASID_FREE event prompts users to drop their references
>>>> after +clearing its context.
>>>> +
>>>> +For example, on VT-d platform when an IOASID is freed, teardown
>>>> +actions are performed on KVM, device driver, and IOMMU driver.
>>>> +KVM shall register notifier block with::
>>>> +
>>>> + static struct notifier_block pasid_nb_kvm = {
>>>> + .notifier_call = pasid_status_change_kvm,
>>>> + .priority = IOASID_PRIO_CPU,
>>>> + };
>>>> +
>>>> +VDCM driver shall register notifier block with::
>>>> +
>>>> + static struct notifier_block pasid_nb_vdcm = {
>>>> + .notifier_call = pasid_status_change_vdcm,
>>>> + .priority = IOASID_PRIO_DEVICE,
>>>> + };
>> not sure those code snippets are really useful. Maybe simply say who
>> is supposed to use each prio.
> Agreed, not all the bits in the snippets are explained. I will explain
> KVM and VDCM need to use priority to ensure call order.
>
>>>> +
>>>> +In both cases, notifier blocks shall be registered on the IOASID
>>>> set +such that *only* events from the matching VM is received.
>>>> +
>>>> +If KVM attempts to register notifier block before the IOASID set
>>>> is +created for the MM token, the notifier block will be placed on
>>>> a
>> using the MM token
> sounds good
>
>>>> +pending list inside IOASID core. Once the token matching IOASID
>>>> set +is created, IOASID will register the notifier block
>>>> automatically.
>> Is this implementation mandated? Can't you enforce the ioasid_set to
>> be created before the notifier gets registered?
>>>> +IOASID core does not replay events for the existing IOASIDs in the
>>>> +set. For IOASID set of MM type, notification blocks can be
>>>> registered +on empty sets only. This is to avoid lost events.
>>>> +
>>>> +IOMMU driver shall register notifier block on global chain::
>>>> +
>>>> + static struct notifier_block pasid_nb_vtd = {
>>>> + .notifier_call = pasid_status_change_vtd,
>>>> + .priority = IOASID_PRIO_IOMMU,
>>>> + };
>>>> +
>>>> +Custom allocator APIs
>>>> +---------------------
>>>> +
>>>> +::
>>>> +
>>>> + int ioasid_register_allocator(struct ioasid_allocator_ops
>>>> *allocator); +
>>>> + void ioasid_unregister_allocator(struct ioasid_allocator_ops
>>>> *allocator); +
>>>> +Allocator Choices
>>>> +~~~~~~~~~~~~~~~~~
>>>> +IOASIDs are allocated for both host and guest SVA/IOVA usage.
>>>> However, +allocators can be different. For example, on VT-d guest
>>>> PASID +allocation must be performed via a virtual command
>>>> interface which is +emulated by VMM.
>>>> +
>>>> +IOASID core has the notion of "custom allocator" such that guest
>>>> can +register virtual command allocator that precedes the default
>>>> one. +
>>>> +Namespaces
>>>> +~~~~~~~~~~
>>>> +IOASIDs are limited system resources that default to 20 bits in
>>>> +size. Since each device has its own table, theoretically the
>>>> namespace +can be per device also. However, for security reasons
>>>> sharing PASID +tables among devices are not good for isolation.
>>>> Therefore, IOASID +namespace is system-wide.
>>>
>>> I don't follow this development. Having per-device PASID table
>>> would work fine for isolation (assuming no hardware bug
>>> necessitating IOMMU groups). If I remember correctly IOASID space
>>> was chosen to be OS-wide because it simplifies the management code
>>> (single PASID per task), and it is system-wide across VMs only in
>>> the case of VT-d scalable mode.
>>>> +
>>>> +There are also other reasons to have this simpler system-wide
>>>> +namespace. Take VT-d as an example, VT-d supports shared workqueue
>>>> +and ENQCMD[1] where one IOASID could be used to submit work on
>>>
>>> Maybe use the Sphinx glossary syntax rather than "[1]"
>>> https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#glossary-directive
>>>
>>>> +multiple devices that are shared with other VMs. This requires
>>>> IOASID +to be system-wide. This is also the reason why guests must
>>>> use an +emulated virtual command interface to allocate IOASID from
>>>> the host. +
>>>> +
>>>> +Life cycle
>>>> +==========
>>>> +This section covers IOASID lifecycle management for both
>>>> bare-metal +and guest usages. In bare-metal SVA, MMU notifier is
>>>> directly hooked +up with IOMMU driver, therefore the process
>>>> address space (MM) +lifecycle is aligned with IOASID.
>> therefore the IOASID lifecyle matches the process address space (MM)
>> lifecyle?
> Sounds good.
>
>>>> +
>>>> +However, guest MMU notifier is not available to host IOMMU
>>>> driver,
>> the guest MMU notifier
>>>> +when guest MM terminates unexpectedly, the events have to go
>>>> through
>> the guest MM
>>>> +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also
>>>> more +parties involved in guest SVA, e.g. on Intel VT-d platform,
>>>> IOASIDs +are used by IOMMU driver, KVM, VDCM, and VFIO.
>>>> +
>>>> +Native IOASID Life Cycle (VT-d Example)
>>>> +---------------------------------------
>>>> +
>>>> +The normal flow of native SVA code with Intel Data Streaming
>>>> +Accelerator(DSA) [2] as example:
>>>> +
>>>> +1. Host user opens accelerator FD, e.g. DSA driver, or uacce;
>>>> +2. DSA driver allocate WQ, do sva_bind_device();
>>>> +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device,
>>>> + mmu_notifier_get()
>>>> +4. DMA starts by DSA driver userspace
>>>> +5. DSA userspace close FD
>>>> +6. DSA/uacce kernel driver handles FD.close()
>>>> +7. DSA driver stops DMA
>>>> +8. DSA driver calls sva_unbind_device();
>>>> +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush
>>>> + TLBs. mmu_notifier_put() called.
>>>> +10. mmu_notifier.release() called, IOMMU SVA code calls
>>>> ioasid_free()* +11. The IOASID is returned to the pool, reclaimed.
>>>> +
>>>> +::
>>>> +
>>>
>>> Use a footnote?
>>> https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes
>>>> + * With ENQCMD, PASID used on VT-d is not released in
>>>> mmu_notifier() but
>>>> + mmdrop(). mmdrop comes after FD close. Should not matter.
>>>
>>> "comes after FD close, which doesn't make a difference?"
>>> The following might not be necessary since early process
>>> termination is described later.
>>>
>>>> + If the user process dies unexpectedly, Step #10 may come
>>>> before
>>>> + Step #5, in between, all DMA faults discarded. PRQ responded
>>>> with
>>>
>>> PRQ hasn't been defined in this document.
>>>
>>>> + code INVALID REQUEST.
>>>> +
>>>> +During the normal teardown, the following three steps would
>>>> happen in +order:
>> can't this be illustrated in the above 1-11 sequence, just adding
>> NORMAL TEARDONW before #7?
>>>> +
>>>> +1. Device driver stops DMA request
>>>> +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain
>>>> in-flight
>>>> + requests.
>>>> +3. IOASID freed
>>>> +
>> Then you can just focus on abnormal termination
> Yes, will refer to the steps starting #7. These can be removed.
>
>>>> +Exception happens when process terminates *before* device driver
>>>> stops +DMA and call IOMMU driver to unbind. The flow of process
>>>> exists are as
>> Can't this be explained with something simpler looking at the steps
>> 1-11?
> It meant to be educational given this level of details. Simpler
> steps are labeled with (1) (2) (3). Perhaps these labels didn't stand
> out right? I will use the steps in the 1-11 sequence.
>
>>>
>>> "exits"
>>>
>>>> +follows:
>>>> +
>>>> +::
>>>> +
>>>> + do_exit() {
>>>> + exit_mm() {
>>>> + mm_put();
>>>> + exit_mmap() {
>>>> + intel_invalidate_range() //mmu notifier
>>>> + tlb_finish_mmu()
>>>> + mmu_notifier_release(mm) {
>>>> + intel_iommu_release() {
>>>> + [2]
>>>> intel_iommu_teardown_pasid();
>>>
>>> Parentheses might be better than square brackets for step numbers
>>>
>>>> + intel_iommu_flush_tlbs();
>>>> + }
>>>> + // tlb_invalidate_range cb removed
>>>> + }
>>>> + unmap_vmas();
>>>> + free_pgtables(); // IOMMU cannot walk PGT
>>>> after this
>>>> + };
>>>> + }
>>>> + exit_files(tsk) {
>>>> + close_files() {
>>>> + dsa_close();
>>>> + [1] dsa_stop_dma();
>>>> + intel_svm_unbind_pasid(); //nothing to do
>>>> + }
>>>> + }
>>>> + }
>>>> +
>>>> + mmdrop() /* some random time later, lazy mm user */ {
>>>> + mm_free_pgd();
>>>> + destroy_context(mm); {
>>>> + [3] ioasid_free();
>>>> + }
>>>> + }
>>>> +
>>>> +As shown in the list above, step #2 could happen before
>>>> +#1. Unrecoverable(UR) faults could happen between #2 and #1.
>>>> +
>>>> +Also notice that TLB invalidation occurs at mmu_notifier
>>>> +invalidate_range callback as well as the release callback. The
>>>> reason +is that release callback will delete IOMMU driver from the
>>>> notifier +chain which may skip invalidate_range() calls during the
>>>> exit path. +
>>>> +To avoid unnecessary reporting of UR fault, IOMMU driver shall
>>>> disable
>> UR?
> Unrecoverable, mentioned in the previous paragraph.
>
>>>> +fault reporting after free and before unbind.
>>>> +
>>>> +Guest IOASID Life Cycle (VT-d Example)
>>>> +--------------------------------------
>>>> +Guest IOASID life cycle starts with guest driver open(), this
>>>> could be +uacce or individual accelerator driver such as DSA. At
>>>> FD open, +sva_bind_device() is called which triggers a series of
>>>> actions. +
>>>> +The example below is an illustration of *normal* operations that
>>>> +involves *all* the SW components in VT-d. The flow can be simpler
>>>> if +no ENQCMD is supported.
>>>> +
>>>> +::
>>>> +
>>>> + VFIO IOMMU KVM VDCM IOASID
>>>> Ref
>>>> + ..................................................................
>>>> + 1 ioasid_register_notifier/_mm()
>>>> + 2 ioasid_alloc()
>>>> 1
>>>> + 3 bind_gpasid()
>>>> + 4 iommu_bind()->ioasid_get()
>>>> 2
>>>> + 5 ioasid_notify(BIND)
>>>> + 6 -> ioasid_get()
>>>> 3
>>>> + 7 -> vmcs_update_atomic()
>>>> + 8 mdev_write(gpasid)
>>>> + 9 hpasid=
>>>> + 10 find_by_spid(gpasid)
>>>> 4
>>>> + 11 vdev_write(hpasid)
>>>> + 12 -------- GUEST STARTS DMA --------------------------
>>>> + 13 -------- GUEST STOPS DMA --------------------------
>>>> + 14 mdev_clear(gpasid)
>>>> + 15 vdev_clear(hpasid)
>>>> + 16
>>>> ioasid_put() 3
>>>> + 17 unbind_gpasid()
>>>> + 18 iommu_ubind()
>>>> + 19 ioasid_notify(UNBIND)
>>>> + 20 -> vmcs_update_atomic()
>>>> + 21 ->
>>>> ioasid_put() 2
>>>> + 22
>>>> ioasid_free() 1
>>>> + 23
>>>> ioasid_put() 0
>>>> + 24 Reclaimed
>>>> + -------------- New Life Cycle Begin
>>>> ----------------------------
>>>> + 1 ioasid_alloc()
>>>> -> 1 +
>>>> + Note: IOASID Notification Events: FREE, BIND, UNBIND
>>>> +
>>>> +Exception cases arise when a guest crashes or a malicious guest
>>>> +attempts to cause disruption on the host system. The fault
>>>> handling +rules are:
>>>> +
>>>> +1. IOASID free must *always* succeed.
>>>> +2. An inactive period may be required before the freed IOASID is
>>>> + reclaimed. During this period, consumers of IOASID perform
>>>> cleanup. +3. Malfunction is limited to the guest owned resources
>>>> for all
>>>> + programming errors.
>>>> +
>>>> +The primary source of exception is when the following are out of
>>>> +order:
>>>> +
>>>> +1. Start/Stop of DMA activity
>>>> + (Guest device driver, mdev via VFIO)
>> please explain the meaning of what is inside (): initiator?
>>>> +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes
>>>> + (Host IOMMU driver bind/unbind)
>>>> +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in
>>>> + case of ENQCMD
>>>> +4. Programming/Clearing host PASID in VDCM (Host VDCM driver)
>>>> +5. IOASID alloc/free (Host IOASID)
>>>> +
>>>> +VFIO is the *only* user-kernel interface, which is ultimately
>>>> +responsible for exception handlings.
>>>
>>> "handling"
>>>
>>>> +
>>>> +#1 is processed the same way as the assigned device today based on
>>>> +device file descriptors and events. There is no special handling.
>>>> +
>>>> +#3 is based on bind/unbind events emitted by #2.
>>>> +
>>>> +#4 is naturally aligned with IOASID life cycle in that an illegal
>>>> +guest PASID programming would fail in obtaining reference of the
>>>> +matching host IOASID.
>>>> +
>>>> +#5 is similar to #4. The fault will be reported to the user if
>>>> PASID +used in the ENQCMD is not set up in VMCS PASID translation
>>>> table. +
>>>> +Therefore, the remaining out of order problem is between #2 and
>>>> +#5. I.e. unbind vs. free. More specifically, free before unbind.
>>>> +
>>>> +IOASID notifier and refcounting are used to ensure order.
>>>> Following +a publisher-subscriber pattern where:
>> with the following actors:
>>>> +
>>>> +- Publishers: VFIO & IOMMU
>>>> +- Subscribers: KVM, VDCM, IOMMU
>> this may be introduced before.
>>>> +
>>>> +IOASID notifier is atomic which requires subscribers to do quick
>>>> +handling of the event in the atomic context. Workqueue can be
>>>> used for +any processing that requires thread context.
>> repetition of what was said before.
>> IOASID reference must be
> Right, will remove.
>
>>>> +acquired before receiving the FREE event. The reference must be
>>>> +dropped at the end of the processing in order to return the
>>>> IOASID to +the pool.
>>>> +
>>>> +Let's examine the IOASID life cycle again when free happens
>>>> *before* +unbind. This could be a result of misbehaving guests or
>>>> crash. Assuming +VFIO cannot enforce unbind->free order. Notice
>>>> that the setup part up +until step #12 is identical to the normal
>>>> case, the flow below starts +with step 13.
>>>> +
>>>> +::
>>>> +
>>>> + VFIO IOMMU KVM VDCM IOASID
>>>> Ref
>>>> + ..................................................................
>>>> + 13 -------- GUEST STARTS DMA --------------------------
>>>> + 14 -------- *GUEST MISBEHAVES!!!* ----------------
>>>> + 15 ioasid_free()
>>>> + 16
>>>> ioasid_notify(FREE)
>>>> + 17
>>>> mark_ioasid_inactive[1]
>>>> + 18 kvm_nb_handler(FREE)
>>>> + 19 vmcs_update_atomic()
>>>> + 20 ioasid_put_locked() ->
>>>> 3
>>>> + 21 vdcm_nb_handler(FREE)
>>>> + 22 iomm_nb_handler(FREE)
>>>> + 23 ioasid_free() returns[2] schedule_work()
>>>> 2
>>>> + 24 schedule_work() vdev_clear_wk(hpasid)
>>>> + 25 teardown_pasid_wk()
>>>> + 26 ioasid_put() ->
>>>> 1
>>>> + 27 ioasid_put()
>>>> 0
>>>> + 28 Reclaimed
>>>> + 29 unbind_gpasid()
>>>> + 30 iommu_unbind()->ioasid_find() Fails[3]
>>>> + -------------- New Life Cycle Begin
>>>> ---------------------------- +
>>>> +Note:
>>>> +
>>>> +1. By marking IOASID inactive at step #17, no new references can
>>>> be
>>>
>>> Is "inactive" FREE_PENDING?
>>>
>>>> + held. ioasid_get/find() will return -ENOENT;
>>>> +2. After step #23, all events can go out of order. Shall not
>>>> affect
>>>> + the outcome.
>>>> +3. IOMMU driver fails to find private data for unbinding. If
>>>> unbind is
>>>> + called after the same IOASID is allocated for the same guest
>>>> again,
>>>> + this is a programming error. The damage is limited to the guest
>>>> + itself since unbind performs permission checking based on the
>>>> + IOASID set associated with the guest process.
>>>> +
>>>> +KVM PASID Translation Table Updates
>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>> +Per VM PASID translation table is maintained by KVM in order to
>>>> +support ENQCMD in the guest. The table contains host-guest PASID
>>>> +translations to be consumed by CPU ucode. The synchronization of
>>>> the +PASID states depends on VFIO/IOMMU driver, where IOCTL and
>>>> atomic +notifiers are used. KVM must register IOASID notifier per
>>>> VM instance +during launch time. The following events are handled:
>>>> +
>>>> +1. BIND/UNBIND
>>>> +2. FREE
>>>> +
>>>> +Rules:
>>>> +
>>>> +1. Multiple devices can bind with the same PASID, this can be
>>>> different PCI
>>>> + devices or mdevs within the same PCI device. However, only the
>>>> + *first* BIND and *last* UNBIND emit notifications.
>>>> +2. IOASID code is responsible for ensuring the correctness of H-G
>>>> + PASID mapping. There is no need for KVM to validate the
>>>> + notification data.
>>>> +3. When UNBIND happens *after* FREE, KVM will see error in
>>>> + ioasid_get() even when the reclaim is not done. IOMMU driver
>>>> will
>>>> + also avoid sending UNBIND if the PASID is already FREE.
>>>> +4. When KVM terminates *before* FREE & UNBIND, references will be
>>>> + dropped for all host PASIDs.
>>>> +
>>>> +VDCM PASID Programming
>>>> +~~~~~~~~~~~~~~~~~~~~~~
>>>> +VDCM composes virtual devices and exposes them to the guests. When
>>>> +the guest allocates a PASID then program it to the virtual
>>>> device, VDCM
>> programs as well
>>>> +intercepts the programming attempt then program the matching
>>>> host
>>>
>>> "programs"
>>>
>>> Thanks,
>>> Jean
>>>
>>>> +PASID on to the hardware.
>>>> +Conversely, when a device is going away, VDCM must be informed
>>>> such +that PASID context on the hardware can be cleared. There
>>>> could be +multiple mdevs assigned to different guests in the same
>>>> VDCM. Since +the PASID table is shared at PCI device level, lazy
>>>> clearing is not +secure. A malicious guest can attack by using
>>>> newly freed PASIDs that +are allocated by another guest.
>>>> +
>>>> +By holding a reference of the PASID until VDCM cleans up the HW
>>>> context, +it is guaranteed that PASID life cycles do not cross
>>>> within the same +device.
>>>> +
>>>> +
>>>> +Reference
>>>> +====================================================
>>>> +1.
>>>> https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf
>>>> + +2.
>>>> https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
>>>> + +3.
>>>> https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
>>>> -- 2.7.4
>>
>> Thanks
>>
>> Eric
>>>>
>>>
>>
>> _______________________________________________
>> iommu mailing list
>> [email protected]
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> [Jacob Pan]
>
Thanks

Eric

2020-09-07 08:09:40

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v2 3/9] iommu/ioasid: Introduce ioasid_set APIs

Hi Jacob,

On 9/3/20 11:07 PM, Jacob Pan wrote:
> On Tue, 1 Sep 2020 13:51:26 +0200
> Auger Eric <[email protected]> wrote:
>
>> Hi Jacob,
>>
>> On 8/22/20 6:35 AM, Jacob Pan wrote:
>>> ioasid_set was introduced as an arbitrary token that are shared by
>>> a
>> that is
> got it
>
>>> group of IOASIDs. For example, if IOASID #1 and #2 are allocated
>>> via the same ioasid_set*, they are viewed as to belong to the same
>>> set.
>> two IOASIDs allocated via the same ioasid_set pointer belong to the
>> same set?
>>>
> yes, better.
>
>>> For guest SVA usages, system-wide IOASID resources need to be
>>> partitioned such that VMs can have its own quota and being managed
>> their own
> right,
>
>>> separately. ioasid_set is the perfect candidate for meeting such
>>> requirements. This patch redefines and extends ioasid_set with the
>>> following new fields:
>>> - Quota
>>> - Reference count
>>> - Storage of its namespace
>>> - The token is stored in the new ioasid_set but with optional types
>>>
>>> ioasid_set level APIs are introduced that wires up these new data.
>> that wire
> right
>
>>> Existing users of IOASID APIs are converted where a host IOASID set
>>> is allocated for bare-metal usage.
>>>
>>> Signed-off-by: Liu Yi L <[email protected]>
>>> Signed-off-by: Jacob Pan <[email protected]>
>>> ---
>>> drivers/iommu/intel/iommu.c | 27 ++-
>>> drivers/iommu/intel/pasid.h | 1 +
>>> drivers/iommu/intel/svm.c | 8 +-
>>> drivers/iommu/ioasid.c | 390
>>> +++++++++++++++++++++++++++++++++++++++++---
>>> include/linux/ioasid.h | 82 ++++++++-- 5 files changed, 465
>>> insertions(+), 43 deletions(-)
>>>
>>> diff --git a/drivers/iommu/intel/iommu.c
>>> b/drivers/iommu/intel/iommu.c index a3a0b5c8921d..5813eeaa5edb
>>> 100644 --- a/drivers/iommu/intel/iommu.c
>>> +++ b/drivers/iommu/intel/iommu.c
>>> @@ -42,6 +42,7 @@
>>> #include <linux/crash_dump.h>
>>> #include <linux/numa.h>
>>> #include <linux/swiotlb.h>
>>> +#include <linux/ioasid.h>
>>> #include <asm/irq_remapping.h>
>>> #include <asm/cacheflush.h>
>>> #include <asm/iommu.h>
>>> @@ -103,6 +104,9 @@
>>> */
>>> #define INTEL_IOMMU_PGSIZES (~0xFFFUL)
>>>
>>> +/* PASIDs used by host SVM */
>>> +struct ioasid_set *host_pasid_set;
>>> +
>>> static inline int agaw_to_level(int agaw)
>>> {
>>> return agaw + 2;
>>> @@ -3103,8 +3107,8 @@ static void intel_vcmd_ioasid_free(ioasid_t
>>> ioasid, void *data)
>>> * Sanity check the ioasid owner is done at upper layer,
>>> e.g. VFIO
>>> * We can only free the PASID when all the devices are
>>> unbound. */
>>> - if (ioasid_find(NULL, ioasid, NULL)) {
>>> - pr_alert("Cannot free active IOASID %d\n", ioasid);
>>> + if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
>>> + pr_err("Cannot free IOASID %d, not in system
>>> set\n", ioasid);
>> not sure the change in the trace is worth. Also you may be more
>> explicit like IOASID %d to be freed cannot be found in the system
>> ioasid set.
> Yes, better. will do.
>
>> shouldn't it be rate_limited as it is originated from
>> user space?
> virtual command is only used in the guest kernel, not from userspace
> though. But I should add ratelimited to all user originated calls.
Sure I mixed things up. Sorry for the noise

Eric
>
>>> return;
>>> }
>>> vcmd_free_pasid(iommu, ioasid);
>>> @@ -3288,6 +3292,19 @@ static int __init init_dmars(void)
>>> if (ret)
>>> goto free_iommu;
>>>
>>> + /* PASID is needed for scalable mode irrespective to SVM */
>>> + if (intel_iommu_sm) {
>>> + ioasid_install_capacity(intel_pasid_max_id);
>>> + /* We should not run out of IOASIDs at boot */
>>> + host_pasid_set = ioasid_alloc_set(NULL,
>>> PID_MAX_DEFAULT,
>> s/PID_MAX_DEFAULT/intel_pasid_max_id?
> Not really, when both baremetal and guest SVA are used on the same
> system, we want to to limit the baremetal SVM PASID to the number of
> host processes. host_pasid_set is for baremetal only.
>
> intel_pasid_max_id would take up the entire PASID resource and leave no
> PASIDs for guest usages.
>
>>> +
>>> IOASID_SET_TYPE_NULL);
>> as suggested by jean-Philippe ioasid_set_alloc
>>> + if (IS_ERR_OR_NULL(host_pasid_set)) {
>>> + pr_err("Failed to enable host PASID
>>> allocator %lu\n",
>>> + PTR_ERR(host_pasid_set));
>> does not sound like the correct error message? failed to allocate the
>> system ioasid_set?
> right
>
>>> + intel_iommu_sm = 0;
>>> + }
>>> + }
>>> +
>>> /*
>>> * for each drhd
>>> * enable fault log
>>> @@ -5149,7 +5166,7 @@ static void auxiliary_unlink_device(struct
>>> dmar_domain *domain, domain->auxd_refcnt--;
>>>
>>> if (!domain->auxd_refcnt && domain->default_pasid > 0)
>>> - ioasid_free(domain->default_pasid);
>>> + ioasid_free(host_pasid_set, domain->default_pasid);
>>> }
>>>
>>> static int aux_domain_add_dev(struct dmar_domain *domain,
>>> @@ -5167,7 +5184,7 @@ static int aux_domain_add_dev(struct
>>> dmar_domain *domain, int pasid;
>>>
>>> /* No private data needed for the default pasid */
>>> - pasid = ioasid_alloc(NULL, PASID_MIN,
>>> + pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
>>> pci_max_pasids(to_pci_dev(dev))
>>> - 1, NULL);
>> don't you want to ioasid_set_put() the ioasid_set in
>> intel_iommu_free_dmars()?
> yes, good catch.
>
>>> if (pasid == INVALID_IOASID) {
>>> @@ -5210,7 +5227,7 @@ static int aux_domain_add_dev(struct
>>> dmar_domain *domain, spin_unlock(&iommu->lock);
>>> spin_unlock_irqrestore(&device_domain_lock, flags);
>>> if (!domain->auxd_refcnt && domain->default_pasid > 0)
>>> - ioasid_free(domain->default_pasid);
>>> + ioasid_free(host_pasid_set, domain->default_pasid);
>>>
>>> return ret;
>>> }
>>> diff --git a/drivers/iommu/intel/pasid.h
>>> b/drivers/iommu/intel/pasid.h index c9850766c3a9..ccdc23446015
>>> 100644 --- a/drivers/iommu/intel/pasid.h
>>> +++ b/drivers/iommu/intel/pasid.h
>>> @@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct
>>> pasid_entry *pte) }
>>>
>>> extern u32 intel_pasid_max_id;
>>> +extern struct ioasid_set *host_pasid_set;
>>> int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp);
>>> void intel_pasid_free_id(int pasid);
>>> void *intel_pasid_lookup_id(int pasid);
>>> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
>>> index 37a9beabc0ca..634e191ca2c3 100644
>>> --- a/drivers/iommu/intel/svm.c
>>> +++ b/drivers/iommu/intel/svm.c
>>> @@ -551,7 +551,7 @@ intel_svm_bind_mm(struct device *dev, int
>>> flags, struct svm_dev_ops *ops, pasid_max = intel_pasid_max_id;
>>>
>>> /* Do not use PASID 0, reserved for RID to PASID */
>>> - svm->pasid = ioasid_alloc(NULL, PASID_MIN,
>>> + svm->pasid = ioasid_alloc(host_pasid_set,
>>> PASID_MIN, pasid_max - 1, svm);
>>> if (svm->pasid == INVALID_IOASID) {
>>> kfree(svm);
>>> @@ -568,7 +568,7 @@ intel_svm_bind_mm(struct device *dev, int
>>> flags, struct svm_dev_ops *ops, if (mm) {
>>> ret =
>>> mmu_notifier_register(&svm->notifier, mm); if (ret) {
>>> - ioasid_free(svm->pasid);
>>> + ioasid_free(host_pasid_set,
>>> svm->pasid); kfree(svm);
>>> kfree(sdev);
>>> goto out;
>>> @@ -586,7 +586,7 @@ intel_svm_bind_mm(struct device *dev, int
>>> flags, struct svm_dev_ops *ops, if (ret) {
>>> if (mm)
>>> mmu_notifier_unregister(&svm->notifier,
>>> mm);
>>> - ioasid_free(svm->pasid);
>>> + ioasid_free(host_pasid_set, svm->pasid);
>>> kfree(svm);
>>> kfree(sdev);
>>> goto out;
>>> @@ -655,7 +655,7 @@ static int intel_svm_unbind_mm(struct device
>>> *dev, int pasid) kfree_rcu(sdev, rcu);
>>>
>>> if (list_empty(&svm->devs)) {
>>> - ioasid_free(svm->pasid);
>>> + ioasid_free(host_pasid_set,
>>> svm->pasid); if (svm->mm)
>>> mmu_notifier_unregister(&svm->notifier,
>>> svm->mm); list_del(&svm->list);
>>> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
>>> index 5f63af07acd5..f73b3dbfc37a 100644
>>> --- a/drivers/iommu/ioasid.c
>>> +++ b/drivers/iommu/ioasid.c
>>> @@ -1,22 +1,58 @@
>>> // SPDX-License-Identifier: GPL-2.0
>>> /*
>>> * I/O Address Space ID allocator. There is one global IOASID
>>> space, split into
>>> - * subsets. Users create a subset with DECLARE_IOASID_SET, then
>>> allocate and
>> I would try to avoid using new terms: s/subset_ioset_set
> Right, I initially used term subset ID for set private ID then realized
> SSID meant something else in ARM SMMU. :)
>
>>> - * free IOASIDs with ioasid_alloc and ioasid_free.
>>> + * subsets. Users create a subset with ioasid_alloc_set, then
>>> allocate/free IDs
>> here also and ioasid_set_alloc
> ditto.
>
>>> + * with ioasid_alloc and ioasid_free.
>>> */
>>> -#include <linux/ioasid.h>
>>> #include <linux/module.h>
>>> #include <linux/slab.h>
>>> #include <linux/spinlock.h>
>>> #include <linux/xarray.h>
>>> +#include <linux/ioasid.h>
>>> +
>>> +static DEFINE_XARRAY_ALLOC(ioasid_sets);
>>> +enum ioasid_state {
>>> + IOASID_STATE_INACTIVE,
>>> + IOASID_STATE_ACTIVE,
>>> + IOASID_STATE_FREE_PENDING,
>>> +};
>>>
>>> +/**
>>> + * struct ioasid_data - Meta data about ioasid
>>> + *
>>> + * @id: Unique ID
>>> + * @users Number of active users
>>> + * @state Track state of the IOASID
>>> + * @set Meta data of the set this IOASID belongs
>>> to
>> s/Meta data of the set this IOASID belongs to/ioasid_set the asid
>> belongs to
> make sense.
>
>>> + * @private Private data associated with the IOASID
>> I would have expected to find the private asid somewhere
>>> + * @rcu For free after RCU grace period
>>> + */
>>> struct ioasid_data {
>>> ioasid_t id;
>>> struct ioasid_set *set;
>>> + refcount_t users;
>>> + enum ioasid_state state;
>>> void *private;
>>> struct rcu_head rcu;
>>> };
>>>
>>> +/* Default to PCIe standard 20 bit PASID */
>>> +#define PCI_PASID_MAX 0x100000
>>> +static ioasid_t ioasid_capacity = PCI_PASID_MAX;
>>> +static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
>>> +
>>> +void ioasid_install_capacity(ioasid_t total)
>>> +{
>>> + ioasid_capacity = ioasid_capacity_avail = total;
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_install_capacity);
>>> +
>>> +ioasid_t ioasid_get_capacity()
>>> +{
>>> + return ioasid_capacity;
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_get_capacity);
>>> +
>>> /*
>>> * struct ioasid_allocator_data - Internal data structure to hold
>>> information
>>> * about an allocator. There are two types of allocators:
>>> @@ -306,11 +342,23 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
>>> ioasid_t min, ioasid_t max, {
>>> struct ioasid_data *data;
>>> void *adata;
>>> - ioasid_t id;
>>> + ioasid_t id = INVALID_IOASID;
>>> +
>>> + spin_lock(&ioasid_allocator_lock);
>>> + /* Check if the IOASID set has been allocated and
>>> initialized */
>>> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
>>> + pr_warn("Invalid set\n");
>>> + goto done_unlock;
>>> + }
>>> +
>>> + if (set->quota <= set->nr_ioasids) {
>>> + pr_err("IOASID set %d out of quota %d\n",
>>> set->sid, set->quota);
>>> + goto done_unlock;
>>> + }
>>>
>>> data = kzalloc(sizeof(*data), GFP_ATOMIC);
>>> if (!data)
>>> - return INVALID_IOASID;
>>> + goto done_unlock;
>>>
>>> data->set = set;
>>> data->private = private;
>>> @@ -319,7 +367,6 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
>>> ioasid_t min, ioasid_t max,
>>> * Custom allocator needs allocator data to perform
>>> platform specific
>>> * operations.
>>> */
>>> - spin_lock(&ioasid_allocator_lock);
>>> adata = active_allocator->flags &
>>> IOASID_ALLOCATOR_CUSTOM ? active_allocator->ops->pdata : data; id =
>>> active_allocator->ops->alloc(min, max, adata); if (id ==
>>> INVALID_IOASID) { @@ -335,42 +382,339 @@ ioasid_t
>>> ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
>>> goto exit_free; }
>>> data->id = id;
>>> + data->state = IOASID_STATE_ACTIVE;
>>> + refcount_set(&data->users, 1);
>>> +
>>> + /* Store IOASID in the per set data */
>>> + if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
>>> + pr_err("Failed to ioasid %d in set %d\n", id,
>>> set->sid);
>>> + goto exit_free;
>>> + }
>>> + set->nr_ioasids++;
>>> + goto done_unlock;
>>>
>>> - spin_unlock(&ioasid_allocator_lock);
>>> - return id;
>>> exit_free:
>>> - spin_unlock(&ioasid_allocator_lock);
>>> kfree(data);
>>> - return INVALID_IOASID;
>>> +done_unlock:
>>> + spin_unlock(&ioasid_allocator_lock);
>>> + return id;
>>> }
>>> EXPORT_SYMBOL_GPL(ioasid_alloc);
>>>
>>> +static void ioasid_do_free(struct ioasid_data *data)
>> do_free_locked?
> sounds good, more accurate.
>
>>> +{
>>> + struct ioasid_data *ioasid_data;
>>> + struct ioasid_set *sdata;
>>> +
>>> + active_allocator->ops->free(data->id,
>>> active_allocator->ops->pdata);
>>> + /* Custom allocator needs additional steps to free the xa
>>> element */
>>> + if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
>>> + ioasid_data = xa_erase(&active_allocator->xa,
>>> data->id);
>>> + kfree_rcu(ioasid_data, rcu);
>>> + }
>>> +
>>> + sdata = xa_load(&ioasid_sets, data->set->sid);
>>> + if (!sdata) {
>>> + pr_err("No set %d for IOASID %d\n", data->set->sid,
>>> + data->id);
>>> + return;
>>> + }
>>> + xa_erase(&sdata->xa, data->id);
>>> + sdata->nr_ioasids--;
>>> +}
>>> +
>>> +static void ioasid_free_locked(struct ioasid_set *set, ioasid_t
>>> ioasid) +{
>>> + struct ioasid_data *data;
>>> +
>>> + data = xa_load(&active_allocator->xa, ioasid);
>>> + if (!data) {
>>> + pr_err("Trying to free unknown IOASID %u\n",
>>> ioasid);
>>> + return;
>>> + }
>>> +
>>> + if (data->set != set) {
>>> + pr_warn("Cannot free IOASID %u due to set
>>> ownership\n", ioasid);
>>> + return;
>>> + }
>>> + data->state = IOASID_STATE_FREE_PENDING;
>>> +
>>> + if (!refcount_dec_and_test(&data->users))
>>> + return;
>>> +
>>> + ioasid_do_free(data);
>>> +}
>>> +
>>> /**
>>> - * ioasid_free - Free an IOASID
>>> - * @ioasid: the ID to remove
>>> + * ioasid_free - Drop reference on an IOASID. Free if refcount
>>> drops to 0,
>>> + * including free from its set and system-wide list.
>>> + * @set: The ioasid_set to check permission with. If not
>>> NULL, IOASID
>>> + * free will fail if the set does not match.
>>> + * @ioasid: The IOASID to remove
>>> */
>>> -void ioasid_free(ioasid_t ioasid)
>>> +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid)
>>> {
>>> - struct ioasid_data *ioasid_data;
>>> + spin_lock(&ioasid_allocator_lock);
>>> + ioasid_free_locked(set, ioasid);
>>> + spin_unlock(&ioasid_allocator_lock);
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_free);
>>>
>>> +/**
>>> + * ioasid_alloc_set - Allocate a new IOASID set for a given token
>>> + *
>>> + * @token: Unique token of the IOASID set, cannot be NULL
>>> + * @quota: Quota allowed in this set. Only for new set
>>> creation
>>> + * @flags: Special requirements
>>> + *
>>> + * IOASID can be limited system-wide resource that requires quota
>>> management.
>>> + * If caller does not wish to enforce quota, use
>>> IOASID_SET_NO_QUOTA flag.
>>> + *
>>> + * Token will be stored in the ioasid_set returned. A reference
>>> will be taken
>>> + * upon finding a matching set or newly created set.
>>> + * IOASID allocation within the set and other per set operations
>>> will use
>>> + * the retured ioasid_set *.
>>> + *
>>> + */
>>> +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
>>> int type) +{
>>> + struct ioasid_set *sdata;
>>> + unsigned long index;
>>> + ioasid_t id;
>>> +
>>> + if (type >= IOASID_SET_TYPE_NR)
>>> + return ERR_PTR(-EINVAL);
>>> +
>>> + /*
>>> + * Need to check space available if we share system-wide
>>> quota.
>>> + * TODO: we may need to support quota free sets in the
>>> future.
>>> + */
>>> spin_lock(&ioasid_allocator_lock);
>>> - ioasid_data = xa_load(&active_allocator->xa, ioasid);
>>> - if (!ioasid_data) {
>>> - pr_err("Trying to free unknown IOASID %u\n",
>>> ioasid);
>>> + if (quota > ioasid_capacity_avail) {
>>> + pr_warn("Out of IOASID capacity! ask %d, avail
>>> %d\n",
>>> + quota, ioasid_capacity_avail);
>>> + sdata = ERR_PTR(-ENOSPC);
>>> goto exit_unlock;
>>> }
>>>
>>> - active_allocator->ops->free(ioasid,
>>> active_allocator->ops->pdata);
>>> - /* Custom allocator needs additional steps to free the xa
>>> element */
>>> - if (active_allocator->flags & IOASID_ALLOCATOR_CUSTOM) {
>>> - ioasid_data = xa_erase(&active_allocator->xa,
>>> ioasid);
>>> - kfree_rcu(ioasid_data, rcu);
>>> + /*
>>> + * Token is only unique within its types but right now we
>>> have only
>>> + * mm type. If we have more token types, we have to match
>>> type as well.
>>> + */
>>> + switch (type) {
>>> + case IOASID_SET_TYPE_MM:
>>> + /* Search existing set tokens, reject duplicates */
>>> + xa_for_each(&ioasid_sets, index, sdata) {
>>> + if (sdata->token == token &&
>>> + sdata->type == IOASID_SET_TYPE_MM)
>>> {
>>> + sdata = ERR_PTR(-EEXIST);
>>> + goto exit_unlock;
>>> + }
>>> + }
>>> + break;
>>> + case IOASID_SET_TYPE_NULL:
>>> + if (!token)
>>> + break;
>>> + fallthrough;
>>> + default:
>>> + pr_err("Invalid token and IOASID type\n");
>>> + sdata = ERR_PTR(-EINVAL);
>>> + goto exit_unlock;
>>> }
>>>
>>> + /* REVISIT: may support set w/o quota, use system
>>> available */
>>> + if (!quota) {
>>> + sdata = ERR_PTR(-EINVAL);
>>> + goto exit_unlock;
>>> + }
>>> +
>>> + sdata = kzalloc(sizeof(*sdata), GFP_ATOMIC);
>>> + if (!sdata) {
>>> + sdata = ERR_PTR(-ENOMEM);
>>> + goto exit_unlock;
>>> + }
>>> +
>>> + if (xa_alloc(&ioasid_sets, &id, sdata,
>>> + XA_LIMIT(0, ioasid_capacity_avail - quota),
>>> + GFP_ATOMIC)) {
>>> + kfree(sdata);
>>> + sdata = ERR_PTR(-ENOSPC);
>>> + goto exit_unlock;
>>> + }
>>> +
>>> + sdata->token = token;
>>> + sdata->type = type;
>>> + sdata->quota = quota;
>>> + sdata->sid = id;
>>> + refcount_set(&sdata->ref, 1);
>>> +
>>> + /*
>>> + * Per set XA is used to store private IDs within the set,
>>> get ready
>>> + * for ioasid_set private ID and system-wide IOASID
>>> allocation
>>> + * results.
>>> + */
>>> + xa_init_flags(&sdata->xa, XA_FLAGS_ALLOC);
>>> + ioasid_capacity_avail -= quota;
>>> +
>>> exit_unlock:
>>> spin_unlock(&ioasid_allocator_lock);
>>> +
>>> + return sdata;
>>> }
>>> -EXPORT_SYMBOL_GPL(ioasid_free);
>>> +EXPORT_SYMBOL_GPL(ioasid_alloc_set);
>>> +
>>> +void ioasid_set_get_locked(struct ioasid_set *set)
>>> +{
>>> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
>>> + pr_warn("Invalid set data\n");
>>> + return;
>>> + }
>>> +
>>> + refcount_inc(&set->ref);
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_set_get_locked);
>>> +
>>> +void ioasid_set_get(struct ioasid_set *set)
>>> +{
>>> + spin_lock(&ioasid_allocator_lock);
>>> + ioasid_set_get_locked(set);
>>> + spin_unlock(&ioasid_allocator_lock);
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_set_get);
>>> +
>>> +void ioasid_set_put_locked(struct ioasid_set *set)
>>> +{
>>> + struct ioasid_data *entry;
>>> + unsigned long index;
>>> +
>>> + if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
>>> + pr_warn("Invalid set data\n");
>>> + return;
>>> + }
>>> +
>>> + if (!refcount_dec_and_test(&set->ref)) {
>>> + pr_debug("%s: IOASID set %d has %d users\n",
>>> + __func__, set->sid,
>>> refcount_read(&set->ref));
>>> + return;
>>> + }
>>> +
>>> + /* The set is already empty, we just destroy the set. */
>>> + if (xa_empty(&set->xa))
>>> + goto done_destroy;
>>> +
>>> + /*
>>> + * Free all PASIDs from system-wide IOASID pool, all
>>> subscribers gets
>>> + * notified and do cleanup of their own.
>>> + * Note that some references of the IOASIDs within the set
>>> can still
>>> + * be held after the free call. This is OK in that the
>>> IOASIDs will be
>>> + * marked inactive, the only operations can be done is
>>> ioasid_put.
>>> + * No need to track IOASID set states since there is no
>>> reclaim phase.
>>> + */
>>> + xa_for_each(&set->xa, index, entry) {
>>> + ioasid_free_locked(set, index);
>>> + /* Free from per set private pool */
>>> + xa_erase(&set->xa, index);
>>> + }
>>> +
>>> +done_destroy:
>>> + /* Return the quota back to system pool */
>>> + ioasid_capacity_avail += set->quota;
>>> + kfree_rcu(set, rcu);
>>> +
>>> + /*
>>> + * Token got released right away after the ioasid_set is
>>> freed.
>>> + * If a new set is created immediately with the newly
>>> released token,
>>> + * it will not allocate the same IOASIDs unless they are
>>> reclaimed.
>>> + */
>>> + xa_erase(&ioasid_sets, set->sid);
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_set_put_locked);
>>> +
>>> +/**
>>> + * ioasid_set_put - Drop a reference to the IOASID set. Free all
>>> IOASIDs within
>>> + * the set if there are no more users.
>>> + *
>>> + * @set: The IOASID set ID to be freed
>>> + *
>>> + * If refcount drops to zero, all IOASIDs allocated within the set
>>> will be
>>> + * freed.
>>> + */
>>> +void ioasid_set_put(struct ioasid_set *set)
>>> +{
>>> + spin_lock(&ioasid_allocator_lock);
>>> + ioasid_set_put_locked(set);
>>> + spin_unlock(&ioasid_allocator_lock);
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_set_put);
>>> +
>>> +/**
>>> + * ioasid_adjust_set - Adjust the quota of an IOASID set
>>> + * @set: IOASID set to be assigned
>>> + * @quota: Quota allowed in this set
>>> + *
>>> + * Return 0 on success. If the new quota is smaller than the
>>> number of
>>> + * IOASIDs already allocated, -EINVAL will be returned. No change
>>> will be
>>> + * made to the existing quota.
>>> + */
>>> +int ioasid_adjust_set(struct ioasid_set *set, int quota)
>>> +{
>>> + int ret = 0;
>>> +
>>> + spin_lock(&ioasid_allocator_lock);
>>> + if (set->nr_ioasids > quota) {
>>> + pr_err("New quota %d is smaller than outstanding
>>> IOASIDs %d\n",
>>> + quota, set->nr_ioasids);
>>> + ret = -EINVAL;
>>> + goto done_unlock;
>>> + }
>>> +
>>> + if (quota >= ioasid_capacity_avail) {
>>> + ret = -ENOSPC;
>>> + goto done_unlock;
>>> + }
>>> +
>>> + /* Return the delta back to system pool */
>>> + ioasid_capacity_avail += set->quota - quota;
>>> +
>>> + /*
>>> + * May have a policy to prevent giving all available
>>> IOASIDs
>>> + * to one set. But we don't enforce here, it should be in
>>> the
>>> + * upper layers.
>>> + */
>>> + set->quota = quota;
>>> +
>>> +done_unlock:
>>> + spin_unlock(&ioasid_allocator_lock);
>>> +
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_adjust_set);
>>> +
>>> +/**
>>> + * ioasid_set_for_each_ioasid - Iterate over all the IOASIDs
>>> within the set
>>> + *
>>> + * Caller must hold a reference of the set and handles its own
>>> locking.
>>> + */
>>> +int ioasid_set_for_each_ioasid(struct ioasid_set *set,
>>> + void (*fn)(ioasid_t id, void *data),
>>> + void *data)
>>> +{
>>> + struct ioasid_data *entry;
>>> + unsigned long index;
>>> + int ret = 0;
>>> +
>>> + if (xa_empty(&set->xa)) {
>>> + pr_warn("No IOASIDs in the set %d\n", set->sid);
>>> + return -ENOENT;
>>> + }
>>> +
>>> + xa_for_each(&set->xa, index, entry) {
>>> + fn(index, data);
>>> + }
>>> +
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
>>>
>>> /**
>>> * ioasid_find - Find IOASID data
>>> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
>>> index 9c44947a68c8..412d025d440e 100644
>>> --- a/include/linux/ioasid.h
>>> +++ b/include/linux/ioasid.h
>>> @@ -10,8 +10,35 @@ typedef unsigned int ioasid_t;
>>> typedef ioasid_t (*ioasid_alloc_fn_t)(ioasid_t min, ioasid_t max,
>>> void *data); typedef void (*ioasid_free_fn_t)(ioasid_t ioasid, void
>>> *data);
>>> +/* IOASID set types */
>>> +enum ioasid_set_type {
>>> + IOASID_SET_TYPE_NULL = 1, /* Set token is NULL */
>>> + IOASID_SET_TYPE_MM, /* Set token is a mm_struct,
>> s/mm_struct/mm_struct pointer
> got it
>
>>> + * i.e. associated with a process
>>> + */
>>> + IOASID_SET_TYPE_NR,
>>> +};
>>> +
>>> +/**
>>> + * struct ioasid_set - Meta data about ioasid_set
>>> + * @type: Token types and other features
>> token type. Why "and other features"
> will remove. initially wanted to have a flag
>
>>> + * @token: Unique to identify an IOASID set
>>> + * @xa: XArray to store ioasid_set private IDs, can
>>> be used for
>>> + * guest-host IOASID mapping, or just a private
>>> IOASID namespace.
>>> + * @quota: Max number of IOASIDs can be allocated within
>>> the set
>>> + * @nr_ioasids Number of IOASIDs currently allocated in the
>>> set
>>> + * @sid: ID of the set
>>> + * @ref: Reference count of the users
>>> + */
>>> struct ioasid_set {
>>> - int dummy;
>>> + void *token;
>>> + struct xarray xa;
>>> + int type;
>>> + int quota;
>>> + int nr_ioasids;
>>> + int sid;
>> nit id? sid has a special meaning on ARM.
>>
> sounds good.
>
>>> + refcount_t ref;
>>> + struct rcu_head rcu;
>>> };
>>>
>>> /**
>>> @@ -29,31 +56,64 @@ struct ioasid_allocator_ops {
>>> void *pdata;
>>> };
>>>
>>> -#define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
>>> -
>>> #if IS_ENABLED(CONFIG_IOASID)
>>> +void ioasid_install_capacity(ioasid_t total);
>>> +ioasid_t ioasid_get_capacity(void);
>>> +struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota,
>>> int type); +int ioasid_adjust_set(struct ioasid_set *set, int
>>> quota);
>> ioasid_set_adjust_quota
>>> +void ioasid_set_get_locked(struct ioasid_set *set);
>> as mentionned during the Plumber uConf, the set_get is unfortunate.
>> Globally I wonder if we shouldn't rename "set" into "pool" or
>> something alike.
> I agree, how about "group"? I felt pool does not reflect enough of the
> resource partitioning. Any better names? Jean?
>
>>> +void ioasid_set_put_locked(struct ioasid_set *set);
>>> +void ioasid_set_put(struct ioasid_set *set);
>>> +
>>> ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
>>> ioasid_t max, void *private);
>>> -void ioasid_free(ioasid_t ioasid);
>>> -void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>>> - bool (*getter)(void *));
>>> +void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
>>> +
>>> +bool ioasid_is_active(ioasid_t ioasid);
>>> +
>>> +void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool
>>> (*getter)(void *)); +int ioasid_attach_data(ioasid_t ioasid, void
>>> *data); int ioasid_register_allocator(struct ioasid_allocator_ops
>>> *allocator); void ioasid_unregister_allocator(struct
>>> ioasid_allocator_ops *allocator); -int ioasid_attach_data(ioasid_t
>>> ioasid, void *data); -
>>> +void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
>>> +int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
>>> + void (*fn)(ioasid_t id, void *data),
>>> + void *data);
>>> #else /* !CONFIG_IOASID */
>>> +static inline void ioasid_install_capacity(ioasid_t total)
>>> +{
>>> +}
>>> +
>>> +static inline ioasid_t ioasid_get_capacity(void)
>>> +{
>>> + return 0;
>>> +}
>>> +
>>> static inline ioasid_t ioasid_alloc(struct ioasid_set *set,
>>> ioasid_t min, ioasid_t max, void *private)
>>> {
>>> return INVALID_IOASID;
>>> }
>>>
>>> -static inline void ioasid_free(ioasid_t ioasid)
>>> +static inline void ioasid_free(struct ioasid_set *set, ioasid_t
>>> ioasid) +{
>>> +}
>>> +
>>> +static inline bool ioasid_is_active(ioasid_t ioasid)
>>> +{
>>> + return false;
>>> +}
>>> +
>>> +static inline struct ioasid_set *ioasid_alloc_set(void *token,
>>> ioasid_t quota, int type) +{
>>> + return ERR_PTR(-ENOTSUPP);
>>> +}
>>> +
>>> +static inline void ioasid_set_put(struct ioasid_set *set)
>>> {
>>> }
>>>
>>> -static inline void *ioasid_find(struct ioasid_set *set, ioasid_t
>>> ioasid,
>>> - bool (*getter)(void *))
>>> +static inline void *ioasid_find(struct ioasid_set *set, ioasid_t
>>> ioasid, bool (*getter)(void *)) {
>>> return NULL;
>>> }
>>>
>> I felt very difficult to review this patch. Could you split it into
>> several ones? maybe introduce the a dummy host_pasid_set and update
>> the call sites accordingling.
>>
>> You introduce ownership checking, quota checking, ioasid state, ref
>> counting, ioasid type handling (whereas existing is NULL) so I have
>> the feeling that could ease the review process by adopting a more
>> incremental approach.
>>
> Yes, I felt the same. It is just that the changes are intertwined but I
> will give it a try again in the next version.
>
> Thanks for the review and suggestion.
>
>> Thanks
>>
>> Eric
>>
>> _______________________________________________
>> iommu mailing list
>> [email protected]
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
> [Jacob Pan]
>

2020-09-08 22:18:43

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] iommu/ioasid: Introduce ioasid_set private ID

On Tue, 25 Aug 2020 12:22:09 +0200
Jean-Philippe Brucker <[email protected]> wrote:

> On Fri, Aug 21, 2020 at 09:35:14PM -0700, Jacob Pan wrote:
> > When an IOASID set is used for guest SVA, each VM will acquire its
> > ioasid_set for IOASID allocations. IOASIDs within the VM must have a
> > host/physical IOASID backing, mapping between guest and host IOASIDs can
> > be non-identical. IOASID set private ID (SPID) is introduced in this
> > patch to be used as guest IOASID. However, the concept of ioasid_set
> > specific namespace is generic, thus named SPID.
> >
> > As SPID namespace is within the IOASID set, the IOASID core can provide
> > lookup services at both directions. SPIDs may not be allocated when its
> > IOASID is allocated, the mapping between SPID and IOASID is usually
> > established when a guest page table is bound to a host PASID.
> >
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > drivers/iommu/ioasid.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/ioasid.h | 12 +++++++++++
> > 2 files changed, 66 insertions(+)
> >
> > diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> > index 5f31d63c75b1..c0aef38a4fde 100644
> > --- a/drivers/iommu/ioasid.c
> > +++ b/drivers/iommu/ioasid.c
> > @@ -21,6 +21,7 @@ enum ioasid_state {
> > * struct ioasid_data - Meta data about ioasid
> > *
> > * @id: Unique ID
> > + * @spid: Private ID unique within a set
> > * @users Number of active users
> > * @state Track state of the IOASID
> > * @set Meta data of the set this IOASID belongs to
> > @@ -29,6 +30,7 @@ enum ioasid_state {
> > */
> > struct ioasid_data {
> > ioasid_t id;
> > + ioasid_t spid;
> > struct ioasid_set *set;
> > refcount_t users;
> > enum ioasid_state state;
> > @@ -326,6 +328,58 @@ int ioasid_attach_data(ioasid_t ioasid, void *data)
> > EXPORT_SYMBOL_GPL(ioasid_attach_data);
> >
> > /**
> > + * ioasid_attach_spid - Attach ioasid_set private ID to an IOASID
> > + *
> > + * @ioasid: the ID to attach
> > + * @spid: the ioasid_set private ID of @ioasid
> > + *
> > + * For IOASID that is already allocated, private ID within the set can be
> > + * attached via this API. Future lookup can be done via ioasid_find.
>
> via ioasid_find_by_spid()?
>
yes, will update.

> > + */
> > +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
> > +{
> > + struct ioasid_data *ioasid_data;
> > + int ret = 0;
> > +
> > + spin_lock(&ioasid_allocator_lock);
> > + ioasid_data = xa_load(&active_allocator->xa, ioasid);
> > +
> > + if (!ioasid_data) {
> > + pr_err("No IOASID entry %d to attach SPID %d\n",
> > + ioasid, spid);
> > + ret = -ENOENT;
> > + goto done_unlock;
> > + }
> > + ioasid_data->spid = spid;
> > +
> > +done_unlock:
> > + spin_unlock(&ioasid_allocator_lock);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_attach_spid);
> > +
> > +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
>
> Maybe add a bit of documentation as this is public-facing.
>
Good point, I will add
/**
* ioasid_find_by_spid - Find the system-wide IOASID by a set private ID and
* its set.
*
* @set: the ioasid_set to search within
* @spid: the set private ID
*
* Given a set private ID and its IOASID set, find the system-wide IOASID. Take
* a reference upon finding the matching IOASID. Return INVALID_IOASID if the
* IOASID is not found in the set or the set is not valid.
*/

> > +{
> > + struct ioasid_data *entry;
> > + unsigned long index;
> > +
> > + if (!xa_load(&ioasid_sets, set->sid)) {
> > + pr_warn("Invalid set\n");
> > + return INVALID_IOASID;
> > + }
> > +
> > + xa_for_each(&set->xa, index, entry) {
> > + if (spid == entry->spid) {
> > + pr_debug("Found ioasid %lu by spid %u\n", index, spid);
> > + refcount_inc(&entry->users);
>
> Nothing prevents ioasid_free() from concurrently dropping the refcount to
> zero and calling ioasid_do_free(). The caller will later call ioasid_put()
> on a stale/reallocated index.
>
right, need to add spin_lock(&ioasid_allocator_lock);

> > + return index;
> > + }
> > + }
> > + return INVALID_IOASID;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
> > +
> > +/**
> > * ioasid_alloc - Allocate an IOASID
> > * @set: the IOASID set
> > * @min: the minimum ID (inclusive)
> > diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> > index 310abe4187a3..d4b3e83672f6 100644
> > --- a/include/linux/ioasid.h
> > +++ b/include/linux/ioasid.h
> > @@ -73,6 +73,8 @@ bool ioasid_is_active(ioasid_t ioasid);
> >
> > void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool (*getter)(void *));
> > int ioasid_attach_data(ioasid_t ioasid, void *data);
> > +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
> > +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid);
> > int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> > void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> > void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
> > @@ -136,5 +138,15 @@ static inline int ioasid_attach_data(ioasid_t ioasid, void *data)
> > return -ENOTSUPP;
> > }
> >
> > +staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
> > +{
> > + return -ENOTSUPP;
> > +}
> > +
> > +static inline ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
> > +{
> > + return -ENOTSUPP;
>
> INVALID_IOASID
>
right, will fix.

Thanks!

> Thanks,
> Jean
>
> > +}
> > +
> > #endif /* CONFIG_IOASID */
> > #endif /* __LINUX_IOASID_H */
> > --
> > 2.7.4
> >
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

[Jacob Pan]

2020-09-08 22:39:50

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] iommu/ioasid: Introduce ioasid_set private ID

On Tue, 1 Sep 2020 17:38:44 +0200
Auger Eric <[email protected]> wrote:

> Hi Jacob,
> On 8/22/20 6:35 AM, Jacob Pan wrote:
> > When an IOASID set is used for guest SVA, each VM will acquire its
> > ioasid_set for IOASID allocations. IOASIDs within the VM must have a
> > host/physical IOASID backing, mapping between guest and host
> > IOASIDs can be non-identical. IOASID set private ID (SPID) is
> > introduced in this patch to be used as guest IOASID. However, the
> > concept of ioasid_set specific namespace is generic, thus named
> > SPID.
> >
> > As SPID namespace is within the IOASID set, the IOASID core can
> > provide lookup services at both directions. SPIDs may not be
> > allocated when its IOASID is allocated, the mapping between SPID
> > and IOASID is usually established when a guest page table is bound
> > to a host PASID.
> >
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > drivers/iommu/ioasid.c | 54
> > ++++++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/ioasid.h | 12 +++++++++++ 2 files changed, 66
> > insertions(+)
> >
> > diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> > index 5f31d63c75b1..c0aef38a4fde 100644
> > --- a/drivers/iommu/ioasid.c
> > +++ b/drivers/iommu/ioasid.c
> > @@ -21,6 +21,7 @@ enum ioasid_state {
> > * struct ioasid_data - Meta data about ioasid
> > *
> > * @id: Unique ID
> > + * @spid: Private ID unique within a set
> > * @users Number of active users
> > * @state Track state of the IOASID
> > * @set Meta data of the set this IOASID belongs to
> > @@ -29,6 +30,7 @@ enum ioasid_state {
> > */
> > struct ioasid_data {
> > ioasid_t id;
> > + ioasid_t spid;
> > struct ioasid_set *set;
> > refcount_t users;
> > enum ioasid_state state;
> > @@ -326,6 +328,58 @@ int ioasid_attach_data(ioasid_t ioasid, void
> > *data) EXPORT_SYMBOL_GPL(ioasid_attach_data);
> >
> > /**
> > + * ioasid_attach_spid - Attach ioasid_set private ID to an IOASID
> > + *
> > + * @ioasid: the ID to attach
> > + * @spid: the ioasid_set private ID of @ioasid
> > + *
> > + * For IOASID that is already allocated, private ID within the set
> > can be
> > + * attached via this API. Future lookup can be done via
> > ioasid_find.
> I would remove "For IOASID that is already allocated, private ID
> within the set can be attached via this API"
I guess it is implied. Will remove.

> > + */
> > +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
> > +{
> > + struct ioasid_data *ioasid_data;
> > + int ret = 0;
> > +
> > + spin_lock(&ioasid_allocator_lock);
> We keep on saying the SPID is local to an IOASID set but we don't
> check any IOASID set contains this ioasid. It looks a bit weird to me.
We store ioasid_set inside ioasid_data when an IOASID is allocated, so
we don't need to search all the ioasid_sets. Perhaps I missed your
point?

> > + ioasid_data = xa_load(&active_allocator->xa, ioasid);
> > +
> > + if (!ioasid_data) {
> > + pr_err("No IOASID entry %d to attach SPID %d\n",
> > + ioasid, spid);
> > + ret = -ENOENT;
> > + goto done_unlock;
> > + }
> > + ioasid_data->spid = spid;
> is there any way/need to remove an spid binding?
For guest SVA, we attach SPID as a guest PASID during bind guest page
table. Unbind does the opposite, ioasid_attach_spid() with
spid=INVALID_IOASID clears the binding.

Perhaps add more symmetric functions? i.e.
ioasid_detach_spid(ioasid_t ioasid)
ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid)

> > +
> > +done_unlock:
> > + spin_unlock(&ioasid_allocator_lock);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_attach_spid);
> > +
> > +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
> > +{
> > + struct ioasid_data *entry;
> > + unsigned long index;
> > +
> > + if (!xa_load(&ioasid_sets, set->sid)) {
> > + pr_warn("Invalid set\n");
> > + return INVALID_IOASID;
> > + }
> > +
> > + xa_for_each(&set->xa, index, entry) {
> > + if (spid == entry->spid) {
> > + pr_debug("Found ioasid %lu by spid %u\n",
> > index, spid);
> > + refcount_inc(&entry->users);
> > + return index;
> > + }
> > + }
> > + return INVALID_IOASID;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
> > +
> > +/**
> > * ioasid_alloc - Allocate an IOASID
> > * @set: the IOASID set
> > * @min: the minimum ID (inclusive)
> > diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> > index 310abe4187a3..d4b3e83672f6 100644
> > --- a/include/linux/ioasid.h
> > +++ b/include/linux/ioasid.h
> > @@ -73,6 +73,8 @@ bool ioasid_is_active(ioasid_t ioasid);
> >
> > void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool
> > (*getter)(void *)); int ioasid_attach_data(ioasid_t ioasid, void
> > *data); +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
> > +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t
> > spid); int ioasid_register_allocator(struct ioasid_allocator_ops
> > *allocator); void ioasid_unregister_allocator(struct
> > ioasid_allocator_ops *allocator); void ioasid_is_in_set(struct
> > ioasid_set *set, ioasid_t ioasid); @@ -136,5 +138,15 @@ static
> > inline int ioasid_attach_data(ioasid_t ioasid, void *data) return
> > -ENOTSUPP; }
> >
> > +staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
> > +{
> > + return -ENOTSUPP;
> > +}
> > +
> > +static inline ioasid_t ioasid_find_by_spid(struct ioasid_set *set,
> > ioasid_t spid) +{
> > + return -ENOTSUPP;
> > +}
> > +
> > #endif /* CONFIG_IOASID */
> > #endif /* __LINUX_IOASID_H */
> >
> Thanks
>
> Eric
>
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

[Jacob Pan]

2020-09-09 20:36:52

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v2 6/9] iommu/ioasid: Introduce notification APIs

On Tue, 25 Aug 2020 12:26:17 +0200
Jean-Philippe Brucker <[email protected]> wrote:

> On Fri, Aug 21, 2020 at 09:35:15PM -0700, Jacob Pan wrote:
> > Relations among IOASID users largely follow a publisher-subscriber
> > pattern. E.g. to support guest SVA on Intel Scalable I/O
> > Virtualization (SIOV) enabled platforms, VFIO, IOMMU, device
> > drivers, KVM are all users of IOASIDs. When a state change occurs,
> > VFIO publishes the change event that needs to be processed by other
> > users/subscribers.
> >
> > This patch introduced two types of notifications: global and per
> > ioasid_set. The latter is intended for users who only needs to
> > handle events related to the IOASID of a given set.
> > For more information, refer to the kernel documentation at
> > Documentation/ioasid.rst.
> >
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Wu Hao <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > drivers/iommu/ioasid.c | 280
> > ++++++++++++++++++++++++++++++++++++++++++++++++-
> > include/linux/ioasid.h | 70 +++++++++++++ 2 files changed, 348
> > insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> > index c0aef38a4fde..6ddc09a7fe74 100644
> > --- a/drivers/iommu/ioasid.c
> > +++ b/drivers/iommu/ioasid.c
> > @@ -9,8 +9,35 @@
> > #include <linux/spinlock.h>
> > #include <linux/xarray.h>
> > #include <linux/ioasid.h>
> > +#include <linux/sched/mm.h>
> >
> > static DEFINE_XARRAY_ALLOC(ioasid_sets);
> > +/*
> > + * An IOASID could have multiple consumers where each consumeer
> > may have
>
> consumer
>
got it

> > + * hardware contexts associated with IOASIDs.
> > + * When a status change occurs, such as IOASID is being freed,
> > notifier chains
> > + * are used to keep the consumers in sync.
> > + * This is a publisher-subscriber pattern where publisher can
> > change the
> > + * state of each IOASID, e.g. alloc/free, bind IOASID to a device
> > and mm.
> > + * On the other hand, subscribers gets notified for the state
> > change and
> > + * keep local states in sync.
> > + *
> > + * Currently, the notifier is global. A further optimization could
> > be per
> > + * IOASID set notifier chain.
>
> The patch adds both
>
right, the comment is old. I will remove the paragraph since it is in
the doc.

> > + */
> > +static ATOMIC_NOTIFIER_HEAD(ioasid_chain);
>
> "ioasid_notifier" may be clearer
>
will do.

> > +
> > +/* List to hold pending notification block registrations */
> > +static LIST_HEAD(ioasid_nb_pending_list);
> > +static DEFINE_SPINLOCK(ioasid_nb_lock);
> > +struct ioasid_set_nb {
> > + struct list_head list;
> > + struct notifier_block *nb;
> > + void *token;
> > + struct ioasid_set *set;
> > + bool active;
> > +};
> > +
> > enum ioasid_state {
> > IOASID_STATE_INACTIVE,
> > IOASID_STATE_ACTIVE,
> > @@ -394,6 +421,7 @@ EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
> > ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> > ioasid_t max, void *private)
> > {
> > + struct ioasid_nb_args args;
> > struct ioasid_data *data;
> > void *adata;
> > ioasid_t id = INVALID_IOASID;
> > @@ -445,8 +473,14 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
> > ioasid_t min, ioasid_t max, goto exit_free;
> > }
> > set->nr_ioasids++;
> > - goto done_unlock;
> > + args.id = id;
> > + /* Set private ID is not attached during allocation */
> > + args.spid = INVALID_IOASID;
> > + args.set = set;
>
> args.pdata is uninitialized
>
right, it should be
args.pdata = data->private;

> > + atomic_notifier_call_chain(&set->nh, IOASID_ALLOC,
> > &args);
>
> No global notification?
>
There hasn't been a need since the only global notifier listener is
vt-d driver which cares about FREE event only.

> >
> > + spin_unlock(&ioasid_allocator_lock);
> > + return id;
> > exit_free:
> > kfree(data);
> > done_unlock:
> > @@ -479,6 +513,7 @@ static void ioasid_do_free(struct ioasid_data
> > *data)
> > static void ioasid_free_locked(struct ioasid_set *set, ioasid_t
> > ioasid) {
> > + struct ioasid_nb_args args;
> > struct ioasid_data *data;
> >
> > data = xa_load(&active_allocator->xa, ioasid);
> > @@ -491,7 +526,16 @@ static void ioasid_free_locked(struct
> > ioasid_set *set, ioasid_t ioasid) pr_warn("Cannot free IOASID %u
> > due to set ownership\n", ioasid); return;
> > }
> > +
> > data->state = IOASID_STATE_FREE_PENDING;
> > + /* Notify all users that this IOASID is being freed */
> > + args.id = ioasid;
> > + args.spid = data->spid;
> > + args.pdata = data->private;
> > + args.set = data->set;
> > + atomic_notifier_call_chain(&ioasid_chain, IOASID_FREE,
> > &args);
> > + /* Notify the ioasid_set for per set users */
> > + atomic_notifier_call_chain(&set->nh, IOASID_FREE, &args);
> >
> > if (!refcount_dec_and_test(&data->users))
> > return;
> > @@ -514,6 +558,28 @@ void ioasid_free(struct ioasid_set *set,
> > ioasid_t ioasid) }
> > EXPORT_SYMBOL_GPL(ioasid_free);
> >
> > +static void ioasid_add_pending_nb(struct ioasid_set *set)
> > +{
> > + struct ioasid_set_nb *curr;
> > +
> > + if (set->type != IOASID_SET_TYPE_MM)
> > + return;
> > +
> > + /*
> > + * Check if there are any pending nb requests for the
> > given token, if so
> > + * add them to the notifier chain.
> > + */
> > + spin_lock(&ioasid_nb_lock);
> > + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> > + if (curr->token == set->token && !curr->active) {
> > + atomic_notifier_chain_register(&set->nh,
> > curr->nb);
> > + curr->set = set;
> > + curr->active = true;
> > + }
> > + }
> > + spin_unlock(&ioasid_nb_lock);
> > +}
> > +
> > /**
> > * ioasid_alloc_set - Allocate a new IOASID set for a given token
> > *
> > @@ -601,6 +667,13 @@ struct ioasid_set *ioasid_alloc_set(void
> > *token, ioasid_t quota, int type) sdata->quota = quota;
> > sdata->sid = id;
> > refcount_set(&sdata->ref, 1);
> > + ATOMIC_INIT_NOTIFIER_HEAD(&sdata->nh);
> > +
> > + /*
> > + * Check if there are any pending nb requests for the
> > given token, if so
> > + * add them to the notifier chain.
> > + */
> > + ioasid_add_pending_nb(sdata);
> >
> > /*
> > * Per set XA is used to store private IDs within the set,
> > get ready @@ -617,6 +690,30 @@ struct ioasid_set
> > *ioasid_alloc_set(void *token, ioasid_t quota, int type) }
> > EXPORT_SYMBOL_GPL(ioasid_alloc_set);
> >
> > +
> > +/*
> > + * ioasid_find_mm_set - Retrieve IOASID set with mm token
> > + * Take a reference of the set if found.
> > + */
> > +static struct ioasid_set *ioasid_find_mm_set(struct mm_struct
> > *token) +{
> > + struct ioasid_set *sdata, *set = NULL;
> > + unsigned long index;
> > +
> > + spin_lock(&ioasid_allocator_lock);
> > +
> > + xa_for_each(&ioasid_sets, index, sdata) {
> > + if (sdata->type == IOASID_SET_TYPE_MM &&
> > sdata->token == token) {
> > + refcount_inc(&sdata->ref);
> > + set = sdata;
> > + goto exit_unlock;
>
> Or just break
>
right, but I missed set = NULL outside xa_for_each(). so I have to
keep this goto, i.e.

spin_lock(&ioasid_allocator_lock);

xa_for_each(&ioasid_sets, index, set) {
if (set->type == IOASID_SET_TYPE_MM && set->token ==
token) {
refcount_inc(&set->ref);
goto exit_unlock;
}
}
set = NULL;
exit_unlock:
spin_unlock(&ioasid_allocator_lock);
return set;


> > + }
> > + }
> > +exit_unlock:
> > + spin_unlock(&ioasid_allocator_lock);
> > + return set;
> > +}
> > +
> > void ioasid_set_get_locked(struct ioasid_set *set)
> > {
> > if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> > @@ -638,6 +735,8 @@ EXPORT_SYMBOL_GPL(ioasid_set_get);
> >
> > void ioasid_set_put_locked(struct ioasid_set *set)
> > {
> > + struct ioasid_nb_args args = { 0 };
> > + struct ioasid_set_nb *curr;
> > struct ioasid_data *entry;
> > unsigned long index;
> >
> > @@ -673,8 +772,24 @@ void ioasid_set_put_locked(struct ioasid_set
> > *set) done_destroy:
> > /* Return the quota back to system pool */
> > ioasid_capacity_avail += set->quota;
> > - kfree_rcu(set, rcu);
> >
> > + /* Restore pending status of the set NBs */
> > + spin_lock(&ioasid_nb_lock);
> > + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> > + if (curr->token == set->token) {
> > + if (curr->active)
> > + curr->active = false;
> > + else
> > + pr_warn("Set token exists but not
> > active!\n");
> > + }
> > + }
> > + spin_unlock(&ioasid_nb_lock);
> > +
> > + args.set = set;
> > + atomic_notifier_call_chain(&ioasid_chain, IOASID_SET_FREE,
> > &args); +
> > + kfree_rcu(set, rcu);
> > + pr_debug("Set freed %d\n", set->sid);
>
> set might have been freed
>
right, will delete this.

> > /*
> > * Token got released right away after the ioasid_set is
> > freed.
> > * If a new set is created immediately with the newly
> > released token, @@ -927,6 +1042,167 @@ void *ioasid_find(struct
> > ioasid_set *set, ioasid_t ioasid, }
> > EXPORT_SYMBOL_GPL(ioasid_find);
> >
> > +int ioasid_register_notifier(struct ioasid_set *set, struct
> > notifier_block *nb)
>
> Maybe add a bit of documentation on the difference with the _mm
> variant, as well as the @set parameter.
>
> Will this be used by anyone at first? We could introduce only the _mm
> functions for now.
>
We do need both variants, VT-d driver registers global notifier with
set=NULL, KVM registers notifier on the mm_struct pointer.
How about the following comments:
/**
* ioasid_register_notifier_mm - Register a notifier block on the
IOASID set
* created by the mm_struct pointer as
the token *
* @mm: the mm_struct token of the ioasid_set
* @nb: notfier block to be registered on the ioasid_set
*
* This a variant of ioasid_register_notifier() where the caller
intends to
* listen to IOASID events belong the ioasid_set created under the same
* process. Caller is not aware of the ioasid_set, no need to hold
reference
* of the ioasid_set.
*/

> > +{
> > + if (set)
> > + return atomic_notifier_chain_register(&set->nh,
> > nb);
> > + else
> > + return
> > atomic_notifier_chain_register(&ioasid_chain, nb); +}
> > +EXPORT_SYMBOL_GPL(ioasid_register_notifier);
> > +
> > +void ioasid_unregister_notifier(struct ioasid_set *set,
> > + struct notifier_block *nb)
> > +{
> > + struct ioasid_set_nb *curr;
> > +
> > + spin_lock(&ioasid_nb_lock);
> > + /*
> > + * Pending list is registered with a token without an
> > ioasid_set,
> > + * therefore should not be unregistered directly.
> > + */
> > + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> > + if (curr->nb == nb) {
> > + pr_warn("Cannot unregister NB from pending
> > list\n");
> > + spin_unlock(&ioasid_nb_lock);
> > + return;
> > + }
> > + }
> > + spin_unlock(&ioasid_nb_lock);
> > +
> > + if (set)
> > + atomic_notifier_chain_unregister(&set->nh, nb);
> > + else
> > + atomic_notifier_chain_unregister(&ioasid_chain,
> > nb); +}
> > +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);
> > +
> > +int ioasid_register_notifier_mm(struct mm_struct *mm, struct
> > notifier_block *nb) +{
> > + struct ioasid_set_nb *curr;
> > + struct ioasid_set *set;
> > + int ret = 0;
> > +
> > + if (!mm)
> > + return -EINVAL;
> > +
> > + spin_lock(&ioasid_nb_lock);
> > +
> > + /* Check for duplicates, nb is unique per set */
> > + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> > + if (curr->token == mm && curr->nb == nb) {
> > + ret = -EBUSY;
> > + goto exit_unlock;
> > + }
> > + }
> > +
> > + /* Check if the token has an existing set */
> > + set = ioasid_find_mm_set(mm);
>
> Seems to be a deadlock here, as ioasid_find_mm_set() grabs
> ioasid_allocator_lock while holding ioasid_nb_lock, and
> ioasid_set_put/get_locked() grabs ioasid_nb_lock while holding
> ioasid_allocator_lock.
>
Good catch, I will move the nb_lock before allocator lock in
ioasid_set_put.

> > + if (IS_ERR_OR_NULL(set)) {
>
> Looks a bit off, maybe we can check !set since ioasid_find_mm_set()
> doesn't return errors.
>
will do.

> > + /* Add to the rsvd list as inactive */
> > + curr->active = false;
>
> curr isn't valid here
>
This is the case where the IOASID set has not been created yet, so we
just put the NB on the pending list. Am I missing your point?
The use case is that when a guest with assigned devices is launched, KVM
and VFIO can be initialized in any order. If KVM starts before VFIO,
which creates the IOASID set, the KVM notifier block will be registered
on the pending list as inactive.

> > + } else {
> > + /* REVISIT: Only register empty set for now. Can
> > add an option
> > + * in the future to playback existing PASIDs.
> > + */
> > + if (set->nr_ioasids) {
> > + pr_warn("IOASID set %d not empty\n",
> > set->sid);
> > + ret = -EBUSY;
> > + goto exit_unlock;
> > + }
> > + curr = kzalloc(sizeof(*curr), GFP_ATOMIC);
>
> As a side-note, I think there's too much atomic allocation in this
> file, I'd like to try and rework the locking once it stabilizes and I
> find some time. Do you remember why ioasid_allocator_lock needed to
> be a spinlock?
>
The spinlock was needed for calling ioasid_free from mmu notifier and
mm_drop().

> > + if (!curr) {
> > + ret = -ENOMEM;
> > + goto exit_unlock;
> > + }
> > + curr->token = mm;
> > + curr->nb = nb;
> > + curr->active = true;
> > + curr->set = set;
> > +
> > + /* Set already created, add to the notifier chain
> > */
> > + atomic_notifier_chain_register(&set->nh, nb);
> > + /*
> > + * Do not hold a reference, if the set gets
> > destroyed, the nb
> > + * entry will be marked inactive.
> > + */
> > + ioasid_set_put(set);
> > + }
> > +
> > + list_add(&curr->list, &ioasid_nb_pending_list);
> > +
> > +exit_unlock:
> > + spin_unlock(&ioasid_nb_lock);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_register_notifier_mm);
> > +
> > +void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
> > notifier_block *nb) +{
> > + struct ioasid_set_nb *curr;
> > +
> > + spin_lock(&ioasid_nb_lock);
> > + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> > + if (curr->token == mm && curr->nb == nb) {
> > + list_del(&curr->list);
> > + goto exit_free;
> > + }
> > + }
> > + pr_warn("No ioasid set found for mm token %llx\n",
> > (u64)mm);
> > + goto done_unlock;
> > +
> > +exit_free:
> > + if (curr->active) {
> > + pr_debug("mm set active, unregister %llx\n",
> > + (u64)mm);
>
> %px shows raw pointers, but I would drop this altogether or use %p.
>
will drop.

> > + atomic_notifier_chain_unregister(&curr->set->nh,
> > nb);
> > + }
> > + kfree(curr);
> > +done_unlock:
> > + spin_unlock(&ioasid_nb_lock);
> > + return;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier_mm);
> > +
> > +/**
> > + * ioasid_notify - Send notification on a given IOASID for status
> > change.
> > + * Used by publishers when the status change may
> > affect
> > + * subscriber's internal state.
> > + *
> > + * @ioasid: The IOASID to which the notification will send
> > + * @cmd: The notification event
> > + * @flags: Special instructions, e.g. notify with a set or
> > global
>
> Describe valid values for @cmd and @flags? I guess this function
> shouldn't accept IOASID_ALLOC, IOASID_FREE etc
>
Good point. only IOASID_BIND and IOASID_UNBIND are allowed.
Will add a check
/* IOASID_FREE/ALLOC are internal events emitted by IOASID core only */
if (cmd <= IOASID_FREE)
return -EINVAL;
And comment:
* @cmd: Notification event sent by IOASID external users, can be
* IOASID_BIND or IOASID_UNBIND.
*
* @flags: Special instructions, e.g. notify within a set or global by
* IOASID_NOTIFY_SET or IOASID_NOTIFY_ALL flags

> > + */
> > +int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
> > unsigned int flags) +{
> > + struct ioasid_data *ioasid_data;
> > + struct ioasid_nb_args args = { 0 };
> > + int ret = 0;
> > +
> > + spin_lock(&ioasid_allocator_lock);
> > + ioasid_data = xa_load(&active_allocator->xa, ioasid);
> > + if (!ioasid_data) {
> > + pr_err("Trying to notify unknown IOASID %u\n",
> > ioasid);
> > + spin_unlock(&ioasid_allocator_lock);
> > + return -EINVAL;
> > + }
> > +
> > + args.id = ioasid;
> > + args.set = ioasid_data->set;
> > + args.pdata = ioasid_data->private;
> > + args.spid = ioasid_data->spid;
> > + if (flags & IOASID_NOTIFY_ALL) {
> > + ret = atomic_notifier_call_chain(&ioasid_chain,
> > cmd, &args);
> > + } else if (flags & IOASID_NOTIFY_SET) {
> > + ret =
> > atomic_notifier_call_chain(&ioasid_data->set->nh,
> > + cmd, &args);
> > + }
>
> else ret = -EINVAL?
> What about allowing both flags?
>
both flags should be allowed. Let me add a check for valid flags
upfront, then remove the else.

> > + spin_unlock(&ioasid_allocator_lock);
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_notify);
> > +
> > MODULE_AUTHOR("Jean-Philippe Brucker
> > <[email protected]>"); MODULE_AUTHOR("Jacob Pan
> > <[email protected]>"); MODULE_DESCRIPTION("IO Address
> > Space ID (IOASID) allocator"); diff --git a/include/linux/ioasid.h
> > b/include/linux/ioasid.h index d4b3e83672f6..572111cd3b4b 100644
> > --- a/include/linux/ioasid.h
> > +++ b/include/linux/ioasid.h
> > @@ -23,6 +23,7 @@ enum ioasid_set_type {
> > * struct ioasid_set - Meta data about ioasid_set
> > * @type: Token types and other features
> > * @token: Unique to identify an IOASID set
> > + * @nh: Notifier for IOASID events within the set
> > * @xa: XArray to store ioasid_set private IDs, can
> > be used for
> > * guest-host IOASID mapping, or just a private
> > IOASID namespace.
> > * @quota: Max number of IOASIDs can be allocated within
> > the set @@ -32,6 +33,7 @@ enum ioasid_set_type {
> > */
> > struct ioasid_set {
> > void *token;
> > + struct atomic_notifier_head nh;
> > struct xarray xa;
> > int type;
> > int quota;
> > @@ -56,6 +58,49 @@ struct ioasid_allocator_ops {
> > void *pdata;
> > };
> >
> > +/* Notification data when IOASID status changed */
> > +enum ioasid_notify_val {
> > + IOASID_ALLOC = 1,
> > + IOASID_FREE,
> > + IOASID_BIND,
> > + IOASID_UNBIND,
> > + IOASID_SET_ALLOC,
> > + IOASID_SET_FREE,
> > +};
>
> May be nicer to prefix these with IOASID_NOTIFY_
>
yes, good idea. also add FLAG_ to the flags below

> > +
> > +#define IOASID_NOTIFY_ALL BIT(0)
> > +#define IOASID_NOTIFY_SET BIT(1)
> > +/**
> > + * enum ioasid_notifier_prios - IOASID event notification order
> > + *
> > + * When status of an IOASID changes, users might need to take
> > actions to
> > + * reflect the new state. For example, when an IOASID is freed due
> > to
> > + * exception, the hardware context in virtual CPU, DMA device, and
> > IOMMU
> > + * shall be cleared and drained. Order is required to prevent life
> > cycle
> > + * problems.
> > + */
> > +enum ioasid_notifier_prios {
> > + IOASID_PRIO_LAST,
> > + IOASID_PRIO_DEVICE,
> > + IOASID_PRIO_IOMMU,
> > + IOASID_PRIO_CPU,
> > +};
>
> Not used by this patch, can be added later
>
will do

> > +
> > +/**
> > + * struct ioasid_nb_args - Argument provided by IOASID core when
> > notifier
> > + * is called.
> > + * @id: The IOASID being notified
> > + * @spid: The set private ID associated with the IOASID
> > + * @set: The IOASID set of @id
> > + * @pdata: The private data attached to the IOASID
> > + */
> > +struct ioasid_nb_args {
> > + ioasid_t id;
> > + ioasid_t spid;
> > + struct ioasid_set *set;
> > + void *pdata;
> > +};
> > +
> > #if IS_ENABLED(CONFIG_IOASID)
> > void ioasid_install_capacity(ioasid_t total);
> > ioasid_t ioasid_get_capacity(void);
> > @@ -75,8 +120,16 @@ void *ioasid_find(struct ioasid_set *set,
> > ioasid_t ioasid, bool (*getter)(void * int
> > ioasid_attach_data(ioasid_t ioasid, void *data); int
> > ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid); ioasid_t
> > ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid); +
> > +int ioasid_register_notifier(struct ioasid_set *set,
> > + struct notifier_block *nb);
> > +void ioasid_unregister_notifier(struct ioasid_set *set,
> > + struct notifier_block *nb);
> > +
> > int ioasid_register_allocator(struct ioasid_allocator_ops
> > *allocator); void ioasid_unregister_allocator(struct
> > ioasid_allocator_ops *allocator); +
> > +int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
> > unsigned int flags); void ioasid_is_in_set(struct ioasid_set *set,
> > ioasid_t ioasid); int ioasid_get(struct ioasid_set *set, ioasid_t
> > ioasid); int ioasid_get_locked(struct ioasid_set *set, ioasid_t
> > ioasid); @@ -85,6 +138,9 @@ void ioasid_put_locked(struct
> > ioasid_set *set, ioasid_t ioasid); int
> > ioasid_set_for_each_ioasid(struct ioasid_set *sdata, void
> > (*fn)(ioasid_t id, void *data), void *data);
> > +int ioasid_register_notifier_mm(struct mm_struct *mm, struct
> > notifier_block *nb); +void ioasid_unregister_notifier_mm(struct
> > mm_struct *mm, struct notifier_block *nb);
>
> These need stubs for !CONFIG_IOASID
>
got it

> > +
> > #else /* !CONFIG_IOASID */
> > static inline void ioasid_install_capacity(ioasid_t total)
> > {
> > @@ -124,6 +180,20 @@ static inline void *ioasid_find(struct
> > ioasid_set *set, ioasid_t ioasid, bool (* return NULL;
> > }
> >
> > +static inline int ioasid_register_notifier(struct notifier_block
> > *nb)
>
> Missing set argument
>
got it


Thanks a lot!
> Thanks,
> Jean
>
> > +{
> > + return -ENOTSUPP;
> > +}
> > +
> > +static inline void ioasid_unregister_notifier(struct
> > notifier_block *nb) +{
> > +}
> > +
> > +static inline int ioasid_notify(ioasid_t ioasid, enum
> > ioasid_notify_val cmd, unsigned int flags) +{
> > + return -ENOTSUPP;
> > +}
> > +
> > static inline int ioasid_register_allocator(struct
> > ioasid_allocator_ops *allocator) {
> > return -ENOTSUPP;
> > --
> > 2.7.4
> >
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

[Jacob Pan]

2020-09-10 02:56:28

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v2 6/9] iommu/ioasid: Introduce notification APIs

On Tue, 1 Sep 2020 18:49:38 +0200
Auger Eric <[email protected]> wrote:

> Hi Jacob,
>
> On 8/22/20 6:35 AM, Jacob Pan wrote:
> > Relations among IOASID users largely follow a publisher-subscriber
> > pattern. E.g. to support guest SVA on Intel Scalable I/O
> > Virtualization (SIOV) enabled platforms, VFIO, IOMMU, device
> > drivers, KVM are all users of IOASIDs. When a state change occurs,
> > VFIO publishes the change event that needs to be processed by other
> > users/subscribers.
> >
> > This patch introduced two types of notifications: global and per
> > ioasid_set. The latter is intended for users who only needs to
> > handle events related to the IOASID of a given set.
> > For more information, refer to the kernel documentation at
> > Documentation/ioasid.rst.
> >
> > Signed-off-by: Liu Yi L <[email protected]>
> > Signed-off-by: Wu Hao <[email protected]>
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > drivers/iommu/ioasid.c | 280
> > ++++++++++++++++++++++++++++++++++++++++++++++++-
> > include/linux/ioasid.h | 70 +++++++++++++ 2 files changed, 348
> > insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> > index c0aef38a4fde..6ddc09a7fe74 100644
> > --- a/drivers/iommu/ioasid.c
> > +++ b/drivers/iommu/ioasid.c
> > @@ -9,8 +9,35 @@
> > #include <linux/spinlock.h>
> > #include <linux/xarray.h>
> > #include <linux/ioasid.h>
> > +#include <linux/sched/mm.h>
> >
> > static DEFINE_XARRAY_ALLOC(ioasid_sets);
> > +/*
> > + * An IOASID could have multiple consumers where each consumeer
> > may have
> can have multiple consumers
Sounds good, I used past tense to describe a possibility :)

> > + * hardware contexts associated with IOASIDs.
> > + * When a status change occurs, such as IOASID is being freed,
> > notifier chains
> s/such as IOASID is being freed/, like on IOASID deallocation,
Better, will do.

> > + * are used to keep the consumers in sync.
> > + * This is a publisher-subscriber pattern where publisher can
> > change the
> > + * state of each IOASID, e.g. alloc/free, bind IOASID to a device
> > and mm.
> > + * On the other hand, subscribers gets notified for the state
> > change and
> > + * keep local states in sync.
> > + *
> > + * Currently, the notifier is global. A further optimization could
> > be per
> > + * IOASID set notifier chain.
> > + */
> > +static ATOMIC_NOTIFIER_HEAD(ioasid_chain);
> > +
> > +/* List to hold pending notification block registrations */
> > +static LIST_HEAD(ioasid_nb_pending_list);
> > +static DEFINE_SPINLOCK(ioasid_nb_lock);
> > +struct ioasid_set_nb {
> > + struct list_head list;
> > + struct notifier_block *nb;
> > + void *token;
> > + struct ioasid_set *set;
> > + bool active;
> > +};
> > +
> > enum ioasid_state {
> > IOASID_STATE_INACTIVE,
> > IOASID_STATE_ACTIVE,
> > @@ -394,6 +421,7 @@ EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
> > ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
> > ioasid_t max, void *private)
> > {
> > + struct ioasid_nb_args args;
> > struct ioasid_data *data;
> > void *adata;
> > ioasid_t id = INVALID_IOASID;
> > @@ -445,8 +473,14 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
> > ioasid_t min, ioasid_t max, goto exit_free;
> > }
> > set->nr_ioasids++;
> > - goto done_unlock;
> > + args.id = id;
> > + /* Set private ID is not attached during allocation */
> > + args.spid = INVALID_IOASID;
> > + args.set = set;
> > + atomic_notifier_call_chain(&set->nh, IOASID_ALLOC, &args);
> >
> > + spin_unlock(&ioasid_allocator_lock);
> > + return id;
> spurious change
Good catch. should just goto done_unlock.

> > exit_free:
> > kfree(data);
> > done_unlock:
> > @@ -479,6 +513,7 @@ static void ioasid_do_free(struct ioasid_data
> > *data)
> > static void ioasid_free_locked(struct ioasid_set *set, ioasid_t
> > ioasid) {
> > + struct ioasid_nb_args args;
> > struct ioasid_data *data;
> >
> > data = xa_load(&active_allocator->xa, ioasid);
> > @@ -491,7 +526,16 @@ static void ioasid_free_locked(struct
> > ioasid_set *set, ioasid_t ioasid) pr_warn("Cannot free IOASID %u
> > due to set ownership\n", ioasid); return;
> > }
> > +
> spurious new line
got it

> > data->state = IOASID_STATE_FREE_PENDING;
> > + /* Notify all users that this IOASID is being freed */
> > + args.id = ioasid;
> > + args.spid = data->spid;
> > + args.pdata = data->private;
> > + args.set = data->set;
> > + atomic_notifier_call_chain(&ioasid_chain, IOASID_FREE,
> > &args);
> > + /* Notify the ioasid_set for per set users */
> > + atomic_notifier_call_chain(&set->nh, IOASID_FREE, &args);
> >
> > if (!refcount_dec_and_test(&data->users))
> > return;
> Shouldn't we call the notifier only when ref count == 0?
Not in the current scheme. The idea is to notify all users the PASID is
being freed, then each user can drop its reference. When refcount == 0,
the PASID will be returned to the pool.

> > @@ -514,6 +558,28 @@ void ioasid_free(struct ioasid_set *set,
> > ioasid_t ioasid) }
> > EXPORT_SYMBOL_GPL(ioasid_free);
> >
> > +static void ioasid_add_pending_nb(struct ioasid_set *set)
> > +{
> > + struct ioasid_set_nb *curr;
> > +
> > + if (set->type != IOASID_SET_TYPE_MM)
> > + return;
> > +
> > + /*
> > + * Check if there are any pending nb requests for the
> > given token, if so
> > + * add them to the notifier chain.
> > + */
> > + spin_lock(&ioasid_nb_lock);
> > + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> > + if (curr->token == set->token && !curr->active) {
> > + atomic_notifier_chain_register(&set->nh,
> > curr->nb);
> > + curr->set = set;
> > + curr->active = true;
> > + }
> > + }
> > + spin_unlock(&ioasid_nb_lock);
> > +}
> > +
> > /**
> > * ioasid_alloc_set - Allocate a new IOASID set for a given token
> > *
> > @@ -601,6 +667,13 @@ struct ioasid_set *ioasid_alloc_set(void
> > *token, ioasid_t quota, int type) sdata->quota = quota;
> > sdata->sid = id;
> > refcount_set(&sdata->ref, 1);
> > + ATOMIC_INIT_NOTIFIER_HEAD(&sdata->nh);
> > +
> > + /*
> > + * Check if there are any pending nb requests for the
> > given token, if so
> > + * add them to the notifier chain.
> > + */
> > + ioasid_add_pending_nb(sdata);
> >
> > /*
> > * Per set XA is used to store private IDs within the set,
> > get ready @@ -617,6 +690,30 @@ struct ioasid_set
> > *ioasid_alloc_set(void *token, ioasid_t quota, int type) }
> > EXPORT_SYMBOL_GPL(ioasid_alloc_set);
> >
> > +
> > +/*
> > + * ioasid_find_mm_set - Retrieve IOASID set with mm token
> > + * Take a reference of the set if found.
> > + */
> > +static struct ioasid_set *ioasid_find_mm_set(struct mm_struct
> > *token) +{
> > + struct ioasid_set *sdata, *set = NULL;
> > + unsigned long index;
> > +
> > + spin_lock(&ioasid_allocator_lock);
> > +
> > + xa_for_each(&ioasid_sets, index, sdata) {
> > + if (sdata->type == IOASID_SET_TYPE_MM &&
> > sdata->token == token) {
> > + refcount_inc(&sdata->ref);
> > + set = sdata;
> > + goto exit_unlock;
> > + }
> > + }
> > +exit_unlock:
> > + spin_unlock(&ioasid_allocator_lock);
> > + return set;
> > +}
> > +
> > void ioasid_set_get_locked(struct ioasid_set *set)
> > {
> > if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
> > @@ -638,6 +735,8 @@ EXPORT_SYMBOL_GPL(ioasid_set_get);
> >
> > void ioasid_set_put_locked(struct ioasid_set *set)
> > {
> > + struct ioasid_nb_args args = { 0 };
> > + struct ioasid_set_nb *curr;
> > struct ioasid_data *entry;
> > unsigned long index;
> >
> > @@ -673,8 +772,24 @@ void ioasid_set_put_locked(struct ioasid_set
> > *set) done_destroy:
> > /* Return the quota back to system pool */
> > ioasid_capacity_avail += set->quota;
> > - kfree_rcu(set, rcu);
> >
> > + /* Restore pending status of the set NBs */
> > + spin_lock(&ioasid_nb_lock);
> > + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> > + if (curr->token == set->token) {
> > + if (curr->active)
> > + curr->active = false;
> > + else
> > + pr_warn("Set token exists but not
> > active!\n");
> > + }
> > + }
> > + spin_unlock(&ioasid_nb_lock);
> > +
> > + args.set = set;
> > + atomic_notifier_call_chain(&ioasid_chain, IOASID_SET_FREE,
> > &args); +
> > + kfree_rcu(set, rcu);
> > + pr_debug("Set freed %d\n", set->sid);
> > /*
> > * Token got released right away after the ioasid_set is
> > freed.
> > * If a new set is created immediately with the newly
> > released token, @@ -927,6 +1042,167 @@ void *ioasid_find(struct
> > ioasid_set *set, ioasid_t ioasid, }
> > EXPORT_SYMBOL_GPL(ioasid_find);
> >
> > +int ioasid_register_notifier(struct ioasid_set *set, struct
> > notifier_block *nb) +{
> > + if (set)
> > + return atomic_notifier_chain_register(&set->nh,
> > nb);
> > + else
> > + return
> > atomic_notifier_chain_register(&ioasid_chain, nb); +}
> > +EXPORT_SYMBOL_GPL(ioasid_register_notifier);
> > +
> > +void ioasid_unregister_notifier(struct ioasid_set *set,
> > + struct notifier_block *nb)
> > +{
> > + struct ioasid_set_nb *curr;
> > +
> > + spin_lock(&ioasid_nb_lock);
> > + /*
> > + * Pending list is registered with a token without an
> > ioasid_set,
> > + * therefore should not be unregistered directly.
> > + */
> > + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> > + if (curr->nb == nb) {
> > + pr_warn("Cannot unregister NB from pending
> > list\n");
> > + spin_unlock(&ioasid_nb_lock);
> > + return;
> > + }
> > + }
> > + spin_unlock(&ioasid_nb_lock);
> is it safe to release the lock here? What does prevent another NB to
> be added to ioasid_nb_pending_list after that?
Another NB will not be the same one as this NB, which is being removed.
I don't see any issues.
The only reason we check the pending list is to make sure the NB on the
pending list must be removed by ioasid_unregister_notifier_mm() API.

> > +
> > + if (set)
> > + atomic_notifier_chain_unregister(&set->nh, nb);
> > + else
> > + atomic_notifier_chain_unregister(&ioasid_chain,
> > nb); +}
> > +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);
> > +
> > +int ioasid_register_notifier_mm(struct mm_struct *mm, struct
> > notifier_block *nb) +{
> > + struct ioasid_set_nb *curr;
> > + struct ioasid_set *set;
> > + int ret = 0;
> > +
> > + if (!mm)
> > + return -EINVAL;
> > +
> > + spin_lock(&ioasid_nb_lock);
> > +
> > + /* Check for duplicates, nb is unique per set */
> > + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> > + if (curr->token == mm && curr->nb == nb) {
> > + ret = -EBUSY;
> > + goto exit_unlock;
> > + }
> > + }
> > +
> > + /* Check if the token has an existing set */
> > + set = ioasid_find_mm_set(mm);
> > + if (IS_ERR_OR_NULL(set)) {
> > + /* Add to the rsvd list as inactive */
> > + curr->active = false;
> > + } else {
> > + /* REVISIT: Only register empty set for now. Can
> > add an option
> > + * in the future to playback existing PASIDs.
> > + */
> > + if (set->nr_ioasids) {
> > + pr_warn("IOASID set %d not empty\n",
> > set->sid);
> > + ret = -EBUSY;
> > + goto exit_unlock;
> > + }
> > + curr = kzalloc(sizeof(*curr), GFP_ATOMIC);
> > + if (!curr) {
> > + ret = -ENOMEM;
> > + goto exit_unlock;
> > + }
> > + curr->token = mm;
> > + curr->nb = nb;
> > + curr->active = true;
> > + curr->set = set;
> > +
> > + /* Set already created, add to the notifier chain
> > */
> > + atomic_notifier_chain_register(&set->nh, nb);
> > + /*
> > + * Do not hold a reference, if the set gets
> > destroyed, the nb
> > + * entry will be marked inactive.
> > + */
> > + ioasid_set_put(set);
> > + }
> > +
> > + list_add(&curr->list, &ioasid_nb_pending_list);
> > +
> > +exit_unlock:
> > + spin_unlock(&ioasid_nb_lock);
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_register_notifier_mm);
> > +
> > +void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
> > notifier_block *nb) +{
> > + struct ioasid_set_nb *curr;
> > +
> > + spin_lock(&ioasid_nb_lock);
> > + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
> > + if (curr->token == mm && curr->nb == nb) {
> > + list_del(&curr->list);
> > + goto exit_free;
> > + }
> > + }
> > + pr_warn("No ioasid set found for mm token %llx\n",
> > (u64)mm);
> > + goto done_unlock;
> > +
> > +exit_free:
> > + if (curr->active) {
> > + pr_debug("mm set active, unregister %llx\n",
> > + (u64)mm);
> > + atomic_notifier_chain_unregister(&curr->set->nh,
> > nb);
> > + }
> > + kfree(curr);
> > +done_unlock:
> > + spin_unlock(&ioasid_nb_lock);
> > + return;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier_mm);
> > +
> > +/**
> > + * ioasid_notify - Send notification on a given IOASID for status
> > change.
> > + * Used by publishers when the status change may
> > affect
> > + * subscriber's internal state.
> > + *
> > + * @ioasid: The IOASID to which the notification will send
> > + * @cmd: The notification event
> > + * @flags: Special instructions, e.g. notify with a set or
> > global
> > + */
> > +int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
> > unsigned int flags) +{
> > + struct ioasid_data *ioasid_data;
> > + struct ioasid_nb_args args = { 0 };
> > + int ret = 0;
> > +
> > + spin_lock(&ioasid_allocator_lock);
> > + ioasid_data = xa_load(&active_allocator->xa, ioasid);
> > + if (!ioasid_data) {
> > + pr_err("Trying to notify unknown IOASID %u\n",
> > ioasid);
> > + spin_unlock(&ioasid_allocator_lock);
> > + return -EINVAL;
> > + }
> > +
> > + args.id = ioasid;
> > + args.set = ioasid_data->set;
> > + args.pdata = ioasid_data->private;
> > + args.spid = ioasid_data->spid;
> > + if (flags & IOASID_NOTIFY_ALL) {
> > + ret = atomic_notifier_call_chain(&ioasid_chain,
> > cmd, &args);
> > + } else if (flags & IOASID_NOTIFY_SET) {
> > + ret =
> > atomic_notifier_call_chain(&ioasid_data->set->nh,
> > + cmd, &args);
> > + }
> > + spin_unlock(&ioasid_allocator_lock);
> > +
> > + return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(ioasid_notify);
> > +
> > MODULE_AUTHOR("Jean-Philippe Brucker
> > <[email protected]>"); MODULE_AUTHOR("Jacob Pan
> > <[email protected]>"); MODULE_DESCRIPTION("IO Address
> > Space ID (IOASID) allocator"); diff --git a/include/linux/ioasid.h
> > b/include/linux/ioasid.h index d4b3e83672f6..572111cd3b4b 100644
> > --- a/include/linux/ioasid.h
> > +++ b/include/linux/ioasid.h
> > @@ -23,6 +23,7 @@ enum ioasid_set_type {
> > * struct ioasid_set - Meta data about ioasid_set
> > * @type: Token types and other features
> > * @token: Unique to identify an IOASID set
> > + * @nh: Notifier for IOASID events within the set
> list of notifiers private to that set?
Sounds more accurate.

> > * @xa: XArray to store ioasid_set private IDs, can
> > be used for
> > * guest-host IOASID mapping, or just a private
> > IOASID namespace.
> > * @quota: Max number of IOASIDs can be allocated within
> > the set @@ -32,6 +33,7 @@ enum ioasid_set_type {
> > */
> > struct ioasid_set {
> > void *token;
> > + struct atomic_notifier_head nh;
> > struct xarray xa;
> > int type;
> > int quota;
> > @@ -56,6 +58,49 @@ struct ioasid_allocator_ops {
> > void *pdata;
> > };
> >
> > +/* Notification data when IOASID status changed */
> > +enum ioasid_notify_val {
> > + IOASID_ALLOC = 1,
> > + IOASID_FREE,
> > + IOASID_BIND,
> > + IOASID_UNBIND,
> > + IOASID_SET_ALLOC,
> > + IOASID_SET_FREE,
> > +};
> > +
> > +#define IOASID_NOTIFY_ALL BIT(0)
> > +#define IOASID_NOTIFY_SET BIT(1)
> > +/**
> > + * enum ioasid_notifier_prios - IOASID event notification order
> > + *
> > + * When status of an IOASID changes, users might need to take
> > actions to
> > + * reflect the new state. For example, when an IOASID is freed due
> > to
> > + * exception, the hardware context in virtual CPU, DMA device, and
> > IOMMU
> > + * shall be cleared and drained. Order is required to prevent life
> > cycle
> > + * problems.
> > + */
> > +enum ioasid_notifier_prios {
> > + IOASID_PRIO_LAST,
> > + IOASID_PRIO_DEVICE,
> > + IOASID_PRIO_IOMMU,
> > + IOASID_PRIO_CPU,
> > +};
> > +
> > +/**
> > + * struct ioasid_nb_args - Argument provided by IOASID core when
> > notifier
> > + * is called.
> > + * @id: The IOASID being notified
> > + * @spid: The set private ID associated with the IOASID
> > + * @set: The IOASID set of @id
> > + * @pdata: The private data attached to the IOASID
> > + */
> > +struct ioasid_nb_args {
> > + ioasid_t id;
> > + ioasid_t spid;
> > + struct ioasid_set *set;
> > + void *pdata;
> > +};
> > +
> > #if IS_ENABLED(CONFIG_IOASID)
> > void ioasid_install_capacity(ioasid_t total);
> > ioasid_t ioasid_get_capacity(void);
> > @@ -75,8 +120,16 @@ void *ioasid_find(struct ioasid_set *set,
> > ioasid_t ioasid, bool (*getter)(void * int
> > ioasid_attach_data(ioasid_t ioasid, void *data); int
> > ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid); ioasid_t
> > ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid); +
> > +int ioasid_register_notifier(struct ioasid_set *set,
> > + struct notifier_block *nb);
> > +void ioasid_unregister_notifier(struct ioasid_set *set,
> > + struct notifier_block *nb);
> > +
> > int ioasid_register_allocator(struct ioasid_allocator_ops
> > *allocator); void ioasid_unregister_allocator(struct
> > ioasid_allocator_ops *allocator); +
> > +int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
> > unsigned int flags); void ioasid_is_in_set(struct ioasid_set *set,
> > ioasid_t ioasid); int ioasid_get(struct ioasid_set *set, ioasid_t
> > ioasid); int ioasid_get_locked(struct ioasid_set *set, ioasid_t
> > ioasid); @@ -85,6 +138,9 @@ void ioasid_put_locked(struct
> > ioasid_set *set, ioasid_t ioasid); int
> > ioasid_set_for_each_ioasid(struct ioasid_set *sdata, void
> > (*fn)(ioasid_t id, void *data), void *data);
> > +int ioasid_register_notifier_mm(struct mm_struct *mm, struct
> > notifier_block *nb); +void ioasid_unregister_notifier_mm(struct
> > mm_struct *mm, struct notifier_block *nb); +
> > #else /* !CONFIG_IOASID */
> > static inline void ioasid_install_capacity(ioasid_t total)
> > {
> > @@ -124,6 +180,20 @@ static inline void *ioasid_find(struct
> > ioasid_set *set, ioasid_t ioasid, bool (* return NULL;
> > }
> >
> > +static inline int ioasid_register_notifier(struct notifier_block
> > *nb) +{
> > + return -ENOTSUPP;
> > +}
> > +
> > +static inline void ioasid_unregister_notifier(struct
> > notifier_block *nb) +{
> > +}
> > +
> > +static inline int ioasid_notify(ioasid_t ioasid, enum
> > ioasid_notify_val cmd, unsigned int flags) +{
> > + return -ENOTSUPP;
> > +}
> > +
> > static inline int ioasid_register_allocator(struct
> > ioasid_allocator_ops *allocator) {
> > return -ENOTSUPP;
> >
> Thanks
>
> Eric
>
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

[Jacob Pan]

2020-09-10 09:01:19

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v2 6/9] iommu/ioasid: Introduce notification APIs

Hi Jacob,

On 9/10/20 12:58 AM, Jacob Pan wrote:
> On Tue, 1 Sep 2020 18:49:38 +0200
> Auger Eric <[email protected]> wrote:
>
>> Hi Jacob,
>>
>> On 8/22/20 6:35 AM, Jacob Pan wrote:
>>> Relations among IOASID users largely follow a publisher-subscriber
>>> pattern. E.g. to support guest SVA on Intel Scalable I/O
>>> Virtualization (SIOV) enabled platforms, VFIO, IOMMU, device
>>> drivers, KVM are all users of IOASIDs. When a state change occurs,
>>> VFIO publishes the change event that needs to be processed by other
>>> users/subscribers.
>>>
>>> This patch introduced two types of notifications: global and per
>>> ioasid_set. The latter is intended for users who only needs to
>>> handle events related to the IOASID of a given set.
>>> For more information, refer to the kernel documentation at
>>> Documentation/ioasid.rst.
>>>
>>> Signed-off-by: Liu Yi L <[email protected]>
>>> Signed-off-by: Wu Hao <[email protected]>
>>> Signed-off-by: Jacob Pan <[email protected]>
>>> ---
>>> drivers/iommu/ioasid.c | 280
>>> ++++++++++++++++++++++++++++++++++++++++++++++++-
>>> include/linux/ioasid.h | 70 +++++++++++++ 2 files changed, 348
>>> insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
>>> index c0aef38a4fde..6ddc09a7fe74 100644
>>> --- a/drivers/iommu/ioasid.c
>>> +++ b/drivers/iommu/ioasid.c
>>> @@ -9,8 +9,35 @@
>>> #include <linux/spinlock.h>
>>> #include <linux/xarray.h>
>>> #include <linux/ioasid.h>
>>> +#include <linux/sched/mm.h>
>>>
>>> static DEFINE_XARRAY_ALLOC(ioasid_sets);
>>> +/*
>>> + * An IOASID could have multiple consumers where each consumeer
>>> may have
>> can have multiple consumers
> Sounds good, I used past tense to describe a possibility :)
>
>>> + * hardware contexts associated with IOASIDs.
>>> + * When a status change occurs, such as IOASID is being freed,
>>> notifier chains
>> s/such as IOASID is being freed/, like on IOASID deallocation,
> Better, will do.
>
>>> + * are used to keep the consumers in sync.
>>> + * This is a publisher-subscriber pattern where publisher can
>>> change the
>>> + * state of each IOASID, e.g. alloc/free, bind IOASID to a device
>>> and mm.
>>> + * On the other hand, subscribers gets notified for the state
>>> change and
>>> + * keep local states in sync.
>>> + *
>>> + * Currently, the notifier is global. A further optimization could
>>> be per
>>> + * IOASID set notifier chain.
>>> + */
>>> +static ATOMIC_NOTIFIER_HEAD(ioasid_chain);
>>> +
>>> +/* List to hold pending notification block registrations */
>>> +static LIST_HEAD(ioasid_nb_pending_list);
>>> +static DEFINE_SPINLOCK(ioasid_nb_lock);
>>> +struct ioasid_set_nb {
>>> + struct list_head list;
>>> + struct notifier_block *nb;
>>> + void *token;
>>> + struct ioasid_set *set;
>>> + bool active;
>>> +};
>>> +
>>> enum ioasid_state {
>>> IOASID_STATE_INACTIVE,
>>> IOASID_STATE_ACTIVE,
>>> @@ -394,6 +421,7 @@ EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
>>> ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
>>> ioasid_t max, void *private)
>>> {
>>> + struct ioasid_nb_args args;
>>> struct ioasid_data *data;
>>> void *adata;
>>> ioasid_t id = INVALID_IOASID;
>>> @@ -445,8 +473,14 @@ ioasid_t ioasid_alloc(struct ioasid_set *set,
>>> ioasid_t min, ioasid_t max, goto exit_free;
>>> }
>>> set->nr_ioasids++;
>>> - goto done_unlock;
>>> + args.id = id;
>>> + /* Set private ID is not attached during allocation */
>>> + args.spid = INVALID_IOASID;
>>> + args.set = set;
>>> + atomic_notifier_call_chain(&set->nh, IOASID_ALLOC, &args);
>>>
>>> + spin_unlock(&ioasid_allocator_lock);
>>> + return id;
>> spurious change
> Good catch. should just goto done_unlock.
>
>>> exit_free:
>>> kfree(data);
>>> done_unlock:
>>> @@ -479,6 +513,7 @@ static void ioasid_do_free(struct ioasid_data
>>> *data)
>>> static void ioasid_free_locked(struct ioasid_set *set, ioasid_t
>>> ioasid) {
>>> + struct ioasid_nb_args args;
>>> struct ioasid_data *data;
>>>
>>> data = xa_load(&active_allocator->xa, ioasid);
>>> @@ -491,7 +526,16 @@ static void ioasid_free_locked(struct
>>> ioasid_set *set, ioasid_t ioasid) pr_warn("Cannot free IOASID %u
>>> due to set ownership\n", ioasid); return;
>>> }
>>> +
>> spurious new line
> got it
>
>>> data->state = IOASID_STATE_FREE_PENDING;
>>> + /* Notify all users that this IOASID is being freed */
>>> + args.id = ioasid;
>>> + args.spid = data->spid;
>>> + args.pdata = data->private;
>>> + args.set = data->set;
>>> + atomic_notifier_call_chain(&ioasid_chain, IOASID_FREE,
>>> &args);
>>> + /* Notify the ioasid_set for per set users */
>>> + atomic_notifier_call_chain(&set->nh, IOASID_FREE, &args);
>>>
>>> if (!refcount_dec_and_test(&data->users))
>>> return;
>> Shouldn't we call the notifier only when ref count == 0?
> Not in the current scheme. The idea is to notify all users the PASID is
> being freed, then each user can drop its reference. When refcount == 0,
> the PASID will be returned to the pool.

OK
>
>>> @@ -514,6 +558,28 @@ void ioasid_free(struct ioasid_set *set,
>>> ioasid_t ioasid) }
>>> EXPORT_SYMBOL_GPL(ioasid_free);
>>>
>>> +static void ioasid_add_pending_nb(struct ioasid_set *set)
>>> +{
>>> + struct ioasid_set_nb *curr;
>>> +
>>> + if (set->type != IOASID_SET_TYPE_MM)
>>> + return;
>>> +
>>> + /*
>>> + * Check if there are any pending nb requests for the
>>> given token, if so
>>> + * add them to the notifier chain.
>>> + */
>>> + spin_lock(&ioasid_nb_lock);
>>> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
>>> + if (curr->token == set->token && !curr->active) {
>>> + atomic_notifier_chain_register(&set->nh,
>>> curr->nb);
>>> + curr->set = set;
>>> + curr->active = true;
>>> + }
>>> + }
>>> + spin_unlock(&ioasid_nb_lock);
>>> +}
>>> +
>>> /**
>>> * ioasid_alloc_set - Allocate a new IOASID set for a given token
>>> *
>>> @@ -601,6 +667,13 @@ struct ioasid_set *ioasid_alloc_set(void
>>> *token, ioasid_t quota, int type) sdata->quota = quota;
>>> sdata->sid = id;
>>> refcount_set(&sdata->ref, 1);
>>> + ATOMIC_INIT_NOTIFIER_HEAD(&sdata->nh);
>>> +
>>> + /*
>>> + * Check if there are any pending nb requests for the
>>> given token, if so
>>> + * add them to the notifier chain.
>>> + */
>>> + ioasid_add_pending_nb(sdata);
>>>
>>> /*
>>> * Per set XA is used to store private IDs within the set,
>>> get ready @@ -617,6 +690,30 @@ struct ioasid_set
>>> *ioasid_alloc_set(void *token, ioasid_t quota, int type) }
>>> EXPORT_SYMBOL_GPL(ioasid_alloc_set);
>>>
>>> +
>>> +/*
>>> + * ioasid_find_mm_set - Retrieve IOASID set with mm token
>>> + * Take a reference of the set if found.
>>> + */
>>> +static struct ioasid_set *ioasid_find_mm_set(struct mm_struct
>>> *token) +{
>>> + struct ioasid_set *sdata, *set = NULL;
>>> + unsigned long index;
>>> +
>>> + spin_lock(&ioasid_allocator_lock);
>>> +
>>> + xa_for_each(&ioasid_sets, index, sdata) {
>>> + if (sdata->type == IOASID_SET_TYPE_MM &&
>>> sdata->token == token) {
>>> + refcount_inc(&sdata->ref);
>>> + set = sdata;
>>> + goto exit_unlock;
>>> + }
>>> + }
>>> +exit_unlock:
>>> + spin_unlock(&ioasid_allocator_lock);
>>> + return set;
>>> +}
>>> +
>>> void ioasid_set_get_locked(struct ioasid_set *set)
>>> {
>>> if (WARN_ON(xa_load(&ioasid_sets, set->sid) != set)) {
>>> @@ -638,6 +735,8 @@ EXPORT_SYMBOL_GPL(ioasid_set_get);
>>>
>>> void ioasid_set_put_locked(struct ioasid_set *set)
>>> {
>>> + struct ioasid_nb_args args = { 0 };
>>> + struct ioasid_set_nb *curr;
>>> struct ioasid_data *entry;
>>> unsigned long index;
>>>
>>> @@ -673,8 +772,24 @@ void ioasid_set_put_locked(struct ioasid_set
>>> *set) done_destroy:
>>> /* Return the quota back to system pool */
>>> ioasid_capacity_avail += set->quota;
>>> - kfree_rcu(set, rcu);
>>>
>>> + /* Restore pending status of the set NBs */
>>> + spin_lock(&ioasid_nb_lock);
>>> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
>>> + if (curr->token == set->token) {
>>> + if (curr->active)
>>> + curr->active = false;
>>> + else
>>> + pr_warn("Set token exists but not
>>> active!\n");
>>> + }
>>> + }
>>> + spin_unlock(&ioasid_nb_lock);
>>> +
>>> + args.set = set;
>>> + atomic_notifier_call_chain(&ioasid_chain, IOASID_SET_FREE,
>>> &args); +
>>> + kfree_rcu(set, rcu);
>>> + pr_debug("Set freed %d\n", set->sid);
>>> /*
>>> * Token got released right away after the ioasid_set is
>>> freed.
>>> * If a new set is created immediately with the newly
>>> released token, @@ -927,6 +1042,167 @@ void *ioasid_find(struct
>>> ioasid_set *set, ioasid_t ioasid, }
>>> EXPORT_SYMBOL_GPL(ioasid_find);
>>>
>>> +int ioasid_register_notifier(struct ioasid_set *set, struct
>>> notifier_block *nb) +{
>>> + if (set)
>>> + return atomic_notifier_chain_register(&set->nh,
>>> nb);
>>> + else
>>> + return
>>> atomic_notifier_chain_register(&ioasid_chain, nb); +}
>>> +EXPORT_SYMBOL_GPL(ioasid_register_notifier);
>>> +
>>> +void ioasid_unregister_notifier(struct ioasid_set *set,
>>> + struct notifier_block *nb)
>>> +{
>>> + struct ioasid_set_nb *curr;
>>> +
>>> + spin_lock(&ioasid_nb_lock);
>>> + /*
>>> + * Pending list is registered with a token without an
>>> ioasid_set,
>>> + * therefore should not be unregistered directly.
>>> + */
>>> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
>>> + if (curr->nb == nb) {
>>> + pr_warn("Cannot unregister NB from pending
>>> list\n");
>>> + spin_unlock(&ioasid_nb_lock);
>>> + return;
>>> + }
>>> + }
>>> + spin_unlock(&ioasid_nb_lock);
>> is it safe to release the lock here? What does prevent another NB to
>> be added to ioasid_nb_pending_list after that?
> Another NB will not be the same one as this NB, which is being removed.
> I don't see any issues.
> The only reason we check the pending list is to make sure the NB on the
> pending list must be removed by ioasid_unregister_notifier_mm() API.
Hum you're right, sorry for the noise.

Thanks

Eric
>
>>> +
>>> + if (set)
>>> + atomic_notifier_chain_unregister(&set->nh, nb);
>>> + else
>>> + atomic_notifier_chain_unregister(&ioasid_chain,
>>> nb); +}
>>> +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);
>>> +
>>> +int ioasid_register_notifier_mm(struct mm_struct *mm, struct
>>> notifier_block *nb) +{
>>> + struct ioasid_set_nb *curr;
>>> + struct ioasid_set *set;
>>> + int ret = 0;
>>> +
>>> + if (!mm)
>>> + return -EINVAL;
>>> +
>>> + spin_lock(&ioasid_nb_lock);
>>> +
>>> + /* Check for duplicates, nb is unique per set */
>>> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
>>> + if (curr->token == mm && curr->nb == nb) {
>>> + ret = -EBUSY;
>>> + goto exit_unlock;
>>> + }
>>> + }
>>> +
>>> + /* Check if the token has an existing set */
>>> + set = ioasid_find_mm_set(mm);
>>> + if (IS_ERR_OR_NULL(set)) {
>>> + /* Add to the rsvd list as inactive */
>>> + curr->active = false;
>>> + } else {
>>> + /* REVISIT: Only register empty set for now. Can
>>> add an option
>>> + * in the future to playback existing PASIDs.
>>> + */
>>> + if (set->nr_ioasids) {
>>> + pr_warn("IOASID set %d not empty\n",
>>> set->sid);
>>> + ret = -EBUSY;
>>> + goto exit_unlock;
>>> + }
>>> + curr = kzalloc(sizeof(*curr), GFP_ATOMIC);
>>> + if (!curr) {
>>> + ret = -ENOMEM;
>>> + goto exit_unlock;
>>> + }
>>> + curr->token = mm;
>>> + curr->nb = nb;
>>> + curr->active = true;
>>> + curr->set = set;
>>> +
>>> + /* Set already created, add to the notifier chain
>>> */
>>> + atomic_notifier_chain_register(&set->nh, nb);
>>> + /*
>>> + * Do not hold a reference, if the set gets
>>> destroyed, the nb
>>> + * entry will be marked inactive.
>>> + */
>>> + ioasid_set_put(set);
>>> + }
>>> +
>>> + list_add(&curr->list, &ioasid_nb_pending_list);
>>> +
>>> +exit_unlock:
>>> + spin_unlock(&ioasid_nb_lock);
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_register_notifier_mm);
>>> +
>>> +void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct
>>> notifier_block *nb) +{
>>> + struct ioasid_set_nb *curr;
>>> +
>>> + spin_lock(&ioasid_nb_lock);
>>> + list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
>>> + if (curr->token == mm && curr->nb == nb) {
>>> + list_del(&curr->list);
>>> + goto exit_free;
>>> + }
>>> + }
>>> + pr_warn("No ioasid set found for mm token %llx\n",
>>> (u64)mm);
>>> + goto done_unlock;
>>> +
>>> +exit_free:
>>> + if (curr->active) {
>>> + pr_debug("mm set active, unregister %llx\n",
>>> + (u64)mm);
>>> + atomic_notifier_chain_unregister(&curr->set->nh,
>>> nb);
>>> + }
>>> + kfree(curr);
>>> +done_unlock:
>>> + spin_unlock(&ioasid_nb_lock);
>>> + return;
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_unregister_notifier_mm);
>>> +
>>> +/**
>>> + * ioasid_notify - Send notification on a given IOASID for status
>>> change.
>>> + * Used by publishers when the status change may
>>> affect
>>> + * subscriber's internal state.
>>> + *
>>> + * @ioasid: The IOASID to which the notification will send
>>> + * @cmd: The notification event
>>> + * @flags: Special instructions, e.g. notify with a set or
>>> global
>>> + */
>>> +int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
>>> unsigned int flags) +{
>>> + struct ioasid_data *ioasid_data;
>>> + struct ioasid_nb_args args = { 0 };
>>> + int ret = 0;
>>> +
>>> + spin_lock(&ioasid_allocator_lock);
>>> + ioasid_data = xa_load(&active_allocator->xa, ioasid);
>>> + if (!ioasid_data) {
>>> + pr_err("Trying to notify unknown IOASID %u\n",
>>> ioasid);
>>> + spin_unlock(&ioasid_allocator_lock);
>>> + return -EINVAL;
>>> + }
>>> +
>>> + args.id = ioasid;
>>> + args.set = ioasid_data->set;
>>> + args.pdata = ioasid_data->private;
>>> + args.spid = ioasid_data->spid;
>>> + if (flags & IOASID_NOTIFY_ALL) {
>>> + ret = atomic_notifier_call_chain(&ioasid_chain,
>>> cmd, &args);
>>> + } else if (flags & IOASID_NOTIFY_SET) {
>>> + ret =
>>> atomic_notifier_call_chain(&ioasid_data->set->nh,
>>> + cmd, &args);
>>> + }
>>> + spin_unlock(&ioasid_allocator_lock);
>>> +
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_notify);
>>> +
>>> MODULE_AUTHOR("Jean-Philippe Brucker
>>> <[email protected]>"); MODULE_AUTHOR("Jacob Pan
>>> <[email protected]>"); MODULE_DESCRIPTION("IO Address
>>> Space ID (IOASID) allocator"); diff --git a/include/linux/ioasid.h
>>> b/include/linux/ioasid.h index d4b3e83672f6..572111cd3b4b 100644
>>> --- a/include/linux/ioasid.h
>>> +++ b/include/linux/ioasid.h
>>> @@ -23,6 +23,7 @@ enum ioasid_set_type {
>>> * struct ioasid_set - Meta data about ioasid_set
>>> * @type: Token types and other features
>>> * @token: Unique to identify an IOASID set
>>> + * @nh: Notifier for IOASID events within the set
>> list of notifiers private to that set?
> Sounds more accurate.
>
>>> * @xa: XArray to store ioasid_set private IDs, can
>>> be used for
>>> * guest-host IOASID mapping, or just a private
>>> IOASID namespace.
>>> * @quota: Max number of IOASIDs can be allocated within
>>> the set @@ -32,6 +33,7 @@ enum ioasid_set_type {
>>> */
>>> struct ioasid_set {
>>> void *token;
>>> + struct atomic_notifier_head nh;
>>> struct xarray xa;
>>> int type;
>>> int quota;
>>> @@ -56,6 +58,49 @@ struct ioasid_allocator_ops {
>>> void *pdata;
>>> };
>>>
>>> +/* Notification data when IOASID status changed */
>>> +enum ioasid_notify_val {
>>> + IOASID_ALLOC = 1,
>>> + IOASID_FREE,
>>> + IOASID_BIND,
>>> + IOASID_UNBIND,
>>> + IOASID_SET_ALLOC,
>>> + IOASID_SET_FREE,
>>> +};
>>> +
>>> +#define IOASID_NOTIFY_ALL BIT(0)
>>> +#define IOASID_NOTIFY_SET BIT(1)
>>> +/**
>>> + * enum ioasid_notifier_prios - IOASID event notification order
>>> + *
>>> + * When status of an IOASID changes, users might need to take
>>> actions to
>>> + * reflect the new state. For example, when an IOASID is freed due
>>> to
>>> + * exception, the hardware context in virtual CPU, DMA device, and
>>> IOMMU
>>> + * shall be cleared and drained. Order is required to prevent life
>>> cycle
>>> + * problems.
>>> + */
>>> +enum ioasid_notifier_prios {
>>> + IOASID_PRIO_LAST,
>>> + IOASID_PRIO_DEVICE,
>>> + IOASID_PRIO_IOMMU,
>>> + IOASID_PRIO_CPU,
>>> +};
>>> +
>>> +/**
>>> + * struct ioasid_nb_args - Argument provided by IOASID core when
>>> notifier
>>> + * is called.
>>> + * @id: The IOASID being notified
>>> + * @spid: The set private ID associated with the IOASID
>>> + * @set: The IOASID set of @id
>>> + * @pdata: The private data attached to the IOASID
>>> + */
>>> +struct ioasid_nb_args {
>>> + ioasid_t id;
>>> + ioasid_t spid;
>>> + struct ioasid_set *set;
>>> + void *pdata;
>>> +};
>>> +
>>> #if IS_ENABLED(CONFIG_IOASID)
>>> void ioasid_install_capacity(ioasid_t total);
>>> ioasid_t ioasid_get_capacity(void);
>>> @@ -75,8 +120,16 @@ void *ioasid_find(struct ioasid_set *set,
>>> ioasid_t ioasid, bool (*getter)(void * int
>>> ioasid_attach_data(ioasid_t ioasid, void *data); int
>>> ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid); ioasid_t
>>> ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid); +
>>> +int ioasid_register_notifier(struct ioasid_set *set,
>>> + struct notifier_block *nb);
>>> +void ioasid_unregister_notifier(struct ioasid_set *set,
>>> + struct notifier_block *nb);
>>> +
>>> int ioasid_register_allocator(struct ioasid_allocator_ops
>>> *allocator); void ioasid_unregister_allocator(struct
>>> ioasid_allocator_ops *allocator); +
>>> +int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd,
>>> unsigned int flags); void ioasid_is_in_set(struct ioasid_set *set,
>>> ioasid_t ioasid); int ioasid_get(struct ioasid_set *set, ioasid_t
>>> ioasid); int ioasid_get_locked(struct ioasid_set *set, ioasid_t
>>> ioasid); @@ -85,6 +138,9 @@ void ioasid_put_locked(struct
>>> ioasid_set *set, ioasid_t ioasid); int
>>> ioasid_set_for_each_ioasid(struct ioasid_set *sdata, void
>>> (*fn)(ioasid_t id, void *data), void *data);
>>> +int ioasid_register_notifier_mm(struct mm_struct *mm, struct
>>> notifier_block *nb); +void ioasid_unregister_notifier_mm(struct
>>> mm_struct *mm, struct notifier_block *nb); +
>>> #else /* !CONFIG_IOASID */
>>> static inline void ioasid_install_capacity(ioasid_t total)
>>> {
>>> @@ -124,6 +180,20 @@ static inline void *ioasid_find(struct
>>> ioasid_set *set, ioasid_t ioasid, bool (* return NULL;
>>> }
>>>
>>> +static inline int ioasid_register_notifier(struct notifier_block
>>> *nb) +{
>>> + return -ENOTSUPP;
>>> +}
>>> +
>>> +static inline void ioasid_unregister_notifier(struct
>>> notifier_block *nb) +{
>>> +}
>>> +
>>> +static inline int ioasid_notify(ioasid_t ioasid, enum
>>> ioasid_notify_val cmd, unsigned int flags) +{
>>> + return -ENOTSUPP;
>>> +}
>>> +
>>> static inline int ioasid_register_allocator(struct
>>> ioasid_allocator_ops *allocator) {
>>> return -ENOTSUPP;
>>>
>> Thanks
>>
>> Eric
>>
>> _______________________________________________
>> iommu mailing list
>> [email protected]
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
> [Jacob Pan]
>

2020-09-10 09:21:00

by Eric Auger

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] iommu/ioasid: Introduce ioasid_set private ID

Hi Jacob,

On 9/9/20 12:40 AM, Jacob Pan wrote:
> On Tue, 1 Sep 2020 17:38:44 +0200
> Auger Eric <[email protected]> wrote:
>
>> Hi Jacob,
>> On 8/22/20 6:35 AM, Jacob Pan wrote:
>>> When an IOASID set is used for guest SVA, each VM will acquire its
>>> ioasid_set for IOASID allocations. IOASIDs within the VM must have a
>>> host/physical IOASID backing, mapping between guest and host
>>> IOASIDs can be non-identical. IOASID set private ID (SPID) is
>>> introduced in this patch to be used as guest IOASID. However, the
>>> concept of ioasid_set specific namespace is generic, thus named
>>> SPID.
>>>
>>> As SPID namespace is within the IOASID set, the IOASID core can
>>> provide lookup services at both directions. SPIDs may not be
>>> allocated when its IOASID is allocated, the mapping between SPID
>>> and IOASID is usually established when a guest page table is bound
>>> to a host PASID.
>>>
>>> Signed-off-by: Jacob Pan <[email protected]>
>>> ---
>>> drivers/iommu/ioasid.c | 54
>>> ++++++++++++++++++++++++++++++++++++++++++++++++++
>>> include/linux/ioasid.h | 12 +++++++++++ 2 files changed, 66
>>> insertions(+)
>>>
>>> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
>>> index 5f31d63c75b1..c0aef38a4fde 100644
>>> --- a/drivers/iommu/ioasid.c
>>> +++ b/drivers/iommu/ioasid.c
>>> @@ -21,6 +21,7 @@ enum ioasid_state {
>>> * struct ioasid_data - Meta data about ioasid
>>> *
>>> * @id: Unique ID
>>> + * @spid: Private ID unique within a set
>>> * @users Number of active users
>>> * @state Track state of the IOASID
>>> * @set Meta data of the set this IOASID belongs to
>>> @@ -29,6 +30,7 @@ enum ioasid_state {
>>> */
>>> struct ioasid_data {
>>> ioasid_t id;
>>> + ioasid_t spid;
>>> struct ioasid_set *set;
>>> refcount_t users;
>>> enum ioasid_state state;
>>> @@ -326,6 +328,58 @@ int ioasid_attach_data(ioasid_t ioasid, void
>>> *data) EXPORT_SYMBOL_GPL(ioasid_attach_data);
>>>
>>> /**
>>> + * ioasid_attach_spid - Attach ioasid_set private ID to an IOASID
>>> + *
>>> + * @ioasid: the ID to attach
>>> + * @spid: the ioasid_set private ID of @ioasid
>>> + *
>>> + * For IOASID that is already allocated, private ID within the set
>>> can be
>>> + * attached via this API. Future lookup can be done via
>>> ioasid_find.
>> I would remove "For IOASID that is already allocated, private ID
>> within the set can be attached via this API"
> I guess it is implied. Will remove.
>
>>> + */
>>> +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
>>> +{
>>> + struct ioasid_data *ioasid_data;
>>> + int ret = 0;
>>> +
>>> + spin_lock(&ioasid_allocator_lock);
>> We keep on saying the SPID is local to an IOASID set but we don't
>> check any IOASID set contains this ioasid. It looks a bit weird to me.
> We store ioasid_set inside ioasid_data when an IOASID is allocated, so
> we don't need to search all the ioasid_sets. Perhaps I missed your
> point?
No I think it became clearer ;-)
>
>>> + ioasid_data = xa_load(&active_allocator->xa, ioasid);
>>> +
>>> + if (!ioasid_data) {
>>> + pr_err("No IOASID entry %d to attach SPID %d\n",
>>> + ioasid, spid);
>>> + ret = -ENOENT;
>>> + goto done_unlock;
>>> + }
>>> + ioasid_data->spid = spid;
>> is there any way/need to remove an spid binding?
> For guest SVA, we attach SPID as a guest PASID during bind guest page
> table. Unbind does the opposite, ioasid_attach_spid() with
> spid=INVALID_IOASID clears the binding.
>
> Perhaps add more symmetric functions? i.e.
> ioasid_detach_spid(ioasid_t ioasid)
> ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid)
yep make sense

Thanks

Eric
>
>>> +
>>> +done_unlock:
>>> + spin_unlock(&ioasid_allocator_lock);
>>> + return ret;
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_attach_spid);
>>> +
>>> +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid)
>>> +{
>>> + struct ioasid_data *entry;
>>> + unsigned long index;
>>> +
>>> + if (!xa_load(&ioasid_sets, set->sid)) {
>>> + pr_warn("Invalid set\n");
>>> + return INVALID_IOASID;
>>> + }
>>> +
>>> + xa_for_each(&set->xa, index, entry) {
>>> + if (spid == entry->spid) {
>>> + pr_debug("Found ioasid %lu by spid %u\n",
>>> index, spid);
>>> + refcount_inc(&entry->users);
>>> + return index;
>>> + }
>>> + }
>>> + return INVALID_IOASID;
>>> +}
>>> +EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
>>> +
>>> +/**
>>> * ioasid_alloc - Allocate an IOASID
>>> * @set: the IOASID set
>>> * @min: the minimum ID (inclusive)
>>> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
>>> index 310abe4187a3..d4b3e83672f6 100644
>>> --- a/include/linux/ioasid.h
>>> +++ b/include/linux/ioasid.h
>>> @@ -73,6 +73,8 @@ bool ioasid_is_active(ioasid_t ioasid);
>>>
>>> void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, bool
>>> (*getter)(void *)); int ioasid_attach_data(ioasid_t ioasid, void
>>> *data); +int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
>>> +ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t
>>> spid); int ioasid_register_allocator(struct ioasid_allocator_ops
>>> *allocator); void ioasid_unregister_allocator(struct
>>> ioasid_allocator_ops *allocator); void ioasid_is_in_set(struct
>>> ioasid_set *set, ioasid_t ioasid); @@ -136,5 +138,15 @@ static
>>> inline int ioasid_attach_data(ioasid_t ioasid, void *data) return
>>> -ENOTSUPP; }
>>>
>>> +staic inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
>>> +{
>>> + return -ENOTSUPP;
>>> +}
>>> +
>>> +static inline ioasid_t ioasid_find_by_spid(struct ioasid_set *set,
>>> ioasid_t spid) +{
>>> + return -ENOTSUPP;
>>> +}
>>> +
>>> #endif /* CONFIG_IOASID */
>>> #endif /* __LINUX_IOASID_H */
>>>
>> Thanks
>>
>> Eric
>>
>> _______________________________________________
>> iommu mailing list
>> [email protected]
>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
> [Jacob Pan]
>

2020-09-10 17:13:32

by Jacob Pan

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] iommu/vt-d: Store guest PASID during bind

On Tue, 1 Sep 2020 19:08:44 +0200
Auger Eric <[email protected]> wrote:

> Hi Jacob,
> On 8/22/20 6:35 AM, Jacob Pan wrote:
> > IOASID core maintains the guest-host mapping in the form of SPID and
> > IOASID. This patch assigns the guest PASID (if valid) as SPID while
> > binding guest page table with a host PASID. This mapping will be
> > used for lookup and notifications.
> >
> > Signed-off-by: Jacob Pan <[email protected]>
> > ---
> > drivers/iommu/intel/svm.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> > index d8a5efa75095..4c958b1aec4c 100644
> > --- a/drivers/iommu/intel/svm.c
> > +++ b/drivers/iommu/intel/svm.c
> > @@ -406,6 +406,7 @@ int intel_svm_bind_gpasid(struct iommu_domain
> > *domain, struct device *dev, if (data->flags &
> > IOMMU_SVA_GPASID_VAL) { svm->gpasid = data->gpasid;
> > svm->flags |= SVM_FLAG_GUEST_PASID;
> > + ioasid_attach_spid(data->hpasid,
> > data->gpasid);
> don't you want to handle the returned value?
Yes, I also need to add a check for duplicated SPID within a set.

> > }
> > svm->iommu = iommu;
> > /*
> > @@ -517,6 +518,7 @@ int intel_svm_unbind_gpasid(struct device *dev,
> > int pasid) ioasid_attach_data(pasid, NULL);
> > ioasid_notify(pasid, IOASID_UNBIND,
> > IOASID_NOTIFY_SET);
> > + ioasid_attach_spid(pasid,
> > INVALID_IOASID);
> So this answers my previous question ;-) but won't it enter the if
> (!ioasid_data) path and fail to reset the spid?
>
Sorry, i am not following. If there is no ioasid_data then there is no
SPID to be stored.

BTW, I plan to separate the APIs into two.
ioasid_attach_spid
ioasid_detach_spid

Only ioasid_detach_spid will be calling synchronize RCU, then
ioasid_attach_spid can be called under spinlock.

Thanks,

> Eric
> > kfree(svm);
> > }
> > }
> >
>
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

[Jacob Pan]

Subject: RE: [PATCH v2 4/9] iommu/ioasid: Add reference couting functions

Hi Jacob,

> -----Original Message-----
> From: iommu [mailto:[email protected]] On Behalf Of
> Jacob Pan
> Sent: 22 August 2020 05:35
> To: [email protected]; LKML <[email protected]>;
> Jean-Philippe Brucker <[email protected]>; Lu Baolu
> <[email protected]>; Joerg Roedel <[email protected]>; David
> Woodhouse <[email protected]>
> Cc: Tian, Kevin <[email protected]>; Raj Ashok <[email protected]>; Wu
> Hao <[email protected]>
> Subject: [PATCH v2 4/9] iommu/ioasid: Add reference couting functions
>
> There can be multiple users of an IOASID, each user could have hardware
> contexts associated with the IOASID. In order to align lifecycles,
> reference counting is introduced in this patch. It is expected that when
> an IOASID is being freed, each user will drop a reference only after its
> context is cleared.
>
> Signed-off-by: Jacob Pan <[email protected]>
> ---
> drivers/iommu/ioasid.c | 113
> +++++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/ioasid.h | 4 ++
> 2 files changed, 117 insertions(+)
>
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index f73b3dbfc37a..5f31d63c75b1 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -717,6 +717,119 @@ int ioasid_set_for_each_ioasid(struct ioasid_set
> *set,
> EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
>
> /**
> + * IOASID refcounting rules
> + * - ioasid_alloc() set initial refcount to 1
> + *
> + * - ioasid_free() decrement and test refcount.
> + * If refcount is 0, ioasid will be freed. Deleted from the system-wide
> + * xarray as well as per set xarray. The IOASID will be returned to the
> + * pool and available for new allocations.
> + *
> + * If recount is non-zero, mark IOASID as
> IOASID_STATE_FREE_PENDING.
> + * No new reference can be added. The IOASID is not returned to the
> pool
> + * for reuse.
> + * After free, ioasid_get() will return error but ioasid_find() and other
> + * non refcount adding APIs will continue to work until the last reference
> + * is dropped
> + *
> + * - ioasid_get() get a reference on an active IOASID
> + *
> + * - ioasid_put() decrement and test refcount of the IOASID.
> + * If refcount is 0, ioasid will be freed. Deleted from the system-wide
> + * xarray as well as per set xarray. The IOASID will be returned to the
> + * pool and available for new allocations.
> + * Do nothing if refcount is non-zero.
> + *

Is it better to have a return for this based on whether ioasid is freed or not?

I was going through Jean's SMMUv3 SVA patches[1] and that one returns true
if ioasid was freed. And that info is subsequently used to reset the pasid associated
with a mm. Though, not sure that is still relevant or not.

Thanks,
Shameer
1. https://lore.kernel.org/linux-iommu/[email protected]/

> + * - ioasid_find() does not take reference, caller must hold reference
> + *
> + * ioasid_free() can be called multiple times without error until all refs are
> + * dropped.
> + */
> +
> +int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(&active_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to get unknown IOASID %u\n", ioasid);
> + return -EINVAL;
> + }
> + if (data->state == IOASID_STATE_FREE_PENDING) {
> + pr_err("Trying to get IOASID being freed%u\n", ioasid);
> + return -EBUSY;
> + }
> +
> + if (set && data->set != set) {
> + pr_err("Trying to get IOASID not in set%u\n", ioasid);
> + /* data found but does not belong to the set */
> + return -EACCES;
> + }
> + refcount_inc(&data->users);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get_locked);
> +
> +/**
> + * ioasid_get - Obtain a reference of an ioasid
> + * @set
> + * @ioasid
> + *
> + * Check set ownership if @set is non-null.
> + */
> +int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + int ret = 0;
> +
> + spin_lock(&ioasid_allocator_lock);
> + ret = ioasid_get_locked(set, ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get);
> +
> +void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(&active_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to put unknown IOASID %u\n", ioasid);
> + return;
> + }
> +
> + if (set && data->set != set) {
> + pr_err("Trying to drop IOASID not in the set %u\n", ioasid);
> + return;
> + }
> +
> + if (!refcount_dec_and_test(&data->users)) {
> + pr_debug("%s: IOASID %d has %d remainning users\n",
> + __func__, ioasid, refcount_read(&data->users));
> + return;
> + }
> + ioasid_do_free(data);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_put_locked);
> +
> +/**
> + * ioasid_put - Drop a reference of an ioasid
> + * @set
> + * @ioasid
> + *
> + * Check set ownership if @set is non-null.
> + */
> +void ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + spin_lock(&ioasid_allocator_lock);
> + ioasid_put_locked(set, ioasid);
> + spin_unlock(&ioasid_allocator_lock);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_put);
> +
> +/**
> * ioasid_find - Find IOASID data
> * @set: the IOASID set
> * @ioasid: the IOASID to find
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 412d025d440e..310abe4187a3 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -76,6 +76,10 @@ int ioasid_attach_data(ioasid_t ioasid, void *data);
> int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
> void ioasid_is_in_set(struct ioasid_set *set, ioasid_t ioasid);
> +int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
> +int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
> +void ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
> +void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
> int ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> void (*fn)(ioasid_t id, void *data),
> void *data);
> --
> 2.7.4
>
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linuxfoundation.org/mailman/listinfo/iommu