I/O Address Space ID (IOASID) core code was introduced in v5.5 as a generic
kernel allocator service for both PCIe Process Address Space ID (PASID) and
ARM SMMU's Substream ID. IOASIDs are used to associate DMA requests with
virtual address spaces, including both host and guest.
In addition to providing basic ID allocation, ioasid_set was defined as a
token that is shared by a group of IOASIDs. This set token can be used
for permission checking, but lack some features to address the following
needs by guest Shared Virtual Address (SVA).
- Manage IOASIDs by group, group ownership, quota, etc.
- State synchronization among IOASID users (e.g. IOMMU driver, KVM, device
drivers)
- Non-identity guest-host IOASID mapping
- Lifecycle management
This patchset introduces the following extensions as solutions to the
problems above.
- Redefine and extend IOASID set such that IOASIDs can be managed by groups/pools.
- Add notifications for IOASID state synchronization
- Extend reference counting for life cycle alignment among multiple users
- Support ioasid_set private IDs, which can be used as guest IOASIDs
- Add a new cgroup controller for resource distribution
Please refer to Documentation/admin-guide/cgroup-v1/ioasids.rst and
Documentation/driver-api/ioasid.rst in the enclosed patches for more
details.
Based on discussions on LKML[1], a direction change was made in v4 such that
the user interfaces for IOASID allocation are extracted from VFIO
subsystem. The proposed IOASID subsystem now consists of three components:
1. IOASID core[01-14]: provides APIs for allocation, pool management,
notifications, and refcounting.
2. IOASID cgroup controller[RFC 15-17]: manage resource distribution[2].
3. IOASID user[RFC 18]: provides user allocation interface via /dev/ioasid
This patchset only included VT-d driver as users of some of the new APIs.
VFIO and KVM patches are coming up to fully utilize the APIs introduced here.
[1] https://lore.kernel.org/linux-iommu/[email protected]/
[2] Note that ioasid quota management code can be removed once the IOASIDs
cgroup is ratified.
You can find this series, VFIO, KVM, and IOASID user at:
https://github.com/jacobpan/linux.git ioasid_v4
(VFIO and KVM patches will be available at this branch when published.)
This work is a result of collaboration with many people:
Liu, Yi L <[email protected]>
Wu Hao <[email protected]>
Ashok Raj <[email protected]>
Kevin Tian <[email protected]>
Thanks,
Jacob
Changelog:
v4
- Introduced IOASIDs cgroup controller
- Introduced /dev/ioasid user API for allocation/free
- Added IOASID states and free function, aligned refcounting on v5.11
introduced by Jean.
- Support iommu-sva-lib (will converge VT-d code afterward)
- Added a shared ordered workqueue for notification work that requires
thread context. Streamlined notification framework among multiple IOASID
users.
- Added ioasid_set helper functions for taking per set operations
V3:
- Use consistent ioasid_set_ prefix for ioasid_set level APIs
- Make SPID and private detach/attach APIs symmetric
- Use the same ioasid_put semantics as Jean-Phillippe IOASID reference patch
- Take away the public ioasid_notify() function, notifications are now emitted
by IOASID core as a result of certain IOASID APIs
- Partition into finer incremental patches
- Miscellaneous cleanup, locking, exception handling fixes based on v2 reviews
V2:
- Redesigned ioasid_set APIs, removed set ID
- Added set private ID (SPID) for guest PASID usage.
- Add per ioasid_set notification and priority support.
- Back to use spinlocks and atomic notifications.
- Added async work in VT-d driver to perform teardown outside atomic context
Jacob Pan (17):
docs: Document IO Address Space ID (IOASID) APIs
iommu/ioasid: Rename ioasid_set_data()
iommu/ioasid: Add a separate function for detach data
iommu/ioasid: Support setting system-wide capacity
iommu/ioasid: Redefine IOASID set and allocation APIs
iommu/ioasid: Add free function and states
iommu/ioasid: Add ioasid_set iterator helper functions
iommu/ioasid: Introduce ioasid_set private ID
iommu/ioasid: Introduce notification APIs
iommu/ioasid: Support mm token type ioasid_set notifications
iommu/ioasid: Add ownership check in guest bind
iommu/vt-d: Remove mm reference for guest SVA
iommu/ioasid: Add a workqueue for cleanup work
iommu/vt-d: Listen to IOASID notifications
cgroup: Introduce ioasids controller
iommu/ioasid: Consult IOASIDs cgroup for allocation
docs: cgroup-v1: Add IOASIDs controller
Liu Yi L (1):
ioasid: Add /dev/ioasid for userspace
Documentation/admin-guide/cgroup-v1/index.rst | 1 +
.../admin-guide/cgroup-v1/ioasids.rst | 107 ++
Documentation/driver-api/index.rst | 1 +
Documentation/driver-api/ioasid.rst | 510 +++++++++
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/ioasid.rst | 49 +
drivers/iommu/Kconfig | 5 +
drivers/iommu/Makefile | 1 +
.../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 1 +
drivers/iommu/intel/Kconfig | 1 +
drivers/iommu/intel/iommu.c | 32 +-
drivers/iommu/intel/pasid.h | 1 +
drivers/iommu/intel/svm.c | 145 ++-
drivers/iommu/ioasid.c | 983 +++++++++++++++++-
drivers/iommu/ioasid_user.c | 297 ++++++
drivers/iommu/iommu-sva-lib.c | 19 +-
drivers/iommu/iommu.c | 16 +-
include/linux/cgroup_subsys.h | 4 +
include/linux/intel-iommu.h | 2 +
include/linux/ioasid.h | 256 ++++-
include/linux/miscdevice.h | 1 +
include/uapi/linux/ioasid.h | 98 ++
init/Kconfig | 7 +
kernel/cgroup/Makefile | 1 +
kernel/cgroup/ioasids.c | 345 ++++++
25 files changed, 2794 insertions(+), 90 deletions(-)
create mode 100644 Documentation/admin-guide/cgroup-v1/ioasids.rst
create mode 100644 Documentation/driver-api/ioasid.rst
create mode 100644 Documentation/userspace-api/ioasid.rst
create mode 100644 drivers/iommu/ioasid_user.c
create mode 100644 include/uapi/linux/ioasid.h
create mode 100644 kernel/cgroup/ioasids.c
--
2.25.1
As a system-wide resource, IOASID is often shared by multiple kernel
subsystems that are independent of each other. However, at the
ioasid_set level, these kernel subsystems must communicate with each
other for ownership checking, event notifications, etc. For example, on
Intel Scalable IO Virtualization (SIOV) enabled platforms, KVM and VFIO
instances under the same process/guest must be aware of a shared IOASID
set.
IOASID_SET_TYPE_MM token type was introduced to explicitly mark an
IOASID set that belongs to a process, thus use the same mm_struct
pointer as a token. Users of the same process can then identify with
each other based on this token.
This patch introduces MM token specific event registration APIs. Event
subscribers such as KVM instances can register IOASID event handler
without the knowledge of its ioasid_set. Event handlers are registered
based on its mm_struct pointer as a token. In case when subscribers
register handler *prior* to the creation of the ioasid_set, the
handler’s notification block is stored in a pending list within IOASID
core. Once the ioasid_set of the MM token is created, the notification
block will be registered by the IOASID core.
Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Wu Hao <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 142 +++++++++++++++++++++++++++++++++++++++++
include/linux/ioasid.h | 18 ++++++
2 files changed, 160 insertions(+)
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 56577e745c4b..96e941dfada7 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -21,6 +21,8 @@
* keep local states in sync.
*/
static ATOMIC_NOTIFIER_HEAD(ioasid_notifier);
+/* List to hold pending notification block registrations */
+static LIST_HEAD(ioasid_nb_pending_list);
static DEFINE_SPINLOCK(ioasid_nb_lock);
/* Default to PCIe standard 20 bit PASID */
@@ -574,6 +576,27 @@ static inline bool ioasid_set_is_valid(struct ioasid_set *set)
return xa_load(&ioasid_sets, set->id) == set;
}
+static void ioasid_add_pending_nb(struct ioasid_set *set)
+{
+ struct ioasid_set_nb *curr;
+
+ if (set->type != IOASID_SET_TYPE_MM)
+ return;
+ /*
+ * Check if there are any pending nb requests for the given token, if so
+ * add them to the notifier chain.
+ */
+ spin_lock(&ioasid_nb_lock);
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->token == set->token && !curr->active) {
+ atomic_notifier_chain_register(&set->nh, curr->nb);
+ curr->set = set;
+ curr->active = true;
+ }
+ }
+ spin_unlock(&ioasid_nb_lock);
+}
+
/**
* ioasid_set_alloc - Allocate a new IOASID set for a given token
*
@@ -658,6 +681,11 @@ struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type)
atomic_set(&set->nr_ioasids, 0);
ATOMIC_INIT_NOTIFIER_HEAD(&set->nh);
+ /*
+ * Check if there are any pending nb requests for the given token, if so
+ * add them to the notifier chain.
+ */
+ ioasid_add_pending_nb(set);
/*
* Per set XA is used to store private IDs within the set, get ready
* for ioasid_set private ID and system-wide IOASID allocation
@@ -675,6 +703,7 @@ EXPORT_SYMBOL_GPL(ioasid_set_alloc);
static int ioasid_set_free_locked(struct ioasid_set *set)
{
+ struct ioasid_set_nb *curr;
int ret = 0;
if (!ioasid_set_is_valid(set)) {
@@ -688,6 +717,16 @@ static int ioasid_set_free_locked(struct ioasid_set *set)
}
WARN_ON(!xa_empty(&set->xa));
+ /* Restore pending status of the set NBs */
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->token == set->token) {
+ if (curr->active)
+ curr->active = false;
+ else
+ pr_warn("Set token exists but not active!\n");
+ }
+ }
+
/*
* Token got released right away after the ioasid_set is freed.
* If a new set is created immediately with the newly released token,
@@ -1117,6 +1156,22 @@ EXPORT_SYMBOL_GPL(ioasid_register_notifier);
void ioasid_unregister_notifier(struct ioasid_set *set,
struct notifier_block *nb)
{
+ struct ioasid_set_nb *curr;
+
+ spin_lock(&ioasid_nb_lock);
+ /*
+ * Pending list is registered with a token without an ioasid_set,
+ * therefore should not be unregistered directly.
+ */
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->nb == nb) {
+ pr_warn("Cannot unregister NB from pending list\n");
+ spin_unlock(&ioasid_nb_lock);
+ return;
+ }
+ }
+ spin_unlock(&ioasid_nb_lock);
+
if (set)
atomic_notifier_chain_unregister(&set->nh, nb);
else
@@ -1124,6 +1179,93 @@ void ioasid_unregister_notifier(struct ioasid_set *set,
}
EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);
+/**
+ * ioasid_register_notifier_mm - Register a notifier block on the IOASID set
+ * created by the mm_struct pointer as the token
+ *
+ * @mm: the mm_struct token of the ioasid_set
+ * @nb: notfier block to be registered on the ioasid_set
+ *
+ * This a variant of ioasid_register_notifier() where the caller intends to
+ * listen to IOASID events belong the ioasid_set created under the same
+ * process. Caller is not aware of the ioasid_set, no need to hold reference
+ * of the ioasid_set.
+ */
+int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
+{
+ struct ioasid_set_nb *curr;
+ struct ioasid_set *set;
+ int ret = 0;
+
+ spin_lock(&ioasid_nb_lock);
+ /* Check for duplicates, nb is unique per set */
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->token == mm && curr->nb == nb) {
+ ret = -EBUSY;
+ goto exit_unlock;
+ }
+ }
+ curr = kzalloc(sizeof(*curr), GFP_ATOMIC);
+ if (!curr) {
+ ret = -ENOMEM;
+ goto exit_unlock;
+ }
+ /* Check if the token has an existing set */
+ set = ioasid_find_mm_set(mm);
+ if (!set) {
+ /* Add to the rsvd list as inactive */
+ curr->active = false;
+ } else {
+ /* REVISIT: Only register empty set for now. Can add an option
+ * in the future to playback existing PASIDs.
+ */
+ if (atomic_read(&set->nr_ioasids)) {
+ pr_warn("IOASID set %d not empty %d\n", set->id,
+ atomic_read(&set->nr_ioasids));
+ ret = -EBUSY;
+ goto exit_free;
+ }
+ curr->token = mm;
+ curr->nb = nb;
+ curr->active = true;
+ curr->set = set;
+
+ /* Set already created, add to the notifier chain */
+ atomic_notifier_chain_register(&set->nh, nb);
+ }
+
+ list_add(&curr->list, &ioasid_nb_pending_list);
+ goto exit_unlock;
+exit_free:
+ kfree(curr);
+exit_unlock:
+ spin_unlock(&ioasid_nb_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_register_notifier_mm);
+
+void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
+{
+ struct ioasid_set_nb *curr;
+
+ spin_lock(&ioasid_nb_lock);
+ list_for_each_entry(curr, &ioasid_nb_pending_list, list) {
+ if (curr->token == mm && curr->nb == nb) {
+ list_del(&curr->list);
+ spin_unlock(&ioasid_nb_lock);
+ if (curr->active) {
+ atomic_notifier_chain_unregister(&curr->set->nh,
+ nb);
+ }
+ kfree(curr);
+ return;
+ }
+ }
+ pr_warn("No ioasid set found for mm token %llx\n", (u64)mm);
+ spin_unlock(&ioasid_nb_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_unregister_notifier_mm);
+
MODULE_AUTHOR("Jean-Philippe Brucker <[email protected]>");
MODULE_AUTHOR("Jacob Pan <[email protected]>");
MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index d8b85a04214f..c97e80ff65cc 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -132,6 +132,8 @@ void ioasid_unregister_notifier(struct ioasid_set *set,
void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
void (*fn)(ioasid_t id, void *data),
void *data);
+int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
+void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
#else /* !CONFIG_IOASID */
static inline void ioasid_install_capacity(ioasid_t total)
{
@@ -250,5 +252,21 @@ static inline void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
void *data)
{
}
+
+static inline int ioasid_register_notifier_mm(struct mm_struct *mm,
+ struct notifier_block *nb)
+{
+ return -ENOTSUPP;
+}
+
+static inline void ioasid_unregister_notifier_mm(struct mm_struct *mm,
+ struct notifier_block *nb)
+{
+}
+
+static inline bool ioasid_queue_work(struct work_struct *work)
+{
+ return false;
+}
#endif /* CONFIG_IOASID */
#endif /* __LINUX_IOASID_H */
--
2.25.1
When an IOASID set is used for guest SVA, each VM will acquire its
ioasid_set for IOASID allocations. IOASIDs within the VM must have a
host/physical IOASID backing, mapping between guest and host IOASIDs can
be non-identical. IOASID set private ID (SPID) is introduced in this
patch to be used as guest IOASID. However, the concept of ioasid_set
specific namespace is generic, thus named SPID.
As SPID namespace is within the IOASID set, the IOASID core can provide
lookup services at both directions. SPIDs may not be available when its
IOASID is allocated, the mapping between SPID and IOASID is usually
established when a guest page table is bound to a host PASID.
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 104 +++++++++++++++++++++++++++++++++++++++++
include/linux/ioasid.h | 18 +++++++
2 files changed, 122 insertions(+)
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 9a3ba157dec3..7707bb608bdd 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -26,6 +26,7 @@ enum ioasid_state {
* struct ioasid_data - Meta data about ioasid
*
* @id: Unique ID
+ * @spid: Private ID unique within a set
* @refs: Number of active users
* @state: Track state of the IOASID
* @set: ioasid_set of the IOASID belongs to
@@ -34,6 +35,7 @@ enum ioasid_state {
*/
struct ioasid_data {
ioasid_t id;
+ ioasid_t spid;
enum ioasid_state state;
struct ioasid_set *set;
void *private;
@@ -413,6 +415,107 @@ void ioasid_detach_data(ioasid_t ioasid)
}
EXPORT_SYMBOL_GPL(ioasid_detach_data);
+static ioasid_t ioasid_find_by_spid_locked(struct ioasid_set *set, ioasid_t spid, bool get)
+{
+ ioasid_t ioasid = INVALID_IOASID;
+ struct ioasid_data *entry;
+ unsigned long index;
+
+ if (!xa_load(&ioasid_sets, set->id)) {
+ pr_warn("Invalid set\n");
+ goto done;
+ }
+
+ xa_for_each(&set->xa, index, entry) {
+ if (spid == entry->spid) {
+ if (get)
+ refcount_inc(&entry->refs);
+ ioasid = index;
+ }
+ }
+done:
+ return ioasid;
+}
+
+/**
+ * ioasid_attach_spid - Attach ioasid_set private ID to an IOASID
+ *
+ * @ioasid: the system-wide IOASID to attach
+ * @spid: the ioasid_set private ID of @ioasid
+ *
+ * After attching SPID, future lookup can be done via ioasid_find_by_spid().
+ */
+int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
+{
+ struct ioasid_data *data;
+ int ret = 0;
+
+ if (spid == INVALID_IOASID)
+ return -EINVAL;
+
+ spin_lock(&ioasid_allocator_lock);
+ data = xa_load(&active_allocator->xa, ioasid);
+
+ if (!data) {
+ pr_err("No IOASID entry %d to attach SPID %d\n",
+ ioasid, spid);
+ ret = -ENOENT;
+ goto done_unlock;
+ }
+ /* Check if SPID is unique within the set */
+ if (ioasid_find_by_spid_locked(data->set, spid, false) != INVALID_IOASID) {
+ ret = -EINVAL;
+ goto done_unlock;
+ }
+ data->spid = spid;
+
+done_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_attach_spid);
+
+void ioasid_detach_spid(ioasid_t ioasid)
+{
+ struct ioasid_data *data;
+
+ spin_lock(&ioasid_allocator_lock);
+ data = xa_load(&active_allocator->xa, ioasid);
+
+ if (!data || data->spid == INVALID_IOASID) {
+ pr_err("Invalid IOASID entry %d to detach\n", ioasid);
+ goto done_unlock;
+ }
+ data->spid = INVALID_IOASID;
+
+done_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_detach_spid);
+
+/**
+ * ioasid_find_by_spid - Find the system-wide IOASID by a set private ID and
+ * its set.
+ *
+ * @set: the ioasid_set to search within
+ * @spid: the set private ID
+ * @get: flag indicates whether to take a reference once found
+ *
+ * Given a set private ID and its IOASID set, find the system-wide IOASID. Take
+ * a reference upon finding the matching IOASID if @get is true. Return
+ * INVALID_IOASID if the IOASID is not found in the set or the set is not valid.
+ */
+ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid, bool get)
+{
+ ioasid_t ioasid;
+
+ spin_lock(&ioasid_allocator_lock);
+ ioasid = ioasid_find_by_spid_locked(set, spid, get);
+ spin_unlock(&ioasid_allocator_lock);
+ return ioasid;
+}
+EXPORT_SYMBOL_GPL(ioasid_find_by_spid);
+
static inline bool ioasid_set_is_valid(struct ioasid_set *set)
{
return xa_load(&ioasid_sets, set->id) == set;
@@ -616,6 +719,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
}
data->id = id;
data->state = IOASID_STATE_IDLE;
+ data->spid = INVALID_IOASID;
/* Store IOASID in the per set data */
if (xa_err(xa_store(&set->xa, id, data, GFP_ATOMIC))) {
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index e7f3e6108724..dcab02886cb5 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -81,6 +81,9 @@ int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
int ioasid_attach_data(ioasid_t ioasid, void *data);
void ioasid_detach_data(ioasid_t ioasid);
+int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
+void ioasid_detach_spid(ioasid_t ioasid);
+ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid, bool get);
void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
void (*fn)(ioasid_t id, void *data),
void *data);
@@ -173,6 +176,21 @@ static inline struct ioasid_set *ioasid_find_set(ioasid_t ioasid)
return ERR_PTR(-ENOTSUPP);
}
+static inline int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
+{
+ return -ENOTSUPP;
+}
+
+static inline void ioasid_detach_spid(ioasid_t ioasid)
+{
+}
+
+static inline ioasid_t ioasid_find_by_spid(struct ioasid_set *set,
+ ioasid_t spid, bool get)
+{
+ return INVALID_IOASID;
+}
+
static inline void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
void (*fn)(ioasid_t id, void *data),
void *data)
--
2.25.1
An IOASID can have multiple users, such as IOMMU driver, KVM, and device
drivers. The atomic IOASID notifier is used to inform users of IOASID
state change. For example, the IOASID_NOTIFY_UNBIND event is issued when
the IOASID is no longer bound to an address space. This requires ordered
actions among users to tear down their contexts.
Not all work can be handled in the atomic notifier handler. This patch
introduces a shared, ordered workqueue for all IOASID users who wish to
perform work asynchronously upon notification.
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 25 +++++++++++++++++++++++++
include/linux/ioasid.h | 1 +
2 files changed, 26 insertions(+)
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 28a2e9b6594d..d42b39ca2c8b 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -32,6 +32,9 @@ static ioasid_t ioasid_capacity = PCI_PASID_MAX;
static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
static DEFINE_XARRAY_ALLOC(ioasid_sets);
+/* Workqueue for IOASID users to do cleanup upon notification */
+static struct workqueue_struct *ioasid_wq;
+
struct ioasid_set_nb {
struct list_head list;
struct notifier_block *nb;
@@ -1281,6 +1284,12 @@ int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
}
EXPORT_SYMBOL_GPL(ioasid_register_notifier_mm);
+bool ioasid_queue_work(struct work_struct *work)
+{
+ return queue_work(ioasid_wq, work);
+}
+EXPORT_SYMBOL_GPL(ioasid_queue_work);
+
void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb)
{
struct ioasid_set_nb *curr;
@@ -1303,7 +1312,23 @@ void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *
}
EXPORT_SYMBOL_GPL(ioasid_unregister_notifier_mm);
+static int __init ioasid_init(void)
+{
+ ioasid_wq = alloc_ordered_workqueue("ioasid_wq", 0);
+ if (!ioasid_wq)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static void __exit ioasid_cleanup(void)
+{
+ destroy_workqueue(ioasid_wq);
+}
+
MODULE_AUTHOR("Jean-Philippe Brucker <[email protected]>");
MODULE_AUTHOR("Jacob Pan <[email protected]>");
MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
MODULE_LICENSE("GPL");
+module_init(ioasid_init);
+module_exit(ioasid_cleanup);
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 9624b665f810..4547086797df 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -135,6 +135,7 @@ void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
void *data);
int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
+bool ioasid_queue_work(struct work_struct *work);
#else /* !CONFIG_IOASID */
static inline void ioasid_install_capacity(ioasid_t total)
{
--
2.25.1
On Intel Scalable I/O Virtualization (SIOV) enabled platforms, IOMMU
driver is one of the users of IOASIDs. In normal flow, callers will
perform IOASID allocation, bind, unbind, and free in order. However, for
guest SVA, IOASID free could come before unbind as guest is untrusted.
This patch registers IOASID notification handler such that IOMMU driver
can perform PASID teardown upon receiving an unexpected IOASID free
event.
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel/iommu.c | 2 +
drivers/iommu/intel/svm.c | 109 +++++++++++++++++++++++++++++++++++-
include/linux/intel-iommu.h | 2 +
3 files changed, 111 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index eb9868061545..d602e89c40d2 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3313,6 +3313,8 @@ static int __init init_dmars(void)
pr_err("Failed to allocate host PASID set %lu\n",
PTR_ERR(host_pasid_set));
intel_iommu_sm = 0;
+ } else {
+ intel_svm_add_pasid_notifier();
}
}
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index f75699ddb923..b5bb9b578281 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -96,6 +96,104 @@ static inline bool intel_svm_capable(struct intel_iommu *iommu)
return iommu->flags & VTD_FLAG_SVM_CAPABLE;
}
+static inline void intel_svm_drop_pasid(ioasid_t pasid)
+{
+ /*
+ * Detaching SPID results in UNBIND notification on the set, we must
+ * do this before dropping the IOASID reference, otherwise the
+ * notification chain may get destroyed.
+ */
+ ioasid_detach_spid(pasid);
+ ioasid_detach_data(pasid);
+ ioasid_put(NULL, pasid);
+}
+
+static DEFINE_MUTEX(pasid_mutex);
+#define pasid_lock_held() lock_is_held(&pasid_mutex.dep_map)
+
+static void intel_svm_free_async_fn(struct work_struct *work)
+{
+ struct intel_svm *svm = container_of(work, struct intel_svm, work);
+ struct intel_svm_dev *sdev;
+
+ /*
+ * Unbind all devices associated with this PASID which is
+ * being freed by other users such as VFIO.
+ */
+ mutex_lock(&pasid_mutex);
+ list_for_each_entry_rcu(sdev, &svm->devs, list, pasid_lock_held()) {
+ /* Does not poison forward pointer */
+ list_del_rcu(&sdev->list);
+ spin_lock(&sdev->iommu->lock);
+ intel_pasid_tear_down_entry(sdev->iommu, sdev->dev,
+ svm->pasid, true);
+ intel_svm_drain_prq(sdev->dev, svm->pasid);
+ spin_unlock(&sdev->iommu->lock);
+ kfree_rcu(sdev, rcu);
+ }
+ /*
+ * We may not be the last user to drop the reference but since
+ * the PASID is in FREE_PENDING state, no one can get new reference.
+ * Therefore, we can safely free the private data svm.
+ */
+ intel_svm_drop_pasid(svm->pasid);
+
+ /*
+ * Free before unbind can only happen with host PASIDs used for
+ * guest SVM. We get here because ioasid_free is called with
+ * outstanding references. So we need to drop the reference
+ * such that the PASID can be reclaimed. unbind_gpasid() after this
+ * will not result in dropping refcount since the private data is
+ * already detached.
+ */
+ kfree(svm);
+
+ mutex_unlock(&pasid_mutex);
+}
+
+
+static int pasid_status_change(struct notifier_block *nb,
+ unsigned long code, void *data)
+{
+ struct ioasid_nb_args *args = (struct ioasid_nb_args *)data;
+ struct intel_svm *svm = (struct intel_svm *)args->pdata;
+ int ret = NOTIFY_DONE;
+
+ /*
+ * Notification private data is a choice of vendor driver when the
+ * IOASID is allocated or attached after allocation. When the data
+ * type changes, we must make modifications here accordingly.
+ */
+ if (code == IOASID_NOTIFY_FREE) {
+ /*
+ * If PASID UNBIND happens before FREE, private data of the
+ * IOASID should be NULL, then we don't need to do anything.
+ */
+ if (!svm)
+ goto done;
+ if (args->id != svm->pasid) {
+ pr_warn("Notify PASID does not match data %d : %d\n",
+ args->id, svm->pasid);
+ goto done;
+ }
+ if (!ioasid_queue_work(&svm->work))
+ pr_warn("Cleanup work already queued\n");
+ return NOTIFY_OK;
+ }
+done:
+ return ret;
+}
+
+static struct notifier_block pasid_nb = {
+ .notifier_call = pasid_status_change,
+};
+
+void intel_svm_add_pasid_notifier(void)
+{
+ /* Listen to all PASIDs, not specific to a set */
+ ioasid_register_notifier(NULL, &pasid_nb);
+}
+
void intel_svm_check(struct intel_iommu *iommu)
{
if (!pasid_supported(iommu))
@@ -240,7 +338,6 @@ static const struct mmu_notifier_ops intel_mmuops = {
.invalidate_range = intel_invalidate_range,
};
-static DEFINE_MUTEX(pasid_mutex);
static LIST_HEAD(global_svm_list);
#define for_each_svm_dev(sdev, svm, d) \
@@ -367,8 +464,16 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
if (data->flags & IOMMU_SVA_GPASID_VAL) {
svm->gpasid = data->gpasid;
svm->flags |= SVM_FLAG_GUEST_PASID;
+ ioasid_attach_spid(data->hpasid, data->gpasid);
}
ioasid_attach_data(data->hpasid, svm);
+ ioasid_get(NULL, svm->pasid);
+ sdev->iommu = iommu;
+ /*
+ * Set up cleanup async work in case IOASID core notify us PASID
+ * is freed before unbind.
+ */
+ INIT_WORK(&svm->work, intel_svm_free_async_fn);
INIT_LIST_HEAD_RCU(&svm->devs);
}
sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
@@ -464,7 +569,7 @@ int intel_svm_unbind_gpasid(struct device *dev, u32 pasid)
* the unbind, IOMMU driver will get notified
* and perform cleanup.
*/
- ioasid_detach_data(pasid);
+ intel_svm_drop_pasid(pasid);
kfree(svm);
}
}
diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h
index 09c6a0bf3892..b1b8914e1564 100644
--- a/include/linux/intel-iommu.h
+++ b/include/linux/intel-iommu.h
@@ -757,6 +757,7 @@ void intel_svm_unbind(struct iommu_sva *handle);
u32 intel_svm_get_pasid(struct iommu_sva *handle);
int intel_svm_page_response(struct device *dev, struct iommu_fault_event *evt,
struct iommu_page_response *msg);
+void intel_svm_add_pasid_notifier(void);
struct svm_dev_ops;
@@ -783,6 +784,7 @@ struct intel_svm {
int gpasid; /* In case that guest PASID is different from host PASID */
struct list_head devs;
struct list_head list;
+ struct work_struct work; /* For deferred clean up */
};
#else
static inline void intel_svm_check(struct intel_iommu *iommu) {}
--
2.25.1
Relations among IOASID users largely follow a publisher-subscriber
pattern. E.g. to support guest SVA on Intel Scalable I/O Virtualization
(SIOV) enabled platforms, VFIO, IOMMU, device drivers, KVM are all users
of IOASIDs. When a state change occurs, VFIO publishes the change event
that needs to be processed by other users/subscribers.
This patch introduced two types of notifications: global and per
ioasid_set. The latter is intended for users who only needs to handle
events related to the IOASID of a given set.
For more information, refer to the kernel documentation at
Documentation/ioasid.rst.
Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Wu Hao <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 111 +++++++++++++++++++++++++++++++++++++++--
include/linux/ioasid.h | 54 ++++++++++++++++++++
2 files changed, 161 insertions(+), 4 deletions(-)
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 7707bb608bdd..56577e745c4b 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -10,12 +10,33 @@
#include <linux/spinlock.h>
#include <linux/xarray.h>
+/*
+ * An IOASID can have multiple consumers where each consumer may have
+ * hardware contexts associated with the IOASID.
+ * When a status change occurs, like on IOASID deallocation, notifier chains
+ * are used to keep the consumers in sync.
+ * This is a publisher-subscriber pattern where publisher can change the
+ * state of each IOASID, e.g. alloc/free, bind IOASID to a device and mm.
+ * On the other hand, subscribers get notified for the state change and
+ * keep local states in sync.
+ */
+static ATOMIC_NOTIFIER_HEAD(ioasid_notifier);
+static DEFINE_SPINLOCK(ioasid_nb_lock);
+
/* Default to PCIe standard 20 bit PASID */
#define PCI_PASID_MAX 0x100000
static ioasid_t ioasid_capacity = PCI_PASID_MAX;
static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
static DEFINE_XARRAY_ALLOC(ioasid_sets);
+struct ioasid_set_nb {
+ struct list_head list;
+ struct notifier_block *nb;
+ void *token;
+ struct ioasid_set *set;
+ bool active;
+};
+
enum ioasid_state {
IOASID_STATE_IDLE,
IOASID_STATE_ACTIVE,
@@ -415,6 +436,38 @@ void ioasid_detach_data(ioasid_t ioasid)
}
EXPORT_SYMBOL_GPL(ioasid_detach_data);
+/**
+ * ioasid_notify - Send notification on a given IOASID for status change.
+ *
+ * @data: The IOASID data to which the notification will send
+ * @cmd: Notification event sent by IOASID external users, can be
+ * IOASID_BIND or IOASID_UNBIND.
+ *
+ * @flags: Special instructions, e.g. notify within a set or global by
+ * IOASID_NOTIFY_FLAG_SET or IOASID_NOTIFY_FLAG_ALL flags
+ * Caller must hold ioasid_allocator_lock and reference to the IOASID
+ */
+static int ioasid_notify(struct ioasid_data *data,
+ enum ioasid_notify_val cmd, unsigned int flags)
+{
+ struct ioasid_nb_args args = { 0 };
+ int ret = 0;
+
+ if (flags & ~(IOASID_NOTIFY_FLAG_ALL | IOASID_NOTIFY_FLAG_SET))
+ return -EINVAL;
+
+ args.id = data->id;
+ args.set = data->set;
+ args.pdata = data->private;
+ args.spid = data->spid;
+ if (flags & IOASID_NOTIFY_FLAG_ALL)
+ ret = atomic_notifier_call_chain(&ioasid_notifier, cmd, &args);
+ if (flags & IOASID_NOTIFY_FLAG_SET)
+ ret = atomic_notifier_call_chain(&data->set->nh, cmd, &args);
+
+ return ret;
+}
+
static ioasid_t ioasid_find_by_spid_locked(struct ioasid_set *set, ioasid_t spid, bool get)
{
ioasid_t ioasid = INVALID_IOASID;
@@ -468,7 +521,7 @@ int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid)
goto done_unlock;
}
data->spid = spid;
-
+ ioasid_notify(data, IOASID_NOTIFY_BIND, IOASID_NOTIFY_FLAG_SET);
done_unlock:
spin_unlock(&ioasid_allocator_lock);
return ret;
@@ -486,8 +539,8 @@ void ioasid_detach_spid(ioasid_t ioasid)
pr_err("Invalid IOASID entry %d to detach\n", ioasid);
goto done_unlock;
}
+ ioasid_notify(data, IOASID_NOTIFY_UNBIND, IOASID_NOTIFY_FLAG_SET);
data->spid = INVALID_IOASID;
-
done_unlock:
spin_unlock(&ioasid_allocator_lock);
}
@@ -603,6 +656,8 @@ struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type)
set->quota = quota;
set->id = id;
atomic_set(&set->nr_ioasids, 0);
+ ATOMIC_INIT_NOTIFIER_HEAD(&set->nh);
+
/*
* Per set XA is used to store private IDs within the set, get ready
* for ioasid_set private ID and system-wide IOASID allocation
@@ -655,7 +710,9 @@ int ioasid_set_free(struct ioasid_set *set)
int ret = 0;
spin_lock(&ioasid_allocator_lock);
+ spin_lock(&ioasid_nb_lock);
ret = ioasid_set_free_locked(set);
+ spin_unlock(&ioasid_nb_lock);
spin_unlock(&ioasid_allocator_lock);
return ret;
}
@@ -728,6 +785,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
goto exit_free;
}
atomic_inc(&set->nr_ioasids);
+ ioasid_notify(data, IOASID_NOTIFY_ALLOC, IOASID_NOTIFY_FLAG_SET);
goto done_unlock;
exit_free:
kfree(data);
@@ -780,9 +838,11 @@ static void ioasid_free_locked(struct ioasid_set *set, ioasid_t ioasid)
* If the refcount is 1, it means there is no other users of the IOASID
* other than IOASID core itself. There is no need to notify anyone.
*/
- if (!refcount_dec_and_test(&data->refs))
+ if (!refcount_dec_and_test(&data->refs)) {
+ ioasid_notify(data, IOASID_NOTIFY_FREE,
+ IOASID_NOTIFY_FLAG_SET | IOASID_NOTIFY_FLAG_ALL);
return;
-
+ }
ioasid_do_free_locked(data);
}
@@ -833,15 +893,39 @@ void ioasid_free_all_in_set(struct ioasid_set *set)
if (!atomic_read(&set->nr_ioasids))
return;
spin_lock(&ioasid_allocator_lock);
+ spin_lock(&ioasid_nb_lock);
xa_for_each(&set->xa, index, entry) {
ioasid_free_locked(set, index);
/* Free from per set private pool */
xa_erase(&set->xa, index);
}
+ spin_unlock(&ioasid_nb_lock);
spin_unlock(&ioasid_allocator_lock);
}
EXPORT_SYMBOL_GPL(ioasid_free_all_in_set);
+/*
+ * ioasid_find_mm_set - Retrieve IOASID set with mm token
+ * Take a reference of the set if found.
+ */
+struct ioasid_set *ioasid_find_mm_set(struct mm_struct *token)
+{
+ struct ioasid_set *set;
+ unsigned long index;
+
+ spin_lock(&ioasid_allocator_lock);
+
+ xa_for_each(&ioasid_sets, index, set) {
+ if (set->type == IOASID_SET_TYPE_MM && set->token == token)
+ goto exit_unlock;
+ }
+ set = NULL;
+exit_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+ return set;
+}
+EXPORT_SYMBOL_GPL(ioasid_find_mm_set);
+
/**
* ioasid_set_for_each_ioasid
* @brief
@@ -1021,6 +1105,25 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
}
EXPORT_SYMBOL_GPL(ioasid_find);
+int ioasid_register_notifier(struct ioasid_set *set, struct notifier_block *nb)
+{
+ if (set)
+ return atomic_notifier_chain_register(&set->nh, nb);
+ else
+ return atomic_notifier_chain_register(&ioasid_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(ioasid_register_notifier);
+
+void ioasid_unregister_notifier(struct ioasid_set *set,
+ struct notifier_block *nb)
+{
+ if (set)
+ atomic_notifier_chain_unregister(&set->nh, nb);
+ else
+ atomic_notifier_chain_unregister(&ioasid_notifier, nb);
+}
+EXPORT_SYMBOL_GPL(ioasid_unregister_notifier);
+
MODULE_AUTHOR("Jean-Philippe Brucker <[email protected]>");
MODULE_AUTHOR("Jacob Pan <[email protected]>");
MODULE_DESCRIPTION("IO Address Space ID (IOASID) allocator");
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index dcab02886cb5..d8b85a04214f 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -58,6 +58,47 @@ struct ioasid_allocator_ops {
void *pdata;
};
+/* Notification data when IOASID status changed */
+enum ioasid_notify_val {
+ IOASID_NOTIFY_ALLOC = 1,
+ IOASID_NOTIFY_FREE,
+ IOASID_NOTIFY_BIND,
+ IOASID_NOTIFY_UNBIND,
+};
+
+#define IOASID_NOTIFY_FLAG_ALL BIT(0)
+#define IOASID_NOTIFY_FLAG_SET BIT(1)
+/**
+ * enum ioasid_notifier_prios - IOASID event notification order
+ *
+ * When status of an IOASID changes, users might need to take actions to
+ * reflect the new state. For example, when an IOASID is freed due to
+ * exception, the hardware context in virtual CPU, DMA device, and IOMMU
+ * shall be cleared and drained. Order is required to prevent life cycle
+ * problems.
+ */
+enum ioasid_notifier_prios {
+ IOASID_PRIO_LAST,
+ IOASID_PRIO_DEVICE,
+ IOASID_PRIO_IOMMU,
+ IOASID_PRIO_CPU,
+};
+
+/**
+ * struct ioasid_nb_args - Argument provided by IOASID core when notifier
+ * is called.
+ * @id: The IOASID being notified
+ * @spid: The set private ID associated with the IOASID
+ * @set: The IOASID set of @id
+ * @pdata: The private data attached to the IOASID
+ */
+struct ioasid_nb_args {
+ ioasid_t id;
+ ioasid_t spid;
+ struct ioasid_set *set;
+ void *pdata;
+};
+
#if IS_ENABLED(CONFIG_IOASID)
void ioasid_install_capacity(ioasid_t total);
int ioasid_reserve_capacity(ioasid_t nr_ioasid);
@@ -84,6 +125,10 @@ void ioasid_detach_data(ioasid_t ioasid);
int ioasid_attach_spid(ioasid_t ioasid, ioasid_t spid);
void ioasid_detach_spid(ioasid_t ioasid);
ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid, bool get);
+int ioasid_register_notifier(struct ioasid_set *set,
+ struct notifier_block *nb);
+void ioasid_unregister_notifier(struct ioasid_set *set,
+ struct notifier_block *nb);
void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
void (*fn)(ioasid_t id, void *data),
void *data);
@@ -149,6 +194,15 @@ static inline void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
return NULL;
}
+static inline int ioasid_register_notifier(struct notifier_block *nb)
+{
+ return -ENOTSUPP;
+}
+
+static inline void ioasid_unregister_notifier(struct notifier_block *nb)
+{
+}
+
static inline int ioasid_register_allocator(struct ioasid_allocator_ops *allocator)
{
return -ENOTSUPP;
--
2.25.1
Bind guest page table call comes with an IOASID provided by the
userspace. To prevent attacks by malicious users, we must ensure the
IOASID was allocated under the same process.
This patch adds a new API that will perform an ownership check that is
based on whether the IOASID belongs to the ioasid_set allocated with the
mm_struct pointer as a token.
Signed-off-by: Liu Yi L <[email protected]>
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 37 +++++++++++++++++++++++++++++++++++++
drivers/iommu/iommu.c | 16 ++++++++++++++--
include/linux/ioasid.h | 6 ++++++
3 files changed, 57 insertions(+), 2 deletions(-)
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index 96e941dfada7..28a2e9b6594d 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -9,6 +9,7 @@
#include <linux/slab.h>
#include <linux/spinlock.h>
#include <linux/xarray.h>
+#include <linux/sched/mm.h>
/*
* An IOASID can have multiple consumers where each consumer may have
@@ -1028,6 +1029,42 @@ int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
}
EXPORT_SYMBOL_GPL(ioasid_get);
+/**
+ * ioasid_get_if_owned - obtain a reference to the IOASID if the IOASID belongs
+ * to the ioasid_set with the current mm as token
+ * @ioasid: the IOASID to get reference
+ *
+ *
+ * Return: 0 on success, error if failed.
+ */
+int ioasid_get_if_owned(ioasid_t ioasid)
+{
+ struct ioasid_set *set;
+ int ret;
+
+ spin_lock(&ioasid_allocator_lock);
+ set = ioasid_find_set(ioasid);
+ if (IS_ERR_OR_NULL(set)) {
+ ret = -ENOENT;
+ goto done_unlock;
+ }
+ if (set->type != IOASID_SET_TYPE_MM) {
+ ret = -EINVAL;
+ goto done_unlock;
+ }
+ if (current->mm != set->token) {
+ ret = -EPERM;
+ goto done_unlock;
+ }
+
+ ret = ioasid_get_locked(set, ioasid);
+done_unlock:
+ spin_unlock(&ioasid_allocator_lock);
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_get_if_owned);
+
bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
{
struct ioasid_data *data;
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index fd76e2f579fe..18716d856b02 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2169,7 +2169,13 @@ int iommu_uapi_sva_bind_gpasid(struct iommu_domain *domain, struct device *dev,
if (ret)
return ret;
- return domain->ops->sva_bind_gpasid(domain, dev, &data);
+ ret = ioasid_get_if_owned(data.hpasid);
+ if (ret)
+ return ret;
+ ret = domain->ops->sva_bind_gpasid(domain, dev, &data);
+ ioasid_put(NULL, data.hpasid);
+
+ return ret;
}
EXPORT_SYMBOL_GPL(iommu_uapi_sva_bind_gpasid);
@@ -2196,7 +2202,13 @@ int iommu_uapi_sva_unbind_gpasid(struct iommu_domain *domain, struct device *dev
if (ret)
return ret;
- return iommu_sva_unbind_gpasid(domain, dev, data.hpasid);
+ ret = ioasid_get_if_owned(data.hpasid);
+ if (ret)
+ return ret;
+ ret = iommu_sva_unbind_gpasid(domain, dev, data.hpasid);
+ ioasid_put(NULL, data.hpasid);
+
+ return ret;
}
EXPORT_SYMBOL_GPL(iommu_uapi_sva_unbind_gpasid);
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index c97e80ff65cc..9624b665f810 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -111,6 +111,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
void *private);
int ioasid_get(struct ioasid_set *set, ioasid_t ioasid);
int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid);
+int ioasid_get_if_owned(ioasid_t ioasid);
bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid);
bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid);
void ioasid_free(struct ioasid_set *set, ioasid_t ioasid);
@@ -180,6 +181,11 @@ static inline int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
return -ENOTSUPP;
}
+static inline int ioasid_get_if_owned(ioasid_t ioasid)
+{
+ return -ENOTSUPP;
+}
+
static inline bool ioasid_put(struct ioasid_set *set, ioasid_t ioasid)
{
return false;
--
2.25.1
Once IOASIDs cgroup is active, we must consult the limitation set up
by the cgroups during allocation. Freeing IOASIDs also need to return
the quota back to the cgroup.
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/ioasid.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
index d42b39ca2c8b..fd3f5729c71d 100644
--- a/drivers/iommu/ioasid.c
+++ b/drivers/iommu/ioasid.c
@@ -782,7 +782,10 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
spin_lock(&ioasid_allocator_lock);
/* Check if the IOASID set has been allocated and initialized */
- if (!ioasid_set_is_valid(set))
+ if (!set || !ioasid_set_is_valid(set))
+ goto done_unlock;
+
+ if (set->type == IOASID_SET_TYPE_MM && ioasid_cg_charge(set))
goto done_unlock;
if (set->quota <= atomic_read(&set->nr_ioasids)) {
@@ -832,6 +835,7 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
goto done_unlock;
exit_free:
kfree(data);
+ ioasid_cg_uncharge(set);
done_unlock:
spin_unlock(&ioasid_allocator_lock);
return id;
@@ -849,6 +853,7 @@ static void ioasid_do_free_locked(struct ioasid_data *data)
kfree_rcu(ioasid_data, rcu);
}
atomic_dec(&data->set->nr_ioasids);
+ ioasid_cg_uncharge(data->set);
xa_erase(&data->set->xa, data->id);
/* Destroy the set if empty */
if (!atomic_read(&data->set->nr_ioasids))
--
2.25.1
Now that IOASID core keeps track of the IOASID to mm_struct ownership in
the forms of ioasid_set with IOASID_SET_TYPE_MM token type, there is no
need to keep the same mapping in VT-d driver specific data. Native SVM
usage is not affected by the change.
Signed-off-by: Jacob Pan <[email protected]>
---
drivers/iommu/intel/svm.c | 7 -------
1 file changed, 7 deletions(-)
diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
index c469c24d23f5..f75699ddb923 100644
--- a/drivers/iommu/intel/svm.c
+++ b/drivers/iommu/intel/svm.c
@@ -363,12 +363,6 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
ret = -ENOMEM;
goto out;
}
- /* REVISIT: upper layer/VFIO can track host process that bind
- * the PASID. ioasid_set = mm might be sufficient for vfio to
- * check pasid VMM ownership. We can drop the following line
- * once VFIO and IOASID set check is in place.
- */
- svm->mm = get_task_mm(current);
svm->pasid = data->hpasid;
if (data->flags & IOMMU_SVA_GPASID_VAL) {
svm->gpasid = data->gpasid;
@@ -376,7 +370,6 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, struct device *dev,
}
ioasid_attach_data(data->hpasid, svm);
INIT_LIST_HEAD_RCU(&svm->devs);
- mmput(svm->mm);
}
sdev = kzalloc(sizeof(*sdev), GFP_KERNEL);
if (!sdev) {
--
2.25.1
IOASIDs are used to associate DMA requests with virtual address spaces.
They are a system-wide limited resource made available to the userspace
applications. Let it be VMs or user-space device drivers.
This RFC patch introduces a cgroup controller to address the following
problems:
1. Some user applications exhaust all the available IOASIDs thus
depriving others of the same host.
2. System admins need to provision VMs based on their needs for IOASIDs,
e.g. the number of VMs with assigned devices that perform DMA requests
with PASID.
This patch is nowhere near its completion, it merely provides the basic
functionality for resource distribution and cgroup hierarchy
organizational changes.
Since this is part of a greater effort to enable Shared Virtual Address
(SVA) virtualization. We would like to have a direction check and
collect feedback early. For details, please refer to the documentation:
Documentation/admin-guide/cgroup-v1/ioasids.rst
Signed-off-by: Jacob Pan <[email protected]>
---
include/linux/cgroup_subsys.h | 4 +
include/linux/ioasid.h | 17 ++
init/Kconfig | 7 +
kernel/cgroup/Makefile | 1 +
kernel/cgroup/ioasids.c | 345 ++++++++++++++++++++++++++++++++++
5 files changed, 374 insertions(+)
create mode 100644 kernel/cgroup/ioasids.c
diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index acb77dcff3b4..cda75ecdcdcb 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -57,6 +57,10 @@ SUBSYS(hugetlb)
SUBSYS(pids)
#endif
+#if IS_ENABLED(CONFIG_CGROUP_IOASIDS)
+SUBSYS(ioasids)
+#endif
+
#if IS_ENABLED(CONFIG_CGROUP_RDMA)
SUBSYS(rdma)
#endif
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 4547086797df..5ea4710efb02 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -135,8 +135,25 @@ void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
void *data);
int ioasid_register_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct notifier_block *nb);
+#ifdef CONFIG_CGROUP_IOASIDS
+int ioasid_cg_charge(struct ioasid_set *set);
+void ioasid_cg_uncharge(struct ioasid_set *set);
+#else
+/* No cgroup control, allocation will proceed until run out total pool */
+static inline int ioasid_cg_charge(struct ioasid_set *set)
+{
+ return 0;
+}
+
+static inline int ioasid_cg_uncharge(struct ioasid_set *set)
+{
+ return 0;
+}
+#endif /* CGROUP_IOASIDS */
bool ioasid_queue_work(struct work_struct *work);
+
#else /* !CONFIG_IOASID */
+
static inline void ioasid_install_capacity(ioasid_t total)
{
}
diff --git a/init/Kconfig b/init/Kconfig
index b77c60f8b963..9a23683dad98 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1017,6 +1017,13 @@ config CGROUP_PIDS
since the PIDs limit only affects a process's ability to fork, not to
attach to a cgroup.
+config CGROUP_IOASIDS
+ bool "IOASIDs controller"
+ depends on IOASID
+ help
+ Provides enforcement of IO Address Space ID limits in the scope of a
+ cgroup.
+
config CGROUP_RDMA
bool "RDMA controller"
help
diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
index 5d7a76bfbbb7..c5ad7c9a2305 100644
--- a/kernel/cgroup/Makefile
+++ b/kernel/cgroup/Makefile
@@ -3,6 +3,7 @@ obj-y := cgroup.o rstat.o namespace.o cgroup-v1.o freezer.o
obj-$(CONFIG_CGROUP_FREEZER) += legacy_freezer.o
obj-$(CONFIG_CGROUP_PIDS) += pids.o
+obj-$(CONFIG_CGROUP_IOASIDS) += ioasids.o
obj-$(CONFIG_CGROUP_RDMA) += rdma.o
obj-$(CONFIG_CPUSETS) += cpuset.o
obj-$(CONFIG_CGROUP_DEBUG) += debug.o
diff --git a/kernel/cgroup/ioasids.c b/kernel/cgroup/ioasids.c
new file mode 100644
index 000000000000..ac43813da6ad
--- /dev/null
+++ b/kernel/cgroup/ioasids.c
@@ -0,0 +1,345 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * IO Address Space ID limiting controller for cgroups.
+ *
+ */
+#define pr_fmt(fmt) "ioasids_cg: " fmt
+
+#include <linux/kernel.h>
+#include <linux/threads.h>
+#include <linux/atomic.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/ioasid.h>
+#include <linux/sched/mm.h>
+#include <linux/sched/task.h>
+
+#define IOASIDS_MAX_STR "max"
+static DEFINE_MUTEX(ioasids_cg_lock);
+
+struct ioasids_cgroup {
+ struct cgroup_subsys_state css;
+ atomic64_t counter;
+ atomic64_t limit;
+ struct cgroup_file events_file;
+ /* Number of times allocations failed because limit was hit. */
+ atomic64_t events_limit;
+};
+
+static struct ioasids_cgroup *css_ioasids(struct cgroup_subsys_state *css)
+{
+ return container_of(css, struct ioasids_cgroup, css);
+}
+
+static struct ioasids_cgroup *parent_ioasids(struct ioasids_cgroup *ioasids)
+{
+ return css_ioasids(ioasids->css.parent);
+}
+
+static struct cgroup_subsys_state *
+ioasids_css_alloc(struct cgroup_subsys_state *parent)
+{
+ struct ioasids_cgroup *ioasids;
+
+ ioasids = kzalloc(sizeof(struct ioasids_cgroup), GFP_KERNEL);
+ if (!ioasids)
+ return ERR_PTR(-ENOMEM);
+
+ atomic64_set(&ioasids->counter, 0);
+ atomic64_set(&ioasids->limit, 0);
+ atomic64_set(&ioasids->events_limit, 0);
+ return &ioasids->css;
+}
+
+static void ioasids_css_free(struct cgroup_subsys_state *css)
+{
+ kfree(css_ioasids(css));
+}
+
+/**
+ * ioasids_cancel - uncharge the local IOASID count
+ * @ioasids: the ioasid cgroup state
+ * @num: the number of ioasids to cancel
+ *
+ */
+static void ioasids_cancel(struct ioasids_cgroup *ioasids, int num)
+{
+ WARN_ON_ONCE(atomic64_add_negative(-num, &ioasids->counter));
+}
+
+/**
+ * ioasids_uncharge - hierarchically uncharge the ioasid count
+ * @ioasids: the ioasid cgroup state
+ * @num: the number of ioasids to uncharge
+ */
+static void ioasids_uncharge(struct ioasids_cgroup *ioasids, int num)
+{
+ struct ioasids_cgroup *p;
+
+ for (p = ioasids; parent_ioasids(p); p = parent_ioasids(p))
+ ioasids_cancel(p, num);
+}
+
+/**
+ * ioasids_charge - hierarchically charge the ioasid count
+ * @ioasids: the ioasid cgroup state
+ * @num: the number of ioasids to charge
+ */
+static void ioasids_charge(struct ioasids_cgroup *ioasids, int num)
+{
+ struct ioasids_cgroup *p;
+
+ for (p = ioasids; parent_ioasids(p); p = parent_ioasids(p))
+ atomic64_add(num, &p->counter);
+}
+
+/**
+ * ioasids_try_charge - hierarchically try to charge the ioasid count
+ * @ioasids: the ioasid cgroup state
+ * @num: the number of ioasids to charge
+ */
+static int ioasids_try_charge(struct ioasids_cgroup *ioasids, int num)
+{
+ struct ioasids_cgroup *p, *q;
+
+ for (p = ioasids; parent_ioasids(p); p = parent_ioasids(p)) {
+ int64_t new = atomic64_add_return(num, &p->counter);
+ int64_t limit = atomic64_read(&p->limit);
+
+ if (new > limit)
+ goto revert;
+ }
+
+ return 0;
+
+revert:
+ for (q = ioasids; q != p; q = parent_ioasids(q))
+ ioasids_cancel(q, num);
+ ioasids_cancel(p, num);
+ cgroup_file_notify(&ioasids->events_file);
+
+ return -EAGAIN;
+}
+
+
+/**
+ * ioasid_cg_charge - Check and charge IOASIDs cgroup
+ *
+ * @set: IOASID set used for allocation
+ *
+ * The IOASID quota is managed per cgroup, all process based allocations
+ * must be validated per cgroup hierarchy.
+ * Return 0 if a single IOASID can be allocated or error if failed in various
+ * checks.
+ */
+int ioasid_cg_charge(struct ioasid_set *set)
+{
+ struct mm_struct *mm = get_task_mm(current);
+ struct cgroup_subsys_state *css;
+ struct ioasids_cgroup *ioasids;
+ int ret = 0;
+
+ /* Must be called with a valid mm, not during process exit */
+ if (set->type != IOASID_SET_TYPE_MM)
+ return ret;
+ if (!mm)
+ return -EINVAL;
+ /* We only charge user process allocated PASIDs */
+ if (set->type != IOASID_SET_TYPE_MM) {
+ ret = -EINVAL;
+ goto exit_drop;
+ }
+ if (set->token != mm) {
+ pr_err("No permisson to allocate IOASID\n");
+ ret = -EPERM;
+ goto exit_drop;
+ }
+ rcu_read_lock();
+ css = task_css(current, ioasids_cgrp_id);
+ ioasids = css_ioasids(css);
+ rcu_read_unlock();
+ ret = ioasids_try_charge(ioasids, 1);
+ if (ret)
+ pr_warn("%s: Unable to charge IOASID %d\n", __func__, ret);
+exit_drop:
+ mmput_async(mm);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(ioasid_cg_charge);
+
+/* Uncharge IOASIDs cgroup after freeing an IOASID */
+void ioasid_cg_uncharge(struct ioasid_set *set)
+{
+ struct cgroup_subsys_state *css;
+ struct ioasids_cgroup *ioasids;
+ struct mm_struct *mm;
+
+ /* We only charge user process allocated PASIDs */
+ if (set->type != IOASID_SET_TYPE_MM)
+ return;
+ mm = set->token;
+ if (!mmget_not_zero(mm)) {
+ pr_err("MM defunct! Cannot uncharge IOASID\n");
+ return;
+ }
+ rcu_read_lock();
+ css = task_css(current, ioasids_cgrp_id);
+ ioasids = css_ioasids(css);
+ rcu_read_unlock();
+ ioasids_uncharge(ioasids, 1);
+ mmput_async(mm);
+}
+EXPORT_SYMBOL_GPL(ioasid_cg_uncharge);
+
+static int ioasids_can_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *dst_css;
+ static struct ioasid_set *set;
+ struct task_struct *leader;
+
+ /*
+ * IOASIDs are managed at per process level, we only support domain mode
+ * in task management model. Loop through all processes by each thread
+ * leader, charge the leader's css.
+ */
+ cgroup_taskset_for_each_leader(leader, dst_css, tset) {
+ struct ioasids_cgroup *ioasids = css_ioasids(dst_css);
+ struct cgroup_subsys_state *old_css;
+ struct ioasids_cgroup *old_ioasids;
+ struct mm_struct *mm = get_task_mm(leader);
+
+ set = ioasid_find_mm_set(mm);
+ mmput(mm);
+ if (!set)
+ continue;
+
+ old_css = task_css(leader, ioasids_cgrp_id);
+ old_ioasids = css_ioasids(old_css);
+
+ ioasids_charge(ioasids, atomic_read(&set->nr_ioasids));
+ ioasids_uncharge(old_ioasids, atomic_read(&set->nr_ioasids));
+ }
+
+ return 0;
+}
+
+static void ioasids_cancel_attach(struct cgroup_taskset *tset)
+{
+ struct cgroup_subsys_state *dst_css;
+ struct task_struct *task;
+
+ cgroup_taskset_for_each(task, dst_css, tset) {
+ struct ioasids_cgroup *ioasids = css_ioasids(dst_css);
+ struct cgroup_subsys_state *old_css;
+ struct ioasids_cgroup *old_ioasids;
+
+ old_css = task_css(task, ioasids_cgrp_id);
+ old_ioasids = css_ioasids(old_css);
+
+ ioasids_charge(old_ioasids, 1);
+ ioasids_uncharge(ioasids, 1);
+ }
+}
+
+static ssize_t ioasids_max_write(struct kernfs_open_file *of, char *buf,
+ size_t nbytes, loff_t off)
+{
+ struct cgroup_subsys_state *css = of_css(of);
+ struct ioasids_cgroup *ioasids = css_ioasids(css);
+ int64_t limit, limit_cur;
+ int err;
+
+ mutex_lock(&ioasids_cg_lock);
+ /* Check whether we are growing or shrinking */
+ limit_cur = atomic64_read(&ioasids->limit);
+ buf = strstrip(buf);
+ if (!strcmp(buf, IOASIDS_MAX_STR)) {
+ /* Returns how many IOASIDs was in the pool */
+ limit = ioasid_reserve_capacity(0);
+ ioasid_reserve_capacity(limit - limit_cur);
+ goto set_limit;
+ }
+ err = kstrtoll(buf, 0, &limit);
+ if (err)
+ goto done_unlock;
+
+ err = nbytes;
+ /* Check whether we are growing or shrinking */
+ limit_cur = atomic64_read(&ioasids->limit);
+ if (limit < 0 || limit == limit_cur) {
+ err = -EINVAL;
+ goto done_unlock;
+ }
+ if (limit < limit_cur)
+ err = ioasid_cancel_capacity(limit_cur - limit);
+ else
+ err = ioasid_reserve_capacity(limit - limit_cur);
+ if (err < 0)
+ goto done_unlock;
+
+set_limit:
+ err = nbytes;
+ atomic64_set(&ioasids->limit, limit);
+done_unlock:
+ mutex_unlock(&ioasids_cg_lock);
+ return err;
+}
+
+static int ioasids_max_show(struct seq_file *sf, void *v)
+{
+ struct cgroup_subsys_state *css = seq_css(sf);
+ struct ioasids_cgroup *ioasids = css_ioasids(css);
+ int64_t limit = atomic64_read(&ioasids->limit);
+
+ seq_printf(sf, "%lld\n", limit);
+
+ return 0;
+}
+
+static s64 ioasids_current_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ struct ioasids_cgroup *ioasids = css_ioasids(css);
+
+ return atomic64_read(&ioasids->counter);
+}
+
+static int ioasids_events_show(struct seq_file *sf, void *v)
+{
+ struct ioasids_cgroup *ioasids = css_ioasids(seq_css(sf));
+
+ seq_printf(sf, "max %lld\n", (s64)atomic64_read(&ioasids->events_limit));
+ return 0;
+}
+
+static struct cftype ioasids_files[] = {
+ {
+ .name = "max",
+ .write = ioasids_max_write,
+ .seq_show = ioasids_max_show,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+ {
+ .name = "current",
+ .read_s64 = ioasids_current_read,
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+ {
+ .name = "events",
+ .seq_show = ioasids_events_show,
+ .file_offset = offsetof(struct ioasids_cgroup, events_file),
+ .flags = CFTYPE_NOT_ON_ROOT,
+ },
+ { } /* terminate */
+};
+
+struct cgroup_subsys ioasids_cgrp_subsys = {
+ .css_alloc = ioasids_css_alloc,
+ .css_free = ioasids_css_free,
+ .can_attach = ioasids_can_attach,
+ .cancel_attach = ioasids_cancel_attach,
+ .legacy_cftypes = ioasids_files,
+ .dfl_cftypes = ioasids_files,
+ .threaded = false,
+};
+
--
2.25.1
Signed-off-by: Jacob Pan <[email protected]>
---
Documentation/admin-guide/cgroup-v1/index.rst | 1 +
.../admin-guide/cgroup-v1/ioasids.rst | 110 ++++++++++++++++++
2 files changed, 111 insertions(+)
create mode 100644 Documentation/admin-guide/cgroup-v1/ioasids.rst
diff --git a/Documentation/admin-guide/cgroup-v1/index.rst b/Documentation/admin-guide/cgroup-v1/index.rst
index 226f64473e8e..f5e307dc4dbb 100644
--- a/Documentation/admin-guide/cgroup-v1/index.rst
+++ b/Documentation/admin-guide/cgroup-v1/index.rst
@@ -15,6 +15,7 @@ Control Groups version 1
devices
freezer-subsystem
hugetlb
+ ioasids
memcg_test
memory
net_cls
diff --git a/Documentation/admin-guide/cgroup-v1/ioasids.rst b/Documentation/admin-guide/cgroup-v1/ioasids.rst
new file mode 100644
index 000000000000..b30eb41bf1be
--- /dev/null
+++ b/Documentation/admin-guide/cgroup-v1/ioasids.rst
@@ -0,0 +1,110 @@
+========================================
+I/O Address Space ID (IOASID) Controller
+========================================
+
+Acronyms
+--------
+PASID:
+ Process Address Space ID, defined by PCIe
+SVA:
+ Shared Virtual Address
+
+Introduction
+------------
+
+IOASIDs are used to associate DMA requests with virtual address spaces. As
+a system-wide limited¹ resource, its constraints are managed by the IOASIDs
+cgroup subsystem. The specific use cases are:
+
+1. Some user applications exhaust all the available IOASIDs thus depriving
+ others of the same host.
+
+2. System admins need to provision VMs based on their needs for IOASIDs,
+ e.g. the number of VMs with assigned devices that perform DMA requests
+ with PASID.
+
+The IOASID subsystem consists of three components:
+
+- IOASID core: provides APIs for allocation, pool management,
+ notifications and refcounting. See Documentation/driver-api/ioasid.rst
+ for details
+- IOASID user: provides user allocation interface via /dev/ioasid
+- IOASID cgroup controller: manage resource distribution
+
+Resource Distribution Model
+---------------------------
+IOASID allocation is process-based in that IOASIDs are tied to page tables²,
+the threaded model is not supported. The allocation is rejected by the
+cgroup hierarchy once a limit is reached. However, organizational changes
+such as moving processes across cgroups are exempted. Therefore, it is
+possible to have ioasids.current > ioasids.max. It is not possible to do
+further allocation after the organizational change that exceeds the max.
+
+The system capacity of the IOASIDs is default to PCIe PASID size of 20 bits.
+IOASID core provides API to adjust the system capacity based on platforms.
+IOASIDs are used by both user applications (e.g. VMs and userspace drivers)
+and kernel (e.g. supervisor SVA). However, only user allocation is subject
+to cgroup constraints. Host kernel allocates a pool of IOASIDs where its
+quota is subtracted from the system capacity. IOASIDs cgroup consults with
+the IOASID core for available capacity when a new cgroup limit is granted.
+Upon creation, no IOASID allocation is allowed by the user processes within
+the new cgroup.
+
+Usage
+-----
+CGroup filesystem has the following IOASIDs controller specific entries:
+::
+
+ ioasids.current
+ ioasids.events
+ ioasids.max
+
+To use the IOASIDs controller, set ioasids.max to the limit of the number
+of IOASIDs that can be allocated. The file ioasids.current shows the current
+number of IOASIDs allocated within the cgroup.
+
+Example
+--------
+1. Mount the cgroup2 FS ::
+
+ $ mount -t cgroup2 none /mnt/cg2/
+
+2. Add ioasids controller ::
+
+ $ echo '+ioasids' > /mnt/cg2/cgroup.subtree_control
+
+3. Create a hierarchy, set non-zero limit (default 0) ::
+
+ $ mkdir /mnt/cg2/test1
+ $ echo 5 > /mnt/cg2/test1/ioasids.max
+
+4. Allocate IOASIDs within limit should succeed ::
+
+ $echo $$ > /mnt/cg2/test1/cgroup.procs
+ Do IOASID allocation via /dev/ioasid
+ ioasids.current:1
+ ioasids.max:5
+
+5. Attempt to allocate IOASIDs beyond limit should fail ::
+
+ ioasids.current:5
+ ioasids.max:5
+
+6. Attach a new process with IOASID already allocated to a cgroup could
+result in ioasids.current > ioasids.max, e.g. process with PID 1234 under
+a cgroup with IOASIDs controller has one IOASID allocated, moving it to
+test1 cgroup ::
+
+ $echo 1234 > /mnt/cg2/test1/cgroup.procs
+ ioasids.current:6
+ ioasids.max:5
+
+Notes
+-----
+¹ When IOASID is used for PCI Express PASID, the range is limited to the
+PASID size of 20 bits. For a device that its resources can be shared across
+the platform, the IOASID namespace must be system-wide in order to uniquely
+identify DMA request with PASID inside the device.
+
+² The primary use case is SVA, where CPU page tables are shared with DMA via
+IOMMU.
--
2.25.1
From: Liu Yi L <[email protected]>
I/O Address Space IDs (IOASIDs) is used to tag DMA requests to target
multiple DMA address spaces for physical devices. Its PCI terminology
is called PASID (Process Address Space ID). Platforms with PASID support
can provide PASID granularity DMA isolation, which is very useful for
efficient and secure device sharing (SVA, subdevice passthrough, etc.).
Today only kernel drivers are allowed to allocate IOASIDs [1]. This patch
aims to extend this capability to userspace as required in device pass-
through scenarios. For example, a userspace driver may want to create its
own DMA address spaces besides the default IOVA address space established
by the kernel on the assigned device (e.g. vDPA control vq [2] and guest
SVA [3]), thus need to get IOASIDs from the kernel IOASID allocator for
tagging. In concept, each device can have its own IOASID space, thus it's
also possible for userspace driver to manage a private IOASID space itself,
say, when PF/VF is assigned. However it doesn't work for subdevice pass-
through, as multiple subdevices under the same parent device share a single
IOASID space thus IOASIDs must be centrally managed by the kernel in such
case.
This patch introduces a /dev/ioasid interface for this purpose (per discussion
in [4]). An IOASID is just a number before it is tagged to a specific DMA
address space. The actual IOASID tagging (to DMA requests) and association
(with DMA address spaces) operations from userspace are scrutinized by specific
device passthrough frameworks, which must ensure that a malicious driver
cannot program arbitrary IOASIDs to its assigned device to access DMA address
spaces that don't belong to it, this is out of the scope of this patch (a
reference VFIO implementation will be posted soon).
Open:
PCIe PASID is 20bit implying a space with 1M IOASIDs. although it's plenty
there was an open [4] on whether this user interface is open to all processes
or only selective processes (e.g. with device assigned). In this patchseries,
a cgroup controller is introduced to manage IOASID quota that a process is
allowed to use. A cgroup-enabled system may by default set quota=0 to disallow
IOASID allocation for most processes, and then having the virt management
stack to adjust the quota for a process which gets device assigned. But yeah,
we are also willing to hear more suggestions.
[1] https://lore.kernel.org/linux-iommu/[email protected]/
[2] https://lore.kernel.org/kvm/[email protected]/
[3] https://lore.kernel.org/linux-iommu/[email protected]/
[4] https://lore.kernel.org/kvm/[email protected]/
Signed-off-by: Liu Yi L <[email protected]>
---
Documentation/userspace-api/index.rst | 1 +
Documentation/userspace-api/ioasid.rst | 49 ++++
drivers/iommu/Kconfig | 5 +
drivers/iommu/Makefile | 1 +
drivers/iommu/intel/Kconfig | 1 +
drivers/iommu/ioasid_user.c | 297 +++++++++++++++++++++++++
include/linux/ioasid.h | 26 +++
include/linux/miscdevice.h | 1 +
include/uapi/linux/ioasid.h | 98 ++++++++
9 files changed, 479 insertions(+)
create mode 100644 Documentation/userspace-api/ioasid.rst
create mode 100644 drivers/iommu/ioasid_user.c
create mode 100644 include/uapi/linux/ioasid.h
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index acd2cc2a538d..69e1be7c67ee 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -24,6 +24,7 @@ place where this information is gathered.
ioctl/index
iommu
media/index
+ ioasid
.. only:: subproject and html
diff --git a/Documentation/userspace-api/ioasid.rst b/Documentation/userspace-api/ioasid.rst
new file mode 100644
index 000000000000..879d6cbae858
--- /dev/null
+++ b/Documentation/userspace-api/ioasid.rst
@@ -0,0 +1,49 @@
+.. SPDX-License-Identifier: GPL-2.0
+.. ioasid:
+
+=====================================
+IOASID Userspace API
+=====================================
+
+The IOASID UAPI is used for userspace IOASID allocation/free requests,
+thus IOASID management is centralized in the IOASID core[1] in the kernel. The
+primary use case is guest Shared Virtual Address (SVA) today.
+
+Requests such as allocation/free can be issued by the users and managed
+on a per-process basis through the ioasid core. Upon opening ("/dev/ioasid"),
+a process obtains a unique handle associated with the process's mm_struct.
+This handle is mapped to an FD in the userspace. Only a single open is
+allowed per process.
+
+File descriptors can be transferred across processes by employing fork() or
+UNIX domain socket. FDs obtained by transfer cannot be used to perform
+IOASID requests. The following behaviors are recommended for the
+applications:
+
+ - forked children close the parent's IOASID FDs immediately, open new
+ /dev/ioasid FDs if IOASID allocation is desired
+
+ - do not share FDs via UNIX domain socket, e.g. via sendmsg
+
+================
+Userspace APIs
+================
+
+/dev/ioasid provides below ioctls:
+
+*) IOASID_GET_API_VERSION: returns the API version, userspace should check
+ the API version first with the one it has embedded.
+*) IOASID_GET_INFO: returns the information on the /dev/ioasid.
+ - ioasid_bits: the ioasid bit width supported by this uAPI, userspace
+ should check the ioasid_bits returned by this ioctl with the ioasid
+ bits it wants and should fail if it's smaller than the one that
+ userspace wants, otherwise, allocation will be failed.
+*) IOASID_REQUEST_ALLOC: returns an IOASID which is allocated in kernel within
+ the specified ioasid range.
+*) IOASID_REQUEST_FREE: free an IOASID per userspace's request.
+
+For detailed definition, please see include/uapi/linux/ioasid.h.
+
+.. contents:: :local:
+
+[1] Documentation/driver-api/ioasid.rst
diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 192ef8f61310..830f4ec28a16 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -7,6 +7,11 @@ config IOMMU_IOVA
config IOASID
tristate
+config IOASID_USER
+ tristate
+ depends on IOASID
+ default n
+
# IOMMU_API always gets selected by whoever wants it.
config IOMMU_API
bool
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index 61bd30cd8369..305dd019ff49 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
obj-$(CONFIG_IOMMU_IO_PGTABLE_ARMV7S) += io-pgtable-arm-v7s.o
obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
obj-$(CONFIG_IOASID) += ioasid.o
+obj-$(CONFIG_IOASID_USER) += ioasid_user.o
obj-$(CONFIG_IOMMU_IOVA) += iova.o
obj-$(CONFIG_OF_IOMMU) += of_iommu.o
obj-$(CONFIG_MSM_IOMMU) += msm_iommu.o
diff --git a/drivers/iommu/intel/Kconfig b/drivers/iommu/intel/Kconfig
index 28a3d1596c76..a6d9dea61d58 100644
--- a/drivers/iommu/intel/Kconfig
+++ b/drivers/iommu/intel/Kconfig
@@ -13,6 +13,7 @@ config INTEL_IOMMU
select DMAR_TABLE
select SWIOTLB
select IOASID
+ select IOASID_USER
select IOMMU_DMA
help
DMA remapping (DMAR) devices support enables independent address
diff --git a/drivers/iommu/ioasid_user.c b/drivers/iommu/ioasid_user.c
new file mode 100644
index 000000000000..2f8957cd055a
--- /dev/null
+++ b/drivers/iommu/ioasid_user.c
@@ -0,0 +1,297 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Support IOASID allocation/free from user space.
+ *
+ * Copyright (C) 2021 Intel Corporation.
+ * Author: Liu Yi L <[email protected]>
+ *
+ */
+
+#include <linux/ioasid.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/sched/mm.h>
+#include <linux/miscdevice.h>
+
+#define DRIVER_VERSION "0.1"
+#define DRIVER_AUTHOR "Liu Yi L <[email protected]>"
+#define DRIVER_DESC "IOASID management for user space"
+
+/* Current user ioasid uapi supports 31 bits */
+#define IOASID_BITS 31
+
+struct ioasid_user_token {
+ unsigned long long val;
+};
+
+struct ioasid_user {
+ struct kref kref;
+ struct ioasid_set *ioasid_set;
+ struct mutex lock;
+ struct list_head next;
+ struct ioasid_user_token token;
+};
+
+static struct mutex ioasid_user_lock;
+static struct list_head ioasid_user_list;
+
+/* called with ioasid_user_lock held */
+static void ioasid_user_release(struct kref *kref)
+{
+ struct ioasid_user *iuser = container_of(kref, struct ioasid_user, kref);
+
+ ioasid_free_all_in_set(iuser->ioasid_set);
+ list_del(&iuser->next);
+ mutex_unlock(&ioasid_user_lock);
+ ioasid_set_free(iuser->ioasid_set);
+ kfree(iuser);
+}
+
+void ioasid_user_put(struct ioasid_user *iuser)
+{
+ kref_put_mutex(&iuser->kref, ioasid_user_release, &ioasid_user_lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_user_put);
+
+static void ioasid_user_get(struct ioasid_user *iuser)
+{
+ kref_get(&iuser->kref);
+}
+
+struct ioasid_user *ioasid_user_get_from_task(struct task_struct *task)
+{
+ struct mm_struct *mm = get_task_mm(task);
+ unsigned long long val = (unsigned long long)mm;
+ struct ioasid_user *iuser;
+ bool found = false;
+
+ if (!mm)
+ return NULL;
+
+ mutex_lock(&ioasid_user_lock);
+ /* Search existing ioasid_user with current mm pointer */
+ list_for_each_entry(iuser, &ioasid_user_list, next) {
+ if (iuser->token.val == val) {
+ ioasid_user_get(iuser);
+ found = true;
+ break;
+ }
+ }
+
+ mmput(mm);
+
+ mutex_unlock(&ioasid_user_lock);
+ return found ? iuser : NULL;
+}
+EXPORT_SYMBOL_GPL(ioasid_user_get_from_task);
+
+void ioasid_user_for_each_id(struct ioasid_user *iuser, void *data,
+ void (*fn)(ioasid_t id, void *data))
+{
+ mutex_lock(&iuser->lock);
+ ioasid_set_for_each_ioasid(iuser->ioasid_set, fn, data);
+ mutex_unlock(&iuser->lock);
+}
+EXPORT_SYMBOL_GPL(ioasid_user_for_each_id);
+
+static int ioasid_fops_open(struct inode *inode, struct file *filep)
+{
+ struct mm_struct *mm = get_task_mm(current);
+ unsigned long long val = (unsigned long long)mm;
+ struct ioasid_set *iset;
+ struct ioasid_user *iuser;
+ int ret = 0;
+
+ mutex_lock(&ioasid_user_lock);
+ /* Only allow one single open per process */
+ list_for_each_entry(iuser, &ioasid_user_list, next) {
+ if (iuser->token.val == val) {
+ ret = -EBUSY;
+ goto out;
+ }
+ }
+
+ iuser = kzalloc(sizeof(*iuser), GFP_KERNEL);
+ if (!iuser) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /*
+ * IOASID core provides a 'IOASID set' concept to track all
+ * IOASIDs associated with a token. Here we use mm_struct as
+ * the token and create a IOASID set per mm_struct. All the
+ * containers of the process share the same IOASID set.
+ */
+ iset = ioasid_set_alloc(mm, 0, IOASID_SET_TYPE_MM);
+ if (IS_ERR(iset)) {
+ kfree(iuser);
+ ret = PTR_ERR(iset);
+ goto out;
+ }
+
+ iuser->ioasid_set = iset;
+ kref_init(&iuser->kref);
+ iuser->token.val = val;
+ mutex_init(&iuser->lock);
+ filep->private_data = iuser;
+
+ list_add(&iuser->next, &ioasid_user_list);
+out:
+ mutex_unlock(&ioasid_user_lock);
+ mmput(mm);
+ return ret;
+}
+
+static int ioasid_fops_release(struct inode *inode, struct file *filep)
+{
+ struct ioasid_user *iuser = filep->private_data;
+
+ filep->private_data = NULL;
+
+ ioasid_user_put(iuser);
+
+ return 0;
+}
+
+static int ioasid_get_info(struct ioasid_user *iuser, unsigned long arg)
+{
+ struct ioasid_info info;
+ unsigned long minsz;
+
+ minsz = offsetofend(struct ioasid_info, ioasid_bits);
+
+ if (copy_from_user(&info, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (info.argsz < minsz || info.flags)
+ return -EINVAL;
+
+ info.ioasid_bits = IOASID_BITS;
+
+ return copy_to_user((void __user *)arg, &info, minsz) ? -EFAULT : 0;
+}
+
+static int ioasid_alloc_request(struct ioasid_user *iuser, unsigned long arg)
+{
+ struct ioasid_alloc_request req;
+ unsigned long minsz;
+ ioasid_t ioasid;
+
+ minsz = offsetofend(struct ioasid_alloc_request, range);
+
+ if (copy_from_user(&req, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (req.argsz < minsz || req.flags)
+ return -EINVAL;
+
+ if (req.range.min > req.range.max ||
+ req.range.min >= (1 << IOASID_BITS) ||
+ req.range.max >= (1 << IOASID_BITS))
+ return -EINVAL;
+
+ ioasid = ioasid_alloc(iuser->ioasid_set, req.range.min,
+ req.range.max, NULL);
+
+ if (ioasid == INVALID_IOASID)
+ return -EINVAL;
+
+ return ioasid;
+
+}
+
+static int ioasid_free_request(struct ioasid_user *iuser, unsigned long arg)
+{
+ int ioasid;
+
+ if (copy_from_user(&ioasid, (void __user *)arg, sizeof(ioasid)))
+ return -EFAULT;
+
+ if (ioasid < 0)
+ return -EINVAL;
+
+ ioasid_free(iuser->ioasid_set, ioasid);
+
+ return 0;
+}
+
+static long ioasid_fops_unl_ioctl(struct file *filep,
+ unsigned int cmd, unsigned long arg)
+{
+ struct ioasid_user *iuser = filep->private_data;
+ long ret = -EINVAL;
+
+ if (!iuser)
+ return ret;
+
+ mutex_lock(&iuser->lock);
+
+ switch (cmd) {
+ case IOASID_GET_API_VERSION:
+ ret = IOASID_API_VERSION;
+ break;
+ case IOASID_GET_INFO:
+ ret = ioasid_get_info(iuser, arg);
+ break;
+ case IOASID_REQUEST_ALLOC:
+ ret = ioasid_alloc_request(iuser, arg);
+ break;
+ case IOASID_REQUEST_FREE:
+ ret = ioasid_free_request(iuser, arg);
+ break;
+ default:
+ pr_err("Unsupported cmd %u\n", cmd);
+ break;
+ }
+
+ mutex_unlock(&iuser->lock);
+ return ret;
+}
+
+static const struct file_operations ioasid_user_fops = {
+ .owner = THIS_MODULE,
+ .open = ioasid_fops_open,
+ .release = ioasid_fops_release,
+ .unlocked_ioctl = ioasid_fops_unl_ioctl,
+};
+
+static struct miscdevice ioasid_user = {
+ .minor = IOASID_MINOR,
+ .name = "ioasid_user",
+ .fops = &ioasid_user_fops,
+ .nodename = "ioasid",
+ .mode = S_IRUGO | S_IWUGO,
+};
+
+
+static int __init ioasid_user_init(void)
+{
+ int ret;
+
+ ret = misc_register(&ioasid_user);
+ if (ret) {
+ pr_err("ioasid_user: misc device register failed\n");
+ return ret;
+ }
+
+ mutex_init(&ioasid_user_lock);
+ INIT_LIST_HEAD(&ioasid_user_list);
+ return 0;
+}
+
+static void __exit ioasid_user_exit(void)
+{
+ WARN_ON(!list_empty(&ioasid_user_list));
+ misc_deregister(&ioasid_user);
+}
+
+module_init(ioasid_user_init);
+module_exit(ioasid_user_exit);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
index 5ea4710efb02..b82abe6325f7 100644
--- a/include/linux/ioasid.h
+++ b/include/linux/ioasid.h
@@ -6,6 +6,7 @@
#include <linux/errno.h>
#include <linux/xarray.h>
#include <linux/refcount.h>
+#include <uapi/linux/ioasid.h>
#define INVALID_IOASID ((ioasid_t)-1)
typedef unsigned int ioasid_t;
@@ -152,6 +153,31 @@ static inline int ioasid_cg_uncharge(struct ioasid_set *set)
#endif /* CGROUP_IOASIDS */
bool ioasid_queue_work(struct work_struct *work);
+/* IOASID userspace support */
+struct ioasid_user;
+#if IS_ENABLED(CONFIG_IOASID_USER)
+extern struct ioasid_user *ioasid_user_get_from_task(struct task_struct *task);
+extern void ioasid_user_put(struct ioasid_user *iuser);
+extern void ioasid_user_for_each_id(struct ioasid_user *iuser, void *data,
+ void (*fn)(ioasid_t id, void *data));
+
+#else /* CONFIG_IOASID_USER */
+static inline struct ioasid_user *
+ioasid_user_get_from_task(struct task_struct *task)
+{
+ return ERR_PTR(-ENOTTY);
+}
+
+static inline void ioasid_user_put(struct ioasid_user *iuser)
+{
+}
+
+static inline void ioasid_user_for_each_id(struct ioasid_user *iuser, void *data,
+ void (*fn)(ioasid_t id, void *data))
+{
+}
+#endif /* CONFIG_IOASID_USER */
+
#else /* !CONFIG_IOASID */
static inline void ioasid_install_capacity(ioasid_t total)
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index 0676f18093f9..9823901f11a4 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -21,6 +21,7 @@
#define APOLLO_MOUSE_MINOR 7 /* unused */
#define PC110PAD_MINOR 9 /* unused */
/*#define ADB_MOUSE_MINOR 10 FIXME OBSOLETE */
+#define IOASID_MINOR 129 /* /dev/ioasid */
#define WATCHDOG_MINOR 130 /* Watchdog timer */
#define TEMP_MINOR 131 /* Temperature Sensor */
#define APM_MINOR_DEV 134
diff --git a/include/uapi/linux/ioasid.h b/include/uapi/linux/ioasid.h
new file mode 100644
index 000000000000..1529070c0317
--- /dev/null
+++ b/include/uapi/linux/ioasid.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * PASID (Processor Address Space ID) is a PCIe concept for tagging
+ * address spaces in DMA requests. When system-wide PASID allocation
+ * is required by the underlying iommu driver (e.g. Intel VT-d), this
+ * provides an interface for userspace to request ioasid alloc/free
+ * for its assigned devices.
+ *
+ * Copyright (C) 2021 Intel Corporation. All rights reserved.
+ * Author: Liu Yi L <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+#ifndef _UAPI_IOASID_H
+#define _UAPI_IOASID_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+#include <linux/ioasid.h>
+
+#define IOASID_API_VERSION 0
+
+
+/* Kernel & User level defines for IOASID IOCTLs. */
+
+#define IOASID_TYPE ('i')
+#define IOASID_BASE 100
+
+/* -------- IOCTLs for IOASID file descriptor (/dev/ioasid) -------- */
+
+/**
+ * IOASID_GET_API_VERSION - _IO(IOASID_TYPE, IOASID_BASE + 0)
+ *
+ * Report the version of the IOASID API. This allows us to bump the entire
+ * API version should we later need to add or change features in incompatible
+ * ways.
+ * Return: IOASID_API_VERSION
+ * Availability: Always
+ */
+#define IOASID_GET_API_VERSION _IO(IOASID_TYPE, IOASID_BASE + 0)
+
+/**
+ * IOASID_GET_INFO - _IOR(IOASID_TYPE, IOASID_BASE + 1, struct ioasid_info)
+ *
+ * Retrieve information about the IOASID object. Fills in provided
+ * struct ioasid_info. Caller sets argsz.
+ *
+ * @argsz: user filled size of this data.
+ * @flags: currently reserved for future extension. must set to 0.
+ * @ioasid_bits: maximum supported PASID bits, 0 represents no PASID
+ * support.
+
+ * Availability: Always
+ */
+struct ioasid_info {
+ __u32 argsz;
+ __u32 flags;
+ __u32 ioasid_bits;
+};
+#define IOASID_GET_INFO _IO(IOASID_TYPE, IOASID_BASE + 1)
+
+/**
+ * IOASID_REQUEST_ALLOC - _IOWR(IOASID_TYPE, IOASID_BASE + 2,
+ * struct ioasid_request)
+ *
+ * Alloc a PASID within @range. @range is [min, max], which means both
+ * @min and @max are inclusive.
+ * User space should provide min, max no more than the ioasid bits reports
+ * in ioasid_info via IOASID_GET_INFO.
+ *
+ * @argsz: user filled size of this data.
+ * @flags: currently reserved for future extension. must set to 0.
+ * @range: allocated ioasid is expected in the range.
+ *
+ * returns: allocated ID on success, -errno on failure
+ */
+struct ioasid_alloc_request {
+ __u32 argsz;
+ __u32 flags;
+ struct {
+ __u32 min;
+ __u32 max;
+ } range;
+};
+#define IOASID_REQUEST_ALLOC _IO(IOASID_TYPE, IOASID_BASE + 2)
+
+/**
+ * IOASID_REQUEST_FREE - _IOWR(IOASID_TYPE, IOASID_BASE + 3, int)
+ *
+ * Free a PASID.
+ *
+ * returns: 0 on success, -errno on failure
+ */
+#define IOASID_REQUEST_FREE _IO(IOASID_TYPE, IOASID_BASE + 3)
+
+#endif /* _UAPI_IOASID_H */
--
2.25.1
> From: Jacob Pan <[email protected]>
> Sent: Sunday, February 28, 2021 6:01 AM
>
> I/O Address Space ID (IOASID) core code was introduced in v5.5 as a generic
> kernel allocator service for both PCIe Process Address Space ID (PASID) and
> ARM SMMU's Substream ID. IOASIDs are used to associate DMA requests
> with
> virtual address spaces, including both host and guest.
>
> In addition to providing basic ID allocation, ioasid_set was defined as a
> token that is shared by a group of IOASIDs. This set token can be used
> for permission checking, but lack some features to address the following
> needs by guest Shared Virtual Address (SVA).
> - Manage IOASIDs by group, group ownership, quota, etc.
> - State synchronization among IOASID users (e.g. IOMMU driver, KVM,
> device
> drivers)
> - Non-identity guest-host IOASID mapping
> - Lifecycle management
>
> This patchset introduces the following extensions as solutions to the
> problems above.
> - Redefine and extend IOASID set such that IOASIDs can be managed by
> groups/pools.
> - Add notifications for IOASID state synchronization
> - Extend reference counting for life cycle alignment among multiple users
> - Support ioasid_set private IDs, which can be used as guest IOASIDs
> - Add a new cgroup controller for resource distribution
>
> Please refer to Documentation/admin-guide/cgroup-v1/ioasids.rst and
> Documentation/driver-api/ioasid.rst in the enclosed patches for more
> details.
>
> Based on discussions on LKML[1], a direction change was made in v4 such
> that
> the user interfaces for IOASID allocation are extracted from VFIO
> subsystem. The proposed IOASID subsystem now consists of three
> components:
> 1. IOASID core[01-14]: provides APIs for allocation, pool management,
> notifications, and refcounting.
> 2. IOASID cgroup controller[RFC 15-17]: manage resource distribution[2].
> 3. IOASID user[RFC 18]: provides user allocation interface via /dev/ioasid
>
> This patchset only included VT-d driver as users of some of the new APIs.
> VFIO and KVM patches are coming up to fully utilize the APIs introduced
> here.
>
> [1] https://lore.kernel.org/linux-iommu/1599734733-6431-1-git-send-email-
> [email protected]/
> [2] Note that ioasid quota management code can be removed once the
> IOASIDs
> cgroup is ratified.
>
> You can find this series, VFIO, KVM, and IOASID user at:
> https://github.com/jacobpan/linux.git ioasid_v4
> (VFIO and KVM patches will be available at this branch when published.)
VFIO and QEMU series are listed below:
VFIO: https://lore.kernel.org/linux-iommu/[email protected]/
QEMU: https://lore.kernel.org/qemu-devel/[email protected]/T/#t
Regards,
Yi Liu
On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:
> IOASIDs are used to associate DMA requests with virtual address spaces.
> They are a system-wide limited resource made available to the userspace
> applications. Let it be VMs or user-space device drivers.
>
> This RFC patch introduces a cgroup controller to address the following
> problems:
> 1. Some user applications exhaust all the available IOASIDs thus
> depriving others of the same host.
> 2. System admins need to provision VMs based on their needs for IOASIDs,
> e.g. the number of VMs with assigned devices that perform DMA requests
> with PASID.
Please take a look at the proposed misc controller:
http://lkml.kernel.org/r/[email protected]
Would that fit your bill?
Thanks.
--
tejun
On Wed, Mar 03, 2021 at 04:02:05PM -0800, Jacob Pan wrote:
> > The interface definitely can be reused. But IOASID has a different
> > behavior in terms of migration and ownership checking. I guess SEV key
> > IDs are not tied to a process whereas IOASIDs are. Perhaps this can be
> > solved by adding
> > + .can_attach = ioasids_can_attach,
> > + .cancel_attach = ioasids_cancel_attach,
> > Let me give it a try and come back.
> >
> While I am trying to fit the IOASIDs cgroup in to the misc cgroup proposal.
> I'd like to have a direction check on whether this idea of using cgroup for
> IOASID/PASID resource management is viable.
>
> Alex/Jason/Jean and everyone, your feedback is much appreciated.
IMHO I can't think of anything else to enforce some limit on a HW
scarce resource that unpriv userspace can consume.
Jason
Hi Jacob,
On Wed, 3 Mar 2021 13:17:26 -0800, Jacob Pan
<[email protected]> wrote:
> Hi Tejun,
>
> On Wed, 3 Mar 2021 10:44:28 -0500, Tejun Heo <[email protected]> wrote:
>
> > On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:
> > > IOASIDs are used to associate DMA requests with virtual address
> > > spaces. They are a system-wide limited resource made available to the
> > > userspace applications. Let it be VMs or user-space device drivers.
> > >
> > > This RFC patch introduces a cgroup controller to address the following
> > > problems:
> > > 1. Some user applications exhaust all the available IOASIDs thus
> > > depriving others of the same host.
> > > 2. System admins need to provision VMs based on their needs for
> > > IOASIDs, e.g. the number of VMs with assigned devices that perform
> > > DMA requests with PASID.
> >
> > Please take a look at the proposed misc controller:
> >
> > http://lkml.kernel.org/r/[email protected]
> >
> > Would that fit your bill?
> The interface definitely can be reused. But IOASID has a different
> behavior in terms of migration and ownership checking. I guess SEV key
> IDs are not tied to a process whereas IOASIDs are. Perhaps this can be
> solved by adding
> + .can_attach = ioasids_can_attach,
> + .cancel_attach = ioasids_cancel_attach,
> Let me give it a try and come back.
>
While I am trying to fit the IOASIDs cgroup in to the misc cgroup proposal.
I'd like to have a direction check on whether this idea of using cgroup for
IOASID/PASID resource management is viable.
Alex/Jason/Jean and everyone, your feedback is much appreciated.
> Thanks for the pointer.
>
> Jacob
>
> >
> > Thanks.
> >
>
>
> Thanks,
>
> Jacob
Thanks,
Jacob
Hi Jean-Philippe,
On Thu, 4 Mar 2021 10:49:37 +0100, Jean-Philippe Brucker
<[email protected]> wrote:
> On Wed, Mar 03, 2021 at 04:02:05PM -0800, Jacob Pan wrote:
> > Hi Jacob,
> >
> > On Wed, 3 Mar 2021 13:17:26 -0800, Jacob Pan
> > <[email protected]> wrote:
> >
> > > Hi Tejun,
> > >
> > > On Wed, 3 Mar 2021 10:44:28 -0500, Tejun Heo <[email protected]> wrote:
> > >
> > > > On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:
> > > > > IOASIDs are used to associate DMA requests with virtual address
> > > > > spaces. They are a system-wide limited resource made available to
> > > > > the userspace applications. Let it be VMs or user-space device
> > > > > drivers.
> > > > >
> > > > > This RFC patch introduces a cgroup controller to address the
> > > > > following problems:
> > > > > 1. Some user applications exhaust all the available IOASIDs thus
> > > > > depriving others of the same host.
> > > > > 2. System admins need to provision VMs based on their needs for
> > > > > IOASIDs, e.g. the number of VMs with assigned devices that perform
> > > > > DMA requests with PASID.
> > > >
> > > > Please take a look at the proposed misc controller:
> > > >
> > > > http://lkml.kernel.org/r/[email protected]
> > > >
> > > > Would that fit your bill?
> > > The interface definitely can be reused. But IOASID has a different
> > > behavior in terms of migration and ownership checking. I guess SEV key
> > > IDs are not tied to a process whereas IOASIDs are. Perhaps this can be
> > > solved by adding
> > > + .can_attach = ioasids_can_attach,
> > > + .cancel_attach = ioasids_cancel_attach,
> > > Let me give it a try and come back.
> > >
> > While I am trying to fit the IOASIDs cgroup in to the misc cgroup
> > proposal. I'd like to have a direction check on whether this idea of
> > using cgroup for IOASID/PASID resource management is viable.
>
> Yes, even for host SVA it would be good to have a cgroup. Currently the
> number of shared address spaces is naturally limited by number of
> processes, which can be controlled with rlimit and cgroup. But on Arm the
> hardware limit on shared address spaces is 64k (number of ASIDs), easily
> exhausted with the default PASID and PID limits. So a cgroup for managing
> this resource is more than welcome.
>
> It looks like your current implementation is very dependent on
> IOASID_SET_TYPE_MM? I'll need to do more reading about cgroup to see how
> easily it can be adapted to host SVA which uses IOASID_SET_TYPE_NULL.
>
Right, I was assuming have three use cases of IOASIDs:
1. host supervisor SVA (not a concern, just one init_mm to bind)
2. host user SVA, either one IOASID per process or perhaps some private
IOASID for private address space
3. VM use for guest SVA, each IOASID is bound to a guest process
My current cgroup proposal applies to #3 with IOASID_SET_TYPE_MM, which is
allocated by the new /dev/ioasid interface.
For #2, I was thinking you can limit the host process via PIDs cgroup? i.e.
limit fork. So the host IOASIDs are currently allocated from the system pool
with quota of chosen by iommu_sva_init() in my patch, 0 means unlimited use
whatever is available. https://lkml.org/lkml/2021/2/28/18
> Thanks,
> Jean
Thanks,
Jacob
On Thu, Mar 04, 2021 at 11:01:44AM -0800, Jacob Pan wrote:
> > For something like qemu I'd expect to put the qemu process in a cgroup
> > with 1 PASID. Who cares what qemu uses the PASID for, or how it was
> > allocated?
>
> For vSVA, we will need one PASID per guest process. But that is up to the
> admin based on whether or how many SVA capable devices are directly
> assigned.
I hope the virtual IOMMU driver can communicate the PASID limit and
the cgroup machinery in the guest can know what the actual limit is.
I was thinking of a case where qemu is using a single PASID to setup
the guest kVA or similar
Jason
Hi Tejun,
On Wed, 3 Mar 2021 10:44:28 -0500, Tejun Heo <[email protected]> wrote:
> On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:
> > IOASIDs are used to associate DMA requests with virtual address spaces.
> > They are a system-wide limited resource made available to the userspace
> > applications. Let it be VMs or user-space device drivers.
> >
> > This RFC patch introduces a cgroup controller to address the following
> > problems:
> > 1. Some user applications exhaust all the available IOASIDs thus
> > depriving others of the same host.
> > 2. System admins need to provision VMs based on their needs for IOASIDs,
> > e.g. the number of VMs with assigned devices that perform DMA requests
> > with PASID.
>
> Please take a look at the proposed misc controller:
>
> http://lkml.kernel.org/r/[email protected]
>
> Would that fit your bill?
The interface definitely can be reused. But IOASID has a different behavior
in terms of migration and ownership checking. I guess SEV key IDs are not
tied to a process whereas IOASIDs are. Perhaps this can be solved by
adding
+ .can_attach = ioasids_can_attach,
+ .cancel_attach = ioasids_cancel_attach,
Let me give it a try and come back.
Thanks for the pointer.
Jacob
>
> Thanks.
>
Thanks,
Jacob
On Wed, Mar 03, 2021 at 04:02:05PM -0800, Jacob Pan wrote:
> Hi Jacob,
>
> On Wed, 3 Mar 2021 13:17:26 -0800, Jacob Pan
> <[email protected]> wrote:
>
> > Hi Tejun,
> >
> > On Wed, 3 Mar 2021 10:44:28 -0500, Tejun Heo <[email protected]> wrote:
> >
> > > On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:
> > > > IOASIDs are used to associate DMA requests with virtual address
> > > > spaces. They are a system-wide limited resource made available to the
> > > > userspace applications. Let it be VMs or user-space device drivers.
> > > >
> > > > This RFC patch introduces a cgroup controller to address the following
> > > > problems:
> > > > 1. Some user applications exhaust all the available IOASIDs thus
> > > > depriving others of the same host.
> > > > 2. System admins need to provision VMs based on their needs for
> > > > IOASIDs, e.g. the number of VMs with assigned devices that perform
> > > > DMA requests with PASID.
> > >
> > > Please take a look at the proposed misc controller:
> > >
> > > http://lkml.kernel.org/r/[email protected]
> > >
> > > Would that fit your bill?
> > The interface definitely can be reused. But IOASID has a different
> > behavior in terms of migration and ownership checking. I guess SEV key
> > IDs are not tied to a process whereas IOASIDs are. Perhaps this can be
> > solved by adding
> > + .can_attach = ioasids_can_attach,
> > + .cancel_attach = ioasids_cancel_attach,
> > Let me give it a try and come back.
> >
> While I am trying to fit the IOASIDs cgroup in to the misc cgroup proposal.
> I'd like to have a direction check on whether this idea of using cgroup for
> IOASID/PASID resource management is viable.
Yes, even for host SVA it would be good to have a cgroup. Currently the
number of shared address spaces is naturally limited by number of
processes, which can be controlled with rlimit and cgroup. But on Arm the
hardware limit on shared address spaces is 64k (number of ASIDs), easily
exhausted with the default PASID and PID limits. So a cgroup for managing
this resource is more than welcome.
It looks like your current implementation is very dependent on
IOASID_SET_TYPE_MM? I'll need to do more reading about cgroup to see how
easily it can be adapted to host SVA which uses IOASID_SET_TYPE_NULL.
Thanks,
Jean
On Thu, Mar 04, 2021 at 09:46:03AM -0800, Jacob Pan wrote:
> Right, I was assuming have three use cases of IOASIDs:
> 1. host supervisor SVA (not a concern, just one init_mm to bind)
> 2. host user SVA, either one IOASID per process or perhaps some private
> IOASID for private address space
> 3. VM use for guest SVA, each IOASID is bound to a guest process
>
> My current cgroup proposal applies to #3 with IOASID_SET_TYPE_MM, which is
> allocated by the new /dev/ioasid interface.
>
> For #2, I was thinking you can limit the host process via PIDs cgroup? i.e.
> limit fork. So the host IOASIDs are currently allocated from the system pool
> with quota of chosen by iommu_sva_init() in my patch, 0 means unlimited use
> whatever is available. https://lkml.org/lkml/2021/2/28/18
Why do we need two pools?
If PASID's are limited then why does it matter how the PASID was
allocated? Either the thing requesting it is below the limit, or it
isn't.
For something like qemu I'd expect to put the qemu process in a cgroup
with 1 PASID. Who cares what qemu uses the PASID for, or how it was
allocated?
Jason
Hi Jason,
On Thu, 4 Mar 2021 13:54:02 -0400, Jason Gunthorpe <[email protected]> wrote:
> On Thu, Mar 04, 2021 at 09:46:03AM -0800, Jacob Pan wrote:
>
> > Right, I was assuming have three use cases of IOASIDs:
> > 1. host supervisor SVA (not a concern, just one init_mm to bind)
> > 2. host user SVA, either one IOASID per process or perhaps some private
> > IOASID for private address space
> > 3. VM use for guest SVA, each IOASID is bound to a guest process
> >
> > My current cgroup proposal applies to #3 with IOASID_SET_TYPE_MM, which
> > is allocated by the new /dev/ioasid interface.
> >
> > For #2, I was thinking you can limit the host process via PIDs cgroup?
> > i.e. limit fork. So the host IOASIDs are currently allocated from the
> > system pool with quota of chosen by iommu_sva_init() in my patch, 0
> > means unlimited use whatever is available.
> > https://lkml.org/lkml/2021/2/28/18
>
> Why do we need two pools?
>
> If PASID's are limited then why does it matter how the PASID was
> allocated? Either the thing requesting it is below the limit, or it
> isn't.
>
you are right. it should be tracked based on the process regardless it is
allocated by the user (/dev/ioasid) or indirectly by kernel drivers during
iommu_sva_bind_device(). Need to consolidate both 2 and 3 and
decouple cgroup and IOASID set.
> For something like qemu I'd expect to put the qemu process in a cgroup
> with 1 PASID. Who cares what qemu uses the PASID for, or how it was
> allocated?
>
For vSVA, we will need one PASID per guest process. But that is up to the
admin based on whether or how many SVA capable devices are directly
assigned.
> Jason
Thanks,
Jacob
Hi Jason,
On Thu, 4 Mar 2021 15:02:53 -0400, Jason Gunthorpe <[email protected]> wrote:
> On Thu, Mar 04, 2021 at 11:01:44AM -0800, Jacob Pan wrote:
>
> > > For something like qemu I'd expect to put the qemu process in a cgroup
> > > with 1 PASID. Who cares what qemu uses the PASID for, or how it was
> > > allocated?
> >
> > For vSVA, we will need one PASID per guest process. But that is up to
> > the admin based on whether or how many SVA capable devices are directly
> > assigned.
>
> I hope the virtual IOMMU driver can communicate the PASID limit and
> the cgroup machinery in the guest can know what the actual limit is.
>
For VT-d, emulated vIOMMU can communicate with the guest IOMMU driver on how
many PASID bits are supported (extended cap reg PASID size fields). But it
cannot communicate how many PASIDs are in the pool(host cgroup capacity).
The QEMU process may not be the only one in a cgroup so it cannot give hard
guarantees. I don't see a good way to communicate accurately at runtime as
the process migrates or limit changes.
We were thinking to adopt the "Limits" model as defined in the cgroup-v2
doc.
"
Limits
------
A child can only consume upto the configured amount of the resource.
Limits can be over-committed - the sum of the limits of children can
exceed the amount of resource available to the parent.
"
So the guest cgroup would still think it has full 20 bits of PASID at its
disposal. But PASID allocation may fail before reaching the full 20 bits
(2M).
Similar on the host side, we only enforce the limit set by the cgroup but
not guarantee it.
> I was thinking of a case where qemu is using a single PASID to setup
> the guest kVA or similar
>
got it.
> Jason
Thanks,
Jacob
On Thu, Mar 04, 2021 at 09:46:03AM -0800, Jacob Pan wrote:
> Hi Jean-Philippe,
>
> On Thu, 4 Mar 2021 10:49:37 +0100, Jean-Philippe Brucker
> <[email protected]> wrote:
>
> > On Wed, Mar 03, 2021 at 04:02:05PM -0800, Jacob Pan wrote:
> > > Hi Jacob,
> > >
> > > On Wed, 3 Mar 2021 13:17:26 -0800, Jacob Pan
> > > <[email protected]> wrote:
> > >
> > > > Hi Tejun,
> > > >
> > > > On Wed, 3 Mar 2021 10:44:28 -0500, Tejun Heo <[email protected]> wrote:
> > > >
> > > > > On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:
> > > > > > IOASIDs are used to associate DMA requests with virtual address
> > > > > > spaces. They are a system-wide limited resource made available to
> > > > > > the userspace applications. Let it be VMs or user-space device
> > > > > > drivers.
> > > > > >
> > > > > > This RFC patch introduces a cgroup controller to address the
> > > > > > following problems:
> > > > > > 1. Some user applications exhaust all the available IOASIDs thus
> > > > > > depriving others of the same host.
> > > > > > 2. System admins need to provision VMs based on their needs for
> > > > > > IOASIDs, e.g. the number of VMs with assigned devices that perform
> > > > > > DMA requests with PASID.
> > > > >
> > > > > Please take a look at the proposed misc controller:
> > > > >
> > > > > http://lkml.kernel.org/r/[email protected]
> > > > >
> > > > > Would that fit your bill?
> > > > The interface definitely can be reused. But IOASID has a different
> > > > behavior in terms of migration and ownership checking. I guess SEV key
> > > > IDs are not tied to a process whereas IOASIDs are. Perhaps this can be
> > > > solved by adding
> > > > + .can_attach = ioasids_can_attach,
> > > > + .cancel_attach = ioasids_cancel_attach,
> > > > Let me give it a try and come back.
> > > >
> > > While I am trying to fit the IOASIDs cgroup in to the misc cgroup
> > > proposal. I'd like to have a direction check on whether this idea of
> > > using cgroup for IOASID/PASID resource management is viable.
> >
> > Yes, even for host SVA it would be good to have a cgroup. Currently the
> > number of shared address spaces is naturally limited by number of
> > processes, which can be controlled with rlimit and cgroup. But on Arm the
> > hardware limit on shared address spaces is 64k (number of ASIDs), easily
> > exhausted with the default PASID and PID limits. So a cgroup for managing
> > this resource is more than welcome.
> >
> > It looks like your current implementation is very dependent on
> > IOASID_SET_TYPE_MM? I'll need to do more reading about cgroup to see how
> > easily it can be adapted to host SVA which uses IOASID_SET_TYPE_NULL.
> >
> Right, I was assuming have three use cases of IOASIDs:
> 1. host supervisor SVA (not a concern, just one init_mm to bind)
> 2. host user SVA, either one IOASID per process or perhaps some private
> IOASID for private address space
> 3. VM use for guest SVA, each IOASID is bound to a guest process
>
> My current cgroup proposal applies to #3 with IOASID_SET_TYPE_MM, which is
> allocated by the new /dev/ioasid interface.
>
> For #2, I was thinking you can limit the host process via PIDs cgroup? i.e.
> limit fork.
That works but isn't perfect, because the hardware resource of shared
address spaces can be much lower that PID limit - 16k ASIDs on Arm. To
allow an admin to fairly distribute that resource we could introduce
another cgroup just to limit the number of shared address spaces, but
limiting the number of IOASIDs does the trick.
> So the host IOASIDs are currently allocated from the system pool
> with quota of chosen by iommu_sva_init() in my patch, 0 means unlimited use
> whatever is available. https://lkml.org/lkml/2021/2/28/18
Yes that's sensible, but it would be good to plan the cgroup user
interface to work for #2 as well, even if we don't implement it right
away.
Thanks,
Jean
On Fri, Mar 05, 2021 at 09:30:49AM +0100, Jean-Philippe Brucker wrote:
> That works but isn't perfect, because the hardware resource of shared
> address spaces can be much lower that PID limit - 16k ASIDs on Arm. To
Sorry I meant 16-bit here - 64k
Thanks,
Jean
Hi Jean-Philippe,
On Fri, 5 Mar 2021 09:30:49 +0100, Jean-Philippe Brucker
<[email protected]> wrote:
> On Thu, Mar 04, 2021 at 09:46:03AM -0800, Jacob Pan wrote:
> > Hi Jean-Philippe,
> >
> > On Thu, 4 Mar 2021 10:49:37 +0100, Jean-Philippe Brucker
> > <[email protected]> wrote:
> >
> > > On Wed, Mar 03, 2021 at 04:02:05PM -0800, Jacob Pan wrote:
> > > > Hi Jacob,
> > > >
> > > > On Wed, 3 Mar 2021 13:17:26 -0800, Jacob Pan
> > > > <[email protected]> wrote:
> > > >
> > > > > Hi Tejun,
> > > > >
> > > > > On Wed, 3 Mar 2021 10:44:28 -0500, Tejun Heo <[email protected]>
> > > > > wrote:
> > > > > > On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:
> > > > > > > IOASIDs are used to associate DMA requests with virtual
> > > > > > > address spaces. They are a system-wide limited resource made
> > > > > > > available to the userspace applications. Let it be VMs or
> > > > > > > user-space device drivers.
> > > > > > >
> > > > > > > This RFC patch introduces a cgroup controller to address the
> > > > > > > following problems:
> > > > > > > 1. Some user applications exhaust all the available IOASIDs
> > > > > > > thus depriving others of the same host.
> > > > > > > 2. System admins need to provision VMs based on their needs
> > > > > > > for IOASIDs, e.g. the number of VMs with assigned devices
> > > > > > > that perform DMA requests with PASID.
> > > > > >
> > > > > > Please take a look at the proposed misc controller:
> > > > > >
> > > > > > http://lkml.kernel.org/r/[email protected]
> > > > > >
> > > > > > Would that fit your bill?
> > > > > The interface definitely can be reused. But IOASID has a different
> > > > > behavior in terms of migration and ownership checking. I guess
> > > > > SEV key IDs are not tied to a process whereas IOASIDs are.
> > > > > Perhaps this can be solved by adding
> > > > > + .can_attach = ioasids_can_attach,
> > > > > + .cancel_attach = ioasids_cancel_attach,
> > > > > Let me give it a try and come back.
> > > > >
> > > > While I am trying to fit the IOASIDs cgroup in to the misc cgroup
> > > > proposal. I'd like to have a direction check on whether this idea of
> > > > using cgroup for IOASID/PASID resource management is viable.
> > >
> > > Yes, even for host SVA it would be good to have a cgroup. Currently
> > > the number of shared address spaces is naturally limited by number of
> > > processes, which can be controlled with rlimit and cgroup. But on Arm
> > > the hardware limit on shared address spaces is 64k (number of ASIDs),
> > > easily exhausted with the default PASID and PID limits. So a cgroup
> > > for managing this resource is more than welcome.
> > >
> > > It looks like your current implementation is very dependent on
> > > IOASID_SET_TYPE_MM? I'll need to do more reading about cgroup to see
> > > how easily it can be adapted to host SVA which uses
> > > IOASID_SET_TYPE_NULL.
> > Right, I was assuming have three use cases of IOASIDs:
> > 1. host supervisor SVA (not a concern, just one init_mm to bind)
> > 2. host user SVA, either one IOASID per process or perhaps some private
> > IOASID for private address space
> > 3. VM use for guest SVA, each IOASID is bound to a guest process
> >
> > My current cgroup proposal applies to #3 with IOASID_SET_TYPE_MM, which
> > is allocated by the new /dev/ioasid interface.
> >
> > For #2, I was thinking you can limit the host process via PIDs cgroup?
> > i.e. limit fork.
>
> That works but isn't perfect, because the hardware resource of shared
> address spaces can be much lower that PID limit - 16k ASIDs on Arm. To
> allow an admin to fairly distribute that resource we could introduce
> another cgroup just to limit the number of shared address spaces, but
> limiting the number of IOASIDs does the trick.
>
make sense. it would be cleaner to have a single approach to limit IOASIDs
(as Jason asked).
> > So the host IOASIDs are currently allocated from the system pool
> > with quota of chosen by iommu_sva_init() in my patch, 0 means unlimited
> > use whatever is available. https://lkml.org/lkml/2021/2/28/18
>
> Yes that's sensible, but it would be good to plan the cgroup user
> interface to work for #2 as well, even if we don't implement it right
> away.
>
will do it in the next version.
> Thanks,
> Jean
Thanks,
Jacob
On Sat, Feb 27, 2021 at 02:01:26PM -0800, Jacob Pan wrote:
> +/* -------- IOCTLs for IOASID file descriptor (/dev/ioasid) -------- */
> +
> +/**
> + * IOASID_GET_API_VERSION - _IO(IOASID_TYPE, IOASID_BASE + 0)
> + *
> + * Report the version of the IOASID API. This allows us to bump the entire
> + * API version should we later need to add or change features in incompatible
> + * ways.
> + * Return: IOASID_API_VERSION
> + * Availability: Always
> + */
> +#define IOASID_GET_API_VERSION _IO(IOASID_TYPE, IOASID_BASE + 0)
I think this is generally a bad idea, if you change the API later then
also change the ioctl numbers and everything should work out
eg use the 4th argument to IOC to specify something about the ABI
Jason
Hi Jason,
Thanks for the review.
On Wed, 10 Mar 2021 15:23:01 -0400, Jason Gunthorpe <[email protected]> wrote:
> On Sat, Feb 27, 2021 at 02:01:26PM -0800, Jacob Pan wrote:
>
> > +/* -------- IOCTLs for IOASID file descriptor (/dev/ioasid) -------- */
> > +
> > +/**
> > + * IOASID_GET_API_VERSION - _IO(IOASID_TYPE, IOASID_BASE + 0)
> > + *
> > + * Report the version of the IOASID API. This allows us to bump the
> > entire
> > + * API version should we later need to add or change features in
> > incompatible
> > + * ways.
> > + * Return: IOASID_API_VERSION
> > + * Availability: Always
> > + */
> > +#define IOASID_GET_API_VERSION _IO(IOASID_TYPE,
> > IOASID_BASE + 0)
>
> I think this is generally a bad idea, if you change the API later then
> also change the ioctl numbers and everything should work out
>
> eg use the 4th argument to IOC to specify something about the ABI
>
Let me try to understand the idea, do you mean something like this?
#define IOASID_GET_INFO _IOC(_IOC_NONE, IOASID_TYPE, IOASID_BASE + 1,
sizeof(struct ioasid_info))
If we later change the size of struct ioasid_info, IOASID_GET_INFO would be
a different ioctl number. Then we will break the existing user space that
uses the old number. So I am guessing you meant we need to have a different
name also. i.e.
#define IOASID_GET_INFO_V2 _IOC(_IOC_NONE, IOASID_TYPE, IOASID_BASE + 1,
sizeof(struct ioasid_info_v2))
We can get rid of the API version, just have individual IOCTL version.
Is that right?
> Jason
Thanks,
Jacob
On Thu, Mar 11, 2021 at 02:55:34PM -0800, Jacob Pan wrote:
> Hi Jason,
>
> Thanks for the review.
>
> On Wed, 10 Mar 2021 15:23:01 -0400, Jason Gunthorpe <[email protected]> wrote:
>
> > On Sat, Feb 27, 2021 at 02:01:26PM -0800, Jacob Pan wrote:
> >
> > > +/* -------- IOCTLs for IOASID file descriptor (/dev/ioasid) -------- */
> > > +
> > > +/**
> > > + * IOASID_GET_API_VERSION - _IO(IOASID_TYPE, IOASID_BASE + 0)
> > > + *
> > > + * Report the version of the IOASID API. This allows us to bump the
> > > entire
> > > + * API version should we later need to add or change features in
> > > incompatible
> > > + * ways.
> > > + * Return: IOASID_API_VERSION
> > > + * Availability: Always
> > > + */
> > > +#define IOASID_GET_API_VERSION _IO(IOASID_TYPE,
> > > IOASID_BASE + 0)
> >
> > I think this is generally a bad idea, if you change the API later then
> > also change the ioctl numbers and everything should work out
> >
> > eg use the 4th argument to IOC to specify something about the ABI
> >
> Let me try to understand the idea, do you mean something like this?
> #define IOASID_GET_INFO _IOC(_IOC_NONE, IOASID_TYPE, IOASID_BASE + 1,
> sizeof(struct ioasid_info))
>
> If we later change the size of struct ioasid_info, IOASID_GET_INFO would be
> a different ioctl number. Then we will break the existing user space that
> uses the old number. So I am guessing you meant we need to have a different
> name also. i.e.
Something like that is more appropriate. Generally we should not be
planning to 'remove' IOCTLs. The kernel must always have backwards
compat, so any new format you introduce down the road has to have new
IOCTL number so the old format can continue to be supported.
Negotiation of support can usually by done by probing for ENOIOCTLCMD
or similar on the new ioctls, not an API version
Jason