Received: by 2002:a05:6a10:a0d1:0:0:0:0 with SMTP id j17csp2043877pxa; Mon, 24 Aug 2020 03:35:07 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwA/aDwqYm9I0WJyoHu6LLcrXLLuH4PV8/LCEKbo6M6kM8okljHs0XEGio6EFrJ55daSct4 X-Received: by 2002:a05:6402:206a:: with SMTP id bd10mr4521632edb.321.1598265306966; Mon, 24 Aug 2020 03:35:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1598265306; cv=none; d=google.com; s=arc-20160816; b=v8rWPkXbaPH9t2Zp78TTsqJRNWmseSxPz3uL7qrR18ZQ2THCZxTvYg3rWDFWrcNzck 6/sDK9y4Zbyqq0OP0C2FIJGayBX5yfF+J4GIUir3E7C4X/vJwzt0CwbPvOMftxR2SWVf YP9Vavr2rcRlqkMzDWvsbNyLk2XKfzDfsiDqMPvJfNBw1G97tWCL7/XPgu3zIch4tST0 bcG5TAsnGga6KzVx9PpT3nhEkaX56mqnpjdg/b08RgJNvX+bzd9llCWxh4nckXRUEQpN w021EnlhviXH3Hs7K2YNmeKRgWWzAu1r8rMQ8RfLCaqksygqNy622q420hCgl+byedny JhZg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=gQJsdW9V+/O/V6viLqmhaLN2g7JwKwLu9J/aT+L/weo=; b=ELF+1pXyfhzAVPzL7F4On6T4Fnmn7TP8g4vMVW6NqpIoOBQZsRza98WzIOx0GIRP2D jyOASKhpn1w9KJAflQW4HJRwSOGlH7p/QoCc7sEYcSCXVGoLAyTfdluB9ehJ/yPFbsJ3 C7YK7FM3WaAXTcTWpwq0SB168CY+msmFNZyLYs6QSp8rkV7v2Ml7ZekRFTg5GCKmYHS6 fkvm6FcxbJkZGW2yLliehX/XboKQvswOyyJZnoNUFzUQXtsdAtQsow8rUd3AqhF9JtEO vYBNjNAxaoBvGC3ZdvqjLhOnKpeLazbhT+g6ykyvowBMOVcBGsR9DRuXfhHbrGldUCJy IiGw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=cdaZpDQl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id mv6si217569ejb.698.2020.08.24.03.34.43; Mon, 24 Aug 2020 03:35:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=cdaZpDQl; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727780AbgHXKde (ORCPT + 99 others); Mon, 24 Aug 2020 06:33:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53034 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726953AbgHXKc7 (ORCPT ); Mon, 24 Aug 2020 06:32:59 -0400 Received: from mail-ed1-x543.google.com (mail-ed1-x543.google.com [IPv6:2a00:1450:4864:20::543]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 40CF5C061574 for ; Mon, 24 Aug 2020 03:32:58 -0700 (PDT) Received: by mail-ed1-x543.google.com with SMTP id l23so7547064edv.11 for ; Mon, 24 Aug 2020 03:32:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=gQJsdW9V+/O/V6viLqmhaLN2g7JwKwLu9J/aT+L/weo=; b=cdaZpDQlkAjWBwm5OjUe12TCF7bqBxV/JfzoEPQoYwICM3068SGGxOoS60+u3dgG3T B6KBzeiPCZTvRnNqYdziUS24QgRnvg2A42Ebm+5VvJmysEwHKOlvn0RLuMNnsvPvJ6Td POL/3dyqXixjviEWVSK1cZ61wBCJyUlTWxrR3J4s43FRB2C8KbLyjQSblXCzkCzS0W2J qyFqadhiumetB0c6dXQdMrV7jFvxxcS17bFKeFy5GgX4ESICfEVg3TJcgqcKPIDes2cT Xdikw1/35IT5PR2IubTrO+SdtlZIH4kdvm3qLvPwTFfYkxp/xBJgcoWDK1m6eQRfZvPq M8Iw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=gQJsdW9V+/O/V6viLqmhaLN2g7JwKwLu9J/aT+L/weo=; b=AHMWAYSXCsoqFaiWbfFuO+xzQvbnoYLJETPuPLaJlTktqIAIlAJaqshXyYG88IJmjD 8UjBfDCV4yU5mY//htcZVbveNWx6vSeKgBwsaTO0J7UkG/J9FVJsjooJttXX1540w6H5 rUpQwfdKSDvHROOO7yMdFXUZpWCd9hq614JgY/yQFOm5cKRKK5JEi5jAmSAWcrzFZfwQ xMaSa19DKueEQ+yJIAD9l4wTYUosXzE1L1qSno4Gx3sw7+qBb0/1si+A3OxMrU0uVQU2 rE7ysfe0s0yVyNrP3eND4Rfk0MBFVf/D26WLKmHJT2zfxiBWlvU6lA34vaw4xZWwn8rQ OVAA== X-Gm-Message-State: AOAM533NQWG7Y9xB71KJjMEtGbACQf2gY/KKwX1fo3Hjnzj7GEB71eCh 5zjXzcqQ1YSnywhOK8z9gh+VwQ== X-Received: by 2002:a05:6402:748:: with SMTP id p8mr4856393edy.305.1598265176484; Mon, 24 Aug 2020 03:32:56 -0700 (PDT) Received: from myrica ([2001:1715:4e26:a7e0:116c:c27a:3e7f:5eaf]) by smtp.gmail.com with ESMTPSA id r18sm9436293ejz.118.2020.08.24.03.32.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 24 Aug 2020 03:32:55 -0700 (PDT) Date: Mon, 24 Aug 2020 12:32:39 +0200 From: Jean-Philippe Brucker To: Jacob Pan Cc: iommu@lists.linux-foundation.org, LKML , Jean-Philippe Brucker , Lu Baolu , Joerg Roedel , David Woodhouse , Yi Liu , "Tian, Kevin" , Raj Ashok , Eric Auger , Wu Hao Subject: Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs Message-ID: <20200824103239.GA3210689@myrica> References: <1598070918-21321-1-git-send-email-jacob.jun.pan@linux.intel.com> <1598070918-21321-2-git-send-email-jacob.jun.pan@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1598070918-21321-2-git-send-email-jacob.jun.pan@linux.intel.com> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Aug 21, 2020 at 09:35:10PM -0700, Jacob Pan wrote: > IOASID is used to identify address spaces that can be targeted by device > DMA. It is a system-wide resource that is essential to its many users. > This document is an attempt to help developers from all vendors navigate > the APIs. At this time, ARM SMMU and Intel’s Scalable IO Virtualization > (SIOV) enabled platforms are the primary users of IOASID. Examples of > how SIOV components interact with IOASID APIs are provided in that many > APIs are driven by the requirements from SIOV. > > Signed-off-by: Liu Yi L > Signed-off-by: Wu Hao > Signed-off-by: Jacob Pan > --- > Documentation/ioasid.rst | 618 +++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 618 insertions(+) > create mode 100644 Documentation/ioasid.rst > > diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst Thanks for writing this up. Should it go to Documentation/driver-api/, or Documentation/driver-api/iommu/? I think this also needs to Cc linux-doc@vger.kernel.org and corbet@lwn.net > new file mode 100644 > index 000000000000..b6a8cdc885ff > --- /dev/null > +++ b/Documentation/ioasid.rst > @@ -0,0 +1,618 @@ > +.. ioasid: > + > +===================================== > +IO Address Space ID > +===================================== > + > +IOASID is a generic name for PCIe Process Address ID (PASID) or ARM > +SMMU sub-stream ID. An IOASID identifies an address space that DMA "SubstreamID" > +requests can target. > + > +The primary use cases for IOASID are Shared Virtual Address (SVA) and > +IO Virtual Address (IOVA). However, the requirements for IOASID IOVA alone isn't a use case, maybe "multiple IOVA spaces per device"? > +management can vary among hardware architectures. > + > +This document covers the generic features supported by IOASID > +APIs. Vendor-specific use cases are also illustrated with Intel's VT-d > +based platforms as the first example. > + > +.. contents:: :local: > + > +Glossary > +======== > +PASID - Process Address Space ID > + > +IOASID - IO Address Space ID (generic term for PCIe PASID and > +sub-stream ID in SMMU) "SubstreamID" > + > +SVA/SVM - Shared Virtual Addressing/Memory > + > +ENQCMD - New Intel X86 ISA for efficient workqueue submission [1] Maybe drop the "New", to keep the documentation perennial. It might be good to add internal links here to the specifications URLs at the bottom. > + > +DSA - Intel Data Streaming Accelerator [2] > + > +VDCM - Virtual device composition module [3] > + > +SIOV - Intel Scalable IO Virtualization > + > + > +Key Concepts > +============ > + > +IOASID Set > +----------- > +An IOASID set is a group of IOASIDs allocated from the system-wide > +IOASID pool. An IOASID set is created and can be identified by a > +token of u64. Refer to IOASID set APIs for more details. Identified either by an u64 or an mm_struct, right? Maybe just drop the second sentence if it's detailed in the IOASID set section below. > + > +IOASID set is particularly useful for guest SVA where each guest could > +have its own IOASID set for security and efficiency reasons. > + > +IOASID Set Private ID (SPID) > +---------------------------- > +SPIDs are introduced as IOASIDs within its set. Each SPID maps to a > +system-wide IOASID but the namespace of SPID is within its IOASID > +set. The intro isn't super clear. Perhaps this is simpler: "Each IOASID set has a private namespace of SPIDs. An SPID maps to a single system-wide IOASID." > SPIDs can be used as guest IOASIDs where each guest could do > +IOASID allocation from its own pool and map them to host physical > +IOASIDs. SPIDs are particularly useful for supporting live migration > +where decoupling guest and host physical resources are necessary. > + > +For example, two VMs can both allocate guest PASID/SPID #101 but map to > +different host PASIDs #201 and #202 respectively as shown in the > +diagram below. > +:: > + > + .------------------. .------------------. > + | VM 1 | | VM 2 | > + | | | | > + |------------------| |------------------| > + | GPASID/SPID 101 | | GPASID/SPID 101 | > + '------------------' -------------------' Guest > + __________|______________________|______________________ > + | | Host > + v v > + .------------------. .------------------. > + | Host IOASID 201 | | Host IOASID 202 | > + '------------------' '------------------' > + | IOASID set 1 | | IOASID set 2 | > + '------------------' '------------------' > + > +Guest PASID is treated as IOASID set private ID (SPID) within an > +IOASID set, mappings between guest and host IOASIDs are stored in the > +set for inquiry. > + > +IOASID APIs > +=========== > +To get the IOASID APIs, users must #include . These APIs > +serve the following functionalities: > + > + - IOASID allocation/Free > + - Group management in the form of ioasid_set > + - Private data storage and lookup > + - Reference counting > + - Event notification in case of state change > + > +IOASID Set Level APIs > +-------------------------- > +For use cases such as guest SVA it is necessary to manage IOASIDs at > +a group level. For example, VMs may allocate multiple IOASIDs for > +guest process address sharing (vSVA). It is imperative to enforce > +VM-IOASID ownership such that malicious guest cannot target DMA "a malicious guest" > +traffic outside its own IOASIDs, or free an active IOASID belong to "that belongs to" > +another VM. > +:: > + > + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, u32 type) > + > + int ioasid_adjust_set(struct ioasid_set *set, int quota); These could be named "ioasid_set_alloc" and "ioasid_set_adjust" to be consistent with the rest of the API. > + > + void ioasid_set_get(struct ioasid_set *set) > + > + void ioasid_set_put(struct ioasid_set *set) > + > + void ioasid_set_get_locked(struct ioasid_set *set) > + > + void ioasid_set_put_locked(struct ioasid_set *set) > + > + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata, Might be nicer to keep the same argument names within the API. Here "set" rather than "sdata". > + void (*fn)(ioasid_t id, void *data), > + void *data) (alignment) > + > + > +IOASID set concept is introduced to represent such IOASID groups. Each Or just "IOASID sets represent such IOASID groups", but might be redundant. > +IOASID set is created with a token which can be one of the following > +types: > + > + - IOASID_SET_TYPE_NULL (Arbitrary u64 value) > + - IOASID_SET_TYPE_MM (Set token is a mm_struct) > + > +The explicit MM token type is useful when multiple users of an IOASID > +set under the same process need to communicate about their shared IOASIDs. > +E.g. An IOASID set created by VFIO for one guest can be associated > +with the KVM instance for the same guest since they share a common mm_struct. > + > +The IOASID set APIs serve the following purposes: > + > + - Ownership/permission enforcement > + - Take collective actions, e.g. free an entire set > + - Event notifications within a set > + - Look up a set based on token > + - Quota enforcement This paragraph could be earlier in the section > + > +Individual IOASID APIs > +---------------------- > +Once an ioasid_set is created, IOASIDs can be allocated from the set. > +Within the IOASID set namespace, set private ID (SPID) is supported. In > +the VM use case, SPID can be used for storing guest PASID. > + > +:: > + > + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max, > + void *private); > + > + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid); > + > + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid); > + > + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid); > + > + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid); > + > + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, > + bool (*getter)(void *)); > + > + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t spid) > + > + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid, > + void *data); > + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid, > + ioasid_t ssid); s/ssid/spid > + > + > +Notifications > +------------- > +An IOASID may have multiple users, each user may have hardware context > +associated with an IOASID. When the status of an IOASID changes, > +e.g. an IOASID is being freed, users need to be notified such that the > +associated hardware context can be cleared, flushed, and drained. > + > +:: > + > + int ioasid_register_notifier(struct ioasid_set *set, struct > + notifier_block *nb) > + > + void ioasid_unregister_notifier(struct ioasid_set *set, > + struct notifier_block *nb) > + > + int ioasid_register_notifier_mm(struct mm_struct *mm, struct > + notifier_block *nb) > + > + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct > + notifier_block *nb) > + > + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd, > + unsigned int flags) > + > + > +Events > +~~~~~~ > +Notification events are pertinent to individual IOASIDs, they can be > +one of the following: > + > + - ALLOC > + - FREE > + - BIND > + - UNBIND > + > +Ordering > +~~~~~~~~ > +Ordering is supported by IOASID notification priorities as the > +following (in ascending order): > + > +:: > + > + enum ioasid_notifier_prios { > + IOASID_PRIO_LAST, > + IOASID_PRIO_IOMMU, > + IOASID_PRIO_DEVICE, > + IOASID_PRIO_CPU, > + }; > + > +The typical use case is when an IOASID is freed due to an exception, DMA > +source should be quiesced before tearing down other hardware contexts > +in the system. This will reduce the churn in handling faults. DMA work > +submission is performed by the CPU which is granted higher priority than > +devices. > + > + > +Scopes > +~~~~~~ > +There are two types of notifiers in IOASID core: system-wide and > +ioasid_set-wide. > + > +System-wide notifier is catering for users that need to handle all > +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs. > + > +Per ioasid_set notifier can be used by VM specific components such as > +KVM. After all, each KVM instance only cares about IOASIDs within its > +own set. > + > + > +Atomicity > +~~~~~~~~~ > +IOASID notifiers are atomic due to spinlocks used inside the IOASID > +core. For tasks cannot be completed in the notifier handler, async work "tasks that cannot be" > +can be submitted to complete the work later as long as there is no > +ordering requirement. > + > +Reference counting > +------------------ > +IOASID lifecycle management is based on reference counting. Users of > +IOASID intend to align lifecycle with the IOASID need to hold "who intend to" > +reference of the IOASID. IOASID will not be returned to the pool for "a reference to the IOASID. The IOASID" > +allocation until all references are dropped. Calling ioasid_free() > +will mark the IOASID as FREE_PENDING if the IOASID has outstanding > +reference. ioasid_get() is not allowed once an IOASID is in the > +FREE_PENDING state. > + > +Event notifications are used to inform users of IOASID status change. > +IOASID_FREE event prompts users to drop their references after > +clearing its context. > + > +For example, on VT-d platform when an IOASID is freed, teardown > +actions are performed on KVM, device driver, and IOMMU driver. > +KVM shall register notifier block with:: > + > + static struct notifier_block pasid_nb_kvm = { > + .notifier_call = pasid_status_change_kvm, > + .priority = IOASID_PRIO_CPU, > + }; > + > +VDCM driver shall register notifier block with:: > + > + static struct notifier_block pasid_nb_vdcm = { > + .notifier_call = pasid_status_change_vdcm, > + .priority = IOASID_PRIO_DEVICE, > + }; > + > +In both cases, notifier blocks shall be registered on the IOASID set > +such that *only* events from the matching VM is received. > + > +If KVM attempts to register notifier block before the IOASID set is > +created for the MM token, the notifier block will be placed on a > +pending list inside IOASID core. Once the token matching IOASID set > +is created, IOASID will register the notifier block automatically. > +IOASID core does not replay events for the existing IOASIDs in the > +set. For IOASID set of MM type, notification blocks can be registered > +on empty sets only. This is to avoid lost events. > + > +IOMMU driver shall register notifier block on global chain:: > + > + static struct notifier_block pasid_nb_vtd = { > + .notifier_call = pasid_status_change_vtd, > + .priority = IOASID_PRIO_IOMMU, > + }; > + > +Custom allocator APIs > +--------------------- > + > +:: > + > + int ioasid_register_allocator(struct ioasid_allocator_ops *allocator); > + > + void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator); > + > +Allocator Choices > +~~~~~~~~~~~~~~~~~ > +IOASIDs are allocated for both host and guest SVA/IOVA usage. However, > +allocators can be different. For example, on VT-d guest PASID > +allocation must be performed via a virtual command interface which is > +emulated by VMM. > + > +IOASID core has the notion of "custom allocator" such that guest can > +register virtual command allocator that precedes the default one. > + > +Namespaces > +~~~~~~~~~~ > +IOASIDs are limited system resources that default to 20 bits in > +size. Since each device has its own table, theoretically the namespace > +can be per device also. However, for security reasons sharing PASID > +tables among devices are not good for isolation. Therefore, IOASID > +namespace is system-wide. I don't follow this development. Having per-device PASID table would work fine for isolation (assuming no hardware bug necessitating IOMMU groups). If I remember correctly IOASID space was chosen to be OS-wide because it simplifies the management code (single PASID per task), and it is system-wide across VMs only in the case of VT-d scalable mode. > + > +There are also other reasons to have this simpler system-wide > +namespace. Take VT-d as an example, VT-d supports shared workqueue > +and ENQCMD[1] where one IOASID could be used to submit work on Maybe use the Sphinx glossary syntax rather than "[1]" https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#glossary-directive > +multiple devices that are shared with other VMs. This requires IOASID > +to be system-wide. This is also the reason why guests must use an > +emulated virtual command interface to allocate IOASID from the host. > + > + > +Life cycle > +========== > +This section covers IOASID lifecycle management for both bare-metal > +and guest usages. In bare-metal SVA, MMU notifier is directly hooked > +up with IOMMU driver, therefore the process address space (MM) > +lifecycle is aligned with IOASID. > + > +However, guest MMU notifier is not available to host IOMMU driver, > +when guest MM terminates unexpectedly, the events have to go through > +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also more > +parties involved in guest SVA, e.g. on Intel VT-d platform, IOASIDs > +are used by IOMMU driver, KVM, VDCM, and VFIO. > + > +Native IOASID Life Cycle (VT-d Example) > +--------------------------------------- > + > +The normal flow of native SVA code with Intel Data Streaming > +Accelerator(DSA) [2] as example: > + > +1. Host user opens accelerator FD, e.g. DSA driver, or uacce; > +2. DSA driver allocate WQ, do sva_bind_device(); > +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device, > + mmu_notifier_get() > +4. DMA starts by DSA driver userspace > +5. DSA userspace close FD > +6. DSA/uacce kernel driver handles FD.close() > +7. DSA driver stops DMA > +8. DSA driver calls sva_unbind_device(); > +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush > + TLBs. mmu_notifier_put() called. > +10. mmu_notifier.release() called, IOMMU SVA code calls ioasid_free()* > +11. The IOASID is returned to the pool, reclaimed. > + > +:: > + Use a footnote? https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes > + * With ENQCMD, PASID used on VT-d is not released in mmu_notifier() but > + mmdrop(). mmdrop comes after FD close. Should not matter. "comes after FD close, which doesn't make a difference?" The following might not be necessary since early process termination is described later. > + If the user process dies unexpectedly, Step #10 may come before > + Step #5, in between, all DMA faults discarded. PRQ responded with PRQ hasn't been defined in this document. > + code INVALID REQUEST. > + > +During the normal teardown, the following three steps would happen in > +order: > + > +1. Device driver stops DMA request > +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain in-flight > + requests. > +3. IOASID freed > + > +Exception happens when process terminates *before* device driver stops > +DMA and call IOMMU driver to unbind. The flow of process exists are as "exits" > +follows: > + > +:: > + > + do_exit() { > + exit_mm() { > + mm_put(); > + exit_mmap() { > + intel_invalidate_range() //mmu notifier > + tlb_finish_mmu() > + mmu_notifier_release(mm) { > + intel_iommu_release() { > + [2] intel_iommu_teardown_pasid(); Parentheses might be better than square brackets for step numbers > + intel_iommu_flush_tlbs(); > + } > + // tlb_invalidate_range cb removed > + } > + unmap_vmas(); > + free_pgtables(); // IOMMU cannot walk PGT after this > + }; > + } > + exit_files(tsk) { > + close_files() { > + dsa_close(); > + [1] dsa_stop_dma(); > + intel_svm_unbind_pasid(); //nothing to do > + } > + } > + } > + > + mmdrop() /* some random time later, lazy mm user */ { > + mm_free_pgd(); > + destroy_context(mm); { > + [3] ioasid_free(); > + } > + } > + > +As shown in the list above, step #2 could happen before > +#1. Unrecoverable(UR) faults could happen between #2 and #1. > + > +Also notice that TLB invalidation occurs at mmu_notifier > +invalidate_range callback as well as the release callback. The reason > +is that release callback will delete IOMMU driver from the notifier > +chain which may skip invalidate_range() calls during the exit path. > + > +To avoid unnecessary reporting of UR fault, IOMMU driver shall disable > +fault reporting after free and before unbind. > + > +Guest IOASID Life Cycle (VT-d Example) > +-------------------------------------- > +Guest IOASID life cycle starts with guest driver open(), this could be > +uacce or individual accelerator driver such as DSA. At FD open, > +sva_bind_device() is called which triggers a series of actions. > + > +The example below is an illustration of *normal* operations that > +involves *all* the SW components in VT-d. The flow can be simpler if > +no ENQCMD is supported. > + > +:: > + > + VFIO IOMMU KVM VDCM IOASID Ref > + .................................................................. > + 1 ioasid_register_notifier/_mm() > + 2 ioasid_alloc() 1 > + 3 bind_gpasid() > + 4 iommu_bind()->ioasid_get() 2 > + 5 ioasid_notify(BIND) > + 6 -> ioasid_get() 3 > + 7 -> vmcs_update_atomic() > + 8 mdev_write(gpasid) > + 9 hpasid= > + 10 find_by_spid(gpasid) 4 > + 11 vdev_write(hpasid) > + 12 -------- GUEST STARTS DMA -------------------------- > + 13 -------- GUEST STOPS DMA -------------------------- > + 14 mdev_clear(gpasid) > + 15 vdev_clear(hpasid) > + 16 ioasid_put() 3 > + 17 unbind_gpasid() > + 18 iommu_ubind() > + 19 ioasid_notify(UNBIND) > + 20 -> vmcs_update_atomic() > + 21 -> ioasid_put() 2 > + 22 ioasid_free() 1 > + 23 ioasid_put() 0 > + 24 Reclaimed > + -------------- New Life Cycle Begin ---------------------------- > + 1 ioasid_alloc() -> 1 > + > + Note: IOASID Notification Events: FREE, BIND, UNBIND > + > +Exception cases arise when a guest crashes or a malicious guest > +attempts to cause disruption on the host system. The fault handling > +rules are: > + > +1. IOASID free must *always* succeed. > +2. An inactive period may be required before the freed IOASID is > + reclaimed. During this period, consumers of IOASID perform cleanup. > +3. Malfunction is limited to the guest owned resources for all > + programming errors. > + > +The primary source of exception is when the following are out of > +order: > + > +1. Start/Stop of DMA activity > + (Guest device driver, mdev via VFIO) > +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes > + (Host IOMMU driver bind/unbind) > +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in > + case of ENQCMD > +4. Programming/Clearing host PASID in VDCM (Host VDCM driver) > +5. IOASID alloc/free (Host IOASID) > + > +VFIO is the *only* user-kernel interface, which is ultimately > +responsible for exception handlings. "handling" > + > +#1 is processed the same way as the assigned device today based on > +device file descriptors and events. There is no special handling. > + > +#3 is based on bind/unbind events emitted by #2. > + > +#4 is naturally aligned with IOASID life cycle in that an illegal > +guest PASID programming would fail in obtaining reference of the > +matching host IOASID. > + > +#5 is similar to #4. The fault will be reported to the user if PASID > +used in the ENQCMD is not set up in VMCS PASID translation table. > + > +Therefore, the remaining out of order problem is between #2 and > +#5. I.e. unbind vs. free. More specifically, free before unbind. > + > +IOASID notifier and refcounting are used to ensure order. Following > +a publisher-subscriber pattern where: > + > +- Publishers: VFIO & IOMMU > +- Subscribers: KVM, VDCM, IOMMU > + > +IOASID notifier is atomic which requires subscribers to do quick > +handling of the event in the atomic context. Workqueue can be used for > +any processing that requires thread context. IOASID reference must be > +acquired before receiving the FREE event. The reference must be > +dropped at the end of the processing in order to return the IOASID to > +the pool. > + > +Let's examine the IOASID life cycle again when free happens *before* > +unbind. This could be a result of misbehaving guests or crash. Assuming > +VFIO cannot enforce unbind->free order. Notice that the setup part up > +until step #12 is identical to the normal case, the flow below starts > +with step 13. > + > +:: > + > + VFIO IOMMU KVM VDCM IOASID Ref > + .................................................................. > + 13 -------- GUEST STARTS DMA -------------------------- > + 14 -------- *GUEST MISBEHAVES!!!* ---------------- > + 15 ioasid_free() > + 16 ioasid_notify(FREE) > + 17 mark_ioasid_inactive[1] > + 18 kvm_nb_handler(FREE) > + 19 vmcs_update_atomic() > + 20 ioasid_put_locked() -> 3 > + 21 vdcm_nb_handler(FREE) > + 22 iomm_nb_handler(FREE) > + 23 ioasid_free() returns[2] schedule_work() 2 > + 24 schedule_work() vdev_clear_wk(hpasid) > + 25 teardown_pasid_wk() > + 26 ioasid_put() -> 1 > + 27 ioasid_put() 0 > + 28 Reclaimed > + 29 unbind_gpasid() > + 30 iommu_unbind()->ioasid_find() Fails[3] > + -------------- New Life Cycle Begin ---------------------------- > + > +Note: > + > +1. By marking IOASID inactive at step #17, no new references can be Is "inactive" FREE_PENDING? > + held. ioasid_get/find() will return -ENOENT; > +2. After step #23, all events can go out of order. Shall not affect > + the outcome. > +3. IOMMU driver fails to find private data for unbinding. If unbind is > + called after the same IOASID is allocated for the same guest again, > + this is a programming error. The damage is limited to the guest > + itself since unbind performs permission checking based on the > + IOASID set associated with the guest process. > + > +KVM PASID Translation Table Updates > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > +Per VM PASID translation table is maintained by KVM in order to > +support ENQCMD in the guest. The table contains host-guest PASID > +translations to be consumed by CPU ucode. The synchronization of the > +PASID states depends on VFIO/IOMMU driver, where IOCTL and atomic > +notifiers are used. KVM must register IOASID notifier per VM instance > +during launch time. The following events are handled: > + > +1. BIND/UNBIND > +2. FREE > + > +Rules: > + > +1. Multiple devices can bind with the same PASID, this can be different PCI > + devices or mdevs within the same PCI device. However, only the > + *first* BIND and *last* UNBIND emit notifications. > +2. IOASID code is responsible for ensuring the correctness of H-G > + PASID mapping. There is no need for KVM to validate the > + notification data. > +3. When UNBIND happens *after* FREE, KVM will see error in > + ioasid_get() even when the reclaim is not done. IOMMU driver will > + also avoid sending UNBIND if the PASID is already FREE. > +4. When KVM terminates *before* FREE & UNBIND, references will be > + dropped for all host PASIDs. > + > +VDCM PASID Programming > +~~~~~~~~~~~~~~~~~~~~~~ > +VDCM composes virtual devices and exposes them to the guests. When > +the guest allocates a PASID then program it to the virtual device, VDCM > +intercepts the programming attempt then program the matching host "programs" Thanks, Jean > +PASID on to the hardware. > +Conversely, when a device is going away, VDCM must be informed such > +that PASID context on the hardware can be cleared. There could be > +multiple mdevs assigned to different guests in the same VDCM. Since > +the PASID table is shared at PCI device level, lazy clearing is not > +secure. A malicious guest can attack by using newly freed PASIDs that > +are allocated by another guest. > + > +By holding a reference of the PASID until VDCM cleans up the HW context, > +it is guaranteed that PASID life cycles do not cross within the same > +device. > + > + > +Reference > +==================================================== > +1. https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf > + > +2. https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator > + > +3. https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification > -- > 2.7.4 >