Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp3042880pxk; Mon, 7 Sep 2020 01:08:42 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw6v4WvwIT2Mw6JUH/AYI3tdwojyJwawxs6HCrquwnyQCgrNbu/VI6Fer4iYrfsOOWV1/fW X-Received: by 2002:a50:8e58:: with SMTP id 24mr20863100edx.226.1599466122557; Mon, 07 Sep 2020 01:08:42 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1599466122; cv=none; d=google.com; s=arc-20160816; b=wSiwCLFfY9sJX5F6BKRrl2VA4Y+XXPeyVuLBG4GQHbSo4xfRTt5+ahqjJN2zohP5UW BEdPH5N9k0TVi6+qG0hTPt498h5BUV+vaAbjsi6TA+KW26kZ9dFGHCsaagySf0CPQBy1 87CW0LJMjjX8ynzoFzvemBt6hN9yQWKT+O5cNYH2uc4wtqtSIEibBWf+1DCeAO8YLjO9 62oDj+nt8pxIfEED6Dv0JGas8s6SkMlLxgQZPHXN9SpXm3+I92DOL6Q4P8Mukzpa1T8l zDfDnlhh1T0xSsd3KtQGhnRvHKdPSFY4g5Eq7fhUMKWNoIe2L/MNqyUEvRpWRwiZzSni zxWw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature; bh=/3f+hXKHvVEErvyx5yvyvt//WrZVXlZnR4qFaUMPke4=; b=E07DB7VBi/6dwQV0nhRVpqBJE/GDVMFAwYMuxhgR0jpHQK6J56oeHRJcSCm6GOAwx/ NLS7FbdYXEj42O47S48vjgjKVaEFucGYDptV/FXWCqJBM0wwFGGuBZHxPv/lnzYT1TF0 oKO30IoABrzI2NijbVhYKPw26tdvdls/yF0zRhRKucZkM70fNwFZs/NtVjDLaykYtwtH REMCX3KKv8t/C2ShbnQle4a6lt9H2ERD/gY9AA6SLjBJ1qNS7x99c/nqoss2bmGO7Yjt 88Z4fJizTHliMfqp3baW3w3kPwaDL6NflaGvIOyu4yZ/DeAw1BYvK1ZA2JXU0GeMCvdi +qnw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=emYG8PPk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id mc20si9374014ejb.571.2020.09.07.01.08.19; Mon, 07 Sep 2020 01:08:42 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=emYG8PPk; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727908AbgIGIEP (ORCPT + 99 others); Mon, 7 Sep 2020 04:04:15 -0400 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:35712 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727884AbgIGIED (ORCPT ); Mon, 7 Sep 2020 04:04:03 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1599465838; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/3f+hXKHvVEErvyx5yvyvt//WrZVXlZnR4qFaUMPke4=; b=emYG8PPkici5FTy3VmLLzOSMHT0MsxKfYD4foXUTQEsQl1VPblqWghSNqBVWG+rrBGRqq7 nXmYJR518eVfx0Igzi43p2IDVUK+zt6hoDiCxlRW3hCcVsZiX3cmv29QKkODwqDnM6Tdww Kv0IBBpVpxJfgpQIi4TZLkgd7ONavLo= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-248-sEDFjCfyOba1QV3e50kiMA-1; Mon, 07 Sep 2020 04:03:50 -0400 X-MC-Unique: sEDFjCfyOba1QV3e50kiMA-1 Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 74A59873113; Mon, 7 Sep 2020 08:03:45 +0000 (UTC) Received: from [10.36.112.51] (ovpn-112-51.ams2.redhat.com [10.36.112.51]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 98C7F60C17; Mon, 7 Sep 2020 08:03:41 +0000 (UTC) Subject: Re: [PATCH v2 1/9] docs: Document IO Address Space ID (IOASID) APIs To: Jacob Pan Cc: Jean-Philippe Brucker , Jacob Pan , "Tian, Kevin" , Raj Ashok , Jean-Philippe Brucker , LKML , iommu@lists.linux-foundation.org, Wu Hao , David Woodhouse References: <1598070918-21321-1-git-send-email-jacob.jun.pan@linux.intel.com> <1598070918-21321-2-git-send-email-jacob.jun.pan@linux.intel.com> <20200824103239.GA3210689@myrica> <15e94dd2-5c4d-27fc-f2b1-13e212538c42@redhat.com> <20200901095640.6f657756@jacob-builder> From: Auger Eric Message-ID: <9ab100e0-71ad-75d7-ffd7-caab4871cc9c@redhat.com> Date: Mon, 7 Sep 2020 10:03:39 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.5.0 MIME-Version: 1.0 In-Reply-To: <20200901095640.6f657756@jacob-builder> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Jacob, On 9/1/20 6:56 PM, Jacob Pan wrote: > Hi Eric, > > On Thu, 27 Aug 2020 18:21:07 +0200 > Auger Eric wrote: > >> Hi Jacob, >> On 8/24/20 12:32 PM, Jean-Philippe Brucker wrote: >>> On Fri, Aug 21, 2020 at 09:35:10PM -0700, Jacob Pan wrote: >>>> IOASID is used to identify address spaces that can be targeted by >>>> device DMA. It is a system-wide resource that is essential to its >>>> many users. This document is an attempt to help developers from >>>> all vendors navigate the APIs. At this time, ARM SMMU and Intel’s >>>> Scalable IO Virtualization (SIOV) enabled platforms are the >>>> primary users of IOASID. Examples of how SIOV components interact >>>> with IOASID APIs are provided in that many APIs are driven by the >>>> requirements from SIOV. >>>> >>>> Signed-off-by: Liu Yi L >>>> Signed-off-by: Wu Hao >>>> Signed-off-by: Jacob Pan >>>> --- >>>> Documentation/ioasid.rst | 618 >>>> +++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, >>>> 618 insertions(+) create mode 100644 Documentation/ioasid.rst >>>> >>>> diff --git a/Documentation/ioasid.rst b/Documentation/ioasid.rst >>> >>> Thanks for writing this up. Should it go to >>> Documentation/driver-api/, or Documentation/driver-api/iommu/? I >>> think this also needs to Cc linux-doc@vger.kernel.org and >>> corbet@lwn.net >>>> new file mode 100644 >>>> index 000000000000..b6a8cdc885ff >>>> --- /dev/null >>>> +++ b/Documentation/ioasid.rst >>>> @@ -0,0 +1,618 @@ >>>> +.. ioasid: >>>> + >>>> +===================================== >>>> +IO Address Space ID >>>> +===================================== >>>> + >>>> +IOASID is a generic name for PCIe Process Address ID (PASID) or >>>> ARM +SMMU sub-stream ID. An IOASID identifies an address space >>>> that DMA >>> >>> "SubstreamID" >> On ARM if we don't use PASIDs we have streamids (SID) which can also >> identify address spaces that DMA requests can target. So maybe this >> definition is not sufficient. >> > According to SMMU spec, the SubstreamID is equivalent to PASID. My > understanding is that SID is equivalent to PCI requester ID that > identifies stage 2. Do you plan to use IOASID for stage 2? No. So actually if PASID is not used we still have a default single IOASID matching the single context. So that may be fine as a definition. > IOASID is mostly for SVA and DMA request w/ PASID. > >>> >>>> +requests can target. >>>> + >>>> +The primary use cases for IOASID are Shared Virtual Address (SVA) >>>> and +IO Virtual Address (IOVA). However, the requirements for >>>> IOASID >>> >>> IOVA alone isn't a use case, maybe "multiple IOVA spaces per >>> device"? >>>> +management can vary among hardware architectures. >>>> + >>>> +This document covers the generic features supported by IOASID >>>> +APIs. Vendor-specific use cases are also illustrated with Intel's >>>> VT-d +based platforms as the first example. >>>> + >>>> +.. contents:: :local: >>>> + >>>> +Glossary >>>> +======== >>>> +PASID - Process Address Space ID >>>> + >>>> +IOASID - IO Address Space ID (generic term for PCIe PASID and >>>> +sub-stream ID in SMMU) >>> >>> "SubstreamID" >>> >>>> + >>>> +SVA/SVM - Shared Virtual Addressing/Memory >>>> + >>>> +ENQCMD - New Intel X86 ISA for efficient workqueue submission >>>> [1] >>> >>> Maybe drop the "New", to keep the documentation perennial. It might >>> be good to add internal links here to the specifications URLs at >>> the bottom. >>>> + >>>> +DSA - Intel Data Streaming Accelerator [2] >>>> + >>>> +VDCM - Virtual device composition module [3] >>>> + >>>> +SIOV - Intel Scalable IO Virtualization >>>> + >>>> + >>>> +Key Concepts >>>> +============ >>>> + >>>> +IOASID Set >>>> +----------- >>>> +An IOASID set is a group of IOASIDs allocated from the system-wide >>>> +IOASID pool. An IOASID set is created and can be identified by a >>>> +token of u64. Refer to IOASID set APIs for more details. >>> >>> Identified either by an u64 or an mm_struct, right? Maybe just >>> drop the second sentence if it's detailed in the IOASID set section >>> below. >>>> + >>>> +IOASID set is particularly useful for guest SVA where each guest >>>> could +have its own IOASID set for security and efficiency reasons. >>>> + >>>> +IOASID Set Private ID (SPID) >>>> +---------------------------- >>>> +SPIDs are introduced as IOASIDs within its set. Each SPID maps to >>>> a +system-wide IOASID but the namespace of SPID is within its >>>> IOASID +set. >>> >>> The intro isn't super clear. Perhaps this is simpler: >>> "Each IOASID set has a private namespace of SPIDs. An SPID maps to a >>> single system-wide IOASID." >> or, "within an ioasid set, each ioasid can be associated with an alias >> ID, named SPID." > I don't have strong opinion, I feel it is good to explain the > relationship between SPID and IOASID in both directions, how about add? > " Conversely, each IOASID is associated with an alias ID, named SPID." yep. I amy suggest: each IOASID may be associated with an alias ID, local to the IOASID set, named SPID. > >>> >>>> SPIDs can be used as guest IOASIDs where each guest could do >>>> +IOASID allocation from its own pool and map them to host physical >>>> +IOASIDs. SPIDs are particularly useful for supporting live >>>> migration +where decoupling guest and host physical resources are >>>> necessary. + >>>> +For example, two VMs can both allocate guest PASID/SPID #101 but >>>> map to +different host PASIDs #201 and #202 respectively as shown >>>> in the +diagram below. >>>> +:: >>>> + >>>> + .------------------. .------------------. >>>> + | VM 1 | | VM 2 | >>>> + | | | | >>>> + |------------------| |------------------| >>>> + | GPASID/SPID 101 | | GPASID/SPID 101 | >>>> + '------------------' -------------------' Guest >>>> + __________|______________________|______________________ >>>> + | | Host >>>> + v v >>>> + .------------------. .------------------. >>>> + | Host IOASID 201 | | Host IOASID 202 | >>>> + '------------------' '------------------' >>>> + | IOASID set 1 | | IOASID set 2 | >>>> + '------------------' '------------------' >>>> + >>>> +Guest PASID is treated as IOASID set private ID (SPID) within an >>>> +IOASID set, mappings between guest and host IOASIDs are stored in >>>> the +set for inquiry. >>>> + >>>> +IOASID APIs >>>> +=========== >>>> +To get the IOASID APIs, users must #include . >>>> These APIs +serve the following functionalities: >>>> + >>>> + - IOASID allocation/Free >>>> + - Group management in the form of ioasid_set >>>> + - Private data storage and lookup >>>> + - Reference counting >>>> + - Event notification in case of state change >> (a) > got it > >>>> + >>>> +IOASID Set Level APIs >>>> +-------------------------- >>>> +For use cases such as guest SVA it is necessary to manage IOASIDs >>>> at +a group level. For example, VMs may allocate multiple IOASIDs >>>> for >> I would use the introduced ioasid_set terminology instead of "group". > Right, we already introduced it. > >>>> +guest process address sharing (vSVA). It is imperative to enforce >>>> +VM-IOASID ownership such that malicious guest cannot target DMA >>> >>> "a malicious guest" >>> >>>> +traffic outside its own IOASIDs, or free an active IOASID belong >>>> to >>> >>> "that belongs to" >>> >>>> +another VM. >>>> +:: >>>> + >>>> + struct ioasid_set *ioasid_alloc_set(void *token, ioasid_t quota, >>>> u32 type) >> what is this void *token? also the type may be explained here. > token is explained in the text following API list. I can move it up. > >>>> + >>>> + int ioasid_adjust_set(struct ioasid_set *set, int quota); >>> >>> These could be named "ioasid_set_alloc" and "ioasid_set_adjust" to >>> be consistent with the rest of the API. >>> >>>> + >>>> + void ioasid_set_get(struct ioasid_set *set) >>>> + >>>> + void ioasid_set_put(struct ioasid_set *set) >>>> + >>>> + void ioasid_set_get_locked(struct ioasid_set *set) >>>> + >>>> + void ioasid_set_put_locked(struct ioasid_set *set) >>>> + >>>> + int ioasid_set_for_each_ioasid(struct ioasid_set *sdata, >>> >>> Might be nicer to keep the same argument names within the API. Here >>> "set" rather than "sdata". >>> >>>> + void (*fn)(ioasid_t id, void >>>> *data), >>>> + void *data) >>> >>> (alignment) >>> >>>> + >>>> + >>>> +IOASID set concept is introduced to represent such IOASID groups. >>>> Each >>> >>> Or just "IOASID sets represent such IOASID groups", but might be >>> redundant. >>> >>>> +IOASID set is created with a token which can be one of the >>>> following +types: >> I think this explanation should happen before the above function >> prototypes > ditto. > >>>> + >>>> + - IOASID_SET_TYPE_NULL (Arbitrary u64 value) >>>> + - IOASID_SET_TYPE_MM (Set token is a mm_struct) >>>> + >>>> +The explicit MM token type is useful when multiple users of an >>>> IOASID +set under the same process need to communicate about their >>>> shared IOASIDs. +E.g. An IOASID set created by VFIO for one guest >>>> can be associated +with the KVM instance for the same guest since >>>> they share a common mm_struct. + >>>> +The IOASID set APIs serve the following purposes: >>>> + >>>> + - Ownership/permission enforcement >>>> + - Take collective actions, e.g. free an entire set >>>> + - Event notifications within a set >>>> + - Look up a set based on token >>>> + - Quota enforcement >>> >>> This paragraph could be earlier in the section >> >> yes this is a kind of repetition of (a), above > I meant to highlight on what the APIs do such that readers don't > need to read the code instead. > >>> >>>> + >>>> +Individual IOASID APIs >>>> +---------------------- >>>> +Once an ioasid_set is created, IOASIDs can be allocated from the >>>> set. +Within the IOASID set namespace, set private ID (SPID) is >>>> supported. In +the VM use case, SPID can be used for storing guest >>>> PASID. + >>>> +:: >>>> + >>>> + ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, >>>> ioasid_t max, >>>> + void *private); >>>> + >>>> + int ioasid_get(struct ioasid_set *set, ioasid_t ioasid); >>>> + >>>> + void ioasid_put(struct ioasid_set *set, ioasid_t ioasid); >>>> + >>>> + int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid); >>>> + >>>> + void ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid); >>>> + >>>> + void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid, >>>> + bool (*getter)(void *)); >>>> + >>>> + ioasid_t ioasid_find_by_spid(struct ioasid_set *set, ioasid_t >>>> spid) + >>>> + int ioasid_attach_data(struct ioasid_set *set, ioasid_t ioasid, >>>> + void *data); >>>> + int ioasid_attach_spid(struct ioasid_set *set, ioasid_t ioasid, >>>> + ioasid_t ssid); >>> >>> s/ssid/spid> > got it > >>>> + >>>> + >>>> +Notifications >>>> +------------- >>>> +An IOASID may have multiple users, each user may have hardware >>>> context +associated with an IOASID. When the status of an IOASID >>>> changes, +e.g. an IOASID is being freed, users need to be notified >>>> such that the +associated hardware context can be cleared, >>>> flushed, and drained. + >>>> +:: >>>> + >>>> + int ioasid_register_notifier(struct ioasid_set *set, struct >>>> + notifier_block *nb) >>>> + >>>> + void ioasid_unregister_notifier(struct ioasid_set *set, >>>> + struct notifier_block *nb) >>>> + >>>> + int ioasid_register_notifier_mm(struct mm_struct *mm, struct >>>> + notifier_block *nb) >>>> + >>>> + void ioasid_unregister_notifier_mm(struct mm_struct *mm, struct >>>> + notifier_block *nb) >> the mm_struct prototypes may be justified > This is the mm type token, i.e. > - IOASID_SET_TYPE_MM (Set token is a mm_struct) > I am not sure if it is better to keep the explanation in code or in > this document, certainly don't want to duplicate. OK. Maybe add a text explaining why it makes sense to register a notifier at mm_struct granularity. > >>>> + >>>> + int ioasid_notify(ioasid_t ioasid, enum ioasid_notify_val cmd, >>>> + unsigned int flags) >> this one is not obvious either. > Here I just wanted to list the API functions, perhaps readers can check > out the code comments? OK never mind. The exercise is difficult anyway. > >>>> + >>>> + >>>> +Events >>>> +~~~~~~ >>>> +Notification events are pertinent to individual IOASIDs, they can >>>> be +one of the following: >>>> + >>>> + - ALLOC >>>> + - FREE >>>> + - BIND >>>> + - UNBIND >>>> + >>>> +Ordering >>>> +~~~~~~~~ >>>> +Ordering is supported by IOASID notification priorities as the >>>> +following (in ascending order): >>>> + >>>> +:: >>>> + >>>> + enum ioasid_notifier_prios { >>>> + IOASID_PRIO_LAST, >>>> + IOASID_PRIO_IOMMU, >>>> + IOASID_PRIO_DEVICE, >>>> + IOASID_PRIO_CPU, >>>> + }; >> >> Maybe: >> when registered, notifiers are assigned a priority that affect the >> call order. Notifiers with CPU priority get called before notifiers >> with device priority and so on. > Sounds good. > >>>> + >>>> +The typical use case is when an IOASID is freed due to an >>>> exception, DMA +source should be quiesced before tearing down >>>> other hardware contexts +in the system. This will reduce the churn >>>> in handling faults. DMA work +submission is performed by the CPU >>>> which is granted higher priority than +devices. >>>> + >>>> + >>>> +Scopes >>>> +~~~~~~ >>>> +There are two types of notifiers in IOASID core: system-wide and >>>> +ioasid_set-wide. >>>> + >>>> +System-wide notifier is catering for users that need to handle all >>>> +IOASIDs in the system. E.g. The IOMMU driver handles all IOASIDs. >>>> + >>>> +Per ioasid_set notifier can be used by VM specific components >>>> such as +KVM. After all, each KVM instance only cares about >>>> IOASIDs within its +own set. >>>> + >>>> + >>>> +Atomicity >>>> +~~~~~~~~~ >>>> +IOASID notifiers are atomic due to spinlocks used inside the >>>> IOASID +core. For tasks cannot be completed in the notifier >>>> handler, async work >>> >>> "tasks that cannot be" >>> >>>> +can be submitted to complete the work later as long as there is no >>>> +ordering requirement. >>>> + >>>> +Reference counting >>>> +------------------ >>>> +IOASID lifecycle management is based on reference counting. Users >>>> of +IOASID intend to align lifecycle with the IOASID need to hold >>> >>> "who intend to" >>> >>>> +reference of the IOASID. IOASID will not be returned to the pool >>>> for >>> >>> "a reference to the IOASID. The IOASID" >>> >>>> +allocation until all references are dropped. Calling ioasid_free() >>>> +will mark the IOASID as FREE_PENDING if the IOASID has outstanding >>>> +reference. ioasid_get() is not allowed once an IOASID is in the >>>> +FREE_PENDING state. >>>> + >>>> +Event notifications are used to inform users of IOASID status >>>> change. +IOASID_FREE event prompts users to drop their references >>>> after +clearing its context. >>>> + >>>> +For example, on VT-d platform when an IOASID is freed, teardown >>>> +actions are performed on KVM, device driver, and IOMMU driver. >>>> +KVM shall register notifier block with:: >>>> + >>>> + static struct notifier_block pasid_nb_kvm = { >>>> + .notifier_call = pasid_status_change_kvm, >>>> + .priority = IOASID_PRIO_CPU, >>>> + }; >>>> + >>>> +VDCM driver shall register notifier block with:: >>>> + >>>> + static struct notifier_block pasid_nb_vdcm = { >>>> + .notifier_call = pasid_status_change_vdcm, >>>> + .priority = IOASID_PRIO_DEVICE, >>>> + }; >> not sure those code snippets are really useful. Maybe simply say who >> is supposed to use each prio. > Agreed, not all the bits in the snippets are explained. I will explain > KVM and VDCM need to use priority to ensure call order. > >>>> + >>>> +In both cases, notifier blocks shall be registered on the IOASID >>>> set +such that *only* events from the matching VM is received. >>>> + >>>> +If KVM attempts to register notifier block before the IOASID set >>>> is +created for the MM token, the notifier block will be placed on >>>> a >> using the MM token > sounds good > >>>> +pending list inside IOASID core. Once the token matching IOASID >>>> set +is created, IOASID will register the notifier block >>>> automatically. >> Is this implementation mandated? Can't you enforce the ioasid_set to >> be created before the notifier gets registered? >>>> +IOASID core does not replay events for the existing IOASIDs in the >>>> +set. For IOASID set of MM type, notification blocks can be >>>> registered +on empty sets only. This is to avoid lost events. >>>> + >>>> +IOMMU driver shall register notifier block on global chain:: >>>> + >>>> + static struct notifier_block pasid_nb_vtd = { >>>> + .notifier_call = pasid_status_change_vtd, >>>> + .priority = IOASID_PRIO_IOMMU, >>>> + }; >>>> + >>>> +Custom allocator APIs >>>> +--------------------- >>>> + >>>> +:: >>>> + >>>> + int ioasid_register_allocator(struct ioasid_allocator_ops >>>> *allocator); + >>>> + void ioasid_unregister_allocator(struct ioasid_allocator_ops >>>> *allocator); + >>>> +Allocator Choices >>>> +~~~~~~~~~~~~~~~~~ >>>> +IOASIDs are allocated for both host and guest SVA/IOVA usage. >>>> However, +allocators can be different. For example, on VT-d guest >>>> PASID +allocation must be performed via a virtual command >>>> interface which is +emulated by VMM. >>>> + >>>> +IOASID core has the notion of "custom allocator" such that guest >>>> can +register virtual command allocator that precedes the default >>>> one. + >>>> +Namespaces >>>> +~~~~~~~~~~ >>>> +IOASIDs are limited system resources that default to 20 bits in >>>> +size. Since each device has its own table, theoretically the >>>> namespace +can be per device also. However, for security reasons >>>> sharing PASID +tables among devices are not good for isolation. >>>> Therefore, IOASID +namespace is system-wide. >>> >>> I don't follow this development. Having per-device PASID table >>> would work fine for isolation (assuming no hardware bug >>> necessitating IOMMU groups). If I remember correctly IOASID space >>> was chosen to be OS-wide because it simplifies the management code >>> (single PASID per task), and it is system-wide across VMs only in >>> the case of VT-d scalable mode. >>>> + >>>> +There are also other reasons to have this simpler system-wide >>>> +namespace. Take VT-d as an example, VT-d supports shared workqueue >>>> +and ENQCMD[1] where one IOASID could be used to submit work on >>> >>> Maybe use the Sphinx glossary syntax rather than "[1]" >>> https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html#glossary-directive >>> >>>> +multiple devices that are shared with other VMs. This requires >>>> IOASID +to be system-wide. This is also the reason why guests must >>>> use an +emulated virtual command interface to allocate IOASID from >>>> the host. + >>>> + >>>> +Life cycle >>>> +========== >>>> +This section covers IOASID lifecycle management for both >>>> bare-metal +and guest usages. In bare-metal SVA, MMU notifier is >>>> directly hooked +up with IOMMU driver, therefore the process >>>> address space (MM) +lifecycle is aligned with IOASID. >> therefore the IOASID lifecyle matches the process address space (MM) >> lifecyle? > Sounds good. > >>>> + >>>> +However, guest MMU notifier is not available to host IOMMU >>>> driver, >> the guest MMU notifier >>>> +when guest MM terminates unexpectedly, the events have to go >>>> through >> the guest MM >>>> +VFIO and IOMMU UAPI to reach host IOMMU driver. There are also >>>> more +parties involved in guest SVA, e.g. on Intel VT-d platform, >>>> IOASIDs +are used by IOMMU driver, KVM, VDCM, and VFIO. >>>> + >>>> +Native IOASID Life Cycle (VT-d Example) >>>> +--------------------------------------- >>>> + >>>> +The normal flow of native SVA code with Intel Data Streaming >>>> +Accelerator(DSA) [2] as example: >>>> + >>>> +1. Host user opens accelerator FD, e.g. DSA driver, or uacce; >>>> +2. DSA driver allocate WQ, do sva_bind_device(); >>>> +3. IOMMU driver calls ioasid_alloc(), then bind PASID with device, >>>> + mmu_notifier_get() >>>> +4. DMA starts by DSA driver userspace >>>> +5. DSA userspace close FD >>>> +6. DSA/uacce kernel driver handles FD.close() >>>> +7. DSA driver stops DMA >>>> +8. DSA driver calls sva_unbind_device(); >>>> +9. IOMMU driver does unbind, clears PASID context in IOMMU, flush >>>> + TLBs. mmu_notifier_put() called. >>>> +10. mmu_notifier.release() called, IOMMU SVA code calls >>>> ioasid_free()* +11. The IOASID is returned to the pool, reclaimed. >>>> + >>>> +:: >>>> + >>> >>> Use a footnote? >>> https://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html#footnotes >>>> + * With ENQCMD, PASID used on VT-d is not released in >>>> mmu_notifier() but >>>> + mmdrop(). mmdrop comes after FD close. Should not matter. >>> >>> "comes after FD close, which doesn't make a difference?" >>> The following might not be necessary since early process >>> termination is described later. >>> >>>> + If the user process dies unexpectedly, Step #10 may come >>>> before >>>> + Step #5, in between, all DMA faults discarded. PRQ responded >>>> with >>> >>> PRQ hasn't been defined in this document. >>> >>>> + code INVALID REQUEST. >>>> + >>>> +During the normal teardown, the following three steps would >>>> happen in +order: >> can't this be illustrated in the above 1-11 sequence, just adding >> NORMAL TEARDONW before #7? >>>> + >>>> +1. Device driver stops DMA request >>>> +2. IOMMU driver unbinds PASID and mm, flush all TLBs, drain >>>> in-flight >>>> + requests. >>>> +3. IOASID freed >>>> + >> Then you can just focus on abnormal termination > Yes, will refer to the steps starting #7. These can be removed. > >>>> +Exception happens when process terminates *before* device driver >>>> stops +DMA and call IOMMU driver to unbind. The flow of process >>>> exists are as >> Can't this be explained with something simpler looking at the steps >> 1-11? > It meant to be educational given this level of details. Simpler > steps are labeled with (1) (2) (3). Perhaps these labels didn't stand > out right? I will use the steps in the 1-11 sequence. > >>> >>> "exits" >>> >>>> +follows: >>>> + >>>> +:: >>>> + >>>> + do_exit() { >>>> + exit_mm() { >>>> + mm_put(); >>>> + exit_mmap() { >>>> + intel_invalidate_range() //mmu notifier >>>> + tlb_finish_mmu() >>>> + mmu_notifier_release(mm) { >>>> + intel_iommu_release() { >>>> + [2] >>>> intel_iommu_teardown_pasid(); >>> >>> Parentheses might be better than square brackets for step numbers >>> >>>> + intel_iommu_flush_tlbs(); >>>> + } >>>> + // tlb_invalidate_range cb removed >>>> + } >>>> + unmap_vmas(); >>>> + free_pgtables(); // IOMMU cannot walk PGT >>>> after this >>>> + }; >>>> + } >>>> + exit_files(tsk) { >>>> + close_files() { >>>> + dsa_close(); >>>> + [1] dsa_stop_dma(); >>>> + intel_svm_unbind_pasid(); //nothing to do >>>> + } >>>> + } >>>> + } >>>> + >>>> + mmdrop() /* some random time later, lazy mm user */ { >>>> + mm_free_pgd(); >>>> + destroy_context(mm); { >>>> + [3] ioasid_free(); >>>> + } >>>> + } >>>> + >>>> +As shown in the list above, step #2 could happen before >>>> +#1. Unrecoverable(UR) faults could happen between #2 and #1. >>>> + >>>> +Also notice that TLB invalidation occurs at mmu_notifier >>>> +invalidate_range callback as well as the release callback. The >>>> reason +is that release callback will delete IOMMU driver from the >>>> notifier +chain which may skip invalidate_range() calls during the >>>> exit path. + >>>> +To avoid unnecessary reporting of UR fault, IOMMU driver shall >>>> disable >> UR? > Unrecoverable, mentioned in the previous paragraph. > >>>> +fault reporting after free and before unbind. >>>> + >>>> +Guest IOASID Life Cycle (VT-d Example) >>>> +-------------------------------------- >>>> +Guest IOASID life cycle starts with guest driver open(), this >>>> could be +uacce or individual accelerator driver such as DSA. At >>>> FD open, +sva_bind_device() is called which triggers a series of >>>> actions. + >>>> +The example below is an illustration of *normal* operations that >>>> +involves *all* the SW components in VT-d. The flow can be simpler >>>> if +no ENQCMD is supported. >>>> + >>>> +:: >>>> + >>>> + VFIO IOMMU KVM VDCM IOASID >>>> Ref >>>> + .................................................................. >>>> + 1 ioasid_register_notifier/_mm() >>>> + 2 ioasid_alloc() >>>> 1 >>>> + 3 bind_gpasid() >>>> + 4 iommu_bind()->ioasid_get() >>>> 2 >>>> + 5 ioasid_notify(BIND) >>>> + 6 -> ioasid_get() >>>> 3 >>>> + 7 -> vmcs_update_atomic() >>>> + 8 mdev_write(gpasid) >>>> + 9 hpasid= >>>> + 10 find_by_spid(gpasid) >>>> 4 >>>> + 11 vdev_write(hpasid) >>>> + 12 -------- GUEST STARTS DMA -------------------------- >>>> + 13 -------- GUEST STOPS DMA -------------------------- >>>> + 14 mdev_clear(gpasid) >>>> + 15 vdev_clear(hpasid) >>>> + 16 >>>> ioasid_put() 3 >>>> + 17 unbind_gpasid() >>>> + 18 iommu_ubind() >>>> + 19 ioasid_notify(UNBIND) >>>> + 20 -> vmcs_update_atomic() >>>> + 21 -> >>>> ioasid_put() 2 >>>> + 22 >>>> ioasid_free() 1 >>>> + 23 >>>> ioasid_put() 0 >>>> + 24 Reclaimed >>>> + -------------- New Life Cycle Begin >>>> ---------------------------- >>>> + 1 ioasid_alloc() >>>> -> 1 + >>>> + Note: IOASID Notification Events: FREE, BIND, UNBIND >>>> + >>>> +Exception cases arise when a guest crashes or a malicious guest >>>> +attempts to cause disruption on the host system. The fault >>>> handling +rules are: >>>> + >>>> +1. IOASID free must *always* succeed. >>>> +2. An inactive period may be required before the freed IOASID is >>>> + reclaimed. During this period, consumers of IOASID perform >>>> cleanup. +3. Malfunction is limited to the guest owned resources >>>> for all >>>> + programming errors. >>>> + >>>> +The primary source of exception is when the following are out of >>>> +order: >>>> + >>>> +1. Start/Stop of DMA activity >>>> + (Guest device driver, mdev via VFIO) >> please explain the meaning of what is inside (): initiator? >>>> +2. Setup/Teardown of IOMMU PASID context, IOTLB, DevTLB flushes >>>> + (Host IOMMU driver bind/unbind) >>>> +3. Setup/Teardown of VMCS PASID translation table entries (KVM) in >>>> + case of ENQCMD >>>> +4. Programming/Clearing host PASID in VDCM (Host VDCM driver) >>>> +5. IOASID alloc/free (Host IOASID) >>>> + >>>> +VFIO is the *only* user-kernel interface, which is ultimately >>>> +responsible for exception handlings. >>> >>> "handling" >>> >>>> + >>>> +#1 is processed the same way as the assigned device today based on >>>> +device file descriptors and events. There is no special handling. >>>> + >>>> +#3 is based on bind/unbind events emitted by #2. >>>> + >>>> +#4 is naturally aligned with IOASID life cycle in that an illegal >>>> +guest PASID programming would fail in obtaining reference of the >>>> +matching host IOASID. >>>> + >>>> +#5 is similar to #4. The fault will be reported to the user if >>>> PASID +used in the ENQCMD is not set up in VMCS PASID translation >>>> table. + >>>> +Therefore, the remaining out of order problem is between #2 and >>>> +#5. I.e. unbind vs. free. More specifically, free before unbind. >>>> + >>>> +IOASID notifier and refcounting are used to ensure order. >>>> Following +a publisher-subscriber pattern where: >> with the following actors: >>>> + >>>> +- Publishers: VFIO & IOMMU >>>> +- Subscribers: KVM, VDCM, IOMMU >> this may be introduced before. >>>> + >>>> +IOASID notifier is atomic which requires subscribers to do quick >>>> +handling of the event in the atomic context. Workqueue can be >>>> used for +any processing that requires thread context. >> repetition of what was said before. >> IOASID reference must be > Right, will remove. > >>>> +acquired before receiving the FREE event. The reference must be >>>> +dropped at the end of the processing in order to return the >>>> IOASID to +the pool. >>>> + >>>> +Let's examine the IOASID life cycle again when free happens >>>> *before* +unbind. This could be a result of misbehaving guests or >>>> crash. Assuming +VFIO cannot enforce unbind->free order. Notice >>>> that the setup part up +until step #12 is identical to the normal >>>> case, the flow below starts +with step 13. >>>> + >>>> +:: >>>> + >>>> + VFIO IOMMU KVM VDCM IOASID >>>> Ref >>>> + .................................................................. >>>> + 13 -------- GUEST STARTS DMA -------------------------- >>>> + 14 -------- *GUEST MISBEHAVES!!!* ---------------- >>>> + 15 ioasid_free() >>>> + 16 >>>> ioasid_notify(FREE) >>>> + 17 >>>> mark_ioasid_inactive[1] >>>> + 18 kvm_nb_handler(FREE) >>>> + 19 vmcs_update_atomic() >>>> + 20 ioasid_put_locked() -> >>>> 3 >>>> + 21 vdcm_nb_handler(FREE) >>>> + 22 iomm_nb_handler(FREE) >>>> + 23 ioasid_free() returns[2] schedule_work() >>>> 2 >>>> + 24 schedule_work() vdev_clear_wk(hpasid) >>>> + 25 teardown_pasid_wk() >>>> + 26 ioasid_put() -> >>>> 1 >>>> + 27 ioasid_put() >>>> 0 >>>> + 28 Reclaimed >>>> + 29 unbind_gpasid() >>>> + 30 iommu_unbind()->ioasid_find() Fails[3] >>>> + -------------- New Life Cycle Begin >>>> ---------------------------- + >>>> +Note: >>>> + >>>> +1. By marking IOASID inactive at step #17, no new references can >>>> be >>> >>> Is "inactive" FREE_PENDING? >>> >>>> + held. ioasid_get/find() will return -ENOENT; >>>> +2. After step #23, all events can go out of order. Shall not >>>> affect >>>> + the outcome. >>>> +3. IOMMU driver fails to find private data for unbinding. If >>>> unbind is >>>> + called after the same IOASID is allocated for the same guest >>>> again, >>>> + this is a programming error. The damage is limited to the guest >>>> + itself since unbind performs permission checking based on the >>>> + IOASID set associated with the guest process. >>>> + >>>> +KVM PASID Translation Table Updates >>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>> +Per VM PASID translation table is maintained by KVM in order to >>>> +support ENQCMD in the guest. The table contains host-guest PASID >>>> +translations to be consumed by CPU ucode. The synchronization of >>>> the +PASID states depends on VFIO/IOMMU driver, where IOCTL and >>>> atomic +notifiers are used. KVM must register IOASID notifier per >>>> VM instance +during launch time. The following events are handled: >>>> + >>>> +1. BIND/UNBIND >>>> +2. FREE >>>> + >>>> +Rules: >>>> + >>>> +1. Multiple devices can bind with the same PASID, this can be >>>> different PCI >>>> + devices or mdevs within the same PCI device. However, only the >>>> + *first* BIND and *last* UNBIND emit notifications. >>>> +2. IOASID code is responsible for ensuring the correctness of H-G >>>> + PASID mapping. There is no need for KVM to validate the >>>> + notification data. >>>> +3. When UNBIND happens *after* FREE, KVM will see error in >>>> + ioasid_get() even when the reclaim is not done. IOMMU driver >>>> will >>>> + also avoid sending UNBIND if the PASID is already FREE. >>>> +4. When KVM terminates *before* FREE & UNBIND, references will be >>>> + dropped for all host PASIDs. >>>> + >>>> +VDCM PASID Programming >>>> +~~~~~~~~~~~~~~~~~~~~~~ >>>> +VDCM composes virtual devices and exposes them to the guests. When >>>> +the guest allocates a PASID then program it to the virtual >>>> device, VDCM >> programs as well >>>> +intercepts the programming attempt then program the matching >>>> host >>> >>> "programs" >>> >>> Thanks, >>> Jean >>> >>>> +PASID on to the hardware. >>>> +Conversely, when a device is going away, VDCM must be informed >>>> such +that PASID context on the hardware can be cleared. There >>>> could be +multiple mdevs assigned to different guests in the same >>>> VDCM. Since +the PASID table is shared at PCI device level, lazy >>>> clearing is not +secure. A malicious guest can attack by using >>>> newly freed PASIDs that +are allocated by another guest. >>>> + >>>> +By holding a reference of the PASID until VDCM cleans up the HW >>>> context, +it is guaranteed that PASID life cycles do not cross >>>> within the same +device. >>>> + >>>> + >>>> +Reference >>>> +==================================================== >>>> +1. >>>> https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf >>>> + +2. >>>> https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator >>>> + +3. >>>> https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification >>>> -- 2.7.4 >> >> Thanks >> >> Eric >>>> >>> >> >> _______________________________________________ >> iommu mailing list >> iommu@lists.linux-foundation.org >> https://lists.linuxfoundation.org/mailman/listinfo/iommu > [Jacob Pan] > Thanks Eric