Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp2174981imm; Thu, 20 Sep 2018 08:52:40 -0700 (PDT) X-Google-Smtp-Source: ANB0VdYG1Pz5YdItVICfGOM7uF1JoJy1+PCRPhDM6bn/PRR9xbLGbx3lsdWBTskZnLLqFPS9ygSb X-Received: by 2002:a17:902:5617:: with SMTP id h23-v6mr40314147pli.324.1537458760447; Thu, 20 Sep 2018 08:52:40 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1537458760; cv=none; d=google.com; s=arc-20160816; b=HZWKwe55PBIEpvD9yn31sIIwcRhIw8bvUWC7UYidZNy/BkTV6KOpfC0te/toStkYeS RoE+b0NA4PP/aftdyprsUzgtslF+ImbVO21W/1Cf+a5KgqJgWgw74K7EG2F/op5UYd9Z 4FFP/p21eSaSKRA+OUBLWiXyGkrf2vsKnrxMT/7791g7rIXGtT0j8G1+a8/aeG+CTXJJ DWfqd4ObF24KOc9AF82YT+7zJ/lfwY7RHTnUwrlHuri0Y/mCTW/mcyiKE7Lp7vMbBFM6 njT06YIPRR18GwkGncaGn+3pCXO1nWzy3i72Mue/Qv4nSTRRP8aTcAepHTxJYjrWWSqw rQ0g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :organization:references:in-reply-to:message-id:subject:cc:to:from :date; bh=QiHgCr2nFrmqXKEZmzb/bfB5oSK8IcaNy0VevbM/bRI=; b=FPmUtdmcgmuf3DJ0fZ243tv6yyb8yc6EOEkoao6do0pss2dqxlUB3vXoh2f0XR+ktH t7H8AoWFeRsEdHxKsxnkzQRvlYYVsBudl0NYjCzwkBK09/1AXkHg/r0hUkwZKj0JuEmg 5NfCT8sZNkDPPxLoNOMacD7756M8Z7Zkl/BLmsxaFVSJ0OtM6AQfSb1YwfNnY0cScVtm xc1AaMh7CEx6Fvom/LENnOGP/5uuGUbSHDOORJDRcdcOBrGX0q/WscSOcsuJAhv7ulDP 8+NkjNkt7He42tWTSOnPQnC0Jxv4FloHyVVTvmNEK/eKILRszZLjAzcwyfPCvxTz7Ptt bO3w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a23-v6si23052902pgd.235.2018.09.20.08.52.24; Thu, 20 Sep 2018 08:52:40 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1733136AbeITVgZ convert rfc822-to-8bit (ORCPT + 99 others); Thu, 20 Sep 2018 17:36:25 -0400 Received: from mga07.intel.com ([134.134.136.100]:16104 "EHLO mga07.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1731025AbeITVgZ (ORCPT ); Thu, 20 Sep 2018 17:36:25 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga001.jf.intel.com ([10.7.209.18]) by orsmga105.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 20 Sep 2018 08:52:17 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.53,399,1531810800"; d="scan'208";a="92266669" Received: from jacob-builder.jf.intel.com (HELO jacob-builder) ([10.7.199.155]) by orsmga001.jf.intel.com with ESMTP; 20 Sep 2018 08:52:17 -0700 Date: Thu, 20 Sep 2018 08:53:39 -0700 From: Jacob Pan To: "Tian, Kevin" Cc: Jean-Philippe Brucker , "Raj, Ashok" , "Bie, Tiwei" , "Kumar, Sanjay K" , Kirti Wankhede , "iommu@lists.linux-foundation.org" , "linux-kernel@vger.kernel.org" , Alex Williamson , "kvm@vger.kernel.org" , David Woodhouse , "Sun, Yi Y" , jacob.jun.pan@intel.com Subject: Re: [RFC PATCH v2 00/10] vfio/mdev: IOMMU aware mediated device Message-ID: <20180920085339.2ea67e72@jacob-builder> In-Reply-To: References: <20180830040922.30426-1-baolu.lu@linux.intel.com> <380dc154-5d72-0085-2056-fa466789e1ab@arm.com> <3602f8c1-df17-4894-1bcc-4d779f9aa7fd@arm.com> <03d496b0-84c2-b3ca-5be5-d4540c6d8ec7@arm.com> <20180914140433.6891a90c@jacob-builder> Organization: OTC X-Mailer: Claws Mail 3.13.2 (GTK+ 2.24.30; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 19 Sep 2018 02:22:03 +0000 "Tian, Kevin" wrote: > > From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com] > > Sent: Tuesday, September 18, 2018 11:47 PM > > > > On 14/09/2018 22:04, Jacob Pan wrote: > > >> This example only needs to modify first-level translation, and > > >> works with SMMUv3. The kernel here could be the host, in which > > >> case second-level translation is disabled in the SMMU, or it > > >> could be the guest, in which case second-level mappings are > > >> created by QEMU and first-level translation is managed by > > >> assigning PASID tables to the guest. > > > There is a difference in case of guest SVA. VT-d v3 will bind > > > guest PASID and guest CR3 instead of the guest PASID table. Then > > > turn on nesting. In case of mdev, the second level is obtained > > > from the aux domain which was setup for the default PASID. Or in > > > case of PCI device, second level is harvested from RID2PASID. > > > > Right, though I wasn't talking about the host managing guest SVA > > here, but a kernel binding the address space of one of its > > userspace drivers to the mdev. > > > > >> So (2) would use iommu_sva_bind_device(), > > > We would need something different than that for guest bind, just > > > to show the two cases:> > > > int iommu_sva_bind_device(struct device *dev, struct mm_struct > > > *mm, > > int > > > *pasid, unsigned long flags, void *drvdata) > > > > > > (WIP) > > > int sva_bind_gpasid(struct device *dev, struct gpasid_bind_data > > > *data) where: > > > /** > > >  * struct gpasid_bind_data - Information about device and guest > > > PASID binding > > >  * @pasid:       Process address space ID used for the guest mm > > >  * @addr_width:  Guest address width. Paging mode can also be > > > derived. > > >  * @gcr3:        Guest CR3 value from guest mm > > >  */ > > > struct gpasid_bind_data { > > >         __u32 pasid; > > >         __u64 gcr3; > > >         __u32 addr_width; > > >         __u32 flags; > > > #define IOMMU_SVA_GPASID_SRE    BIT(0) /* supervisor request */ > > > }; > > > Perhaps there is room to merge with io_mm but the life cycle > > management > > > of guest PASID and host PASID will be different if you rely on mm > > > release callback than FD. > > let's not calling gpasid here - which makes sense only in > bind_pasid_table proposal where pasid table thus pasid space is > managed by guest. In above context it is always about host pasid > (allocated in system-wide), which could point to a host cr3 (user > process) or a guest cr3 (vm case). > I agree this gpasid is confusing, we have a system wide PASID name space. Just a way to differentiate different bind, perhaps just a flag indicating the PASID is used for guest. i.e. struct pasid_bind_data {         __u32 pasid;         __u64 gcr3;         __u32 addr_width;         __u32 flags; #define IOMMU_SVA_GPASID_SRE    BIT(0) /* supervisor request */ #define IOMMU_SVA_PASID_GUEST   BIT(0) /* host pasid used by guest */ }; > > I think gpasid management should stay separate from io_mm, since in > > your case VFIO mechanisms are used for life cycle management of the > > VM, similarly to the former bind_pasid_table proposal. For example > > closing the container fd would unbind all guest page tables. The > > QEMU process' address space lifetime seems like the wrong thing to > > track for gpasid. > > I sort of agree (though not thinking through all the flow carefully). > PASIDs are allocated per iommu domain, thus release also happens when > domain is detached (along with container fd close). > I also prefer to keep gpasid separate. But I don't think we need to have per iommu domain per PASID for guest SVA case. Assuming you are talking about host IOMMU domain. The PASID bind call is a result of guest PASID cache flush with a PASID previously allocated. The host just need to put gcr3 into the PASID entry then harvest the second level from the existing domain. > > > > >> but (1) needs something > > >> else. Aren't auxiliary domains suitable for (1)? Why limit > > >> auxiliary domain to second-level or nested translation? It seems > > >> silly to use a different API for first-level, since the flow in > > >> userspace and VFIO is the same as your second-level case as far > > >> as MAP_DMA ioctl goes. The difference is that in your case the > > >> auxiliary domain supports an additional operation which binds > > >> first-level page tables. An auxiliary domain that only supports > > >> first-level wouldn't support this operation, but it can still > > >> implement iommu_map/unmap/etc. > > > I think the intention is that when a mdev is created, we don;t > > > know whether it will be used for SVA or IOVA. So aux domain is > > > here to "hold a spot" for the default PASID such that MAP_DMA > > > calls can work as usual, which is second level only. Later, if > > > SVA is used on the mdev there will be another PASID allocated for > > > that purpose. Do we need to create an aux domain for each PASID? > > > the translation can be looked up by the combination of parent dev > > > and pasid. > > > > When allocating a new PASID for the guest, I suppose you need to > > clone the second-level translation config? In which case a single > > aux domain for the mdev might be easier to implement in the IOMMU > > driver. Entirely up to you since we don't have this case on SMMUv3 > > > > One thing to highlight in related discussions (also mentioned in other > thread). There is not a new iommu domain type called 'aux'. 'aux' > matters only to a specific device when a domain is attached to that > device which has aux capability enabled. Same domain can be attached > to other device as normal domain. In that case multiple PASIDs > allocated on same mdev are tied to same aux domain, same bare metal > SVA case, i.e. any domain (normal or aux) can include 2nd level > structure and multiple 1st level structures. Jean is correct - all > PASIDs in same domain then share 2nd level translation, and there are > io_mm or similar tracking structures to associate each PASID to a 1st > level translation structure. > I think we are all talking about the same thing :) yes, 2nd level is cloned from aux domain/default PASID for mdev, and pdev similarly from DMA_MAP domain. > Thanks > Kevin > _______________________________________________ > iommu mailing list > iommu@lists.linux-foundation.org > https://lists.linuxfoundation.org/mailman/listinfo/iommu [Jacob Pan]