Received: by 2002:a05:6a10:a841:0:0:0:0 with SMTP id d1csp1046691pxy; Wed, 28 Apr 2021 21:21:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw9hDXfWrLB0KmmDNg605y5LPF6/H4DgbWQ/MH/a393yOOGGQD5IlKEbFCUORLdCBjvC+JJ X-Received: by 2002:a17:906:5a83:: with SMTP id l3mr32513571ejq.50.1619670110475; Wed, 28 Apr 2021 21:21:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1619670110; cv=none; d=google.com; s=arc-20160816; b=Ya1uG1Vx0vt8ce6OLRQY7ybbtT8hngiE/7ivZ1CIPfzrCA/ZoHhoGELcTEj2ncTUEK kDXlRwpAeXF1Ma6zEk+1k5Y7NsqU6exTZvVcjbKWgvShDDNKkoqNWYj8aKPHefLC8Ia9 Iru/Gz+y+SCuaafmZ5qz9dWyFEODuJvrEC4qZhPQXpKpoUxuRaFWBdnxSvtQihcAl6ee OdDIFfEGcssDyLeF+QQPHraSURw365NpYXpUhiDJY7tsZYgo3THVKbuCJKGfOVzXE015 hzhIFvueeeVXwJSXTai5q47U95pPVdmTKWVSNZTcsqFIsIVhi57YsdXeE5wDoC3mYvOV DsCQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=26oLLIRECsuw9hNmGB2bE08X0YOlIsxiBdZW8lJ98zo=; b=O7NrwwQg8us9cnJm1j5HvypXGL1gMhdjIapMdGrxAzFAhQ/xnW1OG2HrtZjPE0khBg 1BV0TA0OAIS90SyrdIrDq/acQDX1VxuumwZ7K8FyWTzIfwSB2GjJHF4Q7OvTH4REH5UW 68bOpy2UZxqJu/T8OcMdBDTn5I/x4fbtT+4kmuq6CLS/mlbk0rH+3b/eNGH45xnwJUDe /ZKa5C8DOt8Rx37kStTyy5FvRQjoKLM5ie72UEmIWL4oyuV23/CLRDrM7X1Vr/j+R+ZP 73hIa6QF2R0CHZI211pDQYkNzTuEukuRU+Q7WysS4WPUHOdKTtJthwA5luOnLnG0u3rZ 6ijw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gibson.dropbear.id.au header.s=201602 header.b=FUk8ozxa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id n2si1739715edi.195.2021.04.28.21.21.27; Wed, 28 Apr 2021 21:21:50 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gibson.dropbear.id.au header.s=201602 header.b=FUk8ozxa; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238023AbhD2ETE (ORCPT + 99 others); Thu, 29 Apr 2021 00:19:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46602 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236873AbhD2ETD (ORCPT ); Thu, 29 Apr 2021 00:19:03 -0400 Received: from ozlabs.org (ozlabs.org [IPv6:2401:3900:2:1::2]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1C285C06138B; Wed, 28 Apr 2021 21:18:17 -0700 (PDT) Received: by ozlabs.org (Postfix, from userid 1007) id 4FW2Kx30D2z9sXM; Thu, 29 Apr 2021 14:18:13 +1000 (AEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=gibson.dropbear.id.au; s=201602; t=1619669893; bh=CG0C1TtA2qTxavaiVAaS4f+9gUTbGNBMGx1ubjdyJ4M=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=FUk8ozxaSPoOFI5vFuUiW8LWW08MXqMNOqVJsKKJJki3yNWJx2/aaXJ78/2tlg8xn ZNPGggQaTjc4DHHiTSrWSvtyYTsKBc04Rnj12XujF07YobhtvaHzq/ULiRcwvfW0hE S0z6xcIcW9nJNujpxOgM0GT5u5BHhoUZJSXimg7w= Date: Thu, 29 Apr 2021 13:20:22 +1000 From: David Gibson To: Jason Gunthorpe Cc: Alex Williamson , "Liu, Yi L" , Jacob Pan , Auger Eric , Jean-Philippe Brucker , "Tian, Kevin" , LKML , Joerg Roedel , Lu Baolu , David Woodhouse , "iommu@lists.linux-foundation.org" , "cgroups@vger.kernel.org" , Tejun Heo , Li Zefan , Johannes Weiner , Jean-Philippe Brucker , Jonathan Corbet , "Raj, Ashok" , "Wu, Hao" , "Jiang, Dave" , Alexey Kardashevskiy Subject: Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs Message-ID: References: <20210421162307.GM1370958@nvidia.com> <20210421105451.56d3670a@redhat.com> <20210421175203.GN1370958@nvidia.com> <20210421133312.15307c44@redhat.com> <20210421230301.GP1370958@nvidia.com> <20210422111337.6ac3624d@redhat.com> <20210427172432.GE1370958@nvidia.com> <20210429002149.GZ1370958@nvidia.com> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="VnpCiNDu5ri/I2+C" Content-Disposition: inline In-Reply-To: <20210429002149.GZ1370958@nvidia.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --VnpCiNDu5ri/I2+C Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Wed, Apr 28, 2021 at 09:21:49PM -0300, Jason Gunthorpe wrote: > On Wed, Apr 28, 2021 at 11:23:39AM +1000, David Gibson wrote: >=20 > > Yes. My proposed model for a unified interface would be that when you > > create a new container/IOASID, *no* IOVAs are valid. >=20 > Hurm, it is quite tricky. All IOMMUs seem to have a dead zone around > the MSI window, so negotiating this all in a general way is not going > to be a very simple API. >=20 > To be general it would be nicer to say something like 'I need XXGB of > IOVA space' 'I need 32 bit IOVA space' etc and have the kernel return > ranges that sum up to at least that big. Then the kernel can do its > all its optimizations. Ah, yes, sorry. We do need an API that lets the kernel make more of the decisions too. For userspace drivers it would generally be sufficient to just ask for XXX size of IOVA space wherever you can get it. Handling guests requires more precision. So, maybe a request interface with a bunch of hint variables and a matching set of MAP_FIXED-like flags to assert which ones aren't negotiable. > I guess you are going to say that the qemu PPC vIOMMU driver needs > more exact control.. *Every* vIOMMU driver needs more exact control. The guest drivers will expect to program the guest devices with IOVAs matching the guest platform's IOMMU model. Therefore the backing host IOMMU has to be programmed to respond to those IOVAs. If it can't be, there's no way around it, and you want to fail out early. With this model that will happen when qemu (say) requests the host IOMMU window(s) to match the guest's expected IOVA ranges. Actually, come to that even guests without a vIOMMU need more exact control: they'll expect IOVA to match GPA, so if your host IOMMU can't be set up translate the full range of GPAs, again, you're out of luck. The only reason x86 has been able to ignore this is that the assumption has been that all IOMMUs can translate IOVAs from 0... Once you really start to look at what the limits are, you need the exact window control I'm describing. > > I expect we'd need some kind of query operation to expose limitations > > on the number of windows, addresses for them, available pagesizes etc. >=20 > Is page size an assumption that hugetlbfs will always be used for backing > memory or something? So for TCEs (and maybe other IOMMUs out there), the IO page tables are independent of the CPU page tables. They don't have the same format, and they don't necessarily have the same page size. In the case of a bare metal kernel working in physical addresses they can use that TCE page size however they like. For userspace you get another layer of complexity. Essentially to implement things correctly the backing IOMMU needs to have a page size granularity that's the minimum of whatever granularity the userspace or guest driver expects and the host page size backing the memory. > > > As an ideal, only things like the HW specific qemu vIOMMU driver > > > should be reaching for all the special stuff. > >=20 > > I'm hoping we can even avoid that, usually. With the explicitly > > created windows model I propose above, it should be able to: qemu will > > create the windows according to the IOVA windows the guest platform > > expects to see and they either will or won't work on the host platform > > IOMMU. If they do, generic maps/unmaps should be sufficient. If they > > don't well, the host IOMMU simply cannot emulate the vIOMMU so you're > > out of luck anyway. >=20 > It is not just P9 that has special stuff, and this whole area of PASID > seems to be quite different on every platform >=20 > If things fit very naturally and generally then maybe, but I've been > down this road before of trying to make a general description of a > group of very special HW. It ended in tears after 10 years when nobody > could understand the "general" API after it was Frankenstein'd up with > special cases for everything. Cautionary tale >=20 > There is a certain appeal to having some > 'PPC_TCE_CREATE_SPECIAL_IOASID' entry point that has a wack of extra > information like windows that can be optionally called by the viommu > driver and it remains well defined and described. Windows really aren't ppc specific. They're absolutely there on x86 and everything else as well - it's just that people are used to having a window at 0.. that you can often get away with treating it sloppily. --=20 David Gibson | I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson --VnpCiNDu5ri/I2+C Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEdfRlhq5hpmzETofcbDjKyiDZs5IFAmCKJfQACgkQbDjKyiDZ s5LJkw//bITy91FckatPRCEx8u5jctT/DbzHeYnr8pGEugv3JmhVq72mk2XGwuHB svC41OVYOD8PC7hX52SQ3PGA+3fNIYvPwALynj2wLlSI7ce79J+NGuREB+7cgB1K GoOWrzuWf23tFL16G7OhY88nothHY7vUWjil4FWrBZVKxj7vnQZ7SZmlasiDN46m V9FwTCfu1Mbv8r+FqgTnytjio5wVIxQSfEhIbbevpbyeCZmcvFT4PQfpKiyrM9hp oRGJibDfILmgCsM20Pj33rNqPcL42Xk/bEllzKnTitoSIcfYBQAUHKjrv3QfPb3d dR/C8E/jEADfIHvfCLdM3IX9CouAwzcQv+TWg9YDydPKawB9F+3tEEwZj9rH/IOk 12d1AxSNv9ruQ4VLkXqbIANcxAau5iuk2pgd8Gm45MmICvSEtNKCpfHkOCsfgREf WUAb3vn95RuxSLXJTtEU6SOk4lBzLjFl4tjv7nCZ5ikAz6EQnjEuN0jybXqNC9dX xdBeT+6Tjq0l8NQxekOv0b1/3ReLenR+MAKsYL1y2OkVVgUiMrBSNtwc3nMqE2ZH WkBzXWyTy/it2/gR7fesmmvpEFNTaovXfHPkApuHBbX45rABjiLD/vh+j/i1lkTL WvUcE7BSx27vbWtRfWXHgjrHIfgGe1yIjG3ogwx8sBd5nLSXoC0= =R90G -----END PGP SIGNATURE----- --VnpCiNDu5ri/I2+C--