Received: by 2002:a05:6a10:17d3:0:0:0:0 with SMTP id hz19csp549571pxb; Fri, 16 Apr 2021 11:56:07 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwEB8a4sf9CnG6ZUX5zez7i+SeX8D/qvy6JbuKVewsK+sPAvZDLiJub9hpybgHN8qbS9lEt X-Received: by 2002:a05:6402:518d:: with SMTP id q13mr11653558edd.313.1618599367055; Fri, 16 Apr 2021 11:56:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618599367; cv=none; d=google.com; s=arc-20160816; b=Qbg5B/W9ZIRdhaspVJP1B1ZxAsh3bK6tUM0/s3C1H0HwZsUtCAjQVahFg7n7IlDB21 SZke0inAjwrTSM5WErV/araeIoM69CxC5RHcz41b3170qvRDs+v1Go0fIf5CghlAGYcB cEkhgPdO8xKyUMOmv6O1nF2dBwsYmDPP5Cps+t2Bnx2mxM5/txrY67d7pWsluqjIL1BS 8tQp9tRUSyKey6Yn1BX4cVHkHHKFwPKLFgE/mKH+BvfkWPhhe9IOu23Iq8+K7k03DjY3 4hssCgOL7Jrpg/4LPKlAjISBqwTcwx0Hb+EtNoVhv291+gCirKiiQDSdChRY7ia4fBtW QVWA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :organization:references:in-reply-to:message-id:subject:cc:to:from :date:ironport-sdr:ironport-sdr; bh=Y5zC0YeTpkmD2XCl2dxJXLkp9zbeRZKCGfCDLlVQrr0=; b=Q2Yq4/D6ZI3GYHqz8PSHw8u5/O3Clg+3+4jkDgVOKFdEtHOBgVJqNIVS7VOxYkF9RL izHxOrAsSXETh45cqkrnQa8gRXwbNjhwdGTAPCRcdKRNnuqrVR/gUC5lk26IJOe1WXLc NPgyGYogJfjaQwU518moiFkINAtDiW5mFvhsAa7rTKlrAECEpvHUAuL/FZkSFwIojiWI Byspc2YfxJs3zczHmoGMJBbEemdTwwQg13TdC6j8HRyIBxUDRkzT1gtgOmTnTC4WOAmy j5pvW9lJt2OavVbA/feqpCzTrNQXk889Z8e3wv74epxruqUbwUMJb7PaJOizac5j55wj VGaA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id lc12si5283183ejc.79.2021.04.16.11.55.43; Fri, 16 Apr 2021 11:56:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S241216AbhDPRVv (ORCPT + 99 others); Fri, 16 Apr 2021 13:21:51 -0400 Received: from mga01.intel.com ([192.55.52.88]:13987 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233606AbhDPRVu (ORCPT ); Fri, 16 Apr 2021 13:21:50 -0400 IronPort-SDR: usxTo4F52PJ5piVi/n2nXzVZwhG8GKREy7wmGHSiRp7UbRKi+csBEhp2pSiHRbfHFA4AknM074 oUiAPb5sa/Uw== X-IronPort-AV: E=McAfee;i="6200,9189,9956"; a="215611395" X-IronPort-AV: E=Sophos;i="5.82,226,1613462400"; d="scan'208";a="215611395" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Apr 2021 10:20:54 -0700 IronPort-SDR: 7aKIREIfd33fUejts9JO8C/JGcYaQKmRX9TIcVRpgZbxyjn9wJ5qbogVNvpiG7jHljOCuydnji hk70wzrofMfg== X-IronPort-AV: E=Sophos;i="5.82,226,1613462400"; d="scan'208";a="425664043" Received: from jacob-builder.jf.intel.com (HELO jacob-builder) ([10.7.199.155]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Apr 2021 10:20:53 -0700 Date: Fri, 16 Apr 2021 10:23:32 -0700 From: Jacob Pan To: Alex Williamson Cc: Jason Gunthorpe , Auger Eric , "Liu, Yi L" , Jean-Philippe Brucker , "Tian, Kevin" , LKML , Joerg Roedel , Lu Baolu , David Woodhouse , "iommu@lists.linux-foundation.org" , "cgroups@vger.kernel.org" , Tejun Heo , Li Zefan , Johannes Weiner , Jean-Philippe Brucker , Jonathan Corbet , "Raj, Ashok" , "Wu, Hao" , "Jiang, Dave" , jacob.jun.pan@linux.intel.com Subject: Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs Message-ID: <20210416102332.6f71e53e@jacob-builder> In-Reply-To: <20210416094547.1774e1a3@redhat.com> References: <20210331124038.GE1463678@nvidia.com> <20210401134236.GF1463678@nvidia.com> <20210401160337.GJ1463678@nvidia.com> <4bea6eb9-08ad-4b6b-1e0f-c97ece58a078@redhat.com> <20210415230732.GG1370958@nvidia.com> <20210416061258.325e762e@jacob-builder> <20210416094547.1774e1a3@redhat.com> Organization: OTC X-Mailer: Claws Mail 3.17.5 (GTK+ 2.24.32; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Alex, On Fri, 16 Apr 2021 09:45:47 -0600, Alex Williamson wrote: > On Fri, 16 Apr 2021 06:12:58 -0700 > Jacob Pan wrote: > > > Hi Jason, > > > > On Thu, 15 Apr 2021 20:07:32 -0300, Jason Gunthorpe > > wrote: > > > On Thu, Apr 15, 2021 at 03:11:19PM +0200, Auger Eric wrote: > > > > Hi Jason, > > > > > > > > On 4/1/21 6:03 PM, Jason Gunthorpe wrote: > > > > > On Thu, Apr 01, 2021 at 02:08:17PM +0000, Liu, Yi L wrote: > > > > > > > > > >> DMA page faults are delivered to root-complex via page request > > > > >> message and it is per-device according to PCIe spec. Page request > > > > >> handling flow is: > > > > >> > > > > >> 1) iommu driver receives a page request from device > > > > >> 2) iommu driver parses the page request message. Get the > > > > >> RID,PASID, faulted page and requested permissions etc. > > > > >> 3) iommu driver triggers fault handler registered by device > > > > >> driver with iommu_report_device_fault() > > > > > > > > > > This seems confused. > > > > > > > > > > The PASID should define how to handle the page fault, not the > > > > > driver. > > > > > > > > In my series I don't use PASID at all. I am just enabling nested > > > > stage and the guest uses a single context. I don't allocate any > > > > user PASID at any point. > > > > > > > > When there is a fault at physical level (a stage 1 fault that > > > > concerns the guest), this latter needs to be reported and injected > > > > into the guest. The vfio pci driver registers a fault handler to > > > > the iommu layer and in that fault handler it fills a circ bugger > > > > and triggers an eventfd that is listened to by the VFIO-PCI QEMU > > > > device. this latter retrives the faault from the mmapped circ > > > > buffer, it knowns which vIOMMU it is attached to, and passes the > > > > fault to the vIOMMU. Then the vIOMMU triggers and IRQ in the guest. > > > > > > > > We are reusing the existing concepts from VFIO, region, IRQ to do > > > > that. > > > > > > > > For that use case, would you also use /dev/ioasid? > > > > > > /dev/ioasid could do all the things you described vfio-pci as doing, > > > it can even do them the same way you just described. > > > > > > Stated another way, do you plan to duplicate all of this code someday > > > for vfio-cxl? What about for vfio-platform? ARM SMMU can be hooked to > > > platform devices, right? > > > > > > I feel what you guys are struggling with is some choice in the iommu > > > kernel APIs that cause the events to be delivered to the pci_device > > > owner, not the PASID owner. > > > > > > That feels solvable. > > > > > Perhaps more of a philosophical question for you and Alex. There is no > > doubt that the direction you guided for /dev/ioasid is a much cleaner > > one, especially after VDPA emerged as another IOMMU backed framework. > > I think this statement answers all your remaining questions ;) > > > The question is what do we do with the nested translation features that > > have been targeting the existing VFIO-IOMMU for the last three years? > > That predates VDPA. Shall we put a stop marker *after* nested support > > and say no more extensions for VFIO-IOMMU, new features must be built > > on this new interface? > > > > If we were to close a checkout line for some unforeseen reasons, should > > we honor the customers already in line for a long time? > > > > This is not a tactic or excuse for not working on the new /dev/ioasid > > interface. In fact, I believe we can benefit from the lessons learned > > while completing the existing. This will give confidence to the new > > interface. Thoughts? > > I understand a big part of Jason's argument is that we shouldn't be in > the habit of creating duplicate interfaces, we should create one, well > designed interfaces to share among multiple subsystems. As new users > have emerged, our solution needs to change to a common one rather than > a VFIO specific one. The IOMMU uAPI provides an abstraction, but at > the wrong level, requiring userspace interfaces for each subsystem. > > Luckily the IOMMU uAPI is not really exposed as an actual uAPI, but > that changes if we proceed to enable the interfaces to tunnel it > through VFIO. > > The logical answer would therefore be that we don't make that > commitment to the IOMMU uAPI if we believe now that it's fundamentally > flawed. > I agree the uAPI data tunneling is definitely flawed in terms of scalability. I was just thinking it is still a small part of the overall picture. Considering there are other parts such as fault reporting, user space deployment, performance, and security. By completing the support on the existing VFIO framework, it would at least offer a clear landscape where the new /dev/ioasid can improve upon. Perhaps similar to cgroup v1 vs v2, it took a long time and with known limitations in v1. Anyway, I am glad we have a clear direction now. Thanks, Jacob > Ideally this new /dev/ioasid interface, and making use of it as a VFIO > IOMMU backend, should replace type1. Type1 will live on until that > interface gets to parity, at which point we may deprecate type1, but it > wouldn't make sense to continue to expand type1 in the same direction > as we intend /dev/ioasid to take over in the meantime, especially if it > means maintaining an otherwise dead uAPI. Thanks, > > Alex > Thanks, Jacob