Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp2923319rwd; Fri, 16 Jun 2023 09:51:38 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ5Hj8Z/TJ717iNt5W9VcG4D4lUfCQJYX37xlj6lqgrWQHCTEFH7iVDP/W1w8BvkPBT9OlF9 X-Received: by 2002:a05:6a20:3d81:b0:107:35ed:28a7 with SMTP id s1-20020a056a203d8100b0010735ed28a7mr3697910pzi.8.1686934297918; Fri, 16 Jun 2023 09:51:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686934297; cv=none; d=google.com; s=arc-20160816; b=x1JCSI80sxN8CpuWUcKWtDVfItbeVxS0Ur1XrcGyQJ/2I/qOKeHZocybZ9ZAzBhp42 /F2PUeyQmsHTIBGFBjx3fNKu/T4UgP80AXaYrh/+TSUtid/2aqAE8A8G+3RIN8E1+iHV fAV6gyQWuJ0JaP4D9rteHolJmDCsWK6ROLejgz3acUVKihMa2PgBf/kClNzPNaDB1tAh nESQD2IGbRdhQCiWUSU2jW3WkS+iMbgTcFJNFPLRQgvOXAWes8663CDzF6K42cJ4jCEm hJtc3c+G9fMOv8MslqCOWoTyRpwehh6ptxBlGdebjfDg0kgv5yxdINsr5EGFg7iCubo6 IHaA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id; bh=QJHA9J1mdKrRwJ3peAzZkY0w4IcNa3F3rfOxHeb7k7M=; b=IwlTEnC+1GaoAeh96iOhDcXDYvvxSJrag5ndHWgxrmqHAWawtlwzdK7lGHfPC0mTe2 XhwCKw9PUu1z4REXOb2pkYWEj0RAERN9c88BATwlocyf3DlP7si44fCiZtIOMkWDlZm7 E/z997nyv+iutrukNAbX+xe66MY6C13Sdhz4dk6MRcKplBUbPXo2T2h2qugsbQQnlPgn Q2PHrFwuebltd4XKJBrbDZsPOxuu7FTt75sv7kURd7bT8kr7KLjrJa+mZrwdYG/Vt1so czI5qPBTmzWmWyI8Teyr+ttUZdYATzGZdnDYwUft56ak9jYeTYuXhMSZDyAHxPImv4GT 6Tiw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id p3-20020a625b03000000b0065e445224e2si2374404pfb.342.2023.06.16.09.51.26; Fri, 16 Jun 2023 09:51:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345678AbjFPQfl (ORCPT + 99 others); Fri, 16 Jun 2023 12:35:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56190 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1345076AbjFPQfR (ORCPT ); Fri, 16 Jun 2023 12:35:17 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 510F030E7; Fri, 16 Jun 2023 09:35:02 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 1AC941FB; Fri, 16 Jun 2023 09:35:46 -0700 (PDT) Received: from [10.57.85.251] (unknown [10.57.85.251]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id BC6DD3F663; Fri, 16 Jun 2023 09:35:00 -0700 (PDT) Message-ID: <520e2be4-726f-c680-c010-a308cdddbae0@arm.com> Date: Fri, 16 Jun 2023 17:34:53 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 Subject: Re: Question about reserved_regions w/ Intel IOMMU Content-Language: en-GB To: Alexander Duyck , Jason Gunthorpe Cc: "Tian, Kevin" , Alex Williamson , Baolu Lu , LKML , linux-pci , "iommu@lists.linux.dev" References: From: Robin Murphy In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2023-06-16 16:27, Alexander Duyck wrote: > On Fri, Jun 16, 2023 at 5:20 AM Jason Gunthorpe wrote: >> >> On Fri, Jun 16, 2023 at 08:39:46AM +0000, Tian, Kevin wrote: >>> +Alex >>> >>>> From: Jason Gunthorpe >>>> Sent: Tuesday, June 13, 2023 11:54 PM >>>> >>>> On Thu, Jun 08, 2023 at 04:28:24PM +0100, Robin Murphy wrote: >>>> >>>>>> The iova_reserve_pci_windows() you've seen is for kernel DMA interfaces >>>>>> which is not related to peer-to-peer accesses. >>>>> >>>>> Right, in general the IOMMU driver cannot be held responsible for >>>> whatever >>>>> might happen upstream of the IOMMU input. >>>> >>>> The driver yes, but.. >>>> >>>>> The DMA layer carves PCI windows out of its IOVA space >>>>> unconditionally because we know that they *might* be problematic, >>>>> and we don't have any specific constraints on our IOVA layout so >>>>> it's no big deal to just sacrifice some space for simplicity. >>>> >>>> This is a problem for everything using UNMANAGED domains. If the iommu >>>> API user picks an IOVA it should be able to expect it to work. If the >>>> intereconnect fails to allow it to work then this has to be discovered >>>> otherwise UNAMANGED domains are not usable at all. >>>> >>>> Eg vfio and iommufd are also in trouble on these configurations. >>>> >>> >>> If those PCI windows are problematic e.g. due to ACS they belong to >>> a single iommu group. If a vfio user opens all the devices in that group >>> then it can discover and reserve those windows in its IOVA space. >> >> How? We don't even exclude the single device's BAR if there is no ACS? > > The issue here was a defective ACS on a PCIe switch. > >>> The problem is that the user may not open all the devices then >>> currently there is no way for it to know the windows on those >>> unopened devices. >>> >>> Curious why nobody complains about this gap before this thread... >> >> Probably because it only matters if you have a real PCIe switch in the >> system, which is pretty rare. > > So just FYI I am pretty sure we have a partitioned PCIe switch that > has FW issues. Specifically it doesn't seem to be honoring the > Redirect Request bit so what is happening is that we are seeing > requests that are supposed to be going to the root complex/IOMMU > getting redirected to an NVMe device that was on the same physical > PCIe switch. We are in the process of getting that sorted out now and > are using the forcedac option in the meantime to keep the IOMMU out of > the 32b address space that was causing the issue. > > The reason for my original request is more about the user experience > of trying to figure out what is reserved and what isn't. It seems like > the IOVA will have reservations that are not visible to the end user. > So when I go looking through the reserved_regions in sysfs it just > lists the MSI regions that are reserved, and maybe some regions such > as the memory for USB. while in reality we may be reserving IOVA > regions in iova_reserve_pci_windows that will not be exposed without > having to add probes or adding some printk debugging. lspci -vvv seems to have no problem telling me about what PCI memory space is assigned where, even as an unprivileged user, so surely it's available to any VFIO user too? It is not necessarily useful for eeh IOMMU layer to claim to userspace that an entire window is unusable if in fact there's nothing in there that would be treated as a P2P address so it's actually fine. As I say, iommu-dma can make that assumption for itself because iommu-dma doesn't need to maintain any particular address space layout, but it could be overly restrictive for a userspace process or VMM which does. If the system has working ACS configured correctly, then this issue should be moot; if it doesn't, then a VFIO user is going to get a whole group of peer devices if they're getting anything at all, so it doesn't seem entirely unreasonable to leave it up to them to check that all those devices' resources play well with their expected memory map. And the particular case of a system which claims to have working ACS but doesn't, doesn't really seem to be something that can or should be worked around from userspace; if that switch can't be fixed, it probably wants an ACS quirk adding in the kernel. Thanks, Robin.