Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp2461254rwl; Thu, 13 Apr 2023 06:51:27 -0700 (PDT) X-Google-Smtp-Source: AKy350abFTQNJEZqazMTsxP4zr8vFhjBcxR/nnJfsJzH8sd2/EMKaxJc0dYGmD20Av5UdfP20Kgn X-Received: by 2002:a17:90b:3a91:b0:23f:7d05:8762 with SMTP id om17-20020a17090b3a9100b0023f7d058762mr1921432pjb.23.1681393887693; Thu, 13 Apr 2023 06:51:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1681393887; cv=none; d=google.com; s=arc-20160816; b=AxMZV08r/vM0ZU/mvtke/zBRpTqfnVRt+7BkMR8RSex6FmrzfzIgEp5R4QmBBfBQPa qGDuaGqpd9T5ZUTOBpesD96xmCSwN44M1Wx3kj3v8Whhz2yscRo5Diice3B1eSHKowl8 /E3uaDFkt991GWIYtkdDU/YP1NaOGvAvC9dmYh4kPKoOUKIJ5+944Zo0vq5FpTKd7AHF b8EgNMcqq31hqT1aTCTav5qNEqSkVjyy31+wOLHJO30l/wyMi/Lqi8QJwf/YlC05hfEX zvnLDNRrrcemWfHwhDageK9IyCGTBfk3tJ08k89vgjd1NSTixjfkL8GetXJpPO2OurAW YfgA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=CN5wwIxQW0BQS8nm4W5QeG74mjXcVN4woCoYiZLSiEA=; b=jAM85otxe7Ni5ctw6oYE23n5B4dOXBINstg6DgtdaeG6m6LyDKRisVYobRjscmzeBX 14nVP1wZklLZqJdfjomtIklvq/ETtcdtMN1nha9FuPHP/wKAS6lUg9WwKgamJfZ/Qd64 KZzXOE6yxoXAull0TN2AnOspTq6MZsX59dvDzAXQzMG8fFJBKWrMbNO7tvMwdK9wMUDi aKW/wKMXghcm/EufmvkfeAyi8xQS59G13H/82UuhC5iNKYZkGhJwrQd/y50IXrFVeen3 YSdJqhCAhkkSSwZvYVVuY4hN2NlXSP/md19zMyb9L+TBHBbjEow8mkiPImDdoYyyOwBe kgSQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id p24-20020a17090ad31800b0024714964981si2152670pju.63.2023.04.13.06.51.15; Thu, 13 Apr 2023 06:51:27 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231658AbjDMNnI (ORCPT + 99 others); Thu, 13 Apr 2023 09:43:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42828 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231626AbjDMNmv (ORCPT ); Thu, 13 Apr 2023 09:42:51 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 98FCAA5F6 for ; Thu, 13 Apr 2023 06:41:08 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B770BD75; Thu, 13 Apr 2023 06:41:16 -0700 (PDT) Received: from e121345-lin.cambridge.arm.com (e121345-lin.cambridge.arm.com [10.1.196.40]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id 458A73F73F; Thu, 13 Apr 2023 06:40:31 -0700 (PDT) From: Robin Murphy To: joro@8bytes.org Cc: will@kernel.org, iommu@lists.linux.dev, linux-kernel@vger.kernel.org, Linus Torvalds , Jakub Kicinski , John Garry Subject: [PATCH v4] iommu: Optimise PCI SAC address trick Date: Thu, 13 Apr 2023 14:40:25 +0100 Message-Id: X-Mailer: git-send-email 2.39.2.101.g768bb238c484.dirty MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Per the reasoning in commit 4bf7fda4dce2 ("iommu/dma: Add config for PCI SAC address trick") and its subsequent revert, this mechanism no longer serves its original purpose, but now only works around broken hardware/drivers in a way that is unfortunately too impactful to remove. This does not, however, prevent us from solving the performance impact which that workaround has on large-scale systems that don't need it. Once the 32-bit IOVA space fills up and a workload starts allocating and freeing on both sides of the boundary, the opportunistic SAC allocation can then end up spending significant time hunting down scattered fragments of free 32-bit space, or just reestablishing max32_alloc_size. This can easily be exacerbated by a change in allocation pattern, such as by changing the network MTU, which can increase pressure on the 32-bit space by leaving a large quantity of cached IOVAs which are now the wrong size to be recycled, but also won't be freed since the non-opportunistic allocations can still be satisfied from the whole 64-bit space without triggering the reclaim path. However, in the context of a workaround where smaller DMA addresses aren't simply a preference but a necessity, if we get to that point at all then in fact it's already the endgame. The nature of the allocator is currently such that the first IOVA we give to a device after the 32-bit space runs out will be the highest possible address for that device, ever. If that works, then great, we know we can optimise for speed by always allocating from the full range. And if it doesn't, then the worst has already happened and any brokenness is now showing, so there's little point in continuing to try to hide it. To that end, implement a flag to refine the SAC business into a per-device policy that can automatically get itself out of the way if and when it stops being useful. CC: Linus Torvalds CC: Jakub Kicinski Reviewed-by: John Garry Signed-off-by: Robin Murphy --- v4: Rebase to use the new bitfield in dev_iommu, expand commit message. drivers/iommu/dma-iommu.c | 26 ++++++++++++++++++++------ drivers/iommu/dma-iommu.h | 8 ++++++++ drivers/iommu/iommu.c | 3 +++ include/linux/iommu.h | 2 ++ 4 files changed, 33 insertions(+), 6 deletions(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 99b2646cb5c7..9193ad5bc72f 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -630,7 +630,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain, { struct iommu_dma_cookie *cookie = domain->iova_cookie; struct iova_domain *iovad = &cookie->iovad; - unsigned long shift, iova_len, iova = 0; + unsigned long shift, iova_len, iova; if (cookie->type == IOMMU_DMA_MSI_COOKIE) { cookie->msi_iova += size; @@ -645,15 +645,29 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain, if (domain->geometry.force_aperture) dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end); - /* Try to get PCI devices a SAC address */ - if (dma_limit > DMA_BIT_MASK(32) && !iommu_dma_forcedac && dev_is_pci(dev)) + /* + * Try to use all the 32-bit PCI addresses first. The original SAC vs. + * DAC reasoning loses relevance with PCIe, but enough hardware and + * firmware bugs are still lurking out there that it's safest not to + * venture into the 64-bit space until necessary. + * + * If your device goes wrong after seeing the notice then likely either + * its driver is not setting DMA masks accurately, the hardware has + * some inherent bug in handling >32-bit addresses, or not all the + * expected address bits are wired up between the device and the IOMMU. + */ + if (dma_limit > DMA_BIT_MASK(32) && dev->iommu->pci_32bit_workaround) { iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> shift, false); + if (iova) + goto done; - if (!iova) - iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift, - true); + dev->iommu->pci_32bit_workaround = false; + dev_notice(dev, "Using %d-bit DMA addresses\n", bits_per(dma_limit)); + } + iova = alloc_iova_fast(iovad, iova_len, dma_limit >> shift, true); +done: return (dma_addr_t)iova << shift; } diff --git a/drivers/iommu/dma-iommu.h b/drivers/iommu/dma-iommu.h index 942790009292..c829f1f82a99 100644 --- a/drivers/iommu/dma-iommu.h +++ b/drivers/iommu/dma-iommu.h @@ -17,6 +17,10 @@ int iommu_dma_init_fq(struct iommu_domain *domain); void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list); extern bool iommu_dma_forcedac; +static inline void iommu_dma_set_pci_32bit_workaround(struct device *dev) +{ + dev->iommu->pci_32bit_workaround = !iommu_dma_forcedac; +} #else /* CONFIG_IOMMU_DMA */ @@ -38,5 +42,9 @@ static inline void iommu_dma_get_resv_regions(struct device *dev, struct list_he { } +static inline void iommu_dma_set_pci_32bit_workaround(struct device *dev) +{ +} + #endif /* CONFIG_IOMMU_DMA */ #endif /* __DMA_IOMMU_H */ diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 10db680acaed..8ea5821b637c 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -354,6 +354,9 @@ static int __iommu_probe_device(struct device *dev, struct list_head *group_list mutex_unlock(&iommu_probe_device_lock); iommu_device_link(iommu_dev, dev); + if (dev_is_pci(dev)) + iommu_dma_set_pci_32bit_workaround(dev); + return 0; out_release: diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 6595454d4f48..2c908e87a89a 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -403,6 +403,7 @@ struct iommu_fault_param { * @priv: IOMMU Driver private data * @max_pasids: number of PASIDs this device can consume * @attach_deferred: the dma domain attachment is deferred + * @pci_32bit_workaround: Limit DMA allocations to 32-bit IOVAs * * TODO: migrate other per device data pointers under iommu_dev_data, e.g. * struct iommu_group *iommu_group; @@ -416,6 +417,7 @@ struct dev_iommu { void *priv; u32 max_pasids; u32 attach_deferred:1; + u32 pci_32bit_workaround:1; }; int iommu_device_register(struct iommu_device *iommu, -- 2.39.2.101.g768bb238c484.dirty