Received: by 2002:a89:2c3:0:b0:1ed:23cc:44d1 with SMTP id d3csp347243lqs; Tue, 5 Mar 2024 04:05:51 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCX5mDMu++N+GchqttCbOmhzwdqjvdwhbQq6Y2cidP9i1WK/IThVRjxbvKz0Y5k0oRy3najeo16eKlMS2lVPybYwhZHGiTSBS1zcQEc2+g== X-Google-Smtp-Source: AGHT+IEEfMeQhjrzzaC9BaukwowrKJnhneeAGZziy6iN0kvXKySlMnZ5yBEgGZM+XJ3+SZs86sDe X-Received: by 2002:a17:906:264b:b0:a43:d1a5:e907 with SMTP id i11-20020a170906264b00b00a43d1a5e907mr7890806ejc.58.1709640351796; Tue, 05 Mar 2024 04:05:51 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1709640351; cv=pass; d=google.com; s=arc-20160816; b=iPErsWIJHc3Rtlbw0oJs10LdP2Kxb5gnQLyWV1pJ7o5vUec3RFPSTS+EyymWFXGyH/ 3ol5XG58kV4EWKYLsn6Zaf723PZa95t7C29KTbKw9zQ+sbfGwb4R+MJh70j/O3uuFIMM wTNdUzKBVzCX2vTe+eT1Lxe1OsCfEd5sq7SfXWQC0L8bq8WUyJ3f6eE/C7uY96U+aJ94 gdczIeQzrIS+oq3Tx5+oLYUDRpQ1S9Uv8hKKPkSzXc7TXbeBeTmBGt9KoRqWIhHwunyM 58K6s0KWeLHQ90Kf7QLd1hnsiytVoM6+Phl9SD3Ui5VjxDQYk2fHcbVNzE5E3LPClFiZ 757Q== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:content-language:from :references:cc:to:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=Rqswi7ipIoZG+INOG3X2gMd6ini0syHXXW75E+5+5wU=; fh=w8/6ZQuS+3aSMRWMmIGrjEHLDyq0VX+BF1I+27fswPw=; b=UVYpE6F4mS203BMSC7ZSgZTGDbAle/SqLDRov7RHXsM1q63hhDwnP21kuVTQchEkNz Be2VphRKKJFENwKZrhHnolcYRdMBzWVZlcMAT9pbeVru0G73E2hGGC4jimPxuvXr4bM5 mNgIzSFqUq102JhprLzYpqtF5x5VsjNudoGcOGwsMHdBVAgoKk0VVVkRP7dAhKxX6bpO PgfrBDD5nk/Ch9BTemMVZBYN7vr6TQOn4FqgX5AS6KHXmIFPznBe8Zyq3ER0jMug8Jdw Q3gmGofbVE4/GLfjgWIrv9cLGg9M0tDb6Dl2MZVw78XMSjjT5unCoVI1kpEu2loVLnkF s85A==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-92329-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-92329-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id ck23-20020a170906c45700b00a4315f2eb12si5093097ejb.167.2024.03.05.04.05.51 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 05 Mar 2024 04:05:51 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-92329-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-92329-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-92329-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 5F5AA1F250E0 for ; Tue, 5 Mar 2024 12:05:51 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id EE2055A4E0; Tue, 5 Mar 2024 12:05:36 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id B8E6C58ABA; Tue, 5 Mar 2024 12:05:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709640336; cv=none; b=X9FyCIb+218qnTcXxkOd4PuVScCiELc6MjB43ORIzUto+o3hHlXWCgs42Niq/JKqw6sqjAlR5Un7Pwqw8dnQbQ25mgjsvIXRtyGDNDvDhe1Q+aca1Mb+R9x/eJQAefhvu03vIX3z3RAO5XYSidxJNzUDOHm9uPuRwY/OnQOy6n4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709640336; c=relaxed/simple; bh=x2n3LYJaF3CcKkOqy6N20V4juYYDI/bsFT9DQVqOTOE=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=D+Nh44t8NJJBLBfoLiBVzRqhbTpPnXGf4anvYwOwUnqZ50ca4359/gUFyXsE/wme1mSeUCh4Bu4Ht1NHk1+c04u+lzXEjhzYpfDF4aceIvgUUMrOCuTLbDDjf0hqSGoDVMDsMY4djSUA3xFNzGp0BIwvBk5bZPO5mZkVZy796L0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id A49DE1FB; Tue, 5 Mar 2024 04:06:08 -0800 (PST) Received: from [10.57.67.228] (unknown [10.57.67.228]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 25B763F73F; Tue, 5 Mar 2024 04:05:25 -0800 (PST) Message-ID: <47afacda-3023-4eb7-b227-5f725c3187c2@arm.com> Date: Tue, 5 Mar 2024 12:05:23 +0000 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC RESEND 00/16] Split IOMMU DMA mapping operation to two steps To: Leon Romanovsky , Christoph Hellwig , Marek Szyprowski , Joerg Roedel , Will Deacon , Jason Gunthorpe , Chaitanya Kulkarni Cc: Jonathan Corbet , Jens Axboe , Keith Busch , Sagi Grimberg , Yishai Hadas , Shameer Kolothum , Kevin Tian , Alex Williamson , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Andrew Morton , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, linux-rdma@vger.kernel.org, iommu@lists.linux.dev, linux-nvme@lists.infradead.org, kvm@vger.kernel.org, linux-mm@kvack.org, Bart Van Assche , Damien Le Moal , Amir Goldstein , "josef@toxicpanda.com" , "Martin K. Petersen" , "daniel@iogearbox.net" , Dan Williams , "jack@suse.com" , Leon Romanovsky , Zhu Yanjun References: From: Robin Murphy Content-Language: en-GB In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit On 2024-03-05 11:18 am, Leon Romanovsky wrote: > This is complimentary part to the proposed LSF/MM topic. > https://lore.kernel.org/linux-rdma/22df55f8-cf64-4aa8-8c0b-b556c867b926@linux.dev/T/#m85672c860539fdbbc8fe0f5ccabdc05b40269057 > > This is posted as RFC to get a feedback on proposed split, but RDMA, VFIO and > DMA patches are ready for review and inclusion, the NVMe patches are still in > progress as they require agreement on API first. > > Thanks > > ------------------------------------------------------------------------------- > The DMA mapping operation performs two steps at one same time: allocates > IOVA space and actually maps DMA pages to that space. This one shot > operation works perfectly for non-complex scenarios, where callers use > that DMA API in control path when they setup hardware. > > However in more complex scenarios, when DMA mapping is needed in data > path and especially when some sort of specific datatype is involved, > such one shot approach has its drawbacks. > > That approach pushes developers to introduce new DMA APIs for specific > datatype. For example existing scatter-gather mapping functions, or > latest Chuck's RFC series to add biovec related DMA mapping [1] and > probably struct folio will need it too. > > These advanced DMA mapping APIs are needed to calculate IOVA size to > allocate it as one chunk and some sort of offset calculations to know > which part of IOVA to map. I don't follow this part at all - at *some* point, something must know a range of memory addresses involved in a DMA transfer, so that's where it should map that range for DMA. Even in a badly-designed system where the point it's most practical to make the mapping is further out and only knows that DMA will touch some subset of a buffer, but doesn't know exactly what subset yet, you'd usually just map the whole buffer. I don't see why the DMA API would ever need to know about anything other than pages/PFNs and dma_addr_ts (yes, it does also accept them being wrapped together in scatterlists; yes, scatterlists are awful and it would be nice to replace them with a better general DMA descriptor; that is a whole other subject of its own). > Instead of teaching DMA to know these specific datatypes, let's separate > existing DMA mapping routine to two steps and give an option to advanced > callers (subsystems) perform all calculations internally in advance and > map pages later when it is needed. From a brief look, this is clearly an awkward reinvention of the IOMMU API. If IOMMU-aware drivers/subsystems want to explicitly manage IOMMU address spaces then they can and should use the IOMMU API. Perhaps there's room for some quality-of-life additions to the IOMMU API to help with common usage patterns, but the generic DMA mapping API is absolutely not the place for it. Thanks, Robin. > In this series, three users are converted and each of such conversion > presents different positive gain: > 1. RDMA simplifies and speeds up its pagefault handling for > on-demand-paging (ODP) mode. > 2. VFIO PCI live migration code saves huge chunk of memory. > 3. NVMe PCI avoids intermediate SG table manipulation and operates > directly on BIOs. > > Thanks > > [1] https://lore.kernel.org/all/169772852492.5232.17148564580779995849.stgit@klimt.1015granger.net > > Chaitanya Kulkarni (2): > block: add dma_link_range() based API > nvme-pci: use blk_rq_dma_map() for NVMe SGL > > Leon Romanovsky (14): > mm/hmm: let users to tag specific PFNs > dma-mapping: provide an interface to allocate IOVA > dma-mapping: provide callbacks to link/unlink pages to specific IOVA > iommu/dma: Provide an interface to allow preallocate IOVA > iommu/dma: Prepare map/unmap page functions to receive IOVA > iommu/dma: Implement link/unlink page callbacks > RDMA/umem: Preallocate and cache IOVA for UMEM ODP > RDMA/umem: Store ODP access mask information in PFN > RDMA/core: Separate DMA mapping to caching IOVA and page linkage > RDMA/umem: Prevent UMEM ODP creation with SWIOTLB > vfio/mlx5: Explicitly use number of pages instead of allocated length > vfio/mlx5: Rewrite create mkey flow to allow better code reuse > vfio/mlx5: Explicitly store page list > vfio/mlx5: Convert vfio to use DMA link API > > Documentation/core-api/dma-attributes.rst | 7 + > block/blk-merge.c | 156 ++++++++++++++ > drivers/infiniband/core/umem_odp.c | 219 +++++++------------ > drivers/infiniband/hw/mlx5/mlx5_ib.h | 1 + > drivers/infiniband/hw/mlx5/odp.c | 59 +++-- > drivers/iommu/dma-iommu.c | 129 ++++++++--- > drivers/nvme/host/pci.c | 220 +++++-------------- > drivers/vfio/pci/mlx5/cmd.c | 252 ++++++++++++---------- > drivers/vfio/pci/mlx5/cmd.h | 22 +- > drivers/vfio/pci/mlx5/main.c | 136 +++++------- > include/linux/blk-mq.h | 9 + > include/linux/dma-map-ops.h | 13 ++ > include/linux/dma-mapping.h | 39 ++++ > include/linux/hmm.h | 3 + > include/rdma/ib_umem_odp.h | 22 +- > include/rdma/ib_verbs.h | 54 +++++ > kernel/dma/debug.h | 2 + > kernel/dma/direct.h | 7 +- > kernel/dma/mapping.c | 91 ++++++++ > mm/hmm.c | 34 +-- > 20 files changed, 870 insertions(+), 605 deletions(-) >