Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932774AbdDSB1E (ORCPT ); Tue, 18 Apr 2017 21:27:04 -0400 Received: from gate.crashing.org ([63.228.1.57]:43374 "EHLO gate.crashing.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932540AbdDSB1C (ORCPT ); Tue, 18 Apr 2017 21:27:02 -0400 Message-ID: <1492565149.25766.128.camel@kernel.crashing.org> Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory From: Benjamin Herrenschmidt To: Jason Gunthorpe , Dan Williams Cc: Logan Gunthorpe , Bjorn Helgaas , Christoph Hellwig , Sagi Grimberg , "James E.J. Bottomley" , "Martin K. Petersen" , Jens Axboe , Steve Wise , Stephen Bates , Max Gurtovoy , Keith Busch , linux-pci@vger.kernel.org, linux-scsi , linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm , "linux-kernel@vger.kernel.org" , Jerome Glisse Date: Wed, 19 Apr 2017 11:25:49 +1000 In-Reply-To: <20170418232159.GA28477@obsidianresearch.com> References: <20170418210339.GA24257@obsidianresearch.com> <20170418212258.GA26838@obsidianresearch.com> <96198489-1af5-abcf-f23f-9a7e41aa17f7@deltatee.com> <20170418224225.GB27113@obsidianresearch.com> <20170418232159.GA28477@obsidianresearch.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.22.6 (3.22.6-1.fc25) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 918 Lines: 26 On Tue, 2017-04-18 at 17:21 -0600, Jason Gunthorpe wrote: > Splitting the sgl is different from iommu batching. > > As an example, an O_DIRECT write of 1 MB with a single 4K P2P page in > the middle. > > The optimum behavior is to allocate a 1MB-4K iommu range and fill it > with the CPU memory. Then return a SGL with three entires, two > pointing into the range and one to the p2p. > > It is creating each range which tends to be expensive, so creating > two > ranges (or worse, if every SGL created a range it would be 255) is > very undesired. I think it's easier to get us started to just use a helper and stick it in the existing sglist processing loop of the architecture. As we noticed, stacking dma_ops is actually non-trivial and opens quite the can of worms. As Jerome mentioned, you can end up with IOs ops containing an sglist that is a collection of memory and GPU pages for example. Cheers, Ben.