Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757293AbdDRWZJ (ORCPT ); Tue, 18 Apr 2017 18:25:09 -0400 Received: from quartz.orcorp.ca ([184.70.90.242]:42587 "EHLO quartz.orcorp.ca" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753326AbdDRWZG (ORCPT ); Tue, 18 Apr 2017 18:25:06 -0400 Date: Tue, 18 Apr 2017 16:24:40 -0600 From: Jason Gunthorpe To: Logan Gunthorpe Cc: Dan Williams , Benjamin Herrenschmidt , Bjorn Helgaas , Christoph Hellwig , Sagi Grimberg , "James E.J. Bottomley" , "Martin K. Petersen" , Jens Axboe , Steve Wise , Stephen Bates , Max Gurtovoy , Keith Busch , linux-pci@vger.kernel.org, linux-scsi , linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm , "linux-kernel@vger.kernel.org" , Jerome Glisse Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory Message-ID: <20170418222440.GA27113@obsidianresearch.com> References: <1492381396.25766.43.camel@kernel.crashing.org> <20170418164557.GA7181@obsidianresearch.com> <20170418190138.GH7181@obsidianresearch.com> <20170418210339.GA24257@obsidianresearch.com> <9fc9352f-86fe-3a9e-e372-24b3346b518c@deltatee.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9fc9352f-86fe-3a9e-e372-24b3346b518c@deltatee.com> User-Agent: Mutt/1.5.24 (2015-08-30) X-Broken-Reverse-DNS: no host name found for IP address 10.0.0.156 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2914 Lines: 69 On Tue, Apr 18, 2017 at 03:31:58PM -0600, Logan Gunthorpe wrote: > 1) It means that sg_has_p2p has to walk the entire sg and check every > page. Then map_sg_p2p/map_sg has to walk it again and repeat the check > then do some operation per page. If anyone is concerned about the > dma_map performance this could be an issue. dma_map performance is a concern, this is why I suggest this as an interm solution until all dma_ops are migrated. Ideally sg_has_p2p would be a fast path that checked some kind of flags bit set during sg_assign_page... This would probably all have to be protected with CONFIG_P2P until it becomes performance neutral. People without an iommu are not going to want to walk the sg list at all.. > 2) Without knowing exactly what the arch specific code may need to do > it's hard to say that this is exactly the right approach. If every > dma_ops provider has to do exactly this on every page it may lead to a > lot of duplicate code: I think someone would have to start to look at it to make a determination.. I suspect the main server oriented iommu dma op will want to have proper p2p support anyhow and will probably have their unique control flow.. > The only thing I'm presently aware of is the segment check and applying > the offset to the physical address Well, I called the function p2p_same_segment_map_page() in my last suggestion for a reason - that is all the helper does. The intention would be for real iommu drivers to call that helper for the one simple case and if it fails then use their own routines to figure out if cross-segment P2P is possible and configure the iommu as needed. > bus specific and not arch specific which I think is what Dan may be > getting at. So it may make sense to just have a pci_map_sg_p2p() which > takes a dma_ops struct it would use for any page that isn't a p2p page. Like I keep saying, dma_ops are not really designed to be stacked. Try and write a stacked map_sg function like you describe and you will see how horrible it quickly becomes. Setting up an iommu is very expensive, so we need to batch it for the entire sg list. Thus a trivial implementation to iterate over all sg list entries is not desired. So first a sg list without p2p memory would have to be created, pass to the lower level ops, then brought back. Remember, the returned sg list will have a different number of entries than the original. Now another complex loop is needed to split/merge back in the p2p sg elements to get a return result. Finally, we have to undo all of this when doing unmap. Basically, all this list processing is a huge overhead compared to just putting a helper call in the existing sg iteration loop of the actual op. Particularly if the actual op is a no-op like no-mmu x86 would use. Since dma mapping is a performance path we must be careful not to create intrinsic inefficiencies with otherwise nice layering :) Jason