Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754952AbdDLRKE (ORCPT ); Wed, 12 Apr 2017 13:10:04 -0400 Received: from ale.deltatee.com ([207.54.116.67]:33246 "EHLO ale.deltatee.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751252AbdDLRJ7 (ORCPT ); Wed, 12 Apr 2017 13:09:59 -0400 To: Benjamin Herrenschmidt , Christoph Hellwig , Sagi Grimberg , "James E.J. Bottomley" , "Martin K. Petersen" , Jens Axboe , Steve Wise , Stephen Bates , Max Gurtovoy , Dan Williams , Keith Busch , Jason Gunthorpe References: <1490911959-5146-1-git-send-email-logang@deltatee.com> <1491974532.7236.43.camel@kernel.crashing.org> Cc: linux-pci@vger.kernel.org, linux-scsi@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@ml01.01.org, linux-kernel@vger.kernel.org From: Logan Gunthorpe Message-ID: <5ac22496-56ec-025d-f153-140001d2a7f9@deltatee.com> Date: Wed, 12 Apr 2017 11:09:53 -0600 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Icedove/45.6.0 MIME-Version: 1.0 In-Reply-To: <1491974532.7236.43.camel@kernel.crashing.org> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-SA-Exim-Connect-IP: 172.16.1.111 X-SA-Exim-Rcpt-To: linux-kernel@vger.kernel.org, linux-nvdimm@ml01.01.org, linux-rdma@vger.kernel.org, linux-nvme@lists.infradead.org, linux-scsi@vger.kernel.org, linux-pci@vger.kernel.org, jgunthorpe@obsidianresearch.com, keith.busch@intel.com, dan.j.williams@intel.com, maxg@mellanox.com, sbates@raithlin.com, swise@opengridcomputing.com, axboe@kernel.dk, martin.petersen@oracle.com, jejb@linux.vnet.ibm.com, sagi@grimberg.me, hch@lst.de, benh@kernel.crashing.org X-SA-Exim-Mail-From: logang@deltatee.com Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory X-SA-Exim-Version: 4.2.1 (built Mon, 26 Dec 2011 16:24:06 +0000) X-SA-Exim-Scanned: Yes (on ale.deltatee.com) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3468 Lines: 77 On 11/04/17 11:22 PM, Benjamin Herrenschmidt wrote: > Another issue of course is that not all systems support P2P > between host bridges :-) (Though almost all switches can enable it). Yes, I'm either going to just let the user enable and test or limit it to switches only to start. However, currently our bigger issue is working on a way to not violate iomem safety. > Ok. I suppose that's a reasonable starting point. Do I haven't looked > at the patches in detail yet but it would be nice if that policy was in > a well isolated component so it can potentially be affected by > arch/platform code. The policy is isolated in the new p2pmem driver. There's no reason the policy couldn't become arbitrarily complex with specific arch exceptions. It's just people would have to do the work to create those exceptions. > Do you handle funky address translation too ? IE. the fact that the PCI > addresses aren't the same as the CPU physical addresses for a BAR ? No, we use the CPU physical address of the BAR. If it's not mapped that way we can't use it. >> This will mean many setups that could likely >> work well will not be supported so that we can be more confident it >> will work and not place any responsibility on the user to understand >> their topology. (We've chosen to go this route based on feedback we >> received at LSF). >> >> In order to enable this functionality we introduce a new p2pmem device >> which can be instantiated by PCI drivers. The device will register some >> PCI memory as ZONE_DEVICE and provide an genalloc based allocator for >> users of these devices to get buffers. > > I don't completely understand this. This is actual memory on the PCI > bus ? Where does it come from ? Or are you just trying to create struct > pages that cover your PCIe DMA target ? Yes, the memory is on the PCI bus in a BAR. For now we have a special PCI card for this, but in the future it would likely be the CMB in an NVMe card. These patches create struct pages to map these BAR addresses using ZONE_DEVICE. > So correct me if I'm wrong, you are trying to create struct page's that > map a PCIe BAR right ? I'm trying to understand how that interacts with > what Jerome is doing for HMM. Yes, well we are using ZONE_DEVICE in the exact same way as the dax code is. These patches use the existing API with no modifications. As I understand it, HMM was using ZONE_DEVICE in a way that was quite different to how it was originally designed. > The reason is that the HMM currently creates the struct pages with > "fake" PFNs pointing to a hole in the address space rather than > covering the actual PCIe memory of the GPU. He does that to deal with > the fact that some GPUs have a smaller aperture on PCIe than their > total memory. I'm aware of what HMM is trying to do and although I'm not familiar with the intimate details, I saw it as fairly orthogonal to what we are attempting to do. > However, I have asked him to only apply that policy if the aperture is > indeed smaller, and if not, create struct pages that directly cover the > PCIe BAR of the GPU instead, which will work better on systems or > architecture that don't have a "pinhole" window limitation. > However he was under the impression that this was going to collide with > what you guys are doing, so I'm trying to understand how. I'm not sure I understand how either. However, I suspect if you collide with these patches then you'd also be breaking dax too. Logan