Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934100AbcKWTGH (ORCPT ); Wed, 23 Nov 2016 14:06:07 -0500 Received: from quartz.orcorp.ca ([184.70.90.242]:40382 "EHLO quartz.orcorp.ca" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751158AbcKWTGF (ORCPT ); Wed, 23 Nov 2016 14:06:05 -0500 Date: Wed, 23 Nov 2016 12:05:15 -0700 From: Jason Gunthorpe To: Logan Gunthorpe Cc: Serguei Sagalovitch , Dan Williams , "Deucher, Alexander" , "linux-nvdimm@lists.01.org" , "linux-rdma@vger.kernel.org" , "linux-pci@vger.kernel.org" , "Kuehling, Felix" , "Bridgman, John" , "linux-kernel@vger.kernel.org" , "dri-devel@lists.freedesktop.org" , "Koenig, Christian" , "Sander, Ben" , "Suthikulpanit, Suravee" , "Blinzer, Paul" , "Linux-media@vger.kernel.org" Subject: Re: Enabling peer to peer device transactions for PCIe devices Message-ID: <20161123190515.GA12146@obsidianresearch.com> References: <75a1f44f-c495-7d1e-7e1c-17e89555edba@amd.com> <45c6e878-bece-7987-aee7-0e940044158c@deltatee.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <45c6e878-bece-7987-aee7-0e940044158c@deltatee.com> User-Agent: Mutt/1.5.23 (2014-03-12) X-Broken-Reverse-DNS: no host name found for IP address 10.0.0.151 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2308 Lines: 50 On Wed, Nov 23, 2016 at 10:13:03AM -0700, Logan Gunthorpe wrote: > an MR would be very tricky. The MR may be relied upon by another host > and the kernel would have to inform user-space the MR was invalid then > user-space would have to tell the remote application. As Bart says, it would be best to be combined with something like Mellanox's ODP MRs, which allows a page to be evicted and then trigger a CPU interrupt if a DMA is attempted so it can be brought back. This includes the usual fencing mechanism so the CPU can block, flush, and then evict a page coherently. This is the general direction the industry is going in: Link PCI DMA directly to dynamic user page tabels, including support for demand faulting and synchronicity. Mellanox ODP is a rough implementation of mirroring a process's page table via the kernel, while IBM's CAPI (and CCIX, PCI ATS?) is probably a good example of where this is ultimately headed. CAPI allows a PCI DMA to directly target an ASID associated with a user process and then use the usual CPU machinery to do the page translation for the DMA. This includes page faults for evicted pages, and obviously allows eviction and migration.. So, of all the solutions in the original list, I would discard anything that isn't VMA focused. Emulating what CAPI does in hardware with software is probably the best choice, or we have to do it all again when CAPI style hardware broadly rolls out :( DAX and GPU allocators should create VMAs and manipulate them in the usual way to achieve migration, windowing, cache&mirror, movement or swap of the potentially peer-peer memory pages. They would have to respect the usual rules for a VMA, including pinning. DMA drivers would use the usual approaches for dealing with DMA from a VMA: short term pin or long term coherent translation mirror. So, to my view (looking from RDMA), the main problem with peer-peer is how do you DMA translate VMA's that point at non struct page memory? Does HMM solve the peer-peer problem? Does it do it generically or only for drivers that are mirroring translation tables? >From a RDMA perspective we could use something other than get_user_pages() to pin and DMA translate a VMA if the core community could decide on an API. eg get_user_dma_sg() would probably be quite usable. Jason