Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756392AbdDPPxu (ORCPT ); Sun, 16 Apr 2017 11:53:50 -0400 Received: from mail-oi0-f42.google.com ([209.85.218.42]:32777 "EHLO mail-oi0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756149AbdDPPxq (ORCPT ); Sun, 16 Apr 2017 11:53:46 -0400 MIME-Version: 1.0 In-Reply-To: <1492311719.25766.37.camel@kernel.crashing.org> References: <1490911959-5146-1-git-send-email-logang@deltatee.com> <1491974532.7236.43.camel@kernel.crashing.org> <5ac22496-56ec-025d-f153-140001d2a7f9@deltatee.com> <1492034124.7236.77.camel@kernel.crashing.org> <81888a1e-eb0d-cbbc-dc66-0a09c32e4ea2@deltatee.com> <20170413232631.GB24910@bhelgaas-glaptop.roam.corp.google.com> <20170414041656.GA30694@obsidianresearch.com> <1492169849.25766.3.camel@kernel.crashing.org> <630c1c63-ff17-1116-e069-2b8f93e50fa2@deltatee.com> <20170414190452.GA15679@bhelgaas-glaptop.roam.corp.google.com> <1492207643.25766.18.camel@kernel.crashing.org> <1492311719.25766.37.camel@kernel.crashing.org> From: Dan Williams Date: Sun, 16 Apr 2017 08:53:45 -0700 Message-ID: Subject: Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory To: Benjamin Herrenschmidt Cc: Logan Gunthorpe , Bjorn Helgaas , Jason Gunthorpe , Christoph Hellwig , Sagi Grimberg , "James E.J. Bottomley" , "Martin K. Petersen" , Jens Axboe , Steve Wise , Stephen Bates , Max Gurtovoy , Keith Busch , linux-pci@vger.kernel.org, linux-scsi , linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm , "linux-kernel@vger.kernel.org" , Jerome Glisse Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2965 Lines: 59 On Sat, Apr 15, 2017 at 8:01 PM, Benjamin Herrenschmidt wrote: > On Sat, 2017-04-15 at 15:09 -0700, Dan Williams wrote: >> I'm wondering, since this is limited to support behind a single >> switch, if you could have a software-iommu hanging off that switch >> device object that knows how to catch and translate the non-zero >> offset bus address case. We have something like this with VMD driver, >> and I toyed with a soft pci bridge when trying to support AHCI+NVME >> bar remapping. When the dma api looks up the iommu for its device it >> hits this soft-iommu and that driver checks if the page is host memory >> or device memory to do the dma translation. You wouldn't need a bit in >> struct page, just a lookup to the hosting struct dev_pagemap in the >> is_zone_device_page() case and that can point you to p2p details. > > I was thinking about a hook in the arch DMA ops but that kind of > wrapper might work instead indeed. However I'm not sure what's the best > way to "instantiate" it. > > The main issue is that the DMA ops are a function of the initiator, > not the target (since the target is supposed to be memory) so things > are a bit awkward. > > One (user ?) would have to know that a given device "intends" to DMA > directly to another device. > > This is awkward because in the ideal scenario, this isn't something the > device knows. For example, one could want to have an existing NIC DMA > directly to/from NVME pages or GPU pages. > > The NIC itself doesn't know the characteristic of these pages, but > *something* needs to insert itself in the DMA ops of that bridge to > make it possible. > > That's why I wonder if it's the struct page of the target that should > be "marked" in such a way that the arch dma'ops can immediately catch > that they belong to a device and might require "wrapped" operations. > > Are ZONE_DEVICE pages identifiable based on the struct page alone ? (a > flag ?) Yes, is_zone_device_page(). However I think we're getting to the point with pmem, hmm, cdm, and now p2p where ZONE_DEVICE is losing specific meaning and we need to have explicit type checks like is_hmm_page() is_p2p_page() that internally check is_zone_device_page() plus some other specific type. > That would allow us to keep a fast path for normal memory targets, but > also have some kind of way to handle the special cases of such peer 2 > peer (or also handle other type of peer to peer that don't necessarily > involve PCI address wrangling but could require additional iommu bits). > > Just thinking out loud ... I don't have a firm idea or a design. But > peer to peer is definitely a problem we need to tackle generically, the > demand for it keeps coming up. ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve context about the physical address in question. I'm thinking you can hang bus address translation data off of that structure. This seems vaguely similar to what HMM is doing.