Received: by 10.223.185.116 with SMTP id b49csp7991988wrg; Thu, 1 Mar 2018 15:02:07 -0800 (PST) X-Google-Smtp-Source: AG47ELtnO0wR/g+24UNVdwS2v9q7Apx+UbNZHtEPASb/LsdjMk5qdQLXBUC7beGJxniUwGoUNtNE X-Received: by 2002:a17:902:b101:: with SMTP id q1-v6mr3362018plr.287.1519945327673; Thu, 01 Mar 2018 15:02:07 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1519945327; cv=none; d=google.com; s=arc-20160816; b=yW0AuW+fW7QLAicm+Tq0lmAcC15JMdxq4jOSZ2HhNu8HhuDjeacqBb+Y4K8IOnn1aU SA/8xggrMT+psbpMkpd1VIlVqVUujLMFy5lTeCf5LcEJ1cb6f+8VImOgB625JJ1fakRT KyjFgST7f6ihSuszoSqRlosVvqGlEw89n88V4fezeKphY1XDZD6goon1FYvtdPa0qVLj aw4kkHGYhfVwX3bLEG9CQlaOqrJqaZkrcdYB9Vj1J1IDoTcmk4mn7wl4UDL16OCCIVp7 w7HEh94In/P9OGBdAwfeV4Y/UsqQVaCc5LeUQW40axc+CaOPGbWgqIimn4nZHontwAKn MMow== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dmarc-filter:arc-authentication-results; bh=G+qWLxsWUCPpDh0E3u+/6CIXRtRDhkF4t0sKTEFIiv8=; b=Bohc05DomsMCUOPGhbLGngDlmQhC5eFW/rlhVTb6QexcCcUorwLplLuhrbz+q3cEsD gK/QxIgczxzcFxahu3ujojG6wVOt/djdzWJw2Nn029Wand+MDS8kMfBZ5gKNAO2xaDzH w3/eTkGF8+E1E0HqjLozF6R/8tVr/fCZ6w9HXm2HZvVgWQ0vYn/iUZH0ckbaoJMVekJv V9S3jlAc/ZXrqXoAaRvntPM5uAmg9+Bxg/OuP6qwdxWlNDY8sgnFo4W5rKgpH0DMZBh/ WqRoTl8wl53CnKdIGUVv54GUq1lbtXTH9QC9d5QrnA7EhUgm/UYHamxXShF+C+pAS1Lk rhPQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id bd7-v6si3689068plb.46.2018.03.01.15.01.52; Thu, 01 Mar 2018 15:02:07 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1162842AbeCAXAi (ORCPT + 99 others); Thu, 1 Mar 2018 18:00:38 -0500 Received: from mail.kernel.org ([198.145.29.99]:59878 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1162828AbeCAXAg (ORCPT ); Thu, 1 Mar 2018 18:00:36 -0500 Received: from localhost (unknown [69.71.4.158]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id B39832177C; Thu, 1 Mar 2018 23:00:35 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B39832177C Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=helgaas@kernel.org Date: Thu, 1 Mar 2018 17:00:32 -0600 From: Bjorn Helgaas To: Logan Gunthorpe Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , =?iso-8859-1?B?Suly9G1l?= Glisse , Benjamin Herrenschmidt , Alex Williamson Subject: Re: [PATCH v2 01/10] PCI/P2PDMA: Support peer to peer memory Message-ID: <20180301230032.GA74737@bhelgaas-glaptop.roam.corp.google.com> References: <20180228234006.21093-1-logang@deltatee.com> <20180228234006.21093-2-logang@deltatee.com> <20180301173752.GE13722@bhelgaas-glaptop.roam.corp.google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Mar 01, 2018 at 11:55:51AM -0700, Logan Gunthorpe wrote: > Hi Bjorn, > > Thanks for the review. I'll correct all the nits for the next version. > > On 01/03/18 10:37 AM, Bjorn Helgaas wrote: > > On Wed, Feb 28, 2018 at 04:39:57PM -0700, Logan Gunthorpe wrote: > > > Some PCI devices may have memory mapped in a BAR space that's > > > intended for use in Peer-to-Peer transactions. In order to enable > > > such transactions the memory must be registered with ZONE_DEVICE pages > > > so it can be used by DMA interfaces in existing drivers. > > > Is there anything about this memory that makes it specifically > > intended for peer-to-peer transactions? I assume the device can't > > really tell whether a transaction is from a CPU or a peer. > > No there's nothing special about the memory and it can still be accessed by > the CPU. This is just the intended purpose. You could use this PCI memory as > regular DMA buffers or regular memory but I'm not sure why you would. It > would probably be pretty bad performance-wise. > > > > BTW, maybe there could be some kind of guide for device driver writers > > in Documentation/PCI/? > Makes sense we can look at writing something for the next iteration. > > > I think it would be clearer and sufficient to simply say that we have > > no way to know whether peer-to-peer routing between PCIe Root Ports is > > supported (PCIe r4.0, sec 1.3.1). > > Fair enough. > > > The fact that you use the PCIe term "switch" suggests that a PCIe > > Switch is required, but isn't it sufficient for the peers to be below > > the same "PCI bridge", which would include PCIe Root Ports, PCIe > > Switch Downstream Ports, and conventional PCI bridges? > > The comments at get_upstream_bridge_port() suggest that this isn't > > enough, and the peers actually do have to be below the same PCIe > > Switch, but I don't know why. > > I do mean Switch as we do need to keep the traffic off the root complex. > Seeing, as stated above, we don't know if it actually support it. (While we > can be certain any PCI switch does). So we specifically want to exclude PCIe > Root ports and I'm not sure about the support of PCI bridges but I can't > imagine anyone wanting to do P2P around them so I'd rather be safe than > sorry and exclude them. I don't think this is correct. A Root Port defines a hierarchy domain (I'm looking at PCIe r4.0, sec 1.3.1). The capability to route peer-to-peer transactions *between* hierarchy domains is optional. I think this means a Root Complex is not required to route transactions from one Root Port to another Root Port. This doesn't say anything about routing between two different devices below a Root Port. Those would be in the same hierarchy domain and should follow the conventional PCI routing rules. Of course, since a Root Port has one link that leads to one device, they would probably be different functions of a single multi-function device, so I don't know how practical it would be to test this. > > This whole thing is confusing to me. Why do you want to reject peers > > directly connected to the same root port? Why do you require the same > > Switch Upstream Port? You don't exclude conventional PCI, but it > > looks like you would require peers to share *two* upstream PCI-to-PCI > > bridges? I would think a single shared upstream bridge (conventional, > > PCIe Switch Downstream Port, or PCIe Root Port) would be sufficient? > > Hmm, yes, this may just be laziness on my part. Finding the shared upstream > bridge is a bit more tricky than just showing that they are on the same > switch. So as coded, a fabric of switches with peers on different legs of > the fabric are not supported. But yes, maybe they just need to be two > devices with a single shared upstream bridge that is not the root port. > Again, we need to reject the root port because we can't know if the root > complex can support P2P traffic. This sounds like the same issue as above, so we just need to resolve that. > > Since "pci_p2pdma_add_client()" includes "pci_" in its name, it seems > > sort of weird that callers supply a non-PCI device and then we look up > > a PCI device here. I assume you have some reason for this; if you > > added a writeup in Documentation/PCI, that would be a good place to > > elaborate on that, maybe with a one-line clue here. > > Well yes, but this is much more convenient for callers which don't need to > care if the device they are attempting to add (which in the NVMe target > case, could be a random block device) is a pci device or not. Especially > seeing find_parent_pci_dev() is non-trivial. OK. I accept that it might be convenient, but I still think it leads to a weird API. Maybe that's OK; I don't know enough about the scenario in the caller to do anything more than say "hmm, strange". > > > +void *pci_alloc_p2pmem(struct pci_dev *pdev, size_t size) > > > +{ > > > + void *ret; > > > + > > > + if (unlikely(!pdev->p2pdma)) > > > > Is this a hot path? I'm not sure it's worth cluttering > > non-performance paths with likely/unlikely. > > I'd say it can be pretty hot given that we need to allocate and free buffers > at multiple GB/s for the NVMe target case. I don't exactly have benchmarks > or anything to show this though... OK, I'll take your word for that. I'm fine with this as long as it is performance-sensitive. Bjorn