Received: by 10.223.185.116 with SMTP id b49csp7884941wrg; Thu, 1 Mar 2018 12:56:53 -0800 (PST) X-Google-Smtp-Source: AG47ELtvmwrpyFazZF0q9MhMHn9N9yNnxzrcfljl+0BtaLq4sunB4tFto+BPQ41R/0z1OdB9Lih+ X-Received: by 10.99.136.195 with SMTP id l186mr1599668pgd.427.1519937813649; Thu, 01 Mar 2018 12:56:53 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1519937813; cv=none; d=google.com; s=arc-20160816; b=hyRmqarNQHaD0zeT+bQJKfZPLlP2X8KrxvSC+0aBnEdiiRLUzH0ul+bSdGevbZTsSO JK99agRAJwL5ZPmeiJC4AVa7vyy23Z47HBsCjpFKmfFyR0bI0CslB9R5J9baGkZRULty QRG/NvT9j3/a8VO8eo267bTMFsEf/DLj5lCd898dWSBxY9sx0k/yycpkP2Fb+NLFBGKp ZuMLXjksZ4bUFtOPqgxInUsEnUP7+3y793r2+LDVc/tmNHVHGrSn22/dZmCRh7ijeCFX E/wpoLgGybn+3ai/6uTdoJ5tMJbXnF9/ACxxv/FYxvo+SszRz+R63EgLuI9QNJkbxTKY o2fA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date :arc-authentication-results; bh=zpDa4LAwQ69BIdRI+7GNHENne2Yhcb8vMBvCWmi3SQU=; b=e7PdTtYP8bGSf/OERzqe6Am8JOoBswPXE2ezFdBSqDPz6ecS0tEm41OOfnaP1ILoZf 8USs9c74T/S/9kNnCBt4NNhDThhuoeoK2ig38G0s95dmi4IAi4J//Y4UM8qCf+qdIabU CdeLqwZRzWpSi3RAKf6OzvOS66tk4B2Gik89f1fa48bLo4CbFfjUqwTkU2g3DAnvMb2k dSLMY8nxOWchcuDlPxjpp0wM61D1KxIjHKt/O2BnjEGdVK96TEdzn3dGA+++CMZCfMVG DIaqH7UbFDt+1N1rQpP/ZUmf9zX3fD1+hpp1/7bMSecoo4KbN50iCDVrZ5tdh81ZHp3m 9ZKg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l59-v6si3573023plb.77.2018.03.01.12.56.38; Thu, 01 Mar 2018 12:56:53 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1161784AbeCAUz5 (ORCPT + 99 others); Thu, 1 Mar 2018 15:55:57 -0500 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:48290 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1161577AbeCAUzz (ORCPT ); Thu, 1 Mar 2018 15:55:55 -0500 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id C38DD4015177; Thu, 1 Mar 2018 20:55:54 +0000 (UTC) Received: from redhat.com (ovpn-124-164.rdu2.redhat.com [10.10.124.164]) by smtp.corp.redhat.com (Postfix) with ESMTPS id B12C410AF9F0; Thu, 1 Mar 2018 20:55:50 +0000 (UTC) Date: Thu, 1 Mar 2018 15:55:49 -0500 From: Jerome Glisse To: Benjamin Herrenschmidt Cc: Logan Gunthorpe , linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org, linux-nvme@lists.infradead.org, linux-rdma@vger.kernel.org, linux-nvdimm@lists.01.org, linux-block@vger.kernel.org, Stephen Bates , Christoph Hellwig , Jens Axboe , Keith Busch , Sagi Grimberg , Bjorn Helgaas , Jason Gunthorpe , Max Gurtovoy , Dan Williams , Alex Williamson , Oliver OHalloran Subject: Re: [PATCH v2 00/10] Copy Offload in NVMe Fabrics with P2P PCI Memory Message-ID: <20180301205548.GA6742@redhat.com> References: <20180228234006.21093-1-logang@deltatee.com> <1519876489.4592.3.camel@kernel.crashing.org> <1519876569.4592.4.camel@au1.ibm.com> <8e808448-fc01-5da0-51e7-1a6657d5a23a@deltatee.com> <1519936195.4592.18.camel@au1.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1519936195.4592.18.camel@au1.ibm.com> User-Agent: Mutt/1.9.2 (2017-12-15) X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.6]); Thu, 01 Mar 2018 20:55:54 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.6]); Thu, 01 Mar 2018 20:55:54 +0000 (UTC) for IP:'10.11.54.3' DOMAIN:'int-mx03.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'jglisse@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Mar 02, 2018 at 07:29:55AM +1100, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-01 at 11:04 -0700, Logan Gunthorpe wrote: > > > > On 28/02/18 08:56 PM, Benjamin Herrenschmidt wrote: > > > On Thu, 2018-03-01 at 14:54 +1100, Benjamin Herrenschmidt wrote: > > > > The problem is that acccording to him (I didn't double check the latest > > > > patches) you effectively hotplug the PCIe memory into the system when > > > > creating struct pages. > > > > > > > > This cannot possibly work for us. First we cannot map PCIe memory as > > > > cachable. (Note that doing so is a bad idea if you are behind a PLX > > > > switch anyway since you'd ahve to manage cache coherency in SW). > > > > > > Note: I think the above means it won't work behind a switch on x86 > > > either, will it ? > > > > This works perfectly fine on x86 behind a switch and we've tested it on > > multiple machines. We've never had an issue of running out of virtual > > space despite our PCI bars typically being located with an offset of > > 56TB or more. The arch code on x86 also somehow figures out not to map > > the memory as cachable so that's not an issue (though, at this point, > > the CPU never accesses the memory so even if it were, it wouldn't affect > > anything). > > Oliver can you look into this ? You sais the memory was effectively > hotplug'ed into the system when creating the struct pages. That would > mean to me that it's a) mapped (which for us is cachable, maybe x86 has > tricks to avoid that) and b) potentially used to populate userspace > pages (that will definitely be cachable). Unless there's something in > there you didn't see that prevents it. > > > We also had this working on ARM64 a while back but it required some out > > of tree ZONE_DEVICE patches and some truly horrid hacks to it's arch > > code to ioremap the memory into the page map. > > > > You didn't mention what architecture you were trying this on. > > ppc64. > > > It may make sense at this point to make this feature dependent on x86 > > until more work is done to make it properly portable. Something like > > arch functions that allow adding IO memory pages to with a specific > > cache setting. Though, if an arch has such restrictive limits on the map > > size it would probably need to address that too somehow. > > Not fan of that approach. > > So there are two issues to consider here: > > - Our MMIO space is very far away from memory (high bits set in the > address) which causes problem with things like vmmemmap, page_address, > virt_to_page etc... Do you have similar issues on arm64 ? HMM private (HMM public is different) works around that by looking for "hole" in address space and using those for hotplug (ie page_to_pfn() != physical pfn of the memory). This is ok for HMM because the memory is never map by the CPU and we can find the physical pfn with a little bit of math (page_to_pfn() - page->pgmap->res->start + page->pgmap->dev-> physical_base_address). To avoid anything going bad i actually do not populate the kernel linear mapping for the range hence definitly no CPU access at all through those struct page. CPU can still access PCIE bar through usual mmio map. > > - We need to ensure that the mechanism (which I'm not familiar with) > that you use to create the struct page's for the device don't end up > turning those device pages into normal "general use" pages for the > system. Oliver thinks it does, you say it doesn't, ... > > Jerome (Glisse), what's your take on this ? Smells like something that > could be covered by HMM... Well this again a new user of struct page for device memory just for one usecase. I wanted HMM to be more versatile so that it could be use for this kind of thing too. I guess the message didn't go through. I will take some cycles tomorrow to look into this patchset to ascertain how struct page is use in this context. Note that i also want peer to peer for HMM users but with ACS and using IOMMU ie having to populate IOMMU page table of one device to point to bar of another device. I need to test on how many platform this work, hardware engineer are unable/unwilling to commit on wether this work or not. > Logan, the only reason you need struct page's to begin with is for the > DMA API right ? Or am I missing something here ? If it is only needed for that this sounds like a waste of memory for struct page. Thought i understand this allow new API to match previous one. Cheers, J?r?me