Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754514AbcK1SgV (ORCPT ); Mon, 28 Nov 2016 13:36:21 -0500 Received: from mail-db5eur01on0087.outbound.protection.outlook.com ([104.47.2.87]:13509 "EHLO EUR01-DB5-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751432AbcK1SgM (ORCPT ); Mon, 28 Nov 2016 13:36:12 -0500 X-Greylist: delayed 100001 seconds by postgrey-1.27 at vger.kernel.org; Mon, 28 Nov 2016 13:36:11 EST From: Haggai Eran To: "jgunthorpe@obsidianresearch.com" , "christian.koenig@amd.com" , "serguei.sagalovitch@amd.com" CC: "linux-kernel@vger.kernel.org" , "linux-rdma@vger.kernel.org" , "linux-nvdimm@ml01.01.org" , "Suravee.Suthikulpanit@amd.com" , "Linux-media@vger.kernel.org" , "John.Bridgman@amd.com" , "Alexander.Deucher@amd.com" , "dan.j.williams@intel.com" , "logang@deltatee.com" , "dri-devel@lists.freedesktop.org" , "Max Gurtovoy" , "linux-pci@vger.kernel.org" , "Paul.Blinzer@amd.com" , "Felix.Kuehling@amd.com" , "ben.sander@amd.com" Subject: Re: Enabling peer to peer device transactions for PCIe devices Thread-Topic: Enabling peer to peer device transactions for PCIe devices Thread-Index: AdJENuonJPasaqxFT7iHs+MJbpSfBgAtPueAADBFIk0AA+LngAAAVDEAAACfJQAAAOnzAAABOAgAAAFTTIAAAYaPAAAHVzCAACAKy4AAK0mygAAM8UYAAF06SwAAL7etAAAH8iiA Date: Mon, 28 Nov 2016 18:36:07 +0000 Message-ID: <1480358165.19407.26.camel@mellanox.com> References: <20161123190515.GA12146@obsidianresearch.com> <7bc38037-b6ab-943f-59db-6280e16901ab@amd.com> <20161123193228.GC12146@obsidianresearch.com> <20161123203332.GA15062@obsidianresearch.com> <20161123215510.GA16311@obsidianresearch.com> <91d28749-bc64-622f-56a1-26c00e6b462a@deltatee.com> <20161124164249.GD20818@obsidianresearch.com> <3f2d2db3-fb75-2422-2a18-a8497fd5d70e@amd.com> <20161125193252.GC16504@obsidianresearch.com> <314e9ef7-f60e-bf6b-d488-c585f1ea60e8@amd.com> In-Reply-To: <314e9ef7-f60e-bf6b-d488-c585f1ea60e8@amd.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=haggaie@mellanox.com; x-originating-ip: [46.121.82.195] x-microsoft-exchange-diagnostics: 1;DB6PR0501MB2839;7:E7OXMEVCOt5Xusw/82kOsgBms9XgxjoFcAxfO9Drk1BmQces6f1o/suPdNHJzjWX9ywETtdCJ02xnrgucAuiyOS68uVdv+dbE5TAQpmLdpz7Uh7CbWp0SkTsIMYwBJ8qqKLKIG3rvcw3mxEv8EzGR6O960AfJU1LULhwNo6QtyW0GsJLFTUrDe7XPY8m8NdEGd/vjDwDGE8OQo0QoUnii19xte/WRF7SLj2hiXrTMpnhRy3NOxucp2vyJx0n2ESeuY3MhOHAOtdVZ9iH2Yu88vpjHairVTPlk8oiWKz4WGTrsdYk8Ci7MJvQY51mIeEUnbtwUF9inwrXcYCfxQuQKXx6ADxpatpa6FInamZa5hnyQtMpDq9o9Wt7vHO/NPdez+QKdSQYwViG4zfzMn9QeePBh/ylVmJYosQDuublnd2zapxJw1Hp01nwQfCvUVcg5eAmtQPYuQShHf7ITvJB+Q== x-forefront-antispam-report: SFV:SKI;SCL:-1SFV:NSPM;SFS:(10009020)(6009001)(7916002)(199003)(189002)(377454003)(24454002)(377424004)(86362001)(105586002)(54356999)(76176999)(8666005)(50986999)(93886004)(2900100001)(103116003)(92566002)(33646002)(7846002)(4326007)(101416001)(2201001)(2906002)(68736007)(8936002)(305945005)(3846002)(6116002)(102836003)(36756003)(97736004)(229853002)(3660700001)(3280700002)(6512003)(81156014)(189998001)(6506003)(5660300001)(66066001)(81166006)(7736002)(6486002)(38730400001)(39400400001)(4001150100001)(106356001)(8676002)(77096006)(39380400001)(39410400001)(5001770100001)(39450400002)(7416002)(2950100002)(122556002)(2501003)(7059030);DIR:OUT;SFP:1101;SCL:1;SRVR:DB6PR0501MB2839;H:AM5PR0502MB3107.eurprd05.prod.outlook.com;FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; x-ms-office365-filtering-correlation-id: e76e2e76-f562-4fc6-1ae5-08d417bd6b31 x-microsoft-antispam: UriScan:;BCL:0;PCL:0;RULEID:(22001);SRVR:DB6PR0501MB2839; x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(278428928389397); x-exchange-antispam-report-cfa-test: BCL:0;PCL:0;RULEID:(6060326)(6040361)(6045199)(601004)(2401047)(5005006)(8121501046)(3002001)(10201501046)(6055026)(6061324)(6041248)(20161123562025)(20161123564025)(20161123560025)(20161123555025)(6072148);SRVR:DB6PR0501MB2839;BCL:0;PCL:0;RULEID:;SRVR:DB6PR0501MB2839; x-forefront-prvs: 01401330D1 spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="utf-7" Content-ID: MIME-Version: 1.0 X-OriginatorOrg: Mellanox.com X-MS-Exchange-CrossTenant-originalarrivaltime: 28 Nov 2016 18:36:07.7998 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: a652971c-7d2e-4d9b-a6a4-d149256f461b X-MS-Exchange-Transport-CrossTenantHeadersStamped: DB6PR0501MB2839 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mail.home.local id uASIaQn2023717 Content-Length: 3588 Lines: 66 On Mon, 2016-11-28 at 09:48 -0500, Serguei Sagalovitch wrote: +AD4- On 2016-11-27 09:02 AM, Haggai Eran wrote +AD4- +AD4- +AD4- +AD4- On PeerDirect, we have some kind of a middle-ground solution for +AD4- +AD4- pinning +AD4- +AD4- GPU memory. We create a non-ODP MR pointing to VRAM but rely on +AD4- +AD4- user-space and the GPU not to migrate it. If they do, the MR gets +AD4- +AD4- destroyed immediately. This should work on legacy devices without +AD4- +AD4- ODP +AD4- +AD4- support, and allows the system to safely terminate a process that +AD4- +AD4- misbehaves. The downside of course is that it cannot transparently +AD4- +AD4- migrate memory but I think for user-space RDMA doing that +AD4- +AD4- transparently +AD4- +AD4- requires hardware support for paging, via something like HMM. +AD4- +AD4- +AD4- +AD4- ... +AD4- May be I am wrong but my understanding is that PeerDirect logic +AD4- basically +AD4- follow+AKAAoAAi-RDMA register MR+ACI- logic Yes. The only difference from regular MRs is the invalidation process I mentioned, and the fact that we get the addresses not from get+AF8-user+AF8-pages but from a peer driver. +AD4- so basically nothing prevent to +ACI-terminate+ACI- +AD4- process for +ACI-MMU notifier+ACI- case when we are very low on memory +AD4- not making it similar (not worse) then PeerDirect case. I'm not sure I understand. I don't think any solution prevents terminating an application. The paragraph above is just trying to explain how a non-ODP device/MR can handle an invalidation. +AD4- +AD4- +AD4- I'm hearing most people say ZONE+AF8-DEVICE is the way to handle this, +AD4- +AD4- +AD4- which means the missing remaing piece for RDMA is some kind of DMA +AD4- +AD4- +AD4- core support for p2p address translation.. +AD4- +AD4- Yes, this is definitely something we need. I think Will Davis's +AD4- +AD4- patches +AD4- +AD4- are a good start. +AD4- +AD4- +AD4- +AD4- Another thing I think is that while HMM is good for user-space +AD4- +AD4- applications, for kernel p2p use there is no need for that. +AD4- About HMM: I do not think that in the current form HMM would+AKAAoA-fit in +AD4- requirement for generic P2P transfer case. My understanding is that at +AD4- the current stage HMM is good for +ACI-caching+ACI- system memory +AD4- in device memory for fast GPU access but in RDMA MR non-ODP case +AD4- it will not work because+AKAAoA-the location of memory should not be +AD4- changed so memory should be allocated directly in PCIe memory. The way I see it there are two ways to handle non-ODP MRs. Either you prevent the GPU from migrating / reusing the MR's VRAM pages for as long as the MR is alive (if I understand correctly you didn't like this solution), or you allow the GPU to somehow notify the HCA to invalidate the MR. If you do that, you can use mmu notifiers or HMM or something else, but HMM provides a nice framework to facilitate that notification. +AD4- +AD4- +AD4- +AD4- Using ZONE+AF8-DEVICE with or without something like DMA-BUF to pin and +AD4- +AD4- unpin +AD4- +AD4- pages for the short duration as you wrote above could work fine for +AD4- +AD4- kernel uses in which we can guarantee they are short. +AD4- Potentially there is another issue related to pin/unpin. If memory +AD4- could +AD4- be used a lot of time then there is no sense to rebuild and program +AD4- s/g tables each time if location of memory was not changed. Is this about the kernel use or user-space? In user-space I think the MR concept captures a long-lived s/g table so you don't need to rebuild it (unless the mapping changes). Haggai