Received: by 2002:ac0:946b:0:0:0:0:0 with SMTP id j40csp561961imj; Thu, 7 Feb 2019 08:26:12 -0800 (PST) X-Google-Smtp-Source: AHgI3Ibw3VQCVGy9YCYIYAXh6cVU0cbeI0sw5q3fZ2b7MHb328LJP5qRqL0g5cw+kvEcGZzeuqg8 X-Received: by 2002:a17:902:33c1:: with SMTP id b59mr17108095plc.220.1549556772479; Thu, 07 Feb 2019 08:26:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1549556772; cv=none; d=google.com; s=arc-20160816; b=Qex157cK6TDJEp4Ik1V6BnPEcOXo6ScV7fZngzy2lyDueeZdeRRIm+wmIH1wOmFp8l NBLAAvRaim9S7tr4vNI+rNOFSKpeh3dyjFl0ijIOuxlMu3Jit9SJYXx/BosK+tQnCt6q lqB5nBw8jYTDUPOK85jF1kYXMQoQoO1GQ0pus+d/ZFgSbtn/BHKSuSmNV6JEcfIAr4mx B0Qz4KvECDfPgGJPyQVcIrzs3c42VHuQZvzYyZuvitJeX6bdZQMH6xfpvihrb2IuTPzy m/8KVecMUDOpv2mDbY0Aw6+Q/BPL0zFPDVUQPTzzaQ39IkBEk/pSslmeZfqGPF3jQSsJ gV3w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:organization :references:in-reply-to:date:cc:to:from:subject:message-id; bh=UlDRjokCZxjPPaAcH8LUtW623yZ43lSgy/QjVuWBctM=; b=HbjyLeAF7w7xvK4jfASVQ1cOuR6T21qan/WWZzfpyBEJfvQMDLgxr6CE/JJEgWBqIl pJowmi7+DhTGcQTHJejSJejyr46hGfAPPyXcMcsoK+71lUiR9kTw0Tr8KUNldHGS3NaH kDYGg1Z4P6cr5fa8OqVOZCegQi7K5y0u6ya7DVgiyrhOD9cjQQtqX0HzagIogsQN78dP vHmflCMmjLnsFRFacXauHoZGvbsY2e/CWlVe1/KpxFAqTL7cKYb/ue9nQfC9fGXtaj9U MhOa5YUzYyi3ZVq4+Vy0mHZUOe4LOj9oznIPhW4OpImC2XsptCnIoHlSXGYhZjzE2cGM 9vdw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h91si9396455pld.411.2019.02.07.08.25.56; Thu, 07 Feb 2019 08:26:12 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726728AbfBGQZl (ORCPT + 99 others); Thu, 7 Feb 2019 11:25:41 -0500 Received: from mx1.redhat.com ([209.132.183.28]:51214 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726442AbfBGQZl (ORCPT ); Thu, 7 Feb 2019 11:25:41 -0500 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 1B2AC432AB; Thu, 7 Feb 2019 16:25:40 +0000 (UTC) Received: from haswell-e.nc.xsintricity.com (ovpn-112-17.rdu2.redhat.com [10.10.112.17]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 93E205C22D; Thu, 7 Feb 2019 16:25:37 +0000 (UTC) Message-ID: Subject: Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA From: Doug Ledford To: Dan Williams Cc: Jason Gunthorpe , Dave Chinner , Christopher Lameter , Matthew Wilcox , Jan Kara , Ira Weiny , lsf-pc@lists.linux-foundation.org, linux-rdma , Linux MM , Linux Kernel Mailing List , John Hubbard , Jerome Glisse , Michal Hocko Date: Thu, 07 Feb 2019 11:25:35 -0500 In-Reply-To: References: <20190205175059.GB21617@iweiny-DESK2.sc.intel.com> <20190206095000.GA12006@quack2.suse.cz> <20190206173114.GB12227@ziepe.ca> <20190206175233.GN21860@bombadil.infradead.org> <47820c4d696aee41225854071ec73373a273fd4a.camel@redhat.com> <01000168c43d594c-7979fcf8-b9c1-4bda-b29a-500efe001d66-000000@email.amazonses.com> <20190206210356.GZ6173@dastard> <20190206220828.GJ12227@ziepe.ca> <0c868bc615a60c44d618fb0183fcbe0c418c7c83.camel@redhat.com> Organization: Red Hat, Inc. Content-Type: multipart/signed; micalg="pgp-sha256"; protocol="application/pgp-signature"; boundary="=-um3ENeAUEAAEn9LYnNJ1" User-Agent: Evolution 3.30.4 (3.30.4-1.fc29) Mime-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.30]); Thu, 07 Feb 2019 16:25:40 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --=-um3ENeAUEAAEn9LYnNJ1 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I think I've finally wrapped my head around all of this. Let's see if I have this right: * People are using filesystem DAX to expose byte addressable persistent memory because putting a filesystem on the memory makes an easy way to organize the data in the memory and share it between various processes.=20 It's worth noting that this is not the only way to share this memory, and arguably not even the best way, but it's what people are doing.=20 However, to get byte level addressability on the remote side, we must create files on the server side, mmap those files, then give a handle to the memory region to the client side that the client then addresses on a byte by byte basis. This is because all of the normal kernel based device sharing mechanisms are block based and don't provide byte level addressability. * People are asking for thin allocations, reflinks, deduplication, whatever else because persistent memory is still small in terms of size compared to the amount of data people want to put on it, so these techniques stretch its usefulness. * Because there is no kernel level mechanism for sharing byte addressable memory, this only works with specific applications that have been written to create files on byte addressable memory, mmap them, then share them out via RDMA. I bring this up because in the video linked in this email, Oracle is gushing about how great this feature is. But it's important to understand that this only works because the Oracle processes themselves are the filesystem sharing entity. That means at other points in this conversation where we've talked about the need for forward progress, and non-ODP hardware, and the talk has come down to sending SIGKILL to a process in order to free memory reservations, I feel confident in saying that Oracle would *never* agree to this. If you kill an Oracle process to make forward progress, you are probably also killing the very process that needed you to make progress in the first place. I'm pretty confident that Oracle will have no problem what-so-ever saying that ODP capable hardware is a hard requirement for using their software with DAX. * So if Oracle is likely to demand ODP hardware, period, are there other scenarios that might be more accepting of a more limited FS on top of DAX that doesn't support reflinks and deduplication? I can think of a possible yes to that answer rather easily. Message brokerage servers (amqp, qpid) have strict requirements about receiving a message and then making sure that it makes it once, and only once, to all subscribed receivers. A natural way of organizing this sort of thing is to create a persistent ring buffer for incoming messages, one per each connecting client that is sending messages. Then a log file for each client you are sending messages back out to. Putting these files on persistent memory and then mapping the ring buffer to the clients, and writing your own transmission journals to the persistent memory, would allow the program to be very robust in the face of a program or system crash.=20 This sort of usage would not require any thin allocations, reflinks, or other such features, and yet would still find the filesystem organization useful. Therefore I think the answer is yes, there are at least some use cases that would find a less featureful filesystem that works with persistent memory and RDMA but without ODP to be of value. * Really though, as I said in my email to Tom Talpey, this entire situation is simply screaming that we are doing DAX networking wrong.=20 We shouldn't be writing the networking code once in every single application that wants to do this. If we had a memory segment that we shared from server to client(s), and in that memory segment we implemented a clustered filesystem, then applications would simply mmap local files and be done with it. If the file needed to move, the kernel would update the mmap in the application, done. If you ask me, it is the attempt to do this the wrong way that is resulting in all this heartache. That said, for today, my recommendation would be to require ODP hardware for XFS filesystem with the DAX option, but allow ext2 filesystems to mount DAX filesystems on non-ODP hardware, and go in and modify the ext2 filesystem so that on DAX mounts, it disables hole punch and ftrunctate any time they would result in the forced removal of an established mmap. On Wed, 2019-02-06 at 14:44 -0800, Dan Williams wrote: > On Wed, Feb 6, 2019 at 2:25 PM Doug Ledford wrote: > > On Wed, 2019-02-06 at 15:08 -0700, Jason Gunthorpe wrote: > > > On Thu, Feb 07, 2019 at 08:03:56AM +1100, Dave Chinner wrote: > > > > On Wed, Feb 06, 2019 at 07:16:21PM +0000, Christopher Lameter wrote= : > > > > > On Wed, 6 Feb 2019, Doug Ledford wrote: > > > > >=20 > > > > > > > Most of the cases we want revoke for are things like truncate= (). > > > > > > > Shouldn't happen with a sane system, but we're trying to avoi= d users > > > > > > > doing awful things like being able to DMA to pages that are n= ow part of > > > > > > > a different file. > > > > > >=20 > > > > > > Why is the solution revoke then? Is there something besides tr= uncate > > > > > > that we have to worry about? I ask because EBUSY is not curren= tly > > > > > > listed as a return value of truncate, so extending the API to i= nclude > > > > > > EBUSY to mean "this file has pinned pages that can not be freed= " is not > > > > > > (or should not be) totally out of the question. > > > > > >=20 > > > > > > Admittedly, I'm coming in late to this conversation, but did I = miss the > > > > > > portion where that alternative was ruled out? > > > > >=20 > > > > > Coming in late here too but isnt the only DAX case that we are co= ncerned > > > > > about where there was an mmap with the O_DAX option to do direct = write > > > > > though? If we only allow this use case then we may not have to wo= rry about > > > > > long term GUP because DAX mapped files will stay in the physical = location > > > > > regardless. > > > >=20 > > > > No, that is not guaranteed. Soon as we have reflink support on XFS, > > > > writes will physically move the data to a new physical location. > > > > This is non-negotiatiable, and cannot be blocked forever by a gup > > > > pin. > > > >=20 > > > > IOWs, DAX on RDMA requires a) page fault capable hardware so that > > > > the filesystem can move data physically on write access, and b) > > > > revokable file leases so that the filesystem can kick userspace out > > > > of the way when it needs to. > > >=20 > > > Why do we need both? You want to have leases for normal CPU mmaps too= ? > > >=20 > > > > Truncate is a red herring. It's definitely a case for revokable > > > > leases, but it's the rare case rather than the one we actually care > > > > about. We really care about making copy-on-write capable filesystem= s like > > > > XFS work with DAX (we've got people asking for it to be supported > > > > yesterday!), and that means DAX+RDMA needs to work with storage tha= t > > > > can change physical location at any time. > > >=20 > > > Then we must continue to ban longterm pin with DAX.. > > >=20 > > > Nobody is going to want to deploy a system where revoke can happen at > > > any time and if you don't respond fast enough your system either lock= s > > > with some kind of FS meltdown or your process gets SIGKILL. > > >=20 > > > I don't really see a reason to invest so much design work into > > > something that isn't production worthy. > > >=20 > > > It *almost* made sense with ftruncate, because you could architect to > > > avoid ftruncate.. But just any FS op might reallocate? Naw. > > >=20 > > > Dave, you said the FS is responsible to arbitrate access to the > > > physical pages.. > > >=20 > > > Is it possible to have a filesystem for DAX that is more suited to > > > this environment? Ie designed to not require block reallocation (no > > > COW, no reflinks, different approach to ftruncate, etc) > >=20 > > Can someone give me a real world scenario that someone is *actually* > > asking for with this? >=20 > I'll point to this example. At the 6:35 mark Kodi talks about the > Oracle use case for DAX + RDMA. >=20 > https://youtu.be/ywKPPIE8JfQ?t=3D395 >=20 > Currently the only way to get this to work is to use ODP capable > hardware, or Device-DAX. Device-DAX is a facility to map persistent > memory statically through device-file. It's great for statically > allocated use cases, but loses all the nice things (provisioning, > permissions, naming) that a filesystem gives you. This debate is what > to do about non-ODP capable hardware and Filesystem-DAX facility. The > current answer is "no RDMA for you". >=20 > > Are DAX users demanding xfs, or is it just the > > filesystem of convenience? >=20 > xfs is the only Linux filesystem that supports DAX and reflink. >=20 > > Do they need to stick with xfs? >=20 > Can you clarify the motivation for that question? This problem exists > for any filesystem that implements an mmap that where the physical > page backing the mapping is identical to the physical storage location > for the file data. I don't see it as an xfs specific problem. Rather, > xfs is taking the lead in this space because it has already deployed > and demonstrated that leases work for the pnfs4 block-server case, so > it seems logical to attempt to extend that case for non-ODP-RDMA. >=20 > > Are they > > really trying to do COW backed mappings for the RDMA targets? Or do > > they want a COW backed FS but are perfectly happy if the specific RDMA > > targets are *not* COW and are statically allocated? >=20 > I would expect the COW to be broken at registration time. Only ODP > could possibly support reflink + RDMA. So I think this devolves the > problem back to just the "what to do about truncate/punch-hole" > problem in the specific case of non-ODP hardware combined with the > Filesystem-DAX facility. --=20 Doug Ledford GPG KeyID: B826A3330E572FDD Key fingerprint =3D AE6B 1BDA 122B 23B4 265B 1274 B826 A333 0E57 2FDD --=-um3ENeAUEAAEn9LYnNJ1 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEErmsb2hIrI7QmWxJ0uCajMw5XL90FAlxcW/8ACgkQuCajMw5X L91/3RAAn/t0azp+g0ydwFXoYi7Glyb8+kYEztIcXGo2gFnCTyTT32j4YlcZYeQY +sb7jSH2fkBbAj69y5pXzuGGg52NvcaGMi8ZzDkxaHxNYIwGXVcT89uNJ5Jri+jD RbCD4MnreNx8+YTPy/K9NZGwrG2jemXaN+8Ln0g3UOSV+pBbMTOD6xYXOgPwxmPG 0sgd2a+Lj5hou8kRIjGk7sbAWCV0kg5Ss4M3gKGt80ny+9CagJHyD49ffhxzU679 of9gegyRRVBJrFD4zh+qZNEUhhITv3+11kIYU0CwFiwAXT3eMQwhEHL70YwWALuJ dOSy1Hy9hDpUqcrkBV2pB9KyaYpDKd84Yt0aS++wNI85i4oZNgjxwtGMmyFb8Fbf 4S9HePUn5oJarlwzJJYm8pjMWO0daESEqKqIaP9IN1VDp/Mjvw51e1TOmoygshyU 6zxeaCIfeaJ76EZj1pajXKiA2wE/ONQhIuEsemPthbMxz9py920tOQHlGg39kbys rcqym6ZACGXA7Z4myCnpCxTNm5aaEk7isXe4i56GkohCFRostUsqeDBLdr2aqQ0f m2TbCHxEuHh6HmhklrHg81w7MJeKl7dOSK71gLwFJoyWZJ3NgSS5OEwwA3kL2TG5 pBCerjLCZbPYrjgFMod6p+1C8IGNSUMvLycrNscgTn5nfiwjur0= =H0zA -----END PGP SIGNATURE----- --=-um3ENeAUEAAEn9LYnNJ1--