Received: by 2002:ac0:946b:0:0:0:0:0 with SMTP id j40csp3026635imj; Mon, 11 Feb 2019 12:32:07 -0800 (PST) X-Google-Smtp-Source: AHgI3IbXgnca8y5ImXZ/iNkxFsLNOKc1LzuDUI1R1TEU1qmPdufb9vAGhLj+PGUbYsQmIQXTED8e X-Received: by 2002:a17:902:e18d:: with SMTP id cd13mr83221plb.262.1549917127330; Mon, 11 Feb 2019 12:32:07 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1549917127; cv=none; d=google.com; s=arc-20160816; b=Fvga5jGcmppsD23x/e0jnI4g4UVSKnj8trd9TPEri10wmtEMLjYTRAfIS350ON1mY2 SOMDCdbn3sUS84so/proMNuVPGVhOvQxlJasXl/nzqntsWL0flPxEAaeY4hvDFinhfSE X3XvKMHU268KFLxy6SvSnmBP6eFpysZ2kYWVmIFrI8GQblB4oJvZ0v1o4+OmgtcWAZ56 H99PsKmJ2hI6xm1k817xH62Czdf0NQ7HW5W0hInUaqjQ/WPEvnIJczwERR4CcFHJF8ub FzOnooK2p4/Z8f7ymX8bYgG8Z/14UcNZP0Tx4nV1kpyTls8Ub5itZaLg0HNIt3/xeKwW M/2Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=o1nrH+8ogBjgSgeb4HNQFQvWovEpxMsUGDfu1u+2kkU=; b=wROz6NG6l9k29zAEjsXZhouF0szypbUb8TMebWBGN2/uygiAa/UlvPDRu18CzCTxIX SLN0MR5mJl2uJqDeZDin8nU0sJAdR04OdaB7mIfbKlANVgHjwc6dUY+xKxwMWo1ZBaoB NWMP8ON0JMwaYUfMyk56LPKORMGj0U0UijA9SpqEevqBUGdUnifhTW41gya2vbIkwlRy SW+Gxf7AAGuVVZq8v8iv+wVx8DhEtkGTTIv5sU1L4HUsfQNc4RyILAXtk771yMocxjZe SM24XAbD9bZIGXE8QXhntP8kFXQKs9yn8/WtW/DEeKX5TH0gGnhqp6u9rDa/SJFxULEz 4jMw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=p7pZU1mU; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j10si9517888pgp.282.2019.02.11.12.31.51; Mon, 11 Feb 2019 12:32:07 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=p7pZU1mU; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728388AbfBKRXM (ORCPT + 99 others); Mon, 11 Feb 2019 12:23:12 -0500 Received: from mail-ot1-f65.google.com ([209.85.210.65]:45435 "EHLO mail-ot1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726684AbfBKRXL (ORCPT ); Mon, 11 Feb 2019 12:23:11 -0500 Received: by mail-ot1-f65.google.com with SMTP id 32so18733375ota.12 for ; Mon, 11 Feb 2019 09:23:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=o1nrH+8ogBjgSgeb4HNQFQvWovEpxMsUGDfu1u+2kkU=; b=p7pZU1mUiFZt/u3lajcfnXFdK2z9M3MMylttADFFO6eUCs5geVuZhtxCw49BYfznNN L5/wh9bOceAYCNkf+sOH4He34tW4H+6lTFda+nAtQimmux8Oq69hflpOTOXppqmlWzh/ rMtIJrWb8zzm/0YUL2kpApbzxO3Ev7W7WPgrZoLZ96DD2NnWnunyDQ8sJrUHq2XEGKUu wBuaHS9nGEDwsorvR1ykvKPeSNC40ybYgaebCaanEod3AS88W7zWr994PhTiXXNxvp9D AY0tNZOj1PDbzWtX9+gc1FvzR9nIseG2SMCv/sAOVeRreQwSBnpGNEvFAK418F8aw17v tqcA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=o1nrH+8ogBjgSgeb4HNQFQvWovEpxMsUGDfu1u+2kkU=; b=VnGVQ9VJtoj9TXFlaHH0lhSX0tw5ICXqKUgdqfmmQI/vAAZrQexFisCcqHb1fcX6g0 OOHT6cGUHoDFyQzbuMiREEz1erYN0vEiemwejIQw2FA5XLxWE/IV8SfhxIisvIccDfx7 qpiTZyMDOSclBvKQTrBkEgYa3vPi3Tzmgn5IefjNRWIzbiLQ4AbEj5zjr2wnww2QZgmt /c1Pvno7dCcXORod8NZCesr0Iztx0Ig8FGhK3FdllifnWkg+AsJ68DdVmLsfLiXltBs9 aL0uGLyGKxmFLrIiaVdlyzD+L96/DXJDMxGpu1f3+ekbaxiA8CXYHJNawHwsoIn+vGni uTnw== X-Gm-Message-State: AHQUAuZ6c5QnF9tCRWpfjOyKfIQqSsVmfwn+HYOt+qeCG8ks9wCHy/It 2FNj878uT7Zd7rwV+TmISuC890wwAExsZGauUXS4qQ== X-Received: by 2002:a9d:7d18:: with SMTP id v24mr20500736otn.352.1549905790519; Mon, 11 Feb 2019 09:23:10 -0800 (PST) MIME-Version: 1.0 References: <01000168c43d594c-7979fcf8-b9c1-4bda-b29a-500efe001d66-000000@email.amazonses.com> <20190206210356.GZ6173@dastard> <20190206220828.GJ12227@ziepe.ca> <0c868bc615a60c44d618fb0183fcbe0c418c7c83.camel@redhat.com> <01000168c8e2de6b-9ab820ed-38ad-469c-b210-60fcff8ea81c-000000@email.amazonses.com> <20190208044302.GA20493@dastard> <20190208111028.GD6353@quack2.suse.cz> <20190211102402.GF19029@quack2.suse.cz> In-Reply-To: <20190211102402.GF19029@quack2.suse.cz> From: Dan Williams Date: Mon, 11 Feb 2019 09:22:58 -0800 Message-ID: Subject: Re: [LSF/MM TOPIC] Discuss least bad options for resolving longterm-GUP usage by RDMA To: Jan Kara Cc: Dave Chinner , Christopher Lameter , Doug Ledford , Jason Gunthorpe , Matthew Wilcox , Ira Weiny , lsf-pc@lists.linux-foundation.org, linux-rdma , Linux MM , Linux Kernel Mailing List , John Hubbard , Jerome Glisse , Michal Hocko Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Feb 11, 2019 at 2:24 AM Jan Kara wrote: > > On Fri 08-02-19 12:50:37, Dan Williams wrote: > > On Fri, Feb 8, 2019 at 3:11 AM Jan Kara wrote: > > > > > > On Fri 08-02-19 15:43:02, Dave Chinner wrote: > > > > On Thu, Feb 07, 2019 at 04:55:37PM +0000, Christopher Lameter wrote: > > > > > One approach that may be a clean way to solve this: > > > > > 3. Filesystems that allow bypass of the page cache (like XFS / DAX) will > > > > > provide the virtual mapping when the PIN is done and DO NO OPERATIONS > > > > > on the longterm pinned range until the long term pin is removed. > > > > > > > > So, ummm, how do we do block allocation then, which is done on > > > > demand during writes? > > > > > > > > IOWs, this requires the application to set up the file in the > > > > correct state for the filesystem to lock it down so somebody else > > > > can write to it. That means the file can't be sparse, it can't be > > > > preallocated (i.e. can't contain unwritten extents), it must have zeroes > > > > written to it's full size before being shared because otherwise it > > > > exposes stale data to the remote client (secure sites are going to > > > > love that!), they can't be extended, etc. > > > > > > > > IOWs, once the file is prepped and leased out for RDMA, it becomes > > > > an immutable for the purposes of local access. > > > > > > > > Which, essentially we can already do. Prep the file, map it > > > > read/write, mark it immutable, then pin it via the longterm gup > > > > interface which can do the necessary checks. > > > > > > Hum, and what will you do if the immutable file that is target for RDMA > > > will be a source of reflink? That seems to be currently allowed for > > > immutable files but RDMA store would be effectively corrupting the data of > > > the target inode. But we could treat it similarly as swapfiles - those also > > > have to deal with writes to blocks beyond filesystem control. In fact the > > > similarity seems to be quite large there. What do you think? > > > > This sounds so familiar... > > > > https://lwn.net/Articles/726481/ > > > > I'm not opposed to trying again, but leases was what crawled out > > smoking crater when this last proposal was nuked. > > Umm, don't think this is that similar to daxctl() discussion. We are not > speaking about providing any new userspace API for this. I thought explicit userspace API was one of the outcomes, i.e. that we can't depend on this behavior being an implicit side effect of a page pin? > Also I think the > situation about leases has somewhat cleared up with this discussion - ODP > hardware does not need leases since it can use MMU notifiers, for non-ODP > hardware it is difficult to handle leases as such hardware has only one big > kill-everything call and using that would effectively mean lot of work on > the userspace side to resetup everything to make things useful if workable > at all. > > So my proposal would be: > > 1) ODP hardward uses gup_fast() like direct IO and uses MMU notifiers to do > its teardown when fs needs it. > > 2) Hardware not capable of tearing down pins from MMU notifiers will have > to use gup_longterm() (we may actually rename it to a more suitable name). > FS may just refuse such calls (for normal page cache backed file, it will > just return success but for DAX file it will do sanity checks whether the > file is fully allocated etc. like we currently do for swapfiles) but if > gup_longterm() returns success, it will provide the same guarantees as for > swapfiles. So the only thing that we need is some call from gup_longterm() > to a filesystem callback to tell it - this file is going to be used by a > third party as an IO buffer, don't touch it. And we can (and should) > probably refactor the handling to be shared between swapfiles and > gup_longterm(). Yes, lets pursue this. At the risk of "arguing past 'yes'" this is a solution I thought we dax folks walked away from in the original MAP_DIRECT discussion [1]. Here is where leases were the response to MAP_DIRECT [2]. ...and here is where we had tame discussions about implications of notifying memory-registrations of lease break events [3]. I honestly don't like the idea that random subsystems can pin down file blocks as a side effect of gup on the result of mmap. Recall that it's not just RDMA that wants this guarantee. It seems safer to have the file be in an explicit block-allocation-immutable-mode so that the fallocate man page can describe this error case. Otherwise how would you describe the scenarios under which FALLOC_FL_PUNCH_HOLE fails? [1]: https://lwn.net/Articles/736333/ [2]: https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg06437.html [3]: https://www.mail-archive.com/linux-nvdimm@lists.01.org/msg06499.html