Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp3261839yba; Sun, 28 Apr 2019 21:56:53 -0700 (PDT) X-Google-Smtp-Source: APXvYqyVTF/8LGZIjNY6C9h/PxGLABSA2Fd+lXMwsVakRlJ8JxIkrHxJwqBW1OYLPEW6UftfCcAR X-Received: by 2002:a65:51c8:: with SMTP id i8mr56293619pgq.175.1556513813721; Sun, 28 Apr 2019 21:56:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1556513813; cv=none; d=google.com; s=arc-20160816; b=OzhEuoBtYFb0ZtxSYJJlT2p34uEO0Moz8QrsFMEzIjbCYtzNwNiS9Ii6pfhwXS+EDP P/8tTnDVNYFRy+QtqIxi7pJibMt78ZeOyIfyjhAiC+g+0lnq1ch5GbzQhq3VKU9HgCD+ opO6crBFjf93PG7N6Ky5gK6r5i455LYVKgNTh8xf8CCbU3ecv1XwXT45FteFw3+hZ3WY 8xNEqVOhBKUDKra4vzLKzdm2yBY1Gtb0XAl5Cwo13g4jeWjKI1gvRaN7icXLnoJ4s08H vyQlHo6YyUx9JwN4bOtjlOJPuba/aHomjbpNlTHIkgO3VHOGD9AoCxkk8anDjSBFNjdn zYQg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=ZGpG2GZOOdLM1S/ckpH2GJLzrVUXsS3NaIqACfli3v0=; b=A29t+jgDLAbfDSKzsAAVSrpHUOzyHQ2j35FELgjnyLh6TUA1TG9fvGgs8JQyE1R3QI cDOfqX6OzrdpIRXScGpJayAL01PWxbVdTrE6VMQ3CyQhoyHdYdCiTJA9EypSgTH7mJt+ ly3Q3vvMU2F7mhnNSQHAKuBRitDyplT+3xyXEfmC64qj/FXx+KVKdpngB3g4j3kPsZrK iEelwfU2S4OGl6NI8tLmWxImksEV0NU29JS//wfkB3+h89aO0TJPjjwI6/KnptZaZccD mUMldnv40na4Rij/d4N9Gu6JOzHzC3UELma19u/kB4yOyRJkFWUsBWD+elnhqu1oIRR7 0u2w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b6si12159905pfi.221.2019.04.28.21.56.39; Sun, 28 Apr 2019 21:56:53 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726646AbfD2EyE (ORCPT + 99 others); Mon, 29 Apr 2019 00:54:04 -0400 Received: from mga03.intel.com ([134.134.136.65]:28440 "EHLO mga03.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725468AbfD2EyE (ORCPT ); Mon, 29 Apr 2019 00:54:04 -0400 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga103.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 28 Apr 2019 21:54:03 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.60,408,1549958400"; d="scan'208";a="146566238" Received: from iweiny-desk2.sc.intel.com ([10.3.52.157]) by orsmga003.jf.intel.com with ESMTP; 28 Apr 2019 21:54:03 -0700 From: ira.weiny@intel.com To: lsf-pc@lists.linux-foundation.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Dan Williams , Jan Kara , =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= , John Hubbard , Michal Hocko , Ira Weiny Subject: [RFC PATCH 00/10] RDMA/FS DAX "LONGTERM" lease proposal Date: Sun, 28 Apr 2019 21:53:49 -0700 Message-Id: <20190429045359.8923-1-ira.weiny@intel.com> X-Mailer: git-send-email 2.20.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Ira Weiny In order to support RDMA to File system pages[*] without On Demand Paging a number of things need to be done. 1) GUP "longterm"[1] users need to inform the other subsystems that they have taken a pin on a page which may remain pinned for a very "long time".[1] 2) Any page which is "controlled" by a file system such needs to have special handling. The details of the handling depends on if the page is page cache backed or not. 2a) A page cache backed page which has been pinned by GUP Longterm can use a bounce buffer to allow the file system to write back snap shots of the page. This is handled by the FS recognizing the GUP longterm pin and making a copy of the page to be written back. NOTE: this patch set does not address this path. 2b) A FS "controlled" page which is not page cache backed is either easier to deal with or harder depending on the operation the filesystem is trying to do. 2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the FS can no longer use the pages in question until the pin has been removed. This patch set presents a solution to this by introducing some reasonable restrictions on user space applications. 2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch then there is nothing which need be done. Data is Read or Written directly to the page. This is an easy case which would currently work if not for GUP longterm pins being disabled. Therefore this patch set need not change access to the file data but does allow for GUP pins after 2ba above is dealt with. The architecture of this series is to introduce a F_LONGTERM file lease mechanism which applications use in one of 2 ways. 1) Applications which may require hole punch or truncation operations on files they intend to mmmapping and pinning for long periods. Can take a F_LONGTERM lease on the file. When a file system operation needs truncate access to this file the lease is broken and the application gets a SIGIO. Upon catching SIGIO the application can un-pin (note munmap is not required) the memory associated with that file. At that point the truncating user can proceed. Re-pinning the memory is entirely left up to the application. In some cases a new mmap will be required (as with a truncation) or a SIGBUS would be experienced anyway. Failure to respond to a SIGIO lease break within the system configured lease-break-time will result in a SIGBUS. WIP: SIGBUS could be caught and ignored... what danger does this present... should this be SIGKILL or should we wait another lease-break-time and then send SIGKILL? 2) Applications which don't require hold punch or truncate operations can use pinning without taking a F_LONGTERM lease. However, applications such as this are expected to have considered the access to the files they are mmaping and are expected to be controlling them in a way that other users on a system can't truncate a file and cause a DOS on the application. These applications will be sent a SIGBUS if someone attempts to truncate or hole punch a file. ALTERNATIVE WIP patch in series: If the F_LONGTERM lease is not taken fail the GUP. The patches compile and have been tested to a first degree. NOTES: Can we deal with the lease/pin at the VFS layer? or for all FSs? LONGTERM seems like a bad name. Suggestions? [1] The definition of long time is debatable but it has been established that RDMAs use of pages, minutes or hours after the pin is the extreme case which makes this problem most severe. [*] Not all file system pages are Page Cache pages. FS DAX bypasses the page cache. Ira Weiny (10): fs/locks: Add trace_leases_conflict fs/locks: Introduce FL_LONGTERM file lease mm/gup: Pass flags down to __gup_device_huge* calls WIP: mm/gup: Ensure F_LONGTERM lease is held on GUP pages mm/gup: Take FL_LONGTERM lease if not set by user fs/locks: Add longterm lease traces fs/dax: Create function dax_mapping_is_dax() mm/gup: fs: Send SIGBUS on truncate of active file fs/locks: Add tracepoint for SIGBUS on LONGTERM expiration mm/gup: Remove FOLL_LONGTERM DAX exclusion fs/dax.c | 23 ++- fs/ext4/inode.c | 4 + fs/locks.c | 301 +++++++++++++++++++++++++++++-- fs/xfs/xfs_file.c | 4 + include/linux/dax.h | 6 + include/linux/fs.h | 18 ++ include/linux/mm.h | 2 + include/trace/events/filelock.h | 74 +++++++- include/uapi/asm-generic/fcntl.h | 2 + mm/gup.c | 107 ++++------- mm/huge_memory.c | 18 ++ 11 files changed, 468 insertions(+), 91 deletions(-) -- 2.20.1