Received: by 2002:ac0:aed5:0:0:0:0:0 with SMTP id t21csp4524076imb; Wed, 6 Mar 2019 15:55:36 -0800 (PST) X-Google-Smtp-Source: APXvYqxDiDgwVfuH4Pt7PYZ4AiqYimrG7x6nHaM1VUEKEDYWwmq23EHmJI//absInt+blmsZ+569 X-Received: by 2002:a63:3548:: with SMTP id c69mr8854518pga.256.1551916536829; Wed, 06 Mar 2019 15:55:36 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1551916536; cv=none; d=google.com; s=arc-20160816; b=mcIVy/waOJztFJUht25QiD6GdXmiGblayvWUlmvuRKNnPUs3gweIu2h+5MM/3UMMs3 11tquf/R96ORRZ1PQev29qvFVEn3b+fNzbR70tFpzfvylIeYX7c3NP8MHn9/vHOK/LYA Fhi5cHkCbhP40sSXjIoLR2OxKDASNYPGNNUirywBjC6T0nM46oXn24lb73Hrp6M0cvdW /LXP7AhArmoQq5OG0107PXWsfjmDQ90lzIOEiixO7tEg+Fiep2ec2qxyrzNKBV24c3A+ vMSmuAL1kEc9OuRiPKG9oSOd5zI4jzEkg543Vj3LCeYR5blwUGv8Cqa92SffW3g0DbHb DmbQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from:dkim-signature; bh=U8OzCzrlqLZ6KobCTgtfavOqwIbWF6/ADR96fWt0LK0=; b=ODZGZ/Xzs5qTNaT4ni3nVq2WcwArhsAyEDtZ0D7arzueAZZLtLEEaaVJOmJ69HqUwH 6KzB1fRsWJyadlc12peX/fSy37YXNm7f3jFj+Kd0RB7AId968QLA5NITQ29LlOn1WqyP gWTRcXmujXo7/ro+bG3am2EZestw+8UtMPblRIqgrtAfEV0mH3U/9gyUd12Xa2gbwyLe TcTql4MqAsfoj7Ds8wniUbkY0RNojydmkkMNsQia11/yZHne6ypTOsbiK3wK8cuuS9Po 44Wfxj9mDl4iSfFe5WfE3TFgBv06whm4hhM/+qCx+dSpIRnDLm8wVQKtOdaY0TXuyh2n uIOw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=RAuZ8i3f; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q20si2558674pls.263.2019.03.06.15.55.20; Wed, 06 Mar 2019 15:55:36 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20161025 header.b=RAuZ8i3f; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726128AbfCFXzA (ORCPT + 99 others); Wed, 6 Mar 2019 18:55:00 -0500 Received: from mail-pf1-f194.google.com ([209.85.210.194]:45644 "EHLO mail-pf1-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725747AbfCFXy7 (ORCPT ); Wed, 6 Mar 2019 18:54:59 -0500 Received: by mail-pf1-f194.google.com with SMTP id v21so9895039pfm.12; Wed, 06 Mar 2019 15:54:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=U8OzCzrlqLZ6KobCTgtfavOqwIbWF6/ADR96fWt0LK0=; b=RAuZ8i3f6niprk+xmlh3uwEEgufWvxEm3yZEwC8R751cjjPsRto5Iwf5LN37Blifko dk/q84Pcx1u49eocJRzK3ApMeBP9A/UOdhMVdTFzCCbJrdVUfGPSDvNE+yfaG9p33Df0 CrxWaigcwgv+SWnF8zeX/UJpHTeyVSec54+8R0jQ2Yszhh+KacnPglvO+pNGrm8uhFrL BKVjM79G1cevGBWKArS6coS9s5YE57t9uA+WFSnJG5G3LG0Kc+kpVoVCxcOe7Rh7GIrU vy2+OxfF21d8lMs96cma5De3iwNDXt71E+3GwF7vJkLy0O8HjttVuwNh7uogt83VFGnF jECA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=U8OzCzrlqLZ6KobCTgtfavOqwIbWF6/ADR96fWt0LK0=; b=W3nKxeN2ti2RHDsep5GNmFVW5f8Us8SfsrtOAlNgLIWDJB8XQUdpmZVrYw79Ehh5GZ o2RgDRffAhsMu69LMfTDiT2cGXISFPDAXXDCDWq8lW9VAR7jDEFHddieNOJ7Xd3CBXDI wN5R3Px5kQgNgtK4M++NzvNzXQSI9TK5pzTL+F8vDoA7HcQ1xocCDSvLzImZ8uLF1m1E cR86Eu5BEgIB+LQlS63h6RrjFCN3GzXAMyNzEipxV/KsLseDjeiLE1LMdny54QUZL7BA lDSY0zqKvubqRgRy0IczgVp4/vEj31Tt6gX7/GHeecYTgLYrQFPpgGXvhSHCkfDNGjlQ prnA== X-Gm-Message-State: APjAAAXhEnN7ObXjnP6Tm4TyO3I9jNpeXzwZNVT/1ko8IRVrlvtlg9Yr +1ukvuqZY4CV1QMVApKOU1Y= X-Received: by 2002:a63:5fce:: with SMTP id t197mr8566506pgb.415.1551916498643; Wed, 06 Mar 2019 15:54:58 -0800 (PST) Received: from blueforge.nvidia.com (searspoint.nvidia.com. [216.228.112.21]) by smtp.gmail.com with ESMTPSA id m21sm4955272pfa.14.2019.03.06.15.54.57 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 06 Mar 2019 15:54:57 -0800 (PST) From: john.hubbard@gmail.com X-Google-Original-From: jhubbard@nvidia.com To: Andrew Morton , linux-mm@kvack.org Cc: Al Viro , Christian Benvenuti , Christoph Hellwig , Christopher Lameter , Dan Williams , Dave Chinner , Dennis Dalessandro , Doug Ledford , Ira Weiny , Jan Kara , Jason Gunthorpe , Jerome Glisse , Matthew Wilcox , Michal Hocko , Mike Rapoport , Mike Marciniszyn , Ralph Campbell , Tom Talpey , LKML , linux-fsdevel@vger.kernel.org, John Hubbard Subject: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions Date: Wed, 6 Mar 2019 15:54:54 -0800 Message-Id: <20190306235455.26348-1-jhubbard@nvidia.com> X-Mailer: git-send-email 2.21.0 MIME-Version: 1.0 X-NVConfidentiality: public Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: John Hubbard Hi Andrew and all, Can we please apply this (destined for 5.2) once the time is right? (I see that -mm just got merged into the main tree today.) We seem to have pretty solid consensus on the concept and details of the put_user_pages() approach. Or at least, if we don't, someone please speak up now. Christopher Lameter, especially, since you had some concerns recently. Therefore, here is the first patch--only. This allows us to begin converting the get_user_pages() call sites to use put_user_page(), instead of put_page(). This is in order to implement tracking of get_user_page() pages. Normally I'd include a user of this code, but in this case, I think we have examples of how it will work in the RFC and related discussions [1]. What matters more at this point is unblocking the ability to start fixing up various subsystems, through git trees other than linux-mm. For example, the Infiniband example conversion now needs to pick up some prerequisite patches via the RDMA tree. It seems likely that other call sites may need similar attention, and so having put_user_pages() available would really make this go more quickly. Previous cover letter follows: ============================== A discussion of the overall problem is below. As mentioned in patch 0001, the steps are to fix the problem are: 1) Provide put_user_page*() routines, intended to be used for releasing pages that were pinned via get_user_pages*(). 2) Convert all of the call sites for get_user_pages*(), to invoke put_user_page*(), instead of put_page(). This involves dozens of call sites, and will take some time. 3) After (2) is complete, use get_user_pages*() and put_user_page*() to implement tracking of these pages. This tracking will be separate from the existing struct page refcounting. 4) Use the tracking and identification of these pages, to implement special handling (especially in writeback paths) when the pages are backed by a filesystem. Overview ======== Some kernel components (file systems, device drivers) need to access memory that is specified via process virtual address. For a long time, the API to achieve that was get_user_pages ("GUP") and its variations. However, GUP has critical limitations that have been overlooked; in particular, GUP does not interact correctly with filesystems in all situations. That means that file-backed memory + GUP is a recipe for potential problems, some of which have already occurred in the field. GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code to get the struct page behind a virtual address and to let storage hardware perform a direct copy to or from that page. This is a short-lived access pattern, and as such, the window for a concurrent writeback of GUP'd page was small enough that there were not (we think) any reported problems. Also, userspace was expected to understand and accept that Direct IO was not synchronized with memory-mapped access to that data, nor with any process address space changes such as munmap(), mremap(), etc. Over the years, more GUP uses have appeared (virtualization, device drivers, RDMA) that can keep the pages they get via GUP for a long period of time (seconds, minutes, hours, days, ...). This long-term pinning makes an underlying design problem more obvious. In fact, there are a number of key problems inherent to GUP: Interactions with file systems ============================== File systems expect to be able to write back data, both to reclaim pages, and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain write access to the file memory pages means that such hardware can dirty the pages, without the filesystem being aware. This can, in some cases (depending on filesystem, filesystem options, block device, block device options, and other variables), lead to data corruption, and also to kernel bugs of the form: kernel BUG at /build/linux-fQ94TU/linux-4.4.0/fs/ext4/inode.c:1899! backtrace: ext4_writepage __writepage write_cache_pages ext4_writepages do_writepages __writeback_single_inode writeback_sb_inodes __writeback_inodes_wb wb_writeback wb_workfn process_one_work worker_thread kthread ret_from_fork ...which is due to the file system asserting that there are still buffer heads attached: ({ \ BUG_ON(!PagePrivate(page)); \ ((struct buffer_head *)page_private(page)); \ }) Dave Chinner's description of this is very clear: "The fundamental issue is that ->page_mkwrite must be called on every write access to a clean file backed page, not just the first one. How long the GUP reference lasts is irrelevant, if the page is clean and you need to dirty it, you must call ->page_mkwrite before it is marked writeable and dirtied. Every. Time." This is just one symptom of the larger design problem: filesystems do not actually support get_user_pages() being called on their pages, and letting hardware write directly to those pages--even though that pattern has been going on since about 2005 or so. Long term GUP ============= Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a writeable mapping is created), and the pages are file-backed. That can lead to filesystem corruption. What happens is that when a file-backed page is being written back, it is first mapped read-only in all of the CPU page tables; the file system then assumes that nobody can write to the page, and that the page content is therefore stable. Unfortunately, the GUP callers generally do not monitor changes to the CPU pages tables; they instead assume that the following pattern is safe (it's not): get_user_pages() Hardware can keep a reference to those pages for a very long time, and write to it at any time. Because "hardware" here means "devices that are not a CPU", this activity occurs without any interaction with the kernel's file system code. for each page set_page_dirty put_page() In fact, the GUP documentation even recommends that pattern. Anyway, the file system assumes that the page is stable (nothing is writing to the page), and that is a problem: stable page content is necessary for many filesystem actions during writeback, such as checksum, encryption, RAID striping, etc. Furthermore, filesystem features like COW (copy on write) or snapshot also rely on being able to use a new page for as memory for that memory range inside the file. Corruption during write back is clearly possible here. To solve that, one idea is to identify pages that have active GUP, so that we can use a bounce page to write stable data to the filesystem. The filesystem would work on the bounce page, while any of the active GUP might write to the original page. This would avoid the stable page violation problem, but note that it is only part of the overall solution, because other problems remain. Other filesystem features that need to replace the page with a new one can be inhibited for pages that are GUP-pinned. This will, however, alter and limit some of those filesystem features. The only fix for that would be to require GUP users to monitor and respond to CPU page table updates. Subsystems such as ODP and HMM do this, for example. This aspect of the problem is still under discussion. Direct IO ========= Direct IO can cause corruption, if userspace does Direct-IO that writes to a range of virtual addresses that are mmap'd to a file. The pages written to are file-backed pages that can be under write back, while the Direct IO is taking place. Here, Direct IO races with a write back: it calls GUP before page_mkclean() has replaced the CPU pte with a read-only entry. The race window is pretty small, which is probably why years have gone by before we noticed this problem: Direct IO is generally very quick, and tends to finish up before the filesystem gets around to do anything with the page contents. However, it's still a real problem. The solution is to never let GUP return pages that are under write back, but instead, force GUP to take a write fault on those pages. That way, GUP will properly synchronize with the active write back. This does not change the required GUP behavior, it just avoids that race. Changes since v2: * Reduced down to just one patch, in order to avoid dependencies between subsystem git repos. * Rebased to latest linux.git: commit afe6fe7036c6 ("Merge tag 'armsoc-late' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc") * Added Ira's review tag, based on https://lore.kernel.org/lkml/20190215002312.GC7512@iweiny-DESK2.sc.intel.com/ [1] https://lore.kernel.org/r/20190208075649.3025-3-jhubbard@nvidia.com (RFC v2: mm: gup/dma tracking) Cc: Christian Benvenuti Cc: Christoph Hellwig Cc: Christopher Lameter Cc: Dan Williams Cc: Dave Chinner Cc: Dennis Dalessandro Cc: Doug Ledford Cc: Ira Weiny Cc: Jan Kara Cc: Jason Gunthorpe Cc: Jérôme Glisse Cc: Matthew Wilcox Cc: Michal Hocko Cc: Mike Rapoport Cc: Mike Marciniszyn Cc: Ralph Campbell Cc: Tom Talpey John Hubbard (1): mm: introduce put_user_page*(), placeholder versions include/linux/mm.h | 24 ++++++++++++++ mm/swap.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 106 insertions(+) -- 2.21.0