Received: by 2002:ac0:aed5:0:0:0:0:0 with SMTP id t21csp5548337imb; Thu, 7 Mar 2019 19:09:19 -0800 (PST) X-Google-Smtp-Source: APXvYqz9UZPLJ2re77fJ+NhAR7iCrmoI0kQ35Jmlwa/b/gqej7PTWX97a03Fa2ADt4hUr4KzbMSc X-Received: by 2002:a17:902:9008:: with SMTP id a8mr16486552plp.38.1552014558973; Thu, 07 Mar 2019 19:09:18 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1552014558; cv=none; d=google.com; s=arc-20160816; b=gnxtB2Rk0kaOLcXEPueJa+EoMOYK/E2pQZdPhRSjF7/c090kcyiw0vkAhn1OLyQS2f sGH/lAEyCF+44Q/4kumaqUzygRDFxGcEwjF3uF2xEZ3n+O8A30sdSqx68CaY+94MDtOf ayxA8RJM58cyDeXDGm1JwhJZNnbtSSMK1CTon8ADKUSpmNU7c3bgX9A5V79ITgiSloAE 7geza+fuyP0v2ui1stSxZwTvOVu6Avf4m/Znc+M5SD9fOltZrocaK3c88eF+8QY8RNFZ Xx1ENvG+V7rfeA+NmmezB9FiB3jaxkJXuPG3pPTCjeXUlY84dAbWlth7LnSEUrWsW0O4 x/Pw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:feedback-id:mime-version:user-agent :references:message-id:in-reply-to:subject:cc:to:from:date :dkim-signature; bh=iSC5NRaIYEqe1epLXpU+YNht7NR85W950XiWInzYTE4=; b=IoCqPZH0i7ik1xa58zb9jlJOymoHVAL61Yap884AsgiMqQPodiOJ4ZKmO492CqMasp 8mn5BubzkOcrBu25a1GSbOygVSeMrY+XnJrOIB6ymNH/7LFE4sgYugwUVTvSBKMOM+Ef imhmyVpt20D4zSJD+xcFoenXcfYTF+hvTNf5cZX98IfyELWC7ldBj5RZlSpPmuXzJAMD w4P9cqR61TGCJN7z+53HEbHj+8+tCkmHicZbX9DHr3oRt79Md+WAi3GzG2KRkbsTeNE8 TFzW3CVb1eRpntcGpUeYJPo3ME4zk5ANeU/YwKnYubUZ+1gXgxEsXQR/nN79VbpiVaLZ ofjQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amazonses.com header.s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug header.b="cSYLQtG/"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h11si5389395pgh.231.2019.03.07.19.09.03; Thu, 07 Mar 2019 19:09:18 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@amazonses.com header.s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug header.b="cSYLQtG/"; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726355AbfCHDIm (ORCPT + 99 others); Thu, 7 Mar 2019 22:08:42 -0500 Received: from a9-32.smtp-out.amazonses.com ([54.240.9.32]:41754 "EHLO a9-32.smtp-out.amazonses.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726261AbfCHDIm (ORCPT ); Thu, 7 Mar 2019 22:08:42 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/simple; s=ug7nbtf4gccmlpwj322ax3p6ow6yfsug; d=amazonses.com; t=1552014520; h=Date:From:To:cc:Subject:In-Reply-To:Message-ID:References:MIME-Version:Content-Type:Feedback-ID; bh=Qsyil3aA8gB2bCI9HRfEXGR5NVCprjsvM1XdMK95rKI=; b=cSYLQtG/NGkELqStS9fYnd9XGtShvawWekbQcHxUgiRa5T6ixwFOhU5nD7/h7oHL A2jHz+Ze6sUjCTS+pO+u1Uf6RyWBCp8BpLs/BfcPr54oLkXiDVO7P5am8S3zp9aHZ5f HO1EhnpUaH4gwXrGIlmtajbNx4EeV+TqJLxqegXw= Date: Fri, 8 Mar 2019 03:08:40 +0000 From: Christopher Lameter X-X-Sender: cl@nuc-kabylake To: john.hubbard@gmail.com cc: Andrew Morton , linux-mm@kvack.org, Al Viro , Christian Benvenuti , Christoph Hellwig , Dan Williams , Dave Chinner , Dennis Dalessandro , Doug Ledford , Ira Weiny , Jan Kara , Jason Gunthorpe , Jerome Glisse , Matthew Wilcox , Michal Hocko , Mike Rapoport , Mike Marciniszyn , Ralph Campbell , Tom Talpey , LKML , linux-fsdevel@vger.kernel.org, John Hubbard Subject: Re: [PATCH v3 0/1] mm: introduce put_user_page*(), placeholder versions In-Reply-To: <20190306235455.26348-1-jhubbard@nvidia.com> Message-ID: <010001695b4631cd-f4b8fcbf-a760-4267-afce-fb7969e3ff87-000000@email.amazonses.com> References: <20190306235455.26348-1-jhubbard@nvidia.com> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII X-SES-Outgoing: 2019.03.08-54.240.9.32 Feedback-ID: 1.us-east-1.fQZZZ0Xtj2+TD7V5apTT/NrT6QKuPgzCT/IC7XYgDKI=:AmazonSES Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 6 Mar 2019, john.hubbard@gmail.com wrote: > GUP was first introduced for Direct IO (O_DIRECT), allowing filesystem code > to get the struct page behind a virtual address and to let storage hardware > perform a direct copy to or from that page. This is a short-lived access > pattern, and as such, the window for a concurrent writeback of GUP'd page > was small enough that there were not (we think) any reported problems. > Also, userspace was expected to understand and accept that Direct IO was > not synchronized with memory-mapped access to that data, nor with any > process address space changes such as munmap(), mremap(), etc. It would good if that understanding would be enforced somehow given the problems that we see. > Interactions with file systems > ============================== > > File systems expect to be able to write back data, both to reclaim pages, Regular filesystems do that. But usually those are not used with GUP pinning AFAICT. > and for data integrity. Allowing other hardware (NICs, GPUs, etc) to gain > write access to the file memory pages means that such hardware can dirty > the pages, without the filesystem being aware. This can, in some cases > (depending on filesystem, filesystem options, block device, block device > options, and other variables), lead to data corruption, and also to kernel > bugs of the form: > Long term GUP > ============= > > Long term GUP is an issue when FOLL_WRITE is specified to GUP (so, a > writeable mapping is created), and the pages are file-backed. That can lead > to filesystem corruption. What happens is that when a file-backed page is > being written back, it is first mapped read-only in all of the CPU page > tables; the file system then assumes that nobody can write to the page, and > that the page content is therefore stable. Unfortunately, the GUP callers > generally do not monitor changes to the CPU pages tables; they instead > assume that the following pattern is safe (it's not): > > get_user_pages() > > Hardware can keep a reference to those pages for a very long time, > and write to it at any time. Because "hardware" here means "devices > that are not a CPU", this activity occurs without any interaction > with the kernel's file system code. > > for each page > set_page_dirty > put_page() > > In fact, the GUP documentation even recommends that pattern. Isnt that pattern safe for anonymous memory and memory filesystems like hugetlbfs etc? Which is the common use case. > Anyway, the file system assumes that the page is stable (nothing is writing > to the page), and that is a problem: stable page content is necessary for > many filesystem actions during writeback, such as checksum, encryption, > RAID striping, etc. Furthermore, filesystem features like COW (copy on > write) or snapshot also rely on being able to use a new page for as memory > for that memory range inside the file. > > Corruption during write back is clearly possible here. To solve that, one > idea is to identify pages that have active GUP, so that we can use a bounce > page to write stable data to the filesystem. The filesystem would work > on the bounce page, while any of the active GUP might write to the > original page. This would avoid the stable page violation problem, but note > that it is only part of the overall solution, because other problems > remain. Yes you now have the filesystem as well as the GUP pinner claiming authority over the contents of a single memory segment. Maybe better not allow that? > Direct IO > ========= > > Direct IO can cause corruption, if userspace does Direct-IO that writes to > a range of virtual addresses that are mmap'd to a file. The pages written > to are file-backed pages that can be under write back, while the Direct IO > is taking place. Here, Direct IO races with a write back: it calls > GUP before page_mkclean() has replaced the CPU pte with a read-only entry. > The race window is pretty small, which is probably why years have gone by > before we noticed this problem: Direct IO is generally very quick, and > tends to finish up before the filesystem gets around to do anything with > the page contents. However, it's still a real problem. The solution is > to never let GUP return pages that are under write back, but instead, > force GUP to take a write fault on those pages. That way, GUP will > properly synchronize with the active write back. This does not change the > required GUP behavior, it just avoids that race. Direct IO on a mmapped file backed page doesnt make any sense. The direct I/O write syscall already specifies one file handle of a filesystem that the data is to be written onto. Plus mmap already established another second filehandle and another filesystem that is also in charge of that memory segment. Two filesystem trying to sync one memory segment both believing to have exclusive access and we want to sort this out. Why? Dont allow this.