Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp115814imu; Wed, 19 Dec 2018 14:46:52 -0800 (PST) X-Google-Smtp-Source: AFSGD/WNKY6ixabamj25BZ2t0FWE30RArI+9aUq1zplTL+NCrpBS+9hmUZ9TU0dxblnLGVDyiwZS X-Received: by 2002:a17:902:1105:: with SMTP id d5mr21054643pla.47.1545259612590; Wed, 19 Dec 2018 14:46:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545259612; cv=none; d=google.com; s=arc-20160816; b=OUaO6LRuqyDkW6+ND83w9oW//q9O+MPDMB8/AHq57bj3SuuzmcfGh/DQsIx33eZGRp sYosZoMWFUMMiiLbmJzDkHNJ9eA7jBCe8VfniXIR0xmhg1WeA+/TauV/OKaO4VeVNhNL MbWHNhDPw2yaC3miJBHK0IAtfZ2pgD4U3xqjwmRr/50y+b1y5cWy/K8UPSPeovPuaglW lmGWwPHG7h73gVTcFAey3MQpm3y4ORFSOeUSvjfiC3X6W67jN4mWDCNChVT8LRGpheWo GpEbjY7Ya4pBPSUwjORYwYyQt9Xf7fa+oDcBR1iq+7fA/nLxLW4jFYSdTRol41fyR/Mw fvCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=CghKEgE1aLLK3FlMZnPtkY/fpQUqIv5IHaHVW9Hhvgo=; b=JQtT8+ocbqWiJ/OP+Qcg4MVdzqAX4uAQgH0ovz1HIP4iar7H4VVSQOo5WU/u1kgA32 uft5fhbbIY9iqwN6T2n97Yk+lC/NRsHwJ/ncPv8ZYOcUUOjlkbvpOsfFa87NStMl2Cf9 ghkKRALPirfrPu144RrkPQqSMQkgatCvXTStKCP3ld3GVrb3tRRQJtBAX9nqnVXRgkFr oAeaZ+OXHcQJAggBkDcPBE6Bf6PiHeBmSLSUyUnqY/o1zTQzRIAfT+UFfcae323nD2Uo FrsasVdyUkNWKtfteh79tBlcGRXTvV7cMatIvF/szqqCTGnHvKg0FLko4/jufZxgED4Q MGhQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e16si16762782pge.364.2018.12.19.14.46.37; Wed, 19 Dec 2018 14:46:52 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728986AbeLSWdS (ORCPT + 99 others); Wed, 19 Dec 2018 17:33:18 -0500 Received: from ipmailnode02.adl6.internode.on.net ([150.101.137.148]:14128 "EHLO ipmailnode02.adl6.internode.on.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728355AbeLSWdR (ORCPT ); Wed, 19 Dec 2018 17:33:17 -0500 Received: from ppp59-167-129-252.static.internode.on.net (HELO dastard) ([59.167.129.252]) by ipmail02.adl6.internode.on.net with ESMTP; 20 Dec 2018 09:03:13 +1030 Received: from dave by dastard with local (Exim 4.80) (envelope-from ) id 1gZkOu-0002Ha-CF; Thu, 20 Dec 2018 09:33:12 +1100 Date: Thu, 20 Dec 2018 09:33:12 +1100 From: Dave Chinner To: Jan Kara Cc: Jason Gunthorpe , Jerome Glisse , John Hubbard , Matthew Wilcox , Dan Williams , John Hubbard , Andrew Morton , Linux MM , tom@talpey.com, Al Viro , benve@cisco.com, Christoph Hellwig , Christopher Lameter , "Dalessandro, Dennis" , Doug Ledford , Michal Hocko , mike.marciniszyn@intel.com, rcampbell@nvidia.com, Linux Kernel Mailing List , linux-fsdevel Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions Message-ID: <20181219223312.GP6311@dastard> References: <20181210102846.GC29289@quack2.suse.cz> <20181212150319.GA3432@redhat.com> <20181212214641.GB29416@dastard> <20181214154321.GF8896@quack2.suse.cz> <20181216215819.GC10644@dastard> <20181218103306.GC18032@quack2.suse.cz> <20181218234254.GC31274@dastard> <20181219030329.GI21992@ziepe.ca> <20181219102825.GN6311@dastard> <20181219113540.GC18345@quack2.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181219113540.GC18345@quack2.suse.cz> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 19, 2018 at 12:35:40PM +0100, Jan Kara wrote: > On Wed 19-12-18 21:28:25, Dave Chinner wrote: > > On Tue, Dec 18, 2018 at 08:03:29PM -0700, Jason Gunthorpe wrote: > > > On Wed, Dec 19, 2018 at 10:42:54AM +1100, Dave Chinner wrote: > > > > > > > Essentially, what we are talking about is how to handle broken > > > > hardware. I say we should just brun it with napalm and thermite > > > > (i.e. taint the kernel with "unsupportable hardware") and force > > > > wait_for_stable_page() to trigger when there are GUP mappings if > > > > the underlying storage doesn't already require it. > > > > > > If you want to ban O_DIRECT/etc from writing to file backed pages, > > > then just do it. > > > > O_DIRECT IO *isn't the problem*. > > That is not true. O_DIRECT IO is a problem. In some aspects it is easier > than the problem with RDMA but currently O_DIRECT IO can crash your machine > or corrupt data the same way RDMA can. It's not O_DIRECT - it's a ""transient page pin". Yes, there are problems with that right now, but as we've discussed the issues can be avoided by: a) stable pages always blocking in ->page_mkwrite; b) blocking in write_cache_pages() on an elevated map count when WB_SYNC_ALL is set; and c) blocking in truncate_pagecache() on an elevated map count. That prevents: a) gup pinning a page that is currently under writeback and modifying it while IO is in flight; b) a dirty page being written back while it is pinned by GUP, thereby turning it clean before the gup reference calls set_page_dirty() on DMA completion; and c) truncate/hole punch for pulling the page out from under the gup operation that is ongoing. This is an adequate solution for a short term transient pins. It doesn't break fsync(), it doesn't change how truncate works and it fixes the problem where a mapped file is the buffer for an O_DIRECT IO rather than the open fd and that buffer file gets truncated. IOWs, transient pins (and hence O_DIRECT) is not really the problem here. The problem with this is that blocking on elevated map count does not work for long term pins (i.e. gup_longterm()) which are defined as: * "longterm" == userspace controlled elevated page count lifetime. * Contrast this to iov_iter_get_pages() usages which are transient. It's the "userspace controlled" part of the long term gup pin that is the problem we need to solve. If we treat them the same as a transient pin, then this leads to fsync() and truncate either blocking for a long time waiting for userspace to drop it's gup reference, or having to be failed with something like EBUSY or EAGAIN. This is the problem revokable file layout leases solve. The NFS server is already using this for revoking delegations from remote clients. Userspace holding long term GUP references is essentially the same thing - it's a delegation of file ownership to userspace that the filesystem must be able to revoke when it needs to run internal and/or 3rd-party requested operations on that delegated file. If the hardware supports page faults, then we can further optimise the long term pin case to relax stable page requirements and allow page cleaning to occur while there are long term pins. In this case, the hardware will write-fault the clean pages appropriately before DMA is initiated, and hence avoid the need for data integrity operations like fsync() to trigger lease revocation. However, truncate/hole punch still requires lease revocation to work sanely, especially when we consider DAX *must* ensure there are no remaining references to the physical pmem page after the space has been freed. i.e. conflating the transient and long term gup pins as the same problem doesn't help anyone. If we fix the short term pin problems, then the long term pin problem become tractable by adding a layer over the top (i.e. hardware page fault capability and/or file lease requirements). Existing apps and hardware will continue to work - external operations on the pinned file will simply hang rather than causing corruption or kernel crashes. New (or updated) applications will play nicely with lease revocation and at that point the "long term pin" basically becomes a transient pin where the unpin latency is determined by how quickly the app responds to the lease revocation. And page fault capable hardware will reduce the occurrence of lease revocations due to data writeback/integrity operations and behave almost identically to cpu-based mmap accesses to file backed pages. Cheers, Dave. -- Dave Chinner david@fromorbit.com