Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp3193262imu; Mon, 17 Dec 2018 15:13:11 -0800 (PST) X-Google-Smtp-Source: AFSGD/UKeqzc4QZbl5+Vv5N8sr+wzs69Q6z870scZzci/AYnjux9UpB1W8MJJTCGlPKpAl2pfQ4a X-Received: by 2002:a17:902:7c85:: with SMTP id y5mr14257478pll.63.1545088391662; Mon, 17 Dec 2018 15:13:11 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545088391; cv=none; d=google.com; s=arc-20160816; b=l/pWScuyxESw7+SWocpys2ku5dgCBkVOfyIDUC8/MKZPpKa9FoLUgDK+EtOXDA2jx0 ssoSlMDx3v1c9fK44yFOYbI5RwRuIxreNEmw0Q9Iig6BI/CKAxxEivLIz19ObgNbZvJU sSXlrCwMSsp+x6fSYtgBXNfciGbrD8UujPrC76S3Ls2D8MpS3ObwReExDqg5IpB/WoHF 4G+ljrl8XYHaAiiW2YY5ZMy8xysD/gLxxAK/tACgxbd7kqNiPoxIo4xVgnHIW6qaU7N9 RMvhZd32Q+ee0rwftpM+mzzqx4/ZblSRdPpXxhItEloxFns4VYf2nLq5BCnX/q2+e98J /6mQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=e5vUeLFz+4vQCOAjKVsAoU/kOmxWpWAc3TiNRePY+ow=; b=tvDJGhD+QAE14uzFGmeExHR29SMfCdmz1D22U5oiWBjY2MZZNg46CRmHkWXsF0iLLD wrUkVs9/6frNp3aqtBq7vQ0VGL7n5eIaylxyl8cDnJTAsxC6eTZmg2Fo1qJRPHP14zB8 491aeqLCqdg5gqREZu8vvi+93mk51kaMLWO8JCjpX+5d3xWYjb2AyfvzjS+OFsceRlvD fyTxmpUNBYHHbabbsMjPN8piJXcbB23V/9w/sQBP+eSTyNTCUVY43XkdOYGuPtitS+gH GgSwAHjHcKE3BAHpCcSYSuUdMFQDdghzGaYZmJV8QEgG+M5HyaWoLhUXqoP7M4xgfDN3 b1XA== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=ndnFUfCW; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e13si11510658pfi.271.2018.12.17.15.12.56; Mon, 17 Dec 2018 15:13:11 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=ndnFUfCW; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389046AbeLQSex (ORCPT + 99 others); Mon, 17 Dec 2018 13:34:53 -0500 Received: from bombadil.infradead.org ([198.137.202.133]:42666 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2387936AbeLQSex (ORCPT ); Mon, 17 Dec 2018 13:34:53 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=e5vUeLFz+4vQCOAjKVsAoU/kOmxWpWAc3TiNRePY+ow=; b=ndnFUfCWYaOf0apTrT+FgRtxi oJ66lCfKaA1lBw6CvnrRpvo7SVsDg+7gfEOFxBVkLSOkwn6SNRnR77ulr1Qy4T4Zr4Df7M3Mera/j awzyPDIvCUio8qW0SPKDsYRuLgW2QVC9VYpZ+qHi87fyuCG17SeqUgKpOWKQZtnUcHWd3icRi5wnF RBxFhIUuRSpWDH5ttCad1z/bIBasqxpPa0+I6Pc3IaJeP+C5CewcBlITRC39KRXx8bkrBxsz4eD7l bXalgI5lo8XdjCN5ws3dYsFTn4/Qm3XmEVMb1XN0OPnaGAtXnDR1lfef8z1SLKUJJYGMwysYb5ll1 Fh0KnweAQ==; Received: from willy by bombadil.infradead.org with local (Exim 4.90_1 #2 (Red Hat Linux)) id 1gYxj1-0006ET-MP; Mon, 17 Dec 2018 18:34:43 +0000 Date: Mon, 17 Dec 2018 10:34:43 -0800 From: Matthew Wilcox To: Jerome Glisse Cc: Dave Chinner , Jan Kara , John Hubbard , Dan Williams , John Hubbard , Andrew Morton , Linux MM , tom@talpey.com, Al Viro , benve@cisco.com, Christoph Hellwig , Christopher Lameter , "Dalessandro, Dennis" , Doug Ledford , Jason Gunthorpe , Michal Hocko , mike.marciniszyn@intel.com, rcampbell@nvidia.com, Linux Kernel Mailing List , linux-fsdevel Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions Message-ID: <20181217183443.GO10600@bombadil.infradead.org> References: <7b4733be-13d3-c790-ff1b-ac51b505e9a6@nvidia.com> <20181207191620.GD3293@redhat.com> <3c4d46c0-aced-f96f-1bf3-725d02f11b60@nvidia.com> <20181208022445.GA7024@redhat.com> <20181210102846.GC29289@quack2.suse.cz> <20181212150319.GA3432@redhat.com> <20181212214641.GB29416@dastard> <20181214154321.GF8896@quack2.suse.cz> <20181216215819.GC10644@dastard> <20181217181148.GA3341@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181217181148.GA3341@redhat.com> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote: > On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote: > > Sure, that's a possibility, but that doesn't close off any race > > conditions because there can be DMA into the page in progress while > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is > > different in that there is no 3rd-party access to the page while it > > is under IO (ext3 arbitrates all access to it's metadata), and so > > nothing can actually race for modification of the page between > > submission and bouncing at the block layer. > > > > In this case, the moment the page is unlocked, anyone else can map > > it and start (R)DMA on it, and that can happen before the bio is > > bounced by the block layer. So AFAICT, block layer bouncing doesn't > > solve the problem of racing writeback and DMA direct to the page we > > are doing IO on. Yes, it reduces the race window substantially, but > > it doesn't get rid of it. > > So the event flow is: > - userspace create object that match a range of virtual address > against a given kernel sub-system (let's say infiniband) and > let's assume that the range is an mmap() of a regular file > - device driver do GUP on the range (let's assume it is a write > GUP) so if the page is not already map with write permission > in the page table than a page fault is trigger and page_mkwrite > happens > - Once GUP return the page to the device driver and once the > device driver as updated the hardware states to allow access > to this page then from that point on hardware can write to the > page at _any_ time, it is fully disconnected from any fs event > like write back, it fully ignore things like page_mkclean > > This is how it is to day, we allowed people to push upstream such > users of GUP. This is a fact we have to live with, we can not stop > hardware access to the page, we can not force the hardware to follow > page_mkclean and force a page_mkwrite once write back ends. This is > the situation we are inheriting (and i am personnaly not happy with > that). > > >From my point of view we are left with 2 choices: > [C1] break all drivers that do not abide by the page_mkclean and > page_mkwrite > [C2] mitigate as much as possible the issue > > For [C2] the idea is to keep track of GUP per page so we know if we > can expect the page to be written to at any time. Here is the event > flow: > - driver GUP the page and program the hardware, page is mark as > GUPed > ... > - write back kicks in on the dirty page, lock the page and every > thing as usual , sees it is GUPed and inform the block layer to > use a bounce page No. The solution John, Dan & I have been looking at is to take the dirty page off the LRU while it is pinned by GUP. It will never be found for writeback. That's not the end of the story though. Other parts of the kernel (eg msync) also need to be taught to stay away from pages which are pinned by GUP. But the idea is that no page gets written back to storage while it's pinned by GUP. Only when the last GUP ends is the page returned to the list of dirty pages. > - block layer copy the page to a bounce page effectively creating > a snapshot of what is the content of the real page. This allows > everything in block layer that need stable content to work on > the bounce page (raid, stripping, encryption, ...) > - once write back is done the page is not marked clean but stays > dirty, this effectively disable things like COW for filesystem > and other feature that expect page_mkwrite between write back. > AFAIK it is believe that it is something acceptable So none of this is necessary.