Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp3487121imu; Mon, 17 Dec 2018 22:14:14 -0800 (PST) X-Google-Smtp-Source: AFSGD/XczCGHLh4UyYqtv3ib1BQvGpOa27zk380GKtRApc64kSDK48Nw0bZTrCpHzuWfscPeFRRZ X-Received: by 2002:a17:902:a83:: with SMTP id 3mr14347809plp.276.1545113654016; Mon, 17 Dec 2018 22:14:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545113653; cv=none; d=google.com; s=arc-20160816; b=dWTUtjLO5tM6EvUadhKIXoIFaCPxugqfbaCyJ2WeO4eULQ9+eco4Bc26XQVABLLgCx i5UTTNZxex181JRJJtjEQIJyOmX5K68lN9UuC3MkmRIO55jh3nB/JeAzfx5LMWtZxKWl R9kiJmRiQAo8FcBOJlCOUj5bCCI09oFr8WmY03zgHye2o0RnLvxg2qrg6tZ/jfKheYFg Xk04+qpqW6oixp5qkgH2xUD0MwxzMgh/S8/W6XYFQbpWR7c60lYtN8zNFbHEskYAtujq 3TGpB/h/rbUibbDMboX1e0FyiSk3AM0e+fJ+0v9MfWhqKEGfKMJxtVPENGLBSx18oLzE jq3A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=7wqhH0pOtmVnoq/69oEbjowT5HHJHiqLkJOqHbP+lO0=; b=TYuW2j46Li073Di3e3/U6+XhfXiPm6nRXxfiMANx5EQx3i86HKN1SCm0Fd215ev/EP yZDnRc8s4S5sdLB28AgrZrqMzqHhB92Aho8TWnmf/McKJw1gtGW2jeiz9j80NVRY/iFS 2GsxHRI44TYMue8cTuaFhOiIEMRFyz53ml02JOfd26RzHKJfhCJjH5oZ2jQ1IWlf5zBP kAXfAbIx4aqAkS94fmkrEIVIoK0Foyp9lomNL4Q/IJdNmysuqdlEFoYU+gcNxNvffvzd Zz0LcrfMiIjXciG8stsYaNfQuyGAqQGARaEzRLdvV6tcuuRb8PttxlxMVbbcQlyIury7 Elyw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=YRuZqUe4; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 85si12819024pfc.145.2018.12.17.22.13.55; Mon, 17 Dec 2018 22:14:13 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@oracle.com header.s=corp-2018-07-02 header.b=YRuZqUe4; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=oracle.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726375AbeLRGNA (ORCPT + 99 others); Tue, 18 Dec 2018 01:13:00 -0500 Received: from userp2120.oracle.com ([156.151.31.85]:50276 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726314AbeLRGNA (ORCPT ); Tue, 18 Dec 2018 01:13:00 -0500 Received: from pps.filterd (userp2120.oracle.com [127.0.0.1]) by userp2120.oracle.com (8.16.0.22/8.16.0.22) with SMTP id wBI64G68152704; Tue, 18 Dec 2018 06:12:25 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : in-reply-to; s=corp-2018-07-02; bh=7wqhH0pOtmVnoq/69oEbjowT5HHJHiqLkJOqHbP+lO0=; b=YRuZqUe4e8zZ75ssrnd7aDNsMKvc/90uGwaK1J5L1fubKr+/8MDI5fhxga7l/L3NKWbH uLgXjbgcOSMrPnhbGX/KI20DHL6x6mJXm8BvzP/OtWgo13++Xe8czf86GuFNFsg3kszg 0JYtvCCT4r408upLNCKK6S8Mh447+YM2LILxYiVzRdwIz29h7/Ia36CoX887VvW1otr0 ru3utS2fP2dERf8ROL/Jt5DS/GHjDbEQ6VpUkK0WCYAYpBXZwx7C3QJXi7iAooIAa+Ee r4EGuffVj+zGSzUIsYWgeHXHtpNJSPZx90e1X6+iuh8WNSWMM6M09bYnEXDTykovNRtm 2Q== Received: from userv0022.oracle.com (userv0022.oracle.com [156.151.31.74]) by userp2120.oracle.com with ESMTP id 2pct8qs5w8-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 18 Dec 2018 06:12:25 +0000 Received: from userv0122.oracle.com (userv0122.oracle.com [156.151.31.75]) by userv0022.oracle.com (8.14.4/8.14.4) with ESMTP id wBI6CPA5018613 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 18 Dec 2018 06:12:25 GMT Received: from abhmp0004.oracle.com (abhmp0004.oracle.com [141.146.116.10]) by userv0122.oracle.com (8.14.4/8.14.4) with ESMTP id wBI6CMeE026265; Tue, 18 Dec 2018 06:12:22 GMT Received: from localhost (/10.159.226.146) by default (Oracle Beehive Gateway v4.0) with ESMTP ; Mon, 17 Dec 2018 22:12:22 -0800 Date: Mon, 17 Dec 2018 22:12:19 -0800 From: "Darrick J. Wong" To: Matthew Wilcox Cc: Jerome Glisse , Dave Chinner , Jan Kara , John Hubbard , Dan Williams , John Hubbard , Andrew Morton , Linux MM , tom@talpey.com, Al Viro , benve@cisco.com, Christoph Hellwig , Christopher Lameter , "Dalessandro, Dennis" , Doug Ledford , Jason Gunthorpe , Michal Hocko , mike.marciniszyn@intel.com, rcampbell@nvidia.com, Linux Kernel Mailing List , linux-fsdevel Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions Message-ID: <20181218061219.GC8112@magnolia> References: <20181207191620.GD3293@redhat.com> <3c4d46c0-aced-f96f-1bf3-725d02f11b60@nvidia.com> <20181208022445.GA7024@redhat.com> <20181210102846.GC29289@quack2.suse.cz> <20181212150319.GA3432@redhat.com> <20181212214641.GB29416@dastard> <20181214154321.GF8896@quack2.suse.cz> <20181216215819.GC10644@dastard> <20181217181148.GA3341@redhat.com> <20181217183443.GO10600@bombadil.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20181217183443.GO10600@bombadil.infradead.org> User-Agent: Mutt/1.9.4 (2018-02-28) X-Proofpoint-Virus-Version: vendor=nai engine=5900 definitions=9110 signatures=668679 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 malwarescore=0 phishscore=0 bulkscore=0 spamscore=0 mlxscore=0 mlxlogscore=999 adultscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.0.1-1810050000 definitions=main-1812180054 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Dec 17, 2018 at 10:34:43AM -0800, Matthew Wilcox wrote: > On Mon, Dec 17, 2018 at 01:11:50PM -0500, Jerome Glisse wrote: > > On Mon, Dec 17, 2018 at 08:58:19AM +1100, Dave Chinner wrote: > > > Sure, that's a possibility, but that doesn't close off any race > > > conditions because there can be DMA into the page in progress while > > > the page is being bounced, right? AFAICT this ext3+DIF/DIX case is > > > different in that there is no 3rd-party access to the page while it > > > is under IO (ext3 arbitrates all access to it's metadata), and so > > > nothing can actually race for modification of the page between > > > submission and bouncing at the block layer. > > > > > > In this case, the moment the page is unlocked, anyone else can map > > > it and start (R)DMA on it, and that can happen before the bio is > > > bounced by the block layer. So AFAICT, block layer bouncing doesn't > > > solve the problem of racing writeback and DMA direct to the page we > > > are doing IO on. Yes, it reduces the race window substantially, but > > > it doesn't get rid of it. > > > > So the event flow is: > > - userspace create object that match a range of virtual address > > against a given kernel sub-system (let's say infiniband) and > > let's assume that the range is an mmap() of a regular file > > - device driver do GUP on the range (let's assume it is a write > > GUP) so if the page is not already map with write permission > > in the page table than a page fault is trigger and page_mkwrite > > happens > > - Once GUP return the page to the device driver and once the > > device driver as updated the hardware states to allow access > > to this page then from that point on hardware can write to the > > page at _any_ time, it is fully disconnected from any fs event > > like write back, it fully ignore things like page_mkclean > > > > This is how it is to day, we allowed people to push upstream such > > users of GUP. This is a fact we have to live with, we can not stop > > hardware access to the page, we can not force the hardware to follow > > page_mkclean and force a page_mkwrite once write back ends. This is > > the situation we are inheriting (and i am personnaly not happy with > > that). > > > > >From my point of view we are left with 2 choices: > > [C1] break all drivers that do not abide by the page_mkclean and > > page_mkwrite > > [C2] mitigate as much as possible the issue > > > > For [C2] the idea is to keep track of GUP per page so we know if we > > can expect the page to be written to at any time. Here is the event > > flow: > > - driver GUP the page and program the hardware, page is mark as > > GUPed > > ... > > - write back kicks in on the dirty page, lock the page and every > > thing as usual , sees it is GUPed and inform the block layer to > > use a bounce page > > No. The solution John, Dan & I have been looking at is to take the > dirty page off the LRU while it is pinned by GUP. It will never be > found for writeback. > > That's not the end of the story though. Other parts of the kernel (eg > msync) also need to be taught to stay away from pages which are pinned > by GUP. But the idea is that no page gets written back to storage while > it's pinned by GUP. Only when the last GUP ends is the page returned > to the list of dirty pages. Errr... what does fsync do in the meantime? Not write the page? That would seem to break what fsync() is supposed to do. --D > > - block layer copy the page to a bounce page effectively creating > > a snapshot of what is the content of the real page. This allows > > everything in block layer that need stable content to work on > > the bounce page (raid, stripping, encryption, ...) > > - once write back is done the page is not marked clean but stays > > dirty, this effectively disable things like COW for filesystem > > and other feature that expect page_mkwrite between write back. > > AFAIK it is believe that it is something acceptable > > So none of this is necessary. >