Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp2074033imu; Wed, 12 Dec 2018 09:05:09 -0800 (PST) X-Google-Smtp-Source: AFSGD/XEmjiq24iIp9y/7Q67SLzeq3CaSoLsFHoxGqTkExIrjXSCq1jDIwyeM6WKYPW0koqwgAjn X-Received: by 2002:a17:902:6b83:: with SMTP id p3mr20355698plk.118.1544634309560; Wed, 12 Dec 2018 09:05:09 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544634309; cv=none; d=google.com; s=arc-20160816; b=YLUfcQU8BsvawtrWr4YrCE7KjI2mWIcxTBKjSgB5JljT75nzeiJzs950nQZ+W+7jCw BdVrELjOmN+8c/i4UAag7BSbE8Efco6K3msAYmEKtIbBKJvglfBQ/0amsot/R26/a1ve gLpdDLjG2qALM+OybLhJ3OH1Qm0SD/+jFluCmpGhMFet7bfhK6ahv43EOXS+2vzVPTj4 1P0tzaW7LJ2rG7FkYM+cK8HFat+LDdcNHz4e7rjlkq2I92YLiksyoDGJxSJCzegUP14T RnWvrUCnKBwlNu4rDE96WnHjBNHQxXQRbHM3MBgsjHRIMRFruQmW642xN/xBUf2hOjjj qflQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=Fln4oDoH8PPMcH8jV5R5ZxU1KObN+45Xsx1KaemWFqg=; b=vRzkB5gddC7vJ7Lyk6N6rLZoPu8jFZ3s87E6DCavDlmtuOxvzDlSbxSaVaWV9VCa5r RHdZXHCaJrZxNhawiBF9Jj8fDC/CnDBFYe55gOSsFhgA2dcMrfu7/kGlbMBeAo5s5MUd kIUNs/5c7jjx+6yb2ndForWj0cbD2QvKPJjdJJTE8P1zMLyalThiidSEJTm7RpoaAhnZ RSdxwpj2vuqAbhq4Du1bo47DffyA+Flh7gnmPxN0fpCgPB63sjQ9RMI1FEtR2UDR4Mzv UqU0YVF/w03187OMQHMMuHUHoHvVNVK5K/MY8xwdqw+8aLmtEpL9lPeY1hFm/GMk5US9 JobQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g71si14715942pgc.419.2018.12.12.09.04.52; Wed, 12 Dec 2018 09:05:09 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727881AbeLLRC1 (ORCPT + 99 others); Wed, 12 Dec 2018 12:02:27 -0500 Received: from mx1.redhat.com ([209.132.183.28]:43968 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726358AbeLLRC1 (ORCPT ); Wed, 12 Dec 2018 12:02:27 -0500 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id E7E76307CEAE; Wed, 12 Dec 2018 17:02:25 +0000 (UTC) Received: from redhat.com (ovpn-124-14.rdu2.redhat.com [10.10.124.14]) by smtp.corp.redhat.com (Postfix) with ESMTPS id A28CC6012B; Wed, 12 Dec 2018 17:02:22 +0000 (UTC) Date: Wed, 12 Dec 2018 12:02:20 -0500 From: Jerome Glisse To: Dan Williams Cc: Jan Kara , John Hubbard , Matthew Wilcox , John Hubbard , Andrew Morton , Linux MM , tom@talpey.com, Al Viro , benve@cisco.com, Christoph Hellwig , Christopher Lameter , "Dalessandro, Dennis" , Doug Ledford , Jason Gunthorpe , Michal Hocko , Mike Marciniszyn , rcampbell@nvidia.com, Linux Kernel Mailing List , linux-fsdevel Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions Message-ID: <20181212170220.GA5037@redhat.com> References: <20181205011519.GV10377@bombadil.infradead.org> <20181205014441.GA3045@redhat.com> <59ca5c4b-fd5b-1fc6-f891-c7986d91908e@nvidia.com> <7b4733be-13d3-c790-ff1b-ac51b505e9a6@nvidia.com> <20181207191620.GD3293@redhat.com> <3c4d46c0-aced-f96f-1bf3-725d02f11b60@nvidia.com> <20181208022445.GA7024@redhat.com> <20181210102846.GC29289@quack2.suse.cz> <20181212150319.GA3432@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.44]); Wed, 12 Dec 2018 17:02:26 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote: > On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse wrote: > > > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote: > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote: > > > > Another crazy idea, why not treating GUP as another mapping of the page > > > > and caller of GUP would have to provide either a fake anon_vma struct or > > > > a fake vma struct (or both for PRIVATE mapping of a file where you can > > > > have a mix of both private and file page thus only if it is a read only > > > > GUP) that would get added to the list of existing mapping. > > > > > > > > So the flow would be: > > > > somefunction_thatuse_gup() > > > > { > > > > ... > > > > GUP(_fast)(vma, ..., fake_anon, fake_vma); > > > > ... > > > > } > > > > > > > > GUP(vma, ..., fake_anon, fake_vma) > > > > { > > > > if (vma->flags == ANON) { > > > > // Add the fake anon vma to the anon vma chain as a child > > > > // of current vma > > > > } else { > > > > // Add the fake vma to the mapping tree > > > > } > > > > > > > > // The existing GUP except that now it inc mapcount and not > > > > // refcount > > > > GUP_old(..., &nanonymous, &nfiles); > > > > > > > > atomic_add(&fake_anon->refcount, nanonymous); > > > > atomic_add(&fake_vma->refcount, nfiles); > > > > > > > > return nanonymous + nfiles; > > > > } > > > > > > Thanks for your idea! This is actually something like I was suggesting back > > > at LSF/MM in Deer Valley. There were two downsides to this I remember > > > people pointing out: > > > > > > 1) This cannot really work with __get_user_pages_fast(). You're not allowed > > > to get necessary locks to insert new entry into the VMA tree in that > > > context. So essentially we'd loose get_user_pages_fast() functionality. > > > > > > 2) The overhead e.g. for direct IO may be noticeable. You need to allocate > > > the fake tracking VMA, get VMA interval tree lock, insert into the tree. > > > Then on IO completion you need to queue work to unpin the pages again as you > > > cannot remove the fake VMA directly from interrupt context where the IO is > > > completed. > > > > > > You are right that the cost could be amortized if gup() is called for > > > multiple consecutive pages however for small IOs there's no help... > > > > > > So this approach doesn't look like a win to me over using counter in struct > > > page and I'd rather try looking into squeezing HMM public page usage of > > > struct page so that we can fit that gup counter there as well. I know that > > > it may be easier said than done... > > > > So i want back to the drawing board and first i would like to ascertain > > that we all agree on what the objectives are: > > > > [O1] Avoid write back from a page still being written by either a > > device or some direct I/O or any other existing user of GUP. > > This would avoid possible file system corruption. > > > > [O2] Avoid crash when set_page_dirty() is call on a page that is > > considered clean by core mm (buffer head have been remove and > > with some file system this turns into an ugly mess). > > > > [O3] DAX and the device block problems, ie with DAX the page map in > > userspace is the same as the block (persistent memory) and no > > filesystem nor block device understand page as block or pinned > > block. > > > > For [O3] i don't think any pin count would help in anyway. I believe > > that the current long term GUP API that does not allow GUP of DAX is > > the only sane solution for now. > > No, that's not a sane solution, it's an emergency hack. Then how do you want to solve it ? Knowing pin count does not help you, at least i do not see how that would help and if it does then my solution allow you to know pin count it is the difference between real mapping and mapcount value. > > The real fix would be to teach file- > > system about DAX/pinned block so that a pinned block is not reuse > > by filesystem. > > We already have taught filesystems about pinned dax pages, see > dax_layout_busy_page(). As much as possible I want to eliminate the > concept of "dax pages" as a special case that gets sprinkled > throughout the mm. > > > For [O1] and [O2] i believe a solution with mapcount would work. So > > no new struct, no fake vma, nothing like that. In GUP for file back > > pages > > With get_user_pages_fast() we don't know that we have a file-backed > page, because we don't have a vma. You do not need a vma to know that we have PageAnon() for that so my solution is just about adding to core GUP page table walker: if (!PageAnon(page)) atomic_inc(&page->mapcount); Then in put_user_page() you add the opposite. In page_mkclean() you count the number of real mapping and voil? ... you got an answer for [O1]. You could use the same count real mapping to get the pin count in other place that cares about it but i fails to see why the actual pin count value would matter to any one. Cheers, J?r?me