Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp138195imu; Wed, 12 Dec 2018 13:42:17 -0800 (PST) X-Google-Smtp-Source: AFSGD/Xyfan41VNc3TbUxSxB+SC54WpkkxIiIcKpQMURvwwwmvoS9VnBz+Op7PcfT0jhTrkteQ3m X-Received: by 2002:a62:2781:: with SMTP id n123mr22056424pfn.138.1544650937695; Wed, 12 Dec 2018 13:42:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544650937; cv=none; d=google.com; s=arc-20160816; b=nTjyKeV1r6KXPReBiq/qek7UcuQm1HDSbY43iUhtnivQ94umc0yxdD6nxfDRVwC1ka 8BUspYgtGJXiqargZkR5JlGk4qgSOzD96rT/KiA3RRtIPqkawJTiZfan3cuiENts20QM e/GARZQ0e5b9hAW7QzVs46P5ROhkPYf04ubvpKkpPh1WomYMFzW6izVNssEvP97DLzLF kP02D3ZGY+dWJTo2YrEKPYBp8TakhrJ7LVfyciAYSysX1QuRS77EyQyuRy5TQKM/AQE+ sawkHkC0nJNRSH3dMlBWU045ClkylJXFvN4peM68YRcIC5e3T3MaHjkgPhnR6LotcCrH JQHQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=O2My4jBJnOvIC3LUncfmr7UPDuuM+Y2cJu3lT886jxQ=; b=gokl3qv59goEDu9ASqBVgHTp57Q5RE1O6dIXr/yxTPWPQjTKC5pIFgYroELIBvrC+f loZ+y5X1EPaEQZrKHcQ6E+ovjoX+XdToyeBS9pbcfdob7wqQYhmmUkhfynoa+sobIO1v sEZNipOVr3AH81LZ8bg4CBNb0EQCjSQO/UcjSq7FXqcqQl8VGvFQ68F1jSwYnaE8ghSV ic8ewVDuyrn21s2rEGejEXymFaLzPv2zmHG6uYfMhckK5NMKqPXHFeVg+CyryPD4CcZP mfl8nyUvPp0bDSg6mev7mE0dL/T/mq1HqYHPc6BNTXfARhAogW3KSkW2lmPO7KUfVpcs haTw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=P66VmLPb; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e34si15542606pgb.80.2018.12.12.13.42.03; Wed, 12 Dec 2018 13:42:17 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=P66VmLPb; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728302AbeLLVlD (ORCPT + 99 others); Wed, 12 Dec 2018 16:41:03 -0500 Received: from mail-ot1-f67.google.com ([209.85.210.67]:43377 "EHLO mail-ot1-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726263AbeLLVlD (ORCPT ); Wed, 12 Dec 2018 16:41:03 -0500 Received: by mail-ot1-f67.google.com with SMTP id a11so19153694otr.10 for ; Wed, 12 Dec 2018 13:41:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=O2My4jBJnOvIC3LUncfmr7UPDuuM+Y2cJu3lT886jxQ=; b=P66VmLPbCNfJkXKbRDqnxC1UtvvU9WLSh/R2S5/Rbkt3pP37zc1SZieBg5aWXCm9qF yRZsvgvUzw0AGDA2tcyIMdhVJ0jr8A75rgsZ0CZweQTcYfqKhtjQ/YVnCy2B388Pu6DL h57TZ+ju6nyZaBbOQQQ74UoEo/WJ2IAbtj9MqAYXLm27AypWNuNtCS+WHL7mAT2xSF7t MvFYCMUYuto6wGH8Yzw2HJIqPny8OxFPzX7flT5ey8BsYbm2qFlXkzZkTkxqU9P3Cnir /KUEYm4ujpQWpYO0REKsA1MFHwkCrEhvR20vmc/LZGM5FuNTxLUEZr5FCrzSXNjIG5Sc RqBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=O2My4jBJnOvIC3LUncfmr7UPDuuM+Y2cJu3lT886jxQ=; b=sdHOKu+o0j9iV/JW/rVnwtGCwsztNr23PzdX51sjS4e3p51nRcEZTjblgMw136GBHF R3hZf5RglaXtFCJC5E8rkWIDCZej580q0xbmbBX03ABvAPhfLDe73qhNYH4Hy+Kj9tSv ZotiXtz6pQ8LN6lJnpTkhzkqkvAbhgNGHN7o77AR0VHldtUfP9vPZIDwLkAxLXZxjz6T 90k6PFznjTNA/OiDBCFEX6lzzLKTxvxvz1KnSOOPgw5pqbuetlhPv0+gx5DJ/cDhtMiT OzqxUI6QWABFaaEMlqucZNN2Ek43sHb9DnO7K+qqQi0PBk+kcfVmn/fubExdd7SYbHUi PdGQ== X-Gm-Message-State: AA+aEWaNm/j7drNv5BzBgbVkiGDw9t/h3iayKp4YFdGqVjAM1gs1tKCq BH5guN529fUlLYhFrIvwP4hogRs11bQkKhkNkvtn5g== X-Received: by 2002:a9d:7dd5:: with SMTP id k21mr15947112otn.214.1544650861889; Wed, 12 Dec 2018 13:41:01 -0800 (PST) MIME-Version: 1.0 References: <20181205011519.GV10377@bombadil.infradead.org> <20181205014441.GA3045@redhat.com> <59ca5c4b-fd5b-1fc6-f891-c7986d91908e@nvidia.com> <7b4733be-13d3-c790-ff1b-ac51b505e9a6@nvidia.com> <20181207191620.GD3293@redhat.com> <3c4d46c0-aced-f96f-1bf3-725d02f11b60@nvidia.com> <20181208022445.GA7024@redhat.com> <20181210102846.GC29289@quack2.suse.cz> <20181212150319.GA3432@redhat.com> <20181212213005.GE5037@redhat.com> In-Reply-To: <20181212213005.GE5037@redhat.com> From: Dan Williams Date: Wed, 12 Dec 2018 13:40:50 -0800 Message-ID: Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions To: =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= Cc: Jan Kara , John Hubbard , Matthew Wilcox , John Hubbard , Andrew Morton , Linux MM , tom@talpey.com, Al Viro , benve@cisco.com, Christoph Hellwig , Christopher Lameter , "Dalessandro, Dennis" , Doug Ledford , Jason Gunthorpe , Michal Hocko , Mike Marciniszyn , rcampbell@nvidia.com, Linux Kernel Mailing List , linux-fsdevel , "Weiny, Ira" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 12, 2018 at 1:30 PM Jerome Glisse wrote: > > On Wed, Dec 12, 2018 at 08:27:35AM -0800, Dan Williams wrote: > > On Wed, Dec 12, 2018 at 7:03 AM Jerome Glisse wrote: > > > > > > On Mon, Dec 10, 2018 at 11:28:46AM +0100, Jan Kara wrote: > > > > On Fri 07-12-18 21:24:46, Jerome Glisse wrote: > > > > > Another crazy idea, why not treating GUP as another mapping of the page > > > > > and caller of GUP would have to provide either a fake anon_vma struct or > > > > > a fake vma struct (or both for PRIVATE mapping of a file where you can > > > > > have a mix of both private and file page thus only if it is a read only > > > > > GUP) that would get added to the list of existing mapping. > > > > > > > > > > So the flow would be: > > > > > somefunction_thatuse_gup() > > > > > { > > > > > ... > > > > > GUP(_fast)(vma, ..., fake_anon, fake_vma); > > > > > ... > > > > > } > > > > > > > > > > GUP(vma, ..., fake_anon, fake_vma) > > > > > { > > > > > if (vma->flags == ANON) { > > > > > // Add the fake anon vma to the anon vma chain as a child > > > > > // of current vma > > > > > } else { > > > > > // Add the fake vma to the mapping tree > > > > > } > > > > > > > > > > // The existing GUP except that now it inc mapcount and not > > > > > // refcount > > > > > GUP_old(..., &nanonymous, &nfiles); > > > > > > > > > > atomic_add(&fake_anon->refcount, nanonymous); > > > > > atomic_add(&fake_vma->refcount, nfiles); > > > > > > > > > > return nanonymous + nfiles; > > > > > } > > > > > > > > Thanks for your idea! This is actually something like I was suggesting back > > > > at LSF/MM in Deer Valley. There were two downsides to this I remember > > > > people pointing out: > > > > > > > > 1) This cannot really work with __get_user_pages_fast(). You're not allowed > > > > to get necessary locks to insert new entry into the VMA tree in that > > > > context. So essentially we'd loose get_user_pages_fast() functionality. > > > > > > > > 2) The overhead e.g. for direct IO may be noticeable. You need to allocate > > > > the fake tracking VMA, get VMA interval tree lock, insert into the tree. > > > > Then on IO completion you need to queue work to unpin the pages again as you > > > > cannot remove the fake VMA directly from interrupt context where the IO is > > > > completed. > > > > > > > > You are right that the cost could be amortized if gup() is called for > > > > multiple consecutive pages however for small IOs there's no help... > > > > > > > > So this approach doesn't look like a win to me over using counter in struct > > > > page and I'd rather try looking into squeezing HMM public page usage of > > > > struct page so that we can fit that gup counter there as well. I know that > > > > it may be easier said than done... > > > > > > So i want back to the drawing board and first i would like to ascertain > > > that we all agree on what the objectives are: > > > > > > [O1] Avoid write back from a page still being written by either a > > > device or some direct I/O or any other existing user of GUP. > > > This would avoid possible file system corruption. > > > > > > [O2] Avoid crash when set_page_dirty() is call on a page that is > > > considered clean by core mm (buffer head have been remove and > > > with some file system this turns into an ugly mess). > > > > > > [O3] DAX and the device block problems, ie with DAX the page map in > > > userspace is the same as the block (persistent memory) and no > > > filesystem nor block device understand page as block or pinned > > > block. > > > > > > For [O3] i don't think any pin count would help in anyway. I believe > > > that the current long term GUP API that does not allow GUP of DAX is > > > the only sane solution for now. > > > > No, that's not a sane solution, it's an emergency hack. > > > > > The real fix would be to teach file- > > > system about DAX/pinned block so that a pinned block is not reuse > > > by filesystem. > > > > We already have taught filesystems about pinned dax pages, see > > dax_layout_busy_page(). As much as possible I want to eliminate the > > concept of "dax pages" as a special case that gets sprinkled > > throughout the mm. > > So thinking on O3 issues what about leveraging the recent change i > did to mmu notifier. Add a event for truncate or any other file > event that need to invalidate the file->page for a range of offset. > > Add mmu notifier listener to GUP user (except direct I/O) so that > they invalidate they hardware mapping or switch the hardware mapping > to use a crappy page. When such event happens what ever user do to > the page through that driver is broken anyway. So it is better to > be loud about it then trying to make it pass under the radar. > > This will put the burden on broken user and allow you to properly > recycle your DAX page. > > Think of it as revoke through mmu notifier. > > So patchset would be: > enum mmu_notifier_event { > + MMU_NOTIFY_TRUNCATE, > }; > > + Change truncate code path to emit MMU_NOTIFY_TRUNCATE > > Then for each user of GUP (except direct I/O or other very short > term GUP): > > Patch 1: register mmu notifier > Patch 2: listen to MMU_NOTIFY_TRUNCATE and MMU_NOTIFY_UNMAP > when that happens update the device page table or > usage to point to a crappy page and do put_user_page > on all previously held page > > So this would solve the revoke side of thing without adding a burden > on GUP user like direct I/O. Many existing user of GUP already do > listen to mmu notifier and already behave properly. It is just about > making every body list to that. Then we can even add the mmu notifier > pointer as argument to GUP just to make sure no new user of GUP forget > about registering a notifier (argument as a teaching guide not as a > something actively use). > > > So does that sounds like a plan to solve your concern with long term > GUP user ? This does not depend on DAX or anything it would apply to > any file back pages. Almost, we need some safety around assuming that DMA is complete the page, so the notification would need to go all to way to userspace with something like a file lease notification. It would also need to be backstopped by an IOMMU in the case where the hardware does not / can not stop in-flight DMA.