Received: by 2002:ac0:a581:0:0:0:0:0 with SMTP id m1-v6csp4762090imm; Mon, 25 Jun 2018 23:33:04 -0700 (PDT) X-Google-Smtp-Source: AAOMgpdzzDrYIxQfyt5X7MRQBXf7fBBIkmUTFPnXAz6XgDpxJvj9YuaM9F1MWDVIFbpvs9qgoxhW X-Received: by 2002:a62:aa18:: with SMTP id e24-v6mr266668pff.72.1529994784180; Mon, 25 Jun 2018 23:33:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1529994784; cv=none; d=google.com; s=arc-20160816; b=DeWJhr9awG4qWBGynqp10wTYf+rWF8IhQXpMVz386/0MxjQLGzRvRq2zkoPOLQjfdo hla5wz1jS7szfnk+4yGN6x2+tyrQ/sTOYFB9H1WvDVCMOedUtVcuEpE6iXTn0xeNkfTb T1k3lt4+n91rSqUaqZYe6ZnBUiLhuGguA/SSzoXqr7XL5zF8L7qhYSvQT+arTFZFQ9Qj tXI1eloKC3f0rCVdcy8FDEAgCUBR43gq9sbbei0HPm8bqroC3nDVcssUKxO6b0ANzj2i 49zinQbadhAxKDBHF8GeecPfj5DVsHv1+NecDyqn8eb3g/iRLdGLPKcOp3hZrLp2GPI6 5Dpw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:arc-authentication-results; bh=KtpsBoaAkF90avDwcJR9wJ7bDV1QsCcn5HM1A/cOqhw=; b=fGSBBOt0+6Ps8sHanv/6psRQVbm10q9mRocs8Z1HFVZcepQVnpcN4wNR0sjwNdtVhR SSX0siJONz+KHoOUjWgHFb8Uz1I31lMfP9bDjvI4R10AmE8HPM1NoIn8B97P4Tu3FCxb yFvKgbHmPPSgRu492G3zbNwllyOjoLWam+Jgbcyj9AJp2CWeL4/kznZUjTv59JcSl1mM uZkVuMunTa2HV7VcK0BHYmIdoK3/EDkhRlyWN821C1IEZGEouWvsgCWhDOYCJRI0iJyM PfAr5+3dK5KrpgISfdmBzr/BP9LakzfRBl0XivR/hlU3T7kPyCNj5SYWcnZeAz7NKZZq +2wQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c18-v6si772054pgp.467.2018.06.25.23.32.49; Mon, 25 Jun 2018 23:33:04 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751843AbeFZGcJ (ORCPT + 99 others); Tue, 26 Jun 2018 02:32:09 -0400 Received: from hqemgate15.nvidia.com ([216.228.121.64]:9859 "EHLO hqemgate15.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751181AbeFZGcH (ORCPT ); Tue, 26 Jun 2018 02:32:07 -0400 Received: from hqpgpgate102.nvidia.com (Not Verified[216.228.121.13]) by hqemgate15.nvidia.com (using TLS: TLSv1, AES128-SHA) id ; Mon, 25 Jun 2018 23:31:38 -0700 Received: from HQMAIL107.nvidia.com ([172.20.161.6]) by hqpgpgate102.nvidia.com (PGP Universal service); Mon, 25 Jun 2018 23:32:08 -0700 X-PGP-Universal: processed; by hqpgpgate102.nvidia.com on Mon, 25 Jun 2018 23:32:08 -0700 Received: from [10.110.48.28] (10.110.48.28) by HQMAIL107.nvidia.com (172.20.187.13) with Microsoft SMTP Server (TLS) id 15.0.1347.2; Tue, 26 Jun 2018 06:32:06 +0000 Subject: Re: [PATCH 2/2] mm: set PG_dma_pinned on get_user_pages*() To: Jan Kara CC: Matthew Wilcox , Dan Williams , Christoph Hellwig , Jason Gunthorpe , John Hubbard , Michal Hocko , Christopher Lameter , Linux MM , LKML , linux-rdma References: <3898ef6b-2fa0-e852-a9ac-d904b47320d5@nvidia.com> <0e6053b3-b78c-c8be-4fab-e8555810c732@nvidia.com> <20180619082949.wzoe42wpxsahuitu@quack2.suse.cz> <20180619090255.GA25522@bombadil.infradead.org> <20180619104142.lpilc6esz7w3a54i@quack2.suse.cz> <70001987-3938-d33e-11e0-de5b19ca3bdf@nvidia.com> <20180620120824.bghoklv7qu2z5wgy@quack2.suse.cz> <151edbf3-66ff-df0c-c1cc-5998de50111e@nvidia.com> <20180621163036.jvdbsv3t2lu34pdl@quack2.suse.cz> <20180625152150.jnf5suiubecfppcl@quack2.suse.cz> X-Nvconfidentiality: public From: John Hubbard Message-ID: <550aacd3-cfea-c99a-3b60-563dd1621d5c@nvidia.com> Date: Mon, 25 Jun 2018 23:31:06 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.8.0 MIME-Version: 1.0 In-Reply-To: <20180625152150.jnf5suiubecfppcl@quack2.suse.cz> X-Originating-IP: [10.110.48.28] X-ClientProxiedBy: HQMAIL105.nvidia.com (172.20.187.12) To HQMAIL107.nvidia.com (172.20.187.13) Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/25/2018 08:21 AM, Jan Kara wrote: > On Thu 21-06-18 18:30:36, Jan Kara wrote: >> On Wed 20-06-18 15:55:41, John Hubbard wrote: >>> On 06/20/2018 05:08 AM, Jan Kara wrote: >>>> On Tue 19-06-18 11:11:48, John Hubbard wrote: >>>>> On 06/19/2018 03:41 AM, Jan Kara wrote: >>>>>> On Tue 19-06-18 02:02:55, Matthew Wilcox wrote: >>>>>>> On Tue, Jun 19, 2018 at 10:29:49AM +0200, Jan Kara wrote: >>> [...] > I've spent some time on this. There are two obstacles with my approach of > putting special entry into inode's VMA tree: > > 1) If I want to place this special entry in inode's VMA tree, I either need > to allocate full VMA, somehow initiate it so that it's clear it's a special > "pinned" range, not a VMA => uses unnecessarily too much memory, it is > ugly. Another solution I was hoping for was that I would factor out some > common bits of vm_area_struct (pgoff, rb_node, ..) into a structure common > for VMA and the locked range => doable but causes a lot of churn as VMAs > are accessed (and modified!) at hundreds of places in the kernel. Some > accessor functions would help to reduce the churn a bit but then stuff like > vma_set_pgoff(vma, pgoff) isn't exactly beautiful either. > > 2) Some users of GUP (e.g. direct IO) get a block of pages and then put > references to these pages at different times and in random order - > basically when IO for given page is completed, reference is dropped and one > GUP call can acquire page references for pages which end up in multiple > different bios (we don't know in advance). This makes is difficult to > implement counterpart to GUP to 'unpin' a range of pages - we'd either have > to support partial unpins (and splitting of pinned ranges and all such fun) > or just have to track internally in how many pages are still pinned in the > originally pinned range and release the pin once all individual pages are > unpinned but then it's difficult to e.g. get to this internal structure > from IO completion callback where we only have the bio. > > So I think the Matthew's idea of removing pinned pages from LRU is > definitely worth trying to see how complex that would end up being. Did you > get to looking into it? If not, I can probably find some time to try that > out. > OK, so I looked into this some more. As you implied in an earlier response, removing a page from LRU is probably the easy part. It's *keeping* it off the LRU that worries me. I looked at SetPageLRU() uses, there were only 5 call sites, and of those, I think only one might be difficult: __pagevec_lru_add() It seems like the way to avoid __pagevec_lru_add() calls on these pages is to first call lru_add_drain_all, then remove the pages from LRU (presumably via isolate_page_lru). I think that should do it. But I'm a little concerned that maybe I'm overlooking something. Here are the 5 search hits and my analysis. This may have mistakes in it, as I'm pretty new to this area, which is why I'm spelling it out: 1. mm/memcontrol.c:2082: SetPageLRU(page); This is in unlock_page_lru(). Caller: commit_charge(), and it's conditional on lrucare, so we can just skip it if the new page flag is set. 2. mm/swap.c:831: SetPageLRU(page_tail); This is in lru_add_page_tail(), which is only called by __split_huge_page_tail, and there, we can also just skip the call for these pages. 3. mm/swap.c:866: SetPageLRU(page); This is in __pagevec_lru_add_fn (sole caller: __pagevec_lru_add), and is discussed above. 4. mm/vmscan.c:1680: SetPageLRU(page); This is in putback_inactive_pages(), which I think won't get called unless the page is already on an LRU. 5. mm/vmscan.c:1873: SetPageLRU(page); // (N/A) This is in move_active_pages_to_lru(), which I also think won't get called unless the page is already on an LRU. thanks, -- John Hubbard NVIDIA