Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1210385imu; Fri, 7 Dec 2018 16:53:36 -0800 (PST) X-Google-Smtp-Source: AFSGD/W7hWMTjttrYvx5VFJPqqXMSbuJb17t69GFQEBfY3QT1VJndo1ksbBF4gni9WwxGCJN6zzT X-Received: by 2002:a62:56c7:: with SMTP id h68mr4509755pfj.134.1544230416617; Fri, 07 Dec 2018 16:53:36 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1544230416; cv=none; d=google.com; s=arc-20160816; b=Xve0armsLzTKSn56aVSx/cPo6W69DFBgwIjvxZ4Cx9MQkov2tJ+HzLqFK2C1S/TA51 +97O1iqLSV2/cVxnmeW8GRr0q0MNL94lSWVbcQomI+OEAAYBVyIDJNf0XWEAUS7JtoQh bSlrpMsoFsmup/5uuI4cv73yGFFkzAwcPGF5QlKLvieIvzRa+4TcVd1rS07eyaYb7Kip 9JtmTs7rU1S3CfpoVg+dU+dLLlgNjXJW8aGlH959W17DnLcsJWJzL0Xh0cMZ1VmBeYxw u0SV6pUXNfBKZNJisaWKDFD63ylnU/pt+S7wd1U4mWpsYNbk4xti+SHGEGoxRLFNusDf DVBA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:dkim-signature:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=O1CsolTRdVsJWOGgyKNIfTwcwwkP1Vlc+CMECC9GAE8=; b=miDlC/QGGDFfGIBJpZlNRzWiHwz1APRziP6gh2ELyfKyhIh/kJa0duCVeE6Gw2r6Zp kJ1DaZctoYDFz35/4ftSfcXFVUCeobPqp5QBZGw/q25GZaXHvjeTUCR8Nc4Rx2/SHtTU DnfMM29PeXhbDJRtVWH28kO8ONE6uA4q21J6BaymQuP972INKAIBDVs31vhlONvENGaN N1Ybrk+FljKFDvvMa7swnwfipAPreDsJ9ATy7Pw6VfjeqsA6jlMKz/OI89EZF9FRpYIq vJIz6CcoxpjARyhFo2bQUhvTeXPK/Lqz7bRJJ+QZ0klmjpsGCj3ZH1+WqVwCLkuASlkh Mo4w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b=Bw6Nx7rl; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a10si4168411plp.167.2018.12.07.16.53.21; Fri, 07 Dec 2018 16:53:36 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@nvidia.com header.s=n1 header.b=Bw6Nx7rl; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=nvidia.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726126AbeLHAwq (ORCPT + 99 others); Fri, 7 Dec 2018 19:52:46 -0500 Received: from hqemgate16.nvidia.com ([216.228.121.65]:18799 "EHLO hqemgate16.nvidia.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726070AbeLHAwq (ORCPT ); Fri, 7 Dec 2018 19:52:46 -0500 Received: from hqpgpgate102.nvidia.com (Not Verified[216.228.121.13]) by hqemgate16.nvidia.com (using TLS: TLSv1.2, DES-CBC3-SHA) id ; Fri, 07 Dec 2018 16:52:42 -0800 Received: from hqmail.nvidia.com ([172.20.161.6]) by hqpgpgate102.nvidia.com (PGP Universal service); Fri, 07 Dec 2018 16:52:43 -0800 X-PGP-Universal: processed; by hqpgpgate102.nvidia.com on Fri, 07 Dec 2018 16:52:43 -0800 Received: from [10.2.174.74] (10.124.1.5) by HQMAIL101.nvidia.com (172.20.187.10) with Microsoft SMTP Server (TLS) id 15.0.1395.4; Sat, 8 Dec 2018 00:52:42 +0000 Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions To: Jerome Glisse CC: Matthew Wilcox , Dan Williams , John Hubbard , Andrew Morton , Linux MM , Jan Kara , , Al Viro , , Christoph Hellwig , Christopher Lameter , "Dalessandro, Dennis" , Doug Ledford , Jason Gunthorpe , Michal Hocko , , , Linux Kernel Mailing List , linux-fsdevel References: <20181204001720.26138-1-jhubbard@nvidia.com> <20181204001720.26138-2-jhubbard@nvidia.com>

<3c91d335-921c-4704-d159-2975ff3a5f20@nvidia.com> <20181205011519.GV10377@bombadil.infradead.org> <20181205014441.GA3045@redhat.com> <59ca5c4b-fd5b-1fc6-f891-c7986d91908e@nvidia.com> <7b4733be-13d3-c790-ff1b-ac51b505e9a6@nvidia.com> <20181207191620.GD3293@redhat.com> From: John Hubbard X-Nvconfidentiality: public Message-ID: <3c4d46c0-aced-f96f-1bf3-725d02f11b60@nvidia.com> Date: Fri, 7 Dec 2018 16:52:42 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.3.2 MIME-Version: 1.0 In-Reply-To: <20181207191620.GD3293@redhat.com> X-Originating-IP: [10.124.1.5] X-ClientProxiedBy: HQMAIL104.nvidia.com (172.18.146.11) To HQMAIL101.nvidia.com (172.20.187.10) Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=nvidia.com; s=n1; t=1544230363; bh=O1CsolTRdVsJWOGgyKNIfTwcwwkP1Vlc+CMECC9GAE8=; h=X-PGP-Universal:Subject:To:CC:References:From:X-Nvconfidentiality: Message-ID:Date:User-Agent:MIME-Version:In-Reply-To: X-Originating-IP:X-ClientProxiedBy:Content-Type:Content-Language: Content-Transfer-Encoding; b=Bw6Nx7rlfQ+6eW33BvBW2QbKgAE4VYE7BErYAGcvrfO60ziHDbUrwDfKTiX/F3BPJ WQ2MqwwRnheuBGAeZJvEDRH1W+nI+TIlC/Qjkor8Jey71wSIU16WlDP59zM7QHEora rCMMBW8hZUgG6ewIkqRlSy0xlTKx/I9g00qLIIDPj2WIsgnGvCvGcvj64KsSaoARNg nfMQltv/5GuCHJcU7r9Qqt7BRclZpBzsVFcnPvHs2SAC5MyA0F+cXx/7BsaKI4WI/y sFCpHJjHbrvNtdDr93k4QLeOJIeaTcObQABBpGckjDNtAA08I8q+2jFxVauy2DMZOE psNQYo1tDIt3w== Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 12/7/18 11:16 AM, Jerome Glisse wrote: > On Thu, Dec 06, 2018 at 06:45:49PM -0800, John Hubbard wrote: >> On 12/4/18 5:57 PM, John Hubbard wrote: >>> On 12/4/18 5:44 PM, Jerome Glisse wrote: >>>> On Tue, Dec 04, 2018 at 05:15:19PM -0800, Matthew Wilcox wrote: >>>>> On Tue, Dec 04, 2018 at 04:58:01PM -0800, John Hubbard wrote: >>>>>> On 12/4/18 3:03 PM, Dan Williams wrote: >>>>>>> Except the LRU fields are already in use for ZONE_DEVICE pages... how >>>>>>> does this proposal interact with those? >>>>>> >>>>>> Very badly: page->pgmap and page->hmm_data both get corrupted. Is there an entire >>>>>> use case I'm missing: calling get_user_pages() on ZONE_DEVICE pages? Said another >>>>>> way: is it reasonable to disallow calling get_user_pages() on ZONE_DEVICE pages? >>>>>> >>>>>> If we have to support get_user_pages() on ZONE_DEVICE pages, then the whole >>>>>> LRU field approach is unusable. >>>>> >>>>> We just need to rearrange ZONE_DEVICE pages. Please excuse the whitespace >>>>> damage: >>>>> >>>>> +++ b/include/linux/mm_types.h >>>>> @@ -151,10 +151,12 @@ struct page { >>>>> #endif >>>>> }; >>>>> struct { /* ZONE_DEVICE pages */ >>>>> + unsigned long _zd_pad_2; /* LRU */ >>>>> + unsigned long _zd_pad_3; /* LRU */ >>>>> + unsigned long _zd_pad_1; /* uses mapping */ >>>>> /** @pgmap: Points to the hosting device page map. */ >>>>> struct dev_pagemap *pgmap; >>>>> unsigned long hmm_data; >>>>> - unsigned long _zd_pad_1; /* uses mapping */ >>>>> }; >>>>> >>>>> /** @rcu_head: You can use this to free a page by RCU. */ >>>>> >>>>> You don't use page->private or page->index, do you Dan? >>>> >>>> page->private and page->index are use by HMM DEVICE page. >>>> >>> >>> OK, so for the ZONE_DEVICE + HMM case, that leaves just one field remaining for >>> dma-pinned information. Which might work. To recap, we need: >>> >>> -- 1 bit for PageDmaPinned >>> -- 1 bit, if using LRU field(s), for PageDmaPinnedWasLru. >>> -- N bits for a reference count >>> >>> Those *could* be packed into a single 64-bit field, if really necessary. >>> >> >> ...actually, this needs to work on 32-bit systems, as well. And HMM is using a lot. >> However, it is still possible for this to work. >> >> Matthew, can I have that bit now please? I'm about out of options, and now it will actually >> solve the problem here. >> >> Given: >> >> 1) It's cheap to know if a page is ZONE_DEVICE, and ZONE_DEVICE means not on the LRU. >> That, in turn, means only 1 bit instead of 2 bits (in addition to a counter) is required, >> for that case. >> >> 2) There is an independent bit available (according to Matthew). >> >> 3) HMM uses 4 of the 5 struct page fields, so only one field is available for a counter >> in that case. > > To expend on this, HMM private page are use for anonymous page > so the index and mapping fields have the value you expect for > such pages. Down the road i want also to support file backed > page with HMM private (mapping, private, index). > > For HMM public both anonymous and file back page are supported > today (HMM public is only useful on platform with something like > OpenCAPI, CCIX or NVlink ... so PowerPC for now). > >> 4) get_user_pages() must work on ZONE_DEVICE and HMM pages. > > get_user_pages() only need to work with HMM public page not the > private one as we can not allow _anyone_ to pin HMM private page. > So on get_user_pages() on HMM private we get a page fault and > it is migrated back to regular memory. > > >> 5) For a proper atomic counter for both 32- and 64-bit, we really do need a complete >> unsigned long field. >> >> So that leads to the following approach: >> >> -- Use a single unsigned long field for an atomic reference count for the DMA pinned count. >> For normal pages, this will be the *second* field of the LRU (in order to avoid PageTail bit). >> >> For ZONE_DEVICE pages, we can also line up the fields so that the second LRU field is >> available and reserved for this DMA pinned count. Basically _zd_pad_1 gets move up and >> optionally renamed: >> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >> index 017ab82e36ca..b5dcd9398cae 100644 >> --- a/include/linux/mm_types.h >> +++ b/include/linux/mm_types.h >> @@ -90,8 +90,8 @@ struct page { >> * are in use. >> */ >> struct { >> - unsigned long dma_pinned_flags; >> - atomic_t dma_pinned_count; >> + unsigned long dma_pinned_flags; /* LRU.next */ >> + atomic_t dma_pinned_count; /* LRU.prev */ >> }; >> }; >> /* See page-flags.h for PAGE_MAPPING_FLAGS */ >> @@ -161,9 +161,9 @@ struct page { >> }; >> struct { /* ZONE_DEVICE pages */ >> /** @pgmap: Points to the hosting device page map. */ >> - struct dev_pagemap *pgmap; >> - unsigned long hmm_data; >> - unsigned long _zd_pad_1; /* uses mapping */ >> + struct dev_pagemap *pgmap; /* LRU.next */ >> + unsigned long _zd_pad_1; /* LRU.prev or dma_pinned_count */ >> + unsigned long hmm_data; /* uses mapping */ > > This breaks HMM today as hmm_data would alias with mapping field. > hmm_data can only be in LRU.prev > I see. OK, HMM has done an efficient job of mopping up unused fields, and now we are completely out of space. At this point, after thinking about it carefully, it seems clear that it's time for a single, new field: diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5ed8f6292a53..1c789e324da8 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -182,6 +182,9 @@ struct page { /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */ atomic_t _refcount; + /* DMA usage count. See get_user_pages*(), put_user_page*(). */ + atomic_t _dma_pinned_count; + #ifdef CONFIG_MEMCG struct mem_cgroup *mem_cgroup; #endif ...because after all, the reason this is so difficult is that this fix has to work in pretty much every configuration. get_user_pages() use is widespread, it's a very general facility, and...it needs fixing. And we're out of space. I'm going to send out an updated RFC that shows the latest, and I think it's going to include the above. -- thanks, John Hubbard NVIDIA