Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp1705734imu; Thu, 17 Jan 2019 01:47:56 -0800 (PST) X-Google-Smtp-Source: ALg8bN7nRWf2AQ2Eip5VQz1enMcAvLEWdk3KU5swBSyfl0UfqJsKtp6KMs3+9m4iWr5/1TDSKTFp X-Received: by 2002:a17:902:a60f:: with SMTP id u15mr13692780plq.275.1547718476156; Thu, 17 Jan 2019 01:47:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1547718476; cv=none; d=google.com; s=arc-20160816; b=lYoE0lnaQ2bw94+oofbttBayDH53tl9C/SnVQpRFlcOWnMiBcWa2gtAy1Uga9uiIXt 7xiGLaCwdc8cpDc0jSJ3qUHdf4OsUP4pHntsozWo7RNobhVeOAc4cCihGt8heiVlsfRJ v2wTciz1T362umYUbq2CmotGmHlBhWYZF//tnmpyMZhfJ3hqxwtf8Bl7SQ1HTO+dE/Op 9RvcXZjiNl+8weejfqNS8+HqjqzeoMBo5hEkZwzhdas8zNaR6rvQbxKKYu54AKeW4z6S FP+YPuHUo7gKp7L/1E9TD2EWS14hP9KTzhyA/fY0DUhJjkomOPbZwLiNmiozM3LPMlN2 chqg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=FKwiBYfLSobEuhsZKKnCq7aIKzTi5jfDl10UxNdyg3A=; b=aEFatL1UYFOZoji0ecsaX5uVWNps/ti8CMY/Wy736usJZUwdMHuUXny3FrX9O4b32e W3q9ws9S0yQ7tUEQyP9CUTt/xwMQaQLjGDE312vDtGo9xDlykTs1Uzf+f6pK/3bBqMvF 7PQtSGR0kgzHQ8bGQQyE4/LormIeplnadWkTmzq6YRqCrvVDjCZvVyNVF0CeLR6vyrjK G8eDP3pu/0PGtIFJE45Z5UXmjAiiBRUoibyTExijB/cZXyAZgVTMpbmgstjpXoUtGp70 y6HdagbJpL4GV4t5O9rTAEJOzZjnfI1YU8LQ09nalG/04mAa3Ig87F0Rk5FGhr+jHsKi mT5w== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b11si1076971pgt.289.2019.01.17.01.47.40; Thu, 17 Jan 2019 01:47:56 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727915AbfAQJEO (ORCPT + 99 others); Thu, 17 Jan 2019 04:04:14 -0500 Received: from mx2.suse.de ([195.135.220.15]:42304 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1727021AbfAQJEN (ORCPT ); Thu, 17 Jan 2019 04:04:13 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay1.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 58AA7AC9C; Thu, 17 Jan 2019 09:04:10 +0000 (UTC) Received: by quack2.suse.cz (Postfix, from userid 1000) id C880C1E1580; Thu, 17 Jan 2019 10:04:06 +0100 (CET) Date: Thu, 17 Jan 2019 10:04:06 +0100 From: Jan Kara To: John Hubbard Cc: Jan Kara , Jerome Glisse , Matthew Wilcox , Dave Chinner , Dan Williams , John Hubbard , Andrew Morton , Linux MM , tom@talpey.com, Al Viro , benve@cisco.com, Christoph Hellwig , Christopher Lameter , "Dalessandro, Dennis" , Doug Ledford , Jason Gunthorpe , Michal Hocko , mike.marciniszyn@intel.com, rcampbell@nvidia.com, Linux Kernel Mailing List , linux-fsdevel Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions Message-ID: <20190117090406.GA9378@quack2.suse.cz> References: <20190111165141.GB3190@redhat.com> <1b37061c-5598-1b02-2983-80003f1c71f2@nvidia.com> <20190112020228.GA5059@redhat.com> <294bdcfa-5bf9-9c09-9d43-875e8375e264@nvidia.com> <20190112024625.GB5059@redhat.com> <20190114145447.GJ13316@quack2.suse.cz> <20190114172124.GA3702@redhat.com> <20190115080759.GC29524@quack2.suse.cz> <76788484-d5ec-91f2-1f66-141764ba0b1e@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <76788484-d5ec-91f2-1f66-141764ba0b1e@nvidia.com> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 16-01-19 21:25:05, John Hubbard wrote: > On 1/15/19 12:07 AM, Jan Kara wrote: > >>>>> [...] > >>> Also there is one more idea I had how to record number of pins in the page: > >>> > >>> #define PAGE_PIN_BIAS 1024 > >>> > >>> get_page_pin() > >>> atomic_add(&page->_refcount, PAGE_PIN_BIAS); > >>> > >>> put_page_pin(); > >>> atomic_add(&page->_refcount, -PAGE_PIN_BIAS); > >>> > >>> page_pinned(page) > >>> (atomic_read(&page->_refcount) - page_mapcount(page)) > PAGE_PIN_BIAS > >>> > >>> This is pretty trivial scheme. It still gives us 22-bits for page pins > >>> which should be plenty (but we should check for that and bail with error if > >>> it would overflow). Also there will be no false negatives and false > >>> positives only if there are more than 1024 non-page-table references to the > >>> page which I expect to be rare (we might want to also subtract > >>> hpage_nr_pages() for radix tree references to avoid excessive false > >>> positives for huge pages although at this point I don't think they would > >>> matter). Thoughts? > > Some details, sorry I'm not fully grasping your plan without more > explanation: > > Do I read it correctly that this uses the lower 10 bits for the original > page->_refcount, and the upper 22 bits for gup-pinned counts? If so, I'm > surprised, because gup-pinned is going to be less than or equal to the > normal (get_page-based) pin count. And 1024 seems like it might be > reached in a large system with lots of processes and IPC. > > Are you just allowing the lower 10 bits to overflow, and that's why the > subtraction of mapcount? Wouldn't it be better to allow more than 10 bits, > instead? I'm not really dividing the page->_refcount counter, that's a wrong way how to think about it I believe. Normal get_page() simply increments the _refcount by 1, get_page_pin() will increment by 1024 (or 999 or whatever - that's PAGE_PIN_BIAS). The choice of value of PAGE_PIN_BIAS is essentially a tradeoff between how many page pins you allow and how likely page_pinned() is to return false positive. Large PAGE_PIN_BIAS means lower amount of false positives but also less page pins allowed for the page before _refcount would overflow. Now the trick with subtracting of page_mapcount() is following: We know that certain places hold references to the page. Common holders of page references are page table entries. So if we subtract page_mapcount() from _refcount, we'll get more accurate view how many other references (including pins) are there and thus reduce the number of false positives. > Another question: do we just allow other kernel code to observe this biased > _refcount, or do we attempt to filter it out? In other words, do you expect > problems due to some kernel code checking the _refcount and finding a large > number there, when it expected, say, 3? I recall some code tries to do > that...in fact, ZONE_DEVICE is 1-based, instead of zero-based, with respect > to _refcount, right? I would just allow other places to observe biased refcount. Sure there are places that do comparions on exact refcount value but if such place does not exclude page pins, it cannot really depend on whether there's just one or thousand of them. Generally such places try to detect whether they are the only owner of the page (besides page cache radix tree, LRU, etc.). So they want to bail if any page pin exists and that check remains the same regardless whether we increment _refcount by 1 or by 1024 when pinning the page. Honza -- Jan Kara SUSE Labs, CR