Received: by 2002:ac0:b08d:0:0:0:0:0 with SMTP id l13csp6731imc; Thu, 20 Dec 2018 15:31:22 -0800 (PST) X-Google-Smtp-Source: ALg8bN4ehoHSKG5XCtUrfYTx3Q1ouJPWhsrg2jZYFQOWIe8vaHfTZBCIkr1H51OMACm5cIGJM/ph X-Received: by 2002:a17:902:2c83:: with SMTP id n3mr223185plb.104.1545348682341; Thu, 20 Dec 2018 15:31:22 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545348682; cv=none; d=google.com; s=arc-20160816; b=uaxcx62q/cLL2RcpxXMOK/7BjtKhV/5p1uwuOj6y0VdntsVCM3ouiXyr9viJ0oAq24 ekLUknQRwR984yVaKCBSSKUmzLijQRKUNKTw9SbEjktg8RBSCErT5n03uF22xUuW6Mqi 20XuHjHhp+kOmLogCGcI2CqwR2PxeRRrqa9hgHlnbh2e2OBLwCUf+JP1woNv7rzKqGIg E+Fq/SW5XxyK9dKSMaIXcp0vtbz4N1xmxHYzE/TKiRJASRWk7glQQ4vJFEBD3mr3qce9 wgm61QIKyORaq/2R8cFg4Kqs8MJWqOsFv1LD9p1k7wfuNJuNWOTG2/K5brzrOeykQ+sY S1Pw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=DYRCP9UN9luBBsvCXHDTrjPXPPJREpwyogy2A14Q3wo=; b=rp8dNC9a0kmyqdEE3rolm31WN8kralmV9AciDvjBhypDbzGoVlnOGRIpI8tJbPc2Gv Pr+iZB8WRPmrct2WLhysUgXJdKcH0ASe9CxDUTx1eVxvj3sDvNntCnqEvOC6e9McubIe VyNU/qw1YSogsyZRr8IDIlvYUO39LDzvGhDxTyunIReKe+2ZfFvfL+cJh++ODBd1UhIP wO7DKfA2l0nWcTIXx8Ke6E5m53BcLt7uXRjIDL+vGnb8QxeY1S9G8G/tQosVDnANpnSA YWL+hn/VDOrQjSm1amfYdcjUk+B6D6uY60EIuD6Lh80hZLpsIpHbMoqVMQFVanNlxNZY Ofew== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=BF+jxDWa; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z86si21023771pfl.209.2018.12.20.15.31.00; Thu, 20 Dec 2018 15:31:22 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@intel-com.20150623.gappssmtp.com header.s=20150623 header.b=BF+jxDWa; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732782AbeLTQ5e (ORCPT + 99 others); Thu, 20 Dec 2018 11:57:34 -0500 Received: from mail-ot1-f68.google.com ([209.85.210.68]:42440 "EHLO mail-ot1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1732385AbeLTQ5e (ORCPT ); Thu, 20 Dec 2018 11:57:34 -0500 Received: by mail-ot1-f68.google.com with SMTP id v23so2525934otk.9 for ; Thu, 20 Dec 2018 08:57:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=DYRCP9UN9luBBsvCXHDTrjPXPPJREpwyogy2A14Q3wo=; b=BF+jxDWaMphjdbIE5m6BbKM9lClqGRJRTGaKWB+fCVLJe/SZocmVMCJD9ohNMiKav/ fKjZMPDcyFwY4po1ZMBzgMblTjmN+TPwnUxY2SdRGpvulfuWnU1i+PCPMtABqCWwoJDM hywlsMQbC3hdF49+lKUceJFYCBs5t4l9jhhNJGv2IsCgfEQjTPskI0P3ilo/dLoijyEV d7vLnDReV2sow+LkSc1wsys6szTsQO/mDObZaf33ul6um9Y1CG/XbO4WSQ/KJat2nrfX 6GLtiSiO75c0SNylDl/mgdoprBeTZ6XoSMOxeBzmQuQLqzlhwGobIar+ZH3uHOPSC5W3 +qIA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=DYRCP9UN9luBBsvCXHDTrjPXPPJREpwyogy2A14Q3wo=; b=MJf5/DWNSHm4dWfxCv+qwL8gH19vK9XdNQhR3Dy/ldmqwOZ3xXq8NXlUNVIs/lSF2J FgaaesjNSMMniz/c8/ywE74plHCX1e7QEovE8gVuP66Gf+S+kudD5G5s2L+WfadK4LyY nF1N4i7zlEqjYZ/9bYYueV2ABLH0wwbqr2MDwCZR0lkOkKRkTAm4vBLMmDkB9fTDW7EX zttCON2RsJze98U4fjaYz7wIwrV2SI1v9E+QnCd0Z52mX0irZpRPi6qMXr0d8VuLHg2N hXVhjKO86q0RXVpjgat+OD91a23004UzcvqGFTD1xx5ZHRJk9l7zjzrrMNRNakwtbxmq 4zng== X-Gm-Message-State: AA+aEWYHkvuOQfXDSE3ZPGarljfycKMsKJejnrw+6mdVvOKIU+mevS2j c5HR2CMY6qUs1HkFRGmQuAiUQ+CzrXTEfngObJQn3Q== X-Received: by 2002:a9d:6a50:: with SMTP id h16mr17529610otn.95.1545325053044; Thu, 20 Dec 2018 08:57:33 -0800 (PST) MIME-Version: 1.0 References: <20181212214641.GB29416@dastard> <20181214154321.GF8896@quack2.suse.cz> <20181216215819.GC10644@dastard> <20181217181148.GA3341@redhat.com> <20181217183443.GO10600@bombadil.infradead.org> <20181218093017.GB18032@quack2.suse.cz> <9f43d124-2386-7bfd-d90b-4d0417f51ccd@nvidia.com> <20181219020723.GD4347@redhat.com> <20181219110856.GA18345@quack2.suse.cz> <8e98d553-7675-8fa1-3a60-4211fc836ed9@nvidia.com> <20181220165030.GC3963@redhat.com> In-Reply-To: <20181220165030.GC3963@redhat.com> From: Dan Williams Date: Thu, 20 Dec 2018 08:57:22 -0800 Message-ID: Subject: Re: [PATCH 1/2] mm: introduce put_user_page*(), placeholder versions To: Jerome Glisse Cc: John Hubbard , Jan Kara , Matthew Wilcox , Dave Chinner , John Hubbard , Andrew Morton , Linux MM , tom@talpey.com, Al Viro , benve@cisco.com, Christoph Hellwig , Christopher Lameter , "Dalessandro, Dennis" , Doug Ledford , Jason Gunthorpe , Michal Hocko , Mike Marciniszyn , rcampbell@nvidia.com, Linux Kernel Mailing List , linux-fsdevel Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Dec 20, 2018 at 8:50 AM Jerome Glisse wrote: > > On Thu, Dec 20, 2018 at 02:54:49AM -0800, John Hubbard wrote: > > On 12/19/18 3:08 AM, Jan Kara wrote: > > > On Tue 18-12-18 21:07:24, Jerome Glisse wrote: > > >> On Tue, Dec 18, 2018 at 03:29:34PM -0800, John Hubbard wrote: > > >>> OK, so let's take another look at Jerome's _mapcount idea all by itself (using > > >>> *only* the tracking pinned pages aspect), given that it is the lightest weight > > >>> solution for that. > > >>> > > >>> So as I understand it, this would use page->_mapcount to store both the real > > >>> mapcount, and the dma pinned count (simply added together), but only do so for > > >>> file-backed (non-anonymous) pages: > > >>> > > >>> > > >>> __get_user_pages() > > >>> { > > >>> ... > > >>> get_page(page); > > >>> > > >>> if (!PageAnon) > > >>> atomic_inc(page->_mapcount); > > >>> ... > > >>> } > > >>> > > >>> put_user_page(struct page *page) > > >>> { > > >>> ... > > >>> if (!PageAnon) > > >>> atomic_dec(&page->_mapcount); > > >>> > > >>> put_page(page); > > >>> ... > > >>> } > > >>> > > >>> ...and then in the various consumers of the DMA pinned count, we use page_mapped(page) > > >>> to see if any mapcount remains, and if so, we treat it as DMA pinned. Is that what you > > >>> had in mind? > > >> > > >> Mostly, with the extra two observations: > > >> [1] We only need to know the pin count when a write back kicks in > > >> [2] We need to protect GUP code with wait_for_write_back() in case > > >> GUP is racing with a write back that might not the see the > > >> elevated mapcount in time. > > >> > > >> So for [2] > > >> > > >> __get_user_pages() > > >> { > > >> get_page(page); > > >> > > >> if (!PageAnon) { > > >> atomic_inc(page->_mapcount); > > >> + if (PageWriteback(page)) { > > >> + // Assume we are racing and curent write back will not see > > >> + // the elevated mapcount so wait for current write back and > > >> + // force page fault > > >> + wait_on_page_writeback(page); > > >> + // force slow path that will fault again > > >> + } > > >> } > > >> } > > > > > > This is not needed AFAICT. __get_user_pages() gets page reference (and it > > > should also increment page->_mapcount) under PTE lock. So at that point we > > > are sure we have writeable PTE nobody can change. So page_mkclean() has to > > > block on PTE lock to make PTE read-only and only after going through all > > > PTEs like this, it can check page->_mapcount. So the PTE lock provides > > > enough synchronization. > > > > > >> For [1] only needing pin count during write back turns page_mkclean into > > >> the perfect spot to check for that so: > > >> > > >> int page_mkclean(struct page *page) > > >> { > > >> int cleaned = 0; > > >> + int real_mapcount = 0; > > >> struct address_space *mapping; > > >> struct rmap_walk_control rwc = { > > >> .arg = (void *)&cleaned, > > >> .rmap_one = page_mkclean_one, > > >> .invalid_vma = invalid_mkclean_vma, > > >> + .mapcount = &real_mapcount, > > >> }; > > >> > > >> BUG_ON(!PageLocked(page)); > > >> > > >> if (!page_mapped(page)) > > >> return 0; > > >> > > >> mapping = page_mapping(page); > > >> if (!mapping) > > >> return 0; > > >> > > >> // rmap_walk need to change to count mapping and return value > > >> // in .mapcount easy one > > >> rmap_walk(page, &rwc); > > >> > > >> // Big fat comment to explain what is going on > > >> + if ((page_mapcount(page) - real_mapcount) > 0) { > > >> + SetPageDMAPined(page); > > >> + } else { > > >> + ClearPageDMAPined(page); > > >> + } > > > > > > This is the detail I'm not sure about: Why cannot rmap_walk_file() race > > > with e.g. zap_pte_range() which decrements page->_mapcount and thus the > > > check we do in page_mkclean() is wrong? > > > > Right. This looks like a dead end, after all. We can't lock a whole chunk > > of "all these are mapped, hold still while we count you" pages. It's not > > designed to allow that at all. > > > > IMHO, we are now back to something like dynamic_page, which provides an > > independent dma pinned count. > > I will keep looking because allocating a structure for every GUP is > insane to me they are user out there that are GUPin GigaBytes of data This is not the common case. > and it gonna waste tons of memory just to fix crappy hardware. This is the common case. Please refrain from the hyperbolic assessments.