Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp234470pxf; Thu, 8 Apr 2021 01:39:25 -0700 (PDT) X-Google-Smtp-Source: ABdhPJx3TkgootOjMNM3cS0yIu0w8PaeTDyJxUZwopYZzNzcW6y3Xvt6dEGyhw5kiPo1L+l1hbJl X-Received: by 2002:a17:902:e80e:b029:e4:b2b8:e36e with SMTP id u14-20020a170902e80eb02900e4b2b8e36emr6751557plg.45.1617871165145; Thu, 08 Apr 2021 01:39:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1617871165; cv=none; d=google.com; s=arc-20160816; b=dvEMSym+kh6IPffUMmJAj84oKd8K+gr04LJ4IgonqX8qxrEKCXG5ri8haBS8daN4kw 4TM4qnXqMeR9PD9qmLUIR/a5XQffccwenomgf+qFa/LuSR3dPxpT3oPVMkEoSsysvm81 RjT+FImb0Z1XOGcw4JMGY01CuN9QmaKUcdm55snxOLA9qBKonVILrqmad9V/VsdNUKAP jeytR9vRJyTvaWcdB/YNitMfbOWQ/wjzspaUTeurGCbH2VEV8vQYntox7WCR75LMyK/7 ihoyZZbIpcdHpC7Y9Ku5R4UwdX9e/oNv+Y455qAY+uPzqzZxFMMV+CFOEDtgUtuU27k+ kAaw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature:dkim-signature; bh=gH+b4I4bYfVPVRG0SYb4pP96/w8P3/5+22wPxKEwskg=; b=qmh1pNuOjwwVNRIj0V4s6XHzcJsAUt6kUm+HaWnP0/mjeQmrsSUw82ShawEo6Mav+8 Gxz+qyS4Os25sdkSqnRkw8qSZ0gj4wJ47NQ2cJHUreEMdxZooFmYr2ncWbbUnWeIeUB5 RnA5jpX4WNPFIDZNRfZuoqJMVCniFpGrFuPTsDoDP2GKObHHL7h4i/d0VZbHre4V40an 8+BHmfMOqEfqj6XtTruiR2HawpbJUnGl03JW61MfDK8Bc8X9gdqkdXthlkuQ2K0jcgRi jsCIck8O99CyR8tLznul8C5Y0FfbuB9UJ+Ze/ipdh7mGzQg1OucwTbwr4DeWrz7rqNJY hDoQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=neutral (no key) header.i=@lespinasse.org; dkim=pass (test mode) header.i=@lespinasse.org header.s=srv-11-rsa header.b="MGs3R/B2"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=lespinasse.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h20si9514285pgk.549.2021.04.08.01.39.12; Thu, 08 Apr 2021 01:39:25 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=neutral (no key) header.i=@lespinasse.org; dkim=pass (test mode) header.i=@lespinasse.org header.s=srv-11-rsa header.b="MGs3R/B2"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=lespinasse.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229554AbhDHIhq (ORCPT + 99 others); Thu, 8 Apr 2021 04:37:46 -0400 Received: from server.lespinasse.org ([63.205.204.226]:46927 "EHLO server.lespinasse.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229539AbhDHIhp (ORCPT ); Thu, 8 Apr 2021 04:37:45 -0400 DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=lespinasse.org; i=@lespinasse.org; q=dns/txt; s=srv-11-ed; t=1617871054; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : in-reply-to : from; bh=gH+b4I4bYfVPVRG0SYb4pP96/w8P3/5+22wPxKEwskg=; b=pgoXWpws4/PUCkj+rX0Qwd2+X7AYlNZlVOFvVF2tPzXLBl1jIJZaH9K9VkPgLmUouITnT AmkqIjnpxE/dUenDg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=lespinasse.org; i=@lespinasse.org; q=dns/txt; s=srv-11-rsa; t=1617871054; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : in-reply-to : from; bh=gH+b4I4bYfVPVRG0SYb4pP96/w8P3/5+22wPxKEwskg=; b=MGs3R/B2t569QkXYWkSa9xq6ocgFqad1DlVj99zaBgEeiY+fGwpmjYuw1tUeWPhq+6jRf hHWHDBekGKfZTzaK+tA9qPr4pSKjWXzlRuhNcSbiPnCIwsJPjQl66v0J4WSD5uR8C9wGS+x EdB2A3cCPenSiMmsEQIzFGEKTcrIWj1EDSRlvvGI6YOwhKpxq+hEY15KStk1rI4alDoluCQ nuxQU/5fCd/vdpIJeaPwg4fjZB2lqAB4iXLmLrPHTvlN5imkREPxgpv81XG5n4aoTooiKma R31H2elk0i5vhPbeSvB8TSKG/31y5GkEMnK8kDG5iVUSvZ1B/RB2xmLfSXhg== Received: by server.lespinasse.org (Postfix, from userid 1000) id D8EDB160253; Thu, 8 Apr 2021 01:37:34 -0700 (PDT) Date: Thu, 8 Apr 2021 01:37:34 -0700 From: Michel Lespinasse To: Matthew Wilcox Cc: Peter Zijlstra , Michel Lespinasse , Linux-MM , Laurent Dufour , Michal Hocko , Rik van Riel , Paul McKenney , Andrew Morton , Suren Baghdasaryan , Joel Fernandes , Rom Lemarchand , Linux-Kernel Subject: Re: [RFC PATCH 24/37] mm: implement speculative handling in __do_fault() Message-ID: <20210408083734.GB27824@lespinasse.org> References: <20210407014502.24091-1-michel@lespinasse.org> <20210407014502.24091-25-michel@lespinasse.org> <20210407212027.GE25738@lespinasse.org> <20210407212712.GH2531743@casper.infradead.org> <20210408071343.GJ2531743@casper.infradead.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20210408071343.GJ2531743@casper.infradead.org> User-Agent: Mutt/1.10.1 (2018-07-13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 08, 2021 at 08:13:43AM +0100, Matthew Wilcox wrote: > On Thu, Apr 08, 2021 at 09:00:26AM +0200, Peter Zijlstra wrote: > > On Wed, Apr 07, 2021 at 10:27:12PM +0100, Matthew Wilcox wrote: > > > Doing I/O without any lock held already works; it just uses the file > > > refcount. It would be better to use a vma refcount, as I already said. > > > > The original workload that I developed SPF for (waaaay back when) was > > prefaulting a single huge vma. Using a vma refcount was a total loss > > because it resulted in the same cacheline contention that down_read() > > was having. > > > > As such, I'm always incredibly sad to see mention of vma refcounts. > > They're fundamentally not solving the problem :/ > > OK, let me outline my locking scheme because I think it's rather better > than Michel's. The vma refcount is the slow path. > > 1. take the RCU read lock > 2. walk the pgd/p4d/pud/pmd > 3. allocate page tables if necessary. *handwave GFP flags*. > 4. walk the vma tree > 5. call ->map_pages > 6. take ptlock > 7. insert page(s) > 8. drop ptlock > if this all worked out, we're done, drop the RCU read lock and return. > 9. increment vma refcount > 10. drop RCU read lock > 11. call ->fault > 12. decrement vma refcount Note that most of your proposed steps seem similar in principle to mine. Looking at the fast path (steps 1-8): - step 2 sounds like the speculative part of __handle_mm_fault() - (step 3 not included in my proposal) - step 4 is basically the lookup I currently have in the arch fault handler - step 6 sounds like the speculative part of map_pte_lock() I have working implementations for each step, while your proposal summarizes each as a point item. It's not clear to me what to make of it; presumably you would be "filling in the blanks" in a different way than I have but you are not explaining how. Are you suggesting that the precautions taken in each step to avoid races with mmap writers would not be necessary in your proposal ? if that is the case, what is the alternative mechanism would you use to handle such races ? Going back to the source of this, you suggested not copying the VMA, what is your proposed alternative ? Do you suggest that fault handlers should deal with the vma potentially mutating under them ? Or should mmap writers consider vmas as immutable and copy them whenever they want to change them ? or are you implying a locking mechanism that would prevent mmap writers from executing while the fault is running ? > Compared to today, where we bump the refcount on the file underlying the > vma, this is _better_ scalability -- different mappings of the same file > will not contend on the file's refcount. > > I suspect your huge VMA was anon, and that wouldn't need a vma refcount > as faulting in new pages doesn't need to do I/O, just drop the RCU > lock, allocate and retry.