Received: by 2002:a5b:505:0:0:0:0:0 with SMTP id o5csp180303ybp; Thu, 3 Oct 2019 11:57:36 -0700 (PDT) X-Google-Smtp-Source: APXvYqzzYC7j/wKSREiras3hlAX+i3XeFSSodu/9KwOhmkdjhqh2OuwnK3BwmwmpvmlD+eD5hL5Q X-Received: by 2002:a17:906:bcd6:: with SMTP id lw22mr8975132ejb.270.1570129056790; Thu, 03 Oct 2019 11:57:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1570129056; cv=none; d=google.com; s=arc-20160816; b=gcN9aLTd3VeqhcZ/ulAjEgvCm+5q+mOZePRE/EyQrnh2QKfqRUAUHfyzlkR2teKJy1 DuO2aRLjuJ2zkU3cOSoBwIvcRZLi//ye1pJzFNrPhMOciBXH7z/K/W2ZAUyG9XPvwm19 ZoURXeKnjw+beyAMCp3sYy6DDut1Xy/q63Ilu1fehZEAMZOqAj1/7VXAMG1V6tZcvvAb 6eWpGVuN8j072fCYiE8e/++a2sE9MnbSK6t+rM57Mzqhx3ilCG4qcd49egNUqTVPL3LN in+JpZvKoPAo+CuFwzgn8EZ1H7dBKDP9mM0qeE7eXBUsz5cMfMEOCKdmxmA2BK3nfc9S CJiQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject :dkim-signature; bh=tTpl57TuPWyDV68UjOKQRTMgGeP9CgDyMjpsaGRD1Sc=; b=g8NNy/G0XPR8/YyeYzwxYFxpWhW+axvzs9QcnRaWpsH2EAJdeMrrDePgBbPhcd++gg aMoqGIsBj74Lj4xeyhtb6GVk3sdlmUAgILhUR/CXhkGZwTmTelzZghz1d7k/UywtELCg dNk9ssyH2pc4m0+s10gFPXLZO5OracUIQPhGdY5Lzexuf/WqN2+28uZlJQgKyJZZxu8t tVQIKwfU1VvpBxYohKXyBOlkT3A1Zwd2w7kGwock+J6IUeKyMTdBbEp9YYG5KFV3WePD TX3V67dtnUgrQ0Q3Nko8Ab924qX5CtvZjlsFAU1hpvWisTY725b9vyMCObLbhatOEFDc 0S0g== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail (test mode) header.i=@shipmail.org header.s=mail header.b=VWznklJ6; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id c8si2260223edb.82.2019.10.03.11.57.12; Thu, 03 Oct 2019 11:57:36 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail (test mode) header.i=@shipmail.org header.s=mail header.b=VWznklJ6; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1732535AbfJCSDT (ORCPT + 99 others); Thu, 3 Oct 2019 14:03:19 -0400 Received: from ste-pvt-msa1.bahnhof.se ([213.80.101.70]:21767 "EHLO ste-pvt-msa1.bahnhof.se" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729199AbfJCSDT (ORCPT ); Thu, 3 Oct 2019 14:03:19 -0400 Received: from localhost (localhost [127.0.0.1]) by ste-pvt-msa1.bahnhof.se (Postfix) with ESMTP id 8492D3F6BB; Thu, 3 Oct 2019 20:03:16 +0200 (CEST) Authentication-Results: ste-pvt-msa1.bahnhof.se; dkim=pass (1024-bit key; unprotected) header.d=shipmail.org header.i=@shipmail.org header.b=VWznklJ6; dkim-atps=neutral X-Virus-Scanned: Debian amavisd-new at bahnhof.se X-Spam-Flag: NO X-Spam-Score: -2.1 X-Spam-Level: X-Spam-Status: No, score=-2.1 tagged_above=-999 required=6.31 tests=[BAYES_00=-1.9, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1] autolearn=ham autolearn_force=no Received: from ste-pvt-msa1.bahnhof.se ([127.0.0.1]) by localhost (ste-pvt-msa1.bahnhof.se [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id WyVX858MhJ0Q; Thu, 3 Oct 2019 20:03:15 +0200 (CEST) Received: from mail1.shipmail.org (h-205-35.A357.priv.bahnhof.se [155.4.205.35]) (Authenticated sender: mb878879) by ste-pvt-msa1.bahnhof.se (Postfix) with ESMTPA id 56B7D3F6C6; Thu, 3 Oct 2019 20:03:12 +0200 (CEST) Received: from localhost.localdomain (h-205-35.A357.priv.bahnhof.se [155.4.205.35]) by mail1.shipmail.org (Postfix) with ESMTPSA id 7768E36045B; Thu, 3 Oct 2019 20:03:12 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=shipmail.org; s=mail; t=1570125792; bh=GF/6FFSHugKxxdXl6zz5/RcxSTIvcJxXQzMLqjyEVXg=; h=Subject:To:Cc:References:From:Date:In-Reply-To:From; b=VWznklJ6n0kMQqXVmmtjzd79P17b+ulfFLyCH51ql0SSXWEZZUxCkS0fBXR5lH1lE NPdf2R3eI4WO+0SU+CSTdQcGOPNTWanD95BYsjkldc4mos9M2h7KttYU+1+gZmRqWq a3w48dyIelee8OhLI182zJh2EgVYmPH2o7xx8TGY= Subject: Re: [PATCH v3 3/7] mm: Add write-protect and clean utilities for address space ranges To: Linus Torvalds , Thomas Hellstrom Cc: Linux-MM , Linux Kernel Mailing List , Andrew Morton , Matthew Wilcox , Will Deacon , Peter Zijlstra , Rik van Riel , Minchan Kim , Michal Hocko , Huang Ying , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , "Kirill A . Shutemov" References: <20191002134730.40985-1-thomas_os@shipmail.org> <20191002134730.40985-4-thomas_os@shipmail.org> <516814a5-a616-b15e-ac87-26ede681da85@shipmail.org> From: =?UTF-8?Q?Thomas_Hellstr=c3=b6m_=28VMware=29?= Organization: VMware Inc. Message-ID: Date: Thu, 3 Oct 2019 20:03:12 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/3/19 6:55 PM, Linus Torvalds wrote: > > d) Fix the pte walker to do the right thing, then just use separate > pte walkers in your code > > The fix would be those two conceptual changes: > > 1) don't split if the walker asks for a pmd_entry (the walker itself > can then decide to split, of course, but right now no walkers want it > since there are no pmd _and_ pte walkers, because people who want that > do the pte walk themselves) > > 2) get the proper page table lock if you do walk the pte, since > otherwise it's racy > > Then there won't be any code duplication, because all the duplication > you now have at the pmd level is literally just workarounds for the > fact that our current walker has this bug. I actually started on d) already when Kirill asked me to unify the pud_entry() and pmd_entry() callbacks. > > That "fix the pte walker" would be one preliminary patch that would > look something like the attached TOTALLY UNTESTED garbage. > > I call it "garbage" because I really hope people take it just as what > it is: "something like this". It compiles for me, and I did try to > think it through, but I might have missed some big piece of the > picture when writing that patch. > > And yes, this is a much bigger conceptual change for the VM layer, but > I really think our pagewalk code is actively buggy right now, and is > forcing users to do bad things because they work around the existing > limitations. > > Hmm? Could some of the core mm people look over that patch? > > And yes, I was tempted to move the proper pmd locking into the walker > too, and do > > ptl = pmd_trans_huge_lock(pmd, vma); > if (ptl) { > err = ops->pmd_entry(pmd, addr, next, walk); > spin_unlock(ptl); > ... > > but while I think that's the correct thing to do in the long run, that > would have to be done together with changing all the existing > pmd_entry users. It would make the pmd_entry _solely_ handle the > hugepage case, and then you'd have to remove the locking in the > pmd_entry, and have to make the pte walking be a walker entry. But > that would _really_ clean things up, and would make things like > smaps_pte_range() much easier to read, and much more obvious (it would > be split into a smaps_pmd_range and smaps_pte_range, and the callbacks > wouldn't need to know about the complex locking). > > So I think this is the right direction to move into, but I do want > people to think about this, and think about that next phase of doing > the pmd_trans_huge_lock too. > > Comments? > > Linus I think if we take the ptl lock outside the callback, we'd need to allow the callback to unlock to do non-atomic things or to avoid recursive locking if it decides to split in the callback. FWIW The pud_entry call is being made locked, but the only current implementation appears to happily ignore that from what I can tell. And if we allow unlocking or call the callback unlocked, the callback needs to tell us whether the entry was actually handled or if we need to fallback to the next level. Perhaps using a positive PAGE_WALK_FALLBACK return value? That would allow current implementations to be unmodified. /Thomas