Received: by 2002:a05:6358:700f:b0:131:369:b2a3 with SMTP id 15csp840980rwo; Wed, 2 Aug 2023 05:11:52 -0700 (PDT) X-Google-Smtp-Source: APBJJlGxE+pz25SsOjSNGKGUqY0NqRKcsgwwUD2i4P7xiGTNwCvfcTamytDKAvbSJuyWq9t9pbjx X-Received: by 2002:a92:cd8c:0:b0:348:ddcb:127 with SMTP id r12-20020a92cd8c000000b00348ddcb0127mr16804538ilb.11.1690978312565; Wed, 02 Aug 2023 05:11:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1690978312; cv=none; d=google.com; s=arc-20160816; b=rWfyDe+HZM79YzyDelMBVOvUp1oDolP/IR9+mq5tdGCNmLh3YXHcvf8MK8QQUF42eo F/aekCTs7488otC8NDnxwb/8gfrZQqe+MZcpw3NViAwxblHhoHMFDZvXI76dy0Kno4EY GsyNmjb+yZN0GudrciiU/UiH7E5q/MptILs+BRVLd8HSoD8EeRKZRuR8ge+Kxlq4g+jc wwPh0ITIYjLCrrA78KTPtmlvGQMFGXqgqV/tX+2y2oo1/J/6cRdcZW3x7KYlT87uGXcF uG9Pq2z1nzJyI2HYK5HO8Ey0YYV6PD+a/Mfkje5Pn/6uMhT6g0yoEnxoQjE78OYwZNC1 wKIQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=F040mnhvR+L0asiaLvpRrxmFG+jQ63UMlX4EjFqq3QE=; fh=jiy9nLah7yVLS51OkF/e8pPYv4kKiF/hDrjL3F5kLtk=; b=VxOj4Es+pgVhQyc3TgsR7D9zIU2hxnwydzzQk99CFMTbTF0jtjS8iRLB2wkV9J19fH H6Wq3+8jMJ5deU/9zCTU70ROyIHEG2Xk3OkkJ4EccOWbU/GBMEIL8HTnoy432Tf8VeLU JFrCCSe8T9IiPNdytJERnjwXfcAHJSPLZ7UZd3xMBqutgUQ7NF4Kbn+eg/5EIdKxry7+ O2POO1FcSGbSLr7erN6u7z6BJ8NLCJclZVg4BBpVOxJGWclUGtFHJfX0qnWlnR3rgtqw 4n2nFJSDkssp2rhyvUcZVJButiBThXIUXbwg8dour+ARERJ4B1mtGH67iEflSuKHIedN Iv4Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y123-20020a636481000000b00563dc235964si10454729pgb.343.2023.08.02.05.11.37; Wed, 02 Aug 2023 05:11:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232328AbjHBKfh (ORCPT + 99 others); Wed, 2 Aug 2023 06:35:37 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52392 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232208AbjHBKfR (ORCPT ); Wed, 2 Aug 2023 06:35:17 -0400 Received: from outbound-smtp57.blacknight.com (outbound-smtp57.blacknight.com [46.22.136.241]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A5D603595 for ; Wed, 2 Aug 2023 03:30:24 -0700 (PDT) Received: from mail.blacknight.com (pemlinmail02.blacknight.ie [81.17.254.11]) by outbound-smtp57.blacknight.com (Postfix) with ESMTPS id C94CCFAD99 for ; Wed, 2 Aug 2023 11:24:05 +0100 (IST) Received: (qmail 650 invoked from network); 2 Aug 2023 10:24:05 -0000 Received: from unknown (HELO techsingularity.net) (mgorman@techsingularity.net@[84.203.20.191]) by 81.17.254.9 with ESMTPSA (AES256-SHA encrypted, authenticated); 2 Aug 2023 10:24:05 -0000 Date: Wed, 2 Aug 2023 11:24:02 +0100 From: Mel Gorman To: David Hildenbrand Cc: Linus Torvalds , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Andrew Morton , liubo , Peter Xu , Matthew Wilcox , Hugh Dickins , Jason Gunthorpe , John Hubbard Subject: Re: [PATCH v1 0/4] smaps / mm/gup: fix gup_can_follow_protnone fallout Message-ID: <20230802102402.at2lvqp3hlbksoz4@techsingularity.net> References: <20230727212845.135673-1-david@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-15 Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00, RCVD_IN_DNSWL_BLOCKED,RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 28, 2023 at 07:30:33PM +0200, David Hildenbrand wrote: > On 28.07.23 18:18, Linus Torvalds wrote: > > On Thu, 27 Jul 2023 at 14:28, David Hildenbrand wrote: > > > > > > This is my proposal on how to handle the fallout of 474098edac26 > > > ("mm/gup: replace FOLL_NUMA by gup_can_follow_protnone()") where I > > > accidentially missed that follow_page() and smaps implicitly kept the > > > FOLL_NUMA flag clear by *not* setting it if FOLL_FORCE is absent, to > > > not trigger faults on PROT_NONE-mapped PTEs. > > > > Ugh. > > I was hoping for that reaction, with the assumption that we would get > something cleaner :) > > > > > I hate how it uses FOLL_FORCE that is inherently scary. > > I hate FOLL_FORCE, but I hate FOLL_NUMA even more, because to me it > is FOLL_FORCE in disguise (currently and before 474098edac26, if > FOLL_FORCE is set, FOLL_NUMA won't be set and the other way around). > FOLL_NUMA being conflated with FOLL_FORCE is almost certainly a historical accident. > > > > Why do we have that "gup_can_follow_protnone()" logic AT ALL? > > That's what I was hoping for. > > > > > Couldn't we just get rid of that disgusting thing, and just say that > > GUP (and follow_page()) always just ignores NUMA hinting, and always > > just follows protnone? > > > > We literally used to have this: > > > > if (!(gup_flags & FOLL_FORCE)) > > gup_flags |= FOLL_NUMA; > > > > ie we *always* set FOLL_NUMA for any sane situation. FOLL_FORCE should > > be the rare crazy case. > > Yes, but my point would be that we now spell that "rare crazy case" > out for follow_page(). > > If you're talking about patch #1, I agree, therefore patch #3 to > avoid all that nasty FOLL_FORCE handling in GUP callers. > > But yeah, if we can avoid all that, great. > > > > > The original reason for not setting FOLL_NUMA all the time is > > documented in commit 0b9d705297b2 ("mm: numa: Support NUMA hinting > > page faults from gup/gup_fast") from way back in 2012: > > > > * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault > > * would be called on PROT_NONE ranges. We must never invoke > > * handle_mm_fault on PROT_NONE ranges or the NUMA hinting > > * page faults would unprotect the PROT_NONE ranges if > > * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd > > * bitflag. So to avoid that, don't set FOLL_NUMA if > > * FOLL_FORCE is set. > > > In handle_mm_fault(), we never call do_numa_page() if > !vma_is_accessible(). Same for do_huge_pmd_numa_page(). > > So, if we would ever end up triggering a page fault on > mprotect(PROT_NONE) ranges (i.e., via FOLL_FORCE), we > would simply do nothing. > > At least that's the hope, I'll take a closer look just to make > sure we're good on all call paths. > > > > > but I don't think the original reason for this is *true* any more. > > > > Because then two years later in 2014, in commit c46a7c817e66 ("x86: > > define _PAGE_NUMA by reusing software bits on the PMD and PTE levels") > > Mel made the code able to distinguish between PROT_NONE and NUMA > > pages, and he changed the comment above too. > > CCing Mel. > > I remember that pte_protnone() can only distinguished between > NUMA vs. actual mprotect(PROT_NONE) by looking at the VMA -- vma_is_accessible(). > Ok, as usual, I'm far behind and this thread massive but I'll respond to this part before trying to digest the history of this and the current implementation. To the best of my recollection, FOLL_NUMA used to be a correctness issue but that should no longer true. Initially, it was to prevent mixing up "PROT_NONE" that was for NUMA hinting and "PROT_NONE" due to VMA protections. Now the bits are different so this case should be avoidable. Later it was still a different correctness issue because PMD migration had a hacky implementation without migration entries and a GUP could find a page that was being collapsed and had to be serialised. That should also now be avoidable. At some point, FOLL_FORCE and FOLL_NUMA got conflated but they really should not be related even if they are by accident. FOLL_FORCE (e.g. ptrace) may have to process the fault and make the page resident and accessible regardless of any other consequences. FOLL_NUMA ideally should be much more specific. If the calling context only cares about the struct page (e.g. smaps) then it's ok to get a reference to the page. If necessary, it could clear the protection and lose the hinting fault although it's less than ideal. Just needing the struct page for informational purposes though should not be treated as a NUMA hinting fault because it has nothing to do with the tasks memory reference behaviour. A variant of FOLL_NUMA (FOLL_NUMA_HINT?) may still be required to indicate the calling context is accessing the page for reasons that are equivalent to a real memory access from a CPU related to the task mapping the page. I didn't check but KVM may be an example of this when dealing with some MMU faults as the page is being looked up on behalf of the task and presumably from the same CPU the task was running run. Something like reading smaps only needs the struct page but it should not be treated as a NUMA hinting fault as the access has nothing to do with the task mapping the page. > > The original reason for FOLL_NUMA simply does not exist any more. We > > know exactly when a page is marked for NUMA faulting, and we should > > simply *ignore* it for GUP and follow_page(). > > I'd be wary of completely ignoring it if there is any known calling context that is equivalent to a memory access and the hinting fault should be processed -- KVM may be an example. -- Mel Gorman SUSE Labs