Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp2285272rwd; Fri, 2 Jun 2023 07:26:01 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6Eh648EBgL87V6i6t6VuBOhtBoVETm4Yz4jdhnIkiehTEnn1Niv/vt0IQ7xyTqf+ZCJ08a X-Received: by 2002:a05:6a20:a58c:b0:10d:3134:10d7 with SMTP id bc12-20020a056a20a58c00b0010d313410d7mr13757132pzb.27.1685715961464; Fri, 02 Jun 2023 07:26:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1685715961; cv=none; d=google.com; s=arc-20160816; b=eCqZJHMQhwhWjOedROFArvCE3mZi4RWTkchefaBmPl0XO6Lca2I1WOh6T3kd6mZ3ul i/yV2PQ554AykanR+l8xRJnjnWXtS/NC4hUYmLz1oxV5GQCwEH9lOCQm9nU3/MLxq9DT XMM8e8tg2T0wVYSo6rQvcUbASbg3OriSg9lK8sJJrPD81NnxWdZ2E3FwmS20STFIQHSO MNEV1N+cAmjswV7KEBlO9OKvMJozGM+pTorFG+pt6f7YDtdxlVZctzBB7KXUwYpEJzYH 6YJlEqGGQo4nJZs93GQwczDU/Z3N/SWkjFFjCGIQElph8hwFoF0Z4LVmsfPH7/F0zPjl NryA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=H7d8d4MCc4onJq8pBQ681DcsEvxdLqedzghklpUu+XE=; b=sYYnq5fkmIQ01F4X7gJxSWSh8NiYrX17YOnBXXYU1qAJz09dPJ9nDZ4KpxgLIWYsBp d2rf8jxH9YRCT5qP4F/BJGu/SlBAP+vSiZrLr2VXZXIXf5gqbosC04YJ4PNKMZrwbF+A raMQESViGqHIXgvC+LiCFvujWH7HDmszjFpuufIha+2q2ewHosC/8kWiIiH64DvR1Gk3 mq9UKfJxiu9zZdwKkrr41I87uqgqxZeEXlA4TU2RrlFIMlSZTyIRX2mrrSGMnNBp6/DM M2geV/CKaMoFm5qKlqfA+Kt8pfuk958qG9j/7Te/ZNAFfKOkTksg80IAiLt+OkYvcWL8 1txQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=L79Re62N; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id g15-20020a63374f000000b005340840c0c7si1082048pgn.476.2023.06.02.07.25.47; Fri, 02 Jun 2023 07:26:01 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20221208 header.b=L79Re62N; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235502AbjFBOWQ (ORCPT + 99 others); Fri, 2 Jun 2023 10:22:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60776 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234339AbjFBOWP (ORCPT ); Fri, 2 Jun 2023 10:22:15 -0400 Received: from mail-il1-x12c.google.com (mail-il1-x12c.google.com [IPv6:2607:f8b0:4864:20::12c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 048901B9 for ; Fri, 2 Jun 2023 07:22:13 -0700 (PDT) Received: by mail-il1-x12c.google.com with SMTP id e9e14a558f8ab-33baee0235cso92905ab.1 for ; Fri, 02 Jun 2023 07:22:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685715732; x=1688307732; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=H7d8d4MCc4onJq8pBQ681DcsEvxdLqedzghklpUu+XE=; b=L79Re62NKLr7jyTpAhcYr+dj4BGT8jlpsciGJ9UqkUTmF1Fp8HF4f7l7PCMx74wnSp Z4MExZDdVvmGr9HZZTldLvGj6yO4f7fmzOW5XHnYGFsp6aMxTvyGOWMjPJ6ywm+1hCxQ OT6t1BPoULN32Prv5c6c7oEpn7DQQMc7OJntv8coVkeidQLUAxq4YWwjm4UCbhaAUkHY CUlzXPTOm1Z7v2fI3l67bBKLhim0lh6AjxhBFfHcW9dOs+LcV01tauSIex0nknLfh486 eezgiwL7ika5ondE8vdXjdjqeaYRbUDxv3VyIaZo6ZQgfWFoet5hN9FlE8YQgtwWZQa+ l08w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685715732; x=1688307732; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=H7d8d4MCc4onJq8pBQ681DcsEvxdLqedzghklpUu+XE=; b=e0JeT9qQSR/k4qmDWFFTcfJF2MB7vE7T/v2NE+2GH19L5ktviX67wrc4FV4pTRVez0 5mtKZ2LAixfkrcazOjjQcjyZAUgzCPXqYhZrHtsuZCp6OIPNYtSapoKRdzB3gZiGXaw7 ZIOEUhrSY3uJXGlLE1JPh43HJGUE7PXYKB6Qw1BRUAxr4uizGGKKKz6YzLj3uDL7EzJS 17fczo+Feue5zseRnRFNBRxNvKrrQ/uyUhNf78KmT6wlwPiTHFTRBvwF5HyyxNIV2/k+ bIhcfFaThvpUnrhzTqodmMPw41WVRWNJQhFKOABZbA52uXIfEog1bEytInj/6HJalbk/ rC9A== X-Gm-Message-State: AC+VfDwvypg2MxHpam3kKrQeoyBLavCswil1u/zKeLDds503jfm82AbU mchyMDVIxZ/QhCq8fv9EkNYMeF/CMKZuWLRR82X76w== X-Received: by 2002:a92:c56a:0:b0:338:55b9:f1a3 with SMTP id b10-20020a92c56a000000b0033855b9f1a3mr137710ilj.7.1685715732071; Fri, 02 Jun 2023 07:22:12 -0700 (PDT) MIME-Version: 1.0 References: <35e983f5-7ed3-b310-d949-9ae8b130cdab@google.com> <88c445ae-552-5243-31a4-2674bac62d4d@google.com> In-Reply-To: From: Jann Horn Date: Fri, 2 Jun 2023 16:21:35 +0200 Message-ID: Subject: Re: [PATCH 01/12] mm/pgtable: add rcu_read_lock() and rcu_read_unlock()s To: Hugh Dickins Cc: Andrew Morton , Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jun 2, 2023 at 4:50=E2=80=AFAM Hugh Dickins wrot= e: > On Wed, 31 May 2023, Jann Horn wrote: > > On Mon, May 29, 2023 at 8:15=E2=80=AFAM Hugh Dickins = wrote: > > > Before putting them to use (several commits later), add rcu_read_lock= () > > > to pte_offset_map(), and rcu_read_unlock() to pte_unmap(). Make this= a > > > separate commit, since it risks exposing imbalances: prior commits ha= ve > > > fixed all the known imbalances, but we may find some have been missed= . > > [...] > > > diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c > > > index c7ab18a5fb77..674671835631 100644 > > > --- a/mm/pgtable-generic.c > > > +++ b/mm/pgtable-generic.c > > > @@ -236,7 +236,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long= addr, pmd_t *pmdvalp) > > > { > > > pmd_t pmdval; > > > > > > - /* rcu_read_lock() to be added later */ > > > + rcu_read_lock(); > > > pmdval =3D pmdp_get_lockless(pmd); > > > if (pmdvalp) > > > *pmdvalp =3D pmdval; > > > > It might be a good idea to document that this series assumes that the > > first argument to __pte_offset_map() is a pointer into a second-level > > page table (and not a local copy of the entry) unless the containing > > VMA is known to not be THP-eligible or the page table is detached from > > the page table hierarchy or something like that. Currently a bunch of > > places pass references to local copies of the entry, and while I think > > all of these are fine, it would probably be good to at least document > > why these are allowed to do it while other places aren't. > > Thanks Jann: but I have to guess that here you are showing awareness of > an important issue that I'm simply ignorant of. > > I have been haunted by a dim recollection that there is one architecture > (arm-32?) which is fussy about the placement of the pmdval being examined > (deduces info missing from the arch-independent interface, by following > up the address?), but I couldn't track it down when I tried. > > Please tell me more; or better, don't spend your time explaining to me, > but please just send a link to a good reference on the issue. I'll be > unable to document what you ask there, without educating myself first. Sorry, I think I was somewhat confused about what was going on when I wrote that message. After this series, __pte_offset_map() looks as follows, with added comments describing my understanding of the semantics: // `pmd` points to one of: // case 1: a pmd_t stored outside a page table, // referencing a page table detached by the caller // case 2: a pmd_t stored outside a page table, which the caller copied // from a page table in an RCU-critical section that extends // until at least the end of this function // case 3: a pmd_t stored inside a page table pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) { unsigned long __maybe_unused flags; pmd_t pmdval; // begin an RCU section; this is needed for case 3 rcu_read_lock(); config_might_irq_save(flags); // read the pmd_t. // if the pmd_t references a page table, this page table can not // go away because: // - in case 1, the caller is the main owner of the page table // - in case 2, because the caller // started an RCU read-side critical section before the caller // read the original pmd_t. (This pmdp_get_lockless() is just // reading a copied pmd_t off the stack.) // - in case 3, because we started an RCU section above before // reading the pmd_t out of the page table here pmdval =3D pmdp_get_lockless(pmd); config_might_irq_restore(flags); if (pmdvalp) *pmdvalp =3D pmdval; if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval))) goto nomap; if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval))) goto nomap; if (unlikely(pmd_bad(pmdval))) { pmd_clear_bad(pmd); goto nomap; } return __pte_map(&pmdval, addr); nomap: rcu_read_unlock(); return NULL; } case 1 is what happens in __page_table_check_pte_clear_range(), __split_huge_zero_page_pmd() and __split_huge_pmd_locked(). case 2 happens in lockless page table traversal (gup_pte_range() and perf_get_pgtable_size()). case 3 is normal page table traversal under mmap lock or mapping lock. I think having a function like this that can run in three different contexts in which it is protected in three different ways is somewhat hard to understand without comments. Though maybe I'm thinking about it the wrong way? Basically my point is: __pte_offset_map() normally requires that the pmd argument points into a page table so that the rcu_read_lock() can provide protection starting from the time the pmd_t is read from a page table. The exception are cases where the caller has taken its own precautions to ensure that the referenced page table can not have been freed.