Received: by 2002:a05:6358:1087:b0:cb:c9d3:cd90 with SMTP id j7csp8456869rwi; Tue, 25 Oct 2022 07:00:48 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7ugavb+RA+chj/nX2gtPgthyYOTjlseiwCAodTwmViVAQ5/g/yQrlZ/YFS/MGOPY7bUxvR X-Received: by 2002:a05:6a00:1a44:b0:528:6af7:ff4a with SMTP id h4-20020a056a001a4400b005286af7ff4amr39271844pfv.78.1666706448115; Tue, 25 Oct 2022 07:00:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666706448; cv=none; d=google.com; s=arc-20160816; b=kywxwEVhVZTuyfFd7iLUO1Kdi6QoLF6Zhqlsm5N/Qwmi4BwJnumScHEzRMIr3teAeC NHzUZ/HpQoxXnT7vq7TUKAUm0wFrWPL/ayEPVaEET6nfUWlaXLi3nMT5WRegwMGRcaEt AIhRpgU2OddxyZepTwTI903yobsOLc9oSqVmIruy9fXcl/EV/eVykI7dZMaDUrt/OXlN 2HkE88XHFMWSABha6am3Sk/TUqZtjvvOny3EI7JyO/UB0lfQvESNksY0TFh9aM+rTUJl hOjJzMSltvQxe27NL6BHx9TYeXCGbWLcn2hKgxePovfk7OTMaiIktcroze8yryrLPWgt FR4g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=KgMDPhJWvWPx+5LJ/Qv7vJHmHiGIhKGLj7RjUESz09M=; b=KnFpFbhhmcRw+PGaLhUsWVOKvj9etIMCgTpmqnmMjO4qkTsopUNRN2OimPXgItQNhR adPRdQrvqmfPqlvAq3Ld5IowBvTVM9IR3E/6QGsgikF/vET0ADph9u2YlT1jZ2IhpKiM 5e8zT3IrnvC7nSNkrmrMNfmiu/DivHDBtyjX+LyHInEdP/q9N0F/GcdT6jDKWHtqFMCt 0zj9aVdMBNv4okUYGMf/nq3/XMiokqLzNL5gB78rMAQPVcbEckh00n36Sz8yhVwz3aSV nThB1tjNQYY7rY/Jv3JSulaVneD3J8u+jzCuUcYEvRd30k/C11diIYMx4lJMGLdDyXfP EPIg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=WumVmYJx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id lb7-20020a17090b4a4700b00212f0d1bf7csi4114728pjb.158.2022.10.25.07.00.30; Tue, 25 Oct 2022 07:00:48 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=WumVmYJx; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232263AbiJYNpQ (ORCPT + 99 others); Tue, 25 Oct 2022 09:45:16 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47394 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230193AbiJYNpN (ORCPT ); Tue, 25 Oct 2022 09:45:13 -0400 Received: from mail-io1-xd2a.google.com (mail-io1-xd2a.google.com [IPv6:2607:f8b0:4864:20::d2a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3F3E6147056 for ; Tue, 25 Oct 2022 06:45:12 -0700 (PDT) Received: by mail-io1-xd2a.google.com with SMTP id y80so10324051iof.3 for ; Tue, 25 Oct 2022 06:45:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=KgMDPhJWvWPx+5LJ/Qv7vJHmHiGIhKGLj7RjUESz09M=; b=WumVmYJxhniCTmZMefNCEm3UcPRsmZeEHJM8hqvVntpzvgyZGF/jvSU6opR0WzrC9W SMegd9e/PZJXVf5poNQdoYE2OTiOVdP8bdUnPOBLSMDR5AhNVe8FeON9n+FAgyMkQV+3 IIm1RDfijAM52g9K3lWfNrRiYQbJZHEv6hqL9ct7vkYEUrDKjPt8CZm/l979ZaAbtZgZ UZvD0cDeYGgcnMddbdLSSuSh+gQe/L8xr+z/7bXLOO7k/DzPLGODC4V2f8paSK39lAKd aTKwBy6JpTxFmrEYDWJM1HXyrmhIX5iH6iGt5EuLBCYBQLEYS6nJrXjTRlEaVEgDDitM SgaA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=KgMDPhJWvWPx+5LJ/Qv7vJHmHiGIhKGLj7RjUESz09M=; b=akm1dkHMfMvEGma98KxPzf4nDGr+Xtz4RODB9i/wW0un3FKVMFWuaDFCSj4FC2xrB/ BHXo+kZFSGpqSnbvEg4EtWYCFPSqMVh5y3FB7KQXICwlF1jr1wMKsJ0dYBba8x2c+SPe svtaK8I56dx/iCjb3/nd+cFQXdmYobLwAjXfnsRYD19XMfvEWM0nCr4qQ9w87NbSOma0 71hV8fN9aRhpd21D6C/m3t54o4k1BGYv0YcgPq6oPCwgtIJWw+8KLDXHmZtlfXMwbudH LXpkzUonONFeNCyoY35neQy2ZSla6EL4sa+W9svTOV/vcwIVipRxwQlt3zTUrUblNHIO fkgA== X-Gm-Message-State: ACrzQf3gsVUR27+320dmM0G3GqnOPuBh4rMa+WRmL2P7G8QZcQ7WUPHA RjXqO0a0sD+OtfiE1QxcPoMapTYvRbxHKcSoEmv5qw== X-Received: by 2002:a05:6638:38a5:b0:363:f688:94dc with SMTP id b37-20020a05663838a500b00363f68894dcmr25269301jav.133.1666705511462; Tue, 25 Oct 2022 06:45:11 -0700 (PDT) MIME-Version: 1.0 References: <20221022111403.531902164@infradead.org> <20221022114424.515572025@infradead.org> <2c800ed1-d17a-def4-39e1-09281ee78d05@nvidia.com> <87fsfcuxu6.fsf@nvidia.com> In-Reply-To: <87fsfcuxu6.fsf@nvidia.com> From: Jann Horn Date: Tue, 25 Oct 2022 15:44:34 +0200 Message-ID: Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment To: Alistair Popple Cc: Matthew Wilcox , Linus Torvalds , Peter Zijlstra , John Hubbard , x86@kernel.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, aarcange@redhat.com, kirill.shutemov@linux.intel.com, jroedel@suse.de, ubizjak@gmail.com Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 25, 2022 at 10:11 AM Alistair Popple wrote: > > > Matthew Wilcox writes: > > > On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote: > >> """ > >> This guarantees that the page tables that are being walked > >> aren't freed concurrently, but at the end of the walk, we > >> have to grab a stable reference to the referenced page. > >> For this we use the grab-reference-and-revalidate trick > >> from above again: > >> First we (locklessly) load the page > >> table entry, then we grab a reference to the page that it > >> points to (which can fail if the refcount is zero, in that > >> case we bail), then we recheck that the page table entry > >> is still the same, and if it changed in between, we drop the > >> page reference and bail. > >> This can, again, grab a reference to a page after it has > >> already been freed and reallocated. The reason why this is > >> fine is that the metadata structure that holds this refcount, > >> `struct folio` (or `struct page`, depending on which kernel > >> version you're looking at; in current kernels it's `folio` > >> but `struct page` and `struct folio` are actually aliases for > >> the same memory, basically, though that is supposed to maybe > >> change at some point) is never freed; even when a page is > >> freed and reallocated, the corresponding `struct folio` > >> stays. This does have the fun consequence that whenever a > >> page/folio has a non-zero refcount, the refcount can > >> spuriously go up and then back down for a little bit. > >> (Also it's technically not as simple as I just described it, > >> because the `struct page` that the PTE points to might be > >> a "tail page" of a `struct folio`. > >> So actually we first read the PTE, the PTE gives us the > >> `page*`, then from that we go to the `folio*`, then we > >> try to grab a reference to the `folio`, then if that worked > >> we check that the `page` still points to the same `folio`, > >> and then we recheck that the PTE is still the same.) > >> """ > > > > Nngh. In trying to make this description fit all kernels (with > > both pages and folios), you've complicated it maximally. Let's > > try a more simple explanation: > > > > First we (locklessly) load the page table entry, then we grab a > > reference to the folio that contains it (which can fail if the > > refcount is zero, in that case we bail), then we recheck that the > > page table entry is still the same, and if it changed in between, > > we drop the folio reference and bail. > > This can, again, grab a reference to a folio after it has > > already been freed and reallocated. The reason why this is > > fine is that the metadata structure that holds this refcount, > > `struct folio` is never freed; even when a folio is > > freed and reallocated, the corresponding `struct folio` > > stays. Oh, thanks. You're right, trying to talk about kernels with folios made it unnecessarily complicated... > I'm probably missing something obvious but how is that synchronised > against memory hotplug? AFAICT if it isn't couldn't the pages be freed > and memory removed? In that case the above would no longer hold because > (I think) the metadata structure could have been freed. Hm... that's this codepath? arch_remove_memory -> __remove_pages -> __remove_section -> sparse_remove_section -> section_deactivate -> depopulate_section_memmap -> vmemmap_free -> remove_pagetable which then walks down the page tables and ends up freeing individual pages in remove_pte_table() using the confusingly-named free_pagetable()? I'm not sure what the synchronization against hotplug is - GUP-fast is running with IRQs disabled, but other codepaths might not, like get_ksm_page()? I don't know if that's holding something else for protection...