Received: by 2002:a05:6358:1087:b0:cb:c9d3:cd90 with SMTP id j7csp7782641rwi; Mon, 24 Oct 2022 20:44:26 -0700 (PDT) X-Google-Smtp-Source: AMsMyM43ZyM5CBAIIEYR2NiIaQGbRcU8q40VRGMR656xt/eap0jtWWiwIOtflHZbrdDGbVnRO6T4 X-Received: by 2002:a05:6402:46:b0:45c:bd68:6ab0 with SMTP id f6-20020a056402004600b0045cbd686ab0mr33965444edu.16.1666669466334; Mon, 24 Oct 2022 20:44:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666669466; cv=none; d=google.com; s=arc-20160816; b=0iRyc94XvPGb4j6bh5GXG5jOUt57sQkO43CojRJzFse7hTE/dbH2iaI47xJg8goUWt s/o4yRpMNzHdTv9uGa6Y0+xN/MrqwYVJzW2Wa+165+zVRic3AT7dKkW4Zzi0YYJTdEE7 WQk+9dH9QbdWz3f8AQaY57783mxwwVhfDQvTXQNk4K8yeFNwis5ZChpb90t39S9ALsmd uQTrAqQ1/7i56Znkb1fbo+GoZDPE4DJbpy+9a8q2VejOBp8o7vnJ2VPiSmNLA+p2BELP FYnthrpU+Lu0ryJIqnM5NHluN+kV/NopFINEU8Gn3nHjA2zrSNo8FRNq13i+hW+ZwzhX y6+A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=2YhJiurh3yeqLAY7/HFpg3fq1seSTJqbsUcJnHEJrjo=; b=nQsq4CxnwekoT0TgKEdo2RjyULXsvQPnLCva3OjireLZNpGqyjSv09zC5BCHI9oTTo Vlkaf931pSNasbSxDBy6MucwzJWlGiyxrHRfRUlpu00RKIGqIoixc8Tcf7lAjYndEcRf kO0DcG0ktlhIYjM46Juki+x3Lgc8MczK4GUECxlmbbjUfM5ojv8Eb77jb4gYqSo+R0rj zaLXWBZbEdyy1FsCKxHWqrGcvWLOn+JmQg20LF/c5Zo3TiujXB9yzJIe2k942IOBEje+ zfhU+k7DRPPs7yeEXiWI6VDtoYiZFVf0jQYgMyBxXik/Gj0zugm5v0igAw3bzsIzDCul t9LQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=FAxLsAyI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id eg48-20020a05640228b000b00461ace746adsi1336452edb.453.2022.10.24.20.44.01; Mon, 24 Oct 2022 20:44:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@infradead.org header.s=casper.20170209 header.b=FAxLsAyI; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231180AbiJYDVw (ORCPT + 99 others); Mon, 24 Oct 2022 23:21:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60962 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230342AbiJYDVq (ORCPT ); Mon, 24 Oct 2022 23:21:46 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B77BC48A16 for ; Mon, 24 Oct 2022 20:21:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=In-Reply-To:Content-Type:MIME-Version: References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description; bh=2YhJiurh3yeqLAY7/HFpg3fq1seSTJqbsUcJnHEJrjo=; b=FAxLsAyIHFrQpx9r+joBCrVs5x CNiXnOy8QwPnLgfHFTn7fUAU/HuLeneBCnASS0XmIpblKGn9GFaViAhQf98i95OtmscUry5Ugi168 hJku8DlWqsby24PS/6iYJSsU+E1lvtnVhilzbDXpmW6ZhbfUf2t9V35r1nvc5zxXzxwQyQucVfq5V BXKo4JXccKH4MQ9jWphalnX7IVHdwH2QnVyZa4EslliANXYAQknYyvEpNu3Jiq1H56/zDSf6y1Kdl f0Pd0VvvvClevk+knHycWRlUbdvSZBuQxCKil6J5qtEKRrWVRpCS291PXzgSBQT/aT2tC9LfFuy7C OVMHO4JA==; Received: from willy by casper.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1onAVX-00Fxrc-9n; Tue, 25 Oct 2022 03:21:39 +0000 Date: Tue, 25 Oct 2022 04:21:39 +0100 From: Matthew Wilcox To: Jann Horn Cc: Linus Torvalds , Peter Zijlstra , John Hubbard , x86@kernel.org, akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, aarcange@redhat.com, kirill.shutemov@linux.intel.com, jroedel@suse.de, ubizjak@gmail.com Subject: Re: [PATCH 01/13] mm: Update ptep_get_lockless()s comment Message-ID: References: <20221022111403.531902164@infradead.org> <20221022114424.515572025@infradead.org> <2c800ed1-d17a-def4-39e1-09281ee78d05@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED,SPF_HELO_NONE, SPF_NONE,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 24, 2022 at 10:23:51PM +0200, Jann Horn wrote: > """ > This guarantees that the page tables that are being walked > aren't freed concurrently, but at the end of the walk, we > have to grab a stable reference to the referenced page. > For this we use the grab-reference-and-revalidate trick > from above again: > First we (locklessly) load the page > table entry, then we grab a reference to the page that it > points to (which can fail if the refcount is zero, in that > case we bail), then we recheck that the page table entry > is still the same, and if it changed in between, we drop the > page reference and bail. > This can, again, grab a reference to a page after it has > already been freed and reallocated. The reason why this is > fine is that the metadata structure that holds this refcount, > `struct folio` (or `struct page`, depending on which kernel > version you're looking at; in current kernels it's `folio` > but `struct page` and `struct folio` are actually aliases for > the same memory, basically, though that is supposed to maybe > change at some point) is never freed; even when a page is > freed and reallocated, the corresponding `struct folio` > stays. This does have the fun consequence that whenever a > page/folio has a non-zero refcount, the refcount can > spuriously go up and then back down for a little bit. > (Also it's technically not as simple as I just described it, > because the `struct page` that the PTE points to might be > a "tail page" of a `struct folio`. > So actually we first read the PTE, the PTE gives us the > `page*`, then from that we go to the `folio*`, then we > try to grab a reference to the `folio`, then if that worked > we check that the `page` still points to the same `folio`, > and then we recheck that the PTE is still the same.) > """ Nngh. In trying to make this description fit all kernels (with both pages and folios), you've complicated it maximally. Let's try a more simple explanation: First we (locklessly) load the page table entry, then we grab a reference to the folio that contains it (which can fail if the refcount is zero, in that case we bail), then we recheck that the page table entry is still the same, and if it changed in between, we drop the folio reference and bail. This can, again, grab a reference to a folio after it has already been freed and reallocated. The reason why this is fine is that the metadata structure that holds this refcount, `struct folio` is never freed; even when a folio is freed and reallocated, the corresponding `struct folio` stays. This does have the fun consequence that whenever a folio has a non-zero refcount, the refcount can spuriously go up and then back down for a little bit. (Also it's slightly more complex than I just described, because the PTE that we just loaded might be in the middle of being reallocated into a different folio. So actually we first read the PTE, translate the PTE into the `page*`, then from that we go to the `folio*`, then we try to grab a reference to the `folio`, then if that worked we check that the `page` is still in the same `folio`, and then we recheck that the PTE is still the same. Older kernels did not make a clear distinction between pages and folios, so it was even more confusing.) Better?