Received: by 2002:ab2:69cc:0:b0:1f4:be93:e15a with SMTP id n12csp1440892lqp; Mon, 15 Apr 2024 06:47:55 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCVdshdxRI7xeJqZdJtrJxXFxVstRCX0cCfiYC4ioM93W1ulziOXmGBbQKSTZb9VTWfahVcwq6y2cgqwrFRI4wGvO+9RQzy3cEKVS+71eg== X-Google-Smtp-Source: AGHT+IELZOhz9+dcHP12PuKDPRyei4iKl2F4v6/8rRTNfUf0mBa2/jAB2uZBUDXILUEqoRXDxYk2 X-Received: by 2002:a17:906:35c4:b0:a51:f463:cfa6 with SMTP id p4-20020a17090635c400b00a51f463cfa6mr6264133ejb.29.1713188875142; Mon, 15 Apr 2024 06:47:55 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1713188875; cv=pass; d=google.com; s=arc-20160816; b=E6l3cH3QR27+mrrWsfVzUbuGgcCudhc2pnuJB4B7KeZWwl0/9il9gx/W047daI0nzM Zba8j1ZL6BmekpSAe8QbXf8WgqnyojRNr2XtuJXvlcTCMG99+V7NOK1dyj7g8lqrKUNm 1yNfwWg9pNyQfxArEHpGi9E3Oyoyc/eELIddlwPiAMWWpKwv10BQPg3mQ9bEuisflIm8 +xt/JLicflqYI5Xy9Nlq4AOpf7gg+sp8v18hVyEvqnoigaAsPG6xiEmk8bWfIK/wBj5T xwrzMOWoe/576ZLMrdzui7R8df8M2OwLcX9czndclsV49ZGTGcSalLbsBCrX6R8UNMwm o2eQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=EUpL0p8f5UfGoBuxRg8xLGlCDkNLVkHwuMPSwoPxqoQ=; fh=5ElpXvGotM0/udMhnvi0GjXf9RQ5OKrboCm54T8RE10=; b=tW1zLO+a5TEzLoo2/qTiYRTbzA2m5dlWoluxdMmkWgjrZYbiib0V755FHbZMo/mYfG SSsrKzgIrj4iRrs8SnxquLYTjQw2km+lPZhIjB5wvuIM4xG+6C9tah2MklulF+LtW1KP ozlwhbwBRu3g5dM2l3mLA2GS1pMvXxXS1QC0uqHrXon7a4QPelkdA8F3/fifCtKL8rSs z4zVCuoRRx3naLwtbBFQBMfqLKs+JFJ2J7nx7yM/LtQahEQGPb7LORyEe/JKrqIiTmBf UNXZDwHu/OByhcewsF9vJX/+TNPZ3oRlVbhygqxU90vKfLMFxQmdcjp3oXjpzXETanAa u2oQ==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-145266-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-145266-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id gt44-20020a1709072dac00b00a51dd117e48si4714424ejc.538.2024.04.15.06.47.55 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 15 Apr 2024 06:47:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-145266-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-145266-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-145266-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id EE8551F25650 for ; Mon, 15 Apr 2024 13:38:40 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 8D98078C6A; Mon, 15 Apr 2024 13:30:54 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id F1E8574C1D for ; Mon, 15 Apr 2024 13:30:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713187853; cv=none; b=IdsuB5vz3RDLeNgyp6YwOLwxpwTwRJIMVnCivSaKs9TCO6oGhPY4r2a/Dr6PALtjrRK/EuIAi10vcQpT0neARzzIxcog5H7V+L9+DcbzmOcPB58OCF+WkthOa7mVvj5ts3rHsUi1BezdR+cNk92SnDfxRQ5drxjGLhcdPa8JW9w= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1713187853; c=relaxed/simple; bh=wzGJXnZK4tkTIIZKeYdm2D4b9Gxuci08+sJBb0Tx5W0=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=Y1B7o9HO8biXEepr9hhZJBPWD4qt/q9ll4j1t3a+BRNIPuBjbM9dKfBReIoN4p2/GmS9Mra6b28z1t64f6/KKA1FHSOLV0yD6iiEZl0d+fUJzjIW6+9cH+JAZDe5A6NmEhxzvGcUe7ymWg0NYru/FTC4KaqbECWd6FL8dP3pZxI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id BFA232F4; Mon, 15 Apr 2024 06:31:19 -0700 (PDT) Received: from [10.57.75.121] (unknown [10.57.75.121]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 8F7523F738; Mon, 15 Apr 2024 06:30:49 -0700 (PDT) Message-ID: <772de69a-27fa-4d39-a75d-54600d767ad1@arm.com> Date: Mon, 15 Apr 2024 14:30:48 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 Content-Language: en-GB To: David Hildenbrand , Mark Rutland , Catalin Marinas , Will Deacon , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Andrew Morton , Muchun Song Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20240215121756.2734131-1-ryan.roberts@arm.com> <0ae22147-e1a1-4bcb-8a4c-f900f3f8c39e@redhat.com> <374d8500-4625-4bff-a934-77b5f34cf2ec@arm.com> <8bd9e136-8575-4c40-bae2-9b015d823916@redhat.com> <86680856-2532-495b-951a-ea7b2b93872f@arm.com> <35236bbf-3d9a-40e9-84b5-e10e10295c0c@redhat.com> <4fba71aa-8a63-4a27-8eaf-92a69b2cff0d@arm.com> <5a23518b-7974-4b03-bd6e-80ecf6c39484@redhat.com> <81aa23ca-18b1-4430-9ad1-00a2c5af8fc2@arm.com> <70a36403-aefd-4311-b612-84e602465689@redhat.com> <3e50030d-2289-4470-a727-a293baa21618@redhat.com> From: Ryan Roberts In-Reply-To: <3e50030d-2289-4470-a727-a293baa21618@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 15/04/2024 11:57, David Hildenbrand wrote: > On 15.04.24 11:28, Ryan Roberts wrote: >> On 12/04/2024 21:16, David Hildenbrand wrote: >>>> >>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and >>>> "lockless walkers that never take the PTL". >>>> >>>> Detail: the part about disabling interrupts and TLB flush syncing is >>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But >>>> you make that clear further down. >>> >>> Yes, but disabling interrupts is also required for RCU-freeing of page tables >>> such that they can be walked safely. The TLB flush IPI is arch-specific and >>> indeed to sync against PTE invalidation (before generic GUP-fast). >>> [...] >>> >>>>>> >>>>>> Could it be this easy? My head is hurting... >>>>> >>>>> I think what has to happen is: >>>>> >>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as >>>>> there >>>>> are no races. No removal/addition of access/dirty bits etc. >>>> >>>> Today's arm64 ptep_get() guarantees this. >>>> >>>>> >>>>> (2) Lockless page table walkers that later verify under the PTL can handle >>>>> serious "garbage PTEs". This is our page fault handler. >>>> >>>> This isn't really a property of a ptep_get_lockless(); its a statement about a >>>> class of users. I agree with the statement. >>> >>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page >>> fault handlers. Well, mostly "not GUP". >>> >>>> >>>>> >>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle >>>>> arbitrary garbage PTEs. This is GUP-fast. Two options: >>>>> >>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the >>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes >>>>> required. This is the common case. HW might concurrently set access/dirty >>>>> bits, >>>>> so we can race with that. But we don't read garbage. >>>> >>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are >>>> consistent for contpte ptes. That's the bit that complicates the current >>>> ptep_get_lockless() implementation. >>>> >>>> But the point I was trying to make is that GUP-fast does not actually care >>>> about >>>> *all* the fields being consistent (e.g. access/dirty). So we could spec >>>> pte_get_lockless() to say that "all fields in the returned pte are guarranteed >>>> to be self-consistent except for access and dirty information, which may be >>>> inconsistent if a racing modification occured". >>> >>> We *might* have KVM in the future want to check that a PTE is dirty, such that >>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not there >>> yet, but one thing I was discussing on the list recently. Burried in: >>> >>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com >>> >>> We wouldn't care about racing modifications, as long as MMU notifiers will >>> properly notify us when the PTE would lose its dirty bits. >>> >>> But getting false-positive dirty bits would be problematic. >>> >>>> >>>> This could mean that the access/dirty state *does* change for a given page >>>> while >>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think* >>>> that failing to detect this is benign. >>> >>> I mean, HW could just set the dirty/access bit immediately after the check. So >>> if HW concurrently sets the bit and we don't observe that change when we >>> recheck, I think that would be perfectly fine. >> >> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or >> soft-dirty or uffd-wp). >> >> But if you don't want to change the ptep_get_lockless() spec to explicitly allow >> this (because you have the KVM use case where false-positive dirty is >> problematic), then I think we are stuck with ptep_get_lockless() as implemented >> for arm64 today. > > At least regarding the dirty bit, we'd have to guarantee that if > ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck > would be able to catch that. > > Would that be possible? Hmm maybe. My head hurts. Let me try to work through some examples... Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1, 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is stored in PTE0 for the contpte block, and it is dirty. Now let's say there are 2 racing threads: - ThreadA is doing a GUP-fast for PTE3 - ThreadB is remapping order-0 FolioB at PTE0 (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the example - today's arm64 ptep_get_lockless() can handle the below correctly). ThreadA ThreadB ======= ======= gup_pte_range() pte1 = ptep_get_lockless(PTE3) READ_ONCE(PTE3) mmap(PTE0) clear_pte(PTE0) unfold(PTE0 - PTE3) WRITE_ONCE(PTE0, 0) WRITE_ONCE(PTE1, 0) WRITE_ONCE(PTE2, 0) READ_ONCE(PTE0) (for a/d) << CLEAN!! READ_ONCE(PTE1) (for a/d) READ_ONCE(PTE2) (for a/d) READ_ONCE(PTE3) (for a/d) pte2 = ptep_get_lockless(PTE3) READ_ONCE(PTE3) READ_ONCE(PTE0) (for a/d) READ_ONCE(PTE1) (for a/d) READ_ONCE(PTE2) (for a/d) READ_ONCE(PTE3) (for a/d) true = pte_same(pte1, pte2) WRITE_ONCE(PTE3, 0) TLBI WRITE_ONCE(PTE0, ) WRITE_ONCE(PTE1, ) WRITE_ONCE(PTE2, ) WRITE_ONCE(PTE3, ) WRITE_ONCE(PTE0, 0) set_pte_at(PTE0, ) This example shows how a *false-negative* can be returned for the dirty state, which isn't detected by the check. I've been unable to come up with an example where a *false-positive* can be returned for dirty state without the second ptep_get_lockless() noticing. In this second example, let's assume everything is the same execpt FolioA is initially clean: ThreadA ThreadB ======= ======= gup_pte_range() pte1 = ptep_get_lockless(PTE3) READ_ONCE(PTE3) mmap(PTE0) clear_pte(PTE0) unfold(PTE0 - PTE3) WRITE_ONCE(PTE0, 0) WRITE_ONCE(PTE1, 0) WRITE_ONCE(PTE2, 0) WRITE_ONCE(PTE3, 0) TLBI WRITE_ONCE(PTE0, ) WRITE_ONCE(PTE1, ) WRITE_ONCE(PTE2, ) WRITE_ONCE(PTE3, ) WRITE_ONCE(PTE0, 0) set_pte_at(PTE0, ) write to FolioB - HW sets PTE0's dirty READ_ONCE(PTE0) (for a/d) << DIRTY!! READ_ONCE(PTE1) (for a/d) READ_ONCE(PTE2) (for a/d) READ_ONCE(PTE3) (for a/d) pte2 = ptep_get_lockless(PTE3) READ_ONCE(PTE3) << BUT THIS IS FOR FolioB READ_ONCE(PTE0) (for a/d) READ_ONCE(PTE1) (for a/d) READ_ONCE(PTE2) (for a/d) READ_ONCE(PTE3) (for a/d) false = pte_same(pte1, pte2) << So this fails The only way I can see false-positive not being caught in the second example is if ThreadB subseuently remaps the original folio, so you have an ABA scenario. But these lockless table walkers are already suseptible to that. I think all the same arguments can be extended to the access bit. For me this is all getting rather subtle and difficult to reason about and even harder to spec in a comprehensible way. The best I could come up with is: "All fields in the returned pte are guarranteed to be self-consistent except for access and dirty information, which may be inconsistent if a racing modification occured. Additionally it is guranteed that false-positive access and/or dirty information is not possible if 2 calls are made and both ptes are the same. Only false-negative access and/or dirty information is possible in this scenario." which is starting to sound bonkers. Personally I think we are better off at this point, just keeping today's arm64 ptep_get_lockless(). Thanks, Ryan