Received: by 2002:a05:7208:820a:b0:81:e143:7c29 with SMTP id q10csp573846rbf; Wed, 10 Apr 2024 13:09:52 -0700 (PDT) X-Forwarded-Encrypted: i=2; AJvYcCVye/sGAF5I50vprOn2uwDXGKo0iizXCHgS9+0Ghl+ggceONVcpROzRqVCfJuHaDtBui25ebaglDn6SpYem7n2vir/VsjxaRFy3L/VA0Q== X-Google-Smtp-Source: AGHT+IEiKsIfFZb+re2Li2PRQPU1p+JXesS+QOzLBJKrC56zefB3lOCRRHOnQ/2EgqS4Ub1cIWLi X-Received: by 2002:ac2:5bc6:0:b0:516:d471:a4a2 with SMTP id u6-20020ac25bc6000000b00516d471a4a2mr2523209lfn.36.1712779792494; Wed, 10 Apr 2024 13:09:52 -0700 (PDT) Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id y20-20020a1709060a9400b00a474ef5ecc7si6034870ejf.428.2024.04.10.13.09.52 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 10 Apr 2024 13:09:52 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-139293-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; dkim=neutral (body hash did not verify) header.i=@redhat.com header.s=mimecast20190719 header.b="h7YeKc/E"; arc=fail (body hash mismatch); spf=pass (google.com: domain of linux-kernel+bounces-139293-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-139293-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id F30891F22073 for ; Wed, 10 Apr 2024 20:09:51 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id C0A2B181BBF; Wed, 10 Apr 2024 20:09:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=fail reason="signature verification failed" (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="h7YeKc/E" Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DA24E181322 for ; Wed, 10 Apr 2024 20:09:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.133.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712779776; cv=none; b=ucAYZRddc8LtTvi1Qzq0nEb0uenltfRTpknULpGL8cG5NZ1hfEGdjgVDIn8WZGMUb8UQB5Hc2gaBfBbX23bqvNEn+vofXbjjEC0b93j3pAMHr95AOLmmYVqmE0N14sQ3km9D0JaE0iReejDnwiuIVaf8APUuqVxZz78ahsPgrjE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712779776; c=relaxed/simple; bh=Ztwxg5kT3W9Fi1FrJcD+WWBTR6NXmWgeY4cLwjN6bI4=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=nWPhlHmRsK/tzwZNDgh7MaDM2i2+kxrsBAD65OArnALXgHOcVzZAjgeUSJO3QNJXOWeys62xWdVAFuCodk9rxzMkwq3Iz8VGGFQtREVLc7wKQTBA4MCmCkEBZnHZ0p6b+Qv3iclGwdZPDXGpIypnE1G5jTrfV8apu6yjh7D/8Vo= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=h7YeKc/E; arc=none smtp.client-ip=170.10.133.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1712779773; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=OQzTFEfcAfnvDEqgx/z+F5AA+521T37eGIcNNEK+rI8=; b=h7YeKc/E3ZIZ2/mOE59UrwgzEmB8lwJnmmB7O8YRrxi23ERxNbuADP7T4FN0F2234OKRYi pf6A2RWfyAPbRUqYcW/FzqplRpz4qXsPDYrS/PEawQI/QMJaevSzankhq9wjyRS2qUwj2v ONghHSXLMoIXSACToVBVq/FxREE8gEQ= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-660-ZT1zjOTkNviCFlk4gRryPw-1; Wed, 10 Apr 2024 16:09:30 -0400 X-MC-Unique: ZT1zjOTkNviCFlk4gRryPw-1 Received: by mail-wm1-f71.google.com with SMTP id 5b1f17b1804b1-416a844695dso9558085e9.2 for ; Wed, 10 Apr 2024 13:09:30 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712779769; x=1713384569; h=content-transfer-encoding:in-reply-to:organization:autocrypt :content-language:from:references:cc:to:subject:user-agent :mime-version:date:message-id:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=OQzTFEfcAfnvDEqgx/z+F5AA+521T37eGIcNNEK+rI8=; b=scVbBJqlZHw+gkN28+1NeJhr0ubwkpaAVgjbDYfsCF/KiH8Ys1lP1uBXa5UqQytZDr pjK4yAd6RkseLlJhr+pMYsnxtk29yrEIXrli0bsOZfJ8mRA07PAnj4aY5phFUtKVjNCd GhSaPc97QBBA3oUwwZXMvQ41OW44u5GzK7tDuYvQeHaj5kc4QSWsjm1skX/YLGEBa8N/ mwB3OWbh3oBzejel9g1NG0xIgs0tCP+JMLef+31ekuqQjdKuYQlYALI7DiFiHLIOIQDs c+ILUB0Aw/l1ZmlOsw2jA+4mMB64aUvwJH4rgQkzo++jGb1RIf/izw0Inq/pO2iOf92B 6FlA== X-Forwarded-Encrypted: i=1; AJvYcCWggeqFxadWVrEDGo9DTuEJm97vsuNQzAtOG7B919Jja8LIfpcy1b1bRgaZKuvNED7rzlyYcXKOrltViz7+YFo7bQIYH8zWGnIjabYg X-Gm-Message-State: AOJu0YyuY/ue+NxHL1neOl/jvMXLp5uDjP7uOyurHs/IhZQElNj2+bMn jlLZ6a1gVhx5CWfdL+IjK1P0cWGW1z/DTJQ8xswH3VD+lLFPCdhlz8BrxmGHR4QV3fh7WOPPAze uooNUqv9SZFBkhXLe5PuEVPpH6SaRgO6q+SXhMjeTXj1050eJfv52N+aoKL4msw== X-Received: by 2002:a7b:c4c9:0:b0:414:86a3:8e8d with SMTP id g9-20020a7bc4c9000000b0041486a38e8dmr2641199wmk.22.1712779769160; Wed, 10 Apr 2024 13:09:29 -0700 (PDT) X-Received: by 2002:a7b:c4c9:0:b0:414:86a3:8e8d with SMTP id g9-20020a7bc4c9000000b0041486a38e8dmr2641182wmk.22.1712779768760; Wed, 10 Apr 2024 13:09:28 -0700 (PDT) Received: from ?IPV6:2003:cb:c712:fa00:38eb:93ad:be38:d469? (p200300cbc712fa0038eb93adbe38d469.dip0.t-ipconnect.de. [2003:cb:c712:fa00:38eb:93ad:be38:d469]) by smtp.gmail.com with ESMTPSA id r6-20020a05600c458600b004162a9f03a6sm3320851wmo.7.2024.04.10.13.09.27 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 10 Apr 2024 13:09:28 -0700 (PDT) Message-ID: <5a23518b-7974-4b03-bd6e-80ecf6c39484@redhat.com> Date: Wed, 10 Apr 2024 22:09:27 +0200 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 To: Ryan Roberts , Mark Rutland , Catalin Marinas , Will Deacon , Alexander Shishkin , Jiri Olsa , Ian Rogers , Adrian Hunter , Andrew Morton , Muchun Song Cc: linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20240215121756.2734131-1-ryan.roberts@arm.com> <0ae22147-e1a1-4bcb-8a4c-f900f3f8c39e@redhat.com> <374d8500-4625-4bff-a934-77b5f34cf2ec@arm.com> <8bd9e136-8575-4c40-bae2-9b015d823916@redhat.com> <86680856-2532-495b-951a-ea7b2b93872f@arm.com> <35236bbf-3d9a-40e9-84b5-e10e10295c0c@redhat.com> <4fba71aa-8a63-4a27-8eaf-92a69b2cff0d@arm.com> From: David Hildenbrand Content-Language: en-US Autocrypt: addr=david@redhat.com; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwZgEEwEIAEICGwMGCwkIBwMCBhUIAgkKCwQW AgMBAh4BAheAAhkBFiEEG9nKrXNcTDpGDfzKTd4Q9wD/g1oFAl8Ox4kFCRKpKXgACgkQTd4Q 9wD/g1oHcA//a6Tj7SBNjFNM1iNhWUo1lxAja0lpSodSnB2g4FCZ4R61SBR4l/psBL73xktp rDHrx4aSpwkRP6Epu6mLvhlfjmkRG4OynJ5HG1gfv7RJJfnUdUM1z5kdS8JBrOhMJS2c/gPf wv1TGRq2XdMPnfY2o0CxRqpcLkx4vBODvJGl2mQyJF/gPepdDfcT8/PY9BJ7FL6Hrq1gnAo4 3Iv9qV0JiT2wmZciNyYQhmA1V6dyTRiQ4YAc31zOo2IM+xisPzeSHgw3ONY/XhYvfZ9r7W1l pNQdc2G+o4Di9NPFHQQhDw3YTRR1opJaTlRDzxYxzU6ZnUUBghxt9cwUWTpfCktkMZiPSDGd KgQBjnweV2jw9UOTxjb4LXqDjmSNkjDdQUOU69jGMUXgihvo4zhYcMX8F5gWdRtMR7DzW/YE BgVcyxNkMIXoY1aYj6npHYiNQesQlqjU6azjbH70/SXKM5tNRplgW8TNprMDuntdvV9wNkFs 9TyM02V5aWxFfI42+aivc4KEw69SE9KXwC7FSf5wXzuTot97N9Phj/Z3+jx443jo2NR34XgF 89cct7wJMjOF7bBefo0fPPZQuIma0Zym71cP61OP/i11ahNye6HGKfxGCOcs5wW9kRQEk8P9 M/k2wt3mt/fCQnuP/mWutNPt95w9wSsUyATLmtNrwccz63XOwU0EVcufkQEQAOfX3n0g0fZz Bgm/S2zF/kxQKCEKP8ID+Vz8sy2GpDvveBq4H2Y34XWsT1zLJdvqPI4af4ZSMxuerWjXbVWb T6d4odQIG0fKx4F8NccDqbgHeZRNajXeeJ3R7gAzvWvQNLz4piHrO/B4tf8svmRBL0ZB5P5A 2uhdwLU3NZuK22zpNn4is87BPWF8HhY0L5fafgDMOqnf4guJVJPYNPhUFzXUbPqOKOkL8ojk CXxkOFHAbjstSK5Ca3fKquY3rdX3DNo+EL7FvAiw1mUtS+5GeYE+RMnDCsVFm/C7kY8c2d0G NWkB9pJM5+mnIoFNxy7YBcldYATVeOHoY4LyaUWNnAvFYWp08dHWfZo9WCiJMuTfgtH9tc75 7QanMVdPt6fDK8UUXIBLQ2TWr/sQKE9xtFuEmoQGlE1l6bGaDnnMLcYu+Asp3kDT0w4zYGsx 5r6XQVRH4+5N6eHZiaeYtFOujp5n+pjBaQK7wUUjDilPQ5QMzIuCL4YjVoylWiBNknvQWBXS lQCWmavOT9sttGQXdPCC5ynI+1ymZC1ORZKANLnRAb0NH/UCzcsstw2TAkFnMEbo9Zu9w7Kv AxBQXWeXhJI9XQssfrf4Gusdqx8nPEpfOqCtbbwJMATbHyqLt7/oz/5deGuwxgb65pWIzufa N7eop7uh+6bezi+rugUI+w6DABEBAAHCwXwEGAEIACYCGwwWIQQb2cqtc1xMOkYN/MpN3hD3 AP+DWgUCXw7HsgUJEqkpoQAKCRBN3hD3AP+DWrrpD/4qS3dyVRxDcDHIlmguXjC1Q5tZTwNB boaBTPHSy/Nksu0eY7x6HfQJ3xajVH32Ms6t1trDQmPx2iP5+7iDsb7OKAb5eOS8h+BEBDeq 3ecsQDv0fFJOA9ag5O3LLNk+3x3q7e0uo06XMaY7UHS341ozXUUI7wC7iKfoUTv03iO9El5f XpNMx/YrIMduZ2+nd9Di7o5+KIwlb2mAB9sTNHdMrXesX8eBL6T9b+MZJk+mZuPxKNVfEQMQ a5SxUEADIPQTPNvBewdeI80yeOCrN+Zzwy/Mrx9EPeu59Y5vSJOx/z6OUImD/GhX7Xvkt3kq Er5KTrJz3++B6SH9pum9PuoE/k+nntJkNMmQpR4MCBaV/J9gIOPGodDKnjdng+mXliF3Ptu6 3oxc2RCyGzTlxyMwuc2U5Q7KtUNTdDe8T0uE+9b8BLMVQDDfJjqY0VVqSUwImzTDLX9S4g/8 kC4HRcclk8hpyhY2jKGluZO0awwTIMgVEzmTyBphDg/Gx7dZU1Xf8HFuE+UZ5UDHDTnwgv7E th6RC9+WrhDNspZ9fJjKWRbveQgUFCpe1sa77LAw+XFrKmBHXp9ZVIe90RMe2tRL06BGiRZr jPrnvUsUUsjRoRNJjKKA/REq+sAnhkNPPZ/NNMjaZ5b8Tovi8C0tmxiCHaQYqj7G2rgnT0kt WNyWQQ== Organization: Red Hat In-Reply-To: <4fba71aa-8a63-4a27-8eaf-92a69b2cff0d@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit [...] Skipping the ptdesc stuff we discussed offline, to not get distracted. :) This stuff is killing me, sorry for the lengthy reply ... > > So I've been looking at all this again, and getting myself even more confused. > > I believe there are 2 classes of ptep_get_lockless() caller: > > 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault() > 2) everyone else Likely only completely lockless page table walkers where we *cannot* recheck under PTL is special. Essentially where we disable interrupts and rely on TLB flushes to sync against concurrent changes. Let's take a look where ptep_get_lockless() comes from: commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3 Author: Peter Zijlstra Date: Fri Nov 13 11:41:40 2020 +0100 mm/gup: Provide gup_get_pte() more generic In order to write another lockless page-table walker, we need gup_get_pte() exposed. While doing that, rename it to ptep_get_lockless() to match the existing ptep_get() naming. GUP-fast, when we were still relying on TLB flushes to sync against GUP-fast. "With get_user_pages_fast(), we walk down the pagetables without taking any locks. For this we would like to load the pointers atomically, but sometimes that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE). What we do have is the guarantee that a PTE will only either go from not present to present, or present to not present or both -- it will not switch to a completely different present page without a TLB flush in between; something hat we are blocking by holding interrupts off." Later, we added support for GUP-fast that introduced the !TLB variant: commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13 Author: Steve Capper Date: Thu Oct 9 15:29:14 2014 -0700 mm: introduce a general RCU get_user_pages_fast() With the pattern /* * In the line below we are assuming that the pte can be read * atomically. If this is not the case for your architecture, * please wrap this in a helper function! * * for an example see gup_get_pte in arch/x86/mm/gup.c */ pte_t pte = ACCESS_ONCE(*ptep); .. if (unlikely(pte_val(pte) != pte_val(*ptep))) { .. Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte = gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read was required to be sane, this the lengthy comment above ptep_get_lockless() that talks about TLB flushes. The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is still full of details about TLB flushes syncing against GUP-fast. But as you note, we use it even in contexts where we don't disable interrupts and the TLB flush can't help. We don't disable interrupts during page faults ... so most of the things described in ptep_get_lockless() don't really apply. That's also the reason why ... > > vmf->pte = pte_offset_map(vmf->pmd, vmf->address); > - vmf->orig_pte = *vmf->pte; > + vmf->orig_pte = ptep_get_lockless(vmf->pte); > vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID; > > - /* > - * some architectures can have larger ptes than wordsize, > - * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and > - * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic > - * accesses. The code below just needs a consistent view > - * for the ifs and we later double check anyway with the > - * ptl lock held. So here a barrier will do. > - */ > - barrier(); > if (pte_none(vmf->orig_pte)) { .. that code was in place. We would possibly read garbage PTEs, but would recheck *under PTL* (where the PTE can no longer change) that what we read wasn't garbage and didn't change. > > -- > > (2) All the other users require that a subset of the pte fields are > self-consistent; specifically they don't care about access, dirty, uffd-wp or > soft-dirty. arm64 can guarrantee that all the other bits are self-consistent > just by calling ptep_get(). Let's focus on access+dirty for now ;) > > -- > > So, I'm making the bold claim that it was never neccessary to specialize > pte_get_lockless() on arm64, and I *think* we could just delete it so that > ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original > aim without needing to introduce "norecency" variants. > > Additionally I propose documenting ptep_get_lockless() to describe the set of > fields that are guarranteed to be self-consistent and the remaining fields which > are self-consistent only with best-effort. > > Could it be this easy? My head is hurting... I think what has to happen is: (1) pte_get_lockless() must return the same value as ptep_get() as long as there are no races. No removal/addition of access/dirty bits etc. (2) Lockless page table walkers that later verify under the PTL can handle serious "garbage PTEs". This is our page fault handler. (3) Lockless page table walkers that cannot verify under PTL cannot handle arbitrary garbage PTEs. This is GUP-fast. Two options: (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes required. This is the common case. HW might concurrently set access/dirty bits, so we can race with that. But we don't read garbage. (3b) pte_get_lockless() cannot atomically read the PTE: We need special magic to read somewhat-sane PTEs and need IPIs during TLB flushes to sync against serious PTE changes (e.g., present -> present). This is weird x86-PAE. If ptep_get() on arm64 can do (1), (2) and (3a), we might be good. My head is hurting ... -- Cheers, David / dhildenb