Received: by 2002:a05:7412:3b8b:b0:fc:a2b0:25d7 with SMTP id nd11csp3030399rdb; Tue, 13 Feb 2024 05:04:31 -0800 (PST) X-Forwarded-Encrypted: i=3; AJvYcCXNIoHyrcQV/725NWMFBVmRUoIIZLJPBih9kzPlJlOISU8SguYnAYVI/F+h3B62u8haqyicGYi+fxlFUoXBcDa8eX2dN0KiFuwxIhVIIQ== X-Google-Smtp-Source: AGHT+IHm26wvoVf4YDJDfPMHneQQPQh5GaaV9Nrhaq+rTNv13hPXrZsyq/6wj6eXBwgJQre7U/a4 X-Received: by 2002:a05:620a:13c2:b0:787:19dc:bd91 with SMTP id g2-20020a05620a13c200b0078719dcbd91mr2865420qkl.3.1707829471046; Tue, 13 Feb 2024 05:04:31 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707829471; cv=pass; d=google.com; s=arc-20160816; b=m79hHfs22562UtatGPLloQR2M3IMWIgZLrb3ymZOPZyDZrI376w9nNeXc0tI6e40Y/ 0ar+3UUhBlfeQbPxh1FJRL0kIPYt3qPcrWyHwgxFKUNQ2jMyyBzwqFkBXuRTvvROz3sW f1nDFhida3Jsko9jAGBejfDTBZ7XBUvWQjrTxIDb4rI6ouZbdnO/rtKc/D2KtildC2sL Q5nhf+REJU6argOz9M2o9/ohuBaHkPqAMw8vbZIR87X3kODyutmrl6mB/fFos/OdE46A B7M2+RBDDH94xf6Q4A5tdNSEH6WaROk+dGt2obssRo8cgGcdPrCWXOwC42J6O4O/MyPX 0DOg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=R7pYEEIvr8x55UEbXw6O7SvqDpSgc03ll1w/ry1+a4k=; fh=vQCD4Nc8F23y9SZJvg3XSEoXYIPgT4+txVY/ulvOXSc=; b=UBV5SNP/EYQD5fNgn76hkH7sbV9YlIfafRNknNlhYmIQ2ZZFyoSgxhOJz9ockGzGmY 1UsMGVxLGeoGaJFsojqy4LNSR8lsmG9nob8C0J54HuiMGzrDi4IjWuIruVlP4P0r8gaw WfA9m/IujaX1AplTyX4R+1SuyX4TD2dLxAuYP+7EKmZpMGyxdMHzLslJsUU6DLI31fwf uyrxAcgI/sMsjMwLDprCh7D8rMF+L2RsxydIt4eUuR3BmlgULdXzbNEMcQKlmCFl/Xm+ 2Y+BSyW7oA+p8ec46NCnCK0lcizEbh6jUGbRHSUlUw92weGgxCPw0iISs9o0Zerr59T/ yviQ==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-63557-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-63557-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com X-Forwarded-Encrypted: i=2; AJvYcCUoEYY2BL788CK69l9N/1fmDUW93wdF4wFvv0Li1SzgwFOH++XMelDL+E8MyjW40MFlTeESSuFebnYfMjZPDV9bt8nE7w7vsOk4Sny0dg== Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id d13-20020a05620a204d00b00785d7b75636si4215354qka.693.2024.02.13.05.04.30 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 13 Feb 2024 05:04:31 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-63557-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-63557-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-63557-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 70B5D1C26518 for ; Tue, 13 Feb 2024 13:04:30 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 82CDC4EB43; Tue, 13 Feb 2024 13:04:08 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id BD3BA446C6 for ; Tue, 13 Feb 2024 13:04:04 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707829447; cv=none; b=gbPrEFGvyMvQnwIuOGCHM6SxZBjgr67Yb4d6zPAmvBiI4Mpl2HygOED8QQVZvdjKIgU1O6bbbr2MCRt42ImucLogbJgMsGxOcHOWLkYxotdGACkqh4nH0r9bMcEW0MUmB+xZOc2jJO/fZgFaeJqrGEd68gECx/8f5M2rz3T1Agk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707829447; c=relaxed/simple; bh=WEPqGvzbyehdmolW/bs56H8mGemFckdFvlBWOCKVpfI=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=nw594bS2HGgoBso7i2/zU9MZM4tkDtDkxFf9DlJidExrtShLMRBAq/Eg8+8EkTJ5RxcDnzUe7cal+6SM72+ZANminGRhSY+Gl0b6ei50RFfsx8Yqtc2FClkTQoPwbasnhLEcil+cbt+8f9O+9j5rUzmb7anN7c8/0IJZ6Mfy9sw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 23C31DA7; Tue, 13 Feb 2024 05:04:39 -0800 (PST) Received: from [10.1.36.184] (XHFQ2J9959.cambridge.arm.com [10.1.36.184]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5908F3F762; Tue, 13 Feb 2024 05:03:54 -0800 (PST) Message-ID: Date: Tue, 13 Feb 2024 13:03:53 +0000 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings Content-Language: en-GB To: Mark Rutland Cc: Catalin Marinas , Will Deacon , Ard Biesheuvel , Marc Zyngier , James Morse , Andrey Ryabinin , Andrew Morton , Matthew Wilcox , David Hildenbrand , Kefeng Wang , John Hubbard , Zi Yan , Barry Song <21cnbao@gmail.com>, Alistair Popple , Yang Shi , Nicholas Piggin , Christophe Leroy , "Aneesh Kumar K.V" , "Naveen N. Rao" , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , linux-arm-kernel@lists.infradead.org, x86@kernel.org, linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20240202080756.1453939-1-ryan.roberts@arm.com> <20240202080756.1453939-20-ryan.roberts@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 13/02/2024 12:02, Mark Rutland wrote: > On Mon, Feb 12, 2024 at 12:59:57PM +0000, Ryan Roberts wrote: >> On 12/02/2024 12:00, Mark Rutland wrote: >>> Hi Ryan, > > [...] > >>>> +static inline void set_pte(pte_t *ptep, pte_t pte) >>>> +{ >>>> + /* >>>> + * We don't have the mm or vaddr so cannot unfold contig entries (since >>>> + * it requires tlb maintenance). set_pte() is not used in core code, so >>>> + * this should never even be called. Regardless do our best to service >>>> + * any call and emit a warning if there is any attempt to set a pte on >>>> + * top of an existing contig range. >>>> + */ >>>> + pte_t orig_pte = __ptep_get(ptep); >>>> + >>>> + WARN_ON_ONCE(pte_valid_cont(orig_pte)); >>>> + __set_pte(ptep, pte_mknoncont(pte)); >>>> +} >>>> + >>>> +#define set_ptes set_ptes >>>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr, >>>> + pte_t *ptep, pte_t pte, unsigned int nr) >>>> +{ >>>> + pte = pte_mknoncont(pte); >>> >>> Why do we have to clear the contiguous bit here? Is that for the same reason as >>> set_pte(), or do we expect callers to legitimately call this with the >>> contiguous bit set in 'pte'? >>> >>> I think you explained this to me in-person, and IIRC we don't expect callers to >>> go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we >>> have to clear it here to defer the decision of whether to set/clear it when >>> modifying entries. It would be nice if we could have a description of why/when >>> we need to clear this, e.g. in the 'public API' comment block above. >> >> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is >> private to the architecture code and is never set directly by core code. If the >> public API ever receives a pte that happens to have the PTE_CONT bit set, it >> would be bad news if we then accidentally set that in the pgtable. >> >> Ideally, we would just uncondidtionally clear the bit before a getter returns >> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That >> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set >> and can therefore never accidentally pass such a pte into a setter function. >> However, there is existing functionality that relies on being able to get a pte, >> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit >> to determine how big the leaf is. This is used in perf_get_pgtable_size(). >> >> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I >> decided to allow PTE_CONT to leak through the getters and instead >> unconditionally clear the bit when a pte is passed to any of the setters. >> >> I'll add a (slightly less verbose) comment as you suggest. > > Great, thanks! > > [...] > >>>> +static inline bool mm_is_user(struct mm_struct *mm) >>>> +{ >>>> + /* >>>> + * Don't attempt to apply the contig bit to kernel mappings, because >>>> + * dynamically adding/removing the contig bit can cause page faults. >>>> + * These racing faults are ok for user space, since they get serialized >>>> + * on the PTL. But kernel mappings can't tolerate faults. >>>> + */ >>>> + return mm != &init_mm; >>>> +} >>> >>> We also have the efi_mm as a non-user mm, though I don't think we manipulate >>> that while it is live, and I'm not sure if that needs any special handling. >> >> Well we never need this function in the hot (order-0 folio) path, so I think I >> could add a check for efi_mm here with performance implication. It's probably >> safest to explicitly exclude it? What do you think? > > That sounds ok to me. > > Otherwise, if we (somehow) know that we avoid calling this at all with an EFI > mm (e.g. because of the way we construct that), I'd be happy with a comment. We crossed streams - as per my other email, I'm confident that this is safe so will just add a comment. > > Probably best to Cc Ard for whatever we do here. Ard is already on CC. > >>>> +static inline pte_t *contpte_align_down(pte_t *ptep) >>>> +{ >>>> + return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3); >>> >>> I think this can be: >>> >>> static inline pte_t *contpte_align_down(pte_t *ptep) >>> { >>> return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES); >>> } >> >> Yep - that's much less ugly - thanks! >> >>> >>>> + >>>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr, >>>> + pte_t *ptep, pte_t pte) >>>> +{ >>>> + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); >>>> + unsigned long start_addr; >>>> + pte_t *start_ptep; >>>> + int i; >>>> + >>>> + start_ptep = ptep = contpte_align_down(ptep); >>>> + start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); >>>> + pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte)); >>>> + >>>> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) { >>>> + pte_t ptent = __ptep_get_and_clear(mm, addr, ptep); >>>> + >>>> + if (pte_dirty(ptent)) >>>> + pte = pte_mkdirty(pte); >>>> + >>>> + if (pte_young(ptent)) >>>> + pte = pte_mkyoung(pte); >>>> + } >>> >>> Not a big deal either way, but I wonder if it makes more sense to accumulate >>> the 'ptent' dirty/young values, then modify 'pte' once, i.e. >>> >>> bool dirty = false, young = false; >>> >>> for (...) { >>> pte_t ptent = __ptep_get_and_clear(mm, addr, ptep); >>> dirty |= pte_dirty(ptent); >>> young |= pte_young(ptent); >>> } >>> >>> if (dirty) >>> pte_mkdirty(pte); >>> if (young) >>> pte_mkyoung(pte); >>> >>> I suspect that might generate slightly better code, but I'm also happy with the >>> current form if people thnk that's more legible (I have no strong feelings >>> either way). >> >> I kept it this way, because its the same pattern used in arm64's hugetlbpage.c. >> We also had the same comment against David's batching patches recently, and he >> opted to stick with the former version: >> >> https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/ >> >> So I'm inclined to leave it as is, since you're not insisting :) > > That rationale is reasonable, and I'm fine with this as-is. > > [...] > >>>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep) >>>> +{ >>>> + /* >>>> + * Gather access/dirty bits, which may be populated in any of the ptes >>>> + * of the contig range. We may not be holding the PTL, so any contiguous >>>> + * range may be unfolded/modified/refolded under our feet. Therefore we >>>> + * ensure we read a _consistent_ contpte range by checking that all ptes >>>> + * in the range are valid and have CONT_PTE set, that all pfns are >>>> + * contiguous and that all pgprots are the same (ignoring access/dirty). >>>> + * If we find a pte that is not consistent, then we must be racing with >>>> + * an update so start again. If the target pte does not have CONT_PTE >>>> + * set then that is considered consistent on its own because it is not >>>> + * part of a contpte range. >>>> + */ >>>> + >>>> + pgprot_t orig_prot; >>>> + unsigned long pfn; >>>> + pte_t orig_pte; >>>> + pgprot_t prot; >>>> + pte_t *ptep; >>>> + pte_t pte; >>>> + int i; >>>> + >>>> +retry: >>>> + orig_pte = __ptep_get(orig_ptep); >>>> + >>>> + if (!pte_valid_cont(orig_pte)) >>>> + return orig_pte; >>>> + >>>> + orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte))); >>>> + ptep = contpte_align_down(orig_ptep); >>>> + pfn = pte_pfn(orig_pte) - (orig_ptep - ptep); >>>> + >>>> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) { >>>> + pte = __ptep_get(ptep); >>>> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte))); >>>> + >>>> + if (!pte_valid_cont(pte) || >>>> + pte_pfn(pte) != pfn || >>>> + pgprot_val(prot) != pgprot_val(orig_prot)) >>>> + goto retry; >>>> + >>>> + if (pte_dirty(pte)) >>>> + orig_pte = pte_mkdirty(orig_pte); >>>> + >>>> + if (pte_young(pte)) >>>> + orig_pte = pte_mkyoung(orig_pte); >>>> + } >>>> + >>>> + return orig_pte; >>>> +} >>>> +EXPORT_SYMBOL(contpte_ptep_get_lockless); >>> >>> I'm struggling to convince myself that this is safe in general, as it really >>> depends on how the caller will use this value. Which caller(s) actually care >>> about the access/dirty bits, given those could change at any time anyway? >> >> I think your points below are valid, and agree we should try to make this work >> without needing access/dirty if possible. But can you elaborate on why you don't >> think it's safe? > > Having mulled this over, I think it is safe as-is, and I was being overly > cautious. > > I had a general fear of potential problems stemming from the fact that (a) the > accumulation of access/dirty bits isn't atomic and (b) the loop is unbounded. > From looking at how this is used today, I think (a) is essentially the same as > reading a stale non-contiguous entry, and I'm being overly cautious there. For > (b), I think that's largely a performance concern and the would only retry > indefinitely in the presence of mis-programmed entries or consistent racing > with a writer under heavy contention. > > I think it's still desirable to avoid the accumulation in most cases (to avoid > redundant work and to minimize the potential for unbounded retries), but I'm > happy with that being a follow-up improvement. Great! I'll do the conversion to ptep_get_lockless_nosync() as a follow up series. > >>> I took a quick scan, and AFAICT: >> >> Thanks for enumerating these; Saves me from having to refresh my memory :) >>> >>> * For perf_get_pgtable_size(), we only care about whether the entry is valid >>> and has the contig bit set. We could clean that up with a new interface, e.g. >>> something like a new ptep_get_size_lockless(). >>> >>> * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when >>> we look at the pte to start with, since we only care where we can logically >>> write to the page at that point. >>> >>> I see that we later follow up with: >>> >>> with pte_val(pte) != pte_val(ptep_get(ptep))) >>> >>> ... is that why we need ptep_get_lockless() to accumulate the access/dirty >>> bits? So that shape of lockless-try...locked-compare sequence works? >>> >>> * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB, >>> so this doesn' seem to matter. >>> >>> * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte, >>> which means the pte isn't valid, and we'll return the orig_pte as-is anyway. >>> >>> * For pte_range_none() the access/dirty bits don't matter. >>> >>> * For handle_pte_fault() I think we have the same shape of >>> lockless-try...locked-compare sequence as for gup_pte_range(), where we don't >>> care about the acess/dirty bits before we reach the locked compare step. >>> >>> * For ptdump_pte_entry() I think it's arguable that we should continue to >>> report the access/dirty bits separately for each PTE, as we have done until >>> now, to give an accurate representation of the contents of the translation >>> tables. >>> >>> * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a >>> swap entry, the access/dirty bits don't matter. >>> >>> So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(), >>> and IIUC that's only so that the locklessly-loaded pte value can be compared >>> with a subsequently locked-loaded entry (for which the access/dirty bits will >>> be accumulated). Have I understood that correctly? >> >> Yes, I agree with what you are saying. My approach was to try to implement the >> existing APIs accurately though, the argument being that it reduces the chances >> of getting it wrong. But if you think the implementation is unsafe, then I guess >> it blows that out of the water... > > I think your approach makes sense, and as above I'm happy to defer the API > changes/additions to avoid the accumulation of access/dirty bits. > >>> If so, I wonder if we could instead do that comparison modulo the access/dirty >>> bits, >> >> I think that would work - but will need to think a bit more on it. >> >>> and leave ptep_get_lockless() only reading a single entry? >> >> I think we will need to do something a bit less fragile. ptep_get() does collect >> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So >> we will likely want to rename the function and make its documentation explicit >> that it does not return those bits. >> >> ptep_get_lockless_noyoungdirty()? yuk... Any ideas? >> >> Of course if I could convince you the current implementation is safe, I might be >> able to sidestep this optimization until a later date? > > Yep. :) > > Mark.