Received: by 2002:a05:7412:bbc7:b0:fc:a2b0:25d7 with SMTP id kh7csp686959rdb; Fri, 2 Feb 2024 00:13:53 -0800 (PST) X-Google-Smtp-Source: AGHT+IHM8JvMlhFRyx0Y6YM4BSAlT4z2sYs3fWj+isd/xkJZmbH+O3lahFo98dsSJBXuc7fwNooa X-Received: by 2002:a17:907:2905:b0:a36:917f:556f with SMTP id eq5-20020a170907290500b00a36917f556fmr747914ejc.7.1706861633669; Fri, 02 Feb 2024 00:13:53 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706861633; cv=pass; d=google.com; s=arc-20160816; b=TLKLrGnfXLwRVd67B0bbSfaF2B2JAau+d2a/wOxbFjg+og3NL4/hzSDrTaNJFEaNQE G/sRArUSz2cE8F+E98boDt/wnHwwrVGUJWn9lTgrQGlqVdH9mcbnY1vi3/JG81oZvMka q12TP6tNOP+clu1O6FaGg2RfXTo6v+A+9jOwPuftnXMepVrhB1suSFrHI8ZrJgGQR4OS DVwi4DtxOEFtumjZ842kDlYml+q4sTKNC0Txg4Es5pzD7blEadPtxAtjT305MRcjPBcV oekoHAW1TulPfyiIoSQwQMGEWYZtZ0TFnE7UcmpzVQPzCJQmFSgn+eIojqHOFCNVpFID BUvA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:in-reply-to:message-id :date:subject:cc:to:from; bh=P1xu3BLf2o1r+yd8h0LJeygbJrOGinDpUNJMX+u+Vlc=; fh=cJeoLc94Uyn+YB+ENEHmh1sdOwlK0BuvkduELIbRMFM=; b=jp4eQq8cI5OkYzWkSnlAhgO/Jtrdm6dTud0lf1J4/xU4YN/dN5wAXROwjKMru4Zxvc tdflPtCOmCn7R8NewmhKHmAuKm4EB8AphxI1m9T6M0qFC+v12pbbOhVHS0KxWHJSPHIk qdHCMg2BMC8A+zOLtLutJsYRvuYJrXANkNJlAjg0msYJuMuO+ddKvfOxeUoFVcVVYgGX LUC+D/5FUKkE4GpCT6xkOQ7U7CmOYwSWxd6gL/tFMzlQqBSIGFM032NXyBGL9BzU9tZl xdc1Z6vNKrjMrba4gdmLf5y8f6bP34MBNgZjviThROglg6QRTCxrtdCiyBste3a2dP9Z nOzA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-49431-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-49431-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com X-Forwarded-Encrypted: i=1; AJvYcCXvDFN02Gw+I6wOIR0ZtEI65YYKmOFWUxY5i7/WKkQQGiO/FYyvOizd6Ri1Hy3UYKeuAIEMMFiarD9bRuD4fne9jOhFFU6MvVHAf8Ud9A== Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id v27-20020a1709060b5b00b00a2c342cba71si608478ejg.201.2024.02.02.00.13.53 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 02 Feb 2024 00:13:53 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-49431-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-49431-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-49431-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 025C21F23EFA for ; Fri, 2 Feb 2024 08:13:53 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 74F684D110; Fri, 2 Feb 2024 08:09:33 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 828BA4C628 for ; Fri, 2 Feb 2024 08:09:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706861372; cv=none; b=olcJvcB1ZYRlOtgLMz4JiQupGOdNXvUIwf5K3Ra0otVQrnyhT/vkAN0zP9wnUDOPz3HHyqGFcU7HDidgY2J+DmtkIDlQEiZoBtAgJBYbRIk9arkwA6HjywHVOkPQFvUG09n6GUaMqUVUj4q81MLcqPfyejnsMLJa2UyFGbsFzcI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706861372; c=relaxed/simple; bh=2fecQm6aPUKAUaBgft2S1M+Znf/9o42Nkd6od0oPVEM=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=HhAE+FM3zmk8IOMAsRevxvRlPTDmVu1V6+dBnm9heuN1M8lxkDmYIiSSanq5RTV8WhjnkB0CEUiBVqxqJvFq9NLAWozDxRTod8x6JvjXRKh8EsiJoQwKVohNJZl6cHwJHpswAsKWqfJAQVlLtuSenbeAZ4mEGJLQe0tTi/B5/AM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 44A331BB2; Fri, 2 Feb 2024 00:10:11 -0800 (PST) Received: from e125769.cambridge.arm.com (e125769.cambridge.arm.com [10.1.196.26]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 55C663F8A4; Fri, 2 Feb 2024 00:09:25 -0800 (PST) From: Ryan Roberts To: Catalin Marinas , Will Deacon , Ard Biesheuvel , Marc Zyngier , James Morse , Andrey Ryabinin , Andrew Morton , Matthew Wilcox , Mark Rutland , David Hildenbrand , Kefeng Wang , John Hubbard , Zi Yan , Barry Song <21cnbao@gmail.com>, Alistair Popple , Yang Shi , Nicholas Piggin , Christophe Leroy , "Aneesh Kumar K.V" , "Naveen N. Rao" , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" Cc: Ryan Roberts , linux-arm-kernel@lists.infradead.org, x86@kernel.org, linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings Date: Fri, 2 Feb 2024 08:07:50 +0000 Message-Id: <20240202080756.1453939-20-ryan.roberts@arm.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20240202080756.1453939-1-ryan.roberts@arm.com> References: <20240202080756.1453939-1-ryan.roberts@arm.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit With the ptep API sufficiently refactored, we can now introduce a new "contpte" API layer, which transparently manages the PTE_CONT bit for user mappings. In this initial implementation, only suitable batches of PTEs, set via set_ptes(), are mapped with the PTE_CONT bit. Any subsequent modification of individual PTEs will cause an "unfold" operation to repaint the contpte block as individual PTEs before performing the requested operation. While, a modification of a single PTE could cause the block of PTEs to which it belongs to become eligible for "folding" into a contpte entry, "folding" is not performed in this initial implementation due to the costs of checking the requirements are met. Due to this, contpte mappings will degrade back to normal pte mappings over time if/when protections are changed. This will be solved in a future patch. Since a contpte block only has a single access and dirty bit, the semantic here changes slightly; when getting a pte (e.g. ptep_get()) that is part of a contpte mapping, the access and dirty information are pulled from the block (so all ptes in the block return the same access/dirty info). When changing the access/dirty info on a pte (e.g. ptep_set_access_flags()) that is part of a contpte mapping, this change will affect the whole contpte block. This is works fine in practice since we guarantee that only a single folio is mapped by a contpte block, and the core-mm tracks access/dirty information per folio. In order for the public functions, which used to be pure inline, to continue to be callable by modules, export all the contpte_* symbols that are now called by those public inline functions. The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter at build time. It defaults to enabled as long as its dependency, TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not enabled, then there is no chance of meeting the physical contiguity requirement for contpte mappings. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/Kconfig | 9 + arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++ arch/arm64/mm/Makefile | 1 + arch/arm64/mm/contpte.c | 283 +++++++++++++++++++++++++++++++ 4 files changed, 454 insertions(+) create mode 100644 arch/arm64/mm/contpte.c diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index d86d7f4758b5..1442e8ed95b6 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS select UNWIND_TABLES select DYNAMIC_SCS +config ARM64_CONTPTE + bool "Contiguous PTE mappings for user memory" if EXPERT + depends on TRANSPARENT_HUGEPAGE + default y + help + When enabled, user mappings are configured using the PTE contiguous + bit, for any mappings that meet the size and alignment requirements. + This reduces TLB pressure and improves performance. + endmenu # "Kernel Features" menu "Boot options" diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 7dc6b68ee516..34892a95403d 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys) */ #define pte_valid_not_user(pte) \ ((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN)) +/* + * Returns true if the pte is valid and has the contiguous bit set. + */ +#define pte_valid_cont(pte) (pte_valid(pte) && pte_cont(pte)) /* * Could the pte be present in the TLB? We must check mm_tlb_flush_pending * so that we don't erroneously return false for pages that have been @@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte); #define vmemmap_update_pte vmemmap_update_pte #endif +#ifdef CONFIG_ARM64_CONTPTE + +/* + * The contpte APIs are used to transparently manage the contiguous bit in ptes + * where it is possible and makes sense to do so. The PTE_CONT bit is considered + * a private implementation detail of the public ptep API (see below). + */ +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte); +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte); +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep); +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte, unsigned int nr); +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep); +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep); +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep, + pte_t entry, int dirty); + +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ + if (unlikely(pte_valid_cont(pte))) + __contpte_try_unfold(mm, addr, ptep, pte); +} + +/* + * The below functions constitute the public API that arm64 presents to the + * core-mm to manipulate PTE entries within their page tables (or at least this + * is the subset of the API that arm64 needs to implement). These public + * versions will automatically and transparently apply the contiguous bit where + * it makes sense to do so. Therefore any users that are contig-aware (e.g. + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the + * private versions, which are prefixed with double underscore. All of these + * APIs except for ptep_get_lockless() are expected to be called with the PTL + * held. + */ + +#define ptep_get ptep_get +static inline pte_t ptep_get(pte_t *ptep) +{ + pte_t pte = __ptep_get(ptep); + + if (likely(!pte_valid_cont(pte))) + return pte; + + return contpte_ptep_get(ptep, pte); +} + +#define ptep_get_lockless ptep_get_lockless +static inline pte_t ptep_get_lockless(pte_t *ptep) +{ + pte_t pte = __ptep_get(ptep); + + if (likely(!pte_valid_cont(pte))) + return pte; + + return contpte_ptep_get_lockless(ptep); +} + +static inline void set_pte(pte_t *ptep, pte_t pte) +{ + /* + * We don't have the mm or vaddr so cannot unfold contig entries (since + * it requires tlb maintenance). set_pte() is not used in core code, so + * this should never even be called. Regardless do our best to service + * any call and emit a warning if there is any attempt to set a pte on + * top of an existing contig range. + */ + pte_t orig_pte = __ptep_get(ptep); + + WARN_ON_ONCE(pte_valid_cont(orig_pte)); + __set_pte(ptep, pte_mknoncont(pte)); +} + +#define set_ptes set_ptes +static inline void set_ptes(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte, unsigned int nr) +{ + pte = pte_mknoncont(pte); + + if (likely(nr == 1)) { + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); + __set_ptes(mm, addr, ptep, pte, 1); + } else { + contpte_set_ptes(mm, addr, ptep, pte, nr); + } +} + +static inline void pte_clear(struct mm_struct *mm, + unsigned long addr, pte_t *ptep) +{ + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); + __pte_clear(mm, addr, ptep); +} + +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR +static inline pte_t ptep_get_and_clear(struct mm_struct *mm, + unsigned long addr, pte_t *ptep) +{ + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); + return __ptep_get_and_clear(mm, addr, ptep); +} + +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) +{ + pte_t orig_pte = __ptep_get(ptep); + + if (likely(!pte_valid_cont(orig_pte))) + return __ptep_test_and_clear_young(vma, addr, ptep); + + return contpte_ptep_test_and_clear_young(vma, addr, ptep); +} + +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH +static inline int ptep_clear_flush_young(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) +{ + pte_t orig_pte = __ptep_get(ptep); + + if (likely(!pte_valid_cont(orig_pte))) + return __ptep_clear_flush_young(vma, addr, ptep); + + return contpte_ptep_clear_flush_young(vma, addr, ptep); +} + +#define __HAVE_ARCH_PTEP_SET_WRPROTECT +static inline void ptep_set_wrprotect(struct mm_struct *mm, + unsigned long addr, pte_t *ptep) +{ + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep)); + __ptep_set_wrprotect(mm, addr, ptep); +} + +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS +static inline int ptep_set_access_flags(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep, + pte_t entry, int dirty) +{ + pte_t orig_pte = __ptep_get(ptep); + + entry = pte_mknoncont(entry); + + if (likely(!pte_valid_cont(orig_pte))) + return __ptep_set_access_flags(vma, addr, ptep, entry, dirty); + + return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty); +} + +#else /* CONFIG_ARM64_CONTPTE */ + #define ptep_get __ptep_get #define set_pte __set_pte #define set_ptes __set_ptes @@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte); #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS #define ptep_set_access_flags __ptep_set_access_flags +#endif /* CONFIG_ARM64_CONTPTE */ + #endif /* !__ASSEMBLY__ */ #endif /* __ASM_PGTABLE_H */ diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile index dbd1bc95967d..60454256945b 100644 --- a/arch/arm64/mm/Makefile +++ b/arch/arm64/mm/Makefile @@ -3,6 +3,7 @@ obj-y := dma-mapping.o extable.o fault.o init.o \ cache.o copypage.o flush.o \ ioremap.o mmap.o pgd.o mmu.o \ context.o proc.o pageattr.o fixmap.o +obj-$(CONFIG_ARM64_CONTPTE) += contpte.o obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o obj-$(CONFIG_PTDUMP_CORE) += ptdump.o obj-$(CONFIG_PTDUMP_DEBUGFS) += ptdump_debugfs.o diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c new file mode 100644 index 000000000000..bfb50e6b44c7 --- /dev/null +++ b/arch/arm64/mm/contpte.c @@ -0,0 +1,283 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Copyright (C) 2023 ARM Ltd. + */ + +#include +#include +#include + +static inline bool mm_is_user(struct mm_struct *mm) +{ + /* + * Don't attempt to apply the contig bit to kernel mappings, because + * dynamically adding/removing the contig bit can cause page faults. + * These racing faults are ok for user space, since they get serialized + * on the PTL. But kernel mappings can't tolerate faults. + */ + return mm != &init_mm; +} + +static inline pte_t *contpte_align_down(pte_t *ptep) +{ + return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3); +} + +static void contpte_convert(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0); + unsigned long start_addr; + pte_t *start_ptep; + int i; + + start_ptep = ptep = contpte_align_down(ptep); + start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); + pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte)); + + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) { + pte_t ptent = __ptep_get_and_clear(mm, addr, ptep); + + if (pte_dirty(ptent)) + pte = pte_mkdirty(pte); + + if (pte_young(ptent)) + pte = pte_mkyoung(pte); + } + + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3); + + __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES); +} + +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte) +{ + /* + * We have already checked that the ptes are contiguous in + * contpte_try_unfold(), so just check that the mm is user space. + */ + + if (!mm_is_user(mm)) + return; + + pte = pte_mknoncont(pte); + contpte_convert(mm, addr, ptep, pte); +} +EXPORT_SYMBOL(__contpte_try_unfold); + +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte) +{ + /* + * Gather access/dirty bits, which may be populated in any of the ptes + * of the contig range. We are guarranteed to be holding the PTL, so any + * contiguous range cannot be unfolded or otherwise modified under our + * feet. + */ + + pte_t pte; + int i; + + ptep = contpte_align_down(ptep); + + for (i = 0; i < CONT_PTES; i++, ptep++) { + pte = __ptep_get(ptep); + + if (pte_dirty(pte)) + orig_pte = pte_mkdirty(orig_pte); + + if (pte_young(pte)) + orig_pte = pte_mkyoung(orig_pte); + } + + return orig_pte; +} +EXPORT_SYMBOL(contpte_ptep_get); + +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep) +{ + /* + * Gather access/dirty bits, which may be populated in any of the ptes + * of the contig range. We may not be holding the PTL, so any contiguous + * range may be unfolded/modified/refolded under our feet. Therefore we + * ensure we read a _consistent_ contpte range by checking that all ptes + * in the range are valid and have CONT_PTE set, that all pfns are + * contiguous and that all pgprots are the same (ignoring access/dirty). + * If we find a pte that is not consistent, then we must be racing with + * an update so start again. If the target pte does not have CONT_PTE + * set then that is considered consistent on its own because it is not + * part of a contpte range. + */ + + pgprot_t orig_prot; + unsigned long pfn; + pte_t orig_pte; + pgprot_t prot; + pte_t *ptep; + pte_t pte; + int i; + +retry: + orig_pte = __ptep_get(orig_ptep); + + if (!pte_valid_cont(orig_pte)) + return orig_pte; + + orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte))); + ptep = contpte_align_down(orig_ptep); + pfn = pte_pfn(orig_pte) - (orig_ptep - ptep); + + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) { + pte = __ptep_get(ptep); + prot = pte_pgprot(pte_mkold(pte_mkclean(pte))); + + if (!pte_valid_cont(pte) || + pte_pfn(pte) != pfn || + pgprot_val(prot) != pgprot_val(orig_prot)) + goto retry; + + if (pte_dirty(pte)) + orig_pte = pte_mkdirty(orig_pte); + + if (pte_young(pte)) + orig_pte = pte_mkyoung(orig_pte); + } + + return orig_pte; +} +EXPORT_SYMBOL(contpte_ptep_get_lockless); + +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr, + pte_t *ptep, pte_t pte, unsigned int nr) +{ + unsigned long next; + unsigned long end; + unsigned long pfn; + pgprot_t prot; + + /* + * The set_ptes() spec guarantees that when nr > 1, the initial state of + * all ptes is not-present. Therefore we never need to unfold or + * otherwise invalidate a range before we set the new ptes. + * contpte_set_ptes() should never be called for nr < 2. + */ + VM_WARN_ON(nr == 1); + + if (!mm_is_user(mm)) + return __set_ptes(mm, addr, ptep, pte, nr); + + end = addr + (nr << PAGE_SHIFT); + pfn = pte_pfn(pte); + prot = pte_pgprot(pte); + + do { + next = pte_cont_addr_end(addr, end); + nr = (next - addr) >> PAGE_SHIFT; + pte = pfn_pte(pfn, prot); + + if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0) + pte = pte_mkcont(pte); + else + pte = pte_mknoncont(pte); + + __set_ptes(mm, addr, ptep, pte, nr); + + addr = next; + ptep += nr; + pfn += nr; + + } while (addr != end); +} +EXPORT_SYMBOL(contpte_set_ptes); + +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) +{ + /* + * ptep_clear_flush_young() technically requires us to clear the access + * flag for a _single_ pte. However, the core-mm code actually tracks + * access/dirty per folio, not per page. And since we only create a + * contig range when the range is covered by a single folio, we can get + * away with clearing young for the whole contig range here, so we avoid + * having to unfold. + */ + + int young = 0; + int i; + + ptep = contpte_align_down(ptep); + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); + + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) + young |= __ptep_test_and_clear_young(vma, addr, ptep); + + return young; +} +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young); + +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep) +{ + int young; + + young = contpte_ptep_test_and_clear_young(vma, addr, ptep); + + if (young) { + /* + * See comment in __ptep_clear_flush_young(); same rationale for + * eliding the trailing DSB applies here. + */ + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); + __flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE, + PAGE_SIZE, true, 3); + } + + return young; +} +EXPORT_SYMBOL(contpte_ptep_clear_flush_young); + +int contpte_ptep_set_access_flags(struct vm_area_struct *vma, + unsigned long addr, pte_t *ptep, + pte_t entry, int dirty) +{ + unsigned long start_addr; + pte_t orig_pte; + int i; + + /* + * Gather the access/dirty bits for the contiguous range. If nothing has + * changed, its a noop. + */ + orig_pte = pte_mknoncont(ptep_get(ptep)); + if (pte_val(orig_pte) == pte_val(entry)) + return 0; + + /* + * We can fix up access/dirty bits without having to unfold the contig + * range. But if the write bit is changing, we must unfold. + */ + if (pte_write(orig_pte) == pte_write(entry)) { + /* + * For HW access management, we technically only need to update + * the flag on a single pte in the range. But for SW access + * management, we need to update all the ptes to prevent extra + * faults. Avoid per-page tlb flush in __ptep_set_access_flags() + * and instead flush the whole range at the end. + */ + ptep = contpte_align_down(ptep); + start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE); + + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) + __ptep_set_access_flags(vma, addr, ptep, entry, 0); + + if (dirty) + __flush_tlb_range(vma, start_addr, addr, + PAGE_SIZE, true, 3); + } else { + __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte); + __ptep_set_access_flags(vma, addr, ptep, entry, dirty); + } + + return 1; +} +EXPORT_SYMBOL(contpte_ptep_set_access_flags); -- 2.25.1