Here is v2 series of patches to various architectures, based on v6.4-rc5:
preparing for v2 of changes following in mm, affecting pte_offset_map()
and pte_offset_map_lock(). There are very few differences from v1:
noted patch by patch below.
v1 was "arch: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/[email protected]/
series of 23 posted on 2023-05-09,
followed by "mm: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/[email protected]/
series of 31 posted on 2023-05-21,
followed by "mm: free retracted page table by RCU"
https://lore.kernel.org/linux-mm/[email protected]/
series of 12 posted on 2023-05-28.
The first two series are "independent": neither depends
for build or correctness on the other, and the arch patches can either be
merged separately via arch trees, or be picked up by akpm; but both series
must be in before the third series is added to make the effective changes
(and that adds just a little more in arm, powerpc, s390 and sparc).
What is it all about? Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs;
but likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing). mmap_write_lock
avoidance when collapsing to anon THPs? Perhaps, but again that's not
work I've done: a quick attempt was not as easy as the shmem/file case.
I would much prefer not to have to make these small but wide-ranging
changes for such a niche case; but failed to find another way, and
have heard that shmem MADV_COLLAPSE's usefulness is being limited by
that mmap_write_lock it currently requires.
These changes (though of course not these exact patches, and not all
of these architectures!) have been in Google's data centre kernel for
three years now: we do rely upon them.
What are the per-arch changes about? Generally, two things.
One: the current mmap locking may not be enough to guard against that
tricky transition between pmd entry pointing to page table, and empty
pmd entry, and pmd entry pointing to huge page: pte_offset_map() will
have to validate the pmd entry for itself, returning NULL if no page
table is there. What to do about that varies: often the nearby error
handling indicates just to skip it; but in some cases a "goto again"
looks appropriate (and if that risks an infinite loop, then there
must have been an oops, or pfn 0 mistaken for page table, before).
Deeper study of each site might show that 90% of them here in arch
code could only fail if there's corruption e.g. a transition to THP
would be surprising on an arch without HAVE_ARCH_TRANSPARENT_HUGEPAGE.
But given the likely extension to freeing empty page tables, I have
not limited this set of changes to THP; and it has been easier, and
sets a better example, if each site is given appropriate handling.
Two: pte_offset_map() will need to do an rcu_read_lock(), with the
corresponding rcu_read_unlock() in pte_unmap(). But most architectures
never supported CONFIG_HIGHPTE, so some don't always call pte_unmap()
after pte_offset_map(), or have used userspace pte_offset_map() where
pte_offset_kernel() is more correct. No problem in the current tree,
but a problem once an rcu_read_unlock() will be needed to keep balance.
A common special case of that comes in arch/*/mm/hugetlbpage.c, if
the architecture supports hugetlb pages down at the lowest PTE level.
huge_pte_alloc() uses pte_alloc_map(), but generic hugetlb code does
no corresponding pte_unmap(); similarly for huge_pte_offset().
Thanks to Mike Kravetz and Andrew Morton, v6.4-rc1 already provides
pte_alloc_huge() and pte_offset_huge() to help fix up those cases.
This posting is based on v6.4-rc5, but good for any v6.4-rc,
current mm-everything and linux-next.
01/23 arm: allow pte_offset_map[_lock]() to fail
v2: same as v1
02/23 arm64: allow pte_offset_map() to fail
v2: add ack from Catalin
03/23 arm64/hugetlb: pte_alloc_huge() pte_offset_huge()
v2: add ack from Catalin
04/23 ia64/hugetlb: pte_alloc_huge() pte_offset_huge()
v2: same as v1
05/23 m68k: allow pte_offset_map[_lock]() to fail
v2: same as v1
06/23 microblaze: allow pte_offset_map() to fail
v2: same as v1
07/23 mips: update_mmu_cache() can replace __update_tlb()
v2: same as v1
08/23 parisc: add pte_unmap() to balance get_ptep()
v2: typo fix from Helge; stronger commit message
09/23 parisc: unmap_uncached_pte() use pte_offset_kernel()
v2: same as v1
10/23 parisc/hugetlb: pte_alloc_huge() pte_offset_huge()
v2: same as v1
11/23 powerpc: kvmppc_unmap_free_pmd() pte_offset_kernel()
v2: same as v1
12/23 powerpc: allow pte_offset_map[_lock]() to fail
v2: same as v1
13/23 powerpc/hugetlb: pte_alloc_huge()
v2: same as v1
14/23 riscv/hugetlb: pte_alloc_huge() pte_offset_huge()
v2: add review from Alex, ack from Palmer
15/23 s390: allow pte_offset_map_lock() to fail
v2: add comment for Claudio
16/23 s390: gmap use pte_unmap_unlock() not spin_unlock()
v2: add ack from Alexander
17/23 sh/hugetlb: pte_alloc_huge() pte_offset_huge()
v2: same as v1
18/23 sparc/hugetlb: pte_alloc_huge() pte_offset_huge()
v2: same as v1
19/23 sparc: allow pte_offset_map() to fail
v2: same as v1
20/23 sparc: iounit and iommu use pte_offset_kernel()
v2: same as v1
21/23 x86: Allow get_locked_pte() to fail
v2: add WARN_ON_ONCE from PeterZ
22/23 x86: sme_populate_pgd() use pte_offset_kernel()
v2: same as v1
23/23 xtensa: add pte_unmap() to balance pte_offset_map()
v2: stronger commit message
arch/arm/lib/uaccess_with_memcpy.c | 3 ++
arch/arm/mm/fault-armv.c | 5 +++-
arch/arm/mm/fault.c | 3 ++
arch/arm64/mm/fault.c | 3 ++
arch/arm64/mm/hugetlbpage.c | 11 ++-----
arch/ia64/mm/hugetlbpage.c | 4 +--
arch/m68k/include/asm/mmu_context.h | 6 ++--
arch/m68k/kernel/sys_m68k.c | 2 ++
arch/m68k/mm/mcfmmu.c | 52 +++++++++++++--------------------
arch/microblaze/kernel/signal.c | 5 ++--
arch/mips/include/asm/pgtable.h | 15 ++--------
arch/mips/mm/tlb-r3k.c | 5 ++--
arch/mips/mm/tlb-r4k.c | 9 ++----
arch/parisc/kernel/cache.c | 26 +++++++++++++----
arch/parisc/kernel/pci-dma.c | 2 +-
arch/parisc/mm/hugetlbpage.c | 4 +--
arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +-
arch/powerpc/mm/book3s64/hash_tlb.c | 4 +++
arch/powerpc/mm/book3s64/subpage_prot.c | 2 ++
arch/powerpc/mm/hugetlbpage.c | 2 +-
arch/powerpc/xmon/xmon.c | 5 +++-
arch/riscv/mm/hugetlbpage.c | 4 +--
arch/s390/kernel/uv.c | 2 ++
arch/s390/mm/gmap.c | 31 ++++++++++++--------
arch/s390/mm/pgtable.c | 12 ++++++--
arch/sh/mm/hugetlbpage.c | 4 +--
arch/sparc/kernel/signal32.c | 2 ++
arch/sparc/mm/fault_64.c | 3 ++
arch/sparc/mm/hugetlbpage.c | 4 +--
arch/sparc/mm/io-unit.c | 2 +-
arch/sparc/mm/iommu.c | 2 +-
arch/sparc/mm/tlb.c | 2 ++
arch/x86/kernel/ldt.c | 6 ++--
arch/x86/mm/mem_encrypt_identity.c | 2 +-
arch/xtensa/mm/tlb.c | 5 +++-
35 files changed, 146 insertions(+), 105 deletions(-)
Hugh
In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.
Add comment on mm's contract with s390 above __zap_zero_pages(),
and fix old comment there: must be called after THP was disabled.
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/s390/kernel/uv.c | 2 ++
arch/s390/mm/gmap.c | 9 ++++++++-
arch/s390/mm/pgtable.c | 12 +++++++++---
3 files changed, 19 insertions(+), 4 deletions(-)
diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index cb2ee06df286..3c62d1b218b1 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -294,6 +294,8 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
rc = -ENXIO;
ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
+ if (!ptep)
+ goto out;
if (pte_present(*ptep) && !(pte_val(*ptep) & _PAGE_INVALID) && pte_write(*ptep)) {
page = pte_page(*ptep);
rc = -EAGAIN;
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index dc90d1eb0d55..3a2a31a15ea8 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2537,7 +2537,12 @@ static inline void thp_split_mm(struct mm_struct *mm)
* Remove all empty zero pages from the mapping for lazy refaulting
* - This must be called after mm->context.has_pgste is set, to avoid
* future creation of zero pages
- * - This must be called after THP was enabled
+ * - This must be called after THP was disabled.
+ *
+ * mm contracts with s390, that even if mm were to remove a page table,
+ * racing with the loop below and so causing pte_offset_map_lock() to fail,
+ * it will never insert a page table containing empty zero pages once
+ * mm_forbids_zeropage(mm) i.e. mm->context.has_pgste is set.
*/
static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
unsigned long end, struct mm_walk *walk)
@@ -2549,6 +2554,8 @@ static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
spinlock_t *ptl;
ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (!ptep)
+ break;
if (is_zero_pfn(pte_pfn(*ptep)))
ptep_xchg_direct(walk->mm, addr, ptep, __pte(_PAGE_INVALID));
pte_unmap_unlock(ptep, ptl);
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 6effb24de6d9..3bd2ab2a9a34 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -829,7 +829,7 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
default:
return -EFAULT;
}
-
+again:
ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
spin_unlock(ptl);
@@ -850,6 +850,8 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
spin_unlock(ptl);
ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ if (!ptep)
+ goto again;
new = old = pgste_get_lock(ptep);
pgste_val(new) &= ~(PGSTE_GR_BIT | PGSTE_GC_BIT |
PGSTE_ACC_BITS | PGSTE_FP_BIT);
@@ -938,7 +940,7 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
default:
return -EFAULT;
}
-
+again:
ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
spin_unlock(ptl);
@@ -955,6 +957,8 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
spin_unlock(ptl);
ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ if (!ptep)
+ goto again;
new = old = pgste_get_lock(ptep);
/* Reset guest reference bit only */
pgste_val(new) &= ~PGSTE_GR_BIT;
@@ -1000,7 +1004,7 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
default:
return -EFAULT;
}
-
+again:
ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
spin_unlock(ptl);
@@ -1017,6 +1021,8 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
spin_unlock(ptl);
ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ if (!ptep)
+ goto again;
pgste = pgste_get_lock(ptep);
*key = (pgste_val(pgste) & (PGSTE_ACC_BITS | PGSTE_FP_BIT)) >> 56;
paddr = pte_val(*ptep) & PAGE_MASK;
--
2.35.3
In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.
Balance successful pte_offset_map() with pte_unmap() where omitted.
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/powerpc/mm/book3s64/hash_tlb.c | 4 ++++
arch/powerpc/mm/book3s64/subpage_prot.c | 2 ++
arch/powerpc/xmon/xmon.c | 5 ++++-
3 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
index a64ea0a7ef96..21fcad97ae80 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -239,12 +239,16 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long
local_irq_save(flags);
arch_enter_lazy_mmu_mode();
start_pte = pte_offset_map(pmd, addr);
+ if (!start_pte)
+ goto out;
for (pte = start_pte; pte < start_pte + PTRS_PER_PTE; pte++) {
unsigned long pteval = pte_val(*pte);
if (pteval & H_PAGE_HASHPTE)
hpte_need_flush(mm, addr, pte, pteval, 0);
addr += PAGE_SIZE;
}
+ pte_unmap(start_pte);
+out:
arch_leave_lazy_mmu_mode();
local_irq_restore(flags);
}
diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index b75a9fb99599..0dc85556dec5 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -71,6 +71,8 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
if (pmd_none(*pmd))
return;
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!pte)
+ return;
arch_enter_lazy_mmu_mode();
for (; npages > 0; --npages) {
pte_update(mm, addr, pte, 0, 0, 0);
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 70c4c59a1a8f..fae747cc57d2 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -3376,12 +3376,15 @@ static void show_pte(unsigned long addr)
printf("pmdp @ 0x%px = 0x%016lx\n", pmdp, pmd_val(*pmdp));
ptep = pte_offset_map(pmdp, addr);
- if (pte_none(*ptep)) {
+ if (!ptep || pte_none(*ptep)) {
+ if (ptep)
+ pte_unmap(ptep);
printf("no valid PTE\n");
return;
}
format_pte(ptep, pte_val(*ptep));
+ pte_unmap(ptep);
sync();
__delay(200);
--
2.35.3
pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead. huge_pte_offset() is using __find_linux_pte(), which is using
pte_offset_kernel() - don't rename that to _huge, it's more complicated.
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/powerpc/mm/hugetlbpage.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index b900933507da..f7c683b672c1 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -183,7 +183,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
return NULL;
if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT)
- return pte_alloc_map(mm, (pmd_t *)hpdp, addr);
+ return pte_alloc_huge(mm, (pmd_t *)hpdp, addr);
BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
--
2.35.3
kvmppc_unmap_free_pmd() use pte_offset_kernel(), like everywhere else
in book3s_64_mmu_radix.c: instead of pte_offset_map(), which will come
to need a pte_unmap() to balance it.
But note that this is a more complex case than most: see those -EAGAINs
in kvmppc_create_pte(), which is coping with kvmppc races beween page
table and huge entry, of the kind which we are expecting to address
in pte_offset_map() - this might want to be revisited in future.
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 461307b89c3a..572707858d65 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -509,7 +509,7 @@ static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t *pmd, bool full,
} else {
pte_t *pte;
- pte = pte_offset_map(p, 0);
+ pte = pte_offset_kernel(p, 0);
kvmppc_unmap_free_pte(kvm, pte, full, lpid);
pmd_clear(p);
}
--
2.35.3
pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Alexandre Ghiti <[email protected]>
Acked-by: Palmer Dabbelt <[email protected]>
---
arch/riscv/mm/hugetlbpage.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index e0ef56dc57b9..542883b3b49b 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -67,7 +67,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
for_each_napot_order(order) {
if (napot_cont_size(order) == sz) {
- pte = pte_alloc_map(mm, pmd, addr & napot_cont_mask(order));
+ pte = pte_alloc_huge(mm, pmd, addr & napot_cont_mask(order));
break;
}
}
@@ -114,7 +114,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
for_each_napot_order(order) {
if (napot_cont_size(order) == sz) {
- pte = pte_offset_kernel(pmd, addr & napot_cont_mask(order));
+ pte = pte_offset_huge(pmd, addr & napot_cont_mask(order));
break;
}
}
--
2.35.3
pte_alloc_map_lock() expects to be followed by pte_unmap_unlock(): to
keep balance in future, pass ptep as well as ptl to gmap_pte_op_end(),
and use pte_unmap_unlock() instead of direct spin_unlock() (even though
ptep ends up unused inside the macro).
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Alexander Gordeev <[email protected]>
---
arch/s390/mm/gmap.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 3a2a31a15ea8..f4b6fc746fce 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -895,12 +895,12 @@ static int gmap_pte_op_fixup(struct gmap *gmap, unsigned long gaddr,
/**
* gmap_pte_op_end - release the page table lock
- * @ptl: pointer to the spinlock pointer
+ * @ptep: pointer to the locked pte
+ * @ptl: pointer to the page table spinlock
*/
-static void gmap_pte_op_end(spinlock_t *ptl)
+static void gmap_pte_op_end(pte_t *ptep, spinlock_t *ptl)
{
- if (ptl)
- spin_unlock(ptl);
+ pte_unmap_unlock(ptep, ptl);
}
/**
@@ -1011,7 +1011,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
{
int rc;
pte_t *ptep;
- spinlock_t *ptl = NULL;
+ spinlock_t *ptl;
unsigned long pbits = 0;
if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
@@ -1025,7 +1025,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
pbits |= (bits & GMAP_NOTIFY_SHADOW) ? PGSTE_VSIE_BIT : 0;
/* Protect and unlock. */
rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, pbits);
- gmap_pte_op_end(ptl);
+ gmap_pte_op_end(ptep, ptl);
return rc;
}
@@ -1154,7 +1154,7 @@ int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val)
/* Do *NOT* clear the _PAGE_INVALID bit! */
rc = 0;
}
- gmap_pte_op_end(ptl);
+ gmap_pte_op_end(ptep, ptl);
}
if (!rc)
break;
@@ -1248,7 +1248,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
if (!rc)
gmap_insert_rmap(sg, vmaddr, rmap);
spin_unlock(&sg->guest_table_lock);
- gmap_pte_op_end(ptl);
+ gmap_pte_op_end(ptep, ptl);
}
radix_tree_preload_end();
if (rc) {
@@ -2156,7 +2156,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
tptep = (pte_t *) gmap_table_walk(sg, saddr, 0);
if (!tptep) {
spin_unlock(&sg->guest_table_lock);
- gmap_pte_op_end(ptl);
+ gmap_pte_op_end(sptep, ptl);
radix_tree_preload_end();
break;
}
@@ -2167,7 +2167,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
rmap = NULL;
rc = 0;
}
- gmap_pte_op_end(ptl);
+ gmap_pte_op_end(sptep, ptl);
spin_unlock(&sg->guest_table_lock);
}
radix_tree_preload_end();
@@ -2495,7 +2495,7 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long bitmap[4],
continue;
if (ptep_test_and_clear_uc(gmap->mm, vmaddr, ptep))
set_bit(i, bitmap);
- spin_unlock(ptl);
+ pte_unmap_unlock(ptep, ptl);
}
}
gmap_pmd_op_end(gmap, pmdp);
--
2.35.3
unmap_uncached_pte() is working from pgd_offset_k(vaddr), so it should
use pte_offset_kernel() instead of pte_offset_map(), to avoid the
question of whether a pte_unmap() will be needed to balance.
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/parisc/kernel/pci-dma.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
index 71ed5391f29d..415f12d5bab3 100644
--- a/arch/parisc/kernel/pci-dma.c
+++ b/arch/parisc/kernel/pci-dma.c
@@ -164,7 +164,7 @@ static inline void unmap_uncached_pte(pmd_t * pmd, unsigned long vaddr,
pmd_clear(pmd);
return;
}
- pte = pte_offset_map(pmd, vaddr);
+ pte = pte_offset_kernel(pmd, vaddr);
vaddr &= ~PMD_MASK;
end = vaddr + size;
if (end > PMD_SIZE)
--
2.35.3
pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/sh/mm/hugetlbpage.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index 999ab5916e69..6cb0ad73dbb9 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -38,7 +38,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
if (pud) {
pmd = pmd_alloc(mm, pud, addr);
if (pmd)
- pte = pte_alloc_map(mm, pmd, addr);
+ pte = pte_alloc_huge(mm, pmd, addr);
}
}
}
@@ -63,7 +63,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
if (pud) {
pmd = pmd_offset(pud, addr);
if (pmd)
- pte = pte_offset_map(pmd, addr);
+ pte = pte_offset_huge(pmd, addr);
}
}
}
--
2.35.3
pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/parisc/mm/hugetlbpage.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index d1d3990b83f6..a8a1a7c1e16e 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -66,7 +66,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
if (pud) {
pmd = pmd_alloc(mm, pud, addr);
if (pmd)
- pte = pte_alloc_map(mm, pmd, addr);
+ pte = pte_alloc_huge(mm, pmd, addr);
}
return pte;
}
@@ -90,7 +90,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
if (!pud_none(*pud)) {
pmd = pmd_offset(pud, addr);
if (!pmd_none(*pmd))
- pte = pte_offset_map(pmd, addr);
+ pte = pte_offset_huge(pmd, addr);
}
}
}
--
2.35.3
iounit_alloc() and sbus_iommu_alloc() are working from pmd_off_k(),
so should use pte_offset_kernel() instead of pte_offset_map(), to avoid
the question of whether a pte_unmap() will be needed to balance.
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/sparc/mm/io-unit.c | 2 +-
arch/sparc/mm/iommu.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/sparc/mm/io-unit.c b/arch/sparc/mm/io-unit.c
index bf3e6d2fe5d9..133dd42570d6 100644
--- a/arch/sparc/mm/io-unit.c
+++ b/arch/sparc/mm/io-unit.c
@@ -244,7 +244,7 @@ static void *iounit_alloc(struct device *dev, size_t len,
long i;
pmdp = pmd_off_k(addr);
- ptep = pte_offset_map(pmdp, addr);
+ ptep = pte_offset_kernel(pmdp, addr);
set_pte(ptep, mk_pte(virt_to_page(page), dvma_prot));
diff --git a/arch/sparc/mm/iommu.c b/arch/sparc/mm/iommu.c
index 9e3f6933ca13..3a6caef68348 100644
--- a/arch/sparc/mm/iommu.c
+++ b/arch/sparc/mm/iommu.c
@@ -358,7 +358,7 @@ static void *sbus_iommu_alloc(struct device *dev, size_t len,
__flush_page_to_ram(page);
pmdp = pmd_off_k(addr);
- ptep = pte_offset_map(pmdp, addr);
+ ptep = pte_offset_kernel(pmdp, addr);
set_pte(ptep, mk_pte(virt_to_page(page), dvma_prot));
}
--
2.35.3
In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/sparc/kernel/signal32.c | 2 ++
arch/sparc/mm/fault_64.c | 3 +++
arch/sparc/mm/tlb.c | 2 ++
3 files changed, 7 insertions(+)
diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..ca450c7bc53f 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -328,6 +328,8 @@ static void flush_signal_insns(unsigned long address)
goto out_irqs_on;
ptep = pte_offset_map(pmdp, address);
+ if (!ptep)
+ goto out_irqs_on;
pte = *ptep;
if (!pte_present(pte))
goto out_unmap;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index d91305de694c..d8a407fbe350 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -99,6 +99,7 @@ static unsigned int get_user_insn(unsigned long tpc)
local_irq_disable();
pmdp = pmd_offset(pudp, tpc);
+again:
if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp)))
goto out_irq_enable;
@@ -115,6 +116,8 @@ static unsigned int get_user_insn(unsigned long tpc)
#endif
{
ptep = pte_offset_map(pmdp, tpc);
+ if (!ptep)
+ goto again;
pte = *ptep;
if (pte_present(pte)) {
pa = (pte_pfn(pte) << PAGE_SHIFT);
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 9a725547578e..7ecf8556947a 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -149,6 +149,8 @@ static void tlb_batch_pmd_scan(struct mm_struct *mm, unsigned long vaddr,
pte_t *pte;
pte = pte_offset_map(&pmd, vaddr);
+ if (!pte)
+ return;
end = vaddr + HPAGE_SIZE;
while (vaddr < end) {
if (pte_val(*pte) & _PAGE_VALID) {
--
2.35.3
To keep balance in future, remember to pte_unmap() after a successful
get_ptep(). And act as if flush_cache_pages() really needs a map there,
to read the pfn before "unmapping", to be sure page table is not removed.
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/parisc/kernel/cache.c | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)
diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index ca4a302d4365..501160250bb7 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -426,10 +426,15 @@ void flush_dcache_page(struct page *page)
offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
addr = mpnt->vm_start + offset;
if (parisc_requires_coherency()) {
+ bool needs_flush = false;
pte_t *ptep;
ptep = get_ptep(mpnt->vm_mm, addr);
- if (ptep && pte_needs_flush(*ptep))
+ if (ptep) {
+ needs_flush = pte_needs_flush(*ptep);
+ pte_unmap(ptep);
+ }
+ if (needs_flush)
flush_user_cache_page(mpnt, addr);
} else {
/*
@@ -561,14 +566,20 @@ EXPORT_SYMBOL(flush_kernel_dcache_page_addr);
static void flush_cache_page_if_present(struct vm_area_struct *vma,
unsigned long vmaddr, unsigned long pfn)
{
- pte_t *ptep = get_ptep(vma->vm_mm, vmaddr);
+ bool needs_flush = false;
+ pte_t *ptep;
/*
* The pte check is racy and sometimes the flush will trigger
* a non-access TLB miss. Hopefully, the page has already been
* flushed.
*/
- if (ptep && pte_needs_flush(*ptep))
+ ptep = get_ptep(vma->vm_mm, vmaddr);
+ if (ptep) {
+ needs_flush = pte_needs_flush(*ptep);
+ pte_unmap(ptep);
+ }
+ if (needs_flush)
flush_cache_page(vma, vmaddr, pfn);
}
@@ -635,17 +646,22 @@ static void flush_cache_pages(struct vm_area_struct *vma, unsigned long start, u
pte_t *ptep;
for (addr = start; addr < end; addr += PAGE_SIZE) {
+ bool needs_flush = false;
/*
* The vma can contain pages that aren't present. Although
* the pte search is expensive, we need the pte to find the
* page pfn and to check whether the page should be flushed.
*/
ptep = get_ptep(vma->vm_mm, addr);
- if (ptep && pte_needs_flush(*ptep)) {
+ if (ptep) {
+ needs_flush = pte_needs_flush(*ptep);
+ pfn = pte_pfn(*ptep);
+ pte_unmap(ptep);
+ }
+ if (needs_flush) {
if (parisc_requires_coherency()) {
flush_user_cache_page(vma, addr);
} else {
- pfn = pte_pfn(*ptep);
if (WARN_ON(!pfn_valid(pfn)))
return;
__flush_cache_page(vma, addr, PFN_PHYS(pfn));
--
2.35.3
pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/sparc/mm/hugetlbpage.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index d8e0e3c7038d..d7018823206c 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -298,7 +298,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
return NULL;
if (sz >= PMD_SIZE)
return (pte_t *)pmd;
- return pte_alloc_map(mm, pmd, addr);
+ return pte_alloc_huge(mm, pmd, addr);
}
pte_t *huge_pte_offset(struct mm_struct *mm,
@@ -325,7 +325,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
return NULL;
if (is_hugetlb_pmd(*pmd))
return (pte_t *)pmd;
- return pte_offset_map(pmd, addr);
+ return pte_offset_huge(pmd, addr);
}
void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
--
2.35.3
To keep balance in future, remember to pte_unmap() after a successful
pte_offset_map(). And act as if get_pte_for_vaddr() really needs a map
there, to read the pteval before "unmapping", to be sure page table is
not removed.
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/xtensa/mm/tlb.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
index 27a477dae232..0a11fc5f185b 100644
--- a/arch/xtensa/mm/tlb.c
+++ b/arch/xtensa/mm/tlb.c
@@ -179,6 +179,7 @@ static unsigned get_pte_for_vaddr(unsigned vaddr)
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+ unsigned int pteval;
if (!mm)
mm = task->active_mm;
@@ -197,7 +198,9 @@ static unsigned get_pte_for_vaddr(unsigned vaddr)
pte = pte_offset_map(pmd, vaddr);
if (!pte)
return 0;
- return pte_val(*pte);
+ pteval = pte_val(*pte);
+ pte_unmap(pte);
+ return pteval;
}
enum {
--
2.35.3
In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/x86/kernel/ldt.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 525876e7b9f4..adc67f98819a 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -367,8 +367,10 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt)
va = (unsigned long)ldt_slot_va(ldt->slot) + offset;
ptep = get_locked_pte(mm, va, &ptl);
- pte_clear(mm, va, ptep);
- pte_unmap_unlock(ptep, ptl);
+ if (!WARN_ON_ONCE(!ptep)) {
+ pte_clear(mm, va, ptep);
+ pte_unmap_unlock(ptep, ptl);
+ }
}
va = (unsigned long)ldt_slot_va(ldt->slot);
--
2.35.3
sme_populate_pgd() is an __init function for sme_encrypt_kernel():
it should use pte_offset_kernel() instead of pte_offset_map(), to avoid
the question of whether a pte_unmap() will be needed to balance.
Signed-off-by: Hugh Dickins <[email protected]>
---
arch/x86/mm/mem_encrypt_identity.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index c6efcf559d88..a1ab542bdfd6 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -188,7 +188,7 @@ static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
if (pmd_large(*pmd))
return;
- pte = pte_offset_map(pmd, ppd->vaddr);
+ pte = pte_offset_kernel(pmd, ppd->vaddr);
if (pte_none(*pte))
set_pte(pte, __pte(ppd->paddr | ppd->pte_flags));
}
--
2.35.3
On Thu, 8 Jun 2023 12:27:22 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:
> In rare transient cases, not yet made possible, pte_offset_map() and
> pte_offset_map_lock() may not find a page table: handle appropriately.
>
> Add comment on mm's contract with s390 above __zap_zero_pages(),
> and fix old comment there: must be called after THP was disabled.
>
> Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Claudio Imbrenda <[email protected]>
> ---
> arch/s390/kernel/uv.c | 2 ++
> arch/s390/mm/gmap.c | 9 ++++++++-
> arch/s390/mm/pgtable.c | 12 +++++++++---
> 3 files changed, 19 insertions(+), 4 deletions(-)
>
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index cb2ee06df286..3c62d1b218b1 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -294,6 +294,8 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
>
> rc = -ENXIO;
> ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
> + if (!ptep)
> + goto out;
> if (pte_present(*ptep) && !(pte_val(*ptep) & _PAGE_INVALID) && pte_write(*ptep)) {
> page = pte_page(*ptep);
> rc = -EAGAIN;
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index dc90d1eb0d55..3a2a31a15ea8 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2537,7 +2537,12 @@ static inline void thp_split_mm(struct mm_struct *mm)
> * Remove all empty zero pages from the mapping for lazy refaulting
> * - This must be called after mm->context.has_pgste is set, to avoid
> * future creation of zero pages
> - * - This must be called after THP was enabled
> + * - This must be called after THP was disabled.
> + *
> + * mm contracts with s390, that even if mm were to remove a page table,
> + * racing with the loop below and so causing pte_offset_map_lock() to fail,
> + * it will never insert a page table containing empty zero pages once
> + * mm_forbids_zeropage(mm) i.e. mm->context.has_pgste is set.
> */
> static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
> unsigned long end, struct mm_walk *walk)
> @@ -2549,6 +2554,8 @@ static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
> spinlock_t *ptl;
>
> ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> + if (!ptep)
> + break;
> if (is_zero_pfn(pte_pfn(*ptep)))
> ptep_xchg_direct(walk->mm, addr, ptep, __pte(_PAGE_INVALID));
> pte_unmap_unlock(ptep, ptl);
> diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
> index 6effb24de6d9..3bd2ab2a9a34 100644
> --- a/arch/s390/mm/pgtable.c
> +++ b/arch/s390/mm/pgtable.c
> @@ -829,7 +829,7 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
> default:
> return -EFAULT;
> }
> -
> +again:
> ptl = pmd_lock(mm, pmdp);
> if (!pmd_present(*pmdp)) {
> spin_unlock(ptl);
> @@ -850,6 +850,8 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
> spin_unlock(ptl);
>
> ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> + if (!ptep)
> + goto again;
> new = old = pgste_get_lock(ptep);
> pgste_val(new) &= ~(PGSTE_GR_BIT | PGSTE_GC_BIT |
> PGSTE_ACC_BITS | PGSTE_FP_BIT);
> @@ -938,7 +940,7 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
> default:
> return -EFAULT;
> }
> -
> +again:
> ptl = pmd_lock(mm, pmdp);
> if (!pmd_present(*pmdp)) {
> spin_unlock(ptl);
> @@ -955,6 +957,8 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
> spin_unlock(ptl);
>
> ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> + if (!ptep)
> + goto again;
> new = old = pgste_get_lock(ptep);
> /* Reset guest reference bit only */
> pgste_val(new) &= ~PGSTE_GR_BIT;
> @@ -1000,7 +1004,7 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
> default:
> return -EFAULT;
> }
> -
> +again:
> ptl = pmd_lock(mm, pmdp);
> if (!pmd_present(*pmdp)) {
> spin_unlock(ptl);
> @@ -1017,6 +1021,8 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
> spin_unlock(ptl);
>
> ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> + if (!ptep)
> + goto again;
> pgste = pgste_get_lock(ptep);
> *key = (pgste_val(pgste) & (PGSTE_ACC_BITS | PGSTE_FP_BIT)) >> 56;
> paddr = pte_val(*ptep) & PAGE_MASK;
On 6/8/23 21:18, Hugh Dickins wrote:
> To keep balance in future, remember to pte_unmap() after a successful
> get_ptep(). And act as if flush_cache_pages() really needs a map there,
> to read the pfn before "unmapping", to be sure page table is not removed.
>
> Signed-off-by: Hugh Dickins <[email protected]>
For the parisc parts:
Acked-by: Helge Deller <[email protected]> # parisc
Helge
> ---
> arch/parisc/kernel/cache.c | 26 +++++++++++++++++++++-----
> 1 file changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
> index ca4a302d4365..501160250bb7 100644
> --- a/arch/parisc/kernel/cache.c
> +++ b/arch/parisc/kernel/cache.c
> @@ -426,10 +426,15 @@ void flush_dcache_page(struct page *page)
> offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
> addr = mpnt->vm_start + offset;
> if (parisc_requires_coherency()) {
> + bool needs_flush = false;
> pte_t *ptep;
>
> ptep = get_ptep(mpnt->vm_mm, addr);
> - if (ptep && pte_needs_flush(*ptep))
> + if (ptep) {
> + needs_flush = pte_needs_flush(*ptep);
> + pte_unmap(ptep);
> + }
> + if (needs_flush)
> flush_user_cache_page(mpnt, addr);
> } else {
> /*
> @@ -561,14 +566,20 @@ EXPORT_SYMBOL(flush_kernel_dcache_page_addr);
> static void flush_cache_page_if_present(struct vm_area_struct *vma,
> unsigned long vmaddr, unsigned long pfn)
> {
> - pte_t *ptep = get_ptep(vma->vm_mm, vmaddr);
> + bool needs_flush = false;
> + pte_t *ptep;
>
> /*
> * The pte check is racy and sometimes the flush will trigger
> * a non-access TLB miss. Hopefully, the page has already been
> * flushed.
> */
> - if (ptep && pte_needs_flush(*ptep))
> + ptep = get_ptep(vma->vm_mm, vmaddr);
> + if (ptep) {
> + needs_flush = pte_needs_flush(*ptep);
> + pte_unmap(ptep);
> + }
> + if (needs_flush)
> flush_cache_page(vma, vmaddr, pfn);
> }
>
> @@ -635,17 +646,22 @@ static void flush_cache_pages(struct vm_area_struct *vma, unsigned long start, u
> pte_t *ptep;
>
> for (addr = start; addr < end; addr += PAGE_SIZE) {
> + bool needs_flush = false;
> /*
> * The vma can contain pages that aren't present. Although
> * the pte search is expensive, we need the pte to find the
> * page pfn and to check whether the page should be flushed.
> */
> ptep = get_ptep(vma->vm_mm, addr);
> - if (ptep && pte_needs_flush(*ptep)) {
> + if (ptep) {
> + needs_flush = pte_needs_flush(*ptep);
> + pfn = pte_pfn(*ptep);
> + pte_unmap(ptep);
> + }
> + if (needs_flush) {
> if (parisc_requires_coherency()) {
> flush_user_cache_page(vma, addr);
> } else {
> - pfn = pte_pfn(*ptep);
> if (WARN_ON(!pfn_valid(pfn)))
> return;
> __flush_cache_page(vma, addr, PFN_PHYS(pfn));