2023-05-10 04:48:40

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 00/23] arch: allow pte_offset_map[_lock]() to fail

Here is a series of patches to various architectures, based on v6.4-rc1:
preparing for changes expected to follow in mm, affecting pte_offset_map()
and pte_offset_map_lock().

In a week or two, I intend to post a separate series, of equivalent
preparations in mm. These two series are "independent": neither depends
for build or correctness on the other, and the arch patches can be merged
separately via arch trees (stragglers picked up by akpm?); but both series
have to be in before a third series is added to make the effective changes
(and that will add a just a little more in powerpc, s390 and sparc).

What is it all about? Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs;
but likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing). mmap_write_lock
avoidance when collapsing to anon THPs? Perhaps, but again that's not
work I've done: a quick and easy attempt looked like it was going to
shift the load from mmap rwsem to pmd spinlock - not an improvement.

I would much prefer not to have to make these small but wide-ranging
changes for such a niche case; but failed to find another way, and
have heard that shmem MADV_COLLAPSE's usefulness is being limited by
that mmap_write_lock it currently requires.

These changes (though of course not these exact patches, and not all
of these architectures!) have been in Google's data centre kernel for
three years now: we do rely upon them.

What are the per-arch changes about? Generally, two things.

One: the current mmap locking may not be enough to guard against that
tricky transition between pmd entry pointing to page table, and empty
pmd entry, and pmd entry pointing to huge page: pte_offset_map() will
have to validate the pmd entry for itself, returning NULL if no page
table is there. What to do about that varies: often the nearby error
handling indicates just to skip it; but in some cases a "goto again"
looks appropriate (and if that risks an infinite loop, then there
must have been an oops, or pfn 0 mistaken for page table, before).

Deeper study of each site might show that 90% of them here in arch
code could only fail if there's corruption e.g. a transition to THP
would be surprising on an arch without HAVE_ARCH_TRANSPARENT_HUGEPAGE.
But given the likely extension to freeing empty page tables, I have
not limited this set of changes to THP; and it has been easier, and
sets a better example, if each site is given appropriate handling.

Two: pte_offset_map() will need to do an rcu_read_lock(), with the
corresponding rcu_read_unlock() in pte_unmap(). But most architectures
never supported CONFIG_HIGHPTE, so some don't always call pte_unmap()
after pte_offset_map(), or have used userspace pte_offset_map() where
pte_offset_kernel() is more correct. No problem in the current tree,
but a problem once an rcu_read_unlock() will be needed to keep balance.

A common special case of that comes in arch/*/mm/hugetlbpage.c, if
the architecture supports hugetlb pages down at the lowest PTE level.
huge_pte_alloc() uses pte_alloc_map(), but generic hugetlb code does
no corresponding pte_unmap(); similarly for huge_pte_offset().
Thanks to Mike Kravetz and Andrew Morton, v6.4-rc1 already provides
pte_alloc_huge() and pte_offset_huge() to help fix up those cases.

01/23 arm: allow pte_offset_map[_lock]() to fail
02/23 arm64: allow pte_offset_map() to fail
03/23 arm64/hugetlb: pte_alloc_huge() pte_offset_huge()
04/23 ia64/hugetlb: pte_alloc_huge() pte_offset_huge()
05/23 m68k: allow pte_offset_map[_lock]() to fail
06/23 microblaze: allow pte_offset_map() to fail
07/23 mips: update_mmu_cache() can replace __update_tlb()
08/23 parisc: add pte_unmap() to balance get_ptep()
09/23 parisc: unmap_uncached_pte() use pte_offset_kernel()
10/23 parisc/hugetlb: pte_alloc_huge() pte_offset_huge()
11/23 powerpc: kvmppc_unmap_free_pmd() pte_offset_kernel()
12/23 powerpc: allow pte_offset_map[_lock]() to fail
13/23 powerpc/hugetlb: pte_alloc_huge()
14/23 riscv/hugetlb: pte_alloc_huge() pte_offset_huge()
15/23 s390: allow pte_offset_map_lock() to fail
16/23 s390: gmap use pte_unmap_unlock() not spin_unlock()
17/23 sh/hugetlb: pte_alloc_huge() pte_offset_huge()
18/23 sparc/hugetlb: pte_alloc_huge() pte_offset_huge()
19/23 sparc: allow pte_offset_map() to fail
20/23 sparc: iounit and iommu use pte_offset_kernel()
21/23 x86: Allow get_locked_pte() to fail
22/23 x86: sme_populate_pgd() use pte_offset_kernel()
23/23 xtensa: add pte_unmap() to balance pte_offset_map()

arch/arm/lib/uaccess_with_memcpy.c | 3 ++
arch/arm/mm/fault-armv.c | 5 ++-
arch/arm/mm/fault.c | 3 ++
arch/arm64/mm/fault.c | 3 ++
arch/arm64/mm/hugetlbpage.c | 11 ++----
arch/ia64/mm/hugetlbpage.c | 4 +--
arch/m68k/include/asm/mmu_context.h | 6 ++--
arch/m68k/kernel/sys_m68k.c | 2 ++
arch/m68k/mm/mcfmmu.c | 52 +++++++++++----------------
arch/microblaze/kernel/signal.c | 5 +--
arch/mips/include/asm/pgtable.h | 15 ++------
arch/mips/mm/tlb-r3k.c | 5 +--
arch/mips/mm/tlb-r4k.c | 9 ++---
arch/parisc/kernel/cache.c | 26 +++++++++++---
arch/parisc/kernel/pci-dma.c | 2 +-
arch/parisc/mm/hugetlbpage.c | 4 +--
arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +-
arch/powerpc/mm/book3s64/hash_tlb.c | 4 +++
arch/powerpc/mm/book3s64/subpage_prot.c | 2 ++
arch/powerpc/mm/hugetlbpage.c | 2 +-
arch/powerpc/xmon/xmon.c | 5 ++-
arch/riscv/mm/hugetlbpage.c | 4 +--
arch/s390/kernel/uv.c | 2 ++
arch/s390/mm/gmap.c | 24 +++++++------
arch/s390/mm/pgtable.c | 12 +++++--
arch/sh/mm/hugetlbpage.c | 4 +--
arch/sparc/kernel/signal32.c | 2 ++
arch/sparc/mm/fault_64.c | 3 ++
arch/sparc/mm/hugetlbpage.c | 4 +--
arch/sparc/mm/io-unit.c | 2 +-
arch/sparc/mm/iommu.c | 2 +-
arch/sparc/mm/tlb.c | 2 ++
arch/x86/kernel/ldt.c | 6 ++--
arch/x86/mm/mem_encrypt_identity.c | 2 +-
arch/xtensa/mm/tlb.c | 5 ++-
35 files changed, 140 insertions(+), 104 deletions(-)

Hugh


2023-05-10 04:50:08

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 04/23] ia64/hugetlb: pte_alloc_huge() pte_offset_huge()

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/ia64/mm/hugetlbpage.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/ia64/mm/hugetlbpage.c b/arch/ia64/mm/hugetlbpage.c
index 78a02e026164..adc49f2d22e8 100644
--- a/arch/ia64/mm/hugetlbpage.c
+++ b/arch/ia64/mm/hugetlbpage.c
@@ -41,7 +41,7 @@ huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
if (pud) {
pmd = pmd_alloc(mm, pud, taddr);
if (pmd)
- pte = pte_alloc_map(mm, pmd, taddr);
+ pte = pte_alloc_huge(mm, pmd, taddr);
}
return pte;
}
@@ -64,7 +64,7 @@ huge_pte_offset (struct mm_struct *mm, unsigned long addr, unsigned long sz)
if (pud_present(*pud)) {
pmd = pmd_offset(pud, taddr);
if (pmd_present(*pmd))
- pte = pte_offset_map(pmd, taddr);
+ pte = pte_offset_huge(pmd, taddr);
}
}
}
--
2.35.3

2023-05-10 04:54:35

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 06/23] microblaze: allow pte_offset_map() to fail

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/microblaze/kernel/signal.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/microblaze/kernel/signal.c b/arch/microblaze/kernel/signal.c
index c3aebec71c0c..c78a0ff48066 100644
--- a/arch/microblaze/kernel/signal.c
+++ b/arch/microblaze/kernel/signal.c
@@ -194,7 +194,7 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set,

preempt_disable();
ptep = pte_offset_map(pmdp, address);
- if (pte_present(*ptep)) {
+ if (ptep && pte_present(*ptep)) {
address = (unsigned long) page_address(pte_page(*ptep));
/* MS: I need add offset in page */
address += ((unsigned long)frame->tramp) & ~PAGE_MASK;
@@ -203,7 +203,8 @@ static int setup_rt_frame(struct ksignal *ksig, sigset_t *set,
invalidate_icache_range(address, address + 8);
flush_dcache_range(address, address + 8);
}
- pte_unmap(ptep);
+ if (ptep)
+ pte_unmap(ptep);
preempt_enable();
if (err)
return -EFAULT;
--
2.35.3

2023-05-10 04:55:44

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 05/23] m68k: allow pte_offset_map[_lock]() to fail

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Restructure cf_tlb_miss() with a pte_unmap() (previously omitted)
at label out, followed by one local_irq_restore() for all.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/m68k/include/asm/mmu_context.h | 6 ++--
arch/m68k/kernel/sys_m68k.c | 2 ++
arch/m68k/mm/mcfmmu.c | 52 ++++++++++++-----------------
3 files changed, 27 insertions(+), 33 deletions(-)

diff --git a/arch/m68k/include/asm/mmu_context.h b/arch/m68k/include/asm/mmu_context.h
index 8ed6ac14d99f..141bbdfad960 100644
--- a/arch/m68k/include/asm/mmu_context.h
+++ b/arch/m68k/include/asm/mmu_context.h
@@ -99,7 +99,7 @@ static inline void load_ksp_mmu(struct task_struct *task)
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
- pte_t *pte;
+ pte_t *pte = NULL;
unsigned long mmuar;

local_irq_save(flags);
@@ -139,7 +139,7 @@ static inline void load_ksp_mmu(struct task_struct *task)

pte = (mmuar >= PAGE_OFFSET) ? pte_offset_kernel(pmd, mmuar)
: pte_offset_map(pmd, mmuar);
- if (pte_none(*pte) || !pte_present(*pte))
+ if (!pte || pte_none(*pte) || !pte_present(*pte))
goto bug;

set_pte(pte, pte_mkyoung(*pte));
@@ -161,6 +161,8 @@ static inline void load_ksp_mmu(struct task_struct *task)
bug:
pr_info("ksp load failed: mm=0x%p ksp=0x08%lx\n", mm, mmuar);
end:
+ if (pte && mmuar < PAGE_OFFSET)
+ pte_unmap(pte);
local_irq_restore(flags);
}

diff --git a/arch/m68k/kernel/sys_m68k.c b/arch/m68k/kernel/sys_m68k.c
index bd0274c7592e..c586034d2a7a 100644
--- a/arch/m68k/kernel/sys_m68k.c
+++ b/arch/m68k/kernel/sys_m68k.c
@@ -488,6 +488,8 @@ sys_atomic_cmpxchg_32(unsigned long newval, int oldval, int d3, int d4, int d5,
if (!pmd_present(*pmd))
goto bad_access;
pte = pte_offset_map_lock(mm, pmd, (unsigned long)mem, &ptl);
+ if (!pte)
+ goto bad_access;
if (!pte_present(*pte) || !pte_dirty(*pte)
|| !pte_write(*pte)) {
pte_unmap_unlock(pte, ptl);
diff --git a/arch/m68k/mm/mcfmmu.c b/arch/m68k/mm/mcfmmu.c
index 70aa0979e027..42f45abea37a 100644
--- a/arch/m68k/mm/mcfmmu.c
+++ b/arch/m68k/mm/mcfmmu.c
@@ -91,7 +91,8 @@ int cf_tlb_miss(struct pt_regs *regs, int write, int dtlb, int extension_word)
p4d_t *p4d;
pud_t *pud;
pmd_t *pmd;
- pte_t *pte;
+ pte_t *pte = NULL;
+ int ret = -1;
int asid;

local_irq_save(flags);
@@ -100,47 +101,33 @@ int cf_tlb_miss(struct pt_regs *regs, int write, int dtlb, int extension_word)
regs->pc + (extension_word * sizeof(long));

mm = (!user_mode(regs) && KMAPAREA(mmuar)) ? &init_mm : current->mm;
- if (!mm) {
- local_irq_restore(flags);
- return -1;
- }
+ if (!mm)
+ goto out;

pgd = pgd_offset(mm, mmuar);
- if (pgd_none(*pgd)) {
- local_irq_restore(flags);
- return -1;
- }
+ if (pgd_none(*pgd))
+ goto out;

p4d = p4d_offset(pgd, mmuar);
- if (p4d_none(*p4d)) {
- local_irq_restore(flags);
- return -1;
- }
+ if (p4d_none(*p4d))
+ goto out;

pud = pud_offset(p4d, mmuar);
- if (pud_none(*pud)) {
- local_irq_restore(flags);
- return -1;
- }
+ if (pud_none(*pud))
+ goto out;

pmd = pmd_offset(pud, mmuar);
- if (pmd_none(*pmd)) {
- local_irq_restore(flags);
- return -1;
- }
+ if (pmd_none(*pmd))
+ goto out;

pte = (KMAPAREA(mmuar)) ? pte_offset_kernel(pmd, mmuar)
: pte_offset_map(pmd, mmuar);
- if (pte_none(*pte) || !pte_present(*pte)) {
- local_irq_restore(flags);
- return -1;
- }
+ if (!pte || pte_none(*pte) || !pte_present(*pte))
+ goto out;

if (write) {
- if (!pte_write(*pte)) {
- local_irq_restore(flags);
- return -1;
- }
+ if (!pte_write(*pte))
+ goto out;
set_pte(pte, pte_mkdirty(*pte));
}

@@ -161,9 +148,12 @@ int cf_tlb_miss(struct pt_regs *regs, int write, int dtlb, int extension_word)
mmu_write(MMUOR, MMUOR_ACC | MMUOR_UAA);
else
mmu_write(MMUOR, MMUOR_ITLB | MMUOR_ACC | MMUOR_UAA);
-
+ ret = 0;
+out:
+ if (pte && !KMAPAREA(mmuar))
+ pte_unmap(pte);
local_irq_restore(flags);
- return 0;
+ return ret;
}

void __init cf_bootmem_alloc(void)
--
2.35.3

2023-05-10 04:55:46

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 08/23] parisc: add pte_unmap() to balance get_ptep()

To keep balance in future, remember to pte_unmap() after a successful
get_ptep(). And (we might as well) pretend that flush_cache_pages()
really needed a map there, to read the pfn before "unmapping".

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/parisc/kernel/cache.c | 26 +++++++++++++++++++++-----
1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index 1d3b8bc8a623..b0c969b3a300 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -425,10 +425,15 @@ void flush_dcache_page(struct page *page)
offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
addr = mpnt->vm_start + offset;
if (parisc_requires_coherency()) {
+ bool needs_flush = false;
pte_t *ptep;

ptep = get_ptep(mpnt->vm_mm, addr);
- if (ptep && pte_needs_flush(*ptep))
+ if (ptep) {
+ needs_flush = pte_needs_flush(*ptep);
+ pte_unmap(ptep);
+ }
+ if (needs_flush)
flush_user_cache_page(mpnt, addr);
} else {
/*
@@ -560,14 +565,20 @@ EXPORT_SYMBOL(flush_kernel_dcache_page_addr);
static void flush_cache_page_if_present(struct vm_area_struct *vma,
unsigned long vmaddr, unsigned long pfn)
{
- pte_t *ptep = get_ptep(vma->vm_mm, vmaddr);
+ bool needs_flush = false;
+ pte_t *ptep;

/*
* The pte check is racy and sometimes the flush will trigger
* a non-access TLB miss. Hopefully, the page has already been
* flushed.
*/
- if (ptep && pte_needs_flush(*ptep))
+ ptep = get_ptep(vma->vm_mm, vmaddr);
+ if (ptep) {
+ needs_flush = pte_needs_flush(*ptep))
+ pte_unmap(ptep);
+ }
+ if (needs_flush)
flush_cache_page(vma, vmaddr, pfn);
}

@@ -634,17 +645,22 @@ static void flush_cache_pages(struct vm_area_struct *vma, unsigned long start, u
pte_t *ptep;

for (addr = start; addr < end; addr += PAGE_SIZE) {
+ bool needs_flush = false;
/*
* The vma can contain pages that aren't present. Although
* the pte search is expensive, we need the pte to find the
* page pfn and to check whether the page should be flushed.
*/
ptep = get_ptep(vma->vm_mm, addr);
- if (ptep && pte_needs_flush(*ptep)) {
+ if (ptep) {
+ needs_flush = pte_needs_flush(*ptep);
+ pfn = pte_pfn(*ptep);
+ pte_unmap(ptep);
+ }
+ if (needs_flush) {
if (parisc_requires_coherency()) {
flush_user_cache_page(vma, addr);
} else {
- pfn = pte_pfn(*ptep);
if (WARN_ON(!pfn_valid(pfn)))
return;
__flush_cache_page(vma, addr, PFN_PHYS(pfn));
--
2.35.3

2023-05-10 05:00:19

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 10/23] parisc/hugetlb: pte_alloc_huge() pte_offset_huge()

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/parisc/mm/hugetlbpage.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/parisc/mm/hugetlbpage.c b/arch/parisc/mm/hugetlbpage.c
index d1d3990b83f6..a8a1a7c1e16e 100644
--- a/arch/parisc/mm/hugetlbpage.c
+++ b/arch/parisc/mm/hugetlbpage.c
@@ -66,7 +66,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
if (pud) {
pmd = pmd_alloc(mm, pud, addr);
if (pmd)
- pte = pte_alloc_map(mm, pmd, addr);
+ pte = pte_alloc_huge(mm, pmd, addr);
}
return pte;
}
@@ -90,7 +90,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
if (!pud_none(*pud)) {
pmd = pmd_offset(pud, addr);
if (!pmd_none(*pmd))
- pte = pte_offset_map(pmd, addr);
+ pte = pte_offset_huge(pmd, addr);
}
}
}
--
2.35.3

2023-05-10 05:01:28

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 07/23] mips: update_mmu_cache() can replace __update_tlb()

Don't make update_mmu_cache() a wrapper around __update_tlb(): call it
directly, and use the ptep (or pmdp) provided by the caller, instead of
re-calling pte_offset_map() - which would raise a question of whether a
pte_unmap() is needed to balance it.

Check whether the "ptep" provided by the caller is actually the pmdp,
instead of testing pmd_huge(): or test pmd_huge() too and warn if it
disagrees? This is "hazardous" territory: needs review and testing.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/mips/include/asm/pgtable.h | 15 +++------------
arch/mips/mm/tlb-r3k.c | 5 +++--
arch/mips/mm/tlb-r4k.c | 9 +++------
3 files changed, 9 insertions(+), 20 deletions(-)

diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index 574fa14ac8b2..9175dfab08d5 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -565,15 +565,8 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte)
}
#endif

-extern void __update_tlb(struct vm_area_struct *vma, unsigned long address,
- pte_t pte);
-
-static inline void update_mmu_cache(struct vm_area_struct *vma,
- unsigned long address, pte_t *ptep)
-{
- pte_t pte = *ptep;
- __update_tlb(vma, address, pte);
-}
+extern void update_mmu_cache(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep);

#define __HAVE_ARCH_UPDATE_MMU_TLB
#define update_mmu_tlb update_mmu_cache
@@ -581,9 +574,7 @@ static inline void update_mmu_cache(struct vm_area_struct *vma,
static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
unsigned long address, pmd_t *pmdp)
{
- pte_t pte = *(pte_t *)pmdp;
-
- __update_tlb(vma, address, pte);
+ update_mmu_cache(vma, address, (pte_t *)pmdp);
}

/*
diff --git a/arch/mips/mm/tlb-r3k.c b/arch/mips/mm/tlb-r3k.c
index 53dfa2b9316b..e5722cd8dd6d 100644
--- a/arch/mips/mm/tlb-r3k.c
+++ b/arch/mips/mm/tlb-r3k.c
@@ -176,7 +176,8 @@ void local_flush_tlb_page(struct vm_area_struct *vma, unsigned long page)
}
}

-void __update_tlb(struct vm_area_struct *vma, unsigned long address, pte_t pte)
+void update_mmu_cache(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep)
{
unsigned long asid_mask = cpu_asid_mask(&current_cpu_data);
unsigned long flags;
@@ -203,7 +204,7 @@ void __update_tlb(struct vm_area_struct *vma, unsigned long address, pte_t pte)
BARRIER;
tlb_probe();
idx = read_c0_index();
- write_c0_entrylo0(pte_val(pte));
+ write_c0_entrylo0(pte_val(*ptep));
write_c0_entryhi(address | pid);
if (idx < 0) { /* BARRIER */
tlb_write_random();
diff --git a/arch/mips/mm/tlb-r4k.c b/arch/mips/mm/tlb-r4k.c
index 1b939abbe4ca..c96725d17cab 100644
--- a/arch/mips/mm/tlb-r4k.c
+++ b/arch/mips/mm/tlb-r4k.c
@@ -290,14 +290,14 @@ void local_flush_tlb_one(unsigned long page)
* updates the TLB with the new pte(s), and another which also checks
* for the R4k "end of page" hardware bug and does the needy.
*/
-void __update_tlb(struct vm_area_struct * vma, unsigned long address, pte_t pte)
+void update_mmu_cache(struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep)
{
unsigned long flags;
pgd_t *pgdp;
p4d_t *p4dp;
pud_t *pudp;
pmd_t *pmdp;
- pte_t *ptep;
int idx, pid;

/*
@@ -326,10 +326,9 @@ void __update_tlb(struct vm_area_struct * vma, unsigned long address, pte_t pte)
idx = read_c0_index();
#ifdef CONFIG_MIPS_HUGE_TLB_SUPPORT
/* this could be a huge page */
- if (pmd_huge(*pmdp)) {
+ if (ptep == (pte_t *)pmdp) {
unsigned long lo;
write_c0_pagemask(PM_HUGE_MASK);
- ptep = (pte_t *)pmdp;
lo = pte_to_entrylo(pte_val(*ptep));
write_c0_entrylo0(lo);
write_c0_entrylo1(lo + (HPAGE_SIZE >> 7));
@@ -344,8 +343,6 @@ void __update_tlb(struct vm_area_struct * vma, unsigned long address, pte_t pte)
} else
#endif
{
- ptep = pte_offset_map(pmdp, address);
-
#if defined(CONFIG_PHYS_ADDR_T_64BIT) && defined(CONFIG_CPU_MIPS32)
#ifdef CONFIG_XPA
write_c0_entrylo0(pte_to_entrylo(ptep->pte_high));
--
2.35.3

2023-05-10 05:01:42

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 09/23] parisc: unmap_uncached_pte() use pte_offset_kernel()

unmap_uncached_pte() is working from pgd_offset_k(vaddr), so it should
use pte_offset_kernel() instead of pte_offset_map(), to avoid the
question of whether a pte_unmap() will be needed to balance.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/parisc/kernel/pci-dma.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
index ba87f791323b..52d5f8a5cdd2 100644
--- a/arch/parisc/kernel/pci-dma.c
+++ b/arch/parisc/kernel/pci-dma.c
@@ -164,7 +164,7 @@ static inline void unmap_uncached_pte(pmd_t * pmd, unsigned long vaddr,
pmd_clear(pmd);
return;
}
- pte = pte_offset_map(pmd, vaddr);
+ pte = pte_offset_kernel(pmd, vaddr);
vaddr &= ~PMD_MASK;
end = vaddr + size;
if (end > PMD_SIZE)
--
2.35.3

2023-05-10 05:02:44

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 13/23] powerpc/hugetlb: pte_alloc_huge()

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead. huge_pte_offset() is using __find_linux_pte(), which is using
pte_offset_kernel() - don't rename that to _huge, it's more complicated.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/powerpc/mm/hugetlbpage.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
index b900933507da..f7c683b672c1 100644
--- a/arch/powerpc/mm/hugetlbpage.c
+++ b/arch/powerpc/mm/hugetlbpage.c
@@ -183,7 +183,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
return NULL;

if (IS_ENABLED(CONFIG_PPC_8xx) && pshift < PMD_SHIFT)
- return pte_alloc_map(mm, (pmd_t *)hpdp, addr);
+ return pte_alloc_huge(mm, (pmd_t *)hpdp, addr);

BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));

--
2.35.3

2023-05-10 05:04:05

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 11/23] powerpc: kvmppc_unmap_free_pmd() pte_offset_kernel()

kvmppc_unmap_free_pmd() use pte_offset_kernel(), like everywhere else
in book3s_64_mmu_radix.c: instead of pte_offset_map(), which will come
to need a pte_unmap() to balance it.

But note that this is a more complex case than most: see those -EAGAINs
in kvmppc_create_pte(), which is coping with kvmppc races beween page
table and huge entry, of the kind which we are expecting to address
in pte_offset_map() - this might want to be revisited in future.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/powerpc/kvm/book3s_64_mmu_radix.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 461307b89c3a..572707858d65 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -509,7 +509,7 @@ static void kvmppc_unmap_free_pmd(struct kvm *kvm, pmd_t *pmd, bool full,
} else {
pte_t *pte;

- pte = pte_offset_map(p, 0);
+ pte = pte_offset_kernel(p, 0);
kvmppc_unmap_free_pte(kvm, pte, full, lpid);
pmd_clear(p);
}
--
2.35.3

2023-05-10 05:05:18

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 12/23] powerpc: allow pte_offset_map[_lock]() to fail

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.
Balance successful pte_offset_map() with pte_unmap() where omitted.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/powerpc/mm/book3s64/hash_tlb.c | 4 ++++
arch/powerpc/mm/book3s64/subpage_prot.c | 2 ++
arch/powerpc/xmon/xmon.c | 5 ++++-
3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_tlb.c b/arch/powerpc/mm/book3s64/hash_tlb.c
index a64ea0a7ef96..21fcad97ae80 100644
--- a/arch/powerpc/mm/book3s64/hash_tlb.c
+++ b/arch/powerpc/mm/book3s64/hash_tlb.c
@@ -239,12 +239,16 @@ void flush_hash_table_pmd_range(struct mm_struct *mm, pmd_t *pmd, unsigned long
local_irq_save(flags);
arch_enter_lazy_mmu_mode();
start_pte = pte_offset_map(pmd, addr);
+ if (!start_pte)
+ goto out;
for (pte = start_pte; pte < start_pte + PTRS_PER_PTE; pte++) {
unsigned long pteval = pte_val(*pte);
if (pteval & H_PAGE_HASHPTE)
hpte_need_flush(mm, addr, pte, pteval, 0);
addr += PAGE_SIZE;
}
+ pte_unmap(start_pte);
+out:
arch_leave_lazy_mmu_mode();
local_irq_restore(flags);
}
diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index b75a9fb99599..0dc85556dec5 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -71,6 +71,8 @@ static void hpte_flush_range(struct mm_struct *mm, unsigned long addr,
if (pmd_none(*pmd))
return;
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!pte)
+ return;
arch_enter_lazy_mmu_mode();
for (; npages > 0; --npages) {
pte_update(mm, addr, pte, 0, 0, 0);
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 728d3c257e4a..69447bdf0bcf 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -3376,12 +3376,15 @@ static void show_pte(unsigned long addr)
printf("pmdp @ 0x%px = 0x%016lx\n", pmdp, pmd_val(*pmdp));

ptep = pte_offset_map(pmdp, addr);
- if (pte_none(*ptep)) {
+ if (!ptep || pte_none(*ptep)) {
+ if (ptep)
+ pte_unmap(ptep);
printf("no valid PTE\n");
return;
}

format_pte(ptep, pte_val(*ptep));
+ pte_unmap(ptep);

sync();
__delay(200);
--
2.35.3

2023-05-10 05:06:21

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 14/23] riscv/hugetlb: pte_alloc_huge() pte_offset_huge()

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/riscv/mm/hugetlbpage.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
index a163a3e0f0d4..80926946759f 100644
--- a/arch/riscv/mm/hugetlbpage.c
+++ b/arch/riscv/mm/hugetlbpage.c
@@ -43,7 +43,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,

for_each_napot_order(order) {
if (napot_cont_size(order) == sz) {
- pte = pte_alloc_map(mm, pmd, addr & napot_cont_mask(order));
+ pte = pte_alloc_huge(mm, pmd, addr & napot_cont_mask(order));
break;
}
}
@@ -90,7 +90,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,

for_each_napot_order(order) {
if (napot_cont_size(order) == sz) {
- pte = pte_offset_kernel(pmd, addr & napot_cont_mask(order));
+ pte = pte_offset_huge(pmd, addr & napot_cont_mask(order));
break;
}
}
--
2.35.3

2023-05-10 05:09:01

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 15/23] s390: allow pte_offset_map_lock() to fail

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/s390/kernel/uv.c | 2 ++
arch/s390/mm/gmap.c | 2 ++
arch/s390/mm/pgtable.c | 12 +++++++++---
3 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
index cb2ee06df286..3c62d1b218b1 100644
--- a/arch/s390/kernel/uv.c
+++ b/arch/s390/kernel/uv.c
@@ -294,6 +294,8 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)

rc = -ENXIO;
ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
+ if (!ptep)
+ goto out;
if (pte_present(*ptep) && !(pte_val(*ptep) & _PAGE_INVALID) && pte_write(*ptep)) {
page = pte_page(*ptep);
rc = -EAGAIN;
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index dc90d1eb0d55..d198fc9475a2 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2549,6 +2549,8 @@ static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
spinlock_t *ptl;

ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (!ptep)
+ break;
if (is_zero_pfn(pte_pfn(*ptep)))
ptep_xchg_direct(walk->mm, addr, ptep, __pte(_PAGE_INVALID));
pte_unmap_unlock(ptep, ptl);
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 6effb24de6d9..3bd2ab2a9a34 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -829,7 +829,7 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
default:
return -EFAULT;
}
-
+again:
ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
spin_unlock(ptl);
@@ -850,6 +850,8 @@ int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
spin_unlock(ptl);

ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ if (!ptep)
+ goto again;
new = old = pgste_get_lock(ptep);
pgste_val(new) &= ~(PGSTE_GR_BIT | PGSTE_GC_BIT |
PGSTE_ACC_BITS | PGSTE_FP_BIT);
@@ -938,7 +940,7 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
default:
return -EFAULT;
}
-
+again:
ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
spin_unlock(ptl);
@@ -955,6 +957,8 @@ int reset_guest_reference_bit(struct mm_struct *mm, unsigned long addr)
spin_unlock(ptl);

ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ if (!ptep)
+ goto again;
new = old = pgste_get_lock(ptep);
/* Reset guest reference bit only */
pgste_val(new) &= ~PGSTE_GR_BIT;
@@ -1000,7 +1004,7 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
default:
return -EFAULT;
}
-
+again:
ptl = pmd_lock(mm, pmdp);
if (!pmd_present(*pmdp)) {
spin_unlock(ptl);
@@ -1017,6 +1021,8 @@ int get_guest_storage_key(struct mm_struct *mm, unsigned long addr,
spin_unlock(ptl);

ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ if (!ptep)
+ goto again;
pgste = pgste_get_lock(ptep);
*key = (pgste_val(pgste) & (PGSTE_ACC_BITS | PGSTE_FP_BIT)) >> 56;
paddr = pte_val(*ptep) & PAGE_MASK;
--
2.35.3

2023-05-10 05:09:17

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 18/23] sparc/hugetlb: pte_alloc_huge() pte_offset_huge()

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/sparc/mm/hugetlbpage.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index d8e0e3c7038d..d7018823206c 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -298,7 +298,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
return NULL;
if (sz >= PMD_SIZE)
return (pte_t *)pmd;
- return pte_alloc_map(mm, pmd, addr);
+ return pte_alloc_huge(mm, pmd, addr);
}

pte_t *huge_pte_offset(struct mm_struct *mm,
@@ -325,7 +325,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
return NULL;
if (is_hugetlb_pmd(*pmd))
return (pte_t *)pmd;
- return pte_offset_map(pmd, addr);
+ return pte_offset_huge(pmd, addr);
}

void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
--
2.35.3

2023-05-10 05:10:17

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 16/23] s390: gmap use pte_unmap_unlock() not spin_unlock()

pte_alloc_map_lock() expects to be followed by pte_unmap_unlock(): to
keep balance in future, pass ptep as well as ptl to gmap_pte_op_end(),
and use pte_unmap_unlock() instead of direct spin_unlock() (even though
ptep ends up unused inside the macro).

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/s390/mm/gmap.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index d198fc9475a2..638dcd9bc820 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -895,12 +895,12 @@ static int gmap_pte_op_fixup(struct gmap *gmap, unsigned long gaddr,

/**
* gmap_pte_op_end - release the page table lock
- * @ptl: pointer to the spinlock pointer
+ * @ptep: pointer to the locked pte
+ * @ptl: pointer to the page table spinlock
*/
-static void gmap_pte_op_end(spinlock_t *ptl)
+static void gmap_pte_op_end(pte_t *ptep, spinlock_t *ptl)
{
- if (ptl)
- spin_unlock(ptl);
+ pte_unmap_unlock(ptep, ptl);
}

/**
@@ -1011,7 +1011,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
{
int rc;
pte_t *ptep;
- spinlock_t *ptl = NULL;
+ spinlock_t *ptl;
unsigned long pbits = 0;

if (pmd_val(*pmdp) & _SEGMENT_ENTRY_INVALID)
@@ -1025,7 +1025,7 @@ static int gmap_protect_pte(struct gmap *gmap, unsigned long gaddr,
pbits |= (bits & GMAP_NOTIFY_SHADOW) ? PGSTE_VSIE_BIT : 0;
/* Protect and unlock. */
rc = ptep_force_prot(gmap->mm, gaddr, ptep, prot, pbits);
- gmap_pte_op_end(ptl);
+ gmap_pte_op_end(ptep, ptl);
return rc;
}

@@ -1154,7 +1154,7 @@ int gmap_read_table(struct gmap *gmap, unsigned long gaddr, unsigned long *val)
/* Do *NOT* clear the _PAGE_INVALID bit! */
rc = 0;
}
- gmap_pte_op_end(ptl);
+ gmap_pte_op_end(ptep, ptl);
}
if (!rc)
break;
@@ -1248,7 +1248,7 @@ static int gmap_protect_rmap(struct gmap *sg, unsigned long raddr,
if (!rc)
gmap_insert_rmap(sg, vmaddr, rmap);
spin_unlock(&sg->guest_table_lock);
- gmap_pte_op_end(ptl);
+ gmap_pte_op_end(ptep, ptl);
}
radix_tree_preload_end();
if (rc) {
@@ -2156,7 +2156,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
tptep = (pte_t *) gmap_table_walk(sg, saddr, 0);
if (!tptep) {
spin_unlock(&sg->guest_table_lock);
- gmap_pte_op_end(ptl);
+ gmap_pte_op_end(sptep, ptl);
radix_tree_preload_end();
break;
}
@@ -2167,7 +2167,7 @@ int gmap_shadow_page(struct gmap *sg, unsigned long saddr, pte_t pte)
rmap = NULL;
rc = 0;
}
- gmap_pte_op_end(ptl);
+ gmap_pte_op_end(sptep, ptl);
spin_unlock(&sg->guest_table_lock);
}
radix_tree_preload_end();
@@ -2495,7 +2495,7 @@ void gmap_sync_dirty_log_pmd(struct gmap *gmap, unsigned long bitmap[4],
continue;
if (ptep_test_and_clear_uc(gmap->mm, vmaddr, ptep))
set_bit(i, bitmap);
- spin_unlock(ptl);
+ pte_unmap_unlock(ptep, ptl);
}
}
gmap_pmd_op_end(gmap, pmdp);
--
2.35.3

2023-05-10 05:10:25

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 19/23] sparc: allow pte_offset_map() to fail

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/sparc/kernel/signal32.c | 2 ++
arch/sparc/mm/fault_64.c | 3 +++
arch/sparc/mm/tlb.c | 2 ++
3 files changed, 7 insertions(+)

diff --git a/arch/sparc/kernel/signal32.c b/arch/sparc/kernel/signal32.c
index dad38960d1a8..ca450c7bc53f 100644
--- a/arch/sparc/kernel/signal32.c
+++ b/arch/sparc/kernel/signal32.c
@@ -328,6 +328,8 @@ static void flush_signal_insns(unsigned long address)
goto out_irqs_on;

ptep = pte_offset_map(pmdp, address);
+ if (!ptep)
+ goto out_irqs_on;
pte = *ptep;
if (!pte_present(pte))
goto out_unmap;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index d91305de694c..d8a407fbe350 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -99,6 +99,7 @@ static unsigned int get_user_insn(unsigned long tpc)
local_irq_disable();

pmdp = pmd_offset(pudp, tpc);
+again:
if (pmd_none(*pmdp) || unlikely(pmd_bad(*pmdp)))
goto out_irq_enable;

@@ -115,6 +116,8 @@ static unsigned int get_user_insn(unsigned long tpc)
#endif
{
ptep = pte_offset_map(pmdp, tpc);
+ if (!ptep)
+ goto again;
pte = *ptep;
if (pte_present(pte)) {
pa = (pte_pfn(pte) << PAGE_SHIFT);
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index 9a725547578e..7ecf8556947a 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -149,6 +149,8 @@ static void tlb_batch_pmd_scan(struct mm_struct *mm, unsigned long vaddr,
pte_t *pte;

pte = pte_offset_map(&pmd, vaddr);
+ if (!pte)
+ return;
end = vaddr + HPAGE_SIZE;
while (vaddr < end) {
if (pte_val(*pte) & _PAGE_VALID) {
--
2.35.3

2023-05-10 05:13:31

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 20/23] sparc: iounit and iommu use pte_offset_kernel()

iounit_alloc() and sbus_iommu_alloc() are working from pmd_off_k(),
so should use pte_offset_kernel() instead of pte_offset_map(), to avoid
the question of whether a pte_unmap() will be needed to balance.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/sparc/mm/io-unit.c | 2 +-
arch/sparc/mm/iommu.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/mm/io-unit.c b/arch/sparc/mm/io-unit.c
index bf3e6d2fe5d9..133dd42570d6 100644
--- a/arch/sparc/mm/io-unit.c
+++ b/arch/sparc/mm/io-unit.c
@@ -244,7 +244,7 @@ static void *iounit_alloc(struct device *dev, size_t len,
long i;

pmdp = pmd_off_k(addr);
- ptep = pte_offset_map(pmdp, addr);
+ ptep = pte_offset_kernel(pmdp, addr);

set_pte(ptep, mk_pte(virt_to_page(page), dvma_prot));

diff --git a/arch/sparc/mm/iommu.c b/arch/sparc/mm/iommu.c
index 9e3f6933ca13..3a6caef68348 100644
--- a/arch/sparc/mm/iommu.c
+++ b/arch/sparc/mm/iommu.c
@@ -358,7 +358,7 @@ static void *sbus_iommu_alloc(struct device *dev, size_t len,
__flush_page_to_ram(page);

pmdp = pmd_off_k(addr);
- ptep = pte_offset_map(pmdp, addr);
+ ptep = pte_offset_kernel(pmdp, addr);

set_pte(ptep, mk_pte(virt_to_page(page), dvma_prot));
}
--
2.35.3

2023-05-10 05:24:10

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 23/23] xtensa: add pte_unmap() to balance pte_offset_map()

To keep balance in future, remember to pte_unmap() after a successful
pte_offset_map(). And (might as well) pretend that get_pte_for_vaddr()
really needed a map there, to read the pteval before "unmapping".

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/xtensa/mm/tlb.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/xtensa/mm/tlb.c b/arch/xtensa/mm/tlb.c
index 27a477dae232..0a11fc5f185b 100644
--- a/arch/xtensa/mm/tlb.c
+++ b/arch/xtensa/mm/tlb.c
@@ -179,6 +179,7 @@ static unsigned get_pte_for_vaddr(unsigned vaddr)
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+ unsigned int pteval;

if (!mm)
mm = task->active_mm;
@@ -197,7 +198,9 @@ static unsigned get_pte_for_vaddr(unsigned vaddr)
pte = pte_offset_map(pmd, vaddr);
if (!pte)
return 0;
- return pte_val(*pte);
+ pteval = pte_val(*pte);
+ pte_unmap(pte);
+ return pteval;
}

enum {
--
2.35.3

2023-05-10 05:24:40

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 22/23] x86: sme_populate_pgd() use pte_offset_kernel()

sme_populate_pgd() is an __init function for sme_encrypt_kernel():
it should use pte_offset_kernel() instead of pte_offset_map(), to avoid
the question of whether a pte_unmap() will be needed to balance.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/x86/mm/mem_encrypt_identity.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/mem_encrypt_identity.c b/arch/x86/mm/mem_encrypt_identity.c
index c6efcf559d88..a1ab542bdfd6 100644
--- a/arch/x86/mm/mem_encrypt_identity.c
+++ b/arch/x86/mm/mem_encrypt_identity.c
@@ -188,7 +188,7 @@ static void __init sme_populate_pgd(struct sme_populate_pgd_data *ppd)
if (pmd_large(*pmd))
return;

- pte = pte_offset_map(pmd, ppd->vaddr);
+ pte = pte_offset_kernel(pmd, ppd->vaddr);
if (pte_none(*pte))
set_pte(pte, __pte(ppd->paddr | ppd->pte_flags));
}
--
2.35.3

2023-05-10 05:25:53

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 17/23] sh/hugetlb: pte_alloc_huge() pte_offset_huge()

pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
that: to keep balance in future, use the recently added pte_alloc_huge()
instead; with pte_offset_huge() a better name for pte_offset_kernel().

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/sh/mm/hugetlbpage.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sh/mm/hugetlbpage.c b/arch/sh/mm/hugetlbpage.c
index 999ab5916e69..6cb0ad73dbb9 100644
--- a/arch/sh/mm/hugetlbpage.c
+++ b/arch/sh/mm/hugetlbpage.c
@@ -38,7 +38,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
if (pud) {
pmd = pmd_alloc(mm, pud, addr);
if (pmd)
- pte = pte_alloc_map(mm, pmd, addr);
+ pte = pte_alloc_huge(mm, pmd, addr);
}
}
}
@@ -63,7 +63,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
if (pud) {
pmd = pmd_offset(pud, addr);
if (pmd)
- pte = pte_offset_map(pmd, addr);
+ pte = pte_offset_huge(pmd, addr);
}
}
}
--
2.35.3

2023-05-10 05:25:54

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 21/23] x86: Allow get_locked_pte() to fail

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offset_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/x86/kernel/ldt.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 525876e7b9f4..eb844549cd83 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -367,8 +367,10 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt)

va = (unsigned long)ldt_slot_va(ldt->slot) + offset;
ptep = get_locked_pte(mm, va, &ptl);
- pte_clear(mm, va, ptep);
- pte_unmap_unlock(ptep, ptl);
+ if (ptep) {
+ pte_clear(mm, va, ptep);
+ pte_unmap_unlock(ptep, ptl);
+ }
}

va = (unsigned long)ldt_slot_va(ldt->slot);
--
2.35.3

2023-05-10 06:17:03

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 00/23] arch: allow pte_offset_map[_lock]() to fail

On Tue, May 09, 2023 at 09:39:13PM -0700, Hugh Dickins wrote:
> Two: pte_offset_map() will need to do an rcu_read_lock(), with the
> corresponding rcu_read_unlock() in pte_unmap(). But most architectures
> never supported CONFIG_HIGHPTE, so some don't always call pte_unmap()
> after pte_offset_map(), or have used userspace pte_offset_map() where
> pte_offset_kernel() is more correct. No problem in the current tree,
> but a problem once an rcu_read_unlock() will be needed to keep balance.

Hi Hugh,

I shall have to spend some time looking at these patches, but at LSFMM
just a few hours ago, I proposed and nobody objected to removing
CONFIG_HIGHPTE. I don't intend to take action on that consensus
immediately, so I can certainly wait until your patches are applied, but
if this information simplifies what you're doing, feel free to act on it.

2023-05-10 07:28:51

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 05/23] m68k: allow pte_offset_map[_lock]() to fail

Hi Hugh,

Thanks for your patch!

On Wed, May 10, 2023 at 6:48 AM Hugh Dickins <[email protected]> wrote:
> In rare transient cases, not yet made possible, pte_offset_map() and
> pte_offset_map_lock() may not find a page table: handle appropriately.
>
> Restructure cf_tlb_miss() with a pte_unmap() (previously omitted)
> at label out, followed by one local_irq_restore() for all.

That's a bug fix, which should be a separate patch?

>
> Signed-off-by: Hugh Dickins <[email protected]>


> --- a/arch/m68k/include/asm/mmu_context.h
> +++ b/arch/m68k/include/asm/mmu_context.h
> @@ -99,7 +99,7 @@ static inline void load_ksp_mmu(struct task_struct *task)
> p4d_t *p4d;
> pud_t *pud;
> pmd_t *pmd;
> - pte_t *pte;
> + pte_t *pte = NULL;
> unsigned long mmuar;
>
> local_irq_save(flags);
> @@ -139,7 +139,7 @@ static inline void load_ksp_mmu(struct task_struct *task)
>
> pte = (mmuar >= PAGE_OFFSET) ? pte_offset_kernel(pmd, mmuar)
> : pte_offset_map(pmd, mmuar);
> - if (pte_none(*pte) || !pte_present(*pte))
> + if (!pte || pte_none(*pte) || !pte_present(*pte))
> goto bug;

If the absence of a pte is to become a non-abnormal case, it should
probably jump to "end" instead, to avoid spamming the kernel log.

>
> set_pte(pte, pte_mkyoung(*pte));
> @@ -161,6 +161,8 @@ static inline void load_ksp_mmu(struct task_struct *task)
> bug:
> pr_info("ksp load failed: mm=0x%p ksp=0x08%lx\n", mm, mmuar);
> end:
> + if (pte && mmuar < PAGE_OFFSET)
> + pte_unmap(pte);

Is this also a bugfix, not mentioned in the patch description?

> local_irq_restore(flags);
> }
>

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2023-05-10 08:13:32

by Alexandre Ghiti

[permalink] [raw]
Subject: Re: [PATCH 14/23] riscv/hugetlb: pte_alloc_huge() pte_offset_huge()

Hi Hugh,

On 5/10/23 06:59, Hugh Dickins wrote:
> pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
> that: to keep balance in future, use the recently added pte_alloc_huge()
> instead; with pte_offset_huge() a better name for pte_offset_kernel().
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> arch/riscv/mm/hugetlbpage.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
> index a163a3e0f0d4..80926946759f 100644
> --- a/arch/riscv/mm/hugetlbpage.c
> +++ b/arch/riscv/mm/hugetlbpage.c
> @@ -43,7 +43,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
>
> for_each_napot_order(order) {
> if (napot_cont_size(order) == sz) {
> - pte = pte_alloc_map(mm, pmd, addr & napot_cont_mask(order));
> + pte = pte_alloc_huge(mm, pmd, addr & napot_cont_mask(order));
> break;
> }
> }
> @@ -90,7 +90,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
>
> for_each_napot_order(order) {
> if (napot_cont_size(order) == sz) {
> - pte = pte_offset_kernel(pmd, addr & napot_cont_mask(order));
> + pte = pte_offset_huge(pmd, addr & napot_cont_mask(order));
> break;
> }
> }


Reviewed-by: Alexandre Ghiti <[email protected]>

Thanks,

Alex


2023-05-10 08:47:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 21/23] x86: Allow get_locked_pte() to fail

On Tue, May 09, 2023 at 10:08:37PM -0700, Hugh Dickins wrote:
> In rare transient cases, not yet made possible, pte_offset_map() and
> pte_offset_map_lock() may not find a page table: handle appropriately.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> arch/x86/kernel/ldt.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index 525876e7b9f4..eb844549cd83 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -367,8 +367,10 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt)
>
> va = (unsigned long)ldt_slot_va(ldt->slot) + offset;
> ptep = get_locked_pte(mm, va, &ptl);
> - pte_clear(mm, va, ptep);
> - pte_unmap_unlock(ptep, ptl);
> + if (ptep) {
> + pte_clear(mm, va, ptep);
> + pte_unmap_unlock(ptep, ptl);
> + }
> }

Ow geez, now I have to go remember how the whole PTI/LDT crud worked :/

At first glance this seems wrong; we can't just not unmap the LDT if we
can't find it in a hurry. Also, IIRC this isn't in fact a regular user
mapping, so it should not be subject to THP induced seizures.

... memory bubbles back ... for PTI kernels we need to map this in the
user and kernel page-tables because obviously userspace needs to be able
to have access to the LDT. But it is not directly acessible by
userspace. It lives in the cpu_entry_area as a virtual map of the real
kernel allocation, and this virtual address is used for LLDT.
Modification is done through sys_modify_ldt().

I think I would feel much better if this were something like:

if (!WARN_ON_ONCE(!ptep))

This really shouldn't fail and if it does, simply skipping it isn't the
right thing either.

2023-05-10 14:19:28

by Palmer Dabbelt

[permalink] [raw]
Subject: Re: [PATCH 14/23] riscv/hugetlb: pte_alloc_huge() pte_offset_huge()

On Tue, 09 May 2023 21:59:57 PDT (-0700), [email protected] wrote:
> pte_alloc_map() expects to be followed by pte_unmap(), but hugetlb omits
> that: to keep balance in future, use the recently added pte_alloc_huge()
> instead; with pte_offset_huge() a better name for pte_offset_kernel().
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> arch/riscv/mm/hugetlbpage.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/riscv/mm/hugetlbpage.c b/arch/riscv/mm/hugetlbpage.c
> index a163a3e0f0d4..80926946759f 100644
> --- a/arch/riscv/mm/hugetlbpage.c
> +++ b/arch/riscv/mm/hugetlbpage.c
> @@ -43,7 +43,7 @@ pte_t *huge_pte_alloc(struct mm_struct *mm,
>
> for_each_napot_order(order) {
> if (napot_cont_size(order) == sz) {
> - pte = pte_alloc_map(mm, pmd, addr & napot_cont_mask(order));
> + pte = pte_alloc_huge(mm, pmd, addr & napot_cont_mask(order));
> break;
> }
> }
> @@ -90,7 +90,7 @@ pte_t *huge_pte_offset(struct mm_struct *mm,
>
> for_each_napot_order(order) {
> if (napot_cont_size(order) == sz) {
> - pte = pte_offset_kernel(pmd, addr & napot_cont_mask(order));
> + pte = pte_offset_huge(pmd, addr & napot_cont_mask(order));
> break;
> }
> }

Acked-by: Palmer Dabbelt <[email protected]>

2023-05-11 03:16:58

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 05/23] m68k: allow pte_offset_map[_lock]() to fail

On Wed, 10 May 2023, Geert Uytterhoeven wrote:

> Hi Hugh,
>
> Thanks for your patch!

And thank you for looking so quickly, Geert.

>
> On Wed, May 10, 2023 at 6:48 AM Hugh Dickins <[email protected]> wrote:
> > In rare transient cases, not yet made possible, pte_offset_map() and
> > pte_offset_map_lock() may not find a page table: handle appropriately.
> >
> > Restructure cf_tlb_miss() with a pte_unmap() (previously omitted)
> > at label out, followed by one local_irq_restore() for all.
>
> That's a bug fix, which should be a separate patch?

No, that's not a bug fix for the current tree, since m68k does not
offer CONFIG_HIGHPTE, so pte_unmap() is never anything but a no-op
for m68k (see include/linux/pgtable.h).

But I want to change pte_unmap() to do something even without
CONFIG_HIGHPTE, so have to fix up any such previously harmless
omissions in this series first.

>
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
>
>
> > --- a/arch/m68k/include/asm/mmu_context.h
> > +++ b/arch/m68k/include/asm/mmu_context.h
> > @@ -99,7 +99,7 @@ static inline void load_ksp_mmu(struct task_struct *task)
> > p4d_t *p4d;
> > pud_t *pud;
> > pmd_t *pmd;
> > - pte_t *pte;
> > + pte_t *pte = NULL;
> > unsigned long mmuar;
> >
> > local_irq_save(flags);
> > @@ -139,7 +139,7 @@ static inline void load_ksp_mmu(struct task_struct *task)
> >
> > pte = (mmuar >= PAGE_OFFSET) ? pte_offset_kernel(pmd, mmuar)
> > : pte_offset_map(pmd, mmuar);
> > - if (pte_none(*pte) || !pte_present(*pte))
> > + if (!pte || pte_none(*pte) || !pte_present(*pte))
> > goto bug;
>
> If the absence of a pte is to become a non-abnormal case, it should
> probably jump to "end" instead, to avoid spamming the kernel log.

I don't think so (but of course it's hard for you to tell, without
seeing all completed series of series). If pmd_none(*pmd) can safely
goto bug just above, and pte_none(*pte) goto bug here, well, the !pte
case is going to be stranger than either of those.

My understanding of this function, load_ksp_mmu(), is that it's dealing
at context switch with a part of userspace which very much needs to be
present: whatever keeps that from being swapped out or migrated at
present, will be sure to keep the !pte case away - we cannot steal its
page table just at random (and a THP on m68k would be surprising too).

Though there is one case I can think of which will cause !pte here,
and so goto bug: if the pmd entry has got corrupted, and counts as
pmd_bad(), which will be tested (and cleared) in pte_offset_map().
But it is okay to report a bug in that case.

I can certainly change this to goto end instead if you still prefer,
no problem; but I'd rather keep it as is, if only for me to be proved
wrong by you actually seeing spam there.

>
> >
> > set_pte(pte, pte_mkyoung(*pte));
> > @@ -161,6 +161,8 @@ static inline void load_ksp_mmu(struct task_struct *task)
> > bug:
> > pr_info("ksp load failed: mm=0x%p ksp=0x08%lx\n", mm, mmuar);
> > end:
> > + if (pte && mmuar < PAGE_OFFSET)
> > + pte_unmap(pte);
>
> Is this also a bugfix, not mentioned in the patch description?

I'm not sure whether you're referring to the pte_unmap() which we
already discussed above, or you're seeing something else in addition;
but I don't think there's a bugfix here, just a rearrangement because
we now want lots of cases to do the pte_unmap() and local_irq_restore().

Hugh

>
> > local_irq_restore(flags);
> > }
> >
>
> Gr{oetje,eeting}s,
>
> Geert
>
> --
> Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]
>
> In personal conversations with technical people, I call myself a hacker. But
> when I'm talking to journalists I just say "programmer" or something like that.
> -- Linus Torvalds

2023-05-11 03:19:16

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 21/23] x86: Allow get_locked_pte() to fail

On Wed, 10 May 2023, Peter Zijlstra wrote:

> On Tue, May 09, 2023 at 10:08:37PM -0700, Hugh Dickins wrote:
> > In rare transient cases, not yet made possible, pte_offset_map() and
> > pte_offset_map_lock() may not find a page table: handle appropriately.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> > ---
> > arch/x86/kernel/ldt.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> > index 525876e7b9f4..eb844549cd83 100644
> > --- a/arch/x86/kernel/ldt.c
> > +++ b/arch/x86/kernel/ldt.c
> > @@ -367,8 +367,10 @@ static void unmap_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt)
> >
> > va = (unsigned long)ldt_slot_va(ldt->slot) + offset;
> > ptep = get_locked_pte(mm, va, &ptl);
> > - pte_clear(mm, va, ptep);
> > - pte_unmap_unlock(ptep, ptl);
> > + if (ptep) {
> > + pte_clear(mm, va, ptep);
> > + pte_unmap_unlock(ptep, ptl);
> > + }
> > }
>
> Ow geez, now I have to go remember how the whole PTI/LDT crud worked :/

I apologize for sending you back there!

>
> At first glance this seems wrong; we can't just not unmap the LDT if we
> can't find it in a hurry. Also, IIRC this isn't in fact a regular user
> mapping, so it should not be subject to THP induced seizures.
>
> ... memory bubbles back ... for PTI kernels we need to map this in the
> user and kernel page-tables because obviously userspace needs to be able
> to have access to the LDT. But it is not directly acessible by
> userspace. It lives in the cpu_entry_area as a virtual map of the real
> kernel allocation, and this virtual address is used for LLDT.
> Modification is done through sys_modify_ldt().

And there must be a user-style page table backing that cpu_entry_area,
because the use of get_locked_pte() and pte_unmap_unlock() implies
that there's a user page table (struct page containing spinlock if
config says so) rather than just a kernel page table mapping it.

>
> I think I would feel much better if this were something like:
>
> if (!WARN_ON_ONCE(!ptep))
>
> This really shouldn't fail and if it does, simply skipping it isn't the
> right thing either.

Sure, I'll gladly make that change when I respin - not immediately, let's
get more feedback on this arch series first, but maybe in a week's time.

Thanks for looking so quickly, Peter: I didn't Cc you on this particular
series, but shall certainly be doing so on the ones that follow, because
a few of those patches go into interesting pmdp_get_lockless() territory.

Hugh

2023-05-11 04:48:41

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 00/23] arch: allow pte_offset_map[_lock]() to fail

On Wed, 10 May 2023, Matthew Wilcox wrote:
> On Tue, May 09, 2023 at 09:39:13PM -0700, Hugh Dickins wrote:
> > Two: pte_offset_map() will need to do an rcu_read_lock(), with the
> > corresponding rcu_read_unlock() in pte_unmap(). But most architectures
> > never supported CONFIG_HIGHPTE, so some don't always call pte_unmap()
> > after pte_offset_map(), or have used userspace pte_offset_map() where
> > pte_offset_kernel() is more correct. No problem in the current tree,
> > but a problem once an rcu_read_unlock() will be needed to keep balance.
>
> Hi Hugh,
>
> I shall have to spend some time looking at these patches, but at LSFMM
> just a few hours ago, I proposed and nobody objected to removing
> CONFIG_HIGHPTE. I don't intend to take action on that consensus
> immediately, so I can certainly wait until your patches are applied, but
> if this information simplifies what you're doing, feel free to act on it.

Thanks a lot, Matthew: very considerate, as usual.

Yes, I did see your "Whither Highmem?" (wither highmem!) proposal on the
list, and it did make me think, better get these patches and preview out
soon, before you get to vanish pte_unmap() altogether. HIGHMEM or not,
HIGHPTE or not, I think pte_offset_map() and pte_unmap() still have an
important role to play.

I don't really understand why you're going down a remove-CONFIG_HIGHPTE
route: I thought you were motivated by the awkardness of kmap on large
folios; but I don't see how removing HIGHPTE helps with that at all
(unless you have a "large page tables" effort in mind, but I doubt it).

But I've no investment in CONFIG_HIGHPTE if people think now is the
time to remove it: I disagree, but wouldn't miss it myself - so long
as you leave pte_offset_map() and pte_unmap() (under whatever names).

I don't think removing CONFIG_HIGHPTE will simplify what I'm doing.
For a moment it looked like it would: the PAE case is nasty (and our
data centres have not been on PAE for a long time, so it wasn't a
problem I had to face before); and knowing pmd_high must be 0 for a
page table looked like it would help, but now I'm not so sure of that
(hmm, I'm changing my mind again as I write).

Peter's pmdp_get_lockless() does rely for complete correctness on
interrupts being disabled, and I suspect that I may be forced in the
PAE case to do so briefly; but detest that notion. For now I'm just
deferring it, hoping for a better idea before third series finalized.

I mention this (and Cc Peter) in passing: don't want this arch thread
to go down into that rabbit hole: we can start a fresh thread on it if
you wish, but right now my priority is commit messages for the second
series, rather than solving (or even detailing) the PAE problem.

Hugh

2023-05-11 06:56:28

by Geert Uytterhoeven

[permalink] [raw]
Subject: Re: [PATCH 05/23] m68k: allow pte_offset_map[_lock]() to fail

Hi Hugh,

On Thu, May 11, 2023 at 4:58 AM Hugh Dickins <[email protected]> wrote:
> On Wed, 10 May 2023, Geert Uytterhoeven wrote:
> > On Wed, May 10, 2023 at 6:48 AM Hugh Dickins <[email protected]> wrote:
> > > In rare transient cases, not yet made possible, pte_offset_map() and
> > > pte_offset_map_lock() may not find a page table: handle appropriately.
> > >
> > > Restructure cf_tlb_miss() with a pte_unmap() (previously omitted)
> > > at label out, followed by one local_irq_restore() for all.
> >
> > That's a bug fix, which should be a separate patch?
>
> No, that's not a bug fix for the current tree, since m68k does not
> offer CONFIG_HIGHPTE, so pte_unmap() is never anything but a no-op
> for m68k (see include/linux/pgtable.h).
>
> But I want to change pte_unmap() to do something even without
> CONFIG_HIGHPTE, so have to fix up any such previously harmless
> omissions in this series first.

OK.

> > > --- a/arch/m68k/include/asm/mmu_context.h
> > > +++ b/arch/m68k/include/asm/mmu_context.h
> > > @@ -99,7 +99,7 @@ static inline void load_ksp_mmu(struct task_struct *task)
> > > p4d_t *p4d;
> > > pud_t *pud;
> > > pmd_t *pmd;
> > > - pte_t *pte;
> > > + pte_t *pte = NULL;
> > > unsigned long mmuar;
> > >
> > > local_irq_save(flags);
> > > @@ -139,7 +139,7 @@ static inline void load_ksp_mmu(struct task_struct *task)
> > >
> > > pte = (mmuar >= PAGE_OFFSET) ? pte_offset_kernel(pmd, mmuar)
> > > : pte_offset_map(pmd, mmuar);
> > > - if (pte_none(*pte) || !pte_present(*pte))
> > > + if (!pte || pte_none(*pte) || !pte_present(*pte))
> > > goto bug;
> >
> > If the absence of a pte is to become a non-abnormal case, it should
> > probably jump to "end" instead, to avoid spamming the kernel log.
>
> I don't think so (but of course it's hard for you to tell, without
> seeing all completed series of series). If pmd_none(*pmd) can safely
> goto bug just above, and pte_none(*pte) goto bug here, well, the !pte
> case is going to be stranger than either of those.
>
> My understanding of this function, load_ksp_mmu(), is that it's dealing
> at context switch with a part of userspace which very much needs to be
> present: whatever keeps that from being swapped out or migrated at
> present, will be sure to keep the !pte case away - we cannot steal its
> page table just at random (and a THP on m68k would be surprising too).
>
> Though there is one case I can think of which will cause !pte here,
> and so goto bug: if the pmd entry has got corrupted, and counts as
> pmd_bad(), which will be tested (and cleared) in pte_offset_map().
> But it is okay to report a bug in that case.
>
> I can certainly change this to goto end instead if you still prefer,
> no problem; but I'd rather keep it as is, if only for me to be proved
> wrong by you actually seeing spam there.

OK, makes sense.

> > > @@ -161,6 +161,8 @@ static inline void load_ksp_mmu(struct task_struct *task)
> > > bug:
> > > pr_info("ksp load failed: mm=0x%p ksp=0x08%lx\n", mm, mmuar);
> > > end:
> > > + if (pte && mmuar < PAGE_OFFSET)
> > > + pte_unmap(pte);
> >
> > Is this also a bugfix, not mentioned in the patch description?
>
> I'm not sure whether you're referring to the pte_unmap() which we
> already discussed above, or you're seeing something else in addition;
> but I don't think there's a bugfix here, just a rearrangement because
> we now want lots of cases to do the pte_unmap() and local_irq_restore().

I was referring to the addition of pte_unmap().
As per your explanation above, this is not a bugfix.

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- [email protected]

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

2023-05-11 07:55:35

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 21/23] x86: Allow get_locked_pte() to fail

On Wed, May 10, 2023 at 08:16:34PM -0700, Hugh Dickins wrote:
> Thanks for looking so quickly, Peter: I didn't Cc you on this particular
> series, but shall certainly be doing so on the ones that follow, because
> a few of those patches go into interesting pmdp_get_lockless() territory.

I'm in the x86@ catch-all, which is how I saw this as quickly :-) A
direct copy won't hurt ofc, the mail system will sort it out.

2023-05-11 14:12:28

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 00/23] arch: allow pte_offset_map[_lock]() to fail

On Wed, May 10, 2023 at 09:35:44PM -0700, Hugh Dickins wrote:
> On Wed, 10 May 2023, Matthew Wilcox wrote:
> > On Tue, May 09, 2023 at 09:39:13PM -0700, Hugh Dickins wrote:
> > > Two: pte_offset_map() will need to do an rcu_read_lock(), with the
> > > corresponding rcu_read_unlock() in pte_unmap(). But most architectures
> > > never supported CONFIG_HIGHPTE, so some don't always call pte_unmap()
> > > after pte_offset_map(), or have used userspace pte_offset_map() where
> > > pte_offset_kernel() is more correct. No problem in the current tree,
> > > but a problem once an rcu_read_unlock() will be needed to keep balance.
> >
> > Hi Hugh,
> >
> > I shall have to spend some time looking at these patches, but at LSFMM
> > just a few hours ago, I proposed and nobody objected to removing
> > CONFIG_HIGHPTE. I don't intend to take action on that consensus
> > immediately, so I can certainly wait until your patches are applied, but
> > if this information simplifies what you're doing, feel free to act on it.
>
> Thanks a lot, Matthew: very considerate, as usual.
>
> Yes, I did see your "Whither Highmem?" (wither highmem!) proposal on the

I'm glad somebody noticed the pun ;-)

> list, and it did make me think, better get these patches and preview out
> soon, before you get to vanish pte_unmap() altogether. HIGHMEM or not,
> HIGHPTE or not, I think pte_offset_map() and pte_unmap() still have an
> important role to play.
>
> I don't really understand why you're going down a remove-CONFIG_HIGHPTE
> route: I thought you were motivated by the awkardness of kmap on large
> folios; but I don't see how removing HIGHPTE helps with that at all
> (unless you have a "large page tables" effort in mind, but I doubt it).

Quite right, my primary concern is filesystem metadata; primarily
directories as I don't think anybody has ever supported symlinks or
superblocks larger than 4kB.

I was thinking that removing CONFIG_HIGHPTE might simplify the page
fault handling path a little, but now I've looked at it some more, and
I'm not sure there's any simplification to be had. It should probably
use kmap_local instead of kmap_atomic(), though.

> But I've no investment in CONFIG_HIGHPTE if people think now is the
> time to remove it: I disagree, but wouldn't miss it myself - so long
> as you leave pte_offset_map() and pte_unmap() (under whatever names).
>
> I don't think removing CONFIG_HIGHPTE will simplify what I'm doing.
> For a moment it looked like it would: the PAE case is nasty (and our
> data centres have not been on PAE for a long time, so it wasn't a
> problem I had to face before); and knowing pmd_high must be 0 for a
> page table looked like it would help, but now I'm not so sure of that
> (hmm, I'm changing my mind again as I write).
>
> Peter's pmdp_get_lockless() does rely for complete correctness on
> interrupts being disabled, and I suspect that I may be forced in the
> PAE case to do so briefly; but detest that notion. For now I'm just
> deferring it, hoping for a better idea before third series finalized.
>
> I mention this (and Cc Peter) in passing: don't want this arch thread
> to go down into that rabbit hole: we can start a fresh thread on it if
> you wish, but right now my priority is commit messages for the second
> series, rather than solving (or even detailing) the PAE problem.

I infer that what you need is a pte_access_start() and a
pte_access_end() which look like they can be plausibly rcu_read_lock()
and rcu_read_unlock(), but might need to be local_irq_save() and
local_irq_restore() in some configurations?

We also talked about moving x86 to always RCU-free page tables in
order to make accessing /proc/$pid/smaps lockless. I believe Michel
is going to take a swing at this project.

2023-05-11 22:59:43

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 00/23] arch: allow pte_offset_map[_lock]() to fail

On Thu, 11 May 2023, Matthew Wilcox wrote:
>
> I was thinking that removing CONFIG_HIGHPTE might simplify the page
> fault handling path a little, but now I've looked at it some more, and
> I'm not sure there's any simplification to be had. It should probably
> use kmap_local instead of kmap_atomic(), though.

Re kmap_local, yes, one of the patches in the next series does make
that change.

>
> I infer that what you need is a pte_access_start() and a
> pte_access_end() which look like they can be plausibly rcu_read_lock()
> and rcu_read_unlock(), but might need to be local_irq_save() and
> local_irq_restore() in some configurations?

Yes, except that the local_irq_restore() in PAE-like configurations
(if we need it at all) is not delayed until the pte_access_end() or
pte_unmap() - it's internal to the pte_access_start() or pte_offset_map():
interrupts only disabled across the getting of a consistent pmd entry.

Over-generalizing a little, any user of pte_offset_map() (as opposed to
pte_offset_map_lock()) has to be prepared for the ptes to change under
them: but we do need to give them something that is or was recently the
relevant page table, rather than a random page mishmashed from mismatched
pmd_low and pmd_high.

>
> We also talked about moving x86 to always RCU-free page tables in
> order to make accessing /proc/$pid/smaps lockless. I believe Michel
> is going to take a swing at this project.

(And /proc/$pid/numa_maps, I hope: that's even worse in some way, IIRC.)

That might be orthogonal to what I'm doing: many non-x86 architectures
already do RCU-freeing of page tables via the TLB route, but that doesn't
cover a pte_free() from retract_page_tables() or collapse_and_free_pmd().

Hugh

2023-05-12 03:59:06

by Mike Rapoport

[permalink] [raw]
Subject: Re: [PATCH 00/23] arch: allow pte_offset_map[_lock]() to fail

Hi,

On Thu, May 11, 2023 at 03:02:55PM +0100, Matthew Wilcox wrote:
> On Wed, May 10, 2023 at 09:35:44PM -0700, Hugh Dickins wrote:
> > On Wed, 10 May 2023, Matthew Wilcox wrote:
> >
> > I don't really understand why you're going down a remove-CONFIG_HIGHPTE
> > route: I thought you were motivated by the awkardness of kmap on large
> > folios; but I don't see how removing HIGHPTE helps with that at all
> > (unless you have a "large page tables" effort in mind, but I doubt it).
>
> Quite right, my primary concern is filesystem metadata; primarily
> directories as I don't think anybody has ever supported symlinks or
> superblocks larger than 4kB.
>
> I was thinking that removing CONFIG_HIGHPTE might simplify the page
> fault handling path a little, but now I've looked at it some more, and
> I'm not sure there's any simplification to be had. It should probably
> use kmap_local instead of kmap_atomic(), though.

Removing CONFIG_HIGHPTE will drop several lines and will allow to get rid
of custom __pte_alloc_one on x86.

--
Sincerely yours,
Mike.

2023-05-13 22:05:21

by Helge Deller

[permalink] [raw]
Subject: Re: [PATCH 08/23] parisc: add pte_unmap() to balance get_ptep()

Hi Hugh,

On 5/10/23 06:52, Hugh Dickins wrote:
> To keep balance in future, remember to pte_unmap() after a successful
> get_ptep(). And (we might as well) pretend that flush_cache_pages()
> really needed a map there, to read the pfn before "unmapping".
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> arch/parisc/kernel/cache.c | 26 +++++++++++++++++++++-----
> 1 file changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
> index 1d3b8bc8a623..b0c969b3a300 100644
> --- a/arch/parisc/kernel/cache.c
> +++ b/arch/parisc/kernel/cache.c
> @@ -425,10 +425,15 @@ void flush_dcache_page(struct page *page)
> offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
> addr = mpnt->vm_start + offset;
> if (parisc_requires_coherency()) {
> + bool needs_flush = false;
> pte_t *ptep;
>
> ptep = get_ptep(mpnt->vm_mm, addr);
> - if (ptep && pte_needs_flush(*ptep))
> + if (ptep) {
> + needs_flush = pte_needs_flush(*ptep);
> + pte_unmap(ptep);
> + }
> + if (needs_flush)
> flush_user_cache_page(mpnt, addr);
> } else {
> /*
> @@ -560,14 +565,20 @@ EXPORT_SYMBOL(flush_kernel_dcache_page_addr);
> static void flush_cache_page_if_present(struct vm_area_struct *vma,
> unsigned long vmaddr, unsigned long pfn)
> {
> - pte_t *ptep = get_ptep(vma->vm_mm, vmaddr);
> + bool needs_flush = false;
> + pte_t *ptep;
>
> /*
> * The pte check is racy and sometimes the flush will trigger
> * a non-access TLB miss. Hopefully, the page has already been
> * flushed.
> */
> - if (ptep && pte_needs_flush(*ptep))
> + ptep = get_ptep(vma->vm_mm, vmaddr);
> + if (ptep) {
> + needs_flush = pte_needs_flush(*ptep))

^^^^^
One ")" too much and lacks a trailing ";"
Should be:
needs_flush = pte_needs_flush(*ptep);

With that fixed the kernel compiles and boots sucessfully on parisc.

Helge

2023-05-14 18:54:55

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 08/23] parisc: add pte_unmap() to balance get_ptep()

On Sat, 13 May 2023, Helge Deller wrote:

> Hi Hugh,
>
> On 5/10/23 06:52, Hugh Dickins wrote:
> > To keep balance in future, remember to pte_unmap() after a successful
> > get_ptep(). And (we might as well) pretend that flush_cache_pages()
> > really needed a map there, to read the pfn before "unmapping".
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> > ---
> > arch/parisc/kernel/cache.c | 26 +++++++++++++++++++++-----
> > 1 file changed, 21 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
> > index 1d3b8bc8a623..b0c969b3a300 100644
> > --- a/arch/parisc/kernel/cache.c
> > +++ b/arch/parisc/kernel/cache.c
> > @@ -425,10 +425,15 @@ void flush_dcache_page(struct page *page)
> > offset = (pgoff - mpnt->vm_pgoff) << PAGE_SHIFT;
> > addr = mpnt->vm_start + offset;
> > if (parisc_requires_coherency()) {
> > + bool needs_flush = false;
> > pte_t *ptep;
> >
> > ptep = get_ptep(mpnt->vm_mm, addr);
> > - if (ptep && pte_needs_flush(*ptep))
> > + if (ptep) {
> > + needs_flush = pte_needs_flush(*ptep);
> > + pte_unmap(ptep);
> > + }
> > + if (needs_flush)
> > flush_user_cache_page(mpnt, addr);
> > } else {
> > /*
> > @@ -560,14 +565,20 @@ EXPORT_SYMBOL(flush_kernel_dcache_page_addr);
> > static void flush_cache_page_if_present(struct vm_area_struct *vma,
> > unsigned long vmaddr, unsigned long pfn)
> > {
> > - pte_t *ptep = get_ptep(vma->vm_mm, vmaddr);
> > + bool needs_flush = false;
> > + pte_t *ptep;
> >
> > /*
> > * The pte check is racy and sometimes the flush will trigger
> > * a non-access TLB miss. Hopefully, the page has already been
> > * flushed.
> > */
> > - if (ptep && pte_needs_flush(*ptep))
> > + ptep = get_ptep(vma->vm_mm, vmaddr);
> > + if (ptep) {
> > + needs_flush = pte_needs_flush(*ptep))
>
> ^^^^^
> One ")" too much and lacks a trailing ";"
> Should be:
> needs_flush = pte_needs_flush(*ptep);
>
> With that fixed the kernel compiles and boots sucessfully on parisc.

Urgh! Indeed, thanks a lot Helge: I'll fold that in.

Hugh

2023-05-16 10:54:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/23] arch: allow pte_offset_map[_lock]() to fail

On Thu, May 11, 2023 at 03:02:55PM +0100, Matthew Wilcox wrote:

> We also talked about moving x86 to always RCU-free page tables in
> order to make accessing /proc/$pid/smaps lockless. I believe Michel
> is going to take a swing at this project.

Shouldn't be too controversial I think -- effectively everybody already
has it enabled because everybody builds with KVM enabled.

2023-05-17 10:47:23

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH 15/23] s390: allow pte_offset_map_lock() to fail

On Tue, 9 May 2023 22:01:16 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

> In rare transient cases, not yet made possible, pte_offset_map() and
> pte_offset_map_lock() may not find a page table: handle appropriately.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> arch/s390/kernel/uv.c | 2 ++
> arch/s390/mm/gmap.c | 2 ++
> arch/s390/mm/pgtable.c | 12 +++++++++---
> 3 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> index cb2ee06df286..3c62d1b218b1 100644
> --- a/arch/s390/kernel/uv.c
> +++ b/arch/s390/kernel/uv.c
> @@ -294,6 +294,8 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
>
> rc = -ENXIO;
> ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
> + if (!ptep)
> + goto out;
> if (pte_present(*ptep) && !(pte_val(*ptep) & _PAGE_INVALID) && pte_write(*ptep)) {
> page = pte_page(*ptep);
> rc = -EAGAIN;
> diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> index dc90d1eb0d55..d198fc9475a2 100644
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2549,6 +2549,8 @@ static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
> spinlock_t *ptl;
>
> ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> + if (!ptep)
> + break;

so if pte_offset_map_lock fails, we abort and skip both the failed
entry and the rest of the entries?

can pte_offset_map_lock be retried immediately if it fails? (consider
that we currently don't allow THP with KVM guests)

Would something like this:

do {
ptep = pte_offset_map_lock(...);
mb(); /* maybe? */
} while (!ptep);

make sense?


otherwise maybe it's better to return an error and retry the whole
walk_page_range() in s390_enable_sie() ? it's a slow path anyway.

> if (is_zero_pfn(pte_pfn(*ptep)))
> ptep_xchg_direct(walk->mm, addr, ptep, __pte(_PAGE_INVALID));
> pte_unmap_unlock(ptep, ptl);

[...]

2023-05-17 11:43:38

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH 16/23] s390: gmap use pte_unmap_unlock() not spin_unlock()

On Tue, May 09, 2023 at 10:02:32PM -0700, Hugh Dickins wrote:
> pte_alloc_map_lock() expects to be followed by pte_unmap_unlock(): to
> keep balance in future, pass ptep as well as ptl to gmap_pte_op_end(),
> and use pte_unmap_unlock() instead of direct spin_unlock() (even though
> ptep ends up unused inside the macro).
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> arch/s390/mm/gmap.c | 22 +++++++++++-----------
> 1 file changed, 11 insertions(+), 11 deletions(-)

Acked-by: Alexander Gordeev <[email protected]>

2023-05-17 22:25:38

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 15/23] s390: allow pte_offset_map_lock() to fail

On Wed, 17 May 2023, Claudio Imbrenda wrote:
> On Tue, 9 May 2023 22:01:16 -0700 (PDT)
> Hugh Dickins <[email protected]> wrote:
>
> > In rare transient cases, not yet made possible, pte_offset_map() and
> > pte_offset_map_lock() may not find a page table: handle appropriately.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> > ---
> > arch/s390/kernel/uv.c | 2 ++
> > arch/s390/mm/gmap.c | 2 ++
> > arch/s390/mm/pgtable.c | 12 +++++++++---
> > 3 files changed, 13 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> > index cb2ee06df286..3c62d1b218b1 100644
> > --- a/arch/s390/kernel/uv.c
> > +++ b/arch/s390/kernel/uv.c
> > @@ -294,6 +294,8 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
> >
> > rc = -ENXIO;
> > ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
> > + if (!ptep)
> > + goto out;

You may or may not be asking about this instance too. When I looked at
how the code lower down handles -ENXIO (promoting it to -EFAULT if an
access fails, or to -EAGAIN to ask for a retry), this looked just right
(whereas using -EAGAIN here would be wrong: that expects a "page" which
has not been initialized at this point).

> > if (pte_present(*ptep) && !(pte_val(*ptep) & _PAGE_INVALID) && pte_write(*ptep)) {
> > page = pte_page(*ptep);
> > rc = -EAGAIN;
> > diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> > index dc90d1eb0d55..d198fc9475a2 100644
> > --- a/arch/s390/mm/gmap.c
> > +++ b/arch/s390/mm/gmap.c
> > @@ -2549,6 +2549,8 @@ static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
> > spinlock_t *ptl;
> >
> > ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> > + if (!ptep)
> > + break;
>
> so if pte_offset_map_lock fails, we abort and skip both the failed
> entry and the rest of the entries?

Yes.

>
> can pte_offset_map_lock be retried immediately if it fails? (consider
> that we currently don't allow THP with KVM guests)
>
> Would something like this:
>
> do {
> ptep = pte_offset_map_lock(...);
> mb(); /* maybe? */
> } while (!ptep);
>
> make sense?

No. But you're absolutely right to be asking: thank you for looking
into it so carefully - and I realize that it's hard at this stage to
judge what's appropriate, when I've not yet even posted the endpoint
of these changes, the patches which make it possible not to find a
page table here. And I'm intentionally keeping that vague, because
although I shall only introduce a THP case, I do expect it to be built
upon later in reclaiming empty page tables: it would be nice not to
have to change the arch code again when extending further.

My "rare transient cases" phrase may be somewhat misleading: one thing
that's wrong with your tight pte_offset_map_lock() loop above is that
the pmd entry pointing to page table may have been suddenly replaced by
a pmd_none() entry; and there's nothing in your loop above to break out
if that is so.

But if a page table is suddenly removed, that would be because it was
either empty, or replaced by a THP entry, or easily reconstructable on
demand (by that, I probably mean it was only mapping shared file pages,
which can just be refaulted if needed again).

The case you're wary of, is if the page table were removed briefly,
then put back shortly after: and still contains zero pages further down.
That's not something mm does now, nor at the end of my several series,
nor that I imagine us wanting to do in future: but I am struggling to
find a killer argument to persuade you that it could never be done -
most pages in a page table do need rmap tracking, which will BUG if
it's broken, but that argument happens not to apply to the zero page.

(Hmm, there could be somewhere, where we would find it convenient to
remove a page table with intent to do ...something, then validation
of that isolated page table fails, so we just put it back again.)

Is it good enough for me to promise you that we won't do that?

There are several ways in which we could change __zap_zero_pages(),
but I don't see them as actually dealing with the concern at hand.

One change, I've tended to make at the mm end but did not dare
to interfere here: it would seem more sensible to do a single
pte_offset_map_lock() outside the loop, return if that fails,
increment ptep inside the loop, pte_unmap_unlock() after the loop.

But perhaps you have preemption reasons for not wanting that; and
although it would eliminate the oddity of half-processing a page
table, it would not really resolve the problem at hand: because,
what if this page table got removed just before __zap_zero_pages()
tries to take the lock, then got put back just after?

Another change: I see __zap_zero_pages() is driven by walk_page_range(),
and over at the mm end I'm usually setting walk->action to ACTION_AGAIN
in these failure cases; but thought that an unnecessary piece of magic
here, and cannot see how it could actually help. Your "retry the whole
walk_page_range()" suggestion below would be a heavier equivalent of
that: but neither way gives confidence, if a page table could actually
be removed then reinserted without mmap_write_lock().

I think I want to keep this s390 __zap_zero_pages() issue in mind, it is
important and thank you for raising it; but don't see any change to the
patch as actually needed.

Hugh

>
>
> otherwise maybe it's better to return an error and retry the whole
> walk_page_range() in s390_enable_sie() ? it's a slow path anyway.
>
> > if (is_zero_pfn(pte_pfn(*ptep)))
> > ptep_xchg_direct(walk->mm, addr, ptep, __pte(_PAGE_INVALID));
> > pte_unmap_unlock(ptep, ptl);
>
> [...]

2023-05-23 12:19:13

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH 15/23] s390: allow pte_offset_map_lock() to fail

On Wed, 17 May 2023 14:50:28 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

> On Wed, 17 May 2023, Claudio Imbrenda wrote:
> > On Tue, 9 May 2023 22:01:16 -0700 (PDT)
> > Hugh Dickins <[email protected]> wrote:
> >
> > > In rare transient cases, not yet made possible, pte_offset_map() and
> > > pte_offset_map_lock() may not find a page table: handle appropriately.
> > >
> > > Signed-off-by: Hugh Dickins <[email protected]>
> > > ---
> > > arch/s390/kernel/uv.c | 2 ++
> > > arch/s390/mm/gmap.c | 2 ++
> > > arch/s390/mm/pgtable.c | 12 +++++++++---
> > > 3 files changed, 13 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/arch/s390/kernel/uv.c b/arch/s390/kernel/uv.c
> > > index cb2ee06df286..3c62d1b218b1 100644
> > > --- a/arch/s390/kernel/uv.c
> > > +++ b/arch/s390/kernel/uv.c
> > > @@ -294,6 +294,8 @@ int gmap_make_secure(struct gmap *gmap, unsigned long gaddr, void *uvcb)
> > >
> > > rc = -ENXIO;
> > > ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
> > > + if (!ptep)
> > > + goto out;
>
> You may or may not be asking about this instance too. When I looked at

actually no, because of the reasons you give here :)

> how the code lower down handles -ENXIO (promoting it to -EFAULT if an
> access fails, or to -EAGAIN to ask for a retry), this looked just right
> (whereas using -EAGAIN here would be wrong: that expects a "page" which
> has not been initialized at this point).
>
> > > if (pte_present(*ptep) && !(pte_val(*ptep) & _PAGE_INVALID) && pte_write(*ptep)) {
> > > page = pte_page(*ptep);
> > > rc = -EAGAIN;
> > > diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
> > > index dc90d1eb0d55..d198fc9475a2 100644
> > > --- a/arch/s390/mm/gmap.c
> > > +++ b/arch/s390/mm/gmap.c
> > > @@ -2549,6 +2549,8 @@ static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
> > > spinlock_t *ptl;
> > >
> > > ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> > > + if (!ptep)
> > > + break;
> >
> > so if pte_offset_map_lock fails, we abort and skip both the failed
> > entry and the rest of the entries?
>
> Yes.
>
> >
> > can pte_offset_map_lock be retried immediately if it fails? (consider
> > that we currently don't allow THP with KVM guests)
> >
> > Would something like this:
> >
> > do {
> > ptep = pte_offset_map_lock(...);
> > mb(); /* maybe? */
> > } while (!ptep);
> >
> > make sense?
>
> No. But you're absolutely right to be asking: thank you for looking
> into it so carefully - and I realize that it's hard at this stage to
> judge what's appropriate, when I've not yet even posted the endpoint
> of these changes, the patches which make it possible not to find a
> page table here. And I'm intentionally keeping that vague, because
> although I shall only introduce a THP case, I do expect it to be built
> upon later in reclaiming empty page tables: it would be nice not to
> have to change the arch code again when extending further.
>
> My "rare transient cases" phrase may be somewhat misleading: one thing
> that's wrong with your tight pte_offset_map_lock() loop above is that
> the pmd entry pointing to page table may have been suddenly replaced
> by a pmd_none() entry; and there's nothing in your loop above to
> break out if that is so.
>
> But if a page table is suddenly removed, that would be because it was
> either empty, or replaced by a THP entry, or easily reconstructable on
> demand (by that, I probably mean it was only mapping shared file
> pages, which can just be refaulted if needed again).
>
> The case you're wary of, is if the page table were removed briefly,
> then put back shortly after: and still contains zero pages further
> down. That's not something mm does now, nor at the end of my several
> series, nor that I imagine us wanting to do in future: but I am
> struggling to find a killer argument to persuade you that it could
> never be done - most pages in a page table do need rmap tracking,
> which will BUG if it's broken, but that argument happens not to apply
> to the zero page.
>
> (Hmm, there could be somewhere, where we would find it convenient to
> remove a page table with intent to do ...something, then validation
> of that isolated page table fails, so we just put it back again.)
>
> Is it good enough for me to promise you that we won't do that?
>
> There are several ways in which we could change __zap_zero_pages(),
> but I don't see them as actually dealing with the concern at hand.
>
> One change, I've tended to make at the mm end but did not dare
> to interfere here: it would seem more sensible to do a single
> pte_offset_map_lock() outside the loop, return if that fails,
> increment ptep inside the loop, pte_unmap_unlock() after the loop.
>
> But perhaps you have preemption reasons for not wanting that; and
> although it would eliminate the oddity of half-processing a page
> table, it would not really resolve the problem at hand: because,
> what if this page table got removed just before __zap_zero_pages()
> tries to take the lock, then got put back just after?
>
> Another change: I see __zap_zero_pages() is driven by
> walk_page_range(), and over at the mm end I'm usually setting
> walk->action to ACTION_AGAIN in these failure cases; but thought that
> an unnecessary piece of magic here, and cannot see how it could
> actually help. Your "retry the whole walk_page_range()" suggestion
> below would be a heavier equivalent of that: but neither way gives
> confidence, if a page table could actually be removed then reinserted
> without mmap_write_lock().
>
> I think I want to keep this s390 __zap_zero_pages() issue in mind, it
> is important and thank you for raising it; but don't see any change
> to the patch as actually needed.
>
> Hugh

so if I understand the above correctly, pte_offset_map_lock will only
fail if the whole page table has disappeared, and in that case, it will
never reappear with zero pages, therefore we can safely skip (in that
case just break). if we were to do a continue instead of a break, we
would most likely fail again anyway.

in that case I would still like a small change in your patch: please
write a short (2~3 lines max) comment about why it's ok to do things
that way

2023-05-24 02:04:49

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 15/23] s390: allow pte_offset_map_lock() to fail

On Tue, 23 May 2023, Claudio Imbrenda wrote:
>
> so if I understand the above correctly, pte_offset_map_lock will only
> fail if the whole page table has disappeared, and in that case, it will
> never reappear with zero pages, therefore we can safely skip (in that
> case just break). if we were to do a continue instead of a break, we
> would most likely fail again anyway.

Yes, that's the most likely; and you hold mmap_write_lock() there,
and VM_NOHUGEPAGE on all vmas, so I think it's the only foreseeable
possibility.

>
> in that case I would still like a small change in your patch: please
> write a short (2~3 lines max) comment about why it's ok to do things
> that way

Sure.

But I now see that I've disobeyed you, and gone to 4 lines (but in the
comment above the function, so as not to distract from the code itself):
is this good wording to you? I needed to research how they were stopped
from coming in afterwards, so wanted to put something greppable in there.

And, unless I'm misunderstanding, that "after THP was enabled" was
always supposed to say "after THP was disabled" (because splitting a
huge zero page pmd inserts a a page table full of little zero ptes).

Or would you prefer the comment in the commit message instead,
or down just above the pte_offset_map_lock() line?

It would much better if I could find one place at the mm end, to
enforce its end of the contract; but cannot think how to do that.

Hugh

--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2537,7 +2537,12 @@ static inline void thp_split_mm(struct mm_struct *mm)
* Remove all empty zero pages from the mapping for lazy refaulting
* - This must be called after mm->context.has_pgste is set, to avoid
* future creation of zero pages
- * - This must be called after THP was enabled
+ * - This must be called after THP was disabled.
+ *
+ * mm contracts with s390, that even if mm were to remove a page table,
+ * racing with the loop below and so causing pte_offset_map_lock() to fail,
+ * it will never insert a page table containing empty zero pages once
+ * mm_forbids_zeropage(mm) i.e. mm->context.has_pgste is set.
*/
static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
unsigned long end, struct mm_walk *walk)

2023-05-25 07:48:25

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH 15/23] s390: allow pte_offset_map_lock() to fail

On Tue, 23 May 2023 18:49:14 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

> On Tue, 23 May 2023, Claudio Imbrenda wrote:
> >
> > so if I understand the above correctly, pte_offset_map_lock will only
> > fail if the whole page table has disappeared, and in that case, it will
> > never reappear with zero pages, therefore we can safely skip (in that
> > case just break). if we were to do a continue instead of a break, we
> > would most likely fail again anyway.
>
> Yes, that's the most likely; and you hold mmap_write_lock() there,
> and VM_NOHUGEPAGE on all vmas, so I think it's the only foreseeable
> possibility.
>
> >
> > in that case I would still like a small change in your patch: please
> > write a short (2~3 lines max) comment about why it's ok to do things
> > that way
>
> Sure.
>
> But I now see that I've disobeyed you, and gone to 4 lines (but in the
> comment above the function, so as not to distract from the code itself):
> is this good wording to you? I needed to research how they were stopped
> from coming in afterwards, so wanted to put something greppable in there.
>
> And, unless I'm misunderstanding, that "after THP was enabled" was
> always supposed to say "after THP was disabled" (because splitting a
> huge zero page pmd inserts a a page table full of little zero ptes).

indeed, thanks for noticing and fixing it

>
> Or would you prefer the comment in the commit message instead,
> or down just above the pte_offset_map_lock() line?
>
> It would much better if I could find one place at the mm end, to
> enforce its end of the contract; but cannot think how to do that.
>
> Hugh
>
> --- a/arch/s390/mm/gmap.c
> +++ b/arch/s390/mm/gmap.c
> @@ -2537,7 +2537,12 @@ static inline void thp_split_mm(struct mm_struct *mm)
> * Remove all empty zero pages from the mapping for lazy refaulting
> * - This must be called after mm->context.has_pgste is set, to avoid
> * future creation of zero pages
> - * - This must be called after THP was enabled
> + * - This must be called after THP was disabled.
> + *
> + * mm contracts with s390, that even if mm were to remove a page table,
> + * racing with the loop below and so causing pte_offset_map_lock() to fail,
> + * it will never insert a page table containing empty zero pages once
> + * mm_forbids_zeropage(mm) i.e. mm->context.has_pgste is set.
> */
> static int __zap_zero_pages(pmd_t *pmd, unsigned long start,
> unsigned long end, struct mm_walk *walk)

looks good, thanks