2023-05-22 05:10:26

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 00/31] mm: allow pte_offset_map[_lock]() to fail

Here is a series of patches to mm, based on v6.4-rc2: preparing for
changes to follow (mostly in mm/khugepaged.c) affecting pte_offset_map()
and pte_offset_map_lock().

This follows on from the "arch: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/[email protected]/
series of 23 posted on 2023-05-09. These two series are "independent":
neither depends for build or correctness on the other, but both series
have to be in before a third series is added to make the effective changes
- though I anticipate that people will want to see at least an initial
version of that third series soon, to complete the context for them all.

What is it all about? Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs;
but likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing). mmap_write_lock
avoidance when collapsing to anon THPs? Perhaps, but again that's not
work I've done: a quick and easy attempt looked like it was going to
shift the load from mmap rwsem to pmd spinlock - not an improvement.

I would much prefer not to have to make these small but wide-ranging
changes for such a niche case; but failed to find another way, and
have heard that shmem MADV_COLLAPSE's usefulness is being limited by
that mmap_write_lock it currently requires.

These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.

What is this preparatory series about?

The current mmap locking will not be enough to guard against that
tricky transition between pmd entry pointing to page table, and empty
pmd entry, and pmd entry pointing to huge page: pte_offset_map() will
have to validate the pmd entry for itself, returning NULL if no page
table is there. What to do about that varies: sometimes nearby error
handling indicates just to skip it; but in many cases an ACTION_AGAIN or
"goto again" is appropriate (and if that risks an infinite loop, then
there must have been an oops, or pfn 0 mistaken for page table, before).

Given the likely extension to freeing empty page tables, I have not
limited this set of changes to a THP config; and it has been easier,
and sets a better example, if each site is given appropriate handling:
even where deeper study might prove that failure could only happen if
the pmd table were corrupted.

Several of the patches are, or include, cleanup on the way; and by the
end, pmd_trans_unstable() and suchlike are deleted: pte_offset_map() and
pte_offset_map_lock() then handle those original races and more. Most
uses of pte_lockptr() are deprecated, with pte_offset_map_nolock()
taking its place.

Based on v6.4-rc2, but also good for -rc1, -rc3,
current mm-everything and linux-next.

01/31 mm: use pmdp_get_lockless() without surplus barrier()
02/31 mm/migrate: remove cruft from migration_entry_wait()s
03/31 mm/pgtable: kmap_local_page() instead of kmap_atomic()
04/31 mm/pgtable: allow pte_offset_map[_lock]() to fail
05/31 mm/filemap: allow pte_offset_map_lock() to fail
06/31 mm/page_vma_mapped: delete bogosity in page_vma_mapped_walk()
07/31 mm/page_vma_mapped: reformat map_pte() with less indentation
08/31 mm/page_vma_mapped: pte_offset_map_nolock() not pte_lockptr()
09/31 mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails
10/31 mm/pagewalk: walk_pte_range() allow for pte_offset_map()
11/31 mm/vmwgfx: simplify pmd & pud mapping dirty helpers
12/31 mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()
13/31 mm/hmm: retry if pte_offset_map() fails
14/31 fs/userfaultfd: retry if pte_offset_map() fails
15/31 mm/userfaultfd: allow pte_offset_map_lock() to fail
16/31 mm/debug_vm_pgtable,page_table_check: warn pte map fails
17/31 mm/various: give up if pte_offset_map[_lock]() fails
18/31 mm/mprotect: delete pmd_none_or_clear_bad_unless_transhuge()
19/31 mm/mremap: retry if either pte_offset_map_*lock() fails
20/31 mm/madvise: clean up pte_offset_map_lock() scans
21/31 mm/madvise: clean up force_shm_swapin_readahead()
22/31 mm/swapoff: allow pte_offset_map[_lock]() to fail
23/31 mm/mglru: allow pte_offset_map_nolock() to fail
24/31 mm/migrate_device: allow pte_offset_map_lock() to fail
25/31 mm/gup: remove FOLL_SPLIT_PMD use of pmd_trans_unstable()
26/31 mm/huge_memory: split huge pmd under one pte_offset_map()
27/31 mm/khugepaged: allow pte_offset_map[_lock]() to fail
28/31 mm/memory: allow pte_offset_map[_lock]() to fail
29/31 mm/memory: handle_pte_fault() use pte_offset_map_nolock()
30/31 mm/pgtable: delete pmd_trans_unstable() and friends
31/31 perf/core: Allow pte_offset_map() to fail

Documentation/mm/split_page_table_lock.rst | 17 +-
fs/proc/task_mmu.c | 32 ++--
fs/userfaultfd.c | 21 +--
include/linux/migrate.h | 4 +-
include/linux/mm.h | 27 ++-
include/linux/pgtable.h | 142 +++-----------
include/linux/swapops.h | 17 +-
kernel/events/core.c | 4 +
mm/damon/vaddr.c | 12 +-
mm/debug_vm_pgtable.c | 9 +-
mm/filemap.c | 25 +--
mm/gup.c | 34 ++--
mm/hmm.c | 4 +-
mm/huge_memory.c | 33 ++--
mm/khugepaged.c | 83 +++++----
mm/ksm.c | 10 +-
mm/madvise.c | 146 ++++++++-------
mm/mapping_dirty_helpers.c | 34 +---
mm/memcontrol.c | 8 +-
mm/memory-failure.c | 8 +-
mm/memory.c | 224 ++++++++++-------------
mm/mempolicy.c | 7 +-
mm/migrate.c | 40 ++--
mm/migrate_device.c | 31 +---
mm/mincore.c | 9 +-
mm/mlock.c | 4 +
mm/mprotect.c | 79 ++------
mm/mremap.c | 28 ++-
mm/page_table_check.c | 2 +
mm/page_vma_mapped.c | 97 +++++-----
mm/pagewalk.c | 33 +++-
mm/pgtable-generic.c | 56 ++++++
mm/swap_state.c | 3 +
mm/swapfile.c | 38 ++--
mm/userfaultfd.c | 10 +-
mm/vmalloc.c | 3 +-
mm/vmscan.c | 16 +-
37 files changed, 641 insertions(+), 709 deletions(-)

Hugh


2023-05-22 05:11:06

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 01/31] mm: use pmdp_get_lockless() without surplus barrier()

Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
reliable result with PAE (or READ_ONCE as before without PAE); and remove
the unnecessary extra barrier()s which got left behind in its callers.

HOWEVER: Note the small print in linux/pgtable.h, where it was designed
specifically for fast GUP, and depends on interrupts being disabled for
its full guarantee: most callers which have been added (here and before)
do NOT have interrupts disabled, so there is still some need for caution.

Signed-off-by: Hugh Dickins <[email protected]>
---
fs/userfaultfd.c | 10 +---------
include/linux/pgtable.h | 17 -----------------
mm/gup.c | 6 +-----
mm/hmm.c | 2 +-
mm/khugepaged.c | 5 -----
mm/ksm.c | 3 +--
mm/memory.c | 14 ++------------
mm/mprotect.c | 5 -----
mm/page_vma_mapped.c | 2 +-
9 files changed, 7 insertions(+), 57 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 0fd96d6e39ce..f7a0817b1ec0 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -349,15 +349,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
if (!pud_present(*pud))
goto out;
pmd = pmd_offset(pud, address);
- /*
- * READ_ONCE must function as a barrier with narrower scope
- * and it must be equivalent to:
- * _pmd = *pmd; barrier();
- *
- * This is to deal with the instability (as in
- * pmd_trans_unstable) of the pmd.
- */
- _pmd = READ_ONCE(*pmd);
+ _pmd = pmdp_get_lockless(pmd);
if (pmd_none(_pmd))
goto out;

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c5a51481bbb9..8ec27fe69dc8 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1344,23 +1344,6 @@ static inline int pud_trans_unstable(pud_t *pud)
static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
{
pmd_t pmdval = pmdp_get_lockless(pmd);
- /*
- * The barrier will stabilize the pmdval in a register or on
- * the stack so that it will stop changing under the code.
- *
- * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
- * pmdp_get_lockless is allowed to return a not atomic pmdval
- * (for example pointing to an hugepage that has never been
- * mapped in the pmd). The below checks will only care about
- * the low part of the pmd with 32bit PAE x86 anyway, with the
- * exception of pmd_none(). So the important thing is that if
- * the low part of the pmd is found null, the high part will
- * be also null or the pmd_none() check below would be
- * confused.
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- barrier();
-#endif
/*
* !pmd_present() checks for pmd migration entries
*
diff --git a/mm/gup.c b/mm/gup.c
index bbe416236593..3bd5d3854c51 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -653,11 +653,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;

pmd = pmd_offset(pudp, address);
- /*
- * The READ_ONCE() will stabilize the pmdval in a register or
- * on the stack so that it will stop changing under the code.
- */
- pmdval = READ_ONCE(*pmd);
+ pmdval = pmdp_get_lockless(pmd);
if (pmd_none(pmdval))
return no_page_table(vma, flags);
if (!pmd_present(pmdval))
diff --git a/mm/hmm.c b/mm/hmm.c
index 6a151c09de5e..e23043345615 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -332,7 +332,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
pmd_t pmd;

again:
- pmd = READ_ONCE(*pmdp);
+ pmd = pmdp_get_lockless(pmdp);
if (pmd_none(pmd))
return hmm_vma_walk_hole(start, end, -1, walk);

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6b9d39d65b73..732f9ac393fc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -961,11 +961,6 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
return SCAN_PMD_NULL;

pmde = pmdp_get_lockless(*pmd);
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
- barrier();
-#endif
if (pmd_none(pmde))
return SCAN_PMD_NONE;
if (!pmd_present(pmde))
diff --git a/mm/ksm.c b/mm/ksm.c
index 0156bded3a66..df2aa281d49d 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1194,8 +1194,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
* without holding anon_vma lock for write. So when looking for a
* genuine pmde (in which to find pte), test present and !THP together.
*/
- pmde = *pmd;
- barrier();
+ pmde = pmdp_get_lockless(pmd);
if (!pmd_present(pmde) || pmd_trans_huge(pmde))
goto out;

diff --git a/mm/memory.c b/mm/memory.c
index f69fbc251198..2eb54c0d5d3c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4925,18 +4925,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
* So now it's safe to run pte_offset_map().
*/
vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
- vmf->orig_pte = *vmf->pte;
+ vmf->orig_pte = ptep_get_lockless(vmf->pte);
vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;

- /*
- * some architectures can have larger ptes than wordsize,
- * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
- * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
- * accesses. The code below just needs a consistent view
- * for the ifs and we later double check anyway with the
- * ptl lock held. So here a barrier will do.
- */
- barrier();
if (pte_none(vmf->orig_pte)) {
pte_unmap(vmf->pte);
vmf->pte = NULL;
@@ -5060,9 +5051,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
- vmf.orig_pmd = *vmf.pmd;
+ vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);

- barrier();
if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
VM_BUG_ON(thp_migration_supported() &&
!is_pmd_migration_entry(vmf.orig_pmd));
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 92d3d3ca390a..c5a13c0f1017 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -309,11 +309,6 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
{
pmd_t pmdval = pmdp_get_lockless(pmd);

- /* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- barrier();
-#endif
-
if (pmd_none(pmdval))
return 1;
if (pmd_trans_huge(pmdval))
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 4e448cfbc6ef..64aff6718bdb 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -210,7 +210,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
* compiler and used as a stale value after we've observed a
* subsequent update.
*/
- pmde = READ_ONCE(*pvmw->pmd);
+ pmde = pmdp_get_lockless(pvmw->pmd);

if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) ||
(pmd_present(pmde) && pmd_devmap(pmde))) {
--
2.35.3


2023-05-22 05:12:22

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 15/31] mm/userfaultfd: allow pte_offset_map_lock() to fail

mfill_atomic_install_pte() and mfill_atomic_pte_zeropage() treat
failed pte_offset_map_lock() as -EFAULT, with no attempt to retry.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/userfaultfd.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e97a0b4889fc..b1554286a31c 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -76,14 +76,16 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
if (flags & MFILL_ATOMIC_WP)
_dst_pte = pte_mkuffd_wp(_dst_pte);

+ ret = -EFAULT;
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+ if (!dst_pte)
+ goto out;

if (vma_is_shmem(dst_vma)) {
/* serialize against truncate with the page table lock */
inode = dst_vma->vm_file->f_inode;
offset = linear_page_index(dst_vma, dst_addr);
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
- ret = -EFAULT;
if (unlikely(offset >= max_off))
goto out_unlock;
}
@@ -121,6 +123,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
ret = 0;
out_unlock:
pte_unmap_unlock(dst_pte, ptl);
+out:
return ret;
}

@@ -212,13 +215,15 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,

_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
dst_vma->vm_page_prot));
+ ret = -EFAULT;
dst_pte = pte_offset_map_lock(dst_vma->vm_mm, dst_pmd, dst_addr, &ptl);
+ if (!dst_pte)
+ goto out;
if (dst_vma->vm_file) {
/* the shmem MAP_PRIVATE case requires checking the i_size */
inode = dst_vma->vm_file->f_inode;
offset = linear_page_index(dst_vma, dst_addr);
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
- ret = -EFAULT;
if (unlikely(offset >= max_off))
goto out_unlock;
}
@@ -231,6 +236,7 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
ret = 0;
out_unlock:
pte_unmap_unlock(dst_pte, ptl);
+out:
return ret;
}

--
2.35.3


2023-05-22 05:18:45

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 16/31] mm/debug_vm_pgtable,page_table_check: warn pte map fails

Failures here would be surprising: pte_advanced_tests() and
pte_clear_tests() and __page_table_check_pte_clear_range() each
issue a warning if pte_offset_map() or pte_offset_map_lock() fails.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/debug_vm_pgtable.c | 9 ++++++++-
mm/page_table_check.c | 2 ++
2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index c54177aabebd..ee119e33fef1 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -138,6 +138,9 @@ static void __init pte_advanced_tests(struct pgtable_debug_args *args)
return;

pr_debug("Validating PTE advanced\n");
+ if (WARN_ON(!args->ptep))
+ return;
+
pte = pfn_pte(args->pte_pfn, args->page_prot);
set_pte_at(args->mm, args->vaddr, args->ptep, pte);
flush_dcache_page(page);
@@ -619,6 +622,9 @@ static void __init pte_clear_tests(struct pgtable_debug_args *args)
* the unexpected overhead of cache flushing is acceptable.
*/
pr_debug("Validating PTE clear\n");
+ if (WARN_ON(!args->ptep))
+ return;
+
#ifndef CONFIG_RISCV
pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
#endif
@@ -1377,7 +1383,8 @@ static int __init debug_vm_pgtable(void)
args.ptep = pte_offset_map_lock(args.mm, args.pmdp, args.vaddr, &ptl);
pte_clear_tests(&args);
pte_advanced_tests(&args);
- pte_unmap_unlock(args.ptep, ptl);
+ if (args.ptep)
+ pte_unmap_unlock(args.ptep, ptl);

ptl = pmd_lock(args.mm, args.pmdp);
pmd_clear_tests(&args);
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index 25d8610c0042..0c511330dbc9 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -240,6 +240,8 @@ void __page_table_check_pte_clear_range(struct mm_struct *mm,
pte_t *ptep = pte_offset_map(&pmd, addr);
unsigned long i;

+ if (WARN_ON(!ptep))
+ return;
for (i = 0; i < PTRS_PER_PTE; i++) {
__page_table_check_pte_clear(mm, addr, *ptep);
addr += PAGE_SIZE;
--
2.35.3


2023-05-22 05:18:54

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 19/31] mm/mremap: retry if either pte_offset_map_*lock() fails

move_ptes() return -EAGAIN if pte_offset_map_lock() of old fails, or if
pte_offset_map_nolock() of new fails: move_page_tables() retry if so.

But that does need a pmd_none() check inside, to stop endless loop when
huge shmem is truncated (thank you to syzbot); and move_huge_pmd() must
tolerate that a page table might have been allocated there just before
(of course it would be more satisfying to remove the empty page table,
but this is not a path worth optimizing).

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/huge_memory.c | 5 +++--
mm/mremap.c | 28 ++++++++++++++++++++--------
2 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 624671aaa60d..d4bd5fa7c823 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1760,9 +1760,10 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,

/*
* The destination pmd shouldn't be established, free_pgtables()
- * should have release it.
+ * should have released it; but move_page_tables() might have already
+ * inserted a page table, if racing against shmem/file collapse.
*/
- if (WARN_ON(!pmd_none(*new_pmd))) {
+ if (!pmd_none(*new_pmd)) {
VM_BUG_ON(pmd_trans_huge(*new_pmd));
return false;
}
diff --git a/mm/mremap.c b/mm/mremap.c
index b11ce6c92099..1fc47b4f38d7 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -133,7 +133,7 @@ static pte_t move_soft_dirty_pte(pte_t pte)
return pte;
}

-static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
+static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
unsigned long old_addr, unsigned long old_end,
struct vm_area_struct *new_vma, pmd_t *new_pmd,
unsigned long new_addr, bool need_rmap_locks)
@@ -143,6 +143,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
spinlock_t *old_ptl, *new_ptl;
bool force_flush = false;
unsigned long len = old_end - old_addr;
+ int err = 0;

/*
* When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
@@ -170,8 +171,16 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
* pte locks because exclusive mmap_lock prevents deadlock.
*/
old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
- new_pte = pte_offset_map(new_pmd, new_addr);
- new_ptl = pte_lockptr(mm, new_pmd);
+ if (!old_pte) {
+ err = -EAGAIN;
+ goto out;
+ }
+ new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
+ if (!new_pte) {
+ pte_unmap_unlock(old_pte, old_ptl);
+ err = -EAGAIN;
+ goto out;
+ }
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
flush_tlb_batched_pending(vma->vm_mm);
@@ -208,8 +217,10 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
spin_unlock(new_ptl);
pte_unmap(new_pte - 1);
pte_unmap_unlock(old_pte - 1, old_ptl);
+out:
if (need_rmap_locks)
drop_rmap_locks(vma);
+ return err;
}

#ifndef arch_supports_page_table_move
@@ -537,6 +548,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
if (!new_pmd)
break;
+again:
if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) ||
pmd_devmap(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE &&
@@ -544,8 +556,6 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
old_pmd, new_pmd, need_rmap_locks))
continue;
split_huge_pmd(vma, old_pmd, old_addr);
- if (pmd_trans_unstable(old_pmd))
- continue;
} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PMD) &&
extent == PMD_SIZE) {
/*
@@ -556,11 +566,13 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
old_pmd, new_pmd, true))
continue;
}
-
+ if (pmd_none(*old_pmd))
+ continue;
if (pte_alloc(new_vma->vm_mm, new_pmd))
break;
- move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
- new_pmd, new_addr, need_rmap_locks);
+ if (move_ptes(vma, old_pmd, old_addr, old_addr + extent,
+ new_vma, new_pmd, new_addr, need_rmap_locks) < 0)
+ goto again;
}

mmu_notifier_invalidate_range_end(&range);
--
2.35.3


2023-05-22 05:22:27

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 09/31] mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails

Simple walk_page_range() users should set ACTION_AGAIN to retry when
pte_offset_map_lock() fails.

No need to check pmd_trans_unstable(): that was precisely to avoid the
possiblity of calling pte_offset_map() on a racily removed or inserted
THP entry, but such cases are now safely handled inside it. Likewise
there is no need to check pmd_none() or pmd_bad() before calling it.

Signed-off-by: Hugh Dickins <[email protected]>
---
fs/proc/task_mmu.c | 32 ++++++++++++++++----------------
mm/damon/vaddr.c | 12 ++++++++----
mm/mempolicy.c | 7 ++++---
mm/mincore.c | 9 ++++-----
mm/mlock.c | 4 ++++
5 files changed, 36 insertions(+), 28 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 420510f6a545..dba5052ce09b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -631,14 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
goto out;
}

- if (pmd_trans_unstable(pmd))
- goto out;
- /*
- * The mmap_lock held all the way back in m_start() is what
- * keeps khugepaged out of here and from collapsing things
- * in here.
- */
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!pte) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
for (; addr != end; pte++, addr += PAGE_SIZE)
smaps_pte_entry(pte, addr, walk);
pte_unmap_unlock(pte - 1, ptl);
@@ -1191,10 +1188,11 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
}

- if (pmd_trans_unstable(pmd))
- return 0;
-
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!pte) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
for (; addr != end; pte++, addr += PAGE_SIZE) {
ptent = *pte;

@@ -1538,9 +1536,6 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
spin_unlock(ptl);
return err;
}
-
- if (pmd_trans_unstable(pmdp))
- return 0;
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

/*
@@ -1548,6 +1543,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
* goes beyond vma->vm_end.
*/
orig_pte = pte = pte_offset_map_lock(walk->mm, pmdp, addr, &ptl);
+ if (!pte) {
+ walk->action = ACTION_AGAIN;
+ return err;
+ }
for (; addr < end; pte++, addr += PAGE_SIZE) {
pagemap_entry_t pme;

@@ -1887,11 +1886,12 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
spin_unlock(ptl);
return 0;
}
-
- if (pmd_trans_unstable(pmd))
- return 0;
#endif
orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (!pte) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
do {
struct page *page = can_gather_numa_stats(*pte, vma, addr);
if (!page)
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 1fec16d7263e..b8762ff15c3c 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -318,9 +318,11 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
spin_unlock(ptl);
}

- if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
- return 0;
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (!pte) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
if (!pte_present(*pte))
goto out;
damon_ptep_mkold(pte, walk->mm, addr);
@@ -464,9 +466,11 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
regular_page:
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

- if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
- return -EINVAL;
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (!pte) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
if (!pte_present(*pte))
goto out;
folio = damon_get_folio(pte_pfn(*pte));
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1756389a0609..4d0bcf6f0d52 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -514,10 +514,11 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
if (ptl)
return queue_folios_pmd(pmd, ptl, addr, end, walk);

- if (pmd_trans_unstable(pmd))
- return 0;
-
mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (!pte) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
for (; addr != end; pte++, addr += PAGE_SIZE) {
if (!pte_present(*pte))
continue;
diff --git a/mm/mincore.c b/mm/mincore.c
index 2d5be013a25a..f33f6a0b1ded 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -113,12 +113,11 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
goto out;
}

- if (pmd_trans_unstable(pmd)) {
- __mincore_unmapped_range(addr, end, vma, vec);
- goto out;
- }
-
ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (!ptep) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
for (; addr != end; ptep++, addr += PAGE_SIZE) {
pte_t pte = *ptep;

diff --git a/mm/mlock.c b/mm/mlock.c
index 40b43f8740df..9f2b1173b1b1 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -329,6 +329,10 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
}

start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!start_pte) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
if (!pte_present(*pte))
continue;
--
2.35.3


2023-05-22 05:24:53

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 18/31] mm/mprotect: delete pmd_none_or_clear_bad_unless_trans_huge()

change_pmd_range() had special pmd_none_or_clear_bad_unless_trans_huge(),
required to avoid "bad" choices when setting automatic NUMA hinting under
mmap_read_lock(); but most of that is already covered in pte_offset_map()
now. change_pmd_range() just wants a pmd_none() check before wasting
time on MMU notifiers, then checks on the read-once _pmd value to work
out what's needed for huge cases. If change_pte_range() returns -EAGAIN
to retry if pte_offset_map_lock() fails, nothing more special is needed.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/mprotect.c | 74 ++++++++++++---------------------------------------
1 file changed, 17 insertions(+), 57 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index c5a13c0f1017..64e1df0af514 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -93,22 +93,9 @@ static long change_pte_range(struct mmu_gather *tlb,
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;

tlb_change_page_size(tlb, PAGE_SIZE);
-
- /*
- * Can be called with only the mmap_lock for reading by
- * prot_numa so we must check the pmd isn't constantly
- * changing from under us from pmd_none to pmd_trans_huge
- * and/or the other way around.
- */
- if (pmd_trans_unstable(pmd))
- return 0;
-
- /*
- * The pmd points to a regular pte so the pmd can't change
- * from under us even if the mmap_lock is only hold for
- * reading.
- */
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!pte)
+ return -EAGAIN;

/* Get target node for single threaded private VMAs */
if (prot_numa && !(vma->vm_flags & VM_SHARED) &&
@@ -301,26 +288,6 @@ static long change_pte_range(struct mmu_gather *tlb,
return pages;
}

-/*
- * Used when setting automatic NUMA hinting protection where it is
- * critical that a numa hinting PMD is not confused with a bad PMD.
- */
-static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
-{
- pmd_t pmdval = pmdp_get_lockless(pmd);
-
- if (pmd_none(pmdval))
- return 1;
- if (pmd_trans_huge(pmdval))
- return 0;
- if (unlikely(pmd_bad(pmdval))) {
- pmd_clear_bad(pmd);
- return 1;
- }
-
- return 0;
-}
-
/*
* Return true if we want to split THPs into PTE mappings in change
* protection procedure, false otherwise.
@@ -398,7 +365,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
pmd = pmd_offset(pud, addr);
do {
long ret;
-
+ pmd_t _pmd;
+again:
next = pmd_addr_end(addr, end);

ret = change_pmd_prepare(vma, pmd, cp_flags);
@@ -406,16 +374,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
pages = ret;
break;
}
- /*
- * Automatic NUMA balancing walks the tables with mmap_lock
- * held for read. It's possible a parallel update to occur
- * between pmd_trans_huge() and a pmd_none_or_clear_bad()
- * check leading to a false positive and clearing.
- * Hence, it's necessary to atomically read the PMD value
- * for all the checks.
- */
- if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
- pmd_none_or_clear_bad_unless_trans_huge(pmd))
+
+ if (pmd_none(*pmd))
goto next;

/* invoke the mmu notifier if the pmd is populated */
@@ -426,7 +386,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
mmu_notifier_invalidate_range_start(&range);
}

- if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ _pmd = pmdp_get_lockless(pmd);
+ if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
if ((next - addr != HPAGE_PMD_SIZE) ||
pgtable_split_needed(vma, cp_flags)) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
@@ -441,15 +402,10 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
break;
}
} else {
- /*
- * change_huge_pmd() does not defer TLB flushes,
- * so no need to propagate the tlb argument.
- */
- int nr_ptes = change_huge_pmd(tlb, vma, pmd,
+ ret = change_huge_pmd(tlb, vma, pmd,
addr, newprot, cp_flags);
-
- if (nr_ptes) {
- if (nr_ptes == HPAGE_PMD_NR) {
+ if (ret) {
+ if (ret == HPAGE_PMD_NR) {
pages += HPAGE_PMD_NR;
nr_huge_updates++;
}
@@ -460,8 +416,12 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
}
/* fall through, the trans huge pmd just split */
}
- pages += change_pte_range(tlb, vma, pmd, addr, next,
- newprot, cp_flags);
+
+ ret = change_pte_range(tlb, vma, pmd, addr, next, newprot,
+ cp_flags);
+ if (ret < 0)
+ goto again;
+ pages += ret;
next:
cond_resched();
} while (pmd++, addr = next, addr != end);
--
2.35.3


2023-05-22 05:31:11

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 14/31] fs/userfaultfd: retry if pte_offset_map() fails

Instead of worrying whether the pmd is stable, userfaultfd_must_wait()
call pte_offset_map() as before, but go back to try again if that fails.

Risk of endless loop? It already broke out if pmd_none(), !pmd_present()
or pmd_trans_huge(), and pte_offset_map() would have cleared pmd_bad():
which leaves pmd_devmap(). Presumably pmd_devmap() is inappropriate in
a vma subject to userfaultfd (it would have been mistreated before),
but add a check just to avoid all possibility of endless loop there.

Signed-off-by: Hugh Dickins <[email protected]>
---
fs/userfaultfd.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index f7a0817b1ec0..ca83423f8d54 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -349,12 +349,13 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
if (!pud_present(*pud))
goto out;
pmd = pmd_offset(pud, address);
+again:
_pmd = pmdp_get_lockless(pmd);
if (pmd_none(_pmd))
goto out;

ret = false;
- if (!pmd_present(_pmd))
+ if (!pmd_present(_pmd) || pmd_devmap(_pmd))
goto out;

if (pmd_trans_huge(_pmd)) {
@@ -363,11 +364,11 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
goto out;
}

- /*
- * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
- * and use the standard pte_offset_map() instead of parsing _pmd.
- */
pte = pte_offset_map(pmd, address);
+ if (!pte) {
+ ret = true;
+ goto again;
+ }
/*
* Lockless access: we're in a wait_event so it's ok if it
* changes under us. PTE markers should be handled the same as none
--
2.35.3


2023-05-22 05:32:21

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 26/31] mm/huge_memory: split huge pmd under one pte_offset_map()

__split_huge_zero_page_pmd() use a single pte_offset_map() to sweep the
extent: it's already under pmd_lock(), so this is no worse for latency;
and since it's supposed to have full control of the just-withdrawn page
table, here choose to VM_BUG_ON if it were to fail. And please don't
increment haddr by PAGE_SIZE, that should remain huge aligned: declare
a separate addr (not a bugfix, but it was deceptive).

__split_huge_pmd_locked() likewise (but it had declared a separate addr);
and change its BUG_ON(!pte_none) to VM_BUG_ON, for consistency with zero
(those deposited page tables are sometimes victims of random corruption).

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/huge_memory.c | 28 ++++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d4bd5fa7c823..839c13fa0bbe 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2037,6 +2037,8 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
pgtable_t pgtable;
pmd_t _pmd, old_pmd;
+ unsigned long addr;
+ pte_t *pte;
int i;

/*
@@ -2052,17 +2054,20 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);

- for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- pte_t *pte, entry;
- entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte);
+ for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+ pte_t entry;
+
+ entry = pfn_pte(my_zero_pfn(addr), vma->vm_page_prot);
entry = pte_mkspecial(entry);
if (pmd_uffd_wp(old_pmd))
entry = pte_mkuffd_wp(entry);
- pte = pte_offset_map(&_pmd, haddr);
VM_BUG_ON(!pte_none(*pte));
- set_pte_at(mm, haddr, pte, entry);
- pte_unmap(pte);
+ set_pte_at(mm, addr, pte, entry);
+ pte++;
}
+ pte_unmap(pte - 1);
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
}
@@ -2077,6 +2082,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
bool anon_exclusive = false, dirty = false;
unsigned long addr;
+ pte_t *pte;
int i;

VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
@@ -2205,8 +2211,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);

+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte);
for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
- pte_t entry, *pte;
+ pte_t entry;
/*
* Note that NUMA hinting access restrictions are not
* transferred to avoid any possibility of altering
@@ -2249,11 +2257,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
entry = pte_mkuffd_wp(entry);
page_add_anon_rmap(page + i, vma, addr, false);
}
- pte = pte_offset_map(&_pmd, addr);
- BUG_ON(!pte_none(*pte));
+ VM_BUG_ON(!pte_none(*pte));
set_pte_at(mm, addr, pte, entry);
- pte_unmap(pte);
+ pte++;
}
+ pte_unmap(pte - 1);

if (!pmd_migration)
page_remove_rmap(page, vma, true);
--
2.35.3


2023-05-22 05:48:15

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 08/31] mm/page_vma_mapped: pte_offset_map_nolock() not pte_lockptr()

map_pte() use pte_offset_map_nolock(), to make sure of the ptl belonging
to pte, even if pmd entry is then changed racily: page_vma_mapped_walk()
use that instead of getting pte_lockptr() later, or restart if map_pte()
found no page table.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/page_vma_mapped.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 947dc7491815..2af734274073 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -13,16 +13,28 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
return false;
}

-static bool map_pte(struct page_vma_mapped_walk *pvmw)
+static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
{
if (pvmw->flags & PVMW_SYNC) {
/* Use the stricter lookup */
pvmw->pte = pte_offset_map_lock(pvmw->vma->vm_mm, pvmw->pmd,
pvmw->address, &pvmw->ptl);
- return true;
+ *ptlp = pvmw->ptl;
+ return !!pvmw->pte;
}

- pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
+ /*
+ * It is important to return the ptl corresponding to pte,
+ * in case *pvmw->pmd changes underneath us; so we need to
+ * return it even when choosing not to lock, in case caller
+ * proceeds to loop over next ptes, and finds a match later.
+ * Though, in most cases, page lock already protects this.
+ */
+ pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
+ pvmw->address, ptlp);
+ if (!pvmw->pte)
+ return false;
+
if (pvmw->flags & PVMW_MIGRATION) {
if (!is_swap_pte(*pvmw->pte))
return false;
@@ -51,7 +63,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
} else if (!pte_present(*pvmw->pte)) {
return false;
}
- pvmw->ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd);
+ pvmw->ptl = *ptlp;
spin_lock(pvmw->ptl);
return true;
}
@@ -156,6 +168,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
struct vm_area_struct *vma = pvmw->vma;
struct mm_struct *mm = vma->vm_mm;
unsigned long end;
+ spinlock_t *ptl;
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
@@ -257,8 +270,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
step_forward(pvmw, PMD_SIZE);
continue;
}
- if (!map_pte(pvmw))
+ if (!map_pte(pvmw, &ptl)) {
+ if (!pvmw->pte)
+ goto restart;
goto next_pte;
+ }
this_pte:
if (check_pte(pvmw))
return true;
@@ -281,7 +297,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
} while (pte_none(*pvmw->pte));

if (!pvmw->ptl) {
- pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
+ pvmw->ptl = ptl;
spin_lock(pvmw->ptl);
}
goto this_pte;
--
2.35.3


2023-05-22 05:50:09

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 22/31] mm/swapoff: allow pte_offset_map[_lock]() to fail

Adjust unuse_pte() and unuse_pte_range() to allow pte_offset_map_lock()
and pte_offset_map() failure; remove pmd_none_or_trans_huge_or_clear_bad()
from unuse_pmd_range() now that pte_offset_map() does all that itself.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/swapfile.c | 38 ++++++++++++++++++++------------------
1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 274bbf797480..12d204e6dae2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1774,7 +1774,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
hwposioned = true;

pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
- if (unlikely(!pte_same_as_swp(*pte, swp_entry_to_pte(entry)))) {
+ if (unlikely(!pte || !pte_same_as_swp(*pte, swp_entry_to_pte(entry)))) {
ret = 0;
goto out;
}
@@ -1827,7 +1827,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
set_pte_at(vma->vm_mm, addr, pte, new_pte);
swap_free(entry);
out:
- pte_unmap_unlock(pte, ptl);
+ if (pte)
+ pte_unmap_unlock(pte, ptl);
if (page != swapcache) {
unlock_page(page);
put_page(page);
@@ -1839,17 +1840,22 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
unsigned int type)
{
- swp_entry_t entry;
- pte_t *pte;
+ pte_t *pte = NULL;
struct swap_info_struct *si;
- int ret = 0;

si = swap_info[type];
- pte = pte_offset_map(pmd, addr);
do {
struct folio *folio;
unsigned long offset;
unsigned char swp_count;
+ swp_entry_t entry;
+ int ret;
+
+ if (!pte++) {
+ pte = pte_offset_map(pmd, addr);
+ if (!pte)
+ break;
+ }

if (!is_swap_pte(*pte))
continue;
@@ -1860,6 +1866,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,

offset = swp_offset(entry);
pte_unmap(pte);
+ pte = NULL;
+
folio = swap_cache_get_folio(entry, vma, addr);
if (!folio) {
struct page *page;
@@ -1878,8 +1886,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (!folio) {
swp_count = READ_ONCE(si->swap_map[offset]);
if (swp_count == 0 || swp_count == SWAP_MAP_BAD)
- goto try_next;
-
+ continue;
return -ENOMEM;
}

@@ -1889,20 +1896,17 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (ret < 0) {
folio_unlock(folio);
folio_put(folio);
- goto out;
+ return ret;
}

folio_free_swap(folio);
folio_unlock(folio);
folio_put(folio);
-try_next:
- pte = pte_offset_map(pmd, addr);
- } while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap(pte - 1);
+ } while (addr += PAGE_SIZE, addr != end);

- ret = 0;
-out:
- return ret;
+ if (pte)
+ pte_unmap(pte);
+ return 0;
}

static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
@@ -1917,8 +1921,6 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
do {
cond_resched();
next = pmd_addr_end(addr, end);
- if (pmd_none_or_trans_huge_or_clear_bad(pmd))
- continue;
ret = unuse_pte_range(vma, pmd, addr, next, type);
if (ret)
return ret;
--
2.35.3


2023-05-22 05:51:27

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 21/31] mm/madvise: clean up force_shm_swapin_readahead()

Some nearby MADV_WILLNEED cleanup unrelated to pte_offset_map_lock().
shmem_swapin_range() is a better name than force_shm_swapin_readahead().
Fix unimportant off-by-one on end_index. Call the swp_entry_t "entry"
rather than "swap": either is okay, but entry is the name used elsewhere
in mm/madvise.c. Do not assume GFP_HIGHUSER_MOVABLE: that's right for
anon swap, but shmem should take gfp from mapping. Pass the actual vma
and address to read_swap_cache_async(), in case a NUMA mempolicy applies.
lru_add_drain() at outer level, like madvise_willneed()'s other branch.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/madvise.c | 24 +++++++++++++-----------
1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 0af64c4a8f82..9b3c9610052f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -235,30 +235,34 @@ static const struct mm_walk_ops swapin_walk_ops = {
.pmd_entry = swapin_walk_pmd_entry,
};

-static void force_shm_swapin_readahead(struct vm_area_struct *vma,
+static void shmem_swapin_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
struct address_space *mapping)
{
XA_STATE(xas, &mapping->i_pages, linear_page_index(vma, start));
- pgoff_t end_index = linear_page_index(vma, end + PAGE_SIZE - 1);
+ pgoff_t end_index = linear_page_index(vma, end) - 1;
struct page *page;
struct swap_iocb *splug = NULL;

rcu_read_lock();
xas_for_each(&xas, page, end_index) {
- swp_entry_t swap;
+ unsigned long addr;
+ swp_entry_t entry;

if (!xa_is_value(page))
continue;
- swap = radix_to_swp_entry(page);
+ entry = radix_to_swp_entry(page);
/* There might be swapin error entries in shmem mapping. */
- if (non_swap_entry(swap))
+ if (non_swap_entry(entry))
continue;
+
+ addr = vma->vm_start +
+ ((xas.xa_index - vma->vm_pgoff) << PAGE_SHIFT);
xas_pause(&xas);
rcu_read_unlock();

- page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
- NULL, 0, false, &splug);
+ page = read_swap_cache_async(entry, mapping_gfp_mask(mapping),
+ vma, addr, false, &splug);
if (page)
put_page(page);

@@ -266,8 +270,6 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
}
rcu_read_unlock();
swap_read_unplug(splug);
-
- lru_add_drain(); /* Push any new pages onto the LRU now */
}
#endif /* CONFIG_SWAP */

@@ -291,8 +293,8 @@ static long madvise_willneed(struct vm_area_struct *vma,
}

if (shmem_mapping(file->f_mapping)) {
- force_shm_swapin_readahead(vma, start, end,
- file->f_mapping);
+ shmem_swapin_range(vma, start, end, file->f_mapping);
+ lru_add_drain(); /* Push any new pages onto the LRU now */
return 0;
}
#else
--
2.35.3


2023-05-22 05:52:34

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 20/31] mm/madvise: clean up pte_offset_map_lock() scans

Came here to make madvise's several pte_offset_map_lock() scans advance
to next extent on failure, and remove superfluous pmd_trans_unstable()
and pmd_none_or_trans_huge_or_clear_bad() calls. But also did some
nearby cleanup.

swapin_walk_pmd_entry(): don't name an address "index"; don't drop the
lock after every pte, only when calling out to read_swap_cache_async().

madvise_cold_or_pageout_pte_range() and madvise_free_pte_range():
prefer "start_pte" for pointer, orig_pte usually denotes a saved pte
value; leave lazy MMU mode before unlocking; merge the success and
failure paths after split_folio().

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/madvise.c | 122 ++++++++++++++++++++++++++++-----------------------
1 file changed, 68 insertions(+), 54 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index b5ffbaf616f5..0af64c4a8f82 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -188,37 +188,43 @@ static int madvise_update_vma(struct vm_area_struct *vma,

#ifdef CONFIG_SWAP
static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
- unsigned long end, struct mm_walk *walk)
+ unsigned long end, struct mm_walk *walk)
{
struct vm_area_struct *vma = walk->private;
- unsigned long index;
struct swap_iocb *splug = NULL;
+ pte_t *ptep = NULL;
+ spinlock_t *ptl;
+ unsigned long addr;

- if (pmd_none_or_trans_huge_or_clear_bad(pmd))
- return 0;
-
- for (index = start; index != end; index += PAGE_SIZE) {
+ for (addr = start; addr < end; addr += PAGE_SIZE) {
pte_t pte;
swp_entry_t entry;
struct page *page;
- spinlock_t *ptl;
- pte_t *ptep;

- ptep = pte_offset_map_lock(vma->vm_mm, pmd, index, &ptl);
+ if (!ptep++) {
+ ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!ptep)
+ break;
+ }
+
pte = *ptep;
- pte_unmap_unlock(ptep, ptl);
-
if (!is_swap_pte(pte))
continue;
entry = pte_to_swp_entry(pte);
if (unlikely(non_swap_entry(entry)))
continue;

+ pte_unmap_unlock(ptep, ptl);
+ ptep = NULL;
+
page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
- vma, index, false, &splug);
+ vma, addr, false, &splug);
if (page)
put_page(page);
}
+
+ if (ptep)
+ pte_unmap_unlock(ptep, ptl);
swap_read_unplug(splug);
cond_resched();

@@ -340,7 +346,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
bool pageout = private->pageout;
struct mm_struct *mm = tlb->mm;
struct vm_area_struct *vma = walk->vma;
- pte_t *orig_pte, *pte, ptent;
+ pte_t *start_pte, *pte, ptent;
spinlock_t *ptl;
struct folio *folio = NULL;
LIST_HEAD(folio_list);
@@ -422,11 +428,11 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
}

regular_folio:
- if (pmd_trans_unstable(pmd))
- return 0;
#endif
tlb_change_page_size(tlb, PAGE_SIZE);
- orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ start_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!start_pte)
+ return 0;
flush_tlb_batched_pending(mm);
arch_enter_lazy_mmu_mode();
for (; addr < end; pte++, addr += PAGE_SIZE) {
@@ -447,25 +453,28 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
* are sure it's worth. Split it if we are only owner.
*/
if (folio_test_large(folio)) {
+ int err;
+
if (folio_mapcount(folio) != 1)
break;
if (pageout_anon_only_filter && !folio_test_anon(folio))
break;
+ if (!folio_trylock(folio))
+ break;
folio_get(folio);
- if (!folio_trylock(folio)) {
- folio_put(folio);
- break;
- }
- pte_unmap_unlock(orig_pte, ptl);
- if (split_folio(folio)) {
- folio_unlock(folio);
- folio_put(folio);
- orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
- break;
- }
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(start_pte, ptl);
+ start_pte = NULL;
+ err = split_folio(folio);
folio_unlock(folio);
folio_put(folio);
- orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (err)
+ break;
+ start_pte = pte =
+ pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!start_pte)
+ break;
+ arch_enter_lazy_mmu_mode();
pte--;
addr -= PAGE_SIZE;
continue;
@@ -510,8 +519,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
folio_deactivate(folio);
}

- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(orig_pte, ptl);
+ if (start_pte) {
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(start_pte, ptl);
+ }
if (pageout)
reclaim_pages(&folio_list);
cond_resched();
@@ -612,7 +623,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
struct mm_struct *mm = tlb->mm;
struct vm_area_struct *vma = walk->vma;
spinlock_t *ptl;
- pte_t *orig_pte, *pte, ptent;
+ pte_t *start_pte, *pte, ptent;
struct folio *folio;
int nr_swap = 0;
unsigned long next;
@@ -620,13 +631,12 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd))
if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
- goto next;
-
- if (pmd_trans_unstable(pmd))
- return 0;
+ return 0;

tlb_change_page_size(tlb, PAGE_SIZE);
- orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!start_pte)
+ return 0;
flush_tlb_batched_pending(mm);
arch_enter_lazy_mmu_mode();
for (; addr != end; pte++, addr += PAGE_SIZE) {
@@ -664,23 +674,26 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
* deactivate all pages.
*/
if (folio_test_large(folio)) {
+ int err;
+
if (folio_mapcount(folio) != 1)
- goto out;
+ break;
+ if (!folio_trylock(folio))
+ break;
folio_get(folio);
- if (!folio_trylock(folio)) {
- folio_put(folio);
- goto out;
- }
- pte_unmap_unlock(orig_pte, ptl);
- if (split_folio(folio)) {
- folio_unlock(folio);
- folio_put(folio);
- orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
- goto out;
- }
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(start_pte, ptl);
+ start_pte = NULL;
+ err = split_folio(folio);
folio_unlock(folio);
folio_put(folio);
- orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (err)
+ break;
+ start_pte = pte =
+ pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!start_pte)
+ break;
+ arch_enter_lazy_mmu_mode();
pte--;
addr -= PAGE_SIZE;
continue;
@@ -725,17 +738,18 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
}
folio_mark_lazyfree(folio);
}
-out:
+
if (nr_swap) {
if (current->mm == mm)
sync_mm_rss(mm);
-
add_mm_counter(mm, MM_SWAPENTS, nr_swap);
}
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(orig_pte, ptl);
+ if (start_pte) {
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(start_pte, ptl);
+ }
cond_resched();
-next:
+
return 0;
}

--
2.35.3


2023-05-22 05:52:46

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 27/31] mm/khugepaged: allow pte_offset_map[_lock]() to fail

__collapse_huge_page_swapin(): don't drop the map after every pte, it
only has to be dropped by do_swap_page(); give up if pte_offset_map()
fails; trace_mm_collapse_huge_page_swapin() at the end, with result;
fix comment on returned result; fix vmf.pgoff, though it's not used.

collapse_huge_page(): use pte_offset_map_lock() on the _pmd returned
from clearing; allow failure, but it should be impossible there.
hpage_collapse_scan_pmd() and collapse_pte_mapped_thp() allow for
pte_offset_map_lock() failure.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/khugepaged.c | 72 +++++++++++++++++++++++++++++++++----------------
1 file changed, 49 insertions(+), 23 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 732f9ac393fc..49cfa7cdfe93 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -993,9 +993,8 @@ static int check_pmd_still_valid(struct mm_struct *mm,
* Only done if hpage_collapse_scan_pmd believes it is worthwhile.
*
* Called and returns without pte mapped or spinlocks held.
- * Note that if false is returned, mmap_lock will be released.
+ * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
*/
-
static int __collapse_huge_page_swapin(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long haddr, pmd_t *pmd,
@@ -1004,23 +1003,35 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
int swapped_in = 0;
vm_fault_t ret = 0;
unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
+ int result;
+ pte_t *pte = NULL;

for (address = haddr; address < end; address += PAGE_SIZE) {
struct vm_fault vmf = {
.vma = vma,
.address = address,
- .pgoff = linear_page_index(vma, haddr),
+ .pgoff = linear_page_index(vma, address),
.flags = FAULT_FLAG_ALLOW_RETRY,
.pmd = pmd,
};

- vmf.pte = pte_offset_map(pmd, address);
- vmf.orig_pte = *vmf.pte;
- if (!is_swap_pte(vmf.orig_pte)) {
- pte_unmap(vmf.pte);
- continue;
+ if (!pte++) {
+ pte = pte_offset_map(pmd, address);
+ if (!pte) {
+ mmap_read_unlock(mm);
+ result = SCAN_PMD_NULL;
+ goto out;
+ }
}
+
+ vmf.orig_pte = *pte;
+ if (!is_swap_pte(vmf.orig_pte))
+ continue;
+
+ vmf.pte = pte;
ret = do_swap_page(&vmf);
+ /* Which unmaps pte (after perhaps re-checking the entry) */
+ pte = NULL;

/*
* do_swap_page returns VM_FAULT_RETRY with released mmap_lock.
@@ -1029,24 +1040,29 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
* resulting in later failure.
*/
if (ret & VM_FAULT_RETRY) {
- trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
/* Likely, but not guaranteed, that page lock failed */
- return SCAN_PAGE_LOCK;
+ result = SCAN_PAGE_LOCK;
+ goto out;
}
if (ret & VM_FAULT_ERROR) {
mmap_read_unlock(mm);
- trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
- return SCAN_FAIL;
+ result = SCAN_FAIL;
+ goto out;
}
swapped_in++;
}

+ if (pte)
+ pte_unmap(pte);
+
/* Drain LRU add pagevec to remove extra pin on the swapped in pages */
if (swapped_in)
lru_add_drain();

- trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 1);
- return SCAN_SUCCEED;
+ result = SCAN_SUCCEED;
+out:
+ trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
+ return result;
}

static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
@@ -1146,9 +1162,6 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
address + HPAGE_PMD_SIZE);
mmu_notifier_invalidate_range_start(&range);

- pte = pte_offset_map(pmd, address);
- pte_ptl = pte_lockptr(mm, pmd);
-
pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
/*
* This removes any huge TLB entry from the CPU so we won't allow
@@ -1163,13 +1176,18 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
mmu_notifier_invalidate_range_end(&range);
tlb_remove_table_sync_one();

- spin_lock(pte_ptl);
- result = __collapse_huge_page_isolate(vma, address, pte, cc,
- &compound_pagelist);
- spin_unlock(pte_ptl);
+ pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+ if (pte) {
+ result = __collapse_huge_page_isolate(vma, address, pte, cc,
+ &compound_pagelist);
+ spin_unlock(pte_ptl);
+ } else {
+ result = SCAN_PMD_NULL;
+ }

if (unlikely(result != SCAN_SUCCEED)) {
- pte_unmap(pte);
+ if (pte)
+ pte_unmap(pte);
spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
/*
@@ -1253,6 +1271,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
memset(cc->node_load, 0, sizeof(cc->node_load));
nodes_clear(cc->alloc_nmask);
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte) {
+ result = SCAN_PMD_NULL;
+ goto out;
+ }
+
for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
_pte++, _address += PAGE_SIZE) {
pte_t pteval = *_pte;
@@ -1622,8 +1645,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
* lockless_pages_from_mm() and the hardware page walker can access page
* tables while all the high-level locks are held in write mode.
*/
- start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
result = SCAN_FAIL;
+ start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
+ if (!start_pte)
+ goto drop_immap;

/* step 1: check all mapped PTEs are to the right huge page */
for (i = 0, addr = haddr, pte = start_pte;
@@ -1697,6 +1722,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,

abort:
pte_unmap_unlock(start_pte, ptl);
+drop_immap:
i_mmap_unlock_write(vma->vm_file->f_mapping);
goto drop_hpage;
}
--
2.35.3


2023-05-22 05:52:46

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 28/31] mm/memory: allow pte_offset_map[_lock]() to fail

copy_pte_range(): use pte_offset_map_nolock(), and allow for it to fail;
but with a comment on some further assumptions that are being made there.

zap_pte_range() and zap_pmd_range(): adjust their interaction so that
a pte_offset_map_lock() failure in zap_pte_range() leads to a retry in
zap_pmd_range(); remove call to pmd_none_or_trans_huge_or_clear_bad().

Allow pte_offset_map_lock() to fail in many functions. Update comment
on calling pte_alloc() in do_anonymous_page(). Remove redundant calls
to pmd_trans_unstable(), pmd_devmap_trans_unstable(), pmd_none() and
pmd_bad(); but leave pmd_none_or_clear_bad() calls in free_pmd_range()
and copy_pmd_range(), those do simplify the next level down.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/memory.c | 172 +++++++++++++++++++++++++---------------------------
1 file changed, 82 insertions(+), 90 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2eb54c0d5d3c..c7b920291a72 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1012,13 +1012,25 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
progress = 0;
init_rss_vec(rss);

+ /*
+ * copy_pmd_range()'s prior pmd_none_or_clear_bad(src_pmd), and the
+ * error handling here, assume that exclusive mmap_lock on dst and src
+ * protects anon from unexpected THP transitions; with shmem and file
+ * protected by mmap_lock-less collapse skipping areas with anon_vma
+ * (whereas vma_needs_copy() skips areas without anon_vma). A rework
+ * can remove such assumptions later, but this is good enough for now.
+ */
dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
if (!dst_pte) {
ret = -ENOMEM;
goto out;
}
- src_pte = pte_offset_map(src_pmd, addr);
- src_ptl = pte_lockptr(src_mm, src_pmd);
+ src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
+ if (!src_pte) {
+ pte_unmap_unlock(dst_pte, dst_ptl);
+ /* ret == 0 */
+ goto out;
+ }
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
orig_src_pte = src_pte;
orig_dst_pte = dst_pte;
@@ -1083,8 +1095,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);

arch_leave_lazy_mmu_mode();
- spin_unlock(src_ptl);
- pte_unmap(orig_src_pte);
+ pte_unmap_unlock(orig_src_pte, src_ptl);
add_mm_rss_vec(dst_mm, rss);
pte_unmap_unlock(orig_dst_pte, dst_ptl);
cond_resched();
@@ -1388,10 +1399,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
swp_entry_t entry;

tlb_change_page_size(tlb, PAGE_SIZE);
-again:
init_rss_vec(rss);
- start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
- pte = start_pte;
+ start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!pte)
+ return addr;
+
flush_tlb_batched_pending(mm);
arch_enter_lazy_mmu_mode();
do {
@@ -1507,17 +1519,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
* If we forced a TLB flush (either due to running out of
* batch buffers or because we needed to flush dirty TLB
* entries before releasing the ptl), free the batched
- * memory too. Restart if we didn't do everything.
+ * memory too. Come back again if we didn't do everything.
*/
- if (force_flush) {
- force_flush = 0;
+ if (force_flush)
tlb_flush_mmu(tlb);
- }
-
- if (addr != end) {
- cond_resched();
- goto again;
- }

return addr;
}
@@ -1536,8 +1541,10 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
__split_huge_pmd(vma, pmd, addr, false, NULL);
- else if (zap_huge_pmd(tlb, vma, pmd, addr))
- goto next;
+ else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
+ addr = next;
+ continue;
+ }
/* fall through */
} else if (details && details->single_folio &&
folio_test_pmd_mappable(details->single_folio) &&
@@ -1550,20 +1557,14 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
*/
spin_unlock(ptl);
}
-
- /*
- * Here there can be other concurrent MADV_DONTNEED or
- * trans huge page faults running, and if the pmd is
- * none or trans huge it can change under us. This is
- * because MADV_DONTNEED holds the mmap_lock in read
- * mode.
- */
- if (pmd_none_or_trans_huge_or_clear_bad(pmd))
- goto next;
- next = zap_pte_range(tlb, vma, pmd, addr, next, details);
-next:
- cond_resched();
- } while (pmd++, addr = next, addr != end);
+ if (pmd_none(*pmd)) {
+ addr = next;
+ continue;
+ }
+ addr = zap_pte_range(tlb, vma, pmd, addr, next, details);
+ if (addr != next)
+ pmd--;
+ } while (pmd++, cond_resched(), addr != end);

return addr;
}
@@ -1905,6 +1906,10 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
const int batch_size = min_t(int, pages_to_write_in_pmd, 8);

start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock);
+ if (!start_pte) {
+ ret = -EFAULT;
+ goto out;
+ }
for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) {
int err = insert_page_in_batch_locked(vma, pte,
addr, pages[curr_page_idx], prot);
@@ -2572,10 +2577,10 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
mapped_pte = pte = (mm == &init_mm) ?
pte_offset_kernel(pmd, addr) :
pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!pte)
+ return -EINVAL;
}

- BUG_ON(pmd_huge(*pmd));
-
arch_enter_lazy_mmu_mode();

if (fn) {
@@ -2804,7 +2809,6 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
int ret;
void *kaddr;
void __user *uaddr;
- bool locked = false;
struct vm_area_struct *vma = vmf->vma;
struct mm_struct *mm = vma->vm_mm;
unsigned long addr = vmf->address;
@@ -2830,12 +2834,12 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
* On architectures with software "accessed" bits, we would
* take a double page fault, so mark it accessed here.
*/
+ vmf->pte = NULL;
if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
pte_t entry;

vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
- locked = true;
- if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+ if (unlikely(!vmf->pte || !pte_same(*vmf->pte, vmf->orig_pte))) {
/*
* Other thread has already handled the fault
* and update local tlb only
@@ -2857,13 +2861,12 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
* zeroes.
*/
if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
- if (locked)
+ if (vmf->pte)
goto warn;

/* Re-validate under PTL if the page is still mapped */
vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
- locked = true;
- if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+ if (unlikely(!vmf->pte || !pte_same(*vmf->pte, vmf->orig_pte))) {
/* The PTE changed under us, update local tlb */
update_mmu_tlb(vma, addr, vmf->pte);
ret = -EAGAIN;
@@ -2888,7 +2891,7 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
ret = 0;

pte_unlock:
- if (locked)
+ if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
kunmap_atomic(kaddr);
flush_dcache_page(dst);
@@ -3110,7 +3113,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
* Re-check the pte - we dropped the lock
*/
vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
- if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+ if (likely(vmf->pte && pte_same(*vmf->pte, vmf->orig_pte))) {
if (old_folio) {
if (!folio_test_anon(old_folio)) {
dec_mm_counter(mm, mm_counter_file(&old_folio->page));
@@ -3178,19 +3181,20 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
/* Free the old page.. */
new_folio = old_folio;
page_copied = 1;
- } else {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ } else if (vmf->pte) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
}

- if (new_folio)
- folio_put(new_folio);
-
- pte_unmap_unlock(vmf->pte, vmf->ptl);
/*
* No need to double call mmu_notifier->invalidate_range() callback as
* the above ptep_clear_flush_notify() did already call it.
*/
mmu_notifier_invalidate_range_only_end(&range);
+
+ if (new_folio)
+ folio_put(new_folio);
if (old_folio) {
if (page_copied)
free_swap_cache(&old_folio->page);
@@ -3230,6 +3234,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf)
WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
+ if (!vmf->pte)
+ return VM_FAULT_NOPAGE;
/*
* We might have raced with another page fault while we released the
* pte_offset_map_lock.
@@ -3591,10 +3597,11 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)

vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
- if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
+ if (likely(vmf->pte && pte_same(*vmf->pte, vmf->orig_pte)))
restore_exclusive_pte(vma, vmf->page, vmf->address, vmf->pte);

- pte_unmap_unlock(vmf->pte, vmf->ptl);
+ if (vmf->pte)
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
folio_unlock(folio);
folio_put(folio);

@@ -3625,6 +3632,8 @@ static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
{
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
+ if (!vmf->pte)
+ return 0;
/*
* Be careful so that we will only recover a special uffd-wp pte into a
* none pte. Otherwise it means the pte could have changed, so retry.
@@ -3728,11 +3737,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->page = pfn_swap_entry_to_page(entry);
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
- if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
- spin_unlock(vmf->ptl);
- goto out;
- }
-
+ if (unlikely(!vmf->pte ||
+ !pte_same(*vmf->pte, vmf->orig_pte)))
+ goto unlock;
/*
* Get a page reference while we know the page can't be
* freed.
@@ -3807,7 +3814,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
- if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
+ if (likely(vmf->pte && pte_same(*vmf->pte, vmf->orig_pte)))
ret = VM_FAULT_OOM;
goto unlock;
}
@@ -3877,7 +3884,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
- if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
+ if (unlikely(!vmf->pte || !pte_same(*vmf->pte, vmf->orig_pte)))
goto out_nomap;

if (unlikely(!folio_test_uptodate(folio))) {
@@ -4003,13 +4010,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, vmf->address, vmf->pte);
unlock:
- pte_unmap_unlock(vmf->pte, vmf->ptl);
+ if (vmf->pte)
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
if (si)
put_swap_device(si);
return ret;
out_nomap:
- pte_unmap_unlock(vmf->pte, vmf->ptl);
+ if (vmf->pte)
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
out_page:
folio_unlock(folio);
out_release:
@@ -4041,22 +4050,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
return VM_FAULT_SIGBUS;

/*
- * Use pte_alloc() instead of pte_alloc_map(). We can't run
- * pte_offset_map() on pmds where a huge pmd might be created
- * from a different thread.
- *
- * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
- * parallel threads are excluded by other means.
- *
- * Here we only have mmap_read_lock(mm).
+ * Use pte_alloc() instead of pte_alloc_map(), so that OOM can
+ * be distinguished from a transient failure of pte_offset_map().
*/
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;

- /* See comment in handle_pte_fault() */
- if (unlikely(pmd_trans_unstable(vmf->pmd)))
- return 0;
-
/* Use the zero-page for reads */
if (!(vmf->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm)) {
@@ -4064,6 +4063,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
vma->vm_page_prot));
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
+ if (!vmf->pte)
+ goto unlock;
if (vmf_pte_changed(vmf)) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
goto unlock;
@@ -4104,6 +4105,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)

vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
+ if (!vmf->pte)
+ goto release;
if (vmf_pte_changed(vmf)) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
goto release;
@@ -4131,7 +4134,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, vmf->address, vmf->pte);
unlock:
- pte_unmap_unlock(vmf->pte, vmf->ptl);
+ if (vmf->pte)
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
return ret;
release:
folio_put(folio);
@@ -4380,15 +4384,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
return VM_FAULT_OOM;
}

- /*
- * See comment in handle_pte_fault() for how this scenario happens, we
- * need to return NOPAGE so that we drop this page.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return VM_FAULT_NOPAGE;
-
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
+ if (!vmf->pte)
+ return VM_FAULT_NOPAGE;

/* Re-check under ptl */
if (likely(!vmf_pte_changed(vmf))) {
@@ -4630,17 +4629,11 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
*/
if (!vma->vm_ops->fault) {
- /*
- * If we find a migration pmd entry or a none pmd entry, which
- * should never happen, return SIGBUS
- */
- if (unlikely(!pmd_present(*vmf->pmd)))
+ vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ if (unlikely(!vmf->pte))
ret = VM_FAULT_SIGBUS;
else {
- vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm,
- vmf->pmd,
- vmf->address,
- &vmf->ptl);
/*
* Make sure this is not a temporary clearing of pte
* by holding ptl and checking again. A R/M/W update
@@ -5429,10 +5422,9 @@ int follow_pte(struct mm_struct *mm, unsigned long address,
pmd = pmd_offset(pud, address);
VM_BUG_ON(pmd_trans_huge(*pmd));

- if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
- goto out;
-
ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
+ if (!ptep)
+ goto out;
if (!pte_present(*ptep))
goto unlock;
*ptepp = ptep;
--
2.35.3


2023-05-22 05:53:45

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 04/31] mm/pgtable: allow pte_offset_map[_lock]() to fail

Make pte_offset_map() a wrapper for __pte_offset_map() (optionally
outputs pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.

__pte_offset_map() do pmdval validation (including pmd_clear_bad()
when pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the
lock, trying again if it changed.

No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done
to cover the imminent case, but we expect to generalize it later, and
it makes a mess of where to do the pmd_bad() clearing.

Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.

Update corresponding Documentation.

Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.

Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/mm/split_page_table_lock.rst | 17 ++++---
include/linux/mm.h | 27 +++++++----
include/linux/pgtable.h | 22 ++++++---
mm/pgtable-generic.c | 56 ++++++++++++++++++++++
4 files changed, 101 insertions(+), 21 deletions(-)

diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index 50ee0dfc95be..a834fad9de12 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -14,15 +14,20 @@ tables. Access to higher level tables protected by mm->page_table_lock.
There are helpers to lock/unlock a table and other accessor functions:

- pte_offset_map_lock()
- maps pte and takes PTE table lock, returns pointer to the taken
- lock;
+ maps PTE and takes PTE table lock, returns pointer to PTE with
+ pointer to its PTE table lock, or returns NULL if no PTE table;
+ - pte_offset_map_nolock()
+ maps PTE, returns pointer to PTE with pointer to its PTE table
+ lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map()
+ maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
+ - pte_unmap()
+ unmaps PTE table;
- pte_unmap_unlock()
unlocks and unmaps PTE table;
- pte_alloc_map_lock()
- allocates PTE table if needed and take the lock, returns pointer
- to taken lock or NULL if allocation failed;
- - pte_lockptr()
- returns pointer to PTE table lock;
+ allocates PTE table if needed and takes its lock, returns pointer to
+ PTE with pointer to its lock, or returns NULL if allocation failed;
- pmd_lock()
takes PMD table lock, returns pointer to taken lock;
- pmd_lockptr()
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..3c2e56980853 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2787,14 +2787,25 @@ static inline void pgtable_pte_page_dtor(struct page *page)
dec_lruvec_page_state(page, NR_PAGETABLE);
}

-#define pte_offset_map_lock(mm, pmd, address, ptlp) \
-({ \
- spinlock_t *__ptl = pte_lockptr(mm, pmd); \
- pte_t *__pte = pte_offset_map(pmd, address); \
- *(ptlp) = __ptl; \
- spin_lock(__ptl); \
- __pte; \
-})
+pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp);
+static inline pte_t *pte_offset_map(pmd_t *pmd, unsigned long addr)
+{
+ return __pte_offset_map(pmd, addr, NULL);
+}
+
+pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp);
+static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp)
+{
+ pte_t *pte;
+
+ __cond_lock(*ptlp, pte = __pte_offset_map_lock(mm, pmd, addr, ptlp));
+ return pte;
+}
+
+pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp);

#define pte_unmap_unlock(pte, ptl) do { \
spin_unlock(ptl); \
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 94235ff2706e..3fabbb018557 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -94,14 +94,22 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
#define pte_offset_kernel pte_offset_kernel
#endif

-#if defined(CONFIG_HIGHPTE)
-#define pte_offset_map(dir, address) \
- ((pte_t *)kmap_local_page(pmd_page(*(dir))) + \
- pte_index((address)))
-#define pte_unmap(pte) kunmap_local((pte))
+#ifdef CONFIG_HIGHPTE
+#define __pte_map(pmd, address) \
+ ((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
+#define pte_unmap(pte) do { \
+ kunmap_local((pte)); \
+ /* rcu_read_unlock() to be added later */ \
+} while (0)
#else
-#define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))
-#define pte_unmap(pte) ((void)(pte)) /* NOP */
+static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
+{
+ return pte_offset_kernel(pmd, address);
+}
+static inline void pte_unmap(pte_t *pte)
+{
+ /* rcu_read_unlock() to be added later */
+}
#endif

/* Find an entry in the second-level page table.. */
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d2fc52bffafc..c7ab18a5fb77 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -10,6 +10,8 @@
#include <linux/pagemap.h>
#include <linux/hugetlb.h>
#include <linux/pgtable.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
#include <linux/mm_inline.h>
#include <asm/tlb.h>

@@ -229,3 +231,57 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
}
#endif
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
+{
+ pmd_t pmdval;
+
+ /* rcu_read_lock() to be added later */
+ pmdval = pmdp_get_lockless(pmd);
+ if (pmdvalp)
+ *pmdvalp = pmdval;
+ if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
+ goto nomap;
+ if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
+ goto nomap;
+ if (unlikely(pmd_bad(pmdval))) {
+ pmd_clear_bad(pmd);
+ goto nomap;
+ }
+ return __pte_map(&pmdval, addr);
+nomap:
+ /* rcu_read_unlock() to be added later */
+ return NULL;
+}
+
+pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp)
+{
+ pmd_t pmdval;
+ pte_t *pte;
+
+ pte = __pte_offset_map(pmd, addr, &pmdval);
+ if (likely(pte))
+ *ptlp = pte_lockptr(mm, &pmdval);
+ return pte;
+}
+
+pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp)
+{
+ spinlock_t *ptl;
+ pmd_t pmdval;
+ pte_t *pte;
+again:
+ pte = __pte_offset_map(pmd, addr, &pmdval);
+ if (unlikely(!pte))
+ return pte;
+ ptl = pte_lockptr(mm, &pmdval);
+ spin_lock(ptl);
+ if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+ *ptlp = ptl;
+ return pte;
+ }
+ pte_unmap_unlock(pte, ptl);
+ goto again;
+}
--
2.35.3


2023-05-22 05:54:43

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 24/31] mm/migrate_device: allow pte_offset_map_lock() to fail

migrate_vma_collect_pmd(): remove the pmd_trans_unstable() handling after
splitting huge zero pmd, and the pmd_none() handling after successfully
splitting huge page: those are now managed inside pte_offset_map_lock(),
and by "goto again" when it fails.

But the skip after unsuccessful split_huge_page() must stay: it avoids an
endless loop. The skip when pmd_bad()? Remove that: it will be treated
as a hole rather than a skip once cleared by pte_offset_map_lock(), but
with different timing that would be so anyway; and it's arguably best to
leave the pmd_bad() handling centralized there.

migrate_vma_insert_page(): remove comment on the old pte_offset_map()
and old locking limitations; remove the pmd_trans_unstable() check and
just proceed to pte_offset_map_lock(), aborting when it fails (page has
now been charged to memcg, but that's so in other cases, and presumably
uncharged later).

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/migrate_device.c | 31 ++++---------------------------
1 file changed, 4 insertions(+), 27 deletions(-)

diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index d30c9de60b0d..a14af6b12b04 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -83,9 +83,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (is_huge_zero_page(page)) {
spin_unlock(ptl);
split_huge_pmd(vma, pmdp, addr);
- if (pmd_trans_unstable(pmdp))
- return migrate_vma_collect_skip(start, end,
- walk);
} else {
int ret;

@@ -100,16 +97,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (ret)
return migrate_vma_collect_skip(start, end,
walk);
- if (pmd_none(*pmdp))
- return migrate_vma_collect_hole(start, end, -1,
- walk);
}
}

- if (unlikely(pmd_bad(*pmdp)))
- return migrate_vma_collect_skip(start, end, walk);
-
ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ if (!ptep)
+ goto again;
arch_enter_lazy_mmu_mode();

for (; addr < end; addr += PAGE_SIZE, ptep++) {
@@ -595,27 +588,10 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
pmdp = pmd_alloc(mm, pudp, addr);
if (!pmdp)
goto abort;
-
if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
goto abort;
-
- /*
- * Use pte_alloc() instead of pte_alloc_map(). We can't run
- * pte_offset_map() on pmds where a huge pmd might be created
- * from a different thread.
- *
- * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
- * parallel threads are excluded by other means.
- *
- * Here we only have mmap_read_lock(mm).
- */
if (pte_alloc(mm, pmdp))
goto abort;
-
- /* See the comment in pte_alloc_one_map() */
- if (unlikely(pmd_trans_unstable(pmdp)))
- goto abort;
-
if (unlikely(anon_vma_prepare(vma)))
goto abort;
if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
@@ -650,7 +626,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
}

ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
-
+ if (!ptep)
+ goto abort;
if (check_stable_address_space(mm))
goto unlock_abort;

--
2.35.3


2023-05-22 05:54:50

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 31/31] perf/core: Allow pte_offset_map() to fail

In rare transient cases, not yet made possible, pte_offset_map() and
pte_offet_map_lock() may not find a page table: handle appropriately.

Signed-off-by: Hugh Dickins <[email protected]>
---
This is a perf patch, not an mm patch, and it will want to go in through
the tip tree in due course; but keep it in this series for now, so that
it's not missed, and not submitted before mm review.

kernel/events/core.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index db016e418931..174be710f3b3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7490,6 +7490,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
return pud_leaf_size(pud);

pmdp = pmd_offset_lockless(pudp, pud, addr);
+again:
pmd = pmdp_get_lockless(pmdp);
if (!pmd_present(pmd))
return 0;
@@ -7498,6 +7499,9 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
return pmd_leaf_size(pmd);

ptep = pte_offset_map(&pmd, addr);
+ if (!ptep)
+ goto again;
+
pte = ptep_get_lockless(ptep);
if (pte_present(pte))
size = pte_leaf_size(pte);
--
2.35.3


2023-05-22 05:55:03

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 17/31] mm/various: give up if pte_offset_map[_lock]() fails

Following the examples of nearby code, various functions can just give
up if pte_offset_map() or pte_offset_map_lock() fails. And there's no
need for a preliminary pmd_trans_unstable() or other such check, since
such cases are now safely handled inside.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/gup.c | 9 ++++++---
mm/ksm.c | 7 ++++---
mm/memcontrol.c | 8 ++++----
mm/memory-failure.c | 8 +++++---
mm/migrate.c | 3 +++
mm/swap_state.c | 3 +++
6 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index 3bd5d3854c51..bb67193c5460 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -544,10 +544,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
(FOLL_PIN | FOLL_GET)))
return ERR_PTR(-EINVAL);
- if (unlikely(pmd_bad(*pmd)))
- return no_page_table(vma, flags);

ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!ptep)
+ return no_page_table(vma, flags);
pte = *ptep;
if (!pte_present(pte))
goto no_page;
@@ -851,8 +851,9 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return -EFAULT;
- VM_BUG_ON(pmd_trans_huge(*pmd));
pte = pte_offset_map(pmd, address);
+ if (!pte)
+ return -EFAULT;
if (pte_none(*pte))
goto unmap;
*vma = get_gate_vma(mm);
@@ -2377,6 +2378,8 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
pte_t *ptep, *ptem;

ptem = ptep = pte_offset_map(&pmd, addr);
+ if (!ptep)
+ return 0;
do {
pte_t pte = ptep_get_lockless(ptep);
struct page *page;
diff --git a/mm/ksm.c b/mm/ksm.c
index df2aa281d49d..3dc15459dd20 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -431,10 +431,9 @@ static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long nex
pte_t *pte;
int ret;

- if (pmd_leaf(*pmd) || !pmd_present(*pmd))
- return 0;
-
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (!pte)
+ return 0;
if (pte_present(*pte)) {
page = vm_normal_page(walk->vma, addr, *pte);
} else if (!pte_none(*pte)) {
@@ -1203,6 +1202,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
mmu_notifier_invalidate_range_start(&range);

ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!ptep)
+ goto out_mn;
if (!pte_same(*ptep, orig_pte)) {
pte_unmap_unlock(ptep, ptl);
goto out_mn;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4b27e245a055..fdd953655fe1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6057,9 +6057,9 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
return 0;
}

- if (pmd_trans_unstable(pmd))
- return 0;
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!pte)
+ return 0;
for (; addr != end; pte++, addr += PAGE_SIZE)
if (get_mctgt_type(vma, addr, *pte, NULL))
mc.precharge++; /* increment precharge temporarily */
@@ -6277,10 +6277,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
return 0;
}

- if (pmd_trans_unstable(pmd))
- return 0;
retry:
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!pte)
+ return 0;
for (; addr != end; addr += PAGE_SIZE) {
pte_t ptent = *(pte++);
bool device = false;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5b663eca1f29..b3cc8f213fe3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -414,6 +414,8 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
if (pmd_devmap(*pmd))
return PMD_SHIFT;
pte = pte_offset_map(pmd, address);
+ if (!pte)
+ return 0;
if (pte_present(*pte) && pte_devmap(*pte))
ret = PAGE_SHIFT;
pte_unmap(pte);
@@ -800,11 +802,11 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
goto out;
}

- if (pmd_trans_unstable(pmdp))
- goto out;
-
mapped_pte = ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp,
addr, &ptl);
+ if (!ptep)
+ goto out;
+
for (; addr != end; ptep++, addr += PAGE_SIZE) {
ret = check_hwpoisoned_entry(*ptep, addr, PAGE_SHIFT,
hwp->pfn, &hwp->tk);
diff --git a/mm/migrate.c b/mm/migrate.c
index 3ecb7a40075f..308a56f0b156 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -305,6 +305,9 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
swp_entry_t entry;

ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!ptep)
+ return;
+
pte = *ptep;
pte_unmap(ptep);

diff --git a/mm/swap_state.c b/mm/swap_state.c
index b76a65ac28b3..db2ec85ef332 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -734,6 +734,9 @@ static void swap_ra_info(struct vm_fault *vmf,

/* Copy the PTEs because the page table may be unmapped */
orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
+ if (!pte)
+ return;
+
if (fpfn == pfn + 1) {
lpfn = fpfn;
rpfn = fpfn + win;
--
2.35.3


2023-05-22 05:55:09

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 25/31] mm/gup: remove FOLL_SPLIT_PMD use of pmd_trans_unstable()

There is now no reason for follow_pmd_mask()'s FOLL_SPLIT_PMD block to
distinguish huge_zero_page from a normal THP: follow_page_pte() handles
any instability, and here it's a good idea to replace any pmd_none(*pmd)
by a page table a.s.a.p, in the huge_zero_page case as for a normal THP.
(Hmm, couldn't the normal THP case have hit an unstably refaulted THP
before? But there are only two, exceptional, users of FOLL_SPLIT_PMD.)

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/gup.c | 19 ++++---------------
1 file changed, 4 insertions(+), 15 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index bb67193c5460..4ad50a59897f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -681,21 +681,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
}
if (flags & FOLL_SPLIT_PMD) {
- int ret;
- page = pmd_page(*pmd);
- if (is_huge_zero_page(page)) {
- spin_unlock(ptl);
- ret = 0;
- split_huge_pmd(vma, pmd, address);
- if (pmd_trans_unstable(pmd))
- ret = -EBUSY;
- } else {
- spin_unlock(ptl);
- split_huge_pmd(vma, pmd, address);
- ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
- }
-
- return ret ? ERR_PTR(ret) :
+ spin_unlock(ptl);
+ split_huge_pmd(vma, pmd, address);
+ /* If pmd was left empty, stuff a page table in there quickly */
+ return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
}
page = follow_trans_huge_pmd(vma, address, pmd, flags);
--
2.35.3


2023-05-22 05:56:21

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 23/31] mm/mglru: allow pte_offset_map_nolock() to fail

MGLRU's walk_pte_range() use the safer pte_offset_map_nolock(), rather
than pte_lockptr(), to get the ptl for its trylock. Just return false
and move on to next extent if it fails, like when the trylock fails.
Remove the VM_WARN_ON_ONCE(pmd_leaf) since that will happen, rarely.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/vmscan.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d257916f39e5..1c344589c145 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3992,15 +3992,15 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
int old_gen, new_gen = lru_gen_from_seq(walk->max_seq);

- VM_WARN_ON_ONCE(pmd_leaf(*pmd));
-
- ptl = pte_lockptr(args->mm, pmd);
- if (!spin_trylock(ptl))
+ pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
+ if (!pte)
return false;
+ if (!spin_trylock(ptl)) {
+ pte_unmap(pte);
+ return false;
+ }

arch_enter_lazy_mmu_mode();
-
- pte = pte_offset_map(pmd, start & PMD_MASK);
restart:
for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
unsigned long pfn;
@@ -4041,10 +4041,8 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
goto restart;

- pte_unmap(pte);
-
arch_leave_lazy_mmu_mode();
- spin_unlock(ptl);
+ pte_unmap_unlock(pte, ptl);

return suitable_to_scan(total, young);
}
--
2.35.3


2023-05-22 05:56:29

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 30/31] mm/pgtable: delete pmd_trans_unstable() and friends

Delete pmd_trans_unstable, pmd_none_or_trans_huge_or_clear_bad() and
pmd_devmap_trans_unstable(), all now unused.

With mixed feelings, delete all the comments on pmd_trans_unstable().
That was very good documentation of a subtle state, and this series does
not even eliminate that state: but rather, normalizes and extends it,
asking pte_offset_map[_lock]() callers to anticipate failure, without
regard for whether mmap_read_lock() or mmap_write_lock() is held.

Retain pud_trans_unstable(), which has one use in __handle_mm_fault(),
but delete its equivalent pud_none_or_trans_huge_or_dev_or_clear_bad().
While there, move the default arch_needs_pgtable_deposit() definition
up near where pgtable_trans_huge_deposit() and withdraw() are declared.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/pgtable.h | 103 +++-------------------------------------
mm/khugepaged.c | 4 --
2 files changed, 7 insertions(+), 100 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 3fabbb018557..a1326e61d7ee 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -599,6 +599,10 @@ extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
#endif

+#ifndef arch_needs_pgtable_deposit
+#define arch_needs_pgtable_deposit() (false)
+#endif
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
* This is an implementation of pmdp_establish() that is only suitable for an
@@ -1300,9 +1304,10 @@ static inline int pud_trans_huge(pud_t pud)
}
#endif

-/* See pmd_none_or_trans_huge_or_clear_bad for discussion. */
-static inline int pud_none_or_trans_huge_or_dev_or_clear_bad(pud_t *pud)
+static inline int pud_trans_unstable(pud_t *pud)
{
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
+ defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
pud_t pudval = READ_ONCE(*pud);

if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval))
@@ -1311,104 +1316,10 @@ static inline int pud_none_or_trans_huge_or_dev_or_clear_bad(pud_t *pud)
pud_clear_bad(pud);
return 1;
}
- return 0;
-}
-
-/* See pmd_trans_unstable for discussion. */
-static inline int pud_trans_unstable(pud_t *pud)
-{
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
- defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
- return pud_none_or_trans_huge_or_dev_or_clear_bad(pud);
-#else
- return 0;
#endif
-}
-
-#ifndef arch_needs_pgtable_deposit
-#define arch_needs_pgtable_deposit() (false)
-#endif
-/*
- * This function is meant to be used by sites walking pagetables with
- * the mmap_lock held in read mode to protect against MADV_DONTNEED and
- * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
- * into a null pmd and the transhuge page fault can convert a null pmd
- * into an hugepmd or into a regular pmd (if the hugepage allocation
- * fails). While holding the mmap_lock in read mode the pmd becomes
- * stable and stops changing under us only if it's not null and not a
- * transhuge pmd. When those races occurs and this function makes a
- * difference vs the standard pmd_none_or_clear_bad, the result is
- * undefined so behaving like if the pmd was none is safe (because it
- * can return none anyway). The compiler level barrier() is critically
- * important to compute the two checks atomically on the same pmdval.
- *
- * For 32bit kernels with a 64bit large pmd_t this automatically takes
- * care of reading the pmd atomically to avoid SMP race conditions
- * against pmd_populate() when the mmap_lock is hold for reading by the
- * caller (a special atomic read not done by "gcc" as in the generic
- * version above, is also needed when THP is disabled because the page
- * fault can populate the pmd from under us).
- */
-static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
-{
- pmd_t pmdval = pmdp_get_lockless(pmd);
- /*
- * !pmd_present() checks for pmd migration entries
- *
- * The complete check uses is_pmd_migration_entry() in linux/swapops.h
- * But using that requires moving current function and pmd_trans_unstable()
- * to linux/swapops.h to resolve dependency, which is too much code move.
- *
- * !pmd_present() is equivalent to is_pmd_migration_entry() currently,
- * because !pmd_present() pages can only be under migration not swapped
- * out.
- *
- * pmd_none() is preserved for future condition checks on pmd migration
- * entries and not confusing with this function name, although it is
- * redundant with !pmd_present().
- */
- if (pmd_none(pmdval) || pmd_trans_huge(pmdval) ||
- (IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) && !pmd_present(pmdval)))
- return 1;
- if (unlikely(pmd_bad(pmdval))) {
- pmd_clear_bad(pmd);
- return 1;
- }
return 0;
}

-/*
- * This is a noop if Transparent Hugepage Support is not built into
- * the kernel. Otherwise it is equivalent to
- * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
- * places that already verified the pmd is not none and they want to
- * walk ptes while holding the mmap sem in read mode (write mode don't
- * need this). If THP is not enabled, the pmd can't go away under the
- * code even if MADV_DONTNEED runs, but if THP is enabled we need to
- * run a pmd_trans_unstable before walking the ptes after
- * split_huge_pmd returns (because it may have run when the pmd become
- * null, but then a page fault can map in a THP and not a regular page).
- */
-static inline int pmd_trans_unstable(pmd_t *pmd)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- return pmd_none_or_trans_huge_or_clear_bad(pmd);
-#else
- return 0;
-#endif
-}
-
-/*
- * the ordering of these checks is important for pmds with _page_devmap set.
- * if we check pmd_trans_unstable() first we will trip the bad_pmd() check
- * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly
- * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
- */
-static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
-{
- return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
-}
-
#ifndef CONFIG_NUMA_BALANCING
/*
* Technically a PTE can be PROTNONE even when not doing NUMA balancing but
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c11db2e78e95..1083f0e38a07 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -946,10 +946,6 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
return SCAN_SUCCEED;
}

-/*
- * See pmd_trans_unstable() for how the result may change out from
- * underneath us, even if we hold mmap_lock in read.
- */
static int find_pmd_or_thp_or_none(struct mm_struct *mm,
unsigned long address,
pmd_t **pmd)
--
2.35.3


2023-05-22 05:56:47

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 29/31] mm/memory: handle_pte_fault() use pte_offset_map_nolock()

handle_pte_fault() use pte_offset_map_nolock() to get the vmf.ptl which
corresponds to vmf.pte, instead of pte_lockptr() being used later, when
there's a chance that the pmd entry might have changed, perhaps to none,
or to a huge pmd, with no split ptlock in its struct page.

Remove its pmd_devmap_trans_unstable() call: pte_offset_map_nolock()
will handle that case by failing. Update the "morph" comment above,
looking forward to when shmem or file collapse to THP may not take
mmap_lock for write (or not at all).

do_numa_page() use the vmf->ptl from handle_pte_fault() at first, but
refresh it when refreshing vmf->pte.

do_swap_page()'s pte_unmap_same() (the thing that takes ptl to verify a
two-part PAE orig_pte) use the vmf->ptl from handle_pte_fault() too; but
do_swap_page() is also used by anon THP's __collapse_huge_page_swapin(),
so adjust that to set vmf->ptl by pte_offset_map_nolock().

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/khugepaged.c | 6 ++++--
mm/memory.c | 38 +++++++++++++-------------------------
2 files changed, 17 insertions(+), 27 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 49cfa7cdfe93..c11db2e78e95 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1005,6 +1005,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
int result;
pte_t *pte = NULL;
+ spinlock_t *ptl;

for (address = haddr; address < end; address += PAGE_SIZE) {
struct vm_fault vmf = {
@@ -1016,7 +1017,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
};

if (!pte++) {
- pte = pte_offset_map(pmd, address);
+ pte = pte_offset_map_nolock(mm, pmd, address, &ptl);
if (!pte) {
mmap_read_unlock(mm);
result = SCAN_PMD_NULL;
@@ -1024,11 +1025,12 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
}
}

- vmf.orig_pte = *pte;
+ vmf.orig_pte = ptep_get_lockless(pte);
if (!is_swap_pte(vmf.orig_pte))
continue;

vmf.pte = pte;
+ vmf.ptl = ptl;
ret = do_swap_page(&vmf);
/* Which unmaps pte (after perhaps re-checking the entry) */
pte = NULL;
diff --git a/mm/memory.c b/mm/memory.c
index c7b920291a72..4ec46eecefd3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2786,10 +2786,9 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
int same = 1;
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
if (sizeof(pte_t) > sizeof(unsigned long)) {
- spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
- spin_lock(ptl);
+ spin_lock(vmf->ptl);
same = pte_same(*vmf->pte, vmf->orig_pte);
- spin_unlock(ptl);
+ spin_unlock(vmf->ptl);
}
#endif
pte_unmap(vmf->pte);
@@ -4696,7 +4695,6 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* validation through pte_unmap_same(). It's of NUMA type but
* the pfn may be screwed if the read is non atomic.
*/
- vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
spin_lock(vmf->ptl);
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -4767,8 +4765,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
flags |= TNF_MIGRATED;
} else {
flags |= TNF_MIGRATE_FAIL;
- vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
- spin_lock(vmf->ptl);
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ if (unlikely(!vmf->pte))
+ goto out;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
@@ -4897,27 +4897,16 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
vmf->pte = NULL;
vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
} else {
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
- * of pmd_trans_huge() to ensure the pmd didn't become
- * pmd_trans_huge under us and then back to pmd_none, as a
- * result of MADV_DONTNEED running immediately after a huge pmd
- * fault in a different thread of this mm, in turn leading to a
- * misleading pmd_trans_huge() retval. All we have to ensure is
- * that it is a regular pmd that we can walk with
- * pte_offset_map() and we can do that through an atomic read
- * in C, which is what pmd_trans_unstable() provides.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return 0;
/*
* A regular pmd is established and it can't morph into a huge
- * pmd from under us anymore at this point because we hold the
- * mmap_lock read mode and khugepaged takes it in write mode.
- * So now it's safe to run pte_offset_map().
+ * pmd by anon khugepaged, since that takes mmap_lock in write
+ * mode; but shmem or file collapse to THP could still morph
+ * it into a huge pmd: just retry later if so.
*/
- vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+ vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ if (unlikely(!vmf->pte))
+ return 0;
vmf->orig_pte = ptep_get_lockless(vmf->pte);
vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;

@@ -4936,7 +4925,6 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);

- vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
spin_lock(vmf->ptl);
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry))) {
--
2.35.3


2023-05-22 11:34:38

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 04/31] mm/pgtable: allow pte_offset_map[_lock]() to fail

Hi Hugh,

On 2023/5/22 12:53, Hugh Dickins wrote:

[...]

>
> @@ -229,3 +231,57 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
> }
> #endif
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> +{
> + pmd_t pmdval;
> +
> + /* rcu_read_lock() to be added later */
> + pmdval = pmdp_get_lockless(pmd);
> + if (pmdvalp)
> + *pmdvalp = pmdval;
> + if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
> + goto nomap;
> + if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
> + goto nomap;

Will the follow-up patch deal with the above situation specially?
Otherwise, maybe it can be changed to the following check method?

if (unlikely(pmd_none(pmdval) || pmd_leaf(pmdval)))
goto nomap;

> + if (unlikely(pmd_bad(pmdval))) {
> + pmd_clear_bad(pmd);
> + goto nomap;
> + }
> + return __pte_map(&pmdval, addr);
> +nomap:
> + /* rcu_read_unlock() to be added later */
> + return NULL;
> +}
> +
> +pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long addr, spinlock_t **ptlp)
> +{
> + pmd_t pmdval;
> + pte_t *pte;
> +
> + pte = __pte_offset_map(pmd, addr, &pmdval);
> + if (likely(pte))
> + *ptlp = pte_lockptr(mm, &pmdval);
> + return pte;
> +}
> +
> +pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long addr, spinlock_t **ptlp)
> +{
> + spinlock_t *ptl;
> + pmd_t pmdval;
> + pte_t *pte;
> +again:
> + pte = __pte_offset_map(pmd, addr, &pmdval);
> + if (unlikely(!pte))
> + return pte;
> + ptl = pte_lockptr(mm, &pmdval);
> + spin_lock(ptl);
> + if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
> + *ptlp = ptl;
> + return pte;
> + }
> + pte_unmap_unlock(pte, ptl);
> + goto again;
> +}

--
Thanks,
Qi

2023-05-22 11:51:30

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 08/31] mm/page_vma_mapped: pte_offset_map_nolock() not pte_lockptr()



On 2023/5/22 12:58, Hugh Dickins wrote:
> map_pte() use pte_offset_map_nolock(), to make sure of the ptl belonging
> to pte, even if pmd entry is then changed racily: page_vma_mapped_walk()
> use that instead of getting pte_lockptr() later, or restart if map_pte()
> found no page table.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> mm/page_vma_mapped.c | 28 ++++++++++++++++++++++------
> 1 file changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 947dc7491815..2af734274073 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -13,16 +13,28 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
> return false;
> }
>
> -static bool map_pte(struct page_vma_mapped_walk *pvmw)
> +static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
> {
> if (pvmw->flags & PVMW_SYNC) {
> /* Use the stricter lookup */
> pvmw->pte = pte_offset_map_lock(pvmw->vma->vm_mm, pvmw->pmd,
> pvmw->address, &pvmw->ptl);
> - return true;
> + *ptlp = pvmw->ptl;
> + return !!pvmw->pte;
> }
>
> - pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
> + /*
> + * It is important to return the ptl corresponding to pte,
> + * in case *pvmw->pmd changes underneath us; so we need to
> + * return it even when choosing not to lock, in case caller
> + * proceeds to loop over next ptes, and finds a match later.
> + * Though, in most cases, page lock already protects this.
> + */
> + pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
> + pvmw->address, ptlp);
> + if (!pvmw->pte)
> + return false;
> +
> if (pvmw->flags & PVMW_MIGRATION) {
> if (!is_swap_pte(*pvmw->pte))
> return false;
> @@ -51,7 +63,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
> } else if (!pte_present(*pvmw->pte)) {
> return false;
> }
> - pvmw->ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd);
> + pvmw->ptl = *ptlp;
> spin_lock(pvmw->ptl);
> return true;
> }
> @@ -156,6 +168,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> struct vm_area_struct *vma = pvmw->vma;
> struct mm_struct *mm = vma->vm_mm;
> unsigned long end;
> + spinlock_t *ptl;
> pgd_t *pgd;
> p4d_t *p4d;
> pud_t *pud;
> @@ -257,8 +270,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> step_forward(pvmw, PMD_SIZE);
> continue;
> }
> - if (!map_pte(pvmw))
> + if (!map_pte(pvmw, &ptl)) {
> + if (!pvmw->pte)
> + goto restart;

Could pvmw->pmd be changed? Otherwise, how about just jumping to the
retry label below?

@@ -205,6 +205,8 @@ bool page_vma_mapped_walk(struct
page_vma_mapped_walk *pvmw)
}

pvmw->pmd = pmd_offset(pud, pvmw->address);
+
+retry:
/*
* Make sure the pmd value isn't cached in a register
by the
* compiler and used as a stale value after we've
observed a

> goto next_pte;
> + }
> this_pte:
> if (check_pte(pvmw))
> return true;
> @@ -281,7 +297,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> } while (pte_none(*pvmw->pte));
>
> if (!pvmw->ptl) {
> - pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
> + pvmw->ptl = ptl;
> spin_lock(pvmw->ptl);
> }
> goto this_pte;

--
Thanks,
Qi

2023-05-22 12:56:57

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 17/31] mm/various: give up if pte_offset_map[_lock]() fails



On 2023/5/22 13:10, Hugh Dickins wrote:
> Following the examples of nearby code, various functions can just give
> up if pte_offset_map() or pte_offset_map_lock() fails. And there's no
> need for a preliminary pmd_trans_unstable() or other such check, since
> such cases are now safely handled inside.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> mm/gup.c | 9 ++++++---
> mm/ksm.c | 7 ++++---
> mm/memcontrol.c | 8 ++++----
> mm/memory-failure.c | 8 +++++---
> mm/migrate.c | 3 +++
> mm/swap_state.c | 3 +++
> 6 files changed, 25 insertions(+), 13 deletions(-)
>

[...]

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 3ecb7a40075f..308a56f0b156 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -305,6 +305,9 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
> swp_entry_t entry;
>
> ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> + if (!ptep)
> + return;

Maybe we should return false and let the caller handle the failure.

> +
> pte = *ptep;
> pte_unmap(ptep);
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index b76a65ac28b3..db2ec85ef332 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -734,6 +734,9 @@ static void swap_ra_info(struct vm_fault *vmf,
>
> /* Copy the PTEs because the page table may be unmapped */
> orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
> + if (!pte)
> + return;

Ditto?

> +
> if (fpfn == pfn + 1) {
> lpfn = fpfn;
> rpfn = fpfn + win;

--
Thanks,
Qi

2023-05-22 13:03:55

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 17/31] mm/various: give up if pte_offset_map[_lock]() fails



On 2023/5/22 20:24, Qi Zheng wrote:
>
>
> On 2023/5/22 13:10, Hugh Dickins wrote:
>> Following the examples of nearby code, various functions can just give
>> up if pte_offset_map() or pte_offset_map_lock() fails.  And there's no
>> need for a preliminary pmd_trans_unstable() or other such check, since
>> such cases are now safely handled inside.
>>
>> Signed-off-by: Hugh Dickins <[email protected]>
>> ---
>>   mm/gup.c            | 9 ++++++---
>>   mm/ksm.c            | 7 ++++---
>>   mm/memcontrol.c     | 8 ++++----
>>   mm/memory-failure.c | 8 +++++---
>>   mm/migrate.c        | 3 +++
>>   mm/swap_state.c     | 3 +++
>>   6 files changed, 25 insertions(+), 13 deletions(-)
>>
>
> [...]
>
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 3ecb7a40075f..308a56f0b156 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -305,6 +305,9 @@ void migration_entry_wait(struct mm_struct *mm,
>> pmd_t *pmd,
>>       swp_entry_t entry;
>>       ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
>> +    if (!ptep)
>> +        return;
>
> Maybe we should return false and let the caller handle the failure.
>
>> +
>>       pte = *ptep;
>>       pte_unmap(ptep);
>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>> index b76a65ac28b3..db2ec85ef332 100644
>> --- a/mm/swap_state.c
>> +++ b/mm/swap_state.c
>> @@ -734,6 +734,9 @@ static void swap_ra_info(struct vm_fault *vmf,
>>       /* Copy the PTEs because the page table may be unmapped */
>>       orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
>> +    if (!pte)
>> +        return;
>
> Ditto?

Oh, I see that you handle it in the PATCH[22/31].

>
>> +
>>       if (fpfn == pfn + 1) {
>>           lpfn = fpfn;
>>           rpfn = fpfn + win;
>

--
Thanks,
Qi

2023-05-22 13:05:35

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 29/31] mm/memory: handle_pte_fault() use pte_offset_map_nolock()



On 2023/5/22 13:26, Hugh Dickins wrote:
> handle_pte_fault() use pte_offset_map_nolock() to get the vmf.ptl which
> corresponds to vmf.pte, instead of pte_lockptr() being used later, when
> there's a chance that the pmd entry might have changed, perhaps to none,
> or to a huge pmd, with no split ptlock in its struct page.
>
> Remove its pmd_devmap_trans_unstable() call: pte_offset_map_nolock()
> will handle that case by failing. Update the "morph" comment above,
> looking forward to when shmem or file collapse to THP may not take
> mmap_lock for write (or not at all).
>
> do_numa_page() use the vmf->ptl from handle_pte_fault() at first, but
> refresh it when refreshing vmf->pte.
>
> do_swap_page()'s pte_unmap_same() (the thing that takes ptl to verify a
> two-part PAE orig_pte) use the vmf->ptl from handle_pte_fault() too; but
> do_swap_page() is also used by anon THP's __collapse_huge_page_swapin(),
> so adjust that to set vmf->ptl by pte_offset_map_nolock().
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> mm/khugepaged.c | 6 ++++--
> mm/memory.c | 38 +++++++++++++-------------------------
> 2 files changed, 17 insertions(+), 27 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 49cfa7cdfe93..c11db2e78e95 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1005,6 +1005,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
> int result;
> pte_t *pte = NULL;
> + spinlock_t *ptl;
>
> for (address = haddr; address < end; address += PAGE_SIZE) {
> struct vm_fault vmf = {
> @@ -1016,7 +1017,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> };
>
> if (!pte++) {
> - pte = pte_offset_map(pmd, address);
> + pte = pte_offset_map_nolock(mm, pmd, address, &ptl);
> if (!pte) {
> mmap_read_unlock(mm);
> result = SCAN_PMD_NULL;
> @@ -1024,11 +1025,12 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> }
> }
>
> - vmf.orig_pte = *pte;
> + vmf.orig_pte = ptep_get_lockless(pte);
> if (!is_swap_pte(vmf.orig_pte))
> continue;
>
> vmf.pte = pte;
> + vmf.ptl = ptl;
> ret = do_swap_page(&vmf);
> /* Which unmaps pte (after perhaps re-checking the entry) */
> pte = NULL;
> diff --git a/mm/memory.c b/mm/memory.c
> index c7b920291a72..4ec46eecefd3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2786,10 +2786,9 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
> int same = 1;
> #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
> if (sizeof(pte_t) > sizeof(unsigned long)) {
> - spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> - spin_lock(ptl);
> + spin_lock(vmf->ptl);
> same = pte_same(*vmf->pte, vmf->orig_pte);
> - spin_unlock(ptl);
> + spin_unlock(vmf->ptl);
> }
> #endif
> pte_unmap(vmf->pte);
> @@ -4696,7 +4695,6 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> * validation through pte_unmap_same(). It's of NUMA type but
> * the pfn may be screwed if the read is non atomic.
> */
> - vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
> spin_lock(vmf->ptl);
> if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -4767,8 +4765,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
> flags |= TNF_MIGRATED;
> } else {
> flags |= TNF_MIGRATE_FAIL;
> - vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> - spin_lock(vmf->ptl);
> + vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> + vmf->address, &vmf->ptl);
> + if (unlikely(!vmf->pte))
> + goto out;
> if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> pte_unmap_unlock(vmf->pte, vmf->ptl);
> goto out;
> @@ -4897,27 +4897,16 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> vmf->pte = NULL;
> vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
> } else {
> - /*
> - * If a huge pmd materialized under us just retry later. Use
> - * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
> - * of pmd_trans_huge() to ensure the pmd didn't become
> - * pmd_trans_huge under us and then back to pmd_none, as a
> - * result of MADV_DONTNEED running immediately after a huge pmd
> - * fault in a different thread of this mm, in turn leading to a
> - * misleading pmd_trans_huge() retval. All we have to ensure is
> - * that it is a regular pmd that we can walk with
> - * pte_offset_map() and we can do that through an atomic read
> - * in C, which is what pmd_trans_unstable() provides.
> - */
> - if (pmd_devmap_trans_unstable(vmf->pmd))
> - return 0;
> /*
> * A regular pmd is established and it can't morph into a huge
> - * pmd from under us anymore at this point because we hold the
> - * mmap_lock read mode and khugepaged takes it in write mode.
> - * So now it's safe to run pte_offset_map().
> + * pmd by anon khugepaged, since that takes mmap_lock in write
> + * mode; but shmem or file collapse to THP could still morph
> + * it into a huge pmd: just retry later if so.
> */
> - vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> + vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
> + vmf->address, &vmf->ptl);
> + if (unlikely(!vmf->pte))
> + return 0;

Just jump to the retry label below?

diff --git a/mm/memory.c b/mm/memory.c
index 63632a5eafc1..2e712fe6f4be 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4897,7 +4897,8 @@ static vm_fault_t handle_pte_fault(struct vm_fault
*vmf)
{
pte_t entry;

- if (unlikely(pmd_none(*vmf->pmd))) {
+retry:
+ if (unlikely(pmd_none(READ_ONCE(*vmf->pmd)))) {
/*
* Leave __pte_alloc() until later: because
vm_ops->fault may
* want to allocate huge page, and if we expose page table

> vmf->orig_pte = ptep_get_lockless(vmf->pte);
> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>
> @@ -4936,7 +4925,6 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> return do_numa_page(vmf);
>
> - vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
> spin_lock(vmf->ptl);
> entry = vmf->orig_pte;
> if (unlikely(!pte_same(*vmf->pte, entry))) {

--
Thanks,
Qi

2023-05-23 00:52:27

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 27/31] mm/khugepaged: allow pte_offset_map[_lock]() to fail

On Sun, May 21, 2023 at 10:24 PM Hugh Dickins <[email protected]> wrote:
>
> __collapse_huge_page_swapin(): don't drop the map after every pte, it
> only has to be dropped by do_swap_page(); give up if pte_offset_map()
> fails; trace_mm_collapse_huge_page_swapin() at the end, with result;
> fix comment on returned result; fix vmf.pgoff, though it's not used.
>
> collapse_huge_page(): use pte_offset_map_lock() on the _pmd returned
> from clearing; allow failure, but it should be impossible there.
> hpage_collapse_scan_pmd() and collapse_pte_mapped_thp() allow for
> pte_offset_map_lock() failure.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Reviewed-by: Yang Shi <[email protected]>

A nit below:

> ---
> mm/khugepaged.c | 72 +++++++++++++++++++++++++++++++++----------------
> 1 file changed, 49 insertions(+), 23 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 732f9ac393fc..49cfa7cdfe93 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -993,9 +993,8 @@ static int check_pmd_still_valid(struct mm_struct *mm,
> * Only done if hpage_collapse_scan_pmd believes it is worthwhile.
> *
> * Called and returns without pte mapped or spinlocks held.
> - * Note that if false is returned, mmap_lock will be released.
> + * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
> */
> -
> static int __collapse_huge_page_swapin(struct mm_struct *mm,
> struct vm_area_struct *vma,
> unsigned long haddr, pmd_t *pmd,
> @@ -1004,23 +1003,35 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> int swapped_in = 0;
> vm_fault_t ret = 0;
> unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
> + int result;
> + pte_t *pte = NULL;
>
> for (address = haddr; address < end; address += PAGE_SIZE) {
> struct vm_fault vmf = {
> .vma = vma,
> .address = address,
> - .pgoff = linear_page_index(vma, haddr),
> + .pgoff = linear_page_index(vma, address),
> .flags = FAULT_FLAG_ALLOW_RETRY,
> .pmd = pmd,
> };
>
> - vmf.pte = pte_offset_map(pmd, address);
> - vmf.orig_pte = *vmf.pte;
> - if (!is_swap_pte(vmf.orig_pte)) {
> - pte_unmap(vmf.pte);
> - continue;
> + if (!pte++) {
> + pte = pte_offset_map(pmd, address);
> + if (!pte) {
> + mmap_read_unlock(mm);
> + result = SCAN_PMD_NULL;
> + goto out;
> + }
> }
> +
> + vmf.orig_pte = *pte;
> + if (!is_swap_pte(vmf.orig_pte))
> + continue;
> +
> + vmf.pte = pte;
> ret = do_swap_page(&vmf);
> + /* Which unmaps pte (after perhaps re-checking the entry) */
> + pte = NULL;
>
> /*
> * do_swap_page returns VM_FAULT_RETRY with released mmap_lock.
> @@ -1029,24 +1040,29 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> * resulting in later failure.
> */
> if (ret & VM_FAULT_RETRY) {
> - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> /* Likely, but not guaranteed, that page lock failed */
> - return SCAN_PAGE_LOCK;
> + result = SCAN_PAGE_LOCK;

With per-VMA lock, this may not be true anymore, at least not true
until per-VMA lock supports swap fault. It may be better to have a
more general failure code, for example, SCAN_FAIL. But anyway you
don't have to change it in your patch, I can send a follow-up patch
once this series is landed on mm-unstable.

> + goto out;
> }
> if (ret & VM_FAULT_ERROR) {
> mmap_read_unlock(mm);
> - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> - return SCAN_FAIL;
> + result = SCAN_FAIL;
> + goto out;
> }
> swapped_in++;
> }
>
> + if (pte)
> + pte_unmap(pte);
> +
> /* Drain LRU add pagevec to remove extra pin on the swapped in pages */
> if (swapped_in)
> lru_add_drain();
>
> - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 1);
> - return SCAN_SUCCEED;
> + result = SCAN_SUCCEED;
> +out:
> + trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
> + return result;
> }
>
> static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> @@ -1146,9 +1162,6 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> address + HPAGE_PMD_SIZE);
> mmu_notifier_invalidate_range_start(&range);
>
> - pte = pte_offset_map(pmd, address);
> - pte_ptl = pte_lockptr(mm, pmd);
> -
> pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> /*
> * This removes any huge TLB entry from the CPU so we won't allow
> @@ -1163,13 +1176,18 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> mmu_notifier_invalidate_range_end(&range);
> tlb_remove_table_sync_one();
>
> - spin_lock(pte_ptl);
> - result = __collapse_huge_page_isolate(vma, address, pte, cc,
> - &compound_pagelist);
> - spin_unlock(pte_ptl);
> + pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> + if (pte) {
> + result = __collapse_huge_page_isolate(vma, address, pte, cc,
> + &compound_pagelist);
> + spin_unlock(pte_ptl);
> + } else {
> + result = SCAN_PMD_NULL;
> + }
>
> if (unlikely(result != SCAN_SUCCEED)) {
> - pte_unmap(pte);
> + if (pte)
> + pte_unmap(pte);
> spin_lock(pmd_ptl);
> BUG_ON(!pmd_none(*pmd));
> /*
> @@ -1253,6 +1271,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
> memset(cc->node_load, 0, sizeof(cc->node_load));
> nodes_clear(cc->alloc_nmask);
> pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> + if (!pte) {
> + result = SCAN_PMD_NULL;
> + goto out;
> + }
> +
> for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> _pte++, _address += PAGE_SIZE) {
> pte_t pteval = *_pte;
> @@ -1622,8 +1645,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
> * lockless_pages_from_mm() and the hardware page walker can access page
> * tables while all the high-level locks are held in write mode.
> */
> - start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> result = SCAN_FAIL;
> + start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> + if (!start_pte)
> + goto drop_immap;
>
> /* step 1: check all mapped PTEs are to the right huge page */
> for (i = 0, addr = haddr, pte = start_pte;
> @@ -1697,6 +1722,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>
> abort:
> pte_unmap_unlock(start_pte, ptl);
> +drop_immap:
> i_mmap_unlock_write(vma->vm_file->f_mapping);
> goto drop_hpage;
> }
> --
> 2.35.3
>

2023-05-23 01:14:39

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 26/31] mm/huge_memory: split huge pmd under one pte_offset_map()

On Sun, May 21, 2023 at 10:23 PM Hugh Dickins <[email protected]> wrote:
>
> __split_huge_zero_page_pmd() use a single pte_offset_map() to sweep the
> extent: it's already under pmd_lock(), so this is no worse for latency;
> and since it's supposed to have full control of the just-withdrawn page
> table, here choose to VM_BUG_ON if it were to fail. And please don't
> increment haddr by PAGE_SIZE, that should remain huge aligned: declare
> a separate addr (not a bugfix, but it was deceptive).
>
> __split_huge_pmd_locked() likewise (but it had declared a separate addr);
> and change its BUG_ON(!pte_none) to VM_BUG_ON, for consistency with zero
> (those deposited page tables are sometimes victims of random corruption).
>
> Signed-off-by: Hugh Dickins <[email protected]>

Reviewed-by: Yang Shi <[email protected]>

> ---
> mm/huge_memory.c | 28 ++++++++++++++++++----------
> 1 file changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d4bd5fa7c823..839c13fa0bbe 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2037,6 +2037,8 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> struct mm_struct *mm = vma->vm_mm;
> pgtable_t pgtable;
> pmd_t _pmd, old_pmd;
> + unsigned long addr;
> + pte_t *pte;
> int i;
>
> /*
> @@ -2052,17 +2054,20 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> pmd_populate(mm, &_pmd, pgtable);
>
> - for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> - pte_t *pte, entry;
> - entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
> + pte = pte_offset_map(&_pmd, haddr);
> + VM_BUG_ON(!pte);
> + for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
> + pte_t entry;
> +
> + entry = pfn_pte(my_zero_pfn(addr), vma->vm_page_prot);
> entry = pte_mkspecial(entry);
> if (pmd_uffd_wp(old_pmd))
> entry = pte_mkuffd_wp(entry);
> - pte = pte_offset_map(&_pmd, haddr);
> VM_BUG_ON(!pte_none(*pte));
> - set_pte_at(mm, haddr, pte, entry);
> - pte_unmap(pte);
> + set_pte_at(mm, addr, pte, entry);
> + pte++;
> }
> + pte_unmap(pte - 1);
> smp_wmb(); /* make pte visible before pmd */
> pmd_populate(mm, pmd, pgtable);
> }
> @@ -2077,6 +2082,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
> bool anon_exclusive = false, dirty = false;
> unsigned long addr;
> + pte_t *pte;
> int i;
>
> VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
> @@ -2205,8 +2211,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> pgtable = pgtable_trans_huge_withdraw(mm, pmd);
> pmd_populate(mm, &_pmd, pgtable);
>
> + pte = pte_offset_map(&_pmd, haddr);
> + VM_BUG_ON(!pte);
> for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
> - pte_t entry, *pte;
> + pte_t entry;
> /*
> * Note that NUMA hinting access restrictions are not
> * transferred to avoid any possibility of altering
> @@ -2249,11 +2257,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> entry = pte_mkuffd_wp(entry);
> page_add_anon_rmap(page + i, vma, addr, false);
> }
> - pte = pte_offset_map(&_pmd, addr);
> - BUG_ON(!pte_none(*pte));
> + VM_BUG_ON(!pte_none(*pte));
> set_pte_at(mm, addr, pte, entry);
> - pte_unmap(pte);
> + pte++;
> }
> + pte_unmap(pte - 1);
>
> if (!pmd_migration)
> page_remove_rmap(page, vma, true);
> --
> 2.35.3
>

2023-05-23 02:32:11

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH 24/31] mm/migrate_device: allow pte_offset_map_lock() to fail


Hugh Dickins <[email protected]> writes:

> migrate_vma_collect_pmd(): remove the pmd_trans_unstable() handling after
> splitting huge zero pmd, and the pmd_none() handling after successfully
> splitting huge page: those are now managed inside pte_offset_map_lock(),
> and by "goto again" when it fails.
>
> But the skip after unsuccessful split_huge_page() must stay: it avoids an
> endless loop. The skip when pmd_bad()? Remove that: it will be treated
> as a hole rather than a skip once cleared by pte_offset_map_lock(), but
> with different timing that would be so anyway; and it's arguably best to
> leave the pmd_bad() handling centralized there.

So for a pmd_bad() the sequence would be:

1. pte_offset_map_lock() would return NULL and clear the PMD.
2. goto again marks the page as a migrating hole,
3. In migrate_vma_insert_page() a new PMD is created by pmd_alloc().
4. This leads to a new zero page getting mapped for the previously
pmd_bad() mapping.

I'm not entirely sure what the pmd_bad() case is used for but is that
ok? I understand that previously it was all a matter of timing, but I
wouldn't rely on the previous code being correct in this regard either.

> migrate_vma_insert_page(): remove comment on the old pte_offset_map()
> and old locking limitations; remove the pmd_trans_unstable() check and
> just proceed to pte_offset_map_lock(), aborting when it fails (page has
> now been charged to memcg, but that's so in other cases, and presumably
> uncharged later).

Correct, the non-migrating page will be freed later via put_page() which
will uncharge the page.

> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> mm/migrate_device.c | 31 ++++---------------------------
> 1 file changed, 4 insertions(+), 27 deletions(-)
>
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index d30c9de60b0d..a14af6b12b04 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -83,9 +83,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> if (is_huge_zero_page(page)) {
> spin_unlock(ptl);
> split_huge_pmd(vma, pmdp, addr);
> - if (pmd_trans_unstable(pmdp))
> - return migrate_vma_collect_skip(start, end,
> - walk);
> } else {
> int ret;
>
> @@ -100,16 +97,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
> if (ret)
> return migrate_vma_collect_skip(start, end,
> walk);
> - if (pmd_none(*pmdp))
> - return migrate_vma_collect_hole(start, end, -1,
> - walk);
> }
> }
>
> - if (unlikely(pmd_bad(*pmdp)))
> - return migrate_vma_collect_skip(start, end, walk);
> -
> ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> + if (!ptep)
> + goto again;
> arch_enter_lazy_mmu_mode();
>
> for (; addr < end; addr += PAGE_SIZE, ptep++) {
> @@ -595,27 +588,10 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> pmdp = pmd_alloc(mm, pudp, addr);
> if (!pmdp)
> goto abort;
> -
> if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
> goto abort;
> -
> - /*
> - * Use pte_alloc() instead of pte_alloc_map(). We can't run
> - * pte_offset_map() on pmds where a huge pmd might be created
> - * from a different thread.
> - *
> - * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
> - * parallel threads are excluded by other means.
> - *
> - * Here we only have mmap_read_lock(mm).
> - */
> if (pte_alloc(mm, pmdp))
> goto abort;
> -
> - /* See the comment in pte_alloc_one_map() */
> - if (unlikely(pmd_trans_unstable(pmdp)))
> - goto abort;
> -
> if (unlikely(anon_vma_prepare(vma)))
> goto abort;
> if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
> @@ -650,7 +626,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
> }
>
> ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> -
> + if (!ptep)
> + goto abort;
> if (check_stable_address_space(mm))
> goto unlock_abort;


2023-05-23 02:32:11

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 25/31] mm/gup: remove FOLL_SPLIT_PMD use of pmd_trans_unstable()

On Sun, May 21, 2023 at 10:22 PM Hugh Dickins <[email protected]> wrote:
>
> There is now no reason for follow_pmd_mask()'s FOLL_SPLIT_PMD block to
> distinguish huge_zero_page from a normal THP: follow_page_pte() handles
> any instability, and here it's a good idea to replace any pmd_none(*pmd)
> by a page table a.s.a.p, in the huge_zero_page case as for a normal THP.
> (Hmm, couldn't the normal THP case have hit an unstably refaulted THP
> before? But there are only two, exceptional, users of FOLL_SPLIT_PMD.)
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> mm/gup.c | 19 ++++---------------
> 1 file changed, 4 insertions(+), 15 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index bb67193c5460..4ad50a59897f 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -681,21 +681,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> }
> if (flags & FOLL_SPLIT_PMD) {
> - int ret;
> - page = pmd_page(*pmd);
> - if (is_huge_zero_page(page)) {
> - spin_unlock(ptl);
> - ret = 0;
> - split_huge_pmd(vma, pmd, address);
> - if (pmd_trans_unstable(pmd))
> - ret = -EBUSY;

IIUC the pmd_trans_unstable() check was transferred to the implicit
pmd_none() in pte_alloc(). But it will return -ENOMEM instead of
-EBUSY. Won't it break some userspace? Or the pmd_trans_unstable() is
never true? If so it seems worth mentioning in the commit log about
this return value change.

> - } else {
> - spin_unlock(ptl);
> - split_huge_pmd(vma, pmd, address);
> - ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
> - }
> -
> - return ret ? ERR_PTR(ret) :
> + spin_unlock(ptl);
> + split_huge_pmd(vma, pmd, address);
> + /* If pmd was left empty, stuff a page table in there quickly */
> + return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
> follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> }
> page = follow_trans_huge_pmd(vma, address, pmd, flags);
> --
> 2.35.3
>

2023-05-23 03:00:52

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 25/31] mm/gup: remove FOLL_SPLIT_PMD use of pmd_trans_unstable()

On Mon, May 22, 2023 at 7:26 PM Yang Shi <[email protected]> wrote:
>
> On Sun, May 21, 2023 at 10:22 PM Hugh Dickins <[email protected]> wrote:
> >
> > There is now no reason for follow_pmd_mask()'s FOLL_SPLIT_PMD block to
> > distinguish huge_zero_page from a normal THP: follow_page_pte() handles
> > any instability, and here it's a good idea to replace any pmd_none(*pmd)
> > by a page table a.s.a.p, in the huge_zero_page case as for a normal THP.
> > (Hmm, couldn't the normal THP case have hit an unstably refaulted THP
> > before? But there are only two, exceptional, users of FOLL_SPLIT_PMD.)
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> > ---
> > mm/gup.c | 19 ++++---------------
> > 1 file changed, 4 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/gup.c b/mm/gup.c
> > index bb67193c5460..4ad50a59897f 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -681,21 +681,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> > return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > }
> > if (flags & FOLL_SPLIT_PMD) {
> > - int ret;
> > - page = pmd_page(*pmd);
> > - if (is_huge_zero_page(page)) {
> > - spin_unlock(ptl);
> > - ret = 0;
> > - split_huge_pmd(vma, pmd, address);
> > - if (pmd_trans_unstable(pmd))
> > - ret = -EBUSY;
>
> IIUC the pmd_trans_unstable() check was transferred to the implicit
> pmd_none() in pte_alloc(). But it will return -ENOMEM instead of
> -EBUSY. Won't it break some userspace? Or the pmd_trans_unstable() is
> never true? If so it seems worth mentioning in the commit log about
> this return value change.

Oops, the above comment is not accurate. It will call
follow_page_pte() instead of returning -EBUSY if pmd is none. For
other unstable cases, it will return -ENOMEM instead of -EBUSY.

>
> > - } else {
> > - spin_unlock(ptl);
> > - split_huge_pmd(vma, pmd, address);
> > - ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
> > - }
> > -
> > - return ret ? ERR_PTR(ret) :
> > + spin_unlock(ptl);
> > + split_huge_pmd(vma, pmd, address);
> > + /* If pmd was left empty, stuff a page table in there quickly */
> > + return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
> > follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > }
> > page = follow_trans_huge_pmd(vma, address, pmd, flags);
> > --
> > 2.35.3
> >

2023-05-23 18:21:26

by SeongJae Park

[permalink] [raw]
Subject: Re: [PATCH 09/31] mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails

Hello Hugh,

On Sun, 21 May 2023 22:00:15 -0700 (PDT) Hugh Dickins <[email protected]> wrote:

> Simple walk_page_range() users should set ACTION_AGAIN to retry when
> pte_offset_map_lock() fails.
>
> No need to check pmd_trans_unstable(): that was precisely to avoid the
> possiblity of calling pte_offset_map() on a racily removed or inserted
> THP entry, but such cases are now safely handled inside it. Likewise
> there is no need to check pmd_none() or pmd_bad() before calling it.
>
> Signed-off-by: Hugh Dickins <[email protected]>

For below mm/damon part,

Reviewed-by: SeongJae Park <[email protected]>


Thanks,
SJ

> ---
> fs/proc/task_mmu.c | 32 ++++++++++++++++----------------
> mm/damon/vaddr.c | 12 ++++++++----
> mm/mempolicy.c | 7 ++++---
> mm/mincore.c | 9 ++++-----
> mm/mlock.c | 4 ++++
> 5 files changed, 36 insertions(+), 28 deletions(-)
>
[...]
> diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
> index 1fec16d7263e..b8762ff15c3c 100644
> --- a/mm/damon/vaddr.c
> +++ b/mm/damon/vaddr.c
> @@ -318,9 +318,11 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
> spin_unlock(ptl);
> }
>
> - if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> - return 0;
> pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> + if (!pte) {
> + walk->action = ACTION_AGAIN;
> + return 0;
> + }
> if (!pte_present(*pte))
> goto out;
> damon_ptep_mkold(pte, walk->mm, addr);
> @@ -464,9 +466,11 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
> regular_page:
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> - if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> - return -EINVAL;
> pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> + if (!pte) {
> + walk->action = ACTION_AGAIN;
> + return 0;
> + }
> if (!pte_present(*pte))
> goto out;
> folio = damon_get_folio(pte_pfn(*pte));
[...]

2023-05-24 02:45:51

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 04/31] mm/pgtable: allow pte_offset_map[_lock]() to fail

On Mon, 22 May 2023, Qi Zheng wrote:
> On 2023/5/22 12:53, Hugh Dickins wrote:
>
> [...]
>
> > @@ -229,3 +231,57 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
> > unsigned long address,
> > }
> > #endif
> > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> > +
> > +pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> > +{
> > + pmd_t pmdval;
> > +
> > + /* rcu_read_lock() to be added later */
> > + pmdval = pmdp_get_lockless(pmd);
> > + if (pmdvalp)
> > + *pmdvalp = pmdval;
> > + if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
> > + goto nomap;
> > + if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
> > + goto nomap;
>
> Will the follow-up patch deal with the above situation specially?

No, the follow-up patch will only insert the rcu_read_lock() and unlock();
and do something (something!) about the PAE mismatched halves case.

> Otherwise, maybe it can be changed to the following check method?
>
> if (unlikely(pmd_none(pmdval) || pmd_leaf(pmdval)))
> goto nomap;

Maybe, but I'm not keen. Partly just because pmd_leaf() is quite a
(good) new invention (I came across a few instances in updating to
the current tree), whereas here I'm just following the old examples,
from zap_pmd_range() etc. I'd have to spend a while getting to know
pmd_leaf(), and its interaction with strange gotchas like pmd_present().

And partly because I do want as many corrupt cases as possible to
reach the pmd_bad() check below, so generating a warning (and clear).
I might be wrong, I haven't checked through the architectures and how
pmd_leaf() is implemented in each, but my guess is that pmd_leaf()
will tend to miss the pmd_bad() check.

But if you can demonstrate a performance improvement from using
pmd_leaf() there, I expect many people would prefer that improvement
to badness catching: send a patch later to make that change if it's
justified.

Thanks a lot for all your attention to these.

Hugh

>
> > + if (unlikely(pmd_bad(pmdval))) {
> > + pmd_clear_bad(pmd);
> > + goto nomap;
> > + }
> > + return __pte_map(&pmdval, addr);
> > +nomap:
> > + /* rcu_read_unlock() to be added later */
> > + return NULL;
> > +}

2023-05-24 02:56:03

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 08/31] mm/page_vma_mapped: pte_offset_map_nolock() not pte_lockptr()

On Mon, 22 May 2023, Qi Zheng wrote:
> On 2023/5/22 12:58, Hugh Dickins wrote:
> > map_pte() use pte_offset_map_nolock(), to make sure of the ptl belonging
> > to pte, even if pmd entry is then changed racily: page_vma_mapped_walk()
> > use that instead of getting pte_lockptr() later, or restart if map_pte()
> > found no page table.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> > ---
> > mm/page_vma_mapped.c | 28 ++++++++++++++++++++++------
> > 1 file changed, 22 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index 947dc7491815..2af734274073 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -156,6 +168,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk
> > *pvmw)
> > struct vm_area_struct *vma = pvmw->vma;
> > struct mm_struct *mm = vma->vm_mm;
> > unsigned long end;
> > + spinlock_t *ptl;
> > pgd_t *pgd;
> > p4d_t *p4d;
> > pud_t *pud;
> > @@ -257,8 +270,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk
> > *pvmw)
> > step_forward(pvmw, PMD_SIZE);
> > continue;
> > }
> > - if (!map_pte(pvmw))
> > + if (!map_pte(pvmw, &ptl)) {
> > + if (!pvmw->pte)
> > + goto restart;
>
> Could pvmw->pmd be changed? Otherwise, how about just jumping to the
> retry label below?
>
> @@ -205,6 +205,8 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk
> *pvmw)
> }
>
> pvmw->pmd = pmd_offset(pud, pvmw->address);
> +
> +retry:
> /*
> * Make sure the pmd value isn't cached in a register by the
> * compiler and used as a stale value after we've observed a

You're right, that could be done, and that's where I'd have inserted
the label if there were none already. I just thought the fewer goto
labels the better, so reused the restart already there. If you feel
strongly that it's actively misleading, I can certainly make that
change; but it's too rare an occurrence to be worth optimizing for.

Hugh

2023-05-24 03:46:55

by Qi Zheng

[permalink] [raw]
Subject: Re: [PATCH 04/31] mm/pgtable: allow pte_offset_map[_lock]() to fail



On 2023/5/24 10:22, Hugh Dickins wrote:
> On Mon, 22 May 2023, Qi Zheng wrote:
>> On 2023/5/22 12:53, Hugh Dickins wrote:
>>
>> [...]
>>
>>> @@ -229,3 +231,57 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma,
>>> unsigned long address,
>>> }
>>> #endif
>>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>> +
>>> +pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
>>> +{
>>> + pmd_t pmdval;
>>> +
>>> + /* rcu_read_lock() to be added later */
>>> + pmdval = pmdp_get_lockless(pmd);
>>> + if (pmdvalp)
>>> + *pmdvalp = pmdval;
>>> + if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
>>> + goto nomap;
>>> + if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
>>> + goto nomap;
>>
>> Will the follow-up patch deal with the above situation specially?
>
> No, the follow-up patch will only insert the rcu_read_lock() and unlock();
> and do something (something!) about the PAE mismatched halves case.
>
>> Otherwise, maybe it can be changed to the following check method?
>>
>> if (unlikely(pmd_none(pmdval) || pmd_leaf(pmdval)))
>> goto nomap;
>
> Maybe, but I'm not keen. Partly just because pmd_leaf() is quite a
> (good) new invention (I came across a few instances in updating to
> the current tree), whereas here I'm just following the old examples,
> from zap_pmd_range() etc. I'd have to spend a while getting to know
> pmd_leaf(), and its interaction with strange gotchas like pmd_present().
>
> And partly because I do want as many corrupt cases as possible to
> reach the pmd_bad() check below, so generating a warning (and clear).
> I might be wrong, I haven't checked through the architectures and how
> pmd_leaf() is implemented in each, but my guess is that pmd_leaf()
> will tend to miss the pmd_bad() check.

IIUC, pmd_leaf() is just for checking a leaf mapped PMD, and will
not cover pmd_bad() case. Can see the examples in vmalloc_to_page()
and apply_to_pmd_range().

>
> But if you can demonstrate a performance improvement from using
> pmd_leaf() there, I expect many people would prefer that improvement
> to badness catching: send a patch later to make that change if it's
> justified.

Probably not a lot of performance gain, just makes the check more
concise.

Thanks,
Qi

>
> Thanks a lot for all your attention to these.
>
> Hugh
>
>>
>>> + if (unlikely(pmd_bad(pmdval))) {
>>> + pmd_clear_bad(pmd);
>>> + goto nomap;
>>> + }
>>> + return __pte_map(&pmdval, addr);
>>> +nomap:
>>> + /* rcu_read_unlock() to be added later */
>>> + return NULL;
>>> +}

--
Thanks,
Qi

2023-05-24 04:01:21

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 17/31] mm/various: give up if pte_offset_map[_lock]() fails

On Mon, 22 May 2023, Qi Zheng wrote:
> On 2023/5/22 20:24, Qi Zheng wrote:
> > On 2023/5/22 13:10, Hugh Dickins wrote:
> >> Following the examples of nearby code, various functions can just give
> >> up if pte_offset_map() or pte_offset_map_lock() fails.  And there's no
> >> need for a preliminary pmd_trans_unstable() or other such check, since
> >> such cases are now safely handled inside.
> >>
> >> Signed-off-by: Hugh Dickins <[email protected]>
> >> ---
> >>   mm/gup.c            | 9 ++++++---
> >>   mm/ksm.c            | 7 ++++---
> >>   mm/memcontrol.c     | 8 ++++----
> >>   mm/memory-failure.c | 8 +++++---
> >>   mm/migrate.c        | 3 +++
> >>   mm/swap_state.c     | 3 +++
> >>   6 files changed, 25 insertions(+), 13 deletions(-)
> >>
> >
> > [...]
> >
> >> diff --git a/mm/migrate.c b/mm/migrate.c
> >> index 3ecb7a40075f..308a56f0b156 100644
> >> --- a/mm/migrate.c
> >> +++ b/mm/migrate.c
> >> @@ -305,6 +305,9 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t
> >> *pmd,
> >>       swp_entry_t entry;
> >>       ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> >> +    if (!ptep)
> >> +        return;
> >
> > Maybe we should return false and let the caller handle the failure.

We have not needed to do that before, it's normal for migration_entry_wait()
not to wait sometimes: it just goes back out to userspace to try again (by
which time the situation is usually resolved). I don't think we want to
trouble the callers with a new case to handle in some other way.

> >
> >> +
> >>       pte = *ptep;
> >>       pte_unmap(ptep);
> >> diff --git a/mm/swap_state.c b/mm/swap_state.c
> >> index b76a65ac28b3..db2ec85ef332 100644
> >> --- a/mm/swap_state.c
> >> +++ b/mm/swap_state.c
> >> @@ -734,6 +734,9 @@ static void swap_ra_info(struct vm_fault *vmf,
> >>       /* Copy the PTEs because the page table may be unmapped */
> >>       orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
> >> +    if (!pte)
> >> +        return;
> >
> > Ditto?
>
> Oh, I see that you handle it in the PATCH[22/31].

I don't think 22/31 (about swapoff "unuse") relates to this one.
Here swap_vma_readahead() is doing an interesting calculation for
how big the readaround window should be, and my thinking was, who cares?
just read 1, in the rare case that the page table vanishes underneath us.

But thank you for making me look again: it looks like I was not careful
enough before, ra_info->win is definitely *not* 1 on this line, and I
wonder if something bad might result from not following through on the
ensuing calculations - see how !CONFIG_64BIT is copying ptes (and that
implies CONFIG_64BIT is accessing the page table after pte_unmap()).

This swap_ra_info() code looks like it will need a patch all it own:
I must come back to it.

Thanks!
Hugh

2023-05-24 04:01:43

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 24/31] mm/migrate_device: allow pte_offset_map_lock() to fail

On Tue, 23 May 2023, Alistair Popple wrote:
> Hugh Dickins <[email protected]> writes:
>
> > migrate_vma_collect_pmd(): remove the pmd_trans_unstable() handling after
> > splitting huge zero pmd, and the pmd_none() handling after successfully
> > splitting huge page: those are now managed inside pte_offset_map_lock(),
> > and by "goto again" when it fails.
> >
> > But the skip after unsuccessful split_huge_page() must stay: it avoids an
> > endless loop. The skip when pmd_bad()? Remove that: it will be treated
> > as a hole rather than a skip once cleared by pte_offset_map_lock(), but
> > with different timing that would be so anyway; and it's arguably best to
> > leave the pmd_bad() handling centralized there.
>
> So for a pmd_bad() the sequence would be:
>
> 1. pte_offset_map_lock() would return NULL and clear the PMD.
> 2. goto again marks the page as a migrating hole,
> 3. In migrate_vma_insert_page() a new PMD is created by pmd_alloc().
> 4. This leads to a new zero page getting mapped for the previously
> pmd_bad() mapping.

Agreed.

>
> I'm not entirely sure what the pmd_bad() case is used for but is that
> ok? I understand that previously it was all a matter of timing, but I
> wouldn't rely on the previous code being correct in this regard either.

The pmd_bad() case is for when the pmd table got corrupted (overwritten,
cosmic rays, whatever), and that pmd entry is easily recognized as
nonsense: we try not to crash on it, but user data may have got lost.

My "timing" remark may not be accurate: I seem to be living in the past,
when we had a lot more "pmd_none_or_clear_bad()"s around than today - I
was thinking that any one of them could be racily changing the bad to none.
Though I suppose I am now making my timing remark accurate, by changing
the bad to none more often again.

Since data is liable to be lost anyway (unless the corrupted entry was
actually none before it got corrupted), it doesn't matter greatly what
we do with it (some would definitely prefer a crash, but traditionally
we don't): issue a "pmd bad" message and not get stuck in a loop is
the main thing.

>
> > migrate_vma_insert_page(): remove comment on the old pte_offset_map()
> > and old locking limitations; remove the pmd_trans_unstable() check and
> > just proceed to pte_offset_map_lock(), aborting when it fails (page has
> > now been charged to memcg, but that's so in other cases, and presumably
> > uncharged later).
>
> Correct, the non-migrating page will be freed later via put_page() which
> will uncharge the page.

Thanks for confirming, yes, it was more difficult once upon a time,
but nowadays just a matter of reaching the final put_page()

Hugh

2023-05-24 04:40:30

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 25/31] mm/gup: remove FOLL_SPLIT_PMD use of pmd_trans_unstable()

On Mon, 22 May 2023, Yang Shi wrote:
> On Mon, May 22, 2023 at 7:26 PM Yang Shi <[email protected]> wrote:
> > On Sun, May 21, 2023 at 10:22 PM Hugh Dickins <[email protected]> wrote:
> > >
> > > There is now no reason for follow_pmd_mask()'s FOLL_SPLIT_PMD block to
> > > distinguish huge_zero_page from a normal THP: follow_page_pte() handles
> > > any instability, and here it's a good idea to replace any pmd_none(*pmd)
> > > by a page table a.s.a.p, in the huge_zero_page case as for a normal THP.
> > > (Hmm, couldn't the normal THP case have hit an unstably refaulted THP
> > > before? But there are only two, exceptional, users of FOLL_SPLIT_PMD.)
> > >
> > > Signed-off-by: Hugh Dickins <[email protected]>
> > > ---
> > > mm/gup.c | 19 ++++---------------
> > > 1 file changed, 4 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index bb67193c5460..4ad50a59897f 100644
> > > --- a/mm/gup.c
> > > +++ b/mm/gup.c
> > > @@ -681,21 +681,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> > > return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > > }
> > > if (flags & FOLL_SPLIT_PMD) {
> > > - int ret;
> > > - page = pmd_page(*pmd);
> > > - if (is_huge_zero_page(page)) {
> > > - spin_unlock(ptl);
> > > - ret = 0;
> > > - split_huge_pmd(vma, pmd, address);
> > > - if (pmd_trans_unstable(pmd))
> > > - ret = -EBUSY;
> >
> > IIUC the pmd_trans_unstable() check was transferred to the implicit
> > pmd_none() in pte_alloc(). But it will return -ENOMEM instead of
> > -EBUSY. Won't it break some userspace? Or the pmd_trans_unstable() is
> > never true? If so it seems worth mentioning in the commit log about
> > this return value change.

Thanks a lot for looking at these, but I disagree here.

>
> Oops, the above comment is not accurate. It will call
> follow_page_pte() instead of returning -EBUSY if pmd is none.

Yes. Ignoring secondary races, if pmd is none, pte_alloc() will allocate
an empty page table there, follow_page_pte() find !pte_present and return
NULL; or if pmd is not none, follow_page_pte() will return no_page_table()
i.e. NULL. And page NULL ends up with __get_user_pages() having another
go round, instead of failing with -EBUSY.

Which I'd say is better handling for such a transient case - remember,
it's split_huge_pmd() (which should always succeed, but might be raced)
in use there, not split_huge_page() (which might take years for pins to
be removed before it can succeed).

> For other unstable cases, it will return -ENOMEM instead of -EBUSY.

I don't think so: the possibly-failing __pte_alloc() only gets called
in the pmd_none() case.

Hugh

>
> >
> > > - } else {
> > > - spin_unlock(ptl);
> > > - split_huge_pmd(vma, pmd, address);
> > > - ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
> > > - }
> > > -
> > > - return ret ? ERR_PTR(ret) :
> > > + spin_unlock(ptl);
> > > + split_huge_pmd(vma, pmd, address);
> > > + /* If pmd was left empty, stuff a page table in there quickly */
> > > + return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
> > > follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > > }
> > > page = follow_trans_huge_pmd(vma, address, pmd, flags);
> > > --
> > > 2.35.3

2023-05-24 04:55:55

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 27/31] mm/khugepaged: allow pte_offset_map[_lock]() to fail

On Mon, 22 May 2023, Yang Shi wrote:
> On Sun, May 21, 2023 at 10:24 PM Hugh Dickins <[email protected]> wrote:
> >
> > __collapse_huge_page_swapin(): don't drop the map after every pte, it
> > only has to be dropped by do_swap_page(); give up if pte_offset_map()
> > fails; trace_mm_collapse_huge_page_swapin() at the end, with result;
> > fix comment on returned result; fix vmf.pgoff, though it's not used.
> >
> > collapse_huge_page(): use pte_offset_map_lock() on the _pmd returned
> > from clearing; allow failure, but it should be impossible there.
> > hpage_collapse_scan_pmd() and collapse_pte_mapped_thp() allow for
> > pte_offset_map_lock() failure.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
>
> Reviewed-by: Yang Shi <[email protected]>

Thanks.

>
> A nit below:
>
> > ---
> > mm/khugepaged.c | 72 +++++++++++++++++++++++++++++++++----------------
> > 1 file changed, 49 insertions(+), 23 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 732f9ac393fc..49cfa7cdfe93 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
...
> > @@ -1029,24 +1040,29 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> > * resulting in later failure.
> > */
> > if (ret & VM_FAULT_RETRY) {
> > - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> > /* Likely, but not guaranteed, that page lock failed */
> > - return SCAN_PAGE_LOCK;
> > + result = SCAN_PAGE_LOCK;
>
> With per-VMA lock, this may not be true anymore, at least not true
> until per-VMA lock supports swap fault. It may be better to have a
> more general failure code, for example, SCAN_FAIL. But anyway you
> don't have to change it in your patch, I can send a follow-up patch
> once this series is landed on mm-unstable.

Interesting point (I've not tried to wrap my head around what differences
per-VMA locking would make to old likelihoods here), and thank you for
deferring a change on it - appreciated.

Something to beware of, if you do choose to change it: mostly those
SCAN codes (I'm not a fan of them!) are only for a tracepoint somewhere,
but madvise_collapse() and madvise_collapse_errno() take some of them
more seriously than others - I think SCAN_PAGE_LOCK ends up as an
EAGAIN (rightly), but SCAN_FAIL as an EINVAL (depends).

But maybe there are layers in between which do not propagate the result
code, I didn't check. All in all, not something I'd spend time on myself.

Hugh

2023-05-24 05:07:52

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 29/31] mm/memory: handle_pte_fault() use pte_offset_map_nolock()

On Mon, 22 May 2023, Qi Zheng wrote:
> On 2023/5/22 13:26, Hugh Dickins wrote:
> > handle_pte_fault() use pte_offset_map_nolock() to get the vmf.ptl which
> > corresponds to vmf.pte, instead of pte_lockptr() being used later, when
> > there's a chance that the pmd entry might have changed, perhaps to none,
> > or to a huge pmd, with no split ptlock in its struct page.
> >
> > Remove its pmd_devmap_trans_unstable() call: pte_offset_map_nolock()
> > will handle that case by failing. Update the "morph" comment above,
> > looking forward to when shmem or file collapse to THP may not take
> > mmap_lock for write (or not at all).
> >
> > do_numa_page() use the vmf->ptl from handle_pte_fault() at first, but
> > refresh it when refreshing vmf->pte.
> >
> > do_swap_page()'s pte_unmap_same() (the thing that takes ptl to verify a
> > two-part PAE orig_pte) use the vmf->ptl from handle_pte_fault() too; but
> > do_swap_page() is also used by anon THP's __collapse_huge_page_swapin(),
> > so adjust that to set vmf->ptl by pte_offset_map_nolock().
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> > ---
> > mm/khugepaged.c | 6 ++++--
> > mm/memory.c | 38 +++++++++++++-------------------------
> > 2 files changed, 17 insertions(+), 27 deletions(-)
> >
...
> > diff --git a/mm/memory.c b/mm/memory.c
> > index c7b920291a72..4ec46eecefd3 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
...
> > @@ -4897,27 +4897,16 @@ static vm_fault_t handle_pte_fault(struct vm_fault
> > *vmf)
> > vmf->pte = NULL;
> > vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
> > } else {
> > - /*
> > - * If a huge pmd materialized under us just retry later. Use
> > - * pmd_trans_unstable() via pmd_devmap_trans_unstable()
> > instead
> > - * of pmd_trans_huge() to ensure the pmd didn't become
> > - * pmd_trans_huge under us and then back to pmd_none, as a
> > - * result of MADV_DONTNEED running immediately after a huge
> > pmd
> > - * fault in a different thread of this mm, in turn leading to
> > a
> > - * misleading pmd_trans_huge() retval. All we have to ensure
> > is
> > - * that it is a regular pmd that we can walk with
> > - * pte_offset_map() and we can do that through an atomic read
> > - * in C, which is what pmd_trans_unstable() provides.
> > - */
> > - if (pmd_devmap_trans_unstable(vmf->pmd))
> > - return 0;
> > /*
> > * A regular pmd is established and it can't morph into a huge
> > - * pmd from under us anymore at this point because we hold the
> > - * mmap_lock read mode and khugepaged takes it in write mode.
> > - * So now it's safe to run pte_offset_map().
> > + * pmd by anon khugepaged, since that takes mmap_lock in write
> > + * mode; but shmem or file collapse to THP could still morph
> > + * it into a huge pmd: just retry later if so.
> > */
> > - vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> > + vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
> > + vmf->address, &vmf->ptl);
> > + if (unlikely(!vmf->pte))
> > + return 0;
>
> Just jump to the retry label below?

Shrug. Could do. But again I saw no reason to optimize this path,
the pmd_devmap_trans_unstable() treatment sets a good enough example.

Hugh

2023-05-24 05:31:02

by Alistair Popple

[permalink] [raw]
Subject: Re: [PATCH 24/31] mm/migrate_device: allow pte_offset_map_lock() to fail


Hugh Dickins <[email protected]> writes:

> On Tue, 23 May 2023, Alistair Popple wrote:
>> Hugh Dickins <[email protected]> writes:
>>
>> > migrate_vma_collect_pmd(): remove the pmd_trans_unstable() handling after
>> > splitting huge zero pmd, and the pmd_none() handling after successfully
>> > splitting huge page: those are now managed inside pte_offset_map_lock(),
>> > and by "goto again" when it fails.
>> >
>> > But the skip after unsuccessful split_huge_page() must stay: it avoids an
>> > endless loop. The skip when pmd_bad()? Remove that: it will be treated
>> > as a hole rather than a skip once cleared by pte_offset_map_lock(), but
>> > with different timing that would be so anyway; and it's arguably best to
>> > leave the pmd_bad() handling centralized there.
>>
>> So for a pmd_bad() the sequence would be:
>>
>> 1. pte_offset_map_lock() would return NULL and clear the PMD.
>> 2. goto again marks the page as a migrating hole,
>> 3. In migrate_vma_insert_page() a new PMD is created by pmd_alloc().
>> 4. This leads to a new zero page getting mapped for the previously
>> pmd_bad() mapping.
>
> Agreed.
>
>>
>> I'm not entirely sure what the pmd_bad() case is used for but is that
>> ok? I understand that previously it was all a matter of timing, but I
>> wouldn't rely on the previous code being correct in this regard either.
>
> The pmd_bad() case is for when the pmd table got corrupted (overwritten,
> cosmic rays, whatever), and that pmd entry is easily recognized as
> nonsense: we try not to crash on it, but user data may have got lost.
>
> My "timing" remark may not be accurate: I seem to be living in the past,
> when we had a lot more "pmd_none_or_clear_bad()"s around than today - I
> was thinking that any one of them could be racily changing the bad to none.
> Though I suppose I am now making my timing remark accurate, by changing
> the bad to none more often again.
>
> Since data is liable to be lost anyway (unless the corrupted entry was
> actually none before it got corrupted), it doesn't matter greatly what
> we do with it (some would definitely prefer a crash, but traditionally
> we don't): issue a "pmd bad" message and not get stuck in a loop is
> the main thing.

Thanks for the background. Either skipping it or marking it as a hole as
you've done here will avoid a loop so feel free to add:

Reviewed-by: Alistair Popple <[email protected]>

>>
>> > migrate_vma_insert_page(): remove comment on the old pte_offset_map()
>> > and old locking limitations; remove the pmd_trans_unstable() check and
>> > just proceed to pte_offset_map_lock(), aborting when it fails (page has
>> > now been charged to memcg, but that's so in other cases, and presumably
>> > uncharged later).
>>
>> Correct, the non-migrating page will be freed later via put_page() which
>> will uncharge the page.
>
> Thanks for confirming, yes, it was more difficult once upon a time,
> but nowadays just a matter of reaching the final put_page()
>
> Hugh


2023-05-24 22:20:09

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 27/31] mm/khugepaged: allow pte_offset_map[_lock]() to fail

On Tue, May 23, 2023 at 9:44 PM Hugh Dickins <[email protected]> wrote:
>
> On Mon, 22 May 2023, Yang Shi wrote:
> > On Sun, May 21, 2023 at 10:24 PM Hugh Dickins <[email protected]> wrote:
> > >
> > > __collapse_huge_page_swapin(): don't drop the map after every pte, it
> > > only has to be dropped by do_swap_page(); give up if pte_offset_map()
> > > fails; trace_mm_collapse_huge_page_swapin() at the end, with result;
> > > fix comment on returned result; fix vmf.pgoff, though it's not used.
> > >
> > > collapse_huge_page(): use pte_offset_map_lock() on the _pmd returned
> > > from clearing; allow failure, but it should be impossible there.
> > > hpage_collapse_scan_pmd() and collapse_pte_mapped_thp() allow for
> > > pte_offset_map_lock() failure.
> > >
> > > Signed-off-by: Hugh Dickins <[email protected]>
> >
> > Reviewed-by: Yang Shi <[email protected]>
>
> Thanks.
>
> >
> > A nit below:
> >
> > > ---
> > > mm/khugepaged.c | 72 +++++++++++++++++++++++++++++++++----------------
> > > 1 file changed, 49 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 732f9ac393fc..49cfa7cdfe93 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> ...
> > > @@ -1029,24 +1040,29 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> > > * resulting in later failure.
> > > */
> > > if (ret & VM_FAULT_RETRY) {
> > > - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
> > > /* Likely, but not guaranteed, that page lock failed */
> > > - return SCAN_PAGE_LOCK;
> > > + result = SCAN_PAGE_LOCK;
> >
> > With per-VMA lock, this may not be true anymore, at least not true
> > until per-VMA lock supports swap fault. It may be better to have a
> > more general failure code, for example, SCAN_FAIL. But anyway you
> > don't have to change it in your patch, I can send a follow-up patch
> > once this series is landed on mm-unstable.
>
> Interesting point (I've not tried to wrap my head around what differences
> per-VMA locking would make to old likelihoods here), and thank you for
> deferring a change on it - appreciated.
>
> Something to beware of, if you do choose to change it: mostly those
> SCAN codes (I'm not a fan of them!) are only for a tracepoint somewhere,
> but madvise_collapse() and madvise_collapse_errno() take some of them
> more seriously than others - I think SCAN_PAGE_LOCK ends up as an
> EAGAIN (rightly), but SCAN_FAIL as an EINVAL (depends).
>
> But maybe there are layers in between which do not propagate the result
> code, I didn't check. All in all, not something I'd spend time on myself.

Thanks, Hugh. A second look shows do_swap_page() should not return
VM_FAULT_RETRY due to per-VMA lock since it depends on
FAULT_FLAG_VMA_LOCK flag, but it is actually not set in khugepaged
path. Khugepaged just has FAULT_FLAG_ALLOW_RETRY flag set. So we don't
have to change anything.

>
> Hugh

2023-05-24 23:02:27

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 15/31] mm/userfaultfd: allow pte_offset_map_lock() to fail

Hi, Hugh,

On Sun, May 21, 2023 at 10:07:35PM -0700, Hugh Dickins wrote:
> mfill_atomic_install_pte() and mfill_atomic_pte_zeropage() treat
> failed pte_offset_map_lock() as -EFAULT, with no attempt to retry.

Could you help explain why it should be -EFAULT, not -EAGAIN or -EEXIST?

IIUC right now if pte existed we have -EEXIST returned as part of the
userfault ABI, no matter whether it's pte or thp.

IMHO it may boil down to my limited knowledge on how pte_offset_map_lock()
is used after this part 2 series, and I assume the core changes will be in
your 3rd series (besides this one and the arch one).

Please shed some light if there's quick answers (IIUC this is for speeding
up collapsing shmem thps, but still no much clue here), or I can also wait
for reading the 3rd part if it'll come soon in any form.

Thanks,

--
Peter Xu


2023-05-24 23:03:07

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH 01/31] mm: use pmdp_get_lockless() without surplus barrier()

On Sun, May 21, 2023 at 10:49 PM Hugh Dickins <[email protected]> wrote:
>
> Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
> reliable result with PAE (or READ_ONCE as before without PAE); and remove
> the unnecessary extra barrier()s which got left behind in its callers.
>
> HOWEVER: Note the small print in linux/pgtable.h, where it was designed
> specifically for fast GUP, and depends on interrupts being disabled for
> its full guarantee: most callers which have been added (here and before)
> do NOT have interrupts disabled, so there is still some need for caution.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Yu Zhao <[email protected]>

The previous ask here:
https://lore.kernel.org/r/CAOUHufZo=fB2HcaCrj2aidLJ2zEhOpi7ou5M_7qOQiuQq8+wTQ@mail.gmail.com/

2023-05-24 23:11:40

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 01/31] mm: use pmdp_get_lockless() without surplus barrier()

On Sun, May 21, 2023 at 09:49:45PM -0700, Hugh Dickins wrote:
> Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
> reliable result with PAE (or READ_ONCE as before without PAE); and remove
> the unnecessary extra barrier()s which got left behind in its callers.

Pure question: does it mean that some of below path (missing barrier()
ones) could have problem when CONFIG_PAE, hence this can be seen as a
(potential) bug fix?

Thanks,

>
> HOWEVER: Note the small print in linux/pgtable.h, where it was designed
> specifically for fast GUP, and depends on interrupts being disabled for
> its full guarantee: most callers which have been added (here and before)
> do NOT have interrupts disabled, so there is still some need for caution.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> fs/userfaultfd.c | 10 +---------
> include/linux/pgtable.h | 17 -----------------
> mm/gup.c | 6 +-----
> mm/hmm.c | 2 +-
> mm/khugepaged.c | 5 -----
> mm/ksm.c | 3 +--
> mm/memory.c | 14 ++------------
> mm/mprotect.c | 5 -----
> mm/page_vma_mapped.c | 2 +-
> 9 files changed, 7 insertions(+), 57 deletions(-)
>
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 0fd96d6e39ce..f7a0817b1ec0 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -349,15 +349,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> if (!pud_present(*pud))
> goto out;
> pmd = pmd_offset(pud, address);
> - /*
> - * READ_ONCE must function as a barrier with narrower scope
> - * and it must be equivalent to:
> - * _pmd = *pmd; barrier();
> - *
> - * This is to deal with the instability (as in
> - * pmd_trans_unstable) of the pmd.
> - */
> - _pmd = READ_ONCE(*pmd);
> + _pmd = pmdp_get_lockless(pmd);
> if (pmd_none(_pmd))
> goto out;
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index c5a51481bbb9..8ec27fe69dc8 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1344,23 +1344,6 @@ static inline int pud_trans_unstable(pud_t *pud)
> static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
> {
> pmd_t pmdval = pmdp_get_lockless(pmd);
> - /*
> - * The barrier will stabilize the pmdval in a register or on
> - * the stack so that it will stop changing under the code.
> - *
> - * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
> - * pmdp_get_lockless is allowed to return a not atomic pmdval
> - * (for example pointing to an hugepage that has never been
> - * mapped in the pmd). The below checks will only care about
> - * the low part of the pmd with 32bit PAE x86 anyway, with the
> - * exception of pmd_none(). So the important thing is that if
> - * the low part of the pmd is found null, the high part will
> - * be also null or the pmd_none() check below would be
> - * confused.
> - */
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> - barrier();
> -#endif
> /*
> * !pmd_present() checks for pmd migration entries
> *
> diff --git a/mm/gup.c b/mm/gup.c
> index bbe416236593..3bd5d3854c51 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -653,11 +653,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> struct mm_struct *mm = vma->vm_mm;
>
> pmd = pmd_offset(pudp, address);
> - /*
> - * The READ_ONCE() will stabilize the pmdval in a register or
> - * on the stack so that it will stop changing under the code.
> - */
> - pmdval = READ_ONCE(*pmd);
> + pmdval = pmdp_get_lockless(pmd);
> if (pmd_none(pmdval))
> return no_page_table(vma, flags);
> if (!pmd_present(pmdval))
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 6a151c09de5e..e23043345615 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -332,7 +332,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> pmd_t pmd;
>
> again:
> - pmd = READ_ONCE(*pmdp);
> + pmd = pmdp_get_lockless(pmdp);
> if (pmd_none(pmd))
> return hmm_vma_walk_hole(start, end, -1, walk);
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6b9d39d65b73..732f9ac393fc 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -961,11 +961,6 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
> return SCAN_PMD_NULL;
>
> pmde = pmdp_get_lockless(*pmd);
> -
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> - /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
> - barrier();
> -#endif
> if (pmd_none(pmde))
> return SCAN_PMD_NONE;
> if (!pmd_present(pmde))
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 0156bded3a66..df2aa281d49d 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1194,8 +1194,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
> * without holding anon_vma lock for write. So when looking for a
> * genuine pmde (in which to find pte), test present and !THP together.
> */
> - pmde = *pmd;
> - barrier();
> + pmde = pmdp_get_lockless(pmd);
> if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> goto out;
>
> diff --git a/mm/memory.c b/mm/memory.c
> index f69fbc251198..2eb54c0d5d3c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4925,18 +4925,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> * So now it's safe to run pte_offset_map().
> */
> vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> - vmf->orig_pte = *vmf->pte;
> + vmf->orig_pte = ptep_get_lockless(vmf->pte);
> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>
> - /*
> - * some architectures can have larger ptes than wordsize,
> - * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
> - * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
> - * accesses. The code below just needs a consistent view
> - * for the ifs and we later double check anyway with the
> - * ptl lock held. So here a barrier will do.
> - */
> - barrier();
> if (pte_none(vmf->orig_pte)) {
> pte_unmap(vmf->pte);
> vmf->pte = NULL;
> @@ -5060,9 +5051,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
> if (!(ret & VM_FAULT_FALLBACK))
> return ret;
> } else {
> - vmf.orig_pmd = *vmf.pmd;
> + vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
>
> - barrier();
> if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
> VM_BUG_ON(thp_migration_supported() &&
> !is_pmd_migration_entry(vmf.orig_pmd));
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 92d3d3ca390a..c5a13c0f1017 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -309,11 +309,6 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> {
> pmd_t pmdval = pmdp_get_lockless(pmd);
>
> - /* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> - barrier();
> -#endif
> -
> if (pmd_none(pmdval))
> return 1;
> if (pmd_trans_huge(pmdval))
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 4e448cfbc6ef..64aff6718bdb 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -210,7 +210,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> * compiler and used as a stale value after we've observed a
> * subsequent update.
> */
> - pmde = READ_ONCE(*pvmw->pmd);
> + pmde = pmdp_get_lockless(pvmw->pmd);
>
> if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) ||
> (pmd_present(pmde) && pmd_devmap(pmde))) {
> --
> 2.35.3
>

--
Peter Xu


2023-05-24 23:17:48

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 14/31] fs/userfaultfd: retry if pte_offset_map() fails

On Sun, May 21, 2023 at 10:06:32PM -0700, Hugh Dickins wrote:
> Instead of worrying whether the pmd is stable, userfaultfd_must_wait()
> call pte_offset_map() as before, but go back to try again if that fails.
>
> Risk of endless loop? It already broke out if pmd_none(), !pmd_present()
> or pmd_trans_huge(), and pte_offset_map() would have cleared pmd_bad():
> which leaves pmd_devmap(). Presumably pmd_devmap() is inappropriate in
> a vma subject to userfaultfd (it would have been mistreated before),
> but add a check just to avoid all possibility of endless loop there.

Agreed, afaiu that's for either PFNMAP or MIXEDMAP vmas only. Maybe we can
use a WARN_ON_ONCE() for that to be clear, but no strong opinions.

>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Peter Xu <[email protected]>

--
Peter Xu


2023-05-24 23:39:29

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 25/31] mm/gup: remove FOLL_SPLIT_PMD use of pmd_trans_unstable()

On Tue, May 23, 2023 at 9:26 PM Hugh Dickins <[email protected]> wrote:
>
> On Mon, 22 May 2023, Yang Shi wrote:
> > On Mon, May 22, 2023 at 7:26 PM Yang Shi <[email protected]> wrote:
> > > On Sun, May 21, 2023 at 10:22 PM Hugh Dickins <[email protected]> wrote:
> > > >
> > > > There is now no reason for follow_pmd_mask()'s FOLL_SPLIT_PMD block to
> > > > distinguish huge_zero_page from a normal THP: follow_page_pte() handles
> > > > any instability, and here it's a good idea to replace any pmd_none(*pmd)
> > > > by a page table a.s.a.p, in the huge_zero_page case as for a normal THP.
> > > > (Hmm, couldn't the normal THP case have hit an unstably refaulted THP
> > > > before? But there are only two, exceptional, users of FOLL_SPLIT_PMD.)
> > > >
> > > > Signed-off-by: Hugh Dickins <[email protected]>
> > > > ---
> > > > mm/gup.c | 19 ++++---------------
> > > > 1 file changed, 4 insertions(+), 15 deletions(-)
> > > >
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index bb67193c5460..4ad50a59897f 100644
> > > > --- a/mm/gup.c
> > > > +++ b/mm/gup.c
> > > > @@ -681,21 +681,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> > > > return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > > > }
> > > > if (flags & FOLL_SPLIT_PMD) {
> > > > - int ret;
> > > > - page = pmd_page(*pmd);
> > > > - if (is_huge_zero_page(page)) {
> > > > - spin_unlock(ptl);
> > > > - ret = 0;
> > > > - split_huge_pmd(vma, pmd, address);
> > > > - if (pmd_trans_unstable(pmd))
> > > > - ret = -EBUSY;
> > >
> > > IIUC the pmd_trans_unstable() check was transferred to the implicit
> > > pmd_none() in pte_alloc(). But it will return -ENOMEM instead of
> > > -EBUSY. Won't it break some userspace? Or the pmd_trans_unstable() is
> > > never true? If so it seems worth mentioning in the commit log about
> > > this return value change.
>
> Thanks a lot for looking at these, but I disagree here.
>
> >
> > Oops, the above comment is not accurate. It will call
> > follow_page_pte() instead of returning -EBUSY if pmd is none.
>
> Yes. Ignoring secondary races, if pmd is none, pte_alloc() will allocate
> an empty page table there, follow_page_pte() find !pte_present and return
> NULL; or if pmd is not none, follow_page_pte() will return no_page_table()
> i.e. NULL. And page NULL ends up with __get_user_pages() having another
> go round, instead of failing with -EBUSY.
>
> Which I'd say is better handling for such a transient case - remember,
> it's split_huge_pmd() (which should always succeed, but might be raced)
> in use there, not split_huge_page() (which might take years for pins to
> be removed before it can succeed).

It sounds like an improvement.

>
> > For other unstable cases, it will return -ENOMEM instead of -EBUSY.
>
> I don't think so: the possibly-failing __pte_alloc() only gets called
> in the pmd_none() case.

I mean what if pmd is not none for huge zero page. If it is not
pmd_none pte_alloc() just returns 0, then returns -ENOMEM instead of
-EBUSY. Or it is impossible that pmd end up being pmd_huge_trans or
!pmd_present? It should be very unlikely, for example, migration does
skip huge zero page, but I'm not sure whether there is any corner case
that I missed.

>
> Hugh
>
> >
> > >
> > > > - } else {
> > > > - spin_unlock(ptl);
> > > > - split_huge_pmd(vma, pmd, address);
> > > > - ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
> > > > - }
> > > > -
> > > > - return ret ? ERR_PTR(ret) :
> > > > + spin_unlock(ptl);
> > > > + split_huge_pmd(vma, pmd, address);
> > > > + /* If pmd was left empty, stuff a page table in there quickly */
> > > > + return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
> > > > follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > > > }
> > > > page = follow_trans_huge_pmd(vma, address, pmd, flags);
> > > > --
> > > > 2.35.3

2023-05-25 21:39:56

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 25/31] mm/gup: remove FOLL_SPLIT_PMD use of pmd_trans_unstable()

On Wed, 24 May 2023, Yang Shi wrote:
> On Tue, May 23, 2023 at 9:26 PM Hugh Dickins <[email protected]> wrote:
> > On Mon, 22 May 2023, Yang Shi wrote:
> >
> > > For other unstable cases, it will return -ENOMEM instead of -EBUSY.
> >
> > I don't think so: the possibly-failing __pte_alloc() only gets called
> > in the pmd_none() case.
>
> I mean what if pmd is not none for huge zero page. If it is not
> pmd_none pte_alloc() just returns 0,

Yes, I agree with you on that.

> then returns -ENOMEM instead of -EBUSY.

But disagree with you on that.

return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);

Doesn't that say that if pte_alloc() returns 0, then follow_page_mask()
will call follow_page_pte() and return whatever that returns?

> Or it is impossible that pmd end up being pmd_huge_trans or
> !pmd_present? It should be very unlikely, for example, migration does
> skip huge zero page, but I'm not sure whether there is any corner case
> that I missed.

I'm assuming both are possible there (but not asserting that they are).

Hugh

2023-05-25 22:19:00

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 15/31] mm/userfaultfd: allow pte_offset_map_lock() to fail

On Wed, 24 May 2023, Peter Xu wrote:
> On Sun, May 21, 2023 at 10:07:35PM -0700, Hugh Dickins wrote:
> > mfill_atomic_install_pte() and mfill_atomic_pte_zeropage() treat
> > failed pte_offset_map_lock() as -EFAULT, with no attempt to retry.
>
> Could you help explain why it should be -EFAULT, not -EAGAIN or -EEXIST?

Thanks a lot for looking, Peter.

No good justification for -EFAULT: I just grabbed the closest, fairly
neutral, error code that I could see already being in use there: but now
that you mention -EAGAIN, which I can see being used from mfill_atomic(),
yes, that would be ideal - and consistent with how it's already being used.

I'll make that change, thanks for suggesting. (And it had bugged me how
my fs/userfaultfd.c was electing to retry, but this one electing to fail.)

>
> IIUC right now if pte existed we have -EEXIST returned as part of the
> userfault ABI, no matter whether it's pte or thp.

It might or might not correspond to -EEXIST - it might even end up as
-EFAULT on a retry after -EAGAIN: I see mfill_atomic() contains both
-EEXIST and -EFAULT cases for pmd_trans_huge(). Actually, I could
say that the -EFAULT case there corresponds to the -EFAULT in this
15/31 patch, but that would be by coincidence not design: I'm happier
with your -EAGAIN suggestion.

>
> IMHO it may boil down to my limited knowledge on how pte_offset_map_lock()
> is used after this part 2 series, and I assume the core changes will be in
> your 3rd series (besides this one and the arch one).
>
> Please shed some light if there's quick answers (IIUC this is for speeding
> up collapsing shmem thps, but still no much clue here), or I can also wait
> for reading the 3rd part if it'll come soon in any form.

It wouldn't be particularly easy to deduce from the third series of
patches, rather submerged in implementation details. Just keep in mind
that, like in the "old" pmd_trans_unstable() cases, there may be instants
at which, when trying to get the lock on a page table, that page table
might already have gone, or been replaced by something else e.g. a THP,
and a retry necessary at the outer level (if it's important to persist).

Hugh

2023-05-25 22:43:16

by Yang Shi

[permalink] [raw]
Subject: Re: [PATCH 25/31] mm/gup: remove FOLL_SPLIT_PMD use of pmd_trans_unstable()

On Thu, May 25, 2023 at 2:16 PM Hugh Dickins <[email protected]> wrote:
>
> On Wed, 24 May 2023, Yang Shi wrote:
> > On Tue, May 23, 2023 at 9:26 PM Hugh Dickins <[email protected]> wrote:
> > > On Mon, 22 May 2023, Yang Shi wrote:
> > >
> > > > For other unstable cases, it will return -ENOMEM instead of -EBUSY.
> > >
> > > I don't think so: the possibly-failing __pte_alloc() only gets called
> > > in the pmd_none() case.
> >
> > I mean what if pmd is not none for huge zero page. If it is not
> > pmd_none pte_alloc() just returns 0,
>
> Yes, I agree with you on that.
>
> > then returns -ENOMEM instead of -EBUSY.
>
> But disagree with you on that.
>
> return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
> follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
>
> Doesn't that say that if pte_alloc() returns 0, then follow_page_mask()
> will call follow_page_pte() and return whatever that returns?

Err... you are right. I misread the code. Anyway it returns -ENOMEM
instead of -EBUSY when pmd is none and pte alloc fails. Returning
-ENOMEM does make sense for this case. Is it worth some words in the
commit log for the slight behavior change?

>
> > Or it is impossible that pmd end up being pmd_huge_trans or
> > !pmd_present? It should be very unlikely, for example, migration does
> > skip huge zero page, but I'm not sure whether there is any corner case
> > that I missed.
>
> I'm assuming both are possible there (but not asserting that they are).
>
> Hugh

2023-05-25 22:43:38

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 01/31] mm: use pmdp_get_lockless() without surplus barrier()

On Wed, 24 May 2023, Peter Xu wrote:
> On Sun, May 21, 2023 at 09:49:45PM -0700, Hugh Dickins wrote:
> > Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
> > reliable result with PAE (or READ_ONCE as before without PAE); and remove
> > the unnecessary extra barrier()s which got left behind in its callers.
>
> Pure question: does it mean that some of below path (missing barrier()
> ones) could have problem when CONFIG_PAE, hence this can be seen as a
> (potential) bug fix?

I don't think so; or at least, I am not claiming that this fixes any.

It really depends on what use is made of the pmdval afterwards, and
I've not checked through them. The READ_ONCE()s which were there,
were good enough to make sure that the compiler did not reevaluate
the pmdval later on, with perhaps a confusingly different result.

But, at least in the x86 PAE case, they were not good enough to ensure
that the two halves of the entry match up; and, sad to say, nor is that
absolutely guaranteed by these conversions to pmdp_get_lockless() -
because of the "HOWEVER" below. PeterZ's comments in linux/pgtable.h
are well worth reading through.

You might question why I made these changes at all: some days
I question them too. Better though imperfect? Or deceptive?

Hugh

> >
> > HOWEVER: Note the small print in linux/pgtable.h, where it was designed
> > specifically for fast GUP, and depends on interrupts being disabled for
> > its full guarantee: most callers which have been added (here and before)
> > do NOT have interrupts disabled, so there is still some need for caution.

2023-05-26 16:40:54

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 15/31] mm/userfaultfd: allow pte_offset_map_lock() to fail

On Thu, May 25, 2023 at 03:06:27PM -0700, Hugh Dickins wrote:
> On Wed, 24 May 2023, Peter Xu wrote:
> > On Sun, May 21, 2023 at 10:07:35PM -0700, Hugh Dickins wrote:
> > > mfill_atomic_install_pte() and mfill_atomic_pte_zeropage() treat
> > > failed pte_offset_map_lock() as -EFAULT, with no attempt to retry.
> >
> > Could you help explain why it should be -EFAULT, not -EAGAIN or -EEXIST?
>
> Thanks a lot for looking, Peter.
>
> No good justification for -EFAULT: I just grabbed the closest, fairly
> neutral, error code that I could see already being in use there: but now
> that you mention -EAGAIN, which I can see being used from mfill_atomic(),
> yes, that would be ideal - and consistent with how it's already being used.
>
> I'll make that change, thanks for suggesting. (And it had bugged me how
> my fs/userfaultfd.c was electing to retry, but this one electing to fail.)

Thanks.

>
> >
> > IIUC right now if pte existed we have -EEXIST returned as part of the
> > userfault ABI, no matter whether it's pte or thp.
>
> It might or might not correspond to -EEXIST - it might even end up as
> -EFAULT on a retry after -EAGAIN: I see mfill_atomic() contains both
> -EEXIST and -EFAULT cases for pmd_trans_huge(). Actually, I could
> say that the -EFAULT case there corresponds to the -EFAULT in this
> 15/31 patch, but that would be by coincidence not design: I'm happier
> with your -EAGAIN suggestion.

I had a feeling that that 2nd -EFAULT there could crash some userapp
already if it got returned somewhere, because the userapp shouldn't expect
that. IMHO it should also return -EAGAIN, or even -EEXIST because even if
user retries, we should highly possibly see that thp again, so the -EEXIST
should possibly follow anyway.

Not a big deal here I think - if an userapp can trigger that -EFAULT I'd
say it's also a user bug because it made two decisions already on resolving
page fault for single VA, and it's racy between them..

>
> >
> > IMHO it may boil down to my limited knowledge on how pte_offset_map_lock()
> > is used after this part 2 series, and I assume the core changes will be in
> > your 3rd series (besides this one and the arch one).
> >
> > Please shed some light if there's quick answers (IIUC this is for speeding
> > up collapsing shmem thps, but still no much clue here), or I can also wait
> > for reading the 3rd part if it'll come soon in any form.
>
> It wouldn't be particularly easy to deduce from the third series of
> patches, rather submerged in implementation details. Just keep in mind
> that, like in the "old" pmd_trans_unstable() cases, there may be instants
> at which, when trying to get the lock on a page table, that page table
> might already have gone, or been replaced by something else e.g. a THP,
> and a retry necessary at the outer level (if it's important to persist).

I'm actually still curious how the 3rd series will look like; would love to
read it when it comes.

Thanks,

--
Peter Xu


2023-05-26 17:03:31

by Peter Xu

[permalink] [raw]
Subject: Re: [PATCH 01/31] mm: use pmdp_get_lockless() without surplus barrier()

On Thu, May 25, 2023 at 03:35:01PM -0700, Hugh Dickins wrote:
> On Wed, 24 May 2023, Peter Xu wrote:
> > On Sun, May 21, 2023 at 09:49:45PM -0700, Hugh Dickins wrote:
> > > Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
> > > reliable result with PAE (or READ_ONCE as before without PAE); and remove
> > > the unnecessary extra barrier()s which got left behind in its callers.
> >
> > Pure question: does it mean that some of below path (missing barrier()
> > ones) could have problem when CONFIG_PAE, hence this can be seen as a
> > (potential) bug fix?
>
> I don't think so; or at least, I am not claiming that this fixes any.
>
> It really depends on what use is made of the pmdval afterwards, and
> I've not checked through them. The READ_ONCE()s which were there,
> were good enough to make sure that the compiler did not reevaluate
> the pmdval later on, with perhaps a confusingly different result.
>
> But, at least in the x86 PAE case, they were not good enough to ensure
> that the two halves of the entry match up; and, sad to say, nor is that
> absolutely guaranteed by these conversions to pmdp_get_lockless() -
> because of the "HOWEVER" below. PeterZ's comments in linux/pgtable.h
> are well worth reading through.

Yes exactly - that's one major thing of my confusion on using
{ptep|pmdp}_get_lockless().

In irqoff ctx, AFAICT we can see a totally messed up pte/pmd with present
bit set if extremely unlucky. E.g. it can race with something like
"DONTNEED (contains tlbflush) then a POPULATE_WRITE" so we can have
"present -> present" conversion of pte when reading, so we can read half
pfn1 and then the other half pfn2.

The other confusing thing on this _lockless trick on PAE is, I think it
_might_ go wrong with devmap..

The problem is here we assumed even if high & low may not match, we still
can rely on most pte/pmd checks are done only on low bits (except _none()
check) to guarantee at least the checks are still atomic on low bits.

But it seems to me it's not true anymore if with pmd_trans_huge() after
devmap introduced, e.g.:

static inline int pmd_trans_huge(pmd_t pmd)
{
return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
}

#define _PAGE_PSE (_AT(pteval_t, 1) << _PAGE_BIT_PSE)
#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */

#define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP)
#define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4
#define _PAGE_BIT_SOFTW4 58 /* available for programmer */

So after devmap with CONFIG_PAE, pmd_trans_huge() checks more than low bits
but also high bits. I didn't go further to check whether there can be any
real issue but IIUC that's not expected when the low/high trick introduced
(originally introduced in commit e585513b76f7b05d sololy for x86 PAE
fast-gup only).

>
> You might question why I made these changes at all: some days
> I question them too. Better though imperfect? Or deceptive?

I think it's probably a separate topic to address in all cases, so I think
this patch still make it slightly better on barrier() which I agree:

Acked-by: Peter Xu <[email protected]>

Thanks,

--
Peter Xu


2023-06-02 02:56:58

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 01/31] mm: use pmdp_get_lockless() without surplus barrier()

On Fri, 26 May 2023, Peter Xu wrote:
>
> The other confusing thing on this _lockless trick on PAE is, I think it
> _might_ go wrong with devmap..
>
> The problem is here we assumed even if high & low may not match, we still
> can rely on most pte/pmd checks are done only on low bits (except _none()
> check) to guarantee at least the checks are still atomic on low bits.
>
> But it seems to me it's not true anymore if with pmd_trans_huge() after
> devmap introduced, e.g.:

I agree that there would likely be a problem for p??_devmap() on 32-bit
PAE: but (I hope I followed the chain correctly!) I had earlier found
that pmd_devmap() can only return true when CONFIG_ZONE_DEVICE=y, and
config ZONE_DEVICE depends on ARCH_HAS_PTE_DEVMAP, and ARCH_HAS_PTE_DEVMAP
is only selected (in some cases) by arm64, powerpc if PPC_BOOK3S_64, and
x86 if X86_64.

So I stopped worrying about devmap.

Hugh

2023-07-05 15:16:19

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 04/31] mm/pgtable: allow pte_offset_map[_lock]() to fail


Hi Hugh,

Sorry for not checking about this before. I am looking at a kernel
crash (BUG_ON()) on ppc64 with 4K page size. The reason we hit
BUG_ON() is beause we have pmd_same calling BUG_ON on 4K with hash
translation. We don't support THP with 4k page size and hash
translation.

Hugh Dickins <[email protected]> writes:

....

+
> +pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long addr, spinlock_t **ptlp)
> +{
> + pmd_t pmdval;
> + pte_t *pte;
> +
> + pte = __pte_offset_map(pmd, addr, &pmdval);
> + if (likely(pte))
> + *ptlp = pte_lockptr(mm, &pmdval);
> + return pte;
> +}
> +
> +pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long addr, spinlock_t **ptlp)
> +{
> + spinlock_t *ptl;
> + pmd_t pmdval;
> + pte_t *pte;
> +again:
> + pte = __pte_offset_map(pmd, addr, &pmdval);
> + if (unlikely(!pte))
> + return pte;
> + ptl = pte_lockptr(mm, &pmdval);
> + spin_lock(ptl);
> + if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
> + *ptlp = ptl;
> + return pte;
> + }
> + pte_unmap_unlock(pte, ptl);
> + goto again;
> +}

What is expected by that pmd_same check? We are holding pte lock
and not pmd lock. So contents of pmd can change.

-aneesh

2023-07-05 22:41:36

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 04/31] mm/pgtable: allow pte_offset_map[_lock]() to fail

Hi Aneesh,

On Wed, 5 Jul 2023, Aneesh Kumar K.V wrote:
>
> Hi Hugh,
>
> Sorry for not checking about this before.

I was about to say "No problem" - but it appears I would be lying!

> I am looking at a kernel
> crash (BUG_ON()) on ppc64 with 4K page size. The reason we hit
> BUG_ON() is beause we have pmd_same calling BUG_ON on 4K with hash
> translation. We don't support THP with 4k page size and hash
> translation.

I misunderstood you at first. I was trying to work out what in that
context might lead to *pmd changing suddenly, was going to ask for
stack backtrace (in faulting? or in copying mm? or?), whether you
have PER_VMA_LOCK enabled, etc. etc.

Then I looked at the source: oh, that is gross, and not something
I had expected at all.

> > +pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
> > + unsigned long addr, spinlock_t **ptlp)
> > +{
> > + spinlock_t *ptl;
> > + pmd_t pmdval;
> > + pte_t *pte;
> > +again:
> > + pte = __pte_offset_map(pmd, addr, &pmdval);
> > + if (unlikely(!pte))
> > + return pte;
> > + ptl = pte_lockptr(mm, &pmdval);
> > + spin_lock(ptl);
> > + if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
> > + *ptlp = ptl;
> > + return pte;
> > + }
> > + pte_unmap_unlock(pte, ptl);
> > + goto again;
> > +}
>
> What is expected by that pmd_same check? We are holding pte lock
> and not pmd lock. So contents of pmd can change.

And you don't need me to answer that question: the answer is in the
"likely". We do not expect *pmd to change there (though maybe some
ancillary bits of it, like "accessed"), unless the page table is on
its way to being freed; and other locking higher up (mmap_lock or
rmap lock) prevent all possibilities of that at present. Later, we
arrange to hold pte lock as well as pmd lock when removing page table.

So the obvious quick fix is:

--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -138,8 +138,7 @@ static inline int hash__pmd_trans_huge(p

static inline int hash__pmd_same(pmd_t pmd_a, pmd_t pmd_b)
{
- BUG();
- return 0;
+ return 1;
}

static inline pmd_t hash__pmd_mkhuge(pmd_t pmd)

But I hope you will reject that as almost as gross, and instead commit
a patch which makes hash__pmd_same() ... check whether the pmd_ts are the
same - as in ppc64's other implementations. That will save having to change
it again, when/if someone extends the current replace-page-table-by-THP work
by non-THP work to remove empty page tables without mmap_lock or rmap lock.

Thanks for finding this, Aneesh, and I'm sorry I didn't notice it before,
and I'm sorry to have given you trouble... but really, that BUG(),

(H)ugh!