Here is v2 series of patches to mm, based on v6.4-rc5: preparing for
v2 effective changes to follow, probably next week (when I hope s390
will be sorted), affecting pte_offset_map() and pte_offset_map_lock().
There are very few differences from v1: noted patch by patch below.
This follows on from the v2 "arch: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/[email protected]/
series of 23 posted on 2023-06-08,
replaces the v1 "mm: allow pte_offset_map[_lock]() to fail"
https://lore.kernel.org/linux-mm/[email protected]/
series of 31 posted on 2023-05-21,
which was followed by the v1 "mm: free retracted page table by RCU"
https://lore.kernel.org/linux-mm/[email protected]/
series of 12 posted on 2023-05-28.
The first two series are "independent":
neither depends for build or correctness on the other, but both series
must in before the third series is added to make the effective changes
(and it may be preferred to hold that one until the following release).
What is it all about? Some mmap_lock avoidance i.e. latency reduction.
Initially just for the case of collapsing shmem or file pages to THPs;
but likely to be relied upon later in other contexts e.g. freeing of
empty page tables (but that's not work I'm doing). mmap_write_lock
avoidance when collapsing to anon THPs? Perhaps, but again that's not
work I've done: a quick attempt was not as easy as the shmem/file case.
I would much prefer not to have to make these small but wide-ranging
changes for such a niche case; but failed to find another way, and
have heard that shmem MADV_COLLAPSE's usefulness is being limited by
that mmap_write_lock it currently requires.
These changes (though of course not these exact patches) have been in
Google's data centre kernel for three years now: we do rely upon them.
What is this preparatory series about?
The current mmap locking will not be enough to guard against that
tricky transition between pmd entry pointing to page table, and empty
pmd entry, and pmd entry pointing to huge page: pte_offset_map() will
have to validate the pmd entry for itself, returning NULL if no page
table is there. What to do about that varies: sometimes nearby error
handling indicates just to skip it; but in many cases an ACTION_AGAIN or
"goto again" is appropriate (and if that risks an infinite loop, then
there must have been an oops, or pfn 0 mistaken for page table, before).
Given the likely extension to freeing empty page tables, I have not
limited this set of changes to a THP config; and it has been easier,
and sets a better example, if each site is given appropriate handling:
even where deeper study might prove that failure could only happen if
the pmd table were corrupted.
Several of the patches are, or include, cleanup on the way; and by the
end, pmd_trans_unstable() and suchlike are deleted: pte_offset_map() and
pte_offset_map_lock() then handle those original races and more. Most
uses of pte_lockptr() are deprecated, with pte_offset_map_nolock()
taking its place.
This posting is based on v6.4-rc5, but good for any v6.4-rc; and good
for current mm-everything and linux-next, except for one minor clash in
mm/memory.c do_swap_page(), where Ryan Roberts and I fixed the same goto.
01/32 mm: use pmdp_get_lockless() without surplus barrier()
v2: add acks from Yu Zhao and PeterX
02/32 mm/migrate: remove cruft from migration_entry_wait()s
v2: add review from Alistair
03/32 mm/pgtable: kmap_local_page() instead of kmap_atomic()
v2: same as v1
04/32 mm/pgtable: allow pte_offset_map[_lock]() to fail
v2: same as v1
05/32 mm/filemap: allow pte_offset_map_lock() to fail
v2: same as v1
06/32 mm/page_vma_mapped: delete bogosity in page_vma_mapped_walk()
v2: same as v1
07/32 mm/page_vma_mapped: reformat map_pte() with less indentation
v2: same as v1
08/32 mm/page_vma_mapped: pte_offset_map_nolock() not pte_lockptr()
v2: same as v1
09/32 mm/pagewalkers: ACTION_AGAIN if pte_offset_map_lock() fails
v2: add review from SeongJae for mm/damon part
10/32 mm/pagewalk: walk_pte_range() allow for pte_offset_map()
v2: same as v1
11/32 mm/vmwgfx: simplify pmd & pud mapping dirty helpers
v2: same as v1
12/32 mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()
v2: add review from Lorenzo
13/32 mm/hmm: retry if pte_offset_map() fails
v2: add review from Alistair
14/32 fs/userfaultfd: retry if pte_offset_map() fails
v2: add ack from PeterX
15/32 mm/userfaultfd: allow pte_offset_map_lock() to fail
v2: use -EAGAIN instead of -EFAULT from PeterX
16/32 mm/debug_vm_pgtable,page_table_check: warn pte map fails
v2: same as v1
17/32 mm/various: give up if pte_offset_map[_lock]() fails
v2: removed mm/swap_state.c mod to a separate patch 31/32
18/32 mm/mprotect: delete pmd_none_or_clear_bad_unless_transhuge()
v2: same as v1
19/32 mm/mremap: retry if either pte_offset_map_*lock() fails
v2: same as v1
20/32 mm/madvise: clean up pte_offset_map_lock() scans
v2: same as v1
21/32 mm/madvise: clean up force_shm_swapin_readahead()
v2: same as v1
22/32 mm/swapoff: allow pte_offset_map[_lock]() to fail
v2: same as v1
23/32 mm/mglru: allow pte_offset_map_nolock() to fail
v2: add ack from Yu Zhao
24/32 mm/migrate_device: allow pte_offset_map_lock() to fail
v2: add review from Alistair
25/32 mm/gup: remove FOLL_SPLIT_PMD use of pmd_trans_unstable()
v2: add comment on -EBUSY from Yang Shi
26/32 mm/huge_memory: split huge pmd under one pte_offset_map()
v2: add review from Yang Shi
27/32 mm/khugepaged: allow pte_offset_map[_lock]() to fail
v2: add review from Yang Shi
28/32 mm/memory: allow pte_offset_map[_lock]() to fail
v2: same as v1
29/32 mm/memory: handle_pte_fault() use pte_offset_map_nolock()
v2: same as v1
30/32 mm/pgtable: delete pmd_trans_unstable() and friends
v2: same as v1
31/32 mm/swap: swap_vma_readahead() do the pte_offset_map()
v2: new patch replacing last part of v1 17/31
32/32 perf/core: Allow pte_offset_map() to fail
v2: same as v1 31/31
Documentation/mm/split_page_table_lock.rst | 17 ++-
fs/proc/task_mmu.c | 32 ++---
fs/userfaultfd.c | 21 +--
include/linux/migrate.h | 4 +-
include/linux/mm.h | 27 ++--
include/linux/pgtable.h | 142 +++---------------
include/linux/swap.h | 19 ---
include/linux/swapops.h | 17 +--
kernel/events/core.c | 4 +
mm/damon/vaddr.c | 12 +-
mm/debug_vm_pgtable.c | 9 +-
mm/filemap.c | 25 ++--
mm/gup.c | 34 ++---
mm/hmm.c | 4 +-
mm/huge_memory.c | 33 +++--
mm/khugepaged.c | 83 ++++++-----
mm/ksm.c | 10 +-
mm/madvise.c | 146 ++++++++++---------
mm/mapping_dirty_helpers.c | 34 ++---
mm/memcontrol.c | 8 +-
mm/memory-failure.c | 8 +-
mm/memory.c | 224 +++++++++++++----------------
mm/mempolicy.c | 7 +-
mm/migrate.c | 40 +++---
mm/migrate_device.c | 31 +---
mm/mincore.c | 9 +-
mm/mlock.c | 4 +
mm/mprotect.c | 79 +++-------
mm/mremap.c | 28 ++--
mm/page_table_check.c | 2 +
mm/page_vma_mapped.c | 97 +++++++------
mm/pagewalk.c | 33 +++--
mm/pgtable-generic.c | 56 ++++++++
mm/swap_state.c | 45 +++---
mm/swapfile.c | 38 ++---
mm/userfaultfd.c | 8 ++
mm/vmalloc.c | 3 +-
mm/vmscan.c | 16 +--
38 files changed, 662 insertions(+), 747 deletions(-)
Hugh
Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more
reliable result with PAE (or READ_ONCE as before without PAE); and remove
the unnecessary extra barrier()s which got left behind in its callers.
HOWEVER: Note the small print in linux/pgtable.h, where it was designed
specifically for fast GUP, and depends on interrupts being disabled for
its full guarantee: most callers which have been added (here and before)
do NOT have interrupts disabled, so there is still some need for caution.
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Yu Zhao <[email protected]>
Acked-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 10 +---------
include/linux/pgtable.h | 17 -----------------
mm/gup.c | 6 +-----
mm/hmm.c | 2 +-
mm/khugepaged.c | 5 -----
mm/ksm.c | 3 +--
mm/memory.c | 14 ++------------
mm/mprotect.c | 5 -----
mm/page_vma_mapped.c | 2 +-
9 files changed, 7 insertions(+), 57 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 0fd96d6e39ce..f7a0817b1ec0 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -349,15 +349,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
if (!pud_present(*pud))
goto out;
pmd = pmd_offset(pud, address);
- /*
- * READ_ONCE must function as a barrier with narrower scope
- * and it must be equivalent to:
- * _pmd = *pmd; barrier();
- *
- * This is to deal with the instability (as in
- * pmd_trans_unstable) of the pmd.
- */
- _pmd = READ_ONCE(*pmd);
+ _pmd = pmdp_get_lockless(pmd);
if (pmd_none(_pmd))
goto out;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c5a51481bbb9..8ec27fe69dc8 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1344,23 +1344,6 @@ static inline int pud_trans_unstable(pud_t *pud)
static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
{
pmd_t pmdval = pmdp_get_lockless(pmd);
- /*
- * The barrier will stabilize the pmdval in a register or on
- * the stack so that it will stop changing under the code.
- *
- * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
- * pmdp_get_lockless is allowed to return a not atomic pmdval
- * (for example pointing to an hugepage that has never been
- * mapped in the pmd). The below checks will only care about
- * the low part of the pmd with 32bit PAE x86 anyway, with the
- * exception of pmd_none(). So the important thing is that if
- * the low part of the pmd is found null, the high part will
- * be also null or the pmd_none() check below would be
- * confused.
- */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- barrier();
-#endif
/*
* !pmd_present() checks for pmd migration entries
*
diff --git a/mm/gup.c b/mm/gup.c
index bbe416236593..3bd5d3854c51 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -653,11 +653,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
pmd = pmd_offset(pudp, address);
- /*
- * The READ_ONCE() will stabilize the pmdval in a register or
- * on the stack so that it will stop changing under the code.
- */
- pmdval = READ_ONCE(*pmd);
+ pmdval = pmdp_get_lockless(pmd);
if (pmd_none(pmdval))
return no_page_table(vma, flags);
if (!pmd_present(pmdval))
diff --git a/mm/hmm.c b/mm/hmm.c
index 6a151c09de5e..e23043345615 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -332,7 +332,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
pmd_t pmd;
again:
- pmd = READ_ONCE(*pmdp);
+ pmd = pmdp_get_lockless(pmdp);
if (pmd_none(pmd))
return hmm_vma_walk_hole(start, end, -1, walk);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6b9d39d65b73..732f9ac393fc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -961,11 +961,6 @@ static int find_pmd_or_thp_or_none(struct mm_struct *mm,
return SCAN_PMD_NULL;
pmde = pmdp_get_lockless(*pmd);
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- /* See comments in pmd_none_or_trans_huge_or_clear_bad() */
- barrier();
-#endif
if (pmd_none(pmde))
return SCAN_PMD_NONE;
if (!pmd_present(pmde))
diff --git a/mm/ksm.c b/mm/ksm.c
index 0156bded3a66..df2aa281d49d 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1194,8 +1194,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
* without holding anon_vma lock for write. So when looking for a
* genuine pmde (in which to find pte), test present and !THP together.
*/
- pmde = *pmd;
- barrier();
+ pmde = pmdp_get_lockless(pmd);
if (!pmd_present(pmde) || pmd_trans_huge(pmde))
goto out;
diff --git a/mm/memory.c b/mm/memory.c
index f69fbc251198..2eb54c0d5d3c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4925,18 +4925,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
* So now it's safe to run pte_offset_map().
*/
vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
- vmf->orig_pte = *vmf->pte;
+ vmf->orig_pte = ptep_get_lockless(vmf->pte);
vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
- /*
- * some architectures can have larger ptes than wordsize,
- * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
- * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
- * accesses. The code below just needs a consistent view
- * for the ifs and we later double check anyway with the
- * ptl lock held. So here a barrier will do.
- */
- barrier();
if (pte_none(vmf->orig_pte)) {
pte_unmap(vmf->pte);
vmf->pte = NULL;
@@ -5060,9 +5051,8 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
if (!(ret & VM_FAULT_FALLBACK))
return ret;
} else {
- vmf.orig_pmd = *vmf.pmd;
+ vmf.orig_pmd = pmdp_get_lockless(vmf.pmd);
- barrier();
if (unlikely(is_swap_pmd(vmf.orig_pmd))) {
VM_BUG_ON(thp_migration_supported() &&
!is_pmd_migration_entry(vmf.orig_pmd));
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 92d3d3ca390a..c5a13c0f1017 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -309,11 +309,6 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
{
pmd_t pmdval = pmdp_get_lockless(pmd);
- /* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- barrier();
-#endif
-
if (pmd_none(pmdval))
return 1;
if (pmd_trans_huge(pmdval))
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 4e448cfbc6ef..64aff6718bdb 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -210,7 +210,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
* compiler and used as a stale value after we've observed a
* subsequent update.
*/
- pmde = READ_ONCE(*pvmw->pmd);
+ pmde = pmdp_get_lockless(pvmw->pmd);
if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) ||
(pmd_present(pmde) && pmd_devmap(pmde))) {
--
2.35.3
pte_offset_map() was still using kmap_atomic(): update it to the
preferred kmap_local_page() before making further changes there, in case
we need this as a bisection point; but I doubt it can cause any trouble.
Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/pgtable.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8ec27fe69dc8..94235ff2706e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -96,9 +96,9 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
#if defined(CONFIG_HIGHPTE)
#define pte_offset_map(dir, address) \
- ((pte_t *)kmap_atomic(pmd_page(*(dir))) + \
+ ((pte_t *)kmap_local_page(pmd_page(*(dir))) + \
pte_index((address)))
-#define pte_unmap(pte) kunmap_atomic((pte))
+#define pte_unmap(pte) kunmap_local((pte))
#else
#define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))
#define pte_unmap(pte) ((void)(pte)) /* NOP */
--
2.35.3
filemap_map_pages() allow pte_offset_map_lock() to fail; and remove the
pmd_devmap_trans_unstable() check from filemap_map_pmd(), which can safely
return to filemap_map_pages() and let pte_offset_map_lock() discover that.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/filemap.c | 12 +++++-------
1 file changed, 5 insertions(+), 7 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 28b42ee848a4..9e129ad43e0d 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3408,13 +3408,6 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
if (pmd_none(*vmf->pmd))
pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
- /* See comment in handle_pte_fault() */
- if (pmd_devmap_trans_unstable(vmf->pmd)) {
- folio_unlock(folio);
- folio_put(folio);
- return true;
- }
-
return false;
}
@@ -3501,6 +3494,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
+ if (!vmf->pte) {
+ folio_unlock(folio);
+ folio_put(folio);
+ goto out;
+ }
do {
again:
page = folio_file_page(folio, xas.xa_index);
--
2.35.3
Make pte_offset_map() a wrapper for __pte_offset_map() (optionally
outputs pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
__pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
__pte_offset_map() do pmdval validation (including pmd_clear_bad()
when pmd_bad()), returning NULL if pmdval is not for a page table.
__pte_offset_map_lock() verify pmdval unchanged after getting the
lock, trying again if it changed.
No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done
to cover the imminent case, but we expect to generalize it later, and
it makes a mess of where to do the pmd_bad() clearing.
Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
without actually taking the lock. This will be preferred to open uses of
pte_lockptr(), because (when split ptlock is in page table's struct page)
it points to the right lock for the returned pte pointer, even if *pmd
gets changed racily afterwards.
Update corresponding Documentation.
Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
they have to wait until all architectures are balancing pte_offset_map()s
with pte_unmap()s (as in the arch series posted earlier). But comment
where they will go, so that it's easy to add them for experiments. And
only when those are in place can transient racy failure cases be enabled.
Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/mm/split_page_table_lock.rst | 17 ++++---
include/linux/mm.h | 27 +++++++----
include/linux/pgtable.h | 22 ++++++---
mm/pgtable-generic.c | 56 ++++++++++++++++++++++
4 files changed, 101 insertions(+), 21 deletions(-)
diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
index 50ee0dfc95be..a834fad9de12 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -14,15 +14,20 @@ tables. Access to higher level tables protected by mm->page_table_lock.
There are helpers to lock/unlock a table and other accessor functions:
- pte_offset_map_lock()
- maps pte and takes PTE table lock, returns pointer to the taken
- lock;
+ maps PTE and takes PTE table lock, returns pointer to PTE with
+ pointer to its PTE table lock, or returns NULL if no PTE table;
+ - pte_offset_map_nolock()
+ maps PTE, returns pointer to PTE with pointer to its PTE table
+ lock (not taken), or returns NULL if no PTE table;
+ - pte_offset_map()
+ maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
+ - pte_unmap()
+ unmaps PTE table;
- pte_unmap_unlock()
unlocks and unmaps PTE table;
- pte_alloc_map_lock()
- allocates PTE table if needed and take the lock, returns pointer
- to taken lock or NULL if allocation failed;
- - pte_lockptr()
- returns pointer to PTE table lock;
+ allocates PTE table if needed and takes its lock, returns pointer to
+ PTE with pointer to its lock, or returns NULL if allocation failed;
- pmd_lock()
takes PMD table lock, returns pointer to taken lock;
- pmd_lockptr()
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..3c2e56980853 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2787,14 +2787,25 @@ static inline void pgtable_pte_page_dtor(struct page *page)
dec_lruvec_page_state(page, NR_PAGETABLE);
}
-#define pte_offset_map_lock(mm, pmd, address, ptlp) \
-({ \
- spinlock_t *__ptl = pte_lockptr(mm, pmd); \
- pte_t *__pte = pte_offset_map(pmd, address); \
- *(ptlp) = __ptl; \
- spin_lock(__ptl); \
- __pte; \
-})
+pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp);
+static inline pte_t *pte_offset_map(pmd_t *pmd, unsigned long addr)
+{
+ return __pte_offset_map(pmd, addr, NULL);
+}
+
+pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp);
+static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp)
+{
+ pte_t *pte;
+
+ __cond_lock(*ptlp, pte = __pte_offset_map_lock(mm, pmd, addr, ptlp));
+ return pte;
+}
+
+pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp);
#define pte_unmap_unlock(pte, ptl) do { \
spin_unlock(ptl); \
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 94235ff2706e..3fabbb018557 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -94,14 +94,22 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
#define pte_offset_kernel pte_offset_kernel
#endif
-#if defined(CONFIG_HIGHPTE)
-#define pte_offset_map(dir, address) \
- ((pte_t *)kmap_local_page(pmd_page(*(dir))) + \
- pte_index((address)))
-#define pte_unmap(pte) kunmap_local((pte))
+#ifdef CONFIG_HIGHPTE
+#define __pte_map(pmd, address) \
+ ((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
+#define pte_unmap(pte) do { \
+ kunmap_local((pte)); \
+ /* rcu_read_unlock() to be added later */ \
+} while (0)
#else
-#define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))
-#define pte_unmap(pte) ((void)(pte)) /* NOP */
+static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
+{
+ return pte_offset_kernel(pmd, address);
+}
+static inline void pte_unmap(pte_t *pte)
+{
+ /* rcu_read_unlock() to be added later */
+}
#endif
/* Find an entry in the second-level page table.. */
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d2fc52bffafc..c7ab18a5fb77 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -10,6 +10,8 @@
#include <linux/pagemap.h>
#include <linux/hugetlb.h>
#include <linux/pgtable.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
#include <linux/mm_inline.h>
#include <asm/tlb.h>
@@ -229,3 +231,57 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
}
#endif
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
+{
+ pmd_t pmdval;
+
+ /* rcu_read_lock() to be added later */
+ pmdval = pmdp_get_lockless(pmd);
+ if (pmdvalp)
+ *pmdvalp = pmdval;
+ if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
+ goto nomap;
+ if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
+ goto nomap;
+ if (unlikely(pmd_bad(pmdval))) {
+ pmd_clear_bad(pmd);
+ goto nomap;
+ }
+ return __pte_map(&pmdval, addr);
+nomap:
+ /* rcu_read_unlock() to be added later */
+ return NULL;
+}
+
+pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp)
+{
+ pmd_t pmdval;
+ pte_t *pte;
+
+ pte = __pte_offset_map(pmd, addr, &pmdval);
+ if (likely(pte))
+ *ptlp = pte_lockptr(mm, &pmdval);
+ return pte;
+}
+
+pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long addr, spinlock_t **ptlp)
+{
+ spinlock_t *ptl;
+ pmd_t pmdval;
+ pte_t *pte;
+again:
+ pte = __pte_offset_map(pmd, addr, &pmdval);
+ if (unlikely(!pte))
+ return pte;
+ ptl = pte_lockptr(mm, &pmdval);
+ spin_lock(ptl);
+ if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
+ *ptlp = ptl;
+ return pte;
+ }
+ pte_unmap_unlock(pte, ptl);
+ goto again;
+}
--
2.35.3
Simple walk_page_range() users should set ACTION_AGAIN to retry when
pte_offset_map_lock() fails.
No need to check pmd_trans_unstable(): that was precisely to avoid the
possiblity of calling pte_offset_map() on a racily removed or inserted
THP entry, but such cases are now safely handled inside it. Likewise
there is no need to check pmd_none() or pmd_bad() before calling it.
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: SeongJae Park <[email protected]> for mm/damon part
---
fs/proc/task_mmu.c | 32 ++++++++++++++++----------------
mm/damon/vaddr.c | 12 ++++++++----
mm/mempolicy.c | 7 ++++---
mm/mincore.c | 9 ++++-----
mm/mlock.c | 4 ++++
5 files changed, 36 insertions(+), 28 deletions(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 420510f6a545..dba5052ce09b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -631,14 +631,11 @@ static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
goto out;
}
- if (pmd_trans_unstable(pmd))
- goto out;
- /*
- * The mmap_lock held all the way back in m_start() is what
- * keeps khugepaged out of here and from collapsing things
- * in here.
- */
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!pte) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
for (; addr != end; pte++, addr += PAGE_SIZE)
smaps_pte_entry(pte, addr, walk);
pte_unmap_unlock(pte - 1, ptl);
@@ -1191,10 +1188,11 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
}
- if (pmd_trans_unstable(pmd))
- return 0;
-
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!pte) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
for (; addr != end; pte++, addr += PAGE_SIZE) {
ptent = *pte;
@@ -1538,9 +1536,6 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
spin_unlock(ptl);
return err;
}
-
- if (pmd_trans_unstable(pmdp))
- return 0;
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
/*
@@ -1548,6 +1543,10 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
* goes beyond vma->vm_end.
*/
orig_pte = pte = pte_offset_map_lock(walk->mm, pmdp, addr, &ptl);
+ if (!pte) {
+ walk->action = ACTION_AGAIN;
+ return err;
+ }
for (; addr < end; pte++, addr += PAGE_SIZE) {
pagemap_entry_t pme;
@@ -1887,11 +1886,12 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
spin_unlock(ptl);
return 0;
}
-
- if (pmd_trans_unstable(pmd))
- return 0;
#endif
orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (!pte) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
do {
struct page *page = can_gather_numa_stats(*pte, vma, addr);
if (!page)
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 1fec16d7263e..b8762ff15c3c 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -318,9 +318,11 @@ static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
spin_unlock(ptl);
}
- if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
- return 0;
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (!pte) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
if (!pte_present(*pte))
goto out;
damon_ptep_mkold(pte, walk->mm, addr);
@@ -464,9 +466,11 @@ static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
regular_page:
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
- if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
- return -EINVAL;
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (!pte) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
if (!pte_present(*pte))
goto out;
folio = damon_get_folio(pte_pfn(*pte));
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1756389a0609..4d0bcf6f0d52 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -514,10 +514,11 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr,
if (ptl)
return queue_folios_pmd(pmd, ptl, addr, end, walk);
- if (pmd_trans_unstable(pmd))
- return 0;
-
mapped_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (!pte) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
for (; addr != end; pte++, addr += PAGE_SIZE) {
if (!pte_present(*pte))
continue;
diff --git a/mm/mincore.c b/mm/mincore.c
index 2d5be013a25a..f33f6a0b1ded 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -113,12 +113,11 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
goto out;
}
- if (pmd_trans_unstable(pmd)) {
- __mincore_unmapped_range(addr, end, vma, vec);
- goto out;
- }
-
ptep = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (!ptep) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
for (; addr != end; ptep++, addr += PAGE_SIZE) {
pte_t pte = *ptep;
diff --git a/mm/mlock.c b/mm/mlock.c
index 40b43f8740df..9f2b1173b1b1 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -329,6 +329,10 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
}
start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!start_pte) {
+ walk->action = ACTION_AGAIN;
+ return 0;
+ }
for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
if (!pte_present(*pte))
continue;
--
2.35.3
wp_clean_pmd_entry() need not check pmd_trans_unstable() or pmd_none(),
wp_clean_pud_entry() need not check pud_trans_unstable() or pud_none():
it's just the ACTION_CONTINUE when trans_huge or devmap that's needed
to prevent splitting, and we're hoping to remove pmd_trans_unstable().
Is that PUD #ifdef necessary? Maybe some configs are missing a stub.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/mapping_dirty_helpers.c | 34 +++++++++-------------------------
1 file changed, 9 insertions(+), 25 deletions(-)
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index e1eb33f49059..87b4beeda4fa 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -128,19 +128,11 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
{
pmd_t pmdval = pmdp_get_lockless(pmd);
- if (!pmd_trans_unstable(&pmdval))
- return 0;
-
- if (pmd_none(pmdval)) {
- walk->action = ACTION_AGAIN;
- return 0;
- }
-
- /* Huge pmd, present or migrated */
- walk->action = ACTION_CONTINUE;
- if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval))
+ /* Do not split a huge pmd, present or migrated */
+ if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) {
WARN_ON(pmd_write(pmdval) || pmd_dirty(pmdval));
-
+ walk->action = ACTION_CONTINUE;
+ }
return 0;
}
@@ -156,23 +148,15 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
pud_t pudval = READ_ONCE(*pud);
- if (!pud_trans_unstable(&pudval))
- return 0;
-
- if (pud_none(pudval)) {
- walk->action = ACTION_AGAIN;
- return 0;
- }
-
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
- /* Huge pud */
- walk->action = ACTION_CONTINUE;
- if (pud_trans_huge(pudval) || pud_devmap(pudval))
+ /* Do not split a huge pud */
+ if (pud_trans_huge(pudval) || pud_devmap(pudval)) {
WARN_ON(pud_write(pudval) || pud_dirty(pudval));
+ walk->action = ACTION_CONTINUE;
+ }
#endif
-
return 0;
}
--
2.35.3
migration_entry_wait_on_locked() does not need to take a mapped pte
pointer, its callers can do the unmap first. Annotate it with
__releases(ptl) to reduce sparse warnings.
Fold __migration_entry_wait_huge() into migration_entry_wait_huge().
Fold __migration_entry_wait() into migration_entry_wait(), preferring
the tighter pte_offset_map_lock() to pte_offset_map() and pte_lockptr().
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
---
include/linux/migrate.h | 4 ++--
include/linux/swapops.h | 17 +++--------------
mm/filemap.c | 13 ++++---------
mm/migrate.c | 37 +++++++++++++------------------------
4 files changed, 22 insertions(+), 49 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 6241a1596a75..affea3063473 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -75,8 +75,8 @@ bool isolate_movable_page(struct page *page, isolate_mode_t mode);
int migrate_huge_page_move_mapping(struct address_space *mapping,
struct folio *dst, struct folio *src);
-void migration_entry_wait_on_locked(swp_entry_t entry, pte_t *ptep,
- spinlock_t *ptl);
+void migration_entry_wait_on_locked(swp_entry_t entry, spinlock_t *ptl)
+ __releases(ptl);
void folio_migrate_flags(struct folio *newfolio, struct folio *folio);
void folio_migrate_copy(struct folio *newfolio, struct folio *folio);
int folio_migrate_mapping(struct address_space *mapping,
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 3a451b7afcb3..4c932cb45e0b 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -332,15 +332,9 @@ static inline bool is_migration_entry_dirty(swp_entry_t entry)
return false;
}
-extern void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
- spinlock_t *ptl);
extern void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
unsigned long address);
-#ifdef CONFIG_HUGETLB_PAGE
-extern void __migration_entry_wait_huge(struct vm_area_struct *vma,
- pte_t *ptep, spinlock_t *ptl);
extern void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte);
-#endif /* CONFIG_HUGETLB_PAGE */
#else /* CONFIG_MIGRATION */
static inline swp_entry_t make_readable_migration_entry(pgoff_t offset)
{
@@ -362,15 +356,10 @@ static inline int is_migration_entry(swp_entry_t swp)
return 0;
}
-static inline void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
- spinlock_t *ptl) { }
static inline void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
- unsigned long address) { }
-#ifdef CONFIG_HUGETLB_PAGE
-static inline void __migration_entry_wait_huge(struct vm_area_struct *vma,
- pte_t *ptep, spinlock_t *ptl) { }
-static inline void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte) { }
-#endif /* CONFIG_HUGETLB_PAGE */
+ unsigned long address) { }
+static inline void migration_entry_wait_huge(struct vm_area_struct *vma,
+ pte_t *pte) { }
static inline int is_writable_migration_entry(swp_entry_t entry)
{
return 0;
diff --git a/mm/filemap.c b/mm/filemap.c
index b4c9bd368b7e..28b42ee848a4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1359,8 +1359,6 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr,
/**
* migration_entry_wait_on_locked - Wait for a migration entry to be removed
* @entry: migration swap entry.
- * @ptep: mapped pte pointer. Will return with the ptep unmapped. Only required
- * for pte entries, pass NULL for pmd entries.
* @ptl: already locked ptl. This function will drop the lock.
*
* Wait for a migration entry referencing the given page to be removed. This is
@@ -1369,13 +1367,13 @@ static inline int folio_wait_bit_common(struct folio *folio, int bit_nr,
* should be called while holding the ptl for the migration entry referencing
* the page.
*
- * Returns after unmapping and unlocking the pte/ptl with pte_unmap_unlock().
+ * Returns after unlocking the ptl.
*
* This follows the same logic as folio_wait_bit_common() so see the comments
* there.
*/
-void migration_entry_wait_on_locked(swp_entry_t entry, pte_t *ptep,
- spinlock_t *ptl)
+void migration_entry_wait_on_locked(swp_entry_t entry, spinlock_t *ptl)
+ __releases(ptl)
{
struct wait_page_queue wait_page;
wait_queue_entry_t *wait = &wait_page.wait;
@@ -1409,10 +1407,7 @@ void migration_entry_wait_on_locked(swp_entry_t entry, pte_t *ptep,
* a valid reference to the page, and it must take the ptl to remove the
* migration entry. So the page is valid until the ptl is dropped.
*/
- if (ptep)
- pte_unmap_unlock(ptep, ptl);
- else
- spin_unlock(ptl);
+ spin_unlock(ptl);
for (;;) {
unsigned int flags;
diff --git a/mm/migrate.c b/mm/migrate.c
index 01cac26a3127..3ecb7a40075f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -296,14 +296,18 @@ void remove_migration_ptes(struct folio *src, struct folio *dst, bool locked)
* get to the page and wait until migration is finished.
* When we return from this function the fault will be retried.
*/
-void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
- spinlock_t *ptl)
+void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long address)
{
+ spinlock_t *ptl;
+ pte_t *ptep;
pte_t pte;
swp_entry_t entry;
- spin_lock(ptl);
+ ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
pte = *ptep;
+ pte_unmap(ptep);
+
if (!is_swap_pte(pte))
goto out;
@@ -311,18 +315,10 @@ void __migration_entry_wait(struct mm_struct *mm, pte_t *ptep,
if (!is_migration_entry(entry))
goto out;
- migration_entry_wait_on_locked(entry, ptep, ptl);
+ migration_entry_wait_on_locked(entry, ptl);
return;
out:
- pte_unmap_unlock(ptep, ptl);
-}
-
-void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
- unsigned long address)
-{
- spinlock_t *ptl = pte_lockptr(mm, pmd);
- pte_t *ptep = pte_offset_map(pmd, address);
- __migration_entry_wait(mm, ptep, ptl);
+ spin_unlock(ptl);
}
#ifdef CONFIG_HUGETLB_PAGE
@@ -332,9 +328,9 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
*
* This function will release the vma lock before returning.
*/
-void __migration_entry_wait_huge(struct vm_area_struct *vma,
- pte_t *ptep, spinlock_t *ptl)
+void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *ptep)
{
+ spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, ptep);
pte_t pte;
hugetlb_vma_assert_locked(vma);
@@ -352,16 +348,9 @@ void __migration_entry_wait_huge(struct vm_area_struct *vma,
* lock release in migration_entry_wait_on_locked().
*/
hugetlb_vma_unlock_read(vma);
- migration_entry_wait_on_locked(pte_to_swp_entry(pte), NULL, ptl);
+ migration_entry_wait_on_locked(pte_to_swp_entry(pte), ptl);
}
}
-
-void migration_entry_wait_huge(struct vm_area_struct *vma, pte_t *pte)
-{
- spinlock_t *ptl = huge_pte_lockptr(hstate_vma(vma), vma->vm_mm, pte);
-
- __migration_entry_wait_huge(vma, pte, ptl);
-}
#endif
#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
@@ -372,7 +361,7 @@ void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd)
ptl = pmd_lock(mm, pmd);
if (!is_pmd_migration_entry(*pmd))
goto unlock;
- migration_entry_wait_on_locked(pmd_to_swp_entry(*pmd), NULL, ptl);
+ migration_entry_wait_on_locked(pmd_to_swp_entry(*pmd), ptl);
return;
unlock:
spin_unlock(ptl);
--
2.35.3
map_pte() use pte_offset_map_nolock(), to make sure of the ptl belonging
to pte, even if pmd entry is then changed racily: page_vma_mapped_walk()
use that instead of getting pte_lockptr() later, or restart if map_pte()
found no page table.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/page_vma_mapped.c | 28 ++++++++++++++++++++++------
1 file changed, 22 insertions(+), 6 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 947dc7491815..2af734274073 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -13,16 +13,28 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
return false;
}
-static bool map_pte(struct page_vma_mapped_walk *pvmw)
+static bool map_pte(struct page_vma_mapped_walk *pvmw, spinlock_t **ptlp)
{
if (pvmw->flags & PVMW_SYNC) {
/* Use the stricter lookup */
pvmw->pte = pte_offset_map_lock(pvmw->vma->vm_mm, pvmw->pmd,
pvmw->address, &pvmw->ptl);
- return true;
+ *ptlp = pvmw->ptl;
+ return !!pvmw->pte;
}
- pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
+ /*
+ * It is important to return the ptl corresponding to pte,
+ * in case *pvmw->pmd changes underneath us; so we need to
+ * return it even when choosing not to lock, in case caller
+ * proceeds to loop over next ptes, and finds a match later.
+ * Though, in most cases, page lock already protects this.
+ */
+ pvmw->pte = pte_offset_map_nolock(pvmw->vma->vm_mm, pvmw->pmd,
+ pvmw->address, ptlp);
+ if (!pvmw->pte)
+ return false;
+
if (pvmw->flags & PVMW_MIGRATION) {
if (!is_swap_pte(*pvmw->pte))
return false;
@@ -51,7 +63,7 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw)
} else if (!pte_present(*pvmw->pte)) {
return false;
}
- pvmw->ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd);
+ pvmw->ptl = *ptlp;
spin_lock(pvmw->ptl);
return true;
}
@@ -156,6 +168,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
struct vm_area_struct *vma = pvmw->vma;
struct mm_struct *mm = vma->vm_mm;
unsigned long end;
+ spinlock_t *ptl;
pgd_t *pgd;
p4d_t *p4d;
pud_t *pud;
@@ -257,8 +270,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
step_forward(pvmw, PMD_SIZE);
continue;
}
- if (!map_pte(pvmw))
+ if (!map_pte(pvmw, &ptl)) {
+ if (!pvmw->pte)
+ goto restart;
goto next_pte;
+ }
this_pte:
if (check_pte(pvmw))
return true;
@@ -281,7 +297,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
} while (pte_none(*pvmw->pte));
if (!pvmw->ptl) {
- pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
+ pvmw->ptl = ptl;
spin_lock(pvmw->ptl);
}
goto this_pte;
--
2.35.3
Revert commit a7a69d8ba88d ("mm/thp: another PVMW_SYNC fix in
page_vma_mapped_walk()"): I was proud of that "Aha!" commit at the time,
but in revisiting page_vma_mapped_walk() for pte_offset_map() failure,
that block raised a doubt: and it now seems utterly bogus. The prior
map_pte() has taken ptl unconditionally when PVMW_SYNC: I must have
forgotten that when making the change. It did no harm, but could not
have fixed a BUG or WARN, and is hard to reconcile with coming changes.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/page_vma_mapped.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 64aff6718bdb..007dc7456f0e 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -275,10 +275,6 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
goto restart;
}
pvmw->pte++;
- if ((pvmw->flags & PVMW_SYNC) && !pvmw->ptl) {
- pvmw->ptl = pte_lockptr(mm, pvmw->pmd);
- spin_lock(pvmw->ptl);
- }
} while (pte_none(*pvmw->pte));
if (!pvmw->ptl) {
--
2.35.3
vmalloc_to_page() was using pte_offset_map() (followed by pte_unmap()),
but it's intended for userspace page tables: prefer pte_offset_kernel().
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Lorenzo Stoakes <[email protected]>
---
mm/vmalloc.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 9683573f1225..741722d247d5 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -703,11 +703,10 @@ struct page *vmalloc_to_page(const void *vmalloc_addr)
if (WARN_ON_ONCE(pmd_bad(*pmd)))
return NULL;
- ptep = pte_offset_map(pmd, addr);
+ ptep = pte_offset_kernel(pmd, addr);
pte = *ptep;
if (pte_present(pte))
page = pte_page(pte);
- pte_unmap(ptep);
return page;
}
--
2.35.3
move_ptes() return -EAGAIN if pte_offset_map_lock() of old fails, or if
pte_offset_map_nolock() of new fails: move_page_tables() retry if so.
But that does need a pmd_none() check inside, to stop endless loop when
huge shmem is truncated (thank you to syzbot); and move_huge_pmd() must
tolerate that a page table might have been allocated there just before
(of course it would be more satisfying to remove the empty page table,
but this is not a path worth optimizing).
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/huge_memory.c | 5 +++--
mm/mremap.c | 28 ++++++++++++++++++++--------
2 files changed, 23 insertions(+), 10 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 624671aaa60d..d4bd5fa7c823 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1760,9 +1760,10 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
/*
* The destination pmd shouldn't be established, free_pgtables()
- * should have release it.
+ * should have released it; but move_page_tables() might have already
+ * inserted a page table, if racing against shmem/file collapse.
*/
- if (WARN_ON(!pmd_none(*new_pmd))) {
+ if (!pmd_none(*new_pmd)) {
VM_BUG_ON(pmd_trans_huge(*new_pmd));
return false;
}
diff --git a/mm/mremap.c b/mm/mremap.c
index b11ce6c92099..1fc47b4f38d7 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -133,7 +133,7 @@ static pte_t move_soft_dirty_pte(pte_t pte)
return pte;
}
-static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
+static int move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
unsigned long old_addr, unsigned long old_end,
struct vm_area_struct *new_vma, pmd_t *new_pmd,
unsigned long new_addr, bool need_rmap_locks)
@@ -143,6 +143,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
spinlock_t *old_ptl, *new_ptl;
bool force_flush = false;
unsigned long len = old_end - old_addr;
+ int err = 0;
/*
* When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
@@ -170,8 +171,16 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
* pte locks because exclusive mmap_lock prevents deadlock.
*/
old_pte = pte_offset_map_lock(mm, old_pmd, old_addr, &old_ptl);
- new_pte = pte_offset_map(new_pmd, new_addr);
- new_ptl = pte_lockptr(mm, new_pmd);
+ if (!old_pte) {
+ err = -EAGAIN;
+ goto out;
+ }
+ new_pte = pte_offset_map_nolock(mm, new_pmd, new_addr, &new_ptl);
+ if (!new_pte) {
+ pte_unmap_unlock(old_pte, old_ptl);
+ err = -EAGAIN;
+ goto out;
+ }
if (new_ptl != old_ptl)
spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
flush_tlb_batched_pending(vma->vm_mm);
@@ -208,8 +217,10 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
spin_unlock(new_ptl);
pte_unmap(new_pte - 1);
pte_unmap_unlock(old_pte - 1, old_ptl);
+out:
if (need_rmap_locks)
drop_rmap_locks(vma);
+ return err;
}
#ifndef arch_supports_page_table_move
@@ -537,6 +548,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
new_pmd = alloc_new_pmd(vma->vm_mm, vma, new_addr);
if (!new_pmd)
break;
+again:
if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) ||
pmd_devmap(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE &&
@@ -544,8 +556,6 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
old_pmd, new_pmd, need_rmap_locks))
continue;
split_huge_pmd(vma, old_pmd, old_addr);
- if (pmd_trans_unstable(old_pmd))
- continue;
} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PMD) &&
extent == PMD_SIZE) {
/*
@@ -556,11 +566,13 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
old_pmd, new_pmd, true))
continue;
}
-
+ if (pmd_none(*old_pmd))
+ continue;
if (pte_alloc(new_vma->vm_mm, new_pmd))
break;
- move_ptes(vma, old_pmd, old_addr, old_addr + extent, new_vma,
- new_pmd, new_addr, need_rmap_locks);
+ if (move_ptes(vma, old_pmd, old_addr, old_addr + extent,
+ new_vma, new_pmd, new_addr, need_rmap_locks) < 0)
+ goto again;
}
mmu_notifier_invalidate_range_end(&range);
--
2.35.3
Came here to make madvise's several pte_offset_map_lock() scans advance
to next extent on failure, and remove superfluous pmd_trans_unstable()
and pmd_none_or_trans_huge_or_clear_bad() calls. But also did some
nearby cleanup.
swapin_walk_pmd_entry(): don't name an address "index"; don't drop the
lock after every pte, only when calling out to read_swap_cache_async().
madvise_cold_or_pageout_pte_range() and madvise_free_pte_range():
prefer "start_pte" for pointer, orig_pte usually denotes a saved pte
value; leave lazy MMU mode before unlocking; merge the success and
failure paths after split_folio().
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/madvise.c | 122 ++++++++++++++++++++++++++++-----------------------
1 file changed, 68 insertions(+), 54 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index b5ffbaf616f5..0af64c4a8f82 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -188,37 +188,43 @@ static int madvise_update_vma(struct vm_area_struct *vma,
#ifdef CONFIG_SWAP
static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
- unsigned long end, struct mm_walk *walk)
+ unsigned long end, struct mm_walk *walk)
{
struct vm_area_struct *vma = walk->private;
- unsigned long index;
struct swap_iocb *splug = NULL;
+ pte_t *ptep = NULL;
+ spinlock_t *ptl;
+ unsigned long addr;
- if (pmd_none_or_trans_huge_or_clear_bad(pmd))
- return 0;
-
- for (index = start; index != end; index += PAGE_SIZE) {
+ for (addr = start; addr < end; addr += PAGE_SIZE) {
pte_t pte;
swp_entry_t entry;
struct page *page;
- spinlock_t *ptl;
- pte_t *ptep;
- ptep = pte_offset_map_lock(vma->vm_mm, pmd, index, &ptl);
+ if (!ptep++) {
+ ptep = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!ptep)
+ break;
+ }
+
pte = *ptep;
- pte_unmap_unlock(ptep, ptl);
-
if (!is_swap_pte(pte))
continue;
entry = pte_to_swp_entry(pte);
if (unlikely(non_swap_entry(entry)))
continue;
+ pte_unmap_unlock(ptep, ptl);
+ ptep = NULL;
+
page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
- vma, index, false, &splug);
+ vma, addr, false, &splug);
if (page)
put_page(page);
}
+
+ if (ptep)
+ pte_unmap_unlock(ptep, ptl);
swap_read_unplug(splug);
cond_resched();
@@ -340,7 +346,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
bool pageout = private->pageout;
struct mm_struct *mm = tlb->mm;
struct vm_area_struct *vma = walk->vma;
- pte_t *orig_pte, *pte, ptent;
+ pte_t *start_pte, *pte, ptent;
spinlock_t *ptl;
struct folio *folio = NULL;
LIST_HEAD(folio_list);
@@ -422,11 +428,11 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
}
regular_folio:
- if (pmd_trans_unstable(pmd))
- return 0;
#endif
tlb_change_page_size(tlb, PAGE_SIZE);
- orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ start_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!start_pte)
+ return 0;
flush_tlb_batched_pending(mm);
arch_enter_lazy_mmu_mode();
for (; addr < end; pte++, addr += PAGE_SIZE) {
@@ -447,25 +453,28 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
* are sure it's worth. Split it if we are only owner.
*/
if (folio_test_large(folio)) {
+ int err;
+
if (folio_mapcount(folio) != 1)
break;
if (pageout_anon_only_filter && !folio_test_anon(folio))
break;
+ if (!folio_trylock(folio))
+ break;
folio_get(folio);
- if (!folio_trylock(folio)) {
- folio_put(folio);
- break;
- }
- pte_unmap_unlock(orig_pte, ptl);
- if (split_folio(folio)) {
- folio_unlock(folio);
- folio_put(folio);
- orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
- break;
- }
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(start_pte, ptl);
+ start_pte = NULL;
+ err = split_folio(folio);
folio_unlock(folio);
folio_put(folio);
- orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (err)
+ break;
+ start_pte = pte =
+ pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!start_pte)
+ break;
+ arch_enter_lazy_mmu_mode();
pte--;
addr -= PAGE_SIZE;
continue;
@@ -510,8 +519,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
folio_deactivate(folio);
}
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(orig_pte, ptl);
+ if (start_pte) {
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(start_pte, ptl);
+ }
if (pageout)
reclaim_pages(&folio_list);
cond_resched();
@@ -612,7 +623,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
struct mm_struct *mm = tlb->mm;
struct vm_area_struct *vma = walk->vma;
spinlock_t *ptl;
- pte_t *orig_pte, *pte, ptent;
+ pte_t *start_pte, *pte, ptent;
struct folio *folio;
int nr_swap = 0;
unsigned long next;
@@ -620,13 +631,12 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd))
if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
- goto next;
-
- if (pmd_trans_unstable(pmd))
- return 0;
+ return 0;
tlb_change_page_size(tlb, PAGE_SIZE);
- orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!start_pte)
+ return 0;
flush_tlb_batched_pending(mm);
arch_enter_lazy_mmu_mode();
for (; addr != end; pte++, addr += PAGE_SIZE) {
@@ -664,23 +674,26 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
* deactivate all pages.
*/
if (folio_test_large(folio)) {
+ int err;
+
if (folio_mapcount(folio) != 1)
- goto out;
+ break;
+ if (!folio_trylock(folio))
+ break;
folio_get(folio);
- if (!folio_trylock(folio)) {
- folio_put(folio);
- goto out;
- }
- pte_unmap_unlock(orig_pte, ptl);
- if (split_folio(folio)) {
- folio_unlock(folio);
- folio_put(folio);
- orig_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
- goto out;
- }
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(start_pte, ptl);
+ start_pte = NULL;
+ err = split_folio(folio);
folio_unlock(folio);
folio_put(folio);
- orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (err)
+ break;
+ start_pte = pte =
+ pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!start_pte)
+ break;
+ arch_enter_lazy_mmu_mode();
pte--;
addr -= PAGE_SIZE;
continue;
@@ -725,17 +738,18 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
}
folio_mark_lazyfree(folio);
}
-out:
+
if (nr_swap) {
if (current->mm == mm)
sync_mm_rss(mm);
-
add_mm_counter(mm, MM_SWAPENTS, nr_swap);
}
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(orig_pte, ptl);
+ if (start_pte) {
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(start_pte, ptl);
+ }
cond_resched();
-next:
+
return 0;
}
--
2.35.3
Some nearby MADV_WILLNEED cleanup unrelated to pte_offset_map_lock().
shmem_swapin_range() is a better name than force_shm_swapin_readahead().
Fix unimportant off-by-one on end_index. Call the swp_entry_t "entry"
rather than "swap": either is okay, but entry is the name used elsewhere
in mm/madvise.c. Do not assume GFP_HIGHUSER_MOVABLE: that's right for
anon swap, but shmem should take gfp from mapping. Pass the actual vma
and address to read_swap_cache_async(), in case a NUMA mempolicy applies.
lru_add_drain() at outer level, like madvise_willneed()'s other branch.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/madvise.c | 24 +++++++++++++-----------
1 file changed, 13 insertions(+), 11 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 0af64c4a8f82..9b3c9610052f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -235,30 +235,34 @@ static const struct mm_walk_ops swapin_walk_ops = {
.pmd_entry = swapin_walk_pmd_entry,
};
-static void force_shm_swapin_readahead(struct vm_area_struct *vma,
+static void shmem_swapin_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
struct address_space *mapping)
{
XA_STATE(xas, &mapping->i_pages, linear_page_index(vma, start));
- pgoff_t end_index = linear_page_index(vma, end + PAGE_SIZE - 1);
+ pgoff_t end_index = linear_page_index(vma, end) - 1;
struct page *page;
struct swap_iocb *splug = NULL;
rcu_read_lock();
xas_for_each(&xas, page, end_index) {
- swp_entry_t swap;
+ unsigned long addr;
+ swp_entry_t entry;
if (!xa_is_value(page))
continue;
- swap = radix_to_swp_entry(page);
+ entry = radix_to_swp_entry(page);
/* There might be swapin error entries in shmem mapping. */
- if (non_swap_entry(swap))
+ if (non_swap_entry(entry))
continue;
+
+ addr = vma->vm_start +
+ ((xas.xa_index - vma->vm_pgoff) << PAGE_SHIFT);
xas_pause(&xas);
rcu_read_unlock();
- page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
- NULL, 0, false, &splug);
+ page = read_swap_cache_async(entry, mapping_gfp_mask(mapping),
+ vma, addr, false, &splug);
if (page)
put_page(page);
@@ -266,8 +270,6 @@ static void force_shm_swapin_readahead(struct vm_area_struct *vma,
}
rcu_read_unlock();
swap_read_unplug(splug);
-
- lru_add_drain(); /* Push any new pages onto the LRU now */
}
#endif /* CONFIG_SWAP */
@@ -291,8 +293,8 @@ static long madvise_willneed(struct vm_area_struct *vma,
}
if (shmem_mapping(file->f_mapping)) {
- force_shm_swapin_readahead(vma, start, end,
- file->f_mapping);
+ shmem_swapin_range(vma, start, end, file->f_mapping);
+ lru_add_drain(); /* Push any new pages onto the LRU now */
return 0;
}
#else
--
2.35.3
walk_pte_range() has a no_vma option to serve walk_page_range_novma().
I don't know of any problem, but it looks safer to check for init_mm,
and use pte_offset_kernel() rather than pte_offset_map() in that case:
pte_offset_map()'s pmdval validation is intended for userspace.
Allow for its pte_offset_map() or pte_offset_map_lock() to fail, and
retry with ACTION_AGAIN if so. Add a second check for ACTION_AGAIN
in walk_pmd_range(), to catch it after return from walk_pte_range().
Remove the pmd_trans_unstable() check after split_huge_pmd() in
walk_pmd_range(): walk_pte_range() now handles those cases safely
(and they must fail powerpc's is_hugepd() check).
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/pagewalk.c | 33 +++++++++++++++++++++++----------
1 file changed, 23 insertions(+), 10 deletions(-)
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index cb23f8a15c13..64437105fe0d 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -46,15 +46,27 @@ static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
spinlock_t *ptl;
if (walk->no_vma) {
- pte = pte_offset_map(pmd, addr);
- err = walk_pte_range_inner(pte, addr, end, walk);
- pte_unmap(pte);
+ /*
+ * pte_offset_map() might apply user-specific validation.
+ */
+ if (walk->mm == &init_mm)
+ pte = pte_offset_kernel(pmd, addr);
+ else
+ pte = pte_offset_map(pmd, addr);
+ if (pte) {
+ err = walk_pte_range_inner(pte, addr, end, walk);
+ if (walk->mm != &init_mm)
+ pte_unmap(pte);
+ }
} else {
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
- err = walk_pte_range_inner(pte, addr, end, walk);
- pte_unmap_unlock(pte, ptl);
+ if (pte) {
+ err = walk_pte_range_inner(pte, addr, end, walk);
+ pte_unmap_unlock(pte, ptl);
+ }
}
-
+ if (!pte)
+ walk->action = ACTION_AGAIN;
return err;
}
@@ -141,11 +153,8 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
!(ops->pte_entry))
continue;
- if (walk->vma) {
+ if (walk->vma)
split_huge_pmd(walk->vma, pmd, addr);
- if (pmd_trans_unstable(pmd))
- goto again;
- }
if (is_hugepd(__hugepd(pmd_val(*pmd))))
err = walk_hugepd_range((hugepd_t *)pmd, addr, next, walk, PMD_SHIFT);
@@ -153,6 +162,10 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
err = walk_pte_range(pmd, addr, next, walk);
if (err)
break;
+
+ if (walk->action == ACTION_AGAIN)
+ goto again;
+
} while (pmd++, addr = next, addr != end);
return err;
--
2.35.3
mfill_atomic_install_pte() and mfill_atomic_pte_zeropage() treat
failed pte_offset_map_lock() as -EAGAIN, which mfill_atomic() already
returns to user for a similar race.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/userfaultfd.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e97a0b4889fc..5fd787158c70 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -76,7 +76,10 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
if (flags & MFILL_ATOMIC_WP)
_dst_pte = pte_mkuffd_wp(_dst_pte);
+ ret = -EAGAIN;
dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
+ if (!dst_pte)
+ goto out;
if (vma_is_shmem(dst_vma)) {
/* serialize against truncate with the page table lock */
@@ -121,6 +124,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
ret = 0;
out_unlock:
pte_unmap_unlock(dst_pte, ptl);
+out:
return ret;
}
@@ -212,7 +216,10 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
_dst_pte = pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr),
dst_vma->vm_page_prot));
+ ret = -EAGAIN;
dst_pte = pte_offset_map_lock(dst_vma->vm_mm, dst_pmd, dst_addr, &ptl);
+ if (!dst_pte)
+ goto out;
if (dst_vma->vm_file) {
/* the shmem MAP_PRIVATE case requires checking the i_size */
inode = dst_vma->vm_file->f_inode;
@@ -231,6 +238,7 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd,
ret = 0;
out_unlock:
pte_unmap_unlock(dst_pte, ptl);
+out:
return ret;
}
--
2.35.3
change_pmd_range() had special pmd_none_or_clear_bad_unless_trans_huge(),
required to avoid "bad" choices when setting automatic NUMA hinting under
mmap_read_lock(); but most of that is already covered in pte_offset_map()
now. change_pmd_range() just wants a pmd_none() check before wasting
time on MMU notifiers, then checks on the read-once _pmd value to work
out what's needed for huge cases. If change_pte_range() returns -EAGAIN
to retry if pte_offset_map_lock() fails, nothing more special is needed.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/mprotect.c | 74 ++++++++++++---------------------------------------
1 file changed, 17 insertions(+), 57 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index c5a13c0f1017..64e1df0af514 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -93,22 +93,9 @@ static long change_pte_range(struct mmu_gather *tlb,
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
tlb_change_page_size(tlb, PAGE_SIZE);
-
- /*
- * Can be called with only the mmap_lock for reading by
- * prot_numa so we must check the pmd isn't constantly
- * changing from under us from pmd_none to pmd_trans_huge
- * and/or the other way around.
- */
- if (pmd_trans_unstable(pmd))
- return 0;
-
- /*
- * The pmd points to a regular pte so the pmd can't change
- * from under us even if the mmap_lock is only hold for
- * reading.
- */
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!pte)
+ return -EAGAIN;
/* Get target node for single threaded private VMAs */
if (prot_numa && !(vma->vm_flags & VM_SHARED) &&
@@ -301,26 +288,6 @@ static long change_pte_range(struct mmu_gather *tlb,
return pages;
}
-/*
- * Used when setting automatic NUMA hinting protection where it is
- * critical that a numa hinting PMD is not confused with a bad PMD.
- */
-static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
-{
- pmd_t pmdval = pmdp_get_lockless(pmd);
-
- if (pmd_none(pmdval))
- return 1;
- if (pmd_trans_huge(pmdval))
- return 0;
- if (unlikely(pmd_bad(pmdval))) {
- pmd_clear_bad(pmd);
- return 1;
- }
-
- return 0;
-}
-
/*
* Return true if we want to split THPs into PTE mappings in change
* protection procedure, false otherwise.
@@ -398,7 +365,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
pmd = pmd_offset(pud, addr);
do {
long ret;
-
+ pmd_t _pmd;
+again:
next = pmd_addr_end(addr, end);
ret = change_pmd_prepare(vma, pmd, cp_flags);
@@ -406,16 +374,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
pages = ret;
break;
}
- /*
- * Automatic NUMA balancing walks the tables with mmap_lock
- * held for read. It's possible a parallel update to occur
- * between pmd_trans_huge() and a pmd_none_or_clear_bad()
- * check leading to a false positive and clearing.
- * Hence, it's necessary to atomically read the PMD value
- * for all the checks.
- */
- if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
- pmd_none_or_clear_bad_unless_trans_huge(pmd))
+
+ if (pmd_none(*pmd))
goto next;
/* invoke the mmu notifier if the pmd is populated */
@@ -426,7 +386,8 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
mmu_notifier_invalidate_range_start(&range);
}
- if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ _pmd = pmdp_get_lockless(pmd);
+ if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
if ((next - addr != HPAGE_PMD_SIZE) ||
pgtable_split_needed(vma, cp_flags)) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
@@ -441,15 +402,10 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
break;
}
} else {
- /*
- * change_huge_pmd() does not defer TLB flushes,
- * so no need to propagate the tlb argument.
- */
- int nr_ptes = change_huge_pmd(tlb, vma, pmd,
+ ret = change_huge_pmd(tlb, vma, pmd,
addr, newprot, cp_flags);
-
- if (nr_ptes) {
- if (nr_ptes == HPAGE_PMD_NR) {
+ if (ret) {
+ if (ret == HPAGE_PMD_NR) {
pages += HPAGE_PMD_NR;
nr_huge_updates++;
}
@@ -460,8 +416,12 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
}
/* fall through, the trans huge pmd just split */
}
- pages += change_pte_range(tlb, vma, pmd, addr, next,
- newprot, cp_flags);
+
+ ret = change_pte_range(tlb, vma, pmd, addr, next, newprot,
+ cp_flags);
+ if (ret < 0)
+ goto again;
+ pages += ret;
next:
cond_resched();
} while (pmd++, addr = next, addr != end);
--
2.35.3
There is now no reason for follow_pmd_mask()'s FOLL_SPLIT_PMD block to
distinguish huge_zero_page from a normal THP: follow_page_pte() handles
any instability, and here it's a good idea to replace any pmd_none(*pmd)
by a page table a.s.a.p, in the huge_zero_page case as for a normal THP;
and this removes an unnecessary possibility of -EBUSY failure.
(Hmm, couldn't the normal THP case have hit an unstably refaulted THP
before? But there are only two, exceptional, users of FOLL_SPLIT_PMD.)
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/gup.c | 19 ++++---------------
1 file changed, 4 insertions(+), 15 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index bb67193c5460..4ad50a59897f 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -681,21 +681,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
}
if (flags & FOLL_SPLIT_PMD) {
- int ret;
- page = pmd_page(*pmd);
- if (is_huge_zero_page(page)) {
- spin_unlock(ptl);
- ret = 0;
- split_huge_pmd(vma, pmd, address);
- if (pmd_trans_unstable(pmd))
- ret = -EBUSY;
- } else {
- spin_unlock(ptl);
- split_huge_pmd(vma, pmd, address);
- ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
- }
-
- return ret ? ERR_PTR(ret) :
+ spin_unlock(ptl);
+ split_huge_pmd(vma, pmd, address);
+ /* If pmd was left empty, stuff a page table in there quickly */
+ return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
}
page = follow_trans_huge_pmd(vma, address, pmd, flags);
--
2.35.3
Instead of worrying whether the pmd is stable, userfaultfd_must_wait()
call pte_offset_map() as before, but go back to try again if that fails.
Risk of endless loop? It already broke out if pmd_none(), !pmd_present()
or pmd_trans_huge(), and pte_offset_map() would have cleared pmd_bad():
which leaves pmd_devmap(). Presumably pmd_devmap() is inappropriate in
a vma subject to userfaultfd (it would have been mistreated before),
but add a check just to avoid all possibility of endless loop there.
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Peter Xu <[email protected]>
---
fs/userfaultfd.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index f7a0817b1ec0..ca83423f8d54 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -349,12 +349,13 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
if (!pud_present(*pud))
goto out;
pmd = pmd_offset(pud, address);
+again:
_pmd = pmdp_get_lockless(pmd);
if (pmd_none(_pmd))
goto out;
ret = false;
- if (!pmd_present(_pmd))
+ if (!pmd_present(_pmd) || pmd_devmap(_pmd))
goto out;
if (pmd_trans_huge(_pmd)) {
@@ -363,11 +364,11 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
goto out;
}
- /*
- * the pmd is stable (as in !pmd_trans_unstable) so we can re-read it
- * and use the standard pte_offset_map() instead of parsing _pmd.
- */
pte = pte_offset_map(pmd, address);
+ if (!pte) {
+ ret = true;
+ goto again;
+ }
/*
* Lockless access: we're in a wait_event so it's ok if it
* changes under us. PTE markers should be handled the same as none
--
2.35.3
__collapse_huge_page_swapin(): don't drop the map after every pte, it
only has to be dropped by do_swap_page(); give up if pte_offset_map()
fails; trace_mm_collapse_huge_page_swapin() at the end, with result;
fix comment on returned result; fix vmf.pgoff, though it's not used.
collapse_huge_page(): use pte_offset_map_lock() on the _pmd returned
from clearing; allow failure, but it should be impossible there.
hpage_collapse_scan_pmd() and collapse_pte_mapped_thp() allow for
pte_offset_map_lock() failure.
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
---
mm/khugepaged.c | 72 +++++++++++++++++++++++++++++++++----------------
1 file changed, 49 insertions(+), 23 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 732f9ac393fc..49cfa7cdfe93 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -993,9 +993,8 @@ static int check_pmd_still_valid(struct mm_struct *mm,
* Only done if hpage_collapse_scan_pmd believes it is worthwhile.
*
* Called and returns without pte mapped or spinlocks held.
- * Note that if false is returned, mmap_lock will be released.
+ * Returns result: if not SCAN_SUCCEED, mmap_lock has been released.
*/
-
static int __collapse_huge_page_swapin(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long haddr, pmd_t *pmd,
@@ -1004,23 +1003,35 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
int swapped_in = 0;
vm_fault_t ret = 0;
unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
+ int result;
+ pte_t *pte = NULL;
for (address = haddr; address < end; address += PAGE_SIZE) {
struct vm_fault vmf = {
.vma = vma,
.address = address,
- .pgoff = linear_page_index(vma, haddr),
+ .pgoff = linear_page_index(vma, address),
.flags = FAULT_FLAG_ALLOW_RETRY,
.pmd = pmd,
};
- vmf.pte = pte_offset_map(pmd, address);
- vmf.orig_pte = *vmf.pte;
- if (!is_swap_pte(vmf.orig_pte)) {
- pte_unmap(vmf.pte);
- continue;
+ if (!pte++) {
+ pte = pte_offset_map(pmd, address);
+ if (!pte) {
+ mmap_read_unlock(mm);
+ result = SCAN_PMD_NULL;
+ goto out;
+ }
}
+
+ vmf.orig_pte = *pte;
+ if (!is_swap_pte(vmf.orig_pte))
+ continue;
+
+ vmf.pte = pte;
ret = do_swap_page(&vmf);
+ /* Which unmaps pte (after perhaps re-checking the entry) */
+ pte = NULL;
/*
* do_swap_page returns VM_FAULT_RETRY with released mmap_lock.
@@ -1029,24 +1040,29 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
* resulting in later failure.
*/
if (ret & VM_FAULT_RETRY) {
- trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
/* Likely, but not guaranteed, that page lock failed */
- return SCAN_PAGE_LOCK;
+ result = SCAN_PAGE_LOCK;
+ goto out;
}
if (ret & VM_FAULT_ERROR) {
mmap_read_unlock(mm);
- trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0);
- return SCAN_FAIL;
+ result = SCAN_FAIL;
+ goto out;
}
swapped_in++;
}
+ if (pte)
+ pte_unmap(pte);
+
/* Drain LRU add pagevec to remove extra pin on the swapped in pages */
if (swapped_in)
lru_add_drain();
- trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 1);
- return SCAN_SUCCEED;
+ result = SCAN_SUCCEED;
+out:
+ trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, result);
+ return result;
}
static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
@@ -1146,9 +1162,6 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
address + HPAGE_PMD_SIZE);
mmu_notifier_invalidate_range_start(&range);
- pte = pte_offset_map(pmd, address);
- pte_ptl = pte_lockptr(mm, pmd);
-
pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
/*
* This removes any huge TLB entry from the CPU so we won't allow
@@ -1163,13 +1176,18 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
mmu_notifier_invalidate_range_end(&range);
tlb_remove_table_sync_one();
- spin_lock(pte_ptl);
- result = __collapse_huge_page_isolate(vma, address, pte, cc,
- &compound_pagelist);
- spin_unlock(pte_ptl);
+ pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+ if (pte) {
+ result = __collapse_huge_page_isolate(vma, address, pte, cc,
+ &compound_pagelist);
+ spin_unlock(pte_ptl);
+ } else {
+ result = SCAN_PMD_NULL;
+ }
if (unlikely(result != SCAN_SUCCEED)) {
- pte_unmap(pte);
+ if (pte)
+ pte_unmap(pte);
spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
/*
@@ -1253,6 +1271,11 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
memset(cc->node_load, 0, sizeof(cc->node_load));
nodes_clear(cc->alloc_nmask);
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte) {
+ result = SCAN_PMD_NULL;
+ goto out;
+ }
+
for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
_pte++, _address += PAGE_SIZE) {
pte_t pteval = *_pte;
@@ -1622,8 +1645,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
* lockless_pages_from_mm() and the hardware page walker can access page
* tables while all the high-level locks are held in write mode.
*/
- start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
result = SCAN_FAIL;
+ start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
+ if (!start_pte)
+ goto drop_immap;
/* step 1: check all mapped PTEs are to the right huge page */
for (i = 0, addr = haddr, pte = start_pte;
@@ -1697,6 +1722,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
abort:
pte_unmap_unlock(start_pte, ptl);
+drop_immap:
i_mmap_unlock_write(vma->vm_file->f_mapping);
goto drop_hpage;
}
--
2.35.3
copy_pte_range(): use pte_offset_map_nolock(), and allow for it to fail;
but with a comment on some further assumptions that are being made there.
zap_pte_range() and zap_pmd_range(): adjust their interaction so that
a pte_offset_map_lock() failure in zap_pte_range() leads to a retry in
zap_pmd_range(); remove call to pmd_none_or_trans_huge_or_clear_bad().
Allow pte_offset_map_lock() to fail in many functions. Update comment
on calling pte_alloc() in do_anonymous_page(). Remove redundant calls
to pmd_trans_unstable(), pmd_devmap_trans_unstable(), pmd_none() and
pmd_bad(); but leave pmd_none_or_clear_bad() calls in free_pmd_range()
and copy_pmd_range(), those do simplify the next level down.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/memory.c | 172 +++++++++++++++++++++++++---------------------------
1 file changed, 82 insertions(+), 90 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 2eb54c0d5d3c..c7b920291a72 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1012,13 +1012,25 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
progress = 0;
init_rss_vec(rss);
+ /*
+ * copy_pmd_range()'s prior pmd_none_or_clear_bad(src_pmd), and the
+ * error handling here, assume that exclusive mmap_lock on dst and src
+ * protects anon from unexpected THP transitions; with shmem and file
+ * protected by mmap_lock-less collapse skipping areas with anon_vma
+ * (whereas vma_needs_copy() skips areas without anon_vma). A rework
+ * can remove such assumptions later, but this is good enough for now.
+ */
dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
if (!dst_pte) {
ret = -ENOMEM;
goto out;
}
- src_pte = pte_offset_map(src_pmd, addr);
- src_ptl = pte_lockptr(src_mm, src_pmd);
+ src_pte = pte_offset_map_nolock(src_mm, src_pmd, addr, &src_ptl);
+ if (!src_pte) {
+ pte_unmap_unlock(dst_pte, dst_ptl);
+ /* ret == 0 */
+ goto out;
+ }
spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
orig_src_pte = src_pte;
orig_dst_pte = dst_pte;
@@ -1083,8 +1095,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
- spin_unlock(src_ptl);
- pte_unmap(orig_src_pte);
+ pte_unmap_unlock(orig_src_pte, src_ptl);
add_mm_rss_vec(dst_mm, rss);
pte_unmap_unlock(orig_dst_pte, dst_ptl);
cond_resched();
@@ -1388,10 +1399,11 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
swp_entry_t entry;
tlb_change_page_size(tlb, PAGE_SIZE);
-again:
init_rss_vec(rss);
- start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
- pte = start_pte;
+ start_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!pte)
+ return addr;
+
flush_tlb_batched_pending(mm);
arch_enter_lazy_mmu_mode();
do {
@@ -1507,17 +1519,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
* If we forced a TLB flush (either due to running out of
* batch buffers or because we needed to flush dirty TLB
* entries before releasing the ptl), free the batched
- * memory too. Restart if we didn't do everything.
+ * memory too. Come back again if we didn't do everything.
*/
- if (force_flush) {
- force_flush = 0;
+ if (force_flush)
tlb_flush_mmu(tlb);
- }
-
- if (addr != end) {
- cond_resched();
- goto again;
- }
return addr;
}
@@ -1536,8 +1541,10 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
__split_huge_pmd(vma, pmd, addr, false, NULL);
- else if (zap_huge_pmd(tlb, vma, pmd, addr))
- goto next;
+ else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
+ addr = next;
+ continue;
+ }
/* fall through */
} else if (details && details->single_folio &&
folio_test_pmd_mappable(details->single_folio) &&
@@ -1550,20 +1557,14 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
*/
spin_unlock(ptl);
}
-
- /*
- * Here there can be other concurrent MADV_DONTNEED or
- * trans huge page faults running, and if the pmd is
- * none or trans huge it can change under us. This is
- * because MADV_DONTNEED holds the mmap_lock in read
- * mode.
- */
- if (pmd_none_or_trans_huge_or_clear_bad(pmd))
- goto next;
- next = zap_pte_range(tlb, vma, pmd, addr, next, details);
-next:
- cond_resched();
- } while (pmd++, addr = next, addr != end);
+ if (pmd_none(*pmd)) {
+ addr = next;
+ continue;
+ }
+ addr = zap_pte_range(tlb, vma, pmd, addr, next, details);
+ if (addr != next)
+ pmd--;
+ } while (pmd++, cond_resched(), addr != end);
return addr;
}
@@ -1905,6 +1906,10 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,
const int batch_size = min_t(int, pages_to_write_in_pmd, 8);
start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock);
+ if (!start_pte) {
+ ret = -EFAULT;
+ goto out;
+ }
for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) {
int err = insert_page_in_batch_locked(vma, pte,
addr, pages[curr_page_idx], prot);
@@ -2572,10 +2577,10 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd,
mapped_pte = pte = (mm == &init_mm) ?
pte_offset_kernel(pmd, addr) :
pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!pte)
+ return -EINVAL;
}
- BUG_ON(pmd_huge(*pmd));
-
arch_enter_lazy_mmu_mode();
if (fn) {
@@ -2804,7 +2809,6 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
int ret;
void *kaddr;
void __user *uaddr;
- bool locked = false;
struct vm_area_struct *vma = vmf->vma;
struct mm_struct *mm = vma->vm_mm;
unsigned long addr = vmf->address;
@@ -2830,12 +2834,12 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
* On architectures with software "accessed" bits, we would
* take a double page fault, so mark it accessed here.
*/
+ vmf->pte = NULL;
if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
pte_t entry;
vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
- locked = true;
- if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+ if (unlikely(!vmf->pte || !pte_same(*vmf->pte, vmf->orig_pte))) {
/*
* Other thread has already handled the fault
* and update local tlb only
@@ -2857,13 +2861,12 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
* zeroes.
*/
if (__copy_from_user_inatomic(kaddr, uaddr, PAGE_SIZE)) {
- if (locked)
+ if (vmf->pte)
goto warn;
/* Re-validate under PTL if the page is still mapped */
vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
- locked = true;
- if (!likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+ if (unlikely(!vmf->pte || !pte_same(*vmf->pte, vmf->orig_pte))) {
/* The PTE changed under us, update local tlb */
update_mmu_tlb(vma, addr, vmf->pte);
ret = -EAGAIN;
@@ -2888,7 +2891,7 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
ret = 0;
pte_unlock:
- if (locked)
+ if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
kunmap_atomic(kaddr);
flush_dcache_page(dst);
@@ -3110,7 +3113,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
* Re-check the pte - we dropped the lock
*/
vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl);
- if (likely(pte_same(*vmf->pte, vmf->orig_pte))) {
+ if (likely(vmf->pte && pte_same(*vmf->pte, vmf->orig_pte))) {
if (old_folio) {
if (!folio_test_anon(old_folio)) {
dec_mm_counter(mm, mm_counter_file(&old_folio->page));
@@ -3178,19 +3181,20 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
/* Free the old page.. */
new_folio = old_folio;
page_copied = 1;
- } else {
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
+ } else if (vmf->pte) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
}
- if (new_folio)
- folio_put(new_folio);
-
- pte_unmap_unlock(vmf->pte, vmf->ptl);
/*
* No need to double call mmu_notifier->invalidate_range() callback as
* the above ptep_clear_flush_notify() did already call it.
*/
mmu_notifier_invalidate_range_only_end(&range);
+
+ if (new_folio)
+ folio_put(new_folio);
if (old_folio) {
if (page_copied)
free_swap_cache(&old_folio->page);
@@ -3230,6 +3234,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf)
WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
+ if (!vmf->pte)
+ return VM_FAULT_NOPAGE;
/*
* We might have raced with another page fault while we released the
* pte_offset_map_lock.
@@ -3591,10 +3597,11 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
- if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
+ if (likely(vmf->pte && pte_same(*vmf->pte, vmf->orig_pte)))
restore_exclusive_pte(vma, vmf->page, vmf->address, vmf->pte);
- pte_unmap_unlock(vmf->pte, vmf->ptl);
+ if (vmf->pte)
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
folio_unlock(folio);
folio_put(folio);
@@ -3625,6 +3632,8 @@ static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
{
vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
+ if (!vmf->pte)
+ return 0;
/*
* Be careful so that we will only recover a special uffd-wp pte into a
* none pte. Otherwise it means the pte could have changed, so retry.
@@ -3728,11 +3737,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->page = pfn_swap_entry_to_page(entry);
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
- if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
- spin_unlock(vmf->ptl);
- goto out;
- }
-
+ if (unlikely(!vmf->pte ||
+ !pte_same(*vmf->pte, vmf->orig_pte)))
+ goto unlock;
/*
* Get a page reference while we know the page can't be
* freed.
@@ -3807,7 +3814,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
- if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
+ if (likely(vmf->pte && pte_same(*vmf->pte, vmf->orig_pte)))
ret = VM_FAULT_OOM;
goto unlock;
}
@@ -3877,7 +3884,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
- if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
+ if (unlikely(!vmf->pte || !pte_same(*vmf->pte, vmf->orig_pte)))
goto out_nomap;
if (unlikely(!folio_test_uptodate(folio))) {
@@ -4003,13 +4010,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, vmf->address, vmf->pte);
unlock:
- pte_unmap_unlock(vmf->pte, vmf->ptl);
+ if (vmf->pte)
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
if (si)
put_swap_device(si);
return ret;
out_nomap:
- pte_unmap_unlock(vmf->pte, vmf->ptl);
+ if (vmf->pte)
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
out_page:
folio_unlock(folio);
out_release:
@@ -4041,22 +4050,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
return VM_FAULT_SIGBUS;
/*
- * Use pte_alloc() instead of pte_alloc_map(). We can't run
- * pte_offset_map() on pmds where a huge pmd might be created
- * from a different thread.
- *
- * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
- * parallel threads are excluded by other means.
- *
- * Here we only have mmap_read_lock(mm).
+ * Use pte_alloc() instead of pte_alloc_map(), so that OOM can
+ * be distinguished from a transient failure of pte_offset_map().
*/
if (pte_alloc(vma->vm_mm, vmf->pmd))
return VM_FAULT_OOM;
- /* See comment in handle_pte_fault() */
- if (unlikely(pmd_trans_unstable(vmf->pmd)))
- return 0;
-
/* Use the zero-page for reads */
if (!(vmf->flags & FAULT_FLAG_WRITE) &&
!mm_forbids_zeropage(vma->vm_mm)) {
@@ -4064,6 +4063,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
vma->vm_page_prot));
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
+ if (!vmf->pte)
+ goto unlock;
if (vmf_pte_changed(vmf)) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
goto unlock;
@@ -4104,6 +4105,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
&vmf->ptl);
+ if (!vmf->pte)
+ goto release;
if (vmf_pte_changed(vmf)) {
update_mmu_tlb(vma, vmf->address, vmf->pte);
goto release;
@@ -4131,7 +4134,8 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, vmf->address, vmf->pte);
unlock:
- pte_unmap_unlock(vmf->pte, vmf->ptl);
+ if (vmf->pte)
+ pte_unmap_unlock(vmf->pte, vmf->ptl);
return ret;
release:
folio_put(folio);
@@ -4380,15 +4384,10 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
return VM_FAULT_OOM;
}
- /*
- * See comment in handle_pte_fault() for how this scenario happens, we
- * need to return NOPAGE so that we drop this page.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return VM_FAULT_NOPAGE;
-
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
+ if (!vmf->pte)
+ return VM_FAULT_NOPAGE;
/* Re-check under ptl */
if (likely(!vmf_pte_changed(vmf))) {
@@ -4630,17 +4629,11 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
*/
if (!vma->vm_ops->fault) {
- /*
- * If we find a migration pmd entry or a none pmd entry, which
- * should never happen, return SIGBUS
- */
- if (unlikely(!pmd_present(*vmf->pmd)))
+ vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ if (unlikely(!vmf->pte))
ret = VM_FAULT_SIGBUS;
else {
- vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm,
- vmf->pmd,
- vmf->address,
- &vmf->ptl);
/*
* Make sure this is not a temporary clearing of pte
* by holding ptl and checking again. A R/M/W update
@@ -5429,10 +5422,9 @@ int follow_pte(struct mm_struct *mm, unsigned long address,
pmd = pmd_offset(pud, address);
VM_BUG_ON(pmd_trans_huge(*pmd));
- if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
- goto out;
-
ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
+ if (!ptep)
+ goto out;
if (!pte_present(*ptep))
goto unlock;
*ptepp = ptep;
--
2.35.3
hmm_vma_walk_pmd() is called through mm_walk, but already has a goto
again loop of its own, so take part in that if pte_offset_map() fails.
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
---
mm/hmm.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/hmm.c b/mm/hmm.c
index e23043345615..b1a9159d7c92 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -381,6 +381,8 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
}
ptep = pte_offset_map(pmdp, addr);
+ if (!ptep)
+ goto again;
for (; addr < end; addr += PAGE_SIZE, ptep++, hmm_pfns++) {
int r;
--
2.35.3
No functional change here, but adjust the format of map_pte() so that the
following commit will be easier to read: separate out the PVMW_SYNC case
first, and remove two levels of indentation from the ZONE_DEVICE case.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/page_vma_mapped.c | 65 +++++++++++++++++++++++---------------------
1 file changed, 34 insertions(+), 31 deletions(-)
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 007dc7456f0e..947dc7491815 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -15,38 +15,41 @@ static inline bool not_found(struct page_vma_mapped_walk *pvmw)
static bool map_pte(struct page_vma_mapped_walk *pvmw)
{
- pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
- if (!(pvmw->flags & PVMW_SYNC)) {
- if (pvmw->flags & PVMW_MIGRATION) {
- if (!is_swap_pte(*pvmw->pte))
- return false;
- } else {
- /*
- * We get here when we are trying to unmap a private
- * device page from the process address space. Such
- * page is not CPU accessible and thus is mapped as
- * a special swap entry, nonetheless it still does
- * count as a valid regular mapping for the page (and
- * is accounted as such in page maps count).
- *
- * So handle this special case as if it was a normal
- * page mapping ie lock CPU page table and returns
- * true.
- *
- * For more details on device private memory see HMM
- * (include/linux/hmm.h or mm/hmm.c).
- */
- if (is_swap_pte(*pvmw->pte)) {
- swp_entry_t entry;
+ if (pvmw->flags & PVMW_SYNC) {
+ /* Use the stricter lookup */
+ pvmw->pte = pte_offset_map_lock(pvmw->vma->vm_mm, pvmw->pmd,
+ pvmw->address, &pvmw->ptl);
+ return true;
+ }
- /* Handle un-addressable ZONE_DEVICE memory */
- entry = pte_to_swp_entry(*pvmw->pte);
- if (!is_device_private_entry(entry) &&
- !is_device_exclusive_entry(entry))
- return false;
- } else if (!pte_present(*pvmw->pte))
- return false;
- }
+ pvmw->pte = pte_offset_map(pvmw->pmd, pvmw->address);
+ if (pvmw->flags & PVMW_MIGRATION) {
+ if (!is_swap_pte(*pvmw->pte))
+ return false;
+ } else if (is_swap_pte(*pvmw->pte)) {
+ swp_entry_t entry;
+ /*
+ * Handle un-addressable ZONE_DEVICE memory.
+ *
+ * We get here when we are trying to unmap a private
+ * device page from the process address space. Such
+ * page is not CPU accessible and thus is mapped as
+ * a special swap entry, nonetheless it still does
+ * count as a valid regular mapping for the page
+ * (and is accounted as such in page maps count).
+ *
+ * So handle this special case as if it was a normal
+ * page mapping ie lock CPU page table and return true.
+ *
+ * For more details on device private memory see HMM
+ * (include/linux/hmm.h or mm/hmm.c).
+ */
+ entry = pte_to_swp_entry(*pvmw->pte);
+ if (!is_device_private_entry(entry) &&
+ !is_device_exclusive_entry(entry))
+ return false;
+ } else if (!pte_present(*pvmw->pte)) {
+ return false;
}
pvmw->ptl = pte_lockptr(pvmw->vma->vm_mm, pvmw->pmd);
spin_lock(pvmw->ptl);
--
2.35.3
Following the examples of nearby code, various functions can just give
up if pte_offset_map() or pte_offset_map_lock() fails. And there's no
need for a preliminary pmd_trans_unstable() or other such check, since
such cases are now safely handled inside.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/gup.c | 9 ++++++---
mm/ksm.c | 7 ++++---
mm/memcontrol.c | 8 ++++----
mm/memory-failure.c | 8 +++++---
mm/migrate.c | 3 +++
5 files changed, 22 insertions(+), 13 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 3bd5d3854c51..bb67193c5460 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -544,10 +544,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
(FOLL_PIN | FOLL_GET)))
return ERR_PTR(-EINVAL);
- if (unlikely(pmd_bad(*pmd)))
- return no_page_table(vma, flags);
ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!ptep)
+ return no_page_table(vma, flags);
pte = *ptep;
if (!pte_present(pte))
goto no_page;
@@ -851,8 +851,9 @@ static int get_gate_page(struct mm_struct *mm, unsigned long address,
pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return -EFAULT;
- VM_BUG_ON(pmd_trans_huge(*pmd));
pte = pte_offset_map(pmd, address);
+ if (!pte)
+ return -EFAULT;
if (pte_none(*pte))
goto unmap;
*vma = get_gate_vma(mm);
@@ -2377,6 +2378,8 @@ static int gup_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
pte_t *ptep, *ptem;
ptem = ptep = pte_offset_map(&pmd, addr);
+ if (!ptep)
+ return 0;
do {
pte_t pte = ptep_get_lockless(ptep);
struct page *page;
diff --git a/mm/ksm.c b/mm/ksm.c
index df2aa281d49d..3dc15459dd20 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -431,10 +431,9 @@ static int break_ksm_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long nex
pte_t *pte;
int ret;
- if (pmd_leaf(*pmd) || !pmd_present(*pmd))
- return 0;
-
pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ if (!pte)
+ return 0;
if (pte_present(*pte)) {
page = vm_normal_page(walk->vma, addr, *pte);
} else if (!pte_none(*pte)) {
@@ -1203,6 +1202,8 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
mmu_notifier_invalidate_range_start(&range);
ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!ptep)
+ goto out_mn;
if (!pte_same(*ptep, orig_pte)) {
pte_unmap_unlock(ptep, ptl);
goto out_mn;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4b27e245a055..fdd953655fe1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6057,9 +6057,9 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
return 0;
}
- if (pmd_trans_unstable(pmd))
- return 0;
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!pte)
+ return 0;
for (; addr != end; pte++, addr += PAGE_SIZE)
if (get_mctgt_type(vma, addr, *pte, NULL))
mc.precharge++; /* increment precharge temporarily */
@@ -6277,10 +6277,10 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
return 0;
}
- if (pmd_trans_unstable(pmd))
- return 0;
retry:
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ if (!pte)
+ return 0;
for (; addr != end; addr += PAGE_SIZE) {
pte_t ptent = *(pte++);
bool device = false;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5b663eca1f29..b3cc8f213fe3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -414,6 +414,8 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
if (pmd_devmap(*pmd))
return PMD_SHIFT;
pte = pte_offset_map(pmd, address);
+ if (!pte)
+ return 0;
if (pte_present(*pte) && pte_devmap(*pte))
ret = PAGE_SHIFT;
pte_unmap(pte);
@@ -800,11 +802,11 @@ static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
goto out;
}
- if (pmd_trans_unstable(pmdp))
- goto out;
-
mapped_pte = ptep = pte_offset_map_lock(walk->vma->vm_mm, pmdp,
addr, &ptl);
+ if (!ptep)
+ goto out;
+
for (; addr != end; ptep++, addr += PAGE_SIZE) {
ret = check_hwpoisoned_entry(*ptep, addr, PAGE_SHIFT,
hwp->pfn, &hwp->tk);
diff --git a/mm/migrate.c b/mm/migrate.c
index 3ecb7a40075f..308a56f0b156 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -305,6 +305,9 @@ void migration_entry_wait(struct mm_struct *mm, pmd_t *pmd,
swp_entry_t entry;
ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!ptep)
+ return;
+
pte = *ptep;
pte_unmap(ptep);
--
2.35.3
migrate_vma_collect_pmd(): remove the pmd_trans_unstable() handling after
splitting huge zero pmd, and the pmd_none() handling after successfully
splitting huge page: those are now managed inside pte_offset_map_lock(),
and by "goto again" when it fails.
But the skip after unsuccessful split_huge_page() must stay: it avoids an
endless loop. The skip when pmd_bad()? Remove that: it will be treated
as a hole rather than a skip once cleared by pte_offset_map_lock(), but
with different timing that would be so anyway; and it's arguably best to
leave the pmd_bad() handling centralized there.
migrate_vma_insert_page(): remove comment on the old pte_offset_map()
and old locking limitations; remove the pmd_trans_unstable() check and
just proceed to pte_offset_map_lock(), aborting when it fails (page has
been charged to memcg, but as in other cases, it's uncharged when freed).
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Alistair Popple <[email protected]>
---
mm/migrate_device.c | 31 ++++---------------------------
1 file changed, 4 insertions(+), 27 deletions(-)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index d30c9de60b0d..a14af6b12b04 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -83,9 +83,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (is_huge_zero_page(page)) {
spin_unlock(ptl);
split_huge_pmd(vma, pmdp, addr);
- if (pmd_trans_unstable(pmdp))
- return migrate_vma_collect_skip(start, end,
- walk);
} else {
int ret;
@@ -100,16 +97,12 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
if (ret)
return migrate_vma_collect_skip(start, end,
walk);
- if (pmd_none(*pmdp))
- return migrate_vma_collect_hole(start, end, -1,
- walk);
}
}
- if (unlikely(pmd_bad(*pmdp)))
- return migrate_vma_collect_skip(start, end, walk);
-
ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ if (!ptep)
+ goto again;
arch_enter_lazy_mmu_mode();
for (; addr < end; addr += PAGE_SIZE, ptep++) {
@@ -595,27 +588,10 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
pmdp = pmd_alloc(mm, pudp, addr);
if (!pmdp)
goto abort;
-
if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
goto abort;
-
- /*
- * Use pte_alloc() instead of pte_alloc_map(). We can't run
- * pte_offset_map() on pmds where a huge pmd might be created
- * from a different thread.
- *
- * pte_alloc_map() is safe to use under mmap_write_lock(mm) or when
- * parallel threads are excluded by other means.
- *
- * Here we only have mmap_read_lock(mm).
- */
if (pte_alloc(mm, pmdp))
goto abort;
-
- /* See the comment in pte_alloc_one_map() */
- if (unlikely(pmd_trans_unstable(pmdp)))
- goto abort;
-
if (unlikely(anon_vma_prepare(vma)))
goto abort;
if (mem_cgroup_charge(page_folio(page), vma->vm_mm, GFP_KERNEL))
@@ -650,7 +626,8 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
}
ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
-
+ if (!ptep)
+ goto abort;
if (check_stable_address_space(mm))
goto unlock_abort;
--
2.35.3
Delete pmd_trans_unstable, pmd_none_or_trans_huge_or_clear_bad() and
pmd_devmap_trans_unstable(), all now unused.
With mixed feelings, delete all the comments on pmd_trans_unstable().
That was very good documentation of a subtle state, and this series does
not even eliminate that state: but rather, normalizes and extends it,
asking pte_offset_map[_lock]() callers to anticipate failure, without
regard for whether mmap_read_lock() or mmap_write_lock() is held.
Retain pud_trans_unstable(), which has one use in __handle_mm_fault(),
but delete its equivalent pud_none_or_trans_huge_or_dev_or_clear_bad().
While there, move the default arch_needs_pgtable_deposit() definition
up near where pgtable_trans_huge_deposit() and withdraw() are declared.
Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/pgtable.h | 103 +++-------------------------------------
mm/khugepaged.c | 4 --
2 files changed, 7 insertions(+), 100 deletions(-)
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 3fabbb018557..a1326e61d7ee 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -599,6 +599,10 @@ extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
#endif
+#ifndef arch_needs_pgtable_deposit
+#define arch_needs_pgtable_deposit() (false)
+#endif
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
* This is an implementation of pmdp_establish() that is only suitable for an
@@ -1300,9 +1304,10 @@ static inline int pud_trans_huge(pud_t pud)
}
#endif
-/* See pmd_none_or_trans_huge_or_clear_bad for discussion. */
-static inline int pud_none_or_trans_huge_or_dev_or_clear_bad(pud_t *pud)
+static inline int pud_trans_unstable(pud_t *pud)
{
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
+ defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
pud_t pudval = READ_ONCE(*pud);
if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval))
@@ -1311,104 +1316,10 @@ static inline int pud_none_or_trans_huge_or_dev_or_clear_bad(pud_t *pud)
pud_clear_bad(pud);
return 1;
}
- return 0;
-}
-
-/* See pmd_trans_unstable for discussion. */
-static inline int pud_trans_unstable(pud_t *pud)
-{
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
- defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
- return pud_none_or_trans_huge_or_dev_or_clear_bad(pud);
-#else
- return 0;
#endif
-}
-
-#ifndef arch_needs_pgtable_deposit
-#define arch_needs_pgtable_deposit() (false)
-#endif
-/*
- * This function is meant to be used by sites walking pagetables with
- * the mmap_lock held in read mode to protect against MADV_DONTNEED and
- * transhuge page faults. MADV_DONTNEED can convert a transhuge pmd
- * into a null pmd and the transhuge page fault can convert a null pmd
- * into an hugepmd or into a regular pmd (if the hugepage allocation
- * fails). While holding the mmap_lock in read mode the pmd becomes
- * stable and stops changing under us only if it's not null and not a
- * transhuge pmd. When those races occurs and this function makes a
- * difference vs the standard pmd_none_or_clear_bad, the result is
- * undefined so behaving like if the pmd was none is safe (because it
- * can return none anyway). The compiler level barrier() is critically
- * important to compute the two checks atomically on the same pmdval.
- *
- * For 32bit kernels with a 64bit large pmd_t this automatically takes
- * care of reading the pmd atomically to avoid SMP race conditions
- * against pmd_populate() when the mmap_lock is hold for reading by the
- * caller (a special atomic read not done by "gcc" as in the generic
- * version above, is also needed when THP is disabled because the page
- * fault can populate the pmd from under us).
- */
-static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
-{
- pmd_t pmdval = pmdp_get_lockless(pmd);
- /*
- * !pmd_present() checks for pmd migration entries
- *
- * The complete check uses is_pmd_migration_entry() in linux/swapops.h
- * But using that requires moving current function and pmd_trans_unstable()
- * to linux/swapops.h to resolve dependency, which is too much code move.
- *
- * !pmd_present() is equivalent to is_pmd_migration_entry() currently,
- * because !pmd_present() pages can only be under migration not swapped
- * out.
- *
- * pmd_none() is preserved for future condition checks on pmd migration
- * entries and not confusing with this function name, although it is
- * redundant with !pmd_present().
- */
- if (pmd_none(pmdval) || pmd_trans_huge(pmdval) ||
- (IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) && !pmd_present(pmdval)))
- return 1;
- if (unlikely(pmd_bad(pmdval))) {
- pmd_clear_bad(pmd);
- return 1;
- }
return 0;
}
-/*
- * This is a noop if Transparent Hugepage Support is not built into
- * the kernel. Otherwise it is equivalent to
- * pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
- * places that already verified the pmd is not none and they want to
- * walk ptes while holding the mmap sem in read mode (write mode don't
- * need this). If THP is not enabled, the pmd can't go away under the
- * code even if MADV_DONTNEED runs, but if THP is enabled we need to
- * run a pmd_trans_unstable before walking the ptes after
- * split_huge_pmd returns (because it may have run when the pmd become
- * null, but then a page fault can map in a THP and not a regular page).
- */
-static inline int pmd_trans_unstable(pmd_t *pmd)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- return pmd_none_or_trans_huge_or_clear_bad(pmd);
-#else
- return 0;
-#endif
-}
-
-/*
- * the ordering of these checks is important for pmds with _page_devmap set.
- * if we check pmd_trans_unstable() first we will trip the bad_pmd() check
- * inside of pmd_none_or_trans_huge_or_clear_bad(). this will end up correctly
- * returning 1 but not before it spams dmesg with the pmd_clear_bad() output.
- */
-static inline int pmd_devmap_trans_unstable(pmd_t *pmd)
-{
- return pmd_devmap(*pmd) || pmd_trans_unstable(pmd);
-}
-
#ifndef CONFIG_NUMA_BALANCING
/*
* Technically a PTE can be PROTNONE even when not doing NUMA balancing but
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c11db2e78e95..1083f0e38a07 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -946,10 +946,6 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
return SCAN_SUCCEED;
}
-/*
- * See pmd_trans_unstable() for how the result may change out from
- * underneath us, even if we hold mmap_lock in read.
- */
static int find_pmd_or_thp_or_none(struct mm_struct *mm,
unsigned long address,
pmd_t **pmd)
--
2.35.3
Adjust unuse_pte() and unuse_pte_range() to allow pte_offset_map_lock()
and pte_offset_map() failure; remove pmd_none_or_trans_huge_or_clear_bad()
from unuse_pmd_range() now that pte_offset_map() does all that itself.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/swapfile.c | 38 ++++++++++++++++++++------------------
1 file changed, 20 insertions(+), 18 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 274bbf797480..12d204e6dae2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1774,7 +1774,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
hwposioned = true;
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
- if (unlikely(!pte_same_as_swp(*pte, swp_entry_to_pte(entry)))) {
+ if (unlikely(!pte || !pte_same_as_swp(*pte, swp_entry_to_pte(entry)))) {
ret = 0;
goto out;
}
@@ -1827,7 +1827,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
set_pte_at(vma->vm_mm, addr, pte, new_pte);
swap_free(entry);
out:
- pte_unmap_unlock(pte, ptl);
+ if (pte)
+ pte_unmap_unlock(pte, ptl);
if (page != swapcache) {
unlock_page(page);
put_page(page);
@@ -1839,17 +1840,22 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
unsigned int type)
{
- swp_entry_t entry;
- pte_t *pte;
+ pte_t *pte = NULL;
struct swap_info_struct *si;
- int ret = 0;
si = swap_info[type];
- pte = pte_offset_map(pmd, addr);
do {
struct folio *folio;
unsigned long offset;
unsigned char swp_count;
+ swp_entry_t entry;
+ int ret;
+
+ if (!pte++) {
+ pte = pte_offset_map(pmd, addr);
+ if (!pte)
+ break;
+ }
if (!is_swap_pte(*pte))
continue;
@@ -1860,6 +1866,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
offset = swp_offset(entry);
pte_unmap(pte);
+ pte = NULL;
+
folio = swap_cache_get_folio(entry, vma, addr);
if (!folio) {
struct page *page;
@@ -1878,8 +1886,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (!folio) {
swp_count = READ_ONCE(si->swap_map[offset]);
if (swp_count == 0 || swp_count == SWAP_MAP_BAD)
- goto try_next;
-
+ continue;
return -ENOMEM;
}
@@ -1889,20 +1896,17 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (ret < 0) {
folio_unlock(folio);
folio_put(folio);
- goto out;
+ return ret;
}
folio_free_swap(folio);
folio_unlock(folio);
folio_put(folio);
-try_next:
- pte = pte_offset_map(pmd, addr);
- } while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap(pte - 1);
+ } while (addr += PAGE_SIZE, addr != end);
- ret = 0;
-out:
- return ret;
+ if (pte)
+ pte_unmap(pte);
+ return 0;
}
static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
@@ -1917,8 +1921,6 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
do {
cond_resched();
next = pmd_addr_end(addr, end);
- if (pmd_none_or_trans_huge_or_clear_bad(pmd))
- continue;
ret = unuse_pte_range(vma, pmd, addr, next, type);
if (ret)
return ret;
--
2.35.3
Failures here would be surprising: pte_advanced_tests() and
pte_clear_tests() and __page_table_check_pte_clear_range() each
issue a warning if pte_offset_map() or pte_offset_map_lock() fails.
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/debug_vm_pgtable.c | 9 ++++++++-
mm/page_table_check.c | 2 ++
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index c54177aabebd..ee119e33fef1 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -138,6 +138,9 @@ static void __init pte_advanced_tests(struct pgtable_debug_args *args)
return;
pr_debug("Validating PTE advanced\n");
+ if (WARN_ON(!args->ptep))
+ return;
+
pte = pfn_pte(args->pte_pfn, args->page_prot);
set_pte_at(args->mm, args->vaddr, args->ptep, pte);
flush_dcache_page(page);
@@ -619,6 +622,9 @@ static void __init pte_clear_tests(struct pgtable_debug_args *args)
* the unexpected overhead of cache flushing is acceptable.
*/
pr_debug("Validating PTE clear\n");
+ if (WARN_ON(!args->ptep))
+ return;
+
#ifndef CONFIG_RISCV
pte = __pte(pte_val(pte) | RANDOM_ORVALUE);
#endif
@@ -1377,7 +1383,8 @@ static int __init debug_vm_pgtable(void)
args.ptep = pte_offset_map_lock(args.mm, args.pmdp, args.vaddr, &ptl);
pte_clear_tests(&args);
pte_advanced_tests(&args);
- pte_unmap_unlock(args.ptep, ptl);
+ if (args.ptep)
+ pte_unmap_unlock(args.ptep, ptl);
ptl = pmd_lock(args.mm, args.pmdp);
pmd_clear_tests(&args);
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index f2baf97d5f38..b743a2f6bce0 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -246,6 +246,8 @@ void __page_table_check_pte_clear_range(struct mm_struct *mm,
pte_t *ptep = pte_offset_map(&pmd, addr);
unsigned long i;
+ if (WARN_ON(!ptep))
+ return;
for (i = 0; i < PTRS_PER_PTE; i++) {
__page_table_check_pte_clear(mm, addr, *ptep);
addr += PAGE_SIZE;
--
2.35.3
swap_vma_readahead() has been proceeding in an unconventional way, its
preliminary swap_ra_info() doing the pte_offset_map() and pte_unmap(),
then relying on that pte pointer even after the pte_unmap() - in its
CONFIG_64BIT case (I think !CONFIG_HIGHPTE was intended; whereas 32-bit
copied ptes to stack while they were mapped, but had to limit how many).
Though it would be difficult to construct a failing testcase, accessing
page table after pte_unmap() will become bad practice, even on 64-bit:
an rcu_read_unlock() in pte_unmap() will allow page table to be freed.
Move relevant definitions from include/linux/swap.h to mm/swap_state.c,
nothing else used them. Delete the CONFIG_64BIT distinction and buffer,
delete all reference to ptes from swap_ra_info(), use pte_offset_map()
repeatedly in swap_vma_readahead(), breaking from the loop if it fails.
(Will the repeated "map" and "unmap" show up as a slowdown anywhere?
If so, maybe modify __read_swap_cache_async() to do the pte_unmap()
only when it does not find the page already in the swapcache.)
Use ptep_get_lockless(), mainly for its READ_ONCE(). Correctly advance
the address passed down to each call of __read__swap_cache_async().
Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/swap.h | 19 -------------------
mm/swap_state.c | 45 +++++++++++++++++++++++---------------------
2 files changed, 24 insertions(+), 40 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 3c69cb653cb9..1b9f2d92fc10 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -337,25 +337,6 @@ struct swap_info_struct {
*/
};
-#ifdef CONFIG_64BIT
-#define SWAP_RA_ORDER_CEILING 5
-#else
-/* Avoid stack overflow, because we need to save part of page table */
-#define SWAP_RA_ORDER_CEILING 3
-#define SWAP_RA_PTE_CACHE_SIZE (1 << SWAP_RA_ORDER_CEILING)
-#endif
-
-struct vma_swap_readahead {
- unsigned short win;
- unsigned short offset;
- unsigned short nr_pte;
-#ifdef CONFIG_64BIT
- pte_t *ptes;
-#else
- pte_t ptes[SWAP_RA_PTE_CACHE_SIZE];
-#endif
-};
-
static inline swp_entry_t folio_swap_entry(struct folio *folio)
{
swp_entry_t entry = { .val = page_private(&folio->page) };
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b76a65ac28b3..a43b41975da2 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -698,6 +698,14 @@ void exit_swap_address_space(unsigned int type)
swapper_spaces[type] = NULL;
}
+#define SWAP_RA_ORDER_CEILING 5
+
+struct vma_swap_readahead {
+ unsigned short win;
+ unsigned short offset;
+ unsigned short nr_pte;
+};
+
static void swap_ra_info(struct vm_fault *vmf,
struct vma_swap_readahead *ra_info)
{
@@ -705,11 +713,7 @@ static void swap_ra_info(struct vm_fault *vmf,
unsigned long ra_val;
unsigned long faddr, pfn, fpfn, lpfn, rpfn;
unsigned long start, end;
- pte_t *pte, *orig_pte;
unsigned int max_win, hits, prev_win, win;
-#ifndef CONFIG_64BIT
- pte_t *tpte;
-#endif
max_win = 1 << min_t(unsigned int, READ_ONCE(page_cluster),
SWAP_RA_ORDER_CEILING);
@@ -728,12 +732,9 @@ static void swap_ra_info(struct vm_fault *vmf,
max_win, prev_win);
atomic_long_set(&vma->swap_readahead_info,
SWAP_RA_VAL(faddr, win, 0));
-
if (win == 1)
return;
- /* Copy the PTEs because the page table may be unmapped */
- orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
if (fpfn == pfn + 1) {
lpfn = fpfn;
rpfn = fpfn + win;
@@ -753,15 +754,6 @@ static void swap_ra_info(struct vm_fault *vmf,
ra_info->nr_pte = end - start;
ra_info->offset = fpfn - start;
- pte -= ra_info->offset;
-#ifdef CONFIG_64BIT
- ra_info->ptes = pte;
-#else
- tpte = ra_info->ptes;
- for (pfn = start; pfn != end; pfn++)
- *tpte++ = *pte++;
-#endif
- pte_unmap(orig_pte);
}
/**
@@ -785,7 +777,8 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
struct swap_iocb *splug = NULL;
struct vm_area_struct *vma = vmf->vma;
struct page *page;
- pte_t *pte, pentry;
+ pte_t *pte = NULL, pentry;
+ unsigned long addr;
swp_entry_t entry;
unsigned int i;
bool page_allocated;
@@ -797,17 +790,25 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
if (ra_info.win == 1)
goto skip;
+ addr = vmf->address - (ra_info.offset * PAGE_SIZE);
+
blk_start_plug(&plug);
- for (i = 0, pte = ra_info.ptes; i < ra_info.nr_pte;
- i++, pte++) {
- pentry = *pte;
+ for (i = 0; i < ra_info.nr_pte; i++, addr += PAGE_SIZE) {
+ if (!pte++) {
+ pte = pte_offset_map(vmf->pmd, addr);
+ if (!pte)
+ break;
+ }
+ pentry = ptep_get_lockless(pte);
if (!is_swap_pte(pentry))
continue;
entry = pte_to_swp_entry(pentry);
if (unlikely(non_swap_entry(entry)))
continue;
+ pte_unmap(pte);
+ pte = NULL;
page = __read_swap_cache_async(entry, gfp_mask, vma,
- vmf->address, &page_allocated);
+ addr, &page_allocated);
if (!page)
continue;
if (page_allocated) {
@@ -819,6 +820,8 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
}
put_page(page);
}
+ if (pte)
+ pte_unmap(pte);
blk_finish_plug(&plug);
swap_read_unplug(splug);
lru_add_drain();
--
2.35.3
In rare transient cases, not yet made possible, pte_offset_map() and
pte_offet_map_lock() may not find a page table: handle appropriately.
Signed-off-by: Hugh Dickins <[email protected]>
---
This is a perf patch, not an mm patch, and it will want to go in through
the tip tree in due course; but keep it in this series for now, so that
it's not missed, and not submitted before mm review.
kernel/events/core.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/kernel/events/core.c b/kernel/events/core.c
index db016e418931..174be710f3b3 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7490,6 +7490,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
return pud_leaf_size(pud);
pmdp = pmd_offset_lockless(pudp, pud, addr);
+again:
pmd = pmdp_get_lockless(pmdp);
if (!pmd_present(pmd))
return 0;
@@ -7498,6 +7499,9 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr)
return pmd_leaf_size(pmd);
ptep = pte_offset_map(&pmd, addr);
+ if (!ptep)
+ goto again;
+
pte = ptep_get_lockless(ptep);
if (pte_present(pte))
size = pte_leaf_size(pte);
--
2.35.3
__split_huge_zero_page_pmd() use a single pte_offset_map() to sweep the
extent: it's already under pmd_lock(), so this is no worse for latency;
and since it's supposed to have full control of the just-withdrawn page
table, here choose to VM_BUG_ON if it were to fail. And please don't
increment haddr by PAGE_SIZE, that should remain huge aligned: declare
a separate addr (not a bugfix, but it was deceptive).
__split_huge_pmd_locked() likewise (but it had declared a separate addr);
and change its BUG_ON(!pte_none) to VM_BUG_ON, for consistency with zero
(those deposited page tables are sometimes victims of random corruption).
Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
---
mm/huge_memory.c | 28 ++++++++++++++++++----------
1 file changed, 18 insertions(+), 10 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d4bd5fa7c823..839c13fa0bbe 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2037,6 +2037,8 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
pgtable_t pgtable;
pmd_t _pmd, old_pmd;
+ unsigned long addr;
+ pte_t *pte;
int i;
/*
@@ -2052,17 +2054,20 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);
- for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- pte_t *pte, entry;
- entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte);
+ for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
+ pte_t entry;
+
+ entry = pfn_pte(my_zero_pfn(addr), vma->vm_page_prot);
entry = pte_mkspecial(entry);
if (pmd_uffd_wp(old_pmd))
entry = pte_mkuffd_wp(entry);
- pte = pte_offset_map(&_pmd, haddr);
VM_BUG_ON(!pte_none(*pte));
- set_pte_at(mm, haddr, pte, entry);
- pte_unmap(pte);
+ set_pte_at(mm, addr, pte, entry);
+ pte++;
}
+ pte_unmap(pte - 1);
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
}
@@ -2077,6 +2082,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
bool anon_exclusive = false, dirty = false;
unsigned long addr;
+ pte_t *pte;
int i;
VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
@@ -2205,8 +2211,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte);
for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
- pte_t entry, *pte;
+ pte_t entry;
/*
* Note that NUMA hinting access restrictions are not
* transferred to avoid any possibility of altering
@@ -2249,11 +2257,11 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
entry = pte_mkuffd_wp(entry);
page_add_anon_rmap(page + i, vma, addr, false);
}
- pte = pte_offset_map(&_pmd, addr);
- BUG_ON(!pte_none(*pte));
+ VM_BUG_ON(!pte_none(*pte));
set_pte_at(mm, addr, pte, entry);
- pte_unmap(pte);
+ pte++;
}
+ pte_unmap(pte - 1);
if (!pmd_migration)
page_remove_rmap(page, vma, true);
--
2.35.3
handle_pte_fault() use pte_offset_map_nolock() to get the vmf.ptl which
corresponds to vmf.pte, instead of pte_lockptr() being used later, when
there's a chance that the pmd entry might have changed, perhaps to none,
or to a huge pmd, with no split ptlock in its struct page.
Remove its pmd_devmap_trans_unstable() call: pte_offset_map_nolock()
will handle that case by failing. Update the "morph" comment above,
looking forward to when shmem or file collapse to THP may not take
mmap_lock for write (or not at all).
do_numa_page() use the vmf->ptl from handle_pte_fault() at first, but
refresh it when refreshing vmf->pte.
do_swap_page()'s pte_unmap_same() (the thing that takes ptl to verify a
two-part PAE orig_pte) use the vmf->ptl from handle_pte_fault() too; but
do_swap_page() is also used by anon THP's __collapse_huge_page_swapin(),
so adjust that to set vmf->ptl by pte_offset_map_nolock().
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/khugepaged.c | 6 ++++--
mm/memory.c | 38 +++++++++++++-------------------------
2 files changed, 17 insertions(+), 27 deletions(-)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 49cfa7cdfe93..c11db2e78e95 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1005,6 +1005,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE);
int result;
pte_t *pte = NULL;
+ spinlock_t *ptl;
for (address = haddr; address < end; address += PAGE_SIZE) {
struct vm_fault vmf = {
@@ -1016,7 +1017,7 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
};
if (!pte++) {
- pte = pte_offset_map(pmd, address);
+ pte = pte_offset_map_nolock(mm, pmd, address, &ptl);
if (!pte) {
mmap_read_unlock(mm);
result = SCAN_PMD_NULL;
@@ -1024,11 +1025,12 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
}
}
- vmf.orig_pte = *pte;
+ vmf.orig_pte = ptep_get_lockless(pte);
if (!is_swap_pte(vmf.orig_pte))
continue;
vmf.pte = pte;
+ vmf.ptl = ptl;
ret = do_swap_page(&vmf);
/* Which unmaps pte (after perhaps re-checking the entry) */
pte = NULL;
diff --git a/mm/memory.c b/mm/memory.c
index c7b920291a72..4ec46eecefd3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2786,10 +2786,9 @@ static inline int pte_unmap_same(struct vm_fault *vmf)
int same = 1;
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
if (sizeof(pte_t) > sizeof(unsigned long)) {
- spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
- spin_lock(ptl);
+ spin_lock(vmf->ptl);
same = pte_same(*vmf->pte, vmf->orig_pte);
- spin_unlock(ptl);
+ spin_unlock(vmf->ptl);
}
#endif
pte_unmap(vmf->pte);
@@ -4696,7 +4695,6 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
* validation through pte_unmap_same(). It's of NUMA type but
* the pfn may be screwed if the read is non atomic.
*/
- vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd);
spin_lock(vmf->ptl);
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
@@ -4767,8 +4765,10 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
flags |= TNF_MIGRATED;
} else {
flags |= TNF_MIGRATE_FAIL;
- vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
- spin_lock(vmf->ptl);
+ vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ if (unlikely(!vmf->pte))
+ goto out;
if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
goto out;
@@ -4897,27 +4897,16 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
vmf->pte = NULL;
vmf->flags &= ~FAULT_FLAG_ORIG_PTE_VALID;
} else {
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() via pmd_devmap_trans_unstable() instead
- * of pmd_trans_huge() to ensure the pmd didn't become
- * pmd_trans_huge under us and then back to pmd_none, as a
- * result of MADV_DONTNEED running immediately after a huge pmd
- * fault in a different thread of this mm, in turn leading to a
- * misleading pmd_trans_huge() retval. All we have to ensure is
- * that it is a regular pmd that we can walk with
- * pte_offset_map() and we can do that through an atomic read
- * in C, which is what pmd_trans_unstable() provides.
- */
- if (pmd_devmap_trans_unstable(vmf->pmd))
- return 0;
/*
* A regular pmd is established and it can't morph into a huge
- * pmd from under us anymore at this point because we hold the
- * mmap_lock read mode and khugepaged takes it in write mode.
- * So now it's safe to run pte_offset_map().
+ * pmd by anon khugepaged, since that takes mmap_lock in write
+ * mode; but shmem or file collapse to THP could still morph
+ * it into a huge pmd: just retry later if so.
*/
- vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
+ vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address, &vmf->ptl);
+ if (unlikely(!vmf->pte))
+ return 0;
vmf->orig_pte = ptep_get_lockless(vmf->pte);
vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
@@ -4936,7 +4925,6 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
- vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
spin_lock(vmf->ptl);
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry))) {
--
2.35.3
MGLRU's walk_pte_range() use the safer pte_offset_map_nolock(), rather
than pte_lockptr(), to get the ptl for its trylock. Just return false
and move on to next extent if it fails, like when the trylock fails.
Remove the VM_WARN_ON_ONCE(pmd_leaf) since that will happen, rarely.
Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Yu Zhao <[email protected]>
---
mm/vmscan.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6d0cd2840cf0..6a9bb6b30dc8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3993,15 +3993,15 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
struct pglist_data *pgdat = lruvec_pgdat(walk->lruvec);
int old_gen, new_gen = lru_gen_from_seq(walk->max_seq);
- VM_WARN_ON_ONCE(pmd_leaf(*pmd));
-
- ptl = pte_lockptr(args->mm, pmd);
- if (!spin_trylock(ptl))
+ pte = pte_offset_map_nolock(args->mm, pmd, start & PMD_MASK, &ptl);
+ if (!pte)
return false;
+ if (!spin_trylock(ptl)) {
+ pte_unmap(pte);
+ return false;
+ }
arch_enter_lazy_mmu_mode();
-
- pte = pte_offset_map(pmd, start & PMD_MASK);
restart:
for (i = pte_index(start), addr = start; addr != end; i++, addr += PAGE_SIZE) {
unsigned long pfn;
@@ -4042,10 +4042,8 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
if (i < PTRS_PER_PTE && get_next_vma(PMD_MASK, PAGE_SIZE, args, &start, &end))
goto restart;
- pte_unmap(pte);
-
arch_leave_lazy_mmu_mode();
- spin_unlock(ptl);
+ pte_unmap_unlock(pte, ptl);
return suitable_to_scan(total, young);
}
--
2.35.3
On Thu, Jun 8, 2023 at 6:40 PM Hugh Dickins <[email protected]> wrote:
>
> There is now no reason for follow_pmd_mask()'s FOLL_SPLIT_PMD block to
> distinguish huge_zero_page from a normal THP: follow_page_pte() handles
> any instability, and here it's a good idea to replace any pmd_none(*pmd)
> by a page table a.s.a.p, in the huge_zero_page case as for a normal THP;
> and this removes an unnecessary possibility of -EBUSY failure.
>
> (Hmm, couldn't the normal THP case have hit an unstably refaulted THP
> before? But there are only two, exceptional, users of FOLL_SPLIT_PMD.)
>
> Signed-off-by: Hugh Dickins <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
> ---
> mm/gup.c | 19 ++++---------------
> 1 file changed, 4 insertions(+), 15 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index bb67193c5460..4ad50a59897f 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -681,21 +681,10 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
> return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> }
> if (flags & FOLL_SPLIT_PMD) {
> - int ret;
> - page = pmd_page(*pmd);
> - if (is_huge_zero_page(page)) {
> - spin_unlock(ptl);
> - ret = 0;
> - split_huge_pmd(vma, pmd, address);
> - if (pmd_trans_unstable(pmd))
> - ret = -EBUSY;
> - } else {
> - spin_unlock(ptl);
> - split_huge_pmd(vma, pmd, address);
> - ret = pte_alloc(mm, pmd) ? -ENOMEM : 0;
> - }
> -
> - return ret ? ERR_PTR(ret) :
> + spin_unlock(ptl);
> + split_huge_pmd(vma, pmd, address);
> + /* If pmd was left empty, stuff a page table in there quickly */
> + return pte_alloc(mm, pmd) ? ERR_PTR(-ENOMEM) :
> follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> }
> page = follow_trans_huge_pmd(vma, address, pmd, flags);
> --
> 2.35.3
>
On Thu, 8 Jun 2023 18:43:38 -0700 (PDT) Hugh Dickins <[email protected]> wrote:
> copy_pte_range(): use pte_offset_map_nolock(), and allow for it to fail;
> but with a comment on some further assumptions that are being made there.
>
> zap_pte_range() and zap_pmd_range(): adjust their interaction so that
> a pte_offset_map_lock() failure in zap_pte_range() leads to a retry in
> zap_pmd_range(); remove call to pmd_none_or_trans_huge_or_clear_bad().
>
> Allow pte_offset_map_lock() to fail in many functions. Update comment
> on calling pte_alloc() in do_anonymous_page(). Remove redundant calls
> to pmd_trans_unstable(), pmd_devmap_trans_unstable(), pmd_none() and
> pmd_bad(); but leave pmd_none_or_clear_bad() calls in free_pmd_range()
> and copy_pmd_range(), those do simplify the next level down.
>
> ...
>
> @@ -3728,11 +3737,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> vmf->page = pfn_swap_entry_to_page(entry);
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> vmf->address, &vmf->ptl);
> - if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> - spin_unlock(vmf->ptl);
> - goto out;
> - }
> -
> + if (unlikely(!vmf->pte ||
> + !pte_same(*vmf->pte, vmf->orig_pte)))
> + goto unlock;
> /*
> * Get a page reference while we know the page can't be
> * freed.
This hunk falls afoul of
https://lkml.kernel.org/r/[email protected].
I did this:
@@ -3729,7 +3738,8 @@ vm_fault_t do_swap_page(struct vm_fault
vmf->page = pfn_swap_entry_to_page(entry);
vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
vmf->address, &vmf->ptl);
- if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
+ if (unlikely(!vmf->pte ||
+ !pte_same(*vmf->pte, vmf->orig_pte)))
goto unlock;
/*
On Fri, 9 Jun 2023, Andrew Morton wrote:
> On Thu, 8 Jun 2023 18:43:38 -0700 (PDT) Hugh Dickins <[email protected]> wrote:
>
> > copy_pte_range(): use pte_offset_map_nolock(), and allow for it to fail;
> > but with a comment on some further assumptions that are being made there.
> >
> > zap_pte_range() and zap_pmd_range(): adjust their interaction so that
> > a pte_offset_map_lock() failure in zap_pte_range() leads to a retry in
> > zap_pmd_range(); remove call to pmd_none_or_trans_huge_or_clear_bad().
> >
> > Allow pte_offset_map_lock() to fail in many functions. Update comment
> > on calling pte_alloc() in do_anonymous_page(). Remove redundant calls
> > to pmd_trans_unstable(), pmd_devmap_trans_unstable(), pmd_none() and
> > pmd_bad(); but leave pmd_none_or_clear_bad() calls in free_pmd_range()
> > and copy_pmd_range(), those do simplify the next level down.
> >
> > ...
> >
> > @@ -3728,11 +3737,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > vmf->page = pfn_swap_entry_to_page(entry);
> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > vmf->address, &vmf->ptl);
> > - if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> > - spin_unlock(vmf->ptl);
> > - goto out;
> > - }
> > -
> > + if (unlikely(!vmf->pte ||
> > + !pte_same(*vmf->pte, vmf->orig_pte)))
> > + goto unlock;
> > /*
> > * Get a page reference while we know the page can't be
> > * freed.
>
> This hunk falls afoul of
> https://lkml.kernel.org/r/[email protected].
>
> I did this:
>
> @@ -3729,7 +3738,8 @@ vm_fault_t do_swap_page(struct vm_fault
> vmf->page = pfn_swap_entry_to_page(entry);
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> vmf->address, &vmf->ptl);
> - if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
> + if (unlikely(!vmf->pte ||
> + !pte_same(*vmf->pte, vmf->orig_pte)))
> goto unlock;
>
> /*
Yes, that's exactly right: thanks, Andrew.
Hugh
Hi, Hugh,
Sorry for late reply.
Hugh Dickins <[email protected]> writes:
> swap_vma_readahead() has been proceeding in an unconventional way, its
> preliminary swap_ra_info() doing the pte_offset_map() and pte_unmap(),
> then relying on that pte pointer even after the pte_unmap() - in its
> CONFIG_64BIT case (I think !CONFIG_HIGHPTE was intended; whereas 32-bit
> copied ptes to stack while they were mapped, but had to limit how many).
>
> Though it would be difficult to construct a failing testcase, accessing
> page table after pte_unmap() will become bad practice, even on 64-bit:
> an rcu_read_unlock() in pte_unmap() will allow page table to be freed.
>
> Move relevant definitions from include/linux/swap.h to mm/swap_state.c,
> nothing else used them. Delete the CONFIG_64BIT distinction and buffer,
> delete all reference to ptes from swap_ra_info(), use pte_offset_map()
> repeatedly in swap_vma_readahead(), breaking from the loop if it fails.
>
> (Will the repeated "map" and "unmap" show up as a slowdown anywhere?
> If so, maybe modify __read_swap_cache_async() to do the pte_unmap()
> only when it does not find the page already in the swapcache.)
>
> Use ptep_get_lockless(), mainly for its READ_ONCE(). Correctly advance
> the address passed down to each call of __read__swap_cache_async().
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> include/linux/swap.h | 19 -------------------
> mm/swap_state.c | 45 +++++++++++++++++++++++---------------------
> 2 files changed, 24 insertions(+), 40 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 3c69cb653cb9..1b9f2d92fc10 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -337,25 +337,6 @@ struct swap_info_struct {
> */
> };
>
> -#ifdef CONFIG_64BIT
> -#define SWAP_RA_ORDER_CEILING 5
> -#else
> -/* Avoid stack overflow, because we need to save part of page table */
> -#define SWAP_RA_ORDER_CEILING 3
> -#define SWAP_RA_PTE_CACHE_SIZE (1 << SWAP_RA_ORDER_CEILING)
> -#endif
> -
> -struct vma_swap_readahead {
> - unsigned short win;
> - unsigned short offset;
> - unsigned short nr_pte;
> -#ifdef CONFIG_64BIT
> - pte_t *ptes;
> -#else
> - pte_t ptes[SWAP_RA_PTE_CACHE_SIZE];
> -#endif
> -};
> -
> static inline swp_entry_t folio_swap_entry(struct folio *folio)
> {
> swp_entry_t entry = { .val = page_private(&folio->page) };
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index b76a65ac28b3..a43b41975da2 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -698,6 +698,14 @@ void exit_swap_address_space(unsigned int type)
> swapper_spaces[type] = NULL;
> }
>
> +#define SWAP_RA_ORDER_CEILING 5
> +
> +struct vma_swap_readahead {
> + unsigned short win;
> + unsigned short offset;
> + unsigned short nr_pte;
> +};
> +
Because we don't deal with PTEs in struct vma_swap_readahead anymore, it
appears simpler to record addresses directly, for example,
struct vma_swap_readahead {
unsigned long start;
unsigned long end;
};
we can make ra_info.win to be the return value of swap_ra_info().
Anyway, this can be a separate cleanup patch based on this patch.
For the patch itself, feel free to add,
Reviewed-by: "Huang, Ying" <[email protected]>
> static void swap_ra_info(struct vm_fault *vmf,
> struct vma_swap_readahead *ra_info)
> {
> @@ -705,11 +713,7 @@ static void swap_ra_info(struct vm_fault *vmf,
> unsigned long ra_val;
> unsigned long faddr, pfn, fpfn, lpfn, rpfn;
> unsigned long start, end;
> - pte_t *pte, *orig_pte;
> unsigned int max_win, hits, prev_win, win;
> -#ifndef CONFIG_64BIT
> - pte_t *tpte;
> -#endif
>
> max_win = 1 << min_t(unsigned int, READ_ONCE(page_cluster),
> SWAP_RA_ORDER_CEILING);
> @@ -728,12 +732,9 @@ static void swap_ra_info(struct vm_fault *vmf,
> max_win, prev_win);
> atomic_long_set(&vma->swap_readahead_info,
> SWAP_RA_VAL(faddr, win, 0));
> -
> if (win == 1)
> return;
>
> - /* Copy the PTEs because the page table may be unmapped */
> - orig_pte = pte = pte_offset_map(vmf->pmd, faddr);
> if (fpfn == pfn + 1) {
> lpfn = fpfn;
> rpfn = fpfn + win;
> @@ -753,15 +754,6 @@ static void swap_ra_info(struct vm_fault *vmf,
>
> ra_info->nr_pte = end - start;
> ra_info->offset = fpfn - start;
> - pte -= ra_info->offset;
> -#ifdef CONFIG_64BIT
> - ra_info->ptes = pte;
> -#else
> - tpte = ra_info->ptes;
> - for (pfn = start; pfn != end; pfn++)
> - *tpte++ = *pte++;
> -#endif
> - pte_unmap(orig_pte);
> }
>
> /**
> @@ -785,7 +777,8 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
> struct swap_iocb *splug = NULL;
> struct vm_area_struct *vma = vmf->vma;
> struct page *page;
> - pte_t *pte, pentry;
> + pte_t *pte = NULL, pentry;
> + unsigned long addr;
> swp_entry_t entry;
> unsigned int i;
> bool page_allocated;
> @@ -797,17 +790,25 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
> if (ra_info.win == 1)
> goto skip;
>
> + addr = vmf->address - (ra_info.offset * PAGE_SIZE);
> +
> blk_start_plug(&plug);
> - for (i = 0, pte = ra_info.ptes; i < ra_info.nr_pte;
> - i++, pte++) {
> - pentry = *pte;
> + for (i = 0; i < ra_info.nr_pte; i++, addr += PAGE_SIZE) {
> + if (!pte++) {
> + pte = pte_offset_map(vmf->pmd, addr);
> + if (!pte)
> + break;
> + }
> + pentry = ptep_get_lockless(pte);
> if (!is_swap_pte(pentry))
> continue;
> entry = pte_to_swp_entry(pentry);
> if (unlikely(non_swap_entry(entry)))
> continue;
> + pte_unmap(pte);
> + pte = NULL;
> page = __read_swap_cache_async(entry, gfp_mask, vma,
> - vmf->address, &page_allocated);
> + addr, &page_allocated);
> if (!page)
> continue;
> if (page_allocated) {
> @@ -819,6 +820,8 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
> }
> put_page(page);
> }
> + if (pte)
> + pte_unmap(pte);
> blk_finish_plug(&plug);
> swap_read_unplug(splug);
> lru_add_drain();
On 09/06/2023 21:11, Hugh Dickins wrote:
> On Fri, 9 Jun 2023, Andrew Morton wrote:
>> On Thu, 8 Jun 2023 18:43:38 -0700 (PDT) Hugh Dickins <[email protected]> wrote:
>>
>>> copy_pte_range(): use pte_offset_map_nolock(), and allow for it to fail;
>>> but with a comment on some further assumptions that are being made there.
>>>
>>> zap_pte_range() and zap_pmd_range(): adjust their interaction so that
>>> a pte_offset_map_lock() failure in zap_pte_range() leads to a retry in
>>> zap_pmd_range(); remove call to pmd_none_or_trans_huge_or_clear_bad().
>>>
>>> Allow pte_offset_map_lock() to fail in many functions. Update comment
>>> on calling pte_alloc() in do_anonymous_page(). Remove redundant calls
>>> to pmd_trans_unstable(), pmd_devmap_trans_unstable(), pmd_none() and
>>> pmd_bad(); but leave pmd_none_or_clear_bad() calls in free_pmd_range()
>>> and copy_pmd_range(), those do simplify the next level down.
>>>
>>> ...
>>>
>>> @@ -3728,11 +3737,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>> vmf->page = pfn_swap_entry_to_page(entry);
>>> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>>> vmf->address, &vmf->ptl);
>>> - if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
>>> - spin_unlock(vmf->ptl);
>>> - goto out;
>>> - }
>>> -
>>> + if (unlikely(!vmf->pte ||
>>> + !pte_same(*vmf->pte, vmf->orig_pte)))
>>> + goto unlock;
>>> /*
>>> * Get a page reference while we know the page can't be
>>> * freed.
>>
>> This hunk falls afoul of
>> https://lkml.kernel.org/r/[email protected].
>>
>> I did this:
>>
>> @@ -3729,7 +3738,8 @@ vm_fault_t do_swap_page(struct vm_fault
>> vmf->page = pfn_swap_entry_to_page(entry);
>> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
>> vmf->address, &vmf->ptl);
>> - if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte)))
>> + if (unlikely(!vmf->pte ||
>> + !pte_same(*vmf->pte, vmf->orig_pte)))
>> goto unlock;
>>
>> /*
>
> Yes, that's exactly right: thanks, Andrew.
FWIW, I agree.
Thanks,
Ryan
>
> Hugh
On Mon, 12 Jun 2023, Huang, Ying wrote:
> Hi, Hugh,
>
> Sorry for late reply.
Never apologize to *me* for being "late" or "slow" or "unresponsive".
Thanks for looking, yes, it was indeed for this one that I particularly
added you to the Cc.
>
> Hugh Dickins <[email protected]> writes:
>
> > swap_vma_readahead() has been proceeding in an unconventional way, its
> > preliminary swap_ra_info() doing the pte_offset_map() and pte_unmap(),
> > then relying on that pte pointer even after the pte_unmap() - in its
> > CONFIG_64BIT case (I think !CONFIG_HIGHPTE was intended; whereas 32-bit
> > copied ptes to stack while they were mapped, but had to limit how many).
> >
> > Though it would be difficult to construct a failing testcase, accessing
> > page table after pte_unmap() will become bad practice, even on 64-bit:
> > an rcu_read_unlock() in pte_unmap() will allow page table to be freed.
> >
> > Move relevant definitions from include/linux/swap.h to mm/swap_state.c,
> > nothing else used them. Delete the CONFIG_64BIT distinction and buffer,
> > delete all reference to ptes from swap_ra_info(), use pte_offset_map()
> > repeatedly in swap_vma_readahead(), breaking from the loop if it fails.
> >
> > (Will the repeated "map" and "unmap" show up as a slowdown anywhere?
> > If so, maybe modify __read_swap_cache_async() to do the pte_unmap()
> > only when it does not find the page already in the swapcache.)
> >
> > Use ptep_get_lockless(), mainly for its READ_ONCE(). Correctly advance
> > the address passed down to each call of __read__swap_cache_async().
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> > ---
> > include/linux/swap.h | 19 -------------------
> > mm/swap_state.c | 45 +++++++++++++++++++++++---------------------
> > 2 files changed, 24 insertions(+), 40 deletions(-)
...
> Because we don't deal with PTEs in struct vma_swap_readahead anymore, it
> appears simpler to record addresses directly, for example,
>
> struct vma_swap_readahead {
> unsigned long start;
> unsigned long end;
> };
>
> we can make ra_info.win to be the return value of swap_ra_info().
>
> Anyway, this can be a separate cleanup patch based on this patch.
Ooh, that would have required me to think, rather than just delete
lines. Mmm, if you see a cleaner way forward, yes, please do add
some cleanup on top.
>
> For the patch itself, feel free to add,
>
> Reviewed-by: "Huang, Ying" <[email protected]>
Great, thanks a lot.
Hugh
__wp_page_copy_user() was liable to call update_mmu_tlb() with NULL
vmf->pte in two places: not a problem today, but could become a problem
later when pte_offset_map_lock() fails.
Signed-off-by: Hugh Dickins <[email protected]>
---
Andrew, please add this as a fix patch for later merge into my
"mm/memory: allow" patch in mm-unstable: it's something noticed while
researching the bug Nathan reported, but not so serious - thanks.
mm/memory.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 4ec46eecefd3..cdadcff5ab26 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2843,7 +2843,8 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
* Other thread has already handled the fault
* and update local tlb only
*/
- update_mmu_tlb(vma, addr, vmf->pte);
+ if (vmf->pte)
+ update_mmu_tlb(vma, addr, vmf->pte);
ret = -EAGAIN;
goto pte_unlock;
}
@@ -2867,7 +2868,8 @@ static inline int __wp_page_copy_user(struct page *dst, struct page *src,
vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);
if (unlikely(!vmf->pte || !pte_same(*vmf->pte, vmf->orig_pte))) {
/* The PTE changed under us, update local tlb */
- update_mmu_tlb(vma, addr, vmf->pte);
+ if (vmf->pte)
+ update_mmu_tlb(vma, addr, vmf->pte);
ret = -EAGAIN;
goto pte_unlock;
}
--
2.35.3
Delete a triply out-of-date comment from add_swap_count_continuation():
1. vmalloc_to_page() changed from pte_offset_map() to pte_offset_kernel()
2. pte_offset_map() changed from using kmap_atomic() to kmap_local_page()
3. kmap_atomic() changed from using fixed FIX_KMAP addresses in 2.6.37.
Signed-off-by: Hugh Dickins <[email protected]>
---
Here's a late "33/32" to the series just moved to mm-stable - thank you!
mm/swapfile.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 12d204e6dae2..0a17d85b50cb 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3470,11 +3470,6 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
goto out;
}
- /*
- * We are fortunate that although vmalloc_to_page uses pte_offset_map,
- * no architecture is using highmem pages for kernel page tables: so it
- * will not corrupt the GFP_ATOMIC caller's atomic page table kmaps.
- */
head = vmalloc_to_page(si->swap_map + offset);
offset &= ~PAGE_MASK;
--
2.35.3
On Thu, Jun 08, 2023 at 06:21:41PM -0700, Hugh Dickins wrote:
> vmalloc_to_page() was using pte_offset_map() (followed by pte_unmap()),
> but it's intended for userspace page tables: prefer pte_offset_kernel().
>
> Signed-off-by: Hugh Dickins <[email protected]>
> Reviewed-by: Lorenzo Stoakes <[email protected]>
Currently Linus' tree is reliably failing to boot on pine64plus, an
arm64 SBC. Most other boards seem fine, though I am seeing some
additional instability on Tritium which is another Allwinner platform,
I've not dug into that yet and Tritium is generally less stable.
We end up seeing NULL or otherwise bad pointer dereferences, the
specific error does vary a bit though it mostly appears to be in the
pinctrl code. A bisect (full log below) identified this patch as
introducing the failure, nothing is jumping out at me about the patch
and it's not affecting everything so I'd not be surprised if it's just
unconvering some bug in the platform support but I'm not super familiar
with the code.
Sample backtrace:
[ 1.919725] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 1.928551] Mem abort info:
[ 1.931359] ESR = 0x0000000096000044
...
[ 1.968870] [0000000000000000] user address but active_mm is swapper
...
[ 2.093969] Call trace:
[ 2.096414] dt_remember_or_free_map+0xc8/0x120
[ 2.100949] pinctrl_dt_to_map+0x23c/0x364
[ 2.105050] create_pinctrl+0x68/0x3ec
[ 2.108803] pinctrl_get+0xb0/0x124
[ 2.112294] devm_pinctrl_get+0x48/0x90
[ 2.116133] pinctrl_bind_pins+0x58/0x158
[ 2.120148] really_probe+0x54/0x2b0
[ 2.123724] __driver_probe_device+0x78/0x12c
Another common theme is the same but with an address like 0x4c and:
[ 2.098328] __kmem_cache_alloc_node+0x1bc/0x2dc
[ 2.102947] kmalloc_trace+0x20/0x2c
[ 2.106524] pinctrl_register_mappings+0x98/0x178
Full boot log from a failure:
https://lava.sirena.org.uk/scheduler/job/712456
git bisect start
# bad: [06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5] Linux 6.5-rc1
git bisect bad 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
# good: [6995e2de6891c724bfeb2db33d7b87775f913ad1] Linux 6.4
git bisect good 6995e2de6891c724bfeb2db33d7b87775f913ad1
# bad: [1b722407a13b7f8658d2e26917791f32805980a2] Merge tag 'drm-next-2023-06-29' of git://anongit.freedesktop.org/drm/drm
git bisect bad 1b722407a13b7f8658d2e26917791f32805980a2
# bad: [3a8a670eeeaa40d87bd38a587438952741980c18] Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
git bisect bad 3a8a670eeeaa40d87bd38a587438952741980c18
# bad: [6e17c6de3ddf3073741d9c91a796ee696914d8a0] Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect bad 6e17c6de3ddf3073741d9c91a796ee696914d8a0
# good: [2605e80d3438c77190f55b821c6575048c68268e] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
git bisect good 2605e80d3438c77190f55b821c6575048c68268e
# good: [72dc6db7e3b692f46f3386b8dd5101d3f431adef] Merge tag 'wq-for-6.5-cleanup-ordered' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
git bisect good 72dc6db7e3b692f46f3386b8dd5101d3f431adef
# bad: [179d3e4f3bfa5947821c1b1bc6aa49a4797b7f21] mm/madvise: clean up force_shm_swapin_readahead()
git bisect bad 179d3e4f3bfa5947821c1b1bc6aa49a4797b7f21
# good: [523716770e63e229dbb6307d663f03d990dfefc5] maple_tree: rework mtree_alloc_{range,rrange}()
git bisect good 523716770e63e229dbb6307d663f03d990dfefc5
# good: [b764253c18821da31c49a260f92f5d093cf1637e] selftests/mm: fix "warning: expression which evaluates to zero..." in mlock2-tests.c
git bisect good b764253c18821da31c49a260f92f5d093cf1637e
# good: [5c7f3bf04a6cf266567fdea1ae4987875e92619f] s390: allow pte_offset_map_lock() to fail
git bisect good 5c7f3bf04a6cf266567fdea1ae4987875e92619f
# good: [0d940a9b270b9220dcff74d8e9123c9788365751] mm/pgtable: allow pte_offset_map[_lock]() to fail
git bisect good 0d940a9b270b9220dcff74d8e9123c9788365751
# bad: [0d1c81edc61e553ed7a5db18fb8074c8b78e1538] mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()
git bisect bad 0d1c81edc61e553ed7a5db18fb8074c8b78e1538
# good: [2798bbe75b9c2752b46d292e5c2a49f49da36418] mm/page_vma_mapped: pte_offset_map_nolock() not pte_lockptr()
git bisect good 2798bbe75b9c2752b46d292e5c2a49f49da36418
# good: [be872f83bf571f4f9a0ac25e2c9c36e905a36619] mm/pagewalk: walk_pte_range() allow for pte_offset_map()
git bisect good be872f83bf571f4f9a0ac25e2c9c36e905a36619
# good: [e5ad581c7f1c32d309ae4e895eea0cd1a3d9f363] mm/vmwgfx: simplify pmd & pud mapping dirty helpers
git bisect good e5ad581c7f1c32d309ae4e895eea0cd1a3d9f363
# first bad commit: [0d1c81edc61e553ed7a5db18fb8074c8b78e1538] mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()
On Mon, Jul 10, 2023 at 03:42:31PM +0100, Mark Brown wrote:
> On Thu, Jun 08, 2023 at 06:21:41PM -0700, Hugh Dickins wrote:
> > vmalloc_to_page() was using pte_offset_map() (followed by pte_unmap()),
> > but it's intended for userspace page tables: prefer pte_offset_kernel().
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> > Reviewed-by: Lorenzo Stoakes <[email protected]>
>
> Currently Linus' tree is reliably failing to boot on pine64plus, an
> arm64 SBC. Most other boards seem fine, though I am seeing some
> additional instability on Tritium which is another Allwinner platform,
> I've not dug into that yet and Tritium is generally less stable.
>
> We end up seeing NULL or otherwise bad pointer dereferences, the
> specific error does vary a bit though it mostly appears to be in the
> pinctrl code. A bisect (full log below) identified this patch as
> introducing the failure, nothing is jumping out at me about the patch
> and it's not affecting everything so I'd not be surprised if it's just
> unconvering some bug in the platform support but I'm not super familiar
> with the code.
Yeah seems likely. Do you have a .config you can share for this board? For
a 64-bit device you'd expect that this change would probably be a nop.
>
> Sample backtrace:
>
> [ 1.919725] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
> [ 1.928551] Mem abort info:
> [ 1.931359] ESR = 0x0000000096000044
>
> ...
>
> [ 1.968870] [0000000000000000] user address but active_mm is swapper
>
> ...
>
> [ 2.093969] Call trace:
> [ 2.096414] dt_remember_or_free_map+0xc8/0x120
> [ 2.100949] pinctrl_dt_to_map+0x23c/0x364
> [ 2.105050] create_pinctrl+0x68/0x3ec
> [ 2.108803] pinctrl_get+0xb0/0x124
> [ 2.112294] devm_pinctrl_get+0x48/0x90
> [ 2.116133] pinctrl_bind_pins+0x58/0x158
> [ 2.120148] really_probe+0x54/0x2b0
> [ 2.123724] __driver_probe_device+0x78/0x12c
>
> Another common theme is the same but with an address like 0x4c and:
>
> [ 2.098328] __kmem_cache_alloc_node+0x1bc/0x2dc
> [ 2.102947] kmalloc_trace+0x20/0x2c
> [ 2.106524] pinctrl_register_mappings+0x98/0x178
>
> Full boot log from a failure:
>
> https://lava.sirena.org.uk/scheduler/job/712456
>
> git bisect start
> # bad: [06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5] Linux 6.5-rc1
> git bisect bad 06c2afb862f9da8dc5efa4b6076a0e48c3fbaaa5
> # good: [6995e2de6891c724bfeb2db33d7b87775f913ad1] Linux 6.4
> git bisect good 6995e2de6891c724bfeb2db33d7b87775f913ad1
> # bad: [1b722407a13b7f8658d2e26917791f32805980a2] Merge tag 'drm-next-2023-06-29' of git://anongit.freedesktop.org/drm/drm
> git bisect bad 1b722407a13b7f8658d2e26917791f32805980a2
> # bad: [3a8a670eeeaa40d87bd38a587438952741980c18] Merge tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
> git bisect bad 3a8a670eeeaa40d87bd38a587438952741980c18
> # bad: [6e17c6de3ddf3073741d9c91a796ee696914d8a0] Merge tag 'mm-stable-2023-06-24-19-15' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> git bisect bad 6e17c6de3ddf3073741d9c91a796ee696914d8a0
> # good: [2605e80d3438c77190f55b821c6575048c68268e] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
> git bisect good 2605e80d3438c77190f55b821c6575048c68268e
> # good: [72dc6db7e3b692f46f3386b8dd5101d3f431adef] Merge tag 'wq-for-6.5-cleanup-ordered' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
> git bisect good 72dc6db7e3b692f46f3386b8dd5101d3f431adef
> # bad: [179d3e4f3bfa5947821c1b1bc6aa49a4797b7f21] mm/madvise: clean up force_shm_swapin_readahead()
> git bisect bad 179d3e4f3bfa5947821c1b1bc6aa49a4797b7f21
> # good: [523716770e63e229dbb6307d663f03d990dfefc5] maple_tree: rework mtree_alloc_{range,rrange}()
> git bisect good 523716770e63e229dbb6307d663f03d990dfefc5
> # good: [b764253c18821da31c49a260f92f5d093cf1637e] selftests/mm: fix "warning: expression which evaluates to zero..." in mlock2-tests.c
> git bisect good b764253c18821da31c49a260f92f5d093cf1637e
> # good: [5c7f3bf04a6cf266567fdea1ae4987875e92619f] s390: allow pte_offset_map_lock() to fail
> git bisect good 5c7f3bf04a6cf266567fdea1ae4987875e92619f
> # good: [0d940a9b270b9220dcff74d8e9123c9788365751] mm/pgtable: allow pte_offset_map[_lock]() to fail
> git bisect good 0d940a9b270b9220dcff74d8e9123c9788365751
> # bad: [0d1c81edc61e553ed7a5db18fb8074c8b78e1538] mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()
> git bisect bad 0d1c81edc61e553ed7a5db18fb8074c8b78e1538
> # good: [2798bbe75b9c2752b46d292e5c2a49f49da36418] mm/page_vma_mapped: pte_offset_map_nolock() not pte_lockptr()
> git bisect good 2798bbe75b9c2752b46d292e5c2a49f49da36418
> # good: [be872f83bf571f4f9a0ac25e2c9c36e905a36619] mm/pagewalk: walk_pte_range() allow for pte_offset_map()
> git bisect good be872f83bf571f4f9a0ac25e2c9c36e905a36619
> # good: [e5ad581c7f1c32d309ae4e895eea0cd1a3d9f363] mm/vmwgfx: simplify pmd & pud mapping dirty helpers
> git bisect good e5ad581c7f1c32d309ae4e895eea0cd1a3d9f363
> # first bad commit: [0d1c81edc61e553ed7a5db18fb8074c8b78e1538] mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()
On Mon, Jul 10, 2023 at 06:18:27PM +0100, Lorenzo Stoakes wrote:
> On Mon, Jul 10, 2023 at 03:42:31PM +0100, Mark Brown wrote:
> > We end up seeing NULL or otherwise bad pointer dereferences, the
> > specific error does vary a bit though it mostly appears to be in the
> > pinctrl code. A bisect (full log below) identified this patch as
> > introducing the failure, nothing is jumping out at me about the patch
> > and it's not affecting everything so I'd not be surprised if it's just
> > unconvering some bug in the platform support but I'm not super familiar
> > with the code.
> Yeah seems likely. Do you have a .config you can share for this board? For
> a 64-bit device you'd expect that this change would probably be a nop.
It's definitely happening with arm64 defconfig, possibly with other
configs but that's the main one.
On 8 Jun 2023, at 21:11, Hugh Dickins wrote:
> filemap_map_pages() allow pte_offset_map_lock() to fail; and remove the
> pmd_devmap_trans_unstable() check from filemap_map_pmd(), which can safely
> return to filemap_map_pages() and let pte_offset_map_lock() discover that.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> mm/filemap.c | 12 +++++-------
> 1 file changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 28b42ee848a4..9e129ad43e0d 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3408,13 +3408,6 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
> if (pmd_none(*vmf->pmd))
> pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
>
> - /* See comment in handle_pte_fault() */
> - if (pmd_devmap_trans_unstable(vmf->pmd)) {
> - folio_unlock(folio);
> - folio_put(folio);
> - return true;
> - }
> -
There is a pmd_trans_huge() check at the beginning, should it be removed
as well? Since pte_offset_map_lock() is also able to detect it.
> return false;
> }
>
> @@ -3501,6 +3494,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
>
> addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> + if (!vmf->pte) {
> + folio_unlock(folio);
> + folio_put(folio);
> + goto out;
> + }
> do {
> again:
> page = folio_file_page(folio, xas.xa_index);
> --
> 2.35.3
These two changes affect the ret value. Before, pmd_devmap_trans_unstable() == true
made ret = VM_FAULT_NPAGE, but now ret is the default 0 value. So ret should be set
to VM_FAULT_NPAGE before goto out in the second hunk?
--
Best Regards,
Yan, Zi
On 8 Jun 2023, at 21:12, Hugh Dickins wrote:
> Revert commit a7a69d8ba88d ("mm/thp: another PVMW_SYNC fix in
> page_vma_mapped_walk()"): I was proud of that "Aha!" commit at the time,
> but in revisiting page_vma_mapped_walk() for pte_offset_map() failure,
> that block raised a doubt: and it now seems utterly bogus. The prior
> map_pte() has taken ptl unconditionally when PVMW_SYNC: I must have
> forgotten that when making the change. It did no harm, but could not
> have fixed a BUG or WARN, and is hard to reconcile with coming changes.
>
> Signed-off-by: Hugh Dickins <[email protected]>
LGTM. Reviewed-by: Zi Yan <[email protected]>
--
Best Regards,
Yan, Zi
On 8 Jun 2023, at 21:14, Hugh Dickins wrote:
> No functional change here, but adjust the format of map_pte() so that the
> following commit will be easier to read: separate out the PVMW_SYNC case
> first, and remove two levels of indentation from the ZONE_DEVICE case.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> mm/page_vma_mapped.c | 65 +++++++++++++++++++++++---------------------
> 1 file changed, 34 insertions(+), 31 deletions(-)
>
LGTM. Reviewed-by: Zi Yan <[email protected]>
--
Best Regards,
Yan, Zi
On 8 Jun 2023, at 21:10, Hugh Dickins wrote:
> Make pte_offset_map() a wrapper for __pte_offset_map() (optionally
> outputs pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
> __pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
>
> __pte_offset_map() do pmdval validation (including pmd_clear_bad()
> when pmd_bad()), returning NULL if pmdval is not for a page table.
> __pte_offset_map_lock() verify pmdval unchanged after getting the
> lock, trying again if it changed.
>
> No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done
> to cover the imminent case, but we expect to generalize it later, and
> it makes a mess of where to do the pmd_bad() clearing.
>
> Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
> without actually taking the lock. This will be preferred to open uses of
> pte_lockptr(), because (when split ptlock is in page table's struct page)
> it points to the right lock for the returned pte pointer, even if *pmd
> gets changed racily afterwards.
>
> Update corresponding Documentation.
>
> Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
> they have to wait until all architectures are balancing pte_offset_map()s
> with pte_unmap()s (as in the arch series posted earlier). But comment
> where they will go, so that it's easy to add them for experiments. And
> only when those are in place can transient racy failure cases be enabled.
> Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> Documentation/mm/split_page_table_lock.rst | 17 ++++---
> include/linux/mm.h | 27 +++++++----
> include/linux/pgtable.h | 22 ++++++---
> mm/pgtable-generic.c | 56 ++++++++++++++++++++++
> 4 files changed, 101 insertions(+), 21 deletions(-)
LGTM. Reviewed-by: Zi Yan <[email protected]>
--
Best Regards,
Yan, Zi
On Mon, 10 Jul 2023, Mark Brown wrote:
> On Mon, Jul 10, 2023 at 06:18:27PM +0100, Lorenzo Stoakes wrote:
> > On Mon, Jul 10, 2023 at 03:42:31PM +0100, Mark Brown wrote:
>
> > > We end up seeing NULL or otherwise bad pointer dereferences, the
> > > specific error does vary a bit though it mostly appears to be in the
> > > pinctrl code. A bisect (full log below) identified this patch as
> > > introducing the failure, nothing is jumping out at me about the patch
> > > and it's not affecting everything so I'd not be surprised if it's just
> > > unconvering some bug in the platform support but I'm not super familiar
> > > with the code.
>
> > Yeah seems likely. Do you have a .config you can share for this board? For
> > a 64-bit device you'd expect that this change would probably be a nop.
>
> It's definitely happening with arm64 defconfig, possibly with other
> configs but that's the main one.
I'm sorry for dropping you in it, Mark, but I'm totally baffled.
I've spent most of the day trying to come up with ideas, but failed.
I've no doubt that you're seeing what you're seeing, but how it comes
about is a mystery.
Lorenzo is right that the change should be a no-op - compared with 6.4.
But it's not quite a no-op in this series, because 04/32 0d940a9b270b
("mm/pgtable: allow pte_offset_map[_lock]() to fail") diverts the old
pte_offset_map() macro off to a new function in mm/pgtable-generic.c;
then this commit restores it back to being the pte_offset_kernel() macro.
So the asm in vmalloc_to_page() is expected to change in this commit,
but change back to what it would have been in 6.4.
This feels like one of those bugs which depends on the code size in
some way (a bit like those bugs we used to have, where a function was
mistakenly marked __init, then in some configs its code landed on a
page which got freed at startup - I'm not saying this is that at all,
just saying it feels weird in that way).
Yet your bisection converges convincingly, which I wouldn't expect
in that case.
I suppose I should ask you to try reverting this 0d1c81edc61e alone
from 6.5-rc1: the consistency of your bisection implies that it will
"fix" the issues, and it is a commit which we could drop. It makes
me a little nervous, applying userspace-pagetable validation to kernel
pagetables, so I don't want to drop it; and it would really be cargo-
culting to drop it without understanding. But we could drop it.
I guess it would be interesting to know whether vmalloc_to_page() is
ever even called in your kernel, before it crashes on the pinctrl stuff.
But putting in a printk to report on that may change everything.
And I guess it would be interesting to know (from a DEBUG_INFO build
of the crashing kernel) which line of dt_remember_or_free_map() it
oopses on i.e. which pointer is NULL when it shouldn't be - or maybe
you already worked that out.
And what device (which ->dt_node_to_map) is involved. If one of the
many dt_node_to_map's fails to initialize *map to NULL when it should,
and has relied on it happening to be a NULL on the stack already...
that might explain it.
Another thing to try, would be the kernel at 0d940a9b270b^, just before
pte_offset_map() grew a function call: there's a faint possibility that
the bug came in before this series, that 0d940a9b270b somehow masked it
(I don't see how: vmalloc_to_page() does sensible validation itself),
and then 0d1c81edc61e unmasked it again - so that the bisection skipped
over, and converged on the wrong point.
But I'm thrashing about: I have no confidence that any of this info will
help us. Sorry for wasting your time.
Thanks,
Hugh
On Mon, 10 Jul 2023, Zi Yan wrote:
> On 8 Jun 2023, at 21:11, Hugh Dickins wrote:
>
> > filemap_map_pages() allow pte_offset_map_lock() to fail; and remove the
> > pmd_devmap_trans_unstable() check from filemap_map_pmd(), which can safely
> > return to filemap_map_pages() and let pte_offset_map_lock() discover that.
> >
> > Signed-off-by: Hugh Dickins <[email protected]>
> > ---
> > mm/filemap.c | 12 +++++-------
> > 1 file changed, 5 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 28b42ee848a4..9e129ad43e0d 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -3408,13 +3408,6 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct folio *folio,
> > if (pmd_none(*vmf->pmd))
> > pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
> >
> > - /* See comment in handle_pte_fault() */
> > - if (pmd_devmap_trans_unstable(vmf->pmd)) {
> > - folio_unlock(folio);
> > - folio_put(folio);
> > - return true;
> > - }
> > -
>
> There is a pmd_trans_huge() check at the beginning, should it be removed
> as well? Since pte_offset_map_lock() is also able to detect it.
It probably could be removed: but mostly I avoided such cleanups,
in the hope that the patches could be more easily reviewed as safe.
But I was eager to delete that obscure pmd_devmap_trans_unstable().
The whole strategy of dealing with the pmd_trans_huge()-like cases first,
and only finally arriving at the pte_offset_map_lock() when other cases
have been excluded, could be reversed in *many* places. It had to be that
way before, because pte_offset_map_lock() could only cope with a page
table; but now we could reverse them to do the pte_offset_map_lock()
first, and only try the other cases when it fails.
That would in theory be more efficient; but whether measurably more
efficient I doubt. And very easy to introduce errors on the way:
my enthusiasm for such cleanups is low! But maybe there's a few
places where the rearrangement would be worthwhile.
>
> > return false;
> > }
> >
> > @@ -3501,6 +3494,11 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf,
> >
> > addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> > vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl);
> > + if (!vmf->pte) {
> > + folio_unlock(folio);
> > + folio_put(folio);
> > + goto out;
> > + }
> > do {
> > again:
> > page = folio_file_page(folio, xas.xa_index);
> > --
> > 2.35.3
>
> These two changes affect the ret value. Before, pmd_devmap_trans_unstable() == true
> made ret = VM_FAULT_NPAGE, but now ret is the default 0 value. So ret should be set
> to VM_FAULT_NPAGE before goto out in the second hunk?
Qi Zheng raised a similar question on the original posting, I answered
https://lore.kernel.org/linux-mm/[email protected]/
It's a rare case to fault here, then find pmd_devmap(*pmd), and it really
doesn't matter whether we return VM_FAULT_NOPAGE or 0 for it - maybe I've
left it inconsistent between THP and devmap, but it doesn't really matter.
I haven't checked Matthew's v5 "new page table range API" posted today,
but I expect this all looks different here anyway.
Thanks a lot for checking these: they are now in 6.5-rc1, so if you find
something that needs fixing, all the more important that we do fix it.
Hugh
[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]
[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]
On 10.07.23 16:42, Mark Brown wrote:
> On Thu, Jun 08, 2023 at 06:21:41PM -0700, Hugh Dickins wrote:
>> vmalloc_to_page() was using pte_offset_map() (followed by pte_unmap()),
>> but it's intended for userspace page tables: prefer pte_offset_kernel().
>>
>> Signed-off-by: Hugh Dickins <[email protected]>
>> Reviewed-by: Lorenzo Stoakes <[email protected]>
>
> Currently Linus' tree is reliably failing to boot on pine64plus, an
> arm64 SBC. Most other boards seem fine, though I am seeing some
> additional instability on Tritium which is another Allwinner platform,
> I've not dug into that yet and Tritium is generally less stable.
>
> We end up seeing NULL or otherwise bad pointer dereferences, the
> [...]
> # first bad commit: [0d1c81edc61e553ed7a5db18fb8074c8b78e1538] mm/vmalloc: vmalloc_to_page() use pte_offset_kernel()
Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:
#regzbot ^introduced 0d1c81edc61e553ed7a5db18fb8074c8
#regzbot title mm/vmalloc: NULL or otherwise bad pointer dereferences on
ARM64
#regzbot ignore-activity
This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.
Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
On Mon, Jul 10, 2023 at 09:34:42PM -0700, Hugh Dickins wrote:
> This feels like one of those bugs which depends on the code size in
> some way (a bit like those bugs we used to have, where a function was
> mistakenly marked __init, then in some configs its code landed on a
> page which got freed at startup - I'm not saying this is that at all,
> just saying it feels weird in that way).
> Yet your bisection converges convincingly, which I wouldn't expect
> in that case.
Yes, it smells like code size or something other than the commit
itself, I have seen this sort of behaviour before where something nearby
in history introduced something which was then triggered by whatever the
bisect points at.
> I suppose I should ask you to try reverting this 0d1c81edc61e alone
> from 6.5-rc1: the consistency of your bisection implies that it will
> "fix" the issues, and it is a commit which we could drop. It makes
> me a little nervous, applying userspace-pagetable validation to kernel
> pagetables, so I don't want to drop it; and it would really be cargo-
> culting to drop it without understanding. But we could drop it.
I did look at that, it doesn't revert cleanly by itself. Your other
suggestions are all good - I'll poke at them. My suspicion is that
there's some longer standing breakage elsewhere and your series (or even
just this patch) just happens to push into happening reliably, had it
not been a mm change and a memory related bug I'd probably have just
discounted the bisect result.
On Tue, 11 Jul 2023, Mark Brown wrote:
> On Mon, Jul 10, 2023 at 09:34:42PM -0700, Hugh Dickins wrote:
>
> > I suppose I should ask you to try reverting this 0d1c81edc61e alone
> > from 6.5-rc1: the consistency of your bisection implies that it will
> > "fix" the issues, and it is a commit which we could drop. It makes
> > me a little nervous, applying userspace-pagetable validation to kernel
> > pagetables, so I don't want to drop it; and it would really be cargo-
> > culting to drop it without understanding. But we could drop it.
>
> I did look at that, it doesn't revert cleanly by itself. ...
Right, that ptep_get() wrapper on the next line came in on top.
The patch to revert just 0d1c81edc61e is this:
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -703,10 +703,11 @@ struct page *vmalloc_to_page(const void
if (WARN_ON_ONCE(pmd_bad(*pmd)))
return NULL;
- ptep = pte_offset_kernel(pmd, addr);
+ ptep = pte_offset_map(pmd, addr);
pte = ptep_get(ptep);
if (pte_present(pte))
page = pte_page(pte);
+ pte_unmap(ptep);
return page;
}
On Tue, Jul 11, 2023 at 09:13:18AM -0700, Hugh Dickins wrote:
> On Tue, 11 Jul 2023, Mark Brown wrote:
> > On Mon, Jul 10, 2023 at 09:34:42PM -0700, Hugh Dickins wrote:
> > > I suppose I should ask you to try reverting this 0d1c81edc61e alone
> > > from 6.5-rc1: the consistency of your bisection implies that it will
> > > "fix" the issues, and it is a commit which we could drop. It makes
> > > me a little nervous, applying userspace-pagetable validation to kernel
> > > pagetables, so I don't want to drop it; and it would really be cargo-
> > > culting to drop it without understanding. But we could drop it.
> > I did look at that, it doesn't revert cleanly by itself. ...
> Right, that ptep_get() wrapper on the next line came in on top.
> The patch to revert just 0d1c81edc61e is this:
Thanks, tried that and it's still exploding in a similar way (though
this time inside a regulator call from the pinctrl code which was
happening in other cases).
On Tue, Jul 11, 2023 at 09:13:18AM -0700, Hugh Dickins wrote:
> On Tue, 11 Jul 2023, Mark Brown wrote:
> > On Mon, Jul 10, 2023 at 09:34:42PM -0700, Hugh Dickins wrote:
> >
> > > I suppose I should ask you to try reverting this 0d1c81edc61e alone
> > > from 6.5-rc1: the consistency of your bisection implies that it will
> > > "fix" the issues, and it is a commit which we could drop. It makes
> > > me a little nervous, applying userspace-pagetable validation to kernel
> > > pagetables, so I don't want to drop it; and it would really be cargo-
> > > culting to drop it without understanding. But we could drop it.
> >
> > I did look at that, it doesn't revert cleanly by itself. ...
>
> Right, that ptep_get() wrapper on the next line came in on top.
> The patch to revert just 0d1c81edc61e is this:
Still investigating but I'm pretty convinced this is nothing to do with
your commit/series and is just common or garden memory corruption that
just happens to get tickled by your changes. Sorry for the noise.
On 11.07.23 19:57, Mark Brown wrote:
> On Tue, Jul 11, 2023 at 09:13:18AM -0700, Hugh Dickins wrote:
>> On Tue, 11 Jul 2023, Mark Brown wrote:
>>> On Mon, Jul 10, 2023 at 09:34:42PM -0700, Hugh Dickins wrote:
>>>
>>>> I suppose I should ask you to try reverting this 0d1c81edc61e alone
>>>> from 6.5-rc1: the consistency of your bisection implies that it will
>>>> "fix" the issues, and it is a commit which we could drop. It makes
>>>> me a little nervous, applying userspace-pagetable validation to kernel
>>>> pagetables, so I don't want to drop it; and it would really be cargo-
>>>> culting to drop it without understanding. But we could drop it.
>>>
>>> I did look at that, it doesn't revert cleanly by itself. ...
>>
>> Right, that ptep_get() wrapper on the next line came in on top.
>> The patch to revert just 0d1c81edc61e is this:
>
> Still investigating but I'm pretty convinced this is nothing to do with
> your commit/series and is just common or garden memory corruption that
> just happens to get tickled by your changes. Sorry for the noise.
In that case:
#regzbot introduced v6.4..v6.5-rc1
#regzbot ignore-activity
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
On Tue, Jul 11, 2023 at 06:57:33PM +0100, Mark Brown wrote:
> On Tue, Jul 11, 2023 at 09:13:18AM -0700, Hugh Dickins wrote:
> > On Tue, 11 Jul 2023, Mark Brown wrote:
> > > On Mon, Jul 10, 2023 at 09:34:42PM -0700, Hugh Dickins wrote:
> > >
> > > > I suppose I should ask you to try reverting this 0d1c81edc61e alone
> > > > from 6.5-rc1: the consistency of your bisection implies that it will
> > > > "fix" the issues, and it is a commit which we could drop. It makes
> > > > me a little nervous, applying userspace-pagetable validation to kernel
> > > > pagetables, so I don't want to drop it; and it would really be cargo-
> > > > culting to drop it without understanding. But we could drop it.
> > >
> > > I did look at that, it doesn't revert cleanly by itself. ...
> >
> > Right, that ptep_get() wrapper on the next line came in on top.
> > The patch to revert just 0d1c81edc61e is this:
>
> Still investigating but I'm pretty convinced this is nothing to do with
> your commit/series and is just common or garden memory corruption that
> just happens to get tickled by your changes. Sorry for the noise.
Did you get to the bottom of this? If not, do you have a reliable way to
reproduce the problem? I don't like the sound of memory corruption :(
Will
On Thu, Jul 20, 2023 at 11:32:28AM +0100, Will Deacon wrote:
> On Tue, Jul 11, 2023 at 06:57:33PM +0100, Mark Brown wrote:
> > Still investigating but I'm pretty convinced this is nothing to do with
> > your commit/series and is just common or garden memory corruption that
> > just happens to get tickled by your changes. Sorry for the noise.
> Did you get to the bottom of this? If not, do you have a reliable way to
> reproduce the problem? I don't like the sound of memory corruption :(
Not to the bottom of it, but getting there - I isolated the issue to
something in the unregistration path for thermal zones but didn't manage
to figure out exactly what. There was some indication it might be a use
after free but I'm not convinced.
I have a reliable way to reproduce this if you have a pine64plus, it
also shows up a lot on the Libretech Tritium but not quite so reliably
as pine64plus since Hugh's changes. Equally pine64plus was rock solid
until those so there's some timing/environment thing going on which
makes the issue manifest obviously, I expect you should be able to
trigger the issue by unregistering a thermal driver but the effects
might not be visible.
There is a change on the list to make the Allwinner SoCs not trigger the
issue during boot (their thermal driver refuses to register if any one
zone fails but most of their SoCs have multiple thermal zones with only
one fully described) but it needs fixing either way.
Hi, Hugh
It seems this change makes pte_offset_map_lock not possible to be
called in out of tree modules,
otherwise it will report error like this:
ERROR: modpost: "__pte_offset_map_lock"
[../omap-modules/android-mainline/pvr/pvrsrvkm.ko] undefined!
Not sure if you have any idea about it, and any suggestions on how to
resolve it?
Thanks,
Yongqin Liu
On Fri, 9 Jun 2023 at 09:10, Hugh Dickins <[email protected]> wrote:
>
> Make pte_offset_map() a wrapper for __pte_offset_map() (optionally
> outputs pmdval), pte_offset_map_lock() a sparse __cond_lock wrapper for
> __pte_offset_map_lock(): those __funcs added in mm/pgtable-generic.c.
>
> __pte_offset_map() do pmdval validation (including pmd_clear_bad()
> when pmd_bad()), returning NULL if pmdval is not for a page table.
> __pte_offset_map_lock() verify pmdval unchanged after getting the
> lock, trying again if it changed.
>
> No #ifdef CONFIG_TRANSPARENT_HUGEPAGE around them: that could be done
> to cover the imminent case, but we expect to generalize it later, and
> it makes a mess of where to do the pmd_bad() clearing.
>
> Add pte_offset_map_nolock(): outputs ptl like pte_offset_map_lock(),
> without actually taking the lock. This will be preferred to open uses of
> pte_lockptr(), because (when split ptlock is in page table's struct page)
> it points to the right lock for the returned pte pointer, even if *pmd
> gets changed racily afterwards.
>
> Update corresponding Documentation.
>
> Do not add the anticipated rcu_read_lock() and rcu_read_unlock()s yet:
> they have to wait until all architectures are balancing pte_offset_map()s
> with pte_unmap()s (as in the arch series posted earlier). But comment
> where they will go, so that it's easy to add them for experiments. And
> only when those are in place can transient racy failure cases be enabled.
> Add more safety for the PAE mismatched pmd_low pmd_high case at that time.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> Documentation/mm/split_page_table_lock.rst | 17 ++++---
> include/linux/mm.h | 27 +++++++----
> include/linux/pgtable.h | 22 ++++++---
> mm/pgtable-generic.c | 56 ++++++++++++++++++++++
> 4 files changed, 101 insertions(+), 21 deletions(-)
>
> diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst
> index 50ee0dfc95be..a834fad9de12 100644
> --- a/Documentation/mm/split_page_table_lock.rst
> +++ b/Documentation/mm/split_page_table_lock.rst
> @@ -14,15 +14,20 @@ tables. Access to higher level tables protected by mm->page_table_lock.
> There are helpers to lock/unlock a table and other accessor functions:
>
> - pte_offset_map_lock()
> - maps pte and takes PTE table lock, returns pointer to the taken
> - lock;
> + maps PTE and takes PTE table lock, returns pointer to PTE with
> + pointer to its PTE table lock, or returns NULL if no PTE table;
> + - pte_offset_map_nolock()
> + maps PTE, returns pointer to PTE with pointer to its PTE table
> + lock (not taken), or returns NULL if no PTE table;
> + - pte_offset_map()
> + maps PTE, returns pointer to PTE, or returns NULL if no PTE table;
> + - pte_unmap()
> + unmaps PTE table;
> - pte_unmap_unlock()
> unlocks and unmaps PTE table;
> - pte_alloc_map_lock()
> - allocates PTE table if needed and take the lock, returns pointer
> - to taken lock or NULL if allocation failed;
> - - pte_lockptr()
> - returns pointer to PTE table lock;
> + allocates PTE table if needed and takes its lock, returns pointer to
> + PTE with pointer to its lock, or returns NULL if allocation failed;
> - pmd_lock()
> takes PMD table lock, returns pointer to taken lock;
> - pmd_lockptr()
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 27ce77080c79..3c2e56980853 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2787,14 +2787,25 @@ static inline void pgtable_pte_page_dtor(struct page *page)
> dec_lruvec_page_state(page, NR_PAGETABLE);
> }
>
> -#define pte_offset_map_lock(mm, pmd, address, ptlp) \
> -({ \
> - spinlock_t *__ptl = pte_lockptr(mm, pmd); \
> - pte_t *__pte = pte_offset_map(pmd, address); \
> - *(ptlp) = __ptl; \
> - spin_lock(__ptl); \
> - __pte; \
> -})
> +pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp);
> +static inline pte_t *pte_offset_map(pmd_t *pmd, unsigned long addr)
> +{
> + return __pte_offset_map(pmd, addr, NULL);
> +}
> +
> +pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long addr, spinlock_t **ptlp);
> +static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long addr, spinlock_t **ptlp)
> +{
> + pte_t *pte;
> +
> + __cond_lock(*ptlp, pte = __pte_offset_map_lock(mm, pmd, addr, ptlp));
> + return pte;
> +}
> +
> +pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long addr, spinlock_t **ptlp);
>
> #define pte_unmap_unlock(pte, ptl) do { \
> spin_unlock(ptl); \
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 94235ff2706e..3fabbb018557 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -94,14 +94,22 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
> #define pte_offset_kernel pte_offset_kernel
> #endif
>
> -#if defined(CONFIG_HIGHPTE)
> -#define pte_offset_map(dir, address) \
> - ((pte_t *)kmap_local_page(pmd_page(*(dir))) + \
> - pte_index((address)))
> -#define pte_unmap(pte) kunmap_local((pte))
> +#ifdef CONFIG_HIGHPTE
> +#define __pte_map(pmd, address) \
> + ((pte_t *)kmap_local_page(pmd_page(*(pmd))) + pte_index((address)))
> +#define pte_unmap(pte) do { \
> + kunmap_local((pte)); \
> + /* rcu_read_unlock() to be added later */ \
> +} while (0)
> #else
> -#define pte_offset_map(dir, address) pte_offset_kernel((dir), (address))
> -#define pte_unmap(pte) ((void)(pte)) /* NOP */
> +static inline pte_t *__pte_map(pmd_t *pmd, unsigned long address)
> +{
> + return pte_offset_kernel(pmd, address);
> +}
> +static inline void pte_unmap(pte_t *pte)
> +{
> + /* rcu_read_unlock() to be added later */
> +}
> #endif
>
> /* Find an entry in the second-level page table.. */
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index d2fc52bffafc..c7ab18a5fb77 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -10,6 +10,8 @@
> #include <linux/pagemap.h>
> #include <linux/hugetlb.h>
> #include <linux/pgtable.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
> #include <linux/mm_inline.h>
> #include <asm/tlb.h>
>
> @@ -229,3 +231,57 @@ pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
> }
> #endif
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
> +{
> + pmd_t pmdval;
> +
> + /* rcu_read_lock() to be added later */
> + pmdval = pmdp_get_lockless(pmd);
> + if (pmdvalp)
> + *pmdvalp = pmdval;
> + if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
> + goto nomap;
> + if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
> + goto nomap;
> + if (unlikely(pmd_bad(pmdval))) {
> + pmd_clear_bad(pmd);
> + goto nomap;
> + }
> + return __pte_map(&pmdval, addr);
> +nomap:
> + /* rcu_read_unlock() to be added later */
> + return NULL;
> +}
> +
> +pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long addr, spinlock_t **ptlp)
> +{
> + pmd_t pmdval;
> + pte_t *pte;
> +
> + pte = __pte_offset_map(pmd, addr, &pmdval);
> + if (likely(pte))
> + *ptlp = pte_lockptr(mm, &pmdval);
> + return pte;
> +}
> +
> +pte_t *__pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long addr, spinlock_t **ptlp)
> +{
> + spinlock_t *ptl;
> + pmd_t pmdval;
> + pte_t *pte;
> +again:
> + pte = __pte_offset_map(pmd, addr, &pmdval);
> + if (unlikely(!pte))
> + return pte;
> + ptl = pte_lockptr(mm, &pmdval);
> + spin_lock(ptl);
> + if (likely(pmd_same(pmdval, pmdp_get_lockless(pmd)))) {
> + *ptlp = ptl;
> + return pte;
> + }
> + pte_unmap_unlock(pte, ptl);
> + goto again;
> +}
> --
> 2.35.3
>
--
Best Regards,
Yongqin Liu
---------------------------------------------------------------
#mailing list
[email protected]
http://lists.linaro.org/mailman/listinfo/linaro-android
On Fri, Jul 28, 2023 at 09:53:29PM +0800, Yongqin Liu wrote:
> Hi, Hugh
>
> It seems this change makes pte_offset_map_lock not possible to be
> called in out of tree modules,
> otherwise it will report error like this:
> ERROR: modpost: "__pte_offset_map_lock"
> [../omap-modules/android-mainline/pvr/pvrsrvkm.ko] undefined!
>
> Not sure if you have any idea about it, and any suggestions on how to
> resolve it?
Please explain why this module needs to map page tables
On Fri, 28 Jul 2023, Matthew Wilcox wrote:
> On Fri, Jul 28, 2023 at 09:53:29PM +0800, Yongqin Liu wrote:
> > Hi, Hugh
> >
> > It seems this change makes pte_offset_map_lock not possible to be
> > called in out of tree modules,
> > otherwise it will report error like this:
> > ERROR: modpost: "__pte_offset_map_lock"
> > [../omap-modules/android-mainline/pvr/pvrsrvkm.ko] undefined!
> >
> > Not sure if you have any idea about it, and any suggestions on how to
> > resolve it?
>
> Please explain why this module needs to map page tables
+1
Thank you for testing 6.5-rc, and I am sorry to have inconvenienced you.
But there is not one example of an in-tree module needing that,
which is a very strong hint that no module should be needing that.
Sounds like pvrsrvkm.ko wants to muck around with page table entries,
without the core mm knowing. Not something core mm can encourage!
If what pvrsrvkm.ko is aiming to do there would be useful for others,
maybe its owner can share that, and work with core mm developers to
expose a generally useful interface - but that is not likely to be
__pte_offset_map_lock itself.
Hugh
On Sat, 29 Jul 2023 at 00:58, Hugh Dickins <[email protected]> wrote:
>
> On Fri, 28 Jul 2023, Matthew Wilcox wrote:
> > On Fri, Jul 28, 2023 at 09:53:29PM +0800, Yongqin Liu wrote:
> > > Hi, Hugh
> > >
> > > It seems this change makes pte_offset_map_lock not possible to be
> > > called in out of tree modules,
> > > otherwise it will report error like this:
> > > ERROR: modpost: "__pte_offset_map_lock"
> > > [../omap-modules/android-mainline/pvr/pvrsrvkm.ko] undefined!
> > >
> > > Not sure if you have any idea about it, and any suggestions on how to
> > > resolve it?
> >
> > Please explain why this module needs to map page tables
>
> +1
Sorry, I am not able to give any explanation here,
I am not familiar with the pvrsrvkm source, I just use it to have one
working AOSP build.
here is the source file where pte_offset_map_lock is called,
https://android-git.linaro.org/kernel/omap-modules.git/tree/pvr/services4/srvkm/env/linux/osfunc.c?h=android-mainline#n3508
in case you could know something with a quick look.
Otherwise, it has to wait for another one to report the problem again.
> Thank you for testing 6.5-rc, and I am sorry to have inconvenienced you.
>
> But there is not one example of an in-tree module needing that,
> which is a very strong hint that no module should be needing that.
>
> Sounds like pvrsrvkm.ko wants to muck around with page table entries,
> without the core mm knowing. Not something core mm can encourage!
>
> If what pvrsrvkm.ko is aiming to do there would be useful for others,
> maybe its owner can share that, and work with core mm developers to
> expose a generally useful interface - but that is not likely to be
> __pte_offset_map_lock itself.
>
Thanks for the explanation!
Let's see if any other pvrsrvkm engineer or other out of tree modules could help
give some explanations on this case or similar cases.
Thanks,
Yongqin Liu
--
Best Regards,
Yongqin Liu
---------------------------------------------------------------
#mailing list
[email protected]
http://lists.linaro.org/mailman/listinfo/linaro-android
On Sun, Aug 06, 2023 at 12:06:28AM +0800, Yongqin Liu wrote:
> On Sat, 29 Jul 2023 at 00:58, Hugh Dickins <[email protected]> wrote:
> >
> > On Fri, 28 Jul 2023, Matthew Wilcox wrote:
> > > On Fri, Jul 28, 2023 at 09:53:29PM +0800, Yongqin Liu wrote:
> > > > Hi, Hugh
> > > >
> > > > It seems this change makes pte_offset_map_lock not possible to be
> > > > called in out of tree modules,
> > > > otherwise it will report error like this:
> > > > ERROR: modpost: "__pte_offset_map_lock"
> > > > [../omap-modules/android-mainline/pvr/pvrsrvkm.ko] undefined!
> > > >
> > > > Not sure if you have any idea about it, and any suggestions on how to
> > > > resolve it?
> > >
> > > Please explain why this module needs to map page tables
> >
> > +1
> Sorry, I am not able to give any explanation here,
> I am not familiar with the pvrsrvkm source, I just use it to have one
> working AOSP build.
>
> here is the source file where pte_offset_map_lock is called,
> https://android-git.linaro.org/kernel/omap-modules.git/tree/pvr/services4/srvkm/env/linux/osfunc.c?h=android-mainline#n3508
> in case you could know something with a quick look.
Isn't this just get_user_pages()?
On 8/5/23 10:07, Matthew Wilcox wrote:
...
>> Sorry, I am not able to give any explanation here,
>> I am not familiar with the pvrsrvkm source, I just use it to have one
>> working AOSP build.
>>
>> here is the source file where pte_offset_map_lock is called,
>> https://android-git.linaro.org/kernel/omap-modules.git/tree/pvr/services4/srvkm/env/linux/osfunc.c?h=android-mainline#n3508
>> in case you could know something with a quick look.
>
> Isn't this just get_user_pages()?
Or even just follow_page(), which looks like a nearly perfect drop-in
replacement, especially since that android link also says, "The page in
question must be present (i.e. no fault handling required)".
thanks,
--
John Hubbard
NVIDIA
On Tue, Aug 08, 2023 at 07:52:43AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> Hi Mark, just wondering did anything come out of this and is this still
> happening? I'm just wondering, as I still have this on my list of
> tracked regressions.
It's fixed.
On 20.07.23 14:06, Mark Brown wrote:
> On Thu, Jul 20, 2023 at 11:32:28AM +0100, Will Deacon wrote:
>> On Tue, Jul 11, 2023 at 06:57:33PM +0100, Mark Brown wrote:
>
>>> Still investigating but I'm pretty convinced this is nothing to do with
>>> your commit/series and is just common or garden memory corruption that
>>> just happens to get tickled by your changes. Sorry for the noise.
>
>> Did you get to the bottom of this? If not, do you have a reliable way to
>> reproduce the problem? I don't like the sound of memory corruption :(
>
> Not to the bottom of it, but getting there - I isolated the issue to
> something in the unregistration path for thermal zones but didn't manage
> to figure out exactly what.
Hi Mark, just wondering did anything come out of this and is this still
happening? I'm just wondering, as I still have this on my list of
tracked regressions.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.
#regzbot poke
> There was some indication it might be a use
> after free but I'm not convinced.
>
> I have a reliable way to reproduce this if you have a pine64plus, it
> also shows up a lot on the Libretech Tritium but not quite so reliably
> as pine64plus since Hugh's changes. Equally pine64plus was rock solid
> until those so there's some timing/environment thing going on which
> makes the issue manifest obviously, I expect you should be able to
> trigger the issue by unregistering a thermal driver but the effects
> might not be visible.
>
> There is a change on the list to make the Allwinner SoCs not trigger the
> issue during boot (their thermal driver refuses to register if any one
> zone fails but most of their SoCs have multiple thermal zones with only
> one fully described) but it needs fixing either way.
On 08.08.23 13:09, Mark Brown wrote:
> On Tue, Aug 08, 2023 at 07:52:43AM +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
>
>> Hi Mark, just wondering did anything come out of this and is this still
>> happening? I'm just wondering, as I still have this on my list of
>> tracked regressions.
>
> It's fixed.
In that case:
#regzbot resolve: fixed according to reporter
#regzbot ignore-activity
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.