2022-02-07 13:52:28

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling

I wondered whether to post this munlocking rework in
https://lore.kernel.org/linux-mm/[email protected]/

There the discussion was OOM reaping, but the main reason for the rework
has been catastrophic contention on i_mmap_rwsem when exiting from
multiply mlocked files; and frustration with how contorted munlocking is.

Here it is based on 5.17-rc2, applies also to -rc3, almost cleanly to
mmotm 2022-02-03-21-58 (needs two easy fixups in mm/huge_memory.c); but
likely to conflict (I hope not fundamentally) with several concurrent
large patchsets.

tl;dr
mm/mlock.c | 634 +++++++++++++++-----------------------
23 files changed, 510 insertions(+), 779 deletions(-)

In my own opinion, this is ready to go in: that whatever its defects,
it's a big practical improvement over what's presently there. Others
may differ; and it may be too much to arrive at a quick opinion on.
My hope is that it will strike a chord with some who have laboured in
this area in the past: I'll be short of time to do much more with it,
so maybe someone else could take over if necessary.

At present there's no update to Documentation/vm/unevictable-lru.rst:
that always needs a different mindset, can follow later, please refer
to commit messages for now.

There are two half-related mm/thp patches at the end: enhancements
we've had for a long time, but more suited after the mlock changes.

01/13 mm/munlock: delete page_mlock() and all its works
02/13 mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE gupflags
03/13 mm/munlock: delete munlock_vma_pages_all(), allow oomreap
04/13 mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
05/13 mm/munlock: replace clear_page_mlock() by final clearance
06/13 mm/munlock: maintain page->mlock_count while unevictable
07/13 mm/munlock: mlock_pte_range() when mlocking or munlocking
08/13 mm/migrate: __unmap_and_move() push good newpage to LRU
09/13 mm/munlock: delete smp_mb() from __pagevec_lru_add_fn()
10/13 mm/munlock: mlock_page() munlock_page() batch by pagevec
11/13 mm/munlock: page migration needs mlock pagevec drained
12/13 mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH)
13/13 mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP

include/linux/mm.h | 2
include/linux/mm_inline.h | 11
include/linux/mm_types.h | 19 +
include/linux/rmap.h | 23 -
kernel/events/uprobes.c | 7
mm/gup.c | 43 --
mm/huge_memory.c | 55 ---
mm/hugetlb.c | 4
mm/internal.h | 64 ++-
mm/khugepaged.c | 14
mm/ksm.c | 12
mm/madvise.c | 5
mm/memcontrol.c | 3
mm/memory.c | 45 --
mm/migrate.c | 42 +-
mm/mlock.c | 634 +++++++++++++++-----------------------
mm/mmap.c | 32 -
mm/mmzone.c | 7
mm/oom_kill.c | 2
mm/rmap.c | 156 ++-------
mm/swap.c | 89 ++---
mm/userfaultfd.c | 14
mm/vmscan.c | 6
23 files changed, 510 insertions(+), 779 deletions(-)

Hugh


2022-02-07 14:51:52

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 03/13] mm/munlock: delete munlock_vma_pages_all(), allow oomreap

munlock_vma_pages_range() will still be required, when munlocking but
not munmapping a set of pages; but when unmapping a pte, the mlock count
will be maintained in much the same way as it will be maintained when
mapping in the pte. Which removes the need for munlock_vma_pages_all()
on mlocked vmas when munmapping or exiting: eliminating the catastrophic
contention on i_mmap_rwsem, and the need for page lock on the pages.

There is still a need to update locked_vm accounting according to the
munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
exit_mmap() does not need locked_vm updates, so delete unlock_range().

And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
because of the uncertainty in blocking on all those page locks?
No fear of that now, so permit the OOM reaper on mlocked vmas.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/internal.h | 16 ++--------------
mm/madvise.c | 5 +++++
mm/mlock.c | 4 ++--
mm/mmap.c | 32 ++------------------------------
mm/oom_kill.c | 2 +-
5 files changed, 12 insertions(+), 47 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index e48c486d5ddf..f235aa92e564 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -71,11 +71,6 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);

-static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
-{
- return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
-}
-
struct zap_details;
void unmap_page_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
@@ -398,12 +393,8 @@ extern long populate_vma_page_range(struct vm_area_struct *vma,
extern long faultin_vma_page_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
bool write, int *locked);
-extern void munlock_vma_pages_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end);
-static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
-{
- munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
-}
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+ unsigned long len);

/*
* must be called with vma's mmap_lock held for read or write, and page locked.
@@ -411,9 +402,6 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
extern void mlock_vma_page(struct page *page);
extern void munlock_vma_page(struct page *page);

-extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
- unsigned long len);
-
/*
* Clear the page's PageMlocked(). This can be useful in a situation where
* we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/madvise.c b/mm/madvise.c
index 5604064df464..ae35d72627ef 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -530,6 +530,11 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
tlb_end_vma(tlb, vma);
}

+static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
+{
+ return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
+}
+
static long madvise_cold(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start_addr, unsigned long end_addr)
diff --git a/mm/mlock.c b/mm/mlock.c
index 544c18ce2c58..d148da934fe9 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -137,8 +137,8 @@ void munlock_vma_page(struct page *page)
* Returns with VM_LOCKED cleared. Callers must be prepared to
* deal with this.
*/
-void munlock_vma_pages_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+static void munlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
{
/* Reimplementation to follow in later commit */
}
diff --git a/mm/mmap.c b/mm/mmap.c
index 1e8fdb0b51ed..64b5985b5295 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2674,6 +2674,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
vma->vm_prev = NULL;
do {
vma_rb_erase(vma, &mm->mm_rb);
+ if (vma->vm_flags & VM_LOCKED)
+ mm->locked_vm -= vma_pages(vma);
mm->map_count--;
tail_vma = vma;
vma = vma->vm_next;
@@ -2778,22 +2780,6 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
return __split_vma(mm, vma, addr, new_below);
}

-static inline void
-unlock_range(struct vm_area_struct *start, unsigned long limit)
-{
- struct mm_struct *mm = start->vm_mm;
- struct vm_area_struct *tmp = start;
-
- while (tmp && tmp->vm_start < limit) {
- if (tmp->vm_flags & VM_LOCKED) {
- mm->locked_vm -= vma_pages(tmp);
- munlock_vma_pages_all(tmp);
- }
-
- tmp = tmp->vm_next;
- }
-}
-
/* Munmap is split into 2 main parts -- this part which finds
* what needs doing, and the areas themselves, which do the
* work. This now handles partial unmappings.
@@ -2874,12 +2860,6 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
return error;
}

- /*
- * unlock any mlock()ed ranges before detaching vmas
- */
- if (mm->locked_vm)
- unlock_range(vma, end);
-
/* Detach vmas from rbtree */
if (!detach_vmas_to_be_unmapped(mm, vma, prev, end))
downgrade = false;
@@ -3147,20 +3127,12 @@ void exit_mmap(struct mm_struct *mm)
* Nothing can be holding mm->mmap_lock here and the above call
* to mmu_notifier_release(mm) ensures mmu notifier callbacks in
* __oom_reap_task_mm() will not block.
- *
- * This needs to be done before calling unlock_range(),
- * which clears VM_LOCKED, otherwise the oom reaper cannot
- * reliably test it.
*/
(void)__oom_reap_task_mm(mm);
-
set_bit(MMF_OOM_SKIP, &mm->flags);
}

mmap_write_lock(mm);
- if (mm->locked_vm)
- unlock_range(mm->mmap, ULONG_MAX);
-
arch_exit_mmap(mm);

vma = mm->mmap;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 832fb330376e..6b875acabd1e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -526,7 +526,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
set_bit(MMF_UNSTABLE, &mm->flags);

for (vma = mm->mmap ; vma; vma = vma->vm_next) {
- if (!can_madv_lru_vma(vma))
+ if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
continue;

/*
--
2.34.1


2022-02-07 15:12:39

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 02/13] mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE

If counting page mlocks, we must not double-count: follow_page_pte() can
tell if a page has already been Mlocked or not, but cannot tell if a pte
has already been counted or not: that will have to be done when the pte
is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks
for new anon pages, but there's no such tracking yet for others).

Delete all the FOLL_MLOCK code - faulting in the missing pages will do
all that is necessary, without special mlock_vma_page() calls from here.

But then FOLL_POPULATE turns out to serve no purpose - it was there so
that its absence would tell faultin_page() not to faultin page when
setting up VM_LOCKONFAULT areas; but if there's no special work needed
here for mlock, then there's no work at all here for VM_LOCKONFAULT.

Have I got that right? I've not looked into the history, but see that
FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different
purpose before? Ah, yes, it was used to skip the old stack guard page.

And is it intentional that COW is not broken on existing pages when
setting up a VM_LOCKONFAULT area? I can see that being argued either
way, and have no reason to disagree with current behaviour.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/mm.h | 2 --
mm/gup.c | 43 ++++++++-----------------------------------
mm/huge_memory.c | 33 ---------------------------------
3 files changed, 8 insertions(+), 70 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b192..74ee50c2033b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2925,13 +2925,11 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
* and return without waiting upon it */
-#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */
#define FOLL_NOFAULT 0x80 /* do not fault in pages */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
-#define FOLL_MLOCK 0x1000 /* lock present pages */
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
#define FOLL_COW 0x4000 /* internal GUP flag */
#define FOLL_ANON 0x8000 /* don't do file mappings */
diff --git a/mm/gup.c b/mm/gup.c
index f0af462ac1e2..2076902344d8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -572,32 +572,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
*/
mark_page_accessed(page);
}
- if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
- /* Do not mlock pte-mapped THP */
- if (PageTransCompound(page))
- goto out;
-
- /*
- * The preliminary mapping check is mainly to avoid the
- * pointless overhead of lock_page on the ZERO_PAGE
- * which might bounce very badly if there is contention.
- *
- * If the page is already locked, we don't need to
- * handle it now - vmscan will handle it later if and
- * when it attempts to reclaim the page.
- */
- if (page->mapping && trylock_page(page)) {
- lru_add_drain(); /* push cached pages to LRU */
- /*
- * Because we lock page here, and migration is
- * blocked by the pte's page reference, and we
- * know the page is still mapped, we don't even
- * need to check for file-cache page truncation.
- */
- mlock_vma_page(page);
- unlock_page(page);
- }
- }
out:
pte_unmap_unlock(ptep, ptl);
return page;
@@ -920,9 +894,6 @@ static int faultin_page(struct vm_area_struct *vma,
unsigned int fault_flags = 0;
vm_fault_t ret;

- /* mlock all present pages, but do not fault in new pages */
- if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
- return -ENOENT;
if (*flags & FOLL_NOFAULT)
return -EFAULT;
if (*flags & FOLL_WRITE)
@@ -1173,8 +1144,6 @@ static long __get_user_pages(struct mm_struct *mm,
case -ENOMEM:
case -EHWPOISON:
goto out;
- case -ENOENT:
- goto next_page;
}
BUG();
} else if (PTR_ERR(page) == -EEXIST) {
@@ -1472,9 +1441,14 @@ long populate_vma_page_range(struct vm_area_struct *vma,
VM_BUG_ON_VMA(end > vma->vm_end, vma);
mmap_assert_locked(mm);

- gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK;
+ /*
+ * Rightly or wrongly, the VM_LOCKONFAULT case has never used
+ * faultin_page() to break COW, so it has no work to do here.
+ */
if (vma->vm_flags & VM_LOCKONFAULT)
- gup_flags &= ~FOLL_POPULATE;
+ return nr_pages;
+
+ gup_flags = FOLL_TOUCH;
/*
* We want to touch writable mappings with a write fault in order
* to break COW, except for shared mappings because these don't COW
@@ -1541,10 +1515,9 @@ long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
* in the page table.
* FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
* a poisoned page.
- * FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
* !FOLL_FORCE: Require proper access permissions.
*/
- gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON;
+ gup_flags = FOLL_TOUCH | FOLL_HWPOISON;
if (write)
gup_flags |= FOLL_WRITE;

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..9a34b85ebcf8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1380,39 +1380,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
if (flags & FOLL_TOUCH)
touch_pmd(vma, addr, pmd, flags);

- if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
- /*
- * We don't mlock() pte-mapped THPs. This way we can avoid
- * leaking mlocked pages into non-VM_LOCKED VMAs.
- *
- * For anon THP:
- *
- * In most cases the pmd is the only mapping of the page as we
- * break COW for the mlock() -- see gup_flags |= FOLL_WRITE for
- * writable private mappings in populate_vma_page_range().
- *
- * The only scenario when we have the page shared here is if we
- * mlocking read-only mapping shared over fork(). We skip
- * mlocking such pages.
- *
- * For file THP:
- *
- * We can expect PageDoubleMap() to be stable under page lock:
- * for file pages we set it in page_add_file_rmap(), which
- * requires page to be locked.
- */
-
- if (PageAnon(page) && compound_mapcount(page) != 1)
- goto skip_mlock;
- if (PageDoubleMap(page) || !page->mapping)
- goto skip_mlock;
- if (!trylock_page(page))
- goto skip_mlock;
- if (page->mapping && !PageDoubleMap(page))
- mlock_vma_page(page);
- unlock_page(page);
- }
-skip_mlock:
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);

--
2.34.1


2022-02-07 17:34:43

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 04/13] mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.

Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.

Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).

No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.

Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points. Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/rmap.h | 17 +++++++------
kernel/events/uprobes.c | 7 ++----
mm/huge_memory.c | 17 ++++++-------
mm/hugetlb.c | 4 +--
mm/internal.h | 36 ++++++++++++++++++++++----
mm/khugepaged.c | 4 +--
mm/ksm.c | 12 +--------
mm/memory.c | 45 +++++++++++----------------------
mm/migrate.c | 9 ++-----
mm/mlock.c | 21 ++++++----------
mm/rmap.c | 56 +++++++++++++++++++----------------------
mm/userfaultfd.c | 14 ++++++-----
12 files changed, 113 insertions(+), 129 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index dc48aa8c2c94..ac29b076082b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -167,18 +167,19 @@ struct anon_vma *page_get_anon_vma(struct page *page);
*/
void page_move_anon_rmap(struct page *, struct vm_area_struct *);
void page_add_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long, bool);
+ unsigned long address, bool compound);
void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long, int);
+ unsigned long address, int flags);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long, bool);
-void page_add_file_rmap(struct page *, bool);
-void page_remove_rmap(struct page *, bool);
-
+ unsigned long address, bool compound);
+void page_add_file_rmap(struct page *, struct vm_area_struct *,
+ bool compound);
+void page_remove_rmap(struct page *, struct vm_area_struct *,
+ bool compound);
void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long);
+ unsigned long address);
void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long);
+ unsigned long address);

static inline void page_dup_rmap(struct page *page, bool compound)
{
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6357c3580d07..eed2f7437d96 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -173,7 +173,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
return err;
}

- /* For try_to_free_swap() and munlock_vma_page() below */
+ /* For try_to_free_swap() below */
lock_page(old_page);

mmu_notifier_invalidate_range_start(&range);
@@ -201,13 +201,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
set_pte_at_notify(mm, addr, pvmw.pte,
mk_pte(new_page, vma->vm_page_prot));

- page_remove_rmap(old_page, false);
+ page_remove_rmap(old_page, vma, false);
if (!page_mapped(old_page))
try_to_free_swap(old_page);
page_vma_mapped_walk_done(&pvmw);
-
- if ((vma->vm_flags & VM_LOCKED) && !PageCompound(old_page))
- munlock_vma_page(old_page);
put_page(old_page);

err = 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9a34b85ebcf8..d6477f48a27e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1577,7 +1577,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,

if (pmd_present(orig_pmd)) {
page = pmd_page(orig_pmd);
- page_remove_rmap(page, true);
+ page_remove_rmap(page, vma, true);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
VM_BUG_ON_PAGE(!PageHead(page), page);
} else if (thp_migration_supported()) {
@@ -1962,7 +1962,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
set_page_dirty(page);
if (!PageReferenced(page) && pmd_young(old_pmd))
SetPageReferenced(page);
- page_remove_rmap(page, true);
+ page_remove_rmap(page, vma, true);
put_page(page);
}
add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
@@ -2096,6 +2096,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
}
}
unlock_page_memcg(page);
+
+ /* Above is effectively page_remove_rmap(page, vma, true) */
+ munlock_vma_page(page, vma, true);
}

smp_wmb(); /* make pte visible before pmd */
@@ -2103,7 +2106,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,

if (freeze) {
for (i = 0; i < HPAGE_PMD_NR; i++) {
- page_remove_rmap(page + i, false);
+ page_remove_rmap(page + i, vma, false);
put_page(page + i);
}
}
@@ -2163,8 +2166,6 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
do_unlock_page = true;
}
}
- if (PageMlocked(page))
- clear_page_mlock(page);
} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
goto out;
__split_huge_pmd_locked(vma, pmd, range.start, freeze);
@@ -3138,7 +3139,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
if (pmd_soft_dirty(pmdval))
pmdswp = pmd_swp_mksoft_dirty(pmdswp);
set_pmd_at(mm, address, pvmw->pmd, pmdswp);
- page_remove_rmap(page, true);
+ page_remove_rmap(page, vma, true);
put_page(page);
}

@@ -3168,10 +3169,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
if (PageAnon(new))
page_add_anon_rmap(new, vma, mmun_start, true);
else
- page_add_file_rmap(new, true);
+ page_add_file_rmap(new, vma, true);
set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
- if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
- mlock_vma_page(new);
update_mmu_cache_pmd(vma, address, pvmw->pmd);
}
#endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 61895cc01d09..43fb3155298e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5014,7 +5014,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
set_page_dirty(page);

hugetlb_count_sub(pages_per_huge_page(h), mm);
- page_remove_rmap(page, true);
+ page_remove_rmap(page, vma, true);

spin_unlock(ptl);
tlb_remove_page_size(tlb, page, huge_page_size(h));
@@ -5259,7 +5259,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
/* Break COW */
huge_ptep_clear_flush(vma, haddr, ptep);
mmu_notifier_invalidate_range(mm, range.start, range.end);
- page_remove_rmap(old_page, true);
+ page_remove_rmap(old_page, vma, true);
hugepage_add_new_anon_rmap(new_page, vma, haddr);
set_huge_pte_at(mm, haddr, ptep,
make_huge_pte(vma, new_page, 1));
diff --git a/mm/internal.h b/mm/internal.h
index f235aa92e564..3d7dfc8bc471 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -395,12 +395,35 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
bool write, int *locked);
extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
unsigned long len);
-
/*
- * must be called with vma's mmap_lock held for read or write, and page locked.
+ * mlock_vma_page() and munlock_vma_page():
+ * should be called with vma's mmap_lock held for read or write,
+ * under page table lock for the pte/pmd being added or removed.
+ *
+ * mlock is usually called at the end of page_add_*_rmap(),
+ * munlock at the end of page_remove_rmap(); but new anon
+ * pages are managed in lru_cache_add_inactive_or_unevictable().
+ *
+ * @compound is used to include pmd mappings of THPs, but filter out
+ * pte mappings of THPs, which cannot be consistently counted: a pte
+ * mapping of the THP head cannot be distinguished by the page alone.
*/
-extern void mlock_vma_page(struct page *page);
-extern void munlock_vma_page(struct page *page);
+void mlock_page(struct page *page);
+static inline void mlock_vma_page(struct page *page,
+ struct vm_area_struct *vma, bool compound)
+{
+ if (unlikely(vma->vm_flags & VM_LOCKED) &&
+ (compound || !PageTransCompound(page)))
+ mlock_page(page);
+}
+void munlock_page(struct page *page);
+static inline void munlock_vma_page(struct page *page,
+ struct vm_area_struct *vma, bool compound)
+{
+ if (unlikely(vma->vm_flags & VM_LOCKED) &&
+ (compound || !PageTransCompound(page)))
+ munlock_page(page);
+}

/*
* Clear the page's PageMlocked(). This can be useful in a situation where
@@ -487,7 +510,10 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
#else /* !CONFIG_MMU */
static inline void unmap_mapping_folio(struct folio *folio) { }
static inline void clear_page_mlock(struct page *page) { }
-static inline void mlock_vma_page(struct page *page) { }
+static inline void mlock_vma_page(struct page *page,
+ struct vm_area_struct *vma, bool compound) { }
+static inline void munlock_vma_page(struct page *page,
+ struct vm_area_struct *vma, bool compound) { }
static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
{
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 35f14d0a00a6..d5e387c58bde 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -773,7 +773,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
*/
spin_lock(ptl);
ptep_clear(vma->vm_mm, address, _pte);
- page_remove_rmap(src_page, false);
+ page_remove_rmap(src_page, vma, false);
spin_unlock(ptl);
free_page_and_swap_cache(src_page);
}
@@ -1497,7 +1497,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
if (pte_none(*pte))
continue;
page = vm_normal_page(vma, addr, *pte);
- page_remove_rmap(page, false);
+ page_remove_rmap(page, vma, false);
}

pte_unmap_unlock(start_pte, ptl);
diff --git a/mm/ksm.c b/mm/ksm.c
index c20bd4d9a0d9..c5a4403b5dc9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1177,7 +1177,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
ptep_clear_flush(vma, addr, ptep);
set_pte_at_notify(mm, addr, ptep, newpte);

- page_remove_rmap(page, false);
+ page_remove_rmap(page, vma, false);
if (!page_mapped(page))
try_to_free_swap(page);
put_page(page);
@@ -1252,16 +1252,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
err = replace_page(vma, page, kpage, orig_pte);
}

- if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
- munlock_vma_page(page);
- if (!PageMlocked(kpage)) {
- unlock_page(page);
- lock_page(kpage);
- mlock_vma_page(kpage);
- page = kpage; /* for final unlock */
- }
- }
-
out_unlock:
unlock_page(page);
out:
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..53bd9e5f2e33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -735,9 +735,6 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,

set_pte_at(vma->vm_mm, address, ptep, pte);

- if (vma->vm_flags & VM_LOCKED)
- mlock_vma_page(page);
-
/*
* No need to invalidate - it was non-present before. However
* secondary CPUs may have mappings that need invalidating.
@@ -1377,7 +1374,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
mark_page_accessed(page);
}
rss[mm_counter(page)]--;
- page_remove_rmap(page, false);
+ page_remove_rmap(page, vma, false);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
if (unlikely(__tlb_remove_page(tlb, page))) {
@@ -1397,10 +1394,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
rss[mm_counter(page)]--;
-
if (is_device_private_entry(entry))
- page_remove_rmap(page, false);
-
+ page_remove_rmap(page, vma, false);
put_page(page);
continue;
}
@@ -1753,16 +1748,16 @@ static int validate_page_before_insert(struct page *page)
return 0;
}

-static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
+static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
unsigned long addr, struct page *page, pgprot_t prot)
{
if (!pte_none(*pte))
return -EBUSY;
/* Ok, finally just insert the thing.. */
get_page(page);
- inc_mm_counter_fast(mm, mm_counter_file(page));
- page_add_file_rmap(page, false);
- set_pte_at(mm, addr, pte, mk_pte(page, prot));
+ inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
+ page_add_file_rmap(page, vma, false);
+ set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
return 0;
}

@@ -1776,7 +1771,6 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
static int insert_page(struct vm_area_struct *vma, unsigned long addr,
struct page *page, pgprot_t prot)
{
- struct mm_struct *mm = vma->vm_mm;
int retval;
pte_t *pte;
spinlock_t *ptl;
@@ -1785,17 +1779,17 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
if (retval)
goto out;
retval = -ENOMEM;
- pte = get_locked_pte(mm, addr, &ptl);
+ pte = get_locked_pte(vma->vm_mm, addr, &ptl);
if (!pte)
goto out;
- retval = insert_page_into_pte_locked(mm, pte, addr, page, prot);
+ retval = insert_page_into_pte_locked(vma, pte, addr, page, prot);
pte_unmap_unlock(pte, ptl);
out:
return retval;
}

#ifdef pte_index
-static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte,
+static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
unsigned long addr, struct page *page, pgprot_t prot)
{
int err;
@@ -1805,7 +1799,7 @@ static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte,
err = validate_page_before_insert(page);
if (err)
return err;
- return insert_page_into_pte_locked(mm, pte, addr, page, prot);
+ return insert_page_into_pte_locked(vma, pte, addr, page, prot);
}

/* insert_pages() amortizes the cost of spinlock operations
@@ -1842,7 +1836,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,

start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock);
for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) {
- int err = insert_page_in_batch_locked(mm, pte,
+ int err = insert_page_in_batch_locked(vma, pte,
addr, pages[curr_page_idx], prot);
if (unlikely(err)) {
pte_unmap_unlock(start_pte, pte_lock);
@@ -3098,7 +3092,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
* mapcount is visible. So transitively, TLBs to
* old page will be flushed before it can be reused.
*/
- page_remove_rmap(old_page, false);
+ page_remove_rmap(old_page, vma, false);
}

/* Free the old page.. */
@@ -3118,16 +3112,6 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
*/
mmu_notifier_invalidate_range_only_end(&range);
if (old_page) {
- /*
- * Don't let another task, with possibly unlocked vma,
- * keep the mlocked page.
- */
- if (page_copied && (vma->vm_flags & VM_LOCKED)) {
- lock_page(old_page); /* LRU manipulation */
- if (PageMlocked(old_page))
- munlock_vma_page(old_page);
- unlock_page(old_page);
- }
if (page_copied)
free_swap_cache(old_page);
put_page(old_page);
@@ -3947,7 +3931,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR);
- page_add_file_rmap(page, true);
+ page_add_file_rmap(page, vma, true);
+
/*
* deposit and withdraw with pmd lock held
*/
@@ -3996,7 +3981,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
lru_cache_add_inactive_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
- page_add_file_rmap(page, false);
+ page_add_file_rmap(page, vma, false);
}
set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
}
diff --git a/mm/migrate.c b/mm/migrate.c
index c7da064b4781..7c4223ce2500 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -248,14 +248,9 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
if (PageAnon(new))
page_add_anon_rmap(new, vma, pvmw.address, false);
else
- page_add_file_rmap(new, false);
+ page_add_file_rmap(new, vma, false);
set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
}
- if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
- mlock_vma_page(new);
-
- if (PageTransHuge(page) && PageMlocked(page))
- clear_page_mlock(page);

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, pvmw.address, pvmw.pte);
@@ -2331,7 +2326,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
* drop page refcount. Page won't be freed, as we took
* a reference just above.
*/
- page_remove_rmap(page, false);
+ page_remove_rmap(page, vma, false);
put_page(page);

if (pte_present(pte))
diff --git a/mm/mlock.c b/mm/mlock.c
index d148da934fe9..aaded15b2f8f 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -78,17 +78,13 @@ void clear_page_mlock(struct page *page)
}
}

-/*
- * Mark page as mlocked if not already.
- * If page on LRU, isolate and putback to move to unevictable list.
+/**
+ * mlock_page - mlock a page
+ * @page - page to be mlocked, either a normal page or a THP head.
*/
-void mlock_vma_page(struct page *page)
+void mlock_page(struct page *page)
{
- /* Serialize with page migration */
- BUG_ON(!PageLocked(page));
-
VM_BUG_ON_PAGE(PageTail(page), page);
- VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);

if (!TestSetPageMlocked(page)) {
int nr_pages = thp_nr_pages(page);
@@ -101,14 +97,11 @@ void mlock_vma_page(struct page *page)
}

/**
- * munlock_vma_page - munlock a vma page
- * @page: page to be unlocked, either a normal page or THP page head
+ * munlock_page - munlock a page
+ * @page: page to be munlocked, either a normal page or a THP head.
*/
-void munlock_vma_page(struct page *page)
+void munlock_page(struct page *page)
{
- /* Serialize with page migration */
- BUG_ON(!PageLocked(page));
-
VM_BUG_ON_PAGE(PageTail(page), page);

if (TestClearPageMlocked(page)) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 7ce7f1946cff..6cc8bf129f18 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1181,17 +1181,17 @@ void do_page_add_anon_rmap(struct page *page,
__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
}

- if (unlikely(PageKsm(page))) {
+ if (unlikely(PageKsm(page)))
unlock_page_memcg(page);
- return;
- }

/* address might be in next vma when migration races vma_adjust */
- if (first)
+ else if (first)
__page_set_anon_rmap(page, vma, address,
flags & RMAP_EXCLUSIVE);
else
__page_check_anon_rmap(page, vma, address);
+
+ mlock_vma_page(page, vma, compound);
}

/**
@@ -1232,12 +1232,14 @@ void page_add_new_anon_rmap(struct page *page,

/**
* page_add_file_rmap - add pte mapping to a file page
- * @page: the page to add the mapping to
- * @compound: charge the page as compound or small page
+ * @page: the page to add the mapping to
+ * @vma: the vm area in which the mapping is added
+ * @compound: charge the page as compound or small page
*
* The caller needs to hold the pte lock.
*/
-void page_add_file_rmap(struct page *page, bool compound)
+void page_add_file_rmap(struct page *page,
+ struct vm_area_struct *vma, bool compound)
{
int i, nr = 1;

@@ -1260,13 +1262,8 @@ void page_add_file_rmap(struct page *page, bool compound)
nr_pages);
} else {
if (PageTransCompound(page) && page_mapping(page)) {
- struct page *head = compound_head(page);
-
VM_WARN_ON_ONCE(!PageLocked(page));
-
- SetPageDoubleMap(head);
- if (PageMlocked(page))
- clear_page_mlock(head);
+ SetPageDoubleMap(compound_head(page));
}
if (!atomic_inc_and_test(&page->_mapcount))
goto out;
@@ -1274,6 +1271,8 @@ void page_add_file_rmap(struct page *page, bool compound)
__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
out:
unlock_page_memcg(page);
+
+ mlock_vma_page(page, vma, compound);
}

static void page_remove_file_rmap(struct page *page, bool compound)
@@ -1368,11 +1367,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
/**
* page_remove_rmap - take down pte mapping from a page
* @page: page to remove mapping from
+ * @vma: the vm area from which the mapping is removed
* @compound: uncharge the page as compound or small page
*
* The caller needs to hold the pte lock.
*/
-void page_remove_rmap(struct page *page, bool compound)
+void page_remove_rmap(struct page *page,
+ struct vm_area_struct *vma, bool compound)
{
lock_page_memcg(page);

@@ -1414,6 +1415,8 @@ void page_remove_rmap(struct page *page, bool compound)
*/
out:
unlock_page_memcg(page);
+
+ munlock_vma_page(page, vma, compound);
}

/*
@@ -1469,28 +1472,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(&range);

while (page_vma_mapped_walk(&pvmw)) {
+ /* Unexpected PMD-mapped THP? */
+ VM_BUG_ON_PAGE(!pvmw.pte, page);
+
/*
- * If the page is mlock()d, we cannot swap it out.
+ * If the page is in an mlock()d vma, we must not swap it out.
*/
if (!(flags & TTU_IGNORE_MLOCK) &&
(vma->vm_flags & VM_LOCKED)) {
- /*
- * PTE-mapped THP are never marked as mlocked: so do
- * not set it on a DoubleMap THP, nor on an Anon THP
- * (which may still be PTE-mapped after DoubleMap was
- * cleared). But stop unmapping even in those cases.
- */
- if (!PageTransCompound(page) || (PageHead(page) &&
- !PageDoubleMap(page) && !PageAnon(page)))
- mlock_vma_page(page);
+ /* Restore the mlock which got missed */
+ mlock_vma_page(page, vma, false);
page_vma_mapped_walk_done(&pvmw);
ret = false;
break;
}

- /* Unexpected PMD-mapped THP? */
- VM_BUG_ON_PAGE(!pvmw.pte, page);
-
subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
address = pvmw.address;

@@ -1668,7 +1664,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
*
* See Documentation/vm/mmu_notifier.rst
*/
- page_remove_rmap(subpage, PageHuge(page));
+ page_remove_rmap(subpage, vma, PageHuge(page));
put_page(page);
}

@@ -1942,7 +1938,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
*
* See Documentation/vm/mmu_notifier.rst
*/
- page_remove_rmap(subpage, PageHuge(page));
+ page_remove_rmap(subpage, vma, PageHuge(page));
put_page(page);
}

@@ -2078,7 +2074,7 @@ static bool page_make_device_exclusive_one(struct page *page,
* There is a reference on the page for the swap entry which has
* been removed, so shouldn't take another.
*/
- page_remove_rmap(subpage, false);
+ page_remove_rmap(subpage, vma, false);
}

mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..15d3e97a6e04 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -95,10 +95,15 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
if (!pte_none(*dst_pte))
goto out_unlock;

- if (page_in_cache)
- page_add_file_rmap(page, false);
- else
+ if (page_in_cache) {
+ /* Usually, cache pages are already added to LRU */
+ if (newly_allocated)
+ lru_cache_add(page);
+ page_add_file_rmap(page, dst_vma, false);
+ } else {
page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+ lru_cache_add_inactive_or_unevictable(page, dst_vma);
+ }

/*
* Must happen after rmap, as mm_counter() checks mapping (via
@@ -106,9 +111,6 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
*/
inc_mm_counter(dst_mm, mm_counter(page));

- if (newly_allocated)
- lru_cache_add_inactive_or_unevictable(page, dst_vma);
-
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);

/* No need to invalidate - it was non-present before */
--
2.34.1


2022-02-07 19:08:16

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 12/13] mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH)

collapse_file() is using unmap_mapping_pages(1) on each small page found
mapped, unlike others (reclaim, migration, splitting, memory-failure) who
use try_to_unmap(). There are four advantages to try_to_unmap(): first,
its TTU_IGNORE_MLOCK option now avoids leaving mlocked page in pagevec;
second, its vma lookup uses i_mmap_lock_read() not i_mmap_lock_write();
third, it breaks out early if page is not mapped everywhere it might be;
fourth, its TTU_BATCH_FLUSH option can be used, as in page reclaim, to
save up all the TLB flushing until all of the pages have been unmapped.

Wild guess: perhaps it was originally written to use try_to_unmap(),
but hit the VM_BUG_ON_PAGE(page_mapped) after unmapping, because without
TTU_SYNC it may skip page table locks; but unmap_mapping_pages() never
skips them, so fixed the issue. I did once hit that VM_BUG_ON_PAGE()
since making this change: we could pass TTU_SYNC here, but I think just
delete the check - the race is very rare, this is an ordinary small page
so we don't need to be so paranoid about mapcount surprises, and the
page_ref_freeze() just below already handles the case adequately.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/khugepaged.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d5e387c58bde..e0883a33efd6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1829,13 +1829,12 @@ static void collapse_file(struct mm_struct *mm,
}

if (page_mapped(page))
- unmap_mapping_pages(mapping, index, 1, false);
+ try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_BATCH_FLUSH);

xas_lock_irq(&xas);
xas_set(&xas, index);

VM_BUG_ON_PAGE(page != xas_load(&xas), page);
- VM_BUG_ON_PAGE(page_mapped(page), page);

/*
* The page is expected to have page_count() == 3:
@@ -1899,6 +1898,13 @@ static void collapse_file(struct mm_struct *mm,
xas_unlock_irq(&xas);
xa_unlocked:

+ /*
+ * If collapse is successful, flush must be done now before copying.
+ * If collapse is unsuccessful, does flush actually need to be done?
+ * Do it anyway, to clear the state.
+ */
+ try_to_unmap_flush();
+
if (result == SCAN_SUCCEED) {
struct page *page, *tmp;

--
2.34.1


2022-02-08 09:12:28

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 11/13] mm/munlock: page migration needs mlock pagevec drained

Page migration of a VM_LOCKED page tends to fail, because when the old
page is unmapped, it is put on the mlock pagevec with raised refcount,
which then fails the freeze.

At first I thought this would be fixed by a local mlock_page_drain() at
the upper rmap_walk() level - which would have nicely batched all the
munlocks of that page; but tests show that the task can too easily move
to another cpu, leaving pagevec residue behind which fails the migration.

So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
TTU_IGNORE_MLOCK users would want the same treatment; and do the same
in remove_migration_pte() - not important when successfully inserting
a new page, but necessary when hoping to retry after failure.

Any new pagevec runs the risk of adding a new way of stranding, and we
might discover other corners where mlock_page_drain() or lru_add_drain()
would now help. If the mlock pagevec raises doubts, we can easily add a
sysctl to tune its length to 1, which reverts to synchronous operation.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/migrate.c | 2 ++
mm/rmap.c | 4 ++++
2 files changed, 6 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index f4bcf1541b62..e7d0b68d5dcb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -251,6 +251,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
page_add_file_rmap(new, vma, false);
set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
}
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page_drain(smp_processor_id());

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, pvmw.address, pvmw.pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 5442a5c97a85..714bfdc72c7b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1656,6 +1656,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
* See Documentation/vm/mmu_notifier.rst
*/
page_remove_rmap(subpage, vma, PageHuge(page));
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page_drain(smp_processor_id());
put_page(page);
}

@@ -1930,6 +1932,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
* See Documentation/vm/mmu_notifier.rst
*/
page_remove_rmap(subpage, vma, PageHuge(page));
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page_drain(smp_processor_id());
put_page(page);
}

--
2.34.1


2022-02-09 06:26:58

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking

Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
required to lower the mlock_counts when munlocking without munmapping;
and its complement, implementation of mlock_vma_pages_range(), required
to raise the mlock_counts on pages already there when a range is mlocked.

Combine them into just the one function mlock_vma_pages_range(), using
walk_page_range() to run mlock_pte_range(). This approach fixes the
"Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
https://lore.kernel.org/linux-mm/[email protected]/

Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
pte first, it will not try to munlock the page: leaving release_pages()
to correct it when the last reference to the page is gone - that's okay,
a page is not evictable anyway while it is held by an extra reference.

Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing remove_migration_pte() or try_to_unmap_one() (depending on
i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
then mlock_pte_range() mlock it a second time. This is harder to
reproduce, but a more serious race because it could leave the page
unevictable indefinitely though the area is munlocked afterwards.
Guard against it by setting the (inappropriate) VM_IO flag,
and modifying mlock_vma_page() to decline such vmas.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/internal.h | 3 +-
mm/mlock.c | 108 ++++++++++++++++++++++++++++++++++++++++----------
2 files changed, 90 insertions(+), 21 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index a43d79335c16..b3f0dd3ffba2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -412,7 +412,8 @@ void mlock_page(struct page *page);
static inline void mlock_vma_page(struct page *page,
struct vm_area_struct *vma, bool compound)
{
- if (unlikely(vma->vm_flags & VM_LOCKED) &&
+ /* VM_IO check prevents migration from double-counting during mlock */
+ if (unlikely((vma->vm_flags & (VM_LOCKED|VM_IO)) == VM_LOCKED) &&
(compound || !PageTransCompound(page)))
mlock_page(page);
}
diff --git a/mm/mlock.c b/mm/mlock.c
index 0d3ae04b1f4e..f8e5dcff21ae 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -14,6 +14,7 @@
#include <linux/swapops.h>
#include <linux/pagemap.h>
#include <linux/pagevec.h>
+#include <linux/pagewalk.h>
#include <linux/mempolicy.h>
#include <linux/syscalls.h>
#include <linux/sched.h>
@@ -127,23 +128,90 @@ void munlock_page(struct page *page)
unlock_page_memcg(page);
}

+static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+
+{
+ struct vm_area_struct *vma = walk->vma;
+ spinlock_t *ptl;
+ pte_t *start_pte, *pte;
+ struct page *page;
+
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (ptl) {
+ if (!pmd_present(*pmd))
+ goto out;
+ if (is_huge_zero_pmd(*pmd))
+ goto out;
+ page = pmd_page(*pmd);
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page(page);
+ else
+ munlock_page(page);
+ goto out;
+ }
+
+ start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
+ if (!pte_present(*pte))
+ continue;
+ page = vm_normal_page(vma, addr, *pte);
+ if (!page)
+ continue;
+ if (PageTransCompound(page))
+ continue;
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page(page);
+ else
+ munlock_page(page);
+ }
+ pte_unmap(start_pte);
+out:
+ spin_unlock(ptl);
+ cond_resched();
+ return 0;
+}
+
/*
- * munlock_vma_pages_range() - munlock all pages in the vma range.'
- * @vma - vma containing range to be munlock()ed.
+ * mlock_vma_pages_range() - mlock any pages already in the range,
+ * or munlock all pages in the range.
+ * @vma - vma containing range to be mlock()ed or munlock()ed
* @start - start address in @vma of the range
- * @end - end of range in @vma.
- *
- * For mremap(), munmap() and exit().
+ * @end - end of range in @vma
+ * @newflags - the new set of flags for @vma.
*
- * Called with @vma VM_LOCKED.
- *
- * Returns with VM_LOCKED cleared. Callers must be prepared to
- * deal with this.
+ * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED;
+ * called for munlock() and munlockall(), to clear VM_LOCKED from @vma.
*/
-static void munlock_vma_pages_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+static void mlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end, vm_flags_t newflags)
{
- /* Reimplementation to follow in later commit */
+ static const struct mm_walk_ops mlock_walk_ops = {
+ .pmd_entry = mlock_pte_range,
+ };
+
+ /*
+ * There is a slight chance that concurrent page migration,
+ * or page reclaim finding a page of this now-VM_LOCKED vma,
+ * will call mlock_vma_page() and raise page's mlock_count:
+ * double counting, leaving the page unevictable indefinitely.
+ * Communicate this danger to mlock_vma_page() with VM_IO,
+ * which is a VM_SPECIAL flag not allowed on VM_LOCKED vmas.
+ * mmap_lock is held in write mode here, so this weird
+ * combination should not be visible to others.
+ */
+ if (newflags & VM_LOCKED)
+ newflags |= VM_IO;
+ WRITE_ONCE(vma->vm_flags, newflags);
+
+ lru_add_drain();
+ walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL);
+ lru_add_drain();
+
+ if (newflags & VM_IO) {
+ newflags &= ~VM_IO;
+ WRITE_ONCE(vma->vm_flags, newflags);
+ }
}

/*
@@ -162,8 +230,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
pgoff_t pgoff;
int nr_pages;
int ret = 0;
- int lock = !!(newflags & VM_LOCKED);
- vm_flags_t old_flags = vma->vm_flags;
+ vm_flags_t oldflags = vma->vm_flags;

if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
@@ -197,9 +264,9 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
* Keep track of amount of locked VM.
*/
nr_pages = (end - start) >> PAGE_SHIFT;
- if (!lock)
+ if (!(newflags & VM_LOCKED))
nr_pages = -nr_pages;
- else if (old_flags & VM_LOCKED)
+ else if (oldflags & VM_LOCKED)
nr_pages = 0;
mm->locked_vm += nr_pages;

@@ -209,11 +276,12 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
* set VM_LOCKED, populate_vma_page_range will bring it back.
*/

- if (lock)
+ if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
+ /* No work to do, and mlocking twice would be wrong */
vma->vm_flags = newflags;
- else
- munlock_vma_pages_range(vma, start, end);
-
+ } else {
+ mlock_vma_pages_range(vma, start, end, newflags);
+ }
out:
*prev = vma;
return ret;
--
2.34.1


2022-02-09 10:01:55

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 09/13] mm/munlock: delete smp_mb() from __pagevec_lru_add_fn()

My reading of comment on smp_mb__after_atomic() in __pagevec_lru_add_fn()
says that it can now be deleted; and that remains so when the next patch
is added.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/swap.c | 37 +++++++++----------------------------
1 file changed, 9 insertions(+), 28 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 682a03301a2c..3f770b1ea2c1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1025,37 +1025,18 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)

VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

+ folio_set_lru(folio);
/*
- * A folio becomes evictable in two ways:
- * 1) Within LRU lock [munlock_vma_page() and __munlock_pagevec()].
- * 2) Before acquiring LRU lock to put the folio on the correct LRU
- * and then
- * a) do PageLRU check with lock [check_move_unevictable_pages]
- * b) do PageLRU check before lock [clear_page_mlock]
- *
- * (1) & (2a) are ok as LRU lock will serialize them. For (2b), we need
- * following strict ordering:
- *
- * #0: __pagevec_lru_add_fn #1: clear_page_mlock
- *
- * folio_set_lru() folio_test_clear_mlocked()
- * smp_mb() // explicit ordering // above provides strict
- * // ordering
- * folio_test_mlocked() folio_test_lru()
+ * Is an smp_mb__after_atomic() still required here, before
+ * folio_evictable() tests PageMlocked, to rule out the possibility
+ * of stranding an evictable folio on an unevictable LRU? I think
+ * not, because munlock_page() only clears PageMlocked while the LRU
+ * lock is held.
*
- *
- * if '#1' does not observe setting of PG_lru by '#0' and
- * fails isolation, the explicit barrier will make sure that
- * folio_evictable check will put the folio on the correct
- * LRU. Without smp_mb(), folio_set_lru() can be reordered
- * after folio_test_mlocked() check and can make '#1' fail the
- * isolation of the folio whose mlocked bit is cleared (#0 is
- * also looking at the same folio) and the evictable folio will
- * be stranded on an unevictable LRU.
+ * (That is not true of __page_cache_release(), and not necessarily
+ * true of release_pages(): but those only clear PageMlocked after
+ * put_page_testzero() has excluded any other users of the page.)
*/
- folio_set_lru(folio);
- smp_mb__after_atomic();
-
if (folio_evictable(folio)) {
if (was_unevictable)
__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
--
2.34.1


2022-02-09 18:34:53

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling

On Sun 06-02-22 13:27:41, Hugh Dickins wrote:
> I wondered whether to post this munlocking rework in
> https://lore.kernel.org/linux-mm/[email protected]/
>
> There the discussion was OOM reaping, but the main reason for the rework
> has been catastrophic contention on i_mmap_rwsem when exiting from
> multiply mlocked files; and frustration with how contorted munlocking is.

yeah, I do not think that too many care about oom_reaper stumbling over
vast portions of the mlocked memory problem. So far I have seen only LTP
oom02 hitting this and that one does mlockall and intentionally hits
OOM. So far I haven't seen any actual workload hiting something similar.
But hey, if we get a better and easier to maintain mlock code then the
oom_reaper will surely add to a huge thanks.

> Here it is based on 5.17-rc2, applies also to -rc3, almost cleanly to
> mmotm 2022-02-03-21-58 (needs two easy fixups in mm/huge_memory.c); but
> likely to conflict (I hope not fundamentally) with several concurrent
> large patchsets.
>
> tl;dr
> mm/mlock.c | 634 +++++++++++++++-----------------------
> 23 files changed, 510 insertions(+), 779 deletions(-)

This is really impressive!

> In my own opinion, this is ready to go in: that whatever its defects,
> it's a big practical improvement over what's presently there. Others
> may differ; and it may be too much to arrive at a quick opinion on.
> My hope is that it will strike a chord with some who have laboured in
> this area in the past: I'll be short of time to do much more with it,
> so maybe someone else could take over if necessary.
>
> At present there's no update to Documentation/vm/unevictable-lru.rst:
> that always needs a different mindset, can follow later, please refer
> to commit messages for now.

Yes, that would help for sure.

So far I have only managed to read through the series and trying to put
all the pieces together (so far I have given up on the THP part) and my
undestanding is far from complete. But I have to say I like the general
approach and overall simplification.

The only thing that is not entirely clear to me at the moment is why you
have chosen to ignore already mapped LOCKONFAULT pages. They will
eventually get sorted out during the reclaim/migration but this can
backfire if too many pages have been pre-faulted before LOCKONFAULT
call. Maybe not an interesting case in the first place but I am still
wondering why you have chosen that way.

I will be off next couple of days and plan to revisit this afterwards
(should time allow). Anyway thanks a lot Hugh!
--
Michal Hocko
SUSE Labs

2022-02-09 20:41:38

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling

On Wed, 9 Feb 2022, Michal Hocko wrote:
>
> So far I have only managed to read through the series and trying to put
> all the pieces together (so far I have given up on the THP part) and my
> undestanding is far from complete. But I have to say I like the general
> approach and overall simplification.

Many thanks for looking, Michal, and for all the positivity!

>
> The only thing that is not entirely clear to me at the moment is why you
> have chosen to ignore already mapped LOCKONFAULT pages. They will
> eventually get sorted out during the reclaim/migration but this can
> backfire if too many pages have been pre-faulted before LOCKONFAULT
> call. Maybe not an interesting case in the first place but I am still
> wondering why you have chosen that way.

I'm puzzled: what makes you think I'm ignoring already mapped LOCKONFAULT
pages? I'd consider that a bug.

It is the case, isn't it, that a VM_LOCKONFAULT area always has VM_LOCKED
set too? If I've got that wrong, yes, I'll need to revisit conditions.

>
> I will be off next couple of days and plan to revisit this afterwards
> (should time allow). Anyway thanks a lot Hugh!
> --
> Michal Hocko
> SUSE Labs

2022-02-09 23:26:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling

On Wed, 9 Feb 2022, Michal Hocko wrote:
> On Wed 09-02-22 08:21:17, Hugh Dickins wrote:
> > On Wed, 9 Feb 2022, Michal Hocko wrote:
> [...]
> > > The only thing that is not entirely clear to me at the moment is why you
> > > have chosen to ignore already mapped LOCKONFAULT pages. They will
> > > eventually get sorted out during the reclaim/migration but this can
> > > backfire if too many pages have been pre-faulted before LOCKONFAULT
> > > call. Maybe not an interesting case in the first place but I am still
> > > wondering why you have chosen that way.
> >
> > I'm puzzled: what makes you think I'm ignoring already mapped LOCKONFAULT
> > pages? I'd consider that a bug.
>
> I've had the following path in mind
> __mm_populate
> populate_vma_page_range
> if (vma->vm_flags & VM_LOCKONFAULT)
> return nr_page
>
> which means that __get_user_pages is not called at all. This also means
> that follow_page_mask is skipped. The page table walk used to mark
> already mapped pages as mlocked so unless I miss some other path it
> would stay on its original LRU list and only get real mlock protection
> when encountered by the reclaim or migration.

That is so until 04/13: but then, whenever a page is faulted into a
VM_LOCKED area (except when skipping pte-of-compound) it goes through
mlock_page(). So it will then lock-on-fault.

Ah, but you're worrying about any previously-mapped pages when
VM_LOCKONFAULT is established: those ones get done by mlock_pte_range()
in 07/13, same as when VM_LOCKED is established - I checked again,
VM_LOCKONFAULT is not set without VM_LOCKED.

>
> > It is the case, isn't it, that a VM_LOCKONFAULT area always has VM_LOCKED
> > set too? If I've got that wrong, yes, I'll need to revisit conditions.
>
> Yes, I did't really mean we would lose the mlock protection. We would
> just do the lazy mlock also on pages which are already mapped. This is
> certainly not a correctness issue - althoug stats might be off - but it
> could polute existing LRUs with mlocked pages rather easily.

Even with the lazy mlock in reclaim, I'd still accuse myself of a bug
if I've got this wrong.

>
> As I've said. Either I am still missing something or this might even not
> be a big deal in real life. I was mostly curious about the choice to
> exclude the page table walk for LOCKONFAULT.

It surprised me too: but looking at how FOLL_POPULATE was handled in
faultin_page(), it appeared to me to become redundant, once mlocking
was counted at fault time (and in mlock_pte_range() - maybe that's
the page table walk you're looking for, not excluded but moved).

It is a big deal for me: please don't let me fool you, please do
continue to worry at it until you're convinced or you convince me.

Thanks,
Hugh

2022-02-09 23:38:48

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling

On Wed 09-02-22 08:21:17, Hugh Dickins wrote:
> On Wed, 9 Feb 2022, Michal Hocko wrote:
[...]
> > The only thing that is not entirely clear to me at the moment is why you
> > have chosen to ignore already mapped LOCKONFAULT pages. They will
> > eventually get sorted out during the reclaim/migration but this can
> > backfire if too many pages have been pre-faulted before LOCKONFAULT
> > call. Maybe not an interesting case in the first place but I am still
> > wondering why you have chosen that way.
>
> I'm puzzled: what makes you think I'm ignoring already mapped LOCKONFAULT
> pages? I'd consider that a bug.

I've had the following path in mind
__mm_populate
populate_vma_page_range
if (vma->vm_flags & VM_LOCKONFAULT)
return nr_page

which means that __get_user_pages is not called at all. This also means
that follow_page_mask is skipped. The page table walk used to mark
already mapped pages as mlocked so unless I miss some other path it
would stay on its original LRU list and only get real mlock protection
when encountered by the reclaim or migration.

> It is the case, isn't it, that a VM_LOCKONFAULT area always has VM_LOCKED
> set too? If I've got that wrong, yes, I'll need to revisit conditions.

Yes, I did't really mean we would lose the mlock protection. We would
just do the lazy mlock also on pages which are already mapped. This is
certainly not a correctness issue - althoug stats might be off - but it
could polute existing LRUs with mlocked pages rather easily.

As I've said. Either I am still missing something or this might even not
be a big deal in real life. I was mostly curious about the choice to
exclude the page table walk for LOCKONFAULT.
--
Michal Hocko
SUSE Labs

2022-02-10 08:43:26

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 00/13] mm/munlock: rework of mlock+munlock page handling

On Wed 09-02-22 14:59:19, Hugh Dickins wrote:
[...]
> Ah, but you're worrying about any previously-mapped pages when
> VM_LOCKONFAULT is established: those ones get done by mlock_pte_range()
> in 07/13, same as when VM_LOCKED is established - I checked again,
> VM_LOCKONFAULT is not set without VM_LOCKED.

Right you are. mlock_fixup->mlock_vma_pages_range is the missing piece.
This is called for both do_mlock and mlockall paths. So all good and
thanks for the clarification!

I will get back to this sometimes next week.
--
Michal Hocko
SUSE Labs

2022-02-10 12:17:02

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 02/13] mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE

On 2/6/22 22:32, Hugh Dickins wrote:
> If counting page mlocks, we must not double-count: follow_page_pte() can
> tell if a page has already been Mlocked or not, but cannot tell if a pte
> has already been counted or not: that will have to be done when the pte
> is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks
> for new anon pages, but there's no such tracking yet for others).
>
> Delete all the FOLL_MLOCK code - faulting in the missing pages will do
> all that is necessary, without special mlock_vma_page() calls from here.
>
> But then FOLL_POPULATE turns out to serve no purpose - it was there so
> that its absence would tell faultin_page() not to faultin page when
> setting up VM_LOCKONFAULT areas; but if there's no special work needed
> here for mlock, then there's no work at all here for VM_LOCKONFAULT.
>
> Have I got that right? I've not looked into the history, but see that
> FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different
> purpose before? Ah, yes, it was used to skip the old stack guard page.
>
> And is it intentional that COW is not broken on existing pages when
> setting up a VM_LOCKONFAULT area? I can see that being argued either
> way, and have no reason to disagree with current behaviour.

Yeah I think it's consistent with the two usecases stated for VM_LOCKONFAULT:
https://lore.kernel.org/all/[email protected]/

> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

2022-02-11 00:09:36

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 03/13] mm/munlock: delete munlock_vma_pages_all(), allow oomreap

On 2/6/22 22:34, Hugh Dickins wrote:
> munlock_vma_pages_range() will still be required, when munlocking but
> not munmapping a set of pages; but when unmapping a pte, the mlock count
> will be maintained in much the same way as it will be maintained when
> mapping in the pte. Which removes the need for munlock_vma_pages_all()
> on mlocked vmas when munmapping or exiting: eliminating the catastrophic
> contention on i_mmap_rwsem, and the need for page lock on the pages.
>
> There is still a need to update locked_vm accounting according to the
> munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
> exit_mmap() does not need locked_vm updates, so delete unlock_range().
>
> And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
> because of the uncertainty in blocking on all those page locks?
> No fear of that now, so permit the OOM reaper on mlocked vmas.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

2022-02-11 19:13:10

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking

On 2/6/22 22:42, Hugh Dickins wrote:
> Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
> required to lower the mlock_counts when munlocking without munmapping;
> and its complement, implementation of mlock_vma_pages_range(), required
> to raise the mlock_counts on pages already there when a range is mlocked.
>
> Combine them into just the one function mlock_vma_pages_range(), using
> walk_page_range() to run mlock_pte_range(). This approach fixes the
> "Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
> https://lore.kernel.org/linux-mm/[email protected]/
>
> Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
> a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
> pte first, it will not try to munlock the page: leaving release_pages()
> to correct it when the last reference to the page is gone - that's okay,
> a page is not evictable anyway while it is held by an extra reference.
>
> Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
> a racing remove_migration_pte() or try_to_unmap_one() (depending on
> i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
> then mlock_pte_range() mlock it a second time. This is harder to
> reproduce, but a more serious race because it could leave the page
> unevictable indefinitely though the area is munlocked afterwards.
> Guard against it by setting the (inappropriate) VM_IO flag,
> and modifying mlock_vma_page() to decline such vmas.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> @@ -162,8 +230,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
> pgoff_t pgoff;
> int nr_pages;
> int ret = 0;
> - int lock = !!(newflags & VM_LOCKED);
> - vm_flags_t old_flags = vma->vm_flags;
> + vm_flags_t oldflags = vma->vm_flags;
>
> if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||

Nit: can use oldflags instead of vma->vm_flags above?

2022-02-11 19:15:04

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 04/13] mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

On 2/6/22 22:36, Hugh Dickins wrote:
> Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
> inline functions which check (vma->vm_flags & VM_LOCKED) before calling
> mlock_page() and munlock_page() in mm/mlock.c.
>
> Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
> because we have understandable difficulty in accounting pte maps of THPs,
> and if passed a PageHead page, mlock_page() and munlock_page() cannot
> tell whether it's a pmd map to be counted or a pte map to be ignored.
>
> Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
> others, and use that to call mlock_vma_page() at the end of the page
> adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
> beginning? unimportant, but end was easier for assertions in testing).
>
> No page lock is required (although almost all adds happen to hold it):
> delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
> Certainly page lock did serialize with page migration, but I'm having
> difficulty explaining why that was ever important.
>
> Mlock accounting on THPs has been hard to define, differed between anon
> and file, involved PageDoubleMap in some places and not others, required
> clear_page_mlock() at some points. Keep it simple now: just count the
> pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.
>
> page_add_new_anon_rmap() callers unchanged: they have long been calling
> lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
> handling (it also checks for not VM_SPECIAL: I think that's overcautious,
> and inconsistent with other checks, that mmap_region() already prevents
> VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).
>
> Signed-off-by: Hugh Dickins <[email protected]>

Yeah, much better when most of the sites are encapsulated in rmap operations
like this.

Acked-by: Vlastimil Babka <[email protected]>

2022-02-11 20:36:21

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 09/13] mm/munlock: delete smp_mb() from __pagevec_lru_add_fn()

On 2/6/22 22:45, Hugh Dickins wrote:
> My reading of comment on smp_mb__after_atomic() in __pagevec_lru_add_fn()
> says that it can now be deleted; and that remains so when the next patch
> is added.

Agree with moderate certainty.

> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> ---
> mm/swap.c | 37 +++++++++----------------------------
> 1 file changed, 9 insertions(+), 28 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index 682a03301a2c..3f770b1ea2c1 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -1025,37 +1025,18 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
>
> VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
>
> + folio_set_lru(folio);
> /*
> - * A folio becomes evictable in two ways:
> - * 1) Within LRU lock [munlock_vma_page() and __munlock_pagevec()].
> - * 2) Before acquiring LRU lock to put the folio on the correct LRU
> - * and then
> - * a) do PageLRU check with lock [check_move_unevictable_pages]
> - * b) do PageLRU check before lock [clear_page_mlock]
> - *
> - * (1) & (2a) are ok as LRU lock will serialize them. For (2b), we need
> - * following strict ordering:
> - *
> - * #0: __pagevec_lru_add_fn #1: clear_page_mlock
> - *
> - * folio_set_lru() folio_test_clear_mlocked()
> - * smp_mb() // explicit ordering // above provides strict
> - * // ordering
> - * folio_test_mlocked() folio_test_lru()
> + * Is an smp_mb__after_atomic() still required here, before
> + * folio_evictable() tests PageMlocked, to rule out the possibility
> + * of stranding an evictable folio on an unevictable LRU? I think
> + * not, because munlock_page() only clears PageMlocked while the LRU
> + * lock is held.
> *
> - *
> - * if '#1' does not observe setting of PG_lru by '#0' and
> - * fails isolation, the explicit barrier will make sure that
> - * folio_evictable check will put the folio on the correct
> - * LRU. Without smp_mb(), folio_set_lru() can be reordered
> - * after folio_test_mlocked() check and can make '#1' fail the
> - * isolation of the folio whose mlocked bit is cleared (#0 is
> - * also looking at the same folio) and the evictable folio will
> - * be stranded on an unevictable LRU.
> + * (That is not true of __page_cache_release(), and not necessarily
> + * true of release_pages(): but those only clear PageMlocked after
> + * put_page_testzero() has excluded any other users of the page.)
> */
> - folio_set_lru(folio);
> - smp_mb__after_atomic();
> -
> if (folio_evictable(folio)) {
> if (was_unevictable)
> __count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);

2022-02-14 03:11:20

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 11/13] mm/munlock: page migration needs mlock pagevec drained

On 2/6/22 22:49, Hugh Dickins wrote:
> Page migration of a VM_LOCKED page tends to fail, because when the old
> page is unmapped, it is put on the mlock pagevec with raised refcount,
> which then fails the freeze.
>
> At first I thought this would be fixed by a local mlock_page_drain() at
> the upper rmap_walk() level - which would have nicely batched all the
> munlocks of that page; but tests show that the task can too easily move
> to another cpu, leaving pagevec residue behind which fails the migration.
>
> So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
> from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
> TTU_IGNORE_MLOCK users would want the same treatment; and do the same
> in remove_migration_pte() - not important when successfully inserting
> a new page, but necessary when hoping to retry after failure.
>
> Any new pagevec runs the risk of adding a new way of stranding, and we
> might discover other corners where mlock_page_drain() or lru_add_drain()
> would now help. If the mlock pagevec raises doubts, we can easily add a
> sysctl to tune its length to 1, which reverts to synchronous operation.

Not a fan of adding new sysctls like those as that just pushes the failure
of kernel devs to poor admins :)
The old pagevec usage deleted by patch 1 was limited to the naturally larger
munlock_vma_pages_range() operation. The new per-cpu based one is more
general, which obviously has its advantages, but then it might bring new
corner cases.
So if this turns out to be an big problem, I would rather go back to the
limited scenario pagevec than a sysctl?

> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> ---
> mm/migrate.c | 2 ++
> mm/rmap.c | 4 ++++
> 2 files changed, 6 insertions(+)
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index f4bcf1541b62..e7d0b68d5dcb 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -251,6 +251,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
> page_add_file_rmap(new, vma, false);
> set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
> }
> + if (vma->vm_flags & VM_LOCKED)
> + mlock_page_drain(smp_processor_id());
>
> /* No need to invalidate - it was non-present before */
> update_mmu_cache(vma, pvmw.address, pvmw.pte);
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 5442a5c97a85..714bfdc72c7b 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1656,6 +1656,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> * See Documentation/vm/mmu_notifier.rst
> */
> page_remove_rmap(subpage, vma, PageHuge(page));
> + if (vma->vm_flags & VM_LOCKED)
> + mlock_page_drain(smp_processor_id());
> put_page(page);
> }
>
> @@ -1930,6 +1932,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
> * See Documentation/vm/mmu_notifier.rst
> */
> page_remove_rmap(subpage, vma, PageHuge(page));
> + if (vma->vm_flags & VM_LOCKED)
> + mlock_page_drain(smp_processor_id());
> put_page(page);
> }
>

2022-02-14 09:57:18

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 11/13] mm/munlock: page migration needs mlock pagevec drained

On Fri, 11 Feb 2022, Vlastimil Babka wrote:
> On 2/6/22 22:49, Hugh Dickins wrote:
> >
> > Any new pagevec runs the risk of adding a new way of stranding, and we
> > might discover other corners where mlock_page_drain() or lru_add_drain()
> > would now help. If the mlock pagevec raises doubts, we can easily add a
> > sysctl to tune its length to 1, which reverts to synchronous operation.
>
> Not a fan of adding new sysctls like those as that just pushes the failure
> of kernel devs to poor admins :)
> The old pagevec usage deleted by patch 1 was limited to the naturally larger
> munlock_vma_pages_range() operation. The new per-cpu based one is more
> general, which obviously has its advantages, but then it might bring new
> corner cases.
> So if this turns out to be an big problem, I would rather go back to the
> limited scenario pagevec than a sysctl?

Okay, I'll delete that comment proposing a sysctl, which was more as
a possible safety measure for our internal experimentation than for
general use. I just thought it was an easy way to force synchronous.

I don't expect a big problem. The flush in this commit deals, I think,
with the only way it was treading on its own toes, using a pagevec at
a level which relied on the unlikelihood of a pagevec reference.

But I can imagine that such-and-such a test will expect Mlocked or
Unevictable to be exact, and this pagevec now delay it becoming exact.
I've not seen any example so far, but it does seem possible. Maybe
we shall fix the test, maybe we shall add a drain.

I hadn't thought of limiting the use of pagevec to the mass operation
(if a significant problem emerges). Yes, that might be a better idea,
thanks. Anyway, we don't need to worry in advance.

A change I also had in this patch, orginally, was for
/proc/sys/vm/stat_refresh to lru_add_drain_all() first. I'm surprised
that tests see good numbers without it doing so; but since I've not
actually seen the need for it yet, dropped that - we can always add
it later if need emerges.

Hugh

2022-02-14 10:00:09

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking

On Fri, 11 Feb 2022, Vlastimil Babka wrote:
> On 2/6/22 22:42, Hugh Dickins wrote:
> > @@ -162,8 +230,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > pgoff_t pgoff;
> > int nr_pages;
> > int ret = 0;
> > - int lock = !!(newflags & VM_LOCKED);
> > - vm_flags_t old_flags = vma->vm_flags;
> > + vm_flags_t oldflags = vma->vm_flags;
> >
> > if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
>
> Nit: can use oldflags instead of vma->vm_flags above?

Yes thanks, that is nicer, I'm making that change now.

Hugh

2022-02-14 10:19:53

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking

Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
required to lower the mlock_counts when munlocking without munmapping;
and its complement, implementation of mlock_vma_pages_range(), required
to raise the mlock_counts on pages already there when a range is mlocked.

Combine them into just the one function mlock_vma_pages_range(), using
walk_page_range() to run mlock_pte_range(). This approach fixes the
"Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
https://lore.kernel.org/linux-mm/[email protected]/

Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
pte first, it will not try to munlock the page: leaving release_pages()
to correct it when the last reference to the page is gone - that's okay,
a page is not evictable anyway while it is held by an extra reference.

Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing remove_migration_pte() or try_to_unmap_one() (depending on
i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
then mlock_pte_range() mlock it a second time. This is harder to
reproduce, but a more serious race because it could leave the page
unevictable indefinitely though the area is munlocked afterwards.
Guard against it by setting the (inappropriate) VM_IO flag,
and modifying mlock_vma_page() to decline such vmas.

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
v2: Delete reintroduced VM_LOCKED_CLEAR_MASKing here, per Vlastimil.
Use oldflags instead of vma->vm_flags in mlock_fixup(), per Vlastimil.
Make clearer comment on mmap_lock and WRITE_ONCE, per Hillf.

mm/internal.h | 3 +-
mm/mlock.c | 111 ++++++++++++++++++++++++++++++++++++++++----------
2 files changed, 91 insertions(+), 23 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index a43d79335c16..b3f0dd3ffba2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -412,7 +412,8 @@ void mlock_page(struct page *page);
static inline void mlock_vma_page(struct page *page,
struct vm_area_struct *vma, bool compound)
{
- if (unlikely(vma->vm_flags & VM_LOCKED) &&
+ /* VM_IO check prevents migration from double-counting during mlock */
+ if (unlikely((vma->vm_flags & (VM_LOCKED|VM_IO)) == VM_LOCKED) &&
(compound || !PageTransCompound(page)))
mlock_page(page);
}
diff --git a/mm/mlock.c b/mm/mlock.c
index f8a3a54687dd..581ea8bf1b83 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -14,6 +14,7 @@
#include <linux/swapops.h>
#include <linux/pagemap.h>
#include <linux/pagevec.h>
+#include <linux/pagewalk.h>
#include <linux/mempolicy.h>
#include <linux/syscalls.h>
#include <linux/sched.h>
@@ -127,25 +128,91 @@ void munlock_page(struct page *page)
unlock_page_memcg(page);
}

+static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+
+{
+ struct vm_area_struct *vma = walk->vma;
+ spinlock_t *ptl;
+ pte_t *start_pte, *pte;
+ struct page *page;
+
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (ptl) {
+ if (!pmd_present(*pmd))
+ goto out;
+ if (is_huge_zero_pmd(*pmd))
+ goto out;
+ page = pmd_page(*pmd);
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page(page);
+ else
+ munlock_page(page);
+ goto out;
+ }
+
+ start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
+ if (!pte_present(*pte))
+ continue;
+ page = vm_normal_page(vma, addr, *pte);
+ if (!page)
+ continue;
+ if (PageTransCompound(page))
+ continue;
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page(page);
+ else
+ munlock_page(page);
+ }
+ pte_unmap(start_pte);
+out:
+ spin_unlock(ptl);
+ cond_resched();
+ return 0;
+}
+
/*
- * munlock_vma_pages_range() - munlock all pages in the vma range.'
- * @vma - vma containing range to be munlock()ed.
+ * mlock_vma_pages_range() - mlock any pages already in the range,
+ * or munlock all pages in the range.
+ * @vma - vma containing range to be mlock()ed or munlock()ed
* @start - start address in @vma of the range
- * @end - end of range in @vma.
- *
- * For mremap(), munmap() and exit().
+ * @end - end of range in @vma
+ * @newflags - the new set of flags for @vma.
*
- * Called with @vma VM_LOCKED.
- *
- * Returns with VM_LOCKED cleared. Callers must be prepared to
- * deal with this.
+ * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED;
+ * called for munlock() and munlockall(), to clear VM_LOCKED from @vma.
*/
-static void munlock_vma_pages_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+static void mlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end, vm_flags_t newflags)
{
- vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+ static const struct mm_walk_ops mlock_walk_ops = {
+ .pmd_entry = mlock_pte_range,
+ };

- /* Reimplementation to follow in later commit */
+ /*
+ * There is a slight chance that concurrent page migration,
+ * or page reclaim finding a page of this now-VM_LOCKED vma,
+ * will call mlock_vma_page() and raise page's mlock_count:
+ * double counting, leaving the page unevictable indefinitely.
+ * Communicate this danger to mlock_vma_page() with VM_IO,
+ * which is a VM_SPECIAL flag not allowed on VM_LOCKED vmas.
+ * mmap_lock is held in write mode here, so this weird
+ * combination should not be visible to other mmap_lock users;
+ * but WRITE_ONCE so rmap walkers must see VM_IO if VM_LOCKED.
+ */
+ if (newflags & VM_LOCKED)
+ newflags |= VM_IO;
+ WRITE_ONCE(vma->vm_flags, newflags);
+
+ lru_add_drain();
+ walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL);
+ lru_add_drain();
+
+ if (newflags & VM_IO) {
+ newflags &= ~VM_IO;
+ WRITE_ONCE(vma->vm_flags, newflags);
+ }
}

/*
@@ -164,10 +231,9 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
pgoff_t pgoff;
int nr_pages;
int ret = 0;
- int lock = !!(newflags & VM_LOCKED);
- vm_flags_t old_flags = vma->vm_flags;
+ vm_flags_t oldflags = vma->vm_flags;

- if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
+ if (newflags == oldflags || (oldflags & VM_SPECIAL) ||
is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
vma_is_dax(vma) || vma_is_secretmem(vma))
/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
@@ -199,9 +265,9 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
* Keep track of amount of locked VM.
*/
nr_pages = (end - start) >> PAGE_SHIFT;
- if (!lock)
+ if (!(newflags & VM_LOCKED))
nr_pages = -nr_pages;
- else if (old_flags & VM_LOCKED)
+ else if (oldflags & VM_LOCKED)
nr_pages = 0;
mm->locked_vm += nr_pages;

@@ -211,11 +277,12 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
* set VM_LOCKED, populate_vma_page_range will bring it back.
*/

- if (lock)
+ if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
+ /* No work to do, and mlocking twice would be wrong */
vma->vm_flags = newflags;
- else
- munlock_vma_pages_range(vma, start, end);
-
+ } else {
+ mlock_vma_pages_range(vma, start, end, newflags);
+ }
out:
*prev = vma;
return ret;
--
2.34.1

2022-02-14 10:43:42

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 11/13] mm/munlock: page migration needs mlock pagevec drained

Page migration of a VM_LOCKED page tends to fail, because when the old
page is unmapped, it is put on the mlock pagevec with raised refcount,
which then fails the freeze.

At first I thought this would be fixed by a local mlock_page_drain() at
the upper rmap_walk() level - which would have nicely batched all the
munlocks of that page; but tests show that the task can too easily move
to another cpu, leaving pagevec residue behind which fails the migration.

So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
TTU_IGNORE_MLOCK users would want the same treatment; and do the same
in remove_migration_pte() - not important when successfully inserting
a new page, but necessary when hoping to retry after failure.

Any new pagevec runs the risk of adding a new way of stranding, and we
might discover other corners where mlock_page_drain() or lru_add_drain()
would now help.

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
v2: Delete suggestion of sysctl from changelog, per Vlastimil.

mm/migrate.c | 2 ++
mm/rmap.c | 4 ++++
2 files changed, 6 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index f4bcf1541b62..e7d0b68d5dcb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -251,6 +251,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
page_add_file_rmap(new, vma, false);
set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
}
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page_drain(smp_processor_id());

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, pvmw.address, pvmw.pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 5442a5c97a85..714bfdc72c7b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1656,6 +1656,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
* See Documentation/vm/mmu_notifier.rst
*/
page_remove_rmap(subpage, vma, PageHuge(page));
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page_drain(smp_processor_id());
put_page(page);
}

@@ -1930,6 +1932,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
* See Documentation/vm/mmu_notifier.rst
*/
page_remove_rmap(subpage, vma, PageHuge(page));
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page_drain(smp_processor_id());
put_page(page);
}

--
2.34.1

2022-02-14 18:50:20

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 04/13] mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.

Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.

Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).

No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.

Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points. Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
v2: Fix kerneldoc on mlock_page() to use colon, per Yang Li.
mmotm will require minor fixup in mm/huge_memory.c like v1.

include/linux/rmap.h | 17 +++++++------
kernel/events/uprobes.c | 7 ++----
mm/huge_memory.c | 17 ++++++-------
mm/hugetlb.c | 4 +--
mm/internal.h | 36 ++++++++++++++++++++++----
mm/khugepaged.c | 4 +--
mm/ksm.c | 12 +--------
mm/memory.c | 45 +++++++++++----------------------
mm/migrate.c | 9 ++-----
mm/mlock.c | 21 ++++++----------
mm/rmap.c | 56 +++++++++++++++++++----------------------
mm/userfaultfd.c | 14 ++++++-----
12 files changed, 113 insertions(+), 129 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index dc48aa8c2c94..ac29b076082b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -167,18 +167,19 @@ struct anon_vma *page_get_anon_vma(struct page *page);
*/
void page_move_anon_rmap(struct page *, struct vm_area_struct *);
void page_add_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long, bool);
+ unsigned long address, bool compound);
void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long, int);
+ unsigned long address, int flags);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long, bool);
-void page_add_file_rmap(struct page *, bool);
-void page_remove_rmap(struct page *, bool);
-
+ unsigned long address, bool compound);
+void page_add_file_rmap(struct page *, struct vm_area_struct *,
+ bool compound);
+void page_remove_rmap(struct page *, struct vm_area_struct *,
+ bool compound);
void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long);
+ unsigned long address);
void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long);
+ unsigned long address);

static inline void page_dup_rmap(struct page *page, bool compound)
{
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6357c3580d07..eed2f7437d96 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -173,7 +173,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
return err;
}

- /* For try_to_free_swap() and munlock_vma_page() below */
+ /* For try_to_free_swap() below */
lock_page(old_page);

mmu_notifier_invalidate_range_start(&range);
@@ -201,13 +201,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
set_pte_at_notify(mm, addr, pvmw.pte,
mk_pte(new_page, vma->vm_page_prot));

- page_remove_rmap(old_page, false);
+ page_remove_rmap(old_page, vma, false);
if (!page_mapped(old_page))
try_to_free_swap(old_page);
page_vma_mapped_walk_done(&pvmw);
-
- if ((vma->vm_flags & VM_LOCKED) && !PageCompound(old_page))
- munlock_vma_page(old_page);
put_page(old_page);

err = 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9a34b85ebcf8..d6477f48a27e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1577,7 +1577,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,

if (pmd_present(orig_pmd)) {
page = pmd_page(orig_pmd);
- page_remove_rmap(page, true);
+ page_remove_rmap(page, vma, true);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
VM_BUG_ON_PAGE(!PageHead(page), page);
} else if (thp_migration_supported()) {
@@ -1962,7 +1962,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
set_page_dirty(page);
if (!PageReferenced(page) && pmd_young(old_pmd))
SetPageReferenced(page);
- page_remove_rmap(page, true);
+ page_remove_rmap(page, vma, true);
put_page(page);
}
add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
@@ -2096,6 +2096,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
}
}
unlock_page_memcg(page);
+
+ /* Above is effectively page_remove_rmap(page, vma, true) */
+ munlock_vma_page(page, vma, true);
}

smp_wmb(); /* make pte visible before pmd */
@@ -2103,7 +2106,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,

if (freeze) {
for (i = 0; i < HPAGE_PMD_NR; i++) {
- page_remove_rmap(page + i, false);
+ page_remove_rmap(page + i, vma, false);
put_page(page + i);
}
}
@@ -2163,8 +2166,6 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
do_unlock_page = true;
}
}
- if (PageMlocked(page))
- clear_page_mlock(page);
} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
goto out;
__split_huge_pmd_locked(vma, pmd, range.start, freeze);
@@ -3138,7 +3139,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
if (pmd_soft_dirty(pmdval))
pmdswp = pmd_swp_mksoft_dirty(pmdswp);
set_pmd_at(mm, address, pvmw->pmd, pmdswp);
- page_remove_rmap(page, true);
+ page_remove_rmap(page, vma, true);
put_page(page);
}

@@ -3168,10 +3169,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
if (PageAnon(new))
page_add_anon_rmap(new, vma, mmun_start, true);
else
- page_add_file_rmap(new, true);
+ page_add_file_rmap(new, vma, true);
set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
- if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
- mlock_vma_page(new);
update_mmu_cache_pmd(vma, address, pvmw->pmd);
}
#endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 61895cc01d09..43fb3155298e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5014,7 +5014,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
set_page_dirty(page);

hugetlb_count_sub(pages_per_huge_page(h), mm);
- page_remove_rmap(page, true);
+ page_remove_rmap(page, vma, true);

spin_unlock(ptl);
tlb_remove_page_size(tlb, page, huge_page_size(h));
@@ -5259,7 +5259,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
/* Break COW */
huge_ptep_clear_flush(vma, haddr, ptep);
mmu_notifier_invalidate_range(mm, range.start, range.end);
- page_remove_rmap(old_page, true);
+ page_remove_rmap(old_page, vma, true);
hugepage_add_new_anon_rmap(new_page, vma, haddr);
set_huge_pte_at(mm, haddr, ptep,
make_huge_pte(vma, new_page, 1));
diff --git a/mm/internal.h b/mm/internal.h
index f235aa92e564..3d7dfc8bc471 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -395,12 +395,35 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
bool write, int *locked);
extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
unsigned long len);
-
/*
- * must be called with vma's mmap_lock held for read or write, and page locked.
+ * mlock_vma_page() and munlock_vma_page():
+ * should be called with vma's mmap_lock held for read or write,
+ * under page table lock for the pte/pmd being added or removed.
+ *
+ * mlock is usually called at the end of page_add_*_rmap(),
+ * munlock at the end of page_remove_rmap(); but new anon
+ * pages are managed in lru_cache_add_inactive_or_unevictable().
+ *
+ * @compound is used to include pmd mappings of THPs, but filter out
+ * pte mappings of THPs, which cannot be consistently counted: a pte
+ * mapping of the THP head cannot be distinguished by the page alone.
*/
-extern void mlock_vma_page(struct page *page);
-extern void munlock_vma_page(struct page *page);
+void mlock_page(struct page *page);
+static inline void mlock_vma_page(struct page *page,
+ struct vm_area_struct *vma, bool compound)
+{
+ if (unlikely(vma->vm_flags & VM_LOCKED) &&
+ (compound || !PageTransCompound(page)))
+ mlock_page(page);
+}
+void munlock_page(struct page *page);
+static inline void munlock_vma_page(struct page *page,
+ struct vm_area_struct *vma, bool compound)
+{
+ if (unlikely(vma->vm_flags & VM_LOCKED) &&
+ (compound || !PageTransCompound(page)))
+ munlock_page(page);
+}

/*
* Clear the page's PageMlocked(). This can be useful in a situation where
@@ -487,7 +510,10 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
#else /* !CONFIG_MMU */
static inline void unmap_mapping_folio(struct folio *folio) { }
static inline void clear_page_mlock(struct page *page) { }
-static inline void mlock_vma_page(struct page *page) { }
+static inline void mlock_vma_page(struct page *page,
+ struct vm_area_struct *vma, bool compound) { }
+static inline void munlock_vma_page(struct page *page,
+ struct vm_area_struct *vma, bool compound) { }
static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
{
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 35f14d0a00a6..d5e387c58bde 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -773,7 +773,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
*/
spin_lock(ptl);
ptep_clear(vma->vm_mm, address, _pte);
- page_remove_rmap(src_page, false);
+ page_remove_rmap(src_page, vma, false);
spin_unlock(ptl);
free_page_and_swap_cache(src_page);
}
@@ -1497,7 +1497,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
if (pte_none(*pte))
continue;
page = vm_normal_page(vma, addr, *pte);
- page_remove_rmap(page, false);
+ page_remove_rmap(page, vma, false);
}

pte_unmap_unlock(start_pte, ptl);
diff --git a/mm/ksm.c b/mm/ksm.c
index c20bd4d9a0d9..c5a4403b5dc9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1177,7 +1177,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
ptep_clear_flush(vma, addr, ptep);
set_pte_at_notify(mm, addr, ptep, newpte);

- page_remove_rmap(page, false);
+ page_remove_rmap(page, vma, false);
if (!page_mapped(page))
try_to_free_swap(page);
put_page(page);
@@ -1252,16 +1252,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
err = replace_page(vma, page, kpage, orig_pte);
}

- if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
- munlock_vma_page(page);
- if (!PageMlocked(kpage)) {
- unlock_page(page);
- lock_page(kpage);
- mlock_vma_page(kpage);
- page = kpage; /* for final unlock */
- }
- }
-
out_unlock:
unlock_page(page);
out:
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..53bd9e5f2e33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -735,9 +735,6 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,

set_pte_at(vma->vm_mm, address, ptep, pte);

- if (vma->vm_flags & VM_LOCKED)
- mlock_vma_page(page);
-
/*
* No need to invalidate - it was non-present before. However
* secondary CPUs may have mappings that need invalidating.
@@ -1377,7 +1374,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
mark_page_accessed(page);
}
rss[mm_counter(page)]--;
- page_remove_rmap(page, false);
+ page_remove_rmap(page, vma, false);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
if (unlikely(__tlb_remove_page(tlb, page))) {
@@ -1397,10 +1394,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
rss[mm_counter(page)]--;
-
if (is_device_private_entry(entry))
- page_remove_rmap(page, false);
-
+ page_remove_rmap(page, vma, false);
put_page(page);
continue;
}
@@ -1753,16 +1748,16 @@ static int validate_page_before_insert(struct page *page)
return 0;
}

-static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
+static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
unsigned long addr, struct page *page, pgprot_t prot)
{
if (!pte_none(*pte))
return -EBUSY;
/* Ok, finally just insert the thing.. */
get_page(page);
- inc_mm_counter_fast(mm, mm_counter_file(page));
- page_add_file_rmap(page, false);
- set_pte_at(mm, addr, pte, mk_pte(page, prot));
+ inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
+ page_add_file_rmap(page, vma, false);
+ set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
return 0;
}

@@ -1776,7 +1771,6 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
static int insert_page(struct vm_area_struct *vma, unsigned long addr,
struct page *page, pgprot_t prot)
{
- struct mm_struct *mm = vma->vm_mm;
int retval;
pte_t *pte;
spinlock_t *ptl;
@@ -1785,17 +1779,17 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
if (retval)
goto out;
retval = -ENOMEM;
- pte = get_locked_pte(mm, addr, &ptl);
+ pte = get_locked_pte(vma->vm_mm, addr, &ptl);
if (!pte)
goto out;
- retval = insert_page_into_pte_locked(mm, pte, addr, page, prot);
+ retval = insert_page_into_pte_locked(vma, pte, addr, page, prot);
pte_unmap_unlock(pte, ptl);
out:
return retval;
}

#ifdef pte_index
-static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte,
+static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
unsigned long addr, struct page *page, pgprot_t prot)
{
int err;
@@ -1805,7 +1799,7 @@ static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte,
err = validate_page_before_insert(page);
if (err)
return err;
- return insert_page_into_pte_locked(mm, pte, addr, page, prot);
+ return insert_page_into_pte_locked(vma, pte, addr, page, prot);
}

/* insert_pages() amortizes the cost of spinlock operations
@@ -1842,7 +1836,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,

start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock);
for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) {
- int err = insert_page_in_batch_locked(mm, pte,
+ int err = insert_page_in_batch_locked(vma, pte,
addr, pages[curr_page_idx], prot);
if (unlikely(err)) {
pte_unmap_unlock(start_pte, pte_lock);
@@ -3098,7 +3092,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
* mapcount is visible. So transitively, TLBs to
* old page will be flushed before it can be reused.
*/
- page_remove_rmap(old_page, false);
+ page_remove_rmap(old_page, vma, false);
}

/* Free the old page.. */
@@ -3118,16 +3112,6 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
*/
mmu_notifier_invalidate_range_only_end(&range);
if (old_page) {
- /*
- * Don't let another task, with possibly unlocked vma,
- * keep the mlocked page.
- */
- if (page_copied && (vma->vm_flags & VM_LOCKED)) {
- lock_page(old_page); /* LRU manipulation */
- if (PageMlocked(old_page))
- munlock_vma_page(old_page);
- unlock_page(old_page);
- }
if (page_copied)
free_swap_cache(old_page);
put_page(old_page);
@@ -3947,7 +3931,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR);
- page_add_file_rmap(page, true);
+ page_add_file_rmap(page, vma, true);
+
/*
* deposit and withdraw with pmd lock held
*/
@@ -3996,7 +3981,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
lru_cache_add_inactive_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
- page_add_file_rmap(page, false);
+ page_add_file_rmap(page, vma, false);
}
set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
}
diff --git a/mm/migrate.c b/mm/migrate.c
index c7da064b4781..7c4223ce2500 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -248,14 +248,9 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
if (PageAnon(new))
page_add_anon_rmap(new, vma, pvmw.address, false);
else
- page_add_file_rmap(new, false);
+ page_add_file_rmap(new, vma, false);
set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
}
- if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
- mlock_vma_page(new);
-
- if (PageTransHuge(page) && PageMlocked(page))
- clear_page_mlock(page);

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, pvmw.address, pvmw.pte);
@@ -2331,7 +2326,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
* drop page refcount. Page won't be freed, as we took
* a reference just above.
*/
- page_remove_rmap(page, false);
+ page_remove_rmap(page, vma, false);
put_page(page);

if (pte_present(pte))
diff --git a/mm/mlock.c b/mm/mlock.c
index 5d7ced8303be..92f28258b4ae 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -78,17 +78,13 @@ void clear_page_mlock(struct page *page)
}
}

-/*
- * Mark page as mlocked if not already.
- * If page on LRU, isolate and putback to move to unevictable list.
+/**
+ * mlock_page - mlock a page
+ * @page: page to be mlocked, either a normal page or a THP head.
*/
-void mlock_vma_page(struct page *page)
+void mlock_page(struct page *page)
{
- /* Serialize with page migration */
- BUG_ON(!PageLocked(page));
-
VM_BUG_ON_PAGE(PageTail(page), page);
- VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);

if (!TestSetPageMlocked(page)) {
int nr_pages = thp_nr_pages(page);
@@ -101,14 +97,11 @@ void mlock_vma_page(struct page *page)
}

/**
- * munlock_vma_page - munlock a vma page
- * @page: page to be unlocked, either a normal page or THP page head
+ * munlock_page - munlock a page
+ * @page: page to be munlocked, either a normal page or a THP head.
*/
-void munlock_vma_page(struct page *page)
+void munlock_page(struct page *page)
{
- /* Serialize with page migration */
- BUG_ON(!PageLocked(page));
-
VM_BUG_ON_PAGE(PageTail(page), page);

if (TestClearPageMlocked(page)) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 7ce7f1946cff..6cc8bf129f18 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1181,17 +1181,17 @@ void do_page_add_anon_rmap(struct page *page,
__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
}

- if (unlikely(PageKsm(page))) {
+ if (unlikely(PageKsm(page)))
unlock_page_memcg(page);
- return;
- }

/* address might be in next vma when migration races vma_adjust */
- if (first)
+ else if (first)
__page_set_anon_rmap(page, vma, address,
flags & RMAP_EXCLUSIVE);
else
__page_check_anon_rmap(page, vma, address);
+
+ mlock_vma_page(page, vma, compound);
}

/**
@@ -1232,12 +1232,14 @@ void page_add_new_anon_rmap(struct page *page,

/**
* page_add_file_rmap - add pte mapping to a file page
- * @page: the page to add the mapping to
- * @compound: charge the page as compound or small page
+ * @page: the page to add the mapping to
+ * @vma: the vm area in which the mapping is added
+ * @compound: charge the page as compound or small page
*
* The caller needs to hold the pte lock.
*/
-void page_add_file_rmap(struct page *page, bool compound)
+void page_add_file_rmap(struct page *page,
+ struct vm_area_struct *vma, bool compound)
{
int i, nr = 1;

@@ -1260,13 +1262,8 @@ void page_add_file_rmap(struct page *page, bool compound)
nr_pages);
} else {
if (PageTransCompound(page) && page_mapping(page)) {
- struct page *head = compound_head(page);
-
VM_WARN_ON_ONCE(!PageLocked(page));
-
- SetPageDoubleMap(head);
- if (PageMlocked(page))
- clear_page_mlock(head);
+ SetPageDoubleMap(compound_head(page));
}
if (!atomic_inc_and_test(&page->_mapcount))
goto out;
@@ -1274,6 +1271,8 @@ void page_add_file_rmap(struct page *page, bool compound)
__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
out:
unlock_page_memcg(page);
+
+ mlock_vma_page(page, vma, compound);
}

static void page_remove_file_rmap(struct page *page, bool compound)
@@ -1368,11 +1367,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
/**
* page_remove_rmap - take down pte mapping from a page
* @page: page to remove mapping from
+ * @vma: the vm area from which the mapping is removed
* @compound: uncharge the page as compound or small page
*
* The caller needs to hold the pte lock.
*/
-void page_remove_rmap(struct page *page, bool compound)
+void page_remove_rmap(struct page *page,
+ struct vm_area_struct *vma, bool compound)
{
lock_page_memcg(page);

@@ -1414,6 +1415,8 @@ void page_remove_rmap(struct page *page, bool compound)
*/
out:
unlock_page_memcg(page);
+
+ munlock_vma_page(page, vma, compound);
}

/*
@@ -1469,28 +1472,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(&range);

while (page_vma_mapped_walk(&pvmw)) {
+ /* Unexpected PMD-mapped THP? */
+ VM_BUG_ON_PAGE(!pvmw.pte, page);
+
/*
- * If the page is mlock()d, we cannot swap it out.
+ * If the page is in an mlock()d vma, we must not swap it out.
*/
if (!(flags & TTU_IGNORE_MLOCK) &&
(vma->vm_flags & VM_LOCKED)) {
- /*
- * PTE-mapped THP are never marked as mlocked: so do
- * not set it on a DoubleMap THP, nor on an Anon THP
- * (which may still be PTE-mapped after DoubleMap was
- * cleared). But stop unmapping even in those cases.
- */
- if (!PageTransCompound(page) || (PageHead(page) &&
- !PageDoubleMap(page) && !PageAnon(page)))
- mlock_vma_page(page);
+ /* Restore the mlock which got missed */
+ mlock_vma_page(page, vma, false);
page_vma_mapped_walk_done(&pvmw);
ret = false;
break;
}

- /* Unexpected PMD-mapped THP? */
- VM_BUG_ON_PAGE(!pvmw.pte, page);
-
subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
address = pvmw.address;

@@ -1668,7 +1664,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
*
* See Documentation/vm/mmu_notifier.rst
*/
- page_remove_rmap(subpage, PageHuge(page));
+ page_remove_rmap(subpage, vma, PageHuge(page));
put_page(page);
}

@@ -1942,7 +1938,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
*
* See Documentation/vm/mmu_notifier.rst
*/
- page_remove_rmap(subpage, PageHuge(page));
+ page_remove_rmap(subpage, vma, PageHuge(page));
put_page(page);
}

@@ -2078,7 +2074,7 @@ static bool page_make_device_exclusive_one(struct page *page,
* There is a reference on the page for the swap entry which has
* been removed, so shouldn't take another.
*/
- page_remove_rmap(subpage, false);
+ page_remove_rmap(subpage, vma, false);
}

mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..15d3e97a6e04 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -95,10 +95,15 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
if (!pte_none(*dst_pte))
goto out_unlock;

- if (page_in_cache)
- page_add_file_rmap(page, false);
- else
+ if (page_in_cache) {
+ /* Usually, cache pages are already added to LRU */
+ if (newly_allocated)
+ lru_cache_add(page);
+ page_add_file_rmap(page, dst_vma, false);
+ } else {
page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+ lru_cache_add_inactive_or_unevictable(page, dst_vma);
+ }

/*
* Must happen after rmap, as mm_counter() checks mapping (via
@@ -106,9 +111,6 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
*/
inc_mm_counter(dst_mm, mm_counter(page));

- if (newly_allocated)
- lru_cache_add_inactive_or_unevictable(page, dst_vma);
-
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);

/* No need to invalidate - it was non-present before */
--
2.34.1