2022-02-15 02:27:18

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 00/13] mm/munlock: rework of mlock+munlock page handling

I wondered whether to post this munlocking rework in
https://lore.kernel.org/linux-mm/[email protected]/

There the discussion was OOM reaping, but the main reason for the rework
has been catastrophic contention on i_mmap_rwsem when exiting from
multiply mlocked files; and frustration with how contorted munlocking is.

tl;dr
mm/mlock.c | 637 +++++++++++++++-----------------------
23 files changed, 514 insertions(+), 780 deletions(-)

v1 of the series was posted on 6 Feb 2022:
https://lore.kernel.org/linux-mm/[email protected]/
Preview of v2 changed patches 01 04 07 10 11 were posted on 13 Feb 2022.
Here is the full v2 series, in case this is easier to manage:
based again on 5.17-rc2, applies also to -rc3 and -rc4.

Andrew, many thanks for including v1 and fixes in mmotm: please now replace

mm-munlock-delete-page_mlock-and-all-its-works.patch
mm-munlock-delete-foll_mlock-and-foll_populate.patch
mm-munlock-delete-munlock_vma_pages_all-allow-oomreap.patch
mm-munlock-rmap-call-mlock_vma_page-munlock_vma_page.patch
mm-munlock-replace-clear_page_mlock-by-final-clearance.patch
mm-munlock-maintain-page-mlock_count-while-unevictable.patch
mm-munlock-mlock_pte_range-when-mlocking-or-munlocking.patch
mm-migrate-__unmap_and_move-push-good-newpage-to-lru.patch
mm-munlock-delete-smp_mb-from-__pagevec_lru_add_fn.patch
mm-munlock-mlock_page-munlock_page-batch-by-pagevec.patch
mm-munlock-mlock_page-munlock_page-batch-by-pagevec-fix.patch
mm-munlock-mlock_page-munlock_page-batch-by-pagevec-fix-2.patch
mm-munlock-page-migration-needs-mlock-pagevec-drained.patch
mm-thp-collapse_file-do-try_to_unmapttu_batch_flush.patch
mm-thp-shrink_page_list-avoid-splitting-vm_locked-thp.patch

by the following thirteen of v2. As before, some easy fixups will be
needed to apply in mm/huge_memory.c, but otherwise expected to be clean.

Many thanks to Vlastimil Babka for his review of 01 through 11, and
to Matthew Wilcox for graciously volunteering to rebase his over these.

At present there's no update to Documentation/vm/unevictable-lru.rst:
that always needs a different mindset, can follow later, please refer
to commit messages for now.

There are two half-related mm/thp patches at the end: enhancements
we've had for a long time, but needed more after the mlock changes.

01/13 mm/munlock: delete page_mlock() and all its works
02/13 mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE
03/13 mm/munlock: delete munlock_vma_pages_all(), allow oomreap
04/13 mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
05/13 mm/munlock: replace clear_page_mlock() by final clearance
06/13 mm/munlock: maintain page->mlock_count while unevictable
07/13 mm/munlock: mlock_pte_range() when mlocking or munlocking
08/13 mm/migrate: __unmap_and_move() push good newpage to LRU
09/13 mm/munlock: delete smp_mb() from __pagevec_lru_add_fn()
10/13 mm/munlock: mlock_page() munlock_page() batch by pagevec
11/13 mm/munlock: page migration needs mlock pagevec drained
12/13 mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH)
13/13 mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP

include/linux/mm.h | 2
include/linux/mm_inline.h | 11
include/linux/mm_types.h | 19 +
include/linux/rmap.h | 23 -
kernel/events/uprobes.c | 7
mm/gup.c | 43 --
mm/huge_memory.c | 55 ---
mm/hugetlb.c | 4
mm/internal.h | 66 ++-
mm/khugepaged.c | 14
mm/ksm.c | 12
mm/madvise.c | 5
mm/memcontrol.c | 3
mm/memory.c | 45 --
mm/migrate.c | 42 +-
mm/mlock.c | 637 +++++++++++++++-----------------------
mm/mmap.c | 32 -
mm/mmzone.c | 7
mm/oom_kill.c | 2
mm/rmap.c | 156 ++-------
mm/swap.c | 89 ++---
mm/userfaultfd.c | 14
mm/vmscan.c | 6
23 files changed, 514 insertions(+), 780 deletions(-)

Hugh


2022-02-15 03:03:07

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 06/13] mm/munlock: maintain page->mlock_count while unevictable

Previous patches have been preparatory: now implement page->mlock_count.
The ordering of the "Unevictable LRU" is of no significance, and there is
no point holding unevictable pages on a list: place page->mlock_count to
overlay page->lru.prev (since page->lru.next is overlaid by compound_head,
which needs to be even so as not to satisfy PageTail - though 2 could be
added instead of 1 for each mlock, if that's ever an improvement).

But it's only safe to rely on or modify page->mlock_count while lruvec
lock is held and page is on unevictable "LRU" - we can save lots of edits
by continuing to pretend that there's an imaginary LRU here (there is an
unevictable count which still needs to be maintained, but not a list).

The mlock_count technique suffers from an unreliability much like with
page_mlock(): while someone else has the page off LRU, not much can
be done. As before, err on the safe side (behave as if mlock_count 0),
and let try_to_unlock_one() move the page to unevictable if reclaim finds
out later on - a few misplaced pages don't matter, what we want to avoid
is imbalancing reclaim by flooding evictable lists with unevictable pages.

I am not a fan of "if (!isolate_lru_page(page)) putback_lru_page(page);":
if we have taken lruvec lock to get the page off its present list, then
we save everyone trouble (and however many extra atomic ops) by putting
it on its destination list immediately.

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
v2: same as v1.

include/linux/mm_inline.h | 11 +++++--
include/linux/mm_types.h | 19 +++++++++--
mm/huge_memory.c | 5 ++-
mm/memcontrol.c | 3 +-
mm/mlock.c | 68 +++++++++++++++++++++++++++++++--------
mm/mmzone.c | 7 ++++
mm/swap.c | 1 +
7 files changed, 92 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index b725839dfe71..884d6f6af05b 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -99,7 +99,8 @@ void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)

update_lru_size(lruvec, lru, folio_zonenum(folio),
folio_nr_pages(folio));
- list_add(&folio->lru, &lruvec->lists[lru]);
+ if (lru != LRU_UNEVICTABLE)
+ list_add(&folio->lru, &lruvec->lists[lru]);
}

static __always_inline void add_page_to_lru_list(struct page *page,
@@ -115,6 +116,7 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)

update_lru_size(lruvec, lru, folio_zonenum(folio),
folio_nr_pages(folio));
+ /* This is not expected to be used on LRU_UNEVICTABLE */
list_add_tail(&folio->lru, &lruvec->lists[lru]);
}

@@ -127,8 +129,11 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
static __always_inline
void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
{
- list_del(&folio->lru);
- update_lru_size(lruvec, folio_lru_list(folio), folio_zonenum(folio),
+ enum lru_list lru = folio_lru_list(folio);
+
+ if (lru != LRU_UNEVICTABLE)
+ list_del(&folio->lru);
+ update_lru_size(lruvec, lru, folio_zonenum(folio),
-folio_nr_pages(folio));
}

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5140e5feb486..475bdb282769 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -85,7 +85,16 @@ struct page {
* lruvec->lru_lock. Sometimes used as a generic list
* by the page owner.
*/
- struct list_head lru;
+ union {
+ struct list_head lru;
+ /* Or, for the Unevictable "LRU list" slot */
+ struct {
+ /* Always even, to negate PageTail */
+ void *__filler;
+ /* Count page's or folio's mlocks */
+ unsigned int mlock_count;
+ };
+ };
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
pgoff_t index; /* Our offset within mapping. */
@@ -241,7 +250,13 @@ struct folio {
struct {
/* public: */
unsigned long flags;
- struct list_head lru;
+ union {
+ struct list_head lru;
+ struct {
+ void *__filler;
+ unsigned int mlock_count;
+ };
+ };
struct address_space *mapping;
pgoff_t index;
void *private;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d6477f48a27e..9afca0122723 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2300,8 +2300,11 @@ static void lru_add_page_tail(struct page *head, struct page *tail,
} else {
/* head is still on lru (and we have it frozen) */
VM_WARN_ON(!PageLRU(head));
+ if (PageUnevictable(tail))
+ tail->mlock_count = 0;
+ else
+ list_add_tail(&tail->lru, &head->lru);
SetPageLRU(tail);
- list_add_tail(&tail->lru, &head->lru);
}
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09d342c7cbd0..b10590926177 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1257,8 +1257,7 @@ struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
* @nr_pages: positive when adding or negative when removing
*
* This function must be called under lru_lock, just before a page is added
- * to or just after a page is removed from an lru list (that ordering being
- * so as to allow it to check that lru_size 0 is consistent with list_empty).
+ * to or just after a page is removed from an lru list.
*/
void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
int zid, int nr_pages)
diff --git a/mm/mlock.c b/mm/mlock.c
index 3c26473050a3..f8a3a54687dd 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -54,16 +54,35 @@ EXPORT_SYMBOL(can_do_mlock);
*/
void mlock_page(struct page *page)
{
+ struct lruvec *lruvec;
+ int nr_pages = thp_nr_pages(page);
+
VM_BUG_ON_PAGE(PageTail(page), page);

if (!TestSetPageMlocked(page)) {
- int nr_pages = thp_nr_pages(page);
-
mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
- count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
- if (!isolate_lru_page(page))
- putback_lru_page(page);
+ __count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+ }
+
+ /* There is nothing more we can do while it's off LRU */
+ if (!TestClearPageLRU(page))
+ return;
+
+ lruvec = folio_lruvec_lock_irq(page_folio(page));
+ if (PageUnevictable(page)) {
+ page->mlock_count++;
+ goto out;
}
+
+ del_page_from_lru_list(page, lruvec);
+ ClearPageActive(page);
+ SetPageUnevictable(page);
+ page->mlock_count = 1;
+ add_page_to_lru_list(page, lruvec);
+ __count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
+out:
+ SetPageLRU(page);
+ unlock_page_lruvec_irq(lruvec);
}

/**
@@ -72,19 +91,40 @@ void mlock_page(struct page *page)
*/
void munlock_page(struct page *page)
{
+ struct lruvec *lruvec;
+ int nr_pages = thp_nr_pages(page);
+
VM_BUG_ON_PAGE(PageTail(page), page);

+ lock_page_memcg(page);
+ lruvec = folio_lruvec_lock_irq(page_folio(page));
+ if (PageLRU(page) && PageUnevictable(page)) {
+ /* Then mlock_count is maintained, but might undercount */
+ if (page->mlock_count)
+ page->mlock_count--;
+ if (page->mlock_count)
+ goto out;
+ }
+ /* else assume that was the last mlock: reclaim will fix it if not */
+
if (TestClearPageMlocked(page)) {
- int nr_pages = thp_nr_pages(page);
-
- mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
- if (!isolate_lru_page(page)) {
- putback_lru_page(page);
- count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
- } else if (PageUnevictable(page)) {
- count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
- }
+ __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+ if (PageLRU(page) || !PageUnevictable(page))
+ __count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
+ else
+ __count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
+ }
+
+ /* page_evictable() has to be checked *after* clearing Mlocked */
+ if (PageLRU(page) && PageUnevictable(page) && page_evictable(page)) {
+ del_page_from_lru_list(page, lruvec);
+ ClearPageUnevictable(page);
+ add_page_to_lru_list(page, lruvec);
+ __count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
}
+out:
+ unlock_page_lruvec_irq(lruvec);
+ unlock_page_memcg(page);
}

/*
diff --git a/mm/mmzone.c b/mm/mmzone.c
index eb89d6e018e2..40e1d9428300 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -81,6 +81,13 @@ void lruvec_init(struct lruvec *lruvec)

for_each_lru(lru)
INIT_LIST_HEAD(&lruvec->lists[lru]);
+ /*
+ * The "Unevictable LRU" is imaginary: though its size is maintained,
+ * it is never scanned, and unevictable pages are not threaded on it
+ * (so that their lru fields can be reused to hold mlock_count).
+ * Poison its list head, so that any operations on it would crash.
+ */
+ list_del(&lruvec->lists[LRU_UNEVICTABLE]);
}

#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
diff --git a/mm/swap.c b/mm/swap.c
index ff4810e4a4bc..682a03301a2c 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1062,6 +1062,7 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
} else {
folio_clear_active(folio);
folio_set_unevictable(folio);
+ folio->mlock_count = !!folio_test_mlocked(folio);
if (!was_unevictable)
__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
}
--
2.34.1

2022-02-15 03:09:08

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 11/13] mm/munlock: page migration needs mlock pagevec drained

Page migration of a VM_LOCKED page tends to fail, because when the old
page is unmapped, it is put on the mlock pagevec with raised refcount,
which then fails the freeze.

At first I thought this would be fixed by a local mlock_page_drain() at
the upper rmap_walk() level - which would have nicely batched all the
munlocks of that page; but tests show that the task can too easily move
to another cpu, leaving pagevec residue behind which fails the migration.

So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
TTU_IGNORE_MLOCK users would want the same treatment; and do the same
in remove_migration_pte() - not important when successfully inserting
a new page, but necessary when hoping to retry after failure.

Any new pagevec runs the risk of adding a new way of stranding, and we
might discover other corners where mlock_page_drain() or lru_add_drain()
would now help.

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
v2: Delete suggestion of sysctl from changelog, per Vlastimil.

mm/migrate.c | 2 ++
mm/rmap.c | 4 ++++
2 files changed, 6 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index f4bcf1541b62..e7d0b68d5dcb 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -251,6 +251,8 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
page_add_file_rmap(new, vma, false);
set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
}
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page_drain(smp_processor_id());

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, pvmw.address, pvmw.pte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 5442a5c97a85..714bfdc72c7b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1656,6 +1656,8 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
* See Documentation/vm/mmu_notifier.rst
*/
page_remove_rmap(subpage, vma, PageHuge(page));
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page_drain(smp_processor_id());
put_page(page);
}

@@ -1930,6 +1932,8 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
* See Documentation/vm/mmu_notifier.rst
*/
page_remove_rmap(subpage, vma, PageHuge(page));
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page_drain(smp_processor_id());
put_page(page);
}

--
2.34.1

2022-02-15 03:14:32

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 03/13] mm/munlock: delete munlock_vma_pages_all(), allow oomreap

munlock_vma_pages_range() will still be required, when munlocking but
not munmapping a set of pages; but when unmapping a pte, the mlock count
will be maintained in much the same way as it will be maintained when
mapping in the pte. Which removes the need for munlock_vma_pages_all()
on mlocked vmas when munmapping or exiting: eliminating the catastrophic
contention on i_mmap_rwsem, and the need for page lock on the pages.

There is still a need to update locked_vm accounting according to the
munmapped vmas when munmapping: do that in detach_vmas_to_be_unmapped().
exit_mmap() does not need locked_vm updates, so delete unlock_range().

And wasn't I the one who forbade the OOM reaper to attack mlocked vmas,
because of the uncertainty in blocking on all those page locks?
No fear of that now, so permit the OOM reaper on mlocked vmas.

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
v2: same as v1.

mm/internal.h | 16 ++--------------
mm/madvise.c | 5 +++++
mm/mlock.c | 4 ++--
mm/mmap.c | 32 ++------------------------------
mm/oom_kill.c | 2 +-
5 files changed, 12 insertions(+), 47 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index e48c486d5ddf..f235aa92e564 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -71,11 +71,6 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);

-static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
-{
- return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
-}
-
struct zap_details;
void unmap_page_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
@@ -398,12 +393,8 @@ extern long populate_vma_page_range(struct vm_area_struct *vma,
extern long faultin_vma_page_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
bool write, int *locked);
-extern void munlock_vma_pages_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end);
-static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
-{
- munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
-}
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+ unsigned long len);

/*
* must be called with vma's mmap_lock held for read or write, and page locked.
@@ -411,9 +402,6 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
extern void mlock_vma_page(struct page *page);
extern void munlock_vma_page(struct page *page);

-extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
- unsigned long len);
-
/*
* Clear the page's PageMlocked(). This can be useful in a situation where
* we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/madvise.c b/mm/madvise.c
index 5604064df464..ae35d72627ef 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -530,6 +530,11 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
tlb_end_vma(tlb, vma);
}

+static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
+{
+ return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
+}
+
static long madvise_cold(struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start_addr, unsigned long end_addr)
diff --git a/mm/mlock.c b/mm/mlock.c
index aec4ce7919da..5d7ced8303be 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -137,8 +137,8 @@ void munlock_vma_page(struct page *page)
* Returns with VM_LOCKED cleared. Callers must be prepared to
* deal with this.
*/
-void munlock_vma_pages_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+static void munlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end)
{
vma->vm_flags &= VM_LOCKED_CLEAR_MASK;

diff --git a/mm/mmap.c b/mm/mmap.c
index 1e8fdb0b51ed..64b5985b5295 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2674,6 +2674,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
vma->vm_prev = NULL;
do {
vma_rb_erase(vma, &mm->mm_rb);
+ if (vma->vm_flags & VM_LOCKED)
+ mm->locked_vm -= vma_pages(vma);
mm->map_count--;
tail_vma = vma;
vma = vma->vm_next;
@@ -2778,22 +2780,6 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
return __split_vma(mm, vma, addr, new_below);
}

-static inline void
-unlock_range(struct vm_area_struct *start, unsigned long limit)
-{
- struct mm_struct *mm = start->vm_mm;
- struct vm_area_struct *tmp = start;
-
- while (tmp && tmp->vm_start < limit) {
- if (tmp->vm_flags & VM_LOCKED) {
- mm->locked_vm -= vma_pages(tmp);
- munlock_vma_pages_all(tmp);
- }
-
- tmp = tmp->vm_next;
- }
-}
-
/* Munmap is split into 2 main parts -- this part which finds
* what needs doing, and the areas themselves, which do the
* work. This now handles partial unmappings.
@@ -2874,12 +2860,6 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
return error;
}

- /*
- * unlock any mlock()ed ranges before detaching vmas
- */
- if (mm->locked_vm)
- unlock_range(vma, end);
-
/* Detach vmas from rbtree */
if (!detach_vmas_to_be_unmapped(mm, vma, prev, end))
downgrade = false;
@@ -3147,20 +3127,12 @@ void exit_mmap(struct mm_struct *mm)
* Nothing can be holding mm->mmap_lock here and the above call
* to mmu_notifier_release(mm) ensures mmu notifier callbacks in
* __oom_reap_task_mm() will not block.
- *
- * This needs to be done before calling unlock_range(),
- * which clears VM_LOCKED, otherwise the oom reaper cannot
- * reliably test it.
*/
(void)__oom_reap_task_mm(mm);
-
set_bit(MMF_OOM_SKIP, &mm->flags);
}

mmap_write_lock(mm);
- if (mm->locked_vm)
- unlock_range(mm->mmap, ULONG_MAX);
-
arch_exit_mmap(mm);

vma = mm->mmap;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 832fb330376e..6b875acabd1e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -526,7 +526,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
set_bit(MMF_UNSTABLE, &mm->flags);

for (vma = mm->mmap ; vma; vma = vma->vm_next) {
- if (!can_madv_lru_vma(vma))
+ if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
continue;

/*
--
2.34.1

2022-02-15 03:17:26

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 04/13] mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
inline functions which check (vma->vm_flags & VM_LOCKED) before calling
mlock_page() and munlock_page() in mm/mlock.c.

Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
because we have understandable difficulty in accounting pte maps of THPs,
and if passed a PageHead page, mlock_page() and munlock_page() cannot
tell whether it's a pmd map to be counted or a pte map to be ignored.

Add vma arg to page_add_file_rmap() and page_remove_rmap(), like the
others, and use that to call mlock_vma_page() at the end of the page
adds, and munlock_vma_page() at the end of page_remove_rmap() (end or
beginning? unimportant, but end was easier for assertions in testing).

No page lock is required (although almost all adds happen to hold it):
delete the "Serialize with page migration" BUG_ON(!PageLocked(page))s.
Certainly page lock did serialize with page migration, but I'm having
difficulty explaining why that was ever important.

Mlock accounting on THPs has been hard to define, differed between anon
and file, involved PageDoubleMap in some places and not others, required
clear_page_mlock() at some points. Keep it simple now: just count the
pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

page_add_new_anon_rmap() callers unchanged: they have long been calling
lru_cache_add_inactive_or_unevictable(), which does its own VM_LOCKED
handling (it also checks for not VM_SPECIAL: I think that's overcautious,
and inconsistent with other checks, that mmap_region() already prevents
VM_LOCKED on VM_SPECIAL; but haven't quite convinced myself to change it).

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
v2: Fix kerneldoc on mlock_page() to use colon, per Yang Li.
mmotm will require minor fixup in mm/huge_memory.c like v1.

include/linux/rmap.h | 17 +++++++------
kernel/events/uprobes.c | 7 ++----
mm/huge_memory.c | 17 ++++++-------
mm/hugetlb.c | 4 +--
mm/internal.h | 36 ++++++++++++++++++++++----
mm/khugepaged.c | 4 +--
mm/ksm.c | 12 +--------
mm/memory.c | 45 +++++++++++----------------------
mm/migrate.c | 9 ++-----
mm/mlock.c | 21 ++++++----------
mm/rmap.c | 56 +++++++++++++++++++----------------------
mm/userfaultfd.c | 14 ++++++-----
12 files changed, 113 insertions(+), 129 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index dc48aa8c2c94..ac29b076082b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -167,18 +167,19 @@ struct anon_vma *page_get_anon_vma(struct page *page);
*/
void page_move_anon_rmap(struct page *, struct vm_area_struct *);
void page_add_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long, bool);
+ unsigned long address, bool compound);
void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long, int);
+ unsigned long address, int flags);
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long, bool);
-void page_add_file_rmap(struct page *, bool);
-void page_remove_rmap(struct page *, bool);
-
+ unsigned long address, bool compound);
+void page_add_file_rmap(struct page *, struct vm_area_struct *,
+ bool compound);
+void page_remove_rmap(struct page *, struct vm_area_struct *,
+ bool compound);
void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long);
+ unsigned long address);
void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
- unsigned long);
+ unsigned long address);

static inline void page_dup_rmap(struct page *page, bool compound)
{
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6357c3580d07..eed2f7437d96 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -173,7 +173,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
return err;
}

- /* For try_to_free_swap() and munlock_vma_page() below */
+ /* For try_to_free_swap() below */
lock_page(old_page);

mmu_notifier_invalidate_range_start(&range);
@@ -201,13 +201,10 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
set_pte_at_notify(mm, addr, pvmw.pte,
mk_pte(new_page, vma->vm_page_prot));

- page_remove_rmap(old_page, false);
+ page_remove_rmap(old_page, vma, false);
if (!page_mapped(old_page))
try_to_free_swap(old_page);
page_vma_mapped_walk_done(&pvmw);
-
- if ((vma->vm_flags & VM_LOCKED) && !PageCompound(old_page))
- munlock_vma_page(old_page);
put_page(old_page);

err = 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9a34b85ebcf8..d6477f48a27e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1577,7 +1577,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,

if (pmd_present(orig_pmd)) {
page = pmd_page(orig_pmd);
- page_remove_rmap(page, true);
+ page_remove_rmap(page, vma, true);
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
VM_BUG_ON_PAGE(!PageHead(page), page);
} else if (thp_migration_supported()) {
@@ -1962,7 +1962,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
set_page_dirty(page);
if (!PageReferenced(page) && pmd_young(old_pmd))
SetPageReferenced(page);
- page_remove_rmap(page, true);
+ page_remove_rmap(page, vma, true);
put_page(page);
}
add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
@@ -2096,6 +2096,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
}
}
unlock_page_memcg(page);
+
+ /* Above is effectively page_remove_rmap(page, vma, true) */
+ munlock_vma_page(page, vma, true);
}

smp_wmb(); /* make pte visible before pmd */
@@ -2103,7 +2106,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,

if (freeze) {
for (i = 0; i < HPAGE_PMD_NR; i++) {
- page_remove_rmap(page + i, false);
+ page_remove_rmap(page + i, vma, false);
put_page(page + i);
}
}
@@ -2163,8 +2166,6 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
do_unlock_page = true;
}
}
- if (PageMlocked(page))
- clear_page_mlock(page);
} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
goto out;
__split_huge_pmd_locked(vma, pmd, range.start, freeze);
@@ -3138,7 +3139,7 @@ void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
if (pmd_soft_dirty(pmdval))
pmdswp = pmd_swp_mksoft_dirty(pmdswp);
set_pmd_at(mm, address, pvmw->pmd, pmdswp);
- page_remove_rmap(page, true);
+ page_remove_rmap(page, vma, true);
put_page(page);
}

@@ -3168,10 +3169,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
if (PageAnon(new))
page_add_anon_rmap(new, vma, mmun_start, true);
else
- page_add_file_rmap(new, true);
+ page_add_file_rmap(new, vma, true);
set_pmd_at(mm, mmun_start, pvmw->pmd, pmde);
- if ((vma->vm_flags & VM_LOCKED) && !PageDoubleMap(new))
- mlock_vma_page(new);
update_mmu_cache_pmd(vma, address, pvmw->pmd);
}
#endif
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 61895cc01d09..43fb3155298e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5014,7 +5014,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
set_page_dirty(page);

hugetlb_count_sub(pages_per_huge_page(h), mm);
- page_remove_rmap(page, true);
+ page_remove_rmap(page, vma, true);

spin_unlock(ptl);
tlb_remove_page_size(tlb, page, huge_page_size(h));
@@ -5259,7 +5259,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
/* Break COW */
huge_ptep_clear_flush(vma, haddr, ptep);
mmu_notifier_invalidate_range(mm, range.start, range.end);
- page_remove_rmap(old_page, true);
+ page_remove_rmap(old_page, vma, true);
hugepage_add_new_anon_rmap(new_page, vma, haddr);
set_huge_pte_at(mm, haddr, ptep,
make_huge_pte(vma, new_page, 1));
diff --git a/mm/internal.h b/mm/internal.h
index f235aa92e564..3d7dfc8bc471 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -395,12 +395,35 @@ extern long faultin_vma_page_range(struct vm_area_struct *vma,
bool write, int *locked);
extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
unsigned long len);
-
/*
- * must be called with vma's mmap_lock held for read or write, and page locked.
+ * mlock_vma_page() and munlock_vma_page():
+ * should be called with vma's mmap_lock held for read or write,
+ * under page table lock for the pte/pmd being added or removed.
+ *
+ * mlock is usually called at the end of page_add_*_rmap(),
+ * munlock at the end of page_remove_rmap(); but new anon
+ * pages are managed in lru_cache_add_inactive_or_unevictable().
+ *
+ * @compound is used to include pmd mappings of THPs, but filter out
+ * pte mappings of THPs, which cannot be consistently counted: a pte
+ * mapping of the THP head cannot be distinguished by the page alone.
*/
-extern void mlock_vma_page(struct page *page);
-extern void munlock_vma_page(struct page *page);
+void mlock_page(struct page *page);
+static inline void mlock_vma_page(struct page *page,
+ struct vm_area_struct *vma, bool compound)
+{
+ if (unlikely(vma->vm_flags & VM_LOCKED) &&
+ (compound || !PageTransCompound(page)))
+ mlock_page(page);
+}
+void munlock_page(struct page *page);
+static inline void munlock_vma_page(struct page *page,
+ struct vm_area_struct *vma, bool compound)
+{
+ if (unlikely(vma->vm_flags & VM_LOCKED) &&
+ (compound || !PageTransCompound(page)))
+ munlock_page(page);
+}

/*
* Clear the page's PageMlocked(). This can be useful in a situation where
@@ -487,7 +510,10 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
#else /* !CONFIG_MMU */
static inline void unmap_mapping_folio(struct folio *folio) { }
static inline void clear_page_mlock(struct page *page) { }
-static inline void mlock_vma_page(struct page *page) { }
+static inline void mlock_vma_page(struct page *page,
+ struct vm_area_struct *vma, bool compound) { }
+static inline void munlock_vma_page(struct page *page,
+ struct vm_area_struct *vma, bool compound) { }
static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
{
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 35f14d0a00a6..d5e387c58bde 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -773,7 +773,7 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
*/
spin_lock(ptl);
ptep_clear(vma->vm_mm, address, _pte);
- page_remove_rmap(src_page, false);
+ page_remove_rmap(src_page, vma, false);
spin_unlock(ptl);
free_page_and_swap_cache(src_page);
}
@@ -1497,7 +1497,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
if (pte_none(*pte))
continue;
page = vm_normal_page(vma, addr, *pte);
- page_remove_rmap(page, false);
+ page_remove_rmap(page, vma, false);
}

pte_unmap_unlock(start_pte, ptl);
diff --git a/mm/ksm.c b/mm/ksm.c
index c20bd4d9a0d9..c5a4403b5dc9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1177,7 +1177,7 @@ static int replace_page(struct vm_area_struct *vma, struct page *page,
ptep_clear_flush(vma, addr, ptep);
set_pte_at_notify(mm, addr, ptep, newpte);

- page_remove_rmap(page, false);
+ page_remove_rmap(page, vma, false);
if (!page_mapped(page))
try_to_free_swap(page);
put_page(page);
@@ -1252,16 +1252,6 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
err = replace_page(vma, page, kpage, orig_pte);
}

- if ((vma->vm_flags & VM_LOCKED) && kpage && !err) {
- munlock_vma_page(page);
- if (!PageMlocked(kpage)) {
- unlock_page(page);
- lock_page(kpage);
- mlock_vma_page(kpage);
- page = kpage; /* for final unlock */
- }
- }
-
out_unlock:
unlock_page(page);
out:
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..53bd9e5f2e33 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -735,9 +735,6 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,

set_pte_at(vma->vm_mm, address, ptep, pte);

- if (vma->vm_flags & VM_LOCKED)
- mlock_vma_page(page);
-
/*
* No need to invalidate - it was non-present before. However
* secondary CPUs may have mappings that need invalidating.
@@ -1377,7 +1374,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
mark_page_accessed(page);
}
rss[mm_counter(page)]--;
- page_remove_rmap(page, false);
+ page_remove_rmap(page, vma, false);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
if (unlikely(__tlb_remove_page(tlb, page))) {
@@ -1397,10 +1394,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
continue;
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
rss[mm_counter(page)]--;
-
if (is_device_private_entry(entry))
- page_remove_rmap(page, false);
-
+ page_remove_rmap(page, vma, false);
put_page(page);
continue;
}
@@ -1753,16 +1748,16 @@ static int validate_page_before_insert(struct page *page)
return 0;
}

-static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
+static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
unsigned long addr, struct page *page, pgprot_t prot)
{
if (!pte_none(*pte))
return -EBUSY;
/* Ok, finally just insert the thing.. */
get_page(page);
- inc_mm_counter_fast(mm, mm_counter_file(page));
- page_add_file_rmap(page, false);
- set_pte_at(mm, addr, pte, mk_pte(page, prot));
+ inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
+ page_add_file_rmap(page, vma, false);
+ set_pte_at(vma->vm_mm, addr, pte, mk_pte(page, prot));
return 0;
}

@@ -1776,7 +1771,6 @@ static int insert_page_into_pte_locked(struct mm_struct *mm, pte_t *pte,
static int insert_page(struct vm_area_struct *vma, unsigned long addr,
struct page *page, pgprot_t prot)
{
- struct mm_struct *mm = vma->vm_mm;
int retval;
pte_t *pte;
spinlock_t *ptl;
@@ -1785,17 +1779,17 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
if (retval)
goto out;
retval = -ENOMEM;
- pte = get_locked_pte(mm, addr, &ptl);
+ pte = get_locked_pte(vma->vm_mm, addr, &ptl);
if (!pte)
goto out;
- retval = insert_page_into_pte_locked(mm, pte, addr, page, prot);
+ retval = insert_page_into_pte_locked(vma, pte, addr, page, prot);
pte_unmap_unlock(pte, ptl);
out:
return retval;
}

#ifdef pte_index
-static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte,
+static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
unsigned long addr, struct page *page, pgprot_t prot)
{
int err;
@@ -1805,7 +1799,7 @@ static int insert_page_in_batch_locked(struct mm_struct *mm, pte_t *pte,
err = validate_page_before_insert(page);
if (err)
return err;
- return insert_page_into_pte_locked(mm, pte, addr, page, prot);
+ return insert_page_into_pte_locked(vma, pte, addr, page, prot);
}

/* insert_pages() amortizes the cost of spinlock operations
@@ -1842,7 +1836,7 @@ static int insert_pages(struct vm_area_struct *vma, unsigned long addr,

start_pte = pte_offset_map_lock(mm, pmd, addr, &pte_lock);
for (pte = start_pte; pte_idx < batch_size; ++pte, ++pte_idx) {
- int err = insert_page_in_batch_locked(mm, pte,
+ int err = insert_page_in_batch_locked(vma, pte,
addr, pages[curr_page_idx], prot);
if (unlikely(err)) {
pte_unmap_unlock(start_pte, pte_lock);
@@ -3098,7 +3092,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
* mapcount is visible. So transitively, TLBs to
* old page will be flushed before it can be reused.
*/
- page_remove_rmap(old_page, false);
+ page_remove_rmap(old_page, vma, false);
}

/* Free the old page.. */
@@ -3118,16 +3112,6 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
*/
mmu_notifier_invalidate_range_only_end(&range);
if (old_page) {
- /*
- * Don't let another task, with possibly unlocked vma,
- * keep the mlocked page.
- */
- if (page_copied && (vma->vm_flags & VM_LOCKED)) {
- lock_page(old_page); /* LRU manipulation */
- if (PageMlocked(old_page))
- munlock_vma_page(old_page);
- unlock_page(old_page);
- }
if (page_copied)
free_swap_cache(old_page);
put_page(old_page);
@@ -3947,7 +3931,8 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

add_mm_counter(vma->vm_mm, mm_counter_file(page), HPAGE_PMD_NR);
- page_add_file_rmap(page, true);
+ page_add_file_rmap(page, vma, true);
+
/*
* deposit and withdraw with pmd lock held
*/
@@ -3996,7 +3981,7 @@ void do_set_pte(struct vm_fault *vmf, struct page *page, unsigned long addr)
lru_cache_add_inactive_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
- page_add_file_rmap(page, false);
+ page_add_file_rmap(page, vma, false);
}
set_pte_at(vma->vm_mm, addr, vmf->pte, entry);
}
diff --git a/mm/migrate.c b/mm/migrate.c
index c7da064b4781..7c4223ce2500 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -248,14 +248,9 @@ static bool remove_migration_pte(struct page *page, struct vm_area_struct *vma,
if (PageAnon(new))
page_add_anon_rmap(new, vma, pvmw.address, false);
else
- page_add_file_rmap(new, false);
+ page_add_file_rmap(new, vma, false);
set_pte_at(vma->vm_mm, pvmw.address, pvmw.pte, pte);
}
- if (vma->vm_flags & VM_LOCKED && !PageTransCompound(new))
- mlock_vma_page(new);
-
- if (PageTransHuge(page) && PageMlocked(page))
- clear_page_mlock(page);

/* No need to invalidate - it was non-present before */
update_mmu_cache(vma, pvmw.address, pvmw.pte);
@@ -2331,7 +2326,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
* drop page refcount. Page won't be freed, as we took
* a reference just above.
*/
- page_remove_rmap(page, false);
+ page_remove_rmap(page, vma, false);
put_page(page);

if (pte_present(pte))
diff --git a/mm/mlock.c b/mm/mlock.c
index 5d7ced8303be..92f28258b4ae 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -78,17 +78,13 @@ void clear_page_mlock(struct page *page)
}
}

-/*
- * Mark page as mlocked if not already.
- * If page on LRU, isolate and putback to move to unevictable list.
+/**
+ * mlock_page - mlock a page
+ * @page: page to be mlocked, either a normal page or a THP head.
*/
-void mlock_vma_page(struct page *page)
+void mlock_page(struct page *page)
{
- /* Serialize with page migration */
- BUG_ON(!PageLocked(page));
-
VM_BUG_ON_PAGE(PageTail(page), page);
- VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);

if (!TestSetPageMlocked(page)) {
int nr_pages = thp_nr_pages(page);
@@ -101,14 +97,11 @@ void mlock_vma_page(struct page *page)
}

/**
- * munlock_vma_page - munlock a vma page
- * @page: page to be unlocked, either a normal page or THP page head
+ * munlock_page - munlock a page
+ * @page: page to be munlocked, either a normal page or a THP head.
*/
-void munlock_vma_page(struct page *page)
+void munlock_page(struct page *page)
{
- /* Serialize with page migration */
- BUG_ON(!PageLocked(page));
-
VM_BUG_ON_PAGE(PageTail(page), page);

if (TestClearPageMlocked(page)) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 7ce7f1946cff..6cc8bf129f18 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1181,17 +1181,17 @@ void do_page_add_anon_rmap(struct page *page,
__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
}

- if (unlikely(PageKsm(page))) {
+ if (unlikely(PageKsm(page)))
unlock_page_memcg(page);
- return;
- }

/* address might be in next vma when migration races vma_adjust */
- if (first)
+ else if (first)
__page_set_anon_rmap(page, vma, address,
flags & RMAP_EXCLUSIVE);
else
__page_check_anon_rmap(page, vma, address);
+
+ mlock_vma_page(page, vma, compound);
}

/**
@@ -1232,12 +1232,14 @@ void page_add_new_anon_rmap(struct page *page,

/**
* page_add_file_rmap - add pte mapping to a file page
- * @page: the page to add the mapping to
- * @compound: charge the page as compound or small page
+ * @page: the page to add the mapping to
+ * @vma: the vm area in which the mapping is added
+ * @compound: charge the page as compound or small page
*
* The caller needs to hold the pte lock.
*/
-void page_add_file_rmap(struct page *page, bool compound)
+void page_add_file_rmap(struct page *page,
+ struct vm_area_struct *vma, bool compound)
{
int i, nr = 1;

@@ -1260,13 +1262,8 @@ void page_add_file_rmap(struct page *page, bool compound)
nr_pages);
} else {
if (PageTransCompound(page) && page_mapping(page)) {
- struct page *head = compound_head(page);
-
VM_WARN_ON_ONCE(!PageLocked(page));
-
- SetPageDoubleMap(head);
- if (PageMlocked(page))
- clear_page_mlock(head);
+ SetPageDoubleMap(compound_head(page));
}
if (!atomic_inc_and_test(&page->_mapcount))
goto out;
@@ -1274,6 +1271,8 @@ void page_add_file_rmap(struct page *page, bool compound)
__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
out:
unlock_page_memcg(page);
+
+ mlock_vma_page(page, vma, compound);
}

static void page_remove_file_rmap(struct page *page, bool compound)
@@ -1368,11 +1367,13 @@ static void page_remove_anon_compound_rmap(struct page *page)
/**
* page_remove_rmap - take down pte mapping from a page
* @page: page to remove mapping from
+ * @vma: the vm area from which the mapping is removed
* @compound: uncharge the page as compound or small page
*
* The caller needs to hold the pte lock.
*/
-void page_remove_rmap(struct page *page, bool compound)
+void page_remove_rmap(struct page *page,
+ struct vm_area_struct *vma, bool compound)
{
lock_page_memcg(page);

@@ -1414,6 +1415,8 @@ void page_remove_rmap(struct page *page, bool compound)
*/
out:
unlock_page_memcg(page);
+
+ munlock_vma_page(page, vma, compound);
}

/*
@@ -1469,28 +1472,21 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(&range);

while (page_vma_mapped_walk(&pvmw)) {
+ /* Unexpected PMD-mapped THP? */
+ VM_BUG_ON_PAGE(!pvmw.pte, page);
+
/*
- * If the page is mlock()d, we cannot swap it out.
+ * If the page is in an mlock()d vma, we must not swap it out.
*/
if (!(flags & TTU_IGNORE_MLOCK) &&
(vma->vm_flags & VM_LOCKED)) {
- /*
- * PTE-mapped THP are never marked as mlocked: so do
- * not set it on a DoubleMap THP, nor on an Anon THP
- * (which may still be PTE-mapped after DoubleMap was
- * cleared). But stop unmapping even in those cases.
- */
- if (!PageTransCompound(page) || (PageHead(page) &&
- !PageDoubleMap(page) && !PageAnon(page)))
- mlock_vma_page(page);
+ /* Restore the mlock which got missed */
+ mlock_vma_page(page, vma, false);
page_vma_mapped_walk_done(&pvmw);
ret = false;
break;
}

- /* Unexpected PMD-mapped THP? */
- VM_BUG_ON_PAGE(!pvmw.pte, page);
-
subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
address = pvmw.address;

@@ -1668,7 +1664,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
*
* See Documentation/vm/mmu_notifier.rst
*/
- page_remove_rmap(subpage, PageHuge(page));
+ page_remove_rmap(subpage, vma, PageHuge(page));
put_page(page);
}

@@ -1942,7 +1938,7 @@ static bool try_to_migrate_one(struct page *page, struct vm_area_struct *vma,
*
* See Documentation/vm/mmu_notifier.rst
*/
- page_remove_rmap(subpage, PageHuge(page));
+ page_remove_rmap(subpage, vma, PageHuge(page));
put_page(page);
}

@@ -2078,7 +2074,7 @@ static bool page_make_device_exclusive_one(struct page *page,
* There is a reference on the page for the swap entry which has
* been removed, so shouldn't take another.
*/
- page_remove_rmap(subpage, false);
+ page_remove_rmap(subpage, vma, false);
}

mmu_notifier_invalidate_range_end(&range);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..15d3e97a6e04 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -95,10 +95,15 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
if (!pte_none(*dst_pte))
goto out_unlock;

- if (page_in_cache)
- page_add_file_rmap(page, false);
- else
+ if (page_in_cache) {
+ /* Usually, cache pages are already added to LRU */
+ if (newly_allocated)
+ lru_cache_add(page);
+ page_add_file_rmap(page, dst_vma, false);
+ } else {
page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
+ lru_cache_add_inactive_or_unevictable(page, dst_vma);
+ }

/*
* Must happen after rmap, as mm_counter() checks mapping (via
@@ -106,9 +111,6 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
*/
inc_mm_counter(dst_mm, mm_counter(page));

- if (newly_allocated)
- lru_cache_add_inactive_or_unevictable(page, dst_vma);
-
set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);

/* No need to invalidate - it was non-present before */
--
2.34.1

2022-02-15 04:32:10

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking

Fill in missing pieces: reimplementation of munlock_vma_pages_range(),
required to lower the mlock_counts when munlocking without munmapping;
and its complement, implementation of mlock_vma_pages_range(), required
to raise the mlock_counts on pages already there when a range is mlocked.

Combine them into just the one function mlock_vma_pages_range(), using
walk_page_range() to run mlock_pte_range(). This approach fixes the
"Very slow unlockall()" of unpopulated PROT_NONE areas, reported in
https://lore.kernel.org/linux-mm/[email protected]/

Munlock clears VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing truncate or holepunch (depending on i_mmap_rwsem) gets to the
pte first, it will not try to munlock the page: leaving release_pages()
to correct it when the last reference to the page is gone - that's okay,
a page is not evictable anyway while it is held by an extra reference.

Mlock sets VM_LOCKED at the start, under exclusive mmap_lock; but if
a racing remove_migration_pte() or try_to_unmap_one() (depending on
i_mmap_rwsem) gets to the pte first, it will try to mlock the page,
then mlock_pte_range() mlock it a second time. This is harder to
reproduce, but a more serious race because it could leave the page
unevictable indefinitely though the area is munlocked afterwards.
Guard against it by setting the (inappropriate) VM_IO flag,
and modifying mlock_vma_page() to decline such vmas.

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
v2: Delete reintroduced VM_LOCKED_CLEAR_MASKing here, per Vlastimil.
Use oldflags instead of vma->vm_flags in mlock_fixup(), per Vlastimil.
Make clearer comment on mmap_lock and WRITE_ONCE, per Hillf.

mm/internal.h | 3 +-
mm/mlock.c | 111 ++++++++++++++++++++++++++++++++++++++++----------
2 files changed, 91 insertions(+), 23 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index a43d79335c16..b3f0dd3ffba2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -412,7 +412,8 @@ void mlock_page(struct page *page);
static inline void mlock_vma_page(struct page *page,
struct vm_area_struct *vma, bool compound)
{
- if (unlikely(vma->vm_flags & VM_LOCKED) &&
+ /* VM_IO check prevents migration from double-counting during mlock */
+ if (unlikely((vma->vm_flags & (VM_LOCKED|VM_IO)) == VM_LOCKED) &&
(compound || !PageTransCompound(page)))
mlock_page(page);
}
diff --git a/mm/mlock.c b/mm/mlock.c
index f8a3a54687dd..581ea8bf1b83 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -14,6 +14,7 @@
#include <linux/swapops.h>
#include <linux/pagemap.h>
#include <linux/pagevec.h>
+#include <linux/pagewalk.h>
#include <linux/mempolicy.h>
#include <linux/syscalls.h>
#include <linux/sched.h>
@@ -127,25 +128,91 @@ void munlock_page(struct page *page)
unlock_page_memcg(page);
}

+static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+
+{
+ struct vm_area_struct *vma = walk->vma;
+ spinlock_t *ptl;
+ pte_t *start_pte, *pte;
+ struct page *page;
+
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (ptl) {
+ if (!pmd_present(*pmd))
+ goto out;
+ if (is_huge_zero_pmd(*pmd))
+ goto out;
+ page = pmd_page(*pmd);
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page(page);
+ else
+ munlock_page(page);
+ goto out;
+ }
+
+ start_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ for (pte = start_pte; addr != end; pte++, addr += PAGE_SIZE) {
+ if (!pte_present(*pte))
+ continue;
+ page = vm_normal_page(vma, addr, *pte);
+ if (!page)
+ continue;
+ if (PageTransCompound(page))
+ continue;
+ if (vma->vm_flags & VM_LOCKED)
+ mlock_page(page);
+ else
+ munlock_page(page);
+ }
+ pte_unmap(start_pte);
+out:
+ spin_unlock(ptl);
+ cond_resched();
+ return 0;
+}
+
/*
- * munlock_vma_pages_range() - munlock all pages in the vma range.'
- * @vma - vma containing range to be munlock()ed.
+ * mlock_vma_pages_range() - mlock any pages already in the range,
+ * or munlock all pages in the range.
+ * @vma - vma containing range to be mlock()ed or munlock()ed
* @start - start address in @vma of the range
- * @end - end of range in @vma.
- *
- * For mremap(), munmap() and exit().
+ * @end - end of range in @vma
+ * @newflags - the new set of flags for @vma.
*
- * Called with @vma VM_LOCKED.
- *
- * Returns with VM_LOCKED cleared. Callers must be prepared to
- * deal with this.
+ * Called for mlock(), mlock2() and mlockall(), to set @vma VM_LOCKED;
+ * called for munlock() and munlockall(), to clear VM_LOCKED from @vma.
*/
-static void munlock_vma_pages_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+static void mlock_vma_pages_range(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end, vm_flags_t newflags)
{
- vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+ static const struct mm_walk_ops mlock_walk_ops = {
+ .pmd_entry = mlock_pte_range,
+ };

- /* Reimplementation to follow in later commit */
+ /*
+ * There is a slight chance that concurrent page migration,
+ * or page reclaim finding a page of this now-VM_LOCKED vma,
+ * will call mlock_vma_page() and raise page's mlock_count:
+ * double counting, leaving the page unevictable indefinitely.
+ * Communicate this danger to mlock_vma_page() with VM_IO,
+ * which is a VM_SPECIAL flag not allowed on VM_LOCKED vmas.
+ * mmap_lock is held in write mode here, so this weird
+ * combination should not be visible to other mmap_lock users;
+ * but WRITE_ONCE so rmap walkers must see VM_IO if VM_LOCKED.
+ */
+ if (newflags & VM_LOCKED)
+ newflags |= VM_IO;
+ WRITE_ONCE(vma->vm_flags, newflags);
+
+ lru_add_drain();
+ walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL);
+ lru_add_drain();
+
+ if (newflags & VM_IO) {
+ newflags &= ~VM_IO;
+ WRITE_ONCE(vma->vm_flags, newflags);
+ }
}

/*
@@ -164,10 +231,9 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
pgoff_t pgoff;
int nr_pages;
int ret = 0;
- int lock = !!(newflags & VM_LOCKED);
- vm_flags_t old_flags = vma->vm_flags;
+ vm_flags_t oldflags = vma->vm_flags;

- if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
+ if (newflags == oldflags || (oldflags & VM_SPECIAL) ||
is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
vma_is_dax(vma) || vma_is_secretmem(vma))
/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
@@ -199,9 +265,9 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
* Keep track of amount of locked VM.
*/
nr_pages = (end - start) >> PAGE_SHIFT;
- if (!lock)
+ if (!(newflags & VM_LOCKED))
nr_pages = -nr_pages;
- else if (old_flags & VM_LOCKED)
+ else if (oldflags & VM_LOCKED)
nr_pages = 0;
mm->locked_vm += nr_pages;

@@ -211,11 +277,12 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
* set VM_LOCKED, populate_vma_page_range will bring it back.
*/

- if (lock)
+ if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
+ /* No work to do, and mlocking twice would be wrong */
vma->vm_flags = newflags;
- else
- munlock_vma_pages_range(vma, start, end);
-
+ } else {
+ mlock_vma_pages_range(vma, start, end, newflags);
+ }
out:
*prev = vma;
return ret;
--
2.34.1

2022-02-15 05:21:18

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 08/13] mm/migrate: __unmap_and_move() push good newpage to LRU

Compaction, NUMA page movement, THP collapse/split, and memory failure
do isolate unevictable pages from their "LRU", losing the record of
mlock_count in doing so (isolators are likely to use page->lru for their
own private lists, so mlock_count has to be presumed lost).

That's unfortunate, and we should put in some work to correct that: one
can imagine a function to build up the mlock_count again - but it would
require i_mmap_rwsem for read, so be careful where it's called. Or
page_referenced_one() and try_to_unmap_one() might do that extra work.

But one place that can very easily be improved is page migration's
__unmap_and_move(): a small adjustment to where the successful new page
is put back on LRU, and its mlock_count (if any) is built back up by
remove_migration_ptes().

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
v2: same as v1.

mm/migrate.c | 31 +++++++++++++++++++------------
1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 7c4223ce2500..f4bcf1541b62 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1032,6 +1032,21 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
if (!page_mapped(page))
rc = move_to_new_page(newpage, page, mode);

+ /*
+ * When successful, push newpage to LRU immediately: so that if it
+ * turns out to be an mlocked page, remove_migration_ptes() will
+ * automatically build up the correct newpage->mlock_count for it.
+ *
+ * We would like to do something similar for the old page, when
+ * unsuccessful, and other cases when a page has been temporarily
+ * isolated from the unevictable LRU: but this case is the easiest.
+ */
+ if (rc == MIGRATEPAGE_SUCCESS) {
+ lru_cache_add(newpage);
+ if (page_was_mapped)
+ lru_add_drain();
+ }
+
if (page_was_mapped)
remove_migration_ptes(page,
rc == MIGRATEPAGE_SUCCESS ? newpage : page, false);
@@ -1045,20 +1060,12 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
unlock_page(page);
out:
/*
- * If migration is successful, decrease refcount of the newpage
+ * If migration is successful, decrease refcount of the newpage,
* which will not free the page because new page owner increased
- * refcounter. As well, if it is LRU page, add the page to LRU
- * list in here. Use the old state of the isolated source page to
- * determine if we migrated a LRU page. newpage was already unlocked
- * and possibly modified by its owner - don't rely on the page
- * state.
+ * refcounter.
*/
- if (rc == MIGRATEPAGE_SUCCESS) {
- if (unlikely(!is_lru))
- put_page(newpage);
- else
- putback_lru_page(newpage);
- }
+ if (rc == MIGRATEPAGE_SUCCESS)
+ put_page(newpage);

return rc;
}
--
2.34.1

2022-02-15 05:23:42

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 05/13] mm/munlock: replace clear_page_mlock() by final clearance

Placing munlock_vma_page() at the end of page_remove_rmap() shifts most
of the munlocking to clear_page_mlock(), since PageMlocked is typically
still set when mapcount has fallen to 0. That is not what we want: we
want /proc/vmstat's unevictable_pgs_cleared to remain as a useful check
on the integrity of of the mlock/munlock protocol - small numbers are
not surprising, but big numbers mean the protocol is not working.

That could be easily fixed by placing munlock_vma_page() at the start of
page_remove_rmap(); but later in the series we shall want to batch the
munlocking, and that too would tend to leave PageMlocked still set at
the point when it is checked.

So delete clear_page_mlock() now: leave it instead to release_pages()
(and __page_cache_release()) to do this backstop clearing of Mlocked,
when page refcount has fallen to 0. If a pinned page occasionally gets
counted as Mlocked and Unevictable until it is unpinned, that's okay.

A slightly regrettable side-effect of this change is that, since
release_pages() and __page_cache_release() may be called at interrupt
time, those places which update NR_MLOCK with interrupts enabled
had better use mod_zone_page_state() than __mod_zone_page_state()
(but holding the lruvec lock always has interrupts disabled).

This change, forcing Mlocked off when refcount 0 instead of earlier
when mapcount 0, is not fundamental: it can be reversed if performance
or something else is found to suffer; but this is the easiest way to
separate the stats - let's not complicate that without good reason.

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
v2: same as v1.

mm/internal.h | 12 ------------
mm/mlock.c | 30 ------------------------------
mm/rmap.c | 9 ---------
mm/swap.c | 32 ++++++++++++++++++++++++--------
4 files changed, 24 insertions(+), 59 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 3d7dfc8bc471..a43d79335c16 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -425,17 +425,6 @@ static inline void munlock_vma_page(struct page *page,
munlock_page(page);
}

-/*
- * Clear the page's PageMlocked(). This can be useful in a situation where
- * we want to unconditionally remove a page from the pagecache -- e.g.,
- * on truncation or freeing.
- *
- * It is legal to call this function for any page, mlocked or not.
- * If called for a page that is still mapped by mlocked vmas, all we do
- * is revert to lazy LRU behaviour -- semantics are not broken.
- */
-extern void clear_page_mlock(struct page *page);
-
extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);

/*
@@ -509,7 +498,6 @@ static inline struct file *maybe_unlock_mmap_for_io(struct vm_fault *vmf,
}
#else /* !CONFIG_MMU */
static inline void unmap_mapping_folio(struct folio *folio) { }
-static inline void clear_page_mlock(struct page *page) { }
static inline void mlock_vma_page(struct page *page,
struct vm_area_struct *vma, bool compound) { }
static inline void munlock_vma_page(struct page *page,
diff --git a/mm/mlock.c b/mm/mlock.c
index 92f28258b4ae..3c26473050a3 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -48,36 +48,6 @@ EXPORT_SYMBOL(can_do_mlock);
* PageUnevictable is set to indicate the unevictable state.
*/

-/*
- * LRU accounting for clear_page_mlock()
- */
-void clear_page_mlock(struct page *page)
-{
- int nr_pages;
-
- if (!TestClearPageMlocked(page))
- return;
-
- nr_pages = thp_nr_pages(page);
- mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
- count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
- /*
- * The previous TestClearPageMlocked() corresponds to the smp_mb()
- * in __pagevec_lru_add_fn().
- *
- * See __pagevec_lru_add_fn for more explanation.
- */
- if (!isolate_lru_page(page)) {
- putback_lru_page(page);
- } else {
- /*
- * We lost the race. the page already moved to evictable list.
- */
- if (PageUnevictable(page))
- count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
- }
-}
-
/**
* mlock_page - mlock a page
* @page: page to be mlocked, either a normal page or a THP head.
diff --git a/mm/rmap.c b/mm/rmap.c
index 6cc8bf129f18..5442a5c97a85 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1315,9 +1315,6 @@ static void page_remove_file_rmap(struct page *page, bool compound)
* pte lock(a spinlock) is held, which implies preemption disabled.
*/
__mod_lruvec_page_state(page, NR_FILE_MAPPED, -nr);
-
- if (unlikely(PageMlocked(page)))
- clear_page_mlock(page);
}

static void page_remove_anon_compound_rmap(struct page *page)
@@ -1357,9 +1354,6 @@ static void page_remove_anon_compound_rmap(struct page *page)
nr = thp_nr_pages(page);
}

- if (unlikely(PageMlocked(page)))
- clear_page_mlock(page);
-
if (nr)
__mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
}
@@ -1398,9 +1392,6 @@ void page_remove_rmap(struct page *page,
*/
__dec_lruvec_page_state(page, NR_ANON_MAPPED);

- if (unlikely(PageMlocked(page)))
- clear_page_mlock(page);
-
if (PageTransCompound(page))
deferred_split_huge_page(compound_head(page));

diff --git a/mm/swap.c b/mm/swap.c
index bcf3ac288b56..ff4810e4a4bc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -74,8 +74,8 @@ static DEFINE_PER_CPU(struct lru_pvecs, lru_pvecs) = {
};

/*
- * This path almost never happens for VM activity - pages are normally
- * freed via pagevecs. But it gets used by networking.
+ * This path almost never happens for VM activity - pages are normally freed
+ * via pagevecs. But it gets used by networking - and for compound pages.
*/
static void __page_cache_release(struct page *page)
{
@@ -89,6 +89,14 @@ static void __page_cache_release(struct page *page)
__clear_page_lru_flags(page);
unlock_page_lruvec_irqrestore(lruvec, flags);
}
+ /* See comment on PageMlocked in release_pages() */
+ if (unlikely(PageMlocked(page))) {
+ int nr_pages = thp_nr_pages(page);
+
+ __ClearPageMlocked(page);
+ mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+ count_vm_events(UNEVICTABLE_PGCLEARED, nr_pages);
+ }
__ClearPageWaiters(page);
}

@@ -489,12 +497,8 @@ void lru_cache_add_inactive_or_unevictable(struct page *page,
unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
if (unlikely(unevictable) && !TestSetPageMlocked(page)) {
int nr_pages = thp_nr_pages(page);
- /*
- * We use the irq-unsafe __mod_zone_page_state because this
- * counter is not modified from interrupt context, and the pte
- * lock is held(spinlock), which implies preemption disabled.
- */
- __mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+
+ mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
}
lru_cache_add(page);
@@ -969,6 +973,18 @@ void release_pages(struct page **pages, int nr)
__clear_page_lru_flags(page);
}

+ /*
+ * In rare cases, when truncation or holepunching raced with
+ * munlock after VM_LOCKED was cleared, Mlocked may still be
+ * found set here. This does not indicate a problem, unless
+ * "unevictable_pgs_cleared" appears worryingly large.
+ */
+ if (unlikely(PageMlocked(page))) {
+ __ClearPageMlocked(page);
+ dec_zone_page_state(page, NR_MLOCK);
+ count_vm_event(UNEVICTABLE_PGCLEARED);
+ }
+
__ClearPageWaiters(page);

list_add(&page->lru, &pages_to_free);
--
2.34.1

2022-02-15 05:32:04

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 02/13] mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE

If counting page mlocks, we must not double-count: follow_page_pte() can
tell if a page has already been Mlocked or not, but cannot tell if a pte
has already been counted or not: that will have to be done when the pte
is mapped in (which lru_cache_add_inactive_or_unevictable() already tracks
for new anon pages, but there's no such tracking yet for others).

Delete all the FOLL_MLOCK code - faulting in the missing pages will do
all that is necessary, without special mlock_vma_page() calls from here.

But then FOLL_POPULATE turns out to serve no purpose - it was there so
that its absence would tell faultin_page() not to faultin page when
setting up VM_LOCKONFAULT areas; but if there's no special work needed
here for mlock, then there's no work at all here for VM_LOCKONFAULT.

Have I got that right? I've not looked into the history, but see that
FOLL_POPULATE goes back before VM_LOCKONFAULT: did it serve a different
purpose before? Ah, yes, it was used to skip the old stack guard page.

And is it intentional that COW is not broken on existing pages when
setting up a VM_LOCKONFAULT area? I can see that being argued either
way, and have no reason to disagree with current behaviour.

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
v2: same as v1.

include/linux/mm.h | 2 --
mm/gup.c | 43 ++++++++-----------------------------------
mm/huge_memory.c | 33 ---------------------------------
3 files changed, 8 insertions(+), 70 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 213cc569b192..74ee50c2033b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2925,13 +2925,11 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
* and return without waiting upon it */
-#define FOLL_POPULATE 0x40 /* fault in pages (with FOLL_MLOCK) */
#define FOLL_NOFAULT 0x80 /* do not fault in pages */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
-#define FOLL_MLOCK 0x1000 /* lock present pages */
#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
#define FOLL_COW 0x4000 /* internal GUP flag */
#define FOLL_ANON 0x8000 /* don't do file mappings */
diff --git a/mm/gup.c b/mm/gup.c
index f0af462ac1e2..2076902344d8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -572,32 +572,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
*/
mark_page_accessed(page);
}
- if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
- /* Do not mlock pte-mapped THP */
- if (PageTransCompound(page))
- goto out;
-
- /*
- * The preliminary mapping check is mainly to avoid the
- * pointless overhead of lock_page on the ZERO_PAGE
- * which might bounce very badly if there is contention.
- *
- * If the page is already locked, we don't need to
- * handle it now - vmscan will handle it later if and
- * when it attempts to reclaim the page.
- */
- if (page->mapping && trylock_page(page)) {
- lru_add_drain(); /* push cached pages to LRU */
- /*
- * Because we lock page here, and migration is
- * blocked by the pte's page reference, and we
- * know the page is still mapped, we don't even
- * need to check for file-cache page truncation.
- */
- mlock_vma_page(page);
- unlock_page(page);
- }
- }
out:
pte_unmap_unlock(ptep, ptl);
return page;
@@ -920,9 +894,6 @@ static int faultin_page(struct vm_area_struct *vma,
unsigned int fault_flags = 0;
vm_fault_t ret;

- /* mlock all present pages, but do not fault in new pages */
- if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
- return -ENOENT;
if (*flags & FOLL_NOFAULT)
return -EFAULT;
if (*flags & FOLL_WRITE)
@@ -1173,8 +1144,6 @@ static long __get_user_pages(struct mm_struct *mm,
case -ENOMEM:
case -EHWPOISON:
goto out;
- case -ENOENT:
- goto next_page;
}
BUG();
} else if (PTR_ERR(page) == -EEXIST) {
@@ -1472,9 +1441,14 @@ long populate_vma_page_range(struct vm_area_struct *vma,
VM_BUG_ON_VMA(end > vma->vm_end, vma);
mmap_assert_locked(mm);

- gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK;
+ /*
+ * Rightly or wrongly, the VM_LOCKONFAULT case has never used
+ * faultin_page() to break COW, so it has no work to do here.
+ */
if (vma->vm_flags & VM_LOCKONFAULT)
- gup_flags &= ~FOLL_POPULATE;
+ return nr_pages;
+
+ gup_flags = FOLL_TOUCH;
/*
* We want to touch writable mappings with a write fault in order
* to break COW, except for shared mappings because these don't COW
@@ -1541,10 +1515,9 @@ long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start,
* in the page table.
* FOLL_HWPOISON: Return -EHWPOISON instead of -EFAULT when we hit
* a poisoned page.
- * FOLL_POPULATE: Always populate memory with VM_LOCKONFAULT.
* !FOLL_FORCE: Require proper access permissions.
*/
- gup_flags = FOLL_TOUCH | FOLL_POPULATE | FOLL_MLOCK | FOLL_HWPOISON;
+ gup_flags = FOLL_TOUCH | FOLL_HWPOISON;
if (write)
gup_flags |= FOLL_WRITE;

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..9a34b85ebcf8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1380,39 +1380,6 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
if (flags & FOLL_TOUCH)
touch_pmd(vma, addr, pmd, flags);

- if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
- /*
- * We don't mlock() pte-mapped THPs. This way we can avoid
- * leaking mlocked pages into non-VM_LOCKED VMAs.
- *
- * For anon THP:
- *
- * In most cases the pmd is the only mapping of the page as we
- * break COW for the mlock() -- see gup_flags |= FOLL_WRITE for
- * writable private mappings in populate_vma_page_range().
- *
- * The only scenario when we have the page shared here is if we
- * mlocking read-only mapping shared over fork(). We skip
- * mlocking such pages.
- *
- * For file THP:
- *
- * We can expect PageDoubleMap() to be stable under page lock:
- * for file pages we set it in page_add_file_rmap(), which
- * requires page to be locked.
- */
-
- if (PageAnon(page) && compound_mapcount(page) != 1)
- goto skip_mlock;
- if (PageDoubleMap(page) || !page->mapping)
- goto skip_mlock;
- if (!trylock_page(page))
- goto skip_mlock;
- if (page->mapping && !PageDoubleMap(page))
- mlock_vma_page(page);
- unlock_page(page);
- }
-skip_mlock:
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);

--
2.34.1

2022-02-15 05:46:51

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 12/13] mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH)

collapse_file() is using unmap_mapping_pages(1) on each small page found
mapped, unlike others (reclaim, migration, splitting, memory-failure) who
use try_to_unmap(). There are four advantages to try_to_unmap(): first,
its TTU_IGNORE_MLOCK option now avoids leaving mlocked page in pagevec;
second, its vma lookup uses i_mmap_lock_read() not i_mmap_lock_write();
third, it breaks out early if page is not mapped everywhere it might be;
fourth, its TTU_BATCH_FLUSH option can be used, as in page reclaim, to
save up all the TLB flushing until all of the pages have been unmapped.

Wild guess: perhaps it was originally written to use try_to_unmap(),
but hit the VM_BUG_ON_PAGE(page_mapped) after unmapping, because without
TTU_SYNC it may skip page table locks; but unmap_mapping_pages() never
skips them, so fixed the issue. I did once hit that VM_BUG_ON_PAGE()
since making this change: we could pass TTU_SYNC here, but I think just
delete the check - the race is very rare, this is an ordinary small page
so we don't need to be so paranoid about mapcount surprises, and the
page_ref_freeze() just below already handles the case adequately.

Signed-off-by: Hugh Dickins <[email protected]>
---
v2: same as v1.

mm/khugepaged.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d5e387c58bde..e0883a33efd6 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1829,13 +1829,12 @@ static void collapse_file(struct mm_struct *mm,
}

if (page_mapped(page))
- unmap_mapping_pages(mapping, index, 1, false);
+ try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_BATCH_FLUSH);

xas_lock_irq(&xas);
xas_set(&xas, index);

VM_BUG_ON_PAGE(page != xas_load(&xas), page);
- VM_BUG_ON_PAGE(page_mapped(page), page);

/*
* The page is expected to have page_count() == 3:
@@ -1899,6 +1898,13 @@ static void collapse_file(struct mm_struct *mm,
xas_unlock_irq(&xas);
xa_unlocked:

+ /*
+ * If collapse is successful, flush must be done now before copying.
+ * If collapse is unsuccessful, does flush actually need to be done?
+ * Do it anyway, to clear the state.
+ */
+ try_to_unmap_flush();
+
if (result == SCAN_SUCCEED) {
struct page *page, *tmp;

--
2.34.1

2022-02-15 06:23:19

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 10/13] mm/munlock: mlock_page() munlock_page() batch by pagevec

A weakness of the page->mlock_count approach is the need for lruvec lock
while holding page table lock. That is not an overhead we would allow on
normal pages, but I think acceptable just for pages in an mlocked area.
But let's try to amortize the extra cost by gathering on per-cpu pagevec
before acquiring the lruvec lock.

I have an unverified conjecture that the mlock pagevec might work out
well for delaying the mlock processing of new file pages until they have
got off lru_cache_add()'s pagevec and on to LRU.

The initialization of page->mlock_count is subject to races and awkward:
0 or !!PageMlocked or 1? Was it wrong even in the implementation before
this commit, which just widens the window? I haven't gone back to think
it through. Maybe someone can point out a better way to initialize it.

Bringing lru_cache_add_inactive_or_unevictable()'s mlock initialization
into mm/mlock.c has helped: mlock_new_page(), using the mlock pagevec,
rather than lru_cache_add()'s pagevec.

Experimented with various orderings: the right thing seems to be for
mlock_page() and mlock_new_page() to TestSetPageMlocked before adding to
pagevec, but munlock_page() to leave TestClearPageMlocked to the later
pagevec processing.

Dropped the VM_BUG_ON_PAGE(PageTail)s this time around: they have made
their point, and the thp_nr_page()s already contain a VM_BUG_ON_PGFLAGS()
for that.

This still leaves acquiring lruvec locks under page table lock each time
the pagevec fills (or a THP is added): which I suppose is rather silly,
since they sit on pagevec waiting to be processed long after page table
lock has been dropped; but I'm disinclined to uglify the calling sequence
until some load shows an actual problem with it (nothing wrong with
taking lruvec lock under page table lock, just "nicer" to do it less).

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
v2: Fix kerneldoc on mlock_page() and two more functions, per Yang Li.
Fix build without CONFIG_MMU, per SeongJae, Geert, Naresh, lkp.

mm/internal.h | 9 ++-
mm/mlock.c | 212 ++++++++++++++++++++++++++++++++++++++++++--------
mm/swap.c | 27 ++++---
3 files changed, 201 insertions(+), 47 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index b3f0dd3ffba2..18af980bb1b8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -402,7 +402,8 @@ extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
*
* mlock is usually called at the end of page_add_*_rmap(),
* munlock at the end of page_remove_rmap(); but new anon
- * pages are managed in lru_cache_add_inactive_or_unevictable().
+ * pages are managed by lru_cache_add_inactive_or_unevictable()
+ * calling mlock_new_page().
*
* @compound is used to include pmd mappings of THPs, but filter out
* pte mappings of THPs, which cannot be consistently counted: a pte
@@ -425,6 +426,9 @@ static inline void munlock_vma_page(struct page *page,
(compound || !PageTransCompound(page)))
munlock_page(page);
}
+void mlock_new_page(struct page *page);
+bool need_mlock_page_drain(int cpu);
+void mlock_page_drain(int cpu);

extern pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma);

@@ -503,6 +507,9 @@ static inline void mlock_vma_page(struct page *page,
struct vm_area_struct *vma, bool compound) { }
static inline void munlock_vma_page(struct page *page,
struct vm_area_struct *vma, bool compound) { }
+static inline void mlock_new_page(struct page *page) { }
+static inline bool need_mlock_page_drain(int cpu) { return false; }
+static inline void mlock_page_drain(int cpu) { }
static inline void vunmap_range_noflush(unsigned long start, unsigned long end)
{
}
diff --git a/mm/mlock.c b/mm/mlock.c
index 581ea8bf1b83..93d616ba3e22 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -28,6 +28,8 @@

#include "internal.h"

+static DEFINE_PER_CPU(struct pagevec, mlock_pvec);
+
bool can_do_mlock(void)
{
if (rlimit(RLIMIT_MEMLOCK) != 0)
@@ -49,57 +51,79 @@ EXPORT_SYMBOL(can_do_mlock);
* PageUnevictable is set to indicate the unevictable state.
*/

-/**
- * mlock_page - mlock a page
- * @page: page to be mlocked, either a normal page or a THP head.
- */
-void mlock_page(struct page *page)
+static struct lruvec *__mlock_page(struct page *page, struct lruvec *lruvec)
{
- struct lruvec *lruvec;
- int nr_pages = thp_nr_pages(page);
+ /* There is nothing more we can do while it's off LRU */
+ if (!TestClearPageLRU(page))
+ return lruvec;

- VM_BUG_ON_PAGE(PageTail(page), page);
+ lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);

- if (!TestSetPageMlocked(page)) {
- mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
- __count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+ if (unlikely(page_evictable(page))) {
+ /*
+ * This is a little surprising, but quite possible:
+ * PageMlocked must have got cleared already by another CPU.
+ * Could this page be on the Unevictable LRU? I'm not sure,
+ * but move it now if so.
+ */
+ if (PageUnevictable(page)) {
+ del_page_from_lru_list(page, lruvec);
+ ClearPageUnevictable(page);
+ add_page_to_lru_list(page, lruvec);
+ __count_vm_events(UNEVICTABLE_PGRESCUED,
+ thp_nr_pages(page));
+ }
+ goto out;
}

- /* There is nothing more we can do while it's off LRU */
- if (!TestClearPageLRU(page))
- return;
-
- lruvec = folio_lruvec_lock_irq(page_folio(page));
if (PageUnevictable(page)) {
- page->mlock_count++;
+ if (PageMlocked(page))
+ page->mlock_count++;
goto out;
}

del_page_from_lru_list(page, lruvec);
ClearPageActive(page);
SetPageUnevictable(page);
- page->mlock_count = 1;
+ page->mlock_count = !!PageMlocked(page);
add_page_to_lru_list(page, lruvec);
- __count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
+ __count_vm_events(UNEVICTABLE_PGCULLED, thp_nr_pages(page));
out:
SetPageLRU(page);
- unlock_page_lruvec_irq(lruvec);
+ return lruvec;
}

-/**
- * munlock_page - munlock a page
- * @page: page to be munlocked, either a normal page or a THP head.
- */
-void munlock_page(struct page *page)
+static struct lruvec *__mlock_new_page(struct page *page, struct lruvec *lruvec)
+{
+ VM_BUG_ON_PAGE(PageLRU(page), page);
+
+ lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);
+
+ /* As above, this is a little surprising, but possible */
+ if (unlikely(page_evictable(page)))
+ goto out;
+
+ SetPageUnevictable(page);
+ page->mlock_count = !!PageMlocked(page);
+ __count_vm_events(UNEVICTABLE_PGCULLED, thp_nr_pages(page));
+out:
+ add_page_to_lru_list(page, lruvec);
+ SetPageLRU(page);
+ return lruvec;
+}
+
+static struct lruvec *__munlock_page(struct page *page, struct lruvec *lruvec)
{
- struct lruvec *lruvec;
int nr_pages = thp_nr_pages(page);
+ bool isolated = false;

- VM_BUG_ON_PAGE(PageTail(page), page);
+ if (!TestClearPageLRU(page))
+ goto munlock;

- lock_page_memcg(page);
- lruvec = folio_lruvec_lock_irq(page_folio(page));
- if (PageLRU(page) && PageUnevictable(page)) {
+ isolated = true;
+ lruvec = folio_lruvec_relock_irq(page_folio(page), lruvec);
+
+ if (PageUnevictable(page)) {
/* Then mlock_count is maintained, but might undercount */
if (page->mlock_count)
page->mlock_count--;
@@ -108,24 +132,144 @@ void munlock_page(struct page *page)
}
/* else assume that was the last mlock: reclaim will fix it if not */

+munlock:
if (TestClearPageMlocked(page)) {
__mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
- if (PageLRU(page) || !PageUnevictable(page))
+ if (isolated || !PageUnevictable(page))
__count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
else
__count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
}

/* page_evictable() has to be checked *after* clearing Mlocked */
- if (PageLRU(page) && PageUnevictable(page) && page_evictable(page)) {
+ if (isolated && PageUnevictable(page) && page_evictable(page)) {
del_page_from_lru_list(page, lruvec);
ClearPageUnevictable(page);
add_page_to_lru_list(page, lruvec);
__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
}
out:
- unlock_page_lruvec_irq(lruvec);
- unlock_page_memcg(page);
+ if (isolated)
+ SetPageLRU(page);
+ return lruvec;
+}
+
+/*
+ * Flags held in the low bits of a struct page pointer on the mlock_pvec.
+ */
+#define LRU_PAGE 0x1
+#define NEW_PAGE 0x2
+#define mlock_lru(page) ((struct page *)((unsigned long)page + LRU_PAGE))
+#define mlock_new(page) ((struct page *)((unsigned long)page + NEW_PAGE))
+
+/*
+ * mlock_pagevec() is derived from pagevec_lru_move_fn():
+ * perhaps that can make use of such page pointer flags in future,
+ * but for now just keep it for mlock. We could use three separate
+ * pagevecs instead, but one feels better (munlocking a full pagevec
+ * does not need to drain mlocking pagevecs first).
+ */
+static void mlock_pagevec(struct pagevec *pvec)
+{
+ struct lruvec *lruvec = NULL;
+ unsigned long mlock;
+ struct page *page;
+ int i;
+
+ for (i = 0; i < pagevec_count(pvec); i++) {
+ page = pvec->pages[i];
+ mlock = (unsigned long)page & (LRU_PAGE | NEW_PAGE);
+ page = (struct page *)((unsigned long)page - mlock);
+ pvec->pages[i] = page;
+
+ if (mlock & LRU_PAGE)
+ lruvec = __mlock_page(page, lruvec);
+ else if (mlock & NEW_PAGE)
+ lruvec = __mlock_new_page(page, lruvec);
+ else
+ lruvec = __munlock_page(page, lruvec);
+ }
+
+ if (lruvec)
+ unlock_page_lruvec_irq(lruvec);
+ release_pages(pvec->pages, pvec->nr);
+ pagevec_reinit(pvec);
+}
+
+void mlock_page_drain(int cpu)
+{
+ struct pagevec *pvec;
+
+ pvec = &per_cpu(mlock_pvec, cpu);
+ if (pagevec_count(pvec))
+ mlock_pagevec(pvec);
+}
+
+bool need_mlock_page_drain(int cpu)
+{
+ return pagevec_count(&per_cpu(mlock_pvec, cpu));
+}
+
+/**
+ * mlock_page - mlock a page already on (or temporarily off) LRU
+ * @page: page to be mlocked, either a normal page or a THP head.
+ */
+void mlock_page(struct page *page)
+{
+ struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+
+ if (!TestSetPageMlocked(page)) {
+ int nr_pages = thp_nr_pages(page);
+
+ mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+ __count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+ }
+
+ get_page(page);
+ if (!pagevec_add(pvec, mlock_lru(page)) ||
+ PageHead(page) || lru_cache_disabled())
+ mlock_pagevec(pvec);
+ put_cpu_var(mlock_pvec);
+}
+
+/**
+ * mlock_new_page - mlock a newly allocated page not yet on LRU
+ * @page: page to be mlocked, either a normal page or a THP head.
+ */
+void mlock_new_page(struct page *page)
+{
+ struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+ int nr_pages = thp_nr_pages(page);
+
+ SetPageMlocked(page);
+ mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
+ __count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
+
+ get_page(page);
+ if (!pagevec_add(pvec, mlock_new(page)) ||
+ PageHead(page) || lru_cache_disabled())
+ mlock_pagevec(pvec);
+ put_cpu_var(mlock_pvec);
+}
+
+/**
+ * munlock_page - munlock a page
+ * @page: page to be munlocked, either a normal page or a THP head.
+ */
+void munlock_page(struct page *page)
+{
+ struct pagevec *pvec = &get_cpu_var(mlock_pvec);
+
+ /*
+ * TestClearPageMlocked(page) must be left to __munlock_page(),
+ * which will check whether the page is multiply mlocked.
+ */
+
+ get_page(page);
+ if (!pagevec_add(pvec, page) ||
+ PageHead(page) || lru_cache_disabled())
+ mlock_pagevec(pvec);
+ put_cpu_var(mlock_pvec);
}

static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
diff --git a/mm/swap.c b/mm/swap.c
index 3f770b1ea2c1..842d5cd92cf6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -490,18 +490,12 @@ EXPORT_SYMBOL(folio_add_lru);
void lru_cache_add_inactive_or_unevictable(struct page *page,
struct vm_area_struct *vma)
{
- bool unevictable;
-
VM_BUG_ON_PAGE(PageLRU(page), page);

- unevictable = (vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED;
- if (unlikely(unevictable) && !TestSetPageMlocked(page)) {
- int nr_pages = thp_nr_pages(page);
-
- mod_zone_page_state(page_zone(page), NR_MLOCK, nr_pages);
- count_vm_events(UNEVICTABLE_PGMLOCKED, nr_pages);
- }
- lru_cache_add(page);
+ if (unlikely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) == VM_LOCKED))
+ mlock_new_page(page);
+ else
+ lru_cache_add(page);
}

/*
@@ -640,6 +634,7 @@ void lru_add_drain_cpu(int cpu)
pagevec_lru_move_fn(pvec, lru_lazyfree_fn);

activate_page_drain(cpu);
+ mlock_page_drain(cpu);
}

/**
@@ -842,6 +837,7 @@ inline void __lru_add_drain_all(bool force_all_cpus)
pagevec_count(&per_cpu(lru_pvecs.lru_deactivate, cpu)) ||
pagevec_count(&per_cpu(lru_pvecs.lru_lazyfree, cpu)) ||
need_activate_page_drain(cpu) ||
+ need_mlock_page_drain(cpu) ||
has_bh_in_lru(cpu, NULL)) {
INIT_WORK(work, lru_add_drain_per_cpu);
queue_work_on(cpu, mm_percpu_wq, work);
@@ -1030,7 +1026,7 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
* Is an smp_mb__after_atomic() still required here, before
* folio_evictable() tests PageMlocked, to rule out the possibility
* of stranding an evictable folio on an unevictable LRU? I think
- * not, because munlock_page() only clears PageMlocked while the LRU
+ * not, because __munlock_page() only clears PageMlocked while the LRU
* lock is held.
*
* (That is not true of __page_cache_release(), and not necessarily
@@ -1043,7 +1039,14 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
} else {
folio_clear_active(folio);
folio_set_unevictable(folio);
- folio->mlock_count = !!folio_test_mlocked(folio);
+ /*
+ * folio->mlock_count = !!folio_test_mlocked(folio)?
+ * But that leaves __mlock_page() in doubt whether another
+ * actor has already counted the mlock or not. Err on the
+ * safe side, underestimate, let page reclaim fix it, rather
+ * than leaving a page on the unevictable LRU indefinitely.
+ */
+ folio->mlock_count = 0;
if (!was_unevictable)
__count_vm_events(UNEVICTABLE_PGCULLED, nr_pages);
}
--
2.34.1

2022-02-15 06:50:17

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 01/13] mm/munlock: delete page_mlock() and all its works

We have recommended some applications to mlock their userspace, but that
turns out to be counter-productive: when many processes mlock the same
file, contention on rmap's i_mmap_rwsem can become intolerable at exit: it
is needed for write, to remove any vma mapping that file from rmap's tree;
but hogged for read by those with mlocks calling page_mlock() (formerly
known as try_to_munlock()) on *each* page mapped from the file (the
purpose being to find out whether another process has the page mlocked,
so therefore it should not be unmlocked yet).

Several optimizations have been made in the past: one is to skip
page_mlock() when mapcount tells that nothing else has this page
mapped; but that doesn't help at all when others do have it mapped.
This time around, I initially intended to add a preliminary search
of the rmap tree for overlapping VM_LOCKED ranges; but that gets
messy with locking order, when in doubt whether a page is actually
present; and risks adding even more contention on the i_mmap_rwsem.

A solution would be much easier, if only there were space in struct page
for an mlock_count... but actually, most of the time, there is space for
it - an mlocked page spends most of its life on an unevictable LRU, but
since 3.18 removed the scan_unevictable_pages sysctl, that "LRU" has
been redundant. Let's try to reuse its page->lru.

But leave that until a later patch: in this patch, clear the ground by
removing page_mlock(), and all the infrastructure that has gathered
around it - which mostly hinders understanding, and will make reviewing
new additions harder. Don't mind those old comments about THPs, they
date from before 4.5's refcounting rework: splitting is not a risk here.

Just keep a minimal version of munlock_vma_page(), as reminder of what it
should attend to (in particular, the odd way PGSTRANDED is counted out of
PGMUNLOCKED), and likewise a stub for munlock_vma_pages_range(). Move
unchanged __mlock_posix_error_return() out of the way, down to above its
caller: this series then makes no further change after mlock_fixup().

After this and each following commit, the kernel builds, boots and runs;
but with deficiencies which may show up in testing of mlock and munlock.
The system calls succeed or fail as before, and mlock remains effective
in preventing page reclaim; but meminfo's Unevictable and Mlocked amounts
may be shown too low after mlock, grow, then stay too high after munlock:
with previously mlocked pages remaining unevictable for too long, until
finally unmapped and freed and counts corrected. Normal service will be
resumed in "mm/munlock: mlock_pte_range() when mlocking or munlocking".

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
v2: Restore munlock_vma_pages_range() VM_LOCKED_CLEAR_MASK, per Vlastimil.
Add paragraph to changelog on what does not work yet, per Vlastimil.
These v2 replacement patches based on 5.17-rc2 like v1.

include/linux/rmap.h | 6 -
mm/internal.h | 2 +-
mm/mlock.c | 375 +++----------------------------------------
mm/rmap.c | 80 ---------
4 files changed, 25 insertions(+), 438 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index e704b1a4c06c..dc48aa8c2c94 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -237,12 +237,6 @@ unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
*/
int folio_mkclean(struct folio *);

-/*
- * called in munlock()/munmap() path to check for other vmas holding
- * the page mlocked.
- */
-void page_mlock(struct page *page);
-
void remove_migration_ptes(struct page *old, struct page *new, bool locked);

/*
diff --git a/mm/internal.h b/mm/internal.h
index d80300392a19..e48c486d5ddf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -409,7 +409,7 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
* must be called with vma's mmap_lock held for read or write, and page locked.
*/
extern void mlock_vma_page(struct page *page);
-extern unsigned int munlock_vma_page(struct page *page);
+extern void munlock_vma_page(struct page *page);

extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
unsigned long len);
diff --git a/mm/mlock.c b/mm/mlock.c
index 8f584eddd305..aec4ce7919da 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -46,12 +46,6 @@ EXPORT_SYMBOL(can_do_mlock);
* be placed on the LRU "unevictable" list, rather than the [in]active lists.
* The unevictable list is an LRU sibling list to the [in]active lists.
* PageUnevictable is set to indicate the unevictable state.
- *
- * When lazy mlocking via vmscan, it is important to ensure that the
- * vma's VM_LOCKED status is not concurrently being modified, otherwise we
- * may have mlocked a page that is being munlocked. So lazy mlock must take
- * the mmap_lock for read, and verify that the vma really is locked
- * (see mm/rmap.c).
*/

/*
@@ -106,299 +100,28 @@ void mlock_vma_page(struct page *page)
}
}

-/*
- * Finish munlock after successful page isolation
- *
- * Page must be locked. This is a wrapper for page_mlock()
- * and putback_lru_page() with munlock accounting.
- */
-static void __munlock_isolated_page(struct page *page)
-{
- /*
- * Optimization: if the page was mapped just once, that's our mapping
- * and we don't need to check all the other vmas.
- */
- if (page_mapcount(page) > 1)
- page_mlock(page);
-
- /* Did try_to_unlock() succeed or punt? */
- if (!PageMlocked(page))
- count_vm_events(UNEVICTABLE_PGMUNLOCKED, thp_nr_pages(page));
-
- putback_lru_page(page);
-}
-
-/*
- * Accounting for page isolation fail during munlock
- *
- * Performs accounting when page isolation fails in munlock. There is nothing
- * else to do because it means some other task has already removed the page
- * from the LRU. putback_lru_page() will take care of removing the page from
- * the unevictable list, if necessary. vmscan [page_referenced()] will move
- * the page back to the unevictable list if some other vma has it mlocked.
- */
-static void __munlock_isolation_failed(struct page *page)
-{
- int nr_pages = thp_nr_pages(page);
-
- if (PageUnevictable(page))
- __count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
- else
- __count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
-}
-
/**
* munlock_vma_page - munlock a vma page
* @page: page to be unlocked, either a normal page or THP page head
- *
- * returns the size of the page as a page mask (0 for normal page,
- * HPAGE_PMD_NR - 1 for THP head page)
- *
- * called from munlock()/munmap() path with page supposedly on the LRU.
- * When we munlock a page, because the vma where we found the page is being
- * munlock()ed or munmap()ed, we want to check whether other vmas hold the
- * page locked so that we can leave it on the unevictable lru list and not
- * bother vmscan with it. However, to walk the page's rmap list in
- * page_mlock() we must isolate the page from the LRU. If some other
- * task has removed the page from the LRU, we won't be able to do that.
- * So we clear the PageMlocked as we might not get another chance. If we
- * can't isolate the page, we leave it for putback_lru_page() and vmscan
- * [page_referenced()/try_to_unmap()] to deal with.
*/
-unsigned int munlock_vma_page(struct page *page)
+void munlock_vma_page(struct page *page)
{
- int nr_pages;
-
- /* For page_mlock() and to serialize with page migration */
+ /* Serialize with page migration */
BUG_ON(!PageLocked(page));
- VM_BUG_ON_PAGE(PageTail(page), page);
-
- if (!TestClearPageMlocked(page)) {
- /* Potentially, PTE-mapped THP: do not skip the rest PTEs */
- return 0;
- }
-
- nr_pages = thp_nr_pages(page);
- mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
-
- if (!isolate_lru_page(page))
- __munlock_isolated_page(page);
- else
- __munlock_isolation_failed(page);
-
- return nr_pages - 1;
-}
-
-/*
- * convert get_user_pages() return value to posix mlock() error
- */
-static int __mlock_posix_error_return(long retval)
-{
- if (retval == -EFAULT)
- retval = -ENOMEM;
- else if (retval == -ENOMEM)
- retval = -EAGAIN;
- return retval;
-}
-
-/*
- * Prepare page for fast batched LRU putback via putback_lru_evictable_pagevec()
- *
- * The fast path is available only for evictable pages with single mapping.
- * Then we can bypass the per-cpu pvec and get better performance.
- * when mapcount > 1 we need page_mlock() which can fail.
- * when !page_evictable(), we need the full redo logic of putback_lru_page to
- * avoid leaving evictable page in unevictable list.
- *
- * In case of success, @page is added to @pvec and @pgrescued is incremented
- * in case that the page was previously unevictable. @page is also unlocked.
- */
-static bool __putback_lru_fast_prepare(struct page *page, struct pagevec *pvec,
- int *pgrescued)
-{
- VM_BUG_ON_PAGE(PageLRU(page), page);
- VM_BUG_ON_PAGE(!PageLocked(page), page);
-
- if (page_mapcount(page) <= 1 && page_evictable(page)) {
- pagevec_add(pvec, page);
- if (TestClearPageUnevictable(page))
- (*pgrescued)++;
- unlock_page(page);
- return true;
- }
-
- return false;
-}

-/*
- * Putback multiple evictable pages to the LRU
- *
- * Batched putback of evictable pages that bypasses the per-cpu pvec. Some of
- * the pages might have meanwhile become unevictable but that is OK.
- */
-static void __putback_lru_fast(struct pagevec *pvec, int pgrescued)
-{
- count_vm_events(UNEVICTABLE_PGMUNLOCKED, pagevec_count(pvec));
- /*
- *__pagevec_lru_add() calls release_pages() so we don't call
- * put_page() explicitly
- */
- __pagevec_lru_add(pvec);
- count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
-}
-
-/*
- * Munlock a batch of pages from the same zone
- *
- * The work is split to two main phases. First phase clears the Mlocked flag
- * and attempts to isolate the pages, all under a single zone lru lock.
- * The second phase finishes the munlock only for pages where isolation
- * succeeded.
- *
- * Note that the pagevec may be modified during the process.
- */
-static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
-{
- int i;
- int nr = pagevec_count(pvec);
- int delta_munlocked = -nr;
- struct pagevec pvec_putback;
- struct lruvec *lruvec = NULL;
- int pgrescued = 0;
-
- pagevec_init(&pvec_putback);
-
- /* Phase 1: page isolation */
- for (i = 0; i < nr; i++) {
- struct page *page = pvec->pages[i];
- struct folio *folio = page_folio(page);
-
- if (TestClearPageMlocked(page)) {
- /*
- * We already have pin from follow_page_mask()
- * so we can spare the get_page() here.
- */
- if (TestClearPageLRU(page)) {
- lruvec = folio_lruvec_relock_irq(folio, lruvec);
- del_page_from_lru_list(page, lruvec);
- continue;
- } else
- __munlock_isolation_failed(page);
- } else {
- delta_munlocked++;
- }
+ VM_BUG_ON_PAGE(PageTail(page), page);

- /*
- * We won't be munlocking this page in the next phase
- * but we still need to release the follow_page_mask()
- * pin. We cannot do it under lru_lock however. If it's
- * the last pin, __page_cache_release() would deadlock.
- */
- pagevec_add(&pvec_putback, pvec->pages[i]);
- pvec->pages[i] = NULL;
- }
- if (lruvec) {
- __mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
- unlock_page_lruvec_irq(lruvec);
- } else if (delta_munlocked) {
- mod_zone_page_state(zone, NR_MLOCK, delta_munlocked);
- }
+ if (TestClearPageMlocked(page)) {
+ int nr_pages = thp_nr_pages(page);

- /* Now we can release pins of pages that we are not munlocking */
- pagevec_release(&pvec_putback);
-
- /* Phase 2: page munlock */
- for (i = 0; i < nr; i++) {
- struct page *page = pvec->pages[i];
-
- if (page) {
- lock_page(page);
- if (!__putback_lru_fast_prepare(page, &pvec_putback,
- &pgrescued)) {
- /*
- * Slow path. We don't want to lose the last
- * pin before unlock_page()
- */
- get_page(page); /* for putback_lru_page() */
- __munlock_isolated_page(page);
- unlock_page(page);
- put_page(page); /* from follow_page_mask() */
- }
+ mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
+ if (!isolate_lru_page(page)) {
+ putback_lru_page(page);
+ count_vm_events(UNEVICTABLE_PGMUNLOCKED, nr_pages);
+ } else if (PageUnevictable(page)) {
+ count_vm_events(UNEVICTABLE_PGSTRANDED, nr_pages);
}
}
-
- /*
- * Phase 3: page putback for pages that qualified for the fast path
- * This will also call put_page() to return pin from follow_page_mask()
- */
- if (pagevec_count(&pvec_putback))
- __putback_lru_fast(&pvec_putback, pgrescued);
-}
-
-/*
- * Fill up pagevec for __munlock_pagevec using pte walk
- *
- * The function expects that the struct page corresponding to @start address is
- * a non-TPH page already pinned and in the @pvec, and that it belongs to @zone.
- *
- * The rest of @pvec is filled by subsequent pages within the same pmd and same
- * zone, as long as the pte's are present and vm_normal_page() succeeds. These
- * pages also get pinned.
- *
- * Returns the address of the next page that should be scanned. This equals
- * @start + PAGE_SIZE when no page could be added by the pte walk.
- */
-static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
- struct vm_area_struct *vma, struct zone *zone,
- unsigned long start, unsigned long end)
-{
- pte_t *pte;
- spinlock_t *ptl;
-
- /*
- * Initialize pte walk starting at the already pinned page where we
- * are sure that there is a pte, as it was pinned under the same
- * mmap_lock write op.
- */
- pte = get_locked_pte(vma->vm_mm, start, &ptl);
- /* Make sure we do not cross the page table boundary */
- end = pgd_addr_end(start, end);
- end = p4d_addr_end(start, end);
- end = pud_addr_end(start, end);
- end = pmd_addr_end(start, end);
-
- /* The page next to the pinned page is the first we will try to get */
- start += PAGE_SIZE;
- while (start < end) {
- struct page *page = NULL;
- pte++;
- if (pte_present(*pte))
- page = vm_normal_page(vma, start, *pte);
- /*
- * Break if page could not be obtained or the page's node+zone does not
- * match
- */
- if (!page || page_zone(page) != zone)
- break;
-
- /*
- * Do not use pagevec for PTE-mapped THP,
- * munlock_vma_pages_range() will handle them.
- */
- if (PageTransCompound(page))
- break;
-
- get_page(page);
- /*
- * Increase the address that will be returned *before* the
- * eventual break due to pvec becoming full by adding the page
- */
- start += PAGE_SIZE;
- if (pagevec_add(pvec, page) == 0)
- break;
- }
- pte_unmap_unlock(pte, ptl);
- return start;
}

/*
@@ -413,75 +136,13 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
*
* Returns with VM_LOCKED cleared. Callers must be prepared to
* deal with this.
- *
- * We don't save and restore VM_LOCKED here because pages are
- * still on lru. In unmap path, pages might be scanned by reclaim
- * and re-mlocked by page_mlock/try_to_unmap before we unmap and
- * free them. This will result in freeing mlocked pages.
*/
void munlock_vma_pages_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
vma->vm_flags &= VM_LOCKED_CLEAR_MASK;

- while (start < end) {
- struct page *page;
- unsigned int page_mask = 0;
- unsigned long page_increm;
- struct pagevec pvec;
- struct zone *zone;
-
- pagevec_init(&pvec);
- /*
- * Although FOLL_DUMP is intended for get_dump_page(),
- * it just so happens that its special treatment of the
- * ZERO_PAGE (returning an error instead of doing get_page)
- * suits munlock very well (and if somehow an abnormal page
- * has sneaked into the range, we won't oops here: great).
- */
- page = follow_page(vma, start, FOLL_GET | FOLL_DUMP);
-
- if (page && !IS_ERR(page)) {
- if (PageTransTail(page)) {
- VM_BUG_ON_PAGE(PageMlocked(page), page);
- put_page(page); /* follow_page_mask() */
- } else if (PageTransHuge(page)) {
- lock_page(page);
- /*
- * Any THP page found by follow_page_mask() may
- * have gotten split before reaching
- * munlock_vma_page(), so we need to compute
- * the page_mask here instead.
- */
- page_mask = munlock_vma_page(page);
- unlock_page(page);
- put_page(page); /* follow_page_mask() */
- } else {
- /*
- * Non-huge pages are handled in batches via
- * pagevec. The pin from follow_page_mask()
- * prevents them from collapsing by THP.
- */
- pagevec_add(&pvec, page);
- zone = page_zone(page);
-
- /*
- * Try to fill the rest of pagevec using fast
- * pte walk. This will also update start to
- * the next page to process. Then munlock the
- * pagevec.
- */
- start = __munlock_pagevec_fill(&pvec, vma,
- zone, start, end);
- __munlock_pagevec(&pvec, zone);
- goto next;
- }
- }
- page_increm = 1 + page_mask;
- start += page_increm * PAGE_SIZE;
-next:
- cond_resched();
- }
+ /* Reimplementation to follow in later commit */
}

/*
@@ -645,6 +306,18 @@ static unsigned long count_mm_mlocked_page_nr(struct mm_struct *mm,
return count >> PAGE_SHIFT;
}

+/*
+ * convert get_user_pages() return value to posix mlock() error
+ */
+static int __mlock_posix_error_return(long retval)
+{
+ if (retval == -EFAULT)
+ retval = -ENOMEM;
+ else if (retval == -ENOMEM)
+ retval = -EAGAIN;
+ return retval;
+}
+
static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t flags)
{
unsigned long locked;
diff --git a/mm/rmap.c b/mm/rmap.c
index 6a1e8c7f6213..7ce7f1946cff 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1996,76 +1996,6 @@ void try_to_migrate(struct page *page, enum ttu_flags flags)
rmap_walk(page, &rwc);
}

-/*
- * Walks the vma's mapping a page and mlocks the page if any locked vma's are
- * found. Once one is found the page is locked and the scan can be terminated.
- */
-static bool page_mlock_one(struct page *page, struct vm_area_struct *vma,
- unsigned long address, void *unused)
-{
- struct page_vma_mapped_walk pvmw = {
- .page = page,
- .vma = vma,
- .address = address,
- };
-
- /* An un-locked vma doesn't have any pages to lock, continue the scan */
- if (!(vma->vm_flags & VM_LOCKED))
- return true;
-
- while (page_vma_mapped_walk(&pvmw)) {
- /*
- * Need to recheck under the ptl to serialise with
- * __munlock_pagevec_fill() after VM_LOCKED is cleared in
- * munlock_vma_pages_range().
- */
- if (vma->vm_flags & VM_LOCKED) {
- /*
- * PTE-mapped THP are never marked as mlocked; but
- * this function is never called on a DoubleMap THP,
- * nor on an Anon THP (which may still be PTE-mapped
- * after DoubleMap was cleared).
- */
- mlock_vma_page(page);
- /*
- * No need to scan further once the page is marked
- * as mlocked.
- */
- page_vma_mapped_walk_done(&pvmw);
- return false;
- }
- }
-
- return true;
-}
-
-/**
- * page_mlock - try to mlock a page
- * @page: the page to be mlocked
- *
- * Called from munlock code. Checks all of the VMAs mapping the page and mlocks
- * the page if any are found. The page will be returned with PG_mlocked cleared
- * if it is not mapped by any locked vmas.
- */
-void page_mlock(struct page *page)
-{
- struct rmap_walk_control rwc = {
- .rmap_one = page_mlock_one,
- .done = page_not_mapped,
- .anon_lock = page_lock_anon_vma_read,
-
- };
-
- VM_BUG_ON_PAGE(!PageLocked(page) || PageLRU(page), page);
- VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);
-
- /* Anon THP are only marked as mlocked when singly mapped */
- if (PageTransCompound(page) && PageAnon(page))
- return;
-
- rmap_walk(page, &rwc);
-}
-
#ifdef CONFIG_DEVICE_PRIVATE
struct make_exclusive_args {
struct mm_struct *mm;
@@ -2291,11 +2221,6 @@ static struct anon_vma *rmap_walk_anon_lock(struct page *page,
*
* Find all the mappings of a page using the mapping pointer and the vma chains
* contained in the anon_vma struct it points to.
- *
- * When called from page_mlock(), the mmap_lock of the mm containing the vma
- * where the page was found will be held for write. So, we won't recheck
- * vm_flags for that VMA. That should be OK, because that vma shouldn't be
- * LOCKED.
*/
static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
bool locked)
@@ -2344,11 +2269,6 @@ static void rmap_walk_anon(struct page *page, struct rmap_walk_control *rwc,
*
* Find all the mappings of a page using the mapping pointer and the vma chains
* contained in the address_space struct it points to.
- *
- * When called from page_mlock(), the mmap_lock of the mm containing the vma
- * where the page was found will be held for write. So, we won't recheck
- * vm_flags for that VMA. That should be OK, because that vma shouldn't be
- * LOCKED.
*/
static void rmap_walk_file(struct page *page, struct rmap_walk_control *rwc,
bool locked)
--
2.34.1

2022-02-15 09:24:40

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 13/13] mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP

4.8 commit 7751b2da6be0 ("vmscan: split file huge pages before paging
them out") inserted a split_huge_page_to_list() into shrink_page_list()
without considering the mlock case: no problem if the page has already
been marked as Mlocked (the !page_evictable check much higher up will
have skipped all this), but it has always been the case that races or
omissions in setting Mlocked can rely on page reclaim to detect this
and correct it before actually reclaiming - and that remains so, but
what a shame if a hugepage is needlessly split before discovering it.

It is surprising that page_check_references() returns PAGEREF_RECLAIM
when VM_LOCKED, but there was a good reason for that: try_to_unmap_one()
is where the condition is detected and corrected; and until now it could
not be done in page_referenced_one(), because that does not always have
the page locked. Now that mlock's requirement for page lock has gone,
copy try_to_unmap_one()'s mlock restoration into page_referenced_one(),
and let page_check_references() return PAGEREF_ACTIVATE in this case.

But page_referenced_one() may find a pte mapping one part of a hugepage:
what hold should a pte mapped in a VM_LOCKED area exert over the entire
huge page? That's debatable. The approach taken here is to treat that
pte mapping in page_referenced_one() as if not VM_LOCKED, and if no
VM_LOCKED pmd mapping is found later in the walk, and lack of reference
permits, then PAGEREF_RECLAIM take it to attempted splitting as before.

Signed-off-by: Hugh Dickins <[email protected]>
---
v2: same as v1.

mm/rmap.c | 7 +++++--
mm/vmscan.c | 6 +++---
2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 714bfdc72c7b..c7921c102bc0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -812,7 +812,10 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,
while (page_vma_mapped_walk(&pvmw)) {
address = pvmw.address;

- if (vma->vm_flags & VM_LOCKED) {
+ if ((vma->vm_flags & VM_LOCKED) &&
+ (!PageTransCompound(page) || !pvmw.pte)) {
+ /* Restore the mlock which got missed */
+ mlock_vma_page(page, vma, !pvmw.pte);
page_vma_mapped_walk_done(&pvmw);
pra->vm_flags |= VM_LOCKED;
return false; /* To break the loop */
@@ -851,7 +854,7 @@ static bool page_referenced_one(struct page *page, struct vm_area_struct *vma,

if (referenced) {
pra->referenced++;
- pra->vm_flags |= vma->vm_flags;
+ pra->vm_flags |= vma->vm_flags & ~VM_LOCKED;
}

if (!pra->mapcount)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 090bfb605ecf..a160efba3c73 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1386,11 +1386,11 @@ static enum page_references page_check_references(struct page *page,
referenced_page = TestClearPageReferenced(page);

/*
- * Mlock lost the isolation race with us. Let try_to_unmap()
- * move the page to the unevictable list.
+ * The supposedly reclaimable page was found to be in a VM_LOCKED vma.
+ * Let the page, now marked Mlocked, be moved to the unevictable list.
*/
if (vm_flags & VM_LOCKED)
- return PAGEREF_RECLAIM;
+ return PAGEREF_ACTIVATE;

if (referenced_ptes) {
/*
--
2.34.1

2022-02-15 10:55:36

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH v2 09/13] mm/munlock: delete smp_mb() from __pagevec_lru_add_fn()

My reading of comment on smp_mb__after_atomic() in __pagevec_lru_add_fn()
says that it can now be deleted; and that remains so when the next patch
is added.

Signed-off-by: Hugh Dickins <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
v2: same as v1.

mm/swap.c | 37 +++++++++----------------------------
1 file changed, 9 insertions(+), 28 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 682a03301a2c..3f770b1ea2c1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -1025,37 +1025,18 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)

VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);

+ folio_set_lru(folio);
/*
- * A folio becomes evictable in two ways:
- * 1) Within LRU lock [munlock_vma_page() and __munlock_pagevec()].
- * 2) Before acquiring LRU lock to put the folio on the correct LRU
- * and then
- * a) do PageLRU check with lock [check_move_unevictable_pages]
- * b) do PageLRU check before lock [clear_page_mlock]
- *
- * (1) & (2a) are ok as LRU lock will serialize them. For (2b), we need
- * following strict ordering:
- *
- * #0: __pagevec_lru_add_fn #1: clear_page_mlock
- *
- * folio_set_lru() folio_test_clear_mlocked()
- * smp_mb() // explicit ordering // above provides strict
- * // ordering
- * folio_test_mlocked() folio_test_lru()
+ * Is an smp_mb__after_atomic() still required here, before
+ * folio_evictable() tests PageMlocked, to rule out the possibility
+ * of stranding an evictable folio on an unevictable LRU? I think
+ * not, because munlock_page() only clears PageMlocked while the LRU
+ * lock is held.
*
- *
- * if '#1' does not observe setting of PG_lru by '#0' and
- * fails isolation, the explicit barrier will make sure that
- * folio_evictable check will put the folio on the correct
- * LRU. Without smp_mb(), folio_set_lru() can be reordered
- * after folio_test_mlocked() check and can make '#1' fail the
- * isolation of the folio whose mlocked bit is cleared (#0 is
- * also looking at the same folio) and the evictable folio will
- * be stranded on an unevictable LRU.
+ * (That is not true of __page_cache_release(), and not necessarily
+ * true of release_pages(): but those only clear PageMlocked after
+ * put_page_testzero() has excluded any other users of the page.)
*/
- folio_set_lru(folio);
- smp_mb__after_atomic();
-
if (folio_evictable(folio)) {
if (was_unevictable)
__count_vm_events(UNEVICTABLE_PGRESCUED, nr_pages);
--
2.34.1

2022-02-15 21:03:14

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 04/13] mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

On Mon, Feb 14, 2022 at 06:26:39PM -0800, Hugh Dickins wrote:
> Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
> inline functions which check (vma->vm_flags & VM_LOCKED) before calling
> mlock_page() and munlock_page() in mm/mlock.c.
>
> Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
> because we have understandable difficulty in accounting pte maps of THPs,
> and if passed a PageHead page, mlock_page() and munlock_page() cannot
> tell whether it's a pmd map to be counted or a pte map to be ignored.
>
[...]
>
> Mlock accounting on THPs has been hard to define, differed between anon
> and file, involved PageDoubleMap in some places and not others, required
> clear_page_mlock() at some points. Keep it simple now: just count the
> pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.

How would you suggest we handle the accounting for folios which are
intermediate in size between PMDs and PTEs? eg, an order-4 page?
Would it make sense to increment mlock_count by HUGE_PMD_NR for
each PMD mapping and by 1 for each PTE mapping?

2022-02-15 21:15:10

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 00/13] mm/munlock: rework of mlock+munlock page handling

On Mon, Feb 14, 2022 at 06:18:34PM -0800, Hugh Dickins wrote:
> Andrew, many thanks for including v1 and fixes in mmotm: please now replace
>
> mm-munlock-delete-page_mlock-and-all-its-works.patch
> mm-munlock-delete-foll_mlock-and-foll_populate.patch
> mm-munlock-delete-munlock_vma_pages_all-allow-oomreap.patch
> mm-munlock-rmap-call-mlock_vma_page-munlock_vma_page.patch
> mm-munlock-replace-clear_page_mlock-by-final-clearance.patch
> mm-munlock-maintain-page-mlock_count-while-unevictable.patch
> mm-munlock-mlock_pte_range-when-mlocking-or-munlocking.patch
> mm-migrate-__unmap_and_move-push-good-newpage-to-lru.patch
> mm-munlock-delete-smp_mb-from-__pagevec_lru_add_fn.patch
> mm-munlock-mlock_page-munlock_page-batch-by-pagevec.patch
> mm-munlock-mlock_page-munlock_page-batch-by-pagevec-fix.patch
> mm-munlock-mlock_page-munlock_page-batch-by-pagevec-fix-2.patch
> mm-munlock-page-migration-needs-mlock-pagevec-drained.patch
> mm-thp-collapse_file-do-try_to_unmapttu_batch_flush.patch
> mm-thp-shrink_page_list-avoid-splitting-vm_locked-thp.patch
>
> by the following thirteen of v2. As before, some easy fixups will be
> needed to apply in mm/huge_memory.c, but otherwise expected to be clean.
>
> Many thanks to Vlastimil Babka for his review of 01 through 11, and
> to Matthew Wilcox for graciously volunteering to rebase his over these.

I have now pushed these 13 patches to my for-next branch:

git://git.infradead.org/users/willy/pagecache.git for-next

and rebased my folio patches on top. Mostly that involved dropping
my mlock-related patches, although there were a few other adjustments
that needed to be made. That should make Stephen's merge resolution
much easier once Andrew drops v1 of these patches from his tree.

2022-02-16 03:55:33

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 10/13] mm/munlock: mlock_page() munlock_page() batch by pagevec

On Tue, 15 Feb 2022, Matthew Wilcox wrote:
> On Mon, Feb 14, 2022 at 06:37:29PM -0800, Hugh Dickins wrote:
> > +/*
> > + * Flags held in the low bits of a struct page pointer on the mlock_pvec.
> > + */
> > +#define LRU_PAGE 0x1
> > +#define NEW_PAGE 0x2
> > +#define mlock_lru(page) ((struct page *)((unsigned long)page + LRU_PAGE))
> > +#define mlock_new(page) ((struct page *)((unsigned long)page + NEW_PAGE))
>
> You've tripped over one of the weirdnesses in the C preprocessor here.
> If the variable passed is not _named_ page, it gets cast to a pointer
> to a struct of the same name as the variable. There's no way to tell
> cpp that that 'page' after 'struct' is literal and not to be replaced
> by the 'page' argument.
>
> I'm going to change this to:
>
> static inline struct page *mlock_lru(struct page *page)
> {
> return (struct page *)((unsigned long)page + LRU_PAGE);
> }
>
> (mutatis mutandi for mlock_new)

Okay, thanks. (You're not naming your folio "page" :-?)

Hugh

2022-02-16 06:56:44

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH v2 04/13] mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

On Tue, 15 Feb 2022, Matthew Wilcox wrote:
> On Mon, Feb 14, 2022 at 06:26:39PM -0800, Hugh Dickins wrote:
> > Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
> > inline functions which check (vma->vm_flags & VM_LOCKED) before calling
> > mlock_page() and munlock_page() in mm/mlock.c.
> >
> > Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
> > because we have understandable difficulty in accounting pte maps of THPs,
> > and if passed a PageHead page, mlock_page() and munlock_page() cannot
> > tell whether it's a pmd map to be counted or a pte map to be ignored.
> >
> [...]
> >
> > Mlock accounting on THPs has been hard to define, differed between anon
> > and file, involved PageDoubleMap in some places and not others, required
> > clear_page_mlock() at some points. Keep it simple now: just count the
> > pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.
>
> How would you suggest we handle the accounting for folios which are
> intermediate in size between PMDs and PTEs? eg, an order-4 page?
> Would it make sense to increment mlock_count by HUGE_PMD_NR for
> each PMD mapping and by 1 for each PTE mapping?

I think you're asking the wrong question here, but perhaps you've
already decided there's only one satisfactory answer to the right question.

To answer what you've asked: it doesn't matter at all how you count them
in mlock_count, just so long as they are counted up and down consistently.
Since it's simplest just to count 1 in mlock_count for each pmd or pte,
I prefer that (as I did with THPs); but if you prefer to count pmds up
and down by HUGE_PMD_NR, that works too.

Though, reading again, you're asking about a PMD mapping of an order-4
page? I don't understand how that could be allowed (except on some
non-x86 architecture where the page table fits only 16 pages).

The question I thought you should be asking is about how to count them
in Mlocked. That's tough; but I take it for granted that you would not
want per-subpage flags and counts involved (or not unless forced to do
so by some regression that turns out to matter). And I think the only
satisfactory answer is to count the whole compound_nr() as Mlocked
when any part of it (a single pte, a series of ptes, a pmd) is mlocked;
and (try to) move folio to Unevictable whenever any part of it is mlocked.

That differs from what Kirill decided for THPs (which I cannot
confidently describe, but something like count pmd as Mlocked, don't count
ptes as Mlocked, but uncount pmd if any ptes), and what I simplified it to
in the mm/munlock series (count pmd as Mlocked, ignore ptes), and will
tend to show larger numbers for Mlocked than before; but alternatives
seem unworkable to me.

Hugh

2022-02-16 07:04:56

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 10/13] mm/munlock: mlock_page() munlock_page() batch by pagevec

On Mon, Feb 14, 2022 at 06:37:29PM -0800, Hugh Dickins wrote:
> +/*
> + * Flags held in the low bits of a struct page pointer on the mlock_pvec.
> + */
> +#define LRU_PAGE 0x1
> +#define NEW_PAGE 0x2
> +#define mlock_lru(page) ((struct page *)((unsigned long)page + LRU_PAGE))
> +#define mlock_new(page) ((struct page *)((unsigned long)page + NEW_PAGE))

You've tripped over one of the weirdnesses in the C preprocessor here.
If the variable passed is not _named_ page, it gets cast to a pointer
to a struct of the same name as the variable. There's no way to tell
cpp that that 'page' after 'struct' is literal and not to be replaced
by the 'page' argument.

I'm going to change this to:

static inline struct page *mlock_lru(struct page *page)
{
return (struct page *)((unsigned long)page + LRU_PAGE);
}

(mutatis mutandi for mlock_new)

2022-02-16 07:28:03

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 04/13] mm/munlock: rmap call mlock_vma_page() munlock_vma_page()

On Tue, Feb 15, 2022 at 01:38:20PM -0800, Hugh Dickins wrote:
> On Tue, 15 Feb 2022, Matthew Wilcox wrote:
> > On Mon, Feb 14, 2022 at 06:26:39PM -0800, Hugh Dickins wrote:
> > > Add vma argument to mlock_vma_page() and munlock_vma_page(), make them
> > > inline functions which check (vma->vm_flags & VM_LOCKED) before calling
> > > mlock_page() and munlock_page() in mm/mlock.c.
> > >
> > > Add bool compound to mlock_vma_page() and munlock_vma_page(): this is
> > > because we have understandable difficulty in accounting pte maps of THPs,
> > > and if passed a PageHead page, mlock_page() and munlock_page() cannot
> > > tell whether it's a pmd map to be counted or a pte map to be ignored.
> > >
> > [...]
> > >
> > > Mlock accounting on THPs has been hard to define, differed between anon
> > > and file, involved PageDoubleMap in some places and not others, required
> > > clear_page_mlock() at some points. Keep it simple now: just count the
> > > pmds and ignore the ptes, there is no reason for ptes to undo pmd mlocks.
> >
> > How would you suggest we handle the accounting for folios which are
> > intermediate in size between PMDs and PTEs? eg, an order-4 page?
> > Would it make sense to increment mlock_count by HUGE_PMD_NR for
> > each PMD mapping and by 1 for each PTE mapping?
>
> I think you're asking the wrong question here, but perhaps you've
> already decided there's only one satisfactory answer to the right question.

Or I've gravely misunderstood the situation. Or explained my concern
badly. The possibilities are endless!

My concern is that a filesystem may create an order-4 folio, an
application mmaps the folio and then calls mlock() (either over a portion
or the entirety of the folio). As far as I can tell, we then do not
move the folio onto the unevictable list because it is of order >0 and
is only mapped by PTEs. This presumably then has performance problems
(or we wouldn't need to have an unevictable list in the first place).

> The question I thought you should be asking is about how to count them
> in Mlocked. That's tough; but I take it for granted that you would not
> want per-subpage flags and counts involved (or not unless forced to do
> so by some regression that turns out to matter). And I think the only
> satisfactory answer is to count the whole compound_nr() as Mlocked
> when any part of it (a single pte, a series of ptes, a pmd) is mlocked;
> and (try to) move folio to Unevictable whenever any part of it is mlocked.

I think that makes sense. As with so many other things, we choose to
manage memory in >PAGE_SIZE chunks. If you mlock() a part of a folio,
we lock the whole folio in memory, and it all counts as being locked.

2022-02-16 07:30:20

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 10/13] mm/munlock: mlock_page() munlock_page() batch by pagevec

On Tue, Feb 15, 2022 at 01:02:30PM -0800, Hugh Dickins wrote:
> On Tue, 15 Feb 2022, Matthew Wilcox wrote:
> > > +#define mlock_new(page) ((struct page *)((unsigned long)page + NEW_PAGE))
> >
> > I'm going to change this to:
> >
> > static inline struct page *mlock_lru(struct page *page)
> > {
> > return (struct page *)((unsigned long)page + LRU_PAGE);
> > }
> >
> > (mutatis mutandi for mlock_new)
>
> Okay, thanks. (You're not naming your folio "page" :-?)

Worse, I tried passing it '&folio->page'. That produced some ...
interesting compiler error messages.

2022-02-18 06:42:43

by kernel test robot

[permalink] [raw]
Subject: [mm/munlock] 237b445401: stress-ng.remap.ops_per_sec -62.6% regression



Greeting,

FYI, we noticed a -62.6% regression of stress-ng.remap.ops_per_sec due to commit:


commit: 237b4454014d3759acc6459eb329c5e3d55113ed ("[PATCH v2 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking")
url: https://github.com/0day-ci/linux/commits/Hugh-Dickins/mm-munlock-rework-of-mlock-munlock-page-handling/20220215-104421
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git ee28855a54493ce83bc2a3fbe30210be61b57bc7
patch link: https://lore.kernel.org/lkml/[email protected]

in testcase: stress-ng
on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
with following parameters:

nr_threads: 100%
testtime: 60s
class: memory
test: remap
cpufreq_governor: performance
ucode: 0xd000280




If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

=========================================================================================
class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime/ucode:
memory/gcc-9/performance/x86_64-rhel-8.3/100%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp6/remap/stress-ng/60s/0xd000280

commit:
c479426e09 ("mm/munlock: maintain page->mlock_count while unevictable")
237b445401 ("mm/munlock: mlock_pte_range() when mlocking or munlocking")

c479426e09c8088d 237b4454014d3759acc6459eb32
---------------- ---------------------------
%stddev %change %stddev
\ | \
109459 -62.5% 41003 ? 2% stress-ng.remap.ops
1823 -62.6% 682.54 ? 2% stress-ng.remap.ops_per_sec
2.242e+08 -62.5% 83989157 ? 2% stress-ng.time.minor_page_faults
30.00 ? 2% -61.2% 11.65 ? 4% stress-ng.time.user_time
0.48 -0.2 0.26 ? 2% mpstat.cpu.all.usr%
439776 +35.9% 597819 meminfo.Inactive
439776 +35.9% 597819 meminfo.Inactive(anon)
157638 -99.8% 247.00 meminfo.Mlocked
289539 ? 17% +30.0% 376365 ? 5% numa-meminfo.node0.Inactive
289539 ? 17% +30.0% 376365 ? 5% numa-meminfo.node0.Inactive(anon)
77981 -99.8% 126.50 numa-meminfo.node0.Mlocked
79990 ? 2% -99.9% 119.50 numa-meminfo.node1.Mlocked
72605 ? 17% +29.8% 94255 ? 5% numa-vmstat.node0.nr_inactive_anon
19407 -99.8% 31.50 ? 2% numa-vmstat.node0.nr_mlock
72603 ? 17% +29.8% 94255 ? 5% numa-vmstat.node0.nr_zone_inactive_anon
20066 ? 2% -99.9% 29.50 numa-vmstat.node1.nr_mlock
110126 +35.6% 149376 proc-vmstat.nr_inactive_anon
61898 +1.2% 62642 proc-vmstat.nr_mapped
39411 -99.8% 61.33 proc-vmstat.nr_mlock
661395 -5.9% 622058 proc-vmstat.nr_unevictable
110126 +35.6% 149376 proc-vmstat.nr_zone_inactive_anon
661395 -5.9% 622058 proc-vmstat.nr_zone_unevictable
1039724 -17.6% 856602 proc-vmstat.numa_hit
928057 -20.2% 740906 proc-vmstat.numa_local
1039735 -17.7% 855482 proc-vmstat.pgalloc_normal
2.247e+08 -62.4% 84468472 ? 2% proc-vmstat.pgfault
802397 ? 4% -26.1% 593350 ? 7% proc-vmstat.pgfree
55865 ? 38% -100.0% 0.00 proc-vmstat.unevictable_pgs_cleared
1.121e+08 +49.8% 1.678e+08 ? 2% proc-vmstat.unevictable_pgs_culled
1.121e+08 +49.8% 1.678e+08 ? 2% proc-vmstat.unevictable_pgs_mlocked
1.12e+08 +49.9% 1.678e+08 ? 2% proc-vmstat.unevictable_pgs_munlocked
1.12e+08 +49.9% 1.678e+08 ? 2% proc-vmstat.unevictable_pgs_rescued
3.06 ? 11% +33.6% 4.09 ? 9% perf-stat.i.MPKI
1.555e+10 -29.1% 1.102e+10 perf-stat.i.branch-instructions
0.63 ? 3% -0.2 0.47 ? 4% perf-stat.i.branch-miss-rate%
86979814 -49.4% 44006612 ? 2% perf-stat.i.branch-misses
5.57 +45.0% 8.08 perf-stat.i.cpi
0.03 ? 8% -0.0 0.02 ? 12% perf-stat.i.dTLB-load-miss-rate%
4465030 -50.5% 2211973 ? 3% perf-stat.i.dTLB-load-misses
1.882e+10 -32.8% 1.265e+10 perf-stat.i.dTLB-loads
6.515e+09 -52.5% 3.095e+09 ? 2% perf-stat.i.dTLB-stores
7.012e+10 -30.9% 4.842e+10 perf-stat.i.instructions
0.20 ? 2% -26.4% 0.15 ? 2% perf-stat.i.ipc
321.02 -34.4% 210.63 perf-stat.i.metric.M/sec
2061534 -57.3% 879696 ? 2% perf-stat.i.node-loads
531446 ? 7% -31.1% 366181 ? 21% perf-stat.i.node-stores
2.89 ? 9% +37.5% 3.97 ? 10% perf-stat.overall.MPKI
0.56 -0.2 0.40 ? 2% perf-stat.overall.branch-miss-rate%
5.71 +45.1% 8.28 perf-stat.overall.cpi
0.02 -0.0 0.02 ? 2% perf-stat.overall.dTLB-load-miss-rate%
0.00 ? 11% +0.0 0.00 ? 11% perf-stat.overall.dTLB-store-miss-rate%
0.18 -31.1% 0.12 perf-stat.overall.ipc
1.532e+10 -29.1% 1.086e+10 perf-stat.ps.branch-instructions
85628106 -49.4% 43291977 ? 2% perf-stat.ps.branch-misses
4429976 -49.9% 2220756 ? 3% perf-stat.ps.dTLB-load-misses
1.854e+10 -32.8% 1.246e+10 perf-stat.ps.dTLB-loads
6.417e+09 -52.5% 3.049e+09 ? 2% perf-stat.ps.dTLB-stores
6.906e+10 -31.0% 4.769e+10 perf-stat.ps.instructions
2037560 -57.2% 872256 ? 2% perf-stat.ps.node-loads
522322 ? 7% -31.0% 360168 ? 21% perf-stat.ps.node-stores
4.39e+12 -31.2% 3.021e+12 perf-stat.total.instructions
98.28 -47.8 50.50 perf-profile.calltrace.cycles-pp.remap_file_pages
98.14 -47.7 50.44 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.remap_file_pages
98.07 -47.6 50.42 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.remap_file_pages
98.03 -47.6 50.40 perf-profile.calltrace.cycles-pp.__x64_sys_remap_file_pages.do_syscall_64.entry_SYSCALL_64_after_hwframe.remap_file_pages
50.17 -24.4 25.79 perf-profile.calltrace.cycles-pp.do_mmap.__x64_sys_remap_file_pages.do_syscall_64.entry_SYSCALL_64_after_hwframe.remap_file_pages
50.01 -24.3 25.72 perf-profile.calltrace.cycles-pp.mmap_region.do_mmap.__x64_sys_remap_file_pages.do_syscall_64.entry_SYSCALL_64_after_hwframe
48.46 -23.3 25.16 perf-profile.calltrace.cycles-pp.__do_munmap.mmap_region.do_mmap.__x64_sys_remap_file_pages.do_syscall_64
47.70 -23.1 24.56 perf-profile.calltrace.cycles-pp.__mm_populate.__x64_sys_remap_file_pages.do_syscall_64.entry_SYSCALL_64_after_hwframe.remap_file_pages
47.52 -23.0 24.49 perf-profile.calltrace.cycles-pp.populate_vma_page_range.__mm_populate.__x64_sys_remap_file_pages.do_syscall_64.entry_SYSCALL_64_after_hwframe
47.50 -23.0 24.49 perf-profile.calltrace.cycles-pp.__get_user_pages.populate_vma_page_range.__mm_populate.__x64_sys_remap_file_pages.do_syscall_64
47.80 -22.9 24.92 perf-profile.calltrace.cycles-pp.unmap_region.__do_munmap.mmap_region.do_mmap.__x64_sys_remap_file_pages
47.14 -22.8 24.36 perf-profile.calltrace.cycles-pp.handle_mm_fault.__get_user_pages.populate_vma_page_range.__mm_populate.__x64_sys_remap_file_pages
47.06 -22.7 24.33 perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.__get_user_pages.populate_vma_page_range.__mm_populate
46.97 -22.7 24.30 perf-profile.calltrace.cycles-pp.do_fault.__handle_mm_fault.handle_mm_fault.__get_user_pages.populate_vma_page_range
46.94 -22.7 24.29 perf-profile.calltrace.cycles-pp.filemap_map_pages.do_fault.__handle_mm_fault.handle_mm_fault.__get_user_pages
46.56 -22.4 24.12 perf-profile.calltrace.cycles-pp.do_set_pte.filemap_map_pages.do_fault.__handle_mm_fault.handle_mm_fault
46.91 -22.3 24.58 perf-profile.calltrace.cycles-pp.unmap_vmas.unmap_region.__do_munmap.mmap_region.do_mmap
46.38 -22.3 24.06 perf-profile.calltrace.cycles-pp.mlock_page.do_set_pte.filemap_map_pages.do_fault.__handle_mm_fault
46.84 -22.3 24.55 perf-profile.calltrace.cycles-pp.unmap_page_range.unmap_vmas.unmap_region.__do_munmap.mmap_region
46.72 -22.2 24.51 perf-profile.calltrace.cycles-pp.zap_pte_range.unmap_page_range.unmap_vmas.unmap_region.__do_munmap
45.95 -22.1 23.82 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irq.mlock_page.do_set_pte.filemap_map_pages.do_fault
45.82 -22.0 23.78 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.folio_lruvec_lock_irq.mlock_page.do_set_pte.filemap_map_pages
45.64 -21.9 23.72 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.folio_lruvec_lock_irq.mlock_page.do_set_pte
46.18 -21.9 24.32 perf-profile.calltrace.cycles-pp.munlock_page.zap_pte_range.unmap_page_range.unmap_vmas.unmap_region
45.70 -21.7 24.00 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irq.munlock_page.zap_pte_range.unmap_page_range.unmap_vmas
45.61 -21.6 23.97 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.folio_lruvec_lock_irq.munlock_page.zap_pte_range.unmap_page_range
45.42 -21.5 23.91 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.folio_lruvec_lock_irq.munlock_page.zap_pte_range
1.05 ? 2% +23.7 24.78 perf-profile.calltrace.cycles-pp.mlock
0.93 ? 2% +23.8 24.74 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.mlock
0.91 ? 2% +23.8 24.73 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.mlock
0.88 ? 2% +23.8 24.72 perf-profile.calltrace.cycles-pp.__x64_sys_mlock.do_syscall_64.entry_SYSCALL_64_after_hwframe.mlock
0.87 ? 2% +23.8 24.71 perf-profile.calltrace.cycles-pp.do_mlock.__x64_sys_mlock.do_syscall_64.entry_SYSCALL_64_after_hwframe.mlock
0.00 +23.9 23.86 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.folio_lruvec_lock_irq.munlock_page.mlock_pte_range
0.00 +23.9 23.87 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.folio_lruvec_lock_irq.mlock_page.mlock_pte_range
0.00 +23.9 23.92 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.folio_lruvec_lock_irq.munlock_page.mlock_pte_range.walk_p4d_range
0.00 +23.9 23.92 perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.folio_lruvec_lock_irq.mlock_page.mlock_pte_range.walk_p4d_range
0.00 +24.0 23.96 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irq.munlock_page.mlock_pte_range.walk_p4d_range.__walk_page_range
0.00 +24.0 23.97 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irq.mlock_page.mlock_pte_range.walk_p4d_range.__walk_page_range
0.54 +24.1 24.63 perf-profile.calltrace.cycles-pp.munlock
0.00 +24.2 24.22 perf-profile.calltrace.cycles-pp.mlock_page.mlock_pte_range.walk_p4d_range.__walk_page_range.walk_page_range
0.00 +24.3 24.28 perf-profile.calltrace.cycles-pp.munlock_page.mlock_pte_range.walk_p4d_range.__walk_page_range.walk_page_range
0.00 +24.3 24.33 perf-profile.calltrace.cycles-pp.__walk_page_range.walk_page_range.mlock_fixup.apply_vma_lock_flags.do_mlock
0.00 +24.3 24.35 perf-profile.calltrace.cycles-pp.walk_page_range.mlock_fixup.apply_vma_lock_flags.do_mlock.__x64_sys_mlock
0.00 +24.4 24.36 perf-profile.calltrace.cycles-pp.__walk_page_range.walk_page_range.mlock_fixup.apply_vma_lock_flags.__x64_sys_munlock
0.00 +24.4 24.38 perf-profile.calltrace.cycles-pp.walk_page_range.mlock_fixup.apply_vma_lock_flags.__x64_sys_munlock.do_syscall_64
0.00 +24.5 24.48 perf-profile.calltrace.cycles-pp.mlock_fixup.apply_vma_lock_flags.__x64_sys_munlock.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +24.5 24.48 perf-profile.calltrace.cycles-pp.mlock_fixup.apply_vma_lock_flags.do_mlock.__x64_sys_mlock.do_syscall_64
0.00 +24.5 24.50 perf-profile.calltrace.cycles-pp.apply_vma_lock_flags.__x64_sys_munlock.do_syscall_64.entry_SYSCALL_64_after_hwframe.munlock
0.00 +24.5 24.53 perf-profile.calltrace.cycles-pp.apply_vma_lock_flags.do_mlock.__x64_sys_mlock.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +24.6 24.56 perf-profile.calltrace.cycles-pp.__x64_sys_munlock.do_syscall_64.entry_SYSCALL_64_after_hwframe.munlock
0.00 +24.6 24.58 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.munlock
0.00 +24.6 24.58 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.munlock
0.00 +48.6 48.62 perf-profile.calltrace.cycles-pp.mlock_pte_range.walk_p4d_range.__walk_page_range.walk_page_range.mlock_fixup
0.00 +48.7 48.68 perf-profile.calltrace.cycles-pp.walk_p4d_range.__walk_page_range.walk_page_range.mlock_fixup.apply_vma_lock_flags
98.34 -47.8 50.51 perf-profile.children.cycles-pp.remap_file_pages
98.04 -47.6 50.41 perf-profile.children.cycles-pp.__x64_sys_remap_file_pages
50.18 -24.4 25.79 perf-profile.children.cycles-pp.do_mmap
50.03 -24.3 25.73 perf-profile.children.cycles-pp.mmap_region
48.04 -23.4 24.68 perf-profile.children.cycles-pp.__mm_populate
48.47 -23.3 25.17 perf-profile.children.cycles-pp.__do_munmap
47.76 -23.2 24.58 perf-profile.children.cycles-pp.populate_vma_page_range
47.74 -23.2 24.57 perf-profile.children.cycles-pp.__get_user_pages
47.81 -22.9 24.93 perf-profile.children.cycles-pp.unmap_region
47.14 -22.8 24.36 perf-profile.children.cycles-pp.handle_mm_fault
47.06 -22.7 24.34 perf-profile.children.cycles-pp.__handle_mm_fault
46.98 -22.7 24.31 perf-profile.children.cycles-pp.do_fault
46.96 -22.7 24.30 perf-profile.children.cycles-pp.filemap_map_pages
46.57 -22.4 24.13 perf-profile.children.cycles-pp.do_set_pte
46.92 -22.3 24.58 perf-profile.children.cycles-pp.unmap_vmas
46.85 -22.3 24.56 perf-profile.children.cycles-pp.unmap_page_range
46.74 -22.2 24.52 perf-profile.children.cycles-pp.zap_pte_range
0.60 -0.4 0.23 ? 4% perf-profile.children.cycles-pp.tlb_finish_mmu
0.57 -0.3 0.22 ? 4% perf-profile.children.cycles-pp.tlb_flush_mmu
0.45 ? 2% -0.3 0.16 ? 5% perf-profile.children.cycles-pp.perf_event_mmap
0.38 ? 2% -0.2 0.14 ? 6% perf-profile.children.cycles-pp.vma_link
0.38 ? 2% -0.2 0.14 ? 4% perf-profile.children.cycles-pp.flush_tlb_mm_range
0.37 -0.2 0.16 ? 4% perf-profile.children.cycles-pp.find_vma
0.31 ? 3% -0.2 0.11 ? 3% perf-profile.children.cycles-pp.__vma_adjust
0.29 ? 2% -0.2 0.10 ? 3% perf-profile.children.cycles-pp.__split_vma
0.30 ? 2% -0.2 0.11 ? 6% perf-profile.children.cycles-pp.flush_tlb_func
0.28 ? 2% -0.2 0.10 ? 4% perf-profile.children.cycles-pp.kmem_cache_alloc
0.28 ? 2% -0.2 0.10 ? 5% perf-profile.children.cycles-pp.remove_vma
0.26 ? 2% -0.2 0.10 ? 6% perf-profile.children.cycles-pp.native_flush_tlb_one_user
0.25 -0.2 0.09 ? 5% perf-profile.children.cycles-pp.vm_area_alloc
0.22 -0.1 0.08 ? 4% perf-profile.children.cycles-pp.vma_merge
0.20 ? 2% -0.1 0.06 perf-profile.children.cycles-pp.__mod_lruvec_page_state
0.22 ? 3% -0.1 0.08 ? 4% perf-profile.children.cycles-pp.vma_interval_tree_insert
0.19 ? 2% -0.1 0.06 perf-profile.children.cycles-pp.follow_page_pte
0.18 ? 4% -0.1 0.05 perf-profile.children.cycles-pp.page_remove_rmap
0.19 ? 2% -0.1 0.07 ? 6% perf-profile.children.cycles-pp.free_pgtables
0.18 ? 3% -0.1 0.07 ? 7% perf-profile.children.cycles-pp.syscall_return_via_sysret
0.21 ? 3% -0.1 0.10 ? 5% perf-profile.children.cycles-pp.__might_resched
0.17 ? 3% -0.1 0.06 ? 9% perf-profile.children.cycles-pp.kmem_cache_free
0.18 ? 2% -0.1 0.07 ? 5% perf-profile.children.cycles-pp.sync_mm_rss
0.17 ? 2% -0.1 0.06 perf-profile.children.cycles-pp.__entry_text_start
0.15 ? 3% -0.1 0.04 ? 44% perf-profile.children.cycles-pp.page_add_file_rmap
0.16 ? 3% -0.1 0.06 perf-profile.children.cycles-pp.__cond_resched
0.12 -0.1 0.02 ? 99% perf-profile.children.cycles-pp.follow_page_mask
0.15 ? 2% -0.1 0.06 ? 8% perf-profile.children.cycles-pp.unlink_file_vma
0.14 ? 3% -0.1 0.05 ? 7% perf-profile.children.cycles-pp.perf_iterate_sb
0.14 ? 3% -0.1 0.05 perf-profile.children.cycles-pp.down_write_killable
0.12 ? 3% -0.1 0.05 perf-profile.children.cycles-pp.folio_unlock
0.12 ? 3% -0.1 0.06 ? 9% perf-profile.children.cycles-pp.vmacache_find
0.12 ? 3% -0.1 0.06 perf-profile.children.cycles-pp._raw_spin_lock
0.12 ? 3% -0.0 0.07 ? 5% perf-profile.children.cycles-pp.next_uptodate_page
0.14 -0.0 0.12 ? 9% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.12 ? 4% -0.0 0.11 ? 10% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.12 ? 4% -0.0 0.11 ? 10% perf-profile.children.cycles-pp.hrtimer_interrupt
0.07 ? 5% -0.0 0.05 ? 7% perf-profile.children.cycles-pp.__mod_lruvec_state
0.12 ? 3% -0.0 0.10 ? 3% perf-profile.children.cycles-pp.up_write
0.07 +0.0 0.10 ? 4% perf-profile.children.cycles-pp.__list_add_valid
0.14 ? 4% +0.1 0.20 ? 7% perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
0.00 +0.1 0.10 ? 5% perf-profile.children.cycles-pp.__list_del_entry_valid
99.53 +0.3 99.81 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
99.44 +0.3 99.78 perf-profile.children.cycles-pp.do_syscall_64
46.40 +1.9 48.30 perf-profile.children.cycles-pp.mlock_page
46.21 +2.4 48.62 perf-profile.children.cycles-pp.munlock_page
91.65 +4.1 95.75 perf-profile.children.cycles-pp.folio_lruvec_lock_irq
91.44 +4.2 95.60 perf-profile.children.cycles-pp._raw_spin_lock_irq
91.09 +4.3 95.38 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
1.09 ? 2% +23.7 24.80 perf-profile.children.cycles-pp.mlock
0.88 ? 2% +23.8 24.71 perf-profile.children.cycles-pp.do_mlock
0.88 ? 2% +23.8 24.72 perf-profile.children.cycles-pp.__x64_sys_mlock
0.58 ? 2% +24.1 24.64 perf-profile.children.cycles-pp.munlock
0.36 ? 2% +24.2 24.56 perf-profile.children.cycles-pp.__x64_sys_munlock
0.74 ? 2% +48.3 49.04 perf-profile.children.cycles-pp.apply_vma_lock_flags
0.59 ? 2% +48.4 48.98 perf-profile.children.cycles-pp.mlock_fixup
0.00 +48.6 48.62 perf-profile.children.cycles-pp.mlock_pte_range
0.00 +48.7 48.68 perf-profile.children.cycles-pp.walk_p4d_range
0.00 +48.7 48.70 perf-profile.children.cycles-pp.__walk_page_range
0.00 +48.7 48.74 perf-profile.children.cycles-pp.walk_page_range
0.32 ? 2% -0.2 0.12 ? 5% perf-profile.self.cycles-pp.mmap_region
0.26 ? 2% -0.2 0.10 ? 8% perf-profile.self.cycles-pp.native_flush_tlb_one_user
0.24 ? 2% -0.1 0.10 ? 6% perf-profile.self.cycles-pp.find_vma
0.21 ? 3% -0.1 0.08 ? 4% perf-profile.self.cycles-pp.vma_interval_tree_insert
0.37 ? 2% -0.1 0.24 ? 6% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.18 ? 2% -0.1 0.07 ? 7% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.18 ? 3% -0.1 0.07 ? 5% perf-profile.self.cycles-pp.sync_mm_rss
0.19 ? 3% -0.1 0.09 ? 5% perf-profile.self.cycles-pp.__might_resched
0.12 ? 4% -0.1 0.02 ? 99% perf-profile.self.cycles-pp.__do_munmap
0.12 ? 3% -0.1 0.05 perf-profile.self.cycles-pp.folio_unlock
0.10 ? 3% -0.1 0.03 ? 70% perf-profile.self.cycles-pp.vmacache_find
0.22 ? 4% -0.1 0.15 ? 2% perf-profile.self.cycles-pp.folio_lruvec_lock_irq
0.11 ? 3% -0.1 0.06 ? 6% perf-profile.self.cycles-pp._raw_spin_lock
0.12 ? 3% -0.1 0.07 ? 7% perf-profile.self.cycles-pp.next_uptodate_page
0.24 ? 3% -0.0 0.21 ? 5% perf-profile.self.cycles-pp.mlock_page
0.07 +0.0 0.10 ? 4% perf-profile.self.cycles-pp.__list_add_valid
0.00 +0.1 0.05 perf-profile.self.cycles-pp.walk_p4d_range
0.00 +0.1 0.05 ? 8% perf-profile.self.cycles-pp.mlock_pte_range
0.19 ? 2% +0.1 0.25 ? 6% perf-profile.self.cycles-pp.munlock_page
0.14 ? 4% +0.1 0.20 ? 7% perf-profile.self.cycles-pp.mem_cgroup_update_lru_size
0.00 +0.1 0.10 ? 7% perf-profile.self.cycles-pp.__list_del_entry_valid
91.08 +4.3 95.38 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (24.32 kB)
config-5.17.0-rc2-00015-g237b4454014d (177.35 kB)
job-script (8.18 kB)
job.yaml (5.52 kB)
reproduce (349.00 B)
Download all attachments

2022-02-18 10:55:23

by Hugh Dickins

[permalink] [raw]
Subject: Re: [mm/munlock] 237b445401: stress-ng.remap.ops_per_sec -62.6% regression

On Fri, 18 Feb 2022, kernel test robot wrote:
>
>
> Greeting,
>
> FYI, we noticed a -62.6% regression of stress-ng.remap.ops_per_sec due to commit:
>
>
> commit: 237b4454014d3759acc6459eb329c5e3d55113ed ("[PATCH v2 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking")
> url: https://github.com/0day-ci/linux/commits/Hugh-Dickins/mm-munlock-rework-of-mlock-munlock-page-handling/20220215-104421
> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git ee28855a54493ce83bc2a3fbe30210be61b57bc7
> patch link: https://lore.kernel.org/lkml/[email protected]
>
> in testcase: stress-ng
> on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> with following parameters:
>
> nr_threads: 100%
> testtime: 60s
> class: memory
> test: remap
> cpufreq_governor: performance
> ucode: 0xd000280
>
>
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <[email protected]>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> sudo bin/lkp install job.yaml # job file is attached in this email
> bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
> sudo bin/lkp run generated-yaml-file
>
> # if come across any failure that blocks the test,
> # please remove ~/.lkp and /lkp dir to run from a clean state.
>
> =========================================================================================
> class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime/ucode:
> memory/gcc-9/performance/x86_64-rhel-8.3/100%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp6/remap/stress-ng/60s/0xd000280
>
> commit:
> c479426e09 ("mm/munlock: maintain page->mlock_count while unevictable")
> 237b445401 ("mm/munlock: mlock_pte_range() when mlocking or munlocking")
>
> c479426e09c8088d 237b4454014d3759acc6459eb32
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 109459 -62.5% 41003 ? 2% stress-ng.remap.ops
> 1823 -62.6% 682.54 ? 2% stress-ng.remap.ops_per_sec
> 2.242e+08 -62.5% 83989157 ? 2% stress-ng.time.minor_page_faults
> 30.00 ? 2% -61.2% 11.65 ? 4% stress-ng.time.user_time

Thanks a lot for trying it out, I did hope that you would find something.

However, IIUC, this by itself is not very interesting:
the comparison is between c479426e09 (06/13) as base and 237b445401?
237b445401 is 07/13, "Fill in missing pieces", where the series gets
to be correct again, after dropping the old implementation and piecing
together the rest of the new implementation. It's not a surprise that
those tests which need what's added back in 07/13 will get much slower
at this stage. And later 10/13 brings in a pagevec to speed it up.

What would be much more interesting is to treat the series of 13 as one,
and compare the baseline before any of it against the end of the series:
is that something that the 0day robot can easily do?

But I'll look more closely at the numbers you've provided later today,
numbers that I've snipped off here: there may still be useful things to
learn from them; and maybe I'll try following the instructions you've
supplied, though I probably won't do a good job of following them.

Thanks,
Hugh

2022-02-21 09:59:41

by Hugh Dickins

[permalink] [raw]
Subject: Re: [mm/munlock] 237b445401: stress-ng.remap.ops_per_sec -62.6% regression

On Fri, 18 Feb 2022, Hugh Dickins wrote:
> On Fri, 18 Feb 2022, kernel test robot wrote:
> >
> > Greeting,
> >
> > FYI, we noticed a -62.6% regression of stress-ng.remap.ops_per_sec due to commit:
> >
> >
> > commit: 237b4454014d3759acc6459eb329c5e3d55113ed ("[PATCH v2 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking")
> > url: https://github.com/0day-ci/linux/commits/Hugh-Dickins/mm-munlock-rework-of-mlock-munlock-page-handling/20220215-104421
> > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git ee28855a54493ce83bc2a3fbe30210be61b57bc7
> > patch link: https://lore.kernel.org/lkml/[email protected]
> >
> > in testcase: stress-ng
> > on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> > with following parameters:
> >
> > nr_threads: 100%
> > testtime: 60s
> > class: memory
> > test: remap
> > cpufreq_governor: performance
> > ucode: 0xd000280
> >
> >
> >
> >
> > If you fix the issue, kindly add following tag
> > Reported-by: kernel test robot <[email protected]>
> >
> >
> > Details are as below:
> > -------------------------------------------------------------------------------------------------->
> >
> >
> > To reproduce:
> >
> > git clone https://github.com/intel/lkp-tests.git
> > cd lkp-tests
> > sudo bin/lkp install job.yaml # job file is attached in this email
> > bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
> > sudo bin/lkp run generated-yaml-file
> >
> > # if come across any failure that blocks the test,
> > # please remove ~/.lkp and /lkp dir to run from a clean state.
> >
> > =========================================================================================
> > class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime/ucode:
> > memory/gcc-9/performance/x86_64-rhel-8.3/100%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp6/remap/stress-ng/60s/0xd000280
> >
> > commit:
> > c479426e09 ("mm/munlock: maintain page->mlock_count while unevictable")
> > 237b445401 ("mm/munlock: mlock_pte_range() when mlocking or munlocking")
> >
> > c479426e09c8088d 237b4454014d3759acc6459eb32
> > ---------------- ---------------------------
> > %stddev %change %stddev
> > \ | \
> > 109459 -62.5% 41003 ? 2% stress-ng.remap.ops
> > 1823 -62.6% 682.54 ? 2% stress-ng.remap.ops_per_sec
> > 2.242e+08 -62.5% 83989157 ? 2% stress-ng.time.minor_page_faults
> > 30.00 ? 2% -61.2% 11.65 ? 4% stress-ng.time.user_time
>
> Thanks a lot for trying it out, I did hope that you would find something.
>
> However, IIUC, this by itself is not very interesting:
> the comparison is between c479426e09 (06/13) as base and 237b445401?
> 237b445401 is 07/13, "Fill in missing pieces", where the series gets
> to be correct again, after dropping the old implementation and piecing
> together the rest of the new implementation. It's not a surprise that
> those tests which need what's added back in 07/13 will get much slower
> at this stage. And later 10/13 brings in a pagevec to speed it up.

I probably did not understand correctly: I expect you did try the whole
series at first, found a regression, and then bisected to that commit.

>
> What would be much more interesting is to treat the series of 13 as one,
> and compare the baseline before any of it against the end of the series:
> is that something that the 0day robot can easily do?

That would still be more interesting to me - though probably not
actionable (see below), so not worth you going to any effort. But
I hope the bad result on this test did not curtail further testing.

>
> But I'll look more closely at the numbers you've provided later today,
> numbers that I've snipped off here: there may still be useful things to
> learn from them; and maybe I'll try following the instructions you've
> supplied, though I probably won't do a good job of following them.

I did look more closely, didn't try lkp itself, but did observe
stress-ng --timeout 60 --times --verify --metrics-brief --remap 128
in the reproduce file, and followed that up (but with 8 not 128).
In my config, the series managed about half the ops as the baseline.

Comparison of unevictable_pgs in /proc/vmstat was instructive.
Baseline 5.17-rc: With mm/munlock series applied:
unevictable_pgs_culled 17 53097984
unevictable_pgs_rescued 17 53097984
unevictable_pgs_mlocked 97251331 53097984
unevictable_pgs_munlocked 97251331 53097984

I was very surprised by those low culled/rescued baseline numbers,
but a look at stress-remap-file-pages.c shows that each thread is
repetitively doing mlock of one page, remap_file_pages on that address
(with implicit munlock of old page and mlock of new), munlock of new.
Whereas usually, we don't munlock a range immediately after mlocking it.

The baseline's "if (!isolate_lru_page(page)) putback_lru_page(page);"
technique works very much in its favour on this test: the munlocking
isolates fail because mlock puts the page back on a pagevec, nothing
reaches the unevictable list; whereas the mm/munlock series under test
fastidiously moves each page to unevictable before bringing it back.

This certainly modifies my view of the role of the pagevec, and I
think it indicates that we *may* find a regression in some realistic
workload which demands a smarter approach. I have experimented with
munlock_page() "cancelling" an mlock found earlier on its pagevec;
but I very much doubt that the experimental code is correct yet, it
worked well but not as well as hoped (there are various lru_add_drain()s
which are limiting it), and it's just not a complication I'd like to get
into: unless pushed to do so by a realistic workload.

stress-ng --remap is not that realistic workload (and does
not pretend to be). I'm glad that it has highlighted the issue,
thanks to you, but I don't intend to propose any "fix" for this yet.

Hugh

2022-02-24 09:40:12

by kernel test robot

[permalink] [raw]
Subject: Re: [mm/munlock] 237b445401: stress-ng.remap.ops_per_sec -62.6% regression

hi Hugh,

On Sun, Feb 20, 2022 at 10:32:28PM -0800, Hugh Dickins wrote:
> On Fri, 18 Feb 2022, Hugh Dickins wrote:
> > On Fri, 18 Feb 2022, kernel test robot wrote:
> > >
> > > Greeting,
> > >
> > > FYI, we noticed a -62.6% regression of stress-ng.remap.ops_per_sec due to commit:
> > >
> > >
> > > commit: 237b4454014d3759acc6459eb329c5e3d55113ed ("[PATCH v2 07/13] mm/munlock: mlock_pte_range() when mlocking or munlocking")
> > > url: https://github.com/0day-ci/linux/commits/Hugh-Dickins/mm-munlock-rework-of-mlock-munlock-page-handling/20220215-104421
> > > base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git ee28855a54493ce83bc2a3fbe30210be61b57bc7
> > > patch link: https://lore.kernel.org/lkml/[email protected]
> > >
> > > in testcase: stress-ng
> > > on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> > > with following parameters:
> > >
> > > nr_threads: 100%
> > > testtime: 60s
> > > class: memory
> > > test: remap
> > > cpufreq_governor: performance
> > > ucode: 0xd000280
> > >
> > >
> > >
> > >
> > > If you fix the issue, kindly add following tag
> > > Reported-by: kernel test robot <[email protected]>
> > >
> > >
> > > Details are as below:
> > > -------------------------------------------------------------------------------------------------->
> > >
> > >
> > > To reproduce:
> > >
> > > git clone https://github.com/intel/lkp-tests.git
> > > cd lkp-tests
> > > sudo bin/lkp install job.yaml # job file is attached in this email
> > > bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
> > > sudo bin/lkp run generated-yaml-file
> > >
> > > # if come across any failure that blocks the test,
> > > # please remove ~/.lkp and /lkp dir to run from a clean state.
> > >
> > > =========================================================================================
> > > class/compiler/cpufreq_governor/kconfig/nr_threads/rootfs/tbox_group/test/testcase/testtime/ucode:
> > > memory/gcc-9/performance/x86_64-rhel-8.3/100%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp6/remap/stress-ng/60s/0xd000280
> > >
> > > commit:
> > > c479426e09 ("mm/munlock: maintain page->mlock_count while unevictable")
> > > 237b445401 ("mm/munlock: mlock_pte_range() when mlocking or munlocking")
> > >
> > > c479426e09c8088d 237b4454014d3759acc6459eb32
> > > ---------------- ---------------------------
> > > %stddev %change %stddev
> > > \ | \
> > > 109459 -62.5% 41003 ? 2% stress-ng.remap.ops
> > > 1823 -62.6% 682.54 ? 2% stress-ng.remap.ops_per_sec
> > > 2.242e+08 -62.5% 83989157 ? 2% stress-ng.time.minor_page_faults
> > > 30.00 ? 2% -61.2% 11.65 ? 4% stress-ng.time.user_time
> >
> > Thanks a lot for trying it out, I did hope that you would find something.
> >
> > However, IIUC, this by itself is not very interesting:
> > the comparison is between c479426e09 (06/13) as base and 237b445401?
> > 237b445401 is 07/13, "Fill in missing pieces", where the series gets
> > to be correct again, after dropping the old implementation and piecing
> > together the rest of the new implementation. It's not a surprise that
> > those tests which need what's added back in 07/13 will get much slower
> > at this stage. And later 10/13 brings in a pagevec to speed it up.
>
> I probably did not understand correctly: I expect you did try the whole
> series at first, found a regression, and then bisected to that commit.

yes, that is kind of what we did. we applied your patch series as a new branch:

* 173fe0fca5b75 (linux-review/Hugh-Dickins/mm-munlock-rework-of-mlock-munlock-page-handling/20220215-104421) mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP
* b98339dee777f mm/thp: collapse_file() do try_to_unmap(TTU_BATCH_FLUSH)
* cfe816e96f7fe mm/munlock: page migration needs mlock pagevec drained
* 9e15831f5ebd1 mm/munlock: mlock_page() munlock_page() batch by pagevec
* 351471e7c8eaa mm/munlock: delete smp_mb() from __pagevec_lru_add_fn()
* 84b82ccebbb3c mm/migrate: __unmap_and_move() push good newpage to LRU
* 237b4454014d3 mm/munlock: mlock_pte_range() when mlocking or munlocking
* c479426e09c80 mm/munlock: maintain page->mlock_count while unevictable
* 4165ea7156fc5 mm/munlock: replace clear_page_mlock() by final clearance
* 9208010d4509e mm/munlock: rmap call mlock_vma_page() munlock_vma_page()
* 9886e4a451624 mm/munlock: delete munlock_vma_pages_all(), allow oomreap
* 963d7e1d5ec62 mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE
* b99933d467204 mm/munlock: delete page_mlock() and all its works
* ee28855a54493 perf/x86/intel: Increase max number of the fixed counters
* 0144ba0c5bd31 KVM: x86: use the KVM side max supported fixed counter
* 2145e77fecfb3 perf/x86/intel: Enable PEBS format 5
* 58b2ff2c18b1e perf/core: Allow kernel address filter when not filtering the kernel
* e5524bf1047eb perf/x86/intel/pt: Fix address filter config for 32-bit kernel
* d680ff24e9e14 perf/core: Fix address filter parser for multiple filters
* 1fb85d06ad675 x86: Share definition of __is_canonical_address()
* c243cecb58e39 perf/x86/intel/pt: Relax address filter validation
* 26291c54e111f (tag: v5.17-rc2,

but we actually don't test upon the HEAD of this branch directly. we have some
hourly kernel which contains this branch, then we found regression, and the
bisection finally pointed to 237b4454014d3.

below is some results on this branch:

173fe0fca5b75 mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP 545.56 547.77 562.23 528.83 547.67 550.75
237b4454014d3 mm/munlock: mlock_pte_range() when mlocking or munlocking 681.04 685.94 704.17 647.75 684.73 691.63
c479426e09c80 mm/munlock: maintain page->mlock_count while unevictable 1842.83 1819.69 1875.71 1811.48 1818.57 1772.87
4165ea7156fc5 mm/munlock: replace clear_page_mlock() by final clearance 1324.93 1303.7 1303.47
9208010d4509e mm/munlock: rmap call mlock_vma_page() munlock_vma_page() 1272.6 1301.53 1354.12
963d7e1d5ec62 mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE 16323.56 15509.96 16488.65
26291c54e111f Linux 5.17-rc2 874.04 890.31 849.84

our auto-bisect would check branch HEAD, and in this case, it's still low,
so we made out the report.


>
> >
> > What would be much more interesting is to treat the series of 13 as one,
> > and compare the baseline before any of it against the end of the series:
> > is that something that the 0day robot can easily do?
>
> That would still be more interesting to me - though probably not
> actionable (see below), so not worth you going to any effort. But
> I hope the bad result on this test did not curtail further testing.

it it kind of difficult for us to treat series as one, but we're considering
how to fulfill this. Thanks a lot for suggestions!

and the bad result is not a problem for us :)

>
> >
> > But I'll look more closely at the numbers you've provided later today,
> > numbers that I've snipped off here: there may still be useful things to
> > learn from them; and maybe I'll try following the instructions you've
> > supplied, though I probably won't do a good job of following them.
>
> I did look more closely, didn't try lkp itself, but did observe
> stress-ng --timeout 60 --times --verify --metrics-brief --remap 128
> in the reproduce file, and followed that up (but with 8 not 128).
> In my config, the series managed about half the ops as the baseline.
>
> Comparison of unevictable_pgs in /proc/vmstat was instructive.
> Baseline 5.17-rc: With mm/munlock series applied:
> unevictable_pgs_culled 17 53097984
> unevictable_pgs_rescued 17 53097984
> unevictable_pgs_mlocked 97251331 53097984
> unevictable_pgs_munlocked 97251331 53097984
>
> I was very surprised by those low culled/rescued baseline numbers,
> but a look at stress-remap-file-pages.c shows that each thread is
> repetitively doing mlock of one page, remap_file_pages on that address
> (with implicit munlock of old page and mlock of new), munlock of new.
> Whereas usually, we don't munlock a range immediately after mlocking it.
>
> The baseline's "if (!isolate_lru_page(page)) putback_lru_page(page);"
> technique works very much in its favour on this test: the munlocking
> isolates fail because mlock puts the page back on a pagevec, nothing
> reaches the unevictable list; whereas the mm/munlock series under test
> fastidiously moves each page to unevictable before bringing it back.
>
> This certainly modifies my view of the role of the pagevec, and I
> think it indicates that we *may* find a regression in some realistic
> workload which demands a smarter approach. I have experimented with
> munlock_page() "cancelling" an mlock found earlier on its pagevec;
> but I very much doubt that the experimental code is correct yet, it
> worked well but not as well as hoped (there are various lru_add_drain()s
> which are limiting it), and it's just not a complication I'd like to get
> into: unless pushed to do so by a realistic workload.
>
> stress-ng --remap is not that realistic workload (and does
> not pretend to be). I'm glad that it has highlighted the issue,
> thanks to you, but I don't intend to propose any "fix" for this yet.

Thanks a lot for these details!
not sure if above data on that branch any help?

>
> Hugh