This patch is part of previous series:
https://lore.kernel.org/lkml/[email protected]/
Originally, it was created for external madvise hinting feature.
https://lkml.org/lkml/2019/5/31/463
Michal wanted to separte the discussion from external hinting interface
so this patchset includes only first part of my entire patchset
- introduce MADV_COLD and MADV_PAGEOUT hint to madvise.
However, I keep entire description for others for easier understanding
why this kinds of hint was born.
Thanks.
This patchset is against on next-20190710.
Below is description of previous entire patchset.
================= &< =====================
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot start.
While we continually try to improve the performance of cold starts, hot
starts will always be significantly less power hungry as well as faster so
we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps should
be killed in a process called ActivityManagerService. ActivityManagerService
tracks every Android app or service that the user could be interacting with
at any time and translates that into a ranked list for lmkd(low memory
killer daemon). They are likely to be killed by lmkd if the system has to
reclaim memory. In that sense they are similar to entries in any other cache.
Those apps are kept alive for opportunistic performance improvements but
those performance improvements will vary based on the memory requirements of
individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a cached
process to be killed(we measured performance swapping out vs. zapping the
memory by killing a process. Unsurprisingly, zapping is 10x times faster
even though we use zram which is much faster than real storage) so kill
from lmkd will often satisfy the high zone watermark, resulting in very
few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd
by reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COLD which will deactivate activated pages and the other is
MADV_PAGEOUT which will reclaim private pages instantly. These new options
complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
gain some free memory space. MADV_PAGEOUT is similar to MADV_DONTNEED in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed immediately; MADV_COLD is similar to MADV_FREE in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed when memory pressure rises.
* v3 - http://lore.kernel.org/lkml/[email protected]
* v2 - http://lore.kernel.org/lkml/[email protected]
* v1 - http://lore.kernel.org/lkml/[email protected]
Minchan Kim (4):
mm: introduce MADV_COLD
mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
mm: account nr_isolated_xxx in [isolate|putback]_lru_page
mm: introduce MADV_PAGEOUT
include/linux/swap.h | 2 +
include/uapi/asm-generic/mman-common.h | 2 +
mm/compaction.c | 2 -
mm/gup.c | 7 +-
mm/internal.h | 2 +-
mm/khugepaged.c | 3 -
mm/madvise.c | 377 ++++++++++++++++++++++++-
mm/memory-failure.c | 3 -
mm/memory_hotplug.c | 4 -
mm/mempolicy.c | 6 +-
mm/migrate.c | 37 +--
mm/oom_kill.c | 2 +-
mm/swap.c | 42 +++
mm/vmscan.c | 83 +++++-
14 files changed, 507 insertions(+), 65 deletions(-)
--
2.22.0.410.gd8fdbe21b5-goog
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
file LRU's head because MADV_COLD is a little bit different symantic.
MADV_FREE means it's okay to discard when the memory pressure because
the content of the page is *garbage* so freeing such pages is almost zero
overhead since we don't need to swap out and access afterward causes just
minor fault. Thus, it would make sense to put those freeable pages in
inactive file LRU to compete other used-once pages. It makes sense for
implmentaion point of view, too because it's not swapbacked memory any
longer until it would be re-dirtied. Even, it could give a bonus to make
them be reclaimed on swapless system. However, MADV_COLD doesn't mean
garbage so reclaiming them requires swap-out/in in the end so it's bigger
cost. Since we have designed VM LRU aging based on cost-model, anonymous
cold pages would be better to position inactive anon's LRU list, not file
LRU. Furthermore, it would help to avoid unnecessary scanning if system
doesn't have a swap device. Let's start simpler way without adding
complexity at this moment. However, keep in mind, too that it's a caveat
that workloads with a lot of pages cache are likely to ignore MADV_COLD
on anonymous memory because we rarely age anonymous LRU lists.
* man-page material
MADV_COLD (since Linux x.x)
Pages in the specified regions will be treated as less-recently-accessed
compared to pages in the system with similar access frequencies.
In contrast to MADV_FREE, the contents of the region are preserved
regardless of subsequent writes to pages.
MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
pages.
* v2
* add up the warn with lots of page cache workload - mhocko
* add man page stuff - dave
* v1
* remove page_mapcount filter - hannes, mhocko
* remove idle page handling - joelaf
* RFCv2
* add more description - mhocko
* RFCv1
* renaming from MADV_COOL to MADV_COLD - hannes
* internal review
* use clear_page_youn in deactivate_page - joelaf
* Revise the description - surenb
* Renaming from MADV_WARM to MADV_COOL - surenb
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/swap.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/internal.h | 2 +-
mm/madvise.c | 180 ++++++++++++++++++++++++-
mm/oom_kill.c | 2 +-
mm/swap.c | 42 ++++++
6 files changed, 224 insertions(+), 4 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index de2c67a33b7e..0ce997edb8bb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
extern void mark_page_lazyfree(struct page *page);
extern void swap_setup(void);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 63b1f506ea67..ef8a56927b12 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -45,6 +45,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_COLD 5 /* deactivatie these pages */
/* common parameters: try to keep these consistent across architectures */
#define MADV_FREE 8 /* free pages only if memory pressure */
diff --git a/mm/internal.h b/mm/internal.h
index f53a14d67538..c61b215ff265 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -39,7 +39,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf);
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
-static inline bool can_madv_dontneed_vma(struct vm_area_struct *vma)
+static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
{
return !(vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP));
}
diff --git a/mm/madvise.c b/mm/madvise.c
index 968df3aa069f..bae0055f9724 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_COLD:
case MADV_FREE:
return 0;
default:
@@ -307,6 +308,178 @@ static long madvise_willneed(struct vm_area_struct *vma,
return 0;
}
+static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ struct mmu_gather *tlb = walk->private;
+ struct mm_struct *mm = tlb->mm;
+ struct vm_area_struct *vma = walk->vma;
+ pte_t *orig_pte, *pte, ptent;
+ spinlock_t *ptl;
+ struct page *page;
+ unsigned long next;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_trans_huge(*pmd)) {
+ pmd_t orig_pmd;
+
+ tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (!ptl)
+ return 0;
+
+ orig_pmd = *pmd;
+ if (is_huge_zero_pmd(orig_pmd))
+ goto huge_unlock;
+
+ if (unlikely(!pmd_present(orig_pmd))) {
+ VM_BUG_ON(thp_migration_supported() &&
+ !is_pmd_migration_entry(orig_pmd));
+ goto huge_unlock;
+ }
+
+ page = pmd_page(orig_pmd);
+ if (next - addr != HPAGE_PMD_SIZE) {
+ int err;
+
+ if (page_mapcount(page) != 1)
+ goto huge_unlock;
+
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ err = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ if (!err)
+ goto regular_page;
+ return 0;
+ }
+
+ if (pmd_young(orig_pmd)) {
+ pmdp_invalidate(vma, addr, pmd);
+ orig_pmd = pmd_mkold(orig_pmd);
+
+ set_pmd_at(mm, addr, pmd, orig_pmd);
+ tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+ }
+
+ test_and_clear_page_young(page);
+ deactivate_page(page);
+huge_unlock:
+ spin_unlock(ptl);
+ return 0;
+ }
+
+ if (pmd_trans_unstable(pmd))
+ return 0;
+
+regular_page:
+ tlb_change_page_size(tlb, PAGE_SIZE);
+ orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ flush_tlb_batched_pending(mm);
+ arch_enter_lazy_mmu_mode();
+ for (; addr < end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+
+ if (pte_none(ptent))
+ continue;
+
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ /*
+ * Creating a THP page is expensive so split it only if we
+ * are sure it's worth. Split it if we are only owner.
+ */
+ if (PageTransCompound(page)) {
+ if (page_mapcount(page) != 1)
+ break;
+ get_page(page);
+ if (!trylock_page(page)) {
+ put_page(page);
+ break;
+ }
+ pte_unmap_unlock(orig_pte, ptl);
+ if (split_huge_page(page)) {
+ unlock_page(page);
+ put_page(page);
+ pte_offset_map_lock(mm, pmd, addr, &ptl);
+ break;
+ }
+ unlock_page(page);
+ put_page(page);
+ pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ pte--;
+ addr -= PAGE_SIZE;
+ continue;
+ }
+
+ VM_BUG_ON_PAGE(PageTransCompound(page), page);
+
+ if (pte_young(ptent)) {
+ ptent = ptep_get_and_clear_full(mm, addr, pte,
+ tlb->fullmm);
+ ptent = pte_mkold(ptent);
+ set_pte_at(mm, addr, pte, ptent);
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ }
+
+ /*
+ * We are deactivating a page for accelerating reclaiming.
+ * VM couldn't reclaim the page unless we clear PG_young.
+ * As a side effect, it makes confuse idle-page tracking
+ * because they will miss recent referenced history.
+ */
+ test_and_clear_page_young(page);
+ deactivate_page(page);
+ }
+
+ arch_enter_lazy_mmu_mode();
+ pte_unmap_unlock(orig_pte, ptl);
+ cond_resched();
+
+ return 0;
+}
+
+static void madvise_cold_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ struct mm_walk cold_walk = {
+ .pmd_entry = madvise_cold_pte_range,
+ .mm = vma->vm_mm,
+ .private = tlb,
+ };
+
+ tlb_start_vma(tlb, vma);
+ walk_page_range(addr, end, &cold_walk);
+ tlb_end_vma(tlb, vma);
+}
+
+static long madvise_cold(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start_addr, unsigned long end_addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_gather tlb;
+
+ *prev = vma;
+ if (!can_madv_lru_vma(vma))
+ return -EINVAL;
+
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+ madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+ tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+ return 0;
+}
+
static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
@@ -519,7 +692,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
int behavior)
{
*prev = vma;
- if (!can_madv_dontneed_vma(vma))
+ if (!can_madv_lru_vma(vma))
return -EINVAL;
if (!userfaultfd_remove(vma, start, end)) {
@@ -541,7 +714,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
*/
return -ENOMEM;
}
- if (!can_madv_dontneed_vma(vma))
+ if (!can_madv_lru_vma(vma))
return -EINVAL;
if (end > vma->vm_end) {
/*
@@ -695,6 +868,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_remove(vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
+ case MADV_COLD:
+ return madvise_cold(vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -716,6 +891,7 @@ madvise_behavior_valid(int behavior)
case MADV_WILLNEED:
case MADV_DONTNEED:
case MADV_FREE:
+ case MADV_COLD:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 95872bdfec4e..c8f0ec6b0e80 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -523,7 +523,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
set_bit(MMF_UNSTABLE, &mm->flags);
for (vma = mm->mmap ; vma; vma = vma->vm_next) {
- if (!can_madv_dontneed_vma(vma))
+ if (!can_madv_lru_vma(vma))
continue;
/*
diff --git a/mm/swap.c b/mm/swap.c
index 55899c1f54af..b501ad6eb091 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -47,6 +47,7 @@ int page_cluster;
static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
#ifdef CONFIG_SMP
static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
@@ -538,6 +539,22 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
update_page_reclaim_stat(lruvec, file, 0);
}
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+ void *arg)
+{
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ int file = page_is_file_cache(page);
+ int lru = page_lru_base_type(page);
+
+ del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ add_page_to_lru_list(page, lruvec, lru);
+
+ __count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
+ update_page_reclaim_stat(lruvec, file, 0);
+ }
+}
static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
void *arg)
@@ -590,6 +607,10 @@ void lru_add_drain_cpu(int cpu)
if (pagevec_count(pvec))
pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ if (pagevec_count(pvec))
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
if (pagevec_count(pvec))
pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
@@ -623,6 +644,26 @@ void deactivate_file_page(struct page *page)
}
}
+/*
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page. This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+ get_page(page);
+ if (!pagevec_add(pvec, page) || PageCompound(page))
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+ put_cpu_var(lru_deactivate_pvecs);
+ }
+}
+
/**
* mark_page_lazyfree - make an anon page lazyfree
* @page: page to deactivate
@@ -687,6 +728,7 @@ void lru_add_drain_all(void)
if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+ pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
need_activate_page_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
--
2.22.0.410.gd8fdbe21b5-goog
The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
as default. It is for preventing to reclaim dirty pages when CMA try to
migrate pages. Strictly speaking, we don't need it because CMA didn't allow
to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.
Moreover, it has a problem to prevent anonymous pages's swap out even
though force_reclaim = true in shrink_page_list on upcoming patch.
So this patch makes references's default value to PAGEREF_RECLAIM and
rename force_reclaim with ignore_references to make it more clear.
This is a preparatory work for next patch.
* RFCv1
* use ignore_referecnes as parameter name - hannes
Acked-by: Michal Hocko <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/vmscan.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a0301edd8d03..b4fa04d10ba6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1119,7 +1119,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct scan_control *sc,
enum ttu_flags ttu_flags,
struct reclaim_stat *stat,
- bool force_reclaim)
+ bool ignore_references)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -1133,7 +1133,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct address_space *mapping;
struct page *page;
int may_enter_fs;
- enum page_references references = PAGEREF_RECLAIM_CLEAN;
+ enum page_references references = PAGEREF_RECLAIM;
bool dirty, writeback;
unsigned int nr_pages;
@@ -1264,7 +1264,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
}
- if (!force_reclaim)
+ if (!ignore_references)
references = page_check_references(page, sc);
switch (references) {
--
2.22.0.410.gd8fdbe21b5-goog
The isolate counting is pecpu counter so it would be not huge gain
to work them by batch. Rather than complicating to make them batch,
let's make it more stright-foward via adding the counting logic
into [isolate|putback]_lru_page API.
* v1
* fix accounting bug - Hillf
Link: http://lkml.kernel.org/r/[email protected]
Acked-by: Michal Hocko <[email protected]>
Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/compaction.c | 2 --
mm/gup.c | 7 +------
mm/khugepaged.c | 3 ---
mm/memory-failure.c | 3 ---
mm/memory_hotplug.c | 4 ----
mm/mempolicy.c | 6 +-----
mm/migrate.c | 37 ++++++++-----------------------------
mm/vmscan.c | 22 ++++++++++++++++------
8 files changed, 26 insertions(+), 58 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 9e1b9acb116b..c6591682deda 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -982,8 +982,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
/* Successfully isolated */
del_page_from_lru_list(page, lruvec, page_lru(page));
- inc_node_page_state(page,
- NR_ISOLATED_ANON + page_is_file_cache(page));
isolate_success:
list_add(&page->lru, &cc->migratepages);
diff --git a/mm/gup.c b/mm/gup.c
index 98f13ab37bac..11d0634ce613 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1475,13 +1475,8 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
drain_allow = false;
}
- if (!isolate_lru_page(head)) {
+ if (!isolate_lru_page(head))
list_add_tail(&head->lru, &cma_page_list);
- mod_node_page_state(page_pgdat(head),
- NR_ISOLATED_ANON +
- page_is_file_cache(head),
- hpage_nr_pages(head));
- }
}
}
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index eaaa21b23215..a8b517d6df4a 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -503,7 +503,6 @@ void __khugepaged_exit(struct mm_struct *mm)
static void release_pte_page(struct page *page)
{
- dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page));
unlock_page(page);
putback_lru_page(page);
}
@@ -602,8 +601,6 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
result = SCAN_DEL_PAGE_LRU;
goto out;
}
- inc_node_page_state(page,
- NR_ISOLATED_ANON + page_is_file_cache(page));
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageLRU(page), page);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7ef849da8278..9900bb95d774 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1791,9 +1791,6 @@ static int __soft_offline_page(struct page *page, int flags)
* so use !__PageMovable instead for LRU page's mapping
* cannot have PAGE_MAPPING_MOVABLE.
*/
- if (!__PageMovable(page))
- inc_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
list_add(&page->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b9ba5b85f9f7..15bad2043b41 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1383,10 +1383,6 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
if (!ret) { /* Success */
list_add_tail(&page->lru, &source);
- if (!__PageMovable(page))
- inc_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
-
} else {
pr_warn("failed to isolate pfn %lx\n", pfn);
dump_page(page, "isolation failed");
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4acc2d14bc77..a5685eee6d1d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -994,12 +994,8 @@ static int migrate_page_add(struct page *page, struct list_head *pagelist,
* Avoid migrating a page that is shared with others.
*/
if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
- if (!isolate_lru_page(head)) {
+ if (!isolate_lru_page(head))
list_add_tail(&head->lru, pagelist);
- mod_node_page_state(page_pgdat(head),
- NR_ISOLATED_ANON + page_is_file_cache(head),
- hpage_nr_pages(head));
- }
}
return 0;
diff --git a/mm/migrate.c b/mm/migrate.c
index 8992741f10aa..f54f449a99f5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -190,8 +190,6 @@ void putback_movable_pages(struct list_head *l)
unlock_page(page);
put_page(page);
} else {
- mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
- page_is_file_cache(page), -hpage_nr_pages(page));
putback_lru_page(page);
}
}
@@ -1175,10 +1173,17 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
return -ENOMEM;
if (page_count(page) == 1) {
+ bool is_lru = !__PageMovable(page);
+
/* page was freed from under us. So we are done. */
ClearPageActive(page);
ClearPageUnevictable(page);
- if (unlikely(__PageMovable(page))) {
+ if (likely(is_lru))
+ mod_node_page_state(page_pgdat(page),
+ NR_ISOLATED_ANON +
+ page_is_file_cache(page),
+ -hpage_nr_pages(page));
+ else {
lock_page(page);
if (!PageMovable(page))
__ClearPageIsolated(page);
@@ -1204,15 +1209,6 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
* restored.
*/
list_del(&page->lru);
-
- /*
- * Compaction can migrate also non-LRU pages which are
- * not accounted to NR_ISOLATED_*. They can be recognized
- * as __PageMovable
- */
- if (likely(!__PageMovable(page)))
- mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
- page_is_file_cache(page), -hpage_nr_pages(page));
}
/*
@@ -1566,9 +1562,6 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
err = 0;
list_add_tail(&head->lru, pagelist);
- mod_node_page_state(page_pgdat(head),
- NR_ISOLATED_ANON + page_is_file_cache(head),
- hpage_nr_pages(head));
}
out_putpage:
/*
@@ -1884,8 +1877,6 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
{
- int page_lru;
-
VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);
/* Avoid migrating to a node that is nearly full */
@@ -1907,10 +1898,6 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
return 0;
}
- page_lru = page_is_file_cache(page);
- mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON + page_lru,
- hpage_nr_pages(page));
-
/*
* Isolating the page has taken another reference, so the
* caller's reference can be safely dropped without the page
@@ -1965,8 +1952,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
if (nr_remaining) {
if (!list_empty(&migratepages)) {
list_del(&page->lru);
- dec_node_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
putback_lru_page(page);
}
isolated = 0;
@@ -1996,7 +1981,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
pg_data_t *pgdat = NODE_DATA(node);
int isolated = 0;
struct page *new_page = NULL;
- int page_lru = page_is_file_cache(page);
unsigned long start = address & HPAGE_PMD_MASK;
new_page = alloc_pages_node(node,
@@ -2042,8 +2026,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
/* Retake the callers reference and putback on LRU */
get_page(page);
putback_lru_page(page);
- mod_node_page_state(page_pgdat(page),
- NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
goto out_unlock;
}
@@ -2093,9 +2075,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);
- mod_node_page_state(page_pgdat(page),
- NR_ISOLATED_ANON + page_lru,
- -HPAGE_PMD_NR);
return isolated;
out_fail:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b4fa04d10ba6..ca192b792d4f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1016,6 +1016,9 @@ int remove_mapping(struct address_space *mapping, struct page *page)
void putback_lru_page(struct page *page)
{
lru_cache_add(page);
+ mod_node_page_state(page_pgdat(page),
+ NR_ISOLATED_ANON + page_is_file_cache(page),
+ -hpage_nr_pages(page));
put_page(page); /* drop ref from isolate */
}
@@ -1481,6 +1484,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
nr_reclaimed += nr_pages;
+ mod_node_page_state(pgdat, NR_ISOLATED_ANON +
+ page_is_file_cache(page),
+ -nr_pages);
/*
* Is there need to periodically free_page_list? It would
* appear not as the counts should be low
@@ -1556,7 +1562,6 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
ret = shrink_page_list(&clean_pages, zone->zone_pgdat, &sc,
TTU_IGNORE_ACCESS, &dummy_stat, true);
list_splice(&clean_pages, page_list);
- mod_node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE, -ret);
return ret;
}
@@ -1632,6 +1637,9 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode)
*/
ClearPageLRU(page);
ret = 0;
+ __mod_node_page_state(page_pgdat(page), NR_ISOLATED_ANON +
+ page_is_file_cache(page),
+ hpage_nr_pages(page));
}
return ret;
@@ -1763,6 +1771,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
total_scan, skipped, nr_taken, mode, lru);
update_lru_sizes(lruvec, lru, nr_zone_taken);
+
return nr_taken;
}
@@ -1811,6 +1820,9 @@ int isolate_lru_page(struct page *page)
ClearPageLRU(page);
del_page_from_lru_list(page, lruvec, lru);
ret = 0;
+ mod_node_page_state(pgdat, NR_ISOLATED_ANON +
+ page_is_file_cache(page),
+ hpage_nr_pages(page));
}
spin_unlock_irq(&pgdat->lru_lock);
}
@@ -1902,6 +1914,9 @@ static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
update_lru_size(lruvec, lru, page_zonenum(page), nr_pages);
list_move(&page->lru, &lruvec->lists[lru]);
+ __mod_node_page_state(pgdat, NR_ISOLATED_ANON +
+ page_is_file_cache(page),
+ -hpage_nr_pages(page));
if (put_page_testzero(page)) {
__ClearPageLRU(page);
__ClearPageActive(page);
@@ -1979,7 +1994,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
&nr_scanned, sc, lru);
- __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
reclaim_stat->recent_scanned[file] += nr_taken;
item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
@@ -2005,8 +2019,6 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
move_pages_to_lru(lruvec, &page_list);
- __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-
spin_unlock_irq(&pgdat->lru_lock);
mem_cgroup_uncharge_list(&page_list);
@@ -2065,7 +2077,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, lru);
- __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
reclaim_stat->recent_scanned[file] += nr_taken;
__count_vm_events(PGREFILL, nr_scanned);
@@ -2134,7 +2145,6 @@ static void shrink_active_list(unsigned long nr_to_scan,
__count_vm_events(PGDEACTIVATE, nr_deactivate);
__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
- __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
spin_unlock_irq(&pgdat->lru_lock);
mem_cgroup_uncharge_list(&l_active);
--
2.22.0.410.gd8fdbe21b5-goog
When a process expects no accesses to a certain memory range
for a long time, it could hint kernel that the pages can be
reclaimed instantly but data should be preserved for future use.
This could reduce workingset eviction so it ends up increasing
performance.
This patch introduces the new MADV_PAGEOUT hint to madvise(2)
syscall. MADV_PAGEOUT can be used by a process to mark a memory
range as not expected to be used for a long time so that kernel
reclaims *any LRU* pages instantly. The hint can help kernel in
deciding which pages to evict proactively.
A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
intentionally because it's automatically bounded by PMD size.
If PMD size(e.g., 256) makes some trouble, we could fix it later
by limit it to SWAP_CLUSTER_MAX[1].
- man-page material
MADV_PAGEOUT (since Linux x.x)
Do not expect access in the near future so pages in the specified
regions could be reclaimed instantly regardless of memory pressure.
Thus, access in the range after successful operation could cause
major page fault but never lose the up-to-date contents unlike
MADV_DONTNEED. Pages belonging to a shared mapping are only processed
if a write access is allowed for the calling process.
MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
VM_PFNMAP pages.
* v3
* man page material modification - mhocko
* remove using SWAP_CLUSTER_MAX - mhocko
* v2
* add comment about SWAP_CLUSTER_MAX - mhocko
* add permission check to prevent sidechannel attack - mhocko
* add man page stuff - dave
* v1
* change pte to old and rely on the other's reference - hannes
* remove page_mapcount to check shared page - mhocko
* RFC v2
* make reclaim_pages simple via factoring out isolate logic - hannes
* RFCv1
* rename from MADV_COLD to MADV_PAGEOUT - hannes
* bail out if process is being killed - Hillf
* fix reclaim_pages bugs - Hillf
[1] https://lore.kernel.org/lkml/[email protected]/
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/swap.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/madvise.c | 197 +++++++++++++++++++++++++
mm/vmscan.c | 55 +++++++
4 files changed, 254 insertions(+)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ce997edb8bb..063c0c1e112b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -365,6 +365,7 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long vm_total_pages;
+extern unsigned long reclaim_pages(struct list_head *page_list);
#ifdef CONFIG_NUMA
extern int node_reclaim_mode;
extern int sysctl_min_unmapped_ratio;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ef8a56927b12..c613abdb7284 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -46,6 +46,7 @@
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
#define MADV_COLD 5 /* deactivatie these pages */
+#define MADV_PAGEOUT 6 /* reclaim these pages */
/* common parameters: try to keep these consistent across architectures */
#define MADV_FREE 8 /* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index bae0055f9724..bc2f0138982e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -11,6 +11,7 @@
#include <linux/syscalls.h>
#include <linux/mempolicy.h>
#include <linux/page-isolation.h>
+#include <linux/page_idle.h>
#include <linux/userfaultfd_k.h>
#include <linux/hugetlb.h>
#include <linux/falloc.h>
@@ -41,6 +42,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_WILLNEED:
case MADV_DONTNEED:
case MADV_COLD:
+ case MADV_PAGEOUT:
case MADV_FREE:
return 0;
default:
@@ -480,6 +482,198 @@ static long madvise_cold(struct vm_area_struct *vma,
return 0;
}
+static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ struct mmu_gather *tlb = walk->private;
+ struct mm_struct *mm = tlb->mm;
+ struct vm_area_struct *vma = walk->vma;
+ pte_t *orig_pte, *pte, ptent;
+ spinlock_t *ptl;
+ LIST_HEAD(page_list);
+ struct page *page;
+ unsigned long next;
+
+ if (fatal_signal_pending(current))
+ return -EINTR;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_trans_huge(*pmd)) {
+ pmd_t orig_pmd;
+
+ tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (!ptl)
+ return 0;
+
+ orig_pmd = *pmd;
+ if (is_huge_zero_pmd(orig_pmd))
+ goto huge_unlock;
+
+ if (unlikely(!pmd_present(orig_pmd))) {
+ VM_BUG_ON(thp_migration_supported() &&
+ !is_pmd_migration_entry(orig_pmd));
+ goto huge_unlock;
+ }
+
+ page = pmd_page(orig_pmd);
+ if (next - addr != HPAGE_PMD_SIZE) {
+ int err;
+
+ if (page_mapcount(page) != 1)
+ goto huge_unlock;
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ err = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ if (!err)
+ goto regular_page;
+ return 0;
+ }
+
+ if (isolate_lru_page(page))
+ goto huge_unlock;
+
+ if (pmd_young(orig_pmd)) {
+ pmdp_invalidate(vma, addr, pmd);
+ orig_pmd = pmd_mkold(orig_pmd);
+
+ set_pmd_at(mm, addr, pmd, orig_pmd);
+ tlb_remove_tlb_entry(tlb, pmd, addr);
+ }
+
+ ClearPageReferenced(page);
+ test_and_clear_page_young(page);
+ list_add(&page->lru, &page_list);
+huge_unlock:
+ spin_unlock(ptl);
+ reclaim_pages(&page_list);
+ return 0;
+ }
+
+ if (pmd_trans_unstable(pmd))
+ return 0;
+regular_page:
+ tlb_change_page_size(tlb, PAGE_SIZE);
+ orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ flush_tlb_batched_pending(mm);
+ arch_enter_lazy_mmu_mode();
+ for (; addr < end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ /*
+ * creating a THP page is expensive so split it only if we
+ * are sure it's worth. Split it if we are only owner.
+ */
+ if (PageTransCompound(page)) {
+ if (page_mapcount(page) != 1)
+ break;
+ get_page(page);
+ if (!trylock_page(page)) {
+ put_page(page);
+ break;
+ }
+ pte_unmap_unlock(orig_pte, ptl);
+ if (split_huge_page(page)) {
+ unlock_page(page);
+ put_page(page);
+ pte_offset_map_lock(mm, pmd, addr, &ptl);
+ break;
+ }
+ unlock_page(page);
+ put_page(page);
+ pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ pte--;
+ addr -= PAGE_SIZE;
+ continue;
+ }
+
+ VM_BUG_ON_PAGE(PageTransCompound(page), page);
+
+ if (isolate_lru_page(page))
+ continue;
+
+ if (pte_young(ptent)) {
+ ptent = ptep_get_and_clear_full(mm, addr, pte,
+ tlb->fullmm);
+ ptent = pte_mkold(ptent);
+ set_pte_at(mm, addr, pte, ptent);
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ }
+ ClearPageReferenced(page);
+ test_and_clear_page_young(page);
+ list_add(&page->lru, &page_list);
+ }
+
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(orig_pte, ptl);
+ reclaim_pages(&page_list);
+ cond_resched();
+
+ return 0;
+}
+
+static void madvise_pageout_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ struct mm_walk pageout_walk = {
+ .pmd_entry = madvise_pageout_pte_range,
+ .mm = vma->vm_mm,
+ .private = tlb,
+ };
+
+ tlb_start_vma(tlb, vma);
+ walk_page_range(addr, end, &pageout_walk);
+ tlb_end_vma(tlb, vma);
+}
+
+static inline bool can_do_pageout(struct vm_area_struct *vma)
+{
+ if (vma_is_anonymous(vma))
+ return true;
+ if (!vma->vm_file)
+ return false;
+ /*
+ * paging out pagecache only for non-anonymous mappings that correspond
+ * to the files the calling process could (if tried) open for writing;
+ * otherwise we'd be including shared non-exclusive mappings, which
+ * opens a side channel.
+ */
+ return inode_owner_or_capable(file_inode(vma->vm_file)) ||
+ inode_permission(file_inode(vma->vm_file), MAY_WRITE) == 0;
+}
+
+static long madvise_pageout(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start_addr, unsigned long end_addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_gather tlb;
+
+ *prev = vma;
+ if (!can_madv_lru_vma(vma))
+ return -EINVAL;
+
+ if (!can_do_pageout(vma))
+ return 0;
+
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+ madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
+ tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+ return 0;
+}
+
static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
@@ -870,6 +1064,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_willneed(vma, prev, start, end);
case MADV_COLD:
return madvise_cold(vma, prev, start, end);
+ case MADV_PAGEOUT:
+ return madvise_pageout(vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -892,6 +1088,7 @@ madvise_behavior_valid(int behavior)
case MADV_DONTNEED:
case MADV_FREE:
case MADV_COLD:
+ case MADV_PAGEOUT:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ca192b792d4f..bda3c41de767 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2153,6 +2153,61 @@ static void shrink_active_list(unsigned long nr_to_scan,
nr_deactivate, nr_rotated, sc->priority, file);
}
+unsigned long reclaim_pages(struct list_head *page_list)
+{
+ int nid = -1;
+ unsigned long nr_reclaimed = 0;
+ LIST_HEAD(node_page_list);
+ struct reclaim_stat dummy_stat;
+ struct page *page;
+ struct scan_control sc = {
+ .gfp_mask = GFP_KERNEL,
+ .priority = DEF_PRIORITY,
+ .may_writepage = 1,
+ .may_unmap = 1,
+ .may_swap = 1,
+ };
+
+ while (!list_empty(page_list)) {
+ page = lru_to_page(page_list);
+ if (nid == -1) {
+ nid = page_to_nid(page);
+ INIT_LIST_HEAD(&node_page_list);
+ }
+
+ if (nid == page_to_nid(page)) {
+ list_move(&page->lru, &node_page_list);
+ continue;
+ }
+
+ nr_reclaimed += shrink_page_list(&node_page_list,
+ NODE_DATA(nid),
+ &sc, 0,
+ &dummy_stat, false);
+ while (!list_empty(&node_page_list)) {
+ page = lru_to_page(&node_page_list);
+ list_del(&page->lru);
+ putback_lru_page(page);
+ }
+
+ nid = -1;
+ }
+
+ if (!list_empty(&node_page_list)) {
+ nr_reclaimed += shrink_page_list(&node_page_list,
+ NODE_DATA(nid),
+ &sc, 0,
+ &dummy_stat, false);
+ while (!list_empty(&node_page_list)) {
+ page = lru_to_page(&node_page_list);
+ list_del(&page->lru);
+ putback_lru_page(page);
+ }
+ }
+
+ return nr_reclaimed;
+}
+
/*
* The inactive anon list should be small enough that the VM never has
* to do too much work.
--
2.22.0.410.gd8fdbe21b5-goog
On Thu, Jul 11, 2019 at 10:25:25AM +0900, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range, it could
> give a hint to kernel that the pages can be reclaimed when memory pressure
> happens but data should be preserved for future use. This could reduce
> workingset eviction so it ends up increasing performance.
>
> This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> MADV_COLD can be used by a process to mark a memory range as not expected
> to be used in the near future. The hint can help kernel in deciding which
> pages to evict early during memory pressure.
>
> It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
>
> active file page -> inactive file LRU
> active anon page -> inacdtive anon LRU
>
> Unlike MADV_FREE, it doesn't move active anonymous pages to inactive
> file LRU's head because MADV_COLD is a little bit different symantic.
> MADV_FREE means it's okay to discard when the memory pressure because
> the content of the page is *garbage* so freeing such pages is almost zero
> overhead since we don't need to swap out and access afterward causes just
> minor fault. Thus, it would make sense to put those freeable pages in
> inactive file LRU to compete other used-once pages. It makes sense for
> implmentaion point of view, too because it's not swapbacked memory any
> longer until it would be re-dirtied. Even, it could give a bonus to make
> them be reclaimed on swapless system. However, MADV_COLD doesn't mean
> garbage so reclaiming them requires swap-out/in in the end so it's bigger
> cost. Since we have designed VM LRU aging based on cost-model, anonymous
> cold pages would be better to position inactive anon's LRU list, not file
> LRU. Furthermore, it would help to avoid unnecessary scanning if system
> doesn't have a swap device. Let's start simpler way without adding
> complexity at this moment. However, keep in mind, too that it's a caveat
> that workloads with a lot of pages cache are likely to ignore MADV_COLD
> on anonymous memory because we rarely age anonymous LRU lists.
>
> * man-page material
>
> MADV_COLD (since Linux x.x)
>
> Pages in the specified regions will be treated as less-recently-accessed
> compared to pages in the system with similar access frequencies.
> In contrast to MADV_FREE, the contents of the region are preserved
> regardless of subsequent writes to pages.
>
> MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
> pages.
>
> * v2
> * add up the warn with lots of page cache workload - mhocko
> * add man page stuff - dave
>
> * v1
> * remove page_mapcount filter - hannes, mhocko
> * remove idle page handling - joelaf
>
> * RFCv2
> * add more description - mhocko
>
> * RFCv1
> * renaming from MADV_COOL to MADV_COLD - hannes
>
> * internal review
> * use clear_page_youn in deactivate_page - joelaf
> * Revise the description - surenb
> * Renaming from MADV_WARM to MADV_COOL - surenb
>
> Acked-by: Michal Hocko <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Thu, Jul 11, 2019 at 10:25:27AM +0900, Minchan Kim wrote:
> The isolate counting is pecpu counter so it would be not huge gain
> to work them by batch. Rather than complicating to make them batch,
> let's make it more stright-foward via adding the counting logic
> into [isolate|putback]_lru_page API.
>
> * v1
> * fix accounting bug - Hillf
>
> Link: http://lkml.kernel.org/r/[email protected]
> Acked-by: Michal Hocko <[email protected]>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
This is tricky to review, but fwiw it looks correct to me.
Acked-by: Johannes Weiner <[email protected]>
On Thu, Jul 11, 2019 at 10:25:28AM +0900, Minchan Kim wrote:
> @@ -480,6 +482,198 @@ static long madvise_cold(struct vm_area_struct *vma,
> return 0;
> }
>
> +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> + unsigned long end, struct mm_walk *walk)
> +{
> + struct mmu_gather *tlb = walk->private;
> + struct mm_struct *mm = tlb->mm;
> + struct vm_area_struct *vma = walk->vma;
> + pte_t *orig_pte, *pte, ptent;
> + spinlock_t *ptl;
> + LIST_HEAD(page_list);
> + struct page *page;
> + unsigned long next;
> +
> + if (fatal_signal_pending(current))
> + return -EINTR;
> +
> + next = pmd_addr_end(addr, end);
> + if (pmd_trans_huge(*pmd)) {
> + pmd_t orig_pmd;
> +
> + tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> + ptl = pmd_trans_huge_lock(pmd, vma);
> + if (!ptl)
> + return 0;
> +
> + orig_pmd = *pmd;
> + if (is_huge_zero_pmd(orig_pmd))
> + goto huge_unlock;
> +
> + if (unlikely(!pmd_present(orig_pmd))) {
> + VM_BUG_ON(thp_migration_supported() &&
> + !is_pmd_migration_entry(orig_pmd));
> + goto huge_unlock;
> + }
> +
> + page = pmd_page(orig_pmd);
> + if (next - addr != HPAGE_PMD_SIZE) {
> + int err;
> +
> + if (page_mapcount(page) != 1)
> + goto huge_unlock;
> + get_page(page);
> + spin_unlock(ptl);
> + lock_page(page);
> + err = split_huge_page(page);
> + unlock_page(page);
> + put_page(page);
> + if (!err)
> + goto regular_page;
> + return 0;
> + }
> +
> + if (isolate_lru_page(page))
> + goto huge_unlock;
> +
> + if (pmd_young(orig_pmd)) {
> + pmdp_invalidate(vma, addr, pmd);
> + orig_pmd = pmd_mkold(orig_pmd);
> +
> + set_pmd_at(mm, addr, pmd, orig_pmd);
> + tlb_remove_tlb_entry(tlb, pmd, addr);
> + }
> +
> + ClearPageReferenced(page);
> + test_and_clear_page_young(page);
> + list_add(&page->lru, &page_list);
> +huge_unlock:
> + spin_unlock(ptl);
> + reclaim_pages(&page_list);
> + return 0;
> + }
> +
> + if (pmd_trans_unstable(pmd))
> + return 0;
> +regular_page:
> + tlb_change_page_size(tlb, PAGE_SIZE);
> + orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> + flush_tlb_batched_pending(mm);
> + arch_enter_lazy_mmu_mode();
> + for (; addr < end; pte++, addr += PAGE_SIZE) {
> + ptent = *pte;
> + if (!pte_present(ptent))
> + continue;
> +
> + page = vm_normal_page(vma, addr, ptent);
> + if (!page)
> + continue;
> +
> + /*
> + * creating a THP page is expensive so split it only if we
> + * are sure it's worth. Split it if we are only owner.
> + */
> + if (PageTransCompound(page)) {
> + if (page_mapcount(page) != 1)
> + break;
> + get_page(page);
> + if (!trylock_page(page)) {
> + put_page(page);
> + break;
> + }
> + pte_unmap_unlock(orig_pte, ptl);
> + if (split_huge_page(page)) {
> + unlock_page(page);
> + put_page(page);
> + pte_offset_map_lock(mm, pmd, addr, &ptl);
> + break;
> + }
> + unlock_page(page);
> + put_page(page);
> + pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + pte--;
> + addr -= PAGE_SIZE;
> + continue;
> + }
> +
> + VM_BUG_ON_PAGE(PageTransCompound(page), page);
> +
> + if (isolate_lru_page(page))
> + continue;
> +
> + if (pte_young(ptent)) {
> + ptent = ptep_get_and_clear_full(mm, addr, pte,
> + tlb->fullmm);
> + ptent = pte_mkold(ptent);
> + set_pte_at(mm, addr, pte, ptent);
> + tlb_remove_tlb_entry(tlb, pte, addr);
> + }
> + ClearPageReferenced(page);
> + test_and_clear_page_young(page);
> + list_add(&page->lru, &page_list);
> + }
> +
> + arch_leave_lazy_mmu_mode();
> + pte_unmap_unlock(orig_pte, ptl);
> + reclaim_pages(&page_list);
> + cond_resched();
> +
> + return 0;
> +}
I know you have briefly talked about code sharing already.
While I agree that sharing with MADV_FREE is maybe a stretch, I
applied these patches and compared the pageout and the cold page table
functions, and they are line for line the same EXCEPT for 2-3 lines at
the very end, where one reclaims and the other deactivates. It would
be good to share here, it shouldn't be hard or result in fragile code.
Something like int madvise_cold_or_pageout_range(..., bool pageout)?
Hi Johannes,
On Thu, Jul 11, 2019 at 02:42:23PM -0400, Johannes Weiner wrote:
> On Thu, Jul 11, 2019 at 10:25:28AM +0900, Minchan Kim wrote:
> > @@ -480,6 +482,198 @@ static long madvise_cold(struct vm_area_struct *vma,
> > return 0;
> > }
> >
> > +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > + unsigned long end, struct mm_walk *walk)
> > +{
> > + struct mmu_gather *tlb = walk->private;
> > + struct mm_struct *mm = tlb->mm;
> > + struct vm_area_struct *vma = walk->vma;
> > + pte_t *orig_pte, *pte, ptent;
> > + spinlock_t *ptl;
> > + LIST_HEAD(page_list);
> > + struct page *page;
> > + unsigned long next;
> > +
> > + if (fatal_signal_pending(current))
> > + return -EINTR;
> > +
> > + next = pmd_addr_end(addr, end);
> > + if (pmd_trans_huge(*pmd)) {
> > + pmd_t orig_pmd;
> > +
> > + tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> > + ptl = pmd_trans_huge_lock(pmd, vma);
> > + if (!ptl)
> > + return 0;
> > +
> > + orig_pmd = *pmd;
> > + if (is_huge_zero_pmd(orig_pmd))
> > + goto huge_unlock;
> > +
> > + if (unlikely(!pmd_present(orig_pmd))) {
> > + VM_BUG_ON(thp_migration_supported() &&
> > + !is_pmd_migration_entry(orig_pmd));
> > + goto huge_unlock;
> > + }
> > +
> > + page = pmd_page(orig_pmd);
> > + if (next - addr != HPAGE_PMD_SIZE) {
> > + int err;
> > +
> > + if (page_mapcount(page) != 1)
> > + goto huge_unlock;
> > + get_page(page);
> > + spin_unlock(ptl);
> > + lock_page(page);
> > + err = split_huge_page(page);
> > + unlock_page(page);
> > + put_page(page);
> > + if (!err)
> > + goto regular_page;
> > + return 0;
> > + }
> > +
> > + if (isolate_lru_page(page))
> > + goto huge_unlock;
> > +
> > + if (pmd_young(orig_pmd)) {
> > + pmdp_invalidate(vma, addr, pmd);
> > + orig_pmd = pmd_mkold(orig_pmd);
> > +
> > + set_pmd_at(mm, addr, pmd, orig_pmd);
> > + tlb_remove_tlb_entry(tlb, pmd, addr);
> > + }
> > +
> > + ClearPageReferenced(page);
> > + test_and_clear_page_young(page);
> > + list_add(&page->lru, &page_list);
> > +huge_unlock:
> > + spin_unlock(ptl);
> > + reclaim_pages(&page_list);
> > + return 0;
> > + }
> > +
> > + if (pmd_trans_unstable(pmd))
> > + return 0;
> > +regular_page:
> > + tlb_change_page_size(tlb, PAGE_SIZE);
> > + orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > + flush_tlb_batched_pending(mm);
> > + arch_enter_lazy_mmu_mode();
> > + for (; addr < end; pte++, addr += PAGE_SIZE) {
> > + ptent = *pte;
> > + if (!pte_present(ptent))
> > + continue;
> > +
> > + page = vm_normal_page(vma, addr, ptent);
> > + if (!page)
> > + continue;
> > +
> > + /*
> > + * creating a THP page is expensive so split it only if we
> > + * are sure it's worth. Split it if we are only owner.
> > + */
> > + if (PageTransCompound(page)) {
> > + if (page_mapcount(page) != 1)
> > + break;
> > + get_page(page);
> > + if (!trylock_page(page)) {
> > + put_page(page);
> > + break;
> > + }
> > + pte_unmap_unlock(orig_pte, ptl);
> > + if (split_huge_page(page)) {
> > + unlock_page(page);
> > + put_page(page);
> > + pte_offset_map_lock(mm, pmd, addr, &ptl);
> > + break;
> > + }
> > + unlock_page(page);
> > + put_page(page);
> > + pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > + pte--;
> > + addr -= PAGE_SIZE;
> > + continue;
> > + }
> > +
> > + VM_BUG_ON_PAGE(PageTransCompound(page), page);
> > +
> > + if (isolate_lru_page(page))
> > + continue;
> > +
> > + if (pte_young(ptent)) {
> > + ptent = ptep_get_and_clear_full(mm, addr, pte,
> > + tlb->fullmm);
> > + ptent = pte_mkold(ptent);
> > + set_pte_at(mm, addr, pte, ptent);
> > + tlb_remove_tlb_entry(tlb, pte, addr);
> > + }
> > + ClearPageReferenced(page);
> > + test_and_clear_page_young(page);
> > + list_add(&page->lru, &page_list);
> > + }
> > +
> > + arch_leave_lazy_mmu_mode();
> > + pte_unmap_unlock(orig_pte, ptl);
> > + reclaim_pages(&page_list);
> > + cond_resched();
> > +
> > + return 0;
> > +}
>
> I know you have briefly talked about code sharing already.
>
> While I agree that sharing with MADV_FREE is maybe a stretch, I
> applied these patches and compared the pageout and the cold page table
> functions, and they are line for line the same EXCEPT for 2-3 lines at
> the very end, where one reclaims and the other deactivates. It would
> be good to share here, it shouldn't be hard or result in fragile code.
Fair enough if we leave MADV_FREE.
>
> Something like int madvise_cold_or_pageout_range(..., bool pageout)?
How about this?
From 41592f23e876ec21e49dc3c76dc89538e2bb16be Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Fri, 12 Jul 2019 14:05:36 +0900
Subject: [PATCH] mm: factor out common parts between MADV_COLD and
MADV_PAGEOUT
There are many common parts between MADV_COLD and MADV_PAGEOUT.
This patch factor them out to save code duplication.
Signed-off-by: Minchan Kim <[email protected]>
---
mm/madvise.c | 201 +++++++++++++--------------------------------------
1 file changed, 52 insertions(+), 149 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index bc2f0138982e..3d3d14517cc8 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -30,6 +30,11 @@
#include "internal.h"
+struct madvise_walk_private {
+ struct mmu_gather *tlb;
+ bool pageout;
+};
+
/*
* Any behaviour which results in changes to the vma->vm_flags needs to
* take mmap_sem for writing. Others, which simply traverse vmas, need
@@ -310,16 +315,23 @@ static long madvise_willneed(struct vm_area_struct *vma,
return 0;
}
-static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
- unsigned long end, struct mm_walk *walk)
+static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
+ unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
{
- struct mmu_gather *tlb = walk->private;
+ struct madvise_walk_private *private = walk->private;
+ struct mmu_gather *tlb = private->tlb;
+ bool pageout = private->pageout;
struct mm_struct *mm = tlb->mm;
struct vm_area_struct *vma = walk->vma;
pte_t *orig_pte, *pte, ptent;
spinlock_t *ptl;
- struct page *page;
unsigned long next;
+ struct page *page = NULL;
+ LIST_HEAD(page_list);
+
+ if (fatal_signal_pending(current))
+ return -EINTR;
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
@@ -358,6 +370,12 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
return 0;
}
+ if (pageout) {
+ if (isolate_lru_page(page))
+ goto huge_unlock;
+ list_add(&page->lru, &page_list);
+ }
+
if (pmd_young(orig_pmd)) {
pmdp_invalidate(vma, addr, pmd);
orig_pmd = pmd_mkold(orig_pmd);
@@ -366,10 +384,14 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
}
+ ClearPageReferenced(page);
test_and_clear_page_young(page);
- deactivate_page(page);
huge_unlock:
spin_unlock(ptl);
+ if (pageout)
+ reclaim_pages(&page_list);
+ else
+ deactivate_page(page);
return 0;
}
@@ -423,6 +445,12 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
VM_BUG_ON_PAGE(PageTransCompound(page), page);
+ if (pageout) {
+ if (isolate_lru_page(page))
+ continue;
+ list_add(&page->lru, &page_list);
+ }
+
if (pte_young(ptent)) {
ptent = ptep_get_and_clear_full(mm, addr, pte,
tlb->fullmm);
@@ -437,12 +465,16 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
* As a side effect, it makes confuse idle-page tracking
* because they will miss recent referenced history.
*/
+ ClearPageReferenced(page);
test_and_clear_page_young(page);
- deactivate_page(page);
+ if (!pageout)
+ deactivate_page(page);
}
arch_enter_lazy_mmu_mode();
pte_unmap_unlock(orig_pte, ptl);
+ if (pageout)
+ reclaim_pages(&page_list);
cond_resched();
return 0;
@@ -452,10 +484,15 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end)
{
+ struct madvise_walk_private walk_private = {
+ .tlb = tlb,
+ .pageout = false,
+ };
+
struct mm_walk cold_walk = {
- .pmd_entry = madvise_cold_pte_range,
+ .pmd_entry = madvise_cold_or_pageout_pte_range,
.mm = vma->vm_mm,
- .private = tlb,
+ .private = &walk_private,
};
tlb_start_vma(tlb, vma);
@@ -482,153 +519,19 @@ static long madvise_cold(struct vm_area_struct *vma,
return 0;
}
-static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
- unsigned long end, struct mm_walk *walk)
-{
- struct mmu_gather *tlb = walk->private;
- struct mm_struct *mm = tlb->mm;
- struct vm_area_struct *vma = walk->vma;
- pte_t *orig_pte, *pte, ptent;
- spinlock_t *ptl;
- LIST_HEAD(page_list);
- struct page *page;
- unsigned long next;
-
- if (fatal_signal_pending(current))
- return -EINTR;
-
- next = pmd_addr_end(addr, end);
- if (pmd_trans_huge(*pmd)) {
- pmd_t orig_pmd;
-
- tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
- ptl = pmd_trans_huge_lock(pmd, vma);
- if (!ptl)
- return 0;
-
- orig_pmd = *pmd;
- if (is_huge_zero_pmd(orig_pmd))
- goto huge_unlock;
-
- if (unlikely(!pmd_present(orig_pmd))) {
- VM_BUG_ON(thp_migration_supported() &&
- !is_pmd_migration_entry(orig_pmd));
- goto huge_unlock;
- }
-
- page = pmd_page(orig_pmd);
- if (next - addr != HPAGE_PMD_SIZE) {
- int err;
-
- if (page_mapcount(page) != 1)
- goto huge_unlock;
- get_page(page);
- spin_unlock(ptl);
- lock_page(page);
- err = split_huge_page(page);
- unlock_page(page);
- put_page(page);
- if (!err)
- goto regular_page;
- return 0;
- }
-
- if (isolate_lru_page(page))
- goto huge_unlock;
-
- if (pmd_young(orig_pmd)) {
- pmdp_invalidate(vma, addr, pmd);
- orig_pmd = pmd_mkold(orig_pmd);
-
- set_pmd_at(mm, addr, pmd, orig_pmd);
- tlb_remove_tlb_entry(tlb, pmd, addr);
- }
-
- ClearPageReferenced(page);
- test_and_clear_page_young(page);
- list_add(&page->lru, &page_list);
-huge_unlock:
- spin_unlock(ptl);
- reclaim_pages(&page_list);
- return 0;
- }
-
- if (pmd_trans_unstable(pmd))
- return 0;
-regular_page:
- tlb_change_page_size(tlb, PAGE_SIZE);
- orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
- flush_tlb_batched_pending(mm);
- arch_enter_lazy_mmu_mode();
- for (; addr < end; pte++, addr += PAGE_SIZE) {
- ptent = *pte;
- if (!pte_present(ptent))
- continue;
-
- page = vm_normal_page(vma, addr, ptent);
- if (!page)
- continue;
-
- /*
- * creating a THP page is expensive so split it only if we
- * are sure it's worth. Split it if we are only owner.
- */
- if (PageTransCompound(page)) {
- if (page_mapcount(page) != 1)
- break;
- get_page(page);
- if (!trylock_page(page)) {
- put_page(page);
- break;
- }
- pte_unmap_unlock(orig_pte, ptl);
- if (split_huge_page(page)) {
- unlock_page(page);
- put_page(page);
- pte_offset_map_lock(mm, pmd, addr, &ptl);
- break;
- }
- unlock_page(page);
- put_page(page);
- pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
- pte--;
- addr -= PAGE_SIZE;
- continue;
- }
-
- VM_BUG_ON_PAGE(PageTransCompound(page), page);
-
- if (isolate_lru_page(page))
- continue;
-
- if (pte_young(ptent)) {
- ptent = ptep_get_and_clear_full(mm, addr, pte,
- tlb->fullmm);
- ptent = pte_mkold(ptent);
- set_pte_at(mm, addr, pte, ptent);
- tlb_remove_tlb_entry(tlb, pte, addr);
- }
- ClearPageReferenced(page);
- test_and_clear_page_young(page);
- list_add(&page->lru, &page_list);
- }
-
- arch_leave_lazy_mmu_mode();
- pte_unmap_unlock(orig_pte, ptl);
- reclaim_pages(&page_list);
- cond_resched();
-
- return 0;
-}
-
static void madvise_pageout_page_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end)
{
+ struct madvise_walk_private walk_private = {
+ .pageout = true,
+ .tlb = tlb,
+ };
+
struct mm_walk pageout_walk = {
- .pmd_entry = madvise_pageout_pte_range,
+ .pmd_entry = madvise_cold_or_pageout_pte_range,
.mm = vma->vm_mm,
- .private = tlb,
+ .private = &walk_private,
};
tlb_start_vma(tlb, vma);
--
2.22.0.410.gd8fdbe21b5-goog
On Fri 12-07-19 14:18:28, Minchan Kim wrote:
[...]
> >From 41592f23e876ec21e49dc3c76dc89538e2bb16be Mon Sep 17 00:00:00 2001
> From: Minchan Kim <[email protected]>
> Date: Fri, 12 Jul 2019 14:05:36 +0900
> Subject: [PATCH] mm: factor out common parts between MADV_COLD and
> MADV_PAGEOUT
>
> There are many common parts between MADV_COLD and MADV_PAGEOUT.
> This patch factor them out to save code duplication.
This looks better indeed. I still hope that this can get improved even
further but let's do that in a follow up patch.
> Signed-off-by: Minchan Kim <[email protected]>
Acked-by: Michal Hocko <[email protected]>
> ---
> mm/madvise.c | 201 +++++++++++++--------------------------------------
> 1 file changed, 52 insertions(+), 149 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index bc2f0138982e..3d3d14517cc8 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -30,6 +30,11 @@
>
> #include "internal.h"
>
> +struct madvise_walk_private {
> + struct mmu_gather *tlb;
> + bool pageout;
> +};
> +
> /*
> * Any behaviour which results in changes to the vma->vm_flags needs to
> * take mmap_sem for writing. Others, which simply traverse vmas, need
> @@ -310,16 +315,23 @@ static long madvise_willneed(struct vm_area_struct *vma,
> return 0;
> }
>
> -static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> - unsigned long end, struct mm_walk *walk)
> +static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> + unsigned long addr, unsigned long end,
> + struct mm_walk *walk)
> {
> - struct mmu_gather *tlb = walk->private;
> + struct madvise_walk_private *private = walk->private;
> + struct mmu_gather *tlb = private->tlb;
> + bool pageout = private->pageout;
> struct mm_struct *mm = tlb->mm;
> struct vm_area_struct *vma = walk->vma;
> pte_t *orig_pte, *pte, ptent;
> spinlock_t *ptl;
> - struct page *page;
> unsigned long next;
> + struct page *page = NULL;
> + LIST_HEAD(page_list);
> +
> + if (fatal_signal_pending(current))
> + return -EINTR;
>
> next = pmd_addr_end(addr, end);
> if (pmd_trans_huge(*pmd)) {
> @@ -358,6 +370,12 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> return 0;
> }
>
> + if (pageout) {
> + if (isolate_lru_page(page))
> + goto huge_unlock;
> + list_add(&page->lru, &page_list);
> + }
> +
> if (pmd_young(orig_pmd)) {
> pmdp_invalidate(vma, addr, pmd);
> orig_pmd = pmd_mkold(orig_pmd);
> @@ -366,10 +384,14 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> }
>
> + ClearPageReferenced(page);
> test_and_clear_page_young(page);
> - deactivate_page(page);
> huge_unlock:
> spin_unlock(ptl);
> + if (pageout)
> + reclaim_pages(&page_list);
> + else
> + deactivate_page(page);
> return 0;
> }
>
> @@ -423,6 +445,12 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
>
> VM_BUG_ON_PAGE(PageTransCompound(page), page);
>
> + if (pageout) {
> + if (isolate_lru_page(page))
> + continue;
> + list_add(&page->lru, &page_list);
> + }
> +
> if (pte_young(ptent)) {
> ptent = ptep_get_and_clear_full(mm, addr, pte,
> tlb->fullmm);
> @@ -437,12 +465,16 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> * As a side effect, it makes confuse idle-page tracking
> * because they will miss recent referenced history.
> */
> + ClearPageReferenced(page);
> test_and_clear_page_young(page);
> - deactivate_page(page);
> + if (!pageout)
> + deactivate_page(page);
> }
>
> arch_enter_lazy_mmu_mode();
> pte_unmap_unlock(orig_pte, ptl);
> + if (pageout)
> + reclaim_pages(&page_list);
> cond_resched();
>
> return 0;
> @@ -452,10 +484,15 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
> struct vm_area_struct *vma,
> unsigned long addr, unsigned long end)
> {
> + struct madvise_walk_private walk_private = {
> + .tlb = tlb,
> + .pageout = false,
> + };
> +
> struct mm_walk cold_walk = {
> - .pmd_entry = madvise_cold_pte_range,
> + .pmd_entry = madvise_cold_or_pageout_pte_range,
> .mm = vma->vm_mm,
> - .private = tlb,
> + .private = &walk_private,
> };
>
> tlb_start_vma(tlb, vma);
> @@ -482,153 +519,19 @@ static long madvise_cold(struct vm_area_struct *vma,
> return 0;
> }
>
> -static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> - unsigned long end, struct mm_walk *walk)
> -{
> - struct mmu_gather *tlb = walk->private;
> - struct mm_struct *mm = tlb->mm;
> - struct vm_area_struct *vma = walk->vma;
> - pte_t *orig_pte, *pte, ptent;
> - spinlock_t *ptl;
> - LIST_HEAD(page_list);
> - struct page *page;
> - unsigned long next;
> -
> - if (fatal_signal_pending(current))
> - return -EINTR;
> -
> - next = pmd_addr_end(addr, end);
> - if (pmd_trans_huge(*pmd)) {
> - pmd_t orig_pmd;
> -
> - tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> - ptl = pmd_trans_huge_lock(pmd, vma);
> - if (!ptl)
> - return 0;
> -
> - orig_pmd = *pmd;
> - if (is_huge_zero_pmd(orig_pmd))
> - goto huge_unlock;
> -
> - if (unlikely(!pmd_present(orig_pmd))) {
> - VM_BUG_ON(thp_migration_supported() &&
> - !is_pmd_migration_entry(orig_pmd));
> - goto huge_unlock;
> - }
> -
> - page = pmd_page(orig_pmd);
> - if (next - addr != HPAGE_PMD_SIZE) {
> - int err;
> -
> - if (page_mapcount(page) != 1)
> - goto huge_unlock;
> - get_page(page);
> - spin_unlock(ptl);
> - lock_page(page);
> - err = split_huge_page(page);
> - unlock_page(page);
> - put_page(page);
> - if (!err)
> - goto regular_page;
> - return 0;
> - }
> -
> - if (isolate_lru_page(page))
> - goto huge_unlock;
> -
> - if (pmd_young(orig_pmd)) {
> - pmdp_invalidate(vma, addr, pmd);
> - orig_pmd = pmd_mkold(orig_pmd);
> -
> - set_pmd_at(mm, addr, pmd, orig_pmd);
> - tlb_remove_tlb_entry(tlb, pmd, addr);
> - }
> -
> - ClearPageReferenced(page);
> - test_and_clear_page_young(page);
> - list_add(&page->lru, &page_list);
> -huge_unlock:
> - spin_unlock(ptl);
> - reclaim_pages(&page_list);
> - return 0;
> - }
> -
> - if (pmd_trans_unstable(pmd))
> - return 0;
> -regular_page:
> - tlb_change_page_size(tlb, PAGE_SIZE);
> - orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> - flush_tlb_batched_pending(mm);
> - arch_enter_lazy_mmu_mode();
> - for (; addr < end; pte++, addr += PAGE_SIZE) {
> - ptent = *pte;
> - if (!pte_present(ptent))
> - continue;
> -
> - page = vm_normal_page(vma, addr, ptent);
> - if (!page)
> - continue;
> -
> - /*
> - * creating a THP page is expensive so split it only if we
> - * are sure it's worth. Split it if we are only owner.
> - */
> - if (PageTransCompound(page)) {
> - if (page_mapcount(page) != 1)
> - break;
> - get_page(page);
> - if (!trylock_page(page)) {
> - put_page(page);
> - break;
> - }
> - pte_unmap_unlock(orig_pte, ptl);
> - if (split_huge_page(page)) {
> - unlock_page(page);
> - put_page(page);
> - pte_offset_map_lock(mm, pmd, addr, &ptl);
> - break;
> - }
> - unlock_page(page);
> - put_page(page);
> - pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> - pte--;
> - addr -= PAGE_SIZE;
> - continue;
> - }
> -
> - VM_BUG_ON_PAGE(PageTransCompound(page), page);
> -
> - if (isolate_lru_page(page))
> - continue;
> -
> - if (pte_young(ptent)) {
> - ptent = ptep_get_and_clear_full(mm, addr, pte,
> - tlb->fullmm);
> - ptent = pte_mkold(ptent);
> - set_pte_at(mm, addr, pte, ptent);
> - tlb_remove_tlb_entry(tlb, pte, addr);
> - }
> - ClearPageReferenced(page);
> - test_and_clear_page_young(page);
> - list_add(&page->lru, &page_list);
> - }
> -
> - arch_leave_lazy_mmu_mode();
> - pte_unmap_unlock(orig_pte, ptl);
> - reclaim_pages(&page_list);
> - cond_resched();
> -
> - return 0;
> -}
> -
> static void madvise_pageout_page_range(struct mmu_gather *tlb,
> struct vm_area_struct *vma,
> unsigned long addr, unsigned long end)
> {
> + struct madvise_walk_private walk_private = {
> + .pageout = true,
> + .tlb = tlb,
> + };
> +
> struct mm_walk pageout_walk = {
> - .pmd_entry = madvise_pageout_pte_range,
> + .pmd_entry = madvise_cold_or_pageout_pte_range,
> .mm = vma->vm_mm,
> - .private = tlb,
> + .private = &walk_private,
> };
>
> tlb_start_vma(tlb, vma);
> --
> 2.22.0.410.gd8fdbe21b5-goog
--
Michal Hocko
SUSE Labs
On Fri, Jul 12, 2019 at 02:18:28PM +0900, Minchan Kim wrote:
> Hi Johannes,
>
> On Thu, Jul 11, 2019 at 02:42:23PM -0400, Johannes Weiner wrote:
> > On Thu, Jul 11, 2019 at 10:25:28AM +0900, Minchan Kim wrote:
> > > @@ -480,6 +482,198 @@ static long madvise_cold(struct vm_area_struct *vma,
> > > return 0;
> > > }
> > >
> > > +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > > + unsigned long end, struct mm_walk *walk)
> > > +{
> > > + struct mmu_gather *tlb = walk->private;
> > > + struct mm_struct *mm = tlb->mm;
> > > + struct vm_area_struct *vma = walk->vma;
> > > + pte_t *orig_pte, *pte, ptent;
> > > + spinlock_t *ptl;
> > > + LIST_HEAD(page_list);
> > > + struct page *page;
> > > + unsigned long next;
> > > +
> > > + if (fatal_signal_pending(current))
> > > + return -EINTR;
> > > +
> > > + next = pmd_addr_end(addr, end);
> > > + if (pmd_trans_huge(*pmd)) {
> > > + pmd_t orig_pmd;
> > > +
> > > + tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> > > + ptl = pmd_trans_huge_lock(pmd, vma);
> > > + if (!ptl)
> > > + return 0;
> > > +
> > > + orig_pmd = *pmd;
> > > + if (is_huge_zero_pmd(orig_pmd))
> > > + goto huge_unlock;
> > > +
> > > + if (unlikely(!pmd_present(orig_pmd))) {
> > > + VM_BUG_ON(thp_migration_supported() &&
> > > + !is_pmd_migration_entry(orig_pmd));
> > > + goto huge_unlock;
> > > + }
> > > +
> > > + page = pmd_page(orig_pmd);
> > > + if (next - addr != HPAGE_PMD_SIZE) {
> > > + int err;
> > > +
> > > + if (page_mapcount(page) != 1)
> > > + goto huge_unlock;
> > > + get_page(page);
> > > + spin_unlock(ptl);
> > > + lock_page(page);
> > > + err = split_huge_page(page);
> > > + unlock_page(page);
> > > + put_page(page);
> > > + if (!err)
> > > + goto regular_page;
> > > + return 0;
> > > + }
> > > +
> > > + if (isolate_lru_page(page))
> > > + goto huge_unlock;
> > > +
> > > + if (pmd_young(orig_pmd)) {
> > > + pmdp_invalidate(vma, addr, pmd);
> > > + orig_pmd = pmd_mkold(orig_pmd);
> > > +
> > > + set_pmd_at(mm, addr, pmd, orig_pmd);
> > > + tlb_remove_tlb_entry(tlb, pmd, addr);
> > > + }
> > > +
> > > + ClearPageReferenced(page);
> > > + test_and_clear_page_young(page);
> > > + list_add(&page->lru, &page_list);
> > > +huge_unlock:
> > > + spin_unlock(ptl);
> > > + reclaim_pages(&page_list);
> > > + return 0;
> > > + }
> > > +
> > > + if (pmd_trans_unstable(pmd))
> > > + return 0;
> > > +regular_page:
> > > + tlb_change_page_size(tlb, PAGE_SIZE);
> > > + orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > + flush_tlb_batched_pending(mm);
> > > + arch_enter_lazy_mmu_mode();
> > > + for (; addr < end; pte++, addr += PAGE_SIZE) {
> > > + ptent = *pte;
> > > + if (!pte_present(ptent))
> > > + continue;
> > > +
> > > + page = vm_normal_page(vma, addr, ptent);
> > > + if (!page)
> > > + continue;
> > > +
> > > + /*
> > > + * creating a THP page is expensive so split it only if we
> > > + * are sure it's worth. Split it if we are only owner.
> > > + */
> > > + if (PageTransCompound(page)) {
> > > + if (page_mapcount(page) != 1)
> > > + break;
> > > + get_page(page);
> > > + if (!trylock_page(page)) {
> > > + put_page(page);
> > > + break;
> > > + }
> > > + pte_unmap_unlock(orig_pte, ptl);
> > > + if (split_huge_page(page)) {
> > > + unlock_page(page);
> > > + put_page(page);
> > > + pte_offset_map_lock(mm, pmd, addr, &ptl);
> > > + break;
> > > + }
> > > + unlock_page(page);
> > > + put_page(page);
> > > + pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > > + pte--;
> > > + addr -= PAGE_SIZE;
> > > + continue;
> > > + }
> > > +
> > > + VM_BUG_ON_PAGE(PageTransCompound(page), page);
> > > +
> > > + if (isolate_lru_page(page))
> > > + continue;
> > > +
> > > + if (pte_young(ptent)) {
> > > + ptent = ptep_get_and_clear_full(mm, addr, pte,
> > > + tlb->fullmm);
> > > + ptent = pte_mkold(ptent);
> > > + set_pte_at(mm, addr, pte, ptent);
> > > + tlb_remove_tlb_entry(tlb, pte, addr);
> > > + }
> > > + ClearPageReferenced(page);
> > > + test_and_clear_page_young(page);
> > > + list_add(&page->lru, &page_list);
> > > + }
> > > +
> > > + arch_leave_lazy_mmu_mode();
> > > + pte_unmap_unlock(orig_pte, ptl);
> > > + reclaim_pages(&page_list);
> > > + cond_resched();
> > > +
> > > + return 0;
> > > +}
> >
> > I know you have briefly talked about code sharing already.
> >
> > While I agree that sharing with MADV_FREE is maybe a stretch, I
> > applied these patches and compared the pageout and the cold page table
> > functions, and they are line for line the same EXCEPT for 2-3 lines at
> > the very end, where one reclaims and the other deactivates. It would
> > be good to share here, it shouldn't be hard or result in fragile code.
>
> Fair enough if we leave MADV_FREE.
>
> >
> > Something like int madvise_cold_or_pageout_range(..., bool pageout)?
>
> How about this?
>
> From 41592f23e876ec21e49dc3c76dc89538e2bb16be Mon Sep 17 00:00:00 2001
> From: Minchan Kim <[email protected]>
> Date: Fri, 12 Jul 2019 14:05:36 +0900
> Subject: [PATCH] mm: factor out common parts between MADV_COLD and
> MADV_PAGEOUT
>
> There are many common parts between MADV_COLD and MADV_PAGEOUT.
> This patch factor them out to save code duplication.
>
> Signed-off-by: Minchan Kim <[email protected]>
This looks much better, thanks!
> @@ -423,6 +445,12 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
>
> VM_BUG_ON_PAGE(PageTransCompound(page), page);
>
> + if (pageout) {
> + if (isolate_lru_page(page))
> + continue;
> + list_add(&page->lru, &page_list);
> + }
> +
> if (pte_young(ptent)) {
> ptent = ptep_get_and_clear_full(mm, addr, pte,
> tlb->fullmm);
One thought on the ordering here.
When LRU isolation fails, it would still make sense to clear the young
bit: we cannot reclaim the page as we wanted to, but the user still
provided a clear hint that the page is cold and she won't be touching
it for a while. MADV_PAGEOUT is basically MADV_COLD + try_to_reclaim.
So IMO isolation should go to the end next to deactivate_page().
On Fri 12-07-19 09:58:09, Johannes Weiner wrote:
[...]
> > @@ -423,6 +445,12 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> >
> > VM_BUG_ON_PAGE(PageTransCompound(page), page);
> >
> > + if (pageout) {
> > + if (isolate_lru_page(page))
> > + continue;
> > + list_add(&page->lru, &page_list);
> > + }
> > +
> > if (pte_young(ptent)) {
> > ptent = ptep_get_and_clear_full(mm, addr, pte,
> > tlb->fullmm);
>
> One thought on the ordering here.
>
> When LRU isolation fails, it would still make sense to clear the young
> bit: we cannot reclaim the page as we wanted to, but the user still
> provided a clear hint that the page is cold and she won't be touching
> it for a while. MADV_PAGEOUT is basically MADV_COLD + try_to_reclaim.
> So IMO isolation should go to the end next to deactivate_page().
Make sense to me
--
Michal Hocko
SUSE Labs
On Fri, Jul 12, 2019 at 09:58:09AM -0400, Johannes Weiner wrote:
> On Fri, Jul 12, 2019 at 02:18:28PM +0900, Minchan Kim wrote:
> > Hi Johannes,
> >
> > On Thu, Jul 11, 2019 at 02:42:23PM -0400, Johannes Weiner wrote:
> > > On Thu, Jul 11, 2019 at 10:25:28AM +0900, Minchan Kim wrote:
> > > > @@ -480,6 +482,198 @@ static long madvise_cold(struct vm_area_struct *vma,
> > > > return 0;
> > > > }
> > > >
> > > > +static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > > > + unsigned long end, struct mm_walk *walk)
> > > > +{
> > > > + struct mmu_gather *tlb = walk->private;
> > > > + struct mm_struct *mm = tlb->mm;
> > > > + struct vm_area_struct *vma = walk->vma;
> > > > + pte_t *orig_pte, *pte, ptent;
> > > > + spinlock_t *ptl;
> > > > + LIST_HEAD(page_list);
> > > > + struct page *page;
> > > > + unsigned long next;
> > > > +
> > > > + if (fatal_signal_pending(current))
> > > > + return -EINTR;
> > > > +
> > > > + next = pmd_addr_end(addr, end);
> > > > + if (pmd_trans_huge(*pmd)) {
> > > > + pmd_t orig_pmd;
> > > > +
> > > > + tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
> > > > + ptl = pmd_trans_huge_lock(pmd, vma);
> > > > + if (!ptl)
> > > > + return 0;
> > > > +
> > > > + orig_pmd = *pmd;
> > > > + if (is_huge_zero_pmd(orig_pmd))
> > > > + goto huge_unlock;
> > > > +
> > > > + if (unlikely(!pmd_present(orig_pmd))) {
> > > > + VM_BUG_ON(thp_migration_supported() &&
> > > > + !is_pmd_migration_entry(orig_pmd));
> > > > + goto huge_unlock;
> > > > + }
> > > > +
> > > > + page = pmd_page(orig_pmd);
> > > > + if (next - addr != HPAGE_PMD_SIZE) {
> > > > + int err;
> > > > +
> > > > + if (page_mapcount(page) != 1)
> > > > + goto huge_unlock;
> > > > + get_page(page);
> > > > + spin_unlock(ptl);
> > > > + lock_page(page);
> > > > + err = split_huge_page(page);
> > > > + unlock_page(page);
> > > > + put_page(page);
> > > > + if (!err)
> > > > + goto regular_page;
> > > > + return 0;
> > > > + }
> > > > +
> > > > + if (isolate_lru_page(page))
> > > > + goto huge_unlock;
> > > > +
> > > > + if (pmd_young(orig_pmd)) {
> > > > + pmdp_invalidate(vma, addr, pmd);
> > > > + orig_pmd = pmd_mkold(orig_pmd);
> > > > +
> > > > + set_pmd_at(mm, addr, pmd, orig_pmd);
> > > > + tlb_remove_tlb_entry(tlb, pmd, addr);
> > > > + }
> > > > +
> > > > + ClearPageReferenced(page);
> > > > + test_and_clear_page_young(page);
> > > > + list_add(&page->lru, &page_list);
> > > > +huge_unlock:
> > > > + spin_unlock(ptl);
> > > > + reclaim_pages(&page_list);
> > > > + return 0;
> > > > + }
> > > > +
> > > > + if (pmd_trans_unstable(pmd))
> > > > + return 0;
> > > > +regular_page:
> > > > + tlb_change_page_size(tlb, PAGE_SIZE);
> > > > + orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > + flush_tlb_batched_pending(mm);
> > > > + arch_enter_lazy_mmu_mode();
> > > > + for (; addr < end; pte++, addr += PAGE_SIZE) {
> > > > + ptent = *pte;
> > > > + if (!pte_present(ptent))
> > > > + continue;
> > > > +
> > > > + page = vm_normal_page(vma, addr, ptent);
> > > > + if (!page)
> > > > + continue;
> > > > +
> > > > + /*
> > > > + * creating a THP page is expensive so split it only if we
> > > > + * are sure it's worth. Split it if we are only owner.
> > > > + */
> > > > + if (PageTransCompound(page)) {
> > > > + if (page_mapcount(page) != 1)
> > > > + break;
> > > > + get_page(page);
> > > > + if (!trylock_page(page)) {
> > > > + put_page(page);
> > > > + break;
> > > > + }
> > > > + pte_unmap_unlock(orig_pte, ptl);
> > > > + if (split_huge_page(page)) {
> > > > + unlock_page(page);
> > > > + put_page(page);
> > > > + pte_offset_map_lock(mm, pmd, addr, &ptl);
> > > > + break;
> > > > + }
> > > > + unlock_page(page);
> > > > + put_page(page);
> > > > + pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > > > + pte--;
> > > > + addr -= PAGE_SIZE;
> > > > + continue;
> > > > + }
> > > > +
> > > > + VM_BUG_ON_PAGE(PageTransCompound(page), page);
> > > > +
> > > > + if (isolate_lru_page(page))
> > > > + continue;
> > > > +
> > > > + if (pte_young(ptent)) {
> > > > + ptent = ptep_get_and_clear_full(mm, addr, pte,
> > > > + tlb->fullmm);
> > > > + ptent = pte_mkold(ptent);
> > > > + set_pte_at(mm, addr, pte, ptent);
> > > > + tlb_remove_tlb_entry(tlb, pte, addr);
> > > > + }
> > > > + ClearPageReferenced(page);
> > > > + test_and_clear_page_young(page);
> > > > + list_add(&page->lru, &page_list);
> > > > + }
> > > > +
> > > > + arch_leave_lazy_mmu_mode();
> > > > + pte_unmap_unlock(orig_pte, ptl);
> > > > + reclaim_pages(&page_list);
> > > > + cond_resched();
> > > > +
> > > > + return 0;
> > > > +}
> > >
> > > I know you have briefly talked about code sharing already.
> > >
> > > While I agree that sharing with MADV_FREE is maybe a stretch, I
> > > applied these patches and compared the pageout and the cold page table
> > > functions, and they are line for line the same EXCEPT for 2-3 lines at
> > > the very end, where one reclaims and the other deactivates. It would
> > > be good to share here, it shouldn't be hard or result in fragile code.
> >
> > Fair enough if we leave MADV_FREE.
> >
> > >
> > > Something like int madvise_cold_or_pageout_range(..., bool pageout)?
> >
> > How about this?
> >
> > From 41592f23e876ec21e49dc3c76dc89538e2bb16be Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <[email protected]>
> > Date: Fri, 12 Jul 2019 14:05:36 +0900
> > Subject: [PATCH] mm: factor out common parts between MADV_COLD and
> > MADV_PAGEOUT
> >
> > There are many common parts between MADV_COLD and MADV_PAGEOUT.
> > This patch factor them out to save code duplication.
> >
> > Signed-off-by: Minchan Kim <[email protected]>
>
> This looks much better, thanks!
>
> > @@ -423,6 +445,12 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
> >
> > VM_BUG_ON_PAGE(PageTransCompound(page), page);
> >
> > + if (pageout) {
> > + if (isolate_lru_page(page))
> > + continue;
> > + list_add(&page->lru, &page_list);
> > + }
> > +
> > if (pte_young(ptent)) {
> > ptent = ptep_get_and_clear_full(mm, addr, pte,
> > tlb->fullmm);
>
> One thought on the ordering here.
>
> When LRU isolation fails, it would still make sense to clear the young
> bit: we cannot reclaim the page as we wanted to, but the user still
> provided a clear hint that the page is cold and she won't be touching
> it for a while. MADV_PAGEOUT is basically MADV_COLD + try_to_reclaim.
> So IMO isolation should go to the end next to deactivate_page().
Sure, I will modify MADV_PAGEOUT patch instead of refactoring one.
Thanks for the review, Johannes!