2015-11-12 04:32:45

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 00/17] MADFV_FREE support

MADV_FREE is on linux-next so long time. The reason was two, I think.

1. MADV_FREE code on reclaim path was really mess.

2. Andrew really want to see voice of userland people who want to use
the syscall.

A few month ago, Daniel Micay(jemalloc active contributor) requested me
to make progress upstreaming but I was busy at that time so it took
so long time for me to revist the code and finally, I clean it up the
mess recently so it solves the #2 issue.

As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew
again recently and they said it would be great to have even though
it has swap dependency now so Andrew decided he will do that for v4.4.

However, there were some concerns, still.

* hotness

Someone think MADV_FREEed pages are really cold while others are not.
Look at detail in decscription of mm: add knob to tune lazyfreeing.

* swap dependency

In old version, MADV_FREE is equal to MADV_DONTNEED on swapless
system because we don't have aged anonymous LRU list on swapless.
So there are requests for MADV_FREE to support swapless system.

For addressing issues, this version includes new LRU list for
hinted pages and tuning knob. With that, we could support swapless
without zapping hinted pages instantly.

Please, review and comment.

I have been tested it on v4.3-rc7 and couldn't find any problem so far.

git: git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
branch: mm/madv_free-v4.3-rc7-v3-lazyfreelru

In this stage, I don't think we need to write man page.
It could be done after solid policy and implementation.

* Change from v2
* add new LRU list and tuning knob
* support swapless

* Change from v1
* Don't do unnecessary TLB flush - Shaohua
* Added Acked-by - Hugh, Michal
* Merge deactivate_page and deactivate_file_page
* Add pmd_dirty/pmd_mkclean patches for several arches
* Add lazy THP split patch
* Drop [email protected] - Delivery Failure

Chen Gang (1):
arch: uapi: asm: mman.h: Let MADV_FREE have same value for all
architectures

Minchan Kim (16):
mm: support madvise(MADV_FREE)
mm: define MADV_FREE for some arches
mm: free swp_entry in madvise_free
mm: move lazily freed pages to inactive list
mm: clear PG_dirty to mark page freeable
mm: mark stable page dirty in KSM
x86: add pmd_[dirty|mkclean] for THP
sparc: add pmd_[dirty|mkclean] for THP
powerpc: add pmd_[dirty|mkclean] for THP
arm: add pmd_mkclean for THP
arm64: add pmd_mkclean for THP
mm: don't split THP page when syscall is called
mm: introduce wrappers to add new LRU
mm: introduce lazyfree LRU list
mm: support MADV_FREE on swapless system
mm: add knob to tune lazyfreeing

Documentation/sysctl/vm.txt | 13 +++
arch/alpha/include/uapi/asm/mman.h | 1 +
arch/arm/include/asm/pgtable-3level.h | 1 +
arch/arm64/include/asm/pgtable.h | 1 +
arch/mips/include/uapi/asm/mman.h | 1 +
arch/parisc/include/uapi/asm/mman.h | 1 +
arch/powerpc/include/asm/pgtable-ppc64.h | 2 +
arch/sparc/include/asm/pgtable_64.h | 9 ++
arch/x86/include/asm/pgtable.h | 5 +
arch/xtensa/include/uapi/asm/mman.h | 1 +
drivers/base/node.c | 2 +
drivers/staging/android/lowmemorykiller.c | 3 +-
fs/proc/meminfo.c | 2 +
include/linux/huge_mm.h | 3 +
include/linux/memcontrol.h | 1 +
include/linux/mm_inline.h | 83 ++++++++++++++-
include/linux/mmzone.h | 16 ++-
include/linux/page-flags.h | 5 +
include/linux/rmap.h | 1 +
include/linux/swap.h | 18 +++-
include/linux/vm_event_item.h | 3 +-
include/trace/events/vmscan.h | 38 ++++---
include/uapi/asm-generic/mman-common.h | 1 +
kernel/sysctl.c | 9 ++
mm/compaction.c | 14 ++-
mm/huge_memory.c | 51 +++++++--
mm/ksm.c | 6 ++
mm/madvise.c | 171 ++++++++++++++++++++++++++++++
mm/memcontrol.c | 44 +++++++-
mm/memory-failure.c | 7 +-
mm/memory_hotplug.c | 3 +-
mm/mempolicy.c | 3 +-
mm/migrate.c | 28 ++---
mm/page_alloc.c | 3 +
mm/rmap.c | 14 +++
mm/swap.c | 128 +++++++++++++++-------
mm/swap_state.c | 11 +-
mm/truncate.c | 2 +-
mm/vmscan.c | 157 ++++++++++++++++++++-------
mm/vmstat.c | 4 +
40 files changed, 713 insertions(+), 153 deletions(-)

--
1.9.1


2015-11-12 04:36:34

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than swapping
out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace without
another additional overhead(ex, page fault + allocation + zeroing).

Jason Evans said:

: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE. The ones that immediately
: come to mind are redis, varnish, and MariaDB. I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX). The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.

How it works:

When madvise syscall is called, VM clears dirty bit of ptes of the range.
If memory pressure happens, VM checks dirty bit of page table and if it
found still "clean", it means it's a "lazyfree pages" so VM could discard
the page instead of swapping out. Once there was store operation for the
page before VM peek a page to reclaim, dirty bit is set so VM can swap out
the page instead of discarding.

Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)

barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 12
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 2
Stepping: 3
CPU MHz: 3200.185
BogoMIPS: 6400.53
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)

Higher avg is better.

vanilla-jemalloc MADV_free-jemalloc

1 thread
records: 10 records: 10
avg: 2961.90 avg: 12069.70
std: 71.96(2.43%) std: 186.68(1.55%)
max: 3070.00 max: 12385.00
min: 2796.00 min: 11746.00

2 thread
records: 10 records: 10
avg: 5020.00 avg: 17827.00
std: 264.87(5.28%) std: 358.52(2.01%)
max: 5244.00 max: 18760.00
min: 4251.00 min: 17382.00

4 thread
records: 10 records: 10
avg: 8988.80 avg: 27930.80
std: 1175.33(13.08%) std: 3317.33(11.88%)
max: 9508.00 max: 30879.00
min: 5477.00 min: 21024.00

8 thread
records: 10 records: 10
avg: 13036.50 avg: 33739.40
std: 170.67(1.31%) std: 5146.22(15.25%)
max: 13371.00 max: 40572.00
min: 12785.00 min: 24088.00

16 thread
records: 10 records: 10
avg: 11092.40 avg: 31424.20
std: 710.60(6.41%) std: 3763.89(11.98%)
max: 12446.00 max: 36635.00
min: 9949.00 min: 25669.00

32 thread
records: 10 records: 10
avg: 11067.00 avg: 34495.80
std: 971.06(8.77%) std: 2721.36(7.89%)
max: 12010.00 max: 38598.00
min: 9002.00 min: 30636.00

In summary, MADV_FREE is about much faster than MADV_DONTNEED.

Acked-by: Hugh Dickins <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/rmap.h | 1 +
include/linux/vm_event_item.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/madvise.c | 132 +++++++++++++++++++++++++++++++++
mm/rmap.c | 7 ++
mm/swap_state.c | 5 +-
mm/vmscan.c | 10 ++-
mm/vmstat.c | 1 +
8 files changed, 153 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 29446aeef36e..f4c992826242 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -85,6 +85,7 @@ enum ttu_flags {
TTU_UNMAP = 1, /* unmap mode */
TTU_MIGRATION = 2, /* migration mode */
TTU_MUNLOCK = 4, /* munlock mode */
+ TTU_FREE = 8, /* free mode */

TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9246d32dc973..2b1cef88b827 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
PGFREE, PGACTIVATE, PGDEACTIVATE,
PGFAULT, PGMAJFAULT,
+ PGLAZYFREED,
FOR_ALL_ZONES(PGREFILL),
FOR_ALL_ZONES(PGSTEAL_KSWAPD),
FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36f1046..7a94102b7a02 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,6 +34,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/mm/madvise.c b/mm/madvise.c
index c889fcbb530e..a8813f7b37b3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -20,6 +20,9 @@
#include <linux/backing-dev.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
+
+#include <asm/tlb.h>

/*
* Any behaviour which results in changes to the vma->vm_flags needs to
@@ -32,6 +35,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_FREE:
return 0;
default:
/* be safe, default to 1. list exceptions explicitly */
@@ -256,6 +260,125 @@ static long madvise_willneed(struct vm_area_struct *vma,
return 0;
}

+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+
+{
+ struct mmu_gather *tlb = walk->private;
+ struct mm_struct *mm = tlb->mm;
+ struct vm_area_struct *vma = walk->vma;
+ spinlock_t *ptl;
+ pte_t *pte, ptent;
+ struct page *page;
+
+ split_huge_page_pmd(vma, addr, pmd);
+ if (pmd_trans_unstable(pmd))
+ return 0;
+
+ pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ arch_enter_lazy_mmu_mode();
+ for (; addr != end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ if (PageSwapCache(page)) {
+ if (!trylock_page(page))
+ continue;
+
+ if (!try_to_free_swap(page)) {
+ unlock_page(page);
+ continue;
+ }
+
+ ClearPageDirty(page);
+ unlock_page(page);
+ }
+
+ if (pte_young(ptent) || pte_dirty(ptent)) {
+ /*
+ * Some of architecture(ex, PPC) don't update TLB
+ * with set_pte_at and tlb_remove_tlb_entry so for
+ * the portability, remap the pte with old|clean
+ * after pte clearing.
+ */
+ ptent = ptep_get_and_clear_full(mm, addr, pte,
+ tlb->fullmm);
+
+ ptent = pte_mkold(ptent);
+ ptent = pte_mkclean(ptent);
+ set_pte_at(mm, addr, pte, ptent);
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ }
+ }
+
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(pte - 1, ptl);
+ cond_resched();
+ return 0;
+}
+
+static void madvise_free_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ struct mm_walk free_walk = {
+ .pmd_entry = madvise_free_pte_range,
+ .mm = vma->vm_mm,
+ .private = tlb,
+ };
+
+ tlb_start_vma(tlb, vma);
+ walk_page_range(addr, end, &free_walk);
+ tlb_end_vma(tlb, vma);
+}
+
+static int madvise_free_single_vma(struct vm_area_struct *vma,
+ unsigned long start_addr, unsigned long end_addr)
+{
+ unsigned long start, end;
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_gather tlb;
+
+ if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+ return -EINVAL;
+
+ /* MADV_FREE works for only anon vma at the moment */
+ if (!vma_is_anonymous(vma))
+ return -EINVAL;
+
+ start = max(vma->vm_start, start_addr);
+ if (start >= vma->vm_end)
+ return -EINVAL;
+ end = min(vma->vm_end, end_addr);
+ if (end <= vma->vm_start)
+ return -EINVAL;
+
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, start, end);
+ update_hiwater_rss(mm);
+
+ mmu_notifier_invalidate_range_start(mm, start, end);
+ madvise_free_page_range(&tlb, vma, start, end);
+ mmu_notifier_invalidate_range_end(mm, start, end);
+ tlb_finish_mmu(&tlb, start, end);
+
+ return 0;
+}
+
+static long madvise_free(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
+{
+ *prev = vma;
+ return madvise_free_single_vma(vma, start, end);
+}
+
/*
* Application no longer needs these pages. If the pages are dirty,
* it's OK to just throw them away. The app will be more careful about
@@ -379,6 +502,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_remove(vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
+ case MADV_FREE:
+ /*
+ * XXX: In this implementation, MADV_FREE works like
+ * MADV_DONTNEED on swapless system or full swap.
+ */
+ if (get_nr_swap_pages() > 0)
+ return madvise_free(vma, prev, start, end);
+ /* passthrough */
case MADV_DONTNEED:
return madvise_dontneed(vma, prev, start, end);
default:
@@ -398,6 +529,7 @@ madvise_behavior_valid(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_FREE:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
diff --git a/mm/rmap.c b/mm/rmap.c
index f5b5c1f3dcd7..9449e91839ab 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1374,6 +1374,12 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
swp_entry_t entry = { .val = page_private(page) };
pte_t swp_pte;

+ if (!PageDirty(page) && (flags & TTU_FREE)) {
+ /* It's a freeable page by MADV_FREE */
+ dec_mm_counter(mm, MM_ANONPAGES);
+ goto discard;
+ }
+
if (PageSwapCache(page)) {
/*
* Store the swap location in the pte.
@@ -1414,6 +1420,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
} else
dec_mm_counter(mm, MM_FILEPAGES);

+discard:
page_remove_rmap(page);
page_cache_release(page);

diff --git a/mm/swap_state.c b/mm/swap_state.c
index d504adb7fa5f..10f63eded7b7 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list)
* deadlock in the swap out path.
*/
/*
- * Add it to the swap cache and mark it dirty
+ * Add it to the swap cache.
*/
err = add_to_swap_cache(page, entry,
__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);

- if (!err) { /* Success */
- SetPageDirty(page);
+ if (!err) {
return 1;
} else { /* -ENOMEM radix-tree allocation failure */
/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f63a9381f71..7a415b9fdd34 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -906,6 +906,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
int may_enter_fs;
enum page_references references = PAGEREF_RECLAIM_CLEAN;
bool dirty, writeback;
+ bool freeable = false;

cond_resched();

@@ -1049,6 +1050,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep_locked;
if (!add_to_swap(page, page_list))
goto activate_locked;
+ freeable = true;
may_enter_fs = 1;

/* Adding to swap updated mapping */
@@ -1060,8 +1062,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* processes. Try to unmap it here.
*/
if (page_mapped(page) && mapping) {
- switch (try_to_unmap(page,
- ttu_flags|TTU_BATCH_FLUSH)) {
+ switch (try_to_unmap(page, freeable ?
+ (ttu_flags | TTU_BATCH_FLUSH | TTU_FREE) :
+ (ttu_flags | TTU_BATCH_FLUSH))) {
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
@@ -1186,6 +1189,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
__clear_page_locked(page);
free_it:
+ if (freeable && !PageDirty(page))
+ count_vm_event(PGLAZYFREED);
+
nr_reclaimed++;

/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fbf14485a049..59d45b22355f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -759,6 +759,7 @@ const char * const vmstat_text[] = {

"pgfault",
"pgmajfault",
+ "pglazyfreed",

TEXTS_FOR_ZONES("pgrefill")
TEXTS_FOR_ZONES("pgsteal_kswapd")
--
1.9.1

2015-11-12 04:38:08

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 02/17] mm: define MADV_FREE for some arches

Most architectures use asm-generic, but alpha, mips, parisc, xtensa
need their own definitions.

This patch defines MADV_FREE for them so it should fix build break
for their architectures.

Maybe, I should split and feed piecies to arch maintainers but
included here for mmotm convenience.

Cc: Michael Kerrisk <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Chris Zankel <[email protected]>
Acked-by: Max Filippov <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
arch/alpha/include/uapi/asm/mman.h | 1 +
arch/mips/include/uapi/asm/mman.h | 1 +
arch/parisc/include/uapi/asm/mman.h | 1 +
arch/xtensa/include/uapi/asm/mman.h | 1 +
4 files changed, 4 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b472bc2b..836fbd44f65b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -44,6 +44,7 @@
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_SPACEAVAIL 5 /* ensure resources are available */
#define MADV_DONTNEED 6 /* don't need these pages */
+#define MADV_FREE 7 /* free pages only if memory pressure */

/* common/generic parameters */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876cae6b..106e741aa7ee 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -67,6 +67,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251ca7b2..6cb8db76fd4e 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -40,6 +40,7 @@
#define MADV_SPACEAVAIL 5 /* insure that resources are reserved */
#define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */
#define MADV_VPS_INHERIT 7 /* Inherit parents page size */
+#define MADV_FREE 8 /* free pages only if memory pressure */

/* common/generic parameters */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 201aec0e0446..1b19f25bc567 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -80,6 +80,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
--
1.9.1

2015-11-12 04:32:47

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 03/17] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures

From: Chen Gang <[email protected]>

For uapi, need try to let all macros have same value, and MADV_FREE is
added into main branch recently, so need redefine MADV_FREE for it.

At present, '8' can be shared with all architectures, so redefine it to
'8'.

Cc: [email protected] <[email protected]>,
Cc: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Chen Gang <[email protected]>
---
arch/alpha/include/uapi/asm/mman.h | 2 +-
arch/mips/include/uapi/asm/mman.h | 2 +-
arch/parisc/include/uapi/asm/mman.h | 2 +-
arch/xtensa/include/uapi/asm/mman.h | 2 +-
include/uapi/asm-generic/mman-common.h | 2 +-
5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 836fbd44f65b..0b8a5de7aee3 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -44,9 +44,9 @@
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_SPACEAVAIL 5 /* ensure resources are available */
#define MADV_DONTNEED 6 /* don't need these pages */
-#define MADV_FREE 7 /* free pages only if memory pressure */

/* common/generic parameters */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 106e741aa7ee..d247f5457944 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -67,9 +67,9 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
-#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 6cb8db76fd4e..700d83fd9352 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -40,9 +40,9 @@
#define MADV_SPACEAVAIL 5 /* insure that resources are reserved */
#define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */
#define MADV_VPS_INHERIT 7 /* Inherit parents page size */
-#define MADV_FREE 8 /* free pages only if memory pressure */

/* common/generic parameters */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 1b19f25bc567..77eaca434071 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -80,9 +80,9 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
-#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 7a94102b7a02..869595947873 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,9 +34,9 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
-#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
--
1.9.1

2015-11-12 04:38:06

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 04/17] mm: free swp_entry in madvise_free

When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat
slower (ie, 2x times) than madvise_dontneed.

loop = 5;
mmap(512M);
while (loop--) {
memset(512M);
madvise(MADV_FREE or MADV_DONTNEED);
}

The reason is lots of swapin.

1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapin

If we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the swapped-out pages in pte.
Instead, let's free the cold page because swapin is more expensive
than (alloc page + zeroing).

With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time

1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

Acked-by: Michal Hocko <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/madvise.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index a8813f7b37b3..6240a5de4a3a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -270,6 +270,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
spinlock_t *ptl;
pte_t *pte, ptent;
struct page *page;
+ int nr_swap = 0;

split_huge_page_pmd(vma, addr, pmd);
if (pmd_trans_unstable(pmd))
@@ -280,8 +281,24 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
for (; addr != end; pte++, addr += PAGE_SIZE) {
ptent = *pte;

- if (!pte_present(ptent))
+ if (pte_none(ptent))
continue;
+ /*
+ * If the pte has swp_entry, just clear page table to
+ * prevent swap-in which is more expensive rather than
+ * (page allocation + zeroing).
+ */
+ if (!pte_present(ptent)) {
+ swp_entry_t entry;
+
+ entry = pte_to_swp_entry(ptent);
+ if (non_swap_entry(entry))
+ continue;
+ nr_swap--;
+ free_swap_and_cache(entry);
+ pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ continue;
+ }

page = vm_normal_page(vma, addr, ptent);
if (!page)
@@ -317,6 +334,13 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
}
}

+ if (nr_swap) {
+ if (current->mm == mm)
+ sync_mm_rss(mm);
+
+ add_mm_counter(mm, MM_SWAPENTS, nr_swap);
+ }
+
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
cond_resched();
--
1.9.1

2015-11-12 04:36:32

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 05/17] mm: move lazily freed pages to inactive list

MADV_FREE is a hint that it's okay to discard pages if there is memory
pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
so there is no value keeping them in the active anonymous LRU so this
patch moves them to inactive LRU list's head.

This means that MADV_FREE-ed pages which were living on the inactive list
are reclaimed first because they are more likely to be cold rather than
recently active pages.

An arguable issue for the approach would be whether we should put the page
to the head or tail of the inactive list. I chose head because the kernel
cannot make sure it's really cold or warm for every MADV_FREE usecase but
at least we know it's not *hot*, so landing of inactive head would be a
comprimise for various usecases.

This fixes suboptimal behavior of MADV_FREE when pages living on the
active list will sit there for a long time even under memory pressure
while the inactive list is reclaimed heavily. This basically breaks the
whole purpose of using MADV_FREE to help the system to free memory which
is might not be used.

Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Shaohua Li <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/swap.h | 2 +-
mm/madvise.c | 3 +++
mm/swap.c | 62 +++++++++++++++++++++++++++++-----------------------
mm/truncate.c | 2 +-
4 files changed, 40 insertions(+), 29 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7ba7dccaf0e7..8e944c0cedea 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,7 @@ extern void lru_add_drain(void);
extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
-extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
extern void swap_setup(void);

extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/madvise.c b/mm/madvise.c
index 6240a5de4a3a..3462a3ca9690 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -317,6 +317,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
unlock_page(page);
}

+ if (PageActive(page))
+ deactivate_page(page);
+
if (pte_young(ptent) || pte_dirty(ptent)) {
/*
* Some of architecture(ex, PPC) don't update TLB
diff --git a/mm/swap.c b/mm/swap.c
index 983f692a47fd..a2f2cd458de0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -44,7 +44,7 @@ int page_cluster;

static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
-static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);

/*
* This path almost never happens for VM activity - pages are normally
@@ -733,13 +733,13 @@ void lru_cache_add_active_or_unevictable(struct page *page,
}

/*
- * If the page can not be invalidated, it is moved to the
+ * If the file page can not be invalidated, it is moved to the
* inactive list to speed up its reclaim. It is moved to the
* head of the list, rather than the tail, to give the flusher
* threads some time to write it out, as this is much more
* effective than the single-page writeout from reclaim.
*
- * If the page isn't page_mapped and dirty/writeback, the page
+ * If the file page isn't page_mapped and dirty/writeback, the page
* could reclaim asap using PG_reclaim.
*
* 1. active, mapped page -> none
@@ -752,32 +752,36 @@ void lru_cache_add_active_or_unevictable(struct page *page,
* In 4, why it moves inactive's head, the VM expects the page would
* be write it out by flusher threads as this is much more effective
* than the single-page writeout from reclaim.
+ *
+ * If @page is anonymous page, it is moved to the inactive list.
*/
-static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
void *arg)
{
- int lru, file;
- bool active;
+ int lru;
+ bool file, active;

- if (!PageLRU(page))
+ if (!PageLRU(page) || PageUnevictable(page))
return;

- if (PageUnevictable(page))
- return;
+ file = page_is_file_cache(page);
+ active = PageActive(page);
+ lru = page_lru_base_type(page);

- /* Some processes are using the page */
- if (page_mapped(page))
+ if (!file && !active)
return;

- active = PageActive(page);
- file = page_is_file_cache(page);
- lru = page_lru_base_type(page);
+ if (file && page_mapped(page))
+ return;

del_page_from_lru_list(page, lruvec, lru + active);
ClearPageActive(page);
- ClearPageReferenced(page);
add_page_to_lru_list(page, lruvec, lru);

+ if (!file)
+ goto out;
+
+ ClearPageReferenced(page);
if (PageWriteback(page) || PageDirty(page)) {
/*
* PG_reclaim could be raced with end_page_writeback
@@ -793,9 +797,10 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
list_move_tail(&page->lru, &lruvec->lists[lru]);
__count_vm_event(PGROTATED);
}
-
+out:
if (active)
__count_vm_event(PGDEACTIVATE);
+
update_page_reclaim_stat(lruvec, file, 0);
}

@@ -821,22 +826,25 @@ void lru_add_drain_cpu(int cpu)
local_irq_restore(flags);
}

- pvec = &per_cpu(lru_deactivate_file_pvecs, cpu);
+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
if (pagevec_count(pvec))
- pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);

activate_page_drain(cpu);
}

/**
- * deactivate_file_page - forcefully deactivate a file page
+ * deactivate_page - forcefully deactivate a page
* @page: page to deactivate
*
- * This function hints the VM that @page is a good reclaim candidate,
- * for example if its invalidation fails due to the page being dirty
- * or under writeback.
+ * This function hints the VM that @page is a good reclaim candidate to
+ * accelerate the reclaim of @page.
+ * For example,
+ * 1. Invalidation of file-page fails due to the page being dirty or under
+ * writeback.
+ * 2. MADV_FREE hinted anonymous page.
*/
-void deactivate_file_page(struct page *page)
+void deactivate_page(struct page *page)
{
/*
* In a workload with many unevictable page such as mprotect,
@@ -846,11 +854,11 @@ void deactivate_file_page(struct page *page)
return;

if (likely(get_page_unless_zero(page))) {
- struct pagevec *pvec = &get_cpu_var(lru_deactivate_file_pvecs);
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);

if (!pagevec_add(pvec, page))
- pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
- put_cpu_var(lru_deactivate_file_pvecs);
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+ put_cpu_var(lru_deactivate_pvecs);
}
}

@@ -882,7 +890,7 @@ void lru_add_drain_all(void)

if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
- pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+ pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
need_activate_page_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
schedule_work_on(cpu, work);
diff --git a/mm/truncate.c b/mm/truncate.c
index 76e35ad97102..cf8d44679364 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -488,7 +488,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
* of interest and try to speed up its reclaim.
*/
if (!ret)
- deactivate_file_page(page);
+ deactivate_page(page);
count += ret;
}
pagevec_remove_exceptionals(&pvec);
--
1.9.1

2015-11-12 04:36:30

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 06/17] mm: clear PG_dirty to mark page freeable

Basically, MADV_FREE relies on dirty bit in page table entry to decide
whether VM allows to discard the page or not. IOW, if page table entry
includes marked dirty bit, VM shouldn't discard the page.

However, as a example, if swap-in by read fault happens, page table entry
doesn't have dirty bit so MADV_FREE could discard the page wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty
and PageSwapCache. It worked out because swapped-in page lives on
swap cache and since it is evicted from the swap cache, the page has
PG_dirty flag. So both page flags check effectively prevent
wrong discarding by MADV_FREE.

However, a problem in above logic is that swapped-in page has
PG_dirty still after they are removed from swap cache so VM cannot
consider the page as freeable any more even if madvise_free is
called in future.

Look at below example for detail.

ptr = malloc();
memset(ptr);
..
..
.. heavy memory pressure so all of pages are swapped out
..
..
var = *ptr; -> a page swapped-in and could be removed from
swapcache. Then, page table doesn't mark
dirty bit and page descriptor includes PG_dirty
..
..
madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
..
..
..
.. heavy memory pressure again.
.. In this time, VM cannot discard the page because the page
.. has *PG_dirty*

To solve the problem, this patch clears PG_dirty if only the page is owned
exclusively by current process when madvise is called because PG_dirty
represents ptes's dirtiness in several processes so we could clear it only
if we own it exclusively.

Acked-by: Michal Hocko <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/madvise.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 3462a3ca9690..4e67ba0b1104 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -304,11 +304,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (!page)
continue;

- if (PageSwapCache(page)) {
+ if (PageSwapCache(page) || PageDirty(page)) {
if (!trylock_page(page))
continue;
+ /*
+ * If page is shared with others, we couldn't clear
+ * PG_dirty of the page.
+ */
+ if (page_count(page) != 1 + !!PageSwapCache(page)) {
+ unlock_page(page);
+ continue;
+ }

- if (!try_to_free_swap(page)) {
+ if (PageSwapCache(page) && !try_to_free_swap(page)) {
unlock_page(page);
continue;
}
--
1.9.1

2015-11-12 04:36:27

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 07/17] mm: mark stable page dirty in KSM

The MADV_FREE patchset changes page reclaim to simply free a clean
anonymous page with no dirty ptes, instead of swapping it out; but
KSM uses clean write-protected ptes to reference the stable ksm page.
So be sure to mark that page dirty, so it's never mistakenly discarded.

[hughd: adjusted comments]
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/ksm.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index 7ee101eaacdf..18d2b7afecff 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1053,6 +1053,12 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
*/
set_page_stable_node(page, NULL);
mark_page_accessed(page);
+ /*
+ * Page reclaim just frees a clean page with no dirty
+ * ptes: make sure that the ksm page would be swapped.
+ */
+ if (!PageDirty(page))
+ SetPageDirty(page);
err = 0;
} else if (pages_identical(page, kpage))
err = replace_page(vma, page, kpage, orig_pte);
--
1.9.1

2015-11-12 04:36:29

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 08/17] x86: add pmd_[dirty|mkclean] for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
arch/x86/include/asm/pgtable.h | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 867da5bbb4a3..b964d54300e1 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -267,6 +267,11 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
return pmd_clear_flags(pmd, _PAGE_ACCESSED);
}

+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+ return pmd_clear_flags(pmd, _PAGE_DIRTY);
+}
+
static inline pmd_t pmd_wrprotect(pmd_t pmd)
{
return pmd_clear_flags(pmd, _PAGE_RW);
--
1.9.1

2015-11-12 04:34:50

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 09/17] sparc: add pmd_[dirty|mkclean] for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
arch/sparc/include/asm/pgtable_64.h | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 131d36fcd07a..5833dc5ee7d7 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -717,6 +717,15 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
return __pmd(pte_val(pte));
}

+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+ pte_t pte = __pte(pmd_val(pmd));
+
+ pte = pte_mkclean(pte);
+
+ return __pmd(pte_val(pte));
+}
+
static inline pmd_t pmd_mkyoung(pmd_t pmd)
{
pte_t pte = __pte(pmd_val(pmd));
--
1.9.1

2015-11-12 04:34:51

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 10/17] powerpc: add pmd_[dirty|mkclean] for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index fa1dfb7f7b48..85e15c8067be 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -507,9 +507,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
#define pmd_pfn(pmd) pte_pfn(pmd_pte(pmd))
#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
#define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
#define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))

--
1.9.1

2015-11-12 04:34:46

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 11/17] arm: add pmd_mkclean for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_mkclean for THP page MADV_FREE support.

Signed-off-by: Minchan Kim <[email protected]>
---
arch/arm/include/asm/pgtable-3level.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index a745a2a53853..6d6012a320b2 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -249,6 +249,7 @@ PMD_BIT_FUNC(mkold, &= ~PMD_SECT_AF);
PMD_BIT_FUNC(mksplitting, |= L_PMD_SECT_SPLITTING);
PMD_BIT_FUNC(mkwrite, &= ~L_PMD_SECT_RDONLY);
PMD_BIT_FUNC(mkdirty, |= L_PMD_SECT_DIRTY);
+PMD_BIT_FUNC(mkclean, &= ~L_PMD_SECT_DIRTY);
PMD_BIT_FUNC(mkyoung, |= PMD_SECT_AF);

#define pmd_mkhuge(pmd) (__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
--
1.9.1

2015-11-12 04:34:45

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 12/17] arm64: add pmd_mkclean for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_mkclean for THP page MADV_FREE support.

Signed-off-by: Minchan Kim <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 26b066690593..a945263addd4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -325,6 +325,7 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
#define pmd_mksplitting(pmd) pte_pmd(pte_mkspecial(pmd_pte(pmd)))
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
#define pmd_mknotpresent(pmd) (__pmd(pmd_val(pmd) & ~PMD_TYPE_MASK))
--
1.9.1

2015-11-12 04:32:58

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 13/17] mm: don't split THP page when syscall is called

We don't need to split THP page when MADV_FREE syscall is called.
It could be done when VM decide to free it in reclaim path when
memory pressure is heavy so we could avoid unnecessary THP split.

For that, this patch changes two things

1. __split_huge_page_map

It does pte_mkdirty to subpages only if pmd_dirty is true.

2. __split_huge_page_refcount

It removes marking PG_dirty to subpages unconditionally.

Cc: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/huge_mm.h | 3 +++
mm/huge_memory.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
mm/madvise.c | 12 +++++++++++-
3 files changed, 56 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ecb080d6ff42..e9db238a75c1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,9 @@ extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
unsigned long addr,
pmd_t *pmd,
unsigned int flags);
+extern int madvise_free_huge_pmd(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long addr);
extern int zap_huge_pmd(struct mmu_gather *tlb,
struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bbac913f96bc..b8c9b44af864 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1453,6 +1453,41 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}

+int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long addr)
+
+{
+ spinlock_t *ptl;
+ pmd_t orig_pmd;
+ struct page *page;
+ struct mm_struct *mm = tlb->mm;
+
+ if (__pmd_trans_huge_lock(pmd, vma, &ptl) != 1)
+ return 1;
+
+ orig_pmd = *pmd;
+ if (is_huge_zero_pmd(orig_pmd))
+ goto out;
+
+ page = pmd_page(orig_pmd);
+ if (PageActive(page))
+ deactivate_page(page);
+
+ if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
+ orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
+ tlb->fullmm);
+ orig_pmd = pmd_mkold(orig_pmd);
+ orig_pmd = pmd_mkclean(orig_pmd);
+
+ set_pmd_at(mm, addr, pmd, orig_pmd);
+ tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+ }
+out:
+ spin_unlock(ptl);
+
+ return 0;
+}
+
int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr)
{
@@ -1752,8 +1787,8 @@ static void __split_huge_page_refcount(struct page *page,
(1L << PG_mlocked) |
(1L << PG_uptodate) |
(1L << PG_active) |
- (1L << PG_unevictable)));
- page_tail->flags |= (1L << PG_dirty);
+ (1L << PG_unevictable) |
+ (1L << PG_dirty)));

/* clear PageTail before overwriting first_page */
smp_wmb();
@@ -1787,7 +1822,6 @@ static void __split_huge_page_refcount(struct page *page,

BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
- BUG_ON(!PageDirty(page_tail));
BUG_ON(!PageSwapBacked(page_tail));

lru_add_page_tail(page, page_tail, lruvec, list);
@@ -1831,10 +1865,12 @@ static int __split_huge_page_map(struct page *page,
int ret = 0, i;
pgtable_t pgtable;
unsigned long haddr;
+ bool dirty;

pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
if (pmd) {
+ dirty = pmd_dirty(*pmd);
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);
if (pmd_write(*pmd))
@@ -1850,7 +1886,9 @@ static int __split_huge_page_map(struct page *page,
* permissions across VMAs.
*/
entry = mk_pte(page + i, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (dirty)
+ entry = pte_mkdirty(entry);
+ entry = maybe_mkwrite(entry, vma);
if (!pmd_write(*pmd))
entry = pte_wrprotect(entry);
if (!pmd_young(*pmd))
diff --git a/mm/madvise.c b/mm/madvise.c
index 4e67ba0b1104..27ed057c0bd7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -271,8 +271,17 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
pte_t *pte, ptent;
struct page *page;
int nr_swap = 0;
+ unsigned long next;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_trans_huge(*pmd)) {
+ if (next - addr != HPAGE_PMD_SIZE)
+ split_huge_page_pmd(vma, addr, pmd);
+ else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
+ goto next;
+ /* fall through */
+ }

- split_huge_page_pmd(vma, addr, pmd);
if (pmd_trans_unstable(pmd))
return 0;

@@ -355,6 +364,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
cond_resched();
+next:
return 0;
}

--
1.9.1

2015-11-12 04:32:56

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 14/17] mm: introduce wrappers to add new LRU

We have used binary variable "file" to identify whether it is anon LRU
or file LRU. It's good but it becomes obstacle if we add new LRU.

So, this patch introduces some wrapper functions to handle it.

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/mm_inline.h | 64 +++++++++++++++++++++++++++++++++++++++++--
include/trace/events/vmscan.h | 24 ++++++++--------
mm/compaction.c | 2 +-
mm/huge_memory.c | 5 ++--
mm/memory-failure.c | 7 ++---
mm/memory_hotplug.c | 3 +-
mm/mempolicy.c | 3 +-
mm/migrate.c | 26 ++++++------------
mm/swap.c | 22 ++++++---------
mm/vmscan.c | 12 ++++----
10 files changed, 104 insertions(+), 64 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index cf55945c83fb..5e08a354f936 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -8,8 +8,8 @@
* page_is_file_cache - should the page be on a file LRU or anon LRU?
* @page: the page to test
*
- * Returns 1 if @page is page cache page backed by a regular filesystem,
- * or 0 if @page is anonymous, tmpfs or otherwise ram or swap backed.
+ * Returns true if @page is page cache page backed by a regular filesystem,
+ * or false if @page is anonymous, tmpfs or otherwise ram or swap backed.
* Used by functions that manipulate the LRU lists, to sort a page
* onto the right LRU list.
*
@@ -17,7 +17,7 @@
* needs to survive until the page is last deleted from the LRU, which
* could be as far down as __page_cache_release.
*/
-static inline int page_is_file_cache(struct page *page)
+static inline bool page_is_file_cache(struct page *page)
{
return !PageSwapBacked(page);
}
@@ -56,6 +56,64 @@ static inline enum lru_list page_lru_base_type(struct page *page)
}

/**
+ * lru_index - which LRU list is lru on for accouting update_page_reclaim_stat
+ *
+ * Used for LRU list index arithmetic.
+ *
+ * Returns 0 if @lru is anon, 1 if it is file.
+ */
+static inline int lru_index(enum lru_list lru)
+{
+ int base;
+
+ switch (lru) {
+ case LRU_INACTIVE_ANON:
+ case LRU_ACTIVE_ANON:
+ base = 0;
+ break;
+ case LRU_INACTIVE_FILE:
+ case LRU_ACTIVE_FILE:
+ base = 1;
+ break;
+ default:
+ BUG();
+ }
+ return base;
+}
+
+/*
+ * page_off_isolate - which LRU list was page on for accouting NR_ISOLATED.
+ * @page: the page to test
+ *
+ * Returns the LRU list a page was on, as an index into the array of
+ * zone_page_state;
+ */
+static inline int page_off_isolate(struct page *page)
+{
+ int lru = NR_ISOLATED_ANON;
+
+ if (!PageSwapBacked(page))
+ lru = NR_ISOLATED_FILE;
+ return lru;
+}
+
+/**
+ * lru_off_isolate - which LRU list was @lru on for accouting NR_ISOLATED.
+ * @lru: the lru to test
+ *
+ * Returns the LRU list a page was on, as an index into the array of
+ * zone_page_state;
+ */
+static inline int lru_off_isolate(enum lru_list lru)
+{
+ int base = NR_ISOLATED_FILE;
+
+ if (lru <= LRU_ACTIVE_ANON)
+ base = NR_ISOLATED_ANON;
+ return base;
+}
+
+/**
* page_off_lru - which LRU list was page on? clearing its lru flags.
* @page: the page to test
*
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index f66476b96264..4e9e86733849 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -30,9 +30,9 @@
(RECLAIM_WB_ASYNC) \
)

-#define trace_shrink_flags(file) \
+#define trace_shrink_flags(lru) \
( \
- (file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
+ (lru ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
(RECLAIM_WB_ASYNC) \
)

@@ -271,9 +271,9 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
unsigned long nr_scanned,
unsigned long nr_taken,
isolate_mode_t isolate_mode,
- int file),
+ enum lru_list lru),

- TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file),
+ TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, lru),

TP_STRUCT__entry(
__field(int, order)
@@ -281,7 +281,7 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
__field(unsigned long, nr_scanned)
__field(unsigned long, nr_taken)
__field(isolate_mode_t, isolate_mode)
- __field(int, file)
+ __field(enum lru_list, lru)
),

TP_fast_assign(
@@ -290,16 +290,16 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
__entry->nr_scanned = nr_scanned;
__entry->nr_taken = nr_taken;
__entry->isolate_mode = isolate_mode;
- __entry->file = file;
+ __entry->lru = lru;
),

- TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu file=%d",
+ TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu lru=%d",
__entry->isolate_mode,
__entry->order,
__entry->nr_requested,
__entry->nr_scanned,
__entry->nr_taken,
- __entry->file)
+ __entry->lru)
);

DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_lru_isolate,
@@ -309,9 +309,9 @@ DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_lru_isolate,
unsigned long nr_scanned,
unsigned long nr_taken,
isolate_mode_t isolate_mode,
- int file),
+ enum lru_list lru),

- TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
+ TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, lru)

);

@@ -322,9 +322,9 @@ DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_memcg_isolate,
unsigned long nr_scanned,
unsigned long nr_taken,
isolate_mode_t isolate_mode,
- int file),
+ enum lru_list lru),

- TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
+ TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, lru)

);

diff --git a/mm/compaction.c b/mm/compaction.c
index c5c627aae996..d888fa248ebb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -632,7 +632,7 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
return;

list_for_each_entry(page, &cc->migratepages, lru)
- count[!!page_is_file_cache(page)]++;
+ count[page_off_isolate(page) - NR_ISOLATED_ANON]++;

mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b8c9b44af864..d020aec63717 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2218,8 +2218,7 @@ void __khugepaged_exit(struct mm_struct *mm)

static void release_pte_page(struct page *page)
{
- /* 0 stands for page_is_file_cache(page) == false */
- dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
+ dec_zone_page_state(page, page_off_isolate(page));
unlock_page(page);
putback_lru_page(page);
}
@@ -2302,7 +2301,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
goto out;
}
/* 0 stands for page_is_file_cache(page) == false */
- inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
+ inc_zone_page_state(page, page_off_isolate(page));
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageLRU(page), page);

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 95882692e747..abf50e00705b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1682,16 +1682,15 @@ static int __soft_offline_page(struct page *page, int flags)
put_hwpoison_page(page);
if (!ret) {
LIST_HEAD(pagelist);
- inc_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ inc_zone_page_state(page, page_off_isolate(page));
list_add(&page->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
if (ret) {
if (!list_empty(&pagelist)) {
list_del(&page->lru);
- dec_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ dec_zone_page_state(page,
+ page_off_isolate(page));
putback_lru_page(page);
}

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index aa992e2df58a..7c8360744551 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1449,8 +1449,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
put_page(page);
list_add_tail(&page->lru, &source);
move_pages--;
- inc_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ inc_zone_page_state(page, page_off_isolate(page));

} else {
#ifdef CONFIG_DEBUG_VM
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 87a177917cb2..856b6eb07e42 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -930,8 +930,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
if (!isolate_lru_page(page)) {
list_add_tail(&page->lru, pagelist);
- inc_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ inc_zone_page_state(page, page_off_isolate(page));
}
}
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 842ecd7aaf7f..87ebf0833b84 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -91,8 +91,7 @@ void putback_movable_pages(struct list_head *l)
continue;
}
list_del(&page->lru);
- dec_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ dec_zone_page_state(page, page_off_isolate(page));
if (unlikely(isolated_balloon_page(page)))
balloon_page_putback(page);
else
@@ -964,8 +963,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
* restored.
*/
list_del(&page->lru);
- dec_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ dec_zone_page_state(page, page_off_isolate(page));
/* Soft-offlined page shouldn't go through lru cache list */
if (reason == MR_MEMORY_FAILURE) {
put_page(page);
@@ -1278,8 +1276,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
err = isolate_lru_page(page);
if (!err) {
list_add_tail(&page->lru, &pagelist);
- inc_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ inc_zone_page_state(page, page_off_isolate(page));
}
put_and_set:
/*
@@ -1622,8 +1619,6 @@ static bool numamigrate_update_ratelimit(pg_data_t *pgdat,

static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
{
- int page_lru;
-
VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);

/* Avoid migrating to a node that is nearly full */
@@ -1645,8 +1640,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
return 0;
}

- page_lru = page_is_file_cache(page);
- mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
+ mod_zone_page_state(page_zone(page), page_off_isolate(page),
hpage_nr_pages(page));

/*
@@ -1704,8 +1698,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
if (nr_remaining) {
if (!list_empty(&migratepages)) {
list_del(&page->lru);
- dec_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ dec_zone_page_state(page, page_off_isolate(page));
putback_lru_page(page);
}
isolated = 0;
@@ -1735,7 +1728,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
pg_data_t *pgdat = NODE_DATA(node);
int isolated = 0;
struct page *new_page = NULL;
- int page_lru = page_is_file_cache(page);
+ int page_lru = page_off_isolate(page);
unsigned long mmun_start = address & HPAGE_PMD_MASK;
unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
pmd_t orig_entry;
@@ -1794,8 +1787,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
/* Retake the callers reference and putback on LRU */
get_page(page);
putback_lru_page(page);
- mod_zone_page_state(page_zone(page),
- NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
+ mod_zone_page_state(page_zone(page), page_lru, -HPAGE_PMD_NR);

goto out_unlock;
}
@@ -1847,9 +1839,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);

- mod_zone_page_state(page_zone(page),
- NR_ISOLATED_ANON + page_lru,
- -HPAGE_PMD_NR);
+ mod_zone_page_state(page_zone(page), page_lru, -HPAGE_PMD_NR);
return isolated;

out_fail:
diff --git a/mm/swap.c b/mm/swap.c
index a2f2cd458de0..367940d093ad 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -490,21 +490,20 @@ void rotate_reclaimable_page(struct page *page)
}

static void update_page_reclaim_stat(struct lruvec *lruvec,
- int file, int rotated)
+ int lru, int rotated)
{
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;

- reclaim_stat->recent_scanned[file]++;
+ reclaim_stat->recent_scanned[lru]++;
if (rotated)
- reclaim_stat->recent_rotated[file]++;
+ reclaim_stat->recent_rotated[lru]++;
}

static void __activate_page(struct page *page, struct lruvec *lruvec,
void *arg)
{
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
- int file = page_is_file_cache(page);
- int lru = page_lru_base_type(page);
+ enum lru_list lru = page_lru_base_type(page);

del_page_from_lru_list(page, lruvec, lru);
SetPageActive(page);
@@ -513,7 +512,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
trace_mm_lru_activate(page);

__count_vm_event(PGACTIVATE);
- update_page_reclaim_stat(lruvec, file, 1);
+ update_page_reclaim_stat(lruvec, lru_index(lru), 1);
}
}

@@ -758,7 +757,7 @@ void lru_cache_add_active_or_unevictable(struct page *page,
static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
void *arg)
{
- int lru;
+ enum lru_list lru;
bool file, active;

if (!PageLRU(page) || PageUnevictable(page))
@@ -801,7 +800,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
if (active)
__count_vm_event(PGDEACTIVATE);

- update_page_reclaim_stat(lruvec, file, 0);
+ update_page_reclaim_stat(lruvec, lru_index(lru), 0);
}

/*
@@ -1002,8 +1001,6 @@ EXPORT_SYMBOL(__pagevec_release);
void lru_add_page_tail(struct page *page, struct page *page_tail,
struct lruvec *lruvec, struct list_head *list)
{
- const int file = 0;
-
VM_BUG_ON_PAGE(!PageHead(page), page);
VM_BUG_ON_PAGE(PageCompound(page_tail), page);
VM_BUG_ON_PAGE(PageLRU(page_tail), page);
@@ -1034,14 +1031,13 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
}

if (!PageUnevictable(page))
- update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
+ update_page_reclaim_stat(lruvec, 0, PageActive(page_tail));
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
void *arg)
{
- int file = page_is_file_cache(page);
int active = PageActive(page);
enum lru_list lru = page_lru(page);

@@ -1049,7 +1045,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,

SetPageLRU(page);
add_page_to_lru_list(page, lruvec, lru);
- update_page_reclaim_stat(lruvec, file, active);
+ update_page_reclaim_stat(lruvec, lru_index(lru), active);
trace_mm_lru_insertion(page, lru);
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7a415b9fdd34..f731084c3a23 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1398,7 +1398,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,

*nr_scanned = scan;
trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
- nr_taken, mode, is_file_lru(lru));
+ nr_taken, mode, lru_index(lru));
return nr_taken;
}

@@ -1599,7 +1599,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
&nr_scanned, sc, isolate_mode, lru);

__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
- __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+ __mod_zone_page_state(zone, lru_off_isolate(lru), nr_taken);

if (global_reclaim(sc)) {
__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
@@ -1633,7 +1633,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,

putback_inactive_pages(lruvec, &page_list);

- __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
+ __mod_zone_page_state(zone, lru_off_isolate(lru), -nr_taken);

spin_unlock_irq(&zone->lru_lock);

@@ -1701,7 +1701,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
zone_idx(zone),
nr_scanned, nr_reclaimed,
sc->priority,
- trace_shrink_flags(file));
+ trace_shrink_flags(lru));
return nr_reclaimed;
}

@@ -1800,7 +1800,7 @@ static void shrink_active_list(unsigned long nr_to_scan,

__count_zone_vm_events(PGREFILL, zone, nr_scanned);
__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
- __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+ __mod_zone_page_state(zone, lru_off_isolate(lru), nr_taken);
spin_unlock_irq(&zone->lru_lock);

while (!list_empty(&l_hold)) {
@@ -1857,7 +1857,7 @@ static void shrink_active_list(unsigned long nr_to_scan,

move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
- __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
+ __mod_zone_page_state(zone, lru_off_isolate(lru), -nr_taken);
spin_unlock_irq(&zone->lru_lock);

mem_cgroup_uncharge_list(&l_hold);
--
1.9.1

2015-11-12 04:33:48

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 15/17] mm: introduce lazyfree LRU list

There are issues to support MADV_FREE.

* MADV_FREE pages's hotness

It's really arguable. Someone think it's cold while others are not.
It's matter of workload dependent so I think no one could have
a one way. IOW, we need tunable knob.

* MADV_FREE on swapless system

Now, we instantly free MADV_FREEed pages on swapless system
because we don't have aged anonymous LRU list on swapless system
so there is no chance to discard them.

I tried to solve it with inactive anonymous LRU list without
introducing new LRU list but it needs a few hooks in reclaim
path to fix old behavior witch was not good to me. Moreover,
it makes implement tuning konb hard.

For addressing issues, this patch adds new LazyFree LRU list and
functions for the stat. Pages on the list have PG_lazyfree flag
which overrides PG_mappedtodisk(It should be safe because
no anonymous page can have the flag).

If user calls madvise(start, len, MADV_FREE), pages in the range
moves to lazyfree LRU from anonymous LRU. When memory pressure
happens, they can be discarded since there is no more store
opeartion since then. If there is store operation, they can move
to active anonymous LRU list.

In this patch, How to age lazyfree pages is very basic, which just
discards all pages in the list whenever memory pressure happens.
It's enough to prove working. Later patch will implement the policy.

Signed-off-by: Minchan Kim <[email protected]>
---
drivers/base/node.c | 2 +
drivers/staging/android/lowmemorykiller.c | 3 +-
fs/proc/meminfo.c | 2 +
include/linux/mm_inline.h | 25 +++++++++--
include/linux/mmzone.h | 11 +++--
include/linux/page-flags.h | 5 +++
include/linux/rmap.h | 2 +-
include/linux/swap.h | 1 +
include/linux/vm_event_item.h | 4 +-
include/trace/events/vmscan.h | 18 +++++---
mm/compaction.c | 12 ++++--
mm/huge_memory.c | 4 +-
mm/madvise.c | 3 +-
mm/memcontrol.c | 14 +++++-
mm/migrate.c | 2 +
mm/page_alloc.c | 3 ++
mm/rmap.c | 15 +++++--
mm/swap.c | 48 +++++++++++++++++++++
mm/vmscan.c | 71 +++++++++++++++++++++++++------
mm/vmstat.c | 3 ++
20 files changed, 203 insertions(+), 45 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..f7a1f2107b43 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -70,6 +70,7 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d Active(file): %8lu kB\n"
"Node %d Inactive(file): %8lu kB\n"
"Node %d Unevictable: %8lu kB\n"
+ "Node %d LazyFree: %8lu kB\n"
"Node %d Mlocked: %8lu kB\n",
nid, K(i.totalram),
nid, K(i.freeram),
@@ -83,6 +84,7 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(nid, NR_ACTIVE_FILE)),
nid, K(node_page_state(nid, NR_INACTIVE_FILE)),
nid, K(node_page_state(nid, NR_UNEVICTABLE)),
+ nid, K(node_page_state(nid, NR_LZFREE)),
nid, K(node_page_state(nid, NR_MLOCK)));

#ifdef CONFIG_HIGHMEM
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index 872bd603fd0d..658c16a653c2 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -72,7 +72,8 @@ static unsigned long lowmem_count(struct shrinker *s,
return global_page_state(NR_ACTIVE_ANON) +
global_page_state(NR_ACTIVE_FILE) +
global_page_state(NR_INACTIVE_ANON) +
- global_page_state(NR_INACTIVE_FILE);
+ global_page_state(NR_INACTIVE_FILE) +
+ global_page_state(NR_LZFREE);
}

static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index d3ebf2e61853..3444f7c4e0b6 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -102,6 +102,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
"Active(file): %8lu kB\n"
"Inactive(file): %8lu kB\n"
"Unevictable: %8lu kB\n"
+ "LazyFree: %8lu kB\n"
"Mlocked: %8lu kB\n"
#ifdef CONFIG_HIGHMEM
"HighTotal: %8lu kB\n"
@@ -159,6 +160,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
K(pages[LRU_ACTIVE_FILE]),
K(pages[LRU_INACTIVE_FILE]),
K(pages[LRU_UNEVICTABLE]),
+ K(pages[LRU_LZFREE]),
K(global_page_state(NR_MLOCK)),
#ifdef CONFIG_HIGHMEM
K(i.totalhigh),
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 5e08a354f936..7342400f434d 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -26,6 +26,10 @@ static __always_inline void add_page_to_lru_list(struct page *page,
struct lruvec *lruvec, enum lru_list lru)
{
int nr_pages = hpage_nr_pages(page);
+
+ if (lru == LRU_LZFREE)
+ VM_BUG_ON_PAGE(PageActive(page), page);
+
mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
list_add(&page->lru, &lruvec->lists[lru]);
__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
@@ -35,6 +39,10 @@ static __always_inline void del_page_from_lru_list(struct page *page,
struct lruvec *lruvec, enum lru_list lru)
{
int nr_pages = hpage_nr_pages(page);
+
+ if (lru == LRU_LZFREE)
+ VM_BUG_ON_PAGE(!PageLazyFree(page), page);
+
mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
list_del(&page->lru);
__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
@@ -46,12 +54,14 @@ static __always_inline void del_page_from_lru_list(struct page *page,
*
* Used for LRU list index arithmetic.
*
- * Returns the base LRU type - file or anon - @page should be on.
+ * Returns the base LRU type - file or anon or lazyfree - @page should be on.
*/
static inline enum lru_list page_lru_base_type(struct page *page)
{
if (page_is_file_cache(page))
return LRU_INACTIVE_FILE;
+ if (PageLazyFree(page))
+ return LRU_LZFREE;
return LRU_INACTIVE_ANON;
}

@@ -60,7 +70,7 @@ static inline enum lru_list page_lru_base_type(struct page *page)
*
* Used for LRU list index arithmetic.
*
- * Returns 0 if @lru is anon, 1 if it is file.
+ * Returns 0 if @lru is anon, 1 if it is file, 2 if it is lazyfree
*/
static inline int lru_index(enum lru_list lru)
{
@@ -75,6 +85,9 @@ static inline int lru_index(enum lru_list lru)
case LRU_ACTIVE_FILE:
base = 1;
break;
+ case LRU_LZFREE:
+ base = 2;
+ break;
default:
BUG();
}
@@ -90,10 +103,12 @@ static inline int lru_index(enum lru_list lru)
*/
static inline int page_off_isolate(struct page *page)
{
- int lru = NR_ISOLATED_ANON;
+ int lru = NR_ISOLATED_LZFREE;

if (!PageSwapBacked(page))
lru = NR_ISOLATED_FILE;
+ else if (PageLazyFree(page))
+ lru = NR_ISOLATED_LZFREE;
return lru;
}

@@ -106,10 +121,12 @@ static inline int page_off_isolate(struct page *page)
*/
static inline int lru_off_isolate(enum lru_list lru)
{
- int base = NR_ISOLATED_FILE;
+ int base = NR_ISOLATED_LZFREE;

if (lru <= LRU_ACTIVE_ANON)
base = NR_ISOLATED_ANON;
+ else if (lru <= LRU_ACTIVE_FILE)
+ base = NR_ISOLATED_FILE;
return base;
}

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d94347737292..1aaa436da0d5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -121,6 +121,7 @@ enum zone_stat_item {
NR_INACTIVE_FILE, /* " " " " " */
NR_ACTIVE_FILE, /* " " " " " */
NR_UNEVICTABLE, /* " " " " " */
+ NR_LZFREE, /* " " " " " */
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
NR_ANON_PAGES, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
@@ -140,6 +141,7 @@ enum zone_stat_item {
NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
+ NR_ISOLATED_LZFREE, /* Temporary isolated pages from lzfree lru */
NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
NR_DIRTIED, /* page dirtyings since bootup */
NR_WRITTEN, /* page writings since bootup */
@@ -178,6 +180,7 @@ enum lru_list {
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
LRU_UNEVICTABLE,
+ LRU_LZFREE,
NR_LRU_LISTS
};

@@ -207,10 +210,11 @@ struct zone_reclaim_stat {
* The higher the rotated/scanned ratio, the more valuable
* that cache is.
*
- * The anon LRU stats live in [0], file LRU stats in [1]
+ * The anon LRU stats live in [0], file LRU stats in [1],
+ * lazyfree LRU stats in [2]
*/
- unsigned long recent_rotated[2];
- unsigned long recent_scanned[2];
+ unsigned long recent_rotated[3];
+ unsigned long recent_scanned[3];
};

struct lruvec {
@@ -224,6 +228,7 @@ struct lruvec {
/* Mask used at gathering information at once (see memcontrol.c) */
#define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE))
#define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
+#define LRU_ALL_LZFREE (BIT(LRU_LZFREE))
#define LRU_ALL ((1 << NR_LRU_LISTS) - 1)

/* Isolate clean file */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 416509e26d6d..14f0643af5c4 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -115,6 +115,9 @@ enum pageflags {
#endif
__NR_PAGEFLAGS,

+ /* MADV_FREE */
+ PG_lazyfree = PG_mappedtodisk,
+
/* Filesystems */
PG_checked = PG_owner_priv_1,

@@ -343,6 +346,8 @@ TESTPAGEFLAG_FALSE(Ksm)

u64 stable_page_flags(struct page *page);

+PAGEFLAG(LazyFree, lazyfree);
+
static inline int PageUptodate(struct page *page)
{
int ret = test_bit(PG_uptodate, &(page)->flags);
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index f4c992826242..edace84b45d5 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -85,7 +85,7 @@ enum ttu_flags {
TTU_UNMAP = 1, /* unmap mode */
TTU_MIGRATION = 2, /* migration mode */
TTU_MUNLOCK = 4, /* munlock mode */
- TTU_FREE = 8, /* free mode */
+ TTU_LZFREE = 8, /* lazyfree mode */

TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8e944c0cedea..f0310eeab3ec 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
extern void deactivate_page(struct page *page);
+extern void add_page_to_lazyfree_list(struct page *page);
extern void swap_setup(void);

extern void add_page_to_unevictable_list(struct page *page);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 2b1cef88b827..7ebfd7ca992d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -23,9 +23,9 @@

enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
- PGFREE, PGACTIVATE, PGDEACTIVATE,
+ PGFREE, PGACTIVATE, PGDEACTIVATE, PGLZFREE,
PGFAULT, PGMAJFAULT,
- PGLAZYFREED,
+ PGLZFREED,
FOR_ALL_ZONES(PGREFILL),
FOR_ALL_ZONES(PGSTEAL_KSWAPD),
FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 4e9e86733849..a7ce9169b0fa 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -12,28 +12,32 @@

#define RECLAIM_WB_ANON 0x0001u
#define RECLAIM_WB_FILE 0x0002u
+#define RECLAIM_WB_LZFREE 0x0004u
#define RECLAIM_WB_MIXED 0x0010u
-#define RECLAIM_WB_SYNC 0x0004u /* Unused, all reclaim async */
-#define RECLAIM_WB_ASYNC 0x0008u
+#define RECLAIM_WB_SYNC 0x0040u /* Unused, all reclaim async */
+#define RECLAIM_WB_ASYNC 0x0080u

#define show_reclaim_flags(flags) \
(flags) ? __print_flags(flags, "|", \
{RECLAIM_WB_ANON, "RECLAIM_WB_ANON"}, \
{RECLAIM_WB_FILE, "RECLAIM_WB_FILE"}, \
+ {RECLAIM_WB_LZFREE, "RECLAIM_WB_LZFREE"}, \
{RECLAIM_WB_MIXED, "RECLAIM_WB_MIXED"}, \
{RECLAIM_WB_SYNC, "RECLAIM_WB_SYNC"}, \
{RECLAIM_WB_ASYNC, "RECLAIM_WB_ASYNC"} \
) : "RECLAIM_WB_NONE"

#define trace_reclaim_flags(page) ( \
- (page_is_file_cache(page) ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
- (RECLAIM_WB_ASYNC) \
+ (page_is_file_cache(page) ? RECLAIM_WB_FILE : \
+ (PageLazyFree(page) ? RECLAIM_WB_LZFREE : \
+ RECLAIM_WB_ANON)) | (RECLAIM_WB_ASYNC) \
)

-#define trace_shrink_flags(lru) \
+#define trace_shrink_flags(lru_idx) \
( \
- (lru ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
- (RECLAIM_WB_ASYNC) \
+ (lru_idx == 1 ? RECLAIM_WB_FILE : (lru_idx == 0 ? \
+ RECLAIM_WB_ANON : RECLAIM_WB_LZFREE)) | \
+ (RECLAIM_WB_ASYNC) \
)

TRACE_EVENT(mm_vmscan_kswapd_sleep,
diff --git a/mm/compaction.c b/mm/compaction.c
index d888fa248ebb..cc40c766de38 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -626,7 +626,7 @@ isolate_freepages_range(struct compact_control *cc,
static void acct_isolated(struct zone *zone, struct compact_control *cc)
{
struct page *page;
- unsigned int count[2] = { 0, };
+ unsigned int count[3] = { 0, };

if (list_empty(&cc->migratepages))
return;
@@ -636,21 +636,25 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)

mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
+ mod_zone_page_state(zone, NR_ISOLATED_LZFREE, count[2]);
}

/* Similar to reclaim, but different enough that they don't share logic */
static bool too_many_isolated(struct zone *zone)
{
- unsigned long active, inactive, isolated;
+ unsigned long active, inactive, lzfree, isolated;

inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
zone_page_state(zone, NR_INACTIVE_ANON);
active = zone_page_state(zone, NR_ACTIVE_FILE) +
zone_page_state(zone, NR_ACTIVE_ANON);
+ lzfree = zone_page_state(zone, NR_LZFREE);
+
isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
- zone_page_state(zone, NR_ISOLATED_ANON);
+ zone_page_state(zone, NR_ISOLATED_ANON) +
+ zone_page_state(zone, NR_ISOLATED_LZFREE);

- return isolated > (inactive + active) / 2;
+ return isolated > (inactive + active + lzfree) / 2;
}

/**
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d020aec63717..6da441618548 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1470,8 +1470,7 @@ int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
goto out;

page = pmd_page(orig_pmd);
- if (PageActive(page))
- deactivate_page(page);
+ add_page_to_lazyfree_list(page);

if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
@@ -1787,6 +1786,7 @@ static void __split_huge_page_refcount(struct page *page,
(1L << PG_mlocked) |
(1L << PG_uptodate) |
(1L << PG_active) |
+ (1L << PG_lazyfree) |
(1L << PG_unevictable) |
(1L << PG_dirty)));

diff --git a/mm/madvise.c b/mm/madvise.c
index 27ed057c0bd7..7c88c6cfe300 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -334,8 +334,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
unlock_page(page);
}

- if (PageActive(page))
- deactivate_page(page);
+ add_page_to_lazyfree_list(page);

if (pte_young(ptent) || pte_dirty(ptent)) {
/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c57c4423c688..1dc599ce1bcb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -109,6 +109,7 @@ static const char * const mem_cgroup_lru_names[] = {
"inactive_file",
"active_file",
"unevictable",
+ "lazyfree",
};

#define THRESHOLDS_EVENTS_TARGET 128
@@ -1402,6 +1403,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
int nid, bool noswap)
{
+ if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_LZFREE))
+ return true;
if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_FILE))
return true;
if (noswap || !total_swap_pages)
@@ -3120,6 +3123,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
{ "total", LRU_ALL },
{ "file", LRU_ALL_FILE },
{ "anon", LRU_ALL_ANON },
+ { "lazyfree", LRU_ALL_LZFREE },
{ "unevictable", BIT(LRU_UNEVICTABLE) },
};
const struct numa_stat *stat;
@@ -3231,8 +3235,8 @@ static int memcg_stat_show(struct seq_file *m, void *v)
int nid, zid;
struct mem_cgroup_per_zone *mz;
struct zone_reclaim_stat *rstat;
- unsigned long recent_rotated[2] = {0, 0};
- unsigned long recent_scanned[2] = {0, 0};
+ unsigned long recent_rotated[3] = {0, 0};
+ unsigned long recent_scanned[3] = {0, 0};

for_each_online_node(nid)
for (zid = 0; zid < MAX_NR_ZONES; zid++) {
@@ -3241,13 +3245,19 @@ static int memcg_stat_show(struct seq_file *m, void *v)

recent_rotated[0] += rstat->recent_rotated[0];
recent_rotated[1] += rstat->recent_rotated[1];
+ recent_rotated[2] += rstat->recent_rotated[2];
recent_scanned[0] += rstat->recent_scanned[0];
recent_scanned[1] += rstat->recent_scanned[1];
+ recent_scanned[2] += rstat->recent_scanned[2];
}
seq_printf(m, "recent_rotated_anon %lu\n", recent_rotated[0]);
seq_printf(m, "recent_rotated_file %lu\n", recent_rotated[1]);
+ seq_printf(m, "recent_rotated_lzfree %lu\n",
+ recent_rotated[2]);
seq_printf(m, "recent_scanned_anon %lu\n", recent_scanned[0]);
seq_printf(m, "recent_scanned_file %lu\n", recent_scanned[1]);
+ seq_printf(m, "recent_scanned_lzfree %lu\n",
+ recent_scanned[2]);
}
#endif

diff --git a/mm/migrate.c b/mm/migrate.c
index 87ebf0833b84..945e5655cd69 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -508,6 +508,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
SetPageChecked(newpage);
if (PageMappedToDisk(page))
SetPageMappedToDisk(newpage);
+ if (PageLazyFree(page))
+ SetPageLazyFree(newpage);

if (PageDirty(page)) {
clear_page_dirty_for_io(page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48aaf7b9f253..5d0321c3bc82 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3712,6 +3712,7 @@ void show_free_areas(unsigned int filter)

printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
+ " lazy_free:%lu isolated_lazyfree:%lu\n"
" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
@@ -3722,6 +3723,8 @@ void show_free_areas(unsigned int filter)
global_page_state(NR_ACTIVE_FILE),
global_page_state(NR_INACTIVE_FILE),
global_page_state(NR_ISOLATED_FILE),
+ global_page_state(NR_LZFREE),
+ global_page_state(NR_ISOLATED_LZFREE),
global_page_state(NR_UNEVICTABLE),
global_page_state(NR_FILE_DIRTY),
global_page_state(NR_WRITEBACK),
diff --git a/mm/rmap.c b/mm/rmap.c
index 9449e91839ab..75bd68bc8abc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1374,10 +1374,17 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
swp_entry_t entry = { .val = page_private(page) };
pte_t swp_pte;

- if (!PageDirty(page) && (flags & TTU_FREE)) {
- /* It's a freeable page by MADV_FREE */
- dec_mm_counter(mm, MM_ANONPAGES);
- goto discard;
+ if ((flags & TTU_LZFREE)) {
+ VM_BUG_ON_PAGE(!PageLazyFree(page), page);
+ if (!PageDirty(page)) {
+ /* It's a freeable page by MADV_FREE */
+ dec_mm_counter(mm, MM_ANONPAGES);
+ goto discard;
+ } else {
+ set_pte_at(mm, address, pte, pteval);
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
}

if (PageSwapCache(page)) {
diff --git a/mm/swap.c b/mm/swap.c
index 367940d093ad..11c1eb147fd4 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -45,6 +45,7 @@ int page_cluster;
static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);

/*
* This path almost never happens for VM activity - pages are normally
@@ -507,6 +508,10 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,

del_page_from_lru_list(page, lruvec, lru);
SetPageActive(page);
+ if (lru == LRU_LZFREE) {
+ ClearPageLazyFree(page);
+ lru = LRU_INACTIVE_ANON;
+ }
lru += LRU_ACTIVE;
add_page_to_lru_list(page, lruvec, lru);
trace_mm_lru_activate(page);
@@ -767,6 +772,9 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
active = PageActive(page);
lru = page_lru_base_type(page);

+ if (lru == LRU_LZFREE)
+ return;
+
if (!file && !active)
return;

@@ -803,6 +811,29 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
update_page_reclaim_stat(lruvec, lru_index(lru), 0);
}

+static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
+ void *arg)
+{
+ VM_BUG_ON_PAGE(!PageAnon(page), page);
+
+ if (PageLRU(page) && !PageLazyFree(page) &&
+ !PageUnevictable(page)) {
+ unsigned int nr_pages = 1;
+ bool active = PageActive(page);
+
+ del_page_from_lru_list(page, lruvec,
+ LRU_INACTIVE_ANON + active);
+ ClearPageActive(page);
+ SetPageLazyFree(page);
+ add_page_to_lru_list(page, lruvec, LRU_LZFREE);
+
+ if (PageTransHuge(page))
+ nr_pages = HPAGE_PMD_NR;
+ count_vm_events(PGLZFREE, nr_pages);
+ update_page_reclaim_stat(lruvec, 2, 0);
+ }
+}
+
/*
* Drain pages out of the cpu's pagevecs.
* Either "cpu" is the current CPU, and preemption has already been
@@ -829,9 +860,25 @@ void lru_add_drain_cpu(int cpu)
if (pagevec_count(pvec))
pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);

+ pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
+ if (pagevec_count(pvec))
+ pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+
activate_page_drain(cpu);
}

+void add_page_to_lazyfree_list(struct page *page)
+{
+ if (PageLRU(page) && !PageLazyFree(page) && !PageUnevictable(page)) {
+ struct pagevec *pvec = &get_cpu_var(lru_lazyfree_pvecs);
+
+ page_cache_get(page);
+ if (!pagevec_add(pvec, page))
+ pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+ put_cpu_var(lru_lazyfree_pvecs);
+ }
+}
+
/**
* deactivate_page - forcefully deactivate a page
* @page: page to deactivate
@@ -890,6 +937,7 @@ void lru_add_drain_all(void)
if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
+ pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
need_activate_page_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
schedule_work_on(cpu, work);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f731084c3a23..3a7d57cbceb3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -197,7 +197,8 @@ static unsigned long zone_reclaimable_pages(struct zone *zone)
int nr;

nr = zone_page_state(zone, NR_ACTIVE_FILE) +
- zone_page_state(zone, NR_INACTIVE_FILE);
+ zone_page_state(zone, NR_INACTIVE_FILE) +
+ zone_page_state(zone, NR_LZFREE);

if (get_nr_swap_pages() > 0)
nr += zone_page_state(zone, NR_ACTIVE_ANON) +
@@ -918,6 +919,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,

VM_BUG_ON_PAGE(PageActive(page), page);
VM_BUG_ON_PAGE(page_zone(page) != zone, page);
+ VM_BUG_ON_PAGE((ttu_flags & TTU_LZFREE) &&
+ !PageLazyFree(page), page);

sc->nr_scanned++;

@@ -1050,7 +1053,16 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep_locked;
if (!add_to_swap(page, page_list))
goto activate_locked;
- freeable = true;
+ if (ttu_flags & TTU_LZFREE) {
+ freeable = true;
+ } else {
+ /*
+ * anon-LRU list can have !PG_dirty &&
+ * !PG_swapcache && clean pte until
+ * lru_lazyfree_pvec is flushed.
+ */
+ SetPageDirty(page);
+ }
may_enter_fs = 1;

/* Adding to swap updated mapping */
@@ -1063,8 +1075,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
if (page_mapped(page) && mapping) {
switch (try_to_unmap(page, freeable ?
- (ttu_flags | TTU_BATCH_FLUSH | TTU_FREE) :
- (ttu_flags | TTU_BATCH_FLUSH))) {
+ (ttu_flags | TTU_BATCH_FLUSH) :
+ ((ttu_flags & ~TTU_LZFREE) |
+ TTU_BATCH_FLUSH))) {
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
@@ -1190,7 +1203,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
__clear_page_locked(page);
free_it:
if (freeable && !PageDirty(page))
- count_vm_event(PGLAZYFREED);
+ count_vm_event(PGLZFREED);

nr_reclaimed++;

@@ -1458,7 +1471,7 @@ int isolate_lru_page(struct page *page)
* the LRU list will go small and be scanned faster than necessary, leading to
* unnecessary swapping, thrashing and OOM.
*/
-static int too_many_isolated(struct zone *zone, int file,
+static int too_many_isolated(struct zone *zone, int lru_index,
struct scan_control *sc)
{
unsigned long inactive, isolated;
@@ -1469,12 +1482,21 @@ static int too_many_isolated(struct zone *zone, int file,
if (!sane_reclaim(sc))
return 0;

- if (file) {
- inactive = zone_page_state(zone, NR_INACTIVE_FILE);
- isolated = zone_page_state(zone, NR_ISOLATED_FILE);
- } else {
+ switch (lru_index) {
+ case 0:
inactive = zone_page_state(zone, NR_INACTIVE_ANON);
isolated = zone_page_state(zone, NR_ISOLATED_ANON);
+ break;
+ case 1:
+ inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+ isolated = zone_page_state(zone, NR_ISOLATED_FILE);
+ break;
+ case 2:
+ inactive = zone_page_state(zone, NR_LZFREE);
+ isolated = zone_page_state(zone, NR_ISOLATED_LZFREE);
+ break;
+ default:
+ BUG();
}

/*
@@ -1515,6 +1537,10 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)

SetPageLRU(page);
lru = page_lru(page);
+ if (lru == LRU_LZFREE + LRU_ACTIVE) {
+ ClearPageLazyFree(page);
+ lru = LRU_ACTIVE_ANON;
+ }
add_page_to_lru_list(page, lruvec, lru);

if (is_active_lru(lru)) {
@@ -1578,7 +1604,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
struct zone *zone = lruvec_zone(lruvec);
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;

- while (unlikely(too_many_isolated(zone, file, sc))) {
+ while (unlikely(too_many_isolated(zone, lru_index(lru), sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);

/* We are about to die and free our memory. Return now. */
@@ -1613,7 +1639,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
if (nr_taken == 0)
return 0;

- nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
+ nr_reclaimed = shrink_page_list(&page_list, zone, sc,
+ (lru != LRU_LZFREE) ?
+ TTU_UNMAP :
+ TTU_UNMAP|TTU_LZFREE,
&nr_dirty, &nr_unqueued_dirty, &nr_congested,
&nr_writeback, &nr_immediate,
false);
@@ -1701,7 +1730,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
zone_idx(zone),
nr_scanned, nr_reclaimed,
sc->priority,
- trace_shrink_flags(lru));
+ trace_shrink_flags(lru_index(lru)));
return nr_reclaimed;
}

@@ -2194,6 +2223,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
unsigned long nr[NR_LRU_LISTS];
unsigned long targets[NR_LRU_LISTS];
unsigned long nr_to_scan;
+ unsigned long nr_to_scan_lzfree;
enum lru_list lru;
unsigned long nr_reclaimed = 0;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
@@ -2204,6 +2234,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,

/* Record the original scan target for proportional adjustments later */
memcpy(targets, nr, sizeof(nr));
+ nr_to_scan_lzfree = get_lru_size(lruvec, LRU_LZFREE);

/*
* Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
@@ -2221,6 +2252,19 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,

init_tlb_ubc();

+ while (nr_to_scan_lzfree) {
+ nr_to_scan = min(nr_to_scan_lzfree, SWAP_CLUSTER_MAX);
+ nr_to_scan_lzfree -= nr_to_scan;
+
+ nr_reclaimed += shrink_inactive_list(nr_to_scan, lruvec,
+ sc, LRU_LZFREE);
+ }
+
+ if (nr_reclaimed >= nr_to_reclaim) {
+ sc->nr_reclaimed += nr_reclaimed;
+ return;
+ }
+
blk_start_plug(&plug);
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
@@ -2364,6 +2408,7 @@ static inline bool should_continue_reclaim(struct zone *zone,
*/
pages_for_compaction = (2UL << sc->order);
inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
+ inactive_lru_pages += zone_page_state(zone, NR_LZFREE);
if (get_nr_swap_pages() > 0)
inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
if (sc->nr_reclaimed < pages_for_compaction &&
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 59d45b22355f..df95d9473bba 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -704,6 +704,7 @@ const char * const vmstat_text[] = {
"nr_inactive_file",
"nr_active_file",
"nr_unevictable",
+ "nr_lazyfree",
"nr_mlock",
"nr_anon_pages",
"nr_mapped",
@@ -721,6 +722,7 @@ const char * const vmstat_text[] = {
"nr_writeback_temp",
"nr_isolated_anon",
"nr_isolated_file",
+ "nr_isolated_lazyfree",
"nr_shmem",
"nr_dirtied",
"nr_written",
@@ -756,6 +758,7 @@ const char * const vmstat_text[] = {
"pgfree",
"pgactivate",
"pgdeactivate",
+ "pglazyfree",

"pgfault",
"pgmajfault",
--
1.9.1

2015-11-12 04:32:55

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 16/17] mm: support MADV_FREE on swapless system

Historically, we have disabled reclaiming of anonymous pages
completely with swapoff or non-swap configurable system.
It did make sense but problem for lazy free pages is that
we couldn't get a chance to discard MADV_FREE hinted pages
in reclaim path in those systems.

That's why current MADV_FREE implementation drops pages instantly
like MADV_DONTNNED in swapless system so that users on those
systems couldn't get the benefit of MADV_FREE.

Now we have lazyfree LRU list to keep MADV_FREEed pages without
relying on anonymous LRU so that we could scan MADV_FREE pages
on swapless system without relying on anonymous LRU list.

Signed-off-by: Minchan Kim <[email protected]>
---
mm/madvise.c | 7 +------
mm/swap_state.c | 6 ------
mm/vmscan.c | 37 +++++++++++++++++++++++++++----------
3 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 7c88c6cfe300..3a4c3f7efe20 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -547,12 +547,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
case MADV_FREE:
- /*
- * XXX: In this implementation, MADV_FREE works like
- * MADV_DONTNEED on swapless system or full swap.
- */
- if (get_nr_swap_pages() > 0)
- return madvise_free(vma, prev, start, end);
+ return madvise_free(vma, prev, start, end);
/* passthrough */
case MADV_DONTNEED:
return madvise_dontneed(vma, prev, start, end);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 10f63eded7b7..49c683b02ee4 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -170,12 +170,6 @@ int add_to_swap(struct page *page, struct list_head *list)
if (!entry.val)
return 0;

- if (unlikely(PageTransHuge(page)))
- if (unlikely(split_huge_page_to_list(page, list))) {
- swapcache_free(entry);
- return 0;
- }
-
/*
* Radix-tree node allocations from PF_MEMALLOC contexts could
* completely exhaust the page allocator. __GFP_NOMEMALLOC
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3a7d57cbceb3..cd65db9d3004 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -611,13 +611,18 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
bool reclaimed)
{
unsigned long flags;
- struct mem_cgroup *memcg;
+ struct mem_cgroup *memcg = NULL;
+ int expected = mapping ? 2 : 1;

BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
+ VM_BUG_ON_PAGE(mapping == NULL && !PageLazyFree(page), page);
+
+ if (mapping) {
+ memcg = mem_cgroup_begin_page_stat(page);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
+ }

- memcg = mem_cgroup_begin_page_stat(page);
- spin_lock_irqsave(&mapping->tree_lock, flags);
/*
* The non racy check for a busy page.
*
@@ -643,14 +648,18 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
* Note that if SetPageDirty is always performed via set_page_dirty,
* and thus under tree_lock, then this ordering is not required.
*/
- if (!page_freeze_refs(page, 2))
+ if (!page_freeze_refs(page, expected))
goto cannot_free;
/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
if (unlikely(PageDirty(page))) {
- page_unfreeze_refs(page, 2);
+ page_unfreeze_refs(page, expected);
goto cannot_free;
}

+ /* No more work to do with backing store */
+ if (!mapping)
+ return 1;
+
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
@@ -687,8 +696,10 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
return 1;

cannot_free:
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
- mem_cgroup_end_page_stat(memcg);
+ if (mapping) {
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ mem_cgroup_end_page_stat(memcg);
+ }
return 0;
}

@@ -1051,7 +1062,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
if (PageAnon(page) && !PageSwapCache(page)) {
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
- if (!add_to_swap(page, page_list))
+ if (unlikely(PageTransHuge(page)) &&
+ unlikely(split_huge_page_to_list(page,
+ page_list)))
+ goto activate_locked;
+ if (total_swap_pages &&
+ !add_to_swap(page, page_list))
goto activate_locked;
if (ttu_flags & TTU_LZFREE) {
freeable = true;
@@ -1073,7 +1089,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* The page is mapped into the page tables of one or more
* processes. Try to unmap it here.
*/
- if (page_mapped(page) && mapping) {
+ if (page_mapped(page) && (mapping || freeable)) {
switch (try_to_unmap(page, freeable ?
(ttu_flags | TTU_BATCH_FLUSH) :
((ttu_flags & ~TTU_LZFREE) |
@@ -1190,7 +1206,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
}

- if (!mapping || !__remove_mapping(mapping, page, true))
+ if ((!mapping && !freeable) ||
+ !__remove_mapping(mapping, page, true))
goto keep_locked;

/*
--
1.9.1

2015-11-12 04:33:47

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v3 17/17] mm: add knob to tune lazyfreeing

MADV_FREEed page's hotness is very arguble.
Someone think it's hot while others are it's cold.

Quote from Shaohua
"
My main concern is the policy how we should treat the FREE pages. Moving it to
inactive lru is definitionly a good start, I'm wondering if it's enough. The
MADV_FREE increases memory pressure and cause unnecessary reclaim because of
the lazy memory free. While MADV_FREE is intended to be a better replacement of
MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
memory immediately. So I hope the MADV_FREE doesn't have impact on memory
pressure too. I'm thinking of adding an extra lru list and wartermark for this
to make sure FREE pages can be freed before system wide page reclaim. As you
said, this is arguable, but I hope we can discuss about this issue more.
"

Quote from me
"
It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
but it's not true because the page would be dirty state when VM want to reclaim.

I'm also against with your's suggestion which let's discard FREEed page before
system wide page reclaim because system would have lots of clean cold page
caches or anonymous pages. In such case, reclaiming of them would be better.
Yeb, it's really workload-dependent so we might need some heuristic which is
normally what we want to avoid.

Having said that, I agree with you we could do better than the deactivation
and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
"ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
When the MADV_FREE is called, we could move hinted pages from anon-LRU to
ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
it could promote it to acive-anon-LRU which would be very natural aging
concept because it mean someone touches the page recenlty.
With that, I don't want to bias one side and don't want to add some knob for
tuning the heuristic but let's rely on common fair aging scheme of VM.
"

Quote from Johannes
"
thread 1:
Even if we're wrong about the aging of those MADV_FREE pages, their
contents are invalidated; they can be discarded freely, and restoring
them is a mere GFP_ZERO allocation. All other anonymous pages have to
be written to disk, and potentially be read back.

[ Arguably, MADV_FREE pages should even be reclaimed before inactive
page cache. It's the same cost to discard both types of pages, but
restoring page cache involves IO. ]

It probably makes sense to stop thinking about them as anonymous pages
entirely at this point when it comes to aging. They're really not. The
LRU lists are split to differentiate access patterns and cost of page
stealing (and restoring). From that angle, MADV_FREE pages really have
nothing in common with in-use anonymous pages, and so they shouldn't
be on the same LRU list.

thread:2
What about them is hot? They contain garbage, you have to write to
them before you can use them. Granted, you might have to refetch
cachelines if you don't do cacheline-aligned populating writes, but
you can do a lot of them before it's more expensive than doing IO.

"

Quote from Daniel
"
thread:1
Keep in mind that this is memory the kernel wouldn't be getting back at
all if the allocator wasn't going out of the way to purge it, and they
aren't going to go out of their way to purge it if it means the kernel
is going to steal the pages when there isn't actually memory pressure.

An allocator would be using MADV_DONTNEED if it didn't expect that the
pages were going to be used against shortly. MADV_FREE indicates that it
has time to inform the kernel that they're unused but they could still
be very hot.

thread:2
It's hot because applications churn through memory via the allocator.

Drop the pages and the application is now churning through page faults
and zeroing rather than simply reusing memory. It's not something that
may happen, it *will* happen. A page in the page cache *may* be reused,
but often won't be, especially when the I/O patterns don't line up well
with the way it works.

The whole point of the feature is not requiring the allocator to have
elaborate mechanisms for aging pages and throttling purging. That ends
up resulting in lots of memory held by userspace where the kernel can't
reclaim it under memory pressure. If it's dropped before page cache, it
isn't going to be able to replace any of that logic in allocators.

The page cache is speculative. Page caching by allocators is not really
speculative. Using MADV_FREE on the pages at all is speculative. The
memory is probably going to be reused fairly soon (unless the process
exits, and then it doesn't matter), but purging will end up reducing
memory usage for the portions that aren't.

It would be a different story for a full unpinning/pinning feature since
that would have other use cases (speculative caches), but this is really
only useful in allocators.
"
You could read all thread from https://lkml.org/lkml/2015/11/4/51

Yeah, with arguble issue and there is no one decision, I think it
means we should provide the knob "lazyfreeness"(I hope someone
give better naming).

It's similar to swapppiness so higher values will discard MADV_FREE
pages agreessively. If memory pressure happens and system works with
DEF_PRIOIRTY(ex, clean cold caches), VM doesn't discard any hinted
pages until the scanning priority is increased.

If memory pressure is higher(ie, the priority is not DEF_PRIORITY),
it scans

nr_to_reclaim * priority * lazyfreensss(def: 20) / 50

If system has low free memory and file cache, it start to discard
MADV_FREEed pages unconditionally even though user set lazyfreeness to 0.

Signed-off-by: Minchan Kim <[email protected]>
---
Documentation/sysctl/vm.txt | 13 +++++++++
drivers/base/node.c | 4 +--
fs/proc/meminfo.c | 4 +--
include/linux/memcontrol.h | 1 +
include/linux/mmzone.h | 9 +++---
include/linux/swap.h | 15 ++++++++++
kernel/sysctl.c | 9 ++++++
mm/memcontrol.c | 32 +++++++++++++++++++++-
mm/vmscan.c | 67 ++++++++++++++++++++++++++++-----------------
mm/vmstat.c | 2 +-
10 files changed, 121 insertions(+), 35 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index a4482fceacec..c1dc63381f2c 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -56,6 +56,7 @@ files can be found in mm/swap.c.
- percpu_pagelist_fraction
- stat_interval
- swappiness
+- lazyfreeness
- user_reserve_kbytes
- vfs_cache_pressure
- zone_reclaim_mode
@@ -737,6 +738,18 @@ The default value is 60.

==============================================================

+lazyfreeness
+
+This control is used to define how aggressive the kernel will discard
+MADV_FREE hinted pages. Higher values will increase agressiveness,
+lower values decrease the amount of discarding. A value of 0 instructs
+the kernel not to initiate discarding until the amount of free and
+file-backed pages is less than the high water mark in a zone.
+
+The default value is 20.
+
+==============================================================
+
- user_reserve_kbytes

When overcommit_memory is set to 2, "never overcommit" mode, reserve
diff --git a/drivers/base/node.c b/drivers/base/node.c
index f7a1f2107b43..3b0bf1b78b2e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -69,8 +69,8 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d Inactive(anon): %8lu kB\n"
"Node %d Active(file): %8lu kB\n"
"Node %d Inactive(file): %8lu kB\n"
- "Node %d Unevictable: %8lu kB\n"
"Node %d LazyFree: %8lu kB\n"
+ "Node %d Unevictable: %8lu kB\n"
"Node %d Mlocked: %8lu kB\n",
nid, K(i.totalram),
nid, K(i.freeram),
@@ -83,8 +83,8 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(nid, NR_INACTIVE_ANON)),
nid, K(node_page_state(nid, NR_ACTIVE_FILE)),
nid, K(node_page_state(nid, NR_INACTIVE_FILE)),
- nid, K(node_page_state(nid, NR_UNEVICTABLE)),
nid, K(node_page_state(nid, NR_LZFREE)),
+ nid, K(node_page_state(nid, NR_UNEVICTABLE)),
nid, K(node_page_state(nid, NR_MLOCK)));

#ifdef CONFIG_HIGHMEM
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 3444f7c4e0b6..f47e6a5aa2e5 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -101,8 +101,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
"Inactive(anon): %8lu kB\n"
"Active(file): %8lu kB\n"
"Inactive(file): %8lu kB\n"
- "Unevictable: %8lu kB\n"
"LazyFree: %8lu kB\n"
+ "Unevictable: %8lu kB\n"
"Mlocked: %8lu kB\n"
#ifdef CONFIG_HIGHMEM
"HighTotal: %8lu kB\n"
@@ -159,8 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
K(pages[LRU_INACTIVE_ANON]),
K(pages[LRU_ACTIVE_FILE]),
K(pages[LRU_INACTIVE_FILE]),
- K(pages[LRU_UNEVICTABLE]),
K(pages[LRU_LZFREE]),
+ K(pages[LRU_UNEVICTABLE]),
K(global_page_state(NR_MLOCK)),
#ifdef CONFIG_HIGHMEM
K(i.totalhigh),
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3e3318ddfc0e..5522ff733506 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -210,6 +210,7 @@ struct mem_cgroup {
int under_oom;

int swappiness;
+ int lzfreeness;
/* OOM-Killer disable */
int oom_kill_disable;

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1aaa436da0d5..cca514a9701d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -120,8 +120,8 @@ enum zone_stat_item {
NR_ACTIVE_ANON, /* " " " " " */
NR_INACTIVE_FILE, /* " " " " " */
NR_ACTIVE_FILE, /* " " " " " */
- NR_UNEVICTABLE, /* " " " " " */
NR_LZFREE, /* " " " " " */
+ NR_UNEVICTABLE, /* " " " " " */
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
NR_ANON_PAGES, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
@@ -179,14 +179,15 @@ enum lru_list {
LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
- LRU_UNEVICTABLE,
LRU_LZFREE,
+ LRU_UNEVICTABLE,
NR_LRU_LISTS
};

#define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
-
-#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
+#define for_each_anon_file_lru(lru) \
+ for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
+#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_LZFREE; lru++)

static inline int is_file_lru(enum lru_list lru)
{
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f0310eeab3ec..73bcdc9d0e88 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -330,6 +330,7 @@ extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
unsigned long *nr_scanned);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
+extern int vm_lazyfreeness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long vm_total_pages;

@@ -361,11 +362,25 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
return memcg->swappiness;
}

+static inline int mem_cgroup_lzfreeness(struct mem_cgroup *memcg)
+{
+ /* root ? */
+ if (mem_cgroup_disabled() || !memcg->css.parent)
+ return vm_lazyfreeness;
+
+ return memcg->lzfreeness;
+}
+
#else
static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
{
return vm_swappiness;
}
+
+static inline int mem_cgroup_lzfreeness(struct mem_cgroup *mem)
+{
+ return vm_lazyfreeness;
+}
#endif
#ifdef CONFIG_MEMCG_SWAP
extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e69201d8094e..2496b10c08e9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1268,6 +1268,15 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &one_hundred,
},
+ {
+ .procname = "lazyfreeness",
+ .data = &vm_lazyfreeness,
+ .maxlen = sizeof(vm_lazyfreeness),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
#ifdef CONFIG_HUGETLB_PAGE
{
.procname = "nr_hugepages",
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1dc599ce1bcb..5bdbe2a20dc0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -108,8 +108,8 @@ static const char * const mem_cgroup_lru_names[] = {
"active_anon",
"inactive_file",
"active_file",
- "unevictable",
"lazyfree",
+ "unevictable",
};

#define THRESHOLDS_EVENTS_TARGET 128
@@ -3288,6 +3288,30 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
return 0;
}

+static u64 mem_cgroup_lzfreeness_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+ return mem_cgroup_lzfreeness(memcg);
+}
+
+static int mem_cgroup_lzfreeness_write(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 val)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+ if (val > 100)
+ return -EINVAL;
+
+ if (css->parent)
+ memcg->lzfreeness = val;
+ else
+ vm_lazyfreeness = val;
+
+ return 0;
+}
+
static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
{
struct mem_cgroup_threshold_ary *t;
@@ -4085,6 +4109,11 @@ static struct cftype mem_cgroup_legacy_files[] = {
.write_u64 = mem_cgroup_swappiness_write,
},
{
+ .name = "lazyfreeness",
+ .read_u64 = mem_cgroup_lzfreeness_read,
+ .write_u64 = mem_cgroup_lzfreeness_write,
+ },
+ {
.name = "move_charge_at_immigrate",
.read_u64 = mem_cgroup_move_charge_read,
.write_u64 = mem_cgroup_move_charge_write,
@@ -4305,6 +4334,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
memcg->use_hierarchy = parent->use_hierarchy;
memcg->oom_kill_disable = parent->oom_kill_disable;
memcg->swappiness = mem_cgroup_swappiness(parent);
+ memcg->lzfreeness = mem_cgroup_lzfreeness(parent);

if (parent->use_hierarchy) {
page_counter_init(&memcg->memory, &parent->memory);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cd65db9d3004..f1abc8a6ca31 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -141,6 +141,10 @@ struct scan_control {
*/
int vm_swappiness = 60;
/*
+ * From 0 .. 100. Higher means more lazy freeing.
+ */
+int vm_lazyfreeness = 20;
+/*
* The total number of pages which are beyond the high watermark within all
* zones.
*/
@@ -2012,10 +2016,11 @@ enum scan_balance {
*
* nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
* nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
+ * nr[4] = lazy free pages to scan;
*/
static void get_scan_count(struct lruvec *lruvec, int swappiness,
- struct scan_control *sc, unsigned long *nr,
- unsigned long *lru_pages)
+ int lzfreeness, struct scan_control *sc,
+ unsigned long *nr, unsigned long *lru_pages)
{
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
u64 fraction[2];
@@ -2023,12 +2028,13 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
struct zone *zone = lruvec_zone(lruvec);
unsigned long anon_prio, file_prio;
enum scan_balance scan_balance;
- unsigned long anon, file;
+ unsigned long anon, file, lzfree;
bool force_scan = false;
unsigned long ap, fp;
enum lru_list lru;
bool some_scanned;
int pass;
+ unsigned long scan_lzfree = 0;

/*
* If the zone or memcg is small, nr[l] can be 0. This
@@ -2166,7 +2172,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
/* Only use force_scan on second pass. */
for (pass = 0; !some_scanned && pass < 2; pass++) {
*lru_pages = 0;
- for_each_evictable_lru(lru) {
+ for_each_anon_file_lru(lru) {
int file = is_file_lru(lru);
unsigned long size;
unsigned long scan;
@@ -2212,6 +2218,28 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
some_scanned |= !!scan;
}
}
+
+ lzfree = get_lru_size(lruvec, LRU_LZFREE);
+ if (lzfree) {
+ scan_lzfree = sc->nr_to_reclaim *
+ (DEF_PRIORITY - sc->priority);
+ scan_lzfree = div64_u64(scan_lzfree *
+ lzfreeness, 50);
+ if (!scan_lzfree) {
+ unsigned long zonefile, zonefree;
+
+ zonefree = zone_page_state(zone, NR_FREE_PAGES);
+ zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
+ zone_page_state(zone, NR_INACTIVE_FILE);
+ if (unlikely(zonefile + zonefree <=
+ high_wmark_pages(zone))) {
+ scan_lzfree = get_lru_size(lruvec,
+ LRU_LZFREE) >> sc->priority;
+ }
+ }
+ }
+
+ nr[LRU_LZFREE] = min(scan_lzfree, lzfree);
}

#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
@@ -2235,23 +2263,22 @@ static inline void init_tlb_ubc(void)
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
- struct scan_control *sc, unsigned long *lru_pages)
+ int lzfreeness, struct scan_control *sc,
+ unsigned long *lru_pages)
{
unsigned long nr[NR_LRU_LISTS];
unsigned long targets[NR_LRU_LISTS];
unsigned long nr_to_scan;
- unsigned long nr_to_scan_lzfree;
enum lru_list lru;
unsigned long nr_reclaimed = 0;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
struct blk_plug plug;
bool scan_adjusted;

- get_scan_count(lruvec, swappiness, sc, nr, lru_pages);
+ get_scan_count(lruvec, swappiness, lzfreeness, sc, nr, lru_pages);

/* Record the original scan target for proportional adjustments later */
memcpy(targets, nr, sizeof(nr));
- nr_to_scan_lzfree = get_lru_size(lruvec, LRU_LZFREE);

/*
* Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
@@ -2269,22 +2296,9 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,

init_tlb_ubc();

- while (nr_to_scan_lzfree) {
- nr_to_scan = min(nr_to_scan_lzfree, SWAP_CLUSTER_MAX);
- nr_to_scan_lzfree -= nr_to_scan;
-
- nr_reclaimed += shrink_inactive_list(nr_to_scan, lruvec,
- sc, LRU_LZFREE);
- }
-
- if (nr_reclaimed >= nr_to_reclaim) {
- sc->nr_reclaimed += nr_reclaimed;
- return;
- }
-
blk_start_plug(&plug);
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
- nr[LRU_INACTIVE_FILE]) {
+ nr[LRU_INACTIVE_FILE] || nr[LRU_LZFREE]) {
unsigned long nr_anon, nr_file, percentage;
unsigned long nr_scanned;

@@ -2466,7 +2480,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
unsigned long lru_pages;
unsigned long scanned;
struct lruvec *lruvec;
- int swappiness;
+ int swappiness, lzfreeness;

if (mem_cgroup_low(root, memcg)) {
if (!sc->may_thrash)
@@ -2476,9 +2490,11 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,

lruvec = mem_cgroup_zone_lruvec(zone, memcg);
swappiness = mem_cgroup_swappiness(memcg);
+ lzfreeness = mem_cgroup_lzfreeness(memcg);
scanned = sc->nr_scanned;

- shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
+ shrink_lruvec(lruvec, swappiness, lzfreeness,
+ sc, &lru_pages);
zone_lru_pages += lru_pages;

if (memcg && is_classzone)
@@ -2944,6 +2960,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
};
struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
int swappiness = mem_cgroup_swappiness(memcg);
+ int lzfreeness = mem_cgroup_lzfreeness(memcg);
unsigned long lru_pages;

sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -2960,7 +2977,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
* will pick up pages from other mem cgroup's as well. We hack
* the priority and make it zero.
*/
- shrink_lruvec(lruvec, swappiness, &sc, &lru_pages);
+ shrink_lruvec(lruvec, swappiness, lzfreeness, &sc, &lru_pages);

trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);

diff --git a/mm/vmstat.c b/mm/vmstat.c
index df95d9473bba..43effd0374d9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -703,8 +703,8 @@ const char * const vmstat_text[] = {
"nr_active_anon",
"nr_inactive_file",
"nr_active_file",
- "nr_unevictable",
"nr_lazyfree",
+ "nr_unevictable",
"nr_mlock",
"nr_anon_pages",
"nr_mapped",
--
1.9.1

2015-11-12 04:50:21

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

On Wed, Nov 11, 2015 at 8:32 PM, Minchan Kim <[email protected]> wrote:
>
> Linux doesn't have an ability to free pages lazy while other OS already
> have been supported that named by madvise(MADV_FREE).
>
> The gain is clear that kernel can discard freed pages rather than swapping
> out or OOM if memory pressure happens.


>
> When madvise syscall is called, VM clears dirty bit of ptes of the range.
> If memory pressure happens, VM checks dirty bit of page table and if it
> found still "clean", it means it's a "lazyfree pages" so VM could discard
> the page instead of swapping out. Once there was store operation for the
> page before VM peek a page to reclaim, dirty bit is set so VM can swap out
> the page instead of discarding.
>

I realize that this lends itself to an efficient implementation, but
it's certainly the case that the kernel *could* use the accessed bit
instead of the dirty bit to give more sensible user semantics, and the
semantics that rely on the dirty bit make me uncomfortable from an ABI
perspective.

I also think that the kernel should commit to either zeroing the page
or leaving it unchanged in response to MADV_FREE (even if the decision
of which to do is made later on). I think that your patch series does
this, but only after a few of the patches are applied (the swap entry
freeing), and I think that it should be a real guaranteed part of the
semantics and maybe have a test case.

--Andy

2015-11-12 05:21:38

by Daniel Micay

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

> I also think that the kernel should commit to either zeroing the page
> or leaving it unchanged in response to MADV_FREE (even if the decision
> of which to do is made later on). I think that your patch series does
> this, but only after a few of the patches are applied (the swap entry
> freeing), and I think that it should be a real guaranteed part of the
> semantics and maybe have a test case.

This would be a good thing to test because it would be required to add
MADV_FREE_UNDO down the road. It would mean the same semantics as the
MEM_RESET and MEM_RESET_UNDO features on Windows, and there's probably
value in that for the sake of migrating existing software too.

For one example, it could be dropped into Firefox:

https://dxr.mozilla.org/mozilla-central/source/memory/volatile/VolatileBufferWindows.cpp

And in Chromium:

https://code.google.com/p/chromium/codesearch#chromium/src/base/memory/discardable_shared_memory.cc

Worth noting that both also support the API for pinning/unpinning that's
used by Android's ashmem too. Linux really needs a feature like this for
caches. Firefox simply doesn't drop the memory at all on Linux right now:

https://dxr.mozilla.org/mozilla-central/source/memory/volatile/VolatileBufferFallback.cpp

(Lock == pin, Unlock == unpin)

For reference:

https://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2015-11-12 11:26:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

On Thu, Nov 12, 2015 at 01:32:57PM +0900, Minchan Kim wrote:
> @@ -256,6 +260,125 @@ static long madvise_willneed(struct vm_area_struct *vma,
> return 0;
> }
>
> +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> + unsigned long end, struct mm_walk *walk)
> +
> +{
> + struct mmu_gather *tlb = walk->private;
> + struct mm_struct *mm = tlb->mm;
> + struct vm_area_struct *vma = walk->vma;
> + spinlock_t *ptl;
> + pte_t *pte, ptent;
> + struct page *page;
> +
> + split_huge_page_pmd(vma, addr, pmd);
> + if (pmd_trans_unstable(pmd))
> + return 0;
> +
> + pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + arch_enter_lazy_mmu_mode();
> + for (; addr != end; pte++, addr += PAGE_SIZE) {
> + ptent = *pte;
> +
> + if (!pte_present(ptent))
> + continue;
> +
> + page = vm_normal_page(vma, addr, ptent);
> + if (!page)
> + continue;
> +
> + if (PageSwapCache(page)) {

Could you put VM_BUG_ON_PAGE(PageTransCompound(page), page) here?
Just in case.

> + if (!trylock_page(page))
> + continue;
> +
> + if (!try_to_free_swap(page)) {
> + unlock_page(page);
> + continue;
> + }
> +
> + ClearPageDirty(page);
> + unlock_page(page);

Hm. Do we handle pages shared over fork() here?
Souldn't we ignore pages with mapcount > 0?

> + }
> +
> + if (pte_young(ptent) || pte_dirty(ptent)) {
> + /*
> + * Some of architecture(ex, PPC) don't update TLB
> + * with set_pte_at and tlb_remove_tlb_entry so for
> + * the portability, remap the pte with old|clean
> + * after pte clearing.
> + */
> + ptent = ptep_get_and_clear_full(mm, addr, pte,
> + tlb->fullmm);
> +
> + ptent = pte_mkold(ptent);
> + ptent = pte_mkclean(ptent);
> + set_pte_at(mm, addr, pte, ptent);
> + tlb_remove_tlb_entry(tlb, pte, addr);
> + }
> + }
> +
> + arch_leave_lazy_mmu_mode();
> + pte_unmap_unlock(pte - 1, ptl);
> + cond_resched();
> + return 0;
> +}
>

--
Kirill A. Shutemov

2015-11-12 11:28:03

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v3 03/17] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures

On Thu, Nov 12, 2015 at 01:32:59PM +0900, Minchan Kim wrote:
> From: Chen Gang <[email protected]>
>
> For uapi, need try to let all macros have same value, and MADV_FREE is
> added into main branch recently, so need redefine MADV_FREE for it.
>
> At present, '8' can be shared with all architectures, so redefine it to
> '8'.

Why not fold the patch into thre previous one?
--
Kirill A. Shutemov

2015-11-12 19:44:58

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH v3 17/17] mm: add knob to tune lazyfreeing

On Thu, Nov 12, 2015 at 01:33:13PM +0900, Minchan Kim wrote:
> MADV_FREEed page's hotness is very arguble.
> Someone think it's hot while others are it's cold.
>
> Quote from Shaohua
> "
> My main concern is the policy how we should treat the FREE pages. Moving it to
> inactive lru is definitionly a good start, I'm wondering if it's enough. The
> MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> the lazy memory free. While MADV_FREE is intended to be a better replacement of
> MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> pressure too. I'm thinking of adding an extra lru list and wartermark for this
> to make sure FREE pages can be freed before system wide page reclaim. As you
> said, this is arguable, but I hope we can discuss about this issue more.
> "
>
> Quote from me
> "
> It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
> But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
> but it's not true because the page would be dirty state when VM want to reclaim.
>
> I'm also against with your's suggestion which let's discard FREEed page before
> system wide page reclaim because system would have lots of clean cold page
> caches or anonymous pages. In such case, reclaiming of them would be better.
> Yeb, it's really workload-dependent so we might need some heuristic which is
> normally what we want to avoid.
>
> Having said that, I agree with you we could do better than the deactivation
> and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
> "ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
> fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
> When the MADV_FREE is called, we could move hinted pages from anon-LRU to
> ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
> it could promote it to acive-anon-LRU which would be very natural aging
> concept because it mean someone touches the page recenlty.
> With that, I don't want to bias one side and don't want to add some knob for
> tuning the heuristic but let's rely on common fair aging scheme of VM.
> "
>
> Quote from Johannes
> "
> thread 1:
> Even if we're wrong about the aging of those MADV_FREE pages, their
> contents are invalidated; they can be discarded freely, and restoring
> them is a mere GFP_ZERO allocation. All other anonymous pages have to
> be written to disk, and potentially be read back.
>
> [ Arguably, MADV_FREE pages should even be reclaimed before inactive
> page cache. It's the same cost to discard both types of pages, but
> restoring page cache involves IO. ]
>
> It probably makes sense to stop thinking about them as anonymous pages
> entirely at this point when it comes to aging. They're really not. The
> LRU lists are split to differentiate access patterns and cost of page
> stealing (and restoring). From that angle, MADV_FREE pages really have
> nothing in common with in-use anonymous pages, and so they shouldn't
> be on the same LRU list.
>
> thread:2
> What about them is hot? They contain garbage, you have to write to
> them before you can use them. Granted, you might have to refetch
> cachelines if you don't do cacheline-aligned populating writes, but
> you can do a lot of them before it's more expensive than doing IO.
>
> "
>
> Quote from Daniel
> "
> thread:1
> Keep in mind that this is memory the kernel wouldn't be getting back at
> all if the allocator wasn't going out of the way to purge it, and they
> aren't going to go out of their way to purge it if it means the kernel
> is going to steal the pages when there isn't actually memory pressure.
>
> An allocator would be using MADV_DONTNEED if it didn't expect that the
> pages were going to be used against shortly. MADV_FREE indicates that it
> has time to inform the kernel that they're unused but they could still
> be very hot.
>
> thread:2
> It's hot because applications churn through memory via the allocator.
>
> Drop the pages and the application is now churning through page faults
> and zeroing rather than simply reusing memory. It's not something that
> may happen, it *will* happen. A page in the page cache *may* be reused,
> but often won't be, especially when the I/O patterns don't line up well
> with the way it works.
>
> The whole point of the feature is not requiring the allocator to have
> elaborate mechanisms for aging pages and throttling purging. That ends
> up resulting in lots of memory held by userspace where the kernel can't
> reclaim it under memory pressure. If it's dropped before page cache, it
> isn't going to be able to replace any of that logic in allocators.
>
> The page cache is speculative. Page caching by allocators is not really
> speculative. Using MADV_FREE on the pages at all is speculative. The
> memory is probably going to be reused fairly soon (unless the process
> exits, and then it doesn't matter), but purging will end up reducing
> memory usage for the portions that aren't.
>
> It would be a different story for a full unpinning/pinning feature since
> that would have other use cases (speculative caches), but this is really
> only useful in allocators.
> "
> You could read all thread from https://lkml.org/lkml/2015/11/4/51
>
> Yeah, with arguble issue and there is no one decision, I think it
> means we should provide the knob "lazyfreeness"(I hope someone
> give better naming).
>
> It's similar to swapppiness so higher values will discard MADV_FREE
> pages agreessively. If memory pressure happens and system works with
> DEF_PRIOIRTY(ex, clean cold caches), VM doesn't discard any hinted
> pages until the scanning priority is increased.
>
> If memory pressure is higher(ie, the priority is not DEF_PRIORITY),
> it scans
>
> nr_to_reclaim * priority * lazyfreensss(def: 20) / 50
>
> If system has low free memory and file cache, it start to discard
> MADV_FREEed pages unconditionally even though user set lazyfreeness to 0.
>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> Documentation/sysctl/vm.txt | 13 +++++++++
> drivers/base/node.c | 4 +--
> fs/proc/meminfo.c | 4 +--
> include/linux/memcontrol.h | 1 +
> include/linux/mmzone.h | 9 +++---
> include/linux/swap.h | 15 ++++++++++
> kernel/sysctl.c | 9 ++++++
> mm/memcontrol.c | 32 +++++++++++++++++++++-
> mm/vmscan.c | 67 ++++++++++++++++++++++++++++-----------------
> mm/vmstat.c | 2 +-
> 10 files changed, 121 insertions(+), 35 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index a4482fceacec..c1dc63381f2c 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -56,6 +56,7 @@ files can be found in mm/swap.c.
> - percpu_pagelist_fraction
> - stat_interval
> - swappiness
> +- lazyfreeness
> - user_reserve_kbytes
> - vfs_cache_pressure
> - zone_reclaim_mode
> @@ -737,6 +738,18 @@ The default value is 60.
>
> ==============================================================
>
> +lazyfreeness
> +
> +This control is used to define how aggressive the kernel will discard
> +MADV_FREE hinted pages. Higher values will increase agressiveness,
> +lower values decrease the amount of discarding. A value of 0 instructs
> +the kernel not to initiate discarding until the amount of free and
> +file-backed pages is less than the high water mark in a zone.
> +
> +The default value is 20.
> +
> +==============================================================
> +
> - user_reserve_kbytes
>
> When overcommit_memory is set to 2, "never overcommit" mode, reserve
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index f7a1f2107b43..3b0bf1b78b2e 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -69,8 +69,8 @@ static ssize_t node_read_meminfo(struct device *dev,
> "Node %d Inactive(anon): %8lu kB\n"
> "Node %d Active(file): %8lu kB\n"
> "Node %d Inactive(file): %8lu kB\n"
> - "Node %d Unevictable: %8lu kB\n"
> "Node %d LazyFree: %8lu kB\n"
> + "Node %d Unevictable: %8lu kB\n"
> "Node %d Mlocked: %8lu kB\n",
> nid, K(i.totalram),
> nid, K(i.freeram),
> @@ -83,8 +83,8 @@ static ssize_t node_read_meminfo(struct device *dev,
> nid, K(node_page_state(nid, NR_INACTIVE_ANON)),
> nid, K(node_page_state(nid, NR_ACTIVE_FILE)),
> nid, K(node_page_state(nid, NR_INACTIVE_FILE)),
> - nid, K(node_page_state(nid, NR_UNEVICTABLE)),
> nid, K(node_page_state(nid, NR_LZFREE)),
> + nid, K(node_page_state(nid, NR_UNEVICTABLE)),
> nid, K(node_page_state(nid, NR_MLOCK)));
>
> #ifdef CONFIG_HIGHMEM
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 3444f7c4e0b6..f47e6a5aa2e5 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -101,8 +101,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> "Inactive(anon): %8lu kB\n"
> "Active(file): %8lu kB\n"
> "Inactive(file): %8lu kB\n"
> - "Unevictable: %8lu kB\n"
> "LazyFree: %8lu kB\n"
> + "Unevictable: %8lu kB\n"
> "Mlocked: %8lu kB\n"
> #ifdef CONFIG_HIGHMEM
> "HighTotal: %8lu kB\n"
> @@ -159,8 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> K(pages[LRU_INACTIVE_ANON]),
> K(pages[LRU_ACTIVE_FILE]),
> K(pages[LRU_INACTIVE_FILE]),
> - K(pages[LRU_UNEVICTABLE]),
> K(pages[LRU_LZFREE]),
> + K(pages[LRU_UNEVICTABLE]),
> K(global_page_state(NR_MLOCK)),
> #ifdef CONFIG_HIGHMEM
> K(i.totalhigh),
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 3e3318ddfc0e..5522ff733506 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -210,6 +210,7 @@ struct mem_cgroup {
> int under_oom;
>
> int swappiness;
> + int lzfreeness;
> /* OOM-Killer disable */
> int oom_kill_disable;
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 1aaa436da0d5..cca514a9701d 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -120,8 +120,8 @@ enum zone_stat_item {
> NR_ACTIVE_ANON, /* " " " " " */
> NR_INACTIVE_FILE, /* " " " " " */
> NR_ACTIVE_FILE, /* " " " " " */
> - NR_UNEVICTABLE, /* " " " " " */
> NR_LZFREE, /* " " " " " */
> + NR_UNEVICTABLE, /* " " " " " */
> NR_MLOCK, /* mlock()ed pages found and moved off LRU */
> NR_ANON_PAGES, /* Mapped anonymous pages */
> NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
> @@ -179,14 +179,15 @@ enum lru_list {
> LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
> LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
> LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
> - LRU_UNEVICTABLE,
> LRU_LZFREE,
> + LRU_UNEVICTABLE,
> NR_LRU_LISTS
> };
>
> #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
> -
> -#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
> +#define for_each_anon_file_lru(lru) \
> + for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
> +#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_LZFREE; lru++)
>
> static inline int is_file_lru(enum lru_list lru)
> {
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index f0310eeab3ec..73bcdc9d0e88 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -330,6 +330,7 @@ extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> unsigned long *nr_scanned);
> extern unsigned long shrink_all_memory(unsigned long nr_pages);
> extern int vm_swappiness;
> +extern int vm_lazyfreeness;
> extern int remove_mapping(struct address_space *mapping, struct page *page);
> extern unsigned long vm_total_pages;
>
> @@ -361,11 +362,25 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
> return memcg->swappiness;
> }
>
> +static inline int mem_cgroup_lzfreeness(struct mem_cgroup *memcg)
> +{
> + /* root ? */
> + if (mem_cgroup_disabled() || !memcg->css.parent)
> + return vm_lazyfreeness;
> +
> + return memcg->lzfreeness;
> +}
> +
> #else
> static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
> {
> return vm_swappiness;
> }
> +
> +static inline int mem_cgroup_lzfreeness(struct mem_cgroup *mem)
> +{
> + return vm_lazyfreeness;
> +}
> #endif
> #ifdef CONFIG_MEMCG_SWAP
> extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index e69201d8094e..2496b10c08e9 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1268,6 +1268,15 @@ static struct ctl_table vm_table[] = {
> .extra1 = &zero,
> .extra2 = &one_hundred,
> },
> + {
> + .procname = "lazyfreeness",
> + .data = &vm_lazyfreeness,
> + .maxlen = sizeof(vm_lazyfreeness),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &zero,
> + .extra2 = &one_hundred,
> + },
> #ifdef CONFIG_HUGETLB_PAGE
> {
> .procname = "nr_hugepages",
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1dc599ce1bcb..5bdbe2a20dc0 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -108,8 +108,8 @@ static const char * const mem_cgroup_lru_names[] = {
> "active_anon",
> "inactive_file",
> "active_file",
> - "unevictable",
> "lazyfree",
> + "unevictable",
> };
>
> #define THRESHOLDS_EVENTS_TARGET 128
> @@ -3288,6 +3288,30 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
> return 0;
> }
>
> +static u64 mem_cgroup_lzfreeness_read(struct cgroup_subsys_state *css,
> + struct cftype *cft)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> + return mem_cgroup_lzfreeness(memcg);
> +}
> +
> +static int mem_cgroup_lzfreeness_write(struct cgroup_subsys_state *css,
> + struct cftype *cft, u64 val)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> + if (val > 100)
> + return -EINVAL;
> +
> + if (css->parent)
> + memcg->lzfreeness = val;
> + else
> + vm_lazyfreeness = val;
> +
> + return 0;
> +}
> +
> static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
> {
> struct mem_cgroup_threshold_ary *t;
> @@ -4085,6 +4109,11 @@ static struct cftype mem_cgroup_legacy_files[] = {
> .write_u64 = mem_cgroup_swappiness_write,
> },
> {
> + .name = "lazyfreeness",
> + .read_u64 = mem_cgroup_lzfreeness_read,
> + .write_u64 = mem_cgroup_lzfreeness_write,
> + },
> + {
> .name = "move_charge_at_immigrate",
> .read_u64 = mem_cgroup_move_charge_read,
> .write_u64 = mem_cgroup_move_charge_write,
> @@ -4305,6 +4334,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
> memcg->use_hierarchy = parent->use_hierarchy;
> memcg->oom_kill_disable = parent->oom_kill_disable;
> memcg->swappiness = mem_cgroup_swappiness(parent);
> + memcg->lzfreeness = mem_cgroup_lzfreeness(parent);
>
> if (parent->use_hierarchy) {
> page_counter_init(&memcg->memory, &parent->memory);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index cd65db9d3004..f1abc8a6ca31 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -141,6 +141,10 @@ struct scan_control {
> */
> int vm_swappiness = 60;
> /*
> + * From 0 .. 100. Higher means more lazy freeing.
> + */
> +int vm_lazyfreeness = 20;
> +/*
> * The total number of pages which are beyond the high watermark within all
> * zones.
> */
> @@ -2012,10 +2016,11 @@ enum scan_balance {
> *
> * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
> * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
> + * nr[4] = lazy free pages to scan;
> */
> static void get_scan_count(struct lruvec *lruvec, int swappiness,
> - struct scan_control *sc, unsigned long *nr,
> - unsigned long *lru_pages)
> + int lzfreeness, struct scan_control *sc,
> + unsigned long *nr, unsigned long *lru_pages)
> {
> struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> u64 fraction[2];
> @@ -2023,12 +2028,13 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
> struct zone *zone = lruvec_zone(lruvec);
> unsigned long anon_prio, file_prio;
> enum scan_balance scan_balance;
> - unsigned long anon, file;
> + unsigned long anon, file, lzfree;
> bool force_scan = false;
> unsigned long ap, fp;
> enum lru_list lru;
> bool some_scanned;
> int pass;
> + unsigned long scan_lzfree = 0;
>
> /*
> * If the zone or memcg is small, nr[l] can be 0. This
> @@ -2166,7 +2172,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
> /* Only use force_scan on second pass. */
> for (pass = 0; !some_scanned && pass < 2; pass++) {
> *lru_pages = 0;
> - for_each_evictable_lru(lru) {
> + for_each_anon_file_lru(lru) {
> int file = is_file_lru(lru);
> unsigned long size;
> unsigned long scan;
> @@ -2212,6 +2218,28 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
> some_scanned |= !!scan;
> }
> }
> +
> + lzfree = get_lru_size(lruvec, LRU_LZFREE);
> + if (lzfree) {
> + scan_lzfree = sc->nr_to_reclaim *
> + (DEF_PRIORITY - sc->priority);

scan_lzfree == 0 if sc->priority == DEF_PRIORITY, is this intended?
> + scan_lzfree = div64_u64(scan_lzfree *
> + lzfreeness, 50);
> + if (!scan_lzfree) {
> + unsigned long zonefile, zonefree;
> +
> + zonefree = zone_page_state(zone, NR_FREE_PAGES);
> + zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
> + zone_page_state(zone, NR_INACTIVE_FILE);
> + if (unlikely(zonefile + zonefree <=
> + high_wmark_pages(zone))) {
> + scan_lzfree = get_lru_size(lruvec,
> + LRU_LZFREE) >> sc->priority;
> + }
> + }
> + }
> +
> + nr[LRU_LZFREE] = min(scan_lzfree, lzfree);
> }

Looks there is no setting to only reclaim lazyfree pages. Could we have an
option for this? It's legit we don't want to trash page cache because of
lazyfree memory.

Thanks,
Shaohua

2015-11-13 06:14:58

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

On Thu, Nov 12, 2015 at 12:21:30AM -0500, Daniel Micay wrote:
> > I also think that the kernel should commit to either zeroing the page
> > or leaving it unchanged in response to MADV_FREE (even if the decision
> > of which to do is made later on). I think that your patch series does
> > this, but only after a few of the patches are applied (the swap entry
> > freeing), and I think that it should be a real guaranteed part of the
> > semantics and maybe have a test case.
>
> This would be a good thing to test because it would be required to add
> MADV_FREE_UNDO down the road. It would mean the same semantics as the
> MEM_RESET and MEM_RESET_UNDO features on Windows, and there's probably
> value in that for the sake of migrating existing software too.

So, do you mean that we could implement MADV_FREE_UNDO with "read"
opearation("just access bit marking) easily in future?

If so, it would be good reason to change MADV_FREE from dirty bit to
access bit. Okay, I will look at that.

>
> For one example, it could be dropped into Firefox:
>
> https://dxr.mozilla.org/mozilla-central/source/memory/volatile/VolatileBufferWindows.cpp
>
> And in Chromium:
>
> https://code.google.com/p/chromium/codesearch#chromium/src/base/memory/discardable_shared_memory.cc
>
> Worth noting that both also support the API for pinning/unpinning that's
> used by Android's ashmem too. Linux really needs a feature like this for
> caches. Firefox simply doesn't drop the memory at all on Linux right now:
>
> https://dxr.mozilla.org/mozilla-central/source/memory/volatile/VolatileBufferFallback.cpp
>
> (Lock == pin, Unlock == unpin)
>
> For reference:
>
> https://msdn.microsoft.com/en-us/library/windows/desktop/aa366887(v=vs.85).aspx
>

2015-11-13 06:17:03

by Daniel Micay

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

On 13/11/15 01:15 AM, Minchan Kim wrote:
> On Thu, Nov 12, 2015 at 12:21:30AM -0500, Daniel Micay wrote:
>>> I also think that the kernel should commit to either zeroing the page
>>> or leaving it unchanged in response to MADV_FREE (even if the decision
>>> of which to do is made later on). I think that your patch series does
>>> this, but only after a few of the patches are applied (the swap entry
>>> freeing), and I think that it should be a real guaranteed part of the
>>> semantics and maybe have a test case.
>>
>> This would be a good thing to test because it would be required to add
>> MADV_FREE_UNDO down the road. It would mean the same semantics as the
>> MEM_RESET and MEM_RESET_UNDO features on Windows, and there's probably
>> value in that for the sake of migrating existing software too.
>
> So, do you mean that we could implement MADV_FREE_UNDO with "read"
> opearation("just access bit marking) easily in future?
>
> If so, it would be good reason to change MADV_FREE from dirty bit to
> access bit. Okay, I will look at that.

I just meant testing that the data is either zero or the old data if
it's read before it's written to. Not having it stay around once there
is a read. Not sure if that's what Andy meant.


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2015-11-13 06:16:51

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

On Thu, Nov 12, 2015 at 01:26:20PM +0200, Kirill A. Shutemov wrote:
> On Thu, Nov 12, 2015 at 01:32:57PM +0900, Minchan Kim wrote:
> > @@ -256,6 +260,125 @@ static long madvise_willneed(struct vm_area_struct *vma,
> > return 0;
> > }
> >
> > +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > + unsigned long end, struct mm_walk *walk)
> > +
> > +{
> > + struct mmu_gather *tlb = walk->private;
> > + struct mm_struct *mm = tlb->mm;
> > + struct vm_area_struct *vma = walk->vma;
> > + spinlock_t *ptl;
> > + pte_t *pte, ptent;
> > + struct page *page;
> > +
> > + split_huge_page_pmd(vma, addr, pmd);
> > + if (pmd_trans_unstable(pmd))
> > + return 0;
> > +
> > + pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > + arch_enter_lazy_mmu_mode();
> > + for (; addr != end; pte++, addr += PAGE_SIZE) {
> > + ptent = *pte;
> > +
> > + if (!pte_present(ptent))
> > + continue;
> > +
> > + page = vm_normal_page(vma, addr, ptent);
> > + if (!page)
> > + continue;
> > +
> > + if (PageSwapCache(page)) {
>
> Could you put VM_BUG_ON_PAGE(PageTransCompound(page), page) here?
> Just in case.

No problem.

>
> > + if (!trylock_page(page))
> > + continue;
> > +
> > + if (!try_to_free_swap(page)) {
> > + unlock_page(page);
> > + continue;
> > + }
> > +
> > + ClearPageDirty(page);
> > + unlock_page(page);
>
> Hm. Do we handle pages shared over fork() here?
> Souldn't we ignore pages with mapcount > 0?

It was handled later patch by historical reason but it's better
to fold the patch to this.

Thanks for review!

2015-11-13 06:18:23

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 03/17] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures

On Thu, Nov 12, 2015 at 01:27:53PM +0200, Kirill A. Shutemov wrote:
> On Thu, Nov 12, 2015 at 01:32:59PM +0900, Minchan Kim wrote:
> > From: Chen Gang <[email protected]>
> >
> > For uapi, need try to let all macros have same value, and MADV_FREE is
> > added into main branch recently, so need redefine MADV_FREE for it.
> >
> > At present, '8' can be shared with all architectures, so redefine it to
> > '8'.
>
> Why not fold the patch into thre previous one?

Because it was a little bit arguable at that time whether we could use
number 8 for all of arches. If so, simply I can drop this patch only.

2015-11-13 06:19:41

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 17/17] mm: add knob to tune lazyfreeing

On Thu, Nov 12, 2015 at 11:44:53AM -0800, Shaohua Li wrote:
> On Thu, Nov 12, 2015 at 01:33:13PM +0900, Minchan Kim wrote:
> > MADV_FREEed page's hotness is very arguble.
> > Someone think it's hot while others are it's cold.
> >
> > Quote from Shaohua
> > "
> > My main concern is the policy how we should treat the FREE pages. Moving it to
> > inactive lru is definitionly a good start, I'm wondering if it's enough. The
> > MADV_FREE increases memory pressure and cause unnecessary reclaim because of
> > the lazy memory free. While MADV_FREE is intended to be a better replacement of
> > MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
> > memory immediately. So I hope the MADV_FREE doesn't have impact on memory
> > pressure too. I'm thinking of adding an extra lru list and wartermark for this
> > to make sure FREE pages can be freed before system wide page reclaim. As you
> > said, this is arguable, but I hope we can discuss about this issue more.
> > "
> >
> > Quote from me
> > "
> > It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
> > But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
> > but it's not true because the page would be dirty state when VM want to reclaim.
> >
> > I'm also against with your's suggestion which let's discard FREEed page before
> > system wide page reclaim because system would have lots of clean cold page
> > caches or anonymous pages. In such case, reclaiming of them would be better.
> > Yeb, it's really workload-dependent so we might need some heuristic which is
> > normally what we want to avoid.
> >
> > Having said that, I agree with you we could do better than the deactivation
> > and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
> > "ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
> > fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
> > When the MADV_FREE is called, we could move hinted pages from anon-LRU to
> > ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
> > it could promote it to acive-anon-LRU which would be very natural aging
> > concept because it mean someone touches the page recenlty.
> > With that, I don't want to bias one side and don't want to add some knob for
> > tuning the heuristic but let's rely on common fair aging scheme of VM.
> > "
> >
> > Quote from Johannes
> > "
> > thread 1:
> > Even if we're wrong about the aging of those MADV_FREE pages, their
> > contents are invalidated; they can be discarded freely, and restoring
> > them is a mere GFP_ZERO allocation. All other anonymous pages have to
> > be written to disk, and potentially be read back.
> >
> > [ Arguably, MADV_FREE pages should even be reclaimed before inactive
> > page cache. It's the same cost to discard both types of pages, but
> > restoring page cache involves IO. ]
> >
> > It probably makes sense to stop thinking about them as anonymous pages
> > entirely at this point when it comes to aging. They're really not. The
> > LRU lists are split to differentiate access patterns and cost of page
> > stealing (and restoring). From that angle, MADV_FREE pages really have
> > nothing in common with in-use anonymous pages, and so they shouldn't
> > be on the same LRU list.
> >
> > thread:2
> > What about them is hot? They contain garbage, you have to write to
> > them before you can use them. Granted, you might have to refetch
> > cachelines if you don't do cacheline-aligned populating writes, but
> > you can do a lot of them before it's more expensive than doing IO.
> >
> > "
> >
> > Quote from Daniel
> > "
> > thread:1
> > Keep in mind that this is memory the kernel wouldn't be getting back at
> > all if the allocator wasn't going out of the way to purge it, and they
> > aren't going to go out of their way to purge it if it means the kernel
> > is going to steal the pages when there isn't actually memory pressure.
> >
> > An allocator would be using MADV_DONTNEED if it didn't expect that the
> > pages were going to be used against shortly. MADV_FREE indicates that it
> > has time to inform the kernel that they're unused but they could still
> > be very hot.
> >
> > thread:2
> > It's hot because applications churn through memory via the allocator.
> >
> > Drop the pages and the application is now churning through page faults
> > and zeroing rather than simply reusing memory. It's not something that
> > may happen, it *will* happen. A page in the page cache *may* be reused,
> > but often won't be, especially when the I/O patterns don't line up well
> > with the way it works.
> >
> > The whole point of the feature is not requiring the allocator to have
> > elaborate mechanisms for aging pages and throttling purging. That ends
> > up resulting in lots of memory held by userspace where the kernel can't
> > reclaim it under memory pressure. If it's dropped before page cache, it
> > isn't going to be able to replace any of that logic in allocators.
> >
> > The page cache is speculative. Page caching by allocators is not really
> > speculative. Using MADV_FREE on the pages at all is speculative. The
> > memory is probably going to be reused fairly soon (unless the process
> > exits, and then it doesn't matter), but purging will end up reducing
> > memory usage for the portions that aren't.
> >
> > It would be a different story for a full unpinning/pinning feature since
> > that would have other use cases (speculative caches), but this is really
> > only useful in allocators.
> > "
> > You could read all thread from https://lkml.org/lkml/2015/11/4/51
> >
> > Yeah, with arguble issue and there is no one decision, I think it
> > means we should provide the knob "lazyfreeness"(I hope someone
> > give better naming).
> >
> > It's similar to swapppiness so higher values will discard MADV_FREE
> > pages agreessively. If memory pressure happens and system works with
> > DEF_PRIOIRTY(ex, clean cold caches), VM doesn't discard any hinted
> > pages until the scanning priority is increased.
> >
> > If memory pressure is higher(ie, the priority is not DEF_PRIORITY),
> > it scans
> >
> > nr_to_reclaim * priority * lazyfreensss(def: 20) / 50
> >
> > If system has low free memory and file cache, it start to discard
> > MADV_FREEed pages unconditionally even though user set lazyfreeness to 0.
> >
> > Signed-off-by: Minchan Kim <[email protected]>
> > ---
> > Documentation/sysctl/vm.txt | 13 +++++++++
> > drivers/base/node.c | 4 +--
> > fs/proc/meminfo.c | 4 +--
> > include/linux/memcontrol.h | 1 +
> > include/linux/mmzone.h | 9 +++---
> > include/linux/swap.h | 15 ++++++++++
> > kernel/sysctl.c | 9 ++++++
> > mm/memcontrol.c | 32 +++++++++++++++++++++-
> > mm/vmscan.c | 67 ++++++++++++++++++++++++++++-----------------
> > mm/vmstat.c | 2 +-
> > 10 files changed, 121 insertions(+), 35 deletions(-)
> >
> > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> > index a4482fceacec..c1dc63381f2c 100644
> > --- a/Documentation/sysctl/vm.txt
> > +++ b/Documentation/sysctl/vm.txt
> > @@ -56,6 +56,7 @@ files can be found in mm/swap.c.
> > - percpu_pagelist_fraction
> > - stat_interval
> > - swappiness
> > +- lazyfreeness
> > - user_reserve_kbytes
> > - vfs_cache_pressure
> > - zone_reclaim_mode
> > @@ -737,6 +738,18 @@ The default value is 60.
> >
> > ==============================================================
> >
> > +lazyfreeness
> > +
> > +This control is used to define how aggressive the kernel will discard
> > +MADV_FREE hinted pages. Higher values will increase agressiveness,
> > +lower values decrease the amount of discarding. A value of 0 instructs
> > +the kernel not to initiate discarding until the amount of free and
> > +file-backed pages is less than the high water mark in a zone.
> > +
> > +The default value is 20.
> > +
> > +==============================================================
> > +
> > - user_reserve_kbytes
> >
> > When overcommit_memory is set to 2, "never overcommit" mode, reserve
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index f7a1f2107b43..3b0bf1b78b2e 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -69,8 +69,8 @@ static ssize_t node_read_meminfo(struct device *dev,
> > "Node %d Inactive(anon): %8lu kB\n"
> > "Node %d Active(file): %8lu kB\n"
> > "Node %d Inactive(file): %8lu kB\n"
> > - "Node %d Unevictable: %8lu kB\n"
> > "Node %d LazyFree: %8lu kB\n"
> > + "Node %d Unevictable: %8lu kB\n"
> > "Node %d Mlocked: %8lu kB\n",
> > nid, K(i.totalram),
> > nid, K(i.freeram),
> > @@ -83,8 +83,8 @@ static ssize_t node_read_meminfo(struct device *dev,
> > nid, K(node_page_state(nid, NR_INACTIVE_ANON)),
> > nid, K(node_page_state(nid, NR_ACTIVE_FILE)),
> > nid, K(node_page_state(nid, NR_INACTIVE_FILE)),
> > - nid, K(node_page_state(nid, NR_UNEVICTABLE)),
> > nid, K(node_page_state(nid, NR_LZFREE)),
> > + nid, K(node_page_state(nid, NR_UNEVICTABLE)),
> > nid, K(node_page_state(nid, NR_MLOCK)));
> >
> > #ifdef CONFIG_HIGHMEM
> > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > index 3444f7c4e0b6..f47e6a5aa2e5 100644
> > --- a/fs/proc/meminfo.c
> > +++ b/fs/proc/meminfo.c
> > @@ -101,8 +101,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > "Inactive(anon): %8lu kB\n"
> > "Active(file): %8lu kB\n"
> > "Inactive(file): %8lu kB\n"
> > - "Unevictable: %8lu kB\n"
> > "LazyFree: %8lu kB\n"
> > + "Unevictable: %8lu kB\n"
> > "Mlocked: %8lu kB\n"
> > #ifdef CONFIG_HIGHMEM
> > "HighTotal: %8lu kB\n"
> > @@ -159,8 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> > K(pages[LRU_INACTIVE_ANON]),
> > K(pages[LRU_ACTIVE_FILE]),
> > K(pages[LRU_INACTIVE_FILE]),
> > - K(pages[LRU_UNEVICTABLE]),
> > K(pages[LRU_LZFREE]),
> > + K(pages[LRU_UNEVICTABLE]),
> > K(global_page_state(NR_MLOCK)),
> > #ifdef CONFIG_HIGHMEM
> > K(i.totalhigh),
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 3e3318ddfc0e..5522ff733506 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -210,6 +210,7 @@ struct mem_cgroup {
> > int under_oom;
> >
> > int swappiness;
> > + int lzfreeness;
> > /* OOM-Killer disable */
> > int oom_kill_disable;
> >
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 1aaa436da0d5..cca514a9701d 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -120,8 +120,8 @@ enum zone_stat_item {
> > NR_ACTIVE_ANON, /* " " " " " */
> > NR_INACTIVE_FILE, /* " " " " " */
> > NR_ACTIVE_FILE, /* " " " " " */
> > - NR_UNEVICTABLE, /* " " " " " */
> > NR_LZFREE, /* " " " " " */
> > + NR_UNEVICTABLE, /* " " " " " */
> > NR_MLOCK, /* mlock()ed pages found and moved off LRU */
> > NR_ANON_PAGES, /* Mapped anonymous pages */
> > NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
> > @@ -179,14 +179,15 @@ enum lru_list {
> > LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
> > LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
> > LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
> > - LRU_UNEVICTABLE,
> > LRU_LZFREE,
> > + LRU_UNEVICTABLE,
> > NR_LRU_LISTS
> > };
> >
> > #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
> > -
> > -#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
> > +#define for_each_anon_file_lru(lru) \
> > + for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
> > +#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_LZFREE; lru++)
> >
> > static inline int is_file_lru(enum lru_list lru)
> > {
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index f0310eeab3ec..73bcdc9d0e88 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -330,6 +330,7 @@ extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
> > unsigned long *nr_scanned);
> > extern unsigned long shrink_all_memory(unsigned long nr_pages);
> > extern int vm_swappiness;
> > +extern int vm_lazyfreeness;
> > extern int remove_mapping(struct address_space *mapping, struct page *page);
> > extern unsigned long vm_total_pages;
> >
> > @@ -361,11 +362,25 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
> > return memcg->swappiness;
> > }
> >
> > +static inline int mem_cgroup_lzfreeness(struct mem_cgroup *memcg)
> > +{
> > + /* root ? */
> > + if (mem_cgroup_disabled() || !memcg->css.parent)
> > + return vm_lazyfreeness;
> > +
> > + return memcg->lzfreeness;
> > +}
> > +
> > #else
> > static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
> > {
> > return vm_swappiness;
> > }
> > +
> > +static inline int mem_cgroup_lzfreeness(struct mem_cgroup *mem)
> > +{
> > + return vm_lazyfreeness;
> > +}
> > #endif
> > #ifdef CONFIG_MEMCG_SWAP
> > extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
> > diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> > index e69201d8094e..2496b10c08e9 100644
> > --- a/kernel/sysctl.c
> > +++ b/kernel/sysctl.c
> > @@ -1268,6 +1268,15 @@ static struct ctl_table vm_table[] = {
> > .extra1 = &zero,
> > .extra2 = &one_hundred,
> > },
> > + {
> > + .procname = "lazyfreeness",
> > + .data = &vm_lazyfreeness,
> > + .maxlen = sizeof(vm_lazyfreeness),
> > + .mode = 0644,
> > + .proc_handler = proc_dointvec_minmax,
> > + .extra1 = &zero,
> > + .extra2 = &one_hundred,
> > + },
> > #ifdef CONFIG_HUGETLB_PAGE
> > {
> > .procname = "nr_hugepages",
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 1dc599ce1bcb..5bdbe2a20dc0 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -108,8 +108,8 @@ static const char * const mem_cgroup_lru_names[] = {
> > "active_anon",
> > "inactive_file",
> > "active_file",
> > - "unevictable",
> > "lazyfree",
> > + "unevictable",
> > };
> >
> > #define THRESHOLDS_EVENTS_TARGET 128
> > @@ -3288,6 +3288,30 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
> > return 0;
> > }
> >
> > +static u64 mem_cgroup_lzfreeness_read(struct cgroup_subsys_state *css,
> > + struct cftype *cft)
> > +{
> > + struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> > +
> > + return mem_cgroup_lzfreeness(memcg);
> > +}
> > +
> > +static int mem_cgroup_lzfreeness_write(struct cgroup_subsys_state *css,
> > + struct cftype *cft, u64 val)
> > +{
> > + struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> > +
> > + if (val > 100)
> > + return -EINVAL;
> > +
> > + if (css->parent)
> > + memcg->lzfreeness = val;
> > + else
> > + vm_lazyfreeness = val;
> > +
> > + return 0;
> > +}
> > +
> > static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
> > {
> > struct mem_cgroup_threshold_ary *t;
> > @@ -4085,6 +4109,11 @@ static struct cftype mem_cgroup_legacy_files[] = {
> > .write_u64 = mem_cgroup_swappiness_write,
> > },
> > {
> > + .name = "lazyfreeness",
> > + .read_u64 = mem_cgroup_lzfreeness_read,
> > + .write_u64 = mem_cgroup_lzfreeness_write,
> > + },
> > + {
> > .name = "move_charge_at_immigrate",
> > .read_u64 = mem_cgroup_move_charge_read,
> > .write_u64 = mem_cgroup_move_charge_write,
> > @@ -4305,6 +4334,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
> > memcg->use_hierarchy = parent->use_hierarchy;
> > memcg->oom_kill_disable = parent->oom_kill_disable;
> > memcg->swappiness = mem_cgroup_swappiness(parent);
> > + memcg->lzfreeness = mem_cgroup_lzfreeness(parent);
> >
> > if (parent->use_hierarchy) {
> > page_counter_init(&memcg->memory, &parent->memory);
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index cd65db9d3004..f1abc8a6ca31 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -141,6 +141,10 @@ struct scan_control {
> > */
> > int vm_swappiness = 60;
> > /*
> > + * From 0 .. 100. Higher means more lazy freeing.
> > + */
> > +int vm_lazyfreeness = 20;
> > +/*
> > * The total number of pages which are beyond the high watermark within all
> > * zones.
> > */
> > @@ -2012,10 +2016,11 @@ enum scan_balance {
> > *
> > * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
> > * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
> > + * nr[4] = lazy free pages to scan;
> > */
> > static void get_scan_count(struct lruvec *lruvec, int swappiness,
> > - struct scan_control *sc, unsigned long *nr,
> > - unsigned long *lru_pages)
> > + int lzfreeness, struct scan_control *sc,
> > + unsigned long *nr, unsigned long *lru_pages)
> > {
> > struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
> > u64 fraction[2];
> > @@ -2023,12 +2028,13 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
> > struct zone *zone = lruvec_zone(lruvec);
> > unsigned long anon_prio, file_prio;
> > enum scan_balance scan_balance;
> > - unsigned long anon, file;
> > + unsigned long anon, file, lzfree;
> > bool force_scan = false;
> > unsigned long ap, fp;
> > enum lru_list lru;
> > bool some_scanned;
> > int pass;
> > + unsigned long scan_lzfree = 0;
> >
> > /*
> > * If the zone or memcg is small, nr[l] can be 0. This
> > @@ -2166,7 +2172,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
> > /* Only use force_scan on second pass. */
> > for (pass = 0; !some_scanned && pass < 2; pass++) {
> > *lru_pages = 0;
> > - for_each_evictable_lru(lru) {
> > + for_each_anon_file_lru(lru) {
> > int file = is_file_lru(lru);
> > unsigned long size;
> > unsigned long scan;
> > @@ -2212,6 +2218,28 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
> > some_scanned |= !!scan;
> > }
> > }
> > +
> > + lzfree = get_lru_size(lruvec, LRU_LZFREE);
> > + if (lzfree) {
> > + scan_lzfree = sc->nr_to_reclaim *
> > + (DEF_PRIORITY - sc->priority);
>
> scan_lzfree == 0 if sc->priority == DEF_PRIORITY, is this intended?
> > + scan_lzfree = div64_u64(scan_lzfree *
> > + lzfreeness, 50);
> > + if (!scan_lzfree) {
> > + unsigned long zonefile, zonefree;
> > +
> > + zonefree = zone_page_state(zone, NR_FREE_PAGES);
> > + zonefile = zone_page_state(zone, NR_ACTIVE_FILE) +
> > + zone_page_state(zone, NR_INACTIVE_FILE);
> > + if (unlikely(zonefile + zonefree <=
> > + high_wmark_pages(zone))) {
> > + scan_lzfree = get_lru_size(lruvec,
> > + LRU_LZFREE) >> sc->priority;
> > + }
> > + }
> > + }
> > +
> > + nr[LRU_LZFREE] = min(scan_lzfree, lzfree);
> > }
>
> Looks there is no setting to only reclaim lazyfree pages. Could we have an
> option for this? It's legit we don't want to trash page cache because of
> lazyfree memory.

Once we introduc the knob, it could be doable.
I will do it in next spin.

Thanks for the review!

2015-11-13 06:37:31

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

On Fri, Nov 13, 2015 at 01:16:54AM -0500, Daniel Micay wrote:
> On 13/11/15 01:15 AM, Minchan Kim wrote:
> > On Thu, Nov 12, 2015 at 12:21:30AM -0500, Daniel Micay wrote:
> >>> I also think that the kernel should commit to either zeroing the page
> >>> or leaving it unchanged in response to MADV_FREE (even if the decision
> >>> of which to do is made later on). I think that your patch series does
> >>> this, but only after a few of the patches are applied (the swap entry
> >>> freeing), and I think that it should be a real guaranteed part of the
> >>> semantics and maybe have a test case.
> >>
> >> This would be a good thing to test because it would be required to add
> >> MADV_FREE_UNDO down the road. It would mean the same semantics as the
> >> MEM_RESET and MEM_RESET_UNDO features on Windows, and there's probably
> >> value in that for the sake of migrating existing software too.
> >
> > So, do you mean that we could implement MADV_FREE_UNDO with "read"
> > opearation("just access bit marking) easily in future?
> >
> > If so, it would be good reason to change MADV_FREE from dirty bit to
> > access bit. Okay, I will look at that.
>
> I just meant testing that the data is either zero or the old data if
> it's read before it's written to. Not having it stay around once there
> is a read. Not sure if that's what Andy meant.

Either zero of old data is gauranteed.
Now:

MADV_FREE(range)
A = read from the range
...
...
B = read from the range


A and B could have different value. But value should be old or zero.

But Andy want more strict ABI so he suggested access bit instead of dirty bit.

MADV_FREE(range)
A = read from the range
...
...
B = read from the range

A and B cannot have different value.

And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
easily when we need it. Maybe, that's what you want. Right?

2015-11-13 06:46:00

by Daniel Micay

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
> easily when we need it. Maybe, that's what you want. Right?

Yes, but why the access bit instead of the dirty bit for that? It could
always be made more strict (i.e. access bit) in the future, while going
the other way won't be possible. So I think the dirty bit is really the
more conservative choice since if it turns out to be a mistake it can be
fixed without a backwards incompatible change.


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2015-11-13 07:03:40

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

On Fri, Nov 13, 2015 at 01:45:52AM -0500, Daniel Micay wrote:
> > And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
> > easily when we need it. Maybe, that's what you want. Right?
>
> Yes, but why the access bit instead of the dirty bit for that? It could
> always be made more strict (i.e. access bit) in the future, while going
> the other way won't be possible. So I think the dirty bit is really the
> more conservative choice since if it turns out to be a mistake it can be
> fixed without a backwards incompatible change.

Absolutely true. That's why I insist on dirty bit until now although
I didn't tell the reason. But I thought you wanted to change for using
access bit for the future, too. It seems MADV_FREE start to bloat
over and over again before knowing real problems and usecases.
It's almost same situation with volatile ranges so I really want to
stop at proper point which maintainer should decide, I hope.
Without it, we will make the feature a lot heavy by just brain storming
and then causes lots of churn in MM code without real bebenfit
It would be very painful for us.

2015-11-13 08:13:24

by Daniel Micay

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

On 13/11/15 02:03 AM, Minchan Kim wrote:
> On Fri, Nov 13, 2015 at 01:45:52AM -0500, Daniel Micay wrote:
>>> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
>>> easily when we need it. Maybe, that's what you want. Right?
>>
>> Yes, but why the access bit instead of the dirty bit for that? It could
>> always be made more strict (i.e. access bit) in the future, while going
>> the other way won't be possible. So I think the dirty bit is really the
>> more conservative choice since if it turns out to be a mistake it can be
>> fixed without a backwards incompatible change.
>
> Absolutely true. That's why I insist on dirty bit until now although
> I didn't tell the reason. But I thought you wanted to change for using
> access bit for the future, too. It seems MADV_FREE start to bloat
> over and over again before knowing real problems and usecases.
> It's almost same situation with volatile ranges so I really want to
> stop at proper point which maintainer should decide, I hope.
> Without it, we will make the feature a lot heavy by just brain storming
> and then causes lots of churn in MM code without real bebenfit
> It would be very painful for us.

Well, I don't think you need more than a good API and an implementation
with no known bugs, kernel security concerns or backwards compatibility
issues. Configuration and API extensions are something for later (i.e.
land a baseline, then submit stuff like sysctl tunables). Just my take
on it though...


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2015-11-13 19:46:30

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

On Fri, Nov 13, 2015 at 12:13 AM, Daniel Micay <[email protected]> wrote:
> On 13/11/15 02:03 AM, Minchan Kim wrote:
>> On Fri, Nov 13, 2015 at 01:45:52AM -0500, Daniel Micay wrote:
>>>> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
>>>> easily when we need it. Maybe, that's what you want. Right?
>>>
>>> Yes, but why the access bit instead of the dirty bit for that? It could
>>> always be made more strict (i.e. access bit) in the future, while going
>>> the other way won't be possible. So I think the dirty bit is really the
>>> more conservative choice since if it turns out to be a mistake it can be
>>> fixed without a backwards incompatible change.
>>
>> Absolutely true. That's why I insist on dirty bit until now although
>> I didn't tell the reason. But I thought you wanted to change for using
>> access bit for the future, too. It seems MADV_FREE start to bloat
>> over and over again before knowing real problems and usecases.
>> It's almost same situation with volatile ranges so I really want to
>> stop at proper point which maintainer should decide, I hope.
>> Without it, we will make the feature a lot heavy by just brain storming
>> and then causes lots of churn in MM code without real bebenfit
>> It would be very painful for us.
>
> Well, I don't think you need more than a good API and an implementation
> with no known bugs, kernel security concerns or backwards compatibility
> issues. Configuration and API extensions are something for later (i.e.
> land a baseline, then submit stuff like sysctl tunables). Just my take
> on it though...
>

As long as it's anonymous MAP_PRIVATE only, then the security aspects
should be okay. MADV_DONTNEED seems to work on pretty much any VMA,
and there's been long history of interesting bugs there.

As for dirty vs accessed, an argument in favor of going straight to
accessed is that it means that users can write code like this without
worrying about whether they have a kernel that uses the dirty bit:

x = mmap(...);
*x = 1; /* mark it present */

/* i'm done with it */
*x = 1;
madvise(MADV_FREE, x, ...);

wait a while;

/* is it still there? */
if (*x == 1) {
/* use whatever was cached there */
} else {
/* reinitialize it */
*x = 1;
}

With the dirty bit, this will look like it works, but on occasion
users will lose the race where they probe *x to see if the data was
lost and then the data gets lost before the next write comes in.

Sure, that load from *x could be changed to RMW or users could do a
dummy write (e.g. x[1] = 1; if (*x == 1) ...), but people might forget
to do that, and the caching implications are a little bit worse.

Note that switching to RMW is really really dangerous. Doing:

*x &= 1;
if (*x == 1) ...;

is safe on x86 if the compiler generates:

andl $1, (%[x]);
cmpl $1, (%[x]);

but is unsafe if the compiler generates:

movl (%[x]), %eax;
andl $1, %eax;
movl %eax, (%[x]);
cmpl $1, %eax;

and even worse if the write is omitted when "provably" unnecessary.

OTOH, if switching to the accessed bit is too much of a mess, then
using the dirty bit at first isn't so bad.

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC

2015-11-16 02:12:58

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

On Fri, Nov 13, 2015 at 11:46:07AM -0800, Andy Lutomirski wrote:
> On Fri, Nov 13, 2015 at 12:13 AM, Daniel Micay <[email protected]> wrote:
> > On 13/11/15 02:03 AM, Minchan Kim wrote:
> >> On Fri, Nov 13, 2015 at 01:45:52AM -0500, Daniel Micay wrote:
> >>>> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
> >>>> easily when we need it. Maybe, that's what you want. Right?
> >>>
> >>> Yes, but why the access bit instead of the dirty bit for that? It could
> >>> always be made more strict (i.e. access bit) in the future, while going
> >>> the other way won't be possible. So I think the dirty bit is really the
> >>> more conservative choice since if it turns out to be a mistake it can be
> >>> fixed without a backwards incompatible change.
> >>
> >> Absolutely true. That's why I insist on dirty bit until now although
> >> I didn't tell the reason. But I thought you wanted to change for using
> >> access bit for the future, too. It seems MADV_FREE start to bloat
> >> over and over again before knowing real problems and usecases.
> >> It's almost same situation with volatile ranges so I really want to
> >> stop at proper point which maintainer should decide, I hope.
> >> Without it, we will make the feature a lot heavy by just brain storming
> >> and then causes lots of churn in MM code without real bebenfit
> >> It would be very painful for us.
> >
> > Well, I don't think you need more than a good API and an implementation
> > with no known bugs, kernel security concerns or backwards compatibility
> > issues. Configuration and API extensions are something for later (i.e.
> > land a baseline, then submit stuff like sysctl tunables). Just my take
> > on it though...
> >
>
> As long as it's anonymous MAP_PRIVATE only, then the security aspects
> should be okay. MADV_DONTNEED seems to work on pretty much any VMA,
> and there's been long history of interesting bugs there.
>
> As for dirty vs accessed, an argument in favor of going straight to
> accessed is that it means that users can write code like this without
> worrying about whether they have a kernel that uses the dirty bit:
>
> x = mmap(...);
> *x = 1; /* mark it present */
>
> /* i'm done with it */
> *x = 1;
> madvise(MADV_FREE, x, ...);
>
> wait a while;
>
> /* is it still there? */
> if (*x == 1) {
> /* use whatever was cached there */
> } else {
> /* reinitialize it */
> *x = 1;
> }
>
> With the dirty bit, this will look like it works, but on occasion
> users will lose the race where they probe *x to see if the data was
> lost and then the data gets lost before the next write comes in.
>
> Sure, that load from *x could be changed to RMW or users could do a
> dummy write (e.g. x[1] = 1; if (*x == 1) ...), but people might forget
> to do that, and the caching implications are a little bit worse.

I think your example is the case what people abuse MADV_FREE.
What happens if the object(ie, x) spans multiple pages?
User should know object's memory align and investigate all of pages
which span the object. Hmm, I don't think it's good for API.

>
> Note that switching to RMW is really really dangerous. Doing:
>
> *x &= 1;
> if (*x == 1) ...;
>
> is safe on x86 if the compiler generates:
>
> andl $1, (%[x]);
> cmpl $1, (%[x]);
>
> but is unsafe if the compiler generates:
>
> movl (%[x]), %eax;
> andl $1, %eax;
> movl %eax, (%[x]);
> cmpl $1, %eax;
>
> and even worse if the write is omitted when "provably" unnecessary.
>
> OTOH, if switching to the accessed bit is too much of a mess, then
> using the dirty bit at first isn't so bad.

Thanks! I want to use dirty bit first.

About access bit, I don't want to say it to mess but I guess it would
change a lot subtle thing for all architectures. Because we have used
access bit as just *hint* for aging while dirty bit is really
*critical marker* for system integrity. A example in x86, we don't
keep accuracy of access bit for reducing TLB flush IPI. I don't know
what technique other arches have used but they might have.

Thanks.


>
> --Andy
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC

2015-11-16 03:15:07

by yalin wang

[permalink] [raw]
Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)


> On Nov 16, 2015, at 10:13, Minchan Kim <[email protected]> wrote:
>
> On Fri, Nov 13, 2015 at 11:46:07AM -0800, Andy Lutomirski wrote:
>> On Fri, Nov 13, 2015 at 12:13 AM, Daniel Micay <[email protected]> wrote:
>>> On 13/11/15 02:03 AM, Minchan Kim wrote:
>>>> On Fri, Nov 13, 2015 at 01:45:52AM -0500, Daniel Micay wrote:
>>>>>> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
>>>>>> easily when we need it. Maybe, that's what you want. Right?
>>>>>
>>>>> Yes, but why the access bit instead of the dirty bit for that? It could
>>>>> always be made more strict (i.e. access bit) in the future, while going
>>>>> the other way won't be possible. So I think the dirty bit is really the
>>>>> more conservative choice since if it turns out to be a mistake it can be
>>>>> fixed without a backwards incompatible change.
>>>>
>>>> Absolutely true. That's why I insist on dirty bit until now although
>>>> I didn't tell the reason. But I thought you wanted to change for using
>>>> access bit for the future, too. It seems MADV_FREE start to bloat
>>>> over and over again before knowing real problems and usecases.
>>>> It's almost same situation with volatile ranges so I really want to
>>>> stop at proper point which maintainer should decide, I hope.
>>>> Without it, we will make the feature a lot heavy by just brain storming
>>>> and then causes lots of churn in MM code without real bebenfit
>>>> It would be very painful for us.
>>>
>>> Well, I don't think you need more than a good API and an implementation
>>> with no known bugs, kernel security concerns or backwards compatibility
>>> issues. Configuration and API extensions are something for later (i.e.
>>> land a baseline, then submit stuff like sysctl tunables). Just my take
>>> on it though...
>>>
>>
>> As long as it's anonymous MAP_PRIVATE only, then the security aspects
>> should be okay. MADV_DONTNEED seems to work on pretty much any VMA,
>> and there's been long history of interesting bugs there.
>>
>> As for dirty vs accessed, an argument in favor of going straight to
>> accessed is that it means that users can write code like this without
>> worrying about whether they have a kernel that uses the dirty bit:
>>
>> x = mmap(...);
>> *x = 1; /* mark it present */
>>
>> /* i'm done with it */
>> *x = 1;
>> madvise(MADV_FREE, x, ...);
>>
>> wait a while;
>>
>> /* is it still there? */
>> if (*x == 1) {
>> /* use whatever was cached there */
>> } else {
>> /* reinitialize it */
>> *x = 1;
>> }
>>
>> With the dirty bit, this will look like it works, but on occasion
>> users will lose the race where they probe *x to see if the data was
>> lost and then the data gets lost before the next write comes in.
>>
>> Sure, that load from *x could be changed to RMW or users could do a
>> dummy write (e.g. x[1] = 1; if (*x == 1) ...), but people might forget
>> to do that, and the caching implications are a little bit worse.
>
> I think your example is the case what people abuse MADV_FREE.
> What happens if the object(ie, x) spans multiple pages?
> User should know object's memory align and investigate all of pages
> which span the object. Hmm, I don't think it's good for API.
>
>>
>> Note that switching to RMW is really really dangerous. Doing:
>>
>> *x &= 1;
>> if (*x == 1) ...;
>>
>> is safe on x86 if the compiler generates:
>>
>> andl $1, (%[x]);
>> cmpl $1, (%[x]);
>>
>> but is unsafe if the compiler generates:
>>
>> movl (%[x]), %eax;
>> andl $1, %eax;
>> movl %eax, (%[x]);
>> cmpl $1, %eax;
>>
>> and even worse if the write is omitted when "provably" unnecessary.
>>
>> OTOH, if switching to the accessed bit is too much of a mess, then
>> using the dirty bit at first isn't so bad.
>
> Thanks! I want to use dirty bit first.
>
> About access bit, I don't want to say it to mess but I guess it would
> change a lot subtle thing for all architectures. Because we have used
> access bit as just *hint* for aging while dirty bit is really
> *critical marker* for system integrity. A example in x86, we don't
> keep accuracy of access bit for reducing TLB flush IPI. I don't know
> what technique other arches have used but they might have.
>
> Thanks.
>
i think use access bit is not easy to implement for ANON page in kernel.
we are sure the Anon page is always PageDirty() if it is !PageSwapCache() ,
unless it is MADV_FREE page ,
but use access bit , how to distinguish Normal ANON page and MADV_FREE page?
it can be implemented by Access bit , but not easy, need more code change .

Thanks