2015-11-20 08:02:45

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 00/16] MADV_FREE support

I have been spent a lot of time to land MADV_FREE feature
by request of userland people(esp, Daniel and Jason, jemalloc guys.
Thanks for the pushing me! ;-)

There are some issues from reviewrs.

1) Swap dependency

In old version, MADV_FREE is fallback to MADV_DONTNEED on swapless
system because we don't have aged anonymous LRU list.
So there are requests for MADV_FREE to support on swapless system.

For addressing the issue, it includes new LRU list for lazyfree
pages. With that, we could support MADV_FREE on swapless system.

2) Hotness

Someone think MADV_FREEed pages are really cold while others are not.
By judging what it is, it affects reclaim policy.
For addressing the issue, it includes new knob "lazyfreeness" like
swappiness. Look at detail in decscription of mm: add knob to tune
lazyfreeing.

I have been tested it on v4.3-rc7 and couldn't find any problem so far.

git: git://git.kernel.org/pub/scm/linux/kernel/git/minchan/linux.git
branch: mm/madv_free-v4.3-rc7-v4rc2

In this stage, I don't think we need to write man page.
It could be done after solid policy and implementation.

* Change from v3
* some bug fix
* code refactoring
* lazyfree reclaim logic change
* reordering patch

* Change from v2
* vm_lazyfreeness tuning knob
* add new LRU list - Johannes, Shaohua
* support swapless - Johannes

* Change from v1
* Don't do unnecessary TLB flush - Shaohua
* Added Acked-by - Hugh, Michal
* Merge deactivate_page and deactivate_file_page
* Add pmd_dirty/pmd_mkclean patches for several arches
* Add lazy THP split patch
* Drop [email protected] - Delivery Failure

Chen Gang (1):
arch: uapi: asm: mman.h: Let MADV_FREE have same value for all
architectures

Minchan Kim (15):
mm: support madvise(MADV_FREE)
mm: define MADV_FREE for some arches
mm: free swp_entry in madvise_free
mm: move lazily freed pages to inactive list
mm: mark stable page dirty in KSM
x86: add pmd_[dirty|mkclean] for THP
sparc: add pmd_[dirty|mkclean] for THP
powerpc: add pmd_[dirty|mkclean] for THP
arm: add pmd_mkclean for THP
arm64: add pmd_mkclean for THP
mm: don't split THP page when syscall is called
mm: introduce wrappers to add new LRU
mm: introduce lazyfree LRU list
mm: support MADV_FREE on swapless system
mm: add knob to tune lazyfreeing

Documentation/sysctl/vm.txt | 13 ++
arch/alpha/include/uapi/asm/mman.h | 1 +
arch/arm/include/asm/pgtable-3level.h | 1 +
arch/arm64/include/asm/pgtable.h | 1 +
arch/mips/include/uapi/asm/mman.h | 1 +
arch/parisc/include/uapi/asm/mman.h | 1 +
arch/powerpc/include/asm/pgtable-ppc64.h | 2 +
arch/sparc/include/asm/pgtable_64.h | 9 ++
arch/x86/include/asm/pgtable.h | 5 +
arch/xtensa/include/uapi/asm/mman.h | 1 +
drivers/base/node.c | 2 +
drivers/staging/android/lowmemorykiller.c | 3 +-
fs/proc/meminfo.c | 2 +
include/linux/huge_mm.h | 6 +
include/linux/memcontrol.h | 1 +
include/linux/mm_inline.h | 85 ++++++++++-
include/linux/mmzone.h | 26 +++-
include/linux/page-flags.h | 5 +
include/linux/rmap.h | 1 +
include/linux/swap.h | 19 ++-
include/linux/vm_event_item.h | 3 +-
include/trace/events/vmscan.h | 38 ++---
include/uapi/asm-generic/mman-common.h | 1 +
kernel/sysctl.c | 9 ++
mm/compaction.c | 14 +-
mm/huge_memory.c | 60 +++++++-
mm/ksm.c | 6 +
mm/madvise.c | 181 ++++++++++++++++++++++
mm/memcontrol.c | 44 +++++-
mm/memory-failure.c | 7 +-
mm/memory_hotplug.c | 3 +-
mm/mempolicy.c | 3 +-
mm/migrate.c | 28 ++--
mm/page_alloc.c | 7 +
mm/rmap.c | 14 ++
mm/swap.c | 84 +++++++++--
mm/swap_state.c | 11 +-
mm/vmscan.c | 240 ++++++++++++++++++++----------
mm/vmstat.c | 4 +
39 files changed, 767 insertions(+), 175 deletions(-)

--
1.9.1


2015-11-20 08:08:13

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 01/16] mm: support madvise(MADV_FREE)

Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than swapping
out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace without
another additional overhead(ex, page fault + allocation + zeroing).

Jason Evans said:

: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE. The ones that immediately
: come to mind are redis, varnish, and MariaDB. I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX). The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.

How it works:

When madvise syscall is called, VM clears dirty bit of ptes of the range.
If memory pressure happens, VM checks dirty bit of page table and if it
found still "clean", it means it's a "lazyfree pages" so VM could discard
the page instead of swapping out. Once there was store operation for the
page before VM peek a page to reclaim, dirty bit is set so VM can swap out
the page instead of discarding.

One thing we should notice is that basically, MADV_FREE relies on dirty bit
in page table entry to decide whether VM allows to discard the page or not.
IOW, if page table entry includes marked dirty bit, VM shouldn't discard
the page.

However, as a example, if swap-in by read fault happens, page table entry
doesn't have dirty bit so MADV_FREE could discard the page wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty
and PageSwapCache. It worked out because swapped-in page lives on
swap cache and since it is evicted from the swap cache, the page has
PG_dirty flag. So both page flags check effectively prevent
wrong discarding by MADV_FREE.

However, a problem in above logic is that swapped-in page has
PG_dirty still after they are removed from swap cache so VM cannot
consider the page as freeable any more even if madvise_free is
called in future.

Look at below example for detail.

ptr = malloc();
memset(ptr);
..
..
.. heavy memory pressure so all of pages are swapped out
..
..
var = *ptr; -> a page swapped-in and could be removed from
swapcache. Then, page table doesn't mark
dirty bit and page descriptor includes PG_dirty
..
..
madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
..
..
..
.. heavy memory pressure again.
.. In this time, VM cannot discard the page because the page
.. has *PG_dirty*

To solve the problem, this patch clears PG_dirty if only the page is owned
exclusively by current process when madvise is called because PG_dirty
represents ptes's dirtiness in several processes so we could clear it only
if we own it exclusively.

Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)

barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 12
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 2
Stepping: 3
CPU MHz: 3200.185
BogoMIPS: 6400.53
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)

Higher avg is better.

vanilla-jemalloc MADV_free-jemalloc

1 thread
records: 10 records: 10
avg: 2961.90 avg: 12069.70
std: 71.96(2.43%) std: 186.68(1.55%)
max: 3070.00 max: 12385.00
min: 2796.00 min: 11746.00

2 thread
records: 10 records: 10
avg: 5020.00 avg: 17827.00
std: 264.87(5.28%) std: 358.52(2.01%)
max: 5244.00 max: 18760.00
min: 4251.00 min: 17382.00

4 thread
records: 10 records: 10
avg: 8988.80 avg: 27930.80
std: 1175.33(13.08%) std: 3317.33(11.88%)
max: 9508.00 max: 30879.00
min: 5477.00 min: 21024.00

8 thread
records: 10 records: 10
avg: 13036.50 avg: 33739.40
std: 170.67(1.31%) std: 5146.22(15.25%)
max: 13371.00 max: 40572.00
min: 12785.00 min: 24088.00

16 thread
records: 10 records: 10
avg: 11092.40 avg: 31424.20
std: 710.60(6.41%) std: 3763.89(11.98%)
max: 12446.00 max: 36635.00
min: 9949.00 min: 25669.00

32 thread
records: 10 records: 10
avg: 11067.00 avg: 34495.80
std: 971.06(8.77%) std: 2721.36(7.89%)
max: 12010.00 max: 38598.00
min: 9002.00 min: 30636.00

In summary, MADV_FREE is about much faster than MADV_DONTNEED.

Acked-by: Michal Hocko <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/rmap.h | 1 +
include/linux/vm_event_item.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/madvise.c | 142 +++++++++++++++++++++++++++++++++
mm/rmap.c | 7 ++
mm/swap_state.c | 5 +-
mm/vmscan.c | 10 ++-
mm/vmstat.c | 1 +
8 files changed, 163 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 29446aeef36e..f4c992826242 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -85,6 +85,7 @@ enum ttu_flags {
TTU_UNMAP = 1, /* unmap mode */
TTU_MIGRATION = 2, /* migration mode */
TTU_MUNLOCK = 4, /* munlock mode */
+ TTU_FREE = 8, /* free mode */

TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9246d32dc973..2b1cef88b827 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
PGFREE, PGACTIVATE, PGDEACTIVATE,
PGFAULT, PGMAJFAULT,
+ PGLAZYFREED,
FOR_ALL_ZONES(PGREFILL),
FOR_ALL_ZONES(PGSTEAL_KSWAPD),
FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36f1046..7a94102b7a02 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,6 +34,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/mm/madvise.c b/mm/madvise.c
index c889fcbb530e..7b5c6f648fdb 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -20,6 +20,9 @@
#include <linux/backing-dev.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
+
+#include <asm/tlb.h>

/*
* Any behaviour which results in changes to the vma->vm_flags needs to
@@ -32,6 +35,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_FREE:
return 0;
default:
/* be safe, default to 1. list exceptions explicitly */
@@ -256,6 +260,135 @@ static long madvise_willneed(struct vm_area_struct *vma,
return 0;
}

+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+
+{
+ struct mmu_gather *tlb = walk->private;
+ struct mm_struct *mm = tlb->mm;
+ struct vm_area_struct *vma = walk->vma;
+ spinlock_t *ptl;
+ pte_t *pte, ptent;
+ struct page *page;
+
+ split_huge_page_pmd(vma, addr, pmd);
+ if (pmd_trans_unstable(pmd))
+ return 0;
+
+ pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ arch_enter_lazy_mmu_mode();
+ for (; addr != end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ VM_BUG_ON_PAGE(PageTransCompound(page), page);
+
+ if (PageSwapCache(page) || PageDirty(page)) {
+ if (!trylock_page(page))
+ continue;
+ /*
+ * If page is shared with others, we couldn't clear
+ * PG_dirty of the page.
+ */
+ if (page_mapcount(page) != 1) {
+ unlock_page(page);
+ continue;
+ }
+
+ if (PageSwapCache(page) && !try_to_free_swap(page)) {
+ unlock_page(page);
+ continue;
+ }
+
+ ClearPageDirty(page);
+ unlock_page(page);
+ }
+
+ if (pte_young(ptent) || pte_dirty(ptent)) {
+ /*
+ * Some of architecture(ex, PPC) don't update TLB
+ * with set_pte_at and tlb_remove_tlb_entry so for
+ * the portability, remap the pte with old|clean
+ * after pte clearing.
+ */
+ ptent = ptep_get_and_clear_full(mm, addr, pte,
+ tlb->fullmm);
+
+ ptent = pte_mkold(ptent);
+ ptent = pte_mkclean(ptent);
+ set_pte_at(mm, addr, pte, ptent);
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ }
+ }
+
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(pte - 1, ptl);
+ cond_resched();
+ return 0;
+}
+
+static void madvise_free_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ struct mm_walk free_walk = {
+ .pmd_entry = madvise_free_pte_range,
+ .mm = vma->vm_mm,
+ .private = tlb,
+ };
+
+ tlb_start_vma(tlb, vma);
+ walk_page_range(addr, end, &free_walk);
+ tlb_end_vma(tlb, vma);
+}
+
+static int madvise_free_single_vma(struct vm_area_struct *vma,
+ unsigned long start_addr, unsigned long end_addr)
+{
+ unsigned long start, end;
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_gather tlb;
+
+ if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+ return -EINVAL;
+
+ /* MADV_FREE works for only anon vma at the moment */
+ if (!vma_is_anonymous(vma))
+ return -EINVAL;
+
+ start = max(vma->vm_start, start_addr);
+ if (start >= vma->vm_end)
+ return -EINVAL;
+ end = min(vma->vm_end, end_addr);
+ if (end <= vma->vm_start)
+ return -EINVAL;
+
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, start, end);
+ update_hiwater_rss(mm);
+
+ mmu_notifier_invalidate_range_start(mm, start, end);
+ madvise_free_page_range(&tlb, vma, start, end);
+ mmu_notifier_invalidate_range_end(mm, start, end);
+ tlb_finish_mmu(&tlb, start, end);
+
+ return 0;
+}
+
+static long madvise_free(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
+{
+ *prev = vma;
+ return madvise_free_single_vma(vma, start, end);
+}
+
/*
* Application no longer needs these pages. If the pages are dirty,
* it's OK to just throw them away. The app will be more careful about
@@ -379,6 +512,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_remove(vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
+ case MADV_FREE:
+ /*
+ * XXX: In this implementation, MADV_FREE works like
+ * MADV_DONTNEED on swapless system or full swap.
+ */
+ if (get_nr_swap_pages() > 0)
+ return madvise_free(vma, prev, start, end);
+ /* passthrough */
case MADV_DONTNEED:
return madvise_dontneed(vma, prev, start, end);
default:
@@ -398,6 +539,7 @@ madvise_behavior_valid(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_FREE:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
diff --git a/mm/rmap.c b/mm/rmap.c
index f5b5c1f3dcd7..9449e91839ab 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1374,6 +1374,12 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
swp_entry_t entry = { .val = page_private(page) };
pte_t swp_pte;

+ if (!PageDirty(page) && (flags & TTU_FREE)) {
+ /* It's a freeable page by MADV_FREE */
+ dec_mm_counter(mm, MM_ANONPAGES);
+ goto discard;
+ }
+
if (PageSwapCache(page)) {
/*
* Store the swap location in the pte.
@@ -1414,6 +1420,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
} else
dec_mm_counter(mm, MM_FILEPAGES);

+discard:
page_remove_rmap(page);
page_cache_release(page);

diff --git a/mm/swap_state.c b/mm/swap_state.c
index d504adb7fa5f..10f63eded7b7 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list)
* deadlock in the swap out path.
*/
/*
- * Add it to the swap cache and mark it dirty
+ * Add it to the swap cache.
*/
err = add_to_swap_cache(page, entry,
__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);

- if (!err) { /* Success */
- SetPageDirty(page);
+ if (!err) {
return 1;
} else { /* -ENOMEM radix-tree allocation failure */
/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f63a9381f71..7a415b9fdd34 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -906,6 +906,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
int may_enter_fs;
enum page_references references = PAGEREF_RECLAIM_CLEAN;
bool dirty, writeback;
+ bool freeable = false;

cond_resched();

@@ -1049,6 +1050,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep_locked;
if (!add_to_swap(page, page_list))
goto activate_locked;
+ freeable = true;
may_enter_fs = 1;

/* Adding to swap updated mapping */
@@ -1060,8 +1062,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* processes. Try to unmap it here.
*/
if (page_mapped(page) && mapping) {
- switch (try_to_unmap(page,
- ttu_flags|TTU_BATCH_FLUSH)) {
+ switch (try_to_unmap(page, freeable ?
+ (ttu_flags | TTU_BATCH_FLUSH | TTU_FREE) :
+ (ttu_flags | TTU_BATCH_FLUSH))) {
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
@@ -1186,6 +1189,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
__clear_page_locked(page);
free_it:
+ if (freeable && !PageDirty(page))
+ count_vm_event(PGLAZYFREED);
+
nr_reclaimed++;

/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fbf14485a049..59d45b22355f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -759,6 +759,7 @@ const char * const vmstat_text[] = {

"pgfault",
"pgmajfault",
+ "pglazyfreed",

TEXTS_FOR_ZONES("pgrefill")
TEXTS_FOR_ZONES("pgsteal_kswapd")
--
1.9.1

2015-11-20 08:02:49

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 02/16] mm: define MADV_FREE for some arches

Most architectures use asm-generic, but alpha, mips, parisc, xtensa
need their own definitions.

This patch defines MADV_FREE for them so it should fix build break
for their architectures.

Maybe, I should split and feed piecies to arch maintainers but
included here for mmotm convenience.

Cc: Michael Kerrisk <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Chris Zankel <[email protected]>
Acked-by: Max Filippov <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
arch/alpha/include/uapi/asm/mman.h | 1 +
arch/mips/include/uapi/asm/mman.h | 1 +
arch/parisc/include/uapi/asm/mman.h | 1 +
arch/xtensa/include/uapi/asm/mman.h | 1 +
4 files changed, 4 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b472bc2b..836fbd44f65b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -44,6 +44,7 @@
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_SPACEAVAIL 5 /* ensure resources are available */
#define MADV_DONTNEED 6 /* don't need these pages */
+#define MADV_FREE 7 /* free pages only if memory pressure */

/* common/generic parameters */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876cae6b..106e741aa7ee 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -67,6 +67,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251ca7b2..6cb8db76fd4e 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -40,6 +40,7 @@
#define MADV_SPACEAVAIL 5 /* insure that resources are reserved */
#define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */
#define MADV_VPS_INHERIT 7 /* Inherit parents page size */
+#define MADV_FREE 8 /* free pages only if memory pressure */

/* common/generic parameters */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 201aec0e0446..1b19f25bc567 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -80,6 +80,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
--
1.9.1

2015-11-20 08:02:51

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 03/16] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures

From: Chen Gang <[email protected]>

For uapi, need try to let all macros have same value, and MADV_FREE is
added into main branch recently, so need redefine MADV_FREE for it.

At present, '8' can be shared with all architectures, so redefine it to
'8'.

Cc: [email protected] <[email protected]>,
Cc: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Chen Gang <[email protected]>
---
arch/alpha/include/uapi/asm/mman.h | 2 +-
arch/mips/include/uapi/asm/mman.h | 2 +-
arch/parisc/include/uapi/asm/mman.h | 2 +-
arch/xtensa/include/uapi/asm/mman.h | 2 +-
include/uapi/asm-generic/mman-common.h | 2 +-
5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 836fbd44f65b..0b8a5de7aee3 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -44,9 +44,9 @@
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_SPACEAVAIL 5 /* ensure resources are available */
#define MADV_DONTNEED 6 /* don't need these pages */
-#define MADV_FREE 7 /* free pages only if memory pressure */

/* common/generic parameters */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 106e741aa7ee..d247f5457944 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -67,9 +67,9 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
-#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 6cb8db76fd4e..700d83fd9352 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -40,9 +40,9 @@
#define MADV_SPACEAVAIL 5 /* insure that resources are reserved */
#define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */
#define MADV_VPS_INHERIT 7 /* Inherit parents page size */
-#define MADV_FREE 8 /* free pages only if memory pressure */

/* common/generic parameters */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 1b19f25bc567..77eaca434071 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -80,9 +80,9 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
-#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 7a94102b7a02..869595947873 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,9 +34,9 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
-#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
--
1.9.1

2015-11-20 08:07:44

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 04/16] mm: free swp_entry in madvise_free

When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat
slower (ie, 2x times) than madvise_dontneed.

loop = 5;
mmap(512M);
while (loop--) {
memset(512M);
madvise(MADV_FREE or MADV_DONTNEED);
}

The reason is lots of swapin.

1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapin

If we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the swapped-out pages in pte.
Instead, let's free the cold page because swapin is more expensive
than (alloc page + zeroing).

With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time

1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

Acked-by: Michal Hocko <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/madvise.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 7b5c6f648fdb..c8e23102fc99 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -270,6 +270,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
spinlock_t *ptl;
pte_t *pte, ptent;
struct page *page;
+ int nr_swap = 0;

split_huge_page_pmd(vma, addr, pmd);
if (pmd_trans_unstable(pmd))
@@ -280,8 +281,24 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
for (; addr != end; pte++, addr += PAGE_SIZE) {
ptent = *pte;

- if (!pte_present(ptent))
+ if (pte_none(ptent))
continue;
+ /*
+ * If the pte has swp_entry, just clear page table to
+ * prevent swap-in which is more expensive rather than
+ * (page allocation + zeroing).
+ */
+ if (!pte_present(ptent)) {
+ swp_entry_t entry;
+
+ entry = pte_to_swp_entry(ptent);
+ if (non_swap_entry(entry))
+ continue;
+ nr_swap--;
+ free_swap_and_cache(entry);
+ pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ continue;
+ }

page = vm_normal_page(vma, addr, ptent);
if (!page)
@@ -327,6 +344,13 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
}
}

+ if (nr_swap) {
+ if (current->mm == mm)
+ sync_mm_rss(mm);
+
+ add_mm_counter(mm, MM_SWAPENTS, nr_swap);
+ }
+
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
cond_resched();
--
1.9.1

2015-11-20 08:02:52

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 05/16] mm: move lazily freed pages to inactive list

MADV_FREE is a hint that it's okay to discard pages if there is memory
pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
so there is no value keeping them in the active anonymous LRU so this
patch moves them to inactive LRU list's head.

This means that MADV_FREE-ed pages which were living on the inactive list
are reclaimed first because they are more likely to be cold rather than
recently active pages.

An arguable issue for the approach would be whether we should put the page
to the head or tail of the inactive list. I chose head because the kernel
cannot make sure it's really cold or warm for every MADV_FREE usecase but
at least we know it's not *hot*, so landing of inactive head would be a
comprimise for various usecases.

This fixes suboptimal behavior of MADV_FREE when pages living on the
active list will sit there for a long time even under memory pressure
while the inactive list is reclaimed heavily. This basically breaks the
whole purpose of using MADV_FREE to help the system to free memory which
is might not be used.

Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Shaohua Li <[email protected]>
Cc: Wang, Yalin <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/swap.h | 1 +
mm/madvise.c | 2 ++
mm/swap.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 47 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7ba7dccaf0e7..f629df4cc13d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -308,6 +308,7 @@ extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
extern void swap_setup(void);

extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/madvise.c b/mm/madvise.c
index c8e23102fc99..60e4d7f8ea16 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -340,6 +340,8 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
ptent = pte_mkold(ptent);
ptent = pte_mkclean(ptent);
set_pte_at(mm, addr, pte, ptent);
+ if (PageActive(page))
+ deactivate_page(page);
tlb_remove_tlb_entry(tlb, pte, addr);
}
}
diff --git a/mm/swap.c b/mm/swap.c
index 983f692a47fd..4a6aec976ab1 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -45,6 +45,7 @@ int page_cluster;
static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);

/*
* This path almost never happens for VM activity - pages are normally
@@ -799,6 +800,24 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
update_page_reclaim_stat(lruvec, file, 0);
}

+
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+ void *arg)
+{
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ int file = page_is_file_cache(page);
+ int lru = page_lru_base_type(page);
+
+ del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ add_page_to_lru_list(page, lruvec, lru);
+
+ __count_vm_event(PGDEACTIVATE);
+ update_page_reclaim_stat(lruvec, file, 0);
+ }
+}
+
/*
* Drain pages out of the cpu's pagevecs.
* Either "cpu" is the current CPU, and preemption has already been
@@ -825,6 +844,10 @@ void lru_add_drain_cpu(int cpu)
if (pagevec_count(pvec))
pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);

+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ if (pagevec_count(pvec))
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
activate_page_drain(cpu);
}

@@ -854,6 +877,26 @@ void deactivate_file_page(struct page *page)
}
}

+/**
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page. This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+ page_cache_get(page);
+ if (!pagevec_add(pvec, page))
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+ put_cpu_var(lru_deactivate_pvecs);
+ }
+}
+
void lru_add_drain(void)
{
lru_add_drain_cpu(get_cpu());
@@ -883,6 +926,7 @@ void lru_add_drain_all(void)
if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+ pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
need_activate_page_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
schedule_work_on(cpu, work);
--
1.9.1

2015-11-20 08:07:19

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 06/16] mm: mark stable page dirty in KSM

The MADV_FREE patchset changes page reclaim to simply free a clean
anonymous page with no dirty ptes, instead of swapping it out; but
KSM uses clean write-protected ptes to reference the stable ksm page.
So be sure to mark that page dirty, so it's never mistakenly discarded.

[hughd: adjusted comments]
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/ksm.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index 7ee101eaacdf..18d2b7afecff 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1053,6 +1053,12 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
*/
set_page_stable_node(page, NULL);
mark_page_accessed(page);
+ /*
+ * Page reclaim just frees a clean page with no dirty
+ * ptes: make sure that the ksm page would be swapped.
+ */
+ if (!PageDirty(page))
+ SetPageDirty(page);
err = 0;
} else if (pages_identical(page, kpage))
err = replace_page(vma, page, kpage, orig_pte);
--
1.9.1

2015-11-20 08:06:52

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 07/16] x86: add pmd_[dirty|mkclean] for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
arch/x86/include/asm/pgtable.h | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 867da5bbb4a3..b964d54300e1 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -267,6 +267,11 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
return pmd_clear_flags(pmd, _PAGE_ACCESSED);
}

+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+ return pmd_clear_flags(pmd, _PAGE_DIRTY);
+}
+
static inline pmd_t pmd_wrprotect(pmd_t pmd)
{
return pmd_clear_flags(pmd, _PAGE_RW);
--
1.9.1

2015-11-20 08:06:31

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 08/16] sparc: add pmd_[dirty|mkclean] for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
arch/sparc/include/asm/pgtable_64.h | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 131d36fcd07a..5833dc5ee7d7 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -717,6 +717,15 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
return __pmd(pte_val(pte));
}

+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+ pte_t pte = __pte(pmd_val(pmd));
+
+ pte = pte_mkclean(pte);
+
+ return __pmd(pte_val(pte));
+}
+
static inline pmd_t pmd_mkyoung(pmd_t pmd)
{
pte_t pte = __pte(pmd_val(pmd));
--
1.9.1

2015-11-20 08:02:56

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 09/16] powerpc: add pmd_[dirty|mkclean] for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index fa1dfb7f7b48..85e15c8067be 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -507,9 +507,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
#define pmd_pfn(pmd) pte_pfn(pmd_pte(pmd))
#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
#define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
#define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))

--
1.9.1

2015-11-20 08:02:58

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 10/16] arm: add pmd_mkclean for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_mkclean for THP page MADV_FREE support.

Signed-off-by: Minchan Kim <[email protected]>
---
arch/arm/include/asm/pgtable-3level.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index a745a2a53853..6d6012a320b2 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -249,6 +249,7 @@ PMD_BIT_FUNC(mkold, &= ~PMD_SECT_AF);
PMD_BIT_FUNC(mksplitting, |= L_PMD_SECT_SPLITTING);
PMD_BIT_FUNC(mkwrite, &= ~L_PMD_SECT_RDONLY);
PMD_BIT_FUNC(mkdirty, |= L_PMD_SECT_DIRTY);
+PMD_BIT_FUNC(mkclean, &= ~L_PMD_SECT_DIRTY);
PMD_BIT_FUNC(mkyoung, |= PMD_SECT_AF);

#define pmd_mkhuge(pmd) (__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
--
1.9.1

2015-11-20 08:05:30

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 11/16] arm64: add pmd_mkclean for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_mkclean for THP page MADV_FREE support.

Signed-off-by: Minchan Kim <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 26b066690593..a945263addd4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -325,6 +325,7 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
#define pmd_mksplitting(pmd) pte_pmd(pte_mkspecial(pmd_pte(pmd)))
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
#define pmd_mknotpresent(pmd) (__pmd(pmd_val(pmd) & ~PMD_TYPE_MASK))
--
1.9.1

2015-11-20 08:04:53

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 12/16] mm: don't split THP page when syscall is called

We don't need to split THP page when MADV_FREE syscall is called.
It could be done when VM decide to free it in reclaim path when
memory pressure is heavy so we could avoid unnecessary THP split.

For that, this patch changes two things

1. __split_huge_page_map

It does pte_mkdirty to subpages only if pmd_dirty is true.

2. __split_huge_page_refcount

It removes marking PG_dirty to subpages unconditionally.

Cc: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/huge_mm.h | 3 +++
mm/huge_memory.c | 54 +++++++++++++++++++++++++++++++++++++++++++++----
mm/madvise.c | 12 ++++++++++-
3 files changed, 64 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ecb080d6ff42..e9db238a75c1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,9 @@ extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
unsigned long addr,
pmd_t *pmd,
unsigned int flags);
+extern int madvise_free_huge_pmd(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long addr);
extern int zap_huge_pmd(struct mmu_gather *tlb,
struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bbac913f96bc..83bc4ce53e19 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1453,6 +1453,49 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}

+int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long addr)
+
+{
+ spinlock_t *ptl;
+ pmd_t orig_pmd;
+ struct page *page;
+ struct mm_struct *mm = tlb->mm;
+
+ if (__pmd_trans_huge_lock(pmd, vma, &ptl) != 1)
+ return 1;
+
+ orig_pmd = *pmd;
+ if (is_huge_zero_pmd(orig_pmd))
+ goto out;
+
+ page = pmd_page(orig_pmd);
+ if (page_mapcount(page) != 1)
+ goto out;
+ if (!trylock_page(page))
+ goto out;
+ if (PageDirty(page))
+ ClearPageDirty(page);
+ unlock_page(page);
+
+ if (PageActive(page))
+ deactivate_page(page);
+
+ if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
+ orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
+ tlb->fullmm);
+ orig_pmd = pmd_mkold(orig_pmd);
+ orig_pmd = pmd_mkclean(orig_pmd);
+
+ set_pmd_at(mm, addr, pmd, orig_pmd);
+ tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+ }
+out:
+ spin_unlock(ptl);
+
+ return 0;
+}
+
int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr)
{
@@ -1752,8 +1795,8 @@ static void __split_huge_page_refcount(struct page *page,
(1L << PG_mlocked) |
(1L << PG_uptodate) |
(1L << PG_active) |
- (1L << PG_unevictable)));
- page_tail->flags |= (1L << PG_dirty);
+ (1L << PG_unevictable) |
+ (1L << PG_dirty)));

/* clear PageTail before overwriting first_page */
smp_wmb();
@@ -1787,7 +1830,6 @@ static void __split_huge_page_refcount(struct page *page,

BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
- BUG_ON(!PageDirty(page_tail));
BUG_ON(!PageSwapBacked(page_tail));

lru_add_page_tail(page, page_tail, lruvec, list);
@@ -1831,10 +1873,12 @@ static int __split_huge_page_map(struct page *page,
int ret = 0, i;
pgtable_t pgtable;
unsigned long haddr;
+ bool dirty;

pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
if (pmd) {
+ dirty = pmd_dirty(*pmd);
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);
if (pmd_write(*pmd))
@@ -1850,7 +1894,9 @@ static int __split_huge_page_map(struct page *page,
* permissions across VMAs.
*/
entry = mk_pte(page + i, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (dirty)
+ entry = pte_mkdirty(entry);
+ entry = maybe_mkwrite(entry, vma);
if (!pmd_write(*pmd))
entry = pte_wrprotect(entry);
if (!pmd_young(*pmd))
diff --git a/mm/madvise.c b/mm/madvise.c
index 60e4d7f8ea16..982484fb44ca 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -271,8 +271,17 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
pte_t *pte, ptent;
struct page *page;
int nr_swap = 0;
+ unsigned long next;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_trans_huge(*pmd)) {
+ if (next - addr != HPAGE_PMD_SIZE)
+ split_huge_page_pmd(vma, addr, pmd);
+ else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
+ goto next;
+ /* fall through */
+ }

- split_huge_page_pmd(vma, addr, pmd);
if (pmd_trans_unstable(pmd))
return 0;

@@ -356,6 +365,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
cond_resched();
+next:
return 0;
}

--
1.9.1

2015-11-20 08:04:32

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 13/16] mm: introduce wrappers to add new LRU

We have used binary variable "file" to identify whether it is anon LRU
or file LRU. It's good but it becomes obstacle if we add new LRU.

So, this patch introduces some wrapper functions to handle it.

Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/mm_inline.h | 64 +++++++++++++++++++++++++++++++++++++++++--
include/trace/events/vmscan.h | 24 ++++++++--------
mm/compaction.c | 2 +-
mm/huge_memory.c | 5 ++--
mm/memory-failure.c | 7 ++---
mm/memory_hotplug.c | 3 +-
mm/mempolicy.c | 3 +-
mm/migrate.c | 26 ++++++------------
mm/swap.c | 29 ++++++++------------
mm/vmscan.c | 26 +++++++++---------
10 files changed, 114 insertions(+), 75 deletions(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index cf55945c83fb..5e08a354f936 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -8,8 +8,8 @@
* page_is_file_cache - should the page be on a file LRU or anon LRU?
* @page: the page to test
*
- * Returns 1 if @page is page cache page backed by a regular filesystem,
- * or 0 if @page is anonymous, tmpfs or otherwise ram or swap backed.
+ * Returns true if @page is page cache page backed by a regular filesystem,
+ * or false if @page is anonymous, tmpfs or otherwise ram or swap backed.
* Used by functions that manipulate the LRU lists, to sort a page
* onto the right LRU list.
*
@@ -17,7 +17,7 @@
* needs to survive until the page is last deleted from the LRU, which
* could be as far down as __page_cache_release.
*/
-static inline int page_is_file_cache(struct page *page)
+static inline bool page_is_file_cache(struct page *page)
{
return !PageSwapBacked(page);
}
@@ -56,6 +56,64 @@ static inline enum lru_list page_lru_base_type(struct page *page)
}

/**
+ * lru_index - which LRU list is lru on for accouting update_page_reclaim_stat
+ *
+ * Used for LRU list index arithmetic.
+ *
+ * Returns 0 if @lru is anon, 1 if it is file.
+ */
+static inline int lru_index(enum lru_list lru)
+{
+ int base;
+
+ switch (lru) {
+ case LRU_INACTIVE_ANON:
+ case LRU_ACTIVE_ANON:
+ base = 0;
+ break;
+ case LRU_INACTIVE_FILE:
+ case LRU_ACTIVE_FILE:
+ base = 1;
+ break;
+ default:
+ BUG();
+ }
+ return base;
+}
+
+/*
+ * page_off_isolate - which LRU list was page on for accouting NR_ISOLATED.
+ * @page: the page to test
+ *
+ * Returns the LRU list a page was on, as an index into the array of
+ * zone_page_state;
+ */
+static inline int page_off_isolate(struct page *page)
+{
+ int lru = NR_ISOLATED_ANON;
+
+ if (!PageSwapBacked(page))
+ lru = NR_ISOLATED_FILE;
+ return lru;
+}
+
+/**
+ * lru_off_isolate - which LRU list was @lru on for accouting NR_ISOLATED.
+ * @lru: the lru to test
+ *
+ * Returns the LRU list a page was on, as an index into the array of
+ * zone_page_state;
+ */
+static inline int lru_off_isolate(enum lru_list lru)
+{
+ int base = NR_ISOLATED_FILE;
+
+ if (lru <= LRU_ACTIVE_ANON)
+ base = NR_ISOLATED_ANON;
+ return base;
+}
+
+/**
* page_off_lru - which LRU list was page on? clearing its lru flags.
* @page: the page to test
*
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index f66476b96264..4e9e86733849 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -30,9 +30,9 @@
(RECLAIM_WB_ASYNC) \
)

-#define trace_shrink_flags(file) \
+#define trace_shrink_flags(lru) \
( \
- (file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
+ (lru ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
(RECLAIM_WB_ASYNC) \
)

@@ -271,9 +271,9 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
unsigned long nr_scanned,
unsigned long nr_taken,
isolate_mode_t isolate_mode,
- int file),
+ enum lru_list lru),

- TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file),
+ TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, lru),

TP_STRUCT__entry(
__field(int, order)
@@ -281,7 +281,7 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
__field(unsigned long, nr_scanned)
__field(unsigned long, nr_taken)
__field(isolate_mode_t, isolate_mode)
- __field(int, file)
+ __field(enum lru_list, lru)
),

TP_fast_assign(
@@ -290,16 +290,16 @@ DECLARE_EVENT_CLASS(mm_vmscan_lru_isolate_template,
__entry->nr_scanned = nr_scanned;
__entry->nr_taken = nr_taken;
__entry->isolate_mode = isolate_mode;
- __entry->file = file;
+ __entry->lru = lru;
),

- TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu file=%d",
+ TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu lru=%d",
__entry->isolate_mode,
__entry->order,
__entry->nr_requested,
__entry->nr_scanned,
__entry->nr_taken,
- __entry->file)
+ __entry->lru)
);

DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_lru_isolate,
@@ -309,9 +309,9 @@ DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_lru_isolate,
unsigned long nr_scanned,
unsigned long nr_taken,
isolate_mode_t isolate_mode,
- int file),
+ enum lru_list lru),

- TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
+ TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, lru)

);

@@ -322,9 +322,9 @@ DEFINE_EVENT(mm_vmscan_lru_isolate_template, mm_vmscan_memcg_isolate,
unsigned long nr_scanned,
unsigned long nr_taken,
isolate_mode_t isolate_mode,
- int file),
+ enum lru_list lru),

- TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, file)
+ TP_ARGS(order, nr_requested, nr_scanned, nr_taken, isolate_mode, lru)

);

diff --git a/mm/compaction.c b/mm/compaction.c
index c5c627aae996..d888fa248ebb 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -632,7 +632,7 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
return;

list_for_each_entry(page, &cc->migratepages, lru)
- count[!!page_is_file_cache(page)]++;
+ count[page_off_isolate(page) - NR_ISOLATED_ANON]++;

mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 83bc4ce53e19..7a48c3d4f92e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2226,8 +2226,7 @@ void __khugepaged_exit(struct mm_struct *mm)

static void release_pte_page(struct page *page)
{
- /* 0 stands for page_is_file_cache(page) == false */
- dec_zone_page_state(page, NR_ISOLATED_ANON + 0);
+ dec_zone_page_state(page, page_off_isolate(page));
unlock_page(page);
putback_lru_page(page);
}
@@ -2310,7 +2309,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
goto out;
}
/* 0 stands for page_is_file_cache(page) == false */
- inc_zone_page_state(page, NR_ISOLATED_ANON + 0);
+ inc_zone_page_state(page, page_off_isolate(page));
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageLRU(page), page);

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 95882692e747..abf50e00705b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1682,16 +1682,15 @@ static int __soft_offline_page(struct page *page, int flags)
put_hwpoison_page(page);
if (!ret) {
LIST_HEAD(pagelist);
- inc_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ inc_zone_page_state(page, page_off_isolate(page));
list_add(&page->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
if (ret) {
if (!list_empty(&pagelist)) {
list_del(&page->lru);
- dec_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ dec_zone_page_state(page,
+ page_off_isolate(page));
putback_lru_page(page);
}

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index aa992e2df58a..7c8360744551 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1449,8 +1449,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
put_page(page);
list_add_tail(&page->lru, &source);
move_pages--;
- inc_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ inc_zone_page_state(page, page_off_isolate(page));

} else {
#ifdef CONFIG_DEBUG_VM
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 87a177917cb2..856b6eb07e42 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -930,8 +930,7 @@ static void migrate_page_add(struct page *page, struct list_head *pagelist,
if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
if (!isolate_lru_page(page)) {
list_add_tail(&page->lru, pagelist);
- inc_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ inc_zone_page_state(page, page_off_isolate(page));
}
}
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 842ecd7aaf7f..87ebf0833b84 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -91,8 +91,7 @@ void putback_movable_pages(struct list_head *l)
continue;
}
list_del(&page->lru);
- dec_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ dec_zone_page_state(page, page_off_isolate(page));
if (unlikely(isolated_balloon_page(page)))
balloon_page_putback(page);
else
@@ -964,8 +963,7 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
* restored.
*/
list_del(&page->lru);
- dec_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ dec_zone_page_state(page, page_off_isolate(page));
/* Soft-offlined page shouldn't go through lru cache list */
if (reason == MR_MEMORY_FAILURE) {
put_page(page);
@@ -1278,8 +1276,7 @@ static int do_move_page_to_node_array(struct mm_struct *mm,
err = isolate_lru_page(page);
if (!err) {
list_add_tail(&page->lru, &pagelist);
- inc_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ inc_zone_page_state(page, page_off_isolate(page));
}
put_and_set:
/*
@@ -1622,8 +1619,6 @@ static bool numamigrate_update_ratelimit(pg_data_t *pgdat,

static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
{
- int page_lru;
-
VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page), page);

/* Avoid migrating to a node that is nearly full */
@@ -1645,8 +1640,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
return 0;
}

- page_lru = page_is_file_cache(page);
- mod_zone_page_state(page_zone(page), NR_ISOLATED_ANON + page_lru,
+ mod_zone_page_state(page_zone(page), page_off_isolate(page),
hpage_nr_pages(page));

/*
@@ -1704,8 +1698,7 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
if (nr_remaining) {
if (!list_empty(&migratepages)) {
list_del(&page->lru);
- dec_zone_page_state(page, NR_ISOLATED_ANON +
- page_is_file_cache(page));
+ dec_zone_page_state(page, page_off_isolate(page));
putback_lru_page(page);
}
isolated = 0;
@@ -1735,7 +1728,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
pg_data_t *pgdat = NODE_DATA(node);
int isolated = 0;
struct page *new_page = NULL;
- int page_lru = page_is_file_cache(page);
+ int page_lru = page_off_isolate(page);
unsigned long mmun_start = address & HPAGE_PMD_MASK;
unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
pmd_t orig_entry;
@@ -1794,8 +1787,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
/* Retake the callers reference and putback on LRU */
get_page(page);
putback_lru_page(page);
- mod_zone_page_state(page_zone(page),
- NR_ISOLATED_ANON + page_lru, -HPAGE_PMD_NR);
+ mod_zone_page_state(page_zone(page), page_lru, -HPAGE_PMD_NR);

goto out_unlock;
}
@@ -1847,9 +1839,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
count_vm_numa_events(NUMA_PAGE_MIGRATE, HPAGE_PMD_NR);

- mod_zone_page_state(page_zone(page),
- NR_ISOLATED_ANON + page_lru,
- -HPAGE_PMD_NR);
+ mod_zone_page_state(page_zone(page), page_lru, -HPAGE_PMD_NR);
return isolated;

out_fail:
diff --git a/mm/swap.c b/mm/swap.c
index 4a6aec976ab1..ac1c6be4381f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -491,21 +491,20 @@ void rotate_reclaimable_page(struct page *page)
}

static void update_page_reclaim_stat(struct lruvec *lruvec,
- int file, int rotated)
+ int lru, int rotated)
{
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;

- reclaim_stat->recent_scanned[file]++;
+ reclaim_stat->recent_scanned[lru]++;
if (rotated)
- reclaim_stat->recent_rotated[file]++;
+ reclaim_stat->recent_rotated[lru]++;
}

static void __activate_page(struct page *page, struct lruvec *lruvec,
void *arg)
{
if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
- int file = page_is_file_cache(page);
- int lru = page_lru_base_type(page);
+ enum lru_list lru = page_lru_base_type(page);

del_page_from_lru_list(page, lruvec, lru);
SetPageActive(page);
@@ -514,7 +513,7 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,
trace_mm_lru_activate(page);

__count_vm_event(PGACTIVATE);
- update_page_reclaim_stat(lruvec, file, 1);
+ update_page_reclaim_stat(lruvec, lru_index(lru), 1);
}
}

@@ -757,8 +756,8 @@ void lru_cache_add_active_or_unevictable(struct page *page,
static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
void *arg)
{
- int lru, file;
- bool active;
+ enum lru_list lru;
+ bool file, active;

if (!PageLRU(page))
return;
@@ -797,7 +796,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,

if (active)
__count_vm_event(PGDEACTIVATE);
- update_page_reclaim_stat(lruvec, file, 0);
+ update_page_reclaim_stat(lruvec, lru_index(lru), 0);
}


@@ -805,8 +804,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
void *arg)
{
if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
- int file = page_is_file_cache(page);
- int lru = page_lru_base_type(page);
+ enum lru_list lru = page_lru_base_type(page);

del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
ClearPageActive(page);
@@ -814,7 +812,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
add_page_to_lru_list(page, lruvec, lru);

__count_vm_event(PGDEACTIVATE);
- update_page_reclaim_stat(lruvec, file, 0);
+ update_page_reclaim_stat(lruvec, lru_index(lru), 0);
}
}

@@ -1038,8 +1036,6 @@ EXPORT_SYMBOL(__pagevec_release);
void lru_add_page_tail(struct page *page, struct page *page_tail,
struct lruvec *lruvec, struct list_head *list)
{
- const int file = 0;
-
VM_BUG_ON_PAGE(!PageHead(page), page);
VM_BUG_ON_PAGE(PageCompound(page_tail), page);
VM_BUG_ON_PAGE(PageLRU(page_tail), page);
@@ -1070,14 +1066,13 @@ void lru_add_page_tail(struct page *page, struct page *page_tail,
}

if (!PageUnevictable(page))
- update_page_reclaim_stat(lruvec, file, PageActive(page_tail));
+ update_page_reclaim_stat(lruvec, 0, PageActive(page_tail));
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,
void *arg)
{
- int file = page_is_file_cache(page);
int active = PageActive(page);
enum lru_list lru = page_lru(page);

@@ -1085,7 +1080,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec,

SetPageLRU(page);
add_page_to_lru_list(page, lruvec, lru);
- update_page_reclaim_stat(lruvec, file, active);
+ update_page_reclaim_stat(lruvec, lru_index(lru), active);
trace_mm_lru_insertion(page, lru);
}

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7a415b9fdd34..80dff84ba673 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1398,7 +1398,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,

*nr_scanned = scan;
trace_mm_vmscan_lru_isolate(sc->order, nr_to_scan, scan,
- nr_taken, mode, is_file_lru(lru));
+ nr_taken, mode, lru_index(lru));
return nr_taken;
}

@@ -1518,9 +1518,9 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
add_page_to_lru_list(page, lruvec, lru);

if (is_active_lru(lru)) {
- int file = is_file_lru(lru);
int numpages = hpage_nr_pages(page);
- reclaim_stat->recent_rotated[file] += numpages;
+ reclaim_stat->recent_rotated[lru_index(lru)]
+ += numpages;
}
if (put_page_testzero(page)) {
__ClearPageLRU(page);
@@ -1574,7 +1574,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
unsigned long nr_writeback = 0;
unsigned long nr_immediate = 0;
isolate_mode_t isolate_mode = 0;
- int file = is_file_lru(lru);
+ int lruidx = lru_index(lru);
struct zone *zone = lruvec_zone(lruvec);
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;

@@ -1599,7 +1599,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
&nr_scanned, sc, isolate_mode, lru);

__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
- __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+ __mod_zone_page_state(zone, lru_off_isolate(lru), nr_taken);

if (global_reclaim(sc)) {
__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
@@ -1620,7 +1620,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,

spin_lock_irq(&zone->lru_lock);

- reclaim_stat->recent_scanned[file] += nr_taken;
+ reclaim_stat->recent_scanned[lruidx] += nr_taken;

if (global_reclaim(sc)) {
if (current_is_kswapd())
@@ -1633,7 +1633,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,

putback_inactive_pages(lruvec, &page_list);

- __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
+ __mod_zone_page_state(zone, lru_off_isolate(lru), -nr_taken);

spin_unlock_irq(&zone->lru_lock);

@@ -1701,7 +1701,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
zone_idx(zone),
nr_scanned, nr_reclaimed,
sc->priority,
- trace_shrink_flags(file));
+ trace_shrink_flags(lru));
return nr_reclaimed;
}

@@ -1779,7 +1779,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
unsigned long nr_rotated = 0;
isolate_mode_t isolate_mode = 0;
- int file = is_file_lru(lru);
+ int lruidx = lru_index(lru);
struct zone *zone = lruvec_zone(lruvec);

lru_add_drain();
@@ -1796,11 +1796,11 @@ static void shrink_active_list(unsigned long nr_to_scan,
if (global_reclaim(sc))
__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);

- reclaim_stat->recent_scanned[file] += nr_taken;
+ reclaim_stat->recent_scanned[lruidx] += nr_taken;

__count_zone_vm_events(PGREFILL, zone, nr_scanned);
__mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
- __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+ __mod_zone_page_state(zone, lru_off_isolate(lru), nr_taken);
spin_unlock_irq(&zone->lru_lock);

while (!list_empty(&l_hold)) {
@@ -1853,11 +1853,11 @@ static void shrink_active_list(unsigned long nr_to_scan,
* helps balance scan pressure between file and anonymous pages in
* get_scan_count.
*/
- reclaim_stat->recent_rotated[file] += nr_rotated;
+ reclaim_stat->recent_rotated[lru_index(lru)] += nr_rotated;

move_active_pages_to_lru(lruvec, &l_active, &l_hold, lru);
move_active_pages_to_lru(lruvec, &l_inactive, &l_hold, lru - LRU_ACTIVE);
- __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, -nr_taken);
+ __mod_zone_page_state(zone, lru_off_isolate(lru), -nr_taken);
spin_unlock_irq(&zone->lru_lock);

mem_cgroup_uncharge_list(&l_hold);
--
1.9.1

2015-11-20 08:03:07

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 14/16] mm: introduce lazyfree LRU list

There are issues to support MADV_FREE.

* MADV_FREE pages's hotness

It's really arguable. Someone think it's cold while others are not.
It's matter of workload dependent so I think no one could have
a one way. IOW, we need tunable knob.

* MADV_FREE on swapless system

Now, we instantly free MADV_FREEed pages on swapless system
because we don't have aged anonymous LRU list on swapless system
so there is no chance to discard them.

I tried to solve it with inactive anonymous LRU list without
introducing new LRU list but it needs a few hooks in reclaim
path to fix old behavior witch was not good to me. Moreover,
it makes implement tuning konb hard.

For addressing issues, this patch adds new LazyFree LRU list and
functions for the stat. Pages on the list have PG_lazyfree flag
which overrides PG_mappedtodisk(It should be safe because
no anonymous page can have the flag).

If user calls madvise(start, len, MADV_FREE), pages in the range
moves to lazyfree LRU from anonymous LRU. When memory pressure
happens, they can be discarded since there is no more store
opeartion since then. If there is store operation, they can move
to active anonymous LRU list.

In this patch, how to age lazyfree pages is very basic, which just
discards all pages in the list whenever memory pressure happens.
It's enough to prove working. Later patch will implement the policy.

Signed-off-by: Minchan Kim <[email protected]>
---
drivers/base/node.c | 2 +
drivers/staging/android/lowmemorykiller.c | 3 +-
fs/proc/meminfo.c | 2 +
include/linux/huge_mm.h | 5 +-
include/linux/mm_inline.h | 25 +++++++--
include/linux/mmzone.h | 11 ++--
include/linux/page-flags.h | 5 ++
include/linux/rmap.h | 2 +-
include/linux/swap.h | 5 +-
include/linux/vm_event_item.h | 4 +-
include/trace/events/vmscan.h | 18 ++++---
mm/compaction.c | 12 +++--
mm/huge_memory.c | 11 ++--
mm/madvise.c | 40 ++++++++------
mm/memcontrol.c | 14 ++++-
mm/migrate.c | 2 +
mm/page_alloc.c | 7 +++
mm/rmap.c | 15 ++++--
mm/swap.c | 87 ++++++++++++++++++-------------
mm/vmscan.c | 78 +++++++++++++++++++--------
mm/vmstat.c | 3 ++
21 files changed, 244 insertions(+), 107 deletions(-)

diff --git a/drivers/base/node.c b/drivers/base/node.c
index 560751bad294..f7a1f2107b43 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -70,6 +70,7 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d Active(file): %8lu kB\n"
"Node %d Inactive(file): %8lu kB\n"
"Node %d Unevictable: %8lu kB\n"
+ "Node %d LazyFree: %8lu kB\n"
"Node %d Mlocked: %8lu kB\n",
nid, K(i.totalram),
nid, K(i.freeram),
@@ -83,6 +84,7 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(nid, NR_ACTIVE_FILE)),
nid, K(node_page_state(nid, NR_INACTIVE_FILE)),
nid, K(node_page_state(nid, NR_UNEVICTABLE)),
+ nid, K(node_page_state(nid, NR_LZFREE)),
nid, K(node_page_state(nid, NR_MLOCK)));

#ifdef CONFIG_HIGHMEM
diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index 872bd603fd0d..658c16a653c2 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -72,7 +72,8 @@ static unsigned long lowmem_count(struct shrinker *s,
return global_page_state(NR_ACTIVE_ANON) +
global_page_state(NR_ACTIVE_FILE) +
global_page_state(NR_INACTIVE_ANON) +
- global_page_state(NR_INACTIVE_FILE);
+ global_page_state(NR_INACTIVE_FILE) +
+ global_page_state(NR_LZFREE);
}

static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index d3ebf2e61853..3444f7c4e0b6 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -102,6 +102,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
"Active(file): %8lu kB\n"
"Inactive(file): %8lu kB\n"
"Unevictable: %8lu kB\n"
+ "LazyFree: %8lu kB\n"
"Mlocked: %8lu kB\n"
#ifdef CONFIG_HIGHMEM
"HighTotal: %8lu kB\n"
@@ -159,6 +160,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
K(pages[LRU_ACTIVE_FILE]),
K(pages[LRU_INACTIVE_FILE]),
K(pages[LRU_UNEVICTABLE]),
+ K(pages[LRU_LZFREE]),
K(global_page_state(NR_MLOCK)),
#ifdef CONFIG_HIGHMEM
K(i.totalhigh),
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e9db238a75c1..6eb54a6ed5d0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -1,6 +1,8 @@
#ifndef _LINUX_HUGE_MM_H
#define _LINUX_HUGE_MM_H

+struct pagevec;
+
extern int do_huge_pmd_anonymous_page(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
@@ -21,7 +23,8 @@ extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
unsigned int flags);
extern int madvise_free_huge_pmd(struct mmu_gather *tlb,
struct vm_area_struct *vma,
- pmd_t *pmd, unsigned long addr);
+ pmd_t *pmd, unsigned long addr,
+ struct pagevec *pvec);
extern int zap_huge_pmd(struct mmu_gather *tlb,
struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr);
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 5e08a354f936..b34e511e90ae 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -26,6 +26,10 @@ static __always_inline void add_page_to_lru_list(struct page *page,
struct lruvec *lruvec, enum lru_list lru)
{
int nr_pages = hpage_nr_pages(page);
+
+ if (lru == LRU_LZFREE)
+ VM_BUG_ON_PAGE(PageActive(page), page);
+
mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
list_add(&page->lru, &lruvec->lists[lru]);
__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
@@ -35,6 +39,10 @@ static __always_inline void del_page_from_lru_list(struct page *page,
struct lruvec *lruvec, enum lru_list lru)
{
int nr_pages = hpage_nr_pages(page);
+
+ if (lru == LRU_LZFREE)
+ VM_BUG_ON_PAGE(!PageLazyFree(page), page);
+
mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
list_del(&page->lru);
__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
@@ -46,12 +54,14 @@ static __always_inline void del_page_from_lru_list(struct page *page,
*
* Used for LRU list index arithmetic.
*
- * Returns the base LRU type - file or anon - @page should be on.
+ * Returns the base LRU type - file or anon or lazyfree - @page should be on.
*/
static inline enum lru_list page_lru_base_type(struct page *page)
{
if (page_is_file_cache(page))
return LRU_INACTIVE_FILE;
+ if (PageLazyFree(page))
+ return LRU_LZFREE;
return LRU_INACTIVE_ANON;
}

@@ -60,7 +70,7 @@ static inline enum lru_list page_lru_base_type(struct page *page)
*
* Used for LRU list index arithmetic.
*
- * Returns 0 if @lru is anon, 1 if it is file.
+ * Returns 0 if @lru is anon, 1 if it is file, 2 if it is lazyfree
*/
static inline int lru_index(enum lru_list lru)
{
@@ -75,6 +85,9 @@ static inline int lru_index(enum lru_list lru)
case LRU_ACTIVE_FILE:
base = 1;
break;
+ case LRU_LZFREE:
+ base = 2;
+ break;
default:
BUG();
}
@@ -94,6 +107,8 @@ static inline int page_off_isolate(struct page *page)

if (!PageSwapBacked(page))
lru = NR_ISOLATED_FILE;
+ else if (PageLazyFree(page))
+ lru = NR_ISOLATED_LZFREE;
return lru;
}

@@ -106,10 +121,12 @@ static inline int page_off_isolate(struct page *page)
*/
static inline int lru_off_isolate(enum lru_list lru)
{
- int base = NR_ISOLATED_FILE;
+ int base = NR_ISOLATED_LZFREE;

if (lru <= LRU_ACTIVE_ANON)
base = NR_ISOLATED_ANON;
+ else if (lru <= LRU_ACTIVE_FILE)
+ base = NR_ISOLATED_FILE;
return base;
}

@@ -154,6 +171,8 @@ static __always_inline enum lru_list page_lru(struct page *page)
lru = page_lru_base_type(page);
if (PageActive(page))
lru += LRU_ACTIVE;
+ if (lru == LRU_LZFREE + LRU_ACTIVE)
+ lru = LRU_ACTIVE_ANON;
}
return lru;
}
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d94347737292..1aaa436da0d5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -121,6 +121,7 @@ enum zone_stat_item {
NR_INACTIVE_FILE, /* " " " " " */
NR_ACTIVE_FILE, /* " " " " " */
NR_UNEVICTABLE, /* " " " " " */
+ NR_LZFREE, /* " " " " " */
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
NR_ANON_PAGES, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
@@ -140,6 +141,7 @@ enum zone_stat_item {
NR_WRITEBACK_TEMP, /* Writeback using temporary buffers */
NR_ISOLATED_ANON, /* Temporary isolated pages from anon lru */
NR_ISOLATED_FILE, /* Temporary isolated pages from file lru */
+ NR_ISOLATED_LZFREE, /* Temporary isolated pages from lzfree lru */
NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
NR_DIRTIED, /* page dirtyings since bootup */
NR_WRITTEN, /* page writings since bootup */
@@ -178,6 +180,7 @@ enum lru_list {
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
LRU_UNEVICTABLE,
+ LRU_LZFREE,
NR_LRU_LISTS
};

@@ -207,10 +210,11 @@ struct zone_reclaim_stat {
* The higher the rotated/scanned ratio, the more valuable
* that cache is.
*
- * The anon LRU stats live in [0], file LRU stats in [1]
+ * The anon LRU stats live in [0], file LRU stats in [1],
+ * lazyfree LRU stats in [2]
*/
- unsigned long recent_rotated[2];
- unsigned long recent_scanned[2];
+ unsigned long recent_rotated[3];
+ unsigned long recent_scanned[3];
};

struct lruvec {
@@ -224,6 +228,7 @@ struct lruvec {
/* Mask used at gathering information at once (see memcontrol.c) */
#define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE))
#define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
+#define LRU_ALL_LZFREE (BIT(LRU_LZFREE))
#define LRU_ALL ((1 << NR_LRU_LISTS) - 1)

/* Isolate clean file */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 416509e26d6d..14f0643af5c4 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -115,6 +115,9 @@ enum pageflags {
#endif
__NR_PAGEFLAGS,

+ /* MADV_FREE */
+ PG_lazyfree = PG_mappedtodisk,
+
/* Filesystems */
PG_checked = PG_owner_priv_1,

@@ -343,6 +346,8 @@ TESTPAGEFLAG_FALSE(Ksm)

u64 stable_page_flags(struct page *page);

+PAGEFLAG(LazyFree, lazyfree);
+
static inline int PageUptodate(struct page *page)
{
int ret = test_bit(PG_uptodate, &(page)->flags);
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index f4c992826242..edace84b45d5 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -85,7 +85,7 @@ enum ttu_flags {
TTU_UNMAP = 1, /* unmap mode */
TTU_MIGRATION = 2, /* migration mode */
TTU_MUNLOCK = 4, /* munlock mode */
- TTU_FREE = 8, /* free mode */
+ TTU_LZFREE = 8, /* lazyfree mode */

TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f629df4cc13d..c484339b46b6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -14,8 +14,8 @@
#include <asm/page.h>

struct notifier_block;
-
struct bio;
+struct pagevec;

#define SWAP_FLAG_PREFER 0x8000 /* set if swap priority specified */
#define SWAP_FLAG_PRIO_MASK 0x7fff
@@ -308,7 +308,8 @@ extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
extern void deactivate_file_page(struct page *page);
-extern void deactivate_page(struct page *page);
+extern void drain_lazyfree_pagevec(struct pagevec *pvec);
+extern int add_page_to_lazyfree_list(struct page *page, struct pagevec *pvec);
extern void swap_setup(void);

extern void add_page_to_unevictable_list(struct page *page);
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 2b1cef88b827..7ebfd7ca992d 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -23,9 +23,9 @@

enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
- PGFREE, PGACTIVATE, PGDEACTIVATE,
+ PGFREE, PGACTIVATE, PGDEACTIVATE, PGLZFREE,
PGFAULT, PGMAJFAULT,
- PGLAZYFREED,
+ PGLZFREED,
FOR_ALL_ZONES(PGREFILL),
FOR_ALL_ZONES(PGSTEAL_KSWAPD),
FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 4e9e86733849..a7ce9169b0fa 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -12,28 +12,32 @@

#define RECLAIM_WB_ANON 0x0001u
#define RECLAIM_WB_FILE 0x0002u
+#define RECLAIM_WB_LZFREE 0x0004u
#define RECLAIM_WB_MIXED 0x0010u
-#define RECLAIM_WB_SYNC 0x0004u /* Unused, all reclaim async */
-#define RECLAIM_WB_ASYNC 0x0008u
+#define RECLAIM_WB_SYNC 0x0040u /* Unused, all reclaim async */
+#define RECLAIM_WB_ASYNC 0x0080u

#define show_reclaim_flags(flags) \
(flags) ? __print_flags(flags, "|", \
{RECLAIM_WB_ANON, "RECLAIM_WB_ANON"}, \
{RECLAIM_WB_FILE, "RECLAIM_WB_FILE"}, \
+ {RECLAIM_WB_LZFREE, "RECLAIM_WB_LZFREE"}, \
{RECLAIM_WB_MIXED, "RECLAIM_WB_MIXED"}, \
{RECLAIM_WB_SYNC, "RECLAIM_WB_SYNC"}, \
{RECLAIM_WB_ASYNC, "RECLAIM_WB_ASYNC"} \
) : "RECLAIM_WB_NONE"

#define trace_reclaim_flags(page) ( \
- (page_is_file_cache(page) ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
- (RECLAIM_WB_ASYNC) \
+ (page_is_file_cache(page) ? RECLAIM_WB_FILE : \
+ (PageLazyFree(page) ? RECLAIM_WB_LZFREE : \
+ RECLAIM_WB_ANON)) | (RECLAIM_WB_ASYNC) \
)

-#define trace_shrink_flags(lru) \
+#define trace_shrink_flags(lru_idx) \
( \
- (lru ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
- (RECLAIM_WB_ASYNC) \
+ (lru_idx == 1 ? RECLAIM_WB_FILE : (lru_idx == 0 ? \
+ RECLAIM_WB_ANON : RECLAIM_WB_LZFREE)) | \
+ (RECLAIM_WB_ASYNC) \
)

TRACE_EVENT(mm_vmscan_kswapd_sleep,
diff --git a/mm/compaction.c b/mm/compaction.c
index d888fa248ebb..cc40c766de38 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -626,7 +626,7 @@ isolate_freepages_range(struct compact_control *cc,
static void acct_isolated(struct zone *zone, struct compact_control *cc)
{
struct page *page;
- unsigned int count[2] = { 0, };
+ unsigned int count[3] = { 0, };

if (list_empty(&cc->migratepages))
return;
@@ -636,21 +636,25 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)

mod_zone_page_state(zone, NR_ISOLATED_ANON, count[0]);
mod_zone_page_state(zone, NR_ISOLATED_FILE, count[1]);
+ mod_zone_page_state(zone, NR_ISOLATED_LZFREE, count[2]);
}

/* Similar to reclaim, but different enough that they don't share logic */
static bool too_many_isolated(struct zone *zone)
{
- unsigned long active, inactive, isolated;
+ unsigned long active, inactive, lzfree, isolated;

inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
zone_page_state(zone, NR_INACTIVE_ANON);
active = zone_page_state(zone, NR_ACTIVE_FILE) +
zone_page_state(zone, NR_ACTIVE_ANON);
+ lzfree = zone_page_state(zone, NR_LZFREE);
+
isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
- zone_page_state(zone, NR_ISOLATED_ANON);
+ zone_page_state(zone, NR_ISOLATED_ANON) +
+ zone_page_state(zone, NR_ISOLATED_LZFREE);

- return isolated > (inactive + active) / 2;
+ return isolated > (inactive + active + lzfree) / 2;
}

/**
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 7a48c3d4f92e..4277740494a0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1454,7 +1454,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
- pmd_t *pmd, unsigned long addr)
+ pmd_t *pmd, unsigned long addr, struct pagevec *pvec)

{
spinlock_t *ptl;
@@ -1478,18 +1478,18 @@ int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
ClearPageDirty(page);
unlock_page(page);

- if (PageActive(page))
- deactivate_page(page);
-
if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
tlb->fullmm);
orig_pmd = pmd_mkold(orig_pmd);
orig_pmd = pmd_mkclean(orig_pmd);
-
+ SetPageDirty(page);
set_pmd_at(mm, addr, pmd, orig_pmd);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
}
+
+ add_page_to_lazyfree_list(page, pvec);
+ drain_lazyfree_pagevec(pvec);
out:
spin_unlock(ptl);

@@ -1795,6 +1795,7 @@ static void __split_huge_page_refcount(struct page *page,
(1L << PG_mlocked) |
(1L << PG_uptodate) |
(1L << PG_active) |
+ (1L << PG_lazyfree) |
(1L << PG_unevictable) |
(1L << PG_dirty)));

diff --git a/mm/madvise.c b/mm/madvise.c
index 982484fb44ca..e0836c870980 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -21,6 +21,7 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/mmu_notifier.h>
+#include <linux/pagevec.h>

#include <asm/tlb.h>

@@ -272,12 +273,15 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
struct page *page;
int nr_swap = 0;
unsigned long next;
+ struct pagevec pvec;
+
+ pagevec_init(&pvec, 0);

next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
split_huge_page_pmd(vma, addr, pmd);
- else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
+ else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr, &pvec))
goto next;
/* fall through */
}
@@ -315,24 +319,17 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,

VM_BUG_ON_PAGE(PageTransCompound(page), page);

- if (PageSwapCache(page) || PageDirty(page)) {
+ if (page_mapcount(page) != 1)
+ continue;
+
+ if (PageSwapCache(page)) {
if (!trylock_page(page))
continue;
- /*
- * If page is shared with others, we couldn't clear
- * PG_dirty of the page.
- */
- if (page_mapcount(page) != 1) {
+ if (PageSwapCache(page) &&
+ !try_to_free_swap(page)) {
unlock_page(page);
continue;
}
-
- if (PageSwapCache(page) && !try_to_free_swap(page)) {
- unlock_page(page);
- continue;
- }
-
- ClearPageDirty(page);
unlock_page(page);
}

@@ -348,11 +345,21 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,

ptent = pte_mkold(ptent);
ptent = pte_mkclean(ptent);
+ /*
+ * Page could lost dirty bit without moving
+ * lazyfree LRU list so the result causes
+ * freeing the page without paging out.
+ * So let's move the dirtiness to page->flags.
+ * If it is moved to lazyfree successfully,
+ * lru_lazyfree_fn will clear it.
+ */
+ SetPageDirty(page);
set_pte_at(mm, addr, pte, ptent);
- if (PageActive(page))
- deactivate_page(page);
tlb_remove_tlb_entry(tlb, pte, addr);
}
+
+ if (add_page_to_lazyfree_list(page, &pvec) == 0)
+ drain_lazyfree_pagevec(&pvec);
}

if (nr_swap) {
@@ -362,6 +369,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
add_mm_counter(mm, MM_SWAPENTS, nr_swap);
}

+ drain_lazyfree_pagevec(&pvec);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
cond_resched();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c57c4423c688..1dc599ce1bcb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -109,6 +109,7 @@ static const char * const mem_cgroup_lru_names[] = {
"inactive_file",
"active_file",
"unevictable",
+ "lazyfree",
};

#define THRESHOLDS_EVENTS_TARGET 128
@@ -1402,6 +1403,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
static bool test_mem_cgroup_node_reclaimable(struct mem_cgroup *memcg,
int nid, bool noswap)
{
+ if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_LZFREE))
+ return true;
if (mem_cgroup_node_nr_lru_pages(memcg, nid, LRU_ALL_FILE))
return true;
if (noswap || !total_swap_pages)
@@ -3120,6 +3123,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
{ "total", LRU_ALL },
{ "file", LRU_ALL_FILE },
{ "anon", LRU_ALL_ANON },
+ { "lazyfree", LRU_ALL_LZFREE },
{ "unevictable", BIT(LRU_UNEVICTABLE) },
};
const struct numa_stat *stat;
@@ -3231,8 +3235,8 @@ static int memcg_stat_show(struct seq_file *m, void *v)
int nid, zid;
struct mem_cgroup_per_zone *mz;
struct zone_reclaim_stat *rstat;
- unsigned long recent_rotated[2] = {0, 0};
- unsigned long recent_scanned[2] = {0, 0};
+ unsigned long recent_rotated[3] = {0, 0};
+ unsigned long recent_scanned[3] = {0, 0};

for_each_online_node(nid)
for (zid = 0; zid < MAX_NR_ZONES; zid++) {
@@ -3241,13 +3245,19 @@ static int memcg_stat_show(struct seq_file *m, void *v)

recent_rotated[0] += rstat->recent_rotated[0];
recent_rotated[1] += rstat->recent_rotated[1];
+ recent_rotated[2] += rstat->recent_rotated[2];
recent_scanned[0] += rstat->recent_scanned[0];
recent_scanned[1] += rstat->recent_scanned[1];
+ recent_scanned[2] += rstat->recent_scanned[2];
}
seq_printf(m, "recent_rotated_anon %lu\n", recent_rotated[0]);
seq_printf(m, "recent_rotated_file %lu\n", recent_rotated[1]);
+ seq_printf(m, "recent_rotated_lzfree %lu\n",
+ recent_rotated[2]);
seq_printf(m, "recent_scanned_anon %lu\n", recent_scanned[0]);
seq_printf(m, "recent_scanned_file %lu\n", recent_scanned[1]);
+ seq_printf(m, "recent_scanned_lzfree %lu\n",
+ recent_scanned[2]);
}
#endif

diff --git a/mm/migrate.c b/mm/migrate.c
index 87ebf0833b84..945e5655cd69 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -508,6 +508,8 @@ void migrate_page_copy(struct page *newpage, struct page *page)
SetPageChecked(newpage);
if (PageMappedToDisk(page))
SetPageMappedToDisk(newpage);
+ if (PageLazyFree(page))
+ SetPageLazyFree(newpage);

if (PageDirty(page)) {
clear_page_dirty_for_io(page);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48aaf7b9f253..d6a42c905b0b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3712,6 +3712,7 @@ void show_free_areas(unsigned int filter)

printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
+ " lazy_free:%lu isolated_lazyfree:%lu\n"
" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
@@ -3722,6 +3723,8 @@ void show_free_areas(unsigned int filter)
global_page_state(NR_ACTIVE_FILE),
global_page_state(NR_INACTIVE_FILE),
global_page_state(NR_ISOLATED_FILE),
+ global_page_state(NR_LZFREE),
+ global_page_state(NR_ISOLATED_LZFREE),
global_page_state(NR_UNEVICTABLE),
global_page_state(NR_FILE_DIRTY),
global_page_state(NR_WRITEBACK),
@@ -3756,9 +3759,11 @@ void show_free_areas(unsigned int filter)
" inactive_anon:%lukB"
" active_file:%lukB"
" inactive_file:%lukB"
+ " lazyfree:%lukB"
" unevictable:%lukB"
" isolated(anon):%lukB"
" isolated(file):%lukB"
+ " isolated(lazy):%lukB"
" present:%lukB"
" managed:%lukB"
" mlocked:%lukB"
@@ -3788,9 +3793,11 @@ void show_free_areas(unsigned int filter)
K(zone_page_state(zone, NR_INACTIVE_ANON)),
K(zone_page_state(zone, NR_ACTIVE_FILE)),
K(zone_page_state(zone, NR_INACTIVE_FILE)),
+ K(zone_page_state(zone, NR_LZFREE)),
K(zone_page_state(zone, NR_UNEVICTABLE)),
K(zone_page_state(zone, NR_ISOLATED_ANON)),
K(zone_page_state(zone, NR_ISOLATED_FILE)),
+ K(zone_page_state(zone, NR_ISOLATED_LZFREE)),
K(zone->present_pages),
K(zone->managed_pages),
K(zone_page_state(zone, NR_MLOCK)),
diff --git a/mm/rmap.c b/mm/rmap.c
index 9449e91839ab..75bd68bc8abc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1374,10 +1374,17 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
swp_entry_t entry = { .val = page_private(page) };
pte_t swp_pte;

- if (!PageDirty(page) && (flags & TTU_FREE)) {
- /* It's a freeable page by MADV_FREE */
- dec_mm_counter(mm, MM_ANONPAGES);
- goto discard;
+ if ((flags & TTU_LZFREE)) {
+ VM_BUG_ON_PAGE(!PageLazyFree(page), page);
+ if (!PageDirty(page)) {
+ /* It's a freeable page by MADV_FREE */
+ dec_mm_counter(mm, MM_ANONPAGES);
+ goto discard;
+ } else {
+ set_pte_at(mm, address, pte, pteval);
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
}

if (PageSwapCache(page)) {
diff --git a/mm/swap.c b/mm/swap.c
index ac1c6be4381f..b88c59c2f1e8 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -45,7 +45,6 @@ int page_cluster;
static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
-static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);

/*
* This path almost never happens for VM activity - pages are normally
@@ -508,6 +507,10 @@ static void __activate_page(struct page *page, struct lruvec *lruvec,

del_page_from_lru_list(page, lruvec, lru);
SetPageActive(page);
+ if (lru == LRU_LZFREE) {
+ ClearPageLazyFree(page);
+ lru = LRU_INACTIVE_ANON;
+ }
lru += LRU_ACTIVE;
add_page_to_lru_list(page, lruvec, lru);
trace_mm_lru_activate(page);
@@ -799,20 +802,41 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
update_page_reclaim_stat(lruvec, lru_index(lru), 0);
}

-
-static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
- void *arg)
+static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
+ void *arg)
{
- if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
- enum lru_list lru = page_lru_base_type(page);
+ VM_BUG_ON_PAGE(!PageAnon(page), page);
+
+ if (PageLRU(page) && !PageLazyFree(page) &&
+ !PageUnevictable(page)) {
+ unsigned int nr_pages = PageTransHuge(page) ? HPAGE_PMD_NR : 1;
+ bool active = PageActive(page);

- del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+ if (!trylock_page(page))
+ return;
+ if (PageSwapCache(page))
+ return;
+ if (PageDirty(page)) {
+ /*
+ * If page is shared with others, we couldn't clear
+ * PG_dirty of the page.
+ */
+ if (page_count(page) != 2) {
+ unlock_page(page);
+ return;
+ }
+ ClearPageDirty(page);
+ }
+
+ del_page_from_lru_list(page, lruvec,
+ LRU_INACTIVE_ANON + active);
ClearPageActive(page);
- ClearPageReferenced(page);
- add_page_to_lru_list(page, lruvec, lru);
+ SetPageLazyFree(page);
+ add_page_to_lru_list(page, lruvec, LRU_LZFREE);
+ unlock_page(page);

- __count_vm_event(PGDEACTIVATE);
- update_page_reclaim_stat(lruvec, lru_index(lru), 0);
+ count_vm_events(PGLZFREE, nr_pages);
+ update_page_reclaim_stat(lruvec, 2, 0);
}
}

@@ -842,11 +866,25 @@ void lru_add_drain_cpu(int cpu)
if (pagevec_count(pvec))
pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);

- pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ activate_page_drain(cpu);
+}
+
+/*
+ * Drain lazyfree pages out of the cpu's pagevec before release pte lock
+ */
+void drain_lazyfree_pagevec(struct pagevec *pvec)
+{
if (pagevec_count(pvec))
- pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+ pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
+}

- activate_page_drain(cpu);
+int add_page_to_lazyfree_list(struct page *page, struct pagevec *pvec)
+{
+ if (PageLRU(page) && !PageLazyFree(page) && !PageUnevictable(page)) {
+ page_cache_get(page);
+ pagevec_add(pvec, page);
+ }
+ return pagevec_space(pvec);
}

/**
@@ -875,26 +913,6 @@ void deactivate_file_page(struct page *page)
}
}

-/**
- * deactivate_page - deactivate a page
- * @page: page to deactivate
- *
- * deactivate_page() moves @page to the inactive list if @page was on the active
- * list and was not an unevictable page. This is done to accelerate the reclaim
- * of @page.
- */
-void deactivate_page(struct page *page)
-{
- if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
- struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
-
- page_cache_get(page);
- if (!pagevec_add(pvec, page))
- pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
- put_cpu_var(lru_deactivate_pvecs);
- }
-}
-
void lru_add_drain(void)
{
lru_add_drain_cpu(get_cpu());
@@ -924,7 +942,6 @@ void lru_add_drain_all(void)
if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
- pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
need_activate_page_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
schedule_work_on(cpu, work);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 80dff84ba673..8efe30ceec3a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -197,7 +197,8 @@ static unsigned long zone_reclaimable_pages(struct zone *zone)
int nr;

nr = zone_page_state(zone, NR_ACTIVE_FILE) +
- zone_page_state(zone, NR_INACTIVE_FILE);
+ zone_page_state(zone, NR_INACTIVE_FILE) +
+ zone_page_state(zone, NR_LZFREE);

if (get_nr_swap_pages() > 0)
nr += zone_page_state(zone, NR_ACTIVE_ANON) +
@@ -918,6 +919,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,

VM_BUG_ON_PAGE(PageActive(page), page);
VM_BUG_ON_PAGE(page_zone(page) != zone, page);
+ VM_BUG_ON_PAGE((ttu_flags & TTU_LZFREE) &&
+ !PageLazyFree(page), page);

sc->nr_scanned++;

@@ -1050,7 +1053,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep_locked;
if (!add_to_swap(page, page_list))
goto activate_locked;
- freeable = true;
+ if (ttu_flags & TTU_LZFREE)
+ freeable = true;
may_enter_fs = 1;

/* Adding to swap updated mapping */
@@ -1063,8 +1067,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
if (page_mapped(page) && mapping) {
switch (try_to_unmap(page, freeable ?
- (ttu_flags | TTU_BATCH_FLUSH | TTU_FREE) :
- (ttu_flags | TTU_BATCH_FLUSH))) {
+ (ttu_flags | TTU_BATCH_FLUSH) :
+ ((ttu_flags & ~TTU_LZFREE) |
+ TTU_BATCH_FLUSH))) {
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
@@ -1190,7 +1195,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
__clear_page_locked(page);
free_it:
if (freeable && !PageDirty(page))
- count_vm_event(PGLAZYFREED);
+ count_vm_event(PGLZFREED);

nr_reclaimed++;

@@ -1458,7 +1463,7 @@ int isolate_lru_page(struct page *page)
* the LRU list will go small and be scanned faster than necessary, leading to
* unnecessary swapping, thrashing and OOM.
*/
-static int too_many_isolated(struct zone *zone, int file,
+static int too_many_isolated(struct zone *zone, int lru_index,
struct scan_control *sc)
{
unsigned long inactive, isolated;
@@ -1469,12 +1474,21 @@ static int too_many_isolated(struct zone *zone, int file,
if (!sane_reclaim(sc))
return 0;

- if (file) {
- inactive = zone_page_state(zone, NR_INACTIVE_FILE);
- isolated = zone_page_state(zone, NR_ISOLATED_FILE);
- } else {
+ switch (lru_index) {
+ case 0:
inactive = zone_page_state(zone, NR_INACTIVE_ANON);
isolated = zone_page_state(zone, NR_ISOLATED_ANON);
+ break;
+ case 1:
+ inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+ isolated = zone_page_state(zone, NR_ISOLATED_FILE);
+ break;
+ case 2:
+ inactive = zone_page_state(zone, NR_LZFREE);
+ isolated = zone_page_state(zone, NR_ISOLATED_LZFREE);
+ break;
+ default:
+ BUG_ON(lru_index);
}

/*
@@ -1489,7 +1503,8 @@ static int too_many_isolated(struct zone *zone, int file,
}

static noinline_for_stack void
-putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
+putback_inactive_pages(struct lruvec *lruvec, enum lru_list old_lru,
+ struct list_head *page_list)
{
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
struct zone *zone = lruvec_zone(lruvec);
@@ -1500,7 +1515,7 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
*/
while (!list_empty(page_list)) {
struct page *page = lru_to_page(page_list);
- int lru;
+ int new_lru;

VM_BUG_ON_PAGE(PageLRU(page), page);
list_del(&page->lru);
@@ -1514,18 +1529,20 @@ putback_inactive_pages(struct lruvec *lruvec, struct list_head *page_list)
lruvec = mem_cgroup_page_lruvec(page, zone);

SetPageLRU(page);
- lru = page_lru(page);
- add_page_to_lru_list(page, lruvec, lru);
+ new_lru = page_lru(page);
+ if (old_lru == LRU_LZFREE && new_lru == LRU_ACTIVE_ANON)
+ ClearPageLazyFree(page);

- if (is_active_lru(lru)) {
+ add_page_to_lru_list(page, lruvec, new_lru);
+ if (PageActive(page)) {
int numpages = hpage_nr_pages(page);
- reclaim_stat->recent_rotated[lru_index(lru)]
+ reclaim_stat->recent_rotated[lru_index(old_lru)]
+= numpages;
}
if (put_page_testzero(page)) {
__ClearPageLRU(page);
__ClearPageActive(page);
- del_page_from_lru_list(page, lruvec, lru);
+ del_page_from_lru_list(page, lruvec, new_lru);

if (unlikely(PageCompound(page))) {
spin_unlock_irq(&zone->lru_lock);
@@ -1578,7 +1595,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
struct zone *zone = lruvec_zone(lruvec);
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;

- while (unlikely(too_many_isolated(zone, file, sc))) {
+ while (unlikely(too_many_isolated(zone, lru_index(lru), sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);

/* We are about to die and free our memory. Return now. */
@@ -1613,7 +1630,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
if (nr_taken == 0)
return 0;

- nr_reclaimed = shrink_page_list(&page_list, zone, sc, TTU_UNMAP,
+ nr_reclaimed = shrink_page_list(&page_list, zone, sc,
+ (lru != LRU_LZFREE) ?
+ TTU_UNMAP :
+ TTU_UNMAP|TTU_LZFREE,
&nr_dirty, &nr_unqueued_dirty, &nr_congested,
&nr_writeback, &nr_immediate,
false);
@@ -1631,7 +1651,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
nr_reclaimed);
}

- putback_inactive_pages(lruvec, &page_list);
+ putback_inactive_pages(lruvec, lru, &page_list);

__mod_zone_page_state(zone, lru_off_isolate(lru), -nr_taken);

@@ -1701,7 +1721,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
zone_idx(zone),
nr_scanned, nr_reclaimed,
sc->priority,
- trace_shrink_flags(lru));
+ trace_shrink_flags(lru_index(lru)));
return nr_reclaimed;
}

@@ -2194,6 +2214,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
unsigned long nr[NR_LRU_LISTS];
unsigned long targets[NR_LRU_LISTS];
unsigned long nr_to_scan;
+ unsigned long nr_to_scan_lzfree;
enum lru_list lru;
unsigned long nr_reclaimed = 0;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
@@ -2204,6 +2225,7 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,

/* Record the original scan target for proportional adjustments later */
memcpy(targets, nr, sizeof(nr));
+ nr_to_scan_lzfree = get_lru_size(lruvec, LRU_LZFREE);

/*
* Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
@@ -2221,6 +2243,19 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,

init_tlb_ubc();

+ while (nr_to_scan_lzfree) {
+ nr_to_scan = min(nr_to_scan_lzfree, SWAP_CLUSTER_MAX);
+ nr_to_scan_lzfree -= nr_to_scan;
+
+ nr_reclaimed += shrink_inactive_list(nr_to_scan, lruvec,
+ sc, LRU_LZFREE);
+ }
+
+ if (nr_reclaimed >= nr_to_reclaim) {
+ sc->nr_reclaimed += nr_reclaimed;
+ return;
+ }
+
blk_start_plug(&plug);
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {
@@ -2364,6 +2399,7 @@ static inline bool should_continue_reclaim(struct zone *zone,
*/
pages_for_compaction = (2UL << sc->order);
inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
+ inactive_lru_pages += zone_page_state(zone, NR_LZFREE);
if (get_nr_swap_pages() > 0)
inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
if (sc->nr_reclaimed < pages_for_compaction &&
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 59d45b22355f..df95d9473bba 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -704,6 +704,7 @@ const char * const vmstat_text[] = {
"nr_inactive_file",
"nr_active_file",
"nr_unevictable",
+ "nr_lazyfree",
"nr_mlock",
"nr_anon_pages",
"nr_mapped",
@@ -721,6 +722,7 @@ const char * const vmstat_text[] = {
"nr_writeback_temp",
"nr_isolated_anon",
"nr_isolated_file",
+ "nr_isolated_lazyfree",
"nr_shmem",
"nr_dirtied",
"nr_written",
@@ -756,6 +758,7 @@ const char * const vmstat_text[] = {
"pgfree",
"pgactivate",
"pgdeactivate",
+ "pglazyfree",

"pgfault",
"pgmajfault",
--
1.9.1

2015-11-20 08:03:04

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 15/16] mm: support MADV_FREE on swapless system

Historically, we have disabled reclaiming of anonymous pages
completely with swapoff or non-swap configurable system.
It did make sense but problem for lazy free pages is that
we couldn't get a chance to discard MADV_FREE hinted pages
in reclaim path in those systems.

That's why current MADV_FREE implementation drops pages instantly
like MADV_DONTNNED in swapless system so that users on those
systems couldn't get the benefit of MADV_FREE.

Now we have lazyfree LRU list to keep MADV_FREEed pages without
relying on anonymous LRU so that we could scan MADV_FREE pages
on swapless system without relying on anonymous LRU list.

Signed-off-by: Minchan Kim <[email protected]>
---
mm/madvise.c | 7 +------
mm/swap_state.c | 6 ------
mm/vmscan.c | 37 +++++++++++++++++++++++++++----------
3 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index e0836c870980..7ed5b3ea5872 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -557,12 +557,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
case MADV_FREE:
- /*
- * XXX: In this implementation, MADV_FREE works like
- * MADV_DONTNEED on swapless system or full swap.
- */
- if (get_nr_swap_pages() > 0)
- return madvise_free(vma, prev, start, end);
+ return madvise_free(vma, prev, start, end);
/* passthrough */
case MADV_DONTNEED:
return madvise_dontneed(vma, prev, start, end);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 10f63eded7b7..49c683b02ee4 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -170,12 +170,6 @@ int add_to_swap(struct page *page, struct list_head *list)
if (!entry.val)
return 0;

- if (unlikely(PageTransHuge(page)))
- if (unlikely(split_huge_page_to_list(page, list))) {
- swapcache_free(entry);
- return 0;
- }
-
/*
* Radix-tree node allocations from PF_MEMALLOC contexts could
* completely exhaust the page allocator. __GFP_NOMEMALLOC
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8efe30ceec3a..d9dfd034b963 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -611,13 +611,18 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
bool reclaimed)
{
unsigned long flags;
- struct mem_cgroup *memcg;
+ struct mem_cgroup *memcg = NULL;
+ int expected = mapping ? 2 : 1;

BUG_ON(!PageLocked(page));
BUG_ON(mapping != page_mapping(page));
+ VM_BUG_ON_PAGE(mapping == NULL && !PageLazyFree(page), page);
+
+ if (mapping) {
+ memcg = mem_cgroup_begin_page_stat(page);
+ spin_lock_irqsave(&mapping->tree_lock, flags);
+ }

- memcg = mem_cgroup_begin_page_stat(page);
- spin_lock_irqsave(&mapping->tree_lock, flags);
/*
* The non racy check for a busy page.
*
@@ -643,14 +648,18 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
* Note that if SetPageDirty is always performed via set_page_dirty,
* and thus under tree_lock, then this ordering is not required.
*/
- if (!page_freeze_refs(page, 2))
+ if (!page_freeze_refs(page, expected))
goto cannot_free;
/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */
if (unlikely(PageDirty(page))) {
- page_unfreeze_refs(page, 2);
+ page_unfreeze_refs(page, expected);
goto cannot_free;
}

+ /* No more work to do with backing store */
+ if (!mapping)
+ return 1;
+
if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
mem_cgroup_swapout(page, swap);
@@ -687,8 +696,10 @@ static int __remove_mapping(struct address_space *mapping, struct page *page,
return 1;

cannot_free:
- spin_unlock_irqrestore(&mapping->tree_lock, flags);
- mem_cgroup_end_page_stat(memcg);
+ if (mapping) {
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ mem_cgroup_end_page_stat(memcg);
+ }
return 0;
}

@@ -1051,7 +1062,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
if (PageAnon(page) && !PageSwapCache(page)) {
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
- if (!add_to_swap(page, page_list))
+ if (unlikely(PageTransHuge(page)) &&
+ unlikely(split_huge_page_to_list(page,
+ page_list)))
+ goto activate_locked;
+ if (total_swap_pages &&
+ !add_to_swap(page, page_list))
goto activate_locked;
if (ttu_flags & TTU_LZFREE)
freeable = true;
@@ -1065,7 +1081,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* The page is mapped into the page tables of one or more
* processes. Try to unmap it here.
*/
- if (page_mapped(page) && mapping) {
+ if (page_mapped(page) && (mapping || freeable)) {
switch (try_to_unmap(page, freeable ?
(ttu_flags | TTU_BATCH_FLUSH) :
((ttu_flags & ~TTU_LZFREE) |
@@ -1182,7 +1198,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
}

- if (!mapping || !__remove_mapping(mapping, page, true))
+ if ((!mapping && !freeable) ||
+ !__remove_mapping(mapping, page, true))
goto keep_locked;

/*
--
1.9.1

2015-11-20 08:03:09

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v4 16/16] mm: add knob to tune lazyfreeing

MADV_FREEed page's hotness is very arguble.
Someone think it's hot while others are it's cold.

Quote from Shaohua
"
My main concern is the policy how we should treat the FREE pages. Moving it to
inactive lru is definitionly a good start, I'm wondering if it's enough. The
MADV_FREE increases memory pressure and cause unnecessary reclaim because of
the lazy memory free. While MADV_FREE is intended to be a better replacement of
MADV_DONTNEED, MADV_DONTNEED doesn't have the memory pressure issue as it free
memory immediately. So I hope the MADV_FREE doesn't have impact on memory
pressure too. I'm thinking of adding an extra lru list and wartermark for this
to make sure FREE pages can be freed before system wide page reclaim. As you
said, this is arguable, but I hope we can discuss about this issue more.
"

Quote from me
"
It seems the divergence comes from MADV_FREE is *replacement* of MADV_DONTNEED.
But I don't think so. If we could discard MADV_FREEed page *anytime*, I agree
but it's not true because the page would be dirty state when VM want to reclaim.

I'm also against with your's suggestion which let's discard FREEed page before
system wide page reclaim because system would have lots of clean cold page
caches or anonymous pages. In such case, reclaiming of them would be better.
Yeb, it's really workload-dependent so we might need some heuristic which is
normally what we want to avoid.

Having said that, I agree with you we could do better than the deactivation
and frankly speaking, I'm thinking of another LRU list(e.g. tentatively named
"ezreclaim LRU list"). What I have in mind is to age (anon|file|ez)
fairly. IOW, I want to percolate ez-LRU list reclaiming into get_scan_count.
When the MADV_FREE is called, we could move hinted pages from anon-LRU to
ez-LRU and then If VM find to not be able to discard a page in ez-LRU,
it could promote it to acive-anon-LRU which would be very natural aging
concept because it mean someone touches the page recenlty.
With that, I don't want to bias one side and don't want to add some knob for
tuning the heuristic but let's rely on common fair aging scheme of VM.
"

Quote from Johannes
"
thread 1:
Even if we're wrong about the aging of those MADV_FREE pages, their
contents are invalidated; they can be discarded freely, and restoring
them is a mere GFP_ZERO allocation. All other anonymous pages have to
be written to disk, and potentially be read back.

[ Arguably, MADV_FREE pages should even be reclaimed before inactive
page cache. It's the same cost to discard both types of pages, but
restoring page cache involves IO. ]

It probably makes sense to stop thinking about them as anonymous pages
entirely at this point when it comes to aging. They're really not. The
LRU lists are split to differentiate access patterns and cost of page
stealing (and restoring). From that angle, MADV_FREE pages really have
nothing in common with in-use anonymous pages, and so they shouldn't
be on the same LRU list.

thread:2
What about them is hot? They contain garbage, you have to write to
them before you can use them. Granted, you might have to refetch
cachelines if you don't do cacheline-aligned populating writes, but
you can do a lot of them before it's more expensive than doing IO.

"

Quote from Daniel
"
thread:1
Keep in mind that this is memory the kernel wouldn't be getting back at
all if the allocator wasn't going out of the way to purge it, and they
aren't going to go out of their way to purge it if it means the kernel
is going to steal the pages when there isn't actually memory pressure.

An allocator would be using MADV_DONTNEED if it didn't expect that the
pages were going to be used against shortly. MADV_FREE indicates that it
has time to inform the kernel that they're unused but they could still
be very hot.

thread:2
It's hot because applications churn through memory via the allocator.

Drop the pages and the application is now churning through page faults
and zeroing rather than simply reusing memory. It's not something that
may happen, it *will* happen. A page in the page cache *may* be reused,
but often won't be, especially when the I/O patterns don't line up well
with the way it works.

The whole point of the feature is not requiring the allocator to have
elaborate mechanisms for aging pages and throttling purging. That ends
up resulting in lots of memory held by userspace where the kernel can't
reclaim it under memory pressure. If it's dropped before page cache, it
isn't going to be able to replace any of that logic in allocators.

The page cache is speculative. Page caching by allocators is not really
speculative. Using MADV_FREE on the pages at all is speculative. The
memory is probably going to be reused fairly soon (unless the process
exits, and then it doesn't matter), but purging will end up reducing
memory usage for the portions that aren't.

It would be a different story for a full unpinning/pinning feature since
that would have other use cases (speculative caches), but this is really
only useful in allocators.
"
You could read all thread from https://lkml.org/lkml/2015/11/4/51

Yeah, with arguble issue and there is no one decision, I think it
means we should provide the knob "lazyfreeness"(I hope someone
give better naming).

It's similar to swapppiness so higher values will discard MADV_FREE
pages agreessively.
In this implementation, lazyfreeness starts from 100 which is max
of swappiness(ie, if you see 40 of /proc/sys/vm/lazyfreeness,
it means 100 + 40). Therefore, lazyfree LRU list always has more
reclaiming pressure than anonymous LRU but same with filecache
pressure when we consider default values(ie, 60 swappiness,
40 lazyfreeness). If user want to reclaim lazyfree LRU first to
prevent other LRU reclaiming, he could set lazyfreeness to above 80
which is double of default lazyfreeness.

There is one exception case. If system has low free memory and
file cache, it start to discard MADV_FREEed pages unconditionally
even though user set lazyfreeness to 0. It's same logic with
swappiness to prevent thrashing.

Cc: Johannes Weiner <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
Documentation/sysctl/vm.txt | 13 +++++
drivers/base/node.c | 4 +-
fs/proc/meminfo.c | 4 +-
include/linux/memcontrol.h | 1 +
include/linux/mmzone.h | 19 +++++--
include/linux/swap.h | 15 +++++
kernel/sysctl.c | 9 +++
mm/memcontrol.c | 32 ++++++++++-
mm/vmscan.c | 131 +++++++++++++++++++++++++-------------------
mm/vmstat.c | 2 +-
10 files changed, 162 insertions(+), 68 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index a4482fceacec..e3bcf115cf03 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -56,6 +56,7 @@ files can be found in mm/swap.c.
- percpu_pagelist_fraction
- stat_interval
- swappiness
+- lazyfreeness
- user_reserve_kbytes
- vfs_cache_pressure
- zone_reclaim_mode
@@ -737,6 +738,18 @@ The default value is 60.

==============================================================

+lazyfreeness
+
+This control is used to define how aggressive the kernel will discard
+MADV_FREE hinted pages. Higher values will increase agressiveness,
+lower values decrease the amount of discarding. A value of 0 instructs
+the kernel not to initiate discarding until the amount of free and
+file-backed pages is less than the high water mark in a zone.
+
+The default value is 40.
+
+==============================================================
+
- user_reserve_kbytes

When overcommit_memory is set to 2, "never overcommit" mode, reserve
diff --git a/drivers/base/node.c b/drivers/base/node.c
index f7a1f2107b43..3b0bf1b78b2e 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -69,8 +69,8 @@ static ssize_t node_read_meminfo(struct device *dev,
"Node %d Inactive(anon): %8lu kB\n"
"Node %d Active(file): %8lu kB\n"
"Node %d Inactive(file): %8lu kB\n"
- "Node %d Unevictable: %8lu kB\n"
"Node %d LazyFree: %8lu kB\n"
+ "Node %d Unevictable: %8lu kB\n"
"Node %d Mlocked: %8lu kB\n",
nid, K(i.totalram),
nid, K(i.freeram),
@@ -83,8 +83,8 @@ static ssize_t node_read_meminfo(struct device *dev,
nid, K(node_page_state(nid, NR_INACTIVE_ANON)),
nid, K(node_page_state(nid, NR_ACTIVE_FILE)),
nid, K(node_page_state(nid, NR_INACTIVE_FILE)),
- nid, K(node_page_state(nid, NR_UNEVICTABLE)),
nid, K(node_page_state(nid, NR_LZFREE)),
+ nid, K(node_page_state(nid, NR_UNEVICTABLE)),
nid, K(node_page_state(nid, NR_MLOCK)));

#ifdef CONFIG_HIGHMEM
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 3444f7c4e0b6..f47e6a5aa2e5 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -101,8 +101,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
"Inactive(anon): %8lu kB\n"
"Active(file): %8lu kB\n"
"Inactive(file): %8lu kB\n"
- "Unevictable: %8lu kB\n"
"LazyFree: %8lu kB\n"
+ "Unevictable: %8lu kB\n"
"Mlocked: %8lu kB\n"
#ifdef CONFIG_HIGHMEM
"HighTotal: %8lu kB\n"
@@ -159,8 +159,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
K(pages[LRU_INACTIVE_ANON]),
K(pages[LRU_ACTIVE_FILE]),
K(pages[LRU_INACTIVE_FILE]),
- K(pages[LRU_UNEVICTABLE]),
K(pages[LRU_LZFREE]),
+ K(pages[LRU_UNEVICTABLE]),
K(global_page_state(NR_MLOCK)),
#ifdef CONFIG_HIGHMEM
K(i.totalhigh),
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 3e3318ddfc0e..5522ff733506 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -210,6 +210,7 @@ struct mem_cgroup {
int under_oom;

int swappiness;
+ int lzfreeness;
/* OOM-Killer disable */
int oom_kill_disable;

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 1aaa436da0d5..210184b2dd18 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -120,8 +120,8 @@ enum zone_stat_item {
NR_ACTIVE_ANON, /* " " " " " */
NR_INACTIVE_FILE, /* " " " " " */
NR_ACTIVE_FILE, /* " " " " " */
- NR_UNEVICTABLE, /* " " " " " */
NR_LZFREE, /* " " " " " */
+ NR_UNEVICTABLE, /* " " " " " */
NR_MLOCK, /* mlock()ed pages found and moved off LRU */
NR_ANON_PAGES, /* Mapped anonymous pages */
NR_FILE_MAPPED, /* pagecache pages mapped into pagetables.
@@ -179,20 +179,29 @@ enum lru_list {
LRU_ACTIVE_ANON = LRU_BASE + LRU_ACTIVE,
LRU_INACTIVE_FILE = LRU_BASE + LRU_FILE,
LRU_ACTIVE_FILE = LRU_BASE + LRU_FILE + LRU_ACTIVE,
- LRU_UNEVICTABLE,
LRU_LZFREE,
+ LRU_UNEVICTABLE,
NR_LRU_LISTS
};

#define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
+#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_LZFREE; lru++)

-#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
-
-static inline int is_file_lru(enum lru_list lru)
+static inline bool is_file_lru(enum lru_list lru)
{
return (lru == LRU_INACTIVE_FILE || lru == LRU_ACTIVE_FILE);
}

+static inline bool is_anon_lru(enum lru_list lru)
+{
+ return (lru == LRU_INACTIVE_ANON || lru == LRU_ACTIVE_ANON);
+}
+
+static inline bool is_lazyfree_lru(enum lru_list lru)
+{
+ return (lru == LRU_LZFREE);
+}
+
static inline int is_active_lru(enum lru_list lru)
{
return (lru == LRU_ACTIVE_ANON || lru == LRU_ACTIVE_FILE);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c484339b46b6..252b478a2579 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -331,6 +331,7 @@ extern unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
unsigned long *nr_scanned);
extern unsigned long shrink_all_memory(unsigned long nr_pages);
extern int vm_swappiness;
+extern int vm_lazyfreeness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long vm_total_pages;

@@ -362,11 +363,25 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
return memcg->swappiness;
}

+static inline int mem_cgroup_lzfreeness(struct mem_cgroup *memcg)
+{
+ /* root ? */
+ if (mem_cgroup_disabled() || !memcg->css.parent)
+ return vm_lazyfreeness;
+
+ return memcg->lzfreeness;
+}
+
#else
static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
{
return vm_swappiness;
}
+
+static inline int mem_cgroup_lzfreeness(struct mem_cgroup *mem)
+{
+ return vm_lazyfreeness;
+}
#endif
#ifdef CONFIG_MEMCG_SWAP
extern void mem_cgroup_swapout(struct page *page, swp_entry_t entry);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e69201d8094e..2496b10c08e9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1268,6 +1268,15 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &one_hundred,
},
+ {
+ .procname = "lazyfreeness",
+ .data = &vm_lazyfreeness,
+ .maxlen = sizeof(vm_lazyfreeness),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
#ifdef CONFIG_HUGETLB_PAGE
{
.procname = "nr_hugepages",
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1dc599ce1bcb..5bdbe2a20dc0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -108,8 +108,8 @@ static const char * const mem_cgroup_lru_names[] = {
"active_anon",
"inactive_file",
"active_file",
- "unevictable",
"lazyfree",
+ "unevictable",
};

#define THRESHOLDS_EVENTS_TARGET 128
@@ -3288,6 +3288,30 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
return 0;
}

+static u64 mem_cgroup_lzfreeness_read(struct cgroup_subsys_state *css,
+ struct cftype *cft)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+ return mem_cgroup_lzfreeness(memcg);
+}
+
+static int mem_cgroup_lzfreeness_write(struct cgroup_subsys_state *css,
+ struct cftype *cft, u64 val)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+ if (val > 100)
+ return -EINVAL;
+
+ if (css->parent)
+ memcg->lzfreeness = val;
+ else
+ vm_lazyfreeness = val;
+
+ return 0;
+}
+
static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
{
struct mem_cgroup_threshold_ary *t;
@@ -4085,6 +4109,11 @@ static struct cftype mem_cgroup_legacy_files[] = {
.write_u64 = mem_cgroup_swappiness_write,
},
{
+ .name = "lazyfreeness",
+ .read_u64 = mem_cgroup_lzfreeness_read,
+ .write_u64 = mem_cgroup_lzfreeness_write,
+ },
+ {
.name = "move_charge_at_immigrate",
.read_u64 = mem_cgroup_move_charge_read,
.write_u64 = mem_cgroup_move_charge_write,
@@ -4305,6 +4334,7 @@ mem_cgroup_css_online(struct cgroup_subsys_state *css)
memcg->use_hierarchy = parent->use_hierarchy;
memcg->oom_kill_disable = parent->oom_kill_disable;
memcg->swappiness = mem_cgroup_swappiness(parent);
+ memcg->lzfreeness = mem_cgroup_lzfreeness(parent);

if (parent->use_hierarchy) {
page_counter_init(&memcg->memory, &parent->memory);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d9dfd034b963..325b49cedee8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -141,6 +141,10 @@ struct scan_control {
*/
int vm_swappiness = 60;
/*
+ * From 0 .. 100. Higher means more lazy freeing.
+ */
+int vm_lazyfreeness = 40;
+/*
* The total number of pages which are beyond the high watermark within all
* zones.
*/
@@ -1989,10 +1993,11 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
}

enum scan_balance {
- SCAN_EQUAL,
- SCAN_FRACT,
- SCAN_ANON,
- SCAN_FILE,
+ SCAN_EQUAL = (1 << 0),
+ SCAN_FRACT = (1 << 1),
+ SCAN_ANON = (1 << 2),
+ SCAN_FILE = (1 << 3),
+ SCAN_LZFREE = (1 << 4),
};

/*
@@ -2003,20 +2008,21 @@ enum scan_balance {
*
* nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
* nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
+ * nr[4] = lazy free pages to scan;
*/
static void get_scan_count(struct lruvec *lruvec, int swappiness,
- struct scan_control *sc, unsigned long *nr,
- unsigned long *lru_pages)
+ int lzfreeness, struct scan_control *sc,
+ unsigned long *nr, unsigned long *lru_pages)
{
struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
- u64 fraction[2];
+ u64 fraction[3];
u64 denominator = 0; /* gcc */
struct zone *zone = lruvec_zone(lruvec);
- unsigned long anon_prio, file_prio;
- enum scan_balance scan_balance;
- unsigned long anon, file;
+ unsigned long anon_prio, file_prio, lzfree_prio;
+ enum scan_balance scan_balance = 0;
+ unsigned long anon, file, lzfree;
bool force_scan = false;
- unsigned long ap, fp;
+ unsigned long ap, fp, lp;
enum lru_list lru;
bool some_scanned;
int pass;
@@ -2040,9 +2046,19 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
if (!global_reclaim(sc))
force_scan = true;

+ /*
+ * If we have lazyfree pages and lzfreeness is enough high,
+ * scan only lazyfree LRU to prevent to reclaim other pages
+ * until lazyfree LRU list is empty.
+ */
+ if (get_lru_size(lruvec, LRU_LZFREE) && lzfreeness >= 80) {
+ scan_balance = SCAN_LZFREE;
+ goto out;
+ }
+
/* If we have no swap space, do not bother scanning anon pages. */
if (!sc->may_swap || (get_nr_swap_pages() <= 0)) {
- scan_balance = SCAN_FILE;
+ scan_balance = SCAN_FILE|SCAN_LZFREE;
goto out;
}

@@ -2054,7 +2070,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
* too expensive.
*/
if (!global_reclaim(sc) && !swappiness) {
- scan_balance = SCAN_FILE;
+ scan_balance = SCAN_FILE|SCAN_LZFREE;
goto out;
}

@@ -2086,7 +2102,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
zone_page_state(zone, NR_INACTIVE_FILE);

if (unlikely(zonefile + zonefree <= high_wmark_pages(zone))) {
- scan_balance = SCAN_ANON;
+ scan_balance = SCAN_ANON|SCAN_LZFREE;
goto out;
}
}
@@ -2096,7 +2112,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
* anything from the anonymous working set right now.
*/
if (!inactive_file_is_low(lruvec)) {
- scan_balance = SCAN_FILE;
+ scan_balance = SCAN_FILE|SCAN_LZFREE;
goto out;
}

@@ -2108,6 +2124,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
*/
anon_prio = swappiness;
file_prio = 200 - anon_prio;
+ lzfree_prio = 100 + lzfreeness;

/*
* OK, so we have swap space and a fair amount of page cache
@@ -2125,6 +2142,7 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
get_lru_size(lruvec, LRU_INACTIVE_ANON);
file = get_lru_size(lruvec, LRU_ACTIVE_FILE) +
get_lru_size(lruvec, LRU_INACTIVE_FILE);
+ lzfree = get_lru_size(lruvec, LRU_LZFREE);

spin_lock_irq(&zone->lru_lock);
if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
@@ -2137,6 +2155,11 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
reclaim_stat->recent_rotated[1] /= 2;
}

+ if (unlikely(reclaim_stat->recent_scanned[2] > lzfree / 4)) {
+ reclaim_stat->recent_scanned[2] /= 2;
+ reclaim_stat->recent_rotated[2] /= 2;
+ }
+
/*
* The amount of pressure on anon vs file pages is inversely
* proportional to the fraction of recently scanned pages on
@@ -2147,13 +2170,18 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,

fp = file_prio * (reclaim_stat->recent_scanned[1] + 1);
fp /= reclaim_stat->recent_rotated[1] + 1;
+
+ lp = lzfree_prio * (reclaim_stat->recent_scanned[2] + 1);
+ lp /= reclaim_stat->recent_rotated[2] + 1;
spin_unlock_irq(&zone->lru_lock);

fraction[0] = ap;
fraction[1] = fp;
- denominator = ap + fp + 1;
+ fraction[2] = lp;
+ denominator = ap + fp + lp + 1;
out:
some_scanned = false;
+
/* Only use force_scan on second pass. */
for (pass = 0; !some_scanned && pass < 2; pass++) {
*lru_pages = 0;
@@ -2168,34 +2196,34 @@ static void get_scan_count(struct lruvec *lruvec, int swappiness,
if (!scan && pass && force_scan)
scan = min(size, SWAP_CLUSTER_MAX);

- switch (scan_balance) {
- case SCAN_EQUAL:
- /* Scan lists relative to size */
- break;
- case SCAN_FRACT:
+ if (scan_balance & SCAN_FRACT) {
/*
* Scan types proportional to swappiness and
* their relative recent reclaim efficiency.
*/
scan = div64_u64(scan * fraction[file],
denominator);
- break;
- case SCAN_FILE:
- case SCAN_ANON:
- /* Scan one type exclusively */
- if ((scan_balance == SCAN_FILE) != file) {
- size = 0;
- scan = 0;
- }
- break;
- default:
- /* Look ma, no brain */
- BUG();
+ goto scan;
}

+ /* Scan lists relative to size */
+ if (scan_balance & SCAN_EQUAL)
+ goto scan;
+
+ if (scan_balance & SCAN_FILE && is_file_lru(lru))
+ goto scan;
+
+ if (scan_balance & SCAN_ANON && is_anon_lru(lru))
+ goto scan;
+
+ if (scan_balance & SCAN_LZFREE &&
+ is_lazyfree_lru(lru))
+ goto scan;
+
+ continue;
+scan:
*lru_pages += size;
nr[lru] = scan;
-
/*
* Skip the second pass and don't force_scan,
* if we found something to scan.
@@ -2226,23 +2254,22 @@ static inline void init_tlb_ubc(void)
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
static void shrink_lruvec(struct lruvec *lruvec, int swappiness,
- struct scan_control *sc, unsigned long *lru_pages)
+ int lzfreeness, struct scan_control *sc,
+ unsigned long *lru_pages)
{
- unsigned long nr[NR_LRU_LISTS];
- unsigned long targets[NR_LRU_LISTS];
+ unsigned long nr[NR_LRU_LISTS] = {0,};
+ unsigned long targets[NR_LRU_LISTS] = {0,};
unsigned long nr_to_scan;
- unsigned long nr_to_scan_lzfree;
enum lru_list lru;
unsigned long nr_reclaimed = 0;
unsigned long nr_to_reclaim = sc->nr_to_reclaim;
struct blk_plug plug;
bool scan_adjusted;

- get_scan_count(lruvec, swappiness, sc, nr, lru_pages);
+ get_scan_count(lruvec, swappiness, lzfreeness, sc, nr, lru_pages);

/* Record the original scan target for proportional adjustments later */
memcpy(targets, nr, sizeof(nr));
- nr_to_scan_lzfree = get_lru_size(lruvec, LRU_LZFREE);

/*
* Global reclaiming within direct reclaim at DEF_PRIORITY is a normal
@@ -2260,22 +2287,9 @@ static void shrink_lruvec(struct lruvec *lruvec, int swappiness,

init_tlb_ubc();

- while (nr_to_scan_lzfree) {
- nr_to_scan = min(nr_to_scan_lzfree, SWAP_CLUSTER_MAX);
- nr_to_scan_lzfree -= nr_to_scan;
-
- nr_reclaimed += shrink_inactive_list(nr_to_scan, lruvec,
- sc, LRU_LZFREE);
- }
-
- if (nr_reclaimed >= nr_to_reclaim) {
- sc->nr_reclaimed += nr_reclaimed;
- return;
- }
-
blk_start_plug(&plug);
while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
- nr[LRU_INACTIVE_FILE]) {
+ nr[LRU_INACTIVE_FILE] || nr[LRU_LZFREE]) {
unsigned long nr_anon, nr_file, percentage;
unsigned long nr_scanned;

@@ -2457,7 +2471,7 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
unsigned long lru_pages;
unsigned long scanned;
struct lruvec *lruvec;
- int swappiness;
+ int swappiness, lzfreeness;

if (mem_cgroup_low(root, memcg)) {
if (!sc->may_thrash)
@@ -2467,9 +2481,11 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,

lruvec = mem_cgroup_zone_lruvec(zone, memcg);
swappiness = mem_cgroup_swappiness(memcg);
+ lzfreeness = mem_cgroup_lzfreeness(memcg);
scanned = sc->nr_scanned;

- shrink_lruvec(lruvec, swappiness, sc, &lru_pages);
+ shrink_lruvec(lruvec, swappiness, lzfreeness,
+ sc, &lru_pages);
zone_lru_pages += lru_pages;

if (memcg && is_classzone)
@@ -2935,6 +2951,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
};
struct lruvec *lruvec = mem_cgroup_zone_lruvec(zone, memcg);
int swappiness = mem_cgroup_swappiness(memcg);
+ int lzfreeness = mem_cgroup_lzfreeness(memcg);
unsigned long lru_pages;

sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
@@ -2951,7 +2968,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *memcg,
* will pick up pages from other mem cgroup's as well. We hack
* the priority and make it zero.
*/
- shrink_lruvec(lruvec, swappiness, &sc, &lru_pages);
+ shrink_lruvec(lruvec, swappiness, lzfreeness, &sc, &lru_pages);

trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);

diff --git a/mm/vmstat.c b/mm/vmstat.c
index df95d9473bba..43effd0374d9 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -703,8 +703,8 @@ const char * const vmstat_text[] = {
"nr_active_anon",
"nr_inactive_file",
"nr_active_file",
- "nr_unevictable",
"nr_lazyfree",
+ "nr_unevictable",
"nr_mlock",
"nr_anon_pages",
"nr_mapped",
--
1.9.1

2015-11-24 21:58:54

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] MADV_FREE support

On Fri, 20 Nov 2015 17:02:32 +0900 Minchan Kim <[email protected]> wrote:

> I have been spent a lot of time to land MADV_FREE feature
> by request of userland people(esp, Daniel and Jason, jemalloc guys.

A couple of things...

There's a massive and complex reject storm against Kirill's page-flags
and thp changes. The problems in hugetlb.c are more than I can
reasonably fix up, sorry. How would you feel about redoing the patches
against next -mm?

Secondly, "mm: introduce lazyfree LRU list" and "mm: support MADV_FREE
on swapless system" are new, and require significant reviewer
attention. But there's so much other stuff flying around that I doubt
if we'll get effective review. So perhaps it would be best to shelve
those new things and introduce them later, after the basic old
MADV_FREE work has settled in?

2015-11-25 02:53:25

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] MADV_FREE support

Hi Andrew,

On Tue, Nov 24, 2015 at 01:58:51PM -0800, Andrew Morton wrote:
> On Fri, 20 Nov 2015 17:02:32 +0900 Minchan Kim <[email protected]> wrote:
>
> > I have been spent a lot of time to land MADV_FREE feature
> > by request of userland people(esp, Daniel and Jason, jemalloc guys.
>
> A couple of things...
>
> There's a massive and complex reject storm against Kirill's page-flags
> and thp changes. The problems in hugetlb.c are more than I can
> reasonably fix up, sorry. How would you feel about redoing the patches
> against next -mm?

No problem at all.

>
> Secondly, "mm: introduce lazyfree LRU list" and "mm: support MADV_FREE
> on swapless system" are new, and require significant reviewer
> attention. But there's so much other stuff flying around that I doubt
> if we'll get effective review. So perhaps it would be best to shelve
> those new things and introduce them later, after the basic old
> MADV_FREE work has settled in?
>

That's really what we(Daniel, Michael and me) want so far.
A people who is reluctant to it is Johannes who wanted to support
MADV_FREE on swapless system via new LRU from the beginning.

If Johannes is not strong against Andrew's plan, I will resend
new patchset(ie, not including new stuff) based on next -mmotm.

Hannes?

2015-11-25 17:37:08

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v4 00/16] MADV_FREE support

On Wed, Nov 25, 2015 at 11:53:18AM +0900, Minchan Kim wrote:
> That's really what we(Daniel, Michael and me) want so far.
> A people who is reluctant to it is Johannes who wanted to support
> MADV_FREE on swapless system via new LRU from the beginning.
>
> If Johannes is not strong against Andrew's plan, I will resend
> new patchset(ie, not including new stuff) based on next -mmotm.
>
> Hannes?

Yeah, let's shelve the new stuff for now and get this going. As Daniel
said, the feature is already useful, and we can improve it later on.