2015-11-04 01:26:09

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 00/13] MADV_FREE support

MADV_FREE is on linux-next so long time. The reason was two, I think.

1. MADV_FREE code on reclaim path was really mess.

2. Andrew really want to see voice of userland people who want to use
the syscall.

A few month ago, Daniel Micay(jemalloc active contributor) requested me
to make progress upstreaming but I was busy at that time so it took
so long time for me to revist the code and finally, I clean it up the
mess recently so it solves the #2 issue.

As well, Daniel and Jason(jemalloc maintainer) requested it to Andrew
again recently and they said it would be great to have even though
it has swap dependency now so Andrew decided he will do that for v4.4.

When I test MADV_FREE patches on recent mmotm, there is some
problem with THP-refcount redesign so it's hard for long running
test. Even, there is some dependency with it because patch ordering of
MADV_FREE in mmotm is after THP refcount redesign so I discussed it
with Andrew in hallway this kernel summit and decided to send
patchset based on v4.3-rc7.

I have been tested it on v4.3-rc7 and couldn't find any problem so far.

In the meanwhile, Hugh reviewed all of code and asked me to tidy up
a lot patches related MADV_FREE on mmotm so this is the result of
the request.

A final modification since I sent clean up patchset(ie, MADV_FREE
refactoring and fix KSM page), there are four.

1. Replace description and comment of KSM fix patch with Hugh's suggestion
2. Avoid forcing SetPageDirty in try_to_unmap_one to avoid clean
page swapout from Yalin
3. Adding uapi patch to make value of MADV_FREE for all arches same
from Chen
4. Does lazy split of THP when MADV_FREE is called

About 3, I include it because I thought it was good but Andrew
just missed the patch at that time. But when I read quilt series
file now, it seems Shaohua had some problem with it but I couldn't
find any mail in my mailbox. If it has something wrong,
please tell us.

About 4(ie, mm: don't split THP page when syscall is called),
it is new implementation so need to review.
I don't understand why it added both SetPageDirty and pte_mkdirty
to subpages unconditionally in split code of THP from the beginning.
I guess at that time, there was no MADV_FREE so it was no problem
and would be more safe to mark both.

#mm-support-madvisemadv_free.patch: other-arch syscall numbering mess ("arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures"). Shaohua Li <[email protected]> testing disasters.

TODO: I will send man page patch if it would land for v4.4.

Andrew, you could replace all of MADV_FREE related patches with
this. IOW, these are

# MADV_FREE stuff:
x86-add-pmd_-for-thp.patch
x86-add-pmd_-for-thp-fix.patch
sparc-add-pmd_-for-thp.patch
sparc-add-pmd_-for-thp-fix.patch
powerpc-add-pmd_-for-thp.patch
arm-add-pmd_mkclean-for-thp.patch
arm64-add-pmd_-for-thp.patch

mm-support-madvisemadv_free.patch
mm-support-madvisemadv_free-fix.patch
mm-support-madvisemadv_free-fix-2.patch
mm-support-madvisemadv_free-fix-3.patch
mm-support-madvisemadv_free-vs-thp-rename-split_huge_page_pmd-to-split_huge_pmd.patch
mm-support-madvisemadv_free-fix-5.patch
mm-support-madvisemadv_free-fix-6.patch
mm-mark-stable-page-dirty-in-ksm.patch
mm-dont-split-thp-page-when-syscall-is-called.patch
mm-dont-split-thp-page-when-syscall-is-called-fix.patch
mm-dont-split-thp-page-when-syscall-is-called-fix-2.patch
mm-dont-split-thp-page-when-syscall-is-called-fix-3.patch
mm-dont-split-thp-page-when-syscall-is-called-fix-4.patch
mm-dont-split-thp-page-when-syscall-is-called-fix-5.patch
mm-dont-split-thp-page-when-syscall-is-called-fix-6.patch
mm-dont-split-thp-page-when-syscall-is-called-fix-6-fix.patch
mm-free-swp_entry-in-madvise_free.patch
mm-move-lazy-free-pages-to-inactive-list.patch
mm-move-lazy-free-pages-to-inactive-list-fix.patch
mm-move-lazy-free-pages-to-inactive-list-fix-fix.patch
mm-move-lazy-free-pages-to-inactive-list-fix-fix-fix.patch

* Change from v1
* Don't do unnecessary TLB flush - Shaohua
* Added Acked-by - Hugh, Michal
* Merge deactivate_page and deactivate_file_page
* Add pmd_dirty/pmd_mkclean patches for several arches
* Add lazy THP split patch
* Drop [email protected] - Delivery Failure

Chen Gang (1):
arch: uapi: asm: mman.h: Let MADV_FREE have same value for all
architectures

Minchan Kim (12):
mm: support madvise(MADV_FREE)
mm: define MADV_FREE for some arches
mm: free swp_entry in madvise_free
mm: move lazily freed pages to inactive list
mm: clear PG_dirty to mark page freeable
mm: mark stable page dirty in KSM
x86: add pmd_[dirty|mkclean] for THP
sparc: add pmd_[dirty|mkclean] for THP
powerpc: add pmd_[dirty|mkclean] for THP
arm: add pmd_mkclean for THP
arm64: add pmd_mkclean for THP
mm: don't split THP page when syscall is called

arch/alpha/include/uapi/asm/mman.h | 1 +
arch/arm/include/asm/pgtable-3level.h | 1 +
arch/arm64/include/asm/pgtable.h | 1 +
arch/mips/include/uapi/asm/mman.h | 1 +
arch/parisc/include/uapi/asm/mman.h | 1 +
arch/powerpc/include/asm/pgtable-ppc64.h | 2 +
arch/sparc/include/asm/pgtable_64.h | 9 ++
arch/x86/include/asm/pgtable.h | 5 +
arch/xtensa/include/uapi/asm/mman.h | 1 +
include/linux/huge_mm.h | 3 +
include/linux/rmap.h | 1 +
include/linux/swap.h | 2 +-
include/linux/vm_event_item.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/huge_memory.c | 46 +++++++-
mm/ksm.c | 6 ++
mm/madvise.c | 177 +++++++++++++++++++++++++++++++
mm/rmap.c | 7 ++
mm/swap.c | 62 ++++++-----
mm/swap_state.c | 5 +-
mm/truncate.c | 2 +-
mm/vmscan.c | 10 +-
mm/vmstat.c | 1 +
23 files changed, 308 insertions(+), 38 deletions(-)

--
1.9.1


2015-11-04 01:30:12

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

Linux doesn't have an ability to free pages lazy while other OS already
have been supported that named by madvise(MADV_FREE).

The gain is clear that kernel can discard freed pages rather than swapping
out or OOM if memory pressure happens.

Without memory pressure, freed pages would be reused by userspace without
another additional overhead(ex, page fault + allocation + zeroing).

Jason Evans said:

: Facebook has been using MAP_UNINITIALIZED
: (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
: several years, but there are operational costs to maintaining this
: out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
: in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
: increased throughput for much of our workload by ~5%, and although the
: benefit has decreased using newer hardware and kernels, there is still
: enough benefit that we cannot reasonably retire it without a replacement.
:
: Aside from Facebook operations, there are numerous broadly used
: applications that would benefit from MADV_FREE. The ones that immediately
: come to mind are redis, varnish, and MariaDB. I don't have much insight
: into Android internals and development process, but I would hope to see
: MADV_FREE support eventually end up there as well to benefit applications
: linked with the integrated jemalloc.
:
: jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
: In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
: available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
: (and AIX, but I'm not sure it even compiles on AIX). The lack of
: MADV_FREE on Linux forced me down a long series of increasingly
: sophisticated heuristics for madvise() volume reduction, and even so this
: remains a common performance issue for people using jemalloc on Linux.
: Please integrate MADV_FREE; many people will benefit substantially.

How it works:

When madvise syscall is called, VM clears dirty bit of ptes of the range.
If memory pressure happens, VM checks dirty bit of page table and if it
found still "clean", it means it's a "lazyfree pages" so VM could discard
the page instead of swapping out. Once there was store operation for the
page before VM peek a page to reclaim, dirty bit is set so VM can swap out
the page instead of discarding.

Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
and hope glibc supports it) and jemalloc/tcmalloc already have supported
the feature for other OS(ex, FreeBSD)

barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 12
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 2
Stepping: 3
CPU MHz: 3200.185
BogoMIPS: 6400.53
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)

Higher avg is better.

vanilla-jemalloc MADV_free-jemalloc

1 thread
records: 10 records: 10
avg: 2961.90 avg: 12069.70
std: 71.96(2.43%) std: 186.68(1.55%)
max: 3070.00 max: 12385.00
min: 2796.00 min: 11746.00

2 thread
records: 10 records: 10
avg: 5020.00 avg: 17827.00
std: 264.87(5.28%) std: 358.52(2.01%)
max: 5244.00 max: 18760.00
min: 4251.00 min: 17382.00

4 thread
records: 10 records: 10
avg: 8988.80 avg: 27930.80
std: 1175.33(13.08%) std: 3317.33(11.88%)
max: 9508.00 max: 30879.00
min: 5477.00 min: 21024.00

8 thread
records: 10 records: 10
avg: 13036.50 avg: 33739.40
std: 170.67(1.31%) std: 5146.22(15.25%)
max: 13371.00 max: 40572.00
min: 12785.00 min: 24088.00

16 thread
records: 10 records: 10
avg: 11092.40 avg: 31424.20
std: 710.60(6.41%) std: 3763.89(11.98%)
max: 12446.00 max: 36635.00
min: 9949.00 min: 25669.00

32 thread
records: 10 records: 10
avg: 11067.00 avg: 34495.80
std: 971.06(8.77%) std: 2721.36(7.89%)
max: 12010.00 max: 38598.00
min: 9002.00 min: 30636.00

In summary, MADV_FREE is about much faster than MADV_DONTNEED.

Acked-by: Hugh Dickins <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/rmap.h | 1 +
include/linux/vm_event_item.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/madvise.c | 132 +++++++++++++++++++++++++++++++++
mm/rmap.c | 7 ++
mm/swap_state.c | 5 +-
mm/vmscan.c | 10 ++-
mm/vmstat.c | 1 +
8 files changed, 153 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 29446aeef36e..f4c992826242 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -85,6 +85,7 @@ enum ttu_flags {
TTU_UNMAP = 1, /* unmap mode */
TTU_MIGRATION = 2, /* migration mode */
TTU_MUNLOCK = 4, /* munlock mode */
+ TTU_FREE = 8, /* free mode */

TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 9246d32dc973..2b1cef88b827 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
PGFREE, PGACTIVATE, PGDEACTIVATE,
PGFAULT, PGMAJFAULT,
+ PGLAZYFREED,
FOR_ALL_ZONES(PGREFILL),
FOR_ALL_ZONES(PGSTEAL_KSWAPD),
FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36f1046..7a94102b7a02 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,6 +34,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/mm/madvise.c b/mm/madvise.c
index c889fcbb530e..a8813f7b37b3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -20,6 +20,9 @@
#include <linux/backing-dev.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
+
+#include <asm/tlb.h>

/*
* Any behaviour which results in changes to the vma->vm_flags needs to
@@ -32,6 +35,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_FREE:
return 0;
default:
/* be safe, default to 1. list exceptions explicitly */
@@ -256,6 +260,125 @@ static long madvise_willneed(struct vm_area_struct *vma,
return 0;
}

+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+
+{
+ struct mmu_gather *tlb = walk->private;
+ struct mm_struct *mm = tlb->mm;
+ struct vm_area_struct *vma = walk->vma;
+ spinlock_t *ptl;
+ pte_t *pte, ptent;
+ struct page *page;
+
+ split_huge_page_pmd(vma, addr, pmd);
+ if (pmd_trans_unstable(pmd))
+ return 0;
+
+ pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ arch_enter_lazy_mmu_mode();
+ for (; addr != end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ if (PageSwapCache(page)) {
+ if (!trylock_page(page))
+ continue;
+
+ if (!try_to_free_swap(page)) {
+ unlock_page(page);
+ continue;
+ }
+
+ ClearPageDirty(page);
+ unlock_page(page);
+ }
+
+ if (pte_young(ptent) || pte_dirty(ptent)) {
+ /*
+ * Some of architecture(ex, PPC) don't update TLB
+ * with set_pte_at and tlb_remove_tlb_entry so for
+ * the portability, remap the pte with old|clean
+ * after pte clearing.
+ */
+ ptent = ptep_get_and_clear_full(mm, addr, pte,
+ tlb->fullmm);
+
+ ptent = pte_mkold(ptent);
+ ptent = pte_mkclean(ptent);
+ set_pte_at(mm, addr, pte, ptent);
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ }
+ }
+
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(pte - 1, ptl);
+ cond_resched();
+ return 0;
+}
+
+static void madvise_free_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ struct mm_walk free_walk = {
+ .pmd_entry = madvise_free_pte_range,
+ .mm = vma->vm_mm,
+ .private = tlb,
+ };
+
+ tlb_start_vma(tlb, vma);
+ walk_page_range(addr, end, &free_walk);
+ tlb_end_vma(tlb, vma);
+}
+
+static int madvise_free_single_vma(struct vm_area_struct *vma,
+ unsigned long start_addr, unsigned long end_addr)
+{
+ unsigned long start, end;
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_gather tlb;
+
+ if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+ return -EINVAL;
+
+ /* MADV_FREE works for only anon vma at the moment */
+ if (!vma_is_anonymous(vma))
+ return -EINVAL;
+
+ start = max(vma->vm_start, start_addr);
+ if (start >= vma->vm_end)
+ return -EINVAL;
+ end = min(vma->vm_end, end_addr);
+ if (end <= vma->vm_start)
+ return -EINVAL;
+
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, start, end);
+ update_hiwater_rss(mm);
+
+ mmu_notifier_invalidate_range_start(mm, start, end);
+ madvise_free_page_range(&tlb, vma, start, end);
+ mmu_notifier_invalidate_range_end(mm, start, end);
+ tlb_finish_mmu(&tlb, start, end);
+
+ return 0;
+}
+
+static long madvise_free(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
+{
+ *prev = vma;
+ return madvise_free_single_vma(vma, start, end);
+}
+
/*
* Application no longer needs these pages. If the pages are dirty,
* it's OK to just throw them away. The app will be more careful about
@@ -379,6 +502,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_remove(vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
+ case MADV_FREE:
+ /*
+ * XXX: In this implementation, MADV_FREE works like
+ * MADV_DONTNEED on swapless system or full swap.
+ */
+ if (get_nr_swap_pages() > 0)
+ return madvise_free(vma, prev, start, end);
+ /* passthrough */
case MADV_DONTNEED:
return madvise_dontneed(vma, prev, start, end);
default:
@@ -398,6 +529,7 @@ madvise_behavior_valid(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_FREE:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
diff --git a/mm/rmap.c b/mm/rmap.c
index f5b5c1f3dcd7..9449e91839ab 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1374,6 +1374,12 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
swp_entry_t entry = { .val = page_private(page) };
pte_t swp_pte;

+ if (!PageDirty(page) && (flags & TTU_FREE)) {
+ /* It's a freeable page by MADV_FREE */
+ dec_mm_counter(mm, MM_ANONPAGES);
+ goto discard;
+ }
+
if (PageSwapCache(page)) {
/*
* Store the swap location in the pte.
@@ -1414,6 +1420,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
} else
dec_mm_counter(mm, MM_FILEPAGES);

+discard:
page_remove_rmap(page);
page_cache_release(page);

diff --git a/mm/swap_state.c b/mm/swap_state.c
index d504adb7fa5f..10f63eded7b7 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -185,13 +185,12 @@ int add_to_swap(struct page *page, struct list_head *list)
* deadlock in the swap out path.
*/
/*
- * Add it to the swap cache and mark it dirty
+ * Add it to the swap cache.
*/
err = add_to_swap_cache(page, entry,
__GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN);

- if (!err) { /* Success */
- SetPageDirty(page);
+ if (!err) {
return 1;
} else { /* -ENOMEM radix-tree allocation failure */
/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f63a9381f71..7a415b9fdd34 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -906,6 +906,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
int may_enter_fs;
enum page_references references = PAGEREF_RECLAIM_CLEAN;
bool dirty, writeback;
+ bool freeable = false;

cond_resched();

@@ -1049,6 +1050,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
goto keep_locked;
if (!add_to_swap(page, page_list))
goto activate_locked;
+ freeable = true;
may_enter_fs = 1;

/* Adding to swap updated mapping */
@@ -1060,8 +1062,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* processes. Try to unmap it here.
*/
if (page_mapped(page) && mapping) {
- switch (try_to_unmap(page,
- ttu_flags|TTU_BATCH_FLUSH)) {
+ switch (try_to_unmap(page, freeable ?
+ (ttu_flags | TTU_BATCH_FLUSH | TTU_FREE) :
+ (ttu_flags | TTU_BATCH_FLUSH))) {
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
@@ -1186,6 +1189,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
*/
__clear_page_locked(page);
free_it:
+ if (freeable && !PageDirty(page))
+ count_vm_event(PGLAZYFREED);
+
nr_reclaimed++;

/*
diff --git a/mm/vmstat.c b/mm/vmstat.c
index fbf14485a049..59d45b22355f 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -759,6 +759,7 @@ const char * const vmstat_text[] = {

"pgfault",
"pgmajfault",
+ "pglazyfreed",

TEXTS_FOR_ZONES("pgrefill")
TEXTS_FOR_ZONES("pgsteal_kswapd")
--
1.9.1

2015-11-04 01:28:08

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 02/13] mm: define MADV_FREE for some arches

Most architectures use asm-generic, but alpha, mips, parisc, xtensa
need their own definitions.

This patch defines MADV_FREE for them so it should fix build break
for their architectures.

Maybe, I should split and feed piecies to arch maintainers but
included here for mmotm convenience.

Cc: Michael Kerrisk <[email protected]>
Cc: Richard Henderson <[email protected]>
Cc: Ivan Kokshaysky <[email protected]>
Cc: "James E.J. Bottomley" <[email protected]>
Cc: Helge Deller <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: Chris Zankel <[email protected]>
Acked-by: Max Filippov <[email protected]>
Reported-by: kbuild test robot <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
arch/alpha/include/uapi/asm/mman.h | 1 +
arch/mips/include/uapi/asm/mman.h | 1 +
arch/parisc/include/uapi/asm/mman.h | 1 +
arch/xtensa/include/uapi/asm/mman.h | 1 +
4 files changed, 4 insertions(+)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 0086b472bc2b..836fbd44f65b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -44,6 +44,7 @@
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_SPACEAVAIL 5 /* ensure resources are available */
#define MADV_DONTNEED 6 /* don't need these pages */
+#define MADV_FREE 7 /* free pages only if memory pressure */

/* common/generic parameters */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index cfcb876cae6b..106e741aa7ee 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -67,6 +67,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 294d251ca7b2..6cb8db76fd4e 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -40,6 +40,7 @@
#define MADV_SPACEAVAIL 5 /* insure that resources are reserved */
#define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */
#define MADV_VPS_INHERIT 7 /* Inherit parents page size */
+#define MADV_FREE 8 /* free pages only if memory pressure */

/* common/generic parameters */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 201aec0e0446..1b19f25bc567 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -80,6 +80,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
--
1.9.1

2015-11-04 01:29:49

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 03/13] arch: uapi: asm: mman.h: Let MADV_FREE have same value for all architectures

From: Chen Gang <[email protected]>

For uapi, need try to let all macros have same value, and MADV_FREE is
added into main branch recently, so need redefine MADV_FREE for it.

At present, '8' can be shared with all architectures, so redefine it to
'8'.

Cc: [email protected] <[email protected]>,
Cc: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Cc: [email protected] <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Minchan Kim <[email protected]>
Signed-off-by: Chen Gang <[email protected]>
---
arch/alpha/include/uapi/asm/mman.h | 2 +-
arch/mips/include/uapi/asm/mman.h | 2 +-
arch/parisc/include/uapi/asm/mman.h | 2 +-
arch/xtensa/include/uapi/asm/mman.h | 2 +-
include/uapi/asm-generic/mman-common.h | 2 +-
5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 836fbd44f65b..0b8a5de7aee3 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -44,9 +44,9 @@
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_SPACEAVAIL 5 /* ensure resources are available */
#define MADV_DONTNEED 6 /* don't need these pages */
-#define MADV_FREE 7 /* free pages only if memory pressure */

/* common/generic parameters */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index 106e741aa7ee..d247f5457944 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -67,9 +67,9 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
-#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 6cb8db76fd4e..700d83fd9352 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -40,9 +40,9 @@
#define MADV_SPACEAVAIL 5 /* insure that resources are reserved */
#define MADV_VPS_PURGE 6 /* Purge pages from VM page cache */
#define MADV_VPS_INHERIT 7 /* Inherit parents page size */
-#define MADV_FREE 8 /* free pages only if memory pressure */

/* common/generic parameters */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index 1b19f25bc567..77eaca434071 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -80,9 +80,9 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
-#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 7a94102b7a02..869595947873 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,9 +34,9 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
-#define MADV_FREE 5 /* free pages only if memory pressure */

/* common parameters: try to keep these consistent across architectures */
+#define MADV_FREE 8 /* free pages only if memory pressure */
#define MADV_REMOVE 9 /* remove these pages & resources */
#define MADV_DONTFORK 10 /* don't inherit across fork */
#define MADV_DOFORK 11 /* do inherit across fork */
--
1.9.1

2015-11-04 01:26:13

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 04/13] mm: free swp_entry in madvise_free

When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is siginficat
slower (ie, 2x times) than madvise_dontneed.

loop = 5;
mmap(512M);
while (loop--) {
memset(512M);
madvise(MADV_FREE or MADV_DONTNEED);
}

The reason is lots of swapin.

1) dontneed: 1,612 swapin
2) madvfree: 879,585 swapin

If we find hinted pages were already swapped out when syscall is called,
it's pointless to keep the swapped-out pages in pte.
Instead, let's free the cold page because swapin is more expensive
than (alloc page + zeroing).

With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time

1) dontneed: 6.10user 233.50system 0:50.44elapsed
2) madvfree: 6.03user 401.17system 1:30.67elapsed
2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed

Acked-by: Michal Hocko <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/madvise.c | 26 +++++++++++++++++++++++++-
1 file changed, 25 insertions(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index a8813f7b37b3..6240a5de4a3a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -270,6 +270,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
spinlock_t *ptl;
pte_t *pte, ptent;
struct page *page;
+ int nr_swap = 0;

split_huge_page_pmd(vma, addr, pmd);
if (pmd_trans_unstable(pmd))
@@ -280,8 +281,24 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
for (; addr != end; pte++, addr += PAGE_SIZE) {
ptent = *pte;

- if (!pte_present(ptent))
+ if (pte_none(ptent))
continue;
+ /*
+ * If the pte has swp_entry, just clear page table to
+ * prevent swap-in which is more expensive rather than
+ * (page allocation + zeroing).
+ */
+ if (!pte_present(ptent)) {
+ swp_entry_t entry;
+
+ entry = pte_to_swp_entry(ptent);
+ if (non_swap_entry(entry))
+ continue;
+ nr_swap--;
+ free_swap_and_cache(entry);
+ pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
+ continue;
+ }

page = vm_normal_page(vma, addr, ptent);
if (!page)
@@ -317,6 +334,13 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
}
}

+ if (nr_swap) {
+ if (current->mm == mm)
+ sync_mm_rss(mm);
+
+ add_mm_counter(mm, MM_SWAPENTS, nr_swap);
+ }
+
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
cond_resched();
--
1.9.1

2015-11-04 01:29:11

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 05/13] mm: move lazily freed pages to inactive list

MADV_FREE is a hint that it's okay to discard pages if there is memory
pressure and we use reclaimers(ie, kswapd and direct reclaim) to free them
so there is no value keeping them in the active anonymous LRU so this
patch moves them to inactive LRU list's head.

This means that MADV_FREE-ed pages which were living on the inactive list
are reclaimed first because they are more likely to be cold rather than
recently active pages.

An arguable issue for the approach would be whether we should put the page
to the head or tail of the inactive list. I chose head because the kernel
cannot make sure it's really cold or warm for every MADV_FREE usecase but
at least we know it's not *hot*, so landing of inactive head would be a
comprimise for various usecases.

This fixes suboptimal behavior of MADV_FREE when pages living on the
active list will sit there for a long time even under memory pressure
while the inactive list is reclaimed heavily. This basically breaks the
whole purpose of using MADV_FREE to help the system to free memory which
is might not be used.

Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Shaohua Li <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/swap.h | 2 +-
mm/madvise.c | 3 +++
mm/swap.c | 62 +++++++++++++++++++++++++++++-----------------------
mm/truncate.c | 2 +-
4 files changed, 40 insertions(+), 29 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7ba7dccaf0e7..8e944c0cedea 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,7 @@ extern void lru_add_drain(void);
extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
-extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
extern void swap_setup(void);

extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/madvise.c b/mm/madvise.c
index 6240a5de4a3a..3462a3ca9690 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -317,6 +317,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
unlock_page(page);
}

+ if (PageActive(page))
+ deactivate_page(page);
+
if (pte_young(ptent) || pte_dirty(ptent)) {
/*
* Some of architecture(ex, PPC) don't update TLB
diff --git a/mm/swap.c b/mm/swap.c
index 983f692a47fd..c76dd4175858 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -44,7 +44,7 @@ int page_cluster;

static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
-static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);

/*
* This path almost never happens for VM activity - pages are normally
@@ -733,13 +733,13 @@ void lru_cache_add_active_or_unevictable(struct page *page,
}

/*
- * If the page can not be invalidated, it is moved to the
+ * If the file page can not be invalidated, it is moved to the
* inactive list to speed up its reclaim. It is moved to the
* head of the list, rather than the tail, to give the flusher
* threads some time to write it out, as this is much more
* effective than the single-page writeout from reclaim.
*
- * If the page isn't page_mapped and dirty/writeback, the page
+ * If the file page isn't page_mapped and dirty/writeback, the page
* could reclaim asap using PG_reclaim.
*
* 1. active, mapped page -> none
@@ -752,32 +752,36 @@ void lru_cache_add_active_or_unevictable(struct page *page,
* In 4, why it moves inactive's head, the VM expects the page would
* be write it out by flusher threads as this is much more effective
* than the single-page writeout from reclaim.
+ *
+ * If @page is anonymous page, it is moved to the inactive list.
*/
-static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
void *arg)
{
- int lru, file;
- bool active;
+ int lru;
+ bool file, active;

- if (!PageLRU(page))
+ if (!PageLRU(page) || PageUnevictable(page))
return;

- if (PageUnevictable(page))
- return;
+ file = page_is_file_cache(page);
+ active = PageActive(page);
+ lru = page_lru_base_type(page);

- /* Some processes are using the page */
- if (page_mapped(page))
+ if (!file && active)
return;

- active = PageActive(page);
- file = page_is_file_cache(page);
- lru = page_lru_base_type(page);
+ if (file && page_mapped(page))
+ return;

del_page_from_lru_list(page, lruvec, lru + active);
ClearPageActive(page);
- ClearPageReferenced(page);
add_page_to_lru_list(page, lruvec, lru);

+ if (!file)
+ goto out;
+
+ ClearPageReferenced(page);
if (PageWriteback(page) || PageDirty(page)) {
/*
* PG_reclaim could be raced with end_page_writeback
@@ -793,9 +797,10 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
list_move_tail(&page->lru, &lruvec->lists[lru]);
__count_vm_event(PGROTATED);
}
-
+out:
if (active)
__count_vm_event(PGDEACTIVATE);
+
update_page_reclaim_stat(lruvec, file, 0);
}

@@ -821,22 +826,25 @@ void lru_add_drain_cpu(int cpu)
local_irq_restore(flags);
}

- pvec = &per_cpu(lru_deactivate_file_pvecs, cpu);
+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
if (pagevec_count(pvec))
- pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);

activate_page_drain(cpu);
}

/**
- * deactivate_file_page - forcefully deactivate a file page
+ * deactivate_page - forcefully deactivate a page
* @page: page to deactivate
*
- * This function hints the VM that @page is a good reclaim candidate,
- * for example if its invalidation fails due to the page being dirty
- * or under writeback.
+ * This function hints the VM that @page is a good reclaim candidate to
+ * accelerate the reclaim of @page.
+ * For example,
+ * 1. Invalidation of file-page fails due to the page being dirty or under
+ * writeback.
+ * 2. MADV_FREE hinted anonymous page.
*/
-void deactivate_file_page(struct page *page)
+void deactivate_page(struct page *page)
{
/*
* In a workload with many unevictable page such as mprotect,
@@ -846,11 +854,11 @@ void deactivate_file_page(struct page *page)
return;

if (likely(get_page_unless_zero(page))) {
- struct pagevec *pvec = &get_cpu_var(lru_deactivate_file_pvecs);
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);

if (!pagevec_add(pvec, page))
- pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
- put_cpu_var(lru_deactivate_file_pvecs);
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+ put_cpu_var(lru_deactivate_pvecs);
}
}

@@ -882,7 +890,7 @@ void lru_add_drain_all(void)

if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
- pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+ pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
need_activate_page_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
schedule_work_on(cpu, work);
diff --git a/mm/truncate.c b/mm/truncate.c
index 76e35ad97102..cf8d44679364 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -488,7 +488,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
* of interest and try to speed up its reclaim.
*/
if (!ret)
- deactivate_file_page(page);
+ deactivate_page(page);
count += ret;
}
pagevec_remove_exceptionals(&pvec);
--
1.9.1

2015-11-04 01:29:10

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 06/13] mm: clear PG_dirty to mark page freeable

Basically, MADV_FREE relies on dirty bit in page table entry to decide
whether VM allows to discard the page or not. IOW, if page table entry
includes marked dirty bit, VM shouldn't discard the page.

However, as a example, if swap-in by read fault happens, page table entry
doesn't have dirty bit so MADV_FREE could discard the page wrongly.

For avoiding the problem, MADV_FREE did more checks with PageDirty
and PageSwapCache. It worked out because swapped-in page lives on
swap cache and since it is evicted from the swap cache, the page has
PG_dirty flag. So both page flags check effectively prevent
wrong discarding by MADV_FREE.

However, a problem in above logic is that swapped-in page has
PG_dirty still after they are removed from swap cache so VM cannot
consider the page as freeable any more even if madvise_free is
called in future.

Look at below example for detail.

ptr = malloc();
memset(ptr);
..
..
.. heavy memory pressure so all of pages are swapped out
..
..
var = *ptr; -> a page swapped-in and could be removed from
swapcache. Then, page table doesn't mark
dirty bit and page descriptor includes PG_dirty
..
..
madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
..
..
..
.. heavy memory pressure again.
.. In this time, VM cannot discard the page because the page
.. has *PG_dirty*

To solve the problem, this patch clears PG_dirty if only the page is owned
exclusively by current process when madvise is called because PG_dirty
represents ptes's dirtiness in several processes so we could clear it only
if we own it exclusively.

Acked-by: Michal Hocko <[email protected]>
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/madvise.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 3462a3ca9690..4e67ba0b1104 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -304,11 +304,19 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
if (!page)
continue;

- if (PageSwapCache(page)) {
+ if (PageSwapCache(page) || PageDirty(page)) {
if (!trylock_page(page))
continue;
+ /*
+ * If page is shared with others, we couldn't clear
+ * PG_dirty of the page.
+ */
+ if (page_count(page) != 1 + !!PageSwapCache(page)) {
+ unlock_page(page);
+ continue;
+ }

- if (!try_to_free_swap(page)) {
+ if (PageSwapCache(page) && !try_to_free_swap(page)) {
unlock_page(page);
continue;
}
--
1.9.1

2015-11-04 01:26:24

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 07/13] mm: mark stable page dirty in KSM

The MADV_FREE patchset changes page reclaim to simply free a clean
anonymous page with no dirty ptes, instead of swapping it out; but
KSM uses clean write-protected ptes to reference the stable ksm page.
So be sure to mark that page dirty, so it's never mistakenly discarded.

[hughd: adjusted comments]
Acked-by: Hugh Dickins <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
mm/ksm.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/ksm.c b/mm/ksm.c
index 7ee101eaacdf..18d2b7afecff 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1053,6 +1053,12 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
*/
set_page_stable_node(page, NULL);
mark_page_accessed(page);
+ /*
+ * Page reclaim just frees a clean page with no dirty
+ * ptes: make sure that the ksm page would be swapped.
+ */
+ if (!PageDirty(page))
+ SetPageDirty(page);
err = 0;
} else if (pages_identical(page, kpage))
err = replace_page(vma, page, kpage, orig_pte);
--
1.9.1

2015-11-04 01:26:22

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 08/13] x86: add pmd_[dirty|mkclean] for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
arch/x86/include/asm/pgtable.h | 5 +++++
1 file changed, 5 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 867da5bbb4a3..b964d54300e1 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -267,6 +267,11 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
return pmd_clear_flags(pmd, _PAGE_ACCESSED);
}

+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+ return pmd_clear_flags(pmd, _PAGE_DIRTY);
+}
+
static inline pmd_t pmd_wrprotect(pmd_t pmd)
{
return pmd_clear_flags(pmd, _PAGE_RW);
--
1.9.1

2015-11-04 01:26:18

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 09/13] sparc: add pmd_[dirty|mkclean] for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
arch/sparc/include/asm/pgtable_64.h | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 131d36fcd07a..5833dc5ee7d7 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -717,6 +717,15 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
return __pmd(pte_val(pte));
}

+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+ pte_t pte = __pte(pmd_val(pmd));
+
+ pte = pte_mkclean(pte);
+
+ return __pmd(pte_val(pte));
+}
+
static inline pmd_t pmd_mkyoung(pmd_t pmd)
{
pte_t pte = __pte(pmd_val(pmd));
--
1.9.1

2015-11-04 01:26:20

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 10/13] powerpc: add pmd_[dirty|mkclean] for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.

Signed-off-by: Minchan Kim <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index fa1dfb7f7b48..85e15c8067be 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -507,9 +507,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
#define pmd_pfn(pmd) pte_pfn(pmd_pte(pmd))
#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
#define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
#define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))

--
1.9.1

2015-11-04 01:26:26

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 11/13] arm: add pmd_mkclean for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_mkclean for THP page MADV_FREE support.

Signed-off-by: Minchan Kim <[email protected]>
---
arch/arm/include/asm/pgtable-3level.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index a745a2a53853..6d6012a320b2 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -249,6 +249,7 @@ PMD_BIT_FUNC(mkold, &= ~PMD_SECT_AF);
PMD_BIT_FUNC(mksplitting, |= L_PMD_SECT_SPLITTING);
PMD_BIT_FUNC(mkwrite, &= ~L_PMD_SECT_RDONLY);
PMD_BIT_FUNC(mkdirty, |= L_PMD_SECT_DIRTY);
+PMD_BIT_FUNC(mkclean, &= ~L_PMD_SECT_DIRTY);
PMD_BIT_FUNC(mkyoung, |= PMD_SECT_AF);

#define pmd_mkhuge(pmd) (__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
--
1.9.1

2015-11-04 01:27:49

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 12/13] arm64: add pmd_mkclean for THP

MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent overwrite
of the contents since MADV_FREE syscall is called for THP page.

This patch adds pmd_mkclean for THP page MADV_FREE support.

Signed-off-by: Minchan Kim <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 26b066690593..a945263addd4 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -325,6 +325,7 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
#define pmd_mksplitting(pmd) pte_pmd(pte_mkspecial(pmd_pte(pmd)))
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
#define pmd_mknotpresent(pmd) (__pmd(pmd_val(pmd) & ~PMD_TYPE_MASK))
--
1.9.1

2015-11-04 01:27:33

by Minchan Kim

[permalink] [raw]
Subject: [PATCH v2 13/13] mm: don't split THP page when syscall is called

We don't need to split THP page when MADV_FREE syscall is called.
It could be done when VM decide to free it in reclaim path when
memory pressure is heavy so we could avoid unnecessary THP split.

For that, this patch changes two things

1. __split_huge_page_map

It does pte_mkdirty to subpages only if pmd_dirty is true.

2. __split_huge_page_refcount

It removes marking PG_dirty to subpages unconditionally.

Cc: Kirill A. Shutemov <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---

I don't understand why it added both SetPageDirty and pte_mkdirty
to subpages unconditionally in split code of THP from the beginning.
I guess at that time, there was no MADV_FREE so it was no problem
and would be more safe to mark both but I might be missing something.
Anyway, this patch doesn't have any problem so far for a few days
of my test.

I just added pmd_dirty respect code in split part of THP, not collapse
intentionally because I don't see how it is helpful in real practice.
Anyway, it could be done in future if it is needed, I think.

include/linux/huge_mm.h | 3 +++
mm/huge_memory.c | 46 ++++++++++++++++++++++++++++++++++++++++++----
mm/madvise.c | 12 +++++++++++-
3 files changed, 56 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ecb080d6ff42..e9db238a75c1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,9 @@ extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
unsigned long addr,
pmd_t *pmd,
unsigned int flags);
+extern int madvise_free_huge_pmd(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long addr);
extern int zap_huge_pmd(struct mmu_gather *tlb,
struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bbac913f96bc..b8c9b44af864 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1453,6 +1453,41 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}

+int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long addr)
+
+{
+ spinlock_t *ptl;
+ pmd_t orig_pmd;
+ struct page *page;
+ struct mm_struct *mm = tlb->mm;
+
+ if (__pmd_trans_huge_lock(pmd, vma, &ptl) != 1)
+ return 1;
+
+ orig_pmd = *pmd;
+ if (is_huge_zero_pmd(orig_pmd))
+ goto out;
+
+ page = pmd_page(orig_pmd);
+ if (PageActive(page))
+ deactivate_page(page);
+
+ if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
+ orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
+ tlb->fullmm);
+ orig_pmd = pmd_mkold(orig_pmd);
+ orig_pmd = pmd_mkclean(orig_pmd);
+
+ set_pmd_at(mm, addr, pmd, orig_pmd);
+ tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+ }
+out:
+ spin_unlock(ptl);
+
+ return 0;
+}
+
int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr)
{
@@ -1752,8 +1787,8 @@ static void __split_huge_page_refcount(struct page *page,
(1L << PG_mlocked) |
(1L << PG_uptodate) |
(1L << PG_active) |
- (1L << PG_unevictable)));
- page_tail->flags |= (1L << PG_dirty);
+ (1L << PG_unevictable) |
+ (1L << PG_dirty)));

/* clear PageTail before overwriting first_page */
smp_wmb();
@@ -1787,7 +1822,6 @@ static void __split_huge_page_refcount(struct page *page,

BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
- BUG_ON(!PageDirty(page_tail));
BUG_ON(!PageSwapBacked(page_tail));

lru_add_page_tail(page, page_tail, lruvec, list);
@@ -1831,10 +1865,12 @@ static int __split_huge_page_map(struct page *page,
int ret = 0, i;
pgtable_t pgtable;
unsigned long haddr;
+ bool dirty;

pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG, &ptl);
if (pmd) {
+ dirty = pmd_dirty(*pmd);
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);
if (pmd_write(*pmd))
@@ -1850,7 +1886,9 @@ static int __split_huge_page_map(struct page *page,
* permissions across VMAs.
*/
entry = mk_pte(page + i, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ if (dirty)
+ entry = pte_mkdirty(entry);
+ entry = maybe_mkwrite(entry, vma);
if (!pmd_write(*pmd))
entry = pte_wrprotect(entry);
if (!pmd_young(*pmd))
diff --git a/mm/madvise.c b/mm/madvise.c
index 4e67ba0b1104..27ed057c0bd7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -271,8 +271,17 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
pte_t *pte, ptent;
struct page *page;
int nr_swap = 0;
+ unsigned long next;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_trans_huge(*pmd)) {
+ if (next - addr != HPAGE_PMD_SIZE)
+ split_huge_page_pmd(vma, addr, pmd);
+ else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
+ goto next;
+ /* fall through */
+ }

- split_huge_page_pmd(vma, addr, pmd);
if (pmd_trans_unstable(pmd))
return 0;

@@ -355,6 +364,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
cond_resched();
+next:
return 0;
}

--
1.9.1

2015-11-04 02:15:32

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

Hi Minchan,

On (11/04/15 10:25), Minchan Kim wrote:
[..]
>+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>+ unsigned long end, struct mm_walk *walk)
>+
...
> + if (pmd_trans_unstable(pmd))
> + return 0;

I think it makes sense to update pmd_trans_unstable() and
pmd_none_or_trans_huge_or_clear_bad() comments in asm-generic/pgtable.h
Because they explicitly mention MADV_DONTNEED only. Just a thought.


> @@ -379,6 +502,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> return madvise_remove(vma, prev, start, end);
> case MADV_WILLNEED:
> return madvise_willneed(vma, prev, start, end);
> + case MADV_FREE:
> + /*
> + * XXX: In this implementation, MADV_FREE works like
^^^^
XXX

> + * MADV_DONTNEED on swapless system or full swap.
> + */
> + if (get_nr_swap_pages() > 0)
> + return madvise_free(vma, prev, start, end);
> + /* passthrough */

-ss

2015-11-04 02:30:54

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On (11/04/15 10:25), Minchan Kim wrote:
[..]
> +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> + unsigned long end, struct mm_walk *walk)
> +
> +{
> + struct mmu_gather *tlb = walk->private;
> + struct mm_struct *mm = tlb->mm;
> + struct vm_area_struct *vma = walk->vma;
> + spinlock_t *ptl;
> + pte_t *pte, ptent;
> + struct page *page;


I'll just ask (probably I'm missing something)

+ pmd_trans_huge_lock() ?

> + split_huge_page_pmd(vma, addr, pmd);
> + if (pmd_trans_unstable(pmd))
> + return 0;
> +
> + pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + arch_enter_lazy_mmu_mode();
> + for (; addr != end; pte++, addr += PAGE_SIZE) {
> + ptent = *pte;
> +
> + if (!pte_present(ptent))
> + continue;
> +
> + page = vm_normal_page(vma, addr, ptent);
> + if (!page)
> + continue;
> +

-ss

2015-11-04 03:41:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On Nov 3, 2015 5:30 PM, "Minchan Kim" <[email protected]> wrote:
>
> Linux doesn't have an ability to free pages lazy while other OS already
> have been supported that named by madvise(MADV_FREE).
>
> The gain is clear that kernel can discard freed pages rather than swapping
> out or OOM if memory pressure happens.
>
> Without memory pressure, freed pages would be reused by userspace without
> another additional overhead(ex, page fault + allocation + zeroing).
>

[...]

>
> How it works:
>
> When madvise syscall is called, VM clears dirty bit of ptes of the range.
> If memory pressure happens, VM checks dirty bit of page table and if it
> found still "clean", it means it's a "lazyfree pages" so VM could discard
> the page instead of swapping out. Once there was store operation for the
> page before VM peek a page to reclaim, dirty bit is set so VM can swap out
> the page instead of discarding.

What happens if you MADV_FREE something that's MAP_SHARED or isn't
ordinary anonymous memory? There's a long history of MADV_DONTNEED on
such mappings causing exploitable problems, and I think it would be
nice if MADV_FREE were obviously safe.

Does this set the write protect bit?

What happens on architectures without hardware dirty tracking? For
that matter, even on architecture with hardware dirty tracking, what
happens in multithreaded processes that have the dirty TLB state
cached in a different CPU's TLB?

Using the dirty bit for these semantics scares me. This API creates a
page that can have visible nonzero contents and then can
asynchronously and magically zero itself thereafter. That makes me
nervous. Could we use the accessed bit instead? Then the observable
semantics would be equivalent to having MADV_FREE either zero the page
or do nothing, except that it doesn't make up its mind until the next
read.

> + ptent = pte_mkold(ptent);
> + ptent = pte_mkclean(ptent);
> + set_pte_at(mm, addr, pte, ptent);
> + tlb_remove_tlb_entry(tlb, pte, addr);

It looks like you are flushing the TLB. In a multithreaded program,
that's rather expensive. Potentially silly question: would it be
better to just zero the page immediately in a multithreaded program
and then, when swapping out, check the page is zeroed and, if so, skip
swapping it out? That could be done without forcing an IPI.

> +static int madvise_free_single_vma(struct vm_area_struct *vma,
> + unsigned long start_addr, unsigned long end_addr)
> +{
> + unsigned long start, end;
> + struct mm_struct *mm = vma->vm_mm;
> + struct mmu_gather tlb;
> +
> + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> + return -EINVAL;
> +
> + /* MADV_FREE works for only anon vma at the moment */
> + if (!vma_is_anonymous(vma))
> + return -EINVAL;

Does anything weird happen if it's shared?

> + if (!PageDirty(page) && (flags & TTU_FREE)) {
> + /* It's a freeable page by MADV_FREE */
> + dec_mm_counter(mm, MM_ANONPAGES);
> + goto discard;
> + }

Does something clear TTU_FREE the next time the page gets marked clean?

--Andy

2015-11-04 05:51:14

by Daniel Micay

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

> Does this set the write protect bit?
>
> What happens on architectures without hardware dirty tracking?

It's supposed to avoid needing page faults when the data is accessed
again, but it can just be implemented via page faults on architectures
without a way to check for access or writes. MADV_DONTNEED is also a
valid implementation of MADV_FREE if it comes to that (which is what it
does on swapless systems for now).

> Using the dirty bit for these semantics scares me. This API creates a
> page that can have visible nonzero contents and then can
> asynchronously and magically zero itself thereafter. That makes me
> nervous. Could we use the accessed bit instead? Then the observable
> semantics would be equivalent to having MADV_FREE either zero the page
> or do nothing, except that it doesn't make up its mind until the next
> read.

FWIW, those are already basically the semantics provided by GCC and LLVM
for data the compiler considers uninitialized (they could be more
aggressive since C just says it's undefined, but in practice they allow
it but can produce inconsistent results even if it isn't touched).

http://llvm.org/docs/LangRef.html#undefined-values

It doesn't seem like there would be an advantage to checking if the data
was written to vs. whether it was accessed if checking for both of those
is comparable in performance. I don't know enough about that.

>> + ptent = pte_mkold(ptent);
>> + ptent = pte_mkclean(ptent);
>> + set_pte_at(mm, addr, pte, ptent);
>> + tlb_remove_tlb_entry(tlb, pte, addr);
>
> It looks like you are flushing the TLB. In a multithreaded program,
> that's rather expensive. Potentially silly question: would it be
> better to just zero the page immediately in a multithreaded program
> and then, when swapping out, check the page is zeroed and, if so, skip
> swapping it out? That could be done without forcing an IPI.

In the common case it will be passed many pages by the allocator. There
will still be a layer of purging logic on top of MADV_FREE but it can be
much thinner than the current workarounds for MADV_DONTNEED. So the
allocator would still be coalescing dirty ranges and only purging when
the ratio of dirty:clean pages rises above some threshold. It would be
able to weight the largest ranges for purging first rather than logic
based on stuff like aging as is used for MADV_DONTNEED.


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2015-11-04 05:54:21

by Daniel Micay

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

> In the common case it will be passed many pages by the allocator. There
> will still be a layer of purging logic on top of MADV_FREE but it can be
> much thinner than the current workarounds for MADV_DONTNEED. So the
> allocator would still be coalescing dirty ranges and only purging when
> the ratio of dirty:clean pages rises above some threshold. It would be
> able to weight the largest ranges for purging first rather than logic
> based on stuff like aging as is used for MADV_DONTNEED.

I would expect that jemalloc would just start putting the dirty ranges
into the usual pair of red-black trees (with coalescing) and then doing
purging starting from the largest spans to get back down below whatever
dirty:clean ratio it's trying to keep. Right now, it has all lots of
other logic to deal with this since each MADV_DONTNEED call results in
lots of zeroing and then page faults.


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2015-11-04 06:05:29

by Daniel Micay

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On 04/11/15 12:53 AM, Daniel Micay wrote:
>> In the common case it will be passed many pages by the allocator. There
>> will still be a layer of purging logic on top of MADV_FREE but it can be
>> much thinner than the current workarounds for MADV_DONTNEED. So the
>> allocator would still be coalescing dirty ranges and only purging when
>> the ratio of dirty:clean pages rises above some threshold. It would be
>> able to weight the largest ranges for purging first rather than logic
>> based on stuff like aging as is used for MADV_DONTNEED.
>
> I would expect that jemalloc would just start putting the dirty ranges
> into the usual pair of red-black trees (with coalescing) and then doing
> purging starting from the largest spans to get back down below whatever
> dirty:clean ratio it's trying to keep. Right now, it has all lots of
> other logic to deal with this since each MADV_DONTNEED call results in
> lots of zeroing and then page faults.

Er, I mean dirty:active (i.e. ratio of unpurged, dirty pages to ones
that are handed out as allocations, which is kept at something like
1:8). A high constant cost in the madvise call but quick handling of
each page means that allocators need to be more aggressive with purging
more than they strictly need to in one go. For example, it might need to
purge 2M to meet the ratio but it could have a contiguous span of 32M of
dirty pages. If the cost per page is low enough, it could just do the
entire range.


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2015-11-04 18:23:37

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On Tue, Nov 3, 2015 at 9:50 PM, Daniel Micay <[email protected]> wrote:
>> Does this set the write protect bit?
>>
>> What happens on architectures without hardware dirty tracking?
>
> It's supposed to avoid needing page faults when the data is accessed
> again, but it can just be implemented via page faults on architectures
> without a way to check for access or writes. MADV_DONTNEED is also a
> valid implementation of MADV_FREE if it comes to that (which is what it
> does on swapless systems for now).

I wonder whether arches without the requisite tracking should just
turn it off. While it might be faster than MADV_DONTNEED or munmap on
those arches, it doesn't really deserve to be faster.

>
>> Using the dirty bit for these semantics scares me. This API creates a
>> page that can have visible nonzero contents and then can
>> asynchronously and magically zero itself thereafter. That makes me
>> nervous. Could we use the accessed bit instead? Then the observable
>> semantics would be equivalent to having MADV_FREE either zero the page
>> or do nothing, except that it doesn't make up its mind until the next
>> read.
>
> FWIW, those are already basically the semantics provided by GCC and LLVM
> for data the compiler considers uninitialized (they could be more
> aggressive since C just says it's undefined, but in practice they allow
> it but can produce inconsistent results even if it isn't touched).
>
> http://llvm.org/docs/LangRef.html#undefined-values

But C isn't the only thing in the world. Also, I think that a C
optimizer should be free to turn:

if ([complicated condition])
*ptr = 1;

into:

if (*ptr != 1 && [complicated condition])
*ptr = 1;

as long as [complicated condition] has no side effects. The MADV_FREE
semantics in this patch set break that.

>
> It doesn't seem like there would be an advantage to checking if the data
> was written to vs. whether it was accessed if checking for both of those
> is comparable in performance. I don't know enough about that.

I'd imagine that there would be no performance difference whatsoever
on hardware that has a real accessed bit. The only thing that changes
is the choice of which bit to use.

>
>>> + ptent = pte_mkold(ptent);
>>> + ptent = pte_mkclean(ptent);
>>> + set_pte_at(mm, addr, pte, ptent);
>>> + tlb_remove_tlb_entry(tlb, pte, addr);
>>
>> It looks like you are flushing the TLB. In a multithreaded program,
>> that's rather expensive. Potentially silly question: would it be
>> better to just zero the page immediately in a multithreaded program
>> and then, when swapping out, check the page is zeroed and, if so, skip
>> swapping it out? That could be done without forcing an IPI.
>
> In the common case it will be passed many pages by the allocator. There
> will still be a layer of purging logic on top of MADV_FREE but it can be
> much thinner than the current workarounds for MADV_DONTNEED. So the
> allocator would still be coalescing dirty ranges and only purging when
> the ratio of dirty:clean pages rises above some threshold. It would be
> able to weight the largest ranges for purging first rather than logic
> based on stuff like aging as is used for MADV_DONTNEED.
>

With enough pages at once, though, munmap would be fine, too.

Maybe what's really needed is a MADV_FREE variant that takes an iovec.
On an all-cores multithreaded mm, the TLB shootdown broadcast takes
thousands of cycles on each core more or less regardless of how much
of the TLB gets zapped.

--Andy

2015-11-04 20:00:11

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On Wed, Nov 04, 2015 at 10:25:55AM +0900, Minchan Kim wrote:
> Linux doesn't have an ability to free pages lazy while other OS already
> have been supported that named by madvise(MADV_FREE).
>
> The gain is clear that kernel can discard freed pages rather than swapping
> out or OOM if memory pressure happens.
>
> Without memory pressure, freed pages would be reused by userspace without
> another additional overhead(ex, page fault + allocation + zeroing).
>
> Jason Evans said:
>
> : Facebook has been using MAP_UNINITIALIZED
> : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
> : several years, but there are operational costs to maintaining this
> : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
> : in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
> : increased throughput for much of our workload by ~5%, and although the
> : benefit has decreased using newer hardware and kernels, there is still
> : enough benefit that we cannot reasonably retire it without a replacement.
> :
> : Aside from Facebook operations, there are numerous broadly used
> : applications that would benefit from MADV_FREE. The ones that immediately
> : come to mind are redis, varnish, and MariaDB. I don't have much insight
> : into Android internals and development process, but I would hope to see
> : MADV_FREE support eventually end up there as well to benefit applications
> : linked with the integrated jemalloc.
> :
> : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
> : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
> : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
> : (and AIX, but I'm not sure it even compiles on AIX). The lack of
> : MADV_FREE on Linux forced me down a long series of increasingly
> : sophisticated heuristics for madvise() volume reduction, and even so this
> : remains a common performance issue for people using jemalloc on Linux.
> : Please integrate MADV_FREE; many people will benefit substantially.
>
> How it works:
>
> When madvise syscall is called, VM clears dirty bit of ptes of the range.
> If memory pressure happens, VM checks dirty bit of page table and if it
> found still "clean", it means it's a "lazyfree pages" so VM could discard
> the page instead of swapping out. Once there was store operation for the
> page before VM peek a page to reclaim, dirty bit is set so VM can swap out
> the page instead of discarding.
>
> Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
> and hope glibc supports it) and jemalloc/tcmalloc already have supported
> the feature for other OS(ex, FreeBSD)
>
> barrios@blaptop:~/benchmark/ebizzy$ lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 12
> On-line CPU(s) list: 0-11
> Thread(s) per core: 1
> Core(s) per socket: 1
> Socket(s): 12
> NUMA node(s): 1
> Vendor ID: GenuineIntel
> CPU family: 6
> Model: 2
> Stepping: 3
> CPU MHz: 3200.185
> BogoMIPS: 6400.53
> Virtualization: VT-x
> Hypervisor vendor: KVM
> Virtualization type: full
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 4096K
> NUMA node0 CPU(s): 0-11
> ebizzy benchmark(./ebizzy -S 10 -n 512)
>
> Higher avg is better.
>
> vanilla-jemalloc MADV_free-jemalloc
>
> 1 thread
> records: 10 records: 10
> avg: 2961.90 avg: 12069.70
> std: 71.96(2.43%) std: 186.68(1.55%)
> max: 3070.00 max: 12385.00
> min: 2796.00 min: 11746.00
>
> 2 thread
> records: 10 records: 10
> avg: 5020.00 avg: 17827.00
> std: 264.87(5.28%) std: 358.52(2.01%)
> max: 5244.00 max: 18760.00
> min: 4251.00 min: 17382.00
>
> 4 thread
> records: 10 records: 10
> avg: 8988.80 avg: 27930.80
> std: 1175.33(13.08%) std: 3317.33(11.88%)
> max: 9508.00 max: 30879.00
> min: 5477.00 min: 21024.00
>
> 8 thread
> records: 10 records: 10
> avg: 13036.50 avg: 33739.40
> std: 170.67(1.31%) std: 5146.22(15.25%)
> max: 13371.00 max: 40572.00
> min: 12785.00 min: 24088.00
>
> 16 thread
> records: 10 records: 10
> avg: 11092.40 avg: 31424.20
> std: 710.60(6.41%) std: 3763.89(11.98%)
> max: 12446.00 max: 36635.00
> min: 9949.00 min: 25669.00
>
> 32 thread
> records: 10 records: 10
> avg: 11067.00 avg: 34495.80
> std: 971.06(8.77%) std: 2721.36(7.89%)
> max: 12010.00 max: 38598.00
> min: 9002.00 min: 30636.00
>
> In summary, MADV_FREE is about much faster than MADV_DONTNEED.

The MADV_FREE is discussed for a while, it probably is too late to propose
something new, but we had the new idea (from Ben Maurer, CCed) recently and
think it's better. Our target is still jemalloc.

Compared to MADV_DONTNEED, MADV_FREE's lazy memory free is a huge win to reduce
page fault. But there is one issue remaining, the TLB flush. Both MADV_DONTNEED
and MADV_FREE do TLB flush. TLB flush overhead is quite big in contemporary
multi-thread applications. In our production workload, we observed 80% CPU
spending on TLB flush triggered by jemalloc madvise(MADV_DONTNEED) sometimes.
We haven't tested MADV_FREE yet, but the result should be similar. It's hard to
avoid the TLB flush issue with MADV_FREE, because it helps avoid data
corruption.

The new proposal tries to fix the TLB issue. We introduce two madvise verbs:

MARK_FREE. Userspace notifies kernel the memory range can be discarded. Kernel
just records the range in current stage. Should memory pressure happen, page
reclaim can free the memory directly regardless the pte state.

MARK_NOFREE. Userspace notifies kernel the memory range will be reused soon.
Kernel deletes the record and prevents page reclaim discards the memory. If the
memory isn't reclaimed, userspace will access the old memory, otherwise do
normal page fault handling.

The point is to let userspace notify kernel if memory can be discarded, instead
of depending on pte dirty bit used by MADV_FREE. With these, no TLB flush is
required till page reclaim actually frees the memory (page reclaim need do the
TLB flush for MADV_FREE too). It still preserves the lazy memory free merit of
MADV_FREE.

Compared to MADV_FREE, reusing memory with the new proposal isn't transparent,
eg must call MARK_NOFREE. But it's easy to utilize the new API in jemalloc.

We don't have code to backup this yet, sorry. We'd like to discuss it if it
makes sense.

Thanks,
Shaohua

2015-11-04 21:16:44

by Daniel Micay

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

> Compared to MADV_DONTNEED, MADV_FREE's lazy memory free is a huge win to reduce
> page fault. But there is one issue remaining, the TLB flush. Both MADV_DONTNEED
> and MADV_FREE do TLB flush. TLB flush overhead is quite big in contemporary
> multi-thread applications. In our production workload, we observed 80% CPU
> spending on TLB flush triggered by jemalloc madvise(MADV_DONTNEED) sometimes.
> We haven't tested MADV_FREE yet, but the result should be similar. It's hard to
> avoid the TLB flush issue with MADV_FREE, because it helps avoid data
> corruption.
>
> The new proposal tries to fix the TLB issue. We introduce two madvise verbs:
>
> MARK_FREE. Userspace notifies kernel the memory range can be discarded. Kernel
> just records the range in current stage. Should memory pressure happen, page
> reclaim can free the memory directly regardless the pte state.
>
> MARK_NOFREE. Userspace notifies kernel the memory range will be reused soon.
> Kernel deletes the record and prevents page reclaim discards the memory. If the
> memory isn't reclaimed, userspace will access the old memory, otherwise do
> normal page fault handling.
>
> The point is to let userspace notify kernel if memory can be discarded, instead
> of depending on pte dirty bit used by MADV_FREE. With these, no TLB flush is
> required till page reclaim actually frees the memory (page reclaim need do the
> TLB flush for MADV_FREE too). It still preserves the lazy memory free merit of
> MADV_FREE.
>
> Compared to MADV_FREE, reusing memory with the new proposal isn't transparent,
> eg must call MARK_NOFREE. But it's easy to utilize the new API in jemalloc.
>
> We don't have code to backup this yet, sorry. We'd like to discuss it if it
> makes sense.

That's comparable to Android's pinning / unpinning API for ashmem and I
think it makes sense if it's faster. It's different than the MADV_FREE
API though, because the new allocations that are handed out won't have
the usual lazy commit which MADV_FREE provides. Pages in an allocation
that's handed out can still be dropped until they are actually written
to. It's considered active by jemalloc either way, but only a subset of
the active pages are actually committed. There's probably a use case for
both of these systems.


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2015-11-04 21:29:42

by Daniel Micay

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

> That's comparable to Android's pinning / unpinning API for ashmem and I
> think it makes sense if it's faster. It's different than the MADV_FREE
> API though, because the new allocations that are handed out won't have
> the usual lazy commit which MADV_FREE provides. Pages in an allocation
> that's handed out can still be dropped until they are actually written
> to. It's considered active by jemalloc either way, but only a subset of
> the active pages are actually committed. There's probably a use case for
> both of these systems.

Also, consider that MADV_FREE would allow jemalloc to be extremely
aggressive with purging when it actually has to do it. It can start with
the largest span of memory and it can mark more than strictly necessary
to drop below the ratio as there's no cost to using the memory again
(not even a system call).

Since the main cost is using the system call at all, there's going to be
pressure to mark the largest possible spans in one go. It will mean
concentration on memory compaction will improve performance. I think
that's the right direction for the kernel to be guiding userspace. It
will play better with THP than the allocator trying to be very precise
with purging based on aging.


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2015-11-04 21:43:57

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On Wed, Nov 4, 2015 at 12:00 PM, Shaohua Li <[email protected]> wrote:
>
> The new proposal tries to fix the TLB issue. We introduce two madvise verbs:
>
> MARK_FREE. Userspace notifies kernel the memory range can be discarded. Kernel
> just records the range in current stage. Should memory pressure happen, page
> reclaim can free the memory directly regardless the pte state.
>
> MARK_NOFREE. Userspace notifies kernel the memory range will be reused soon.
> Kernel deletes the record and prevents page reclaim discards the memory. If the
> memory isn't reclaimed, userspace will access the old memory, otherwise do
> normal page fault handling.
>
> The point is to let userspace notify kernel if memory can be discarded, instead
> of depending on pte dirty bit used by MADV_FREE. With these, no TLB flush is
> required till page reclaim actually frees the memory (page reclaim need do the
> TLB flush for MADV_FREE too). It still preserves the lazy memory free merit of
> MADV_FREE.
>
> Compared to MADV_FREE, reusing memory with the new proposal isn't transparent,
> eg must call MARK_NOFREE. But it's easy to utilize the new API in jemalloc.
>

I can't speak to the usefulness of this or to other arches, but on x86
(unless you have nohz_full or similar enabled), a pair of syscalls
should be *much* faster than an IPI or a page fault.

I don't know how expensive it is to write to a clean page or to access
an unaccessed page on x86. I'm sure it's not free (there's memory
bandwidth if nothing else), but it could be very cheap.

--Andy

2015-11-04 22:06:32

by Daniel Micay

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

> With enough pages at once, though, munmap would be fine, too.

That implies lots of page faults and zeroing though. The zeroing alone
is a major performance issue.

There are separate issues with munmap since it ends up resulting in a
lot more virtual memory fragmentation. It would help if the kernel used
first-best-fit for mmap instead of the current naive algorithm (bonus:
O(log n) worst-case, not O(n)). Since allocators like jemalloc and
PartitionAlloc want 2M aligned spans, mixing them with other allocators
can also accelerate the VM fragmentation caused by the dumb mmap
algorithm (i.e. they make a 2M aligned mapping, some other mmap user
does 4k, now there's a nearly 2M gap when the next 2M region is made and
the kernel keeps going rather than reusing it). Anyway, that's a totally
separate issue from this. Just felt like complaining :).

> Maybe what's really needed is a MADV_FREE variant that takes an iovec.
> On an all-cores multithreaded mm, the TLB shootdown broadcast takes
> thousands of cycles on each core more or less regardless of how much
> of the TLB gets zapped.

That would work very well. The allocator ends up having a sequence of
dirty spans that it needs to purge in one go. As long as purging is
fairly spread out, the cost of a single TLB shootdown isn't that bad. It
is extremely bad if it needs to do it over and over to purge a bunch of
ranges, which can happen if the memory has ended up being very, very
fragmentated despite the efforts to compact it (depends on what the
application ends up doing).


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2015-11-04 23:39:07

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

Hi Sergey,

On Wed, Nov 04, 2015 at 11:16:24AM +0900, Sergey Senozhatsky wrote:
> Hi Minchan,
>
> On (11/04/15 10:25), Minchan Kim wrote:
> [..]
> >+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >+ unsigned long end, struct mm_walk *walk)
> >+
> ...
> > + if (pmd_trans_unstable(pmd))
> > + return 0;
>
> I think it makes sense to update pmd_trans_unstable() and
> pmd_none_or_trans_huge_or_clear_bad() comments in asm-generic/pgtable.h
> Because they explicitly mention MADV_DONTNEED only. Just a thought.

Hmm, When I read comments(but actually I don't understand it 100%), it
says pmd disappearing from MADV_DONTNEED with mmap_sem read-side
lock. But MADV_FREE doesn't remove the pmd. So, I don't understand
what I should add comment. Please suggest if I am missing something.

>
>
> > @@ -379,6 +502,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > return madvise_remove(vma, prev, start, end);
> > case MADV_WILLNEED:
> > return madvise_willneed(vma, prev, start, end);
> > + case MADV_FREE:
> > + /*
> > + * XXX: In this implementation, MADV_FREE works like
> ^^^^
> XXX

What does it mean?

>
> > + * MADV_DONTNEED on swapless system or full swap.
> > + */
> > + if (get_nr_swap_pages() > 0)
> > + return madvise_free(vma, prev, start, end);
> > + /* passthrough */
>
> -ss

2015-11-04 23:40:50

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On Wed, Nov 04, 2015 at 11:29:51AM +0900, Sergey Senozhatsky wrote:
> On (11/04/15 10:25), Minchan Kim wrote:
> [..]
> > +static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > + unsigned long end, struct mm_walk *walk)
> > +
> > +{
> > + struct mmu_gather *tlb = walk->private;
> > + struct mm_struct *mm = tlb->mm;
> > + struct vm_area_struct *vma = walk->vma;
> > + spinlock_t *ptl;
> > + pte_t *pte, ptent;
> > + struct page *page;
>
>
> I'll just ask (probably I'm missing something)
>
> + pmd_trans_huge_lock() ?

No. It would be deadlock.

>
> > + split_huge_page_pmd(vma, addr, pmd);
> > + if (pmd_trans_unstable(pmd))
> > + return 0;
> > +
> > + pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > + arch_enter_lazy_mmu_mode();
> > + for (; addr != end; pte++, addr += PAGE_SIZE) {
> > + ptent = *pte;
> > +
> > + if (!pte_present(ptent))
> > + continue;
> > +
> > + page = vm_normal_page(vma, addr, ptent);
> > + if (!page)
> > + continue;
> > +
>
> -ss
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-11-05 00:13:43

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On Tue, Nov 03, 2015 at 07:41:35PM -0800, Andy Lutomirski wrote:
> On Nov 3, 2015 5:30 PM, "Minchan Kim" <[email protected]> wrote:
> >
> > Linux doesn't have an ability to free pages lazy while other OS already
> > have been supported that named by madvise(MADV_FREE).
> >
> > The gain is clear that kernel can discard freed pages rather than swapping
> > out or OOM if memory pressure happens.
> >
> > Without memory pressure, freed pages would be reused by userspace without
> > another additional overhead(ex, page fault + allocation + zeroing).
> >
>
> [...]
>
> >
> > How it works:
> >
> > When madvise syscall is called, VM clears dirty bit of ptes of the range.
> > If memory pressure happens, VM checks dirty bit of page table and if it
> > found still "clean", it means it's a "lazyfree pages" so VM could discard
> > the page instead of swapping out. Once there was store operation for the
> > page before VM peek a page to reclaim, dirty bit is set so VM can swap out
> > the page instead of discarding.
>
> What happens if you MADV_FREE something that's MAP_SHARED or isn't
> ordinary anonymous memory? There's a long history of MADV_DONTNEED on
> such mappings causing exploitable problems, and I think it would be
> nice if MADV_FREE were obviously safe.

It filter out VM_LOCKED|VM_HUGETLB|VM_PFNMAP and file-backed vma and MAP_SHARED
with vma_is_anonymous.

>
> Does this set the write protect bit?

No.

>
> What happens on architectures without hardware dirty tracking? For
> that matter, even on architecture with hardware dirty tracking, what
> happens in multithreaded processes that have the dirty TLB state
> cached in a different CPU's TLB?
>
> Using the dirty bit for these semantics scares me. This API creates a
> page that can have visible nonzero contents and then can
> asynchronously and magically zero itself thereafter. That makes me
> nervous. Could we use the accessed bit instead? Then the observable

Access bit is used by aging algorithm for reclaim. In addition,
we have supported clear_refs feacture.
IOW, it could be reset anytime so it's hard to use marker for
lazy freeing at the moment.

> semantics would be equivalent to having MADV_FREE either zero the page
> or do nothing, except that it doesn't make up its mind until the next
> read.
>
> > + ptent = pte_mkold(ptent);
> > + ptent = pte_mkclean(ptent);
> > + set_pte_at(mm, addr, pte, ptent);
> > + tlb_remove_tlb_entry(tlb, pte, addr);
>
> It looks like you are flushing the TLB. In a multithreaded program,
> that's rather expensive. Potentially silly question: would it be
> better to just zero the page immediately in a multithreaded program
> and then, when swapping out, check the page is zeroed and, if so, skip
> swapping it out? That could be done without forcing an IPI.

So, we should monitor all of pages in reclaim patch whether they are
zero or not? It is fatster for allocation side but much slower in
reclaim side. For avoiding that, we should mark something for lazy
freeing page out of page table.

Anyway, it depends on the TLB flush overehead vs memset overhead.
If the hinted range is pretty big and small system(ie, not many core),
memset overhead would't not trivial compared to TLB flush.
Even, some of ARM arches doesn't do IPI to TLB flush so the overhead
would be cheaper.

I don't want to push more optimization in new syscall from the beginning.
It's an optimization and might come better idea once we hear from the
voice of userland folks. Then, it's not too late.
Let's do step by step.

>
> > +static int madvise_free_single_vma(struct vm_area_struct *vma,
> > + unsigned long start_addr, unsigned long end_addr)
> > +{
> > + unsigned long start, end;
> > + struct mm_struct *mm = vma->vm_mm;
> > + struct mmu_gather tlb;
> > +
> > + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> > + return -EINVAL;
> > +
> > + /* MADV_FREE works for only anon vma at the moment */
> > + if (!vma_is_anonymous(vma))
> > + return -EINVAL;
>
> Does anything weird happen if it's shared?

Hmm, you mean MAP_SHARED|MAP_ANONYMOUS?
In that case, vma->vm_ops = &shmem_vm_ops so vma_is anonymous should filter it out.

>
> > + if (!PageDirty(page) && (flags & TTU_FREE)) {
> > + /* It's a freeable page by MADV_FREE */
> > + dec_mm_counter(mm, MM_ANONPAGES);
> > + goto discard;
> > + }
>
> Does something clear TTU_FREE the next time the page gets marked clean?

Sorry, I don't understand. Could you elaborate it more?

>
> --Andy
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-11-05 00:42:59

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On Wed, Nov 4, 2015 at 4:13 PM, Minchan Kim <[email protected]> wrote:
> On Tue, Nov 03, 2015 at 07:41:35PM -0800, Andy Lutomirski wrote:
>> On Nov 3, 2015 5:30 PM, "Minchan Kim" <[email protected]> wrote:
>> >
>> > Linux doesn't have an ability to free pages lazy while other OS already
>> > have been supported that named by madvise(MADV_FREE).
>> >
>> > The gain is clear that kernel can discard freed pages rather than swapping
>> > out or OOM if memory pressure happens.
>> >
>> > Without memory pressure, freed pages would be reused by userspace without
>> > another additional overhead(ex, page fault + allocation + zeroing).
>> >
>>
>> [...]
>>
>> >
>> > How it works:
>> >
>> > When madvise syscall is called, VM clears dirty bit of ptes of the range.
>> > If memory pressure happens, VM checks dirty bit of page table and if it
>> > found still "clean", it means it's a "lazyfree pages" so VM could discard
>> > the page instead of swapping out. Once there was store operation for the
>> > page before VM peek a page to reclaim, dirty bit is set so VM can swap out
>> > the page instead of discarding.
>>
>> What happens if you MADV_FREE something that's MAP_SHARED or isn't
>> ordinary anonymous memory? There's a long history of MADV_DONTNEED on
>> such mappings causing exploitable problems, and I think it would be
>> nice if MADV_FREE were obviously safe.
>
> It filter out VM_LOCKED|VM_HUGETLB|VM_PFNMAP and file-backed vma and MAP_SHARED
> with vma_is_anonymous.
>
>>
>> Does this set the write protect bit?
>
> No.
>
>>
>> What happens on architectures without hardware dirty tracking? For
>> that matter, even on architecture with hardware dirty tracking, what
>> happens in multithreaded processes that have the dirty TLB state
>> cached in a different CPU's TLB?
>>
>> Using the dirty bit for these semantics scares me. This API creates a
>> page that can have visible nonzero contents and then can
>> asynchronously and magically zero itself thereafter. That makes me
>> nervous. Could we use the accessed bit instead? Then the observable
>
> Access bit is used by aging algorithm for reclaim. In addition,
> we have supported clear_refs feacture.
> IOW, it could be reset anytime so it's hard to use marker for
> lazy freeing at the moment.
>

That's unfortunate. I think that the ABI would be much nicer if it
used the accessed bit.

In any case, shouldn't the aging algorithm be irrelevant here? A
MADV_FREE page that isn't accessed can be discarded, whereas we could
hopefully just say that a MADV_FREE page that is accessed gets moved
to whatever list holds recently accessed pages and also stops being a
candidate for discarding due to MADV_FREE?

>>
>> > + if (!PageDirty(page) && (flags & TTU_FREE)) {
>> > + /* It's a freeable page by MADV_FREE */
>> > + dec_mm_counter(mm, MM_ANONPAGES);
>> > + goto discard;
>> > + }
>>
>> Does something clear TTU_FREE the next time the page gets marked clean?
>
> Sorry, I don't understand. Could you elaborate it more?

I don't fully understand how TTU_FREE ends up being set here, but, if
the page is dirtied by user code and then cleaned later by the kernel,
what prevents TTU_FREE from being incorrectly set here?


--Andy

2015-11-05 00:56:19

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On Wed, Nov 04, 2015 at 04:42:37PM -0800, Andy Lutomirski wrote:
> On Wed, Nov 4, 2015 at 4:13 PM, Minchan Kim <[email protected]> wrote:
> > On Tue, Nov 03, 2015 at 07:41:35PM -0800, Andy Lutomirski wrote:
> >> On Nov 3, 2015 5:30 PM, "Minchan Kim" <[email protected]> wrote:
> >> >
> >> > Linux doesn't have an ability to free pages lazy while other OS already
> >> > have been supported that named by madvise(MADV_FREE).
> >> >
> >> > The gain is clear that kernel can discard freed pages rather than swapping
> >> > out or OOM if memory pressure happens.
> >> >
> >> > Without memory pressure, freed pages would be reused by userspace without
> >> > another additional overhead(ex, page fault + allocation + zeroing).
> >> >
> >>
> >> [...]
> >>
> >> >
> >> > How it works:
> >> >
> >> > When madvise syscall is called, VM clears dirty bit of ptes of the range.
> >> > If memory pressure happens, VM checks dirty bit of page table and if it
> >> > found still "clean", it means it's a "lazyfree pages" so VM could discard
> >> > the page instead of swapping out. Once there was store operation for the
> >> > page before VM peek a page to reclaim, dirty bit is set so VM can swap out
> >> > the page instead of discarding.
> >>
> >> What happens if you MADV_FREE something that's MAP_SHARED or isn't
> >> ordinary anonymous memory? There's a long history of MADV_DONTNEED on
> >> such mappings causing exploitable problems, and I think it would be
> >> nice if MADV_FREE were obviously safe.
> >
> > It filter out VM_LOCKED|VM_HUGETLB|VM_PFNMAP and file-backed vma and MAP_SHARED
> > with vma_is_anonymous.
> >
> >>
> >> Does this set the write protect bit?
> >
> > No.
> >
> >>
> >> What happens on architectures without hardware dirty tracking? For
> >> that matter, even on architecture with hardware dirty tracking, what
> >> happens in multithreaded processes that have the dirty TLB state
> >> cached in a different CPU's TLB?
> >>
> >> Using the dirty bit for these semantics scares me. This API creates a
> >> page that can have visible nonzero contents and then can
> >> asynchronously and magically zero itself thereafter. That makes me
> >> nervous. Could we use the accessed bit instead? Then the observable
> >
> > Access bit is used by aging algorithm for reclaim. In addition,
> > we have supported clear_refs feacture.
> > IOW, it could be reset anytime so it's hard to use marker for
> > lazy freeing at the moment.
> >
>
> That's unfortunate. I think that the ABI would be much nicer if it
> used the accessed bit.
>
> In any case, shouldn't the aging algorithm be irrelevant here? A
> MADV_FREE page that isn't accessed can be discarded, whereas we could
> hopefully just say that a MADV_FREE page that is accessed gets moved
> to whatever list holds recently accessed pages and also stops being a
> candidate for discarding due to MADV_FREE?

I meant if we use access bit as indicator for lazy-freeing page,
we could discard valid page which is never hinted by MADV_FREE but
just doesn't mark access bit in page table by aging algorithm.

>
> >>
> >> > + if (!PageDirty(page) && (flags & TTU_FREE)) {
> >> > + /* It's a freeable page by MADV_FREE */
> >> > + dec_mm_counter(mm, MM_ANONPAGES);
> >> > + goto discard;
> >> > + }
> >>
> >> Does something clear TTU_FREE the next time the page gets marked clean?
> >
> > Sorry, I don't understand. Could you elaborate it more?
>
> I don't fully understand how TTU_FREE ends up being set here, but, if
> the page is dirtied by user code and then cleaned later by the kernel,
> what prevents TTU_FREE from being incorrectly set here?

Kernel shouldn't make the page clean without writeback(ie, swapout)
if the page has valid data.

>
>
> --Andy
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-11-05 01:30:19

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On Wed, Nov 4, 2015 at 4:56 PM, Minchan Kim <[email protected]> wrote:
> On Wed, Nov 04, 2015 at 04:42:37PM -0800, Andy Lutomirski wrote:
>> On Wed, Nov 4, 2015 at 4:13 PM, Minchan Kim <[email protected]> wrote:
>> > On Tue, Nov 03, 2015 at 07:41:35PM -0800, Andy Lutomirski wrote:
>> >> On Nov 3, 2015 5:30 PM, "Minchan Kim" <[email protected]> wrote:
>> >> >
>> >> > Linux doesn't have an ability to free pages lazy while other OS already
>> >> > have been supported that named by madvise(MADV_FREE).
>> >> >
>> >> > The gain is clear that kernel can discard freed pages rather than swapping
>> >> > out or OOM if memory pressure happens.
>> >> >
>> >> > Without memory pressure, freed pages would be reused by userspace without
>> >> > another additional overhead(ex, page fault + allocation + zeroing).
>> >> >
>> >>
>> >> [...]
>> >>
>> >> >
>> >> > How it works:
>> >> >
>> >> > When madvise syscall is called, VM clears dirty bit of ptes of the range.
>> >> > If memory pressure happens, VM checks dirty bit of page table and if it
>> >> > found still "clean", it means it's a "lazyfree pages" so VM could discard
>> >> > the page instead of swapping out. Once there was store operation for the
>> >> > page before VM peek a page to reclaim, dirty bit is set so VM can swap out
>> >> > the page instead of discarding.
>> >>
>> >> What happens if you MADV_FREE something that's MAP_SHARED or isn't
>> >> ordinary anonymous memory? There's a long history of MADV_DONTNEED on
>> >> such mappings causing exploitable problems, and I think it would be
>> >> nice if MADV_FREE were obviously safe.
>> >
>> > It filter out VM_LOCKED|VM_HUGETLB|VM_PFNMAP and file-backed vma and MAP_SHARED
>> > with vma_is_anonymous.
>> >
>> >>
>> >> Does this set the write protect bit?
>> >
>> > No.
>> >
>> >>
>> >> What happens on architectures without hardware dirty tracking? For
>> >> that matter, even on architecture with hardware dirty tracking, what
>> >> happens in multithreaded processes that have the dirty TLB state
>> >> cached in a different CPU's TLB?
>> >>
>> >> Using the dirty bit for these semantics scares me. This API creates a
>> >> page that can have visible nonzero contents and then can
>> >> asynchronously and magically zero itself thereafter. That makes me
>> >> nervous. Could we use the accessed bit instead? Then the observable
>> >
>> > Access bit is used by aging algorithm for reclaim. In addition,
>> > we have supported clear_refs feacture.
>> > IOW, it could be reset anytime so it's hard to use marker for
>> > lazy freeing at the moment.
>> >
>>
>> That's unfortunate. I think that the ABI would be much nicer if it
>> used the accessed bit.
>>
>> In any case, shouldn't the aging algorithm be irrelevant here? A
>> MADV_FREE page that isn't accessed can be discarded, whereas we could
>> hopefully just say that a MADV_FREE page that is accessed gets moved
>> to whatever list holds recently accessed pages and also stops being a
>> candidate for discarding due to MADV_FREE?
>
> I meant if we use access bit as indicator for lazy-freeing page,
> we could discard valid page which is never hinted by MADV_FREE but
> just doesn't mark access bit in page table by aging algorithm.

Oh, is the rule that the anonymous pages that are clean are discarded
instead of swapped out? That is, does your patch set detect that an
anonymous page can be discarded if it's clean and that the lack of a
dirty bit is the only indication that the page has been hit with
MADV_FREE?

If so, that seems potentially error prone -- I had assumed that pages
that were swapped in but not written since swap-in would also be
clean, and I don't see how you distinguish them.

--Andy

2015-11-05 01:33:44

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On Wed, Nov 04, 2015 at 12:00:06PM -0800, Shaohua Li wrote:
> On Wed, Nov 04, 2015 at 10:25:55AM +0900, Minchan Kim wrote:
> > Linux doesn't have an ability to free pages lazy while other OS already
> > have been supported that named by madvise(MADV_FREE).
> >
> > The gain is clear that kernel can discard freed pages rather than swapping
> > out or OOM if memory pressure happens.
> >
> > Without memory pressure, freed pages would be reused by userspace without
> > another additional overhead(ex, page fault + allocation + zeroing).
> >
> > Jason Evans said:
> >
> > : Facebook has been using MAP_UNINITIALIZED
> > : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
> > : several years, but there are operational costs to maintaining this
> > : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
> > : in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
> > : increased throughput for much of our workload by ~5%, and although the
> > : benefit has decreased using newer hardware and kernels, there is still
> > : enough benefit that we cannot reasonably retire it without a replacement.
> > :
> > : Aside from Facebook operations, there are numerous broadly used
> > : applications that would benefit from MADV_FREE. The ones that immediately
> > : come to mind are redis, varnish, and MariaDB. I don't have much insight
> > : into Android internals and development process, but I would hope to see
> > : MADV_FREE support eventually end up there as well to benefit applications
> > : linked with the integrated jemalloc.
> > :
> > : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
> > : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
> > : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
> > : (and AIX, but I'm not sure it even compiles on AIX). The lack of
> > : MADV_FREE on Linux forced me down a long series of increasingly
> > : sophisticated heuristics for madvise() volume reduction, and even so this
> > : remains a common performance issue for people using jemalloc on Linux.
> > : Please integrate MADV_FREE; many people will benefit substantially.
> >
> > How it works:
> >
> > When madvise syscall is called, VM clears dirty bit of ptes of the range.
> > If memory pressure happens, VM checks dirty bit of page table and if it
> > found still "clean", it means it's a "lazyfree pages" so VM could discard
> > the page instead of swapping out. Once there was store operation for the
> > page before VM peek a page to reclaim, dirty bit is set so VM can swap out
> > the page instead of discarding.
> >
> > Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
> > and hope glibc supports it) and jemalloc/tcmalloc already have supported
> > the feature for other OS(ex, FreeBSD)
> >
> > barrios@blaptop:~/benchmark/ebizzy$ lscpu
> > Architecture: x86_64
> > CPU op-mode(s): 32-bit, 64-bit
> > Byte Order: Little Endian
> > CPU(s): 12
> > On-line CPU(s) list: 0-11
> > Thread(s) per core: 1
> > Core(s) per socket: 1
> > Socket(s): 12
> > NUMA node(s): 1
> > Vendor ID: GenuineIntel
> > CPU family: 6
> > Model: 2
> > Stepping: 3
> > CPU MHz: 3200.185
> > BogoMIPS: 6400.53
> > Virtualization: VT-x
> > Hypervisor vendor: KVM
> > Virtualization type: full
> > L1d cache: 32K
> > L1i cache: 32K
> > L2 cache: 4096K
> > NUMA node0 CPU(s): 0-11
> > ebizzy benchmark(./ebizzy -S 10 -n 512)
> >
> > Higher avg is better.
> >
> > vanilla-jemalloc MADV_free-jemalloc
> >
> > 1 thread
> > records: 10 records: 10
> > avg: 2961.90 avg: 12069.70
> > std: 71.96(2.43%) std: 186.68(1.55%)
> > max: 3070.00 max: 12385.00
> > min: 2796.00 min: 11746.00
> >
> > 2 thread
> > records: 10 records: 10
> > avg: 5020.00 avg: 17827.00
> > std: 264.87(5.28%) std: 358.52(2.01%)
> > max: 5244.00 max: 18760.00
> > min: 4251.00 min: 17382.00
> >
> > 4 thread
> > records: 10 records: 10
> > avg: 8988.80 avg: 27930.80
> > std: 1175.33(13.08%) std: 3317.33(11.88%)
> > max: 9508.00 max: 30879.00
> > min: 5477.00 min: 21024.00
> >
> > 8 thread
> > records: 10 records: 10
> > avg: 13036.50 avg: 33739.40
> > std: 170.67(1.31%) std: 5146.22(15.25%)
> > max: 13371.00 max: 40572.00
> > min: 12785.00 min: 24088.00
> >
> > 16 thread
> > records: 10 records: 10
> > avg: 11092.40 avg: 31424.20
> > std: 710.60(6.41%) std: 3763.89(11.98%)
> > max: 12446.00 max: 36635.00
> > min: 9949.00 min: 25669.00
> >
> > 32 thread
> > records: 10 records: 10
> > avg: 11067.00 avg: 34495.80
> > std: 971.06(8.77%) std: 2721.36(7.89%)
> > max: 12010.00 max: 38598.00
> > min: 9002.00 min: 30636.00
> >
> > In summary, MADV_FREE is about much faster than MADV_DONTNEED.
>
> The MADV_FREE is discussed for a while, it probably is too late to propose
> something new, but we had the new idea (from Ben Maurer, CCed) recently and
> think it's better. Our target is still jemalloc.
>
> Compared to MADV_DONTNEED, MADV_FREE's lazy memory free is a huge win to reduce
> page fault. But there is one issue remaining, the TLB flush. Both MADV_DONTNEED
> and MADV_FREE do TLB flush. TLB flush overhead is quite big in contemporary
> multi-thread applications. In our production workload, we observed 80% CPU
> spending on TLB flush triggered by jemalloc madvise(MADV_DONTNEED) sometimes.
> We haven't tested MADV_FREE yet, but the result should be similar. It's hard to
> avoid the TLB flush issue with MADV_FREE, because it helps avoid data
> corruption.
>
> The new proposal tries to fix the TLB issue. We introduce two madvise verbs:
>
> MARK_FREE. Userspace notifies kernel the memory range can be discarded. Kernel
> just records the range in current stage. Should memory pressure happen, page
> reclaim can free the memory directly regardless the pte state.
>
> MARK_NOFREE. Userspace notifies kernel the memory range will be reused soon.
> Kernel deletes the record and prevents page reclaim discards the memory. If the
> memory isn't reclaimed, userspace will access the old memory, otherwise do
> normal page fault handling.
>
> The point is to let userspace notify kernel if memory can be discarded, instead
> of depending on pte dirty bit used by MADV_FREE. With these, no TLB flush is
> required till page reclaim actually frees the memory (page reclaim need do the
> TLB flush for MADV_FREE too). It still preserves the lazy memory free merit of
> MADV_FREE.
>
> Compared to MADV_FREE, reusing memory with the new proposal isn't transparent,
> eg must call MARK_NOFREE. But it's easy to utilize the new API in jemalloc.
>
> We don't have code to backup this yet, sorry. We'd like to discuss it if it
> makes sense.

It's really what volatile range did.
John Stultz and me tried it for a *long* time but it had lots of troubles.
It's really hard to write it down in my time due to really long history
and even I forgot lots of detail(ie, dead brain).
Please search volatile ranges in google.
Finally, people in LSF/MM suggested MADV_FREE to help anonymous page side
rather than stucking hich prevent useful feature. :(

>
> Thanks,
> Shaohua

2015-11-05 01:37:39

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On Thu, Nov 05, 2015 at 10:33:50AM +0900, Minchan Kim wrote:
> On Wed, Nov 04, 2015 at 12:00:06PM -0800, Shaohua Li wrote:
> > On Wed, Nov 04, 2015 at 10:25:55AM +0900, Minchan Kim wrote:
> > > Linux doesn't have an ability to free pages lazy while other OS already
> > > have been supported that named by madvise(MADV_FREE).
> > >
> > > The gain is clear that kernel can discard freed pages rather than swapping
> > > out or OOM if memory pressure happens.
> > >
> > > Without memory pressure, freed pages would be reused by userspace without
> > > another additional overhead(ex, page fault + allocation + zeroing).
> > >
> > > Jason Evans said:
> > >
> > > : Facebook has been using MAP_UNINITIALIZED
> > > : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
> > > : several years, but there are operational costs to maintaining this
> > > : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
> > > : in favor of MADV_FREE. When we first enabled MAP_UNINITIALIZED it
> > > : increased throughput for much of our workload by ~5%, and although the
> > > : benefit has decreased using newer hardware and kernels, there is still
> > > : enough benefit that we cannot reasonably retire it without a replacement.
> > > :
> > > : Aside from Facebook operations, there are numerous broadly used
> > > : applications that would benefit from MADV_FREE. The ones that immediately
> > > : come to mind are redis, varnish, and MariaDB. I don't have much insight
> > > : into Android internals and development process, but I would hope to see
> > > : MADV_FREE support eventually end up there as well to benefit applications
> > > : linked with the integrated jemalloc.
> > > :
> > > : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
> > > : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
> > > : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
> > > : (and AIX, but I'm not sure it even compiles on AIX). The lack of
> > > : MADV_FREE on Linux forced me down a long series of increasingly
> > > : sophisticated heuristics for madvise() volume reduction, and even so this
> > > : remains a common performance issue for people using jemalloc on Linux.
> > > : Please integrate MADV_FREE; many people will benefit substantially.
> > >
> > > How it works:
> > >
> > > When madvise syscall is called, VM clears dirty bit of ptes of the range.
> > > If memory pressure happens, VM checks dirty bit of page table and if it
> > > found still "clean", it means it's a "lazyfree pages" so VM could discard
> > > the page instead of swapping out. Once there was store operation for the
> > > page before VM peek a page to reclaim, dirty bit is set so VM can swap out
> > > the page instead of discarding.
> > >
> > > Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
> > > and hope glibc supports it) and jemalloc/tcmalloc already have supported
> > > the feature for other OS(ex, FreeBSD)
> > >
> > > barrios@blaptop:~/benchmark/ebizzy$ lscpu
> > > Architecture: x86_64
> > > CPU op-mode(s): 32-bit, 64-bit
> > > Byte Order: Little Endian
> > > CPU(s): 12
> > > On-line CPU(s) list: 0-11
> > > Thread(s) per core: 1
> > > Core(s) per socket: 1
> > > Socket(s): 12
> > > NUMA node(s): 1
> > > Vendor ID: GenuineIntel
> > > CPU family: 6
> > > Model: 2
> > > Stepping: 3
> > > CPU MHz: 3200.185
> > > BogoMIPS: 6400.53
> > > Virtualization: VT-x
> > > Hypervisor vendor: KVM
> > > Virtualization type: full
> > > L1d cache: 32K
> > > L1i cache: 32K
> > > L2 cache: 4096K
> > > NUMA node0 CPU(s): 0-11
> > > ebizzy benchmark(./ebizzy -S 10 -n 512)
> > >
> > > Higher avg is better.
> > >
> > > vanilla-jemalloc MADV_free-jemalloc
> > >
> > > 1 thread
> > > records: 10 records: 10
> > > avg: 2961.90 avg: 12069.70
> > > std: 71.96(2.43%) std: 186.68(1.55%)
> > > max: 3070.00 max: 12385.00
> > > min: 2796.00 min: 11746.00
> > >
> > > 2 thread
> > > records: 10 records: 10
> > > avg: 5020.00 avg: 17827.00
> > > std: 264.87(5.28%) std: 358.52(2.01%)
> > > max: 5244.00 max: 18760.00
> > > min: 4251.00 min: 17382.00
> > >
> > > 4 thread
> > > records: 10 records: 10
> > > avg: 8988.80 avg: 27930.80
> > > std: 1175.33(13.08%) std: 3317.33(11.88%)
> > > max: 9508.00 max: 30879.00
> > > min: 5477.00 min: 21024.00
> > >
> > > 8 thread
> > > records: 10 records: 10
> > > avg: 13036.50 avg: 33739.40
> > > std: 170.67(1.31%) std: 5146.22(15.25%)
> > > max: 13371.00 max: 40572.00
> > > min: 12785.00 min: 24088.00
> > >
> > > 16 thread
> > > records: 10 records: 10
> > > avg: 11092.40 avg: 31424.20
> > > std: 710.60(6.41%) std: 3763.89(11.98%)
> > > max: 12446.00 max: 36635.00
> > > min: 9949.00 min: 25669.00
> > >
> > > 32 thread
> > > records: 10 records: 10
> > > avg: 11067.00 avg: 34495.80
> > > std: 971.06(8.77%) std: 2721.36(7.89%)
> > > max: 12010.00 max: 38598.00
> > > min: 9002.00 min: 30636.00
> > >
> > > In summary, MADV_FREE is about much faster than MADV_DONTNEED.
> >
> > The MADV_FREE is discussed for a while, it probably is too late to propose
> > something new, but we had the new idea (from Ben Maurer, CCed) recently and
> > think it's better. Our target is still jemalloc.
> >
> > Compared to MADV_DONTNEED, MADV_FREE's lazy memory free is a huge win to reduce
> > page fault. But there is one issue remaining, the TLB flush. Both MADV_DONTNEED
> > and MADV_FREE do TLB flush. TLB flush overhead is quite big in contemporary
> > multi-thread applications. In our production workload, we observed 80% CPU
> > spending on TLB flush triggered by jemalloc madvise(MADV_DONTNEED) sometimes.
> > We haven't tested MADV_FREE yet, but the result should be similar. It's hard to
> > avoid the TLB flush issue with MADV_FREE, because it helps avoid data
> > corruption.
> >
> > The new proposal tries to fix the TLB issue. We introduce two madvise verbs:
> >
> > MARK_FREE. Userspace notifies kernel the memory range can be discarded. Kernel
> > just records the range in current stage. Should memory pressure happen, page
> > reclaim can free the memory directly regardless the pte state.
> >
> > MARK_NOFREE. Userspace notifies kernel the memory range will be reused soon.
> > Kernel deletes the record and prevents page reclaim discards the memory. If the
> > memory isn't reclaimed, userspace will access the old memory, otherwise do
> > normal page fault handling.
> >
> > The point is to let userspace notify kernel if memory can be discarded, instead
> > of depending on pte dirty bit used by MADV_FREE. With these, no TLB flush is
> > required till page reclaim actually frees the memory (page reclaim need do the
> > TLB flush for MADV_FREE too). It still preserves the lazy memory free merit of
> > MADV_FREE.
> >
> > Compared to MADV_FREE, reusing memory with the new proposal isn't transparent,
> > eg must call MARK_NOFREE. But it's easy to utilize the new API in jemalloc.
> >
> > We don't have code to backup this yet, sorry. We'd like to discuss it if it
> > makes sense.
>
> It's really what volatile range did.
> John Stultz and me tried it for a *long* time but it had lots of troubles.
> It's really hard to write it down in my time due to really long history
> and even I forgot lots of detail(ie, dead brain).
> Please search volatile ranges in google.
> Finally, people in LSF/MM suggested MADV_FREE to help anonymous page side
> rather than stucking hich prevent useful feature. :(

I should have Cced John Stutlz.

He would have good memory than me so he would help but I'm not sure
he has a interest on volatile ranges, still.

2015-11-05 01:48:50

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On Wed, Nov 04, 2015 at 05:29:57PM -0800, Andy Lutomirski wrote:
> On Wed, Nov 4, 2015 at 4:56 PM, Minchan Kim <[email protected]> wrote:
> > On Wed, Nov 04, 2015 at 04:42:37PM -0800, Andy Lutomirski wrote:
> >> On Wed, Nov 4, 2015 at 4:13 PM, Minchan Kim <[email protected]> wrote:
> >> > On Tue, Nov 03, 2015 at 07:41:35PM -0800, Andy Lutomirski wrote:
> >> >> On Nov 3, 2015 5:30 PM, "Minchan Kim" <[email protected]> wrote:
> >> >> >
> >> >> > Linux doesn't have an ability to free pages lazy while other OS already
> >> >> > have been supported that named by madvise(MADV_FREE).
> >> >> >
> >> >> > The gain is clear that kernel can discard freed pages rather than swapping
> >> >> > out or OOM if memory pressure happens.
> >> >> >
> >> >> > Without memory pressure, freed pages would be reused by userspace without
> >> >> > another additional overhead(ex, page fault + allocation + zeroing).
> >> >> >
> >> >>
> >> >> [...]
> >> >>
> >> >> >
> >> >> > How it works:
> >> >> >
> >> >> > When madvise syscall is called, VM clears dirty bit of ptes of the range.
> >> >> > If memory pressure happens, VM checks dirty bit of page table and if it
> >> >> > found still "clean", it means it's a "lazyfree pages" so VM could discard
> >> >> > the page instead of swapping out. Once there was store operation for the
> >> >> > page before VM peek a page to reclaim, dirty bit is set so VM can swap out
> >> >> > the page instead of discarding.
> >> >>
> >> >> What happens if you MADV_FREE something that's MAP_SHARED or isn't
> >> >> ordinary anonymous memory? There's a long history of MADV_DONTNEED on
> >> >> such mappings causing exploitable problems, and I think it would be
> >> >> nice if MADV_FREE were obviously safe.
> >> >
> >> > It filter out VM_LOCKED|VM_HUGETLB|VM_PFNMAP and file-backed vma and MAP_SHARED
> >> > with vma_is_anonymous.
> >> >
> >> >>
> >> >> Does this set the write protect bit?
> >> >
> >> > No.
> >> >
> >> >>
> >> >> What happens on architectures without hardware dirty tracking? For
> >> >> that matter, even on architecture with hardware dirty tracking, what
> >> >> happens in multithreaded processes that have the dirty TLB state
> >> >> cached in a different CPU's TLB?
> >> >>
> >> >> Using the dirty bit for these semantics scares me. This API creates a
> >> >> page that can have visible nonzero contents and then can
> >> >> asynchronously and magically zero itself thereafter. That makes me
> >> >> nervous. Could we use the accessed bit instead? Then the observable
> >> >
> >> > Access bit is used by aging algorithm for reclaim. In addition,
> >> > we have supported clear_refs feacture.
> >> > IOW, it could be reset anytime so it's hard to use marker for
> >> > lazy freeing at the moment.
> >> >
> >>
> >> That's unfortunate. I think that the ABI would be much nicer if it
> >> used the accessed bit.
> >>
> >> In any case, shouldn't the aging algorithm be irrelevant here? A
> >> MADV_FREE page that isn't accessed can be discarded, whereas we could
> >> hopefully just say that a MADV_FREE page that is accessed gets moved
> >> to whatever list holds recently accessed pages and also stops being a
> >> candidate for discarding due to MADV_FREE?
> >
> > I meant if we use access bit as indicator for lazy-freeing page,
> > we could discard valid page which is never hinted by MADV_FREE but
> > just doesn't mark access bit in page table by aging algorithm.
>
> Oh, is the rule that the anonymous pages that are clean are discarded
> instead of swapped out? That is, does your patch set detect that an

The page swapped-in after swapped-out has clean pte and swap device
has valid data if the page isn't touch so VM discards the page rather
than swapout. Of course, pte should point out the swap slot.
If VM decide to remove the page from swap slot, it should be marked
PG_dirty.

> anonymous page can be discarded if it's clean and that the lack of a
> dirty bit is the only indication that the page has been hit with
> MADV_FREE?

No dirty bit, exactly speaking, PG_Dirty
because the page I mentioned above has clean pte but will have PG_dirty.

>
> If so, that seems potentially error prone -- I had assumed that pages
> that were swapped in but not written since swap-in would also be
> clean, and I don't see how you distinguish them.

I hope above will answer.
>
> --Andy

2015-11-05 03:40:28

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

Hi Minchan,

On (11/05/15 08:39), Minchan Kim wrote:
[..]
> >
> > I think it makes sense to update pmd_trans_unstable() and
> > pmd_none_or_trans_huge_or_clear_bad() comments in asm-generic/pgtable.h
> > Because they explicitly mention MADV_DONTNEED only. Just a thought.
>
> Hmm, When I read comments(but actually I don't understand it 100%), it
> says pmd disappearing from MADV_DONTNEED with mmap_sem read-side
> lock. But MADV_FREE doesn't remove the pmd. So, I don't understand
> what I should add comment. Please suggest if I am missing something.
>

Hm, sorry, I need to think about it more, probably my comment is irrelevant.
Was fantasizing some stupid use cases like doing MADV_DONTNEED and MADV_FREE
on overlapping addresses from different threads, processes that share mem, etc.

> > > @@ -379,6 +502,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > return madvise_remove(vma, prev, start, end);
> > > case MADV_WILLNEED:
> > > return madvise_willneed(vma, prev, start, end);
> > > + case MADV_FREE:
> > > + /*
> > > + * XXX: In this implementation, MADV_FREE works like
> > ^^^^
> > XXX
>
> What does it mean?

not much. just a minor note that there is a 'XXX' in "XXX: In this implementation"
comment.

-ss

2015-11-05 18:17:32

by Shaohua Li

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On Wed, Nov 04, 2015 at 05:05:47PM -0500, Daniel Micay wrote:
> > With enough pages at once, though, munmap would be fine, too.
>
> That implies lots of page faults and zeroing though. The zeroing alone
> is a major performance issue.
>
> There are separate issues with munmap since it ends up resulting in a
> lot more virtual memory fragmentation. It would help if the kernel used
> first-best-fit for mmap instead of the current naive algorithm (bonus:
> O(log n) worst-case, not O(n)). Since allocators like jemalloc and
> PartitionAlloc want 2M aligned spans, mixing them with other allocators
> can also accelerate the VM fragmentation caused by the dumb mmap
> algorithm (i.e. they make a 2M aligned mapping, some other mmap user
> does 4k, now there's a nearly 2M gap when the next 2M region is made and
> the kernel keeps going rather than reusing it). Anyway, that's a totally
> separate issue from this. Just felt like complaining :).
>
> > Maybe what's really needed is a MADV_FREE variant that takes an iovec.
> > On an all-cores multithreaded mm, the TLB shootdown broadcast takes
> > thousands of cycles on each core more or less regardless of how much
> > of the TLB gets zapped.
>
> That would work very well. The allocator ends up having a sequence of
> dirty spans that it needs to purge in one go. As long as purging is
> fairly spread out, the cost of a single TLB shootdown isn't that bad. It
> is extremely bad if it needs to do it over and over to purge a bunch of
> ranges, which can happen if the memory has ended up being very, very
> fragmentated despite the efforts to compact it (depends on what the
> application ends up doing).

I posted a patch doing exactly iovec madvise. Doesn't support MADV_FREE yet
though, but should be easy to do it.

http://marc.info/?l=linux-mm&m=144615663522661&w=2

2015-11-05 20:13:18

by Daniel Micay

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

> I posted a patch doing exactly iovec madvise. Doesn't support MADV_FREE yet
> though, but should be easy to do it.
>
> http://marc.info/?l=linux-mm&m=144615663522661&w=2

I think that would be a great way to deal with this. It keeps the nice
property of still being able to drop pages in allocations that have been
handed out but not yet touched. The allocator just needs to be designed
to do lots of purging in one go (i.e. something like an 8:1 active:clean
ratio triggers purging and it goes all the way to 16:1).


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature

2015-11-05 20:14:11

by Daniel Micay

[permalink] [raw]

2015-12-01 22:30:07

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH v2 01/13] mm: support madvise(MADV_FREE)

On Wed, Nov 4, 2015 at 12:00 PM, Shaohua Li <[email protected]> wrote:
> Compared to MADV_DONTNEED, MADV_FREE's lazy memory free is a huge win to reduce
> page fault. But there is one issue remaining, the TLB flush. Both MADV_DONTNEED
> and MADV_FREE do TLB flush. TLB flush overhead is quite big in contemporary
> multi-thread applications. In our production workload, we observed 80% CPU
> spending on TLB flush triggered by jemalloc madvise(MADV_DONTNEED) sometimes.
> We haven't tested MADV_FREE yet, but the result should be similar. It's hard to
> avoid the TLB flush issue with MADV_FREE, because it helps avoid data
> corruption.
>
> The new proposal tries to fix the TLB issue. We introduce two madvise verbs:
>
> MARK_FREE. Userspace notifies kernel the memory range can be discarded. Kernel
> just records the range in current stage. Should memory pressure happen, page
> reclaim can free the memory directly regardless the pte state.
>
> MARK_NOFREE. Userspace notifies kernel the memory range will be reused soon.
> Kernel deletes the record and prevents page reclaim discards the memory. If the
> memory isn't reclaimed, userspace will access the old memory, otherwise do
> normal page fault handling.
>
> The point is to let userspace notify kernel if memory can be discarded, instead
> of depending on pte dirty bit used by MADV_FREE. With these, no TLB flush is
> required till page reclaim actually frees the memory (page reclaim need do the
> TLB flush for MADV_FREE too). It still preserves the lazy memory free merit of
> MADV_FREE.
>
> Compared to MADV_FREE, reusing memory with the new proposal isn't transparent,
> eg must call MARK_NOFREE. But it's easy to utilize the new API in jemalloc.
>
> We don't have code to backup this yet, sorry. We'd like to discuss it if it
> makes sense.

Sorry to be so slow to reply here!

As Minchan mentioned, this is very similar in concept to the volatile
ranges work Minchan and I tried to push for a few years.

Here's some of the coverage (in reverse chronological order)
https://lwn.net/Articles/602650/
https://lwn.net/Articles/592042/
https://lwn.net/Articles/590991/
http://permalink.gmane.org/gmane.linux.kernel.mm/98848
http://permalink.gmane.org/gmane.linux.kernel.mm/98676
https://lwn.net/Articles/522135/
https://lwn.net/Kernel/Index/#Volatile_ranges


If you are interested in reviving the patch set, I'd love to hear
about it. I think its a really compelling feature for kernel
right-sizing of userspace caches.

thanks
-john

2015-12-05 11:10:47

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH v2 00/13] MADV_FREE support

On Wed 2015-11-04 10:25:54, Minchan Kim wrote:
> MADV_FREE is on linux-next so long time. The reason was two, I think.
>
> 1. MADV_FREE code on reclaim path was really mess.

Could you explain what MADV_FREE does?

Comment in code says 'free the page only when there's memory
pressure'. So I mark my caches MADV_FREE, no memory pressure, I can
keep using it? And if there's memory pressure, what happens? I get
zeros? SIGSEGV?

Thanks,
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2015-12-05 15:52:07

by Daniel Micay

[permalink] [raw]
Subject: Re: [PATCH v2 00/13] MADV_FREE support

On 05/12/15 06:10 AM, Pavel Machek wrote:
> On Wed 2015-11-04 10:25:54, Minchan Kim wrote:
>> MADV_FREE is on linux-next so long time. The reason was two, I think.
>>
>> 1. MADV_FREE code on reclaim path was really mess.
>
> Could you explain what MADV_FREE does?
>
> Comment in code says 'free the page only when there's memory
> pressure'. So I mark my caches MADV_FREE, no memory pressure, I can
> keep using it? And if there's memory pressure, what happens? I get
> zeros? SIGSEGV?

You get zeroes. It's not designed for that use case right now. It's for
malloc implementations to use internally. There would need to be a new
feature like MADV_FREE_UNDO for it to be usable for caches and it may
make more sense for that to be a separate feature entirely, i.e. have a
different flag for marking too (not sure) since it wouldn't need to
worry about whether stuff is touched.


Attachments:
signature.asc (819.00 B)
OpenPGP digital signature