This patch enable MADV_FREE hint for madvise syscall, which have
been supported by other OSes. [PATCH 1] includes the details.
[1] support MADVISE_FREE for !THP page so if VM encounter
THP page in syscall context, it splits THP page.
[2-6] is to preparing to call madvise syscall without THP plitting
[7] enable THP page support for MADV_FREE.
* from v16
* Rebased on mmotm-2014-10-15-16-57
* from v15
* Add more Acked-by - Rik van Riel
* Rebased on mmotom-08-29-15-15
* from v14
* Add more Ackedy-by from arch people(sparc, arm64 and arm)
* Drop s390 since pmd_dirty/clean was merged
* from v13
* Add more Ackedy-by from arch people(arm, arm64 and ppc)
* Rebased on mmotm 2014-08-13-14-29
* from v12
* Fix - skip to mark free pte on try_to_free_swap failed page - Kirill
* Add more Acked-by from arch maintainers and Kirill
* From v11
* Fix arm build - Steve
* Separate patch for arm and arm64 - Steve
* Remove unnecessary check - Kirill
* Skip non-vm_normal page - Kirill
* Add Acked-by - Zhang
* Sparc64 build fix
* Pagetable walker THP handling fix
* From v10
* Add Acked-by from arch stuff(x86, s390)
* Pagewalker based pagetable working - Kirill
* Fix try_to_unmap_one broken with hwpoison - Kirill
* Use VM_BUG_ON_PAGE in madvise_free_pmd - Kirill
* Fix pgtable-3level.h for arm - Steve
* From v9
* Add Acked-by - Rik
* Add THP page support - Kirill
* From v8
* Rebased-on v3.16-rc2-mmotm-2014-06-25-16-44
* From v7
* Rebased-on next-20140613
* From v6
* Remove page from swapcache in syscal time
* Move utility functions from memory.c to madvise.c - Johannes
* Rename untilify functtions - Johannes
* Remove unnecessary checks from vmscan.c - Johannes
* Rebased-on v3.15-rc5-mmotm-2014-05-16-16-56
* Drop Reviewe-by because there was some changes since then.
* From v5
* Fix PPC problem which don't flush TLB - Rik
* Remove unnecessary lazyfree_range stub function - Rik
* Rebased on v3.15-rc5
* From v4
* Add Reviewed-by: Zhang Yanfei
* Rebase on v3.15-rc1-mmotm-2014-04-15-16-14
* From v3
* Add "how to work part" in description - Zhang
* Add page_discardable utility function - Zhang
* Clean up
* From v2
* Remove forceful dirty marking of swap-readed page - Johannes
* Remove deactivation logic of lazyfreed page
* Rebased on 3.14
* Remove RFC tag
* From v1
* Use custom page table walker for madvise_free - Johannes
* Remove PG_lazypage flag - Johannes
* Do madvise_dontneed instead of madvise_freein swapless system
Minchan Kim (7):
mm: support madvise(MADV_FREE)
x86: add pmd_[dirty|mkclean] for THP
sparc: add pmd_[dirty|mkclean] for THP
powerpc: add pmd_[dirty|mkclean] for THP
arm: add pmd_mkclean for THP
arm64: add pmd_[dirty|mkclean] for THP
mm: Don't split THP page when syscall is called
arch/arm/include/asm/pgtable-3level.h | 1 +
arch/arm64/include/asm/pgtable.h | 2 +
arch/powerpc/include/asm/pgtable-ppc64.h | 2 +
arch/sparc/include/asm/pgtable_64.h | 16 ++++
arch/x86/include/asm/pgtable.h | 10 ++
include/linux/huge_mm.h | 4 +
include/linux/rmap.h | 9 +-
include/linux/vm_event_item.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/huge_memory.c | 35 +++++++
mm/madvise.c | 159 +++++++++++++++++++++++++++++++
mm/rmap.c | 46 ++++++++-
mm/vmscan.c | 64 +++++++++----
mm/vmstat.c | 1 +
14 files changed, 331 insertions(+), 20 deletions(-)
--
2.0.0
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.
This patch adds pmd_mkclean for THP page MADV_FREE support.
Cc: Catalin Marinas <[email protected]>
Cc: Russell King <[email protected]>
Cc: [email protected]
Acked-by: Will Deacon <[email protected]>
Acked-by: Steve Capper <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
arch/arm/include/asm/pgtable-3level.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index a31ecdad4b59..54dc91486c16 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -249,6 +249,7 @@ PMD_BIT_FUNC(mkold, &= ~PMD_SECT_AF);
PMD_BIT_FUNC(mksplitting, |= L_PMD_SECT_SPLITTING);
PMD_BIT_FUNC(mkwrite, &= ~L_PMD_SECT_RDONLY);
PMD_BIT_FUNC(mkdirty, |= L_PMD_SECT_DIRTY);
+PMD_BIT_FUNC(mkclean, &= ~L_PMD_SECT_DIRTY);
PMD_BIT_FUNC(mkyoung, |= PMD_SECT_AF);
#define pmd_mkhuge(pmd) (__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
--
2.0.0
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.
This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Acked-by: Rik van Riel <[email protected]>
Acked-by: Zhang Yanfei <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
arch/x86/include/asm/pgtable.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index aa97a070f09f..2259de0ccd79 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -104,6 +104,11 @@ static inline int pmd_young(pmd_t pmd)
return pmd_flags(pmd) & _PAGE_ACCESSED;
}
+static inline int pmd_dirty(pmd_t pmd)
+{
+ return pmd_flags(pmd) & _PAGE_DIRTY;
+}
+
static inline int pte_write(pte_t pte)
{
return pte_flags(pte) & _PAGE_RW;
@@ -272,6 +277,11 @@ static inline pmd_t pmd_mkold(pmd_t pmd)
return pmd_clear_flags(pmd, _PAGE_ACCESSED);
}
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+ return pmd_clear_flags(pmd, _PAGE_DIRTY);
+}
+
static inline pmd_t pmd_wrprotect(pmd_t pmd)
{
return pmd_clear_flags(pmd, _PAGE_RW);
--
2.0.0
We don't need to split THP page when MADV_FREE syscall is
called. It could be done when VM decide really frees it so
we could avoid unnecessary THP split.
Cc: Andrea Arcangeli <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/huge_mm.h | 4 ++++
mm/huge_memory.c | 35 +++++++++++++++++++++++++++++++++++
mm/madvise.c | 21 ++++++++++++++++++++-
mm/rmap.c | 8 ++++++--
mm/vmscan.c | 28 ++++++++++++++++++----------
5 files changed, 83 insertions(+), 13 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad9051bab267..07f736b18ffc 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -19,6 +19,9 @@ extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
unsigned long addr,
pmd_t *pmd,
unsigned int flags);
+extern int madvise_free_huge_pmd(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long addr);
extern int zap_huge_pmd(struct mmu_gather *tlb,
struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr);
@@ -56,6 +59,7 @@ extern pmd_t *page_check_address_pmd(struct page *page,
unsigned long address,
enum page_check_address_pmd_flag flag,
spinlock_t **ptl);
+extern int pmd_freeable(pmd_t pmd);
#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de984159cf0b..5be0a5f3ea3a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1384,6 +1384,36 @@ out:
return 0;
}
+int madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long addr)
+
+{
+ spinlock_t *ptl;
+ struct mm_struct *mm = tlb->mm;
+ int ret = 1;
+
+ if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ struct page *page;
+ pmd_t orig_pmd;
+
+ orig_pmd = pmdp_get_and_clear(mm, addr, pmd);
+
+ /* No hugepage in swapcache */
+ page = pmd_page(orig_pmd);
+ VM_BUG_ON_PAGE(PageSwapCache(page), page);
+
+ orig_pmd = pmd_mkold(orig_pmd);
+ orig_pmd = pmd_mkclean(orig_pmd);
+
+ set_pmd_at(mm, addr, pmd, orig_pmd);
+ tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+ spin_unlock(ptl);
+ ret = 0;
+ }
+
+ return ret;
+}
+
int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr)
{
@@ -1620,6 +1650,11 @@ unlock:
return NULL;
}
+int pmd_freeable(pmd_t pmd)
+{
+ return !pmd_dirty(pmd);
+}
+
static int __split_huge_page_splitting(struct page *page,
struct vm_area_struct *vma,
unsigned long address)
diff --git a/mm/madvise.c b/mm/madvise.c
index a21584235bb6..84badee5f46d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -271,8 +271,26 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
spinlock_t *ptl;
pte_t *pte, ptent;
struct page *page;
+ unsigned long next;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_trans_huge(*pmd)) {
+ if (next - addr != HPAGE_PMD_SIZE) {
+#ifdef CONFIG_DEBUG_VM
+ if (!rwsem_is_locked(&mm->mmap_sem)) {
+ pr_err("%s: mmap_sem is unlocked! addr=0x%lx end=0x%lx vma->vm_start=0x%lx vma->vm_end=0x%lx\n",
+ __func__, addr, end,
+ vma->vm_start,
+ vma->vm_end);
+ BUG();
+ }
+#endif
+ split_huge_page_pmd(vma, addr, pmd);
+ } else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
+ goto next;
+ /* fall through */
+ }
- split_huge_page_pmd(vma, addr, pmd);
if (pmd_trans_unstable(pmd))
return 0;
@@ -316,6 +334,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
}
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
+next:
cond_resched();
return 0;
}
diff --git a/mm/rmap.c b/mm/rmap.c
index 93149c82a5a4..3a7081d884b9 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -704,9 +704,13 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
referenced++;
/*
- * In this implmentation, MADV_FREE doesn't support THP free
+ * Use pmd_freeable instead of raw pmd_dirty because in some
+ * of architecture, pmd_dirty is not defined unless
+ * CONFIG_TRANSPARNTE_HUGE is enabled
*/
- dirty++;
+ if (!pmd_freeable(*pmd))
+ dirty++;
+
spin_unlock(ptl);
} else {
pte_t *pte;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8f67765ebb77..29ae6382275a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -976,17 +976,25 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
*/
- if (PageAnon(page) && !PageSwapCache(page) && !freeable) {
- if (!(sc->gfp_mask & __GFP_IO))
- goto keep_locked;
- if (!add_to_swap(page, page_list))
- goto activate_locked;
- may_enter_fs = 1;
-
- /* Adding to swap updated mapping */
- mapping = page_mapping(page);
+ if (PageAnon(page) && !PageSwapCache(page)) {
+ if (!freeable) {
+ if (!(sc->gfp_mask & __GFP_IO))
+ goto keep_locked;
+ if (!add_to_swap(page, page_list))
+ goto activate_locked;
+ may_enter_fs = 1;
+ /* Adding to swap updated mapping */
+ mapping = page_mapping(page);
+ } else {
+ if (likely(!PageTransHuge(page)))
+ goto unmap;
+ /* try_to_unmap isn't aware of THP page */
+ if (unlikely(split_huge_page_to_list(page,
+ page_list)))
+ goto keep_locked;
+ }
}
-
+unmap:
/*
* The page is mapped into the page tables of one or more
* processes. Try to unmap it here.
--
2.0.0
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.
This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.
Cc: Benjamin Herrenschmidt <[email protected]>
Cc: Paul Mackerras <[email protected]>
Cc: [email protected]
Reviewed-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
arch/powerpc/include/asm/pgtable-ppc64.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/powerpc/include/asm/pgtable-ppc64.h b/arch/powerpc/include/asm/pgtable-ppc64.h
index 889c6fa9ee01..7c07e5975871 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64.h
@@ -468,9 +468,11 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd)
#define pmd_pfn(pmd) pte_pfn(pmd_pte(pmd))
#define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
#define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
+#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))
--
2.0.0
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.
This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.
Acked-by: David S. Miller <[email protected]>
Cc: [email protected]
Signed-off-by: Minchan Kim <[email protected]>
---
arch/sparc/include/asm/pgtable_64.h | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 3770bf5c6e1b..b80a309d7e00 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -666,6 +666,13 @@ static inline unsigned long pmd_young(pmd_t pmd)
return pte_young(pte);
}
+static inline int pmd_dirty(pmd_t pmd)
+{
+ pte_t pte = __pte(pmd_val(pmd));
+
+ return pte_dirty(pte);
+}
+
static inline unsigned long pmd_write(pmd_t pmd)
{
pte_t pte = __pte(pmd_val(pmd));
@@ -723,6 +730,15 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
return __pmd(pte_val(pte));
}
+static inline pmd_t pmd_mkclean(pmd_t pmd)
+{
+ pte_t pte = __pte(pmd_val(pmd));
+
+ pte = pte_mkclean(pte);
+
+ return __pmd(pte_val(pte));
+}
+
static inline pmd_t pmd_mkyoung(pmd_t pmd)
{
pte_t pte = __pte(pmd_val(pmd));
--
2.0.0
MADV_FREE needs pmd_dirty and pmd_mkclean for detecting recent
overwrite of the contents since MADV_FREE syscall is called for
THP page.
This patch adds pmd_dirty and pmd_mkclean for THP page MADV_FREE
support.
Cc: Russell King <[email protected]>
Cc: [email protected]
Acked-by: Will Deacon <[email protected]>
Acked-by: Steve Capper <[email protected]>
Acked-by: Catalin Marinas <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 6d81471183ee..effc859084bc 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -275,10 +275,12 @@ void pmdp_splitting_flush(struct vm_area_struct *vma, unsigned long address,
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#define pmd_young(pmd) pte_young(pmd_pte(pmd))
+#define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd))
#define pmd_wrprotect(pmd) pte_pmd(pte_wrprotect(pmd_pte(pmd)))
#define pmd_mksplitting(pmd) pte_pmd(pte_mkspecial(pmd_pte(pmd)))
#define pmd_mkold(pmd) pte_pmd(pte_mkold(pmd_pte(pmd)))
#define pmd_mkwrite(pmd) pte_pmd(pte_mkwrite(pmd_pte(pmd)))
+#define pmd_mkclean(pmd) pte_pmd(pte_mkclean(pmd_pte(pmd)))
#define pmd_mkdirty(pmd) pte_pmd(pte_mkdirty(pmd_pte(pmd)))
#define pmd_mkyoung(pmd) pte_pmd(pte_mkyoung(pmd_pte(pmd)))
#define pmd_mknotpresent(pmd) (__pmd(pmd_val(pmd) & ~PMD_TYPE_MASK))
--
2.0.0
Linux doesn't have an ability to free pages lazy while other OS
already have been supported that named by madvise(MADV_FREE).
The gain is clear that kernel can discard freed pages rather than
swapping out or OOM if memory pressure happens.
Without memory pressure, freed pages would be reused by userspace
without another additional overhead(ex, page fault + allocation
+ zeroing).
How to work is following as.
When madvise syscall is called, VM clears dirty bit of ptes of
the range. If memory pressure happens, VM checks dirty bit of
page table and if it found still "clean", it means it's a
"lazyfree pages" so VM could discard the page instead of swapping out.
Once there was store operation for the page before VM peek a page
to reclaim, dirty bit is set so VM can swap out the page instead of
discarding.
Firstly, heavy users would be general allocators(ex, jemalloc,
tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
have supported the feature for other OS(ex, FreeBSD)
barrios@blaptop:~/benchmark/ebizzy$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 12
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 2
Stepping: 3
CPU MHz: 3200.185
BogoMIPS: 6400.53
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-11
ebizzy benchmark(./ebizzy -S 10 -n 512)
Higher avg is better.
vanilla-jemalloc MADV_free-jemalloc
1 thread
records: 10 records: 10
avg: 2961.90 avg: 12069.70
std: 71.96(2.43%) std: 186.68(1.55%)
max: 3070.00 max: 12385.00
min: 2796.00 min: 11746.00
2 thread
records: 10 records: 10
avg: 5020.00 avg: 17827.00
std: 264.87(5.28%) std: 358.52(2.01%)
max: 5244.00 max: 18760.00
min: 4251.00 min: 17382.00
4 thread
records: 10 records: 10
avg: 8988.80 avg: 27930.80
std: 1175.33(13.08%) std: 3317.33(11.88%)
max: 9508.00 max: 30879.00
min: 5477.00 min: 21024.00
8 thread
records: 10 records: 10
avg: 13036.50 avg: 33739.40
std: 170.67(1.31%) std: 5146.22(15.25%)
max: 13371.00 max: 40572.00
min: 12785.00 min: 24088.00
16 thread
records: 10 records: 10
avg: 11092.40 avg: 31424.20
std: 710.60(6.41%) std: 3763.89(11.98%)
max: 12446.00 max: 36635.00
min: 9949.00 min: 25669.00
32 thread
records: 10 records: 10
avg: 11067.00 avg: 34495.80
std: 971.06(8.77%) std: 2721.36(7.89%)
max: 12010.00 max: 38598.00
min: 9002.00 min: 30636.00
In summary, MADV_FREE is about much faster than MADV_DONTNEED.
* from v16
* Rebased on mmotm-2014-10-15-16-57
* from v15
* Add more Acked-by - Rik van Riel
* Rebased on mmotom-08-29-15-15
* from v14
* Add more Ackedy-by from arch people(sparc, arm64 and arm)
* Drop s390 since pmd_dirty/clean was merged
* from v13
* Add more Ackedy-by from arch people(arm, arm64 and ppc)
* Rebased on mmotm 2014-08-13-14-29
* from v12
* Fix - skip to mark free pte on try_to_free_swap failed page - Kirill
* Add more Acked-by from arch maintainers and Kirill
* From v11
* Fix arm build - Steve
* Separate patch for arm and arm64 - Steve
* Remove unnecessary check - Kirill
* Skip non-vm_normal page - Kirill
* Add Acked-by - Zhang
* Sparc64 build fix
* Pagetable walker THP handling fix
* From v10
* Add Acked-by from arch stuff(x86, s390)
* Pagewalker based pagetable working - Kirill
* Fix try_to_unmap_one broken with hwpoison - Kirill
* Use VM_BUG_ON_PAGE in madvise_free_pmd - Kirill
* Fix pgtable-3level.h for arm - Steve
* From v9
* Add Acked-by - Rik
* Add THP page support - Kirill
* From v8
* Rebased-on v3.16-rc2-mmotm-2014-06-25-16-44
* From v7
* Rebased-on next-20140613
* From v6
* Remove page from swapcache in syscal time
* Move utility functions from memory.c to madvise.c - Johannes
* Rename untilify functtions - Johannes
* Remove unnecessary checks from vmscan.c - Johannes
* Rebased-on v3.15-rc5-mmotm-2014-05-16-16-56
* Drop Reviewe-by because there was some changes since then.
* From v5
* Fix PPC problem which don't flush TLB - Rik
* Remove unnecessary lazyfree_range stub function - Rik
* Rebased on v3.15-rc5
* From v4
* Add Reviewed-by: Zhang Yanfei
* Rebase on v3.15-rc1-mmotm-2014-04-15-16-14
* From v3
* Add "how to work part" in description - Zhang
* Add page_discardable utility function - Zhang
* Clean up
* From v2
* Remove forceful dirty marking of swap-readed page - Johannes
* Remove deactivation logic of lazyfreed page
* Rebased on 3.14
* Remove RFC tag
* From v1
* Use custom page table walker for madvise_free - Johannes
* Remove PG_lazypage flag - Johannes
* Do madvise_dontneed instead of madvise_freein swapless system
Cc: Michael Kerrisk <[email protected]>
Cc: Linux API <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Jason Evans <[email protected]>
Acked-by: Kirill A. Shutemov <[email protected]>
Acked-by: Zhang Yanfei <[email protected]>
Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Minchan Kim <[email protected]>
---
include/linux/rmap.h | 9 ++-
include/linux/vm_event_item.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/madvise.c | 140 +++++++++++++++++++++++++++++++++
mm/rmap.c | 42 +++++++++-
mm/vmscan.c | 40 ++++++++--
mm/vmstat.c | 1 +
7 files changed, 222 insertions(+), 12 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index c0c2bce6b0b7..94d5bcacc83e 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -75,6 +75,7 @@ enum ttu_flags {
TTU_UNMAP = 1, /* unmap mode */
TTU_MIGRATION = 2, /* migration mode */
TTU_MUNLOCK = 4, /* munlock mode */
+ TTU_FREE = 8, /* free mode */
TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
@@ -181,7 +182,8 @@ static inline void page_dup_rmap(struct page *page)
* Called from mm/vmscan.c to handle paging out
*/
int page_referenced(struct page *, int is_locked,
- struct mem_cgroup *memcg, unsigned long *vm_flags);
+ struct mem_cgroup *memcg, unsigned long *vm_flags,
+ int *is_pte_dirty);
#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
@@ -260,9 +262,12 @@ int rmap_walk(struct page *page, struct rmap_walk_control *rwc);
static inline int page_referenced(struct page *page, int is_locked,
struct mem_cgroup *memcg,
- unsigned long *vm_flags)
+ unsigned long *vm_flags,
+ int *is_pte_dirty)
{
*vm_flags = 0;
+ if (is_pte_dirty)
+ *is_pte_dirty = 0;
return 0;
}
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 730334cdf037..8be4582ac3ff 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -25,6 +25,7 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
FOR_ALL_ZONES(PGALLOC),
PGFREE, PGACTIVATE, PGDEACTIVATE,
PGFAULT, PGMAJFAULT,
+ PGLAZYFREED,
FOR_ALL_ZONES(PGREFILL),
FOR_ALL_ZONES(PGSTEAL_KSWAPD),
FOR_ALL_ZONES(PGSTEAL_DIRECT),
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index ddc3b36f1046..7a94102b7a02 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -34,6 +34,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_FREE 5 /* free pages only if memory pressure */
/* common parameters: try to keep these consistent across architectures */
#define MADV_REMOVE 9 /* remove these pages & resources */
diff --git a/mm/madvise.c b/mm/madvise.c
index 0938b30da4ab..a21584235bb6 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -19,6 +19,14 @@
#include <linux/blkdev.h>
#include <linux/swap.h>
#include <linux/swapops.h>
+#include <linux/mmu_notifier.h>
+
+#include <asm/tlb.h>
+
+struct madvise_free_private {
+ struct vm_area_struct *vma;
+ struct mmu_gather *tlb;
+};
/*
* Any behaviour which results in changes to the vma->vm_flags needs to
@@ -31,6 +39,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_FREE:
return 0;
default:
/* be safe, default to 1. list exceptions explicitly */
@@ -251,6 +260,128 @@ static long madvise_willneed(struct vm_area_struct *vma,
return 0;
}
+static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+
+{
+ struct madvise_free_private *fp = walk->private;
+ struct mmu_gather *tlb = fp->tlb;
+ struct mm_struct *mm = tlb->mm;
+ struct vm_area_struct *vma = fp->vma;
+ spinlock_t *ptl;
+ pte_t *pte, ptent;
+ struct page *page;
+
+ split_huge_page_pmd(vma, addr, pmd);
+ if (pmd_trans_unstable(pmd))
+ return 0;
+
+ pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ arch_enter_lazy_mmu_mode();
+ for (; addr != end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ if (PageSwapCache(page)) {
+ if (!trylock_page(page))
+ continue;
+
+ if (!try_to_free_swap(page)) {
+ unlock_page(page);
+ continue;
+ }
+
+ ClearPageDirty(page);
+ unlock_page(page);
+ }
+
+ /*
+ * Some of architecture(ex, PPC) don't update TLB
+ * with set_pte_at and tlb_remove_tlb_entry so for
+ * the portability, remap the pte with old|clean
+ * after pte clearing.
+ */
+ ptent = ptep_get_and_clear_full(mm, addr, pte,
+ tlb->fullmm);
+ ptent = pte_mkold(ptent);
+ ptent = pte_mkclean(ptent);
+ set_pte_at(mm, addr, pte, ptent);
+ tlb_remove_tlb_entry(tlb, pte, addr);
+ }
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(pte - 1, ptl);
+ cond_resched();
+ return 0;
+}
+
+static void madvise_free_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ struct madvise_free_private fp = {
+ .vma = vma,
+ .tlb = tlb,
+ };
+
+ struct mm_walk free_walk = {
+ .pmd_entry = madvise_free_pte_range,
+ .mm = vma->vm_mm,
+ .private = &fp,
+ };
+
+ BUG_ON(addr >= end);
+ tlb_start_vma(tlb, vma);
+ walk_page_range(addr, end, &free_walk);
+ tlb_end_vma(tlb, vma);
+}
+
+static int madvise_free_single_vma(struct vm_area_struct *vma,
+ unsigned long start_addr, unsigned long end_addr)
+{
+ unsigned long start, end;
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_gather tlb;
+
+ if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+ return -EINVAL;
+
+ /* MADV_FREE works for only anon vma at the moment */
+ if (vma->vm_file)
+ return -EINVAL;
+
+ start = max(vma->vm_start, start_addr);
+ if (start >= vma->vm_end)
+ return -EINVAL;
+ end = min(vma->vm_end, end_addr);
+ if (end <= vma->vm_start)
+ return -EINVAL;
+
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, start, end);
+ update_hiwater_rss(mm);
+
+ mmu_notifier_invalidate_range_start(mm, start, end);
+ madvise_free_page_range(&tlb, vma, start, end);
+ mmu_notifier_invalidate_range_end(mm, start, end);
+ tlb_finish_mmu(&tlb, start, end);
+
+ return 0;
+}
+
+static long madvise_free(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
+{
+ *prev = vma;
+ return madvise_free_single_vma(vma, start, end);
+}
+
/*
* Application no longer needs these pages. If the pages are dirty,
* it's OK to just throw them away. The app will be more careful about
@@ -381,6 +512,14 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_remove(vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
+ case MADV_FREE:
+ /*
+ * XXX: In this implementation, MADV_FREE works like
+ * MADV_DONTNEED on swapless system or full swap.
+ */
+ if (get_nr_swap_pages() > 0)
+ return madvise_free(vma, prev, start, end);
+ /* passthrough */
case MADV_DONTNEED:
return madvise_dontneed(vma, prev, start, end);
default:
@@ -400,6 +539,7 @@ madvise_behavior_valid(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_FREE:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
diff --git a/mm/rmap.c b/mm/rmap.c
index 5fbd0fe8f933..93149c82a5a4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -663,6 +663,7 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
}
struct page_referenced_arg {
+ int dirtied;
int mapcount;
int referenced;
unsigned long vm_flags;
@@ -677,6 +678,7 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
int referenced = 0;
+ int dirty = 0;
struct page_referenced_arg *pra = arg;
if (unlikely(PageTransHuge(page))) {
@@ -700,6 +702,11 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
/* go ahead even if the pmd is pmd_trans_splitting() */
if (pmdp_clear_flush_young_notify(vma, address, pmd))
referenced++;
+
+ /*
+ * In this implmentation, MADV_FREE doesn't support THP free
+ */
+ dirty++;
spin_unlock(ptl);
} else {
pte_t *pte;
@@ -729,6 +736,10 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
if (likely(!(vma->vm_flags & VM_SEQ_READ)))
referenced++;
}
+
+ if (pte_dirty(*pte))
+ dirty++;
+
pte_unmap_unlock(pte, ptl);
}
@@ -737,6 +748,9 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
pra->vm_flags |= vma->vm_flags;
}
+ if (dirty)
+ pra->dirtied++;
+
pra->mapcount--;
if (!pra->mapcount)
return SWAP_SUCCESS; /* To break the loop */
@@ -761,6 +775,7 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
* @is_locked: caller holds lock on the page
* @memcg: target memory cgroup
* @vm_flags: collect encountered vma->vm_flags who actually referenced the page
+ * @is_pte_dirty: ptes which have marked dirty bit - used for lazyfree page
*
* Quick test_and_clear_referenced for all mappings to a page,
* returns the number of ptes which referenced the page.
@@ -768,7 +783,8 @@ static bool invalid_page_referenced_vma(struct vm_area_struct *vma, void *arg)
int page_referenced(struct page *page,
int is_locked,
struct mem_cgroup *memcg,
- unsigned long *vm_flags)
+ unsigned long *vm_flags,
+ int *is_pte_dirty)
{
int ret;
int we_locked = 0;
@@ -783,6 +799,9 @@ int page_referenced(struct page *page,
};
*vm_flags = 0;
+ if (is_pte_dirty)
+ *is_pte_dirty = 0;
+
if (!page_mapped(page))
return 0;
@@ -810,6 +829,9 @@ int page_referenced(struct page *page,
if (we_locked)
unlock_page(page);
+ if (is_pte_dirty)
+ *is_pte_dirty = pra.dirtied;
+
return pra.referenced;
}
@@ -1128,6 +1150,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
spinlock_t *ptl;
int ret = SWAP_AGAIN;
enum ttu_flags flags = (enum ttu_flags)arg;
+ int dirty = 0;
pte = page_check_address(page, mm, address, &ptl, 0);
if (!pte)
@@ -1157,7 +1180,8 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
pteval = ptep_clear_flush(vma, address, pte);
/* Move the dirty bit to the physical page now the pte is gone. */
- if (pte_dirty(pteval))
+ dirty = pte_dirty(pteval);
+ if (dirty)
set_page_dirty(page);
/* Update high watermark before we lower rss */
@@ -1186,6 +1210,19 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
swp_entry_t entry = { .val = page_private(page) };
pte_t swp_pte;
+ if (flags & TTU_FREE) {
+ VM_BUG_ON_PAGE(PageSwapCache(page), page);
+ if (!dirty && !PageDirty(page)) {
+ /* It's a freeable page by MADV_FREE */
+ dec_mm_counter(mm, MM_ANONPAGES);
+ goto discard;
+ } else {
+ set_pte_at(mm, address, pte, pteval);
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+ }
+
if (PageSwapCache(page)) {
/*
* Store the swap location in the pte.
@@ -1227,6 +1264,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
} else
dec_mm_counter(mm, MM_FILEPAGES);
+discard:
page_remove_rmap(page);
page_cache_release(page);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4636d9e822c1..8f67765ebb77 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -712,13 +712,17 @@ enum page_references {
};
static enum page_references page_check_references(struct page *page,
- struct scan_control *sc)
+ struct scan_control *sc,
+ bool *freeable)
{
int referenced_ptes, referenced_page;
unsigned long vm_flags;
+ int pte_dirty;
+
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,
- &vm_flags);
+ &vm_flags, &pte_dirty);
referenced_page = TestClearPageReferenced(page);
/*
@@ -759,6 +763,10 @@ static enum page_references page_check_references(struct page *page,
return PAGEREF_KEEP;
}
+ if (PageAnon(page) && !pte_dirty && !PageSwapCache(page) &&
+ !PageDirty(page))
+ *freeable = true;
+
/* Reclaim if clean, defer dirty pages to writeback */
if (referenced_page && !PageSwapBacked(page))
return PAGEREF_RECLAIM_CLEAN;
@@ -827,6 +835,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
int may_enter_fs;
enum page_references references = PAGEREF_RECLAIM_CLEAN;
bool dirty, writeback;
+ bool freeable = false;
cond_resched();
@@ -950,7 +959,8 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
if (!force_reclaim)
- references = page_check_references(page, sc);
+ references = page_check_references(page, sc,
+ &freeable);
switch (references) {
case PAGEREF_ACTIVATE:
@@ -966,7 +976,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
*/
- if (PageAnon(page) && !PageSwapCache(page)) {
+ if (PageAnon(page) && !PageSwapCache(page) && !freeable) {
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
if (!add_to_swap(page, page_list))
@@ -981,8 +991,9 @@ static unsigned long shrink_page_list(struct list_head *page_list,
* The page is mapped into the page tables of one or more
* processes. Try to unmap it here.
*/
- if (page_mapped(page) && mapping) {
- switch (try_to_unmap(page, ttu_flags)) {
+ if (page_mapped(page) && (mapping || freeable)) {
+ switch (try_to_unmap(page,
+ freeable ? TTU_FREE : ttu_flags)) {
case SWAP_FAIL:
goto activate_locked;
case SWAP_AGAIN:
@@ -990,7 +1001,20 @@ static unsigned long shrink_page_list(struct list_head *page_list,
case SWAP_MLOCK:
goto cull_mlocked;
case SWAP_SUCCESS:
- ; /* try to free the page below */
+ /* try to free the page below */
+ if (!freeable)
+ break;
+ /*
+ * Freeable anon page doesn't have mapping
+ * due to skipping of swapcache so we free
+ * page in here rather than __remove_mapping.
+ */
+ VM_BUG_ON_PAGE(PageSwapCache(page), page);
+ if (!page_freeze_refs(page, 1))
+ goto keep_locked;
+ __clear_page_locked(page);
+ count_vm_event(PGLAZYFREED);
+ goto free_it;
}
}
@@ -1730,7 +1754,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
}
if (page_referenced(page, 0, sc->target_mem_cgroup,
- &vm_flags)) {
+ &vm_flags, NULL)) {
nr_rotated += hpage_nr_pages(page);
/*
* Identify referenced, file-backed active pages and
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 1b12d390dc68..87b6c83ce717 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -814,6 +814,7 @@ const char * const vmstat_text[] = {
"pgfault",
"pgmajfault",
+ "pglazyfreed",
TEXTS_FOR_ZONES("pgrefill")
TEXTS_FOR_ZONES("pgsteal_kswapd")
--
2.0.0
Hello Andrew,
It seems I have waited your review for a long time.
What should I do to take your time slot?
On Mon, Oct 20, 2014 at 07:11:57PM +0900, Minchan Kim wrote:
> This patch enable MADV_FREE hint for madvise syscall, which have
> been supported by other OSes. [PATCH 1] includes the details.
>
> [1] support MADVISE_FREE for !THP page so if VM encounter
> THP page in syscall context, it splits THP page.
> [2-6] is to preparing to call madvise syscall without THP plitting
> [7] enable THP page support for MADV_FREE.
>
> * from v16
> * Rebased on mmotm-2014-10-15-16-57
>
> * from v15
> * Add more Acked-by - Rik van Riel
> * Rebased on mmotom-08-29-15-15
>
> * from v14
> * Add more Ackedy-by from arch people(sparc, arm64 and arm)
> * Drop s390 since pmd_dirty/clean was merged
>
> * from v13
> * Add more Ackedy-by from arch people(arm, arm64 and ppc)
> * Rebased on mmotm 2014-08-13-14-29
>
> * from v12
> * Fix - skip to mark free pte on try_to_free_swap failed page - Kirill
> * Add more Acked-by from arch maintainers and Kirill
>
> * From v11
> * Fix arm build - Steve
> * Separate patch for arm and arm64 - Steve
> * Remove unnecessary check - Kirill
> * Skip non-vm_normal page - Kirill
> * Add Acked-by - Zhang
> * Sparc64 build fix
> * Pagetable walker THP handling fix
>
> * From v10
> * Add Acked-by from arch stuff(x86, s390)
> * Pagewalker based pagetable working - Kirill
> * Fix try_to_unmap_one broken with hwpoison - Kirill
> * Use VM_BUG_ON_PAGE in madvise_free_pmd - Kirill
> * Fix pgtable-3level.h for arm - Steve
>
> * From v9
> * Add Acked-by - Rik
> * Add THP page support - Kirill
>
> * From v8
> * Rebased-on v3.16-rc2-mmotm-2014-06-25-16-44
>
> * From v7
> * Rebased-on next-20140613
>
> * From v6
> * Remove page from swapcache in syscal time
> * Move utility functions from memory.c to madvise.c - Johannes
> * Rename untilify functtions - Johannes
> * Remove unnecessary checks from vmscan.c - Johannes
> * Rebased-on v3.15-rc5-mmotm-2014-05-16-16-56
> * Drop Reviewe-by because there was some changes since then.
>
> * From v5
> * Fix PPC problem which don't flush TLB - Rik
> * Remove unnecessary lazyfree_range stub function - Rik
> * Rebased on v3.15-rc5
>
> * From v4
> * Add Reviewed-by: Zhang Yanfei
> * Rebase on v3.15-rc1-mmotm-2014-04-15-16-14
>
> * From v3
> * Add "how to work part" in description - Zhang
> * Add page_discardable utility function - Zhang
> * Clean up
>
> * From v2
> * Remove forceful dirty marking of swap-readed page - Johannes
> * Remove deactivation logic of lazyfreed page
> * Rebased on 3.14
> * Remove RFC tag
>
> * From v1
> * Use custom page table walker for madvise_free - Johannes
> * Remove PG_lazypage flag - Johannes
> * Do madvise_dontneed instead of madvise_freein swapless system
>
>
>
> Minchan Kim (7):
> mm: support madvise(MADV_FREE)
> x86: add pmd_[dirty|mkclean] for THP
> sparc: add pmd_[dirty|mkclean] for THP
> powerpc: add pmd_[dirty|mkclean] for THP
> arm: add pmd_mkclean for THP
> arm64: add pmd_[dirty|mkclean] for THP
> mm: Don't split THP page when syscall is called
>
> arch/arm/include/asm/pgtable-3level.h | 1 +
> arch/arm64/include/asm/pgtable.h | 2 +
> arch/powerpc/include/asm/pgtable-ppc64.h | 2 +
> arch/sparc/include/asm/pgtable_64.h | 16 ++++
> arch/x86/include/asm/pgtable.h | 10 ++
> include/linux/huge_mm.h | 4 +
> include/linux/rmap.h | 9 +-
> include/linux/vm_event_item.h | 1 +
> include/uapi/asm-generic/mman-common.h | 1 +
> mm/huge_memory.c | 35 +++++++
> mm/madvise.c | 159 +++++++++++++++++++++++++++++++
> mm/rmap.c | 46 ++++++++-
> mm/vmscan.c | 64 +++++++++----
> mm/vmstat.c | 1 +
> 14 files changed, 331 insertions(+), 20 deletions(-)
>
> --
> 2.0.0
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kind regards,
Minchan Kim
On Fri, 14 Nov 2014 07:58:09 +0900 Minchan Kim <[email protected]> wrote:
> It seems I have waited your review for a long time.
> What should I do to take your time slot?
I'm being terrible, sorry.
I'll merge the patches into -mm next week so at least they get some
external testing while I get my ass into gear.
[Late but I didn't get to this soone - I hope this is still up-to-date
version]
On Mon 20-10-14 19:11:58, Minchan Kim wrote:
> Linux doesn't have an ability to free pages lazy while other OS
> already have been supported that named by madvise(MADV_FREE).
>
> The gain is clear that kernel can discard freed pages rather than
> swapping out or OOM if memory pressure happens.
>
> Without memory pressure, freed pages would be reused by userspace
> without another additional overhead(ex, page fault + allocation
> + zeroing).
>
> How to work is following as.
>
> When madvise syscall is called, VM clears dirty bit of ptes of
> the range. If memory pressure happens, VM checks dirty bit of
> page table and if it found still "clean", it means it's a
> "lazyfree pages" so VM could discard the page instead of swapping out.
> Once there was store operation for the page before VM peek a page
> to reclaim, dirty bit is set so VM can swap out the page instead of
> discarding.
Is there any patch for madvise man page? I guess the semantic will be
same/similar to FreeBSD:
http://www.freebsd.org/cgi/man.cgi?query=madvise&sektion=2
I guess the changelog should be more specific that this is only for the
private MAP_ANON mappings (same applies to the patch for man).
> Firstly, heavy users would be general allocators(ex, jemalloc,
> tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> have supported the feature for other OS(ex, FreeBSD)
>
[...]
>
> Cc: Michael Kerrisk <[email protected]>
> Cc: Linux API <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: KOSAKI Motohiro <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Jason Evans <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Acked-by: Zhang Yanfei <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
[...]
--
Michal Hocko
SUSE Labs
On Mon 20-10-14 19:12:04, Minchan Kim wrote:
> We don't need to split THP page when MADV_FREE syscall is
> called. It could be done when VM decide really frees it so
> we could avoid unnecessary THP split.
>
> Cc: Andrea Arcangeli <[email protected]>
> Acked-by: Rik van Riel <[email protected]>
> Acked-by: Kirill A. Shutemov <[email protected]>
> Signed-off-by: Minchan Kim <[email protected]>
Other than a minor comment below
Reviewed-by: Michal Hocko <[email protected]>
> ---
> include/linux/huge_mm.h | 4 ++++
> mm/huge_memory.c | 35 +++++++++++++++++++++++++++++++++++
> mm/madvise.c | 21 ++++++++++++++++++++-
> mm/rmap.c | 8 ++++++--
> mm/vmscan.c | 28 ++++++++++++++++++----------
> 5 files changed, 83 insertions(+), 13 deletions(-)
>
[...]
> diff --git a/mm/madvise.c b/mm/madvise.c
> index a21584235bb6..84badee5f46d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -271,8 +271,26 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> spinlock_t *ptl;
> pte_t *pte, ptent;
> struct page *page;
> + unsigned long next;
> +
> + next = pmd_addr_end(addr, end);
> + if (pmd_trans_huge(*pmd)) {
> + if (next - addr != HPAGE_PMD_SIZE) {
> +#ifdef CONFIG_DEBUG_VM
> + if (!rwsem_is_locked(&mm->mmap_sem)) {
> + pr_err("%s: mmap_sem is unlocked! addr=0x%lx end=0x%lx vma->vm_start=0x%lx vma->vm_end=0x%lx\n",
> + __func__, addr, end,
> + vma->vm_start,
> + vma->vm_end);
> + BUG();
> + }
> +#endif
Why is this code here? madvise_free_pte_range is called only from the
madvise path and we are holding mmap_sem and relying on that for regular
pages as well.
> + split_huge_page_pmd(vma, addr, pmd);
> + } else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
> + goto next;
> + /* fall through */
> + }
>
> - split_huge_page_pmd(vma, addr, pmd);
> if (pmd_trans_unstable(pmd))
> return 0;
>
> @@ -316,6 +334,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> }
> arch_leave_lazy_mmu_mode();
> pte_unmap_unlock(pte - 1, ptl);
> +next:
> cond_resched();
> return 0;
> }
[...]
--
Michal Hocko
SUSE Labs
Hello Michal,
On Thu, Nov 27, 2014 at 03:47:25PM +0100, Michal Hocko wrote:
> [Late but I didn't get to this soone - I hope this is still up-to-date
> version]
>
> On Mon 20-10-14 19:11:58, Minchan Kim wrote:
> > Linux doesn't have an ability to free pages lazy while other OS
> > already have been supported that named by madvise(MADV_FREE).
> >
> > The gain is clear that kernel can discard freed pages rather than
> > swapping out or OOM if memory pressure happens.
> >
> > Without memory pressure, freed pages would be reused by userspace
> > without another additional overhead(ex, page fault + allocation
> > + zeroing).
> >
> > How to work is following as.
> >
> > When madvise syscall is called, VM clears dirty bit of ptes of
> > the range. If memory pressure happens, VM checks dirty bit of
> > page table and if it found still "clean", it means it's a
> > "lazyfree pages" so VM could discard the page instead of swapping out.
> > Once there was store operation for the page before VM peek a page
> > to reclaim, dirty bit is set so VM can swap out the page instead of
> > discarding.
>
> Is there any patch for madvise man page? I guess the semantic will be
> same/similar to FreeBSD:
> http://www.freebsd.org/cgi/man.cgi?query=madvise&sektion=2
I postponed because I didn't know when we release the feature into mainline
but I should write down in man page ("MADV_FREE since Linux x.x.x").
However, early posting is not harmful.
Here it goes.
Most of content was copied from FreeBSD man page.
>From 2edd6890f92fa4943ce3c452194479458582d88c Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Mon, 1 Dec 2014 08:53:55 +0900
Subject: [PATCH] madvise.2: Document MADV_FREE
Signed-off-by: Minchan Kim <[email protected]>
---
man2/madvise.2 | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/man2/madvise.2 b/man2/madvise.2
index 032ead7..33aa936 100644
--- a/man2/madvise.2
+++ b/man2/madvise.2
@@ -265,6 +265,19 @@ file (see
.BR MADV_DODUMP " (since Linux 3.4)"
Undo the effect of an earlier
.BR MADV_DONTDUMP .
+.TP
+.BR MADV_FREE " (since Linux 3.19)"
+Gives the VM system the freedom to free pages, and tells the system that
+information in the specified page range is no longer important.
+This is an efficient way of allowing
+.BR malloc (3)
+to free pages anywhere in the address space, while keeping the address space
+valid. The next time that the page is referenced, the page might be demand
+zeroed, or might contain the data that was there before the MADV_FREE call.
+References made to that address space range will not make the VM system page the
+information back in from backing store until the page is modified again.
+It works only with private anonymous pages (see
+.BR mmap (2)).
.SH RETURN VALUE
On success
.BR madvise ()
--
2.0.0
>
> I guess the changelog should be more specific that this is only for the
> private MAP_ANON mappings (same applies to the patch for man).
>
> > Firstly, heavy users would be general allocators(ex, jemalloc,
> > tcmalloc and hope glibc supports it) and jemalloc/tcmalloc already
> > have supported the feature for other OS(ex, FreeBSD)
> >
> [...]
> >
> > Cc: Michael Kerrisk <[email protected]>
> > Cc: Linux API <[email protected]>
> > Cc: Hugh Dickins <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: KOSAKI Motohiro <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Jason Evans <[email protected]>
> > Acked-by: Kirill A. Shutemov <[email protected]>
> > Acked-by: Zhang Yanfei <[email protected]>
> > Acked-by: Rik van Riel <[email protected]>
> > Signed-off-by: Minchan Kim <[email protected]>
>
> Reviewed-by: Michal Hocko <[email protected]>
> [...]
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kind regards,
Minchan Kim
On Thu, Nov 27, 2014 at 04:49:21PM +0100, Michal Hocko wrote:
> On Mon 20-10-14 19:12:04, Minchan Kim wrote:
> > We don't need to split THP page when MADV_FREE syscall is
> > called. It could be done when VM decide really frees it so
> > we could avoid unnecessary THP split.
> >
> > Cc: Andrea Arcangeli <[email protected]>
> > Acked-by: Rik van Riel <[email protected]>
> > Acked-by: Kirill A. Shutemov <[email protected]>
> > Signed-off-by: Minchan Kim <[email protected]>
>
> Other than a minor comment below
> Reviewed-by: Michal Hocko <[email protected]>
Thanks!
>
> > ---
> > include/linux/huge_mm.h | 4 ++++
> > mm/huge_memory.c | 35 +++++++++++++++++++++++++++++++++++
> > mm/madvise.c | 21 ++++++++++++++++++++-
> > mm/rmap.c | 8 ++++++--
> > mm/vmscan.c | 28 ++++++++++++++++++----------
> > 5 files changed, 83 insertions(+), 13 deletions(-)
> >
> [...]
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index a21584235bb6..84badee5f46d 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -271,8 +271,26 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> > spinlock_t *ptl;
> > pte_t *pte, ptent;
> > struct page *page;
> > + unsigned long next;
> > +
> > + next = pmd_addr_end(addr, end);
> > + if (pmd_trans_huge(*pmd)) {
> > + if (next - addr != HPAGE_PMD_SIZE) {
> > +#ifdef CONFIG_DEBUG_VM
> > + if (!rwsem_is_locked(&mm->mmap_sem)) {
> > + pr_err("%s: mmap_sem is unlocked! addr=0x%lx end=0x%lx vma->vm_start=0x%lx vma->vm_end=0x%lx\n",
> > + __func__, addr, end,
> > + vma->vm_start,
> > + vma->vm_end);
> > + BUG();
> > + }
> > +#endif
>
> Why is this code here? madvise_free_pte_range is called only from the
> madvise path and we are holding mmap_sem and relying on that for regular
> pages as well.
Make sense.
>From 2ecc213a2c3634cbee7529055fd6348b03307ed5 Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Mon, 1 Dec 2014 09:07:18 +0900
Subject: [PATCH] mm: remove lock validation check for MADV_FREE
Curretnly, madvise_free_pte_range is called only madvise path
which already holds an mmap_sem so it's pointless to add the
lock validation check.
Signed-off-by: Minchan Kim <[email protected]>
---
mm/madvise.c | 13 ++-----------
1 file changed, 2 insertions(+), 11 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index dc024effa9bf..6fc9b8298da1 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -275,18 +275,9 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
- if (next - addr != HPAGE_PMD_SIZE) {
-#ifdef CONFIG_DEBUG_VM
- if (!rwsem_is_locked(&mm->mmap_sem)) {
- pr_err("%s: mmap_sem is unlocked! addr=0x%lx end=0x%lx vma->vm_start=0x%lx vma->vm_end=0x%lx\n",
- __func__, addr, end,
- vma->vm_start,
- vma->vm_end);
- BUG();
- }
-#endif
+ if (next - addr != HPAGE_PMD_SIZE)
split_huge_page_pmd(vma, addr, pmd);
- } else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
+ else if (!madvise_free_huge_pmd(tlb, vma, pmd, addr))
goto next;
/* fall through */
}
--
2.0.0
--
Kind regards,
Minchan Kim
On Mon 01-12-14 08:56:52, Minchan Kim wrote:
[...]
> From 2edd6890f92fa4943ce3c452194479458582d88c Mon Sep 17 00:00:00 2001
> From: Minchan Kim <[email protected]>
> Date: Mon, 1 Dec 2014 08:53:55 +0900
> Subject: [PATCH] madvise.2: Document MADV_FREE
>
> Signed-off-by: Minchan Kim <[email protected]>
> ---
> man2/madvise.2 | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
> diff --git a/man2/madvise.2 b/man2/madvise.2
> index 032ead7..33aa936 100644
> --- a/man2/madvise.2
> +++ b/man2/madvise.2
> @@ -265,6 +265,19 @@ file (see
> .BR MADV_DODUMP " (since Linux 3.4)"
> Undo the effect of an earlier
> .BR MADV_DONTDUMP .
> +.TP
> +.BR MADV_FREE " (since Linux 3.19)"
> +Gives the VM system the freedom to free pages, and tells the system that
> +information in the specified page range is no longer important.
> +This is an efficient way of allowing
> +.BR malloc (3)
This might be rather misleading. Only some malloc implementations are
using this feature (jemalloc, right?). So either be specific about which
implementation or do not add it at all.
> +to free pages anywhere in the address space, while keeping the address space
> +valid. The next time that the page is referenced, the page might be demand
> +zeroed, or might contain the data that was there before the MADV_FREE call.
> +References made to that address space range will not make the VM system page the
> +information back in from backing store until the page is modified again.
I am not sure I understand the last sentence. So say I did MADV_FREE and
the reclaim has dropped that page. I know that the file backed mappings
are not supported yet but assume they were for a second... Now, I do
read from that location again what is the result?
If we consider anon mappings then the backing store is misleading as
well because memory was dropped and so always newly allocated.
I would rather drop the whole sentence and rather see an explanation
what is the difference between to MADV_DONT_NEED.
"
Unlike MADV_DONT_NEED the memory is freed lazily e.g. when the VM system
is under memory pressure.
"
> +It works only with private anonymous pages (see
> +.BR mmap (2)).
> .SH RETURN VALUE
> On success
> .BR madvise ()
--
Michal Hocko
SUSE Labs
On Tue, Dec 02, 2014 at 11:01:25AM +0100, Michal Hocko wrote:
> On Mon 01-12-14 08:56:52, Minchan Kim wrote:
> [...]
> > From 2edd6890f92fa4943ce3c452194479458582d88c Mon Sep 17 00:00:00 2001
> > From: Minchan Kim <[email protected]>
> > Date: Mon, 1 Dec 2014 08:53:55 +0900
> > Subject: [PATCH] madvise.2: Document MADV_FREE
> >
> > Signed-off-by: Minchan Kim <[email protected]>
> > ---
> > man2/madvise.2 | 13 +++++++++++++
> > 1 file changed, 13 insertions(+)
> >
> > diff --git a/man2/madvise.2 b/man2/madvise.2
> > index 032ead7..33aa936 100644
> > --- a/man2/madvise.2
> > +++ b/man2/madvise.2
> > @@ -265,6 +265,19 @@ file (see
> > .BR MADV_DODUMP " (since Linux 3.4)"
> > Undo the effect of an earlier
> > .BR MADV_DONTDUMP .
> > +.TP
> > +.BR MADV_FREE " (since Linux 3.19)"
> > +Gives the VM system the freedom to free pages, and tells the system that
> > +information in the specified page range is no longer important.
> > +This is an efficient way of allowing
> > +.BR malloc (3)
>
> This might be rather misleading. Only some malloc implementations are
> using this feature (jemalloc, right?). So either be specific about which
> implementation or do not add it at all.
Make sense. I don't think it's a good idea to say specific example
in man page, which is rather arguable and limit the idea.
>
> > +to free pages anywhere in the address space, while keeping the address space
> > +valid. The next time that the page is referenced, the page might be demand
> > +zeroed, or might contain the data that was there before the MADV_FREE call.
> > +References made to that address space range will not make the VM system page the
> > +information back in from backing store until the page is modified again.
>
> I am not sure I understand the last sentence. So say I did MADV_FREE and
> the reclaim has dropped that page. I know that the file backed mappings
> are not supported yet but assume they were for a second... Now, I do
> read from that location again what is the result?
Zero page.
> If we consider anon mappings then the backing store is misleading as
> well because memory was dropped and so always newly allocated.
When I read the sentence at first, I thought backing store means swap
so I don't have any trouble to understand it. But I agree your opinion.
Target for man page is not a kernel developer but application developer.
> I would rather drop the whole sentence and rather see an explanation
> what is the difference between to MADV_DONT_NEED.
> "
> Unlike MADV_DONT_NEED the memory is freed lazily e.g. when the VM system
> is under memory pressure.
> "
It's a good idea but I don't think it's enough. At least we should explan
cancel of delay free logic(ie, write). So, How about this?
MADV_FREE " (since Linux 3.19)"
Gives the VM system the freedom to free pages, and tells the system that
it's okay to free pages if the VM system has reasons(e.g., memory pressure).
So, it looks like delayed MADV_DONTNEED.
The next time that the page is referenced, the page might be demand
zeroed if the VM system freed the page. Otherwise, it might contain the data
that was there before the MADV_FREE call if the VM system didn't free the page.
New write in the page after the MADV_FREE call makes the VM system not free
the page any more.
It works only with private anonymous pages (see mmap(2)).
>
> > +It works only with private anonymous pages (see
> > +.BR mmap (2)).
> > .SH RETURN VALUE
> > On success
> > .BR madvise ()
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kind regards,
Minchan Kim
On Wed 03-12-14 09:00:26, Minchan Kim wrote:
> On Tue, Dec 02, 2014 at 11:01:25AM +0100, Michal Hocko wrote:
> > On Mon 01-12-14 08:56:52, Minchan Kim wrote:
> > [...]
> > > From 2edd6890f92fa4943ce3c452194479458582d88c Mon Sep 17 00:00:00 2001
> > > From: Minchan Kim <[email protected]>
> > > Date: Mon, 1 Dec 2014 08:53:55 +0900
> > > Subject: [PATCH] madvise.2: Document MADV_FREE
> > >
> > > Signed-off-by: Minchan Kim <[email protected]>
> > > ---
> > > man2/madvise.2 | 13 +++++++++++++
> > > 1 file changed, 13 insertions(+)
> > >
> > > diff --git a/man2/madvise.2 b/man2/madvise.2
> > > index 032ead7..33aa936 100644
> > > --- a/man2/madvise.2
> > > +++ b/man2/madvise.2
> > > @@ -265,6 +265,19 @@ file (see
> > > .BR MADV_DODUMP " (since Linux 3.4)"
> > > Undo the effect of an earlier
> > > .BR MADV_DONTDUMP .
> > > +.TP
> > > +.BR MADV_FREE " (since Linux 3.19)"
> > > +Gives the VM system the freedom to free pages, and tells the system that
> > > +information in the specified page range is no longer important.
> > > +This is an efficient way of allowing
> > > +.BR malloc (3)
> >
> > This might be rather misleading. Only some malloc implementations are
> > using this feature (jemalloc, right?). So either be specific about which
> > implementation or do not add it at all.
>
> Make sense. I don't think it's a good idea to say specific example
> in man page, which is rather arguable and limit the idea.
>
> >
> > > +to free pages anywhere in the address space, while keeping the address space
> > > +valid. The next time that the page is referenced, the page might be demand
> > > +zeroed, or might contain the data that was there before the MADV_FREE call.
> > > +References made to that address space range will not make the VM system page the
> > > +information back in from backing store until the page is modified again.
> >
> > I am not sure I understand the last sentence. So say I did MADV_FREE and
> > the reclaim has dropped that page. I know that the file backed mappings
> > are not supported yet but assume they were for a second... Now, I do
> > read from that location again what is the result?
>
> Zero page.
OK, it felt strange at first but now that I am thinking about it some
more it starts making sense. So the semantic is: Either zero page
(disconnected from the backing store) or the original content after
madvise(MADV_FREE). The page gets connected to the backing store after
it gets modified again. If this is the case then the sentence in the man
page makes perfect sense.
What made me confused was that I expected file backed pages would get a
fresh page from the origin but this would be awkward I guess.
> > If we consider anon mappings then the backing store is misleading as
> > well because memory was dropped and so always newly allocated.
>
> When I read the sentence at first, I thought backing store means swap
> so I don't have any trouble to understand it. But I agree your opinion.
> Target for man page is not a kernel developer but application developer.
>
> > I would rather drop the whole sentence and rather see an explanation
> > what is the difference between to MADV_DONT_NEED.
> > "
> > Unlike MADV_DONT_NEED the memory is freed lazily e.g. when the VM system
> > is under memory pressure.
> > "
>
> It's a good idea but I don't think it's enough. At least we should explan
> cancel of delay free logic(ie, write). So, How about this?
>
> MADV_FREE " (since Linux 3.19)"
>
> Gives the VM system the freedom to free pages, and tells the system that
> it's okay to free pages if the VM system has reasons(e.g., memory pressure).
> So, it looks like delayed MADV_DONTNEED.
> The next time that the page is referenced, the page might be demand
> zeroed if the VM system freed the page. Otherwise, it might contain the data
> that was there before the MADV_FREE call if the VM system didn't free the page.
> New write in the page after the MADV_FREE call makes the VM system not free
> the page any more.
Dunno, I guess the original content was slightly better. Or the
following wording from UNIX man pages is even more descriptive
(http://www.lehman.cuny.edu/cgi-bin/man-cgi?madvise+3)
"
Tell the kernel that contents in the specified address range are no
longer important and the range will be overwritten. When there is
demand for memory, the system will free pages associated with the
specified address range. In this instance, the next time a page in the
address range is referenced, it will contain all zeroes. Otherwise,
it will con- tain the data that was there prior to the MADV_FREE
call. References made to the address range will not make the system read
from backing store (swap space) until the page is modified again.
This value cannot be used on mappings that have underlying file objects.
"
I would just clarify the last sentence with addition
(MAP_PRIVATE|MAP_ANONYMOUS mappings in this implementation). The
difference to MADV_DONTNEED is more complicated now so I wouldn't make
the text even more confusing.
Anyway the confusion started on my end so feel free to stick with the
BSD wording (modulo malloc note which is really confusing as the default
glibc allocator doesn't do that AFAIK).
> It works only with private anonymous pages (see mmap(2)).
--
Michal Hocko
SUSE Labs
On Wed, Dec 03, 2014 at 11:13:29AM +0100, Michal Hocko wrote:
> On Wed 03-12-14 09:00:26, Minchan Kim wrote:
> > On Tue, Dec 02, 2014 at 11:01:25AM +0100, Michal Hocko wrote:
> > > On Mon 01-12-14 08:56:52, Minchan Kim wrote:
> > > [...]
> > > > From 2edd6890f92fa4943ce3c452194479458582d88c Mon Sep 17 00:00:00 2001
> > > > From: Minchan Kim <[email protected]>
> > > > Date: Mon, 1 Dec 2014 08:53:55 +0900
> > > > Subject: [PATCH] madvise.2: Document MADV_FREE
> > > >
> > > > Signed-off-by: Minchan Kim <[email protected]>
> > > > ---
> > > > man2/madvise.2 | 13 +++++++++++++
> > > > 1 file changed, 13 insertions(+)
> > > >
> > > > diff --git a/man2/madvise.2 b/man2/madvise.2
> > > > index 032ead7..33aa936 100644
> > > > --- a/man2/madvise.2
> > > > +++ b/man2/madvise.2
> > > > @@ -265,6 +265,19 @@ file (see
> > > > .BR MADV_DODUMP " (since Linux 3.4)"
> > > > Undo the effect of an earlier
> > > > .BR MADV_DONTDUMP .
> > > > +.TP
> > > > +.BR MADV_FREE " (since Linux 3.19)"
> > > > +Gives the VM system the freedom to free pages, and tells the system that
> > > > +information in the specified page range is no longer important.
> > > > +This is an efficient way of allowing
> > > > +.BR malloc (3)
> > >
> > > This might be rather misleading. Only some malloc implementations are
> > > using this feature (jemalloc, right?). So either be specific about which
> > > implementation or do not add it at all.
> >
> > Make sense. I don't think it's a good idea to say specific example
> > in man page, which is rather arguable and limit the idea.
> >
> > >
> > > > +to free pages anywhere in the address space, while keeping the address space
> > > > +valid. The next time that the page is referenced, the page might be demand
> > > > +zeroed, or might contain the data that was there before the MADV_FREE call.
> > > > +References made to that address space range will not make the VM system page the
> > > > +information back in from backing store until the page is modified again.
> > >
> > > I am not sure I understand the last sentence. So say I did MADV_FREE and
> > > the reclaim has dropped that page. I know that the file backed mappings
> > > are not supported yet but assume they were for a second... Now, I do
> > > read from that location again what is the result?
> >
> > Zero page.
>
> OK, it felt strange at first but now that I am thinking about it some
> more it starts making sense. So the semantic is: Either zero page
> (disconnected from the backing store) or the original content after
> madvise(MADV_FREE). The page gets connected to the backing store after
> it gets modified again. If this is the case then the sentence in the man
> page makes perfect sense.
>
> What made me confused was that I expected file backed pages would get a
> fresh page from the origin but this would be awkward I guess.
>
> > > If we consider anon mappings then the backing store is misleading as
> > > well because memory was dropped and so always newly allocated.
> >
> > When I read the sentence at first, I thought backing store means swap
> > so I don't have any trouble to understand it. But I agree your opinion.
> > Target for man page is not a kernel developer but application developer.
> >
> > > I would rather drop the whole sentence and rather see an explanation
> > > what is the difference between to MADV_DONT_NEED.
> > > "
> > > Unlike MADV_DONT_NEED the memory is freed lazily e.g. when the VM system
> > > is under memory pressure.
> > > "
> >
> > It's a good idea but I don't think it's enough. At least we should explan
> > cancel of delay free logic(ie, write). So, How about this?
> >
> > MADV_FREE " (since Linux 3.19)"
> >
> > Gives the VM system the freedom to free pages, and tells the system that
> > it's okay to free pages if the VM system has reasons(e.g., memory pressure).
> > So, it looks like delayed MADV_DONTNEED.
> > The next time that the page is referenced, the page might be demand
> > zeroed if the VM system freed the page. Otherwise, it might contain the data
> > that was there before the MADV_FREE call if the VM system didn't free the page.
> > New write in the page after the MADV_FREE call makes the VM system not free
> > the page any more.
>
> Dunno, I guess the original content was slightly better. Or the
> following wording from UNIX man pages is even more descriptive
> (http://www.lehman.cuny.edu/cgi-bin/man-cgi?madvise+3)
> "
> Tell the kernel that contents in the specified address range are no
> longer important and the range will be overwritten. When there is
> demand for memory, the system will free pages associated with the
> specified address range. In this instance, the next time a page in the
> address range is referenced, it will contain all zeroes. Otherwise,
> it will con- tain the data that was there prior to the MADV_FREE
> call. References made to the address range will not make the system read
> from backing store (swap space) until the page is modified again.
>
> This value cannot be used on mappings that have underlying file objects.
> "
For me, it would be better.
Thanks for the heads up.
>
> I would just clarify the last sentence with addition
> (MAP_PRIVATE|MAP_ANONYMOUS mappings in this implementation). The
I want to be consistent with KSM/THP which used "private anonymous pages".
So, I guess man page maintainer already acked the term so I want to use it,
too.
> difference to MADV_DONTNEED is more complicated now so I wouldn't make
> the text even more confusing.
>
> Anyway the confusion started on my end so feel free to stick with the
> BSD wording (modulo malloc note which is really confusing as the default
> glibc allocator doesn't do that AFAIK).
>From cfa212d4fb307ae772b08cf564cab7e6adb8f4fc Mon Sep 17 00:00:00 2001
From: Minchan Kim <[email protected]>
Date: Mon, 1 Dec 2014 08:53:55 +0900
Subject: [PATCH] madvise.2: Document MADV_FREE
Signed-off-by: Minchan Kim <[email protected]>
---
man2/madvise.2 | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/man2/madvise.2 b/man2/madvise.2
index 032ead7..fc1aaca 100644
--- a/man2/madvise.2
+++ b/man2/madvise.2
@@ -265,6 +265,18 @@ file (see
.BR MADV_DODUMP " (since Linux 3.4)"
Undo the effect of an earlier
.BR MADV_DONTDUMP .
+.TP
+.BR MADV_FREE " (since Linux 3.19)"
+Tell the kernel that contents in the specified address range are no
+longer important and the range will be overwritten. When there is
+demand for memory, the system will free pages associated with the
+specified address range. In this instance, the next time a page in the
+address range is referenced, it will contain all zeroes. Otherwise,
+it will contain the data that was there prior to the MADV_FREE call.
+References made to the address range will not make the system read
+from backing store (swap space) until the page is modified again.
+It works only with private anonymous pages (see
+.BR mmap (2)).
.SH RETURN VALUE
On success
.BR madvise ()
--
2.0.0
--
Kind regards,
Minchan Kim
On Fri 05-12-14 16:08:16, Minchan Kim wrote:
[...]
> From cfa212d4fb307ae772b08cf564cab7e6adb8f4fc Mon Sep 17 00:00:00 2001
> From: Minchan Kim <[email protected]>
> Date: Mon, 1 Dec 2014 08:53:55 +0900
> Subject: [PATCH] madvise.2: Document MADV_FREE
>
> Signed-off-by: Minchan Kim <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Thanks!
> ---
> man2/madvise.2 | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/man2/madvise.2 b/man2/madvise.2
> index 032ead7..fc1aaca 100644
> --- a/man2/madvise.2
> +++ b/man2/madvise.2
> @@ -265,6 +265,18 @@ file (see
> .BR MADV_DODUMP " (since Linux 3.4)"
> Undo the effect of an earlier
> .BR MADV_DONTDUMP .
> +.TP
> +.BR MADV_FREE " (since Linux 3.19)"
> +Tell the kernel that contents in the specified address range are no
> +longer important and the range will be overwritten. When there is
> +demand for memory, the system will free pages associated with the
> +specified address range. In this instance, the next time a page in the
> +address range is referenced, it will contain all zeroes. Otherwise,
> +it will contain the data that was there prior to the MADV_FREE call.
> +References made to the address range will not make the system read
> +from backing store (swap space) until the page is modified again.
> +It works only with private anonymous pages (see
> +.BR mmap (2)).
> .SH RETURN VALUE
> On success
> .BR madvise ()
> --
> 2.0.0
>
> --
> Kind regards,
> Minchan Kim
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Michal Hocko
SUSE Labs
Hello Minchan (and Michal)
I did not see this patch until just now when Michael explicitly
mentioned it in another discussion because
(a) it was buried in an LMKL thread that started a topic
that was not about a man-pages patch.
(b) linux-man@ was not CCed.
When resubmitting this patch, could you please To:me and CC linux-man@
and give the mail a suitable subject line indicating a man-pages patch.
On 12/05/2014 09:32 AM, Michal Hocko wrote:
> On Fri 05-12-14 16:08:16, Minchan Kim wrote:
> [...]
>> From cfa212d4fb307ae772b08cf564cab7e6adb8f4fc Mon Sep 17 00:00:00 2001
>> From: Minchan Kim <[email protected]>
>> Date: Mon, 1 Dec 2014 08:53:55 +0900
>> Subject: [PATCH] madvise.2: Document MADV_FREE
>>
>> Signed-off-by: Minchan Kim <[email protected]>
>
> Reviewed-by: Michal Hocko <[email protected]>
>
> Thanks!
>
>> ---
>> man2/madvise.2 | 12 ++++++++++++
>> 1 file changed, 12 insertions(+)
>>
>> diff --git a/man2/madvise.2 b/man2/madvise.2
>> index 032ead7..fc1aaca 100644
>> --- a/man2/madvise.2
>> +++ b/man2/madvise.2
>> @@ -265,6 +265,18 @@ file (see
>> .BR MADV_DODUMP " (since Linux 3.4)"
>> Undo the effect of an earlier
>> .BR MADV_DONTDUMP .
>> +.TP
>> +.BR MADV_FREE " (since Linux 3.19)"
>> +Tell the kernel that contents in the specified address range are no
>> +longer important and the range will be overwritten. When there is
>> +demand for memory, the system will free pages associated with the
>> +specified address range. In this instance, the next time a page in the
>> +address range is referenced, it will contain all zeroes. Otherwise,
>> +it will contain the data that was there prior to the MADV_FREE call.
>> +References made to the address range will not make the system read
>> +from backing store (swap space) until the page is modified again.
>> +It works only with private anonymous pages (see
>> +.BR mmap (2)).
>> .SH RETURN VALUE
>> On success
>> .BR madvise ()
If I'm reading the conversation right, the initially proposed text
was from the BSD man page (which would be okay), but most of the
text above seems to have come straight from the page here:
http://www.lehman.cuny.edu/cgi-bin/man-cgi?madvise+3
Right?
Unfortunately, I don't think we can use that text. It's from the
Solaris man page as far as I can tell, and I doubt that it's
under a license that we can use.
If that's the case, we need to go back and come up with an
original text. It might draw inspiration from the Solaris page,
and take actual text from the BSD page (which is under a free
license), and it might also draw inspiration from Jon Corbet's
description at http://lwn.net/Articles/590991/.
Could you take another shot this please!
Thanks,
Michael
--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/
Hello, Michael
On Tue, Feb 03, 2015 at 05:39:24PM +0100, Michael Kerrisk (man-pages) wrote:
> Hello Minchan (and Michal)
>
> I did not see this patch until just now when Michael explicitly
> mentioned it in another discussion because
> (a) it was buried in an LMKL thread that started a topic
> that was not about a man-pages patch.
> (b) linux-man@ was not CCed.
Sorry about that.
>
> When resubmitting this patch, could you please To:me and CC linux-man@
> and give the mail a suitable subject line indicating a man-pages patch.
Sure.
>
> On 12/05/2014 09:32 AM, Michal Hocko wrote:
> > On Fri 05-12-14 16:08:16, Minchan Kim wrote:
> > [...]
> >> From cfa212d4fb307ae772b08cf564cab7e6adb8f4fc Mon Sep 17 00:00:00 2001
> >> From: Minchan Kim <[email protected]>
> >> Date: Mon, 1 Dec 2014 08:53:55 +0900
> >> Subject: [PATCH] madvise.2: Document MADV_FREE
> >>
> >> Signed-off-by: Minchan Kim <[email protected]>
> >
> > Reviewed-by: Michal Hocko <[email protected]>
> >
> > Thanks!
> >
> >> ---
> >> man2/madvise.2 | 12 ++++++++++++
> >> 1 file changed, 12 insertions(+)
> >>
> >> diff --git a/man2/madvise.2 b/man2/madvise.2
> >> index 032ead7..fc1aaca 100644
> >> --- a/man2/madvise.2
> >> +++ b/man2/madvise.2
> >> @@ -265,6 +265,18 @@ file (see
> >> .BR MADV_DODUMP " (since Linux 3.4)"
> >> Undo the effect of an earlier
> >> .BR MADV_DONTDUMP .
> >> +.TP
> >> +.BR MADV_FREE " (since Linux 3.19)"
> >> +Tell the kernel that contents in the specified address range are no
> >> +longer important and the range will be overwritten. When there is
> >> +demand for memory, the system will free pages associated with the
> >> +specified address range. In this instance, the next time a page in the
> >> +address range is referenced, it will contain all zeroes. Otherwise,
> >> +it will contain the data that was there prior to the MADV_FREE call.
> >> +References made to the address range will not make the system read
> >> +from backing store (swap space) until the page is modified again.
> >> +It works only with private anonymous pages (see
> >> +.BR mmap (2)).
> >> .SH RETURN VALUE
> >> On success
> >> .BR madvise ()
>
> If I'm reading the conversation right, the initially proposed text
> was from the BSD man page (which would be okay), but most of the
> text above seems to have come straight from the page here:
> http://www.lehman.cuny.edu/cgi-bin/man-cgi?madvise+3
>
> Right?
True. Solaris man page was really straightforward/clear rather than BSD.
>
> Unfortunately, I don't think we can use that text. It's from the
> Solaris man page as far as I can tell, and I doubt that it's
> under a license that we can use.
>
> If that's the case, we need to go back and come up with an
> original text. It might draw inspiration from the Solaris page,
> and take actual text from the BSD page (which is under a free
> license), and it might also draw inspiration from Jon Corbet's
> description at http://lwn.net/Articles/590991/.
>
> Could you take another shot this please!
No problem. I will test my essay writing skill.
Thanks.
>
> Thanks,
>
> Michael
>
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/
--
Kind regards,
Minchan Kim
On Tue 03-02-15 17:39:24, Michael Kerrisk wrote:
[...]
> If I'm reading the conversation right, the initially proposed text
> was from the BSD man page (which would be okay), but most of the
> text above seems to have come straight from the page here:
> http://www.lehman.cuny.edu/cgi-bin/man-cgi?madvise+3
>
> Right?
>
> Unfortunately, I don't think we can use that text. It's from the
> Solaris man page as far as I can tell, and I doubt that it's
> under a license that we can use.
Ohh, I wasn't aware of that restriction and didn't notice anything at
the man page nor http://www.lehman.cuny.edu/cgi-bin/man-cgi.
But you are definitely right that it would be better to not use this
source. Sorry about that, I should have noticed that myself.
> If that's the case, we need to go back and come up with an
> original text. It might draw inspiration from the Solaris page,
> and take actual text from the BSD page (which is under a free
> license), and it might also draw inspiration from Jon Corbet's
> description at http://lwn.net/Articles/590991/.
>
> Could you take another shot this please!
Minchan is obviously working on one and I will review it once he is done
with it.
--
Michal Hocko
SUSE Labs
Hi Minchan,
Sorry to jump in this thread so later, and if some issues are discussed before.
I'm interesting in this patch, so tried it here. I use a simple test with
jemalloc. Obviously this can improve performance when there is no memory
pressure. Did you try setup with memory pressure?
In my test, jemalloc will map 61G vma, and use about 32G memory without
MADV_FREE. If MADV_FREE is enabled, jemalloc will use whole 61G memory because
madvise doesn't reclaim the unused memory. If I disable swap (tweak your patch
slightly to make it work without swap), I got oom. If swap is enabled, my
system is totally stalled because of swap activity. Without the MADV_FREE,
everything is ok. Considering we definitely don't want to waste too much
memory, a system with memory pressure is normal, so sounds MADV_FREE will
introduce big trouble here.
Did you think about move the MADV_FREE pages to the head of inactive LRU, so
they can be reclaimed easily?
Thanks,
Shaohua
On Wed, Feb 04, 2015 at 08:47:22AM +0900, Minchan Kim wrote:
> Hello, Michael
>
> On Tue, Feb 03, 2015 at 05:39:24PM +0100, Michael Kerrisk (man-pages) wrote:
> > Hello Minchan (and Michal)
> >
> > I did not see this patch until just now when Michael explicitly
> > mentioned it in another discussion because
> > (a) it was buried in an LMKL thread that started a topic
> > that was not about a man-pages patch.
> > (b) linux-man@ was not CCed.
>
> Sorry about that.
>
> >
> > When resubmitting this patch, could you please To:me and CC linux-man@
> > and give the mail a suitable subject line indicating a man-pages patch.
>
> Sure.
>
> >
> > On 12/05/2014 09:32 AM, Michal Hocko wrote:
> > > On Fri 05-12-14 16:08:16, Minchan Kim wrote:
> > > [...]
> > >> From cfa212d4fb307ae772b08cf564cab7e6adb8f4fc Mon Sep 17 00:00:00 2001
> > >> From: Minchan Kim <[email protected]>
> > >> Date: Mon, 1 Dec 2014 08:53:55 +0900
> > >> Subject: [PATCH] madvise.2: Document MADV_FREE
> > >>
> > >> Signed-off-by: Minchan Kim <[email protected]>
> > >
> > > Reviewed-by: Michal Hocko <[email protected]>
> > >
> > > Thanks!
> > >
> > >> ---
> > >> man2/madvise.2 | 12 ++++++++++++
> > >> 1 file changed, 12 insertions(+)
> > >>
> > >> diff --git a/man2/madvise.2 b/man2/madvise.2
> > >> index 032ead7..fc1aaca 100644
> > >> --- a/man2/madvise.2
> > >> +++ b/man2/madvise.2
> > >> @@ -265,6 +265,18 @@ file (see
> > >> .BR MADV_DODUMP " (since Linux 3.4)"
> > >> Undo the effect of an earlier
> > >> .BR MADV_DONTDUMP .
> > >> +.TP
> > >> +.BR MADV_FREE " (since Linux 3.19)"
> > >> +Tell the kernel that contents in the specified address range are no
> > >> +longer important and the range will be overwritten. When there is
> > >> +demand for memory, the system will free pages associated with the
> > >> +specified address range. In this instance, the next time a page in the
> > >> +address range is referenced, it will contain all zeroes. Otherwise,
> > >> +it will contain the data that was there prior to the MADV_FREE call.
> > >> +References made to the address range will not make the system read
> > >> +from backing store (swap space) until the page is modified again.
> > >> +It works only with private anonymous pages (see
> > >> +.BR mmap (2)).
> > >> .SH RETURN VALUE
> > >> On success
> > >> .BR madvise ()
> >
> > If I'm reading the conversation right, the initially proposed text
> > was from the BSD man page (which would be okay), but most of the
> > text above seems to have come straight from the page here:
> > http://www.lehman.cuny.edu/cgi-bin/man-cgi?madvise+3
> >
> > Right?
>
> True. Solaris man page was really straightforward/clear rather than BSD.
>
> >
> > Unfortunately, I don't think we can use that text. It's from the
> > Solaris man page as far as I can tell, and I doubt that it's
> > under a license that we can use.
> >
> > If that's the case, we need to go back and come up with an
> > original text. It might draw inspiration from the Solaris page,
> > and take actual text from the BSD page (which is under a free
> > license), and it might also draw inspiration from Jon Corbet's
> > description at http://lwn.net/Articles/590991/.
> >
> > Could you take another shot this please!
>
> No problem. I will test my essay writing skill.
> Thanks.
>
> >
> > Thanks,
> >
> > Michael
> >
> >
> >
> > --
> > Michael Kerrisk
> > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> > Linux/UNIX System Programming Training: http://man7.org/training/
>
> --
> Kind regards,
> Minchan Kim
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Hi Shaohua,
On Thu, Feb 05, 2015 at 04:33:11PM -0800, Shaohua Li wrote:
>
> Hi Minchan,
>
> Sorry to jump in this thread so later, and if some issues are discussed before.
> I'm interesting in this patch, so tried it here. I use a simple test with
No problem at all. Interest is always win over ignorance.
> jemalloc. Obviously this can improve performance when there is no memory
> pressure. Did you try setup with memory pressure?
Sure but it was not a huge memory system like yours.
>
> In my test, jemalloc will map 61G vma, and use about 32G memory without
> MADV_FREE. If MADV_FREE is enabled, jemalloc will use whole 61G memory because
> madvise doesn't reclaim the unused memory. If I disable swap (tweak your patch
Yes, IIUC, jemalloc replaces MADV_DONTNEED with MADV_FREE completely.
> slightly to make it work without swap), I got oom. If swap is enabled, my
You mean you modified anon aging logic so it works although there is no swap?
If so, I have no idea why OOM happens. I guess it should free all of freeable
pages during the aging so although system stall happens more, I don't expect
OOM. Anyway, with MADV_FREE with no swap, we should consider more things
about anonymous aging.
> system is totally stalled because of swap activity. Without the MADV_FREE,
> everything is ok. Considering we definitely don't want to waste too much
> memory, a system with memory pressure is normal, so sounds MADV_FREE will
> introduce big trouble here.
>
> Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> they can be reclaimed easily?
I think it's desirable if the page lived in active LRU.
The reason I didn't that was caused by volatile ranges system call which
was motivaion for MADV_FREE in my mind.
In last LSF/MM, there was concern about data's hotness.
Some of users want to keep that as it is in LRU position, others want to
handle that as cold(tail of inactive list)/warm(head of inactive list)/
hot(head of active list), for example.
The vrange syscall was just about volatiltiy, not depends on page hotness
so the decision on my head was not to change LRU order and let's make new
hotness advise if we need it later.
However, MADV_FREE's main customer is allocators and afaik, they want
to replace MADV_DONTNEED with MADV_FREE so I think it is really cold,
but we couldn't make sure so head of inactive is good compromise.
Another concern about tail of inactive list is that there could be
plenty of pages in there, which was asynchromos write-backed in
previous reclaim path, not-yet reclaimed because of not being able
to free the in softirq context of writeback. It means we ends up
freeing more potential pages to become workingset in advance
than pages VM already decided to evict.
In summary, I like your suggestion.
Thanks.
>
> Thanks,
> Shaohua
>
> On Wed, Feb 04, 2015 at 08:47:22AM +0900, Minchan Kim wrote:
> > Hello, Michael
> >
> > On Tue, Feb 03, 2015 at 05:39:24PM +0100, Michael Kerrisk (man-pages) wrote:
> > > Hello Minchan (and Michal)
> > >
> > > I did not see this patch until just now when Michael explicitly
> > > mentioned it in another discussion because
> > > (a) it was buried in an LMKL thread that started a topic
> > > that was not about a man-pages patch.
> > > (b) linux-man@ was not CCed.
> >
> > Sorry about that.
> >
> > >
> > > When resubmitting this patch, could you please To:me and CC linux-man@
> > > and give the mail a suitable subject line indicating a man-pages patch.
> >
> > Sure.
> >
> > >
> > > On 12/05/2014 09:32 AM, Michal Hocko wrote:
> > > > On Fri 05-12-14 16:08:16, Minchan Kim wrote:
> > > > [...]
> > > >> From cfa212d4fb307ae772b08cf564cab7e6adb8f4fc Mon Sep 17 00:00:00 2001
> > > >> From: Minchan Kim <[email protected]>
> > > >> Date: Mon, 1 Dec 2014 08:53:55 +0900
> > > >> Subject: [PATCH] madvise.2: Document MADV_FREE
> > > >>
> > > >> Signed-off-by: Minchan Kim <[email protected]>
> > > >
> > > > Reviewed-by: Michal Hocko <[email protected]>
> > > >
> > > > Thanks!
> > > >
> > > >> ---
> > > >> man2/madvise.2 | 12 ++++++++++++
> > > >> 1 file changed, 12 insertions(+)
> > > >>
> > > >> diff --git a/man2/madvise.2 b/man2/madvise.2
> > > >> index 032ead7..fc1aaca 100644
> > > >> --- a/man2/madvise.2
> > > >> +++ b/man2/madvise.2
> > > >> @@ -265,6 +265,18 @@ file (see
> > > >> .BR MADV_DODUMP " (since Linux 3.4)"
> > > >> Undo the effect of an earlier
> > > >> .BR MADV_DONTDUMP .
> > > >> +.TP
> > > >> +.BR MADV_FREE " (since Linux 3.19)"
> > > >> +Tell the kernel that contents in the specified address range are no
> > > >> +longer important and the range will be overwritten. When there is
> > > >> +demand for memory, the system will free pages associated with the
> > > >> +specified address range. In this instance, the next time a page in the
> > > >> +address range is referenced, it will contain all zeroes. Otherwise,
> > > >> +it will contain the data that was there prior to the MADV_FREE call.
> > > >> +References made to the address range will not make the system read
> > > >> +from backing store (swap space) until the page is modified again.
> > > >> +It works only with private anonymous pages (see
> > > >> +.BR mmap (2)).
> > > >> .SH RETURN VALUE
> > > >> On success
> > > >> .BR madvise ()
> > >
> > > If I'm reading the conversation right, the initially proposed text
> > > was from the BSD man page (which would be okay), but most of the
> > > text above seems to have come straight from the page here:
> > > http://www.lehman.cuny.edu/cgi-bin/man-cgi?madvise+3
> > >
> > > Right?
> >
> > True. Solaris man page was really straightforward/clear rather than BSD.
> >
> > >
> > > Unfortunately, I don't think we can use that text. It's from the
> > > Solaris man page as far as I can tell, and I doubt that it's
> > > under a license that we can use.
> > >
> > > If that's the case, we need to go back and come up with an
> > > original text. It might draw inspiration from the Solaris page,
> > > and take actual text from the BSD page (which is under a free
> > > license), and it might also draw inspiration from Jon Corbet's
> > > description at http://lwn.net/Articles/590991/.
> > >
> > > Could you take another shot this please!
> >
> > No problem. I will test my essay writing skill.
> > Thanks.
> >
> > >
> > > Thanks,
> > >
> > > Michael
> > >
> > >
> > >
> > > --
> > > Michael Kerrisk
> > > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> > > Linux/UNIX System Programming Training: http://man7.org/training/
> >
> > --
> > Kind regards,
> > Minchan Kim
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Kind regards,
Minchan Kim
On Thu 05-02-15 16:33:11, Shaohua Li wrote:
[...]
> Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> they can be reclaimed easily?
Yes this makes sense for pages living on the active LRU list. I would
preserve LRU ordering on the inactive list because there is no good
reason to make the operation more costly for inactive pages. On the
other hand having tons of to-be-freed pages on the active list clearly
sucks. Care to send a patch?
--
Michal Hocko
SUSE Labs
On Fri, Feb 06, 2015 at 02:51:03PM +0900, Minchan Kim wrote:
> Hi Shaohua,
>
> On Thu, Feb 05, 2015 at 04:33:11PM -0800, Shaohua Li wrote:
> >
> > Hi Minchan,
> >
> > Sorry to jump in this thread so later, and if some issues are discussed before.
> > I'm interesting in this patch, so tried it here. I use a simple test with
>
> No problem at all. Interest is always win over ignorance.
>
> > jemalloc. Obviously this can improve performance when there is no memory
> > pressure. Did you try setup with memory pressure?
>
> Sure but it was not a huge memory system like yours.
Yes, I'd like to check the symptom in memory pressure, so choose such test.
> > In my test, jemalloc will map 61G vma, and use about 32G memory without
> > MADV_FREE. If MADV_FREE is enabled, jemalloc will use whole 61G memory because
> > madvise doesn't reclaim the unused memory. If I disable swap (tweak your patch
>
> Yes, IIUC, jemalloc replaces MADV_DONTNEED with MADV_FREE completely.
right.
> > slightly to make it work without swap), I got oom. If swap is enabled, my
>
> You mean you modified anon aging logic so it works although there is no swap?
> If so, I have no idea why OOM happens. I guess it should free all of freeable
> pages during the aging so although system stall happens more, I don't expect
> OOM. Anyway, with MADV_FREE with no swap, we should consider more things
> about anonymous aging.
In the patch, MADV_FREE will be disabled and fallback to DONTNEED if no swap is
enabled. Our production environment doesn't enable swap, so I tried to delete
the 'no swap' check and make MADV_FREE always enabled regardless if swap is
enabled. I didn't change anything else. With such change, I saw oom
immediately. So definitely we have aging issue, the pages aren't reclaimed
fast.
> > system is totally stalled because of swap activity. Without the MADV_FREE,
> > everything is ok. Considering we definitely don't want to waste too much
> > memory, a system with memory pressure is normal, so sounds MADV_FREE will
> > introduce big trouble here.
> >
> > Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> > they can be reclaimed easily?
>
> I think it's desirable if the page lived in active LRU.
> The reason I didn't that was caused by volatile ranges system call which
> was motivaion for MADV_FREE in my mind.
> In last LSF/MM, there was concern about data's hotness.
> Some of users want to keep that as it is in LRU position, others want to
> handle that as cold(tail of inactive list)/warm(head of inactive list)/
> hot(head of active list), for example.
> The vrange syscall was just about volatiltiy, not depends on page hotness
> so the decision on my head was not to change LRU order and let's make new
> hotness advise if we need it later.
>
> However, MADV_FREE's main customer is allocators and afaik, they want
> to replace MADV_DONTNEED with MADV_FREE so I think it is really cold,
> but we couldn't make sure so head of inactive is good compromise.
> Another concern about tail of inactive list is that there could be
> plenty of pages in there, which was asynchromos write-backed in
> previous reclaim path, not-yet reclaimed because of not being able
> to free the in softirq context of writeback. It means we ends up
> freeing more potential pages to become workingset in advance
> than pages VM already decided to evict.
Yes, they are definitely cold pages. I thought We should make sure the
MADV_FREE pages are reclaimed first before other pages, at least in the anon
LRU list, though there might be difficult to determine if we should reclaim
writeback pages first or MADV_FREE pages first.
Thanks,
Shaohua
On Fri, Feb 06, 2015 at 01:58:25PM +0100, Michal Hocko wrote:
> On Thu 05-02-15 16:33:11, Shaohua Li wrote:
> [...]
> > Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> > they can be reclaimed easily?
>
> Yes this makes sense for pages living on the active LRU list. I would
> preserve LRU ordering on the inactive list because there is no good
> reason to make the operation more costly for inactive pages. On the
> other hand having tons of to-be-freed pages on the active list clearly
> sucks. Care to send a patch?
Considering anon pages are in active LRU first, it's likely MADV_FREE pages are
in active list. I'm curious why preserves the order of inactive list. App knows
which pages are cold, why don't take the advantages? I'll play the patch more
to see what I can do for it.
Thanks,
Shaohua
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 02/06/2015 01:32 PM, Shaohua Li wrote:
> On Fri, Feb 06, 2015 at 01:58:25PM +0100, Michal Hocko wrote:
>> On Thu 05-02-15 16:33:11, Shaohua Li wrote: [...]
>>> Did you think about move the MADV_FREE pages to the head of
>>> inactive LRU, so they can be reclaimed easily?
>>
>> Yes this makes sense for pages living on the active LRU list. I
>> would preserve LRU ordering on the inactive list because there is
>> no good reason to make the operation more costly for inactive
>> pages. On the other hand having tons of to-be-freed pages on the
>> active list clearly sucks. Care to send a patch?
>
> Considering anon pages are in active LRU first, it's likely
> MADV_FREE pages are in active list. I'm curious why preserves the
> order of inactive list.
Only before the first time MADV_FREE is called on those pages.
If a program repeatedly allocates and frees the same memory
region, not moving the MADV_FREE pages around in the LRU
several times can save some overhead.
- --
All rights reversed
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAEBAgAGBQJU1QqHAAoJEM553pKExN6DTXoH/3bS+VhdIm1EpOc8OOFtBHvd
T63DHObtOY1FOog48CtgvUCfo7Q+g1aG/9hz7lJNP1G26B3+LNszM9OtE/9QrYUH
uzmuWvFL7l0W0qen/WsyO0RcyqN+0mEXvNVqynTmJJu8qAG0p5WsjA6L5Penzj//
tnBmn5xb1h3COjDZkHsxBfkpfCpNq5dm88K6B3nApHz4QhfcviKefczsrWdZ/bBc
2uMnlIebKY1Oq9MDHsg8p/b3lIHzwAf0xGSvGLN0YfzDPzlqBMbxSbVubYEA9EaU
OiS1XqRp8okeGgrxsRAb/F8wPgClpce+h0E5xpyUuew2rlD1OmciX6iIDcE5Zrk=
=KUng
-----END PGP SIGNATURE-----
On Fri, Feb 06, 2015 at 10:29:18AM -0800, Shaohua Li wrote:
> On Fri, Feb 06, 2015 at 02:51:03PM +0900, Minchan Kim wrote:
> > Hi Shaohua,
> >
> > On Thu, Feb 05, 2015 at 04:33:11PM -0800, Shaohua Li wrote:
> > >
> > > Hi Minchan,
> > >
> > > Sorry to jump in this thread so later, and if some issues are discussed before.
> > > I'm interesting in this patch, so tried it here. I use a simple test with
> >
> > No problem at all. Interest is always win over ignorance.
> >
> > > jemalloc. Obviously this can improve performance when there is no memory
> > > pressure. Did you try setup with memory pressure?
> >
> > Sure but it was not a huge memory system like yours.
>
> Yes, I'd like to check the symptom in memory pressure, so choose such test.
>
> > > In my test, jemalloc will map 61G vma, and use about 32G memory without
> > > MADV_FREE. If MADV_FREE is enabled, jemalloc will use whole 61G memory because
> > > madvise doesn't reclaim the unused memory. If I disable swap (tweak your patch
> >
> > Yes, IIUC, jemalloc replaces MADV_DONTNEED with MADV_FREE completely.
>
> right.
> > > slightly to make it work without swap), I got oom. If swap is enabled, my
> >
> > You mean you modified anon aging logic so it works although there is no swap?
> > If so, I have no idea why OOM happens. I guess it should free all of freeable
> > pages during the aging so although system stall happens more, I don't expect
> > OOM. Anyway, with MADV_FREE with no swap, we should consider more things
> > about anonymous aging.
>
> In the patch, MADV_FREE will be disabled and fallback to DONTNEED if no swap is
> enabled. Our production environment doesn't enable swap, so I tried to delete
> the 'no swap' check and make MADV_FREE always enabled regardless if swap is
> enabled. I didn't change anything else. With such change, I saw oom
> immediately. So definitely we have aging issue, the pages aren't reclaimed
> fast.
In current VM implementation, it doesn't age anonymous LRU list if we have no
swap. That's the reason to drop freeing pages instantly.
I think it could be enhanced later.
http://lists.infradead.org/pipermail/linux-arm-kernel/2014-December/311591.html
>
> > > system is totally stalled because of swap activity. Without the MADV_FREE,
> > > everything is ok. Considering we definitely don't want to waste too much
> > > memory, a system with memory pressure is normal, so sounds MADV_FREE will
> > > introduce big trouble here.
> > >
> > > Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> > > they can be reclaimed easily?
> >
> > I think it's desirable if the page lived in active LRU.
> > The reason I didn't that was caused by volatile ranges system call which
> > was motivaion for MADV_FREE in my mind.
> > In last LSF/MM, there was concern about data's hotness.
> > Some of users want to keep that as it is in LRU position, others want to
> > handle that as cold(tail of inactive list)/warm(head of inactive list)/
> > hot(head of active list), for example.
> > The vrange syscall was just about volatiltiy, not depends on page hotness
> > so the decision on my head was not to change LRU order and let's make new
> > hotness advise if we need it later.
> >
> > However, MADV_FREE's main customer is allocators and afaik, they want
> > to replace MADV_DONTNEED with MADV_FREE so I think it is really cold,
> > but we couldn't make sure so head of inactive is good compromise.
> > Another concern about tail of inactive list is that there could be
> > plenty of pages in there, which was asynchromos write-backed in
> > previous reclaim path, not-yet reclaimed because of not being able
> > to free the in softirq context of writeback. It means we ends up
> > freeing more potential pages to become workingset in advance
> > than pages VM already decided to evict.
>
> Yes, they are definitely cold pages. I thought We should make sure the
> MADV_FREE pages are reclaimed first before other pages, at least in the anon
> LRU list, though there might be difficult to determine if we should reclaim
> writeback pages first or MADV_FREE pages first.
Frankly speaking, the issue with writeback page is just hurdle of
implementation, not design so if we could fix it, we might move
cold pages into tail of the inactive LRU. I tried it but don't have
time slot to continue these days. Hope to get a time to look soon.
https://lkml.org/lkml/2014/7/1/628
Even, it wouldn't be critical problem although we couldn't fix
the problem of writeback pages because they are already all
cold pages so it might be not important to keep order in LRU so
we could save working set and effort of VM to reclaim them
at the cost of moving all of hinting pages into tail of the LRU
whenever the syscall is called.
However, significant problem from my mind is we couldn't make
sure they are really cold pages. It would be true for allocators
but it's cache-friendly pages so it might be better to discard
tail pages of inactive LRU, which are really cold.
In addition, we couldn't expect all of usecase for MADV_FREE
so some of users might want to treat them as warm, not cold.
With moving them into inactive list's head, if we still see
a lot stall, I think it's a sign to add other logic, for example,
we could drop MADV_FREEed pages instantly if the zone is below
low min watermark when the syscall is called. Because everybody
doesn't like direct reclaim.
>
> Thanks,
> Shaohua
--
Kind regards,
Minchan Kim
On Mon, Feb 09, 2015 at 04:15:53PM +0900, Minchan Kim wrote:
> On Fri, Feb 06, 2015 at 10:29:18AM -0800, Shaohua Li wrote:
> > On Fri, Feb 06, 2015 at 02:51:03PM +0900, Minchan Kim wrote:
> > > Hi Shaohua,
> > >
> > > On Thu, Feb 05, 2015 at 04:33:11PM -0800, Shaohua Li wrote:
> > > >
> > > > Hi Minchan,
> > > >
> > > > Sorry to jump in this thread so later, and if some issues are discussed before.
> > > > I'm interesting in this patch, so tried it here. I use a simple test with
> > >
> > > No problem at all. Interest is always win over ignorance.
> > >
> > > > jemalloc. Obviously this can improve performance when there is no memory
> > > > pressure. Did you try setup with memory pressure?
> > >
> > > Sure but it was not a huge memory system like yours.
> >
> > Yes, I'd like to check the symptom in memory pressure, so choose such test.
> >
> > > > In my test, jemalloc will map 61G vma, and use about 32G memory without
> > > > MADV_FREE. If MADV_FREE is enabled, jemalloc will use whole 61G memory because
> > > > madvise doesn't reclaim the unused memory. If I disable swap (tweak your patch
> > >
> > > Yes, IIUC, jemalloc replaces MADV_DONTNEED with MADV_FREE completely.
> >
> > right.
> > > > slightly to make it work without swap), I got oom. If swap is enabled, my
> > >
> > > You mean you modified anon aging logic so it works although there is no swap?
> > > If so, I have no idea why OOM happens. I guess it should free all of freeable
> > > pages during the aging so although system stall happens more, I don't expect
> > > OOM. Anyway, with MADV_FREE with no swap, we should consider more things
> > > about anonymous aging.
> >
> > In the patch, MADV_FREE will be disabled and fallback to DONTNEED if no swap is
> > enabled. Our production environment doesn't enable swap, so I tried to delete
> > the 'no swap' check and make MADV_FREE always enabled regardless if swap is
> > enabled. I didn't change anything else. With such change, I saw oom
> > immediately. So definitely we have aging issue, the pages aren't reclaimed
> > fast.
>
> In current VM implementation, it doesn't age anonymous LRU list if we have no
> swap. That's the reason to drop freeing pages instantly.
> I think it could be enhanced later.
> http://lists.infradead.org/pipermail/linux-arm-kernel/2014-December/311591.html
>
> >
> > > > system is totally stalled because of swap activity. Without the MADV_FREE,
> > > > everything is ok. Considering we definitely don't want to waste too much
> > > > memory, a system with memory pressure is normal, so sounds MADV_FREE will
> > > > introduce big trouble here.
> > > >
> > > > Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> > > > they can be reclaimed easily?
> > >
> > > I think it's desirable if the page lived in active LRU.
> > > The reason I didn't that was caused by volatile ranges system call which
> > > was motivaion for MADV_FREE in my mind.
> > > In last LSF/MM, there was concern about data's hotness.
> > > Some of users want to keep that as it is in LRU position, others want to
> > > handle that as cold(tail of inactive list)/warm(head of inactive list)/
> > > hot(head of active list), for example.
> > > The vrange syscall was just about volatiltiy, not depends on page hotness
> > > so the decision on my head was not to change LRU order and let's make new
> > > hotness advise if we need it later.
> > >
> > > However, MADV_FREE's main customer is allocators and afaik, they want
> > > to replace MADV_DONTNEED with MADV_FREE so I think it is really cold,
> > > but we couldn't make sure so head of inactive is good compromise.
> > > Another concern about tail of inactive list is that there could be
> > > plenty of pages in there, which was asynchromos write-backed in
> > > previous reclaim path, not-yet reclaimed because of not being able
> > > to free the in softirq context of writeback. It means we ends up
> > > freeing more potential pages to become workingset in advance
> > > than pages VM already decided to evict.
> >
> > Yes, they are definitely cold pages. I thought We should make sure the
> > MADV_FREE pages are reclaimed first before other pages, at least in the anon
> > LRU list, though there might be difficult to determine if we should reclaim
> > writeback pages first or MADV_FREE pages first.
>
> Frankly speaking, the issue with writeback page is just hurdle of
> implementation, not design so if we could fix it, we might move
> cold pages into tail of the inactive LRU. I tried it but don't have
> time slot to continue these days. Hope to get a time to look soon.
> https://lkml.org/lkml/2014/7/1/628
> Even, it wouldn't be critical problem although we couldn't fix
> the problem of writeback pages because they are already all
> cold pages so it might be not important to keep order in LRU so
> we could save working set and effort of VM to reclaim them
> at the cost of moving all of hinting pages into tail of the LRU
> whenever the syscall is called.
>
> However, significant problem from my mind is we couldn't make
> sure they are really cold pages. It would be true for allocators
> but it's cache-friendly pages so it might be better to discard
> tail pages of inactive LRU, which are really cold.
> In addition, we couldn't expect all of usecase for MADV_FREE
> so some of users might want to treat them as warm, not cold.
>
> With moving them into inactive list's head, if we still see
> a lot stall, I think it's a sign to add other logic, for example,
> we could drop MADV_FREEed pages instantly if the zone is below
> low min watermark when the syscall is called. Because everybody
> doesn't like direct reclaim.
So I tried move the MADV_FREE pages to inactive list head or tail. It helps a
little. But there are still stalls/oom. kswapd isn't fast enough to free the
pages, App enters direct reclaim frequently. In one machine, no swap trigger,
but MADV_FREE is 5x slower than MADV_DONTNEED. In another machine, MADV_FREE
trigers a lot of swap and sometimes even oom. app enters direct reclaim and has
a lot of lock contention because of excessive direct reclaim, so there are a
lot of stalls.
Thanks,
Shaohua
Hi Shaohua,
On Tue, Feb 10, 2015 at 02:38:26PM -0800, Shaohua Li wrote:
> On Mon, Feb 09, 2015 at 04:15:53PM +0900, Minchan Kim wrote:
> > On Fri, Feb 06, 2015 at 10:29:18AM -0800, Shaohua Li wrote:
> > > On Fri, Feb 06, 2015 at 02:51:03PM +0900, Minchan Kim wrote:
> > > > Hi Shaohua,
> > > >
> > > > On Thu, Feb 05, 2015 at 04:33:11PM -0800, Shaohua Li wrote:
> > > > >
> > > > > Hi Minchan,
> > > > >
> > > > > Sorry to jump in this thread so later, and if some issues are discussed before.
> > > > > I'm interesting in this patch, so tried it here. I use a simple test with
> > > >
> > > > No problem at all. Interest is always win over ignorance.
> > > >
> > > > > jemalloc. Obviously this can improve performance when there is no memory
> > > > > pressure. Did you try setup with memory pressure?
> > > >
> > > > Sure but it was not a huge memory system like yours.
> > >
> > > Yes, I'd like to check the symptom in memory pressure, so choose such test.
> > >
> > > > > In my test, jemalloc will map 61G vma, and use about 32G memory without
> > > > > MADV_FREE. If MADV_FREE is enabled, jemalloc will use whole 61G memory because
> > > > > madvise doesn't reclaim the unused memory. If I disable swap (tweak your patch
> > > >
> > > > Yes, IIUC, jemalloc replaces MADV_DONTNEED with MADV_FREE completely.
> > >
> > > right.
> > > > > slightly to make it work without swap), I got oom. If swap is enabled, my
> > > >
> > > > You mean you modified anon aging logic so it works although there is no swap?
> > > > If so, I have no idea why OOM happens. I guess it should free all of freeable
> > > > pages during the aging so although system stall happens more, I don't expect
> > > > OOM. Anyway, with MADV_FREE with no swap, we should consider more things
> > > > about anonymous aging.
> > >
> > > In the patch, MADV_FREE will be disabled and fallback to DONTNEED if no swap is
> > > enabled. Our production environment doesn't enable swap, so I tried to delete
> > > the 'no swap' check and make MADV_FREE always enabled regardless if swap is
> > > enabled. I didn't change anything else. With such change, I saw oom
> > > immediately. So definitely we have aging issue, the pages aren't reclaimed
> > > fast.
> >
> > In current VM implementation, it doesn't age anonymous LRU list if we have no
> > swap. That's the reason to drop freeing pages instantly.
> > I think it could be enhanced later.
> > http://lists.infradead.org/pipermail/linux-arm-kernel/2014-December/311591.html
> >
> > >
> > > > > system is totally stalled because of swap activity. Without the MADV_FREE,
> > > > > everything is ok. Considering we definitely don't want to waste too much
> > > > > memory, a system with memory pressure is normal, so sounds MADV_FREE will
> > > > > introduce big trouble here.
> > > > >
> > > > > Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> > > > > they can be reclaimed easily?
> > > >
> > > > I think it's desirable if the page lived in active LRU.
> > > > The reason I didn't that was caused by volatile ranges system call which
> > > > was motivaion for MADV_FREE in my mind.
> > > > In last LSF/MM, there was concern about data's hotness.
> > > > Some of users want to keep that as it is in LRU position, others want to
> > > > handle that as cold(tail of inactive list)/warm(head of inactive list)/
> > > > hot(head of active list), for example.
> > > > The vrange syscall was just about volatiltiy, not depends on page hotness
> > > > so the decision on my head was not to change LRU order and let's make new
> > > > hotness advise if we need it later.
> > > >
> > > > However, MADV_FREE's main customer is allocators and afaik, they want
> > > > to replace MADV_DONTNEED with MADV_FREE so I think it is really cold,
> > > > but we couldn't make sure so head of inactive is good compromise.
> > > > Another concern about tail of inactive list is that there could be
> > > > plenty of pages in there, which was asynchromos write-backed in
> > > > previous reclaim path, not-yet reclaimed because of not being able
> > > > to free the in softirq context of writeback. It means we ends up
> > > > freeing more potential pages to become workingset in advance
> > > > than pages VM already decided to evict.
> > >
> > > Yes, they are definitely cold pages. I thought We should make sure the
> > > MADV_FREE pages are reclaimed first before other pages, at least in the anon
> > > LRU list, though there might be difficult to determine if we should reclaim
> > > writeback pages first or MADV_FREE pages first.
> >
> > Frankly speaking, the issue with writeback page is just hurdle of
> > implementation, not design so if we could fix it, we might move
> > cold pages into tail of the inactive LRU. I tried it but don't have
> > time slot to continue these days. Hope to get a time to look soon.
> > https://lkml.org/lkml/2014/7/1/628
> > Even, it wouldn't be critical problem although we couldn't fix
> > the problem of writeback pages because they are already all
> > cold pages so it might be not important to keep order in LRU so
> > we could save working set and effort of VM to reclaim them
> > at the cost of moving all of hinting pages into tail of the LRU
> > whenever the syscall is called.
> >
> > However, significant problem from my mind is we couldn't make
> > sure they are really cold pages. It would be true for allocators
> > but it's cache-friendly pages so it might be better to discard
> > tail pages of inactive LRU, which are really cold.
> > In addition, we couldn't expect all of usecase for MADV_FREE
> > so some of users might want to treat them as warm, not cold.
> >
> > With moving them into inactive list's head, if we still see
> > a lot stall, I think it's a sign to add other logic, for example,
> > we could drop MADV_FREEed pages instantly if the zone is below
> > low min watermark when the syscall is called. Because everybody
> > doesn't like direct reclaim.
>
> So I tried move the MADV_FREE pages to inactive list head or tail. It helps a
> little. But there are still stalls/oom. kswapd isn't fast enough to free the
> pages, App enters direct reclaim frequently. In one machine, no swap trigger,
> but MADV_FREE is 5x slower than MADV_DONTNEED. In another machine, MADV_FREE
It's expected. MADV_DONTNEED and MADV_FREE is really different.
MADV_DONTNEED is self-sacrificy for others in the system while MADV_FREE is
greedy approach for itself because random process asking the memory could
enter direct reclaim.
However, as I said earlier, we could mitigate the problem by checking
min_free_kbytes. If memory in the system is under min_free_kbytes, it is
pointless to impose reclaim overhead for hinted pages because we alreay
know the hint is "please free when you are trouble with memory" and we got
know it already.
When I test below patch on my 3G machine + 12 CPU + 8G swap with below test
test: 12 processes(each process does 5 iteration: mmap 512M + memset + madvise),
1. MADV_DONTNEED : 41.884sec, sys:3m4.552
2. MADV_FREE : 1m28sec, sys: 5m23
3. MADV_FREE + below patch : 37.188s, sys: 2m20
Could you test?
diff --git a/mm/madvise.c b/mm/madvise.c
index 6d0fcb8..da15f8f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -523,7 +523,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
* XXX: In this implementation, MADV_FREE works like
* MADV_DONTNEED on swapless system or full swap.
*/
- if (get_nr_swap_pages() > 0)
+ if (get_nr_swap_pages() > 0 && min_free_kbytes < nr_free_pages())
return madvise_free(vma, prev, start, end);
/* passthrough */
case MADV_DONTNEED:
> trigers a lot of swap and sometimes even oom. app enters direct reclaim and has
> a lot of lock contention because of excessive direct reclaim, so there are a
> lot of stalls.
>
> Thanks,
> Shaohua
--
Kind regards,
Minchan Kim
On Wed, Feb 11, 2015 at 09:56:20AM +0900, Minchan Kim wrote:
> Hi Shaohua,
>
> On Tue, Feb 10, 2015 at 02:38:26PM -0800, Shaohua Li wrote:
> > On Mon, Feb 09, 2015 at 04:15:53PM +0900, Minchan Kim wrote:
> > > On Fri, Feb 06, 2015 at 10:29:18AM -0800, Shaohua Li wrote:
> > > > On Fri, Feb 06, 2015 at 02:51:03PM +0900, Minchan Kim wrote:
> > > > > Hi Shaohua,
> > > > >
> > > > > On Thu, Feb 05, 2015 at 04:33:11PM -0800, Shaohua Li wrote:
> > > > > >
> > > > > > Hi Minchan,
> > > > > >
> > > > > > Sorry to jump in this thread so later, and if some issues are discussed before.
> > > > > > I'm interesting in this patch, so tried it here. I use a simple test with
> > > > >
> > > > > No problem at all. Interest is always win over ignorance.
> > > > >
> > > > > > jemalloc. Obviously this can improve performance when there is no memory
> > > > > > pressure. Did you try setup with memory pressure?
> > > > >
> > > > > Sure but it was not a huge memory system like yours.
> > > >
> > > > Yes, I'd like to check the symptom in memory pressure, so choose such test.
> > > >
> > > > > > In my test, jemalloc will map 61G vma, and use about 32G memory without
> > > > > > MADV_FREE. If MADV_FREE is enabled, jemalloc will use whole 61G memory because
> > > > > > madvise doesn't reclaim the unused memory. If I disable swap (tweak your patch
> > > > >
> > > > > Yes, IIUC, jemalloc replaces MADV_DONTNEED with MADV_FREE completely.
> > > >
> > > > right.
> > > > > > slightly to make it work without swap), I got oom. If swap is enabled, my
> > > > >
> > > > > You mean you modified anon aging logic so it works although there is no swap?
> > > > > If so, I have no idea why OOM happens. I guess it should free all of freeable
> > > > > pages during the aging so although system stall happens more, I don't expect
> > > > > OOM. Anyway, with MADV_FREE with no swap, we should consider more things
> > > > > about anonymous aging.
> > > >
> > > > In the patch, MADV_FREE will be disabled and fallback to DONTNEED if no swap is
> > > > enabled. Our production environment doesn't enable swap, so I tried to delete
> > > > the 'no swap' check and make MADV_FREE always enabled regardless if swap is
> > > > enabled. I didn't change anything else. With such change, I saw oom
> > > > immediately. So definitely we have aging issue, the pages aren't reclaimed
> > > > fast.
> > >
> > > In current VM implementation, it doesn't age anonymous LRU list if we have no
> > > swap. That's the reason to drop freeing pages instantly.
> > > I think it could be enhanced later.
> > > http://lists.infradead.org/pipermail/linux-arm-kernel/2014-December/311591.html
> > >
> > > >
> > > > > > system is totally stalled because of swap activity. Without the MADV_FREE,
> > > > > > everything is ok. Considering we definitely don't want to waste too much
> > > > > > memory, a system with memory pressure is normal, so sounds MADV_FREE will
> > > > > > introduce big trouble here.
> > > > > >
> > > > > > Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> > > > > > they can be reclaimed easily?
> > > > >
> > > > > I think it's desirable if the page lived in active LRU.
> > > > > The reason I didn't that was caused by volatile ranges system call which
> > > > > was motivaion for MADV_FREE in my mind.
> > > > > In last LSF/MM, there was concern about data's hotness.
> > > > > Some of users want to keep that as it is in LRU position, others want to
> > > > > handle that as cold(tail of inactive list)/warm(head of inactive list)/
> > > > > hot(head of active list), for example.
> > > > > The vrange syscall was just about volatiltiy, not depends on page hotness
> > > > > so the decision on my head was not to change LRU order and let's make new
> > > > > hotness advise if we need it later.
> > > > >
> > > > > However, MADV_FREE's main customer is allocators and afaik, they want
> > > > > to replace MADV_DONTNEED with MADV_FREE so I think it is really cold,
> > > > > but we couldn't make sure so head of inactive is good compromise.
> > > > > Another concern about tail of inactive list is that there could be
> > > > > plenty of pages in there, which was asynchromos write-backed in
> > > > > previous reclaim path, not-yet reclaimed because of not being able
> > > > > to free the in softirq context of writeback. It means we ends up
> > > > > freeing more potential pages to become workingset in advance
> > > > > than pages VM already decided to evict.
> > > >
> > > > Yes, they are definitely cold pages. I thought We should make sure the
> > > > MADV_FREE pages are reclaimed first before other pages, at least in the anon
> > > > LRU list, though there might be difficult to determine if we should reclaim
> > > > writeback pages first or MADV_FREE pages first.
> > >
> > > Frankly speaking, the issue with writeback page is just hurdle of
> > > implementation, not design so if we could fix it, we might move
> > > cold pages into tail of the inactive LRU. I tried it but don't have
> > > time slot to continue these days. Hope to get a time to look soon.
> > > https://lkml.org/lkml/2014/7/1/628
> > > Even, it wouldn't be critical problem although we couldn't fix
> > > the problem of writeback pages because they are already all
> > > cold pages so it might be not important to keep order in LRU so
> > > we could save working set and effort of VM to reclaim them
> > > at the cost of moving all of hinting pages into tail of the LRU
> > > whenever the syscall is called.
> > >
> > > However, significant problem from my mind is we couldn't make
> > > sure they are really cold pages. It would be true for allocators
> > > but it's cache-friendly pages so it might be better to discard
> > > tail pages of inactive LRU, which are really cold.
> > > In addition, we couldn't expect all of usecase for MADV_FREE
> > > so some of users might want to treat them as warm, not cold.
> > >
> > > With moving them into inactive list's head, if we still see
> > > a lot stall, I think it's a sign to add other logic, for example,
> > > we could drop MADV_FREEed pages instantly if the zone is below
> > > low min watermark when the syscall is called. Because everybody
> > > doesn't like direct reclaim.
> >
> > So I tried move the MADV_FREE pages to inactive list head or tail. It helps a
> > little. But there are still stalls/oom. kswapd isn't fast enough to free the
> > pages, App enters direct reclaim frequently. In one machine, no swap trigger,
> > but MADV_FREE is 5x slower than MADV_DONTNEED. In another machine, MADV_FREE
>
> It's expected. MADV_DONTNEED and MADV_FREE is really different.
> MADV_DONTNEED is self-sacrificy for others in the system while MADV_FREE is
> greedy approach for itself because random process asking the memory could
> enter direct reclaim.
> However, as I said earlier, we could mitigate the problem by checking
> min_free_kbytes. If memory in the system is under min_free_kbytes, it is
> pointless to impose reclaim overhead for hinted pages because we alreay
> know the hint is "please free when you are trouble with memory" and we got
> know it already.
>
> When I test below patch on my 3G machine + 12 CPU + 8G swap with below test
> test: 12 processes(each process does 5 iteration: mmap 512M + memset + madvise),
>
> 1. MADV_DONTNEED : 41.884sec, sys:3m4.552
> 2. MADV_FREE : 1m28sec, sys: 5m23
> 3. MADV_FREE + below patch : 37.188s, sys: 2m20
>
> Could you test?
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6d0fcb8..da15f8f 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -523,7 +523,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> * XXX: In this implementation, MADV_FREE works like
> * MADV_DONTNEED on swapless system or full swap.
> */
> - if (get_nr_swap_pages() > 0)
> + if (get_nr_swap_pages() > 0 && min_free_kbytes < nr_free_pages())
> return madvise_free(vma, prev, start, end);
> /* passthrough */
> case MADV_DONTNEED:
The throttling makes a lot of sense, definitely should be included in the
patch. At least my jemalloc test has similar performance result with/without
the patch in memory pressure case. So overall I'm pretty happy with it.
However, this only solves half of the problem. pages which are MADV_FREE before
watermark is hit are still hard to be reclaimed later if there are other
allocations. I'm not sure how severe this issue is. My jemalloc test frequently
does madvise (fallback to DONTNEED with above change), so itself can free a lot
of memory in memory pressure. If application uses MADV_FREE before watermark is
hit, but don't use it after watermark is hit, we will have trouble.
Thanks,
Shaohua
Hi Shaohua,
On Wed, Feb 11, 2015 at 04:14:03PM -0800, Shaohua Li wrote:
> On Wed, Feb 11, 2015 at 09:56:20AM +0900, Minchan Kim wrote:
> > Hi Shaohua,
> >
> > On Tue, Feb 10, 2015 at 02:38:26PM -0800, Shaohua Li wrote:
> > > On Mon, Feb 09, 2015 at 04:15:53PM +0900, Minchan Kim wrote:
> > > > On Fri, Feb 06, 2015 at 10:29:18AM -0800, Shaohua Li wrote:
> > > > > On Fri, Feb 06, 2015 at 02:51:03PM +0900, Minchan Kim wrote:
> > > > > > Hi Shaohua,
> > > > > >
> > > > > > On Thu, Feb 05, 2015 at 04:33:11PM -0800, Shaohua Li wrote:
> > > > > > >
> > > > > > > Hi Minchan,
> > > > > > >
> > > > > > > Sorry to jump in this thread so later, and if some issues are discussed before.
> > > > > > > I'm interesting in this patch, so tried it here. I use a simple test with
> > > > > >
> > > > > > No problem at all. Interest is always win over ignorance.
> > > > > >
> > > > > > > jemalloc. Obviously this can improve performance when there is no memory
> > > > > > > pressure. Did you try setup with memory pressure?
> > > > > >
> > > > > > Sure but it was not a huge memory system like yours.
> > > > >
> > > > > Yes, I'd like to check the symptom in memory pressure, so choose such test.
> > > > >
> > > > > > > In my test, jemalloc will map 61G vma, and use about 32G memory without
> > > > > > > MADV_FREE. If MADV_FREE is enabled, jemalloc will use whole 61G memory because
> > > > > > > madvise doesn't reclaim the unused memory. If I disable swap (tweak your patch
> > > > > >
> > > > > > Yes, IIUC, jemalloc replaces MADV_DONTNEED with MADV_FREE completely.
> > > > >
> > > > > right.
> > > > > > > slightly to make it work without swap), I got oom. If swap is enabled, my
> > > > > >
> > > > > > You mean you modified anon aging logic so it works although there is no swap?
> > > > > > If so, I have no idea why OOM happens. I guess it should free all of freeable
> > > > > > pages during the aging so although system stall happens more, I don't expect
> > > > > > OOM. Anyway, with MADV_FREE with no swap, we should consider more things
> > > > > > about anonymous aging.
> > > > >
> > > > > In the patch, MADV_FREE will be disabled and fallback to DONTNEED if no swap is
> > > > > enabled. Our production environment doesn't enable swap, so I tried to delete
> > > > > the 'no swap' check and make MADV_FREE always enabled regardless if swap is
> > > > > enabled. I didn't change anything else. With such change, I saw oom
> > > > > immediately. So definitely we have aging issue, the pages aren't reclaimed
> > > > > fast.
> > > >
> > > > In current VM implementation, it doesn't age anonymous LRU list if we have no
> > > > swap. That's the reason to drop freeing pages instantly.
> > > > I think it could be enhanced later.
> > > > http://lists.infradead.org/pipermail/linux-arm-kernel/2014-December/311591.html
> > > >
> > > > >
> > > > > > > system is totally stalled because of swap activity. Without the MADV_FREE,
> > > > > > > everything is ok. Considering we definitely don't want to waste too much
> > > > > > > memory, a system with memory pressure is normal, so sounds MADV_FREE will
> > > > > > > introduce big trouble here.
> > > > > > >
> > > > > > > Did you think about move the MADV_FREE pages to the head of inactive LRU, so
> > > > > > > they can be reclaimed easily?
> > > > > >
> > > > > > I think it's desirable if the page lived in active LRU.
> > > > > > The reason I didn't that was caused by volatile ranges system call which
> > > > > > was motivaion for MADV_FREE in my mind.
> > > > > > In last LSF/MM, there was concern about data's hotness.
> > > > > > Some of users want to keep that as it is in LRU position, others want to
> > > > > > handle that as cold(tail of inactive list)/warm(head of inactive list)/
> > > > > > hot(head of active list), for example.
> > > > > > The vrange syscall was just about volatiltiy, not depends on page hotness
> > > > > > so the decision on my head was not to change LRU order and let's make new
> > > > > > hotness advise if we need it later.
> > > > > >
> > > > > > However, MADV_FREE's main customer is allocators and afaik, they want
> > > > > > to replace MADV_DONTNEED with MADV_FREE so I think it is really cold,
> > > > > > but we couldn't make sure so head of inactive is good compromise.
> > > > > > Another concern about tail of inactive list is that there could be
> > > > > > plenty of pages in there, which was asynchromos write-backed in
> > > > > > previous reclaim path, not-yet reclaimed because of not being able
> > > > > > to free the in softirq context of writeback. It means we ends up
> > > > > > freeing more potential pages to become workingset in advance
> > > > > > than pages VM already decided to evict.
> > > > >
> > > > > Yes, they are definitely cold pages. I thought We should make sure the
> > > > > MADV_FREE pages are reclaimed first before other pages, at least in the anon
> > > > > LRU list, though there might be difficult to determine if we should reclaim
> > > > > writeback pages first or MADV_FREE pages first.
> > > >
> > > > Frankly speaking, the issue with writeback page is just hurdle of
> > > > implementation, not design so if we could fix it, we might move
> > > > cold pages into tail of the inactive LRU. I tried it but don't have
> > > > time slot to continue these days. Hope to get a time to look soon.
> > > > https://lkml.org/lkml/2014/7/1/628
> > > > Even, it wouldn't be critical problem although we couldn't fix
> > > > the problem of writeback pages because they are already all
> > > > cold pages so it might be not important to keep order in LRU so
> > > > we could save working set and effort of VM to reclaim them
> > > > at the cost of moving all of hinting pages into tail of the LRU
> > > > whenever the syscall is called.
> > > >
> > > > However, significant problem from my mind is we couldn't make
> > > > sure they are really cold pages. It would be true for allocators
> > > > but it's cache-friendly pages so it might be better to discard
> > > > tail pages of inactive LRU, which are really cold.
> > > > In addition, we couldn't expect all of usecase for MADV_FREE
> > > > so some of users might want to treat them as warm, not cold.
> > > >
> > > > With moving them into inactive list's head, if we still see
> > > > a lot stall, I think it's a sign to add other logic, for example,
> > > > we could drop MADV_FREEed pages instantly if the zone is below
> > > > low min watermark when the syscall is called. Because everybody
> > > > doesn't like direct reclaim.
> > >
> > > So I tried move the MADV_FREE pages to inactive list head or tail. It helps a
> > > little. But there are still stalls/oom. kswapd isn't fast enough to free the
> > > pages, App enters direct reclaim frequently. In one machine, no swap trigger,
> > > but MADV_FREE is 5x slower than MADV_DONTNEED. In another machine, MADV_FREE
> >
> > It's expected. MADV_DONTNEED and MADV_FREE is really different.
> > MADV_DONTNEED is self-sacrificy for others in the system while MADV_FREE is
> > greedy approach for itself because random process asking the memory could
> > enter direct reclaim.
> > However, as I said earlier, we could mitigate the problem by checking
> > min_free_kbytes. If memory in the system is under min_free_kbytes, it is
> > pointless to impose reclaim overhead for hinted pages because we alreay
> > know the hint is "please free when you are trouble with memory" and we got
> > know it already.
> >
> > When I test below patch on my 3G machine + 12 CPU + 8G swap with below test
> > test: 12 processes(each process does 5 iteration: mmap 512M + memset + madvise),
> >
> > 1. MADV_DONTNEED : 41.884sec, sys:3m4.552
> > 2. MADV_FREE : 1m28sec, sys: 5m23
> > 3. MADV_FREE + below patch : 37.188s, sys: 2m20
> >
> > Could you test?
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6d0fcb8..da15f8f 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -523,7 +523,7 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > * XXX: In this implementation, MADV_FREE works like
> > * MADV_DONTNEED on swapless system or full swap.
> > */
> > - if (get_nr_swap_pages() > 0)
> > + if (get_nr_swap_pages() > 0 && min_free_kbytes < nr_free_pages())
> > return madvise_free(vma, prev, start, end);
> > /* passthrough */
> > case MADV_DONTNEED:
>
> The throttling makes a lot of sense, definitely should be included in the
> patch. At least my jemalloc test has similar performance result with/without
Yeb, I will post it with a little modification after long vacation.
> the patch in memory pressure case. So overall I'm pretty happy with it.
Thanks for the testing.
> However, this only solves half of the problem. pages which are MADV_FREE before
> watermark is hit are still hard to be reclaimed later if there are other
> allocations. I'm not sure how severe this issue is. My jemalloc test frequently
> does madvise (fallback to DONTNEED with above change), so itself can free a lot
> of memory in memory pressure. If application uses MADV_FREE before watermark is
> hit, but don't use it after watermark is hit, we will have trouble.
Fair enough. It might make those pages close to inactive LRU's tail
be unlikely to free, instead rotate back to active LRU.
Hmm, I don't know how such anonymous LRU scanning without freeing makes
trobule in huge system.
Anyway, one of the idea is we could use COW so that it could move recent
dirtied pages into active LRU's head. Although it adds more overhead for
MADV_FREE than now, it could solve above issue.
As well, I think we could make MADV_FREE support on swapless system easier.
On swapless system, we don't move pages in active LRU to inactive so
when MADV_FREE is called on, we could move those pages in inactive's LRU
and if recent access happens on those pages before discarding by VM,
we could move them from inactive to active list. So, inactive LRU list
could have mostly freeable pages(if swapoff race happens, some of
non-freeable pages remains inactive list) so it's not a performan
problem only if VM does aging if there are anonymous pages
in inactive LRU list on swapless system.
>
> Thanks,
> Shaohua
--
Kind regards,
Minchan Kim