2012-11-15 19:25:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v6 00/12] Introduce huge zero page

From: "Kirill A. Shutemov" <[email protected]>

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include <assert.h>
#include <stdlib.h>
#include <unistd.h>

#define MB 1024*1024

int main(int argc, char **argv)
{
char *p;
int i;

posix_memalign((void **)&p, 2 * MB, 200 * MB);
for (i = 0; i < 200 * MB; i+= 4096)
assert(p[i] == 0);
pause();
return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros. The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
Patches 01-07: prepare all code paths for hzp
Patch 08: all code paths are covered: safe to setup hzp
Patch 09: lazy allocation
Patch 10: lockless refcounting for hzp
Patch 11: vmstat events for hzp allocation
Patch 12: sysfs knob to disable hzp

v6:
- updates based on feedback from David Rientjes;
- sysfs knob to disable hzp;
v5:
- implement HZP_ALLOC and HZP_ALLOC_FAILED events;
v4:
- Rebase to v3.7-rc1;
- Update commit message;
v3:
- fix potential deadlock in refcounting code on preemptive kernel.
- do not mark huge zero page as movable.
- fix typo in comment.
- Reviewed-by tag from Andrea Arcangeli.
v2:
- Avoid find_vma() if we've already had vma on stack.
Suggested by Andrea Arcangeli.
- Implement refcounting for huge zero page.

--------------------------------------------------------------------------

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==============

test:
posix_memalign((void **)&p, 2 * MB, 8 * GB);
for (i = 0; i < 100; i++) {
assert(memcmp(p, p + 4*GB, 4*GB) == 0);
asm volatile ("": : :"memory");
}

hzp:
Performance counter stats for './test_memcmp' (5 runs):

32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% )
40 context-switches # 0.001 K/sec ( +- 0.94% )
0 CPU-migrations # 0.000 K/sec
4,218 page-faults # 0.130 K/sec ( +- 0.00% )
76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%]
36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%]
1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%]
134,355,715,816 instructions # 1.75 insns per cycle
# 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%]
13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%]
1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%]

32.413866442 seconds time elapsed ( +- 0.13% )

vhzp:
Performance counter stats for './test_memcmp' (5 runs):

30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% )
38 context-switches # 0.001 K/sec ( +- 1.53% )
0 CPU-migrations # 0.000 K/sec
4,218 page-faults # 0.139 K/sec ( +- 0.01% )
71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%]
31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%]
773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%]
134,982,215,437 instructions # 1.88 insns per cycle
# 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%]
13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%]
1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%]

30.381324695 seconds time elapsed ( +- 0.13% )

Mirobenchmark2
==============

test:
posix_memalign((void **)&p, 2 * MB, 8 * GB);
for (i = 0; i < 1000; i++) {
char *_p = p;
while (_p < p+4*GB) {
assert(*_p == *(_p+4*GB));
_p += 4096;
asm volatile ("": : :"memory");
}
}

hzp:
Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):

3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% )
9 context-switches # 0.003 K/sec ( +- 4.97% )
4,384 page-faults # 0.001 M/sec ( +- 0.00% )
8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%]
5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%]
2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%]
9,494,670,537 instructions # 1.14 insns per cycle
# 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%]
2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%]
158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%]
3,168,102,115 L1-dcache-loads
# 903.693 M/sec ( +- 0.11% ) [41.70%]
1,048,710,998 L1-dcache-misses
# 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%]
1,047,699,685 LLC-load
# 298.854 M/sec ( +- 0.03% ) [33.38%]
2,287 LLC-misses
# 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%]
3,166,187,367 dTLB-loads
# 903.147 M/sec ( +- 0.02% ) [33.35%]
4,266,538 dTLB-misses
# 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%]

3.513339813 seconds time elapsed ( +- 0.26% )

vhzp:
Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):

27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% )
62 context-switches # 0.002 K/sec ( +- 0.61% )
4,384 page-faults # 0.160 K/sec ( +- 0.01% )
64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%]
61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%]
56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%]
10,033,724,846 instructions # 0.15 insns per cycle
# 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%]
2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%]
1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%]
3,302,006,540 L1-dcache-loads
# 120.891 M/sec ( +- 0.11% ) [41.68%]
271,374,358 L1-dcache-misses
# 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%]
20,385,476 LLC-load
# 0.746 M/sec ( +- 1.64% ) [33.34%]
76,754 LLC-misses
# 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%]
3,309,927,290 dTLB-loads
# 121.181 M/sec ( +- 0.03% ) [33.34%]
2,098,967,427 dTLB-misses
# 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%]

27.364448741 seconds time elapsed ( +- 0.24% )

--------------------------------------------------------------------------

I personally prefer implementation present in this patchset. It doesn't
touch arch-specific code.

Kirill A. Shutemov (12):
thp: huge zero page: basic preparation
thp: zap_huge_pmd(): zap huge zero pmd
thp: copy_huge_pmd(): copy huge zero page
thp: do_huge_pmd_wp_page(): handle huge zero page
thp: change_huge_pmd(): keep huge zero page write-protected
thp: change split_huge_page_pmd() interface
thp: implement splitting pmd for huge zero page
thp: setup huge zero page on non-write page fault
thp: lazy huge zero page allocation
thp: implement refcounting for huge zero page
thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events
thp: introduce sysfs knob to disable huge zero page

Documentation/vm/transhuge.txt | 19 ++-
arch/x86/kernel/vm86_32.c | 2 +-
fs/proc/task_mmu.c | 2 +-
include/linux/huge_mm.h | 18 ++-
include/linux/mm.h | 8 +
include/linux/vm_event_item.h | 2 +
mm/huge_memory.c | 339 +++++++++++++++++++++++++++++++++++++----
mm/memory.c | 11 +-
mm/mempolicy.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/pagewalk.c | 2 +-
mm/vmstat.c | 2 +
13 files changed, 364 insertions(+), 47 deletions(-)

--
1.7.11.7


2012-11-15 19:26:01

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v6 03/12] thp: copy_huge_pmd(): copy huge zero page

From: "Kirill A. Shutemov" <[email protected]>

It's easy to copy huge zero page. Just set destination pmd to huge zero
page.

It's safe to copy huge zero page since we have none yet :-p

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fc2cada..183127c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -701,6 +701,18 @@ static inline struct page *alloc_hugepage(int defrag)
}
#endif

+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+{
+ pmd_t entry;
+ entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+ entry = pmd_wrprotect(entry);
+ entry = pmd_mkhuge(entry);
+ set_pmd_at(mm, haddr, pmd, entry);
+ pgtable_trans_huge_deposit(mm, pgtable);
+ mm->nr_ptes++;
+}
+
int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags)
@@ -778,6 +790,16 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_free(dst_mm, pgtable);
goto out_unlock;
}
+ /*
+ * mm->pagetable lock is enough to be sure that huge zero pmd is not
+ * under splitting since we don't split the page itself, only pmd to
+ * a page table.
+ */
+ if (is_huge_zero_pmd(pmd)) {
+ set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+ ret = 0;
+ goto out_unlock;
+ }
if (unlikely(pmd_trans_splitting(pmd))) {
/* split huge page running from under us */
spin_unlock(&src_mm->page_table_lock);
--
1.7.11.7

2012-11-15 19:26:05

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v6 06/12] thp: change split_huge_page_pmd() interface

From: "Kirill A. Shutemov" <[email protected]>

Pass vma instead of mm and add address parameter.

In most cases we already have vma on the stack. We provides
split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

This change is preparation to huge zero pmd splitting implementation.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 4 ++--
arch/x86/kernel/vm86_32.c | 2 +-
fs/proc/task_mmu.c | 2 +-
include/linux/huge_mm.h | 14 ++++++++++----
mm/huge_memory.c | 19 +++++++++++++++++--
mm/memory.c | 4 ++--
mm/mempolicy.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/pagewalk.c | 2 +-
10 files changed, 37 insertions(+), 16 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index f734bb2..8f5b41d 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -276,7 +276,7 @@ unaffected. libhugetlbfs will also work fine as usual.
== Graceful fallback ==

Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
+split_huge_page_pmd(vma, addr, pmd) where the pmd is the one returned by
pmd_offset. It's trivial to make the code transparent hugepage aware
by just grepping for "pmd_offset" and adding split_huge_page_pmd where
missing after pmd_offset returns the pmd. Thanks to the graceful
@@ -299,7 +299,7 @@ diff --git a/mm/mremap.c b/mm/mremap.c
return NULL;

pmd = pmd_offset(pud, addr);
-+ split_huge_page_pmd(mm, pmd);
++ split_huge_page_pmd(vma, addr, pmd);
if (pmd_none_or_clear_bad(pmd))
return NULL;

diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index 5c9687b..1dfe69c 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -182,7 +182,7 @@ static void mark_screen_rdonly(struct mm_struct *mm)
if (pud_none_or_clear_bad(pud))
goto out;
pmd = pmd_offset(pud, 0xA0000);
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd_mm(mm, 0xA0000, pmd);
if (pmd_none_or_clear_bad(pmd))
goto out;
pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 90c63f9..291a0d1 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -643,7 +643,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
spinlock_t *ptl;
struct page *page;

- split_huge_page_pmd(walk->mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
if (pmd_trans_unstable(pmd))
return 0;

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b31cb7d..856f080 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -91,12 +91,14 @@ extern int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
pte_t *pte, pmd_t *pmd, unsigned int flags);
extern int split_huge_page(struct page *page);
-extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
-#define split_huge_page_pmd(__mm, __pmd) \
+extern void __split_huge_page_pmd(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd);
+#define split_huge_page_pmd(__vma, __address, __pmd) \
do { \
pmd_t *____pmd = (__pmd); \
if (unlikely(pmd_trans_huge(*____pmd))) \
- __split_huge_page_pmd(__mm, ____pmd); \
+ __split_huge_page_pmd(__vma, __address, \
+ ____pmd); \
} while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { \
@@ -106,6 +108,8 @@ extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
BUG_ON(pmd_trans_splitting(*____pmd) || \
pmd_trans_huge(*____pmd)); \
} while (0)
+extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+ pmd_t *pmd);
#if HPAGE_PMD_ORDER > MAX_ORDER
#error "hugepages can't be allocated by the buddy allocator"
#endif
@@ -173,10 +177,12 @@ static inline int split_huge_page(struct page *page)
{
return 0;
}
-#define split_huge_page_pmd(__mm, __pmd) \
+#define split_huge_page_pmd(__vma, __address, __pmd) \
do { } while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { } while (0)
+#define split_huge_page_pmd_mm(__mm, __address, __pmd) \
+ do { } while (0)
#define compound_trans_head(page) compound_head(page)
static inline int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 42607a1..2e1dbba 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2495,9 +2495,14 @@ static int khugepaged(void *none)
return 0;
}

-void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
+void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd)
{
struct page *page;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct mm_struct *mm = vma->vm_mm;
+
+ BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);

spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_trans_huge(*pmd))) {
@@ -2515,6 +2520,16 @@ void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
BUG_ON(pmd_trans_huge(*pmd));
}

+void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+ pmd_t *pmd)
+{
+ struct vm_area_struct *vma;
+
+ vma = find_vma(mm, address);
+ BUG_ON(vma == NULL);
+ split_huge_page_pmd(vma, address, pmd);
+}
+
static void split_huge_page_address(struct mm_struct *mm,
unsigned long address)
{
@@ -2539,7 +2554,7 @@ static void split_huge_page_address(struct mm_struct *mm,
* Caller holds the mmap_sem write mode, so a huge pmd cannot
* materialize from under us.
*/
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd_mm(mm, address, pmd);
}

void __vma_adjust_trans_huge(struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 6edc030..6017e23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1243,7 +1243,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
BUG();
}
#endif
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
goto next;
/* fall through */
@@ -1512,7 +1512,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
}
if (pmd_trans_huge(*pmd)) {
if (flags & FOLL_SPLIT) {
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd(vma, address, pmd);
goto split_fallthrough;
}
spin_lock(&mm->page_table_lock);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..b68061e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -511,7 +511,7 @@ static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
continue;
if (check_pte_range(vma, pmd, addr, next, nodes,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..e8c3938 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -90,7 +90,7 @@ static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
else if (change_huge_pmd(vma, pmd, addr, newprot))
continue;
/* fall through */
diff --git a/mm/mremap.c b/mm/mremap.c
index 1b61c2d..eabb24d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -182,7 +182,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
need_flush = true;
continue;
} else if (!err) {
- split_huge_page_pmd(vma->vm_mm, old_pmd);
+ split_huge_page_pmd(vma, old_addr, old_pmd);
}
VM_BUG_ON(pmd_trans_huge(*old_pmd));
}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 6c118d0..35aa294 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ again:
if (!walk->pte_entry)
continue;

- split_huge_page_pmd(walk->mm, pmd);
+ split_huge_page_pmd_mm(walk->mm, addr, pmd);
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
goto again;
err = walk_pte_range(pmd, addr, next, walk);
--
1.7.11.7

2012-11-15 19:26:17

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v6 02/12] thp: zap_huge_pmd(): zap huge zero pmd

From: "Kirill A. Shutemov" <[email protected]>

We don't have a mapped page to zap in huge zero page case. Let's just
clear pmd and remove it from tlb.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: David Rientjes <[email protected]>
---
mm/huge_memory.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8d7a54b..fc2cada 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1058,15 +1058,20 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t orig_pmd;
pgtable = pgtable_trans_huge_withdraw(tlb->mm);
orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
- page = pmd_page(orig_pmd);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
- page_remove_rmap(page);
- VM_BUG_ON(page_mapcount(page) < 0);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
- VM_BUG_ON(!PageHead(page));
- tlb->mm->nr_ptes--;
- spin_unlock(&tlb->mm->page_table_lock);
- tlb_remove_page(tlb, page);
+ if (is_huge_zero_pmd(orig_pmd)) {
+ tlb->mm->nr_ptes--;
+ spin_unlock(&tlb->mm->page_table_lock);
+ } else {
+ page = pmd_page(orig_pmd);
+ page_remove_rmap(page);
+ VM_BUG_ON(page_mapcount(page) < 0);
+ add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ VM_BUG_ON(!PageHead(page));
+ tlb->mm->nr_ptes--;
+ spin_unlock(&tlb->mm->page_table_lock);
+ tlb_remove_page(tlb, page);
+ }
pte_free(tlb->mm, pgtable);
ret = 1;
}
--
1.7.11.7

2012-11-15 19:26:15

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v6 11/12] thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events

From: "Kirill A. Shutemov" <[email protected]>

hzp_alloc is incremented every time a huge zero page is successfully
allocated. It includes allocations which where dropped due
race with other allocation. Note, it doesn't count every map
of the huge zero page, only its allocation.

hzp_alloc_failed is incremented if kernel fails to allocate huge zero
page and falls back to using small pages.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 8 ++++++++
include/linux/vm_event_item.h | 2 ++
mm/huge_memory.c | 5 ++++-
mm/vmstat.c | 2 ++
4 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 8f5b41d..60aeedd 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -197,6 +197,14 @@ thp_split is incremented every time a huge page is split into base
pages. This can happen for a variety of reasons but a common
reason is that a huge page is old and is being reclaimed.

+thp_zero_page_alloc is incremented every time a huge zero page is
+ successfully allocated. It includes allocations which where
+ dropped due race with other allocation. Note, it doesn't count
+ every map of the huge zero page, only its allocation.
+
+thp_zero_page_alloc_failed is incremented if kernel fails to allocate
+ huge zero page and falls back to using small pages.
+
As the system ages, allocating huge pages may be expensive as the
system uses memory compaction to copy data around memory to free a
huge page for use. There are some counters in /proc/vmstat to help
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3d31145..fe786f0 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -58,6 +58,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
THP_SPLIT,
+ THP_ZERO_PAGE_ALLOC,
+ THP_ZERO_PAGE_ALLOC_FAILED,
#endif
NR_VM_EVENT_ITEMS
};
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 923ea75..b104718 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -183,8 +183,11 @@ retry:

zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
HPAGE_PMD_ORDER);
- if (!zero_page)
+ if (!zero_page) {
+ count_vm_event(THP_ZERO_PAGE_ALLOC_FAILED);
return 0;
+ }
+ count_vm_event(THP_ZERO_PAGE_ALLOC);
preempt_disable();
if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
preempt_enable();
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..5da4b19 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -801,6 +801,8 @@ const char * const vmstat_text[] = {
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
"thp_split",
+ "thp_zero_page_alloc",
+ "thp_zero_page_alloc_failed",
#endif

#endif /* CONFIG_VM_EVENTS_COUNTERS */
--
1.7.11.7

2012-11-15 19:26:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v6 09/12] thp: lazy huge zero page allocation

From: "Kirill A. Shutemov" <[email protected]>

Instead of allocating huge zero page on hugepage_init() we can postpone it
until first huge zero page map. It saves memory if THP is not in use.

cmpxchg() is used to avoid race on huge_zero_pfn initialization.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ca3f6f2..bad9c8f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -160,22 +160,24 @@ static int start_khugepaged(void)
return err;
}

-static int __init init_huge_zero_page(void)
+static int init_huge_zero_pfn(void)
{
struct page *hpage;
+ unsigned long pfn;

hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
HPAGE_PMD_ORDER);
if (!hpage)
return -ENOMEM;
-
- huge_zero_pfn = page_to_pfn(hpage);
+ pfn = page_to_pfn(hpage);
+ if (cmpxchg(&huge_zero_pfn, 0, pfn))
+ __free_page(hpage);
return 0;
}

static inline bool is_huge_zero_pfn(unsigned long pfn)
{
- return pfn == huge_zero_pfn;
+ return huge_zero_pfn && pfn == huge_zero_pfn;
}

static inline bool is_huge_zero_pmd(pmd_t pmd)
@@ -564,10 +566,6 @@ static int __init hugepage_init(void)
if (err)
return err;

- err = init_huge_zero_page();
- if (err)
- goto out;
-
err = khugepaged_slab_init();
if (err)
goto out;
@@ -590,8 +588,6 @@ static int __init hugepage_init(void)

return 0;
out:
- if (huge_zero_pfn)
- __free_page(pfn_to_page(huge_zero_pfn));
hugepage_exit_sysfs(hugepage_kobj);
return err;
}
@@ -728,6 +724,10 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (!(flags & FAULT_FLAG_WRITE)) {
pgtable_t pgtable;
+ if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
+ count_vm_event(THP_FAULT_FALLBACK);
+ goto out;
+ }
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
goto out;
--
1.7.11.7

2012-11-15 19:26:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v6 12/12] thp: introduce sysfs knob to disable huge zero page

From: "Kirill A. Shutemov" <[email protected]>

By default kernel tries to use huge zero page on read page fault.
It's possible to disable huge zero page by writing 0 or enable it
back by writing 1:

echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 7 +++++++
include/linux/huge_mm.h | 4 ++++
mm/huge_memory.c | 21 +++++++++++++++++++--
3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index 60aeedd..8785fb8 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -116,6 +116,13 @@ echo always >/sys/kernel/mm/transparent_hugepage/defrag
echo madvise >/sys/kernel/mm/transparent_hugepage/defrag
echo never >/sys/kernel/mm/transparent_hugepage/defrag

+By default kernel tries to use huge zero page on read page fault.
+It's possible to disable huge zero page by writing 0 or enable it
+back by writing 1:
+
+echo 0 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
+echo 1 >/sys/kernel/mm/transparent_hugepage/khugepaged/use_zero_page
+
khugepaged will be automatically started when
transparent_hugepage/enabled is set to "always" or "madvise, and it'll
be automatically shutdown if it's set to "never".
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 856f080..a9f5bd4 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -35,6 +35,7 @@ enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_DEFRAG_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG,
+ TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG,
#ifdef CONFIG_DEBUG_VM
TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
#endif
@@ -74,6 +75,9 @@ extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
(transparent_hugepage_flags & \
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG) && \
(__vma)->vm_flags & VM_HUGEPAGE))
+#define transparent_hugepage_use_zero_page() \
+ (transparent_hugepage_flags & \
+ (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG))
#ifdef CONFIG_DEBUG_VM
#define transparent_hugepage_debug_cow() \
(transparent_hugepage_flags & \
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b104718..1f6c6de 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -38,7 +38,8 @@ unsigned long transparent_hugepage_flags __read_mostly =
(1<<TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG)|
#endif
(1<<TRANSPARENT_HUGEPAGE_DEFRAG_FLAG)|
- (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG);
+ (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
+ (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);

/* default scan 8*512 pte (or vmas) every 30 second */
static unsigned int khugepaged_pages_to_scan __read_mostly = HPAGE_PMD_NR*8;
@@ -356,6 +357,20 @@ static ssize_t defrag_store(struct kobject *kobj,
static struct kobj_attribute defrag_attr =
__ATTR(defrag, 0644, defrag_show, defrag_store);

+static ssize_t use_zero_page_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return single_flag_show(kobj, attr, buf,
+ TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
+}
+static ssize_t use_zero_page_store(struct kobject *kobj,
+ struct kobj_attribute *attr, const char *buf, size_t count)
+{
+ return single_flag_store(kobj, attr, buf, count,
+ TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
+}
+static struct kobj_attribute use_zero_page_attr =
+ __ATTR(use_zero_page, 0644, use_zero_page_show, use_zero_page_store);
#ifdef CONFIG_DEBUG_VM
static ssize_t debug_cow_show(struct kobject *kobj,
struct kobj_attribute *attr, char *buf)
@@ -377,6 +392,7 @@ static struct kobj_attribute debug_cow_attr =
static struct attribute *hugepage_attr[] = {
&enabled_attr.attr,
&defrag_attr.attr,
+ &use_zero_page_attr.attr,
#ifdef CONFIG_DEBUG_VM
&debug_cow_attr.attr,
#endif
@@ -771,7 +787,8 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma)))
return VM_FAULT_OOM;
- if (!(flags & FAULT_FLAG_WRITE)) {
+ if (!(flags & FAULT_FLAG_WRITE) &&
+ transparent_hugepage_use_zero_page()) {
pgtable_t pgtable;
unsigned long zero_pfn;
pgtable = pte_alloc_one(mm, haddr);
--
1.7.11.7

2012-11-15 19:27:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v6 10/12] thp: implement refcounting for huge zero page

From: "Kirill A. Shutemov" <[email protected]>

H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.

We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.

If counter is 0, get_huge_zero_page() allocates a new huge page and
takes two references: one for caller and one for shrinker. We free the
page only in shrinker callback if counter is 1 (only shrinker has the
reference).

put_huge_zero_page() only decrements counter. Counter is never zero
in put_huge_zero_page() since shrinker holds on reference.

Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.

Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation. I think it's pretty reasonable for synthetic benchmark.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 112 ++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 87 insertions(+), 25 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bad9c8f..923ea75 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
#include <linux/freezer.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
+#include <linux/shrinker.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -47,7 +48,6 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
/* during fragmentation poll the hugepage allocator once every minute */
static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
static struct task_struct *khugepaged_thread __read_mostly;
-static unsigned long huge_zero_pfn __read_mostly;
static DEFINE_MUTEX(khugepaged_mutex);
static DEFINE_SPINLOCK(khugepaged_mm_lock);
static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -160,31 +160,74 @@ static int start_khugepaged(void)
return err;
}

-static int init_huge_zero_pfn(void)
+static atomic_t huge_zero_refcount;
+static unsigned long huge_zero_pfn __read_mostly;
+
+static inline bool is_huge_zero_pfn(unsigned long pfn)
{
- struct page *hpage;
- unsigned long pfn;
+ unsigned long zero_pfn = ACCESS_ONCE(huge_zero_pfn);
+ return zero_pfn && pfn == zero_pfn;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+ return is_huge_zero_pfn(pmd_pfn(pmd));
+}
+
+static unsigned long get_huge_zero_page(void)
+{
+ struct page *zero_page;
+retry:
+ if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
+ return ACCESS_ONCE(huge_zero_pfn);

- hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
+ zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
HPAGE_PMD_ORDER);
- if (!hpage)
- return -ENOMEM;
- pfn = page_to_pfn(hpage);
- if (cmpxchg(&huge_zero_pfn, 0, pfn))
- __free_page(hpage);
- return 0;
+ if (!zero_page)
+ return 0;
+ preempt_disable();
+ if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
+ preempt_enable();
+ __free_page(zero_page);
+ goto retry;
+ }
+
+ /* We take additional reference here. It will be put back by shrinker */
+ atomic_set(&huge_zero_refcount, 2);
+ preempt_enable();
+ return ACCESS_ONCE(huge_zero_pfn);
}

-static inline bool is_huge_zero_pfn(unsigned long pfn)
+static void put_huge_zero_page(void)
{
- return huge_zero_pfn && pfn == huge_zero_pfn;
+ /*
+ * Counter should never go to zero here. Only shrinker can put
+ * last reference.
+ */
+ BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
}

-static inline bool is_huge_zero_pmd(pmd_t pmd)
+static int shrink_huge_zero_page(struct shrinker *shrink,
+ struct shrink_control *sc)
{
- return is_huge_zero_pfn(pmd_pfn(pmd));
+ if (!sc->nr_to_scan)
+ /* we can free zero page only if last reference remains */
+ return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
+
+ if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
+ unsigned long zero_pfn = xchg(&huge_zero_pfn, 0);
+ BUG_ON(zero_pfn == 0);
+ __free_page(__pfn_to_page(zero_pfn));
+ }
+
+ return 0;
}

+static struct shrinker huge_zero_page_shrinker = {
+ .shrink = shrink_huge_zero_page,
+ .seeks = DEFAULT_SEEKS,
+};
+
#ifdef CONFIG_SYSFS

static ssize_t double_flag_show(struct kobject *kobj,
@@ -576,6 +619,8 @@ static int __init hugepage_init(void)
goto out;
}

+ register_shrinker(&huge_zero_page_shrinker);
+
/*
* By default disable transparent hugepages on smaller systems,
* where the extra memory used could hurt more than TLB overhead
@@ -698,10 +743,11 @@ static inline struct page *alloc_hugepage(int defrag)
#endif

static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+ struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
+ unsigned long zero_pfn)
{
pmd_t entry;
- entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+ entry = pfn_pmd(zero_pfn, vma->vm_page_prot);
entry = pmd_wrprotect(entry);
entry = pmd_mkhuge(entry);
set_pmd_at(mm, haddr, pmd, entry);
@@ -724,15 +770,19 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (!(flags & FAULT_FLAG_WRITE)) {
pgtable_t pgtable;
- if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
- count_vm_event(THP_FAULT_FALLBACK);
- goto out;
- }
+ unsigned long zero_pfn;
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
goto out;
+ zero_pfn = get_huge_zero_page();
+ if (unlikely(!zero_pfn)) {
+ pte_free(mm, pgtable);
+ count_vm_event(THP_FAULT_FALLBACK);
+ goto out;
+ }
spin_lock(&mm->page_table_lock);
- set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+ set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+ zero_pfn);
spin_unlock(&mm->page_table_lock);
return 0;
}
@@ -806,7 +856,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
* a page table.
*/
if (is_huge_zero_pmd(pmd)) {
- set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+ unsigned long zero_pfn;
+ /*
+ * get_huge_zero_page() will never allocate a new page here,
+ * since we already have a zero page to copy. It just takes a
+ * reference.
+ */
+ zero_pfn = get_huge_zero_page();
+ set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+ zero_pfn);
ret = 0;
goto out_unlock;
}
@@ -894,6 +952,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
spin_unlock(&mm->page_table_lock);
+ put_huge_zero_page();

mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

@@ -1095,9 +1154,10 @@ alloc:
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache_pmd(vma, address, pmd);
- if (is_huge_zero_pmd(orig_pmd))
+ if (is_huge_zero_pmd(orig_pmd)) {
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
- else {
+ put_huge_zero_page();
+ } else {
VM_BUG_ON(!PageHead(page));
page_remove_rmap(page);
put_page(page);
@@ -1174,6 +1234,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (is_huge_zero_pmd(orig_pmd)) {
tlb->mm->nr_ptes--;
spin_unlock(&tlb->mm->page_table_lock);
+ put_huge_zero_page();
} else {
page = pmd_page(orig_pmd);
page_remove_rmap(page);
@@ -2531,6 +2592,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
}
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
+ put_huge_zero_page();
}

void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
--
1.7.11.7

2012-11-15 19:27:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v6 08/12] thp: setup huge zero page on non-write page fault

From: "Kirill A. Shutemov" <[email protected]>

All code paths seems covered. Now we can map huge zero page on read page
fault.

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.

If we fail to setup huge zero page (ENOMEM) we fallback to
handle_pte_fault() as we normally do in THP.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 015a13a..ca3f6f2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -726,6 +726,16 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma)))
return VM_FAULT_OOM;
+ if (!(flags & FAULT_FLAG_WRITE)) {
+ pgtable_t pgtable;
+ pgtable = pte_alloc_one(mm, haddr);
+ if (unlikely(!pgtable))
+ goto out;
+ spin_lock(&mm->page_table_lock);
+ set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+ spin_unlock(&mm->page_table_lock);
+ return 0;
+ }
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
vma, haddr, numa_node_id(), 0);
if (unlikely(!page)) {
--
1.7.11.7

2012-11-15 19:27:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v6 07/12] thp: implement splitting pmd for huge zero page

From: "Kirill A. Shutemov" <[email protected]>

We can't split huge zero page itself (and it's bug if we try), but we
can split the pmd which points to it.

On splitting the pmd we create a table with all ptes set to normal zero
page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 41 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2e1dbba..015a13a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1597,6 +1597,7 @@ int split_huge_page(struct page *page)
struct anon_vma *anon_vma;
int ret = 1;

+ BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
BUG_ON(!PageAnon(page));
anon_vma = page_lock_anon_vma(page);
if (!anon_vma)
@@ -2495,24 +2496,64 @@ static int khugepaged(void *none)
return 0;
}

+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+ unsigned long haddr, pmd_t *pmd)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ int i;
+
+ pmdp_clear_flush(vma, haddr, pmd);
+ /* leave pmd empty until pte is filled */
+
+ pgtable = pgtable_trans_huge_withdraw(mm);
+ pmd_populate(mm, &_pmd, pgtable);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+ entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+ entry = pte_mkspecial(entry);
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*pte));
+ set_pte_at(mm, haddr, pte, entry);
+ pte_unmap(pte);
+ }
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pgtable);
+}
+
void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd)
{
struct page *page;
unsigned long haddr = address & HPAGE_PMD_MASK;
struct mm_struct *mm = vma->vm_mm;
+ unsigned long mmun_start; /* For mmu_notifiers */
+ unsigned long mmun_end; /* For mmu_notifiers */

BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);

+ mmun_start = haddr;
+ mmun_end = haddr + HPAGE_PMD_SIZE;
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_trans_huge(*pmd))) {
spin_unlock(&mm->page_table_lock);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ return;
+ }
+ if (is_huge_zero_pmd(*pmd)) {
+ __split_huge_zero_page_pmd(vma, haddr, pmd);
+ spin_unlock(&mm->page_table_lock);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
return;
}
page = pmd_page(*pmd);
VM_BUG_ON(!page_count(page));
get_page(page);
spin_unlock(&mm->page_table_lock);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

split_huge_page(page);

--
1.7.11.7

2012-11-15 19:28:05

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v6 05/12] thp: change_huge_pmd(): keep huge zero page write-protected

From: "Kirill A. Shutemov" <[email protected]>

We want to get page fault on write attempt to huge zero page, so let's
keep it write-protected.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f5589c0..42607a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1245,6 +1245,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
pmd_t entry;
entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
+ if (is_huge_zero_pmd(entry))
+ entry = pmd_wrprotect(entry);
set_pmd_at(mm, addr, pmd, entry);
spin_unlock(&vma->vm_mm->page_table_lock);
ret = 1;
--
1.7.11.7

2012-11-15 19:28:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v6 04/12] thp: do_huge_pmd_wp_page(): handle huge zero page

From: "Kirill A. Shutemov" <[email protected]>

On write access to huge zero page we alloc a new huge page and clear it.

If ENOMEM, graceful fallback: we create a new pmd table and set pte
around fault address to newly allocated normal (4k) page. All other ptes
in the pmd set to normal zero page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 8 ++++
mm/huge_memory.c | 110 +++++++++++++++++++++++++++++++++++++++++++++--------
mm/memory.c | 7 ----
3 files changed, 103 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..fe329da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -516,6 +516,14 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
}
#endif

+#ifndef my_zero_pfn
+static inline unsigned long my_zero_pfn(unsigned long addr)
+{
+ extern unsigned long zero_pfn;
+ return zero_pfn;
+}
+#endif
+
/*
* Multiple processes may "see" the same page. E.g. for untouched
* mappings of /dev/null, all processes see the same page full of
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 183127c..f5589c0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -829,6 +829,69 @@ out:
return ret;
}

+static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, unsigned long haddr)
+{
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ struct page *page;
+ int i, ret = 0;
+ unsigned long mmun_start; /* For mmu_notifiers */
+ unsigned long mmun_end; /* For mmu_notifiers */
+
+ page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ if (!page) {
+ ret |= VM_FAULT_OOM;
+ goto out;
+ }
+
+ if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
+ put_page(page);
+ ret |= VM_FAULT_OOM;
+ goto out;
+ }
+
+ clear_user_highpage(page, address);
+ __SetPageUptodate(page);
+
+ mmun_start = haddr;
+ mmun_end = haddr + HPAGE_PMD_SIZE;
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+
+ spin_lock(&mm->page_table_lock);
+ pmdp_clear_flush(vma, haddr, pmd);
+ /* leave pmd empty until pte is filled */
+
+ pgtable = pgtable_trans_huge_withdraw(mm);
+ pmd_populate(mm, &_pmd, pgtable);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+ if (haddr == (address & PAGE_MASK)) {
+ entry = mk_pte(page, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ page_add_new_anon_rmap(page, vma, haddr);
+ } else {
+ entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+ entry = pte_mkspecial(entry);
+ }
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*pte));
+ set_pte_at(mm, haddr, pte, entry);
+ pte_unmap(pte);
+ }
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pgtable);
+ spin_unlock(&mm->page_table_lock);
+
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+
+ ret |= VM_FAULT_WRITE;
+out:
+ return ret;
+}
+
static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
@@ -935,19 +998,21 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
{
int ret = 0;
- struct page *page, *new_page;
+ struct page *page = NULL, *new_page;
unsigned long haddr;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */

VM_BUG_ON(!vma->anon_vma);
+ haddr = address & HPAGE_PMD_MASK;
+ if (is_huge_zero_pmd(orig_pmd))
+ goto alloc;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
goto out_unlock;

page = pmd_page(orig_pmd);
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
- haddr = address & HPAGE_PMD_MASK;
if (page_mapcount(page) == 1) {
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
@@ -959,7 +1024,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
get_page(page);
spin_unlock(&mm->page_table_lock);
-
+alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -969,24 +1034,34 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,

if (unlikely(!new_page)) {
count_vm_event(THP_FAULT_FALLBACK);
- ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
- pmd, orig_pmd, page, haddr);
- if (ret & VM_FAULT_OOM)
- split_huge_page(page);
- put_page(page);
+ if (is_huge_zero_pmd(orig_pmd)) {
+ ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
+ address, pmd, haddr);
+ } else {
+ ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
+ pmd, orig_pmd, page, haddr);
+ if (ret & VM_FAULT_OOM)
+ split_huge_page(page);
+ put_page(page);
+ }
goto out;
}
count_vm_event(THP_FAULT_ALLOC);

if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
put_page(new_page);
- split_huge_page(page);
- put_page(page);
+ if (page) {
+ split_huge_page(page);
+ put_page(page);
+ }
ret |= VM_FAULT_OOM;
goto out;
}

- copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
+ if (is_huge_zero_pmd(orig_pmd))
+ clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
+ else
+ copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
__SetPageUptodate(new_page);

mmun_start = haddr;
@@ -994,7 +1069,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

spin_lock(&mm->page_table_lock);
- put_page(page);
+ if (page)
+ put_page(page);
if (unlikely(!pmd_same(*pmd, orig_pmd))) {
spin_unlock(&mm->page_table_lock);
mem_cgroup_uncharge_page(new_page);
@@ -1002,7 +1078,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_mn;
} else {
pmd_t entry;
- VM_BUG_ON(!PageHead(page));
entry = mk_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
entry = pmd_mkhuge(entry);
@@ -1010,8 +1085,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache_pmd(vma, address, pmd);
- page_remove_rmap(page);
- put_page(page);
+ if (is_huge_zero_pmd(orig_pmd))
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ else {
+ VM_BUG_ON(!PageHead(page));
+ page_remove_rmap(page);
+ put_page(page);
+ }
ret |= VM_FAULT_WRITE;
}
spin_unlock(&mm->page_table_lock);
diff --git a/mm/memory.c b/mm/memory.c
index fb135ba..6edc030 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -724,13 +724,6 @@ static inline int is_zero_pfn(unsigned long pfn)
}
#endif

-#ifndef my_zero_pfn
-static inline unsigned long my_zero_pfn(unsigned long addr)
-{
- return zero_pfn;
-}
-#endif
-
/*
* vm_normal_page -- This function gets the "struct page" associated with a pte.
*
--
1.7.11.7

2012-11-15 19:25:58

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v6 01/12] thp: huge zero page: basic preparation

From: "Kirill A. Shutemov" <[email protected]>

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled
with zeros.

For now let's allocate the page on hugepage_init(). We'll switch to lazy
allocation later.

We are not going to map the huge zero page until we can handle it
properly on all code paths.

is_huge_zero_{pfn,pmd}() functions will be used by following patches to
check whether the pfn/pmd is huge zero page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: David Rientjes <[email protected]>
---
mm/huge_memory.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..8d7a54b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -47,6 +47,7 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
/* during fragmentation poll the hugepage allocator once every minute */
static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
static struct task_struct *khugepaged_thread __read_mostly;
+static unsigned long huge_zero_pfn __read_mostly;
static DEFINE_MUTEX(khugepaged_mutex);
static DEFINE_SPINLOCK(khugepaged_mm_lock);
static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -159,6 +160,29 @@ static int start_khugepaged(void)
return err;
}

+static int __init init_huge_zero_page(void)
+{
+ struct page *hpage;
+
+ hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
+ HPAGE_PMD_ORDER);
+ if (!hpage)
+ return -ENOMEM;
+
+ huge_zero_pfn = page_to_pfn(hpage);
+ return 0;
+}
+
+static inline bool is_huge_zero_pfn(unsigned long pfn)
+{
+ return pfn == huge_zero_pfn;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+ return is_huge_zero_pfn(pmd_pfn(pmd));
+}
+
#ifdef CONFIG_SYSFS

static ssize_t double_flag_show(struct kobject *kobj,
@@ -540,6 +564,10 @@ static int __init hugepage_init(void)
if (err)
return err;

+ err = init_huge_zero_page();
+ if (err)
+ goto out;
+
err = khugepaged_slab_init();
if (err)
goto out;
@@ -562,6 +590,8 @@ static int __init hugepage_init(void)

return 0;
out:
+ if (huge_zero_pfn)
+ __free_page(pfn_to_page(huge_zero_pfn));
hugepage_exit_sysfs(hugepage_kobj);
return err;
}
--
1.7.11.7

2012-11-15 22:32:19

by David Rientjes

[permalink] [raw]
Subject: [patch] thp: copy_huge_pmd(): copy huge zero page v6 fix

Fix comment

Signed-off-by: David Rientjes <[email protected]>
---
mm/huge_memory.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -791,7 +791,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_unlock;
}
/*
- * mm->pagetable lock is enough to be sure that huge zero pmd is not
+ * mm->page_table_lock is enough to be sure that huge zero pmd is not
* under splitting since we don't split the page itself, only pmd to
* a page table.
*/

2012-11-18 06:23:57

by Jaegeuk Hanse

[permalink] [raw]
Subject: Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page

On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> after the first allocation. Here's implementation of lockless refcounting
> for huge zero page.
>
> We have two basic primitives: {get,put}_huge_zero_page(). They
> manipulate reference counter.
>
> If counter is 0, get_huge_zero_page() allocates a new huge page and
> takes two references: one for caller and one for shrinker. We free the
> page only in shrinker callback if counter is 1 (only shrinker has the
> reference).
>
> put_huge_zero_page() only decrements counter. Counter is never zero
> in put_huge_zero_page() since shrinker holds on reference.
>
> Freeing huge zero page in shrinker callback helps to avoid frequent
> allocate-free.
>
> Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
> parallel (40 processes) read page faulting comparing to lazy huge page
> allocation. I think it's pretty reasonable for synthetic benchmark.

Hi Kirill,

I see your and Andew's hot discussion in v4 resend thread.

"I also tried another scenario: usemem -n16 100M -r 1000. It creates
real memory pressure - no easy reclaimable memory. This time callback
called with nr_to_scan > 0 and we freed hzp. "

What's "usemem"? Is it a tool and how to get it? It's hard for me to
find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your
scenario?

Regards,
Jaegeuk

>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/huge_memory.c | 112 ++++++++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 87 insertions(+), 25 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bad9c8f..923ea75 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -18,6 +18,7 @@
> #include <linux/freezer.h>
> #include <linux/mman.h>
> #include <linux/pagemap.h>
> +#include <linux/shrinker.h>
> #include <asm/tlb.h>
> #include <asm/pgalloc.h>
> #include "internal.h"
> @@ -47,7 +48,6 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
> /* during fragmentation poll the hugepage allocator once every minute */
> static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
> static struct task_struct *khugepaged_thread __read_mostly;
> -static unsigned long huge_zero_pfn __read_mostly;
> static DEFINE_MUTEX(khugepaged_mutex);
> static DEFINE_SPINLOCK(khugepaged_mm_lock);
> static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
> @@ -160,31 +160,74 @@ static int start_khugepaged(void)
> return err;
> }
>
> -static int init_huge_zero_pfn(void)
> +static atomic_t huge_zero_refcount;
> +static unsigned long huge_zero_pfn __read_mostly;
> +
> +static inline bool is_huge_zero_pfn(unsigned long pfn)
> {
> - struct page *hpage;
> - unsigned long pfn;
> + unsigned long zero_pfn = ACCESS_ONCE(huge_zero_pfn);
> + return zero_pfn && pfn == zero_pfn;
> +}
> +
> +static inline bool is_huge_zero_pmd(pmd_t pmd)
> +{
> + return is_huge_zero_pfn(pmd_pfn(pmd));
> +}
> +
> +static unsigned long get_huge_zero_page(void)
> +{
> + struct page *zero_page;
> +retry:
> + if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
> + return ACCESS_ONCE(huge_zero_pfn);
>
> - hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
> + zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
> HPAGE_PMD_ORDER);
> - if (!hpage)
> - return -ENOMEM;
> - pfn = page_to_pfn(hpage);
> - if (cmpxchg(&huge_zero_pfn, 0, pfn))
> - __free_page(hpage);
> - return 0;
> + if (!zero_page)
> + return 0;
> + preempt_disable();
> + if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
> + preempt_enable();
> + __free_page(zero_page);
> + goto retry;
> + }
> +
> + /* We take additional reference here. It will be put back by shrinker */
> + atomic_set(&huge_zero_refcount, 2);
> + preempt_enable();
> + return ACCESS_ONCE(huge_zero_pfn);
> }
>
> -static inline bool is_huge_zero_pfn(unsigned long pfn)
> +static void put_huge_zero_page(void)
> {
> - return huge_zero_pfn && pfn == huge_zero_pfn;
> + /*
> + * Counter should never go to zero here. Only shrinker can put
> + * last reference.
> + */
> + BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
> }
>
> -static inline bool is_huge_zero_pmd(pmd_t pmd)
> +static int shrink_huge_zero_page(struct shrinker *shrink,
> + struct shrink_control *sc)
> {
> - return is_huge_zero_pfn(pmd_pfn(pmd));
> + if (!sc->nr_to_scan)
> + /* we can free zero page only if last reference remains */
> + return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
> +
> + if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
> + unsigned long zero_pfn = xchg(&huge_zero_pfn, 0);
> + BUG_ON(zero_pfn == 0);
> + __free_page(__pfn_to_page(zero_pfn));
> + }
> +
> + return 0;
> }
>
> +static struct shrinker huge_zero_page_shrinker = {
> + .shrink = shrink_huge_zero_page,
> + .seeks = DEFAULT_SEEKS,
> +};
> +
> #ifdef CONFIG_SYSFS
>
> static ssize_t double_flag_show(struct kobject *kobj,
> @@ -576,6 +619,8 @@ static int __init hugepage_init(void)
> goto out;
> }
>
> + register_shrinker(&huge_zero_page_shrinker);
> +
> /*
> * By default disable transparent hugepages on smaller systems,
> * where the extra memory used could hurt more than TLB overhead
> @@ -698,10 +743,11 @@ static inline struct page *alloc_hugepage(int defrag)
> #endif
>
> static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
> - struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
> + struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
> + unsigned long zero_pfn)
> {
> pmd_t entry;
> - entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
> + entry = pfn_pmd(zero_pfn, vma->vm_page_prot);
> entry = pmd_wrprotect(entry);
> entry = pmd_mkhuge(entry);
> set_pmd_at(mm, haddr, pmd, entry);
> @@ -724,15 +770,19 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> return VM_FAULT_OOM;
> if (!(flags & FAULT_FLAG_WRITE)) {
> pgtable_t pgtable;
> - if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
> - count_vm_event(THP_FAULT_FALLBACK);
> - goto out;
> - }
> + unsigned long zero_pfn;
> pgtable = pte_alloc_one(mm, haddr);
> if (unlikely(!pgtable))
> goto out;
> + zero_pfn = get_huge_zero_page();
> + if (unlikely(!zero_pfn)) {
> + pte_free(mm, pgtable);
> + count_vm_event(THP_FAULT_FALLBACK);
> + goto out;
> + }
> spin_lock(&mm->page_table_lock);
> - set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
> + set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
> + zero_pfn);
> spin_unlock(&mm->page_table_lock);
> return 0;
> }
> @@ -806,7 +856,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> * a page table.
> */
> if (is_huge_zero_pmd(pmd)) {
> - set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
> + unsigned long zero_pfn;
> + /*
> + * get_huge_zero_page() will never allocate a new page here,
> + * since we already have a zero page to copy. It just takes a
> + * reference.
> + */
> + zero_pfn = get_huge_zero_page();
> + set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
> + zero_pfn);
> ret = 0;
> goto out_unlock;
> }
> @@ -894,6 +952,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
> smp_wmb(); /* make pte visible before pmd */
> pmd_populate(mm, pmd, pgtable);
> spin_unlock(&mm->page_table_lock);
> + put_huge_zero_page();
>
> mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
>
> @@ -1095,9 +1154,10 @@ alloc:
> page_add_new_anon_rmap(new_page, vma, haddr);
> set_pmd_at(mm, haddr, pmd, entry);
> update_mmu_cache_pmd(vma, address, pmd);
> - if (is_huge_zero_pmd(orig_pmd))
> + if (is_huge_zero_pmd(orig_pmd)) {
> add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
> - else {
> + put_huge_zero_page();
> + } else {
> VM_BUG_ON(!PageHead(page));
> page_remove_rmap(page);
> put_page(page);
> @@ -1174,6 +1234,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> if (is_huge_zero_pmd(orig_pmd)) {
> tlb->mm->nr_ptes--;
> spin_unlock(&tlb->mm->page_table_lock);
> + put_huge_zero_page();
> } else {
> page = pmd_page(orig_pmd);
> page_remove_rmap(page);
> @@ -2531,6 +2592,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> }
> smp_wmb(); /* make pte visible before pmd */
> pmd_populate(mm, pmd, pgtable);
> + put_huge_zero_page();
> }
>
> void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,

2012-11-19 09:55:09

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page

On Sun, Nov 18, 2012 at 02:23:44PM +0800, Jaegeuk Hanse wrote:
> On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote:
> >From: "Kirill A. Shutemov" <[email protected]>
> >
> >H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> >after the first allocation. Here's implementation of lockless refcounting
> >for huge zero page.
> >
> >We have two basic primitives: {get,put}_huge_zero_page(). They
> >manipulate reference counter.
> >
> >If counter is 0, get_huge_zero_page() allocates a new huge page and
> >takes two references: one for caller and one for shrinker. We free the
> >page only in shrinker callback if counter is 1 (only shrinker has the
> >reference).
> >
> >put_huge_zero_page() only decrements counter. Counter is never zero
> >in put_huge_zero_page() since shrinker holds on reference.
> >
> >Freeing huge zero page in shrinker callback helps to avoid frequent
> >allocate-free.
> >
> >Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
> >parallel (40 processes) read page faulting comparing to lazy huge page
> >allocation. I think it's pretty reasonable for synthetic benchmark.
>
> Hi Kirill,
>
> I see your and Andew's hot discussion in v4 resend thread.
>
> "I also tried another scenario: usemem -n16 100M -r 1000. It creates
> real memory pressure - no easy reclaimable memory. This time
> callback called with nr_to_scan > 0 and we freed hzp. "
>
> What's "usemem"? Is it a tool and how to get it?

http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar

> It's hard for me to
> find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your
> scenario?

shrink_slab() calls the callback with nr_to_scan > 0 if system is under
pressure -- look for do_shrinker_shrink().

--
Kirill A. Shutemov


Attachments:
(No filename) (1.76 kB)
signature.asc (836.00 B)
Digital signature
Download all attachments

2012-11-19 10:20:13

by Jaegeuk Hanse

[permalink] [raw]
Subject: Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page

On 11/19/2012 05:56 PM, Kirill A. Shutemov wrote:
> On Sun, Nov 18, 2012 at 02:23:44PM +0800, Jaegeuk Hanse wrote:
>> On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote:
>>> From: "Kirill A. Shutemov" <[email protected]>
>>>
>>> H. Peter Anvin doesn't like huge zero page which sticks in memory forever
>>> after the first allocation. Here's implementation of lockless refcounting
>>> for huge zero page.
>>>
>>> We have two basic primitives: {get,put}_huge_zero_page(). They
>>> manipulate reference counter.
>>>
>>> If counter is 0, get_huge_zero_page() allocates a new huge page and
>>> takes two references: one for caller and one for shrinker. We free the
>>> page only in shrinker callback if counter is 1 (only shrinker has the
>>> reference).
>>>
>>> put_huge_zero_page() only decrements counter. Counter is never zero
>>> in put_huge_zero_page() since shrinker holds on reference.
>>>
>>> Freeing huge zero page in shrinker callback helps to avoid frequent
>>> allocate-free.
>>>
>>> Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
>>> parallel (40 processes) read page faulting comparing to lazy huge page
>>> allocation. I think it's pretty reasonable for synthetic benchmark.
>> Hi Kirill,
>>
>> I see your and Andew's hot discussion in v4 resend thread.
>>
>> "I also tried another scenario: usemem -n16 100M -r 1000. It creates
>> real memory pressure - no easy reclaimable memory. This time
>> callback called with nr_to_scan > 0 and we freed hzp. "
>>
>> What's "usemem"? Is it a tool and how to get it?
> http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar

Thanks for your response. But how to use it, I even can't compile the
files.

# ./case-lru-file-mmap-read
./case-lru-file-mmap-read: line 3: hw_vars: No such file or directory
./case-lru-file-mmap-read: line 7: 10 * mem / nr_cpu: division by 0
(error token is "nr_cpu")

# gcc usemem.c -o usemem
/tmp/ccFkIDWk.o: In function `do_task':
usemem.c:(.text+0x9f2): undefined reference to `pthread_create'
usemem.c:(.text+0xa44): undefined reference to `pthread_join'
collect2: ld returned 1 exit status

>
>> It's hard for me to
>> find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your
>> scenario?
> shrink_slab() calls the callback with nr_to_scan > 0 if system is under
> pressure -- look for do_shrinker_shrink().

Why Andrew's example(dd if=/fast-disk/large-file) doesn't call this
path? I think it also can add memory pressure, where I miss?

2012-11-19 10:23:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page

On Mon, Nov 19, 2012 at 06:20:01PM +0800, Jaegeuk Hanse wrote:
> On 11/19/2012 05:56 PM, Kirill A. Shutemov wrote:
> >On Sun, Nov 18, 2012 at 02:23:44PM +0800, Jaegeuk Hanse wrote:
> >>On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote:
> >>>From: "Kirill A. Shutemov" <[email protected]>
> >>>
> >>>H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> >>>after the first allocation. Here's implementation of lockless refcounting
> >>>for huge zero page.
> >>>
> >>>We have two basic primitives: {get,put}_huge_zero_page(). They
> >>>manipulate reference counter.
> >>>
> >>>If counter is 0, get_huge_zero_page() allocates a new huge page and
> >>>takes two references: one for caller and one for shrinker. We free the
> >>>page only in shrinker callback if counter is 1 (only shrinker has the
> >>>reference).
> >>>
> >>>put_huge_zero_page() only decrements counter. Counter is never zero
> >>>in put_huge_zero_page() since shrinker holds on reference.
> >>>
> >>>Freeing huge zero page in shrinker callback helps to avoid frequent
> >>>allocate-free.
> >>>
> >>>Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
> >>>parallel (40 processes) read page faulting comparing to lazy huge page
> >>>allocation. I think it's pretty reasonable for synthetic benchmark.
> >>Hi Kirill,
> >>
> >>I see your and Andew's hot discussion in v4 resend thread.
> >>
> >>"I also tried another scenario: usemem -n16 100M -r 1000. It creates
> >>real memory pressure - no easy reclaimable memory. This time
> >>callback called with nr_to_scan > 0 and we freed hzp. "
> >>
> >>What's "usemem"? Is it a tool and how to get it?
> >http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar
>
> Thanks for your response. But how to use it, I even can't compile
> the files.
>
> # ./case-lru-file-mmap-read
> ./case-lru-file-mmap-read: line 3: hw_vars: No such file or directory
> ./case-lru-file-mmap-read: line 7: 10 * mem / nr_cpu: division by 0
> (error token is "nr_cpu")
>
> # gcc usemem.c -o usemem

-lpthread

> /tmp/ccFkIDWk.o: In function `do_task':
> usemem.c:(.text+0x9f2): undefined reference to `pthread_create'
> usemem.c:(.text+0xa44): undefined reference to `pthread_join'
> collect2: ld returned 1 exit status
>
> >
> >>It's hard for me to
> >>find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your
> >>scenario?
> >shrink_slab() calls the callback with nr_to_scan > 0 if system is under
> >pressure -- look for do_shrinker_shrink().
>
> Why Andrew's example(dd if=/fast-disk/large-file) doesn't call this
> path? I think it also can add memory pressure, where I miss?

dd if=large-file only fills pagecache -- easy reclaimable memory.
Pagecache will be dropped first, before shrinking slabs.

--
Kirill A. Shutemov

2012-11-19 11:02:31

by Jaegeuk Hanse

[permalink] [raw]
Subject: Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page

On 11/19/2012 06:23 PM, Kirill A. Shutemov wrote:
> On Mon, Nov 19, 2012 at 06:20:01PM +0800, Jaegeuk Hanse wrote:
>> On 11/19/2012 05:56 PM, Kirill A. Shutemov wrote:
>>> On Sun, Nov 18, 2012 at 02:23:44PM +0800, Jaegeuk Hanse wrote:
>>>> On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote:
>>>>> From: "Kirill A. Shutemov" <[email protected]>
>>>>>
>>>>> H. Peter Anvin doesn't like huge zero page which sticks in memory forever
>>>>> after the first allocation. Here's implementation of lockless refcounting
>>>>> for huge zero page.
>>>>>
>>>>> We have two basic primitives: {get,put}_huge_zero_page(). They
>>>>> manipulate reference counter.
>>>>>
>>>>> If counter is 0, get_huge_zero_page() allocates a new huge page and
>>>>> takes two references: one for caller and one for shrinker. We free the
>>>>> page only in shrinker callback if counter is 1 (only shrinker has the
>>>>> reference).
>>>>>
>>>>> put_huge_zero_page() only decrements counter. Counter is never zero
>>>>> in put_huge_zero_page() since shrinker holds on reference.
>>>>>
>>>>> Freeing huge zero page in shrinker callback helps to avoid frequent
>>>>> allocate-free.
>>>>>
>>>>> Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
>>>>> parallel (40 processes) read page faulting comparing to lazy huge page
>>>>> allocation. I think it's pretty reasonable for synthetic benchmark.
>>>> Hi Kirill,
>>>>
>>>> I see your and Andew's hot discussion in v4 resend thread.
>>>>
>>>> "I also tried another scenario: usemem -n16 100M -r 1000. It creates
>>>> real memory pressure - no easy reclaimable memory. This time
>>>> callback called with nr_to_scan > 0 and we freed hzp. "
>>>>
>>>> What's "usemem"? Is it a tool and how to get it?
>>> http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar
>> Thanks for your response. But how to use it, I even can't compile
>> the files.
>>
>> # ./case-lru-file-mmap-read
>> ./case-lru-file-mmap-read: line 3: hw_vars: No such file or directory
>> ./case-lru-file-mmap-read: line 7: 10 * mem / nr_cpu: division by 0
>> (error token is "nr_cpu")
>>
>> # gcc usemem.c -o usemem
> -lpthread
>
>> /tmp/ccFkIDWk.o: In function `do_task':
>> usemem.c:(.text+0x9f2): undefined reference to `pthread_create'
>> usemem.c:(.text+0xa44): undefined reference to `pthread_join'
>> collect2: ld returned 1 exit status
>>
>>>> It's hard for me to
>>>> find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your
>>>> scenario?
>>> shrink_slab() calls the callback with nr_to_scan > 0 if system is under
>>> pressure -- look for do_shrinker_shrink().
>> Why Andrew's example(dd if=/fast-disk/large-file) doesn't call this
>> path? I think it also can add memory pressure, where I miss?
> dd if=large-file only fills pagecache -- easy reclaimable memory.
> Pagecache will be dropped first, before shrinking slabs.

How could I confirm page reclaim working hard and slabs are reclaimed at
this time?

2012-11-19 11:09:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page

On Mon, Nov 19, 2012 at 07:02:22PM +0800, Jaegeuk Hanse wrote:
> On 11/19/2012 06:23 PM, Kirill A. Shutemov wrote:
> >On Mon, Nov 19, 2012 at 06:20:01PM +0800, Jaegeuk Hanse wrote:
> >>On 11/19/2012 05:56 PM, Kirill A. Shutemov wrote:
> >>>On Sun, Nov 18, 2012 at 02:23:44PM +0800, Jaegeuk Hanse wrote:
> >>>>On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote:
> >>>>>From: "Kirill A. Shutemov" <[email protected]>
> >>>>>
> >>>>>H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> >>>>>after the first allocation. Here's implementation of lockless refcounting
> >>>>>for huge zero page.
> >>>>>
> >>>>>We have two basic primitives: {get,put}_huge_zero_page(). They
> >>>>>manipulate reference counter.
> >>>>>
> >>>>>If counter is 0, get_huge_zero_page() allocates a new huge page and
> >>>>>takes two references: one for caller and one for shrinker. We free the
> >>>>>page only in shrinker callback if counter is 1 (only shrinker has the
> >>>>>reference).
> >>>>>
> >>>>>put_huge_zero_page() only decrements counter. Counter is never zero
> >>>>>in put_huge_zero_page() since shrinker holds on reference.
> >>>>>
> >>>>>Freeing huge zero page in shrinker callback helps to avoid frequent
> >>>>>allocate-free.
> >>>>>
> >>>>>Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
> >>>>>parallel (40 processes) read page faulting comparing to lazy huge page
> >>>>>allocation. I think it's pretty reasonable for synthetic benchmark.
> >>>>Hi Kirill,
> >>>>
> >>>>I see your and Andew's hot discussion in v4 resend thread.
> >>>>
> >>>>"I also tried another scenario: usemem -n16 100M -r 1000. It creates
> >>>>real memory pressure - no easy reclaimable memory. This time
> >>>>callback called with nr_to_scan > 0 and we freed hzp. "
> >>>>
> >>>>What's "usemem"? Is it a tool and how to get it?
> >>>http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar
> >>Thanks for your response. But how to use it, I even can't compile
> >>the files.
> >>
> >># ./case-lru-file-mmap-read
> >>./case-lru-file-mmap-read: line 3: hw_vars: No such file or directory
> >>./case-lru-file-mmap-read: line 7: 10 * mem / nr_cpu: division by 0
> >>(error token is "nr_cpu")
> >>
> >># gcc usemem.c -o usemem
> >-lpthread
> >
> >>/tmp/ccFkIDWk.o: In function `do_task':
> >>usemem.c:(.text+0x9f2): undefined reference to `pthread_create'
> >>usemem.c:(.text+0xa44): undefined reference to `pthread_join'
> >>collect2: ld returned 1 exit status
> >>
> >>>>It's hard for me to
> >>>>find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your
> >>>>scenario?
> >>>shrink_slab() calls the callback with nr_to_scan > 0 if system is under
> >>>pressure -- look for do_shrinker_shrink().
> >>Why Andrew's example(dd if=/fast-disk/large-file) doesn't call this
> >>path? I think it also can add memory pressure, where I miss?
> >dd if=large-file only fills pagecache -- easy reclaimable memory.
> >Pagecache will be dropped first, before shrinking slabs.
>
> How could I confirm page reclaim working hard and slabs are
> reclaimed at this time?

The only what I see is slabs_scanned in vmstat.

--
Kirill A. Shutemov

2012-11-19 11:29:39

by Jaegeuk Hanse

[permalink] [raw]
Subject: Re: [PATCH v6 10/12] thp: implement refcounting for huge zero page

On 11/19/2012 07:09 PM, Kirill A. Shutemov wrote:
> On Mon, Nov 19, 2012 at 07:02:22PM +0800, Jaegeuk Hanse wrote:
>> On 11/19/2012 06:23 PM, Kirill A. Shutemov wrote:
>>> On Mon, Nov 19, 2012 at 06:20:01PM +0800, Jaegeuk Hanse wrote:
>>>> On 11/19/2012 05:56 PM, Kirill A. Shutemov wrote:
>>>>> On Sun, Nov 18, 2012 at 02:23:44PM +0800, Jaegeuk Hanse wrote:
>>>>>> On 11/16/2012 03:27 AM, Kirill A. Shutemov wrote:
>>>>>>> From: "Kirill A. Shutemov" <[email protected]>
>>>>>>>
>>>>>>> H. Peter Anvin doesn't like huge zero page which sticks in memory forever
>>>>>>> after the first allocation. Here's implementation of lockless refcounting
>>>>>>> for huge zero page.
>>>>>>>
>>>>>>> We have two basic primitives: {get,put}_huge_zero_page(). They
>>>>>>> manipulate reference counter.
>>>>>>>
>>>>>>> If counter is 0, get_huge_zero_page() allocates a new huge page and
>>>>>>> takes two references: one for caller and one for shrinker. We free the
>>>>>>> page only in shrinker callback if counter is 1 (only shrinker has the
>>>>>>> reference).
>>>>>>>
>>>>>>> put_huge_zero_page() only decrements counter. Counter is never zero
>>>>>>> in put_huge_zero_page() since shrinker holds on reference.
>>>>>>>
>>>>>>> Freeing huge zero page in shrinker callback helps to avoid frequent
>>>>>>> allocate-free.
>>>>>>>
>>>>>>> Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
>>>>>>> parallel (40 processes) read page faulting comparing to lazy huge page
>>>>>>> allocation. I think it's pretty reasonable for synthetic benchmark.
>>>>>> Hi Kirill,
>>>>>>
>>>>>> I see your and Andew's hot discussion in v4 resend thread.
>>>>>>
>>>>>> "I also tried another scenario: usemem -n16 100M -r 1000. It creates
>>>>>> real memory pressure - no easy reclaimable memory. This time
>>>>>> callback called with nr_to_scan > 0 and we freed hzp. "
>>>>>>
>>>>>> What's "usemem"? Is it a tool and how to get it?
>>>>> http://www.spinics.net/lists/linux-mm/attachments/gtarazbJaHPaAT.gtar
>>>> Thanks for your response. But how to use it, I even can't compile
>>>> the files.
>>>>
>>>> # ./case-lru-file-mmap-read
>>>> ./case-lru-file-mmap-read: line 3: hw_vars: No such file or directory
>>>> ./case-lru-file-mmap-read: line 7: 10 * mem / nr_cpu: division by 0
>>>> (error token is "nr_cpu")
>>>>
>>>> # gcc usemem.c -o usemem
>>> -lpthread
>>>
>>>> /tmp/ccFkIDWk.o: In function `do_task':
>>>> usemem.c:(.text+0x9f2): undefined reference to `pthread_create'
>>>> usemem.c:(.text+0xa44): undefined reference to `pthread_join'
>>>> collect2: ld returned 1 exit status
>>>>
>>>>>> It's hard for me to
>>>>>> find nr_to_scan > 0 in every callset, how can nr_to_scan > 0 in your
>>>>>> scenario?
>>>>> shrink_slab() calls the callback with nr_to_scan > 0 if system is under
>>>>> pressure -- look for do_shrinker_shrink().
>>>> Why Andrew's example(dd if=/fast-disk/large-file) doesn't call this
>>>> path? I think it also can add memory pressure, where I miss?
>>> dd if=large-file only fills pagecache -- easy reclaimable memory.
>>> Pagecache will be dropped first, before shrinking slabs.
>> How could I confirm page reclaim working hard and slabs are
>> reclaimed at this time?
> The only what I see is slabs_scanned in vmstat.

Oh, I see. Thanks! :-)