2012-10-15 06:01:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v4 00/10, REBASED] Introduce huge zero page

From: "Kirill A. Shutemov" <[email protected]>

Hi,

Andrew, here's huge zero page patchset rebased to v3.7-rc1.

Andrea, I've dropped your Reviewed-by due not-so-trivial conflicts in during
rebase. Could you look through it again. Patches 2, 3, 4, 7, 10 had conflicts.
Mostly due new MMU notifiers interface.

=================

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include <assert.h>
#include <stdlib.h>
#include <unistd.h>

#define MB 1024*1024

int main(int argc, char **argv)
{
char *p;
int i;

posix_memalign((void **)&p, 2 * MB, 200 * MB);
for (i = 0; i < 200 * MB; i+= 4096)
assert(p[i] == 0);
pause();
return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros. The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
Patches 01-07: prepare all code paths for hzp
Patch 08: all code paths are covered: safe to setup hzp
Patch 09: lazy allocation
Patch 10: lockless refcounting for hzp

v4:
- Rebase to v3.7-rc1;
- Update commit message;
v3:
- fix potential deadlock in refcounting code on preemptive kernel.
- do not mark huge zero page as movable.
- fix typo in comment.
- Reviewed-by tag from Andrea Arcangeli.
v2:
- Avoid find_vma() if we've already had vma on stack.
Suggested by Andrea Arcangeli.
- Implement refcounting for huge zero page.

--------------------------------------------------------------------------

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==============

test:
posix_memalign((void **)&p, 2 * MB, 8 * GB);
for (i = 0; i < 100; i++) {
assert(memcmp(p, p + 4*GB, 4*GB) == 0);
asm volatile ("": : :"memory");
}

hzp:
Performance counter stats for './test_memcmp' (5 runs):

32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% )
40 context-switches # 0.001 K/sec ( +- 0.94% )
0 CPU-migrations # 0.000 K/sec
4,218 page-faults # 0.130 K/sec ( +- 0.00% )
76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%]
36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%]
1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%]
134,355,715,816 instructions # 1.75 insns per cycle
# 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%]
13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%]
1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%]

32.413866442 seconds time elapsed ( +- 0.13% )

vhzp:
Performance counter stats for './test_memcmp' (5 runs):

30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% )
38 context-switches # 0.001 K/sec ( +- 1.53% )
0 CPU-migrations # 0.000 K/sec
4,218 page-faults # 0.139 K/sec ( +- 0.01% )
71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%]
31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%]
773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%]
134,982,215,437 instructions # 1.88 insns per cycle
# 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%]
13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%]
1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%]

30.381324695 seconds time elapsed ( +- 0.13% )

Mirobenchmark2
==============

test:
posix_memalign((void **)&p, 2 * MB, 8 * GB);
for (i = 0; i < 1000; i++) {
char *_p = p;
while (_p < p+4*GB) {
assert(*_p == *(_p+4*GB));
_p += 4096;
asm volatile ("": : :"memory");
}
}

hzp:
Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):

3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% )
9 context-switches # 0.003 K/sec ( +- 4.97% )
4,384 page-faults # 0.001 M/sec ( +- 0.00% )
8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%]
5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%]
2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%]
9,494,670,537 instructions # 1.14 insns per cycle
# 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%]
2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%]
158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%]
3,168,102,115 L1-dcache-loads
# 903.693 M/sec ( +- 0.11% ) [41.70%]
1,048,710,998 L1-dcache-misses
# 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%]
1,047,699,685 LLC-load
# 298.854 M/sec ( +- 0.03% ) [33.38%]
2,287 LLC-misses
# 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%]
3,166,187,367 dTLB-loads
# 903.147 M/sec ( +- 0.02% ) [33.35%]
4,266,538 dTLB-misses
# 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%]

3.513339813 seconds time elapsed ( +- 0.26% )

vhzp:
Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):

27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% )
62 context-switches # 0.002 K/sec ( +- 0.61% )
4,384 page-faults # 0.160 K/sec ( +- 0.01% )
64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%]
61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%]
56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%]
10,033,724,846 instructions # 0.15 insns per cycle
# 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%]
2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%]
1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%]
3,302,006,540 L1-dcache-loads
# 120.891 M/sec ( +- 0.11% ) [41.68%]
271,374,358 L1-dcache-misses
# 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%]
20,385,476 LLC-load
# 0.746 M/sec ( +- 1.64% ) [33.34%]
76,754 LLC-misses
# 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%]
3,309,927,290 dTLB-loads
# 121.181 M/sec ( +- 0.03% ) [33.34%]
2,098,967,427 dTLB-misses
# 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%]

27.364448741 seconds time elapsed ( +- 0.24% )

--------------------------------------------------------------------------

I personally prefer implementation present in this patchset. It doesn't
touch arch-specific code.


Kirill A. Shutemov (10):
thp: huge zero page: basic preparation
thp: zap_huge_pmd(): zap huge zero pmd
thp: copy_huge_pmd(): copy huge zero page
thp: do_huge_pmd_wp_page(): handle huge zero page
thp: change_huge_pmd(): keep huge zero page write-protected
thp: change split_huge_page_pmd() interface
thp: implement splitting pmd for huge zero page
thp: setup huge zero page on non-write page fault
thp: lazy huge zero page allocation
thp: implement refcounting for huge zero page

Documentation/vm/transhuge.txt | 4 +-
arch/x86/kernel/vm86_32.c | 2 +-
fs/proc/task_mmu.c | 2 +-
include/linux/huge_mm.h | 14 ++-
include/linux/mm.h | 8 +
mm/huge_memory.c | 331 +++++++++++++++++++++++++++++++++++++---
mm/memory.c | 11 +-
mm/mempolicy.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/pagewalk.c | 2 +-
11 files changed, 334 insertions(+), 46 deletions(-)

--
1.7.7.6


2012-10-15 06:00:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v4 01/10] thp: huge zero page: basic preparation

From: "Kirill A. Shutemov" <[email protected]>

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled
with zeros.

For now let's allocate the page on hugepage_init(). We'll switch to lazy
allocation later.

We are not going to map the huge zero page until we can handle it
properly on all code paths.

is_huge_zero_{pfn,pmd}() functions will be used by following patches to
check whether the pfn/pmd is huge zero page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 30 ++++++++++++++++++++++++++++++
1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a863af2..438adbf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -46,6 +46,7 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
/* during fragmentation poll the hugepage allocator once every minute */
static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
static struct task_struct *khugepaged_thread __read_mostly;
+static unsigned long huge_zero_pfn __read_mostly;
static DEFINE_MUTEX(khugepaged_mutex);
static DEFINE_SPINLOCK(khugepaged_mm_lock);
static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -158,6 +159,29 @@ static int start_khugepaged(void)
return err;
}

+static int init_huge_zero_page(void)
+{
+ struct page *hpage;
+
+ hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
+ HPAGE_PMD_ORDER);
+ if (!hpage)
+ return -ENOMEM;
+
+ huge_zero_pfn = page_to_pfn(hpage);
+ return 0;
+}
+
+static inline bool is_huge_zero_pfn(unsigned long pfn)
+{
+ return pfn == huge_zero_pfn;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+ return is_huge_zero_pfn(pmd_pfn(pmd));
+}
+
#ifdef CONFIG_SYSFS

static ssize_t double_flag_show(struct kobject *kobj,
@@ -539,6 +563,10 @@ static int __init hugepage_init(void)
if (err)
return err;

+ err = init_huge_zero_page();
+ if (err)
+ goto out;
+
err = khugepaged_slab_init();
if (err)
goto out;
@@ -561,6 +589,8 @@ static int __init hugepage_init(void)

return 0;
out:
+ if (huge_zero_pfn)
+ __free_page(pfn_to_page(huge_zero_pfn));
hugepage_exit_sysfs(hugepage_kobj);
return err;
}
--
1.7.7.6

2012-10-15 06:00:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v4 10/10] thp: implement refcounting for huge zero page

From: "Kirill A. Shutemov" <[email protected]>

H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.

We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.

If counter is 0, get_huge_zero_page() allocates a new huge page and
takes two references: one for caller and one for shrinker. We free the
page only in shrinker callback if counter is 1 (only shrinker has the
reference).

put_huge_zero_page() only decrements counter. Counter is never zero
in put_huge_zero_page() since shrinker holds on reference.

Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.

Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation. I think it's pretty reasonable for synthetic benchmark.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 111 ++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 87 insertions(+), 24 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8fae26a..a4f2110 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
#include <linux/khugepaged.h>
#include <linux/freezer.h>
#include <linux/mman.h>
+#include <linux/shrinker.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -46,7 +47,6 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
/* during fragmentation poll the hugepage allocator once every minute */
static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
static struct task_struct *khugepaged_thread __read_mostly;
-static unsigned long huge_zero_pfn __read_mostly;
static DEFINE_MUTEX(khugepaged_mutex);
static DEFINE_SPINLOCK(khugepaged_mm_lock);
static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -159,31 +159,74 @@ static int start_khugepaged(void)
return err;
}

-static int init_huge_zero_pfn(void)
+static atomic_t huge_zero_refcount;
+static unsigned long huge_zero_pfn __read_mostly;
+
+static inline bool is_huge_zero_pfn(unsigned long pfn)
{
- struct page *hpage;
- unsigned long pfn;
+ unsigned long zero_pfn = ACCESS_ONCE(huge_zero_pfn);
+ return zero_pfn && pfn == zero_pfn;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+ return is_huge_zero_pfn(pmd_pfn(pmd));
+}
+
+static unsigned long get_huge_zero_page(void)
+{
+ struct page *zero_page;
+retry:
+ if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
+ return ACCESS_ONCE(huge_zero_pfn);

- hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
+ zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
HPAGE_PMD_ORDER);
- if (!hpage)
- return -ENOMEM;
- pfn = page_to_pfn(hpage);
- if (cmpxchg(&huge_zero_pfn, 0, pfn))
- __free_page(hpage);
- return 0;
+ if (!zero_page)
+ return 0;
+ preempt_disable();
+ if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
+ preempt_enable();
+ __free_page(zero_page);
+ goto retry;
+ }
+
+ /* We take additional reference here. It will be put back by shrinker */
+ atomic_set(&huge_zero_refcount, 2);
+ preempt_enable();
+ return ACCESS_ONCE(huge_zero_pfn);
}

-static inline bool is_huge_zero_pfn(unsigned long pfn)
+static void put_huge_zero_page(void)
{
- return huge_zero_pfn && pfn == huge_zero_pfn;
+ /*
+ * Counter should never go to zero here. Only shrinker can put
+ * last reference.
+ */
+ BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
}

-static inline bool is_huge_zero_pmd(pmd_t pmd)
+static int shrink_huge_zero_page(struct shrinker *shrink,
+ struct shrink_control *sc)
{
- return is_huge_zero_pfn(pmd_pfn(pmd));
+ if (!sc->nr_to_scan)
+ /* we can free zero page only if last reference remains */
+ return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
+
+ if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
+ unsigned long zero_pfn = xchg(&huge_zero_pfn, 0);
+ BUG_ON(zero_pfn == 0);
+ __free_page(__pfn_to_page(zero_pfn));
+ }
+
+ return 0;
}

+static struct shrinker huge_zero_page_shrinker = {
+ .shrink = shrink_huge_zero_page,
+ .seeks = DEFAULT_SEEKS,
+};
+
#ifdef CONFIG_SYSFS

static ssize_t double_flag_show(struct kobject *kobj,
@@ -575,6 +618,8 @@ static int __init hugepage_init(void)
goto out;
}

+ register_shrinker(&huge_zero_page_shrinker);
+
/*
* By default disable transparent hugepages on smaller systems,
* where the extra memory used could hurt more than TLB overhead
@@ -697,10 +742,11 @@ static inline struct page *alloc_hugepage(int defrag)
#endif

static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+ struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
+ unsigned long zero_pfn)
{
pmd_t entry;
- entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+ entry = pfn_pmd(zero_pfn, vma->vm_page_prot);
entry = pmd_wrprotect(entry);
entry = pmd_mkhuge(entry);
set_pmd_at(mm, haddr, pmd, entry);
@@ -723,15 +769,19 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (!(flags & FAULT_FLAG_WRITE)) {
pgtable_t pgtable;
- if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
- count_vm_event(THP_FAULT_FALLBACK);
- goto out;
- }
+ unsigned long zero_pfn;
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
goto out;
+ zero_pfn = get_huge_zero_page();
+ if (unlikely(!zero_pfn)) {
+ pte_free(mm, pgtable);
+ count_vm_event(THP_FAULT_FALLBACK);
+ goto out;
+ }
spin_lock(&mm->page_table_lock);
- set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+ set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+ zero_pfn);
spin_unlock(&mm->page_table_lock);
return 0;
}
@@ -800,7 +850,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_unlock;
}
if (is_huge_zero_pmd(pmd)) {
- set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+ unsigned long zero_pfn;
+ /*
+ * get_huge_zero_page() will never allocate a new page here,
+ * since we already have a zero page to copy. It just takes a
+ * reference.
+ */
+ zero_pfn = get_huge_zero_page();
+ set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+ zero_pfn);
ret = 0;
goto out_unlock;
}
@@ -907,6 +965,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
spin_unlock(&mm->page_table_lock);
+ put_huge_zero_page();

mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

@@ -1108,8 +1167,10 @@ alloc:
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache_pmd(vma, address, pmd);
- if (is_huge_zero_pmd(orig_pmd))
+ if (is_huge_zero_pmd(orig_pmd)) {
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ put_huge_zero_page();
+ }
if (page) {
VM_BUG_ON(!PageHead(page));
page_remove_rmap(page);
@@ -1187,6 +1248,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
if (is_huge_zero_pmd(orig_pmd)) {
tlb->mm->nr_ptes--;
spin_unlock(&tlb->mm->page_table_lock);
+ put_huge_zero_page();
} else {
page = pmd_page(orig_pmd);
page_remove_rmap(page);
@@ -2543,6 +2605,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
}
smp_wmb(); /* make pte visible before pmd */
pmd_populate(vma->vm_mm, pmd, pgtable);
+ put_huge_zero_page();
}

void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
--
1.7.7.6

2012-10-15 06:00:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v4 09/10] thp: lazy huge zero page allocation

From: "Kirill A. Shutemov" <[email protected]>

Instead of allocating huge zero page on hugepage_init() we can postpone it
until first huge zero page map. It saves memory if THP is not in use.

cmpxchg() is used to avoid race on huge_zero_pfn initialization.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 20 ++++++++++----------
1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index da7e07b..8fae26a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -159,22 +159,24 @@ static int start_khugepaged(void)
return err;
}

-static int init_huge_zero_page(void)
+static int init_huge_zero_pfn(void)
{
struct page *hpage;
+ unsigned long pfn;

hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
HPAGE_PMD_ORDER);
if (!hpage)
return -ENOMEM;
-
- huge_zero_pfn = page_to_pfn(hpage);
+ pfn = page_to_pfn(hpage);
+ if (cmpxchg(&huge_zero_pfn, 0, pfn))
+ __free_page(hpage);
return 0;
}

static inline bool is_huge_zero_pfn(unsigned long pfn)
{
- return pfn == huge_zero_pfn;
+ return huge_zero_pfn && pfn == huge_zero_pfn;
}

static inline bool is_huge_zero_pmd(pmd_t pmd)
@@ -563,10 +565,6 @@ static int __init hugepage_init(void)
if (err)
return err;

- err = init_huge_zero_page();
- if (err)
- goto out;
-
err = khugepaged_slab_init();
if (err)
goto out;
@@ -589,8 +587,6 @@ static int __init hugepage_init(void)

return 0;
out:
- if (huge_zero_pfn)
- __free_page(pfn_to_page(huge_zero_pfn));
hugepage_exit_sysfs(hugepage_kobj);
return err;
}
@@ -727,6 +723,10 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (!(flags & FAULT_FLAG_WRITE)) {
pgtable_t pgtable;
+ if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
+ count_vm_event(THP_FAULT_FALLBACK);
+ goto out;
+ }
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
goto out;
--
1.7.7.6

2012-10-15 06:00:42

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v4 05/10] thp: change_huge_pmd(): keep huge zero page write-protected

From: "Kirill A. Shutemov" <[email protected]>

We want to get page fault on write attempt to huge zero page, so let's
keep it write-protected.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 76548b1..8dbb1e4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1258,6 +1258,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
pmd_t entry;
entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
+ if (is_huge_zero_pmd(entry))
+ entry = pmd_wrprotect(entry);
set_pmd_at(mm, addr, pmd, entry);
spin_unlock(&vma->vm_mm->page_table_lock);
ret = 1;
--
1.7.7.6

2012-10-15 06:00:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v4 08/10] thp: setup huge zero page on non-write page fault

From: "Kirill A. Shutemov" <[email protected]>

All code paths seems covered. Now we can map huge zero page on read page
fault.

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.

If we fail to setup huge zero page (ENOMEM) we fallback to
handle_pte_fault() as we normally do in THP.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b267b12..da7e07b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -725,6 +725,16 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma)))
return VM_FAULT_OOM;
+ if (!(flags & FAULT_FLAG_WRITE)) {
+ pgtable_t pgtable;
+ pgtable = pte_alloc_one(mm, haddr);
+ if (unlikely(!pgtable))
+ goto out;
+ spin_lock(&mm->page_table_lock);
+ set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+ spin_unlock(&mm->page_table_lock);
+ return 0;
+ }
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
vma, haddr, numa_node_id(), 0);
if (unlikely(!page)) {
--
1.7.7.6

2012-10-15 06:00:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v4 07/10] thp: implement splitting pmd for huge zero page

From: "Kirill A. Shutemov" <[email protected]>

We can't split huge zero page itself (and it's bug if we try), but we
can split the pmd which points to it.

On splitting the pmd we create a table with all ptes set to normal zero
page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 47 ++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 44 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 87359f1..b267b12 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1610,6 +1610,7 @@ int split_huge_page(struct page *page)
struct anon_vma *anon_vma;
int ret = 1;

+ BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
BUG_ON(!PageAnon(page));
anon_vma = page_lock_anon_vma(page);
if (!anon_vma)
@@ -2508,23 +2509,63 @@ static int khugepaged(void *none)
return 0;
}

+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+ unsigned long haddr, pmd_t *pmd)
+{
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ int i;
+
+ pmdp_clear_flush(vma, haddr, pmd);
+ /* leave pmd empty until pte is filled */
+
+ pgtable = get_pmd_huge_pte(vma->vm_mm);
+ pmd_populate(vma->vm_mm, &_pmd, pgtable);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+ entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+ entry = pte_mkspecial(entry);
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*pte));
+ set_pte_at(vma->vm_mm, haddr, pte, entry);
+ pte_unmap(pte);
+ }
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(vma->vm_mm, pmd, pgtable);
+}
+
void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd)
{
struct page *page;
+ struct mm_struct *mm = vma->vm_mm;
unsigned long haddr = address & HPAGE_PMD_MASK;
+ unsigned long mmun_start; /* For mmu_notifiers */
+ unsigned long mmun_end; /* For mmu_notifiers */

BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);

- spin_lock(&vma->vm_mm->page_table_lock);
+ mmun_start = haddr;
+ mmun_end = address + HPAGE_PMD_SIZE;
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+ spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_trans_huge(*pmd))) {
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(&mm->page_table_lock);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+ return;
+ }
+ if (is_huge_zero_pmd(*pmd)) {
+ __split_huge_zero_page_pmd(vma, haddr, pmd);
+ spin_unlock(&mm->page_table_lock);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
return;
}
page = pmd_page(*pmd);
VM_BUG_ON(!page_count(page));
get_page(page);
- spin_unlock(&vma->vm_mm->page_table_lock);
+ spin_unlock(&mm->page_table_lock);
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);

split_huge_page(page);

--
1.7.7.6

2012-10-15 06:01:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v4 06/10] thp: change split_huge_page_pmd() interface

From: "Kirill A. Shutemov" <[email protected]>

Pass vma instead of mm and add address parameter.

In most cases we already have vma on the stack. We provides
split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

This change is preparation to huge zero pmd splitting implementation.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 4 ++--
arch/x86/kernel/vm86_32.c | 2 +-
fs/proc/task_mmu.c | 2 +-
include/linux/huge_mm.h | 14 ++++++++++----
mm/huge_memory.c | 24 +++++++++++++++++++-----
mm/memory.c | 4 ++--
mm/mempolicy.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/pagewalk.c | 2 +-
10 files changed, 39 insertions(+), 19 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index f734bb2..677a599 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -276,7 +276,7 @@ unaffected. libhugetlbfs will also work fine as usual.
== Graceful fallback ==

Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
+split_huge_page_pmd(vma, pmd, addr) where the pmd is the one returned by
pmd_offset. It's trivial to make the code transparent hugepage aware
by just grepping for "pmd_offset" and adding split_huge_page_pmd where
missing after pmd_offset returns the pmd. Thanks to the graceful
@@ -299,7 +299,7 @@ diff --git a/mm/mremap.c b/mm/mremap.c
return NULL;

pmd = pmd_offset(pud, addr);
-+ split_huge_page_pmd(mm, pmd);
++ split_huge_page_pmd(vma, pmd, addr);
if (pmd_none_or_clear_bad(pmd))
return NULL;

diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index 5c9687b..1dfe69c 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -182,7 +182,7 @@ static void mark_screen_rdonly(struct mm_struct *mm)
if (pud_none_or_clear_bad(pud))
goto out;
pmd = pmd_offset(pud, 0xA0000);
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd_mm(mm, 0xA0000, pmd);
if (pmd_none_or_clear_bad(pmd))
goto out;
pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 79827ce..866aa48 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -597,7 +597,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
spinlock_t *ptl;
struct page *page;

- split_huge_page_pmd(walk->mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
if (pmd_trans_unstable(pmd))
return 0;

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b31cb7d..856f080 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -91,12 +91,14 @@ extern int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
pte_t *pte, pmd_t *pmd, unsigned int flags);
extern int split_huge_page(struct page *page);
-extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
-#define split_huge_page_pmd(__mm, __pmd) \
+extern void __split_huge_page_pmd(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd);
+#define split_huge_page_pmd(__vma, __address, __pmd) \
do { \
pmd_t *____pmd = (__pmd); \
if (unlikely(pmd_trans_huge(*____pmd))) \
- __split_huge_page_pmd(__mm, ____pmd); \
+ __split_huge_page_pmd(__vma, __address, \
+ ____pmd); \
} while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { \
@@ -106,6 +108,8 @@ extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
BUG_ON(pmd_trans_splitting(*____pmd) || \
pmd_trans_huge(*____pmd)); \
} while (0)
+extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+ pmd_t *pmd);
#if HPAGE_PMD_ORDER > MAX_ORDER
#error "hugepages can't be allocated by the buddy allocator"
#endif
@@ -173,10 +177,12 @@ static inline int split_huge_page(struct page *page)
{
return 0;
}
-#define split_huge_page_pmd(__mm, __pmd) \
+#define split_huge_page_pmd(__vma, __address, __pmd) \
do { } while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { } while (0)
+#define split_huge_page_pmd_mm(__mm, __address, __pmd) \
+ do { } while (0)
#define compound_trans_head(page) compound_head(page)
static inline int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8dbb1e4..87359f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2508,19 +2508,23 @@ static int khugepaged(void *none)
return 0;
}

-void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
+void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd)
{
struct page *page;
+ unsigned long haddr = address & HPAGE_PMD_MASK;

- spin_lock(&mm->page_table_lock);
+ BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
+
+ spin_lock(&vma->vm_mm->page_table_lock);
if (unlikely(!pmd_trans_huge(*pmd))) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(&vma->vm_mm->page_table_lock);
return;
}
page = pmd_page(*pmd);
VM_BUG_ON(!page_count(page));
get_page(page);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(&vma->vm_mm->page_table_lock);

split_huge_page(page);

@@ -2528,6 +2532,16 @@ void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
BUG_ON(pmd_trans_huge(*pmd));
}

+void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+ pmd_t *pmd)
+{
+ struct vm_area_struct *vma;
+
+ vma = find_vma(mm, address);
+ BUG_ON(vma == NULL);
+ split_huge_page_pmd(vma, address, pmd);
+}
+
static void split_huge_page_address(struct mm_struct *mm,
unsigned long address)
{
@@ -2552,7 +2566,7 @@ static void split_huge_page_address(struct mm_struct *mm,
* Caller holds the mmap_sem write mode, so a huge pmd cannot
* materialize from under us.
*/
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd_mm(mm, address, pmd);
}

void __vma_adjust_trans_huge(struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index 6edc030..6017e23 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1243,7 +1243,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
BUG();
}
#endif
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
goto next;
/* fall through */
@@ -1512,7 +1512,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
}
if (pmd_trans_huge(*pmd)) {
if (flags & FOLL_SPLIT) {
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd(vma, address, pmd);
goto split_fallthrough;
}
spin_lock(&mm->page_table_lock);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0b78fb9..a32bbfc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -511,7 +511,7 @@ static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
continue;
if (check_pte_range(vma, pmd, addr, next, nodes,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..e8c3938 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -90,7 +90,7 @@ static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
else if (change_huge_pmd(vma, pmd, addr, newprot))
continue;
/* fall through */
diff --git a/mm/mremap.c b/mm/mremap.c
index 1b61c2d..eabb24d 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -182,7 +182,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
need_flush = true;
continue;
} else if (!err) {
- split_huge_page_pmd(vma->vm_mm, old_pmd);
+ split_huge_page_pmd(vma, old_addr, old_pmd);
}
VM_BUG_ON(pmd_trans_huge(*old_pmd));
}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 6c118d0..35aa294 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ again:
if (!walk->pte_entry)
continue;

- split_huge_page_pmd(walk->mm, pmd);
+ split_huge_page_pmd_mm(walk->mm, addr, pmd);
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
goto again;
err = walk_pte_range(pmd, addr, next, walk);
--
1.7.7.6

2012-10-15 06:00:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v4 02/10] thp: zap_huge_pmd(): zap huge zero pmd

From: "Kirill A. Shutemov" <[email protected]>

We don't have a real page to zap in huge zero page case. Let's just
clear pmd and remove it from tlb.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 21 +++++++++++++--------
1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 438adbf..680c27f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1057,15 +1057,20 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t orig_pmd;
pgtable = pgtable_trans_huge_withdraw(tlb->mm);
orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
- page = pmd_page(orig_pmd);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
- page_remove_rmap(page);
- VM_BUG_ON(page_mapcount(page) < 0);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
- VM_BUG_ON(!PageHead(page));
- tlb->mm->nr_ptes--;
- spin_unlock(&tlb->mm->page_table_lock);
- tlb_remove_page(tlb, page);
+ if (is_huge_zero_pmd(orig_pmd)) {
+ tlb->mm->nr_ptes--;
+ spin_unlock(&tlb->mm->page_table_lock);
+ } else {
+ page = pmd_page(orig_pmd);
+ page_remove_rmap(page);
+ VM_BUG_ON(page_mapcount(page) < 0);
+ add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ VM_BUG_ON(!PageHead(page));
+ tlb->mm->nr_ptes--;
+ spin_unlock(&tlb->mm->page_table_lock);
+ tlb_remove_page(tlb, page);
+ }
pte_free(tlb->mm, pgtable);
ret = 1;
}
--
1.7.7.6

2012-10-15 06:02:17

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v4 04/10] thp: do_huge_pmd_wp_page(): handle huge zero page

From: "Kirill A. Shutemov" <[email protected]>

On write access to huge zero page we alloc a new huge page and clear it.

If ENOMEM, graceful fallback: we create a new pmd table and set pte
around fault address to newly allocated normal (4k) page. All other ptes
in the pmd set to normal zero page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 8 +++
mm/huge_memory.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++------
mm/memory.c | 7 ---
3 files changed, 122 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..fe329da 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -516,6 +516,14 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
}
#endif

+#ifndef my_zero_pfn
+static inline unsigned long my_zero_pfn(unsigned long addr)
+{
+ extern unsigned long zero_pfn;
+ return zero_pfn;
+}
+#endif
+
/*
* Multiple processes may "see" the same page. E.g. for untouched
* mappings of /dev/null, all processes see the same page full of
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9f5e5cb..76548b1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -823,6 +823,88 @@ out:
return ret;
}

+/* no "address" argument so destroys page coloring of some arch */
+pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
+{
+ pgtable_t pgtable;
+
+ assert_spin_locked(&mm->page_table_lock);
+
+ /* FIFO */
+ pgtable = mm->pmd_huge_pte;
+ if (list_empty(&pgtable->lru))
+ mm->pmd_huge_pte = NULL;
+ else {
+ mm->pmd_huge_pte = list_entry(pgtable->lru.next,
+ struct page, lru);
+ list_del(&pgtable->lru);
+ }
+ return pgtable;
+}
+
+static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, unsigned long haddr)
+{
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ struct page *page;
+ int i, ret = 0;
+ unsigned long mmun_start; /* For mmu_notifiers */
+ unsigned long mmun_end; /* For mmu_notifiers */
+
+ page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ if (!page) {
+ ret |= VM_FAULT_OOM;
+ goto out;
+ }
+
+ if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
+ put_page(page);
+ ret |= VM_FAULT_OOM;
+ goto out;
+ }
+
+ clear_user_highpage(page, address);
+ __SetPageUptodate(page);
+
+ mmun_start = haddr;
+ mmun_end = haddr + HPAGE_PMD_SIZE;
+ mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
+
+ spin_lock(&mm->page_table_lock);
+ pmdp_clear_flush(vma, haddr, pmd);
+ /* leave pmd empty until pte is filled */
+
+ pgtable = get_pmd_huge_pte(mm);
+ pmd_populate(mm, &_pmd, pgtable);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+ if (haddr == (address & PAGE_MASK)) {
+ entry = mk_pte(page, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ page_add_new_anon_rmap(page, vma, haddr);
+ } else {
+ entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+ entry = pte_mkspecial(entry);
+ }
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*pte));
+ set_pte_at(mm, haddr, pte, entry);
+ pte_unmap(pte);
+ }
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pgtable);
+ spin_unlock(&mm->page_table_lock);
+
+ mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
+
+ ret |= VM_FAULT_WRITE;
+out:
+ return ret;
+}
+
static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
@@ -929,19 +1011,21 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
{
int ret = 0;
- struct page *page, *new_page;
+ struct page *page = NULL, *new_page;
unsigned long haddr;
unsigned long mmun_start; /* For mmu_notifiers */
unsigned long mmun_end; /* For mmu_notifiers */

VM_BUG_ON(!vma->anon_vma);
+ haddr = address & HPAGE_PMD_MASK;
+ if (is_huge_zero_pmd(orig_pmd))
+ goto alloc;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
goto out_unlock;

page = pmd_page(orig_pmd);
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
- haddr = address & HPAGE_PMD_MASK;
if (page_mapcount(page) == 1) {
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
@@ -953,7 +1037,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
get_page(page);
spin_unlock(&mm->page_table_lock);
-
+alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -963,24 +1047,34 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,

if (unlikely(!new_page)) {
count_vm_event(THP_FAULT_FALLBACK);
- ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
- pmd, orig_pmd, page, haddr);
- if (ret & VM_FAULT_OOM)
- split_huge_page(page);
- put_page(page);
+ if (is_huge_zero_pmd(orig_pmd)) {
+ ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
+ address, pmd, haddr);
+ } else {
+ ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
+ pmd, orig_pmd, page, haddr);
+ if (ret & VM_FAULT_OOM)
+ split_huge_page(page);
+ put_page(page);
+ }
goto out;
}
count_vm_event(THP_FAULT_ALLOC);

if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
put_page(new_page);
- split_huge_page(page);
- put_page(page);
+ if (page) {
+ split_huge_page(page);
+ put_page(page);
+ }
ret |= VM_FAULT_OOM;
goto out;
}

- copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
+ if (is_huge_zero_pmd(orig_pmd))
+ clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
+ else
+ copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
__SetPageUptodate(new_page);

mmun_start = haddr;
@@ -988,7 +1082,8 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);

spin_lock(&mm->page_table_lock);
- put_page(page);
+ if (page)
+ put_page(page);
if (unlikely(!pmd_same(*pmd, orig_pmd))) {
spin_unlock(&mm->page_table_lock);
mem_cgroup_uncharge_page(new_page);
@@ -996,7 +1091,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_mn;
} else {
pmd_t entry;
- VM_BUG_ON(!PageHead(page));
entry = mk_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
entry = pmd_mkhuge(entry);
@@ -1004,8 +1098,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache_pmd(vma, address, pmd);
- page_remove_rmap(page);
- put_page(page);
+ if (is_huge_zero_pmd(orig_pmd))
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ if (page) {
+ VM_BUG_ON(!PageHead(page));
+ page_remove_rmap(page);
+ put_page(page);
+ }
ret |= VM_FAULT_WRITE;
}
spin_unlock(&mm->page_table_lock);
diff --git a/mm/memory.c b/mm/memory.c
index fb135ba..6edc030 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -724,13 +724,6 @@ static inline int is_zero_pfn(unsigned long pfn)
}
#endif

-#ifndef my_zero_pfn
-static inline unsigned long my_zero_pfn(unsigned long addr)
-{
- return zero_pfn;
-}
-#endif
-
/*
* vm_normal_page -- This function gets the "struct page" associated with a pte.
*
--
1.7.7.6

2012-10-15 06:02:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v4 03/10] thp: copy_huge_pmd(): copy huge zero page

From: "Kirill A. Shutemov" <[email protected]>

It's easy to copy huge zero page. Just set destination pmd to huge zero
page.

It's safe to copy huge zero page since we have none yet :-p

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 680c27f..9f5e5cb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -700,6 +700,18 @@ static inline struct page *alloc_hugepage(int defrag)
}
#endif

+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+{
+ pmd_t entry;
+ entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+ entry = pmd_wrprotect(entry);
+ entry = pmd_mkhuge(entry);
+ set_pmd_at(mm, haddr, pmd, entry);
+ pgtable_trans_huge_deposit(mm, pgtable);
+ mm->nr_ptes++;
+}
+
int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags)
@@ -777,6 +789,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_free(dst_mm, pgtable);
goto out_unlock;
}
+ if (is_huge_zero_pmd(pmd)) {
+ set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+ ret = 0;
+ goto out_unlock;
+ }
if (unlikely(pmd_trans_splitting(pmd))) {
/* split huge page running from under us */
spin_unlock(&src_mm->page_table_lock);
--
1.7.7.6

2012-10-16 09:53:20

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

On 10/15/2012 02:00 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> Hi,
>
> Andrew, here's huge zero page patchset rebased to v3.7-rc1.
>
> Andrea, I've dropped your Reviewed-by due not-so-trivial conflicts in during
> rebase. Could you look through it again. Patches 2, 3, 4, 7, 10 had conflicts.
> Mostly due new MMU notifiers interface.
>
> =================
>
> During testing I noticed big (up to 2.5 times) memory consumption overhead
> on some workloads (e.g. ft.A from NPB) if THP is enabled.
>
> The main reason for that big difference is lacking zero page in THP case.
> We have to allocate a real page on read page fault.
>
> A program to demonstrate the issue:
> #include <assert.h>
> #include <stdlib.h>
> #include <unistd.h>
>
> #define MB 1024*1024
>
> int main(int argc, char **argv)
> {
> char *p;
> int i;
>
> posix_memalign((void **)&p, 2 * MB, 200 * MB);
> for (i = 0; i < 200 * MB; i+= 4096)
> assert(p[i] == 0);
> pause();
> return 0;
> }
>
> With thp-never RSS is about 400k, but with thp-always it's 200M.
> After the patcheset thp-always RSS is 400k too.
>
> Design overview.
>
> Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
> zeros. The way how we allocate it changes in the patchset:
>
> - [01/10] simplest way: hzp allocated on boot time in hugepage_init();
> - [09/10] lazy allocation on first use;
> - [10/10] lockless refcounting + shrinker-reclaimable hzp;
>
> We setup it in do_huge_pmd_anonymous_page() if area around fault address
> is suitable for THP and we've got read page fault.
> If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
> normally do in THP.
>
> On wp fault to hzp we allocate real memory for the huge page and clear it.
> If ENOMEM, graceful fallback: we create a new pmd table and set pte around
> fault address to newly allocated normal (4k) page. All other ptes in the
> pmd set to normal zero page.
>
> We cannot split hzp (and it's bug if we try), but we can split the pmd
> which points to it. On splitting the pmd we create a table with all ptes
> set to normal zero page.
>
> Patchset organized in bisect-friendly way:
> Patches 01-07: prepare all code paths for hzp
> Patch 08: all code paths are covered: safe to setup hzp
> Patch 09: lazy allocation
> Patch 10: lockless refcounting for hzp
>
> v4:
> - Rebase to v3.7-rc1;
> - Update commit message;
> v3:
> - fix potential deadlock in refcounting code on preemptive kernel.
> - do not mark huge zero page as movable.
> - fix typo in comment.
> - Reviewed-by tag from Andrea Arcangeli.
> v2:
> - Avoid find_vma() if we've already had vma on stack.
> Suggested by Andrea Arcangeli.
> - Implement refcounting for huge zero page.
>
> --------------------------------------------------------------------------
>
> By hpa request I've tried alternative approach for hzp implementation (see
> Virtual huge zero page patchset): pmd table with all entries set to zero
> page. This way should be more cache friendly, but it increases TLB
> pressure.

Thanks for your excellent works. But could you explain me why current
implementation not cache friendly and hpa's request cache friendly?
Thanks in advance.

>
> The problem with virtual huge zero page: it requires per-arch enabling.
> We need a way to mark that pmd table has all ptes set to zero page.
>
> Some numbers to compare two implementations (on 4s Westmere-EX):
>
> Mirobenchmark1
> ==============
>
> test:
> posix_memalign((void **)&p, 2 * MB, 8 * GB);
> for (i = 0; i < 100; i++) {
> assert(memcmp(p, p + 4*GB, 4*GB) == 0);
> asm volatile ("": : :"memory");
> }
>
> hzp:
> Performance counter stats for './test_memcmp' (5 runs):
>
> 32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% )
> 40 context-switches # 0.001 K/sec ( +- 0.94% )
> 0 CPU-migrations # 0.000 K/sec
> 4,218 page-faults # 0.130 K/sec ( +- 0.00% )
> 76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%]
> 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%]
> 1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%]
> 134,355,715,816 instructions # 1.75 insns per cycle
> # 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%]
> 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%]
> 1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%]
>
> 32.413866442 seconds time elapsed ( +- 0.13% )
>
> vhzp:
> Performance counter stats for './test_memcmp' (5 runs):
>
> 30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% )
> 38 context-switches # 0.001 K/sec ( +- 1.53% )
> 0 CPU-migrations # 0.000 K/sec
> 4,218 page-faults # 0.139 K/sec ( +- 0.01% )
> 71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%]
> 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%]
> 773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%]
> 134,982,215,437 instructions # 1.88 insns per cycle
> # 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%]
> 13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%]
> 1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%]
>
> 30.381324695 seconds time elapsed ( +- 0.13% )

Could you tell me which data I should care in this performance counter.
And what's the benefit of your current implementation compare to hpa's
request?

>
> Mirobenchmark2
> ==============
>
> test:
> posix_memalign((void **)&p, 2 * MB, 8 * GB);
> for (i = 0; i < 1000; i++) {
> char *_p = p;
> while (_p < p+4*GB) {
> assert(*_p == *(_p+4*GB));
> _p += 4096;
> asm volatile ("": : :"memory");
> }
> }
>
> hzp:
> Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
>
> 3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% )
> 9 context-switches # 0.003 K/sec ( +- 4.97% )
> 4,384 page-faults # 0.001 M/sec ( +- 0.00% )
> 8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%]
> 5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%]
> 2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%]
> 9,494,670,537 instructions # 1.14 insns per cycle
> # 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%]
> 2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%]
> 158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%]
> 3,168,102,115 L1-dcache-loads
> # 903.693 M/sec ( +- 0.11% ) [41.70%]
> 1,048,710,998 L1-dcache-misses
> # 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%]
> 1,047,699,685 LLC-load
> # 298.854 M/sec ( +- 0.03% ) [33.38%]
> 2,287 LLC-misses
> # 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%]
> 3,166,187,367 dTLB-loads
> # 903.147 M/sec ( +- 0.02% ) [33.35%]
> 4,266,538 dTLB-misses
> # 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%]
>
> 3.513339813 seconds time elapsed ( +- 0.26% )
>
> vhzp:
> Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
>
> 27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% )
> 62 context-switches # 0.002 K/sec ( +- 0.61% )
> 4,384 page-faults # 0.160 K/sec ( +- 0.01% )
> 64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%]
> 61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%]
> 56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%]
> 10,033,724,846 instructions # 0.15 insns per cycle
> # 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%]
> 2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%]
> 1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%]
> 3,302,006,540 L1-dcache-loads
> # 120.891 M/sec ( +- 0.11% ) [41.68%]
> 271,374,358 L1-dcache-misses
> # 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%]
> 20,385,476 LLC-load
> # 0.746 M/sec ( +- 1.64% ) [33.34%]
> 76,754 LLC-misses
> # 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%]
> 3,309,927,290 dTLB-loads
> # 121.181 M/sec ( +- 0.03% ) [33.34%]
> 2,098,967,427 dTLB-misses
> # 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%]
>
> 27.364448741 seconds time elapsed ( +- 0.24% )

For this case, the same question as above, thanks in adance. :-)

>
> --------------------------------------------------------------------------
>
> I personally prefer implementation present in this patchset. It doesn't
> touch arch-specific code.
>
>
> Kirill A. Shutemov (10):
> thp: huge zero page: basic preparation
> thp: zap_huge_pmd(): zap huge zero pmd
> thp: copy_huge_pmd(): copy huge zero page
> thp: do_huge_pmd_wp_page(): handle huge zero page
> thp: change_huge_pmd(): keep huge zero page write-protected
> thp: change split_huge_page_pmd() interface
> thp: implement splitting pmd for huge zero page
> thp: setup huge zero page on non-write page fault
> thp: lazy huge zero page allocation
> thp: implement refcounting for huge zero page
>
> Documentation/vm/transhuge.txt | 4 +-
> arch/x86/kernel/vm86_32.c | 2 +-
> fs/proc/task_mmu.c | 2 +-
> include/linux/huge_mm.h | 14 ++-
> include/linux/mm.h | 8 +
> mm/huge_memory.c | 331 +++++++++++++++++++++++++++++++++++++---
> mm/memory.c | 11 +-
> mm/mempolicy.c | 2 +-
> mm/mprotect.c | 2 +-
> mm/mremap.c | 2 +-
> mm/pagewalk.c | 2 +-
> 11 files changed, 334 insertions(+), 46 deletions(-)
>

2012-10-16 10:53:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:
> >By hpa request I've tried alternative approach for hzp implementation (see
> >Virtual huge zero page patchset): pmd table with all entries set to zero
> >page. This way should be more cache friendly, but it increases TLB
> >pressure.
>
> Thanks for your excellent works. But could you explain me why
> current implementation not cache friendly and hpa's request cache
> friendly? Thanks in advance.

In workloads like microbenchmark1 you need N * size(zero page) cache
space to get zero page fully cached, where N is cache associativity.
If zero page is 2M, cache pressure is significant.

On other hand with table of 4k zero pages (hpa's proposal) will increase
pressure on TLB, since we have more pages for the same memory area. So we
have to do more page translation in this case.

On my test machine with simple memcmp() virtual huge zero page is faster.
But it highly depends on TLB size, cache size, memory access and page
translation costs.

It looks like cache size in modern processors grows faster than TLB size.

> >The problem with virtual huge zero page: it requires per-arch enabling.
> >We need a way to mark that pmd table has all ptes set to zero page.
> >
> >Some numbers to compare two implementations (on 4s Westmere-EX):
> >
> >Mirobenchmark1
> >==============
> >
> >test:
> > posix_memalign((void **)&p, 2 * MB, 8 * GB);
> > for (i = 0; i < 100; i++) {
> > assert(memcmp(p, p + 4*GB, 4*GB) == 0);
> > asm volatile ("": : :"memory");
> > }
> >
> >hzp:
> > Performance counter stats for './test_memcmp' (5 runs):
> >
> > 32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% )
> > 40 context-switches # 0.001 K/sec ( +- 0.94% )
> > 0 CPU-migrations # 0.000 K/sec
> > 4,218 page-faults # 0.130 K/sec ( +- 0.00% )
> > 76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%]
> > 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%]
> > 1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%]
> > 134,355,715,816 instructions # 1.75 insns per cycle
> > # 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%]
> > 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%]
> > 1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%]
> >
> > 32.413866442 seconds time elapsed ( +- 0.13% )
> >
> >vhzp:
> > Performance counter stats for './test_memcmp' (5 runs):
> >
> > 30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% )
> > 38 context-switches # 0.001 K/sec ( +- 1.53% )
> > 0 CPU-migrations # 0.000 K/sec
> > 4,218 page-faults # 0.139 K/sec ( +- 0.01% )
> > 71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%]
> > 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%]
> > 773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%]
> > 134,982,215,437 instructions # 1.88 insns per cycle
> > # 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%]
> > 13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%]
> > 1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%]
> >
> > 30.381324695 seconds time elapsed ( +- 0.13% )
>
> Could you tell me which data I should care in this performance
> counter. And what's the benefit of your current implementation
> compare to hpa's request?
>
> >
> >Mirobenchmark2
> >==============
> >
> >test:
> > posix_memalign((void **)&p, 2 * MB, 8 * GB);
> > for (i = 0; i < 1000; i++) {
> > char *_p = p;
> > while (_p < p+4*GB) {
> > assert(*_p == *(_p+4*GB));
> > _p += 4096;
> > asm volatile ("": : :"memory");
> > }
> > }
> >
> >hzp:
> > Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
> >
> > 3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% )
> > 9 context-switches # 0.003 K/sec ( +- 4.97% )
> > 4,384 page-faults # 0.001 M/sec ( +- 0.00% )
> > 8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%]
> > 5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%]
> > 2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%]
> > 9,494,670,537 instructions # 1.14 insns per cycle
> > # 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%]
> > 2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%]
> > 158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%]
> > 3,168,102,115 L1-dcache-loads
> > # 903.693 M/sec ( +- 0.11% ) [41.70%]
> > 1,048,710,998 L1-dcache-misses
> > # 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%]
> > 1,047,699,685 LLC-load
> > # 298.854 M/sec ( +- 0.03% ) [33.38%]
> > 2,287 LLC-misses
> > # 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%]
> > 3,166,187,367 dTLB-loads
> > # 903.147 M/sec ( +- 0.02% ) [33.35%]
> > 4,266,538 dTLB-misses
> > # 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%]
> >
> > 3.513339813 seconds time elapsed ( +- 0.26% )
> >
> >vhzp:
> > Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
> >
> > 27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% )
> > 62 context-switches # 0.002 K/sec ( +- 0.61% )
> > 4,384 page-faults # 0.160 K/sec ( +- 0.01% )
> > 64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%]
> > 61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%]
> > 56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%]
> > 10,033,724,846 instructions # 0.15 insns per cycle
> > # 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%]
> > 2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%]
> > 1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%]
> > 3,302,006,540 L1-dcache-loads
> > # 120.891 M/sec ( +- 0.11% ) [41.68%]
> > 271,374,358 L1-dcache-misses
> > # 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%]
> > 20,385,476 LLC-load
> > # 0.746 M/sec ( +- 1.64% ) [33.34%]
> > 76,754 LLC-misses
> > # 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%]
> > 3,309,927,290 dTLB-loads
> > # 121.181 M/sec ( +- 0.03% ) [33.34%]
> > 2,098,967,427 dTLB-misses
> > # 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%]
> >
> > 27.364448741 seconds time elapsed ( +- 0.24% )
>
> For this case, the same question as above, thanks in adance. :-)

--
Kirill A. Shutemov

2012-10-16 11:13:17

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote:
> On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:
>>> By hpa request I've tried alternative approach for hzp implementation (see
>>> Virtual huge zero page patchset): pmd table with all entries set to zero
>>> page. This way should be more cache friendly, but it increases TLB
>>> pressure.
>> Thanks for your excellent works. But could you explain me why
>> current implementation not cache friendly and hpa's request cache
>> friendly? Thanks in advance.
> In workloads like microbenchmark1 you need N * size(zero page) cache
> space to get zero page fully cached, where N is cache associativity.
> If zero page is 2M, cache pressure is significant.
>
> On other hand with table of 4k zero pages (hpa's proposal) will increase
> pressure on TLB, since we have more pages for the same memory area. So we
> have to do more page translation in this case.
>
> On my test machine with simple memcmp() virtual huge zero page is faster.
> But it highly depends on TLB size, cache size, memory access and page
> translation costs.
>
> It looks like cache size in modern processors grows faster than TLB size.

Oh, I see, thanks for your quick response. Another one question below,

>
>>> The problem with virtual huge zero page: it requires per-arch enabling.
>>> We need a way to mark that pmd table has all ptes set to zero page.
>>>
>>> Some numbers to compare two implementations (on 4s Westmere-EX):
>>>
>>> Mirobenchmark1
>>> ==============
>>>
>>> test:
>>> posix_memalign((void **)&p, 2 * MB, 8 * GB);
>>> for (i = 0; i < 100; i++) {
>>> assert(memcmp(p, p + 4*GB, 4*GB) == 0);
>>> asm volatile ("": : :"memory");
>>> }
>>>
>>> hzp:
>>> Performance counter stats for './test_memcmp' (5 runs):
>>>
>>> 32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% )
>>> 40 context-switches # 0.001 K/sec ( +- 0.94% )
>>> 0 CPU-migrations # 0.000 K/sec
>>> 4,218 page-faults # 0.130 K/sec ( +- 0.00% )
>>> 76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%]
>>> 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%]
>>> 1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%]
>>> 134,355,715,816 instructions # 1.75 insns per cycle
>>> # 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%]
>>> 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%]
>>> 1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%]
>>>
>>> 32.413866442 seconds time elapsed ( +- 0.13% )
>>>
>>> vhzp:
>>> Performance counter stats for './test_memcmp' (5 runs):
>>>
>>> 30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% )
>>> 38 context-switches # 0.001 K/sec ( +- 1.53% )
>>> 0 CPU-migrations # 0.000 K/sec
>>> 4,218 page-faults # 0.139 K/sec ( +- 0.01% )
>>> 71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%]
>>> 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%]
>>> 773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%]
>>> 134,982,215,437 instructions # 1.88 insns per cycle
>>> # 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%]
>>> 13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%]
>>> 1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%]
>>>
>>> 30.381324695 seconds time elapsed ( +- 0.13% )
>> Could you tell me which data I should care in this performance
>> counter. And what's the benefit of your current implementation
>> compare to hpa's request?

Sorry for my unintelligent. Could you tell me which data I should care
in this performance counter stats. The same question about the second
benchmark counter stats, thanks in adance. :-)
>>> Mirobenchmark2
>>> ==============
>>>
>>> test:
>>> posix_memalign((void **)&p, 2 * MB, 8 * GB);
>>> for (i = 0; i < 1000; i++) {
>>> char *_p = p;
>>> while (_p < p+4*GB) {
>>> assert(*_p == *(_p+4*GB));
>>> _p += 4096;
>>> asm volatile ("": : :"memory");
>>> }
>>> }
>>>
>>> hzp:
>>> Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
>>>
>>> 3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% )
>>> 9 context-switches # 0.003 K/sec ( +- 4.97% )
>>> 4,384 page-faults # 0.001 M/sec ( +- 0.00% )
>>> 8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%]
>>> 5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%]
>>> 2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%]
>>> 9,494,670,537 instructions # 1.14 insns per cycle
>>> # 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%]
>>> 2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%]
>>> 158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%]
>>> 3,168,102,115 L1-dcache-loads
>>> # 903.693 M/sec ( +- 0.11% ) [41.70%]
>>> 1,048,710,998 L1-dcache-misses
>>> # 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%]
>>> 1,047,699,685 LLC-load
>>> # 298.854 M/sec ( +- 0.03% ) [33.38%]
>>> 2,287 LLC-misses
>>> # 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%]
>>> 3,166,187,367 dTLB-loads
>>> # 903.147 M/sec ( +- 0.02% ) [33.35%]
>>> 4,266,538 dTLB-misses
>>> # 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%]
>>>
>>> 3.513339813 seconds time elapsed ( +- 0.26% )
>>>
>>> vhzp:
>>> Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
>>>
>>> 27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% )
>>> 62 context-switches # 0.002 K/sec ( +- 0.61% )
>>> 4,384 page-faults # 0.160 K/sec ( +- 0.01% )
>>> 64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%]
>>> 61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%]
>>> 56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%]
>>> 10,033,724,846 instructions # 0.15 insns per cycle
>>> # 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%]
>>> 2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%]
>>> 1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%]
>>> 3,302,006,540 L1-dcache-loads
>>> # 120.891 M/sec ( +- 0.11% ) [41.68%]
>>> 271,374,358 L1-dcache-misses
>>> # 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%]
>>> 20,385,476 LLC-load
>>> # 0.746 M/sec ( +- 1.64% ) [33.34%]
>>> 76,754 LLC-misses
>>> # 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%]
>>> 3,309,927,290 dTLB-loads
>>> # 121.181 M/sec ( +- 0.03% ) [33.34%]
>>> 2,098,967,427 dTLB-misses
>>> # 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%]
>>>
>>> 27.364448741 seconds time elapsed ( +- 0.24% )
>> For this case, the same question as above, thanks in adance. :-)

2012-10-16 11:27:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

On Tue, Oct 16, 2012 at 07:13:07PM +0800, Ni zhan Chen wrote:
> On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote:
> >On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:
> >>>By hpa request I've tried alternative approach for hzp implementation (see
> >>>Virtual huge zero page patchset): pmd table with all entries set to zero
> >>>page. This way should be more cache friendly, but it increases TLB
> >>>pressure.
> >>Thanks for your excellent works. But could you explain me why
> >>current implementation not cache friendly and hpa's request cache
> >>friendly? Thanks in advance.
> >In workloads like microbenchmark1 you need N * size(zero page) cache
> >space to get zero page fully cached, where N is cache associativity.
> >If zero page is 2M, cache pressure is significant.
> >
> >On other hand with table of 4k zero pages (hpa's proposal) will increase
> >pressure on TLB, since we have more pages for the same memory area. So we
> >have to do more page translation in this case.
> >
> >On my test machine with simple memcmp() virtual huge zero page is faster.
> >But it highly depends on TLB size, cache size, memory access and page
> >translation costs.
> >
> >It looks like cache size in modern processors grows faster than TLB size.
>
> Oh, I see, thanks for your quick response. Another one question below,
>
> >
> >>>The problem with virtual huge zero page: it requires per-arch enabling.
> >>>We need a way to mark that pmd table has all ptes set to zero page.
> >>>
> >>>Some numbers to compare two implementations (on 4s Westmere-EX):
> >>>
> >>>Mirobenchmark1
> >>>==============
> >>>
> >>>test:
> >>> posix_memalign((void **)&p, 2 * MB, 8 * GB);
> >>> for (i = 0; i < 100; i++) {
> >>> assert(memcmp(p, p + 4*GB, 4*GB) == 0);
> >>> asm volatile ("": : :"memory");
> >>> }
> >>>
> >>>hzp:
> >>> Performance counter stats for './test_memcmp' (5 runs):
> >>>
> >>> 32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% )
> >>> 40 context-switches # 0.001 K/sec ( +- 0.94% )
> >>> 0 CPU-migrations # 0.000 K/sec
> >>> 4,218 page-faults # 0.130 K/sec ( +- 0.00% )
> >>> 76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%]
> >>> 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%]
> >>> 1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%]
> >>> 134,355,715,816 instructions # 1.75 insns per cycle
> >>> # 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%]
> >>> 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%]
> >>> 1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%]
> >>>
> >>> 32.413866442 seconds time elapsed ( +- 0.13% )
> >>>
> >>>vhzp:
> >>> Performance counter stats for './test_memcmp' (5 runs):
> >>>
> >>> 30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% )
> >>> 38 context-switches # 0.001 K/sec ( +- 1.53% )
> >>> 0 CPU-migrations # 0.000 K/sec
> >>> 4,218 page-faults # 0.139 K/sec ( +- 0.01% )
> >>> 71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%]
> >>> 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%]
> >>> 773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%]
> >>> 134,982,215,437 instructions # 1.88 insns per cycle
> >>> # 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%]
> >>> 13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%]
> >>> 1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%]
> >>>
> >>> 30.381324695 seconds time elapsed ( +- 0.13% )
> >>Could you tell me which data I should care in this performance
> >>counter. And what's the benefit of your current implementation
> >>compare to hpa's request?
>
> Sorry for my unintelligent. Could you tell me which data I should
> care in this performance counter stats. The same question about the
> second benchmark counter stats, thanks in adance. :-)

I've missed relevant counters in this run, you can see them in the second
benchmark.

Relevant counters:
L1-dcache-*, LLC-*: shows cache related stats (hits/misses);
dTLB-*: shows data TLB hits and misses.

Indirect relevant counters:
stalled-cycles-*: how long CPU pipeline has to wait for data.

> >>>Mirobenchmark2
> >>>==============
> >>>
> >>>test:
> >>> posix_memalign((void **)&p, 2 * MB, 8 * GB);
> >>> for (i = 0; i < 1000; i++) {
> >>> char *_p = p;
> >>> while (_p < p+4*GB) {
> >>> assert(*_p == *(_p+4*GB));
> >>> _p += 4096;
> >>> asm volatile ("": : :"memory");
> >>> }
> >>> }
> >>>
> >>>hzp:
> >>> Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
> >>>
> >>> 3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% )
> >>> 9 context-switches # 0.003 K/sec ( +- 4.97% )
> >>> 4,384 page-faults # 0.001 M/sec ( +- 0.00% )
> >>> 8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%]
> >>> 5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%]
> >>> 2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%]
> >>> 9,494,670,537 instructions # 1.14 insns per cycle
> >>> # 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%]
> >>> 2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%]
> >>> 158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%]
> >>> 3,168,102,115 L1-dcache-loads
> >>> # 903.693 M/sec ( +- 0.11% ) [41.70%]
> >>> 1,048,710,998 L1-dcache-misses
> >>> # 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%]
> >>> 1,047,699,685 LLC-load
> >>> # 298.854 M/sec ( +- 0.03% ) [33.38%]
> >>> 2,287 LLC-misses
> >>> # 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%]
> >>> 3,166,187,367 dTLB-loads
> >>> # 903.147 M/sec ( +- 0.02% ) [33.35%]
> >>> 4,266,538 dTLB-misses
> >>> # 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%]
> >>>
> >>> 3.513339813 seconds time elapsed ( +- 0.26% )
> >>>
> >>>vhzp:
> >>> Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
> >>>
> >>> 27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% )
> >>> 62 context-switches # 0.002 K/sec ( +- 0.61% )
> >>> 4,384 page-faults # 0.160 K/sec ( +- 0.01% )
> >>> 64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%]
> >>> 61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%]
> >>> 56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%]
> >>> 10,033,724,846 instructions # 0.15 insns per cycle
> >>> # 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%]
> >>> 2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%]
> >>> 1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%]
> >>> 3,302,006,540 L1-dcache-loads
> >>> # 120.891 M/sec ( +- 0.11% ) [41.68%]
> >>> 271,374,358 L1-dcache-misses
> >>> # 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%]
> >>> 20,385,476 LLC-load
> >>> # 0.746 M/sec ( +- 1.64% ) [33.34%]
> >>> 76,754 LLC-misses
> >>> # 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%]
> >>> 3,309,927,290 dTLB-loads
> >>> # 121.181 M/sec ( +- 0.03% ) [33.34%]
> >>> 2,098,967,427 dTLB-misses
> >>> # 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%]
> >>>
> >>> 27.364448741 seconds time elapsed ( +- 0.24% )
> >>For this case, the same question as above, thanks in adance. :-)
>

--
Kirill A. Shutemov

2012-10-16 11:37:56

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH v4 00/10, REBASED] Introduce huge zero page

On 10/16/2012 07:28 PM, Kirill A. Shutemov wrote:
> On Tue, Oct 16, 2012 at 07:13:07PM +0800, Ni zhan Chen wrote:
>> On 10/16/2012 06:54 PM, Kirill A. Shutemov wrote:
>>> On Tue, Oct 16, 2012 at 05:53:07PM +0800, Ni zhan Chen wrote:
>>>>> By hpa request I've tried alternative approach for hzp implementation (see
>>>>> Virtual huge zero page patchset): pmd table with all entries set to zero
>>>>> page. This way should be more cache friendly, but it increases TLB
>>>>> pressure.
>>>> Thanks for your excellent works. But could you explain me why
>>>> current implementation not cache friendly and hpa's request cache
>>>> friendly? Thanks in advance.
>>> In workloads like microbenchmark1 you need N * size(zero page) cache
>>> space to get zero page fully cached, where N is cache associativity.
>>> If zero page is 2M, cache pressure is significant.
>>>
>>> On other hand with table of 4k zero pages (hpa's proposal) will increase
>>> pressure on TLB, since we have more pages for the same memory area. So we
>>> have to do more page translation in this case.
>>>
>>> On my test machine with simple memcmp() virtual huge zero page is faster.
>>> But it highly depends on TLB size, cache size, memory access and page
>>> translation costs.
>>>
>>> It looks like cache size in modern processors grows faster than TLB size.
>> Oh, I see, thanks for your quick response. Another one question below,
>>
>>>>> The problem with virtual huge zero page: it requires per-arch enabling.
>>>>> We need a way to mark that pmd table has all ptes set to zero page.
>>>>>
>>>>> Some numbers to compare two implementations (on 4s Westmere-EX):
>>>>>
>>>>> Mirobenchmark1
>>>>> ==============
>>>>>
>>>>> test:
>>>>> posix_memalign((void **)&p, 2 * MB, 8 * GB);
>>>>> for (i = 0; i < 100; i++) {
>>>>> assert(memcmp(p, p + 4*GB, 4*GB) == 0);
>>>>> asm volatile ("": : :"memory");
>>>>> }
>>>>>
>>>>> hzp:
>>>>> Performance counter stats for './test_memcmp' (5 runs):
>>>>>
>>>>> 32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% )
>>>>> 40 context-switches # 0.001 K/sec ( +- 0.94% )
>>>>> 0 CPU-migrations # 0.000 K/sec
>>>>> 4,218 page-faults # 0.130 K/sec ( +- 0.00% )
>>>>> 76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%]
>>>>> 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%]
>>>>> 1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%]
>>>>> 134,355,715,816 instructions # 1.75 insns per cycle
>>>>> # 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%]
>>>>> 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%]
>>>>> 1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%]
>>>>>
>>>>> 32.413866442 seconds time elapsed ( +- 0.13% )
>>>>>
>>>>> vhzp:
>>>>> Performance counter stats for './test_memcmp' (5 runs):
>>>>>
>>>>> 30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% )
>>>>> 38 context-switches # 0.001 K/sec ( +- 1.53% )
>>>>> 0 CPU-migrations # 0.000 K/sec
>>>>> 4,218 page-faults # 0.139 K/sec ( +- 0.01% )
>>>>> 71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%]
>>>>> 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%]
>>>>> 773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%]
>>>>> 134,982,215,437 instructions # 1.88 insns per cycle
>>>>> # 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%]
>>>>> 13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%]
>>>>> 1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%]
>>>>>
>>>>> 30.381324695 seconds time elapsed ( +- 0.13% )
>>>> Could you tell me which data I should care in this performance
>>>> counter. And what's the benefit of your current implementation
>>>> compare to hpa's request?
>> Sorry for my unintelligent. Could you tell me which data I should
>> care in this performance counter stats. The same question about the
>> second benchmark counter stats, thanks in adance. :-)
> I've missed relevant counters in this run, you can see them in the second
> benchmark.
>
> Relevant counters:
> L1-dcache-*, LLC-*: shows cache related stats (hits/misses);
> dTLB-*: shows data TLB hits and misses.
>
> Indirect relevant counters:
> stalled-cycles-*: how long CPU pipeline has to wait for data.

Oh, I see, thanks for your patient. :-)

>
>>>>> Mirobenchmark2
>>>>> ==============
>>>>>
>>>>> test:
>>>>> posix_memalign((void **)&p, 2 * MB, 8 * GB);
>>>>> for (i = 0; i < 1000; i++) {
>>>>> char *_p = p;
>>>>> while (_p < p+4*GB) {
>>>>> assert(*_p == *(_p+4*GB));
>>>>> _p += 4096;
>>>>> asm volatile ("": : :"memory");
>>>>> }
>>>>> }
>>>>>
>>>>> hzp:
>>>>> Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
>>>>>
>>>>> 3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% )
>>>>> 9 context-switches # 0.003 K/sec ( +- 4.97% )
>>>>> 4,384 page-faults # 0.001 M/sec ( +- 0.00% )
>>>>> 8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%]
>>>>> 5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%]
>>>>> 2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%]
>>>>> 9,494,670,537 instructions # 1.14 insns per cycle
>>>>> # 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%]
>>>>> 2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%]
>>>>> 158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%]
>>>>> 3,168,102,115 L1-dcache-loads
>>>>> # 903.693 M/sec ( +- 0.11% ) [41.70%]
>>>>> 1,048,710,998 L1-dcache-misses
>>>>> # 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%]
>>>>> 1,047,699,685 LLC-load
>>>>> # 298.854 M/sec ( +- 0.03% ) [33.38%]
>>>>> 2,287 LLC-misses
>>>>> # 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%]
>>>>> 3,166,187,367 dTLB-loads
>>>>> # 903.147 M/sec ( +- 0.02% ) [33.35%]
>>>>> 4,266,538 dTLB-misses
>>>>> # 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%]
>>>>>
>>>>> 3.513339813 seconds time elapsed ( +- 0.26% )
>>>>>
>>>>> vhzp:
>>>>> Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
>>>>>
>>>>> 27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% )
>>>>> 62 context-switches # 0.002 K/sec ( +- 0.61% )
>>>>> 4,384 page-faults # 0.160 K/sec ( +- 0.01% )
>>>>> 64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%]
>>>>> 61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%]
>>>>> 56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%]
>>>>> 10,033,724,846 instructions # 0.15 insns per cycle
>>>>> # 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%]
>>>>> 2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%]
>>>>> 1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%]
>>>>> 3,302,006,540 L1-dcache-loads
>>>>> # 120.891 M/sec ( +- 0.11% ) [41.68%]
>>>>> 271,374,358 L1-dcache-misses
>>>>> # 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%]
>>>>> 20,385,476 LLC-load
>>>>> # 0.746 M/sec ( +- 1.64% ) [33.34%]
>>>>> 76,754 LLC-misses
>>>>> # 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%]
>>>>> 3,309,927,290 dTLB-loads
>>>>> # 121.181 M/sec ( +- 0.03% ) [33.34%]
>>>>> 2,098,967,427 dTLB-misses
>>>>> # 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%]
>>>>>
>>>>> 27.364448741 seconds time elapsed ( +- 0.24% )
>>>> For this case, the same question as above, thanks in adance. :-)

2012-10-18 23:45:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Mon, 15 Oct 2012 09:00:59 +0300
"Kirill A. Shutemov" <[email protected]> wrote:

> H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> after the first allocation. Here's implementation of lockless refcounting
> for huge zero page.
>
> We have two basic primitives: {get,put}_huge_zero_page(). They
> manipulate reference counter.
>
> If counter is 0, get_huge_zero_page() allocates a new huge page and
> takes two references: one for caller and one for shrinker. We free the
> page only in shrinker callback if counter is 1 (only shrinker has the
> reference).
>
> put_huge_zero_page() only decrements counter. Counter is never zero
> in put_huge_zero_page() since shrinker holds on reference.
>
> Freeing huge zero page in shrinker callback helps to avoid frequent
> allocate-free.

I'd like more details on this please. The cost of freeing then
reinstantiating that page is tremendous, because it has to be zeroed
out again. If there is any way at all in which the kernel can be made
to enter a high-frequency free/reinstantiate pattern then I expect the
effects would be quite bad.

Do we have sufficient mechanisms in there to prevent this from
happening in all cases? If so, what are they, because I'm not seeing
them?

> Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
> parallel (40 processes) read page faulting comparing to lazy huge page
> allocation. I think it's pretty reasonable for synthetic benchmark.

2012-10-18 23:58:37

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Thu, Oct 18, 2012 at 04:45:02PM -0700, Andrew Morton wrote:
> On Mon, 15 Oct 2012 09:00:59 +0300
> "Kirill A. Shutemov" <[email protected]> wrote:
>
> > H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> > after the first allocation. Here's implementation of lockless refcounting
> > for huge zero page.
> >
> > We have two basic primitives: {get,put}_huge_zero_page(). They
> > manipulate reference counter.
> >
> > If counter is 0, get_huge_zero_page() allocates a new huge page and
> > takes two references: one for caller and one for shrinker. We free the
> > page only in shrinker callback if counter is 1 (only shrinker has the
> > reference).
> >
> > put_huge_zero_page() only decrements counter. Counter is never zero
> > in put_huge_zero_page() since shrinker holds on reference.
> >
> > Freeing huge zero page in shrinker callback helps to avoid frequent
> > allocate-free.
>
> I'd like more details on this please. The cost of freeing then
> reinstantiating that page is tremendous, because it has to be zeroed
> out again. If there is any way at all in which the kernel can be made
> to enter a high-frequency free/reinstantiate pattern then I expect the
> effects would be quite bad.
>
> Do we have sufficient mechanisms in there to prevent this from
> happening in all cases? If so, what are they, because I'm not seeing
> them?

We only free huge zero page in shrinker callback if nobody in the system
uses it. Never on put_huge_zero_page(). Shrinker runs only under memory
pressure or if user asks (drop_caches).
Do you think we need an additional protection mechanism?

>
> > Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
> > parallel (40 processes) read page faulting comparing to lazy huge page
> > allocation. I think it's pretty reasonable for synthetic benchmark.

--
Kirill A. Shutemov

2012-10-23 06:34:21

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Fri, Oct 19, 2012 at 02:59:41AM +0300, Kirill A. Shutemov wrote:
> On Thu, Oct 18, 2012 at 04:45:02PM -0700, Andrew Morton wrote:
> > On Mon, 15 Oct 2012 09:00:59 +0300
> > "Kirill A. Shutemov" <[email protected]> wrote:
> >
> > > H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> > > after the first allocation. Here's implementation of lockless refcounting
> > > for huge zero page.
> > >
> > > We have two basic primitives: {get,put}_huge_zero_page(). They
> > > manipulate reference counter.
> > >
> > > If counter is 0, get_huge_zero_page() allocates a new huge page and
> > > takes two references: one for caller and one for shrinker. We free the
> > > page only in shrinker callback if counter is 1 (only shrinker has the
> > > reference).
> > >
> > > put_huge_zero_page() only decrements counter. Counter is never zero
> > > in put_huge_zero_page() since shrinker holds on reference.
> > >
> > > Freeing huge zero page in shrinker callback helps to avoid frequent
> > > allocate-free.
> >
> > I'd like more details on this please. The cost of freeing then
> > reinstantiating that page is tremendous, because it has to be zeroed
> > out again. If there is any way at all in which the kernel can be made
> > to enter a high-frequency free/reinstantiate pattern then I expect the
> > effects would be quite bad.
> >
> > Do we have sufficient mechanisms in there to prevent this from
> > happening in all cases? If so, what are they, because I'm not seeing
> > them?
>
> We only free huge zero page in shrinker callback if nobody in the system
> uses it. Never on put_huge_zero_page(). Shrinker runs only under memory
> pressure or if user asks (drop_caches).
> Do you think we need an additional protection mechanism?

Andrew?

> > > Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
> > > parallel (40 processes) read page faulting comparing to lazy huge page
> > > allocation. I think it's pretty reasonable for synthetic benchmark.
>
> --
> Kirill A. Shutemov

--
Kirill A. Shutemov

2012-10-23 06:44:03

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Tue, 23 Oct 2012 09:35:32 +0300 "Kirill A. Shutemov" <[email protected]> wrote:

> On Fri, Oct 19, 2012 at 02:59:41AM +0300, Kirill A. Shutemov wrote:
> > On Thu, Oct 18, 2012 at 04:45:02PM -0700, Andrew Morton wrote:
> > > On Mon, 15 Oct 2012 09:00:59 +0300
> > > "Kirill A. Shutemov" <[email protected]> wrote:
> > >
> > > > H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> > > > after the first allocation. Here's implementation of lockless refcounting
> > > > for huge zero page.
> > > >
> > > > We have two basic primitives: {get,put}_huge_zero_page(). They
> > > > manipulate reference counter.
> > > >
> > > > If counter is 0, get_huge_zero_page() allocates a new huge page and
> > > > takes two references: one for caller and one for shrinker. We free the
> > > > page only in shrinker callback if counter is 1 (only shrinker has the
> > > > reference).
> > > >
> > > > put_huge_zero_page() only decrements counter. Counter is never zero
> > > > in put_huge_zero_page() since shrinker holds on reference.
> > > >
> > > > Freeing huge zero page in shrinker callback helps to avoid frequent
> > > > allocate-free.
> > >
> > > I'd like more details on this please. The cost of freeing then
> > > reinstantiating that page is tremendous, because it has to be zeroed
> > > out again. If there is any way at all in which the kernel can be made
> > > to enter a high-frequency free/reinstantiate pattern then I expect the
> > > effects would be quite bad.
> > >
> > > Do we have sufficient mechanisms in there to prevent this from
> > > happening in all cases? If so, what are they, because I'm not seeing
> > > them?
> >
> > We only free huge zero page in shrinker callback if nobody in the system
> > uses it. Never on put_huge_zero_page(). Shrinker runs only under memory
> > pressure or if user asks (drop_caches).
> > Do you think we need an additional protection mechanism?
>
> Andrew?
>

Well, how hard is it to trigger the bad behavior? One can easily
create a situation in which that page's refcount frequently switches
from 0 to 1 and back again. And one can easily create a situation in
which the shrinkers are being called frequently. Run both at the same
time and what happens?

2012-10-23 06:59:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Mon, Oct 22, 2012 at 11:43:49PM -0700, Andrew Morton wrote:
> On Tue, 23 Oct 2012 09:35:32 +0300 "Kirill A. Shutemov" <[email protected]> wrote:
>
> > On Fri, Oct 19, 2012 at 02:59:41AM +0300, Kirill A. Shutemov wrote:
> > > On Thu, Oct 18, 2012 at 04:45:02PM -0700, Andrew Morton wrote:
> > > > On Mon, 15 Oct 2012 09:00:59 +0300
> > > > "Kirill A. Shutemov" <[email protected]> wrote:
> > > >
> > > > > H. Peter Anvin doesn't like huge zero page which sticks in memory forever
> > > > > after the first allocation. Here's implementation of lockless refcounting
> > > > > for huge zero page.
> > > > >
> > > > > We have two basic primitives: {get,put}_huge_zero_page(). They
> > > > > manipulate reference counter.
> > > > >
> > > > > If counter is 0, get_huge_zero_page() allocates a new huge page and
> > > > > takes two references: one for caller and one for shrinker. We free the
> > > > > page only in shrinker callback if counter is 1 (only shrinker has the
> > > > > reference).
> > > > >
> > > > > put_huge_zero_page() only decrements counter. Counter is never zero
> > > > > in put_huge_zero_page() since shrinker holds on reference.
> > > > >
> > > > > Freeing huge zero page in shrinker callback helps to avoid frequent
> > > > > allocate-free.
> > > >
> > > > I'd like more details on this please. The cost of freeing then
> > > > reinstantiating that page is tremendous, because it has to be zeroed
> > > > out again. If there is any way at all in which the kernel can be made
> > > > to enter a high-frequency free/reinstantiate pattern then I expect the
> > > > effects would be quite bad.
> > > >
> > > > Do we have sufficient mechanisms in there to prevent this from
> > > > happening in all cases? If so, what are they, because I'm not seeing
> > > > them?
> > >
> > > We only free huge zero page in shrinker callback if nobody in the system
> > > uses it. Never on put_huge_zero_page(). Shrinker runs only under memory
> > > pressure or if user asks (drop_caches).
> > > Do you think we need an additional protection mechanism?
> >
> > Andrew?
> >
>
> Well, how hard is it to trigger the bad behavior? One can easily
> create a situation in which that page's refcount frequently switches
> from 0 to 1 and back again. And one can easily create a situation in
> which the shrinkers are being called frequently. Run both at the same
> time and what happens?

If the goal is to trigger bad behavior then:

1. read from an area where a huge page can be mapped to get huge zero page
mapped. hzp is allocated here. refcounter == 2.
2. write to the same page. refcounter == 1.
3. echo 3 > /proc/sys/vm/drop_caches. refcounter == 0 -> free the hzp.
4. goto 1.

But it's unrealistic. /proc/sys/vm/drop_caches is only root-accessible.
We can trigger shrinker only under memory pressure. But in this, most
likely we will get -ENOMEM on hzp allocation and will go to fallback path
(4k zero page).

I don't see a problem here.

--
Kirill A. Shutemov


Attachments:
(No filename) (2.93 kB)
signature.asc (836.00 B)
Digital signature
Download all attachments

2012-10-23 22:59:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Tue, 23 Oct 2012 10:00:18 +0300
"Kirill A. Shutemov" <[email protected]> wrote:

> > Well, how hard is it to trigger the bad behavior? One can easily
> > create a situation in which that page's refcount frequently switches
> > from 0 to 1 and back again. And one can easily create a situation in
> > which the shrinkers are being called frequently. Run both at the same
> > time and what happens?
>
> If the goal is to trigger bad behavior then:
>
> 1. read from an area where a huge page can be mapped to get huge zero page
> mapped. hzp is allocated here. refcounter == 2.
> 2. write to the same page. refcounter == 1.
> 3. echo 3 > /proc/sys/vm/drop_caches. refcounter == 0 -> free the hzp.
> 4. goto 1.
>
> But it's unrealistic. /proc/sys/vm/drop_caches is only root-accessible.

Yes, drop_caches is uninteresting.

> We can trigger shrinker only under memory pressure. But in this, most
> likely we will get -ENOMEM on hzp allocation and will go to fallback path
> (4k zero page).

I disagree. If, for example, there is a large amount of clean
pagecache being generated then the shrinkers will be called frequently
and memory reclaim will be running at a 100% success rate. The
hugepage allocation will be successful in such a situation?

2012-10-23 23:36:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Tue, Oct 23, 2012 at 03:59:15PM -0700, Andrew Morton wrote:
> On Tue, 23 Oct 2012 10:00:18 +0300
> "Kirill A. Shutemov" <[email protected]> wrote:
>
> > > Well, how hard is it to trigger the bad behavior? One can easily
> > > create a situation in which that page's refcount frequently switches
> > > from 0 to 1 and back again. And one can easily create a situation in
> > > which the shrinkers are being called frequently. Run both at the same
> > > time and what happens?
> >
> > If the goal is to trigger bad behavior then:
> >
> > 1. read from an area where a huge page can be mapped to get huge zero page
> > mapped. hzp is allocated here. refcounter == 2.
> > 2. write to the same page. refcounter == 1.
> > 3. echo 3 > /proc/sys/vm/drop_caches. refcounter == 0 -> free the hzp.
> > 4. goto 1.
> >
> > But it's unrealistic. /proc/sys/vm/drop_caches is only root-accessible.
>
> Yes, drop_caches is uninteresting.
>
> > We can trigger shrinker only under memory pressure. But in this, most
> > likely we will get -ENOMEM on hzp allocation and will go to fallback path
> > (4k zero page).
>
> I disagree. If, for example, there is a large amount of clean
> pagecache being generated then the shrinkers will be called frequently
> and memory reclaim will be running at a 100% success rate. The
> hugepage allocation will be successful in such a situation?

Yes.

Shrinker callbacks are called from shrink_slab() which happens after page
cache reclaim, so on next reclaim round page cache will reclaim first and
we will avoid frequent alloc-free pattern.

One more thing we can do: increase shrinker->seeks to something like
DEFAULT_SEEKS * 4. In this case shrink_slab() will call our callback after
callbacks with DEFAULT_SEEKS.

--
Kirill A. Shutemov

2012-10-24 19:22:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Wed, 24 Oct 2012 02:38:01 +0300
"Kirill A. Shutemov" <[email protected]> wrote:

> On Tue, Oct 23, 2012 at 03:59:15PM -0700, Andrew Morton wrote:
> > On Tue, 23 Oct 2012 10:00:18 +0300
> > "Kirill A. Shutemov" <[email protected]> wrote:
> >
> > > > Well, how hard is it to trigger the bad behavior? One can easily
> > > > create a situation in which that page's refcount frequently switches
> > > > from 0 to 1 and back again. And one can easily create a situation in
> > > > which the shrinkers are being called frequently. Run both at the same
> > > > time and what happens?
> > >
> > > If the goal is to trigger bad behavior then:
> > >
> > > 1. read from an area where a huge page can be mapped to get huge zero page
> > > mapped. hzp is allocated here. refcounter == 2.
> > > 2. write to the same page. refcounter == 1.
> > > 3. echo 3 > /proc/sys/vm/drop_caches. refcounter == 0 -> free the hzp.
> > > 4. goto 1.
> > >
> > > But it's unrealistic. /proc/sys/vm/drop_caches is only root-accessible.
> >
> > Yes, drop_caches is uninteresting.
> >
> > > We can trigger shrinker only under memory pressure. But in this, most
> > > likely we will get -ENOMEM on hzp allocation and will go to fallback path
> > > (4k zero page).
> >
> > I disagree. If, for example, there is a large amount of clean
> > pagecache being generated then the shrinkers will be called frequently
> > and memory reclaim will be running at a 100% success rate. The
> > hugepage allocation will be successful in such a situation?
>
> Yes.
>
> Shrinker callbacks are called from shrink_slab() which happens after page
> cache reclaim, so on next reclaim round page cache will reclaim first and
> we will avoid frequent alloc-free pattern.

I don't understand this. If reclaim is running continuously (which can
happen pretty easily: "dd if=/fast-disk/large-file") then the zero page
will be whipped away very shortly after its refcount has fallen to
zero.

> One more thing we can do: increase shrinker->seeks to something like
> DEFAULT_SEEKS * 4. In this case shrink_slab() will call our callback after
> callbacks with DEFAULT_SEEKS.

It would be useful if you could try to make this scenario happen. If
for some reason it doesn't happen then let's understand *why* it
doesn't happen.

I'm thinking that such a workload would be the above dd in parallel
with a small app which touches the huge page and then exits, then gets
executed again. That "small app" sounds realistic to me. Obviously
one could exercise the zero page's refcount at higher frequency with a
tight map/touch/unmap loop, but that sounds less realistic. It's worth
trying that exercise as well though.

Or do something else. But we should try to probe this code's
worst-case behaviour, get an understanding of its effects and then
decide whether any such workload is realisic enough to worry about.

2012-10-24 19:55:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Wed, Oct 24, 2012 at 12:22:53PM -0700, Andrew Morton wrote:
> On Wed, 24 Oct 2012 02:38:01 +0300
> "Kirill A. Shutemov" <[email protected]> wrote:
>
> > On Tue, Oct 23, 2012 at 03:59:15PM -0700, Andrew Morton wrote:
> > > On Tue, 23 Oct 2012 10:00:18 +0300
> > > "Kirill A. Shutemov" <[email protected]> wrote:
> > >
> > > > > Well, how hard is it to trigger the bad behavior? One can easily
> > > > > create a situation in which that page's refcount frequently switches
> > > > > from 0 to 1 and back again. And one can easily create a situation in
> > > > > which the shrinkers are being called frequently. Run both at the same
> > > > > time and what happens?
> > > >
> > > > If the goal is to trigger bad behavior then:
> > > >
> > > > 1. read from an area where a huge page can be mapped to get huge zero page
> > > > mapped. hzp is allocated here. refcounter == 2.
> > > > 2. write to the same page. refcounter == 1.
> > > > 3. echo 3 > /proc/sys/vm/drop_caches. refcounter == 0 -> free the hzp.
> > > > 4. goto 1.
> > > >
> > > > But it's unrealistic. /proc/sys/vm/drop_caches is only root-accessible.
> > >
> > > Yes, drop_caches is uninteresting.
> > >
> > > > We can trigger shrinker only under memory pressure. But in this, most
> > > > likely we will get -ENOMEM on hzp allocation and will go to fallback path
> > > > (4k zero page).
> > >
> > > I disagree. If, for example, there is a large amount of clean
> > > pagecache being generated then the shrinkers will be called frequently
> > > and memory reclaim will be running at a 100% success rate. The
> > > hugepage allocation will be successful in such a situation?
> >
> > Yes.
> >
> > Shrinker callbacks are called from shrink_slab() which happens after page
> > cache reclaim, so on next reclaim round page cache will reclaim first and
> > we will avoid frequent alloc-free pattern.
>
> I don't understand this. If reclaim is running continuously (which can
> happen pretty easily: "dd if=/fast-disk/large-file") then the zero page
> will be whipped away very shortly after its refcount has fallen to
> zero.
>
> > One more thing we can do: increase shrinker->seeks to something like
> > DEFAULT_SEEKS * 4. In this case shrink_slab() will call our callback after
> > callbacks with DEFAULT_SEEKS.
>
> It would be useful if you could try to make this scenario happen. If
> for some reason it doesn't happen then let's understand *why* it
> doesn't happen.
>
> I'm thinking that such a workload would be the above dd in parallel
> with a small app which touches the huge page and then exits, then gets
> executed again. That "small app" sounds realistic to me. Obviously
> one could exercise the zero page's refcount at higher frequency with a
> tight map/touch/unmap loop, but that sounds less realistic. It's worth
> trying that exercise as well though.
>
> Or do something else. But we should try to probe this code's
> worst-case behaviour, get an understanding of its effects and then
> decide whether any such workload is realisic enough to worry about.

Okay, I'll try few memory pressure scenarios.

Meanwhile, could you take patches 01-09? Patch 09 implements simpler
allocation scheme. It would be nice to get all other code tested.
Or do you see any other blocker?

--
Kirill A. Shutemov


Attachments:
(No filename) (3.24 kB)
signature.asc (836.00 B)
Digital signature
Download all attachments

2012-10-24 20:25:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Wed, 24 Oct 2012 22:45:52 +0300
"Kirill A. Shutemov" <[email protected]> wrote:

> On Wed, Oct 24, 2012 at 12:22:53PM -0700, Andrew Morton wrote:
> >
> > I'm thinking that such a workload would be the above dd in parallel
> > with a small app which touches the huge page and then exits, then gets
> > executed again. That "small app" sounds realistic to me. Obviously
> > one could exercise the zero page's refcount at higher frequency with a
> > tight map/touch/unmap loop, but that sounds less realistic. It's worth
> > trying that exercise as well though.
> >
> > Or do something else. But we should try to probe this code's
> > worst-case behaviour, get an understanding of its effects and then
> > decide whether any such workload is realisic enough to worry about.
>
> Okay, I'll try few memory pressure scenarios.

Thanks.

> Meanwhile, could you take patches 01-09? Patch 09 implements simpler
> allocation scheme. It would be nice to get all other code tested.
> Or do you see any other blocker?

I think I would take them all, to get them tested while we're still
poking at the code. It's a matter of getting my lazy ass onto reviewing
the patches.

The patches have a disturbing lack of reviewed-by's, acked-by's and
tested-by's on them. Have any other of the MM lazy asses actually
spent some time with them yet?

2012-10-24 20:32:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Wed, Oct 24, 2012 at 01:25:52PM -0700, Andrew Morton wrote:
> On Wed, 24 Oct 2012 22:45:52 +0300
> "Kirill A. Shutemov" <[email protected]> wrote:
>
> > On Wed, Oct 24, 2012 at 12:22:53PM -0700, Andrew Morton wrote:
> > >
> > > I'm thinking that such a workload would be the above dd in parallel
> > > with a small app which touches the huge page and then exits, then gets
> > > executed again. That "small app" sounds realistic to me. Obviously
> > > one could exercise the zero page's refcount at higher frequency with a
> > > tight map/touch/unmap loop, but that sounds less realistic. It's worth
> > > trying that exercise as well though.
> > >
> > > Or do something else. But we should try to probe this code's
> > > worst-case behaviour, get an understanding of its effects and then
> > > decide whether any such workload is realisic enough to worry about.
> >
> > Okay, I'll try few memory pressure scenarios.
>
> Thanks.
>
> > Meanwhile, could you take patches 01-09? Patch 09 implements simpler
> > allocation scheme. It would be nice to get all other code tested.
> > Or do you see any other blocker?
>
> I think I would take them all, to get them tested while we're still
> poking at the code. It's a matter of getting my lazy ass onto reviewing
> the patches.
>
> The patches have a disturbing lack of reviewed-by's, acked-by's and
> tested-by's on them. Have any other of the MM lazy asses actually
> spent some time with them yet?

Andrea Revieved-by previous version of the patchset, but I've dropped the
tag after rebase to v3.7-rc1 due not-so-trivial conflicts. Patches 2, 3,
4, 7, 10 had conflicts. Mostly due new MMU notifiers interface.

I mentioned that in cover letter.

--
Kirill A. Shutemov


Attachments:
(No filename) (1.71 kB)
signature.asc (836.00 B)
Digital signature
Download all attachments

2012-10-24 20:44:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

> Andrea Revieved-by previous version of the patchset, but I've dropped the
> tag after rebase to v3.7-rc1 due not-so-trivial conflicts. Patches 2, 3,
> 4, 7, 10 had conflicts. Mostly due new MMU notifiers interface.

I reviewed it too, but I probably do not count as a real MM person.

-Andi

2012-10-25 20:49:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Wed, Oct 24, 2012 at 01:25:52PM -0700, Andrew Morton wrote:
> On Wed, 24 Oct 2012 22:45:52 +0300
> "Kirill A. Shutemov" <[email protected]> wrote:
>
> > On Wed, Oct 24, 2012 at 12:22:53PM -0700, Andrew Morton wrote:
> > >
> > > I'm thinking that such a workload would be the above dd in parallel
> > > with a small app which touches the huge page and then exits, then gets
> > > executed again. That "small app" sounds realistic to me. Obviously
> > > one could exercise the zero page's refcount at higher frequency with a
> > > tight map/touch/unmap loop, but that sounds less realistic. It's worth
> > > trying that exercise as well though.
> > >
> > > Or do something else. But we should try to probe this code's
> > > worst-case behaviour, get an understanding of its effects and then
> > > decide whether any such workload is realisic enough to worry about.
> >
> > Okay, I'll try few memory pressure scenarios.

A test program:

while (1) {
posix_memalign((void **)&p, 2 * MB, 2 * MB);
assert(*p == 0);
free(p);
}

With this code in background we have pretty good chance to have huge zero
page freeable (refcount == 1) when shrinker callback called - roughly one
of two.

Pagecache hog (dd if=hugefile of=/dev/null bs=1M) creates enough pressure
to get shrinker callback called, but it was only asked about cache size
(nr_to_scan == 0).
I was not able to get it called with nr_to_scan > 0 on this scenario, so
hzp never freed.

I also tried another scenario: usemem -n16 100M -r 1000. It creates real
memory pressure - no easy reclaimable memory. This time callback called
with nr_to_scan > 0 and we freed hzp. Under pressure we fails to allocate
hzp and code goes to fallback path as it supposed to.

Do I need to check any other scenario?

--
Kirill A. Shutemov


Attachments:
(No filename) (1.82 kB)
signature.asc (836.00 B)
Digital signature
Download all attachments

2012-10-25 21:05:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Thu, 25 Oct 2012 23:49:59 +0300
"Kirill A. Shutemov" <[email protected]> wrote:

> On Wed, Oct 24, 2012 at 01:25:52PM -0700, Andrew Morton wrote:
> > On Wed, 24 Oct 2012 22:45:52 +0300
> > "Kirill A. Shutemov" <[email protected]> wrote:
> >
> > > On Wed, Oct 24, 2012 at 12:22:53PM -0700, Andrew Morton wrote:
> > > >
> > > > I'm thinking that such a workload would be the above dd in parallel
> > > > with a small app which touches the huge page and then exits, then gets
> > > > executed again. That "small app" sounds realistic to me. Obviously
> > > > one could exercise the zero page's refcount at higher frequency with a
> > > > tight map/touch/unmap loop, but that sounds less realistic. It's worth
> > > > trying that exercise as well though.
> > > >
> > > > Or do something else. But we should try to probe this code's
> > > > worst-case behaviour, get an understanding of its effects and then
> > > > decide whether any such workload is realisic enough to worry about.
> > >
> > > Okay, I'll try few memory pressure scenarios.
>
> A test program:
>
> while (1) {
> posix_memalign((void **)&p, 2 * MB, 2 * MB);
> assert(*p == 0);
> free(p);
> }
>
> With this code in background we have pretty good chance to have huge zero
> page freeable (refcount == 1) when shrinker callback called - roughly one
> of two.
>
> Pagecache hog (dd if=hugefile of=/dev/null bs=1M) creates enough pressure
> to get shrinker callback called, but it was only asked about cache size
> (nr_to_scan == 0).
> I was not able to get it called with nr_to_scan > 0 on this scenario, so
> hzp never freed.

hm. It's odd that the kernel didn't try to shrink slabs in this case.
Why didn't it??

> I also tried another scenario: usemem -n16 100M -r 1000. It creates real
> memory pressure - no easy reclaimable memory. This time callback called
> with nr_to_scan > 0 and we freed hzp. Under pressure we fails to allocate
> hzp and code goes to fallback path as it supposed to.
>
> Do I need to check any other scenario?

I'm thinking that if we do hit problems in this area, we could avoid
freeing the hugepage unless the scan_control.priority is high enough.
That would involve adding a magic number or a tunable to set the
threshold.

Also, it would be beneficial if we can monitor this easily. Perhaps
add a counter to /proc/vmstat which tells us how many times that page
has been reallocated? And perhaps how many times we tried to allocate
it but failed?

2012-10-25 21:21:36

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Thu, Oct 25, 2012 at 02:05:24PM -0700, Andrew Morton wrote:
> On Thu, 25 Oct 2012 23:49:59 +0300
> "Kirill A. Shutemov" <[email protected]> wrote:
>
> > On Wed, Oct 24, 2012 at 01:25:52PM -0700, Andrew Morton wrote:
> > > On Wed, 24 Oct 2012 22:45:52 +0300
> > > "Kirill A. Shutemov" <[email protected]> wrote:
> > >
> > > > On Wed, Oct 24, 2012 at 12:22:53PM -0700, Andrew Morton wrote:
> > > > >
> > > > > I'm thinking that such a workload would be the above dd in parallel
> > > > > with a small app which touches the huge page and then exits, then gets
> > > > > executed again. That "small app" sounds realistic to me. Obviously
> > > > > one could exercise the zero page's refcount at higher frequency with a
> > > > > tight map/touch/unmap loop, but that sounds less realistic. It's worth
> > > > > trying that exercise as well though.
> > > > >
> > > > > Or do something else. But we should try to probe this code's
> > > > > worst-case behaviour, get an understanding of its effects and then
> > > > > decide whether any such workload is realisic enough to worry about.
> > > >
> > > > Okay, I'll try few memory pressure scenarios.
> >
> > A test program:
> >
> > while (1) {
> > posix_memalign((void **)&p, 2 * MB, 2 * MB);
> > assert(*p == 0);
> > free(p);
> > }
> >
> > With this code in background we have pretty good chance to have huge zero
> > page freeable (refcount == 1) when shrinker callback called - roughly one
> > of two.
> >
> > Pagecache hog (dd if=hugefile of=/dev/null bs=1M) creates enough pressure
> > to get shrinker callback called, but it was only asked about cache size
> > (nr_to_scan == 0).
> > I was not able to get it called with nr_to_scan > 0 on this scenario, so
> > hzp never freed.
>
> hm. It's odd that the kernel didn't try to shrink slabs in this case.
> Why didn't it??

nr_to_scan == 0 asks for the fast path. shrinker callback can shink, if
it thinks it's good idea.

>
> > I also tried another scenario: usemem -n16 100M -r 1000. It creates real
> > memory pressure - no easy reclaimable memory. This time callback called
> > with nr_to_scan > 0 and we freed hzp. Under pressure we fails to allocate
> > hzp and code goes to fallback path as it supposed to.
> >
> > Do I need to check any other scenario?
>
> I'm thinking that if we do hit problems in this area, we could avoid
> freeing the hugepage unless the scan_control.priority is high enough.
> That would involve adding a magic number or a tunable to set the
> threshold.

What about ratelimit on alloc path to force fallback if we allocate
to often? Is it good idea?

> Also, it would be beneficial if we can monitor this easily. Perhaps
> add a counter to /proc/vmstat which tells us how many times that page
> has been reallocated? And perhaps how many times we tried to allocate
> it but failed?

Okay, I'll prepare patch.

--
Kirill A. Shutemov

2012-10-25 21:37:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Fri, 26 Oct 2012 00:22:51 +0300
"Kirill A. Shutemov" <[email protected]> wrote:

> On Thu, Oct 25, 2012 at 02:05:24PM -0700, Andrew Morton wrote:
> > On Thu, 25 Oct 2012 23:49:59 +0300
> > "Kirill A. Shutemov" <[email protected]> wrote:
> >
> > > On Wed, Oct 24, 2012 at 01:25:52PM -0700, Andrew Morton wrote:
> > > > On Wed, 24 Oct 2012 22:45:52 +0300
> > > > "Kirill A. Shutemov" <[email protected]> wrote:
> > > >
> > > > > On Wed, Oct 24, 2012 at 12:22:53PM -0700, Andrew Morton wrote:
> > > > > >
> > > > > > I'm thinking that such a workload would be the above dd in parallel
> > > > > > with a small app which touches the huge page and then exits, then gets
> > > > > > executed again. That "small app" sounds realistic to me. Obviously
> > > > > > one could exercise the zero page's refcount at higher frequency with a
> > > > > > tight map/touch/unmap loop, but that sounds less realistic. It's worth
> > > > > > trying that exercise as well though.
> > > > > >
> > > > > > Or do something else. But we should try to probe this code's
> > > > > > worst-case behaviour, get an understanding of its effects and then
> > > > > > decide whether any such workload is realisic enough to worry about.
> > > > >
> > > > > Okay, I'll try few memory pressure scenarios.
> > >
> > > A test program:
> > >
> > > while (1) {
> > > posix_memalign((void **)&p, 2 * MB, 2 * MB);
> > > assert(*p == 0);
> > > free(p);
> > > }
> > >
> > > With this code in background we have pretty good chance to have huge zero
> > > page freeable (refcount == 1) when shrinker callback called - roughly one
> > > of two.
> > >
> > > Pagecache hog (dd if=hugefile of=/dev/null bs=1M) creates enough pressure
> > > to get shrinker callback called, but it was only asked about cache size
> > > (nr_to_scan == 0).
> > > I was not able to get it called with nr_to_scan > 0 on this scenario, so
> > > hzp never freed.
> >
> > hm. It's odd that the kernel didn't try to shrink slabs in this case.
> > Why didn't it??
>
> nr_to_scan == 0 asks for the fast path. shrinker callback can shink, if
> it thinks it's good idea.

What nr_objects does your shrinker return in that case? If it's "1"
then it wouild be unsurprising that the core code decides not to
shrink.

> >
> > > I also tried another scenario: usemem -n16 100M -r 1000. It creates real
> > > memory pressure - no easy reclaimable memory. This time callback called
> > > with nr_to_scan > 0 and we freed hzp. Under pressure we fails to allocate
> > > hzp and code goes to fallback path as it supposed to.
> > >
> > > Do I need to check any other scenario?
> >
> > I'm thinking that if we do hit problems in this area, we could avoid
> > freeing the hugepage unless the scan_control.priority is high enough.
> > That would involve adding a magic number or a tunable to set the
> > threshold.
>
> What about ratelimit on alloc path to force fallback if we allocate
> to often? Is it good idea?

mmm... ratelimit via walltime is always a bad idea. We could
ratelimit by "number of times the shrinker was called", and maybe that
would work OK, unsure.

It *is* appropriate to use sc->priority to be more reluctant to release
expensive-to-reestablish objects. But there is already actually a
mechanism in the shrinker code to handle this: the shrink_control.seeks
field. That was originally added to provide an estimate of "how
expensive will it be to recreate this object if we were to reclaim it".
So perhaps we could generalise that a bit, and state that the zero
hugepage is an expensive thing.

I don't think the shrink_control.seeks facility had ever been used much,
so it's possible that it is presently mistuned or not working very
well.

2012-10-25 22:09:40

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v4 10/10] thp: implement refcounting for huge zero page

On Thu, Oct 25, 2012 at 02:37:07PM -0700, Andrew Morton wrote:
> On Fri, 26 Oct 2012 00:22:51 +0300
> "Kirill A. Shutemov" <[email protected]> wrote:
>
> > On Thu, Oct 25, 2012 at 02:05:24PM -0700, Andrew Morton wrote:
> > > hm. It's odd that the kernel didn't try to shrink slabs in this case.
> > > Why didn't it??
> >
> > nr_to_scan == 0 asks for the fast path. shrinker callback can shink, if
> > it thinks it's good idea.
>
> What nr_objects does your shrinker return in that case?

HPAGE_PMD_NR if hzp is freeable, otherwise 0.

> > > > I also tried another scenario: usemem -n16 100M -r 1000. It creates real
> > > > memory pressure - no easy reclaimable memory. This time callback called
> > > > with nr_to_scan > 0 and we freed hzp. Under pressure we fails to allocate
> > > > hzp and code goes to fallback path as it supposed to.
> > > >
> > > > Do I need to check any other scenario?
> > >
> > > I'm thinking that if we do hit problems in this area, we could avoid
> > > freeing the hugepage unless the scan_control.priority is high enough.
> > > That would involve adding a magic number or a tunable to set the
> > > threshold.
> >
> > What about ratelimit on alloc path to force fallback if we allocate
> > to often? Is it good idea?
>
> mmm... ratelimit via walltime is always a bad idea. We could
> ratelimit by "number of times the shrinker was called", and maybe that
> would work OK, unsure.
>
> It *is* appropriate to use sc->priority to be more reluctant to release
> expensive-to-reestablish objects. But there is already actually a
> mechanism in the shrinker code to handle this: the shrink_control.seeks
> field. That was originally added to provide an estimate of "how
> expensive will it be to recreate this object if we were to reclaim it".
> So perhaps we could generalise that a bit, and state that the zero
> hugepage is an expensive thing.

I've proposed DEFAULT_SEEKS * 4 already.

> I don't think the shrink_control.seeks facility had ever been used much,
> so it's possible that it is presently mistuned or not working very
> well.

Yeah, non-default .seeks is only in kvm mmu_shrinker and in few places in
staging/android/.

--
Kirill A. Shutemov


Attachments:
(No filename) (2.15 kB)
signature.asc (836.00 B)
Digital signature
Download all attachments

2012-10-26 15:23:18

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH] thp, vmstat: implement HZP_ALLOC and HZP_ALLOC_FAILED events

From: "Kirill A. Shutemov" <[email protected]>

HZP_ALLOC event triggers on every huge zero page allocation, including
allocations which where dropped due race with other allocation.

HZP_ALLOC_FAILED event triggers on huge zero page allocation fail
(ENOMEM).

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/vm_event_item.h | 2 ++
mm/huge_memory.c | 5 ++++-
mm/vmstat.c | 2 ++
3 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3d31145..d7156fb 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -58,6 +58,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
THP_COLLAPSE_ALLOC,
THP_COLLAPSE_ALLOC_FAILED,
THP_SPLIT,
+ HZP_ALLOC,
+ HZP_ALLOC_FAILED,
#endif
NR_VM_EVENT_ITEMS
};
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 92a1b66..492658a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -183,8 +183,11 @@ retry:

zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
HPAGE_PMD_ORDER);
- if (!zero_page)
+ if (!zero_page) {
+ count_vm_event(HZP_ALLOC_FAILED);
return 0;
+ }
+ count_vm_event(HZP_ALLOC);
preempt_disable();
if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
preempt_enable();
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..cb8901c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -801,6 +801,8 @@ const char * const vmstat_text[] = {
"thp_collapse_alloc",
"thp_collapse_alloc_failed",
"thp_split",
+ "hzp_alloc",
+ "hzp_alloc_failed",
#endif

#endif /* CONFIG_VM_EVENTS_COUNTERS */
--
1.7.7.6