2012-10-02 15:19:01

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v3 00/10] Introduce huge zero page

From: "Kirill A. Shutemov" <[email protected]>

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include <assert.h>
#include <stdlib.h>
#include <unistd.h>

#define MB 1024*1024

int main(int argc, char **argv)
{
char *p;
int i;

posix_memalign((void **)&p, 2 * MB, 200 * MB);
for (i = 0; i < 200 * MB; i+= 4096)
assert(p[i] == 0);
pause();
return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

v3:
- fix potential deadlock in refcounting code on preemptive kernel.
- do not mark huge zero page as movable.
- fix typo in comment.
- Reviewed-by tag from Andrea Arcangeli.
v2:
- Avoid find_vma() if we've already had vma on stack.
Suggested by Andrea Arcangeli.
- Implement refcounting for huge zero page.

Kirill A. Shutemov (10):
thp: huge zero page: basic preparation
thp: zap_huge_pmd(): zap huge zero pmd
thp: copy_huge_pmd(): copy huge zero page
thp: do_huge_pmd_wp_page(): handle huge zero page
thp: change_huge_pmd(): keep huge zero page write-protected
thp: change split_huge_page_pmd() interface
thp: implement splitting pmd for huge zero page
thp: setup huge zero page on non-write page fault
thp: lazy huge zero page allocation
thp: implement refcounting for huge zero page

Documentation/vm/transhuge.txt | 4 +-
arch/x86/kernel/vm86_32.c | 2 +-
fs/proc/task_mmu.c | 2 +-
include/linux/huge_mm.h | 14 ++-
include/linux/mm.h | 8 +
mm/huge_memory.c | 307 ++++++++++++++++++++++++++++++++++++----
mm/memory.c | 11 +--
mm/mempolicy.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/pagewalk.c | 2 +-
11 files changed, 305 insertions(+), 51 deletions(-)

--
1.7.7.6


2012-10-02 15:19:00

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v3 02/10] thp: zap_huge_pmd(): zap huge zero pmd

From: "Kirill A. Shutemov" <[email protected]>

We don't have a real page to zap in huge zero page case. Let's just
clear pmd and remove it from tlb.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 27 +++++++++++++++++----------
1 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 50c44e9..140d858 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1072,16 +1072,23 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct page *page;
pgtable_t pgtable;
pgtable = get_pmd_huge_pte(tlb->mm);
- page = pmd_page(*pmd);
- pmd_clear(pmd);
- tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
- page_remove_rmap(page);
- VM_BUG_ON(page_mapcount(page) < 0);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
- VM_BUG_ON(!PageHead(page));
- tlb->mm->nr_ptes--;
- spin_unlock(&tlb->mm->page_table_lock);
- tlb_remove_page(tlb, page);
+ if (is_huge_zero_pmd(*pmd)) {
+ pmd_clear(pmd);
+ tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+ tlb->mm->nr_ptes--;
+ spin_unlock(&tlb->mm->page_table_lock);
+ } else {
+ page = pmd_page(*pmd);
+ pmd_clear(pmd);
+ tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+ page_remove_rmap(page);
+ VM_BUG_ON(page_mapcount(page) < 0);
+ add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ VM_BUG_ON(!PageHead(page));
+ tlb->mm->nr_ptes--;
+ spin_unlock(&tlb->mm->page_table_lock);
+ tlb_remove_page(tlb, page);
+ }
pte_free(tlb->mm, pgtable);
ret = 1;
}
--
1.7.7.6

2012-10-02 15:19:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v3 08/10] thp: setup huge zero page on non-write page fault

From: "Kirill A. Shutemov" <[email protected]>

All code paths seems covered. Now we can map huge zero page on read page
fault.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3f1c59c..a5b9282 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -751,6 +751,16 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma)))
return VM_FAULT_OOM;
+ if (!(flags & FAULT_FLAG_WRITE)) {
+ pgtable_t pgtable;
+ pgtable = pte_alloc_one(mm, haddr);
+ if (unlikely(!pgtable))
+ goto out;
+ spin_lock(&mm->page_table_lock);
+ set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+ spin_unlock(&mm->page_table_lock);
+ return 0;
+ }
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
vma, haddr, numa_node_id(), 0);
if (unlikely(!page)) {
--
1.7.7.6

2012-10-02 15:19:26

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v3 06/10] thp: change split_huge_page_pmd() interface

From: "Kirill A. Shutemov" <[email protected]>

Pass vma instead of mm and add address parameter.

In most cases we already have vma on the stack. We provides
split_huge_page_pmd_mm() for few cases when we have mm, but not vma.

This change is preparation to huge zero pmd splitting implementation.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
---
Documentation/vm/transhuge.txt | 4 ++--
arch/x86/kernel/vm86_32.c | 2 +-
fs/proc/task_mmu.c | 2 +-
include/linux/huge_mm.h | 14 ++++++++++----
mm/huge_memory.c | 24 +++++++++++++++++++-----
mm/memory.c | 4 ++--
mm/mempolicy.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 2 +-
mm/pagewalk.c | 2 +-
10 files changed, 39 insertions(+), 19 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index f734bb2..677a599 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -276,7 +276,7 @@ unaffected. libhugetlbfs will also work fine as usual.
== Graceful fallback ==

Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
+split_huge_page_pmd(vma, pmd, addr) where the pmd is the one returned by
pmd_offset. It's trivial to make the code transparent hugepage aware
by just grepping for "pmd_offset" and adding split_huge_page_pmd where
missing after pmd_offset returns the pmd. Thanks to the graceful
@@ -299,7 +299,7 @@ diff --git a/mm/mremap.c b/mm/mremap.c
return NULL;

pmd = pmd_offset(pud, addr);
-+ split_huge_page_pmd(mm, pmd);
++ split_huge_page_pmd(vma, pmd, addr);
if (pmd_none_or_clear_bad(pmd))
return NULL;

diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index 54abcc0..22840bb 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -182,7 +182,7 @@ static void mark_screen_rdonly(struct mm_struct *mm)
if (pud_none_or_clear_bad(pud))
goto out;
pmd = pmd_offset(pud, 0xA0000);
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd_mm(mm, 0xA0000, pmd);
if (pmd_none_or_clear_bad(pmd))
goto out;
pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4540b8f..766d5d7 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -597,7 +597,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
spinlock_t *ptl;
struct page *page;

- split_huge_page_pmd(walk->mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
if (pmd_trans_unstable(pmd))
return 0;

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4c59b11..c68e073 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,12 +92,14 @@ extern int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
pte_t *pte, pmd_t *pmd, unsigned int flags);
extern int split_huge_page(struct page *page);
-extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
-#define split_huge_page_pmd(__mm, __pmd) \
+extern void __split_huge_page_pmd(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd);
+#define split_huge_page_pmd(__vma, __address, __pmd) \
do { \
pmd_t *____pmd = (__pmd); \
if (unlikely(pmd_trans_huge(*____pmd))) \
- __split_huge_page_pmd(__mm, ____pmd); \
+ __split_huge_page_pmd(__vma, __address, \
+ ____pmd); \
} while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { \
@@ -107,6 +109,8 @@ extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
BUG_ON(pmd_trans_splitting(*____pmd) || \
pmd_trans_huge(*____pmd)); \
} while (0)
+extern void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+ pmd_t *pmd);
#if HPAGE_PMD_ORDER > MAX_ORDER
#error "hugepages can't be allocated by the buddy allocator"
#endif
@@ -174,10 +178,12 @@ static inline int split_huge_page(struct page *page)
{
return 0;
}
-#define split_huge_page_pmd(__mm, __pmd) \
+#define split_huge_page_pmd(__vma, __address, __pmd) \
do { } while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { } while (0)
+#define split_huge_page_pmd_mm(__mm, __address, __pmd) \
+ do { } while (0)
#define compound_trans_head(page) compound_head(page)
static inline int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d864cd4..95032d3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2503,19 +2503,23 @@ static int khugepaged(void *none)
return 0;
}

-void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
+void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd)
{
struct page *page;
+ unsigned long haddr = address & HPAGE_PMD_MASK;

- spin_lock(&mm->page_table_lock);
+ BUG_ON(vma->vm_start > haddr || vma->vm_end < haddr + HPAGE_PMD_SIZE);
+
+ spin_lock(&vma->vm_mm->page_table_lock);
if (unlikely(!pmd_trans_huge(*pmd))) {
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(&vma->vm_mm->page_table_lock);
return;
}
page = pmd_page(*pmd);
VM_BUG_ON(!page_count(page));
get_page(page);
- spin_unlock(&mm->page_table_lock);
+ spin_unlock(&vma->vm_mm->page_table_lock);

split_huge_page(page);

@@ -2523,6 +2527,16 @@ void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
BUG_ON(pmd_trans_huge(*pmd));
}

+void split_huge_page_pmd_mm(struct mm_struct *mm, unsigned long address,
+ pmd_t *pmd)
+{
+ struct vm_area_struct *vma;
+
+ vma = find_vma(mm, address);
+ BUG_ON(vma == NULL);
+ split_huge_page_pmd(vma, address, pmd);
+}
+
static void split_huge_page_address(struct mm_struct *mm,
unsigned long address)
{
@@ -2547,7 +2561,7 @@ static void split_huge_page_address(struct mm_struct *mm,
* Caller holds the mmap_sem write mode, so a huge pmd cannot
* materialize from under us.
*/
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd_mm(mm, address, pmd);
}

void __vma_adjust_trans_huge(struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index dbd92ba..312c21d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1236,7 +1236,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
BUG();
}
#endif
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
goto next;
/* fall through */
@@ -1505,7 +1505,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
}
if (pmd_trans_huge(*pmd)) {
if (flags & FOLL_SPLIT) {
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd(vma, address, pmd);
goto split_fallthrough;
}
spin_lock(&mm->page_table_lock);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ada3be..55ac3b6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -511,7 +511,7 @@ static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
continue;
if (check_pte_range(vma, pmd, addr, next, nodes,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..e8c3938 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -90,7 +90,7 @@ static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma, addr, pmd);
else if (change_huge_pmd(vma, pmd, addr, newprot))
continue;
/* fall through */
diff --git a/mm/mremap.c b/mm/mremap.c
index cc06d0e..292ec46 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -156,7 +156,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
need_flush = true;
continue;
} else if (!err) {
- split_huge_page_pmd(vma->vm_mm, old_pmd);
+ split_huge_page_pmd(vma, old_addr, old_pmd);
}
VM_BUG_ON(pmd_trans_huge(*old_pmd));
}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 6c118d0..35aa294 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ again:
if (!walk->pte_entry)
continue;

- split_huge_page_pmd(walk->mm, pmd);
+ split_huge_page_pmd_mm(walk->mm, addr, pmd);
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
goto again;
err = walk_pte_range(pmd, addr, next, walk);
--
1.7.7.6

2012-10-02 15:19:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v3 05/10] thp: change_huge_pmd(): keep huge zero page write-protected

From: "Kirill A. Shutemov" <[email protected]>

We want to get page fault on write attempt to huge zero page, so let's
keep it write-protected.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f30f39d..d864cd4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1249,6 +1249,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
pmd_t entry;
entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
+ if (is_huge_zero_pmd(entry))
+ entry = pmd_wrprotect(entry);
set_pmd_at(mm, addr, pmd, entry);
spin_unlock(&vma->vm_mm->page_table_lock);
ret = 1;
--
1.7.7.6

2012-10-02 15:19:25

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v3 07/10] thp: implement splitting pmd for huge zero page

From: "Kirill A. Shutemov" <[email protected]>

We can't split huge zero page itself, but we can split the pmd which
points to it.

On splitting the pmd we create a table with all ptes set to normal zero
page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 32 ++++++++++++++++++++++++++++++++
1 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 95032d3..3f1c59c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1600,6 +1600,7 @@ int split_huge_page(struct page *page)
struct anon_vma *anon_vma;
int ret = 1;

+ BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
BUG_ON(!PageAnon(page));
anon_vma = page_lock_anon_vma(page);
if (!anon_vma)
@@ -2503,6 +2504,32 @@ static int khugepaged(void *none)
return 0;
}

+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
+ unsigned long haddr, pmd_t *pmd)
+{
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ int i;
+
+ pmdp_clear_flush_notify(vma, haddr, pmd);
+ /* leave pmd empty until pte is filled */
+
+ pgtable = get_pmd_huge_pte(vma->vm_mm);
+ pmd_populate(vma->vm_mm, &_pmd, pgtable);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+ entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+ entry = pte_mkspecial(entry);
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*pte));
+ set_pte_at(vma->vm_mm, haddr, pte, entry);
+ pte_unmap(pte);
+ }
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(vma->vm_mm, pmd, pgtable);
+}
+
void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmd)
{
@@ -2516,6 +2543,11 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
spin_unlock(&vma->vm_mm->page_table_lock);
return;
}
+ if (is_huge_zero_pmd(*pmd)) {
+ __split_huge_zero_page_pmd(vma, haddr, pmd);
+ spin_unlock(&vma->vm_mm->page_table_lock);
+ return;
+ }
page = pmd_page(*pmd);
VM_BUG_ON(!page_count(page));
get_page(page);
--
1.7.7.6

2012-10-02 15:19:23

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v3 10/10] thp: implement refcounting for huge zero page

From: "Kirill A. Shutemov" <[email protected]>

H. Peter Anvin doesn't like huge zero page which sticks in memory forever
after the first allocation. Here's implementation of lockless refcounting
for huge zero page.

We have two basic primitives: {get,put}_huge_zero_page(). They
manipulate reference counter.

If counter is 0, get_huge_zero_page() allocates a new huge page and
takes two references: one for caller and one for shrinker. We free the
page only in shrinker callback if counter is 1 (only shrinker has the
reference).

put_huge_zero_page() only decrements counter. Counter is never zero
in put_huge_zero_page() since shrinker holds on reference.

Freeing huge zero page in shrinker callback helps to avoid frequent
allocate-free.

Refcounting has cost. On 4 socket machine I observe ~1% slowdown on
parallel (40 processes) read page faulting comparing to lazy huge page
allocation. I think it's pretty reasonable for synthetic benchmark.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 111 ++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 87 insertions(+), 24 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3fdf1b4..6270a45 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
#include <linux/khugepaged.h>
#include <linux/freezer.h>
#include <linux/mman.h>
+#include <linux/shrinker.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -46,7 +47,6 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
/* during fragmentation poll the hugepage allocator once every minute */
static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
static struct task_struct *khugepaged_thread __read_mostly;
-static unsigned long huge_zero_pfn __read_mostly;
static DEFINE_MUTEX(khugepaged_mutex);
static DEFINE_SPINLOCK(khugepaged_mm_lock);
static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -168,31 +168,74 @@ out:
return err;
}

-static int init_huge_zero_pfn(void)
+static atomic_t huge_zero_refcount;
+static unsigned long huge_zero_pfn __read_mostly;
+
+static inline bool is_huge_zero_pfn(unsigned long pfn)
{
- struct page *hpage;
- unsigned long pfn;
+ unsigned long zero_pfn = ACCESS_ONCE(huge_zero_pfn);
+ return zero_pfn && pfn == zero_pfn;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+ return is_huge_zero_pfn(pmd_pfn(pmd));
+}
+
+static unsigned long get_huge_zero_page(void)
+{
+ struct page *zero_page;
+retry:
+ if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
+ return ACCESS_ONCE(huge_zero_pfn);

- hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
+ zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
HPAGE_PMD_ORDER);
- if (!hpage)
- return -ENOMEM;
- pfn = page_to_pfn(hpage);
- if (cmpxchg(&huge_zero_pfn, 0, pfn))
- __free_page(hpage);
- return 0;
+ if (!zero_page)
+ return 0;
+ preempt_disable();
+ if (cmpxchg(&huge_zero_pfn, 0, page_to_pfn(zero_page))) {
+ preempt_enable();
+ __free_page(zero_page);
+ goto retry;
+ }
+
+ /* We take additional reference here. It will be put back by shrinker */
+ atomic_set(&huge_zero_refcount, 2);
+ preempt_enable();
+ return ACCESS_ONCE(huge_zero_pfn);
}

-static inline bool is_huge_zero_pfn(unsigned long pfn)
+static void put_huge_zero_page(void)
{
- return huge_zero_pfn && pfn == huge_zero_pfn;
+ /*
+ * Counter should never go to zero here. Only shrinker can put
+ * last reference.
+ */
+ BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
}

-static inline bool is_huge_zero_pmd(pmd_t pmd)
+static int shrink_huge_zero_page(struct shrinker *shrink,
+ struct shrink_control *sc)
{
- return is_huge_zero_pfn(pmd_pfn(pmd));
+ if (!sc->nr_to_scan)
+ /* we can free zero page only if last reference remains */
+ return atomic_read(&huge_zero_refcount) == 1 ? HPAGE_PMD_NR : 0;
+
+ if (atomic_cmpxchg(&huge_zero_refcount, 1, 0) == 1) {
+ unsigned long zero_pfn = xchg(&huge_zero_pfn, 0);
+ BUG_ON(zero_pfn == 0);
+ __free_page(__pfn_to_page(zero_pfn));
+ }
+
+ return 0;
}

+static struct shrinker huge_zero_page_shrinker = {
+ .shrink = shrink_huge_zero_page,
+ .seeks = DEFAULT_SEEKS,
+};
+
#ifdef CONFIG_SYSFS

static ssize_t double_flag_show(struct kobject *kobj,
@@ -586,6 +629,8 @@ static int __init hugepage_init(void)
goto out;
}

+ register_shrinker(&huge_zero_page_shrinker);
+
/*
* By default disable transparent hugepages on smaller systems,
* where the extra memory used could hurt more than TLB overhead
@@ -723,10 +768,11 @@ static inline struct page *alloc_hugepage(int defrag)
#endif

static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+ struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd,
+ unsigned long zero_pfn)
{
pmd_t entry;
- entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+ entry = pfn_pmd(zero_pfn, vma->vm_page_prot);
entry = pmd_wrprotect(entry);
entry = pmd_mkhuge(entry);
set_pmd_at(mm, haddr, pmd, entry);
@@ -749,15 +795,19 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (!(flags & FAULT_FLAG_WRITE)) {
pgtable_t pgtable;
- if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
- count_vm_event(THP_FAULT_FALLBACK);
- goto out;
- }
+ unsigned long zero_pfn;
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
goto out;
+ zero_pfn = get_huge_zero_page();
+ if (unlikely(!zero_pfn)) {
+ pte_free(mm, pgtable);
+ count_vm_event(THP_FAULT_FALLBACK);
+ goto out;
+ }
spin_lock(&mm->page_table_lock);
- set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+ set_huge_zero_page(pgtable, mm, vma, haddr, pmd,
+ zero_pfn);
spin_unlock(&mm->page_table_lock);
return 0;
}
@@ -826,7 +876,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
goto out_unlock;
}
if (is_huge_zero_pmd(pmd)) {
- set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+ unsigned long zero_pfn;
+ /*
+ * get_huge_zero_page() will never allocate a new page here,
+ * since we already have a zero page to copy. It just takes a
+ * reference.
+ */
+ zero_pfn = get_huge_zero_page();
+ set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd,
+ zero_pfn);
ret = 0;
goto out_unlock;
}
@@ -927,6 +985,7 @@ static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
spin_unlock(&mm->page_table_lock);
+ put_huge_zero_page();

ret |= VM_FAULT_WRITE;
out:
@@ -1111,8 +1170,10 @@ alloc:
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache(vma, address, entry);
- if (is_huge_zero_pmd(orig_pmd))
+ if (is_huge_zero_pmd(orig_pmd)) {
add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ put_huge_zero_page();
+ }
if (page) {
VM_BUG_ON(!PageHead(page));
page_remove_rmap(page);
@@ -1176,6 +1237,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
tlb->mm->nr_ptes--;
spin_unlock(&tlb->mm->page_table_lock);
+ put_huge_zero_page();
} else {
page = pmd_page(*pmd);
pmd_clear(pmd);
@@ -2538,6 +2600,7 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
}
smp_wmb(); /* make pte visible before pmd */
pmd_populate(vma->vm_mm, pmd, pgtable);
+ put_huge_zero_page();
}

void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
--
1.7.7.6

2012-10-02 15:20:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v3 09/10] thp: lazy huge zero page allocation

From: "Kirill A. Shutemov" <[email protected]>

Instead of allocating huge zero page on hugepage_init() we can postpone it
until first huge zero page map. It saves memory if THP is not in use.

cmpxchg() is used to avoid race on huge_zero_pfn initialization.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 20 ++++++++++----------
1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a5b9282..3fdf1b4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -168,22 +168,24 @@ out:
return err;
}

-static int init_huge_zero_page(void)
+static int init_huge_zero_pfn(void)
{
struct page *hpage;
+ unsigned long pfn;

hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
HPAGE_PMD_ORDER);
if (!hpage)
return -ENOMEM;
-
- huge_zero_pfn = page_to_pfn(hpage);
+ pfn = page_to_pfn(hpage);
+ if (cmpxchg(&huge_zero_pfn, 0, pfn))
+ __free_page(hpage);
return 0;
}

static inline bool is_huge_zero_pfn(unsigned long pfn)
{
- return pfn == huge_zero_pfn;
+ return huge_zero_pfn && pfn == huge_zero_pfn;
}

static inline bool is_huge_zero_pmd(pmd_t pmd)
@@ -574,10 +576,6 @@ static int __init hugepage_init(void)
if (err)
return err;

- err = init_huge_zero_page();
- if (err)
- goto out;
-
err = khugepaged_slab_init();
if (err)
goto out;
@@ -602,8 +600,6 @@ static int __init hugepage_init(void)

return 0;
out:
- if (huge_zero_pfn)
- __free_page(pfn_to_page(huge_zero_pfn));
hugepage_exit_sysfs(hugepage_kobj);
return err;
}
@@ -753,6 +749,10 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (!(flags & FAULT_FLAG_WRITE)) {
pgtable_t pgtable;
+ if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
+ count_vm_event(THP_FAULT_FALLBACK);
+ goto out;
+ }
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
goto out;
--
1.7.7.6

2012-10-02 15:20:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v3 01/10] thp: huge zero page: basic preparation

From: "Kirill A. Shutemov" <[email protected]>

For now let's allocate the page on hugepage_init(). We'll switch to lazy
allocation later.

We are not going to map the huge zero page until we can handle it
properly on all code paths.

is_huge_zero_{pfn,pmd}() functions will be used by following patches to
check whether the pfn/pmd is huge zero page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 30 ++++++++++++++++++++++++++++++
1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 141dbb6..50c44e9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -46,6 +46,7 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
/* during fragmentation poll the hugepage allocator once every minute */
static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
static struct task_struct *khugepaged_thread __read_mostly;
+static unsigned long huge_zero_pfn __read_mostly;
static DEFINE_MUTEX(khugepaged_mutex);
static DEFINE_SPINLOCK(khugepaged_mm_lock);
static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -167,6 +168,29 @@ out:
return err;
}

+static int init_huge_zero_page(void)
+{
+ struct page *hpage;
+
+ hpage = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
+ HPAGE_PMD_ORDER);
+ if (!hpage)
+ return -ENOMEM;
+
+ huge_zero_pfn = page_to_pfn(hpage);
+ return 0;
+}
+
+static inline bool is_huge_zero_pfn(unsigned long pfn)
+{
+ return pfn == huge_zero_pfn;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+ return is_huge_zero_pfn(pmd_pfn(pmd));
+}
+
#ifdef CONFIG_SYSFS

static ssize_t double_flag_show(struct kobject *kobj,
@@ -550,6 +574,10 @@ static int __init hugepage_init(void)
if (err)
return err;

+ err = init_huge_zero_page();
+ if (err)
+ goto out;
+
err = khugepaged_slab_init();
if (err)
goto out;
@@ -574,6 +602,8 @@ static int __init hugepage_init(void)

return 0;
out:
+ if (huge_zero_pfn)
+ __free_page(pfn_to_page(huge_zero_pfn));
hugepage_exit_sysfs(hugepage_kobj);
return err;
}
--
1.7.7.6

2012-10-02 15:21:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v3 03/10] thp: copy_huge_pmd(): copy huge zero page

From: "Kirill A. Shutemov" <[email protected]>

It's easy to copy huge zero page. Just set destination pmd to huge zero
page.

It's safe to copy huge zero page since we have none yet :-p

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 140d858..c8b157d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -726,6 +726,18 @@ static inline struct page *alloc_hugepage(int defrag)
}
#endif

+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+{
+ pmd_t entry;
+ entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+ entry = pmd_wrprotect(entry);
+ entry = pmd_mkhuge(entry);
+ set_pmd_at(mm, haddr, pmd, entry);
+ prepare_pmd_huge_pte(pgtable, mm);
+ mm->nr_ptes++;
+}
+
int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags)
@@ -803,6 +815,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_free(dst_mm, pgtable);
goto out_unlock;
}
+ if (is_huge_zero_pmd(pmd)) {
+ set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+ ret = 0;
+ goto out_unlock;
+ }
if (unlikely(pmd_trans_splitting(pmd))) {
/* split huge page running from under us */
spin_unlock(&src_mm->page_table_lock);
--
1.7.7.6

2012-10-02 15:21:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH v3 04/10] thp: do_huge_pmd_wp_page(): handle huge zero page

From: "Kirill A. Shutemov" <[email protected]>

On right access to huge zero page we alloc a new page and clear it.

In fallback path we create a new table and set pte around fault address
to the newly allocated page. All other ptes set to normal zero page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
---
include/linux/mm.h | 8 ++++
mm/huge_memory.c | 102 ++++++++++++++++++++++++++++++++++++++++++++--------
mm/memory.c | 7 ----
3 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 311be90..179a41c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -514,6 +514,14 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
}
#endif

+#ifndef my_zero_pfn
+static inline unsigned long my_zero_pfn(unsigned long addr)
+{
+ extern unsigned long zero_pfn;
+ return zero_pfn;
+}
+#endif
+
/*
* Multiple processes may "see" the same page. E.g. for untouched
* mappings of /dev/null, all processes see the same page full of
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c8b157d..f30f39d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -868,6 +868,61 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
return pgtable;
}

+static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, unsigned long haddr)
+{
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ struct page *page;
+ int i, ret = 0;
+
+ page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ if (!page) {
+ ret |= VM_FAULT_OOM;
+ goto out;
+ }
+
+ if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
+ put_page(page);
+ ret |= VM_FAULT_OOM;
+ goto out;
+ }
+
+ clear_user_highpage(page, address);
+ __SetPageUptodate(page);
+
+ spin_lock(&mm->page_table_lock);
+ pmdp_clear_flush_notify(vma, haddr, pmd);
+ /* leave pmd empty until pte is filled */
+
+ pgtable = get_pmd_huge_pte(mm);
+ pmd_populate(mm, &_pmd, pgtable);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+ if (haddr == (address & PAGE_MASK)) {
+ entry = mk_pte(page, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ page_add_new_anon_rmap(page, vma, haddr);
+ } else {
+ entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+ entry = pte_mkspecial(entry);
+ }
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*pte));
+ set_pte_at(mm, haddr, pte, entry);
+ pte_unmap(pte);
+ }
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pgtable);
+ spin_unlock(&mm->page_table_lock);
+
+ ret |= VM_FAULT_WRITE;
+out:
+ return ret;
+}
+
static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
@@ -965,17 +1020,19 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
{
int ret = 0;
- struct page *page, *new_page;
+ struct page *page = NULL, *new_page;
unsigned long haddr;

VM_BUG_ON(!vma->anon_vma);
+ haddr = address & HPAGE_PMD_MASK;
+ if (is_huge_zero_pmd(orig_pmd))
+ goto alloc;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
goto out_unlock;

page = pmd_page(orig_pmd);
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
- haddr = address & HPAGE_PMD_MASK;
if (page_mapcount(page) == 1) {
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
@@ -987,7 +1044,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
get_page(page);
spin_unlock(&mm->page_table_lock);
-
+alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -997,28 +1054,39 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,

if (unlikely(!new_page)) {
count_vm_event(THP_FAULT_FALLBACK);
- ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
- pmd, orig_pmd, page, haddr);
- if (ret & VM_FAULT_OOM)
- split_huge_page(page);
- put_page(page);
+ if (is_huge_zero_pmd(orig_pmd)) {
+ ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
+ address, pmd, haddr);
+ } else {
+ ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
+ pmd, orig_pmd, page, haddr);
+ if (ret & VM_FAULT_OOM)
+ split_huge_page(page);
+ put_page(page);
+ }
goto out;
}
count_vm_event(THP_FAULT_ALLOC);

if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
put_page(new_page);
- split_huge_page(page);
- put_page(page);
+ if (page) {
+ split_huge_page(page);
+ put_page(page);
+ }
ret |= VM_FAULT_OOM;
goto out;
}

- copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
+ if (is_huge_zero_pmd(orig_pmd))
+ clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
+ else
+ copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
__SetPageUptodate(new_page);

spin_lock(&mm->page_table_lock);
- put_page(page);
+ if (page)
+ put_page(page);
if (unlikely(!pmd_same(*pmd, orig_pmd))) {
spin_unlock(&mm->page_table_lock);
mem_cgroup_uncharge_page(new_page);
@@ -1026,7 +1094,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out;
} else {
pmd_t entry;
- VM_BUG_ON(!PageHead(page));
entry = mk_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
entry = pmd_mkhuge(entry);
@@ -1034,8 +1101,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache(vma, address, entry);
- page_remove_rmap(page);
- put_page(page);
+ if (is_huge_zero_pmd(orig_pmd))
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ if (page) {
+ VM_BUG_ON(!PageHead(page));
+ page_remove_rmap(page);
+ put_page(page);
+ }
ret |= VM_FAULT_WRITE;
}
out_unlock:
diff --git a/mm/memory.c b/mm/memory.c
index 5736170..dbd92ba 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -724,13 +724,6 @@ static inline int is_zero_pfn(unsigned long pfn)
}
#endif

-#ifndef my_zero_pfn
-static inline unsigned long my_zero_pfn(unsigned long addr)
-{
- return zero_pfn;
-}
-#endif
-
/*
* vm_normal_page -- This function gets the "struct page" associated with a pte.
*
--
1.7.7.6

2012-10-02 15:36:14

by Brice Goglin

[permalink] [raw]
Subject: Re: [PATCH v3 04/10] thp: do_huge_pmd_wp_page(): handle huge zero page

Le 02/10/2012 17:19, Kirill A. Shutemov a ?crit :
> From: "Kirill A. Shutemov" <[email protected]>
>
> On right access to huge zero page we alloc a new page and clear it.
>

s/right/write/ ?

Brice

2012-10-02 15:37:49

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v3 04/10] thp: do_huge_pmd_wp_page(): handle huge zero page

On Tue, Oct 02, 2012 at 05:35:59PM +0200, Brice Goglin wrote:
> Le 02/10/2012 17:19, Kirill A. Shutemov a ?crit :
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > On right access to huge zero page we alloc a new page and clear it.
> >
>
> s/right/write/ ?

Oops... thanks.

--
Kirill A. Shutemov

2012-10-02 16:13:39

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH v3 00/10] Introduce huge zero page

On Tue, Oct 02, 2012 at 06:19:22PM +0300, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> During testing I noticed big (up to 2.5 times) memory consumption overhead
> on some workloads (e.g. ft.A from NPB) if THP is enabled.
>
> The main reason for that big difference is lacking zero page in THP case.
> We have to allocate a real page on read page fault.
>
> A program to demonstrate the issue:
> #include <assert.h>
> #include <stdlib.h>
> #include <unistd.h>
>
> #define MB 1024*1024
>
> int main(int argc, char **argv)
> {
> char *p;
> int i;
>
> posix_memalign((void **)&p, 2 * MB, 200 * MB);
> for (i = 0; i < 200 * MB; i+= 4096)
> assert(p[i] == 0);
> pause();
> return 0;
> }
>
> With thp-never RSS is about 400k, but with thp-always it's 200M.
> After the patcheset thp-always RSS is 400k too.
>
> v3:
> - fix potential deadlock in refcounting code on preemptive kernel.
> - do not mark huge zero page as movable.
> - fix typo in comment.
> - Reviewed-by tag from Andrea Arcangeli.
> v2:
> - Avoid find_vma() if we've already had vma on stack.
> Suggested by Andrea Arcangeli.
> - Implement refcounting for huge zero page.
>
> Kirill A. Shutemov (10):
> thp: huge zero page: basic preparation
> thp: zap_huge_pmd(): zap huge zero pmd
> thp: copy_huge_pmd(): copy huge zero page
> thp: do_huge_pmd_wp_page(): handle huge zero page
> thp: change_huge_pmd(): keep huge zero page write-protected
> thp: change split_huge_page_pmd() interface
> thp: implement splitting pmd for huge zero page
> thp: setup huge zero page on non-write page fault
> thp: lazy huge zero page allocation
> thp: implement refcounting for huge zero page

Reviewed-by: Andrea Arcangeli <[email protected]>

2012-10-02 22:31:51

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 00/10] Introduce huge zero page

On Tue, 2 Oct 2012 18:19:22 +0300
"Kirill A. Shutemov" <[email protected]> wrote:

> During testing I noticed big (up to 2.5 times) memory consumption overhead
> on some workloads (e.g. ft.A from NPB) if THP is enabled.
>
> The main reason for that big difference is lacking zero page in THP case.
> We have to allocate a real page on read page fault.
>
> A program to demonstrate the issue:
> #include <assert.h>
> #include <stdlib.h>
> #include <unistd.h>
>
> #define MB 1024*1024
>
> int main(int argc, char **argv)
> {
> char *p;
> int i;
>
> posix_memalign((void **)&p, 2 * MB, 200 * MB);
> for (i = 0; i < 200 * MB; i+= 4096)
> assert(p[i] == 0);
> pause();
> return 0;
> }
>
> With thp-never RSS is about 400k, but with thp-always it's 200M.
> After the patcheset thp-always RSS is 400k too.

I'd like to see a full description of the design, please.

>From reading the code, it appears that we initially allocate a huge
page and point the pmd at that. If/when there is a write fault against
that page we then populate the mm with ptes which point at the normal
4k zero page and populate the pte at the fault address with a newly
allocated page? Correct and complete? If not, please fix ;)

Also, IIRC, the early versions of the patch did not allocate the
initial huge page at all - it immediately filled the mm with ptes which
point at the normal 4k zero page. Is that a correct recollection?
If so, why the change?

Also IIRC, Andrea had a little test app which demonstrated the TLB
costs of the inital approach, and they were high?

Please, let's capture all this knowledge in a single place, right here
in the changelog. And in code comments, where appropriate. Otherwise
people won't know why we made these decisions unless they go off and
find lengthy, years-old and quite possibly obsolete email threads.


Also, you've presented some data on the memory savings, but no
quantitative testing results on the performance cost. Both you and
Andrea have run these tests and those results are important. Let's
capture them here. And when designing such tests we should not just
try to demonstrate the benefits of a code change - we should think of
test cases whcih might be adversely affected and run those as well.


It's not an appropriate time to be merging new features - please plan
on preparing this patchset against 3.7-rc1.

2012-10-02 22:55:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH v3 00/10] Introduce huge zero page

Hi Andrew,

On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote:
> From reading the code, it appears that we initially allocate a huge
> page and point the pmd at that. If/when there is a write fault against
> that page we then populate the mm with ptes which point at the normal
> 4k zero page and populate the pte at the fault address with a newly
> allocated page? Correct and complete? If not, please fix ;)

During the cow, we never use 4k ptes, unless the 2m page allocation
fails.

> Also, IIRC, the early versions of the patch did not allocate the
> initial huge page at all - it immediately filled the mm with ptes which
> point at the normal 4k zero page. Is that a correct recollection?
> If so, why the change?

That was a different design yes. The design in this patchset will not
do that.

> Also IIRC, Andrea had a little test app which demonstrated the TLB
> costs of the inital approach, and they were high?

Yes we run the benchmarks yesterday, this version is the one that will
decrease the TLB cost and that seems the safest tradeoff.

> Please, let's capture all this knowledge in a single place, right here
> in the changelog. And in code comments, where appropriate. Otherwise
> people won't know why we made these decisions unless they go off and
> find lengthy, years-old and quite possibly obsolete email threads.

Agreed ;).

> Also, you've presented some data on the memory savings, but no
> quantitative testing results on the performance cost. Both you and
> Andrea have run these tests and those results are important. Let's
> capture them here. And when designing such tests we should not just
> try to demonstrate the benefits of a code change - we should think of
> test cases whcih might be adversely affected and run those as well.

Right.

> It's not an appropriate time to be merging new features - please plan
> on preparing this patchset against 3.7-rc1.

Ok, I assume Kirill will take care of it.

Thanks,
Andrea

2012-10-03 00:03:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v3 00/10] Introduce huge zero page

On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote:
> On Tue, 2 Oct 2012 18:19:22 +0300
> "Kirill A. Shutemov" <[email protected]> wrote:
>
> > During testing I noticed big (up to 2.5 times) memory consumption overhead
> > on some workloads (e.g. ft.A from NPB) if THP is enabled.
> >
> > The main reason for that big difference is lacking zero page in THP case.
> > We have to allocate a real page on read page fault.
> >
> > A program to demonstrate the issue:
> > #include <assert.h>
> > #include <stdlib.h>
> > #include <unistd.h>
> >
> > #define MB 1024*1024
> >
> > int main(int argc, char **argv)
> > {
> > char *p;
> > int i;
> >
> > posix_memalign((void **)&p, 2 * MB, 200 * MB);
> > for (i = 0; i < 200 * MB; i+= 4096)
> > assert(p[i] == 0);
> > pause();
> > return 0;
> > }
> >
> > With thp-never RSS is about 400k, but with thp-always it's 200M.
> > After the patcheset thp-always RSS is 400k too.
>
> I'd like to see a full description of the design, please.

Okay. Design overview.

Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
zeros. The way how we allocate it changes in the patchset:

- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
- [09/10] lazy allocation on first use;
- [10/10] lockless refcounting + shrinker-reclaimable hzp;

We setup it in do_huge_pmd_anonymous_page() if area around fault address
is suitable for THP and we've got read page fault.
If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
normally do in THP.

On wp fault to hzp we allocate real memory for the huge page and clear it.
If ENOMEM, graceful fallback: we create a new pmd table and set pte around
fault address to newly allocated normal (4k) page. All other ptes in the
pmd set to normal zero page.

We cannot split hzp (and it's bug if we try), but we can split the pmd
which points to it. On splitting the pmd we create a table with all ptes
set to normal zero page.

Patchset organized in bisect-friendly way:
Patches 01-07: prepare all code paths for hzp
Patch 08: all code paths are covered: safe to setup hzp
Patch 09: lazy allocation
Patch 10: lockless refcounting for hzp

--------------------------------------------------------------------------

By hpa request I've tried alternative approach for hzp implementation (see
Virtual huge zero page patchset): pmd table with all entries set to zero
page. This way should be more cache friendly, but it increases TLB
pressure.

The problem with virtual huge zero page: it requires per-arch enabling.
We need a way to mark that pmd table has all ptes set to zero page.

Some numbers to compare two implementations (on 4s Westmere-EX):

Mirobenchmark1
==============

test:
posix_memalign((void **)&p, 2 * MB, 8 * GB);
for (i = 0; i < 100; i++) {
assert(memcmp(p, p + 4*GB, 4*GB) == 0);
asm volatile ("": : :"memory");
}

hzp:
Performance counter stats for './test_memcmp' (5 runs):

32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% )
40 context-switches # 0.001 K/sec ( +- 0.94% )
0 CPU-migrations # 0.000 K/sec
4,218 page-faults # 0.130 K/sec ( +- 0.00% )
76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%]
36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%]
1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%]
134,355,715,816 instructions # 1.75 insns per cycle
# 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%]
13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%]
1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%]

32.413866442 seconds time elapsed ( +- 0.13% )

vhzp:
Performance counter stats for './test_memcmp' (5 runs):

30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% )
38 context-switches # 0.001 K/sec ( +- 1.53% )
0 CPU-migrations # 0.000 K/sec
4,218 page-faults # 0.139 K/sec ( +- 0.01% )
71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%]
31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%]
773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%]
134,982,215,437 instructions # 1.88 insns per cycle
# 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%]
13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%]
1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%]

30.381324695 seconds time elapsed ( +- 0.13% )

Mirobenchmark2
==============

test:
posix_memalign((void **)&p, 2 * MB, 8 * GB);
for (i = 0; i < 1000; i++) {
char *_p = p;
while (_p < p+4*GB) {
assert(*_p == *(_p+4*GB));
_p += 4096;
asm volatile ("": : :"memory");
}
}

hzp:
Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):

3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% )
9 context-switches # 0.003 K/sec ( +- 4.97% )
4,384 page-faults # 0.001 M/sec ( +- 0.00% )
8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%]
5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%]
2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%]
9,494,670,537 instructions # 1.14 insns per cycle
# 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%]
2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%]
158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%]
3,168,102,115 L1-dcache-loads
# 903.693 M/sec ( +- 0.11% ) [41.70%]
1,048,710,998 L1-dcache-misses
# 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%]
1,047,699,685 LLC-load
# 298.854 M/sec ( +- 0.03% ) [33.38%]
2,287 LLC-misses
# 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%]
3,166,187,367 dTLB-loads
# 903.147 M/sec ( +- 0.02% ) [33.35%]
4,266,538 dTLB-misses
# 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%]

3.513339813 seconds time elapsed ( +- 0.26% )

vhzp:
Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):

27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% )
62 context-switches # 0.002 K/sec ( +- 0.61% )
4,384 page-faults # 0.160 K/sec ( +- 0.01% )
64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%]
61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%]
56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%]
10,033,724,846 instructions # 0.15 insns per cycle
# 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%]
2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%]
1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%]
3,302,006,540 L1-dcache-loads
# 120.891 M/sec ( +- 0.11% ) [41.68%]
271,374,358 L1-dcache-misses
# 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%]
20,385,476 LLC-load
# 0.746 M/sec ( +- 1.64% ) [33.34%]
76,754 LLC-misses
# 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%]
3,309,927,290 dTLB-loads
# 121.181 M/sec ( +- 0.03% ) [33.34%]
2,098,967,427 dTLB-misses
# 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%]

27.364448741 seconds time elapsed ( +- 0.24% )

--------------------------------------------------------------------------

I personally prefer implementation present in this patchset. It doesn't
touch arch-specific code.


Is the overview complete enough? Have I answered all you questions here?

> It's not an appropriate time to be merging new features - please plan
> on preparing this patchset against 3.7-rc1.

Sure.

--
Kirill A. Shutemov

2012-10-03 00:11:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 00/10] Introduce huge zero page

On Wed, 3 Oct 2012 03:04:02 +0300
"Kirill A. Shutemov" <[email protected]> wrote:

> Is the overview complete enough? Have I answered all you questions here?

Yes, thanks!

The design overview is short enough to be put in as code comments in
suitable places.

2012-10-12 03:23:44

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH v3 07/10] thp: implement splitting pmd for huge zero page

On 10/02/2012 11:19 PM, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> We can't split huge zero page itself, but we can split the pmd which
> points to it.
>
> On splitting the pmd we create a table with all ptes set to normal zero
> page.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Reviewed-by: Andrea Arcangeli <[email protected]>
> ---
> mm/huge_memory.c | 32 ++++++++++++++++++++++++++++++++
> 1 files changed, 32 insertions(+), 0 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 95032d3..3f1c59c 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1600,6 +1600,7 @@ int split_huge_page(struct page *page)
> struct anon_vma *anon_vma;
> int ret = 1;
>
> + BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
> BUG_ON(!PageAnon(page));
> anon_vma = page_lock_anon_vma(page);
> if (!anon_vma)
> @@ -2503,6 +2504,32 @@ static int khugepaged(void *none)
> return 0;
> }
>
> +static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> + unsigned long haddr, pmd_t *pmd)
> +{
> + pgtable_t pgtable;
> + pmd_t _pmd;
> + int i;
> +
> + pmdp_clear_flush_notify(vma, haddr, pmd);

why I can't find function pmdp_clear_flush_notify in kernel source code?
Do you mean pmdp_clear_flush_young_notify or something like that?

> + /* leave pmd empty until pte is filled */
> +
> + pgtable = get_pmd_huge_pte(vma->vm_mm);
> + pmd_populate(vma->vm_mm, &_pmd, pgtable);
> +
> + for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> + pte_t *pte, entry;
> + entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
> + entry = pte_mkspecial(entry);
> + pte = pte_offset_map(&_pmd, haddr);
> + VM_BUG_ON(!pte_none(*pte));
> + set_pte_at(vma->vm_mm, haddr, pte, entry);
> + pte_unmap(pte);
> + }
> + smp_wmb(); /* make pte visible before pmd */
> + pmd_populate(vma->vm_mm, pmd, pgtable);
> +}
> +
> void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
> pmd_t *pmd)
> {
> @@ -2516,6 +2543,11 @@ void __split_huge_page_pmd(struct vm_area_struct *vma, unsigned long address,
> spin_unlock(&vma->vm_mm->page_table_lock);
> return;
> }
> + if (is_huge_zero_pmd(*pmd)) {
> + __split_huge_zero_page_pmd(vma, haddr, pmd);
> + spin_unlock(&vma->vm_mm->page_table_lock);
> + return;
> + }
> page = pmd_page(*pmd);
> VM_BUG_ON(!page_count(page));
> get_page(page);

2012-10-12 04:12:16

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v3 07/10] thp: implement splitting pmd for huge zero page

On Fri, Oct 12, 2012 at 11:23:37AM +0800, Ni zhan Chen wrote:
> On 10/02/2012 11:19 PM, Kirill A. Shutemov wrote:
> >From: "Kirill A. Shutemov" <[email protected]>
> >
> >We can't split huge zero page itself, but we can split the pmd which
> >points to it.
> >
> >On splitting the pmd we create a table with all ptes set to normal zero
> >page.
> >
> >Signed-off-by: Kirill A. Shutemov <[email protected]>
> >Reviewed-by: Andrea Arcangeli <[email protected]>
> >---
> > mm/huge_memory.c | 32 ++++++++++++++++++++++++++++++++
> > 1 files changed, 32 insertions(+), 0 deletions(-)
> >
> >diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >index 95032d3..3f1c59c 100644
> >--- a/mm/huge_memory.c
> >+++ b/mm/huge_memory.c
> >@@ -1600,6 +1600,7 @@ int split_huge_page(struct page *page)
> > struct anon_vma *anon_vma;
> > int ret = 1;
> >+ BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
> > BUG_ON(!PageAnon(page));
> > anon_vma = page_lock_anon_vma(page);
> > if (!anon_vma)
> >@@ -2503,6 +2504,32 @@ static int khugepaged(void *none)
> > return 0;
> > }
> >+static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
> >+ unsigned long haddr, pmd_t *pmd)
> >+{
> >+ pgtable_t pgtable;
> >+ pmd_t _pmd;
> >+ int i;
> >+
> >+ pmdp_clear_flush_notify(vma, haddr, pmd);
>
> why I can't find function pmdp_clear_flush_notify in kernel source
> code? Do you mean pmdp_clear_flush_young_notify or something like
> that?

It was changed recently. See commit
2ec74c3 mm: move all mmu notifier invocations to be done outside the PT lock

--
Kirill A. Shutemov

2012-10-12 05:00:54

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH v3 07/10] thp: implement splitting pmd for huge zero page

On 10/12/2012 12:13 PM, Kirill A. Shutemov wrote:
> On Fri, Oct 12, 2012 at 11:23:37AM +0800, Ni zhan Chen wrote:
>> On 10/02/2012 11:19 PM, Kirill A. Shutemov wrote:
>>> From: "Kirill A. Shutemov" <[email protected]>
>>>
>>> We can't split huge zero page itself, but we can split the pmd which
>>> points to it.
>>>
>>> On splitting the pmd we create a table with all ptes set to normal zero
>>> page.
>>>
>>> Signed-off-by: Kirill A. Shutemov <[email protected]>
>>> Reviewed-by: Andrea Arcangeli <[email protected]>
>>> ---
>>> mm/huge_memory.c | 32 ++++++++++++++++++++++++++++++++
>>> 1 files changed, 32 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 95032d3..3f1c59c 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1600,6 +1600,7 @@ int split_huge_page(struct page *page)
>>> struct anon_vma *anon_vma;
>>> int ret = 1;
>>> + BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
>>> BUG_ON(!PageAnon(page));
>>> anon_vma = page_lock_anon_vma(page);
>>> if (!anon_vma)
>>> @@ -2503,6 +2504,32 @@ static int khugepaged(void *none)
>>> return 0;
>>> }
>>> +static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>> + unsigned long haddr, pmd_t *pmd)
>>> +{
>>> + pgtable_t pgtable;
>>> + pmd_t _pmd;
>>> + int i;
>>> +
>>> + pmdp_clear_flush_notify(vma, haddr, pmd);
>> why I can't find function pmdp_clear_flush_notify in kernel source
>> code? Do you mean pmdp_clear_flush_young_notify or something like
>> that?
> It was changed recently. See commit
> 2ec74c3 mm: move all mmu notifier invocations to be done outside the PT lock

Oh, thanks!

2012-10-17 02:32:32

by Ni zhan Chen

[permalink] [raw]
Subject: Re: [PATCH v3 00/10] Introduce huge zero page

On 10/03/2012 08:04 AM, Kirill A. Shutemov wrote:
> On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote:
>> On Tue, 2 Oct 2012 18:19:22 +0300
>> "Kirill A. Shutemov" <[email protected]> wrote:
>>
>>> During testing I noticed big (up to 2.5 times) memory consumption overhead
>>> on some workloads (e.g. ft.A from NPB) if THP is enabled.
>>>
>>> The main reason for that big difference is lacking zero page in THP case.
>>> We have to allocate a real page on read page fault.
>>>
>>> A program to demonstrate the issue:
>>> #include <assert.h>
>>> #include <stdlib.h>
>>> #include <unistd.h>
>>>
>>> #define MB 1024*1024
>>>
>>> int main(int argc, char **argv)
>>> {
>>> char *p;
>>> int i;
>>>
>>> posix_memalign((void **)&p, 2 * MB, 200 * MB);
>>> for (i = 0; i < 200 * MB; i+= 4096)
>>> assert(p[i] == 0);
>>> pause();
>>> return 0;
>>> }
>>>
>>> With thp-never RSS is about 400k, but with thp-always it's 200M.
>>> After the patcheset thp-always RSS is 400k too.
>> I'd like to see a full description of the design, please.
> Okay. Design overview.
>
> Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
> zeros. The way how we allocate it changes in the patchset:
>
> - [01/10] simplest way: hzp allocated on boot time in hugepage_init();
> - [09/10] lazy allocation on first use;
> - [10/10] lockless refcounting + shrinker-reclaimable hzp;
>
> We setup it in do_huge_pmd_anonymous_page() if area around fault address
> is suitable for THP and we've got read page fault.
> If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
> normally do in THP.
>
> On wp fault to hzp we allocate real memory for the huge page and clear it.
> If ENOMEM, graceful fallback: we create a new pmd table and set pte around
> fault address to newly allocated normal (4k) page. All other ptes in the
> pmd set to normal zero page.
>
> We cannot split hzp (and it's bug if we try), but we can split the pmd
> which points to it. On splitting the pmd we create a table with all ptes
> set to normal zero page.
>
> Patchset organized in bisect-friendly way:
> Patches 01-07: prepare all code paths for hzp
> Patch 08: all code paths are covered: safe to setup hzp
> Patch 09: lazy allocation
> Patch 10: lockless refcounting for hzp
>
> --------------------------------------------------------------------------
>
> By hpa request I've tried alternative approach for hzp implementation (see
> Virtual huge zero page patchset): pmd table with all entries set to zero
> page. This way should be more cache friendly, but it increases TLB
> pressure.
>
> The problem with virtual huge zero page: it requires per-arch enabling.
> We need a way to mark that pmd table has all ptes set to zero page.
>
> Some numbers to compare two implementations (on 4s Westmere-EX):
>
> Mirobenchmark1
> ==============
>
> test:
> posix_memalign((void **)&p, 2 * MB, 8 * GB);
> for (i = 0; i < 100; i++) {
> assert(memcmp(p, p + 4*GB, 4*GB) == 0);
> asm volatile ("": : :"memory");
> }
>
> hzp:
> Performance counter stats for './test_memcmp' (5 runs):
>
> 32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% )
> 40 context-switches # 0.001 K/sec ( +- 0.94% )
> 0 CPU-migrations # 0.000 K/sec
> 4,218 page-faults # 0.130 K/sec ( +- 0.00% )
> 76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%]
> 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%]
> 1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%]
> 134,355,715,816 instructions # 1.75 insns per cycle
> # 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%]
> 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%]
> 1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%]
>
> 32.413866442 seconds time elapsed ( +- 0.13% )
>
> vhzp:
> Performance counter stats for './test_memcmp' (5 runs):
>
> 30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% )
> 38 context-switches # 0.001 K/sec ( +- 1.53% )
> 0 CPU-migrations # 0.000 K/sec
> 4,218 page-faults # 0.139 K/sec ( +- 0.01% )
> 71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%]
> 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%]
> 773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%]
> 134,982,215,437 instructions # 1.88 insns per cycle
> # 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%]
> 13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%]
> 1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%]
>
> 30.381324695 seconds time elapsed ( +- 0.13% )
>
> Mirobenchmark2
> ==============
>
> test:
> posix_memalign((void **)&p, 2 * MB, 8 * GB);
> for (i = 0; i < 1000; i++) {
> char *_p = p;
> while (_p < p+4*GB) {
> assert(*_p == *(_p+4*GB));
> _p += 4096;
> asm volatile ("": : :"memory");
> }
> }
>
> hzp:
> Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
>
> 3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% )
> 9 context-switches # 0.003 K/sec ( +- 4.97% )
> 4,384 page-faults # 0.001 M/sec ( +- 0.00% )
> 8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%]
> 5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%]
> 2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%]
> 9,494,670,537 instructions # 1.14 insns per cycle
> # 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%]
> 2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%]
> 158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%]
> 3,168,102,115 L1-dcache-loads
> # 903.693 M/sec ( +- 0.11% ) [41.70%]
> 1,048,710,998 L1-dcache-misses
> # 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%]
> 1,047,699,685 LLC-load
> # 298.854 M/sec ( +- 0.03% ) [33.38%]
> 2,287 LLC-misses
> # 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%]
> 3,166,187,367 dTLB-loads
> # 903.147 M/sec ( +- 0.02% ) [33.35%]
> 4,266,538 dTLB-misses
> # 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%]
>
> 3.513339813 seconds time elapsed ( +- 0.26% )
>
> vhzp:
> Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
>
> 27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% )
> 62 context-switches # 0.002 K/sec ( +- 0.61% )
> 4,384 page-faults # 0.160 K/sec ( +- 0.01% )
> 64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%]
> 61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%]
> 56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%]
> 10,033,724,846 instructions # 0.15 insns per cycle
> # 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%]
> 2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%]
> 1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%]
> 3,302,006,540 L1-dcache-loads
> # 120.891 M/sec ( +- 0.11% ) [41.68%]
> 271,374,358 L1-dcache-misses
> # 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%]
> 20,385,476 LLC-load
> # 0.746 M/sec ( +- 1.64% ) [33.34%]
> 76,754 LLC-misses
> # 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%]
> 3,309,927,290 dTLB-loads
> # 121.181 M/sec ( +- 0.03% ) [33.34%]
> 2,098,967,427 dTLB-misses
> # 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%]
>
> 27.364448741 seconds time elapsed ( +- 0.24% )
>
> --------------------------------------------------------------------------

Hi Kirill A. Shutemov,

I see in the kernel doc which describes the benefit of thp, "the TLB
miss will run faster" (especially with virtualization using nested
pagetables but almost always also on bare metal without virtualization).

Could you explain me why TLB miss run faster? I think it only reduce TLB
miss ratio.

Thanks,
Chen

>
> I personally prefer implementation present in this patchset. It doesn't
> touch arch-specific code.
>
>
> Is the overview complete enough? Have I answered all you questions here?
>
>> It's not an appropriate time to be merging new features - please plan
>> on preparing this patchset against 3.7-rc1.
> Sure.
>

2012-10-18 14:49:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH v3 00/10] Introduce huge zero page

On Wed, Oct 17, 2012 at 10:32:13AM +0800, Ni zhan Chen wrote:
> On 10/03/2012 08:04 AM, Kirill A. Shutemov wrote:
> >On Tue, Oct 02, 2012 at 03:31:48PM -0700, Andrew Morton wrote:
> >>On Tue, 2 Oct 2012 18:19:22 +0300
> >>"Kirill A. Shutemov" <[email protected]> wrote:
> >>
> >>>During testing I noticed big (up to 2.5 times) memory consumption overhead
> >>>on some workloads (e.g. ft.A from NPB) if THP is enabled.
> >>>
> >>>The main reason for that big difference is lacking zero page in THP case.
> >>>We have to allocate a real page on read page fault.
> >>>
> >>>A program to demonstrate the issue:
> >>>#include <assert.h>
> >>>#include <stdlib.h>
> >>>#include <unistd.h>
> >>>
> >>>#define MB 1024*1024
> >>>
> >>>int main(int argc, char **argv)
> >>>{
> >>> char *p;
> >>> int i;
> >>>
> >>> posix_memalign((void **)&p, 2 * MB, 200 * MB);
> >>> for (i = 0; i < 200 * MB; i+= 4096)
> >>> assert(p[i] == 0);
> >>> pause();
> >>> return 0;
> >>>}
> >>>
> >>>With thp-never RSS is about 400k, but with thp-always it's 200M.
> >>>After the patcheset thp-always RSS is 400k too.
> >>I'd like to see a full description of the design, please.
> >Okay. Design overview.
> >
> >Huge zero page (hzp) is a non-movable huge page (2M on x86-64) filled with
> >zeros. The way how we allocate it changes in the patchset:
> >
> >- [01/10] simplest way: hzp allocated on boot time in hugepage_init();
> >- [09/10] lazy allocation on first use;
> >- [10/10] lockless refcounting + shrinker-reclaimable hzp;
> >
> >We setup it in do_huge_pmd_anonymous_page() if area around fault address
> >is suitable for THP and we've got read page fault.
> >If we fail to setup hzp (ENOMEM) we fallback to handle_pte_fault() as we
> >normally do in THP.
> >
> >On wp fault to hzp we allocate real memory for the huge page and clear it.
> >If ENOMEM, graceful fallback: we create a new pmd table and set pte around
> >fault address to newly allocated normal (4k) page. All other ptes in the
> >pmd set to normal zero page.
> >
> >We cannot split hzp (and it's bug if we try), but we can split the pmd
> >which points to it. On splitting the pmd we create a table with all ptes
> >set to normal zero page.
> >
> >Patchset organized in bisect-friendly way:
> > Patches 01-07: prepare all code paths for hzp
> > Patch 08: all code paths are covered: safe to setup hzp
> > Patch 09: lazy allocation
> > Patch 10: lockless refcounting for hzp
> >
> >--------------------------------------------------------------------------
> >
> >By hpa request I've tried alternative approach for hzp implementation (see
> >Virtual huge zero page patchset): pmd table with all entries set to zero
> >page. This way should be more cache friendly, but it increases TLB
> >pressure.
> >
> >The problem with virtual huge zero page: it requires per-arch enabling.
> >We need a way to mark that pmd table has all ptes set to zero page.
> >
> >Some numbers to compare two implementations (on 4s Westmere-EX):
> >
> >Mirobenchmark1
> >==============
> >
> >test:
> > posix_memalign((void **)&p, 2 * MB, 8 * GB);
> > for (i = 0; i < 100; i++) {
> > assert(memcmp(p, p + 4*GB, 4*GB) == 0);
> > asm volatile ("": : :"memory");
> > }
> >
> >hzp:
> > Performance counter stats for './test_memcmp' (5 runs):
> >
> > 32356.272845 task-clock # 0.998 CPUs utilized ( +- 0.13% )
> > 40 context-switches # 0.001 K/sec ( +- 0.94% )
> > 0 CPU-migrations # 0.000 K/sec
> > 4,218 page-faults # 0.130 K/sec ( +- 0.00% )
> > 76,712,481,765 cycles # 2.371 GHz ( +- 0.13% ) [83.31%]
> > 36,279,577,636 stalled-cycles-frontend # 47.29% frontend cycles idle ( +- 0.28% ) [83.35%]
> > 1,684,049,110 stalled-cycles-backend # 2.20% backend cycles idle ( +- 2.96% ) [66.67%]
> > 134,355,715,816 instructions # 1.75 insns per cycle
> > # 0.27 stalled cycles per insn ( +- 0.10% ) [83.35%]
> > 13,526,169,702 branches # 418.039 M/sec ( +- 0.10% ) [83.31%]
> > 1,058,230 branch-misses # 0.01% of all branches ( +- 0.91% ) [83.36%]
> >
> > 32.413866442 seconds time elapsed ( +- 0.13% )
> >
> >vhzp:
> > Performance counter stats for './test_memcmp' (5 runs):
> >
> > 30327.183829 task-clock # 0.998 CPUs utilized ( +- 0.13% )
> > 38 context-switches # 0.001 K/sec ( +- 1.53% )
> > 0 CPU-migrations # 0.000 K/sec
> > 4,218 page-faults # 0.139 K/sec ( +- 0.01% )
> > 71,964,773,660 cycles # 2.373 GHz ( +- 0.13% ) [83.35%]
> > 31,191,284,231 stalled-cycles-frontend # 43.34% frontend cycles idle ( +- 0.40% ) [83.32%]
> > 773,484,474 stalled-cycles-backend # 1.07% backend cycles idle ( +- 6.61% ) [66.67%]
> > 134,982,215,437 instructions # 1.88 insns per cycle
> > # 0.23 stalled cycles per insn ( +- 0.11% ) [83.32%]
> > 13,509,150,683 branches # 445.447 M/sec ( +- 0.11% ) [83.34%]
> > 1,017,667 branch-misses # 0.01% of all branches ( +- 1.07% ) [83.32%]
> >
> > 30.381324695 seconds time elapsed ( +- 0.13% )
> >
> >Mirobenchmark2
> >==============
> >
> >test:
> > posix_memalign((void **)&p, 2 * MB, 8 * GB);
> > for (i = 0; i < 1000; i++) {
> > char *_p = p;
> > while (_p < p+4*GB) {
> > assert(*_p == *(_p+4*GB));
> > _p += 4096;
> > asm volatile ("": : :"memory");
> > }
> > }
> >
> >hzp:
> > Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
> >
> > 3505.727639 task-clock # 0.998 CPUs utilized ( +- 0.26% )
> > 9 context-switches # 0.003 K/sec ( +- 4.97% )
> > 4,384 page-faults # 0.001 M/sec ( +- 0.00% )
> > 8,318,482,466 cycles # 2.373 GHz ( +- 0.26% ) [33.31%]
> > 5,134,318,786 stalled-cycles-frontend # 61.72% frontend cycles idle ( +- 0.42% ) [33.32%]
> > 2,193,266,208 stalled-cycles-backend # 26.37% backend cycles idle ( +- 5.51% ) [33.33%]
> > 9,494,670,537 instructions # 1.14 insns per cycle
> > # 0.54 stalled cycles per insn ( +- 0.13% ) [41.68%]
> > 2,108,522,738 branches # 601.451 M/sec ( +- 0.09% ) [41.68%]
> > 158,746 branch-misses # 0.01% of all branches ( +- 1.60% ) [41.71%]
> > 3,168,102,115 L1-dcache-loads
> > # 903.693 M/sec ( +- 0.11% ) [41.70%]
> > 1,048,710,998 L1-dcache-misses
> > # 33.10% of all L1-dcache hits ( +- 0.11% ) [41.72%]
> > 1,047,699,685 LLC-load
> > # 298.854 M/sec ( +- 0.03% ) [33.38%]
> > 2,287 LLC-misses
> > # 0.00% of all LL-cache hits ( +- 8.27% ) [33.37%]
> > 3,166,187,367 dTLB-loads
> > # 903.147 M/sec ( +- 0.02% ) [33.35%]
> > 4,266,538 dTLB-misses
> > # 0.13% of all dTLB cache hits ( +- 0.03% ) [33.33%]
> >
> > 3.513339813 seconds time elapsed ( +- 0.26% )
> >
> >vhzp:
> > Performance counter stats for 'taskset -c 0 ./test_memcmp2' (5 runs):
> >
> > 27313.891128 task-clock # 0.998 CPUs utilized ( +- 0.24% )
> > 62 context-switches # 0.002 K/sec ( +- 0.61% )
> > 4,384 page-faults # 0.160 K/sec ( +- 0.01% )
> > 64,747,374,606 cycles # 2.370 GHz ( +- 0.24% ) [33.33%]
> > 61,341,580,278 stalled-cycles-frontend # 94.74% frontend cycles idle ( +- 0.26% ) [33.33%]
> > 56,702,237,511 stalled-cycles-backend # 87.57% backend cycles idle ( +- 0.07% ) [33.33%]
> > 10,033,724,846 instructions # 0.15 insns per cycle
> > # 6.11 stalled cycles per insn ( +- 0.09% ) [41.65%]
> > 2,190,424,932 branches # 80.195 M/sec ( +- 0.12% ) [41.66%]
> > 1,028,630 branch-misses # 0.05% of all branches ( +- 1.50% ) [41.66%]
> > 3,302,006,540 L1-dcache-loads
> > # 120.891 M/sec ( +- 0.11% ) [41.68%]
> > 271,374,358 L1-dcache-misses
> > # 8.22% of all L1-dcache hits ( +- 0.04% ) [41.66%]
> > 20,385,476 LLC-load
> > # 0.746 M/sec ( +- 1.64% ) [33.34%]
> > 76,754 LLC-misses
> > # 0.38% of all LL-cache hits ( +- 2.35% ) [33.34%]
> > 3,309,927,290 dTLB-loads
> > # 121.181 M/sec ( +- 0.03% ) [33.34%]
> > 2,098,967,427 dTLB-misses
> > # 63.41% of all dTLB cache hits ( +- 0.03% ) [33.34%]
> >
> > 27.364448741 seconds time elapsed ( +- 0.24% )
> >
> >--------------------------------------------------------------------------
>
> Hi Kirill A. Shutemov,
>
> I see in the kernel doc which describes the benefit of thp, "the TLB
> miss will run faster" (especially with virtualization using nested
> pagetables but almost always also on bare metal without
> virtualization).
>
> Could you explain me why TLB miss run faster? I think it only reduce
> TLB miss ratio.

TLB miss for huge page resolved on PMD, not on PTE level, so it's -1 table
lookup.

--
Kirill A. Shutemov