2012-08-09 09:08:52

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH, RFC 0/9] Introduce huge zero page

From: "Kirill A. Shutemov" <[email protected]>

During testing I noticed big (up to 2.5 times) memory consumption overhead
on some workloads (e.g. ft.A from NPB) if THP is enabled.

The main reason for that big difference is lacking zero page in THP case.
We have to allocate a real page on read page fault.

A program to demonstrate the issue:
#include <assert.h>
#include <stdlib.h>
#include <unistd.h>

#define MB 1024*1024

int main(int argc, char **argv)
{
char *p;
int i;

posix_memalign((void **)&p, 2 * MB, 200 * MB);
for (i = 0; i < 200 * MB; i+= 4096)
assert(p[i] == 0);
pause();
return 0;
}

With thp-never RSS is about 400k, but with thp-always it's 200M.
After the patcheset thp-always RSS is 400k too.

H. Peter Anvin proposed to use a "virtual huge zero page" -- a pmd table
with all pte set to 4k zero page. I haven't tried that approach and I'm
not sure if it's good idea (cache vs. tlb trashing). And I guess it will
require more code to handle.
For now, I just allocate 2M page and use it.

Kirill A. Shutemov (9):
thp: huge zero page: basic preparation
thp: zap_huge_pmd(): zap huge zero pmd
thp: copy_huge_pmd(): copy huge zero page
thp: do_huge_pmd_wp_page(): handle huge zero page
thp: change_huge_pmd(): keep huge zero page write-protected
thp: add address parameter to split_huge_page_pmd()
thp: implement splitting pmd for huge zero page
thp: setup huge zero page on non-write page fault
thp: lazy huge zero page allocation

Documentation/vm/transhuge.txt | 4 +-
arch/x86/kernel/vm86_32.c | 2 +-
fs/proc/task_mmu.c | 2 +-
include/linux/huge_mm.h | 10 +-
include/linux/mm.h | 8 ++
mm/huge_memory.c | 228 +++++++++++++++++++++++++++++++++++-----
mm/memory.c | 11 +--
mm/mempolicy.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 3 +-
mm/pagewalk.c | 2 +-
11 files changed, 226 insertions(+), 48 deletions(-)

--
1.7.7.6


2012-08-09 09:08:39

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH, RFC 1/9] thp: huge zero page: basic preparation

From: "Kirill A. Shutemov" <[email protected]>

For now let's allocate the page on hugepage_init(). We'll switch to lazy
allocation later.

We are not going to map the huge zero page until we can handle it
properly on all code paths.

is_huge_zero_{pfn,pmd}() functions will be used by following patches to
check whether the pfn/pmd is huge zero page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 29 +++++++++++++++++++++++++++++
1 files changed, 29 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 57c4b93..88e0a7a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -46,6 +46,7 @@ static unsigned int khugepaged_scan_sleep_millisecs __read_mostly = 10000;
/* during fragmentation poll the hugepage allocator once every minute */
static unsigned int khugepaged_alloc_sleep_millisecs __read_mostly = 60000;
static struct task_struct *khugepaged_thread __read_mostly;
+static unsigned long huge_zero_pfn __read_mostly;
static DEFINE_MUTEX(khugepaged_mutex);
static DEFINE_SPINLOCK(khugepaged_mm_lock);
static DECLARE_WAIT_QUEUE_HEAD(khugepaged_wait);
@@ -167,6 +168,28 @@ out:
return err;
}

+static int init_huge_zero_page(void)
+{
+ struct page *hpage;
+
+ hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
+ if (!hpage)
+ return -ENOMEM;
+
+ huge_zero_pfn = page_to_pfn(hpage);
+ return 0;
+}
+
+static inline bool is_huge_zero_pfn(unsigned long pfn)
+{
+ return pfn == huge_zero_pfn;
+}
+
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+ return is_huge_zero_pfn(pmd_pfn(pmd));
+}
+
#ifdef CONFIG_SYSFS

static ssize_t double_flag_show(struct kobject *kobj,
@@ -550,6 +573,10 @@ static int __init hugepage_init(void)
if (err)
return err;

+ err = init_huge_zero_page();
+ if (err)
+ goto out;
+
err = khugepaged_slab_init();
if (err)
goto out;
@@ -574,6 +601,8 @@ static int __init hugepage_init(void)

return 0;
out:
+ if (huge_zero_pfn)
+ __free_page(pfn_to_page(huge_zero_pfn));
hugepage_exit_sysfs(hugepage_kobj);
return err;
}
--
1.7.7.6

2012-08-09 09:09:06

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH, RFC 9/9] thp: lazy huge zero page allocation

From: "Kirill A. Shutemov" <[email protected]>

Instead of allocating huge zero page on hugepage_init() we can postpone it
until first huge zero page map. It saves memory if THP is not in use.

cmpxchg() is used to avoid race on huge_zero_pfn initialization.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 20 ++++++++++----------
1 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3a78677..6861230 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -168,21 +168,23 @@ out:
return err;
}

-static int init_huge_zero_page(void)
+static int init_huge_zero_pfn(void)
{
struct page *hpage;
+ unsigned long pfn;

hpage = alloc_pages(GFP_TRANSHUGE | __GFP_ZERO, HPAGE_PMD_ORDER);
if (!hpage)
return -ENOMEM;
-
- huge_zero_pfn = page_to_pfn(hpage);
+ pfn = page_to_pfn(hpage);
+ if (cmpxchg(&huge_zero_pfn, 0, pfn))
+ __free_page(hpage);
return 0;
}

static inline bool is_huge_zero_pfn(unsigned long pfn)
{
- return pfn == huge_zero_pfn;
+ return huge_zero_pfn && pfn == huge_zero_pfn;
}

static inline bool is_huge_zero_pmd(pmd_t pmd)
@@ -573,10 +575,6 @@ static int __init hugepage_init(void)
if (err)
return err;

- err = init_huge_zero_page();
- if (err)
- goto out;
-
err = khugepaged_slab_init();
if (err)
goto out;
@@ -601,8 +599,6 @@ static int __init hugepage_init(void)

return 0;
out:
- if (huge_zero_pfn)
- __free_page(pfn_to_page(huge_zero_pfn));
hugepage_exit_sysfs(hugepage_kobj);
return err;
}
@@ -752,6 +748,10 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (!(flags & FAULT_FLAG_WRITE)) {
pgtable_t pgtable;
+ if (unlikely(!huge_zero_pfn && init_huge_zero_pfn())) {
+ count_vm_event(THP_FAULT_FALLBACK);
+ goto out;
+ }
pgtable = pte_alloc_one(mm, haddr);
if (unlikely(!pgtable))
goto out;
--
1.7.7.6

2012-08-09 09:08:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH, RFC 5/9] thp: change_huge_pmd(): keep huge zero page write-protected

From: "Kirill A. Shutemov" <[email protected]>

We want to get page fault on write attempt to huge zero page, so let's
keep it write-protected.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f5029d4..4001f1a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1248,6 +1248,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
pmd_t entry;
entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
+ if (is_huge_zero_pmd(entry))
+ entry = pmd_wrprotect(entry);
set_pmd_at(mm, addr, pmd, entry);
spin_unlock(&vma->vm_mm->page_table_lock);
ret = 1;
--
1.7.7.6

2012-08-09 09:08:48

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH, RFC 7/9] thp: implement splitting pmd for huge zero page

From: "Kirill A. Shutemov" <[email protected]>

We can't split huge zero page itself, but we can split the pmd which
points to it.

On splitting the pmd we create a table with all ptes set to normal zero
page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 36 ++++++++++++++++++++++++++++++++++++
1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c8948d6..4df5841 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1599,6 +1599,7 @@ int split_huge_page(struct page *page)
struct anon_vma *anon_vma;
int ret = 1;

+ BUG_ON(is_huge_zero_pfn(page_to_pfn(page)));
BUG_ON(!PageAnon(page));
anon_vma = page_lock_anon_vma(page);
if (!anon_vma)
@@ -2503,6 +2504,36 @@ static int khugepaged(void *none)
return 0;
}

+static void __split_huge_zero_page_pmd(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long address)
+{
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct vm_area_struct *vma;
+ int i;
+
+ vma = find_vma(mm, address);
+ VM_BUG_ON(vma == NULL);
+ pmdp_clear_flush_notify(vma, haddr, pmd);
+ /* leave pmd empty until pte is filled */
+
+ pgtable = get_pmd_huge_pte(mm);
+ pmd_populate(mm, &_pmd, pgtable);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+ entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+ entry = pte_mkspecial(entry);
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*pte));
+ set_pte_at(mm, haddr, pte, entry);
+ pte_unmap(pte);
+ }
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pgtable);
+}
+
void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd,
unsigned long address)
{
@@ -2513,6 +2544,11 @@ void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd,
spin_unlock(&mm->page_table_lock);
return;
}
+ if (is_huge_zero_pmd(*pmd)) {
+ __split_huge_zero_page_pmd(mm, pmd, address);
+ spin_unlock(&mm->page_table_lock);
+ return;
+ }
page = pmd_page(*pmd);
VM_BUG_ON(!page_count(page));
get_page(page);
--
1.7.7.6

2012-08-09 09:08:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH, RFC 2/9] thp: zap_huge_pmd(): zap huge zero pmd

From: "Kirill A. Shutemov" <[email protected]>

We don't have a real page to zap in huge zero page case. Let's just
clear pmd and remove it from tlb.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 27 +++++++++++++++++----------
1 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 88e0a7a..9dcb9e6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1071,16 +1071,23 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct page *page;
pgtable_t pgtable;
pgtable = get_pmd_huge_pte(tlb->mm);
- page = pmd_page(*pmd);
- pmd_clear(pmd);
- tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
- page_remove_rmap(page);
- VM_BUG_ON(page_mapcount(page) < 0);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
- VM_BUG_ON(!PageHead(page));
- tlb->mm->nr_ptes--;
- spin_unlock(&tlb->mm->page_table_lock);
- tlb_remove_page(tlb, page);
+ if (is_huge_zero_pmd(*pmd)) {
+ pmd_clear(pmd);
+ tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+ tlb->mm->nr_ptes--;
+ spin_unlock(&tlb->mm->page_table_lock);
+ } else {
+ page = pmd_page(*pmd);
+ pmd_clear(pmd);
+ tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
+ page_remove_rmap(page);
+ VM_BUG_ON(page_mapcount(page) < 0);
+ add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+ VM_BUG_ON(!PageHead(page));
+ tlb->mm->nr_ptes--;
+ spin_unlock(&tlb->mm->page_table_lock);
+ tlb_remove_page(tlb, page);
+ }
pte_free(tlb->mm, pgtable);
ret = 1;
}
--
1.7.7.6

2012-08-09 09:08:43

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH, RFC 3/9] thp: copy_huge_pmd(): copy huge zero page

From: "Kirill A. Shutemov" <[email protected]>

It's easy to copy huge zero page. Just set destination pmd to huge zero
page.

It's safe to copy huge zero page since we have none yet :-p

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 17 +++++++++++++++++
1 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9dcb9e6..a534f84 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -725,6 +725,18 @@ static inline struct page *alloc_hugepage(int defrag)
}
#endif

+static void set_huge_zero_page(pgtable_t pgtable, struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long haddr, pmd_t *pmd)
+{
+ pmd_t entry;
+ entry = pfn_pmd(huge_zero_pfn, vma->vm_page_prot);
+ entry = pmd_wrprotect(entry);
+ entry = pmd_mkhuge(entry);
+ set_pmd_at(mm, haddr, pmd, entry);
+ prepare_pmd_huge_pte(pgtable, mm);
+ mm->nr_ptes++;
+}
+
int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
unsigned int flags)
@@ -802,6 +814,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pte_free(dst_mm, pgtable);
goto out_unlock;
}
+ if (is_huge_zero_pmd(pmd)) {
+ set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd);
+ ret = 0;
+ goto out_unlock;
+ }
if (unlikely(pmd_trans_splitting(pmd))) {
/* split huge page running from under us */
spin_unlock(&src_mm->page_table_lock);
--
1.7.7.6

2012-08-09 09:10:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH, RFC 8/9] thp: setup huge zero page on non-write page fault

From: "Kirill A. Shutemov" <[email protected]>

All code paths seems covered. Now we can map huge zero page on read page
fault.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4df5841..3a78677 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -750,6 +750,16 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma)))
return VM_FAULT_OOM;
+ if (!(flags & FAULT_FLAG_WRITE)) {
+ pgtable_t pgtable;
+ pgtable = pte_alloc_one(mm, haddr);
+ if (unlikely(!pgtable))
+ goto out;
+ spin_lock(&mm->page_table_lock);
+ set_huge_zero_page(pgtable, mm, vma, haddr, pmd);
+ spin_unlock(&mm->page_table_lock);
+ return 0;
+ }
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
vma, haddr, numa_node_id(), 0);
if (unlikely(!page)) {
--
1.7.7.6

2012-08-09 09:11:12

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH, RFC 6/9] thp: add address parameter to split_huge_page_pmd()

From: "Kirill A. Shutemov" <[email protected]>

It's required to implement huge zero pmd splitting.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
Documentation/vm/transhuge.txt | 4 ++--
arch/x86/kernel/vm86_32.c | 2 +-
fs/proc/task_mmu.c | 2 +-
include/linux/huge_mm.h | 10 ++++++----
mm/huge_memory.c | 5 +++--
mm/memory.c | 4 ++--
mm/mempolicy.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 3 ++-
mm/pagewalk.c | 2 +-
10 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/Documentation/vm/transhuge.txt b/Documentation/vm/transhuge.txt
index f734bb2..b1fe2ca 100644
--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -276,7 +276,7 @@ unaffected. libhugetlbfs will also work fine as usual.
== Graceful fallback ==

Code walking pagetables but unware about huge pmds can simply call
-split_huge_page_pmd(mm, pmd) where the pmd is the one returned by
+split_huge_page_pmd(mm, pmd, addr) where the pmd is the one returned by
pmd_offset. It's trivial to make the code transparent hugepage aware
by just grepping for "pmd_offset" and adding split_huge_page_pmd where
missing after pmd_offset returns the pmd. Thanks to the graceful
@@ -299,7 +299,7 @@ diff --git a/mm/mremap.c b/mm/mremap.c
return NULL;

pmd = pmd_offset(pud, addr);
-+ split_huge_page_pmd(mm, pmd);
++ split_huge_page_pmd(mm, pmd, addr);
if (pmd_none_or_clear_bad(pmd))
return NULL;

diff --git a/arch/x86/kernel/vm86_32.c b/arch/x86/kernel/vm86_32.c
index 255f58a..719ba0c 100644
--- a/arch/x86/kernel/vm86_32.c
+++ b/arch/x86/kernel/vm86_32.c
@@ -180,7 +180,7 @@ static void mark_screen_rdonly(struct mm_struct *mm)
if (pud_none_or_clear_bad(pud))
goto out;
pmd = pmd_offset(pud, 0xA0000);
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd(mm, pmd, 0xA0000);
if (pmd_none_or_clear_bad(pmd))
goto out;
pte = pte_offset_map_lock(mm, pmd, 0xA0000, &ptl);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 4540b8f..27c1827 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -597,7 +597,7 @@ static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
spinlock_t *ptl;
struct page *page;

- split_huge_page_pmd(walk->mm, pmd);
+ split_huge_page_pmd(walk->mm, pmd, addr);
if (pmd_trans_unstable(pmd))
return 0;

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4c59b11..ce91199 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -92,12 +92,14 @@ extern int handle_pte_fault(struct mm_struct *mm,
struct vm_area_struct *vma, unsigned long address,
pte_t *pte, pmd_t *pmd, unsigned int flags);
extern int split_huge_page(struct page *page);
-extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
-#define split_huge_page_pmd(__mm, __pmd) \
+extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long address);
+#define split_huge_page_pmd(__mm, __pmd, __address) \
do { \
pmd_t *____pmd = (__pmd); \
if (unlikely(pmd_trans_huge(*____pmd))) \
- __split_huge_page_pmd(__mm, ____pmd); \
+ __split_huge_page_pmd(__mm, ____pmd, \
+ __address); \
} while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { \
@@ -174,7 +176,7 @@ static inline int split_huge_page(struct page *page)
{
return 0;
}
-#define split_huge_page_pmd(__mm, __pmd) \
+#define split_huge_page_pmd(__mm, __pmd, __address) \
do { } while (0)
#define wait_split_huge_page(__anon_vma, __pmd) \
do { } while (0)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4001f1a..c8948d6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2503,7 +2503,8 @@ static int khugepaged(void *none)
return 0;
}

-void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd)
+void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd,
+ unsigned long address)
{
struct page *page;

@@ -2547,7 +2548,7 @@ static void split_huge_page_address(struct mm_struct *mm,
* Caller holds the mmap_sem write mode, so a huge pmd cannot
* materialize from under us.
*/
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd(mm, pmd, address);
}

void __vma_adjust_trans_huge(struct vm_area_struct *vma,
diff --git a/mm/memory.c b/mm/memory.c
index c281847..e202392 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1227,7 +1227,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
BUG();
}
#endif
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma->vm_mm, pmd, addr);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
goto next;
/* fall through */
@@ -1493,7 +1493,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
}
if (pmd_trans_huge(*pmd)) {
if (flags & FOLL_SPLIT) {
- split_huge_page_pmd(mm, pmd);
+ split_huge_page_pmd(mm, pmd, address);
goto split_fallthrough;
}
spin_lock(&mm->page_table_lock);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1d771e4..44c818d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -511,7 +511,7 @@ static inline int check_pmd_range(struct vm_area_struct *vma, pud_t *pud,
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma->vm_mm, pmd, addr);
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
continue;
if (check_pte_range(vma, pmd, addr, next, nodes,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..e85e29d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -90,7 +90,7 @@ static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
- split_huge_page_pmd(vma->vm_mm, pmd);
+ split_huge_page_pmd(vma->vm_mm, pmd, addr);
else if (change_huge_pmd(vma, pmd, addr, newprot))
continue;
/* fall through */
diff --git a/mm/mremap.c b/mm/mremap.c
index 21fed20..790df47 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -156,7 +156,8 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
need_flush = true;
continue;
} else if (!err) {
- split_huge_page_pmd(vma->vm_mm, old_pmd);
+ split_huge_page_pmd(vma->vm_mm, old_pmd,
+ old_addr);
}
VM_BUG_ON(pmd_trans_huge(*old_pmd));
}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 6c118d0..7e92ebd 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -58,7 +58,7 @@ again:
if (!walk->pte_entry)
continue;

- split_huge_page_pmd(walk->mm, pmd);
+ split_huge_page_pmd(walk->mm, pmd, addr);
if (pmd_none_or_trans_huge_or_clear_bad(pmd))
goto again;
err = walk_pte_range(pmd, addr, next, walk);
--
1.7.7.6

2012-08-09 09:11:33

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH, RFC 4/9] thp: do_huge_pmd_wp_page(): handle huge zero page

From: "Kirill A. Shutemov" <[email protected]>

On right access to huge zero page we alloc a new page and clear it.

In fallback path we create a new table and set pte around fault address
to the newly allocated page. All other ptes set to normal zero page.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/linux/mm.h | 8 ++++
mm/huge_memory.c | 102 ++++++++++++++++++++++++++++++++++++++++++++--------
mm/memory.c | 7 ----
3 files changed, 95 insertions(+), 22 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b36d08c..c6eef63 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -514,6 +514,14 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
}
#endif

+#ifndef my_zero_pfn
+static inline unsigned long my_zero_pfn(unsigned long addr)
+{
+ extern unsigned long zero_pfn;
+ return zero_pfn;
+}
+#endif
+
/*
* Multiple processes may "see" the same page. E.g. for untouched
* mappings of /dev/null, all processes see the same page full of
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a534f84..f5029d4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -867,6 +867,61 @@ pgtable_t get_pmd_huge_pte(struct mm_struct *mm)
return pgtable;
}

+static int do_huge_pmd_wp_zero_page_fallback(struct mm_struct *mm,
+ struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmd, unsigned long haddr)
+{
+ pgtable_t pgtable;
+ pmd_t _pmd;
+ struct page *page;
+ int i, ret = 0;
+
+ page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
+ if (!page) {
+ ret |= VM_FAULT_OOM;
+ goto out;
+ }
+
+ if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
+ put_page(page);
+ ret |= VM_FAULT_OOM;
+ goto out;
+ }
+
+ clear_user_highpage(page, address);
+ __SetPageUptodate(page);
+
+ spin_lock(&mm->page_table_lock);
+ pmdp_clear_flush_notify(vma, haddr, pmd);
+ /* leave pmd empty until pte is filled */
+
+ pgtable = get_pmd_huge_pte(mm);
+ pmd_populate(mm, &_pmd, pgtable);
+
+ for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+ if (haddr == (address & PAGE_MASK)) {
+ entry = mk_pte(page, vma->vm_page_prot);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ page_add_new_anon_rmap(page, vma, haddr);
+ } else {
+ entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
+ entry = pte_mkspecial(entry);
+ }
+ pte = pte_offset_map(&_pmd, haddr);
+ VM_BUG_ON(!pte_none(*pte));
+ set_pte_at(mm, haddr, pte, entry);
+ pte_unmap(pte);
+ }
+ smp_wmb(); /* make pte visible before pmd */
+ pmd_populate(mm, pmd, pgtable);
+ spin_unlock(&mm->page_table_lock);
+
+ ret |= VM_FAULT_WRITE;
+out:
+ return ret;
+}
+
static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
struct vm_area_struct *vma,
unsigned long address,
@@ -964,17 +1019,19 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd, pmd_t orig_pmd)
{
int ret = 0;
- struct page *page, *new_page;
+ struct page *page = NULL, *new_page;
unsigned long haddr;

VM_BUG_ON(!vma->anon_vma);
+ haddr = address & HPAGE_PMD_MASK;
+ if (is_huge_zero_pmd(orig_pmd))
+ goto alloc;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, orig_pmd)))
goto out_unlock;

page = pmd_page(orig_pmd);
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
- haddr = address & HPAGE_PMD_MASK;
if (page_mapcount(page) == 1) {
pmd_t entry;
entry = pmd_mkyoung(orig_pmd);
@@ -986,7 +1043,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
get_page(page);
spin_unlock(&mm->page_table_lock);
-
+alloc:
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
@@ -996,28 +1053,39 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,

if (unlikely(!new_page)) {
count_vm_event(THP_FAULT_FALLBACK);
- ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
- pmd, orig_pmd, page, haddr);
- if (ret & VM_FAULT_OOM)
- split_huge_page(page);
- put_page(page);
+ if (is_huge_zero_pmd(orig_pmd)) {
+ ret = do_huge_pmd_wp_zero_page_fallback(mm, vma,
+ address, pmd, haddr);
+ } else {
+ ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
+ pmd, orig_pmd, page, haddr);
+ if (ret & VM_FAULT_OOM)
+ split_huge_page(page);
+ put_page(page);
+ }
goto out;
}
count_vm_event(THP_FAULT_ALLOC);

if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
put_page(new_page);
- split_huge_page(page);
- put_page(page);
+ if (page) {
+ split_huge_page(page);
+ put_page(page);
+ }
ret |= VM_FAULT_OOM;
goto out;
}

- copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
+ if (is_huge_zero_pmd(orig_pmd))
+ clear_huge_page(new_page, haddr, HPAGE_PMD_NR);
+ else
+ copy_user_huge_page(new_page, page, haddr, vma, HPAGE_PMD_NR);
__SetPageUptodate(new_page);

spin_lock(&mm->page_table_lock);
- put_page(page);
+ if (page)
+ put_page(page);
if (unlikely(!pmd_same(*pmd, orig_pmd))) {
spin_unlock(&mm->page_table_lock);
mem_cgroup_uncharge_page(new_page);
@@ -1025,7 +1093,6 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out;
} else {
pmd_t entry;
- VM_BUG_ON(!PageHead(page));
entry = mk_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
entry = pmd_mkhuge(entry);
@@ -1033,8 +1100,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
page_add_new_anon_rmap(new_page, vma, haddr);
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache(vma, address, entry);
- page_remove_rmap(page);
- put_page(page);
+ if (is_huge_zero_pmd(orig_pmd))
+ add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ if (page) {
+ VM_BUG_ON(!PageHead(page));
+ page_remove_rmap(page);
+ put_page(page);
+ }
ret |= VM_FAULT_WRITE;
}
out_unlock:
diff --git a/mm/memory.c b/mm/memory.c
index 2466d12..c281847 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -720,13 +720,6 @@ static inline int is_zero_pfn(unsigned long pfn)
}
#endif

-#ifndef my_zero_pfn
-static inline unsigned long my_zero_pfn(unsigned long addr)
-{
- return zero_pfn;
-}
-#endif
-
/*
* vm_normal_page -- This function gets the "struct page" associated with a pte.
*
--
1.7.7.6

2012-08-10 10:33:07

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH, RFC 0/9] Introduce huge zero page

On Fri, Aug 10, 2012 at 11:49:12AM +0800, Wanpeng Li wrote:
> On Thu, Aug 09, 2012 at 12:08:11PM +0300, Kirill A. Shutemov wrote:
> >From: "Kirill A. Shutemov" <[email protected]>
> >
> >During testing I noticed big (up to 2.5 times) memory consumption overhead
> >on some workloads (e.g. ft.A from NPB) if THP is enabled.
> >
> >The main reason for that big difference is lacking zero page in THP case.
> >We have to allocate a real page on read page fault.
> >
> >A program to demonstrate the issue:
> >#include <assert.h>
> >#include <stdlib.h>
> >#include <unistd.h>
> >
> >#define MB 1024*1024
> >
> >int main(int argc, char **argv)
> >{
> > char *p;
> > int i;
> >
> > posix_memalign((void **)&p, 2 * MB, 200 * MB);
> > for (i = 0; i < 200 * MB; i+= 4096)
> > assert(p[i] == 0);
> > pause();
> > return 0;
> >}
> >
> >With thp-never RSS is about 400k, but with thp-always it's 200M.
> >After the patcheset thp-always RSS is 400k too.
> >
> Hi Kirill,
>
> Thank you for your patchset, I have some questions to ask.
>
> 1. In your patchset, if read page fault, the pmd will be populated by huge
> zero page, IIUC, assert(p[i] == 0) is a read operation, so why thp-always
> RSS is 400K ? You allocate 100 pages, why each cost 4K? I think the
> right overhead should be 2MB for the huge zero page instead of 400K, where
> I missing ?

400k comes not from the allocation, but from libc runtime. The test
program consumes about the same without any allocation at all.

Zero page is a global resource. System owns it. It's not accounted to any
process.

>
> 2. If the user hope to allocate 200MB, total 100 pages needed. The codes
> will allocate one 2MB huge zero page and populate to all associated pmd
> in your patchset logic. When the user attempt to write pages, wp will be
> triggered, and if allocate huge page failed will fallback to
> do_huge_pmd_wp_zero_page_fallback in your patch logic, but you just
> create a new table and set pte around fault address to the newly
> allocated page, all other ptes set to normal zero page. In this scene
> user only get one 4K page and all other zero pages, how the codes can
> cotinue to work? Why not fallback to allocate normal page even if not
> physical continuous.

Since we allocate 4k page around the fault address the fault is handled.
Userspace can use it.

If the process will try to write to any other 4k page of this area a new
fault will be triggered and do_wp_page() will allocate a real page.

It's not reasonable to allocate all 4k pages in the fallback path. We can
postpone it until userspace will really want to use them. This way we reduce
memory pressure in fallback path.

> 3. In your patchset logic:
> "In fallback path we create a new table and set pte around fault address
> to the newly allocated page. All other ptes set to normal zero page."
> When these zero pages will be replaced by real pages and add memcg charge?

I guess I've answered the question above.

> Look forward to your detail response, thank you! :)

Thanks for your questions.

--
Kirill A. Shutemov


Attachments:
(No filename) (3.05 kB)
signature.asc (836.00 B)
Digital signature
Download all attachments

2012-08-16 19:20:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH, RFC 0/9] Introduce huge zero page

On Thu, 9 Aug 2012 12:08:11 +0300
"Kirill A. Shutemov" <[email protected]> wrote:

> During testing I noticed big (up to 2.5 times) memory consumption overhead
> on some workloads (e.g. ft.A from NPB) if THP is enabled.
>
> The main reason for that big difference is lacking zero page in THP case.
> We have to allocate a real page on read page fault.
>
> A program to demonstrate the issue:
> #include <assert.h>
> #include <stdlib.h>
> #include <unistd.h>
>
> #define MB 1024*1024
>
> int main(int argc, char **argv)
> {
> char *p;
> int i;
>
> posix_memalign((void **)&p, 2 * MB, 200 * MB);
> for (i = 0; i < 200 * MB; i+= 4096)
> assert(p[i] == 0);
> pause();
> return 0;
> }
>
> With thp-never RSS is about 400k, but with thp-always it's 200M.
> After the patcheset thp-always RSS is 400k too.

That's a pretty big improvement for a rather fake test case. I wonder
how much benefit we'd see with real workloads?

Things are rather quiet at present, with summer and beaches and Kernel
Summit coming up. Please resend these patches early next month and
let's see if we can get a bit of action happening?

2012-08-16 19:27:47

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH, RFC 7/9] thp: implement splitting pmd for huge zero page

On Thu, Aug 09, 2012 at 12:08:18PM +0300, Kirill A. Shutemov wrote:
> +static void __split_huge_zero_page_pmd(struct mm_struct *mm, pmd_t *pmd,
> + unsigned long address)
> +{
> + pgtable_t pgtable;
> + pmd_t _pmd;
> + unsigned long haddr = address & HPAGE_PMD_MASK;
> + struct vm_area_struct *vma;
> + int i;
> +
> + vma = find_vma(mm, address);
> + VM_BUG_ON(vma == NULL);

I think you can use BUG_ON here just in case but see below how I would
change it.

> + pmdp_clear_flush_notify(vma, haddr, pmd);
> + /* leave pmd empty until pte is filled */
> +
> + pgtable = get_pmd_huge_pte(mm);
> + pmd_populate(mm, &_pmd, pgtable);
> +
> + for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> + pte_t *pte, entry;
> + entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
> + entry = pte_mkspecial(entry);
> + pte = pte_offset_map(&_pmd, haddr);
> + VM_BUG_ON(!pte_none(*pte));
> + set_pte_at(mm, haddr, pte, entry);
> + pte_unmap(pte);
> + }
> + smp_wmb(); /* make pte visible before pmd */
> + pmd_populate(mm, pmd, pgtable);
> +}
> +

The last pmd_populate will corrupt memory.

See the comment in __split_huge_page_splitting. If you set it to none
at any given time, a new page fault will instantiate a hugepmd
thinking it's the first fault and then you'll overwrite it leaking
memory and corrupting userland.

The caller may be holding the mmap_sem in read mode too (pagewalk is
an example). The PSE bit must also remain on at all times.

The non present bit must be clear and a tlb flush must happen before
the final pmd_populate with the regular pmd to avoid tripping machine
checks on some CPU (to avoid a 4k and 2m tlb to appear for the same
vaddr).

I think you should replace pmdp_clear_flush_notify with:

pmdp_splitting_flush_notify(vma, haddr, pmd);

then build the 4k zero pages in the loop using the temporary _pmd set with
pmd_populate(&_pmd) and then:


/*
* Up to this point the pmd is present and huge and
* userland has the whole access to the hugepage
* during the split (which happens in place). If we
* overwrite the pmd with the not-huge version
* pointing to the pte here (which of course we could
* if all CPUs were bug free), userland could trigger
* a small page size TLB miss on the small sized TLB
* while the hugepage TLB entry is still established
* in the huge TLB. Some CPU doesn't like that. See
* http://support.amd.com/us/Processor_TechDocs/41322.pdf,
* Erratum 383 on page 93. Intel should be safe but is
* also warns that it's only safe if the permission
* and cache attributes of the two entries loaded in
* the two TLB is identical (which should be the case
* here). But it is generally safer to never allow
* small and huge TLB entries for the same virtual
* address to be loaded simultaneously. So instead of
* doing "pmd_populate(); flush_tlb_range();" we first
* mark the current pmd notpresent (atomically because
* here the pmd_trans_huge and pmd_trans_splitting
* must remain set at all times on the pmd until the
* split is complete for this pmd), then we flush the
* SMP TLB and finally we write the non-huge version
* of the pmd entry with pmd_populate.
*/
set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
pmd_populate(mm, pmd, pgtable);

note address above is actually haddr aligned (generated by
vma_address(page, vma) where page is a thp page)

> + if (is_huge_zero_pmd(*pmd)) {
> + __split_huge_zero_page_pmd(mm, pmd, address);

This will work fine but it's a bit sad having to add "address" at
every call, just to run a find_vma(). The only place that doesn't have
a vma already on the caller stack is actually pagewalk, all other
places already have a vma on the stack without having to find it with
the rbtree.

I think it may be better to change the param to
split_huge_page_pmd(vma, pmd).

Then have standard split_huge_page_pmd obtain the mm with vma->vm_mm
(most callers already calles it with split_huge_page_pmd(vma->vm_mm)
so it won't alter the cost to do vma->vm_mm in caller or callee).

split_huge_page_address also should take the vma (all callers are
invoking it as split_huge_page_address(vma->vm_mm) so it'll be zero
cost change).

Then we can add a split_huge_page_pmd_mm(mm, address, pmd) or
split_huge_page_pmd_address(mm, address, pmd) (call it as you
prefer...) only for the pagewalk caller that will do the find_vma and
BUG_ON if it's not found.

In that new split_huge_page_pmd_mm you can also add a BUG_ON checking
vma->vm_start to be <= haddr and vma->vm_end >= haddr+HPAGE_PMD_SIZE
in addition to BUG_ON(!vma) above, for more robustness. I'm not aware
of any place calling it without mmap_sem hold at least for reading
and the vma must be stable, but more debug checks won't hurt.

Thanks!
Andrea

2012-08-16 19:40:34

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH, RFC 0/9] Introduce huge zero page

Hi Andrew,

On Thu, Aug 16, 2012 at 12:20:23PM -0700, Andrew Morton wrote:
> That's a pretty big improvement for a rather fake test case. I wonder
> how much benefit we'd see with real workloads?

The same discussion happened about the zero page in general and
there's no easy answer. I seem to recall that it was dropped at some
point and then we reintroduced the zero page later.

Most of the time it won't be worth it, it's just a few pathological
compute loads that benefits IIRC. So I'm overall positive about it
(after it's stable).

Because this is done the right way (i.e. to allocate an hugepage at
the first wp fault, and to fallback exclusively if compaction fails)
it will help much less than the 4k zero pages if the zero pages are
scattered over the address space and not contiguous (it only helps if
there are 512 of them in a row). OTOH if they're contiguous, the huge
zero pages will perform better than the 4k zero pages.

2012-08-16 19:42:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH, RFC 6/9] thp: add address parameter to split_huge_page_pmd()

On Thu, Aug 09, 2012 at 12:08:17PM +0300, Kirill A. Shutemov wrote:
> From: "Kirill A. Shutemov" <[email protected]>
>
> It's required to implement huge zero pmd splitting.
>

This isn't bisectable with the next one, it'd fail on wfg 0-DAY kernel
build testing backend, however this is clearly to separate this patch
from the next, to keep the size small so I don't mind.

2012-08-16 23:08:43

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH, RFC 0/9] Introduce huge zero page

On 08/16/2012 12:40 PM, Andrea Arcangeli wrote:
> Hi Andrew,
>
> On Thu, Aug 16, 2012 at 12:20:23PM -0700, Andrew Morton wrote:
>> That's a pretty big improvement for a rather fake test case. I wonder
>> how much benefit we'd see with real workloads?
>
> The same discussion happened about the zero page in general and
> there's no easy answer. I seem to recall that it was dropped at some
> point and then we reintroduced the zero page later.
>
> Most of the time it won't be worth it, it's just a few pathological
> compute loads that benefits IIRC. So I'm overall positive about it
> (after it's stable).
>
> Because this is done the right way (i.e. to allocate an hugepage at
> the first wp fault, and to fallback exclusively if compaction fails)
> it will help much less than the 4k zero pages if the zero pages are
> scattered over the address space and not contiguous (it only helps if
> there are 512 of them in a row). OTOH if they're contiguous, the huge
> zero pages will perform better than the 4k zero pages.
>

One thing that I asked for testing a "virtual zero page" where the same
page (or N pages for N-way page coloring) is reused across a page table.
It would have worse TLB performance but likely *much* better cache
behavior.

-hpa

2012-08-16 23:12:49

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH, RFC 0/9] Introduce huge zero page

> Because this is done the right way (i.e. to allocate an hugepage at
> the first wp fault, and to fallback exclusively if compaction fails)
> it will help much less than the 4k zero pages if the zero pages are

The main benefit is that you have a zero page with THP enabled.
So it lowers the cost of having THP on (for workloads that benefit
from a zero page)

-Andi
--
[email protected] -- Speaking for myself only

2012-08-17 07:49:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH, RFC 6/9] thp: add address parameter to split_huge_page_pmd()

On Thu, Aug 16, 2012 at 09:42:01PM +0200, Andrea Arcangeli wrote:
> On Thu, Aug 09, 2012 at 12:08:17PM +0300, Kirill A. Shutemov wrote:
> > From: "Kirill A. Shutemov" <[email protected]>
> >
> > It's required to implement huge zero pmd splitting.
> >
>
> This isn't bisectable with the next one, it'd fail on wfg 0-DAY kernel
> build testing backend, however this is clearly to separate this patch
> from the next, to keep the size small so I don't mind.

Hm. I don't see why it's not bisectable. It's only add a new parameter to
the function. The parameter is unused until next patch.

Actually, I've checked build bisectability with aiaiai[1].

[1] http://git.infradead.org/users/dedekind/aiaiai.git

--
Kirill A. Shutemov


Attachments:
(No filename) (744.00 B)
signature.asc (836.00 B)
Digital signature
Download all attachments

2012-08-17 08:12:34

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH, RFC 7/9] thp: implement splitting pmd for huge zero page

On Thu, Aug 16, 2012 at 09:27:38PM +0200, Andrea Arcangeli wrote:
> On Thu, Aug 09, 2012 at 12:08:18PM +0300, Kirill A. Shutemov wrote:
> > +static void __split_huge_zero_page_pmd(struct mm_struct *mm, pmd_t *pmd,
> > + unsigned long address)
> > +{
> > + pgtable_t pgtable;
> > + pmd_t _pmd;
> > + unsigned long haddr = address & HPAGE_PMD_MASK;
> > + struct vm_area_struct *vma;
> > + int i;
> > +
> > + vma = find_vma(mm, address);
> > + VM_BUG_ON(vma == NULL);
>
> I think you can use BUG_ON here just in case but see below how I would
> change it.
>
> > + pmdp_clear_flush_notify(vma, haddr, pmd);
> > + /* leave pmd empty until pte is filled */
> > +
> > + pgtable = get_pmd_huge_pte(mm);
> > + pmd_populate(mm, &_pmd, pgtable);
> > +
> > + for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
> > + pte_t *pte, entry;
> > + entry = pfn_pte(my_zero_pfn(haddr), vma->vm_page_prot);
> > + entry = pte_mkspecial(entry);
> > + pte = pte_offset_map(&_pmd, haddr);
> > + VM_BUG_ON(!pte_none(*pte));
> > + set_pte_at(mm, haddr, pte, entry);
> > + pte_unmap(pte);
> > + }
> > + smp_wmb(); /* make pte visible before pmd */
> > + pmd_populate(mm, pmd, pgtable);
> > +}
> > +
>
> The last pmd_populate will corrupt memory.

Nice catch, thank you.

I've used do_huge_pmd_wp_page_fallback() as template for my code.
What's difference between these two code paths?
Why is do_huge_pmd_wp_page_fallback() safe?

> > + if (is_huge_zero_pmd(*pmd)) {
> > + __split_huge_zero_page_pmd(mm, pmd, address);
>
> This will work fine but it's a bit sad having to add "address" at
> every call, just to run a find_vma(). The only place that doesn't have
> a vma already on the caller stack is actually pagewalk, all other
> places already have a vma on the stack without having to find it with
> the rbtree.
>
> I think it may be better to change the param to
> split_huge_page_pmd(vma, pmd).
>
> Then have standard split_huge_page_pmd obtain the mm with vma->vm_mm
> (most callers already calles it with split_huge_page_pmd(vma->vm_mm)
> so it won't alter the cost to do vma->vm_mm in caller or callee).
>
> split_huge_page_address also should take the vma (all callers are
> invoking it as split_huge_page_address(vma->vm_mm) so it'll be zero
> cost change).
>
> Then we can add a split_huge_page_pmd_mm(mm, address, pmd) or
> split_huge_page_pmd_address(mm, address, pmd) (call it as you
> prefer...) only for the pagewalk caller that will do the find_vma and
> BUG_ON if it's not found.
>
> In that new split_huge_page_pmd_mm you can also add a BUG_ON checking
> vma->vm_start to be <= haddr and vma->vm_end >= haddr+HPAGE_PMD_SIZE
> in addition to BUG_ON(!vma) above, for more robustness. I'm not aware
> of any place calling it without mmap_sem hold at least for reading
> and the vma must be stable, but more debug checks won't hurt.

Looks resonable. I'll update it in next revision.

--
Kirill A. Shutemov


Attachments:
(No filename) (2.85 kB)
signature.asc (836.00 B)
Digital signature
Download all attachments

2012-08-17 16:34:02

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH, RFC 7/9] thp: implement splitting pmd for huge zero page

On Fri, Aug 17, 2012 at 11:12:33AM +0300, Kirill A. Shutemov wrote:
> I've used do_huge_pmd_wp_page_fallback() as template for my code.
> What's difference between these two code paths?
> Why is do_huge_pmd_wp_page_fallback() safe?

Good point. do_huge_pmd_wp_page_fallback works only on the current
"mm" so it doesn't need the splitting transition, but thinking twice
the split_huge_zero_page_pmd also works only on the local "mm" because
you're not really splitting the zero page there (you're not affecting
other mm). As long as you keep holding the page_table_lock of the "mm"
that you're altering your current version is safe.

I got mistaken because I'm very used to think at split huge page as
something that cannot relay on the page_table_lock, but this is a
simpler case that isn't splitting the "page" but only the "pmd" of a
single "mm", so you can safely relay on the mm->page_table_lock :).

> Looks resonable. I'll update it in next revision.

Thanks. Of course the function parameter comments to avoid unnecessary
calls of find_vma, weren't related to the above locking issues.

2012-08-31 14:06:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH, RFC 7/9] thp: implement splitting pmd for huge zero page

On Thu, Aug 16, 2012 at 09:27:38PM +0200, Andrea Arcangeli wrote:
> On Thu, Aug 09, 2012 at 12:08:18PM +0300, Kirill A. Shutemov wrote:
> > + if (is_huge_zero_pmd(*pmd)) {
> > + __split_huge_zero_page_pmd(mm, pmd, address);
>
> This will work fine but it's a bit sad having to add "address" at
> every call, just to run a find_vma().

Hm. address is also used to calculate haddr..

It seems we need pass address anyway. I mean vma + address.

--
Kirill A. Shutemov


Attachments:
(No filename) (469.00 B)
signature.asc (836.00 B)
Digital signature
Download all attachments