2011-03-03 20:00:35

by Andi Kleen

[permalink] [raw]
Subject: Fix NUMA problems in transparent hugepages and KSM

[Another updated version, with new Reviewed-bys
and the statistics issues Johannes pointed out fixed.]

The current transparent hugepages daemon can mess up local
memory affinity on NUMA systems. When it copies memory to a
huge page it does not necessarily keep it on the same
node as the local allocations.

While fixing this I also found some more related issues:
- The NUMA policy interleaving for THP was using the small
page size, not the large parse size.
- KSM and THP copies also did not preserve the local node
- The accounting for local/remote allocations in the daemon
was misleading.
- There were no VM statistics counters for THP, which made it
impossible to analyze.

At least some of the bug fixes are 2.6.38 candidates IMHO
because some of the NUMA problems are pretty bad. In some workloads
this can cause performance problems.

What can be delayed are GFP_OTHERNODE and the statistics changes.

Git tree:

git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc-2.6.git thp-numa


2011-03-03 20:00:39

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 4/8] Preserve original node for transparent huge page copies

From: Andi Kleen <[email protected]>

This makes a difference for LOCAL policy, where the node cannot
be determined from the policy itself, but has to be gotten
from the original page.

Acked-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/huge_memory.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c7c2cd9..1802db8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -799,8 +799,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
}

for (i = 0; i < HPAGE_PMD_NR; i++) {
- pages[i] = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
- vma, address);
+ pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE,
+ vma, address, page_to_nid(page));
if (unlikely(!pages[i] ||
mem_cgroup_newpage_charge(pages[i], mm,
GFP_KERNEL))) {
--
1.7.4

2011-03-03 20:00:49

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 8/8] Add VM counters for transparent hugepages

From: Andi Kleen <[email protected]>

I found it difficult to make sense of transparent huge pages without
having any counters for its actions. Add some counters to vmstat
for allocation of transparent hugepages and fallback to smaller
pages.

Optional patch, but useful for development and understanding the system.

Contains improvements from Andrea Arcangeli and Johannes Weiner

Acked-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/vmstat.h | 7 +++++++
mm/huge_memory.c | 25 +++++++++++++++++++++----
mm/vmstat.c | 8 ++++++++
3 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 9b5c63d..074e8fd 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -58,6 +58,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
UNEVICTABLE_PGCLEARED, /* on COW, page truncate */
UNEVICTABLE_PGSTRANDED, /* unable to isolate on unlock */
UNEVICTABLE_MLOCKFREED,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ THP_FAULT_ALLOC,
+ THP_FAULT_FALLBACK,
+ THP_COLLAPSE_ALLOC,
+ THP_COLLAPSE_ALLOC_FAILED,
+ THP_SPLIT,
+#endif
NR_VM_EVENT_ITEMS
};

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8c6c4a7..f13a9a3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -680,8 +680,11 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
vma, haddr, numa_node_id(), 0);
- if (unlikely(!page))
+ if (unlikely(!page)) {
+ count_vm_event(THP_FAULT_FALLBACK);
goto out;
+ }
+ count_vm_event(THP_FAULT_ALLOC);
if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
put_page(page);
goto out;
@@ -909,11 +912,13 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
new_page = NULL;

if (unlikely(!new_page)) {
+ count_vm_event(THP_FAULT_FALLBACK);
ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
pmd, orig_pmd, page, haddr);
put_page(page);
goto out;
}
+ count_vm_event(THP_FAULT_ALLOC);

if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
put_page(new_page);
@@ -1390,6 +1395,7 @@ int split_huge_page(struct page *page)

BUG_ON(!PageSwapBacked(page));
__split_huge_page(page, anon_vma);
+ count_vm_event(THP_SPLIT);

BUG_ON(PageCompound(page));
out_unlock:
@@ -1780,9 +1786,11 @@ static void collapse_huge_page(struct mm_struct *mm,
node, __GFP_OTHER_NODE);
if (unlikely(!new_page)) {
up_read(&mm->mmap_sem);
+ count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
*hpage = ERR_PTR(-ENOMEM);
return;
}
+ count_vm_event(THP_COLLAPSE_ALLOC);
#endif
if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
up_read(&mm->mmap_sem);
@@ -2147,8 +2155,11 @@ static void khugepaged_do_scan(struct page **hpage)
#ifndef CONFIG_NUMA
if (!*hpage) {
*hpage = alloc_hugepage(khugepaged_defrag());
- if (unlikely(!*hpage))
+ if (unlikely(!*hpage)) {
+ count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
break;
+ }
+ count_vm_event(THP_COLLAPSE_ALLOC);
}
#else
if (IS_ERR(*hpage))
@@ -2188,8 +2199,11 @@ static struct page *khugepaged_alloc_hugepage(void)

do {
hpage = alloc_hugepage(khugepaged_defrag());
- if (!hpage)
+ if (!hpage) {
+ count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
khugepaged_alloc_sleep();
+ } else
+ count_vm_event(THP_COLLAPSE_ALLOC);
} while (unlikely(!hpage) &&
likely(khugepaged_enabled()));
return hpage;
@@ -2206,8 +2220,11 @@ static void khugepaged_loop(void)
while (likely(khugepaged_enabled())) {
#ifndef CONFIG_NUMA
hpage = khugepaged_alloc_hugepage();
- if (unlikely(!hpage))
+ if (unlikely(!hpage)) {
+ count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
break;
+ }
+ count_vm_event(THP_COLLAPSE_ALLOC);
#else
if (IS_ERR(hpage)) {
khugepaged_alloc_sleep();
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 2b461ed..7b5a6f2 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -946,6 +946,14 @@ static const char * const vmstat_text[] = {
"unevictable_pgs_stranded",
"unevictable_pgs_mlockfreed",
#endif
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ "thp_fault_alloc",
+ "thp_fault_fallback",
+ "thp_collapse_alloc",
+ "thp_collapse_alloc_failed",
+ "thp_split",
+#endif
};

static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
--
1.7.4

2011-03-03 20:00:48

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 6/8] Add __GFP_OTHER_NODE flag

From: Andi Kleen <[email protected]>

Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics
in zone_statistics() that an allocation is on behalf of another thread.
This way the local and remote counters can be still correct, even
when background daemons like khugepaged are changing memory
mappings.

This only affects the accounting, but I think it's worth doing that
right to avoid confusing users.

I first tried to just pass down the right node, but this required
a lot of changes to pass down this parameter and at least one
addition of a 10th argument to a 9 argument function. Using
the flag is a lot less intrusive.

Open: should be also used for migration?

Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/gfp.h | 2 ++
include/linux/vmstat.h | 4 ++--
mm/page_alloc.c | 2 +-
mm/vmstat.c | 9 +++++++--
4 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 814d50e..a064724 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
#define ___GFP_NOTRACK 0
#endif
#define ___GFP_NO_KSWAPD 0x400000u
+#define ___GFP_OTHER_NODE 0x800000u

/*
* GFP bitmasks..
@@ -83,6 +84,7 @@ struct vm_area_struct;
#define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) /* Don't track with kmemcheck */

#define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
+#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */

/*
* This may seem redundant, but it's a way of annotating false positives vs.
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 833e676..9b5c63d 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -220,12 +220,12 @@ static inline unsigned long node_page_state(int node,
zone_page_state(&zones[ZONE_MOVABLE], item);
}

-extern void zone_statistics(struct zone *, struct zone *);
+extern void zone_statistics(struct zone *, struct zone *, gfp_t gfp);

#else

#define node_page_state(node, item) global_page_state(item)
-#define zone_statistics(_zl,_z) do { } while (0)
+#define zone_statistics(_zl,_z, gfp) do { } while (0)

#endif /* CONFIG_NUMA */

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a873e61..4ce06a6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1333,7 +1333,7 @@ again:
}

__count_zone_vm_events(PGALLOC, zone, 1 << order);
- zone_statistics(preferred_zone, zone);
+ zone_statistics(preferred_zone, zone, gfp_flags);
local_irq_restore(flags);

VM_BUG_ON(bad_range(zone, page));
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 0c3b504..2b461ed 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -500,8 +500,12 @@ void refresh_cpu_vm_stats(int cpu)
* z = the zone from which the allocation occurred.
*
* Must be called with interrupts disabled.
+ *
+ * When __GFP_OTHER_NODE is set assume the node of the preferred
+ * zone is the local node. This is useful for daemons who allocate
+ * memory on behalf of other processes.
*/
-void zone_statistics(struct zone *preferred_zone, struct zone *z)
+void zone_statistics(struct zone *preferred_zone, struct zone *z, gfp_t flags)
{
if (z->zone_pgdat == preferred_zone->zone_pgdat) {
__inc_zone_state(z, NUMA_HIT);
@@ -509,7 +513,8 @@ void zone_statistics(struct zone *preferred_zone, struct zone *z)
__inc_zone_state(z, NUMA_MISS);
__inc_zone_state(preferred_zone, NUMA_FOREIGN);
}
- if (z->node == numa_node_id())
+ if (z->node == ((flags & __GFP_OTHER_NODE) ?
+ preferred_zone->node : numa_node_id()))
__inc_zone_state(z, NUMA_LOCAL);
else
__inc_zone_state(z, NUMA_OTHER);
--
1.7.4

2011-03-03 20:01:17

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 2/8] Change alloc_pages_vma to pass down the policy node for local policy

From: Andi Kleen <[email protected]>

Currently alloc_pages_vma always uses the local node as policy node
for the LOCAL policy. Pass this node down as an argument instead.

No behaviour change from this patch, but will be needed for followons.

Acked-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/gfp.h | 9 +++++----
mm/huge_memory.c | 2 +-
mm/mempolicy.c | 11 +++++------
3 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0b84c61..782e74a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -332,16 +332,17 @@ alloc_pages(gfp_t gfp_mask, unsigned int order)
return alloc_pages_current(gfp_mask, order);
}
extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
- struct vm_area_struct *vma, unsigned long addr);
+ struct vm_area_struct *vma, unsigned long addr,
+ int node);
#else
#define alloc_pages(gfp_mask, order) \
alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_pages_vma(gfp_mask, order, vma, addr) \
+#define alloc_pages_vma(gfp_mask, order, vma, addr, node) \
alloc_pages(gfp_mask, order)
#endif
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
-#define alloc_page_vma(gfp_mask, vma, addr) \
- alloc_pages_vma(gfp_mask, 0, vma, addr)
+#define alloc_page_vma(gfp_mask, vma, addr) \
+ alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())

extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3e29781..c7c2cd9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -653,7 +653,7 @@ static inline struct page *alloc_hugepage_vma(int defrag,
unsigned long haddr)
{
return alloc_pages_vma(alloc_hugepage_gfpmask(defrag),
- HPAGE_PMD_ORDER, vma, haddr);
+ HPAGE_PMD_ORDER, vma, haddr, numa_node_id());
}

#ifndef CONFIG_NUMA
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 49355a9..25a5a91 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1524,10 +1524,9 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *policy)
}

/* Return a zonelist indicated by gfp for node representing a mempolicy */
-static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy)
+static struct zonelist *policy_zonelist(gfp_t gfp, struct mempolicy *policy,
+ int nd)
{
- int nd = numa_node_id();
-
switch (policy->mode) {
case MPOL_PREFERRED:
if (!(policy->flags & MPOL_F_LOCAL))
@@ -1679,7 +1678,7 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
zl = node_zonelist(interleave_nid(*mpol, vma, addr,
huge_page_shift(hstate_vma(vma))), gfp_flags);
} else {
- zl = policy_zonelist(gfp_flags, *mpol);
+ zl = policy_zonelist(gfp_flags, *mpol, numa_node_id());
if ((*mpol)->mode == MPOL_BIND)
*nodemask = &(*mpol)->v.nodes;
}
@@ -1820,7 +1819,7 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
*/
struct page *
alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
- unsigned long addr)
+ unsigned long addr, int node)
{
struct mempolicy *pol = get_vma_policy(current, vma, addr);
struct zonelist *zl;
@@ -1836,7 +1835,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
put_mems_allowed();
return page;
}
- zl = policy_zonelist(gfp, pol);
+ zl = policy_zonelist(gfp, pol, node);
if (unlikely(mpol_needs_cond_ref(pol))) {
/*
* slow path: ref counted shared policy
--
1.7.4

2011-03-03 20:00:38

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 5/8] Use correct numa policy node for transparent hugepages

From: Andi Kleen <[email protected]>

Pass down the correct node for a transparent hugepage allocation.
Most callers continue to use the current node, however the hugepaged
daemon now uses the previous node of the first to be collapsed page
instead. This ensures that khugepaged does not mess up local memory
for an existing process which uses local policy.

The choice of node is somewhat primitive currently: it just
uses the node of the first page in the pmd range. An alternative
would be to look at multiple pages and use the most popular
node. I used the simplest variant for now which should work
well enough for the case of all pages being on the same node.

Acked-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/huge_memory.c | 24 +++++++++++++++++-------
mm/mempolicy.c | 3 ++-
2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1802db8..8a7f94c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -650,10 +650,10 @@ static inline gfp_t alloc_hugepage_gfpmask(int defrag)

static inline struct page *alloc_hugepage_vma(int defrag,
struct vm_area_struct *vma,
- unsigned long haddr)
+ unsigned long haddr, int nd)
{
return alloc_pages_vma(alloc_hugepage_gfpmask(defrag),
- HPAGE_PMD_ORDER, vma, haddr, numa_node_id());
+ HPAGE_PMD_ORDER, vma, haddr, nd);
}

#ifndef CONFIG_NUMA
@@ -678,7 +678,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(khugepaged_enter(vma)))
return VM_FAULT_OOM;
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr);
+ vma, haddr, numa_node_id());
if (unlikely(!page))
goto out;
if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
@@ -902,7 +902,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr);
+ vma, haddr, numa_node_id());
else
new_page = NULL;

@@ -1745,7 +1745,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
static void collapse_huge_page(struct mm_struct *mm,
unsigned long address,
struct page **hpage,
- struct vm_area_struct *vma)
+ struct vm_area_struct *vma,
+ int node)
{
pgd_t *pgd;
pud_t *pud;
@@ -1773,7 +1774,8 @@ static void collapse_huge_page(struct mm_struct *mm,
* mmap_sem in read mode is good idea also to allow greater
* scalability.
*/
- new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address);
+ new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
+ node);
if (unlikely(!new_page)) {
up_read(&mm->mmap_sem);
*hpage = ERR_PTR(-ENOMEM);
@@ -1919,6 +1921,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
struct page *page;
unsigned long _address;
spinlock_t *ptl;
+ int node = -1;

VM_BUG_ON(address & ~HPAGE_PMD_MASK);

@@ -1949,6 +1952,13 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
page = vm_normal_page(vma, _address, pteval);
if (unlikely(!page))
goto out_unmap;
+ /*
+ * Chose the node of the first page. This could
+ * be more sophisticated and look at more pages,
+ * but isn't for now.
+ */
+ if (node == -1)
+ node = page_to_nid(page);
VM_BUG_ON(PageCompound(page));
if (!PageLRU(page) || PageLocked(page) || !PageAnon(page))
goto out_unmap;
@@ -1965,7 +1975,7 @@ out_unmap:
pte_unmap_unlock(pte, ptl);
if (ret)
/* collapse_huge_page will return with the mmap_sem released */
- collapse_huge_page(mm, address, hpage, vma);
+ collapse_huge_page(mm, address, hpage, vma, node);
out:
return ret;
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 25a5a91..151c20c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1891,7 +1891,8 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
else
page = __alloc_pages_nodemask(gfp, order,
- policy_zonelist(gfp, pol), policy_nodemask(gfp, pol));
+ policy_zonelist(gfp, pol, numa_node_id()),
+ policy_nodemask(gfp, pol));
put_mems_allowed();
return page;
}
--
1.7.4

2011-03-03 20:00:38

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 7/8] Use GFP_OTHER_NODE for transparent huge pages

From: Andi Kleen <[email protected]>

Pass GFP_OTHER_NODE for transparent hugepages NUMA allocations
done by the hugepages daemon. This way the low level accounting
for local versus remote pages works correctly.

Contains improvements from Andrea Arcangeli

Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/huge_memory.c | 20 +++++++++++---------
1 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8a7f94c..8c6c4a7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -643,23 +643,24 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
return ret;
}

-static inline gfp_t alloc_hugepage_gfpmask(int defrag)
+static inline gfp_t alloc_hugepage_gfpmask(int defrag, gfp_t extra_gfp)
{
- return GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT);
+ return (GFP_TRANSHUGE & ~(defrag ? 0 : __GFP_WAIT)) | extra_gfp;
}

static inline struct page *alloc_hugepage_vma(int defrag,
struct vm_area_struct *vma,
- unsigned long haddr, int nd)
+ unsigned long haddr, int nd,
+ gfp_t extra_gfp)
{
- return alloc_pages_vma(alloc_hugepage_gfpmask(defrag),
+ return alloc_pages_vma(alloc_hugepage_gfpmask(defrag, extra_gfp),
HPAGE_PMD_ORDER, vma, haddr, nd);
}

#ifndef CONFIG_NUMA
static inline struct page *alloc_hugepage(int defrag)
{
- return alloc_pages(alloc_hugepage_gfpmask(defrag),
+ return alloc_pages(alloc_hugepage_gfpmask(defrag, 0),
HPAGE_PMD_ORDER);
}
#endif
@@ -678,7 +679,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(khugepaged_enter(vma)))
return VM_FAULT_OOM;
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id());
+ vma, haddr, numa_node_id(), 0);
if (unlikely(!page))
goto out;
if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
@@ -799,7 +800,8 @@ static int do_huge_pmd_wp_page_fallback(struct mm_struct *mm,
}

for (i = 0; i < HPAGE_PMD_NR; i++) {
- pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE,
+ pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE |
+ __GFP_OTHER_NODE,
vma, address, page_to_nid(page));
if (unlikely(!pages[i] ||
mem_cgroup_newpage_charge(pages[i], mm,
@@ -902,7 +904,7 @@ int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (transparent_hugepage_enabled(vma) &&
!transparent_hugepage_debug_cow())
new_page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
- vma, haddr, numa_node_id());
+ vma, haddr, numa_node_id(), 0);
else
new_page = NULL;

@@ -1775,7 +1777,7 @@ static void collapse_huge_page(struct mm_struct *mm,
* scalability.
*/
new_page = alloc_hugepage_vma(khugepaged_defrag(), vma, address,
- node);
+ node, __GFP_OTHER_NODE);
if (unlikely(!new_page)) {
up_read(&mm->mmap_sem);
*hpage = ERR_PTR(-ENOMEM);
--
1.7.4

2011-03-03 20:00:34

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 1/8] Fix interleaving for transparent hugepages v2

From: Andi Kleen <[email protected]>

Bugfix, independent from the rest of the series.

The THP code didn't pass the correct interleaving shift to the memory
policy code. Fix this here by adjusting for the order.

v2: Use + (thanks Christoph)
Acked-by: Andrea Arcangeli <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
---
mm/mempolicy.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 368fc9d..49355a9 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1830,7 +1830,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
if (unlikely(pol->mode == MPOL_INTERLEAVE)) {
unsigned nid;

- nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
+ nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
mpol_cond_put(pol);
page = alloc_page_interleave(gfp, order, nid);
put_mems_allowed();
--
1.7.4

2011-03-03 20:02:21

by Andi Kleen

[permalink] [raw]
Subject: [PATCH 3/8] Add alloc_page_vma_node

From: Andi Kleen <[email protected]>

Add a alloc_page_vma_node that allows passing the "local" node in.
Used in a followon patch.

Acked-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Andi Kleen <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/gfp.h | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 782e74a..814d50e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -343,6 +343,8 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
#define alloc_page_vma(gfp_mask, vma, addr) \
alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
+#define alloc_page_vma_node(gfp_mask, vma, addr, node) \
+ alloc_pages_vma(gfp_mask, 0, vma, addr, node)

extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
extern unsigned long get_zeroed_page(gfp_t gfp_mask);
--
1.7.4

2011-03-07 08:28:52

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 8/8] Add VM counters for transparent hugepages

> From: Andi Kleen <[email protected]>
>
> I found it difficult to make sense of transparent huge pages without
> having any counters for its actions. Add some counters to vmstat
> for allocation of transparent hugepages and fallback to smaller
> pages.
>
> Optional patch, but useful for development and understanding the system.
>
> Contains improvements from Andrea Arcangeli and Johannes Weiner
>
> Acked-by: Andrea Arcangeli <[email protected]>
> Signed-off-by: Andi Kleen <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/vmstat.h | 7 +++++++
> mm/huge_memory.c | 25 +++++++++++++++++++++----
> mm/vmstat.c | 8 ++++++++
> 3 files changed, 36 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
> index 9b5c63d..074e8fd 100644
> --- a/include/linux/vmstat.h
> +++ b/include/linux/vmstat.h
> @@ -58,6 +58,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> UNEVICTABLE_PGCLEARED, /* on COW, page truncate */
> UNEVICTABLE_PGSTRANDED, /* unable to isolate on unlock */
> UNEVICTABLE_MLOCKFREED,
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> + THP_FAULT_ALLOC,
> + THP_FAULT_FALLBACK,
> + THP_COLLAPSE_ALLOC,
> + THP_COLLAPSE_ALLOC_FAILED,
> + THP_SPLIT,
> +#endif
> NR_VM_EVENT_ITEMS
> };

Hmm...
Don't we need to make per zone stastics? I'm afraid small dma zone
makes much thp-splitting and screw up this stastics.

only nit.

2011-03-07 08:34:18

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 6/8] Add __GFP_OTHER_NODE flag

> From: Andi Kleen <[email protected]>
>
> Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics
> in zone_statistics() that an allocation is on behalf of another thread.
> This way the local and remote counters can be still correct, even
> when background daemons like khugepaged are changing memory
> mappings.
>
> This only affects the accounting, but I think it's worth doing that
> right to avoid confusing users.
>
> I first tried to just pass down the right node, but this required
> a lot of changes to pass down this parameter and at least one
> addition of a 10th argument to a 9 argument function. Using
> the flag is a lot less intrusive.

Yes, less intrusive. But are you using current NUMA stastics on
practical system?
I didn't numa stat recent 5 years at all. So, I'm curious your usecase.
IOW, I haven't convinced this is worthful to consume new GFP_ flags bit.

_now_, I can say I don't found any bug in this patch.

>
> Open: should be also used for migration?
>
> Cc: [email protected]
> Signed-off-by: Andi Kleen <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/gfp.h | 2 ++
> include/linux/vmstat.h | 4 ++--
> mm/page_alloc.c | 2 +-
> mm/vmstat.c | 9 +++++++--
> 4 files changed, 12 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 814d50e..a064724 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -35,6 +35,7 @@ struct vm_area_struct;
> #define ___GFP_NOTRACK 0
> #endif
> #define ___GFP_NO_KSWAPD 0x400000u
> +#define ___GFP_OTHER_NODE 0x800000u


2011-03-07 08:36:10

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/8] Change alloc_pages_vma to pass down the policy node for local policy

> From: Andi Kleen <[email protected]>
>
> Currently alloc_pages_vma always uses the local node as policy node
> for the LOCAL policy. Pass this node down as an argument instead.
>
> No behaviour change from this patch, but will be needed for followons.
>
> Acked-by: Andrea Arcangeli <[email protected]>
> Signed-off-by: Andi Kleen <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

Reviewed-by: KOSAKI Motohiro <[email protected]>


2011-03-07 08:37:12

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/8] Add alloc_page_vma_node

> From: Andi Kleen <[email protected]>
>
> Add a alloc_page_vma_node that allows passing the "local" node in.
> Used in a followon patch.
>
> Acked-by: Andrea Arcangeli <[email protected]>
> Signed-off-by: Andi Kleen <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/gfp.h | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 782e74a..814d50e 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -343,6 +343,8 @@ extern struct page *alloc_pages_vma(gfp_t gfp_mask, int order,
> #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
> #define alloc_page_vma(gfp_mask, vma, addr) \
> alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id())
> +#define alloc_page_vma_node(gfp_mask, vma, addr, node) \
> + alloc_pages_vma(gfp_mask, 0, vma, addr, node)
>
> extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
> extern unsigned long get_zeroed_page(gfp_t gfp_mask);

Reviewed-by: KOSAKI Motohiro <[email protected]>

2011-03-07 08:37:48

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 4/8] Preserve original node for transparent huge page copies

> From: Andi Kleen <[email protected]>
>
> This makes a difference for LOCAL policy, where the node cannot
> be determined from the policy itself, but has to be gotten
> from the original page.
>
> Acked-by: Andrea Arcangeli <[email protected]>
> Signed-off-by: Andi Kleen <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

Reviewed-by: KOSAKI Motohiro <[email protected]>



2011-03-07 08:39:03

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 5/8] Use correct numa policy node for transparent hugepages

> From: Andi Kleen <[email protected]>
>
> Pass down the correct node for a transparent hugepage allocation.
> Most callers continue to use the current node, however the hugepaged
> daemon now uses the previous node of the first to be collapsed page
> instead. This ensures that khugepaged does not mess up local memory
> for an existing process which uses local policy.
>
> The choice of node is somewhat primitive currently: it just
> uses the node of the first page in the pmd range. An alternative
> would be to look at multiple pages and use the most popular
> node. I used the simplest variant for now which should work
> well enough for the case of all pages being on the same node.
>
> Acked-by: Andrea Arcangeli <[email protected]>
> Signed-off-by: Andi Kleen <[email protected]>
> Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>

Reviewed-by: KOSAKI Motohiro <[email protected]>


2011-03-07 16:33:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 6/8] Add __GFP_OTHER_NODE flag

> Yes, less intrusive. But are you using current NUMA stastics on
> practical system?

Yes I do. I know users use it too.

We unfortunately still have enough NUMA locality problems in the kernel
so that overflowing nodes, causing fallbacks for process memory etc. are not uncommon.
If you get that then numastat is very useful to track down what happens.

In an ideal world with perfect NUMA balancing it wouldn't be needed,
but we're far from that.

Also the numactl test suite depends on them.


-Andi

2011-03-07 16:35:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 8/8] Add VM counters for transparent hugepages

> Don't we need to make per zone stastics? I'm afraid small dma zone
> makes much thp-splitting and screw up this stastics.

Does it? I haven't seen that so far.

If it happens a lot it would be better to disable THP for the 16MB DMA
zone at least. Or did you mean the 4GB zone?

-Andi

2011-03-08 00:19:10

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 6/8] Add __GFP_OTHER_NODE flag

> > Yes, less intrusive. But are you using current NUMA stastics on
> > practical system?
>
> Yes I do. I know users use it too.
>
> We unfortunately still have enough NUMA locality problems in the kernel
> so that overflowing nodes, causing fallbacks for process memory etc. are not uncommon.
> If you get that then numastat is very useful to track down what happens.
>
> In an ideal world with perfect NUMA balancing it wouldn't be needed,
> but we're far from that.
>
> Also the numactl test suite depends on them.

If so, I have no objection of cource. :)

Thanks.


2011-03-08 02:43:41

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 8/8] Add VM counters for transparent hugepages

> > Don't we need to make per zone stastics? I'm afraid small dma zone
> > makes much thp-splitting and screw up this stastics.
>
> Does it? I haven't seen that so far.
>
> If it happens a lot it would be better to disable THP for the 16MB DMA
> zone at least. Or did you mean the 4GB zone?

I assumered 4GB. And cpusets/mempolicy binding might makes similar
issue. It can make only one zone high pressure.

But, hmmm...
Do you mean you don't hit any issue then? I don't think do don't tested
NUMA machine. So, it has no practical problem I can agree this.

Thanks.


2011-03-30 21:56:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 8/8] Add VM counters for transparent hugepages

On Tue, 8 Mar 2011 11:43:23 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> > > Don't we need to make per zone stastics? I'm afraid small dma zone
> > > makes much thp-splitting and screw up this stastics.
> >
> > Does it? I haven't seen that so far.
> >
> > If it happens a lot it would be better to disable THP for the 16MB DMA
> > zone at least. Or did you mean the 4GB zone?
>
> I assumered 4GB. And cpusets/mempolicy binding might makes similar
> issue. It can make only one zone high pressure.
>
> But, hmmm...
> Do you mean you don't hit any issue then? I don't think do don't tested
> NUMA machine. So, it has no practical problem I can agree this.
>

I didn't actually merge this patch because I assumed you guys were
still arguing over it. But I now see you weren't.

Do we still want it? Are we sure we don't want the per-zone numbers?


From: Andi Kleen <[email protected]>

I found it difficult to make sense of transparent huge pages without
having any counters for its actions. Add some counters to vmstat for
allocation of transparent hugepages and fallback to smaller pages.

Optional patch, but useful for development and understanding the system.

Contains improvements from Andrea Arcangeli and Johannes Weiner

[[email protected]: coding-style fixes]
[[email protected]: fix vmstat_text[] entries]
Signed-off-by: Andi Kleen <[email protected]>
Acked-by: Andrea Arcangeli <[email protected]>
Reviewed-by: KAMEZAWA Hiroyuki <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

include/linux/vmstat.h | 7 +++++++
mm/huge_memory.c | 25 +++++++++++++++++++++----
mm/vmstat.c | 9 +++++++++
3 files changed, 37 insertions(+), 4 deletions(-)

diff -puN include/linux/vmstat.h~mm-add-vm-counters-for-transparent-hugepages include/linux/vmstat.h
--- a/include/linux/vmstat.h~mm-add-vm-counters-for-transparent-hugepages
+++ a/include/linux/vmstat.h
@@ -58,6 +58,13 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
UNEVICTABLE_PGCLEARED, /* on COW, page truncate */
UNEVICTABLE_PGSTRANDED, /* unable to isolate on unlock */
UNEVICTABLE_MLOCKFREED,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ THP_FAULT_ALLOC,
+ THP_FAULT_FALLBACK,
+ THP_COLLAPSE_ALLOC,
+ THP_COLLAPSE_ALLOC_FAILED,
+ THP_SPLIT,
+#endif
NR_VM_EVENT_ITEMS
};

diff -puN mm/huge_memory.c~mm-add-vm-counters-for-transparent-hugepages mm/huge_memory.c
--- a/mm/huge_memory.c~mm-add-vm-counters-for-transparent-hugepages
+++ a/mm/huge_memory.c
@@ -680,8 +680,11 @@ int do_huge_pmd_anonymous_page(struct mm
return VM_FAULT_OOM;
page = alloc_hugepage_vma(transparent_hugepage_defrag(vma),
vma, haddr, numa_node_id(), 0);
- if (unlikely(!page))
+ if (unlikely(!page)) {
+ count_vm_event(THP_FAULT_FALLBACK);
goto out;
+ }
+ count_vm_event(THP_FAULT_ALLOC);
if (unlikely(mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))) {
put_page(page);
goto out;
@@ -909,11 +912,13 @@ int do_huge_pmd_wp_page(struct mm_struct
new_page = NULL;

if (unlikely(!new_page)) {
+ count_vm_event(THP_FAULT_FALLBACK);
ret = do_huge_pmd_wp_page_fallback(mm, vma, address,
pmd, orig_pmd, page, haddr);
put_page(page);
goto out;
}
+ count_vm_event(THP_FAULT_ALLOC);

if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
put_page(new_page);
@@ -1390,6 +1395,7 @@ int split_huge_page(struct page *page)

BUG_ON(!PageSwapBacked(page));
__split_huge_page(page, anon_vma);
+ count_vm_event(THP_SPLIT);

BUG_ON(PageCompound(page));
out_unlock:
@@ -1784,9 +1790,11 @@ static void collapse_huge_page(struct mm
node, __GFP_OTHER_NODE);
if (unlikely(!new_page)) {
up_read(&mm->mmap_sem);
+ count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
*hpage = ERR_PTR(-ENOMEM);
return;
}
+ count_vm_event(THP_COLLAPSE_ALLOC);
if (unlikely(mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))) {
up_read(&mm->mmap_sem);
put_page(new_page);
@@ -2151,8 +2159,11 @@ static void khugepaged_do_scan(struct pa
#ifndef CONFIG_NUMA
if (!*hpage) {
*hpage = alloc_hugepage(khugepaged_defrag());
- if (unlikely(!*hpage))
+ if (unlikely(!*hpage)) {
+ count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
break;
+ }
+ count_vm_event(THP_COLLAPSE_ALLOC);
}
#else
if (IS_ERR(*hpage))
@@ -2192,8 +2203,11 @@ static struct page *khugepaged_alloc_hug

do {
hpage = alloc_hugepage(khugepaged_defrag());
- if (!hpage)
+ if (!hpage) {
+ count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
khugepaged_alloc_sleep();
+ } else
+ count_vm_event(THP_COLLAPSE_ALLOC);
} while (unlikely(!hpage) &&
likely(khugepaged_enabled()));
return hpage;
@@ -2210,8 +2224,11 @@ static void khugepaged_loop(void)
while (likely(khugepaged_enabled())) {
#ifndef CONFIG_NUMA
hpage = khugepaged_alloc_hugepage();
- if (unlikely(!hpage))
+ if (unlikely(!hpage)) {
+ count_vm_event(THP_COLLAPSE_ALLOC_FAILED);
break;
+ }
+ count_vm_event(THP_COLLAPSE_ALLOC);
#else
if (IS_ERR(hpage)) {
khugepaged_alloc_sleep();
diff -puN mm/vmstat.c~mm-add-vm-counters-for-transparent-hugepages mm/vmstat.c
--- a/mm/vmstat.c~mm-add-vm-counters-for-transparent-hugepages
+++ a/mm/vmstat.c
@@ -951,7 +951,16 @@ static const char * const vmstat_text[]
"unevictable_pgs_cleared",
"unevictable_pgs_stranded",
"unevictable_pgs_mlockfreed",
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ "thp_fault_alloc",
+ "thp_fault_fallback",
+ "thp_collapse_alloc",
+ "thp_collapse_alloc_failed",
+ "thp_split",
#endif
+
+#endif /* CONFIG_VM_EVENTS_COUNTERS */
};

static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
_

2011-03-30 23:30:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 8/8] Add VM counters for transparent hugepages

> Do we still want it? Are we sure we don't want the per-zone numbers?

At least I still want it and Dave Hansen did too.

I don't need per zone personally and I remember a strong request from
anyone. Or was there one?

-Andi

2011-03-31 00:52:29

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 8/8] Add VM counters for transparent hugepages

> > Do we still want it? Are we sure we don't want the per-zone numbers?
>
> At least I still want it and Dave Hansen did too.
>
> I don't need per zone personally and I remember a strong request from
> anyone. Or was there one?

If my remember is correct, Only /me puted weak request of per-zone number.
To be honest, myself never use this counter, my question was just curious.
Then, I'm ok if Andi didn't hit any issue.

Andi, But, if anyone will put numa request or numa related bug report
in future, Perhaps I might convert it per-zone one. Is this ok?


2011-03-31 00:58:13

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 8/8] Add VM counters for transparent hugepages

On Thu, Mar 31, 2011 at 09:52:24AM +0900, KOSAKI Motohiro wrote:
> > > Do we still want it? Are we sure we don't want the per-zone numbers?
> >
> > At least I still want it and Dave Hansen did too.
> >
> > I don't need per zone personally and I remember a strong request from
> > anyone. Or was there one?
>
> If my remember is correct, Only /me puted weak request of per-zone number.
> To be honest, myself never use this counter, my question was just curious.
> Then, I'm ok if Andi didn't hit any issue.

Thanks

> Andi, But, if anyone will put numa request or numa related bug report
> in future, Perhaps I might convert it per-zone one. Is this ok?

Sure. We can always change it later.

Andrew, this means you can merge it now I think.

-Andi

--
[email protected] -- Speaking for myself only