2012-11-22 22:50:28

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 00/33] Latest numa/core release, v17

This release mainly addresses one of the regressions Linus
(rightfully) complained about: the "4x JVM" SPECjbb run.

[ Note to testers: if possible please still run with
CONFIG_TRANSPARENT_HUGEPAGES=y enabled, to avoid the
!THP regression that is still not fully fixed.
It will be fixed next. ]

The new 4x JVM results on a 4-node, 32-CPU, 64 GB RAM system,
(240 seconds run, 8 warehouses per 4 JVM instances):

spec1.txt: throughput = 177460.44 SPECjbb2005 bops
spec2.txt: throughput = 176175.08 SPECjbb2005 bops
spec3.txt: throughput = 175053.91 SPECjbb2005 bops
spec4.txt: throughput = 171383.52 SPECjbb2005 bops

Which is close to (but not yet completely matching) the hard binding
performance figures.

Mainline has the following 4x JVM performance:

spec1.txt: throughput = 157839.25 SPECjbb2005 bops
spec2.txt: throughput = 156969.15 SPECjbb2005 bops
spec3.txt: throughput = 157571.59 SPECjbb2005 bops
spec4.txt: throughput = 157873.86 SPECjbb2005 bops

This result is achieved through the following patches:

sched: Introduce staged average NUMA faults
sched: Track groups of shared tasks
sched: Use the best-buddy 'ideal cpu' in balancing decisions
sched, mm, mempolicy: Add per task mempolicy
sched: Average the fault stats longer
sched: Use the ideal CPU to drive active balancing
sched: Add hysteresis to p->numa_shared
sched: Track shared task's node groups and interleave their memory allocations

These patches make increasing use of the shared/private access
pattern distinction between tasks.

Automatic, task group accurate interleaving of memory is the
most important new placement optimization feature in -v17.

It works by first implementing a task CPU placement feature:

Using our shared/private distinction to allow the separate
handling of 'private' versus 'shared' workloads, we enable
the active-balancing of them:

- private tasks, via the sched_update_ideal_cpu_private() function,
try to 'spread' the system as evenly as possible.

- shared-access tasks that also share their mm (threads), via the
sched_update_ideal_cpu_shared() function, try to 'compress'
with other shared tasks on as few nodes as possible.

As tasks are tracked as distinct groups of 'shared access pattern'
tasks, they are compressed towards as few nodes as possible. While
the scheduler performs this compression, a mempolicy node mask can
be constructed almost for free - and in turn be used for the memory
allocations of the tasks.

There are two notable special cases of the interleaving:

- if a group of shared tasks fits on a single node. In this case
the interleaving happens on a single bit, a single node and thus
turns into nice node-local allocations.

- if a large group spans the whole system: in this case the node
masks will cover the whole system, and all memory gets evenly
interleaved and available RAM bandwidth gets utilized. This is
preferable to allocating memory assymetrically and overloading
certain CPU links and running into their bandwidth limitations.

"Private" and non-NUMA tasks on the other hand are not affected and
continue to do efficient node-local allocations.

With this approach we avoid most of the 'threading means shared access
patterns' heuristics that AutoNUMA uses, by automatically separating
out threads that have a private working set and not binding them to
the other threads forcibly.

The thread group heuristics are not completely eliminated though, as
can be seen in the "sched: Use the ideal CPU to drive active balancing"
patch. It's not hard-coded into the design in any case and could be
extended to other task group information: the automatic NUMA balancing
of cgroups for example.

Thanks,

Ingo

-------------------->

Andrea Arcangeli (1):
numa, mm: Support NUMA hinting page faults from gup/gup_fast

Ingo Molnar (14):
mm: Optimize the TLB flush of sys_mprotect() and change_protection()
users
sched, mm, numa: Create generic NUMA fault infrastructure, with
architectures overrides
sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag
sched, numa, mm: Interleave shared tasks
sched: Implement NUMA scanning backoff
sched: Improve convergence
sched: Introduce staged average NUMA faults
sched: Track groups of shared tasks
sched: Use the best-buddy 'ideal cpu' in balancing decisions
sched, mm, mempolicy: Add per task mempolicy
sched: Average the fault stats longer
sched: Use the ideal CPU to drive active balancing
sched: Add hysteresis to p->numa_shared
sched: Track shared task's node groups and interleave their memory
allocations

Mel Gorman (1):
mm/migration: Improve migrate_misplaced_page()

Peter Zijlstra (11):
mm: Count the number of pages affected in change_protection()
sched, numa, mm: Add last_cpu to page flags
sched: Make find_busiest_queue() a method
sched, numa, mm: Describe the NUMA scheduling problem formally
mm/migrate: Introduce migrate_misplaced_page()
sched, numa, mm, arch: Add variable locality exception
sched, numa, mm: Add the scanning page fault machinery
sched: Add adaptive NUMA affinity support
sched: Implement constant, per task Working Set Sampling (WSS) rate
sched, numa, mm: Count WS scanning against present PTEs, not virtual
memory ranges
sched: Implement slow start for working set sampling

Rik van Riel (6):
mm/generic: Only flush the local TLB in ptep_set_access_flags()
x86/mm: Only do a local tlb flush in ptep_set_access_flags()
x86/mm: Introduce pte_accessible()
mm: Only flush the TLB when clearing an accessible pte
x86/mm: Completely drop the TLB flush from ptep_set_access_flags()
sched, numa, mm: Add credits for NUMA placement

CREDITS | 1 +
Documentation/scheduler/numa-problem.txt | 236 +++++
arch/sh/mm/Kconfig | 1 +
arch/x86/Kconfig | 2 +
arch/x86/include/asm/pgtable.h | 6 +
arch/x86/mm/pgtable.c | 8 +-
include/asm-generic/pgtable.h | 59 ++
include/linux/huge_mm.h | 12 +
include/linux/hugetlb.h | 8 +-
include/linux/init_task.h | 8 +
include/linux/mempolicy.h | 47 +-
include/linux/migrate.h | 7 +
include/linux/mm.h | 99 +-
include/linux/mm_types.h | 50 +
include/linux/mmzone.h | 14 +-
include/linux/page-flags-layout.h | 83 ++
include/linux/sched.h | 54 +-
init/Kconfig | 81 ++
kernel/bounds.c | 4 +
kernel/sched/core.c | 105 ++-
kernel/sched/fair.c | 1464 ++++++++++++++++++++++++++----
kernel/sched/features.h | 13 +
kernel/sched/sched.h | 39 +-
kernel/sysctl.c | 45 +-
mm/Makefile | 1 +
mm/huge_memory.c | 163 ++++
mm/hugetlb.c | 10 +-
mm/internal.h | 5 +-
mm/memcontrol.c | 7 +-
mm/memory.c | 105 ++-
mm/mempolicy.c | 175 +++-
mm/migrate.c | 106 ++-
mm/mprotect.c | 69 +-
mm/numa.c | 73 ++
mm/pgtable-generic.c | 9 +-
35 files changed, 2818 insertions(+), 351 deletions(-)
create mode 100644 Documentation/scheduler/numa-problem.txt
create mode 100644 include/linux/page-flags-layout.h
create mode 100644 mm/numa.c

--
1.7.11.7


2012-11-22 22:50:34

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 02/33] x86/mm: Only do a local tlb flush in ptep_set_access_flags()

From: Rik van Riel <[email protected]>

Because we only ever upgrade a PTE when calling ptep_set_access_flags(),
it is safe to skip flushing entries on remote TLBs.

The worst that can happen is a spurious page fault on other CPUs, which
would flush that TLB entry.

Lazily letting another CPU incur a spurious page fault occasionally
is (much!) cheaper than aggressively flushing everybody else's TLB.

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..be3bb46 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
free_page((unsigned long)pgd);
}

+/*
+ * Used to set accessed or dirty bits in the page table entries
+ * on other architectures. On x86, the accessed and dirty bits
+ * are tracked by hardware. However, do_wp_page calls this function
+ * to also make the pte writeable at the same time the dirty bit is
+ * set. In that case we do actually need to write the PTE.
+ */
int ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty)
@@ -310,7 +317,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
if (changed && dirty) {
*ptep = entry;
pte_update_defer(vma->vm_mm, address, ptep);
- flush_tlb_page(vma, address);
+ __flush_tlb_one(address);
}

return changed;
--
1.7.11.7

2012-11-22 22:50:44

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 06/33] mm: Count the number of pages affected in change_protection()

From: Peter Zijlstra <[email protected]>

This will be used for three kinds of purposes:

- to optimize mprotect()

- to speed up working set scanning for working set areas that
have not been touched

- to more accurately scan per real working set

No change in functionality from this patch.

Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/hugetlb.h | 8 +++++--
include/linux/mm.h | 3 +++
mm/hugetlb.c | 10 +++++++--
mm/mprotect.c | 58 +++++++++++++++++++++++++++++++++++++------------
4 files changed, 61 insertions(+), 18 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2251648..06e691b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -87,7 +87,7 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int write);
int pmd_huge(pmd_t pmd);
int pud_huge(pud_t pmd);
-void hugetlb_change_protection(struct vm_area_struct *vma,
+unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);

#else /* !CONFIG_HUGETLB_PAGE */
@@ -132,7 +132,11 @@ static inline void copy_huge_page(struct page *dst, struct page *src)
{
}

-#define hugetlb_change_protection(vma, address, end, newprot)
+static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
+ unsigned long address, unsigned long end, pgprot_t newprot)
+{
+ return 0;
+}

static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..8d86d5a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1078,6 +1078,9 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
extern unsigned long do_mremap(unsigned long addr,
unsigned long old_len, unsigned long new_len,
unsigned long flags, unsigned long new_addr);
+extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, pgprot_t newprot,
+ int dirty_accountable);
extern int mprotect_fixup(struct vm_area_struct *vma,
struct vm_area_struct **pprev, unsigned long start,
unsigned long end, unsigned long newflags);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 59a0059..712895e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3014,7 +3014,7 @@ same_page:
return i ? i : -EFAULT;
}

-void hugetlb_change_protection(struct vm_area_struct *vma,
+unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot)
{
struct mm_struct *mm = vma->vm_mm;
@@ -3022,6 +3022,7 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
pte_t *ptep;
pte_t pte;
struct hstate *h = hstate_vma(vma);
+ unsigned long pages = 0;

BUG_ON(address >= end);
flush_cache_range(vma, address, end);
@@ -3032,12 +3033,15 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
ptep = huge_pte_offset(mm, address);
if (!ptep)
continue;
- if (huge_pmd_unshare(mm, &address, ptep))
+ if (huge_pmd_unshare(mm, &address, ptep)) {
+ pages++;
continue;
+ }
if (!huge_pte_none(huge_ptep_get(ptep))) {
pte = huge_ptep_get_and_clear(mm, address, ptep);
pte = pte_mkhuge(pte_modify(pte, newprot));
set_huge_pte_at(mm, address, ptep, pte);
+ pages++;
}
}
spin_unlock(&mm->page_table_lock);
@@ -3049,6 +3053,8 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
*/
flush_tlb_range(vma, start, end);
mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+
+ return pages << h->order;
}

int hugetlb_reserve_pages(struct inode *inode,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..1e265be 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -35,12 +35,13 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
}
#endif

-static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
+static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
pte_t *pte, oldpte;
spinlock_t *ptl;
+ unsigned long pages = 0;

pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -60,6 +61,7 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
ptent = pte_mkwrite(ptent);

ptep_modify_prot_commit(mm, addr, pte, ptent);
+ pages++;
} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);

@@ -72,18 +74,22 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
set_pte_at(mm, addr, pte,
swp_entry_to_pte(entry));
}
+ pages++;
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
+
+ return pages;
}

-static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
+static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
pmd_t *pmd;
unsigned long next;
+ unsigned long pages = 0;

pmd = pmd_offset(pud, addr);
do {
@@ -91,35 +97,42 @@ static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
split_huge_page_pmd(vma->vm_mm, pmd);
- else if (change_huge_pmd(vma, pmd, addr, newprot))
+ else if (change_huge_pmd(vma, pmd, addr, newprot)) {
+ pages += HPAGE_PMD_NR;
continue;
+ }
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))
continue;
- change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
+ pages += change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
dirty_accountable);
} while (pmd++, addr = next, addr != end);
+
+ return pages;
}

-static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+static inline unsigned long change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
pud_t *pud;
unsigned long next;
+ unsigned long pages = 0;

pud = pud_offset(pgd, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
continue;
- change_pmd_range(vma, pud, addr, next, newprot,
+ pages += change_pmd_range(vma, pud, addr, next, newprot,
dirty_accountable);
} while (pud++, addr = next, addr != end);
+
+ return pages;
}

-static void change_protection(struct vm_area_struct *vma,
+static unsigned long change_protection_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
@@ -127,6 +140,7 @@ static void change_protection(struct vm_area_struct *vma,
pgd_t *pgd;
unsigned long next;
unsigned long start = addr;
+ unsigned long pages = 0;

BUG_ON(addr >= end);
pgd = pgd_offset(mm, addr);
@@ -135,10 +149,30 @@ static void change_protection(struct vm_area_struct *vma,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- change_pud_range(vma, pgd, addr, next, newprot,
+ pages += change_pud_range(vma, pgd, addr, next, newprot,
dirty_accountable);
} while (pgd++, addr = next, addr != end);
+
flush_tlb_range(vma, start, end);
+
+ return pages;
+}
+
+unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, pgprot_t newprot,
+ int dirty_accountable)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long pages;
+
+ mmu_notifier_invalidate_range_start(mm, start, end);
+ if (is_vm_hugetlb_page(vma))
+ pages = hugetlb_change_protection(vma, start, end, newprot);
+ else
+ pages = change_protection_range(vma, start, end, newprot, dirty_accountable);
+ mmu_notifier_invalidate_range_end(mm, start, end);
+
+ return pages;
}

int
@@ -213,12 +247,8 @@ success:
dirty_accountable = 1;
}

- mmu_notifier_invalidate_range_start(mm, start, end);
- if (is_vm_hugetlb_page(vma))
- hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
- else
- change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+
vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
vm_stat_account(mm, newflags, vma->vm_file, nrpages);
perf_event_mmap(vma);
--
1.7.11.7

2012-11-22 22:50:47

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 07/33] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users

Reuse the NUMA code's 'modified page protections' count that
change_protection() computes and skip the TLB flush if there's
no changes to a range that sys_mprotect() modifies.

Given that mprotect() already optimizes the same-flags case
I expected this optimization to dominantly trigger on
CONFIG_NUMA_BALANCING=y kernels - but even with that feature
disabled it triggers rather often.

There's two reasons for that:

1)

While sys_mprotect() already optimizes the same-flag case:

if (newflags == oldflags) {
*pprev = vma;
return 0;
}

and this test works in many cases, but it is too sharp in some
others, where it differentiates between protection values that the
underlying PTE format makes no distinction about, such as
PROT_EXEC == PROT_READ on x86.

2)

Even where the pte format over vma flag changes necessiates a
modification of the pagetables, there might be no pagetables
yet to modify: they might not be instantiated yet.

During a regular desktop bootup this optimization hits a couple
of hundred times. During a Java test I measured thousands of
hits.

So this optimization improves sys_mprotect() in general, not just
CONFIG_NUMA_BALANCING=y kernels.

[ We could further increase the efficiency of this optimization if
change_pte_range() and change_huge_pmd() was a bit smarter about
recognizing exact-same-value protection masks - when the hardware
can do that safely. This would probably further speed up mprotect(). ]

Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/mprotect.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1e265be..7c3628a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -153,7 +153,9 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
dirty_accountable);
} while (pgd++, addr = next, addr != end);

- flush_tlb_range(vma, start, end);
+ /* Only flush the TLB if we actually modified any entries: */
+ if (pages)
+ flush_tlb_range(vma, start, end);

return pages;
}
--
1.7.11.7

2012-11-22 22:50:39

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 04/33] mm: Only flush the TLB when clearing an accessible pte

From: Rik van Riel <[email protected]>

If ptep_clear_flush() is called to clear a page table entry that is
accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry,
there is no need to flush the TLB on remote CPUs.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/pgtable-generic.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d8397da..0c8323f 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -88,7 +88,8 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
{
pte_t pte;
pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
- flush_tlb_page(vma, address);
+ if (pte_accessible(pte))
+ flush_tlb_page(vma, address);
return pte;
}
#endif
--
1.7.11.7

2012-11-22 22:50:54

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 08/33] sched, numa, mm: Add last_cpu to page flags

From: Peter Zijlstra <[email protected]>

Introduce a per-page last_cpu field, fold this into the struct
page::flags field whenever possible.

The unlikely/rare 32bit NUMA configs will likely grow the page-frame.

[ Completely dropping 32bit support for CONFIG_NUMA_BALANCING would simplify
things, but it would also remove the warning if we grow enough 64bit
only page-flags to push the last-cpu out. ]

Suggested-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mm.h | 90 +++++++++++++++++++++------------------
include/linux/mm_types.h | 5 +++
include/linux/mmzone.h | 14 +-----
include/linux/page-flags-layout.h | 83 ++++++++++++++++++++++++++++++++++++
kernel/bounds.c | 4 ++
mm/memory.c | 4 ++
6 files changed, 146 insertions(+), 54 deletions(-)
create mode 100644 include/linux/page-flags-layout.h

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8d86d5a..5fc1d46 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -581,50 +581,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
* sets it, so none of the operations on it need to be atomic.
*/

-
-/*
- * page->flags layout:
- *
- * There are three possibilities for how page->flags get
- * laid out. The first is for the normal case, without
- * sparsemem. The second is for sparsemem when there is
- * plenty of space for node and section. The last is when
- * we have run out of space and have to fall back to an
- * alternate (slower) way of determining the node.
- *
- * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
- * classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
- */
-#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
-#define SECTIONS_WIDTH SECTIONS_SHIFT
-#else
-#define SECTIONS_WIDTH 0
-#endif
-
-#define ZONES_WIDTH ZONES_SHIFT
-
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define NODES_WIDTH NODES_SHIFT
-#else
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-#error "Vmemmap: No space for nodes field in page flags"
-#endif
-#define NODES_WIDTH 0
-#endif
-
-/* Page flags: | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPU] | ... | FLAGS | */
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
-
-/*
- * We are going to use the flags for the page to node mapping if its in
- * there. This includes the case where there is no node, so it is implicit.
- */
-#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
-#define NODE_NOT_IN_PAGE_FLAGS
-#endif
+#define LAST_CPU_PGOFF (ZONES_PGOFF - LAST_CPU_WIDTH)

/*
* Define the bit shifts to access each section. For non-existent
@@ -634,6 +595,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
+#define LAST_CPU_PGSHIFT (LAST_CPU_PGOFF * (LAST_CPU_WIDTH != 0))

/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
#ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -655,6 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK ((1UL << NODES_WIDTH) - 1)
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
+#define LAST_CPU_MASK ((1UL << LAST_CPU_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)

static inline enum zone_type page_zonenum(const struct page *page)
@@ -693,6 +656,51 @@ static inline int page_to_nid(const struct page *page)
}
#endif

+#ifdef CONFIG_NUMA_BALANCING
+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ return xchg(&page->_last_cpu, cpu);
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+ return page->_last_cpu;
+}
+#else
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ unsigned long old_flags, flags;
+ int last_cpu;
+
+ do {
+ old_flags = flags = page->flags;
+ last_cpu = (flags >> LAST_CPU_PGSHIFT) & LAST_CPU_MASK;
+
+ flags &= ~(LAST_CPU_MASK << LAST_CPU_PGSHIFT);
+ flags |= (cpu & LAST_CPU_MASK) << LAST_CPU_PGSHIFT;
+ } while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
+
+ return last_cpu;
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+ return (page->flags >> LAST_CPU_PGSHIFT) & LAST_CPU_MASK;
+}
+#endif /* LAST_CPU_NOT_IN_PAGE_FLAGS */
+#else /* CONFIG_NUMA_BALANCING */
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ return page_to_nid(page);
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+ return page_to_nid(page);
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
static inline struct zone *page_zone(const struct page *page)
{
return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 31f8a3a..7e9f758 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
#include <linux/cpumask.h>
#include <linux/page-debug-flags.h>
#include <linux/uprobes.h>
+#include <linux/page-flags-layout.h>
#include <asm/page.h>
#include <asm/mmu.h>

@@ -175,6 +176,10 @@ struct page {
*/
void *shadow;
#endif
+
+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+ int _last_cpu;
+#endif
}
/*
* The struct page can be forced to be double word aligned so that atomic ops
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 50aaca8..7e116ed 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -15,7 +15,7 @@
#include <linux/seqlock.h>
#include <linux/nodemask.h>
#include <linux/pageblock-flags.h>
-#include <generated/bounds.h>
+#include <linux/page-flags-layout.h>
#include <linux/atomic.h>
#include <asm/page.h>

@@ -318,16 +318,6 @@ enum zone_type {
* match the requested limits. See gfp_zone() in include/linux/gfp.h
*/

-#if MAX_NR_ZONES < 2
-#define ZONES_SHIFT 0
-#elif MAX_NR_ZONES <= 2
-#define ZONES_SHIFT 1
-#elif MAX_NR_ZONES <= 4
-#define ZONES_SHIFT 2
-#else
-#error ZONES_SHIFT -- too many zones configured adjust calculation
-#endif
-
struct zone {
/* Fields commonly accessed by the page allocator */

@@ -1030,8 +1020,6 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
* PA_SECTION_SHIFT physical address to/from section number
* PFN_SECTION_SHIFT pfn to/from section number
*/
-#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
-
#define PA_SECTION_SHIFT (SECTION_SIZE_BITS)
#define PFN_SECTION_SHIFT (SECTION_SIZE_BITS - PAGE_SHIFT)

diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
new file mode 100644
index 0000000..b258132
--- /dev/null
+++ b/include/linux/page-flags-layout.h
@@ -0,0 +1,83 @@
+#ifndef _LINUX_PAGE_FLAGS_LAYOUT
+#define _LINUX_PAGE_FLAGS_LAYOUT
+
+#include <linux/numa.h>
+#include <generated/bounds.h>
+
+#if MAX_NR_ZONES < 2
+#define ZONES_SHIFT 0
+#elif MAX_NR_ZONES <= 2
+#define ZONES_SHIFT 1
+#elif MAX_NR_ZONES <= 4
+#define ZONES_SHIFT 2
+#else
+#error ZONES_SHIFT -- too many zones configured adjust calculation
+#endif
+
+#ifdef CONFIG_SPARSEMEM
+#include <asm/sparsemem.h>
+
+/*
+ * SECTION_SHIFT #bits space required to store a section #
+ */
+#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
+#endif
+
+/*
+ * page->flags layout:
+ *
+ * There are five possibilities for how page->flags get laid out. The first
+ * (and second) is for the normal case, without sparsemem. The third is for
+ * sparsemem when there is plenty of space for node and section. The last is
+ * when we have run out of space and have to fall back to an alternate (slower)
+ * way of determining the node.
+ *
+ * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_cpu:| NODE | ZONE | LAST_CPU | ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_cpu:| SECTION | NODE | ZONE | LAST_CPU | ... | FLAGS |
+ * classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
+ */
+#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+
+#define SECTIONS_WIDTH SECTIONS_SHIFT
+#else
+#define SECTIONS_WIDTH 0
+#endif
+
+#define ZONES_WIDTH ZONES_SHIFT
+
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define NODES_WIDTH NODES_SHIFT
+#else
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+#error "Vmemmap: No space for nodes field in page flags"
+#endif
+#define NODES_WIDTH 0
+#endif
+
+#ifdef CONFIG_NUMA_BALANCING
+#define LAST_CPU_SHIFT NR_CPUS_BITS
+#else
+#define LAST_CPU_SHIFT 0
+#endif
+
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPU_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_CPU_WIDTH LAST_CPU_SHIFT
+#else
+#define LAST_CPU_WIDTH 0
+#endif
+
+/*
+ * We are going to use the flags for the page to node mapping if its in
+ * there. This includes the case where there is no node, so it is implicit.
+ */
+#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
+#define NODE_NOT_IN_PAGE_FLAGS
+#endif
+
+#if defined(CONFIG_NUMA_BALANCING) && LAST_CPU_WIDTH == 0
+#define LAST_CPU_NOT_IN_PAGE_FLAGS
+#endif
+
+#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 0c9b862..e8ca97b 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -10,6 +10,7 @@
#include <linux/mmzone.h>
#include <linux/kbuild.h>
#include <linux/page_cgroup.h>
+#include <linux/log2.h>

void foo(void)
{
@@ -17,5 +18,8 @@ void foo(void)
DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
DEFINE(MAX_NR_ZONES, __MAX_NR_ZONES);
DEFINE(NR_PCG_FLAGS, __NR_PCG_FLAGS);
+#ifdef CONFIG_SMP
+ DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
+#endif
/* End of constants */
}
diff --git a/mm/memory.c b/mm/memory.c
index fb135ba..24d3a4a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -67,6 +67,10 @@

#include "internal.h"

+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA config, growing page-frame for last_cpu.
+#endif
+
#ifndef CONFIG_NEED_MULTIPLE_NODES
/* use the per-pgdat data instead for discontigmem - mbligh */
unsigned long max_mapnr;
--
1.7.11.7

2012-11-22 22:51:04

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 10/33] sched: Make find_busiest_queue() a method

From: Peter Zijlstra <[email protected]>

Its a bit awkward but it was the least painful means of modifying the
queue selection. Used in a later patch to conditionally use a random
queue.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59e072b..511fbb8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3600,6 +3600,9 @@ struct lb_env {
unsigned int loop;
unsigned int loop_break;
unsigned int loop_max;
+
+ struct rq * (*find_busiest_queue)(struct lb_env *,
+ struct sched_group *);
};

/*
@@ -4779,13 +4782,14 @@ static int load_balance(int this_cpu, struct rq *this_rq,
struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);

struct lb_env env = {
- .sd = sd,
- .dst_cpu = this_cpu,
- .dst_rq = this_rq,
- .dst_grpmask = sched_group_cpus(sd->groups),
- .idle = idle,
- .loop_break = sched_nr_migrate_break,
- .cpus = cpus,
+ .sd = sd,
+ .dst_cpu = this_cpu,
+ .dst_rq = this_rq,
+ .dst_grpmask = sched_group_cpus(sd->groups),
+ .idle = idle,
+ .loop_break = sched_nr_migrate_break,
+ .cpus = cpus,
+ .find_busiest_queue = find_busiest_queue,
};

cpumask_copy(cpus, cpu_active_mask);
@@ -4804,7 +4808,7 @@ redo:
goto out_balanced;
}

- busiest = find_busiest_queue(&env, group);
+ busiest = env.find_busiest_queue(&env, group);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;
--
1.7.11.7

2012-11-22 22:51:26

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 19/33] sched: Add adaptive NUMA affinity support

From: Peter Zijlstra <[email protected]>

The principal ideas behind this patch are the fundamental
difference between shared and privately used memory and the very
strong desire to only rely on per-task behavioral state for
scheduling decisions.

We define 'shared memory' as all user memory that is frequently
accessed by multiple tasks and conversely 'private memory' is
the user memory used predominantly by a single task.

To approximate the above strict definition we recognise that
task placement is dominantly per cpu and thus using cpu granular
page access state is a natural fit. Thus we introduce
page::last_cpu as the cpu that last accessed a page.

Using this, we can construct two per-task node-vectors, 'S_i'
and 'P_i' reflecting the amount of shared and privately used
pages of this task respectively. Pages for which two consecutive
'hits' are of the same cpu are assumed private and the others
are shared.

[ This means that we will start evaluating this state when the
task has not migrated for at least 2 scans, see NUMA_SETTLE ]

Using these vectors we can compute the total number of
shared/private pages of this task and determine which dominates.

[ Note that for shared tasks we only see '1/n' the total number
of shared pages for the other tasks will take the other
faults; where 'n' is the number of tasks sharing the memory.
So for an equal comparison we should divide total private by
'n' as well, but we don't have 'n' so we pick 2. ]

We can also compute which node holds most of our memory, running
on this node will be called 'ideal placement' (As per previous
patches we will prefer to pull memory towards wherever we run.)

We change the load-balancer to prefer moving tasks in order of:

1) !numa tasks and numa tasks in the direction of more faults
2) allow !ideal tasks getting worse in the direction of faults
3) allow private tasks to get worse
4) allow shared tasks to get worse

This order ensures we prefer increasing memory locality but when
we do have to make hard decisions we prefer spreading private
over shared, because spreading shared tasks significantly
increases the interconnect bandwidth since not all memory can
follow.

We also add an extra 'lateral' force to the load balancer that
perturbs the state when otherwise 'fairly' balanced. This
ensures we don't get 'stuck' in a state which is fair but
undesired from a memory location POV (see can_do_numa_run()).

Lastly, we allow shared tasks to defeat the default spreading of
tasks such that, when possible, they can aggregate on a single
node.

Shared tasks aggregate for the very simple reason that there has
to be a single node that holds most of their memory and a second
most, etc.. and tasks want to move up the faults ladder.

Enable it on x86. A number of other architectures are
most likely fine too - but they should enable and test this
feature explicitly.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/scheduler/numa-problem.txt | 20 +-
arch/x86/Kconfig | 2 +
include/linux/sched.h | 1 +
kernel/sched/core.c | 53 +-
kernel/sched/fair.c | 975 +++++++++++++++++++++++++------
kernel/sched/features.h | 8 +
kernel/sched/sched.h | 38 +-
7 files changed, 900 insertions(+), 197 deletions(-)

diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt
index a5d2fee..7f133e3 100644
--- a/Documentation/scheduler/numa-problem.txt
+++ b/Documentation/scheduler/numa-problem.txt
@@ -133,6 +133,8 @@ XXX properties of this M vs a potential optimal

2b) migrate memory towards 'n_i' using 2 samples.

+XXX include the statistical babble on double sampling somewhere near
+
This separates pages into those that will migrate and those that will not due
to the two samples not matching. We could consider the first to be of 'p_i'
(private) and the second to be of 's_i' (shared).
@@ -142,7 +144,17 @@ This interpretation can be motivated by the previously observed property that
's_i' (shared). (here we loose the need for memory limits again, since it
becomes indistinguishable from shared).

-XXX include the statistical babble on double sampling somewhere near
+ 2c) use cpu samples instead of node samples
+
+The problem with sampling on node granularity is that one looses 's_i' for
+the local node, since one cannot distinguish between two accesses from the
+same node.
+
+By increasing the granularity to per-cpu we gain the ability to have both an
+'s_i' and 'p_i' per node. Since we do all task placement per-cpu as well this
+seems like a natural match. The line where we overcommit cpus is where we loose
+granularity again, but when we loose overcommit we naturally spread tasks.
+Therefore it should work out nicely.

This reduces the problem further; we loose 'M' as per 2a, it further reduces
the 'T_k,l' (interconnect traffic) term to only include shared (since per the
@@ -150,12 +162,6 @@ above all private will be local):

T_k,l = \Sum_i bs_i,l for every n_i = k, l != k

-[ more or less matches the state of sched/numa and describes its remaining
- problems and assumptions. It should work well for tasks without significant
- shared memory usage between tasks. ]
-
-Possible future directions:
-
Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
can evaluate it;

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 46c3bff..95646fe 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -22,6 +22,8 @@ config X86
def_bool y
select HAVE_AOUT if X86_32
select HAVE_UNSTABLE_SCHED_CLOCK
+ select ARCH_SUPPORTS_NUMA_BALANCING
+ select ARCH_WANTS_NUMA_GENERIC_PGPROT
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_PCSPKR_PLATFORM
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 418d405..bb12cc3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -823,6 +823,7 @@ enum cpu_idle_type {
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
+#define SD_NUMA 0x4000 /* cross-node balancing */

extern int __weak arch_sd_sibiling_asym_packing(void);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3611f5f..7b58366 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1800,6 +1800,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
+ task_numa_free(prev);
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
@@ -5510,7 +5511,9 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu)
DEFINE_PER_CPU(struct sched_domain *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_id);

-static void update_top_cache_domain(int cpu)
+DEFINE_PER_CPU(struct sched_domain *, sd_node);
+
+static void update_domain_cache(int cpu)
{
struct sched_domain *sd;
int id = cpu;
@@ -5521,6 +5524,15 @@ static void update_top_cache_domain(int cpu)

rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
per_cpu(sd_llc_id, cpu) = id;
+
+ for_each_domain(cpu, sd) {
+ if (cpumask_equal(sched_domain_span(sd),
+ cpumask_of_node(cpu_to_node(cpu))))
+ goto got_node;
+ }
+ sd = NULL;
+got_node:
+ rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
}

/*
@@ -5563,7 +5575,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
rcu_assign_pointer(rq->sd, sd);
destroy_sched_domains(tmp, cpu);

- update_top_cache_domain(cpu);
+ update_domain_cache(cpu);
}

/* cpus with isolated domains */
@@ -5985,6 +5997,37 @@ static struct sched_domain_topology_level default_topology[] = {

static struct sched_domain_topology_level *sched_domain_topology = default_topology;

+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * Change a task's NUMA state - called from the placement tick.
+ */
+void sched_setnuma(struct task_struct *p, int node, int shared)
+{
+ unsigned long flags;
+ int on_rq, running;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &flags);
+ on_rq = p->on_rq;
+ running = task_current(rq, p);
+
+ if (on_rq)
+ dequeue_task(rq, p, 0);
+ if (running)
+ p->sched_class->put_prev_task(rq, p);
+
+ p->numa_shared = shared;
+ p->numa_max_node = node;
+
+ if (running)
+ p->sched_class->set_curr_task(rq);
+ if (on_rq)
+ enqueue_task(rq, p, 0);
+ task_rq_unlock(rq, p, &flags);
+}
+
+#endif /* CONFIG_NUMA_BALANCING */
+
#ifdef CONFIG_NUMA

static int sched_domains_numa_levels;
@@ -6030,6 +6073,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
| 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
+ | 1*SD_NUMA
| sd_local_flags(level)
,
.last_balance = jiffies,
@@ -6884,7 +6928,6 @@ void __init sched_init(void)
rq->post_schedule = 0;
rq->active_balance = 0;
rq->next_balance = jiffies;
- rq->push_cpu = 0;
rq->cpu = i;
rq->online = 0;
rq->idle_stamp = 0;
@@ -6892,6 +6935,10 @@ void __init sched_init(void)

INIT_LIST_HEAD(&rq->cfs_tasks);

+#ifdef CONFIG_NUMA_BALANCING
+ rq->nr_shared_running = 0;
+#endif
+
rq_attach_root(rq, &def_root_domain);
#ifdef CONFIG_NO_HZ
rq->nohz_flags = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8af0208..f3aeaac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -29,6 +29,9 @@
#include <linux/slab.h>
#include <linux/profile.h>
#include <linux/interrupt.h>
+#include <linux/random.h>
+#include <linux/mempolicy.h>
+#include <linux/task_work.h>

#include <trace/events/sched.h>

@@ -774,6 +777,235 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
}

/**************************************************
+ * Scheduling class numa methods.
+ *
+ * The purpose of the NUMA bits are to maintain compute (task) and data
+ * (memory) locality.
+ *
+ * We keep a faults vector per task and use periodic fault scans to try and
+ * estalish a task<->page relation. This assumes the task<->page relation is a
+ * compute<->data relation, this is false for things like virt. and n:m
+ * threading solutions but its the best we can do given the information we
+ * have.
+ *
+ * We try and migrate such that we increase along the order provided by this
+ * vector while maintaining fairness.
+ *
+ * Tasks start out with their numa status unset (-1) this effectively means
+ * they act !NUMA until we've established the task is busy enough to bother
+ * with placement.
+ */
+
+#ifdef CONFIG_SMP
+static unsigned long task_h_load(struct task_struct *p);
+#endif
+
+#ifdef CONFIG_NUMA_BALANCING
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+ if (task_numa_shared(p) != -1) {
+ p->numa_weight = task_h_load(p);
+ rq->nr_numa_running++;
+ rq->nr_shared_running += task_numa_shared(p);
+ rq->nr_ideal_running += (cpu_to_node(task_cpu(p)) == p->numa_max_node);
+ rq->numa_weight += p->numa_weight;
+ }
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+ if (task_numa_shared(p) != -1) {
+ rq->nr_numa_running--;
+ rq->nr_shared_running -= task_numa_shared(p);
+ rq->nr_ideal_running -= (cpu_to_node(task_cpu(p)) == p->numa_max_node);
+ rq->numa_weight -= p->numa_weight;
+ }
+}
+
+/*
+ * numa task sample period in ms: 5s
+ */
+unsigned int sysctl_sched_numa_scan_period_min = 5000;
+unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
+
+/*
+ * Wait for the 2-sample stuff to settle before migrating again
+ */
+unsigned int sysctl_sched_numa_settle_count = 2;
+
+static void task_numa_migrate(struct task_struct *p, int next_cpu)
+{
+ p->numa_migrate_seq = 0;
+}
+
+static void task_numa_placement(struct task_struct *p)
+{
+ int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
+ unsigned long total[2] = { 0, 0 };
+ unsigned long faults, max_faults = 0;
+ int node, priv, shared, max_node = -1;
+
+ if (p->numa_scan_seq == seq)
+ return;
+
+ p->numa_scan_seq = seq;
+
+ for (node = 0; node < nr_node_ids; node++) {
+ faults = 0;
+ for (priv = 0; priv < 2; priv++) {
+ faults += p->numa_faults[2*node + priv];
+ total[priv] += p->numa_faults[2*node + priv];
+ p->numa_faults[2*node + priv] /= 2;
+ }
+ if (faults > max_faults) {
+ max_faults = faults;
+ max_node = node;
+ }
+ }
+
+ if (max_node != p->numa_max_node)
+ sched_setnuma(p, max_node, task_numa_shared(p));
+
+ p->numa_migrate_seq++;
+ if (sched_feat(NUMA_SETTLE) &&
+ p->numa_migrate_seq < sysctl_sched_numa_settle_count)
+ return;
+
+ /*
+ * Note: shared is spread across multiple tasks and in the future
+ * we might want to consider a different equation below to reduce
+ * the impact of a little private memory accesses.
+ */
+ shared = (total[0] >= total[1] / 2);
+ if (shared != task_numa_shared(p)) {
+ sched_setnuma(p, p->numa_max_node, shared);
+ p->numa_migrate_seq = 0;
+ }
+}
+
+/*
+ * Got a PROT_NONE fault for a page on @node.
+ */
+void task_numa_fault(int node, int last_cpu, int pages)
+{
+ struct task_struct *p = current;
+ int priv = (task_cpu(p) == last_cpu);
+
+ if (unlikely(!p->numa_faults)) {
+ int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+
+ p->numa_faults = kzalloc(size, GFP_KERNEL);
+ if (!p->numa_faults)
+ return;
+ }
+
+ task_numa_placement(p);
+ p->numa_faults[2*node + priv] += pages;
+}
+
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+void task_numa_work(struct callback_head *work)
+{
+ unsigned long migrate, next_scan, now = jiffies;
+ struct task_struct *p = current;
+ struct mm_struct *mm = p->mm;
+
+ WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
+
+ work->next = work; /* protect against double add */
+ /*
+ * Who cares about NUMA placement when they're dying.
+ *
+ * NOTE: make sure not to dereference p->mm before this check,
+ * exit_task_work() happens _after_ exit_mm() so we could be called
+ * without p->mm even though we still had it when we enqueued this
+ * work.
+ */
+ if (p->flags & PF_EXITING)
+ return;
+
+ /*
+ * Enforce maximal scan/migration frequency..
+ */
+ migrate = mm->numa_next_scan;
+ if (time_before(now, migrate))
+ return;
+
+ next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
+ if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
+ return;
+
+ ACCESS_ONCE(mm->numa_scan_seq)++;
+ {
+ struct vm_area_struct *vma;
+
+ down_write(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (!vma_migratable(vma))
+ continue;
+ change_prot_numa(vma, vma->vm_start, vma->vm_end);
+ }
+ up_write(&mm->mmap_sem);
+ }
+}
+
+/*
+ * Drive the periodic memory faults..
+ */
+void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+ struct callback_head *work = &curr->numa_work;
+ u64 period, now;
+
+ /*
+ * We don't care about NUMA placement if we don't have memory.
+ */
+ if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+ return;
+
+ /*
+ * Using runtime rather than walltime has the dual advantage that
+ * we (mostly) drive the selection from busy threads and that the
+ * task needs to have done some actual work before we bother with
+ * NUMA placement.
+ */
+ now = curr->se.sum_exec_runtime;
+ period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
+
+ if (now - curr->node_stamp > period) {
+ curr->node_stamp = now;
+
+ /*
+ * We are comparing runtime to wall clock time here, which
+ * puts a maximum scan frequency limit on the task work.
+ *
+ * This, together with the limits in task_numa_work() filters
+ * us from over-sampling if there are many threads: if all
+ * threads happen to come in at the same time we don't create a
+ * spike in overhead.
+ *
+ * We also avoid multiple threads scanning at once in parallel to
+ * each other.
+ */
+ if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+ init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
+ task_work_add(curr, work, true);
+ }
+ }
+}
+#else /* !CONFIG_NUMA_BALANCING: */
+#ifdef CONFIG_SMP
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p) { }
+#endif
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p) { }
+static inline void task_tick_numa(struct rq *rq, struct task_struct *curr) { }
+static inline void task_numa_migrate(struct task_struct *p, int next_cpu) { }
+#endif /* CONFIG_NUMA_BALANCING */
+
+/**************************************************
* Scheduling class queueing methods:
*/

@@ -784,9 +1016,13 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (!parent_entity(se))
update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
#ifdef CONFIG_SMP
- if (entity_is_task(se))
- list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
-#endif
+ if (entity_is_task(se)) {
+ struct rq *rq = rq_of(cfs_rq);
+
+ account_numa_enqueue(rq, task_of(se));
+ list_add(&se->group_node, &rq->cfs_tasks);
+ }
+#endif /* CONFIG_SMP */
cfs_rq->nr_running++;
}

@@ -796,8 +1032,10 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
update_load_sub(&cfs_rq->load, se->load.weight);
if (!parent_entity(se))
update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
- if (entity_is_task(se))
+ if (entity_is_task(se)) {
list_del_init(&se->group_node);
+ account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+ }
cfs_rq->nr_running--;
}

@@ -3177,20 +3415,8 @@ unlock:
return new_cpu;
}

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
- * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
- * cfs_rq_of(p) references at time of call are still valid and identify the
- * previous cpu. However, the caller only guarantees p->pi_lock is held; no
- * other assumptions, including the state of rq->lock, should be made.
- */
-static void
-migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+static void migrate_task_rq_entity(struct task_struct *p, int next_cpu)
{
struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -3206,7 +3432,27 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
+#else
+static void migrate_task_rq_entity(struct task_struct *p, int next_cpu) { }
#endif
+
+/*
+ * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
+ * removed when useful for applications beyond shares distribution (e.g.
+ * load-balance).
+ */
+/*
+ * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
+ * cfs_rq_of(p) references at time of call are still valid and identify the
+ * previous cpu. However, the caller only guarantees p->pi_lock is held; no
+ * other assumptions, including the state of rq->lock, should be made.
+ */
+static void
+migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+{
+ migrate_task_rq_entity(p, next_cpu);
+ task_numa_migrate(p, next_cpu);
+}
#endif /* CONFIG_SMP */

static unsigned long
@@ -3580,7 +3826,10 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;

#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
-#define LBF_SOME_PINNED 0x04
+#define LBF_SOME_PINNED 0x04
+#define LBF_NUMA_RUN 0x08
+#define LBF_NUMA_SHARED 0x10
+#define LBF_KEEP_SHARED 0x20

struct lb_env {
struct sched_domain *sd;
@@ -3599,6 +3848,8 @@ struct lb_env {
struct cpumask *cpus;

unsigned int flags;
+ unsigned int failed;
+ unsigned int iteration;

unsigned int loop;
unsigned int loop_break;
@@ -3620,11 +3871,87 @@ static void move_task(struct task_struct *p, struct lb_env *env)
check_preempt_curr(env->dst_rq, p, 0);
}

+#ifdef CONFIG_NUMA_BALANCING
+
+static inline unsigned long task_node_faults(struct task_struct *p, int node)
+{
+ return p->numa_faults[2*node] + p->numa_faults[2*node + 1];
+}
+
+static int task_faults_down(struct task_struct *p, struct lb_env *env)
+{
+ int src_node, dst_node, node, down_node = -1;
+ unsigned long faults, src_faults, max_faults = 0;
+
+ if (!sched_feat_numa(NUMA_FAULTS_DOWN) || !p->numa_faults)
+ return 1;
+
+ src_node = cpu_to_node(env->src_cpu);
+ dst_node = cpu_to_node(env->dst_cpu);
+
+ if (src_node == dst_node)
+ return 1;
+
+ src_faults = task_node_faults(p, src_node);
+
+ for (node = 0; node < nr_node_ids; node++) {
+ if (node == src_node)
+ continue;
+
+ faults = task_node_faults(p, node);
+
+ if (faults > max_faults && faults <= src_faults) {
+ max_faults = faults;
+ down_node = node;
+ }
+ }
+
+ if (down_node == dst_node)
+ return 1; /* move towards the next node down */
+
+ return 0; /* stay here */
+}
+
+static int task_faults_up(struct task_struct *p, struct lb_env *env)
+{
+ unsigned long src_faults, dst_faults;
+ int src_node, dst_node;
+
+ if (!sched_feat_numa(NUMA_FAULTS_UP) || !p->numa_faults)
+ return 0; /* can't say it improved */
+
+ src_node = cpu_to_node(env->src_cpu);
+ dst_node = cpu_to_node(env->dst_cpu);
+
+ if (src_node == dst_node)
+ return 0; /* pointless, don't do that */
+
+ src_faults = task_node_faults(p, src_node);
+ dst_faults = task_node_faults(p, dst_node);
+
+ if (dst_faults > src_faults)
+ return 1; /* move to dst */
+
+ return 0; /* stay where we are */
+}
+
+#else /* !CONFIG_NUMA_BALANCING: */
+static inline int task_faults_up(struct task_struct *p, struct lb_env *env)
+{
+ return 0;
+}
+
+static inline int task_faults_down(struct task_struct *p, struct lb_env *env)
+{
+ return 0;
+}
+#endif
+
/*
* Is this task likely cache-hot:
*/
static int
-task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
+task_hot(struct task_struct *p, struct lb_env *env)
{
s64 delta;

@@ -3647,80 +3974,153 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
if (sysctl_sched_migration_cost == 0)
return 0;

- delta = now - p->se.exec_start;
+ delta = env->src_rq->clock_task - p->se.exec_start;

return delta < (s64)sysctl_sched_migration_cost;
}

/*
- * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
+ * We do not migrate tasks that cannot be migrated to this CPU
+ * due to cpus_allowed.
+ *
+ * NOTE: this function has env-> side effects, to help the balancing
+ * of pinned tasks.
*/
-static
-int can_migrate_task(struct task_struct *p, struct lb_env *env)
+static bool can_migrate_pinned_task(struct task_struct *p, struct lb_env *env)
{
- int tsk_cache_hot = 0;
+ int new_dst_cpu;
+
+ if (cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p)))
+ return true;
+
+ schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+
/*
- * We do not migrate tasks that are:
- * 1) running (obviously), or
- * 2) cannot be migrated to this CPU due to cpus_allowed, or
- * 3) are cache-hot on their current CPU.
+ * Remember if this task can be migrated to any other cpu in
+ * our sched_group. We may want to revisit it if we couldn't
+ * meet load balance goals by pulling other tasks on src_cpu.
+ *
+ * Also avoid computing new_dst_cpu if we have already computed
+ * one in current iteration.
*/
- if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
- int new_dst_cpu;
-
- schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+ if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+ return false;

- /*
- * Remember if this task can be migrated to any other cpu in
- * our sched_group. We may want to revisit it if we couldn't
- * meet load balance goals by pulling other tasks on src_cpu.
- *
- * Also avoid computing new_dst_cpu if we have already computed
- * one in current iteration.
- */
- if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
- return 0;
-
- new_dst_cpu = cpumask_first_and(env->dst_grpmask,
- tsk_cpus_allowed(p));
- if (new_dst_cpu < nr_cpu_ids) {
- env->flags |= LBF_SOME_PINNED;
- env->new_dst_cpu = new_dst_cpu;
- }
- return 0;
+ new_dst_cpu = cpumask_first_and(env->dst_grpmask, tsk_cpus_allowed(p));
+ if (new_dst_cpu < nr_cpu_ids) {
+ env->flags |= LBF_SOME_PINNED;
+ env->new_dst_cpu = new_dst_cpu;
}
+ return false;
+}

- /* Record that we found atleast one task that could run on dst_cpu */
- env->flags &= ~LBF_ALL_PINNED;
+/*
+ * We cannot (easily) migrate tasks that are currently running:
+ */
+static bool can_migrate_running_task(struct task_struct *p, struct lb_env *env)
+{
+ if (!task_running(env->src_rq, p))
+ return true;

- if (task_running(env->src_rq, p)) {
- schedstat_inc(p, se.statistics.nr_failed_migrations_running);
- return 0;
- }
+ schedstat_inc(p, se.statistics.nr_failed_migrations_running);
+ return false;
+}

+/*
+ * Can we migrate a NUMA task? The rules are rather involved:
+ */
+static bool can_migrate_numa_task(struct task_struct *p, struct lb_env *env)
+{
/*
- * Aggressive migration if:
- * 1) task is cache cold, or
- * 2) too many balance attempts have failed.
+ * iteration:
+ * 0 -- only allow improvement, or !numa
+ * 1 -- + worsen !ideal
+ * 2 priv
+ * 3 shared (everything)
+ *
+ * NUMA_HOT_DOWN:
+ * 1 .. nodes -- allow getting worse by step
+ * nodes+1 -- punt, everything goes!
+ *
+ * LBF_NUMA_RUN -- numa only, only allow improvement
+ * LBF_NUMA_SHARED -- shared only
+ *
+ * LBF_KEEP_SHARED -- do not touch shared tasks
*/

- tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
- if (!tsk_cache_hot ||
- env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
-#ifdef CONFIG_SCHEDSTATS
- if (tsk_cache_hot) {
- schedstat_inc(env->sd, lb_hot_gained[env->idle]);
- schedstat_inc(p, se.statistics.nr_forced_migrations);
- }
-#endif
- return 1;
+ /* a numa run can only move numa tasks about to improve things */
+ if (env->flags & LBF_NUMA_RUN) {
+ if (task_numa_shared(p) < 0)
+ return false;
+ /* can only pull shared tasks */
+ if ((env->flags & LBF_NUMA_SHARED) && !task_numa_shared(p))
+ return false;
+ } else {
+ if (task_numa_shared(p) < 0)
+ goto try_migrate;
}

- if (tsk_cache_hot) {
- schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
- return 0;
- }
- return 1;
+ /* can not move shared tasks */
+ if ((env->flags & LBF_KEEP_SHARED) && task_numa_shared(p) == 1)
+ return false;
+
+ if (task_faults_up(p, env))
+ return true; /* memory locality beats cache hotness */
+
+ if (env->iteration < 1)
+ return false;
+
+#ifdef CONFIG_NUMA_BALANCING
+ if (p->numa_max_node != cpu_to_node(task_cpu(p))) /* !ideal */
+ goto demote;
+#endif
+
+ if (env->iteration < 2)
+ return false;
+
+ if (task_numa_shared(p) == 0) /* private */
+ goto demote;
+
+ if (env->iteration < 3)
+ return false;
+
+demote:
+ if (env->iteration < 5)
+ return task_faults_down(p, env);
+
+try_migrate:
+ if (env->failed > env->sd->cache_nice_tries)
+ return true;
+
+ return !task_hot(p, env);
+}
+
+/*
+ * can_migrate_task() - may task p from runqueue rq be migrated to this_cpu?
+ */
+static int can_migrate_task(struct task_struct *p, struct lb_env *env)
+{
+ if (!can_migrate_pinned_task(p, env))
+ return false;
+
+ /* Record that we found atleast one task that could run on dst_cpu */
+ env->flags &= ~LBF_ALL_PINNED;
+
+ if (!can_migrate_running_task(p, env))
+ return false;
+
+ if (env->sd->flags & SD_NUMA)
+ return can_migrate_numa_task(p, env);
+
+ if (env->failed > env->sd->cache_nice_tries)
+ return true;
+
+ if (!task_hot(p, env))
+ return true;
+
+ schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
+
+ return false;
}

/*
@@ -3735,6 +4135,7 @@ static int move_one_task(struct lb_env *env)
struct task_struct *p, *n;

list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
+
if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
continue;

@@ -3742,6 +4143,7 @@ static int move_one_task(struct lb_env *env)
continue;

move_task(p, env);
+
/*
* Right now, this is only the second place move_task()
* is called, so we can safely collect move_task()
@@ -3753,8 +4155,6 @@ static int move_one_task(struct lb_env *env)
return 0;
}

-static unsigned long task_h_load(struct task_struct *p);
-
static const unsigned int sched_nr_migrate_break = 32;

/*
@@ -3766,7 +4166,6 @@ static const unsigned int sched_nr_migrate_break = 32;
*/
static int move_tasks(struct lb_env *env)
{
- struct list_head *tasks = &env->src_rq->cfs_tasks;
struct task_struct *p;
unsigned long load;
int pulled = 0;
@@ -3774,8 +4173,8 @@ static int move_tasks(struct lb_env *env)
if (env->imbalance <= 0)
return 0;

- while (!list_empty(tasks)) {
- p = list_first_entry(tasks, struct task_struct, se.group_node);
+ while (!list_empty(&env->src_rq->cfs_tasks)) {
+ p = list_first_entry(&env->src_rq->cfs_tasks, struct task_struct, se.group_node);

env->loop++;
/* We've more or less seen every task there is, call it quits */
@@ -3786,7 +4185,7 @@ static int move_tasks(struct lb_env *env)
if (env->loop > env->loop_break) {
env->loop_break += sched_nr_migrate_break;
env->flags |= LBF_NEED_BREAK;
- break;
+ goto out;
}

if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
@@ -3794,7 +4193,7 @@ static int move_tasks(struct lb_env *env)

load = task_h_load(p);

- if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
+ if (sched_feat(LB_MIN) && load < 16 && !env->failed)
goto next;

if ((load / 2) > env->imbalance)
@@ -3814,7 +4213,7 @@ static int move_tasks(struct lb_env *env)
* the critical section.
*/
if (env->idle == CPU_NEWLY_IDLE)
- break;
+ goto out;
#endif

/*
@@ -3822,13 +4221,13 @@ static int move_tasks(struct lb_env *env)
* weighted load.
*/
if (env->imbalance <= 0)
- break;
+ goto out;

continue;
next:
- list_move_tail(&p->se.group_node, tasks);
+ list_move_tail(&p->se.group_node, &env->src_rq->cfs_tasks);
}
-
+out:
/*
* Right now, this is one of only two places move_task() is called,
* so we can safely collect move_task() stats here rather than
@@ -3953,17 +4352,18 @@ static inline void update_blocked_averages(int cpu)
static inline void update_h_load(long cpu)
{
}
-
+#ifdef CONFIG_SMP
static unsigned long task_h_load(struct task_struct *p)
{
return p->se.load.weight;
}
#endif
+#endif

/********** Helpers for find_busiest_group ************************/
/*
* sd_lb_stats - Structure to store the statistics of a sched_domain
- * during load balancing.
+ * during load balancing.
*/
struct sd_lb_stats {
struct sched_group *busiest; /* Busiest group in this sd */
@@ -3976,7 +4376,7 @@ struct sd_lb_stats {
unsigned long this_load;
unsigned long this_load_per_task;
unsigned long this_nr_running;
- unsigned long this_has_capacity;
+ unsigned int this_has_capacity;
unsigned int this_idle_cpus;

/* Statistics of the busiest group */
@@ -3985,10 +4385,28 @@ struct sd_lb_stats {
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
unsigned long busiest_group_capacity;
- unsigned long busiest_has_capacity;
+ unsigned int busiest_has_capacity;
unsigned int busiest_group_weight;

int group_imb; /* Is there imbalance in this sd */
+
+#ifdef CONFIG_NUMA_BALANCING
+ unsigned long this_numa_running;
+ unsigned long this_numa_weight;
+ unsigned long this_shared_running;
+ unsigned long this_ideal_running;
+ unsigned long this_group_capacity;
+
+ struct sched_group *numa;
+ unsigned long numa_load;
+ unsigned long numa_nr_running;
+ unsigned long numa_numa_running;
+ unsigned long numa_shared_running;
+ unsigned long numa_ideal_running;
+ unsigned long numa_numa_weight;
+ unsigned long numa_group_capacity;
+ unsigned int numa_has_capacity;
+#endif
};

/*
@@ -4004,6 +4422,13 @@ struct sg_lb_stats {
unsigned long group_weight;
int group_imb; /* Is there an imbalance in the group ? */
int group_has_capacity; /* Is there extra capacity in the group? */
+
+#ifdef CONFIG_NUMA_BALANCING
+ unsigned long sum_ideal_running;
+ unsigned long sum_numa_running;
+ unsigned long sum_numa_weight;
+#endif
+ unsigned long sum_shared_running; /* 0 on non-NUMA */
};

/**
@@ -4032,6 +4457,151 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
return load_idx;
}

+#ifdef CONFIG_NUMA_BALANCING
+
+static inline bool pick_numa_rand(int n)
+{
+ return !(get_random_int() % n);
+}
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+ sgs->sum_ideal_running += rq->nr_ideal_running;
+ sgs->sum_shared_running += rq->nr_shared_running;
+ sgs->sum_numa_running += rq->nr_numa_running;
+ sgs->sum_numa_weight += rq->numa_weight;
+}
+
+static inline
+void update_sd_numa_stats(struct sched_domain *sd, struct sched_group *sg,
+ struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
+ int local_group)
+{
+ if (!(sd->flags & SD_NUMA))
+ return;
+
+ if (local_group) {
+ sds->this_numa_running = sgs->sum_numa_running;
+ sds->this_numa_weight = sgs->sum_numa_weight;
+ sds->this_shared_running = sgs->sum_shared_running;
+ sds->this_ideal_running = sgs->sum_ideal_running;
+ sds->this_group_capacity = sgs->group_capacity;
+
+ } else if (sgs->sum_numa_running - sgs->sum_ideal_running) {
+ if (!sds->numa || pick_numa_rand(sd->span_weight / sg->group_weight)) {
+ sds->numa = sg;
+ sds->numa_load = sgs->avg_load;
+ sds->numa_nr_running = sgs->sum_nr_running;
+ sds->numa_numa_running = sgs->sum_numa_running;
+ sds->numa_shared_running = sgs->sum_shared_running;
+ sds->numa_ideal_running = sgs->sum_ideal_running;
+ sds->numa_numa_weight = sgs->sum_numa_weight;
+ sds->numa_has_capacity = sgs->group_has_capacity;
+ sds->numa_group_capacity = sgs->group_capacity;
+ }
+ }
+}
+
+static struct rq *
+find_busiest_numa_queue(struct lb_env *env, struct sched_group *sg)
+{
+ struct rq *rq, *busiest = NULL;
+ int cpu;
+
+ for_each_cpu_and(cpu, sched_group_cpus(sg), env->cpus) {
+ rq = cpu_rq(cpu);
+
+ if (!rq->nr_numa_running)
+ continue;
+
+ if (!(rq->nr_numa_running - rq->nr_ideal_running))
+ continue;
+
+ if ((env->flags & LBF_KEEP_SHARED) && !(rq->nr_running - rq->nr_shared_running))
+ continue;
+
+ if (!busiest || pick_numa_rand(sg->group_weight))
+ busiest = rq;
+ }
+
+ return busiest;
+}
+
+static bool can_do_numa_run(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ /*
+ * if we're overloaded; don't pull when:
+ * - the other guy isn't
+ * - imbalance would become too great
+ */
+ if (!sds->this_has_capacity) {
+ if (sds->numa_has_capacity)
+ return false;
+ }
+
+ /*
+ * pull if we got easy trade
+ */
+ if (sds->this_nr_running - sds->this_numa_running)
+ return true;
+
+ /*
+ * If we got capacity allow stacking up on shared tasks.
+ */
+ if ((sds->this_shared_running < sds->this_group_capacity) && sds->numa_shared_running) {
+ env->flags |= LBF_NUMA_SHARED;
+ return true;
+ }
+
+ /*
+ * pull if we could possibly trade
+ */
+ if (sds->this_numa_running - sds->this_ideal_running)
+ return true;
+
+ return false;
+}
+
+/*
+ * introduce some controlled imbalance to perturb the state so we allow the
+ * state to improve should be tightly controlled/co-ordinated with
+ * can_migrate_task()
+ */
+static int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ if (!sds->numa || !sds->numa_numa_running)
+ return 0;
+
+ if (!can_do_numa_run(env, sds))
+ return 0;
+
+ env->flags |= LBF_NUMA_RUN;
+ env->flags &= ~LBF_KEEP_SHARED;
+ env->imbalance = sds->numa_numa_weight / sds->numa_numa_running;
+ sds->busiest = sds->numa;
+ env->find_busiest_queue = find_busiest_numa_queue;
+
+ return 1;
+}
+
+#else /* !CONFIG_NUMA_BALANCING: */
+static inline
+void update_sd_numa_stats(struct sched_domain *sd, struct sched_group *sg,
+ struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
+ int local_group)
+{
+}
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+}
+
+static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ return 0;
+}
+#endif
+
unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
{
return SCHED_POWER_SCALE;
@@ -4245,6 +4815,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_load += load;
sgs->sum_nr_running += nr_running;
sgs->sum_weighted_load += weighted_cpuload(i);
+
+ update_sg_numa_stats(sgs, rq);
+
if (idle_cpu(i))
sgs->idle_cpus++;
}
@@ -4336,6 +4909,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
return false;
}

+static void update_src_keep_shared(struct lb_env *env, bool keep_shared)
+{
+ env->flags &= ~LBF_KEEP_SHARED;
+ if (keep_shared)
+ env->flags |= LBF_KEEP_SHARED;
+}
+
/**
* update_sd_lb_stats - Update sched_domain's statistics for load balancing.
* @env: The load balancing environment.
@@ -4368,6 +4948,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sds->total_load += sgs.group_load;
sds->total_pwr += sg->sgp->power;

+#ifdef CONFIG_NUMA_BALANCING
/*
* In case the child domain prefers tasks go to siblings
* first, lower the sg capacity to one so that we'll try
@@ -4378,8 +4959,11 @@ static inline void update_sd_lb_stats(struct lb_env *env,
* heaviest group when it is already under-utilized (possible
* with a large weight task outweighs the tasks on the system).
*/
- if (prefer_sibling && !local_group && sds->this_has_capacity)
- sgs.group_capacity = min(sgs.group_capacity, 1UL);
+ if (0 && prefer_sibling && !local_group && sds->this_has_capacity) {
+ sgs.group_capacity = clamp_val(sgs.sum_shared_running,
+ 1UL, sgs.group_capacity);
+ }
+#endif

if (local_group) {
sds->this_load = sgs.avg_load;
@@ -4398,8 +4982,13 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sds->busiest_has_capacity = sgs.group_has_capacity;
sds->busiest_group_weight = sgs.group_weight;
sds->group_imb = sgs.group_imb;
+
+ update_src_keep_shared(env,
+ sgs.sum_shared_running <= sgs.group_capacity);
}

+ update_sd_numa_stats(env->sd, sg, sds, &sgs, local_group);
+
sg = sg->next;
} while (sg != env->sd->groups);
}
@@ -4652,14 +5241,14 @@ find_busiest_group(struct lb_env *env, int *balance)
* don't try and pull any tasks.
*/
if (sds.this_load >= sds.max_load)
- goto out_balanced;
+ goto out_imbalanced;

/*
* Don't pull any tasks if this group is already above the domain
* average load.
*/
if (sds.this_load >= sds.avg_load)
- goto out_balanced;
+ goto out_imbalanced;

if (env->idle == CPU_IDLE) {
/*
@@ -4685,9 +5274,18 @@ force_balance:
calculate_imbalance(env, &sds);
return sds.busiest;

+out_imbalanced:
+ /* if we've got capacity allow for secondary placement preference */
+ if (!sds.this_has_capacity)
+ goto ret;
+
out_balanced:
+ if (check_numa_busiest_group(env, &sds))
+ return sds.busiest;
+
ret:
env->imbalance = 0;
+
return NULL;
}

@@ -4723,6 +5321,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
if (capacity && rq->nr_running == 1 && wl > env->imbalance)
continue;

+ if ((env->flags & LBF_KEEP_SHARED) && !(rq->nr_running - rq->nr_shared_running))
+ continue;
+
/*
* For the load comparisons with the other cpu's, consider
* the weighted_cpuload() scaled with the cpu power, so that
@@ -4749,25 +5350,40 @@ static struct rq *find_busiest_queue(struct lb_env *env,
/* Working cpumask for load_balance and load_balance_newidle. */
DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask);

-static int need_active_balance(struct lb_env *env)
-{
- struct sched_domain *sd = env->sd;
-
- if (env->idle == CPU_NEWLY_IDLE) {
+static int active_load_balance_cpu_stop(void *data);

+static void update_sd_failed(struct lb_env *env, int ld_moved)
+{
+ if (!ld_moved) {
+ schedstat_inc(env->sd, lb_failed[env->idle]);
/*
- * ASYM_PACKING needs to force migrate tasks from busy but
- * higher numbered CPUs in order to pack all tasks in the
- * lowest numbered CPUs.
+ * Increment the failure counter only on periodic balance.
+ * We do not want newidle balance, which can be very
+ * frequent, pollute the failure counter causing
+ * excessive cache_hot migrations and active balances.
*/
- if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
- return 1;
- }
-
- return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
+ if (env->idle != CPU_NEWLY_IDLE && !(env->flags & LBF_NUMA_RUN))
+ env->sd->nr_balance_failed++;
+ } else
+ env->sd->nr_balance_failed = 0;
}

-static int active_load_balance_cpu_stop(void *data);
+/*
+ * See can_migrate_numa_task()
+ */
+static int lb_max_iteration(struct lb_env *env)
+{
+ if (!(env->sd->flags & SD_NUMA))
+ return 0;
+
+ if (env->flags & LBF_NUMA_RUN)
+ return 0; /* NUMA_RUN may only improve */
+
+ if (sched_feat_numa(NUMA_FAULTS_DOWN))
+ return 5; /* nodes^2 would suck */
+
+ return 3;
+}

/*
* Check this_cpu to ensure it is balanced within domain. Attempt to move
@@ -4793,6 +5409,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
.find_busiest_queue = find_busiest_queue,
+ .failed = sd->nr_balance_failed,
+ .iteration = 0,
};

cpumask_copy(cpus, cpu_active_mask);
@@ -4816,6 +5434,8 @@ redo:
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;
}
+ env.src_rq = busiest;
+ env.src_cpu = busiest->cpu;

BUG_ON(busiest == env.dst_rq);

@@ -4895,92 +5515,72 @@ more_balance:
}

/* All tasks on this runqueue were pinned by CPU affinity */
- if (unlikely(env.flags & LBF_ALL_PINNED)) {
- cpumask_clear_cpu(cpu_of(busiest), cpus);
- if (!cpumask_empty(cpus)) {
- env.loop = 0;
- env.loop_break = sched_nr_migrate_break;
- goto redo;
- }
- goto out_balanced;
+ if (unlikely(env.flags & LBF_ALL_PINNED))
+ goto out_pinned;
+
+ if (!ld_moved && env.iteration < lb_max_iteration(&env)) {
+ env.iteration++;
+ env.loop = 0;
+ goto more_balance;
}
}

- if (!ld_moved) {
- schedstat_inc(sd, lb_failed[idle]);
+ if (!ld_moved && idle != CPU_NEWLY_IDLE) {
+ raw_spin_lock_irqsave(&busiest->lock, flags);
+
/*
- * Increment the failure counter only on periodic balance.
- * We do not want newidle balance, which can be very
- * frequent, pollute the failure counter causing
- * excessive cache_hot migrations and active balances.
+ * Don't kick the active_load_balance_cpu_stop,
+ * if the curr task on busiest cpu can't be
+ * moved to this_cpu
*/
- if (idle != CPU_NEWLY_IDLE)
- sd->nr_balance_failed++;
-
- if (need_active_balance(&env)) {
- raw_spin_lock_irqsave(&busiest->lock, flags);
-
- /* don't kick the active_load_balance_cpu_stop,
- * if the curr task on busiest cpu can't be
- * moved to this_cpu
- */
- if (!cpumask_test_cpu(this_cpu,
- tsk_cpus_allowed(busiest->curr))) {
- raw_spin_unlock_irqrestore(&busiest->lock,
- flags);
- env.flags |= LBF_ALL_PINNED;
- goto out_one_pinned;
- }
-
- /*
- * ->active_balance synchronizes accesses to
- * ->active_balance_work. Once set, it's cleared
- * only after active load balance is finished.
- */
- if (!busiest->active_balance) {
- busiest->active_balance = 1;
- busiest->push_cpu = this_cpu;
- active_balance = 1;
- }
+ if (!cpumask_test_cpu(this_cpu, tsk_cpus_allowed(busiest->curr))) {
raw_spin_unlock_irqrestore(&busiest->lock, flags);
-
- if (active_balance) {
- stop_one_cpu_nowait(cpu_of(busiest),
- active_load_balance_cpu_stop, busiest,
- &busiest->active_balance_work);
- }
-
- /*
- * We've kicked active balancing, reset the failure
- * counter.
- */
- sd->nr_balance_failed = sd->cache_nice_tries+1;
+ env.flags |= LBF_ALL_PINNED;
+ goto out_pinned;
}
- } else
- sd->nr_balance_failed = 0;

- if (likely(!active_balance)) {
- /* We were unbalanced, so reset the balancing interval */
- sd->balance_interval = sd->min_interval;
- } else {
/*
- * If we've begun active balancing, start to back off. This
- * case may not be covered by the all_pinned logic if there
- * is only 1 task on the busy runqueue (because we don't call
- * move_tasks).
+ * ->active_balance synchronizes accesses to
+ * ->active_balance_work. Once set, it's cleared
+ * only after active load balance is finished.
*/
- if (sd->balance_interval < sd->max_interval)
- sd->balance_interval *= 2;
+ if (!busiest->active_balance) {
+ busiest->active_balance = 1;
+ busiest->ab_dst_cpu = this_cpu;
+ busiest->ab_flags = env.flags;
+ busiest->ab_failed = env.failed;
+ busiest->ab_idle = env.idle;
+ active_balance = 1;
+ }
+ raw_spin_unlock_irqrestore(&busiest->lock, flags);
+
+ if (active_balance) {
+ stop_one_cpu_nowait(cpu_of(busiest),
+ active_load_balance_cpu_stop, busiest,
+ &busiest->ab_work);
+ }
}

- goto out;
+ if (!active_balance)
+ update_sd_failed(&env, ld_moved);
+
+ sd->balance_interval = sd->min_interval;
+out:
+ return ld_moved;
+
+out_pinned:
+ cpumask_clear_cpu(cpu_of(busiest), cpus);
+ if (!cpumask_empty(cpus)) {
+ env.loop = 0;
+ env.loop_break = sched_nr_migrate_break;
+ goto redo;
+ }

out_balanced:
schedstat_inc(sd, lb_balanced[idle]);

sd->nr_balance_failed = 0;

-out_one_pinned:
/* tune up the balancing interval */
if (((env.flags & LBF_ALL_PINNED) &&
sd->balance_interval < MAX_PINNED_INTERVAL) ||
@@ -4988,8 +5588,8 @@ out_one_pinned:
sd->balance_interval *= 2;

ld_moved = 0;
-out:
- return ld_moved;
+
+ goto out;
}

/*
@@ -5060,7 +5660,7 @@ static int active_load_balance_cpu_stop(void *data)
{
struct rq *busiest_rq = data;
int busiest_cpu = cpu_of(busiest_rq);
- int target_cpu = busiest_rq->push_cpu;
+ int target_cpu = busiest_rq->ab_dst_cpu;
struct rq *target_rq = cpu_rq(target_cpu);
struct sched_domain *sd;

@@ -5098,17 +5698,23 @@ static int active_load_balance_cpu_stop(void *data)
.sd = sd,
.dst_cpu = target_cpu,
.dst_rq = target_rq,
- .src_cpu = busiest_rq->cpu,
+ .src_cpu = busiest_cpu,
.src_rq = busiest_rq,
- .idle = CPU_IDLE,
+ .flags = busiest_rq->ab_flags,
+ .failed = busiest_rq->ab_failed,
+ .idle = busiest_rq->ab_idle,
};
+ env.iteration = lb_max_iteration(&env);

schedstat_inc(sd, alb_count);

- if (move_one_task(&env))
+ if (move_one_task(&env)) {
schedstat_inc(sd, alb_pushed);
- else
+ update_sd_failed(&env, 1);
+ } else {
schedstat_inc(sd, alb_failed);
+ update_sd_failed(&env, 0);
+ }
}
rcu_read_unlock();
double_unlock_balance(busiest_rq, target_rq);
@@ -5508,6 +6114,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
}

update_rq_runnable_avg(rq, 1);
+
+ if (sched_feat_numa(NUMA) && nr_node_ids > 1)
+ task_tick_numa(rq, curr);
}

/*
@@ -5902,9 +6511,7 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index e68e69a..a432eb8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -66,3 +66,11 @@ SCHED_FEAT(TTWU_QUEUE, true)
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
+
+#ifdef CONFIG_NUMA_BALANCING
+/* Do the working set probing faults: */
+SCHED_FEAT(NUMA, true)
+SCHED_FEAT(NUMA_FAULTS_UP, true)
+SCHED_FEAT(NUMA_FAULTS_DOWN, false)
+SCHED_FEAT(NUMA_SETTLE, true)
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5eca173..bb9475c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3,6 +3,7 @@
#include <linux/mutex.h>
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
+#include <linux/slab.h>

#include "cpupri.h"

@@ -420,17 +421,29 @@ struct rq {
unsigned long cpu_power;

unsigned char idle_balance;
- /* For active balancing */
int post_schedule;
+
+ /* For active balancing */
int active_balance;
- int push_cpu;
- struct cpu_stop_work active_balance_work;
+ int ab_dst_cpu;
+ int ab_flags;
+ int ab_failed;
+ int ab_idle;
+ struct cpu_stop_work ab_work;
+
/* cpu of this runqueue: */
int cpu;
int online;

struct list_head cfs_tasks;

+#ifdef CONFIG_NUMA_BALANCING
+ unsigned long numa_weight;
+ unsigned long nr_numa_running;
+ unsigned long nr_ideal_running;
+#endif
+ unsigned long nr_shared_running; /* 0 on non-NUMA */
+
u64 rt_avg;
u64 age_stamp;
u64 idle_stamp;
@@ -501,6 +514,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() (&__raw_get_cpu_var(runqueues))

+#ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int node, int shared);
+static inline void task_numa_free(struct task_struct *p)
+{
+ kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
#ifdef CONFIG_SMP

#define rcu_dereference_check_sched_domain(p) \
@@ -544,6 +569,7 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)

DECLARE_PER_CPU(struct sched_domain *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain *, sd_node);

extern int group_balance_cpu(struct sched_group *sg);

@@ -663,6 +689,12 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
#endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */

+#ifdef CONFIG_NUMA_BALANCING
+#define sched_feat_numa(x) sched_feat(x)
+#else
+#define sched_feat_numa(x) (0)
+#endif
+
static inline u64 global_rt_period(void)
{
return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
--
1.7.11.7

2012-11-22 22:51:17

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 16/33] sched, numa, mm: Add credits for NUMA placement

From: Rik van Riel <[email protected]>

The NUMA placement code has been rewritten several times, but
the basic ideas took a lot of work to develop. The people who
put in the work deserve credit for it. Thanks Andrea & Peter :)

[ The Documentation/scheduler/numa-problem.txt file should
probably be rewritten once we figure out the final details of
what the NUMA code needs to do, and why. ]

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
----
This is against tip.git numa/core
---
CREDITS | 1 +
kernel/sched/fair.c | 3 +++
mm/memory.c | 2 ++
3 files changed, 6 insertions(+)

diff --git a/CREDITS b/CREDITS
index d8fe12a..b4cdc8f 100644
--- a/CREDITS
+++ b/CREDITS
@@ -125,6 +125,7 @@ D: Author of pscan that helps to fix lp/parport bugs
D: Author of lil (Linux Interrupt Latency benchmark)
D: Fixed the shm swap deallocation at swapoff time (try_to_unuse message)
D: VM hacker
+D: NUMA task placement
D: Various other kernel hacks
S: Imola 40026
S: Italy
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 511fbb8..8af0208 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -18,6 +18,9 @@
*
* Adaptive scheduling granularity, math enhancements by Peter Zijlstra
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[email protected]>
+ *
+ * NUMA placement, statistics and algorithm by Andrea Arcangeli,
+ * CFS balancing changes by Peter Zijlstra. Copyright (C) 2012 Red Hat, Inc.
*/

#include <linux/latencytop.h>
diff --git a/mm/memory.c b/mm/memory.c
index 52ad29d..1f733dc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -36,6 +36,8 @@
* ([email protected])
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ *
+ * 2012 - NUMA placement page faults (Andrea Arcangeli, Peter Zijlstra)
*/

#include <linux/kernel_stat.h>
--
1.7.11.7

2012-11-22 22:51:35

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 22/33] sched: Implement slow start for working set sampling

From: Peter Zijlstra <[email protected]>

Add a 1 second delay before starting to scan the working set of
a task and starting to balance it amongst nodes.

[ note that before the constant per task WSS sampling rate patch
the initial scan would happen much later still, in effect that
patch caused this regression. ]

The theory is that short-run tasks benefit very little from NUMA
placement: they come and go, and they better stick to the node
they were started on. As tasks mature and rebalance to other CPUs
and nodes, so does their NUMA placement have to change and so
does it start to matter more and more.

In practice this change fixes an observable kbuild regression:

# [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

!NUMA:
45.291088843 seconds time elapsed ( +- 0.40% )
45.154231752 seconds time elapsed ( +- 0.36% )

+NUMA, no slow start:
46.172308123 seconds time elapsed ( +- 0.30% )
46.343168745 seconds time elapsed ( +- 0.25% )

+NUMA, 1 sec slow start:
45.224189155 seconds time elapsed ( +- 0.25% )
45.160866532 seconds time elapsed ( +- 0.17% )

and it also fixes an observable perf bench (hackbench) regression:

# perf stat --null --repeat 10 perf bench sched messaging

-NUMA:

-NUMA: 0.246225691 seconds time elapsed ( +- 1.31% )
+NUMA no slow start: 0.252620063 seconds time elapsed ( +- 1.13% )

+NUMA 1sec delay: 0.248076230 seconds time elapsed ( +- 1.35% )

The implementation is simple and straightforward, most of the patch
deals with adding the /proc/sys/kernel/sched_numa_scan_delay_ms tunable
knob.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Wrote the changelog, ran measurements, tuned the default. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 16 ++++++++++------
kernel/sysctl.c | 7 +++++++
4 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3372aac..8f65323 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2045,6 +2045,7 @@ enum sched_tunable_scaling {
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

+extern unsigned int sysctl_sched_numa_scan_delay;
extern unsigned int sysctl_sched_numa_scan_period_min;
extern unsigned int sysctl_sched_numa_scan_period_max;
extern unsigned int sysctl_sched_numa_scan_size;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7b58366..af0602f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1556,7 +1556,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
p->numa_migrate_seq = 2;
p->numa_faults = NULL;
- p->numa_scan_period = sysctl_sched_numa_scan_period_min;
+ p->numa_scan_period = sysctl_sched_numa_scan_delay;
p->numa_work.next = &p->numa_work;
#endif /* CONFIG_NUMA_BALANCING */
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da28315..8f0e6ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -823,11 +823,12 @@ static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
}

/*
- * numa task sample period in ms: 5s
+ * Scan @scan_size MB every @scan_period after an initial @scan_delay.
*/
-unsigned int sysctl_sched_numa_scan_period_min = 100;
-unsigned int sysctl_sched_numa_scan_period_max = 100*16;
-unsigned int sysctl_sched_numa_scan_size = 256; /* MB */
+unsigned int sysctl_sched_numa_scan_delay = 1000; /* ms */
+unsigned int sysctl_sched_numa_scan_period_min = 100; /* ms */
+unsigned int sysctl_sched_numa_scan_period_max = 100*16;/* ms */
+unsigned int sysctl_sched_numa_scan_size = 256; /* MB */

/*
* Wait for the 2-sample stuff to settle before migrating again
@@ -938,10 +939,12 @@ void task_numa_work(struct callback_head *work)
if (time_before(now, migrate))
return;

- next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
+ next_scan = now + msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;

+ current->numa_scan_period += jiffies_to_msecs(2);
+
start = mm->numa_scan_offset;
pages = sysctl_sched_numa_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
@@ -998,7 +1001,8 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;

if (now - curr->node_stamp > period) {
- curr->node_stamp = now;
+ curr->node_stamp += period;
+ curr->numa_scan_period = sysctl_sched_numa_scan_period_min;

/*
* We are comparing runtime to wall clock time here, which
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a14b8a4..6d2fe5b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -353,6 +353,13 @@ static struct ctl_table kern_table[] = {
#endif /* CONFIG_SMP */
#ifdef CONFIG_NUMA_BALANCING
{
+ .procname = "sched_numa_scan_delay_ms",
+ .data = &sysctl_sched_numa_scan_delay,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_numa_scan_period_min_ms",
.data = &sysctl_sched_numa_scan_period_min,
.maxlen = sizeof(unsigned int),
--
1.7.11.7

2012-11-22 22:51:46

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 27/33] sched: Track groups of shared tasks

To be able to cluster memory-related tasks more efficiently, introduce
a new metric that tracks the 'best' buddy task

Track our "memory buddies": the tasks we actively share memory with.

Firstly we establish the identity of some other task that we are
sharing memory with by looking at rq[page::last_cpu].curr - i.e.
we check the task that is running on that CPU right now.

This is not entirely correct as this task might have scheduled or
migrate ther - but statistically there will be correlation to the
tasks that we share memory with, and correlation is all we need.

We map out the relation itself by filtering out the highest address
ask that is below our own task address, per working set scan
iteration.

This creates a natural ordering relation between groups of tasks:

t1 < t2 < t3 < t4

t1->memory_buddy == NULL
t2->memory_buddy == t1
t3->memory_buddy == t2
t4->memory_buddy == t3

The load-balancer can then use this information to speed up NUMA
convergence, by moving such tasks together if capacity and load
constraints allow it.

(This is all statistical so there are no preemption or locking worries.)

Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 5 ++
kernel/sched/core.c | 5 ++
kernel/sched/fair.c | 144 ++++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 92b41b4..be73297 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1513,6 +1513,11 @@ struct task_struct {
unsigned long *numa_faults;
unsigned long *numa_faults_curr;
struct callback_head numa_work;
+
+ struct task_struct *shared_buddy, *shared_buddy_curr;
+ unsigned long shared_buddy_faults, shared_buddy_faults_curr;
+ int ideal_cpu, ideal_cpu_curr;
+
#endif /* CONFIG_NUMA_BALANCING */

struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec3cc74..39cf991 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1558,6 +1558,11 @@ static void __sched_fork(struct task_struct *p)
p->numa_faults = NULL;
p->numa_scan_period = sysctl_sched_numa_scan_delay;
p->numa_work.next = &p->numa_work;
+
+ p->shared_buddy = NULL;
+ p->shared_buddy_faults = 0;
+ p->ideal_cpu = -1;
+ p->ideal_cpu_curr = -1;
#endif /* CONFIG_NUMA_BALANCING */
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1ab11be..67f7fd2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -840,6 +840,43 @@ static void task_numa_migrate(struct task_struct *p, int next_cpu)
p->numa_migrate_seq = 0;
}

+/*
+ * Called for every full scan - here we consider switching to a new
+ * shared buddy, if the one we found during this scan is good enough:
+ */
+static void shared_fault_full_scan_done(struct task_struct *p)
+{
+ /*
+ * If we have a new maximum rate buddy task then pick it
+ * as our new best friend:
+ */
+ if (p->shared_buddy_faults_curr > p->shared_buddy_faults) {
+ WARN_ON_ONCE(!p->shared_buddy_curr);
+ p->shared_buddy = p->shared_buddy_curr;
+ p->shared_buddy_faults = p->shared_buddy_faults_curr;
+ p->ideal_cpu = p->ideal_cpu_curr;
+
+ goto clear_buddy;
+ }
+ /*
+ * If the new buddy is lower rate than the previous average
+ * fault rate then don't switch buddies yet but lower the average by
+ * averaging in the new rate, with a 1/3 weight.
+ *
+ * Eventually, if the current buddy is not a buddy anymore
+ * then we'll switch away from it: a higher rate buddy will
+ * replace it.
+ */
+ p->shared_buddy_faults *= 3;
+ p->shared_buddy_faults += p->shared_buddy_faults_curr;
+ p->shared_buddy_faults /= 4;
+
+clear_buddy:
+ p->shared_buddy_curr = NULL;
+ p->shared_buddy_faults_curr = 0;
+ p->ideal_cpu_curr = -1;
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -852,6 +889,8 @@ static void task_numa_placement(struct task_struct *p)

p->numa_scan_seq = seq;

+ shared_fault_full_scan_done(p);
+
/*
* Update the fault average with the result of the latest
* scan:
@@ -906,23 +945,122 @@ out_backoff:
}

/*
+ * Track our "memory buddies" the tasks we actively share memory with.
+ *
+ * Firstly we establish the identity of some other task that we are
+ * sharing memory with by looking at rq[page::last_cpu].curr - i.e.
+ * we check the task that is running on that CPU right now.
+ *
+ * This is not entirely correct as this task might have scheduled or
+ * migrate ther - but statistically there will be correlation to the
+ * tasks that we share memory with, and correlation is all we need.
+ *
+ * We map out the relation itself by filtering out the highest address
+ * ask that is below our own task address, per working set scan
+ * iteration.
+ *
+ * This creates a natural ordering relation between groups of tasks:
+ *
+ * t1 < t2 < t3 < t4
+ *
+ * t1->memory_buddy == NULL
+ * t2->memory_buddy == t1
+ * t3->memory_buddy == t2
+ * t4->memory_buddy == t3
+ *
+ * The load-balancer can then use this information to speed up NUMA
+ * convergence, by moving such tasks together if capacity and load
+ * constraints allow it.
+ *
+ * (This is all statistical so there are no preemption or locking worries.)
+ */
+static void shared_fault_tick(struct task_struct *this_task, int node, int last_cpu, int pages)
+{
+ struct task_struct *last_task;
+ struct rq *last_rq;
+ int last_node;
+ int this_node;
+ int this_cpu;
+
+ last_node = cpu_to_node(last_cpu);
+ this_cpu = raw_smp_processor_id();
+ this_node = cpu_to_node(this_cpu);
+
+ /* Ignore private memory access faults: */
+ if (last_cpu == this_cpu)
+ return;
+
+ /*
+ * Ignore accesses from foreign nodes to our memory.
+ *
+ * Yet still recognize tasks accessing a third node - i.e. one that is
+ * remote to both of them.
+ */
+ if (node != this_node)
+ return;
+
+ /* We are in a shared fault - see which task we relate to: */
+ last_rq = cpu_rq(last_cpu);
+ last_task = last_rq->curr;
+
+ /* Task might be gone from that runqueue already: */
+ if (!last_task || last_task == last_rq->idle)
+ return;
+
+ if (last_task == this_task->shared_buddy_curr)
+ goto out_hit;
+
+ /* Order our memory buddies by address: */
+ if (last_task >= this_task)
+ return;
+
+ if (this_task->shared_buddy_curr > last_task)
+ return;
+
+ /* New shared buddy! */
+ this_task->shared_buddy_curr = last_task;
+ this_task->shared_buddy_faults_curr = 0;
+ this_task->ideal_cpu_curr = last_rq->cpu;
+
+out_hit:
+ /*
+ * Give threads that we share a process with an advantage,
+ * but don't stop the discovery of process level sharing
+ * either:
+ */
+ if (this_task->mm == last_task->mm)
+ pages *= 2;
+
+ this_task->shared_buddy_faults_curr += pages;
+}
+
+/*
* Got a PROT_NONE fault for a page on @node.
*/
void task_numa_fault(int node, int last_cpu, int pages)
{
struct task_struct *p = current;
int priv = (task_cpu(p) == last_cpu);
+ int idx = 2*node + priv;

if (unlikely(!p->numa_faults)) {
- int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+ int entries = 2*nr_node_ids;
+ int size = sizeof(*p->numa_faults) * entries;

- p->numa_faults = kzalloc(size, GFP_KERNEL);
+ p->numa_faults = kzalloc(2*size, GFP_KERNEL);
if (!p->numa_faults)
return;
+ /*
+ * For efficiency reasons we allocate ->numa_faults[]
+ * and ->numa_faults_curr[] at once and split the
+ * buffer we get. They are separate otherwise.
+ */
+ p->numa_faults_curr = p->numa_faults + entries;
}

+ p->numa_faults_curr[idx] += pages;
+ shared_fault_tick(p, node, last_cpu, pages);
task_numa_placement(p);
- p->numa_faults[2*node + priv] += pages;
}

/*
--
1.7.11.7

2012-11-22 22:51:40

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 21/33] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges

From: Peter Zijlstra <[email protected]>

By accounting against the present PTEs, scanning speed reflects the
actual present (mapped) memory.

Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 37 +++++++++++++++++++++----------------
1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 151a3cd..da28315 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -914,8 +914,8 @@ void task_numa_work(struct callback_head *work)
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
struct vm_area_struct *vma;
- unsigned long offset, end;
- long length;
+ unsigned long start, end;
+ long pages;

WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));

@@ -942,30 +942,35 @@ void task_numa_work(struct callback_head *work)
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;

- offset = mm->numa_scan_offset;
- length = sysctl_sched_numa_scan_size;
- length <<= 20;
+ start = mm->numa_scan_offset;
+ pages = sysctl_sched_numa_scan_size;
+ pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+ if (!pages)
+ return;

down_write(&mm->mmap_sem);
- vma = find_vma(mm, offset);
+ vma = find_vma(mm, start);
if (!vma) {
ACCESS_ONCE(mm->numa_scan_seq)++;
- offset = 0;
+ start = 0;
vma = mm->mmap;
}
- for (; vma && length > 0; vma = vma->vm_next) {
+ for (; vma; vma = vma->vm_next) {
if (!vma_migratable(vma))
continue;

- offset = max(offset, vma->vm_start);
- end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
- length -= end - offset;
-
- change_prot_numa(vma, offset, end);
-
- offset = end;
+ do {
+ start = max(start, vma->vm_start);
+ end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
+ end = min(end, vma->vm_end);
+ pages -= change_prot_numa(vma, start, end);
+ start = end;
+ if (pages <= 0)
+ goto out;
+ } while (end != vma->vm_end);
}
- mm->numa_scan_offset = offset;
+out:
+ mm->numa_scan_offset = start;
up_write(&mm->mmap_sem);
}

--
1.7.11.7

2012-11-22 22:51:51

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 29/33] sched, mm, mempolicy: Add per task mempolicy

We are going to make use of it in the NUMA code: each thread will
converge not just to a group of related tasks, but to a specific
group of memory nodes as well.

Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mempolicy.h | 39 +--------------------------------------
include/linux/mm_types.h | 40 ++++++++++++++++++++++++++++++++++++++++
include/linux/sched.h | 3 ++-
kernel/sched/core.c | 7 +++++++
mm/mempolicy.c | 16 +++-------------
5 files changed, 53 insertions(+), 52 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index c511e25..f44b7f3 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -6,11 +6,11 @@
#define _LINUX_MEMPOLICY_H 1


+#include <linux/mm_types.h>
#include <linux/mmzone.h>
#include <linux/slab.h>
#include <linux/rbtree.h>
#include <linux/spinlock.h>
-#include <linux/nodemask.h>
#include <linux/pagemap.h>
#include <uapi/linux/mempolicy.h>

@@ -19,43 +19,6 @@ struct mm_struct;
#ifdef CONFIG_NUMA

/*
- * Describe a memory policy.
- *
- * A mempolicy can be either associated with a process or with a VMA.
- * For VMA related allocations the VMA policy is preferred, otherwise
- * the process policy is used. Interrupts ignore the memory policy
- * of the current process.
- *
- * Locking policy for interlave:
- * In process context there is no locking because only the process accesses
- * its own state. All vma manipulation is somewhat protected by a down_read on
- * mmap_sem.
- *
- * Freeing policy:
- * Mempolicy objects are reference counted. A mempolicy will be freed when
- * mpol_put() decrements the reference count to zero.
- *
- * Duplicating policy objects:
- * mpol_dup() allocates a new mempolicy and copies the specified mempolicy
- * to the new storage. The reference count of the new object is initialized
- * to 1, representing the caller of mpol_dup().
- */
-struct mempolicy {
- atomic_t refcnt;
- unsigned short mode; /* See MPOL_* above */
- unsigned short flags; /* See set_mempolicy() MPOL_F_* above */
- union {
- short preferred_node; /* preferred */
- nodemask_t nodes; /* interleave/bind */
- /* undefined for default */
- } v;
- union {
- nodemask_t cpuset_mems_allowed; /* relative to these nodes */
- nodemask_t user_nodemask; /* nodemask passed by user */
- } w;
-};
-
-/*
* Support for managing mempolicy data objects (clone, copy, destroy)
* The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
*/
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5995652..cd2be76 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
#include <linux/page-debug-flags.h>
#include <linux/uprobes.h>
#include <linux/page-flags-layout.h>
+#include <linux/nodemask.h>
#include <asm/page.h>
#include <asm/mmu.h>

@@ -203,6 +204,45 @@ struct page_frag {

typedef unsigned long __nocast vm_flags_t;

+#ifdef CONFIG_NUMA
+/*
+ * Describe a memory policy.
+ *
+ * A mempolicy can be either associated with a process or with a VMA.
+ * For VMA related allocations the VMA policy is preferred, otherwise
+ * the process policy is used. Interrupts ignore the memory policy
+ * of the current process.
+ *
+ * Locking policy for interlave:
+ * In process context there is no locking because only the process accesses
+ * its own state. All vma manipulation is somewhat protected by a down_read on
+ * mmap_sem.
+ *
+ * Freeing policy:
+ * Mempolicy objects are reference counted. A mempolicy will be freed when
+ * mpol_put() decrements the reference count to zero.
+ *
+ * Duplicating policy objects:
+ * mpol_dup() allocates a new mempolicy and copies the specified mempolicy
+ * to the new storage. The reference count of the new object is initialized
+ * to 1, representing the caller of mpol_dup().
+ */
+struct mempolicy {
+ atomic_t refcnt;
+ unsigned short mode; /* See MPOL_* above */
+ unsigned short flags; /* See set_mempolicy() MPOL_F_* above */
+ union {
+ short preferred_node; /* preferred */
+ nodemask_t nodes; /* interleave/bind */
+ /* undefined for default */
+ } v;
+ union {
+ nodemask_t cpuset_mems_allowed; /* relative to these nodes */
+ nodemask_t user_nodemask; /* nodemask passed by user */
+ } w;
+};
+#endif
+
/*
* A region containing a mapping of a non-memory backed file under NOMMU
* conditions. These are held in a global tree and are pinned by the VMAs that
diff --git a/include/linux/sched.h b/include/linux/sched.h
index be73297..696492e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1516,7 +1516,8 @@ struct task_struct {

struct task_struct *shared_buddy, *shared_buddy_curr;
unsigned long shared_buddy_faults, shared_buddy_faults_curr;
- int ideal_cpu, ideal_cpu_curr;
+ int ideal_cpu, ideal_cpu_curr, ideal_cpu_candidate;
+ struct mempolicy numa_policy;

#endif /* CONFIG_NUMA_BALANCING */

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 39cf991..794efa0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,7 @@
#include <linux/slab.h>
#include <linux/init_task.h>
#include <linux/binfmts.h>
+#include <uapi/linux/mempolicy.h>

#include <asm/switch_to.h>
#include <asm/tlb.h>
@@ -1563,6 +1564,12 @@ static void __sched_fork(struct task_struct *p)
p->shared_buddy_faults = 0;
p->ideal_cpu = -1;
p->ideal_cpu_curr = -1;
+ atomic_set(&p->numa_policy.refcnt, 1);
+ p->numa_policy.mode = MPOL_INTERLEAVE;
+ p->numa_policy.flags = 0;
+ p->numa_policy.v.preferred_node = 0;
+ p->numa_policy.v.nodes = node_online_map;
+
#endif /* CONFIG_NUMA_BALANCING */
}

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 02890f2..d71a93d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -118,20 +118,12 @@ static struct mempolicy default_policy_local = {
.flags = MPOL_F_LOCAL,
};

-/*
- * .v.nodes is set by numa_policy_init():
- */
-static struct mempolicy default_policy_shared = {
- .refcnt = ATOMIC_INIT(1), /* never free it */
- .mode = MPOL_INTERLEAVE,
- .flags = 0,
-};
-
static struct mempolicy *default_policy(void)
{
+#ifdef CONFIG_NUMA_BALANCING
if (task_numa_shared(current) == 1)
- return &default_policy_shared;
-
+ return &current->numa_policy;
+#endif
return &default_policy_local;
}

@@ -2518,8 +2510,6 @@ void __init numa_policy_init(void)
sizeof(struct sp_node),
0, SLAB_PANIC, NULL);

- default_policy_shared.v.nodes = node_online_map;
-
/*
* Set interleaving policy for system init. Interleaving is only
* enabled across suitably sized nodes (default is >= 16MB), or
--
1.7.11.7

2012-11-22 22:51:57

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 31/33] sched: Use the ideal CPU to drive active balancing

Use our shared/private distinction to allow the separate
handling of 'private' versus 'shared' workloads, which enables
the active-balancing of them:

- private tasks, via the sched_update_ideal_cpu_private() function,
try to 'spread' the system as evenly as possible.

- shared-access tasks that also share their mm (threads), via the
sched_update_ideal_cpu_shared() function, try to 'compress'
with other shared tasks on as few nodes as possible.

[ We'll be able to extend this grouping beyond threads in the
future, by using the existing p->shared_buddy directed graph. ]

Introduce the sched_rebalance_to() primitive to trigger active rebalancing.

The result of this patch is 2-3 times faster convergence and
much more stable convergence points.

Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 19 ++++
kernel/sched/fair.c | 244 +++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 7 +-
kernel/sched/sched.h | 1 +
5 files changed, 257 insertions(+), 15 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 696492e..8bc3a03 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2020,6 +2020,7 @@ task_sched_runtime(struct task_struct *task);
/* sched_exec is called by processes performing an exec */
#ifdef CONFIG_SMP
extern void sched_exec(void);
+extern void sched_rebalance_to(int dest_cpu);
#else
#define sched_exec() {}
#endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 794efa0..93f2561 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2596,6 +2596,22 @@ unlock:
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
}

+/*
+ * sched_rebalance_to()
+ *
+ * Active load-balance to a target CPU.
+ */
+void sched_rebalance_to(int dest_cpu)
+{
+ struct task_struct *p = current;
+ struct migration_arg arg = { p, dest_cpu };
+
+ if (!cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
+ return;
+
+ stop_one_cpu(raw_smp_processor_id(), migration_cpu_stop, &arg);
+}
+
#endif

DEFINE_PER_CPU(struct kernel_stat, kstat);
@@ -4753,6 +4769,9 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
done:
ret = 1;
fail:
+#ifdef CONFIG_NUMA_BALANCING
+ rq_dest->curr_buddy = NULL;
+#endif
double_rq_unlock(rq_src, rq_dest);
raw_spin_unlock(&p->pi_lock);
return ret;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a5f3ad7..8aa4b36 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -848,6 +848,181 @@ static int task_ideal_cpu(struct task_struct *p)
return p->ideal_cpu;
}

+
+static int sched_update_ideal_cpu_shared(struct task_struct *p)
+{
+ int full_buddies;
+ int max_buddies;
+ int target_cpu;
+ int ideal_cpu;
+ int this_cpu;
+ int this_node;
+ int best_node;
+ int buddies;
+ int node;
+ int cpu;
+
+ if (!sched_feat(PUSH_SHARED_BUDDIES))
+ return -1;
+
+ ideal_cpu = -1;
+ best_node = -1;
+ max_buddies = 0;
+ this_cpu = task_cpu(p);
+ this_node = cpu_to_node(this_cpu);
+
+ for_each_online_node(node) {
+ full_buddies = cpumask_weight(cpumask_of_node(node));
+
+ buddies = 0;
+ target_cpu = -1;
+
+ for_each_cpu(cpu, cpumask_of_node(node)) {
+ struct task_struct *curr;
+ struct rq *rq;
+
+ WARN_ON_ONCE(cpu_to_node(cpu) != node);
+
+ rq = cpu_rq(cpu);
+
+ /*
+ * Don't take the rq lock for scalability,
+ * we only rely on rq->curr statistically:
+ */
+ curr = ACCESS_ONCE(rq->curr);
+
+ if (curr == p) {
+ buddies += 1;
+ continue;
+ }
+
+ /* Pick up idle tasks immediately: */
+ if (curr == rq->idle && !rq->curr_buddy)
+ target_cpu = cpu;
+
+ /* Leave alone non-NUMA tasks: */
+ if (task_numa_shared(curr) < 0)
+ continue;
+
+ /* Private task is an easy target: */
+ if (task_numa_shared(curr) == 0) {
+ if (!rq->curr_buddy)
+ target_cpu = cpu;
+ continue;
+ }
+ if (curr->mm != p->mm) {
+ /* Leave alone different groups on their ideal node: */
+ if (cpu_to_node(curr->ideal_cpu) == node)
+ continue;
+ if (!rq->curr_buddy)
+ target_cpu = cpu;
+ continue;
+ }
+
+ buddies++;
+ }
+ WARN_ON_ONCE(buddies > full_buddies);
+
+ /* Don't go to a node that is already at full capacity: */
+ if (buddies == full_buddies)
+ continue;
+
+ if (!buddies)
+ continue;
+
+ if (buddies > max_buddies && target_cpu != -1) {
+ max_buddies = buddies;
+ best_node = node;
+ WARN_ON_ONCE(target_cpu == -1);
+ ideal_cpu = target_cpu;
+ }
+ }
+
+ WARN_ON_ONCE(best_node == -1 && ideal_cpu != -1);
+ WARN_ON_ONCE(best_node != -1 && ideal_cpu == -1);
+
+ this_node = cpu_to_node(this_cpu);
+
+ /* If we'd stay within this node then stay put: */
+ if (ideal_cpu == -1 || cpu_to_node(ideal_cpu) == this_node)
+ ideal_cpu = this_cpu;
+
+ return ideal_cpu;
+}
+
+static int sched_update_ideal_cpu_private(struct task_struct *p)
+{
+ int full_idles;
+ int this_idles;
+ int max_idles;
+ int target_cpu;
+ int ideal_cpu;
+ int best_node;
+ int this_node;
+ int this_cpu;
+ int idles;
+ int node;
+ int cpu;
+
+ if (!sched_feat(PUSH_PRIVATE_BUDDIES))
+ return -1;
+
+ ideal_cpu = -1;
+ best_node = -1;
+ max_idles = 0;
+ this_idles = 0;
+ this_cpu = task_cpu(p);
+ this_node = cpu_to_node(this_cpu);
+
+ for_each_online_node(node) {
+ full_idles = cpumask_weight(cpumask_of_node(node));
+
+ idles = 0;
+ target_cpu = -1;
+
+ for_each_cpu(cpu, cpumask_of_node(node)) {
+ struct rq *rq;
+
+ WARN_ON_ONCE(cpu_to_node(cpu) != node);
+
+ rq = cpu_rq(cpu);
+ if (rq->curr == rq->idle) {
+ if (!rq->curr_buddy)
+ target_cpu = cpu;
+ idles++;
+ }
+ }
+ WARN_ON_ONCE(idles > full_idles);
+
+ if (node == this_node)
+ this_idles = idles;
+
+ if (!idles)
+ continue;
+
+ if (idles > max_idles && target_cpu != -1) {
+ max_idles = idles;
+ best_node = node;
+ WARN_ON_ONCE(target_cpu == -1);
+ ideal_cpu = target_cpu;
+ }
+ }
+
+ WARN_ON_ONCE(best_node == -1 && ideal_cpu != -1);
+ WARN_ON_ONCE(best_node != -1 && ideal_cpu == -1);
+
+ /* If the target is not idle enough, skip: */
+ if (max_idles <= this_idles+1)
+ ideal_cpu = this_cpu;
+
+ /* If we'd stay within this node then stay put: */
+ if (ideal_cpu == -1 || cpu_to_node(ideal_cpu) == this_node)
+ ideal_cpu = this_cpu;
+
+ return ideal_cpu;
+}
+
+
/*
* Called for every full scan - here we consider switching to a new
* shared buddy, if the one we found during this scan is good enough:
@@ -862,7 +1037,6 @@ static void shared_fault_full_scan_done(struct task_struct *p)
WARN_ON_ONCE(!p->shared_buddy_curr);
p->shared_buddy = p->shared_buddy_curr;
p->shared_buddy_faults = p->shared_buddy_faults_curr;
- p->ideal_cpu = p->ideal_cpu_curr;

goto clear_buddy;
}
@@ -891,14 +1065,13 @@ static void task_numa_placement(struct task_struct *p)
unsigned long total[2] = { 0, 0 };
unsigned long faults, max_faults = 0;
int node, priv, shared, max_node = -1;
+ int this_node;

if (p->numa_scan_seq == seq)
return;

p->numa_scan_seq = seq;

- shared_fault_full_scan_done(p);
-
/*
* Update the fault average with the result of the latest
* scan:
@@ -926,10 +1099,7 @@ static void task_numa_placement(struct task_struct *p)
}
}

- if (max_node != p->numa_max_node) {
- sched_setnuma(p, max_node, task_numa_shared(p));
- goto out_backoff;
- }
+ shared_fault_full_scan_done(p);

p->numa_migrate_seq++;
if (sched_feat(NUMA_SETTLE) &&
@@ -942,14 +1112,55 @@ static void task_numa_placement(struct task_struct *p)
* the impact of a little private memory accesses.
*/
shared = (total[0] >= total[1] / 2);
- if (shared != task_numa_shared(p)) {
- sched_setnuma(p, p->numa_max_node, shared);
+ if (shared)
+ p->ideal_cpu = sched_update_ideal_cpu_shared(p);
+ else
+ p->ideal_cpu = sched_update_ideal_cpu_private(p);
+
+ if (p->ideal_cpu >= 0) {
+ /* Filter migrations a bit - the same target twice in a row is picked: */
+ if (p->ideal_cpu == p->ideal_cpu_candidate) {
+ max_node = cpu_to_node(p->ideal_cpu);
+ } else {
+ p->ideal_cpu_candidate = p->ideal_cpu;
+ max_node = -1;
+ }
+ } else {
+ if (max_node < 0)
+ max_node = p->numa_max_node;
+ }
+
+ if (shared != task_numa_shared(p) || (max_node != -1 && max_node != p->numa_max_node)) {
+
p->numa_migrate_seq = 0;
- goto out_backoff;
+ /*
+ * Fix up node migration fault statistics artifact, as we
+ * migrate to another node we'll soon bring over our private
+ * working set - generating 'shared' faults as that happens.
+ * To counter-balance this effect, move this node's private
+ * stats to the new node.
+ */
+ if (max_node != -1 && p->numa_max_node != -1 && max_node != p->numa_max_node) {
+ int idx_oldnode = p->numa_max_node*2 + 1;
+ int idx_newnode = max_node*2 + 1;
+
+ p->numa_faults[idx_newnode] += p->numa_faults[idx_oldnode];
+ p->numa_faults[idx_oldnode] = 0;
+ }
+ sched_setnuma(p, max_node, shared);
+ } else {
+ /* node unchanged, back off: */
+ p->numa_scan_period = min(p->numa_scan_period * 2, sysctl_sched_numa_scan_period_max);
+ }
+
+ this_node = cpu_to_node(task_cpu(p));
+
+ if (max_node >= 0 && p->ideal_cpu >= 0 && max_node != this_node) {
+ struct rq *rq = cpu_rq(p->ideal_cpu);
+
+ rq->curr_buddy = p;
+ sched_rebalance_to(p->ideal_cpu);
}
- return;
-out_backoff:
- p->numa_scan_period = min(p->numa_scan_period * 2, sysctl_sched_numa_scan_period_max);
}

/*
@@ -1051,6 +1262,8 @@ void task_numa_fault(int node, int last_cpu, int pages)
int priv = (task_cpu(p) == last_cpu);
int idx = 2*node + priv;

+ WARN_ON_ONCE(last_cpu == -1 || node == -1);
+
if (unlikely(!p->numa_faults)) {
int entries = 2*nr_node_ids;
int size = sizeof(*p->numa_faults) * entries;
@@ -3545,6 +3758,11 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
if (p->nr_cpus_allowed == 1)
return prev_cpu;

+#ifdef CONFIG_NUMA_BALANCING
+ if (sched_feat(WAKE_ON_IDEAL_CPU) && p->ideal_cpu >= 0)
+ return p->ideal_cpu;
+#endif
+
if (sd_flag & SD_BALANCE_WAKE) {
if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
want_affine = 1;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 737d2c8..c868a66 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -68,11 +68,14 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
SCHED_FEAT(IDEAL_CPU, true)
SCHED_FEAT(IDEAL_CPU_THREAD_BIAS, false)
+SCHED_FEAT(PUSH_PRIVATE_BUDDIES, true)
+SCHED_FEAT(PUSH_SHARED_BUDDIES, true)
+SCHED_FEAT(WAKE_ON_IDEAL_CPU, false)

#ifdef CONFIG_NUMA_BALANCING
/* Do the working set probing faults: */
SCHED_FEAT(NUMA, true)
-SCHED_FEAT(NUMA_FAULTS_UP, true)
-SCHED_FEAT(NUMA_FAULTS_DOWN, true)
+SCHED_FEAT(NUMA_FAULTS_UP, false)
+SCHED_FEAT(NUMA_FAULTS_DOWN, false)
SCHED_FEAT(NUMA_SETTLE, true)
#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bb9475c..810a1a0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -441,6 +441,7 @@ struct rq {
unsigned long numa_weight;
unsigned long nr_numa_running;
unsigned long nr_ideal_running;
+ struct task_struct *curr_buddy;
#endif
unsigned long nr_shared_running; /* 0 on non-NUMA */

--
1.7.11.7

2012-11-22 22:52:04

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 33/33] sched: Track shared task's node groups and interleave their memory allocations

This patch shows the power of the shared/private distinction: in
the shared tasks active balancing function (sched_update_ideal_cpu_shared())
we are able to to build a per (shared) task node mask of the nodes that
it and its buddies occupy at the moment.

Private tasks on the other hand are not affected and continue to do
efficient node-local allocations.

There's two important special cases:

- if a group of shared tasks fits on a single node. In this case
the interleaving happens on a single bit, a single node and thus
turns into nice node-local allocations.

- if a large group spans the whole system: in this case the node
masks will cover the whole system, and all memory gets evenly
interleaved and available RAM bandwidth gets utilized. This is
preferable to allocating memory assymetrically and overloading
certain CPU links and running into their bandwidth limitations.

This patch, in combination with the private/shared buddies patch,
optimizes the "4x JVM", "single JVM" and "2x JVM" SPECjbb workloads
on a 4-node system produce almost completely perfect memory placement.

For example a 4-JVM workload on a 4-node, 32-CPU system has
this performance (8 SPECjbb warehouses per JVM):

spec1.txt: throughput = 177460.44 SPECjbb2005 bops
spec2.txt: throughput = 176175.08 SPECjbb2005 bops
spec3.txt: throughput = 175053.91 SPECjbb2005 bops
spec4.txt: throughput = 171383.52 SPECjbb2005 bops

Which is close to the hard binding performance figures.

while previously it would regress compared to mainline.

Mainline has the following 4x JVM performance:

spec1.txt: throughput = 157839.25 SPECjbb2005 bops
spec2.txt: throughput = 156969.15 SPECjbb2005 bops
spec3.txt: throughput = 157571.59 SPECjbb2005 bops
spec4.txt: throughput = 157873.86 SPECjbb2005 bops

So the patch brings a 12% speedup.

This placement idea came while discussing interleaving strategies
with Christoph Lameter.

Suggested-by: Christoph Lameter <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab4a7130..5cc3620 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -922,6 +922,10 @@ static int sched_update_ideal_cpu_shared(struct task_struct *p)
buddies++;
}
WARN_ON_ONCE(buddies > full_buddies);
+ if (buddies)
+ node_set(node, p->numa_policy.v.nodes);
+ else
+ node_clear(node, p->numa_policy.v.nodes);

/* Don't go to a node that is already at full capacity: */
if (buddies == full_buddies)
--
1.7.11.7

2012-11-22 22:52:35

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 32/33] sched: Add hysteresis to p->numa_shared

Make p->numa_shared flip/flop less around unstable equilibriums,
instead require a significant move in either direction to trigger
'dominantly shared accesses' versus 'dominantly private accesses'
NUMA status.

Suggested-by: Rik van Riel <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8aa4b36..ab4a7130 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1111,7 +1111,20 @@ static void task_numa_placement(struct task_struct *p)
* we might want to consider a different equation below to reduce
* the impact of a little private memory accesses.
*/
- shared = (total[0] >= total[1] / 2);
+ shared = p->numa_shared;
+
+ if (shared < 0) {
+ shared = (total[0] >= total[1]);
+ } else if (shared == 0) {
+ /* If it was private before, make it harder to become shared: */
+ if (total[0] >= total[1]*2)
+ shared = 1;
+ } else if (shared == 1 ) {
+ /* If it was shared before, make it harder to become private: */
+ if (total[0]*2 <= total[1])
+ shared = 0;
+ }
+
if (shared)
p->ideal_cpu = sched_update_ideal_cpu_shared(p);
else
--
1.7.11.7

2012-11-22 22:52:54

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 30/33] sched: Average the fault stats longer

We will rely on the per CPU fault statistics and its
shared/private derivative even more in the future, so
stabilize this metric even better.

The staged updates introduced in commit:

sched: Introduce staged average NUMA faults

Already stabilized this key metric significantly, but in
real workloads it was still reacting to temporary load
balancing transients too quickly.

Slow down by weighting the average. The weighting value was
found via experimentation.

Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 24a5588..a5f3ad7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -914,8 +914,8 @@ static void task_numa_placement(struct task_struct *p)
p->numa_faults_curr[idx] = 0;

/* Keep a simple running average: */
- p->numa_faults[idx] += new_faults;
- p->numa_faults[idx] /= 2;
+ p->numa_faults[idx] = p->numa_faults[idx]*7 + new_faults;
+ p->numa_faults[idx] /= 8;

faults += p->numa_faults[idx];
total[priv] += p->numa_faults[idx];
--
1.7.11.7

2012-11-22 22:53:30

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 28/33] sched: Use the best-buddy 'ideal cpu' in balancing decisions

Now that we have a notion of (one of the) best CPUs we interrelate
with in terms of memory usage, use that information to improve
can_migrate_task() balancing decisions: allow the migration to
occur even if we locally cache-hot, if we are on another node
and want to migrate towards our best buddy's node.

( Note that this is not hard affinity - if imbalance persists long
enough then the scheduler will eventually balance tasks anyway,
to maximize CPU utilization. )

Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 35 ++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 2 ++
2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 67f7fd2..24a5588 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -840,6 +840,14 @@ static void task_numa_migrate(struct task_struct *p, int next_cpu)
p->numa_migrate_seq = 0;
}

+static int task_ideal_cpu(struct task_struct *p)
+{
+ if (!sched_feat(IDEAL_CPU))
+ return -1;
+
+ return p->ideal_cpu;
+}
+
/*
* Called for every full scan - here we consider switching to a new
* shared buddy, if the one we found during this scan is good enough:
@@ -1028,7 +1036,7 @@ out_hit:
* but don't stop the discovery of process level sharing
* either:
*/
- if (this_task->mm == last_task->mm)
+ if (sched_feat(IDEAL_CPU_THREAD_BIAS) && this_task->mm == last_task->mm)
pages *= 2;

this_task->shared_buddy_faults_curr += pages;
@@ -1189,6 +1197,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
}
#else /* !CONFIG_NUMA_BALANCING: */
#ifdef CONFIG_SMP
+static inline int task_ideal_cpu(struct task_struct *p) { return -1; }
static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p) { }
#endif
static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p) { }
@@ -4064,6 +4073,7 @@ struct lb_env {
static void move_task(struct task_struct *p, struct lb_env *env)
{
deactivate_task(env->src_rq, p, 0);
+
set_task_cpu(p, env->dst_cpu);
activate_task(env->dst_rq, p, 0);
check_preempt_curr(env->dst_rq, p, 0);
@@ -4242,15 +4252,17 @@ static bool can_migrate_numa_task(struct task_struct *p, struct lb_env *env)
*
* LBF_NUMA_RUN -- numa only, only allow improvement
* LBF_NUMA_SHARED -- shared only
+ * LBF_NUMA_IDEAL -- ideal only
*
* LBF_KEEP_SHARED -- do not touch shared tasks
*/

/* a numa run can only move numa tasks about to improve things */
if (env->flags & LBF_NUMA_RUN) {
- if (task_numa_shared(p) < 0)
+ if (task_numa_shared(p) < 0 && task_ideal_cpu(p) < 0)
return false;
- /* can only pull shared tasks */
+
+ /* If we are only allowed to pull shared tasks: */
if ((env->flags & LBF_NUMA_SHARED) && !task_numa_shared(p))
return false;
} else {
@@ -4307,6 +4319,23 @@ static int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (!can_migrate_running_task(p, env))
return false;

+#ifdef CONFIG_NUMA_BALANCING
+ /* If we are only allowed to pull ideal tasks: */
+ if ((task_ideal_cpu(p) >= 0) && (p->shared_buddy_faults > 1000)) {
+ int ideal_node;
+ int dst_node;
+
+ BUG_ON(env->dst_cpu < 0);
+
+ ideal_node = cpu_to_node(p->ideal_cpu);
+ dst_node = cpu_to_node(env->dst_cpu);
+
+ if (ideal_node == dst_node)
+ return true;
+ return false;
+ }
+#endif
+
if (env->sd->flags & SD_NUMA)
return can_migrate_numa_task(p, env);

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index b75a10d..737d2c8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -66,6 +66,8 @@ SCHED_FEAT(TTWU_QUEUE, true)
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
+SCHED_FEAT(IDEAL_CPU, true)
+SCHED_FEAT(IDEAL_CPU_THREAD_BIAS, false)

#ifdef CONFIG_NUMA_BALANCING
/* Do the working set probing faults: */
--
1.7.11.7

2012-11-22 22:53:46

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 26/33] sched: Introduce staged average NUMA faults

The current way of building the p->numa_faults[2][node] faults
statistics has a sampling artifact:

The continuous and immediate nature of propagating new fault
stats to the numa_faults array creates a 'pulsating' dynamic,
that starts at the average value at the beginning of the scan,
increases monotonically until we finish the scan to about twice
the average, and then drops back to half of its value due to
the running average.

Since we rely on these values to balance tasks, the pulsating
nature resulted in false migrations and general noise in the
stats.

To solve this, introduce buffering of the current scan via
p->task_numa_faults_curr[]. The array is co-allocated with the
p->task_numa[] for efficiency reasons, but it is otherwise an
ordinary separate array.

At the end of the scan we propagate the latest stats into the
average stats value. Most of the balancing code stays unmodified.

The cost of this change is that we delay the effects of the latest
round of faults by 1 scan - but using the partial faults info was
creating artifacts.

This instantly stabilized the page fault stats and improved
numa02-alike workloads by making them faster to converge.

Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 20 +++++++++++++++++---
2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f65323..92b41b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1511,6 +1511,7 @@ struct task_struct {
u64 node_stamp; /* migration stamp */
unsigned long numa_weight;
unsigned long *numa_faults;
+ unsigned long *numa_faults_curr;
struct callback_head numa_work;
#endif /* CONFIG_NUMA_BALANCING */

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c46b45..1ab11be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -852,12 +852,26 @@ static void task_numa_placement(struct task_struct *p)

p->numa_scan_seq = seq;

+ /*
+ * Update the fault average with the result of the latest
+ * scan:
+ */
for (node = 0; node < nr_node_ids; node++) {
faults = 0;
for (priv = 0; priv < 2; priv++) {
- faults += p->numa_faults[2*node + priv];
- total[priv] += p->numa_faults[2*node + priv];
- p->numa_faults[2*node + priv] /= 2;
+ unsigned int new_faults;
+ unsigned int idx;
+
+ idx = 2*node + priv;
+ new_faults = p->numa_faults_curr[idx];
+ p->numa_faults_curr[idx] = 0;
+
+ /* Keep a simple running average: */
+ p->numa_faults[idx] += new_faults;
+ p->numa_faults[idx] /= 2;
+
+ faults += p->numa_faults[idx];
+ total[priv] += p->numa_faults[idx];
}
if (faults > max_faults) {
max_faults = faults;
--
1.7.11.7

2012-11-22 22:53:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/33] Latest numa/core release, v17



* Ingo Molnar <[email protected]> wrote:

> This release mainly addresses one of the regressions Linus
> (rightfully) complained about: the "4x JVM" SPECjbb run.
>
> [ Note to testers: if possible please still run with
> CONFIG_TRANSPARENT_HUGEPAGES=y enabled, to avoid the
> !THP regression that is still not fully fixed.
> It will be fixed next. ]

I forgot to include the Git link:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

Thanks,

Ingo

2012-11-22 22:54:25

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 25/33] sched: Improve convergence

- break out of can_do_numa_run() earlier if we can make no progress
- don't flip between siblings that often
- turn on bidirectional fault balancing
- improve the flow in task_numa_work()

Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 46 ++++++++++++++++++++++++++++++++--------------
kernel/sched/features.h | 2 +-
2 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59fea2e..9c46b45 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -917,12 +917,12 @@ void task_numa_fault(int node, int last_cpu, int pages)
*/
void task_numa_work(struct callback_head *work)
{
+ long pages_total, pages_left, pages_changed;
unsigned long migrate, next_scan, now = jiffies;
+ unsigned long start0, start, end;
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
struct vm_area_struct *vma;
- unsigned long start, end;
- long pages;

WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));

@@ -951,35 +951,42 @@ void task_numa_work(struct callback_head *work)

current->numa_scan_period += jiffies_to_msecs(2);

- start = mm->numa_scan_offset;
- pages = sysctl_sched_numa_scan_size;
- pages <<= 20 - PAGE_SHIFT; /* MB in pages */
- if (!pages)
+ start0 = start = end = mm->numa_scan_offset;
+ pages_total = sysctl_sched_numa_scan_size;
+ pages_total <<= 20 - PAGE_SHIFT; /* MB in pages */
+ if (!pages_total)
return;

+ pages_left = pages_total;
+
down_write(&mm->mmap_sem);
vma = find_vma(mm, start);
if (!vma) {
ACCESS_ONCE(mm->numa_scan_seq)++;
- start = 0;
- vma = mm->mmap;
+ end = 0;
+ vma = find_vma(mm, end);
}
for (; vma; vma = vma->vm_next) {
if (!vma_migratable(vma))
continue;

do {
- start = max(start, vma->vm_start);
- end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
+ start = max(end, vma->vm_start);
+ end = ALIGN(start + (pages_left << PAGE_SHIFT), HPAGE_SIZE);
end = min(end, vma->vm_end);
- pages -= change_prot_numa(vma, start, end);
- start = end;
- if (pages <= 0)
+ pages_changed = change_prot_numa(vma, start, end);
+
+ WARN_ON_ONCE(pages_changed > pages_total);
+ BUG_ON(pages_changed < 0);
+
+ pages_left -= pages_changed;
+ if (pages_left <= 0)
goto out;
} while (end != vma->vm_end);
}
out:
- mm->numa_scan_offset = start;
+ mm->numa_scan_offset = end;
+
up_write(&mm->mmap_sem);
}

@@ -3306,6 +3313,13 @@ static int select_idle_sibling(struct task_struct *p, int target)
int i;

/*
+ * For NUMA tasks constant, reliable placement is more important
+ * than flipping tasks between siblings:
+ */
+ if (task_numa_shared(p) >= 0)
+ return target;
+
+ /*
* If the task is going to be woken-up on this cpu and if it is
* already idle, then it is the right target.
*/
@@ -4581,6 +4595,10 @@ static bool can_do_numa_run(struct lb_env *env, struct sd_lb_stats *sds)
* If we got capacity allow stacking up on shared tasks.
*/
if ((sds->this_shared_running < sds->this_group_capacity) && sds->numa_shared_running) {
+ /* There's no point in trying to move if all are here already: */
+ if (sds->numa_shared_running == sds->this_shared_running)
+ return false;
+
env->flags |= LBF_NUMA_SHARED;
return true;
}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a432eb8..b75a10d 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -71,6 +71,6 @@ SCHED_FEAT(LB_MIN, false)
/* Do the working set probing faults: */
SCHED_FEAT(NUMA, true)
SCHED_FEAT(NUMA_FAULTS_UP, true)
-SCHED_FEAT(NUMA_FAULTS_DOWN, false)
+SCHED_FEAT(NUMA_FAULTS_DOWN, true)
SCHED_FEAT(NUMA_SETTLE, true)
#endif
--
1.7.11.7

2012-11-22 22:55:08

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 24/33] sched: Implement NUMA scanning backoff

Back off slowly from scanning, up to sysctl_sched_numa_scan_period_max
(1.6 seconds). Scan faster again if we were forced to switch to
another node.

This makes sure that workload in equilibrium don't get scanned as often
as workloads that are still converging.

Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 6 ++++++
kernel/sched/fair.c | 8 +++++++-
2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index af0602f..ec3cc74 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6024,6 +6024,12 @@ void sched_setnuma(struct task_struct *p, int node, int shared)
if (on_rq)
enqueue_task(rq, p, 0);
task_rq_unlock(rq, p, &flags);
+
+ /*
+ * Reset the scanning period. If the task converges
+ * on this node then we'll back off again:
+ */
+ p->numa_scan_period = sysctl_sched_numa_scan_period_min;
}

#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f0e6ba..59fea2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -865,8 +865,10 @@ static void task_numa_placement(struct task_struct *p)
}
}

- if (max_node != p->numa_max_node)
+ if (max_node != p->numa_max_node) {
sched_setnuma(p, max_node, task_numa_shared(p));
+ goto out_backoff;
+ }

p->numa_migrate_seq++;
if (sched_feat(NUMA_SETTLE) &&
@@ -882,7 +884,11 @@ static void task_numa_placement(struct task_struct *p)
if (shared != task_numa_shared(p)) {
sched_setnuma(p, p->numa_max_node, shared);
p->numa_migrate_seq = 0;
+ goto out_backoff;
}
+ return;
+out_backoff:
+ p->numa_scan_period = min(p->numa_scan_period * 2, sysctl_sched_numa_scan_period_max);
}

/*
--
1.7.11.7

2012-11-22 22:55:37

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 23/33] sched, numa, mm: Interleave shared tasks

Interleave tasks that are 'shared' - i.e. whose memory access patterns
indicate that they are intensively sharing memory with other tasks.

If such a task ends up converging then it switches back into the lazy
node-local policy.

Build-Bug-Reported-by: Fengguang Wu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/mempolicy.c | 56 ++++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 42 insertions(+), 14 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 318043a..02890f2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -111,12 +111,30 @@ enum zone_type policy_zone = 0;
/*
* run-time system-wide default policy => local allocation
*/
-static struct mempolicy default_policy = {
- .refcnt = ATOMIC_INIT(1), /* never free it */
- .mode = MPOL_PREFERRED,
- .flags = MPOL_F_LOCAL,
+
+static struct mempolicy default_policy_local = {
+ .refcnt = ATOMIC_INIT(1), /* never free it */
+ .mode = MPOL_PREFERRED,
+ .flags = MPOL_F_LOCAL,
+};
+
+/*
+ * .v.nodes is set by numa_policy_init():
+ */
+static struct mempolicy default_policy_shared = {
+ .refcnt = ATOMIC_INIT(1), /* never free it */
+ .mode = MPOL_INTERLEAVE,
+ .flags = 0,
};

+static struct mempolicy *default_policy(void)
+{
+ if (task_numa_shared(current) == 1)
+ return &default_policy_shared;
+
+ return &default_policy_local;
+}
+
static const struct mempolicy_operations {
int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
/*
@@ -789,7 +807,7 @@ out:
static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
{
nodes_clear(*nodes);
- if (p == &default_policy)
+ if (p == default_policy())
return;

switch (p->mode) {
@@ -864,7 +882,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
return -EINVAL;

if (!pol)
- pol = &default_policy; /* indicates default behavior */
+ pol = default_policy(); /* indicates default behavior */

if (flags & MPOL_F_NODE) {
if (flags & MPOL_F_ADDR) {
@@ -880,7 +898,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
goto out;
}
} else {
- *policy = pol == &default_policy ? MPOL_DEFAULT :
+ *policy = pol == default_policy() ? MPOL_DEFAULT :
pol->mode;
/*
* Internal mempolicy flags must be masked off before exposing
@@ -1568,7 +1586,7 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
}
}
if (!pol)
- pol = &default_policy;
+ pol = default_policy();
return pol;
}

@@ -1974,7 +1992,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
unsigned int cpuset_mems_cookie;

if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
- pol = &default_policy;
+ pol = default_policy();

retry_cpuset:
cpuset_mems_cookie = get_mems_allowed();
@@ -2255,7 +2273,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
int best_nid = -1, page_nid;
int cpu_last_access, this_cpu;
struct mempolicy *pol;
- unsigned long pgoff;
struct zone *zone;

BUG_ON(!vma);
@@ -2271,13 +2288,22 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long

switch (pol->mode) {
case MPOL_INTERLEAVE:
+ {
+ int shift;
+
BUG_ON(addr >= vma->vm_end);
BUG_ON(addr < vma->vm_start);

- pgoff = vma->vm_pgoff;
- pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
- best_nid = offset_il_node(pol, vma, pgoff);
+#ifdef CONFIG_HUGETLB_PAGE
+ if (transparent_hugepage_enabled(vma) || vma->vm_flags & VM_HUGETLB)
+ shift = HPAGE_SHIFT;
+ else
+#endif
+ shift = PAGE_SHIFT;
+
+ best_nid = interleave_nid(pol, vma, addr, shift);
break;
+ }

case MPOL_PREFERRED:
if (pol->flags & MPOL_F_LOCAL)
@@ -2492,6 +2518,8 @@ void __init numa_policy_init(void)
sizeof(struct sp_node),
0, SLAB_PANIC, NULL);

+ default_policy_shared.v.nodes = node_online_map;
+
/*
* Set interleaving policy for system init. Interleaving is only
* enabled across suitably sized nodes (default is >= 16MB), or
@@ -2712,7 +2740,7 @@ int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, int no_context)
*/
VM_BUG_ON(maxlen < strlen("interleave") + strlen("relative") + 16);

- if (!pol || pol == &default_policy)
+ if (!pol || pol == default_policy())
mode = MPOL_DEFAULT;
else
mode = pol->mode;
--
1.7.11.7

2012-11-22 22:55:58

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 20/33] sched: Implement constant, per task Working Set Sampling (WSS) rate

From: Peter Zijlstra <[email protected]>

Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.

That method has various (obvious) disadvantages:

- it samples the working set at dissimilar rates,
giving some tasks a sampling quality advantage
over others.

- creates performance problems for tasks with very
large working sets

- over-samples processes with large address spaces but
which only very rarely execute

Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.

The per-task nature of the working set sampling functionality
in this tree allows such constant rate, per task,
execution-weight proportional sampling of the working set,
with an adaptive sampling interval/frequency that goes from
once per 100 msecs up to just once per 1.6 seconds.
The current sampling volume is 256 MB per interval.

As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 1.6
seconds of CPU time executed.

This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.

[ In AutoNUMA speak, this patch deals with the effective sampling
rate of the 'hinting page fault'. AutoNUMA's scanning is
currently rate-limited, but it is also fundamentally
single-threaded, executing in the knuma_scand kernel thread,
so the limit in AutoNUMA is global and does not scale up with
the number of CPUs, nor does it scan tasks in an execution
proportional manner.

So the idea of rate-limiting the scanning was first implemented
in the AutoNUMA tree via a global rate limit. This patch goes
beyond that by implementing an execution rate proportional
working set sampling rate that is not implemented via a single
global scanning daemon. ]

[ Dan Carpenter pointed out a possible NULL pointer dereference in the
first version of this patch. ]

Based-on-idea-by: Andrea Arcangeli <[email protected]>
Bug-Found-By: Dan Carpenter <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mm_types.h | 1 +
include/linux/sched.h | 1 +
kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++------------
kernel/sysctl.c | 7 +++++++
4 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 48760e9..5995652 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -405,6 +405,7 @@ struct mm_struct {
#endif
#ifdef CONFIG_NUMA_BALANCING
unsigned long numa_next_scan;
+ unsigned long numa_scan_offset;
int numa_scan_seq;
#endif
struct uprobes_state uprobes_state;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb12cc3..3372aac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2047,6 +2047,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

extern unsigned int sysctl_sched_numa_scan_period_min;
extern unsigned int sysctl_sched_numa_scan_period_max;
+extern unsigned int sysctl_sched_numa_scan_size;
extern unsigned int sysctl_sched_numa_settle_count;

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f3aeaac..151a3cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -825,8 +825,9 @@ static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
/*
* numa task sample period in ms: 5s
*/
-unsigned int sysctl_sched_numa_scan_period_min = 5000;
-unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
+unsigned int sysctl_sched_numa_scan_period_min = 100;
+unsigned int sysctl_sched_numa_scan_period_max = 100*16;
+unsigned int sysctl_sched_numa_scan_size = 256; /* MB */

/*
* Wait for the 2-sample stuff to settle before migrating again
@@ -912,6 +913,9 @@ void task_numa_work(struct callback_head *work)
unsigned long migrate, next_scan, now = jiffies;
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
+ struct vm_area_struct *vma;
+ unsigned long offset, end;
+ long length;

WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));

@@ -938,18 +942,31 @@ void task_numa_work(struct callback_head *work)
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;

- ACCESS_ONCE(mm->numa_scan_seq)++;
- {
- struct vm_area_struct *vma;
+ offset = mm->numa_scan_offset;
+ length = sysctl_sched_numa_scan_size;
+ length <<= 20;

- down_write(&mm->mmap_sem);
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
- if (!vma_migratable(vma))
- continue;
- change_prot_numa(vma, vma->vm_start, vma->vm_end);
- }
- up_write(&mm->mmap_sem);
+ down_write(&mm->mmap_sem);
+ vma = find_vma(mm, offset);
+ if (!vma) {
+ ACCESS_ONCE(mm->numa_scan_seq)++;
+ offset = 0;
+ vma = mm->mmap;
+ }
+ for (; vma && length > 0; vma = vma->vm_next) {
+ if (!vma_migratable(vma))
+ continue;
+
+ offset = max(offset, vma->vm_start);
+ end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
+ length -= end - offset;
+
+ change_prot_numa(vma, offset, end);
+
+ offset = end;
}
+ mm->numa_scan_offset = offset;
+ up_write(&mm->mmap_sem);
}

/*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7736b9e..a14b8a4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,6 +367,13 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "sched_numa_scan_size_mb",
+ .data = &sysctl_sched_numa_scan_size,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_numa_settle_count",
.data = &sysctl_sched_numa_settle_count,
.maxlen = sizeof(unsigned int),
--
1.7.11.7

2012-11-22 22:51:14

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 15/33] sched, numa, mm, arch: Add variable locality exception

From: Peter Zijlstra <[email protected]>

Some architectures (ab)use NUMA to represent different memory
regions all cpu-local but of different latencies, such as SuperH.

The naming comes from Mel Gorman.

Named-by: Mel Gorman <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/sh/mm/Kconfig | 1 +
init/Kconfig | 7 +++++++
2 files changed, 8 insertions(+)

diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig
index cb8f992..5d2a4df 100644
--- a/arch/sh/mm/Kconfig
+++ b/arch/sh/mm/Kconfig
@@ -111,6 +111,7 @@ config VSYSCALL
config NUMA
bool "Non Uniform Memory Access (NUMA) Support"
depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
+ select ARCH_WANTS_NUMA_VARIABLE_LOCALITY
default n
help
Some SH systems have many various memories scattered around
diff --git a/init/Kconfig b/init/Kconfig
index f36c83d..b8a4a58 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -718,6 +718,13 @@ config ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
depends on ARCH_USES_NUMA_GENERIC_PGPROT
depends on TRANSPARENT_HUGEPAGE

+#
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config ARCH_WANTS_NUMA_VARIABLE_LOCALITY
+ bool
+
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
--
1.7.11.7

2012-11-22 22:56:27

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 18/33] sched, numa, mm: Add the scanning page fault machinery

From: Peter Zijlstra <[email protected]>

Add the NUMA working set scanning/hinting page fault machinery,
with no policy yet.

The numa_migration_target() function is a derivative of
Andrea Arcangeli's code in the AutoNUMA tree - including the
comment above the function.

An additional enhancement is that instead of node granular faults
we are recording CPU granular faults, which allows us to make
a distinction between:

- 'private' tasks (who access pages that have been accessed
from the same CPU before - i.e. by the same task)

- and 'shared' tasks: who access pages that tend to have been
accessed by another CPU, i.e. another task.

We later on use this fault metric to do enhancement task and
memory placement.

[ The earliest versions had the mpol_misplaced() function from
Lee Schermerhorn - this was heavily modified later on. ]

Many thanks to everyone involved for these great ideas and code.

Written-by: Andrea Arcangeli <[email protected]>
Also-written-by: Lee Schermerhorn <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
[ split it out of the main policy patch - as suggested by Mel Gorman ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/init_task.h | 8 +++
include/linux/mempolicy.h | 6 +-
include/linux/mm_types.h | 4 ++
include/linux/sched.h | 41 ++++++++++++--
init/Kconfig | 73 +++++++++++++++++++-----
kernel/sched/core.c | 15 +++++
kernel/sysctl.c | 31 ++++++++++-
mm/huge_memory.c | 1 +
mm/mempolicy.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++
9 files changed, 294 insertions(+), 22 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6d087c5..ed98982 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -143,6 +143,13 @@ extern struct task_group root_task_group;

#define INIT_TASK_COMM "swapper"

+#ifdef CONFIG_NUMA_BALANCING
+# define INIT_TASK_NUMA(tsk) \
+ .numa_shared = -1,
+#else
+# define INIT_TASK_NUMA(tsk)
+#endif
+
/*
* INIT_TASK is used to set up the first task table, touch at
* your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -210,6 +217,7 @@ extern struct task_group root_task_group;
INIT_TRACE_RECURSION \
INIT_TASK_RCU_PREEMPT(tsk) \
INIT_CPUSET_SEQ \
+ INIT_TASK_NUMA(tsk) \
}


diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index f329306..c511e25 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -198,6 +198,8 @@ static inline int vma_migratable(struct vm_area_struct *vma)
return 1;
}

+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+
#else

struct mempolicy {};
@@ -323,11 +325,11 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
return 0;
}

-#endif /* CONFIG_NUMA */
-
static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
unsigned long address)
{
return -1; /* no node preference */
}
+
+#endif /* CONFIG_NUMA */
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7e9f758..48760e9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -403,6 +403,10 @@ struct mm_struct {
#ifdef CONFIG_CPUMASK_OFFSTACK
struct cpumask cpumask_allocation;
#endif
+#ifdef CONFIG_NUMA_BALANCING
+ unsigned long numa_next_scan;
+ int numa_scan_seq;
+#endif
struct uprobes_state uprobes_state;
};

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a0a2808..418d405 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1501,6 +1501,18 @@ struct task_struct {
short il_next;
short pref_node_fork;
#endif
+#ifdef CONFIG_NUMA_BALANCING
+ int numa_shared;
+ int numa_max_node;
+ int numa_scan_seq;
+ int numa_migrate_seq;
+ unsigned int numa_scan_period;
+ u64 node_stamp; /* migration stamp */
+ unsigned long numa_weight;
+ unsigned long *numa_faults;
+ struct callback_head numa_work;
+#endif /* CONFIG_NUMA_BALANCING */
+
struct rcu_head rcu;

/*
@@ -1575,7 +1587,25 @@ struct task_struct {
/* Future-safe accessor for struct task_struct's cpus_allowed. */
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)

+#ifdef CONFIG_NUMA_BALANCING
+extern void task_numa_fault(int node, int cpu, int pages);
+#else
static inline void task_numa_fault(int node, int cpu, int pages) { }
+#endif /* CONFIG_NUMA_BALANCING */
+
+/*
+ * -1: non-NUMA task
+ * 0: NUMA task with a dominantly 'private' working set
+ * 1: NUMA task with a dominantly 'shared' working set
+ */
+static inline int task_numa_shared(struct task_struct *p)
+{
+#ifdef CONFIG_NUMA_BALANCING
+ return p->numa_shared;
+#else
+ return -1;
+#endif
+}

/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
@@ -2014,6 +2044,10 @@ enum sched_tunable_scaling {
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

+extern unsigned int sysctl_sched_numa_scan_period_min;
+extern unsigned int sysctl_sched_numa_scan_period_max;
+extern unsigned int sysctl_sched_numa_settle_count;
+
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
@@ -2024,18 +2058,17 @@ extern unsigned int sysctl_sched_shares_window;
int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
loff_t *ppos);
-#endif
-#ifdef CONFIG_SCHED_DEBUG
+
static inline unsigned int get_sysctl_timer_migration(void)
{
return sysctl_timer_migration;
}
-#else
+#else /* CONFIG_SCHED_DEBUG */
static inline unsigned int get_sysctl_timer_migration(void)
{
return 1;
}
-#endif
+#endif /* CONFIG_SCHED_DEBUG */
extern unsigned int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;

diff --git a/init/Kconfig b/init/Kconfig
index cf3e79c..9511f0d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -697,6 +697,65 @@ config HAVE_UNSTABLE_SCHED_CLOCK
bool

#
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config ARCH_WANTS_NUMA_VARIABLE_LOCALITY
+ bool
+
+#
+# For architectures that want to enable NUMA-affine scheduling
+# and memory placement:
+#
+config ARCH_SUPPORTS_NUMA_BALANCING
+ bool
+
+#
+# For architectures that want to reuse the PROT_NONE bits
+# to implement NUMA protection bits:
+#
+config ARCH_WANTS_NUMA_GENERIC_PGPROT
+ bool
+
+config NUMA_BALANCING
+ bool "NUMA-optimizing scheduler"
+ default n
+ depends on ARCH_SUPPORTS_NUMA_BALANCING
+ depends on !ARCH_WANTS_NUMA_VARIABLE_LOCALITY
+ depends on SMP && NUMA && MIGRATION
+ help
+ This option enables NUMA-aware, transparent, automatic
+ placement optimizations of memory, tasks and task groups.
+
+ The optimizations work by (transparently) runtime sampling the
+ workload sharing relationship between threads and processes
+ of long-run workloads, and scheduling them based on these
+ measured inter-task relationships (or the lack thereof).
+
+ ("Long-run" means several seconds of CPU runtime at least.)
+
+ Tasks that predominantly perform their own processing, without
+ interacting with other tasks much will be independently balanced
+ to a CPU and their working set memory will migrate to that CPU/node.
+
+ Tasks that share a lot of data with each other will be attempted to
+ be scheduled on as few nodes as possible, with their memory
+ following them there and being distributed between those nodes.
+
+ This optimization can improve the performance of long-run CPU-bound
+ workloads by 10% or more. The sampling and migration has a small
+ but nonzero cost, so if your NUMA workload is already perfectly
+ placed (for example by use of explicit CPU and memory bindings,
+ or because the stock scheduler does a good job already) then you
+ probably don't need this feature.
+
+ [ On non-NUMA systems this feature will not be active. You can query
+ whether your system is a NUMA system via looking at the output of
+ "numactl --hardware". ]
+
+ Say N if unsure.
+
+#
# Helper Kconfig switches to express compound feature dependencies
# and thus make the .h/.c code more readable:
#
@@ -718,20 +777,6 @@ config ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
depends on ARCH_USES_NUMA_GENERIC_PGPROT
depends on TRANSPARENT_HUGEPAGE

-#
-# For architectures that (ab)use NUMA to represent different memory regions
-# all cpu-local but of different latencies, such as SuperH.
-#
-config ARCH_WANTS_NUMA_VARIABLE_LOCALITY
- bool
-
-#
-# For architectures that want to enable the PROT_NONE driven,
-# NUMA-affine scheduler balancing logic:
-#
-config ARCH_SUPPORTS_NUMA_BALANCING
- bool
-
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dae0d2..3611f5f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1544,6 +1544,21 @@ static void __sched_fork(struct task_struct *p)
#ifdef CONFIG_PREEMPT_NOTIFIERS
INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif
+
+#ifdef CONFIG_NUMA_BALANCING
+ if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+ p->mm->numa_next_scan = jiffies;
+ p->mm->numa_scan_seq = 0;
+ }
+
+ p->numa_shared = -1;
+ p->node_stamp = 0ULL;
+ p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+ p->numa_migrate_seq = 2;
+ p->numa_faults = NULL;
+ p->numa_scan_period = sysctl_sched_numa_scan_period_min;
+ p->numa_work.next = &p->numa_work;
+#endif /* CONFIG_NUMA_BALANCING */
}

/*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b0fa5ad..7736b9e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -256,9 +256,11 @@ static int min_sched_granularity_ns = 100000; /* 100 usecs */
static int max_sched_granularity_ns = NSEC_PER_SEC; /* 1 second */
static int min_wakeup_granularity_ns; /* 0 usecs */
static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
+#ifdef CONFIG_SMP
static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
-#endif
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_SCHED_DEBUG */

#ifdef CONFIG_COMPACTION
static int min_extfrag_threshold;
@@ -301,6 +303,7 @@ static struct ctl_table kern_table[] = {
.extra1 = &min_wakeup_granularity_ns,
.extra2 = &max_wakeup_granularity_ns,
},
+#ifdef CONFIG_SMP
{
.procname = "sched_tunable_scaling",
.data = &sysctl_sched_tunable_scaling,
@@ -347,7 +350,31 @@ static struct ctl_table kern_table[] = {
.extra1 = &zero,
.extra2 = &one,
},
-#endif
+#endif /* CONFIG_SMP */
+#ifdef CONFIG_NUMA_BALANCING
+ {
+ .procname = "sched_numa_scan_period_min_ms",
+ .data = &sysctl_sched_numa_scan_period_min,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "sched_numa_scan_period_max_ms",
+ .data = &sysctl_sched_numa_scan_period_max,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "sched_numa_settle_count",
+ .data = &sysctl_sched_numa_settle_count,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_SCHED_DEBUG */
{
.procname = "sched_rt_period_us",
.data = &sysctl_sched_rt_period,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 814e3ea..92e101f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1456,6 +1456,7 @@ static void __split_huge_page_refcount(struct page *page)
page_tail->mapping = page->mapping;

page_tail->index = page->index + i;
+ page_xchg_last_cpu(page, page_last_cpu(page_tail));

BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..318043a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2175,6 +2175,143 @@ static void sp_free(struct sp_node *n)
kmem_cache_free(sn_cache, n);
}

+/*
+ * Multi-stage node selection is used in conjunction with a periodic
+ * migration fault to build a temporal task<->page relation. By
+ * using a two-stage filter we remove short/unlikely relations.
+ *
+ * Using P(p) ~ n_p / n_t as per frequentist probability, we can
+ * equate a task's usage of a particular page (n_p) per total usage
+ * of this page (n_t) (in a given time-span) to a probability.
+ *
+ * Our periodic faults will then sample this probability and getting
+ * the same result twice in a row, given these samples are fully
+ * independent, is then given by P(n)^2, provided our sample period
+ * is sufficiently short compared to the usage pattern.
+ *
+ * This quadric squishes small probabilities, making it less likely
+ * we act on an unlikely task<->page relation.
+ *
+ * Return the best node ID this page should be on, or -1 if it should
+ * stay where it is.
+ */
+static int
+numa_migration_target(struct page *page, int page_nid,
+ struct task_struct *p, int this_cpu,
+ int cpu_last_access)
+{
+ int nid_last_access;
+ int this_nid;
+
+ if (task_numa_shared(p) < 0)
+ return -1;
+
+ /*
+ * Possibly migrate towards the current node, depends on
+ * task_numa_placement() and access details.
+ */
+ nid_last_access = cpu_to_node(cpu_last_access);
+ this_nid = cpu_to_node(this_cpu);
+
+ if (nid_last_access != this_nid) {
+ /*
+ * 'Access miss': the page got last accessed from a remote node.
+ */
+ return -1;
+ }
+ /*
+ * 'Access hit': the page got last accessed from our node.
+ *
+ * Migrate the page if needed.
+ */
+
+ /* The page is already on this node: */
+ if (page_nid == this_nid)
+ return -1;
+
+ return this_nid;
+}
+
+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page - page to be checked
+ * @vma - vm area where page mapped
+ * @addr - virtual address where page mapped
+ * @multi - use multi-stage node binding
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ * -1 - not misplaced, page is in the right node
+ * node - node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+{
+ int best_nid = -1, page_nid;
+ int cpu_last_access, this_cpu;
+ struct mempolicy *pol;
+ unsigned long pgoff;
+ struct zone *zone;
+
+ BUG_ON(!vma);
+
+ this_cpu = raw_smp_processor_id();
+ page_nid = page_to_nid(page);
+
+ cpu_last_access = page_xchg_last_cpu(page, this_cpu);
+
+ pol = get_vma_policy(current, vma, addr);
+ if (!(task_numa_shared(current) >= 0))
+ goto out_keep_page;
+
+ switch (pol->mode) {
+ case MPOL_INTERLEAVE:
+ BUG_ON(addr >= vma->vm_end);
+ BUG_ON(addr < vma->vm_start);
+
+ pgoff = vma->vm_pgoff;
+ pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+ best_nid = offset_il_node(pol, vma, pgoff);
+ break;
+
+ case MPOL_PREFERRED:
+ if (pol->flags & MPOL_F_LOCAL)
+ best_nid = numa_migration_target(page, page_nid, current, this_cpu, cpu_last_access);
+ else
+ best_nid = pol->v.preferred_node;
+ break;
+
+ case MPOL_BIND:
+ /*
+ * allows binding to multiple nodes.
+ * use current page if in policy nodemask,
+ * else select nearest allowed node, if any.
+ * If no allowed nodes, use current [!misplaced].
+ */
+ if (node_isset(page_nid, pol->v.nodes))
+ goto out_keep_page;
+ (void)first_zones_zonelist(
+ node_zonelist(numa_node_id(), GFP_HIGHUSER),
+ gfp_zone(GFP_HIGHUSER),
+ &pol->v.nodes, &zone);
+ best_nid = zone->node;
+ break;
+
+ default:
+ BUG();
+ }
+
+out_keep_page:
+ mpol_cond_put(pol);
+
+ return best_nid;
+}
+
static void sp_delete(struct shared_policy *sp, struct sp_node *n)
{
pr_debug("deleting %lx-l%lx\n", n->start, n->end);
--
1.7.11.7

2012-11-22 22:51:11

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 14/33] mm/migration: Improve migrate_misplaced_page()

From: Mel Gorman <[email protected]>

Fix, improve and clean up migrate_misplaced_page() to
reuse migrate_pages() and to check for zone watermarks
to make sure we don't overload the node.

This was originally based on Peter's patch "mm/migrate: Introduce
migrate_misplaced_page()" but borrows extremely heavily from Andrea's
"autonuma: memory follows CPU algorithm and task/mm_autonuma stats
collection".

Based-on-work-by: Lee Schermerhorn <[email protected]>
Based-on-work-by: Peter Zijlstra <[email protected]>
Based-on-work-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Linux-MM <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Adapted to the numa/core tree. Kept Mel's patch separate to retain
original authorship for the authors. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/migrate_mode.h | 3 -
mm/memory.c | 13 ++--
mm/migrate.c | 143 +++++++++++++++++++++++++++----------------
3 files changed, 95 insertions(+), 64 deletions(-)

diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index 40b37dc..ebf3d89 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,14 +6,11 @@
* on most operations but not ->writepage as the potential stall time
* is too significant
* MIGRATE_SYNC will block when migrating pages
- * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
- * this path has an extra reference count
*/
enum migrate_mode {
MIGRATE_ASYNC,
MIGRATE_SYNC_LIGHT,
MIGRATE_SYNC,
- MIGRATE_FAULT,
};

#endif /* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/memory.c b/mm/memory.c
index 23ad2eb..52ad29d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3492,28 +3492,25 @@ out_pte_upgrade_unlock:

out_unlock:
pte_unmap_unlock(ptep, ptl);
-out:
+
if (page) {
task_numa_fault(page_nid, last_cpu, 1);
put_page(page);
}
-
+out:
return 0;

migrate:
pte_unmap_unlock(ptep, ptl);

- if (!migrate_misplaced_page(page, node)) {
- page_nid = node;
+ if (migrate_misplaced_page(page, node)) {
goto out;
}
+ page = NULL;

ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_same(*ptep, entry)) {
- put_page(page);
- page = NULL;
+ if (!pte_same(*ptep, entry))
goto out_unlock;
- }

goto out_pte_upgrade_unlock;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index b89062d..16a4709 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -225,7 +225,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
struct buffer_head *bh = head;

/* Simple case, sync compaction */
- if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) {
+ if (mode != MIGRATE_ASYNC) {
do {
get_bh(bh);
lock_buffer(bh);
@@ -282,19 +282,9 @@ static int migrate_page_move_mapping(struct address_space *mapping,
int expected_count = 0;
void **pslot;

- if (mode == MIGRATE_FAULT) {
- /*
- * MIGRATE_FAULT has an extra reference on the page and
- * otherwise acts like ASYNC, no point in delaying the
- * fault, we'll try again next time.
- */
- expected_count++;
- }
-
if (!mapping) {
/* Anonymous page without mapping */
- expected_count += 1;
- if (page_count(page) != expected_count)
+ if (page_count(page) != 1)
return -EAGAIN;
return 0;
}
@@ -304,7 +294,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));

- expected_count += 2 + page_has_private(page);
+ expected_count = 2 + page_has_private(page);
if (page_count(page) != expected_count ||
radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
spin_unlock_irq(&mapping->tree_lock);
@@ -323,7 +313,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
* the mapping back due to an elevated page count, we would have to
* block waiting on other references to be dropped.
*/
- if ((mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) && head &&
+ if (mode == MIGRATE_ASYNC && head &&
!buffer_migrate_lock_buffers(head, mode)) {
page_unfreeze_refs(page, expected_count);
spin_unlock_irq(&mapping->tree_lock);
@@ -531,7 +521,7 @@ int buffer_migrate_page(struct address_space *mapping,
* with an IRQ-safe spinlock held. In the sync case, the buffers
* need to be locked now
*/
- if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT)
+ if (mode != MIGRATE_ASYNC)
BUG_ON(!buffer_migrate_lock_buffers(head, mode));

ClearPagePrivate(page);
@@ -697,7 +687,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
struct anon_vma *anon_vma = NULL;

if (!trylock_page(page)) {
- if (!force || mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT)
+ if (!force || mode == MIGRATE_ASYNC)
goto out;

/*
@@ -1415,55 +1405,102 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
}

/*
+ * Returns true if this is a safe migration target node for misplaced NUMA
+ * pages. Currently it only checks the watermarks which is a bit crude.
+ */
+static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+ int nr_migrate_pages)
+{
+ int z;
+
+ for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+ struct zone *zone = pgdat->node_zones + z;
+
+ if (!populated_zone(zone))
+ continue;
+
+ if (zone->all_unreclaimable)
+ continue;
+
+ /* Avoid waking kswapd by allocating pages_to_migrate pages. */
+ if (!zone_watermark_ok(zone, 0,
+ high_wmark_pages(zone) +
+ nr_migrate_pages,
+ 0, 0))
+ continue;
+ return true;
+ }
+ return false;
+}
+
+static struct page *alloc_misplaced_dst_page(struct page *page,
+ unsigned long data,
+ int **result)
+{
+ int nid = (int) data;
+ struct page *newpage;
+
+ newpage = alloc_pages_exact_node(nid,
+ (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
+ __GFP_NOMEMALLOC | __GFP_NORETRY |
+ __GFP_NOWARN) &
+ ~GFP_IOFS, 0);
+ return newpage;
+}
+
+/*
* Attempt to migrate a misplaced page to the specified destination
- * node.
+ * node. Caller is expected to have an elevated reference count on
+ * the page that will be dropped by this function before returning.
*/
int migrate_misplaced_page(struct page *page, int node)
{
- struct address_space *mapping = page_mapping(page);
- int page_lru = page_is_file_cache(page);
- struct page *newpage;
- int ret = -EAGAIN;
- gfp_t gfp = GFP_HIGHUSER_MOVABLE;
+ int isolated = 0;
+ LIST_HEAD(migratepages);

/*
- * Never wait for allocations just to migrate on fault, but don't dip
- * into reserves. And, only accept pages from the specified node. No
- * sense migrating to a different "misplaced" page!
+ * Don't migrate pages that are mapped in multiple processes.
+ * TODO: Handle false sharing detection instead of this hammer
*/
- if (mapping)
- gfp = mapping_gfp_mask(mapping);
- gfp &= ~__GFP_WAIT;
- gfp |= __GFP_NOMEMALLOC | GFP_THISNODE;
-
- newpage = alloc_pages_node(node, gfp, 0);
- if (!newpage) {
- ret = -ENOMEM;
+ if (page_mapcount(page) != 1)
goto out;
- }

- if (isolate_lru_page(page)) {
- ret = -EBUSY;
- goto put_new;
+ /* Avoid migrating to a node that is nearly full */
+ if (migrate_balanced_pgdat(NODE_DATA(node), 1)) {
+ int page_lru;
+
+ if (isolate_lru_page(page)) {
+ put_page(page);
+ goto out;
+ }
+ isolated = 1;
+
+ /*
+ * Page is isolated which takes a reference count so now the
+ * callers reference can be safely dropped without the page
+ * disappearing underneath us during migration
+ */
+ put_page(page);
+
+ page_lru = page_is_file_cache(page);
+ inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+ list_add(&page->lru, &migratepages);
}

- inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
- ret = __unmap_and_move(page, newpage, 0, 0, MIGRATE_FAULT);
- /*
- * A page that has been migrated has all references removed and will be
- * freed. A page that has not been migrated will have kepts its
- * references and be restored.
- */
- dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
- putback_lru_page(page);
-put_new:
- /*
- * Move the new page to the LRU. If migration was not successful
- * then this will free the page.
- */
- putback_lru_page(newpage);
+ if (isolated) {
+ int nr_remaining;
+
+ nr_remaining = migrate_pages(&migratepages,
+ alloc_misplaced_dst_page,
+ node, false, MIGRATE_ASYNC);
+ if (nr_remaining) {
+ putback_lru_pages(&migratepages);
+ isolated = 0;
+ }
+ }
+ BUG_ON(!list_empty(&migratepages));
out:
- return ret;
+ return isolated;
}

#endif /* CONFIG_NUMA */
--
1.7.11.7

2012-11-22 22:56:50

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 17/33] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag

Allow architectures to opt-in to the adaptive affinity NUMA balancing code.

Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
init/Kconfig | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index b8a4a58..cf3e79c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -725,6 +725,13 @@ config ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
config ARCH_WANTS_NUMA_VARIABLE_LOCALITY
bool

+#
+# For architectures that want to enable the PROT_NONE driven,
+# NUMA-affine scheduler balancing logic:
+#
+config ARCH_SUPPORTS_NUMA_BALANCING
+ bool
+
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
--
1.7.11.7

2012-11-22 22:57:25

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 13/33] mm/migrate: Introduce migrate_misplaced_page()

From: Peter Zijlstra <[email protected]>

Add migrate_misplaced_page() which deals with migrating pages from
faults.

This includes adding a new MIGRATE_FAULT migration mode to
deal with the extra page reference required due to having to look up
the page.

Based-on-work-by: Lee Schermerhorn <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/migrate.h | 4 ++-
include/linux/migrate_mode.h | 3 ++
mm/migrate.c | 79 +++++++++++++++++++++++++++++++++++++++-----
3 files changed, 77 insertions(+), 9 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index afd9af1..72665c9 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -30,6 +30,7 @@ extern int migrate_vmas(struct mm_struct *mm,
extern void migrate_page_copy(struct page *newpage, struct page *page);
extern int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page);
+extern int migrate_misplaced_page(struct page *page, int node);
#else

static inline void putback_lru_pages(struct list_head *l) {}
@@ -63,10 +64,11 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
#define migrate_page NULL
#define fail_migrate_page NULL

-#endif /* CONFIG_MIGRATION */
static inline
int migrate_misplaced_page(struct page *page, int node)
{
return -EAGAIN; /* can't migrate now */
}
+
+#endif /* CONFIG_MIGRATION */
#endif /* _LINUX_MIGRATE_H */
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index ebf3d89..40b37dc 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,14 @@
* on most operations but not ->writepage as the potential stall time
* is too significant
* MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
+ * this path has an extra reference count
*/
enum migrate_mode {
MIGRATE_ASYNC,
MIGRATE_SYNC_LIGHT,
MIGRATE_SYNC,
+ MIGRATE_FAULT,
};

#endif /* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/migrate.c b/mm/migrate.c
index 4ba45f4..b89062d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -225,7 +225,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
struct buffer_head *bh = head;

/* Simple case, sync compaction */
- if (mode != MIGRATE_ASYNC) {
+ if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) {
do {
get_bh(bh);
lock_buffer(bh);
@@ -279,12 +279,22 @@ static int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page,
struct buffer_head *head, enum migrate_mode mode)
{
- int expected_count;
+ int expected_count = 0;
void **pslot;

+ if (mode == MIGRATE_FAULT) {
+ /*
+ * MIGRATE_FAULT has an extra reference on the page and
+ * otherwise acts like ASYNC, no point in delaying the
+ * fault, we'll try again next time.
+ */
+ expected_count++;
+ }
+
if (!mapping) {
/* Anonymous page without mapping */
- if (page_count(page) != 1)
+ expected_count += 1;
+ if (page_count(page) != expected_count)
return -EAGAIN;
return 0;
}
@@ -294,7 +304,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));

- expected_count = 2 + page_has_private(page);
+ expected_count += 2 + page_has_private(page);
if (page_count(page) != expected_count ||
radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
spin_unlock_irq(&mapping->tree_lock);
@@ -313,7 +323,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
* the mapping back due to an elevated page count, we would have to
* block waiting on other references to be dropped.
*/
- if (mode == MIGRATE_ASYNC && head &&
+ if ((mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) && head &&
!buffer_migrate_lock_buffers(head, mode)) {
page_unfreeze_refs(page, expected_count);
spin_unlock_irq(&mapping->tree_lock);
@@ -521,7 +531,7 @@ int buffer_migrate_page(struct address_space *mapping,
* with an IRQ-safe spinlock held. In the sync case, the buffers
* need to be locked now
*/
- if (mode != MIGRATE_ASYNC)
+ if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT)
BUG_ON(!buffer_migrate_lock_buffers(head, mode));

ClearPagePrivate(page);
@@ -687,7 +697,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
struct anon_vma *anon_vma = NULL;

if (!trylock_page(page)) {
- if (!force || mode == MIGRATE_ASYNC)
+ if (!force || mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT)
goto out;

/*
@@ -1403,4 +1413,57 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
}
return err;
}
-#endif
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+ struct address_space *mapping = page_mapping(page);
+ int page_lru = page_is_file_cache(page);
+ struct page *newpage;
+ int ret = -EAGAIN;
+ gfp_t gfp = GFP_HIGHUSER_MOVABLE;
+
+ /*
+ * Never wait for allocations just to migrate on fault, but don't dip
+ * into reserves. And, only accept pages from the specified node. No
+ * sense migrating to a different "misplaced" page!
+ */
+ if (mapping)
+ gfp = mapping_gfp_mask(mapping);
+ gfp &= ~__GFP_WAIT;
+ gfp |= __GFP_NOMEMALLOC | GFP_THISNODE;
+
+ newpage = alloc_pages_node(node, gfp, 0);
+ if (!newpage) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ if (isolate_lru_page(page)) {
+ ret = -EBUSY;
+ goto put_new;
+ }
+
+ inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+ ret = __unmap_and_move(page, newpage, 0, 0, MIGRATE_FAULT);
+ /*
+ * A page that has been migrated has all references removed and will be
+ * freed. A page that has not been migrated will have kepts its
+ * references and be restored.
+ */
+ dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+ putback_lru_page(page);
+put_new:
+ /*
+ * Move the new page to the LRU. If migration was not successful
+ * then this will free the page.
+ */
+ putback_lru_page(newpage);
+out:
+ return ret;
+}
+
+#endif /* CONFIG_NUMA */
--
1.7.11.7

2012-11-22 22:51:00

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 11/33] sched, numa, mm: Describe the NUMA scheduling problem formally

From: Peter Zijlstra <[email protected]>

This is probably a first: formal description of a complex high-level
computing problem, within the kernel source.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ]
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++
1 file changed, 230 insertions(+)
create mode 100644 Documentation/scheduler/numa-problem.txt

diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt
new file mode 100644
index 0000000..a5d2fee
--- /dev/null
+++ b/Documentation/scheduler/numa-problem.txt
@@ -0,0 +1,230 @@
+
+
+Effective NUMA scheduling problem statement, described formally:
+
+ * minimize interconnect traffic
+
+For each task 't_i' we have memory, this memory can be spread over multiple
+physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on
+node 'k' in [pages].
+
+If a task shares memory with another task let us denote this as:
+'s_i,k', the memory shared between tasks including 't_i' residing on node
+'k'.
+
+Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement.
+
+Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage
+frequency over those memory regions [1/s] such that the product gives an
+(average) bandwidth 'bp' and 'bs' in [pages/s].
+
+(note: multiple tasks sharing memory naturally avoid duplicat accounting
+ because each task will have its own access frequency 'fs')
+
+(pjt: I think this frequency is more numerically consistent if you explicitly
+ restrict p/s above to be the working-set. (It also makes explicit the
+ requirement for <C0,M0> to change about a change in the working set.)
+
+ Doing this does have the nice property that it lets you use your frequency
+ measurement as a weak-ordering for the benefit a task would receive when
+ we can't fit everything.
+
+ e.g. task1 has working set 10mb, f=90%
+ task2 has working set 90mb, f=10%
+
+ Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit
+ from task1 being on the right node than task2. )
+
+Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i':
+
+ C: t_i -> {c_i, n_i}
+
+This gives us the total interconnect traffic between nodes 'k' and 'l',
+'T_k,l', as:
+
+ T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l
+
+And our goal is to obtain C0 and M0 such that:
+
+ T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l
+
+(note: we could introduce 'nc(k,l)' as the cost function of accessing memory
+ on node 'l' from node 'k', this would be useful for bigger NUMA systems
+
+ pjt: I agree nice to have, but intuition suggests diminishing returns on more
+ usual systems given factors like things like Haswell's enormous 35mb l3
+ cache and QPI being able to do a direct fetch.)
+
+(note: do we need a limit on the total memory per node?)
+
+
+ * fairness
+
+For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu
+'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a
+load 'L_n':
+
+ L_n = 1/P_n * \Sum_i w_i for all c_i = n
+
+using that we can formulate a load difference between CPUs
+
+ L_n,m = | L_n - L_m |
+
+Which allows us to state the fairness goal like:
+
+ L_n,m(C0) =< L_n,m(C) for all C, n != m
+
+(pjt: It can also be usefully stated that, having converged at C0:
+
+ | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) |
+
+ Where G_n,m is the greedy partition of tasks between L_n and L_m. This is
+ the "worst" partition we should accept; but having it gives us a useful
+ bound on how much we can reasonably adjust L_n/L_m at a Pareto point to
+ favor T_n,m. )
+
+Together they give us the complete multi-objective optimization problem:
+
+ min_C,M [ L_n,m(C), T_k,l(C,M) ]
+
+
+
+Notes:
+
+ - the memory bandwidth problem is very much an inter-process problem, in
+ particular there is no such concept as a process in the above problem.
+
+ - the naive solution would completely prefer fairness over interconnect
+ traffic, the more complicated solution could pick another Pareto point using
+ an aggregate objective function such that we balance the loss of work
+ efficiency against the gain of running, we'd want to more or less suggest
+ there to be a fixed bound on the error from the Pareto line for any
+ such solution.
+
+References:
+
+ http://en.wikipedia.org/wiki/Mathematical_optimization
+ http://en.wikipedia.org/wiki/Multi-objective_optimization
+
+
+* warning, significant hand-waving ahead, improvements welcome *
+
+
+Partial solutions / approximations:
+
+ 1) have task node placement be a pure preference from the 'fairness' pov.
+
+This means we always prefer fairness over interconnect bandwidth. This reduces
+the problem to:
+
+ min_C,M [ T_k,l(C,M) ]
+
+ 2a) migrate memory towards 'n_i' (the task's node).
+
+This creates memory movement such that 'p_i,k for k != n_i' becomes 0 --
+provided 'n_i' stays stable enough and there's sufficient memory (looks like
+we might need memory limits for this).
+
+This does however not provide us with any 's_i' (shared) information. It does
+however remove 'M' since it defines memory placement in terms of task
+placement.
+
+XXX properties of this M vs a potential optimal
+
+ 2b) migrate memory towards 'n_i' using 2 samples.
+
+This separates pages into those that will migrate and those that will not due
+to the two samples not matching. We could consider the first to be of 'p_i'
+(private) and the second to be of 's_i' (shared).
+
+This interpretation can be motivated by the previously observed property that
+'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only
+'s_i' (shared). (here we loose the need for memory limits again, since it
+becomes indistinguishable from shared).
+
+XXX include the statistical babble on double sampling somewhere near
+
+This reduces the problem further; we loose 'M' as per 2a, it further reduces
+the 'T_k,l' (interconnect traffic) term to only include shared (since per the
+above all private will be local):
+
+ T_k,l = \Sum_i bs_i,l for every n_i = k, l != k
+
+[ more or less matches the state of sched/numa and describes its remaining
+ problems and assumptions. It should work well for tasks without significant
+ shared memory usage between tasks. ]
+
+Possible future directions:
+
+Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
+can evaluate it;
+
+ 3a) add per-task per node counters
+
+At fault time, count the number of pages the task faults on for each node.
+This should give an approximation of 'p_i' for the local node and 's_i,k' for
+all remote nodes.
+
+While these numbers provide pages per scan, and so have the unit [pages/s] they
+don't count repeat access and thus aren't actually representable for our
+bandwidth numberes.
+
+ 3b) additional frequency term
+
+Additionally (or instead if it turns out we don't need the raw 'p' and 's'
+numbers) we can approximate the repeat accesses by using the time since marking
+the pages as indication of the access frequency.
+
+Let 'I' be the interval of marking pages and 'e' the elapsed time since the
+last marking, then we could estimate the number of accesses 'a' as 'a = I / e'.
+If we then increment the node counters using 'a' instead of 1 we might get
+a better estimate of bandwidth terms.
+
+ 3c) additional averaging; can be applied on top of either a/b.
+
+[ Rik argues that decaying averages on 3a might be sufficient for bandwidth since
+ the decaying avg includes the old accesses and therefore has a measure of repeat
+ accesses.
+
+ Rik also argued that the sample frequency is too low to get accurate access
+ frequency measurements, I'm not entirely convinced, event at low sample
+ frequencies the avg elapsed time 'e' over multiple samples should still
+ give us a fair approximation of the avg access frequency 'a'.
+
+ So doing both b&c has a fair chance of working and allowing us to distinguish
+ between important and less important memory accesses.
+
+ Experimentation has shown no benefit from the added frequency term so far. ]
+
+This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute
+'T_k,l' Our optimization problem now reads:
+
+ min_C [ \Sum_i bs_i,l for every n_i = k, l != k ]
+
+And includes only shared terms, this makes sense since all task private memory
+will become local as per 2.
+
+This suggests that if there is significant shared memory, we should try and
+move towards it.
+
+ 4) move towards where 'most' memory is
+
+The simplest significance test is comparing the biggest shared 's_i,k' against
+the private 'p_i'. If we have more shared than private, move towards it.
+
+This effectively makes us move towards where most our memory is and forms a
+feed-back loop with 2. We migrate memory towards us and we migrate towards
+where 'most' memory is.
+
+(Note: even if there were two tasks fully trashing the same shared memory, it
+ is very rare for there to be an 50/50 split in memory, lacking a perfect
+ split, the small will move towards the larger. In case of the perfect
+ split, we'll tie-break towards the lower node number.)
+
+ 5) 'throttle' 4's node placement
+
+Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize'
+and show representative numbers, we should limit node-migration to not be
+faster than this.
+
+ n) poke holes in previous that require more stuff and describe it.
--
1.7.11.7

2012-11-22 22:57:49

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 12/33] numa, mm: Support NUMA hinting page faults from gup/gup_fast

From: Andrea Arcangeli <[email protected]>

Introduce FOLL_NUMA to tell follow_page to check
pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
so because it always invokes handle_mm_fault and retries the
follow_page later.

KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault.

Other follow_page callers like KSM should not use FOLL_NUMA, or they
would fail to get the pages if they use follow_page instead of
get_user_pages.

[ This patch was picked up from the AutoNUMA tree. ]

Originally-by: Andrea Arcangeli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Thomas Gleixner <[email protected]>
[ ported to this tree. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mm.h | 1 +
mm/memory.c | 17 +++++++++++++++++
2 files changed, 18 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 246375c..f39a628 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1585,6 +1585,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
#define FOLL_MLOCK 0x40 /* mark page as mlocked */
#define FOLL_SPLIT 0x80 /* don't return transhuge pages, split them */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
+#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */

typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/memory.c b/mm/memory.c
index b9bb15c..23ad2eb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1522,6 +1522,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
goto out;
}
+ if ((flags & FOLL_NUMA) && pmd_numa(vma, *pmd))
+ goto no_page_table;
if (pmd_trans_huge(*pmd)) {
if (flags & FOLL_SPLIT) {
split_huge_page_pmd(mm, pmd);
@@ -1551,6 +1553,8 @@ split_fallthrough:
pte = *ptep;
if (!pte_present(pte))
goto no_page;
+ if ((flags & FOLL_NUMA) && pte_numa(vma, pte))
+ goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;

@@ -1702,6 +1706,19 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
(VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
vm_flags &= (gup_flags & FOLL_FORCE) ?
(VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
+
+ /*
+ * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
+ * would be called on PROT_NONE ranges. We must never invoke
+ * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
+ * page faults would unprotect the PROT_NONE ranges if
+ * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
+ * bitflag. So to avoid that, don't set FOLL_NUMA if
+ * FOLL_FORCE is set.
+ */
+ if (!(gup_flags & FOLL_FORCE))
+ gup_flags |= FOLL_NUMA;
+
i = 0;

do {
--
1.7.11.7

2012-11-22 22:58:11

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 09/33] sched, mm, numa: Create generic NUMA fault infrastructure, with architectures overrides

This patch is based on patches written by multiple people:

Hugh Dickins <[email protected]>
Johannes Weiner <[email protected]>
Peter Zijlstra <[email protected]>

Of the "mm/mpol: Create special PROT_NONE infrastructure" patch
and its variants.

I have reworked the code so significantly that I had to
drop the acks and signoffs.

In order to facilitate a lazy -- fault driven -- migration of pages,
create a special transient PROT_NONE variant, we can then use the
'spurious' protection faults to drive our migrations from.

Pages that already had an effective PROT_NONE mapping will not
be detected to generate these 'spuriuos' faults for the simple reason
that we cannot distinguish them on their protection bits, see
pte_numa().

This isn't a problem since PROT_NONE (and possible PROT_WRITE with
dirty tracking) aren't used or are rare enough for us to not care
about their placement.

Architectures can set the CONFIG_ARCH_WANTS_NUMA_GENERIC_PGPROT Kconfig
variable, in which case they get the PROT_NONE variant. Alternatively
they can provide the basic primitives themselves:

bool pte_numa(struct vm_area_struct *vma, pte_t pte);
pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte);
bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);
pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
unsigned long change_prot_numa(struct vm_area_struct *vma, unsigned long start, unsigned long end);

[ This non-generic angle is untested though. ]

Original-Idea-by: Rik van Riel <[email protected]>
Also-From: Johannes Weiner <[email protected]>
Also-From: Hugh Dickins <[email protected]>
Also-From: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/asm-generic/pgtable.h | 55 ++++++++++++++
include/linux/huge_mm.h | 12 ++++
include/linux/mempolicy.h | 6 ++
include/linux/migrate.h | 5 ++
include/linux/mm.h | 5 ++
include/linux/sched.h | 2 +
init/Kconfig | 22 ++++++
mm/Makefile | 1 +
mm/huge_memory.c | 162 ++++++++++++++++++++++++++++++++++++++++++
mm/internal.h | 5 +-
mm/memcontrol.c | 7 +-
mm/memory.c | 85 ++++++++++++++++++++--
mm/migrate.c | 2 +-
mm/mprotect.c | 7 --
mm/numa.c | 73 +++++++++++++++++++
15 files changed, 430 insertions(+), 19 deletions(-)
create mode 100644 mm/numa.c

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 48fc1dc..d03d0a8 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -537,6 +537,61 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
}

/*
+ * Is this pte used for NUMA scanning?
+ */
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT
+extern bool pte_numa(struct vm_area_struct *vma, pte_t pte);
+#else
+# ifndef pte_numa
+static inline bool pte_numa(struct vm_area_struct *vma, pte_t pte)
+{
+ return false;
+}
+# endif
+#endif
+
+/*
+ * Turn a pte into a NUMA entry:
+ */
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT
+extern pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte);
+#else
+# ifndef pte_mknuma
+static inline pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte)
+{
+ return pte;
+}
+# endif
+#endif
+
+/*
+ * Is this pmd used for NUMA scanning?
+ */
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
+extern bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);
+#else
+# ifndef pmd_numa
+static inline bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
+{
+ return false;
+}
+# endif
+#endif
+
+/*
+ * Some architectures (such as x86) may need to preserve certain pgprot
+ * bits, without complicating generic pgprot code.
+ *
+ * Most architectures don't care:
+ */
+#ifndef pgprot_modify
+static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
+{
+ return newprot;
+}
+#endif
+
+/*
* This is a noop if Transparent Hugepage Support is not built into
* the kernel. Otherwise it is equivalent to
* pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b31cb7d..7f5a552 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -197,4 +197,16 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+#ifdef CONFIG_NUMA_BALANCING_HUGEPAGE
+extern void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t orig_pmd);
+#else
+static inline void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t orig_pmd)
+{
+}
+#endif
+
#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e5ccb9d..f329306 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -324,4 +324,10 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
}

#endif /* CONFIG_NUMA */
+
+static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+ unsigned long address)
+{
+ return -1; /* no node preference */
+}
#endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ce7e667..afd9af1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -64,4 +64,9 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
#define fail_migrate_page NULL

#endif /* CONFIG_MIGRATION */
+static inline
+int migrate_misplaced_page(struct page *page, int node)
+{
+ return -EAGAIN; /* can't migrate now */
+}
#endif /* _LINUX_MIGRATE_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5fc1d46..246375c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1559,6 +1559,11 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
}
#endif

+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT
+extern unsigned long
+change_prot_numa(struct vm_area_struct *vma, unsigned long start, unsigned long end);
+#endif
+
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e1581a0..a0a2808 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1575,6 +1575,8 @@ struct task_struct {
/* Future-safe accessor for struct task_struct's cpus_allowed. */
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)

+static inline void task_numa_fault(int node, int cpu, int pages) { }
+
/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
diff --git a/init/Kconfig b/init/Kconfig
index 6fdd6e3..f36c83d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -696,6 +696,28 @@ config LOG_BUF_SHIFT
config HAVE_UNSTABLE_SCHED_CLOCK
bool

+#
+# Helper Kconfig switches to express compound feature dependencies
+# and thus make the .h/.c code more readable:
+#
+config NUMA_BALANCING_HUGEPAGE
+ bool
+ default y
+ depends on NUMA_BALANCING
+ depends on TRANSPARENT_HUGEPAGE
+
+config ARCH_USES_NUMA_GENERIC_PGPROT
+ bool
+ default y
+ depends on ARCH_WANTS_NUMA_GENERIC_PGPROT
+ depends on NUMA_BALANCING
+
+config ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
+ bool
+ default y
+ depends on ARCH_USES_NUMA_GENERIC_PGPROT
+ depends on TRANSPARENT_HUGEPAGE
+
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
diff --git a/mm/Makefile b/mm/Makefile
index 6b025f8..26f7574 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -34,6 +34,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
+obj-$(CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT) += numa.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..814e3ea 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
#include <linux/freezer.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
+#include <linux/migrate.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -725,6 +726,165 @@ out:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * Handle a NUMA fault: check whether we should migrate and
+ * mark it accessible again.
+ */
+void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t entry)
+{
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct mem_cgroup *memcg = NULL;
+ struct page *new_page;
+ struct page *page = NULL;
+ int last_cpu;
+ int node = -1;
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry)))
+ goto unlock;
+
+ if (unlikely(pmd_trans_splitting(entry))) {
+ spin_unlock(&mm->page_table_lock);
+ wait_split_huge_page(vma->anon_vma, pmd);
+ return;
+ }
+
+ page = pmd_page(entry);
+ if (page) {
+ int page_nid = page_to_nid(page);
+
+ VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+ last_cpu = page_last_cpu(page);
+
+ get_page(page);
+ /*
+ * Note that migrating pages shared by others is safe, since
+ * get_user_pages() or GUP fast would have to fault this page
+ * present before they could proceed, and we are holding the
+ * pagetable lock here and are mindful of pmd races below.
+ */
+ node = mpol_misplaced(page, vma, haddr);
+ if (node != -1 && node != page_nid)
+ goto migrate;
+ }
+
+fixup:
+ /* change back to regular protection */
+ entry = pmd_modify(entry, vma->vm_page_prot);
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, entry);
+
+unlock:
+ spin_unlock(&mm->page_table_lock);
+ if (page) {
+ task_numa_fault(page_to_nid(page), last_cpu, HPAGE_PMD_NR);
+ put_page(page);
+ }
+ return;
+
+migrate:
+ spin_unlock(&mm->page_table_lock);
+
+ lock_page(page);
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry))) {
+ spin_unlock(&mm->page_table_lock);
+ unlock_page(page);
+ put_page(page);
+ return;
+ }
+ spin_unlock(&mm->page_table_lock);
+
+ new_page = alloc_pages_node(node,
+ (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);
+ if (!new_page)
+ goto alloc_fail;
+
+ if (isolate_lru_page(page)) { /* Does an implicit get_page() */
+ put_page(new_page);
+ goto alloc_fail;
+ }
+
+ __set_page_locked(new_page);
+ SetPageSwapBacked(new_page);
+
+ /* anon mapping, we can simply copy page->mapping to the new page: */
+ new_page->mapping = page->mapping;
+ new_page->index = page->index;
+
+ migrate_page_copy(new_page, page);
+
+ WARN_ON(PageLRU(new_page));
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry))) {
+ spin_unlock(&mm->page_table_lock);
+
+ /* Reverse changes made by migrate_page_copy() */
+ if (TestClearPageActive(new_page))
+ SetPageActive(page);
+ if (TestClearPageUnevictable(new_page))
+ SetPageUnevictable(page);
+ mlock_migrate_page(page, new_page);
+
+ unlock_page(new_page);
+ put_page(new_page); /* Free it */
+
+ unlock_page(page);
+ putback_lru_page(page);
+ put_page(page); /* Drop the local reference */
+ return;
+ }
+ /*
+ * Traditional migration needs to prepare the memcg charge
+ * transaction early to prevent the old page from being
+ * uncharged when installing migration entries. Here we can
+ * save the potential rollback and start the charge transfer
+ * only when migration is already known to end successfully.
+ */
+ mem_cgroup_prepare_migration(page, new_page, &memcg);
+
+ entry = mk_pmd(new_page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ entry = pmd_mkhuge(entry);
+
+ page_add_new_anon_rmap(new_page, vma, haddr);
+
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, entry);
+ page_remove_rmap(page);
+ /*
+ * Finish the charge transaction under the page table lock to
+ * prevent split_huge_page() from dividing up the charge
+ * before it's fully transferred to the new page.
+ */
+ mem_cgroup_end_migration(memcg, page, new_page, true);
+ spin_unlock(&mm->page_table_lock);
+
+ task_numa_fault(node, last_cpu, HPAGE_PMD_NR);
+
+ unlock_page(new_page);
+ unlock_page(page);
+ put_page(page); /* Drop the rmap reference */
+ put_page(page); /* Drop the LRU isolation reference */
+ put_page(page); /* Drop the local reference */
+ return;
+
+alloc_fail:
+ unlock_page(page);
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry))) {
+ put_page(page);
+ page = NULL;
+ goto unlock;
+ }
+ goto fixup;
+}
+#endif
+
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma)
@@ -1363,6 +1523,8 @@ static int __split_huge_page_map(struct page *page,
BUG_ON(page_mapcount(page) != 1);
if (!pmd_young(*pmd))
entry = pte_mkold(entry);
+ if (pmd_numa(vma, *pmd))
+ entry = pte_mknuma(vma, entry);
pte = pte_offset_map(&_pmd, haddr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, haddr, pte, entry);
diff --git a/mm/internal.h b/mm/internal.h
index a4fa284..b84d571 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -212,11 +212,12 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
{
if (TestClearPageMlocked(page)) {
unsigned long flags;
+ int nr_pages = hpage_nr_pages(page);

local_irq_save(flags);
- __dec_zone_page_state(page, NR_MLOCK);
+ __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
SetPageMlocked(newpage);
- __inc_zone_page_state(newpage, NR_MLOCK);
+ __mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages);
local_irq_restore(flags);
}
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..011e510 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
struct mem_cgroup **memcgp)
{
struct mem_cgroup *memcg = NULL;
+ unsigned int nr_pages = 1;
struct page_cgroup *pc;
enum charge_type ctype;

*memcgp = NULL;

- VM_BUG_ON(PageTransHuge(page));
if (mem_cgroup_disabled())
return;

+ if (PageTransHuge(page))
+ nr_pages <<= compound_order(page);
+
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
@@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
* charged to the res_counter since we plan on replacing the
* old one and only one page is going to be left afterwards.
*/
- __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
+ __mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
}

/* remove redundant charge if migration failed*/
diff --git a/mm/memory.c b/mm/memory.c
index 24d3a4a..b9bb15c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
#include <linux/gfp.h>
+#include <linux/migrate.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -3437,6 +3438,69 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}

+static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep, pmd_t *pmd,
+ unsigned int flags, pte_t entry)
+{
+ struct page *page = NULL;
+ int node, page_nid = -1;
+ int last_cpu = -1;
+ spinlock_t *ptl;
+
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
+ if (unlikely(!pte_same(*ptep, entry)))
+ goto out_unlock;
+
+ page = vm_normal_page(vma, address, entry);
+ if (page) {
+ get_page(page);
+ page_nid = page_to_nid(page);
+ last_cpu = page_last_cpu(page);
+ node = mpol_misplaced(page, vma, address);
+ if (node != -1 && node != page_nid)
+ goto migrate;
+ }
+
+out_pte_upgrade_unlock:
+ flush_cache_page(vma, address, pte_pfn(entry));
+
+ ptep_modify_prot_start(mm, address, ptep);
+ entry = pte_modify(entry, vma->vm_page_prot);
+ ptep_modify_prot_commit(mm, address, ptep, entry);
+
+ /* No TLB flush needed because we upgraded the PTE */
+
+ update_mmu_cache(vma, address, ptep);
+
+out_unlock:
+ pte_unmap_unlock(ptep, ptl);
+out:
+ if (page) {
+ task_numa_fault(page_nid, last_cpu, 1);
+ put_page(page);
+ }
+
+ return 0;
+
+migrate:
+ pte_unmap_unlock(ptep, ptl);
+
+ if (!migrate_misplaced_page(page, node)) {
+ page_nid = node;
+ goto out;
+ }
+
+ ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_same(*ptep, entry)) {
+ put_page(page);
+ page = NULL;
+ goto out_unlock;
+ }
+
+ goto out_pte_upgrade_unlock;
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
@@ -3475,6 +3539,9 @@ int handle_pte_fault(struct mm_struct *mm,
pte, pmd, flags, entry);
}

+ if (pte_numa(vma, entry))
+ return do_numa_page(mm, vma, address, pte, pmd, flags, entry);
+
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (unlikely(!pte_same(*pte, entry)))
@@ -3539,13 +3606,16 @@ retry:
pmd, flags);
} else {
pmd_t orig_pmd = *pmd;
- int ret;
+ int ret = 0;

barrier();
- if (pmd_trans_huge(orig_pmd)) {
- if (flags & FAULT_FLAG_WRITE &&
- !pmd_write(orig_pmd) &&
- !pmd_trans_splitting(orig_pmd)) {
+ if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) {
+ if (pmd_numa(vma, orig_pmd)) {
+ do_huge_pmd_numa_page(mm, vma, address, pmd,
+ flags, orig_pmd);
+ }
+
+ if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
orig_pmd);
/*
@@ -3555,12 +3625,13 @@ retry:
*/
if (unlikely(ret & VM_FAULT_OOM))
goto retry;
- return ret;
}
- return 0;
+
+ return ret;
}
}

+
/*
* Use __pte_alloc instead of pte_alloc_map, because we can't
* run pte_offset_map on the pmd, if an huge pmd could
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..4ba45f4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -407,7 +407,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
*/
void migrate_page_copy(struct page *newpage, struct page *page)
{
- if (PageHuge(page))
+ if (PageHuge(page) || PageTransHuge(page))
copy_huge_page(newpage, page);
else
copy_highpage(newpage, page);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7c3628a..6ff2d5e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -28,13 +28,6 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>

-#ifndef pgprot_modify
-static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
-{
- return newprot;
-}
-#endif
-
static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
diff --git a/mm/numa.c b/mm/numa.c
new file mode 100644
index 0000000..8d18800
--- /dev/null
+++ b/mm/numa.c
@@ -0,0 +1,73 @@
+/*
+ * Generic NUMA page table entry support. This code reuses
+ * PROT_NONE: an architecture can choose to use its own
+ * implementation, by setting CONFIG_ARCH_SUPPORTS_NUMA_BALANCING
+ * and not setting CONFIG_ARCH_WANTS_NUMA_GENERIC_PGPROT.
+ */
+#include <linux/mm.h>
+
+static inline pgprot_t vma_prot_none(struct vm_area_struct *vma)
+{
+ /*
+ * obtain PROT_NONE by removing READ|WRITE|EXEC privs
+ */
+ vm_flags_t vmflags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
+
+ return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags));
+}
+
+bool pte_numa(struct vm_area_struct *vma, pte_t pte)
+{
+ /*
+ * For NUMA page faults, we use PROT_NONE ptes in VMAs with
+ * "normal" vma->vm_page_prot protections. Genuine PROT_NONE
+ * VMAs should never get here, because the fault handling code
+ * will notice that the VMA has no read or write permissions.
+ *
+ * This means we cannot get 'special' PROT_NONE faults from genuine
+ * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
+ * tracking.
+ *
+ * Neither case is really interesting for our current use though so we
+ * don't care.
+ */
+ if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
+ return false;
+
+ return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
+}
+
+pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte)
+{
+ return pte_modify(pte, vma_prot_none(vma));
+}
+
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
+bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
+{
+ /*
+ * See pte_numa() above
+ */
+ if (pmd_same(pmd, pmd_modify(pmd, vma->vm_page_prot)))
+ return false;
+
+ return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma)));
+}
+#endif
+
+/*
+ * The scheduler uses this function to mark a range of virtual
+ * memory inaccessible to user-space, for the purposes of probing
+ * the composition of the working set.
+ *
+ * The resulting page faults will be demultiplexed into:
+ *
+ * mm/memory.c::do_numa_page()
+ * mm/huge_memory.c::do_huge_pmd_numa_page()
+ *
+ * This generic version simply uses PROT_NONE.
+ */
+unsigned long change_prot_numa(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+ return change_protection(vma, start, end, vma_prot_none(vma), 0);
+}
--
1.7.11.7

2012-11-22 22:58:42

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 05/33] x86/mm: Completely drop the TLB flush from ptep_set_access_flags()

From: Rik van Riel <[email protected]>

Intel has an architectural guarantee that the TLB entry causing
a page fault gets invalidated automatically. This means
we should be able to drop the local TLB invalidation.

Because of the way other areas of the page fault code work,
chances are good that all x86 CPUs do this. However, if
someone somewhere has an x86 CPU that does not invalidate
the TLB entry causing a page fault, this one-liner should
be easy to revert - or a CPU model specific quirk could
be added to retain this optimization on most CPUs.

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
[ Applied changelog massage and moved this last in the series,
to create bisection distance. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index be3bb46..7353de3 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -317,7 +317,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
if (changed && dirty) {
*ptep = entry;
pte_update_defer(vma->vm_mm, address, ptep);
- __flush_tlb_one(address);
}

return changed;
--
1.7.11.7

2012-11-22 22:59:17

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 03/33] x86/mm: Introduce pte_accessible()

From: Rik van Riel <[email protected]>

We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that
the pte is associated with a page.

However, for TLB flushing purposes, we would like to know whether the pte
points to an actually accessible page. This allows us to skip remote TLB
flushes for pages that are not actually accessible.

Fill in this method for x86 and provide a safe (but slower) method
on other architectures.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Fixed-by: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Added Linus's review fixes. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/pgtable.h | 6 ++++++
include/asm-generic/pgtable.h | 4 ++++
2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a1f780d..5fe03aa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -407,6 +407,12 @@ static inline int pte_present(pte_t a)
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
}

+#define pte_accessible pte_accessible
+static inline int pte_accessible(pte_t a)
+{
+ return pte_flags(a) & _PAGE_PRESENT;
+}
+
static inline int pte_hidden(pte_t pte)
{
return pte_flags(pte) & _PAGE_HIDDEN;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b36ce40..48fc1dc 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
#define move_pte(pte, prot, old_addr, new_addr) (pte)
#endif

+#ifndef pte_accessible
+# define pte_accessible(pte) ((void)(pte),1)
+#endif
+
#ifndef flush_tlb_fix_spurious_fault
#define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
#endif
--
1.7.11.7

2012-11-22 23:00:00

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 01/33] mm/generic: Only flush the local TLB in ptep_set_access_flags()

From: Rik van Riel <[email protected]>

The function ptep_set_access_flags() is only ever used to upgrade
access permissions to a page - i.e. they make it less restrictive.

That means the only negative side effect of not flushing remote
TLBs in this function is that other CPUs may incur spurious page
faults, if they happen to access the same address, and still have
a PTE with the old permissions cached in their TLB caches.

Having another CPU maybe incur a spurious page fault is faster
than always incurring the cost of a remote TLB flush, so replace
the remote TLB flush with a purely local one.

This should be safe on every architecture that correctly
implements flush_tlb_fix_spurious_fault() to actually invalidate
the local TLB entry that caused a page fault, as well as on
architectures where the hardware invalidates TLB entries that
cause page faults.

In the unlikely event that you are hitting what appears to be
an infinite loop of page faults, and 'git bisect' took you to
this changeset, your architecture needs to implement
flush_tlb_fix_spurious_fault() to actually flush the TLB entry.

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
[ Changelog massage. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/pgtable-generic.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e642627..d8397da 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -12,8 +12,8 @@

#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
- * Only sets the access flags (dirty, accessed, and
- * writable). Furthermore, we know it always gets set to a "more
+ * Only sets the access flags (dirty, accessed), as well as write
+ * permission. Furthermore, we know it always gets set to a "more
* permissive" setting, which allows most architectures to optimize
* this. We return whether the PTE actually changed, which in turn
* instructs the caller to do things like update__mmu_cache. This
@@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
int changed = !pte_same(*ptep, entry);
if (changed) {
set_pte_at(vma->vm_mm, address, ptep, entry);
- flush_tlb_page(vma, address);
+ flush_tlb_fix_spurious_fault(vma, address);
}
return changed;
}
--
1.7.11.7

2012-11-23 06:45:55

by Zhouping Liu

[permalink] [raw]
Subject: Re: [PATCH 00/33] Latest numa/core release, v17

On 11/23/2012 06:53 AM, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>> This release mainly addresses one of the regressions Linus
>> (rightfully) complained about: the "4x JVM" SPECjbb run.
>>
>> [ Note to testers: if possible please still run with
>> CONFIG_TRANSPARENT_HUGEPAGES=y enabled, to avoid the
>> !THP regression that is still not fully fixed.
>> It will be fixed next. ]
> I forgot to include the Git link:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

Hi Ingo, I tested the latest tip/master tree on 2 nodes with 8
processors, closed CONFIG_TRANSPARENT_HUGEPAGE,
and found some issues:

one is that command `stress -i 20 -m 30 -v` caused some hung tasks:

------------- snip ---------------------------
[ 1726.278382] Node 0 DMA free:15880kB min:20kB low:24kB high:28kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15332kB
mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB
pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? yes
[ 1726.366825] lowmem_reserve[]: 0 1957 3973 3973
[ 1726.388610] Node 0 DMA32 free:10856kB min:2796kB low:3492kB
high:4192kB active_anon:1479384kB inactive_anon:498788kB active_file:0kB
inactive_file:8kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:2004184kB mlocked:0kB dirty:0kB writeback:0kB mapped:4kB
shmem:20kB slab_reclaimable:140kB slab_unreclaimable:96kB
kernel_stack:8kB pagetables:80kB unstable:0kB bounce:0kB free_cma:0kB
writeback_tmp:0kB pages_scanned:3502066 all_unreclaimable? yes
[ 1726.490163] lowmem_reserve[]: 0 0 2016 2016
[ 1726.515235] Node 0 Normal free:2880kB min:2880kB low:3600kB
high:4320kB active_anon:1453776kB inactive_anon:490196kB
active_file:748kB inactive_file:1140kB unevictable:3740kB
isolated(anon):0kB isolated(file):0kB present:2064384kB mlocked:3492kB
dirty:0kB writeback:0kB mapped:2748kB shmem:2116kB
slab_reclaimable:9260kB slab_unreclaimable:35880kB kernel_stack:1184kB
pagetables:3308kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:3437106 all_unreclaimable? yes
[ 1726.629591] lowmem_reserve[]: 0 0 0 0
[ 1726.657650] Node 1 Normal free:5748kB min:5760kB low:7200kB
high:8640kB active_anon:3383776kB inactive_anon:682376kB active_file:8kB
inactive_file:340kB unevictable:8kB isolated(anon):384kB
isolated(file):0kB present:4128768kB mlocked:8kB dirty:0kB writeback:0kB
mapped:20kB shmem:12kB slab_reclaimable:9364kB
slab_unreclaimable:13728kB kernel_stack:880kB pagetables:12492kB
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:8696274 all_unreclaimable? yes
[ 1726.782732] lowmem_reserve[]: 0 0 0 0
[ 1726.814748] Node 0 DMA: 2*4kB 2*8kB 1*16kB 1*32kB 3*64kB 2*128kB
0*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15880kB
[ 1726.854951] Node 0 DMA32: 12*4kB 25*8kB 21*16kB 19*32kB 15*64kB
8*128kB 4*256kB 1*512kB 0*1024kB 1*2048kB 1*4096kB = 10856kB
[ 1726.896378] Node 0 Normal: 556*4kB 11*8kB 6*16kB 7*32kB 2*64kB
1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2888kB
[ 1726.937928] Node 1 Normal: 392*4kB 22*8kB 1*16kB 0*32kB 0*64kB
0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 5856kB
[ 1726.979519] 481162 total pagecache pages
[ 1727.013687] 479540 pages in swap cache
[ 1727.047898] Swap cache stats: add 2176770, delete 1697230, find
701489/781867
[ 1727.085709] Free swap = 4371040kB
[ 1727.119839] Total swap = 8175612kB
[ 1727.187872] 2097136 pages RAM
[ 1727.221789] 56226 pages reserved
[ 1727.256273] 1721904 pages shared
[ 1727.290826] 1549555 pages non-shared
[ 1727.325708] [ pid ] uid tgid total_vm rss nr_ptes swapents
oom_score_adj name
[ 1727.366026] [ 303] 0 303 9750 184 23 500
-1000 systemd-udevd
[ 1727.407347] [ 448] 0 448 6558 178 13 56
-1000 auditd
[ 1727.447970] [ 516] 0 516 61311 172 22
105 0 rsyslogd
[ 1727.488827] [ 519] 0 519 24696 35 15
35 0 iscsiuio
[ 1727.529839] [ 529] 81 529 8201 246 27 93
-900 dbus-daemon
[ 1727.571553] [ 530] 0 530 1264 93 9
16 0 iscsid
[ 1727.612795] [ 531] 0 531 1389 843 9 0
-17 iscsid
[ 1727.654069] [ 538] 0 538 1608 157 10
29 0 mcelog
[ 1727.695659] [ 543] 0 543 5906 138 16
49 0 atd
[ 1727.736954] [ 551] 0 551 30146 211 22
123 0 crond
[ 1727.778589] [ 569] 0 569 26612 223 83
185 0 login
[ 1727.820932] [ 587] 0 587 86085 334 84
139 0 NetworkManager
[ 1727.863388] [ 643] 0 643 21931 200 65
171 -900 modem-manager
[ 1727.906015] [ 644] 0 644 5812 183 16
74 0 bluetoothd
[ 1727.948629] [ 672] 0 672 118364 228 179
1079 0 libvirtd
[ 1727.991230] [ 691] 0 691 20059 189 40 198
-1000 sshd
[ 1728.033700] [ 996] 0 996 28950 196 17
516 0 bash
[ 1728.076541] [ 1056] 0 1056 1803 112 10
22 0 stress
[ 1728.119495] [ 1057] 0 1057 1803 27 10
21 0 stress
[ 1728.162616] [ 1058] 0 1058 67340 54961 118
12 0 stress
[ 1728.205521] [ 1059] 0 1059 1803 27 10
21 0 stress
[ 1728.248414] [ 1060] 0 1060 67340 12209 35
12 0 stress
[ 1728.291529] [ 1061] 0 1061 1803 27 10
21 0 stress
[ 1728.335147] [ 1062] 0 1062 67340 23519 57
12 0 stress
[ 1728.378537] [ 1063] 0 1063 1803 27 10
21 0 stress
[ 1728.421877] [ 1064] 0 1064 67340 7673 138
57944 0 stress
[ 1728.465325] [ 1065] 0 1065 1803 27 10
21 0 stress
[ 1728.508955] [ 1066] 0 1066 67340 90 11
12 0 stress
[ 1728.553187] [ 1067] 0 1067 1803 27 10
21 0 stress
[ 1728.597554] [ 1068] 0 1068 67340 58628 126
12 0 stress
[ 1728.640668] [ 1069] 0 1069 1803 27 10
21 0 stress
[ 1728.683676] [ 1070] 0 1070 67340 59802 128
12 0 stress
[ 1728.726534] [ 1071] 0 1071 1803 27 10
21 0 stress
[ 1728.769082] [ 1072] 0 1072 67340 5924 138
59693 0 stress
[ 1728.811455] [ 1073] 0 1073 1803 27 10
21 0 stress
[ 1728.852798] [ 1074] 0 1074 67340 65103 138
14 0 stress
[ 1728.892605] [ 1075] 0 1075 1803 27 10
21 0 stress
[ 1728.931191] [ 1076] 0 1076 67340 60077 128
13 0 stress
[ 1728.969491] [ 1077] 0 1077 1803 27 10
21 0 stress
[ 1729.006394] [ 1078] 0 1078 67340 13262 138
52355 0 stress
[ 1729.042189] [ 1079] 0 1079 1803 27 10
21 0 stress
[ 1729.076890] [ 1080] 0 1080 67340 38640 87
12 0 stress
[ 1729.111443] [ 1081] 0 1081 1803 27 10
21 0 stress
[ 1729.144638] [ 1082] 0 1082 67340 8238 138
57379 0 stress
[ 1729.176403] [ 1083] 0 1083 1803 27 10
21 0 stress
[ 1729.206905] [ 1084] 0 1084 67340 55392 119
12 0 stress
[ 1729.237086] [ 1085] 0 1085 1803 27 10
21 0 stress
[ 1729.265883] [ 1086] 0 1086 67340 4169 138
61447 0 stress
[ 1729.293362] [ 1087] 0 1087 1803 27 10
21 0 stress
[ 1729.319405] [ 1088] 0 1088 67340 16042 42
12 0 stress
[ 1729.345380] [ 1089] 0 1089 1803 27 10
21 0 stress
[ 1729.370934] [ 1090] 0 1090 67340 1223 13
12 0 stress
[ 1729.395553] [ 1091] 0 1091 1803 27 10
21 0 stress
[ 1729.419544] [ 1092] 0 1092 67340 8318 138
57298 0 stress
[ 1729.443863] [ 1093] 0 1093 1803 27 10
21 0 stress
[ 1729.467471] [ 1094] 0 1094 67340 2342 16
12 0 stress
[ 1729.491074] [ 1095] 0 1095 1803 27 10
21 0 stress
[ 1729.514194] [ 1096] 0 1096 67340 59017 126
12 0 stress
[ 1729.536998] [ 1097] 0 1097 67340 36245 82
12 0 stress
[ 1729.559710] [ 1098] 0 1098 67340 57050 122
12 0 stress
[ 1729.582264] [ 1099] 0 1099 67340 29239 68
12 0 stress
[ 1729.604895] [ 1100] 0 1100 67340 30815 71
12 0 stress
[ 1729.627532] [ 1101] 0 1101 67340 6881 138
58735 0 stress
[ 1729.650016] [ 1102] 0 1102 67340 37447 84
12 0 stress
[ 1729.672130] [ 1103] 0 1103 67340 6891 24
12 0 stress
[ 1729.693897] [ 1104] 0 1104 67340 35463 80
12 0 stress
[ 1729.715565] [ 1105] 0 1105 67340 11843 34
12 0 stress
[ 1729.736992] [ 1106] 0 1106 67340 10279 138
55338 0 stress
[ 1729.758383] [ 1198] 0 1198 88549 5957 185
6739 -900 setroubleshootd
[ 1729.780776] [ 2309] 0 2309 3243 192 12
0 0 systemd-cgroups
[ 1729.803176] [ 2312] 0 2312 3243 179 12
0 0 systemd-cgroups
[ 1729.825560] [ 2314] 0 2314 3243 209 11
0 0 systemd-cgroups
[ 1729.847848] [ 2315] 0 2315 3243 165 11
0 0 systemd-cgroups
[ 1729.870223] [ 2317] 0 2317 1736 41 6
0 0 systemd-cgroups
[ 1729.892316] [ 2319] 0 2319 2688 46 7
0 0 systemd-cgroups
[ 1729.914310] [ 2320] 0 2320 681 34 4
0 0 systemd-cgroups
[ 1729.936223] [ 2321] 0 2321 42 1 2
0 0 systemd-cgroups
[ 1729.957811] Out of memory: Kill process 516 (rsyslogd) score 0 or
sacrifice child
[ 1729.978407] Killed process 516 (rsyslogd) total-vm:245244kB,
anon-rss:0kB, file-rss:688kB
[ 1923.469572] INFO: task kworker/4:2:232 blocked for more than 120 seconds.
[ 1923.490002] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1923.511856] kworker/4:2 D ffff88027fc14080 0 232 2
0x00000000
[ 1923.533216] ffff88027977db88 0000000000000046 ffff8802797b5040
ffff88027977dfd8
[ 1923.556100] ffff88027977dfd8 ffff88027977dfd8 ffff880179063580
ffff8802797b5040
[ 1923.578569] ffff88027977db68 ffff88027977dd10 7fffffffffffffff
ffff8802797b5040
[ 1923.601034] Call Trace:
[ 1923.618253] [<ffffffff815e7b69>] schedule+0x29/0x70
[ 1923.638372] [<ffffffff815e6114>] schedule_timeout+0x1f4/0x2b0
[ 1923.659539] [<ffffffff815e79f0>] wait_for_common+0x120/0x170
[ 1923.680664] [<ffffffff81096920>] ? try_to_wake_up+0x2d0/0x2d0
[ 1923.702224] [<ffffffff815e7b3d>] wait_for_completion+0x1d/0x20
[ 1923.723820] [<ffffffff8107b0b9>] call_usermodehelper_fns+0x1d9/0x200
[ 1923.746167] [<ffffffff810d0b32>] cgroup_release_agent+0xe2/0x180
[ 1923.768233] [<ffffffff8107e638>] process_one_work+0x148/0x490
[ 1923.790179] [<ffffffff810d0a50>] ? init_root_id+0xb0/0xb0
[ 1923.811797] [<ffffffff8107f16e>] worker_thread+0x15e/0x450
[ 1923.833733] [<ffffffff8107f010>] ? busy_worker_rebind_fn+0x110/0x110
[ 1923.856703] [<ffffffff81084350>] kthread+0xc0/0xd0
[ 1923.878067] [<ffffffff81084290>] ? kthread_create_on_node+0x120/0x120
[ 1923.901388] [<ffffffff815f12ac>] ret_from_fork+0x7c/0xb0
[ 1923.923544] [<ffffffff81084290>] ? kthread_create_on_node+0x120/0x120
[ 1923.947290] INFO: task rs:main Q:Reg:534 blocked for more than 120
seconds.
[ 1923.971646] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1923.997338] rs:main Q:Reg D ffff88027fc54080 0 534 1
0x00000084
[ 1924.022565] ffff880279913930 0000000000000082 ffff880279b29ac0
ffff880279913fd8
[ 1924.049053] ffff880279913fd8 ffff880279913fd8 ffff880279b2b580
ffff880279b29ac0
[ 1924.075230] 000000000003b55f ffff880279b29ac0 ffff880279bedde0
ffffffffffffffff
[ 1924.101476] Call Trace:
[ 1924.122766] [<ffffffff815e7b69>] schedule+0x29/0x70
[ 1924.146894] [<ffffffff815e8915>] rwsem_down_failed_common+0xb5/0x140
[ 1924.173018] [<ffffffff815e89d5>] rwsem_down_read_failed+0x15/0x17
[ 1924.198572] [<ffffffff812eb0c4>] call_rwsem_down_read_failed+0x14/0x30
[ 1924.224975] [<ffffffff810fecf3>] ? taskstats_exit+0x383/0x420
[ 1924.250579] [<ffffffff812e9e5f>] ? __get_user_8+0x1f/0x29
[ 1924.275797] [<ffffffff815e6df4>] ? down_read+0x24/0x2b
[ 1924.300955] [<ffffffff815eca16>] __do_page_fault+0x1c6/0x4e0
[ 1924.326726] [<ffffffff811797ef>] ? alloc_pages_current+0xcf/0x140
[ 1924.353121] [<ffffffff8118345e>] ? new_slab+0x20e/0x310
[ 1924.378729] [<ffffffff815ecd3e>] do_page_fault+0xe/0x10
[ 1924.404424] [<ffffffff815e9358>] page_fault+0x28/0x30
[ 1924.429991] [<ffffffff810fecf3>] ? taskstats_exit+0x383/0x420
[ 1924.456539] [<ffffffff812e9e5f>] ? __get_user_8+0x1f/0x29
[ 1924.482746] [<ffffffff810c047d>] ? exit_robust_list+0x5d/0x160
[ 1924.509604] [<ffffffff810feca9>] ? taskstats_exit+0x339/0x420
[ 1924.536514] [<ffffffff8105e5d7>] mm_release+0x147/0x160
[ 1924.563242] [<ffffffff81065186>] exit_mm+0x26/0x120
[ 1924.589384] [<ffffffff81066787>] do_exit+0x167/0x8d0
[ 1924.615888] [<ffffffff810be75b>] ? futex_wait+0x13b/0x2c0
[ 1924.642809] [<ffffffff81183060>] ? kmem_cache_free+0x20/0x160
[ 1924.670213] [<ffffffff8106733f>] do_group_exit+0x3f/0xa0
[ 1924.697387] [<ffffffff81075eca>] get_signal_to_deliver+0x1ca/0x5e0
[ 1924.726012] [<ffffffff8101437f>] do_signal+0x3f/0x610
[ 1924.753260] [<ffffffff810c06ad>] ? do_futex+0x12d/0x580
[ 1924.780688] [<ffffffff810149f0>] do_notify_resume+0x80/0xb0
[ 1924.808394] [<ffffffff815f1612>] int_signal+0x12/0x17
[ 1924.835765] INFO: task rsyslogd:536 blocked for more than 120 seconds.
[ 1924.864967] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1924.896012] rsyslogd D ffff88017bc14080 0 536 1
0x00000086
[ 1924.926767] ffff8802791b7c38 0000000000000082 ffff88027a0a0000
ffff8802791b7fd8
[ 1924.958361] ffff8802791b7fd8 ffff8802791b7fd8 ffff880279b29ac0
ffff88027a0a0000
[ 1924.990164] ffff8802791b7c28 ffff88027a0a0000 ffff880279bedde0
ffffffffffffffff
[ 1925.021929] Call Trace:
[ 1925.048178] [<ffffffff815e7b69>] schedule+0x29/0x70
[ 1925.077075] [<ffffffff815e8915>] rwsem_down_failed_common+0xb5/0x140
[ 1925.107125] [<ffffffff815e89b3>] rwsem_down_write_failed+0x13/0x20
[ 1925.136665] [<ffffffff812eb0f3>] call_rwsem_down_write_failed+0x13/0x20
[ 1925.166430] [<ffffffff815e6dc2>] ? down_write+0x32/0x40
[ 1925.194797] [<ffffffff8109ae0b>] task_numa_work+0xeb/0x270
[ 1925.223254] [<ffffffff81081047>] task_work_run+0xa7/0xe0
[ 1925.251609] [<ffffffff810762bb>] get_signal_to_deliver+0x5bb/0x5e0
[ 1925.280624] [<ffffffff8115e619>] ? handle_mm_fault+0x149/0x210
[ 1925.309222] [<ffffffff8101437f>] do_signal+0x3f/0x610
[ 1925.337048] [<ffffffff8117cea1>] ? change_prot_numa+0x51/0x60
[ 1925.365604] [<ffffffff8109aef6>] ? task_numa_work+0x1d6/0x270
[ 1925.394295] [<ffffffff810149f0>] do_notify_resume+0x80/0xb0
[ 1925.422677] [<ffffffff815e917c>] retint_signal+0x48/0x8c
[ 1925.450753] INFO: task NetworkManager:587 blocked for more than 120
seconds.
[ 1925.480882] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 1925.512086] NetworkManager D ffff88027fc54080 0 587 1
0x00000080
[ 1925.542751] ffff880279fb7dc8 0000000000000086 ffff880279f29ac0
ffff880279fb7fd8
[ 1925.574086] ffff880279fb7fd8 ffff880279fb7fd8 ffff8801752c1ac0
ffff880279f29ac0
[ 1925.605328] ffffea0005e82c40 ffff880279f29ac0 ffff8801793cd860
ffffffffffffffff
[ 1925.636585] Call Trace:
[ 1925.662557] [<ffffffff815e7b69>] schedule+0x29/0x70
[ 1925.691518] [<ffffffff815e8915>] rwsem_down_failed_common+0xb5/0x140
[ 1925.721815] [<ffffffff8115e619>] ? handle_mm_fault+0x149/0x210
[ 1925.751814] [<ffffffff815e89b3>] rwsem_down_write_failed+0x13/0x20
[ 1925.781475] [<ffffffff812eb0f3>] call_rwsem_down_write_failed+0x13/0x20
[ 1925.811421] [<ffffffff815e6dc2>] ? down_write+0x32/0x40
[ 1925.839679] [<ffffffff8109ae0b>] task_numa_work+0xeb/0x270
[ 1925.868180] [<ffffffff810e307c>] ? __audit_syscall_exit+0x3ec/0x450
[ 1925.897544] [<ffffffff81081047>] task_work_run+0xa7/0xe0
[ 1925.925674] [<ffffffff810149e1>] do_notify_resume+0x71/0xb0
[ 1925.954159] [<ffffffff815e917c>] retint_signal+0x48/0x8c
[ 2045.984126] INFO: task kworker/4:2:232 blocked for more than 120 seconds.
[ 2046.013951] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 2046.045339] kworker/4:2 D ffff88027fc14080 0 232 2
0x00000000
[ 2046.075873] ffff88027977db88 0000000000000046 ffff8802797b5040
ffff88027977dfd8
[ 2046.107357] ffff88027977dfd8 ffff88027977dfd8 ffff880179063580
ffff8802797b5040
[ 2046.138576] ffff88027977db68 ffff88027977dd10 7fffffffffffffff
ffff8802797b5040
[ 2046.169696] Call Trace:
[ 2046.195647] [<ffffffff815e7b69>] schedule+0x29/0x70
[ 2046.224276] [<ffffffff815e6114>] schedule_timeout+0x1f4/0x2b0
[ 2046.253957] [<ffffffff815e79f0>] wait_for_common+0x120/0x170
[ 2046.283497] [<ffffffff81096920>] ? try_to_wake_up+0x2d0/0x2d0
[ 2046.313349] [<ffffffff815e7b3d>] wait_for_completion+0x1d/0x20
[ 2046.343061] [<ffffffff8107b0b9>] call_usermodehelper_fns+0x1d9/0x200
[ 2046.373636] [<ffffffff810d0b32>] cgroup_release_agent+0xe2/0x180
[ 2046.403705] [<ffffffff8107e638>] process_one_work+0x148/0x490
[ 2046.433485] [<ffffffff810d0a50>] ? init_root_id+0xb0/0xb0
[ 2046.462879] [<ffffffff8107f16e>] worker_thread+0x15e/0x450
[ 2046.492398] [<ffffffff8107f010>] ? busy_worker_rebind_fn+0x110/0x110
[ 2046.522898] [<ffffffff81084350>] kthread+0xc0/0xd0
[ 2046.551810] [<ffffffff81084290>] ? kthread_create_on_node+0x120/0x120
[ 2046.582425] [<ffffffff815f12ac>] ret_from_fork+0x7c/0xb0
[ 2046.612099] [<ffffffff81084290>] ? kthread_create_on_node+0x120/0x120
----------------------- snip ----------------------------

the other is that oom02(LTP: testcases/kernel/mem/oom/oom02) made
oom-killer kill unexpected processes:
(oom02 is designed to try to hog memory on one node), and oom02 always
hung until you kill it manually.

------------ snip -----------------------
[12449.554508] [ pid ] uid tgid total_vm rss nr_ptes swapents
oom_score_adj name
[12449.554524] [ 303] 0 303 9750 296 23 414
-1000 systemd-udevd
[12449.554527] [ 448] 0 448 6558 224 13 35
-1000 auditd
[12449.554532] [ 519] 0 519 24696 82 15
20 0 iscsiuio
[12449.554534] [ 529] 81 529 8201 389 27 35
-900 dbus-daemon
[12449.554536] [ 530] 0 530 1264 94 9
15 0 iscsid
[12449.554538] [ 531] 0 531 1389 876 9 0
-17 iscsid
[12449.554540] [ 538] 0 538 1608 158 10
28 0 mcelog
[12449.554542] [ 543] 0 543 5906 138 16
48 0 atd
[12449.554543] [ 551] 0 551 30146 301 22
56 0 crond
[12449.554545] [ 569] 0 569 26612 224 83
184 0 login
[12449.554547] [ 587] 0 587 108127 917 91
45 0 NetworkManager
[12449.554549] [ 643] 0 643 21931 290 65
120 -900 modem-manager
[12449.554551] [ 644] 0 644 5812 222 16
61 0 bluetoothd
[12449.554553] [ 672] 0 672 118388 469 179
935 0 libvirtd
[12449.554555] [ 691] 0 691 20059 229 40 177
-1000 sshd
[12449.554556] [ 996] 0 996 28975 654 18
186 0 bash
[12449.554558] [ 1198] 0 1198 88549 5958 185
6738 -900 setroubleshootd
[12449.554563] [ 2332] 0 2332 74107 3729 152
0 0 systemd-journal
[12449.554564] [ 2335] 0 2335 8222 388 20
0 0 systemd-logind
[12449.554567] [20909] 0 20909 61311 456 22
0 0 rsyslogd
[12449.554572] [12818] 0 12818 788031 6273 24
0 0 oom02
[12449.554574] [13193] 0 13193 23143 3749 46
0 0 dhclient
[12449.554576] [13217] 0 13217 33221 1188 67
0 0 sshd
[12449.554577] [13221] 0 13221 28949 879 19
0 0 bash
[12449.554579] [13308] 0 13308 23143 3749 45
0 0 dhclient
[12449.554581] [13388] 0 13388 36864 1063 45
0 0 vim
[12449.554583] Out of memory: Kill process 13193 (dhclient) score 1 or
sacrifice child
[12449.554584] Killed process 13193 (dhclient) total-vm:92572kB,
anon-rss:11976kB, file-rss:3020kB
[12451.878812] oom02 invoked oom-killer: gfp_mask=0x280da, order=0,
oom_score_adj=0
[12451.878815] oom02 cpuset=/ mems_allowed=0-1
[12451.878818] Pid: 12818, comm: oom02 Tainted: G W
3.7.0-rc6numacorev17+ #5
[12451.878819] Call Trace:
[12451.878829] [<ffffffff810d92d1>] ?
cpuset_print_task_mems_allowed+0x91/0xa0
[12451.878834] [<ffffffff815dd3b6>] dump_header.isra.12+0x70/0x19b
[12451.878838] [<ffffffff812e37d3>] ? ___ratelimit+0xa3/0x120
[12451.878843] [<ffffffff81139c0d>] oom_kill_process+0x1cd/0x320
[12451.878848] [<ffffffff8106d3c5>] ? has_ns_capability_noaudit+0x15/0x20
[12451.878850] [<ffffffff8113a377>] out_of_memory+0x447/0x480
[12451.878853] [<ffffffff8113ff5c>] __alloc_pages_nodemask+0x94c/0x960
[12451.878858] [<ffffffff8117b1a6>] alloc_pages_vma+0xb6/0x190
[12451.878861] [<ffffffff8115e094>] handle_pte_fault+0x8f4/0xb90
[12451.878865] [<ffffffff810fa237>] ? call_rcu_sched+0x17/0x20
[12451.878868] [<ffffffff8118ed82>] ? put_filp+0x52/0x60
[12451.878870] [<ffffffff8115e619>] handle_mm_fault+0x149/0x210
[12451.878873] [<ffffffff815ec9c2>] __do_page_fault+0x172/0x4e0
[12451.878875] [<ffffffff81183060>] ? kmem_cache_free+0x20/0x160
[12451.878878] [<ffffffff81198396>] ? final_putname+0x26/0x50
[12451.878880] [<ffffffff815ecd3e>] do_page_fault+0xe/0x10
[12451.878883] [<ffffffff815e9358>] page_fault+0x28/0x30
--------------- snip ------------------

-------------- snip ------------------
[12451.956997] [ pid ] uid tgid total_vm rss nr_ptes swapents
oom_score_adj name
[12451.957010] [ 303] 0 303 9750 296 23 414
-1000 systemd-udevd
[12451.957013] [ 448] 0 448 6558 224 13 35
-1000 auditd
[12451.957017] [ 519] 0 519 24696 82 15
20 0 iscsiuio
[12451.957019] [ 529] 81 529 8201 389 27 35
-900 dbus-daemon
[12451.957021] [ 530] 0 530 1264 94 9
15 0 iscsid
[12451.957022] [ 531] 0 531 1389 876 9 0
-17 iscsid
[12451.957024] [ 538] 0 538 1608 158 10
28 0 mcelog
[12451.957026] [ 543] 0 543 5906 138 16
48 0 atd
[12451.957028] [ 551] 0 551 30146 301 22
56 0 crond
[12451.957029] [ 569] 0 569 26612 224 83
184 0 login
[12451.957031] [ 587] 0 587 110177 922 92
45 0 NetworkManager
[12451.957033] [ 643] 0 643 21931 290 65
120 -900 modem-manager
[12451.957035] [ 644] 0 644 5812 222 16
61 0 bluetoothd
[12451.957036] [ 672] 0 672 118388 469 179
935 0 libvirtd
[12451.957038] [ 691] 0 691 20059 229 40 177
-1000 sshd
[12451.957040] [ 996] 0 996 28975 654 18
186 0 bash
[12451.957042] [ 1198] 0 1198 88549 5958 185
6738 -900 setroubleshootd
[12451.957045] [ 2332] 0 2332 74107 3897 152
0 0 systemd-journal
[12451.957047] [ 2335] 0 2335 8222 388 20
0 0 systemd-logind
[12451.957049] [20909] 0 20909 61311 456 22
0 0 rsyslogd
[12451.957052] [12818] 0 12818 788031 6273 24
0 0 oom02
[12451.957054] [13217] 0 13217 33221 1188 67
0 0 sshd
[12451.957055] [13221] 0 13221 28949 879 19
0 0 bash
[12451.957057] [13388] 0 13388 36864 1063 45
0 0 vim
[12451.957059] [13410] 0 13410 33476 510 58 0
-900 nm-dispatcher.a
[12451.957061] Out of memory: Kill process 2335 (systemd-logind) score 0
or sacrifice child
-------------------- snip ----------------------------

also I found oom-killer performed bad on numa/core tree, you can use
LTP: testcases/kernel/mem/oom/oom* to verify it.

please let me know if you need more details or any other testing works.

Thanks,
Zhouping

2012-11-23 17:32:18

by Mel Gorman

[permalink] [raw]
Subject: Comparison between three trees (was: Latest numa/core release, v17)

Warning: This is an insanely long mail and there a lot of data here. Get
coffee or something.

This is another round of comparisons between the latest released versions
of each of three automatic numa balancing trees that are out there.

>From the series "Automatic NUMA Balancing V5", the kernels tested were

stats-v5r1 Patches 1-10. TLB optimisations, migration stats
thpmigrate-v5r1 Patches 1-37. Basic placement policy, PMD handling, THP migration etc.
adaptscan-v5r1 Patches 1-38. Heavy handed PTE scan reduction
delaystart-v5r1 Patches 1-40. Delay the PTE scan until running on a new node

If I just say balancenuma, I mean the "delaystart-v5r1" kernel. The other
kernels are included so you can see the impact the scan rate adaption
patch has and what that might mean for a placement policy using a proper
feedback mechanism.

The other two kernels were

numacore-20121123 It was no longer clear what the deltas between releases and
the dependencies might be so I just pulled tip/master on November
23rd, 2012. An earlier pull had serious difficulties and the patch
responsible has been dropped since. This is not a like-with-like
comparison as the tree contains numerous other patches but it's
the best available given the timeframe

autonuma-v28fast This is a rebased version of Andrea's autonuma-v28fast
branch with Hugh's THP migration patch on top. Hopefully Andrea
and Hugh will not mind but I took the liberty of publishing the
result as the mm-autonuma-v28fastr4-mels-rebase branch in
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma.git

I'm treating stats-v5r1 as the baseline as it has the same TLB optimisations
shared between balancenuma and numacore. As I write this I realise this may
not be fair to autonuma depending on how it avoids flushing the TLB. I'm
not digging into that right now, Andrea might comment.

All of these tests were run unattended via MMTests. Any errors in the
methodology would be applied evenly to all kernels tested. There were
monitors running but *not* profiling for the reported figures. All tests
were actually run in pairs, with and without profiling but none of the
profiles are included, nor have I looked at any of them yet. The heaviest
active monitor reads numa_maps every 10 seconds and is only read one per
address space and reused by all threads. This will affect peak values
because it means the monitors contend on some of the same locks the PTE
scanner does for example. If time permits, I'll run a no-monitor set.

Lets start with the usual autonumabench.

AUTONUMA BENCH
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
User NUMA01 75064.91 ( 0.00%) 24837.09 ( 66.91%) 31651.70 ( 57.83%) 54454.75 ( 27.46%) 58561.99 ( 21.98%) 56747.85 ( 24.40%)
User NUMA01_THEADLOCAL 62045.39 ( 0.00%) 17582.23 ( 71.66%) 17173.01 ( 72.32%) 16906.80 ( 72.75%) 17813.47 ( 71.29%) 18021.32 ( 70.95%)
User NUMA02 6921.18 ( 0.00%) 2088.16 ( 69.83%) 2226.35 ( 67.83%) 2065.29 ( 70.16%) 2049.90 ( 70.38%) 2098.25 ( 69.68%)
User NUMA02_SMT 2924.84 ( 0.00%) 1006.42 ( 65.59%) 1069.26 ( 63.44%) 987.17 ( 66.25%) 995.65 ( 65.96%) 1000.24 ( 65.80%)
System NUMA01 48.75 ( 0.00%) 1138.62 (-2235.63%) 249.25 (-411.28%) 696.82 (-1329.37%) 273.76 (-461.56%) 271.95 (-457.85%)
System NUMA01_THEADLOCAL 46.05 ( 0.00%) 480.03 (-942.41%) 92.40 (-100.65%) 156.85 (-240.61%) 135.24 (-193.68%) 122.13 (-165.21%)
System NUMA02 1.73 ( 0.00%) 24.84 (-1335.84%) 7.73 (-346.82%) 8.74 (-405.20%) 6.35 (-267.05%) 9.02 (-421.39%)
System NUMA02_SMT 18.34 ( 0.00%) 11.02 ( 39.91%) 3.74 ( 79.61%) 3.31 ( 81.95%) 3.53 ( 80.75%) 3.55 ( 80.64%)
Elapsed NUMA01 1666.60 ( 0.00%) 585.34 ( 64.88%) 749.72 ( 55.02%) 1234.33 ( 25.94%) 1321.51 ( 20.71%) 1269.96 ( 23.80%)
Elapsed NUMA01_THEADLOCAL 1391.37 ( 0.00%) 392.39 ( 71.80%) 381.56 ( 72.58%) 370.06 ( 73.40%) 396.18 ( 71.53%) 397.63 ( 71.42%)
Elapsed NUMA02 176.41 ( 0.00%) 50.78 ( 71.21%) 53.35 ( 69.76%) 48.89 ( 72.29%) 50.66 ( 71.28%) 50.34 ( 71.46%)
Elapsed NUMA02_SMT 163.88 ( 0.00%) 48.09 ( 70.66%) 49.54 ( 69.77%) 46.83 ( 71.42%) 48.29 ( 70.53%) 47.63 ( 70.94%)
CPU NUMA01 4506.00 ( 0.00%) 4437.00 ( 1.53%) 4255.00 ( 5.57%) 4468.00 ( 0.84%) 4452.00 ( 1.20%) 4489.00 ( 0.38%)
CPU NUMA01_THEADLOCAL 4462.00 ( 0.00%) 4603.00 ( -3.16%) 4524.00 ( -1.39%) 4610.00 ( -3.32%) 4530.00 ( -1.52%) 4562.00 ( -2.24%)
CPU NUMA02 3924.00 ( 0.00%) 4160.00 ( -6.01%) 4187.00 ( -6.70%) 4241.00 ( -8.08%) 4058.00 ( -3.41%) 4185.00 ( -6.65%)
CPU NUMA02_SMT 1795.00 ( 0.00%) 2115.00 (-17.83%) 2165.00 (-20.61%) 2114.00 (-17.77%) 2068.00 (-15.21%) 2107.00 (-17.38%)

numacore is the best at running the adverse numa01 workload. autonuma does
respectably and balancenuma does not cope with this case. It improves on the
baseline but it does not know how to interleave for this type of workload.

For the other workloads that are friendlier to NUMA, the three trees
are roughly comparable in terms of elapsed time. There is not multiple runs
because it takes too long but there is a strong chance we are within the noise
of each other for the other workloads.

Where we differ is in system CPU usage. In all cases, numacore uses more
system CPU. It is likely it is compensating better for this overhead
with better placement. With this higher overhead it ends up with a tie
on everything except the adverse workload. Take NUMA01_THREADLOCAL as
an example -- numacore uses roughly 4 times more system CPU than either
autonuma or balancenuma. autonumas cost could be hidden in kernel threads
but that's not true for balancenuma.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User 274653.21 92676.27 107399.17 130223.93 142154.84 146804.10
System 1329.11 5364.97 1093.69 2773.99 1453.79 1814.66
Elapsed 6827.56 2781.35 3046.92 3508.55 3757.51 3843.07

The overall elapsed time is differences in how well numa01 is handled. There
are large differences in the system CPU time. It's using almost twice
the amount of CPU as either autonuma or balancenuma.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins 195440 172116 168284 169788 167656 168860
Page Outs 355400 238756 247740 246860 264276 269304
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 0
Direct pages scanned 0 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0 0
Page writes file 0 0 0 0 0 0
Page writes anon 0 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0 0
Page rescued immediate 0 0 0 0 0 0
Slabs scanned 0 0 0 0 0 0
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0 0
THP fault alloc 42264 29117 37284 47486 32077 34343
THP collapse alloc 23 1 809 23 26 22
THP splits 5 1 47 6 5 4
THP fault fallback 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0
Page migrate success 0 0 0 523123 180790 209771
Page migrate failure 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0
Compaction cost 0 0 0 543 187 217
NUMA PTE updates 0 0 0 842347410 295302723 301160396
NUMA hint faults 0 0 0 6924258 3277126 3189624
NUMA hint local faults 0 0 0 3757418 1824546 1872917
NUMA pages migrated 0 0 0 523123 180790 209771
AutoNUMA cost 0 0 0 40527 18456 18060

Not much to usefully interpret here other than noting we generally avoid
splitting THP. For balancenuma, note what the scan adaption does to the
number of PTE updates and the number of faults incurred. A policy may
not necessarily like this. It depends on its requirements but if it wants
higher PTE scan rates it will have to compensate for it.

Next is the specjbb. There are 4 separate configurations

multi JVM, THP
multi JVM, no THP
single JVM, THP
single JVM, no THP

SPECJBB: Mult JVMs (one per node, 4 nodes), THP is enabled
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
Mean 1 30969.75 ( 0.00%) 28318.75 ( -8.56%) 31542.00 ( 1.85%) 30427.75 ( -1.75%) 31192.25 ( 0.72%) 31216.75 ( 0.80%)
Mean 2 62036.50 ( 0.00%) 57323.50 ( -7.60%) 66167.25 ( 6.66%) 62900.25 ( 1.39%) 61826.75 ( -0.34%) 62239.00 ( 0.33%)
Mean 3 90075.50 ( 0.00%) 86045.25 ( -4.47%) 96151.25 ( 6.75%) 91035.75 ( 1.07%) 89128.25 ( -1.05%) 90692.25 ( 0.68%)
Mean 4 116062.50 ( 0.00%) 91439.25 (-21.22%) 125072.75 ( 7.76%) 116103.75 ( 0.04%) 115819.25 ( -0.21%) 117047.75 ( 0.85%)
Mean 5 136056.00 ( 0.00%) 97558.25 (-28.30%) 150854.50 ( 10.88%) 138629.75 ( 1.89%) 138712.25 ( 1.95%) 139477.00 ( 2.51%)
Mean 6 153827.50 ( 0.00%) 128628.25 (-16.38%) 175849.50 ( 14.32%) 157472.75 ( 2.37%) 158780.00 ( 3.22%) 158780.25 ( 3.22%)
Mean 7 151946.00 ( 0.00%) 136447.25 (-10.20%) 181675.50 ( 19.57%) 160388.25 ( 5.56%) 160378.75 ( 5.55%) 162787.50 ( 7.14%)
Mean 8 155941.50 ( 0.00%) 136351.25 (-12.56%) 185131.75 ( 18.72%) 158613.00 ( 1.71%) 159683.25 ( 2.40%) 164054.25 ( 5.20%)
Mean 9 146191.50 ( 0.00%) 125132.00 (-14.41%) 184833.50 ( 26.43%) 155988.50 ( 6.70%) 157664.75 ( 7.85%) 161319.00 ( 10.35%)
Mean 10 139189.50 ( 0.00%) 98594.50 (-29.17%) 179948.50 ( 29.28%) 150341.75 ( 8.01%) 152771.00 ( 9.76%) 155530.25 ( 11.74%)
Mean 11 133561.75 ( 0.00%) 105967.75 (-20.66%) 175904.50 ( 31.70%) 144335.75 ( 8.07%) 146147.00 ( 9.42%) 146832.50 ( 9.94%)
Mean 12 123752.25 ( 0.00%) 138392.25 ( 11.83%) 169482.50 ( 36.95%) 140328.50 ( 13.39%) 138498.50 ( 11.92%) 142362.25 ( 15.04%)
Mean 13 123578.50 ( 0.00%) 103236.50 (-16.46%) 166714.75 ( 34.91%) 136745.25 ( 10.65%) 138469.50 ( 12.05%) 140699.00 ( 13.85%)
Mean 14 123812.00 ( 0.00%) 113250.00 ( -8.53%) 164406.00 ( 32.79%) 138061.25 ( 11.51%) 134047.25 ( 8.27%) 139790.50 ( 12.91%)
Mean 15 123499.25 ( 0.00%) 130577.50 ( 5.73%) 162517.00 ( 31.59%) 133598.50 ( 8.18%) 132651.50 ( 7.41%) 134423.00 ( 8.85%)
Mean 16 118595.75 ( 0.00%) 127494.50 ( 7.50%) 160836.25 ( 35.62%) 129305.25 ( 9.03%) 131355.75 ( 10.76%) 132424.25 ( 11.66%)
Mean 17 115374.75 ( 0.00%) 121443.50 ( 5.26%) 157091.00 ( 36.16%) 127538.50 ( 10.54%) 128536.00 ( 11.41%) 128923.75 ( 11.74%)
Mean 18 120981.00 ( 0.00%) 119649.00 ( -1.10%) 155978.75 ( 28.93%) 126031.00 ( 4.17%) 127277.00 ( 5.20%) 131032.25 ( 8.31%)
Stddev 1 1256.20 ( 0.00%) 1649.69 (-31.32%) 1042.80 ( 16.99%) 1004.74 ( 20.02%) 1125.79 ( 10.38%) 965.75 ( 23.12%)
Stddev 2 894.02 ( 0.00%) 1299.83 (-45.39%) 153.62 ( 82.82%) 1757.03 (-96.53%) 1089.32 (-21.84%) 370.16 ( 58.60%)
Stddev 3 1354.13 ( 0.00%) 3221.35 (-137.89%) 452.26 ( 66.60%) 1169.99 ( 13.60%) 1387.57 ( -2.47%) 629.10 ( 53.54%)
Stddev 4 1505.56 ( 0.00%) 9559.15 (-534.92%) 597.48 ( 60.32%) 1046.60 ( 30.48%) 1285.40 ( 14.62%) 1320.74 ( 12.28%)
Stddev 5 513.85 ( 0.00%) 20854.29 (-3958.43%) 416.34 ( 18.98%) 760.85 (-48.07%) 1118.27 (-117.62%) 1382.28 (-169.00%)
Stddev 6 1393.16 ( 0.00%) 11554.27 (-729.36%) 1225.46 ( 12.04%) 1190.92 ( 14.52%) 1662.55 (-19.34%) 1814.39 (-30.24%)
Stddev 7 1645.51 ( 0.00%) 7300.33 (-343.65%) 1690.25 ( -2.72%) 2517.46 (-52.99%) 1882.02 (-14.37%) 2393.67 (-45.47%)
Stddev 8 4853.40 ( 0.00%) 10303.35 (-112.29%) 1724.63 ( 64.47%) 4280.27 ( 11.81%) 6680.41 (-37.64%) 1453.35 ( 70.05%)
Stddev 9 4366.96 ( 0.00%) 9683.51 (-121.74%) 3443.47 ( 21.15%) 7360.20 (-68.54%) 4560.06 ( -4.42%) 3269.18 ( 25.14%)
Stddev 10 4840.11 ( 0.00%) 7402.77 (-52.95%) 5808.63 (-20.01%) 4639.55 ( 4.14%) 1221.58 ( 74.76%) 3911.11 ( 19.19%)
Stddev 11 5208.04 ( 0.00%) 12657.33 (-143.03%) 10003.74 (-92.08%) 8961.02 (-72.06%) 3754.61 ( 27.91%) 4138.30 ( 20.54%)
Stddev 12 5015.66 ( 0.00%) 14749.87 (-194.08%) 14862.62 (-196.32%) 4554.52 ( 9.19%) 7436.76 (-48.27%) 3902.07 ( 22.20%)
Stddev 13 3348.23 ( 0.00%) 13349.42 (-298.70%) 15333.50 (-357.96%) 5121.75 (-52.97%) 6893.45 (-105.88%) 3633.54 ( -8.52%)
Stddev 14 2816.30 ( 0.00%) 3878.71 (-37.72%) 15707.34 (-457.73%) 1296.47 ( 53.97%) 4760.04 (-69.02%) 1540.51 ( 45.30%)
Stddev 15 2592.17 ( 0.00%) 777.61 ( 70.00%) 17317.35 (-568.06%) 3572.43 (-37.82%) 5510.05 (-112.57%) 2227.21 ( 14.08%)
Stddev 16 4163.07 ( 0.00%) 1239.57 ( 70.22%) 16770.00 (-302.83%) 3858.12 ( 7.33%) 2947.70 ( 29.19%) 3332.69 ( 19.95%)
Stddev 17 5959.34 ( 0.00%) 1602.88 ( 73.10%) 16890.90 (-183.44%) 4770.68 ( 19.95%) 4398.91 ( 26.18%) 3340.67 ( 43.94%)
Stddev 18 3040.65 ( 0.00%) 857.66 ( 71.79%) 19296.90 (-534.63%) 6344.77 (-108.67%) 4183.68 (-37.59%) 1278.14 ( 57.96%)
TPut 1 123879.00 ( 0.00%) 113275.00 ( -8.56%) 126168.00 ( 1.85%) 121711.00 ( -1.75%) 124769.00 ( 0.72%) 124867.00 ( 0.80%)
TPut 2 248146.00 ( 0.00%) 229294.00 ( -7.60%) 264669.00 ( 6.66%) 251601.00 ( 1.39%) 247307.00 ( -0.34%) 248956.00 ( 0.33%)
TPut 3 360302.00 ( 0.00%) 344181.00 ( -4.47%) 384605.00 ( 6.75%) 364143.00 ( 1.07%) 356513.00 ( -1.05%) 362769.00 ( 0.68%)
TPut 4 464250.00 ( 0.00%) 365757.00 (-21.22%) 500291.00 ( 7.76%) 464415.00 ( 0.04%) 463277.00 ( -0.21%) 468191.00 ( 0.85%)
TPut 5 544224.00 ( 0.00%) 390233.00 (-28.30%) 603418.00 ( 10.88%) 554519.00 ( 1.89%) 554849.00 ( 1.95%) 557908.00 ( 2.51%)
TPut 6 615310.00 ( 0.00%) 514513.00 (-16.38%) 703398.00 ( 14.32%) 629891.00 ( 2.37%) 635120.00 ( 3.22%) 635121.00 ( 3.22%)
TPut 7 607784.00 ( 0.00%) 545789.00 (-10.20%) 726702.00 ( 19.57%) 641553.00 ( 5.56%) 641515.00 ( 5.55%) 651150.00 ( 7.14%)
TPut 8 623766.00 ( 0.00%) 545405.00 (-12.56%) 740527.00 ( 18.72%) 634452.00 ( 1.71%) 638733.00 ( 2.40%) 656217.00 ( 5.20%)
TPut 9 584766.00 ( 0.00%) 500528.00 (-14.41%) 739334.00 ( 26.43%) 623954.00 ( 6.70%) 630659.00 ( 7.85%) 645276.00 ( 10.35%)
TPut 10 556758.00 ( 0.00%) 394378.00 (-29.17%) 719794.00 ( 29.28%) 601367.00 ( 8.01%) 611084.00 ( 9.76%) 622121.00 ( 11.74%)
TPut 11 534247.00 ( 0.00%) 423871.00 (-20.66%) 703618.00 ( 31.70%) 577343.00 ( 8.07%) 584588.00 ( 9.42%) 587330.00 ( 9.94%)
TPut 12 495009.00 ( 0.00%) 553569.00 ( 11.83%) 677930.00 ( 36.95%) 561314.00 ( 13.39%) 553994.00 ( 11.92%) 569449.00 ( 15.04%)
TPut 13 494314.00 ( 0.00%) 412946.00 (-16.46%) 666859.00 ( 34.91%) 546981.00 ( 10.65%) 553878.00 ( 12.05%) 562796.00 ( 13.85%)
TPut 14 495248.00 ( 0.00%) 453000.00 ( -8.53%) 657624.00 ( 32.79%) 552245.00 ( 11.51%) 536189.00 ( 8.27%) 559162.00 ( 12.91%)
TPut 15 493997.00 ( 0.00%) 522310.00 ( 5.73%) 650068.00 ( 31.59%) 534394.00 ( 8.18%) 530606.00 ( 7.41%) 537692.00 ( 8.85%)
TPut 16 474383.00 ( 0.00%) 509978.00 ( 7.50%) 643345.00 ( 35.62%) 517221.00 ( 9.03%) 525423.00 ( 10.76%) 529697.00 ( 11.66%)
TPut 17 461499.00 ( 0.00%) 485774.00 ( 5.26%) 628364.00 ( 36.16%) 510154.00 ( 10.54%) 514144.00 ( 11.41%) 515695.00 ( 11.74%)
TPut 18 483924.00 ( 0.00%) 478596.00 ( -1.10%) 623915.00 ( 28.93%) 504124.00 ( 4.17%) 509108.00 ( 5.20%) 524129.00 ( 8.31%)

numacore is not handling the multi JVM case well with numerous regressions
for lower number of threads. It starts improving as it gets closer to the
expected peak of 12 warehouses for this configuration. There are also large
variances between the different JVMs throughput but note again that this
improves as the number of warehouses increase.

autonuma generally does very well in terms of throughput but the variance
between JVMs is massive.

balancenuma does reasonably well and improves upon the baseline kernel. It's
no longer regressing for small numbers of warehouses and is basically the
same as mainline. As the number of warehouses increases, it shows some
performance improvement and the variances are not too bad.

SPECJBB PEAKS
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123 rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%)
Expctd Peak Bops 495009.00 ( 0.00%) 553569.00 ( 11.83%) 677930.00 ( 36.95%) 561314.00 ( 13.39%) 553994.00 ( 11.92%) 569449.00 ( 15.04%)
Actual Warehouse 8.00 ( 0.00%) 12.00 ( 50.00%) 8.00 ( 0.00%) 7.00 (-12.50%) 7.00 (-12.50%) 8.00 ( 0.00%)
Actual Peak Bops 623766.00 ( 0.00%) 553569.00 (-11.25%) 740527.00 ( 18.72%) 641553.00 ( 2.85%) 641515.00 ( 2.85%) 656217.00 ( 5.20%)
SpecJBB Bops 261413.00 ( 0.00%) 262783.00 ( 0.52%) 349854.00 ( 33.83%) 286648.00 ( 9.65%) 286412.00 ( 9.56%) 292202.00 ( 11.78%)
SpecJBB Bops/JVM 65353.00 ( 0.00%) 65696.00 ( 0.52%) 87464.00 ( 33.83%) 71662.00 ( 9.65%) 71603.00 ( 9.56%) 73051.00 ( 11.78%)

Note the peak numbers for numacore. The peak performance regresses 11.25%
from the baseline kernel. However as it improves with the number of
warehouses, specjbb reports that it sees a 0.52% because it's using a
range of peak values.

autonuma sees an 18.72% performance gain at its peak and a 33.83% gain in
its specjbb score.

balancenuma does reasonably well with a 5.2% gain at its peak and 11.78% on its
overall specjbb score.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User 204146.61 197898.85 203957.74 203331.16 203747.52 203740.33
System 314.90 6106.94 444.09 1278.71 703.78 688.21
Elapsed 5029.18 5041.34 5009.46 5022.41 5024.73 5021.80

Note the system CPU usage. numacore is using 9 times more system CPU
than balancenuma is and 4 times more than autonuma (usual disclaimer
about threads).

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins 164712 164556 160492 164020 160552 164364
Page Outs 509132 236136 430444 511088 471208 252540
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 0
Direct pages scanned 0 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0 0
Page writes file 0 0 0 0 0 0
Page writes anon 0 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0 0
Page rescued immediate 0 0 0 0 0 0
Slabs scanned 0 0 0 0 0 0
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0 0
THP fault alloc 105761 91276 94593 111724 106169 99366
THP collapse alloc 114 111 1059 119 114 115
THP splits 605 379 575 517 570 592
THP fault fallback 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0
Page migrate success 0 0 0 1031293 476756 398109
Page migrate failure 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0
Compaction cost 0 0 0 1070 494 413
NUMA PTE updates 0 0 0 1089136813 514718304 515300823
NUMA hint faults 0 0 0 9147497 4661092 4580385
NUMA hint local faults 0 0 0 3005415 1332898 1599021
NUMA pages migrated 0 0 0 1031293 476756 398109
AutoNUMA cost 0 0 0 53381 26917 26516

The main takeaways here is that there were THP allocations and all the
trees split THPs at roughly the same rate overall. Migration stats are
not available for numacore or autonuma and the migration stats available
for balancenuma here are not reliable because it's not accounting for THP
properly. This is fixed, but not in the V5 tree released.


SPECJBB: Multi JVMs (one per node, 4 nodes), THP is disabled
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
Mean 1 25269.25 ( 0.00%) 21623.50 (-14.43%) 25937.75 ( 2.65%) 25138.00 ( -0.52%) 25539.25 ( 1.07%) 25193.00 ( -0.30%)
Mean 2 53467.00 ( 0.00%) 38412.00 (-28.16%) 56598.75 ( 5.86%) 50813.00 ( -4.96%) 52803.50 ( -1.24%) 52637.50 ( -1.55%)
Mean 3 77112.50 ( 0.00%) 57653.25 (-25.23%) 83762.25 ( 8.62%) 75274.25 ( -2.38%) 76097.00 ( -1.32%) 76324.25 ( -1.02%)
Mean 4 99928.75 ( 0.00%) 68468.50 (-31.48%) 108700.75 ( 8.78%) 97444.75 ( -2.49%) 99426.75 ( -0.50%) 99767.25 ( -0.16%)
Mean 5 119616.75 ( 0.00%) 77222.25 (-35.44%) 132572.75 ( 10.83%) 117350.00 ( -1.90%) 118417.25 ( -1.00%) 118298.50 ( -1.10%)
Mean 6 133944.75 ( 0.00%) 89222.75 (-33.39%) 154110.25 ( 15.06%) 133565.75 ( -0.28%) 135268.75 ( 0.99%) 137512.50 ( 2.66%)
Mean 7 137063.00 ( 0.00%) 94944.25 (-30.73%) 159535.25 ( 16.40%) 136744.75 ( -0.23%) 139218.25 ( 1.57%) 138919.25 ( 1.35%)
Mean 8 130814.25 ( 0.00%) 98367.25 (-24.80%) 162045.75 ( 23.87%) 137088.25 ( 4.80%) 139649.50 ( 6.75%) 138273.00 ( 5.70%)
Mean 9 124815.00 ( 0.00%) 99183.50 (-20.54%) 162337.75 ( 30.06%) 135275.50 ( 8.38%) 137494.50 ( 10.16%) 137386.25 ( 10.07%)
Mean 10 123741.00 ( 0.00%) 91926.25 (-25.71%) 158733.00 ( 28.28%) 131418.00 ( 6.20%) 132662.00 ( 7.21%) 132379.25 ( 6.98%)
Mean 11 116966.25 ( 0.00%) 95283.00 (-18.54%) 155065.50 ( 32.57%) 125246.00 ( 7.08%) 124420.25 ( 6.37%) 128132.00 ( 9.55%)
Mean 12 106682.00 ( 0.00%) 92286.25 (-13.49%) 149946.25 ( 40.55%) 118489.50 ( 11.07%) 119624.25 ( 12.13%) 121050.75 ( 13.47%)
Mean 13 106395.00 ( 0.00%) 103168.75 ( -3.03%) 146355.50 ( 37.56%) 118143.75 ( 11.04%) 116799.25 ( 9.78%) 121032.25 ( 13.76%)
Mean 14 104384.25 ( 0.00%) 105417.75 ( 0.99%) 145206.50 ( 39.11%) 119562.75 ( 14.54%) 117898.75 ( 12.95%) 114255.25 ( 9.46%)
Mean 15 103699.00 ( 0.00%) 103878.75 ( 0.17%) 142139.75 ( 37.07%) 115845.50 ( 11.71%) 117527.25 ( 13.33%) 109329.50 ( 5.43%)
Mean 16 100955.00 ( 0.00%) 103582.50 ( 2.60%) 139864.00 ( 38.54%) 113216.75 ( 12.15%) 114046.50 ( 12.97%) 108669.75 ( 7.64%)
Mean 17 99528.25 ( 0.00%) 101783.25 ( 2.27%) 138544.50 ( 39.20%) 112736.50 ( 13.27%) 115917.00 ( 16.47%) 113464.50 ( 14.00%)
Mean 18 97694.00 ( 0.00%) 99978.75 ( 2.34%) 138034.00 ( 41.29%) 108930.00 ( 11.50%) 114137.50 ( 16.83%) 114161.25 ( 16.86%)
Stddev 1 898.91 ( 0.00%) 754.70 ( 16.04%) 815.97 ( 9.23%) 786.81 ( 12.47%) 756.10 ( 15.89%) 1061.69 (-18.11%)
Stddev 2 676.51 ( 0.00%) 2726.62 (-303.04%) 946.10 (-39.85%) 1591.35 (-135.23%) 968.21 (-43.12%) 919.08 (-35.86%)
Stddev 3 629.58 ( 0.00%) 1975.98 (-213.86%) 1403.79 (-122.97%) 291.72 ( 53.66%) 1181.68 (-87.69%) 701.90 (-11.49%)
Stddev 4 363.04 ( 0.00%) 2867.55 (-689.87%) 1810.59 (-398.73%) 1288.56 (-254.94%) 1757.87 (-384.21%) 2050.94 (-464.94%)
Stddev 5 437.02 ( 0.00%) 1159.08 (-165.22%) 2352.89 (-438.39%) 1148.94 (-162.90%) 1294.70 (-196.26%) 861.14 (-97.05%)
Stddev 6 1484.12 ( 0.00%) 1777.97 (-19.80%) 1045.24 ( 29.57%) 860.24 ( 42.04%) 1703.57 (-14.79%) 1367.56 ( 7.85%)
Stddev 7 3856.79 ( 0.00%) 857.26 ( 77.77%) 1369.61 ( 64.49%) 1517.99 ( 60.64%) 2676.34 ( 30.61%) 1818.15 ( 52.86%)
Stddev 8 4910.41 ( 0.00%) 2751.82 ( 43.96%) 1765.69 ( 64.04%) 5022.25 ( -2.28%) 3113.14 ( 36.60%) 3958.06 ( 19.39%)
Stddev 9 2107.95 ( 0.00%) 2348.33 (-11.40%) 1764.06 ( 16.31%) 2932.34 (-39.11%) 6568.79 (-211.62%) 7450.20 (-253.43%)
Stddev 10 2012.98 ( 0.00%) 1332.65 ( 33.80%) 3297.73 (-63.82%) 4649.56 (-130.98%) 2703.19 (-34.29%) 4193.34 (-108.31%)
Stddev 11 5263.81 ( 0.00%) 3810.66 ( 27.61%) 5676.52 ( -7.84%) 1647.81 ( 68.70%) 4683.05 ( 11.03%) 3702.45 ( 29.66%)
Stddev 12 4316.09 ( 0.00%) 731.69 ( 83.05%) 9685.19 (-124.40%) 2202.13 ( 48.98%) 2520.73 ( 41.60%) 3572.75 ( 17.22%)
Stddev 13 4116.97 ( 0.00%) 4217.04 ( -2.43%) 9249.57 (-124.67%) 3042.07 ( 26.11%) 1705.18 ( 58.58%) 464.36 ( 88.72%)
Stddev 14 4711.12 ( 0.00%) 925.12 ( 80.36%) 10672.49 (-126.54%) 1597.01 ( 66.10%) 1983.88 ( 57.89%) 1513.32 ( 67.88%)
Stddev 15 4582.30 ( 0.00%) 909.35 ( 80.16%) 11033.47 (-140.78%) 1966.56 ( 57.08%) 420.63 ( 90.82%) 1049.66 ( 77.09%)
Stddev 16 3805.96 ( 0.00%) 743.92 ( 80.45%) 10353.28 (-172.03%) 1493.18 ( 60.77%) 2524.84 ( 33.66%) 2030.46 ( 46.65%)
Stddev 17 4560.83 ( 0.00%) 1130.10 ( 75.22%) 9902.66 (-117.12%) 1709.65 ( 62.51%) 2449.37 ( 46.30%) 1259.00 ( 72.40%)
Stddev 18 4503.57 ( 0.00%) 1418.91 ( 68.49%) 12143.74 (-169.65%) 1334.37 ( 70.37%) 1693.93 ( 62.39%) 975.71 ( 78.33%)
TPut 1 101077.00 ( 0.00%) 86494.00 (-14.43%) 103751.00 ( 2.65%) 100552.00 ( -0.52%) 102157.00 ( 1.07%) 100772.00 ( -0.30%)
TPut 2 213868.00 ( 0.00%) 153648.00 (-28.16%) 226395.00 ( 5.86%) 203252.00 ( -4.96%) 211214.00 ( -1.24%) 210550.00 ( -1.55%)
TPut 3 308450.00 ( 0.00%) 230613.00 (-25.23%) 335049.00 ( 8.62%) 301097.00 ( -2.38%) 304388.00 ( -1.32%) 305297.00 ( -1.02%)
TPut 4 399715.00 ( 0.00%) 273874.00 (-31.48%) 434803.00 ( 8.78%) 389779.00 ( -2.49%) 397707.00 ( -0.50%) 399069.00 ( -0.16%)
TPut 5 478467.00 ( 0.00%) 308889.00 (-35.44%) 530291.00 ( 10.83%) 469400.00 ( -1.90%) 473669.00 ( -1.00%) 473194.00 ( -1.10%)
TPut 6 535779.00 ( 0.00%) 356891.00 (-33.39%) 616441.00 ( 15.06%) 534263.00 ( -0.28%) 541075.00 ( 0.99%) 550050.00 ( 2.66%)
TPut 7 548252.00 ( 0.00%) 379777.00 (-30.73%) 638141.00 ( 16.40%) 546979.00 ( -0.23%) 556873.00 ( 1.57%) 555677.00 ( 1.35%)
TPut 8 523257.00 ( 0.00%) 393469.00 (-24.80%) 648183.00 ( 23.87%) 548353.00 ( 4.80%) 558598.00 ( 6.75%) 553092.00 ( 5.70%)
TPut 9 499260.00 ( 0.00%) 396734.00 (-20.54%) 649351.00 ( 30.06%) 541102.00 ( 8.38%) 549978.00 ( 10.16%) 549545.00 ( 10.07%)
TPut 10 494964.00 ( 0.00%) 367705.00 (-25.71%) 634932.00 ( 28.28%) 525672.00 ( 6.20%) 530648.00 ( 7.21%) 529517.00 ( 6.98%)
TPut 11 467865.00 ( 0.00%) 381132.00 (-18.54%) 620262.00 ( 32.57%) 500984.00 ( 7.08%) 497681.00 ( 6.37%) 512528.00 ( 9.55%)
TPut 12 426728.00 ( 0.00%) 369145.00 (-13.49%) 599785.00 ( 40.55%) 473958.00 ( 11.07%) 478497.00 ( 12.13%) 484203.00 ( 13.47%)
TPut 13 425580.00 ( 0.00%) 412675.00 ( -3.03%) 585422.00 ( 37.56%) 472575.00 ( 11.04%) 467197.00 ( 9.78%) 484129.00 ( 13.76%)
TPut 14 417537.00 ( 0.00%) 421671.00 ( 0.99%) 580826.00 ( 39.11%) 478251.00 ( 14.54%) 471595.00 ( 12.95%) 457021.00 ( 9.46%)
TPut 15 414796.00 ( 0.00%) 415515.00 ( 0.17%) 568559.00 ( 37.07%) 463382.00 ( 11.71%) 470109.00 ( 13.33%) 437318.00 ( 5.43%)
TPut 16 403820.00 ( 0.00%) 414330.00 ( 2.60%) 559456.00 ( 38.54%) 452867.00 ( 12.15%) 456186.00 ( 12.97%) 434679.00 ( 7.64%)
TPut 17 398113.00 ( 0.00%) 407133.00 ( 2.27%) 554178.00 ( 39.20%) 450946.00 ( 13.27%) 463668.00 ( 16.47%) 453858.00 ( 14.00%)
TPut 18 390776.00 ( 0.00%) 399915.00 ( 2.34%) 552136.00 ( 41.29%) 435720.00 ( 11.50%) 456550.00 ( 16.83%) 456645.00 ( 16.86%)

numacore regresses badly without THP on multi JVM configurations. Note
that once again it improves as the number of warehouses increase. SpecJBB
reports based on peaks so this will be missed if only the peak figures
are quoted in other benchmark reports.

autonuma again does pretty well although it's variances between JVMs is nuts.

Without THP, balancenuma shows small regressions for small numbers of
warehouses but recovers to show decent performance gains. Note that the gains
vary a lot between warehouses because it's completely at the mercy of the
default scheduler decisions which are getting no hints about NUMA placement.

SPECJBB PEAKS
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123 rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%)
Expctd Peak Bops 426728.00 ( 0.00%) 369145.00 (-13.49%) 599785.00 ( 40.55%) 473958.00 ( 11.07%) 478497.00 ( 12.13%) 484203.00 ( 13.47%)
Actual Warehouse 7.00 ( 0.00%) 14.00 (100.00%) 9.00 ( 28.57%) 8.00 ( 14.29%) 8.00 ( 14.29%) 7.00 ( 0.00%)
Actual Peak Bops 548252.00 ( 0.00%) 421671.00 (-23.09%) 649351.00 ( 18.44%) 548353.00 ( 0.02%) 558598.00 ( 1.89%) 555677.00 ( 1.35%)
SpecJBB Bops 221334.00 ( 0.00%) 218491.00 ( -1.28%) 307720.00 ( 39.03%) 248285.00 ( 12.18%) 251062.00 ( 13.43%) 246759.00 ( 11.49%)
SpecJBB Bops/JVM 55334.00 ( 0.00%) 54623.00 ( -1.28%) 76930.00 ( 39.03%) 62071.00 ( 12.18%) 62766.00 ( 13.43%) 61690.00 ( 11.49%)

numacore regresses from the peak by 23.09% and the specjbb overall score is down 1.28%.

autonuma does well with a 18.44% gain on the peak and 39.03% overall.

balancenuma does reasonably well - 1.35% gain at the peak and 11.49%
gain overall.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User 203906.38 167709.64 203858.75 200055.62 202076.09 201985.74
System 577.16 31263.34 692.24 4114.76 2129.71 2177.70
Elapsed 5030.84 5067.85 5009.06 5019.25 5026.83 5017.79

numacores system CPU usage is nuts.

autonumas is ok (kernel threads blah blah)

balancenumas is higher than I'd like. I want to describe is as "not crazy"
but it probably is to everybody else.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins 157624 164396 165024 163492 164776 163348
Page Outs 322264 391416 271880 491668 401644 523684
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 0
Direct pages scanned 0 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0 0
Page writes file 0 0 0 0 0 0
Page writes anon 0 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0 0
Page rescued immediate 0 0 0 0 0 0
Slabs scanned 0 0 0 0 0 0
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0 0
THP fault alloc 2 2 3 2 1 3
THP collapse alloc 0 0 9 0 0 5
THP splits 0 0 0 0 0 0
THP fault fallback 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0
Page migrate success 0 0 0 100618401 47601498 49370903
Page migrate failure 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0
Compaction cost 0 0 0 104441 49410 51246
NUMA PTE updates 0 0 0 783430956 381926529 389134805
NUMA hint faults 0 0 0 730273702 352415076 360742428
NUMA hint local faults 0 0 0 191790656 92208827 93522412
NUMA pages migrated 0 0 0 100618401 47601498 49370903
AutoNUMA cost 0 0 0 3658764 1765653 1807374

First take-away is the lack of THP activity.

Here the stats balancenuma reports are useful because we're only dealing
with base pages. balancenuma migrates 38MB/second which is really high. Note
what the scan rate adaption did to that figure. Without scan rate adaption
it's at 78MB/second on average which is nuts. Average migration rate is
something we should keep an eye on.

>From here, we're onto the single JVM configuration. I suspect
this is tested much more commonly but note that it behaves very
differently to the multi JVM configuration as explained by Andrea
(http://choon.net/forum/read.php?21,1599976,page=4).

A concern with the single JVM results as reported here is the maximum
number of warehouses. In the Multi JVM configuration, the expected peak
was 12 warehouses so I ran up to 18 so that the tests could complete in a
reasonable amount of time. The expected peak for a single JVM is 48 (the
number of CPUs) but the configuration file was derived from the multi JVM
configuration so it was restricted to running up to 18 warehouses. Again,
the reason was so it would complete in a reasonable amount of time but
specjbb does not give a score for this type of configuration and I am
only reporting on the 1-18 warehouses it ran for. I've reconfigured the
4 specjbb configs to run a full config and it'll run over the weekend.

SPECJBB: Single JVMs (one per node, 4 nodes), THP is enabled

SPECJBB BOPS
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
TPut 1 26802.00 ( 0.00%) 22808.00 (-14.90%) 24482.00 ( -8.66%) 25723.00 ( -4.03%) 24387.00 ( -9.01%) 25940.00 ( -3.22%)
TPut 2 57720.00 ( 0.00%) 51245.00 (-11.22%) 55018.00 ( -4.68%) 55498.00 ( -3.85%) 55259.00 ( -4.26%) 55581.00 ( -3.71%)
TPut 3 86940.00 ( 0.00%) 79172.00 ( -8.93%) 87705.00 ( 0.88%) 86101.00 ( -0.97%) 86894.00 ( -0.05%) 86875.00 ( -0.07%)
TPut 4 117203.00 ( 0.00%) 107315.00 ( -8.44%) 117382.00 ( 0.15%) 116282.00 ( -0.79%) 116322.00 ( -0.75%) 115263.00 ( -1.66%)
TPut 5 145375.00 ( 0.00%) 121178.00 (-16.64%) 145802.00 ( 0.29%) 142378.00 ( -2.06%) 144947.00 ( -0.29%) 144211.00 ( -0.80%)
TPut 6 169232.00 ( 0.00%) 157796.00 ( -6.76%) 173409.00 ( 2.47%) 171066.00 ( 1.08%) 173341.00 ( 2.43%) 169861.00 ( 0.37%)
TPut 7 195468.00 ( 0.00%) 169834.00 (-13.11%) 197201.00 ( 0.89%) 197536.00 ( 1.06%) 198347.00 ( 1.47%) 198047.00 ( 1.32%)
TPut 8 217863.00 ( 0.00%) 169975.00 (-21.98%) 222559.00 ( 2.16%) 224901.00 ( 3.23%) 226268.00 ( 3.86%) 218354.00 ( 0.23%)
TPut 9 240679.00 ( 0.00%) 197498.00 (-17.94%) 245997.00 ( 2.21%) 250022.00 ( 3.88%) 253838.00 ( 5.47%) 250264.00 ( 3.98%)
TPut 10 261454.00 ( 0.00%) 204909.00 (-21.63%) 269551.00 ( 3.10%) 275125.00 ( 5.23%) 274658.00 ( 5.05%) 274155.00 ( 4.86%)
TPut 11 281079.00 ( 0.00%) 230118.00 (-18.13%) 281588.00 ( 0.18%) 304383.00 ( 8.29%) 297198.00 ( 5.73%) 299131.00 ( 6.42%)
TPut 12 302007.00 ( 0.00%) 275511.00 ( -8.77%) 313281.00 ( 3.73%) 327826.00 ( 8.55%) 325324.00 ( 7.72%) 325372.00 ( 7.74%)
TPut 13 319139.00 ( 0.00%) 293501.00 ( -8.03%) 332581.00 ( 4.21%) 352389.00 ( 10.42%) 340169.00 ( 6.59%) 351215.00 ( 10.05%)
TPut 14 321069.00 ( 0.00%) 312088.00 ( -2.80%) 337911.00 ( 5.25%) 376198.00 ( 17.17%) 370669.00 ( 15.45%) 366491.00 ( 14.15%)
TPut 15 345851.00 ( 0.00%) 283856.00 (-17.93%) 369104.00 ( 6.72%) 389772.00 ( 12.70%) 392963.00 ( 13.62%) 389254.00 ( 12.55%)
TPut 16 346868.00 ( 0.00%) 317127.00 ( -8.57%) 380930.00 ( 9.82%) 420331.00 ( 21.18%) 412974.00 ( 19.06%) 408575.00 ( 17.79%)
TPut 17 357755.00 ( 0.00%) 349624.00 ( -2.27%) 387635.00 ( 8.35%) 441223.00 ( 23.33%) 426558.00 ( 19.23%) 435985.00 ( 21.87%)
TPut 18 357467.00 ( 0.00%) 360056.00 ( 0.72%) 399487.00 ( 11.75%) 464603.00 ( 29.97%) 442907.00 ( 23.90%) 453011.00 ( 26.73%)

numacore is not doing well here for low numbers of warehouses. However,
note that by 18 warehouses it had drawn level and the expected peak is 48
warehouses. The specjbb reported figure would be using the higher numbers
of warehouses. I'll a full range over the weekend and report back. If
time permits, I'll also run a "monitors disabled" run case the read of
numa_maps every 10 seconds is crippling it.

autonuma did reasonably well and was showing larger gains towards teh 18
warehouses mark.

balancenuma regressed a little initially but was doing quite well by 18
warehouses.

SPECJBB PEAKS
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123 rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
Expctd Warehouse 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%)
Expctd Peak Bops 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Actual Warehouse 17.00 ( 0.00%) 18.00 ( 5.88%) 18.00 ( 5.88%) 18.00 ( 5.88%) 18.00 ( 5.88%) 18.00 ( 5.88%)
Actual Peak Bops 357755.00 ( 0.00%) 360056.00 ( 0.64%) 399487.00 ( 11.66%) 464603.00 ( 29.87%) 442907.00 ( 23.80%) 453011.00 ( 26.63%)
SpecJBB Bops 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
SpecJBB Bops/JVM 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)

Note that numacores peak was 0.64% higher than the baseline and for a
higher number of warehouses so it was scaling better.

autonuma was 11.66% higher at the peak which was also at 18 warehouses.

balancenuma was at 26.63% and was still scaling at 18 warehouses.

The fact that the peak and maximum number of warehouses is the same
reinforces that this test needs to be rerun all the way up to 48 warehouses.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User 10450.16 10006.88 10441.26 10421.00 10441.47 10447.30
System 115.84 549.28 107.70 167.83 129.14 142.34
Elapsed 1196.56 1228.13 1187.23 1196.37 1198.64 1198.75

numacores system CPU usage is very high.

autonumas is lower than baseline -- usual thread disclaimers.

balancenuma system CPU usage is also a bit high but it's not crazy.


MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins 164228 164452 164436 163868 164440 164052
Page Outs 173972 132016 247080 257988 123724 255716
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 0
Direct pages scanned 0 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0 0
Page writes file 0 0 0 0 0 0
Page writes anon 0 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0 0
Page rescued immediate 0 0 0 0 0 0
Slabs scanned 0 0 0 0 0 0
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0 0
THP fault alloc 55438 46676 52240 48118 57618 53194
THP collapse alloc 56 8 323 54 28 19
THP splits 96 30 106 80 91 86
THP fault fallback 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0
Page migrate success 0 0 0 253855 111066 58659
Page migrate failure 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0
Compaction cost 0 0 0 263 115 60
NUMA PTE updates 0 0 0 142021619 62920560 64394112
NUMA hint faults 0 0 0 2314850 1258884 1019745
NUMA hint local faults 0 0 0 1249300 756763 569808
NUMA pages migrated 0 0 0 253855 111066 58659
AutoNUMA cost 0 0 0 12573 6736 5550

THP was in use - collapses and splits in evidence.

For balancenuma, note how adaptscan affected the PTE scan rates. The
impact on the system CPU usage is obvious too -- fewer PTE scans means
fewer faults, fewer migrations etc. Obviously there needs to be enough
of these faults to actually do the NUMA balancing but there comes a point
where there are diminishing returns.

SPECJBB: Single JVMs (one per node, 4 nodes), THP is disabled

3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
TPut 1 20890.00 ( 0.00%) 18720.00 (-10.39%) 21127.00 ( 1.13%) 20376.00 ( -2.46%) 20806.00 ( -0.40%) 20698.00 ( -0.92%)
TPut 2 48259.00 ( 0.00%) 38121.00 (-21.01%) 47920.00 ( -0.70%) 47085.00 ( -2.43%) 48594.00 ( 0.69%) 48094.00 ( -0.34%)
TPut 3 73203.00 ( 0.00%) 60057.00 (-17.96%) 73630.00 ( 0.58%) 70241.00 ( -4.05%) 73418.00 ( 0.29%) 74016.00 ( 1.11%)
TPut 4 98694.00 ( 0.00%) 73669.00 (-25.36%) 98929.00 ( 0.24%) 96721.00 ( -2.00%) 96797.00 ( -1.92%) 97930.00 ( -0.77%)
TPut 5 122563.00 ( 0.00%) 98786.00 (-19.40%) 118969.00 ( -2.93%) 118045.00 ( -3.69%) 121553.00 ( -0.82%) 122781.00 ( 0.18%)
TPut 6 144095.00 ( 0.00%) 114485.00 (-20.55%) 145328.00 ( 0.86%) 141713.00 ( -1.65%) 142589.00 ( -1.05%) 143771.00 ( -0.22%)
TPut 7 166457.00 ( 0.00%) 112416.00 (-32.47%) 163503.00 ( -1.77%) 166971.00 ( 0.31%) 166788.00 ( 0.20%) 165188.00 ( -0.76%)
TPut 8 191067.00 ( 0.00%) 122996.00 (-35.63%) 189477.00 ( -0.83%) 183090.00 ( -4.17%) 187710.00 ( -1.76%) 192157.00 ( 0.57%)
TPut 9 210634.00 ( 0.00%) 141200.00 (-32.96%) 209639.00 ( -0.47%) 207968.00 ( -1.27%) 215216.00 ( 2.18%) 214222.00 ( 1.70%)
TPut 10 234121.00 ( 0.00%) 129508.00 (-44.68%) 231221.00 ( -1.24%) 221553.00 ( -5.37%) 219998.00 ( -6.03%) 227193.00 ( -2.96%)
TPut 11 257885.00 ( 0.00%) 131232.00 (-49.11%) 256568.00 ( -0.51%) 252734.00 ( -2.00%) 258433.00 ( 0.21%) 260534.00 ( 1.03%)
TPut 12 271751.00 ( 0.00%) 154763.00 (-43.05%) 277319.00 ( 2.05%) 277154.00 ( 1.99%) 265747.00 ( -2.21%) 262285.00 ( -3.48%)
TPut 13 297457.00 ( 0.00%) 119716.00 (-59.75%) 296068.00 ( -0.47%) 289716.00 ( -2.60%) 276527.00 ( -7.04%) 293199.00 ( -1.43%)
TPut 14 319074.00 ( 0.00%) 129730.00 (-59.34%) 311604.00 ( -2.34%) 308798.00 ( -3.22%) 316807.00 ( -0.71%) 275748.00 (-13.58%)
TPut 15 337859.00 ( 0.00%) 177494.00 (-47.47%) 329288.00 ( -2.54%) 300463.00 (-11.07%) 305116.00 ( -9.69%) 287814.00 (-14.81%)
TPut 16 356396.00 ( 0.00%) 145173.00 (-59.27%) 355616.00 ( -0.22%) 342598.00 ( -3.87%) 364077.00 ( 2.16%) 339649.00 ( -4.70%)
TPut 17 373925.00 ( 0.00%) 176956.00 (-52.68%) 368589.00 ( -1.43%) 360917.00 ( -3.48%) 366043.00 ( -2.11%) 345586.00 ( -7.58%)
TPut 18 388373.00 ( 0.00%) 150100.00 (-61.35%) 372873.00 ( -3.99%) 389062.00 ( 0.18%) 386779.00 ( -0.41%) 370871.00 ( -4.51%)

balancenuma suffered here. It is very likely that it was not able to handle
faults at a PMD level due to the lack of THP and I would expect that the
pages within a PMD boundary are not on the same node so pmd_numa is not
set. This results in its worst case of always having to deal with PTE
faults. Further, it must be migrating many or almost all of these because
the adaptscan patch made no difference. This is a worst-case scenario for
balancenuma. The scan rates later will indicate if that was the case.

autonuma did ok in that it was roughly comparable with mainline. Small
regressions.

I do not know how to describe numacores figures. Lets go with "not great".
Maybe it would have gotten better if it ran all the way up to 48 warehouses
or maybe the numa_maps reading is really kicking it harder than it kicks
autonuma or balancenuma. There is also the possibility that there is some
other patch in tip/master that is causing the problems.

SPECJBB PEAKS
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123 rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
Expctd Warehouse 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%)
Expctd Peak Bops 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
Actual Warehouse 18.00 ( 0.00%) 15.00 (-16.67%) 18.00 ( 0.00%) 18.00 ( 0.00%) 18.00 ( 0.00%) 18.00 ( 0.00%)
Actual Peak Bops 388373.00 ( 0.00%) 177494.00 (-54.30%) 372873.00 ( -3.99%) 389062.00 ( 0.18%) 386779.00 ( -0.41%) 370871.00 ( -4.51%)
SpecJBB Bops 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
SpecJBB Bops/JVM 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)

numacore regressed 54.30% at the actual peak of 15 warehouses which was
also fewer warehouses than the baseline kernel did.

autonuma and balancenuma both peaked at 18 warehouses (the maximum number
it ran) so it was still scaling ok but autonuma regressed 3.99% while
balancenuma regressed 4.51%.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User 10405.85 7284.62 10826.33 10084.82 10134.62 10026.65
System 331.48 2505.16 432.62 506.52 538.50 529.03
Elapsed 1202.48 1242.71 1197.09 1204.03 1202.98 1201.74

numacores system CPU usage was very high.

autonumas and balancenumas were both higher than I'd like.


MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins 163780 164588 193572 163984 164068 164416
Page Outs 137692 130984 265672 230884 188836 117192
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 0
Direct pages scanned 0 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0 0
Page writes file 0 0 0 0 0 0
Page writes anon 0 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0 0
Page rescued immediate 0 0 0 0 0 0
Slabs scanned 0 0 0 0 0 0
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0 0
THP fault alloc 1 1 4 2 2 2
THP collapse alloc 0 0 12 0 0 0
THP splits 0 0 0 0 0 0
THP fault fallback 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0
Page migrate success 0 0 0 7816428 5725511 6869488
Page migrate failure 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0
Compaction cost 0 0 0 8113 5943 7130
NUMA PTE updates 0 0 0 66123797 53516623 60445811
NUMA hint faults 0 0 0 63047742 51160357 58406746
NUMA hint local faults 0 0 0 18265709 14490652 16584428
NUMA pages migrated 0 0 0 7816428 5725511 6869488
AutoNUMA cost 0 0 0 315850 256285 292587

For balancenuma the scan rates are interesting. Note that adaptscan made
very little difference to the number of PTEs updated. This very strongly
implies that the scan rate is not being reduced as many of the NUMA faults
are resulting in a migration. This could be hit with a hammer by always
decreasing the scan rate on every fall but it would be a really really
blunt hammer.

As before, note that there was no THP activity because it was disabled.

Finally, the following are just rudimentary tests to check some basics.

KERNBENCH
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
User min 1296.38 ( 0.00%) 1310.16 ( -1.06%) 1296.52 ( -0.01%) 1297.53 ( -0.09%) 1298.35 ( -0.15%) 1299.53 ( -0.24%)
User mean 1298.86 ( 0.00%) 1311.49 ( -0.97%) 1299.73 ( -0.07%) 1300.50 ( -0.13%) 1301.56 ( -0.21%) 1301.42 ( -0.20%)
User stddev 1.65 ( 0.00%) 0.90 ( 45.15%) 2.68 (-62.37%) 3.47 (-110.63%) 2.19 (-33.06%) 1.59 ( 3.45%)
User max 1301.52 ( 0.00%) 1312.87 ( -0.87%) 1303.09 ( -0.12%) 1306.88 ( -0.41%) 1304.60 ( -0.24%) 1304.05 ( -0.19%)
System min 118.74 ( 0.00%) 129.74 ( -9.26%) 122.34 ( -3.03%) 121.82 ( -2.59%) 121.21 ( -2.08%) 119.43 ( -0.58%)
System mean 119.34 ( 0.00%) 130.24 ( -9.14%) 123.20 ( -3.24%) 122.15 ( -2.35%) 121.52 ( -1.83%) 120.17 ( -0.70%)
System stddev 0.42 ( 0.00%) 0.49 (-14.52%) 0.56 (-30.96%) 0.25 ( 41.66%) 0.43 ( -0.96%) 0.56 (-31.84%)
System max 120.00 ( 0.00%) 131.07 ( -9.22%) 123.88 ( -3.23%) 122.53 ( -2.11%) 122.36 ( -1.97%) 120.83 ( -0.69%)
Elapsed min 40.42 ( 0.00%) 41.42 ( -2.47%) 40.55 ( -0.32%) 41.43 ( -2.50%) 40.66 ( -0.59%) 40.09 ( 0.82%)
Elapsed mean 41.60 ( 0.00%) 42.63 ( -2.48%) 41.65 ( -0.13%) 42.27 ( -1.62%) 41.57 ( 0.06%) 41.12 ( 1.13%)
Elapsed stddev 0.72 ( 0.00%) 0.82 (-13.62%) 0.80 (-10.77%) 0.65 ( 9.93%) 0.86 (-19.29%) 0.64 ( 11.92%)
Elapsed max 42.41 ( 0.00%) 43.90 ( -3.51%) 42.79 ( -0.90%) 43.03 ( -1.46%) 42.76 ( -0.83%) 41.87 ( 1.27%)
CPU min 3341.00 ( 0.00%) 3279.00 ( 1.86%) 3319.00 ( 0.66%) 3298.00 ( 1.29%) 3319.00 ( 0.66%) 3392.00 ( -1.53%)
CPU mean 3409.80 ( 0.00%) 3382.40 ( 0.80%) 3417.00 ( -0.21%) 3365.60 ( 1.30%) 3424.00 ( -0.42%) 3457.00 ( -1.38%)
CPU stddev 63.50 ( 0.00%) 66.38 ( -4.53%) 70.01 (-10.25%) 50.19 ( 20.97%) 74.58 (-17.45%) 56.25 ( 11.42%)
CPU max 3514.00 ( 0.00%) 3479.00 ( 1.00%) 3516.00 ( -0.06%) 3426.00 ( 2.50%) 3506.00 ( 0.23%) 3546.00 ( -0.91%)

numacore has improved a lot here here. It only regressed 2.48% which is an improvement
over earlier releases.

autonuma and balancenuma both show some system CPU overhead but averaged
over the multiple runs, it's not very obvious.


MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User 7821.05 7900.01 7829.89 7837.23 7840.19 7835.43
System 735.84 802.86 758.93 753.98 749.44 740.47
Elapsed 298.72 305.17 298.52 300.67 296.84 296.20

System CPU overhead is a bit more obvious here. balancenuma adds 5ish
seconds (0.62%). autonuma adds around 23 seconds (3.04%). numacore adds
67 seconds (8.34%)

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins 156 0 28 148 8 16
Page Outs 1519504 1740760 1460708 1548820 1510256 1548792
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 0
Direct pages scanned 0 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0 0
Page writes file 0 0 0 0 0 0
Page writes anon 0 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0 0
Page rescued immediate 0 0 0 0 0 0
Slabs scanned 0 0 0 0 0 0
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0 0
THP fault alloc 323 351 365 374 378 316
THP collapse alloc 22 1 10071 30 7 28
THP splits 4 2 151 5 1 7
THP fault fallback 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0
Page migrate success 0 0 0 558483 50325 100470
Page migrate failure 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0
Compaction cost 0 0 0 579 52 104
NUMA PTE updates 0 0 0 109735841 86018422 65125719
NUMA hint faults 0 0 0 68484623 53110294 40259527
NUMA hint local faults 0 0 0 65051361 50701491 37787066
NUMA pages migrated 0 0 0 558483 50325 100470
AutoNUMA cost 0 0 0 343201 266154 201755

And you can see where balacenumas system CPU overhead is coming from. Despite
the fact that most of the processes are short-lived, they are still living
longer than 1 second and being scheduled on another node which triggers
the PTE scanner.

Note how adaptscan affects the number of PTE updates as it reduces the scan rate.

Note too how delaystart reduces it further because PTE scanning is postponed
until the task is scheduled on a new node.

AIM9
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
Min page_test 337620.00 ( 0.00%) 382584.94 ( 13.32%) 274380.00 (-18.73%) 386013.33 ( 14.33%) 367068.62 ( 8.72%) 389186.67 ( 15.27%)
Min brk_test 3189200.00 ( 0.00%) 3130446.37 ( -1.84%) 3036200.00 ( -4.80%) 3261733.33 ( 2.27%) 2729513.66 (-14.41%) 3232266.67 ( 1.35%)
Min exec_test 263.16 ( 0.00%) 270.49 ( 2.79%) 275.97 ( 4.87%) 263.49 ( 0.13%) 262.32 ( -0.32%) 263.33 ( 0.06%)
Min fork_test 1489.36 ( 0.00%) 1533.86 ( 2.99%) 1754.15 ( 17.78%) 1503.66 ( 0.96%) 1500.66 ( 0.76%) 1484.69 ( -0.31%)
Mean page_test 376537.21 ( 0.00%) 407175.97 ( 8.14%) 369202.58 ( -1.95%) 408484.43 ( 8.48%) 401734.17 ( 6.69%) 419007.65 ( 11.28%)
Mean brk_test 3217657.48 ( 0.00%) 3223631.95 ( 0.19%) 3142007.48 ( -2.35%) 3301305.55 ( 2.60%) 2815992.93 (-12.48%) 3270913.07 ( 1.66%)
Mean exec_test 266.09 ( 0.00%) 275.19 ( 3.42%) 280.30 ( 5.34%) 268.35 ( 0.85%) 265.03 ( -0.40%) 268.45 ( 0.89%)
Mean fork_test 1521.05 ( 0.00%) 1569.47 ( 3.18%) 1844.55 ( 21.27%) 1526.62 ( 0.37%) 1531.56 ( 0.69%) 1529.75 ( 0.57%)
Stddev page_test 26593.06 ( 0.00%) 11327.52 (-57.40%) 35313.32 ( 32.79%) 11484.61 (-56.81%) 15098.72 (-43.22%) 12553.59 (-52.79%)
Stddev brk_test 14591.07 ( 0.00%) 51911.60 (255.78%) 42645.66 (192.27%) 22593.16 ( 54.84%) 41088.23 (181.60%) 26548.94 ( 81.95%)
Stddev exec_test 2.18 ( 0.00%) 2.83 ( 29.93%) 3.47 ( 59.06%) 2.90 ( 33.05%) 2.01 ( -7.84%) 3.42 ( 56.74%)
Stddev fork_test 22.76 ( 0.00%) 18.41 (-19.10%) 68.22 (199.75%) 20.41 (-10.34%) 20.20 (-11.23%) 28.56 ( 25.48%)
Max page_test 407320.00 ( 0.00%) 421940.00 ( 3.59%) 398026.67 ( -2.28%) 421940.00 ( 3.59%) 426755.50 ( 4.77%) 438146.67 ( 7.57%)
Max brk_test 3240200.00 ( 0.00%) 3321800.00 ( 2.52%) 3227733.33 ( -0.38%) 3337666.67 ( 3.01%) 2863933.33 (-11.61%) 3321852.10 ( 2.52%)
Max exec_test 269.97 ( 0.00%) 281.96 ( 4.44%) 287.81 ( 6.61%) 272.67 ( 1.00%) 268.82 ( -0.43%) 273.67 ( 1.37%)
Max fork_test 1554.82 ( 0.00%) 1601.33 ( 2.99%) 1926.91 ( 23.93%) 1565.62 ( 0.69%) 1559.39 ( 0.29%) 1583.50 ( 1.84%)

This has much improved in general.

page_test is looking generally good on average although the large variances
make it a bit unreliable. brk_test is looking ok too. autonuma regressed
but with the large variances it is within the noise. exec_test fork_test
both look fine.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User 0.14 2.83 2.87 2.73 2.79 2.80
System 0.24 0.72 0.75 0.72 0.71 0.71
Elapsed 721.97 724.55 724.52 724.36 725.08 724.54


System CPU overhead is noticeable again but it's not really a factor for this load.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins 7252 7180 7176 7416 7672 7168
Page Outs 72684 74080 74844 73980 74472 74844
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 0
Direct pages scanned 0 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0 0
Page writes file 0 0 0 0 0 0
Page writes anon 0 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0 0
Page rescued immediate 0 0 0 0 0 0
Slabs scanned 0 0 0 0 0 0
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0 0
THP fault alloc 0 15 0 36 18 19
THP collapse alloc 0 0 0 0 0 2
THP splits 0 0 0 0 0 1
THP fault fallback 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0
Page migrate success 0 0 0 75 842 581
Page migrate failure 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0
Compaction cost 0 0 0 0 0 0
NUMA PTE updates 0 0 0 40740052 41937943 1669018
NUMA hint faults 0 0 0 20273 17880 9628
NUMA hint local faults 0 0 0 15901 15562 7259
NUMA pages migrated 0 0 0 75 842 581
AutoNUMA cost 0 0 0 386 382 59

The evidence is there that the load is active enough to trigger automatic
numa migration activity even though the processes are all small. For
balancenuma, being scheduled on a new node is enough.

HACKBENCH PIPES
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
Procs 1 0.0537 ( 0.00%) 0.0282 ( 47.58%) 0.0233 ( 56.73%) 0.0400 ( 25.56%) 0.0220 ( 59.06%) 0.0269 ( 50.02%)
Procs 4 0.0755 ( 0.00%) 0.0710 ( 5.96%) 0.0540 ( 28.48%) 0.0721 ( 4.54%) 0.0679 ( 10.07%) 0.0684 ( 9.36%)
Procs 8 0.0795 ( 0.00%) 0.0933 (-17.39%) 0.1032 (-29.87%) 0.0859 ( -8.08%) 0.0736 ( 7.35%) 0.0954 (-20.11%)
Procs 12 0.1002 ( 0.00%) 0.1069 ( -6.62%) 0.1760 (-75.56%) 0.1051 ( -4.88%) 0.0809 ( 19.26%) 0.0926 ( 7.68%)
Procs 16 0.1086 ( 0.00%) 0.1282 (-18.07%) 0.1695 (-56.08%) 0.1380 (-27.07%) 0.1055 ( 2.85%) 0.1239 (-14.13%)
Procs 20 0.1455 ( 0.00%) 0.1450 ( 0.37%) 0.3690 (-153.54%) 0.1276 ( 12.36%) 0.1588 ( -9.12%) 0.1464 ( -0.56%)
Procs 24 0.1548 ( 0.00%) 0.1638 ( -5.82%) 0.4010 (-158.99%) 0.1648 ( -6.41%) 0.1575 ( -1.69%) 0.1621 ( -4.69%)
Procs 28 0.1995 ( 0.00%) 0.2089 ( -4.72%) 0.3936 (-97.31%) 0.1829 ( 8.33%) 0.2057 ( -3.09%) 0.1942 ( 2.66%)
Procs 32 0.2030 ( 0.00%) 0.2352 (-15.86%) 0.3780 (-86.21%) 0.2189 ( -7.85%) 0.2011 ( 0.92%) 0.2207 ( -8.71%)
Procs 36 0.2323 ( 0.00%) 0.2502 ( -7.70%) 0.4813 (-107.14%) 0.2449 ( -5.41%) 0.2492 ( -7.27%) 0.2250 ( 3.16%)
Procs 40 0.2708 ( 0.00%) 0.2734 ( -0.97%) 0.6089 (-124.84%) 0.2832 ( -4.57%) 0.2822 ( -4.20%) 0.2658 ( 1.85%)

Everyone is a bit all over the place here and autonuma is consistent with the
last results in that it's hurting hackbench pipes results. With such large
differences on each thread number it's difficult to draw any conclusion
here. I'd have to dig into the data more and see what's happening but
system CPU can be a proxy measure so onwards...


MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User 57.28 61.04 61.94 61.00 59.64 58.88
System 1849.51 2011.94 1873.74 1918.32 1864.12 1916.33
Elapsed 96.56 100.27 145.82 97.88 96.59 98.28

Yep, system CPU usage is up. Highest in numacore, balancenuma is adding a
chunk as well. autonuma appears to add less but the usual thread comment
applies.


MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins 24 24 24 24 24 24
Page Outs 1668 1772 2284 1752 2072 1756
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 0
Direct pages scanned 0 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0 0
Page writes file 0 0 0 0 0 0
Page writes anon 0 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0 0
Page rescued immediate 0 0 0 0 0 0
Slabs scanned 0 0 0 0 0 0
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0 0
THP fault alloc 0 5 0 6 6 0
THP collapse alloc 0 0 0 2 0 5
THP splits 0 0 0 0 0 0
THP fault fallback 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0
Page migrate success 0 0 0 2 0 28
Page migrate failure 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0
Compaction cost 0 0 0 0 0 0
NUMA PTE updates 0 0 0 54736 1061 42752
NUMA hint faults 0 0 0 2247 518 71
NUMA hint local faults 0 0 0 29 1 0
NUMA pages migrated 0 0 0 2 0 28
AutoNUMA cost 0 0 0 11 2 0

And here is the evidence again. balancenuma at least is triggering the
migration logic while running hackbench. It may be that as the thread
counts grow it simply becomes more likely it gets scheduled on another
node and starts up even though it is not memory intensive.

I could avoid firing the PTE scanner if the processes RSS is low I guess
but that feels hacky.

HACKBENCH SOCKETS
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
Procs 1 0.0220 ( 0.00%) 0.0240 ( -9.09%) 0.0276 (-25.34%) 0.0228 ( -3.83%) 0.0282 (-28.18%) 0.0207 ( 6.11%)
Procs 4 0.0535 ( 0.00%) 0.0490 ( 8.35%) 0.0888 (-66.12%) 0.0467 ( 12.70%) 0.0442 ( 17.27%) 0.0494 ( 7.52%)
Procs 8 0.0716 ( 0.00%) 0.0726 ( -1.33%) 0.1665 (-132.54%) 0.0718 ( -0.25%) 0.0700 ( 2.19%) 0.0701 ( 2.09%)
Procs 12 0.1026 ( 0.00%) 0.0975 ( 4.99%) 0.1290 (-25.73%) 0.0981 ( 4.34%) 0.0946 ( 7.76%) 0.0967 ( 5.71%)
Procs 16 0.1272 ( 0.00%) 0.1268 ( 0.25%) 0.3193 (-151.05%) 0.1229 ( 3.35%) 0.1224 ( 3.78%) 0.1270 ( 0.11%)
Procs 20 0.1487 ( 0.00%) 0.1537 ( -3.40%) 0.1793 (-20.57%) 0.1550 ( -4.25%) 0.1519 ( -2.17%) 0.1579 ( -6.18%)
Procs 24 0.1794 ( 0.00%) 0.1797 ( -0.16%) 0.4423 (-146.55%) 0.1851 ( -3.19%) 0.1807 ( -0.71%) 0.1904 ( -6.15%)
Procs 28 0.2165 ( 0.00%) 0.2156 ( 0.44%) 0.5012 (-131.50%) 0.2147 ( 0.85%) 0.2126 ( 1.82%) 0.2194 ( -1.34%)
Procs 32 0.2344 ( 0.00%) 0.2458 ( -4.89%) 0.7008 (-199.00%) 0.2498 ( -6.60%) 0.2449 ( -4.50%) 0.2528 ( -7.86%)
Procs 36 0.2623 ( 0.00%) 0.2752 ( -4.92%) 0.7469 (-184.73%) 0.2852 ( -8.72%) 0.2762 ( -5.30%) 0.2826 ( -7.72%)
Procs 40 0.2921 ( 0.00%) 0.3030 ( -3.72%) 0.7753 (-165.46%) 0.3085 ( -5.61%) 0.3046 ( -4.28%) 0.3182 ( -8.94%)

Mix of gains and losses except for autonuma which takes a hammering.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User 39.43 38.44 48.79 41.48 39.54 42.47
System 2249.41 2273.39 2678.90 2285.03 2218.08 2302.44
Elapsed 104.91 105.83 173.39 105.50 104.38 106.55

Less system CPU overhead from numacore here. autonuma adds a lot. balancenuma
is adding more than it should.


MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins 4 4 4 4 4 4
Page Outs 1952 2104 2812 1796 1952 2264
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 0
Direct pages scanned 0 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0 0
Page writes file 0 0 0 0 0 0
Page writes anon 0 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0 0
Page rescued immediate 0 0 0 0 0 0
Slabs scanned 0 0 0 0 0 0
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0 0
THP fault alloc 0 0 0 6 0 0
THP collapse alloc 0 0 1 0 0 0
THP splits 0 0 0 0 0 0
THP fault fallback 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0
Page migrate success 0 0 0 328 513 19
Page migrate failure 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0
Compaction cost 0 0 0 0 0 0
NUMA PTE updates 0 0 0 21522 22448 21376
NUMA hint faults 0 0 0 1082 546 52
NUMA hint local faults 0 0 0 217 0 31
NUMA pages migrated 0 0 0 328 513 19
AutoNUMA cost 0 0 0 5 2 0

Again the PTE scanners are in there. They will not help hackbench figures.

PAGE FAULT TEST
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
System 1 8.0195 ( 0.00%) 8.2535 ( -2.92%) 8.0495 ( -0.37%) 37.7675 (-370.95%) 38.0265 (-374.18%) 7.9775 ( 0.52%)
System 2 8.0095 ( 0.00%) 8.0905 ( -1.01%) 8.1415 ( -1.65%) 12.0595 (-50.56%) 11.4145 (-42.51%) 7.9900 ( 0.24%)
System 3 8.1025 ( 0.00%) 8.1725 ( -0.86%) 8.3525 ( -3.09%) 9.7380 (-20.19%) 9.4905 (-17.13%) 8.1110 ( -0.10%)
System 4 8.1635 ( 0.00%) 8.2875 ( -1.52%) 8.5415 ( -4.63%) 8.7440 ( -7.11%) 8.6145 ( -5.52%) 8.1800 ( -0.20%)
System 5 8.4600 ( 0.00%) 8.5900 ( -1.54%) 8.8910 ( -5.09%) 8.8365 ( -4.45%) 8.6755 ( -2.55%) 8.5105 ( -0.60%)
System 6 8.7565 ( 0.00%) 8.8120 ( -0.63%) 9.3630 ( -6.93%) 8.9460 ( -2.16%) 8.8490 ( -1.06%) 8.7390 ( 0.20%)
System 7 8.7390 ( 0.00%) 8.8430 ( -1.19%) 9.9310 (-13.64%) 9.0680 ( -3.76%) 8.9600 ( -2.53%) 8.8300 ( -1.04%)
System 8 8.7700 ( 0.00%) 8.9110 ( -1.61%) 10.1445 (-15.67%) 9.0435 ( -3.12%) 8.8060 ( -0.41%) 8.7615 ( 0.10%)
System 9 9.3455 ( 0.00%) 9.3505 ( -0.05%) 10.5340 (-12.72%) 9.4765 ( -1.40%) 9.3955 ( -0.54%) 9.2860 ( 0.64%)
System 10 9.4195 ( 0.00%) 9.4780 ( -0.62%) 11.6035 (-23.19%) 9.6500 ( -2.45%) 9.5350 ( -1.23%) 9.4735 ( -0.57%)
System 11 9.5405 ( 0.00%) 9.6495 ( -1.14%) 12.8475 (-34.66%) 9.7370 ( -2.06%) 9.5995 ( -0.62%) 9.5835 ( -0.45%)
System 12 9.7035 ( 0.00%) 9.7470 ( -0.45%) 13.2560 (-36.61%) 9.8445 ( -1.45%) 9.7260 ( -0.23%) 9.5890 ( 1.18%)
System 13 10.2745 ( 0.00%) 10.2270 ( 0.46%) 13.5490 (-31.87%) 10.3840 ( -1.07%) 10.1880 ( 0.84%) 10.1480 ( 1.23%)
System 14 10.5405 ( 0.00%) 10.6135 ( -0.69%) 13.9225 (-32.09%) 10.6915 ( -1.43%) 10.5255 ( 0.14%) 10.5620 ( -0.20%)
System 15 10.7190 ( 0.00%) 10.8635 ( -1.35%) 15.0760 (-40.65%) 10.9380 ( -2.04%) 10.8190 ( -0.93%) 10.7040 ( 0.14%)
System 16 11.2575 ( 0.00%) 11.2750 ( -0.16%) 15.0995 (-34.13%) 11.3315 ( -0.66%) 11.2615 ( -0.04%) 11.2345 ( 0.20%)
System 17 11.8090 ( 0.00%) 12.0865 ( -2.35%) 16.1715 (-36.94%) 11.8925 ( -0.71%) 11.7655 ( 0.37%) 11.7585 ( 0.43%)
System 18 12.3910 ( 0.00%) 12.4270 ( -0.29%) 16.7410 (-35.11%) 12.4425 ( -0.42%) 12.4235 ( -0.26%) 12.3295 ( 0.50%)
System 19 12.7915 ( 0.00%) 12.8340 ( -0.33%) 16.7175 (-30.69%) 12.7980 ( -0.05%) 12.9825 ( -1.49%) 12.7980 ( -0.05%)
System 20 13.5870 ( 0.00%) 13.3100 ( 2.04%) 16.5590 (-21.87%) 13.2725 ( 2.31%) 13.1720 ( 3.05%) 13.1855 ( 2.96%)
System 21 13.9325 ( 0.00%) 13.9705 ( -0.27%) 16.9110 (-21.38%) 13.8975 ( 0.25%) 14.0360 ( -0.74%) 13.8760 ( 0.41%)
System 22 14.5810 ( 0.00%) 14.7345 ( -1.05%) 18.1160 (-24.24%) 14.7635 ( -1.25%) 14.4805 ( 0.69%) 14.4130 ( 1.15%)
System 23 15.0710 ( 0.00%) 15.1400 ( -0.46%) 18.3805 (-21.96%) 15.2020 ( -0.87%) 15.1100 ( -0.26%) 15.0385 ( 0.22%)
System 24 15.8815 ( 0.00%) 15.7120 ( 1.07%) 19.7195 (-24.17%) 15.6205 ( 1.64%) 15.5965 ( 1.79%) 15.5950 ( 1.80%)
System 25 16.1480 ( 0.00%) 16.6115 ( -2.87%) 19.5480 (-21.06%) 16.2305 ( -0.51%) 16.1775 ( -0.18%) 16.1510 ( -0.02%)
System 26 17.1075 ( 0.00%) 17.1015 ( 0.04%) 19.7100 (-15.21%) 17.0800 ( 0.16%) 16.8955 ( 1.24%) 16.7845 ( 1.89%)
System 27 17.3015 ( 0.00%) 17.4120 ( -0.64%) 20.2640 (-17.12%) 17.2615 ( 0.23%) 17.2430 ( 0.34%) 17.2895 ( 0.07%)
System 28 17.8750 ( 0.00%) 17.9675 ( -0.52%) 21.2030 (-18.62%) 17.7305 ( 0.81%) 17.7480 ( 0.71%) 17.7615 ( 0.63%)
System 29 18.5260 ( 0.00%) 18.8165 ( -1.57%) 20.4045 (-10.14%) 18.3895 ( 0.74%) 18.2980 ( 1.23%) 18.4480 ( 0.42%)
System 30 19.0865 ( 0.00%) 19.1865 ( -0.52%) 21.0970 (-10.53%) 18.9800 ( 0.56%) 18.8510 ( 1.23%) 19.0500 ( 0.19%)
System 31 19.8095 ( 0.00%) 19.7210 ( 0.45%) 22.8030 (-15.11%) 19.7365 ( 0.37%) 19.6370 ( 0.87%) 19.9115 ( -0.51%)
System 32 20.3360 ( 0.00%) 20.3510 ( -0.07%) 23.3780 (-14.96%) 20.2040 ( 0.65%) 20.0695 ( 1.31%) 20.2110 ( 0.61%)
System 33 21.0240 ( 0.00%) 21.0225 ( 0.01%) 23.3495 (-11.06%) 20.8200 ( 0.97%) 20.6455 ( 1.80%) 21.0125 ( 0.05%)
System 34 21.6065 ( 0.00%) 21.9710 ( -1.69%) 23.2650 ( -7.68%) 21.4115 ( 0.90%) 21.4230 ( 0.85%) 21.8570 ( -1.16%)
System 35 22.3005 ( 0.00%) 22.3190 ( -0.08%) 23.2305 ( -4.17%) 22.1695 ( 0.59%) 22.0695 ( 1.04%) 22.2485 ( 0.23%)
System 36 23.0245 ( 0.00%) 22.9430 ( 0.35%) 24.8930 ( -8.12%) 22.7685 ( 1.11%) 22.7385 ( 1.24%) 23.0900 ( -0.28%)
System 37 23.8225 ( 0.00%) 23.7100 ( 0.47%) 24.9290 ( -4.64%) 23.5425 ( 1.18%) 23.3270 ( 2.08%) 23.6795 ( 0.60%)
System 38 24.5015 ( 0.00%) 24.4780 ( 0.10%) 25.3145 ( -3.32%) 24.3460 ( 0.63%) 24.1105 ( 1.60%) 24.5430 ( -0.17%)
System 39 25.1855 ( 0.00%) 25.1445 ( 0.16%) 25.1985 ( -0.05%) 25.1355 ( 0.20%) 24.9305 ( 1.01%) 25.0000 ( 0.74%)
System 40 25.8990 ( 0.00%) 25.8310 ( 0.26%) 26.5205 ( -2.40%) 25.7115 ( 0.72%) 25.5310 ( 1.42%) 25.9605 ( -0.24%)
System 41 26.5585 ( 0.00%) 26.7045 ( -0.55%) 27.5060 ( -3.57%) 26.5825 ( -0.09%) 26.3515 ( 0.78%) 26.5835 ( -0.09%)
System 42 27.3840 ( 0.00%) 27.5735 ( -0.69%) 27.3995 ( -0.06%) 27.2475 ( 0.50%) 27.1680 ( 0.79%) 27.3810 ( 0.01%)
System 43 28.1595 ( 0.00%) 28.2515 ( -0.33%) 27.5285 ( 2.24%) 27.9805 ( 0.64%) 27.8795 ( 0.99%) 28.1255 ( 0.12%)
System 44 28.8460 ( 0.00%) 29.0390 ( -0.67%) 28.4580 ( 1.35%) 28.9385 ( -0.32%) 28.7750 ( 0.25%) 28.8655 ( -0.07%)
System 45 29.5430 ( 0.00%) 29.8280 ( -0.96%) 28.5270 ( 3.44%) 29.8165 ( -0.93%) 29.6105 ( -0.23%) 29.5655 ( -0.08%)
System 46 30.3290 ( 0.00%) 30.6420 ( -1.03%) 29.1955 ( 3.74%) 30.6235 ( -0.97%) 30.4205 ( -0.30%) 30.2640 ( 0.21%)
System 47 30.9365 ( 0.00%) 31.3360 ( -1.29%) 29.2915 ( 5.32%) 31.3365 ( -1.29%) 31.3660 ( -1.39%) 30.9300 ( 0.02%)
System 48 31.5680 ( 0.00%) 32.1220 ( -1.75%) 29.3805 ( 6.93%) 32.1925 ( -1.98%) 31.9820 ( -1.31%) 31.6180 ( -0.16%)

autonuma is showing a lot of system CPU overhead here. numacore and
balancenuma are ok. Some blips there but small enough that's nothing to
get excited over.

Elapsed 1 8.7170 ( 0.00%) 8.9585 ( -2.77%) 8.7485 ( -0.36%) 38.5375 (-342.10%) 38.8065 (-345.18%) 8.6755 ( 0.48%)
Elapsed 2 4.4075 ( 0.00%) 4.4345 ( -0.61%) 4.5320 ( -2.82%) 6.5940 (-49.61%) 6.1920 (-40.49%) 4.4090 ( -0.03%)
Elapsed 3 2.9785 ( 0.00%) 2.9990 ( -0.69%) 3.0945 ( -3.89%) 3.5820 (-20.26%) 3.4765 (-16.72%) 2.9840 ( -0.18%)
Elapsed 4 2.2530 ( 0.00%) 2.3010 ( -2.13%) 2.3845 ( -5.84%) 2.4400 ( -8.30%) 2.4045 ( -6.72%) 2.2675 ( -0.64%)
Elapsed 5 1.9070 ( 0.00%) 1.9315 ( -1.28%) 1.9885 ( -4.27%) 2.0180 ( -5.82%) 1.9725 ( -3.43%) 1.9195 ( -0.66%)
Elapsed 6 1.6490 ( 0.00%) 1.6705 ( -1.30%) 1.7470 ( -5.94%) 1.6695 ( -1.24%) 1.6575 ( -0.52%) 1.6385 ( 0.64%)
Elapsed 7 1.4235 ( 0.00%) 1.4385 ( -1.05%) 1.6090 (-13.03%) 1.4590 ( -2.49%) 1.4495 ( -1.83%) 1.4200 ( 0.25%)
Elapsed 8 1.2500 ( 0.00%) 1.2600 ( -0.80%) 1.4345 (-14.76%) 1.2650 ( -1.20%) 1.2340 ( 1.28%) 1.2345 ( 1.24%)
Elapsed 9 1.2090 ( 0.00%) 1.2125 ( -0.29%) 1.3355 (-10.46%) 1.2275 ( -1.53%) 1.2185 ( -0.79%) 1.1975 ( 0.95%)
Elapsed 10 1.0885 ( 0.00%) 1.0900 ( -0.14%) 1.3390 (-23.01%) 1.1195 ( -2.85%) 1.1110 ( -2.07%) 1.0985 ( -0.92%)
Elapsed 11 0.9970 ( 0.00%) 1.0220 ( -2.51%) 1.3575 (-36.16%) 1.0210 ( -2.41%) 1.0145 ( -1.76%) 1.0005 ( -0.35%)
Elapsed 12 0.9355 ( 0.00%) 0.9375 ( -0.21%) 1.3060 (-39.60%) 0.9505 ( -1.60%) 0.9390 ( -0.37%) 0.9205 ( 1.60%)
Elapsed 13 0.9345 ( 0.00%) 0.9320 ( 0.27%) 1.2940 (-38.47%) 0.9435 ( -0.96%) 0.9200 ( 1.55%) 0.9195 ( 1.61%)
Elapsed 14 0.8815 ( 0.00%) 0.8960 ( -1.64%) 1.2755 (-44.70%) 0.8955 ( -1.59%) 0.8780 ( 0.40%) 0.8860 ( -0.51%)
Elapsed 15 0.8175 ( 0.00%) 0.8375 ( -2.45%) 1.3655 (-67.03%) 0.8470 ( -3.61%) 0.8260 ( -1.04%) 0.8170 ( 0.06%)
Elapsed 16 0.8135 ( 0.00%) 0.8045 ( 1.11%) 1.3165 (-61.83%) 0.8130 ( 0.06%) 0.8040 ( 1.17%) 0.7970 ( 2.03%)
Elapsed 17 0.8375 ( 0.00%) 0.8530 ( -1.85%) 1.4175 (-69.25%) 0.8380 ( -0.06%) 0.8405 ( -0.36%) 0.8305 ( 0.84%)
Elapsed 18 0.8045 ( 0.00%) 0.8100 ( -0.68%) 1.4135 (-75.70%) 0.8120 ( -0.93%) 0.8050 ( -0.06%) 0.8010 ( 0.44%)
Elapsed 19 0.7600 ( 0.00%) 0.7625 ( -0.33%) 1.3640 (-79.47%) 0.7700 ( -1.32%) 0.7870 ( -3.55%) 0.7720 ( -1.58%)
Elapsed 20 0.7860 ( 0.00%) 0.7410 ( 5.73%) 1.3125 (-66.98%) 0.7580 ( 3.56%) 0.7375 ( 6.17%) 0.7370 ( 6.23%)
Elapsed 21 0.8080 ( 0.00%) 0.7970 ( 1.36%) 1.2775 (-58.11%) 0.7960 ( 1.49%) 0.8175 ( -1.18%) 0.7970 ( 1.36%)
Elapsed 22 0.7930 ( 0.00%) 0.7840 ( 1.13%) 1.3940 (-75.79%) 0.8035 ( -1.32%) 0.7780 ( 1.89%) 0.7640 ( 3.66%)
Elapsed 23 0.7570 ( 0.00%) 0.7525 ( 0.59%) 1.3490 (-78.20%) 0.7915 ( -4.56%) 0.7710 ( -1.85%) 0.7800 ( -3.04%)
Elapsed 24 0.7705 ( 0.00%) 0.7280 ( 5.52%) 1.4550 (-88.84%) 0.7400 ( 3.96%) 0.7630 ( 0.97%) 0.7575 ( 1.69%)
Elapsed 25 0.8165 ( 0.00%) 0.8630 ( -5.70%) 1.3755 (-68.46%) 0.8790 ( -7.65%) 0.9015 (-10.41%) 0.8505 ( -4.16%)
Elapsed 26 0.8465 ( 0.00%) 0.8425 ( 0.47%) 1.3405 (-58.36%) 0.8790 ( -3.84%) 0.8660 ( -2.30%) 0.8360 ( 1.24%)
Elapsed 27 0.8025 ( 0.00%) 0.8045 ( -0.25%) 1.3655 (-70.16%) 0.8325 ( -3.74%) 0.8420 ( -4.92%) 0.8175 ( -1.87%)
Elapsed 28 0.7990 ( 0.00%) 0.7850 ( 1.75%) 1.3475 (-68.65%) 0.8075 ( -1.06%) 0.8185 ( -2.44%) 0.7885 ( 1.31%)
Elapsed 29 0.8010 ( 0.00%) 0.8005 ( 0.06%) 1.2595 (-57.24%) 0.8075 ( -0.81%) 0.8130 ( -1.50%) 0.7970 ( 0.50%)
Elapsed 30 0.7965 ( 0.00%) 0.7825 ( 1.76%) 1.2365 (-55.24%) 0.8105 ( -1.76%) 0.8050 ( -1.07%) 0.8095 ( -1.63%)
Elapsed 31 0.7820 ( 0.00%) 0.7740 ( 1.02%) 1.2670 (-62.02%) 0.7980 ( -2.05%) 0.8035 ( -2.75%) 0.7970 ( -1.92%)
Elapsed 32 0.7905 ( 0.00%) 0.7675 ( 2.91%) 1.3765 (-74.13%) 0.8000 ( -1.20%) 0.7935 ( -0.38%) 0.7725 ( 2.28%)
Elapsed 33 0.7980 ( 0.00%) 0.7640 ( 4.26%) 1.2225 (-53.20%) 0.7985 ( -0.06%) 0.7945 ( 0.44%) 0.7900 ( 1.00%)
Elapsed 34 0.7875 ( 0.00%) 0.7820 ( 0.70%) 1.1880 (-50.86%) 0.8030 ( -1.97%) 0.8175 ( -3.81%) 0.8090 ( -2.73%)
Elapsed 35 0.7910 ( 0.00%) 0.7735 ( 2.21%) 1.2100 (-52.97%) 0.8050 ( -1.77%) 0.8025 ( -1.45%) 0.7830 ( 1.01%)
Elapsed 36 0.7745 ( 0.00%) 0.7565 ( 2.32%) 1.3075 (-68.82%) 0.8010 ( -3.42%) 0.8095 ( -4.52%) 0.8000 ( -3.29%)
Elapsed 37 0.7960 ( 0.00%) 0.7660 ( 3.77%) 1.1970 (-50.38%) 0.8045 ( -1.07%) 0.7950 ( 0.13%) 0.8010 ( -0.63%)
Elapsed 38 0.7800 ( 0.00%) 0.7825 ( -0.32%) 1.1305 (-44.94%) 0.8095 ( -3.78%) 0.8015 ( -2.76%) 0.8065 ( -3.40%)
Elapsed 39 0.7915 ( 0.00%) 0.7635 ( 3.54%) 1.0915 (-37.90%) 0.8085 ( -2.15%) 0.8060 ( -1.83%) 0.7790 ( 1.58%)
Elapsed 40 0.7810 ( 0.00%) 0.7635 ( 2.24%) 1.1175 (-43.09%) 0.7870 ( -0.77%) 0.8025 ( -2.75%) 0.7895 ( -1.09%)
Elapsed 41 0.7675 ( 0.00%) 0.7730 ( -0.72%) 1.1610 (-51.27%) 0.8025 ( -4.56%) 0.7780 ( -1.37%) 0.7870 ( -2.54%)
Elapsed 42 0.7705 ( 0.00%) 0.7925 ( -2.86%) 1.1095 (-44.00%) 0.7850 ( -1.88%) 0.7890 ( -2.40%) 0.7950 ( -3.18%)
Elapsed 43 0.7830 ( 0.00%) 0.7680 ( 1.92%) 1.1470 (-46.49%) 0.7960 ( -1.66%) 0.7830 ( 0.00%) 0.7855 ( -0.32%)
Elapsed 44 0.7745 ( 0.00%) 0.7560 ( 2.39%) 1.1575 (-49.45%) 0.7870 ( -1.61%) 0.7950 ( -2.65%) 0.7835 ( -1.16%)
Elapsed 45 0.7665 ( 0.00%) 0.7635 ( 0.39%) 1.0200 (-33.07%) 0.7935 ( -3.52%) 0.7745 ( -1.04%) 0.7695 ( -0.39%)
Elapsed 46 0.7660 ( 0.00%) 0.7695 ( -0.46%) 1.0610 (-38.51%) 0.7835 ( -2.28%) 0.7830 ( -2.22%) 0.7725 ( -0.85%)
Elapsed 47 0.7575 ( 0.00%) 0.7710 ( -1.78%) 1.0340 (-36.50%) 0.7895 ( -4.22%) 0.7800 ( -2.97%) 0.7755 ( -2.38%)
Elapsed 48 0.7740 ( 0.00%) 0.7665 ( 0.97%) 1.0505 (-35.72%) 0.7735 ( 0.06%) 0.7795 ( -0.71%) 0.7630 ( 1.42%)

autonuma hurts here. numacore and balancenuma are ok.

Faults/cpu 1 379968.7014 ( 0.00%) 369716.7221 ( -2.70%) 378284.9642 ( -0.44%) 86427.8993 (-77.25%) 87036.4027 (-77.09%) 381109.9811 ( 0.30%)
Faults/cpu 2 379324.0493 ( 0.00%) 376624.9420 ( -0.71%) 372938.2576 ( -1.68%) 258617.9410 (-31.82%) 272229.5372 (-28.23%) 379332.1426 ( 0.00%)
Faults/cpu 3 374110.9252 ( 0.00%) 371809.0394 ( -0.62%) 362384.3379 ( -3.13%) 315364.3194 (-15.70%) 322932.0319 (-13.68%) 373740.6327 ( -0.10%)
Faults/cpu 4 371054.3320 ( 0.00%) 366010.1683 ( -1.36%) 354374.7659 ( -4.50%) 347925.4511 ( -6.23%) 351926.8213 ( -5.15%) 369718.8116 ( -0.36%)
Faults/cpu 5 357644.9509 ( 0.00%) 353116.2568 ( -1.27%) 340954.4156 ( -4.67%) 342873.2808 ( -4.13%) 348837.4032 ( -2.46%) 355357.9808 ( -0.64%)
Faults/cpu 6 345166.0268 ( 0.00%) 343605.5937 ( -0.45%) 324566.0244 ( -5.97%) 339177.9361 ( -1.73%) 341785.4988 ( -0.98%) 345830.4062 ( 0.19%)
Faults/cpu 7 346686.9164 ( 0.00%) 343254.5354 ( -0.99%) 307569.0063 (-11.28%) 334501.4563 ( -3.51%) 337715.4825 ( -2.59%) 342176.3071 ( -1.30%)
Faults/cpu 8 345617.2248 ( 0.00%) 341409.8570 ( -1.22%) 301005.0046 (-12.91%) 335797.8156 ( -2.84%) 344630.9102 ( -0.29%) 346313.4237 ( 0.20%)
Faults/cpu 9 324187.6755 ( 0.00%) 324493.4570 ( 0.09%) 292467.7328 ( -9.78%) 320295.6357 ( -1.20%) 321737.9910 ( -0.76%) 325867.9016 ( 0.52%)
Faults/cpu 10 323260.5270 ( 0.00%) 321706.2762 ( -0.48%) 267253.0641 (-17.33%) 314825.0722 ( -2.61%) 317861.8672 ( -1.67%) 320046.7340 ( -0.99%)
Faults/cpu 11 319485.7975 ( 0.00%) 315952.8672 ( -1.11%) 242837.3072 (-23.99%) 312472.4466 ( -2.20%) 316449.1894 ( -0.95%) 317039.2752 ( -0.77%)
Faults/cpu 12 314193.4166 ( 0.00%) 313068.6101 ( -0.36%) 235605.3115 (-25.01%) 309340.3850 ( -1.54%) 313383.0113 ( -0.26%) 317336.9315 ( 1.00%)
Faults/cpu 13 297642.2341 ( 0.00%) 299213.5432 ( 0.53%) 234437.1802 (-21.24%) 293494.9766 ( -1.39%) 299705.3429 ( 0.69%) 300624.5210 ( 1.00%)
Faults/cpu 14 290534.1543 ( 0.00%) 288426.1514 ( -0.73%) 224483.1714 (-22.73%) 285707.6328 ( -1.66%) 290879.5737 ( 0.12%) 289279.0242 ( -0.43%)
Faults/cpu 15 288135.4034 ( 0.00%) 283193.5948 ( -1.72%) 212413.0189 (-26.28%) 280349.0344 ( -2.70%) 284072.2862 ( -1.41%) 287647.8834 ( -0.17%)
Faults/cpu 16 272332.8272 ( 0.00%) 272814.3475 ( 0.18%) 207466.3481 (-23.82%) 270402.6579 ( -0.71%) 271763.7503 ( -0.21%) 274964.5255 ( 0.97%)
Faults/cpu 17 259801.4891 ( 0.00%) 254678.1893 ( -1.97%) 195438.3763 (-24.77%) 258832.2108 ( -0.37%) 260388.8630 ( 0.23%) 260959.0635 ( 0.45%)
Faults/cpu 18 247485.0166 ( 0.00%) 247528.4736 ( 0.02%) 188851.6906 (-23.69%) 246617.6952 ( -0.35%) 246672.7250 ( -0.33%) 248623.7380 ( 0.46%)
Faults/cpu 19 240874.3964 ( 0.00%) 240040.1762 ( -0.35%) 188854.7002 (-21.60%) 241091.5604 ( 0.09%) 235779.1526 ( -2.12%) 240054.8191 ( -0.34%)
Faults/cpu 20 230055.4776 ( 0.00%) 233739.6952 ( 1.60%) 189561.1074 (-17.60%) 232361.9801 ( 1.00%) 235648.3672 ( 2.43%) 235093.1838 ( 2.19%)
Faults/cpu 21 221089.0306 ( 0.00%) 222658.7857 ( 0.71%) 185501.7940 (-16.10%) 221778.3227 ( 0.31%) 220242.8822 ( -0.38%) 222037.5554 ( 0.43%)
Faults/cpu 22 212928.6223 ( 0.00%) 211709.9070 ( -0.57%) 173833.3256 (-18.36%) 210452.7972 ( -1.16%) 214426.3103 ( 0.70%) 214947.4742 ( 0.95%)
Faults/cpu 23 207494.8662 ( 0.00%) 206521.8192 ( -0.47%) 171758.7557 (-17.22%) 205407.2927 ( -1.01%) 206721.0393 ( -0.37%) 207409.9085 ( -0.04%)
Faults/cpu 24 198271.6218 ( 0.00%) 200140.9741 ( 0.94%) 162334.1621 (-18.13%) 201006.4327 ( 1.38%) 201252.9323 ( 1.50%) 200952.4305 ( 1.35%)
Faults/cpu 25 194049.1874 ( 0.00%) 188802.4110 ( -2.70%) 161943.4996 (-16.55%) 191462.4322 ( -1.33%) 191439.2795 ( -1.34%) 192108.4659 ( -1.00%)
Faults/cpu 26 183620.4998 ( 0.00%) 183343.6939 ( -0.15%) 160425.1497 (-12.63%) 182870.8145 ( -0.41%) 184395.3448 ( 0.42%) 186077.3626 ( 1.34%)
Faults/cpu 27 181390.7603 ( 0.00%) 180468.1260 ( -0.51%) 156356.5144 (-13.80%) 181196.8598 ( -0.11%) 181266.5928 ( -0.07%) 180640.5088 ( -0.41%)
Faults/cpu 28 176180.0531 ( 0.00%) 175634.1202 ( -0.31%) 150357.6004 (-14.66%) 177080.1177 ( 0.51%) 177119.5918 ( 0.53%) 176368.0055 ( 0.11%)
Faults/cpu 29 169650.2633 ( 0.00%) 168217.8595 ( -0.84%) 155420.2194 ( -8.39%) 170747.8837 ( 0.65%) 171278.7622 ( 0.96%) 170279.8400 ( 0.37%)
Faults/cpu 30 165035.8356 ( 0.00%) 164500.4660 ( -0.32%) 149498.3808 ( -9.41%) 165260.2440 ( 0.14%) 166184.8081 ( 0.70%) 164413.5702 ( -0.38%)
Faults/cpu 31 159436.3440 ( 0.00%) 160203.2927 ( 0.48%) 139138.4143 (-12.73%) 159857.9330 ( 0.26%) 160602.8294 ( 0.73%) 158802.3951 ( -0.40%)
Faults/cpu 32 155345.7802 ( 0.00%) 155688.0137 ( 0.22%) 136290.5101 (-12.27%) 156028.5649 ( 0.44%) 156660.6132 ( 0.85%) 156110.2021 ( 0.49%)
Faults/cpu 33 150219.6220 ( 0.00%) 150761.8116 ( 0.36%) 135744.4512 ( -9.64%) 151295.3001 ( 0.72%) 152374.5286 ( 1.43%) 149876.4226 ( -0.23%)
Faults/cpu 34 145772.3820 ( 0.00%) 144612.2751 ( -0.80%) 136039.8268 ( -6.68%) 147191.8811 ( 0.97%) 146490.6089 ( 0.49%) 144259.7221 ( -1.04%)
Faults/cpu 35 141844.4600 ( 0.00%) 141708.8606 ( -0.10%) 136089.5490 ( -4.06%) 141913.1720 ( 0.05%) 142196.7473 ( 0.25%) 141281.3582 ( -0.40%)
Faults/cpu 36 137593.5661 ( 0.00%) 138161.2436 ( 0.41%) 128386.3001 ( -6.69%) 138513.0778 ( 0.67%) 138313.7914 ( 0.52%) 136719.5046 ( -0.64%)
Faults/cpu 37 132889.3691 ( 0.00%) 133510.5699 ( 0.47%) 127211.5973 ( -4.27%) 133844.4348 ( 0.72%) 134542.6731 ( 1.24%) 133044.9847 ( 0.12%)
Faults/cpu 38 129464.8808 ( 0.00%) 129309.9659 ( -0.12%) 124991.9760 ( -3.45%) 129698.4299 ( 0.18%) 130383.7440 ( 0.71%) 128545.0900 ( -0.71%)
Faults/cpu 39 125847.2523 ( 0.00%) 126247.6919 ( 0.32%) 125720.8199 ( -0.10%) 125748.5172 ( -0.08%) 126184.8812 ( 0.27%) 126166.4376 ( 0.25%)
Faults/cpu 40 122497.3658 ( 0.00%) 122904.6230 ( 0.33%) 119592.8625 ( -2.37%) 122917.6924 ( 0.34%) 123206.4626 ( 0.58%) 121880.4385 ( -0.50%)
Faults/cpu 41 119450.0397 ( 0.00%) 119031.7169 ( -0.35%) 115547.9382 ( -3.27%) 118794.7652 ( -0.55%) 119418.5855 ( -0.03%) 118715.8560 ( -0.61%)
Faults/cpu 42 116004.5444 ( 0.00%) 115247.2406 ( -0.65%) 115673.3669 ( -0.29%) 115894.3102 ( -0.10%) 115924.0103 ( -0.07%) 115546.2484 ( -0.40%)
Faults/cpu 43 112825.6897 ( 0.00%) 112555.8521 ( -0.24%) 115351.1821 ( 2.24%) 113205.7203 ( 0.34%) 112896.3224 ( 0.06%) 112501.5505 ( -0.29%)
Faults/cpu 44 110221.9798 ( 0.00%) 109799.1269 ( -0.38%) 111690.2165 ( 1.33%) 109460.3398 ( -0.69%) 109736.3227 ( -0.44%) 109822.0646 ( -0.36%)
Faults/cpu 45 107808.1019 ( 0.00%) 106853.8230 ( -0.89%) 111211.9257 ( 3.16%) 106613.8474 ( -1.11%) 106835.5728 ( -0.90%) 107420.9722 ( -0.36%)
Faults/cpu 46 105338.7289 ( 0.00%) 104322.1338 ( -0.97%) 108688.1743 ( 3.18%) 103868.0598 ( -1.40%) 104019.1548 ( -1.25%) 105022.6610 ( -0.30%)
Faults/cpu 47 103330.7670 ( 0.00%) 102023.9900 ( -1.26%) 108331.5085 ( 4.84%) 101681.8182 ( -1.60%) 101245.4175 ( -2.02%) 102871.1021 ( -0.44%)
Faults/cpu 48 101441.4170 ( 0.00%) 99674.9779 ( -1.74%) 108007.0665 ( 6.47%) 99354.5932 ( -2.06%) 99252.9156 ( -2.16%) 100868.6868 ( -0.56%)

Same story on number of faults processed per CPU.

Faults/sec 1 379226.4553 ( 0.00%) 368933.2163 ( -2.71%) 377567.1922 ( -0.44%) 86267.2515 (-77.25%) 86875.1744 (-77.09%) 380376.2873 ( 0.30%)
Faults/sec 2 749973.6389 ( 0.00%) 745368.4598 ( -0.61%) 729046.6001 ( -2.79%) 501399.0067 (-33.14%) 533091.7531 (-28.92%) 748098.5102 ( -0.25%)
Faults/sec 3 1109387.2150 ( 0.00%) 1101815.4855 ( -0.68%) 1067844.4241 ( -3.74%) 922150.6228 (-16.88%) 948926.6753 (-14.46%) 1105559.1712 ( -0.35%)
Faults/sec 4 1466774.3100 ( 0.00%) 1436277.7333 ( -2.08%) 1386595.2563 ( -5.47%) 1352804.9587 ( -7.77%) 1373754.4330 ( -6.34%) 1455926.9804 ( -0.74%)
Faults/sec 5 1734004.1931 ( 0.00%) 1712341.4333 ( -1.25%) 1663159.2063 ( -4.09%) 1636827.0073 ( -5.60%) 1674262.7667 ( -3.45%) 1719713.1856 ( -0.82%)
Faults/sec 6 2005083.6885 ( 0.00%) 1980047.8898 ( -1.25%) 1892759.0575 ( -5.60%) 1978591.3286 ( -1.32%) 1990385.5922 ( -0.73%) 2012957.1946 ( 0.39%)
Faults/sec 7 2323523.7344 ( 0.00%) 2297209.3144 ( -1.13%) 2064475.4665 (-11.15%) 2260510.6371 ( -2.71%) 2278640.0597 ( -1.93%) 2324813.2040 ( 0.06%)
Faults/sec 8 2648167.0893 ( 0.00%) 2624742.9343 ( -0.88%) 2314968.6209 (-12.58%) 2606988.4580 ( -1.55%) 2671599.7800 ( 0.88%) 2673032.1950 ( 0.94%)
Faults/sec 9 2736925.7247 ( 0.00%) 2728207.1722 ( -0.32%) 2491913.1048 ( -8.95%) 2689604.9745 ( -1.73%) 2708047.0077 ( -1.06%) 2760248.2053 ( 0.85%)
Faults/sec 10 3039414.3444 ( 0.00%) 3038105.4345 ( -0.04%) 2492174.2233 (-18.00%) 2947139.9612 ( -3.04%) 2973073.5636 ( -2.18%) 3002803.7061 ( -1.20%)
Faults/sec 11 3321706.1658 ( 0.00%) 3239414.0527 ( -2.48%) 2456634.8702 (-26.04%) 3237117.6282 ( -2.55%) 3260521.6371 ( -1.84%) 3298132.1843 ( -0.71%)
Faults/sec 12 3532409.7672 ( 0.00%) 3534748.1800 ( 0.07%) 2556542.9426 (-27.63%) 3478409.1401 ( -1.53%) 3513285.3467 ( -0.54%) 3587238.4424 ( 1.55%)
Faults/sec 13 3537583.2973 ( 0.00%) 3555979.7240 ( 0.52%) 2643676.1015 (-25.27%) 3498887.6802 ( -1.09%) 3584695.8753 ( 1.33%) 3590044.7697 ( 1.48%)
Faults/sec 14 3746624.1500 ( 0.00%) 3689003.6175 ( -1.54%) 2630758.3449 (-29.78%) 3690864.4632 ( -1.49%) 3751840.8797 ( 0.14%) 3724950.8729 ( -0.58%)
Faults/sec 15 4051109.8741 ( 0.00%) 3953680.3643 ( -2.41%) 2541857.4723 (-37.26%) 3905515.7917 ( -3.59%) 3998526.1306 ( -1.30%) 4049199.2538 ( -0.05%)
Faults/sec 16 4078126.4712 ( 0.00%) 4123441.7643 ( 1.11%) 2549782.7076 (-37.48%) 4067671.7626 ( -0.26%) 4106454.4320 ( 0.69%) 4167569.6242 ( 2.19%)
Faults/sec 17 3946209.5066 ( 0.00%) 3886274.3946 ( -1.52%) 2405328.1767 (-39.05%) 3937304.5223 ( -0.23%) 3920485.2382 ( -0.65%) 3967957.4690 ( 0.55%)
Faults/sec 18 4115112.1063 ( 0.00%) 4079027.7233 ( -0.88%) 2385981.0332 (-42.02%) 4062940.8129 ( -1.27%) 4103770.0811 ( -0.28%) 4121303.7070 ( 0.15%)
Faults/sec 19 4354086.4908 ( 0.00%) 4333268.5610 ( -0.48%) 2501627.6834 (-42.55%) 4284800.1294 ( -1.59%) 4206148.7446 ( -3.40%) 4287512.8517 ( -1.53%)
Faults/sec 20 4263596.5894 ( 0.00%) 4472167.3677 ( 4.89%) 2564140.4929 (-39.86%) 4370659.6359 ( 2.51%) 4479581.9679 ( 5.07%) 4484166.9738 ( 5.17%)
Faults/sec 21 4098972.5089 ( 0.00%) 4151322.9576 ( 1.28%) 2626683.1075 (-35.92%) 4149013.2160 ( 1.22%) 4058372.3890 ( -0.99%) 4143527.1704 ( 1.09%)
Faults/sec 22 4175738.8898 ( 0.00%) 4237648.8102 ( 1.48%) 2388945.8252 (-42.79%) 4137584.2163 ( -0.91%) 4247730.7669 ( 1.72%) 4322814.4495 ( 3.52%)
Faults/sec 23 4373975.8159 ( 0.00%) 4395014.8420 ( 0.48%) 2491320.6893 (-43.04%) 4195839.4189 ( -4.07%) 4289031.3045 ( -1.94%) 4249735.3807 ( -2.84%)
Faults/sec 24 4343903.6909 ( 0.00%) 4539539.0281 ( 4.50%) 2367142.7680 (-45.51%) 4463459.6633 ( 2.75%) 4347883.8816 ( 0.09%) 4361808.4405 ( 0.41%)
Faults/sec 25 4049139.5490 ( 0.00%) 3836819.6187 ( -5.24%) 2452593.4879 (-39.43%) 3756917.3563 ( -7.22%) 3667462.3028 ( -9.43%) 3882470.4622 ( -4.12%)
Faults/sec 26 3923558.8580 ( 0.00%) 3926335.3913 ( 0.07%) 2497179.3566 (-36.35%) 3758947.5820 ( -4.20%) 3810590.6641 ( -2.88%) 3949958.5833 ( 0.67%)
Faults/sec 27 4120929.2726 ( 0.00%) 4111259.5839 ( -0.23%) 2444020.3202 (-40.69%) 3958866.4333 ( -3.93%) 3934181.7350 ( -4.53%) 4038502.1999 ( -2.00%)
Faults/sec 28 4148296.9993 ( 0.00%) 4208740.3644 ( 1.46%) 2508485.6715 (-39.53%) 4084949.7113 ( -1.53%) 4037661.6209 ( -2.67%) 4185738.4607 ( 0.90%)
Faults/sec 29 4124742.2486 ( 0.00%) 4142048.5869 ( 0.42%) 2672716.5715 (-35.20%) 4085761.2234 ( -0.95%) 4068650.8559 ( -1.36%) 4144694.1129 ( 0.48%)
Faults/sec 30 4160740.4979 ( 0.00%) 4236457.4748 ( 1.82%) 2695629.9415 (-35.21%) 4076825.3513 ( -2.02%) 4106802.5562 ( -1.30%) 4084027.7691 ( -1.84%)
Faults/sec 31 4237767.8919 ( 0.00%) 4262954.1215 ( 0.59%) 2622045.7226 (-38.13%) 4147492.6973 ( -2.13%) 4129507.3254 ( -2.55%) 4154591.8086 ( -1.96%)
Faults/sec 32 4193896.3492 ( 0.00%) 4313804.9370 ( 2.86%) 2486013.3793 (-40.72%) 4144234.0287 ( -1.18%) 4167653.2985 ( -0.63%) 4280308.2714 ( 2.06%)
Faults/sec 33 4162942.9767 ( 0.00%) 4324720.6943 ( 3.89%) 2705706.6138 (-35.00%) 4148215.3556 ( -0.35%) 4160800.6591 ( -0.05%) 4188855.2428 ( 0.62%)
Faults/sec 34 4204133.3523 ( 0.00%) 4246486.4313 ( 1.01%) 2801163.4164 (-33.37%) 4115498.6406 ( -2.11%) 4050464.9098 ( -3.66%) 4092430.9384 ( -2.66%)
Faults/sec 35 4189096.5835 ( 0.00%) 4271877.3268 ( 1.98%) 2763406.1657 (-34.03%) 4112864.6044 ( -1.82%) 4116065.7955 ( -1.74%) 4219699.5756 ( 0.73%)
Faults/sec 36 4277421.2521 ( 0.00%) 4373426.4356 ( 2.24%) 2692221.4270 (-37.06%) 4129438.5970 ( -3.46%) 4108075.3296 ( -3.96%) 4149259.8944 ( -3.00%)
Faults/sec 37 4168551.9047 ( 0.00%) 4319223.3874 ( 3.61%) 2836764.2086 (-31.95%) 4109725.0377 ( -1.41%) 4156874.2769 ( -0.28%) 4149515.4613 ( -0.46%)
Faults/sec 38 4247525.5670 ( 0.00%) 4229905.6978 ( -0.41%) 2938912.4587 (-30.81%) 4085058.1995 ( -3.82%) 4127366.4416 ( -2.83%) 4096271.9211 ( -3.56%)
Faults/sec 39 4190989.8515 ( 0.00%) 4329385.1325 ( 3.30%) 3061436.0988 (-26.95%) 4099026.7324 ( -2.19%) 4094648.2005 ( -2.30%) 4240087.0764 ( 1.17%)
Faults/sec 40 4238307.5210 ( 0.00%) 4337475.3368 ( 2.34%) 2988097.1336 (-29.50%) 4203501.6812 ( -0.82%) 4120604.7912 ( -2.78%) 4193144.8164 ( -1.07%)
Faults/sec 41 4317393.3854 ( 0.00%) 4282458.5094 ( -0.81%) 2949899.0149 (-31.67%) 4120836.6477 ( -4.55%) 4248620.8455 ( -1.59%) 4206700.7050 ( -2.56%)
Faults/sec 42 4299075.7581 ( 0.00%) 4181602.0005 ( -2.73%) 3037710.0530 (-29.34%) 4205958.7415 ( -2.17%) 4181449.1786 ( -2.74%) 4155578.2275 ( -3.34%)
Faults/sec 43 4234922.1492 ( 0.00%) 4301130.5970 ( 1.56%) 2996342.1505 (-29.25%) 4170975.0653 ( -1.51%) 4210039.9002 ( -0.59%) 4203158.8656 ( -0.75%)
Faults/sec 44 4270913.7498 ( 0.00%) 4376035.4745 ( 2.46%) 3054249.1521 (-28.49%) 4193693.1721 ( -1.81%) 4154034.6390 ( -2.74%) 4207031.5562 ( -1.50%)
Faults/sec 45 4313055.5348 ( 0.00%) 4342993.1271 ( 0.69%) 3263986.2960 (-24.32%) 4172891.7566 ( -3.25%) 4262028.6193 ( -1.18%) 4293905.9657 ( -0.44%)
Faults/sec 46 4323716.1160 ( 0.00%) 4306994.5183 ( -0.39%) 3198502.0716 (-26.02%) 4212553.2514 ( -2.57%) 4216000.7652 ( -2.49%) 4277511.4815 ( -1.07%)
Faults/sec 47 4364354.4986 ( 0.00%) 4290609.7996 ( -1.69%) 3274654.5504 (-24.97%) 4185908.2435 ( -4.09%) 4235166.8662 ( -2.96%) 4267607.2786 ( -2.22%)
Faults/sec 48 4280234.1143 ( 0.00%) 4312820.1724 ( 0.76%) 3168212.5669 (-25.98%) 4272168.2365 ( -0.19%) 4235504.6092 ( -1.05%) 4322535.9118 ( 0.99%)

More or less the same story.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
User 1076.65 935.93 1276.09 1089.84 1134.60 1097.18
System 18726.05 18738.26 22038.05 19395.18 19281.62 18688.61
Elapsed 1353.67 1346.72 1798.95 2022.47 2010.67 1355.63

autonumas system CPU usage overhead is obvious here. balancenuma and
numacore are ok although it's interesting to note that balancenuma required
the delaystart logic to keep the usage down here.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v5r1rc6-adaptscan-v5r1rc6-delaystart-v5r4
Page Ins 680 536 536 540 540 540
Page Outs 16004 15496 19048 19052 19888 15892
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 0
Direct pages scanned 0 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0 0
Page writes file 0 0 0 0 0 0
Page writes anon 0 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0 0
Page rescued immediate 0 0 0 0 0 0
Slabs scanned 0 0 0 0 0 0
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0 0
THP fault alloc 0 0 0 0 0 0
THP collapse alloc 0 0 0 0 0 0
THP splits 0 0 0 1 0 0
THP fault fallback 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0
Page migrate success 0 0 0 1093 986 613
Page migrate failure 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0
Compaction cost 0 0 0 1 1 0
NUMA PTE updates 0 0 0 505196235 493301672 515709
NUMA hint faults 0 0 0 2549799 2482875 105795
NUMA hint local faults 0 0 0 2545441 2480546 102428
NUMA pages migrated 0 0 0 1093 986 613
AutoNUMA cost 0 0 0 16285 15867 532

There you have it. Some good results, some great, some bad results, some
disastrous. Of course this is for only one machine and other machines
might report differently. I've outlined what other factors could impact the
results and will re-run tests if there is a complaint about one of them.

I'll keep my overall comments to balancenuma. I think it did pretty well
overall. It generally was an improvement on the baseline kernel and in only
one case did it heavily regress (specjbb, single JVM, no THP). Here it hit
its worst-case scenario of always dealing with PTE faults, almost always
migrating and not reducing the scan rate. I could try be clever about this,
I could ignore it or I could hit it with a hammer. I have a hammer.

Other comments?

--
Mel Gorman
SUSE Labs

2012-11-25 08:47:17

by Hillf Danton

[permalink] [raw]
Subject: Re: Comparison between three trees (was: Latest numa/core release, v17)

On 11/24/12, Mel Gorman <[email protected]> wrote:
> Warning: This is an insanely long mail and there a lot of data here. Get
> coffee or something.
>
> This is another round of comparisons between the latest released versions
> of each of three automatic numa balancing trees that are out there.
>
> From the series "Automatic NUMA Balancing V5", the kernels tested were
>
> stats-v5r1 Patches 1-10. TLB optimisations, migration stats
> thpmigrate-v5r1 Patches 1-37. Basic placement policy, PMD handling, THP
> migration etc.
> adaptscan-v5r1 Patches 1-38. Heavy handed PTE scan reduction
> delaystart-v5r1 Patches 1-40. Delay the PTE scan until running on a new
> node
>
> If I just say balancenuma, I mean the "delaystart-v5r1" kernel. The other
> kernels are included so you can see the impact the scan rate adaption
> patch has and what that might mean for a placement policy using a proper
> feedback mechanism.
>
> The other two kernels were
>
> numacore-20121123 It was no longer clear what the deltas between releases
> and
> the dependencies might be so I just pulled tip/master on November
> 23rd, 2012. An earlier pull had serious difficulties and the patch
> responsible has been dropped since. This is not a like-with-like
> comparison as the tree contains numerous other patches but it's
> the best available given the timeframe
>
> autonuma-v28fast This is a rebased version of Andrea's autonuma-v28fast
> branch with Hugh's THP migration patch on top.

FYI, based on how target huge page is selected,

+
+ new_page = alloc_pages_node(numa_node_id(),
+ (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);

the thp replacement policy is changed to be MORON,

+ /* Migrate the page towards the node whose CPU is referencing it */
+ if (pol->flags & MPOL_F_MORON)
+ polnid = numa_node_id();


described in
[PATCH 29/46] mm: numa: Migrate on reference policy
https://lkml.org/lkml/2012/11/21/228

Hillf

2012-11-25 23:37:43

by Mel Gorman

[permalink] [raw]
Subject: Re: Comparison between three trees (was: Latest numa/core release, v17)

On Fri, Nov 23, 2012 at 05:32:05PM +0000, Mel Gorman wrote:
> From here, we're onto the single JVM configuration. I suspect
> this is tested much more commonly but note that it behaves very
> differently to the multi JVM configuration as explained by Andrea
> (http://choon.net/forum/read.php?21,1599976,page=4).
>
> A concern with the single JVM results as reported here is the maximum
> number of warehouses. In the Multi JVM configuration, the expected peak
> was 12 warehouses so I ran up to 18 so that the tests could complete in a
> reasonable amount of time. The expected peak for a single JVM is 48 (the
> number of CPUs) but the configuration file was derived from the multi JVM
> configuration so it was restricted to running up to 18 warehouses. Again,
> the reason was so it would complete in a reasonable amount of time but
> specjbb does not give a score for this type of configuration and I am
> only reporting on the 1-18 warehouses it ran for. I've reconfigured the
> 4 specjbb configs to run a full config and it'll run over the weekend.
>

Ths use of just peak figures really is a factor. The THP configuration,
single JVM is the best configuration for numacore but this is only visible
for peak numbers of warehouses. For lower number of warehouses it regresses
but this is not reported by the specjbb benchmark and could have been
easily missed. It also mostly explains why I was seeing very different
figures to other testers.

More below.

> SPECJBB: Single JVMs (one per node, 4 nodes), THP is enabled
>
> SPECJBB BOPS
> 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
> rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
> TPut 1 26802.00 ( 0.00%) 22808.00 (-14.90%) 24482.00 ( -8.66%) 25723.00 ( -4.03%) 24387.00 ( -9.01%) 25940.00 ( -3.22%)
> TPut 2 57720.00 ( 0.00%) 51245.00 (-11.22%) 55018.00 ( -4.68%) 55498.00 ( -3.85%) 55259.00 ( -4.26%) 55581.00 ( -3.71%)
> TPut 3 86940.00 ( 0.00%) 79172.00 ( -8.93%) 87705.00 ( 0.88%) 86101.00 ( -0.97%) 86894.00 ( -0.05%) 86875.00 ( -0.07%)
> TPut 4 117203.00 ( 0.00%) 107315.00 ( -8.44%) 117382.00 ( 0.15%) 116282.00 ( -0.79%) 116322.00 ( -0.75%) 115263.00 ( -1.66%)
> TPut 5 145375.00 ( 0.00%) 121178.00 (-16.64%) 145802.00 ( 0.29%) 142378.00 ( -2.06%) 144947.00 ( -0.29%) 144211.00 ( -0.80%)
> TPut 6 169232.00 ( 0.00%) 157796.00 ( -6.76%) 173409.00 ( 2.47%) 171066.00 ( 1.08%) 173341.00 ( 2.43%) 169861.00 ( 0.37%)
> TPut 7 195468.00 ( 0.00%) 169834.00 (-13.11%) 197201.00 ( 0.89%) 197536.00 ( 1.06%) 198347.00 ( 1.47%) 198047.00 ( 1.32%)
> TPut 8 217863.00 ( 0.00%) 169975.00 (-21.98%) 222559.00 ( 2.16%) 224901.00 ( 3.23%) 226268.00 ( 3.86%) 218354.00 ( 0.23%)
> TPut 9 240679.00 ( 0.00%) 197498.00 (-17.94%) 245997.00 ( 2.21%) 250022.00 ( 3.88%) 253838.00 ( 5.47%) 250264.00 ( 3.98%)
> TPut 10 261454.00 ( 0.00%) 204909.00 (-21.63%) 269551.00 ( 3.10%) 275125.00 ( 5.23%) 274658.00 ( 5.05%) 274155.00 ( 4.86%)
> TPut 11 281079.00 ( 0.00%) 230118.00 (-18.13%) 281588.00 ( 0.18%) 304383.00 ( 8.29%) 297198.00 ( 5.73%) 299131.00 ( 6.42%)
> TPut 12 302007.00 ( 0.00%) 275511.00 ( -8.77%) 313281.00 ( 3.73%) 327826.00 ( 8.55%) 325324.00 ( 7.72%) 325372.00 ( 7.74%)
> TPut 13 319139.00 ( 0.00%) 293501.00 ( -8.03%) 332581.00 ( 4.21%) 352389.00 ( 10.42%) 340169.00 ( 6.59%) 351215.00 ( 10.05%)
> TPut 14 321069.00 ( 0.00%) 312088.00 ( -2.80%) 337911.00 ( 5.25%) 376198.00 ( 17.17%) 370669.00 ( 15.45%) 366491.00 ( 14.15%)
> TPut 15 345851.00 ( 0.00%) 283856.00 (-17.93%) 369104.00 ( 6.72%) 389772.00 ( 12.70%) 392963.00 ( 13.62%) 389254.00 ( 12.55%)
> TPut 16 346868.00 ( 0.00%) 317127.00 ( -8.57%) 380930.00 ( 9.82%) 420331.00 ( 21.18%) 412974.00 ( 19.06%) 408575.00 ( 17.79%)
> TPut 17 357755.00 ( 0.00%) 349624.00 ( -2.27%) 387635.00 ( 8.35%) 441223.00 ( 23.33%) 426558.00 ( 19.23%) 435985.00 ( 21.87%)
> TPut 18 357467.00 ( 0.00%) 360056.00 ( 0.72%) 399487.00 ( 11.75%) 464603.00 ( 29.97%) 442907.00 ( 23.90%) 453011.00 ( 26.73%)
>
> numacore is not doing well here for low numbers of warehouses. However,
> note that by 18 warehouses it had drawn level and the expected peak is 48
> warehouses. The specjbb reported figure would be using the higher numbers
> of warehouses. I'll a full range over the weekend and report back. If
> time permits, I'll also run a "monitors disabled" run case the read of
> numa_maps every 10 seconds is crippling it.
>

Over the weekend I ran a few configurations that used a large number of
warehouses. The numacore and autonuma kernels are as before. The balancenuma
kernel is a reshuffled tree that moves the THP patches towards the end of the
series. It's functionally very similar to delaystart-v5r4 from the earlier
report. The differences are bug fixes from Hillf and accounting fixes.

In terms of testing, the big difference is the number of warehouses
tested. Here are the results.

SPECJBB: Single JVM, THP is enabled
3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v6r10
TPut 1 25598.00 ( 0.00%) 24938.00 ( -2.58%) 24663.00 ( -3.65%) 25641.00 ( 0.17%)
TPut 2 56182.00 ( 0.00%) 50701.00 ( -9.76%) 55059.00 ( -2.00%) 56300.00 ( 0.21%)
TPut 3 84856.00 ( 0.00%) 80000.00 ( -5.72%) 86692.00 ( 2.16%) 87656.00 ( 3.30%)
TPut 4 115406.00 ( 0.00%) 102629.00 (-11.07%) 118576.00 ( 2.75%) 117089.00 ( 1.46%)
TPut 5 143810.00 ( 0.00%) 131824.00 ( -8.33%) 142516.00 ( -0.90%) 143652.00 ( -0.11%)
TPut 6 168681.00 ( 0.00%) 138700.00 (-17.77%) 171938.00 ( 1.93%) 171625.00 ( 1.75%)
TPut 7 196629.00 ( 0.00%) 158003.00 (-19.64%) 184263.00 ( -6.29%) 196422.00 ( -0.11%)
TPut 8 219888.00 ( 0.00%) 173094.00 (-21.28%) 222689.00 ( 1.27%) 226163.00 ( 2.85%)
TPut 9 244790.00 ( 0.00%) 201543.00 (-17.67%) 247785.00 ( 1.22%) 252223.00 ( 3.04%)
TPut 10 265824.00 ( 0.00%) 224522.00 (-15.54%) 268362.00 ( 0.95%) 273253.00 ( 2.79%)
TPut 11 286745.00 ( 0.00%) 240431.00 (-16.15%) 297968.00 ( 3.91%) 303903.00 ( 5.98%)
TPut 12 312593.00 ( 0.00%) 278749.00 (-10.83%) 322880.00 ( 3.29%) 324283.00 ( 3.74%)
TPut 13 319508.00 ( 0.00%) 297467.00 ( -6.90%) 337332.00 ( 5.58%) 350443.00 ( 9.68%)
TPut 14 348575.00 ( 0.00%) 301683.00 (-13.45%) 374828.00 ( 7.53%) 371199.00 ( 6.49%)
TPut 15 350516.00 ( 0.00%) 357707.00 ( 2.05%) 370428.00 ( 5.68%) 400114.00 ( 14.15%)
TPut 16 370886.00 ( 0.00%) 326597.00 (-11.94%) 412694.00 ( 11.27%) 420616.00 ( 13.41%)
TPut 17 386422.00 ( 0.00%) 363441.00 ( -5.95%) 427190.00 ( 10.55%) 444268.00 ( 14.97%)
TPut 18 387031.00 ( 0.00%) 387802.00 ( 0.20%) 449808.00 ( 16.22%) 459404.00 ( 18.70%)
TPut 19 397352.00 ( 0.00%) 387513.00 ( -2.48%) 444231.00 ( 11.80%) 480527.00 ( 20.93%)
TPut 20 386512.00 ( 0.00%) 409861.00 ( 6.04%) 469152.00 ( 21.38%) 503000.00 ( 30.14%)
TPut 21 406441.00 ( 0.00%) 453321.00 ( 11.53%) 475290.00 ( 16.94%) 517443.00 ( 27.31%)
TPut 22 399667.00 ( 0.00%) 473069.00 ( 18.37%) 494780.00 ( 23.80%) 530384.00 ( 32.71%)
TPut 23 406795.00 ( 0.00%) 459549.00 ( 12.97%) 498187.00 ( 22.47%) 545605.00 ( 34.12%)
TPut 24 410499.00 ( 0.00%) 442373.00 ( 7.76%) 506758.00 ( 23.45%) 555870.00 ( 35.41%)
TPut 25 400845.00 ( 0.00%) 463657.00 ( 15.67%) 497653.00 ( 24.15%) 554370.00 ( 38.30%)
TPut 26 390073.00 ( 0.00%) 488957.00 ( 25.35%) 500685.00 ( 28.36%) 553714.00 ( 41.95%)
TPut 27 391689.00 ( 0.00%) 452545.00 ( 15.54%) 498155.00 ( 27.18%) 561167.00 ( 43.27%)
TPut 28 380903.00 ( 0.00%) 483782.00 ( 27.01%) 494085.00 ( 29.71%) 546296.00 ( 43.42%)
TPut 29 381805.00 ( 0.00%) 527448.00 ( 38.15%) 502872.00 ( 31.71%) 552729.00 ( 44.77%)
TPut 30 375810.00 ( 0.00%) 483409.00 ( 28.63%) 494412.00 ( 31.56%) 548433.00 ( 45.93%)
TPut 31 378324.00 ( 0.00%) 477776.00 ( 26.29%) 497701.00 ( 31.55%) 548419.00 ( 44.96%)
TPut 32 372322.00 ( 0.00%) 444958.00 ( 19.51%) 488683.00 ( 31.25%) 536867.00 ( 44.19%)
TPut 33 359918.00 ( 0.00%) 431751.00 ( 19.96%) 484478.00 ( 34.61%) 538970.00 ( 49.75%)
TPut 34 357685.00 ( 0.00%) 452866.00 ( 26.61%) 476558.00 ( 33.23%) 521906.00 ( 45.91%)
TPut 35 354902.00 ( 0.00%) 456795.00 ( 28.71%) 484244.00 ( 36.44%) 533609.00 ( 50.35%)
TPut 36 337517.00 ( 0.00%) 469182.00 ( 39.01%) 454640.00 ( 34.70%) 526363.00 ( 55.95%)
TPut 37 332136.00 ( 0.00%) 456822.00 ( 37.54%) 458413.00 ( 38.02%) 519400.00 ( 56.38%)
TPut 38 330084.00 ( 0.00%) 453377.00 ( 37.35%) 434666.00 ( 31.68%) 512187.00 ( 55.17%)
TPut 39 319024.00 ( 0.00%) 412778.00 ( 29.39%) 428688.00 ( 34.37%) 509798.00 ( 59.80%)
TPut 40 315002.00 ( 0.00%) 391376.00 ( 24.25%) 398529.00 ( 26.52%) 480411.00 ( 52.51%)
TPut 41 299693.00 ( 0.00%) 353819.00 ( 18.06%) 403541.00 ( 34.65%) 492599.00 ( 64.37%)
TPut 42 298226.00 ( 0.00%) 347563.00 ( 16.54%) 362189.00 ( 21.45%) 476979.00 ( 59.94%)
TPut 43 295595.00 ( 0.00%) 401208.00 ( 35.73%) 393026.00 ( 32.96%) 459142.00 ( 55.33%)
TPut 44 296490.00 ( 0.00%) 419443.00 ( 41.47%) 341222.00 ( 15.09%) 452357.00 ( 52.57%)
TPut 45 292584.00 ( 0.00%) 420579.00 ( 43.75%) 393112.00 ( 34.36%) 468680.00 ( 60.19%)
TPut 46 287256.00 ( 0.00%) 384628.00 ( 33.90%) 375230.00 ( 30.63%) 433550.00 ( 50.93%)
TPut 47 277411.00 ( 0.00%) 349226.00 ( 25.89%) 392540.00 ( 41.50%) 449038.00 ( 61.87%)
TPut 48 277058.00 ( 0.00%) 396594.00 ( 43.14%) 398184.00 ( 43.72%) 457085.00 ( 64.98%)
TPut 49 279962.00 ( 0.00%) 402671.00 ( 43.83%) 394294.00 ( 40.84%) 425650.00 ( 52.04%)
TPut 50 279948.00 ( 0.00%) 372190.00 ( 32.95%) 420082.00 ( 50.06%) 447108.00 ( 59.71%)
TPut 51 282160.00 ( 0.00%) 362593.00 ( 28.51%) 404464.00 ( 43.35%) 460767.00 ( 63.30%)
TPut 52 275574.00 ( 0.00%) 343943.00 ( 24.81%) 397754.00 ( 44.34%) 425609.00 ( 54.44%)
TPut 53 283902.00 ( 0.00%) 355129.00 ( 25.09%) 410938.00 ( 44.75%) 427099.00 ( 50.44%)
TPut 54 277341.00 ( 0.00%) 371739.00 ( 34.04%) 398662.00 ( 43.74%) 427941.00 ( 54.30%)
TPut 55 272116.00 ( 0.00%) 417531.00 ( 53.44%) 390286.00 ( 43.43%) 436491.00 ( 60.41%)
TPut 56 280207.00 ( 0.00%) 347432.00 ( 23.99%) 404331.00 ( 44.30%) 439342.00 ( 56.79%)
TPut 57 282146.00 ( 0.00%) 329932.00 ( 16.94%) 379562.00 ( 34.53%) 407568.00 ( 44.45%)
TPut 58 275901.00 ( 0.00%) 373810.00 ( 35.49%) 394333.00 ( 42.93%) 428118.00 ( 55.17%)
TPut 59 276583.00 ( 0.00%) 359812.00 ( 30.09%) 376969.00 ( 36.30%) 429891.00 ( 55.43%)
TPut 60 272523.00 ( 0.00%) 368938.00 ( 35.38%) 385033.00 ( 41.28%) 427636.00 ( 56.92%)
TPut 61 272427.00 ( 0.00%) 387343.00 ( 42.18%) 376525.00 ( 38.21%) 417755.00 ( 53.35%)
TPut 62 258730.00 ( 0.00%) 390303.00 ( 50.85%) 373770.00 ( 44.46%) 438145.00 ( 69.34%)
TPut 63 269246.00 ( 0.00%) 389464.00 ( 44.65%) 381536.00 ( 41.71%) 433943.00 ( 61.17%)
TPut 64 266261.00 ( 0.00%) 387660.00 ( 45.59%) 387200.00 ( 45.42%) 399805.00 ( 50.16%)
TPut 65 259147.00 ( 0.00%) 373458.00 ( 44.11%) 389666.00 ( 50.36%) 400191.00 ( 54.43%)
TPut 66 273445.00 ( 0.00%) 374637.00 ( 37.01%) 359764.00 ( 31.57%) 419330.00 ( 53.35%)
TPut 67 269350.00 ( 0.00%) 380035.00 ( 41.09%) 391560.00 ( 45.37%) 391418.00 ( 45.32%)
TPut 68 275532.00 ( 0.00%) 379096.00 ( 37.59%) 396028.00 ( 43.73%) 390213.00 ( 41.62%)
TPut 69 274195.00 ( 0.00%) 368116.00 ( 34.25%) 393802.00 ( 43.62%) 391539.00 ( 42.80%)
TPut 70 269523.00 ( 0.00%) 372521.00 ( 38.21%) 381988.00 ( 41.73%) 360330.00 ( 33.69%)
TPut 71 264778.00 ( 0.00%) 372533.00 ( 40.70%) 377377.00 ( 42.53%) 395088.00 ( 49.21%)
TPut 72 265705.00 ( 0.00%) 359686.00 ( 35.37%) 390037.00 ( 46.79%) 399126.00 ( 50.21%)

Note for lower number of warehouses that numacore regresses and then
improves as the warehouses increase. The expected peak is 48 cores and
note how numacore gets a 43.14% improvement here, autonuma sees a 43.72%
gain and balancenuma sees a 64.98% gain.

This explains why there was a big difference in reported figures. I was
using Multiple JVMs as ordinarily one would expect one JVM per node and
to have each JVM bound to a node. Multiple JVMs and Single JVMs generate
very different results. Second, there are massive differences depending on
whether THP is enabled or disabled. Lastly, as we can see here, numacore
regresses for small number of warehouses which is what I initially saw
but does very well as the number of warehouses increases. specjbb reports
based on peak number of warehouses so if people were using just the specjbb
score or were only testing peak number of warehouses, they would see the
performance gains but miss the regressions.

SPECJBB PEAKS
3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123 rc6-autonuma-v28fastr4 rc6-thpmigrate-v6r10
Expctd Warehouse 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%)
Expctd Peak Bops 277058.00 ( 0.00%) 396594.00 ( 43.14%) 398184.00 ( 43.72%) 457085.00 ( 64.98%)
Actual Warehouse 24.00 ( 0.00%) 29.00 ( 20.83%) 24.00 ( 0.00%) 27.00 ( 12.50%)
Actual Peak Bops 410499.00 ( 0.00%) 527448.00 ( 28.49%) 506758.00 ( 23.45%) 561167.00 ( 36.70%)
SpecJBB Bops 139464.00 ( 0.00%) 190554.00 ( 36.63%) 199064.00 ( 42.74%) 213820.00 ( 53.32%)
SpecJBB Bops/JVM 139464.00 ( 0.00%) 190554.00 ( 36.63%) 199064.00 ( 42.74%) 213820.00 ( 53.32%)

Here you can see that numacore scales to a higher number of warehouses
and sees a 43.14% performance gain at the peak and a 36.63% gain on the
specjbb score. The peaks are great, just not the smaller number of
warehouses.

autonuma sees a 23.45% performance gain at the peak and a 42.74%
performance gain on the specjbb score.

balancenuma gets a 36.7% performance gain at the peak and a 53.32%
gain on the specjbb score.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v6r10
User 317241.10 311543.98 314980.59 315357.34
System 105.47 2989.96 341.54 431.13
Elapsed 7432.59 7439.32 7433.84 7433.72

Same comments about the sytem CPU usage. numacores is really high.
balancenuma's is higher than I'd like.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v6r10
Page Ins 38252 38036 38212 37976
Page Outs 55364 59772 55704 54824
Swap Ins 0 0 0 0
Swap Outs 0 0 0 0
Direct pages scanned 0 0 0 0
Kswapd pages scanned 0 0 0 0
Kswapd pages reclaimed 0 0 0 0
Direct pages reclaimed 0 0 0 0
Kswapd efficiency 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0
Page writes file 0 0 0 0
Page writes anon 0 0 0 0
Page reclaim immediate 0 0 0 0
Page rescued immediate 0 0 0 0
Slabs scanned 0 0 0 0
Direct inode steals 0 0 0 0
Kswapd inode steals 0 0 0 0
Kswapd skipped wait 0 0 0 0
THP fault alloc 51908 43137 46165 49523
THP collapse alloc 62 3 179 59
THP splits 72 45 86 75
THP fault fallback 0 0 0 0
THP collapse fail 0 0 0 0
Compaction stalls 0 0 0 0
Compaction success 0 0 0 0
Compaction failures 0 0 0 0
Page migrate success 0 0 0 46917509
Page migrate failure 0 0 0 0
Compaction pages isolated 0 0 0 0
Compaction migrate scanned 0 0 0 0
Compaction free scanned 0 0 0 0
Compaction cost 0 0 0 48700
NUMA PTE updates 0 0 0 356453719
NUMA hint faults 0 0 0 2056190
NUMA hint local faults 0 0 0 752408
NUMA pages migrated 0 0 0 46917509
AutoNUMA cost 0 0 0 13667

Note that THP was certainly in use here. balancenuma migrated a lot more
than I'd like but it cannot be compared with numacore or autonuma at
this point.


SPECJBB: Single JVMs (one per node, 4 nodes), THP is disabled
3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v6r10
TPut 1 20507.00 ( 0.00%) 16702.00 (-18.55%) 19496.00 ( -4.93%) 19831.00 ( -3.30%)
TPut 2 48723.00 ( 0.00%) 36714.00 (-24.65%) 49452.00 ( 1.50%) 45973.00 ( -5.64%)
TPut 3 72618.00 ( 0.00%) 59086.00 (-18.63%) 69728.00 ( -3.98%) 71996.00 ( -0.86%)
TPut 4 98383.00 ( 0.00%) 76940.00 (-21.80%) 98216.00 ( -0.17%) 95339.00 ( -3.09%)
TPut 5 122240.00 ( 0.00%) 95981.00 (-21.48%) 119822.00 ( -1.98%) 117487.00 ( -3.89%)
TPut 6 144010.00 ( 0.00%) 100095.00 (-30.49%) 141127.00 ( -2.00%) 143931.00 ( -0.05%)
TPut 7 164690.00 ( 0.00%) 119577.00 (-27.39%) 159922.00 ( -2.90%) 164073.00 ( -0.37%)
TPut 8 190702.00 ( 0.00%) 125183.00 (-34.36%) 189187.00 ( -0.79%) 180400.00 ( -5.40%)
TPut 9 209898.00 ( 0.00%) 137179.00 (-34.64%) 160205.00 (-23.67%) 206052.00 ( -1.83%)
TPut 10 234064.00 ( 0.00%) 140225.00 (-40.09%) 220768.00 ( -5.68%) 218224.00 ( -6.77%)
TPut 11 252408.00 ( 0.00%) 134453.00 (-46.73%) 250953.00 ( -0.58%) 248507.00 ( -1.55%)
TPut 12 278689.00 ( 0.00%) 140355.00 (-49.64%) 271815.00 ( -2.47%) 255907.00 ( -8.17%)
TPut 13 298940.00 ( 0.00%) 153780.00 (-48.56%) 190433.00 (-36.30%) 289418.00 ( -3.19%)
TPut 14 315971.00 ( 0.00%) 126929.00 (-59.83%) 309899.00 ( -1.92%) 283315.00 (-10.34%)
TPut 15 340446.00 ( 0.00%) 132710.00 (-61.02%) 290484.00 (-14.68%) 327168.00 ( -3.90%)
TPut 16 362010.00 ( 0.00%) 156255.00 (-56.84%) 347844.00 ( -3.91%) 311160.00 (-14.05%)
TPut 17 376476.00 ( 0.00%) 95441.00 (-74.65%) 333508.00 (-11.41%) 366629.00 ( -2.62%)
TPut 18 399230.00 ( 0.00%) 132993.00 (-66.69%) 374946.00 ( -6.08%) 358280.00 (-10.26%)
TPut 19 414300.00 ( 0.00%) 129194.00 (-68.82%) 392675.00 ( -5.22%) 363700.00 (-12.21%)
TPut 20 429780.00 ( 0.00%) 90068.00 (-79.04%) 241891.00 (-43.72%) 413210.00 ( -3.86%)
TPut 21 439977.00 ( 0.00%) 136793.00 (-68.91%) 412629.00 ( -6.22%) 398914.00 ( -9.33%)
TPut 22 459593.00 ( 0.00%) 134292.00 (-70.78%) 426511.00 ( -7.20%) 414652.00 ( -9.78%)
TPut 23 473600.00 ( 0.00%) 137794.00 (-70.90%) 436081.00 ( -7.92%) 421456.00 (-11.01%)
TPut 24 483442.00 ( 0.00%) 139342.00 (-71.18%) 390536.00 (-19.22%) 453552.00 ( -6.18%)
TPut 25 484584.00 ( 0.00%) 144745.00 (-70.13%) 430863.00 (-11.09%) 397971.00 (-17.87%)
TPut 26 483041.00 ( 0.00%) 145326.00 (-69.91%) 333960.00 (-30.86%) 454575.00 ( -5.89%)
TPut 27 480788.00 ( 0.00%) 145395.00 (-69.76%) 402433.00 (-16.30%) 415528.00 (-13.57%)
TPut 28 470141.00 ( 0.00%) 146261.00 (-68.89%) 385008.00 (-18.11%) 445938.00 ( -5.15%)
TPut 29 476984.00 ( 0.00%) 147988.00 (-68.97%) 379719.00 (-20.39%) 395984.00 (-16.98%)
TPut 30 471709.00 ( 0.00%) 148658.00 (-68.49%) 417249.00 (-11.55%) 424000.00 (-10.11%)
TPut 31 470451.00 ( 0.00%) 147949.00 (-68.55%) 408792.00 (-13.11%) 384502.00 (-18.27%)
TPut 32 468377.00 ( 0.00%) 158685.00 (-66.12%) 414694.00 (-11.46%) 405441.00 (-13.44%)
TPut 33 463536.00 ( 0.00%) 159097.00 (-65.68%) 412259.00 (-11.06%) 399323.00 (-13.85%)
TPut 34 457678.00 ( 0.00%) 153025.00 (-66.56%) 408133.00 (-10.83%) 402190.00 (-12.12%)
TPut 35 448181.00 ( 0.00%) 154037.00 (-65.63%) 405535.00 ( -9.52%) 422016.00 ( -5.84%)
TPut 36 450490.00 ( 0.00%) 149057.00 (-66.91%) 407218.00 ( -9.61%) 381320.00 (-15.35%)
TPut 37 435425.00 ( 0.00%) 153996.00 (-64.63%) 400370.00 ( -8.05%) 403088.00 ( -7.43%)
TPut 38 434985.00 ( 0.00%) 158683.00 (-63.52%) 408266.00 ( -6.14%) 406860.00 ( -6.47%)
TPut 39 425064.00 ( 0.00%) 160263.00 (-62.30%) 397737.00 ( -6.43%) 385657.00 ( -9.27%)
TPut 40 428366.00 ( 0.00%) 161150.00 (-62.38%) 383404.00 (-10.50%) 405984.00 ( -5.22%)
TPut 41 417072.00 ( 0.00%) 155817.00 (-62.64%) 394627.00 ( -5.38%) 398389.00 ( -4.48%)
TPut 42 398350.00 ( 0.00%) 156774.00 (-60.64%) 388583.00 ( -2.45%) 329310.00 (-17.33%)
TPut 43 405526.00 ( 0.00%) 162938.00 (-59.82%) 371761.00 ( -8.33%) 396379.00 ( -2.26%)
TPut 44 400696.00 ( 0.00%) 167164.00 (-58.28%) 372067.00 ( -7.14%) 373746.00 ( -6.73%)
TPut 45 391357.00 ( 0.00%) 163075.00 (-58.33%) 365494.00 ( -6.61%) 348089.00 (-11.06%)
TPut 46 394109.00 ( 0.00%) 173557.00 (-55.96%) 357955.00 ( -9.17%) 372188.00 ( -5.56%)
TPut 47 383292.00 ( 0.00%) 168575.00 (-56.02%) 357946.00 ( -6.61%) 352658.00 ( -7.99%)
TPut 48 373607.00 ( 0.00%) 158491.00 (-57.58%) 358227.00 ( -4.12%) 373779.00 ( 0.05%)
TPut 49 372131.00 ( 0.00%) 145881.00 (-60.80%) 360147.00 ( -3.22%) 358224.00 ( -3.74%)
TPut 50 369060.00 ( 0.00%) 139450.00 (-62.21%) 355721.00 ( -3.61%) 367608.00 ( -0.39%)
TPut 51 375906.00 ( 0.00%) 139823.00 (-62.80%) 367783.00 ( -2.16%) 364796.00 ( -2.96%)
TPut 52 379731.00 ( 0.00%) 158706.00 (-58.21%) 381289.00 ( 0.41%) 370100.00 ( -2.54%)
TPut 53 366656.00 ( 0.00%) 178068.00 (-51.43%) 382147.00 ( 4.22%) 369301.00 ( 0.72%)
TPut 54 373531.00 ( 0.00%) 177087.00 (-52.59%) 374892.00 ( 0.36%) 367863.00 ( -1.52%)
TPut 55 374440.00 ( 0.00%) 174830.00 (-53.31%) 372036.00 ( -0.64%) 377606.00 ( 0.85%)
TPut 56 351285.00 ( 0.00%) 175761.00 (-49.97%) 370602.00 ( 5.50%) 371896.00 ( 5.87%)
TPut 57 366069.00 ( 0.00%) 172227.00 (-52.95%) 377253.00 ( 3.06%) 364024.00 ( -0.56%)
TPut 58 367753.00 ( 0.00%) 174523.00 (-52.54%) 376854.00 ( 2.47%) 372580.00 ( 1.31%)
TPut 59 364282.00 ( 0.00%) 176119.00 (-51.65%) 365806.00 ( 0.42%) 370299.00 ( 1.65%)
TPut 60 372531.00 ( 0.00%) 175673.00 (-52.84%) 354662.00 ( -4.80%) 365126.00 ( -1.99%)
TPut 61 359648.00 ( 0.00%) 174686.00 (-51.43%) 365387.00 ( 1.60%) 370039.00 ( 2.89%)
TPut 62 361856.00 ( 0.00%) 171420.00 (-52.63%) 366173.00 ( 1.19%) 345029.00 ( -4.65%)
TPut 63 363032.00 ( 0.00%) 171603.00 (-52.73%) 360794.00 ( -0.62%) 349379.00 ( -3.76%)
TPut 64 351549.00 ( 0.00%) 170967.00 (-51.37%) 354632.00 ( 0.88%) 352406.00 ( 0.24%)
TPut 65 360425.00 ( 0.00%) 170349.00 (-52.74%) 346205.00 ( -3.95%) 351510.00 ( -2.47%)
TPut 66 359197.00 ( 0.00%) 170037.00 (-52.66%) 355970.00 ( -0.90%) 330963.00 ( -7.86%)
TPut 67 356962.00 ( 0.00%) 168949.00 (-52.67%) 355577.00 ( -0.39%) 358511.00 ( 0.43%)
TPut 68 360411.00 ( 0.00%) 167892.00 (-53.42%) 337932.00 ( -6.24%) 358516.00 ( -0.53%)
TPut 69 354346.00 ( 0.00%) 166288.00 (-53.07%) 334951.00 ( -5.47%) 360614.00 ( 1.77%)
TPut 70 354596.00 ( 0.00%) 166214.00 (-53.13%) 333059.00 ( -6.07%) 337859.00 ( -4.72%)
TPut 71 351838.00 ( 0.00%) 167198.00 (-52.48%) 316732.00 ( -9.98%) 350369.00 ( -0.42%)
TPut 72 357716.00 ( 0.00%) 164325.00 (-54.06%) 309282.00 (-13.54%) 353090.00 ( -1.29%)

Without THP, numacore suffers really badly. Neither autonuma or
balancenuma do great. The reasons why balancenuma suffers have already
been explained -- the scan rate is not reducing but this can be
addressed with a big hammer. A patch already exists that does that but
is not included here.

SPECJBB PEAKS
3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123 rc6-autonuma-v28fastr4 rc6-thpmigrate-v6r10
Expctd Warehouse 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%)
Expctd Peak Bops 373607.00 ( 0.00%) 158491.00 (-57.58%) 358227.00 ( -4.12%) 373779.00 ( 0.05%)
Actual Warehouse 25.00 ( 0.00%) 53.00 (112.00%) 23.00 ( -8.00%) 26.00 ( 4.00%)
Actual Peak Bops 484584.00 ( 0.00%) 178068.00 (-63.25%) 436081.00 (-10.01%) 454575.00 ( -6.19%)
SpecJBB Bops 185685.00 ( 0.00%) 85236.00 (-54.10%) 182329.00 ( -1.81%) 183908.00 ( -0.96%)
SpecJBB Bops/JVM 185685.00 ( 0.00%) 85236.00 (-54.10%) 182329.00 ( -1.81%) 183908.00 ( -0.96%)

numacore regresses 63.25% at it's peak and has a 54.10% loss on its
specjbb score.

autonuma regresses 10.01% at its peak, 1.81% on the specjbb score.

balancenuma does "best" in that it regresses the least.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v6r10
User 316094.47 169409.35 0.00 308074.71
System 62.67 123927.05 0.00 1897.43
Elapsed 7434.12 7452.00 0.00 7438.16

The autonuma file that stored the system CPu usage was truncated for some
reason. I've set it to rerun.

numacores system CPU usage is massive.

balancenumas is also far too high due to it failing to reduce the scan
rate.

So, now I'm seeing compatible figures that have been reported elsewhere.
To get those figures you must use a single JVM, THP must be enabled and it
must run with a large enough number of warehouses. For other configurations
or lower number of warehouses, it can suffer.

--
Mel Gorman
SUSE Labs

2012-11-25 23:40:10

by Mel Gorman

[permalink] [raw]
Subject: Re: Comparison between three trees (was: Latest numa/core release, v17)

On Fri, Nov 23, 2012 at 05:32:05PM +0000, Mel Gorman wrote:

> <SNIP>
> SPECJBB: Single JVMs (one per node, 4 nodes), THP is enabled
>
> <SNIP>
> SPECJBB: Single JVMs (one per node, 4 nodes), THP is disabled

Just to clarify, the "JVMs (one per node, 4 nodes)" was a cut&paste
error. Single JVM meant that there was just one JVM running and it was
configured to use 80% of available RAM.

--
Mel Gorman
SUSE Labs

2012-11-26 09:38:34

by Mel Gorman

[permalink] [raw]
Subject: Re: Comparison between three trees (was: Latest numa/core release, v17)

On Sun, Nov 25, 2012 at 04:47:15PM +0800, Hillf Danton wrote:
> On 11/24/12, Mel Gorman <[email protected]> wrote:
> > Warning: This is an insanely long mail and there a lot of data here. Get
> > coffee or something.
> >
> > This is another round of comparisons between the latest released versions
> > of each of three automatic numa balancing trees that are out there.
> >
> > From the series "Automatic NUMA Balancing V5", the kernels tested were
> >
> > stats-v5r1 Patches 1-10. TLB optimisations, migration stats
> > thpmigrate-v5r1 Patches 1-37. Basic placement policy, PMD handling, THP
> > migration etc.
> > adaptscan-v5r1 Patches 1-38. Heavy handed PTE scan reduction
> > delaystart-v5r1 Patches 1-40. Delay the PTE scan until running on a new
> > node
> >
> > If I just say balancenuma, I mean the "delaystart-v5r1" kernel. The other
> > kernels are included so you can see the impact the scan rate adaption
> > patch has and what that might mean for a placement policy using a proper
> > feedback mechanism.
> >
> > The other two kernels were
> >
> > numacore-20121123 It was no longer clear what the deltas between releases
> > and
> > the dependencies might be so I just pulled tip/master on November
> > 23rd, 2012. An earlier pull had serious difficulties and the patch
> > responsible has been dropped since. This is not a like-with-like
> > comparison as the tree contains numerous other patches but it's
> > the best available given the timeframe
> >
> > autonuma-v28fast This is a rebased version of Andrea's autonuma-v28fast
> > branch with Hugh's THP migration patch on top.
>
> FYI, based on how target huge page is selected,
>
> +
> + new_page = alloc_pages_node(numa_node_id(),
> + (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);
>
> the thp replacement policy is changed to be MORON,
>

That is likely true. When rebasing a policy on top of balancenuma it is
important to keep an eye on what node is used for target migration and
what node is passed to task_numa_fault() and confirm this is the node
the policy expects.

--
Mel Gorman
SUSE Labs

2012-11-26 13:33:53

by Mel Gorman

[permalink] [raw]
Subject: Re: Comparison between three trees (was: Latest numa/core release, v17)

On Fri, Nov 23, 2012 at 05:32:05PM +0000, Mel Gorman wrote:
> SPECJBB: Single JVMs (one per node, 4 nodes), THP is disabled
>
> 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
> rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v5r1 rc6-adaptscan-v5r1 rc6-delaystart-v5r4
> TPut 1 20890.00 ( 0.00%) 18720.00 (-10.39%) 21127.00 ( 1.13%) 20376.00 ( -2.46%) 20806.00 ( -0.40%) 20698.00 ( -0.92%)
> TPut 2 48259.00 ( 0.00%) 38121.00 (-21.01%) 47920.00 ( -0.70%) 47085.00 ( -2.43%) 48594.00 ( 0.69%) 48094.00 ( -0.34%)
> TPut 3 73203.00 ( 0.00%) 60057.00 (-17.96%) 73630.00 ( 0.58%) 70241.00 ( -4.05%) 73418.00 ( 0.29%) 74016.00 ( 1.11%)
> TPut 4 98694.00 ( 0.00%) 73669.00 (-25.36%) 98929.00 ( 0.24%) 96721.00 ( -2.00%) 96797.00 ( -1.92%) 97930.00 ( -0.77%)
> TPut 5 122563.00 ( 0.00%) 98786.00 (-19.40%) 118969.00 ( -2.93%) 118045.00 ( -3.69%) 121553.00 ( -0.82%) 122781.00 ( 0.18%)
> TPut 6 144095.00 ( 0.00%) 114485.00 (-20.55%) 145328.00 ( 0.86%) 141713.00 ( -1.65%) 142589.00 ( -1.05%) 143771.00 ( -0.22%)
> TPut 7 166457.00 ( 0.00%) 112416.00 (-32.47%) 163503.00 ( -1.77%) 166971.00 ( 0.31%) 166788.00 ( 0.20%) 165188.00 ( -0.76%)
> TPut 8 191067.00 ( 0.00%) 122996.00 (-35.63%) 189477.00 ( -0.83%) 183090.00 ( -4.17%) 187710.00 ( -1.76%) 192157.00 ( 0.57%)
> TPut 9 210634.00 ( 0.00%) 141200.00 (-32.96%) 209639.00 ( -0.47%) 207968.00 ( -1.27%) 215216.00 ( 2.18%) 214222.00 ( 1.70%)
> TPut 10 234121.00 ( 0.00%) 129508.00 (-44.68%) 231221.00 ( -1.24%) 221553.00 ( -5.37%) 219998.00 ( -6.03%) 227193.00 ( -2.96%)
> TPut 11 257885.00 ( 0.00%) 131232.00 (-49.11%) 256568.00 ( -0.51%) 252734.00 ( -2.00%) 258433.00 ( 0.21%) 260534.00 ( 1.03%)
> TPut 12 271751.00 ( 0.00%) 154763.00 (-43.05%) 277319.00 ( 2.05%) 277154.00 ( 1.99%) 265747.00 ( -2.21%) 262285.00 ( -3.48%)
> TPut 13 297457.00 ( 0.00%) 119716.00 (-59.75%) 296068.00 ( -0.47%) 289716.00 ( -2.60%) 276527.00 ( -7.04%) 293199.00 ( -1.43%)
> TPut 14 319074.00 ( 0.00%) 129730.00 (-59.34%) 311604.00 ( -2.34%) 308798.00 ( -3.22%) 316807.00 ( -0.71%) 275748.00 (-13.58%)
> TPut 15 337859.00 ( 0.00%) 177494.00 (-47.47%) 329288.00 ( -2.54%) 300463.00 (-11.07%) 305116.00 ( -9.69%) 287814.00 (-14.81%)
> TPut 16 356396.00 ( 0.00%) 145173.00 (-59.27%) 355616.00 ( -0.22%) 342598.00 ( -3.87%) 364077.00 ( 2.16%) 339649.00 ( -4.70%)
> TPut 17 373925.00 ( 0.00%) 176956.00 (-52.68%) 368589.00 ( -1.43%) 360917.00 ( -3.48%) 366043.00 ( -2.11%) 345586.00 ( -7.58%)
> TPut 18 388373.00 ( 0.00%) 150100.00 (-61.35%) 372873.00 ( -3.99%) 389062.00 ( 0.18%) 386779.00 ( -0.41%) 370871.00 ( -4.51%)
>
> balancenuma suffered here. It is very likely that it was not able to handle
> faults at a PMD level due to the lack of THP and I would expect that the
> pages within a PMD boundary are not on the same node so pmd_numa is not
> set. This results in its worst case of always having to deal with PTE
> faults. Further, it must be migrating many or almost all of these because
> the adaptscan patch made no difference. This is a worst-case scenario for
> balancenuma. The scan rates later will indicate if that was the case.
>

This worst-case for balancenuma can be hit with a hammer to some extent
(patch below) but the results are too variable to be considered useful. The
headline figures say that balancenuma comes back in line with mainline so
it's not regressing but the devil is in the details. It regresses less
but balancenumas worst-case scenario still hurts. I'm not including the
patch in the tree because the right answer is to rebase a scheduling and
placement policy on top that results in fewer migrations.

However, for reference here is how the hammer affects the results for a
single JVM with THP disabled. adaptalways-v6r12 is the hammer.

SPECJBB BOPS
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123rc6-autonuma-v28fastr4 rc6-thpmigrate-v6r10 rc6-adaptalways-v6r12
TPut 1 20507.00 ( 0.00%) 16702.00 (-18.55%) 19496.00 ( -4.93%) 19831.00 ( -3.30%) 20539.00 ( 0.16%)
TPut 2 48723.00 ( 0.00%) 36714.00 (-24.65%) 49452.00 ( 1.50%) 45973.00 ( -5.64%) 47664.00 ( -2.17%)
TPut 3 72618.00 ( 0.00%) 59086.00 (-18.63%) 69728.00 ( -3.98%) 71996.00 ( -0.86%) 71917.00 ( -0.97%)
TPut 4 98383.00 ( 0.00%) 76940.00 (-21.80%) 98216.00 ( -0.17%) 95339.00 ( -3.09%) 96118.00 ( -2.30%)
TPut 5 122240.00 ( 0.00%) 95981.00 (-21.48%) 119822.00 ( -1.98%) 117487.00 ( -3.89%) 121080.00 ( -0.95%)
TPut 6 144010.00 ( 0.00%) 100095.00 (-30.49%) 141127.00 ( -2.00%) 143931.00 ( -0.05%) 141666.00 ( -1.63%)
TPut 7 164690.00 ( 0.00%) 119577.00 (-27.39%) 159922.00 ( -2.90%) 164073.00 ( -0.37%) 163861.00 ( -0.50%)
TPut 8 190702.00 ( 0.00%) 125183.00 (-34.36%) 189187.00 ( -0.79%) 180400.00 ( -5.40%) 187520.00 ( -1.67%)
TPut 9 209898.00 ( 0.00%) 137179.00 (-34.64%) 160205.00 (-23.67%) 206052.00 ( -1.83%) 214639.00 ( 2.26%)
TPut 10 234064.00 ( 0.00%) 140225.00 (-40.09%) 220768.00 ( -5.68%) 218224.00 ( -6.77%) 224924.00 ( -3.90%)
TPut 11 252408.00 ( 0.00%) 134453.00 (-46.73%) 250953.00 ( -0.58%) 248507.00 ( -1.55%) 247219.00 ( -2.06%)
TPut 12 278689.00 ( 0.00%) 140355.00 (-49.64%) 271815.00 ( -2.47%) 255907.00 ( -8.17%) 266701.00 ( -4.30%)
TPut 13 298940.00 ( 0.00%) 153780.00 (-48.56%) 190433.00 (-36.30%) 289418.00 ( -3.19%) 269335.00 ( -9.90%)
TPut 14 315971.00 ( 0.00%) 126929.00 (-59.83%) 309899.00 ( -1.92%) 283315.00 (-10.34%) 308350.00 ( -2.41%)
TPut 15 340446.00 ( 0.00%) 132710.00 (-61.02%) 290484.00 (-14.68%) 327168.00 ( -3.90%) 342031.00 ( 0.47%)
TPut 16 362010.00 ( 0.00%) 156255.00 (-56.84%) 347844.00 ( -3.91%) 311160.00 (-14.05%) 360196.00 ( -0.50%)
TPut 17 376476.00 ( 0.00%) 95441.00 (-74.65%) 333508.00 (-11.41%) 366629.00 ( -2.62%) 341397.00 ( -9.32%)
TPut 18 399230.00 ( 0.00%) 132993.00 (-66.69%) 374946.00 ( -6.08%) 358280.00 (-10.26%) 324370.00 (-18.75%)
TPut 19 414300.00 ( 0.00%) 129194.00 (-68.82%) 392675.00 ( -5.22%) 363700.00 (-12.21%) 368777.00 (-10.99%)
TPut 20 429780.00 ( 0.00%) 90068.00 (-79.04%) 241891.00 (-43.72%) 413210.00 ( -3.86%) 351444.00 (-18.23%)
TPut 21 439977.00 ( 0.00%) 136793.00 (-68.91%) 412629.00 ( -6.22%) 398914.00 ( -9.33%) 442260.00 ( 0.52%)
TPut 22 459593.00 ( 0.00%) 134292.00 (-70.78%) 426511.00 ( -7.20%) 414652.00 ( -9.78%) 422916.00 ( -7.98%)
TPut 23 473600.00 ( 0.00%) 137794.00 (-70.90%) 436081.00 ( -7.92%) 421456.00 (-11.01%) 359619.00 (-24.07%)
TPut 24 483442.00 ( 0.00%) 139342.00 (-71.18%) 390536.00 (-19.22%) 453552.00 ( -6.18%) 486759.00 ( 0.69%)
TPut 25 484584.00 ( 0.00%) 144745.00 (-70.13%) 430863.00 (-11.09%) 397971.00 (-17.87%) 396648.00 (-18.15%)
TPut 26 483041.00 ( 0.00%) 145326.00 (-69.91%) 333960.00 (-30.86%) 454575.00 ( -5.89%) 472979.00 ( -2.08%)
TPut 27 480788.00 ( 0.00%) 145395.00 (-69.76%) 402433.00 (-16.30%) 415528.00 (-13.57%) 418540.00 (-12.95%)
TPut 28 470141.00 ( 0.00%) 146261.00 (-68.89%) 385008.00 (-18.11%) 445938.00 ( -5.15%) 455615.00 ( -3.09%)
TPut 29 476984.00 ( 0.00%) 147988.00 (-68.97%) 379719.00 (-20.39%) 395984.00 (-16.98%) 479828.00 ( 0.60%)
TPut 30 471709.00 ( 0.00%) 148658.00 (-68.49%) 417249.00 (-11.55%) 424000.00 (-10.11%) 435163.00 ( -7.75%)
TPut 31 470451.00 ( 0.00%) 147949.00 (-68.55%) 408792.00 (-13.11%) 384502.00 (-18.27%) 415069.00 (-11.77%)
TPut 32 468377.00 ( 0.00%) 158685.00 (-66.12%) 414694.00 (-11.46%) 405441.00 (-13.44%) 468585.00 ( 0.04%)
TPut 33 463536.00 ( 0.00%) 159097.00 (-65.68%) 412259.00 (-11.06%) 399323.00 (-13.85%) 455622.00 ( -1.71%)
TPut 34 457678.00 ( 0.00%) 153025.00 (-66.56%) 408133.00 (-10.83%) 402190.00 (-12.12%) 432962.00 ( -5.40%)
TPut 35 448181.00 ( 0.00%) 154037.00 (-65.63%) 405535.00 ( -9.52%) 422016.00 ( -5.84%) 452914.00 ( 1.06%)
TPut 36 450490.00 ( 0.00%) 149057.00 (-66.91%) 407218.00 ( -9.61%) 381320.00 (-15.35%) 427438.00 ( -5.12%)
TPut 37 435425.00 ( 0.00%) 153996.00 (-64.63%) 400370.00 ( -8.05%) 403088.00 ( -7.43%) 381348.00 (-12.42%)
TPut 38 434985.00 ( 0.00%) 158683.00 (-63.52%) 408266.00 ( -6.14%) 406860.00 ( -6.47%) 404181.00 ( -7.08%)
TPut 39 425064.00 ( 0.00%) 160263.00 (-62.30%) 397737.00 ( -6.43%) 385657.00 ( -9.27%) 425414.00 ( 0.08%)
TPut 40 428366.00 ( 0.00%) 161150.00 (-62.38%) 383404.00 (-10.50%) 405984.00 ( -5.22%) 444815.00 ( 3.84%)
TPut 41 417072.00 ( 0.00%) 155817.00 (-62.64%) 394627.00 ( -5.38%) 398389.00 ( -4.48%) 391735.00 ( -6.07%)
TPut 42 398350.00 ( 0.00%) 156774.00 (-60.64%) 388583.00 ( -2.45%) 329310.00 (-17.33%) 430361.00 ( 8.04%)
TPut 43 405526.00 ( 0.00%) 162938.00 (-59.82%) 371761.00 ( -8.33%) 396379.00 ( -2.26%) 397849.00 ( -1.89%)
TPut 44 400696.00 ( 0.00%) 167164.00 (-58.28%) 372067.00 ( -7.14%) 373746.00 ( -6.73%) 388050.00 ( -3.16%)
TPut 45 391357.00 ( 0.00%) 163075.00 (-58.33%) 365494.00 ( -6.61%) 348089.00 (-11.06%) 414737.00 ( 5.97%)
TPut 46 394109.00 ( 0.00%) 173557.00 (-55.96%) 357955.00 ( -9.17%) 372188.00 ( -5.56%) 400373.00 ( 1.59%)
TPut 47 383292.00 ( 0.00%) 168575.00 (-56.02%) 357946.00 ( -6.61%) 352658.00 ( -7.99%) 395851.00 ( 3.28%)
TPut 48 373607.00 ( 0.00%) 158491.00 (-57.58%) 358227.00 ( -4.12%) 373779.00 ( 0.05%) 388631.00 ( 4.02%)
TPut 49 372131.00 ( 0.00%) 145881.00 (-60.80%) 360147.00 ( -3.22%) 358224.00 ( -3.74%) 377922.00 ( 1.56%)
TPut 50 369060.00 ( 0.00%) 139450.00 (-62.21%) 355721.00 ( -3.61%) 367608.00 ( -0.39%) 369852.00 ( 0.21%)
TPut 51 375906.00 ( 0.00%) 139823.00 (-62.80%) 367783.00 ( -2.16%) 364796.00 ( -2.96%) 353863.00 ( -5.86%)
TPut 52 379731.00 ( 0.00%) 158706.00 (-58.21%) 381289.00 ( 0.41%) 370100.00 ( -2.54%) 379472.00 ( -0.07%)
TPut 53 366656.00 ( 0.00%) 178068.00 (-51.43%) 382147.00 ( 4.22%) 369301.00 ( 0.72%) 376606.00 ( 2.71%)
TPut 54 373531.00 ( 0.00%) 177087.00 (-52.59%) 374892.00 ( 0.36%) 367863.00 ( -1.52%) 372560.00 ( -0.26%)
TPut 55 374440.00 ( 0.00%) 174830.00 (-53.31%) 372036.00 ( -0.64%) 377606.00 ( 0.85%) 375134.00 ( 0.19%)
TPut 56 351285.00 ( 0.00%) 175761.00 (-49.97%) 370602.00 ( 5.50%) 371896.00 ( 5.87%) 366349.00 ( 4.29%)
TPut 57 366069.00 ( 0.00%) 172227.00 (-52.95%) 377253.00 ( 3.06%) 364024.00 ( -0.56%) 367468.00 ( 0.38%)
TPut 58 367753.00 ( 0.00%) 174523.00 (-52.54%) 376854.00 ( 2.47%) 372580.00 ( 1.31%) 363218.00 ( -1.23%)
TPut 59 364282.00 ( 0.00%) 176119.00 (-51.65%) 365806.00 ( 0.42%) 370299.00 ( 1.65%) 367422.00 ( 0.86%)
TPut 60 372531.00 ( 0.00%) 175673.00 (-52.84%) 354662.00 ( -4.80%) 365126.00 ( -1.99%) 372139.00 ( -0.11%)
TPut 61 359648.00 ( 0.00%) 174686.00 (-51.43%) 365387.00 ( 1.60%) 370039.00 ( 2.89%) 368296.00 ( 2.40%)
TPut 62 361856.00 ( 0.00%) 171420.00 (-52.63%) 366173.00 ( 1.19%) 345029.00 ( -4.65%) 368224.00 ( 1.76%)
TPut 63 363032.00 ( 0.00%) 171603.00 (-52.73%) 360794.00 ( -0.62%) 349379.00 ( -3.76%) 364463.00 ( 0.39%)
TPut 64 351549.00 ( 0.00%) 170967.00 (-51.37%) 354632.00 ( 0.88%) 352406.00 ( 0.24%) 365522.00 ( 3.97%)
TPut 65 360425.00 ( 0.00%) 170349.00 (-52.74%) 346205.00 ( -3.95%) 351510.00 ( -2.47%) 360351.00 ( -0.02%)
TPut 66 359197.00 ( 0.00%) 170037.00 (-52.66%) 355970.00 ( -0.90%) 330963.00 ( -7.86%) 347958.00 ( -3.13%)
TPut 67 356962.00 ( 0.00%) 168949.00 (-52.67%) 355577.00 ( -0.39%) 358511.00 ( 0.43%) 371059.00 ( 3.95%)
TPut 68 360411.00 ( 0.00%) 167892.00 (-53.42%) 337932.00 ( -6.24%) 358516.00 ( -0.53%) 361518.00 ( 0.31%)
TPut 69 354346.00 ( 0.00%) 166288.00 (-53.07%) 334951.00 ( -5.47%) 360614.00 ( 1.77%) 367286.00 ( 3.65%)
TPut 70 354596.00 ( 0.00%) 166214.00 (-53.13%) 333059.00 ( -6.07%) 337859.00 ( -4.72%) 350505.00 ( -1.15%)
TPut 71 351838.00 ( 0.00%) 167198.00 (-52.48%) 316732.00 ( -9.98%) 350369.00 ( -0.42%) 353104.00 ( 0.36%)
TPut 72 357716.00 ( 0.00%) 164325.00 (-54.06%) 309282.00 (-13.54%) 353090.00 ( -1.29%) 339898.00 ( -4.98%)

adaptalways reduces the scanning rate on every fault. It mitigates many
of the worse of the regressions but does not eliminate them because there
are still remote faults and migrations.

SPECJBB PEAKS
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1 rc6-numacore-20121123 rc6-autonuma-v28fastr4 rc6-thpmigrate-v6r10 rc6-adaptalways-v6r12
Expctd Warehouse 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%)
Expctd Peak Bops 373607.00 ( 0.00%) 158491.00 (-57.58%) 358227.00 ( -4.12%) 373779.00 ( 0.05%) 388631.00 ( 4.02%)
Actual Warehouse 25.00 ( 0.00%) 53.00 (112.00%) 23.00 ( -8.00%) 26.00 ( 4.00%) 24.00 ( -4.00%)
Actual Peak Bops 484584.00 ( 0.00%) 178068.00 (-63.25%) 436081.00 (-10.01%) 454575.00 ( -6.19%) 486759.00 ( 0.45%)
SpecJBB Bops 185685.00 ( 0.00%) 85236.00 (-54.10%) 182329.00 ( -1.81%) 183908.00 ( -0.96%) 186711.00 ( 0.55%)
SpecJBB Bops/JVM 185685.00 ( 0.00%) 85236.00 (-54.10%) 182329.00 ( -1.81%) 183908.00 ( -0.96%) 186711.00 ( 0.55%)

The actual peak performance figures look ok though and if you were just
looking at the headline figures you might be tempted to conclude that the
patch works but the per-warehouse figures show that it's not really the
case at all.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v6r10rc6-adaptalways-v6r12
User 316094.47 169409.35 308316.22 308074.71 309256.18
System 62.67 123927.05 4304.26 1897.43 1650.29
Elapsed 7434.12 7452.00 7439.70 7438.16 7437.24

It does reduce system CPU usage a bit but the fact is that it's still
migrating uselessly.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v5r1rc6-numacore-20121123rc6-autonuma-v28fastr4rc6-thpmigrate-v6r10rc6-adaptalways-v6r12
Page Ins 34248 37888 38048 38148 38076
Page Outs 50932 60036 54448 55196 55368
Swap Ins 0 0 0 0 0
Swap Outs 0 0 0 0 0
Direct pages scanned 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0
Page writes file 0 0 0 0 0
Page writes anon 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0
Page rescued immediate 0 0 0 0 0
Slabs scanned 0 0 0 0 0
Direct inode steals 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0
THP fault alloc 3 3 3 3 2
THP collapse alloc 0 0 12 0 0
THP splits 0 0 0 0 0
THP fault fallback 0 0 0 0 0
THP collapse fail 0 0 0 0 0
Compaction stalls 0 0 0 0 0
Compaction success 0 0 0 0 0
Compaction failures 0 0 0 0 0
Page migrate success 0 0 0 27257642 22698940
Page migrate failure 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0
Compaction free scanned 0 0 0 0 0
Compaction cost 0 0 0 28293 23561
NUMA PTE updates 0 0 0 220482204 187969232
NUMA hint faults 0 0 0 214660099 183397210
NUMA hint local faults 0 0 0 55657689 47359679
NUMA pages migrated 0 0 0 27257642 22698940
AutoNUMA cost 0 0 0 1075361 918733

Note that it alters the number of PTEs that are updated and the number
of faults but not enough to make a difference. Far too many of those
NUMA faults were remote and resulted in migration.

Here is the "hammer" for reference but I'll not be including it.

---8<---
mm: sched: Adapt the scanning rate even if a NUMA hinting fault migrates

specjbb on single JVM for balancenuma indicated that the scan rate was
not reducing and the performance was impaired. The problem was that
the threads are getting scheduled between nodes and balancenuma is
migrating the pages around in circles uselessly. It needs a scheduling
policy that makes tasks sticker to a node if much of their memory is
there.

In the meantime, I have a hammer and this problems looks mighty like a
nail.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd9c78c..ed54789 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -818,11 +818,18 @@ void task_numa_fault(int node, int pages, bool migrated)

/*
* If pages are properly placed (did not migrate) then scan slower.
- * This is reset periodically in case of phase changes
+ * This is reset periodically in case of phase changes. If the page
+ * was migrated, we still slow the scan rate but less. If the
+ * workload is not converging at all, at least it will update
+ * fewer PTEs and stop trashing around but in ideal circumstances it
+ * also means we converge slower.
*/
if (!migrated)
p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
p->numa_scan_period + jiffies_to_msecs(10));
+ else
+ p->numa_scan_period = min(sysctl_balance_numa_scan_period_max,
+ p->numa_scan_period + jiffies_to_msecs(5));

task_numa_placement(p);
}

2012-11-26 20:34:22

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 19/33] sched: Add adaptive NUMA affinity support

Hi all,

On 11/22/2012 05:49 PM, Ingo Molnar wrote:
> +static void task_numa_placement(struct task_struct *p)
> +{
> + int seq = ACCESS_ONCE(p->mm->numa_scan_seq);

I was fuzzing with trinity on my fake numa setup, and discovered that this can
be called for task_structs with p->mm == NULL, which would cause things like:

[ 1140.001957] BUG: unable to handle kernel NULL pointer dereference at 00000000000006d0
[ 1140.010037] IP: [<ffffffff81157627>] task_numa_placement+0x27/0x1a0
[ 1140.015020] PGD 9b002067 PUD 9fb3c067 PMD 14a89067 PTE 5a4098bf040
[ 1140.015020] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 1140.015020] Dumping ftrace buffer:
[ 1140.015020] (ftrace buffer empty)
[ 1140.015020] CPU 1
[ 1140.015020] Pid: 3179, comm: ksmd Tainted: G W 3.7.0-rc6-next-20121126-sasha-00015-gb04382b-dirty #200
[ 1140.015020] RIP: 0010:[<ffffffff81157627>] [<ffffffff81157627>] task_numa_placement+0x27/0x1a0
[ 1140.015020] RSP: 0018:ffff8800bfae5b08 EFLAGS: 00010292
[ 1140.015020] RAX: 0000000000000000 RBX: ffff8800bfaeb000 RCX: 0000000000000001
[ 1140.015020] RDX: ffff880007c00000 RSI: 000000000000000e RDI: ffff8800bfaeb000
[ 1140.015020] RBP: ffff8800bfae5b38 R08: ffff8800bf805e00 R09: ffff880000369000
[ 1140.015020] R10: 0000000000000000 R11: 0000000000000000 R12: 000000000000000e
[ 1140.015020] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000064
[ 1140.015020] FS: 0000000000000000(0000) GS:ffff880007c00000(0000) knlGS:0000000000000000
[ 1140.015020] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1140.015020] CR2: 00000000000006d0 CR3: 0000000097b18000 CR4: 00000000000406e0
[ 1140.015020] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1140.015020] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1140.015020] Process ksmd (pid: 3179, threadinfo ffff8800bfae4000, task ffff8800bfaeb000)
[ 1140.015020] Stack:
[ 1140.015020] 0000000000000000 0000000000000000 000000000000000e ffff8800bfaeb000
[ 1140.015020] 000000000000000e 0000000000000004 ffff8800bfae5b88 ffffffff8115a577
[ 1140.015020] ffff8800bfae5b68 ffffffff00000001 ffff88000c1d0068 ffffea0000ec1000
[ 1140.015020] Call Trace:
[ 1140.015020] [<ffffffff8115a577>] task_numa_fault+0xb7/0xd0
[ 1140.015020] [<ffffffff81230d96>] do_numa_page.isra.42+0x1b6/0x270
[ 1140.015020] [<ffffffff8126fe08>] ? mem_cgroup_count_vm_event+0x178/0x1a0
[ 1140.015020] [<ffffffff812333f4>] handle_pte_fault+0x174/0x220
[ 1140.015020] [<ffffffff819e7ad9>] ? __const_udelay+0x29/0x30
[ 1140.015020] [<ffffffff81234780>] handle_mm_fault+0x320/0x350
[ 1140.015020] [<ffffffff81256845>] break_ksm+0x65/0xc0
[ 1140.015020] [<ffffffff81256b4d>] break_cow+0x5d/0x80
[ 1140.015020] [<ffffffff81258442>] cmp_and_merge_page+0x122/0x1e0
[ 1140.015020] [<ffffffff81258565>] ksm_do_scan+0x65/0xa0
[ 1140.015020] [<ffffffff8125860f>] ksm_scan_thread+0x6f/0x2d0
[ 1140.015020] [<ffffffff8113b990>] ? abort_exclusive_wait+0xb0/0xb0
[ 1140.015020] [<ffffffff812585a0>] ? ksm_do_scan+0xa0/0xa0
[ 1140.015020] [<ffffffff8113a723>] kthread+0xe3/0xf0
[ 1140.015020] [<ffffffff8113a640>] ? __kthread_bind+0x40/0x40
[ 1140.015020] [<ffffffff83c8813c>] ret_from_fork+0x7c/0xb0
[ 1140.015020] [<ffffffff8113a640>] ? __kthread_bind+0x40/0x40
[ 1140.015020] Code: 00 00 00 00 55 48 89 e5 41 55 41 54 53 48 89 fb 48 83 ec 18 48 c7 45 d0 00 00 00 00 48 8b 87 a0 04 00 00 48
c7 45 d8 00 00 00 00 <8b> 80 d0 06 00 00 39 87 d4 15 00 00 0f 84 57 01 00 00 89 87 d4
[ 1140.015020] RIP [<ffffffff81157627>] task_numa_placement+0x27/0x1a0
[ 1140.015020] RSP <ffff8800bfae5b08>
[ 1140.015020] CR2: 00000000000006d0
[ 1140.660568] ---[ end trace 9f1fd31243556513 ]---

In exchange to this bug report, I have couple of questions about this NUMA code which I wasn't
able to answer myself :)

- In this case, would it mean that KSM may run on one node, but scan the memory of a different node?
- If yes, we should migrate KSM to each node we scan, right? Or possibly start a dedicated KSM
thread for each NUMA node?
- Is there a class of per-numa threads in the works?


Thanks,
Sasha