2012-11-19 02:15:21

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 00/27] Latest numa/core release, v16

I'm pleased to announce the latest version of the numa/core tree.

Here are some quick, preliminary performance numbers on a 4-node,
32-way, 64 GB RAM system:

CONFIG_NUMA_BALANCING=y
-----------------------------------------------------------------------
[ seconds ] v3.7 AutoNUMA | numa/core-v16 [ vs. v3.7]
[ lower is better ] ----- -------- | ------------- -----------
|
numa01 340.3 192.3 | 139.4 +144.1%
numa01_THREAD_ALLOC 425.1 135.1 | 121.1 +251.0%
numa02 56.1 25.3 | 17.5 +220.5%
|
[ SPECjbb transactions/sec ] |
[ higher is better ] |
|
SPECjbb single-1x32 524k 507k | 638k +21.7%
-----------------------------------------------------------------------

On my NUMA system the numa/core tree now significantly outperforms both
the vanilla kernel and the AutoNUMA (v28) kernel, in these benchmarks.
No NUMA balancing kernel has ever performed so well on this system.

It is notable that workloads where 'private' processing dominates
(numa01_THREAD_ALLOC and numa02) are now very close to bare metal
hard binding performance.

These are the main changes in this release:

- There are countless performance improvements. The new shared/private
distinction metric we introduced in v15 is now further refined and
is used in more places within the scheduler to converge in a better
and more directed fashion.

- I restructured the whole tree to make it cleaner, to simplify its
mm/ impact and in general to make it more mergable. It now includes
either agreed-upon patches, or bare essentials that are needed to
make the CONFIG_NUMA_BALANCING=y feature work. It is fully bisect
tested - it builds and works at every point.

- The hard-coded "PROT_NONE" feature that reviewers complained about
is now factored out and selectable on a per architecture basis.
(the arch porting aspect of this is untested, but the basic fabric
is there and should be pretty close to what we need.)

The generic PROT_NONE based facility can be used by architectures
to prototype this feature quickly.

- I tried to pick up all fixes that were sent. Many thanks go to
Hugh Dickins and Johannes Weiner! If I missed any fix or review
feedback, please re-send, as the code base has changed
significantly.

Future plans are to concentrate on converging 'shared' workloads
even better, to address any pending review feedback, and to fix
any regressions that might be remaining.

Bug reports, review feedback and suggestions are welcome!

Thanks,

Ingo

------------>

Andrea Arcangeli (1):
numa, mm: Support NUMA hinting page faults from gup/gup_fast

Ingo Molnar (9):
mm: Optimize the TLB flush of sys_mprotect() and change_protection()
users
sched, mm, numa: Create generic NUMA fault infrastructure, with
architectures overrides
sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag
sched, numa, mm: Interleave shared tasks
sched: Implement NUMA scanning backoff
sched: Improve convergence
sched: Introduce staged average NUMA faults
sched: Track groups of shared tasks
sched: Use the best-buddy 'ideal cpu' in balancing decisions

Peter Zijlstra (11):
mm: Count the number of pages affected in change_protection()
sched, numa, mm: Add last_cpu to page flags
sched: Make find_busiest_queue() a method
sched, numa, mm: Describe the NUMA scheduling problem formally
mm/migrate: Introduce migrate_misplaced_page()
sched, numa, mm, arch: Add variable locality exception
sched, numa, mm: Add the scanning page fault machinery
sched: Add adaptive NUMA affinity support
sched: Implement constant, per task Working Set Sampling (WSS) rate
sched, numa, mm: Count WS scanning against present PTEs, not virtual
memory ranges
sched: Implement slow start for working set sampling

Rik van Riel (6):
mm/generic: Only flush the local TLB in ptep_set_access_flags()
x86/mm: Only do a local tlb flush in ptep_set_access_flags()
x86/mm: Introduce pte_accessible()
mm: Only flush the TLB when clearing an accessible pte
x86/mm: Completely drop the TLB flush from ptep_set_access_flags()
sched, numa, mm: Add credits for NUMA placement

CREDITS | 1 +
Documentation/scheduler/numa-problem.txt | 236 ++++++
arch/sh/mm/Kconfig | 1 +
arch/x86/Kconfig | 2 +
arch/x86/include/asm/pgtable.h | 6 +
arch/x86/mm/pgtable.c | 8 +-
include/asm-generic/pgtable.h | 59 ++
include/linux/huge_mm.h | 12 +
include/linux/hugetlb.h | 8 +-
include/linux/init_task.h | 8 +
include/linux/mempolicy.h | 8 +
include/linux/migrate.h | 7 +
include/linux/migrate_mode.h | 3 +
include/linux/mm.h | 99 ++-
include/linux/mm_types.h | 10 +
include/linux/mmzone.h | 14 +-
include/linux/page-flags-layout.h | 83 ++
include/linux/sched.h | 52 +-
init/Kconfig | 81 ++
kernel/bounds.c | 4 +
kernel/sched/core.c | 79 +-
kernel/sched/fair.c | 1227 +++++++++++++++++++++++++-----
kernel/sched/features.h | 10 +
kernel/sched/sched.h | 38 +-
kernel/sysctl.c | 45 +-
mm/Makefile | 1 +
mm/huge_memory.c | 163 ++++
mm/hugetlb.c | 10 +-
mm/internal.h | 5 +-
mm/memcontrol.c | 7 +-
mm/memory.c | 108 ++-
mm/mempolicy.c | 183 ++++-
mm/migrate.c | 81 +-
mm/mprotect.c | 69 +-
mm/numa.c | 73 ++
mm/pgtable-generic.c | 9 +-
36 files changed, 2492 insertions(+), 318 deletions(-)
create mode 100644 Documentation/scheduler/numa-problem.txt
create mode 100644 include/linux/page-flags-layout.h
create mode 100644 mm/numa.c

--
1.7.11.7


2012-11-19 02:15:29

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 01/27] mm/generic: Only flush the local TLB in ptep_set_access_flags()

From: Rik van Riel <[email protected]>

The function ptep_set_access_flags() is only ever used to upgrade
access permissions to a page - i.e. they make it less restrictive.

That means the only negative side effect of not flushing remote
TLBs in this function is that other CPUs may incur spurious page
faults, if they happen to access the same address, and still have
a PTE with the old permissions cached in their TLB caches.

Having another CPU maybe incur a spurious page fault is faster
than always incurring the cost of a remote TLB flush, so replace
the remote TLB flush with a purely local one.

This should be safe on every architecture that correctly
implements flush_tlb_fix_spurious_fault() to actually invalidate
the local TLB entry that caused a page fault, as well as on
architectures where the hardware invalidates TLB entries that
cause page faults.

In the unlikely event that you are hitting what appears to be
an infinite loop of page faults, and 'git bisect' took you to
this changeset, your architecture needs to implement
flush_tlb_fix_spurious_fault() to actually flush the TLB entry.

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
[ Changelog massage. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/pgtable-generic.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e642627..d8397da 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -12,8 +12,8 @@

#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
- * Only sets the access flags (dirty, accessed, and
- * writable). Furthermore, we know it always gets set to a "more
+ * Only sets the access flags (dirty, accessed), as well as write
+ * permission. Furthermore, we know it always gets set to a "more
* permissive" setting, which allows most architectures to optimize
* this. We return whether the PTE actually changed, which in turn
* instructs the caller to do things like update__mmu_cache. This
@@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
int changed = !pte_same(*ptep, entry);
if (changed) {
set_pte_at(vma->vm_mm, address, ptep, entry);
- flush_tlb_page(vma, address);
+ flush_tlb_fix_spurious_fault(vma, address);
}
return changed;
}
--
1.7.11.7

2012-11-19 02:15:33

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 03/27] x86/mm: Introduce pte_accessible()

From: Rik van Riel <[email protected]>

We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that
the pte is associated with a page.

However, for TLB flushing purposes, we would like to know whether the pte
points to an actually accessible page. This allows us to skip remote TLB
flushes for pages that are not actually accessible.

Fill in this method for x86 and provide a safe (but slower) method
on other architectures.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Fixed-by: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Added Linus's review fixes. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/pgtable.h | 6 ++++++
include/asm-generic/pgtable.h | 4 ++++
2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a1f780d..5fe03aa 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -407,6 +407,12 @@ static inline int pte_present(pte_t a)
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
}

+#define pte_accessible pte_accessible
+static inline int pte_accessible(pte_t a)
+{
+ return pte_flags(a) & _PAGE_PRESENT;
+}
+
static inline int pte_hidden(pte_t pte)
{
return pte_flags(pte) & _PAGE_HIDDEN;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b36ce40..48fc1dc 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
#define move_pte(pte, prot, old_addr, new_addr) (pte)
#endif

+#ifndef pte_accessible
+# define pte_accessible(pte) ((void)(pte),1)
+#endif
+
#ifndef flush_tlb_fix_spurious_fault
#define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
#endif
--
1.7.11.7

2012-11-19 02:15:39

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 06/27] mm: Count the number of pages affected in change_protection()

From: Peter Zijlstra <[email protected]>

This will be used for three kinds of purposes:

- to optimize mprotect()

- to speed up working set scanning for working set areas that
have not been touched

- to more accurately scan per real working set

No change in functionality from this patch.

Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/hugetlb.h | 8 +++++--
include/linux/mm.h | 3 +++
mm/hugetlb.c | 10 +++++++--
mm/mprotect.c | 58 +++++++++++++++++++++++++++++++++++++------------
4 files changed, 61 insertions(+), 18 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2251648..06e691b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -87,7 +87,7 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int write);
int pmd_huge(pmd_t pmd);
int pud_huge(pud_t pmd);
-void hugetlb_change_protection(struct vm_area_struct *vma,
+unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);

#else /* !CONFIG_HUGETLB_PAGE */
@@ -132,7 +132,11 @@ static inline void copy_huge_page(struct page *dst, struct page *src)
{
}

-#define hugetlb_change_protection(vma, address, end, newprot)
+static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
+ unsigned long address, unsigned long end, pgprot_t newprot)
+{
+ return 0;
+}

static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..8d86d5a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1078,6 +1078,9 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
extern unsigned long do_mremap(unsigned long addr,
unsigned long old_len, unsigned long new_len,
unsigned long flags, unsigned long new_addr);
+extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, pgprot_t newprot,
+ int dirty_accountable);
extern int mprotect_fixup(struct vm_area_struct *vma,
struct vm_area_struct **pprev, unsigned long start,
unsigned long end, unsigned long newflags);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 59a0059..712895e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3014,7 +3014,7 @@ same_page:
return i ? i : -EFAULT;
}

-void hugetlb_change_protection(struct vm_area_struct *vma,
+unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot)
{
struct mm_struct *mm = vma->vm_mm;
@@ -3022,6 +3022,7 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
pte_t *ptep;
pte_t pte;
struct hstate *h = hstate_vma(vma);
+ unsigned long pages = 0;

BUG_ON(address >= end);
flush_cache_range(vma, address, end);
@@ -3032,12 +3033,15 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
ptep = huge_pte_offset(mm, address);
if (!ptep)
continue;
- if (huge_pmd_unshare(mm, &address, ptep))
+ if (huge_pmd_unshare(mm, &address, ptep)) {
+ pages++;
continue;
+ }
if (!huge_pte_none(huge_ptep_get(ptep))) {
pte = huge_ptep_get_and_clear(mm, address, ptep);
pte = pte_mkhuge(pte_modify(pte, newprot));
set_huge_pte_at(mm, address, ptep, pte);
+ pages++;
}
}
spin_unlock(&mm->page_table_lock);
@@ -3049,6 +3053,8 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
*/
flush_tlb_range(vma, start, end);
mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+
+ return pages << h->order;
}

int hugetlb_reserve_pages(struct inode *inode,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..1e265be 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -35,12 +35,13 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
}
#endif

-static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
+static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
pte_t *pte, oldpte;
spinlock_t *ptl;
+ unsigned long pages = 0;

pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -60,6 +61,7 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
ptent = pte_mkwrite(ptent);

ptep_modify_prot_commit(mm, addr, pte, ptent);
+ pages++;
} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);

@@ -72,18 +74,22 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
set_pte_at(mm, addr, pte,
swp_entry_to_pte(entry));
}
+ pages++;
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
+
+ return pages;
}

-static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
+static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
pmd_t *pmd;
unsigned long next;
+ unsigned long pages = 0;

pmd = pmd_offset(pud, addr);
do {
@@ -91,35 +97,42 @@ static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
split_huge_page_pmd(vma->vm_mm, pmd);
- else if (change_huge_pmd(vma, pmd, addr, newprot))
+ else if (change_huge_pmd(vma, pmd, addr, newprot)) {
+ pages += HPAGE_PMD_NR;
continue;
+ }
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))
continue;
- change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
+ pages += change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
dirty_accountable);
} while (pmd++, addr = next, addr != end);
+
+ return pages;
}

-static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+static inline unsigned long change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
pud_t *pud;
unsigned long next;
+ unsigned long pages = 0;

pud = pud_offset(pgd, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
continue;
- change_pmd_range(vma, pud, addr, next, newprot,
+ pages += change_pmd_range(vma, pud, addr, next, newprot,
dirty_accountable);
} while (pud++, addr = next, addr != end);
+
+ return pages;
}

-static void change_protection(struct vm_area_struct *vma,
+static unsigned long change_protection_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
@@ -127,6 +140,7 @@ static void change_protection(struct vm_area_struct *vma,
pgd_t *pgd;
unsigned long next;
unsigned long start = addr;
+ unsigned long pages = 0;

BUG_ON(addr >= end);
pgd = pgd_offset(mm, addr);
@@ -135,10 +149,30 @@ static void change_protection(struct vm_area_struct *vma,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- change_pud_range(vma, pgd, addr, next, newprot,
+ pages += change_pud_range(vma, pgd, addr, next, newprot,
dirty_accountable);
} while (pgd++, addr = next, addr != end);
+
flush_tlb_range(vma, start, end);
+
+ return pages;
+}
+
+unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, pgprot_t newprot,
+ int dirty_accountable)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long pages;
+
+ mmu_notifier_invalidate_range_start(mm, start, end);
+ if (is_vm_hugetlb_page(vma))
+ pages = hugetlb_change_protection(vma, start, end, newprot);
+ else
+ pages = change_protection_range(vma, start, end, newprot, dirty_accountable);
+ mmu_notifier_invalidate_range_end(mm, start, end);
+
+ return pages;
}

int
@@ -213,12 +247,8 @@ success:
dirty_accountable = 1;
}

- mmu_notifier_invalidate_range_start(mm, start, end);
- if (is_vm_hugetlb_page(vma))
- hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
- else
- change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+
vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
vm_stat_account(mm, newflags, vma->vm_file, nrpages);
perf_event_mmap(vma);
--
1.7.11.7

2012-11-19 02:15:48

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 10/27] sched: Make find_busiest_queue() a method

From: Peter Zijlstra <[email protected]>

Its a bit awkward but it was the least painful means of modifying the
queue selection. Used in a later patch to conditionally use a random
queue.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59e072b..511fbb8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3600,6 +3600,9 @@ struct lb_env {
unsigned int loop;
unsigned int loop_break;
unsigned int loop_max;
+
+ struct rq * (*find_busiest_queue)(struct lb_env *,
+ struct sched_group *);
};

/*
@@ -4779,13 +4782,14 @@ static int load_balance(int this_cpu, struct rq *this_rq,
struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);

struct lb_env env = {
- .sd = sd,
- .dst_cpu = this_cpu,
- .dst_rq = this_rq,
- .dst_grpmask = sched_group_cpus(sd->groups),
- .idle = idle,
- .loop_break = sched_nr_migrate_break,
- .cpus = cpus,
+ .sd = sd,
+ .dst_cpu = this_cpu,
+ .dst_rq = this_rq,
+ .dst_grpmask = sched_group_cpus(sd->groups),
+ .idle = idle,
+ .loop_break = sched_nr_migrate_break,
+ .cpus = cpus,
+ .find_busiest_queue = find_busiest_queue,
};

cpumask_copy(cpus, cpu_active_mask);
@@ -4804,7 +4808,7 @@ redo:
goto out_balanced;
}

- busiest = find_busiest_queue(&env, group);
+ busiest = env.find_busiest_queue(&env, group);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;
--
1.7.11.7

2012-11-19 02:15:56

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 14/27] sched, numa, mm, arch: Add variable locality exception

From: Peter Zijlstra <[email protected]>

Some architectures (ab)use NUMA to represent different memory
regions all cpu-local but of different latencies, such as SuperH.

The naming comes from Mel Gorman.

Named-by: Mel Gorman <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/sh/mm/Kconfig | 1 +
init/Kconfig | 7 +++++++
2 files changed, 8 insertions(+)

diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig
index cb8f992..5d2a4df 100644
--- a/arch/sh/mm/Kconfig
+++ b/arch/sh/mm/Kconfig
@@ -111,6 +111,7 @@ config VSYSCALL
config NUMA
bool "Non Uniform Memory Access (NUMA) Support"
depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
+ select ARCH_WANTS_NUMA_VARIABLE_LOCALITY
default n
help
Some SH systems have many various memories scattered around
diff --git a/init/Kconfig b/init/Kconfig
index f36c83d..b8a4a58 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -718,6 +718,13 @@ config ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
depends on ARCH_USES_NUMA_GENERIC_PGPROT
depends on TRANSPARENT_HUGEPAGE

+#
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config ARCH_WANTS_NUMA_VARIABLE_LOCALITY
+ bool
+
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
--
1.7.11.7

2012-11-19 02:15:44

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 08/27] sched, numa, mm: Add last_cpu to page flags

From: Peter Zijlstra <[email protected]>

Introduce a per-page last_cpu field, fold this into the struct
page::flags field whenever possible.

The unlikely/rare 32bit NUMA configs will likely grow the page-frame.

[ Completely dropping 32bit support for CONFIG_NUMA_BALANCING would simplify
things, but it would also remove the warning if we grow enough 64bit
only page-flags to push the last-cpu out. ]

Suggested-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mm.h | 90 +++++++++++++++++++++------------------
include/linux/mm_types.h | 5 +++
include/linux/mmzone.h | 14 +-----
include/linux/page-flags-layout.h | 83 ++++++++++++++++++++++++++++++++++++
kernel/bounds.c | 4 ++
mm/memory.c | 4 ++
6 files changed, 146 insertions(+), 54 deletions(-)
create mode 100644 include/linux/page-flags-layout.h

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8d86d5a..5fc1d46 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -581,50 +581,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
* sets it, so none of the operations on it need to be atomic.
*/

-
-/*
- * page->flags layout:
- *
- * There are three possibilities for how page->flags get
- * laid out. The first is for the normal case, without
- * sparsemem. The second is for sparsemem when there is
- * plenty of space for node and section. The last is when
- * we have run out of space and have to fall back to an
- * alternate (slower) way of determining the node.
- *
- * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
- * classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
- */
-#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
-#define SECTIONS_WIDTH SECTIONS_SHIFT
-#else
-#define SECTIONS_WIDTH 0
-#endif
-
-#define ZONES_WIDTH ZONES_SHIFT
-
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define NODES_WIDTH NODES_SHIFT
-#else
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-#error "Vmemmap: No space for nodes field in page flags"
-#endif
-#define NODES_WIDTH 0
-#endif
-
-/* Page flags: | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPU] | ... | FLAGS | */
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
-
-/*
- * We are going to use the flags for the page to node mapping if its in
- * there. This includes the case where there is no node, so it is implicit.
- */
-#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
-#define NODE_NOT_IN_PAGE_FLAGS
-#endif
+#define LAST_CPU_PGOFF (ZONES_PGOFF - LAST_CPU_WIDTH)

/*
* Define the bit shifts to access each section. For non-existent
@@ -634,6 +595,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
+#define LAST_CPU_PGSHIFT (LAST_CPU_PGOFF * (LAST_CPU_WIDTH != 0))

/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
#ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -655,6 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK ((1UL << NODES_WIDTH) - 1)
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
+#define LAST_CPU_MASK ((1UL << LAST_CPU_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)

static inline enum zone_type page_zonenum(const struct page *page)
@@ -693,6 +656,51 @@ static inline int page_to_nid(const struct page *page)
}
#endif

+#ifdef CONFIG_NUMA_BALANCING
+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ return xchg(&page->_last_cpu, cpu);
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+ return page->_last_cpu;
+}
+#else
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ unsigned long old_flags, flags;
+ int last_cpu;
+
+ do {
+ old_flags = flags = page->flags;
+ last_cpu = (flags >> LAST_CPU_PGSHIFT) & LAST_CPU_MASK;
+
+ flags &= ~(LAST_CPU_MASK << LAST_CPU_PGSHIFT);
+ flags |= (cpu & LAST_CPU_MASK) << LAST_CPU_PGSHIFT;
+ } while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
+
+ return last_cpu;
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+ return (page->flags >> LAST_CPU_PGSHIFT) & LAST_CPU_MASK;
+}
+#endif /* LAST_CPU_NOT_IN_PAGE_FLAGS */
+#else /* CONFIG_NUMA_BALANCING */
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ return page_to_nid(page);
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+ return page_to_nid(page);
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
static inline struct zone *page_zone(const struct page *page)
{
return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 31f8a3a..7e9f758 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
#include <linux/cpumask.h>
#include <linux/page-debug-flags.h>
#include <linux/uprobes.h>
+#include <linux/page-flags-layout.h>
#include <asm/page.h>
#include <asm/mmu.h>

@@ -175,6 +176,10 @@ struct page {
*/
void *shadow;
#endif
+
+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+ int _last_cpu;
+#endif
}
/*
* The struct page can be forced to be double word aligned so that atomic ops
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 50aaca8..7e116ed 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -15,7 +15,7 @@
#include <linux/seqlock.h>
#include <linux/nodemask.h>
#include <linux/pageblock-flags.h>
-#include <generated/bounds.h>
+#include <linux/page-flags-layout.h>
#include <linux/atomic.h>
#include <asm/page.h>

@@ -318,16 +318,6 @@ enum zone_type {
* match the requested limits. See gfp_zone() in include/linux/gfp.h
*/

-#if MAX_NR_ZONES < 2
-#define ZONES_SHIFT 0
-#elif MAX_NR_ZONES <= 2
-#define ZONES_SHIFT 1
-#elif MAX_NR_ZONES <= 4
-#define ZONES_SHIFT 2
-#else
-#error ZONES_SHIFT -- too many zones configured adjust calculation
-#endif
-
struct zone {
/* Fields commonly accessed by the page allocator */

@@ -1030,8 +1020,6 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
* PA_SECTION_SHIFT physical address to/from section number
* PFN_SECTION_SHIFT pfn to/from section number
*/
-#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
-
#define PA_SECTION_SHIFT (SECTION_SIZE_BITS)
#define PFN_SECTION_SHIFT (SECTION_SIZE_BITS - PAGE_SHIFT)

diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
new file mode 100644
index 0000000..b258132
--- /dev/null
+++ b/include/linux/page-flags-layout.h
@@ -0,0 +1,83 @@
+#ifndef _LINUX_PAGE_FLAGS_LAYOUT
+#define _LINUX_PAGE_FLAGS_LAYOUT
+
+#include <linux/numa.h>
+#include <generated/bounds.h>
+
+#if MAX_NR_ZONES < 2
+#define ZONES_SHIFT 0
+#elif MAX_NR_ZONES <= 2
+#define ZONES_SHIFT 1
+#elif MAX_NR_ZONES <= 4
+#define ZONES_SHIFT 2
+#else
+#error ZONES_SHIFT -- too many zones configured adjust calculation
+#endif
+
+#ifdef CONFIG_SPARSEMEM
+#include <asm/sparsemem.h>
+
+/*
+ * SECTION_SHIFT #bits space required to store a section #
+ */
+#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
+#endif
+
+/*
+ * page->flags layout:
+ *
+ * There are five possibilities for how page->flags get laid out. The first
+ * (and second) is for the normal case, without sparsemem. The third is for
+ * sparsemem when there is plenty of space for node and section. The last is
+ * when we have run out of space and have to fall back to an alternate (slower)
+ * way of determining the node.
+ *
+ * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_cpu:| NODE | ZONE | LAST_CPU | ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_cpu:| SECTION | NODE | ZONE | LAST_CPU | ... | FLAGS |
+ * classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
+ */
+#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+
+#define SECTIONS_WIDTH SECTIONS_SHIFT
+#else
+#define SECTIONS_WIDTH 0
+#endif
+
+#define ZONES_WIDTH ZONES_SHIFT
+
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define NODES_WIDTH NODES_SHIFT
+#else
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+#error "Vmemmap: No space for nodes field in page flags"
+#endif
+#define NODES_WIDTH 0
+#endif
+
+#ifdef CONFIG_NUMA_BALANCING
+#define LAST_CPU_SHIFT NR_CPUS_BITS
+#else
+#define LAST_CPU_SHIFT 0
+#endif
+
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPU_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_CPU_WIDTH LAST_CPU_SHIFT
+#else
+#define LAST_CPU_WIDTH 0
+#endif
+
+/*
+ * We are going to use the flags for the page to node mapping if its in
+ * there. This includes the case where there is no node, so it is implicit.
+ */
+#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
+#define NODE_NOT_IN_PAGE_FLAGS
+#endif
+
+#if defined(CONFIG_NUMA_BALANCING) && LAST_CPU_WIDTH == 0
+#define LAST_CPU_NOT_IN_PAGE_FLAGS
+#endif
+
+#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 0c9b862..e8ca97b 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -10,6 +10,7 @@
#include <linux/mmzone.h>
#include <linux/kbuild.h>
#include <linux/page_cgroup.h>
+#include <linux/log2.h>

void foo(void)
{
@@ -17,5 +18,8 @@ void foo(void)
DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
DEFINE(MAX_NR_ZONES, __MAX_NR_ZONES);
DEFINE(NR_PCG_FLAGS, __NR_PCG_FLAGS);
+#ifdef CONFIG_SMP
+ DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
+#endif
/* End of constants */
}
diff --git a/mm/memory.c b/mm/memory.c
index fb135ba..24d3a4a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -67,6 +67,10 @@

#include "internal.h"

+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA config, growing page-frame for last_cpu.
+#endif
+
#ifndef CONFIG_NEED_MULTIPLE_NODES
/* use the per-pgdat data instead for discontigmem - mbligh */
unsigned long max_mapnr;
--
1.7.11.7

2012-11-19 02:15:51

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 12/27] numa, mm: Support NUMA hinting page faults from gup/gup_fast

From: Andrea Arcangeli <[email protected]>

Introduce FOLL_NUMA to tell follow_page to check
pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
so because it always invokes handle_mm_fault and retries the
follow_page later.

KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault.

Other follow_page callers like KSM should not use FOLL_NUMA, or they
would fail to get the pages if they use follow_page instead of
get_user_pages.

[ This patch was picked up from the AutoNUMA tree. ]

Originally-by: Andrea Arcangeli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Thomas Gleixner <[email protected]>
[ ported to this tree. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mm.h | 1 +
mm/memory.c | 17 +++++++++++++++++
2 files changed, 18 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 246375c..f39a628 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1585,6 +1585,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
#define FOLL_MLOCK 0x40 /* mark page as mlocked */
#define FOLL_SPLIT 0x80 /* don't return transhuge pages, split them */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
+#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */

typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/memory.c b/mm/memory.c
index b9bb15c..23ad2eb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1522,6 +1522,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
goto out;
}
+ if ((flags & FOLL_NUMA) && pmd_numa(vma, *pmd))
+ goto no_page_table;
if (pmd_trans_huge(*pmd)) {
if (flags & FOLL_SPLIT) {
split_huge_page_pmd(mm, pmd);
@@ -1551,6 +1553,8 @@ split_fallthrough:
pte = *ptep;
if (!pte_present(pte))
goto no_page;
+ if ((flags & FOLL_NUMA) && pte_numa(vma, pte))
+ goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;

@@ -1702,6 +1706,19 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
(VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
vm_flags &= (gup_flags & FOLL_FORCE) ?
(VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
+
+ /*
+ * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
+ * would be called on PROT_NONE ranges. We must never invoke
+ * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
+ * page faults would unprotect the PROT_NONE ranges if
+ * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
+ * bitflag. So to avoid that, don't set FOLL_NUMA if
+ * FOLL_FORCE is set.
+ */
+ if (!(gup_flags & FOLL_FORCE))
+ gup_flags |= FOLL_NUMA;
+
i = 0;

do {
--
1.7.11.7

2012-11-19 02:16:00

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 16/27] sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag

Allow architectures to opt-in to the adaptive affinity NUMA balancing code.

Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
init/Kconfig | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/init/Kconfig b/init/Kconfig
index b8a4a58..cf3e79c 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -725,6 +725,13 @@ config ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
config ARCH_WANTS_NUMA_VARIABLE_LOCALITY
bool

+#
+# For architectures that want to enable the PROT_NONE driven,
+# NUMA-affine scheduler balancing logic:
+#
+config ARCH_SUPPORTS_NUMA_BALANCING
+ bool
+
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
--
1.7.11.7

2012-11-19 02:16:07

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 18/27] sched: Add adaptive NUMA affinity support

From: Peter Zijlstra <[email protected]>

The principal ideas behind this patch are the fundamental
difference between shared and privately used memory and the very
strong desire to only rely on per-task behavioral state for
scheduling decisions.

We define 'shared memory' as all user memory that is frequently
accessed by multiple tasks and conversely 'private memory' is
the user memory used predominantly by a single task.

To approximate the above strict definition we recognise that
task placement is dominantly per cpu and thus using cpu granular
page access state is a natural fit. Thus we introduce
page::last_cpu as the cpu that last accessed a page.

Using this, we can construct two per-task node-vectors, 'S_i'
and 'P_i' reflecting the amount of shared and privately used
pages of this task respectively. Pages for which two consecutive
'hits' are of the same cpu are assumed private and the others
are shared.

[ This means that we will start evaluating this state when the
task has not migrated for at least 2 scans, see NUMA_SETTLE ]

Using these vectors we can compute the total number of
shared/private pages of this task and determine which dominates.

[ Note that for shared tasks we only see '1/n' the total number
of shared pages for the other tasks will take the other
faults; where 'n' is the number of tasks sharing the memory.
So for an equal comparison we should divide total private by
'n' as well, but we don't have 'n' so we pick 2. ]

We can also compute which node holds most of our memory, running
on this node will be called 'ideal placement' (As per previous
patches we will prefer to pull memory towards wherever we run.)

We change the load-balancer to prefer moving tasks in order of:

1) !numa tasks and numa tasks in the direction of more faults
2) allow !ideal tasks getting worse in the direction of faults
3) allow private tasks to get worse
4) allow shared tasks to get worse

This order ensures we prefer increasing memory locality but when
we do have to make hard decisions we prefer spreading private
over shared, because spreading shared tasks significantly
increases the interconnect bandwidth since not all memory can
follow.

We also add an extra 'lateral' force to the load balancer that
perturbs the state when otherwise 'fairly' balanced. This
ensures we don't get 'stuck' in a state which is fair but
undesired from a memory location POV (see can_do_numa_run()).

Lastly, we allow shared tasks to defeat the default spreading of
tasks such that, when possible, they can aggregate on a single
node.

Shared tasks aggregate for the very simple reason that there has
to be a single node that holds most of their memory and a second
most, etc.. and tasks want to move up the faults ladder.

Enable it on x86. A number of other architectures are
most likely fine too - but they should enable and test this
feature explicitly.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/scheduler/numa-problem.txt | 20 +-
arch/x86/Kconfig | 2 +
include/linux/sched.h | 1 +
kernel/sched/core.c | 53 +-
kernel/sched/fair.c | 975 +++++++++++++++++++++++++------
kernel/sched/features.h | 8 +
kernel/sched/sched.h | 38 +-
7 files changed, 900 insertions(+), 197 deletions(-)

diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt
index a5d2fee..7f133e3 100644
--- a/Documentation/scheduler/numa-problem.txt
+++ b/Documentation/scheduler/numa-problem.txt
@@ -133,6 +133,8 @@ XXX properties of this M vs a potential optimal

2b) migrate memory towards 'n_i' using 2 samples.

+XXX include the statistical babble on double sampling somewhere near
+
This separates pages into those that will migrate and those that will not due
to the two samples not matching. We could consider the first to be of 'p_i'
(private) and the second to be of 's_i' (shared).
@@ -142,7 +144,17 @@ This interpretation can be motivated by the previously observed property that
's_i' (shared). (here we loose the need for memory limits again, since it
becomes indistinguishable from shared).

-XXX include the statistical babble on double sampling somewhere near
+ 2c) use cpu samples instead of node samples
+
+The problem with sampling on node granularity is that one looses 's_i' for
+the local node, since one cannot distinguish between two accesses from the
+same node.
+
+By increasing the granularity to per-cpu we gain the ability to have both an
+'s_i' and 'p_i' per node. Since we do all task placement per-cpu as well this
+seems like a natural match. The line where we overcommit cpus is where we loose
+granularity again, but when we loose overcommit we naturally spread tasks.
+Therefore it should work out nicely.

This reduces the problem further; we loose 'M' as per 2a, it further reduces
the 'T_k,l' (interconnect traffic) term to only include shared (since per the
@@ -150,12 +162,6 @@ above all private will be local):

T_k,l = \Sum_i bs_i,l for every n_i = k, l != k

-[ more or less matches the state of sched/numa and describes its remaining
- problems and assumptions. It should work well for tasks without significant
- shared memory usage between tasks. ]
-
-Possible future directions:
-
Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
can evaluate it;

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 46c3bff..95646fe 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -22,6 +22,8 @@ config X86
def_bool y
select HAVE_AOUT if X86_32
select HAVE_UNSTABLE_SCHED_CLOCK
+ select ARCH_SUPPORTS_NUMA_BALANCING
+ select ARCH_WANTS_NUMA_GENERIC_PGPROT
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_PCSPKR_PLATFORM
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 418d405..bb12cc3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -823,6 +823,7 @@ enum cpu_idle_type {
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
+#define SD_NUMA 0x4000 /* cross-node balancing */

extern int __weak arch_sd_sibiling_asym_packing(void);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3611f5f..7b58366 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1800,6 +1800,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
+ task_numa_free(prev);
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
@@ -5510,7 +5511,9 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu)
DEFINE_PER_CPU(struct sched_domain *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_id);

-static void update_top_cache_domain(int cpu)
+DEFINE_PER_CPU(struct sched_domain *, sd_node);
+
+static void update_domain_cache(int cpu)
{
struct sched_domain *sd;
int id = cpu;
@@ -5521,6 +5524,15 @@ static void update_top_cache_domain(int cpu)

rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
per_cpu(sd_llc_id, cpu) = id;
+
+ for_each_domain(cpu, sd) {
+ if (cpumask_equal(sched_domain_span(sd),
+ cpumask_of_node(cpu_to_node(cpu))))
+ goto got_node;
+ }
+ sd = NULL;
+got_node:
+ rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
}

/*
@@ -5563,7 +5575,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
rcu_assign_pointer(rq->sd, sd);
destroy_sched_domains(tmp, cpu);

- update_top_cache_domain(cpu);
+ update_domain_cache(cpu);
}

/* cpus with isolated domains */
@@ -5985,6 +5997,37 @@ static struct sched_domain_topology_level default_topology[] = {

static struct sched_domain_topology_level *sched_domain_topology = default_topology;

+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * Change a task's NUMA state - called from the placement tick.
+ */
+void sched_setnuma(struct task_struct *p, int node, int shared)
+{
+ unsigned long flags;
+ int on_rq, running;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &flags);
+ on_rq = p->on_rq;
+ running = task_current(rq, p);
+
+ if (on_rq)
+ dequeue_task(rq, p, 0);
+ if (running)
+ p->sched_class->put_prev_task(rq, p);
+
+ p->numa_shared = shared;
+ p->numa_max_node = node;
+
+ if (running)
+ p->sched_class->set_curr_task(rq);
+ if (on_rq)
+ enqueue_task(rq, p, 0);
+ task_rq_unlock(rq, p, &flags);
+}
+
+#endif /* CONFIG_NUMA_BALANCING */
+
#ifdef CONFIG_NUMA

static int sched_domains_numa_levels;
@@ -6030,6 +6073,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
| 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
+ | 1*SD_NUMA
| sd_local_flags(level)
,
.last_balance = jiffies,
@@ -6884,7 +6928,6 @@ void __init sched_init(void)
rq->post_schedule = 0;
rq->active_balance = 0;
rq->next_balance = jiffies;
- rq->push_cpu = 0;
rq->cpu = i;
rq->online = 0;
rq->idle_stamp = 0;
@@ -6892,6 +6935,10 @@ void __init sched_init(void)

INIT_LIST_HEAD(&rq->cfs_tasks);

+#ifdef CONFIG_NUMA_BALANCING
+ rq->nr_shared_running = 0;
+#endif
+
rq_attach_root(rq, &def_root_domain);
#ifdef CONFIG_NO_HZ
rq->nohz_flags = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8af0208..f3aeaac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -29,6 +29,9 @@
#include <linux/slab.h>
#include <linux/profile.h>
#include <linux/interrupt.h>
+#include <linux/random.h>
+#include <linux/mempolicy.h>
+#include <linux/task_work.h>

#include <trace/events/sched.h>

@@ -774,6 +777,235 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
}

/**************************************************
+ * Scheduling class numa methods.
+ *
+ * The purpose of the NUMA bits are to maintain compute (task) and data
+ * (memory) locality.
+ *
+ * We keep a faults vector per task and use periodic fault scans to try and
+ * estalish a task<->page relation. This assumes the task<->page relation is a
+ * compute<->data relation, this is false for things like virt. and n:m
+ * threading solutions but its the best we can do given the information we
+ * have.
+ *
+ * We try and migrate such that we increase along the order provided by this
+ * vector while maintaining fairness.
+ *
+ * Tasks start out with their numa status unset (-1) this effectively means
+ * they act !NUMA until we've established the task is busy enough to bother
+ * with placement.
+ */
+
+#ifdef CONFIG_SMP
+static unsigned long task_h_load(struct task_struct *p);
+#endif
+
+#ifdef CONFIG_NUMA_BALANCING
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+ if (task_numa_shared(p) != -1) {
+ p->numa_weight = task_h_load(p);
+ rq->nr_numa_running++;
+ rq->nr_shared_running += task_numa_shared(p);
+ rq->nr_ideal_running += (cpu_to_node(task_cpu(p)) == p->numa_max_node);
+ rq->numa_weight += p->numa_weight;
+ }
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+ if (task_numa_shared(p) != -1) {
+ rq->nr_numa_running--;
+ rq->nr_shared_running -= task_numa_shared(p);
+ rq->nr_ideal_running -= (cpu_to_node(task_cpu(p)) == p->numa_max_node);
+ rq->numa_weight -= p->numa_weight;
+ }
+}
+
+/*
+ * numa task sample period in ms: 5s
+ */
+unsigned int sysctl_sched_numa_scan_period_min = 5000;
+unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
+
+/*
+ * Wait for the 2-sample stuff to settle before migrating again
+ */
+unsigned int sysctl_sched_numa_settle_count = 2;
+
+static void task_numa_migrate(struct task_struct *p, int next_cpu)
+{
+ p->numa_migrate_seq = 0;
+}
+
+static void task_numa_placement(struct task_struct *p)
+{
+ int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
+ unsigned long total[2] = { 0, 0 };
+ unsigned long faults, max_faults = 0;
+ int node, priv, shared, max_node = -1;
+
+ if (p->numa_scan_seq == seq)
+ return;
+
+ p->numa_scan_seq = seq;
+
+ for (node = 0; node < nr_node_ids; node++) {
+ faults = 0;
+ for (priv = 0; priv < 2; priv++) {
+ faults += p->numa_faults[2*node + priv];
+ total[priv] += p->numa_faults[2*node + priv];
+ p->numa_faults[2*node + priv] /= 2;
+ }
+ if (faults > max_faults) {
+ max_faults = faults;
+ max_node = node;
+ }
+ }
+
+ if (max_node != p->numa_max_node)
+ sched_setnuma(p, max_node, task_numa_shared(p));
+
+ p->numa_migrate_seq++;
+ if (sched_feat(NUMA_SETTLE) &&
+ p->numa_migrate_seq < sysctl_sched_numa_settle_count)
+ return;
+
+ /*
+ * Note: shared is spread across multiple tasks and in the future
+ * we might want to consider a different equation below to reduce
+ * the impact of a little private memory accesses.
+ */
+ shared = (total[0] >= total[1] / 2);
+ if (shared != task_numa_shared(p)) {
+ sched_setnuma(p, p->numa_max_node, shared);
+ p->numa_migrate_seq = 0;
+ }
+}
+
+/*
+ * Got a PROT_NONE fault for a page on @node.
+ */
+void task_numa_fault(int node, int last_cpu, int pages)
+{
+ struct task_struct *p = current;
+ int priv = (task_cpu(p) == last_cpu);
+
+ if (unlikely(!p->numa_faults)) {
+ int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+
+ p->numa_faults = kzalloc(size, GFP_KERNEL);
+ if (!p->numa_faults)
+ return;
+ }
+
+ task_numa_placement(p);
+ p->numa_faults[2*node + priv] += pages;
+}
+
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+void task_numa_work(struct callback_head *work)
+{
+ unsigned long migrate, next_scan, now = jiffies;
+ struct task_struct *p = current;
+ struct mm_struct *mm = p->mm;
+
+ WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
+
+ work->next = work; /* protect against double add */
+ /*
+ * Who cares about NUMA placement when they're dying.
+ *
+ * NOTE: make sure not to dereference p->mm before this check,
+ * exit_task_work() happens _after_ exit_mm() so we could be called
+ * without p->mm even though we still had it when we enqueued this
+ * work.
+ */
+ if (p->flags & PF_EXITING)
+ return;
+
+ /*
+ * Enforce maximal scan/migration frequency..
+ */
+ migrate = mm->numa_next_scan;
+ if (time_before(now, migrate))
+ return;
+
+ next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
+ if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
+ return;
+
+ ACCESS_ONCE(mm->numa_scan_seq)++;
+ {
+ struct vm_area_struct *vma;
+
+ down_write(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (!vma_migratable(vma))
+ continue;
+ change_prot_numa(vma, vma->vm_start, vma->vm_end);
+ }
+ up_write(&mm->mmap_sem);
+ }
+}
+
+/*
+ * Drive the periodic memory faults..
+ */
+void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+ struct callback_head *work = &curr->numa_work;
+ u64 period, now;
+
+ /*
+ * We don't care about NUMA placement if we don't have memory.
+ */
+ if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+ return;
+
+ /*
+ * Using runtime rather than walltime has the dual advantage that
+ * we (mostly) drive the selection from busy threads and that the
+ * task needs to have done some actual work before we bother with
+ * NUMA placement.
+ */
+ now = curr->se.sum_exec_runtime;
+ period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
+
+ if (now - curr->node_stamp > period) {
+ curr->node_stamp = now;
+
+ /*
+ * We are comparing runtime to wall clock time here, which
+ * puts a maximum scan frequency limit on the task work.
+ *
+ * This, together with the limits in task_numa_work() filters
+ * us from over-sampling if there are many threads: if all
+ * threads happen to come in at the same time we don't create a
+ * spike in overhead.
+ *
+ * We also avoid multiple threads scanning at once in parallel to
+ * each other.
+ */
+ if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+ init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
+ task_work_add(curr, work, true);
+ }
+ }
+}
+#else /* !CONFIG_NUMA_BALANCING: */
+#ifdef CONFIG_SMP
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p) { }
+#endif
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p) { }
+static inline void task_tick_numa(struct rq *rq, struct task_struct *curr) { }
+static inline void task_numa_migrate(struct task_struct *p, int next_cpu) { }
+#endif /* CONFIG_NUMA_BALANCING */
+
+/**************************************************
* Scheduling class queueing methods:
*/

@@ -784,9 +1016,13 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (!parent_entity(se))
update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
#ifdef CONFIG_SMP
- if (entity_is_task(se))
- list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
-#endif
+ if (entity_is_task(se)) {
+ struct rq *rq = rq_of(cfs_rq);
+
+ account_numa_enqueue(rq, task_of(se));
+ list_add(&se->group_node, &rq->cfs_tasks);
+ }
+#endif /* CONFIG_SMP */
cfs_rq->nr_running++;
}

@@ -796,8 +1032,10 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
update_load_sub(&cfs_rq->load, se->load.weight);
if (!parent_entity(se))
update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
- if (entity_is_task(se))
+ if (entity_is_task(se)) {
list_del_init(&se->group_node);
+ account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+ }
cfs_rq->nr_running--;
}

@@ -3177,20 +3415,8 @@ unlock:
return new_cpu;
}

-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
- * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
- * cfs_rq_of(p) references at time of call are still valid and identify the
- * previous cpu. However, the caller only guarantees p->pi_lock is held; no
- * other assumptions, including the state of rq->lock, should be made.
- */
-static void
-migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+static void migrate_task_rq_entity(struct task_struct *p, int next_cpu)
{
struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -3206,7 +3432,27 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
+#else
+static void migrate_task_rq_entity(struct task_struct *p, int next_cpu) { }
#endif
+
+/*
+ * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
+ * removed when useful for applications beyond shares distribution (e.g.
+ * load-balance).
+ */
+/*
+ * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
+ * cfs_rq_of(p) references at time of call are still valid and identify the
+ * previous cpu. However, the caller only guarantees p->pi_lock is held; no
+ * other assumptions, including the state of rq->lock, should be made.
+ */
+static void
+migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+{
+ migrate_task_rq_entity(p, next_cpu);
+ task_numa_migrate(p, next_cpu);
+}
#endif /* CONFIG_SMP */

static unsigned long
@@ -3580,7 +3826,10 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;

#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
-#define LBF_SOME_PINNED 0x04
+#define LBF_SOME_PINNED 0x04
+#define LBF_NUMA_RUN 0x08
+#define LBF_NUMA_SHARED 0x10
+#define LBF_KEEP_SHARED 0x20

struct lb_env {
struct sched_domain *sd;
@@ -3599,6 +3848,8 @@ struct lb_env {
struct cpumask *cpus;

unsigned int flags;
+ unsigned int failed;
+ unsigned int iteration;

unsigned int loop;
unsigned int loop_break;
@@ -3620,11 +3871,87 @@ static void move_task(struct task_struct *p, struct lb_env *env)
check_preempt_curr(env->dst_rq, p, 0);
}

+#ifdef CONFIG_NUMA_BALANCING
+
+static inline unsigned long task_node_faults(struct task_struct *p, int node)
+{
+ return p->numa_faults[2*node] + p->numa_faults[2*node + 1];
+}
+
+static int task_faults_down(struct task_struct *p, struct lb_env *env)
+{
+ int src_node, dst_node, node, down_node = -1;
+ unsigned long faults, src_faults, max_faults = 0;
+
+ if (!sched_feat_numa(NUMA_FAULTS_DOWN) || !p->numa_faults)
+ return 1;
+
+ src_node = cpu_to_node(env->src_cpu);
+ dst_node = cpu_to_node(env->dst_cpu);
+
+ if (src_node == dst_node)
+ return 1;
+
+ src_faults = task_node_faults(p, src_node);
+
+ for (node = 0; node < nr_node_ids; node++) {
+ if (node == src_node)
+ continue;
+
+ faults = task_node_faults(p, node);
+
+ if (faults > max_faults && faults <= src_faults) {
+ max_faults = faults;
+ down_node = node;
+ }
+ }
+
+ if (down_node == dst_node)
+ return 1; /* move towards the next node down */
+
+ return 0; /* stay here */
+}
+
+static int task_faults_up(struct task_struct *p, struct lb_env *env)
+{
+ unsigned long src_faults, dst_faults;
+ int src_node, dst_node;
+
+ if (!sched_feat_numa(NUMA_FAULTS_UP) || !p->numa_faults)
+ return 0; /* can't say it improved */
+
+ src_node = cpu_to_node(env->src_cpu);
+ dst_node = cpu_to_node(env->dst_cpu);
+
+ if (src_node == dst_node)
+ return 0; /* pointless, don't do that */
+
+ src_faults = task_node_faults(p, src_node);
+ dst_faults = task_node_faults(p, dst_node);
+
+ if (dst_faults > src_faults)
+ return 1; /* move to dst */
+
+ return 0; /* stay where we are */
+}
+
+#else /* !CONFIG_NUMA_BALANCING: */
+static inline int task_faults_up(struct task_struct *p, struct lb_env *env)
+{
+ return 0;
+}
+
+static inline int task_faults_down(struct task_struct *p, struct lb_env *env)
+{
+ return 0;
+}
+#endif
+
/*
* Is this task likely cache-hot:
*/
static int
-task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
+task_hot(struct task_struct *p, struct lb_env *env)
{
s64 delta;

@@ -3647,80 +3974,153 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
if (sysctl_sched_migration_cost == 0)
return 0;

- delta = now - p->se.exec_start;
+ delta = env->src_rq->clock_task - p->se.exec_start;

return delta < (s64)sysctl_sched_migration_cost;
}

/*
- * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
+ * We do not migrate tasks that cannot be migrated to this CPU
+ * due to cpus_allowed.
+ *
+ * NOTE: this function has env-> side effects, to help the balancing
+ * of pinned tasks.
*/
-static
-int can_migrate_task(struct task_struct *p, struct lb_env *env)
+static bool can_migrate_pinned_task(struct task_struct *p, struct lb_env *env)
{
- int tsk_cache_hot = 0;
+ int new_dst_cpu;
+
+ if (cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p)))
+ return true;
+
+ schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+
/*
- * We do not migrate tasks that are:
- * 1) running (obviously), or
- * 2) cannot be migrated to this CPU due to cpus_allowed, or
- * 3) are cache-hot on their current CPU.
+ * Remember if this task can be migrated to any other cpu in
+ * our sched_group. We may want to revisit it if we couldn't
+ * meet load balance goals by pulling other tasks on src_cpu.
+ *
+ * Also avoid computing new_dst_cpu if we have already computed
+ * one in current iteration.
*/
- if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
- int new_dst_cpu;
-
- schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+ if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+ return false;

- /*
- * Remember if this task can be migrated to any other cpu in
- * our sched_group. We may want to revisit it if we couldn't
- * meet load balance goals by pulling other tasks on src_cpu.
- *
- * Also avoid computing new_dst_cpu if we have already computed
- * one in current iteration.
- */
- if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
- return 0;
-
- new_dst_cpu = cpumask_first_and(env->dst_grpmask,
- tsk_cpus_allowed(p));
- if (new_dst_cpu < nr_cpu_ids) {
- env->flags |= LBF_SOME_PINNED;
- env->new_dst_cpu = new_dst_cpu;
- }
- return 0;
+ new_dst_cpu = cpumask_first_and(env->dst_grpmask, tsk_cpus_allowed(p));
+ if (new_dst_cpu < nr_cpu_ids) {
+ env->flags |= LBF_SOME_PINNED;
+ env->new_dst_cpu = new_dst_cpu;
}
+ return false;
+}

- /* Record that we found atleast one task that could run on dst_cpu */
- env->flags &= ~LBF_ALL_PINNED;
+/*
+ * We cannot (easily) migrate tasks that are currently running:
+ */
+static bool can_migrate_running_task(struct task_struct *p, struct lb_env *env)
+{
+ if (!task_running(env->src_rq, p))
+ return true;

- if (task_running(env->src_rq, p)) {
- schedstat_inc(p, se.statistics.nr_failed_migrations_running);
- return 0;
- }
+ schedstat_inc(p, se.statistics.nr_failed_migrations_running);
+ return false;
+}

+/*
+ * Can we migrate a NUMA task? The rules are rather involved:
+ */
+static bool can_migrate_numa_task(struct task_struct *p, struct lb_env *env)
+{
/*
- * Aggressive migration if:
- * 1) task is cache cold, or
- * 2) too many balance attempts have failed.
+ * iteration:
+ * 0 -- only allow improvement, or !numa
+ * 1 -- + worsen !ideal
+ * 2 priv
+ * 3 shared (everything)
+ *
+ * NUMA_HOT_DOWN:
+ * 1 .. nodes -- allow getting worse by step
+ * nodes+1 -- punt, everything goes!
+ *
+ * LBF_NUMA_RUN -- numa only, only allow improvement
+ * LBF_NUMA_SHARED -- shared only
+ *
+ * LBF_KEEP_SHARED -- do not touch shared tasks
*/

- tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
- if (!tsk_cache_hot ||
- env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
-#ifdef CONFIG_SCHEDSTATS
- if (tsk_cache_hot) {
- schedstat_inc(env->sd, lb_hot_gained[env->idle]);
- schedstat_inc(p, se.statistics.nr_forced_migrations);
- }
-#endif
- return 1;
+ /* a numa run can only move numa tasks about to improve things */
+ if (env->flags & LBF_NUMA_RUN) {
+ if (task_numa_shared(p) < 0)
+ return false;
+ /* can only pull shared tasks */
+ if ((env->flags & LBF_NUMA_SHARED) && !task_numa_shared(p))
+ return false;
+ } else {
+ if (task_numa_shared(p) < 0)
+ goto try_migrate;
}

- if (tsk_cache_hot) {
- schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
- return 0;
- }
- return 1;
+ /* can not move shared tasks */
+ if ((env->flags & LBF_KEEP_SHARED) && task_numa_shared(p) == 1)
+ return false;
+
+ if (task_faults_up(p, env))
+ return true; /* memory locality beats cache hotness */
+
+ if (env->iteration < 1)
+ return false;
+
+#ifdef CONFIG_NUMA_BALANCING
+ if (p->numa_max_node != cpu_to_node(task_cpu(p))) /* !ideal */
+ goto demote;
+#endif
+
+ if (env->iteration < 2)
+ return false;
+
+ if (task_numa_shared(p) == 0) /* private */
+ goto demote;
+
+ if (env->iteration < 3)
+ return false;
+
+demote:
+ if (env->iteration < 5)
+ return task_faults_down(p, env);
+
+try_migrate:
+ if (env->failed > env->sd->cache_nice_tries)
+ return true;
+
+ return !task_hot(p, env);
+}
+
+/*
+ * can_migrate_task() - may task p from runqueue rq be migrated to this_cpu?
+ */
+static int can_migrate_task(struct task_struct *p, struct lb_env *env)
+{
+ if (!can_migrate_pinned_task(p, env))
+ return false;
+
+ /* Record that we found atleast one task that could run on dst_cpu */
+ env->flags &= ~LBF_ALL_PINNED;
+
+ if (!can_migrate_running_task(p, env))
+ return false;
+
+ if (env->sd->flags & SD_NUMA)
+ return can_migrate_numa_task(p, env);
+
+ if (env->failed > env->sd->cache_nice_tries)
+ return true;
+
+ if (!task_hot(p, env))
+ return true;
+
+ schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
+
+ return false;
}

/*
@@ -3735,6 +4135,7 @@ static int move_one_task(struct lb_env *env)
struct task_struct *p, *n;

list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
+
if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
continue;

@@ -3742,6 +4143,7 @@ static int move_one_task(struct lb_env *env)
continue;

move_task(p, env);
+
/*
* Right now, this is only the second place move_task()
* is called, so we can safely collect move_task()
@@ -3753,8 +4155,6 @@ static int move_one_task(struct lb_env *env)
return 0;
}

-static unsigned long task_h_load(struct task_struct *p);
-
static const unsigned int sched_nr_migrate_break = 32;

/*
@@ -3766,7 +4166,6 @@ static const unsigned int sched_nr_migrate_break = 32;
*/
static int move_tasks(struct lb_env *env)
{
- struct list_head *tasks = &env->src_rq->cfs_tasks;
struct task_struct *p;
unsigned long load;
int pulled = 0;
@@ -3774,8 +4173,8 @@ static int move_tasks(struct lb_env *env)
if (env->imbalance <= 0)
return 0;

- while (!list_empty(tasks)) {
- p = list_first_entry(tasks, struct task_struct, se.group_node);
+ while (!list_empty(&env->src_rq->cfs_tasks)) {
+ p = list_first_entry(&env->src_rq->cfs_tasks, struct task_struct, se.group_node);

env->loop++;
/* We've more or less seen every task there is, call it quits */
@@ -3786,7 +4185,7 @@ static int move_tasks(struct lb_env *env)
if (env->loop > env->loop_break) {
env->loop_break += sched_nr_migrate_break;
env->flags |= LBF_NEED_BREAK;
- break;
+ goto out;
}

if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
@@ -3794,7 +4193,7 @@ static int move_tasks(struct lb_env *env)

load = task_h_load(p);

- if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
+ if (sched_feat(LB_MIN) && load < 16 && !env->failed)
goto next;

if ((load / 2) > env->imbalance)
@@ -3814,7 +4213,7 @@ static int move_tasks(struct lb_env *env)
* the critical section.
*/
if (env->idle == CPU_NEWLY_IDLE)
- break;
+ goto out;
#endif

/*
@@ -3822,13 +4221,13 @@ static int move_tasks(struct lb_env *env)
* weighted load.
*/
if (env->imbalance <= 0)
- break;
+ goto out;

continue;
next:
- list_move_tail(&p->se.group_node, tasks);
+ list_move_tail(&p->se.group_node, &env->src_rq->cfs_tasks);
}
-
+out:
/*
* Right now, this is one of only two places move_task() is called,
* so we can safely collect move_task() stats here rather than
@@ -3953,17 +4352,18 @@ static inline void update_blocked_averages(int cpu)
static inline void update_h_load(long cpu)
{
}
-
+#ifdef CONFIG_SMP
static unsigned long task_h_load(struct task_struct *p)
{
return p->se.load.weight;
}
#endif
+#endif

/********** Helpers for find_busiest_group ************************/
/*
* sd_lb_stats - Structure to store the statistics of a sched_domain
- * during load balancing.
+ * during load balancing.
*/
struct sd_lb_stats {
struct sched_group *busiest; /* Busiest group in this sd */
@@ -3976,7 +4376,7 @@ struct sd_lb_stats {
unsigned long this_load;
unsigned long this_load_per_task;
unsigned long this_nr_running;
- unsigned long this_has_capacity;
+ unsigned int this_has_capacity;
unsigned int this_idle_cpus;

/* Statistics of the busiest group */
@@ -3985,10 +4385,28 @@ struct sd_lb_stats {
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
unsigned long busiest_group_capacity;
- unsigned long busiest_has_capacity;
+ unsigned int busiest_has_capacity;
unsigned int busiest_group_weight;

int group_imb; /* Is there imbalance in this sd */
+
+#ifdef CONFIG_NUMA_BALANCING
+ unsigned long this_numa_running;
+ unsigned long this_numa_weight;
+ unsigned long this_shared_running;
+ unsigned long this_ideal_running;
+ unsigned long this_group_capacity;
+
+ struct sched_group *numa;
+ unsigned long numa_load;
+ unsigned long numa_nr_running;
+ unsigned long numa_numa_running;
+ unsigned long numa_shared_running;
+ unsigned long numa_ideal_running;
+ unsigned long numa_numa_weight;
+ unsigned long numa_group_capacity;
+ unsigned int numa_has_capacity;
+#endif
};

/*
@@ -4004,6 +4422,13 @@ struct sg_lb_stats {
unsigned long group_weight;
int group_imb; /* Is there an imbalance in the group ? */
int group_has_capacity; /* Is there extra capacity in the group? */
+
+#ifdef CONFIG_NUMA_BALANCING
+ unsigned long sum_ideal_running;
+ unsigned long sum_numa_running;
+ unsigned long sum_numa_weight;
+#endif
+ unsigned long sum_shared_running; /* 0 on non-NUMA */
};

/**
@@ -4032,6 +4457,151 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
return load_idx;
}

+#ifdef CONFIG_NUMA_BALANCING
+
+static inline bool pick_numa_rand(int n)
+{
+ return !(get_random_int() % n);
+}
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+ sgs->sum_ideal_running += rq->nr_ideal_running;
+ sgs->sum_shared_running += rq->nr_shared_running;
+ sgs->sum_numa_running += rq->nr_numa_running;
+ sgs->sum_numa_weight += rq->numa_weight;
+}
+
+static inline
+void update_sd_numa_stats(struct sched_domain *sd, struct sched_group *sg,
+ struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
+ int local_group)
+{
+ if (!(sd->flags & SD_NUMA))
+ return;
+
+ if (local_group) {
+ sds->this_numa_running = sgs->sum_numa_running;
+ sds->this_numa_weight = sgs->sum_numa_weight;
+ sds->this_shared_running = sgs->sum_shared_running;
+ sds->this_ideal_running = sgs->sum_ideal_running;
+ sds->this_group_capacity = sgs->group_capacity;
+
+ } else if (sgs->sum_numa_running - sgs->sum_ideal_running) {
+ if (!sds->numa || pick_numa_rand(sd->span_weight / sg->group_weight)) {
+ sds->numa = sg;
+ sds->numa_load = sgs->avg_load;
+ sds->numa_nr_running = sgs->sum_nr_running;
+ sds->numa_numa_running = sgs->sum_numa_running;
+ sds->numa_shared_running = sgs->sum_shared_running;
+ sds->numa_ideal_running = sgs->sum_ideal_running;
+ sds->numa_numa_weight = sgs->sum_numa_weight;
+ sds->numa_has_capacity = sgs->group_has_capacity;
+ sds->numa_group_capacity = sgs->group_capacity;
+ }
+ }
+}
+
+static struct rq *
+find_busiest_numa_queue(struct lb_env *env, struct sched_group *sg)
+{
+ struct rq *rq, *busiest = NULL;
+ int cpu;
+
+ for_each_cpu_and(cpu, sched_group_cpus(sg), env->cpus) {
+ rq = cpu_rq(cpu);
+
+ if (!rq->nr_numa_running)
+ continue;
+
+ if (!(rq->nr_numa_running - rq->nr_ideal_running))
+ continue;
+
+ if ((env->flags & LBF_KEEP_SHARED) && !(rq->nr_running - rq->nr_shared_running))
+ continue;
+
+ if (!busiest || pick_numa_rand(sg->group_weight))
+ busiest = rq;
+ }
+
+ return busiest;
+}
+
+static bool can_do_numa_run(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ /*
+ * if we're overloaded; don't pull when:
+ * - the other guy isn't
+ * - imbalance would become too great
+ */
+ if (!sds->this_has_capacity) {
+ if (sds->numa_has_capacity)
+ return false;
+ }
+
+ /*
+ * pull if we got easy trade
+ */
+ if (sds->this_nr_running - sds->this_numa_running)
+ return true;
+
+ /*
+ * If we got capacity allow stacking up on shared tasks.
+ */
+ if ((sds->this_shared_running < sds->this_group_capacity) && sds->numa_shared_running) {
+ env->flags |= LBF_NUMA_SHARED;
+ return true;
+ }
+
+ /*
+ * pull if we could possibly trade
+ */
+ if (sds->this_numa_running - sds->this_ideal_running)
+ return true;
+
+ return false;
+}
+
+/*
+ * introduce some controlled imbalance to perturb the state so we allow the
+ * state to improve should be tightly controlled/co-ordinated with
+ * can_migrate_task()
+ */
+static int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ if (!sds->numa || !sds->numa_numa_running)
+ return 0;
+
+ if (!can_do_numa_run(env, sds))
+ return 0;
+
+ env->flags |= LBF_NUMA_RUN;
+ env->flags &= ~LBF_KEEP_SHARED;
+ env->imbalance = sds->numa_numa_weight / sds->numa_numa_running;
+ sds->busiest = sds->numa;
+ env->find_busiest_queue = find_busiest_numa_queue;
+
+ return 1;
+}
+
+#else /* !CONFIG_NUMA_BALANCING: */
+static inline
+void update_sd_numa_stats(struct sched_domain *sd, struct sched_group *sg,
+ struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
+ int local_group)
+{
+}
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+}
+
+static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ return 0;
+}
+#endif
+
unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
{
return SCHED_POWER_SCALE;
@@ -4245,6 +4815,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_load += load;
sgs->sum_nr_running += nr_running;
sgs->sum_weighted_load += weighted_cpuload(i);
+
+ update_sg_numa_stats(sgs, rq);
+
if (idle_cpu(i))
sgs->idle_cpus++;
}
@@ -4336,6 +4909,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
return false;
}

+static void update_src_keep_shared(struct lb_env *env, bool keep_shared)
+{
+ env->flags &= ~LBF_KEEP_SHARED;
+ if (keep_shared)
+ env->flags |= LBF_KEEP_SHARED;
+}
+
/**
* update_sd_lb_stats - Update sched_domain's statistics for load balancing.
* @env: The load balancing environment.
@@ -4368,6 +4948,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sds->total_load += sgs.group_load;
sds->total_pwr += sg->sgp->power;

+#ifdef CONFIG_NUMA_BALANCING
/*
* In case the child domain prefers tasks go to siblings
* first, lower the sg capacity to one so that we'll try
@@ -4378,8 +4959,11 @@ static inline void update_sd_lb_stats(struct lb_env *env,
* heaviest group when it is already under-utilized (possible
* with a large weight task outweighs the tasks on the system).
*/
- if (prefer_sibling && !local_group && sds->this_has_capacity)
- sgs.group_capacity = min(sgs.group_capacity, 1UL);
+ if (0 && prefer_sibling && !local_group && sds->this_has_capacity) {
+ sgs.group_capacity = clamp_val(sgs.sum_shared_running,
+ 1UL, sgs.group_capacity);
+ }
+#endif

if (local_group) {
sds->this_load = sgs.avg_load;
@@ -4398,8 +4982,13 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sds->busiest_has_capacity = sgs.group_has_capacity;
sds->busiest_group_weight = sgs.group_weight;
sds->group_imb = sgs.group_imb;
+
+ update_src_keep_shared(env,
+ sgs.sum_shared_running <= sgs.group_capacity);
}

+ update_sd_numa_stats(env->sd, sg, sds, &sgs, local_group);
+
sg = sg->next;
} while (sg != env->sd->groups);
}
@@ -4652,14 +5241,14 @@ find_busiest_group(struct lb_env *env, int *balance)
* don't try and pull any tasks.
*/
if (sds.this_load >= sds.max_load)
- goto out_balanced;
+ goto out_imbalanced;

/*
* Don't pull any tasks if this group is already above the domain
* average load.
*/
if (sds.this_load >= sds.avg_load)
- goto out_balanced;
+ goto out_imbalanced;

if (env->idle == CPU_IDLE) {
/*
@@ -4685,9 +5274,18 @@ force_balance:
calculate_imbalance(env, &sds);
return sds.busiest;

+out_imbalanced:
+ /* if we've got capacity allow for secondary placement preference */
+ if (!sds.this_has_capacity)
+ goto ret;
+
out_balanced:
+ if (check_numa_busiest_group(env, &sds))
+ return sds.busiest;
+
ret:
env->imbalance = 0;
+
return NULL;
}

@@ -4723,6 +5321,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
if (capacity && rq->nr_running == 1 && wl > env->imbalance)
continue;

+ if ((env->flags & LBF_KEEP_SHARED) && !(rq->nr_running - rq->nr_shared_running))
+ continue;
+
/*
* For the load comparisons with the other cpu's, consider
* the weighted_cpuload() scaled with the cpu power, so that
@@ -4749,25 +5350,40 @@ static struct rq *find_busiest_queue(struct lb_env *env,
/* Working cpumask for load_balance and load_balance_newidle. */
DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask);

-static int need_active_balance(struct lb_env *env)
-{
- struct sched_domain *sd = env->sd;
-
- if (env->idle == CPU_NEWLY_IDLE) {
+static int active_load_balance_cpu_stop(void *data);

+static void update_sd_failed(struct lb_env *env, int ld_moved)
+{
+ if (!ld_moved) {
+ schedstat_inc(env->sd, lb_failed[env->idle]);
/*
- * ASYM_PACKING needs to force migrate tasks from busy but
- * higher numbered CPUs in order to pack all tasks in the
- * lowest numbered CPUs.
+ * Increment the failure counter only on periodic balance.
+ * We do not want newidle balance, which can be very
+ * frequent, pollute the failure counter causing
+ * excessive cache_hot migrations and active balances.
*/
- if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
- return 1;
- }
-
- return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
+ if (env->idle != CPU_NEWLY_IDLE && !(env->flags & LBF_NUMA_RUN))
+ env->sd->nr_balance_failed++;
+ } else
+ env->sd->nr_balance_failed = 0;
}

-static int active_load_balance_cpu_stop(void *data);
+/*
+ * See can_migrate_numa_task()
+ */
+static int lb_max_iteration(struct lb_env *env)
+{
+ if (!(env->sd->flags & SD_NUMA))
+ return 0;
+
+ if (env->flags & LBF_NUMA_RUN)
+ return 0; /* NUMA_RUN may only improve */
+
+ if (sched_feat_numa(NUMA_FAULTS_DOWN))
+ return 5; /* nodes^2 would suck */
+
+ return 3;
+}

/*
* Check this_cpu to ensure it is balanced within domain. Attempt to move
@@ -4793,6 +5409,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
.find_busiest_queue = find_busiest_queue,
+ .failed = sd->nr_balance_failed,
+ .iteration = 0,
};

cpumask_copy(cpus, cpu_active_mask);
@@ -4816,6 +5434,8 @@ redo:
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;
}
+ env.src_rq = busiest;
+ env.src_cpu = busiest->cpu;

BUG_ON(busiest == env.dst_rq);

@@ -4895,92 +5515,72 @@ more_balance:
}

/* All tasks on this runqueue were pinned by CPU affinity */
- if (unlikely(env.flags & LBF_ALL_PINNED)) {
- cpumask_clear_cpu(cpu_of(busiest), cpus);
- if (!cpumask_empty(cpus)) {
- env.loop = 0;
- env.loop_break = sched_nr_migrate_break;
- goto redo;
- }
- goto out_balanced;
+ if (unlikely(env.flags & LBF_ALL_PINNED))
+ goto out_pinned;
+
+ if (!ld_moved && env.iteration < lb_max_iteration(&env)) {
+ env.iteration++;
+ env.loop = 0;
+ goto more_balance;
}
}

- if (!ld_moved) {
- schedstat_inc(sd, lb_failed[idle]);
+ if (!ld_moved && idle != CPU_NEWLY_IDLE) {
+ raw_spin_lock_irqsave(&busiest->lock, flags);
+
/*
- * Increment the failure counter only on periodic balance.
- * We do not want newidle balance, which can be very
- * frequent, pollute the failure counter causing
- * excessive cache_hot migrations and active balances.
+ * Don't kick the active_load_balance_cpu_stop,
+ * if the curr task on busiest cpu can't be
+ * moved to this_cpu
*/
- if (idle != CPU_NEWLY_IDLE)
- sd->nr_balance_failed++;
-
- if (need_active_balance(&env)) {
- raw_spin_lock_irqsave(&busiest->lock, flags);
-
- /* don't kick the active_load_balance_cpu_stop,
- * if the curr task on busiest cpu can't be
- * moved to this_cpu
- */
- if (!cpumask_test_cpu(this_cpu,
- tsk_cpus_allowed(busiest->curr))) {
- raw_spin_unlock_irqrestore(&busiest->lock,
- flags);
- env.flags |= LBF_ALL_PINNED;
- goto out_one_pinned;
- }
-
- /*
- * ->active_balance synchronizes accesses to
- * ->active_balance_work. Once set, it's cleared
- * only after active load balance is finished.
- */
- if (!busiest->active_balance) {
- busiest->active_balance = 1;
- busiest->push_cpu = this_cpu;
- active_balance = 1;
- }
+ if (!cpumask_test_cpu(this_cpu, tsk_cpus_allowed(busiest->curr))) {
raw_spin_unlock_irqrestore(&busiest->lock, flags);
-
- if (active_balance) {
- stop_one_cpu_nowait(cpu_of(busiest),
- active_load_balance_cpu_stop, busiest,
- &busiest->active_balance_work);
- }
-
- /*
- * We've kicked active balancing, reset the failure
- * counter.
- */
- sd->nr_balance_failed = sd->cache_nice_tries+1;
+ env.flags |= LBF_ALL_PINNED;
+ goto out_pinned;
}
- } else
- sd->nr_balance_failed = 0;

- if (likely(!active_balance)) {
- /* We were unbalanced, so reset the balancing interval */
- sd->balance_interval = sd->min_interval;
- } else {
/*
- * If we've begun active balancing, start to back off. This
- * case may not be covered by the all_pinned logic if there
- * is only 1 task on the busy runqueue (because we don't call
- * move_tasks).
+ * ->active_balance synchronizes accesses to
+ * ->active_balance_work. Once set, it's cleared
+ * only after active load balance is finished.
*/
- if (sd->balance_interval < sd->max_interval)
- sd->balance_interval *= 2;
+ if (!busiest->active_balance) {
+ busiest->active_balance = 1;
+ busiest->ab_dst_cpu = this_cpu;
+ busiest->ab_flags = env.flags;
+ busiest->ab_failed = env.failed;
+ busiest->ab_idle = env.idle;
+ active_balance = 1;
+ }
+ raw_spin_unlock_irqrestore(&busiest->lock, flags);
+
+ if (active_balance) {
+ stop_one_cpu_nowait(cpu_of(busiest),
+ active_load_balance_cpu_stop, busiest,
+ &busiest->ab_work);
+ }
}

- goto out;
+ if (!active_balance)
+ update_sd_failed(&env, ld_moved);
+
+ sd->balance_interval = sd->min_interval;
+out:
+ return ld_moved;
+
+out_pinned:
+ cpumask_clear_cpu(cpu_of(busiest), cpus);
+ if (!cpumask_empty(cpus)) {
+ env.loop = 0;
+ env.loop_break = sched_nr_migrate_break;
+ goto redo;
+ }

out_balanced:
schedstat_inc(sd, lb_balanced[idle]);

sd->nr_balance_failed = 0;

-out_one_pinned:
/* tune up the balancing interval */
if (((env.flags & LBF_ALL_PINNED) &&
sd->balance_interval < MAX_PINNED_INTERVAL) ||
@@ -4988,8 +5588,8 @@ out_one_pinned:
sd->balance_interval *= 2;

ld_moved = 0;
-out:
- return ld_moved;
+
+ goto out;
}

/*
@@ -5060,7 +5660,7 @@ static int active_load_balance_cpu_stop(void *data)
{
struct rq *busiest_rq = data;
int busiest_cpu = cpu_of(busiest_rq);
- int target_cpu = busiest_rq->push_cpu;
+ int target_cpu = busiest_rq->ab_dst_cpu;
struct rq *target_rq = cpu_rq(target_cpu);
struct sched_domain *sd;

@@ -5098,17 +5698,23 @@ static int active_load_balance_cpu_stop(void *data)
.sd = sd,
.dst_cpu = target_cpu,
.dst_rq = target_rq,
- .src_cpu = busiest_rq->cpu,
+ .src_cpu = busiest_cpu,
.src_rq = busiest_rq,
- .idle = CPU_IDLE,
+ .flags = busiest_rq->ab_flags,
+ .failed = busiest_rq->ab_failed,
+ .idle = busiest_rq->ab_idle,
};
+ env.iteration = lb_max_iteration(&env);

schedstat_inc(sd, alb_count);

- if (move_one_task(&env))
+ if (move_one_task(&env)) {
schedstat_inc(sd, alb_pushed);
- else
+ update_sd_failed(&env, 1);
+ } else {
schedstat_inc(sd, alb_failed);
+ update_sd_failed(&env, 0);
+ }
}
rcu_read_unlock();
double_unlock_balance(busiest_rq, target_rq);
@@ -5508,6 +6114,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
}

update_rq_runnable_avg(rq, 1);
+
+ if (sched_feat_numa(NUMA) && nr_node_ids > 1)
+ task_tick_numa(rq, curr);
}

/*
@@ -5902,9 +6511,7 @@ const struct sched_class fair_sched_class = {

#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index e68e69a..a432eb8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -66,3 +66,11 @@ SCHED_FEAT(TTWU_QUEUE, true)
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
+
+#ifdef CONFIG_NUMA_BALANCING
+/* Do the working set probing faults: */
+SCHED_FEAT(NUMA, true)
+SCHED_FEAT(NUMA_FAULTS_UP, true)
+SCHED_FEAT(NUMA_FAULTS_DOWN, false)
+SCHED_FEAT(NUMA_SETTLE, true)
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5eca173..bb9475c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3,6 +3,7 @@
#include <linux/mutex.h>
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
+#include <linux/slab.h>

#include "cpupri.h"

@@ -420,17 +421,29 @@ struct rq {
unsigned long cpu_power;

unsigned char idle_balance;
- /* For active balancing */
int post_schedule;
+
+ /* For active balancing */
int active_balance;
- int push_cpu;
- struct cpu_stop_work active_balance_work;
+ int ab_dst_cpu;
+ int ab_flags;
+ int ab_failed;
+ int ab_idle;
+ struct cpu_stop_work ab_work;
+
/* cpu of this runqueue: */
int cpu;
int online;

struct list_head cfs_tasks;

+#ifdef CONFIG_NUMA_BALANCING
+ unsigned long numa_weight;
+ unsigned long nr_numa_running;
+ unsigned long nr_ideal_running;
+#endif
+ unsigned long nr_shared_running; /* 0 on non-NUMA */
+
u64 rt_avg;
u64 age_stamp;
u64 idle_stamp;
@@ -501,6 +514,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() (&__raw_get_cpu_var(runqueues))

+#ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int node, int shared);
+static inline void task_numa_free(struct task_struct *p)
+{
+ kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
#ifdef CONFIG_SMP

#define rcu_dereference_check_sched_domain(p) \
@@ -544,6 +569,7 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)

DECLARE_PER_CPU(struct sched_domain *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain *, sd_node);

extern int group_balance_cpu(struct sched_group *sg);

@@ -663,6 +689,12 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
#endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */

+#ifdef CONFIG_NUMA_BALANCING
+#define sched_feat_numa(x) sched_feat(x)
+#else
+#define sched_feat_numa(x) (0)
+#endif
+
static inline u64 global_rt_period(void)
{
return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
--
1.7.11.7

2012-11-19 02:16:19

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 24/27] sched: Improve convergence

- break out of can_do_numa_run() earlier if we can make no progress
- don't flip between siblings that often
- turn on bidirectional fault balancing
- improve the flow in task_numa_work()

Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 46 ++++++++++++++++++++++++++++++++--------------
kernel/sched/features.h | 2 +-
2 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 59fea2e..9c46b45 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -917,12 +917,12 @@ void task_numa_fault(int node, int last_cpu, int pages)
*/
void task_numa_work(struct callback_head *work)
{
+ long pages_total, pages_left, pages_changed;
unsigned long migrate, next_scan, now = jiffies;
+ unsigned long start0, start, end;
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
struct vm_area_struct *vma;
- unsigned long start, end;
- long pages;

WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));

@@ -951,35 +951,42 @@ void task_numa_work(struct callback_head *work)

current->numa_scan_period += jiffies_to_msecs(2);

- start = mm->numa_scan_offset;
- pages = sysctl_sched_numa_scan_size;
- pages <<= 20 - PAGE_SHIFT; /* MB in pages */
- if (!pages)
+ start0 = start = end = mm->numa_scan_offset;
+ pages_total = sysctl_sched_numa_scan_size;
+ pages_total <<= 20 - PAGE_SHIFT; /* MB in pages */
+ if (!pages_total)
return;

+ pages_left = pages_total;
+
down_write(&mm->mmap_sem);
vma = find_vma(mm, start);
if (!vma) {
ACCESS_ONCE(mm->numa_scan_seq)++;
- start = 0;
- vma = mm->mmap;
+ end = 0;
+ vma = find_vma(mm, end);
}
for (; vma; vma = vma->vm_next) {
if (!vma_migratable(vma))
continue;

do {
- start = max(start, vma->vm_start);
- end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
+ start = max(end, vma->vm_start);
+ end = ALIGN(start + (pages_left << PAGE_SHIFT), HPAGE_SIZE);
end = min(end, vma->vm_end);
- pages -= change_prot_numa(vma, start, end);
- start = end;
- if (pages <= 0)
+ pages_changed = change_prot_numa(vma, start, end);
+
+ WARN_ON_ONCE(pages_changed > pages_total);
+ BUG_ON(pages_changed < 0);
+
+ pages_left -= pages_changed;
+ if (pages_left <= 0)
goto out;
} while (end != vma->vm_end);
}
out:
- mm->numa_scan_offset = start;
+ mm->numa_scan_offset = end;
+
up_write(&mm->mmap_sem);
}

@@ -3306,6 +3313,13 @@ static int select_idle_sibling(struct task_struct *p, int target)
int i;

/*
+ * For NUMA tasks constant, reliable placement is more important
+ * than flipping tasks between siblings:
+ */
+ if (task_numa_shared(p) >= 0)
+ return target;
+
+ /*
* If the task is going to be woken-up on this cpu and if it is
* already idle, then it is the right target.
*/
@@ -4581,6 +4595,10 @@ static bool can_do_numa_run(struct lb_env *env, struct sd_lb_stats *sds)
* If we got capacity allow stacking up on shared tasks.
*/
if ((sds->this_shared_running < sds->this_group_capacity) && sds->numa_shared_running) {
+ /* There's no point in trying to move if all are here already: */
+ if (sds->numa_shared_running == sds->this_shared_running)
+ return false;
+
env->flags |= LBF_NUMA_SHARED;
return true;
}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index a432eb8..b75a10d 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -71,6 +71,6 @@ SCHED_FEAT(LB_MIN, false)
/* Do the working set probing faults: */
SCHED_FEAT(NUMA, true)
SCHED_FEAT(NUMA_FAULTS_UP, true)
-SCHED_FEAT(NUMA_FAULTS_DOWN, false)
+SCHED_FEAT(NUMA_FAULTS_DOWN, true)
SCHED_FEAT(NUMA_SETTLE, true)
#endif
--
1.7.11.7

2012-11-19 02:16:10

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 22/27] sched, numa, mm: Interleave shared tasks

Interleave tasks that are 'shared' - i.e. whose memory access patterns
indicate that they are intensively sharing memory with other tasks.

If such a task ends up converging then it switches back into the lazy
node-local policy.

Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/mempolicy.c | 54 ++++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 40 insertions(+), 14 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 318043a..21bbb13 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -111,12 +111,30 @@ enum zone_type policy_zone = 0;
/*
* run-time system-wide default policy => local allocation
*/
-static struct mempolicy default_policy = {
- .refcnt = ATOMIC_INIT(1), /* never free it */
- .mode = MPOL_PREFERRED,
- .flags = MPOL_F_LOCAL,
+
+static struct mempolicy default_policy_local = {
+ .refcnt = ATOMIC_INIT(1), /* never free it */
+ .mode = MPOL_PREFERRED,
+ .flags = MPOL_F_LOCAL,
+};
+
+/*
+ * .v.nodes is set by numa_policy_init():
+ */
+static struct mempolicy default_policy_shared = {
+ .refcnt = ATOMIC_INIT(1), /* never free it */
+ .mode = MPOL_INTERLEAVE,
+ .flags = 0,
};

+static struct mempolicy *default_policy(void)
+{
+ if (task_numa_shared(current) == 1)
+ return &default_policy_shared;
+
+ return &default_policy_local;
+}
+
static const struct mempolicy_operations {
int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
/*
@@ -789,7 +807,7 @@ out:
static void get_policy_nodemask(struct mempolicy *p, nodemask_t *nodes)
{
nodes_clear(*nodes);
- if (p == &default_policy)
+ if (p == default_policy())
return;

switch (p->mode) {
@@ -864,7 +882,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
return -EINVAL;

if (!pol)
- pol = &default_policy; /* indicates default behavior */
+ pol = default_policy(); /* indicates default behavior */

if (flags & MPOL_F_NODE) {
if (flags & MPOL_F_ADDR) {
@@ -880,7 +898,7 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
goto out;
}
} else {
- *policy = pol == &default_policy ? MPOL_DEFAULT :
+ *policy = pol == default_policy() ? MPOL_DEFAULT :
pol->mode;
/*
* Internal mempolicy flags must be masked off before exposing
@@ -1568,7 +1586,7 @@ struct mempolicy *get_vma_policy(struct task_struct *task,
}
}
if (!pol)
- pol = &default_policy;
+ pol = default_policy();
return pol;
}

@@ -1974,7 +1992,7 @@ struct page *alloc_pages_current(gfp_t gfp, unsigned order)
unsigned int cpuset_mems_cookie;

if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
- pol = &default_policy;
+ pol = default_policy();

retry_cpuset:
cpuset_mems_cookie = get_mems_allowed();
@@ -2255,7 +2273,6 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
int best_nid = -1, page_nid;
int cpu_last_access, this_cpu;
struct mempolicy *pol;
- unsigned long pgoff;
struct zone *zone;

BUG_ON(!vma);
@@ -2271,13 +2288,20 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long

switch (pol->mode) {
case MPOL_INTERLEAVE:
+ {
+ int shift;
+
BUG_ON(addr >= vma->vm_end);
BUG_ON(addr < vma->vm_start);

- pgoff = vma->vm_pgoff;
- pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
- best_nid = offset_il_node(pol, vma, pgoff);
+ if (transparent_hugepage_enabled(vma) || vma->vm_flags & VM_HUGETLB)
+ shift = HPAGE_SHIFT;
+ else
+ shift = PAGE_SHIFT;
+
+ best_nid = interleave_nid(pol, vma, addr, shift);
break;
+ }

case MPOL_PREFERRED:
if (pol->flags & MPOL_F_LOCAL)
@@ -2492,6 +2516,8 @@ void __init numa_policy_init(void)
sizeof(struct sp_node),
0, SLAB_PANIC, NULL);

+ default_policy_shared.v.nodes = node_online_map;
+
/*
* Set interleaving policy for system init. Interleaving is only
* enabled across suitably sized nodes (default is >= 16MB), or
@@ -2712,7 +2738,7 @@ int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, int no_context)
*/
VM_BUG_ON(maxlen < strlen("interleave") + strlen("relative") + 16);

- if (!pol || pol == &default_policy)
+ if (!pol || pol == default_policy())
mode = MPOL_DEFAULT;
else
mode = pol->mode;
--
1.7.11.7

2012-11-19 02:16:33

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 27/27] sched: Use the best-buddy 'ideal cpu' in balancing decisions

Now that we have a notion of (one of the) best CPUs we interrelate
with in terms of memory usage, use that information to improve
can_migrate_task() balancing decisions: allow the migration to
occur even if we locally cache-hot, if we are on another node
and want to migrate towards our best buddy's node.

( Note that this is not hard affinity - if imbalance persists long
enough then the scheduler will eventually balance tasks anyway,
to maximize CPU utilization. )

Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 35 ++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 2 ++
2 files changed, 34 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 67f7fd2..24a5588 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -840,6 +840,14 @@ static void task_numa_migrate(struct task_struct *p, int next_cpu)
p->numa_migrate_seq = 0;
}

+static int task_ideal_cpu(struct task_struct *p)
+{
+ if (!sched_feat(IDEAL_CPU))
+ return -1;
+
+ return p->ideal_cpu;
+}
+
/*
* Called for every full scan - here we consider switching to a new
* shared buddy, if the one we found during this scan is good enough:
@@ -1028,7 +1036,7 @@ out_hit:
* but don't stop the discovery of process level sharing
* either:
*/
- if (this_task->mm == last_task->mm)
+ if (sched_feat(IDEAL_CPU_THREAD_BIAS) && this_task->mm == last_task->mm)
pages *= 2;

this_task->shared_buddy_faults_curr += pages;
@@ -1189,6 +1197,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
}
#else /* !CONFIG_NUMA_BALANCING: */
#ifdef CONFIG_SMP
+static inline int task_ideal_cpu(struct task_struct *p) { return -1; }
static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p) { }
#endif
static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p) { }
@@ -4064,6 +4073,7 @@ struct lb_env {
static void move_task(struct task_struct *p, struct lb_env *env)
{
deactivate_task(env->src_rq, p, 0);
+
set_task_cpu(p, env->dst_cpu);
activate_task(env->dst_rq, p, 0);
check_preempt_curr(env->dst_rq, p, 0);
@@ -4242,15 +4252,17 @@ static bool can_migrate_numa_task(struct task_struct *p, struct lb_env *env)
*
* LBF_NUMA_RUN -- numa only, only allow improvement
* LBF_NUMA_SHARED -- shared only
+ * LBF_NUMA_IDEAL -- ideal only
*
* LBF_KEEP_SHARED -- do not touch shared tasks
*/

/* a numa run can only move numa tasks about to improve things */
if (env->flags & LBF_NUMA_RUN) {
- if (task_numa_shared(p) < 0)
+ if (task_numa_shared(p) < 0 && task_ideal_cpu(p) < 0)
return false;
- /* can only pull shared tasks */
+
+ /* If we are only allowed to pull shared tasks: */
if ((env->flags & LBF_NUMA_SHARED) && !task_numa_shared(p))
return false;
} else {
@@ -4307,6 +4319,23 @@ static int can_migrate_task(struct task_struct *p, struct lb_env *env)
if (!can_migrate_running_task(p, env))
return false;

+#ifdef CONFIG_NUMA_BALANCING
+ /* If we are only allowed to pull ideal tasks: */
+ if ((task_ideal_cpu(p) >= 0) && (p->shared_buddy_faults > 1000)) {
+ int ideal_node;
+ int dst_node;
+
+ BUG_ON(env->dst_cpu < 0);
+
+ ideal_node = cpu_to_node(p->ideal_cpu);
+ dst_node = cpu_to_node(env->dst_cpu);
+
+ if (ideal_node == dst_node)
+ return true;
+ return false;
+ }
+#endif
+
if (env->sd->flags & SD_NUMA)
return can_migrate_numa_task(p, env);

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index b75a10d..737d2c8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -66,6 +66,8 @@ SCHED_FEAT(TTWU_QUEUE, true)
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
+SCHED_FEAT(IDEAL_CPU, true)
+SCHED_FEAT(IDEAL_CPU_THREAD_BIAS, false)

#ifdef CONFIG_NUMA_BALANCING
/* Do the working set probing faults: */
--
1.7.11.7

2012-11-19 02:16:48

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 25/27] sched: Introduce staged average NUMA faults

The current way of building the p->numa_faults[2][node] faults
statistics has a sampling artifact:

The continuous and immediate nature of propagating new fault
stats to the numa_faults array creates a 'pulsating' dynamic,
that starts at the average value at the beginning of the scan,
increases monotonically until we finish the scan to about twice
the average, and then drops back to half of its value due to
the running average.

Since we rely on these values to balance tasks, the pulsating
nature resulted in false migrations and general noise in the
stats.

To solve this, introduce buffering of the current scan via
p->task_numa_faults_curr[]. The array is co-allocated with the
p->task_numa[] for efficiency reasons, but it is otherwise an
ordinary separate array.

At the end of the scan we propagate the latest stats into the
average stats value. Most of the balancing code stays unmodified.

The cost of this change is that we delay the effects of the latest
round of faults by 1 scan - but using the partial faults info was
creating artifacts.

This instantly stabilized the page fault stats and improved
numa02-alike workloads by making them faster to converge.

Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 20 +++++++++++++++++---
2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8f65323..92b41b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1511,6 +1511,7 @@ struct task_struct {
u64 node_stamp; /* migration stamp */
unsigned long numa_weight;
unsigned long *numa_faults;
+ unsigned long *numa_faults_curr;
struct callback_head numa_work;
#endif /* CONFIG_NUMA_BALANCING */

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c46b45..1ab11be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -852,12 +852,26 @@ static void task_numa_placement(struct task_struct *p)

p->numa_scan_seq = seq;

+ /*
+ * Update the fault average with the result of the latest
+ * scan:
+ */
for (node = 0; node < nr_node_ids; node++) {
faults = 0;
for (priv = 0; priv < 2; priv++) {
- faults += p->numa_faults[2*node + priv];
- total[priv] += p->numa_faults[2*node + priv];
- p->numa_faults[2*node + priv] /= 2;
+ unsigned int new_faults;
+ unsigned int idx;
+
+ idx = 2*node + priv;
+ new_faults = p->numa_faults_curr[idx];
+ p->numa_faults_curr[idx] = 0;
+
+ /* Keep a simple running average: */
+ p->numa_faults[idx] += new_faults;
+ p->numa_faults[idx] /= 2;
+
+ faults += p->numa_faults[idx];
+ total[priv] += p->numa_faults[idx];
}
if (faults > max_faults) {
max_faults = faults;
--
1.7.11.7

2012-11-19 02:16:46

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 26/27] sched: Track groups of shared tasks

To be able to cluster memory-related tasks more efficiently, introduce
a new metric that tracks the 'best' buddy task

Track our "memory buddies": the tasks we actively share memory with.

Firstly we establish the identity of some other task that we are
sharing memory with by looking at rq[page::last_cpu].curr - i.e.
we check the task that is running on that CPU right now.

This is not entirely correct as this task might have scheduled or
migrate ther - but statistically there will be correlation to the
tasks that we share memory with, and correlation is all we need.

We map out the relation itself by filtering out the highest address
ask that is below our own task address, per working set scan
iteration.

This creates a natural ordering relation between groups of tasks:

t1 < t2 < t3 < t4

t1->memory_buddy == NULL
t2->memory_buddy == t1
t3->memory_buddy == t2
t4->memory_buddy == t3

The load-balancer can then use this information to speed up NUMA
convergence, by moving such tasks together if capacity and load
constraints allow it.

(This is all statistical so there are no preemption or locking worries.)

Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 5 ++
kernel/sched/core.c | 5 ++
kernel/sched/fair.c | 144 ++++++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 151 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 92b41b4..be73297 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1513,6 +1513,11 @@ struct task_struct {
unsigned long *numa_faults;
unsigned long *numa_faults_curr;
struct callback_head numa_work;
+
+ struct task_struct *shared_buddy, *shared_buddy_curr;
+ unsigned long shared_buddy_faults, shared_buddy_faults_curr;
+ int ideal_cpu, ideal_cpu_curr;
+
#endif /* CONFIG_NUMA_BALANCING */

struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ec3cc74..39cf991 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1558,6 +1558,11 @@ static void __sched_fork(struct task_struct *p)
p->numa_faults = NULL;
p->numa_scan_period = sysctl_sched_numa_scan_delay;
p->numa_work.next = &p->numa_work;
+
+ p->shared_buddy = NULL;
+ p->shared_buddy_faults = 0;
+ p->ideal_cpu = -1;
+ p->ideal_cpu_curr = -1;
#endif /* CONFIG_NUMA_BALANCING */
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1ab11be..67f7fd2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -840,6 +840,43 @@ static void task_numa_migrate(struct task_struct *p, int next_cpu)
p->numa_migrate_seq = 0;
}

+/*
+ * Called for every full scan - here we consider switching to a new
+ * shared buddy, if the one we found during this scan is good enough:
+ */
+static void shared_fault_full_scan_done(struct task_struct *p)
+{
+ /*
+ * If we have a new maximum rate buddy task then pick it
+ * as our new best friend:
+ */
+ if (p->shared_buddy_faults_curr > p->shared_buddy_faults) {
+ WARN_ON_ONCE(!p->shared_buddy_curr);
+ p->shared_buddy = p->shared_buddy_curr;
+ p->shared_buddy_faults = p->shared_buddy_faults_curr;
+ p->ideal_cpu = p->ideal_cpu_curr;
+
+ goto clear_buddy;
+ }
+ /*
+ * If the new buddy is lower rate than the previous average
+ * fault rate then don't switch buddies yet but lower the average by
+ * averaging in the new rate, with a 1/3 weight.
+ *
+ * Eventually, if the current buddy is not a buddy anymore
+ * then we'll switch away from it: a higher rate buddy will
+ * replace it.
+ */
+ p->shared_buddy_faults *= 3;
+ p->shared_buddy_faults += p->shared_buddy_faults_curr;
+ p->shared_buddy_faults /= 4;
+
+clear_buddy:
+ p->shared_buddy_curr = NULL;
+ p->shared_buddy_faults_curr = 0;
+ p->ideal_cpu_curr = -1;
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -852,6 +889,8 @@ static void task_numa_placement(struct task_struct *p)

p->numa_scan_seq = seq;

+ shared_fault_full_scan_done(p);
+
/*
* Update the fault average with the result of the latest
* scan:
@@ -906,23 +945,122 @@ out_backoff:
}

/*
+ * Track our "memory buddies" the tasks we actively share memory with.
+ *
+ * Firstly we establish the identity of some other task that we are
+ * sharing memory with by looking at rq[page::last_cpu].curr - i.e.
+ * we check the task that is running on that CPU right now.
+ *
+ * This is not entirely correct as this task might have scheduled or
+ * migrate ther - but statistically there will be correlation to the
+ * tasks that we share memory with, and correlation is all we need.
+ *
+ * We map out the relation itself by filtering out the highest address
+ * ask that is below our own task address, per working set scan
+ * iteration.
+ *
+ * This creates a natural ordering relation between groups of tasks:
+ *
+ * t1 < t2 < t3 < t4
+ *
+ * t1->memory_buddy == NULL
+ * t2->memory_buddy == t1
+ * t3->memory_buddy == t2
+ * t4->memory_buddy == t3
+ *
+ * The load-balancer can then use this information to speed up NUMA
+ * convergence, by moving such tasks together if capacity and load
+ * constraints allow it.
+ *
+ * (This is all statistical so there are no preemption or locking worries.)
+ */
+static void shared_fault_tick(struct task_struct *this_task, int node, int last_cpu, int pages)
+{
+ struct task_struct *last_task;
+ struct rq *last_rq;
+ int last_node;
+ int this_node;
+ int this_cpu;
+
+ last_node = cpu_to_node(last_cpu);
+ this_cpu = raw_smp_processor_id();
+ this_node = cpu_to_node(this_cpu);
+
+ /* Ignore private memory access faults: */
+ if (last_cpu == this_cpu)
+ return;
+
+ /*
+ * Ignore accesses from foreign nodes to our memory.
+ *
+ * Yet still recognize tasks accessing a third node - i.e. one that is
+ * remote to both of them.
+ */
+ if (node != this_node)
+ return;
+
+ /* We are in a shared fault - see which task we relate to: */
+ last_rq = cpu_rq(last_cpu);
+ last_task = last_rq->curr;
+
+ /* Task might be gone from that runqueue already: */
+ if (!last_task || last_task == last_rq->idle)
+ return;
+
+ if (last_task == this_task->shared_buddy_curr)
+ goto out_hit;
+
+ /* Order our memory buddies by address: */
+ if (last_task >= this_task)
+ return;
+
+ if (this_task->shared_buddy_curr > last_task)
+ return;
+
+ /* New shared buddy! */
+ this_task->shared_buddy_curr = last_task;
+ this_task->shared_buddy_faults_curr = 0;
+ this_task->ideal_cpu_curr = last_rq->cpu;
+
+out_hit:
+ /*
+ * Give threads that we share a process with an advantage,
+ * but don't stop the discovery of process level sharing
+ * either:
+ */
+ if (this_task->mm == last_task->mm)
+ pages *= 2;
+
+ this_task->shared_buddy_faults_curr += pages;
+}
+
+/*
* Got a PROT_NONE fault for a page on @node.
*/
void task_numa_fault(int node, int last_cpu, int pages)
{
struct task_struct *p = current;
int priv = (task_cpu(p) == last_cpu);
+ int idx = 2*node + priv;

if (unlikely(!p->numa_faults)) {
- int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+ int entries = 2*nr_node_ids;
+ int size = sizeof(*p->numa_faults) * entries;

- p->numa_faults = kzalloc(size, GFP_KERNEL);
+ p->numa_faults = kzalloc(2*size, GFP_KERNEL);
if (!p->numa_faults)
return;
+ /*
+ * For efficiency reasons we allocate ->numa_faults[]
+ * and ->numa_faults_curr[] at once and split the
+ * buffer we get. They are separate otherwise.
+ */
+ p->numa_faults_curr = p->numa_faults + entries;
}

+ p->numa_faults_curr[idx] += pages;
+ shared_fault_tick(p, node, last_cpu, pages);
task_numa_placement(p);
- p->numa_faults[2*node + priv] += pages;
}

/*
--
1.7.11.7

2012-11-19 02:17:24

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 23/27] sched: Implement NUMA scanning backoff

Back off slowly from scanning, up to sysctl_sched_numa_scan_period_max
(1.6 seconds). Scan faster again if we were forced to switch to
another node.

This makes sure that workload in equilibrium don't get scanned as often
as workloads that are still converging.

Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/core.c | 6 ++++++
kernel/sched/fair.c | 8 +++++++-
2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index af0602f..ec3cc74 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6024,6 +6024,12 @@ void sched_setnuma(struct task_struct *p, int node, int shared)
if (on_rq)
enqueue_task(rq, p, 0);
task_rq_unlock(rq, p, &flags);
+
+ /*
+ * Reset the scanning period. If the task converges
+ * on this node then we'll back off again:
+ */
+ p->numa_scan_period = sysctl_sched_numa_scan_period_min;
}

#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8f0e6ba..59fea2e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -865,8 +865,10 @@ static void task_numa_placement(struct task_struct *p)
}
}

- if (max_node != p->numa_max_node)
+ if (max_node != p->numa_max_node) {
sched_setnuma(p, max_node, task_numa_shared(p));
+ goto out_backoff;
+ }

p->numa_migrate_seq++;
if (sched_feat(NUMA_SETTLE) &&
@@ -882,7 +884,11 @@ static void task_numa_placement(struct task_struct *p)
if (shared != task_numa_shared(p)) {
sched_setnuma(p, p->numa_max_node, shared);
p->numa_migrate_seq = 0;
+ goto out_backoff;
}
+ return;
+out_backoff:
+ p->numa_scan_period = min(p->numa_scan_period * 2, sysctl_sched_numa_scan_period_max);
}

/*
--
1.7.11.7

2012-11-19 02:17:57

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 21/27] sched: Implement slow start for working set sampling

From: Peter Zijlstra <[email protected]>

Add a 1 second delay before starting to scan the working set of
a task and starting to balance it amongst nodes.

[ note that before the constant per task WSS sampling rate patch
the initial scan would happen much later still, in effect that
patch caused this regression. ]

The theory is that short-run tasks benefit very little from NUMA
placement: they come and go, and they better stick to the node
they were started on. As tasks mature and rebalance to other CPUs
and nodes, so does their NUMA placement have to change and so
does it start to matter more and more.

In practice this change fixes an observable kbuild regression:

# [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

!NUMA:
45.291088843 seconds time elapsed ( +- 0.40% )
45.154231752 seconds time elapsed ( +- 0.36% )

+NUMA, no slow start:
46.172308123 seconds time elapsed ( +- 0.30% )
46.343168745 seconds time elapsed ( +- 0.25% )

+NUMA, 1 sec slow start:
45.224189155 seconds time elapsed ( +- 0.25% )
45.160866532 seconds time elapsed ( +- 0.17% )

and it also fixes an observable perf bench (hackbench) regression:

# perf stat --null --repeat 10 perf bench sched messaging

-NUMA:

-NUMA: 0.246225691 seconds time elapsed ( +- 1.31% )
+NUMA no slow start: 0.252620063 seconds time elapsed ( +- 1.13% )

+NUMA 1sec delay: 0.248076230 seconds time elapsed ( +- 1.35% )

The implementation is simple and straightforward, most of the patch
deals with adding the /proc/sys/kernel/sched_numa_scan_delay_ms tunable
knob.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Wrote the changelog, ran measurements, tuned the default. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 16 ++++++++++------
kernel/sysctl.c | 7 +++++++
4 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3372aac..8f65323 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2045,6 +2045,7 @@ enum sched_tunable_scaling {
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

+extern unsigned int sysctl_sched_numa_scan_delay;
extern unsigned int sysctl_sched_numa_scan_period_min;
extern unsigned int sysctl_sched_numa_scan_period_max;
extern unsigned int sysctl_sched_numa_scan_size;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7b58366..af0602f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1556,7 +1556,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
p->numa_migrate_seq = 2;
p->numa_faults = NULL;
- p->numa_scan_period = sysctl_sched_numa_scan_period_min;
+ p->numa_scan_period = sysctl_sched_numa_scan_delay;
p->numa_work.next = &p->numa_work;
#endif /* CONFIG_NUMA_BALANCING */
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da28315..8f0e6ba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -823,11 +823,12 @@ static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
}

/*
- * numa task sample period in ms: 5s
+ * Scan @scan_size MB every @scan_period after an initial @scan_delay.
*/
-unsigned int sysctl_sched_numa_scan_period_min = 100;
-unsigned int sysctl_sched_numa_scan_period_max = 100*16;
-unsigned int sysctl_sched_numa_scan_size = 256; /* MB */
+unsigned int sysctl_sched_numa_scan_delay = 1000; /* ms */
+unsigned int sysctl_sched_numa_scan_period_min = 100; /* ms */
+unsigned int sysctl_sched_numa_scan_period_max = 100*16;/* ms */
+unsigned int sysctl_sched_numa_scan_size = 256; /* MB */

/*
* Wait for the 2-sample stuff to settle before migrating again
@@ -938,10 +939,12 @@ void task_numa_work(struct callback_head *work)
if (time_before(now, migrate))
return;

- next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
+ next_scan = now + msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;

+ current->numa_scan_period += jiffies_to_msecs(2);
+
start = mm->numa_scan_offset;
pages = sysctl_sched_numa_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
@@ -998,7 +1001,8 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;

if (now - curr->node_stamp > period) {
- curr->node_stamp = now;
+ curr->node_stamp += period;
+ curr->numa_scan_period = sysctl_sched_numa_scan_period_min;

/*
* We are comparing runtime to wall clock time here, which
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a14b8a4..6d2fe5b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -353,6 +353,13 @@ static struct ctl_table kern_table[] = {
#endif /* CONFIG_SMP */
#ifdef CONFIG_NUMA_BALANCING
{
+ .procname = "sched_numa_scan_delay_ms",
+ .data = &sysctl_sched_numa_scan_delay,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_numa_scan_period_min_ms",
.data = &sysctl_sched_numa_scan_period_min,
.maxlen = sizeof(unsigned int),
--
1.7.11.7

2012-11-19 02:16:05

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 19/27] sched: Implement constant, per task Working Set Sampling (WSS) rate

From: Peter Zijlstra <[email protected]>

Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.

That method has various (obvious) disadvantages:

- it samples the working set at dissimilar rates,
giving some tasks a sampling quality advantage
over others.

- creates performance problems for tasks with very
large working sets

- over-samples processes with large address spaces but
which only very rarely execute

Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.

The per-task nature of the working set sampling functionality
in this tree allows such constant rate, per task,
execution-weight proportional sampling of the working set,
with an adaptive sampling interval/frequency that goes from
once per 100 msecs up to just once per 1.6 seconds.
The current sampling volume is 256 MB per interval.

As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 1.6
seconds of CPU time executed.

This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.

[ In AutoNUMA speak, this patch deals with the effective sampling
rate of the 'hinting page fault'. AutoNUMA's scanning is
currently rate-limited, but it is also fundamentally
single-threaded, executing in the knuma_scand kernel thread,
so the limit in AutoNUMA is global and does not scale up with
the number of CPUs, nor does it scan tasks in an execution
proportional manner.

So the idea of rate-limiting the scanning was first implemented
in the AutoNUMA tree via a global rate limit. This patch goes
beyond that by implementing an execution rate proportional
working set sampling rate that is not implemented via a single
global scanning daemon. ]

[ Dan Carpenter pointed out a possible NULL pointer dereference in the
first version of this patch. ]

Based-on-idea-by: Andrea Arcangeli <[email protected]>
Bug-Found-By: Dan Carpenter <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mm_types.h | 1 +
include/linux/sched.h | 1 +
kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++------------
kernel/sysctl.c | 7 +++++++
4 files changed, 38 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 48760e9..5995652 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -405,6 +405,7 @@ struct mm_struct {
#endif
#ifdef CONFIG_NUMA_BALANCING
unsigned long numa_next_scan;
+ unsigned long numa_scan_offset;
int numa_scan_seq;
#endif
struct uprobes_state uprobes_state;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb12cc3..3372aac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2047,6 +2047,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

extern unsigned int sysctl_sched_numa_scan_period_min;
extern unsigned int sysctl_sched_numa_scan_period_max;
+extern unsigned int sysctl_sched_numa_scan_size;
extern unsigned int sysctl_sched_numa_settle_count;

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f3aeaac..151a3cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -825,8 +825,9 @@ static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
/*
* numa task sample period in ms: 5s
*/
-unsigned int sysctl_sched_numa_scan_period_min = 5000;
-unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
+unsigned int sysctl_sched_numa_scan_period_min = 100;
+unsigned int sysctl_sched_numa_scan_period_max = 100*16;
+unsigned int sysctl_sched_numa_scan_size = 256; /* MB */

/*
* Wait for the 2-sample stuff to settle before migrating again
@@ -912,6 +913,9 @@ void task_numa_work(struct callback_head *work)
unsigned long migrate, next_scan, now = jiffies;
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
+ struct vm_area_struct *vma;
+ unsigned long offset, end;
+ long length;

WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));

@@ -938,18 +942,31 @@ void task_numa_work(struct callback_head *work)
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;

- ACCESS_ONCE(mm->numa_scan_seq)++;
- {
- struct vm_area_struct *vma;
+ offset = mm->numa_scan_offset;
+ length = sysctl_sched_numa_scan_size;
+ length <<= 20;

- down_write(&mm->mmap_sem);
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
- if (!vma_migratable(vma))
- continue;
- change_prot_numa(vma, vma->vm_start, vma->vm_end);
- }
- up_write(&mm->mmap_sem);
+ down_write(&mm->mmap_sem);
+ vma = find_vma(mm, offset);
+ if (!vma) {
+ ACCESS_ONCE(mm->numa_scan_seq)++;
+ offset = 0;
+ vma = mm->mmap;
+ }
+ for (; vma && length > 0; vma = vma->vm_next) {
+ if (!vma_migratable(vma))
+ continue;
+
+ offset = max(offset, vma->vm_start);
+ end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
+ length -= end - offset;
+
+ change_prot_numa(vma, offset, end);
+
+ offset = end;
}
+ mm->numa_scan_offset = offset;
+ up_write(&mm->mmap_sem);
}

/*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 7736b9e..a14b8a4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,6 +367,13 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "sched_numa_scan_size_mb",
+ .data = &sysctl_sched_numa_scan_size,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_numa_settle_count",
.data = &sysctl_sched_numa_settle_count,
.maxlen = sizeof(unsigned int),
--
1.7.11.7

2012-11-19 02:18:27

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 20/27] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges

From: Peter Zijlstra <[email protected]>

By accounting against the present PTEs, scanning speed reflects the
actual present (mapped) memory.

Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 37 +++++++++++++++++++++----------------
1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 151a3cd..da28315 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -914,8 +914,8 @@ void task_numa_work(struct callback_head *work)
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
struct vm_area_struct *vma;
- unsigned long offset, end;
- long length;
+ unsigned long start, end;
+ long pages;

WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));

@@ -942,30 +942,35 @@ void task_numa_work(struct callback_head *work)
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;

- offset = mm->numa_scan_offset;
- length = sysctl_sched_numa_scan_size;
- length <<= 20;
+ start = mm->numa_scan_offset;
+ pages = sysctl_sched_numa_scan_size;
+ pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+ if (!pages)
+ return;

down_write(&mm->mmap_sem);
- vma = find_vma(mm, offset);
+ vma = find_vma(mm, start);
if (!vma) {
ACCESS_ONCE(mm->numa_scan_seq)++;
- offset = 0;
+ start = 0;
vma = mm->mmap;
}
- for (; vma && length > 0; vma = vma->vm_next) {
+ for (; vma; vma = vma->vm_next) {
if (!vma_migratable(vma))
continue;

- offset = max(offset, vma->vm_start);
- end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
- length -= end - offset;
-
- change_prot_numa(vma, offset, end);
-
- offset = end;
+ do {
+ start = max(start, vma->vm_start);
+ end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
+ end = min(end, vma->vm_end);
+ pages -= change_prot_numa(vma, start, end);
+ start = end;
+ if (pages <= 0)
+ goto out;
+ } while (end != vma->vm_end);
}
- mm->numa_scan_offset = offset;
+out:
+ mm->numa_scan_offset = start;
up_write(&mm->mmap_sem);
}

--
1.7.11.7

2012-11-19 02:18:47

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 17/27] sched, numa, mm: Add the scanning page fault machinery

From: Peter Zijlstra <[email protected]>

Add the NUMA working set scanning/hinting page fault machinery,
with no policy yet.

[ The earliest versions had the mpol_misplaced() function from
Lee Schermerhorn - this was heavily modified later on. ]

Also-written-by: Lee Schermerhorn <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
[ split it out of the main policy patch - as suggested by Mel Gorman ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/init_task.h | 8 +++
include/linux/mempolicy.h | 6 +-
include/linux/mm_types.h | 4 ++
include/linux/sched.h | 41 ++++++++++++--
init/Kconfig | 73 +++++++++++++++++++-----
kernel/sched/core.c | 15 +++++
kernel/sysctl.c | 31 ++++++++++-
mm/huge_memory.c | 1 +
mm/mempolicy.c | 137 ++++++++++++++++++++++++++++++++++++++++++++++
9 files changed, 294 insertions(+), 22 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6d087c5..ed98982 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -143,6 +143,13 @@ extern struct task_group root_task_group;

#define INIT_TASK_COMM "swapper"

+#ifdef CONFIG_NUMA_BALANCING
+# define INIT_TASK_NUMA(tsk) \
+ .numa_shared = -1,
+#else
+# define INIT_TASK_NUMA(tsk)
+#endif
+
/*
* INIT_TASK is used to set up the first task table, touch at
* your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -210,6 +217,7 @@ extern struct task_group root_task_group;
INIT_TRACE_RECURSION \
INIT_TASK_RCU_PREEMPT(tsk) \
INIT_CPUSET_SEQ \
+ INIT_TASK_NUMA(tsk) \
}


diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index f329306..c511e25 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -198,6 +198,8 @@ static inline int vma_migratable(struct vm_area_struct *vma)
return 1;
}

+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+
#else

struct mempolicy {};
@@ -323,11 +325,11 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
return 0;
}

-#endif /* CONFIG_NUMA */
-
static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
unsigned long address)
{
return -1; /* no node preference */
}
+
+#endif /* CONFIG_NUMA */
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7e9f758..48760e9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -403,6 +403,10 @@ struct mm_struct {
#ifdef CONFIG_CPUMASK_OFFSTACK
struct cpumask cpumask_allocation;
#endif
+#ifdef CONFIG_NUMA_BALANCING
+ unsigned long numa_next_scan;
+ int numa_scan_seq;
+#endif
struct uprobes_state uprobes_state;
};

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a0a2808..418d405 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1501,6 +1501,18 @@ struct task_struct {
short il_next;
short pref_node_fork;
#endif
+#ifdef CONFIG_NUMA_BALANCING
+ int numa_shared;
+ int numa_max_node;
+ int numa_scan_seq;
+ int numa_migrate_seq;
+ unsigned int numa_scan_period;
+ u64 node_stamp; /* migration stamp */
+ unsigned long numa_weight;
+ unsigned long *numa_faults;
+ struct callback_head numa_work;
+#endif /* CONFIG_NUMA_BALANCING */
+
struct rcu_head rcu;

/*
@@ -1575,7 +1587,25 @@ struct task_struct {
/* Future-safe accessor for struct task_struct's cpus_allowed. */
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)

+#ifdef CONFIG_NUMA_BALANCING
+extern void task_numa_fault(int node, int cpu, int pages);
+#else
static inline void task_numa_fault(int node, int cpu, int pages) { }
+#endif /* CONFIG_NUMA_BALANCING */
+
+/*
+ * -1: non-NUMA task
+ * 0: NUMA task with a dominantly 'private' working set
+ * 1: NUMA task with a dominantly 'shared' working set
+ */
+static inline int task_numa_shared(struct task_struct *p)
+{
+#ifdef CONFIG_NUMA_BALANCING
+ return p->numa_shared;
+#else
+ return -1;
+#endif
+}

/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
@@ -2014,6 +2044,10 @@ enum sched_tunable_scaling {
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

+extern unsigned int sysctl_sched_numa_scan_period_min;
+extern unsigned int sysctl_sched_numa_scan_period_max;
+extern unsigned int sysctl_sched_numa_settle_count;
+
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
@@ -2024,18 +2058,17 @@ extern unsigned int sysctl_sched_shares_window;
int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
loff_t *ppos);
-#endif
-#ifdef CONFIG_SCHED_DEBUG
+
static inline unsigned int get_sysctl_timer_migration(void)
{
return sysctl_timer_migration;
}
-#else
+#else /* CONFIG_SCHED_DEBUG */
static inline unsigned int get_sysctl_timer_migration(void)
{
return 1;
}
-#endif
+#endif /* CONFIG_SCHED_DEBUG */
extern unsigned int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;

diff --git a/init/Kconfig b/init/Kconfig
index cf3e79c..9511f0d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -697,6 +697,65 @@ config HAVE_UNSTABLE_SCHED_CLOCK
bool

#
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config ARCH_WANTS_NUMA_VARIABLE_LOCALITY
+ bool
+
+#
+# For architectures that want to enable NUMA-affine scheduling
+# and memory placement:
+#
+config ARCH_SUPPORTS_NUMA_BALANCING
+ bool
+
+#
+# For architectures that want to reuse the PROT_NONE bits
+# to implement NUMA protection bits:
+#
+config ARCH_WANTS_NUMA_GENERIC_PGPROT
+ bool
+
+config NUMA_BALANCING
+ bool "NUMA-optimizing scheduler"
+ default n
+ depends on ARCH_SUPPORTS_NUMA_BALANCING
+ depends on !ARCH_WANTS_NUMA_VARIABLE_LOCALITY
+ depends on SMP && NUMA && MIGRATION
+ help
+ This option enables NUMA-aware, transparent, automatic
+ placement optimizations of memory, tasks and task groups.
+
+ The optimizations work by (transparently) runtime sampling the
+ workload sharing relationship between threads and processes
+ of long-run workloads, and scheduling them based on these
+ measured inter-task relationships (or the lack thereof).
+
+ ("Long-run" means several seconds of CPU runtime at least.)
+
+ Tasks that predominantly perform their own processing, without
+ interacting with other tasks much will be independently balanced
+ to a CPU and their working set memory will migrate to that CPU/node.
+
+ Tasks that share a lot of data with each other will be attempted to
+ be scheduled on as few nodes as possible, with their memory
+ following them there and being distributed between those nodes.
+
+ This optimization can improve the performance of long-run CPU-bound
+ workloads by 10% or more. The sampling and migration has a small
+ but nonzero cost, so if your NUMA workload is already perfectly
+ placed (for example by use of explicit CPU and memory bindings,
+ or because the stock scheduler does a good job already) then you
+ probably don't need this feature.
+
+ [ On non-NUMA systems this feature will not be active. You can query
+ whether your system is a NUMA system via looking at the output of
+ "numactl --hardware". ]
+
+ Say N if unsure.
+
+#
# Helper Kconfig switches to express compound feature dependencies
# and thus make the .h/.c code more readable:
#
@@ -718,20 +777,6 @@ config ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
depends on ARCH_USES_NUMA_GENERIC_PGPROT
depends on TRANSPARENT_HUGEPAGE

-#
-# For architectures that (ab)use NUMA to represent different memory regions
-# all cpu-local but of different latencies, such as SuperH.
-#
-config ARCH_WANTS_NUMA_VARIABLE_LOCALITY
- bool
-
-#
-# For architectures that want to enable the PROT_NONE driven,
-# NUMA-affine scheduler balancing logic:
-#
-config ARCH_SUPPORTS_NUMA_BALANCING
- bool
-
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dae0d2..3611f5f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1544,6 +1544,21 @@ static void __sched_fork(struct task_struct *p)
#ifdef CONFIG_PREEMPT_NOTIFIERS
INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif
+
+#ifdef CONFIG_NUMA_BALANCING
+ if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+ p->mm->numa_next_scan = jiffies;
+ p->mm->numa_scan_seq = 0;
+ }
+
+ p->numa_shared = -1;
+ p->node_stamp = 0ULL;
+ p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+ p->numa_migrate_seq = 2;
+ p->numa_faults = NULL;
+ p->numa_scan_period = sysctl_sched_numa_scan_period_min;
+ p->numa_work.next = &p->numa_work;
+#endif /* CONFIG_NUMA_BALANCING */
}

/*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b0fa5ad..7736b9e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -256,9 +256,11 @@ static int min_sched_granularity_ns = 100000; /* 100 usecs */
static int max_sched_granularity_ns = NSEC_PER_SEC; /* 1 second */
static int min_wakeup_granularity_ns; /* 0 usecs */
static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
+#ifdef CONFIG_SMP
static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
-#endif
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_SCHED_DEBUG */

#ifdef CONFIG_COMPACTION
static int min_extfrag_threshold;
@@ -301,6 +303,7 @@ static struct ctl_table kern_table[] = {
.extra1 = &min_wakeup_granularity_ns,
.extra2 = &max_wakeup_granularity_ns,
},
+#ifdef CONFIG_SMP
{
.procname = "sched_tunable_scaling",
.data = &sysctl_sched_tunable_scaling,
@@ -347,7 +350,31 @@ static struct ctl_table kern_table[] = {
.extra1 = &zero,
.extra2 = &one,
},
-#endif
+#endif /* CONFIG_SMP */
+#ifdef CONFIG_NUMA_BALANCING
+ {
+ .procname = "sched_numa_scan_period_min_ms",
+ .data = &sysctl_sched_numa_scan_period_min,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "sched_numa_scan_period_max_ms",
+ .data = &sysctl_sched_numa_scan_period_max,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "sched_numa_settle_count",
+ .data = &sysctl_sched_numa_settle_count,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_SCHED_DEBUG */
{
.procname = "sched_rt_period_us",
.data = &sysctl_sched_rt_period,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 814e3ea..92e101f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1456,6 +1456,7 @@ static void __split_huge_page_refcount(struct page *page)
page_tail->mapping = page->mapping;

page_tail->index = page->index + i;
+ page_xchg_last_cpu(page, page_last_cpu(page_tail));

BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..318043a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2175,6 +2175,143 @@ static void sp_free(struct sp_node *n)
kmem_cache_free(sn_cache, n);
}

+/*
+ * Multi-stage node selection is used in conjunction with a periodic
+ * migration fault to build a temporal task<->page relation. By
+ * using a two-stage filter we remove short/unlikely relations.
+ *
+ * Using P(p) ~ n_p / n_t as per frequentist probability, we can
+ * equate a task's usage of a particular page (n_p) per total usage
+ * of this page (n_t) (in a given time-span) to a probability.
+ *
+ * Our periodic faults will then sample this probability and getting
+ * the same result twice in a row, given these samples are fully
+ * independent, is then given by P(n)^2, provided our sample period
+ * is sufficiently short compared to the usage pattern.
+ *
+ * This quadric squishes small probabilities, making it less likely
+ * we act on an unlikely task<->page relation.
+ *
+ * Return the best node ID this page should be on, or -1 if it should
+ * stay where it is.
+ */
+static int
+numa_migration_target(struct page *page, int page_nid,
+ struct task_struct *p, int this_cpu,
+ int cpu_last_access)
+{
+ int nid_last_access;
+ int this_nid;
+
+ if (task_numa_shared(p) < 0)
+ return -1;
+
+ /*
+ * Possibly migrate towards the current node, depends on
+ * task_numa_placement() and access details.
+ */
+ nid_last_access = cpu_to_node(cpu_last_access);
+ this_nid = cpu_to_node(this_cpu);
+
+ if (nid_last_access != this_nid) {
+ /*
+ * 'Access miss': the page got last accessed from a remote node.
+ */
+ return -1;
+ }
+ /*
+ * 'Access hit': the page got last accessed from our node.
+ *
+ * Migrate the page if needed.
+ */
+
+ /* The page is already on this node: */
+ if (page_nid == this_nid)
+ return -1;
+
+ return this_nid;
+}
+
+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page - page to be checked
+ * @vma - vm area where page mapped
+ * @addr - virtual address where page mapped
+ * @multi - use multi-stage node binding
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ * -1 - not misplaced, page is in the right node
+ * node - node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+{
+ int best_nid = -1, page_nid;
+ int cpu_last_access, this_cpu;
+ struct mempolicy *pol;
+ unsigned long pgoff;
+ struct zone *zone;
+
+ BUG_ON(!vma);
+
+ this_cpu = raw_smp_processor_id();
+ page_nid = page_to_nid(page);
+
+ cpu_last_access = page_xchg_last_cpu(page, this_cpu);
+
+ pol = get_vma_policy(current, vma, addr);
+ if (!(task_numa_shared(current) >= 0))
+ goto out_keep_page;
+
+ switch (pol->mode) {
+ case MPOL_INTERLEAVE:
+ BUG_ON(addr >= vma->vm_end);
+ BUG_ON(addr < vma->vm_start);
+
+ pgoff = vma->vm_pgoff;
+ pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+ best_nid = offset_il_node(pol, vma, pgoff);
+ break;
+
+ case MPOL_PREFERRED:
+ if (pol->flags & MPOL_F_LOCAL)
+ best_nid = numa_migration_target(page, page_nid, current, this_cpu, cpu_last_access);
+ else
+ best_nid = pol->v.preferred_node;
+ break;
+
+ case MPOL_BIND:
+ /*
+ * allows binding to multiple nodes.
+ * use current page if in policy nodemask,
+ * else select nearest allowed node, if any.
+ * If no allowed nodes, use current [!misplaced].
+ */
+ if (node_isset(page_nid, pol->v.nodes))
+ goto out_keep_page;
+ (void)first_zones_zonelist(
+ node_zonelist(numa_node_id(), GFP_HIGHUSER),
+ gfp_zone(GFP_HIGHUSER),
+ &pol->v.nodes, &zone);
+ best_nid = zone->node;
+ break;
+
+ default:
+ BUG();
+ }
+
+out_keep_page:
+ mpol_cond_put(pol);
+
+ return best_nid;
+}
+
static void sp_delete(struct shared_policy *sp, struct sp_node *n)
{
pr_debug("deleting %lx-l%lx\n", n->start, n->end);
--
1.7.11.7

2012-11-19 02:19:06

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 15/27] sched, numa, mm: Add credits for NUMA placement

From: Rik van Riel <[email protected]>

The NUMA placement code has been rewritten several times, but
the basic ideas took a lot of work to develop. The people who
put in the work deserve credit for it. Thanks Andrea & Peter :)

[ The Documentation/scheduler/numa-problem.txt file should
probably be rewritten once we figure out the final details of
what the NUMA code needs to do, and why. ]

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
----
This is against tip.git numa/core
---
CREDITS | 1 +
kernel/sched/fair.c | 3 +++
mm/memory.c | 2 ++
3 files changed, 6 insertions(+)

diff --git a/CREDITS b/CREDITS
index d8fe12a..b4cdc8f 100644
--- a/CREDITS
+++ b/CREDITS
@@ -125,6 +125,7 @@ D: Author of pscan that helps to fix lp/parport bugs
D: Author of lil (Linux Interrupt Latency benchmark)
D: Fixed the shm swap deallocation at swapoff time (try_to_unuse message)
D: VM hacker
+D: NUMA task placement
D: Various other kernel hacks
S: Imola 40026
S: Italy
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 511fbb8..8af0208 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -18,6 +18,9 @@
*
* Adaptive scheduling granularity, math enhancements by Peter Zijlstra
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[email protected]>
+ *
+ * NUMA placement, statistics and algorithm by Andrea Arcangeli,
+ * CFS balancing changes by Peter Zijlstra. Copyright (C) 2012 Red Hat, Inc.
*/

#include <linux/latencytop.h>
diff --git a/mm/memory.c b/mm/memory.c
index 23ad2eb..6c2e0d4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -36,6 +36,8 @@
* ([email protected])
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ *
+ * 2012 - NUMA placement page faults (Andrea Arcangeli, Peter Zijlstra)
*/

#include <linux/kernel_stat.h>
--
1.7.11.7

2012-11-19 02:19:30

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 13/27] mm/migrate: Introduce migrate_misplaced_page()

From: Peter Zijlstra <[email protected]>

Add migrate_misplaced_page() which deals with migrating pages from
faults.

This includes adding a new MIGRATE_FAULT migration mode to
deal with the extra page reference required due to having to look up
the page.

Based-on-work-by: Lee Schermerhorn <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/migrate.h | 4 ++-
include/linux/migrate_mode.h | 3 ++
mm/migrate.c | 79 +++++++++++++++++++++++++++++++++++++++-----
3 files changed, 77 insertions(+), 9 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index afd9af1..72665c9 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -30,6 +30,7 @@ extern int migrate_vmas(struct mm_struct *mm,
extern void migrate_page_copy(struct page *newpage, struct page *page);
extern int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page);
+extern int migrate_misplaced_page(struct page *page, int node);
#else

static inline void putback_lru_pages(struct list_head *l) {}
@@ -63,10 +64,11 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
#define migrate_page NULL
#define fail_migrate_page NULL

-#endif /* CONFIG_MIGRATION */
static inline
int migrate_misplaced_page(struct page *page, int node)
{
return -EAGAIN; /* can't migrate now */
}
+
+#endif /* CONFIG_MIGRATION */
#endif /* _LINUX_MIGRATE_H */
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index ebf3d89..40b37dc 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,14 @@
* on most operations but not ->writepage as the potential stall time
* is too significant
* MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
+ * this path has an extra reference count
*/
enum migrate_mode {
MIGRATE_ASYNC,
MIGRATE_SYNC_LIGHT,
MIGRATE_SYNC,
+ MIGRATE_FAULT,
};

#endif /* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/migrate.c b/mm/migrate.c
index 4ba45f4..b89062d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -225,7 +225,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
struct buffer_head *bh = head;

/* Simple case, sync compaction */
- if (mode != MIGRATE_ASYNC) {
+ if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) {
do {
get_bh(bh);
lock_buffer(bh);
@@ -279,12 +279,22 @@ static int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page,
struct buffer_head *head, enum migrate_mode mode)
{
- int expected_count;
+ int expected_count = 0;
void **pslot;

+ if (mode == MIGRATE_FAULT) {
+ /*
+ * MIGRATE_FAULT has an extra reference on the page and
+ * otherwise acts like ASYNC, no point in delaying the
+ * fault, we'll try again next time.
+ */
+ expected_count++;
+ }
+
if (!mapping) {
/* Anonymous page without mapping */
- if (page_count(page) != 1)
+ expected_count += 1;
+ if (page_count(page) != expected_count)
return -EAGAIN;
return 0;
}
@@ -294,7 +304,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));

- expected_count = 2 + page_has_private(page);
+ expected_count += 2 + page_has_private(page);
if (page_count(page) != expected_count ||
radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
spin_unlock_irq(&mapping->tree_lock);
@@ -313,7 +323,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
* the mapping back due to an elevated page count, we would have to
* block waiting on other references to be dropped.
*/
- if (mode == MIGRATE_ASYNC && head &&
+ if ((mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) && head &&
!buffer_migrate_lock_buffers(head, mode)) {
page_unfreeze_refs(page, expected_count);
spin_unlock_irq(&mapping->tree_lock);
@@ -521,7 +531,7 @@ int buffer_migrate_page(struct address_space *mapping,
* with an IRQ-safe spinlock held. In the sync case, the buffers
* need to be locked now
*/
- if (mode != MIGRATE_ASYNC)
+ if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT)
BUG_ON(!buffer_migrate_lock_buffers(head, mode));

ClearPagePrivate(page);
@@ -687,7 +697,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
struct anon_vma *anon_vma = NULL;

if (!trylock_page(page)) {
- if (!force || mode == MIGRATE_ASYNC)
+ if (!force || mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT)
goto out;

/*
@@ -1403,4 +1413,57 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
}
return err;
}
-#endif
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+ struct address_space *mapping = page_mapping(page);
+ int page_lru = page_is_file_cache(page);
+ struct page *newpage;
+ int ret = -EAGAIN;
+ gfp_t gfp = GFP_HIGHUSER_MOVABLE;
+
+ /*
+ * Never wait for allocations just to migrate on fault, but don't dip
+ * into reserves. And, only accept pages from the specified node. No
+ * sense migrating to a different "misplaced" page!
+ */
+ if (mapping)
+ gfp = mapping_gfp_mask(mapping);
+ gfp &= ~__GFP_WAIT;
+ gfp |= __GFP_NOMEMALLOC | GFP_THISNODE;
+
+ newpage = alloc_pages_node(node, gfp, 0);
+ if (!newpage) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ if (isolate_lru_page(page)) {
+ ret = -EBUSY;
+ goto put_new;
+ }
+
+ inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+ ret = __unmap_and_move(page, newpage, 0, 0, MIGRATE_FAULT);
+ /*
+ * A page that has been migrated has all references removed and will be
+ * freed. A page that has not been migrated will have kepts its
+ * references and be restored.
+ */
+ dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+ putback_lru_page(page);
+put_new:
+ /*
+ * Move the new page to the LRU. If migration was not successful
+ * then this will free the page.
+ */
+ putback_lru_page(newpage);
+out:
+ return ret;
+}
+
+#endif /* CONFIG_NUMA */
--
1.7.11.7

2012-11-19 02:19:55

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 09/27] sched, mm, numa: Create generic NUMA fault infrastructure, with architectures overrides

This patch is based on patches written by multiple people:

Hugh Dickins <[email protected]>
Johannes Weiner <[email protected]>
Peter Zijlstra <[email protected]>

Of the "mm/mpol: Create special PROT_NONE infrastructure" patch
and its variants.

I have reworked the code so significantly that I had to
drop the acks and signoffs.

In order to facilitate a lazy -- fault driven -- migration of pages,
create a special transient PROT_NONE variant, we can then use the
'spurious' protection faults to drive our migrations from.

Pages that already had an effective PROT_NONE mapping will not
be detected to generate these 'spuriuos' faults for the simple reason
that we cannot distinguish them on their protection bits, see
pte_numa().

This isn't a problem since PROT_NONE (and possible PROT_WRITE with
dirty tracking) aren't used or are rare enough for us to not care
about their placement.

Architectures can set the CONFIG_ARCH_WANTS_NUMA_GENERIC_PGPROT Kconfig
variable, in which case they get the PROT_NONE variant. Alternatively
they can provide the basic primitives themselves:

bool pte_numa(struct vm_area_struct *vma, pte_t pte);
pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte);
bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);
pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
unsigned long change_prot_numa(struct vm_area_struct *vma, unsigned long start, unsigned long end);

[ This non-generic angle is untested though. ]

Original-Idea-by: Rik van Riel <[email protected]>
Also-From: Johannes Weiner <[email protected]>
Also-From: Hugh Dickins <[email protected]>
Also-From: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/asm-generic/pgtable.h | 55 ++++++++++++++
include/linux/huge_mm.h | 12 ++++
include/linux/mempolicy.h | 6 ++
include/linux/migrate.h | 5 ++
include/linux/mm.h | 5 ++
include/linux/sched.h | 2 +
init/Kconfig | 22 ++++++
mm/Makefile | 1 +
mm/huge_memory.c | 162 ++++++++++++++++++++++++++++++++++++++++++
mm/internal.h | 5 +-
mm/memcontrol.c | 7 +-
mm/memory.c | 85 ++++++++++++++++++++--
mm/migrate.c | 2 +-
mm/mprotect.c | 7 --
mm/numa.c | 73 +++++++++++++++++++
15 files changed, 430 insertions(+), 19 deletions(-)
create mode 100644 mm/numa.c

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 48fc1dc..d03d0a8 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -537,6 +537,61 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
}

/*
+ * Is this pte used for NUMA scanning?
+ */
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT
+extern bool pte_numa(struct vm_area_struct *vma, pte_t pte);
+#else
+# ifndef pte_numa
+static inline bool pte_numa(struct vm_area_struct *vma, pte_t pte)
+{
+ return false;
+}
+# endif
+#endif
+
+/*
+ * Turn a pte into a NUMA entry:
+ */
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT
+extern pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte);
+#else
+# ifndef pte_mknuma
+static inline pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte)
+{
+ return pte;
+}
+# endif
+#endif
+
+/*
+ * Is this pmd used for NUMA scanning?
+ */
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
+extern bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);
+#else
+# ifndef pmd_numa
+static inline bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
+{
+ return false;
+}
+# endif
+#endif
+
+/*
+ * Some architectures (such as x86) may need to preserve certain pgprot
+ * bits, without complicating generic pgprot code.
+ *
+ * Most architectures don't care:
+ */
+#ifndef pgprot_modify
+static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
+{
+ return newprot;
+}
+#endif
+
+/*
* This is a noop if Transparent Hugepage Support is not built into
* the kernel. Otherwise it is equivalent to
* pmd_none_or_trans_huge_or_clear_bad(), and shall only be called in
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b31cb7d..7f5a552 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -197,4 +197,16 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

+#ifdef CONFIG_NUMA_BALANCING_HUGEPAGE
+extern void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t orig_pmd);
+#else
+static inline void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t orig_pmd)
+{
+}
+#endif
+
#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e5ccb9d..f329306 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -324,4 +324,10 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
}

#endif /* CONFIG_NUMA */
+
+static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+ unsigned long address)
+{
+ return -1; /* no node preference */
+}
#endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ce7e667..afd9af1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -64,4 +64,9 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
#define fail_migrate_page NULL

#endif /* CONFIG_MIGRATION */
+static inline
+int migrate_misplaced_page(struct page *page, int node)
+{
+ return -EAGAIN; /* can't migrate now */
+}
#endif /* _LINUX_MIGRATE_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5fc1d46..246375c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1559,6 +1559,11 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
}
#endif

+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT
+extern unsigned long
+change_prot_numa(struct vm_area_struct *vma, unsigned long start, unsigned long end);
+#endif
+
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e1581a0..a0a2808 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1575,6 +1575,8 @@ struct task_struct {
/* Future-safe accessor for struct task_struct's cpus_allowed. */
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)

+static inline void task_numa_fault(int node, int cpu, int pages) { }
+
/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
diff --git a/init/Kconfig b/init/Kconfig
index 6fdd6e3..f36c83d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -696,6 +696,28 @@ config LOG_BUF_SHIFT
config HAVE_UNSTABLE_SCHED_CLOCK
bool

+#
+# Helper Kconfig switches to express compound feature dependencies
+# and thus make the .h/.c code more readable:
+#
+config NUMA_BALANCING_HUGEPAGE
+ bool
+ default y
+ depends on NUMA_BALANCING
+ depends on TRANSPARENT_HUGEPAGE
+
+config ARCH_USES_NUMA_GENERIC_PGPROT
+ bool
+ default y
+ depends on ARCH_WANTS_NUMA_GENERIC_PGPROT
+ depends on NUMA_BALANCING
+
+config ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
+ bool
+ default y
+ depends on ARCH_USES_NUMA_GENERIC_PGPROT
+ depends on TRANSPARENT_HUGEPAGE
+
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
diff --git a/mm/Makefile b/mm/Makefile
index 6b025f8..26f7574 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -34,6 +34,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
+obj-$(CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT) += numa.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..814e3ea 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
#include <linux/freezer.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
+#include <linux/migrate.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -725,6 +726,165 @@ out:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

+#ifdef CONFIG_NUMA_BALANCING
+/*
+ * Handle a NUMA fault: check whether we should migrate and
+ * mark it accessible again.
+ */
+void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t entry)
+{
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct mem_cgroup *memcg = NULL;
+ struct page *new_page;
+ struct page *page = NULL;
+ int last_cpu;
+ int node = -1;
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry)))
+ goto unlock;
+
+ if (unlikely(pmd_trans_splitting(entry))) {
+ spin_unlock(&mm->page_table_lock);
+ wait_split_huge_page(vma->anon_vma, pmd);
+ return;
+ }
+
+ page = pmd_page(entry);
+ if (page) {
+ int page_nid = page_to_nid(page);
+
+ VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+ last_cpu = page_last_cpu(page);
+
+ get_page(page);
+ /*
+ * Note that migrating pages shared by others is safe, since
+ * get_user_pages() or GUP fast would have to fault this page
+ * present before they could proceed, and we are holding the
+ * pagetable lock here and are mindful of pmd races below.
+ */
+ node = mpol_misplaced(page, vma, haddr);
+ if (node != -1 && node != page_nid)
+ goto migrate;
+ }
+
+fixup:
+ /* change back to regular protection */
+ entry = pmd_modify(entry, vma->vm_page_prot);
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, entry);
+
+unlock:
+ spin_unlock(&mm->page_table_lock);
+ if (page) {
+ task_numa_fault(page_to_nid(page), last_cpu, HPAGE_PMD_NR);
+ put_page(page);
+ }
+ return;
+
+migrate:
+ spin_unlock(&mm->page_table_lock);
+
+ lock_page(page);
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry))) {
+ spin_unlock(&mm->page_table_lock);
+ unlock_page(page);
+ put_page(page);
+ return;
+ }
+ spin_unlock(&mm->page_table_lock);
+
+ new_page = alloc_pages_node(node,
+ (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);
+ if (!new_page)
+ goto alloc_fail;
+
+ if (isolate_lru_page(page)) { /* Does an implicit get_page() */
+ put_page(new_page);
+ goto alloc_fail;
+ }
+
+ __set_page_locked(new_page);
+ SetPageSwapBacked(new_page);
+
+ /* anon mapping, we can simply copy page->mapping to the new page: */
+ new_page->mapping = page->mapping;
+ new_page->index = page->index;
+
+ migrate_page_copy(new_page, page);
+
+ WARN_ON(PageLRU(new_page));
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry))) {
+ spin_unlock(&mm->page_table_lock);
+
+ /* Reverse changes made by migrate_page_copy() */
+ if (TestClearPageActive(new_page))
+ SetPageActive(page);
+ if (TestClearPageUnevictable(new_page))
+ SetPageUnevictable(page);
+ mlock_migrate_page(page, new_page);
+
+ unlock_page(new_page);
+ put_page(new_page); /* Free it */
+
+ unlock_page(page);
+ putback_lru_page(page);
+ put_page(page); /* Drop the local reference */
+ return;
+ }
+ /*
+ * Traditional migration needs to prepare the memcg charge
+ * transaction early to prevent the old page from being
+ * uncharged when installing migration entries. Here we can
+ * save the potential rollback and start the charge transfer
+ * only when migration is already known to end successfully.
+ */
+ mem_cgroup_prepare_migration(page, new_page, &memcg);
+
+ entry = mk_pmd(new_page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ entry = pmd_mkhuge(entry);
+
+ page_add_new_anon_rmap(new_page, vma, haddr);
+
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, entry);
+ page_remove_rmap(page);
+ /*
+ * Finish the charge transaction under the page table lock to
+ * prevent split_huge_page() from dividing up the charge
+ * before it's fully transferred to the new page.
+ */
+ mem_cgroup_end_migration(memcg, page, new_page, true);
+ spin_unlock(&mm->page_table_lock);
+
+ task_numa_fault(node, last_cpu, HPAGE_PMD_NR);
+
+ unlock_page(new_page);
+ unlock_page(page);
+ put_page(page); /* Drop the rmap reference */
+ put_page(page); /* Drop the LRU isolation reference */
+ put_page(page); /* Drop the local reference */
+ return;
+
+alloc_fail:
+ unlock_page(page);
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry))) {
+ put_page(page);
+ page = NULL;
+ goto unlock;
+ }
+ goto fixup;
+}
+#endif
+
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma)
@@ -1363,6 +1523,8 @@ static int __split_huge_page_map(struct page *page,
BUG_ON(page_mapcount(page) != 1);
if (!pmd_young(*pmd))
entry = pte_mkold(entry);
+ if (pmd_numa(vma, *pmd))
+ entry = pte_mknuma(vma, entry);
pte = pte_offset_map(&_pmd, haddr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, haddr, pte, entry);
diff --git a/mm/internal.h b/mm/internal.h
index a4fa284..b84d571 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -212,11 +212,12 @@ static inline void mlock_migrate_page(struct page *newpage, struct page *page)
{
if (TestClearPageMlocked(page)) {
unsigned long flags;
+ int nr_pages = hpage_nr_pages(page);

local_irq_save(flags);
- __dec_zone_page_state(page, NR_MLOCK);
+ __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
SetPageMlocked(newpage);
- __inc_zone_page_state(newpage, NR_MLOCK);
+ __mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages);
local_irq_restore(flags);
}
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..011e510 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
struct mem_cgroup **memcgp)
{
struct mem_cgroup *memcg = NULL;
+ unsigned int nr_pages = 1;
struct page_cgroup *pc;
enum charge_type ctype;

*memcgp = NULL;

- VM_BUG_ON(PageTransHuge(page));
if (mem_cgroup_disabled())
return;

+ if (PageTransHuge(page))
+ nr_pages <<= compound_order(page);
+
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
@@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
* charged to the res_counter since we plan on replacing the
* old one and only one page is going to be left afterwards.
*/
- __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
+ __mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
}

/* remove redundant charge if migration failed*/
diff --git a/mm/memory.c b/mm/memory.c
index 24d3a4a..b9bb15c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
#include <linux/gfp.h>
+#include <linux/migrate.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -3437,6 +3438,69 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}

+static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep, pmd_t *pmd,
+ unsigned int flags, pte_t entry)
+{
+ struct page *page = NULL;
+ int node, page_nid = -1;
+ int last_cpu = -1;
+ spinlock_t *ptl;
+
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
+ if (unlikely(!pte_same(*ptep, entry)))
+ goto out_unlock;
+
+ page = vm_normal_page(vma, address, entry);
+ if (page) {
+ get_page(page);
+ page_nid = page_to_nid(page);
+ last_cpu = page_last_cpu(page);
+ node = mpol_misplaced(page, vma, address);
+ if (node != -1 && node != page_nid)
+ goto migrate;
+ }
+
+out_pte_upgrade_unlock:
+ flush_cache_page(vma, address, pte_pfn(entry));
+
+ ptep_modify_prot_start(mm, address, ptep);
+ entry = pte_modify(entry, vma->vm_page_prot);
+ ptep_modify_prot_commit(mm, address, ptep, entry);
+
+ /* No TLB flush needed because we upgraded the PTE */
+
+ update_mmu_cache(vma, address, ptep);
+
+out_unlock:
+ pte_unmap_unlock(ptep, ptl);
+out:
+ if (page) {
+ task_numa_fault(page_nid, last_cpu, 1);
+ put_page(page);
+ }
+
+ return 0;
+
+migrate:
+ pte_unmap_unlock(ptep, ptl);
+
+ if (!migrate_misplaced_page(page, node)) {
+ page_nid = node;
+ goto out;
+ }
+
+ ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_same(*ptep, entry)) {
+ put_page(page);
+ page = NULL;
+ goto out_unlock;
+ }
+
+ goto out_pte_upgrade_unlock;
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
@@ -3475,6 +3539,9 @@ int handle_pte_fault(struct mm_struct *mm,
pte, pmd, flags, entry);
}

+ if (pte_numa(vma, entry))
+ return do_numa_page(mm, vma, address, pte, pmd, flags, entry);
+
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (unlikely(!pte_same(*pte, entry)))
@@ -3539,13 +3606,16 @@ retry:
pmd, flags);
} else {
pmd_t orig_pmd = *pmd;
- int ret;
+ int ret = 0;

barrier();
- if (pmd_trans_huge(orig_pmd)) {
- if (flags & FAULT_FLAG_WRITE &&
- !pmd_write(orig_pmd) &&
- !pmd_trans_splitting(orig_pmd)) {
+ if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) {
+ if (pmd_numa(vma, orig_pmd)) {
+ do_huge_pmd_numa_page(mm, vma, address, pmd,
+ flags, orig_pmd);
+ }
+
+ if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
orig_pmd);
/*
@@ -3555,12 +3625,13 @@ retry:
*/
if (unlikely(ret & VM_FAULT_OOM))
goto retry;
- return ret;
}
- return 0;
+
+ return ret;
}
}

+
/*
* Use __pte_alloc instead of pte_alloc_map, because we can't
* run pte_offset_map on the pmd, if an huge pmd could
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..4ba45f4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -407,7 +407,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
*/
void migrate_page_copy(struct page *newpage, struct page *page)
{
- if (PageHuge(page))
+ if (PageHuge(page) || PageTransHuge(page))
copy_huge_page(newpage, page);
else
copy_highpage(newpage, page);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7c3628a..6ff2d5e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -28,13 +28,6 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>

-#ifndef pgprot_modify
-static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
-{
- return newprot;
-}
-#endif
-
static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
diff --git a/mm/numa.c b/mm/numa.c
new file mode 100644
index 0000000..8d18800
--- /dev/null
+++ b/mm/numa.c
@@ -0,0 +1,73 @@
+/*
+ * Generic NUMA page table entry support. This code reuses
+ * PROT_NONE: an architecture can choose to use its own
+ * implementation, by setting CONFIG_ARCH_SUPPORTS_NUMA_BALANCING
+ * and not setting CONFIG_ARCH_WANTS_NUMA_GENERIC_PGPROT.
+ */
+#include <linux/mm.h>
+
+static inline pgprot_t vma_prot_none(struct vm_area_struct *vma)
+{
+ /*
+ * obtain PROT_NONE by removing READ|WRITE|EXEC privs
+ */
+ vm_flags_t vmflags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
+
+ return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags));
+}
+
+bool pte_numa(struct vm_area_struct *vma, pte_t pte)
+{
+ /*
+ * For NUMA page faults, we use PROT_NONE ptes in VMAs with
+ * "normal" vma->vm_page_prot protections. Genuine PROT_NONE
+ * VMAs should never get here, because the fault handling code
+ * will notice that the VMA has no read or write permissions.
+ *
+ * This means we cannot get 'special' PROT_NONE faults from genuine
+ * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
+ * tracking.
+ *
+ * Neither case is really interesting for our current use though so we
+ * don't care.
+ */
+ if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
+ return false;
+
+ return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
+}
+
+pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte)
+{
+ return pte_modify(pte, vma_prot_none(vma));
+}
+
+#ifdef CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE
+bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
+{
+ /*
+ * See pte_numa() above
+ */
+ if (pmd_same(pmd, pmd_modify(pmd, vma->vm_page_prot)))
+ return false;
+
+ return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma)));
+}
+#endif
+
+/*
+ * The scheduler uses this function to mark a range of virtual
+ * memory inaccessible to user-space, for the purposes of probing
+ * the composition of the working set.
+ *
+ * The resulting page faults will be demultiplexed into:
+ *
+ * mm/memory.c::do_numa_page()
+ * mm/huge_memory.c::do_huge_pmd_numa_page()
+ *
+ * This generic version simply uses PROT_NONE.
+ */
+unsigned long change_prot_numa(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+ return change_protection(vma, start, end, vma_prot_none(vma), 0);
+}
--
1.7.11.7

2012-11-19 02:19:54

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 11/27] sched, numa, mm: Describe the NUMA scheduling problem formally

From: Peter Zijlstra <[email protected]>

This is probably a first: formal description of a complex high-level
computing problem, within the kernel source.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Mike Galbraith <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ]
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++
1 file changed, 230 insertions(+)
create mode 100644 Documentation/scheduler/numa-problem.txt

diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt
new file mode 100644
index 0000000..a5d2fee
--- /dev/null
+++ b/Documentation/scheduler/numa-problem.txt
@@ -0,0 +1,230 @@
+
+
+Effective NUMA scheduling problem statement, described formally:
+
+ * minimize interconnect traffic
+
+For each task 't_i' we have memory, this memory can be spread over multiple
+physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on
+node 'k' in [pages].
+
+If a task shares memory with another task let us denote this as:
+'s_i,k', the memory shared between tasks including 't_i' residing on node
+'k'.
+
+Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement.
+
+Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage
+frequency over those memory regions [1/s] such that the product gives an
+(average) bandwidth 'bp' and 'bs' in [pages/s].
+
+(note: multiple tasks sharing memory naturally avoid duplicat accounting
+ because each task will have its own access frequency 'fs')
+
+(pjt: I think this frequency is more numerically consistent if you explicitly
+ restrict p/s above to be the working-set. (It also makes explicit the
+ requirement for <C0,M0> to change about a change in the working set.)
+
+ Doing this does have the nice property that it lets you use your frequency
+ measurement as a weak-ordering for the benefit a task would receive when
+ we can't fit everything.
+
+ e.g. task1 has working set 10mb, f=90%
+ task2 has working set 90mb, f=10%
+
+ Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit
+ from task1 being on the right node than task2. )
+
+Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i':
+
+ C: t_i -> {c_i, n_i}
+
+This gives us the total interconnect traffic between nodes 'k' and 'l',
+'T_k,l', as:
+
+ T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l
+
+And our goal is to obtain C0 and M0 such that:
+
+ T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l
+
+(note: we could introduce 'nc(k,l)' as the cost function of accessing memory
+ on node 'l' from node 'k', this would be useful for bigger NUMA systems
+
+ pjt: I agree nice to have, but intuition suggests diminishing returns on more
+ usual systems given factors like things like Haswell's enormous 35mb l3
+ cache and QPI being able to do a direct fetch.)
+
+(note: do we need a limit on the total memory per node?)
+
+
+ * fairness
+
+For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu
+'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a
+load 'L_n':
+
+ L_n = 1/P_n * \Sum_i w_i for all c_i = n
+
+using that we can formulate a load difference between CPUs
+
+ L_n,m = | L_n - L_m |
+
+Which allows us to state the fairness goal like:
+
+ L_n,m(C0) =< L_n,m(C) for all C, n != m
+
+(pjt: It can also be usefully stated that, having converged at C0:
+
+ | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) |
+
+ Where G_n,m is the greedy partition of tasks between L_n and L_m. This is
+ the "worst" partition we should accept; but having it gives us a useful
+ bound on how much we can reasonably adjust L_n/L_m at a Pareto point to
+ favor T_n,m. )
+
+Together they give us the complete multi-objective optimization problem:
+
+ min_C,M [ L_n,m(C), T_k,l(C,M) ]
+
+
+
+Notes:
+
+ - the memory bandwidth problem is very much an inter-process problem, in
+ particular there is no such concept as a process in the above problem.
+
+ - the naive solution would completely prefer fairness over interconnect
+ traffic, the more complicated solution could pick another Pareto point using
+ an aggregate objective function such that we balance the loss of work
+ efficiency against the gain of running, we'd want to more or less suggest
+ there to be a fixed bound on the error from the Pareto line for any
+ such solution.
+
+References:
+
+ http://en.wikipedia.org/wiki/Mathematical_optimization
+ http://en.wikipedia.org/wiki/Multi-objective_optimization
+
+
+* warning, significant hand-waving ahead, improvements welcome *
+
+
+Partial solutions / approximations:
+
+ 1) have task node placement be a pure preference from the 'fairness' pov.
+
+This means we always prefer fairness over interconnect bandwidth. This reduces
+the problem to:
+
+ min_C,M [ T_k,l(C,M) ]
+
+ 2a) migrate memory towards 'n_i' (the task's node).
+
+This creates memory movement such that 'p_i,k for k != n_i' becomes 0 --
+provided 'n_i' stays stable enough and there's sufficient memory (looks like
+we might need memory limits for this).
+
+This does however not provide us with any 's_i' (shared) information. It does
+however remove 'M' since it defines memory placement in terms of task
+placement.
+
+XXX properties of this M vs a potential optimal
+
+ 2b) migrate memory towards 'n_i' using 2 samples.
+
+This separates pages into those that will migrate and those that will not due
+to the two samples not matching. We could consider the first to be of 'p_i'
+(private) and the second to be of 's_i' (shared).
+
+This interpretation can be motivated by the previously observed property that
+'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only
+'s_i' (shared). (here we loose the need for memory limits again, since it
+becomes indistinguishable from shared).
+
+XXX include the statistical babble on double sampling somewhere near
+
+This reduces the problem further; we loose 'M' as per 2a, it further reduces
+the 'T_k,l' (interconnect traffic) term to only include shared (since per the
+above all private will be local):
+
+ T_k,l = \Sum_i bs_i,l for every n_i = k, l != k
+
+[ more or less matches the state of sched/numa and describes its remaining
+ problems and assumptions. It should work well for tasks without significant
+ shared memory usage between tasks. ]
+
+Possible future directions:
+
+Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
+can evaluate it;
+
+ 3a) add per-task per node counters
+
+At fault time, count the number of pages the task faults on for each node.
+This should give an approximation of 'p_i' for the local node and 's_i,k' for
+all remote nodes.
+
+While these numbers provide pages per scan, and so have the unit [pages/s] they
+don't count repeat access and thus aren't actually representable for our
+bandwidth numberes.
+
+ 3b) additional frequency term
+
+Additionally (or instead if it turns out we don't need the raw 'p' and 's'
+numbers) we can approximate the repeat accesses by using the time since marking
+the pages as indication of the access frequency.
+
+Let 'I' be the interval of marking pages and 'e' the elapsed time since the
+last marking, then we could estimate the number of accesses 'a' as 'a = I / e'.
+If we then increment the node counters using 'a' instead of 1 we might get
+a better estimate of bandwidth terms.
+
+ 3c) additional averaging; can be applied on top of either a/b.
+
+[ Rik argues that decaying averages on 3a might be sufficient for bandwidth since
+ the decaying avg includes the old accesses and therefore has a measure of repeat
+ accesses.
+
+ Rik also argued that the sample frequency is too low to get accurate access
+ frequency measurements, I'm not entirely convinced, event at low sample
+ frequencies the avg elapsed time 'e' over multiple samples should still
+ give us a fair approximation of the avg access frequency 'a'.
+
+ So doing both b&c has a fair chance of working and allowing us to distinguish
+ between important and less important memory accesses.
+
+ Experimentation has shown no benefit from the added frequency term so far. ]
+
+This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute
+'T_k,l' Our optimization problem now reads:
+
+ min_C [ \Sum_i bs_i,l for every n_i = k, l != k ]
+
+And includes only shared terms, this makes sense since all task private memory
+will become local as per 2.
+
+This suggests that if there is significant shared memory, we should try and
+move towards it.
+
+ 4) move towards where 'most' memory is
+
+The simplest significance test is comparing the biggest shared 's_i,k' against
+the private 'p_i'. If we have more shared than private, move towards it.
+
+This effectively makes us move towards where most our memory is and forms a
+feed-back loop with 2. We migrate memory towards us and we migrate towards
+where 'most' memory is.
+
+(Note: even if there were two tasks fully trashing the same shared memory, it
+ is very rare for there to be an 50/50 split in memory, lacking a perfect
+ split, the small will move towards the larger. In case of the perfect
+ split, we'll tie-break towards the lower node number.)
+
+ 5) 'throttle' 4's node placement
+
+Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize'
+and show representative numbers, we should limit node-migration to not be
+faster than this.
+
+ n) poke holes in previous that require more stuff and describe it.
--
1.7.11.7

2012-11-19 02:20:47

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 07/27] mm: Optimize the TLB flush of sys_mprotect() and change_protection() users

Reuse the NUMA code's 'modified page protections' count that
change_protection() computes and skip the TLB flush if there's
no changes to a range that sys_mprotect() modifies.

Given that mprotect() already optimizes the same-flags case
I expected this optimization to dominantly trigger on
CONFIG_NUMA_BALANCING=y kernels - but even with that feature
disabled it triggers rather often.

There's two reasons for that:

1)

While sys_mprotect() already optimizes the same-flag case:

if (newflags == oldflags) {
*pprev = vma;
return 0;
}

and this test works in many cases, but it is too sharp in some
others, where it differentiates between protection values that the
underlying PTE format makes no distinction about, such as
PROT_EXEC == PROT_READ on x86.

2)

Even where the pte format over vma flag changes necessiates a
modification of the pagetables, there might be no pagetables
yet to modify: they might not be instantiated yet.

During a regular desktop bootup this optimization hits a couple
of hundred times. During a Java test I measured thousands of
hits.

So this optimization improves sys_mprotect() in general, not just
CONFIG_NUMA_BALANCING=y kernels.

[ We could further increase the efficiency of this optimization if
change_pte_range() and change_huge_pmd() was a bit smarter about
recognizing exact-same-value protection masks - when the hardware
can do that safely. This would probably further speed up mprotect(). ]

Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/mprotect.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1e265be..7c3628a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -153,7 +153,9 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
dirty_accountable);
} while (pgd++, addr = next, addr != end);

- flush_tlb_range(vma, start, end);
+ /* Only flush the TLB if we actually modified any entries: */
+ if (pages)
+ flush_tlb_range(vma, start, end);

return pages;
}
--
1.7.11.7

2012-11-19 02:15:36

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 05/27] x86/mm: Completely drop the TLB flush from ptep_set_access_flags()

From: Rik van Riel <[email protected]>

Intel has an architectural guarantee that the TLB entry causing
a page fault gets invalidated automatically. This means
we should be able to drop the local TLB invalidation.

Because of the way other areas of the page fault code work,
chances are good that all x86 CPUs do this. However, if
someone somewhere has an x86 CPU that does not invalidate
the TLB entry causing a page fault, this one-liner should
be easy to revert - or a CPU model specific quirk could
be added to retain this optimization on most CPUs.

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
[ Applied changelog massage and moved this last in the series,
to create bisection distance. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index be3bb46..7353de3 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -317,7 +317,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
if (changed && dirty) {
*ptep = entry;
pte_update_defer(vma->vm_mm, address, ptep);
- __flush_tlb_one(address);
}

return changed;
--
1.7.11.7

2012-11-19 02:21:23

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 04/27] mm: Only flush the TLB when clearing an accessible pte

From: Rik van Riel <[email protected]>

If ptep_clear_flush() is called to clear a page table entry that is
accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry,
there is no need to flush the TLB on remote CPUs.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/pgtable-generic.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d8397da..0c8323f 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -88,7 +88,8 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
{
pte_t pte;
pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
- flush_tlb_page(vma, address);
+ if (pte_accessible(pte))
+ flush_tlb_page(vma, address);
return pte;
}
#endif
--
1.7.11.7

2012-11-19 02:21:48

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH 02/27] x86/mm: Only do a local tlb flush in ptep_set_access_flags()

From: Rik van Riel <[email protected]>

Because we only ever upgrade a PTE when calling ptep_set_access_flags(),
it is safe to skip flushing entries on remote TLBs.

The worst that can happen is a spurious page fault on other CPUs, which
would flush that TLB entry.

Lazily letting another CPU incur a spurious page fault occasionally
is (much!) cheaper than aggressively flushing everybody else's TLB.

Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..be3bb46 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
free_page((unsigned long)pgd);
}

+/*
+ * Used to set accessed or dirty bits in the page table entries
+ * on other architectures. On x86, the accessed and dirty bits
+ * are tracked by hardware. However, do_wp_page calls this function
+ * to also make the pte writeable at the same time the dirty bit is
+ * set. In that case we do actually need to write the PTE.
+ */
int ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty)
@@ -310,7 +317,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
if (changed && dirty) {
*ptep = entry;
pte_update_defer(vma->vm_mm, address, ptep);
- flush_tlb_page(vma, address);
+ __flush_tlb_one(address);
}

return changed;
--
1.7.11.7

2012-11-19 16:29:18

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Mon, Nov 19, 2012 at 03:14:17AM +0100, Ingo Molnar wrote:
> I'm pleased to announce the latest version of the numa/core tree.
>
> Here are some quick, preliminary performance numbers on a 4-node,
> 32-way, 64 GB RAM system:
>
> CONFIG_NUMA_BALANCING=y
> -----------------------------------------------------------------------
> [ seconds ] v3.7 AutoNUMA | numa/core-v16 [ vs. v3.7]
> [ lower is better ] ----- -------- | ------------- -----------
> |
> numa01 340.3 192.3 | 139.4 +144.1%
> numa01_THREAD_ALLOC 425.1 135.1 | 121.1 +251.0%
> numa02 56.1 25.3 | 17.5 +220.5%
> |
> [ SPECjbb transactions/sec ] |
> [ higher is better ] |
> |
> SPECjbb single-1x32 524k 507k | 638k +21.7%
> -----------------------------------------------------------------------
>

I was not able to run a full sets of tests today as I was distracted so
all I have is a multi JVM comparison. I'll keep it shorter than average

3.7.0 3.7.0
rc5-stats-v4r2 rc5-schednuma-v16r1
TPut 1 101903.00 ( 0.00%) 77651.00 (-23.80%)
TPut 2 213825.00 ( 0.00%) 160285.00 (-25.04%)
TPut 3 307905.00 ( 0.00%) 237472.00 (-22.87%)
TPut 4 397046.00 ( 0.00%) 302814.00 (-23.73%)
TPut 5 477557.00 ( 0.00%) 364281.00 (-23.72%)
TPut 6 542973.00 ( 0.00%) 420810.00 (-22.50%)
TPut 7 540466.00 ( 0.00%) 448976.00 (-16.93%)
TPut 8 543226.00 ( 0.00%) 463568.00 (-14.66%)
TPut 9 513351.00 ( 0.00%) 468238.00 ( -8.79%)
TPut 10 484126.00 ( 0.00%) 457018.00 ( -5.60%)
TPut 11 467440.00 ( 0.00%) 457999.00 ( -2.02%)
TPut 12 430423.00 ( 0.00%) 447928.00 ( 4.07%)
TPut 13 445803.00 ( 0.00%) 434823.00 ( -2.46%)
TPut 14 427388.00 ( 0.00%) 430667.00 ( 0.77%)
TPut 15 437183.00 ( 0.00%) 423746.00 ( -3.07%)
TPut 16 423245.00 ( 0.00%) 416259.00 ( -1.65%)
TPut 17 417666.00 ( 0.00%) 407186.00 ( -2.51%)
TPut 18 413046.00 ( 0.00%) 398197.00 ( -3.59%)

This version of the patches manages to cripple performance entirely. I
do not have a single JVM comparison available as the machine has been in
use during the day. I accept that it is very possible that the single
JVM figures are better.

SPECJBB PEAKS
3.7.0 3.7.0
rc5-stats-v4r2 rc5-schednuma-v16r1
Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%)
Expctd Peak Bops 430423.00 ( 0.00%) 447928.00 ( 4.07%)
Actual Warehouse 8.00 ( 0.00%) 9.00 ( 12.50%)
Actual Peak Bops 543226.00 ( 0.00%) 468238.00 (-13.80%)

Peaks at a higher number of warehouses but actual peak throughput is
hurt badly.

MMTests Statistics: duration
3.7.0 3.7.0
rc5-stats-v4r2rc5-schednuma-v16r1
User 203918.23 176061.11
System 141.06 24846.23
Elapsed 4979.65 4964.45

System CPU usage.

MMTests Statistics: vmstat

No meaningful stats are available with your series.

> On my NUMA system the numa/core tree now significantly outperforms both
> the vanilla kernel and the AutoNUMA (v28) kernel, in these benchmarks.
> No NUMA balancing kernel has ever performed so well on this system.
>

Maybe on your machine and maybe on your specjbb configuration it works
well. But for my machine and for a multi JVM configuration, it hurts
quite badly.

> It is notable that workloads where 'private' processing dominates
> (numa01_THREAD_ALLOC and numa02) are now very close to bare metal
> hard binding performance.
>
> These are the main changes in this release:
>
> - There are countless performance improvements. The new shared/private
> distinction metric we introduced in v15 is now further refined and
> is used in more places within the scheduler to converge in a better
> and more directed fashion.
>

Great.

> - I restructured the whole tree to make it cleaner, to simplify its
> mm/ impact and in general to make it more mergable. It now includes
> either agreed-upon patches, or bare essentials that are needed to
> make the CONFIG_NUMA_BALANCING=y feature work. It is fully bisect
> tested - it builds and works at every point.
>

It is a misrepresentation to say that all these patches have been agreed
upon.

You are still using MIGRATE_FAULT which has not been agreed upon at
all.

While you have renamed change_prot_none to change_prot_numa, it still
effectively hard-codes PROT_NONE. Even if an architecture redefines
pte_numa to use a bit other than _PAGE_PROTNONE it'll still not work
because change_protection() will not recognise it.

I still maintain that THP native migration was introduced too early now
it's worse because you've collapsed it with another patch. The risk is
that you might be depending on THP migration to reduce overhead for the
autonumabench test cases. I've said already that I believe that the correct
thing to do here is to handle regular PMDs in batch where possible and add
THP native migration as an optimisation on top. This avoids us accidentally
depending on THP to reduce system CPU usage.

While the series may be bisectable, it still is an all-or-nothing
approach. Consider "sched, numa, mm: Add the scanning page fault machinery"
for example. Despite its name it is very much orientated around schednuma
and adds the fields schednuma requires. This is a small example. The
bigger example continues to be "sched: Add adaptive NUMA affinity support"
which is a monolithic patch that introduces .... basically everything. It
would be extremely difficult to retrofit an alternative policy on top of
this and to make an apples-to-apples comparison. My whole point about
bisection was that we would have three major points that could be bisected

1. vanilla kernel
2. one set of optimisations, basic stats
3. basic placement policy
4. more optimisations if necessary
5. complex placement policy

The complex placement policy it would either be schednuma, autonuma or some
combination and it could be evaluated in terms of a basic placement policy
-- how much better does it perform? How much system overhead does it add?
Otherwise it's too easy to fall into a trap where a complex placement policy
for the tested workloads hides all the cost of the underlying machinery
and then falls apart when tested by a larger number of users. If/when the
placement policy fails the system gets majorly bogged down and it'll not
be possible to break up the series in any meaningful way to see where
the problem was introduced. I'm running out of creative ways to repeat
myself on this.

> - The hard-coded "PROT_NONE" feature that reviewers complained about
> is now factored out and selectable on a per architecture basis.
> (the arch porting aspect of this is untested, but the basic fabric
> is there and should be pretty close to what we need.)
>
> The generic PROT_NONE based facility can be used by architectures
> to prototype this feature quickly.
>

They'll also need to alter change_protection() or reimplement it. It's
still effectively hard-coded although it's getting better in this
regard.

--
Mel Gorman
SUSE Labs

2012-11-19 19:13:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* Mel Gorman <[email protected]> wrote:

> On Mon, Nov 19, 2012 at 03:14:17AM +0100, Ingo Molnar wrote:
> > I'm pleased to announce the latest version of the numa/core tree.
> >
> > Here are some quick, preliminary performance numbers on a 4-node,
> > 32-way, 64 GB RAM system:
> >
> > CONFIG_NUMA_BALANCING=y
> > -----------------------------------------------------------------------
> > [ seconds ] v3.7 AutoNUMA | numa/core-v16 [ vs. v3.7]
> > [ lower is better ] ----- -------- | ------------- -----------
> > |
> > numa01 340.3 192.3 | 139.4 +144.1%
> > numa01_THREAD_ALLOC 425.1 135.1 | 121.1 +251.0%
> > numa02 56.1 25.3 | 17.5 +220.5%
> > |
> > [ SPECjbb transactions/sec ] |
> > [ higher is better ] |
> > |
> > SPECjbb single-1x32 524k 507k | 638k +21.7%
> > -----------------------------------------------------------------------
> >
>
> I was not able to run a full sets of tests today as I was
> distracted so all I have is a multi JVM comparison. I'll keep
> it shorter than average
>
> 3.7.0 3.7.0
> rc5-stats-v4r2 rc5-schednuma-v16r1

Thanks for the testing - I'll wait for your full results to see
whether the other regressions you reported before are
fixed/improved.

Exactly what tree/commit does "rc5-schednuma-v16r1" mean?

I am starting to have doubts about your testing methods. There
does seem to be some big disconnect between your testing and
mine - and I think we should clear that up by specifying exactly
*what* you have tested. Did you rebase my tree in any fashion?

You can find what I tested in tip:master and I'd encourage you
to test that too.

Other people within Red Hat have tested these same workloads as
well, on similarly sided (and even larger) systems as yours, and
they have already reported to me (much) improved numbers,
including improvements in the multi-JVM SPECjbb load that you
are concentrating on ...

> > - I restructured the whole tree to make it cleaner, to
> > simplify its mm/ impact and in general to make it more
> > mergable. It now includes either agreed-upon patches, or
> > bare essentials that are needed to make the
> > CONFIG_NUMA_BALANCING=y feature work. It is fully bisect
> > tested - it builds and works at every point.
>
> It is a misrepresentation to say that all these patches have
> been agreed upon.

That has not been claimed, please read the sentence above.

> You are still using MIGRATE_FAULT which has not been agreed
> upon at all.

See my followup patch yesterday.

> While you have renamed change_prot_none to change_prot_numa,
> it still effectively hard-codes PROT_NONE. Even if an
> architecture redefines pte_numa to use a bit other than
> _PAGE_PROTNONE it'll still not work because
> change_protection() will not recognise it.

This is not how it works.

The new generic PROT_NONE scheme is that an architecture that
wants to reuse the generic PROT_NONE code can define:

select ARCH_SUPPORTS_NUMA_BALANCING

and can set:

select ARCH_WANTS_NUMA_GENERIC_PGPROT

and it will get the generic code very easily. This is what x86
uses now. No architecture changes needed beyond these two lines
of Kconfig enablement.

If an architecture wants to provide its own open-coded, optimizd
variant, it can do so by not defining
ARCH_SUPPORTS_NUMA_BALANCING, and by offering the following
methods:

bool pte_numa(struct vm_area_struct *vma, pte_t pte);
pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte);
bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);
pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
unsigned long change_prot_numa(struct vm_area_struct *vma, unsigned long start, unsigned long

I have not tested the arch-defined portion but barring trivial
problems it should work. We can extend the list of methods if
you think more is needed, and we can offer library functions for
architectures that want to share some but not all generic code -
I'm personally happy with x86 using change_protection().

I think this got bikeshed painted enough already.

> I still maintain that THP native migration was introduced too
> early now it's worse because you've collapsed it with another
> patch. The risk is that you might be depending on THP
> migration to reduce overhead for the autonumabench test cases.
> I've said already that I believe that the correct thing to do
> here is to handle regular PMDs in batch where possible and add
> THP native migration as an optimisation on top. This avoids us
> accidentally depending on THP to reduce system CPU usage.

Have you disabled THP in your testing of numa/core???

I think we need to stop the discussion now and clear up exactly
*what* you have tested. Commit ID and an exact description of
testing methodology please ...

Thanks,

Ingo

2012-11-19 20:07:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* Mel Gorman <[email protected]> wrote:

> > [ SPECjbb transactions/sec ] |
> > [ higher is better ] |
> > |
> > SPECjbb single-1x32 524k 507k | 638k +21.7%
> > -----------------------------------------------------------------------
> >
>
> I was not able to run a full sets of tests today as I was
> distracted so all I have is a multi JVM comparison. I'll keep
> it shorter than average
>
> 3.7.0 3.7.0
> rc5-stats-v4r2 rc5-schednuma-v16r1
> TPut 1 101903.00 ( 0.00%) 77651.00 (-23.80%)
> TPut 2 213825.00 ( 0.00%) 160285.00 (-25.04%)
> TPut 3 307905.00 ( 0.00%) 237472.00 (-22.87%)
> TPut 4 397046.00 ( 0.00%) 302814.00 (-23.73%)
> TPut 5 477557.00 ( 0.00%) 364281.00 (-23.72%)
> TPut 6 542973.00 ( 0.00%) 420810.00 (-22.50%)
> TPut 7 540466.00 ( 0.00%) 448976.00 (-16.93%)
> TPut 8 543226.00 ( 0.00%) 463568.00 (-14.66%)
> TPut 9 513351.00 ( 0.00%) 468238.00 ( -8.79%)
> TPut 10 484126.00 ( 0.00%) 457018.00 ( -5.60%)

These figures are IMO way too low for a 64-way system. I have a
32-way system with midrange server CPUs and get 650k+/sec
easily.

Have you tried to analyze the root cause, what does 'perf top'
show during the run and how much idle time is there?

Trying to reproduce your findings I have done 4x JVM tests
myself, using 4x 8-warehouse setups, with a sizing of -Xms8192m
-Xmx8192m -Xss256k, and here are the results:

v3.7 v3.7
SPECjbb single-1x32 524k 638k +21.7%
SPECjbb multi-4x8 633k 655k +3.4%

So while here we are only marginally better than the
single-instance numbers (I will try to improve that in numa/core
v17), they are still better than mainline - and they are
definitely not slower as your numbers suggest ...

So we need to go back to the basics to figure this out: please
outline exactly which commit ID of the numa/core tree you have
booted. Also, how does 'perf top' look like on your box?

Thanks,

Ingo

2012-11-19 21:18:12

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Mon, Nov 19, 2012 at 08:13:39PM +0100, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
> > On Mon, Nov 19, 2012 at 03:14:17AM +0100, Ingo Molnar wrote:
> > > I'm pleased to announce the latest version of the numa/core tree.
> > >
> > > Here are some quick, preliminary performance numbers on a 4-node,
> > > 32-way, 64 GB RAM system:
> > >
> > > CONFIG_NUMA_BALANCING=y
> > > -----------------------------------------------------------------------
> > > [ seconds ] v3.7 AutoNUMA | numa/core-v16 [ vs. v3.7]
> > > [ lower is better ] ----- -------- | ------------- -----------
> > > |
> > > numa01 340.3 192.3 | 139.4 +144.1%
> > > numa01_THREAD_ALLOC 425.1 135.1 | 121.1 +251.0%
> > > numa02 56.1 25.3 | 17.5 +220.5%
> > > |
> > > [ SPECjbb transactions/sec ] |
> > > [ higher is better ] |
> > > |
> > > SPECjbb single-1x32 524k 507k | 638k +21.7%
> > > -----------------------------------------------------------------------
> > >
> >
> > I was not able to run a full sets of tests today as I was
> > distracted so all I have is a multi JVM comparison. I'll keep
> > it shorter than average
> >
> > 3.7.0 3.7.0
> > rc5-stats-v4r2 rc5-schednuma-v16r1
>
> Thanks for the testing - I'll wait for your full results to see
> whether the other regressions you reported before are
> fixed/improved.
>

Ok.

In response to one of your later questions, I found that I had in fact
disabled THP without properly reporting it. I had configured MMTests to
run specjbb with "base" pages which is the default taken from the specjvm
configuration file. I had noted in different reports that no THP was used
and that THP was not a factor for those tests but had not properly checked
why that was.

However, this also means that *all* the tests disabled THP. Vanilla kernel,
balancenuma and autonuma all ran with THP disabled. The comparisons are
still valid but failing to report the lack of THP was a major mistake.

> Exactly what tree/commit does "rc5-schednuma-v16r1" mean?
>

Kernel 3.7-rc5
tip/sched/core from when it was last pulled
your patches as posted in this thread

JVM is Oracle JVM
java version "1.7.0_07"
Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)

I had been thinking that differences in the JVMs might be a factor but it
almost certainly because I had configured mmtests to disable THP for the
specjbb test. Again, I'm sorry that this was not properly identified and
reported by me but will reiterate that all tests were run identically.

> I am starting to have doubts about your testing methods. There
> does seem to be some big disconnect between your testing and
> mine - and I think we should clear that up by specifying exactly
> *what* you have tested.

And the difference was in test methods - specifically I disabled THP.

FWIW, what I'm testing is implemented via mmtests 0.07 which I released a
few days ago. I use the same configuration file each time and the kernel
.config only differs in the name of the numa balancing option it needs
to enable. The THP disabling was a screw-up but it was the same screw up
every time.

> Did you rebase my tree in any fashion?
>

No.

However, what I will be running next will be rebased on top of rc6. I am
not expecting that to make any difference.

> You can find what I tested in tip:master and I'd encourage you
> to test that too.
>

I will.

> Other people within Red Hat have tested these same workloads as
> well, on similarly sided (and even larger) systems as yours, and
> they have already reported to me (much) improved numbers,
> including improvements in the multi-JVM SPECjbb load that you
> are concentrating on ...
>

It almost certainly came down to THP and not the JVM. Can you or one
other other testers try with THP disabled and see what they find?

> > > - I restructured the whole tree to make it cleaner, to
> > > simplify its mm/ impact and in general to make it more
> > > mergable. It now includes either agreed-upon patches, or
> > > bare essentials that are needed to make the
> > > CONFIG_NUMA_BALANCING=y feature work. It is fully bisect
> > > tested - it builds and works at every point.
> >
> > It is a misrepresentation to say that all these patches have
> > been agreed upon.
>
> That has not been claimed, please read the sentence above.
>

Ok yes, it is one or the other.

> > You are still using MIGRATE_FAULT which has not been agreed
> > upon at all.
>
> See my followup patch yesterday.
>

I missed the follow-up patch. I was looking at the series as-posted.

> > While you have renamed change_prot_none to change_prot_numa,
> > it still effectively hard-codes PROT_NONE. Even if an
> > architecture redefines pte_numa to use a bit other than
> > _PAGE_PROTNONE it'll still not work because
> > change_protection() will not recognise it.
>
> This is not how it works.
>
> The new generic PROT_NONE scheme is that an architecture that
> wants to reuse the generic PROT_NONE code can define:
>
> select ARCH_SUPPORTS_NUMA_BALANCING
>
> and can set:
>
> select ARCH_WANTS_NUMA_GENERIC_PGPROT
>
> and it will get the generic code very easily. This is what x86
> uses now. No architecture changes needed beyond these two lines
> of Kconfig enablement.
>
> If an architecture wants to provide its own open-coded, optimizd
> variant, it can do so by not defining
> ARCH_SUPPORTS_NUMA_BALANCING, and by offering the following
> methods:
>
> bool pte_numa(struct vm_area_struct *vma, pte_t pte);
> pte_t pte_mknuma(struct vm_area_struct *vma, pte_t pte);
> bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);
> pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
> unsigned long change_prot_numa(struct vm_area_struct *vma, unsigned long start, unsigned long
>

Ah ok. The arch would indeed need change_prot_numa().

> I have not tested the arch-defined portion but barring trivial
> problems it should work. We can extend the list of methods if
> you think more is needed, and we can offer library functions for
> architectures that want to share some but not all generic code -
> I'm personally happy with x86 using change_protection().
>

FWIW, I don't think the list needs to be extended. I'm still disappointed
that the pte_numa and friend implementations are relatively heavy and
require a function call to vm_get_page_prot(). I'm also disappointed
that regularly PMDs cannot be easily handled in batch without further
modification.

On the batch PMD thing, I'm in the process of converting to using
change_protection. Preserving all the logic from the old change_prot_numa()
is a bit of a mess right now but so far I've found that at least two major
factors are;

1. If page_mapcount > 1 pages are marked pte_numa then the overhead
increases a lot. This is partially due to the placement policy being
really dumb and unable to handle shared pages.

2. With a partial series (I don't have test results with a full series
yet) I find that if I do not set pmd_numa and batch handle the faults
that it takes 5 times longer to complete the numa01 test. This is even
slower than the vanilla kernel and is probably fault overhead.

In both cases the overhead could be overcome by having a smart placement
policy and then scanning backoff logic to reduce the number of faults
incurred. If for some reason the scanning rate cannot be reduced because
the placement policy bounces pages around nodes en the CPU usage will go
through the roof setting all the PTEs and handling the faults.

> I think this got bikeshed painted enough already.
>
> > I still maintain that THP native migration was introduced too
> > early now it's worse because you've collapsed it with another
> > patch. The risk is that you might be depending on THP
> > migration to reduce overhead for the autonumabench test cases.
> > I've said already that I believe that the correct thing to do
> > here is to handle regular PMDs in batch where possible and add
> > THP native migration as an optimisation on top. This avoids us
> > accidentally depending on THP to reduce system CPU usage.
>
> Have you disabled THP in your testing of numa/core???
>

As it turned out, yes, I did disable THP. The vmstats I posted in other
reports all showed that no THPs were used because of this.

If your internal testers run with THP disabled, it would be interesting
to know if they see similar regressions. If they do, it implies that the
performance of schednuma depends on THP. If they do not, something else
is also making a big difference.

The temptation is to just ignore the THP problem but if workloads cannot
use THP or the system gets into a state where it cannot allocate THP
then performance will be badly hurt. It'll be hard to spot that the lack
of THP is what is causing the regression.

> I think we need to stop the discussion now and clear up exactly
> *what* you have tested. Commit ID and an exact description of
> testing methodology please ...
>

Kernel and JVM used have already been mentioned. Testing methodology is
as implemented in mmtests 0.07. The configuration file for multi JVMs is
configs/config-global-dhp__jvm-specjbb and sets SPECJBB_PAGESIZES="base"
which is what disabled THP.

--
Mel Gorman
SUSE Labs

2012-11-19 21:37:16

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Mon, Nov 19, 2012 at 09:07:07PM +0100, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
> > > [ SPECjbb transactions/sec ] |
> > > [ higher is better ] |
> > > |
> > > SPECjbb single-1x32 524k 507k | 638k +21.7%
> > > -----------------------------------------------------------------------
> > >
> >
> > I was not able to run a full sets of tests today as I was
> > distracted so all I have is a multi JVM comparison. I'll keep
> > it shorter than average
> >
> > 3.7.0 3.7.0
> > rc5-stats-v4r2 rc5-schednuma-v16r1
> > TPut 1 101903.00 ( 0.00%) 77651.00 (-23.80%)
> > TPut 2 213825.00 ( 0.00%) 160285.00 (-25.04%)
> > TPut 3 307905.00 ( 0.00%) 237472.00 (-22.87%)
> > TPut 4 397046.00 ( 0.00%) 302814.00 (-23.73%)
> > TPut 5 477557.00 ( 0.00%) 364281.00 (-23.72%)
> > TPut 6 542973.00 ( 0.00%) 420810.00 (-22.50%)
> > TPut 7 540466.00 ( 0.00%) 448976.00 (-16.93%)
> > TPut 8 543226.00 ( 0.00%) 463568.00 (-14.66%)
> > TPut 9 513351.00 ( 0.00%) 468238.00 ( -8.79%)
> > TPut 10 484126.00 ( 0.00%) 457018.00 ( -5.60%)
>
> These figures are IMO way too low for a 64-way system. I have a
> 32-way system with midrange server CPUs and get 650k+/sec
> easily.
>

48-way as I said here https://lkml.org/lkml/2012/11/3/109. If I said
64-way somewhere else, it was a mistake. The lack of THP would account
for some of the difference. As I was looking for potential
locking-related issues, I also had CONFIG_DEBUG_VMA nd
CONFIG_DEBUG_MUTEXES set which would account for more overhead. Any
options set are set for all tests that make up a group.

> Have you tried to analyze the root cause, what does 'perf top'
> show during the run and how much idle time is there?
>

No, I haven't and the machine is currently occupied. However, a second
profile run was run as part of the test above. The figures I reported are
based on a run without profiling. With profiling, oprofile reported

Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 6000
samples % image name app name symbol name
176552 42.9662 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 intel_idle
22790 5.5462 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 find_busiest_group
10533 2.5633 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 update_blocked_averages
10489 2.5526 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 rb_get_reader_page
9514 2.3154 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 native_write_msr_safe
8511 2.0713 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 ring_buffer_consume
7406 1.8023 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 idle_cpu
6549 1.5938 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 update_cfs_rq_blocked_load
6482 1.5775 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 rebalance_domains
5212 1.2684 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 run_rebalance_domains
5037 1.2258 perl perl /usr/bin/perl
4167 1.0141 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 page_fault
3885 0.9455 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 cpumask_next_and
3704 0.9014 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 find_next_bit
3498 0.8513 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 getnstimeofday
3345 0.8140 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 __update_cpu_load
3175 0.7727 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 load_balance
3018 0.7345 vmlinux-3.7.0-rc5-schednuma-v16r1 vmlinux-3.7.0-rc5-schednuma-v16r1 menu_select

> Trying to reproduce your findings I have done 4x JVM tests
> myself, using 4x 8-warehouse setups, with a sizing of -Xms8192m
> -Xmx8192m -Xss256k, and here are the results:
>
> v3.7 v3.7
> SPECjbb single-1x32 524k 638k +21.7%
> SPECjbb multi-4x8 633k 655k +3.4%
>

I'll re-run with THP enabled the next time and see what I find.

> So while here we are only marginally better than the
> single-instance numbers (I will try to improve that in numa/core
> v17), they are still better than mainline - and they are
> definitely not slower as your numbers suggest ...
>
> So we need to go back to the basics to figure this out: please
> outline exactly which commit ID of the numa/core tree you have
> booted. Also, how does 'perf top' look like on your box?
>

I'll find out what perf top looks like ASAP.

--
Mel Gorman
SUSE Labs

2012-11-19 22:36:13

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* Mel Gorman <[email protected]> wrote:

> Ok.
>
> In response to one of your later questions, I found that I had
> in fact disabled THP without properly reporting it. [...]

Hugepages is a must for most forms of NUMA/HPC. This alone
questions the relevance of most of your prior numa/core testing
results. I now have to strongly dispute your other conclusions
as well.

Just a look at 'perf top' output should have told you the story.

Yet time and time again you readily reported bad 'schednuma'
results for a slow 4K memory model that neither we nor other
NUMA testers I talked to actually used, without stopping to look
why that was so...

[ I suspect that if such terabytes-of-data workloads are forced
through such a slow 4K pages model then there's a bug or
mis-tuning in our code that explains the level of additional
slowdown you saw - we'll fix that.

But you should know that behavior under the slow 4K model
tells very little about the true scheduling and placement
quality of the patches... ]

Please report proper THP-enabled numbers before continuing.

Thanks,

Ingo

2012-11-19 23:00:41

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Mon, Nov 19, 2012 at 11:36:04PM +0100, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
> > Ok.
> >
> > In response to one of your later questions, I found that I had
> > in fact disabled THP without properly reporting it. [...]
>
> Hugepages is a must for most forms of NUMA/HPC.

Requiring huge pages to avoid a regression is a mistake.

> This alone
> questions the relevance of most of your prior numa/core testing
> results. I now have to strongly dispute your other conclusions
> as well.
>

I'll freely admit that disabling THP for specjbb was a mistake and I should
have caught why at the start. However, the autonumabench figures reported for
the last release had THP enabled as had the kernel build benchmark figures.

> Just a look at 'perf top' output should have told you the story.
>

I knew THP were not in use and said so in earlier reports. Take this for
example -- https://lkml.org/lkml/2012/11/16/207 . For specjbb, note that
the THP fault alloc figures are close to 0 and due to that I said "THP is
not really a factor for this workload". What I failed to do was identify
why THP was not in use.

> Yet time and time again you readily reported bad 'schednuma'
> results for a slow 4K memory model that neither we nor other
> NUMA testers I talked to actually used, without stopping to look
> why that was so...
>

Again, I apologise for the THP mistake. The fact remains that the other
implementations did not suffer a performance slowdown due to the same
mistake.

> [ I suspect that if such terabytes-of-data workloads are forced
> through such a slow 4K pages model then there's a bug or
> mis-tuning in our code that explains the level of additional
> slowdown you saw - we'll fix that.
>
> But you should know that behavior under the slow 4K model
> tells very little about the true scheduling and placement
> quality of the patches... ]
>
> Please report proper THP-enabled numbers before continuing.
>

Will do. Are THP-disabled benchmark results to be ignored?

--
Mel Gorman
SUSE Labs

2012-11-20 00:41:52

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On 11/19/2012 06:00 PM, Mel Gorman wrote:
> On Mon, Nov 19, 2012 at 11:36:04PM +0100, Ingo Molnar wrote:
>>
>> * Mel Gorman <[email protected]> wrote:
>>
>>> Ok.
>>>
>>> In response to one of your later questions, I found that I had
>>> in fact disabled THP without properly reporting it. [...]
>>
>> Hugepages is a must for most forms of NUMA/HPC.
>
> Requiring huge pages to avoid a regression is a mistake.

Not all architectures support THP. Not all workloads will end up
using THP effectively.

Mel, would you have numa/core profiles from !THP runs, so we can
find out the cause of the regression?

--
All rights reversed

2012-11-20 00:50:07

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Mon, 19 Nov 2012, Mel Gorman wrote:

> I was not able to run a full sets of tests today as I was distracted so
> all I have is a multi JVM comparison. I'll keep it shorter than average
>
> 3.7.0 3.7.0
> rc5-stats-v4r2 rc5-schednuma-v16r1
> TPut 1 101903.00 ( 0.00%) 77651.00 (-23.80%)
> TPut 2 213825.00 ( 0.00%) 160285.00 (-25.04%)
> TPut 3 307905.00 ( 0.00%) 237472.00 (-22.87%)
> TPut 4 397046.00 ( 0.00%) 302814.00 (-23.73%)
> TPut 5 477557.00 ( 0.00%) 364281.00 (-23.72%)
> TPut 6 542973.00 ( 0.00%) 420810.00 (-22.50%)
> TPut 7 540466.00 ( 0.00%) 448976.00 (-16.93%)
> TPut 8 543226.00 ( 0.00%) 463568.00 (-14.66%)
> TPut 9 513351.00 ( 0.00%) 468238.00 ( -8.79%)
> TPut 10 484126.00 ( 0.00%) 457018.00 ( -5.60%)
> TPut 11 467440.00 ( 0.00%) 457999.00 ( -2.02%)
> TPut 12 430423.00 ( 0.00%) 447928.00 ( 4.07%)
> TPut 13 445803.00 ( 0.00%) 434823.00 ( -2.46%)
> TPut 14 427388.00 ( 0.00%) 430667.00 ( 0.77%)
> TPut 15 437183.00 ( 0.00%) 423746.00 ( -3.07%)
> TPut 16 423245.00 ( 0.00%) 416259.00 ( -1.65%)
> TPut 17 417666.00 ( 0.00%) 407186.00 ( -2.51%)
> TPut 18 413046.00 ( 0.00%) 398197.00 ( -3.59%)
>
> This version of the patches manages to cripple performance entirely. I
> do not have a single JVM comparison available as the machine has been in
> use during the day. I accept that it is very possible that the single
> JVM figures are better.
>

I confirm that SPECjbb2005 1.07 -Xmx4g regresses in terms of throughput on
my 16-way, 4 node system with 32GB of memory using 16 warehouses and 240
measurement seconds. I averaged the throughput for five runs on each
kernel.

Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
Java HotSpot(TM) 64-Bit Server VM (build 10.0-b22, mixed mode)

Both kernels have
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y

numa/core at 01aa90068b12 ("sched: Use the best-buddy 'ideal cpu' in
balancing decisions") with

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANTS_NUMA_GENERIC_PGPROT=y
CONFIG_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_HUGEPAGE=y
CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT=y
CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE=y

had a throughput of 128315.19 SPECjbb2005 bops.

numa/core at ec05a2311c35 ("Merge branch 'sched/urgent' into sched/core")
had an average throughput of 136918.34 SPECjbb2005 bops, which is a 6.3%
regression.

2012-11-20 01:02:36

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Mon, Nov 19, 2012 at 12:36 PM, Ingo Molnar <[email protected]> wrote:
>
> Hugepages is a must for most forms of NUMA/HPC. This alone
> questions the relevance of most of your prior numa/core testing
> results. I now have to strongly dispute your other conclusions
> as well.

Ingo, stop doing this kind of crap.

Let's make it clear: if the NUMA patches continue to regress
performance for reasonable loads (and that very much includes "no
THP") then they won't be merged.

You seem to be in total denial. Every time Mel sends out results that
show that your patches MAKE PERFORMANCE WORSE you blame Mel, or blame
the load, and never seem to admit that performance got worse.

Stop it. That kind of "head-in-the-sand" behavior is not conducive to
good code, and I have absolutely *zero* interest in merging a branch
that has been tested with only your load on only your machine, and
performs better on that *one* load, and then regresses on other loads.

Seriously. If you can't make the non-THP case go faster, don't even
bother sending out the patches.

Similarly, if you cannot take performance results from others, don't
even bother sending out the patches. If all you care about is your own
special case, then keep the patches on your own machine, and stop
bothering others with your patches.

So stop ignoring the feedback, and stop shooting the messenger. Look
at the numbers, and admit that there is something that needs to be
fixed.

Linus

2012-11-20 01:05:16

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Mon, 19 Nov 2012, David Rientjes wrote:

> I confirm that SPECjbb2005 1.07 -Xmx4g regresses in terms of throughput on
> my 16-way, 4 node system with 32GB of memory using 16 warehouses and 240
> measurement seconds. I averaged the throughput for five runs on each
> kernel.
>
> Java(TM) SE Runtime Environment (build 1.6.0_06-b02)
> Java HotSpot(TM) 64-Bit Server VM (build 10.0-b22, mixed mode)
>
> Both kernels have
> CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
> CONFIG_TRANSPARENT_HUGEPAGE=y
> CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
>
> numa/core at 01aa90068b12 ("sched: Use the best-buddy 'ideal cpu' in
> balancing decisions") with
>
> CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
> CONFIG_ARCH_WANTS_NUMA_GENERIC_PGPROT=y
> CONFIG_NUMA_BALANCING=y
> CONFIG_NUMA_BALANCING_HUGEPAGE=y
> CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT=y
> CONFIG_ARCH_USES_NUMA_GENERIC_PGPROT_HUGEPAGE=y
>
> had a throughput of 128315.19 SPECjbb2005 bops.
>
> numa/core at ec05a2311c35 ("Merge branch 'sched/urgent' into sched/core")
> had an average throughput of 136918.34 SPECjbb2005 bops, which is a 6.3%
> regression.
>

perftop during the run on numa/core at 01aa90068b12 ("sched: Use the
best-buddy 'ideal cpu' in balancing decisions"):

15.99% [kernel] [k] page_fault
4.05% [kernel] [k] getnstimeofday
3.96% [kernel] [k] _raw_spin_lock
3.20% [kernel] [k] rcu_check_callbacks
2.93% [kernel] [k] generic_smp_call_function_interrupt
2.90% [kernel] [k] __do_page_fault
2.82% [kernel] [k] ktime_get
2.62% [kernel] [k] read_tsc
2.41% [kernel] [k] handle_mm_fault
2.01% [kernel] [k] flush_tlb_func
1.99% [kernel] [k] retint_swapgs
1.83% [kernel] [k] emulate_vsyscall
1.71% [kernel] [k] handle_pte_fault
1.63% [kernel] [k] task_tick_fair
1.57% [kernel] [k] clear_page_c
1.55% [kernel] [k] down_read_trylock
1.54% [kernel] [k] copy_user_generic_string
1.48% [kernel] [k] ktime_get_update_offsets
1.37% [kernel] [k] find_vma
1.23% [kernel] [k] mpol_misplaced
1.14% [kernel] [k] task_numa_fault
1.10% [kernel] [k] run_timer_softirq
1.06% [kernel] [k] up_read
0.87% [kernel] [k] __bad_area_nosemaphore
0.82% [kernel] [k] write_ok_or_segv
0.77% [kernel] [k] update_cfs_shares
0.76% [kernel] [k] update_curr
0.75% [kernel] [k] error_sti
0.75% [kernel] [k] get_vma_policy
0.73% [kernel] [k] smp_call_function_many
0.66% [kernel] [k] do_wp_page
0.60% [kernel] [k] error_entry
0.60% [kernel] [k] call_function_interrupt
0.59% [kernel] [k] error_exit
0.58% [kernel] [k] _raw_spin_lock_irqsave
0.58% [kernel] [k] tick_sched_timer
0.57% [kernel] [k] __do_softirq
0.57% [kernel] [k] mem_cgroup_count_vm_event
0.56% [kernel] [k] account_user_time
0.56% [kernel] [k] spurious_fault
0.54% [kernel] [k] acct_update_integrals
0.54% [kernel] [k] bad_area_nosemaphore

[ Both kernels for this test were booted with cgroup_disable=memory on
the command line, why mem_cgroup_count_vm_event shows up at all here is
strange... ]

2012-11-20 06:00:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* David Rientjes <[email protected]> wrote:

> > numa/core at ec05a2311c35 ("Merge branch 'sched/urgent' into
> > sched/core") had an average throughput of 136918.34
> > SPECjbb2005 bops, which is a 6.3% regression.
>
> perftop during the run on numa/core at 01aa90068b12 ("sched:
> Use the best-buddy 'ideal cpu' in balancing decisions"):
>
> 15.99% [kernel] [k] page_fault
> 4.05% [kernel] [k] getnstimeofday
> 3.96% [kernel] [k] _raw_spin_lock
> 3.20% [kernel] [k] rcu_check_callbacks
> 2.93% [kernel] [k] generic_smp_call_function_interrupt
> 2.90% [kernel] [k] __do_page_fault
> 2.82% [kernel] [k] ktime_get

Thanks for testing, that's very interesting - could you tell me
more about exactly what kind of hardware this is? I'll try to
find a similar system and reproduce the performance regression.

(A wild guess would be an older 4x Opteron system, 83xx series
or so?)

Also, the profile looks weird to me. Here is how perf top looks
like on my system during a similarly configured, "healthy"
SPECjbb run:

91.29% perf-6687.map [.] 0x00007fffed1e8f21
4.81% libjvm.so [.] 0x00000000007004a0
0.93% [vdso] [.] 0x00007ffff7ffe60c
0.72% [kernel] [k] do_raw_spin_lock
0.36% [kernel] [k] generic_smp_call_function_interrupt
0.10% [kernel] [k] format_decode
0.07% [kernel] [k] rcu_check_callbacks
0.07% [kernel] [k] apic_timer_interrupt
0.07% [kernel] [k] call_function_interrupt
0.06% libc-2.15.so [.] __strcmp_sse42
0.06% [kernel] [k] irqtime_account_irq
0.06% perf [.] 0x000000000004bb7c
0.05% [kernel] [k] x86_pmu_disable_all
0.04% libc-2.15.so [.] __memcpy_ssse3
0.04% [kernel] [k] ktime_get
0.04% [kernel] [k] account_group_user_time
0.03% [kernel] [k] vbin_printf

and that is what SPECjbb does: it spends 97% of its time in Java
code - yet there's no Java overhead visible in your profile -
how is that possible? Could you try a newer perf on that box:

cd tools/perf/
make -j install

to make sure perf picks up the Java symbols as well (or at least
includes them as a summary, as in the profile above). Note that
no page fault overhead is visible in my profile.

Thanks,

Ingo

2012-11-20 06:20:43

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Tue, 20 Nov 2012, Ingo Molnar wrote:

> > > numa/core at ec05a2311c35 ("Merge branch 'sched/urgent' into
> > > sched/core") had an average throughput of 136918.34
> > > SPECjbb2005 bops, which is a 6.3% regression.
> >
> > perftop during the run on numa/core at 01aa90068b12 ("sched:
> > Use the best-buddy 'ideal cpu' in balancing decisions"):
> >
> > 15.99% [kernel] [k] page_fault
> > 4.05% [kernel] [k] getnstimeofday
> > 3.96% [kernel] [k] _raw_spin_lock
> > 3.20% [kernel] [k] rcu_check_callbacks
> > 2.93% [kernel] [k] generic_smp_call_function_interrupt
> > 2.90% [kernel] [k] __do_page_fault
> > 2.82% [kernel] [k] ktime_get
>
> Thanks for testing, that's very interesting - could you tell me
> more about exactly what kind of hardware this is? I'll try to
> find a similar system and reproduce the performance regression.
>

This happened to be an Opteron (but not 83xx series), 2.4Ghz. Your
benchmarks were different in the number of cores but also in the amount of
memory, do you think numa/core would regress because this is 32GB and not
64GB?

> (A wild guess would be an older 4x Opteron system, 83xx series
> or so?)
>

Just curious, how you would guess that? Is there something about Opteron
83xx that make numa/core regress?

> Also, the profile looks weird to me. Here is how perf top looks
> like on my system during a similarly configured, "healthy"
> SPECjbb run:
>
> 91.29% perf-6687.map [.] 0x00007fffed1e8f21
> 4.81% libjvm.so [.] 0x00000000007004a0
> 0.93% [vdso] [.] 0x00007ffff7ffe60c
> 0.72% [kernel] [k] do_raw_spin_lock
> 0.36% [kernel] [k] generic_smp_call_function_interrupt
> 0.10% [kernel] [k] format_decode
> 0.07% [kernel] [k] rcu_check_callbacks
> 0.07% [kernel] [k] apic_timer_interrupt
> 0.07% [kernel] [k] call_function_interrupt
> 0.06% libc-2.15.so [.] __strcmp_sse42
> 0.06% [kernel] [k] irqtime_account_irq
> 0.06% perf [.] 0x000000000004bb7c
> 0.05% [kernel] [k] x86_pmu_disable_all
> 0.04% libc-2.15.so [.] __memcpy_ssse3
> 0.04% [kernel] [k] ktime_get
> 0.04% [kernel] [k] account_group_user_time
> 0.03% [kernel] [k] vbin_printf
>
> and that is what SPECjbb does: it spends 97% of its time in Java
> code - yet there's no Java overhead visible in your profile -
> how is that possible? Could you try a newer perf on that box:
>

It's perf top -U, the benchmark itself was unchanged so I didn't think it
was interesting to gather the user symbols. If that would be helpful, let
me know!

2012-11-20 07:17:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* Linus Torvalds <[email protected]> wrote:

> On Mon, Nov 19, 2012 at 12:36 PM, Ingo Molnar <[email protected]> wrote:
> >
> > Hugepages is a must for most forms of NUMA/HPC. This alone
> > questions the relevance of most of your prior numa/core testing
> > results. I now have to strongly dispute your other conclusions
> > as well.
>
> Ingo, stop doing this kind of crap.
>
> Let's make it clear: if the NUMA patches continue to regress
> performance for reasonable loads (and that very much includes
> "no THP") then they won't be merged.
>
> You seem to be in total denial. Every time Mel sends out
> results that show that your patches MAKE PERFORMANCE WORSE you
> blame Mel, or blame the load, and never seem to admit that
> performance got worse.

No doubt numa/core should not regress with THP off or on and
I'll fix that.

As a background, here's how SPECjbb gets slower on mainline
(v3.7-rc6) if you boot Mel's kernel config and turn THP forcibly
off:

(avg: 502395 ops/sec)
(avg: 505902 ops/sec)
(avg: 509271 ops/sec)

# echo never > /sys/kernel/mm/transparent_hugepage/enabled

(avg: 376989 ops/sec)
(avg: 379463 ops/sec)
(avg: 378131 ops/sec)

A ~30% slowdown.

[ How do I know? I asked for Mel's kernel config days ago and
actually booted Mel's very config in the past few days,
spending hours on testing it on 4 separate NUMA systems,
trying to find Mel's regression. In the past Mel was a
reliable tester so I blindly trusted his results. Was that
some weird sort of denial on my part? :-) ]

Every time a regression is reported I take it seriously - and
there were performance regression reports against numa/core not
just from Mel and I'm sure there will be more in the future. For
example I'm taking David Rijentje's fresh performance regression
report seriously as well.

What I have some problem with is Mel sending me his kernel
config as the thing he tested, and which config included:

CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y

but he apparently went and explicitly disabled THP on top of
that - which was just a weird choice of 'negative test tuning'
to keep unreported. That made me waste quite some time booting
and debugging his config and made the finding of the root cause
of the testing difference unnecessarily hard for me.

Again, that's not an excuse for the performance regression in
the numa/core tree in any way and I'll fix it.

Thanks,

Ingo

2012-11-20 07:37:10

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Tue, 20 Nov 2012, Ingo Molnar wrote:

> No doubt numa/core should not regress with THP off or on and
> I'll fix that.
>
> As a background, here's how SPECjbb gets slower on mainline
> (v3.7-rc6) if you boot Mel's kernel config and turn THP forcibly
> off:
>
> (avg: 502395 ops/sec)
> (avg: 505902 ops/sec)
> (avg: 509271 ops/sec)
>
> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
>
> (avg: 376989 ops/sec)
> (avg: 379463 ops/sec)
> (avg: 378131 ops/sec)
>
> A ~30% slowdown.
>
> [ How do I know? I asked for Mel's kernel config days ago and
> actually booted Mel's very config in the past few days,
> spending hours on testing it on 4 separate NUMA systems,
> trying to find Mel's regression. In the past Mel was a
> reliable tester so I blindly trusted his results. Was that
> some weird sort of denial on my part? :-) ]
>

I confirm that numa/core regresses significantly more without thp than the
6.3% regression I reported with thp in terms of throughput on the same
system. numa/core at 01aa90068b12 ("sched: Use the best-buddy 'ideal cpu'
in balancing decisions") had 99389.49 SPECjbb2005 bops whereas
ec05a2311c35 ("Merge branch 'sched/urgent' into sched/core") had 122246.90
SPECjbb2005 bops, a 23.0% regression.

perf top -U for >=0.70% at 01aa90068b12 ("sched: Use the best-buddy 'ideal
cpu' in balancing decisions"):

16.34% [kernel] [k] page_fault
12.15% [kernel] [k] down_read_trylock
9.21% [kernel] [k] up_read
7.58% [kernel] [k] handle_pte_fault
6.10% [kernel] [k] handle_mm_fault
4.35% [kernel] [k] retint_swapgs
3.99% [kernel] [k] find_vma
3.95% [kernel] [k] __do_page_fault
3.81% [kernel] [k] mpol_misplaced
3.41% [kernel] [k] get_vma_policy
2.68% [kernel] [k] task_numa_fault
1.82% [kernel] [k] pte_numa
1.65% [kernel] [k] do_page_fault
1.46% [kernel] [k] _raw_spin_lock
1.28% [kernel] [k] do_wp_page
1.26% [kernel] [k] vm_normal_page
1.25% [kernel] [k] unlock_page
1.01% [kernel] [k] change_protection
0.80% [kernel] [k] getnstimeofday
0.79% [kernel] [k] ktime_get
0.76% [kernel] [k] __wake_up_bit
0.74% [kernel] [k] rcu_check_callbacks

and at ec05a2311c35 ("Merge branch 'sched/urgent' into sched/core"):

22.01% [kernel] [k] page_fault
6.54% [kernel] [k] rcu_check_callbacks
5.04% [kernel] [k] getnstimeofday
4.12% [kernel] [k] ktime_get
3.55% [kernel] [k] read_tsc
3.37% [kernel] [k] task_tick_fair
2.61% [kernel] [k] emulate_vsyscall
2.22% [kernel] [k] __do_page_fault
1.78% [kernel] [k] run_timer_softirq
1.71% [kernel] [k] write_ok_or_segv
1.55% [kernel] [k] copy_user_generic_string
1.48% [kernel] [k] __bad_area_nosemaphore
1.27% [kernel] [k] retint_swapgs
1.26% [kernel] [k] spurious_fault
1.15% [kernel] [k] update_rq_clock
1.12% [kernel] [k] update_cfs_shares
1.09% [kernel] [k] _raw_spin_lock
1.08% [kernel] [k] update_curr
1.07% [kernel] [k] error_entry
1.05% [kernel] [k] x86_pmu_disable_all
0.88% [kernel] [k] sys_gettimeofday
0.88% [kernel] [k] __do_softirq
0.87% [kernel] [k] _raw_spin_lock_irq
0.84% [kernel] [k] hrtimer_forward
0.81% [kernel] [k] ktime_get_update_offsets
0.79% [kernel] [k] __update_cpu_load
0.77% [kernel] [k] acct_update_integrals
0.77% [kernel] [k] hrtimer_interrupt
0.75% [kernel] [k] perf_adjust_freq_unthr_context.part.81
0.73% [kernel] [k] do_gettimeofday
0.73% [kernel] [k] apic_timer_interrupt
0.72% [kernel] [k] timerqueue_add
0.70% [kernel] [k] tick_sched_timer

This is in comparison to my earlier perftop results which were with thp
enabled. Keep in mind that this system has a NUMA configuration of

$ cat /sys/devices/system/node/node*/distance
10 20 20 30
20 10 20 20
20 20 10 20
30 20 20 10

so perhaps you would have better luck reproducing the problem using the
new ability to fake the distance in between nodes that Peter introduced in
94c0dd3278dd ("x86/numa: Allow specifying node_distance() for numa=fake")
with numa=fake=4:10,20,20,30,20,10,20,20,20,20,10,20,30,20,20,10 ?

2012-11-20 07:44:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* David Rientjes <[email protected]> wrote:

> On Tue, 20 Nov 2012, Ingo Molnar wrote:
>
> > > > numa/core at ec05a2311c35 ("Merge branch 'sched/urgent' into
> > > > sched/core") had an average throughput of 136918.34
> > > > SPECjbb2005 bops, which is a 6.3% regression.
> > >
> > > perftop during the run on numa/core at 01aa90068b12 ("sched:
> > > Use the best-buddy 'ideal cpu' in balancing decisions"):
> > >
> > > 15.99% [kernel] [k] page_fault
> > > 4.05% [kernel] [k] getnstimeofday
> > > 3.96% [kernel] [k] _raw_spin_lock
> > > 3.20% [kernel] [k] rcu_check_callbacks
> > > 2.93% [kernel] [k] generic_smp_call_function_interrupt
> > > 2.90% [kernel] [k] __do_page_fault
> > > 2.82% [kernel] [k] ktime_get
> >
> > Thanks for testing, that's very interesting - could you tell me
> > more about exactly what kind of hardware this is? I'll try to
> > find a similar system and reproduce the performance regression.
> >
>
> This happened to be an Opteron (but not 83xx series), 2.4Ghz.

Ok - roughly which family/model from /proc/cpuinfo?

> Your benchmarks were different in the number of cores but also
> in the amount of memory, do you think numa/core would regress
> because this is 32GB and not 64GB?

I'd not expect much sensitivity to RAM size.

> > (A wild guess would be an older 4x Opteron system, 83xx
> > series or so?)
>
> Just curious, how you would guess that? [...]

I'm testing numa/core on many systems and the performance
figures seemed to roughly map to that range.

> [...] Is there something about Opteron 83xx that make
> numa/core regress?

Not that I knew of - but apparently there is! I'll try to find a
system that matches yours as closely as possible and have a
look.

> > Also, the profile looks weird to me. Here is how perf top looks
> > like on my system during a similarly configured, "healthy"
> > SPECjbb run:
> >
> > 91.29% perf-6687.map [.] 0x00007fffed1e8f21
> > 4.81% libjvm.so [.] 0x00000000007004a0
> > 0.93% [vdso] [.] 0x00007ffff7ffe60c
> > 0.72% [kernel] [k] do_raw_spin_lock
> > 0.36% [kernel] [k] generic_smp_call_function_interrupt
> > 0.10% [kernel] [k] format_decode
> > 0.07% [kernel] [k] rcu_check_callbacks
> > 0.07% [kernel] [k] apic_timer_interrupt
> > 0.07% [kernel] [k] call_function_interrupt
> > 0.06% libc-2.15.so [.] __strcmp_sse42
> > 0.06% [kernel] [k] irqtime_account_irq
> > 0.06% perf [.] 0x000000000004bb7c
> > 0.05% [kernel] [k] x86_pmu_disable_all
> > 0.04% libc-2.15.so [.] __memcpy_ssse3
> > 0.04% [kernel] [k] ktime_get
> > 0.04% [kernel] [k] account_group_user_time
> > 0.03% [kernel] [k] vbin_printf
> >
> > and that is what SPECjbb does: it spends 97% of its time in Java
> > code - yet there's no Java overhead visible in your profile -
> > how is that possible? Could you try a newer perf on that box:
> >
>
> It's perf top -U, the benchmark itself was unchanged so I
> didn't think it was interesting to gather the user symbols.
> If that would be helpful, let me know!

Yeah, regular perf top output would be very helpful to get a
general sense of proportion. Thanks!

Ingo

2012-11-20 07:48:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* David Rientjes <[email protected]> wrote:

> This is in comparison to my earlier perftop results which were with thp
> enabled. Keep in mind that this system has a NUMA configuration of
>
> $ cat /sys/devices/system/node/node*/distance
> 10 20 20 30
> 20 10 20 20
> 20 20 10 20
> 30 20 20 10

You could check whether the basic topology is scheduled right by
running Andrea's autonumabench:

git://gitorious.org/autonuma-benchmark/autonuma-benchmark.git

and do a "start.sh -A" and send the mainline versus numa/core
results.

Thanks,

Ingo

2012-11-20 07:49:19

by Paul Turner

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Mon, Nov 19, 2012 at 11:44 PM, Ingo Molnar <[email protected]> wrote:
>
> * David Rientjes <[email protected]> wrote:
>
>> On Tue, 20 Nov 2012, Ingo Molnar wrote:
>>
>> > > > numa/core at ec05a2311c35 ("Merge branch 'sched/urgent' into
>> > > > sched/core") had an average throughput of 136918.34
>> > > > SPECjbb2005 bops, which is a 6.3% regression.
>> > >
>> > > perftop during the run on numa/core at 01aa90068b12 ("sched:
>> > > Use the best-buddy 'ideal cpu' in balancing decisions"):
>> > >
>> > > 15.99% [kernel] [k] page_fault
>> > > 4.05% [kernel] [k] getnstimeofday
>> > > 3.96% [kernel] [k] _raw_spin_lock
>> > > 3.20% [kernel] [k] rcu_check_callbacks
>> > > 2.93% [kernel] [k] generic_smp_call_function_interrupt
>> > > 2.90% [kernel] [k] __do_page_fault
>> > > 2.82% [kernel] [k] ktime_get
>> >
>> > Thanks for testing, that's very interesting - could you tell me
>> > more about exactly what kind of hardware this is? I'll try to
>> > find a similar system and reproduce the performance regression.
>> >
>>
>> This happened to be an Opteron (but not 83xx series), 2.4Ghz.
>
> Ok - roughly which family/model from /proc/cpuinfo?
>
>> Your benchmarks were different in the number of cores but also
>> in the amount of memory, do you think numa/core would regress
>> because this is 32GB and not 64GB?
>
> I'd not expect much sensitivity to RAM size.
>
>> > (A wild guess would be an older 4x Opteron system, 83xx
>> > series or so?)
>>
>> Just curious, how you would guess that? [...]
>
> I'm testing numa/core on many systems and the performance
> figures seemed to roughly map to that range.
>
>> [...] Is there something about Opteron 83xx that make
>> numa/core regress?
>
> Not that I knew of - but apparently there is! I'll try to find a
> system that matches yours as closely as possible and have a
> look.

Here I'd note the node-distances that David included above. This
system is not fully connected, having an (asymmetric) kite topology.
Only nodes nodes 1 and 2 are fully connected.

This is sufficiently whacky that it seems a likely candidate :-).

- Paul

>
>> > Also, the profile looks weird to me. Here is how perf top looks
>> > like on my system during a similarly configured, "healthy"
>> > SPECjbb run:
>> >
>> > 91.29% perf-6687.map [.] 0x00007fffed1e8f21
>> > 4.81% libjvm.so [.] 0x00000000007004a0
>> > 0.93% [vdso] [.] 0x00007ffff7ffe60c
>> > 0.72% [kernel] [k] do_raw_spin_lock
>> > 0.36% [kernel] [k] generic_smp_call_function_interrupt
>> > 0.10% [kernel] [k] format_decode
>> > 0.07% [kernel] [k] rcu_check_callbacks
>> > 0.07% [kernel] [k] apic_timer_interrupt
>> > 0.07% [kernel] [k] call_function_interrupt
>> > 0.06% libc-2.15.so [.] __strcmp_sse42
>> > 0.06% [kernel] [k] irqtime_account_irq
>> > 0.06% perf [.] 0x000000000004bb7c
>> > 0.05% [kernel] [k] x86_pmu_disable_all
>> > 0.04% libc-2.15.so [.] __memcpy_ssse3
>> > 0.04% [kernel] [k] ktime_get
>> > 0.04% [kernel] [k] account_group_user_time
>> > 0.03% [kernel] [k] vbin_printf
>> >
>> > and that is what SPECjbb does: it spends 97% of its time in Java
>> > code - yet there's no Java overhead visible in your profile -
>> > how is that possible? Could you try a newer perf on that box:
>> >
>>
>> It's perf top -U, the benchmark itself was unchanged so I
>> didn't think it was interesting to gather the user symbols.
>> If that would be helpful, let me know!
>
> Yeah, regular perf top output would be very helpful to get a
> general sense of proportion. Thanks!
>
> Ingo

2012-11-20 08:01:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* David Rientjes <[email protected]> wrote:

> I confirm that numa/core regresses significantly more without
> thp than the 6.3% regression I reported with thp in terms of
> throughput on the same system. numa/core at 01aa90068b12
> ("sched: Use the best-buddy 'ideal cpu' in balancing
> decisions") had 99389.49 SPECjbb2005 bops whereas ec05a2311c35
> ("Merge branch 'sched/urgent' into sched/core") had 122246.90
> SPECjbb2005 bops, a 23.0% regression.

What is the base performance figure with THP disabled? Your
baseline was:

sched/core at ec05a2311c35: 136918.34 SPECjbb2005

Would be interesting to see how that kernel reacts to THP off.

Thanks,

Ingo

2012-11-20 08:11:11

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Tue, 20 Nov 2012, Ingo Molnar wrote:

> > I confirm that numa/core regresses significantly more without
> > thp than the 6.3% regression I reported with thp in terms of
> > throughput on the same system. numa/core at 01aa90068b12
> > ("sched: Use the best-buddy 'ideal cpu' in balancing
> > decisions") had 99389.49 SPECjbb2005 bops whereas ec05a2311c35
> > ("Merge branch 'sched/urgent' into sched/core") had 122246.90
> > SPECjbb2005 bops, a 23.0% regression.
>
> What is the base performance figure with THP disabled? Your
> baseline was:
>
> sched/core at ec05a2311c35: 136918.34 SPECjbb2005
>
> Would be interesting to see how that kernel reacts to THP off.
>

In summary, the benchmarks that I've collected thus far are:

THP enabled:

numa/core at ec05a2311c35: 136918.34 SPECjbb2005 bops
numa/core at 01aa90068b12: 128315.19 SPECjbb2005 bops (-6.3%)

THP disabled:

numa/core at ec05a2311c35: 122246.90 SPECjbb2005 bops
numa/core at 01aa90068b12: 99389.49 SPECjbb2005 bops (-23.0%)

2012-11-20 08:20:10

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Tue, 20 Nov 2012, Ingo Molnar wrote:

> > This happened to be an Opteron (but not 83xx series), 2.4Ghz.
>
> Ok - roughly which family/model from /proc/cpuinfo?
>

It's close enough, it's 23xx.

> > It's perf top -U, the benchmark itself was unchanged so I
> > didn't think it was interesting to gather the user symbols.
> > If that would be helpful, let me know!
>
> Yeah, regular perf top output would be very helpful to get a
> general sense of proportion. Thanks!
>

Ok, here it is:

91.24% perf-10971.map [.] 0x00007f116a6c6fb8
1.19% libjvm.so [.] instanceKlass::oop_push_contents(PSPromotionMa
1.04% libjvm.so [.] PSPromotionManager::drain_stacks_depth(bool)
0.79% libjvm.so [.] PSPromotionManager::copy_to_survivor_space(oop
0.60% libjvm.so [.] PSPromotionManager::claim_or_forward_internal_
0.58% [kernel] [k] page_fault
0.28% libc-2.3.6.so [.] __gettimeofday
0.26% libjvm.so [.] Copy::pd_disjoint_words(HeapWord*, HeapWord*, unsigned
0.22% [kernel] [k] getnstimeofday
0.18% libjvm.so [.] CardTableExtension::scavenge_contents_parallel(ObjectS
0.15% [kernel] [k] _raw_spin_lock
0.12% [kernel] [k] ktime_get_update_offsets
0.11% [kernel] [k] ktime_get
0.11% [kernel] [k] rcu_check_callbacks
0.10% [kernel] [k] generic_smp_call_function_interrupt
0.10% [kernel] [k] read_tsc
0.10% [kernel] [k] clear_page_c
0.10% [kernel] [k] __do_page_fault
0.08% [kernel] [k] handle_mm_fault
0.08% libjvm.so [.] os::javaTimeMillis()
0.08% [kernel] [k] emulate_vsyscall
0.08% [kernel] [k] flush_tlb_func
0.07% [kernel] [k] task_tick_fair
0.07% [kernel] [k] retint_swapgs
0.06% libjvm.so [.] oopDesc::size_given_klass(Klass*)
0.05% [kernel] [k] handle_pte_fault
0.05% perf [.] 0x0000000000033190
0.05% sanctuaryd [.] 0x00000000006f0ad3
0.05% [kernel] [k] copy_user_generic_string
0.05% libjvm.so [.] objArrayKlass::oop_push_contents(PSPromotionManager*,
0.04% [kernel] [k] find_vma
0.04% [kernel] [k] mpol_misplaced
0.04% [kernel] [k] __bad_area_nosemaphore
0.04% [kernel] [k] get_vma_policy
0.04% [kernel] [k] task_numa_fault
0.04% [kernel] [k] error_sti
0.03% [kernel] [k] write_ok_or_segv
0.03% [kernel] [k] do_gettimeofday
0.03% [kernel] [k] down_read_trylock
0.03% [kernel] [k] update_cfs_shares
0.03% [kernel] [k] error_entry
0.03% [kernel] [k] run_timer_softirq

2012-11-20 09:06:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* David Rientjes <[email protected]> wrote:

> On Tue, 20 Nov 2012, Ingo Molnar wrote:
>
> > > This happened to be an Opteron (but not 83xx series), 2.4Ghz.
> >
> > Ok - roughly which family/model from /proc/cpuinfo?
>
> It's close enough, it's 23xx.

Ok - which family/model number in /proc/cpuinfo?

I'm asking because that will matter most to page fault
micro-characteristics and the 23xx series existed in Barcelona
form as well (family/model of 16/2) and it still exists in its
current Shanghai form as well.

My guess is Barcelona 16/2?

If that is correct then the closest I can get to your topology
is a 4-socket 32-way Opteron system with 32 GB of RAM - which
seems close enough for testing purposes.

But checking numa/core on such a system still keeps me
absolutely puzzled, as I get the following with a similar
16-warehouses SPECjbb 2005 test, using java -Xms8192m -Xmx8192m
-Xss256k sizing, THP enabled, 2x 240 seconds runs (I tried to
configure it all very close to yours), using -tip-a07005cbd847:

kernel warehouses transactions/sec
--------- ----------
v3.7-rc6: 16 197802
16 197997

numa/core: 16 203086
16 203967

So sadly numa/core is about 2%-3% faster on this 4x4 system too!
:-/

But I have to say, your SPECjbb score is uncharacteristically
low even for an oddball-topology Barcelona system - which is the
oldest/slowest system I can think of. So there might be more to
this.

To further characterise a "good" SPECjbb run, there's no
page_fault overhead visible in perf top:

Mainline profile:

94.99% perf-1244.map [.] 0x00007f04cd1aa523
2.52% libjvm.so [.] 0x00000000007004a1
0.62% [vdso] [.] 0x0000000000000972
0.31% [kernel] [k] clear_page_c
0.17% [kernel] [k] timekeeping_get_ns.constprop.7
0.11% [kernel] [k] rep_nop
0.09% [kernel] [k] ktime_get
0.08% [kernel] [k] get_cycles
0.06% [kernel] [k] read_tsc
0.05% libc-2.15.so [.] __strcmp_sse2

numa/core profile:

95.66% perf-1201.map [.] 0x00007fe4ad1c8fc7
1.70% libjvm.so [.] 0x0000000000381581
0.59% [vdso] [.] 0x0000000000000607
0.19% [kernel] [k] do_raw_spin_lock
0.11% [kernel] [k] generic_smp_call_function_interrupt
0.11% [kernel] [k] timekeeping_get_ns.constprop.7
0.08% [kernel] [k] ktime_get
0.06% [kernel] [k] get_cycles
0.05% [kernel] [k] __native_flush_tlb
0.05% [kernel] [k] rep_nop
0.04% perf [.] add_hist_entry.isra.9
0.04% [kernel] [k] rcu_check_callbacks
0.04% [kernel] [k] ktime_get_update_offsets
0.04% libc-2.15.so [.] __strcmp_sse2

No page fault overhead (see the page fault rate further below) -
the NUMA scanning overhead shows up only through some mild TLB
flush activity (which I'll fix btw).

[ Stupid question: cpufreq is configured to always-2.4GHz,
right? If you could send me your kernel config (you can do
that privately as well) then I can try to boot it and see. ]

> > > It's perf top -U, the benchmark itself was unchanged so I
> > > didn't think it was interesting to gather the user
> > > symbols. If that would be helpful, let me know!
> >
> > Yeah, regular perf top output would be very helpful to get a
> > general sense of proportion. Thanks!
>
> Ok, here it is:
>
> 91.24% perf-10971.map [.] 0x00007f116a6c6fb8
> 1.19% libjvm.so [.] instanceKlass::oop_push_contents(PSPromotionMa
> 1.04% libjvm.so [.] PSPromotionManager::drain_stacks_depth(bool)
> 0.79% libjvm.so [.] PSPromotionManager::copy_to_survivor_space(oop
> 0.60% libjvm.so [.] PSPromotionManager::claim_or_forward_internal_
> 0.58% [kernel] [k] page_fault
> 0.28% libc-2.3.6.so [.] __gettimeofday
> 0.26% libjvm.so [.] Copy::pd_disjoint_words(HeapWord*, HeapWord*, unsigned
> 0.22% [kernel] [k] getnstimeofday
> 0.18% libjvm.so [.] CardTableExtension::scavenge_contents_parallel(ObjectS
> 0.15% [kernel] [k] _raw_spin_lock
> 0.12% [kernel] [k] ktime_get_update_offsets
> 0.11% [kernel] [k] ktime_get
> 0.11% [kernel] [k] rcu_check_callbacks
> 0.10% [kernel] [k] generic_smp_call_function_interrupt
> 0.10% [kernel] [k] read_tsc
> 0.10% [kernel] [k] clear_page_c
> 0.10% [kernel] [k] __do_page_fault
> 0.08% [kernel] [k] handle_mm_fault
> 0.08% libjvm.so [.] os::javaTimeMillis()
> 0.08% [kernel] [k] emulate_vsyscall

Oh, finally a clue: you seem to have vsyscall emulation
overhead!

Vsyscall emulation is fundamentally page fault driven - which
might explain why you are seeing page fault overhead. It might
also interact with other sources of faults - such as numa/core's
working set probing ...

Many JVMs try to be smart with the vsyscall. As a test, does the
vsyscall=native boot option change the results/behavior in any
way?

Stupid question, if you apply the patch attached below and if
you do page fault profiling while the run is in steady state:

perf record -e faults -g -a sleep 10

do you see it often coming from the vsyscall page?

Also, this:

perf stat -e faults -a --repeat 10 sleep 1

should normally report something like this during SPECjbb steady
state, numa/core:

warmup: 3,895 faults/sec ( +- 12.11% )
steady state: 3,910 faults/sec ( +- 6.72% )

Which is about 250 faults/sec/CPU - i.e. it should be barely
recognizable in profiles - let alone be prominent as in yours.

Thanks,

Ingo

---
arch/x86/mm/fault.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c
+++ linux/arch/x86/mm/fault.c
@@ -1030,6 +1030,9 @@ __do_page_fault(struct pt_regs *regs, un
/* Get the faulting address: */
address = read_cr2();

+ /* Instrument as early as possible: */
+ perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
+
/*
* Detect and handle instructions that would cause a page fault for
* both a tracked kernel page and a userspace page.
@@ -1107,8 +1110,6 @@ __do_page_fault(struct pt_regs *regs, un
}
}

- perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
-
/*
* If we're in an interrupt, have no user context or are running
* in an atomic region then we must not take the fault:

2012-11-20 09:41:42

by Ingo Molnar

[permalink] [raw]
Subject: [patch] x86/vsyscall: Add Kconfig option to use native vsyscalls, switch to it


* Ingo Molnar <[email protected]> wrote:

> > 0.10% [kernel] [k] __do_page_fault
> > 0.08% [kernel] [k] handle_mm_fault
> > 0.08% libjvm.so [.] os::javaTimeMillis()
> > 0.08% [kernel] [k] emulate_vsyscall
>
> Oh, finally a clue: you seem to have vsyscall emulation
> overhead!
>
> Vsyscall emulation is fundamentally page fault driven - which
> might explain why you are seeing page fault overhead. It might
> also interact with other sources of faults - such as
> numa/core's working set probing ...
>
> Many JVMs try to be smart with the vsyscall. As a test, does
> the vsyscall=native boot option change the results/behavior in
> any way?

As a blind shot into the dark, does the attached patch help?

If that's the root cause then it should measurably help mainline
SPECjbb performance as well. It could turn numa/core from a
regression into a win on your system.

Thanks,

Ingo

----------------->
Subject: x86/vsyscall: Add Kconfig option to use native vsyscalls, switch to it
From: Ingo Molnar <[email protected]>

Apparently there's still plenty of systems out there triggering
the vsyscall emulation page faults - causing hard to track down
performance regressions on page fault intense workloads...

Some people seem to have run into that with threading-intense
Java workloads.

So until there's a better solution to this, add a Kconfig switch
to make the vsyscall mode configurable and turn native vsyscall
support back on by default.

Distributions whose libraries and applications use the vDSO and never
access the vsyscall page can change the config option to off.

Note, I don't think we want to expose the "none" option via a Kconfig
switch - it breaks the ABI. So it's "native" versus "emulate", with
"none" being available as a kernel boot option, for the super paranoid.

For more background, see these upstream commits:

3ae36655b97a x86-64: Rework vsyscall emulation and add vsyscall= parameter
5cec93c216db x86-64: Emulate legacy vsyscalls

Cc: Andy Lutomirski <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/Kconfig | 21 +++++++++++++++++++++
arch/x86/kernel/vsyscall_64.c | 8 +++++++-
2 files changed, 28 insertions(+), 1 deletion(-)

Index: linux/arch/x86/Kconfig
===================================================================
--- linux.orig/arch/x86/Kconfig
+++ linux/arch/x86/Kconfig
@@ -2234,6 +2234,27 @@ config X86_X32
elf32_x86_64 support enabled to compile a kernel with this
option set.

+config X86_VSYSCALL_COMPAT
+ bool "vsyscall compatibility"
+ default y
+ help
+ vsyscalls, as global executable pages, can be a security hole
+ escallation helper by exposing an easy shell code target with
+ a predictable address.
+
+ Many versions of glibc rely on the vsyscall page though, so it
+ cannot be eliminated unconditionally. If you disable this
+ option these systems will still work but might incur the overhead
+ of vsyscall emulation page faults.
+
+ The vsyscall=none, vsyscall=emulate, vsyscall=native kernel boot
+ option can be used to override this mode as well.
+
+ Keeping this option enabled leaves the vsyscall page enabled,
+ i.e. vsyscall=native. Disabling this option means vsyscall=emulate.
+
+ If unsure, say Y.
+
config COMPAT
def_bool y
depends on IA32_EMULATION || X86_X32
Index: linux/arch/x86/kernel/vsyscall_64.c
===================================================================
--- linux.orig/arch/x86/kernel/vsyscall_64.c
+++ linux/arch/x86/kernel/vsyscall_64.c
@@ -56,7 +56,13 @@
DEFINE_VVAR(int, vgetcpu_mode);
DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data);

-static enum { EMULATE, NATIVE, NONE } vsyscall_mode = EMULATE;
+static enum { EMULATE, NATIVE, NONE } vsyscall_mode =
+#ifdef CONFIG_X86_VSYSCALL_COMPAT
+ NATIVE
+#else
+ EMULATE
+#endif
+;

static int __init vsyscall_setup(char *str)
{

2012-11-20 10:20:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

> > Ingo, stop doing this kind of crap.
> >
> > Let's make it clear: if the NUMA patches continue to regress
> > performance for reasonable loads (and that very much includes
> > "no THP") then they won't be merged.
> >
> > You seem to be in total denial. Every time Mel sends out
> > results that show that your patches MAKE PERFORMANCE WORSE you
> > blame Mel, or blame the load, and never seem to admit that
> > performance got worse.
>
> No doubt numa/core should not regress with THP off or on and
> I'll fix that.
>
> As a background, here's how SPECjbb gets slower on mainline
> (v3.7-rc6) if you boot Mel's kernel config and turn THP forcibly
> off:
>
> (avg: 502395 ops/sec)
> (avg: 505902 ops/sec)
> (avg: 509271 ops/sec)
>
> # echo never > /sys/kernel/mm/transparent_hugepage/enabled
>
> (avg: 376989 ops/sec)
> (avg: 379463 ops/sec)
> (avg: 378131 ops/sec)
>
> A ~30% slowdown.
>
> [ How do I know? I asked for Mel's kernel config days ago and
> actually booted Mel's very config in the past few days,
> spending hours on testing it on 4 separate NUMA systems,
> trying to find Mel's regression. In the past Mel was a
> reliable tester so I blindly trusted his results. Was that
> some weird sort of denial on my part? :-) ]
>
> Every time a regression is reported I take it seriously - and
> there were performance regression reports against numa/core not
> just from Mel and I'm sure there will be more in the future. For
> example I'm taking David Rijentje's fresh performance regression
> report seriously as well.
>
> What I have some problem with is Mel sending me his kernel
> config as the thing he tested, and which config included:
>
> CONFIG_TRANSPARENT_HUGEPAGE=y
> CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
>
> but he apparently went and explicitly disabled THP on top of
> that - which was just a weird choice of 'negative test tuning'
> to keep unreported.

I've already apologised for this omission and I'll apologise again if
necessary. The whole point of implementing MM Tests was to mitigate exactly
this sort of situation where the testing conditions are suspicious so the
configuration file and scripts can be rechecked. I was assuming the lack
of THP usage was due to the JVM using shared memory because a few years ago
the JVM I was using at the time did this if configured for hugepage usage.
Assumptions made a complete ass out of me here. When I did the recheck of
what the JVM was actually doing and reexamined the configuration, I saw
the mistake and admitted it immediately.

I want to be absolutely clear that it was unintentional to disable THP like
this which is why I did not report it. I did not deliberately hide such
information because that would be completely unacceptable. The root of the
mistake is actually from a few years ago where tests would be configured
to run with base pages, huge pages and transparent hugepages -- similar to
how we might currently test vanilla kernel, hard bindings and automatic
migration. Because of their history, some mmtest scripts support running
with multiple page sizes and I neglected to properly identify this and
retrofit a "default hugepage" configuration.

I've also been already clear that this was done for *all* the specjbb
tests. It was still a mistake but it was evenly applied.

I've added two extra configuration files to run specjbb single and multi
JVMs with THP enabled. It takes about 1.5 to 2 hours to complete a single
test which means a full battery of tests for autonuma, vanilla kernel and
schednuma will take a little over 24 hours (4 specjbb tests, autobench and
a few basic performance tests like kernbench, hackbench etc). They will
not be running back-to-back as the machine is not dedicated to this. I'll
report when they're done.

--
Mel Gorman
SUSE Labs

2012-11-20 10:41:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


btw., mind sending me a fuller/longer profile than the one
you've sent before? In particular does your system have any
vsyscall emulation page fault overhead?

If yes, does the patch below change anything for you?

Thanks,

Ingo

---------------->
Subject: x86/vsyscall: Add Kconfig option to use native vsyscalls, switch to it
From: Ingo Molnar <[email protected]>

Apparently there's still plenty of systems out there triggering
the vsyscall emulation page faults - causing hard to track down
performance regressions on page fault intense workloads...

Some people seem to have run into that with threading-intense
Java workloads.

So until there's a better solution to this, add a Kconfig switch
to make the vsyscall mode configurable and turn native vsyscall
support back on by default.

Distributions whose libraries and applications use the vDSO and never
access the vsyscall page can change the config option to off.

Note, I don't think we want to expose the "none" option via a Kconfig
switch - it breaks the ABI. So it's "native" versus "emulate", with
"none" being available as a kernel boot option, for the super paranoid.

For more background, see these upstream commits:

3ae36655b97a x86-64: Rework vsyscall emulation and add vsyscall= parameter
5cec93c216db x86-64: Emulate legacy vsyscalls

Cc: Andy Lutomirski <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/Kconfig | 21 +++++++++++++++++++++
arch/x86/kernel/vsyscall_64.c | 8 +++++++-
2 files changed, 28 insertions(+), 1 deletion(-)

Index: linux/arch/x86/Kconfig
===================================================================
--- linux.orig/arch/x86/Kconfig
+++ linux/arch/x86/Kconfig
@@ -2234,6 +2234,27 @@ config X86_X32
elf32_x86_64 support enabled to compile a kernel with this
option set.

+config X86_VSYSCALL_COMPAT
+ bool "vsyscall compatibility"
+ default y
+ help
+ vsyscalls, as global executable pages, can be a security hole
+ escallation helper by exposing an easy shell code target with
+ a predictable address.
+
+ Many versions of glibc rely on the vsyscall page though, so it
+ cannot be eliminated unconditionally. If you disable this
+ option these systems will still work but might incur the overhead
+ of vsyscall emulation page faults.
+
+ The vsyscall=none, vsyscall=emulate, vsyscall=native kernel boot
+ option can be used to override this mode as well.
+
+ Keeping this option enabled leaves the vsyscall page enabled,
+ i.e. vsyscall=native. Disabling this option means vsyscall=emulate.
+
+ If unsure, say Y.
+
config COMPAT
def_bool y
depends on IA32_EMULATION || X86_X32
Index: linux/arch/x86/kernel/vsyscall_64.c
===================================================================
--- linux.orig/arch/x86/kernel/vsyscall_64.c
+++ linux/arch/x86/kernel/vsyscall_64.c
@@ -56,7 +56,13 @@
DEFINE_VVAR(int, vgetcpu_mode);
DEFINE_VVAR(struct vsyscall_gtod_data, vsyscall_gtod_data);

-static enum { EMULATE, NATIVE, NONE } vsyscall_mode = EMULATE;
+static enum { EMULATE, NATIVE, NONE } vsyscall_mode =
+#ifdef CONFIG_X86_VSYSCALL_COMPAT
+ NATIVE
+#else
+ EMULATE
+#endif
+;

static int __init vsyscall_setup(char *str)
{

2012-11-20 10:47:54

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Tue, Nov 20, 2012 at 10:20:10AM +0000, Mel Gorman wrote:
> I've added two extra configuration files to run specjbb single and multi
> JVMs with THP enabled. It takes about 1.5 to 2 hours to complete a single

1.5 to 2 hours if running to the full set of warehouses required for a
compliant run. Configuration is current limited to a smaller number of
threads so each specjbb should complete between 40 minutes and an hour.

--
Mel Gorman
SUSE Labs

2012-11-20 11:40:09

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Tue, Nov 20, 2012 at 11:40:53AM +0100, Ingo Molnar wrote:
>
> btw., mind sending me a fuller/longer profile than the one
> you've sent before? In particular does your system have any
> vsyscall emulation page fault overhead?
>

I can't, the results for specjbb got trashed after I moved to 3.7-rc6 and
the old run was incomplete :( I'll have to re-run for profiles. I generally
do not run with profiling enabled because it can distort results very badly.

I went back and re-examined anything else that could distort the results. I
have monitoring other than profiles enabled and the heaviest monitor by
far is reading smaps every 10 seconds to see how processes are currently
distributed between nodes and what CPU they are running on. The intention
was to be able to examine after the fact if the individual java processes
were ending up on the same nodes and scheduled properly.

However, this would impact the peak performance and affect contention on
mmap_sem every 10 seconds as it's possible for the PTE scanner and smaps
reader to contend. Care is taken to only read smaps once per process. So
for each JVM instance, it would read smaps once even though it would
examine where every thread is running. This minimises the distortion.
top is also updating every 10 seconds which will also contend on locks.
The other monitors are relatively harmless and are reading files like
/proc/vmstat, each numastat file every 10 seconds. As before, all tests
had these monitors enabled so they would all have suffered evenly.

Normally when I'm looking at a series I run both with and without monitors. I
report without monitors and use monitors to help debug and problems that
are identified. Unfortunately, time pressure is not allowing me to do that
this time.

> If yes, does the patch below change anything for you?
>

I'll check it.

Thanks.

--
Mel Gorman
SUSE Labs

2012-11-20 12:03:00

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH] x86/mm: Don't flush the TLB on #WP pmd fixups


* Ingo Molnar <[email protected]> wrote:

> numa/core profile:
>
> 95.66% perf-1201.map [.] 0x00007fe4ad1c8fc7
> 1.70% libjvm.so [.] 0x0000000000381581
> 0.59% [vdso] [.] 0x0000000000000607
> 0.19% [kernel] [k] do_raw_spin_lock
> 0.11% [kernel] [k] generic_smp_call_function_interrupt
> 0.11% [kernel] [k] timekeeping_get_ns.constprop.7
> 0.08% [kernel] [k] ktime_get
> 0.06% [kernel] [k] get_cycles
> 0.05% [kernel] [k] __native_flush_tlb
> 0.05% [kernel] [k] rep_nop
> 0.04% perf [.] add_hist_entry.isra.9
> 0.04% [kernel] [k] rcu_check_callbacks
> 0.04% [kernel] [k] ktime_get_update_offsets
> 0.04% libc-2.15.so [.] __strcmp_sse2
>
> No page fault overhead (see the page fault rate further below)
> - the NUMA scanning overhead shows up only through some mild
> TLB flush activity (which I'll fix btw).

The patch attached below should get rid of that mild TLB
flushing activity as well.

Thanks,

Ingo

--------------------------->
Subject: x86/mm: Don't flush the TLB on #WP pmd fixups
From: Ingo Molnar <[email protected]>
Date: Tue Nov 20 14:46:34 CET 2012

If we have a write protection #PF and fix up the pmd then the
hugetlb code [the only user of pmdp_set_access_flags], in its
do_huge_pmd_wp_page() page fault resolution function calls
pmdp_set_access_flags() to mark the pmd permissive again,
and flushes the TLB.

This TLB flush is unnecessary: a flush on #PF is guaranteed on
most (all?) x86 CPUs, and even in the worst-case we'll generate
a spurious fault.

So remove it.

Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

Index: linux/arch/x86/mm/pgtable.c
===================================================================
--- linux.orig/arch/x86/mm/pgtable.c
+++ linux/arch/x86/mm/pgtable.c
@@ -334,7 +334,12 @@ int pmdp_set_access_flags(struct vm_area
if (changed && dirty) {
*pmdp = entry;
pmd_update_defer(vma->vm_mm, address, pmdp);
- flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ /*
+ * We had a write-protection fault here and changed the pmd
+ * to to more permissive. No need to flush the TLB for that,
+ * #PF is architecturally guaranteed to do that and in the
+ * worst-case we'll generate a spurious fault.
+ */
}

return changed;

2012-11-20 12:32:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86/mm: Don't flush the TLB on #WP pmd fixups


* Ingo Molnar <[email protected]> wrote:

> * Ingo Molnar <[email protected]> wrote:
>
> > numa/core profile:
> >
> > 95.66% perf-1201.map [.] 0x00007fe4ad1c8fc7
> > 1.70% libjvm.so [.] 0x0000000000381581
> > 0.59% [vdso] [.] 0x0000000000000607
> > 0.19% [kernel] [k] do_raw_spin_lock
> > 0.11% [kernel] [k] generic_smp_call_function_interrupt
> > 0.11% [kernel] [k] timekeeping_get_ns.constprop.7
> > 0.08% [kernel] [k] ktime_get
> > 0.06% [kernel] [k] get_cycles
> > 0.05% [kernel] [k] __native_flush_tlb
> > 0.05% [kernel] [k] rep_nop
> > 0.04% perf [.] add_hist_entry.isra.9
> > 0.04% [kernel] [k] rcu_check_callbacks
> > 0.04% [kernel] [k] ktime_get_update_offsets
> > 0.04% libc-2.15.so [.] __strcmp_sse2
> >
> > No page fault overhead (see the page fault rate further below)
> > - the NUMA scanning overhead shows up only through some mild
> > TLB flush activity (which I'll fix btw).
>
> The patch attached below should get rid of that mild TLB
> flushing activity as well.

This has further increased SPECjbb from 203k/sec to 207k/sec,
i.e. it's now 5% faster than mainline - THP enabled.

The profile is now totally flat even during a full 32-WH SPECjbb
run, with the highest overhead entries left all related to timer
IRQ processing or profiling. That is on a system that should be
very close to yours.

Thanks,

Ingo

2012-11-20 15:29:41

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones


* Ingo Molnar <[email protected]> wrote:

> * Linus Torvalds <[email protected]> wrote:
>
> > On Mon, Nov 19, 2012 at 12:36 PM, Ingo Molnar <[email protected]> wrote:
> > >
> > > Hugepages is a must for most forms of NUMA/HPC. This alone
> > > questions the relevance of most of your prior numa/core testing
> > > results. I now have to strongly dispute your other conclusions
> > > as well.
> >
> > Ingo, stop doing this kind of crap.
> >
> > Let's make it clear: if the NUMA patches continue to regress
> > performance for reasonable loads (and that very much includes
> > "no THP") then they won't be merged.
> [...]
>
> No doubt numa/core should not regress with THP off or on and
> I'll fix that.

Once it was clear how Mel's workload was configured I could
reproduce it immediately myself as well and the fix was easy and
straightforward: the attached patch should do the trick.

(Lightly tested.)

Updated 32-warehouse SPECjbb test benchmarks on a 4x4 64 GB
system:

mainline: 395 k/sec
numa/core +patch: 512 k/sec [ +29.6% ]

mainline +THP: 524 k/sec
numa/core +patch +THP: 654 k/sec [ +24.8% ]

So here on my box the reported 32-warehouse SPECjbb regressions
are fixed to the best of my knowledge, and numa/core is now a
nice unconditional speedup over mainline.

CONFIG_NUMA_BALANCING=y brings roughly as much of a speedup to
mainline as CONFIG_TRANSPARENT_HUGEPAGE=y itself - and the
combination of the two features brings roughly a combination of
speedups: +65%, which looks pretty impressive.

This fix had no impact on the good "+THP +NUMA" results that
were reproducible with -v16 already.

Mel, David, could you give this patch too a whirl? It should
improve !THP workloads.

( The 4x JVM regression is still an open bug I think - I'll
re-check and fix that one next, no need to re-report it,
I'm on it. )

Thanks,

Ingo


----------------------------->
Subject: mm, numa: Turn 4K pte NUMA faults into effective hugepage ones
From: Ingo Molnar <[email protected]>
Date: Tue Nov 20 15:48:26 CET 2012

Reduce the 4K page fault count by looking around and processing
nearby pages if possible.

To keep the logic simple and straightforward we do a couple of
simplifications:

- we only scan in the HPAGE_SIZE range of the faulting address
- we only go as far as the vma allows us

Also simplify the do_numa_page() flow while at it and fix the
previous double faulting we incurred due to not properly fixing
up freshly migrated ptes.

Suggested-by: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/memory.c | 101 +++++++++++++++++++++++++++++++++++++++---------------------
1 file changed, 66 insertions(+), 35 deletions(-)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -3455,64 +3455,94 @@ static int do_nonlinear_fault(struct mm_
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}

-static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+static int __do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pmd_t *pmd,
- unsigned int flags, pte_t entry)
+ unsigned int flags, pte_t entry, spinlock_t *ptl)
{
- struct page *page = NULL;
- int node, page_nid = -1;
- int last_cpu = -1;
- spinlock_t *ptl;
-
- ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- if (unlikely(!pte_same(*ptep, entry)))
- goto out_unlock;
+ struct page *page;
+ int new_node;

page = vm_normal_page(vma, address, entry);
if (page) {
- get_page(page);
- page_nid = page_to_nid(page);
- last_cpu = page_last_cpu(page);
- node = mpol_misplaced(page, vma, address);
- if (node != -1 && node != page_nid)
+ int page_nid = page_to_nid(page);
+ int last_cpu = page_last_cpu(page);
+
+ new_node = mpol_misplaced(page, vma, address);
+ if (new_node != -1 && new_node != page_nid)
goto migrate;
+ task_numa_fault(page_nid, last_cpu, 1);
}

-out_pte_upgrade_unlock:
+out_pte_upgrade:
flush_cache_page(vma, address, pte_pfn(entry));
-
ptep_modify_prot_start(mm, address, ptep);
entry = pte_modify(entry, vma->vm_page_prot);
+ if (pte_dirty(entry))
+ entry = pte_mkwrite(entry);
ptep_modify_prot_commit(mm, address, ptep, entry);
-
/* No TLB flush needed because we upgraded the PTE */
-
update_mmu_cache(vma, address, ptep);
-
-out_unlock:
- pte_unmap_unlock(ptep, ptl);
-
- if (page) {
- task_numa_fault(page_nid, last_cpu, 1);
- put_page(page);
- }
out:
return 0;

migrate:
+ get_page(page);
pte_unmap_unlock(ptep, ptl);

- if (migrate_misplaced_page(page, node)) {
+ migrate_misplaced_page(page, new_node);
+
+ /* Re-check after migration: */
+
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
+ entry = ACCESS_ONCE(*ptep);
+
+ if (!pte_numa(vma, entry))
goto out;
- }
- page = NULL;

- ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_same(*ptep, entry))
- goto out_unlock;
+ page = vm_normal_page(vma, address, entry);
+ goto out_pte_upgrade;
+}

- goto out_pte_upgrade_unlock;
+/*
+ * Add a simple loop to also fetch ptes within the same pmd:
+ */
+static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr0, pte_t *ptep0, pmd_t *pmd,
+ unsigned int flags, pte_t entry0)
+{
+ unsigned long addr0_pmd = addr0 & PMD_MASK;
+ unsigned long addr_start;
+ unsigned long addr;
+ spinlock_t *ptl;
+ int entries = 0;
+ pte_t *ptep;
+
+ addr_start = max(addr0_pmd, vma->vm_start);
+ ptep = pte_offset_map(pmd, addr_start);
+
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
+
+ for (addr = addr_start; addr < vma->vm_end; addr += PAGE_SIZE, ptep++) {
+ pte_t entry;
+
+ entry = ACCESS_ONCE(*ptep);
+
+ if ((addr & PMD_MASK) != addr0_pmd)
+ break;
+ if (!pte_present(entry))
+ continue;
+ if (!pte_numa(vma, entry))
+ continue;
+
+ __do_numa_page(mm, vma, addr, ptep, pmd, flags, entry, ptl);
+ entries++;
+ }
+
+ pte_unmap_unlock(ptep, ptl);
+
+ return 0;
}

/*
@@ -3536,6 +3566,7 @@ int handle_pte_fault(struct mm_struct *m
spinlock_t *ptl;

entry = ACCESS_ONCE(*pte);
+
if (!pte_present(entry)) {
if (pte_none(entry)) {
if (vma->vm_ops) {

2012-11-20 16:09:28

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH, v2] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones


Ok, the patch withstood a bit more testing as well. Below is a
v2 version of it, with a couple of cleanups (no functional
changes).

Thanks,

Ingo

----------------->
Subject: mm, numa: Turn 4K pte NUMA faults into effective hugepage ones
From: Ingo Molnar <[email protected]>
Date: Tue Nov 20 15:48:26 CET 2012

Reduce the 4K page fault count by looking around and processing
nearby pages if possible.

To keep the logic and cache overhead simple and straightforward
we do a couple of simplifications:

- we only scan in the HPAGE_SIZE range of the faulting address
- we only go as far as the vma allows us

Also simplify the do_numa_page() flow while at it and fix the
previous double faulting we incurred due to not properly fixing
up freshly migrated ptes.

Suggested-by: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Hugh Dickins <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/memory.c | 99 ++++++++++++++++++++++++++++++++++++++----------------------
1 file changed, 64 insertions(+), 35 deletions(-)

Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -3455,64 +3455,93 @@ static int do_nonlinear_fault(struct mm_
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}

-static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+static int __do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pmd_t *pmd,
- unsigned int flags, pte_t entry)
+ unsigned int flags, pte_t entry, spinlock_t *ptl)
{
- struct page *page = NULL;
- int node, page_nid = -1;
- int last_cpu = -1;
- spinlock_t *ptl;
-
- ptl = pte_lockptr(mm, pmd);
- spin_lock(ptl);
- if (unlikely(!pte_same(*ptep, entry)))
- goto out_unlock;
+ struct page *page;
+ int new_node;

page = vm_normal_page(vma, address, entry);
if (page) {
- get_page(page);
- page_nid = page_to_nid(page);
- last_cpu = page_last_cpu(page);
- node = mpol_misplaced(page, vma, address);
- if (node != -1 && node != page_nid)
+ int page_nid = page_to_nid(page);
+ int last_cpu = page_last_cpu(page);
+
+ task_numa_fault(page_nid, last_cpu, 1);
+
+ new_node = mpol_misplaced(page, vma, address);
+ if (new_node != -1 && new_node != page_nid)
goto migrate;
}

-out_pte_upgrade_unlock:
+out_pte_upgrade:
flush_cache_page(vma, address, pte_pfn(entry));
-
ptep_modify_prot_start(mm, address, ptep);
entry = pte_modify(entry, vma->vm_page_prot);
+ if (pte_dirty(entry))
+ entry = pte_mkwrite(entry);
ptep_modify_prot_commit(mm, address, ptep, entry);
-
/* No TLB flush needed because we upgraded the PTE */
-
update_mmu_cache(vma, address, ptep);
-
-out_unlock:
- pte_unmap_unlock(ptep, ptl);
-
- if (page) {
- task_numa_fault(page_nid, last_cpu, 1);
- put_page(page);
- }
out:
return 0;

migrate:
+ get_page(page);
pte_unmap_unlock(ptep, ptl);

- if (migrate_misplaced_page(page, node)) {
+ migrate_misplaced_page(page, new_node); /* Drops the page reference */
+
+ /* Re-check after migration: */
+
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
+ entry = ACCESS_ONCE(*ptep);
+
+ if (!pte_numa(vma, entry))
goto out;
- }
- page = NULL;

- ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_same(*ptep, entry))
- goto out_unlock;
+ goto out_pte_upgrade;
+}
+
+/*
+ * Add a simple loop to also fetch ptes within the same pmd:
+ */
+static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr0, pte_t *ptep0, pmd_t *pmd,
+ unsigned int flags, pte_t entry0)
+{
+ unsigned long addr0_pmd;
+ unsigned long addr_start;
+ unsigned long addr;
+ spinlock_t *ptl;
+ pte_t *ptep;
+
+ addr0_pmd = addr0 & PMD_MASK;
+ addr_start = max(addr0_pmd, vma->vm_start);

- goto out_pte_upgrade_unlock;
+ ptep = pte_offset_map(pmd, addr_start);
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
+
+ for (addr = addr_start; addr < vma->vm_end; addr += PAGE_SIZE, ptep++) {
+ pte_t entry;
+
+ entry = ACCESS_ONCE(*ptep);
+
+ if ((addr & PMD_MASK) != addr0_pmd)
+ break;
+ if (!pte_present(entry))
+ continue;
+ if (!pte_numa(vma, entry))
+ continue;
+
+ __do_numa_page(mm, vma, addr, ptep, pmd, flags, entry, ptl);
+ }
+
+ pte_unmap_unlock(ptep, ptl);
+
+ return 0;
}

/*

2012-11-20 16:32:19

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH, v2] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones

On 11/20/2012 11:09 AM, Ingo Molnar wrote:

> Subject: mm, numa: Turn 4K pte NUMA faults into effective hugepage ones
> From: Ingo Molnar <[email protected]>
> Date: Tue Nov 20 15:48:26 CET 2012
>
> Reduce the 4K page fault count by looking around and processing
> nearby pages if possible.

This is essentially what autonuma does with PMD level NUMA
faults, so we know this idea works.

Performance measurements will show us how much of an impact
it makes, since I don't think we have never done apples to apples
comparisons with just this thing toggled :)

The patch looks good for me, just a nit-pick on the comment
to do_numa_page().

Other than that:

Acked-by: Rik van Riel <[email protected]>

> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c
> +++ linux/mm/memory.c
> @@ -3455,64 +3455,93 @@ static int do_nonlinear_fault(struct mm_
> return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
> }
>
> -static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +static int __do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, pte_t *ptep, pmd_t *pmd,
> - unsigned int flags, pte_t entry)
> + unsigned int flags, pte_t entry, spinlock_t *ptl)
> {
> - struct page *page = NULL;
> - int node, page_nid = -1;
> - int last_cpu = -1;
> - spinlock_t *ptl;
> -
> - ptl = pte_lockptr(mm, pmd);
> - spin_lock(ptl);
> - if (unlikely(!pte_same(*ptep, entry)))
> - goto out_unlock;
> + struct page *page;
> + int new_node;
>
> page = vm_normal_page(vma, address, entry);
> if (page) {
> - get_page(page);
> - page_nid = page_to_nid(page);
> - last_cpu = page_last_cpu(page);
> - node = mpol_misplaced(page, vma, address);
> - if (node != -1 && node != page_nid)
> + int page_nid = page_to_nid(page);
> + int last_cpu = page_last_cpu(page);
> +
> + task_numa_fault(page_nid, last_cpu, 1);
> +
> + new_node = mpol_misplaced(page, vma, address);
> + if (new_node != -1 && new_node != page_nid)
> goto migrate;
> }
>
> -out_pte_upgrade_unlock:
> +out_pte_upgrade:
> flush_cache_page(vma, address, pte_pfn(entry));
> -
> ptep_modify_prot_start(mm, address, ptep);
> entry = pte_modify(entry, vma->vm_page_prot);
> + if (pte_dirty(entry))
> + entry = pte_mkwrite(entry);
> ptep_modify_prot_commit(mm, address, ptep, entry);
> -
> /* No TLB flush needed because we upgraded the PTE */
> -
> update_mmu_cache(vma, address, ptep);
> -
> -out_unlock:
> - pte_unmap_unlock(ptep, ptl);
> -
> - if (page) {
> - task_numa_fault(page_nid, last_cpu, 1);
> - put_page(page);
> - }
> out:
> return 0;
>
> migrate:
> + get_page(page);
> pte_unmap_unlock(ptep, ptl);
>
> - if (migrate_misplaced_page(page, node)) {
> + migrate_misplaced_page(page, new_node); /* Drops the page reference */
> +
> + /* Re-check after migration: */
> +
> + ptl = pte_lockptr(mm, pmd);
> + spin_lock(ptl);
> + entry = ACCESS_ONCE(*ptep);
> +
> + if (!pte_numa(vma, entry))
> goto out;
> - }
> - page = NULL;
>
> - ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> - if (!pte_same(*ptep, entry))
> - goto out_unlock;
> + goto out_pte_upgrade;
> +}
> +
> +/*
> + * Add a simple loop to also fetch ptes within the same pmd:
> + */

That's not a very useful comment. How about something like:

/*
* Also fault over nearby ptes from within the same pmd and vma,
* in order to minimize the overhead from page fault exceptions
* and TLB flushes.
*/

> +static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long addr0, pte_t *ptep0, pmd_t *pmd,
> + unsigned int flags, pte_t entry0)
> +{
> + unsigned long addr0_pmd;
> + unsigned long addr_start;
> + unsigned long addr;
> + spinlock_t *ptl;
> + pte_t *ptep;
> +
> + addr0_pmd = addr0 & PMD_MASK;
> + addr_start = max(addr0_pmd, vma->vm_start);
>
> - goto out_pte_upgrade_unlock;
> + ptep = pte_offset_map(pmd, addr_start);
> + ptl = pte_lockptr(mm, pmd);
> + spin_lock(ptl);
> +
> + for (addr = addr_start; addr < vma->vm_end; addr += PAGE_SIZE, ptep++) {
> + pte_t entry;
> +
> + entry = ACCESS_ONCE(*ptep);
> +
> + if ((addr & PMD_MASK) != addr0_pmd)
> + break;
> + if (!pte_present(entry))
> + continue;
> + if (!pte_numa(vma, entry))
> + continue;
> +
> + __do_numa_page(mm, vma, addr, ptep, pmd, flags, entry, ptl);
> + }
> +
> + pte_unmap_unlock(ptep, ptl);
> +
> + return 0;
> }
>
> /*
>


--
All rights reversed

2012-11-20 16:52:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH, v2] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones


* Rik van Riel <[email protected]> wrote:

> Performance measurements will show us how much of an impact it
> makes, since I don't think we have never done apples to apples
> comparisons with just this thing toggled :)

I've done a couple of quick measurements to characterise it: as
expected this patch simply does not matter much when THP is
enabled - and most testers I worked with had THP enabled.

Testing with THP off hurst most NUMA workloads dearly and tells
very little about the real NUMA story of these workloads. If you
turn off THP you are living with a constant ~25% regression -
just check the THP and no-THP numbers I posted:

[ 32-warehouse SPECjbb test benchmarks ]

mainline: 395 k/sec
mainline +THP: 524 k/sec

numa/core +patch: 512 k/sec [ +29.6% ]
numa/core +patch +THP: 654 k/sec [ +24.8% ]

The group of testers who had THP disabled was thus very low -
maybe only Mel alone? The testers I worked with all had THP
enabled.

I'd encourage everyone to report unusual 'tweaks' done before
tests are reported - no matter how well intended the purpose of
that tweak. There's just so many config variations we can test
and we obviously check the most logically and most scalably
configured system variants first.

Thanks,

Ingo

2012-11-20 17:57:01

by Ingo Molnar

[permalink] [raw]
Subject: numa/core regressions fixed - more testers wanted


* Ingo Molnar <[email protected]> wrote:

> ( The 4x JVM regression is still an open bug I think - I'll
> re-check and fix that one next, no need to re-report it,
> I'm on it. )

So I tested this on !THP too and the combined numbers are now:

|
[ SPECjbb multi-4x8 ] |
[ tx/sec ] v3.7 | numa/core-v16
[ higher is better ] ----- | -------------
|
+THP: 639k | 655k +2.5%
-THP: 510k | 517k +1.3%

So it's not a regression anymore, regardless of whether THP is
enabled or disabled.

The current updated table of performance results is:

-------------------------------------------------------------------------
[ seconds ] v3.7 AutoNUMA | numa/core-v16 [ vs. v3.7]
[ lower is better ] ----- -------- | ------------- -----------
|
numa01 340.3 192.3 | 139.4 +144.1%
numa01_THREAD_ALLOC 425.1 135.1 | 121.1 +251.0%
numa02 56.1 25.3 | 17.5 +220.5%
|
[ SPECjbb transactions/sec ] |
[ higher is better ] |
|
SPECjbb 1x32 +THP 524k 507k | 638k +21.7%
SPECjbb 1x32 !THP 395k | 512k +29.6%
|
-----------------------------------------------------------------------
|
[ SPECjbb multi-4x8 ] |
[ tx/sec ] v3.7 | numa/core-v16
[ higher is better ] ----- | -------------
|
+THP: 639k | 655k +2.5%
-THP: 510k | 517k +1.3%

So I think I've addressed all regressions reported so far - if
anyone can still see something odd, please let me know so I can
reproduce and fix it ASAP.

Next I'll work on making multi-JVM more of an improvement, and
I'll also address any incoming regression reports.

Those of you who would like to test all the latest patches are
welcome to pick up latest bits at tip:master:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

Thanks,

Ingo

2012-11-20 23:01:39

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [patch] x86/vsyscall: Add Kconfig option to use native vsyscalls, switch to it

On Tue, Nov 20, 2012 at 1:41 AM, Ingo Molnar <[email protected]> wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>> > 0.10% [kernel] [k] __do_page_fault
>> > 0.08% [kernel] [k] handle_mm_fault
>> > 0.08% libjvm.so [.] os::javaTimeMillis()
>> > 0.08% [kernel] [k] emulate_vsyscall
>>
>> Oh, finally a clue: you seem to have vsyscall emulation
>> overhead!
>>
>> Vsyscall emulation is fundamentally page fault driven - which
>> might explain why you are seeing page fault overhead. It might
>> also interact with other sources of faults - such as
>> numa/core's working set probing ...
>>
>> Many JVMs try to be smart with the vsyscall. As a test, does
>> the vsyscall=native boot option change the results/behavior in
>> any way?
>
> As a blind shot into the dark, does the attached patch help?
>
> If that's the root cause then it should measurably help mainline
> SPECjbb performance as well. It could turn numa/core from a
> regression into a win on your system.
>
> Thanks,
>
> Ingo
>
> ----------------->
> Subject: x86/vsyscall: Add Kconfig option to use native vsyscalls, switch to it
> From: Ingo Molnar <[email protected]>
>
> Apparently there's still plenty of systems out there triggering
> the vsyscall emulation page faults - causing hard to track down
> performance regressions on page fault intense workloads...
>
> Some people seem to have run into that with threading-intense
> Java workloads.
>
> So until there's a better solution to this, add a Kconfig switch
> to make the vsyscall mode configurable and turn native vsyscall
> support back on by default.
>

I'm not sure the default should be changed. Presumably only a
smallish minority of users are affected, and all of their code still
*works* -- it's just a little bit slower.

>
> +config X86_VSYSCALL_COMPAT
> + bool "vsyscall compatibility"
> + default y
> + help

This is IMO misleading. Perhaps the option should be
X86_VSYSCALL_EMULATION. A description like "compatibility" makes
turning it on sound like a no-brainer.

Perhaps the vsyscall emulation code should be tweaked to warn if it's
getting called more than, say, 1k times per second. The kernel could
log something like "Detected large numbers of emulated vsyscalls.
Consider upgading, setting vsyscall=native, or adjusting
CONFIG_X86_WHATEVER."


> + vsyscalls, as global executable pages, can be a security hole
> + escallation helper by exposing an easy shell code target with

escalation?

> + a predictable address.
> +
> + Many versions of glibc rely on the vsyscall page though, so it
> + cannot be eliminated unconditionally. If you disable this
> + option these systems will still work but might incur the overhead
> + of vsyscall emulation page faults.
> +
> + The vsyscall=none, vsyscall=emulate, vsyscall=native kernel boot
> + option can be used to override this mode as well.
> +
> + Keeping this option enabled leaves the vsyscall page enabled,
> + i.e. vsyscall=native. Disabling this option means vsyscall=emulate.
> +
> + If unsure, say Y.
> +

--Andy

2012-11-21 00:43:32

by David Rientjes

[permalink] [raw]
Subject: Re: [patch] x86/vsyscall: Add Kconfig option to use native vsyscalls, switch to it

On Tue, 20 Nov 2012, Ingo Molnar wrote:

> Subject: x86/vsyscall: Add Kconfig option to use native vsyscalls, switch to it
> From: Ingo Molnar <[email protected]>
>
> Apparently there's still plenty of systems out there triggering
> the vsyscall emulation page faults - causing hard to track down
> performance regressions on page fault intense workloads...
>
> Some people seem to have run into that with threading-intense
> Java workloads.
>
> So until there's a better solution to this, add a Kconfig switch
> to make the vsyscall mode configurable and turn native vsyscall
> support back on by default.
>
> Distributions whose libraries and applications use the vDSO and never
> access the vsyscall page can change the config option to off.
>
> Note, I don't think we want to expose the "none" option via a Kconfig
> switch - it breaks the ABI. So it's "native" versus "emulate", with
> "none" being available as a kernel boot option, for the super paranoid.
>
> For more background, see these upstream commits:
>
> 3ae36655b97a x86-64: Rework vsyscall emulation and add vsyscall= parameter
> 5cec93c216db x86-64: Emulate legacy vsyscalls
>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>

It's slightly better, not sure if it's worth changing everybody's default
for the small speedup here, though.

THP enabled:

numa/core at ec05a2311c35: 136918.34 SPECjbb2005 bops
numa/core at 01aa90068b12: 128315.19 SPECjbb2005 bops (-6.3%)
numa/core at 01aa90068b12 + patch: 128589.34 SPECjbb2005 bops (-6.1%)

THP disabled:

numa/core at ec05a2311c35: 122246.90 SPECjbb2005 bops
numa/core at 01aa90068b12: 99389.49 SPECjbb2005 bops (-18.7%)
numa/core at 01aa90068b12 + patch: 100726.34 SPECjbb2005 bops (-17.6%)

perf top w/ THP enabled:

92.56% perf-13343.map [.] 0x00007fd513d2ba7a
1.24% libjvm.so [.] instanceKlass::oop_push_contents(PSPromoti
1.01% libjvm.so [.] PSPromotionManager::drain_stacks_depth(boo
0.77% libjvm.so [.] PSPromotionManager::copy_to_survivor_space
0.57% libjvm.so [.] PSPromotionManager::claim_or_forward_inter
0.26% libjvm.so [.] Copy::pd_disjoint_words(HeapWord*, HeapWord*, unsi
0.22% libjvm.so [.] CardTableExtension::scavenge_contents_parallel(Obj
0.20% [kernel] [k] read_tsc
0.15% [kernel] [k] _raw_spin_lock
0.13% [kernel] [k] getnstimeofday
0.12% [kernel] [k] page_fault
0.11% [kernel] [k] generic_smp_call_function_interrupt
0.10% [kernel] [k] ktime_get
0.10% [kernel] [k] rcu_check_callbacks
0.09% [kernel] [k] ktime_get_update_offsets
0.09% libjvm.so [.] objArrayKlass::oop_push_contents(PSPromotionManage
0.08% [kernel] [k] flush_tlb_func
0.07% [kernel] [k] system_call
0.07% libjvm.so [.] oopDesc::size_given_klass(Klass*)
0.06% [kernel] [k] handle_mm_fault
0.06% [kernel] [k] task_tick_fair
0.05% libc-2.3.6.so [.] __gettimeofday
0.05% libjvm.so [.] os::javaTimeMillis()
0.05% [kernel] [k] handle_pte_fault
0.05% [kernel] [k] system_call_after_swapgs
0.04% [kernel] [k] mpol_misplaced
0.04% [kernel] [k] __do_page_fault
0.04% perf [.] 0x0000000000035903
0.04% [kernel] [k] copy_user_generic_string
0.04% [kernel] [k] task_numa_fault
0.04% [acpi_cpufreq] [.] 0x000000005f51a009
0.04% [kernel] [k] find_vma
0.04% [kernel] [k] run_timer_softirq
0.03% [kernel] [k] sysret_check
0.03% [kernel] [k] smp_call_function_many
0.03% [kernel] [k] update_cfs_shares
0.02% [kernel] [k] do_wp_page
0.02% [kernel] [k] get_vma_policy
0.02% [kernel] [k] update_curr
0.02% [kernel] [k] down_read_trylock
0.02% [kernel] [k] apic_timer_interrupt

2012-11-21 01:22:09

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] x86/mm: Don't flush the TLB on #WP pmd fixups

On Tue, 20 Nov 2012, Ingo Molnar wrote:

> Subject: x86/mm: Don't flush the TLB on #WP pmd fixups
> From: Ingo Molnar <[email protected]>
> Date: Tue Nov 20 14:46:34 CET 2012
>
> If we have a write protection #PF and fix up the pmd then the
> hugetlb code [the only user of pmdp_set_access_flags], in its
> do_huge_pmd_wp_page() page fault resolution function calls
> pmdp_set_access_flags() to mark the pmd permissive again,
> and flushes the TLB.
>
> This TLB flush is unnecessary: a flush on #PF is guaranteed on
> most (all?) x86 CPUs, and even in the worst-case we'll generate
> a spurious fault.
>
> So remove it.
>

This patch did not cause the 2% speedup that you reported with THP
enabled for me:

numa/core at ec05a2311c35: 136918.34 SPECjbb2005 bops
numa/core at 01aa90068b12: 128315.19 SPECjbb2005 bops (-6.3%)
numa/core at 01aa90068b12 + patch: 128184.77 SPECjbb2005 bops (-6.4%)

2012-11-21 01:54:24

by Andrew Theurer

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted

On Tue, 2012-11-20 at 18:56 +0100, Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
> > ( The 4x JVM regression is still an open bug I think - I'll
> > re-check and fix that one next, no need to re-report it,
> > I'm on it. )
>
> So I tested this on !THP too and the combined numbers are now:
>
> |
> [ SPECjbb multi-4x8 ] |
> [ tx/sec ] v3.7 | numa/core-v16
> [ higher is better ] ----- | -------------
> |
> +THP: 639k | 655k +2.5%
> -THP: 510k | 517k +1.3%
>
> So it's not a regression anymore, regardless of whether THP is
> enabled or disabled.
>
> The current updated table of performance results is:
>
> -------------------------------------------------------------------------
> [ seconds ] v3.7 AutoNUMA | numa/core-v16 [ vs. v3.7]
> [ lower is better ] ----- -------- | ------------- -----------
> |
> numa01 340.3 192.3 | 139.4 +144.1%
> numa01_THREAD_ALLOC 425.1 135.1 | 121.1 +251.0%
> numa02 56.1 25.3 | 17.5 +220.5%
> |
> [ SPECjbb transactions/sec ] |
> [ higher is better ] |
> |
> SPECjbb 1x32 +THP 524k 507k | 638k +21.7%
> SPECjbb 1x32 !THP 395k | 512k +29.6%
> |
> -----------------------------------------------------------------------
> |
> [ SPECjbb multi-4x8 ] |
> [ tx/sec ] v3.7 | numa/core-v16
> [ higher is better ] ----- | -------------
> |
> +THP: 639k | 655k +2.5%
> -THP: 510k | 517k +1.3%
>
> So I think I've addressed all regressions reported so far - if
> anyone can still see something odd, please let me know so I can
> reproduce and fix it ASAP.

I can confirm single JVM JBB is working well for me. I see a 30%
improvement over autoNUMA. What I can't make sense of is some perf
stats (taken at 80 warehouses on 4 x WST-EX, 512GB memory):

tips numa/core:

5,429,632,865 node-loads
3,806,419,082 node-load-misses(70.1%)
2,486,756,884 node-stores
2,042,557,277 node-store-misses(82.1%)
2,878,655,372 node-prefetches
2,201,441,900 node-prefetch-misses

autoNUMA:

4,538,975,144 node-loads
2,666,374,830 node-load-misses(58.7%)
2,148,950,354 node-stores
1,682,942,931 node-store-misses(78.3%)
2,191,139,475 node-prefetches
1,633,752,109 node-prefetch-misses

The percentage of misses is higher for numa/core. I would have expected
the performance increase be due to lower "node-misses", but perhaps I am
misinterpreting the perf data.

One other thing I noticed was both tests are not even using all CPU
(75-80%), so I suspect there's a JVM scalability issue with this
workload at this number of cpu threads (80). This is a IBM JVM, so
there may be some differences. I am curious if any of the others
testing JBB are getting 100% cpu utilization at their warehouse peak.

So, while the performance results are encouraging, I would like to
correlate it with some kind of perf data that confirms why we think it's
better.

>
> Next I'll work on making multi-JVM more of an improvement, and
> I'll also address any incoming regression reports.

I have issues with multiple KVM VMs running either JBB or
dbench-in-tmpfs, and I suspect whatever I am seeing is similar to
whatever multi-jvm in baremetal is. What I typically see is no real
convergence of a single node for resource usage for any of the VMs. For
example, when running 8 VMs, 10 vCPUs each, a VM may have the following
resource usage:

host cpu usage from cpuacct cgroup:
/cgroup/cpuacct/libvirt/qemu/at-vm01

node00 node01 node02 node03
199056918180|005% 752455339099|020% 1811704146176|049% 888803723722|024%

And VM memory placement in host(in pages):
node00 node01 node02 node03
107566|023% 115245|025% 117807|025% 119414|025%

Conversely, autoNUMA usually has 98+% for cpu and memory in one of the
host nodes for each of these VMs. AutoNUMA is about 30% better in these
tests.

That is data for the entire run time, and "not converged" could possibly
mean, "converged but moved around", but I doubt that's what happening.

Here's perf data for the dbench VMs:

numa/core:

468,634,508 node-loads
210,598,643 node-load-misses(44.9%)
172,735,053 node-stores
107,535,553 node-store-misses(51.1%)
208,064,103 node-prefetches
160,858,933 node-prefetch-misses

autoNUMA:

666,498,425 node-loads
222,643,141 node-load-misses(33.4%)
219,003,566 node-stores
99,243,370 node-store-misses(45.3%)
315,439,315 node-prefetches
254,888,403 node-prefetch-misses

These seems to make a little more sense to me, but the percentages for
autoNUMA still seem a little high (but at least lower then numa/core).
I need to take a manually pinned measurement to compare.

> Those of you who would like to test all the latest patches are
> welcome to pick up latest bits at tip:master:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

I've been running on numa/core, but I'll switch to master and try these
again.

Thanks,

-Andrew Theurer

2012-11-21 02:41:18

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH, v2] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones

On Tue, 20 Nov 2012, Ingo Molnar wrote:

> Reduce the 4K page fault count by looking around and processing
> nearby pages if possible.
>
> To keep the logic and cache overhead simple and straightforward
> we do a couple of simplifications:
>
> - we only scan in the HPAGE_SIZE range of the faulting address
> - we only go as far as the vma allows us
>
> Also simplify the do_numa_page() flow while at it and fix the
> previous double faulting we incurred due to not properly fixing
> up freshly migrated ptes.
>
> Suggested-by: Mel Gorman <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>

Acked-by: David Rientjes <[email protected]>

Ok, this is significantly better, it almost cut the regression in half on
my system. With THP enabled:

numa/core at ec05a2311c35: 136918.34 SPECjbb2005 bops
numa/core at 01aa90068b12: 128315.19 SPECjbb2005 bops (-6.3%)
numa/core at 01aa90068b12 + patch: 132523.06 SPECjbb2005 bops (-3.2%)

Here's the newest perftop, which is radically different than before (not
nearly the number of newly-added numa/core functions in the biggest
consumers) but still incurs significant overhead from page faults.

92.18% perf-6697.map [.] 0x00007fe2c5afd079
1.20% libjvm.so [.] instanceKlass::oop_push_contents(PSPromotionManag
1.05% libjvm.so [.] PSPromotionManager::drain_stacks_depth(bool)
0.78% libjvm.so [.] PSPromotionManager::copy_to_survivor_space(oopDes
0.59% libjvm.so [.] PSPromotionManager::claim_or_forward_internal_dep
0.49% [kernel] [k] page_fault
0.27% libjvm.so [.] Copy::pd_disjoint_words(HeapWord*, HeapWord*, unsigned lo
0.27% libc-2.3.6.so [.] __gettimeofday
0.19% libjvm.so [.] CardTableExtension::scavenge_contents_parallel(ObjectStar
0.16% [kernel] [k] getnstimeofday
0.14% [kernel] [k] _raw_spin_lock
0.13% [kernel] [k] generic_smp_call_function_interrupt
0.11% [kernel] [k] ktime_get
0.11% [kernel] [k] rcu_check_callbacks
0.10% [kernel] [k] read_tsc
0.09% libjvm.so [.] os::javaTimeMillis()
0.09% [kernel] [k] clear_page_c
0.08% [kernel] [k] flush_tlb_func
0.08% [kernel] [k] ktime_get_update_offsets
0.07% [kernel] [k] task_tick_fair
0.06% [kernel] [k] emulate_vsyscall
0.06% libjvm.so [.] oopDesc::size_given_klass(Klass*)
0.06% [kernel] [k] __do_page_fault
0.04% [kernel] [k] __bad_area_nosemaphore
0.04% perf [.] 0x000000000003310b
0.04% libjvm.so [.] objArrayKlass::oop_push_contents(PSPromotionManager*, oop
0.04% [kernel] [k] run_timer_softirq
0.04% [kernel] [k] copy_user_generic_string
0.03% [kernel] [k] task_numa_fault
0.03% [kernel] [k] smp_call_function_many
0.03% [kernel] [k] retint_swapgs
0.03% [kernel] [k] update_cfs_shares
0.03% [kernel] [k] error_sti
0.03% [kernel] [k] _raw_spin_lock_irq
0.03% [kernel] [k] update_curr
0.02% [kernel] [k] write_ok_or_segv
0.02% [kernel] [k] call_function_interrupt
0.02% [kernel] [k] __do_softirq
0.02% [kernel] [k] acct_update_integrals
0.02% [kernel] [k] x86_pmu_disable_all
0.02% [kernel] [k] apic_timer_interrupt
0.02% [kernel] [k] tick_sched_timer

2012-11-21 03:23:31

by Rik van Riel

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted

On 11/20/2012 08:54 PM, Andrew Theurer wrote:

> I can confirm single JVM JBB is working well for me. I see a 30%
> improvement over autoNUMA. What I can't make sense of is some perf
> stats (taken at 80 warehouses on 4 x WST-EX, 512GB memory):

AutoNUMA does not have native THP migration, that may explain some
of the difference.

> tips numa/core:
>
> 5,429,632,865 node-loads
> 3,806,419,082 node-load-misses(70.1%)
> 2,486,756,884 node-stores
> 2,042,557,277 node-store-misses(82.1%)
> 2,878,655,372 node-prefetches
> 2,201,441,900 node-prefetch-misses
>
> autoNUMA:
>
> 4,538,975,144 node-loads
> 2,666,374,830 node-load-misses(58.7%)
> 2,148,950,354 node-stores
> 1,682,942,931 node-store-misses(78.3%)
> 2,191,139,475 node-prefetches
> 1,633,752,109 node-prefetch-misses
>
> The percentage of misses is higher for numa/core. I would have expected
> the performance increase be due to lower "node-misses", but perhaps I am
> misinterpreting the perf data.

Lack of native THP migration may be enough to explain the
performance difference, despite autonuma having better node
locality.

>> Next I'll work on making multi-JVM more of an improvement, and
>> I'll also address any incoming regression reports.
>
> I have issues with multiple KVM VMs running either JBB or
> dbench-in-tmpfs, and I suspect whatever I am seeing is similar to
> whatever multi-jvm in baremetal is. What I typically see is no real
> convergence of a single node for resource usage for any of the VMs. For
> example, when running 8 VMs, 10 vCPUs each, a VM may have the following
> resource usage:

This is an issue. I have tried understanding the new local/shared
and shared task grouping code, but have not wrapped my mind around
that code yet.

I will have to look at that code a few more times, and ask more
questions of Ingo and Peter (and maybe ask some of the same questions
again - I see that some of my comments were addressed in the next
version of the patch, but the email never got a reply).

--
All rights reversed

2012-11-21 03:33:21

by David Rientjes

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted

On Tue, 20 Nov 2012, Ingo Molnar wrote:

> The current updated table of performance results is:
>
> -------------------------------------------------------------------------
> [ seconds ] v3.7 AutoNUMA | numa/core-v16 [ vs. v3.7]
> [ lower is better ] ----- -------- | ------------- -----------
> |
> numa01 340.3 192.3 | 139.4 +144.1%
> numa01_THREAD_ALLOC 425.1 135.1 | 121.1 +251.0%
> numa02 56.1 25.3 | 17.5 +220.5%
> |
> [ SPECjbb transactions/sec ] |
> [ higher is better ] |
> |
> SPECjbb 1x32 +THP 524k 507k | 638k +21.7%
> SPECjbb 1x32 !THP 395k | 512k +29.6%
> |
> -----------------------------------------------------------------------
> |
> [ SPECjbb multi-4x8 ] |
> [ tx/sec ] v3.7 | numa/core-v16
> [ higher is better ] ----- | -------------
> |
> +THP: 639k | 655k +2.5%
> -THP: 510k | 517k +1.3%
>
> So I think I've addressed all regressions reported so far - if
> anyone can still see something odd, please let me know so I can
> reproduce and fix it ASAP.
>

I started profiling on a new machine that is an exact duplicate of the
16-way, 4 node, 32GB machine I was profiling with earlier to rule out any
machine-specific problems. I pulled master and ran new comparisons with
THP enabled at c418de93e398 ("Merge branch 'x86/mm'"):

CONFIG_NUMA_BALANCING disabled 136521.55 SPECjbb2005 bops
CONFIG_NUMA_BALANCING enabled 132476.07 SPECjbb2005 bops (-3.0%)

Aside: neither 4739578c3ab3 ("x86/mm: Don't flush the TLB on #WP pmd
fixups") nor 01e9c2441eee ("x86/vsyscall: Add Kconfig option to use native
vsyscalls and switch to it") significantly improved upon the throughput on
this system.

Over the past 24 hours, however, throughput has significantly improved
from a 6.3% regression to a 3.0% regression because of 246c0b9a1caf ("mm,
numa: Turn 4K pte NUMA faults into effective hugepage ones")!

One request: I noticed that the entire patchset doesn't add any fields to
/proc/vmstat through count_vm_event() like thp did, which I found very
useful when profiling that set when it was being reviewed. Would it be
possible to add some vm events to the balancing code so we can capture
data of how the NUMA balancing is behaving? Their usefulness would extend
beyond just the review period.

2012-11-21 04:10:30

by Hugh Dickins

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted

On Tue, 20 Nov 2012, Rik van Riel wrote:
> On 11/20/2012 08:54 PM, Andrew Theurer wrote:
>
> > I can confirm single JVM JBB is working well for me. I see a 30%
> > improvement over autoNUMA. What I can't make sense of is some perf
> > stats (taken at 80 warehouses on 4 x WST-EX, 512GB memory):
>
> AutoNUMA does not have native THP migration, that may explain some
> of the difference.

When I made some fixes to the sched/numa native THP migration,
I did also try porting that (with Hannes's memcg fixes) to AutoNUMA.

Here's the patch below: it appeared to be working just fine, but
you might find that it doesn't quite apply to whatever tree you're
using. I started from 3.6 autonuma28fast in aa.git, but had folded
in some of the equally applicable TLB flush optimizations too.

There's also a little "Hack, remove after THP native migration"
retuning in mm/huge_memory.c which should probably be removed too.

No signoffs, but it's from work by Peter and Ingo and Hannes and Hugh.
---

include/linux/huge_mm.h | 4 -
mm/autonuma.c | 59 +++++-----------
mm/huge_memory.c | 140 +++++++++++++++++++++++++++++++++-----
mm/internal.h | 5 -
mm/memcontrol.c | 7 +
mm/memory.c | 4 -
mm/migrate.c | 2
7 files changed, 158 insertions(+), 63 deletions(-)

--- 306aa/include/linux/huge_mm.h 2012-11-04 10:21:30.965548793 -0800
+++ 306AA/include/linux/huge_mm.h 2012-11-04 17:14:32.232651475 -0800
@@ -11,8 +11,8 @@ extern int copy_huge_pmd(struct mm_struc
extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
pmd_t orig_pmd);
-extern int huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
- pmd_t pmd, pmd_t *pmdp);
+extern int huge_pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd, pmd_t orig_pmd);
extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
unsigned long addr,
--- 306aa/mm/autonuma.c 2012-11-04 10:22:13.993549816 -0800
+++ 306AA/mm/autonuma.c 2012-11-04 21:24:43.045622669 -0800
@@ -77,8 +77,7 @@ void autonuma_migrate_split_huge_page(st

static int sync_isolate_migratepages(struct list_head *migratepages,
struct page *page,
- struct pglist_data *pgdat,
- bool *migrated)
+ struct pglist_data *pgdat)
{
struct zone *zone;
struct lruvec *lruvec;
@@ -129,19 +128,9 @@ static int sync_isolate_migratepages(str
* __isolate_lru_page successd, the page could be freed and
* reallocated out from under us. Thus our previous checks on
* the page, and the split_huge_page, would be worthless.
- *
- * We really only need to do this if "ret > 0" but it doesn't
- * hurt to do it unconditionally as nobody can reference
- * "page" anymore after this and so we can avoid an "if (ret >
- * 0)" branch here.
*/
- put_page(page);
- /*
- * Tell the caller we already released its pin, to avoid a
- * double free.
- */
- *migrated = true;
-
+ if (ret > 0)
+ put_page(page);
out:
return ret;
}
@@ -215,13 +204,12 @@ static inline void autonuma_migrate_unlo
spin_unlock(&NODE_DATA(nid)->autonuma_migrate_lock);
}

-static bool autonuma_migrate_page(struct page *page, int dst_nid,
- int page_nid, bool *migrated)
+static bool autonuma_migrate_page(struct page *page, int dst_nid, int page_nid,
+ int nr_pages)
{
int isolated = 0;
LIST_HEAD(migratepages);
struct pglist_data *pgdat = NODE_DATA(dst_nid);
- int nr_pages = hpage_nr_pages(page);
unsigned long autonuma_migrate_nr_pages = 0;

autonuma_migrate_lock(dst_nid);
@@ -242,11 +230,12 @@ static bool autonuma_migrate_page(struct
autonuma_printk("migrated %lu pages to node %d\n",
autonuma_migrate_nr_pages, dst_nid);

- if (autonuma_balance_pgdat(pgdat, nr_pages))
+ if (autonuma_balance_pgdat(pgdat, nr_pages)) {
+ if (nr_pages > 1)
+ return true;
isolated = sync_isolate_migratepages(&migratepages,
- page, pgdat,
- migrated);
-
+ page, pgdat);
+ }
if (isolated) {
int err;
trace_numa_migratepages_begin(current, &migratepages,
@@ -381,15 +370,14 @@ static bool should_migrate_page(struct t
static int numa_hinting_fault_memory_follow_cpu(struct task_struct *p,
struct page *page,
int this_nid, int page_nid,
- bool *migrated)
+ int numpages)
{
if (!should_migrate_page(p, page, this_nid, page_nid))
goto out;
if (!PageLRU(page))
goto out;
if (this_nid != page_nid) {
- if (autonuma_migrate_page(page, this_nid, page_nid,
- migrated))
+ if (autonuma_migrate_page(page, this_nid, page_nid, numpages))
return this_nid;
}
out:
@@ -418,14 +406,17 @@ bool numa_hinting_fault(struct page *pag
VM_BUG_ON(this_nid < 0);
VM_BUG_ON(this_nid >= MAX_NUMNODES);
access_nid = numa_hinting_fault_memory_follow_cpu(p, page,
- this_nid,
- page_nid,
- &migrated);
- /* "page" has been already freed if "migrated" is true */
+ this_nid, page_nid, numpages);
numa_hinting_fault_cpu_follow_memory(p, access_nid, numpages);
+ migrated = access_nid != page_nid;
}

- return migrated;
+ /* small page was already freed if migrated */
+ if (!migrated) {
+ put_page(page);
+ return false;
+ }
+ return true;
}

/* NUMA hinting page fault entry point for ptes */
@@ -434,7 +425,6 @@ int pte_numa_fixup(struct mm_struct *mm,
{
struct page *page;
spinlock_t *ptl;
- bool migrated;

/*
* The "pte" at this point cannot be used safely without
@@ -455,9 +445,7 @@ int pte_numa_fixup(struct mm_struct *mm,
get_page(page);
pte_unmap_unlock(ptep, ptl);

- migrated = numa_hinting_fault(page, 1);
- if (!migrated)
- put_page(page);
+ numa_hinting_fault(page, 1);
out:
return 0;

@@ -476,7 +464,6 @@ int pmd_numa_fixup(struct mm_struct *mm,
spinlock_t *ptl;
bool numa = false;
struct vm_area_struct *vma;
- bool migrated;

spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -521,9 +508,7 @@ int pmd_numa_fixup(struct mm_struct *mm,
get_page(page);
pte_unmap_unlock(pte, ptl);

- migrated = numa_hinting_fault(page, 1);
- if (!migrated)
- put_page(page);
+ numa_hinting_fault(page, 1);

pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
--- 306aa/mm/huge_memory.c 2012-11-04 15:32:28.512793096 -0800
+++ 306AA/mm/huge_memory.c 2012-11-04 22:21:14.112450390 -0800
@@ -17,6 +17,7 @@
#include <linux/khugepaged.h>
#include <linux/freezer.h>
#include <linux/mman.h>
+#include <linux/migrate.h>
#include <linux/autonuma.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
@@ -1037,35 +1038,140 @@ out:

#ifdef CONFIG_AUTONUMA
/* NUMA hinting page fault entry point for trans huge pmds */
-int huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
- pmd_t pmd, pmd_t *pmdp)
+int huge_pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd, pmd_t entry)
{
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct mem_cgroup *memcg = NULL;
+ struct page *new_page;
struct page *page;
- bool migrated;

spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(pmd, *pmdp)))
- goto out_unlock;
+ if (unlikely(!pmd_same(entry, *pmd)))
+ goto unlock;

- page = pmd_page(pmd);
- pmd = pmd_mknonnuma(pmd);
- set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
- VM_BUG_ON(pmd_numa(*pmdp));
- if (unlikely(page_mapcount(page) != 1))
- goto out_unlock;
+ page = pmd_page(entry);
+ /*
+ * Do not migrate this page if it is mapped anywhere else.
+ * Do not migrate this page if its count has been raised.
+ * Our caller's down_read of mmap_sem excludes fork raising
+ * mapcount; but recheck page count below whenever we take
+ * page_table_lock - although it's unclear what pin we are
+ * protecting against, since get_user_pages() or GUP fast
+ * would have to fault it present before they could proceed.
+ */
+ if (unlikely(page_count(page) != 1))
+ goto fixup;
get_page(page);
spin_unlock(&mm->page_table_lock);

- migrated = numa_hinting_fault(page, HPAGE_PMD_NR);
- if (!migrated)
- put_page(page);
+ if (numa_hinting_fault(page, HPAGE_PMD_NR))
+ goto migrate;

-out:
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(entry, *pmd)))
+ goto unlock;
+fixup:
+ entry = pmd_mknonnuma(entry);
+ set_pmd_at(mm, haddr, pmd, entry);
+ VM_BUG_ON(pmd_numa(*pmd));
+ update_mmu_cache(vma, address, entry);
+
+unlock:
+ spin_unlock(&mm->page_table_lock);
return 0;

-out_unlock:
+migrate:
+ lock_page(page);
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
+ spin_unlock(&mm->page_table_lock);
+ unlock_page(page);
+ put_page(page);
+ return 0;
+ }
spin_unlock(&mm->page_table_lock);
- goto out;
+
+ new_page = alloc_pages_node(numa_node_id(),
+ (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);
+ if (!new_page)
+ goto alloc_fail;
+
+ if (isolate_lru_page(page)) { /* Does an implicit get_page() */
+ put_page(new_page);
+ goto alloc_fail;
+ }
+
+ __set_page_locked(new_page);
+ SetPageSwapBacked(new_page);
+
+ /* anon mapping, we can simply copy page->mapping to the new page: */
+ new_page->mapping = page->mapping;
+ new_page->index = page->index;
+
+ migrate_page_copy(new_page, page);
+
+ WARN_ON(PageLRU(new_page));
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 3)) {
+ spin_unlock(&mm->page_table_lock);
+
+ /* Reverse changes made by migrate_page_copy() */
+ if (TestClearPageActive(new_page))
+ SetPageActive(page);
+ if (TestClearPageUnevictable(new_page))
+ SetPageUnevictable(page);
+ mlock_migrate_page(page, new_page);
+
+ unlock_page(new_page);
+ put_page(new_page); /* Free it */
+
+ unlock_page(page);
+ putback_lru_page(page);
+ put_page(page); /* Drop the local reference */
+ return 0;
+ }
+ /*
+ * Traditional migration needs to prepare the memcg charge
+ * transaction early to prevent the old page from being
+ * uncharged when installing migration entries. Here we can
+ * save the potential rollback and start the charge transfer
+ * only when migration is already known to end successfully.
+ */
+ mem_cgroup_prepare_migration(page, new_page, &memcg);
+
+ entry = mk_pmd(new_page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ entry = pmd_mkhuge(entry);
+
+ page_add_new_anon_rmap(new_page, vma, haddr);
+
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache(vma, address, entry);
+ page_remove_rmap(page);
+ /*
+ * Finish the charge transaction under the page table lock to
+ * prevent split_huge_page() from dividing up the charge
+ * before it's fully transferred to the new page.
+ */
+ mem_cgroup_end_migration(memcg, page, new_page, true);
+ spin_unlock(&mm->page_table_lock);
+
+ unlock_page(new_page);
+ unlock_page(page);
+ put_page(page); /* Drop the rmap reference */
+ put_page(page); /* Drop the LRU isolation reference */
+ put_page(page); /* Drop the local reference */
+ return 0;
+
+alloc_fail:
+ unlock_page(page);
+ put_page(page);
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry)))
+ goto unlock;
+ goto fixup;
}
#endif

--- 306aa/mm/internal.h 2012-09-30 16:47:46.000000000 -0700
+++ 306AA/mm/internal.h 2012-11-04 16:16:21.760439246 -0800
@@ -216,11 +216,12 @@ static inline void mlock_migrate_page(st
{
if (TestClearPageMlocked(page)) {
unsigned long flags;
+ int nr_pages = hpage_nr_pages(page);

local_irq_save(flags);
- __dec_zone_page_state(page, NR_MLOCK);
+ __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
SetPageMlocked(newpage);
- __inc_zone_page_state(newpage, NR_MLOCK);
+ __mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages);
local_irq_restore(flags);
}
}
--- 306aa/mm/memcontrol.c 2012-09-30 16:47:46.000000000 -0700
+++ 306AA/mm/memcontrol.c 2012-11-04 16:15:55.264437693 -0800
@@ -3261,15 +3261,18 @@ void mem_cgroup_prepare_migration(struct
struct mem_cgroup **memcgp)
{
struct mem_cgroup *memcg = NULL;
+ unsigned int nr_pages = 1;
struct page_cgroup *pc;
enum charge_type ctype;

*memcgp = NULL;

- VM_BUG_ON(PageTransHuge(page));
if (mem_cgroup_disabled())
return;

+ if (PageTransHuge(page))
+ nr_pages <<= compound_order(page);
+
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
@@ -3331,7 +3334,7 @@ void mem_cgroup_prepare_migration(struct
* charged to the res_counter since we plan on replacing the
* old one and only one page is going to be left afterwards.
*/
- __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
+ __mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
}

/* remove redundant charge if migration failed*/
--- 306aa/mm/memory.c 2012-11-04 10:21:34.181548869 -0800
+++ 306AA/mm/memory.c 2012-11-04 17:06:11.400620099 -0800
@@ -3546,8 +3546,8 @@ retry:
barrier();
if (pmd_trans_huge(orig_pmd)) {
if (pmd_numa(*pmd))
- return huge_pmd_numa_fixup(mm, address,
- orig_pmd, pmd);
+ return huge_pmd_numa_fixup(mm, vma, address,
+ pmd, orig_pmd);
if (flags & FAULT_FLAG_WRITE &&
!pmd_write(orig_pmd) &&
!pmd_trans_splitting(orig_pmd)) {
--- 306aa/mm/migrate.c 2012-09-30 16:47:46.000000000 -0700
+++ 306AA/mm/migrate.c 2012-11-04 17:10:13.084633509 -0800
@@ -407,7 +407,7 @@ int migrate_huge_page_move_mapping(struc
*/
void migrate_page_copy(struct page *newpage, struct page *page)
{
- if (PageHuge(page))
+ if (PageHuge(page) || PageTransHuge(page))
copy_huge_page(newpage, page);
else
copy_highpage(newpage, page);

2012-11-21 08:12:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH, v2] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones


* Rik van Riel <[email protected]> wrote:

> >+}
> >+
> >+/*
> >+ * Add a simple loop to also fetch ptes within the same pmd:
> >+ */
>
> That's not a very useful comment. How about something like:
>
> /*
> * Also fault over nearby ptes from within the same pmd and vma,
> * in order to minimize the overhead from page fault exceptions
> * and TLB flushes.
> */

There's no TLB flushes here. But I'm fine with the other part so
I've updated the comment to say:

/*
* Also fault over nearby ptes from within the same pmd and vma,
* in order to minimize the overhead from page fault exceptions:
*/

Thanks,

Ingo

2012-11-21 08:39:41

by Alex Shi

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted

>
> Those of you who would like to test all the latest patches are
> welcome to pick up latest bits at tip:master:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
>

I am wondering if it is a problem, but it still exists on HEAD: c418de93e39891
http://article.gmane.org/gmane.linux.kernel.mm/90131/match=compiled+with+name+pl+and+start+it+on+my

like when just start 4 pl tasks, often 3 were running on node 0, and 1
was running on node 1.
The old balance will average assign tasks to different node, different core.

Regards
Alex

2012-11-21 09:34:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH, v2] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones


* David Rientjes <[email protected]> wrote:

> Ok, this is significantly better, it almost cut the regression
> in half on my system. [...]

The other half still seems to be related to the emulation faults
that I fixed in the other patch:

> 0.49% [kernel] [k] page_fault
> 0.06% [kernel] [k] emulate_vsyscall

Plus TLB flush costs:

> 0.13% [kernel] [k] generic_smp_call_function_interrupt
> 0.08% [kernel] [k] flush_tlb_func

for which you should try the third patch I sent.

So please try all my fixes - the easiest way to do that would be
to try the latest tip:master that has all related fixes
integrated and send me a new perf top output - most page fault
and TLB flush overhead should be gone from the profile.

Thanks,

Ingo

2012-11-21 09:38:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted


* David Rientjes <[email protected]> wrote:

> I started profiling on a new machine that is an exact
> duplicate of the 16-way, 4 node, 32GB machine I was profiling
> with earlier to rule out any machine-specific problems. I
> pulled master and ran new comparisons with THP enabled at
> c418de93e398 ("Merge branch 'x86/mm'"):
>
> CONFIG_NUMA_BALANCING disabled 136521.55 SPECjbb2005 bops
> CONFIG_NUMA_BALANCING enabled 132476.07 SPECjbb2005 bops (-3.0%)
>
> Aside: neither 4739578c3ab3 ("x86/mm: Don't flush the TLB on
> #WP pmd fixups") nor 01e9c2441eee ("x86/vsyscall: Add Kconfig
> option to use native vsyscalls and switch to it")
> significantly improved upon the throughput on this system.

Could you please send an updated profile done with latest -tip?
The last profile I got from you still had the vsyscall emulation
page faults in it.

Thanks,

Ingo

2012-11-21 10:39:12

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Mon, Nov 19, 2012 at 08:13:39PM +0100, Ingo Molnar wrote:
> > I was not able to run a full sets of tests today as I was
> > distracted so all I have is a multi JVM comparison. I'll keep
> > it shorter than average
> >
> > 3.7.0 3.7.0
> > rc5-stats-v4r2 rc5-schednuma-v16r1
>
> Thanks for the testing - I'll wait for your full results to see
> whether the other regressions you reported before are
> fixed/improved.
>

Here are the latest figures I have available. It includes figures from
"Automatic NUMA Balancing V4" which I just released. Very short summary
is as follows

Even without a proper placement policy, balancenuma does fairly well in a
number of tests, shows a number of improvements in places and for the most
part it does not regress against mainline. It does this without a decent
placement policy on top and I expect a placement policy would only make it
better. Its System CPU usage is still of concern but with proper feedback
from a placement policy it could reduce the PTE scan rate and keep it down.

schednuma has improved a lot, particularly in terms in system CPU usage.
However, even with THP enabled it is showing regressions for specjbb and a
noticable regression when just building kernels. There have been follow-on
patches since testing started and maybe they'll make a difference.



Now, the long report... the very long report. The full sets tests are
still not complete but it should be enough to go with for now. A number
of kernels are compared. All are using 3.7-rc6 are the base

stats-v4r12 This is patches 10 from "Automatic NUMA Balancing V4" and
is just the TLB fixes and a few minor stats patches for
migration

schednuma-v16r2 tip/sched/core + the original series "Latest numa/core
release, v16". None of the follow up patches have been
applied because testing started after these were posted.

autonuma-v28fastr3 is the autonuma-v28fast branch from Andrea's tree rebased
to v3.7-rc6

moron-v4r38 is patches 1-19 from "Automatic NUMA Balancing V4" and is
the most basic available policy

twostage-v4r38 is patches 1-36 from "Automatic NUMA Balancing V4" and includes
PMD fault handling, migration backoff if there is too much
migration, the most rudimentary of scan rate adapation and
a two-stage filter to mitigate ping-pong effects

thpmigrate-v4r38 is patches 1-37 from "Autonumatic NUMA Balancing". Patch 37
adds native THP migration so its effect can be observed

In all cases, tests were run via mmtests. Monitors were enabled but not
profiling as profiling can distort results a lot. The monitors fire every
10 seconds and the heaviest reads numa_maps. THP is generally enabled but
the vmstats from each test is usually an obvious indicator.

There is a very important point to note about specjbb. specjbb itself
reports a single throughput figure and it bases this on a number of
warehouses around the expected peak. It ignores warehouses outside this
window which can be misleading. I'm reporting on all warehouses so if
you find that my figures do not match what specjbb tells you, it could be
because I'm reporting on low warehouse counts or counts outside the window
when the peak performance as reported by specjbb was great.

First, the autonumabenchmark.

AUTONUMA BENCH
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12 rc6-schednuma-v16r2rc6-autonuma-v28fastr3 rc6-moron-v4r38 rc6-twostage-v4r38 rc6-thpmigrate-v4r38
User NUMA01 75014.43 ( 0.00%) 22510.33 ( 69.99%) 32944.96 ( 56.08%) 25431.57 ( 66.10%) 69422.58 ( 7.45%) 47343.91 ( 36.89%)
User NUMA01_THEADLOCAL 55479.76 ( 0.00%) 18960.30 ( 65.82%) 16960.54 ( 69.43%) 20381.77 ( 63.26%) 23673.65 ( 57.33%) 16862.01 ( 69.61%)
User NUMA02 6801.32 ( 0.00%) 2208.32 ( 67.53%) 1921.66 ( 71.75%) 2979.36 ( 56.19%) 2213.03 ( 67.46%) 2053.89 ( 69.80%)
User NUMA02_SMT 2973.96 ( 0.00%) 1011.45 ( 65.99%) 1018.84 ( 65.74%) 1135.76 ( 61.81%) 912.61 ( 69.31%) 989.03 ( 66.74%)
System NUMA01 47.87 ( 0.00%) 140.01 (-192.48%) 286.39 (-498.27%) 743.09 (-1452.31%) 896.21 (-1772.17%) 489.09 (-921.70%)
System NUMA01_THEADLOCAL 43.52 ( 0.00%) 1014.35 (-2230.77%) 172.10 (-295.45%) 475.68 (-993.01%) 593.89 (-1264.64%) 144.30 (-231.57%)
System NUMA02 1.94 ( 0.00%) 36.90 (-1802.06%) 20.06 (-934.02%) 22.86 (-1078.35%) 43.01 (-2117.01%) 9.28 (-378.35%)
System NUMA02_SMT 0.93 ( 0.00%) 11.42 (-1127.96%) 11.68 (-1155.91%) 11.87 (-1176.34%) 31.31 (-3266.67%) 3.61 (-288.17%)
Elapsed NUMA01 1668.03 ( 0.00%) 486.04 ( 70.86%) 794.10 ( 52.39%) 601.19 ( 63.96%) 1575.52 ( 5.55%) 1066.67 ( 36.05%)
Elapsed NUMA01_THEADLOCAL 1266.49 ( 0.00%) 433.14 ( 65.80%) 412.50 ( 67.43%) 514.30 ( 59.39%) 542.26 ( 57.18%) 369.38 ( 70.83%)
Elapsed NUMA02 175.75 ( 0.00%) 53.15 ( 69.76%) 63.25 ( 64.01%) 84.51 ( 51.91%) 68.64 ( 60.94%) 49.42 ( 71.88%)
Elapsed NUMA02_SMT 163.55 ( 0.00%) 50.54 ( 69.10%) 56.75 ( 65.30%) 68.85 ( 57.90%) 59.85 ( 63.41%) 46.21 ( 71.75%)
CPU NUMA01 4500.00 ( 0.00%) 4660.00 ( -3.56%) 4184.00 ( 7.02%) 4353.00 ( 3.27%) 4463.00 ( 0.82%) 4484.00 ( 0.36%)
CPU NUMA01_THEADLOCAL 4384.00 ( 0.00%) 4611.00 ( -5.18%) 4153.00 ( 5.27%) 4055.00 ( 7.50%) 4475.00 ( -2.08%) 4603.00 ( -5.00%)
CPU NUMA02 3870.00 ( 0.00%) 4224.00 ( -9.15%) 3069.00 ( 20.70%) 3552.00 ( 8.22%) 3286.00 ( 15.09%) 4174.00 ( -7.86%)
CPU NUMA02_SMT 1818.00 ( 0.00%) 2023.00 (-11.28%) 1815.00 ( 0.17%) 1666.00 ( 8.36%) 1577.00 ( 13.26%) 2147.00 (-18.10%)

In all cases, the baseline kernel is beaten in terms of elapsed time.

NUMA01 schednuma best
NUMA01_THREADLOCAL balancenuma best (required THP migration)
NUMA02 balancenuma best (required THP migration)
NUMA02_SMT balancenuma best (required THP migration)

Note that even without a placement policy, balancenuma was still quite
good but that it required native THP migration to do that. Not depending
on THP to avoid regressions is important but it reinforces my point that
THP migration was introduced too early in schednuma and potentially hid
problems in the underlying mechanics.

System CPU usage -- schednuma has improved *dramatically* in this regard
for this test.

NUMA01 schednuma lowest overhead
NUMA01_THREADLOCAL balancenuma lowest overhead (THP again)
NUMA02 balancenuma lowest overhead (THP again)
NUMA02_SMT balancenuma lowest overhead (THP again)

Again, balancenuma had the lowest overhead. Note that much of this was due
to native THP migration. That patch was implemented in a hurry so it will
need close scrutiny to make sure I'm not cheating in there somewhere.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38rc6-thpmigrate-v4r38
User 140276.51 44697.54 52853.24 49933.77 96228.41 67256.37
System 94.93 1203.53 490.84 1254.00 1565.05 646.94
Elapsed 3284.21 1033.02 1336.01 1276.08 2255.08 1542.05

schednuma completed the fast overall because it completely kicked ass at
numa01. It's system CPU usage was apparently high but much of that was
incurred in just NUMA01_THREADLOCAL.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38rc6-thpmigrate-v4r38
Page Ins 43580 43444 43416 39176 43604 44184
Page Outs 30944 11504 14012 13332 20576 15944
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 0
Direct pages scanned 0 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0 0
Page writes file 0 0 0 0 0 0
Page writes anon 0 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0 0
Page rescued immediate 0 0 0 0 0 0
Slabs scanned 0 0 0 0 0 0
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0 0
THP fault alloc 17076 13240 19254 17165 16207 17298
THP collapse alloc 7 0 8950 534 1020 8
THP splits 3 2 9486 7585 7426 2
THP fault fallback 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0
Page migrate success 0 0 0 2988728 8265970 14679
Page migrate failure 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0
Compaction cost 0 0 0 3102 8580 15
NUMA PTE updates 0 0 0 712623229 221698496 458517124
NUMA hint faults 0 0 0 604878754 105489260 1431976
NUMA hint local faults 0 0 0 163366888 48972507 621116
NUMA pages migrated 0 0 0 2988728 8265970 14679
AutoNUMA cost 0 0 0 3029438 529155 10369

So I don't have detailed stats for schednuma or autonuma so I don't know how
many PTE updates it's doing. However, look at the "THP collapse alloc" and
"THP splits". You can see the effect of native THP migration. schednuma and
thpmigrate both have few collapses and splits due to the native migration.

Also note what thpmigrate does to "Page migrate success" as each THP
migration only counts as 1. I don't have the same stats for schednuma but
one would expect they would be similar if they existed.

SPECJBB BOPS Multiple JVMs, THP is DISABLED

3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12 rc6-schednuma-v16r2rc6-autonuma-v28fastr3 rc6-moron-v4r38 rc6-twostage-v4r38 rc6-thpmigrate-v4r38
Mean 1 25426.00 ( 0.00%) 17734.25 (-30.25%) 25828.25 ( 1.58%) 24972.75 ( -1.78%) 24944.25 ( -1.89%) 24557.25 ( -3.42%)
Mean 2 53316.50 ( 0.00%) 39883.50 (-25.19%) 56303.00 ( 5.60%) 51994.00 ( -2.48%) 51962.75 ( -2.54%) 49828.50 ( -6.54%)
Mean 3 77182.75 ( 0.00%) 58082.50 (-24.75%) 82874.00 ( 7.37%) 76428.50 ( -0.98%) 74272.75 ( -3.77%) 73934.50 ( -4.21%)
Mean 4 100698.25 ( 0.00%) 75740.25 (-24.78%) 107776.00 ( 7.03%) 98963.75 ( -1.72%) 96681.00 ( -3.99%) 95749.75 ( -4.91%)
Mean 5 120235.50 ( 0.00%) 87472.25 (-27.25%) 131299.75 ( 9.20%) 118226.50 ( -1.67%) 115981.25 ( -3.54%) 115904.50 ( -3.60%)
Mean 6 135085.00 ( 0.00%) 100947.25 (-25.27%) 152928.75 ( 13.21%) 133681.50 ( -1.04%) 134297.00 ( -0.58%) 133065.50 ( -1.49%)
Mean 7 135916.25 ( 0.00%) 112033.50 (-17.57%) 158917.50 ( 16.92%) 135273.25 ( -0.47%) 135100.50 ( -0.60%) 135286.00 ( -0.46%)
Mean 8 131696.25 ( 0.00%) 114805.25 (-12.83%) 160972.00 ( 22.23%) 126948.50 ( -3.61%) 135756.00 ( 3.08%) 135097.25 ( 2.58%)
Mean 9 129359.00 ( 0.00%) 113961.25 (-11.90%) 161584.00 ( 24.91%) 129655.75 ( 0.23%) 133621.50 ( 3.30%) 133027.00 ( 2.84%)
Mean 10 121682.75 ( 0.00%) 114095.25 ( -6.24%) 159302.75 ( 30.92%) 119806.00 ( -1.54%) 127338.50 ( 4.65%) 128388.50 ( 5.51%)
Mean 11 114355.25 ( 0.00%) 112794.25 ( -1.37%) 154468.75 ( 35.08%) 114229.75 ( -0.11%) 121907.00 ( 6.60%) 125957.00 ( 10.15%)
Mean 12 109110.00 ( 0.00%) 110618.00 ( 1.38%) 149917.50 ( 37.40%) 106851.00 ( -2.07%) 121331.50 ( 11.20%) 122557.25 ( 12.32%)
Mean 13 106055.00 ( 0.00%) 109073.25 ( 2.85%) 146731.75 ( 38.35%) 105273.75 ( -0.74%) 118965.25 ( 12.17%) 121129.25 ( 14.21%)
Mean 14 105102.25 ( 0.00%) 107065.00 ( 1.87%) 143996.50 ( 37.01%) 103972.00 ( -1.08%) 118018.50 ( 12.29%) 120379.50 ( 14.54%)
Mean 15 105070.00 ( 0.00%) 104714.50 ( -0.34%) 142079.50 ( 35.22%) 102753.50 ( -2.20%) 115214.50 ( 9.65%) 114074.25 ( 8.57%)
Mean 16 101610.50 ( 0.00%) 103741.25 ( 2.10%) 140463.75 ( 38.24%) 103084.75 ( 1.45%) 115000.25 ( 13.18%) 112132.75 ( 10.36%)
Mean 17 99653.00 ( 0.00%) 101577.25 ( 1.93%) 137886.50 ( 38.37%) 101658.00 ( 2.01%) 116072.25 ( 16.48%) 114797.75 ( 15.20%)
Mean 18 99804.25 ( 0.00%) 99625.75 ( -0.18%) 136973.00 ( 37.24%) 101557.25 ( 1.76%) 113653.75 ( 13.88%) 112361.00 ( 12.58%)
Stddev 1 956.30 ( 0.00%) 696.13 ( 27.21%) 729.45 ( 23.72%) 692.14 ( 27.62%) 344.73 ( 63.95%) 620.60 ( 35.10%)
Stddev 2 1105.71 ( 0.00%) 1219.79 (-10.32%) 819.00 ( 25.93%) 497.85 ( 54.97%) 1571.77 (-42.15%) 1584.30 (-43.28%)
Stddev 3 782.85 ( 0.00%) 1293.42 (-65.22%) 1016.53 (-29.85%) 777.41 ( 0.69%) 559.90 ( 28.48%) 1451.35 (-85.39%)
Stddev 4 1583.94 ( 0.00%) 1266.70 ( 20.03%) 1418.75 ( 10.43%) 1117.71 ( 29.43%) 879.59 ( 44.47%) 3081.68 (-94.56%)
Stddev 5 1361.30 ( 0.00%) 2958.17 (-117.31%) 1254.51 ( 7.84%) 1085.07 ( 20.29%) 821.75 ( 39.63%) 1971.46 (-44.82%)
Stddev 6 980.46 ( 0.00%) 2401.48 (-144.93%) 1693.67 (-72.74%) 865.73 ( 11.70%) 995.95 ( -1.58%) 1484.04 (-51.36%)
Stddev 7 1596.69 ( 0.00%) 1152.52 ( 27.82%) 1278.42 ( 19.93%) 2125.55 (-33.12%) 780.03 ( 51.15%) 7738.34 (-384.65%)
Stddev 8 5335.38 ( 0.00%) 2228.09 ( 58.24%) 720.44 ( 86.50%) 1425.78 ( 73.28%) 4981.34 ( 6.64%) 3015.77 ( 43.48%)
Stddev 9 2644.97 ( 0.00%) 2559.52 ( 3.23%) 1676.05 ( 36.63%) 6018.44 (-127.54%) 4856.12 (-83.60%) 2224.33 ( 15.90%)
Stddev 10 2887.45 ( 0.00%) 2237.65 ( 22.50%) 2592.28 ( 10.22%) 4871.48 (-68.71%) 3211.83 (-11.23%) 2934.03 ( -1.61%)
Stddev 11 4397.53 ( 0.00%) 1507.18 ( 65.73%) 5111.36 (-16.23%) 2741.08 ( 37.67%) 2954.59 ( 32.81%) 2812.71 ( 36.04%)
Stddev 12 4591.96 ( 0.00%) 313.48 ( 93.17%) 9008.19 (-96.17%) 3077.80 ( 32.97%) 888.55 ( 80.65%) 1665.82 ( 63.72%)
Stddev 13 3949.88 ( 0.00%) 743.20 ( 81.18%) 9978.16 (-152.62%) 2622.11 ( 33.62%) 1869.85 ( 52.66%) 1048.64 ( 73.45%)
Stddev 14 3727.46 ( 0.00%) 462.24 ( 87.60%) 9933.35 (-166.49%) 2702.25 ( 27.50%) 1596.33 ( 57.17%) 1276.03 ( 65.77%)
Stddev 15 2034.89 ( 0.00%) 490.28 ( 75.91%) 8688.84 (-326.99%) 2309.97 (-13.52%) 1212.53 ( 40.41%) 2088.72 ( -2.65%)
Stddev 16 3979.74 ( 0.00%) 648.50 ( 83.70%) 9606.85 (-141.39%) 2284.15 ( 42.61%) 1769.97 ( 55.53%) 2083.18 ( 47.66%)
Stddev 17 3619.30 ( 0.00%) 415.80 ( 88.51%) 9636.97 (-166.27%) 2838.78 ( 21.57%) 1034.92 ( 71.41%) 760.91 ( 78.98%)
Stddev 18 3276.41 ( 0.00%) 238.77 ( 92.71%) 11295.37 (-244.75%) 1061.62 ( 67.60%) 589.37 ( 82.01%) 881.04 ( 73.11%)
TPut 1 101704.00 ( 0.00%) 70937.00 (-30.25%) 103313.00 ( 1.58%) 99891.00 ( -1.78%) 99777.00 ( -1.89%) 98229.00 ( -3.42%)
TPut 2 213266.00 ( 0.00%) 159534.00 (-25.19%) 225212.00 ( 5.60%) 207976.00 ( -2.48%) 207851.00 ( -2.54%) 199314.00 ( -6.54%)
TPut 3 308731.00 ( 0.00%) 232330.00 (-24.75%) 331496.00 ( 7.37%) 305714.00 ( -0.98%) 297091.00 ( -3.77%) 295738.00 ( -4.21%)
TPut 4 402793.00 ( 0.00%) 302961.00 (-24.78%) 431104.00 ( 7.03%) 395855.00 ( -1.72%) 386724.00 ( -3.99%) 382999.00 ( -4.91%)
TPut 5 480942.00 ( 0.00%) 349889.00 (-27.25%) 525199.00 ( 9.20%) 472906.00 ( -1.67%) 463925.00 ( -3.54%) 463618.00 ( -3.60%)
TPut 6 540340.00 ( 0.00%) 403789.00 (-25.27%) 611715.00 ( 13.21%) 534726.00 ( -1.04%) 537188.00 ( -0.58%) 532262.00 ( -1.49%)
TPut 7 543665.00 ( 0.00%) 448134.00 (-17.57%) 635670.00 ( 16.92%) 541093.00 ( -0.47%) 540402.00 ( -0.60%) 541144.00 ( -0.46%)
TPut 8 526785.00 ( 0.00%) 459221.00 (-12.83%) 643888.00 ( 22.23%) 507794.00 ( -3.61%) 543024.00 ( 3.08%) 540389.00 ( 2.58%)
TPut 9 517436.00 ( 0.00%) 455845.00 (-11.90%) 646336.00 ( 24.91%) 518623.00 ( 0.23%) 534486.00 ( 3.30%) 532108.00 ( 2.84%)
TPut 10 486731.00 ( 0.00%) 456381.00 ( -6.24%) 637211.00 ( 30.92%) 479224.00 ( -1.54%) 509354.00 ( 4.65%) 513554.00 ( 5.51%)
TPut 11 457421.00 ( 0.00%) 451177.00 ( -1.37%) 617875.00 ( 35.08%) 456919.00 ( -0.11%) 487628.00 ( 6.60%) 503828.00 ( 10.15%)
TPut 12 436440.00 ( 0.00%) 442472.00 ( 1.38%) 599670.00 ( 37.40%) 427404.00 ( -2.07%) 485326.00 ( 11.20%) 490229.00 ( 12.32%)
TPut 13 424220.00 ( 0.00%) 436293.00 ( 2.85%) 586927.00 ( 38.35%) 421095.00 ( -0.74%) 475861.00 ( 12.17%) 484517.00 ( 14.21%)
TPut 14 420409.00 ( 0.00%) 428260.00 ( 1.87%) 575986.00 ( 37.01%) 415888.00 ( -1.08%) 472074.00 ( 12.29%) 481518.00 ( 14.54%)
TPut 15 420280.00 ( 0.00%) 418858.00 ( -0.34%) 568318.00 ( 35.22%) 411014.00 ( -2.20%) 460858.00 ( 9.65%) 456297.00 ( 8.57%)
TPut 16 406442.00 ( 0.00%) 414965.00 ( 2.10%) 561855.00 ( 38.24%) 412339.00 ( 1.45%) 460001.00 ( 13.18%) 448531.00 ( 10.36%)
TPut 17 398612.00 ( 0.00%) 406309.00 ( 1.93%) 551546.00 ( 38.37%) 406632.00 ( 2.01%) 464289.00 ( 16.48%) 459191.00 ( 15.20%)
TPut 18 399217.00 ( 0.00%) 398503.00 ( -0.18%) 547892.00 ( 37.24%) 406229.00 ( 1.76%) 454615.00 ( 13.88%) 449444.00 ( 12.58%)

In case you missed it at the header, THP is disabled in this test.

Overall, autonuma is the best showing gains no matter how many warehouses
are used.

schednuma starts badly with a 30% regression but improves as the number of
warehouses increases until it is comparable with a baseline kernel. Remember
what I said about specjbb itself using the peak range of warehouses? I
checked and in this case it used warehouses 12-18 for its throughput figure
which would have missed all the regressions for low numbers. Watch for
this in your own testing.

moron-v4r38 does nothing but it's not expected to, it lacks proper handling
of PMDs.

twostage-v4r38 does better. It also regresses for low number of workloads but
from 8 warehouses on it has a decent improvement over the baseline kernel.

thpmigrate-v4r38 makes no real difference here. There are some changed but
it's likely just testing jitter as THP was disabled.

SPECJBB PEAKS
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12 rc6-schednuma-v16r2 rc6-autonuma-v28fastr3 rc6-moron-v4r38 rc6-twostage-v4r38 rc6-thpmigrate-v4r38
Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%)
Expctd Peak Bops 436440.00 ( 0.00%) 442472.00 ( 1.38%) 599670.00 ( 37.40%) 427404.00 ( -2.07%) 485326.00 ( 11.20%) 490229.00 ( 12.32%)
Actual Warehouse 7.00 ( 0.00%) 8.00 ( 14.29%) 9.00 ( 28.57%) 7.00 ( 0.00%) 8.00 ( 14.29%) 7.00 ( 0.00%)
Actual Peak Bops 543665.00 ( 0.00%) 459221.00 (-15.53%) 646336.00 ( 18.88%) 541093.00 ( -0.47%) 543024.00 ( -0.12%) 541144.00 ( -0.46%)

schednumas actual peak throughput regressed 15% from the baseline kernel

autonuma did best with an 18% improveent on the peak.

balancenuma does no worse at the peak. Note the peak warehouses of 7
was around when it started showing improvements.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38rc6-thpmigrate-v4r38
User 101947.42 88113.29 101723.29 100931.37 99788.91 99783.34
System 66.48 12389.75 174.59 906.21 1575.66 1576.91
Elapsed 2457.45 2459.94 2461.46 2451.58 2457.17 2452.21

schednumas system CPU usage is through the roof.

autonumas looks great but could be hiding it in threads.

balancenumas is pretty poor but a lot less than schednumas.


MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38rc6-thpmigrate-v4r38
Page Ins 38540 38240 38524 38224 38104 38284
Page Outs 33276 34448 31808 31928 32380 30676
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 0
Direct pages scanned 0 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0 0
Page writes file 0 0 0 0 0 0
Page writes anon 0 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0 0
Page rescued immediate 0 0 0 0 0 0
Slabs scanned 0 0 0 0 0 0
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0 0
THP fault alloc 2 1 2 2 2 2
THP collapse alloc 0 0 0 0 0 0
THP splits 0 0 8 1 2 0
THP fault fallback 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0
Page migrate success 0 0 0 520232 44930994 44969103
Page migrate failure 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0
Compaction cost 0 0 0 540 46638 46677
NUMA PTE updates 0 0 0 2985879895 386687008 386289592
NUMA hint faults 0 0 0 2762800008 360149388 359807642
NUMA hint local faults 0 0 0 700107356 97822934 97064458
NUMA pages migrated 0 0 0 520232 44930994 44969103
AutoNUMA cost 0 0 0 13834911 1804307 1802596

You can see the possible source of balancenumas overhead here. It updated
an extremely large number of PTEs and incurred a very large number of
faults. It needs better scan rate adaption but it needs a placement policy
to drive that to detect if it's converging or not.

Note the THP figures -- there is almost no activity because THP is disabled.

SPECJBB BOPS Multiple JVMs, THP is enabled
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12 rc6-schednuma-v16r2rc6-autonuma-v28fastr3 rc6-moron-v4r38 rc6-twostage-v4r38 rc6-thpmigrate-v4r38
Mean 1 31245.50 ( 0.00%) 26282.75 (-15.88%) 29527.75 ( -5.50%) 28873.50 ( -7.59%) 29596.25 ( -5.28%) 31146.00 ( -0.32%)
Mean 2 61735.75 ( 0.00%) 57095.50 ( -7.52%) 62362.50 ( 1.02%) 51322.50 (-16.87%) 55991.50 ( -9.30%) 61055.00 ( -1.10%)
Mean 3 90068.00 ( 0.00%) 87035.50 ( -3.37%) 94382.50 ( 4.79%) 78299.25 (-13.07%) 77209.25 (-14.28%) 91018.25 ( 1.06%)
Mean 4 116542.75 ( 0.00%) 113082.00 ( -2.97%) 123228.75 ( 5.74%) 97686.50 (-16.18%) 100294.75 (-13.94%) 116657.50 ( 0.10%)
Mean 5 136686.50 ( 0.00%) 119901.75 (-12.28%) 150850.25 ( 10.36%) 104357.25 (-23.65%) 121599.50 (-11.04%) 139162.25 ( 1.81%)
Mean 6 154764.00 ( 0.00%) 148642.25 ( -3.96%) 175157.25 ( 13.18%) 115533.25 (-25.35%) 140291.75 ( -9.35%) 158279.25 ( 2.27%)
Mean 7 152353.50 ( 0.00%) 154544.50 ( 1.44%) 180972.50 ( 18.78%) 131652.75 (-13.59%) 142895.00 ( -6.21%) 162127.00 ( 6.42%)
Mean 8 153510.50 ( 0.00%) 156682.00 ( 2.07%) 184412.00 ( 20.13%) 134736.75 (-12.23%) 141980.00 ( -7.51%) 161740.00 ( 5.36%)
Mean 9 141531.25 ( 0.00%) 151687.00 ( 7.18%) 184020.50 ( 30.02%) 133901.75 ( -5.39%) 137555.50 ( -2.81%) 157858.25 ( 11.54%)
Mean 10 141536.00 ( 0.00%) 144682.75 ( 2.22%) 179991.50 ( 27.17%) 131299.75 ( -7.23%) 132871.00 ( -6.12%) 151339.75 ( 6.93%)
Mean 11 139880.50 ( 0.00%) 140449.25 ( 0.41%) 174480.75 ( 24.74%) 122725.75 (-12.26%) 126864.00 ( -9.31%) 145256.50 ( 3.84%)
Mean 12 122948.25 ( 0.00%) 136247.50 ( 10.82%) 169831.25 ( 38.13%) 116190.25 ( -5.50%) 124048.00 ( 0.89%) 137139.25 ( 11.54%)
Mean 13 123131.75 ( 0.00%) 133700.75 ( 8.58%) 166204.50 ( 34.98%) 113206.25 ( -8.06%) 119934.00 ( -2.60%) 138639.25 ( 12.59%)
Mean 14 124271.25 ( 0.00%) 131856.75 ( 6.10%) 163368.25 ( 31.46%) 112379.75 ( -9.57%) 122836.75 ( -1.15%) 131143.50 ( 5.53%)
Mean 15 120426.75 ( 0.00%) 128455.25 ( 6.67%) 162290.00 ( 34.76%) 110448.50 ( -8.29%) 121109.25 ( 0.57%) 135818.25 ( 12.78%)
Mean 16 120899.00 ( 0.00%) 124334.00 ( 2.84%) 160002.00 ( 32.34%) 108771.25 (-10.03%) 113568.75 ( -6.06%) 127873.50 ( 5.77%)
Mean 17 120508.25 ( 0.00%) 124564.50 ( 3.37%) 158369.25 ( 31.42%) 106233.50 (-11.85%) 116768.50 ( -3.10%) 129826.50 ( 7.73%)
Mean 18 113974.00 ( 0.00%) 121539.25 ( 6.64%) 156437.50 ( 37.26%) 108424.50 ( -4.87%) 114648.50 ( 0.59%) 129318.50 ( 13.46%)
Stddev 1 1030.82 ( 0.00%) 781.13 ( 24.22%) 276.53 ( 73.17%) 1216.87 (-18.05%) 1666.25 (-61.64%) 949.68 ( 7.87%)
Stddev 2 837.50 ( 0.00%) 1449.41 (-73.06%) 937.19 (-11.90%) 1758.28 (-109.94%) 2300.84 (-174.73%) 1191.02 (-42.21%)
Stddev 3 629.40 ( 0.00%) 1314.87 (-108.91%) 1606.92 (-155.31%) 1682.12 (-167.26%) 2028.25 (-222.25%) 788.05 (-25.21%)
Stddev 4 1234.97 ( 0.00%) 525.14 ( 57.48%) 617.46 ( 50.00%) 2162.57 (-75.11%) 522.03 ( 57.73%) 1389.65 (-12.52%)
Stddev 5 997.81 ( 0.00%) 4516.97 (-352.69%) 2366.16 (-137.14%) 5545.91 (-455.81%) 2477.82 (-148.33%) 396.92 ( 60.22%)
Stddev 6 1196.81 ( 0.00%) 2759.43 (-130.56%) 1680.54 (-40.42%) 3188.65 (-166.43%) 2534.28 (-111.75%) 1648.18 (-37.71%)
Stddev 7 2808.10 ( 0.00%) 6114.11 (-117.73%) 2004.86 ( 28.60%) 6714.17 (-139.10%) 3538.72 (-26.02%) 3334.99 (-18.76%)
Stddev 8 3059.06 ( 0.00%) 8582.09 (-180.55%) 3534.51 (-15.54%) 5823.74 (-90.38%) 4425.50 (-44.67%) 3089.27 ( -0.99%)
Stddev 9 2244.91 ( 0.00%) 4927.67 (-119.50%) 5014.87 (-123.39%) 3233.41 (-44.03%) 3622.19 (-61.35%) 2718.62 (-21.10%)
Stddev 10 4662.71 ( 0.00%) 905.03 ( 80.59%) 6637.16 (-42.35%) 3183.20 ( 31.73%) 6056.20 (-29.89%) 3339.35 ( 28.38%)
Stddev 11 3671.80 ( 0.00%) 1863.28 ( 49.25%) 12270.82 (-234.19%) 2186.10 ( 40.46%) 3335.54 ( 9.16%) 1388.36 ( 62.19%)
Stddev 12 6802.60 ( 0.00%) 1897.86 ( 72.10%) 16818.87 (-147.24%) 2461.95 ( 63.81%) 1908.58 ( 71.94%) 5683.00 ( 16.46%)
Stddev 13 4798.34 ( 0.00%) 225.34 ( 95.30%) 16911.42 (-252.44%) 2282.32 ( 52.44%) 1952.91 ( 59.30%) 3572.80 ( 25.54%)
Stddev 14 4266.81 ( 0.00%) 1311.71 ( 69.26%) 16842.35 (-294.73%) 1898.80 ( 55.50%) 1738.97 ( 59.24%) 5058.54 (-18.56%)
Stddev 15 2361.19 ( 0.00%) 926.70 ( 60.75%) 17701.84 (-649.70%) 1907.33 ( 19.22%) 1599.64 ( 32.25%) 2199.69 ( 6.84%)
Stddev 16 1927.00 ( 0.00%) 521.78 ( 72.92%) 19107.14 (-891.55%) 2704.74 (-40.36%) 2354.42 (-22.18%) 3355.74 (-74.14%)
Stddev 17 3098.03 ( 0.00%) 910.17 ( 70.62%) 18920.22 (-510.72%) 2214.42 ( 28.52%) 2290.00 ( 26.08%) 1939.87 ( 37.38%)
Stddev 18 4045.82 ( 0.00%) 798.22 ( 80.27%) 17789.94 (-339.71%) 1287.48 ( 68.18%) 2189.19 ( 45.89%) 2531.60 ( 37.43%)
TPut 1 124982.00 ( 0.00%) 105131.00 (-15.88%) 118111.00 ( -5.50%) 115494.00 ( -7.59%) 118385.00 ( -5.28%) 124584.00 ( -0.32%)
TPut 2 246943.00 ( 0.00%) 228382.00 ( -7.52%) 249450.00 ( 1.02%) 205290.00 (-16.87%) 223966.00 ( -9.30%) 244220.00 ( -1.10%)
TPut 3 360272.00 ( 0.00%) 348142.00 ( -3.37%) 377530.00 ( 4.79%) 313197.00 (-13.07%) 308837.00 (-14.28%) 364073.00 ( 1.06%)
TPut 4 466171.00 ( 0.00%) 452328.00 ( -2.97%) 492915.00 ( 5.74%) 390746.00 (-16.18%) 401179.00 (-13.94%) 466630.00 ( 0.10%)
TPut 5 546746.00 ( 0.00%) 479607.00 (-12.28%) 603401.00 ( 10.36%) 417429.00 (-23.65%) 486398.00 (-11.04%) 556649.00 ( 1.81%)
TPut 6 619056.00 ( 0.00%) 594569.00 ( -3.96%) 700629.00 ( 13.18%) 462133.00 (-25.35%) 561167.00 ( -9.35%) 633117.00 ( 2.27%)
TPut 7 609414.00 ( 0.00%) 618178.00 ( 1.44%) 723890.00 ( 18.78%) 526611.00 (-13.59%) 571580.00 ( -6.21%) 648508.00 ( 6.42%)
TPut 8 614042.00 ( 0.00%) 626728.00 ( 2.07%) 737648.00 ( 20.13%) 538947.00 (-12.23%) 567920.00 ( -7.51%) 646960.00 ( 5.36%)
TPut 9 566125.00 ( 0.00%) 606748.00 ( 7.18%) 736082.00 ( 30.02%) 535607.00 ( -5.39%) 550222.00 ( -2.81%) 631433.00 ( 11.54%)
TPut 10 566144.00 ( 0.00%) 578731.00 ( 2.22%) 719966.00 ( 27.17%) 525199.00 ( -7.23%) 531484.00 ( -6.12%) 605359.00 ( 6.93%)
TPut 11 559522.00 ( 0.00%) 561797.00 ( 0.41%) 697923.00 ( 24.74%) 490903.00 (-12.26%) 507456.00 ( -9.31%) 581026.00 ( 3.84%)
TPut 12 491793.00 ( 0.00%) 544990.00 ( 10.82%) 679325.00 ( 38.13%) 464761.00 ( -5.50%) 496192.00 ( 0.89%) 548557.00 ( 11.54%)
TPut 13 492527.00 ( 0.00%) 534803.00 ( 8.58%) 664818.00 ( 34.98%) 452825.00 ( -8.06%) 479736.00 ( -2.60%) 554557.00 ( 12.59%)
TPut 14 497085.00 ( 0.00%) 527427.00 ( 6.10%) 653473.00 ( 31.46%) 449519.00 ( -9.57%) 491347.00 ( -1.15%) 524574.00 ( 5.53%)
TPut 15 481707.00 ( 0.00%) 513821.00 ( 6.67%) 649160.00 ( 34.76%) 441794.00 ( -8.29%) 484437.00 ( 0.57%) 543273.00 ( 12.78%)
TPut 16 483596.00 ( 0.00%) 497336.00 ( 2.84%) 640008.00 ( 32.34%) 435085.00 (-10.03%) 454275.00 ( -6.06%) 511494.00 ( 5.77%)
TPut 17 482033.00 ( 0.00%) 498258.00 ( 3.37%) 633477.00 ( 31.42%) 424934.00 (-11.85%) 467074.00 ( -3.10%) 519306.00 ( 7.73%)
TPut 18 455896.00 ( 0.00%) 486157.00 ( 6.64%) 625750.00 ( 37.26%) 433698.00 ( -4.87%) 458594.00 ( 0.59%) 517274.00 ( 13.46%)

In case you missed it in the header, THP is enabled this time.

Again, autonuma is the best.

schednuma does much better here. It regresses for small number of warehouses
and note that the specjbb reporting will have missed this because it focuses
on the peak. For higher number of warehouses it sees a nice improvement
of very roughly 2-8% performance gain. Again, it is worth double checking
if the positive specjbb reports were based on peak warehouses and looking
at what the other warehouse figures looked like.

twostage-v4r38 from balancenuma suffers here which initially surprised me
but then I looked at the THP figures below. It's splitting its huge pages
and trying to migrate them.

thpmigrate-v4r38 natively migrates pages. It marginally regresses for 1-2
warehouses but shows decent performance gains after that.

SPECJBB PEAKS
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12 rc6-schednuma-v16r2 rc6-autonuma-v28fastr3 rc6-moron-v4r38 rc6-twostage-v4r38 rc6-thpmigrate-v4r38
Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%)
Expctd Peak Bops 491793.00 ( 0.00%) 544990.00 ( 10.82%) 679325.00 ( 38.13%) 464761.00 ( -5.50%) 496192.00 ( 0.89%) 548557.00 ( 11.54%)
Actual Warehouse 6.00 ( 0.00%) 8.00 ( 33.33%) 8.00 ( 33.33%) 8.00 ( 33.33%) 7.00 ( 16.67%) 7.00 ( 16.67%)
Actual Peak Bops 619056.00 ( 0.00%) 626728.00 ( 1.24%) 737648.00 ( 19.16%) 538947.00 (-12.94%) 571580.00 ( -7.67%) 648508.00 ( 4.76%)

schednuma reports a 1.24% gain at the peak
autonuma reports 19.16%
balancenuma reports 4.76% but note it needed native THP migration to do that.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38rc6-thpmigrate-v4r38
User 102073.40 101389.03 101952.32 100475.04 99905.11 101627.79
System 145.14 586.45 157.47 1257.01 1582.86 546.22
Elapsed 2457.98 2461.43 2450.75 2459.24 2459.39 2456.16

schednumas system CPU usage is much more acceptable here. As it can deal
with THPs a possible conclusion is that schednuma suffers when it has to
deal with the individual PTE updates and faults.

autonuma had the lowest overhead for system CPU. Usual disclaimers apply
about the kernel threads.

balancenuma had similar system CPU overhead to schednuma. Note how much
a different native THP migration made to the system CPU usage.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38rc6-thpmigrate-v4r38
Page Ins 38416 38260 38272 38076 38384 38104
Page Outs 33340 34696 31912 31736 31980 31360
Swap Ins 0 0 0 0 0 0
Swap Outs 0 0 0 0 0 0
Direct pages scanned 0 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0 0
Page writes file 0 0 0 0 0 0
Page writes anon 0 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0 0
Page rescued immediate 0 0 0 0 0 0
Slabs scanned 0 0 0 0 0 0
Direct inode steals 0 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0 0
THP fault alloc 64863 53973 48980 61397 61028 62441
THP collapse alloc 60 53 2254 1667 1575 56
THP splits 342 175 2194 12729 11544 329
THP fault fallback 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0
Page migrate success 0 0 0 5087468 41144914 340035
Page migrate failure 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0
Compaction cost 0 0 0 5280 42708 352
NUMA PTE updates 0 0 0 2997404728 393796213 521840907
NUMA hint faults 0 0 0 2739639942 328788995 3461566
NUMA hint local faults 0 0 0 709168519 83931322 815569
NUMA pages migrated 0 0 0 5087468 41144914 340035
AutoNUMA cost 0 0 0 13719278 1647483 20967

There are a lot of PTE updates and faults here but it's not completely crazy.

The main point to note is the THP figures. THP migration heavily reduces the
number of collapses and splits. Note however that all kernels showed some
THP activity reflecting the fact it's actually enabled this time.

I do not have data yet on running specjbb on single JVM instances. I probably
will not have for a long time either as I'm going to have to rerun more schednuma
tests with additional patches on top.

The remainder of this covers some more basic performance tests. Unfortunately I
do not have figures for the thpmigrate kernel as it's still running. However I
would expect it to make very little difference to these results. If I'm wrong,
then whoops.

KERNBENCH
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12 rc6-schednuma-v16r2rc6-autonuma-v28fastr3 rc6-moron-v4r38 rc6-twostage-v4r38
User min 1296.75 ( 0.00%) 1299.23 ( -0.19%) 1290.49 ( 0.48%) 1297.40 ( -0.05%) 1297.74 ( -0.08%)
User mean 1299.08 ( 0.00%) 1309.99 ( -0.84%) 1293.82 ( 0.41%) 1300.66 ( -0.12%) 1299.70 ( -0.05%)
User stddev 1.78 ( 0.00%) 7.65 (-329.49%) 3.62 (-103.18%) 1.90 ( -6.92%) 1.17 ( 34.25%)
User max 1301.82 ( 0.00%) 1319.59 ( -1.37%) 1300.12 ( 0.13%) 1303.27 ( -0.11%) 1301.23 ( 0.05%)
System min 121.16 ( 0.00%) 139.16 (-14.86%) 123.79 ( -2.17%) 124.58 ( -2.82%) 124.06 ( -2.39%)
System mean 121.26 ( 0.00%) 146.11 (-20.49%) 124.42 ( -2.60%) 124.97 ( -3.05%) 124.32 ( -2.52%)
System stddev 0.07 ( 0.00%) 3.59 (-4725.82%) 0.45 (-506.41%) 0.29 (-294.47%) 0.22 (-195.02%)
System max 121.37 ( 0.00%) 148.94 (-22.72%) 125.04 ( -3.02%) 125.48 ( -3.39%) 124.65 ( -2.70%)
Elapsed min 41.90 ( 0.00%) 44.92 ( -7.21%) 40.10 ( 4.30%) 40.85 ( 2.51%) 41.56 ( 0.81%)
Elapsed mean 42.47 ( 0.00%) 45.74 ( -7.69%) 41.23 ( 2.93%) 42.49 ( -0.05%) 42.42 ( 0.13%)
Elapsed stddev 0.44 ( 0.00%) 0.52 (-17.51%) 0.93 (-110.57%) 1.01 (-129.42%) 0.74 (-68.20%)
Elapsed max 43.06 ( 0.00%) 46.51 ( -8.01%) 42.19 ( 2.02%) 43.56 ( -1.16%) 43.70 ( -1.49%)
CPU min 3300.00 ( 0.00%) 3133.00 ( 5.06%) 3354.00 ( -1.64%) 3277.00 ( 0.70%) 3257.00 ( 1.30%)
CPU mean 3343.80 ( 0.00%) 3183.20 ( 4.80%) 3441.00 ( -2.91%) 3356.20 ( -0.37%) 3357.20 ( -0.40%)
CPU stddev 36.31 ( 0.00%) 39.99 (-10.14%) 82.80 (-128.06%) 81.41 (-124.23%) 59.23 (-63.13%)
CPU max 3395.00 ( 0.00%) 3242.00 ( 4.51%) 3552.00 ( -4.62%) 3489.00 ( -2.77%) 3428.00 ( -0.97%)

schednuma has improved a lot here. It used to be a 50% regression, now
it's just a 7.69% regression.

autonuma showed a small gain but it's within 2*stddev so I would not get
too excited.

balancenuma is comparable to the baseline kernel which is what you'd expect.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38
User 7809.47 8426.10 7798.15 7834.32 7831.34
System 748.23 967.97 767.00 771.10 767.15
Elapsed 303.48 340.40 297.36 304.79 303.16

schednuma is showing a lot higher system CPU usage. autonuma and balancenuma
are showing some too.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38
Page Ins 336 96 0 84 60
Page Outs 1606596 1565384 1470956 1477020 1682808
Swap Ins 0 0 0 0 0
Swap Outs 0 0 0 0 0
Direct pages scanned 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0
Page writes file 0 0 0 0 0
Page writes anon 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0
Page rescued immediate 0 0 0 0 0
Slabs scanned 0 0 0 0 0
Direct inode steals 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0
THP fault alloc 373 331 392 334 338
THP collapse alloc 7 1 9913 57 69
THP splits 2 2 340 45 18
THP fault fallback 0 0 0 0 0
THP collapse fail 0 0 0 0 0
Compaction stalls 0 0 0 0 0
Compaction success 0 0 0 0 0
Compaction failures 0 0 0 0 0
Page migrate success 0 0 0 20870 567171
Page migrate failure 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0
Compaction free scanned 0 0 0 0 0
Compaction cost 0 0 0 21 588
NUMA PTE updates 0 0 0 104807469 108314529
NUMA hint faults 0 0 0 67587495 67487394
NUMA hint local faults 0 0 0 53813675 64082455
NUMA pages migrated 0 0 0 20870 567171
AutoNUMA cost 0 0 0 338671 338205

Ok... wow. So, schednuma does not report how many updates it made but look
at balancenuma. It's updating PTEs and migrating pages for short-lived
processes from a kernel build. Some of these updates will be against the
monitors themselves but it's too high to be only the monitors. This is a
big surprise to me but indicates that the delay start is still too fast
or that there needs to be better identification of processes that do not
care about NUMA.

===BEGIN aim9

AIM9
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12 rc6-schednuma-v16r2rc6-autonuma-v28fastr3 rc6-moron-v4r38 rc6-twostage-v4r38
Min page_test 387600.00 ( 0.00%) 268486.67 (-30.73%) 356875.42 ( -7.93%) 342718.19 (-11.58%) 361405.73 ( -6.76%)
Min brk_test 2350099.93 ( 0.00%) 1996933.33 (-15.03%) 2198334.44 ( -6.46%) 2360733.33 ( 0.45%) 1856295.80 (-21.01%)
Min exec_test 255.99 ( 0.00%) 261.98 ( 2.34%) 273.15 ( 6.70%) 254.50 ( -0.58%) 257.33 ( 0.52%)
Min fork_test 1416.22 ( 0.00%) 1422.87 ( 0.47%) 1678.88 ( 18.55%) 1364.85 ( -3.63%) 1404.79 ( -0.81%)
Mean page_test 393893.69 ( 0.00%) 299688.63 (-23.92%) 374714.36 ( -4.87%) 377638.64 ( -4.13%) 373460.48 ( -5.19%)
Mean brk_test 2372673.79 ( 0.00%) 2221715.20 ( -6.36%) 2348968.24 ( -1.00%) 2394503.04 ( 0.92%) 2073987.04 (-12.59%)
Mean exec_test 258.91 ( 0.00%) 264.89 ( 2.31%) 280.17 ( 8.21%) 259.41 ( 0.19%) 260.94 ( 0.78%)
Mean fork_test 1428.88 ( 0.00%) 1447.96 ( 1.34%) 1812.08 ( 26.82%) 1398.49 ( -2.13%) 1430.22 ( 0.09%)
Stddev page_test 2689.70 ( 0.00%) 19221.87 (614.65%) 12994.24 (383.11%) 15871.82 (490.10%) 11104.15 (312.84%)
Stddev brk_test 11440.58 ( 0.00%) 174875.02 (1428.55%) 59011.99 (415.81%) 20870.31 ( 82.42%) 92043.46 (704.54%)
Stddev exec_test 1.42 ( 0.00%) 2.08 ( 46.59%) 6.06 (325.92%) 3.60 (152.88%) 1.80 ( 26.77%)
Stddev fork_test 8.30 ( 0.00%) 14.34 ( 72.70%) 48.64 (485.78%) 25.26 (204.22%) 17.05 (105.39%)
Max page_test 397800.00 ( 0.00%) 342833.33 (-13.82%) 396326.67 ( -0.37%) 393117.92 ( -1.18%) 391645.57 ( -1.55%)
Max brk_test 2386800.00 ( 0.00%) 2381133.33 ( -0.24%) 2416266.67 ( 1.23%) 2428733.33 ( 1.76%) 2245902.73 ( -5.90%)
Max exec_test 261.65 ( 0.00%) 267.82 ( 2.36%) 294.80 ( 12.67%) 266.00 ( 1.66%) 264.98 ( 1.27%)
Max fork_test 1446.58 ( 0.00%) 1468.44 ( 1.51%) 1869.59 ( 29.24%) 1454.18 ( 0.53%) 1475.08 ( 1.97%)

Straight up, I find aim9 to be generally unreliable and can show regressions
and gains for all sorts of unrelated nonsense. I keep running it because
over long enough periods of time it can still identify trends.

schednuma is regressing 23% in the page fault microbenchmark. autonuma
and balancenuma are also showing regressions. Not as bad, but not great
by any means. brktest is also showing regressions and here balancenuma
is showing quite a bit of hurt.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38
User 2.77 2.81 2.88 2.76 2.76
System 0.76 0.72 0.74 0.74 0.74
Elapsed 724.78 724.58 724.40 724.61 724.53

Not reflected in system CPU usage though. Cost is somewhere else

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38
Page Ins 7124 7096 6964 7388 7032
Page Outs 74380 73996 74324 73800 74576
Swap Ins 0 0 0 0 0
Swap Outs 0 0 0 0 0
Direct pages scanned 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0
Page writes file 0 0 0 0 0
Page writes anon 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0
Page rescued immediate 0 0 0 0 0
Slabs scanned 0 0 0 0 0
Direct inode steals 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0
THP fault alloc 36 2 23 0 1
THP collapse alloc 0 0 8 8 1
THP splits 0 0 8 8 1
THP fault fallback 0 0 0 0 0
THP collapse fail 0 0 0 0 0
Compaction stalls 0 0 0 0 0
Compaction success 0 0 0 0 0
Compaction failures 0 0 0 0 0
Page migrate success 0 0 0 236 475
Page migrate failure 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0
Compaction free scanned 0 0 0 0 0
Compaction cost 0 0 0 0 0
NUMA PTE updates 0 0 0 21404376 40316461
NUMA hint faults 0 0 0 76711 10144
NUMA hint local faults 0 0 0 21258 9628
NUMA pages migrated 0 0 0 236 475
AutoNUMA cost 0 0 0 533 332

In balancenuma, you can see that it's taking NUMA faults and migrating. Maybe
schednuma is doing the same.

HACKBENCH PIPES
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12 rc6-schednuma-v16r2rc6-autonuma-v28fastr3 rc6-moron-v4r38 rc6-twostage-v4r38
Procs 1 0.0320 ( 0.00%) 0.0354 (-10.53%) 0.0410 (-28.28%) 0.0310 ( 3.00%) 0.0296 ( 7.55%)
Procs 4 0.0560 ( 0.00%) 0.0699 (-24.87%) 0.0641 (-14.47%) 0.0556 ( 0.79%) 0.0562 ( -0.36%)
Procs 8 0.0850 ( 0.00%) 0.1084 (-27.51%) 0.1397 (-64.30%) 0.0833 ( 1.96%) 0.0953 (-12.07%)
Procs 12 0.1047 ( 0.00%) 0.1084 ( -3.54%) 0.1789 (-70.91%) 0.0990 ( 5.44%) 0.1127 ( -7.72%)
Procs 16 0.1276 ( 0.00%) 0.1323 ( -3.67%) 0.1395 ( -9.34%) 0.1236 ( 3.16%) 0.1240 ( 2.83%)
Procs 20 0.1405 ( 0.00%) 0.1578 (-12.29%) 0.2452 (-74.52%) 0.1471 ( -4.73%) 0.1454 ( -3.50%)
Procs 24 0.1823 ( 0.00%) 0.1800 ( 1.24%) 0.3030 (-66.22%) 0.1776 ( 2.58%) 0.1574 ( 13.63%)
Procs 28 0.2019 ( 0.00%) 0.2143 ( -6.13%) 0.3403 (-68.52%) 0.2000 ( 0.94%) 0.1983 ( 1.78%)
Procs 32 0.2162 ( 0.00%) 0.2329 ( -7.71%) 0.6526 (-201.85%) 0.2235 ( -3.36%) 0.2158 ( 0.20%)
Procs 36 0.2354 ( 0.00%) 0.2577 ( -9.47%) 0.4468 (-89.77%) 0.2619 (-11.24%) 0.2451 ( -4.11%)
Procs 40 0.2600 ( 0.00%) 0.2850 ( -9.62%) 0.5247 (-101.79%) 0.2724 ( -4.77%) 0.2646 ( -1.75%)

The number of procs hackbench is running is too low here for a 48-core
machine. It should have been reconfigured but this is better than nothing.

schednuma and autonuma both show large regressions in the performance here.
I do not investigate why but as there are a number of scheduler changes
it could be anything.

balancenuma is showing some impact on the figures but it's gains and losses.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38
User 65.98 75.68 68.61 61.40 62.96
System 1934.87 2129.32 2104.72 1958.01 1902.99
Elapsed 100.52 106.29 153.66 102.06 99.96

Nothing major there. schednumas system CPu usage is higher which might be it.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38
Page Ins 24 24 24 24 24
Page Outs 2092 1840 2636 1948 1912
Swap Ins 0 0 0 0 0
Swap Outs 0 0 0 0 0
Direct pages scanned 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0
Page writes file 0 0 0 0 0
Page writes anon 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0
Page rescued immediate 0 0 0 0 0
Slabs scanned 0 0 0 0 0
Direct inode steals 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0
THP fault alloc 6 0 0 0 0
THP collapse alloc 0 0 0 3 0
THP splits 0 0 0 0 0
THP fault fallback 0 0 0 0 0
THP collapse fail 0 0 0 0 0
Compaction stalls 0 0 0 0 0
Compaction success 0 0 0 0 0
Compaction failures 0 0 0 0 0
Page migrate success 0 0 0 84 0
Page migrate failure 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0
Compaction free scanned 0 0 0 0 0
Compaction cost 0 0 0 0 0
NUMA PTE updates 0 0 0 152332 0
NUMA hint faults 0 0 0 21271 3
NUMA hint local faults 0 0 0 6778 0
NUMA pages migrated 0 0 0 84 0
AutoNUMA cost 0 0 0 107 0

Big surprise, moron-v4r38 was updating PTEs so some process was living long enough.
Could have been the monitors though.

HACKBENCH SOCKETS
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12 rc6-schednuma-v16r2rc6-autonuma-v28fastr3 rc6-moron-v4r38 rc6-twostage-v4r38
Procs 1 0.0260 ( 0.00%) 0.0320 (-23.08%) 0.0259 ( 0.55%) 0.0285 ( -9.62%) 0.0274 ( -5.57%)
Procs 4 0.0512 ( 0.00%) 0.0471 ( 7.99%) 0.0864 (-68.81%) 0.0481 ( 5.97%) 0.0469 ( 8.37%)
Procs 8 0.0739 ( 0.00%) 0.0782 ( -5.84%) 0.0823 (-11.41%) 0.0699 ( 5.38%) 0.0762 ( -3.12%)
Procs 12 0.0999 ( 0.00%) 0.1011 ( -1.18%) 0.1130 (-13.09%) 0.0961 ( 3.86%) 0.0977 ( 2.27%)
Procs 16 0.1270 ( 0.00%) 0.1311 ( -3.24%) 0.3777 (-197.40%) 0.1252 ( 1.38%) 0.1286 ( -1.29%)
Procs 20 0.1568 ( 0.00%) 0.1624 ( -3.56%) 0.3955 (-152.14%) 0.1568 ( -0.00%) 0.1566 ( 0.13%)
Procs 24 0.1845 ( 0.00%) 0.1914 ( -3.75%) 0.4127 (-123.73%) 0.1853 ( -0.47%) 0.1844 ( 0.06%)
Procs 28 0.2172 ( 0.00%) 0.2247 ( -3.48%) 0.5268 (-142.60%) 0.2163 ( 0.40%) 0.2230 ( -2.71%)
Procs 32 0.2505 ( 0.00%) 0.2553 ( -1.93%) 0.5334 (-112.96%) 0.2489 ( 0.63%) 0.2487 ( 0.72%)
Procs 36 0.2830 ( 0.00%) 0.2872 ( -1.47%) 0.7256 (-156.39%) 0.2787 ( 1.53%) 0.2751 ( 2.79%)
Procs 40 0.3041 ( 0.00%) 0.3200 ( -5.22%) 0.9365 (-207.91%) 0.3100 ( -1.93%) 0.3134 ( -3.04%)

schednuma showing small regressions here.

autonuma showed massive regressions here.

balancenuma is ok because scheduler decisions are mostly left alone. It's
the PTE numa updates where it kicks in.


MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38
User 43.39 48.16 46.27 39.19 38.39
System 2305.48 2339.98 2461.69 2271.80 2265.79
Elapsed 109.65 111.15 173.41 108.75 108.52

Nothing major there.


MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38
Page Ins 4 4 4 4 4
Page Outs 1848 1840 2672 1788 1896
Swap Ins 0 0 0 0 0
Swap Outs 0 0 0 0 0
Direct pages scanned 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0
Page writes file 0 0 0 0 0
Page writes anon 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0
Page rescued immediate 0 0 0 0 0
Slabs scanned 0 0 0 0 0
Direct inode steals 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0
THP fault alloc 6 0 0 0 0
THP collapse alloc 1 0 3 0 0
THP splits 0 0 3 3 0
THP fault fallback 0 0 0 0 0
THP collapse fail 0 0 0 0 0
Compaction stalls 0 0 0 0 0
Compaction success 0 0 0 0 0
Compaction failures 0 0 0 0 0
Page migrate success 0 0 0 96 0
Page migrate failure 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0
Compaction free scanned 0 0 0 0 0
Compaction cost 0 0 0 0 0
NUMA PTE updates 0 0 0 117626 0
NUMA hint faults 0 0 0 11781 0
NUMA hint local faults 0 0 0 2785 0
NUMA pages migrated 0 0 0 96 0
AutoNUMA cost 0 0 0 59 0

Some PTE updates from moron-v4r8 again. Again could be the monitors.


I ran the STREAM benchmark but it's long and there was nothing interesting
to report. performance was flat and there was some migration activity
which is bad but as STREAM is long-lived for larger amounts of memory
it was not too suprising. It deserves better investigation but is realtively
low priority when it showed no regressions.

PAGE FAULT TEST

This is a microbenchmark for page faults. The number of clients are badly ordered
which again, I really should fix but anyway.

3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12 rc6-schednuma-v16r2rc6-autonuma-v28fastr3 rc6-moron-v4r38 rc6-twostage-v4r38
System 1 8.0710 ( 0.00%) 8.1085 ( -0.46%) 8.0925 ( -0.27%) 8.0170 ( 0.67%) 37.3075 (-362.24%
System 10 9.4975 ( 0.00%) 9.5690 ( -0.75%) 12.0055 (-26.41%) 9.5915 ( -0.99%) 9.5835 ( -0.91%)
System 11 9.7740 ( 0.00%) 9.7915 ( -0.18%) 13.4890 (-38.01%) 9.7275 ( 0.48%) 9.6810 ( 0.95%)
System 12 9.6300 ( 0.00%) 9.7065 ( -0.79%) 13.6075 (-41.30%) 9.8320 ( -2.10%) 9.7365 ( -1.11%)
System 13 10.3300 ( 0.00%) 10.2560 ( 0.72%) 17.2815 (-67.29%) 10.2435 ( 0.84%) 10.2480 ( 0.79%)
System 14 10.7300 ( 0.00%) 10.6860 ( 0.41%) 13.5335 (-26.13%) 10.5975 ( 1.23%) 10.6490 ( 0.75%)
System 15 10.7860 ( 0.00%) 10.8695 ( -0.77%) 18.8370 (-74.64%) 10.7860 ( 0.00%) 10.7685 ( 0.16%)
System 16 11.2070 ( 0.00%) 11.3730 ( -1.48%) 17.6445 (-57.44%) 11.1970 ( 0.09%) 11.2270 ( -0.18%)
System 17 11.8695 ( 0.00%) 11.9420 ( -0.61%) 15.7420 (-32.63%) 11.8660 ( 0.03%) 11.8465 ( 0.19%)
System 18 12.3110 ( 0.00%) 12.3800 ( -0.56%) 18.7010 (-51.90%) 12.4065 ( -0.78%) 12.3975 ( -0.70%)
System 19 12.8610 ( 0.00%) 13.0375 ( -1.37%) 17.5450 (-36.42%) 12.9510 ( -0.70%) 13.0045 ( -1.12%)
System 2 8.0750 ( 0.00%) 8.1405 ( -0.81%) 8.2075 ( -1.64%) 8.0670 ( 0.10%) 11.5805 (-43.41%)
System 20 13.5975 ( 0.00%) 13.4650 ( 0.97%) 17.6630 (-29.90%) 13.4485 ( 1.10%) 13.2655 ( 2.44%)
System 21 13.9945 ( 0.00%) 14.1510 ( -1.12%) 16.6380 (-18.89%) 13.9305 ( 0.46%) 13.9215 ( 0.52%)
System 22 14.5055 ( 0.00%) 14.6145 ( -0.75%) 19.8770 (-37.03%) 14.5555 ( -0.34%) 14.6435 ( -0.95%)
System 23 15.0345 ( 0.00%) 15.2365 ( -1.34%) 19.6190 (-30.49%) 15.0930 ( -0.39%) 15.2005 ( -1.10%)
System 24 15.5565 ( 0.00%) 15.7380 ( -1.17%) 20.5575 (-32.15%) 15.5965 ( -0.26%) 15.6015 ( -0.29%)
System 25 16.1795 ( 0.00%) 16.3190 ( -0.86%) 21.6805 (-34.00%) 16.1595 ( 0.12%) 16.2315 ( -0.32%)
System 26 17.0595 ( 0.00%) 16.9270 ( 0.78%) 19.8575 (-16.40%) 16.9075 ( 0.89%) 16.7940 ( 1.56%)
System 27 17.3200 ( 0.00%) 17.4150 ( -0.55%) 19.2015 (-10.86%) 17.5160 ( -1.13%) 17.3045 ( 0.09%)
System 28 17.9900 ( 0.00%) 18.0230 ( -0.18%) 20.3495 (-13.12%) 18.0700 ( -0.44%) 17.8465 ( 0.80%)
System 29 18.5160 ( 0.00%) 18.6785 ( -0.88%) 21.1070 (-13.99%) 18.5375 ( -0.12%) 18.5735 ( -0.31%)
System 3 8.1575 ( 0.00%) 8.2200 ( -0.77%) 8.3190 ( -1.98%) 8.2200 ( -0.77%) 9.5105 (-16.59%)
System 30 19.2095 ( 0.00%) 19.4355 ( -1.18%) 22.2920 (-16.05%) 19.1850 ( 0.13%) 19.1160 ( 0.49%)
System 31 19.7165 ( 0.00%) 19.7785 ( -0.31%) 21.5625 ( -9.36%) 19.7635 ( -0.24%) 20.0735 ( -1.81%)
System 32 20.5370 ( 0.00%) 20.5395 ( -0.01%) 22.7315 (-10.69%) 20.2400 ( 1.45%) 20.2930 ( 1.19%)
System 33 20.9265 ( 0.00%) 21.3055 ( -1.81%) 22.2900 ( -6.52%) 20.9520 ( -0.12%) 21.0705 ( -0.69%)
System 34 21.9625 ( 0.00%) 21.7200 ( 1.10%) 24.1665 (-10.04%) 21.5605 ( 1.83%) 21.6485 ( 1.43%)
System 35 22.3010 ( 0.00%) 22.4145 ( -0.51%) 23.5105 ( -5.42%) 22.3475 ( -0.21%) 22.4405 ( -0.63%)
System 36 23.0040 ( 0.00%) 23.0160 ( -0.05%) 23.8965 ( -3.88%) 23.2190 ( -0.93%) 22.9625 ( 0.18%)
System 37 23.6785 ( 0.00%) 23.7325 ( -0.23%) 24.8125 ( -4.79%) 23.7495 ( -0.30%) 23.6925 ( -0.06%)
System 38 24.7495 ( 0.00%) 24.8330 ( -0.34%) 25.0045 ( -1.03%) 24.2465 ( 2.03%) 24.3775 ( 1.50%)
System 39 25.0975 ( 0.00%) 25.1845 ( -0.35%) 25.8640 ( -3.05%) 25.0515 ( 0.18%) 25.0655 ( 0.13%)
System 4 8.2660 ( 0.00%) 8.3770 ( -1.34%) 9.0370 ( -9.33%) 8.3380 ( -0.87%) 8.6195 ( -4.28%)
System 40 25.9170 ( 0.00%) 26.1390 ( -0.86%) 25.7945 ( 0.47%) 25.8330 ( 0.32%) 25.7755 ( 0.55%)
System 41 26.4745 ( 0.00%) 26.6030 ( -0.49%) 26.0005 ( 1.79%) 26.4665 ( 0.03%) 26.6990 ( -0.85%)
System 42 27.4050 ( 0.00%) 27.4030 ( 0.01%) 27.1415 ( 0.96%) 27.4045 ( 0.00%) 27.1995 ( 0.75%)
System 43 27.9820 ( 0.00%) 28.3140 ( -1.19%) 27.2640 ( 2.57%) 28.1045 ( -0.44%) 28.0070 ( -0.09%)
System 44 28.7245 ( 0.00%) 28.9940 ( -0.94%) 27.4990 ( 4.27%) 28.6740 ( 0.18%) 28.6515 ( 0.25%)
System 45 29.5315 ( 0.00%) 29.8435 ( -1.06%) 28.3015 ( 4.17%) 29.5350 ( -0.01%) 29.3825 ( 0.50%)
System 46 30.2260 ( 0.00%) 30.5220 ( -0.98%) 28.3505 ( 6.20%) 30.2100 ( 0.05%) 30.2865 ( -0.20%)
System 47 31.0865 ( 0.00%) 31.3480 ( -0.84%) 28.6695 ( 7.78%) 30.9940 ( 0.30%) 30.9930 ( 0.30%)
System 48 31.5745 ( 0.00%) 31.9750 ( -1.27%) 28.8480 ( 8.64%) 31.6925 ( -0.37%) 31.6355 ( -0.19%)
System 5 8.5895 ( 0.00%) 8.6365 ( -0.55%) 10.7745 (-25.44%) 8.6905 ( -1.18%) 8.7105 ( -1.41%)
System 6 8.8350 ( 0.00%) 8.8820 ( -0.53%) 10.7165 (-21.30%) 8.8105 ( 0.28%) 8.8090 ( 0.29%)
System 7 8.9120 ( 0.00%) 8.9095 ( 0.03%) 10.0140 (-12.37%) 8.9440 ( -0.36%) 9.0585 ( -1.64%)
System 8 8.8235 ( 0.00%) 8.9295 ( -1.20%) 10.3175 (-16.93%) 8.9185 ( -1.08%) 8.8695 ( -0.52%)
System 9 9.4775 ( 0.00%) 9.5080 ( -0.32%) 10.9855 (-15.91%) 9.4815 ( -0.04%) 9.4435 ( 0.36%)

autonuma shows high system CPU usage overhead here.

schednuma and balancenuma show some but it's not crazy. Processes are likely too short-lived

Elapsed 1 8.7755 ( 0.00%) 8.8080 ( -0.37%) 8.7870 ( -0.13%) 8.7060 ( 0.79%) 38.0820 (-333.96%)
Elapsed 10 1.0985 ( 0.00%) 1.0965 ( 0.18%) 1.3965 (-27.13%) 1.1120 ( -1.23%) 1.1070 ( -0.77%)
Elapsed 11 1.0280 ( 0.00%) 1.0340 ( -0.58%) 1.4540 (-41.44%) 1.0220 ( 0.58%) 1.0160 ( 1.17%)
Elapsed 12 0.9155 ( 0.00%) 0.9250 ( -1.04%) 1.3995 (-52.87%) 0.9430 ( -3.00%) 0.9455 ( -3.28%)
Elapsed 13 0.9500 ( 0.00%) 0.9325 ( 1.84%) 1.6625 (-75.00%) 0.9345 ( 1.63%) 0.9470 ( 0.32%)
Elapsed 14 0.8910 ( 0.00%) 0.9000 ( -1.01%) 1.2435 (-39.56%) 0.8835 ( 0.84%) 0.9005 ( -1.07%)
Elapsed 15 0.8245 ( 0.00%) 0.8290 ( -0.55%) 1.7575 (-113.16%) 0.8250 ( -0.06%) 0.8205 ( 0.49%)
Elapsed 16 0.8050 ( 0.00%) 0.8040 ( 0.12%) 1.5650 (-94.41%) 0.7980 ( 0.87%) 0.8140 ( -1.12%)
Elapsed 17 0.8365 ( 0.00%) 0.8440 ( -0.90%) 1.3350 (-59.59%) 0.8355 ( 0.12%) 0.8305 ( 0.72%)
Elapsed 18 0.8015 ( 0.00%) 0.8030 ( -0.19%) 1.5420 (-92.39%) 0.8040 ( -0.31%) 0.8000 ( 0.19%)
Elapsed 19 0.7700 ( 0.00%) 0.7720 ( -0.26%) 1.4410 (-87.14%) 0.7770 ( -0.91%) 0.7805 ( -1.36%)
Elapsed 2 4.4485 ( 0.00%) 4.4850 ( -0.82%) 4.5230 ( -1.67%) 4.4145 ( 0.76%) 6.2950 (-41.51%)
Elapsed 20 0.7725 ( 0.00%) 0.7565 ( 2.07%) 1.4245 (-84.40%) 0.7580 ( 1.88%) 0.7485 ( 3.11%)
Elapsed 21 0.7965 ( 0.00%) 0.8135 ( -2.13%) 1.2630 (-58.57%) 0.7995 ( -0.38%) 0.8055 ( -1.13%)
Elapsed 22 0.7785 ( 0.00%) 0.7785 ( 0.00%) 1.5505 (-99.17%) 0.7940 ( -1.99%) 0.7905 ( -1.54%)
Elapsed 23 0.7665 ( 0.00%) 0.7700 ( -0.46%) 1.5335 (-100.07%) 0.7605 ( 0.78%) 0.7905 ( -3.13%)
Elapsed 24 0.7655 ( 0.00%) 0.7630 ( 0.33%) 1.5210 (-98.69%) 0.7455 ( 2.61%) 0.7660 ( -0.07%)
Elapsed 25 0.8430 ( 0.00%) 0.8580 ( -1.78%) 1.6220 (-92.41%) 0.8565 ( -1.60%) 0.8640 ( -2.49%)
Elapsed 26 0.8585 ( 0.00%) 0.8385 ( 2.33%) 1.3195 (-53.70%) 0.8240 ( 4.02%) 0.8480 ( 1.22%)
Elapsed 27 0.8195 ( 0.00%) 0.8115 ( 0.98%) 1.2000 (-46.43%) 0.8165 ( 0.37%) 0.8060 ( 1.65%)
Elapsed 28 0.7985 ( 0.00%) 0.7845 ( 1.75%) 1.2925 (-61.87%) 0.8085 ( -1.25%) 0.8020 ( -0.44%)
Elapsed 29 0.7995 ( 0.00%) 0.7995 ( 0.00%) 1.3140 (-64.35%) 0.8135 ( -1.75%) 0.8050 ( -0.69%)
Elapsed 3 3.0140 ( 0.00%) 3.0110 ( 0.10%) 3.0735 ( -1.97%) 3.0230 ( -0.30%) 3.4670 (-15.03%)
Elapsed 30 0.8075 ( 0.00%) 0.7935 ( 1.73%) 1.3905 (-72.20%) 0.8045 ( 0.37%) 0.8000 ( 0.93%)
Elapsed 31 0.7895 ( 0.00%) 0.7735 ( 2.03%) 1.2075 (-52.94%) 0.8015 ( -1.52%) 0.8135 ( -3.04%)
Elapsed 32 0.8055 ( 0.00%) 0.7745 ( 3.85%) 1.3090 (-62.51%) 0.7705 ( 4.35%) 0.7815 ( 2.98%)
Elapsed 33 0.7860 ( 0.00%) 0.7710 ( 1.91%) 1.1485 (-46.12%) 0.7850 ( 0.13%) 0.7985 ( -1.59%)
Elapsed 34 0.7950 ( 0.00%) 0.7750 ( 2.52%) 1.4080 (-77.11%) 0.7800 ( 1.89%) 0.7870 ( 1.01%)
Elapsed 35 0.7900 ( 0.00%) 0.7720 ( 2.28%) 1.1245 (-42.34%) 0.7965 ( -0.82%) 0.8230 ( -4.18%)
Elapsed 36 0.7930 ( 0.00%) 0.7600 ( 4.16%) 1.1240 (-41.74%) 0.8150 ( -2.77%) 0.7875 ( 0.69%)
Elapsed 37 0.7830 ( 0.00%) 0.7565 ( 3.38%) 1.2870 (-64.37%) 0.7860 ( -0.38%) 0.7795 ( 0.45%)
Elapsed 38 0.8035 ( 0.00%) 0.7960 ( 0.93%) 1.1955 (-48.79%) 0.7700 ( 4.17%) 0.7695 ( 4.23%)
Elapsed 39 0.7760 ( 0.00%) 0.7680 ( 1.03%) 1.3305 (-71.46%) 0.7700 ( 0.77%) 0.7820 ( -0.77%)
Elapsed 4 2.2845 ( 0.00%) 2.3185 ( -1.49%) 2.4895 ( -8.97%) 2.3010 ( -0.72%) 2.4175 ( -5.82%)
Elapsed 40 0.7710 ( 0.00%) 0.7720 ( -0.13%) 1.0095 (-30.93%) 0.7655 ( 0.71%) 0.7670 ( 0.52%)
Elapsed 41 0.7880 ( 0.00%) 0.7510 ( 4.70%) 1.1440 (-45.18%) 0.7590 ( 3.68%) 0.7985 ( -1.33%)
Elapsed 42 0.7780 ( 0.00%) 0.7690 ( 1.16%) 1.2405 (-59.45%) 0.7845 ( -0.84%) 0.7815 ( -0.45%)
Elapsed 43 0.7650 ( 0.00%) 0.7760 ( -1.44%) 1.0820 (-41.44%) 0.7795 ( -1.90%) 0.7600 ( 0.65%)
Elapsed 44 0.7595 ( 0.00%) 0.7590 ( 0.07%) 1.1615 (-52.93%) 0.7590 ( 0.07%) 0.7540 ( 0.72%)
Elapsed 45 0.7730 ( 0.00%) 0.7535 ( 2.52%) 0.9845 (-27.36%) 0.7735 ( -0.06%) 0.7705 ( 0.32%)
Elapsed 46 0.7735 ( 0.00%) 0.7650 ( 1.10%) 0.9610 (-24.24%) 0.7625 ( 1.42%) 0.7660 ( 0.97%)
Elapsed 47 0.7645 ( 0.00%) 0.7670 ( -0.33%) 1.1040 (-44.41%) 0.7650 ( -0.07%) 0.7675 ( -0.39%)
Elapsed 48 0.7655 ( 0.00%) 0.7675 ( -0.26%) 1.2085 (-57.87%) 0.7590 ( 0.85%) 0.7700 ( -0.59%)
Elapsed 5 1.9355 ( 0.00%) 1.9425 ( -0.36%) 2.3495 (-21.39%) 1.9710 ( -1.83%) 1.9675 ( -1.65%)
Elapsed 6 1.6640 ( 0.00%) 1.6760 ( -0.72%) 1.9865 (-19.38%) 1.6430 ( 1.26%) 1.6405 ( 1.41%)
Elapsed 7 1.4405 ( 0.00%) 1.4295 ( 0.76%) 1.6215 (-12.57%) 1.4370 ( 0.24%) 1.4550 ( -1.01%)
Elapsed 8 1.2320 ( 0.00%) 1.2545 ( -1.83%) 1.4595 (-18.47%) 1.2465 ( -1.18%) 1.2440 ( -0.97%)
Elapsed 9 1.2260 ( 0.00%) 1.2270 ( -0.08%) 1.3955 (-13.83%) 1.2285 ( -0.20%) 1.2180 ( 0.65%)

Same story. autonuma takes a hit. schednuma and balancenuma are ok.

There are also faults/sec and faults/cpu/sec stats but they all tell more
or less the same story. autonuma took a hit. schednuma and balancenuma are ok.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38
User 1097.70 963.35 1275.69 1095.71 1104.06
System 18926.22 18947.86 22664.44 18895.61 19587.47
Elapsed 1374.39 1360.35 1888.67 1369.07 2008.11

autonuma has higher system CPU usage so that might account for its loss. Again
balancenuma and schednuma are ok.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
rc6-stats-v4r12rc6-schednuma-v16r2rc6-autonuma-v28fastr3rc6-moron-v4r38rc6-twostage-v4r38
Page Ins 364 364 364 364 364
Page Outs 14756 15188 20036 15152 19152
Swap Ins 0 0 0 0 0
Swap Outs 0 0 0 0 0
Direct pages scanned 0 0 0 0 0
Kswapd pages scanned 0 0 0 0 0
Kswapd pages reclaimed 0 0 0 0 0
Direct pages reclaimed 0 0 0 0 0
Kswapd efficiency 100% 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0 0
Page writes file 0 0 0 0 0
Page writes anon 0 0 0 0 0
Page reclaim immediate 0 0 0 0 0
Page rescued immediate 0 0 0 0 0
Slabs scanned 0 0 0 0 0
Direct inode steals 0 0 0 0 0
Kswapd inode steals 0 0 0 0 0
Kswapd skipped wait 0 0 0 0 0
THP fault alloc 0 0 0 0 0
THP collapse alloc 0 0 0 0 0
THP splits 0 0 5 1 0
THP fault fallback 0 0 0 0 0
THP collapse fail 0 0 0 0 0
Compaction stalls 0 0 0 0 0
Compaction success 0 0 0 0 0
Compaction failures 0 0 0 0 0
Page migrate success 0 0 0 938 2892
Page migrate failure 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0
Compaction free scanned 0 0 0 0 0
Compaction cost 0 0 0 0 3
NUMA PTE updates 0 0 0 297476912 497772489
NUMA hint faults 0 0 0 290139 2456411
NUMA hint local faults 0 0 0 115544 2449766
NUMA pages migrated 0 0 0 938 2892
AutoNUMA cost 0 0 0 3533 15766

Some NUMA update activity here. Again, might be the monitors. As these
stats are collected before and after the test they are collected even
if monitors are disabled so that would indicate if monitors are making a
difference. It could be some other long-lived process on the system too.

So there you have it. balancenumas foundation has many things in common
with schednuma but does a lot more in just the basic mechanics to keep the
overhead under control and to avoid falling apart when the placement policy
makes wrong decisions. Even without a placment policy it can beat schednuma
in a number of cases and while I do not expect this to be universal to
all machines, it's encouraging.

Can the schednuma people please reconsider rebasing on top of this?
It should be able to show in all cases that it improves performance over no
placement policy and it'll be a bit more obvious how it does it. I would
also hope that the concepts of autonuma would be reimplemented on top of
this foundation so we can do a meaningful comparison between different
placement policies.

--
Mel Gorman
SUSE Labs

2012-11-21 10:58:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Mon, Nov 19, 2012 at 07:41:16PM -0500, Rik van Riel wrote:
> On 11/19/2012 06:00 PM, Mel Gorman wrote:
> >On Mon, Nov 19, 2012 at 11:36:04PM +0100, Ingo Molnar wrote:
> >>
> >>* Mel Gorman <[email protected]> wrote:
> >>
> >>>Ok.
> >>>
> >>>In response to one of your later questions, I found that I had
> >>>in fact disabled THP without properly reporting it. [...]
> >>
> >>Hugepages is a must for most forms of NUMA/HPC.
> >
> >Requiring huge pages to avoid a regression is a mistake.
>
> Not all architectures support THP. Not all workloads will end up
> using THP effectively.
>
> Mel, would you have numa/core profiles from !THP runs, so we can
> find out the cause of the regression?
>

Unfortunately not. I'll queue up a profile run again when I can but as
the !profile runs are still going it could take a while. It might be the
weekend before they even start.

--
Mel Gorman
SUSE Labs

2012-11-21 11:06:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted


* David Rientjes <[email protected]> wrote:

> Over the past 24 hours, however, throughput has significantly
> improved from a 6.3% regression to a 3.0% regression [...]

It's still a regression though, and I'd like to figure out the
root cause of that. An updated full profile from tip:master
[which has all the latest fixes applied] would be helpful as a
starting point.

Thanks,

Ingo

2012-11-21 11:14:58

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Mon, Nov 19, 2012 at 11:37:01PM -0800, David Rientjes wrote:
> On Tue, 20 Nov 2012, Ingo Molnar wrote:
>
> > No doubt numa/core should not regress with THP off or on and
> > I'll fix that.
> >
> > As a background, here's how SPECjbb gets slower on mainline
> > (v3.7-rc6) if you boot Mel's kernel config and turn THP forcibly
> > off:
> >
> > (avg: 502395 ops/sec)
> > (avg: 505902 ops/sec)
> > (avg: 509271 ops/sec)
> >
> > # echo never > /sys/kernel/mm/transparent_hugepage/enabled
> >
> > (avg: 376989 ops/sec)
> > (avg: 379463 ops/sec)
> > (avg: 378131 ops/sec)
> >
> > A ~30% slowdown.
> >
> > [ How do I know? I asked for Mel's kernel config days ago and
> > actually booted Mel's very config in the past few days,
> > spending hours on testing it on 4 separate NUMA systems,
> > trying to find Mel's regression. In the past Mel was a
> > reliable tester so I blindly trusted his results. Was that
> > some weird sort of denial on my part? :-) ]
> >
>
> I confirm that numa/core regresses significantly more without thp than the
> 6.3% regression I reported with thp in terms of throughput on the same
> system. numa/core at 01aa90068b12 ("sched: Use the best-buddy 'ideal cpu'
> in balancing decisions") had 99389.49 SPECjbb2005 bops whereas
> ec05a2311c35 ("Merge branch 'sched/urgent' into sched/core") had 122246.90
> SPECjbb2005 bops, a 23.0% regression.
>

I also see different regressions and gains depending on the number of
warehouses. For low number of warehouses without THP the regression was
severe but flat for higher number of warehouses. I explained in another
mail that specjbb reports based on peak figures and regressions outside
the peak can be missed as a result so we should watch out for that.

--
Mel Gorman
SUSE Labs

2012-11-21 11:40:33

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH, v2] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones

On Tue, Nov 20, 2012 at 05:09:18PM +0100, Ingo Molnar wrote:
>
> Ok, the patch withstood a bit more testing as well. Below is a
> v2 version of it, with a couple of cleanups (no functional
> changes).
>
> Thanks,
>
> Ingo
>
> ----------------->
> Subject: mm, numa: Turn 4K pte NUMA faults into effective hugepage ones
> From: Ingo Molnar <[email protected]>
> Date: Tue Nov 20 15:48:26 CET 2012
>
> Reduce the 4K page fault count by looking around and processing
> nearby pages if possible.
>
> To keep the logic and cache overhead simple and straightforward
> we do a couple of simplifications:
>
> - we only scan in the HPAGE_SIZE range of the faulting address
> - we only go as far as the vma allows us
>
> Also simplify the do_numa_page() flow while at it and fix the
> previous double faulting we incurred due to not properly fixing
> up freshly migrated ptes.
>
> Suggested-by: Mel Gorman <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> mm/memory.c | 99 ++++++++++++++++++++++++++++++++++++++----------------------
> 1 file changed, 64 insertions(+), 35 deletions(-)
>

This is functionally similar to what balancenuma does but there is one key
difference worth noting. I only mark the PMD pmd_numa if all the pages
pointed to by the updated[*] PTEs underneath are on the same node. The
intention is that if the workload is converged on a PMD boundary then a
migration of all the pages underneath will be remote->local copies. If the
workload is not converged on a PMD boundary and you handle all the faults
then you are potentially incurring remote->remote copies.

It also means that if the workload is not converged on the PMD boundary
then a PTE fault is just one page. With yours, it will be the full PMD
every time, right?

[*] Note I said only the updated ptes are checked. I do not check every
PTE underneath. I could but felt the benefit would be marginal.

--
Mel Gorman
SUSE Labs

2012-11-21 11:47:35

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] x86/mm: Don't flush the TLB on #WP pmd fixups

On Tue, Nov 20, 2012 at 01:31:56PM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > numa/core profile:
> > >
> > > 95.66% perf-1201.map [.] 0x00007fe4ad1c8fc7
> > > 1.70% libjvm.so [.] 0x0000000000381581
> > > 0.59% [vdso] [.] 0x0000000000000607
> > > 0.19% [kernel] [k] do_raw_spin_lock
> > > 0.11% [kernel] [k] generic_smp_call_function_interrupt
> > > 0.11% [kernel] [k] timekeeping_get_ns.constprop.7
> > > 0.08% [kernel] [k] ktime_get
> > > 0.06% [kernel] [k] get_cycles
> > > 0.05% [kernel] [k] __native_flush_tlb
> > > 0.05% [kernel] [k] rep_nop
> > > 0.04% perf [.] add_hist_entry.isra.9
> > > 0.04% [kernel] [k] rcu_check_callbacks
> > > 0.04% [kernel] [k] ktime_get_update_offsets
> > > 0.04% libc-2.15.so [.] __strcmp_sse2
> > >
> > > No page fault overhead (see the page fault rate further below)
> > > - the NUMA scanning overhead shows up only through some mild
> > > TLB flush activity (which I'll fix btw).
> >
> > The patch attached below should get rid of that mild TLB
> > flushing activity as well.
>
> This has further increased SPECjbb from 203k/sec to 207k/sec,
> i.e. it's now 5% faster than mainline - THP enabled.
>
> The profile is now totally flat even during a full 32-WH SPECjbb
> run, with the highest overhead entries left all related to timer
> IRQ processing or profiling. That is on a system that should be
> very close to yours.
>

This is a stab in the dark but are you always running with profiling enabled?

I have not checked this with perf but a number of years ago I found that
oprofile could distort results really badly (7-30% depending on the workload
at the time) when I was evalating hugetlbfs and THP. In some cases I would
find that profiling would show that a patch series improved performance
when the same series showed regressions if profiling was disabled. The
sampling rate had to be reduced quite a bit to avoid this effect.

--
Mel Gorman
SUSE Labs

2012-11-21 11:53:04

by Mel Gorman

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted

On Tue, Nov 20, 2012 at 07:54:13PM -0600, Andrew Theurer wrote:
> On Tue, 2012-11-20 at 18:56 +0100, Ingo Molnar wrote:
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > ( The 4x JVM regression is still an open bug I think - I'll
> > > re-check and fix that one next, no need to re-report it,
> > > I'm on it. )
> >
> > So I tested this on !THP too and the combined numbers are now:
> >
> > |
> > [ SPECjbb multi-4x8 ] |
> > [ tx/sec ] v3.7 | numa/core-v16
> > [ higher is better ] ----- | -------------
> > |
> > +THP: 639k | 655k +2.5%
> > -THP: 510k | 517k +1.3%
> >
> > So it's not a regression anymore, regardless of whether THP is
> > enabled or disabled.
> >
> > The current updated table of performance results is:
> >
> > -------------------------------------------------------------------------
> > [ seconds ] v3.7 AutoNUMA | numa/core-v16 [ vs. v3.7]
> > [ lower is better ] ----- -------- | ------------- -----------
> > |
> > numa01 340.3 192.3 | 139.4 +144.1%
> > numa01_THREAD_ALLOC 425.1 135.1 | 121.1 +251.0%
> > numa02 56.1 25.3 | 17.5 +220.5%
> > |
> > [ SPECjbb transactions/sec ] |
> > [ higher is better ] |
> > |
> > SPECjbb 1x32 +THP 524k 507k | 638k +21.7%
> > SPECjbb 1x32 !THP 395k | 512k +29.6%
> > |
> > -----------------------------------------------------------------------
> > |
> > [ SPECjbb multi-4x8 ] |
> > [ tx/sec ] v3.7 | numa/core-v16
> > [ higher is better ] ----- | -------------
> > |
> > +THP: 639k | 655k +2.5%
> > -THP: 510k | 517k +1.3%
> >
> > So I think I've addressed all regressions reported so far - if
> > anyone can still see something odd, please let me know so I can
> > reproduce and fix it ASAP.
>
> I can confirm single JVM JBB is working well for me. I see a 30%
> improvement over autoNUMA. What I can't make sense of is some perf
> stats (taken at 80 warehouses on 4 x WST-EX, 512GB memory):
>

I'm curious about possible effects with profiling. Can you rerun just
this test without any profiling and see if the gain is the same? My own
tests are running monitors but they only fire every 10 seconds and are
not running profiles.

--
Mel Gorman
SUSE Labs

2012-11-21 12:09:06

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH, v2] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones

On Tue, Nov 20, 2012 at 05:52:39PM +0100, Ingo Molnar wrote:
>
> * Rik van Riel <[email protected]> wrote:
>
> > Performance measurements will show us how much of an impact it
> > makes, since I don't think we have never done apples to apples
> > comparisons with just this thing toggled :)
>
> I've done a couple of quick measurements to characterise it: as
> expected this patch simply does not matter much when THP is
> enabled - and most testers I worked with had THP enabled.
>
> Testing with THP off hurst most NUMA workloads dearly and tells
> very little about the real NUMA story of these workloads. If you
> turn off THP you are living with a constant ~25% regression -
> just check the THP and no-THP numbers I posted:
>
> [ 32-warehouse SPECjbb test benchmarks ]
>
> mainline: 395 k/sec
> mainline +THP: 524 k/sec
>
> numa/core +patch: 512 k/sec [ +29.6% ]
> numa/core +patch +THP: 654 k/sec [ +24.8% ]
>
> The group of testers who had THP disabled was thus very low -
> maybe only Mel alone? The testers I worked with all had THP
> enabled.
>
> I'd encourage everyone to report unusual 'tweaks' done before
> tests are reported - no matter how well intended the purpose of
> that tweak.

Jeez, it was an oversight. Do I need to send a card or something?

> There's just so many config variations we can test
> and we obviously check the most logically and most scalably
> configured system variants first.
>

I managed to not break the !THP case for the most part in balancenuma
for the cases I looked at. As stated elsewhere not all machines can
support THP that care about HPC -- ppc64 is a major example. THPs are
not always available, particularly on the node you are trying to migrate
to is fragmented. You can just fail the migration in this case of course
but unless you are willing to compact, this situation can persist for a
long time. You get to keep THP but on a remote node. If we are to ever
choose to split THP to get better placement then we must be able to cope
with the !THP case from the start. Lastly, not all workloads can use THP
if they depend heavily on large files or shared memory.

Focusing on the THP case initially will produce better figures but I worry
it will eventually kick us in the shins and be hard to back out of.

--
Mel Gorman
SUSE Labs

2012-11-21 12:45:58

by Hillf Danton

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted

On Wed, Nov 21, 2012 at 12:10 PM, Hugh Dickins <[email protected]> wrote:
> On Tue, 20 Nov 2012, Rik van Riel wrote:
>> On 11/20/2012 08:54 PM, Andrew Theurer wrote:
>>
>> > I can confirm single JVM JBB is working well for me. I see a 30%
>> > improvement over autoNUMA. What I can't make sense of is some perf
>> > stats (taken at 80 warehouses on 4 x WST-EX, 512GB memory):
>>
>> AutoNUMA does not have native THP migration, that may explain some
>> of the difference.

Plus, numa/core is sucking the milk of TLB-flsh-optimization from Rik.

BTW, I want to see results of numa/core without such TLB boosts.

>
> When I made some fixes to the sched/numa native THP migration,
> I did also try porting that (with Hannes's memcg fixes) to AutoNUMA.
>
Thanks a ton;)

> +
> + new_page = alloc_pages_node(numa_node_id(),
> + (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);

Such a brand new page is selected to be migration target, why?

Hillf
> + if (!new_page)
> + goto alloc_fail;

2012-11-21 17:02:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Mon, Nov 19, 2012 at 11:06 PM, Ingo Molnar <[email protected]> wrote:
>
> Oh, finally a clue: you seem to have vsyscall emulation
> overhead!

Ingo, stop it already!

This is *exactly* the kind of "blame everybody else than yourself"
behavior that I was talking about earlier.

There have been an absolute *shitload* of patches to try to make up
for the schednuma regressions THAT HAVE ABSOLUTELY NOTHING TO DO WITH
SCHEDNUMA, and are all about trying to work around the fact that it
regresses. The whole TLB optimization, and now this kind of crap.

Ingo, look your code in the mirror some day, and ask yourself: why do
you think this fixes a "regression"?

The fact is, the *baseline* had the exact same vsyscall emulation too.
So by trying to make up for vsyscalls only in your numbers, you are
basically trying to lie about regressions, and try to cover up the
schednuma regression by fixing something else.

See? That's bogus. When you now compare numbers, YOU ARE LYING. You
have introduced a performance regression, and then trying to hide it
by making something else go faster.

The same is true of all your arguments about Mel's numbers wrt THP
etc. Your arguments are misleading - either intentionally, of because
you yourself didn't think things through. For schednuma, it's not
enough to be par with mainline with THP off - the competition
(autonuma) has been beating mainline soundly in Mel's configuration.
So the target to beat is not mainline, but the much *higher*
performance that autonuma got.

The fact that Mel has a different configuration from yours IS
IRRELEVANT. You should not blame his configuration for the regression,
you should instead ask yourself "Why does schednuma regress in that
configuration"? And not look at vsyscalls or anything, but look at
what schednuma does wrong!

Linus

2012-11-21 17:10:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* Linus Torvalds <[email protected]> wrote:

> On Mon, Nov 19, 2012 at 11:06 PM, Ingo Molnar <[email protected]> wrote:
> >
> > Oh, finally a clue: you seem to have vsyscall emulation
> > overhead!
>
> Ingo, stop it already!
>
> This is *exactly* the kind of "blame everybody else than
> yourself" behavior that I was talking about earlier.
>
> There have been an absolute *shitload* of patches to try to
> make up for the schednuma regressions THAT HAVE ABSOLUTELY
> NOTHING TO DO WITH SCHEDNUMA, and are all about trying to work
> around the fact that it regresses. The whole TLB optimization,
> and now this kind of crap.
>
> Ingo, look your code in the mirror some day, and ask yourself:
> why do you think this fixes a "regression"?

Because scalability slowdowns are often non-linear.

So with CONFIG_NUMA_BALANCING=y we are taking a higher page
fault rate, in exchange for a speedup.

But if some other factor further increases the page fault rate
(such as vsyscall emulation) then the speedup can be
non-linearly slower than the cost of the technique - washing it
out or even turning it into an outright regression.

So, for example:

- 10K page faults/sec from CONFIG_SCHED_BALANCING: 0.5% cost
- 10K page faults/sec from vsyscall emu: 0.5% cost

If the two are mixed together the slowdown is non-linear:

- 10K+10K page faults/sec overhead is not a linear 1%, but
might be 3%

So because I did not have an old-glibc system like David's, I
did not know the actual page fault rate. If it is high enough
then nonlinear effects might cause such effects.

This is an entirely valid line of inquiry IMO.

Thanks,

Ingo

2012-11-21 17:20:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* Ingo Molnar <[email protected]> wrote:

> This is an entirely valid line of inquiry IMO.

Btw., what I did was to simply look at David's profile on the
regressing system and I compared it to the profile I got on a
pretty similar (but unfortunately not identical and not
regressing) system. I saw 3 differences:

- the numa emulation faults
- the higher TLB miss cost
- numa/core's failure to handle 4K pages properly

And addressed those, in the hope of one of them making a
difference.

There's a fourth line of inquiry I'm pursuing as well: the node
assymetry that David and Paul mentioned could have a performance
effect as well - resulting from non-ideal placement under
numa/core.

That is not easy to cure - I have written a patch to take the
node assymetry into consideration, I'm still testing it with
David's topology simulated on a testbox:

numa=fake=4:10,20,20,30,20,10,20,20,20,20,10,20,30,20,20,10

Will send the patch out later.

Thanks,

Ingo

2012-11-21 17:40:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* Ingo Molnar <[email protected]> wrote:

> So because I did not have an old-glibc system like David's, I
> did not know the actual page fault rate. If it is high enough
> then nonlinear effects might cause such effects.
>
> This is an entirely valid line of inquiry IMO.

Btw., when comparing against 'mainline' I routinely use a
vanilla kernel that has the same optimization applied. (first I
make sure it's not a regression to vanilla.)

I do that to factor out the linear component of the independent
speedup: it would not be valid to compare vanilla against
numa/core+optimization, but the comparison has to be:

vanilla + optimization
vs.
numa/core + optimization

I did that with last night's numbers as well.

So any of this can only address a regression if a non-linear
factor is in play.

Since I have no direct access to a regressing system I have to
work with the theories that I can think of: one had a larger
effect, the other had a smaller effect, the third one had no
effect on David's system.

How would you have done it instead?

Thanks,

Ingo

2012-11-21 17:45:56

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On 11/21/2012 12:02 PM, Linus Torvalds wrote:

> The same is true of all your arguments about Mel's numbers wrt THP
> etc. Your arguments are misleading - either intentionally, of because
> you yourself didn't think things through. For schednuma, it's not
> enough to be par with mainline with THP off - the competition
> (autonuma) has been beating mainline soundly in Mel's configuration.
> So the target to beat is not mainline, but the much *higher*
> performance that autonuma got.

Once the numa base patches are upstream, the bar will be raised
automatically.

With the same infrastructure in place, we will be able to do
an apples to apples comparison of the NUMA placement policies,
and figure out which one will be the best to merge.

--
All rights reversed

2012-11-21 17:59:35

by Andrew Theurer

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted

On Tue, 2012-11-20 at 20:10 -0800, Hugh Dickins wrote:
> On Tue, 20 Nov 2012, Rik van Riel wrote:
> > On 11/20/2012 08:54 PM, Andrew Theurer wrote:
> >
> > > I can confirm single JVM JBB is working well for me. I see a 30%
> > > improvement over autoNUMA. What I can't make sense of is some perf
> > > stats (taken at 80 warehouses on 4 x WST-EX, 512GB memory):
> >
> > AutoNUMA does not have native THP migration, that may explain some
> > of the difference.
>
> When I made some fixes to the sched/numa native THP migration,
> I did also try porting that (with Hannes's memcg fixes) to AutoNUMA.
>
> Here's the patch below: it appeared to be working just fine, but
> you might find that it doesn't quite apply to whatever tree you're
> using. I started from 3.6 autonuma28fast in aa.git, but had folded
> in some of the equally applicable TLB flush optimizations too.
>
> There's also a little "Hack, remove after THP native migration"
> retuning in mm/huge_memory.c which should probably be removed too.

Thanks, this worked for me. The autoNUMA SPECjbb result is now much
closer, just 4% lower than the numa/core result. The number of anon and
anon-huge pages are now nearly the same.

-Andrew Theurer

2012-11-21 18:04:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* Linus Torvalds <[email protected]> wrote:

> [...] And not look at vsyscalls or anything, but look at what
> schednuma does wrong!

I have started 4 independent lines of inquiry to figure out
what's wrong on David's system, and all four are in the category
of 'what does our tree do to cause a regression':

- suboptimal (== regressive) 4K fault handling by numa/core

- suboptimal (== regressive) placement by numa/core on David's
assymetric-topology system

- vsyscalls escallating numa/core page fault overhead
non-linearly

- TLB flushes escallating numacore page fault overhead
non-linearly

I have sent patches for 3 of them, one is still work in
progress, because it's non-trivial.

I'm absolutely open to every possibility and obviously any
regression is numa/core's fault, full stop.

What would you have done differently to handle this particular
regression?

Thanks,

Ingo

2012-11-21 19:38:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

Hi,

On Wed, Nov 21, 2012 at 10:38:59AM +0000, Mel Gorman wrote:
> HACKBENCH PIPES
> 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
> rc6-stats-v4r12 rc6-schednuma-v16r2rc6-autonuma-v28fastr3 rc6-moron-v4r38 rc6-twostage-v4r38
> Procs 1 0.0320 ( 0.00%) 0.0354 (-10.53%) 0.0410 (-28.28%) 0.0310 ( 3.00%) 0.0296 ( 7.55%)
> Procs 4 0.0560 ( 0.00%) 0.0699 (-24.87%) 0.0641 (-14.47%) 0.0556 ( 0.79%) 0.0562 ( -0.36%)
> Procs 8 0.0850 ( 0.00%) 0.1084 (-27.51%) 0.1397 (-64.30%) 0.0833 ( 1.96%) 0.0953 (-12.07%)
> Procs 12 0.1047 ( 0.00%) 0.1084 ( -3.54%) 0.1789 (-70.91%) 0.0990 ( 5.44%) 0.1127 ( -7.72%)
> Procs 16 0.1276 ( 0.00%) 0.1323 ( -3.67%) 0.1395 ( -9.34%) 0.1236 ( 3.16%) 0.1240 ( 2.83%)
> Procs 20 0.1405 ( 0.00%) 0.1578 (-12.29%) 0.2452 (-74.52%) 0.1471 ( -4.73%) 0.1454 ( -3.50%)
> Procs 24 0.1823 ( 0.00%) 0.1800 ( 1.24%) 0.3030 (-66.22%) 0.1776 ( 2.58%) 0.1574 ( 13.63%)
> Procs 28 0.2019 ( 0.00%) 0.2143 ( -6.13%) 0.3403 (-68.52%) 0.2000 ( 0.94%) 0.1983 ( 1.78%)
> Procs 32 0.2162 ( 0.00%) 0.2329 ( -7.71%) 0.6526 (-201.85%) 0.2235 ( -3.36%) 0.2158 ( 0.20%)
> Procs 36 0.2354 ( 0.00%) 0.2577 ( -9.47%) 0.4468 (-89.77%) 0.2619 (-11.24%) 0.2451 ( -4.11%)
> Procs 40 0.2600 ( 0.00%) 0.2850 ( -9.62%) 0.5247 (-101.79%) 0.2724 ( -4.77%) 0.2646 ( -1.75%)
>
> The number of procs hackbench is running is too low here for a 48-core
> machine. It should have been reconfigured but this is better than nothing.
>
> schednuma and autonuma both show large regressions in the performance here.
> I do not investigate why but as there are a number of scheduler changes
> it could be anything.

Strange, last time I tested hackbench it was perfectly ok, I even had
this test shown in some of the pdf.

Lately (post my last hackbench run) I disabled the affine wakeups
cross-node and pipes use sd_affine wakeups. That could matter for
these heavy scheduling tests as it practically disables the _sync in
wake_up_interruptible_sync_poll used by the pipe code, if the waker
CPU is in a different node than the wakee prev_cpu. I discussed this
with Mike and he liked this change IIRC but it's the first thing that
should be checked at the light of above regression.

> PAGE FAULT TEST
>
> This is a microbenchmark for page faults. The number of clients are badly ordered
> which again, I really should fix but anyway.
>
> 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
> rc6-stats-v4r12 rc6-schednuma-v16r2rc6-autonuma-v28fastr3 rc6-moron-v4r38 rc6-twostage-v4r38
> System 1 8.0710 ( 0.00%) 8.1085 ( -0.46%) 8.0925 ( -0.27%) 8.0170 ( 0.67%) 37.3075 (-362.24%
> System 10 9.4975 ( 0.00%) 9.5690 ( -0.75%) 12.0055 (-26.41%) 9.5915 ( -0.99%) 9.5835 ( -0.91%)
> System 11 9.7740 ( 0.00%) 9.7915 ( -0.18%) 13.4890 (-38.01%) 9.7275 ( 0.48%) 9.6810 ( 0.95%)

No real clue on this one as I should look in what the test does. It
might be related to THP splits though. I can't imagine anything else
because there's nothing at all in autonuma that alters the page faults
(except from arming NUMA hinting faults which should be lighter in
autonuma than in the other implementation using task work).

Chances are the faults are tested by touching bytes at different 4k
offsets in the same 2m naturally aligned virtual range.

Hugh THP native migration patch will clarify things on the above.

> also hope that the concepts of autonuma would be reimplemented on top of
> this foundation so we can do a meaningful comparison between different
> placement policies.

I'll try to help with this to see what could be added from autonuma on
top to improve on top your balancenuma foundation. Your current
foundation looks ideal for inclusion to me.

I noticed you haven't run any single instance specjbb workload, that
should be added to the battery of tests. But hey take your time, the
amount of data you provided is already very comprehensive and you were
so fast.

The thing is: single instance and multi instance are totally different
beasts.

multi instance is all about avoiding NUMA false sharing in the first
place (the anti false sharing algorithm becomes a noop), and it has a
trivial perfect solution with all cross node traffic guaranteed to
stop after converence has been reached for the whole duration of the
workload.

single instance is all about NUMA false sharing detection and it has
no perfect solution and there's no way to fully converge and to stop
all cross node traffic. So it's a tradeoff between doing too many
CPU/memory spurious migrations (harmful, causes regressions) and doing
too few (i.e. not improving at all compared to upstream but not
regressing either).

autonuma27/28/28fast will perform identical on multi instance loads
(i.e. optimal, a few percent away from hard bindings).

I was just starting to improve the anti false sharing algorithm in
autonuma28/28fast to improve single instance specjbb too (this is why
you really need autonuma28 or autonuma28fast to test single instance
specjbb and not autonuma27).

About THP, normally when I was running benchmarks I was testing these
4 configs:

1) THP on PMD scan on
2) THP on PMD scan off
3) THP off PMD scan on
4) THP off PMD scan off

(1 and 2 are practically the same for the autonuma benchmark, because
all memory is backed by THP rendering the PMD level hinting faults for
4k pages very unlikely, but I was testing it anyway just in case)

THP off is going to hit KVM guests the most and much less host
workloads. But even for KVM it's good practice to test with THP off
too, to verify the cost of the numa hinting page faults remains very
low (the cost is much higher for guests than host because of the
vmexists).

The KVM benchmark run by IBM was also done in all 4 combinations: THP
on/off KSM on/off and showed improvement even for the "No THP" case
(btw, things should run much better these days than the old
autonuma13).

http://dl.dropbox.com/u/82832537/kvm-numa-comparison-0.png

2012-11-21 19:56:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Wed, Nov 21, 2012 at 08:37:12PM +0100, Andrea Arcangeli wrote:
> Hi,
>
> On Wed, Nov 21, 2012 at 10:38:59AM +0000, Mel Gorman wrote:
> > HACKBENCH PIPES
> > 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
> > rc6-stats-v4r12 rc6-schednuma-v16r2rc6-autonuma-v28fastr3 rc6-moron-v4r38 rc6-twostage-v4r38
> > Procs 1 0.0320 ( 0.00%) 0.0354 (-10.53%) 0.0410 (-28.28%) 0.0310 ( 3.00%) 0.0296 ( 7.55%)
> > Procs 4 0.0560 ( 0.00%) 0.0699 (-24.87%) 0.0641 (-14.47%) 0.0556 ( 0.79%) 0.0562 ( -0.36%)
> > Procs 8 0.0850 ( 0.00%) 0.1084 (-27.51%) 0.1397 (-64.30%) 0.0833 ( 1.96%) 0.0953 (-12.07%)
> > Procs 12 0.1047 ( 0.00%) 0.1084 ( -3.54%) 0.1789 (-70.91%) 0.0990 ( 5.44%) 0.1127 ( -7.72%)
> > Procs 16 0.1276 ( 0.00%) 0.1323 ( -3.67%) 0.1395 ( -9.34%) 0.1236 ( 3.16%) 0.1240 ( 2.83%)
> > Procs 20 0.1405 ( 0.00%) 0.1578 (-12.29%) 0.2452 (-74.52%) 0.1471 ( -4.73%) 0.1454 ( -3.50%)
> > Procs 24 0.1823 ( 0.00%) 0.1800 ( 1.24%) 0.3030 (-66.22%) 0.1776 ( 2.58%) 0.1574 ( 13.63%)
> > Procs 28 0.2019 ( 0.00%) 0.2143 ( -6.13%) 0.3403 (-68.52%) 0.2000 ( 0.94%) 0.1983 ( 1.78%)
> > Procs 32 0.2162 ( 0.00%) 0.2329 ( -7.71%) 0.6526 (-201.85%) 0.2235 ( -3.36%) 0.2158 ( 0.20%)
> > Procs 36 0.2354 ( 0.00%) 0.2577 ( -9.47%) 0.4468 (-89.77%) 0.2619 (-11.24%) 0.2451 ( -4.11%)
> > Procs 40 0.2600 ( 0.00%) 0.2850 ( -9.62%) 0.5247 (-101.79%) 0.2724 ( -4.77%) 0.2646 ( -1.75%)
> >
> > The number of procs hackbench is running is too low here for a 48-core
> > machine. It should have been reconfigured but this is better than nothing.
> >
> > schednuma and autonuma both show large regressions in the performance here.
> > I do not investigate why but as there are a number of scheduler changes
> > it could be anything.
>
> Strange, last time I tested hackbench it was perfectly ok, I even had
> this test shown in some of the pdf.
>

It's been rebased to 3.7-rc6 since so there may be an incompatible
scheduler change somewhere.

> Lately (post my last hackbench run) I disabled the affine wakeups
> cross-node and pipes use sd_affine wakeups. That could matter for
> these heavy scheduling tests as it practically disables the _sync in
> wake_up_interruptible_sync_poll used by the pipe code, if the waker
> CPU is in a different node than the wakee prev_cpu. I discussed this
> with Mike and he liked this change IIRC but it's the first thing that
> should be checked at the light of above regression.
>

Understood. I found in early profiles that the mutex_spin_on_owner logic
was also relevant but did not pin down why. I expected it was contention
on mmap_sem due to the PTE scanner but have not had the chance to
verify.

> > PAGE FAULT TEST
> >
> > This is a microbenchmark for page faults. The number of clients are badly ordered
> > which again, I really should fix but anyway.
> >
> > 3.7.0 3.7.0 3.7.0 3.7.0 3.7.0
> > rc6-stats-v4r12 rc6-schednuma-v16r2rc6-autonuma-v28fastr3 rc6-moron-v4r38 rc6-twostage-v4r38
> > System 1 8.0710 ( 0.00%) 8.1085 ( -0.46%) 8.0925 ( -0.27%) 8.0170 ( 0.67%) 37.3075 (-362.24%
> > System 10 9.4975 ( 0.00%) 9.5690 ( -0.75%) 12.0055 (-26.41%) 9.5915 ( -0.99%) 9.5835 ( -0.91%)
> > System 11 9.7740 ( 0.00%) 9.7915 ( -0.18%) 13.4890 (-38.01%) 9.7275 ( 0.48%) 9.6810 ( 0.95%)
>
> No real clue on this one as I should look in what the test does.

It's the PFT test in MMTests and it should run it by default out of the
box. Running it will fetch the relevant source and it'll be in
work/testsdisk/sources

> It
> might be related to THP splits though. I can't imagine anything else
> because there's nothing at all in autonuma that alters the page faults
> (except from arming NUMA hinting faults which should be lighter in
> autonuma than in the other implementation using task work).
>
> Chances are the faults are tested by touching bytes at different 4k
> offsets in the same 2m naturally aligned virtual range.
>
> Hugh THP native migration patch will clarify things on the above.
>

The current sets of tests been run has Hugh's THP native migration patch
on top. There was a trivial conflict but otherwise it applied.

> > also hope that the concepts of autonuma would be reimplemented on top of
> > this foundation so we can do a meaningful comparison between different
> > placement policies.
>
> I'll try to help with this to see what could be added from autonuma on
> top to improve on top your balancenuma foundation. Your current
> foundation looks ideal for inclusion to me.
>

That would be really great. If this happened then potentially numacore
and autonuma can be directly compared in terms of placement and scheduler
policies if both depended on the same underlying infrastrcture. If there
was a better implementation of the PTE scanner for example then it should
be usable by both.

> I noticed you haven't run any single instance specjbb workload, that
> should be added to the battery of tests. But hey take your time, the
> amount of data you provided is already very comprehensive and you were
> so fast.
>

No, I haven't. Each time they got cancelled due to patch updates before
they had a chance to run. They're still not queued because I want profiles
for the other tests first. When they'll complete I'll fire them up.

> The thing is: single instance and multi instance are totally different
> beasts.
>
> multi instance is all about avoiding NUMA false sharing in the first
> place (the anti false sharing algorithm becomes a noop), and it has a
> trivial perfect solution with all cross node traffic guaranteed to
> stop after converence has been reached for the whole duration of the
> workload.
>
> single instance is all about NUMA false sharing detection and it has
> no perfect solution and there's no way to fully converge and to stop
> all cross node traffic. So it's a tradeoff between doing too many
> CPU/memory spurious migrations (harmful, causes regressions) and doing
> too few (i.e. not improving at all compared to upstream but not
> regressing either).
>

Thanks for that explanation. It does mean that for any specjbb results
that it'll have to be declared if it's single or multi configurations as
they cannot be directly compared in a meaningful manner. If the majority
of JVM tests are single-configuration then I'll prioritise those over the
multi-JVM configurations just to have compatability in comparisons.

> autonuma27/28/28fast will perform identical on multi instance loads
> (i.e. optimal, a few percent away from hard bindings).
>
> I was just starting to improve the anti false sharing algorithm in
> autonuma28/28fast to improve single instance specjbb too (this is why
> you really need autonuma28 or autonuma28fast to test single instance
> specjbb and not autonuma27).
>
> About THP, normally when I was running benchmarks I was testing these
> 4 configs:
>
> 1) THP on PMD scan on
> 2) THP on PMD scan off
> 3) THP off PMD scan on
> 4) THP off PMD scan off
>
> (1 and 2 are practically the same for the autonuma benchmark, because
> all memory is backed by THP rendering the PMD level hinting faults for
> 4k pages very unlikely, but I was testing it anyway just in case)
>

Testing the PMD and !PMD cases was important as I expect the results are
different depending on whether the workload converges on a 2M boundary or
not. A similar tunable is not available in my current tree but perhaps it
should be added to allow the same comparison to happen.

> THP off is going to hit KVM guests the most and much less host
> workloads. But even for KVM it's good practice to test with THP off
> too, to verify the cost of the numa hinting page faults remains very
> low (the cost is much higher for guests than host because of the
> vmexists).
>

Agreed.

> The KVM benchmark run by IBM was also done in all 4 combinations: THP
> on/off KSM on/off and showed improvement even for the "No THP" case
> (btw, things should run much better these days than the old
> autonuma13).
>
> http://dl.dropbox.com/u/82832537/kvm-numa-comparison-0.png

Thanks for that.

--
Mel Gorman
SUSE Labs

2012-11-22 19:27:18

by Andrew Theurer

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted

On Wed, 2012-11-21 at 11:52 +0000, Mel Gorman wrote:
> On Tue, Nov 20, 2012 at 07:54:13PM -0600, Andrew Theurer wrote:
> > On Tue, 2012-11-20 at 18:56 +0100, Ingo Molnar wrote:
> > > * Ingo Molnar <[email protected]> wrote:
> > >
> > > > ( The 4x JVM regression is still an open bug I think - I'll
> > > > re-check and fix that one next, no need to re-report it,
> > > > I'm on it. )
> > >
> > > So I tested this on !THP too and the combined numbers are now:
> > >
> > > |
> > > [ SPECjbb multi-4x8 ] |
> > > [ tx/sec ] v3.7 | numa/core-v16
> > > [ higher is better ] ----- | -------------
> > > |
> > > +THP: 639k | 655k +2.5%
> > > -THP: 510k | 517k +1.3%
> > >
> > > So it's not a regression anymore, regardless of whether THP is
> > > enabled or disabled.
> > >
> > > The current updated table of performance results is:
> > >
> > > -------------------------------------------------------------------------
> > > [ seconds ] v3.7 AutoNUMA | numa/core-v16 [ vs. v3.7]
> > > [ lower is better ] ----- -------- | ------------- -----------
> > > |
> > > numa01 340.3 192.3 | 139.4 +144.1%
> > > numa01_THREAD_ALLOC 425.1 135.1 | 121.1 +251.0%
> > > numa02 56.1 25.3 | 17.5 +220.5%
> > > |
> > > [ SPECjbb transactions/sec ] |
> > > [ higher is better ] |
> > > |
> > > SPECjbb 1x32 +THP 524k 507k | 638k +21.7%
> > > SPECjbb 1x32 !THP 395k | 512k +29.6%
> > > |
> > > -----------------------------------------------------------------------
> > > |
> > > [ SPECjbb multi-4x8 ] |
> > > [ tx/sec ] v3.7 numa/core-v16
> > > [ higher is better ] ----- | -------------
> > > |
> > > +THP: 639k | 655k +2.5%
> > > -THP: 510k | 517k +1.3%
> > >
> > > So I think I've addressed all regressions reported so far - if
> > > anyone can still see something odd, please let me know so I can
> > > reproduce and fix it ASAP.
> >
> > I can confirm single JVM JBB is working well for me. I see a 30%
> > improvement over autoNUMA. What I can't make sense of is some perf
> > stats (taken at 80 warehouses on 4 x WST-EX, 512GB memory):
> >
>
> I'm curious about possible effects with profiling. Can you rerun just
> this test without any profiling and see if the gain is the same? My own
> tests are running monitors but they only fire every 10 seconds and are
> not running profiles.

After using the patch Hugh provided, I did make a 2nd run, this time
with no profiling at all, and the run was 2% higher. Not sure if this
is due to profiling gone, or just run to run variance, but nevertheless
a pretty low difference.

-Andrew Theurer

2012-11-22 20:16:20

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Wed, 21 Nov 2012, Ingo Molnar wrote:

> Btw., what I did was to simply look at David's profile on the
> regressing system and I compared it to the profile I got on a
> pretty similar (but unfortunately not identical and not
> regressing) system. I saw 3 differences:
>
> - the numa emulation faults
> - the higher TLB miss cost
> - numa/core's failure to handle 4K pages properly
>
> And addressed those, in the hope of one of them making a
> difference.
>

I agree that it's worth a try and if it's helpful to your debugging then
I'll always commit to trying things out. I've pulled tip#master at
9f7b91a96bb6 ("Merge branch 'x86/urgent'") and performance improves 0.3%
with 16 warehouses with vsyscall=emulate, i.e. a revert of 01e9c2441eee
("x86/vsyscall: Add Kconfig option to use native vsyscalls and switch to
it") so I'd recommend that gets dropped based on my results and Andy's
feedback unless anybody can demonstrate it's better (which very well may
be the case on some systems, but then again that's why its configurable
from the command line).

You're also completely right about the old glibc, mine is seven years old;
I can upgrade that since I need to install libnuma as well on this box
since you asked for the autonuma topology information that I haven't
forgotten about but will definitely get around to doing.

> There's a fourth line of inquiry I'm pursuing as well: the node
> assymetry that David and Paul mentioned could have a performance
> effect as well - resulting from non-ideal placement under
> numa/core.
>
> That is not easy to cure - I have written a patch to take the
> node assymetry into consideration, I'm still testing it with
> David's topology simulated on a testbox:
>
> numa=fake=4:10,20,20,30,20,10,20,20,20,20,10,20,30,20,20,10
>

This very much may be the case and that characteristic of this box is why
I picked it to test with first. Just curious what types of topologies
you've benchmarked on for your results if you have that available? An
early version of sched/numa used to panic on this machine and Peter was
interested in its topology (see https://lkml.org/lkml/2012/5/25/89) so
perhaps I'm the only one testing with such a thing thus far?

2012-11-22 20:45:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted


* Alex Shi <[email protected]> wrote:

> >
> > Those of you who would like to test all the latest patches are
> > welcome to pick up latest bits at tip:master:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
> >
>
> I am wondering if it is a problem, but it still exists on HEAD: c418de93e39891
> http://article.gmane.org/gmane.linux.kernel.mm/90131/match=compiled+with+name+pl+and+start+it+on+my
>
> like when just start 4 pl tasks, often 3 were running on node
> 0, and 1 was running on node 1. The old balance will average
> assign tasks to different node, different core.

This is "normal" in the sense that the current mainline
scheduler is (supposed to be) doing something similar: if the
node is still within capacity, then there's no reason to move
those threads.

OTOH, I think with NUMA balancing we indeed want to spread them
better, if those tasks do not share memory with each other but
use their own memory. If they share memory then they should
remain on the same node if possible.

Thanks,

Ingo

2012-11-22 20:49:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16


* Linus Torvalds <[email protected]> wrote:

> On Wed, Nov 21, 2012 at 7:10 AM, Ingo Molnar <[email protected]> wrote:
> >
> > Because scalability slowdowns are often non-linear.
>
> Only if you hold locks or have other non-cpu-private activity.
>
> Which the vsyscall code really shouldn't have.

Yeah, the faults accessing any sort of thread shared cache line
was my main thinking - the vsyscall faults are so hidden, and
David's transaction score was so low that I could not exclude
some extremely high page fault rate (which would not get
reported by anything other than a strange blip on the profile).
I was thinking of a hundred thousand vsyscall page faults per
second as a possibility - SPECjbb measures time for every
transaction.

So this was just a "maybe-that-has-an-effect" blind theory of
mine - and David's testing did not confirm it so we know it was
a bad idea.

I basically wanted to see a profile from David that looked as
flat as mine - that would have excluded a handful of unknown
unknowns.

> That said, it might be worth removing the
> "prefetchw(&mm->mmap_sem)" from the VM fault path. Partly
> because software prefetches have never ever worked on any
> reasonable hardware, and partly because it could seriously
> screw up things like the vsyscall stuff.

Yeah, I was wondering about that one too ...

> I think we only turn prefetchw into an actual prefetch
> instruction on 3DNOW hardware. Which is the *old* AMD chips. I
> don't think even the Athlon does that.
>
> Anyway, it might be interesting to see a instruction-level
> annotated profile of do_page_fault() or whatever

Yes.

> > So with CONFIG_NUMA_BALANCING=y we are taking a higher page
> > fault rate, in exchange for a speedup.
>
> The thing is, so is autonuma.
>
> And autonuma doesn't show any of these problems. [...]

AutoNUMA regresses on this workload, at least on my box:

v3.7 AutoNUMA | numa/core-v16 [ vs. v3.7]
----- -------- | ------------- -----------
|
[ SPECjbb transactions/sec ] |
[ higher is better ] |
|
SPECjbb single-1x32 524k 507k | 638k +21.7%

It regresses by 3.3% over mainline. [I have not measured a
THP-disabled number for AutoNUMA.]

Maybe it does not regress on David's box - I have just
re-checked all of David's mails and AFAICS he has not reported
AutoNUMA SPECjbb performance.

> Why are you ignoring that fact?

I'm not :-(

Thanks,

Ingo

2012-11-22 23:07:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 00/27] Latest numa/core release, v16

On Wed, Nov 21, 2012 at 7:10 AM, Ingo Molnar <[email protected]> wrote:
>
> Because scalability slowdowns are often non-linear.

Only if you hold locks or have other non-cpu-private activity.

Which the vsyscall code really shouldn't have.

That said, it might be worth removing the "prefetchw(&mm->mmap_sem)"
from the VM fault path. Partly because software prefetches have never
ever worked on any reasonable hardware, and partly because it could
seriously screw up things like the vsyscall stuff.

I think we only turn prefetchw into an actual prefetch instruction on
3DNOW hardware. Which is the *old* AMD chips. I don't think even the
Athlon does that.

Anyway, it might be interesting to see a instruction-level annotated
profile of do_page_fault() or whatever

> So with CONFIG_NUMA_BALANCING=y we are taking a higher page
> fault rate, in exchange for a speedup.

The thing is, so is autonuma.

And autonuma doesn't show any of these problems. Autonuma didn't need
vsyscall hacks, autonuma didn't need TLB flushing optimizations,
autonuma just *worked*, and in fact got big speedups when Mel did the
exact same loads on that same machine, presumably with all the same
issues..

Why are you ignoring that fact?

Linus

2012-11-22 23:24:12

by Ingo Molnar

[permalink] [raw]
Subject: [tip:x86/mm] x86/mm: Don't flush the TLB on #WP pmd fixups

Commit-ID: 5e4bf1a55da976a5ed60901bb8801f1024ef9774
Gitweb: http://git.kernel.org/tip/5e4bf1a55da976a5ed60901bb8801f1024ef9774
Author: Ingo Molnar <[email protected]>
AuthorDate: Tue, 20 Nov 2012 13:02:51 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 22 Nov 2012 21:52:06 +0100

x86/mm: Don't flush the TLB on #WP pmd fixups

If we have a write protection #PF and fix up the pmd then the
hugetlb code [the only user of pmdp_set_access_flags], in its
do_huge_pmd_wp_page() page fault resolution function calls
pmdp_set_access_flags() to mark the pmd permissive again,
and flushes the TLB.

This TLB flush is unnecessary: a flush on #PF is guaranteed on
most (all?) x86 CPUs, and even in the worst-case we'll generate
a spurious fault.

So remove it.

Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Hugh Dickins <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..8a828d7 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -328,7 +328,12 @@ int pmdp_set_access_flags(struct vm_area_struct *vma,
if (changed && dirty) {
*pmdp = entry;
pmd_update_defer(vma->vm_mm, address, pmdp);
- flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ /*
+ * We had a write-protection fault here and changed the pmd
+ * to to more permissive. No need to flush the TLB for that,
+ * #PF is architecturally guaranteed to do that and in the
+ * worst-case we'll generate a spurious fault.
+ */
}

return changed;

2012-11-23 01:26:14

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH, v2] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones

This patch cause boot hang on our SNB EP 2 sockets machine with some
segmentation fault.
revert it recovers booting.

============
[ 8.290147] Freeing unused kernel memory: 1264k freed
[ 8.306140] Freeing unused kernel memory: 1592k freed
[ 8.342668] init[250]: segfault at 20da510 ip 00000000020da510 sp
00007fff26788040 error 15[ 8.350983] usb 2-1: New USB device found,
idVendor=8087, idProduct=0024
[ 8.350987] usb 2-1: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[ 8.351266] hub 2-1:1.0: USB hub found
[ 8.351346] hub 2-1:1.0: 8 ports detected

Segmentation fault
[ 8.626633] usb 2-1.4: new full-speed USB device number 3 using ehci_hcd
[ 8.721391] usb 2-1.4: New USB device found, idVendor=046b, idProduct=ff10
[ 8.729536] usb 2-1.4: New USB device strings: Mfr=1, Product=2,
SerialNumber=3
[ 8.738540] usb 2-1.4: Product: Virtual Keyboard and Mouse
[ 8.745134] usb 2-1.4: Manufacturer: American Megatrends Inc.
[ 8.752026] usb 2-1.4: SerialNumber: serial
[ 8.758877] input: American Megatrends Inc. Virtual Keyboard and
Mouse as /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.4/2-1.4:1.0/input/input1
[ 8.774428] hid-generic 0003:046B:FF10.0001: input,hidraw0: USB HID
v1.10 Keyboard [American Megatrends Inc. Virtual Keyboard and Mouse]
on usb-0000:00:1d.0-1.4/input0
[ 8.793393] input: American Megatrends Inc. Virtual Keyboard and
Mouse as /devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.4/2-1.4:1.1/input/input2
[ 8.809140] hid-generic 0003:046B:FF10.0002: input,hidraw1: USB HID
v1.10 Mouse [American Megatrends Inc. Virtual Keyboard and Mouse] on
usb-0000:00:1d.0-1.4/input1
[ 8.899511] usb 2-1.7: new low-speed USB device number 4 using ehci_hcd
[ 9.073473] usb 2-1.7: New USB device found, idVendor=0557, idProduct=2220
[ 9.081633] usb 2-1.7: New USB device strings: Mfr=1, Product=2,
SerialNumber=0
[ 9.090643] usb 2-1.7: Product: ATEN CS-1758/54
[ 9.096258] usb 2-1.7: Manufacturer: ATEN
[ 9.134093] input: ATEN ATEN CS-1758/54 as
/devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.7/2-1.7:1.0/input/input3
[ 9.146804] hid-generic 0003:0557:2220.0003: input,hidraw2: USB HID
v1.10 Keyboard [ATEN ATEN CS-1758/54] on usb-0000:00:1d.0-1.7/input0
[ 9.184396] input: ATEN ATEN CS-1758/54 as
/devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.7/2-1.7:1.1/input/input4
[ 9.197210] hid-generic 0003:0557:2220.0004: input,hidraw3: USB HID
v1.10 Mouse [ATEN ATEN CS-1758/54] on usb-0000:00:1d.0-1.7/input1

<hang here>

On Wed, Nov 21, 2012 at 12:09 AM, Ingo Molnar <[email protected]> wrote:
>
> Ok, the patch withstood a bit more testing as well. Below is a
> v2 version of it, with a couple of cleanups (no functional
> changes).
>
> Thanks,
>
> Ingo
>
> ----------------->
> Subject: mm, numa: Turn 4K pte NUMA faults into effective hugepage ones
> From: Ingo Molnar <[email protected]>
> Date: Tue Nov 20 15:48:26 CET 2012
>
> Reduce the 4K page fault count by looking around and processing
> nearby pages if possible.
>
> To keep the logic and cache overhead simple and straightforward
> we do a couple of simplifications:
>
> - we only scan in the HPAGE_SIZE range of the faulting address
> - we only go as far as the vma allows us
>
> Also simplify the do_numa_page() flow while at it and fix the
> previous double faulting we incurred due to not properly fixing
> up freshly migrated ptes.
>
> Suggested-by: Mel Gorman <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> mm/memory.c | 99 ++++++++++++++++++++++++++++++++++++++----------------------
> 1 file changed, 64 insertions(+), 35 deletions(-)
>
> Index: linux/mm/memory.c
> ===================================================================
> --- linux.orig/mm/memory.c
> +++ linux/mm/memory.c
> @@ -3455,64 +3455,93 @@ static int do_nonlinear_fault(struct mm_
> return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
> }
>
> -static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> +static int __do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned long address, pte_t *ptep, pmd_t *pmd,
> - unsigned int flags, pte_t entry)
> + unsigned int flags, pte_t entry, spinlock_t *ptl)
> {
> - struct page *page = NULL;
> - int node, page_nid = -1;
> - int last_cpu = -1;
> - spinlock_t *ptl;
> -
> - ptl = pte_lockptr(mm, pmd);
> - spin_lock(ptl);
> - if (unlikely(!pte_same(*ptep, entry)))
> - goto out_unlock;
> + struct page *page;
> + int new_node;
>
> page = vm_normal_page(vma, address, entry);
> if (page) {
> - get_page(page);
> - page_nid = page_to_nid(page);
> - last_cpu = page_last_cpu(page);
> - node = mpol_misplaced(page, vma, address);
> - if (node != -1 && node != page_nid)
> + int page_nid = page_to_nid(page);
> + int last_cpu = page_last_cpu(page);
> +
> + task_numa_fault(page_nid, last_cpu, 1);
> +
> + new_node = mpol_misplaced(page, vma, address);
> + if (new_node != -1 && new_node != page_nid)
> goto migrate;
> }
>
> -out_pte_upgrade_unlock:
> +out_pte_upgrade:
> flush_cache_page(vma, address, pte_pfn(entry));
> -
> ptep_modify_prot_start(mm, address, ptep);
> entry = pte_modify(entry, vma->vm_page_prot);
> + if (pte_dirty(entry))
> + entry = pte_mkwrite(entry);
> ptep_modify_prot_commit(mm, address, ptep, entry);
> -
> /* No TLB flush needed because we upgraded the PTE */
> -
> update_mmu_cache(vma, address, ptep);
> -
> -out_unlock:
> - pte_unmap_unlock(ptep, ptl);
> -
> - if (page) {
> - task_numa_fault(page_nid, last_cpu, 1);
> - put_page(page);
> - }
> out:
> return 0;
>
> migrate:
> + get_page(page);
> pte_unmap_unlock(ptep, ptl);
>
> - if (migrate_misplaced_page(page, node)) {
> + migrate_misplaced_page(page, new_node); /* Drops the page reference */
> +
> + /* Re-check after migration: */
> +
> + ptl = pte_lockptr(mm, pmd);
> + spin_lock(ptl);
> + entry = ACCESS_ONCE(*ptep);
> +
> + if (!pte_numa(vma, entry))
> goto out;
> - }
> - page = NULL;
>
> - ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
> - if (!pte_same(*ptep, entry))
> - goto out_unlock;
> + goto out_pte_upgrade;
> +}
> +
> +/*
> + * Add a simple loop to also fetch ptes within the same pmd:
> + */
> +static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long addr0, pte_t *ptep0, pmd_t *pmd,
> + unsigned int flags, pte_t entry0)
> +{
> + unsigned long addr0_pmd;
> + unsigned long addr_start;
> + unsigned long addr;
> + spinlock_t *ptl;
> + pte_t *ptep;
> +
> + addr0_pmd = addr0 & PMD_MASK;
> + addr_start = max(addr0_pmd, vma->vm_start);
>
> - goto out_pte_upgrade_unlock;
> + ptep = pte_offset_map(pmd, addr_start);
> + ptl = pte_lockptr(mm, pmd);
> + spin_lock(ptl);
> +
> + for (addr = addr_start; addr < vma->vm_end; addr += PAGE_SIZE, ptep++) {
> + pte_t entry;
> +
> + entry = ACCESS_ONCE(*ptep);
> +
> + if ((addr & PMD_MASK) != addr0_pmd)
> + break;
> + if (!pte_present(entry))
> + continue;
> + if (!pte_numa(vma, entry))
> + continue;
> +
> + __do_numa_page(mm, vma, addr, ptep, pmd, flags, entry, ptl);
> + }
> +
> + pte_unmap_unlock(ptep, ptl);
> +
> + return 0;
> }
>
> /*
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>



--
Thanks
Alex

2012-11-23 13:31:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted


* Ingo Molnar <[email protected]> wrote:

> * Alex Shi <[email protected]> wrote:
>
> > >
> > > Those of you who would like to test all the latest patches are
> > > welcome to pick up latest bits at tip:master:
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
> > >
> >
> > I am wondering if it is a problem, but it still exists on HEAD: c418de93e39891
> > http://article.gmane.org/gmane.linux.kernel.mm/90131/match=compiled+with+name+pl+and+start+it+on+my
> >
> > like when just start 4 pl tasks, often 3 were running on node
> > 0, and 1 was running on node 1. The old balance will average
> > assign tasks to different node, different core.
>
> This is "normal" in the sense that the current mainline
> scheduler is (supposed to be) doing something similar: if the
> node is still within capacity, then there's no reason to move
> those threads.
>
> OTOH, I think with NUMA balancing we indeed want to spread
> them better, if those tasks do not share memory with each
> other but use their own memory. If they share memory then they
> should remain on the same node if possible.

Could you please check tip:master with -v17:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

?

It should place your workload better than v16 did.

Note, you might be able to find other combinations of tasks that
are not scheduled NUMA-perfectly yet, as task group placement is
not exhaustive yet.

You might want to check which combination looks the weirdest to
you and report it, so I can fix any remaining placement
inefficiencies in order of importance.

Thanks,

Ingo

2012-11-23 15:23:12

by Alex Shi

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted

On Fri, Nov 23, 2012 at 9:31 PM, Ingo Molnar <[email protected]> wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>> * Alex Shi <[email protected]> wrote:
>>
>> > >
>> > > Those of you who would like to test all the latest patches are
>> > > welcome to pick up latest bits at tip:master:
>> > >
>> > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
>> > >
>> >
>> > I am wondering if it is a problem, but it still exists on HEAD: c418de93e39891
>> > http://article.gmane.org/gmane.linux.kernel.mm/90131/match=compiled+with+name+pl+and+start+it+on+my
>> >
>> > like when just start 4 pl tasks, often 3 were running on node
>> > 0, and 1 was running on node 1. The old balance will average
>> > assign tasks to different node, different core.
>>
>> This is "normal" in the sense that the current mainline
>> scheduler is (supposed to be) doing something similar: if the
>> node is still within capacity, then there's no reason to move
>> those threads.
>>
>> OTOH, I think with NUMA balancing we indeed want to spread
>> them better, if those tasks do not share memory with each
>> other but use their own memory. If they share memory then they
>> should remain on the same node if possible.

There is no share memory between them.
>
> Could you please check tip:master with -v17:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
>
> ?
>
> It should place your workload better than v16 did.

OK. will try it on next Monday, if it is not late to you.
>
> Note, you might be able to find other combinations of tasks that
> are not scheduled NUMA-perfectly yet, as task group placement is
> not exhaustive yet.

I am not familiar with task group. but anyway, will try it too.
>
> You might want to check which combination looks the weirdest to
> you and report it, so I can fix any remaining placement
> inefficiencies in order of importance.

Any suggestions of combination?
>
> Thanks,
>
> Ingo



--
Thanks
Alex

2012-11-26 02:11:22

by Alex Shi

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted

On Fri, Nov 23, 2012 at 9:31 PM, Ingo Molnar <[email protected]> wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>> * Alex Shi <[email protected]> wrote:
>>
>> > >
>> > > Those of you who would like to test all the latest patches are
>> > > welcome to pick up latest bits at tip:master:
>> > >
>> > > git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
>> > >
>> >
>> > I am wondering if it is a problem, but it still exists on HEAD: c418de93e39891
>> > http://article.gmane.org/gmane.linux.kernel.mm/90131/match=compiled+with+name+pl+and+start+it+on+my
>> >
>> > like when just start 4 pl tasks, often 3 were running on node
>> > 0, and 1 was running on node 1. The old balance will average
>> > assign tasks to different node, different core.
>>
>> This is "normal" in the sense that the current mainline
>> scheduler is (supposed to be) doing something similar: if the
>> node is still within capacity, then there's no reason to move
>> those threads.
>>
>> OTOH, I think with NUMA balancing we indeed want to spread
>> them better, if those tasks do not share memory with each
>> other but use their own memory. If they share memory then they
>> should remain on the same node if possible.

I rewrite the little test case by assemble:
==
.text

.global _start

_start:

do_nop:
nop
nop
jmp do_nop
==
It reproduced the problem on latest tip/master, HEAD: 7cb989d0159a6f43104992f18
like for 4 above tasks running, 3 of them running on node 0, one
running on node 1.

If kernel can detect the LLC of CPU is allowed for tasks aggregate, it's a nice
feature. if not, the aggregate may cause more cache missing.


>
> Could you please check tip:master with -v17:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
>
> ?
>
> It should place your workload better than v16 did.
>
> Note, you might be able to find other combinations of tasks that
> are not scheduled NUMA-perfectly yet, as task group placement is
> not exhaustive yet.
>
> You might want to check which combination looks the weirdest to
> you and report it, so I can fix any remaining placement
> inefficiencies in order of importance.
>
> Thanks,
>
> Ingo



--
Thanks
Alex

2012-11-28 14:21:18

by Alex Shi

[permalink] [raw]
Subject: Re: numa/core regressions fixed - more testers wanted

>
>>
>> Could you please check tip:master with -v17:
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
>>

Tested this version on our SNB EP 2 sockets box, 8 cores * HT with
specjbb2005 on jrockit.
With single JVM setting it has 40% performance increase compare to
3.7-rc6. impressive!