Hello everyone,
It's time for a new autonuma-alpha14 milestone.
Removed the [RFC] from Subject because 1) this is a release I'm quite
happy with (from the implementation side it allows the same kernel
image to boot optimally on NUMA and not-NUMA hardware and it avoids
altering the scheduler runtime most of the time) and 2) because of the
great benchmark results we got so far, showing this design so far has
been proved to perform best.
I believe (realistically speaking) nobody is going to change
applications to specify which thread is using which memory (for
threaded apps) with the only exception of QEMU and a few others.
For not threaded apps that fits in a NUMA node, there's no way a blind
home node can perform nearly as good as AutoNUMA: AutoNUMA monitor the
whole status of the memory of the running processes and it optimizes
the memory placement and CPU placement dynamically
accordingly. There's a small memory and CPU cost in collecting so much
information to be able to make smart decisions, but the benefits
largely outweight those costs.
If a big idle task was idle for a long while, but it suddenly start
computing, AutoNUMA may totally change the memory and CPU placement of
the other running tasks according to what's best, because it has
enough information to take optimal NUMA placement decisions.
git clone --reference linux -b autonuma-alpha14 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma-alpha14
Development autonuma branch (currently equal to autonuma-alpha14 ==
a49fedcc284a8e8b47175fbc23e9d3b075884e53):
git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
to update: git fetch; git checkout -f origin/autonuma
Changelog from alpha13 to alpha14:
o page_autonuma introduction, no memory wasted if the kernel is booted
on not-NUMA hardware. Tested with flatmem/sparsemem on x86
autonuma=y/n and sparsemem/vsparsemem on x86_64 with autonuma=y/n.
The "noautonuma" kernel param disables autonuma permanently also
when booted on NUMA hardware (no /sys/kernel/mm/autonuma, and no
page_autonuma allocations, like cgroup_disable=memory)
o autonuma_balance only runs along with run_rebalance_domains, to
avoid altering the scheduler runtime. autonuma_balance gives a
"kick" to the scheduler only along the load balance events (it
overrides the load balance activity if needed). This change has not
yet been tested on specjbb or more schedule intensive benchmarks,
but I don't expect measurable NUMA affinity regressions. For
intensive compute loads not involving a flood of scheduling activity
this has already been verified not to show any performance
regression, and it will boost the scheduler performance compared to
previous autonuma releases.
Note: autonuma_balance still runs from normal context (not softirq
context like run_rebalance_domains) to be able to wait on process
migration (avoid _nowait), but most of the time it does nothing at
all.
Changelog from alpha11 to alpha13:
o autonuma_balance optimization (take the fast path when process is in
the preferred NUMA node)
TODO:
o THP native migration (orthogonal and also needed for
cpuset/migrate_pages(2)/numa/sched).
o distribute pagecache to other nodes (and maybe shared memory or
other movable memory) if knuma_migrated stops because the local node
is full
Andrea Arcangeli (35):
mm: add unlikely to the mm allocation failure check
autonuma: make set_pmd_at always available
xen: document Xen is using an unused bit for the pagetables
autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
autonuma: x86 pte_numa() and pmd_numa()
autonuma: generic pte_numa() and pmd_numa()
autonuma: teach gup_fast about pte_numa
autonuma: introduce kthread_bind_node()
autonuma: mm_autonuma and sched_autonuma data structures
autonuma: define the autonuma flags
autonuma: core autonuma.h header
autonuma: CPU follow memory algorithm
autonuma: add page structure fields
autonuma: knuma_migrated per NUMA node queues
autonuma: init knuma_migrated queues
autonuma: autonuma_enter/exit
autonuma: call autonuma_setup_new_exec()
autonuma: alloc/free/init sched_autonuma
autonuma: alloc/free/init mm_autonuma
autonuma: avoid CFS select_task_rq_fair to return -1
autonuma: teach CFS about autonuma affinity
autonuma: sched_set_autonuma_need_balance
autonuma: core
autonuma: follow_page check for pte_numa/pmd_numa
autonuma: default mempolicy follow AutoNUMA
autonuma: call autonuma_split_huge_page()
autonuma: make khugepaged pte_numa aware
autonuma: retain page last_nid information in khugepaged
autonuma: numa hinting page faults entry points
autonuma: reset autonuma page data when pages are freed
autonuma: initialize page structure fields
autonuma: link mm/autonuma.o and kernel/sched/numa.o
autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
autonuma: boost khugepaged scanning rate
autonuma: page_autonuma
arch/x86/include/asm/paravirt.h | 2 -
arch/x86/include/asm/pgtable.h | 51 ++-
arch/x86/include/asm/pgtable_types.h | 22 +-
arch/x86/mm/gup.c | 2 +-
fs/exec.c | 3 +
include/asm-generic/pgtable.h | 12 +
include/linux/autonuma.h | 53 ++
include/linux/autonuma_flags.h | 68 ++
include/linux/autonuma_sched.h | 50 ++
include/linux/autonuma_types.h | 88 ++
include/linux/huge_mm.h | 2 +
include/linux/kthread.h | 1 +
include/linux/mm_types.h | 5 +
include/linux/mmzone.h | 18 +
include/linux/page_autonuma.h | 53 ++
include/linux/sched.h | 3 +
init/main.c | 2 +
kernel/fork.c | 36 +-
kernel/kthread.c | 23 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 12 +-
kernel/sched/fair.c | 72 ++-
kernel/sched/numa.c | 281 +++++++
kernel/sched/sched.h | 10 +
mm/Kconfig | 13 +
mm/Makefile | 1 +
mm/autonuma.c | 1464 ++++++++++++++++++++++++++++++++++
mm/huge_memory.c | 58 ++-
mm/memory.c | 36 +-
mm/mempolicy.c | 15 +-
mm/mmu_context.c | 2 +
mm/page_alloc.c | 6 +-
mm/page_autonuma.c | 234 ++++++
mm/sparse.c | 126 +++-
34 files changed, 2776 insertions(+), 49 deletions(-)
create mode 100644 include/linux/autonuma.h
create mode 100644 include/linux/autonuma_flags.h
create mode 100644 include/linux/autonuma_sched.h
create mode 100644 include/linux/autonuma_types.h
create mode 100644 include/linux/page_autonuma.h
create mode 100644 kernel/sched/numa.c
create mode 100644 mm/autonuma.c
create mode 100644 mm/page_autonuma.c
Invoke autonuma_balance only on the busy CPUs at the same frequency of
the CFS load balance.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/sched/fair.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 99d1d33..1357938 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4893,6 +4893,9 @@ static void run_rebalance_domains(struct softirq_action *h)
rebalance_domains(this_cpu, idle);
+ if (!this_rq->idle_balance)
+ sched_set_autonuma_need_balance();
+
/*
* If this cpu has a pending nohz_balance_kick, then do the
* balancing on behalf of the other idle cpus whose ticks are
Add the config options to allow building the kernel with AutoNUMA.
If CONFIG_AUTONUMA_DEFAULT_ENABLED is "=y", then
/sys/kernel/mm/autonuma/enabled will be equal to 1, and AutoNUMA will
be enabled automatically at boot.
CONFIG_AUTONUMA currently depends on X86, because no other arch
implements the pte/pmd_numa yet and selecting =y would result in a
failed build, but this shall be relaxed in the future. Porting
AutoNUMA to other archs should be pretty simple.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/Kconfig | 13 +++++++++++++
1 files changed, 13 insertions(+), 0 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig
index e338407..cbfdb15 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -207,6 +207,19 @@ config MIGRATION
pages as migration can relocate pages to satisfy a huge page
allocation instead of reclaiming.
+config AUTONUMA
+ bool "Auto NUMA"
+ select MIGRATION
+ depends on NUMA && X86
+ help
+ Automatic NUMA CPU scheduling and memory migration.
+
+config AUTONUMA_DEFAULT_ENABLED
+ bool "Auto NUMA default enabled"
+ depends on AUTONUMA
+ help
+ Automatic NUMA CPU scheduling and memory migration enabled at boot.
+
config PHYS_ADDR_T_64BIT
def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
This algorithm takes as input the statistical information filled by the
knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
(p->sched_autonuma), evaluates it for the current scheduled task, and
compares it against every other running process to see if it should
move the current task to another NUMA node.
For example if there's any idle CPU in the NUMA node where the current
task prefers to be scheduled into (according to the mm_autonuma and
sched_autonuma data structures) the task will be migrated there
instead of keep running in the current CPU.
When the scheduler decides if the task should be migrated to a
different NUMA node or to stay in the same NUMA node, the decision is
then stored into p->sched_autonuma->autonuma_node. The fair scheduler
than tries to keep the task on the autonuma_node too.
Code include fixes and cleanups from Hillf Danton <[email protected]>.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/autonuma_sched.h | 50 +++++++
include/linux/mm_types.h | 5 +
include/linux/sched.h | 3 +
kernel/sched/core.c | 12 +-
kernel/sched/numa.c | 281 ++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 10 ++
6 files changed, 353 insertions(+), 8 deletions(-)
create mode 100644 include/linux/autonuma_sched.h
create mode 100644 kernel/sched/numa.c
diff --git a/include/linux/autonuma_sched.h b/include/linux/autonuma_sched.h
new file mode 100644
index 0000000..9a4d945
--- /dev/null
+++ b/include/linux/autonuma_sched.h
@@ -0,0 +1,50 @@
+#ifndef _LINUX_AUTONUMA_SCHED_H
+#define _LINUX_AUTONUMA_SCHED_H
+
+#include <linux/autonuma_flags.h>
+
+static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
+{
+#ifdef CONFIG_AUTONUMA
+ int autonuma_node;
+ struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+
+ if (!sched_autonuma)
+ return true;
+
+ autonuma_node = ACCESS_ONCE(sched_autonuma->autonuma_node);
+ if (autonuma_node < 0 || autonuma_node == cpu_to_node(cpu))
+ return true;
+ else
+ return false;
+#else
+ return true;
+#endif
+}
+
+static inline void sched_set_autonuma_need_balance(void)
+{
+#ifdef CONFIG_AUTONUMA
+ struct sched_autonuma *sa = current->sched_autonuma;
+
+ if (sa && current->mm)
+ sa->autonuma_flags |= SCHED_AUTONUMA_FLAG_NEED_BALANCE;
+#endif
+}
+
+#ifdef CONFIG_AUTONUMA
+extern void sched_autonuma_balance(void);
+extern bool sched_autonuma_can_migrate_task(struct task_struct *p,
+ int numa, int dst_cpu,
+ enum cpu_idle_type idle);
+#else /* CONFIG_AUTONUMA */
+static inline void sched_autonuma_balance(void) {}
+static inline bool sched_autonuma_can_migrate_task(struct task_struct *p,
+ int numa, int dst_cpu,
+ enum cpu_idle_type idle)
+{
+ return true;
+}
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_SCHED_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 26574c7..780ded7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
#include <linux/cpumask.h>
#include <linux/page-debug-flags.h>
#include <linux/uprobes.h>
+#include <linux/autonuma_types.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -390,6 +391,10 @@ struct mm_struct {
struct cpumask cpumask_allocation;
#endif
struct uprobes_state uprobes_state;
+#ifdef CONFIG_AUTONUMA
+ /* this is used by the scheduler and the page allocator */
+ struct mm_autonuma *mm_autonuma;
+#endif
};
static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f45c0b2..60a699c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,6 +1507,9 @@ struct task_struct {
struct mempolicy *mempolicy; /* Protected by alloc_lock */
short il_next;
short pref_node_fork;
+#ifdef CONFIG_AUTONUMA
+ struct sched_autonuma *sched_autonuma;
+#endif
#endif
struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 39eb601..e3e4c99 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,7 @@
#include <linux/slab.h>
#include <linux/init_task.h>
#include <linux/binfmts.h>
+#include <linux/autonuma_sched.h>
#include <asm/switch_to.h>
#include <asm/tlb.h>
@@ -1117,13 +1118,6 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
__set_task_cpu(p, new_cpu);
}
-struct migration_arg {
- struct task_struct *task;
- int dest_cpu;
-};
-
-static int migration_cpu_stop(void *data);
-
/*
* wait_task_inactive - wait for a thread to unschedule.
*
@@ -3274,6 +3268,8 @@ need_resched:
post_schedule(rq);
+ sched_autonuma_balance();
+
sched_preempt_enable_no_resched();
if (need_resched())
goto need_resched;
@@ -5106,7 +5102,7 @@ fail:
* and performs thread migration by bumping thread off CPU then
* 'pushing' onto another runqueue.
*/
-static int migration_cpu_stop(void *data)
+int migration_cpu_stop(void *data)
{
struct migration_arg *arg = data;
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
new file mode 100644
index 0000000..499a197
--- /dev/null
+++ b/kernel/sched/numa.c
@@ -0,0 +1,281 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/sched.h>
+#include <linux/autonuma_sched.h>
+#include <asm/tlb.h>
+
+#include "sched.h"
+
+#define AUTONUMA_BALANCE_SCALE 1000
+
+enum {
+ W_TYPE_THREAD,
+ W_TYPE_PROCESS,
+};
+
+/*
+ * This function is responsible for deciding which is the best CPU
+ * each process should be running on according to the NUMA
+ * affinity. To do that it evaluates all CPUs and checks if there's
+ * any remote CPU where the current process has more NUMA affinity
+ * than with the current CPU, and where the process running on the
+ * remote CPU has less NUMA affinity than the current process to run
+ * on the remote CPU. Ideally this should be expanded to take all
+ * runnable processes into account but this is a good
+ * approximation. When we compare the NUMA affinity between the
+ * current and remote CPU we use the per-thread information if the
+ * remote CPU runs a thread of the same process that the current task
+ * belongs to, or the per-process information if the remote CPU runs a
+ * different process than the current one. If the remote CPU runs the
+ * idle task we require both the per-thread and per-process
+ * information to have more affinity with the remote CPU than with the
+ * current CPU for a migration to happen.
+ *
+ * This has O(N) complexity but N isn't the number of running
+ * processes, but the number of CPUs, so if you assume a constant
+ * number of CPUs (capped at NR_CPUS) it is O(1). O(1) misleading math
+ * aside, the number of cachelines touched with thousands of CPU might
+ * make it measurable. Calling this at every schedule may also be
+ * overkill and it may be enough to call it with a frequency similar
+ * to the load balancing, but by doing so we're also verifying the
+ * algorithm is a converging one in all workloads if performance is
+ * improved and there's no frequent CPU migration, so it's good in the
+ * short term for stressing the algorithm.
+ */
+void sched_autonuma_balance(void)
+{
+ int cpu, nid, selected_cpu, selected_nid;
+ int cpu_nid = numa_node_id();
+ int this_cpu = smp_processor_id();
+ unsigned long p_w, p_t, m_w, m_t, p_w_max, m_w_max;
+ unsigned long weight_delta_max, weight;
+ long s_w_nid = -1, s_w_cpu_nid = -1, s_w_other = -1;
+ int s_w_type = -1;
+ struct cpumask *allowed;
+ struct migration_arg arg;
+ struct task_struct *p = current;
+ struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+
+ /* per-cpu statically allocated in runqueues */
+ long *weight_current;
+ long *weight_current_mm;
+
+ if (!sched_autonuma || !p->mm)
+ return;
+
+ if (!(sched_autonuma->autonuma_flags &
+ SCHED_AUTONUMA_FLAG_NEED_BALANCE))
+ return;
+ else
+ sched_autonuma->autonuma_flags &=
+ ~SCHED_AUTONUMA_FLAG_NEED_BALANCE;
+
+ if (sched_autonuma->autonuma_flags & SCHED_AUTONUMA_FLAG_STOP_ONE_CPU)
+ return;
+
+ if (!autonuma_enabled()) {
+ if (sched_autonuma->autonuma_node != -1)
+ sched_autonuma->autonuma_node = -1;
+ return;
+ }
+
+ allowed = tsk_cpus_allowed(p);
+
+ m_t = ACCESS_ONCE(p->mm->mm_autonuma->numa_fault_tot);
+ p_t = sched_autonuma->numa_fault_tot;
+ /*
+ * If a process still misses the per-thread or per-process
+ * information skip it.
+ */
+ if (!m_t || !p_t)
+ return;
+
+ weight_current = cpu_rq(this_cpu)->weight_current;
+ weight_current_mm = cpu_rq(this_cpu)->weight_current_mm;
+
+ p_w_max = m_w_max = 0;
+ selected_nid = -1;
+ for_each_online_node(nid) {
+ int hits = 0;
+ m_w = ACCESS_ONCE(p->mm->mm_autonuma->numa_fault[nid]);
+ p_w = sched_autonuma->numa_fault[nid];
+ if (m_w > m_t)
+ m_t = m_w;
+ weight_current_mm[nid] = m_w*AUTONUMA_BALANCE_SCALE/m_t;
+ if (p_w > p_t)
+ p_t = p_w;
+ weight_current[nid] = p_w*AUTONUMA_BALANCE_SCALE/p_t;
+ if (weight_current_mm[nid] > m_w_max) {
+ m_w_max = weight_current_mm[nid];
+ hits++;
+ }
+ if (weight_current[nid] > p_w_max) {
+ p_w_max = weight_current[nid];
+ hits++;
+ }
+ if (hits == 2)
+ selected_nid = nid;
+ }
+ if (selected_nid == cpu_nid) {
+ if (sched_autonuma->autonuma_node != selected_nid)
+ sched_autonuma->autonuma_node = selected_nid;
+ return;
+ }
+
+ selected_cpu = this_cpu;
+ selected_nid = cpu_nid;
+ weight = weight_delta_max = 0;
+
+ for_each_online_node(nid) {
+ if (nid == cpu_nid)
+ continue;
+ for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+ long w_nid, w_cpu_nid, w_other;
+ int w_type;
+ struct mm_struct *mm;
+ struct rq *rq = cpu_rq(cpu);
+ if (!cpu_online(cpu))
+ continue;
+
+ if (idle_cpu(cpu))
+ /*
+ * Offload the while IDLE balancing
+ * and physical / logical imbalances
+ * to CFS.
+ */
+ continue;
+
+ mm = rq->curr->mm;
+ if (!mm)
+ continue;
+ raw_spin_lock_irq(&rq->lock);
+ /* recheck after implicit barrier() */
+ mm = rq->curr->mm;
+ if (!mm) {
+ raw_spin_unlock_irq(&rq->lock);
+ continue;
+ }
+ m_t = ACCESS_ONCE(mm->mm_autonuma->numa_fault_tot);
+ p_t = rq->curr->sched_autonuma->numa_fault_tot;
+ if (!m_t || !p_t) {
+ raw_spin_unlock_irq(&rq->lock);
+ continue;
+ }
+ m_w = ACCESS_ONCE(mm->mm_autonuma->numa_fault[nid]);
+ p_w = rq->curr->sched_autonuma->numa_fault[nid];
+ raw_spin_unlock_irq(&rq->lock);
+ if (mm == p->mm) {
+ if (p_w > p_t)
+ p_t = p_w;
+ w_other = p_w*AUTONUMA_BALANCE_SCALE/p_t;
+ w_nid = weight_current[nid];
+ w_cpu_nid = weight_current[cpu_nid];
+ w_type = W_TYPE_THREAD;
+ } else {
+ if (m_w > m_t)
+ m_t = m_w;
+ w_other = m_w*AUTONUMA_BALANCE_SCALE/m_t;
+ w_nid = weight_current_mm[nid];
+ w_cpu_nid = weight_current_mm[cpu_nid];
+ w_type = W_TYPE_PROCESS;
+ }
+
+ if (w_nid > w_other && w_nid > w_cpu_nid) {
+ weight = w_nid - w_other + w_nid - w_cpu_nid;
+
+ if (weight > weight_delta_max) {
+ weight_delta_max = weight;
+ selected_cpu = cpu;
+ selected_nid = nid;
+
+ s_w_other = w_other;
+ s_w_nid = w_nid;
+ s_w_cpu_nid = w_cpu_nid;
+ s_w_type = w_type;
+ }
+ }
+ }
+ }
+
+ if (sched_autonuma->autonuma_node != selected_nid)
+ sched_autonuma->autonuma_node = selected_nid;
+ if (selected_cpu != this_cpu) {
+ if (autonuma_debug()) {
+ char *w_type_str = NULL;
+ switch (s_w_type) {
+ case W_TYPE_THREAD:
+ w_type_str = "thread";
+ break;
+ case W_TYPE_PROCESS:
+ w_type_str = "process";
+ break;
+ }
+ printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
+ p->mm, p->pid, cpu_nid, selected_nid,
+ this_cpu, selected_cpu,
+ s_w_other, s_w_nid, s_w_cpu_nid,
+ w_type_str);
+ }
+ BUG_ON(cpu_nid == selected_nid);
+ goto found;
+ }
+
+ return;
+
+found:
+ arg = (struct migration_arg) { p, selected_cpu };
+ /* Need help from migration thread: drop lock and wait. */
+ sched_autonuma->autonuma_flags |= SCHED_AUTONUMA_FLAG_STOP_ONE_CPU;
+ sched_preempt_enable_no_resched();
+ stop_one_cpu(this_cpu, migration_cpu_stop, &arg);
+ preempt_disable();
+ sched_autonuma->autonuma_flags &= ~SCHED_AUTONUMA_FLAG_STOP_ONE_CPU;
+ tlb_migrate_finish(p->mm);
+}
+
+bool sched_autonuma_can_migrate_task(struct task_struct *p,
+ int numa, int dst_cpu,
+ enum cpu_idle_type idle)
+{
+ if (!task_autonuma_cpu(p, dst_cpu)) {
+ if (numa)
+ return false;
+ if (autonuma_sched_load_balance_strict() &&
+ idle != CPU_NEWLY_IDLE && idle != CPU_IDLE)
+ return false;
+ }
+ return true;
+}
+
+void sched_autonuma_dump_mm(void)
+{
+ int nid, cpu;
+ cpumask_var_t x;
+
+ if (!alloc_cpumask_var(&x, GFP_KERNEL))
+ return;
+ cpumask_setall(x);
+ for_each_online_node(nid) {
+ for_each_cpu(cpu, cpumask_of_node(nid)) {
+ struct rq *rq = cpu_rq(cpu);
+ struct mm_struct *mm = rq->curr->mm;
+ int nr = 0, cpux;
+ if (!cpumask_test_cpu(cpu, x))
+ continue;
+ for_each_cpu(cpux, cpumask_of_node(nid)) {
+ struct rq *rqx = cpu_rq(cpux);
+ if (rqx->curr->mm == mm) {
+ nr++;
+ cpumask_clear_cpu(cpux, x);
+ }
+ }
+ printk("nid %d mm %p nr %d\n", nid, mm, nr);
+ }
+ }
+ free_cpumask_var(x);
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ba9dccf..b12b8cd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -463,6 +463,10 @@ struct rq {
#ifdef CONFIG_SMP
struct llist_head wake_list;
#endif
+#ifdef CONFIG_AUTONUMA
+ long weight_current[MAX_NUMNODES];
+ long weight_current_mm[MAX_NUMNODES];
+#endif
};
static inline int cpu_of(struct rq *rq)
@@ -526,6 +530,12 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
DECLARE_PER_CPU(struct sched_domain *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_id);
+struct migration_arg {
+ struct task_struct *task;
+ int dest_cpu;
+};
+extern int migration_cpu_stop(void *data);
+
#endif /* CONFIG_SMP */
#include "stats.h"
set_pmd_at() will also be used for the knuma_scand/pmd = 1 (default)
mode even when TRANSPARENT_HUGEPAGE=n. Make it available so the build
won't fail.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/include/asm/paravirt.h | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6cbbabf..e99fb37 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -564,7 +564,6 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t pmd)
{
@@ -575,7 +574,6 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp,
native_pmd_val(pmd));
}
-#endif
static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
{
If any of the ptes that khugepaged is collapsing was a pte_numa, the
resulting trans huge pmd will be a pmd_numa too.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 13 +++++++++++--
1 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b1c047b..d388517 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1790,12 +1790,13 @@ out:
return isolated;
}
-static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
struct vm_area_struct *vma,
unsigned long address,
spinlock_t *ptl)
{
pte_t *_pte;
+ bool mknuma = false;
for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
pte_t pteval = *_pte;
struct page *src_page;
@@ -1823,11 +1824,15 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
page_remove_rmap(src_page);
spin_unlock(ptl);
free_page_and_swap_cache(src_page);
+
+ mknuma |= pte_numa(pteval);
}
address += PAGE_SIZE;
page++;
}
+
+ return mknuma;
}
static void collapse_huge_page(struct mm_struct *mm,
@@ -1845,6 +1850,7 @@ static void collapse_huge_page(struct mm_struct *mm,
spinlock_t *ptl;
int isolated;
unsigned long hstart, hend;
+ bool mknuma = false;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
#ifndef CONFIG_NUMA
@@ -1963,7 +1969,8 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
anon_vma_unlock(vma->anon_vma);
- __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+ mknuma = pmd_numa(_pmd);
+ mknuma |= __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
pte_unmap(pte);
__SetPageUptodate(new_page);
pgtable = pmd_pgtable(_pmd);
@@ -1973,6 +1980,8 @@ static void collapse_huge_page(struct mm_struct *mm,
_pmd = mk_pmd(new_page, vma->vm_page_prot);
_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
_pmd = pmd_mkhuge(_pmd);
+ if (mknuma)
+ _pmd = pmd_mknuma(_pmd);
/*
* spin_lock() below is not the equivalent of smp_wmb(), so
Implement generic version of the methods. They're used when
CONFIG_AUTONUMA=n, and they're a noop.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/asm-generic/pgtable.h | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index fa596d9..780f707 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -521,6 +521,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}
+#ifndef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+ return 0;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+ return 0;
+}
+#endif /* CONFIG_AUTONUMA */
+
#endif /* CONFIG_MMU */
#endif /* !__ASSEMBLY__ */
This is where the numa hinting page faults are detected and are passed
over to the AutoNUMA core logic.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/huge_mm.h | 2 ++
mm/huge_memory.c | 17 +++++++++++++++++
mm/memory.c | 32 ++++++++++++++++++++++++++++++++
3 files changed, 51 insertions(+), 0 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c8af7a2..72eac1d 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,6 +11,8 @@ extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
pmd_t orig_pmd);
+extern pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+ pmd_t pmd, pmd_t *pmdp);
extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
unsigned long addr,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 76bdc48..017c0a3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1030,6 +1030,23 @@ out:
return page;
}
+#ifdef CONFIG_AUTONUMA
+pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+ pmd_t pmd, pmd_t *pmdp)
+{
+ spin_lock(&mm->page_table_lock);
+ if (pmd_same(pmd, *pmdp)) {
+ struct page *page = pmd_page(pmd);
+ pmd = pmd_mknotnuma(pmd);
+ set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+ numa_hinting_fault(page, HPAGE_PMD_NR);
+ VM_BUG_ON(pmd_numa(pmd));
+ }
+ spin_unlock(&mm->page_table_lock);
+ return pmd;
+}
+#endif
+
int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr)
{
diff --git a/mm/memory.c b/mm/memory.c
index e3aa47c..316ce54 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
#include <linux/gfp.h>
+#include <linux/autonuma.h>
#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -3398,6 +3399,32 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}
+static inline pte_t pte_numa_fixup(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte, pte_t *ptep)
+{
+ if (pte_numa(pte))
+ pte = __pte_numa_fixup(mm, vma, addr, pte, ptep);
+ return pte;
+}
+
+static inline void pmd_numa_fixup(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmd)
+{
+ if (pmd_numa(*pmd))
+ __pmd_numa_fixup(mm, vma, addr, pmd);
+}
+
+static inline pmd_t huge_pmd_numa_fixup(struct mm_struct *mm,
+ unsigned long addr,
+ pmd_t pmd, pmd_t *pmdp)
+{
+ if (pmd_numa(pmd))
+ pmd = __huge_pmd_numa_fixup(mm, addr, pmd, pmdp);
+ return pmd;
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
@@ -3440,6 +3467,7 @@ int handle_pte_fault(struct mm_struct *mm,
spin_lock(ptl);
if (unlikely(!pte_same(*pte, entry)))
goto unlock;
+ entry = pte_numa_fixup(mm, vma, address, entry, pte);
if (flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address,
@@ -3501,6 +3529,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pmd_t orig_pmd = *pmd;
barrier();
if (pmd_trans_huge(orig_pmd)) {
+ orig_pmd = huge_pmd_numa_fixup(mm, address,
+ orig_pmd, pmd);
if (flags & FAULT_FLAG_WRITE &&
!pmd_write(orig_pmd) &&
!pmd_trans_splitting(orig_pmd))
@@ -3510,6 +3540,8 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}
}
+ pmd_numa_fixup(mm, vma, address, pmd);
+
/*
* Use __pte_alloc instead of pte_alloc_map, because we can't
* run pte_offset_map on the pmd, if an huge pmd could
Xen has taken over the last reserved bit available for the pagetables
which is set through ioremap, this documents it and makes the code
more readable.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 11 +++++++++--
1 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 013286a..b74cac9 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -17,7 +17,7 @@
#define _PAGE_BIT_PAT 7 /* on 4KB pages */
#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */
#define _PAGE_BIT_UNUSED1 9 /* available for programmer */
-#define _PAGE_BIT_IOMAP 10 /* flag used to indicate IO mapping */
+#define _PAGE_BIT_UNUSED2 10
#define _PAGE_BIT_HIDDEN 11 /* hidden by kmemcheck */
#define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
#define _PAGE_BIT_SPECIAL _PAGE_BIT_UNUSED1
@@ -41,7 +41,7 @@
#define _PAGE_PSE (_AT(pteval_t, 1) << _PAGE_BIT_PSE)
#define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
#define _PAGE_UNUSED1 (_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1)
-#define _PAGE_IOMAP (_AT(pteval_t, 1) << _PAGE_BIT_IOMAP)
+#define _PAGE_UNUSED2 (_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
#define _PAGE_PAT (_AT(pteval_t, 1) << _PAGE_BIT_PAT)
#define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
#define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
@@ -49,6 +49,13 @@
#define _PAGE_SPLITTING (_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
#define __HAVE_ARCH_PTE_SPECIAL
+/* flag used to indicate IO mapping */
+#ifdef CONFIG_XEN
+#define _PAGE_IOMAP (_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
+#else
+#define _PAGE_IOMAP (_AT(pteval_t, 0))
+#endif
+
#ifdef CONFIG_KMEMCHECK
#define _PAGE_HIDDEN (_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
#else
This resets all per-thread and per-process statistics across exec
syscalls or after kernel threads detached from the mm. The past
statistical NUMA information is unlikely to be relevant for the future
in these cases.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/exec.c | 3 +++
mm/mmu_context.c | 2 ++
2 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 52c9e2f..17330ba 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
#include <linux/pipe_fs_i.h>
#include <linux/oom.h>
#include <linux/compat.h>
+#include <linux/autonuma.h>
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -1176,6 +1177,8 @@ void setup_new_exec(struct linux_binprm * bprm)
flush_signal_handlers(current, 0);
flush_old_files(current->files);
+
+ autonuma_setup_new_exec(current);
}
EXPORT_SYMBOL(setup_new_exec);
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3dcfaf4..40f0f13 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -7,6 +7,7 @@
#include <linux/mmu_context.h>
#include <linux/export.h>
#include <linux/sched.h>
+#include <linux/autonuma.h>
#include <asm/mmu_context.h>
@@ -58,5 +59,6 @@ void unuse_mm(struct mm_struct *mm)
/* active_mm is still 'mm' */
enter_lazy_tlb(mm, tsk);
task_unlock(tsk);
+ autonuma_setup_new_exec(tsk);
}
EXPORT_SYMBOL_GPL(unuse_mm);
This implements the knuma_migrated queues. The pages are added to
these queues through the NUMA hinting page faults (memory follow CPU
algorithm with false sharing evaluation) and knuma_migrated then is
waken with a certain hysteresis to migrate the memory in round robin
from all remote nodes to its local node.
The head that belongs to the local node that knuma_migrated runs on,
for now must be empty and it's not being used.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/mmzone.h | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 41aa49b..8e578e6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -666,6 +666,12 @@ typedef struct pglist_data {
struct task_struct *kswapd;
int kswapd_max_order;
enum zone_type classzone_idx;
+#ifdef CONFIG_AUTONUMA
+ spinlock_t autonuma_lock;
+ struct list_head autonuma_migrate_head[MAX_NUMNODES];
+ unsigned long autonuma_nr_migrate_pages;
+ wait_queue_head_t autonuma_knuma_migrated_wait;
+#endif
} pg_data_t;
#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
Link the AutoNUMA core and scheduler object files in the kernel if
CONFIG_AUTONUMA=y.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/sched/Makefile | 1 +
mm/Makefile | 1 +
2 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 173ea52..783a840 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_SMP) += cpupri.o
obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
+obj-$(CONFIG_AUTONUMA) += numa.o
diff --git a/mm/Makefile b/mm/Makefile
index 50ec00e..67c77bd 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
+obj-$(CONFIG_AUTONUMA) += autonuma.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
When pages are collapsed try to keep the last_nid information from one
of the original pages.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d388517..76bdc48 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1805,7 +1805,18 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
clear_user_highpage(page, address);
add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
} else {
+#ifdef CONFIG_AUTONUMA
+ int autonuma_last_nid;
+#endif
src_page = pte_page(pteval);
+#ifdef CONFIG_AUTONUMA
+ /* pick the last one, better than nothing */
+ autonuma_last_nid =
+ ACCESS_ONCE(src_page->autonuma_last_nid);
+ if (autonuma_last_nid >= 0)
+ ACCESS_ONCE(page->autonuma_last_nid) =
+ autonuma_last_nid;
+#endif
copy_user_highpage(page, src_page, address, vma);
VM_BUG_ON(page_mapcount(src_page) != 1);
VM_BUG_ON(page_count(src_page) != 2);
Very minor optimization to hint gcc.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/fork.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 47b4e4f..98db8b0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -571,7 +571,7 @@ struct mm_struct *mm_alloc(void)
struct mm_struct *mm;
mm = allocate_mm();
- if (!mm)
+ if (unlikely(!mm))
return NULL;
memset(mm, 0, sizeof(*mm));
This is the generic autonuma.h header that defines the generic
AutoNUMA specific functions like autonuma_setup_new_exec,
autonuma_split_huge_page, numa_hinting_fault, etc...
As usual functions like numa_hinting_fault that only matter for builds
with CONFIG_AUTONUMA=y are defined unconditionally, but they are only
linked into the kernel if CONFIG_AUTONUMA=n. Their call sites are
optimized away at build time (or kernel won't link).
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/autonuma.h | 41 +++++++++++++++++++++++++++++++++++++++++
1 files changed, 41 insertions(+), 0 deletions(-)
create mode 100644 include/linux/autonuma.h
diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
new file mode 100644
index 0000000..a963dcb
--- /dev/null
+++ b/include/linux/autonuma.h
@@ -0,0 +1,41 @@
+#ifndef _LINUX_AUTONUMA_H
+#define _LINUX_AUTONUMA_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void autonuma_enter(struct mm_struct *mm);
+extern void autonuma_exit(struct mm_struct *mm);
+extern void __autonuma_migrate_page_remove(struct page *page);
+extern void autonuma_migrate_split_huge_page(struct page *page,
+ struct page *page_tail);
+extern void autonuma_setup_new_exec(struct task_struct *p);
+
+static inline void autonuma_migrate_page_remove(struct page *page)
+{
+ if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
+ __autonuma_migrate_page_remove(page);
+}
+
+#define autonuma_printk(format, args...) \
+ if (autonuma_debug()) printk(format, ##args)
+
+#else /* CONFIG_AUTONUMA */
+
+static inline void autonuma_enter(struct mm_struct *mm) {}
+static inline void autonuma_exit(struct mm_struct *mm) {}
+static inline void autonuma_migrate_page_remove(struct page *page) {}
+static inline void autonuma_migrate_split_huge_page(struct page *page,
+ struct page *page_tail) {}
+static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+extern pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte, pte_t *ptep);
+extern void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmd);
+extern void numa_hinting_fault(struct page *page, int numpages);
+
+#endif /* _LINUX_AUTONUMA_H */
This implements knuma_scand, the numa_hinting faults started by
knuma_scand, the knuma_migrated that migrates the memory queued by the
NUMA hinting faults, the statistics gathering code that is done by
knuma_scand for the mm_autonuma and by the numa hinting page faults
for the sched_autonuma, and most of the rest of the AutoNUMA core
logics like the false sharing detection, sysfs and initialization
routines.
The AutoNUMA algorithm when knuma_scand is not running is a full
bypass and it must not alter the runtime of memory management and
scheduler.
The whole AutoNUMA logic is a chain reaction as result of the actions
of the knuma_scand. The various parts of the code can be described
like different gears (gears as in glxgears).
knuma_scand is the first gear and it collects the mm_autonuma per-process
statistics and at the same time it sets the pte/pmd it scans as
pte_numa and pmd_numa.
The second gear are the numa hinting page faults. These are triggered
by the pte_numa/pmd_numa pmd/ptes. They collect the sched_autonuma
per-thread statistics. They also implement the memory follow CPU logic
where we track if pages are repeatedly accessed by remote nodes. The
memory follow CPU logic can decide to migrate pages across different
NUMA nodes by queuing the pages for migration in the per-node
knuma_migrated queues.
The third gear is knuma_migrated. There is one knuma_migrated daemon
per node. Pages pending for migration are queued in a matrix of
lists. Each knuma_migrated (in parallel with each other) goes over
those lists and migrates the pages queued for migration in round robin
from each incoming node to the node where knuma_migrated is running
on.
The fourth gear is the NUMA scheduler balancing code. That computes
the statistical information collected in mm->mm_autonuma and
p->sched_autonuma and evaluates the status of all CPUs to decide if
tasks should be migrated to CPUs in remote nodes.
The code include fixes from Hillf Danton <[email protected]>.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/autonuma.c | 1449 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 1449 insertions(+), 0 deletions(-)
create mode 100644 mm/autonuma.c
diff --git a/mm/autonuma.c b/mm/autonuma.c
new file mode 100644
index 0000000..88c7ab3
--- /dev/null
+++ b/mm/autonuma.c
@@ -0,0 +1,1449 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ *
+ * Boot with "numa=fake=2" to test on not NUMA systems.
+ */
+
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/kthread.h>
+#include <linux/mmu_notifier.h>
+#include <linux/freezer.h>
+#include <linux/mm_inline.h>
+#include <linux/migrate.h>
+#include <linux/swap.h>
+#include <linux/autonuma.h>
+#include <asm/tlbflush.h>
+#include <asm/pgtable.h>
+
+unsigned long autonuma_flags __read_mostly =
+ (1<<AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG)|
+ (1<<AUTONUMA_SCHED_CLONE_RESET_FLAG)|
+ (1<<AUTONUMA_SCHED_FORK_RESET_FLAG)|
+#ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
+ (1<<AUTONUMA_FLAG)|
+#endif
+ (1<<AUTONUMA_SCAN_PMD_FLAG);
+
+static DEFINE_MUTEX(knumad_mm_mutex);
+
+/* knuma_scand */
+static unsigned int scan_sleep_millisecs __read_mostly = 100;
+static unsigned int scan_sleep_pass_millisecs __read_mostly = 5000;
+static unsigned int pages_to_scan __read_mostly = 128*1024*1024/PAGE_SIZE;
+static DECLARE_WAIT_QUEUE_HEAD(knuma_scand_wait);
+static unsigned long full_scans;
+static unsigned long pages_scanned;
+
+/* knuma_migrated */
+static unsigned int migrate_sleep_millisecs __read_mostly = 100;
+static unsigned int pages_to_migrate __read_mostly = 128*1024*1024/PAGE_SIZE;
+static volatile unsigned long pages_migrated;
+
+static struct knumad_scan {
+ struct list_head mm_head;
+ struct mm_struct *mm;
+ unsigned long address;
+} knumad_scan = {
+ .mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
+};
+
+static inline bool autonuma_impossible(void)
+{
+ return num_possible_nodes() <= 1 ||
+ test_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
+}
+
+static inline void autonuma_migrate_lock(int nid)
+{
+ spin_lock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock(int nid)
+{
+ spin_unlock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_lock_irq(int nid)
+{
+ spin_lock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock_irq(int nid)
+{
+ spin_unlock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+/* caller already holds the compound_lock */
+void autonuma_migrate_split_huge_page(struct page *page,
+ struct page *page_tail)
+{
+ int nid, last_nid;
+
+ nid = page->autonuma_migrate_nid;
+ VM_BUG_ON(nid >= MAX_NUMNODES);
+ VM_BUG_ON(nid < -1);
+ VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+ if (nid >= 0) {
+ VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+ autonuma_migrate_lock(nid);
+ list_add_tail(&page_tail->autonuma_migrate_node,
+ &page->autonuma_migrate_node);
+ autonuma_migrate_unlock(nid);
+
+ page_tail->autonuma_migrate_nid = nid;
+ }
+
+ last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+ if (last_nid >= 0)
+ page_tail->autonuma_last_nid = last_nid;
+}
+
+void __autonuma_migrate_page_remove(struct page *page)
+{
+ unsigned long flags;
+ int nid;
+
+ flags = compound_lock_irqsave(page);
+
+ nid = page->autonuma_migrate_nid;
+ VM_BUG_ON(nid >= MAX_NUMNODES);
+ VM_BUG_ON(nid < -1);
+ if (nid >= 0) {
+ int numpages = hpage_nr_pages(page);
+ autonuma_migrate_lock(nid);
+ list_del(&page->autonuma_migrate_node);
+ NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+ autonuma_migrate_unlock(nid);
+
+ page->autonuma_migrate_nid = -1;
+ }
+
+ compound_unlock_irqrestore(page, flags);
+}
+
+static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
+ int page_nid)
+{
+ unsigned long flags;
+ int nid;
+ int numpages;
+ unsigned long nr_migrate_pages;
+ wait_queue_head_t *wait_queue;
+
+ VM_BUG_ON(dst_nid >= MAX_NUMNODES);
+ VM_BUG_ON(dst_nid < -1);
+ VM_BUG_ON(page_nid >= MAX_NUMNODES);
+ VM_BUG_ON(page_nid < -1);
+
+ VM_BUG_ON(page_nid == dst_nid);
+ VM_BUG_ON(page_to_nid(page) != page_nid);
+
+ flags = compound_lock_irqsave(page);
+
+ numpages = hpage_nr_pages(page);
+ nid = page->autonuma_migrate_nid;
+ VM_BUG_ON(nid >= MAX_NUMNODES);
+ VM_BUG_ON(nid < -1);
+ if (nid >= 0) {
+ autonuma_migrate_lock(nid);
+ list_del(&page->autonuma_migrate_node);
+ NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+ autonuma_migrate_unlock(nid);
+ }
+
+ autonuma_migrate_lock(dst_nid);
+ list_add(&page->autonuma_migrate_node,
+ &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
+ NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+ nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+
+ autonuma_migrate_unlock(dst_nid);
+
+ page->autonuma_migrate_nid = dst_nid;
+
+ compound_unlock_irqrestore(page, flags);
+
+ if (!autonuma_migrate_defer()) {
+ wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
+ if (nr_migrate_pages >= pages_to_migrate &&
+ nr_migrate_pages - numpages < pages_to_migrate &&
+ waitqueue_active(wait_queue))
+ wake_up_interruptible(wait_queue);
+ }
+}
+
+static void autonuma_migrate_page_add(struct page *page, int dst_nid,
+ int page_nid)
+{
+ int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+ if (migrate_nid != dst_nid)
+ __autonuma_migrate_page_add(page, dst_nid, page_nid);
+}
+
+static bool balance_pgdat(struct pglist_data *pgdat,
+ int nr_migrate_pages)
+{
+ /* FIXME: this only check the wmarks, make it move
+ * "unused" memory or pagecache by queuing it to
+ * pgdat->autonuma_migrate_head[pgdat->node_id].
+ */
+ int z;
+ for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+ struct zone *zone = pgdat->node_zones + z;
+
+ if (!populated_zone(zone))
+ continue;
+
+ if (zone->all_unreclaimable)
+ continue;
+
+ /*
+ * FIXME: deal with order with THP, maybe if
+ * kswapd will learn using compaction, otherwise
+ * order = 0 probably is ok.
+ * FIXME: in theory we're ok if we can obtain
+ * pages_to_migrate pages from all zones, it doesn't
+ * need to be all in a single zone. We care about the
+ * pgdat, the zone not.
+ */
+
+ /*
+ * Try not to wakeup kswapd by allocating
+ * pages_to_migrate pages.
+ */
+ if (!zone_watermark_ok(zone, 0,
+ high_wmark_pages(zone) +
+ nr_migrate_pages,
+ 0, 0))
+ continue;
+ return true;
+ }
+ return false;
+}
+
+static void cpu_follow_memory_pass(struct task_struct *p,
+ struct sched_autonuma *sched_autonuma,
+ unsigned long *numa_fault)
+{
+ int nid;
+ for_each_node(nid)
+ numa_fault[nid] >>= 1;
+ sched_autonuma->numa_fault_tot >>= 1;
+}
+
+static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
+ int access_nid,
+ int numpages,
+ bool pass)
+{
+ struct sched_autonuma *sched_autonuma = p->sched_autonuma;
+ unsigned long *numa_fault = sched_autonuma->numa_fault;
+ if (unlikely(pass))
+ cpu_follow_memory_pass(p, sched_autonuma, numa_fault);
+ numa_fault[access_nid] += numpages;
+ sched_autonuma->numa_fault_tot += numpages;
+}
+
+static inline bool last_nid_set(struct task_struct *p,
+ struct page *page, int cpu_nid)
+{
+ bool ret = true;
+ int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+ VM_BUG_ON(cpu_nid < 0);
+ VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+ if (autonuma_last_nid >= 0 && autonuma_last_nid != cpu_nid) {
+ int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+ if (migrate_nid >= 0 && migrate_nid != cpu_nid)
+ __autonuma_migrate_page_remove(page);
+ ret = false;
+ }
+ if (autonuma_last_nid != cpu_nid)
+ ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
+ return ret;
+}
+
+static int __page_migrate_nid(struct page *page, int page_nid)
+{
+ int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+ if (migrate_nid < 0)
+ migrate_nid = page_nid;
+#if 0
+ return page_nid;
+#endif
+ return migrate_nid;
+}
+
+static int page_migrate_nid(struct page *page)
+{
+ return __page_migrate_nid(page, page_to_nid(page));
+}
+
+static int numa_hinting_fault_memory_follow_cpu(struct task_struct *p,
+ struct page *page,
+ int cpu_nid, int page_nid,
+ bool pass)
+{
+ if (!last_nid_set(p, page, cpu_nid))
+ return __page_migrate_nid(page, page_nid);
+ if (!PageLRU(page))
+ return page_nid;
+ if (cpu_nid != page_nid)
+ autonuma_migrate_page_add(page, cpu_nid, page_nid);
+ else
+ autonuma_migrate_page_remove(page);
+ return cpu_nid;
+}
+
+void numa_hinting_fault(struct page *page, int numpages)
+{
+ WARN_ON_ONCE(!current->mm);
+ if (likely(current->mm && !current->mempolicy && autonuma_enabled())) {
+ struct task_struct *p = current;
+ int cpu_nid, page_nid, access_nid;
+ bool pass;
+
+ pass = p->sched_autonuma->numa_fault_pass !=
+ p->mm->mm_autonuma->numa_fault_pass;
+ page_nid = page_to_nid(page);
+ cpu_nid = numa_node_id();
+ VM_BUG_ON(cpu_nid < 0);
+ VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+ access_nid = numa_hinting_fault_memory_follow_cpu(p, page,
+ cpu_nid,
+ page_nid,
+ pass);
+ numa_hinting_fault_cpu_follow_memory(p, access_nid,
+ numpages, pass);
+ if (unlikely(pass))
+ p->sched_autonuma->numa_fault_pass =
+ p->mm->mm_autonuma->numa_fault_pass;
+ }
+}
+
+pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte, pte_t *ptep)
+{
+ struct page *page;
+ pte = pte_mknotnuma(pte);
+ set_pte_at(mm, addr, ptep, pte);
+ page = vm_normal_page(vma, addr, pte);
+ BUG_ON(!page);
+ numa_hinting_fault(page, 1);
+ return pte;
+}
+
+void __pmd_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmdp)
+{
+ pmd_t pmd;
+ pte_t *pte;
+ unsigned long _addr = addr & PMD_MASK;
+ spinlock_t *ptl;
+ bool numa = false;
+
+ spin_lock(&mm->page_table_lock);
+ pmd = *pmdp;
+ if (pmd_numa(pmd)) {
+ set_pmd_at(mm, _addr, pmdp, pmd_mknotnuma(pmd));
+ numa = true;
+ }
+ spin_unlock(&mm->page_table_lock);
+
+ if (!numa)
+ return;
+
+ pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+ for (addr = _addr; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+ pte_t pteval = *pte;
+ struct page * page;
+ if (!pte_present(pteval))
+ continue;
+ if (pte_numa(pteval)) {
+ pteval = pte_mknotnuma(pteval);
+ set_pte_at(mm, addr, pte, pteval);
+ }
+ page = vm_normal_page(vma, addr, pteval);
+ if (unlikely(!page))
+ continue;
+ /* only check non-shared pages */
+ if (page_mapcount(page) != 1)
+ continue;
+ numa_hinting_fault(page, 1);
+ }
+ pte_unmap_unlock(pte, ptl);
+}
+
+static inline int sched_autonuma_size(void)
+{
+ return sizeof(struct sched_autonuma) +
+ num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline int sched_autonuma_reset_size(void)
+{
+ struct sched_autonuma *sched_autonuma = NULL;
+ return sched_autonuma_size() -
+ (int)((char *)(&sched_autonuma->autonuma_flags) -
+ (char *)sched_autonuma);
+}
+
+static void sched_autonuma_reset(struct sched_autonuma *sched_autonuma)
+{
+ sched_autonuma->autonuma_node = -1;
+ memset(&sched_autonuma->autonuma_flags, 0,
+ sched_autonuma_reset_size());
+}
+
+static inline int mm_autonuma_fault_size(void)
+{
+ return num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline unsigned long *mm_autonuma_numa_fault_tmp(struct mm_struct *mm)
+{
+ return mm->mm_autonuma->numa_fault + num_possible_nodes();
+}
+
+static inline int mm_autonuma_size(void)
+{
+ return sizeof(struct mm_autonuma) + mm_autonuma_fault_size() * 2;
+}
+
+static inline int mm_autonuma_reset_size(void)
+{
+ struct mm_autonuma *mm_autonuma = NULL;
+ return mm_autonuma_size() -
+ (int)((char *)(&mm_autonuma->numa_fault_tot) -
+ (char *)mm_autonuma);
+}
+
+static void mm_autonuma_reset(struct mm_autonuma *mm_autonuma)
+{
+ memset(&mm_autonuma->numa_fault_tot, 0, mm_autonuma_reset_size());
+}
+
+void autonuma_setup_new_exec(struct task_struct *p)
+{
+ if (p->sched_autonuma)
+ sched_autonuma_reset(p->sched_autonuma);
+ if (p->mm && p->mm->mm_autonuma)
+ mm_autonuma_reset(p->mm->mm_autonuma);
+}
+
+static inline int knumad_test_exit(struct mm_struct *mm)
+{
+ return atomic_read(&mm->mm_users) == 0;
+}
+
+static int knumad_scan_pmd(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte, *_pte;
+ struct page *page;
+ unsigned long _address, end;
+ spinlock_t *ptl;
+ int ret = 0;
+
+ VM_BUG_ON(address & ~PAGE_MASK);
+
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
+ goto out;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ goto out;
+
+ pmd = pmd_offset(pud, address);
+ if (pmd_none(*pmd))
+ goto out;
+ if (pmd_trans_huge(*pmd)) {
+ spin_lock(&mm->page_table_lock);
+ if (pmd_trans_huge(*pmd)) {
+ VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+ if (unlikely(pmd_trans_splitting(*pmd))) {
+ spin_unlock(&mm->page_table_lock);
+ wait_split_huge_page(vma->anon_vma, pmd);
+ } else {
+ int page_nid;
+ unsigned long *numa_fault_tmp;
+ ret = HPAGE_PMD_NR;
+
+ if (autonuma_scan_use_working_set() &&
+ pmd_numa(*pmd)) {
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+ }
+
+ page = pmd_page(*pmd);
+
+ /* only check non-shared pages */
+ if (page_mapcount(page) != 1) {
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+ }
+
+ page_nid = page_migrate_nid(page);
+ numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+ numa_fault_tmp[page_nid] += ret;
+
+ if (pmd_numa(*pmd)) {
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+ }
+
+ set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+ /* defer TLB flush to lower the overhead */
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+ }
+ } else
+ spin_unlock(&mm->page_table_lock);
+ }
+
+ VM_BUG_ON(!pmd_present(*pmd));
+
+ end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+ pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+ for (_address = address, _pte = pte; _address < end;
+ _pte++, _address += PAGE_SIZE) {
+ unsigned long *numa_fault_tmp;
+ pte_t pteval = *_pte;
+ if (!pte_present(pteval))
+ continue;
+ if (autonuma_scan_use_working_set() &&
+ pte_numa(pteval))
+ continue;
+ page = vm_normal_page(vma, _address, pteval);
+ if (unlikely(!page))
+ continue;
+ /* only check non-shared pages */
+ if (page_mapcount(page) != 1)
+ continue;
+
+ numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+ numa_fault_tmp[page_migrate_nid(page)]++;
+
+ if (pte_numa(pteval))
+ continue;
+
+ if (!autonuma_scan_pmd())
+ set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+ /* defer TLB flush to lower the overhead */
+ ret++;
+ }
+ pte_unmap_unlock(pte, ptl);
+
+ if (ret && !pmd_numa(*pmd) && autonuma_scan_pmd()) {
+ spin_lock(&mm->page_table_lock);
+ set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+ spin_unlock(&mm->page_table_lock);
+ /* defer TLB flush to lower the overhead */
+ }
+
+out:
+ return ret;
+}
+
+static void mm_numa_fault_flush(struct mm_struct *mm)
+{
+ int nid;
+ struct mm_autonuma *mma = mm->mm_autonuma;
+ unsigned long *numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+ unsigned long tot = 0;
+ /* FIXME: protect this with seqlock against autonuma_balance() */
+ for_each_node(nid) {
+ mma->numa_fault[nid] = numa_fault_tmp[nid];
+ tot += mma->numa_fault[nid];
+ numa_fault_tmp[nid] = 0;
+ }
+ mma->numa_fault_tot = tot;
+}
+
+static int knumad_do_scan(void)
+{
+ struct mm_struct *mm;
+ struct mm_autonuma *mm_autonuma;
+ unsigned long address;
+ struct vm_area_struct *vma;
+ int progress = 0;
+
+ mm = knumad_scan.mm;
+ if (!mm) {
+ if (unlikely(list_empty(&knumad_scan.mm_head)))
+ return pages_to_scan;
+ mm_autonuma = list_entry(knumad_scan.mm_head.next,
+ struct mm_autonuma, mm_node);
+ mm = mm_autonuma->mm;
+ knumad_scan.address = 0;
+ knumad_scan.mm = mm;
+ atomic_inc(&mm->mm_count);
+ mm_autonuma->numa_fault_pass++;
+ }
+ address = knumad_scan.address;
+
+ mutex_unlock(&knumad_mm_mutex);
+
+ down_read(&mm->mmap_sem);
+ if (unlikely(knumad_test_exit(mm)))
+ vma = NULL;
+ else
+ vma = find_vma(mm, address);
+
+ progress++;
+ for (; vma && progress < pages_to_scan; vma = vma->vm_next) {
+ unsigned long start_addr, end_addr;
+ cond_resched();
+ if (unlikely(knumad_test_exit(mm))) {
+ progress++;
+ break;
+ }
+
+ if (!vma->anon_vma || vma_policy(vma)) {
+ progress++;
+ continue;
+ }
+ if (is_vma_temporary_stack(vma)) {
+ progress++;
+ continue;
+ }
+
+ VM_BUG_ON(address & ~PAGE_MASK);
+ if (address < vma->vm_start)
+ address = vma->vm_start;
+
+ start_addr = address;
+ while (address < vma->vm_end) {
+ cond_resched();
+ if (unlikely(knumad_test_exit(mm)))
+ break;
+
+ VM_BUG_ON(address < vma->vm_start ||
+ address + PAGE_SIZE > vma->vm_end);
+ progress += knumad_scan_pmd(mm, vma, address);
+ /* move to next address */
+ address = (address + PMD_SIZE) & PMD_MASK;
+ if (progress >= pages_to_scan)
+ break;
+ }
+ end_addr = min(address, vma->vm_end);
+
+ /*
+ * Flush the TLB for the mm to start the numa
+ * hinting minor page faults after we finish
+ * scanning this vma part.
+ */
+ mmu_notifier_invalidate_range_start(vma->vm_mm, start_addr,
+ end_addr);
+ flush_tlb_range(vma, start_addr, end_addr);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, start_addr,
+ end_addr);
+ }
+ up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+
+ mutex_lock(&knumad_mm_mutex);
+ VM_BUG_ON(knumad_scan.mm != mm);
+ knumad_scan.address = address;
+ /*
+ * Change the current mm if this mm is about to die, or if we
+ * scanned all vmas of this mm.
+ */
+ if (knumad_test_exit(mm) || !vma) {
+ mm_autonuma = mm->mm_autonuma;
+ if (mm_autonuma->mm_node.next != &knumad_scan.mm_head) {
+ mm_autonuma = list_entry(mm_autonuma->mm_node.next,
+ struct mm_autonuma, mm_node);
+ knumad_scan.mm = mm_autonuma->mm;
+ atomic_inc(&knumad_scan.mm->mm_count);
+ knumad_scan.address = 0;
+ knumad_scan.mm->mm_autonuma->numa_fault_pass++;
+ } else
+ knumad_scan.mm = NULL;
+
+ if (knumad_test_exit(mm))
+ list_del(&mm->mm_autonuma->mm_node);
+ else
+ mm_numa_fault_flush(mm);
+
+ mmdrop(mm);
+ }
+
+ return progress;
+}
+
+static void wake_up_knuma_migrated(void)
+{
+ int nid;
+
+ lru_add_drain();
+ for_each_online_node(nid) {
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ if (pgdat->autonuma_nr_migrate_pages &&
+ waitqueue_active(&pgdat->autonuma_knuma_migrated_wait))
+ wake_up_interruptible(&pgdat->
+ autonuma_knuma_migrated_wait);
+ }
+}
+
+static void knuma_scand_disabled(void)
+{
+ if (!autonuma_enabled())
+ wait_event_freezable(knuma_scand_wait,
+ autonuma_enabled() ||
+ kthread_should_stop());
+}
+
+static int knuma_scand(void *none)
+{
+ struct mm_struct *mm = NULL;
+ int progress = 0, _progress;
+ unsigned long total_progress = 0;
+
+ set_freezable();
+
+ knuma_scand_disabled();
+
+ mutex_lock(&knumad_mm_mutex);
+
+ for (;;) {
+ if (unlikely(kthread_should_stop()))
+ break;
+ _progress = knumad_do_scan();
+ progress += _progress;
+ total_progress += _progress;
+ mutex_unlock(&knumad_mm_mutex);
+
+ if (unlikely(!knumad_scan.mm)) {
+ autonuma_printk("knuma_scand %lu\n", total_progress);
+ pages_scanned += total_progress;
+ total_progress = 0;
+ full_scans++;
+
+ wait_event_freezable_timeout(knuma_scand_wait,
+ kthread_should_stop(),
+ msecs_to_jiffies(
+ scan_sleep_pass_millisecs));
+ /* flush the last pending pages < pages_to_migrate */
+ wake_up_knuma_migrated();
+ wait_event_freezable_timeout(knuma_scand_wait,
+ kthread_should_stop(),
+ msecs_to_jiffies(
+ scan_sleep_pass_millisecs));
+
+ if (autonuma_debug()) {
+ extern void sched_autonuma_dump_mm(void);
+ sched_autonuma_dump_mm();
+ }
+
+ /* wait while there is no pinned mm */
+ knuma_scand_disabled();
+ }
+ if (progress > pages_to_scan) {
+ progress = 0;
+ wait_event_freezable_timeout(knuma_scand_wait,
+ kthread_should_stop(),
+ msecs_to_jiffies(
+ scan_sleep_millisecs));
+ }
+ cond_resched();
+ mutex_lock(&knumad_mm_mutex);
+ }
+
+ mm = knumad_scan.mm;
+ knumad_scan.mm = NULL;
+ if (mm)
+ list_del(&mm->mm_autonuma->mm_node);
+ mutex_unlock(&knumad_mm_mutex);
+
+ if (mm)
+ mmdrop(mm);
+
+ return 0;
+}
+
+static int isolate_migratepages(struct list_head *migratepages,
+ struct pglist_data *pgdat)
+{
+ int nr = 0, nid;
+ struct list_head *heads = pgdat->autonuma_migrate_head;
+
+ /* FIXME: THP balancing, restart from last nid */
+ for_each_online_node(nid) {
+ struct zone *zone;
+ struct page *page;
+ cond_resched();
+ VM_BUG_ON(numa_node_id() != pgdat->node_id);
+ if (nid == pgdat->node_id) {
+ VM_BUG_ON(!list_empty(&heads[nid]));
+ continue;
+ }
+ if (list_empty(&heads[nid]))
+ continue;
+ /* some page wants to go to this pgdat */
+ /*
+ * Take the lock with irqs disabled to avoid a lock
+ * inversion with the lru_lock which is taken before
+ * the autonuma_migrate_lock in split_huge_page, and
+ * that could be taken by interrupts after we obtained
+ * the autonuma_migrate_lock here, if we didn't disable
+ * irqs.
+ */
+ autonuma_migrate_lock_irq(pgdat->node_id);
+ if (list_empty(&heads[nid])) {
+ autonuma_migrate_unlock_irq(pgdat->node_id);
+ continue;
+ }
+ page = list_entry(heads[nid].prev,
+ struct page,
+ autonuma_migrate_node);
+ if (unlikely(!get_page_unless_zero(page))) {
+ /*
+ * Is getting freed and will remove self from the
+ * autonuma list shortly, skip it for now.
+ */
+ list_del(&page->autonuma_migrate_node);
+ list_add(&page->autonuma_migrate_node,
+ &heads[nid]);
+ autonuma_migrate_unlock_irq(pgdat->node_id);
+ autonuma_printk("autonuma migrate page is free\n");
+ continue;
+ }
+ if (!PageLRU(page)) {
+ autonuma_migrate_unlock_irq(pgdat->node_id);
+ autonuma_printk("autonuma migrate page not in LRU\n");
+ __autonuma_migrate_page_remove(page);
+ put_page(page);
+ continue;
+ }
+ autonuma_migrate_unlock_irq(pgdat->node_id);
+
+ VM_BUG_ON(nid != page_to_nid(page));
+
+ if (PageAnon(page) && PageTransHuge(page))
+ /* FIXME: remove split_huge_page */
+ split_huge_page(page);
+
+ __autonuma_migrate_page_remove(page);
+
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+ if (!__isolate_lru_page(page, ISOLATE_ACTIVE|ISOLATE_INACTIVE,
+ 0)) {
+ VM_BUG_ON(PageTransCompound(page));
+ del_page_from_lru_list(zone, page, page_lru(page));
+ inc_zone_state(zone, page_is_file_cache(page) ?
+ NR_ISOLATED_FILE : NR_ISOLATED_ANON);
+ spin_unlock_irq(&zone->lru_lock);
+ /*
+ * hold the page pin at least until
+ * __isolate_lru_page succeeds
+ * (__isolate_lru_page takes a second pin when
+ * it succeeds). If we release the pin before
+ * __isolate_lru_page returns, the page could
+ * have been freed and reallocated from under
+ * us, so rendering worthless our previous
+ * checks on the page including the
+ * split_huge_page call.
+ */
+ put_page(page);
+
+ list_add(&page->lru, migratepages);
+ nr += hpage_nr_pages(page);
+ } else {
+ /* FIXME: losing page, safest and simplest for now */
+ spin_unlock_irq(&zone->lru_lock);
+ put_page(page);
+ autonuma_printk("autonuma migrate page lost\n");
+ }
+ }
+
+ return nr;
+}
+
+static struct page *alloc_migrate_dst_page(struct page *page,
+ unsigned long data,
+ int **result)
+{
+ int nid = (int) data;
+ struct page *newpage;
+ newpage = alloc_pages_exact_node(nid,
+ GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
+ 0);
+ if (newpage)
+ newpage->autonuma_last_nid = page->autonuma_last_nid;
+ return newpage;
+}
+
+static void knumad_do_migrate(struct pglist_data *pgdat)
+{
+ int nr_migrate_pages = 0;
+ LIST_HEAD(migratepages);
+
+ autonuma_printk("nr_migrate_pages %lu to node %d\n",
+ pgdat->autonuma_nr_migrate_pages, pgdat->node_id);
+ do {
+ int isolated = 0;
+ if (balance_pgdat(pgdat, nr_migrate_pages))
+ isolated = isolate_migratepages(&migratepages, pgdat);
+ /* FIXME: might need to check too many isolated */
+ if (!isolated)
+ break;
+ nr_migrate_pages += isolated;
+ } while (nr_migrate_pages < pages_to_migrate);
+
+ if (nr_migrate_pages) {
+ int err;
+ autonuma_printk("migrate %d to node %d\n", nr_migrate_pages,
+ pgdat->node_id);
+ pages_migrated += nr_migrate_pages; /* FIXME: per node */
+ err = migrate_pages(&migratepages, alloc_migrate_dst_page,
+ pgdat->node_id, false, true);
+ if (err)
+ /* FIXME: requeue failed pages */
+ putback_lru_pages(&migratepages);
+ }
+}
+
+static int knuma_migrated(void *arg)
+{
+ struct pglist_data *pgdat = (struct pglist_data *)arg;
+ int nid = pgdat->node_id;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(nowakeup);
+
+ set_freezable();
+
+ for (;;) {
+ if (unlikely(kthread_should_stop()))
+ break;
+ /* FIXME: scan the free levels of this node we may not
+ * be allowed to receive memory if the wmark of this
+ * pgdat are below high. In the future also add
+ * not-interesting pages like not-accessed pages to
+ * pgdat->autonuma_migrate_head[pgdat->node_id]; so we
+ * can move our memory away to other nodes in order
+ * to satisfy the high-wmark described above (so migration
+ * can continue).
+ */
+ knumad_do_migrate(pgdat);
+ if (!pgdat->autonuma_nr_migrate_pages) {
+ wait_event_freezable(
+ pgdat->autonuma_knuma_migrated_wait,
+ pgdat->autonuma_nr_migrate_pages ||
+ kthread_should_stop());
+ autonuma_printk("wake knuma_migrated %d\n", nid);
+ } else
+ wait_event_freezable_timeout(nowakeup,
+ kthread_should_stop(),
+ msecs_to_jiffies(
+ migrate_sleep_millisecs));
+ }
+
+ return 0;
+}
+
+void autonuma_enter(struct mm_struct *mm)
+{
+ if (autonuma_impossible())
+ return;
+
+ mutex_lock(&knumad_mm_mutex);
+ list_add_tail(&mm->mm_autonuma->mm_node, &knumad_scan.mm_head);
+ mutex_unlock(&knumad_mm_mutex);
+}
+
+void autonuma_exit(struct mm_struct *mm)
+{
+ bool serialize;
+
+ if (autonuma_impossible())
+ return;
+
+ serialize = false;
+ mutex_lock(&knumad_mm_mutex);
+ if (knumad_scan.mm == mm)
+ serialize = true;
+ else
+ list_del(&mm->mm_autonuma->mm_node);
+ mutex_unlock(&knumad_mm_mutex);
+
+ if (serialize) {
+ down_write(&mm->mmap_sem);
+ up_write(&mm->mmap_sem);
+ }
+}
+
+static int start_knuma_scand(void)
+{
+ int err = 0;
+ struct task_struct *knumad_thread;
+
+ knumad_thread = kthread_run(knuma_scand, NULL, "knuma_scand");
+ if (unlikely(IS_ERR(knumad_thread))) {
+ autonuma_printk(KERN_ERR
+ "knumad: kthread_run(knuma_scand) failed\n");
+ err = PTR_ERR(knumad_thread);
+ }
+ return err;
+}
+
+static int start_knuma_migrated(void)
+{
+ int err = 0;
+ struct task_struct *knumad_thread;
+ int nid;
+
+ for_each_online_node(nid) {
+ knumad_thread = kthread_create_on_node(knuma_migrated,
+ NODE_DATA(nid),
+ nid,
+ "knuma_migrated%d",
+ nid);
+ if (unlikely(IS_ERR(knumad_thread))) {
+ autonuma_printk(KERN_ERR
+ "knumad: "
+ "kthread_run(knuma_migrated%d) "
+ "failed\n", nid);
+ err = PTR_ERR(knumad_thread);
+ } else {
+ autonuma_printk("cpumask %d %lx\n", nid,
+ cpumask_of_node(nid)->bits[0]);
+ kthread_bind_node(knumad_thread, nid);
+ wake_up_process(knumad_thread);
+ }
+ }
+ return err;
+}
+
+
+#ifdef CONFIG_SYSFS
+
+static ssize_t flag_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf,
+ enum autonuma_flag flag)
+{
+ return sprintf(buf, "%d\n",
+ !!test_bit(flag, &autonuma_flags));
+}
+static ssize_t flag_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count,
+ enum autonuma_flag flag)
+{
+ unsigned long value;
+ int ret;
+
+ ret = kstrtoul(buf, 10, &value);
+ if (ret < 0)
+ return ret;
+ if (value > 1)
+ return -EINVAL;
+
+ if (value)
+ set_bit(flag, &autonuma_flags);
+ else
+ clear_bit(flag, &autonuma_flags);
+
+ return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return flag_show(kobj, attr, buf, AUTONUMA_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ ssize_t ret;
+
+ ret = flag_store(kobj, attr, buf, count, AUTONUMA_FLAG);
+
+ if (ret > 0 && autonuma_enabled())
+ wake_up_interruptible(&knuma_scand_wait);
+
+ return ret;
+}
+static struct kobj_attribute enabled_attr =
+ __ATTR(enabled, 0644, enabled_show, enabled_store);
+
+#define SYSFS_ENTRY(NAME, FLAG) \
+static ssize_t NAME ## _show(struct kobject *kobj, \
+ struct kobj_attribute *attr, char *buf) \
+{ \
+ return flag_show(kobj, attr, buf, FLAG); \
+} \
+ \
+static ssize_t NAME ## _store(struct kobject *kobj, \
+ struct kobj_attribute *attr, \
+ const char *buf, size_t count) \
+{ \
+ return flag_store(kobj, attr, buf, count, FLAG); \
+} \
+static struct kobj_attribute NAME ## _attr = \
+ __ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
+SYSFS_ENTRY(pmd, AUTONUMA_SCAN_PMD_FLAG);
+SYSFS_ENTRY(working_set, AUTONUMA_SCAN_USE_WORKING_SET_FLAG);
+SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
+SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
+SYSFS_ENTRY(clone_reset, AUTONUMA_SCHED_CLONE_RESET_FLAG);
+SYSFS_ENTRY(fork_reset, AUTONUMA_SCHED_FORK_RESET_FLAG);
+
+#undef SYSFS_ENTRY
+
+enum {
+ SYSFS_KNUMA_SCAND_SLEEP_ENTRY,
+ SYSFS_KNUMA_SCAND_PAGES_ENTRY,
+ SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY,
+ SYSFS_KNUMA_MIGRATED_PAGES_ENTRY,
+};
+
+#define SYSFS_ENTRY(NAME, SYSFS_TYPE) \
+static ssize_t NAME ## _show(struct kobject *kobj, \
+ struct kobj_attribute *attr, \
+ char *buf) \
+{ \
+ return sprintf(buf, "%u\n", NAME); \
+} \
+static ssize_t NAME ## _store(struct kobject *kobj, \
+ struct kobj_attribute *attr, \
+ const char *buf, size_t count) \
+{ \
+ unsigned long val; \
+ int err; \
+ \
+ err = strict_strtoul(buf, 10, &val); \
+ if (err || val > UINT_MAX) \
+ return -EINVAL; \
+ switch (SYSFS_TYPE) { \
+ case SYSFS_KNUMA_SCAND_PAGES_ENTRY: \
+ case SYSFS_KNUMA_MIGRATED_PAGES_ENTRY: \
+ if (!val) \
+ return -EINVAL; \
+ break; \
+ } \
+ \
+ NAME = val; \
+ switch (SYSFS_TYPE) { \
+ case SYSFS_KNUMA_SCAND_SLEEP_ENTRY: \
+ wake_up_interruptible(&knuma_scand_wait); \
+ break; \
+ case \
+ SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY: \
+ wake_up_knuma_migrated(); \
+ break; \
+ } \
+ \
+ return count; \
+} \
+static struct kobj_attribute NAME ## _attr = \
+ __ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(scan_sleep_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(scan_sleep_pass_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_scan, SYSFS_KNUMA_SCAND_PAGES_ENTRY);
+
+SYSFS_ENTRY(migrate_sleep_millisecs, SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_migrate, SYSFS_KNUMA_MIGRATED_PAGES_ENTRY);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *autonuma_attr[] = {
+ &enabled_attr.attr,
+ &debug_attr.attr,
+ NULL,
+};
+static struct attribute_group autonuma_attr_group = {
+ .attrs = autonuma_attr,
+};
+
+#define SYSFS_ENTRY(NAME) \
+static ssize_t NAME ## _show(struct kobject *kobj, \
+ struct kobj_attribute *attr, \
+ char *buf) \
+{ \
+ return sprintf(buf, "%lu\n", NAME); \
+} \
+static struct kobj_attribute NAME ## _attr = \
+ __ATTR_RO(NAME);
+
+SYSFS_ENTRY(full_scans);
+SYSFS_ENTRY(pages_scanned);
+SYSFS_ENTRY(pages_migrated);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *knuma_scand_attr[] = {
+ &scan_sleep_millisecs_attr.attr,
+ &scan_sleep_pass_millisecs_attr.attr,
+ &pages_to_scan_attr.attr,
+ &pages_scanned_attr.attr,
+ &full_scans_attr.attr,
+ &pmd_attr.attr,
+ &working_set_attr.attr,
+ NULL,
+};
+static struct attribute_group knuma_scand_attr_group = {
+ .attrs = knuma_scand_attr,
+ .name = "knuma_scand",
+};
+
+static struct attribute *knuma_migrated_attr[] = {
+ &migrate_sleep_millisecs_attr.attr,
+ &pages_to_migrate_attr.attr,
+ &pages_migrated_attr.attr,
+ &defer_attr.attr,
+ NULL,
+};
+static struct attribute_group knuma_migrated_attr_group = {
+ .attrs = knuma_migrated_attr,
+ .name = "knuma_migrated",
+};
+
+static struct attribute *scheduler_attr[] = {
+ &clone_reset_attr.attr,
+ &fork_reset_attr.attr,
+ &load_balance_strict_attr.attr,
+ NULL,
+};
+static struct attribute_group scheduler_attr_group = {
+ .attrs = scheduler_attr,
+ .name = "scheduler",
+};
+
+static int __init autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+ int err;
+
+ *autonuma_kobj = kobject_create_and_add("autonuma", mm_kobj);
+ if (unlikely(!*autonuma_kobj)) {
+ printk(KERN_ERR "autonuma: failed kobject create\n");
+ return -ENOMEM;
+ }
+
+ err = sysfs_create_group(*autonuma_kobj, &autonuma_attr_group);
+ if (err) {
+ printk(KERN_ERR "autonuma: failed register autonuma group\n");
+ goto delete_obj;
+ }
+
+ err = sysfs_create_group(*autonuma_kobj, &knuma_scand_attr_group);
+ if (err) {
+ printk(KERN_ERR
+ "autonuma: failed register knuma_scand group\n");
+ goto remove_autonuma;
+ }
+
+ err = sysfs_create_group(*autonuma_kobj, &knuma_migrated_attr_group);
+ if (err) {
+ printk(KERN_ERR
+ "autonuma: failed register knuma_migrated group\n");
+ goto remove_knuma_scand;
+ }
+
+ err = sysfs_create_group(*autonuma_kobj, &scheduler_attr_group);
+ if (err) {
+ printk(KERN_ERR
+ "autonuma: failed register scheduler group\n");
+ goto remove_knuma_migrated;
+ }
+
+ return 0;
+
+remove_knuma_migrated:
+ sysfs_remove_group(*autonuma_kobj, &knuma_migrated_attr_group);
+remove_knuma_scand:
+ sysfs_remove_group(*autonuma_kobj, &knuma_scand_attr_group);
+remove_autonuma:
+ sysfs_remove_group(*autonuma_kobj, &autonuma_attr_group);
+delete_obj:
+ kobject_put(*autonuma_kobj);
+ return err;
+}
+
+static void __init autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+ sysfs_remove_group(autonuma_kobj, &knuma_migrated_attr_group);
+ sysfs_remove_group(autonuma_kobj, &knuma_scand_attr_group);
+ sysfs_remove_group(autonuma_kobj, &autonuma_attr_group);
+ kobject_put(autonuma_kobj);
+}
+#else
+static inline int autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+ return 0;
+}
+
+static inline void autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+}
+#endif /* CONFIG_SYSFS */
+
+static int __init noautonuma_setup(char *str)
+{
+ if (!autonuma_impossible()) {
+ printk("AutoNUMA permanently disabled\n");
+ set_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
+ BUG_ON(!autonuma_impossible());
+ }
+ return 1;
+}
+__setup("noautonuma", noautonuma_setup);
+
+static int __init autonuma_init(void)
+{
+ int err;
+ struct kobject *autonuma_kobj;
+
+ VM_BUG_ON(num_possible_nodes() < 1);
+ if (autonuma_impossible())
+ return -EINVAL;
+
+ err = autonuma_init_sysfs(&autonuma_kobj);
+ if (err)
+ return err;
+
+ err = start_knuma_scand();
+ if (err) {
+ printk("failed to start knuma_scand\n");
+ goto out;
+ }
+ err = start_knuma_migrated();
+ if (err) {
+ printk("failed to start knuma_migrated\n");
+ goto out;
+ }
+
+ printk("AutoNUMA initialized successfully\n");
+ return err;
+
+out:
+ autonuma_exit_sysfs(autonuma_kobj);
+ return err;
+}
+module_init(autonuma_init)
+
+static struct kmem_cache *sched_autonuma_cachep;
+
+int alloc_sched_autonuma(struct task_struct *tsk, struct task_struct *orig,
+ int node)
+{
+ int err = 1;
+ struct sched_autonuma *sched_autonuma;
+
+ if (autonuma_impossible())
+ goto no_numa;
+ sched_autonuma = kmem_cache_alloc_node(sched_autonuma_cachep,
+ GFP_KERNEL, node);
+ if (!sched_autonuma)
+ goto out;
+ if (autonuma_sched_clone_reset())
+ sched_autonuma_reset(sched_autonuma);
+ else {
+ memcpy(sched_autonuma, orig->sched_autonuma,
+ sched_autonuma_size());
+ BUG_ON(sched_autonuma->autonuma_flags &
+ SCHED_AUTONUMA_FLAG_STOP_ONE_CPU);
+ sched_autonuma->autonuma_flags = 0;
+ }
+ tsk->sched_autonuma = sched_autonuma;
+no_numa:
+ err = 0;
+out:
+ return err;
+}
+
+void free_sched_autonuma(struct task_struct *tsk)
+{
+ if (autonuma_impossible()) {
+ BUG_ON(tsk->sched_autonuma);
+ return;
+ }
+
+ BUG_ON(!tsk->sched_autonuma);
+ kmem_cache_free(sched_autonuma_cachep, tsk->sched_autonuma);
+ tsk->sched_autonuma = NULL;
+}
+
+void __init sched_autonuma_init(void)
+{
+ struct sched_autonuma *sched_autonuma;
+
+ BUG_ON(current != &init_task);
+
+ if (autonuma_impossible())
+ return;
+
+ sched_autonuma_cachep =
+ kmem_cache_create("sched_autonuma",
+ sched_autonuma_size(), 0,
+ SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+
+ sched_autonuma = kmem_cache_alloc_node(sched_autonuma_cachep,
+ GFP_KERNEL, numa_node_id());
+ BUG_ON(!sched_autonuma);
+ sched_autonuma_reset(sched_autonuma);
+ BUG_ON(current->sched_autonuma);
+ current->sched_autonuma = sched_autonuma;
+}
+
+static struct kmem_cache *mm_autonuma_cachep;
+
+int alloc_mm_autonuma(struct mm_struct *mm)
+{
+ int err = 1;
+ struct mm_autonuma *mm_autonuma;
+
+ if (autonuma_impossible())
+ goto no_numa;
+ mm_autonuma = kmem_cache_alloc(mm_autonuma_cachep, GFP_KERNEL);
+ if (!mm_autonuma)
+ goto out;
+ if (autonuma_sched_fork_reset() || !mm->mm_autonuma)
+ mm_autonuma_reset(mm_autonuma);
+ else
+ memcpy(mm_autonuma, mm->mm_autonuma, mm_autonuma_size());
+ mm->mm_autonuma = mm_autonuma;
+ mm_autonuma->mm = mm;
+no_numa:
+ err = 0;
+out:
+ return err;
+}
+
+void free_mm_autonuma(struct mm_struct *mm)
+{
+ if (autonuma_impossible()) {
+ BUG_ON(mm->mm_autonuma);
+ return;
+ }
+
+ BUG_ON(!mm->mm_autonuma);
+ kmem_cache_free(mm_autonuma_cachep, mm->mm_autonuma);
+ mm->mm_autonuma = NULL;
+}
+
+void __init mm_autonuma_init(void)
+{
+ BUG_ON(current != &init_task);
+ BUG_ON(current->mm);
+
+ if (autonuma_impossible())
+ return;
+
+ mm_autonuma_cachep =
+ kmem_cache_create("mm_autonuma",
+ mm_autonuma_size(), 0,
+ SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+}
This function makes it easy to bind the per-node knuma_migrated
threads to their respective NUMA nodes. Those threads take memory from
the other nodes (in round robin with a incoming queue for each remote
node) and they move that memory to their local node.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/kthread.h | 1 +
kernel/kthread.c | 23 +++++++++++++++++++++++
2 files changed, 24 insertions(+), 0 deletions(-)
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 0714b24..e733f97 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
})
void kthread_bind(struct task_struct *k, unsigned int cpu);
+void kthread_bind_node(struct task_struct *p, int nid);
int kthread_stop(struct task_struct *k);
int kthread_should_stop(void);
bool kthread_freezable_should_stop(bool *was_frozen);
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3d3de63..48b36f9 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
EXPORT_SYMBOL(kthread_bind);
/**
+ * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
+ * @p: thread created by kthread_create().
+ * @nid: node (might not be online, must be possible) for @k to run on.
+ *
+ * Description: This function is equivalent to set_cpus_allowed(),
+ * except that @nid doesn't need to be online, and the thread must be
+ * stopped (i.e., just returned from kthread_create()).
+ */
+void kthread_bind_node(struct task_struct *p, int nid)
+{
+ /* Must have done schedule() in kthread() before we set_task_cpu */
+ if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
+ WARN_ON(1);
+ return;
+ }
+
+ /* It's safe because the task is inactive. */
+ do_set_cpus_allowed(p, cpumask_of_node(nid));
+ p->flags |= PF_THREAD_BOUND;
+}
+EXPORT_SYMBOL(kthread_bind_node);
+
+/**
* kthread_stop - stop a thread created by kthread_create().
* @k: thread created by kthread_create().
*
The first gear in the whole AutoNUMA algorithm is knuma_scand. If
knuma_scand doesn't run AutoNUMA is a full bypass. If knuma_scand is
stopped, soon all other AutoNUMA gears will settle down too.
knuma_scand is the daemon that sets the pmd_numa and pte_numa and
allows the NUMA hinting page faults to start and then all other
actions follows as a reaction to that.
knuma_scand scans a list of "mm" and this is where we register and
unregister the "mm" into AutoNUMA for knuma_scand to scan them.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/fork.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 98db8b0..237c34e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -70,6 +70,7 @@
#include <linux/khugepaged.h>
#include <linux/signalfd.h>
#include <linux/uprobes.h>
+#include <linux/autonuma.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -539,6 +540,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
mmu_notifier_mm_init(mm);
+ autonuma_enter(mm);
return mm;
}
@@ -607,6 +609,7 @@ void mmput(struct mm_struct *mm)
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
+ autonuma_exit(mm); /* must run before exit_mmap */
exit_mmap(mm);
set_mm_exe_file(mm, NULL);
if (!list_empty(&mm->mmlist)) {
The CFS scheduler is still in charge of all scheduling
decisions. AutoNUMA balancing at times will override those. But
generally we'll just relay on the CFS scheduler to keep doing its
thing, but while preferring the autonuma affine nodes when deciding
to move a process to a different runqueue or when waking it up.
For example the idle balancing, will look into the runqueues of the
busy CPUs, but it'll search first for a task that wants to run into
the idle CPU in AutoNUMA terms (task_autonuma_cpu() being true).
Most of this is encoded in the can_migrate_task becoming AutoNUMA
aware and running two passes for each balancing pass, the first NUMA
aware, and the second one relaxed.
The idle/newidle balancing is always allowed to fallback into
non-affine AutoNUMA tasks. The load_balancing (which is more a
fariness than a performance issue) is instead only able to cross over
the AutoNUMA affinity if the flag controlled by
/sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set (it
is set by default).
Tasks that haven't been fully profiled yet, are not affected by this
because their p->sched_autonuma->autonuma_node is still set to the
original value of -1 and task_autonuma_cpu will always return true in
that case.
Includes fixes from Hillf Danton <[email protected]>.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/sched/fair.c | 65 +++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 56 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 137119f..99d1d33 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,7 @@
#include <linux/slab.h>
#include <linux/profile.h>
#include <linux/interrupt.h>
+#include <linux/autonuma_sched.h>
#include <trace/events/sched.h>
@@ -2621,6 +2622,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
load = weighted_cpuload(i);
if (load < min_load || (load == min_load && i == this_cpu)) {
+ if (!task_autonuma_cpu(p, i))
+ continue;
min_load = load;
idlest = i;
}
@@ -2639,24 +2642,27 @@ static int select_idle_sibling(struct task_struct *p, int target)
struct sched_domain *sd;
struct sched_group *sg;
int i;
+ bool idle_target;
/*
* If the task is going to be woken-up on this cpu and if it is
* already idle, then it is the right target.
*/
- if (target == cpu && idle_cpu(cpu))
+ if (target == cpu && idle_cpu(cpu) && task_autonuma_cpu(p, cpu))
return cpu;
/*
* If the task is going to be woken-up on the cpu where it previously
* ran and if it is currently idle, then it the right target.
*/
- if (target == prev_cpu && idle_cpu(prev_cpu))
+ if (target == prev_cpu && idle_cpu(prev_cpu) &&
+ task_autonuma_cpu(p, prev_cpu))
return prev_cpu;
/*
* Otherwise, iterate the domains and find an elegible idle cpu.
*/
+ idle_target = false;
sd = rcu_dereference(per_cpu(sd_llc, target));
for_each_lower_domain(sd) {
sg = sd->groups;
@@ -2670,9 +2676,18 @@ static int select_idle_sibling(struct task_struct *p, int target)
goto next;
}
- target = cpumask_first_and(sched_group_cpus(sg),
- tsk_cpus_allowed(p));
- goto done;
+ for_each_cpu_and(i, sched_group_cpus(sg),
+ tsk_cpus_allowed(p)) {
+ /* Find autonuma cpu only in idle group */
+ if (task_autonuma_cpu(p, i)) {
+ target = i;
+ goto done;
+ }
+ if (!idle_target) {
+ idle_target = true;
+ target = i;
+ }
+ }
next:
sg = sg->next;
} while (sg != sd->groups);
@@ -2707,7 +2722,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
return prev_cpu;
if (sd_flag & SD_BALANCE_WAKE) {
- if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+ if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+ task_autonuma_cpu(p, cpu))
want_affine = 1;
new_cpu = prev_cpu;
}
@@ -3072,6 +3088,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
+#define LBF_NUMA 0x04
struct lb_env {
struct sched_domain *sd;
@@ -3142,13 +3159,14 @@ static
int can_migrate_task(struct task_struct *p, struct lb_env *env)
{
int tsk_cache_hot = 0;
+ struct cpumask *allowed = tsk_cpus_allowed(p);
/*
* We do not migrate tasks that are:
* 1) running (obviously), or
* 2) cannot be migrated to this CPU due to cpus_allowed, or
* 3) are cache-hot on their current CPU.
*/
- if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
+ if (!cpumask_test_cpu(env->dst_cpu, allowed)) {
schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
return 0;
}
@@ -3159,6 +3177,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
return 0;
}
+ if (!sched_autonuma_can_migrate_task(p, env->flags & LBF_NUMA,
+ env->dst_cpu, env->idle))
+ return 0;
+
/*
* Aggressive migration if:
* 1) task is cache cold, or
@@ -3195,6 +3217,8 @@ static int move_one_task(struct lb_env *env)
{
struct task_struct *p, *n;
+ env->flags |= LBF_NUMA;
+numa_repeat:
list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
continue;
@@ -3209,8 +3233,14 @@ static int move_one_task(struct lb_env *env)
* stats here rather than inside move_task().
*/
schedstat_inc(env->sd, lb_gained[env->idle]);
+ env->flags &= ~LBF_NUMA;
return 1;
}
+ if (env->flags & LBF_NUMA) {
+ env->flags &= ~LBF_NUMA;
+ goto numa_repeat;
+ }
+
return 0;
}
@@ -3235,6 +3265,8 @@ static int move_tasks(struct lb_env *env)
if (env->imbalance <= 0)
return 0;
+ env->flags |= LBF_NUMA;
+numa_repeat:
while (!list_empty(tasks)) {
p = list_first_entry(tasks, struct task_struct, se.group_node);
@@ -3274,9 +3306,13 @@ static int move_tasks(struct lb_env *env)
* kernels will stop after the first task is pulled to minimize
* the critical section.
*/
- if (env->idle == CPU_NEWLY_IDLE)
- break;
+ if (env->idle == CPU_NEWLY_IDLE) {
+ env->flags &= ~LBF_NUMA;
+ goto out;
+ }
#endif
+ /* not idle anymore after pulling first task */
+ env->idle = CPU_NOT_IDLE;
/*
* We only want to steal up to the prescribed amount of
@@ -3289,6 +3325,17 @@ static int move_tasks(struct lb_env *env)
next:
list_move_tail(&p->se.group_node, tasks);
}
+ if ((env->flags & (LBF_NUMA|LBF_NEED_BREAK)) == LBF_NUMA) {
+ env->flags &= ~LBF_NUMA;
+ if (env->imbalance > 0) {
+ env->loop = 0;
+ env->loop_break = sched_nr_migrate_break;
+ goto numa_repeat;
+ }
+ }
+#ifdef CONFIG_PREEMPT
+out:
+#endif
/*
* Right now, this is one of only two places move_task() is called,
Without this, follow_page wouldn't trigger the NUMA hinting faults.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/memory.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 7f265fc..e3aa47c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1483,7 +1483,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
goto no_page_table;
pmd = pmd_offset(pud, address);
- if (pmd_none(*pmd))
+ if (pmd_none(*pmd) || pmd_numa(*pmd))
goto no_page_table;
if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
BUG_ON(flags & FOLL_GET);
@@ -1517,7 +1517,7 @@ split_fallthrough:
ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
pte = *ptep;
- if (!pte_present(pte))
+ if (!pte_present(pte) || pte_numa(pte))
goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;
When pages are freed abort any pending migration. If knuma_migrated
arrives first it will notice because get_page_unless_zero would fail.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/page_alloc.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d1ee70..1d3163f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -614,6 +614,10 @@ static inline int free_pages_check(struct page *page)
bad_page(page);
return 1;
}
+ autonuma_migrate_page_remove(page);
+#ifdef CONFIG_AUTONUMA
+ ACCESS_ONCE(page->autonuma_last_nid) = -1;
+#endif
if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
Initialize the AutoNUMA page structure fields at boot.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/page_alloc.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1d3163f..3c354d4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3673,6 +3673,10 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
INIT_LIST_HEAD(&page->lru);
+#ifdef CONFIG_AUTONUMA
+ page->autonuma_last_nid = -1;
+ page->autonuma_migrate_nid = -1;
+#endif
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
if (!is_highmem_idx(zone))
Initialize the knuma_migrated queues at boot time.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/page_alloc.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3d69735..3d1ee70 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -58,6 +58,7 @@
#include <linux/memcontrol.h>
#include <linux/prefetch.h>
#include <linux/page-debug-flags.h>
+#include <linux/autonuma.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -4295,8 +4296,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
int nid = pgdat->node_id;
unsigned long zone_start_pfn = pgdat->node_start_pfn;
int ret;
+#ifdef CONFIG_AUTONUMA
+ int node_iter;
+#endif
pgdat_resize_init(pgdat);
+#ifdef CONFIG_AUTONUMA
+ spin_lock_init(&pgdat->autonuma_lock);
+ init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+ pgdat->autonuma_nr_migrate_pages = 0;
+ for_each_node(node_iter)
+ INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+#endif
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
pgdat->kswapd_max_order = 0;
These flags are the ones tweaked through sysfs, they control the
behavior of autonuma, from enabling disabling it, to selecting various
runtime options.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/autonuma_flags.h | 62 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 62 insertions(+), 0 deletions(-)
create mode 100644 include/linux/autonuma_flags.h
diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
new file mode 100644
index 0000000..9c702fd
--- /dev/null
+++ b/include/linux/autonuma_flags.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_AUTONUMA_FLAGS_H
+#define _LINUX_AUTONUMA_FLAGS_H
+
+enum autonuma_flag {
+ AUTONUMA_FLAG,
+ AUTONUMA_IMPOSSIBLE,
+ AUTONUMA_DEBUG_FLAG,
+ AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+ AUTONUMA_SCHED_CLONE_RESET_FLAG,
+ AUTONUMA_SCHED_FORK_RESET_FLAG,
+ AUTONUMA_SCAN_PMD_FLAG,
+ AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+ AUTONUMA_MIGRATE_DEFER_FLAG,
+};
+
+extern unsigned long autonuma_flags;
+
+static bool inline autonuma_enabled(void)
+{
+ return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_debug(void)
+{
+ return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_sched_load_balance_strict(void)
+{
+ return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+ &autonuma_flags);
+}
+
+static bool inline autonuma_sched_clone_reset(void)
+{
+ return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
+ &autonuma_flags);
+}
+
+static bool inline autonuma_sched_fork_reset(void)
+{
+ return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
+ &autonuma_flags);
+}
+
+static bool inline autonuma_scan_pmd(void)
+{
+ return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
+}
+
+static bool inline autonuma_scan_use_working_set(void)
+{
+ return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+ &autonuma_flags);
+}
+
+static bool inline autonuma_migrate_defer(void)
+{
+ return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
+}
+
+#endif /* _LINUX_AUTONUMA_FLAGS_H */
Until THP native migration is implemented it's safer to boost
khugepaged scanning rate because all memory migration are splitting
the hugepages. So the regular rate of scanning becomes too low when
lots of memory is migrated.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 017c0a3..b919c0c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -573,6 +573,14 @@ static int __init hugepage_init(void)
set_recommended_min_free_kbytes();
+#ifdef CONFIG_AUTONUMA
+ /* Hack, remove after THP native migration */
+ if (num_possible_nodes() > 1) {
+ khugepaged_scan_sleep_millisecs = 100;
+ khugepaged_alloc_sleep_millisecs = 10000;
+ }
+#endif
+
return 0;
out:
hugepage_exit_sysfs(hugepage_kobj);
gup_fast will skip over non present ptes (pte_numa requires the pte to
be non present). So no explicit check is needed for pte_numa in the
pte case.
gup_fast will also automatically skip over THP when the trans huge pmd
is non present (pmd_numa requires the pmd to be non present).
But for the special pmd mode scan of knuma_scand
(/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
type (so non present too), the pte may be present. gup_pte_range
wouldn't notice the pmd is of numa type. So to avoid losing a NUMA
hinting page fault with gup_fast we need an explicit check for
pmd_numa() here to be sure it will fault through gup ->
handle_mm_fault.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/mm/gup.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..bf36575 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -164,7 +164,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
* wait_split_huge_page() would never return as the
* tlb flush IPI wouldn't run.
*/
- if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+ if (pmd_none(pmd) || pmd_trans_splitting(pmd) || pmd_numa(pmd))
return 0;
if (unlikely(pmd_large(pmd))) {
if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
Define the two data structures that collect the per-process (in the
mm) and per-thread (in the task_struct) statistical information that
are the input of the CPU follow memory algorithms in the NUMA
scheduler.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/autonuma_types.h | 57 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 57 insertions(+), 0 deletions(-)
create mode 100644 include/linux/autonuma_types.h
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
new file mode 100644
index 0000000..65b175b
--- /dev/null
+++ b/include/linux/autonuma_types.h
@@ -0,0 +1,57 @@
+#ifndef _LINUX_AUTONUMA_TYPES_H
+#define _LINUX_AUTONUMA_TYPES_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/numa.h>
+
+struct mm_autonuma {
+ struct list_head mm_node;
+ struct mm_struct *mm;
+ unsigned long numa_fault_tot; /* reset from here */
+ unsigned long numa_fault_pass;
+ unsigned long numa_fault[0];
+};
+
+extern int alloc_mm_autonuma(struct mm_struct *mm);
+extern void free_mm_autonuma(struct mm_struct *mm);
+extern void __init mm_autonuma_init(void);
+
+#define SCHED_AUTONUMA_FLAG_STOP_ONE_CPU (1<<0)
+#define SCHED_AUTONUMA_FLAG_NEED_BALANCE (1<<1)
+
+struct sched_autonuma {
+ int autonuma_node;
+ unsigned int autonuma_flags; /* zeroed from here */
+ unsigned long numa_fault_pass;
+ unsigned long numa_fault_tot;
+ unsigned long numa_fault[0];
+};
+
+extern int alloc_sched_autonuma(struct task_struct *tsk,
+ struct task_struct *orig,
+ int node);
+extern void __init sched_autonuma_init(void);
+extern void free_sched_autonuma(struct task_struct *tsk);
+
+#else /* CONFIG_AUTONUMA */
+
+static inline int alloc_mm_autonuma(struct mm_struct *mm)
+{
+ return 0;
+}
+static inline void free_mm_autonuma(struct mm_struct *mm) {}
+static inline void mm_autonuma_init(void) {}
+
+static inline int alloc_sched_autonuma(struct task_struct *tsk,
+ struct task_struct *orig,
+ int node)
+{
+ return 0;
+}
+static inline void sched_autonuma_init(void) {}
+static inline void free_sched_autonuma(struct task_struct *tsk) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_TYPES_H */
Implement pte_numa and pmd_numa and related methods on x86 arch.
We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.
Whenever a pte or pmd is set as pte_numa or pmd_numa the first time a
thread will touch that virtual address, a NUMA hinting page fault will
trigger. The NUMA hinting page fault will simply clear the NUMA bit
and set the present bit again to resolve the page fault.
NUMA hinting page faults are used:
1) to fill in the per-thread NUMA statistic stored for each thread in
a current->sched_autonuma data structure
2) to track the per-node last_nid information in the page structure to
detect false sharing
3) to queue the page mapped by the pte_numa or pmd_numa for async
migration if there have been enough NUMA hinting page faults on the
page coming from remote CPUs
NUMA hinting page faults don't do anything except collecting
information and possibly adding pages to migrate queues. They're
extremely quick and absolutely non blocking. They don't allocate any
memory either.
The only "input" information of the AutoNUMA algorithm that isn't
collected through NUMA hinting page faults are the per-process
(per-thread not) mm->mm_autonuma statistics. Those mm_autonuma
statistics are collected by the knuma_scand pmd/pte scans that are
also responsible for setting the pte_numa/pmd_numa to activate the
NUMA hinting page faults.
knuma_scand -> NUMA hinting page faults
| |
\|/ \|/
mm_autonuma <-> sched_autonuma (CPU follow memory, this is mm_autonuma too)
page last_nid (false thread sharing/thread shared memory detection )
queue or cancel page migration (memory follow CPU)
After pages are queued, there is one knuma_migratedN daemon per NUMA
node that will take care of migrating the pages at a perfectly steady
rate in parallel from all nodes, and in round robin from all incoming
nodes going to the same destination node to keep all memory channels
in large boxes active at the same time to avoid hitting on a single
memory channel for too long to minimize memory bus migration latency
effects.
Once pages are queued for async migration by knuma_migratedN, their
migration can still be canceled before they're actually migrated, if
false sharing is later detected.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/include/asm/pgtable.h | 51 +++++++++++++++++++++++++++++++++++++--
1 files changed, 48 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 49afb3f..7514fa6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -109,7 +109,7 @@ static inline int pte_write(pte_t pte)
static inline int pte_file(pte_t pte)
{
- return pte_flags(pte) & _PAGE_FILE;
+ return (pte_flags(pte) & _PAGE_FILE) == _PAGE_FILE;
}
static inline int pte_huge(pte_t pte)
@@ -405,7 +405,9 @@ static inline int pte_same(pte_t a, pte_t b)
static inline int pte_present(pte_t a)
{
- return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+ /* _PAGE_NUMA includes _PAGE_PROTNONE */
+ return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+ _PAGE_NUMA_PTE);
}
static inline int pte_hidden(pte_t pte)
@@ -415,7 +417,46 @@ static inline int pte_hidden(pte_t pte)
static inline int pmd_present(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_PRESENT;
+ return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+ _PAGE_NUMA_PMD);
+}
+
+#ifdef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+ return (pte_flags(pte) &
+ (_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+ return (pmd_flags(pmd) &
+ (_PAGE_NUMA_PMD|_PAGE_PRESENT)) == _PAGE_NUMA_PMD;
+}
+#endif
+
+static inline pte_t pte_mknotnuma(pte_t pte)
+{
+ pte = pte_clear_flags(pte, _PAGE_NUMA_PTE);
+ return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mknotnuma(pmd_t pmd)
+{
+ pmd = pmd_clear_flags(pmd, _PAGE_NUMA_PMD);
+ return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+ pte = pte_set_flags(pte, _PAGE_NUMA_PTE);
+ return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+ pmd = pmd_set_flags(pmd, _PAGE_NUMA_PMD);
+ return pmd_clear_flags(pmd, _PAGE_PRESENT);
}
static inline int pmd_none(pmd_t pmd)
@@ -474,6 +515,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
static inline int pmd_bad(pmd_t pmd)
{
+#ifdef CONFIG_AUTONUMA
+ if (pmd_numa(pmd))
+ return 0;
+#endif
return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
}
If the task has already been moved to an autonuma_node try to allocate
memory from it even if it's temporarily not the local node. Chances
are it's where most of its memory is already located and where it will
run in the future.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/mempolicy.c | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 88f9422..b6b88f6 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1925,10 +1925,21 @@ retry_cpuset:
*/
if (pol->mode == MPOL_INTERLEAVE)
page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
- else
+ else {
+ int nid;
+#ifdef CONFIG_AUTONUMA
+ nid = -1;
+ if (current->sched_autonuma)
+ nid = current->sched_autonuma->autonuma_node;
+ if (nid < 0)
+ nid = numa_node_id();
+#else
+ nid = numa_node_id();
+#endif
page = __alloc_pages_nodemask(gfp, order,
- policy_zonelist(gfp, pol, numa_node_id()),
+ policy_zonelist(gfp, pol, nid),
policy_nodemask(gfp, pol));
+ }
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
This is needed to make sure the tail pages are also queued into the
migration queues of knuma_migrated across a transparent hugepage
split.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 383ae4d..b1c047b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
#include <linux/khugepaged.h>
#include <linux/freezer.h>
#include <linux/mman.h>
+#include <linux/autonuma.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -1307,6 +1308,7 @@ static void __split_huge_page_refcount(struct page *page)
lru_add_page_tail(zone, page, page_tail);
+ autonuma_migrate_split_huge_page(page, page_tail);
}
atomic_sub(tail_count, &page->_count);
BUG_ON(__page_count(page) <= 0);
Move the AutoNUMA per page information from the "struct page" to a
separate page_autonuma data structure allocated in the memsection
(with sparsemem) or in the pgdat (with flatmem).
This is done to avoid growing the size of the "struct page" and the
page_autonuma data is only allocated if the kernel has been booted on
real NUMA hardware (or if noautonuma is passed as parameter to the
kernel).
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/autonuma.h | 18 +++-
include/linux/autonuma_flags.h | 6 +
include/linux/autonuma_types.h | 31 ++++++
include/linux/mm_types.h | 25 -----
include/linux/mmzone.h | 14 +++-
include/linux/page_autonuma.h | 53 +++++++++
init/main.c | 2 +
mm/Makefile | 2 +-
mm/autonuma.c | 95 ++++++++++-------
mm/huge_memory.c | 11 ++-
mm/page_alloc.c | 23 +----
mm/page_autonuma.c | 234 ++++++++++++++++++++++++++++++++++++++++
mm/sparse.c | 126 ++++++++++++++++++++-
13 files changed, 543 insertions(+), 97 deletions(-)
create mode 100644 include/linux/page_autonuma.h
create mode 100644 mm/page_autonuma.c
diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index a963dcb..1eb84d0 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -7,15 +7,26 @@
extern void autonuma_enter(struct mm_struct *mm);
extern void autonuma_exit(struct mm_struct *mm);
-extern void __autonuma_migrate_page_remove(struct page *page);
+extern void __autonuma_migrate_page_remove(struct page *,
+ struct page_autonuma *);
extern void autonuma_migrate_split_huge_page(struct page *page,
struct page *page_tail);
extern void autonuma_setup_new_exec(struct task_struct *p);
+extern struct page_autonuma *lookup_page_autonuma(struct page *page);
static inline void autonuma_migrate_page_remove(struct page *page)
{
- if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
- __autonuma_migrate_page_remove(page);
+ struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+ if (ACCESS_ONCE(page_autonuma->autonuma_migrate_nid) >= 0)
+ __autonuma_migrate_page_remove(page, page_autonuma);
+}
+
+static inline void autonuma_free_page(struct page *page)
+{
+ if (!autonuma_impossible()) {
+ autonuma_migrate_page_remove(page);
+ ACCESS_ONCE(lookup_page_autonuma(page)->autonuma_last_nid) = -1;
+ }
}
#define autonuma_printk(format, args...) \
@@ -29,6 +40,7 @@ static inline void autonuma_migrate_page_remove(struct page *page) {}
static inline void autonuma_migrate_split_huge_page(struct page *page,
struct page *page_tail) {}
static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+static inline void autonuma_free_page(struct page *page) {}
#endif /* CONFIG_AUTONUMA */
diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
index 9c702fd..6ec837a 100644
--- a/include/linux/autonuma_flags.h
+++ b/include/linux/autonuma_flags.h
@@ -15,6 +15,12 @@ enum autonuma_flag {
extern unsigned long autonuma_flags;
+static inline bool autonuma_impossible(void)
+{
+ return num_possible_nodes() <= 1 ||
+ test_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
+}
+
static bool inline autonuma_enabled(void)
{
return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 65b175b..28f64ec 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -28,6 +28,37 @@ struct sched_autonuma {
unsigned long numa_fault[0];
};
+struct page_autonuma {
+ /*
+ * FIXME: move to pgdat section along with the memcg and allocate
+ * at runtime only in presence of a numa system.
+ */
+ /*
+ * To modify autonuma_last_nid lockless the architecture,
+ * needs SMP atomic granularity < sizeof(long), not all archs
+ * have that, notably some alpha. Archs without that requires
+ * autonuma_last_nid to be a long.
+ */
+#if BITS_PER_LONG > 32
+ int autonuma_migrate_nid;
+ int autonuma_last_nid;
+#else
+#if MAX_NUMNODES >= 32768
+#error "too many nodes"
+#endif
+ /* FIXME: remember to check the updates are atomic */
+ short autonuma_migrate_nid;
+ short autonuma_last_nid;
+#endif
+ struct list_head autonuma_migrate_node;
+
+ /*
+ * To find the page starting from the autonuma_migrate_node we
+ * need a backlink.
+ */
+ struct page *page;
+};
+
extern int alloc_sched_autonuma(struct task_struct *tsk,
struct task_struct *orig,
int node);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e8dc82c..780ded7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -126,31 +126,6 @@ struct page {
struct page *first_page; /* Compound tail pages */
};
-#ifdef CONFIG_AUTONUMA
- /*
- * FIXME: move to pgdat section along with the memcg and allocate
- * at runtime only in presence of a numa system.
- */
- /*
- * To modify autonuma_last_nid lockless the architecture,
- * needs SMP atomic granularity < sizeof(long), not all archs
- * have that, notably some alpha. Archs without that requires
- * autonuma_last_nid to be a long.
- */
-#if BITS_PER_LONG > 32
- int autonuma_migrate_nid;
- int autonuma_last_nid;
-#else
-#if MAX_NUMNODES >= 32768
-#error "too many nodes"
-#endif
- /* FIXME: remember to check the updates are atomic */
- short autonuma_migrate_nid;
- short autonuma_last_nid;
-#endif
- struct list_head autonuma_migrate_node;
-#endif
-
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8e578e6..89fa49f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -667,10 +667,13 @@ typedef struct pglist_data {
int kswapd_max_order;
enum zone_type classzone_idx;
#ifdef CONFIG_AUTONUMA
- spinlock_t autonuma_lock;
+#if !defined(CONFIG_SPARSEMEM)
+ struct page_autonuma *node_page_autonuma;
+#endif
struct list_head autonuma_migrate_head[MAX_NUMNODES];
unsigned long autonuma_nr_migrate_pages;
wait_queue_head_t autonuma_knuma_migrated_wait;
+ spinlock_t autonuma_lock;
#endif
} pg_data_t;
@@ -1022,6 +1025,15 @@ struct mem_section {
* section. (see memcontrol.h/page_cgroup.h about this.)
*/
struct page_cgroup *page_cgroup;
+#endif
+#ifdef CONFIG_AUTONUMA
+ /*
+ * If !SPARSEMEM, pgdat doesn't have page_autonuma pointer. We use
+ * section.
+ */
+ struct page_autonuma *section_page_autonuma;
+#endif
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR) ^ defined(CONFIG_AUTONUMA)
unsigned long pad;
#endif
};
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
new file mode 100644
index 0000000..05d2862
--- /dev/null
+++ b/include/linux/page_autonuma.h
@@ -0,0 +1,53 @@
+#ifndef _LINUX_PAGE_AUTONUMA_H
+#define _LINUX_PAGE_AUTONUMA_H
+
+#if defined(CONFIG_AUTONUMA) && !defined(CONFIG_SPARSEMEM)
+extern void __init page_autonuma_init_flatmem(void);
+#else
+static inline void __init page_autonuma_init_flatmem(void) {}
+#endif
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void __meminit page_autonuma_map_init(struct page *page,
+ struct page_autonuma *page_autonuma,
+ int nr_pages);
+
+#ifdef CONFIG_SPARSEMEM
+#define PAGE_AUTONUMA_SIZE (sizeof(struct page_autonuma))
+#define SECTION_PAGE_AUTONUMA_SIZE (PAGE_AUTONUMA_SIZE * \
+ PAGES_PER_SECTION)
+#endif
+
+extern void __meminit pgdat_autonuma_init(struct pglist_data *);
+
+#else /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+struct page_autonuma;
+#define PAGE_AUTONUMA_SIZE 0
+#define SECTION_PAGE_AUTONUMA_SIZE 0
+
+#define autonuma_impossible() true
+
+#endif
+
+static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+extern struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+ unsigned long nr_pages);
+extern void __meminit __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+ unsigned long nr_pages);
+extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+ unsigned long pnum_begin,
+ unsigned long pnum_end,
+ unsigned long map_count,
+ int nodeid);
+#endif
+
+#endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/init/main.c b/init/main.c
index 1ca6b32..275e914 100644
--- a/init/main.c
+++ b/init/main.c
@@ -68,6 +68,7 @@
#include <linux/shmem_fs.h>
#include <linux/slab.h>
#include <linux/perf_event.h>
+#include <linux/page_autonuma.h>
#include <asm/io.h>
#include <asm/bugs.h>
@@ -455,6 +456,7 @@ static void __init mm_init(void)
* bigger than MAX_ORDER unless SPARSEMEM.
*/
page_cgroup_init_flatmem();
+ page_autonuma_init_flatmem();
mem_init();
kmem_cache_init();
percpu_init_late();
diff --git a/mm/Makefile b/mm/Makefile
index 67c77bd..5410eba 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -29,7 +29,7 @@ obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
-obj-$(CONFIG_AUTONUMA) += autonuma.o
+obj-$(CONFIG_AUTONUMA) += autonuma.o page_autonuma.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index 88c7ab3..96c02a2 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -51,12 +51,6 @@ static struct knumad_scan {
.mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
};
-static inline bool autonuma_impossible(void)
-{
- return num_possible_nodes() <= 1 ||
- test_bit(AUTONUMA_IMPOSSIBLE, &autonuma_flags);
-}
-
static inline void autonuma_migrate_lock(int nid)
{
spin_lock(&NODE_DATA(nid)->autonuma_lock);
@@ -82,51 +76,57 @@ void autonuma_migrate_split_huge_page(struct page *page,
struct page *page_tail)
{
int nid, last_nid;
+ struct page_autonuma *page_autonuma, *page_tail_autonuma;
- nid = page->autonuma_migrate_nid;
+ page_autonuma = lookup_page_autonuma(page);
+ page_tail_autonuma = lookup_page_autonuma(page_tail);
+
+ nid = page_autonuma->autonuma_migrate_nid;
VM_BUG_ON(nid >= MAX_NUMNODES);
VM_BUG_ON(nid < -1);
- VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+ VM_BUG_ON(page_tail_autonuma->autonuma_migrate_nid != -1);
if (nid >= 0) {
VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
autonuma_migrate_lock(nid);
- list_add_tail(&page_tail->autonuma_migrate_node,
- &page->autonuma_migrate_node);
+ list_add_tail(&page_tail_autonuma->autonuma_migrate_node,
+ &page_autonuma->autonuma_migrate_node);
autonuma_migrate_unlock(nid);
- page_tail->autonuma_migrate_nid = nid;
+ page_tail_autonuma->autonuma_migrate_nid = nid;
}
- last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+ last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
if (last_nid >= 0)
- page_tail->autonuma_last_nid = last_nid;
+ page_tail_autonuma->autonuma_last_nid = last_nid;
}
-void __autonuma_migrate_page_remove(struct page *page)
+void __autonuma_migrate_page_remove(struct page *page,
+ struct page_autonuma *page_autonuma)
{
unsigned long flags;
int nid;
flags = compound_lock_irqsave(page);
- nid = page->autonuma_migrate_nid;
+ nid = page_autonuma->autonuma_migrate_nid;
VM_BUG_ON(nid >= MAX_NUMNODES);
VM_BUG_ON(nid < -1);
if (nid >= 0) {
int numpages = hpage_nr_pages(page);
autonuma_migrate_lock(nid);
- list_del(&page->autonuma_migrate_node);
+ list_del(&page_autonuma->autonuma_migrate_node);
NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
autonuma_migrate_unlock(nid);
- page->autonuma_migrate_nid = -1;
+ page_autonuma->autonuma_migrate_nid = -1;
}
compound_unlock_irqrestore(page, flags);
}
-static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
- int page_nid)
+static void __autonuma_migrate_page_add(struct page *page,
+ struct page_autonuma *page_autonuma,
+ int dst_nid, int page_nid)
{
unsigned long flags;
int nid;
@@ -145,25 +145,25 @@ static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
flags = compound_lock_irqsave(page);
numpages = hpage_nr_pages(page);
- nid = page->autonuma_migrate_nid;
+ nid = page_autonuma->autonuma_migrate_nid;
VM_BUG_ON(nid >= MAX_NUMNODES);
VM_BUG_ON(nid < -1);
if (nid >= 0) {
autonuma_migrate_lock(nid);
- list_del(&page->autonuma_migrate_node);
+ list_del(&page_autonuma->autonuma_migrate_node);
NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
autonuma_migrate_unlock(nid);
}
autonuma_migrate_lock(dst_nid);
- list_add(&page->autonuma_migrate_node,
+ list_add(&page_autonuma->autonuma_migrate_node,
&NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
autonuma_migrate_unlock(dst_nid);
- page->autonuma_migrate_nid = dst_nid;
+ page_autonuma->autonuma_migrate_nid = dst_nid;
compound_unlock_irqrestore(page, flags);
@@ -179,9 +179,13 @@ static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
static void autonuma_migrate_page_add(struct page *page, int dst_nid,
int page_nid)
{
- int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+ int migrate_nid;
+ struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+
+ migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
if (migrate_nid != dst_nid)
- __autonuma_migrate_page_add(page, dst_nid, page_nid);
+ __autonuma_migrate_page_add(page, page_autonuma,
+ dst_nid, page_nid);
}
static bool balance_pgdat(struct pglist_data *pgdat,
@@ -252,23 +256,26 @@ static inline bool last_nid_set(struct task_struct *p,
struct page *page, int cpu_nid)
{
bool ret = true;
- int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+ struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+ int autonuma_last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
VM_BUG_ON(cpu_nid < 0);
VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
if (autonuma_last_nid >= 0 && autonuma_last_nid != cpu_nid) {
- int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+ int migrate_nid;
+ migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
if (migrate_nid >= 0 && migrate_nid != cpu_nid)
- __autonuma_migrate_page_remove(page);
+ __autonuma_migrate_page_remove(page, page_autonuma);
ret = false;
}
if (autonuma_last_nid != cpu_nid)
- ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
+ ACCESS_ONCE(page_autonuma->autonuma_last_nid) = cpu_nid;
return ret;
}
static int __page_migrate_nid(struct page *page, int page_nid)
{
- int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+ struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+ int migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
if (migrate_nid < 0)
migrate_nid = page_nid;
#if 0
@@ -780,6 +787,7 @@ static int isolate_migratepages(struct list_head *migratepages,
for_each_online_node(nid) {
struct zone *zone;
struct page *page;
+ struct page_autonuma *page_autonuma;
cond_resched();
VM_BUG_ON(numa_node_id() != pgdat->node_id);
if (nid == pgdat->node_id) {
@@ -802,16 +810,17 @@ static int isolate_migratepages(struct list_head *migratepages,
autonuma_migrate_unlock_irq(pgdat->node_id);
continue;
}
- page = list_entry(heads[nid].prev,
- struct page,
- autonuma_migrate_node);
+ page_autonuma = list_entry(heads[nid].prev,
+ struct page_autonuma,
+ autonuma_migrate_node);
+ page = page_autonuma->page;
if (unlikely(!get_page_unless_zero(page))) {
/*
* Is getting freed and will remove self from the
* autonuma list shortly, skip it for now.
*/
- list_del(&page->autonuma_migrate_node);
- list_add(&page->autonuma_migrate_node,
+ list_del(&page_autonuma->autonuma_migrate_node);
+ list_add(&page_autonuma->autonuma_migrate_node,
&heads[nid]);
autonuma_migrate_unlock_irq(pgdat->node_id);
autonuma_printk("autonuma migrate page is free\n");
@@ -820,7 +829,7 @@ static int isolate_migratepages(struct list_head *migratepages,
if (!PageLRU(page)) {
autonuma_migrate_unlock_irq(pgdat->node_id);
autonuma_printk("autonuma migrate page not in LRU\n");
- __autonuma_migrate_page_remove(page);
+ __autonuma_migrate_page_remove(page, page_autonuma);
put_page(page);
continue;
}
@@ -832,7 +841,7 @@ static int isolate_migratepages(struct list_head *migratepages,
/* FIXME: remove split_huge_page */
split_huge_page(page);
- __autonuma_migrate_page_remove(page);
+ __autonuma_migrate_page_remove(page, page_autonuma);
zone = page_zone(page);
spin_lock_irq(&zone->lru_lock);
@@ -875,11 +884,16 @@ static struct page *alloc_migrate_dst_page(struct page *page,
{
int nid = (int) data;
struct page *newpage;
+ struct page_autonuma *page_autonuma, *newpage_autonuma;
newpage = alloc_pages_exact_node(nid,
GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
0);
- if (newpage)
- newpage->autonuma_last_nid = page->autonuma_last_nid;
+ if (newpage) {
+ page_autonuma = lookup_page_autonuma(page);
+ newpage_autonuma = lookup_page_autonuma(newpage);
+ newpage_autonuma->autonuma_last_nid =
+ page_autonuma->autonuma_last_nid;
+ }
return newpage;
}
@@ -1299,7 +1313,8 @@ static int __init noautonuma_setup(char *str)
}
return 1;
}
-__setup("noautonuma", noautonuma_setup);
+/* early so sparse.c also can see it */
+early_param("noautonuma", noautonuma_setup);
static int __init autonuma_init(void)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b919c0c..faaf73f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1822,6 +1822,12 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
{
pte_t *_pte;
bool mknuma = false;
+#ifdef CONFIG_AUTONUMA
+ struct page_autonuma *src_page_an, *page_an;
+
+ page_an = lookup_page_autonuma(page);
+#endif
+
for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
pte_t pteval = *_pte;
struct page *src_page;
@@ -1835,11 +1841,12 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
#endif
src_page = pte_page(pteval);
#ifdef CONFIG_AUTONUMA
+ src_page_an = lookup_page_autonuma(src_page);
/* pick the last one, better than nothing */
autonuma_last_nid =
- ACCESS_ONCE(src_page->autonuma_last_nid);
+ ACCESS_ONCE(src_page_an->autonuma_last_nid);
if (autonuma_last_nid >= 0)
- ACCESS_ONCE(page->autonuma_last_nid) =
+ ACCESS_ONCE(page_an->autonuma_last_nid) =
autonuma_last_nid;
#endif
copy_user_highpage(page, src_page, address, vma);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3c354d4..b8c13ff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
#include <linux/prefetch.h>
#include <linux/page-debug-flags.h>
#include <linux/autonuma.h>
+#include <linux/page_autonuma.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -614,10 +615,7 @@ static inline int free_pages_check(struct page *page)
bad_page(page);
return 1;
}
- autonuma_migrate_page_remove(page);
-#ifdef CONFIG_AUTONUMA
- ACCESS_ONCE(page->autonuma_last_nid) = -1;
-#endif
+ autonuma_free_page(page);
if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
@@ -3673,10 +3671,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
INIT_LIST_HEAD(&page->lru);
-#ifdef CONFIG_AUTONUMA
- page->autonuma_last_nid = -1;
- page->autonuma_migrate_nid = -1;
-#endif
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
if (!is_highmem_idx(zone))
@@ -4304,23 +4298,14 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
int nid = pgdat->node_id;
unsigned long zone_start_pfn = pgdat->node_start_pfn;
int ret;
-#ifdef CONFIG_AUTONUMA
- int node_iter;
-#endif
pgdat_resize_init(pgdat);
-#ifdef CONFIG_AUTONUMA
- spin_lock_init(&pgdat->autonuma_lock);
- init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
- pgdat->autonuma_nr_migrate_pages = 0;
- for_each_node(node_iter)
- INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
-#endif
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
pgdat->kswapd_max_order = 0;
pgdat_page_cgroup_init(pgdat);
-
+ pgdat_autonuma_init(pgdat);
+
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
unsigned long size, realsize, memmap_pages;
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
new file mode 100644
index 0000000..131b5c9
--- /dev/null
+++ b/mm/page_autonuma.c
@@ -0,0 +1,234 @@
+#include <linux/mm.h>
+#include <linux/memory.h>
+#include <linux/autonuma_flags.h>
+#include <linux/page_autonuma.h>
+#include <linux/bootmem.h>
+
+void __meminit page_autonuma_map_init(struct page *page,
+ struct page_autonuma *page_autonuma,
+ int nr_pages)
+{
+ struct page *end;
+ for (end = page + nr_pages; page < end; page++, page_autonuma++) {
+ page_autonuma->autonuma_last_nid = -1;
+ page_autonuma->autonuma_migrate_nid = -1;
+ page_autonuma->page = page;
+ }
+}
+
+static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+ int node_iter;
+
+ spin_lock_init(&pgdat->autonuma_lock);
+ init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+ pgdat->autonuma_nr_migrate_pages = 0;
+ for_each_node(node_iter)
+ INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+}
+
+#if !defined(CONFIG_SPARSEMEM)
+
+static unsigned long total_usage;
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+ __pgdat_autonuma_init(pgdat);
+ pgdat->node_page_autonuma = NULL;
+}
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long offset;
+ struct page_autonuma *base;
+
+ base = NODE_DATA(page_to_nid(page))->node_page_autonuma;
+#ifdef CONFIG_DEBUG_VM
+ /*
+ * The sanity checks the page allocator does upon freeing a
+ * page can reach here before the page_autonuma arrays are
+ * allocated when feeding a range of pages to the allocator
+ * for the first time during bootup or memory hotplug.
+ */
+ if (unlikely(!base))
+ return NULL;
+#endif
+ offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+ return base + offset;
+}
+
+static int __init alloc_node_page_autonuma(int nid)
+{
+ struct page_autonuma *base;
+ unsigned long table_size;
+ unsigned long nr_pages;
+
+ nr_pages = NODE_DATA(nid)->node_spanned_pages;
+ if (!nr_pages)
+ return 0;
+
+ table_size = sizeof(struct page_autonuma) * nr_pages;
+
+ base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
+ table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+ if (!base)
+ return -ENOMEM;
+ NODE_DATA(nid)->node_page_autonuma = base;
+ total_usage += table_size;
+ page_autonuma_map_init(NODE_DATA(nid)->node_mem_map, base, nr_pages);
+ return 0;
+}
+
+void __init page_autonuma_init_flatmem(void)
+{
+
+ int nid, fail;
+
+ if (autonuma_impossible())
+ return;
+
+ for_each_online_node(nid) {
+ fail = alloc_node_page_autonuma(nid);
+ if (fail)
+ goto fail;
+ }
+ printk(KERN_INFO "allocated %lu KBytes of page_autonuma\n",
+ total_usage >> 10);
+ printk(KERN_INFO "please try the 'noautonuma' option if you"
+ " don't want to allocate page_autonuma memory\n");
+ return;
+fail:
+ printk(KERN_CRIT "allocation of page_autonuma failed.\n");
+ printk(KERN_CRIT "please try the 'noautonuma' boot option\n");
+ panic("Out of memory");
+}
+
+#else /* CONFIG_SPARSEMEM */
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ struct mem_section *section = __pfn_to_section(pfn);
+
+ /* if it's not a power of two we may be wasting memory */
+ BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
+ (SECTION_PAGE_AUTONUMA_SIZE-1));
+
+#ifdef CONFIG_DEBUG_VM
+ /*
+ * The sanity checks the page allocator does upon freeing a
+ * page can reach here before the page_autonuma arrays are
+ * allocated when feeding a range of pages to the allocator
+ * for the first time during bootup or memory hotplug.
+ */
+ if (!section->section_page_autonuma)
+ return NULL;
+#endif
+ return section->section_page_autonuma + pfn;
+}
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+ __pgdat_autonuma_init(pgdat);
+}
+
+struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+ unsigned long nr_pages)
+{
+ struct page_autonuma *ret;
+ struct page *page;
+ unsigned long memmap_size = PAGE_AUTONUMA_SIZE * nr_pages;
+
+ page = alloc_pages_node(nid, GFP_KERNEL|__GFP_NOWARN,
+ get_order(memmap_size));
+ if (page)
+ goto got_map_page_autonuma;
+
+ ret = vmalloc(memmap_size);
+ if (ret)
+ goto out;
+
+ return NULL;
+got_map_page_autonuma:
+ ret = (struct page_autonuma *)pfn_to_kaddr(page_to_pfn(page));
+out:
+ return ret;
+}
+
+void __meminit __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+ unsigned long nr_pages)
+{
+ if (is_vmalloc_addr(page_autonuma))
+ vfree(page_autonuma);
+ else
+ free_pages((unsigned long)page_autonuma,
+ get_order(PAGE_AUTONUMA_SIZE * nr_pages));
+}
+
+static struct page_autonuma __init *sparse_page_autonuma_map_populate(unsigned long pnum,
+ int nid)
+{
+ struct page_autonuma *map;
+ unsigned long size;
+
+ map = alloc_remap(nid, SECTION_PAGE_AUTONUMA_SIZE);
+ if (map)
+ return map;
+
+ size = PAGE_ALIGN(SECTION_PAGE_AUTONUMA_SIZE);
+ map = __alloc_bootmem_node_high(NODE_DATA(nid), size,
+ PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+ return map;
+}
+
+void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+ unsigned long pnum_begin,
+ unsigned long pnum_end,
+ unsigned long map_count,
+ int nodeid)
+{
+ void *map;
+ unsigned long pnum;
+ unsigned long size = SECTION_PAGE_AUTONUMA_SIZE;
+
+ map = alloc_remap(nodeid, size * map_count);
+ if (map) {
+ for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+ if (!present_section_nr(pnum))
+ continue;
+ page_autonuma_map[pnum] = map;
+ map += size;
+ }
+ return;
+ }
+
+ size = PAGE_ALIGN(size);
+ map = __alloc_bootmem_node_high(NODE_DATA(nodeid), size * map_count,
+ PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+ if (map) {
+ for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+ if (!present_section_nr(pnum))
+ continue;
+ page_autonuma_map[pnum] = map;
+ map += size;
+ }
+ return;
+ }
+
+ /* fallback */
+ for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+ struct mem_section *ms;
+
+ if (!present_section_nr(pnum))
+ continue;
+ page_autonuma_map[pnum] = sparse_page_autonuma_map_populate(pnum, nodeid);
+ if (page_autonuma_map[pnum])
+ continue;
+ ms = __nr_to_section(pnum);
+ printk(KERN_ERR "%s: sparsemem page_autonuma map backing failed "
+ "some memory will not be available.\n", __func__);
+ }
+}
+
+#endif
diff --git a/mm/sparse.c b/mm/sparse.c
index a8bc7d3..e20d891 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -9,6 +9,7 @@
#include <linux/export.h>
#include <linux/spinlock.h>
#include <linux/vmalloc.h>
+#include <linux/page_autonuma.h>
#include "internal.h"
#include <asm/dma.h>
#include <asm/pgalloc.h>
@@ -242,7 +243,8 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
static int __meminit sparse_init_one_section(struct mem_section *ms,
unsigned long pnum, struct page *mem_map,
- unsigned long *pageblock_bitmap)
+ unsigned long *pageblock_bitmap,
+ struct page_autonuma *page_autonuma)
{
if (!present_section(ms))
return -EINVAL;
@@ -251,6 +253,14 @@ static int __meminit sparse_init_one_section(struct mem_section *ms,
ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
SECTION_HAS_MEM_MAP;
ms->pageblock_flags = pageblock_bitmap;
+#ifdef CONFIG_AUTONUMA
+ if (page_autonuma) {
+ ms->section_page_autonuma = page_autonuma - section_nr_to_pfn(pnum);
+ page_autonuma_map_init(mem_map, page_autonuma, PAGES_PER_SECTION);
+ }
+#else
+ BUG_ON(page_autonuma);
+#endif
return 1;
}
@@ -485,6 +495,9 @@ void __init sparse_init(void)
int size2;
struct page **map_map;
#endif
+ struct page_autonuma **uninitialized_var(page_autonuma_map);
+ struct page_autonuma *page_autonuma;
+ int size3;
/*
* map is using big page (aka 2M in x86 64 bit)
@@ -579,6 +592,62 @@ void __init sparse_init(void)
map_count, nodeid_begin);
#endif
+ if (!autonuma_impossible()) {
+ unsigned long total_page_autonuma;
+ unsigned long page_autonuma_count;
+
+ size3 = sizeof(struct page_autonuma *) * NR_MEM_SECTIONS;
+ page_autonuma_map = alloc_bootmem(size3);
+ if (!page_autonuma_map)
+ panic("can not allocate page_autonuma_map\n");
+
+ for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+ struct mem_section *ms;
+
+ if (!present_section_nr(pnum))
+ continue;
+ ms = __nr_to_section(pnum);
+ nodeid_begin = sparse_early_nid(ms);
+ pnum_begin = pnum;
+ break;
+ }
+ total_page_autonuma = 0;
+ page_autonuma_count = 1;
+ for (pnum = pnum_begin + 1; pnum < NR_MEM_SECTIONS; pnum++) {
+ struct mem_section *ms;
+ int nodeid;
+
+ if (!present_section_nr(pnum))
+ continue;
+ ms = __nr_to_section(pnum);
+ nodeid = sparse_early_nid(ms);
+ if (nodeid == nodeid_begin) {
+ page_autonuma_count++;
+ continue;
+ }
+ /* ok, we need to take cake of from pnum_begin to pnum - 1*/
+ sparse_early_page_autonuma_alloc_node(page_autonuma_map,
+ pnum_begin,
+ NR_MEM_SECTIONS,
+ page_autonuma_count,
+ nodeid_begin);
+ total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+ /* new start, update count etc*/
+ nodeid_begin = nodeid;
+ pnum_begin = pnum;
+ page_autonuma_count = 1;
+ }
+ /* ok, last chunk */
+ sparse_early_page_autonuma_alloc_node(page_autonuma_map, pnum_begin,
+ NR_MEM_SECTIONS,
+ page_autonuma_count, nodeid_begin);
+ total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+ printk("allocated %lu KBytes of page_autonuma\n",
+ total_page_autonuma >> 10);
+ printk(KERN_INFO "please try the 'noautonuma' option if you"
+ " don't want to allocate page_autonuma memory\n");
+ }
+
for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
if (!present_section_nr(pnum))
continue;
@@ -587,6 +656,14 @@ void __init sparse_init(void)
if (!usemap)
continue;
+ if (autonuma_impossible())
+ page_autonuma = NULL;
+ else {
+ page_autonuma = page_autonuma_map[pnum];
+ if (!page_autonuma)
+ continue;
+ }
+
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
map = map_map[pnum];
#else
@@ -596,11 +673,13 @@ void __init sparse_init(void)
continue;
sparse_init_one_section(__nr_to_section(pnum), pnum, map,
- usemap);
+ usemap, page_autonuma);
}
vmemmap_populate_print_last();
+ if (!autonuma_impossible())
+ free_bootmem(__pa(page_autonuma_map), size3);
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
free_bootmem(__pa(map_map), size2);
#endif
@@ -687,7 +766,8 @@ static void free_map_bootmem(struct page *page, unsigned long nr_pages)
}
#endif /* CONFIG_SPARSEMEM_VMEMMAP */
-static void free_section_usemap(struct page *memmap, unsigned long *usemap)
+static void free_section_usemap(struct page *memmap, unsigned long *usemap,
+ struct page_autonuma *page_autonuma)
{
struct page *usemap_page;
unsigned long nr_pages;
@@ -701,8 +781,14 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
*/
if (PageSlab(usemap_page)) {
kfree(usemap);
- if (memmap)
+ if (memmap) {
__kfree_section_memmap(memmap, PAGES_PER_SECTION);
+ if (!autonuma_impossible())
+ __kfree_section_page_autonuma(page_autonuma,
+ PAGES_PER_SECTION);
+ else
+ BUG_ON(page_autonuma);
+ }
return;
}
@@ -719,6 +805,13 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
>> PAGE_SHIFT;
free_map_bootmem(memmap_page, nr_pages);
+
+ if (!autonuma_impossible()) {
+ struct page *page_autonuma_page;
+ page_autonuma_page = virt_to_page(page_autonuma);
+ free_map_bootmem(page_autonuma_page, nr_pages);
+ } else
+ BUG_ON(page_autonuma);
}
}
@@ -734,6 +827,7 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
struct pglist_data *pgdat = zone->zone_pgdat;
struct mem_section *ms;
struct page *memmap;
+ struct page_autonuma *page_autonuma;
unsigned long *usemap;
unsigned long flags;
int ret;
@@ -753,6 +847,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
__kfree_section_memmap(memmap, nr_pages);
return -ENOMEM;
}
+ if (!autonuma_impossible()) {
+ page_autonuma = __kmalloc_section_page_autonuma(pgdat->node_id,
+ nr_pages);
+ if (!page_autonuma) {
+ kfree(usemap);
+ __kfree_section_memmap(memmap, nr_pages);
+ return -ENOMEM;
+ }
+ } else
+ page_autonuma = NULL;
pgdat_resize_lock(pgdat, &flags);
@@ -764,11 +868,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
ms->section_mem_map |= SECTION_MARKED_PRESENT;
- ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
+ ret = sparse_init_one_section(ms, section_nr, memmap, usemap,
+ page_autonuma);
out:
pgdat_resize_unlock(pgdat, &flags);
if (ret <= 0) {
+ if (!autonuma_impossible())
+ __kfree_section_page_autonuma(page_autonuma, nr_pages);
+ else
+ BUG_ON(page_autonuma);
kfree(usemap);
__kfree_section_memmap(memmap, nr_pages);
}
@@ -779,6 +888,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
{
struct page *memmap = NULL;
unsigned long *usemap = NULL;
+ struct page_autonuma *page_autonuma = NULL;
if (ms->section_mem_map) {
usemap = ms->pageblock_flags;
@@ -786,8 +896,12 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
__section_nr(ms));
ms->section_mem_map = 0;
ms->pageblock_flags = NULL;
+
+#ifdef CONFIG_AUTONUMA
+ page_autonuma = ms->section_page_autonuma;
+#endif
}
- free_section_usemap(memmap, usemap);
+ free_section_usemap(memmap, usemap, page_autonuma);
}
#endif
This is where the mm_autonuma structure is being handled. Just like
sched_autonuma, this is only allocated at runtime if the hardware the
kernel is running on has been detected as NUMA. On not NUMA hardware
the memory cost is reduced to one pointer per mm.
To get rid of the pointer in the each mm, the kernel can be compiled
with CONFIG_AUTONUMA=n.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/fork.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index d323eb1..22f102e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -526,6 +526,8 @@ static void mm_init_aio(struct mm_struct *mm)
static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
{
+ if (unlikely(alloc_mm_autonuma(mm)))
+ goto out_free_mm;
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
@@ -548,6 +550,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
return mm;
}
+ free_mm_autonuma(mm);
+out_free_mm:
free_mm(mm);
return NULL;
}
@@ -597,6 +601,7 @@ void __mmdrop(struct mm_struct *mm)
destroy_context(mm);
mmu_notifier_mm_destroy(mm);
check_mm(mm);
+ free_mm_autonuma(mm);
free_mm(mm);
}
EXPORT_SYMBOL_GPL(__mmdrop);
@@ -880,6 +885,7 @@ fail_nocontext:
* If init_new_context() failed, we cannot use mmput() to free the mm
* because it calls destroy_context()
*/
+ free_mm_autonuma(mm);
mm_free_pgd(mm);
free_mm(mm);
return NULL;
@@ -1706,6 +1712,7 @@ void __init proc_caches_init(void)
mm_cachep = kmem_cache_create("mm_struct",
sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
+ mm_autonuma_init();
vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
mmap_init();
nsproxy_cache_init();
On 64bit archs, 20 bytes are used for async memory migration (specific
to the knuma_migrated per-node threads), and 4 bytes are used for the
thread NUMA false sharing detection logic.
This is a bad implementation due lack of time to do a proper one.
These AutoNUMA new fields must be moved to the pgdat like memcg
does. So that they're only allocated at boot time if the kernel is
booted on NUMA hardware. And so that they're not allocated even if
booted on NUMA hardware if "noautonuma" is passed as boot parameter to
the kernel.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/mm_types.h | 25 +++++++++++++++++++++++++
1 files changed, 25 insertions(+), 0 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 780ded7..e8dc82c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -126,6 +126,31 @@ struct page {
struct page *first_page; /* Compound tail pages */
};
+#ifdef CONFIG_AUTONUMA
+ /*
+ * FIXME: move to pgdat section along with the memcg and allocate
+ * at runtime only in presence of a numa system.
+ */
+ /*
+ * To modify autonuma_last_nid lockless the architecture,
+ * needs SMP atomic granularity < sizeof(long), not all archs
+ * have that, notably some alpha. Archs without that requires
+ * autonuma_last_nid to be a long.
+ */
+#if BITS_PER_LONG > 32
+ int autonuma_migrate_nid;
+ int autonuma_last_nid;
+#else
+#if MAX_NUMNODES >= 32768
+#error "too many nodes"
+#endif
+ /* FIXME: remember to check the updates are atomic */
+ short autonuma_migrate_nid;
+ short autonuma_last_nid;
+#endif
+ struct list_head autonuma_migrate_node;
+#endif
+
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
This is where the dynamically allocated sched_autonuma structure is
being handled.
The reason for keeping this outside of the task_struct besides not
using too much kernel stack, is to only allocate it on NUMA
hardware. So the not NUMA hardware only pays the memory of a pointer
in the kernel stack (which remains NULL at all times in that case).
If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
is allocated on the kernel stack of course.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/fork.c | 24 ++++++++++++++----------
1 files changed, 14 insertions(+), 10 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 237c34e..d323eb1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -206,6 +206,7 @@ static void account_kernel_stack(struct thread_info *ti, int account)
void free_task(struct task_struct *tsk)
{
account_kernel_stack(tsk->stack, -1);
+ free_sched_autonuma(tsk);
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
@@ -260,6 +261,8 @@ void __init fork_init(unsigned long mempages)
/* do the arch specific task caches init */
arch_task_cache_init();
+ sched_autonuma_init();
+
/*
* The default maximum number of threads is set to a safe
* value: the thread structures can take up at most half
@@ -292,21 +295,21 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
struct thread_info *ti;
unsigned long *stackend;
int node = tsk_fork_get_node(orig);
- int err;
tsk = alloc_task_struct_node(node);
- if (!tsk)
+ if (unlikely(!tsk))
return NULL;
ti = alloc_thread_info_node(tsk, node);
- if (!ti) {
- free_task_struct(tsk);
- return NULL;
- }
+ if (unlikely(!ti))
+ goto out_task_struct;
- err = arch_dup_task_struct(tsk, orig);
- if (err)
- goto out;
+ if (unlikely(arch_dup_task_struct(tsk, orig)))
+ goto out_thread_info;
+
+ if (unlikely(alloc_sched_autonuma(tsk, orig, node)))
+ /* free_thread_info() undoes arch_dup_task_struct() too */
+ goto out_thread_info;
tsk->stack = ti;
@@ -334,8 +337,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
return tsk;
-out:
+out_thread_info:
free_thread_info(ti);
+out_task_struct:
free_task_struct(tsk);
return NULL;
}
We will set these bitflags only when the pmd and pte is non present.
They work like PROT_NONE but they identify a request for the numa
hinting page fault to trigger.
Because we want to be able to set these bitflag in any established pte
or pmd (while clearing the present bit at the same time) without
losing information, these bitflags must never be set when the pte and
pmd are present.
For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
which avoids having to alter the swp entries format.
For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
swap_entries but if in the future we'll swap transparent hugepages, we
must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
entry format and to start the swap entry offset above it.
PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
but it's never used on pmds so there's no risk of collision with Xen.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b74cac9..6e2d954 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -71,6 +71,17 @@
#define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE)
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
+/*
+ * Cannot be set on pte. The fact it's in between _PAGE_FILE and
+ * _PAGE_PROTNONE avoids having to alter the swp entries.
+ */
+#define _PAGE_NUMA_PTE _PAGE_PSE
+/*
+ * Cannot be set on pmd, if transparent hugepages will be swapped out
+ * the swap entry offset must start above it.
+ */
+#define _PAGE_NUMA_PMD _PAGE_UNUSED2
+
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
_PAGE_ACCESSED | _PAGE_DIRTY)
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
Fix to avoid -1 retval.
Includes fixes from Hillf Danton <[email protected]>.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 940e6d1..137119f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2789,6 +2789,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
if (new_cpu == -1 || new_cpu == cpu) {
/* Now try balancing at a lower domain level of cpu */
sd = sd->child;
+ if (new_cpu < 0)
+ /* Return prev_cpu is find_idlest_cpu failed */
+ new_cpu = prev_cpu;
continue;
}
@@ -2807,6 +2810,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
unlock:
rcu_read_unlock();
+ BUG_ON(new_cpu < 0);
return new_cpu;
}
#endif /* CONFIG_SMP */
On Fri, May 25, 2012 at 07:02:07PM +0200, Andrea Arcangeli wrote:
> Xen has taken over the last reserved bit available for the pagetables
> which is set through ioremap, this documents it and makes the code
> more readable.
Andrea, my previous respone had a question about this - was wondering
if you had a chance to look at that in your busy schedule and provide
some advice on how to remove the _PAGE_IOMAP altogether?
Hi Konrad,
On Fri, May 25, 2012 at 04:26:56PM -0400, Konrad Rzeszutek Wilk wrote:
> On Fri, May 25, 2012 at 07:02:07PM +0200, Andrea Arcangeli wrote:
> > Xen has taken over the last reserved bit available for the pagetables
> > which is set through ioremap, this documents it and makes the code
> > more readable.
>
> Andrea, my previous respone had a question about this - was wondering
> if you had a chance to look at that in your busy schedule and provide
> some advice on how to remove the _PAGE_IOMAP altogether?
I read you response but I didn't look into the P2M tree and
xen_val_pte code yet sorry. Thanks for looking into this, if it's
possible to remove it without downsides, it would be a nice
cleanup. It's not urgent though, we're not running out of reserved
pte bits yet :).
On 05/25/2012 01:02 PM, Andrea Arcangeli wrote:
> I believe (realistically speaking) nobody is going to change
> applications to specify which thread is using which memory (for
> threaded apps) with the only exception of QEMU and a few others.
This is the point of contention. I believe that for some
programs these kinds of modifications might happen, but
that for some other programs - managed runtimes like JVMs -
it is fundamentally impossible to do proper NUMA hinting,
because the programming languages that run on top of the
runtimes have no concept of pointers or memory ranges, making
it impossible to do those kinds of hints without fundamentally
changing the programming languages in question.
It would be good to get everybody's ideas out there on this
topic, because this is the fundamental factor in deciding
between Peter's approach to NUMA and Andrea's approach.
Ingo? Andrew? Linus? Paul?
> For not threaded apps that fits in a NUMA node, there's no way a blind
> home node can perform nearly as good as AutoNUMA:
The small tasks are easy. I suspect that either implementation
can be tuned to produce good results there.
It is the large programs (that do not fit in a NUMA node, either
due to too much memory, or due to too many threads) that will
force our hand in deciding whether to go with Peter's approach
or your approach.
I believe we do need to carefully think about this issue, decide
on a NUMA approach based on the fundamental technical properties of
each approach.
After we figure out what we want to do, we can nit-pick on the
codebase in question, and make sure it gets completely fixed.
I am sure neither codebase is perfect right now, but both are
entirely fixable.
--
All rights reversed
On Sat, May 26, 2012 at 10:28 AM, Rik van Riel <[email protected]> wrote:
>
> It would be good to get everybody's ideas out there on this
> topic, because this is the fundamental factor in deciding
> between Peter's approach to NUMA and Andrea's approach.
>
> Ingo? Andrew? Linus? Paul?
I'm a *firm* believer that if it cannot be done automatically "well
enough", the absolute last thing we should ever do is worry about the
crazy people who think they can tweak it to perfection with complex
interfaces.
You can't do it, except for trivial loads (often benchmarks), and for
very specific machines.
So I think very strongly that we should entirely dismiss all the
people who want to do manual placement and claim that they know what
their loads do. They're either full of sh*t (most likely), or they
have a very specific benchmark and platform that they are tuning for
that is totally irrelevant to everybody else.
What we *should* try to aim for is a system that doesn't do horribly
badly right out of the box. IOW, no tuning what-so-ever (at most a
kind of "yes, I want you to try to do the NUMA thing" flag to just
enable it at all), and try to not suck.
Seriously. "Try to avoid sucking" is *way* superior to "We can let the
user tweak things to their hearts content". Because users won't get it
right.
Give the anal people a knob they can tweak, and tell them it does
something fancy. And never actually wire the damn thing up. They'll be
really happy with their OCD tweaking, and do lots of nice graphs that
just show how the error bars are so big that you can find any damn
pattern you want in random noise.
Linus
On Fri, May 25, 2012 at 07:02:27PM +0200, Andrea Arcangeli wrote:
> +static int knumad_do_scan(void)
> +{
...
> + if (knumad_test_exit(mm) || !vma) {
> + mm_autonuma = mm->mm_autonuma;
> + if (mm_autonuma->mm_node.next != &knumad_scan.mm_head) {
> + mm_autonuma = list_entry(mm_autonuma->mm_node.next,
> + struct mm_autonuma, mm_node);
> + knumad_scan.mm = mm_autonuma->mm;
> + atomic_inc(&knumad_scan.mm->mm_count);
> + knumad_scan.address = 0;
> + knumad_scan.mm->mm_autonuma->numa_fault_pass++;
> + } else
> + knumad_scan.mm = NULL;
knumad_scan.mm should be nulled only after list_del otherwise you will
have race with autonuma_exit():
[ 22.905208] ------------[ cut here ]------------
[ 23.003620] WARNING: at /home/kas/git/public/linux/lib/list_debug.c:50 __list_del_entry+0x63/0xd0()
[ 23.003621] Hardware name: QSSC-S4R
[ 23.003624] list_del corruption, ffff880771a49300->next is LIST_POISON1 (dead000000100100)
[ 23.003626] Modules linked in: megaraid_sas
[ 23.003629] Pid: 569, comm: udevd Not tainted 3.4.0+ #31
[ 23.003630] Call Trace:
[ 23.003640] [<ffffffff8105956f>] warn_slowpath_common+0x7f/0xc0
[ 23.003643] [<ffffffff81059666>] warn_slowpath_fmt+0x46/0x50
[ 23.003645] [<ffffffff813202e3>] __list_del_entry+0x63/0xd0
[ 23.003648] [<ffffffff81320361>] list_del+0x11/0x40
[ 23.003654] [<ffffffff8117b80f>] autonuma_exit+0x5f/0xb0
[ 23.003657] [<ffffffff810567ab>] mmput+0x7b/0x120
[ 23.003663] [<ffffffff8105e7d8>] exit_mm+0x108/0x130
[ 23.003674] [<ffffffff8165da5b>] ? _raw_spin_unlock_irq+0x2b/0x40
[ 23.003677] [<ffffffff8105e94a>] do_exit+0x14a/0x8d0
[ 23.003682] [<ffffffff811b71c6>] ? mntput+0x26/0x40
[ 23.003688] [<ffffffff8119a599>] ? fput+0x1c9/0x270
[ 23.003693] [<ffffffff81319dd9>] ? lockdep_sys_exit_thunk+0x35/0x67
[ 23.003696] [<ffffffff8105f41f>] do_group_exit+0x4f/0xc0
[ 23.003698] [<ffffffff8105f4a7>] sys_exit_group+0x17/0x20
[ 23.003703] [<ffffffff816663e9>] system_call_fastpath+0x16/0x1b
[ 23.003705] ---[ end trace 8b21c7adb0af191b ]---
> +
> + if (knumad_test_exit(mm))
> + list_del(&mm->mm_autonuma->mm_node);
> + else
> + mm_numa_fault_flush(mm);
> +
> + mmdrop(mm);
> + }
> +
> + return progress;
> +}
...
> +
> +static int knuma_scand(void *none)
> +{
...
> + mm = knumad_scan.mm;
> + knumad_scan.mm = NULL;
The same problem here.
> + if (mm)
> + list_del(&mm->mm_autonuma->mm_node);
> + mutex_unlock(&knumad_mm_mutex);
> +
> + if (mm)
> + mmdrop(mm);
> +
> + return 0;
> +}
--
Kirill A. Shutemov
On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> /**
> + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> + * @p: thread created by kthread_create().
> + * @nid: node (might not be online, must be possible) for @k to run on.
> + *
> + * Description: This function is equivalent to set_cpus_allowed(),
> + * except that @nid doesn't need to be online, and the thread must be
> + * stopped (i.e., just returned from kthread_create()).
> + */
> +void kthread_bind_node(struct task_struct *p, int nid)
> +{
> + /* Must have done schedule() in kthread() before we set_task_cpu */
> + if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> + WARN_ON(1);
> + return;
> + }
> +
> + /* It's safe because the task is inactive. */
> + do_set_cpus_allowed(p, cpumask_of_node(nid));
> + p->flags |= PF_THREAD_BOUND;
No, I've said before, this is wrong. You should only ever use
PF_THREAD_BOUND when its strictly per-cpu. Moving the your numa threads
to a different node is silly but not fatal in any way.
> +}
> +EXPORT_SYMBOL(kthread_bind_node);
On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> @@ -3274,6 +3268,8 @@ need_resched:
>
> post_schedule(rq);
>
> + sched_autonuma_balance();
> +
> sched_preempt_enable_no_resched();
> if (need_resched())
> goto need_resched;
> +void sched_autonuma_balance(void)
> +{
> + for_each_online_node(nid) {
> + }
> + for_each_online_node(nid) {
> + for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
> + }
> + }
> + stop_one_cpu(this_cpu, migration_cpu_stop, &arg);
> +}
NAK
You do _NOT_ put a O(nr_cpus) or even O(nr_nodes) loop in the middle of
schedule().
I see you've made it conditional, but schedule() taking that long --
even occasionally -- is just not cool.
schedule() calling schedule() is also an absolute abomination.
You were told to fix this several times..
On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> + * This function is responsible for deciding which is the best CPU
> + * each process should be running on according to the NUMA
> + * affinity. To do that it evaluates all CPUs and checks if there's
> + * any remote CPU where the current process has more NUMA affinity
> + * than with the current CPU, and where the process running on the
> + * remote CPU has less NUMA affinity than the current process to run
> + * on the remote CPU. Ideally this should be expanded to take all
> + * runnable processes into account but this is a good
> + * approximation. When we compare the NUMA affinity between the
> + * current and remote CPU we use the per-thread information if the
> + * remote CPU runs a thread of the same process that the current task
> + * belongs to, or the per-process information if the remote CPU runs
> a
> + * different process than the current one. If the remote CPU runs the
> + * idle task we require both the per-thread and per-process
> + * information to have more affinity with the remote CPU than with
> the
> + * current CPU for a migration to happen.
This doesn't explain anything in the dense code that follows.
What statistics, how are they used, with what goal etc..
On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 780ded7..e8dc82c 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -126,6 +126,31 @@ struct page {
> struct page *first_page; /* Compound tail pages */
> };
>
> +#ifdef CONFIG_AUTONUMA
> + /*
> + * FIXME: move to pgdat section along with the memcg and allocate
> + * at runtime only in presence of a numa system.
> + */
> + /*
> + * To modify autonuma_last_nid lockless the architecture,
> + * needs SMP atomic granularity < sizeof(long), not all archs
> + * have that, notably some alpha. Archs without that requires
> + * autonuma_last_nid to be a long.
> + */
> +#if BITS_PER_LONG > 32
> + int autonuma_migrate_nid;
> + int autonuma_last_nid;
> +#else
> +#if MAX_NUMNODES >= 32768
> +#error "too many nodes"
> +#endif
> + /* FIXME: remember to check the updates are atomic */
> + short autonuma_migrate_nid;
> + short autonuma_last_nid;
> +#endif
> + struct list_head autonuma_migrate_node;
> +#endif
> +
> /*
> * On machines where all RAM is mapped into kernel address space,
> * we can simply calculate the virtual address. On machines with
24 bytes per page.. or ~0.6% of memory gone. This is far too great a
price to pay.
At LSF/MM Rik already suggested you limit the number of pages that can
be migrated concurrently and use this to move the extra list_head out of
struct page and into a smaller amount of extra structures, reducing the
total overhead.
<2>[ 729.065896] kernel BUG at /home/kas/git/public/linux/mm/autonuma.c:850!
<4>[ 729.176966] invalid opcode: 0000 [#1] SMP
<4>[ 729.287517] CPU 24
<4>[ 729.397025] Modules linked in: sunrpc bnep bluetooth rfkill cpufreq_ondemand acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack coretemp kvm asix usbnet igb i7core_edac crc32c_intel iTCO_wdt i2c_i801 ioatdma pcspkr tpm_tis microcode joydev mii i2c_core iTCO_vendor_support tpm edac_core dca ptp tpm_bios pps_core megaraid_sas [last unloaded: scsi_wait_scan]
<4>[ 729.870867]
<4>[ 729.989848] Pid: 342, comm: knuma_migrated0 Not tainted 3.4.0+ #32 QCI QSSC-S4R/QSSC-S4R
<4>[ 730.111497] RIP: 0010:[<ffffffff8117baf5>] [<ffffffff8117baf5>] knuma_migrated+0x915/0xa50
<4>[ 730.234615] RSP: 0018:ffff88026c8b7d40 EFLAGS: 00010006
<4>[ 730.357993] RAX: 0000000000000000 RBX: ffff88027ffea000 RCX: 0000000000000002
<4>[ 730.482959] RDX: 0000000000000002 RSI: ffffea0017b7001c RDI: ffffea0017b70000
<4>[ 730.607709] RBP: ffff88026c8b7e90 R08: 0000000000000001 R09: 0000000000000000
<4>[ 730.733082] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
<4>[ 730.858424] R13: 0000000000000200 R14: ffffea0017b70000 R15: ffff88067ffeae00
<4>[ 730.983686] FS: 0000000000000000(0000) GS:ffff880272200000(0000) knlGS:0000000000000000
<4>[ 731.110169] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>[ 731.236396] CR2: 00007ff3463dd000 CR3: 0000000001c0b000 CR4: 00000000000007e0
<4>[ 731.363987] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[ 731.490875] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>[ 731.616769] Process knuma_migrated0 (pid: 342, threadinfo ffff88026c8b6000, task ffff88026c8ac5a0)
<4>[ 731.745286] Stack:
<4>[ 731.871079] ffff88027fffdef0 ffff88026c8ac5a0 0000000000000082 ffff88026c8ac5a0
<4>[ 731.999154] ffff88026c8b7d98 0000000100000003 ffffea000f165e60 ffff88067ffeb2c0
<4>[ 732.126565] ffffea000f165e60 ffffea000f165e20 ffffffff8107df90 ffff88026c8b7d98
<4>[ 732.253488] Call Trace:
<4>[ 732.377354] [<ffffffff8107df90>] ? __init_waitqueue_head+0x60/0x60
<4>[ 732.501250] [<ffffffff8107e075>] ? finish_wait+0x45/0x90
<4>[ 732.623816] [<ffffffff8117b1e0>] ? __autonuma_migrate_page_remove+0x130/0x130
<4>[ 732.748194] [<ffffffff8107d437>] kthread+0xb7/0xc0
<4>[ 732.872468] [<ffffffff81668324>] kernel_thread_helper+0x4/0x10
<4>[ 732.997588] [<ffffffff8107d380>] ? __init_kthread_worker+0x70/0x70
<4>[ 733.120411] [<ffffffff81668320>] ? gs_change+0x13/0x13
<4>[ 733.240230] Code: 4e 00 48 8b 05 6d 05 b8 00 a8 04 0f 84 b5 f9 ff ff 48 c7 c7 b0 c9 9e 81 31 c0 e8 04 6c 4d 00 e9 a2 f9 ff ff 66 90 e8 8a 87 4d 00 <0f> 0b 48 c7 c7 d0 c9 9e 81 31 c0 e8 e8 6b 4d 00 e9 73 f9 ff ff
<1>[ 733.489612] RIP [<ffffffff8117baf5>] knuma_migrated+0x915/0xa50
<4>[ 733.614281] RSP <ffff88026c8b7d40>
<4>[ 733.736855] ---[ end trace 25052e4d75b2f1f6 ]---
--
Kirill A. Shutemov
On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 41aa49b..8e578e6 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -666,6 +666,12 @@ typedef struct pglist_data {
> struct task_struct *kswapd;
> int kswapd_max_order;
> enum zone_type classzone_idx;
> +#ifdef CONFIG_AUTONUMA
> + spinlock_t autonuma_lock;
> + struct list_head autonuma_migrate_head[MAX_NUMNODES];
> + unsigned long autonuma_nr_migrate_pages;
> + wait_queue_head_t autonuma_knuma_migrated_wait;
> +#endif
> } pg_data_t;
>
> #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
O(nr_nodes^2) data.. ISTR people rewriting a certain slab allocator to
get rid of that :-)
Also, don't forget that MAX_NUMNODES is an unconditional 512 on distro
kernels, even when we only have 2.
Now the total wasted space isn't too bad since its only 16 bytes,
totaling a whole 2M for a 256 node system. But still, something like
that wants at least a mention somewhere.
On 05/29/2012 09:00 AM, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
>> @@ -3274,6 +3268,8 @@ need_resched:
>>
>> post_schedule(rq);
>>
>> + sched_autonuma_balance();
>> +
>> sched_preempt_enable_no_resched();
>> if (need_resched())
>> goto need_resched;
>
>
>
>> +void sched_autonuma_balance(void)
>> +{
>
>> + for_each_online_node(nid) {
>> + }
>
>> + for_each_online_node(nid) {
>> + for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
>
>
>> + }
>> + }
>
>> + stop_one_cpu(this_cpu, migration_cpu_stop,&arg);
>> +}
>
> NAK
>
> You do _NOT_ put a O(nr_cpus) or even O(nr_nodes) loop in the middle of
> schedule().
>
> I see you've made it conditional, but schedule() taking that long --
> even occasionally -- is just not cool.
>
> schedule() calling schedule() is also an absolute abomination.
>
> You were told to fix this several times..
Do you have any suggestions for how Andrea could fix this?
Pairwise comparisons with a busy CPU/node?
--
All rights reversed
On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
> price to pay.
>
> At LSF/MM Rik already suggested you limit the number of pages that can
> be migrated concurrently and use this to move the extra list_head out of
> struct page and into a smaller amount of extra structures, reducing the
> total overhead.
For THP, we should be able to track this NUMA info on a
2MB page granularity.
It is not like we will ever want to break up a large
page into small pages anyway (with different 4kB pages
going to different NUMA nodes), because the THP benefit
is on the same order of magnitude as the NUMA benefit.
--
All rights reversed
On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> Fix to avoid -1 retval.
>
> Includes fixes from Hillf Danton <[email protected]>.
This changelog is very much insufficient. It fails to mention why your
solution is the right one or if there's something else wrong with that
code.
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> kernel/sched/fair.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 940e6d1..137119f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2789,6 +2789,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> if (new_cpu == -1 || new_cpu == cpu) {
> /* Now try balancing at a lower domain level of cpu */
> sd = sd->child;
> + if (new_cpu < 0)
> + /* Return prev_cpu is find_idlest_cpu failed */
> + new_cpu = prev_cpu;
> continue;
> }
>
> @@ -2807,6 +2810,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> unlock:
> rcu_read_unlock();
>
> + BUG_ON(new_cpu < 0);
> return new_cpu;
> }
> #endif /* CONFIG_SMP */
On Tue, 2012-05-29 at 09:56 -0400, Rik van Riel wrote:
> On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
> > On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
>
> > 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
> > price to pay.
> >
> > At LSF/MM Rik already suggested you limit the number of pages that can
> > be migrated concurrently and use this to move the extra list_head out of
> > struct page and into a smaller amount of extra structures, reducing the
> > total overhead.
>
> For THP, we should be able to track this NUMA info on a
> 2MB page granularity.
Yeah, but that's another x86-only feature, _IF_ we're going to do this
it must be done for all archs that have CONFIG_NUMA, thus we're stuck
with 4k (or other base page size).
On Sat, May 26, 2012 at 05:59:12PM +0200, Andrea Arcangeli wrote:
> Hi Konrad,
>
> On Fri, May 25, 2012 at 04:26:56PM -0400, Konrad Rzeszutek Wilk wrote:
> > On Fri, May 25, 2012 at 07:02:07PM +0200, Andrea Arcangeli wrote:
> > > Xen has taken over the last reserved bit available for the pagetables
> > > which is set through ioremap, this documents it and makes the code
> > > more readable.
> >
> > Andrea, my previous respone had a question about this - was wondering
> > if you had a chance to look at that in your busy schedule and provide
> > some advice on how to remove the _PAGE_IOMAP altogether?
>
> I read you response but I didn't look into the P2M tree and
> xen_val_pte code yet sorry. Thanks for looking into this, if it's
> possible to remove it without downsides, it would be a nice
Yeah, I am not really thrilled about it.
> cleanup. It's not urgent though, we're not running out of reserved
> pte bits yet :).
Oh, your git comment says "the last reserved bit". Let me
look through all your patches to see how the AutoNUMA code works -
I am probably just missing something simple.
On Tue, 29 May 2012, Kirill A. Shutemov wrote:
> <4>[ 732.253488] Call Trace:
> <4>[ 732.377354] [<ffffffff8107df90>] ? __init_waitqueue_head+0x60/0x60
> <4>[ 732.501250] [<ffffffff8107e075>] ? finish_wait+0x45/0x90
> <4>[ 732.623816] [<ffffffff8117b1e0>] ? __autonuma_migrate_page_remove+0x130/0x130
> <4>[ 732.748194] [<ffffffff8107d437>] kthread+0xb7/0xc0
> <4>[ 732.872468] [<ffffffff81668324>] kernel_thread_helper+0x4/0x10
> <4>[ 732.997588] [<ffffffff8107d380>] ? __init_kthread_worker+0x70/0x70
> <4>[ 733.120411] [<ffffffff81668320>] ? gs_change+0x13/0x13
> <4>[ 733.240230] Code: 4e 00 48 8b 05 6d 05 b8 00 a8 04 0f 84 b5 f9 ff ff 48 c7 c7 b0 c9 9e 81 31 c0 e8 04 6c 4d 00 e9 a2 f9 ff ff 66 90 e8 8a 87 4d 00 <0f> 0b 48 c7 c7 d0 c9 9e 81 31 c0 e8 e8 6b 4d 00 e9 73 f9 ff ff
> <1>[ 733.489612] RIP [<ffffffff8117baf5>] knuma_migrated+0x915/0xa50
> <4>[ 733.614281] RSP <ffff88026c8b7d40>
> <4>[ 733.736855] ---[ end trace 25052e4d75b2f1f6 ]---
>
Similar problem with __autonuma_migrate_page_remove here.
[ 1945.516632] ------------[ cut here ]------------
[ 1945.516636] WARNING: at lib/list_debug.c:50 __list_del_entry+0x63/0xd0()
[ 1945.516642] Hardware name: ProLiant DL585 G5
[ 1945.516651] list_del corruption, ffff88017d68b068->next is LIST_POISON1 (dead000000100100)
[ 1945.516682] Modules linked in: ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle lockd ip6t_REJECT sunrpc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_nat nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack mperf freq_table kvm_amd kvm pcspkr amd64_edac_mod edac_core serio_raw bnx2 microcode edac_mce_amd shpchp k10temp hpilo ipmi_si ipmi_msghandler hpwdt qla2xxx hpsa ata_generic pata_acpi scsi_transport_fc scsi_tgt cciss pata_amd radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
[ 1945.516694] Pid: 150, comm: knuma_migrated0 Tainted: G W 3.4.0aa_alpha+ #3
[ 1945.516701] Call Trace:
[ 1945.516710] [<ffffffff8105788f>] warn_slowpath_common+0x7f/0xc0
[ 1945.516717] [<ffffffff81057986>] warn_slowpath_fmt+0x46/0x50
[ 1945.516726] [<ffffffff812f9713>] __list_del_entry+0x63/0xd0
[ 1945.516735] [<ffffffff812f9791>] list_del+0x11/0x40
[ 1945.516743] [<ffffffff81165b98>] __autonuma_migrate_page_remove+0x48/0x80
[ 1945.516746] [<ffffffff81165e66>] knuma_migrated+0x296/0x8a0
[ 1945.516749] [<ffffffff8107a200>] ? wake_up_bit+0x40/0x40
[ 1945.516758] [<ffffffff81165bd0>] ? __autonuma_migrate_page_remove+0x80/0x80
[ 1945.516766] [<ffffffff81079cc3>] kthread+0x93/0xa0
[ 1945.516780] [<ffffffff81626f24>] kernel_thread_helper+0x4/0x10
[ 1945.516791] [<ffffffff81079c30>] ? flush_kthread_worker+0x80/0x80
[ 1945.516798] [<ffffffff81626f20>] ? gs_change+0x13/0x13
[ 1945.516800] ---[ end trace 7cab294af87bd79f ]---
I am getting this warning continually during memory intensive operations,
e.g. AutoNUMA benchmarks from Andrea.
thanks,
Petr H
On Sat, 26 May 2012, Linus Torvalds wrote:
>
> I'm a *firm* believer that if it cannot be done automatically "well
> enough", the absolute last thing we should ever do is worry about the
> crazy people who think they can tweak it to perfection with complex
> interfaces.
>
> You can't do it, except for trivial loads (often benchmarks), and for
> very specific machines.
NUMA APIs already exist that allow tuning for the NUMA cases by allowing
the application to specify where to get memory from and where to run the
threads of a process. Those require the application to be aware of the
NUMA topology and exploit the capabilities there explicitly. Typically one
would like to reserve processors and memory for a single application that
then does the distribution of the load on its own. NUMA aware applications
like that do not benefit and do not need either of the mechanisms proposed
here.
What these automatic migration schemes (autonuma is really a bad term for
this. These are *migration* schemes where the memory is moved between NUMA
nodes automatically so call it AutoMigration if you like) try to do is to
avoid the tuning bits and automatically distribute generic process loads
in a NUMA aware fashion in order to improve performance. This is no easy
task since the cost of migrating a page is much more expensive that the
additional latency due to access of memory from a distant node. A huge
number of accesses must occur in order to amortize the migration of a
page. Various companies in decades past have tried to implement
automigration schemes without too much success.
I think the proof that we need is that a general mix of applications
actually benefits from an auto migration scheme. I would also like to see
that it does no harm to existing NUMA aware applications.
Hi,
On Tue, May 29, 2012 at 10:10:49AM -0400, Konrad Rzeszutek Wilk wrote:
> Oh, your git comment says "the last reserved bit". Let me
> look through all your patches to see how the AutoNUMA code works -
> I am probably just missing something simple.
Ah, with "the last reserved bit" I didn't mean AutoNUMA is using
it. It just means there is nothing left if somebody in the future
needs it. AutoNUMA happened to need it initially, but I figured how I
was better off not using it. Initially I had to make AUTONUMA=y
mutually exclusive with XEN=y but it's not the case anymore. So at
this point the patch is only a cleanup, I could drop it too but I
thought it was cleaner to keep it.
On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> The CFS scheduler is still in charge of all scheduling
> decisions. AutoNUMA balancing at times will override those. But
> generally we'll just relay on the CFS scheduler to keep doing its
> thing, but while preferring the autonuma affine nodes when deciding
> to move a process to a different runqueue or when waking it up.
>
> For example the idle balancing, will look into the runqueues of the
> busy CPUs, but it'll search first for a task that wants to run into
> the idle CPU in AutoNUMA terms (task_autonuma_cpu() being true).
>
> Most of this is encoded in the can_migrate_task becoming AutoNUMA
> aware and running two passes for each balancing pass, the first NUMA
> aware, and the second one relaxed.
>
> The idle/newidle balancing is always allowed to fallback into
> non-affine AutoNUMA tasks. The load_balancing (which is more a
> fariness than a performance issue) is instead only able to cross over
> the AutoNUMA affinity if the flag controlled by
> /sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set (it
> is set by default).
This is unacceptable, and contradicts your earlier claim that you rely
on the regular load-balancer.
The strict mode needs to go, load-balancing is a best effort and
fairness is important -- so much so to some people that I get complaints
the current thing isn't strong enough.
Your strict mode basically supplants any and all balancing done at node
level and above.
Please use something like:
https://lkml.org/lkml/2012/5/19/53
with the sched_setnode() function from:
https://lkml.org/lkml/2012/5/18/109
Fairness matters because people expect similar throughput or runtimes so
balancing such that we first ensure equal load on cpus and only then
bother with node placement should be the order.
Furthermore, load-balancing does things like trying to place tasks that
wake each-other closer together, your strict mode completely breaks
that. Instead, if the balancer finds these tasks are related and should
be together that should be a hint the memory needs to come to them, not
the other way around.
Hi,
On Tue, May 29, 2012 at 10:53:32AM -0500, Christoph Lameter wrote:
> then does the distribution of the load on its own. NUMA aware applications
> like that do not benefit and do not need either of the mechanisms proposed
> here.
Agreed. Who changes the apps to optimize things to that lowlevel, I
doubt wants to risk to hit on on a migrate on fault (or AutoNUMA async
migration for that matter).
> I think the proof that we need is that a general mix of applications
> actually benefits from an auto migration scheme. I would also like to see
Agreed.
> that it does no harm to existing NUMA aware applications.
As far as AutoNUMA is concerned, it will be a total bypass whenever
the mpol isn't MPOL_DEFAULT. So it shouldn't harm. Shared memory is
also bypassed.
It only alters the beahvior of MPOL_DEFAULT, any other kind of
mempolicy is unaffected, and all CPU bindings are also unaffected.
If an app has only a few vmas that are MPOL_DEFAULT those few will be
handled by AutoNUMA.
If people thinks AutoMigration is a better name I should rename
it. It's up to you. I thought using a "NUMA" suffix would make it
more intuitive that if your hardware isn't NUMA, this won't do
anything at all. Migration as a feature isn't limited to NUMA (see
compaction etc..). Comments welcome.
On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> Invoke autonuma_balance only on the busy CPUs at the same frequency of
> the CFS load balance.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> kernel/sched/fair.c | 3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 99d1d33..1357938 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4893,6 +4893,9 @@ static void run_rebalance_domains(struct softirq_action *h)
>
> rebalance_domains(this_cpu, idle);
>
> + if (!this_rq->idle_balance)
> + sched_set_autonuma_need_balance();
> +
This just isn't enough.. the whole thing needs to move out of
schedule(). The only time schedule() should ever look at another cpu is
if its idle.
As it stands load-balance actually takes too much time as it is to live
in a softirq, -rt gets around that by pushing all softirqs into a thread
and I was thinking of doing some of that for mainline too.
On Tue, May 29, 2012 at 02:49:13PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> > /**
> > + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> > + * @p: thread created by kthread_create().
> > + * @nid: node (might not be online, must be possible) for @k to run on.
> > + *
> > + * Description: This function is equivalent to set_cpus_allowed(),
> > + * except that @nid doesn't need to be online, and the thread must be
> > + * stopped (i.e., just returned from kthread_create()).
> > + */
> > +void kthread_bind_node(struct task_struct *p, int nid)
> > +{
> > + /* Must have done schedule() in kthread() before we set_task_cpu */
> > + if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> > + WARN_ON(1);
> > + return;
> > + }
> > +
> > + /* It's safe because the task is inactive. */
> > + do_set_cpus_allowed(p, cpumask_of_node(nid));
> > + p->flags |= PF_THREAD_BOUND;
>
> No, I've said before, this is wrong. You should only ever use
> PF_THREAD_BOUND when its strictly per-cpu. Moving the your numa threads
> to a different node is silly but not fatal in any way.
I changed the semantics of that bitflag, now it means: userland isn't
allowed to shoot itself in the foot and mess with whatever CPU
bindings the kernel has set for the kernel thread.
It'd be a clear regress not to set PF_THREAD_BOUND there. It would be
even worse to remove the CPU binding to the node: it'd risk to copy
memory with both src and dst being in remote nodes from the CPU where
knuma_migrate runs on (there aren't just 2 node systems out there).
On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> This implements knuma_scand, the numa_hinting faults started by
> knuma_scand, the knuma_migrated that migrates the memory queued by the
> NUMA hinting faults, the statistics gathering code that is done by
> knuma_scand for the mm_autonuma and by the numa hinting page faults
> for the sched_autonuma, and most of the rest of the AutoNUMA core
> logics like the false sharing detection, sysfs and initialization
> routines.
>
> The AutoNUMA algorithm when knuma_scand is not running is a full
> bypass and it must not alter the runtime of memory management and
> scheduler.
>
> The whole AutoNUMA logic is a chain reaction as result of the actions
> of the knuma_scand. The various parts of the code can be described
> like different gears (gears as in glxgears).
>
> knuma_scand is the first gear and it collects the mm_autonuma per-process
> statistics and at the same time it sets the pte/pmd it scans as
> pte_numa and pmd_numa.
>
> The second gear are the numa hinting page faults. These are triggered
> by the pte_numa/pmd_numa pmd/ptes. They collect the sched_autonuma
> per-thread statistics. They also implement the memory follow CPU logic
> where we track if pages are repeatedly accessed by remote nodes. The
> memory follow CPU logic can decide to migrate pages across different
> NUMA nodes by queuing the pages for migration in the per-node
> knuma_migrated queues.
>
> The third gear is knuma_migrated. There is one knuma_migrated daemon
> per node. Pages pending for migration are queued in a matrix of
> lists. Each knuma_migrated (in parallel with each other) goes over
> those lists and migrates the pages queued for migration in round robin
> from each incoming node to the node where knuma_migrated is running
> on.
>
> The fourth gear is the NUMA scheduler balancing code. That computes
> the statistical information collected in mm->mm_autonuma and
> p->sched_autonuma and evaluates the status of all CPUs to decide if
> tasks should be migrated to CPUs in remote nodes.
IOW:
"knuma_scand 'unmaps' ptes and collects mm stats, this triggers
numa_hinting pagefaults, using these we collect per task stats.
knuma_migrated migrates pages to their destination node. Something
queues pages.
The numa scheduling code uses the gathered stats to place tasks."
That covers just about all you said, now the interesting bits are still
missing:
- how do you do false sharing;
- what stats do you gather, and how are they used at each stage;
- what's your balance goal, and how is that expressed and
converged upon.
Also, what I've not seen anywhere are scheduling stats, what if, despite
you giving a hint a particular process should run on a particular node
it doesn't and sticks to where its at (granted with strict this can't
happen -- but it should).
On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> When pages are freed abort any pending migration. If knuma_migrated
> arrives first it will notice because get_page_unless_zero would fail.
But knuma_migrated can run on a different cpu than this free is
happening, ACCESS_ONCE() won't cure that.
What's that ACCESS_ONCE() good for?
Also, you already have an autonuma_ hook right there, why add more
#ifdeffery ?
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> mm/page_alloc.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3d1ee70..1d3163f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -614,6 +614,10 @@ static inline int free_pages_check(struct page *page)
> bad_page(page);
> return 1;
> }
> + autonuma_migrate_page_remove(page);
> +#ifdef CONFIG_AUTONUMA
> + ACCESS_ONCE(page->autonuma_last_nid) = -1;
> +#endif
> if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
> page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> return 0;
On Tue, May 29, 2012 at 03:16:25PM +0200, Peter Zijlstra wrote:
> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
> price to pay.
I don't think it's too great, memcg uses for half of that and yet
nobody is booting with cgroup_disable=memory even on not-NUMA servers
with less RAM.
> At LSF/MM Rik already suggested you limit the number of pages that can
> be migrated concurrently and use this to move the extra list_head out of
> struct page and into a smaller amount of extra structures, reducing the
> total overhead.
It would reduce the memory overhead but it'll make the code more
complex and it'll require more locking, plus allowing for very long
migration lrus, provides an additional means of false-sharing
avoidance. Those are lrus, if the last_nid false sharing logic will
pass, the page still has to reach the tail of the list before being
migrated, if false sharing happens in the meanwhile we'll remove it
from the lru.
But I'm all for experimenting. It's just not something I had the time
to try yet. I will certainly love to see how it performs by reducing
the max size of the list. I totally agree it's a good idea to try it
out, and I don't exclude it will work fine, but it's not obvious it's
worth the memory saving.
On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> Move the AutoNUMA per page information from the "struct page" to a
> separate page_autonuma data structure allocated in the memsection
> (with sparsemem) or in the pgdat (with flatmem).
>
> This is done to avoid growing the size of the "struct page" and the
> page_autonuma data is only allocated if the kernel has been booted on
> real NUMA hardware (or if noautonuma is passed as parameter to the
> kernel).
>
Argh, please fold this change back into the series proper. If you want
to keep it.. as it is its not really an improvement IMO, see below.
> +struct page_autonuma {
> + /*
> + * FIXME: move to pgdat section along with the memcg and allocate
> + * at runtime only in presence of a numa system.
> + */
> + /*
> + * To modify autonuma_last_nid lockless the architecture,
> + * needs SMP atomic granularity < sizeof(long), not all archs
> + * have that, notably some alpha. Archs without that requires
> + * autonuma_last_nid to be a long.
> + */
Looking at arch/alpha/include/asm/xchg.h it looks to have that just
fine, so maybe we simply don't support SMP on those early Alphas that
had that weirdness.
> +#if BITS_PER_LONG > 32
> + int autonuma_migrate_nid;
> + int autonuma_last_nid;
> +#else
> +#if MAX_NUMNODES >= 32768
> +#error "too many nodes"
> +#endif
> + /* FIXME: remember to check the updates are atomic */
> + short autonuma_migrate_nid;
> + short autonuma_last_nid;
> +#endif
> + struct list_head autonuma_migrate_node;
> +
> + /*
> + * To find the page starting from the autonuma_migrate_node we
> + * need a backlink.
> + */
> + struct page *page;
> +};
This makes a shadow page frame of 32 bytes per page, or ~0.8% of memory.
This isn't in fact an improvement.
The suggestion done by Rik was to have something like a sqrt(nr_pages)
(?) scaled array of such things containing the list_head and page
pointer -- and leave the two nids in the regular page frame. Although I
think you've got to fight the memcg people over that last word in struct
page.
That places a limit on the amount of pages that can be in migration
concurrently, but also greatly reduces the memory overhead.
On 05/29/2012 12:38 PM, Andrea Arcangeli wrote:
> On Tue, May 29, 2012 at 03:16:25PM +0200, Peter Zijlstra wrote:
>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>> price to pay.
>
> I don't think it's too great, memcg uses for half of that and yet
> nobody is booting with cgroup_disable=memory even on not-NUMA servers
> with less RAM.
Not any more.
Ever since the memcg naturalization work by Johannes,
a page is only ever on one LRU list and the memcg
memory overhead is gone.
> But I'm all for experimenting. It's just not something I had the time
> to try yet. I will certainly love to see how it performs by reducing
> the max size of the list. I totally agree it's a good idea to try it
> out, and I don't exclude it will work fine, but it's not obvious it's
> worth the memory saving.
That's fair enough.
--
All rights reversed
On Tue, May 29, 2012 at 06:30:29PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> > When pages are freed abort any pending migration. If knuma_migrated
> > arrives first it will notice because get_page_unless_zero would fail.
>
> But knuma_migrated can run on a different cpu than this free is
> happening, ACCESS_ONCE() won't cure that.
knuma_migrated won't alter the last_nid and it generally won't work on
any page that has page_count() = 0.
last_nid is the false sharing avoidance information (btw, that really
better exist for every page, unlike the list node, which might be
limited maybe).
Then there's a second false sharing avoidance through the implicit
properties of the autonuma_migrate_head lrus and the
migration-cancellation in numa_hinting_fault_memory_follow_cpu (which
is why I wouldn't like the idea of an insert-only list, even if it
would save a pointer per page, but then I couldn't cancel the
migration when a false sharing is detected and knuma_migrated is
congested).
> What's that ACCESS_ONCE() good for?
The ACCESS_ONCE was used when setting last_nid, to tell gcc the value
can change from under it. It shouldn't alter the code emitted here and
probably it's superfluous in any case.
But considering that the page is being freed, I don't think it can
change from under us here so this was definitely superflous, numa
hinting page faults can't run on that page. I will remove it, thanks!
>
> Also, you already have an autonuma_ hook right there, why add more
> #ifdeffery ?
Agreed, the #ifdef is in fact already cleaned up in page_autonuma,
with autonuma_page_free.
- autonuma_migrate_page_remove(page);
-#ifdef CONFIG_AUTONUMA
- ACCESS_ONCE(page->autonuma_last_nid) = -1;
-#endif
+ autonuma_free_page(page);
>
> > Signed-off-by: Andrea Arcangeli <[email protected]>
> > ---
> > mm/page_alloc.c | 4 ++++
> > 1 files changed, 4 insertions(+), 0 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 3d1ee70..1d3163f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -614,6 +614,10 @@ static inline int free_pages_check(struct page *page)
> > bad_page(page);
> > return 1;
> > }
> > + autonuma_migrate_page_remove(page);
> > +#ifdef CONFIG_AUTONUMA
> > + ACCESS_ONCE(page->autonuma_last_nid) = -1;
> > +#endif
> > if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
> > page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > return 0;
>
On Tue, 2012-05-29 at 12:46 -0400, Rik van Riel wrote:
> > I don't think it's too great, memcg uses for half of that and yet
> > nobody is booting with cgroup_disable=memory even on not-NUMA servers
> > with less RAM.
Right, it was such a hit we had to disable that by default on RHEL6.
> Not any more.
Right, hnaz did great work there, but wasn't there still some few pieces
of the shadow page frame left? ISTR LSF/MM talk of moving the last few
bits into the regular page frame, taking the word that became available
through: fc9bb8c768 ("mm: Rearrange struct page").
On Tue, 2012-05-29 at 18:11 +0200, Andrea Arcangeli wrote:
> On Tue, May 29, 2012 at 02:49:13PM +0200, Peter Zijlstra wrote:
> > On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> > > /**
> > > + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> > > + * @p: thread created by kthread_create().
> > > + * @nid: node (might not be online, must be possible) for @k to run on.
> > > + *
> > > + * Description: This function is equivalent to set_cpus_allowed(),
> > > + * except that @nid doesn't need to be online, and the thread must be
> > > + * stopped (i.e., just returned from kthread_create()).
> > > + */
> > > +void kthread_bind_node(struct task_struct *p, int nid)
> > > +{
> > > + /* Must have done schedule() in kthread() before we set_task_cpu */
> > > + if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> > > + WARN_ON(1);
> > > + return;
> > > + }
> > > +
> > > + /* It's safe because the task is inactive. */
> > > + do_set_cpus_allowed(p, cpumask_of_node(nid));
> > > + p->flags |= PF_THREAD_BOUND;
> >
> > No, I've said before, this is wrong. You should only ever use
> > PF_THREAD_BOUND when its strictly per-cpu. Moving the your numa threads
> > to a different node is silly but not fatal in any way.
>
> I changed the semantics of that bitflag, now it means: userland isn't
> allowed to shoot itself in the foot and mess with whatever CPU
> bindings the kernel has set for the kernel thread.
Yeah, and you did so without mentioning that in your changelog.
Furthermore I object to that change. I object even more strongly to
doing it without mention and keeping a misleading comment near the
definition.
> It'd be a clear regress not to set PF_THREAD_BOUND there. It would be
> even worse to remove the CPU binding to the node: it'd risk to copy
> memory with both src and dst being in remote nodes from the CPU where
> knuma_migrate runs on (there aren't just 2 node systems out there).
Just teach each knuma_migrated what node it represents and don't use
numa_node_id().
That way you can change the affinity just fine, it'll be sub-optimal,
copying memory from node x to node y through node z, but it'll still
work correctly.
numa isn't special in the way per-cpu stuff is special.
On Tue, May 29, 2012 at 06:44:15PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> > Move the AutoNUMA per page information from the "struct page" to a
> > separate page_autonuma data structure allocated in the memsection
> > (with sparsemem) or in the pgdat (with flatmem).
> >
> > This is done to avoid growing the size of the "struct page" and the
> > page_autonuma data is only allocated if the kernel has been booted on
> > real NUMA hardware (or if noautonuma is passed as parameter to the
> > kernel).
> >
>
> Argh, please fold this change back into the series proper. If you want
> to keep it.. as it is its not really an improvement IMO, see below.
The whole objective of this patch is to avoid allocating the
page_autonuma structures when the kernel is booted on not NUMA
hardware.
It's not an improvement when booting the kernel on NUMA hardware
that's for sure.
I didn't merge it with the previous because this was the most
experimental recent change, so I wanted bisectability here. When
something goes wrong here, the kernel won't boot, so unless you use
kvm with gdbstub it's a little tricky to debug (indeed I debugged it
with gdbstub, there it's trivial).
> > +struct page_autonuma {
> > + /*
> > + * FIXME: move to pgdat section along with the memcg and allocate
> > + * at runtime only in presence of a numa system.
> > + */
> > + /*
> > + * To modify autonuma_last_nid lockless the architecture,
> > + * needs SMP atomic granularity < sizeof(long), not all archs
> > + * have that, notably some alpha. Archs without that requires
> > + * autonuma_last_nid to be a long.
> > + */
>
> Looking at arch/alpha/include/asm/xchg.h it looks to have that just
> fine, so maybe we simply don't support SMP on those early Alphas that
> had that weirdness.
I agree we should never risk that.
> This makes a shadow page frame of 32 bytes per page, or ~0.8% of memory.
> This isn't in fact an improvement.
>
> The suggestion done by Rik was to have something like a sqrt(nr_pages)
> (?) scaled array of such things containing the list_head and page
> pointer -- and leave the two nids in the regular page frame. Although I
> think you've got to fight the memcg people over that last word in struct
> page.
>
> That places a limit on the amount of pages that can be in migration
> concurrently, but also greatly reduces the memory overhead.
Yes, however for the last_nid I'd still need it for every page (and if
I allocate it dynamic I still first need to find a way to remove the
struct page pointer).
I thought to add a pointer in the memsection (or maybe to use a vmemmap
so that it won't even require a pointer in every memsection). I've to
check a few more things before I allow_the autonuma->page translation
without a page pointer, notably to verify the boot time allocations
points won't just allocate power of two blocks of memory (they
shouldn't but I didn't verify).
This is clearly a move in the right direction to avoid the memory
overhead when not booted on NUMA hardware, and I don't think there's
anything fundamental that prevents us remove the page pointer from the
page_autonuma structure, and to later experiment with a limited size
array of async migration structures.
Hi Kirill,
The anon page was munmapped just after get_page_unless_zero obtained a
refcount in knuma_migrated. This can happen for example if a big
process exits while knuma_migrated starts to migrate the page. In that
case split_huge_page would do nothing but when it does nothing it
notifies the caller returning 1. When it returns 1, we just need to
put_page and bail out (the page isn't splitted in that case and it's
pointless to try to migrate a freed page).
I also made the code more strict now, to be sure the reason of the bug
wasn't an hugepage in the LRU that wasn't Anon, such a thing must not
exist, but this will verify it just in case.
I'll push it to the origin/autonuma branch of aa.git shortly
(rebased), could you try if it helps?
diff --git a/mm/autonuma.c b/mm/autonuma.c
index 3d4c2a7..c2a5a82 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -840,9 +840,17 @@ static int isolate_migratepages(struct list_head *migratepages,
VM_BUG_ON(nid != page_to_nid(page));
- if (PageAnon(page) && PageTransHuge(page))
+ if (PageTransHuge(page)) {
+ VM_BUG_ON(!PageAnon(page));
/* FIXME: remove split_huge_page */
- split_huge_page(page);
+ if (unlikely(split_huge_page(page))) {
+ autonuma_printk("autonuma migrate THP free\n");
+ __autonuma_migrate_page_remove(page,
+ page_autonuma);
+ put_page(page);
+ continue;
+ }
+ }
__autonuma_migrate_page_remove(page, page_autonuma);
Thanks a lot,
Andrea
BTW, interesting the knuma_migrated0 runs on CPU24, just in case, you
may also want to verify that it's correct with numactl --hardware, in
my case the highest cpuid in node0 is 17. It's not related to the
above, which is needed anyway.
On Tue, May 29, 2012 at 06:12:22PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> > Invoke autonuma_balance only on the busy CPUs at the same frequency of
> > the CFS load balance.
> >
> > Signed-off-by: Andrea Arcangeli <[email protected]>
> > ---
> > kernel/sched/fair.c | 3 +++
> > 1 files changed, 3 insertions(+), 0 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 99d1d33..1357938 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4893,6 +4893,9 @@ static void run_rebalance_domains(struct softirq_action *h)
> >
> > rebalance_domains(this_cpu, idle);
> >
> > + if (!this_rq->idle_balance)
> > + sched_set_autonuma_need_balance();
> > +
>
> This just isn't enough.. the whole thing needs to move out of
> schedule(). The only time schedule() should ever look at another cpu is
> if its idle.
>
> As it stands load-balance actually takes too much time as it is to live
> in a softirq, -rt gets around that by pushing all softirqs into a thread
> and I was thinking of doing some of that for mainline too.
No worries, I didn't mean to leave it like this forever. I was
considering using the stop cpu _nowait variant but I didn't have
enough time to realize if it would work for my case. I need to rethink
about that.
I was thinking which thread to use for that or if to use the stop_cpu
_nowait variant that active balancing is using, but it wasn't so easy
to change and considering from a practical standpoint it already flies
I released it. It's already an improvement, the previous approach was
mostly a debug approach to see if autonuma_balance would flood the
debug log and not converging.
autonuma_balance isn't fundamentally different from load_balance, they
boot look around at the other runqueues, to see if some task should be
moved.
If you move the load_balance to a kernel thread, I could move
autonuma_balance there too.
I just wasn't sure if to invoke a schedule() to actually call
autonuma_balance() made any sense, so I thought running it from
softirq too with the noblocking _nowait variant (or keep it in
schedule to be able to call stop_one_cpu without _nowait) would have
been more efficient.
The moment I gave up on the _nowait variant before releasing is when I
couldn't understand what is tlb_migrate_finish doing, and why it's not
present in the _nowait version in fair.c. Can you explain me that?
Obviously it's only used by ia64 so I could as well ignore that but it
was still an additional annoyance that made me think I needed a bit
more of time to think about it.
I'm glad you acknowledge load_balance already takes a bulk of the time
as it needs to find the busiest runqueue checking other CPU runqueues
too... With autonuma14 there's no measurable difference in hackbench
with autonuma=y or noautonuma boot parameter anymore, or upstream
without autonuma applied (not just autonuma=n). So the cost on a
24-way SMP is 0.
Then I tried to measure it also with lockdep and all lock/mutex
debugging/stats enabled there's a slighty measurable slowdown in
hackbench that may not be a measurement error, but it's barely
noticeable and I expect if I remove load_balance from the softirq, the
gain would be bigger than removing autonuma_balance (it goes from 70
to 80 sec in avg IIRC, but the error is about 10sec, just the avg
seems slightly higher). With lockdep and all other debug disabled it
takes fixed 6sec for all configs and it's definitely not measurable
(tested both thread and process, not that it makes any difference for
this).
On Tue, May 29, 2012 at 9:38 AM, Andrea Arcangeli <[email protected]> wrote:
> On Tue, May 29, 2012 at 03:16:25PM +0200, Peter Zijlstra wrote:
>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>> price to pay.
>
> I don't think it's too great, memcg uses for half of that and yet
> nobody is booting with cgroup_disable=memory even on not-NUMA servers
> with less RAM.
A big fraction of one percent is absolutely unacceptable.
Our "struct page" is one of our biggest memory users, there's no way
we should cavalierly make it even bigger.
It's also a huge performance sink, the cache miss on struct page tends
to be one of the biggest problems in managing memory. We may not ever
fix that, but making struct page bigger certainly isn't going to help
the bad cache behavior.
Linus
On Tue, 2012-05-29 at 19:33 +0200, Andrea Arcangeli wrote:
> So the cost on a 24-way SMP
is irrelevant.. also, not every cpu gets to the 24 cpu domain, just 2
do.
When you do for_each_cpu() think at least 4096, if you do
for_each_node() think at least 256.
Add to that the knowledge that doing 4096 remote memory accesses will
cost multiple jiffies, then realize you're wanting to do that with
preemption disabled.
That's just a very big no go.
On Tue, May 29, 2012 at 07:04:51PM +0200, Peter Zijlstra wrote:
> doing it without mention and keeping a misleading comment near the
> definition.
Right, I forgot to update the comment, I fixed it now.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 60a699c..0b84494 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1788,7 +1788,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
#define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
#define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
-#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
+#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpus */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
> Just teach each knuma_migrated what node it represents and don't use
> numa_node_id().
It already works like that, I absolutely never use numa_node_id(), I
always use the pgdat passed as parameter to the kernel thread through
the pointer parameter.
But it'd be totally bad not to do the hard bindings to the cpu_s_ of
the node, and not using PF_THREAD_BOUND would just allow userland to
shoot itself in the foot. I mean if PF_THREAD_BOUND wouldn't exist
already I wouldn't add it, but considering somebody bothered to
implement it for the sake to make userland root user "safer", it'd be
really silly not to take advantage of that for knuma_migrated too
(even if it binds to more than 1 CPU).
Additionally I added a bugcheck in the main knuma_migrated loop:
VM_BUG_ON(numa_node_id() != pgdat->node_id);
to be sure it never goes wrong. This above bugcheck is what allowed me
to find a bug in the numa emulation fixed in commit
d71b5a73fe9af42752c4329b087f7911b35f8f79.
> That way you can change the affinity just fine, it'll be sub-optimal,
> copying memory from node x to node y through node z, but it'll still
> work correctly.
I don't think allowing userland to do suboptimal things (even if it
will only decrease performance and still work correctly) makes
sense (considering somebody added PF_THREAD_BOUND already and it's
zero cost to use).
> numa isn't special in the way per-cpu stuff is special.
Agreed that it won't be as bad as getting per-cpu stuff wrong, it only
slowdown -50% in the worst case, but it's a guaranteed regression in
the best case too, so there's no reason to allow root to shoot itself
in the foot.
On Tue, 2012-05-29 at 19:44 +0200, Andrea Arcangeli wrote:
>
> But it'd be totally bad not to do the hard bindings to the cpu_s_ of
> the node, and not using PF_THREAD_BOUND would just allow userland to
> shoot itself in the foot. I mean if PF_THREAD_BOUND wouldn't exist
> already I wouldn't add it, but considering somebody bothered to
> implement it for the sake to make userland root user "safer", it'd be
> really silly not to take advantage of that for knuma_migrated too
> (even if it binds to more than 1 CPU).
No, I'm absolutely ok with the user shooting himself in the foot. The
thing exists because you can crash stuff if you get it wrong with
per-cpu.
Crashing is not good, worse performance is his own damn fault.
On Tue, May 29, 2012 at 10:38:34AM -0700, Linus Torvalds wrote:
> A big fraction of one percent is absolutely unacceptable.
>
> Our "struct page" is one of our biggest memory users, there's no way
> we should cavalierly make it even bigger.
>
> It's also a huge performance sink, the cache miss on struct page tends
> to be one of the biggest problems in managing memory. We may not ever
> fix that, but making struct page bigger certainly isn't going to help
> the bad cache behavior.
The cache effects on the VM fast paths shouldn't be altered, and no
additional memory per-page is allocated when booting the same bzImage
on not NUMA hardware.
But now when booted on NUMA hardware it takes 8 bytes more than
before. There are 32 bytes allocated for every page (with autonuma13
it was only 24 bytes). The struct page itself isn't modified.
I want to remove the page pointer from the page_autonuma structure, to
keep the overhead at 0.58% instead of the current 0.78% (like it was
on autonuma alpha13 before the page_autonuma introduction). That
shouldn't be difficult and it's the next step.
Those changes aren't visible to anything but *autonuma.* files and the
cache misses in accessing the page_autonuma structure shouldn't be
measurable (the only fast path access is from
autonuma_free_page). Even if we find a way to shrink it below 0.58%,
it won't be intrusive over the rest of the kernel.
memcg takes 0.39% on every system built with
CONFIG_CGROUP_MEM_RES_CTLR=y unless the kernel is booted with
cgroup_disable=memory (and nobody does).
I'll do my best to shrink it further, like mentioned I'm very willing
to experiment with a fixed size array in function of the RAM per node,
to reduce the overhead (Michel and Rik suggested that at MM summit
too). Maybe it'll just work fine even if the max size of the lru is
reduced by a factor of 10. In the worst case I personally believe lots
of people would be ok to pay 0.58% considering they're paying 0.39%
even on much smaller not-NUMA systems to boot with memcg. And I'm sure
I can reduce it at least to 0.58% without any downside.
It's lots of work to reduce it below 0.58%, so before doing that I
believe it's fair enough to do enough performance measurement and
reviews to be sure the design flies.
On Tue, May 29, 2012 at 07:48:06PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 19:44 +0200, Andrea Arcangeli wrote:
> >
> > But it'd be totally bad not to do the hard bindings to the cpu_s_ of
> > the node, and not using PF_THREAD_BOUND would just allow userland to
> > shoot itself in the foot. I mean if PF_THREAD_BOUND wouldn't exist
> > already I wouldn't add it, but considering somebody bothered to
> > implement it for the sake to make userland root user "safer", it'd be
> > really silly not to take advantage of that for knuma_migrated too
> > (even if it binds to more than 1 CPU).
>
> No, I'm absolutely ok with the user shooting himself in the foot. The
> thing exists because you can crash stuff if you get it wrong with
> per-cpu.
>
> Crashing is not good, worse performance is his own damn fault.
Some people don't like root to write to /dev/mem or rm -r /
either. I'm not in that camp, but if you're not in that camp, then you
should _never_ care to set PF_THREAD_BOUND, no matter if it's about
crashing or just slowing down the kernel.
If such a thing exists, well using it to avoid the user either to crash or
to screw with the system performance, can only be a bonus.
On Tue, May 29, 2012 at 07:43:27PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 19:33 +0200, Andrea Arcangeli wrote:
> > So the cost on a 24-way SMP
>
> is irrelevant.. also, not every cpu gets to the 24 cpu domain, just 2
> do.
>
> When you do for_each_cpu() think at least 4096, if you do
> for_each_node() think at least 256.
>
> Add to that the knowledge that doing 4096 remote memory accesses will
> cost multiple jiffies, then realize you're wanting to do that with
> preemption disabled.
>
> That's just a very big no go.
I'm thinking 4096/256, this is why I mentioned it's a 24-way system. I
think the hackbench should be repeated on a much bigger system to see
what happens, I'm not saying it'll work fine already.
But from autonuma13 to 14 it's a world of difference in hackbench
terms, to the point the cost is zero on a 24-way.
My idea down the road, with multi hop systems, is to balance across
the 1 hop at the regular load_balance interval, and move to the 2 hops
at half frequency, and 3 hops at 1/4th frequency etc... That change
alone should help tremendously with 256 nodes and 5/6 hops. And it
should be quite easy to implement too.
knuma_migrated also need to learn more about the hops and probably
scan at higher frequency the lru heads coming from the closer hops.
The code is not "hops" aware yet and certainly there are still lots of
optimization to do for the very big systems. I think it's already
quite ideal right now for most servers and I don't see blockers in
optimizing it for the extreme big cases (and I expect it'd already
work better than nothing in the extreme setups). I removed [RFC]
because I'm quite happy with it now (there were things I wasn't happy
with before), but I didn't mean it's finished.
On Tue, May 29, 2012 at 06:56:53PM +0200, Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 12:46 -0400, Rik van Riel wrote:
> > > I don't think it's too great, memcg uses for half of that and yet
> > > nobody is booting with cgroup_disable=memory even on not-NUMA servers
> > > with less RAM.
>
> Right, it was such a hit we had to disable that by default on RHEL6.
CONFIG_CGROUP_MEM_RES_CTLR is =y, do you mean it's set to
cgroup_disable=memory by default in grub? I didn't notice that.
If a certain amount of users is passing cgroup_disable=memory at boot
because they don't need the feature, well that's perfectly reasonable
and the way it should be. That's why such an option exists and why I
also provided a noautonuma parameter for the same reason.
> Right, hnaz did great work there, but wasn't there still some few pieces
> of the shadow page frame left? ISTR LSF/MM talk of moving the last few
> bits into the regular page frame, taking the word that became available
> through: fc9bb8c768 ("mm: Rearrange struct page").
memcg diet topic is there for a long time, they started working on it
more than 1 year ago, I'm currently referring to current upstream
(maybe 1 week ago old).
But this is normal, first you focus on the algorithm, then you worry
how to optimize the implementation to reduce the memory usage without
altering the runtime (well, without altering it too much at least...).
On 05/29/2012 01:38 PM, Linus Torvalds wrote:
> On Tue, May 29, 2012 at 9:38 AM, Andrea Arcangeli<[email protected]> wrote:
>> On Tue, May 29, 2012 at 03:16:25PM +0200, Peter Zijlstra wrote:
>>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>>> price to pay.
>>
>> I don't think it's too great, memcg uses for half of that and yet
>> nobody is booting with cgroup_disable=memory even on not-NUMA servers
>> with less RAM.
>
> A big fraction of one percent is absolutely unacceptable.
Andrea, here is a quick back of the envelope idea.
In every zone, we keep an array of pointers to pages and
other needed info for knumad. We do not need as many as
we have pages in a zone, because we do not want to move
all that memory across anyway (especially in larger systems).
Maybe the number of entries can scale up with the square
root of the zone size?
struct numa_pq_entry {
struct page *page;
pg_data_t *destination;
};
For each zone, we can have a numa queueing struct
struct numa_queue {
struct numa_pq_entry *current_knumad;
struct numa_pq_entry *current_queue;
struct numa_pq_entry[];
};
Pages can get added to the knumad queue by filling
in a pointer and a destination node, and by setting
a page flag indicating that this page should be
moved to another NUMA node.
If something happens to the page that would cancel
the queuing, we simply clear that page flag.
When knumad gets around to an entry in the array,
it will check to see if the "should migrate" page
flag is still set. If it is not, it skips the entry.
The current_knumad and current_queue entries can
be used to simply keep circular buffer semantics.
Does this look reasonable?
--
All rights reversed
On Tue, 2012-05-29 at 19:33 +0200, Andrea Arcangeli wrote:
> No worries, I didn't mean to leave it like this forever. I was
> considering using the stop cpu _nowait variant but I didn't have
> enough time to realize if it would work for my case. I need to rethink
> about that.
No, you're not going to use any stop_cpu variant at all. Nothing is
_that_ urgent. Your whole strict mode needs to go, it completely wrecks
the regular balancer.
> The moment I gave up on the _nowait variant before releasing is when I
> couldn't understand what is tlb_migrate_finish doing, and why it's not
> present in the _nowait version in fair.c. Can you explain me that?
Its an optional tlb flush, I guess they didn't find the active_balance
worth the effort -- it should be fairly rare anyway.
> I'm glad you acknowledge load_balance already takes a bulk of the time
> as it needs to find the busiest runqueue checking other CPU runqueues
> too...
I've never said otherwise, its always been about where you do it, in the
middle of schedule() just isn't it. And I'm getting very tired of having
to repeat myself.
Also for regular load-balance only 2 cpus will ever scan all cpus, the
rest will only scan smaller ranges. Your thing does n-1 nodes worth of
cpus for every cpu.
On Tue, May 29, 2012 at 02:45:54PM +0300, Kirill A. Shutemov wrote:
> On Fri, May 25, 2012 at 07:02:27PM +0200, Andrea Arcangeli wrote:
>
> > +static int knumad_do_scan(void)
> > +{
>
> ...
>
> > + if (knumad_test_exit(mm) || !vma) {
> > + mm_autonuma = mm->mm_autonuma;
> > + if (mm_autonuma->mm_node.next != &knumad_scan.mm_head) {
> > + mm_autonuma = list_entry(mm_autonuma->mm_node.next,
> > + struct mm_autonuma, mm_node);
> > + knumad_scan.mm = mm_autonuma->mm;
> > + atomic_inc(&knumad_scan.mm->mm_count);
> > + knumad_scan.address = 0;
> > + knumad_scan.mm->mm_autonuma->numa_fault_pass++;
> > + } else
> > + knumad_scan.mm = NULL;
>
> knumad_scan.mm should be nulled only after list_del otherwise you will
> have race with autonuma_exit():
Thanks for noticing I managed to reproduce it by setting
knuma_scand/scan_sleep_millisecs and
knuma_scand/scan_sleep_pass_millisecs both to 0 and running a loop of
"while :; do memhog -r10 10m &>/dev/null; done".
So the problem was that if knuma_scand would change the knumad_scan.mm
after the mm->mm_users was 0 but before autonuma_exit run,
autonuma_exit wouldn't notice that the mm->mm_auotnuma was already
unlinked and it would unlink again.
autonuma_exit itself doesn't need to tell anything to knuma_scand,
because if it notices knuma_scand.mm == mm, it will do nothing and it
_always_ relies on knumad_scan to unlink it.
And if instead knuma_scand.mm is != mm, then autonuma_exit knows the
knuma_scand daemon will never have a chance to see the "mm" in the
list again if it arrived first (setting mm_autonuma->mm = NULL there
is just a debug tweak according to the comment).
The "serialize" event is there only to wait the knuma_scand main loop
before taking down the mm (it's not related to the list management).
The mm_autonuma->mm is useless after the "mm_autonuma" has been
unlinked so it's ok to use that to track if knuma_scand arrives first.
The exit path of the kernel daemon also forgot to check for
knumad_test_exit(mm) before unlinking, but that only runs if
kthread_should_stop() is true, and nobody calls kthread_stop so it's
only a theoretical improvement.
So this seems to fix it.
diff --git a/mm/autonuma.c b/mm/autonuma.c
index c2a5a82..768250a 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -679,9 +679,12 @@ static int knumad_do_scan(void)
} else
knumad_scan.mm = NULL;
- if (knumad_test_exit(mm))
+ if (knumad_test_exit(mm)) {
list_del(&mm->mm_autonuma->mm_node);
- else
+ /* tell autonuma_exit not to list_del */
+ VM_BUG_ON(mm->mm_autonuma->mm != mm);
+ mm->mm_autonuma->mm = NULL;
+ } else
mm_numa_fault_flush(mm);
mmdrop(mm);
@@ -770,8 +773,12 @@ static int knuma_scand(void *none)
mm = knumad_scan.mm;
knumad_scan.mm = NULL;
- if (mm)
+ if (mm && knumad_test_exit(mm)) {
list_del(&mm->mm_autonuma->mm_node);
+ /* tell autonuma_exit not to list_del */
+ VM_BUG_ON(mm->mm_autonuma->mm != mm);
+ mm->mm_autonuma->mm = NULL;
+ }
mutex_unlock(&knumad_mm_mutex);
if (mm)
@@ -996,11 +1003,15 @@ void autonuma_exit(struct mm_struct *mm)
mutex_lock(&knumad_mm_mutex);
if (knumad_scan.mm == mm)
serialize = true;
- else
+ else if (mm->mm_autonuma->mm) {
+ VM_BUG_ON(mm->mm_autonuma->mm != mm);
+ mm->mm_autonuma->mm = NULL; /* debug */
list_del(&mm->mm_autonuma->mm_node);
+ }
mutex_unlock(&knumad_mm_mutex);
if (serialize) {
+ /* prevent the mm to go away under knumad_do_scan main loop */
down_write(&mm->mmap_sem);
up_write(&mm->mmap_sem);
}
Hi,
On Tue, May 29, 2012 at 03:51:08PM +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
>
>
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 41aa49b..8e578e6 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -666,6 +666,12 @@ typedef struct pglist_data {
> > struct task_struct *kswapd;
> > int kswapd_max_order;
> > enum zone_type classzone_idx;
> > +#ifdef CONFIG_AUTONUMA
> > + spinlock_t autonuma_lock;
> > + struct list_head autonuma_migrate_head[MAX_NUMNODES];
> > + unsigned long autonuma_nr_migrate_pages;
> > + wait_queue_head_t autonuma_knuma_migrated_wait;
> > +#endif
> > } pg_data_t;
> >
> > #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
>
> O(nr_nodes^2) data.. ISTR people rewriting a certain slab allocator to
> get rid of that :-)
>
> Also, don't forget that MAX_NUMNODES is an unconditional 512 on distro
> kernels, even when we only have 2.
>
> Now the total wasted space isn't too bad since its only 16 bytes,
> totaling a whole 2M for a 256 node system. But still, something like
> that wants at least a mention somewhere.
I fully agree, I prefer to fix it and I was fully aware about
this. It's not a big deal so it got low priority to be fixed, but I
intended to optimize this.
As long as num_possible_nodes() is initialized before the pgdat is
allocated it shouldn't be difficult to optimize this moving struct
list_head autonuma_migrate_head[0] at the end of the structure.
mm_autonuma and sched_autonuma initially also had MAX_NUMNODES arrays
in them, then I converted to dynamic allocations to be optimal. We
same needs to happen here.
Thanks,
Andrea
(5/29/12 10:54 AM), Peter Zijlstra wrote:
> On Tue, 2012-05-29 at 09:56 -0400, Rik van Riel wrote:
>> On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
>>> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
>>
>>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>>> price to pay.
>>>
>>> At LSF/MM Rik already suggested you limit the number of pages that can
>>> be migrated concurrently and use this to move the extra list_head out of
>>> struct page and into a smaller amount of extra structures, reducing the
>>> total overhead.
>>
>> For THP, we should be able to track this NUMA info on a
>> 2MB page granularity.
>
> Yeah, but that's another x86-only feature, _IF_ we're going to do this
> it must be done for all archs that have CONFIG_NUMA, thus we're stuck
> with 4k (or other base page size).
Even if THP=n, we don't need 4k granularity. All modern malloc implementation have
per-thread heap (e.g. glibc call it as arena) and it is usually 1-8MB size. So, if
it is larger than 2MB, we can always use per-pmd tracking. iow, memory consumption
reduce to 1/512.
My suggestion is, track per-pmd (i.e. 2M size) granularity and fix glibc too (current
glibc malloc has dynamically arena size adjusting feature and then it often become
less than 2M).
On Wed, 2012-05-30 at 04:25 -0400, KOSAKI Motohiro wrote:
> (5/29/12 10:54 AM), Peter Zijlstra wrote:
> > On Tue, 2012-05-29 at 09:56 -0400, Rik van Riel wrote:
> >> On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
> >>> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
> >>
> >>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
> >>> price to pay.
> >>>
> >>> At LSF/MM Rik already suggested you limit the number of pages that can
> >>> be migrated concurrently and use this to move the extra list_head out of
> >>> struct page and into a smaller amount of extra structures, reducing the
> >>> total overhead.
> >>
> >> For THP, we should be able to track this NUMA info on a
> >> 2MB page granularity.
> >
> > Yeah, but that's another x86-only feature, _IF_ we're going to do this
> > it must be done for all archs that have CONFIG_NUMA, thus we're stuck
> > with 4k (or other base page size).
>
> Even if THP=n, we don't need 4k granularity. All modern malloc implementation have
> per-thread heap (e.g. glibc call it as arena) and it is usually 1-8MB size. So, if
> it is larger than 2MB, we can always use per-pmd tracking. iow, memory consumption
> reduce to 1/512.
Yes, and we all know objects allocated in one thread are never shared
with other threads.. the producer-consumer pattern seems fairly popular
and will destroy your argument.
> My suggestion is, track per-pmd (i.e. 2M size) granularity and fix glibc too (current
> glibc malloc has dynamically arena size adjusting feature and then it often become
> less than 2M).
The trouble with making this per pmd is that you then get the false
sharing per pmd, so if there's shared data on the 2m page you'll not
know where to put it.
I also know of some folks who did a strict per-cpu allocator based on
some kernel patches I hope to see posted sometime soon. This because if
you have many more threads than cpus the wasted space in your areas is
tremendous.
(5/30/12 5:06 AM), Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 04:25 -0400, KOSAKI Motohiro wrote:
>> (5/29/12 10:54 AM), Peter Zijlstra wrote:
>>> On Tue, 2012-05-29 at 09:56 -0400, Rik van Riel wrote:
>>>> On 05/29/2012 09:16 AM, Peter Zijlstra wrote:
>>>>> On Fri, 2012-05-25 at 19:02 +0200, Andrea Arcangeli wrote:
>>>>
>>>>> 24 bytes per page.. or ~0.6% of memory gone. This is far too great a
>>>>> price to pay.
>>>>>
>>>>> At LSF/MM Rik already suggested you limit the number of pages that can
>>>>> be migrated concurrently and use this to move the extra list_head out of
>>>>> struct page and into a smaller amount of extra structures, reducing the
>>>>> total overhead.
>>>>
>>>> For THP, we should be able to track this NUMA info on a
>>>> 2MB page granularity.
>>>
>>> Yeah, but that's another x86-only feature, _IF_ we're going to do this
>>> it must be done for all archs that have CONFIG_NUMA, thus we're stuck
>>> with 4k (or other base page size).
>>
>> Even if THP=n, we don't need 4k granularity. All modern malloc implementation have
>> per-thread heap (e.g. glibc call it as arena) and it is usually 1-8MB size. So, if
>> it is larger than 2MB, we can always use per-pmd tracking. iow, memory consumption
>> reduce to 1/512.
>
> Yes, and we all know objects allocated in one thread are never shared
> with other threads.. the producer-consumer pattern seems fairly popular
> and will destroy your argument.
THP also strike producer-consumer pattern. But, as far as I know, people haven't observed
significant performance degression. thus I _guessed_ performance critical producer-consumer
pattern is rare. Just guess.
>> My suggestion is, track per-pmd (i.e. 2M size) granularity and fix glibc too (current
>> glibc malloc has dynamically arena size adjusting feature and then it often become
>> less than 2M).
>
> The trouble with making this per pmd is that you then get the false
> sharing per pmd, so if there's shared data on the 2m page you'll not
> know where to put it.
>
> I also know of some folks who did a strict per-cpu allocator based on
> some kernel patches I hope to see posted sometime soon. This because if
> you have many more threads than cpus the wasted space in your areas is
> tremendous.
On Wed, 2012-05-30 at 05:41 -0400, KOSAKI Motohiro wrote:
> > Yes, and we all know objects allocated in one thread are never shared
> > with other threads.. the producer-consumer pattern seems fairly popular
> > and will destroy your argument.
>
> THP also strike producer-consumer pattern. But, as far as I know, people haven't observed
> significant performance degression. thus I _guessed_ performance critical producer-consumer
> pattern is rare. Just guess.
Not so, as long as the areas span PMDs THP can back them using huge
pages, regardless of what objects live in that virtual space (or indeed
if its given out as objects at all or lives on the free-lists).
THP doesn't care about what lives in the virtual space, all it cares
about is ranges spanning PMDs that are populated densely enough.
On Wed, May 30, 2012 at 11:06:03AM +0200, Peter Zijlstra wrote:
> The trouble with making this per pmd is that you then get the false
> sharing per pmd, so if there's shared data on the 2m page you'll not
> know where to put it.
The numa hinting page fault is already scanning the pmd only, and it's
working fine. So reducing the page_autonuma to one per pmd would not
reduce the granularity of the information with the default settings
everyone has been using so far, but then it would prevent this runtime
tweak to work:
echo 0 >/sys/kernel/mm/autonuma/knuma_scand/pmd
I'm thinking about it but probably reducing the page_autonuma to one
per pmd is going to be the simplest solution considering by default we
only track the pmd anyway.
On Sat, 2012-05-26 at 13:42 -0700, Linus Torvalds wrote:
> I'm a *firm* believer that if it cannot be done automatically "well
> enough", the absolute last thing we should ever do is worry about the
> crazy people who think they can tweak it to perfection with complex
> interfaces.
>
> You can't do it, except for trivial loads (often benchmarks), and for
> very specific machines.
>
> So I think very strongly that we should entirely dismiss all the
> people who want to do manual placement and claim that they know what
> their loads do. They're either full of sh*t (most likely), or they
> have a very specific benchmark and platform that they are tuning for
> that is totally irrelevant to everybody else.
>
> What we *should* try to aim for is a system that doesn't do horribly
> badly right out of the box. IOW, no tuning what-so-ever (at most a
> kind of "yes, I want you to try to do the NUMA thing" flag to just
> enable it at all), and try to not suck.
>
> Seriously. "Try to avoid sucking" is *way* superior to "We can let the
> user tweak things to their hearts content". Because users won't get it
> right.
>
> Give the anal people a knob they can tweak, and tell them it does
> something fancy. And never actually wire the damn thing up. They'll be
> really happy with their OCD tweaking, and do lots of nice graphs that
> just show how the error bars are so big that you can find any damn
> pattern you want in random noise.
So the thing is, my homenode-per-process approach should work for
everything except the case where a single process out-strips a single
node in either cpu utilization or memory consumption.
Now I claim such processes are rare since nodes are big, typically 6-8
cores. Writing anything that can sustain parallel execution larger than
that is very specialist (and typically already employs strong data
separation).
Yes there are such things out there, some use JVMs some are virtual
machines some regular applications, but by and large processes are small
compared to nodes.
So my approach is focus on the normal case, and provide 2 system calls
to replace sched_setaffinity() and mbind() for the people who use those.
Now, maybe I shouldn't have bothered with the system calls.. but I
thought providing something better than hard-affinity would be nice.
Andrea went the other way and focused on these big processes. His
approach relies on a pte scanner and faults. His code builds a
page<->thread map using this data either moves memory around or
processes (I'm a little vague on the details simply because I haven't
seen it explained anywhere yet -- and the code is non-obvious).
I have a number of problems with both the approach as well as the
implementation.
On the approach my biggest complaints are:
- the complexity, it focuses on the rarest sort of processes and thus
results in a rather complex setup.
- load-balance state explosion, the page-tables become part of the
load-balance state -- this is a lot of extra state making
reproduction more 'interesting'.
- the overhead, since its per page, it needs per-page state.
- I don't see how it can reliably work for virtual machines, because
the host page<->thread (vcpu) relation doesn't reflect a
data<->compute relation in this case. The guest scheduler can move
the guest thread (the compute) part around between the vcpus at a
much higher rate than the host will update its page<->vcpu map.
On the implementation:
- he works around the scheduler instead of with it.
- its x86 only (although he claims adding archs is trivial
I've yet to see the first !x86 support).
- complete lack of useful comments describing the balancing goal and
approach.
The worst part is that I've asked for this stuff several times, but
nothing seems forth-coming.
Anyway, I prefer doing the simple thing first and then seeing if there's
need for more complexity, esp. given the overheads involved. But if you
prefer we can dive off the deep end :-)
* Peter Zijlstra <[email protected]> wrote:
> So the thing is, my homenode-per-process approach should work
> for everything except the case where a single process
> out-strips a single node in either cpu utilization or memory
> consumption.
>
> Now I claim such processes are rare since nodes are big,
> typically 6-8 cores. Writing anything that can sustain
> parallel execution larger than that is very specialist (and
> typically already employs strong data separation).
>
> Yes there are such things out there, some use JVMs some are
> virtual machines some regular applications, but by and large
> processes are small compared to nodes.
>
> So my approach is focus on the normal case, and provide 2
> system calls to replace sched_setaffinity() and mbind() for
> the people who use those.
We could certainly strike those from the first version, if Linus
agrees with the general approach.
This gives us degrees freedom as it's an obvious on/off kernel
feature which we fix or remove if it does not work.
I'd even venture that it should be on by default, it's an
obvious placement strategy for everything sane that does not try
to nest some other execution environment within Linux (i.e.
specialist runtimes).
Thanks,
Ingo
On Wed, May 30, 2012 at 02:14:38AM +0200, Andrea Arcangeli wrote:
> I fully agree, I prefer to fix it and I was fully aware about
I did this yesterday, this is saving a couple of pages on my numa
system with node shift = 9. However I'm not sure anymore if it's
really worth it... but since I did it I may as well keep it.
==
From: Andrea Arcangeli <[email protected]>
Subject: [PATCH] autonuma: autonuma_migrate_head[0] dynamic size
Reduce the autonuma_migrate_head array entries from MAX_NUMNODES to
num_possible_nodes() or zero if autonuma_impossible() is true.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/mm/numa.c | 6 ++++--
arch/x86/mm/numa_32.c | 3 ++-
include/linux/memory_hotplug.h | 3 ++-
include/linux/mmzone.h | 8 +++++++-
include/linux/page_autonuma.h | 10 ++++++++--
mm/memory_hotplug.c | 2 +-
mm/page_autonuma.c | 5 +++--
7 files changed, 27 insertions(+), 10 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..a4a9e92 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -11,6 +11,7 @@
#include <linux/nodemask.h>
#include <linux/sched.h>
#include <linux/topology.h>
+#include <linux/page_autonuma.h>
#include <asm/e820.h>
#include <asm/proto.h>
@@ -192,7 +193,8 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
/* Initialize NODE_DATA for a node on the local memory */
static void __init setup_node_data(int nid, u64 start, u64 end)
{
- const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+ const size_t nd_size = roundup(autonuma_pglist_data_size(),
+ PAGE_SIZE);
bool remapped = false;
u64 nd_pa;
void *nd;
@@ -239,7 +241,7 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
printk(KERN_INFO " NODE_DATA(%d) on node %d\n", nid, tnid);
node_data[nid] = nd;
- memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+ memset(NODE_DATA(nid), 0, autonuma_pglist_data_size());
NODE_DATA(nid)->node_id = nid;
NODE_DATA(nid)->node_start_pfn = start >> PAGE_SHIFT;
NODE_DATA(nid)->node_spanned_pages = (end - start) >> PAGE_SHIFT;
diff --git a/arch/x86/mm/numa_32.c b/arch/x86/mm/numa_32.c
index 534255a..d32d6cc 100644
--- a/arch/x86/mm/numa_32.c
+++ b/arch/x86/mm/numa_32.c
@@ -25,6 +25,7 @@
#include <linux/bootmem.h>
#include <linux/memblock.h>
#include <linux/module.h>
+#include <linux/page_autonuma.h>
#include "numa_internal.h"
@@ -194,7 +195,7 @@ void __init init_alloc_remap(int nid, u64 start, u64 end)
/* calculate the necessary space aligned to large page size */
size = node_memmap_size_bytes(nid, start_pfn, end_pfn);
- size += ALIGN(sizeof(pg_data_t), PAGE_SIZE);
+ size += ALIGN(autonuma_pglist_data_size(), PAGE_SIZE);
size = ALIGN(size, LARGE_PAGE_BYTES);
/* allocate node memory and the lowmem remap area */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 910550f..76b1840 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -5,6 +5,7 @@
#include <linux/spinlock.h>
#include <linux/notifier.h>
#include <linux/bug.h>
+#include <linux/page_autonuma.h>
struct page;
struct zone;
@@ -130,7 +131,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
*/
#define generic_alloc_nodedata(nid) \
({ \
- kzalloc(sizeof(pg_data_t), GFP_KERNEL); \
+ kzalloc(autonuma_pglist_data_size(), GFP_KERNEL); \
})
/*
* This definition is just for error path in node hotadd.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e66da74..ed5b0c0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -701,10 +701,16 @@ typedef struct pglist_data {
#if !defined(CONFIG_SPARSEMEM)
struct page_autonuma *node_page_autonuma;
#endif
- struct list_head autonuma_migrate_head[MAX_NUMNODES];
unsigned long autonuma_nr_migrate_pages;
wait_queue_head_t autonuma_knuma_migrated_wait;
spinlock_t autonuma_lock;
+ /*
+ * Archs supporting AutoNUMA should allocate the pgdat with
+ * size autonuma_pglist_data_size() after including
+ * <linux/page_autonuma.h> and the below field must remain the
+ * last one of this structure.
+ */
+ struct list_head autonuma_migrate_head[0];
#endif
} pg_data_t;
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
index 05d2862..1d02643 100644
--- a/include/linux/page_autonuma.h
+++ b/include/linux/page_autonuma.h
@@ -10,6 +10,7 @@ static inline void __init page_autonuma_init_flatmem(void) {}
#ifdef CONFIG_AUTONUMA
#include <linux/autonuma_flags.h>
+#include <linux/autonuma_types.h>
extern void __meminit page_autonuma_map_init(struct page *page,
struct page_autonuma *page_autonuma,
@@ -29,11 +30,10 @@ extern void __meminit pgdat_autonuma_init(struct pglist_data *);
struct page_autonuma;
#define PAGE_AUTONUMA_SIZE 0
#define SECTION_PAGE_AUTONUMA_SIZE 0
+#endif
#define autonuma_impossible() true
-#endif
-
static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
#endif /* CONFIG_AUTONUMA */
@@ -50,4 +50,10 @@ extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **
int nodeid);
#endif
+/* inline won't work here */
+#define autonuma_pglist_data_size() (sizeof(struct pglist_data) + \
+ (autonuma_impossible() ? 0 : \
+ sizeof(struct list_head) * \
+ num_possible_nodes()))
+
#endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0d7e3ec..604995b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -164,7 +164,7 @@ void register_page_bootmem_info_node(struct pglist_data *pgdat)
struct page *page;
struct zone *zone;
- nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
+ nr_pages = PAGE_ALIGN(autonuma_pglist_data_size()) >> PAGE_SHIFT;
page = virt_to_page(pgdat);
for (i = 0; i < nr_pages; i++, page++)
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index 131b5c9..c5c340b 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -23,8 +23,9 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
spin_lock_init(&pgdat->autonuma_lock);
init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
pgdat->autonuma_nr_migrate_pages = 0;
- for_each_node(node_iter)
- INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+ if (!autonuma_impossible())
+ for_each_node(node_iter)
+ INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
}
#if !defined(CONFIG_SPARSEMEM)
On Fri, May 25, 2012 at 07:02:08PM +0200, Andrea Arcangeli wrote:
> We will set these bitflags only when the pmd and pte is non present.
>
> They work like PROT_NONE but they identify a request for the numa
> hinting page fault to trigger.
>
> Because we want to be able to set these bitflag in any established pte
> or pmd (while clearing the present bit at the same time) without
> losing information, these bitflags must never be set when the pte and
> pmd are present.
>
> For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
> set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
> which avoids having to alter the swp entries format.
>
> For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
> swap_entries but if in the future we'll swap transparent hugepages, we
> must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
> entry format and to start the swap entry offset above it.
>
> PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
> but it's never used on pmds so there's no risk of collision with Xen.
Thank you for loking at this from the xen side. The interesting thing
is that I believe the _PAGE_PAT (or _PAGE_PSE) is actually used on
Xen on PTEs. It is used to mark the pages WC. <sigh>
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> arch/x86/include/asm/pgtable_types.h | 11 +++++++++++
> 1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index b74cac9..6e2d954 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -71,6 +71,17 @@
> #define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE)
> #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
>
> +/*
> + * Cannot be set on pte. The fact it's in between _PAGE_FILE and
> + * _PAGE_PROTNONE avoids having to alter the swp entries.
> + */
> +#define _PAGE_NUMA_PTE _PAGE_PSE
> +/*
> + * Cannot be set on pmd, if transparent hugepages will be swapped out
> + * the swap entry offset must start above it.
> + */
> +#define _PAGE_NUMA_PMD _PAGE_UNUSED2
> +
> #define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
> _PAGE_ACCESSED | _PAGE_DIRTY)
> #define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
Hi Konrad,
On Wed, May 30, 2012 at 02:22:49PM -0400, Konrad Rzeszutek Wilk wrote:
> Thank you for loking at this from the xen side. The interesting thing
> is that I believe the _PAGE_PAT (or _PAGE_PSE) is actually used on
> Xen on PTEs. It is used to mark the pages WC. <sigh>
Oops, I'm using _PAGE_PSE too on the pte, but only when it's unmapped.
static inline int pte_numa(pte_t pte)
{
return (pte_flags(pte) &
(_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
}
And _PAGE_UNUSED2 (_PAGE_IOMAP) is used for the pmd but _PAGE_IOMAP by
Xen should only be set on ptes.
The only way to use _PAGE_PSE safe on the pte is if the pte is
non-present, is this what Xen is also doing? (in turn colliding with
pte_numa)
Now if I shrink the size of the page_autonuma to one entry per pmd
(instead of per pte) I may as well drop pte_numa entirely and only
leave pmd_numa. At the moment it's possible to switch between the two
models at runtime with sysctl (if one wants to do a more expensive
granular tracking). I'm still uncertain on the best way to shrink
the page_autonuma size we'll see.
On Wed, May 30, 2012 at 08:34:06PM +0200, Andrea Arcangeli wrote:
> Hi Konrad,
>
> On Wed, May 30, 2012 at 02:22:49PM -0400, Konrad Rzeszutek Wilk wrote:
> > Thank you for loking at this from the xen side. The interesting thing
> > is that I believe the _PAGE_PAT (or _PAGE_PSE) is actually used on
> > Xen on PTEs. It is used to mark the pages WC. <sigh>
>
> Oops, I'm using _PAGE_PSE too on the pte, but only when it's unmapped.
>
> static inline int pte_numa(pte_t pte)
> {
> return (pte_flags(pte) &
> (_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
> }
>
> And _PAGE_UNUSED2 (_PAGE_IOMAP) is used for the pmd but _PAGE_IOMAP by
> Xen should only be set on ptes.
<nods>
>
> The only way to use _PAGE_PSE safe on the pte is if the pte is
> non-present, is this what Xen is also doing? (in turn colliding with
> pte_numa)
The only time the _PAGE_PSE (_PAGE_PAT) is set is when
_PAGE_PCD | _PAGE_PWT are set. It is this ugly transformation
of doing:
if (pat_enabled && _PAGE_PWT | _PAGE_PCD)
pte = ~(_PAGE_PWT | _PAGE_PCD) | _PAGE_PAT;
and then writting the pte with the 7th bit set instead of the
2nd and 3rd to mark it as WC. There is a corresponding reverse too
(to read the pte - so the pte_val calls) - so if _PAGE_PAT is
detected it will remove the _PAGE_PAT and return the PTE as
if it had _PAGE_PWT | _PAGE_PCD.
So that little bit of code will need some tweaking - as it does
that even if _PAGE_PRESENT is not set. Meaning it would
transform your _PAGE_PAT to _PAGE_PWT | _PAGE_PCD. Gah!
>
> Now if I shrink the size of the page_autonuma to one entry per pmd
> (instead of per pte) I may as well drop pte_numa entirely and only
> leave pmd_numa. At the moment it's possible to switch between the two
> models at runtime with sysctl (if one wants to do a more expensive
> granular tracking). I'm still uncertain on the best way to shrink
> the page_autonuma size we'll see.
OK. I can whip up a patch to deal with the 'Gah!' case easily if needed.
Thanks!
On Fri, May 25, 2012 at 07:02:10PM +0200, Andrea Arcangeli wrote:
> Implement generic version of the methods. They're used when
> CONFIG_AUTONUMA=n, and they're a noop.
I think you can roll that in the previous patch.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> include/asm-generic/pgtable.h | 12 ++++++++++++
> 1 files changed, 12 insertions(+), 0 deletions(-)
>
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index fa596d9..780f707 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -521,6 +521,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
> #endif
> }
>
> +#ifndef CONFIG_AUTONUMA
> +static inline int pte_numa(pte_t pte)
> +{
> + return 0;
> +}
> +
> +static inline int pmd_numa(pmd_t pmd)
> +{
> + return 0;
> +}
> +#endif /* CONFIG_AUTONUMA */
> +
> #endif /* CONFIG_MMU */
>
> #endif /* !__ASSEMBLY__ */
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Fri, May 25, 2012 at 07:02:12PM +0200, Andrea Arcangeli wrote:
> This function makes it easy to bind the per-node knuma_migrated
> threads to their respective NUMA nodes. Those threads take memory from
> the other nodes (in round robin with a incoming queue for each remote
> node) and they move that memory to their local node.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> include/linux/kthread.h | 1 +
> kernel/kthread.c | 23 +++++++++++++++++++++++
> 2 files changed, 24 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> index 0714b24..e733f97 100644
> --- a/include/linux/kthread.h
> +++ b/include/linux/kthread.h
> @@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
> })
>
> void kthread_bind(struct task_struct *k, unsigned int cpu);
> +void kthread_bind_node(struct task_struct *p, int nid);
> int kthread_stop(struct task_struct *k);
> int kthread_should_stop(void);
> bool kthread_freezable_should_stop(bool *was_frozen);
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 3d3de63..48b36f9 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
> EXPORT_SYMBOL(kthread_bind);
>
> /**
> + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> + * @p: thread created by kthread_create().
> + * @nid: node (might not be online, must be possible) for @k to run on.
> + *
> + * Description: This function is equivalent to set_cpus_allowed(),
> + * except that @nid doesn't need to be online, and the thread must be
> + * stopped (i.e., just returned from kthread_create()).
> + */
> +void kthread_bind_node(struct task_struct *p, int nid)
> +{
> + /* Must have done schedule() in kthread() before we set_task_cpu */
> + if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> + WARN_ON(1);
> + return;
> + }
> +
> + /* It's safe because the task is inactive. */
> + do_set_cpus_allowed(p, cpumask_of_node(nid));
> + p->flags |= PF_THREAD_BOUND;
> +}
> +EXPORT_SYMBOL(kthread_bind_node);
_GPL?
> +
> +/**
> * kthread_stop - stop a thread created by kthread_create().
> * @k: thread created by kthread_create().
> *
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Fri, May 25, 2012 at 07:02:22PM +0200, Andrea Arcangeli wrote:
> This is where the dynamically allocated sched_autonuma structure is
> being handled.
>
> The reason for keeping this outside of the task_struct besides not
> using too much kernel stack, is to only allocate it on NUMA
> hardware. So the not NUMA hardware only pays the memory of a pointer
> in the kernel stack (which remains NULL at all times in that case).
>
> If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
> is allocated on the kernel stack of course.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> kernel/fork.c | 24 ++++++++++++++----------
> 1 files changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 237c34e..d323eb1 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -206,6 +206,7 @@ static void account_kernel_stack(struct thread_info *ti, int account)
> void free_task(struct task_struct *tsk)
> {
> account_kernel_stack(tsk->stack, -1);
> + free_sched_autonuma(tsk);
> free_thread_info(tsk->stack);
> rt_mutex_debug_task_free(tsk);
> ftrace_graph_exit_task(tsk);
> @@ -260,6 +261,8 @@ void __init fork_init(unsigned long mempages)
> /* do the arch specific task caches init */
> arch_task_cache_init();
>
> + sched_autonuma_init();
> +
> /*
> * The default maximum number of threads is set to a safe
> * value: the thread structures can take up at most half
> @@ -292,21 +295,21 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
> struct thread_info *ti;
> unsigned long *stackend;
> int node = tsk_fork_get_node(orig);
> - int err;
>
> tsk = alloc_task_struct_node(node);
> - if (!tsk)
> + if (unlikely(!tsk))
> return NULL;
>
> ti = alloc_thread_info_node(tsk, node);
> - if (!ti) {
> - free_task_struct(tsk);
> - return NULL;
> - }
> + if (unlikely(!ti))
Should those "unlikely" have their own commit? Did you
run this with the likely/unlikely tracer to confirm that it
does give a sppedup?
> + goto out_task_struct;
>
> - err = arch_dup_task_struct(tsk, orig);
> - if (err)
> - goto out;
> + if (unlikely(arch_dup_task_struct(tsk, orig)))
> + goto out_thread_info;
> +
> + if (unlikely(alloc_sched_autonuma(tsk, orig, node)))
> + /* free_thread_info() undoes arch_dup_task_struct() too */
> + goto out_thread_info;
>
> tsk->stack = ti;
>
> @@ -334,8 +337,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
>
> return tsk;
>
> -out:
> +out_thread_info:
> free_thread_info(ti);
> +out_task_struct:
> free_task_struct(tsk);
> return NULL;
> }
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
Hi,
On Tue, May 29, 2012 at 05:43:09PM +0200, Petr Holasek wrote:
> Similar problem with __autonuma_migrate_page_remove here.
>
> [ 1945.516632] ------------[ cut here ]------------
> [ 1945.516636] WARNING: at lib/list_debug.c:50 __list_del_entry+0x63/0xd0()
> [ 1945.516642] Hardware name: ProLiant DL585 G5
> [ 1945.516651] list_del corruption, ffff88017d68b068->next is LIST_POISON1 (dead000000100100)
> [ 1945.516682] Modules linked in: ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle lockd ip6t_REJECT sunrpc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_nat nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack mperf freq_table kvm_amd kvm pcspkr amd64_edac_mod edac_core serio_raw bnx2 microcode edac_mce_amd shpchp k10temp hpilo ipmi_si ipmi_msghandler hpwdt qla2xxx hpsa ata_generic pata_acpi scsi_transport_fc scsi_tgt cciss pata_amd radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
> [ 1945.516694] Pid: 150, comm: knuma_migrated0 Tainted: G W 3.4.0aa_alpha+ #3
> [ 1945.516701] Call Trace:
> [ 1945.516710] [<ffffffff8105788f>] warn_slowpath_common+0x7f/0xc0
> [ 1945.516717] [<ffffffff81057986>] warn_slowpath_fmt+0x46/0x50
> [ 1945.516726] [<ffffffff812f9713>] __list_del_entry+0x63/0xd0
> [ 1945.516735] [<ffffffff812f9791>] list_del+0x11/0x40
> [ 1945.516743] [<ffffffff81165b98>] __autonuma_migrate_page_remove+0x48/0x80
> [ 1945.516746] [<ffffffff81165e66>] knuma_migrated+0x296/0x8a0
> [ 1945.516749] [<ffffffff8107a200>] ? wake_up_bit+0x40/0x40
> [ 1945.516758] [<ffffffff81165bd0>] ? __autonuma_migrate_page_remove+0x80/0x80
> [ 1945.516766] [<ffffffff81079cc3>] kthread+0x93/0xa0
> [ 1945.516780] [<ffffffff81626f24>] kernel_thread_helper+0x4/0x10
> [ 1945.516791] [<ffffffff81079c30>] ? flush_kthread_worker+0x80/0x80
> [ 1945.516798] [<ffffffff81626f20>] ? gs_change+0x13/0x13
> [ 1945.516800] ---[ end trace 7cab294af87bd79f ]---
I didn't manage to reproduce it on my hardware but it seems this was
caused by the autonuma_migrate_split_huge_page: the tail page list
linking wasn't surrounded by the compound lock to make list insertion
and migrate_nid setting atomic like it happens everywhere else (the
caller holding the lock on the head page wasn't enough to make the
tails stable too).
I released an AutoNUMA15 branch that includes all pending fixes:
git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
Thanks,
Andrea
On Wed, 2012-05-30 at 15:49 +0200, Andrea Arcangeli wrote:
>
> I'm thinking about it but probably reducing the page_autonuma to one
> per pmd is going to be the simplest solution considering by default we
> only track the pmd anyway.
Do also consider that some archs have larger base page size. So their
effective PMD size is increased as well.
Hi Andrea, everyone..
AA> Changelog from alpha13 to alpha14:
AA> [...]
AA> o autonuma_balance only runs along with run_rebalance_domains, to
AA> avoid altering the scheduler runtime. [...]
AA> [...] This change has not
AA> yet been tested on specjbb or more schedule intensive benchmarks,
AA> but I don't expect measurable NUMA affinity regressions. [...]
Perhaps I can contribute a bit to the SPECjbb tests.
I got SPECjbb2005 results for 3.4-rc2 mainline, numasched,
autonuma-alpha10, and autonuma-alpha13. If you judge the data is OK it
may suit a comparison between autonuma-alpha13/14 to verify NUMA
affinity regressions.
The system is an Intel 2-socket Blade. Each NUMA node has 6 cores (+6
hyperthreads) and 12 GB RAM. Different permutations of THP, KSM, and VM
memory size were tested for each kernel.
I'll have to leave the analysis of each variable for you, as I'm not
familiar w/ the code and expected impacts; but I'm perfectly fine with
providing more details about the tests, environment and procedures, and
even some reruns, if needed.
Please CC me on questions and comments.
Environment:
------------
Host:
- Enterprise Linux Distro
- Kernel: 3.4-rc2 (either mainline, or patched w/ numasched,
autonuma-alpha10, or autonuma-alpha13)
- 2 NUMA nodes. 6 cores + 6 hyperthreads/node, 12 GB RAM/node.
(total of 24 logical CPUs and 24 GB RAM)
- Hypervisor: qemu-kvm 1.0.50 (+ memsched patches only for numasched)
VMs:
- Enterprise Linux Distro
- Distro Kernel
1 Main VM (VM1) -- relevant benchmark score.
- 12 vCPUs
- 12 GB (for '< 1 Node' configuration) or 14 GB (for '> 1 Node'
configuration)
2 Noise VMs (VM2 and VM3)
- each noise VM has half of the remaining resources.
- 6 vCPUs
- 4 GB (for '< 1 Node' configuration) or 3 GB ('> 1 Node' configuration)
(to sum 20 GB w/ main VM + 4 GB for host = total 24 GB)
Settings:
- Swapping disabled on host and VMs.
- Memory Overcommit enabled on host and VMs.
- THP on host is a variable. THP disabled on VMs.
- KSM on host is a variable. KSM disabled on VMs.
Results
=======
Reference is mainline kernel with THP disabled (its score is
approximately 100%). It performed similarly (less than 2% difference) on
the 4 permutations of KSM and Main VM memory size.
For the results of all permutations, see chart [1].
One interesting permutation seems to be: No THP (disabled); KSM (enabled).
Interpretation:
- higher is better;
- main VM should perform better than noise VMs;
- noise VMs should perform similarly.
Main VM < 1 Node
-----------------
Main VM Noise VM Noise VM
mainline ~100% 60% 60%
numasched * 50%/135% 30%/58% 40%/68%
autonuma-a10 125% 60% 60%
autonuma-a13 126% 32% 32%
* numasched yielded a wide range of scores. Is this behavior expected?
Main VM > 1 Node.
-----------------
Main VM Noise VM Noise VM
mainline ~100% 60% 59%
numasched 60% 48% 48%
autonuma-a10 62% 37% 38%
autonuma-a13 125% 61% 63%
Considerations:
---------------
The 3 VMs ran SPECjbb2005, synchronously starting the benchmark.
For the benchmark run to take about the same time on the 3 VMs, its
configuration for the Noise VMs is different than for the Main VM.
So comparing VM1 scores w/ VM2 or VM3 scores is not reasonable.
But comparing scores between VM2 and VM3 is perfectly fine (it's
evidence of the performed balancing).
Sometimes both autonuma and numasched prioritized one of the Noise VMs
over the other Noise VM, or even over the Main VM. In these cases, some
reruns would yield scores of 'expected proportion', given the VMs
configuration (Main VM w/ the highest score, both Noise VMs with lower
scores which are about the same).
The non-expected proportion scores happened less often w/
autonuma-alpha13, followed by autonuma-alpha10, and finally numasched
(i.e., numasched had the greatest rate of non-expected proportion scores).
For most permutations, numasched didn't yield scores of expected
proportion. I'd like to know how likely this is to happen, before
performing additional runs to confirm it. If anyone would provide
evidence or thoughts?
Links:
------
[1] http://dl.dropbox.com/u/82832537/kvm-numa-comparison-0.png
--
Mauricio Faria de Oliveira
IBM Linux Technology Center
Hi,
On Thu, May 31, 2012 at 08:18:59PM +0200, Peter Zijlstra wrote:
> On Wed, 2012-05-30 at 15:49 +0200, Andrea Arcangeli wrote:
> >
> > I'm thinking about it but probably reducing the page_autonuma to one
> > per pmd is going to be the simplest solution considering by default we
> > only track the pmd anyway.
>
> Do also consider that some archs have larger base page size. So their
> effective PMD size is increased as well.
With a larger PAGE_SIZE like 64k I doubt this would be a concern, it's
just 4k is too small.
Now I did a number of cleanups and already added a number of comments,
I'll write proper badly needed docs on the autonuma_balance() function
ASAP, but at least a number of cleanups are already committed in the
autonuma branch of my git tree.
>From my side, the thing that annoys me the most at the moment is the
page_autonuma size.
So I gave more thought about the idea outlined above but well I gave
up in less than a minute of thinking what I could run into doing
that. The fact we do pmd tracking in knuma_scand by default (possible
to disable with sysfs) is irrelevant. Unless I'm only going to track
THP pages, 1 page_autonuma per pmd won't work, when the pmd_numa fault
triggers it's all nonlinear on whatever scattered 4k page is pointed
by the pte, not shared pagecache especially.
I kept thinking more on it, I should have now figured how to reduce
the page_autonuma to 12 bytes per 4k page on both 32bit and 64bit
without losing information (no code written yet but this one should
work). I just couldn't shrink it below 12 bytes without going into
ridiculous high and worthless complexities.
After this change AutoNUMA will bail out if any of the two below
conditions is true:
1) MAX_NUMNODES >= 65536
2) any NUMA node pgdat.node_spanned_pages >= 16TB/PAGE_SIZE
That means AutoNUMA will disengage itself automatically on boot on x86
NUMA systems with more than 1152921504606846976 of ram, that's 60bit
of physical address space and no x86 CPU even gets that far in terms
of physical address space.
Other archs requiring more memory than that, will hopefully have a
PAGE_SIZE > 4KB (in turn doubling up the per-node limit of ram at
every doubling of the PAGE_SIZE without having to increase the size of
the page_autonuma even on 64bit from 12bytes).
A packed 12 bytes per page should be all I need (maybe some arch with
alignment troubles may prefer to make it a 16 bytes, but on x86 packed
should work). So on x86 that's 0.29% of RAM used for autonuma and only
spent when booting on NUMA hardware (and trivial to get rid of by
passing "noatuonuma" on the command line).
If I leave the anti false sharing last_nid information in the page
structure plus a pointer to a dynamic structure, that would be still
about 12 bytes. So I rather spend those 12 bytes to avoid having to
point to a dynamic object which in fact would waste even more memory
in addition to the 12 bytes of pointer+last_nid.
The details of the solution:
struct page_autonuma {
short autonuma_last_nid;
short autonuma_migrate_nid;
unsigned int pfn_offset_next;
unsigned int pfn_offset_prev;
} __attribute__((packed));
page_autonuma can only point to a page that belongs to the same node
(page_autonuma is queued into the
NODE_DATA(autonuma_migrate_nid)->autonuma_migrate_head[src_nid]) where
src_nid is the source node that page_autonuma belongs to, so all pages
in the autonuma_migrate_head[src_nid] lru must come from the same
src_nid. So the next page_autonuma in the list will be
lookup_page_autonuma(pfn_to_page(NODE_DATA(src_nid)->node_start_pfn +
page_autonuma->pfn_offset_next)) etc..
Of course all list_add/del must be hardcoded specially for this, but
it's not a conceptually difficult solution, just we can't use list.h
and stright pointers anymore and some conversion must happen.
On Wed, May 30, 2012 at 04:01:51PM -0400, Konrad Rzeszutek Wilk wrote:
> The only time the _PAGE_PSE (_PAGE_PAT) is set is when
> _PAGE_PCD | _PAGE_PWT are set. It is this ugly transformation
> of doing:
>
> if (pat_enabled && _PAGE_PWT | _PAGE_PCD)
> pte = ~(_PAGE_PWT | _PAGE_PCD) | _PAGE_PAT;
>
> and then writting the pte with the 7th bit set instead of the
> 2nd and 3rd to mark it as WC. There is a corresponding reverse too
> (to read the pte - so the pte_val calls) - so if _PAGE_PAT is
> detected it will remove the _PAGE_PAT and return the PTE as
> if it had _PAGE_PWT | _PAGE_PCD.
>
> So that little bit of code will need some tweaking - as it does
> that even if _PAGE_PRESENT is not set. Meaning it would
> transform your _PAGE_PAT to _PAGE_PWT | _PAGE_PCD. Gah!
It looks like this is disabled in current upstream?
8eaffa67b43e99ae581622c5133e20b0f48bcef1
> OK. I can whip up a patch to deal with the 'Gah!' case easily if needed.
That would help! But again it looks disabled in Xen?
About linux host (no xen) when I decided to use PSE I checked this part:
/* Set PWT to Write-Combining. All other bits stay the same */
/*
* PTE encoding used in Linux:
* PAT
* |PCD
* ||PWT
* |||
* 000 WB _PAGE_CACHE_WB
* 001 WC _PAGE_CACHE_WC
* 010 UC- _PAGE_CACHE_UC_MINUS
* 011 UC _PAGE_CACHE_UC
* PAT bit unused
*/
I need to go read the specs pdf and audit the code against the specs
to be sure but if my interpretation correct, PAT is never set on linux
host (novirt) the way the relevant msr are programmed.
If I couldn't use the PSE (/PAT) it'd screw with 32bit because I need
to poke a bit between _PAGE_BIT_DIRTY and _PAGE_BIT_GLOBAL to avoid
losing space on the swap entry, and there's just one bit in that range
(PSE).
_PAGE_UNUSED1 (besides it's used by Xen) wouldn't work unless I change
the swp entry format for 32bit x86 reducing the max amount of swap
(conditional to CONFIG_AUTONUMA so it wouldn't be the end of the
world, plus the amount of swap on 32bit NUMA may not be so important)
On Tue, Jun 05, 2012 at 07:13:54PM +0200, Andrea Arcangeli wrote:
> On Wed, May 30, 2012 at 04:01:51PM -0400, Konrad Rzeszutek Wilk wrote:
> > The only time the _PAGE_PSE (_PAGE_PAT) is set is when
> > _PAGE_PCD | _PAGE_PWT are set. It is this ugly transformation
> > of doing:
> >
> > if (pat_enabled && _PAGE_PWT | _PAGE_PCD)
> > pte = ~(_PAGE_PWT | _PAGE_PCD) | _PAGE_PAT;
> >
> > and then writting the pte with the 7th bit set instead of the
> > 2nd and 3rd to mark it as WC. There is a corresponding reverse too
> > (to read the pte - so the pte_val calls) - so if _PAGE_PAT is
> > detected it will remove the _PAGE_PAT and return the PTE as
> > if it had _PAGE_PWT | _PAGE_PCD.
> >
> > So that little bit of code will need some tweaking - as it does
> > that even if _PAGE_PRESENT is not set. Meaning it would
> > transform your _PAGE_PAT to _PAGE_PWT | _PAGE_PCD. Gah!
>
> It looks like this is disabled in current upstream?
> 8eaffa67b43e99ae581622c5133e20b0f48bcef1
Yup. But it is a temporary bandaid that I hope to fix soon.
>
> > OK. I can whip up a patch to deal with the 'Gah!' case easily if needed.
>
> That would help! But again it looks disabled in Xen?
>
> About linux host (no xen) when I decided to use PSE I checked this part:
>
> /* Set PWT to Write-Combining. All other bits stay the same */
> /*
> * PTE encoding used in Linux:
> * PAT
> * |PCD
> * ||PWT
> * |||
> * 000 WB _PAGE_CACHE_WB
> * 001 WC _PAGE_CACHE_WC
> * 010 UC- _PAGE_CACHE_UC_MINUS
> * 011 UC _PAGE_CACHE_UC
> * PAT bit unused
> */
>
> I need to go read the specs pdf and audit the code against the specs
> to be sure but if my interpretation correct, PAT is never set on linux
> host (novirt) the way the relevant msr are programmed.
>
> If I couldn't use the PSE (/PAT) it'd screw with 32bit because I need
> to poke a bit between _PAGE_BIT_DIRTY and _PAGE_BIT_GLOBAL to avoid
> losing space on the swap entry, and there's just one bit in that range
> (PSE).
>
> _PAGE_UNUSED1 (besides it's used by Xen) wouldn't work unless I change
> the swp entry format for 32bit x86 reducing the max amount of swap
> (conditional to CONFIG_AUTONUMA so it wouldn't be the end of the
> world, plus the amount of swap on 32bit NUMA may not be so important)
Yeah, I concur. I think stick with _PAGE_PAT (/PSE) and I can cook
up the appropiate patch for it on the Xen side.
On Tue, Jun 05, 2012 at 01:17:27PM -0400, Konrad Rzeszutek Wilk wrote:
> Yup. But it is a temporary bandaid that I hope to fix soon.
Ok.
> Yeah, I concur. I think stick with _PAGE_PAT (/PSE) and I can cook
> up the appropiate patch for it on the Xen side.
Great, thanks!
Andrea
On 06/01/2012 02:08 AM, Andrea Arcangeli wrote:
> Hi,
>
> On Tue, May 29, 2012 at 05:43:09PM +0200, Petr Holasek wrote:
>> Similar problem with __autonuma_migrate_page_remove here.
>>
>> [ 1945.516632] ------------[ cut here ]------------
>> [ 1945.516636] WARNING: at lib/list_debug.c:50 __list_del_entry+0x63/0xd0()
>> [ 1945.516642] Hardware name: ProLiant DL585 G5
>> [ 1945.516651] list_del corruption, ffff88017d68b068->next is LIST_POISON1 (dead000000100100)
>> [ 1945.516682] Modules linked in: ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle lockd ip6t_REJECT sunrpc nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_nat nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack mperf freq_table kvm_amd kvm pcspkr amd64_edac_mod edac_core serio_raw bnx2 microcode edac_mce_amd shpchp k10temp hpilo ipmi_si ipmi_msghandler hpwdt qla2xxx hpsa ata_generic pata_acpi scsi_transport_fc scsi_tgt cciss pata_amd radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core [last unloaded: scsi_wait_scan]
>> [ 1945.516694] Pid: 150, comm: knuma_migrated0 Tainted: G W 3.4.0aa_alpha+ #3
>> [ 1945.516701] Call Trace:
>> [ 1945.516710] [<ffffffff8105788f>] warn_slowpath_common+0x7f/0xc0
>> [ 1945.516717] [<ffffffff81057986>] warn_slowpath_fmt+0x46/0x50
>> [ 1945.516726] [<ffffffff812f9713>] __list_del_entry+0x63/0xd0
>> [ 1945.516735] [<ffffffff812f9791>] list_del+0x11/0x40
>> [ 1945.516743] [<ffffffff81165b98>] __autonuma_migrate_page_remove+0x48/0x80
>> [ 1945.516746] [<ffffffff81165e66>] knuma_migrated+0x296/0x8a0
>> [ 1945.516749] [<ffffffff8107a200>] ? wake_up_bit+0x40/0x40
>> [ 1945.516758] [<ffffffff81165bd0>] ? __autonuma_migrate_page_remove+0x80/0x80
>> [ 1945.516766] [<ffffffff81079cc3>] kthread+0x93/0xa0
>> [ 1945.516780] [<ffffffff81626f24>] kernel_thread_helper+0x4/0x10
>> [ 1945.516791] [<ffffffff81079c30>] ? flush_kthread_worker+0x80/0x80
>> [ 1945.516798] [<ffffffff81626f20>] ? gs_change+0x13/0x13
>> [ 1945.516800] ---[ end trace 7cab294af87bd79f ]---
> I didn't manage to reproduce it on my hardware but it seems this was
> caused by the autonuma_migrate_split_huge_page: the tail page list
> linking wasn't surrounded by the compound lock to make list insertion
> and migrate_nid setting atomic like it happens everywhere else (the
> caller holding the lock on the head page wasn't enough to make the
> tails stable too).
>
> I released an AutoNUMA15 branch that includes all pending fixes:
>
> git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
hi, Andrea and all
when I tested autonuma patch set, kernel panic occurred in the process
of starting with new compiled kernel,
also I found the issue in latest Linus tree(3.5.0-rc1), partial call
trace are:
[ 2.635443] kernel BUG at include/linux/gfp.h:318!
[ 2.642998] invalid opcode: 0000 [#1] SMP
[ 2.651148] CPU 0
[ 2.653911] Modules linked in:
[ 2.662388]
[ 2.664657] Pid: 1, comm: swapper/0 Not tainted 3.4.0+ #1 HP ProLiant
DL585 G7
[ 2.677609] RIP: 0010:[<ffffffff811b044d>] [<ffffffff811b044d>]
new_slab+0x26d/0x310
[ 2.692803] RSP: 0018:ffff880135ad3c80 EFLAGS: 00010246
[ 2.702541] RAX: 0000000000000000 RBX: ffff880137008c80 RCX:
ffff8801377db780
[ 2.716402] RDX: ffff880135bf8000 RSI: 0000000000000003 RDI:
00000000000052d0
[ 2.728471] RBP: ffff880135ad3cb0 R08: 0000000000000000 R09:
0000000000000000
[ 2.743791] R10: 0000000000000001 R11: 0000000000000000 R12:
00000000000040d0
[ 2.756111] R13: ffff880137008c80 R14: 0000000000000001 R15:
0000000000030027
[ 2.770428] FS: 0000000000000000(0000) GS:ffff880137600000(0000)
knlGS:0000000000000000
[ 2.786319] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 2.798100] CR2: 0000000000000000 CR3: 000000000196b000 CR4:
00000000000007f0
[ 2.810264] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 2.824889] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 2.836882] Process swapper/0 (pid: 1, threadinfo ffff880135ad2000,
task ffff880135bf8000)
[ 2.856452] Stack:
[ 2.859175] ffff880135ad3ca0 0000000000000002 0000000000000001
ffff880137008c80
[ 2.872325] ffff8801377db760 ffff880137008c80 ffff880135ad3db0
ffffffff8167632f
[ 2.887248] ffffffff8167e0e7 0000000000000000 ffff8801377db780
ffff8801377db770
[ 2.899666] Call Trace:
[ 2.906792] [<ffffffff8167632f>] __slab_alloc+0x351/0x4d2
[ 2.914238] [<ffffffff8167e0e7>] ? mutex_lock_nested+0x2e7/0x390
[ 2.925157] [<ffffffff813350d8>] ? alloc_cpumask_var_node+0x28/0x90
[ 2.939430] [<ffffffff81c81e50>] ? sched_init_smp+0x16a/0x3b4
[ 2.949790] [<ffffffff811b1a04>] kmem_cache_alloc_node_trace+0xa4/0x250
[ 2.964259] [<ffffffff8109e72f>] ? kzalloc+0xf/0x20
[ 2.976298] [<ffffffff81c81e50>] ? sched_init_smp+0x16a/0x3b4
[ 2.984664] [<ffffffff81c81e50>] sched_init_smp+0x16a/0x3b4
[ 2.997217] [<ffffffff81c66d57>] kernel_init+0xe3/0x215
[ 3.006848] [<ffffffff810d4c3d>] ? trace_hardirqs_on_caller+0x10d/0x1a0
[ 3.020673] [<ffffffff8168c3b4>] kernel_thread_helper+0x4/0x10
[ 3.031154] [<ffffffff81682470>] ? retint_restore_args+0x13/0x13
[ 3.040816] [<ffffffff81c66c74>] ? start_kernel+0x401/0x401
[ 3.052881] [<ffffffff8168c3b0>] ? gs_change+0x13/0x13
[ 3.061692] Code: 1f 80 00 00 00 00 fa 66 66 90 66 66 90 e8 cc e2 f1
ff e9 71 fe ff ff 0f 1f 80 00 00 00 00 e8 8b 25 ff ff 49 89 c5 e9 4a fe
ff ff <0f> 0b 0f 0b 49 8b 45 00 31 c9 f6 c4 40 74 04 41 8b 4d 68 ba 00
[ 3.095893] RIP [<ffffffff811b044d>] new_slab+0x26d/0x310
[ 3.107828] RSP <ffff880135ad3c80>
[ 3.114024] ---[ end trace e696d6ddf3adb276 ]---
[ 3.121541] swapper/0 used greatest stack depth: 4768 bytes left
[ 3.143784] Kernel panic - not syncing: Attempted to kill init!
exitcode=0x0000000b
[ 3.143784]
such above errors occurred in my two boxes:
in one machine, which has 120Gb RAM and 8 numa nodes with AMD CPU,
kernel panic occurred in autonuma15 and Linus tree(3.5.0-rc1)
but in another one, which has 16Gb RAM and 4 numa nodes with AMD CPU,
kernel panic only occurred in autonuma15, no such issues in Linus tree,
whole panic info is available in
http://www.sanweiying.org/download/kernel_panic_log
and config file in http://www.sanweiying.org/download/config
please feel free to tell me if you need more detailed info.
Thanks,
Zhouping
Hi Zhouping
On Thu, Jun 7, 2012 at 10:30 AM, Zhouping Liu <[email protected]> wrote:
>
> [ Â Â 3.114024] ---[ end trace e696d6ddf3adb276 ]---
> [ Â Â 3.121541] swapper/0 used greatest stack depth: 4768 bytes left
> [ Â Â 3.143784] Kernel panic - not syncing: Attempted to kill init!
> exitcode=0x0000000b
> [ Â Â 3.143784]
>
> such above errors occurred in my two boxes:
> in one machine, which has 120Gb RAM and 8 numa nodes with AMD CPU, kernel
> panic occurred in autonuma15 and Linus tree(3.5.0-rc1)
> but in another one, which has 16Gb RAM and 4 numa nodes with AMD CPU, kernel
> panic only occurred in autonuma15, no such issues in Linus tree,
>
Related to fix at https://lkml.org/lkml/2012/6/5/31 ?
On Thu, Jun 07, 2012 at 07:44:33PM +0800, Hillf Danton wrote:
> Hi Zhouping
>
> On Thu, Jun 7, 2012 at 10:30 AM, Zhouping Liu <[email protected]> wrote:
> >
> > [ ? ?3.114024] ---[ end trace e696d6ddf3adb276 ]---
> > [ ? ?3.121541] swapper/0 used greatest stack depth: 4768 bytes left
> > [ ? ?3.143784] Kernel panic - not syncing: Attempted to kill init!
> > exitcode=0x0000000b
> > [ ? ?3.143784]
> >
> > such above errors occurred in my two boxes:
> > in one machine, which has 120Gb RAM and 8 numa nodes with AMD CPU, kernel
> > panic occurred in autonuma15 and Linus tree(3.5.0-rc1)
> > but in another one, which has 16Gb RAM and 4 numa nodes with AMD CPU, kernel
> > panic only occurred in autonuma15, no such issues in Linus tree,
> >
> Related to fix at https://lkml.org/lkml/2012/6/5/31 ?
Right thanks! I pushed an update after an upstream rebase to fix it.
git fetch; git checkout -f origin/autonuma
or:
git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
Please let me know if you still have problems.
Andrea
> On Thu, Jun 7, 2012 at 10:30 AM, Zhouping Liu <[email protected]>
> wrote:
> >
> > [ Â Â 3.114024] ---[ end trace e696d6ddf3adb276 ]---
> > [ Â Â 3.121541] swapper/0 used greatest stack depth: 4768 bytes left
> > [ Â Â 3.143784] Kernel panic - not syncing: Attempted to kill init!
> > exitcode=0x0000000b
> > [ Â Â 3.143784]
> >
> > such above errors occurred in my two boxes:
> > in one machine, which has 120Gb RAM and 8 numa nodes with AMD CPU,
> > kernel
> > panic occurred in autonuma15 and Linus tree(3.5.0-rc1)
> > but in another one, which has 16Gb RAM and 4 numa nodes with AMD
> > CPU, kernel
> > panic only occurred in autonuma15, no such issues in Linus tree,
> >
> Related to fix at https://lkml.org/lkml/2012/6/5/31 ?
>
hi, Hillf
Thanks! but the Linus tree I tested has contained the patch,
also I tested it in autunuma15 with the patch just now, and
the panic is still alive, so maybe it's a new issues...
--
Thanks,
Zhouping
On Thu, Jun 07, 2012 at 10:08:52AM -0400, Zhouping Liu wrote:
> > On Thu, Jun 7, 2012 at 10:30 AM, Zhouping Liu <[email protected]>
> > wrote:
> > >
> > > [ ? ?3.114024] ---[ end trace e696d6ddf3adb276 ]---
> > > [ ? ?3.121541] swapper/0 used greatest stack depth: 4768 bytes left
> > > [ ? ?3.143784] Kernel panic - not syncing: Attempted to kill init!
> > > exitcode=0x0000000b
> > > [ ? ?3.143784]
> > >
> > > such above errors occurred in my two boxes:
> > > in one machine, which has 120Gb RAM and 8 numa nodes with AMD CPU,
> > > kernel
> > > panic occurred in autonuma15 and Linus tree(3.5.0-rc1)
> > > but in another one, which has 16Gb RAM and 4 numa nodes with AMD
> > > CPU, kernel
> > > panic only occurred in autonuma15, no such issues in Linus tree,
> > >
> > Related to fix at https://lkml.org/lkml/2012/6/5/31 ?
> >
>
> hi, Hillf
>
> Thanks! but the Linus tree I tested has contained the patch,
> also I tested it in autunuma15 with the patch just now, and
> the panic is still alive, so maybe it's a new issues...
I guess this 74a5ce20e6eeeb3751340b390e7ac1d1d07bbf55 or this
8e7fbcbc22c12414bcc9dfdd683637f58fb32759 may have introduced a problem
with sgp->power being null.
After applying the zalloc_node it oopses in a different place here:
/* Adjust by relative CPU power of the group */
sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
power is zero.
[ 3.243773] divide error: 0000 [#1] SMP
[ 3.244564] CPU 5
[ 3.245016] Modules linked in:
[ 3.245642]
[ 3.245939] Pid: 0, comm: swapper/5 Not tainted 3.5.0-rc1+ #1 HP ProLiant DL785 G6
[ 3.247640] RIP: 0010:[<ffffffff810afbeb>] [<ffffffff810afbeb>] update_sd_lb_stats+0x27b/0x620
[ 3.249534] RSP: 0000:ffff880411207b48 EFLAGS: 00010056
[ 3.250636] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff880811496d00
[ 3.252174] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8818116a0548
[ 3.253509] RBP: ffff880411207c28 R08: 0000000000000000 R09: 0000000000000000
[ 3.255073] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[ 3.256607] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000030
[ 3.258278] FS: 0000000000000000(0000) GS:ffff881817200000(0000) knlGS:0000000000000000
[ 3.260010] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 3.261250] CR2: 0000000000000000 CR3: 000000000196f000 CR4: 00000000000007e0
[ 3.262586] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3.263912] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 3.265320] Process swapper/5 (pid: 0, threadinfo ffff880411206000, task ffff8804111fa680)
[ 3.267150] Stack:
[ 3.267670] 0000000000000001 ffff880411207e34 ffff880411207bb8 ffff880411207d90
[ 3.269344] 00000000ffffffff ffff8818116a0548 00000000001d4780 00000000001d4780
[ 3.270953] ffff880416c21000 ffff880411207c38 ffff8818116a0560 0000000000000000
[ 3.272379] Call Trace:
[ 3.272933] [<ffffffff810affc9>] find_busiest_group+0x39/0x4b0
[ 3.274214] [<ffffffff810b0545>] load_balance+0x105/0xac0
[ 3.275408] [<ffffffff810ceefd>] ? trace_hardirqs_off+0xd/0x10
[ 3.276695] [<ffffffff810aa26f>] ? local_clock+0x6f/0x80
[ 3.277925] [<ffffffff810b1500>] idle_balance+0x130/0x2d0
[ 3.279137] [<ffffffff810b1420>] ? idle_balance+0x50/0x2d0
[ 3.280224] [<ffffffff81683e40>] __schedule+0x910/0xa00
[ 3.281229] [<ffffffff81684269>] schedule+0x29/0x70
[ 3.282165] [<ffffffff8102352f>] cpu_idle+0x12f/0x140
[ 3.283130] [<ffffffff8166bf85>] start_secondary+0x262/0x264
Please let me know if it rings a bell, it looks an upstream problem.
Thanks,
Andrea
Hi,
> >
> > Thanks! but the Linus tree I tested has contained the patch,
> > also I tested it in autunuma15 with the patch just now, and
> > the panic is still alive, so maybe it's a new issues...
>
> I guess this 74a5ce20e6eeeb3751340b390e7ac1d1d07bbf55 or this
> 8e7fbcbc22c12414bcc9dfdd683637f58fb32759 may have introduced a
> problem
> with sgp->power being null.
I have tested the kernel after removing the two commit 74a5ce20e6ee & 8e7fbcbc22c1241,
but unluckily, the panic is still alive, so I think it's maybe not
related to the two commit, and I will do more investigating to check
what introduced the panic, please let me know if you need me to test
the special commit.
Thanks,
Zhouping
On Fri, Jun 8, 2012 at 2:09 PM, Zhouping Liu <[email protected]> wrote:
>
> I have tested the kernel after removing the two commit 74a5ce20e6ee & 8e7fbcbc22c1241,
> but unluckily, the panic is still alive, so I think it's maybe not
> related to the two commit, and I will do more investigating to check
>
I see three reverts at
http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=log;h=refs/heads/autonuma
74a5ce20e Revert "sched: Fix SD_OVERLAP"
9f646389a Revert "sched/x86: Use cpu_llc_shared_mask(cpu) for coregroup_mask"
8e7fbcbc2 Revert "sched: Remove stale power aware scheduling remnants and
dysfunctional knobs"
Would you please take revert of 9f646389a also into account?
Good Weekend
Hillf
Don't hide reports like this in subjects like the above
On Thu, 2012-06-07 at 21:37 +0200, Andrea Arcangeli wrote:
> Please let me know if it rings a bell, it looks an upstream problem.
Please try tip/master, if it still fails, please report in a new thread
with appropriate subject.
On Thu, Jun 7, 2012 at 10:30 AM, Zhouping Liu <[email protected]> wrote:
> On 06/01/2012 02:08 AM, Andrea Arcangeli wrote:
>>
>> Hi,
>>
>> On Tue, May 29, 2012 at 05:43:09PM +0200, Petr Holasek wrote:
>>>
>>> Similar problem with __autonuma_migrate_page_remove here.
>>>
>>> [ 1945.516632] ------------[ cut here ]------------
>>> [ 1945.516636] WARNING: at lib/list_debug.c:50
>>> __list_del_entry+0x63/0xd0()
>>> [ 1945.516642] Hardware name: ProLiant DL585 G5
>>> [ 1945.516651] list_del corruption, ffff88017d68b068->next is
>>> LIST_POISON1 (dead000000100100)
>>> [ 1945.516682] Modules linked in: ipt_MASQUERADE nf_conntrack_netbios_ns
>>> nf_conntrack_broadcast ip6table_mangle lockd ip6t_REJECT sunrpc
>>> nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables iptable_nat
>>> nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack
>>> nf_conntrack mperf freq_table kvm_amd kvm pcspkr amd64_edac_mod edac_core
>>> serio_raw bnx2 microcode edac_mce_amd shpchp k10temp hpilo ipmi_si
>>> ipmi_msghandler hpwdt qla2xxx hpsa ata_generic pata_acpi scsi_transport_fc
>>> scsi_tgt cciss pata_amd radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core
>>> [last unloaded: scsi_wait_scan]
>>> [ 1945.516694] Pid: 150, comm: knuma_migrated0 Tainted: G ? ? ? ?W
>>> ?3.4.0aa_alpha+ #3
>>> [ 1945.516701] Call Trace:
>>> [ 1945.516710] ?[<ffffffff8105788f>] warn_slowpath_common+0x7f/0xc0
>>> [ 1945.516717] ?[<ffffffff81057986>] warn_slowpath_fmt+0x46/0x50
>>> [ 1945.516726] ?[<ffffffff812f9713>] __list_del_entry+0x63/0xd0
>>> [ 1945.516735] ?[<ffffffff812f9791>] list_del+0x11/0x40
>>> [ 1945.516743] ?[<ffffffff81165b98>]
>>> __autonuma_migrate_page_remove+0x48/0x80
>>> [ 1945.516746] ?[<ffffffff81165e66>] knuma_migrated+0x296/0x8a0
>>> [ 1945.516749] ?[<ffffffff8107a200>] ? wake_up_bit+0x40/0x40
>>> [ 1945.516758] ?[<ffffffff81165bd0>] ?
>>> __autonuma_migrate_page_remove+0x80/0x80
>>> [ 1945.516766] ?[<ffffffff81079cc3>] kthread+0x93/0xa0
>>> [ 1945.516780] ?[<ffffffff81626f24>] kernel_thread_helper+0x4/0x10
>>> [ 1945.516791] ?[<ffffffff81079c30>] ? flush_kthread_worker+0x80/0x80
>>> [ 1945.516798] ?[<ffffffff81626f20>] ? gs_change+0x13/0x13
>>> [ 1945.516800] ---[ end trace 7cab294af87bd79f ]---
>>
>> I didn't manage to reproduce it on my hardware but it seems this was
>> caused by the autonuma_migrate_split_huge_page: the tail page list
>> linking wasn't surrounded by the compound lock to make list insertion
>> and migrate_nid setting atomic like it happens everywhere else (the
>> caller holding the lock on the head page wasn't enough to make the
>> tails stable too).
>>
>> I released an AutoNUMA15 branch that includes all pending fixes:
>>
>> git clone --reference linux -b autonuma15
>> git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
>
>
> hi, Andrea and all
>
> when I tested autonuma patch set, kernel panic occurred in the process of
> starting with new compiled kernel,
> also I found the issue in latest Linus tree(3.5.0-rc1), partial call trace
> are:
>
> [ ? ?2.635443] kernel BUG at include/linux/gfp.h:318!
> [ ? ?2.642998] invalid opcode: 0000 [#1] SMP
> [ ? ?2.651148] CPU 0
> [ ? ?2.653911] Modules linked in:
> [ ? ?2.662388]
> [ ? ?2.664657] Pid: 1, comm: swapper/0 Not tainted 3.4.0+ #1 HP ProLiant
> DL585 G7
> [ ? ?2.677609] RIP: 0010:[<ffffffff811b044d>] ?[<ffffffff811b044d>]
> new_slab+0x26d/0x310
> [ ? ?2.692803] RSP: 0018:ffff880135ad3c80 ?EFLAGS: 00010246
> [ ? ?2.702541] RAX: 0000000000000000 RBX: ffff880137008c80 RCX:
> ffff8801377db780
> [ ? ?2.716402] RDX: ffff880135bf8000 RSI: 0000000000000003 RDI:
> 00000000000052d0
> [ ? ?2.728471] RBP: ffff880135ad3cb0 R08: 0000000000000000 R09:
> 0000000000000000
> [ ? ?2.743791] R10: 0000000000000001 R11: 0000000000000000 R12:
> 00000000000040d0
> [ ? ?2.756111] R13: ffff880137008c80 R14: 0000000000000001 R15:
> 0000000000030027
> [ ? ?2.770428] FS: ?0000000000000000(0000) GS:ffff880137600000(0000)
> knlGS:0000000000000000
> [ ? ?2.786319] CS: ?0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ ? ?2.798100] CR2: 0000000000000000 CR3: 000000000196b000 CR4:
> 00000000000007f0
> [ ? ?2.810264] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ ? ?2.824889] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
> 0000000000000400
> [ ? ?2.836882] Process swapper/0 (pid: 1, threadinfo ffff880135ad2000, task
> ffff880135bf8000)
> [ ? ?2.856452] Stack:
> [ ? ?2.859175] ?ffff880135ad3ca0 0000000000000002 0000000000000001
> ffff880137008c80
> [ ? ?2.872325] ?ffff8801377db760 ffff880137008c80 ffff880135ad3db0
> ffffffff8167632f
> [ ? ?2.887248] ?ffffffff8167e0e7 0000000000000000 ffff8801377db780
> ffff8801377db770
> [ ? ?2.899666] Call Trace:
> [ ? ?2.906792] ?[<ffffffff8167632f>] __slab_alloc+0x351/0x4d2
> [ ? ?2.914238] ?[<ffffffff8167e0e7>] ? mutex_lock_nested+0x2e7/0x390
> [ ? ?2.925157] ?[<ffffffff813350d8>] ? alloc_cpumask_var_node+0x28/0x90
> [ ? ?2.939430] ?[<ffffffff81c81e50>] ? sched_init_smp+0x16a/0x3b4
> [ ? ?2.949790] ?[<ffffffff811b1a04>] kmem_cache_alloc_node_trace+0xa4/0x250
> [ ? ?2.964259] ?[<ffffffff8109e72f>] ? kzalloc+0xf/0x20
> [ ? ?2.976298] ?[<ffffffff81c81e50>] ? sched_init_smp+0x16a/0x3b4
> [ ? ?2.984664] ?[<ffffffff81c81e50>] sched_init_smp+0x16a/0x3b4
> [ ? ?2.997217] ?[<ffffffff81c66d57>] kernel_init+0xe3/0x215
> [ ? ?3.006848] ?[<ffffffff810d4c3d>] ? trace_hardirqs_on_caller+0x10d/0x1a0
> [ ? ?3.020673] ?[<ffffffff8168c3b4>] kernel_thread_helper+0x4/0x10
> [ ? ?3.031154] ?[<ffffffff81682470>] ? retint_restore_args+0x13/0x13
> [ ? ?3.040816] ?[<ffffffff81c66c74>] ? start_kernel+0x401/0x401
> [ ? ?3.052881] ?[<ffffffff8168c3b0>] ? gs_change+0x13/0x13
> [ ? ?3.061692] Code: 1f 80 00 00 00 00 fa 66 66 90 66 66 90 e8 cc e2 f1 ff
> e9 71 fe ff ff 0f 1f 80 00 00 00 00 e8 8b 25 ff ff 49 89 c5 e9 4a fe ff ff
> <0f> 0b 0f 0b 49 8b 45 00 31 c9 f6 c4 40 74 04 41 8b 4d 68 ba 00
> [ ? ?3.095893] RIP ?[<ffffffff811b044d>] new_slab+0x26d/0x310
> [ 3.107828] ?RSP <ffff880135ad3c80>
> [ ? ?3.114024] ---[ end trace e696d6ddf3adb276 ]---
> [ ? ?3.121541] swapper/0 used greatest stack depth: 4768 bytes left
> [ ? ?3.143784] Kernel panic - not syncing: Attempted to kill init!
> exitcode=0x0000000b
> [ ? ?3.143784]
>
> such above errors occurred in my two boxes:
> in one machine, which has 120Gb RAM and 8 numa nodes with AMD CPU, kernel
> panic occurred in autonuma15 and Linus tree(3.5.0-rc1)
> but in another one, which has 16Gb RAM and 4 numa nodes with AMD CPU, kernel
> panic only occurred in autonuma15, no such issues in Linus tree,
>
> whole panic info is available in
> http://www.sanweiying.org/download/kernel_panic_log
> and config file in http://www.sanweiying.org/download/config
>
> please feel free to tell me if you need more detailed info.
>
> Thanks,
> Zhouping
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
Killing init actually means there is problem with scheduling-related subsystem.
Check it and fix it yourself after reading the design. :-)
Hi,
>
> Please try tip/master, if it still fails, please report in a new
> thread
> with appropriate subject.
I tested tip/master(commit: b2f5ce55c4e683), it's OK without the above panic,
also tested mainline(commit: 48d212a2eecaca2), the panic still exist.
and I will open a new subject to trace the issue.
--
Thanks,
Zhouping
Hi everyone,
On Tue, Jun 05, 2012 at 04:51:23PM +0200, Andrea Arcangeli wrote:
> The details of the solution:
>
> struct page_autonuma {
> short autonuma_last_nid;
> short autonuma_migrate_nid;
> unsigned int pfn_offset_next;
> unsigned int pfn_offset_prev;
> } __attribute__((packed));
>
> page_autonuma can only point to a page that belongs to the same node
> (page_autonuma is queued into the
> NODE_DATA(autonuma_migrate_nid)->autonuma_migrate_head[src_nid]) where
> src_nid is the source node that page_autonuma belongs to, so all pages
> in the autonuma_migrate_head[src_nid] lru must come from the same
> src_nid. So the next page_autonuma in the list will be
> lookup_page_autonuma(pfn_to_page(NODE_DATA(src_nid)->node_start_pfn +
> page_autonuma->pfn_offset_next)) etc..
>
> Of course all list_add/del must be hardcoded specially for this, but
> it's not a conceptually difficult solution, just we can't use list.h
> and stright pointers anymore and some conversion must happen.
So here the above idea implemented and working fine (it seems...?!? it
has been running only for half an hour but all benchmark regression
tests passed with the same score as before and I verified memory goes
in all directions during the bench, so there's good chance it's ok).
It actually works even if a node has more than 16TB but in that case
it will WARN_ONCE on the first page that is migrated at an offset
above 16TB from the start of the node, and then it will continue
simply skipping migrating those pages with a too large offset.
Next part coming is the docs of autonuma_balance() at the top of
kernel/sched/numa.c and cleanup the autonuma_balance callout location
(if I can figure how to do an active balance on the running task from
softirq). The location at the moment is there just to be invoked after
load_balance runs so it shouldn't make a runtime difference after I
clean it up (hackbench already runs identical to upstream) but
certainly it'll be nice to microoptimize away a call and a branch from
the schedule() fast path.
After that I'll write Documentation/vm/AutoNUMA.txt and I'll finish
the THP native migration (the last one assuming nobody does it before
I get there, if somebody wants to do it sooner, we figured the locking
details with Johannes during the MM summit but it's some work to
implement it).
===
>From 17e1cbc02c1b41037248d9952179ff293a287d58 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <[email protected]>
Date: Tue, 19 Jun 2012 18:55:25 +0200
Subject: [PATCH] autonuma: shrink the per-page page_autonuma struct size
>From 32 to 12 bytes, so the AutoNUMA memory footprint is reduced to
0.29% of RAM.
This however will fail to migrate pages above a 16 Terabyte offset
from the start of each node (migration failure isn't fatal, simply
those pages will not follow the CPU, a warning will be printed in the
log just once in that case).
AutoNUMA will also fail to build if there are more than (2**15)-1
nodes supported by the MAX_NUMNODES at build time (it would be easy to
relax it to (2**16)-1 nodes without increasing the memory footprint,
but it's not even worth it, so let's keep the negative space reserved
for now).
This means the max RAM configuration fully supported by AutoNUMA
becomes AUTONUMA_LIST_MAX_PFN_OFFSET multiplied by 32767 nodes
multiplied by the PAGE_SIZE (assume 4096 here, but for some archs it's
bigger).
4096*32767*(0xffffffff-3)>>(10*5) = 511 PetaBytes.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/autonuma_list.h | 94 ++++++++++++++++++++++
include/linux/autonuma_types.h | 48 +++++++-----
include/linux/mmzone.h | 3 +-
include/linux/page_autonuma.h | 2 +-
mm/Makefile | 2 +-
mm/autonuma.c | 75 +++++++++++++-----
mm/autonuma_list.c | 167 ++++++++++++++++++++++++++++++++++++++++
mm/page_autonuma.c | 15 ++--
8 files changed, 355 insertions(+), 51 deletions(-)
create mode 100644 include/linux/autonuma_list.h
create mode 100644 mm/autonuma_list.c
diff --git a/include/linux/autonuma_list.h b/include/linux/autonuma_list.h
new file mode 100644
index 0000000..0f338e9
--- /dev/null
+++ b/include/linux/autonuma_list.h
@@ -0,0 +1,94 @@
+#ifndef __AUTONUMA_LIST_H
+#define __AUTONUMA_LIST_H
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+
+typedef uint32_t autonuma_list_entry;
+#define AUTONUMA_LIST_MAX_PFN_OFFSET (AUTONUMA_LIST_HEAD-3)
+#define AUTONUMA_LIST_POISON1 (AUTONUMA_LIST_HEAD-2)
+#define AUTONUMA_LIST_POISON2 (AUTONUMA_LIST_HEAD-1)
+#define AUTONUMA_LIST_HEAD ((uint32_t)UINT_MAX)
+
+struct autonuma_list_head {
+ autonuma_list_entry anl_next_pfn;
+ autonuma_list_entry anl_prev_pfn;
+};
+
+static inline void AUTONUMA_INIT_LIST_HEAD(struct autonuma_list_head *anl)
+{
+ anl->anl_next_pfn = AUTONUMA_LIST_HEAD;
+ anl->anl_prev_pfn = AUTONUMA_LIST_HEAD;
+}
+
+/* abstraction conversion methods */
+extern struct page *autonuma_list_entry_to_page(int nid,
+ autonuma_list_entry pfn_offset);
+extern autonuma_list_entry autonuma_page_to_list_entry(int page_nid,
+ struct page *page);
+extern struct autonuma_list_head *__autonuma_list_head(int page_nid,
+ struct autonuma_list_head *head,
+ autonuma_list_entry pfn_offset);
+
+extern bool __autonuma_list_add(int page_nid,
+ struct page *page,
+ struct autonuma_list_head *head,
+ autonuma_list_entry prev,
+ autonuma_list_entry next);
+
+/*
+ * autonuma_list_add - add a new entry
+ *
+ * Insert a new entry after the specified head.
+ */
+static inline bool autonuma_list_add(int page_nid,
+ struct page *page,
+ autonuma_list_entry entry,
+ struct autonuma_list_head *head)
+{
+ struct autonuma_list_head *entry_head;
+ entry_head = __autonuma_list_head(page_nid, head, entry);
+ return __autonuma_list_add(page_nid, page, head,
+ entry, entry_head->anl_next_pfn);
+}
+
+/*
+ * autonuma_list_add_tail - add a new entry
+ *
+ * Insert a new entry before the specified head.
+ * This is useful for implementing queues.
+ */
+static inline bool autonuma_list_add_tail(int page_nid,
+ struct page *page,
+ autonuma_list_entry entry,
+ struct autonuma_list_head *head)
+{
+ struct autonuma_list_head *entry_head;
+ entry_head = __autonuma_list_head(page_nid, head, entry);
+ return __autonuma_list_add(page_nid, page, head,
+ entry_head->anl_prev_pfn, entry);
+}
+
+/*
+ * autonuma_list_del - deletes entry from list.
+ * @entry: the element to delete from the list.
+ */
+extern void autonuma_list_del(int page_nid,
+ struct autonuma_list_head *entry,
+ struct autonuma_list_head *head);
+
+extern bool autonuma_list_empty(const struct autonuma_list_head *head);
+
+#if 0 /* not needed so far */
+/*
+ * autonuma_list_is_singular - tests whether a list has just one entry.
+ * @head: the list to test.
+ */
+static inline int autonuma_list_is_singular(const struct autonuma_list_head *head)
+{
+ return !autonuma_list_empty(head) &&
+ (head->anl_next_pfn == head->anl_prev_pfn);
+}
+#endif
+
+#endif /* __AUTONUMA_LIST_H */
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 6662990..1abde9c5 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -4,6 +4,7 @@
#ifdef CONFIG_AUTONUMA
#include <linux/numa.h>
+#include <linux/autonuma_list.h>
/*
* Per-mm (process) structure dynamically allocated only if autonuma
@@ -45,15 +46,36 @@ struct task_autonuma {
/*
* Per page (or per-pageblock) structure dynamically allocated only if
* autonuma is not impossible.
+ *
+ * This structure takes 12 bytes per page for all architectures. There
+ * are two constraints to make this work:
+ *
+ * 1) the build will abort if * MAX_NUMNODES is too big according to
+ * the #error check below
+ *
+ * 2) AutoNUMA will not succeed to insert into the migration queue any
+ * page whose pfn offset value (offset with respect to the first
+ * pfn of the node) is bigger than AUTONUMA_LIST_MAX_PFN_OFFSET
+ * (NOTE: AUTONUMA_LIST_MAX_PFN_OFFSET is still a valid pfn offset
+ * value). This means with huge node sizes and small PAGE_SIZE,
+ * some pages may not be allowed to be migrated.
*/
struct page_autonuma {
/*
* To modify autonuma_last_nid lockless the architecture,
* needs SMP atomic granularity < sizeof(long), not all archs
- * have that, notably some alpha. Archs without that requires
+ * have that, notably some ancient alpha (but none of those
+ * should run in NUMA systems). Archs without that requires
* autonuma_last_nid to be a long.
*/
-#if BITS_PER_LONG > 32
+#if MAX_NUMNODES > 32767
+ /*
+ * Verify at build time that int16_t for autonuma_migrate_nid
+ * and autonuma_last_nid won't risk to overflow, max allowed
+ * nid value is (2**15)-1.
+ */
+#error "too many nodes"
+#endif
/*
* autonuma_migrate_nid is -1 if the page_autonuma structure
* is not linked into any
@@ -63,7 +85,7 @@ struct page_autonuma {
* page_nid is the nid that the page (referenced by the
* page_autonuma structure) belongs to.
*/
- int autonuma_migrate_nid;
+ int16_t autonuma_migrate_nid;
/*
* autonuma_last_nid records which is the NUMA nid that tried
* to access this page at the last NUMA hinting page fault.
@@ -72,28 +94,14 @@ struct page_autonuma {
* it will make different threads trashing on the same pages,
* converge on the same NUMA node (if possible).
*/
- int autonuma_last_nid;
-#else
-#if MAX_NUMNODES >= 32768
-#error "too many nodes"
-#endif
- short autonuma_migrate_nid;
- short autonuma_last_nid;
-#endif
+ int16_t autonuma_last_nid;
+
/*
* This is the list node that links the page (referenced by
* the page_autonuma structure) in the
* &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid] lru.
*/
- struct list_head autonuma_migrate_node;
-
- /*
- * To find the page starting from the autonuma_migrate_node we
- * need a backlink.
- *
- * FIXME: drop it;
- */
- struct page *page;
+ struct autonuma_list_head autonuma_migrate_node;
};
extern int alloc_task_autonuma(struct task_struct *tsk,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ed5b0c0..acefdfa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -17,6 +17,7 @@
#include <linux/pageblock-flags.h>
#include <generated/bounds.h>
#include <linux/atomic.h>
+#include <linux/autonuma_list.h>
#include <asm/page.h>
/* Free memory management - zoned buddy allocator. */
@@ -710,7 +711,7 @@ typedef struct pglist_data {
* <linux/page_autonuma.h> and the below field must remain the
* last one of this structure.
*/
- struct list_head autonuma_migrate_head[0];
+ struct autonuma_list_head autonuma_migrate_head[0];
#endif
} pg_data_t;
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
index bc7a629..e78beda 100644
--- a/include/linux/page_autonuma.h
+++ b/include/linux/page_autonuma.h
@@ -53,7 +53,7 @@ extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **
/* inline won't work here */
#define autonuma_pglist_data_size() (sizeof(struct pglist_data) + \
(autonuma_impossible() ? 0 : \
- sizeof(struct list_head) * \
+ sizeof(struct autonuma_list_head) * \
num_possible_nodes()))
#endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/mm/Makefile b/mm/Makefile
index a4d8354..4aa90d4 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,7 +33,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
-obj-$(CONFIG_AUTONUMA) += autonuma.o page_autonuma.o
+obj-$(CONFIG_AUTONUMA) += autonuma.o page_autonuma.o autonuma_list.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index 9834f5d..8aed9af 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -89,12 +89,21 @@ void autonuma_migrate_split_huge_page(struct page *page,
VM_BUG_ON(nid < -1);
VM_BUG_ON(page_tail_autonuma->autonuma_migrate_nid != -1);
if (nid >= 0) {
- VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+ int page_nid = page_to_nid(page);
+ struct autonuma_list_head *head;
+ autonuma_list_entry entry;
+ entry = autonuma_page_to_list_entry(page_nid, page);
+ head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+ VM_BUG_ON(page_nid != page_to_nid(page_tail));
+ VM_BUG_ON(page_nid == nid);
compound_lock(page_tail);
autonuma_migrate_lock(nid);
- list_add_tail(&page_tail_autonuma->autonuma_migrate_node,
- &page_autonuma->autonuma_migrate_node);
+ if (!autonuma_list_add_tail(page_nid,
+ page_tail,
+ entry,
+ head))
+ BUG();
autonuma_migrate_unlock(nid);
page_tail_autonuma->autonuma_migrate_nid = nid;
@@ -119,8 +128,15 @@ void __autonuma_migrate_page_remove(struct page *page,
VM_BUG_ON(nid < -1);
if (nid >= 0) {
int numpages = hpage_nr_pages(page);
+ int page_nid = page_to_nid(page);
+ struct autonuma_list_head *head;
+ VM_BUG_ON(nid == page_nid);
+ head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+
autonuma_migrate_lock(nid);
- list_del(&page_autonuma->autonuma_migrate_node);
+ autonuma_list_del(page_nid,
+ &page_autonuma->autonuma_migrate_node,
+ head);
NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
autonuma_migrate_unlock(nid);
@@ -139,6 +155,8 @@ static void __autonuma_migrate_page_add(struct page *page,
int numpages;
unsigned long nr_migrate_pages;
wait_queue_head_t *wait_queue;
+ struct autonuma_list_head *head;
+ bool added;
VM_BUG_ON(dst_nid >= MAX_NUMNODES);
VM_BUG_ON(dst_nid < -1);
@@ -155,25 +173,33 @@ static void __autonuma_migrate_page_add(struct page *page,
VM_BUG_ON(nid >= MAX_NUMNODES);
VM_BUG_ON(nid < -1);
if (nid >= 0) {
+ VM_BUG_ON(nid == page_nid);
+ head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+
autonuma_migrate_lock(nid);
- list_del(&page_autonuma->autonuma_migrate_node);
+ autonuma_list_del(page_nid,
+ &page_autonuma->autonuma_migrate_node,
+ head);
NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
autonuma_migrate_unlock(nid);
}
+ head = &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid];
+
autonuma_migrate_lock(dst_nid);
- list_add(&page_autonuma->autonuma_migrate_node,
- &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
- NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+ added = autonuma_list_add(page_nid, page, AUTONUMA_LIST_HEAD, head);
+ if (added)
+ NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
autonuma_migrate_unlock(dst_nid);
- page_autonuma->autonuma_migrate_nid = dst_nid;
+ if (added)
+ page_autonuma->autonuma_migrate_nid = dst_nid;
compound_unlock_irqrestore(page, flags);
- if (!autonuma_migrate_defer()) {
+ if (added && !autonuma_migrate_defer()) {
wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
if (nr_migrate_pages >= pages_to_migrate &&
nr_migrate_pages - numpages < pages_to_migrate &&
@@ -813,7 +839,7 @@ static int isolate_migratepages(struct list_head *migratepages,
struct pglist_data *pgdat)
{
int nr = 0, nid;
- struct list_head *heads = pgdat->autonuma_migrate_head;
+ struct autonuma_list_head *heads = pgdat->autonuma_migrate_head;
/* FIXME: THP balancing, restart from last nid */
for_each_online_node(nid) {
@@ -825,10 +851,10 @@ static int isolate_migratepages(struct list_head *migratepages,
cond_resched();
VM_BUG_ON(numa_node_id() != pgdat->node_id);
if (nid == pgdat->node_id) {
- VM_BUG_ON(!list_empty(&heads[nid]));
+ VM_BUG_ON(!autonuma_list_empty(&heads[nid]));
continue;
}
- if (list_empty(&heads[nid]))
+ if (autonuma_list_empty(&heads[nid]))
continue;
/* some page wants to go to this pgdat */
/*
@@ -840,22 +866,29 @@ static int isolate_migratepages(struct list_head *migratepages,
* irqs.
*/
autonuma_migrate_lock_irq(pgdat->node_id);
- if (list_empty(&heads[nid])) {
+ if (autonuma_list_empty(&heads[nid])) {
autonuma_migrate_unlock_irq(pgdat->node_id);
continue;
}
- page_autonuma = list_entry(heads[nid].prev,
- struct page_autonuma,
- autonuma_migrate_node);
- page = page_autonuma->page;
+ page = autonuma_list_entry_to_page(nid,
+ heads[nid].anl_prev_pfn);
+ page_autonuma = lookup_page_autonuma(page);
if (unlikely(!get_page_unless_zero(page))) {
+ int page_nid = page_to_nid(page);
+ struct autonuma_list_head *entry_head;
+ VM_BUG_ON(nid == page_nid);
+
/*
* Is getting freed and will remove self from the
* autonuma list shortly, skip it for now.
*/
- list_del(&page_autonuma->autonuma_migrate_node);
- list_add(&page_autonuma->autonuma_migrate_node,
- &heads[nid]);
+ entry_head = &page_autonuma->autonuma_migrate_node;
+ autonuma_list_del(page_nid, entry_head,
+ &heads[nid]);
+ if (!autonuma_list_add(page_nid, page,
+ AUTONUMA_LIST_HEAD,
+ &heads[nid]))
+ BUG();
autonuma_migrate_unlock_irq(pgdat->node_id);
autonuma_printk("autonuma migrate page is free\n");
continue;
diff --git a/mm/autonuma_list.c b/mm/autonuma_list.c
new file mode 100644
index 0000000..2c840f7
--- /dev/null
+++ b/mm/autonuma_list.c
@@ -0,0 +1,167 @@
+/*
+ * Copyright 2006, Red Hat, Inc., Dave Jones
+ * Copyright 2012, Red Hat, Inc.
+ * Released under the General Public License (GPL).
+ *
+ * This file contains the linked list implementations for
+ * autonuma migration lists.
+ */
+
+#include <linux/mm.h>
+#include <linux/autonuma.h>
+
+/*
+ * Insert a new entry between two known consecutive entries.
+ *
+ * This is only for internal list manipulation where we know
+ * the prev/next entries already!
+ *
+ * return true if succeeded, or false if the (page_nid, pfn_offset)
+ * pair couldn't represent the pfn and the list_add didn't succeed.
+ */
+bool __autonuma_list_add(int page_nid,
+ struct page *page,
+ struct autonuma_list_head *head,
+ autonuma_list_entry prev,
+ autonuma_list_entry next)
+{
+ autonuma_list_entry new;
+
+ VM_BUG_ON(page_nid != page_to_nid(page));
+ new = autonuma_page_to_list_entry(page_nid, page);
+ if (new > AUTONUMA_LIST_MAX_PFN_OFFSET)
+ return false;
+
+ WARN(new == prev || new == next,
+ "autonuma_list_add double add: new=%u, prev=%u, next=%u.\n",
+ new, prev, next);
+
+ __autonuma_list_head(page_nid, head, next)->anl_prev_pfn = new;
+ __autonuma_list_head(page_nid, head, new)->anl_next_pfn = next;
+ __autonuma_list_head(page_nid, head, new)->anl_prev_pfn = prev;
+ __autonuma_list_head(page_nid, head, prev)->anl_next_pfn = new;
+ return true;
+}
+
+static inline void __autonuma_list_del_entry(int page_nid,
+ struct autonuma_list_head *entry,
+ struct autonuma_list_head *head)
+{
+ autonuma_list_entry prev, next;
+
+ prev = entry->anl_prev_pfn;
+ next = entry->anl_next_pfn;
+
+ if (WARN(next == AUTONUMA_LIST_POISON1,
+ "autonuma_list_del corruption, "
+ "%p->anl_next_pfn is AUTONUMA_LIST_POISON1 (%u)\n",
+ entry, AUTONUMA_LIST_POISON1) ||
+ WARN(prev == AUTONUMA_LIST_POISON2,
+ "autonuma_list_del corruption, "
+ "%p->anl_prev_pfn is AUTONUMA_LIST_POISON2 (%u)\n",
+ entry, AUTONUMA_LIST_POISON2))
+ return;
+
+ __autonuma_list_head(page_nid, head, next)->anl_prev_pfn = prev;
+ __autonuma_list_head(page_nid, head, prev)->anl_next_pfn = next;
+}
+
+/*
+ * autonuma_list_del - deletes entry from list.
+ *
+ * Note: autonuma_list_empty on entry does not return true after this,
+ * the entry is in an undefined state.
+ */
+void autonuma_list_del(int page_nid, struct autonuma_list_head *entry,
+ struct autonuma_list_head *head)
+{
+ __autonuma_list_del_entry(page_nid, entry, head);
+ entry->anl_next_pfn = AUTONUMA_LIST_POISON1;
+ entry->anl_prev_pfn = AUTONUMA_LIST_POISON2;
+}
+
+/*
+ * autonuma_list_empty - tests whether a list is empty
+ * @head: the list to test.
+ */
+bool autonuma_list_empty(const struct autonuma_list_head *head)
+{
+ bool ret = false;
+ if (head->anl_next_pfn == AUTONUMA_LIST_HEAD) {
+ ret = true;
+ BUG_ON(head->anl_prev_pfn != AUTONUMA_LIST_HEAD);
+ }
+ return ret;
+}
+
+/* abstraction conversion methods */
+
+static inline struct page *__autonuma_list_entry_to_page(int page_nid,
+ autonuma_list_entry pfn_offset)
+{
+ struct pglist_data *pgdat = NODE_DATA(page_nid);
+ unsigned long pfn = pgdat->node_start_pfn + pfn_offset;
+ return pfn_to_page(pfn);
+}
+
+struct page *autonuma_list_entry_to_page(int page_nid,
+ autonuma_list_entry pfn_offset)
+{
+ VM_BUG_ON(page_nid < 0);
+ BUG_ON(pfn_offset == AUTONUMA_LIST_POISON1);
+ BUG_ON(pfn_offset == AUTONUMA_LIST_POISON2);
+ BUG_ON(pfn_offset == AUTONUMA_LIST_HEAD);
+ return __autonuma_list_entry_to_page(page_nid, pfn_offset);
+}
+
+/*
+ * returns a value above AUTONUMA_LIST_MAX_PFN_OFFSET if the pfn is
+ * located a too big offset from the start of the node and cannot be
+ * represented by the (page_nid, pfn_offset) pair.
+ */
+autonuma_list_entry autonuma_page_to_list_entry(int page_nid,
+ struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ struct pglist_data *pgdat = NODE_DATA(page_nid);
+ VM_BUG_ON(page_nid != page_to_nid(page));
+ BUG_ON(pfn < pgdat->node_start_pfn);
+ pfn -= pgdat->node_start_pfn;
+ if (pfn > AUTONUMA_LIST_MAX_PFN_OFFSET) {
+ WARN_ONCE(1, "autonuma_page_to_list_entry: "
+ "pfn_offset %lu, pgdat %p, "
+ "pgdat->node_start_pfn %lu\n",
+ pfn, pgdat, pgdat->node_start_pfn);
+ /*
+ * Any value bigger than AUTONUMA_LIST_MAX_PFN_OFFSET
+ * will work as an error retval, but better pick one
+ * that will cause noise if computed wrong by the
+ * caller.
+ */
+ return AUTONUMA_LIST_POISON1;
+ }
+ return pfn; /* convert to uint16_t without losing information */
+}
+
+static inline struct autonuma_list_head *____autonuma_list_head(int page_nid,
+ autonuma_list_entry pfn_offset)
+{
+ struct pglist_data *pgdat = NODE_DATA(page_nid);
+ unsigned long pfn = pgdat->node_start_pfn + pfn_offset;
+ struct page *page = pfn_to_page(pfn);
+ struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+ return &page_autonuma->autonuma_migrate_node;
+}
+
+struct autonuma_list_head *__autonuma_list_head(int page_nid,
+ struct autonuma_list_head *head,
+ autonuma_list_entry pfn_offset)
+{
+ VM_BUG_ON(page_nid < 0);
+ BUG_ON(pfn_offset == AUTONUMA_LIST_POISON1);
+ BUG_ON(pfn_offset == AUTONUMA_LIST_POISON2);
+ if (pfn_offset != AUTONUMA_LIST_HEAD)
+ return ____autonuma_list_head(page_nid, pfn_offset);
+ else
+ return head;
+}
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index f929d81..151f25c 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -12,7 +12,6 @@ void __meminit page_autonuma_map_init(struct page *page,
for (end = page + nr_pages; page < end; page++, page_autonuma++) {
page_autonuma->autonuma_last_nid = -1;
page_autonuma->autonuma_migrate_nid = -1;
- page_autonuma->page = page;
}
}
@@ -20,12 +19,18 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
{
int node_iter;
+ /* verify the per-page page_autonuma 12 byte fixed cost */
+ BUILD_BUG_ON((unsigned long) &((struct page_autonuma *)0)[1] != 12);
+
spin_lock_init(&pgdat->autonuma_lock);
init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
pgdat->autonuma_nr_migrate_pages = 0;
if (!autonuma_impossible())
- for_each_node(node_iter)
- INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+ for_each_node(node_iter) {
+ struct autonuma_list_head *head;
+ head = &pgdat->autonuma_migrate_head[node_iter];
+ AUTONUMA_INIT_LIST_HEAD(head);
+ }
}
#if !defined(CONFIG_SPARSEMEM)
@@ -112,10 +117,6 @@ struct page_autonuma *lookup_page_autonuma(struct page *page)
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
- /* if it's not a power of two we may be wasting memory */
- BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
- (SECTION_PAGE_AUTONUMA_SIZE-1));
-
#ifdef CONFIG_DEBUG_VM
/*
* The sanity checks the page allocator does upon freeing a
> I released an AutoNUMA15 branch that includes all pending fixes:
>
> git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
>
I did a quick testing on our
specjbb2005/oltp/hackbench/tbench/netperf-loop/fio/ffsb on NHM EP/EX,
Core2 EP, Romely EP machine, In generally no clear performance change
found. Is this results expected for this patch set?
Regards!
Alex
On Thu, Jun 21, 2012 at 03:29:52PM +0800, Alex Shi wrote:
> > I released an AutoNUMA15 branch that includes all pending fixes:
> >
> > git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
> >
>
> I did a quick testing on our
> specjbb2005/oltp/hackbench/tbench/netperf-loop/fio/ffsb on NHM EP/EX,
> Core2 EP, Romely EP machine, In generally no clear performance change
> found. Is this results expected for this patch set?
hackbench and network benchs won't get benefit (the former
overschedule like crazy so there's no way any autonuma balancing can
have effect with such an overscheduling and zillion of threads, the
latter is I/O dominated usually taking so little RAM it doesn't
matter, the memory accesses on the kernel side and DMA issue should
dominate it in CPU utilization). Similar issue for filesystem
benchmarks like fio.
On all _system_ time dominated kernel benchmarks it is expected not to
measure a performance optimization and if you don't measure a
regression it's more than enough.
The only benchmarks that gets benefit are userland where the user/nice
time in top dominates. AutoNUMA cannot optimize or move kernel memory
around, it only optimizes userland computations.
So you should run HPC jobs. The only strange thing here is that
specjbb2005 gets a measurable significant boost with AutoNUMA so if
you didn't even get a boost with that you may want to verify:
cat /sys/kernel/mm/autonuma/enabled == 1
Also verify:
CONFIG_AUTONUMA_DEFAULT_ENABLED=y
If that's 1 well maybe the memory interconnect is so fast that there's
no benefit?
My numa01/02 benchmarks measures the best worst case of the hardware
(not software), with -DINVERSE_BIND -DHARD_BIND parameters, you can
consider running that to verify.
Probably there should be a little boot time kernel benchmark to
measure the inverse bind vs hard bind performance across the first two
nodes, if the difference is nil AutoNUMA should disengage and not even
allocate the page_autonuma (now only 12 bytes per page but anyway).
If you can retest with autonuma17 it would help too as there was some
performance issue fixed and it'd stress the new autonuma migration lru
code:
git clone --reference linux -b autonuma17 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma17
And the very latest is always at the autonuma branch:
git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma
Thanks,
Andrea
On Tue, May 29, 2012 at 03:10:04PM +0200, Peter Zijlstra wrote:
> This doesn't explain anything in the dense code that follows.
>
> What statistics, how are they used, with what goal etc..
Right, sorry for taking so long at updating the docs. So I tried to
write a more useful comment to explain why it converges and what is
the objective of the math. This is the current status and it includes
everything related to the autonuma balancing and lots of documentation
on it (in fact almost more documentation than code), including the
part to prioritize CFS in picking from the autonuma_node.
If you consider this is _everything_ needed in terms of scheduler
code, I think it's quite simple, and with the docs it should be a lot
more clear.
Moving the callout out of schedule is the next step but it's only an
implementation issue (and I would already done it if only I could
schedule from softirq, but this isn't preempt-rt...).
/*
* Copyright (C) 2012 Red Hat, Inc.
*
* This work is licensed under the terms of the GNU GPL, version 2. See
* the COPYING file in the top-level directory.
*/
#include <linux/sched.h>
#include <linux/autonuma_sched.h>
#include <asm/tlb.h>
#include "sched.h"
#define AUTONUMA_BALANCE_SCALE 1000
enum {
W_TYPE_THREAD,
W_TYPE_PROCESS,
};
/*
* This function is responsible for deciding which is the best CPU
* each process should be running on according to the NUMA statistics
* collected in mm->mm_autonuma and tsk->task_autonuma.
*
* The core math that evaluates the current CPU against the CPUs of
* all _other_ nodes is this:
*
* if (w_nid > w_other && w_nid > w_cpu_nid) {
* weight = w_nid - w_other + w_nid - w_cpu_nid;
*
* w_nid: worthiness of moving the current thread/process to the other
* CPU.
*
* w_other: worthiness of moving the thread/process running in the
* other CPU to the current CPU.
*
* w_cpu_nid: worthiness of keeping the current thread/process in the
* current CPU.
*
* We run the above math on every CPU not part of the current NUMA
* node, and we compare the current process against the other
* processes running in the other CPUs in the remote NUMA nodes. The
* objective is to select the cpu (in selected_cpu) with a bigger
* worthiness weight (calculated as w_nid - w_other + w_nid -
* w_cpu_nid). The bigger the worthiness weight of the other CPU the
* biggest gain we'll get by moving the current process to the
* selected_cpu (not only the biggest immediate CPU gain but also the
* fewer async memory migrations that will be required to reach full
* convergence later). If we select a cpu we migrate the current
* process to it.
*
* Checking that the other process prefers to run here (w_nid >
* w_other) and not only that we prefer to run there (w_nid >
* w_cpu_nid) completely avoids ping pongs and ensures (temporary)
* convergence of the algorithm (at least from a CPU standpoint).
*
* It's then up to the idle balancing code that will run as soon as
* the current CPU goes idle to pick the other process and move it
* here.
*
* By only evaluating running processes against running processes we
* avoid to interfere with the CFS stock active idle balancing so
* critical to perform optimally with HT enabled (getting HT wrong is
* worse than running on remote memory so the active idle balancing
* has the priority). The idle balancing (and all other CFS load
* balancing) is however NUMA aware through the introduction of
* sched_autonuma_can_migrate_task(). CFS searches the CPUs in the
* tsk->autonuma_node first when it needs to find idle CPUs during the
* idle balancing or tasks to pick during the load balancing.
*
* Then in the background asynchronously the memory always slowly
* follows the CPU. Here the CPU follows the memory as fast as it can
* (as long as active idle balancing permits).
*
* One non trivial bit of this logic that deserves an explanation is
* how the three crucial variables of the core math
* (w_nid/w_other/wcpu_nid) are going to change depending if the other
* CPU is running a thread of the current process, or a thread of a
* different process. A simple example is required: assume there are 2
* processes and 4 thread per process and two nodes with 4 CPUs
* each. Because the total 8 threads belongs to two different
* processes by using the process statistics when comparing threads of
* different processes, we'll end up converging reliably and quickly
* in a configuration where first process is entirely contained in the
* first node and the second process is entirely contained in the
* second node. Now if you knew in advance that all threads only use
* thread local memory and there's no sharing of memory between the
* different threads, it wouldn't matter if use per-thread or per-mm
* statistics in the w_nid/w_other/wcpu_nid and we could use
* per-thread statistics at all times. But clearly with threads it's
* expectable to get some sharing of memory, so to avoid false sharing
* it's better to keep all threads of the same process in the same
* node (or if they don't fit in a single node, in as fewer nodes as
* possible), and this is why we've to use processes statistics in
* w_nid/w_other/wcpu_nid when comparing thread of different
* processes. Why instead we've to use thread statistics when
* comparing threads of the same process should be already obvious if
* you're still reading (hint: the mm statistics are identical for
* threads of the same process). If some process doesn't fit in one
* node, the thread statistics will then distribute the threads to the
* best nodes within the group of nodes where the process is
* contained.
*
* This is an example of the CPU layout after the startup of 2
* processes with 12 threads each:
*
* nid 0 mm ffff880433367b80 nr 6
* nid 0 mm ffff880433367480 nr 5
* nid 1 mm ffff880433367b80 nr 6
* nid 1 mm ffff880433367480 nr 6
*
* And after a few seconds it becomes:
*
* nid 0 mm ffff880433367b80 nr 12
* nid 1 mm ffff880433367480 nr 11
*
* You can see it happening yourself by enabling debug with sysfs.
*
* Before scanning all other CPUs runqueues to compute the above math,
* we also verify that we're not already in the preferred nid from the
* point of view of both the process statistics and the thread
* statistics. In such case we can return to the caller without having
* to check any other CPUs runqueues because full convergence has been
* already reached.
*
* Ideally this should be expanded to take all runnable processes into
* account but this is a good enough approximation because some
* runnable processes may run only for a short time so statistically
* there will always be a bias on the processes that uses most the of
* the CPU and that's ideal (if a process runs only for a short time,
* it won't matter too much if the NUMA balancing isn't optimal for
* it).
*
* This function is invoked at the same frequency of the load balancer
* and only if the CPU is not idle. The rest of the time we depend on
* CFS to keep sticking to the current CPU or to prioritize on the
* CPUs in the selected_nid recorded in the task autonuma_node.
*/
void sched_autonuma_balance(void)
{
int cpu, nid, selected_cpu, selected_nid, selected_nid_mm;
int cpu_nid = numa_node_id();
int this_cpu = smp_processor_id();
unsigned long t_w, t_t, m_w, m_t, t_w_max, m_w_max;
unsigned long weight_delta_max, weight;
long s_w_nid = -1, s_w_cpu_nid = -1, s_w_other = -1;
int s_w_type = -1;
struct cpumask *allowed;
struct migration_arg arg;
struct task_struct *p = current;
struct task_autonuma *task_autonuma = p->task_autonuma;
/* per-cpu statically allocated in runqueues */
long *task_numa_weight;
long *mm_numa_weight;
if (!task_autonuma || !p->mm)
return;
if (!(task_autonuma->autonuma_flags &
SCHED_AUTONUMA_FLAG_NEED_BALANCE))
return;
else
task_autonuma->autonuma_flags &=
~SCHED_AUTONUMA_FLAG_NEED_BALANCE;
if (task_autonuma->autonuma_flags & SCHED_AUTONUMA_FLAG_STOP_ONE_CPU)
return;
if (!autonuma_enabled()) {
if (task_autonuma->autonuma_node != -1)
task_autonuma->autonuma_node = -1;
return;
}
allowed = tsk_cpus_allowed(p);
m_t = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault_tot);
t_t = task_autonuma->task_numa_fault_tot;
/*
* If a process still misses the per-thread or per-process
* information skip it.
*/
if (!m_t || !t_t)
return;
task_numa_weight = cpu_rq(this_cpu)->task_numa_weight;
mm_numa_weight = cpu_rq(this_cpu)->mm_numa_weight;
/*
* See if we already converged to skip the more expensive loop
* below, if we can already predict here with only CPU local
* information, that it would selected the current cpu_nid.
*/
t_w_max = m_w_max = 0;
selected_nid = selected_nid_mm = -1;
for_each_online_node(nid) {
m_w = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault[nid]);
t_w = task_autonuma->task_numa_fault[nid];
if (m_w > m_t)
m_t = m_w;
mm_numa_weight[nid] = m_w*AUTONUMA_BALANCE_SCALE/m_t;
if (t_w > t_t)
t_t = t_w;
task_numa_weight[nid] = t_w*AUTONUMA_BALANCE_SCALE/t_t;
if (mm_numa_weight[nid] > m_w_max) {
m_w_max = mm_numa_weight[nid];
selected_nid_mm = nid;
}
if (task_numa_weight[nid] > t_w_max) {
t_w_max = task_numa_weight[nid];
selected_nid = nid;
}
}
if (selected_nid == cpu_nid && selected_nid_mm == selected_nid) {
if (task_autonuma->autonuma_node != selected_nid)
task_autonuma->autonuma_node = selected_nid;
return;
}
selected_cpu = this_cpu;
/*
* Avoid the process migration if we don't find an ideal not
* idle CPU (hence the above selected_cpu = this_cpu), but
* keep the autonuma_node pointing to the node with most of
* the thread memory as selected above using the thread
* statistical data so the idle balancing code keeps
* prioritizing on it when selecting an idle CPU where to run
* the task on. Do not set it to the cpu_nid which would keep
* it in the current nid even if maybe the thread memory got
* allocated somewhere else because the current nid was
* already full.
*
* NOTE: selected_nid should never be below zero here, it's
* not a BUG_ON(selected_nid < 0), because it's nicer to keep
* the autonuma thread/mm statistics speculative.
*/
if (selected_nid < 0)
selected_nid = cpu_nid;
weight = weight_delta_max = 0;
for_each_online_node(nid) {
if (nid == cpu_nid)
continue;
for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
long w_nid, w_cpu_nid, w_other;
int w_type;
struct mm_struct *mm;
struct rq *rq = cpu_rq(cpu);
if (!cpu_online(cpu))
continue;
if (idle_cpu(cpu))
/*
* Offload the while IDLE balancing
* and physical / logical imbalances
* to CFS.
*/
continue;
mm = rq->curr->mm;
if (!mm)
continue;
raw_spin_lock_irq(&rq->lock);
/* recheck after implicit barrier() */
mm = rq->curr->mm;
if (!mm) {
raw_spin_unlock_irq(&rq->lock);
continue;
}
m_t = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault_tot);
t_t = rq->curr->task_autonuma->task_numa_fault_tot;
if (!m_t || !t_t) {
raw_spin_unlock_irq(&rq->lock);
continue;
}
m_w = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault[nid]);
t_w = rq->curr->task_autonuma->task_numa_fault[nid];
raw_spin_unlock_irq(&rq->lock);
if (mm == p->mm) {
if (t_w > t_t)
t_t = t_w;
w_other = t_w*AUTONUMA_BALANCE_SCALE/t_t;
w_nid = task_numa_weight[nid];
w_cpu_nid = task_numa_weight[cpu_nid];
w_type = W_TYPE_THREAD;
} else {
if (m_w > m_t)
m_t = m_w;
w_other = m_w*AUTONUMA_BALANCE_SCALE/m_t;
w_nid = mm_numa_weight[nid];
w_cpu_nid = mm_numa_weight[cpu_nid];
w_type = W_TYPE_PROCESS;
}
if (w_nid > w_other && w_nid > w_cpu_nid) {
weight = w_nid - w_other + w_nid - w_cpu_nid;
if (weight > weight_delta_max) {
weight_delta_max = weight;
selected_cpu = cpu;
selected_nid = nid;
s_w_other = w_other;
s_w_nid = w_nid;
s_w_cpu_nid = w_cpu_nid;
s_w_type = w_type;
}
}
}
}
if (task_autonuma->autonuma_node != selected_nid)
task_autonuma->autonuma_node = selected_nid;
if (selected_cpu != this_cpu) {
if (autonuma_debug()) {
char *w_type_str = NULL;
switch (s_w_type) {
case W_TYPE_THREAD:
w_type_str = "thread";
break;
case W_TYPE_PROCESS:
w_type_str = "process";
break;
}
printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
p->mm, p->pid, cpu_nid, selected_nid,
this_cpu, selected_cpu,
s_w_other, s_w_nid, s_w_cpu_nid,
w_type_str);
}
BUG_ON(cpu_nid == selected_nid);
goto found;
}
return;
found:
arg = (struct migration_arg) { p, selected_cpu };
/* Need help from migration thread: drop lock and wait. */
task_autonuma->autonuma_flags |= SCHED_AUTONUMA_FLAG_STOP_ONE_CPU;
sched_preempt_enable_no_resched();
stop_one_cpu(this_cpu, migration_cpu_stop, &arg);
preempt_disable();
task_autonuma->autonuma_flags &= ~SCHED_AUTONUMA_FLAG_STOP_ONE_CPU;
tlb_migrate_finish(p->mm);
}
/*
* This is called by CFS can_migrate_task() to prioritize the
* selection of AutoNUMA affine tasks (according to the autonuma_node)
* during the CFS load balance, active balance, etc...
*
* This is first called with numa == true to skip not AutoNUMA affine
* tasks. If this is later called with a numa == false parameter, it
* means a first pass of CFS load balancing wasn't satisfied by an
* AutoNUMA affine task and so we can decide to fallback to allowing
* migration of not affine tasks.
*
* If load_balance_strict is enabled, AutoNUMA will only allow
* migration of tasks for idle balancing purposes (the idle balancing
* of CFS is never altered by AutoNUMA). In the not strict mode the
* load balancing is not altered and the AutoNUMA affinity is
* disregarded in favor of higher fairness
*
* The load_balance_strict mode (tunable through sysfs), if enabled,
* tends to partition the system and in turn it may reduce the
* scheduler fariness across different NUMA nodes but it shall deliver
* higher global performance.
*/
bool sched_autonuma_can_migrate_task(struct task_struct *p,
int numa, int dst_cpu,
enum cpu_idle_type idle)
{
if (!task_autonuma_cpu(p, dst_cpu)) {
if (numa)
return false;
if (autonuma_sched_load_balance_strict() &&
idle != CPU_NEWLY_IDLE && idle != CPU_IDLE)
return false;
}
return true;
}
void sched_autonuma_dump_mm(void)
{
int nid, cpu;
cpumask_var_t x;
if (!alloc_cpumask_var(&x, GFP_KERNEL))
return;
cpumask_setall(x);
for_each_online_node(nid) {
for_each_cpu(cpu, cpumask_of_node(nid)) {
struct rq *rq = cpu_rq(cpu);
struct mm_struct *mm = rq->curr->mm;
int nr = 0, cpux;
if (!cpumask_test_cpu(cpu, x))
continue;
for_each_cpu(cpux, cpumask_of_node(nid)) {
struct rq *rqx = cpu_rq(cpux);
if (rqx->curr->mm == mm) {
nr++;
cpumask_clear_cpu(cpux, x);
}
}
printk("nid %d mm %p nr %d\n", nid, mm, nr);
}
}
free_cpumask_var(x);
}
Hi Mauricio and everyone,
On Fri, Jun 01, 2012 at 07:41:09PM -0300, Mauricio Faria de Oliveira wrote:
> I got SPECjbb2005 results for 3.4-rc2 mainline, numasched,
> autonuma-alpha10, and autonuma-alpha13. If you judge the data is OK it
> may suit a comparison between autonuma-alpha13/14 to verify NUMA
> affinity regressions.
>
> The system is an Intel 2-socket Blade. Each NUMA node has 6 cores (+6
> hyperthreads) and 12 GB RAM. Different permutations of THP, KSM, and VM
> memory size were tested for each kernel.
>
> I'll have to leave the analysis of each variable for you, as I'm not
> familiar w/ the code and expected impacts; but I'm perfectly fine with
> providing more details about the tests, environment and procedures, and
> even some reruns, if needed.
So autonuma10 didn't have a fully working idle balancing yet, that's
why it's under-performing. My initial regression test didn't verify the
idle balancing, that got fixed in autonuma11 (notably: it also fixes
multi instance streams)
Your testing methodology was perfect, because you tested with THP off
too, on the host. This rules out the possibility that different
khugepaged default settings could skew the results (AutoNUMA when
engaging boosts khugepaged to offset for the fact THP native migration
isn't available yet so THP gets splitted across memory migrations and
so we need to collapse them more aggressively).
Another thing I noticed is the THP off, KSM off, and VM1 < node, on
autonuma13 the VM1 gets slightly less priority and scores only 87%
(but VM2/3 scores higher than on mainline). It may be a not
reproducible hyperthreading effect that happens once in a while (the
active balancing probably isn't as fast as it should and I'm seeing
some effect of that even on upstream without patches when half of the
hyperthreads are idle), but more likely it's one of the bugs that I've
been fixing lately.
If you have time. you may consider running this again on
autonuma18. Lots of changes and improvements happened since
autonuma13. The amount of memory used in page_autonuma (per-page) has
also been significantly reduced from 24 (or 32 since autonuma14) bytes
to 12, the scheduler should be much faster if overscheduling.
git clone --reference linux -b autonuma18 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma18
Other two tweaks to test (only if you have time):
echo 15000 >/sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs
Thanks a lot for your great effort, this was very useful!
Andrea
On 06/21/2012 10:55 PM, Andrea Arcangeli wrote:
> On Thu, Jun 21, 2012 at 03:29:52PM +0800, Alex Shi wrote:
>>> I released an AutoNUMA15 branch that includes all pending fixes:
>>>
>>> git clone --reference linux -b autonuma15 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
>>>
>>
>> I did a quick testing on our
>> specjbb2005/oltp/hackbench/tbench/netperf-loop/fio/ffsb on NHM EP/EX,
>> Core2 EP, Romely EP machine, In generally no clear performance change
>> found. Is this results expected for this patch set?
>
> hackbench and network benchs won't get benefit (the former
> overschedule like crazy so there's no way any autonuma balancing can
> have effect with such an overscheduling and zillion of threads, the
> latter is I/O dominated usually taking so little RAM it doesn't
> matter, the memory accesses on the kernel side and DMA issue should
> dominate it in CPU utilization). Similar issue for filesystem
> benchmarks like fio.
>
> On all _system_ time dominated kernel benchmarks it is expected not to
> measure a performance optimization and if you don't measure a
> regression it's more than enough.
>
> The only benchmarks that gets benefit are userland where the user/nice
> time in top dominates. AutoNUMA cannot optimize or move kernel memory
> around, it only optimizes userland computations.
>
> So you should run HPC jobs. The only strange thing here is that
> specjbb2005 gets a measurable significant boost with AutoNUMA so if
> you didn't even get a boost with that you may want to verify:
>
> cat /sys/kernel/mm/autonuma/enabled == 1
>
> Also verify:
>
> CONFIG_AUTONUMA_DEFAULT_ENABLED=y
>
> If that's 1 well maybe the memory interconnect is so fast that there's
> no benefit?
>
> My numa01/02 benchmarks measures the best worst case of the hardware
> (not software), with -DINVERSE_BIND -DHARD_BIND parameters, you can
> consider running that to verify.
Could you like to give a url for the benchmarks?
>
> Probably there should be a little boot time kernel benchmark to
> measure the inverse bind vs hard bind performance across the first two
> nodes, if the difference is nil AutoNUMA should disengage and not even
> allocate the page_autonuma (now only 12 bytes per page but anyway).
>
> If you can retest with autonuma17 it would help too as there was some
> performance issue fixed and it'd stress the new autonuma migration lru
> code:
>
> git clone --reference linux -b autonuma17 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma17
>
> And the very latest is always at the autonuma branch:
>
> git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma
I got the commit till 2c7535e100805d. and retested specjbb2005 with
jrockit and openjdk again on my Romely EP(2P * 8 cores * HT, with 64GB
memory). find the openjdk has about 2% regression, while jrockit has no
clear change.
the testing user 2 instances, each of them are pinned to a node. some
setting is here:
per_jvm_warehouse_rampup = 3.0
per_jvm_warehouse_rampdown = 20.0
jvm_instances = 2
deterministic_random_seed = false
ramp_up_seconds = 30
measurement_seconds = 240
starting_number_warehouses = 1
increment_number_warehouses = 1
ending_number_warehouses = 34
expected_peak_warehouse = 16
openjdk
java options:
-Xmx8g -Xms8g -Xincgc
jrockit use hugetlb and its options:
-Xmx8g -Xms8g -Xns4g -XXaggressive -Xlargepages -XXlazyUnlocking
-Xgc:genpar -XXtlasize:min=16k,preferred=64k
>
> Thanks,
> Andrea
On Tue, Jun 26, 2012 at 03:52:26PM +0800, Alex Shi wrote:
> Could you like to give a url for the benchmarks?
I posted them to lkml a few months ago, I'm attaching them here. There
is actually a more polished version around that I didn't have time to
test yet. For now I'm attaching the old version here that I'm still
using to verify the regressions.
If you edit the .c files to make the right hard/inverse binds, and
then build with -DHARD_BIND and later -DINVERSE_BIND you can measure
the hardware NUMA effects on your hardware. numactl --hardware will
give you the topology to check if the code is ok for your hardware.
> memory). find the openjdk has about 2% regression, while jrockit has no
2% regression is in the worst case the numa hinting page faults (or in
the best case a measurement error) when you get no benefit from the
vastly increased NUMA affinity.
You can reduce that overhead to below 1% by multiplying by 2/3 times
the /sys/kernel/mm/autonuma/knuma_scand/scan_sleep_millisecs and
/sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs .
Especially the latter if set to 15000 will reduce the overhead by 1%.
The current AutoNUMA defaults are hyper aggressive, with benchmarks
running for several minutes you can easily reduce AutoNUMA
aggressiveness to pay a lower fixed cost in the numa hinting page
faults without reducing overall performance.
The boost when you use AutoNUMA is >20%, sometime as high as 100%, so
the 2% is lost in the noise, but over time we should reduce it
(especially with hypervisor tuned profile for those cloud nodes which
only run virtual machines in turn with quite constant loads where
there's no need to react that fast).
> the testing user 2 instances, each of them are pinned to a node. some
> setting is here:
Ok the problem is that you must not pin anything. If you hard pin
AutoNUMA won't do anything on those processes.
It is impossible to run faster than the raw hard pinning, impossible
because AutoNUMA has also to migrate memory, hard pinning avoids all
memory migrations.
AutoNUMA aims to achieve as close performance to hard pinning as
possible without having to user hard pinning, that's the whole point.
So this explains why you measure a 2% regression or no difference,
with hard pins used at all times only the AutoNUMA worst case overhead
can be measured (and I explained above how it can be reduced).
A plan I can suggest for this benchmark is this:
1) "upstream default"
- no hugetlbfs (AutoNUMA cannot migrate hugetlbfs memory)
- no hard pinning of CPUs or memory to nodes
- CONFIG_AUTONUMA=n
- CONFIG_TRANSPARENT_HUGEPAGE=y
2) "autonuma"
- no hugetlbfs (AutoNUMA cannot migrate hugetlbfs memory)
- no hard pinning of CPUs or memory to nodes
- CONFIG_AUTONUMA=y
- CONFIG_AUTONUMA_DEFAULT_ENABLED=y
- CONFIG_TRANSPARENT_HUGEPAGE=y
3) "autonuma lower numa hinting page fault overhead"
- no hugetlbfs (AutoNUMA cannot migrate hugetlbfs memory)
- no hard pinning of CPUs or memory to nodes
- CONFIG_AUTONUMA=y
- CONFIG_AUTONUMA_DEFAULT_ENABLED=y
- CONFIG_TRANSPARENT_HUGEPAGE=y
- echo 15000 >/sys/kernel/mm/autonuma/knuma_scand/scan_sleep_pass_millisecs
4) "upstream hard pinning and transparent hugepage"
- hard pinning of CPUs or memory to nodes
- CONFIG_AUTONUMA=n
- CONFIG_TRANSPARENT_HUGEPAGE=y
5) "upstream hard pinning and hugetlbfs"
- hugetlbfs
- hard pinning of CPUs or memory to nodes
- CONFIG_AUTONUMA=n
- CONFIG_TRANSPARENT_HUGEPAGE=y (y/n won't matter if you use hugetlbfs)
Then you can compare 1/2/3/4/5.
The minimum to make a meaningful comparison is 1 vs 2. The next best
comparison is 1 vs 2 vs 4 (4 is very useful reference too because the
closer AutoNUMA gets to 4 the better! beating 1 is trivial, getting
very close to 4 is less easy because 4 isn't migrating any memory).
Running 3 and 5 is optional, especially I mentioned 5 just because you
liked to run it with hugetlbfs and not just THP.
> jrockit use hugetlb and its options:
hugetlbfs should be disabled when AutoNUMA is enabled because AutoNUMA
won't try to migrate hugetlbfs memory, not that it makes any
difference if the memory is hard pinned. THP should deliver the same
performance of hugetlbfs for the JVM and THP memory can be migrated by
AutoNUMA (as well as mmapped not-shared pagecache, not just anon
memory).
Thanks a lot, and looking forward to see how things goes when you
remove the hard pins.
Andrea