Hello everyone,
It's time for a new AutoNUMA19 release.
The objective of AutoNUMA is to be able to perform as close as
possible to (and sometime faster than) the NUMA hard CPU/memory
bindings setups, without requiring the administrator to manually setup
any NUMA hard bind.
https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
(NOTE: the TODO slide is obsolete)
git clone --reference linux -b autonuma19 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git autonuma19
Development autonuma branch:
git clone --reference linux -b autonuma git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
To update:
git fetch
git checkout -f origin/autonuma
Changelog from AutoNUMA-alpha14 to AutoNUMA19:
o sched_autonuma_balance callout location removed from schedule() now it runs
in the softirq along with CFS load_balancing
o lots of documentation about the math in the sched_autonuma_balance algorithm
o fixed a bug in the fast path detection in sched_autonuma_balance that could
decrease performance with many nodes
o reduced the page_autonuma memory overhead to from 32 to 12 bytes per page
o fixed a crash in __pmd_numa_fixup
o knuma_numad won't scan VM_MIXEDMAP|PFNMAP (it never touched those ptes
anyway)
o fixed a crash in autonuma_exit
o fixed a crash when split_huge_page returns 0 in knuma_migratedN as the page
has been freed already
o assorted cleanups and probably more
Changelog from alpha13 to alpha14:
o page_autonuma introduction, no memory wasted if the kernel is booted
on not-NUMA hardware. Tested with flatmem/sparsemem on x86
autonuma=y/n and sparsemem/vsparsemem on x86_64 with autonuma=y/n.
"noautonuma" kernel param disables autonuma permanently also when
booted on NUMA hardware (no /sys/kernel/mm/autonuma, and no
page_autonuma allocations, like cgroup_disable=memory)
o autonuma_balance only runs along with run_rebalance_domains, to
avoid altering the usual scheduler runtime. autonuma_balance gives a
"kick" to the scheduler after a rebalance (it overrides the load
balance activity if needed). It's not yet tested on specjbb or more
schedule intensive benchmark, hopefully there's no NUMA
regression. For intensive compute loads not involving a flood of
scheduling activity this doesn't show any performance regression,
and it avoids altering the strict schedule performance. It goes in
the direction of being less intrusive with the stock scheduler
runtime.
Note: autonuma_balance still runs from normal context (not softirq
context like run_rebalance_domains) to be able to wait on process
migration (avoid _nowait), but most of the time it does nothing at
all.
Changelog from alpha11 to alpha13:
o autonuma_balance optimization (take the fast path when process is in
the preferred NUMA node)
TODO:
o THP native migration (orthogonal and also needed for
cpuset/migrate_pages(2)/numa/sched).
o port to ppc64, Ben? Any arch able to support PROT_NONE can also support
AutoNUMA, in short all archs should work fine with AutoNUMA.
Andrea Arcangeli (40):
mm: add unlikely to the mm allocation failure check
autonuma: make set_pmd_at always available
autonuma: export is_vma_temporary_stack() even if
CONFIG_TRANSPARENT_HUGEPAGE=n
xen: document Xen is using an unused bit for the pagetables
autonuma: define _PAGE_NUMA_PTE and _PAGE_NUMA_PMD
autonuma: x86 pte_numa() and pmd_numa()
autonuma: generic pte_numa() and pmd_numa()
autonuma: teach gup_fast about pte_numa
autonuma: introduce kthread_bind_node()
autonuma: mm_autonuma and sched_autonuma data structures
autonuma: define the autonuma flags
autonuma: core autonuma.h header
autonuma: CPU follow memory algorithm
autonuma: add page structure fields
autonuma: knuma_migrated per NUMA node queues
autonuma: init knuma_migrated queues
autonuma: autonuma_enter/exit
autonuma: call autonuma_setup_new_exec()
autonuma: alloc/free/init sched_autonuma
autonuma: alloc/free/init mm_autonuma
autonuma: avoid CFS select_task_rq_fair to return -1
autonuma: teach CFS about autonuma affinity
autonuma: sched_set_autonuma_need_balance
autonuma: core
autonuma: follow_page check for pte_numa/pmd_numa
autonuma: default mempolicy follow AutoNUMA
autonuma: call autonuma_split_huge_page()
autonuma: make khugepaged pte_numa aware
autonuma: retain page last_nid information in khugepaged
autonuma: numa hinting page faults entry points
autonuma: reset autonuma page data when pages are freed
autonuma: initialize page structure fields
autonuma: link mm/autonuma.o and kernel/sched/numa.o
autonuma: add CONFIG_AUTONUMA and CONFIG_AUTONUMA_DEFAULT_ENABLED
autonuma: boost khugepaged scanning rate
autonuma: page_autonuma
autonuma: page_autonuma change #include for sparse
autonuma: autonuma_migrate_head[0] dynamic size
autonuma: bugcheck page_autonuma fields on newly allocated pages
autonuma: shrink the per-page page_autonuma struct size
arch/x86/include/asm/paravirt.h | 2 -
arch/x86/include/asm/pgtable.h | 51 ++-
arch/x86/include/asm/pgtable_types.h | 22 +-
arch/x86/mm/gup.c | 2 +-
arch/x86/mm/numa.c | 6 +-
arch/x86/mm/numa_32.c | 3 +-
fs/exec.c | 3 +
include/asm-generic/pgtable.h | 12 +
include/linux/autonuma.h | 64 ++
include/linux/autonuma_flags.h | 68 ++
include/linux/autonuma_list.h | 94 ++
include/linux/autonuma_sched.h | 50 ++
include/linux/autonuma_types.h | 130 +++
include/linux/huge_mm.h | 6 +-
include/linux/kthread.h | 1 +
include/linux/memory_hotplug.h | 3 +-
include/linux/mm_types.h | 5 +
include/linux/mmzone.h | 25 +
include/linux/page_autonuma.h | 59 ++
include/linux/sched.h | 5 +-
init/main.c | 2 +
kernel/fork.c | 36 +-
kernel/kthread.c | 23 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 72 ++-
kernel/sched/numa.c | 586 +++++++++++++
kernel/sched/sched.h | 18 +
mm/Kconfig | 13 +
mm/Makefile | 1 +
mm/autonuma.c | 1549 ++++++++++++++++++++++++++++++++++
mm/autonuma_list.c | 167 ++++
mm/huge_memory.c | 59 ++-
mm/memory.c | 35 +-
mm/memory_hotplug.c | 2 +-
mm/mempolicy.c | 15 +-
mm/mmu_context.c | 2 +
mm/page_alloc.c | 5 +
mm/page_autonuma.c | 236 ++++++
mm/sparse.c | 126 +++-
40 files changed, 3512 insertions(+), 48 deletions(-)
create mode 100644 include/linux/autonuma.h
create mode 100644 include/linux/autonuma_flags.h
create mode 100644 include/linux/autonuma_list.h
create mode 100644 include/linux/autonuma_sched.h
create mode 100644 include/linux/autonuma_types.h
create mode 100644 include/linux/page_autonuma.h
create mode 100644 kernel/sched/numa.c
create mode 100644 mm/autonuma.c
create mode 100644 mm/autonuma_list.c
create mode 100644 mm/page_autonuma.c
This is where the mm_autonuma structure is being handled. Just like
sched_autonuma, this is only allocated at runtime if the hardware the
kernel is running on has been detected as NUMA. On not NUMA hardware
the memory cost is reduced to one pointer per mm.
To get rid of the pointer in the each mm, the kernel can be compiled
with CONFIG_AUTONUMA=n.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/fork.c | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 0adbe09..3e5a0d9 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -527,6 +527,8 @@ static void mm_init_aio(struct mm_struct *mm)
static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
{
+ if (unlikely(alloc_mm_autonuma(mm)))
+ goto out_free_mm;
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
@@ -549,6 +551,8 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
return mm;
}
+ free_mm_autonuma(mm);
+out_free_mm:
free_mm(mm);
return NULL;
}
@@ -598,6 +602,7 @@ void __mmdrop(struct mm_struct *mm)
destroy_context(mm);
mmu_notifier_mm_destroy(mm);
check_mm(mm);
+ free_mm_autonuma(mm);
free_mm(mm);
}
EXPORT_SYMBOL_GPL(__mmdrop);
@@ -880,6 +885,7 @@ fail_nocontext:
* If init_new_context() failed, we cannot use mmput() to free the mm
* because it calls destroy_context()
*/
+ free_mm_autonuma(mm);
mm_free_pgd(mm);
free_mm(mm);
return NULL;
@@ -1702,6 +1708,7 @@ void __init proc_caches_init(void)
mm_cachep = kmem_cache_create("mm_struct",
sizeof(struct mm_struct), ARCH_MIN_MMSTRUCT_ALIGN,
SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_NOTRACK, NULL);
+ mm_autonuma_init();
vm_area_cachep = KMEM_CACHE(vm_area_struct, SLAB_PANIC);
mmap_init();
nsproxy_cache_init();
Link the AutoNUMA core and scheduler object files in the kernel if
CONFIG_AUTONUMA=y.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/sched/Makefile | 1 +
mm/Makefile | 1 +
2 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index 173ea52..783a840 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_SMP) += cpupri.o
obj-$(CONFIG_SCHED_AUTOGROUP) += auto_group.o
obj-$(CONFIG_SCHEDSTATS) += stats.o
obj-$(CONFIG_SCHED_DEBUG) += debug.o
+obj-$(CONFIG_AUTONUMA) += numa.o
diff --git a/mm/Makefile b/mm/Makefile
index 2e2fbbe..15900fd 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
+obj-$(CONFIG_AUTONUMA) += autonuma.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
Debug tweak.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/autonuma.h | 11 +++++++++++
mm/page_alloc.c | 1 +
2 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index 67af86a..05bd8c1 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -29,6 +29,16 @@ static inline void autonuma_free_page(struct page *page)
}
}
+static inline void autonuma_check_new_page(struct page *page)
+{
+ struct page_autonuma *page_autonuma;
+ if (!autonuma_impossible()) {
+ page_autonuma = lookup_page_autonuma(page);
+ BUG_ON(page_autonuma->autonuma_migrate_nid != -1);
+ BUG_ON(page_autonuma->autonuma_last_nid != -1);
+ }
+}
+
#define autonuma_printk(format, args...) \
if (autonuma_debug()) printk(format, ##args)
@@ -41,6 +51,7 @@ static inline void autonuma_migrate_split_huge_page(struct page *page,
struct page *page_tail) {}
static inline void autonuma_setup_new_exec(struct task_struct *p) {}
static inline void autonuma_free_page(struct page *page) {}
+static inline void autonuma_check_new_page(struct page *page) {}
#endif /* CONFIG_AUTONUMA */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2d53a1f..5943ed2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -833,6 +833,7 @@ static inline int check_new_page(struct page *page)
bad_page(page);
return 1;
}
+ autonuma_check_new_page(page);
return 0;
}
set_pmd_at() will also be used for the knuma_scand/pmd = 1 (default)
mode even when TRANSPARENT_HUGEPAGE=n. Make it available so the build
won't fail.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/include/asm/paravirt.h | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 6cbbabf..e99fb37 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -564,7 +564,6 @@ static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
PVOP_VCALL4(pv_mmu_ops.set_pte_at, mm, addr, ptep, pte.pte);
}
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t pmd)
{
@@ -575,7 +574,6 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
PVOP_VCALL4(pv_mmu_ops.set_pmd_at, mm, addr, pmdp,
native_pmd_val(pmd));
}
-#endif
static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
{
The first gear in the whole AutoNUMA algorithm is knuma_scand. If
knuma_scand doesn't run AutoNUMA is a full bypass. If knuma_scand is
stopped, soon all other AutoNUMA gears will settle down too.
knuma_scand is the daemon that sets the pmd_numa and pte_numa and
allows the NUMA hinting page faults to start and then all other
actions follows as a reaction to that.
knuma_scand scans a list of "mm" and this is where we register and
unregister the "mm" into AutoNUMA for knuma_scand to scan them.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/fork.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index 5fcfa70..d3c064c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -70,6 +70,7 @@
#include <linux/khugepaged.h>
#include <linux/signalfd.h>
#include <linux/uprobes.h>
+#include <linux/autonuma.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -540,6 +541,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
mmu_notifier_mm_init(mm);
+ autonuma_enter(mm);
return mm;
}
@@ -608,6 +610,7 @@ void mmput(struct mm_struct *mm)
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
+ autonuma_exit(mm); /* must run before exit_mmap */
exit_mmap(mm);
set_mm_exe_file(mm, NULL);
if (!list_empty(&mm->mmlist)) {
is_vma_temporary_stack() is needed by mm/autonuma.c too, and without
this the build breaks with CONFIG_TRANSPARENT_HUGEPAGE=n.
Reported-by: Petr Holasek <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/huge_mm.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 4c59b11..ad4e2e0 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -54,13 +54,13 @@ extern pmd_t *page_check_address_pmd(struct page *page,
#define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
#define HPAGE_PMD_NR (1<<HPAGE_PMD_ORDER)
+extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define HPAGE_PMD_SHIFT HPAGE_SHIFT
#define HPAGE_PMD_MASK HPAGE_MASK
#define HPAGE_PMD_SIZE HPAGE_SIZE
-extern bool is_vma_temporary_stack(struct vm_area_struct *vma);
-
#define transparent_hugepage_enabled(__vma) \
((transparent_hugepage_flags & \
(1<<TRANSPARENT_HUGEPAGE_FLAG) || \
This resets all per-thread and per-process statistics across exec
syscalls or after kernel threads detached from the mm. The past
statistical NUMA information is unlikely to be relevant for the future
in these cases.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/exec.c | 3 +++
mm/mmu_context.c | 2 ++
2 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index da27b91..146ced2 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -55,6 +55,7 @@
#include <linux/pipe_fs_i.h>
#include <linux/oom.h>
#include <linux/compat.h>
+#include <linux/autonuma.h>
#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -1172,6 +1173,8 @@ void setup_new_exec(struct linux_binprm * bprm)
flush_signal_handlers(current, 0);
flush_old_files(current->files);
+
+ autonuma_setup_new_exec(current);
}
EXPORT_SYMBOL(setup_new_exec);
diff --git a/mm/mmu_context.c b/mm/mmu_context.c
index 3dcfaf4..40f0f13 100644
--- a/mm/mmu_context.c
+++ b/mm/mmu_context.c
@@ -7,6 +7,7 @@
#include <linux/mmu_context.h>
#include <linux/export.h>
#include <linux/sched.h>
+#include <linux/autonuma.h>
#include <asm/mmu_context.h>
@@ -58,5 +59,6 @@ void unuse_mm(struct mm_struct *mm)
/* active_mm is still 'mm' */
enter_lazy_tlb(mm, tsk);
task_unlock(tsk);
+ autonuma_setup_new_exec(tsk);
}
EXPORT_SYMBOL_GPL(unuse_mm);
This function makes it easy to bind the per-node knuma_migrated
threads to their respective NUMA nodes. Those threads take memory from
the other nodes (in round robin with a incoming queue for each remote
node) and they move that memory to their local node.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/kthread.h | 1 +
include/linux/sched.h | 2 +-
kernel/kthread.c | 23 +++++++++++++++++++++++
3 files changed, 25 insertions(+), 1 deletions(-)
diff --git a/include/linux/kthread.h b/include/linux/kthread.h
index 0714b24..e733f97 100644
--- a/include/linux/kthread.h
+++ b/include/linux/kthread.h
@@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
})
void kthread_bind(struct task_struct *k, unsigned int cpu);
+void kthread_bind_node(struct task_struct *p, int nid);
int kthread_stop(struct task_struct *k);
int kthread_should_stop(void);
bool kthread_freezable_should_stop(bool *was_frozen);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4059c0f..699324c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
#define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
#define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
-#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
+#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpus */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
diff --git a/kernel/kthread.c b/kernel/kthread.c
index 3d3de63..48b36f9 100644
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
EXPORT_SYMBOL(kthread_bind);
/**
+ * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
+ * @p: thread created by kthread_create().
+ * @nid: node (might not be online, must be possible) for @k to run on.
+ *
+ * Description: This function is equivalent to set_cpus_allowed(),
+ * except that @nid doesn't need to be online, and the thread must be
+ * stopped (i.e., just returned from kthread_create()).
+ */
+void kthread_bind_node(struct task_struct *p, int nid)
+{
+ /* Must have done schedule() in kthread() before we set_task_cpu */
+ if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
+ WARN_ON(1);
+ return;
+ }
+
+ /* It's safe because the task is inactive. */
+ do_set_cpus_allowed(p, cpumask_of_node(nid));
+ p->flags |= PF_THREAD_BOUND;
+}
+EXPORT_SYMBOL(kthread_bind_node);
+
+/**
* kthread_stop - stop a thread created by kthread_create().
* @k: thread created by kthread_create().
*
This implements the knuma_migrated queues. The pages are added to
these queues through the NUMA hinting page faults (memory follow CPU
algorithm with false sharing evaluation) and knuma_migrated then is
waken with a certain hysteresis to migrate the memory in round robin
from all remote nodes to its local node.
The head that belongs to the local node that knuma_migrated runs on,
for now must be empty and it's not being used.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/mmzone.h | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2427706..d53b26a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -697,6 +697,12 @@ typedef struct pglist_data {
struct task_struct *kswapd;
int kswapd_max_order;
enum zone_type classzone_idx;
+#ifdef CONFIG_AUTONUMA
+ spinlock_t autonuma_lock;
+ struct list_head autonuma_migrate_head[MAX_NUMNODES];
+ unsigned long autonuma_nr_migrate_pages;
+ wait_queue_head_t autonuma_knuma_migrated_wait;
+#endif
} pg_data_t;
#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
Initialize the AutoNUMA page structure fields at boot.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/page_alloc.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 841d964..8c4ae8e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3729,6 +3729,10 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
INIT_LIST_HEAD(&page->lru);
+#ifdef CONFIG_AUTONUMA
+ page->autonuma_last_nid = -1;
+ page->autonuma_migrate_nid = -1;
+#endif
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
if (!is_highmem_idx(zone))
If any of the ptes that khugepaged is collapsing was a pte_numa, the
resulting trans huge pmd will be a pmd_numa too.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 13 +++++++++++--
1 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 55fc72d..094f82b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1799,12 +1799,13 @@ out:
return isolated;
}
-static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
+static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
struct vm_area_struct *vma,
unsigned long address,
spinlock_t *ptl)
{
pte_t *_pte;
+ bool mknuma = false;
for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
pte_t pteval = *_pte;
struct page *src_page;
@@ -1832,11 +1833,15 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page,
page_remove_rmap(src_page);
spin_unlock(ptl);
free_page_and_swap_cache(src_page);
+
+ mknuma |= pte_numa(pteval);
}
address += PAGE_SIZE;
page++;
}
+
+ return mknuma;
}
static void collapse_huge_page(struct mm_struct *mm,
@@ -1854,6 +1859,7 @@ static void collapse_huge_page(struct mm_struct *mm,
spinlock_t *ptl;
int isolated;
unsigned long hstart, hend;
+ bool mknuma = false;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
#ifndef CONFIG_NUMA
@@ -1972,7 +1978,8 @@ static void collapse_huge_page(struct mm_struct *mm,
*/
anon_vma_unlock(vma->anon_vma);
- __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
+ mknuma = pmd_numa(_pmd);
+ mknuma |= __collapse_huge_page_copy(pte, new_page, vma, address, ptl);
pte_unmap(pte);
__SetPageUptodate(new_page);
pgtable = pmd_pgtable(_pmd);
@@ -1982,6 +1989,8 @@ static void collapse_huge_page(struct mm_struct *mm,
_pmd = mk_pmd(new_page, vma->vm_page_prot);
_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
_pmd = pmd_mkhuge(_pmd);
+ if (mknuma)
+ _pmd = pmd_mknuma(_pmd);
/*
* spin_lock() below is not the equivalent of smp_wmb(), so
Xen has taken over the last reserved bit available for the pagetables
which is set through ioremap, this documents it and makes the code
more readable.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 11 +++++++++--
1 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 013286a..b74cac9 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -17,7 +17,7 @@
#define _PAGE_BIT_PAT 7 /* on 4KB pages */
#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */
#define _PAGE_BIT_UNUSED1 9 /* available for programmer */
-#define _PAGE_BIT_IOMAP 10 /* flag used to indicate IO mapping */
+#define _PAGE_BIT_UNUSED2 10
#define _PAGE_BIT_HIDDEN 11 /* hidden by kmemcheck */
#define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
#define _PAGE_BIT_SPECIAL _PAGE_BIT_UNUSED1
@@ -41,7 +41,7 @@
#define _PAGE_PSE (_AT(pteval_t, 1) << _PAGE_BIT_PSE)
#define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
#define _PAGE_UNUSED1 (_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1)
-#define _PAGE_IOMAP (_AT(pteval_t, 1) << _PAGE_BIT_IOMAP)
+#define _PAGE_UNUSED2 (_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
#define _PAGE_PAT (_AT(pteval_t, 1) << _PAGE_BIT_PAT)
#define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
#define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
@@ -49,6 +49,13 @@
#define _PAGE_SPLITTING (_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
#define __HAVE_ARCH_PTE_SPECIAL
+/* flag used to indicate IO mapping */
+#ifdef CONFIG_XEN
+#define _PAGE_IOMAP (_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
+#else
+#define _PAGE_IOMAP (_AT(pteval_t, 0))
+#endif
+
#ifdef CONFIG_KMEMCHECK
#define _PAGE_HIDDEN (_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
#else
This is the generic autonuma.h header that defines the generic
AutoNUMA specific functions like autonuma_setup_new_exec,
autonuma_split_huge_page, numa_hinting_fault, etc...
As usual functions like numa_hinting_fault that only matter for builds
with CONFIG_AUTONUMA=y are defined unconditionally, but they are only
linked into the kernel if CONFIG_AUTONUMA=n. Their call sites are
optimized away at build time (or kernel won't link).
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/autonuma.h | 41 +++++++++++++++++++++++++++++++++++++++++
1 files changed, 41 insertions(+), 0 deletions(-)
create mode 100644 include/linux/autonuma.h
diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
new file mode 100644
index 0000000..85ca5eb
--- /dev/null
+++ b/include/linux/autonuma.h
@@ -0,0 +1,41 @@
+#ifndef _LINUX_AUTONUMA_H
+#define _LINUX_AUTONUMA_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void autonuma_enter(struct mm_struct *mm);
+extern void autonuma_exit(struct mm_struct *mm);
+extern void __autonuma_migrate_page_remove(struct page *page);
+extern void autonuma_migrate_split_huge_page(struct page *page,
+ struct page *page_tail);
+extern void autonuma_setup_new_exec(struct task_struct *p);
+
+static inline void autonuma_migrate_page_remove(struct page *page)
+{
+ if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
+ __autonuma_migrate_page_remove(page);
+}
+
+#define autonuma_printk(format, args...) \
+ if (autonuma_debug()) printk(format, ##args)
+
+#else /* CONFIG_AUTONUMA */
+
+static inline void autonuma_enter(struct mm_struct *mm) {}
+static inline void autonuma_exit(struct mm_struct *mm) {}
+static inline void autonuma_migrate_page_remove(struct page *page) {}
+static inline void autonuma_migrate_split_huge_page(struct page *page,
+ struct page *page_tail) {}
+static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+extern pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte, pte_t *ptep);
+extern void __pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+ pmd_t *pmd);
+extern void numa_hinting_fault(struct page *page, int numpages);
+
+#endif /* _LINUX_AUTONUMA_H */
This is where the dynamically allocated sched_autonuma structure is
being handled.
The reason for keeping this outside of the task_struct besides not
using too much kernel stack, is to only allocate it on NUMA
hardware. So the not NUMA hardware only pays the memory of a pointer
in the kernel stack (which remains NULL at all times in that case).
If the kernel is compiled with CONFIG_AUTONUMA=n, not even the pointer
is allocated on the kernel stack of course.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/fork.c | 24 ++++++++++++++----------
1 files changed, 14 insertions(+), 10 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index d3c064c..0adbe09 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -206,6 +206,7 @@ static void account_kernel_stack(struct thread_info *ti, int account)
void free_task(struct task_struct *tsk)
{
account_kernel_stack(tsk->stack, -1);
+ free_task_autonuma(tsk);
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
@@ -260,6 +261,8 @@ void __init fork_init(unsigned long mempages)
/* do the arch specific task caches init */
arch_task_cache_init();
+ task_autonuma_init();
+
/*
* The default maximum number of threads is set to a safe
* value: the thread structures can take up at most half
@@ -292,21 +295,21 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
struct thread_info *ti;
unsigned long *stackend;
int node = tsk_fork_get_node(orig);
- int err;
tsk = alloc_task_struct_node(node);
- if (!tsk)
+ if (unlikely(!tsk))
return NULL;
ti = alloc_thread_info_node(tsk, node);
- if (!ti) {
- free_task_struct(tsk);
- return NULL;
- }
+ if (unlikely(!ti))
+ goto out_task_struct;
- err = arch_dup_task_struct(tsk, orig);
- if (err)
- goto out;
+ if (unlikely(arch_dup_task_struct(tsk, orig)))
+ goto out_thread_info;
+
+ if (unlikely(alloc_task_autonuma(tsk, orig, node)))
+ /* free_thread_info() undoes arch_dup_task_struct() too */
+ goto out_thread_info;
tsk->stack = ti;
@@ -334,8 +337,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
return tsk;
-out:
+out_thread_info:
free_thread_info(ti);
+out_task_struct:
free_task_struct(tsk);
return NULL;
}
Invoke autonuma_balance only on the busy CPUs at the same frequency of
the CFS load balance.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/sched/fair.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dab9bdd..ff288c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4906,6 +4906,9 @@ static void run_rebalance_domains(struct softirq_action *h)
rebalance_domains(this_cpu, idle);
+ if (!this_rq->idle_balance)
+ sched_set_autonuma_need_balance();
+
/*
* If this cpu has a pending nohz_balance_kick, then do the
* balancing on behalf of the other idle cpus whose ticks are
When pages are freed abort any pending migration. If knuma_migrated
arrives first it will notice because get_page_unless_zero would fail.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/page_alloc.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 48eabe9..841d964 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -615,6 +615,10 @@ static inline int free_pages_check(struct page *page)
bad_page(page);
return 1;
}
+ autonuma_migrate_page_remove(page);
+#ifdef CONFIG_AUTONUMA
+ page->autonuma_last_nid = -1;
+#endif
if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
>From 32 to 12 bytes, so the AutoNUMA memory footprint is reduced to
0.29% of RAM.
This however will fail to migrate pages above a 16 Terabyte offset
from the start of each node (migration failure isn't fatal, simply
those pages will not follow the CPU, a warning will be printed in the
log just once in that case).
AutoNUMA will also fail to build if there are more than (2**15)-1
nodes supported by the MAX_NUMNODES at build time (it would be easy to
relax it to (2**16)-1 nodes without increasing the memory footprint,
but it's not even worth it, so let's keep the negative space reserved
for now).
This means the max RAM configuration fully supported by AutoNUMA
becomes AUTONUMA_LIST_MAX_PFN_OFFSET multiplied by 32767 nodes
multiplied by the PAGE_SIZE (assume 4096 here, but for some archs it's
bigger).
4096*32767*(0xffffffff-3)>>(10*5) = 511 PetaBytes.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/autonuma_list.h | 94 ++++++++++++++++++++++
include/linux/autonuma_types.h | 45 ++++++-----
include/linux/mmzone.h | 3 +-
include/linux/page_autonuma.h | 2 +-
mm/Makefile | 2 +-
mm/autonuma.c | 86 +++++++++++++++------
mm/autonuma_list.c | 167 ++++++++++++++++++++++++++++++++++++++++
mm/page_autonuma.c | 15 ++--
8 files changed, 362 insertions(+), 52 deletions(-)
create mode 100644 include/linux/autonuma_list.h
create mode 100644 mm/autonuma_list.c
diff --git a/include/linux/autonuma_list.h b/include/linux/autonuma_list.h
new file mode 100644
index 0000000..0f338e9
--- /dev/null
+++ b/include/linux/autonuma_list.h
@@ -0,0 +1,94 @@
+#ifndef __AUTONUMA_LIST_H
+#define __AUTONUMA_LIST_H
+
+#include <linux/types.h>
+#include <linux/kernel.h>
+
+typedef uint32_t autonuma_list_entry;
+#define AUTONUMA_LIST_MAX_PFN_OFFSET (AUTONUMA_LIST_HEAD-3)
+#define AUTONUMA_LIST_POISON1 (AUTONUMA_LIST_HEAD-2)
+#define AUTONUMA_LIST_POISON2 (AUTONUMA_LIST_HEAD-1)
+#define AUTONUMA_LIST_HEAD ((uint32_t)UINT_MAX)
+
+struct autonuma_list_head {
+ autonuma_list_entry anl_next_pfn;
+ autonuma_list_entry anl_prev_pfn;
+};
+
+static inline void AUTONUMA_INIT_LIST_HEAD(struct autonuma_list_head *anl)
+{
+ anl->anl_next_pfn = AUTONUMA_LIST_HEAD;
+ anl->anl_prev_pfn = AUTONUMA_LIST_HEAD;
+}
+
+/* abstraction conversion methods */
+extern struct page *autonuma_list_entry_to_page(int nid,
+ autonuma_list_entry pfn_offset);
+extern autonuma_list_entry autonuma_page_to_list_entry(int page_nid,
+ struct page *page);
+extern struct autonuma_list_head *__autonuma_list_head(int page_nid,
+ struct autonuma_list_head *head,
+ autonuma_list_entry pfn_offset);
+
+extern bool __autonuma_list_add(int page_nid,
+ struct page *page,
+ struct autonuma_list_head *head,
+ autonuma_list_entry prev,
+ autonuma_list_entry next);
+
+/*
+ * autonuma_list_add - add a new entry
+ *
+ * Insert a new entry after the specified head.
+ */
+static inline bool autonuma_list_add(int page_nid,
+ struct page *page,
+ autonuma_list_entry entry,
+ struct autonuma_list_head *head)
+{
+ struct autonuma_list_head *entry_head;
+ entry_head = __autonuma_list_head(page_nid, head, entry);
+ return __autonuma_list_add(page_nid, page, head,
+ entry, entry_head->anl_next_pfn);
+}
+
+/*
+ * autonuma_list_add_tail - add a new entry
+ *
+ * Insert a new entry before the specified head.
+ * This is useful for implementing queues.
+ */
+static inline bool autonuma_list_add_tail(int page_nid,
+ struct page *page,
+ autonuma_list_entry entry,
+ struct autonuma_list_head *head)
+{
+ struct autonuma_list_head *entry_head;
+ entry_head = __autonuma_list_head(page_nid, head, entry);
+ return __autonuma_list_add(page_nid, page, head,
+ entry_head->anl_prev_pfn, entry);
+}
+
+/*
+ * autonuma_list_del - deletes entry from list.
+ * @entry: the element to delete from the list.
+ */
+extern void autonuma_list_del(int page_nid,
+ struct autonuma_list_head *entry,
+ struct autonuma_list_head *head);
+
+extern bool autonuma_list_empty(const struct autonuma_list_head *head);
+
+#if 0 /* not needed so far */
+/*
+ * autonuma_list_is_singular - tests whether a list has just one entry.
+ * @head: the list to test.
+ */
+static inline int autonuma_list_is_singular(const struct autonuma_list_head *head)
+{
+ return !autonuma_list_empty(head) &&
+ (head->anl_next_pfn == head->anl_prev_pfn);
+}
+#endif
+
+#endif /* __AUTONUMA_LIST_H */
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 1e860f6..579e126 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -4,6 +4,7 @@
#ifdef CONFIG_AUTONUMA
#include <linux/numa.h>
+#include <linux/autonuma_list.h>
/*
* Per-mm (process) structure dynamically allocated only if autonuma
@@ -42,6 +43,19 @@ struct task_autonuma {
/*
* Per page (or per-pageblock) structure dynamically allocated only if
* autonuma is not impossible.
+ *
+ * This structure takes 12 bytes per page for all architectures. There
+ * are two constraints to make this work:
+ *
+ * 1) the build will abort if * MAX_NUMNODES is too big according to
+ * the #error check below
+ *
+ * 2) AutoNUMA will not succeed to insert into the migration queue any
+ * page whose pfn offset value (offset with respect to the first
+ * pfn of the node) is bigger than AUTONUMA_LIST_MAX_PFN_OFFSET
+ * (NOTE: AUTONUMA_LIST_MAX_PFN_OFFSET is still a valid pfn offset
+ * value). This means with huge node sizes and small PAGE_SIZE,
+ * some pages may not be allowed to be migrated.
*/
struct page_autonuma {
/*
@@ -51,7 +65,14 @@ struct page_autonuma {
* should run in NUMA systems). Archs without that requires
* autonuma_last_nid to be a long.
*/
-#if BITS_PER_LONG > 32
+#if MAX_NUMNODES > 32767
+ /*
+ * Verify at build time that int16_t for autonuma_migrate_nid
+ * and autonuma_last_nid won't risk to overflow, max allowed
+ * nid value is (2**15)-1.
+ */
+#error "too many nodes"
+#endif
/*
* autonuma_migrate_nid is -1 if the page_autonuma structure
* is not linked into any
@@ -61,7 +82,7 @@ struct page_autonuma {
* page_nid is the nid that the page (referenced by the
* page_autonuma structure) belongs to.
*/
- int autonuma_migrate_nid;
+ int16_t autonuma_migrate_nid;
/*
* autonuma_last_nid records which is the NUMA nid that tried
* to access this page at the last NUMA hinting page fault.
@@ -70,28 +91,14 @@ struct page_autonuma {
* it will make different threads trashing on the same pages,
* converge on the same NUMA node (if possible).
*/
- int autonuma_last_nid;
-#else
-#if MAX_NUMNODES >= 32768
-#error "too many nodes"
-#endif
- short autonuma_migrate_nid;
- short autonuma_last_nid;
-#endif
+ int16_t autonuma_last_nid;
+
/*
* This is the list node that links the page (referenced by
* the page_autonuma structure) in the
* &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid] lru.
*/
- struct list_head autonuma_migrate_node;
-
- /*
- * To find the page starting from the autonuma_migrate_node we
- * need a backlink.
- *
- * FIXME: drop it;
- */
- struct page *page;
+ struct autonuma_list_head autonuma_migrate_node;
};
extern int alloc_task_autonuma(struct task_struct *tsk,
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ed5b0c0..acefdfa 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -17,6 +17,7 @@
#include <linux/pageblock-flags.h>
#include <generated/bounds.h>
#include <linux/atomic.h>
+#include <linux/autonuma_list.h>
#include <asm/page.h>
/* Free memory management - zoned buddy allocator. */
@@ -710,7 +711,7 @@ typedef struct pglist_data {
* <linux/page_autonuma.h> and the below field must remain the
* last one of this structure.
*/
- struct list_head autonuma_migrate_head[0];
+ struct autonuma_list_head autonuma_migrate_head[0];
#endif
} pg_data_t;
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
index bc7a629..e78beda 100644
--- a/include/linux/page_autonuma.h
+++ b/include/linux/page_autonuma.h
@@ -53,7 +53,7 @@ extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **
/* inline won't work here */
#define autonuma_pglist_data_size() (sizeof(struct pglist_data) + \
(autonuma_impossible() ? 0 : \
- sizeof(struct list_head) * \
+ sizeof(struct autonuma_list_head) * \
num_possible_nodes()))
#endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/mm/Makefile b/mm/Makefile
index a4d8354..4aa90d4 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,7 +33,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
-obj-$(CONFIG_AUTONUMA) += autonuma.o page_autonuma.o
+obj-$(CONFIG_AUTONUMA) += autonuma.o page_autonuma.o autonuma_list.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index ec4d492..1873a7b 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -89,15 +89,30 @@ void autonuma_migrate_split_huge_page(struct page *page,
VM_BUG_ON(nid < -1);
VM_BUG_ON(page_tail_autonuma->autonuma_migrate_nid != -1);
if (nid >= 0) {
- VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+ bool added;
+ int page_nid = page_to_nid(page);
+ struct autonuma_list_head *head;
+ autonuma_list_entry entry;
+ entry = autonuma_page_to_list_entry(page_nid, page);
+ head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+ VM_BUG_ON(page_nid != page_to_nid(page_tail));
+ VM_BUG_ON(page_nid == nid);
compound_lock(page_tail);
autonuma_migrate_lock(nid);
- list_add_tail(&page_tail_autonuma->autonuma_migrate_node,
- &page_autonuma->autonuma_migrate_node);
+ added = autonuma_list_add_tail(page_nid, page_tail, entry,
+ head);
+ /*
+ * AUTONUMA_LIST_MAX_PFN_OFFSET+1 isn't a power of 2
+ * so "added" may be false if there's a pfn overflow
+ * in the list.
+ */
+ if (!added)
+ NODE_DATA(nid)->autonuma_nr_migrate_pages--;
autonuma_migrate_unlock(nid);
- page_tail_autonuma->autonuma_migrate_nid = nid;
+ if (added)
+ page_tail_autonuma->autonuma_migrate_nid = nid;
compound_unlock(page_tail);
}
@@ -119,8 +134,15 @@ void __autonuma_migrate_page_remove(struct page *page,
VM_BUG_ON(nid < -1);
if (nid >= 0) {
int numpages = hpage_nr_pages(page);
+ int page_nid = page_to_nid(page);
+ struct autonuma_list_head *head;
+ VM_BUG_ON(nid == page_nid);
+ head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+
autonuma_migrate_lock(nid);
- list_del(&page_autonuma->autonuma_migrate_node);
+ autonuma_list_del(page_nid,
+ &page_autonuma->autonuma_migrate_node,
+ head);
NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
autonuma_migrate_unlock(nid);
@@ -139,6 +161,8 @@ static void __autonuma_migrate_page_add(struct page *page,
int numpages;
unsigned long nr_migrate_pages;
wait_queue_head_t *wait_queue;
+ struct autonuma_list_head *head;
+ bool added;
VM_BUG_ON(dst_nid >= MAX_NUMNODES);
VM_BUG_ON(dst_nid < -1);
@@ -155,25 +179,34 @@ static void __autonuma_migrate_page_add(struct page *page,
VM_BUG_ON(nid >= MAX_NUMNODES);
VM_BUG_ON(nid < -1);
if (nid >= 0) {
+ VM_BUG_ON(nid == page_nid);
+ head = &NODE_DATA(nid)->autonuma_migrate_head[page_nid];
+
autonuma_migrate_lock(nid);
- list_del(&page_autonuma->autonuma_migrate_node);
+ autonuma_list_del(page_nid,
+ &page_autonuma->autonuma_migrate_node,
+ head);
NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
autonuma_migrate_unlock(nid);
}
+ head = &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid];
+
autonuma_migrate_lock(dst_nid);
- list_add(&page_autonuma->autonuma_migrate_node,
- &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
- NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
- nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+ added = autonuma_list_add(page_nid, page, AUTONUMA_LIST_HEAD, head);
+ if (added) {
+ NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+ nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+ }
autonuma_migrate_unlock(dst_nid);
- page_autonuma->autonuma_migrate_nid = dst_nid;
+ if (added)
+ page_autonuma->autonuma_migrate_nid = dst_nid;
compound_unlock_irqrestore(page, flags);
- if (!autonuma_migrate_defer()) {
+ if (added && !autonuma_migrate_defer()) {
wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
if (nr_migrate_pages >= pages_to_migrate &&
nr_migrate_pages - numpages < pages_to_migrate &&
@@ -813,7 +846,7 @@ static int isolate_migratepages(struct list_head *migratepages,
struct pglist_data *pgdat)
{
int nr = 0, nid;
- struct list_head *heads = pgdat->autonuma_migrate_head;
+ struct autonuma_list_head *heads = pgdat->autonuma_migrate_head;
/* FIXME: THP balancing, restart from last nid */
for_each_online_node(nid) {
@@ -825,10 +858,10 @@ static int isolate_migratepages(struct list_head *migratepages,
cond_resched();
VM_BUG_ON(numa_node_id() != pgdat->node_id);
if (nid == pgdat->node_id) {
- VM_BUG_ON(!list_empty(&heads[nid]));
+ VM_BUG_ON(!autonuma_list_empty(&heads[nid]));
continue;
}
- if (list_empty(&heads[nid]))
+ if (autonuma_list_empty(&heads[nid]))
continue;
/* some page wants to go to this pgdat */
/*
@@ -840,22 +873,29 @@ static int isolate_migratepages(struct list_head *migratepages,
* irqs.
*/
autonuma_migrate_lock_irq(pgdat->node_id);
- if (list_empty(&heads[nid])) {
+ if (autonuma_list_empty(&heads[nid])) {
autonuma_migrate_unlock_irq(pgdat->node_id);
continue;
}
- page_autonuma = list_entry(heads[nid].prev,
- struct page_autonuma,
- autonuma_migrate_node);
- page = page_autonuma->page;
+ page = autonuma_list_entry_to_page(nid,
+ heads[nid].anl_prev_pfn);
+ page_autonuma = lookup_page_autonuma(page);
if (unlikely(!get_page_unless_zero(page))) {
+ int page_nid = page_to_nid(page);
+ struct autonuma_list_head *entry_head;
+ VM_BUG_ON(nid == page_nid);
+
/*
* Is getting freed and will remove self from the
* autonuma list shortly, skip it for now.
*/
- list_del(&page_autonuma->autonuma_migrate_node);
- list_add(&page_autonuma->autonuma_migrate_node,
- &heads[nid]);
+ entry_head = &page_autonuma->autonuma_migrate_node;
+ autonuma_list_del(page_nid, entry_head,
+ &heads[nid]);
+ if (!autonuma_list_add(page_nid, page,
+ AUTONUMA_LIST_HEAD,
+ &heads[nid]))
+ BUG();
autonuma_migrate_unlock_irq(pgdat->node_id);
autonuma_printk("autonuma migrate page is free\n");
continue;
diff --git a/mm/autonuma_list.c b/mm/autonuma_list.c
new file mode 100644
index 0000000..2c840f7
--- /dev/null
+++ b/mm/autonuma_list.c
@@ -0,0 +1,167 @@
+/*
+ * Copyright 2006, Red Hat, Inc., Dave Jones
+ * Copyright 2012, Red Hat, Inc.
+ * Released under the General Public License (GPL).
+ *
+ * This file contains the linked list implementations for
+ * autonuma migration lists.
+ */
+
+#include <linux/mm.h>
+#include <linux/autonuma.h>
+
+/*
+ * Insert a new entry between two known consecutive entries.
+ *
+ * This is only for internal list manipulation where we know
+ * the prev/next entries already!
+ *
+ * return true if succeeded, or false if the (page_nid, pfn_offset)
+ * pair couldn't represent the pfn and the list_add didn't succeed.
+ */
+bool __autonuma_list_add(int page_nid,
+ struct page *page,
+ struct autonuma_list_head *head,
+ autonuma_list_entry prev,
+ autonuma_list_entry next)
+{
+ autonuma_list_entry new;
+
+ VM_BUG_ON(page_nid != page_to_nid(page));
+ new = autonuma_page_to_list_entry(page_nid, page);
+ if (new > AUTONUMA_LIST_MAX_PFN_OFFSET)
+ return false;
+
+ WARN(new == prev || new == next,
+ "autonuma_list_add double add: new=%u, prev=%u, next=%u.\n",
+ new, prev, next);
+
+ __autonuma_list_head(page_nid, head, next)->anl_prev_pfn = new;
+ __autonuma_list_head(page_nid, head, new)->anl_next_pfn = next;
+ __autonuma_list_head(page_nid, head, new)->anl_prev_pfn = prev;
+ __autonuma_list_head(page_nid, head, prev)->anl_next_pfn = new;
+ return true;
+}
+
+static inline void __autonuma_list_del_entry(int page_nid,
+ struct autonuma_list_head *entry,
+ struct autonuma_list_head *head)
+{
+ autonuma_list_entry prev, next;
+
+ prev = entry->anl_prev_pfn;
+ next = entry->anl_next_pfn;
+
+ if (WARN(next == AUTONUMA_LIST_POISON1,
+ "autonuma_list_del corruption, "
+ "%p->anl_next_pfn is AUTONUMA_LIST_POISON1 (%u)\n",
+ entry, AUTONUMA_LIST_POISON1) ||
+ WARN(prev == AUTONUMA_LIST_POISON2,
+ "autonuma_list_del corruption, "
+ "%p->anl_prev_pfn is AUTONUMA_LIST_POISON2 (%u)\n",
+ entry, AUTONUMA_LIST_POISON2))
+ return;
+
+ __autonuma_list_head(page_nid, head, next)->anl_prev_pfn = prev;
+ __autonuma_list_head(page_nid, head, prev)->anl_next_pfn = next;
+}
+
+/*
+ * autonuma_list_del - deletes entry from list.
+ *
+ * Note: autonuma_list_empty on entry does not return true after this,
+ * the entry is in an undefined state.
+ */
+void autonuma_list_del(int page_nid, struct autonuma_list_head *entry,
+ struct autonuma_list_head *head)
+{
+ __autonuma_list_del_entry(page_nid, entry, head);
+ entry->anl_next_pfn = AUTONUMA_LIST_POISON1;
+ entry->anl_prev_pfn = AUTONUMA_LIST_POISON2;
+}
+
+/*
+ * autonuma_list_empty - tests whether a list is empty
+ * @head: the list to test.
+ */
+bool autonuma_list_empty(const struct autonuma_list_head *head)
+{
+ bool ret = false;
+ if (head->anl_next_pfn == AUTONUMA_LIST_HEAD) {
+ ret = true;
+ BUG_ON(head->anl_prev_pfn != AUTONUMA_LIST_HEAD);
+ }
+ return ret;
+}
+
+/* abstraction conversion methods */
+
+static inline struct page *__autonuma_list_entry_to_page(int page_nid,
+ autonuma_list_entry pfn_offset)
+{
+ struct pglist_data *pgdat = NODE_DATA(page_nid);
+ unsigned long pfn = pgdat->node_start_pfn + pfn_offset;
+ return pfn_to_page(pfn);
+}
+
+struct page *autonuma_list_entry_to_page(int page_nid,
+ autonuma_list_entry pfn_offset)
+{
+ VM_BUG_ON(page_nid < 0);
+ BUG_ON(pfn_offset == AUTONUMA_LIST_POISON1);
+ BUG_ON(pfn_offset == AUTONUMA_LIST_POISON2);
+ BUG_ON(pfn_offset == AUTONUMA_LIST_HEAD);
+ return __autonuma_list_entry_to_page(page_nid, pfn_offset);
+}
+
+/*
+ * returns a value above AUTONUMA_LIST_MAX_PFN_OFFSET if the pfn is
+ * located a too big offset from the start of the node and cannot be
+ * represented by the (page_nid, pfn_offset) pair.
+ */
+autonuma_list_entry autonuma_page_to_list_entry(int page_nid,
+ struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ struct pglist_data *pgdat = NODE_DATA(page_nid);
+ VM_BUG_ON(page_nid != page_to_nid(page));
+ BUG_ON(pfn < pgdat->node_start_pfn);
+ pfn -= pgdat->node_start_pfn;
+ if (pfn > AUTONUMA_LIST_MAX_PFN_OFFSET) {
+ WARN_ONCE(1, "autonuma_page_to_list_entry: "
+ "pfn_offset %lu, pgdat %p, "
+ "pgdat->node_start_pfn %lu\n",
+ pfn, pgdat, pgdat->node_start_pfn);
+ /*
+ * Any value bigger than AUTONUMA_LIST_MAX_PFN_OFFSET
+ * will work as an error retval, but better pick one
+ * that will cause noise if computed wrong by the
+ * caller.
+ */
+ return AUTONUMA_LIST_POISON1;
+ }
+ return pfn; /* convert to uint16_t without losing information */
+}
+
+static inline struct autonuma_list_head *____autonuma_list_head(int page_nid,
+ autonuma_list_entry pfn_offset)
+{
+ struct pglist_data *pgdat = NODE_DATA(page_nid);
+ unsigned long pfn = pgdat->node_start_pfn + pfn_offset;
+ struct page *page = pfn_to_page(pfn);
+ struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+ return &page_autonuma->autonuma_migrate_node;
+}
+
+struct autonuma_list_head *__autonuma_list_head(int page_nid,
+ struct autonuma_list_head *head,
+ autonuma_list_entry pfn_offset)
+{
+ VM_BUG_ON(page_nid < 0);
+ BUG_ON(pfn_offset == AUTONUMA_LIST_POISON1);
+ BUG_ON(pfn_offset == AUTONUMA_LIST_POISON2);
+ if (pfn_offset != AUTONUMA_LIST_HEAD)
+ return ____autonuma_list_head(page_nid, pfn_offset);
+ else
+ return head;
+}
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index d7c5e4a..b629074 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -12,7 +12,6 @@ void __meminit page_autonuma_map_init(struct page *page,
for (end = page + nr_pages; page < end; page++, page_autonuma++) {
page_autonuma->autonuma_last_nid = -1;
page_autonuma->autonuma_migrate_nid = -1;
- page_autonuma->page = page;
}
}
@@ -20,12 +19,18 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
{
int node_iter;
+ /* verify the per-page page_autonuma 12 byte fixed cost */
+ BUILD_BUG_ON((unsigned long) &((struct page_autonuma *)0)[1] != 12);
+
spin_lock_init(&pgdat->autonuma_lock);
init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
pgdat->autonuma_nr_migrate_pages = 0;
if (!autonuma_impossible())
- for_each_node(node_iter)
- INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+ for_each_node(node_iter) {
+ struct autonuma_list_head *head;
+ head = &pgdat->autonuma_migrate_head[node_iter];
+ AUTONUMA_INIT_LIST_HEAD(head);
+ }
}
#if !defined(CONFIG_SPARSEMEM)
@@ -112,10 +117,6 @@ struct page_autonuma *lookup_page_autonuma(struct page *page)
unsigned long pfn = page_to_pfn(page);
struct mem_section *section = __pfn_to_section(pfn);
- /* if it's not a power of two we may be wasting memory */
- BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
- (SECTION_PAGE_AUTONUMA_SIZE-1));
-
#ifdef CONFIG_DEBUG_VM
/*
* The sanity checks the page allocator does upon freeing a
We will set these bitflags only when the pmd and pte is non present.
They work like PROT_NONE but they identify a request for the numa
hinting page fault to trigger.
Because we want to be able to set these bitflag in any established pte
or pmd (while clearing the present bit at the same time) without
losing information, these bitflags must never be set when the pte and
pmd are present.
For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
which avoids having to alter the swp entries format.
For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
swap_entries but if in the future we'll swap transparent hugepages, we
must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
entry format and to start the swap entry offset above it.
PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
but it's never used on pmds so there's no risk of collision with Xen.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index b74cac9..6e2d954 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -71,6 +71,17 @@
#define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE)
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
+/*
+ * Cannot be set on pte. The fact it's in between _PAGE_FILE and
+ * _PAGE_PROTNONE avoids having to alter the swp entries.
+ */
+#define _PAGE_NUMA_PTE _PAGE_PSE
+/*
+ * Cannot be set on pmd, if transparent hugepages will be swapped out
+ * the swap entry offset must start above it.
+ */
+#define _PAGE_NUMA_PMD _PAGE_UNUSED2
+
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
_PAGE_ACCESSED | _PAGE_DIRTY)
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
On 64bit archs, 20 bytes are used for async memory migration (specific
to the knuma_migrated per-node threads), and 4 bytes are used for the
thread NUMA false sharing detection logic.
This is a bad implementation due lack of time to do a proper one.
These AutoNUMA new fields must be moved to the pgdat like memcg
does. So that they're only allocated at boot time if the kernel is
booted on NUMA hardware. And so that they're not allocated even if
booted on NUMA hardware if "noautonuma" is passed as boot parameter to
the kernel.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/mm_types.h | 26 ++++++++++++++++++++++++++
1 files changed, 26 insertions(+), 0 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f0c6379..d1248cf 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -136,6 +136,32 @@ struct page {
struct page *first_page; /* Compound tail pages */
};
+#ifdef CONFIG_AUTONUMA
+ /*
+ * FIXME: move to pgdat section along with the memcg and allocate
+ * at runtime only in presence of a numa system.
+ */
+ /*
+ * To modify autonuma_last_nid lockless the architecture,
+ * needs SMP atomic granularity < sizeof(long), not all archs
+ * have that, notably some ancient alpha (but none of those
+ * should run in NUMA systems). Archs without that requires
+ * autonuma_last_nid to be a long.
+ */
+#if BITS_PER_LONG > 32
+ int autonuma_migrate_nid;
+ int autonuma_last_nid;
+#else
+#if MAX_NUMNODES >= 32768
+#error "too many nodes"
+#endif
+ /* FIXME: remember to check the updates are atomic */
+ short autonuma_migrate_nid;
+ short autonuma_last_nid;
+#endif
+ struct list_head autonuma_migrate_node;
+#endif
+
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
Very minor optimization to hint gcc.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/fork.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/fork.c b/kernel/fork.c
index ab5211b..5fcfa70 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -572,7 +572,7 @@ struct mm_struct *mm_alloc(void)
struct mm_struct *mm;
mm = allocate_mm();
- if (!mm)
+ if (unlikely(!mm))
return NULL;
memset(mm, 0, sizeof(*mm));
This implements knuma_scand, the numa_hinting faults started by
knuma_scand, the knuma_migrated that migrates the memory queued by the
NUMA hinting faults, the statistics gathering code that is done by
knuma_scand for the mm_autonuma and by the numa hinting page faults
for the sched_autonuma, and most of the rest of the AutoNUMA core
logics like the false sharing detection, sysfs and initialization
routines.
The AutoNUMA algorithm when knuma_scand is not running is a full
bypass and it must not alter the runtime of memory management and
scheduler.
The whole AutoNUMA logic is a chain reaction as result of the actions
of the knuma_scand. The various parts of the code can be described
like different gears (gears as in glxgears).
knuma_scand is the first gear and it collects the mm_autonuma per-process
statistics and at the same time it sets the pte/pmd it scans as
pte_numa and pmd_numa.
The second gear are the numa hinting page faults. These are triggered
by the pte_numa/pmd_numa pmd/ptes. They collect the sched_autonuma
per-thread statistics. They also implement the memory follow CPU logic
where we track if pages are repeatedly accessed by remote nodes. The
memory follow CPU logic can decide to migrate pages across different
NUMA nodes by queuing the pages for migration in the per-node
knuma_migrated queues.
The third gear is knuma_migrated. There is one knuma_migrated daemon
per node. Pages pending for migration are queued in a matrix of
lists. Each knuma_migrated (in parallel with each other) goes over
those lists and migrates the pages queued for migration in round robin
from each incoming node to the node where knuma_migrated is running
on.
The fourth gear is the NUMA scheduler balancing code. That computes
the statistical information collected in mm->mm_autonuma and
p->sched_autonuma and evaluates the status of all CPUs to decide if
tasks should be migrated to CPUs in remote nodes.
The code include fixes from Hillf Danton <[email protected]>.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/autonuma.c | 1491 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 1491 insertions(+), 0 deletions(-)
create mode 100644 mm/autonuma.c
diff --git a/mm/autonuma.c b/mm/autonuma.c
new file mode 100644
index 0000000..f44272b
--- /dev/null
+++ b/mm/autonuma.c
@@ -0,0 +1,1491 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ *
+ * Boot with "numa=fake=2" to test on not NUMA systems.
+ */
+
+#include <linux/mm.h>
+#include <linux/rmap.h>
+#include <linux/kthread.h>
+#include <linux/mmu_notifier.h>
+#include <linux/freezer.h>
+#include <linux/mm_inline.h>
+#include <linux/migrate.h>
+#include <linux/swap.h>
+#include <linux/autonuma.h>
+#include <asm/tlbflush.h>
+#include <asm/pgtable.h>
+
+unsigned long autonuma_flags __read_mostly =
+ (1<<AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG)|
+ (1<<AUTONUMA_SCHED_CLONE_RESET_FLAG)|
+ (1<<AUTONUMA_SCHED_FORK_RESET_FLAG)|
+#ifdef CONFIG_AUTONUMA_DEFAULT_ENABLED
+ (1<<AUTONUMA_FLAG)|
+#endif
+ (1<<AUTONUMA_SCAN_PMD_FLAG);
+
+static DEFINE_MUTEX(knumad_mm_mutex);
+
+/* knuma_scand */
+static unsigned int scan_sleep_millisecs __read_mostly = 100;
+static unsigned int scan_sleep_pass_millisecs __read_mostly = 5000;
+static unsigned int pages_to_scan __read_mostly = 128*1024*1024/PAGE_SIZE;
+static DECLARE_WAIT_QUEUE_HEAD(knuma_scand_wait);
+static unsigned long full_scans;
+static unsigned long pages_scanned;
+
+/* knuma_migrated */
+static unsigned int migrate_sleep_millisecs __read_mostly = 100;
+static unsigned int pages_to_migrate __read_mostly = 128*1024*1024/PAGE_SIZE;
+static volatile unsigned long pages_migrated;
+
+static struct knumad_scan {
+ struct list_head mm_head;
+ struct mm_struct *mm;
+ unsigned long address;
+} knumad_scan = {
+ .mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
+};
+
+static inline bool autonuma_impossible(void)
+{
+ return num_possible_nodes() <= 1 ||
+ test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
+}
+
+static inline void autonuma_migrate_lock(int nid)
+{
+ spin_lock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock(int nid)
+{
+ spin_unlock(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_lock_irq(int nid)
+{
+ spin_lock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+static inline void autonuma_migrate_unlock_irq(int nid)
+{
+ spin_unlock_irq(&NODE_DATA(nid)->autonuma_lock);
+}
+
+/* caller already holds the compound_lock */
+void autonuma_migrate_split_huge_page(struct page *page,
+ struct page *page_tail)
+{
+ int nid, last_nid;
+
+ nid = page->autonuma_migrate_nid;
+ VM_BUG_ON(nid >= MAX_NUMNODES);
+ VM_BUG_ON(nid < -1);
+ VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+ if (nid >= 0) {
+ VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
+
+ compound_lock(page_tail);
+ autonuma_migrate_lock(nid);
+ list_add_tail(&page_tail->autonuma_migrate_node,
+ &page->autonuma_migrate_node);
+ autonuma_migrate_unlock(nid);
+
+ page_tail->autonuma_migrate_nid = nid;
+ compound_unlock(page_tail);
+ }
+
+ last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+ if (last_nid >= 0)
+ page_tail->autonuma_last_nid = last_nid;
+}
+
+void __autonuma_migrate_page_remove(struct page *page)
+{
+ unsigned long flags;
+ int nid;
+
+ flags = compound_lock_irqsave(page);
+
+ nid = page->autonuma_migrate_nid;
+ VM_BUG_ON(nid >= MAX_NUMNODES);
+ VM_BUG_ON(nid < -1);
+ if (nid >= 0) {
+ int numpages = hpage_nr_pages(page);
+ autonuma_migrate_lock(nid);
+ list_del(&page->autonuma_migrate_node);
+ NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+ autonuma_migrate_unlock(nid);
+
+ page->autonuma_migrate_nid = -1;
+ }
+
+ compound_unlock_irqrestore(page, flags);
+}
+
+static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
+ int page_nid)
+{
+ unsigned long flags;
+ int nid;
+ int numpages;
+ unsigned long nr_migrate_pages;
+ wait_queue_head_t *wait_queue;
+
+ VM_BUG_ON(dst_nid >= MAX_NUMNODES);
+ VM_BUG_ON(dst_nid < -1);
+ VM_BUG_ON(page_nid >= MAX_NUMNODES);
+ VM_BUG_ON(page_nid < -1);
+
+ VM_BUG_ON(page_nid == dst_nid);
+ VM_BUG_ON(page_to_nid(page) != page_nid);
+
+ flags = compound_lock_irqsave(page);
+
+ numpages = hpage_nr_pages(page);
+ nid = page->autonuma_migrate_nid;
+ VM_BUG_ON(nid >= MAX_NUMNODES);
+ VM_BUG_ON(nid < -1);
+ if (nid >= 0) {
+ autonuma_migrate_lock(nid);
+ list_del(&page->autonuma_migrate_node);
+ NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
+ autonuma_migrate_unlock(nid);
+ }
+
+ autonuma_migrate_lock(dst_nid);
+ list_add(&page->autonuma_migrate_node,
+ &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
+ NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
+ nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
+
+ autonuma_migrate_unlock(dst_nid);
+
+ page->autonuma_migrate_nid = dst_nid;
+
+ compound_unlock_irqrestore(page, flags);
+
+ if (!autonuma_migrate_defer()) {
+ wait_queue = &NODE_DATA(dst_nid)->autonuma_knuma_migrated_wait;
+ if (nr_migrate_pages >= pages_to_migrate &&
+ nr_migrate_pages - numpages < pages_to_migrate &&
+ waitqueue_active(wait_queue))
+ wake_up_interruptible(wait_queue);
+ }
+}
+
+static void autonuma_migrate_page_add(struct page *page, int dst_nid,
+ int page_nid)
+{
+ int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+ if (migrate_nid != dst_nid)
+ __autonuma_migrate_page_add(page, dst_nid, page_nid);
+}
+
+static bool balance_pgdat(struct pglist_data *pgdat,
+ int nr_migrate_pages)
+{
+ /* FIXME: this only check the wmarks, make it move
+ * "unused" memory or pagecache by queuing it to
+ * pgdat->autonuma_migrate_head[pgdat->node_id].
+ */
+ int z;
+ for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+ struct zone *zone = pgdat->node_zones + z;
+
+ if (!populated_zone(zone))
+ continue;
+
+ if (zone->all_unreclaimable)
+ continue;
+
+ /*
+ * FIXME: deal with order with THP, maybe if
+ * kswapd will learn using compaction, otherwise
+ * order = 0 probably is ok.
+ * FIXME: in theory we're ok if we can obtain
+ * pages_to_migrate pages from all zones, it doesn't
+ * need to be all in a single zone. We care about the
+ * pgdat, the zone not.
+ */
+
+ /*
+ * Try not to wakeup kswapd by allocating
+ * pages_to_migrate pages.
+ */
+ if (!zone_watermark_ok(zone, 0,
+ high_wmark_pages(zone) +
+ nr_migrate_pages,
+ 0, 0))
+ continue;
+ return true;
+ }
+ return false;
+}
+
+static void cpu_follow_memory_pass(struct task_struct *p,
+ struct task_autonuma *task_autonuma,
+ unsigned long *task_numa_fault)
+{
+ int nid;
+ for_each_node(nid)
+ task_numa_fault[nid] >>= 1;
+ task_autonuma->task_numa_fault_tot >>= 1;
+}
+
+static void numa_hinting_fault_cpu_follow_memory(struct task_struct *p,
+ int access_nid,
+ int numpages,
+ bool pass)
+{
+ struct task_autonuma *task_autonuma = p->task_autonuma;
+ unsigned long *task_numa_fault = task_autonuma->task_numa_fault;
+ if (unlikely(pass))
+ cpu_follow_memory_pass(p, task_autonuma, task_numa_fault);
+ task_numa_fault[access_nid] += numpages;
+ task_autonuma->task_numa_fault_tot += numpages;
+}
+
+static inline bool last_nid_set(struct task_struct *p,
+ struct page *page, int cpu_nid)
+{
+ bool ret = true;
+ int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+ VM_BUG_ON(cpu_nid < 0);
+ VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+ if (autonuma_last_nid >= 0 && autonuma_last_nid != cpu_nid) {
+ int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+ if (migrate_nid >= 0 && migrate_nid != cpu_nid)
+ __autonuma_migrate_page_remove(page);
+ ret = false;
+ }
+ if (autonuma_last_nid != cpu_nid)
+ ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
+ return ret;
+}
+
+static int __page_migrate_nid(struct page *page, int page_nid)
+{
+ int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+ if (migrate_nid < 0)
+ migrate_nid = page_nid;
+#if 0
+ return page_nid;
+#endif
+ return migrate_nid;
+}
+
+static int page_migrate_nid(struct page *page)
+{
+ return __page_migrate_nid(page, page_to_nid(page));
+}
+
+static int numa_hinting_fault_memory_follow_cpu(struct task_struct *p,
+ struct page *page,
+ int cpu_nid, int page_nid,
+ bool pass)
+{
+ if (!last_nid_set(p, page, cpu_nid))
+ return __page_migrate_nid(page, page_nid);
+ if (!PageLRU(page))
+ return page_nid;
+ if (cpu_nid != page_nid)
+ autonuma_migrate_page_add(page, cpu_nid, page_nid);
+ else
+ autonuma_migrate_page_remove(page);
+ return cpu_nid;
+}
+
+void numa_hinting_fault(struct page *page, int numpages)
+{
+ WARN_ON_ONCE(!current->mm);
+ if (likely(current->mm && !current->mempolicy && autonuma_enabled())) {
+ struct task_struct *p = current;
+ int cpu_nid, page_nid, access_nid;
+ bool pass;
+
+ pass = p->task_autonuma->task_numa_fault_pass !=
+ p->mm->mm_autonuma->mm_numa_fault_pass;
+ page_nid = page_to_nid(page);
+ cpu_nid = numa_node_id();
+ VM_BUG_ON(cpu_nid < 0);
+ VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
+ access_nid = numa_hinting_fault_memory_follow_cpu(p, page,
+ cpu_nid,
+ page_nid,
+ pass);
+ numa_hinting_fault_cpu_follow_memory(p, access_nid,
+ numpages, pass);
+ if (unlikely(pass))
+ p->task_autonuma->task_numa_fault_pass =
+ p->mm->mm_autonuma->mm_numa_fault_pass;
+ }
+}
+
+pte_t __pte_numa_fixup(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte, pte_t *ptep)
+{
+ struct page *page;
+ pte = pte_mknotnuma(pte);
+ set_pte_at(mm, addr, ptep, pte);
+ page = vm_normal_page(vma, addr, pte);
+ BUG_ON(!page);
+ numa_hinting_fault(page, 1);
+ return pte;
+}
+
+void __pmd_numa_fixup(struct mm_struct *mm,
+ unsigned long addr, pmd_t *pmdp)
+{
+ pmd_t pmd;
+ pte_t *pte;
+ unsigned long _addr = addr & PMD_MASK;
+ unsigned long offset;
+ spinlock_t *ptl;
+ bool numa = false;
+ struct vm_area_struct *vma;
+
+ spin_lock(&mm->page_table_lock);
+ pmd = *pmdp;
+ if (pmd_numa(pmd)) {
+ set_pmd_at(mm, _addr, pmdp, pmd_mknotnuma(pmd));
+ numa = true;
+ }
+ spin_unlock(&mm->page_table_lock);
+
+ if (!numa)
+ return;
+
+ vma = find_vma(mm, _addr);
+ /* we're in a page fault so some vma must be in the range */
+ BUG_ON(!vma);
+ BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
+ offset = max(_addr, vma->vm_start) & ~PMD_MASK;
+ VM_BUG_ON(offset >= PMD_SIZE);
+ pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+ pte += offset >> PAGE_SHIFT;
+ for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+ pte_t pteval = *pte;
+ struct page * page;
+ if (!pte_present(pteval))
+ continue;
+ if (addr >= vma->vm_end) {
+ vma = find_vma(mm, addr);
+ /* there's a pte present so there must be a vma */
+ BUG_ON(!vma);
+ BUG_ON(addr < vma->vm_start);
+ }
+ if (pte_numa(pteval)) {
+ pteval = pte_mknotnuma(pteval);
+ set_pte_at(mm, addr, pte, pteval);
+ }
+ page = vm_normal_page(vma, addr, pteval);
+ if (unlikely(!page))
+ continue;
+ /* only check non-shared pages */
+ if (page_mapcount(page) != 1)
+ continue;
+ numa_hinting_fault(page, 1);
+ }
+ pte_unmap_unlock(pte, ptl);
+}
+
+static inline int task_autonuma_size(void)
+{
+ return sizeof(struct task_autonuma) +
+ num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline int task_autonuma_reset_size(void)
+{
+ struct task_autonuma *task_autonuma = NULL;
+ return task_autonuma_size() -
+ (int)((char *)(&task_autonuma->task_numa_fault_pass) -
+ (char *)task_autonuma);
+}
+
+static void task_autonuma_reset(struct task_autonuma *task_autonuma)
+{
+ task_autonuma->autonuma_node = -1;
+ memset(&task_autonuma->task_numa_fault_pass, 0,
+ task_autonuma_reset_size());
+}
+
+static inline int mm_autonuma_fault_size(void)
+{
+ return num_possible_nodes() * sizeof(unsigned long);
+}
+
+static inline unsigned long *mm_autonuma_numa_fault_tmp(struct mm_struct *mm)
+{
+ return mm->mm_autonuma->mm_numa_fault + num_possible_nodes();
+}
+
+static inline int mm_autonuma_size(void)
+{
+ return sizeof(struct mm_autonuma) + mm_autonuma_fault_size() * 2;
+}
+
+static inline int mm_autonuma_reset_size(void)
+{
+ struct mm_autonuma *mm_autonuma = NULL;
+ return mm_autonuma_size() -
+ (int)((char *)(&mm_autonuma->mm_numa_fault_pass) -
+ (char *)mm_autonuma);
+}
+
+static void mm_autonuma_reset(struct mm_autonuma *mm_autonuma)
+{
+ memset(&mm_autonuma->mm_numa_fault_pass, 0, mm_autonuma_reset_size());
+}
+
+void autonuma_setup_new_exec(struct task_struct *p)
+{
+ if (p->task_autonuma)
+ task_autonuma_reset(p->task_autonuma);
+ if (p->mm && p->mm->mm_autonuma)
+ mm_autonuma_reset(p->mm->mm_autonuma);
+}
+
+static inline int knumad_test_exit(struct mm_struct *mm)
+{
+ return atomic_read(&mm->mm_users) == 0;
+}
+
+static int knumad_scan_pmd(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long address)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte, *_pte;
+ struct page *page;
+ unsigned long _address, end;
+ spinlock_t *ptl;
+ int ret = 0;
+
+ VM_BUG_ON(address & ~PAGE_MASK);
+
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
+ goto out;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ goto out;
+
+ pmd = pmd_offset(pud, address);
+ if (pmd_none(*pmd))
+ goto out;
+ if (pmd_trans_huge(*pmd)) {
+ spin_lock(&mm->page_table_lock);
+ if (pmd_trans_huge(*pmd)) {
+ VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+ if (unlikely(pmd_trans_splitting(*pmd))) {
+ spin_unlock(&mm->page_table_lock);
+ wait_split_huge_page(vma->anon_vma, pmd);
+ } else {
+ int page_nid;
+ unsigned long *numa_fault_tmp;
+ ret = HPAGE_PMD_NR;
+
+ if (autonuma_scan_use_working_set() &&
+ pmd_numa(*pmd)) {
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+ }
+
+ page = pmd_page(*pmd);
+
+ /* only check non-shared pages */
+ if (page_mapcount(page) != 1) {
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+ }
+
+ page_nid = page_migrate_nid(page);
+ numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+ numa_fault_tmp[page_nid] += ret;
+
+ if (pmd_numa(*pmd)) {
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+ }
+
+ set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+ /* defer TLB flush to lower the overhead */
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+ }
+ } else
+ spin_unlock(&mm->page_table_lock);
+ }
+
+ VM_BUG_ON(!pmd_present(*pmd));
+
+ end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+ pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+ for (_address = address, _pte = pte; _address < end;
+ _pte++, _address += PAGE_SIZE) {
+ unsigned long *numa_fault_tmp;
+ pte_t pteval = *_pte;
+ if (!pte_present(pteval))
+ continue;
+ if (autonuma_scan_use_working_set() &&
+ pte_numa(pteval))
+ continue;
+ page = vm_normal_page(vma, _address, pteval);
+ if (unlikely(!page))
+ continue;
+ /* only check non-shared pages */
+ if (page_mapcount(page) != 1)
+ continue;
+
+ numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+ numa_fault_tmp[page_migrate_nid(page)]++;
+
+ if (pte_numa(pteval))
+ continue;
+
+ if (!autonuma_scan_pmd())
+ set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+ /* defer TLB flush to lower the overhead */
+ ret++;
+ }
+ pte_unmap_unlock(pte, ptl);
+
+ if (ret && !pmd_numa(*pmd) && autonuma_scan_pmd()) {
+ spin_lock(&mm->page_table_lock);
+ set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+ spin_unlock(&mm->page_table_lock);
+ /* defer TLB flush to lower the overhead */
+ }
+
+out:
+ return ret;
+}
+
+static void mm_numa_fault_flush(struct mm_struct *mm)
+{
+ int nid;
+ struct mm_autonuma *mma = mm->mm_autonuma;
+ unsigned long *numa_fault_tmp = mm_autonuma_numa_fault_tmp(mm);
+ unsigned long tot = 0;
+ /* FIXME: protect this with seqlock against autonuma_balance() */
+ for_each_node(nid) {
+ mma->mm_numa_fault[nid] = numa_fault_tmp[nid];
+ tot += mma->mm_numa_fault[nid];
+ numa_fault_tmp[nid] = 0;
+ }
+ mma->mm_numa_fault_tot = tot;
+}
+
+static int knumad_do_scan(void)
+{
+ struct mm_struct *mm;
+ struct mm_autonuma *mm_autonuma;
+ unsigned long address;
+ struct vm_area_struct *vma;
+ int progress = 0;
+
+ mm = knumad_scan.mm;
+ if (!mm) {
+ if (unlikely(list_empty(&knumad_scan.mm_head)))
+ return pages_to_scan;
+ mm_autonuma = list_entry(knumad_scan.mm_head.next,
+ struct mm_autonuma, mm_node);
+ mm = mm_autonuma->mm;
+ knumad_scan.address = 0;
+ knumad_scan.mm = mm;
+ atomic_inc(&mm->mm_count);
+ mm_autonuma->mm_numa_fault_pass++;
+ }
+ address = knumad_scan.address;
+
+ mutex_unlock(&knumad_mm_mutex);
+
+ down_read(&mm->mmap_sem);
+ if (unlikely(knumad_test_exit(mm)))
+ vma = NULL;
+ else
+ vma = find_vma(mm, address);
+
+ progress++;
+ for (; vma && progress < pages_to_scan; vma = vma->vm_next) {
+ unsigned long start_addr, end_addr;
+ cond_resched();
+ if (unlikely(knumad_test_exit(mm))) {
+ progress++;
+ break;
+ }
+
+ if (!vma->anon_vma || vma_policy(vma)) {
+ progress++;
+ continue;
+ }
+ if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) {
+ progress++;
+ continue;
+ }
+ if (is_vma_temporary_stack(vma)) {
+ progress++;
+ continue;
+ }
+
+ VM_BUG_ON(address & ~PAGE_MASK);
+ if (address < vma->vm_start)
+ address = vma->vm_start;
+
+ start_addr = address;
+ while (address < vma->vm_end) {
+ cond_resched();
+ if (unlikely(knumad_test_exit(mm)))
+ break;
+
+ VM_BUG_ON(address < vma->vm_start ||
+ address + PAGE_SIZE > vma->vm_end);
+ progress += knumad_scan_pmd(mm, vma, address);
+ /* move to next address */
+ address = (address + PMD_SIZE) & PMD_MASK;
+ if (progress >= pages_to_scan)
+ break;
+ }
+ end_addr = min(address, vma->vm_end);
+
+ /*
+ * Flush the TLB for the mm to start the numa
+ * hinting minor page faults after we finish
+ * scanning this vma part.
+ */
+ mmu_notifier_invalidate_range_start(vma->vm_mm, start_addr,
+ end_addr);
+ flush_tlb_range(vma, start_addr, end_addr);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, start_addr,
+ end_addr);
+ }
+ up_read(&mm->mmap_sem); /* exit_mmap will destroy ptes after this */
+
+ mutex_lock(&knumad_mm_mutex);
+ VM_BUG_ON(knumad_scan.mm != mm);
+ knumad_scan.address = address;
+ /*
+ * Change the current mm if this mm is about to die, or if we
+ * scanned all vmas of this mm.
+ */
+ if (knumad_test_exit(mm) || !vma) {
+ mm_autonuma = mm->mm_autonuma;
+ if (mm_autonuma->mm_node.next != &knumad_scan.mm_head) {
+ mm_autonuma = list_entry(mm_autonuma->mm_node.next,
+ struct mm_autonuma, mm_node);
+ knumad_scan.mm = mm_autonuma->mm;
+ atomic_inc(&knumad_scan.mm->mm_count);
+ knumad_scan.address = 0;
+ knumad_scan.mm->mm_autonuma->mm_numa_fault_pass++;
+ } else
+ knumad_scan.mm = NULL;
+
+ if (knumad_test_exit(mm)) {
+ list_del(&mm->mm_autonuma->mm_node);
+ /* tell autonuma_exit not to list_del */
+ VM_BUG_ON(mm->mm_autonuma->mm != mm);
+ mm->mm_autonuma->mm = NULL;
+ } else
+ mm_numa_fault_flush(mm);
+
+ mmdrop(mm);
+ }
+
+ return progress;
+}
+
+static void wake_up_knuma_migrated(void)
+{
+ int nid;
+
+ lru_add_drain();
+ for_each_online_node(nid) {
+ struct pglist_data *pgdat = NODE_DATA(nid);
+ if (pgdat->autonuma_nr_migrate_pages &&
+ waitqueue_active(&pgdat->autonuma_knuma_migrated_wait))
+ wake_up_interruptible(&pgdat->
+ autonuma_knuma_migrated_wait);
+ }
+}
+
+static void knuma_scand_disabled(void)
+{
+ if (!autonuma_enabled())
+ wait_event_freezable(knuma_scand_wait,
+ autonuma_enabled() ||
+ kthread_should_stop());
+}
+
+static int knuma_scand(void *none)
+{
+ struct mm_struct *mm = NULL;
+ int progress = 0, _progress;
+ unsigned long total_progress = 0;
+
+ set_freezable();
+
+ knuma_scand_disabled();
+
+ mutex_lock(&knumad_mm_mutex);
+
+ for (;;) {
+ if (unlikely(kthread_should_stop()))
+ break;
+ _progress = knumad_do_scan();
+ progress += _progress;
+ total_progress += _progress;
+ mutex_unlock(&knumad_mm_mutex);
+
+ if (unlikely(!knumad_scan.mm)) {
+ autonuma_printk("knuma_scand %lu\n", total_progress);
+ pages_scanned += total_progress;
+ total_progress = 0;
+ full_scans++;
+
+ wait_event_freezable_timeout(knuma_scand_wait,
+ kthread_should_stop(),
+ msecs_to_jiffies(
+ scan_sleep_pass_millisecs));
+ /* flush the last pending pages < pages_to_migrate */
+ wake_up_knuma_migrated();
+ wait_event_freezable_timeout(knuma_scand_wait,
+ kthread_should_stop(),
+ msecs_to_jiffies(
+ scan_sleep_pass_millisecs));
+
+ if (autonuma_debug()) {
+ extern void sched_autonuma_dump_mm(void);
+ sched_autonuma_dump_mm();
+ }
+
+ /* wait while there is no pinned mm */
+ knuma_scand_disabled();
+ }
+ if (progress > pages_to_scan) {
+ progress = 0;
+ wait_event_freezable_timeout(knuma_scand_wait,
+ kthread_should_stop(),
+ msecs_to_jiffies(
+ scan_sleep_millisecs));
+ }
+ cond_resched();
+ mutex_lock(&knumad_mm_mutex);
+ }
+
+ mm = knumad_scan.mm;
+ knumad_scan.mm = NULL;
+ if (mm && knumad_test_exit(mm)) {
+ list_del(&mm->mm_autonuma->mm_node);
+ /* tell autonuma_exit not to list_del */
+ VM_BUG_ON(mm->mm_autonuma->mm != mm);
+ mm->mm_autonuma->mm = NULL;
+ }
+ mutex_unlock(&knumad_mm_mutex);
+
+ if (mm)
+ mmdrop(mm);
+
+ return 0;
+}
+
+static int isolate_migratepages(struct list_head *migratepages,
+ struct pglist_data *pgdat)
+{
+ int nr = 0, nid;
+ struct list_head *heads = pgdat->autonuma_migrate_head;
+
+ /* FIXME: THP balancing, restart from last nid */
+ for_each_online_node(nid) {
+ struct zone *zone;
+ struct page *page;
+ struct lruvec *lruvec;
+
+ cond_resched();
+ VM_BUG_ON(numa_node_id() != pgdat->node_id);
+ if (nid == pgdat->node_id) {
+ VM_BUG_ON(!list_empty(&heads[nid]));
+ continue;
+ }
+ if (list_empty(&heads[nid]))
+ continue;
+ /* some page wants to go to this pgdat */
+ /*
+ * Take the lock with irqs disabled to avoid a lock
+ * inversion with the lru_lock which is taken before
+ * the autonuma_migrate_lock in split_huge_page, and
+ * that could be taken by interrupts after we obtained
+ * the autonuma_migrate_lock here, if we didn't disable
+ * irqs.
+ */
+ autonuma_migrate_lock_irq(pgdat->node_id);
+ if (list_empty(&heads[nid])) {
+ autonuma_migrate_unlock_irq(pgdat->node_id);
+ continue;
+ }
+ page = list_entry(heads[nid].prev,
+ struct page,
+ autonuma_migrate_node);
+ if (unlikely(!get_page_unless_zero(page))) {
+ /*
+ * Is getting freed and will remove self from the
+ * autonuma list shortly, skip it for now.
+ */
+ list_del(&page->autonuma_migrate_node);
+ list_add(&page->autonuma_migrate_node,
+ &heads[nid]);
+ autonuma_migrate_unlock_irq(pgdat->node_id);
+ autonuma_printk("autonuma migrate page is free\n");
+ continue;
+ }
+ if (!PageLRU(page)) {
+ autonuma_migrate_unlock_irq(pgdat->node_id);
+ autonuma_printk("autonuma migrate page not in LRU\n");
+ __autonuma_migrate_page_remove(page);
+ put_page(page);
+ continue;
+ }
+ autonuma_migrate_unlock_irq(pgdat->node_id);
+
+ VM_BUG_ON(nid != page_to_nid(page));
+
+ if (PageTransHuge(page)) {
+ VM_BUG_ON(!PageAnon(page));
+ /* FIXME: remove split_huge_page */
+ if (unlikely(split_huge_page(page))) {
+ autonuma_printk("autonuma migrate THP free\n");
+ __autonuma_migrate_page_remove(page,
+ page_autonuma);
+ put_page(page);
+ continue;
+ }
+ }
+
+ __autonuma_migrate_page_remove(page);
+
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+
+ /* Must run under the lru_lock and before page isolation */
+ lruvec = mem_cgroup_page_lruvec(page, zone);
+
+ if (!__isolate_lru_page(page, 0)) {
+ VM_BUG_ON(PageTransCompound(page));
+ del_page_from_lru_list(page, lruvec, page_lru(page));
+ inc_zone_state(zone, page_is_file_cache(page) ?
+ NR_ISOLATED_FILE : NR_ISOLATED_ANON);
+ spin_unlock_irq(&zone->lru_lock);
+ /*
+ * hold the page pin at least until
+ * __isolate_lru_page succeeds
+ * (__isolate_lru_page takes a second pin when
+ * it succeeds). If we release the pin before
+ * __isolate_lru_page returns, the page could
+ * have been freed and reallocated from under
+ * us, so rendering worthless our previous
+ * checks on the page including the
+ * split_huge_page call.
+ */
+ put_page(page);
+
+ list_add(&page->lru, migratepages);
+ nr += hpage_nr_pages(page);
+ } else {
+ /* FIXME: losing page, safest and simplest for now */
+ spin_unlock_irq(&zone->lru_lock);
+ put_page(page);
+ autonuma_printk("autonuma migrate page lost\n");
+ }
+ }
+
+ return nr;
+}
+
+static struct page *alloc_migrate_dst_page(struct page *page,
+ unsigned long data,
+ int **result)
+{
+ int nid = (int) data;
+ struct page *newpage;
+ newpage = alloc_pages_exact_node(nid,
+ GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
+ 0);
+ if (newpage)
+ newpage->autonuma_last_nid = page->autonuma_last_nid;
+ return newpage;
+}
+
+static void knumad_do_migrate(struct pglist_data *pgdat)
+{
+ int nr_migrate_pages = 0;
+ LIST_HEAD(migratepages);
+
+ autonuma_printk("nr_migrate_pages %lu to node %d\n",
+ pgdat->autonuma_nr_migrate_pages, pgdat->node_id);
+ do {
+ int isolated = 0;
+ if (balance_pgdat(pgdat, nr_migrate_pages))
+ isolated = isolate_migratepages(&migratepages, pgdat);
+ /* FIXME: might need to check too many isolated */
+ if (!isolated)
+ break;
+ nr_migrate_pages += isolated;
+ } while (nr_migrate_pages < pages_to_migrate);
+
+ if (nr_migrate_pages) {
+ int err;
+ autonuma_printk("migrate %d to node %d\n", nr_migrate_pages,
+ pgdat->node_id);
+ pages_migrated += nr_migrate_pages; /* FIXME: per node */
+ err = migrate_pages(&migratepages, alloc_migrate_dst_page,
+ pgdat->node_id, false, true);
+ if (err)
+ /* FIXME: requeue failed pages */
+ putback_lru_pages(&migratepages);
+ }
+}
+
+static int knuma_migrated(void *arg)
+{
+ struct pglist_data *pgdat = (struct pglist_data *)arg;
+ int nid = pgdat->node_id;
+ DECLARE_WAIT_QUEUE_HEAD_ONSTACK(nowakeup);
+
+ set_freezable();
+
+ for (;;) {
+ if (unlikely(kthread_should_stop()))
+ break;
+ /* FIXME: scan the free levels of this node we may not
+ * be allowed to receive memory if the wmark of this
+ * pgdat are below high. In the future also add
+ * not-interesting pages like not-accessed pages to
+ * pgdat->autonuma_migrate_head[pgdat->node_id]; so we
+ * can move our memory away to other nodes in order
+ * to satisfy the high-wmark described above (so migration
+ * can continue).
+ */
+ knumad_do_migrate(pgdat);
+ if (!pgdat->autonuma_nr_migrate_pages) {
+ wait_event_freezable(
+ pgdat->autonuma_knuma_migrated_wait,
+ pgdat->autonuma_nr_migrate_pages ||
+ kthread_should_stop());
+ autonuma_printk("wake knuma_migrated %d\n", nid);
+ } else
+ wait_event_freezable_timeout(nowakeup,
+ kthread_should_stop(),
+ msecs_to_jiffies(
+ migrate_sleep_millisecs));
+ }
+
+ return 0;
+}
+
+void autonuma_enter(struct mm_struct *mm)
+{
+ if (autonuma_impossible())
+ return;
+
+ mutex_lock(&knumad_mm_mutex);
+ list_add_tail(&mm->mm_autonuma->mm_node, &knumad_scan.mm_head);
+ mutex_unlock(&knumad_mm_mutex);
+}
+
+void autonuma_exit(struct mm_struct *mm)
+{
+ bool serialize;
+
+ if (autonuma_impossible())
+ return;
+
+ serialize = false;
+ mutex_lock(&knumad_mm_mutex);
+ if (knumad_scan.mm == mm)
+ serialize = true;
+ else if (mm->mm_autonuma->mm) {
+ VM_BUG_ON(mm->mm_autonuma->mm != mm);
+ mm->mm_autonuma->mm = NULL; /* debug */
+ list_del(&mm->mm_autonuma->mm_node);
+ }
+ mutex_unlock(&knumad_mm_mutex);
+
+ if (serialize) {
+ /* prevent the mm to go away under knumad_do_scan main loop */
+ down_write(&mm->mmap_sem);
+ up_write(&mm->mmap_sem);
+ }
+}
+
+static int start_knuma_scand(void)
+{
+ int err = 0;
+ struct task_struct *knumad_thread;
+
+ knumad_thread = kthread_run(knuma_scand, NULL, "knuma_scand");
+ if (unlikely(IS_ERR(knumad_thread))) {
+ autonuma_printk(KERN_ERR
+ "knumad: kthread_run(knuma_scand) failed\n");
+ err = PTR_ERR(knumad_thread);
+ }
+ return err;
+}
+
+static int start_knuma_migrated(void)
+{
+ int err = 0;
+ struct task_struct *knumad_thread;
+ int nid;
+
+ for_each_online_node(nid) {
+ knumad_thread = kthread_create_on_node(knuma_migrated,
+ NODE_DATA(nid),
+ nid,
+ "knuma_migrated%d",
+ nid);
+ if (unlikely(IS_ERR(knumad_thread))) {
+ autonuma_printk(KERN_ERR
+ "knumad: "
+ "kthread_run(knuma_migrated%d) "
+ "failed\n", nid);
+ err = PTR_ERR(knumad_thread);
+ } else {
+ autonuma_printk("cpumask %d %lx\n", nid,
+ cpumask_of_node(nid)->bits[0]);
+ kthread_bind_node(knumad_thread, nid);
+ wake_up_process(knumad_thread);
+ }
+ }
+ return err;
+}
+
+
+#ifdef CONFIG_SYSFS
+
+static ssize_t flag_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf,
+ enum autonuma_flag flag)
+{
+ return sprintf(buf, "%d\n",
+ !!test_bit(flag, &autonuma_flags));
+}
+static ssize_t flag_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count,
+ enum autonuma_flag flag)
+{
+ unsigned long value;
+ int ret;
+
+ ret = kstrtoul(buf, 10, &value);
+ if (ret < 0)
+ return ret;
+ if (value > 1)
+ return -EINVAL;
+
+ if (value)
+ set_bit(flag, &autonuma_flags);
+ else
+ clear_bit(flag, &autonuma_flags);
+
+ return count;
+}
+
+static ssize_t enabled_show(struct kobject *kobj,
+ struct kobj_attribute *attr, char *buf)
+{
+ return flag_show(kobj, attr, buf, AUTONUMA_FLAG);
+}
+static ssize_t enabled_store(struct kobject *kobj,
+ struct kobj_attribute *attr,
+ const char *buf, size_t count)
+{
+ ssize_t ret;
+
+ ret = flag_store(kobj, attr, buf, count, AUTONUMA_FLAG);
+
+ if (ret > 0 && autonuma_enabled())
+ wake_up_interruptible(&knuma_scand_wait);
+
+ return ret;
+}
+static struct kobj_attribute enabled_attr =
+ __ATTR(enabled, 0644, enabled_show, enabled_store);
+
+#define SYSFS_ENTRY(NAME, FLAG) \
+static ssize_t NAME ## _show(struct kobject *kobj, \
+ struct kobj_attribute *attr, char *buf) \
+{ \
+ return flag_show(kobj, attr, buf, FLAG); \
+} \
+ \
+static ssize_t NAME ## _store(struct kobject *kobj, \
+ struct kobj_attribute *attr, \
+ const char *buf, size_t count) \
+{ \
+ return flag_store(kobj, attr, buf, count, FLAG); \
+} \
+static struct kobj_attribute NAME ## _attr = \
+ __ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(debug, AUTONUMA_DEBUG_FLAG);
+SYSFS_ENTRY(pmd, AUTONUMA_SCAN_PMD_FLAG);
+SYSFS_ENTRY(working_set, AUTONUMA_SCAN_USE_WORKING_SET_FLAG);
+SYSFS_ENTRY(defer, AUTONUMA_MIGRATE_DEFER_FLAG);
+SYSFS_ENTRY(load_balance_strict, AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG);
+SYSFS_ENTRY(clone_reset, AUTONUMA_SCHED_CLONE_RESET_FLAG);
+SYSFS_ENTRY(fork_reset, AUTONUMA_SCHED_FORK_RESET_FLAG);
+
+#undef SYSFS_ENTRY
+
+enum {
+ SYSFS_KNUMA_SCAND_SLEEP_ENTRY,
+ SYSFS_KNUMA_SCAND_PAGES_ENTRY,
+ SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY,
+ SYSFS_KNUMA_MIGRATED_PAGES_ENTRY,
+};
+
+#define SYSFS_ENTRY(NAME, SYSFS_TYPE) \
+static ssize_t NAME ## _show(struct kobject *kobj, \
+ struct kobj_attribute *attr, \
+ char *buf) \
+{ \
+ return sprintf(buf, "%u\n", NAME); \
+} \
+static ssize_t NAME ## _store(struct kobject *kobj, \
+ struct kobj_attribute *attr, \
+ const char *buf, size_t count) \
+{ \
+ unsigned long val; \
+ int err; \
+ \
+ err = strict_strtoul(buf, 10, &val); \
+ if (err || val > UINT_MAX) \
+ return -EINVAL; \
+ switch (SYSFS_TYPE) { \
+ case SYSFS_KNUMA_SCAND_PAGES_ENTRY: \
+ case SYSFS_KNUMA_MIGRATED_PAGES_ENTRY: \
+ if (!val) \
+ return -EINVAL; \
+ break; \
+ } \
+ \
+ NAME = val; \
+ switch (SYSFS_TYPE) { \
+ case SYSFS_KNUMA_SCAND_SLEEP_ENTRY: \
+ wake_up_interruptible(&knuma_scand_wait); \
+ break; \
+ case \
+ SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY: \
+ wake_up_knuma_migrated(); \
+ break; \
+ } \
+ \
+ return count; \
+} \
+static struct kobj_attribute NAME ## _attr = \
+ __ATTR(NAME, 0644, NAME ## _show, NAME ## _store);
+
+SYSFS_ENTRY(scan_sleep_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(scan_sleep_pass_millisecs, SYSFS_KNUMA_SCAND_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_scan, SYSFS_KNUMA_SCAND_PAGES_ENTRY);
+
+SYSFS_ENTRY(migrate_sleep_millisecs, SYSFS_KNUMA_MIGRATED_SLEEP_ENTRY);
+SYSFS_ENTRY(pages_to_migrate, SYSFS_KNUMA_MIGRATED_PAGES_ENTRY);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *autonuma_attr[] = {
+ &enabled_attr.attr,
+ &debug_attr.attr,
+ NULL,
+};
+static struct attribute_group autonuma_attr_group = {
+ .attrs = autonuma_attr,
+};
+
+#define SYSFS_ENTRY(NAME) \
+static ssize_t NAME ## _show(struct kobject *kobj, \
+ struct kobj_attribute *attr, \
+ char *buf) \
+{ \
+ return sprintf(buf, "%lu\n", NAME); \
+} \
+static struct kobj_attribute NAME ## _attr = \
+ __ATTR_RO(NAME);
+
+SYSFS_ENTRY(full_scans);
+SYSFS_ENTRY(pages_scanned);
+SYSFS_ENTRY(pages_migrated);
+
+#undef SYSFS_ENTRY
+
+static struct attribute *knuma_scand_attr[] = {
+ &scan_sleep_millisecs_attr.attr,
+ &scan_sleep_pass_millisecs_attr.attr,
+ &pages_to_scan_attr.attr,
+ &pages_scanned_attr.attr,
+ &full_scans_attr.attr,
+ &pmd_attr.attr,
+ &working_set_attr.attr,
+ NULL,
+};
+static struct attribute_group knuma_scand_attr_group = {
+ .attrs = knuma_scand_attr,
+ .name = "knuma_scand",
+};
+
+static struct attribute *knuma_migrated_attr[] = {
+ &migrate_sleep_millisecs_attr.attr,
+ &pages_to_migrate_attr.attr,
+ &pages_migrated_attr.attr,
+ &defer_attr.attr,
+ NULL,
+};
+static struct attribute_group knuma_migrated_attr_group = {
+ .attrs = knuma_migrated_attr,
+ .name = "knuma_migrated",
+};
+
+static struct attribute *scheduler_attr[] = {
+ &clone_reset_attr.attr,
+ &fork_reset_attr.attr,
+ &load_balance_strict_attr.attr,
+ NULL,
+};
+static struct attribute_group scheduler_attr_group = {
+ .attrs = scheduler_attr,
+ .name = "scheduler",
+};
+
+static int __init autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+ int err;
+
+ *autonuma_kobj = kobject_create_and_add("autonuma", mm_kobj);
+ if (unlikely(!*autonuma_kobj)) {
+ printk(KERN_ERR "autonuma: failed kobject create\n");
+ return -ENOMEM;
+ }
+
+ err = sysfs_create_group(*autonuma_kobj, &autonuma_attr_group);
+ if (err) {
+ printk(KERN_ERR "autonuma: failed register autonuma group\n");
+ goto delete_obj;
+ }
+
+ err = sysfs_create_group(*autonuma_kobj, &knuma_scand_attr_group);
+ if (err) {
+ printk(KERN_ERR
+ "autonuma: failed register knuma_scand group\n");
+ goto remove_autonuma;
+ }
+
+ err = sysfs_create_group(*autonuma_kobj, &knuma_migrated_attr_group);
+ if (err) {
+ printk(KERN_ERR
+ "autonuma: failed register knuma_migrated group\n");
+ goto remove_knuma_scand;
+ }
+
+ err = sysfs_create_group(*autonuma_kobj, &scheduler_attr_group);
+ if (err) {
+ printk(KERN_ERR
+ "autonuma: failed register scheduler group\n");
+ goto remove_knuma_migrated;
+ }
+
+ return 0;
+
+remove_knuma_migrated:
+ sysfs_remove_group(*autonuma_kobj, &knuma_migrated_attr_group);
+remove_knuma_scand:
+ sysfs_remove_group(*autonuma_kobj, &knuma_scand_attr_group);
+remove_autonuma:
+ sysfs_remove_group(*autonuma_kobj, &autonuma_attr_group);
+delete_obj:
+ kobject_put(*autonuma_kobj);
+ return err;
+}
+
+static void __init autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+ sysfs_remove_group(autonuma_kobj, &knuma_migrated_attr_group);
+ sysfs_remove_group(autonuma_kobj, &knuma_scand_attr_group);
+ sysfs_remove_group(autonuma_kobj, &autonuma_attr_group);
+ kobject_put(autonuma_kobj);
+}
+#else
+static inline int autonuma_init_sysfs(struct kobject **autonuma_kobj)
+{
+ return 0;
+}
+
+static inline void autonuma_exit_sysfs(struct kobject *autonuma_kobj)
+{
+}
+#endif /* CONFIG_SYSFS */
+
+static int __init noautonuma_setup(char *str)
+{
+ if (!autonuma_impossible()) {
+ printk("AutoNUMA permanently disabled\n");
+ set_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
+ BUG_ON(!autonuma_impossible());
+ }
+ return 1;
+}
+__setup("noautonuma", noautonuma_setup);
+
+static int __init autonuma_init(void)
+{
+ int err;
+ struct kobject *autonuma_kobj;
+
+ VM_BUG_ON(num_possible_nodes() < 1);
+ if (autonuma_impossible())
+ return -EINVAL;
+
+ err = autonuma_init_sysfs(&autonuma_kobj);
+ if (err)
+ return err;
+
+ err = start_knuma_scand();
+ if (err) {
+ printk("failed to start knuma_scand\n");
+ goto out;
+ }
+ err = start_knuma_migrated();
+ if (err) {
+ printk("failed to start knuma_migrated\n");
+ goto out;
+ }
+
+ printk("AutoNUMA initialized successfully\n");
+ return err;
+
+out:
+ autonuma_exit_sysfs(autonuma_kobj);
+ return err;
+}
+module_init(autonuma_init)
+
+static struct kmem_cache *task_autonuma_cachep;
+
+int alloc_task_autonuma(struct task_struct *tsk, struct task_struct *orig,
+ int node)
+{
+ int err = 1;
+ struct task_autonuma *task_autonuma;
+
+ if (autonuma_impossible())
+ goto no_numa;
+ task_autonuma = kmem_cache_alloc_node(task_autonuma_cachep,
+ GFP_KERNEL, node);
+ if (!task_autonuma)
+ goto out;
+ if (autonuma_sched_clone_reset())
+ task_autonuma_reset(task_autonuma);
+ else
+ memcpy(task_autonuma, orig->task_autonuma,
+ task_autonuma_size());
+ tsk->task_autonuma = task_autonuma;
+no_numa:
+ err = 0;
+out:
+ return err;
+}
+
+void free_task_autonuma(struct task_struct *tsk)
+{
+ if (autonuma_impossible()) {
+ BUG_ON(tsk->task_autonuma);
+ return;
+ }
+
+ BUG_ON(!tsk->task_autonuma);
+ kmem_cache_free(task_autonuma_cachep, tsk->task_autonuma);
+ tsk->task_autonuma = NULL;
+}
+
+void __init task_autonuma_init(void)
+{
+ struct task_autonuma *task_autonuma;
+
+ BUG_ON(current != &init_task);
+
+ if (autonuma_impossible())
+ return;
+
+ task_autonuma_cachep =
+ kmem_cache_create("task_autonuma",
+ task_autonuma_size(), 0,
+ SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+
+ task_autonuma = kmem_cache_alloc_node(task_autonuma_cachep,
+ GFP_KERNEL, numa_node_id());
+ BUG_ON(!task_autonuma);
+ task_autonuma_reset(task_autonuma);
+ BUG_ON(current->task_autonuma);
+ current->task_autonuma = task_autonuma;
+}
+
+static struct kmem_cache *mm_autonuma_cachep;
+
+int alloc_mm_autonuma(struct mm_struct *mm)
+{
+ int err = 1;
+ struct mm_autonuma *mm_autonuma;
+
+ if (autonuma_impossible())
+ goto no_numa;
+ mm_autonuma = kmem_cache_alloc(mm_autonuma_cachep, GFP_KERNEL);
+ if (!mm_autonuma)
+ goto out;
+ if (autonuma_sched_fork_reset() || !mm->mm_autonuma)
+ mm_autonuma_reset(mm_autonuma);
+ else
+ memcpy(mm_autonuma, mm->mm_autonuma, mm_autonuma_size());
+ mm->mm_autonuma = mm_autonuma;
+ mm_autonuma->mm = mm;
+no_numa:
+ err = 0;
+out:
+ return err;
+}
+
+void free_mm_autonuma(struct mm_struct *mm)
+{
+ if (autonuma_impossible()) {
+ BUG_ON(mm->mm_autonuma);
+ return;
+ }
+
+ BUG_ON(!mm->mm_autonuma);
+ kmem_cache_free(mm_autonuma_cachep, mm->mm_autonuma);
+ mm->mm_autonuma = NULL;
+}
+
+void __init mm_autonuma_init(void)
+{
+ BUG_ON(current != &init_task);
+ BUG_ON(current->mm);
+
+ if (autonuma_impossible())
+ return;
+
+ mm_autonuma_cachep =
+ kmem_cache_create("mm_autonuma",
+ mm_autonuma_size(), 0,
+ SLAB_PANIC | SLAB_HWCACHE_ALIGN, NULL);
+}
Until THP native migration is implemented it's safer to boost
khugepaged scanning rate because all memory migration are splitting
the hugepages. So the regular rate of scanning becomes too low when
lots of memory is migrated.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4fcdaf7..bcaa8ac 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -573,6 +573,14 @@ static int __init hugepage_init(void)
set_recommended_min_free_kbytes();
+#ifdef CONFIG_AUTONUMA
+ /* Hack, remove after THP native migration */
+ if (num_possible_nodes() > 1) {
+ khugepaged_scan_sleep_millisecs = 100;
+ khugepaged_alloc_sleep_millisecs = 10000;
+ }
+#endif
+
return 0;
out:
hugepage_exit_sysfs(hugepage_kobj);
These flags are the ones tweaked through sysfs, they control the
behavior of autonuma, from enabling disabling it, to selecting various
runtime options.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/autonuma_flags.h | 62 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 62 insertions(+), 0 deletions(-)
create mode 100644 include/linux/autonuma_flags.h
diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
new file mode 100644
index 0000000..5e29a75
--- /dev/null
+++ b/include/linux/autonuma_flags.h
@@ -0,0 +1,62 @@
+#ifndef _LINUX_AUTONUMA_FLAGS_H
+#define _LINUX_AUTONUMA_FLAGS_H
+
+enum autonuma_flag {
+ AUTONUMA_FLAG,
+ AUTONUMA_IMPOSSIBLE_FLAG,
+ AUTONUMA_DEBUG_FLAG,
+ AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+ AUTONUMA_SCHED_CLONE_RESET_FLAG,
+ AUTONUMA_SCHED_FORK_RESET_FLAG,
+ AUTONUMA_SCAN_PMD_FLAG,
+ AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+ AUTONUMA_MIGRATE_DEFER_FLAG,
+};
+
+extern unsigned long autonuma_flags;
+
+static inline bool autonuma_enabled(void)
+{
+ return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_debug(void)
+{
+ return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_sched_load_balance_strict(void)
+{
+ return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
+ &autonuma_flags);
+}
+
+static inline bool autonuma_sched_clone_reset(void)
+{
+ return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
+ &autonuma_flags);
+}
+
+static inline bool autonuma_sched_fork_reset(void)
+{
+ return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
+ &autonuma_flags);
+}
+
+static inline bool autonuma_scan_pmd(void)
+{
+ return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
+}
+
+static inline bool autonuma_scan_use_working_set(void)
+{
+ return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
+ &autonuma_flags);
+}
+
+static inline bool autonuma_migrate_defer(void)
+{
+ return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
+}
+
+#endif /* _LINUX_AUTONUMA_FLAGS_H */
Add the config options to allow building the kernel with AutoNUMA.
If CONFIG_AUTONUMA_DEFAULT_ENABLED is "=y", then
/sys/kernel/mm/autonuma/enabled will be equal to 1, and AutoNUMA will
be enabled automatically at boot.
CONFIG_AUTONUMA currently depends on X86, because no other arch
implements the pte/pmd_numa yet and selecting =y would result in a
failed build, but this shall be relaxed in the future. Porting
AutoNUMA to other archs should be pretty simple.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/Kconfig | 13 +++++++++++++
1 files changed, 13 insertions(+), 0 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig
index 82fed4e..330dd51 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -207,6 +207,19 @@ config MIGRATION
pages as migration can relocate pages to satisfy a huge page
allocation instead of reclaiming.
+config AUTONUMA
+ bool "Auto NUMA"
+ select MIGRATION
+ depends on NUMA && X86
+ help
+ Automatic NUMA CPU scheduling and memory migration.
+
+config AUTONUMA_DEFAULT_ENABLED
+ bool "Auto NUMA default enabled"
+ depends on AUTONUMA
+ help
+ Automatic NUMA CPU scheduling and memory migration enabled at boot.
+
config PHYS_ADDR_T_64BIT
def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT
This is where the numa hinting page faults are detected and are passed
over to the AutoNUMA core logic.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/huge_mm.h | 2 ++
mm/huge_memory.c | 17 +++++++++++++++++
mm/memory.c | 31 +++++++++++++++++++++++++++++++
3 files changed, 50 insertions(+), 0 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index ad4e2e0..5270c81 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -11,6 +11,8 @@ extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
extern int do_huge_pmd_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
pmd_t orig_pmd);
+extern pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+ pmd_t pmd, pmd_t *pmdp);
extern pgtable_t get_pmd_huge_pte(struct mm_struct *mm);
extern struct page *follow_trans_huge_pmd(struct mm_struct *mm,
unsigned long addr,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ae20409..4fcdaf7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1037,6 +1037,23 @@ out:
return page;
}
+#ifdef CONFIG_AUTONUMA
+pmd_t __huge_pmd_numa_fixup(struct mm_struct *mm, unsigned long addr,
+ pmd_t pmd, pmd_t *pmdp)
+{
+ spin_lock(&mm->page_table_lock);
+ if (pmd_same(pmd, *pmdp)) {
+ struct page *page = pmd_page(pmd);
+ pmd = pmd_mknotnuma(pmd);
+ set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+ numa_hinting_fault(page, HPAGE_PMD_NR);
+ VM_BUG_ON(pmd_numa(pmd));
+ }
+ spin_unlock(&mm->page_table_lock);
+ return pmd;
+}
+#endif
+
int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr)
{
diff --git a/mm/memory.c b/mm/memory.c
index 78b6acc..d72aafd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
#include <linux/gfp.h>
+#include <linux/autonuma.h>
#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -3406,6 +3407,31 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}
+static inline pte_t pte_numa_fixup(struct mm_struct *mm,
+ struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte, pte_t *ptep)
+{
+ if (pte_numa(pte))
+ pte = __pte_numa_fixup(mm, vma, addr, pte, ptep);
+ return pte;
+}
+
+static inline void pmd_numa_fixup(struct mm_struct *mm,
+ unsigned long addr, pmd_t *pmd)
+{
+ if (pmd_numa(*pmd))
+ __pmd_numa_fixup(mm, addr, pmd);
+}
+
+static inline pmd_t huge_pmd_numa_fixup(struct mm_struct *mm,
+ unsigned long addr,
+ pmd_t pmd, pmd_t *pmdp)
+{
+ if (pmd_numa(pmd))
+ pmd = __huge_pmd_numa_fixup(mm, addr, pmd, pmdp);
+ return pmd;
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
@@ -3448,6 +3474,7 @@ int handle_pte_fault(struct mm_struct *mm,
spin_lock(ptl);
if (unlikely(!pte_same(*pte, entry)))
goto unlock;
+ entry = pte_numa_fixup(mm, vma, address, entry, pte);
if (flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
return do_wp_page(mm, vma, address,
@@ -3512,6 +3539,8 @@ retry:
barrier();
if (pmd_trans_huge(orig_pmd)) {
+ orig_pmd = huge_pmd_numa_fixup(mm, address,
+ orig_pmd, pmd);
if (flags & FAULT_FLAG_WRITE &&
!pmd_write(orig_pmd) &&
!pmd_trans_splitting(orig_pmd)) {
@@ -3530,6 +3559,8 @@ retry:
}
}
+ pmd_numa_fixup(mm, address, pmd);
+
/*
* Use __pte_alloc instead of pte_alloc_map, because we can't
* run pte_offset_map on the pmd, if an huge pmd could
sparse (make C=1) warns about lookup_page_autonuma not being declared,
that's a false positive, but we can shut it down by being less strict
in the includes.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/page_autonuma.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index bace9b8..2468c9e 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -1,6 +1,6 @@
#include <linux/mm.h>
#include <linux/memory.h>
-#include <linux/autonuma_flags.h>
+#include <linux/autonuma.h>
#include <linux/page_autonuma.h>
#include <linux/bootmem.h>
Initialize the knuma_migrated queues at boot time.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/page_alloc.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a9710a4..48eabe9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -59,6 +59,7 @@
#include <linux/prefetch.h>
#include <linux/migrate.h>
#include <linux/page-debug-flags.h>
+#include <linux/autonuma.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -4348,8 +4349,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
int nid = pgdat->node_id;
unsigned long zone_start_pfn = pgdat->node_start_pfn;
int ret;
+#ifdef CONFIG_AUTONUMA
+ int node_iter;
+#endif
pgdat_resize_init(pgdat);
+#ifdef CONFIG_AUTONUMA
+ spin_lock_init(&pgdat->autonuma_lock);
+ init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+ pgdat->autonuma_nr_migrate_pages = 0;
+ for_each_node(node_iter)
+ INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+#endif
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
pgdat->kswapd_max_order = 0;
Fix to avoid -1 retval.
Includes fixes from Hillf Danton <[email protected]>.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/sched/fair.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c099cc6..fa96810 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2789,6 +2789,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
if (new_cpu == -1 || new_cpu == cpu) {
/* Now try balancing at a lower domain level of cpu */
sd = sd->child;
+ if (new_cpu < 0)
+ /* Return prev_cpu is find_idlest_cpu failed */
+ new_cpu = prev_cpu;
continue;
}
@@ -2807,6 +2810,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
unlock:
rcu_read_unlock();
+ BUG_ON(new_cpu < 0);
return new_cpu;
}
#endif /* CONFIG_SMP */
Without this, follow_page wouldn't trigger the NUMA hinting faults.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/memory.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 2e9cab2..78b6acc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1491,7 +1491,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
goto no_page_table;
pmd = pmd_offset(pud, address);
- if (pmd_none(*pmd))
+ if (pmd_none(*pmd) || pmd_numa(*pmd))
goto no_page_table;
if (pmd_huge(*pmd) && vma->vm_flags & VM_HUGETLB) {
BUG_ON(flags & FOLL_GET);
@@ -1525,7 +1525,7 @@ split_fallthrough:
ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
pte = *ptep;
- if (!pte_present(pte))
+ if (!pte_present(pte) || pte_numa(pte))
goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;
Implement pte_numa and pmd_numa and related methods on x86 arch.
We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.
Whenever a pte or pmd is set as pte_numa or pmd_numa the first time a
thread will touch that virtual address, a NUMA hinting page fault will
trigger. The NUMA hinting page fault will simply clear the NUMA bit
and set the present bit again to resolve the page fault.
NUMA hinting page faults are used:
1) to fill in the per-thread NUMA statistic stored for each thread in
a current->sched_autonuma data structure
2) to track the per-node last_nid information in the page structure to
detect false sharing
3) to queue the page mapped by the pte_numa or pmd_numa for async
migration if there have been enough NUMA hinting page faults on the
page coming from remote CPUs
NUMA hinting page faults don't do anything except collecting
information and possibly adding pages to migrate queues. They're
extremely quick and absolutely non blocking. They don't allocate any
memory either.
The only "input" information of the AutoNUMA algorithm that isn't
collected through NUMA hinting page faults are the per-process
(per-thread not) mm->mm_autonuma statistics. Those mm_autonuma
statistics are collected by the knuma_scand pmd/pte scans that are
also responsible for setting the pte_numa/pmd_numa to activate the
NUMA hinting page faults.
knuma_scand -> NUMA hinting page faults
| |
\|/ \|/
mm_autonuma <-> sched_autonuma (CPU follow memory, this is mm_autonuma too)
page last_nid (false thread sharing/thread shared memory detection )
queue or cancel page migration (memory follow CPU)
After pages are queued, there is one knuma_migratedN daemon per NUMA
node that will take care of migrating the pages at a perfectly steady
rate in parallel from all nodes, and in round robin from all incoming
nodes going to the same destination node to keep all memory channels
in large boxes active at the same time to avoid hitting on a single
memory channel for too long to minimize memory bus migration latency
effects.
Once pages are queued for async migration by knuma_migratedN, their
migration can still be canceled before they're actually migrated, if
false sharing is later detected.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/include/asm/pgtable.h | 51 +++++++++++++++++++++++++++++++++++++--
1 files changed, 48 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 49afb3f..7514fa6 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -109,7 +109,7 @@ static inline int pte_write(pte_t pte)
static inline int pte_file(pte_t pte)
{
- return pte_flags(pte) & _PAGE_FILE;
+ return (pte_flags(pte) & _PAGE_FILE) == _PAGE_FILE;
}
static inline int pte_huge(pte_t pte)
@@ -405,7 +405,9 @@ static inline int pte_same(pte_t a, pte_t b)
static inline int pte_present(pte_t a)
{
- return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+ /* _PAGE_NUMA includes _PAGE_PROTNONE */
+ return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+ _PAGE_NUMA_PTE);
}
static inline int pte_hidden(pte_t pte)
@@ -415,7 +417,46 @@ static inline int pte_hidden(pte_t pte)
static inline int pmd_present(pmd_t pmd)
{
- return pmd_flags(pmd) & _PAGE_PRESENT;
+ return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+ _PAGE_NUMA_PMD);
+}
+
+#ifdef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+ return (pte_flags(pte) &
+ (_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+ return (pmd_flags(pmd) &
+ (_PAGE_NUMA_PMD|_PAGE_PRESENT)) == _PAGE_NUMA_PMD;
+}
+#endif
+
+static inline pte_t pte_mknotnuma(pte_t pte)
+{
+ pte = pte_clear_flags(pte, _PAGE_NUMA_PTE);
+ return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mknotnuma(pmd_t pmd)
+{
+ pmd = pmd_clear_flags(pmd, _PAGE_NUMA_PMD);
+ return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+ pte = pte_set_flags(pte, _PAGE_NUMA_PTE);
+ return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+ pmd = pmd_set_flags(pmd, _PAGE_NUMA_PMD);
+ return pmd_clear_flags(pmd, _PAGE_PRESENT);
}
static inline int pmd_none(pmd_t pmd)
@@ -474,6 +515,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
static inline int pmd_bad(pmd_t pmd)
{
+#ifdef CONFIG_AUTONUMA
+ if (pmd_numa(pmd))
+ return 0;
+#endif
return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
}
gup_fast will skip over non present ptes (pte_numa requires the pte to
be non present). So no explicit check is needed for pte_numa in the
pte case.
gup_fast will also automatically skip over THP when the trans huge pmd
is non present (pmd_numa requires the pmd to be non present).
But for the special pmd mode scan of knuma_scand
(/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
type (so non present too), the pte may be present. gup_pte_range
wouldn't notice the pmd is of numa type. So to avoid losing a NUMA
hinting page fault with gup_fast we need an explicit check for
pmd_numa() here to be sure it will fault through gup ->
handle_mm_fault.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/mm/gup.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..bf36575 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -164,7 +164,7 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
* wait_split_huge_page() would never return as the
* tlb flush IPI wouldn't run.
*/
- if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+ if (pmd_none(pmd) || pmd_trans_splitting(pmd) || pmd_numa(pmd))
return 0;
if (unlikely(pmd_large(pmd))) {
if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
When pages are collapsed try to keep the last_nid information from one
of the original pages.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 094f82b..ae20409 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1814,7 +1814,18 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
clear_user_highpage(page, address);
add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
} else {
+#ifdef CONFIG_AUTONUMA
+ int autonuma_last_nid;
+#endif
src_page = pte_page(pteval);
+#ifdef CONFIG_AUTONUMA
+ /* pick the last one, better than nothing */
+ autonuma_last_nid =
+ ACCESS_ONCE(src_page->autonuma_last_nid);
+ if (autonuma_last_nid >= 0)
+ ACCESS_ONCE(page->autonuma_last_nid) =
+ autonuma_last_nid;
+#endif
copy_user_highpage(page, src_page, address, vma);
VM_BUG_ON(page_mapcount(src_page) != 1);
VM_BUG_ON(page_count(src_page) != 2);
Move the AutoNUMA per page information from the "struct page" to a
separate page_autonuma data structure allocated in the memsection
(with sparsemem) or in the pgdat (with flatmem).
This is done to avoid growing the size of the "struct page" and the
page_autonuma data is only allocated if the kernel has been booted on
real NUMA hardware (or if noautonuma is passed as parameter to the
kernel).
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/autonuma.h | 18 +++-
include/linux/autonuma_flags.h | 6 +
include/linux/autonuma_types.h | 55 ++++++++++
include/linux/mm_types.h | 26 -----
include/linux/mmzone.h | 14 +++-
include/linux/page_autonuma.h | 53 +++++++++
init/main.c | 2 +
mm/Makefile | 2 +-
mm/autonuma.c | 98 ++++++++++-------
mm/huge_memory.c | 26 +++--
mm/page_alloc.c | 21 +---
mm/page_autonuma.c | 234 ++++++++++++++++++++++++++++++++++++++++
mm/sparse.c | 126 ++++++++++++++++++++-
13 files changed, 577 insertions(+), 104 deletions(-)
create mode 100644 include/linux/page_autonuma.h
create mode 100644 mm/page_autonuma.c
diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
index 85ca5eb..67af86a 100644
--- a/include/linux/autonuma.h
+++ b/include/linux/autonuma.h
@@ -7,15 +7,26 @@
extern void autonuma_enter(struct mm_struct *mm);
extern void autonuma_exit(struct mm_struct *mm);
-extern void __autonuma_migrate_page_remove(struct page *page);
+extern void __autonuma_migrate_page_remove(struct page *,
+ struct page_autonuma *);
extern void autonuma_migrate_split_huge_page(struct page *page,
struct page *page_tail);
extern void autonuma_setup_new_exec(struct task_struct *p);
+extern struct page_autonuma *lookup_page_autonuma(struct page *page);
static inline void autonuma_migrate_page_remove(struct page *page)
{
- if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
- __autonuma_migrate_page_remove(page);
+ struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+ if (ACCESS_ONCE(page_autonuma->autonuma_migrate_nid) >= 0)
+ __autonuma_migrate_page_remove(page, page_autonuma);
+}
+
+static inline void autonuma_free_page(struct page *page)
+{
+ if (!autonuma_impossible()) {
+ autonuma_migrate_page_remove(page);
+ lookup_page_autonuma(page)->autonuma_last_nid = -1;
+ }
}
#define autonuma_printk(format, args...) \
@@ -29,6 +40,7 @@ static inline void autonuma_migrate_page_remove(struct page *page) {}
static inline void autonuma_migrate_split_huge_page(struct page *page,
struct page *page_tail) {}
static inline void autonuma_setup_new_exec(struct task_struct *p) {}
+static inline void autonuma_free_page(struct page *page) {}
#endif /* CONFIG_AUTONUMA */
diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
index 5e29a75..035d993 100644
--- a/include/linux/autonuma_flags.h
+++ b/include/linux/autonuma_flags.h
@@ -15,6 +15,12 @@ enum autonuma_flag {
extern unsigned long autonuma_flags;
+static inline bool autonuma_impossible(void)
+{
+ return num_possible_nodes() <= 1 ||
+ test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
+}
+
static inline bool autonuma_enabled(void)
{
return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
index 9e697e3..1e860f6 100644
--- a/include/linux/autonuma_types.h
+++ b/include/linux/autonuma_types.h
@@ -39,6 +39,61 @@ struct task_autonuma {
unsigned long task_numa_fault[0];
};
+/*
+ * Per page (or per-pageblock) structure dynamically allocated only if
+ * autonuma is not impossible.
+ */
+struct page_autonuma {
+ /*
+ * To modify autonuma_last_nid lockless the architecture,
+ * needs SMP atomic granularity < sizeof(long), not all archs
+ * have that, notably some ancient alpha (but none of those
+ * should run in NUMA systems). Archs without that requires
+ * autonuma_last_nid to be a long.
+ */
+#if BITS_PER_LONG > 32
+ /*
+ * autonuma_migrate_nid is -1 if the page_autonuma structure
+ * is not linked into any
+ * pgdat->autonuma_migrate_head. Otherwise it means the
+ * page_autonuma structure is linked into the
+ * &NODE_DATA(autonuma_migrate_nid)->autonuma_migrate_head[page_nid].
+ * page_nid is the nid that the page (referenced by the
+ * page_autonuma structure) belongs to.
+ */
+ int autonuma_migrate_nid;
+ /*
+ * autonuma_last_nid records which is the NUMA nid that tried
+ * to access this page at the last NUMA hinting page fault.
+ * If it changed, AutoNUMA will not try to migrate the page to
+ * the nid where the thread is running on and to the contrary,
+ * it will make different threads trashing on the same pages,
+ * converge on the same NUMA node (if possible).
+ */
+ int autonuma_last_nid;
+#else
+#if MAX_NUMNODES >= 32768
+#error "too many nodes"
+#endif
+ short autonuma_migrate_nid;
+ short autonuma_last_nid;
+#endif
+ /*
+ * This is the list node that links the page (referenced by
+ * the page_autonuma structure) in the
+ * &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid] lru.
+ */
+ struct list_head autonuma_migrate_node;
+
+ /*
+ * To find the page starting from the autonuma_migrate_node we
+ * need a backlink.
+ *
+ * FIXME: drop it;
+ */
+ struct page *page;
+};
+
extern int alloc_task_autonuma(struct task_struct *tsk,
struct task_struct *orig,
int node);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d1248cf..f0c6379 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -136,32 +136,6 @@ struct page {
struct page *first_page; /* Compound tail pages */
};
-#ifdef CONFIG_AUTONUMA
- /*
- * FIXME: move to pgdat section along with the memcg and allocate
- * at runtime only in presence of a numa system.
- */
- /*
- * To modify autonuma_last_nid lockless the architecture,
- * needs SMP atomic granularity < sizeof(long), not all archs
- * have that, notably some ancient alpha (but none of those
- * should run in NUMA systems). Archs without that requires
- * autonuma_last_nid to be a long.
- */
-#if BITS_PER_LONG > 32
- int autonuma_migrate_nid;
- int autonuma_last_nid;
-#else
-#if MAX_NUMNODES >= 32768
-#error "too many nodes"
-#endif
- /* FIXME: remember to check the updates are atomic */
- short autonuma_migrate_nid;
- short autonuma_last_nid;
-#endif
- struct list_head autonuma_migrate_node;
-#endif
-
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d53b26a..e66da74 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -698,10 +698,13 @@ typedef struct pglist_data {
int kswapd_max_order;
enum zone_type classzone_idx;
#ifdef CONFIG_AUTONUMA
- spinlock_t autonuma_lock;
+#if !defined(CONFIG_SPARSEMEM)
+ struct page_autonuma *node_page_autonuma;
+#endif
struct list_head autonuma_migrate_head[MAX_NUMNODES];
unsigned long autonuma_nr_migrate_pages;
wait_queue_head_t autonuma_knuma_migrated_wait;
+ spinlock_t autonuma_lock;
#endif
} pg_data_t;
@@ -1064,6 +1067,15 @@ struct mem_section {
* section. (see memcontrol.h/page_cgroup.h about this.)
*/
struct page_cgroup *page_cgroup;
+#endif
+#ifdef CONFIG_AUTONUMA
+ /*
+ * If !SPARSEMEM, pgdat doesn't have page_autonuma pointer. We use
+ * section.
+ */
+ struct page_autonuma *section_page_autonuma;
+#endif
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR) ^ defined(CONFIG_AUTONUMA)
unsigned long pad;
#endif
};
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
new file mode 100644
index 0000000..d748aa2
--- /dev/null
+++ b/include/linux/page_autonuma.h
@@ -0,0 +1,53 @@
+#ifndef _LINUX_PAGE_AUTONUMA_H
+#define _LINUX_PAGE_AUTONUMA_H
+
+#if defined(CONFIG_AUTONUMA) && !defined(CONFIG_SPARSEMEM)
+extern void __init page_autonuma_init_flatmem(void);
+#else
+static inline void __init page_autonuma_init_flatmem(void) {}
+#endif
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/autonuma_flags.h>
+
+extern void __meminit page_autonuma_map_init(struct page *page,
+ struct page_autonuma *page_autonuma,
+ int nr_pages);
+
+#ifdef CONFIG_SPARSEMEM
+#define PAGE_AUTONUMA_SIZE (sizeof(struct page_autonuma))
+#define SECTION_PAGE_AUTONUMA_SIZE (PAGE_AUTONUMA_SIZE * \
+ PAGES_PER_SECTION)
+#endif
+
+extern void __meminit pgdat_autonuma_init(struct pglist_data *);
+
+#else /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+struct page_autonuma;
+#define PAGE_AUTONUMA_SIZE 0
+#define SECTION_PAGE_AUTONUMA_SIZE 0
+
+#define autonuma_impossible() true
+
+#endif
+
+static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#ifdef CONFIG_SPARSEMEM
+extern struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+ unsigned long nr_pages);
+extern void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+ unsigned long nr_pages);
+extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+ unsigned long pnum_begin,
+ unsigned long pnum_end,
+ unsigned long map_count,
+ int nodeid);
+#endif
+
+#endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/init/main.c b/init/main.c
index b5cc0a7..070a377 100644
--- a/init/main.c
+++ b/init/main.c
@@ -68,6 +68,7 @@
#include <linux/shmem_fs.h>
#include <linux/slab.h>
#include <linux/perf_event.h>
+#include <linux/page_autonuma.h>
#include <asm/io.h>
#include <asm/bugs.h>
@@ -455,6 +456,7 @@ static void __init mm_init(void)
* bigger than MAX_ORDER unless SPARSEMEM.
*/
page_cgroup_init_flatmem();
+ page_autonuma_init_flatmem();
mem_init();
kmem_cache_init();
percpu_init_late();
diff --git a/mm/Makefile b/mm/Makefile
index 15900fd..a4d8354 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -33,7 +33,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o
obj-$(CONFIG_HAS_DMA) += dmapool.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
-obj-$(CONFIG_AUTONUMA) += autonuma.o
+obj-$(CONFIG_AUTONUMA) += autonuma.o page_autonuma.o
obj-$(CONFIG_SPARSEMEM) += sparse.o
obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
obj-$(CONFIG_SLOB) += slob.o
diff --git a/mm/autonuma.c b/mm/autonuma.c
index f44272b..ec4d492 100644
--- a/mm/autonuma.c
+++ b/mm/autonuma.c
@@ -51,12 +51,6 @@ static struct knumad_scan {
.mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
};
-static inline bool autonuma_impossible(void)
-{
- return num_possible_nodes() <= 1 ||
- test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
-}
-
static inline void autonuma_migrate_lock(int nid)
{
spin_lock(&NODE_DATA(nid)->autonuma_lock);
@@ -82,54 +76,63 @@ void autonuma_migrate_split_huge_page(struct page *page,
struct page *page_tail)
{
int nid, last_nid;
+ struct page_autonuma *page_autonuma, *page_tail_autonuma;
- nid = page->autonuma_migrate_nid;
+ if (autonuma_impossible())
+ return;
+
+ page_autonuma = lookup_page_autonuma(page);
+ page_tail_autonuma = lookup_page_autonuma(page_tail);
+
+ nid = page_autonuma->autonuma_migrate_nid;
VM_BUG_ON(nid >= MAX_NUMNODES);
VM_BUG_ON(nid < -1);
- VM_BUG_ON(page_tail->autonuma_migrate_nid != -1);
+ VM_BUG_ON(page_tail_autonuma->autonuma_migrate_nid != -1);
if (nid >= 0) {
VM_BUG_ON(page_to_nid(page) != page_to_nid(page_tail));
compound_lock(page_tail);
autonuma_migrate_lock(nid);
- list_add_tail(&page_tail->autonuma_migrate_node,
- &page->autonuma_migrate_node);
+ list_add_tail(&page_tail_autonuma->autonuma_migrate_node,
+ &page_autonuma->autonuma_migrate_node);
autonuma_migrate_unlock(nid);
- page_tail->autonuma_migrate_nid = nid;
+ page_tail_autonuma->autonuma_migrate_nid = nid;
compound_unlock(page_tail);
}
- last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+ last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
if (last_nid >= 0)
- page_tail->autonuma_last_nid = last_nid;
+ page_tail_autonuma->autonuma_last_nid = last_nid;
}
-void __autonuma_migrate_page_remove(struct page *page)
+void __autonuma_migrate_page_remove(struct page *page,
+ struct page_autonuma *page_autonuma)
{
unsigned long flags;
int nid;
flags = compound_lock_irqsave(page);
- nid = page->autonuma_migrate_nid;
+ nid = page_autonuma->autonuma_migrate_nid;
VM_BUG_ON(nid >= MAX_NUMNODES);
VM_BUG_ON(nid < -1);
if (nid >= 0) {
int numpages = hpage_nr_pages(page);
autonuma_migrate_lock(nid);
- list_del(&page->autonuma_migrate_node);
+ list_del(&page_autonuma->autonuma_migrate_node);
NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
autonuma_migrate_unlock(nid);
- page->autonuma_migrate_nid = -1;
+ page_autonuma->autonuma_migrate_nid = -1;
}
compound_unlock_irqrestore(page, flags);
}
-static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
- int page_nid)
+static void __autonuma_migrate_page_add(struct page *page,
+ struct page_autonuma *page_autonuma,
+ int dst_nid, int page_nid)
{
unsigned long flags;
int nid;
@@ -148,25 +151,25 @@ static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
flags = compound_lock_irqsave(page);
numpages = hpage_nr_pages(page);
- nid = page->autonuma_migrate_nid;
+ nid = page_autonuma->autonuma_migrate_nid;
VM_BUG_ON(nid >= MAX_NUMNODES);
VM_BUG_ON(nid < -1);
if (nid >= 0) {
autonuma_migrate_lock(nid);
- list_del(&page->autonuma_migrate_node);
+ list_del(&page_autonuma->autonuma_migrate_node);
NODE_DATA(nid)->autonuma_nr_migrate_pages -= numpages;
autonuma_migrate_unlock(nid);
}
autonuma_migrate_lock(dst_nid);
- list_add(&page->autonuma_migrate_node,
+ list_add(&page_autonuma->autonuma_migrate_node,
&NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid]);
NODE_DATA(dst_nid)->autonuma_nr_migrate_pages += numpages;
nr_migrate_pages = NODE_DATA(dst_nid)->autonuma_nr_migrate_pages;
autonuma_migrate_unlock(dst_nid);
- page->autonuma_migrate_nid = dst_nid;
+ page_autonuma->autonuma_migrate_nid = dst_nid;
compound_unlock_irqrestore(page, flags);
@@ -182,9 +185,13 @@ static void __autonuma_migrate_page_add(struct page *page, int dst_nid,
static void autonuma_migrate_page_add(struct page *page, int dst_nid,
int page_nid)
{
- int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+ int migrate_nid;
+ struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+
+ migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
if (migrate_nid != dst_nid)
- __autonuma_migrate_page_add(page, dst_nid, page_nid);
+ __autonuma_migrate_page_add(page, page_autonuma,
+ dst_nid, page_nid);
}
static bool balance_pgdat(struct pglist_data *pgdat,
@@ -255,23 +262,26 @@ static inline bool last_nid_set(struct task_struct *p,
struct page *page, int cpu_nid)
{
bool ret = true;
- int autonuma_last_nid = ACCESS_ONCE(page->autonuma_last_nid);
+ struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+ int autonuma_last_nid = ACCESS_ONCE(page_autonuma->autonuma_last_nid);
VM_BUG_ON(cpu_nid < 0);
VM_BUG_ON(cpu_nid >= MAX_NUMNODES);
if (autonuma_last_nid >= 0 && autonuma_last_nid != cpu_nid) {
- int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+ int migrate_nid;
+ migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
if (migrate_nid >= 0 && migrate_nid != cpu_nid)
- __autonuma_migrate_page_remove(page);
+ __autonuma_migrate_page_remove(page, page_autonuma);
ret = false;
}
if (autonuma_last_nid != cpu_nid)
- ACCESS_ONCE(page->autonuma_last_nid) = cpu_nid;
+ ACCESS_ONCE(page_autonuma->autonuma_last_nid) = cpu_nid;
return ret;
}
static int __page_migrate_nid(struct page *page, int page_nid)
{
- int migrate_nid = ACCESS_ONCE(page->autonuma_migrate_nid);
+ struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
+ int migrate_nid = ACCESS_ONCE(page_autonuma->autonuma_migrate_nid);
if (migrate_nid < 0)
migrate_nid = page_nid;
#if 0
@@ -810,6 +820,7 @@ static int isolate_migratepages(struct list_head *migratepages,
struct zone *zone;
struct page *page;
struct lruvec *lruvec;
+ struct page_autonuma *page_autonuma;
cond_resched();
VM_BUG_ON(numa_node_id() != pgdat->node_id);
@@ -833,16 +844,17 @@ static int isolate_migratepages(struct list_head *migratepages,
autonuma_migrate_unlock_irq(pgdat->node_id);
continue;
}
- page = list_entry(heads[nid].prev,
- struct page,
- autonuma_migrate_node);
+ page_autonuma = list_entry(heads[nid].prev,
+ struct page_autonuma,
+ autonuma_migrate_node);
+ page = page_autonuma->page;
if (unlikely(!get_page_unless_zero(page))) {
/*
* Is getting freed and will remove self from the
* autonuma list shortly, skip it for now.
*/
- list_del(&page->autonuma_migrate_node);
- list_add(&page->autonuma_migrate_node,
+ list_del(&page_autonuma->autonuma_migrate_node);
+ list_add(&page_autonuma->autonuma_migrate_node,
&heads[nid]);
autonuma_migrate_unlock_irq(pgdat->node_id);
autonuma_printk("autonuma migrate page is free\n");
@@ -851,7 +863,7 @@ static int isolate_migratepages(struct list_head *migratepages,
if (!PageLRU(page)) {
autonuma_migrate_unlock_irq(pgdat->node_id);
autonuma_printk("autonuma migrate page not in LRU\n");
- __autonuma_migrate_page_remove(page);
+ __autonuma_migrate_page_remove(page, page_autonuma);
put_page(page);
continue;
}
@@ -871,7 +883,7 @@ static int isolate_migratepages(struct list_head *migratepages,
}
}
- __autonuma_migrate_page_remove(page);
+ __autonuma_migrate_page_remove(page, page_autonuma);
zone = page_zone(page);
spin_lock_irq(&zone->lru_lock);
@@ -917,11 +929,16 @@ static struct page *alloc_migrate_dst_page(struct page *page,
{
int nid = (int) data;
struct page *newpage;
+ struct page_autonuma *page_autonuma, *newpage_autonuma;
newpage = alloc_pages_exact_node(nid,
GFP_HIGHUSER_MOVABLE | GFP_THISNODE,
0);
- if (newpage)
- newpage->autonuma_last_nid = page->autonuma_last_nid;
+ if (newpage) {
+ page_autonuma = lookup_page_autonuma(page);
+ newpage_autonuma = lookup_page_autonuma(newpage);
+ newpage_autonuma->autonuma_last_nid =
+ page_autonuma->autonuma_last_nid;
+ }
return newpage;
}
@@ -1345,7 +1362,8 @@ static int __init noautonuma_setup(char *str)
}
return 1;
}
-__setup("noautonuma", noautonuma_setup);
+/* early so sparse.c also can see it */
+early_param("noautonuma", noautonuma_setup);
static int __init autonuma_init(void)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bcaa8ac..c5e47bc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1831,6 +1831,13 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
{
pte_t *_pte;
bool mknuma = false;
+#ifdef CONFIG_AUTONUMA
+ struct page_autonuma *src_page_an, *page_an = NULL;
+
+ if (!autonuma_impossible())
+ page_an = lookup_page_autonuma(page);
+#endif
+
for (_pte = pte; _pte < pte+HPAGE_PMD_NR; _pte++) {
pte_t pteval = *_pte;
struct page *src_page;
@@ -1839,17 +1846,18 @@ static bool __collapse_huge_page_copy(pte_t *pte, struct page *page,
clear_user_highpage(page, address);
add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1);
} else {
-#ifdef CONFIG_AUTONUMA
- int autonuma_last_nid;
-#endif
src_page = pte_page(pteval);
#ifdef CONFIG_AUTONUMA
- /* pick the last one, better than nothing */
- autonuma_last_nid =
- ACCESS_ONCE(src_page->autonuma_last_nid);
- if (autonuma_last_nid >= 0)
- ACCESS_ONCE(page->autonuma_last_nid) =
- autonuma_last_nid;
+ if (!autonuma_impossible()) {
+ int autonuma_last_nid;
+ src_page_an = lookup_page_autonuma(src_page);
+ /* pick the last one, better than nothing */
+ autonuma_last_nid =
+ ACCESS_ONCE(src_page_an->autonuma_last_nid);
+ if (autonuma_last_nid >= 0)
+ ACCESS_ONCE(page_an->autonuma_last_nid) =
+ autonuma_last_nid;
+ }
#endif
copy_user_highpage(page, src_page, address, vma);
VM_BUG_ON(page_mapcount(src_page) != 1);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8c4ae8e..2d53a1f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -60,6 +60,7 @@
#include <linux/migrate.h>
#include <linux/page-debug-flags.h>
#include <linux/autonuma.h>
+#include <linux/page_autonuma.h>
#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -615,10 +616,7 @@ static inline int free_pages_check(struct page *page)
bad_page(page);
return 1;
}
- autonuma_migrate_page_remove(page);
-#ifdef CONFIG_AUTONUMA
- page->autonuma_last_nid = -1;
-#endif
+ autonuma_free_page(page);
if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
@@ -3729,10 +3727,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
set_pageblock_migratetype(page, MIGRATE_MOVABLE);
INIT_LIST_HEAD(&page->lru);
-#ifdef CONFIG_AUTONUMA
- page->autonuma_last_nid = -1;
- page->autonuma_migrate_nid = -1;
-#endif
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
if (!is_highmem_idx(zone))
@@ -4357,22 +4351,13 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
int nid = pgdat->node_id;
unsigned long zone_start_pfn = pgdat->node_start_pfn;
int ret;
-#ifdef CONFIG_AUTONUMA
- int node_iter;
-#endif
pgdat_resize_init(pgdat);
-#ifdef CONFIG_AUTONUMA
- spin_lock_init(&pgdat->autonuma_lock);
- init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
- pgdat->autonuma_nr_migrate_pages = 0;
- for_each_node(node_iter)
- INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
-#endif
pgdat->nr_zones = 0;
init_waitqueue_head(&pgdat->kswapd_wait);
pgdat->kswapd_max_order = 0;
pgdat_page_cgroup_init(pgdat);
+ pgdat_autonuma_init(pgdat);
for (j = 0; j < MAX_NR_ZONES; j++) {
struct zone *zone = pgdat->node_zones + j;
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
new file mode 100644
index 0000000..bace9b8
--- /dev/null
+++ b/mm/page_autonuma.c
@@ -0,0 +1,234 @@
+#include <linux/mm.h>
+#include <linux/memory.h>
+#include <linux/autonuma_flags.h>
+#include <linux/page_autonuma.h>
+#include <linux/bootmem.h>
+
+void __meminit page_autonuma_map_init(struct page *page,
+ struct page_autonuma *page_autonuma,
+ int nr_pages)
+{
+ struct page *end;
+ for (end = page + nr_pages; page < end; page++, page_autonuma++) {
+ page_autonuma->autonuma_last_nid = -1;
+ page_autonuma->autonuma_migrate_nid = -1;
+ page_autonuma->page = page;
+ }
+}
+
+static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+ int node_iter;
+
+ spin_lock_init(&pgdat->autonuma_lock);
+ init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
+ pgdat->autonuma_nr_migrate_pages = 0;
+ for_each_node(node_iter)
+ INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+}
+
+#if !defined(CONFIG_SPARSEMEM)
+
+static unsigned long total_usage;
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+ __pgdat_autonuma_init(pgdat);
+ pgdat->node_page_autonuma = NULL;
+}
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long offset;
+ struct page_autonuma *base;
+
+ base = NODE_DATA(page_to_nid(page))->node_page_autonuma;
+#ifdef CONFIG_DEBUG_VM
+ /*
+ * The sanity checks the page allocator does upon freeing a
+ * page can reach here before the page_autonuma arrays are
+ * allocated when feeding a range of pages to the allocator
+ * for the first time during bootup or memory hotplug.
+ */
+ if (unlikely(!base))
+ return NULL;
+#endif
+ offset = pfn - NODE_DATA(page_to_nid(page))->node_start_pfn;
+ return base + offset;
+}
+
+static int __init alloc_node_page_autonuma(int nid)
+{
+ struct page_autonuma *base;
+ unsigned long table_size;
+ unsigned long nr_pages;
+
+ nr_pages = NODE_DATA(nid)->node_spanned_pages;
+ if (!nr_pages)
+ return 0;
+
+ table_size = sizeof(struct page_autonuma) * nr_pages;
+
+ base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
+ table_size, PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+ if (!base)
+ return -ENOMEM;
+ NODE_DATA(nid)->node_page_autonuma = base;
+ total_usage += table_size;
+ page_autonuma_map_init(NODE_DATA(nid)->node_mem_map, base, nr_pages);
+ return 0;
+}
+
+void __init page_autonuma_init_flatmem(void)
+{
+
+ int nid, fail;
+
+ if (autonuma_impossible())
+ return;
+
+ for_each_online_node(nid) {
+ fail = alloc_node_page_autonuma(nid);
+ if (fail)
+ goto fail;
+ }
+ printk(KERN_INFO "allocated %lu KBytes of page_autonuma\n",
+ total_usage >> 10);
+ printk(KERN_INFO "please try the 'noautonuma' option if you"
+ " don't want to allocate page_autonuma memory\n");
+ return;
+fail:
+ printk(KERN_CRIT "allocation of page_autonuma failed.\n");
+ printk(KERN_CRIT "please try the 'noautonuma' boot option\n");
+ panic("Out of memory");
+}
+
+#else /* CONFIG_SPARSEMEM */
+
+struct page_autonuma *lookup_page_autonuma(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ struct mem_section *section = __pfn_to_section(pfn);
+
+ /* if it's not a power of two we may be wasting memory */
+ BUILD_BUG_ON(SECTION_PAGE_AUTONUMA_SIZE &
+ (SECTION_PAGE_AUTONUMA_SIZE-1));
+
+#ifdef CONFIG_DEBUG_VM
+ /*
+ * The sanity checks the page allocator does upon freeing a
+ * page can reach here before the page_autonuma arrays are
+ * allocated when feeding a range of pages to the allocator
+ * for the first time during bootup or memory hotplug.
+ */
+ if (!section->section_page_autonuma)
+ return NULL;
+#endif
+ return section->section_page_autonuma + pfn;
+}
+
+void __meminit pgdat_autonuma_init(struct pglist_data *pgdat)
+{
+ __pgdat_autonuma_init(pgdat);
+}
+
+struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
+ unsigned long nr_pages)
+{
+ struct page_autonuma *ret;
+ struct page *page;
+ unsigned long memmap_size = PAGE_AUTONUMA_SIZE * nr_pages;
+
+ page = alloc_pages_node(nid, GFP_KERNEL|__GFP_NOWARN,
+ get_order(memmap_size));
+ if (page)
+ goto got_map_page_autonuma;
+
+ ret = vmalloc(memmap_size);
+ if (ret)
+ goto out;
+
+ return NULL;
+got_map_page_autonuma:
+ ret = (struct page_autonuma *)pfn_to_kaddr(page_to_pfn(page));
+out:
+ return ret;
+}
+
+void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
+ unsigned long nr_pages)
+{
+ if (is_vmalloc_addr(page_autonuma))
+ vfree(page_autonuma);
+ else
+ free_pages((unsigned long)page_autonuma,
+ get_order(PAGE_AUTONUMA_SIZE * nr_pages));
+}
+
+static struct page_autonuma __init *sparse_page_autonuma_map_populate(unsigned long pnum,
+ int nid)
+{
+ struct page_autonuma *map;
+ unsigned long size;
+
+ map = alloc_remap(nid, SECTION_PAGE_AUTONUMA_SIZE);
+ if (map)
+ return map;
+
+ size = PAGE_ALIGN(SECTION_PAGE_AUTONUMA_SIZE);
+ map = __alloc_bootmem_node_high(NODE_DATA(nid), size,
+ PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+ return map;
+}
+
+void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
+ unsigned long pnum_begin,
+ unsigned long pnum_end,
+ unsigned long map_count,
+ int nodeid)
+{
+ void *map;
+ unsigned long pnum;
+ unsigned long size = SECTION_PAGE_AUTONUMA_SIZE;
+
+ map = alloc_remap(nodeid, size * map_count);
+ if (map) {
+ for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+ if (!present_section_nr(pnum))
+ continue;
+ page_autonuma_map[pnum] = map;
+ map += size;
+ }
+ return;
+ }
+
+ size = PAGE_ALIGN(size);
+ map = __alloc_bootmem_node_high(NODE_DATA(nodeid), size * map_count,
+ PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+ if (map) {
+ for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+ if (!present_section_nr(pnum))
+ continue;
+ page_autonuma_map[pnum] = map;
+ map += size;
+ }
+ return;
+ }
+
+ /* fallback */
+ for (pnum = pnum_begin; pnum < pnum_end; pnum++) {
+ struct mem_section *ms;
+
+ if (!present_section_nr(pnum))
+ continue;
+ page_autonuma_map[pnum] = sparse_page_autonuma_map_populate(pnum, nodeid);
+ if (page_autonuma_map[pnum])
+ continue;
+ ms = __nr_to_section(pnum);
+ printk(KERN_ERR "%s: sparsemem page_autonuma map backing failed "
+ "some memory will not be available.\n", __func__);
+ }
+}
+
+#endif
diff --git a/mm/sparse.c b/mm/sparse.c
index 6a4bf91..1eb301e 100644
--- a/mm/sparse.c
+++ b/mm/sparse.c
@@ -9,6 +9,7 @@
#include <linux/export.h>
#include <linux/spinlock.h>
#include <linux/vmalloc.h>
+#include <linux/page_autonuma.h>
#include "internal.h"
#include <asm/dma.h>
#include <asm/pgalloc.h>
@@ -242,7 +243,8 @@ struct page *sparse_decode_mem_map(unsigned long coded_mem_map, unsigned long pn
static int __meminit sparse_init_one_section(struct mem_section *ms,
unsigned long pnum, struct page *mem_map,
- unsigned long *pageblock_bitmap)
+ unsigned long *pageblock_bitmap,
+ struct page_autonuma *page_autonuma)
{
if (!present_section(ms))
return -EINVAL;
@@ -251,6 +253,14 @@ static int __meminit sparse_init_one_section(struct mem_section *ms,
ms->section_mem_map |= sparse_encode_mem_map(mem_map, pnum) |
SECTION_HAS_MEM_MAP;
ms->pageblock_flags = pageblock_bitmap;
+#ifdef CONFIG_AUTONUMA
+ if (page_autonuma) {
+ ms->section_page_autonuma = page_autonuma - section_nr_to_pfn(pnum);
+ page_autonuma_map_init(mem_map, page_autonuma, PAGES_PER_SECTION);
+ }
+#else
+ BUG_ON(page_autonuma);
+#endif
return 1;
}
@@ -484,6 +494,9 @@ void __init sparse_init(void)
int size2;
struct page **map_map;
#endif
+ struct page_autonuma **uninitialized_var(page_autonuma_map);
+ struct page_autonuma *page_autonuma;
+ int size3;
/*
* map is using big page (aka 2M in x86 64 bit)
@@ -578,6 +591,62 @@ void __init sparse_init(void)
map_count, nodeid_begin);
#endif
+ if (!autonuma_impossible()) {
+ unsigned long total_page_autonuma;
+ unsigned long page_autonuma_count;
+
+ size3 = sizeof(struct page_autonuma *) * NR_MEM_SECTIONS;
+ page_autonuma_map = alloc_bootmem(size3);
+ if (!page_autonuma_map)
+ panic("can not allocate page_autonuma_map\n");
+
+ for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
+ struct mem_section *ms;
+
+ if (!present_section_nr(pnum))
+ continue;
+ ms = __nr_to_section(pnum);
+ nodeid_begin = sparse_early_nid(ms);
+ pnum_begin = pnum;
+ break;
+ }
+ total_page_autonuma = 0;
+ page_autonuma_count = 1;
+ for (pnum = pnum_begin + 1; pnum < NR_MEM_SECTIONS; pnum++) {
+ struct mem_section *ms;
+ int nodeid;
+
+ if (!present_section_nr(pnum))
+ continue;
+ ms = __nr_to_section(pnum);
+ nodeid = sparse_early_nid(ms);
+ if (nodeid == nodeid_begin) {
+ page_autonuma_count++;
+ continue;
+ }
+ /* ok, we need to take cake of from pnum_begin to pnum - 1*/
+ sparse_early_page_autonuma_alloc_node(page_autonuma_map,
+ pnum_begin,
+ NR_MEM_SECTIONS,
+ page_autonuma_count,
+ nodeid_begin);
+ total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+ /* new start, update count etc*/
+ nodeid_begin = nodeid;
+ pnum_begin = pnum;
+ page_autonuma_count = 1;
+ }
+ /* ok, last chunk */
+ sparse_early_page_autonuma_alloc_node(page_autonuma_map, pnum_begin,
+ NR_MEM_SECTIONS,
+ page_autonuma_count, nodeid_begin);
+ total_page_autonuma += SECTION_PAGE_AUTONUMA_SIZE * page_autonuma_count;
+ printk("allocated %lu KBytes of page_autonuma\n",
+ total_page_autonuma >> 10);
+ printk(KERN_INFO "please try the 'noautonuma' option if you"
+ " don't want to allocate page_autonuma memory\n");
+ }
+
for (pnum = 0; pnum < NR_MEM_SECTIONS; pnum++) {
if (!present_section_nr(pnum))
continue;
@@ -586,6 +655,14 @@ void __init sparse_init(void)
if (!usemap)
continue;
+ if (autonuma_impossible())
+ page_autonuma = NULL;
+ else {
+ page_autonuma = page_autonuma_map[pnum];
+ if (!page_autonuma)
+ continue;
+ }
+
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
map = map_map[pnum];
#else
@@ -595,11 +672,13 @@ void __init sparse_init(void)
continue;
sparse_init_one_section(__nr_to_section(pnum), pnum, map,
- usemap);
+ usemap, page_autonuma);
}
vmemmap_populate_print_last();
+ if (!autonuma_impossible())
+ free_bootmem(__pa(page_autonuma_map), size3);
#ifdef CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER
free_bootmem(__pa(map_map), size2);
#endif
@@ -686,7 +765,8 @@ static void free_map_bootmem(struct page *page, unsigned long nr_pages)
}
#endif /* CONFIG_SPARSEMEM_VMEMMAP */
-static void free_section_usemap(struct page *memmap, unsigned long *usemap)
+static void free_section_usemap(struct page *memmap, unsigned long *usemap,
+ struct page_autonuma *page_autonuma)
{
struct page *usemap_page;
unsigned long nr_pages;
@@ -700,8 +780,14 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
*/
if (PageSlab(usemap_page)) {
kfree(usemap);
- if (memmap)
+ if (memmap) {
__kfree_section_memmap(memmap, PAGES_PER_SECTION);
+ if (!autonuma_impossible())
+ __kfree_section_page_autonuma(page_autonuma,
+ PAGES_PER_SECTION);
+ else
+ BUG_ON(page_autonuma);
+ }
return;
}
@@ -718,6 +804,13 @@ static void free_section_usemap(struct page *memmap, unsigned long *usemap)
>> PAGE_SHIFT;
free_map_bootmem(memmap_page, nr_pages);
+
+ if (!autonuma_impossible()) {
+ struct page *page_autonuma_page;
+ page_autonuma_page = virt_to_page(page_autonuma);
+ free_map_bootmem(page_autonuma_page, nr_pages);
+ } else
+ BUG_ON(page_autonuma);
}
}
@@ -733,6 +826,7 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
struct pglist_data *pgdat = zone->zone_pgdat;
struct mem_section *ms;
struct page *memmap;
+ struct page_autonuma *page_autonuma;
unsigned long *usemap;
unsigned long flags;
int ret;
@@ -752,6 +846,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
__kfree_section_memmap(memmap, nr_pages);
return -ENOMEM;
}
+ if (!autonuma_impossible()) {
+ page_autonuma = __kmalloc_section_page_autonuma(pgdat->node_id,
+ nr_pages);
+ if (!page_autonuma) {
+ kfree(usemap);
+ __kfree_section_memmap(memmap, nr_pages);
+ return -ENOMEM;
+ }
+ } else
+ page_autonuma = NULL;
pgdat_resize_lock(pgdat, &flags);
@@ -763,11 +867,16 @@ int __meminit sparse_add_one_section(struct zone *zone, unsigned long start_pfn,
ms->section_mem_map |= SECTION_MARKED_PRESENT;
- ret = sparse_init_one_section(ms, section_nr, memmap, usemap);
+ ret = sparse_init_one_section(ms, section_nr, memmap, usemap,
+ page_autonuma);
out:
pgdat_resize_unlock(pgdat, &flags);
if (ret <= 0) {
+ if (!autonuma_impossible())
+ __kfree_section_page_autonuma(page_autonuma, nr_pages);
+ else
+ BUG_ON(page_autonuma);
kfree(usemap);
__kfree_section_memmap(memmap, nr_pages);
}
@@ -778,6 +887,7 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
{
struct page *memmap = NULL;
unsigned long *usemap = NULL;
+ struct page_autonuma *page_autonuma = NULL;
if (ms->section_mem_map) {
usemap = ms->pageblock_flags;
@@ -785,8 +895,12 @@ void sparse_remove_one_section(struct zone *zone, struct mem_section *ms)
__section_nr(ms));
ms->section_mem_map = 0;
ms->pageblock_flags = NULL;
+
+#ifdef CONFIG_AUTONUMA
+ page_autonuma = ms->section_page_autonuma;
+#endif
}
- free_section_usemap(memmap, usemap);
+ free_section_usemap(memmap, usemap, page_autonuma);
}
#endif
Implement generic version of the methods. They're used when
CONFIG_AUTONUMA=n, and they're a noop.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/asm-generic/pgtable.h | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index ff4947b..0ff87ec 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -530,6 +530,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}
+#ifndef CONFIG_AUTONUMA
+static inline int pte_numa(pte_t pte)
+{
+ return 0;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+ return 0;
+}
+#endif /* CONFIG_AUTONUMA */
+
#endif /* CONFIG_MMU */
#endif /* !__ASSEMBLY__ */
Reduce the autonuma_migrate_head array entries from MAX_NUMNODES to
num_possible_nodes() or zero if autonuma_impossible() is true.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/mm/numa.c | 6 ++++--
arch/x86/mm/numa_32.c | 3 ++-
include/linux/memory_hotplug.h | 3 ++-
include/linux/mmzone.h | 8 +++++++-
include/linux/page_autonuma.h | 10 ++++++++--
mm/memory_hotplug.c | 2 +-
mm/page_autonuma.c | 5 +++--
7 files changed, 27 insertions(+), 10 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2d125be..a4a9e92 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -11,6 +11,7 @@
#include <linux/nodemask.h>
#include <linux/sched.h>
#include <linux/topology.h>
+#include <linux/page_autonuma.h>
#include <asm/e820.h>
#include <asm/proto.h>
@@ -192,7 +193,8 @@ int __init numa_add_memblk(int nid, u64 start, u64 end)
/* Initialize NODE_DATA for a node on the local memory */
static void __init setup_node_data(int nid, u64 start, u64 end)
{
- const size_t nd_size = roundup(sizeof(pg_data_t), PAGE_SIZE);
+ const size_t nd_size = roundup(autonuma_pglist_data_size(),
+ PAGE_SIZE);
bool remapped = false;
u64 nd_pa;
void *nd;
@@ -239,7 +241,7 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
printk(KERN_INFO " NODE_DATA(%d) on node %d\n", nid, tnid);
node_data[nid] = nd;
- memset(NODE_DATA(nid), 0, sizeof(pg_data_t));
+ memset(NODE_DATA(nid), 0, autonuma_pglist_data_size());
NODE_DATA(nid)->node_id = nid;
NODE_DATA(nid)->node_start_pfn = start >> PAGE_SHIFT;
NODE_DATA(nid)->node_spanned_pages = (end - start) >> PAGE_SHIFT;
diff --git a/arch/x86/mm/numa_32.c b/arch/x86/mm/numa_32.c
index 534255a..d32d6cc 100644
--- a/arch/x86/mm/numa_32.c
+++ b/arch/x86/mm/numa_32.c
@@ -25,6 +25,7 @@
#include <linux/bootmem.h>
#include <linux/memblock.h>
#include <linux/module.h>
+#include <linux/page_autonuma.h>
#include "numa_internal.h"
@@ -194,7 +195,7 @@ void __init init_alloc_remap(int nid, u64 start, u64 end)
/* calculate the necessary space aligned to large page size */
size = node_memmap_size_bytes(nid, start_pfn, end_pfn);
- size += ALIGN(sizeof(pg_data_t), PAGE_SIZE);
+ size += ALIGN(autonuma_pglist_data_size(), PAGE_SIZE);
size = ALIGN(size, LARGE_PAGE_BYTES);
/* allocate node memory and the lowmem remap area */
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 910550f..76b1840 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -5,6 +5,7 @@
#include <linux/spinlock.h>
#include <linux/notifier.h>
#include <linux/bug.h>
+#include <linux/page_autonuma.h>
struct page;
struct zone;
@@ -130,7 +131,7 @@ extern void arch_refresh_nodedata(int nid, pg_data_t *pgdat);
*/
#define generic_alloc_nodedata(nid) \
({ \
- kzalloc(sizeof(pg_data_t), GFP_KERNEL); \
+ kzalloc(autonuma_pglist_data_size(), GFP_KERNEL); \
})
/*
* This definition is just for error path in node hotadd.
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e66da74..ed5b0c0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -701,10 +701,16 @@ typedef struct pglist_data {
#if !defined(CONFIG_SPARSEMEM)
struct page_autonuma *node_page_autonuma;
#endif
- struct list_head autonuma_migrate_head[MAX_NUMNODES];
unsigned long autonuma_nr_migrate_pages;
wait_queue_head_t autonuma_knuma_migrated_wait;
spinlock_t autonuma_lock;
+ /*
+ * Archs supporting AutoNUMA should allocate the pgdat with
+ * size autonuma_pglist_data_size() after including
+ * <linux/page_autonuma.h> and the below field must remain the
+ * last one of this structure.
+ */
+ struct list_head autonuma_migrate_head[0];
#endif
} pg_data_t;
diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
index d748aa2..bc7a629 100644
--- a/include/linux/page_autonuma.h
+++ b/include/linux/page_autonuma.h
@@ -10,6 +10,7 @@ static inline void __init page_autonuma_init_flatmem(void) {}
#ifdef CONFIG_AUTONUMA
#include <linux/autonuma_flags.h>
+#include <linux/autonuma_types.h>
extern void __meminit page_autonuma_map_init(struct page *page,
struct page_autonuma *page_autonuma,
@@ -29,11 +30,10 @@ extern void __meminit pgdat_autonuma_init(struct pglist_data *);
struct page_autonuma;
#define PAGE_AUTONUMA_SIZE 0
#define SECTION_PAGE_AUTONUMA_SIZE 0
+#endif
#define autonuma_impossible() true
-#endif
-
static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
#endif /* CONFIG_AUTONUMA */
@@ -50,4 +50,10 @@ extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **
int nodeid);
#endif
+/* inline won't work here */
+#define autonuma_pglist_data_size() (sizeof(struct pglist_data) + \
+ (autonuma_impossible() ? 0 : \
+ sizeof(struct list_head) * \
+ num_possible_nodes()))
+
#endif /* _LINUX_PAGE_AUTONUMA_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0d7e3ec..604995b 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -164,7 +164,7 @@ void register_page_bootmem_info_node(struct pglist_data *pgdat)
struct page *page;
struct zone *zone;
- nr_pages = PAGE_ALIGN(sizeof(struct pglist_data)) >> PAGE_SHIFT;
+ nr_pages = PAGE_ALIGN(autonuma_pglist_data_size()) >> PAGE_SHIFT;
page = virt_to_page(pgdat);
for (i = 0; i < nr_pages; i++, page++)
diff --git a/mm/page_autonuma.c b/mm/page_autonuma.c
index 2468c9e..d7c5e4a 100644
--- a/mm/page_autonuma.c
+++ b/mm/page_autonuma.c
@@ -23,8 +23,9 @@ static void __meminit __pgdat_autonuma_init(struct pglist_data *pgdat)
spin_lock_init(&pgdat->autonuma_lock);
init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
pgdat->autonuma_nr_migrate_pages = 0;
- for_each_node(node_iter)
- INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
+ if (!autonuma_impossible())
+ for_each_node(node_iter)
+ INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
}
#if !defined(CONFIG_SPARSEMEM)
This algorithm takes as input the statistical information filled by the
knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
(p->sched_autonuma), evaluates it for the current scheduled task, and
compares it against every other running process to see if it should
move the current task to another NUMA node.
When the scheduler decides if the task should be migrated to a
different NUMA node or to stay in the same NUMA node, the decision is
then stored into p->sched_autonuma->autonuma_node. The fair scheduler
than tries to keep the task on the autonuma_node too.
Code include fixes and cleanups from Hillf Danton <[email protected]>.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/autonuma_sched.h | 50 ++++
include/linux/mm_types.h | 5 +
include/linux/sched.h | 3 +
kernel/sched/core.c | 1 +
kernel/sched/numa.c | 586 ++++++++++++++++++++++++++++++++++++++++
kernel/sched/sched.h | 18 ++
6 files changed, 663 insertions(+), 0 deletions(-)
create mode 100644 include/linux/autonuma_sched.h
create mode 100644 kernel/sched/numa.c
diff --git a/include/linux/autonuma_sched.h b/include/linux/autonuma_sched.h
new file mode 100644
index 0000000..aff31d4
--- /dev/null
+++ b/include/linux/autonuma_sched.h
@@ -0,0 +1,50 @@
+#ifndef _LINUX_AUTONUMA_SCHED_H
+#define _LINUX_AUTONUMA_SCHED_H
+
+#ifdef CONFIG_AUTONUMA
+#include <linux/autonuma_flags.h>
+
+extern void sched_autonuma_balance(void);
+extern bool sched_autonuma_can_migrate_task(struct task_struct *p,
+ int numa, int dst_cpu,
+ enum cpu_idle_type idle);
+#else /* CONFIG_AUTONUMA */
+static inline void sched_autonuma_balance(void) {}
+static inline bool sched_autonuma_can_migrate_task(struct task_struct *p,
+ int numa, int dst_cpu,
+ enum cpu_idle_type idle)
+{
+ return true;
+}
+#endif /* CONFIG_AUTONUMA */
+
+static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
+{
+#ifdef CONFIG_AUTONUMA
+ int autonuma_node;
+ struct task_autonuma *task_autonuma = p->task_autonuma;
+
+ if (!task_autonuma)
+ return true;
+
+ autonuma_node = ACCESS_ONCE(task_autonuma->autonuma_node);
+ if (autonuma_node < 0 || autonuma_node == cpu_to_node(cpu))
+ return true;
+ else
+ return false;
+#else
+ return true;
+#endif
+}
+
+static inline void sched_set_autonuma_need_balance(void)
+{
+#ifdef CONFIG_AUTONUMA
+ struct task_autonuma *ta = current->task_autonuma;
+
+ if (ta && current->mm)
+ sched_autonuma_balance();
+#endif
+}
+
+#endif /* _LINUX_AUTONUMA_SCHED_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 704a626..f0c6379 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -13,6 +13,7 @@
#include <linux/cpumask.h>
#include <linux/page-debug-flags.h>
#include <linux/uprobes.h>
+#include <linux/autonuma_types.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -389,6 +390,10 @@ struct mm_struct {
struct cpumask cpumask_allocation;
#endif
struct uprobes_state uprobes_state;
+#ifdef CONFIG_AUTONUMA
+ /* this is used by the scheduler and the page allocator */
+ struct mm_autonuma *mm_autonuma;
+#endif
};
static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 699324c..cb20347 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1514,6 +1514,9 @@ struct task_struct {
struct mempolicy *mempolicy; /* Protected by alloc_lock */
short il_next;
short pref_node_fork;
+#ifdef CONFIG_AUTONUMA
+ struct task_autonuma *task_autonuma;
+#endif
#endif
struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d5594a4..a8f94b9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -72,6 +72,7 @@
#include <linux/slab.h>
#include <linux/init_task.h>
#include <linux/binfmts.h>
+#include <linux/autonuma_sched.h>
#include <asm/switch_to.h>
#include <asm/tlb.h>
diff --git a/kernel/sched/numa.c b/kernel/sched/numa.c
new file mode 100644
index 0000000..72f6158
--- /dev/null
+++ b/kernel/sched/numa.c
@@ -0,0 +1,586 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ */
+
+#include <linux/sched.h>
+#include <linux/autonuma_sched.h>
+#include <asm/tlb.h>
+
+#include "sched.h"
+
+/*
+ * autonuma_balance_cpu_stop() is a callback to be invoked by
+ * stop_one_cpu_nowait(). It is used by sched_autonuma_balance() to
+ * migrate the tasks to the selected_cpu, from softirq context.
+ */
+static int autonuma_balance_cpu_stop(void *data)
+{
+ struct rq *src_rq = data;
+ int src_cpu = cpu_of(src_rq);
+ int dst_cpu = src_rq->autonuma_balance_dst_cpu;
+ struct task_struct *p = src_rq->autonuma_balance_task;
+ struct rq *dst_rq = cpu_rq(dst_cpu);
+
+ raw_spin_lock_irq(&p->pi_lock);
+ raw_spin_lock(&src_rq->lock);
+
+ /* Make sure the selected cpu hasn't gone down in the meanwhile */
+ if (unlikely(src_cpu != smp_processor_id() ||
+ !src_rq->autonuma_balance))
+ goto out_unlock;
+
+ /* Check if the affinity changed in the meanwhile */
+ if (!cpumask_test_cpu(dst_cpu, tsk_cpus_allowed(p)))
+ goto out_unlock;
+
+ /* Is the task to migrate still there? */
+ if (task_cpu(p) != src_cpu)
+ goto out_unlock;
+
+ BUG_ON(src_rq == dst_rq);
+
+ /* Prepare to move the task from src_rq to dst_rq */
+ double_lock_balance(src_rq, dst_rq);
+
+ /*
+ * Supposedly pi_lock should have been enough but some code
+ * seems to call __set_task_cpu without pi_lock.
+ */
+ if (task_cpu(p) != src_cpu) {
+ WARN_ONCE(1, "autonuma_balance_cpu_stop: "
+ "not pi_lock protected");
+ goto out_double_unlock;
+ }
+
+ /*
+ * If the task is not on a rq, the autonuma_node will take
+ * care of the NUMA affinity at the next wake-up.
+ */
+ if (p->on_rq) {
+ deactivate_task(src_rq, p, 0);
+ set_task_cpu(p, dst_cpu);
+ activate_task(dst_rq, p, 0);
+ check_preempt_curr(dst_rq, p, 0);
+ }
+
+out_double_unlock:
+ double_unlock_balance(src_rq, dst_rq);
+out_unlock:
+ src_rq->autonuma_balance = false;
+ raw_spin_unlock(&src_rq->lock);
+ /* spinlocks acts as barrier() so p is stored local on the stack */
+ raw_spin_unlock_irq(&p->pi_lock);
+ put_task_struct(p);
+ return 0;
+}
+
+#define AUTONUMA_BALANCE_SCALE 1000
+
+enum {
+ W_TYPE_THREAD,
+ W_TYPE_PROCESS,
+};
+
+/*
+ * This function sched_autonuma_balance() is responsible for deciding
+ * which is the best CPU each process should be running on according
+ * to the NUMA statistics collected in mm->mm_autonuma and
+ * tsk->task_autonuma.
+ *
+ * The core math that evaluates the current CPU against the CPUs of
+ * all _other_ nodes is this:
+ *
+ * if (w_nid > w_other && w_nid > w_cpu_nid)
+ * weight = w_nid - w_other + w_nid - w_cpu_nid;
+ *
+ * w_nid: NUMA affinity of the current thread/process if run on the
+ * other CPU.
+ *
+ * w_other: NUMA affinity of the other thread/process if run on the
+ * other CPU.
+ *
+ * w_cpu_nid: NUMA affinity of the current thread/process if run on
+ * the current CPU.
+ *
+ * weight: combined NUMA affinity benefit in moving the current
+ * thread/process to the other CPU taking into account both the higher
+ * NUMA affinity of the current process if run on the other CPU, and
+ * the increase in NUMA affinity in the other CPU by replacing the
+ * other process.
+ *
+ * We run the above math on every CPU not part of the current NUMA
+ * node, and we compare the current process against the other
+ * processes running in the other CPUs in the remote NUMA nodes. The
+ * objective is to select the cpu (in selected_cpu) with a bigger
+ * "weight". The bigger the "weight" the biggest gain we'll get by
+ * moving the current process to the selected_cpu (not only the
+ * biggest immediate CPU gain but also the fewer async memory
+ * migrations that will be required to reach full convergence
+ * later). If we select a cpu we migrate the current process to it.
+ *
+ * Checking that the current process has higher NUMA affinity than the
+ * other process on the other CPU (w_nid > w_other) and not only that
+ * the current process has higher NUMA affinity on the other CPU than
+ * on the current CPU (w_nid > w_cpu_nid) completely avoids ping pongs
+ * and ensures (temporary) convergence of the algorithm (at least from
+ * a CPU standpoint).
+ *
+ * It's then up to the idle balancing code that will run as soon as
+ * the current CPU goes idle to pick the other process and move it
+ * here (or in some other idle CPU if any).
+ *
+ * By only evaluating running processes against running processes we
+ * avoid interfering with the CFS stock active idle balancing, which
+ * is critical to optimal performance with HT enabled. (getting HT
+ * wrong is worse than running on remote memory so the active idle
+ * balancing has priority)
+ *
+ * Idle balancing and all other CFS load balancing become NUMA
+ * affinity aware through the introduction of
+ * sched_autonuma_can_migrate_task(). CFS searches CPUs in the task's
+ * autonuma_node first when it needs to find idle CPUs during idle
+ * balancing or tasks to pick during load balancing.
+ *
+ * The task's autonuma_node is the node selected by
+ * sched_autonuma_balance() when it migrates a task to the
+ * selected_cpu in the selected_nid.
+ *
+ * Once a process/thread has been moved to another node, closer to the
+ * much of memory it has recently accessed, any memory for that task
+ * not in the new node moves slowly (asynchronously in the background)
+ * to the new node. This is done by the knuma_migratedN (where the
+ * suffix N is the node id) daemon described in mm/autonuma.c.
+ *
+ * One non trivial bit of this logic that deserves an explanation is
+ * how the three crucial variables of the core math
+ * (w_nid/w_other/wcpu_nid) are going to change depending on whether
+ * the other CPU is running a thread of the current process, or a
+ * thread of a different process.
+ *
+ * A simple example is required. Given the following:
+ * - 2 processes
+ * - 4 threads per process
+ * - 2 NUMA nodes
+ * - 4 CPUS per NUMA node
+ *
+ * Because the 8 threads belong to 2 different processes, by using the
+ * process statistics when comparing threads of different processes,
+ * we will converge reliably and quickly to a configuration where the
+ * 1st process is entirely contained in one node and the 2nd process
+ * in the other node.
+ *
+ * If all threads only use thread local memory (no sharing of memory
+ * between the threads), it wouldn't matter if we use per-thread or
+ * per-mm statistics for w_nid/w_other/w_cpu_nid. We could then use
+ * per-thread statistics all the time.
+ *
+ * But clearly with threads it's expected to get some sharing of
+ * memory. To avoid false sharing it's better to keep all threads of
+ * the same process in the same node (or if they don't fit in a single
+ * node, in as fewer nodes as possible). This is why we have to use
+ * processes statistics in w_nid/w_other/wcpu_nid when comparing
+ * threads of different processes. Why instead do we have to use
+ * thread statistics when comparing threads of the same process? This
+ * should be obvious if you're still reading (hint: the mm statistics
+ * are identical for threads of the same process). If some process
+ * doesn't fit in one node, the thread statistics will then distribute
+ * the threads to the best nodes within the group of nodes where the
+ * process is contained.
+ *
+ * False sharing in the above sentence (and generally in AutoNUMA
+ * context) is intended as virtual memory accessed simultaneously (or
+ * frequently) by threads running in CPUs of different nodes. This
+ * doesn't refer to shared memory as in tmpfs, but it refers to
+ * CLONE_VM instead. If the threads access the same memory from CPUs
+ * of different nodes it means the memory accesses will be NUMA local
+ * for some thread and NUMA remote for some other thread. The only way
+ * to avoid NUMA false sharing is to schedule all threads accessing
+ * the same memory in the same node (which may or may not be possible,
+ * if it's not possible because there aren't enough CPUs in the node,
+ * the threads should be scheduled in as few nodes as possible and the
+ * nodes distance should be the lowest possible).
+ *
+ * This is an example of the CPU layout after the startup of 2
+ * processes with 12 threads each. This is some of the logs you will
+ * find in `dmesg` after running:
+ *
+ * echo 1 >/sys/kernel/mm/autonuma/debug
+ *
+ * nid is the node id
+ * mm is the pointer to the mm structure (kind of the "ID" of the process)
+ * nr is the number of threads of that belongs to that process in that node id.
+ *
+ * This dumps the raw content of the CPUs' runqueues, it doesn't show
+ * kernel threads (the kernel thread dumping the below stats is
+ * clearly using one CPU, hence only 23 CPUs are dumped, clearly the
+ * debug mode can be improved but it's good enough to see what's going
+ * on).
+ *
+ * nid 0 mm ffff880433367b80 nr 6
+ * nid 0 mm ffff880433367480 nr 5
+ * nid 1 mm ffff880433367b80 nr 6
+ * nid 1 mm ffff880433367480 nr 6
+ *
+ * Now, the process with mm == ffff880433367b80 has 6 threads in node0
+ * and 6 threads in node1, while the process with mm ==
+ * ffff880433367480 has 5 threads in node0 and 6 threads running in
+ * node1.
+ *
+ * And after a few seconds it becomes:
+ *
+ * nid 0 mm ffff880433367b80 nr 12
+ * nid 1 mm ffff880433367480 nr 11
+ *
+ * Now, 12 threads of one process are running on node 0 and 11 threads
+ * of the other process are running on node 1.
+ *
+ * Before scanning all other CPUs' runqueues to compute the above
+ * math, we also verify that the current CPU is not already in the
+ * preferred NUMA node from the point of view of both the process
+ * statistics and the thread statistics. In such case we can return to
+ * the caller without having to check any other CPUs' runqueues
+ * because full convergence has been already reached.
+ *
+ * This algorithm might be expanded to take all runnable processes
+ * into account but examining just the currently running processes is
+ * a good enough approximation because some runnable processes may run
+ * only for a short time so statistically there will always be a bias
+ * on the processes that uses most the of the CPU. This is ideal
+ * because it doesn't matter if NUMA balancing isn't optimal for
+ * processes that run only for a short time.
+ *
+ * This function is invoked at the same frequency and in the same
+ * location of the CFS load balancer and only if the CPU is not
+ * idle. The rest of the time we depend on CFS to keep sticking to the
+ * current CPU or to prioritize on the CPUs in the selected_nid
+ * (recorded in the task's autonuma_node field).
+ */
+void sched_autonuma_balance(void)
+{
+ int cpu, nid, selected_cpu, selected_nid, selected_nid_mm;
+ int cpu_nid = numa_node_id();
+ int this_cpu = smp_processor_id();
+ /*
+ * w_t: node thread weight
+ * w_t_t: total sum of all node thread weights
+ * w_m: node mm/process weight
+ * w_m_t: total sum of all node mm/process weights
+ */
+ unsigned long w_t, w_t_t, w_m, w_m_t;
+ unsigned long w_t_max, w_m_max;
+ unsigned long weight_max, weight;
+ long s_w_nid = -1, s_w_cpu_nid = -1, s_w_other = -1;
+ int s_w_type = -1;
+ struct cpumask *allowed;
+ struct task_struct *p = current;
+ struct task_autonuma *task_autonuma = p->task_autonuma;
+ struct rq *rq;
+
+ /* per-cpu statically allocated in runqueues */
+ long *task_numa_weight;
+ long *mm_numa_weight;
+
+ if (!task_autonuma || !p->mm)
+ return;
+
+ if (!autonuma_enabled()) {
+ if (task_autonuma->autonuma_node != -1)
+ task_autonuma->autonuma_node = -1;
+ return;
+ }
+
+ allowed = tsk_cpus_allowed(p);
+
+ /*
+ * Do nothing if the task had no numa hinting page faults yet
+ * or if the mm hasn't been fully scanned by knuma_scand yet.
+ */
+ w_t_t = task_autonuma->task_numa_fault_tot;
+ if (!w_t_t)
+ return;
+ w_m_t = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault_tot);
+ if (!w_m_t)
+ return;
+
+ /*
+ * The below two arrays holds the NUMA affinity information of
+ * the current process if scheduled in the "nid". This is task
+ * local and mm local information. We compute this information
+ * for all nodes.
+ *
+ * task/mm_numa_weight[nid] will become w_nid.
+ * task/mm_numa_weight[cpu_nid] will become w_cpu_nid.
+ */
+ rq = cpu_rq(this_cpu);
+ task_numa_weight = rq->task_numa_weight;
+ mm_numa_weight = rq->mm_numa_weight;
+
+ w_t_max = w_m_max = 0;
+ selected_nid = selected_nid_mm = -1;
+ for_each_online_node(nid) {
+ w_m = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault[nid]);
+ w_t = task_autonuma->task_numa_fault[nid];
+ if (w_m > w_m_t)
+ w_m_t = w_m;
+ mm_numa_weight[nid] = w_m*AUTONUMA_BALANCE_SCALE/w_m_t;
+ if (w_t > w_t_t)
+ w_t_t = w_t;
+ task_numa_weight[nid] = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
+ if (mm_numa_weight[nid] > w_m_max) {
+ w_m_max = mm_numa_weight[nid];
+ selected_nid_mm = nid;
+ }
+ if (task_numa_weight[nid] > w_t_max) {
+ w_t_max = task_numa_weight[nid];
+ selected_nid = nid;
+ }
+ }
+ /*
+ * See if we already converged to skip the more expensive loop
+ * below. Return if we can already predict here with only
+ * mm/task local information, that the below loop would
+ * selected the current cpu_nid.
+ */
+ if (selected_nid == cpu_nid && selected_nid_mm == selected_nid) {
+ if (task_autonuma->autonuma_node != selected_nid)
+ task_autonuma->autonuma_node = selected_nid;
+ return;
+ }
+
+ selected_cpu = this_cpu;
+ selected_nid = cpu_nid;
+
+ weight = weight_max = 0;
+
+ /* check that the following raw_spin_lock_irq is safe */
+ BUG_ON(irqs_disabled());
+
+ for_each_online_node(nid) {
+ /*
+ * Calculate the "weight" for all CPUs that the
+ * current process is allowed to be migrated to,
+ * except the CPUs of the current nid (it would be
+ * worthless from a NUMA affinity standpoint to
+ * migrate the task to another CPU of the current
+ * node).
+ */
+ if (nid == cpu_nid)
+ continue;
+ for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
+ long w_nid, w_cpu_nid, w_other;
+ int w_type;
+ struct mm_struct *mm;
+ rq = cpu_rq(cpu);
+ if (!cpu_online(cpu))
+ continue;
+
+ if (idle_cpu(cpu))
+ /*
+ * Offload the while IDLE balancing
+ * and physical / logical imbalances
+ * to CFS.
+ */
+ continue;
+
+ mm = rq->curr->mm;
+ if (!mm)
+ continue;
+ /*
+ * Grab the w_m/w_t/w_m_t/w_t_t of the
+ * processes running in the other CPUs to
+ * compute w_other.
+ */
+ raw_spin_lock_irq(&rq->lock);
+ /* recheck after implicit barrier() */
+ mm = rq->curr->mm;
+ if (!mm) {
+ raw_spin_unlock_irq(&rq->lock);
+ continue;
+ }
+ w_m_t = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault_tot);
+ w_t_t = rq->curr->task_autonuma->task_numa_fault_tot;
+ if (!w_m_t || !w_t_t) {
+ raw_spin_unlock_irq(&rq->lock);
+ continue;
+ }
+ w_m = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault[nid]);
+ w_t = rq->curr->task_autonuma->task_numa_fault[nid];
+ raw_spin_unlock_irq(&rq->lock);
+ /*
+ * Generate the w_nid/w_cpu_nid from the
+ * pre-computed mm/task_numa_weight[] and
+ * compute w_other using the w_m/w_t info
+ * collected from the other process.
+ */
+ if (mm == p->mm) {
+ if (w_t > w_t_t)
+ w_t_t = w_t;
+ w_other = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
+ w_nid = task_numa_weight[nid];
+ w_cpu_nid = task_numa_weight[cpu_nid];
+ w_type = W_TYPE_THREAD;
+ } else {
+ if (w_m > w_m_t)
+ w_m_t = w_m;
+ w_other = w_m*AUTONUMA_BALANCE_SCALE/w_m_t;
+ w_nid = mm_numa_weight[nid];
+ w_cpu_nid = mm_numa_weight[cpu_nid];
+ w_type = W_TYPE_PROCESS;
+ }
+
+ /*
+ * Finally check if there's a combined gain in
+ * NUMA affinity. If there is and it's the
+ * biggest weight seen so far, record its
+ * weight and select this NUMA remote "cpu" as
+ * candidate migration destination.
+ */
+ if (w_nid > w_other && w_nid > w_cpu_nid) {
+ weight = w_nid - w_other + w_nid - w_cpu_nid;
+
+ if (weight > weight_max) {
+ weight_max = weight;
+ selected_cpu = cpu;
+ selected_nid = nid;
+
+ s_w_other = w_other;
+ s_w_nid = w_nid;
+ s_w_cpu_nid = w_cpu_nid;
+ s_w_type = w_type;
+ }
+ }
+ }
+ }
+
+ if (task_autonuma->autonuma_node != selected_nid)
+ task_autonuma->autonuma_node = selected_nid;
+ if (selected_cpu != this_cpu) {
+ if (autonuma_debug()) {
+ char *w_type_str = NULL;
+ switch (s_w_type) {
+ case W_TYPE_THREAD:
+ w_type_str = "thread";
+ break;
+ case W_TYPE_PROCESS:
+ w_type_str = "process";
+ break;
+ }
+ printk("%p %d - %dto%d - %dto%d - %ld %ld %ld - %s\n",
+ p->mm, p->pid, cpu_nid, selected_nid,
+ this_cpu, selected_cpu,
+ s_w_other, s_w_nid, s_w_cpu_nid,
+ w_type_str);
+ }
+ BUG_ON(cpu_nid == selected_nid);
+ goto found;
+ }
+
+ return;
+
+found:
+ rq = cpu_rq(this_cpu);
+
+ /*
+ * autonuma_balance synchronizes accesses to
+ * autonuma_balance_work. Once set, it's cleared by the
+ * callback once the migration work is finished.
+ */
+ raw_spin_lock_irq(&rq->lock);
+ if (rq->autonuma_balance) {
+ raw_spin_unlock_irq(&rq->lock);
+ return;
+ }
+ rq->autonuma_balance = true;
+ raw_spin_unlock_irq(&rq->lock);
+
+ rq->autonuma_balance_dst_cpu = selected_cpu;
+ rq->autonuma_balance_task = p;
+ get_task_struct(p);
+
+ stop_one_cpu_nowait(this_cpu,
+ autonuma_balance_cpu_stop, rq,
+ &rq->autonuma_balance_work);
+#ifdef __ia64__
+#error "NOTE: tlb_migrate_finish won't run here"
+#endif
+}
+
+/*
+ * The function sched_autonuma_can_migrate_task is called by CFS
+ * can_migrate_task() to prioritize on the task's autonuma_node. It is
+ * called during load_balancing, idle balancing and in general
+ * before any task CPU migration event happens.
+ *
+ * The caller first scans the CFS migration candidate tasks passing a
+ * not zero numa parameter, to skip tasks without AutoNUMA affinity
+ * (according to the tasks's autonuma_node). If no task can be
+ * migrated in the first scan, a second scan is run with a zero numa
+ * parameter.
+ *
+ * If load_balance_strict is enabled, AutoNUMA will only allow
+ * migration of tasks for idle balancing purposes (the idle balancing
+ * of CFS is never altered by AutoNUMA). In the not strict mode the
+ * load balancing is not altered and the AutoNUMA affinity is
+ * disregarded in favor of higher fairness. The load_balance_strict
+ * knob is runtime tunable in sysfs.
+ *
+ * If load_balance_strict is enabled, it tends to partition the
+ * system. In turn it may reduce the scheduler fairness across NUMA
+ * nodes, but it should deliver higher global performance.
+ */
+bool sched_autonuma_can_migrate_task(struct task_struct *p,
+ int numa, int dst_cpu,
+ enum cpu_idle_type idle)
+{
+ if (!task_autonuma_cpu(p, dst_cpu)) {
+ if (numa)
+ return false;
+ if (autonuma_sched_load_balance_strict() &&
+ idle != CPU_NEWLY_IDLE && idle != CPU_IDLE)
+ return false;
+ }
+ return true;
+}
+
+/*
+ * sched_autonuma_dump_mm is a purely debugging function called at
+ * regular intervals when /sys/kernel/mm/autonuma/debug is
+ * enabled. This prints in the kernel logs how the threads and
+ * processes are distributed in all NUMA nodes to easily check if the
+ * threads of the same processes are converging in the same
+ * nodes. This won't take into account kernel threads and because it
+ * runs itself from a kernel thread it won't show what was running in
+ * the current CPU, but it's simple and good enough to get what we
+ * need in the debug logs. This function can be disabled or deleted
+ * later.
+ */
+void sched_autonuma_dump_mm(void)
+{
+ int nid, cpu;
+ cpumask_var_t x;
+
+ if (!alloc_cpumask_var(&x, GFP_KERNEL))
+ return;
+ cpumask_setall(x);
+ for_each_online_node(nid) {
+ for_each_cpu(cpu, cpumask_of_node(nid)) {
+ struct rq *rq = cpu_rq(cpu);
+ struct mm_struct *mm = rq->curr->mm;
+ int nr = 0, cpux;
+ if (!cpumask_test_cpu(cpu, x))
+ continue;
+ for_each_cpu(cpux, cpumask_of_node(nid)) {
+ struct rq *rqx = cpu_rq(cpux);
+ if (rqx->curr->mm == mm) {
+ nr++;
+ cpumask_clear_cpu(cpux, x);
+ }
+ }
+ printk("nid %d mm %p nr %d\n", nid, mm, nr);
+ }
+ }
+ free_cpumask_var(x);
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6d52cea..e5b7ae9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -463,6 +463,24 @@ struct rq {
#ifdef CONFIG_SMP
struct llist_head wake_list;
#endif
+#ifdef CONFIG_AUTONUMA
+ /*
+ * Per-cpu arrays to compute the per-thread and per-process
+ * statistics. Allocated statically to avoid overflowing the
+ * stack with large MAX_NUMNODES values.
+ *
+ * FIXME: allocate dynamically and with num_possible_nodes()
+ * array sizes only if autonuma is not impossible, to save
+ * some dozen KB of RAM when booting on not NUMA (or small
+ * NUMA) systems.
+ */
+ long task_numa_weight[MAX_NUMNODES];
+ long mm_numa_weight[MAX_NUMNODES];
+ bool autonuma_balance;
+ int autonuma_balance_dst_cpu;
+ struct task_struct *autonuma_balance_task;
+ struct cpu_stop_work autonuma_balance_work;
+#endif
};
static inline int cpu_of(struct rq *rq)
If the task has already been moved to an autonuma_node try to allocate
memory from it even if it's temporarily not the local node. Chances
are it's where most of its memory is already located and where it will
run in the future.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/mempolicy.c | 15 +++++++++++++--
1 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1d771e4..86c0df0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1945,10 +1945,21 @@ retry_cpuset:
*/
if (pol->mode == MPOL_INTERLEAVE)
page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
- else
+ else {
+ int nid;
+#ifdef CONFIG_AUTONUMA
+ nid = -1;
+ if (current->task_autonuma)
+ nid = current->task_autonuma->autonuma_node;
+ if (nid < 0)
+ nid = numa_node_id();
+#else
+ nid = numa_node_id();
+#endif
page = __alloc_pages_nodemask(gfp, order,
- policy_zonelist(gfp, pol, numa_node_id()),
+ policy_zonelist(gfp, pol, nid),
policy_nodemask(gfp, pol));
+ }
if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
goto retry_cpuset;
This is needed to make sure the tail pages are also queued into the
migration queues of knuma_migrated across a transparent hugepage
split.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
mm/huge_memory.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1598708..55fc72d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -17,6 +17,7 @@
#include <linux/khugepaged.h>
#include <linux/freezer.h>
#include <linux/mman.h>
+#include <linux/autonuma.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -1316,6 +1317,7 @@ static void __split_huge_page_refcount(struct page *page)
BUG_ON(!PageSwapBacked(page_tail));
lru_add_page_tail(page, page_tail, lruvec);
+ autonuma_migrate_split_huge_page(page, page_tail);
}
atomic_sub(tail_count, &page->_count);
BUG_ON(__page_count(page) <= 0);
Define the two data structures that collect the per-process (in the
mm) and per-thread (in the task_struct) statistical information that
are the input of the CPU follow memory algorithms in the NUMA
scheduler.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
include/linux/autonuma_types.h | 68 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 68 insertions(+), 0 deletions(-)
create mode 100644 include/linux/autonuma_types.h
diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
new file mode 100644
index 0000000..9e697e3
--- /dev/null
+++ b/include/linux/autonuma_types.h
@@ -0,0 +1,68 @@
+#ifndef _LINUX_AUTONUMA_TYPES_H
+#define _LINUX_AUTONUMA_TYPES_H
+
+#ifdef CONFIG_AUTONUMA
+
+#include <linux/numa.h>
+
+/*
+ * Per-mm (process) structure dynamically allocated only if autonuma
+ * is not impossible. This links the mm to scan into the
+ * knuma_scand.mm_head and it contains the NUMA memory placement
+ * statistics for the process (generated by knuma_scand).
+ */
+struct mm_autonuma {
+ /* list node to link the "mm" into the knuma_scand.mm_head */
+ struct list_head mm_node;
+ struct mm_struct *mm;
+ unsigned long mm_numa_fault_pass; /* zeroed from here during allocation */
+ unsigned long mm_numa_fault_tot;
+ unsigned long mm_numa_fault[0];
+};
+
+extern int alloc_mm_autonuma(struct mm_struct *mm);
+extern void free_mm_autonuma(struct mm_struct *mm);
+extern void __init mm_autonuma_init(void);
+
+/*
+ * Per-task (thread) structure dynamically allocated only if autonuma
+ * is not impossible. This contains the preferred autonuma_node where
+ * the userland thread should be scheduled into (only relevant if
+ * tsk->mm is not null) and the per-thread NUMA accesses statistics
+ * (generated by the NUMA hinting page faults).
+ */
+struct task_autonuma {
+ int autonuma_node;
+ /* zeroed from the below field during allocation */
+ unsigned long task_numa_fault_pass;
+ unsigned long task_numa_fault_tot;
+ unsigned long task_numa_fault[0];
+};
+
+extern int alloc_task_autonuma(struct task_struct *tsk,
+ struct task_struct *orig,
+ int node);
+extern void __init task_autonuma_init(void);
+extern void free_task_autonuma(struct task_struct *tsk);
+
+#else /* CONFIG_AUTONUMA */
+
+static inline int alloc_mm_autonuma(struct mm_struct *mm)
+{
+ return 0;
+}
+static inline void free_mm_autonuma(struct mm_struct *mm) {}
+static inline void mm_autonuma_init(void) {}
+
+static inline int alloc_task_autonuma(struct task_struct *tsk,
+ struct task_struct *orig,
+ int node)
+{
+ return 0;
+}
+static inline void task_autonuma_init(void) {}
+static inline void free_task_autonuma(struct task_struct *tsk) {}
+
+#endif /* CONFIG_AUTONUMA */
+
+#endif /* _LINUX_AUTONUMA_TYPES_H */
The CFS scheduler is still in charge of all scheduling
decisions. AutoNUMA balancing at times will override those. But
generally we'll just relay on the CFS scheduler to keep doing its
thing, but while preferring the autonuma affine nodes when deciding
to move a process to a different runqueue or when waking it up.
For example the idle balancing, will look into the runqueues of the
busy CPUs, but it'll search first for a task that wants to run into
the idle CPU in AutoNUMA terms (task_autonuma_cpu() being true).
Most of this is encoded in the can_migrate_task becoming AutoNUMA
aware and running two passes for each balancing pass, the first NUMA
aware, and the second one relaxed.
The idle/newidle balancing is always allowed to fallback into
non-affine AutoNUMA tasks. The load_balancing (which is more a
fariness than a performance issue) is instead only able to cross over
the AutoNUMA affinity if the flag controlled by
/sys/kernel/mm/autonuma/scheduler/load_balance_strict is not set (it
is set by default).
Tasks that haven't been fully profiled yet, are not affected by this
because their p->sched_autonuma->autonuma_node is still set to the
original value of -1 and task_autonuma_cpu will always return true in
that case.
Includes fixes from Hillf Danton <[email protected]>.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
kernel/sched/fair.c | 65 +++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 56 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fa96810..dab9bdd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,7 @@
#include <linux/slab.h>
#include <linux/profile.h>
#include <linux/interrupt.h>
+#include <linux/autonuma_sched.h>
#include <trace/events/sched.h>
@@ -2621,6 +2622,8 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
load = weighted_cpuload(i);
if (load < min_load || (load == min_load && i == this_cpu)) {
+ if (!task_autonuma_cpu(p, i))
+ continue;
min_load = load;
idlest = i;
}
@@ -2639,24 +2642,27 @@ static int select_idle_sibling(struct task_struct *p, int target)
struct sched_domain *sd;
struct sched_group *sg;
int i;
+ bool idle_target;
/*
* If the task is going to be woken-up on this cpu and if it is
* already idle, then it is the right target.
*/
- if (target == cpu && idle_cpu(cpu))
+ if (target == cpu && idle_cpu(cpu) && task_autonuma_cpu(p, cpu))
return cpu;
/*
* If the task is going to be woken-up on the cpu where it previously
* ran and if it is currently idle, then it the right target.
*/
- if (target == prev_cpu && idle_cpu(prev_cpu))
+ if (target == prev_cpu && idle_cpu(prev_cpu) &&
+ task_autonuma_cpu(p, prev_cpu))
return prev_cpu;
/*
* Otherwise, iterate the domains and find an elegible idle cpu.
*/
+ idle_target = false;
sd = rcu_dereference(per_cpu(sd_llc, target));
for_each_lower_domain(sd) {
sg = sd->groups;
@@ -2670,9 +2676,18 @@ static int select_idle_sibling(struct task_struct *p, int target)
goto next;
}
- target = cpumask_first_and(sched_group_cpus(sg),
- tsk_cpus_allowed(p));
- goto done;
+ for_each_cpu_and(i, sched_group_cpus(sg),
+ tsk_cpus_allowed(p)) {
+ /* Find autonuma cpu only in idle group */
+ if (task_autonuma_cpu(p, i)) {
+ target = i;
+ goto done;
+ }
+ if (!idle_target) {
+ idle_target = true;
+ target = i;
+ }
+ }
next:
sg = sg->next;
} while (sg != sd->groups);
@@ -2707,7 +2722,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
return prev_cpu;
if (sd_flag & SD_BALANCE_WAKE) {
- if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+ if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)) &&
+ task_autonuma_cpu(p, cpu))
want_affine = 1;
new_cpu = prev_cpu;
}
@@ -3072,6 +3088,7 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
+#define LBF_NUMA 0x04
struct lb_env {
struct sched_domain *sd;
@@ -3142,13 +3159,14 @@ static
int can_migrate_task(struct task_struct *p, struct lb_env *env)
{
int tsk_cache_hot = 0;
+ struct cpumask *allowed = tsk_cpus_allowed(p);
/*
* We do not migrate tasks that are:
* 1) running (obviously), or
* 2) cannot be migrated to this CPU due to cpus_allowed, or
* 3) are cache-hot on their current CPU.
*/
- if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
+ if (!cpumask_test_cpu(env->dst_cpu, allowed)) {
schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
return 0;
}
@@ -3159,6 +3177,10 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
return 0;
}
+ if (!sched_autonuma_can_migrate_task(p, env->flags & LBF_NUMA,
+ env->dst_cpu, env->idle))
+ return 0;
+
/*
* Aggressive migration if:
* 1) task is cache cold, or
@@ -3195,6 +3217,8 @@ static int move_one_task(struct lb_env *env)
{
struct task_struct *p, *n;
+ env->flags |= LBF_NUMA;
+numa_repeat:
list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
continue;
@@ -3209,8 +3233,14 @@ static int move_one_task(struct lb_env *env)
* stats here rather than inside move_task().
*/
schedstat_inc(env->sd, lb_gained[env->idle]);
+ env->flags &= ~LBF_NUMA;
return 1;
}
+ if (env->flags & LBF_NUMA) {
+ env->flags &= ~LBF_NUMA;
+ goto numa_repeat;
+ }
+
return 0;
}
@@ -3235,6 +3265,8 @@ static int move_tasks(struct lb_env *env)
if (env->imbalance <= 0)
return 0;
+ env->flags |= LBF_NUMA;
+numa_repeat:
while (!list_empty(tasks)) {
p = list_first_entry(tasks, struct task_struct, se.group_node);
@@ -3274,9 +3306,13 @@ static int move_tasks(struct lb_env *env)
* kernels will stop after the first task is pulled to minimize
* the critical section.
*/
- if (env->idle == CPU_NEWLY_IDLE)
- break;
+ if (env->idle == CPU_NEWLY_IDLE) {
+ env->flags &= ~LBF_NUMA;
+ goto out;
+ }
#endif
+ /* not idle anymore after pulling first task */
+ env->idle = CPU_NOT_IDLE;
/*
* We only want to steal up to the prescribed amount of
@@ -3289,6 +3325,17 @@ static int move_tasks(struct lb_env *env)
next:
list_move_tail(&p->se.group_node, tasks);
}
+ if ((env->flags & (LBF_NUMA|LBF_NEED_BREAK)) == LBF_NUMA) {
+ env->flags &= ~LBF_NUMA;
+ if (env->imbalance > 0) {
+ env->loop = 0;
+ env->loop_break = sched_nr_migrate_break;
+ goto numa_repeat;
+ }
+ }
+#ifdef CONFIG_PREEMPT
+out:
+#endif
/*
* Right now, this is one of only two places move_task() is called,
On Thu, 2012-06-28 at 14:55 +0200, Andrea Arcangeli wrote:
> +/*
> + * This function sched_autonuma_balance() is responsible for deciding
> + * which is the best CPU each process should be running on according
> + * to the NUMA statistics collected in mm->mm_autonuma and
> + * tsk->task_autonuma.
> + *
> + * The core math that evaluates the current CPU against the CPUs of
> + * all _other_ nodes is this:
> + *
> + * if (w_nid > w_other && w_nid > w_cpu_nid)
> + * weight = w_nid - w_other + w_nid - w_cpu_nid;
> + *
> + * w_nid: NUMA affinity of the current thread/process if run on the
> + * other CPU.
> + *
> + * w_other: NUMA affinity of the other thread/process if run on the
> + * other CPU.
> + *
> + * w_cpu_nid: NUMA affinity of the current thread/process if run on
> + * the current CPU.
> + *
> + * weight: combined NUMA affinity benefit in moving the current
> + * thread/process to the other CPU taking into account both the
> higher
> + * NUMA affinity of the current process if run on the other CPU, and
> + * the increase in NUMA affinity in the other CPU by replacing the
> + * other process.
A lot of words, all meaningless without a proper definition of w_*
stuff. How are they calculated and why.
> + * We run the above math on every CPU not part of the current NUMA
> + * node, and we compare the current process against the other
> + * processes running in the other CPUs in the remote NUMA nodes. The
> + * objective is to select the cpu (in selected_cpu) with a bigger
> + * "weight". The bigger the "weight" the biggest gain we'll get by
> + * moving the current process to the selected_cpu (not only the
> + * biggest immediate CPU gain but also the fewer async memory
> + * migrations that will be required to reach full convergence
> + * later). If we select a cpu we migrate the current process to it.
So you do something like:
max_(i, node(i) != curr_node) { weight_i }
That is, you have this weight, then what do you do?
> + * Checking that the current process has higher NUMA affinity than
> the
> + * other process on the other CPU (w_nid > w_other) and not only that
> + * the current process has higher NUMA affinity on the other CPU than
> + * on the current CPU (w_nid > w_cpu_nid) completely avoids ping
> pongs
> + * and ensures (temporary) convergence of the algorithm (at least
> from
> + * a CPU standpoint).
How does that follow?
> + * It's then up to the idle balancing code that will run as soon as
> + * the current CPU goes idle to pick the other process and move it
> + * here (or in some other idle CPU if any).
> + *
> + * By only evaluating running processes against running processes we
> + * avoid interfering with the CFS stock active idle balancing, which
> + * is critical to optimal performance with HT enabled. (getting HT
> + * wrong is worse than running on remote memory so the active idle
> + * balancing has priority)
what?
> + * Idle balancing and all other CFS load balancing become NUMA
> + * affinity aware through the introduction of
> + * sched_autonuma_can_migrate_task(). CFS searches CPUs in the task's
> + * autonuma_node first when it needs to find idle CPUs during idle
> + * balancing or tasks to pick during load balancing.
You talk a lot about idle balance, but there's zero mention of fairness.
This is worrysome.
> + * The task's autonuma_node is the node selected by
> + * sched_autonuma_balance() when it migrates a task to the
> + * selected_cpu in the selected_nid
I think I already said that strict was out of the question and hard
movement like that simply didn't make sense.
> + * Once a process/thread has been moved to another node, closer to
> the
> + * much of memory it has recently accessed,
closer to the recently accessed memory you mean?
> any memory for that task
> + * not in the new node moves slowly (asynchronously in the
> background)
> + * to the new node. This is done by the knuma_migratedN (where the
> + * suffix N is the node id) daemon described in mm/autonuma.c.
> + *
> + * One non trivial bit of this logic that deserves an explanation is
> + * how the three crucial variables of the core math
> + * (w_nid/w_other/wcpu_nid) are going to change depending on whether
> + * the other CPU is running a thread of the current process, or a
> + * thread of a different process.
No no no,.. its not a friggin detail, its absolutely crucial. Also, if
you'd given proper definition you wouldn't need to hand wave your way
around the dynamics either because that would simply follow from the
definition.
<snip terrible example>
> + * Before scanning all other CPUs' runqueues to compute the above
> + * math,
OK, lets stop calling the one isolated conditional you mentioned 'math'.
On its own its useless.
> we also verify that the current CPU is not already in the
> + * preferred NUMA node from the point of view of both the process
> + * statistics and the thread statistics. In such case we can return
> to
> + * the caller without having to check any other CPUs' runqueues
> + * because full convergence has been already reached.
Things being in the 'preferred' place don't have much to do with
convergence. Does your model have local minima/maxima where it can get
stuck, or does it always find a global min/max?
> + * This algorithm might be expanded to take all runnable processes
> + * into account but examining just the currently running processes is
> + * a good enough approximation because some runnable processes may
> run
> + * only for a short time so statistically there will always be a bias
> + * on the processes that uses most the of the CPU. This is ideal
> + * because it doesn't matter if NUMA balancing isn't optimal for
> + * processes that run only for a short time.
Almost, but not quite.. it would be so if the sampling could be proven
to be unbiased. But its quite possible for a task to consume most cpu
time and never show up as the current task in your load-balance run.
As it stands you wrote a lot of words.. but none of them were really
helpful in understanding what you do.
On Thu, 2012-06-28 at 14:55 +0200, Andrea Arcangeli wrote:
> +#ifdef __ia64__
> +#error "NOTE: tlb_migrate_finish won't run here"
> +#endif
https://lkml.org/lkml/2012/5/29/359
Its an optional thing, not running it isn't fatal at all.
Also, ia64 has CONFIG_NUMA so all this code had better run on it.
That said, I've also already told you to stop using such forceful
migration, that simply doesn't make any sense, numa balancing isn't that
critical.
Unless you're going to listen to feedback I give you, I'm going to
completely stop reading your patches, I don't give a rats arse you work
for the same company anymore.
You're impossible to work with.
On 06/28/2012 05:55 AM, Andrea Arcangeli wrote:
> We will set these bitflags only when the pmd and pte is non present.
>
Just a couple grammar nitpicks.
> They work like PROT_NONE but they identify a request for the numa
> hinting page fault to trigger.
>
> Because we want to be able to set these bitflag in any established pte
these bitflags
> or pmd (while clearing the present bit at the same time) without
> losing information, these bitflags must never be set when the pte and
> pmd are present.
>
> For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
> set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
> which avoids having to alter the swp entries format.
>
> For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
> swap_entries but if in the future we'll swap transparent hugepages, we
> must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
> entry format and to start the swap entry offset above it.
>
> PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
> but it's never used on pmds so there's no risk of collision with Xen.
Maybe "but only on ptes established by ioremap, never on pmds so
there's no risk of collision with Xen." ? The extra "but" just
doesn't flow in the original.
Don Morris
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> arch/x86/include/asm/pgtable_types.h | 11 +++++++++++
> 1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index b74cac9..6e2d954 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -71,6 +71,17 @@
> #define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE)
> #define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
>
> +/*
> + * Cannot be set on pte. The fact it's in between _PAGE_FILE and
> + * _PAGE_PROTNONE avoids having to alter the swp entries.
> + */
> +#define _PAGE_NUMA_PTE _PAGE_PSE
> +/*
> + * Cannot be set on pmd, if transparent hugepages will be swapped out
> + * the swap entry offset must start above it.
> + */
> +#define _PAGE_NUMA_PMD _PAGE_UNUSED2
> +
> #define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
> _PAGE_ACCESSED | _PAGE_DIRTY)
> #define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> .
>
Hi Don,
On Thu, Jun 28, 2012 at 08:13:11AM -0700, Don Morris wrote:
> On 06/28/2012 05:55 AM, Andrea Arcangeli wrote:
> > We will set these bitflags only when the pmd and pte is non present.
> >
>
> Just a couple grammar nitpicks.
>
> > They work like PROT_NONE but they identify a request for the numa
> > hinting page fault to trigger.
> >
> > Because we want to be able to set these bitflag in any established pte
>
> these bitflags
>
> > or pmd (while clearing the present bit at the same time) without
> > losing information, these bitflags must never be set when the pte and
> > pmd are present.
> >
> > For _PAGE_NUMA_PTE the pte bitflag used is _PAGE_PSE, which cannot be
> > set on ptes and it also fits in between _PAGE_FILE and _PAGE_PROTNONE
> > which avoids having to alter the swp entries format.
> >
> > For _PAGE_NUMA_PMD, we use a reserved bitflag. pmds never contain
> > swap_entries but if in the future we'll swap transparent hugepages, we
> > must keep in mind not to use the _PAGE_UNUSED2 bitflag in the swap
> > entry format and to start the swap entry offset above it.
> >
> > PAGE_UNUSED2 is used by Xen but only on ptes established by ioremap,
> > but it's never used on pmds so there's no risk of collision with Xen.
>
> Maybe "but only on ptes established by ioremap, never on pmds so
> there's no risk of collision with Xen." ? The extra "but" just
> doesn't flow in the original.
Agreed and applied, thanks!
Andrea
On Thu, Jun 28, 2012 at 10:53 PM, Peter Zijlstra <[email protected]> wrote:
>
> Unless you're going to listen to feedback I give you, I'm going to
> completely stop reading your patches, I don't give a rats arse you work
> for the same company anymore.
>
Are you brought up, Peter, in dirty environment with mind polluted?
And stop shaming RedHat anymore, since you are no longer teenager.
If you did not nothing wrong, give me the email address of your manager
at Redhat in reply please.
> You're impossible to work with.
Clearly you show that you are not sane enough and able to work with others.
* Hillf Danton <[email protected]> wrote:
> On Thu, Jun 28, 2012 at 10:53 PM, Peter Zijlstra <[email protected]> wrote:
> >
> > Unless you're going to listen to feedback I give you, I'm
> > going to completely stop reading your patches, I don't give
> > a rats arse you work for the same company anymore.
>
> Are you brought up, Peter, in dirty environment with mind
> polluted?
You do not seem to be aware of the history of this patch-set,
I suspect Peter got "polluted" by Andrea ignoring his repeated
review feedbacks...
If his multiple rounds of polite (and extensive) review didn't
have much of an effect then maybe some amount of not so nice
shouting has more of an effect?
The other option would be to NAK and ignore the patchset, in
that sense Peter is a lot more constructive and forward looking
than a polite NAK would be, even if the language is rough.
Thanks,
Ingo
On 2012年06月28日 22:46, Peter Zijlstra wrote:
> On Thu, 2012-06-28 at 14:55 +0200, Andrea Arcangeli wrote:
>> +/*
>> + * This function sched_autonuma_balance() is responsible for deciding
>> + * which is the best CPU each process should be running on according
>> + * to the NUMA statistics collected in mm->mm_autonuma and
>> + * tsk->task_autonuma.
>> + *
>> + * The core math that evaluates the current CPU against the CPUs of
>> + * all _other_ nodes is this:
>> + *
>> + * if (w_nid> w_other&& w_nid> w_cpu_nid)
>> + * weight = w_nid - w_other + w_nid - w_cpu_nid;
>> + *
>> + * w_nid: NUMA affinity of the current thread/process if run on the
>> + * other CPU.
>> + *
>> + * w_other: NUMA affinity of the other thread/process if run on the
>> + * other CPU.
>> + *
>> + * w_cpu_nid: NUMA affinity of the current thread/process if run on
>> + * the current CPU.
>> + *
>> + * weight: combined NUMA affinity benefit in moving the current
>> + * thread/process to the other CPU taking into account both the
>> higher
>> + * NUMA affinity of the current process if run on the other CPU, and
>> + * the increase in NUMA affinity in the other CPU by replacing the
>> + * other process.
>
> A lot of words, all meaningless without a proper definition of w_*
> stuff. How are they calculated and why.
>
>> + * We run the above math on every CPU not part of the current NUMA
>> + * node, and we compare the current process against the other
>> + * processes running in the other CPUs in the remote NUMA nodes. The
>> + * objective is to select the cpu (in selected_cpu) with a bigger
>> + * "weight". The bigger the "weight" the biggest gain we'll get by
>> + * moving the current process to the selected_cpu (not only the
>> + * biggest immediate CPU gain but also the fewer async memory
>> + * migrations that will be required to reach full convergence
>> + * later). If we select a cpu we migrate the current process to it.
>
> So you do something like:
>
> max_(i, node(i) != curr_node) { weight_i }
>
> That is, you have this weight, then what do you do?
>
>> + * Checking that the current process has higher NUMA affinity than
>> the
>> + * other process on the other CPU (w_nid> w_other) and not only that
>> + * the current process has higher NUMA affinity on the other CPU than
>> + * on the current CPU (w_nid> w_cpu_nid) completely avoids ping
>> pongs
>> + * and ensures (temporary) convergence of the algorithm (at least
>> from
>> + * a CPU standpoint).
>
> How does that follow?
>
>> + * It's then up to the idle balancing code that will run as soon as
>> + * the current CPU goes idle to pick the other process and move it
>> + * here (or in some other idle CPU if any).
>> + *
>> + * By only evaluating running processes against running processes we
>> + * avoid interfering with the CFS stock active idle balancing, which
>> + * is critical to optimal performance with HT enabled. (getting HT
>> + * wrong is worse than running on remote memory so the active idle
>> + * balancing has priority)
>
> what?
>
>> + * Idle balancing and all other CFS load balancing become NUMA
>> + * affinity aware through the introduction of
>> + * sched_autonuma_can_migrate_task(). CFS searches CPUs in the task's
>> + * autonuma_node first when it needs to find idle CPUs during idle
>> + * balancing or tasks to pick during load balancing.
>
> You talk a lot about idle balance, but there's zero mention of fairness.
> This is worrysome.
>
>> + * The task's autonuma_node is the node selected by
>> + * sched_autonuma_balance() when it migrates a task to the
>> + * selected_cpu in the selected_nid
>
> I think I already said that strict was out of the question and hard
> movement like that simply didn't make sense.
>
>> + * Once a process/thread has been moved to another node, closer to
>> the
>> + * much of memory it has recently accessed,
>
> closer to the recently accessed memory you mean?
>
>> any memory for that task
>> + * not in the new node moves slowly (asynchronously in the
>> background)
>> + * to the new node. This is done by the knuma_migratedN (where the
>> + * suffix N is the node id) daemon described in mm/autonuma.c.
>> + *
>> + * One non trivial bit of this logic that deserves an explanation is
>> + * how the three crucial variables of the core math
>> + * (w_nid/w_other/wcpu_nid) are going to change depending on whether
>> + * the other CPU is running a thread of the current process, or a
>> + * thread of a different process.
>
> No no no,.. its not a friggin detail, its absolutely crucial. Also, if
> you'd given proper definition you wouldn't need to hand wave your way
> around the dynamics either because that would simply follow from the
> definition.
>
> <snip terrible example>
>
>> + * Before scanning all other CPUs' runqueues to compute the above
>> + * math,
>
> OK, lets stop calling the one isolated conditional you mentioned 'math'.
> On its own its useless.
>
>> we also verify that the current CPU is not already in the
>> + * preferred NUMA node from the point of view of both the process
>> + * statistics and the thread statistics. In such case we can return
>> to
>> + * the caller without having to check any other CPUs' runqueues
>> + * because full convergence has been already reached.
>
> Things being in the 'preferred' place don't have much to do with
> convergence. Does your model have local minima/maxima where it can get
> stuck, or does it always find a global min/max?
>
>
>> + * This algorithm might be expanded to take all runnable processes
>> + * into account but examining just the currently running processes is
>> + * a good enough approximation because some runnable processes may
>> run
>> + * only for a short time so statistically there will always be a bias
>> + * on the processes that uses most the of the CPU. This is ideal
>> + * because it doesn't matter if NUMA balancing isn't optimal for
>> + * processes that run only for a short time.
>
> Almost, but not quite.. it would be so if the sampling could be proven
> to be unbiased. But its quite possible for a task to consume most cpu
> time and never show up as the current task in your load-balance run.
Same here, I have another similar question regarding sampling:
If one process do very intensive visit of a small set of pages in this
node, but occasional visit of a large set of pages in another node.
Will this algorithm do a very bad judgment? I guess the answer would
be: it's possible and this judgment depends on the racing pattern
between the process and your knuma_scand.
Usually, if we are using sampling, we are on the assumption that if
this sampling would not be accurate, we only lose chance to
better optimization, but NOT to do bad/false judgment.
Andrea, sorry, I don't have enough time to look into all your patches
details(and also since I'm not on the CCs ;-) ),
But my intuition tells me that your current sampling and weight
algorithm is far from optimal.
>
>
>
> As it stands you wrote a lot of words.. but none of them were really
> helpful in understanding what you do.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email:<a href=ilto:"[email protected]"> [email protected]</a>
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Very minor optimization to hint gcc.
>
> Signed-off-by: Andrea Arcangeli<[email protected]>
Acked-by: Rik van Riel <[email protected]>
This looks like something that could be submitted separately,
reducing the size of your autonuma patch series a little...
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> set_pmd_at() will also be used for the knuma_scand/pmd = 1 (default)
> mode even when TRANSPARENT_HUGEPAGE=n. Make it available so the build
> won't fail.
>
> Signed-off-by: Andrea Arcangeli<[email protected]>
Acked-by: Rik van Riel <[email protected]>
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> is_vma_temporary_stack() is needed by mm/autonuma.c too, and without
> this the build breaks with CONFIG_TRANSPARENT_HUGEPAGE=n.
>
> Reported-by: Petr Holasek<[email protected]>
> Signed-off-by: Andrea Arcangeli<[email protected]>
Acked-by: Rik van Riel <[email protected]>
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Xen has taken over the last reserved bit available for the pagetables
> which is set through ioremap, this documents it and makes the code
> more readable.
>
> Signed-off-by: Andrea Arcangeli<[email protected]>
> ---
> arch/x86/include/asm/pgtable_types.h | 11 +++++++++--
> 1 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 013286a..b74cac9 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -17,7 +17,7 @@
> #define _PAGE_BIT_PAT 7 /* on 4KB pages */
> #define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */
> #define _PAGE_BIT_UNUSED1 9 /* available for programmer */
> -#define _PAGE_BIT_IOMAP 10 /* flag used to indicate IO mapping */
> +#define _PAGE_BIT_UNUSED2 10
Considering that Xen is using it, it is not really
unused, is it?
Not that I can think of a better name, considering
you are using this bit for something else at the PMD
level...
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> +/*
> + * Cannot be set on pte. The fact it's in between _PAGE_FILE and
> + * _PAGE_PROTNONE avoids having to alter the swp entries.
> + */
> +#define _PAGE_NUMA_PTE _PAGE_PSE
> +/*
> + * Cannot be set on pmd, if transparent hugepages will be swapped out
> + * the swap entry offset must start above it.
> + */
> +#define _PAGE_NUMA_PMD _PAGE_UNUSED2
Those comments only tell us what the flags can NOT be
used for, not what they are actually used for.
That needs to be fixed.
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> static inline int pte_file(pte_t pte)
> {
> - return pte_flags(pte)& _PAGE_FILE;
> + return (pte_flags(pte)& _PAGE_FILE) == _PAGE_FILE;
> }
Wait, why is this change made? Surely _PAGE_FILE is just
one single bit and this change is not useful?
If there is a reason for this change, please document it.
> @@ -405,7 +405,9 @@ static inline int pte_same(pte_t a, pte_t b)
>
> static inline int pte_present(pte_t a)
> {
> - return pte_flags(a)& (_PAGE_PRESENT | _PAGE_PROTNONE);
> + /* _PAGE_NUMA includes _PAGE_PROTNONE */
> + return pte_flags(a)& (_PAGE_PRESENT | _PAGE_PROTNONE |
> + _PAGE_NUMA_PTE);
> }
>
> static inline int pte_hidden(pte_t pte)
> @@ -415,7 +417,46 @@ static inline int pte_hidden(pte_t pte)
>
> static inline int pmd_present(pmd_t pmd)
> {
> - return pmd_flags(pmd)& _PAGE_PRESENT;
> + return pmd_flags(pmd)& (_PAGE_PRESENT | _PAGE_PROTNONE |
> + _PAGE_NUMA_PMD);
> +}
Somewhat subtle. Better documentation in patch 5 will
help explain this.
> +#ifdef CONFIG_AUTONUMA
> +static inline int pte_numa(pte_t pte)
> +{
> + return (pte_flags(pte)&
> + (_PAGE_NUMA_PTE|_PAGE_PRESENT)) == _PAGE_NUMA_PTE;
> +}
> +
> +static inline int pmd_numa(pmd_t pmd)
> +{
> + return (pmd_flags(pmd)&
> + (_PAGE_NUMA_PMD|_PAGE_PRESENT)) == _PAGE_NUMA_PMD;
> +}
> +#endif
These could use a little explanation of how _PAGE_NUMA_* is
used and what the flags mean.
> +static inline pte_t pte_mknotnuma(pte_t pte)
> +{
> + pte = pte_clear_flags(pte, _PAGE_NUMA_PTE);
> + return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
> +}
> +
> +static inline pmd_t pmd_mknotnuma(pmd_t pmd)
> +{
> + pmd = pmd_clear_flags(pmd, _PAGE_NUMA_PMD);
> + return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
> +}
> +
> +static inline pte_t pte_mknuma(pte_t pte)
> +{
> + pte = pte_set_flags(pte, _PAGE_NUMA_PTE);
> + return pte_clear_flags(pte, _PAGE_PRESENT);
> +}
> +
> +static inline pmd_t pmd_mknuma(pmd_t pmd)
> +{
> + pmd = pmd_set_flags(pmd, _PAGE_NUMA_PMD);
> + return pmd_clear_flags(pmd, _PAGE_PRESENT);
> }
These functions could use some explanation, too.
Why do the top ones set _PAGE_ACCESSED, while the bottom ones
leave _PAGE_ACCESSED alone?
I can guess the answer, but it should be documented so it is
also clear to people with less experience in the VM.
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Implement generic version of the methods. They're used when
> CONFIG_AUTONUMA=n, and they're a noop.
>
> Signed-off-by: Andrea Arcangeli<[email protected]>
Acked-by: Rik van Riel <[email protected]>
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> gup_fast will skip over non present ptes (pte_numa requires the pte to
> be non present). So no explicit check is needed for pte_numa in the
> pte case.
>
> gup_fast will also automatically skip over THP when the trans huge pmd
> is non present (pmd_numa requires the pmd to be non present).
>
> But for the special pmd mode scan of knuma_scand
> (/sys/kernel/mm/autonuma/knuma_scand/pmd == 1), the pmd may be of numa
> type (so non present too), the pte may be present. gup_pte_range
> wouldn't notice the pmd is of numa type. So to avoid losing a NUMA
> hinting page fault with gup_fast we need an explicit check for
> pmd_numa() here to be sure it will fault through gup ->
> handle_mm_fault.
>
> Signed-off-by: Andrea Arcangeli<[email protected]>
Assuming pmd_numa will get the documentation I asked for a few
patches back, this patch is fine, since people will just be able
to look at a nice comment above pmd_numa and see what is going on.
Acked-by: Rik van Riel <[email protected]>
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
> #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
> #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
> -#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
> +#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpus */
> #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
> #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
> #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
Changing the semantics of PF_THREAD_BOUND without so much as
a comment in your changelog or buy-in from the scheduler
maintainers is a big no-no.
Is there any reason you even need PF_THREAD_BOUND in your
kernel numa threads?
I do not see much at all in the scheduler code that uses
PF_THREAD_BOUND and it is not clear at all that your
numa threads get any benefit from them...
Why do you think you need it?
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
You tell us when the data structures are not allocated, but
you do not tell us how the data structures is used, or what
the fields inside the data structures mean.
This makes it very hard for other people to figure out the
code later. Please document these kinds of things properly.
> +/*
> + * Per-mm (process) structure dynamically allocated only if autonuma
> + * is not impossible. This links the mm to scan into the
> + * knuma_scand.mm_head and it contains the NUMA memory placement
> + * statistics for the process (generated by knuma_scand).
> + */
> +struct mm_autonuma {
> + /* list node to link the "mm" into the knuma_scand.mm_head */
> + struct list_head mm_node;
> + struct mm_struct *mm;
> + unsigned long mm_numa_fault_pass; /* zeroed from here during allocation */
> + unsigned long mm_numa_fault_tot;
> + unsigned long mm_numa_fault[0];
> +};
> +/*
> + * Per-task (thread) structure dynamically allocated only if autonuma
> + * is not impossible. This contains the preferred autonuma_node where
> + * the userland thread should be scheduled into (only relevant if
> + * tsk->mm is not null) and the per-thread NUMA accesses statistics
> + * (generated by the NUMA hinting page faults).
> + */
> +struct task_autonuma {
> + int autonuma_node;
> + /* zeroed from the below field during allocation */
> + unsigned long task_numa_fault_pass;
> + unsigned long task_numa_fault_tot;
> + unsigned long task_numa_fault[0];
> +};
--
All rights reversed
On Fri, 2012-06-29 at 11:36 -0400, Rik van Riel wrote:
> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
>
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> > #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
> > #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
> > #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
> > -#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
> > +#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpus */
> > #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
> > #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
> > #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
>
> Changing the semantics of PF_THREAD_BOUND without so much as
> a comment in your changelog or buy-in from the scheduler
> maintainers is a big no-no.
In fact I've already said a number of times this patch isn't going
anywhere.
On 06/29/2012 12:04 PM, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 11:36 -0400, Rik van Riel wrote:
>> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
>>
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
>>> #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
>>> #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
>>> #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
>>> -#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
>>> +#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpus */
>>> #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
>>> #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
>>> #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
>>
>> Changing the semantics of PF_THREAD_BOUND without so much as
>> a comment in your changelog or buy-in from the scheduler
>> maintainers is a big no-no.
>
> In fact I've already said a number of times this patch isn't going
> anywhere.
Which is probably fine, because I see no reason why Andrea's
numa threads would need PF_THREAD_BOUND in the first place.
Andrea?
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> These flags are the ones tweaked through sysfs, they control the
> behavior of autonuma, from enabling disabling it, to selecting various
> runtime options.
That's all fine and dandy, but what do these flags mean?
How do you expect people to be able to maintain this code,
or control autonuma behaviour, when these flags are not
documented at all?
Please document them.
> +enum autonuma_flag {
> + AUTONUMA_FLAG,
> + AUTONUMA_IMPOSSIBLE_FLAG,
> + AUTONUMA_DEBUG_FLAG,
> + AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> + AUTONUMA_SCHED_CLONE_RESET_FLAG,
> + AUTONUMA_SCHED_FORK_RESET_FLAG,
> + AUTONUMA_SCAN_PMD_FLAG,
> + AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> + AUTONUMA_MIGRATE_DEFER_FLAG,
> +};
--
All rights reversed
On Fri, Jun 29, 2012 at 11:36:26AM -0400, Rik van Riel wrote:
> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
>
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> > #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
> > #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
> > #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
> > -#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
> > +#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpus */
> > #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
> > #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
> > #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
>
> Changing the semantics of PF_THREAD_BOUND without so much as
> a comment in your changelog or buy-in from the scheduler
> maintainers is a big no-no.
>
> Is there any reason you even need PF_THREAD_BOUND in your
> kernel numa threads?
>
> I do not see much at all in the scheduler code that uses
> PF_THREAD_BOUND and it is not clear at all that your
> numa threads get any benefit from them...
>
> Why do you think you need it?
Nobody needs that flag anyway, you can drop the flag from the kernel
in all places it used, it is never "needed", but since somebody
bothered to add this reliability feature to the kernel, why not to
take advantage of it whenever possible?
This flag is only used to prevent userland to mess with the kernel CPU
binds of kernel threads. It is used to avoid the root user to shoot
itself in the foot.
So far it has been used to prevent changing bindings to a single
CPU. I'm setting it also after making a multiple-cpu bind (all CPUs of
the node, instead of just 1 CPU). I hope it's clear to everybody that
this is perfectly ok usage and if it the bind is done on 1 CPU or 10 CPUs
or all CPUs, nothing changes for how the bitflag works.
There's no legitimate reason to allow the root user to change the CPU
binding of knuma_migratedN. Any change would be a guaranteed
regression. So there's no reason not to enforce the NODE wide binding,
if nothing else to document that the binding enforced is the ideal one
in all possible conditions.
On 06/29/2012 08:55 AM, Ingo Molnar wrote:
>
> * Hillf Danton <[email protected]> wrote:
>
>> On Thu, Jun 28, 2012 at 10:53 PM, Peter Zijlstra <[email protected]> wrote:
>>>
>>> Unless you're going to listen to feedback I give you, I'm
>>> going to completely stop reading your patches, I don't give
>>> a rats arse you work for the same company anymore.
>>
>> Are you brought up, Peter, in dirty environment with mind
>> polluted?
>
> You do not seem to be aware of the history of this patch-set,
> I suspect Peter got "polluted" by Andrea ignoring his repeated
> review feedbacks...
AFAIK, Andrea answered many of Peter's request by reducing the memory
overhead, adding documentation and changing the scheduler integration.
When someone plants 'crap' too much in his comments, its not a surprise
that some will get ignored. Moreover, I don't think the decent comments
got ignored, sometimes both were talking in parallel lines - even in
this case, it's hard to say whether Peter's like to add ia64 support or
just like to get rid of the forceful migration as a whole.
Since it takes more time to fully understand the code than to write the
comments, I suggest to go the extra mile there and make sure the review
is crystal clear.
>
> If his multiple rounds of polite (and extensive) review didn't
> have much of an effect then maybe some amount of not so nice
> shouting has more of an effect?
>
> The other option would be to NAK and ignore the patchset, in
> that sense Peter is a lot more constructive and forward looking
> than a polite NAK would be, even if the language is rough.
NAK is better w/ further explanation or even suggestion about
alternatives. The previous comments were not shouts but the mother of
all NAKs.
There are some in the Linux community that adore flames but this is a
perfect example that this approach slows innovation instead of advance it.
Some developers have a thick skin and nothing gets in, others are human
and have feelings. Using a tiny difference in behavior we can do much
much better. What's works in a f2f loud discussion doesn't play well in
email.
Or alternatively:
/*
* can_nice - check if folks on lkml can be nicer&productive
* @p: person
* @nice: nice value
* Since nice isn't a negative property, nice is an uint here.
*/
int can_nice(const struct person *p, const unsigned int nice)
{
int nice_rlim = MAX_LIMIT_BEFORE_NAK;
BUG_ON(!capable(CAP_SYS_NICE));
if (nice_rlim >= task_rlimit(p, RLIMIT_NICE))
printk(KERN_INFO "Please NAK w/ decent explanation or \
submit an alternative patch);
return 0;
}
Ingo, what's your technical perspective of this particular patch?
Cheers,
Dor
>
> Thanks,
>
> Ingo
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On 06/29/2012 12:38 PM, Andrea Arcangeli wrote:
> On Fri, Jun 29, 2012 at 11:36:26AM -0400, Rik van Riel wrote:
>> On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
>>
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
>>> #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
>>> #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
>>> #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
>>> -#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
>>> +#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpus */
>>> #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
>>> #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
>>> #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
>>
>> Changing the semantics of PF_THREAD_BOUND without so much as
>> a comment in your changelog or buy-in from the scheduler
>> maintainers is a big no-no.
>>
>> Is there any reason you even need PF_THREAD_BOUND in your
>> kernel numa threads?
>>
>> I do not see much at all in the scheduler code that uses
>> PF_THREAD_BOUND and it is not clear at all that your
>> numa threads get any benefit from them...
>>
>> Why do you think you need it?
> This flag is only used to prevent userland to mess with the kernel CPU
> binds of kernel threads. It is used to avoid the root user to shoot
> itself in the foot.
>
> So far it has been used to prevent changing bindings to a single
> CPU. I'm setting it also after making a multiple-cpu bind (all CPUs of
> the node, instead of just 1 CPU).
Fair enough. Looking at the scheduler code some more, I
see that all PF_THREAD_BOUND seems to do is block userspace
from changing a thread's CPU bindings.
Peter and Ingo, what is the special magic in PF_THREAD_BOUND
that should make it only apply to kernel threads that are bound
to a single CPU?
Allowing it for threads that are bound to a NUMA node
could make some sense for eg. kswapd...
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Define the two data structures that collect the per-process (in the
> mm) and per-thread (in the task_struct) statistical information that
> are the input of the CPU follow memory algorithms in the NUMA
> scheduler.
I just noticed the subject of this email is misleading, too.
This patch does not introduce sched_autonuma at all.
*searches around for the patch that does*
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> This algorithm takes as input the statistical information filled by the
> knuma_scand (mm->mm_autonuma) and by the NUMA hinting page faults
> (p->sched_autonuma),
Somewhat confusing patch order, since the NUMA hinting page faults
appear to be later in the patch series.
At least the data structures got introduced earlier, albeit without
any documentation whatsoever (that needs fixing).
> evaluates it for the current scheduled task, and
> compares it against every other running process to see if it should
> move the current task to another NUMA node.
This is a little worrying. What if you are running on a system
with hundreds of NUMA nodes? How often does this code run?
> +static bool inline task_autonuma_cpu(struct task_struct *p, int cpu)
> +{
> +#ifdef CONFIG_AUTONUMA
> + int autonuma_node;
> + struct task_autonuma *task_autonuma = p->task_autonuma;
> +
> + if (!task_autonuma)
> + return true;
> +
> + autonuma_node = ACCESS_ONCE(task_autonuma->autonuma_node);
> + if (autonuma_node< 0 || autonuma_node == cpu_to_node(cpu))
> + return true;
> + else
> + return false;
> +#else
> + return true;
> +#endif
> +}
What is the return value of task_autonuma_cpu supposed
to represent? It is not at all clear what this function
is trying to do...
> +#ifdef CONFIG_AUTONUMA
> + /* this is used by the scheduler and the page allocator */
> + struct mm_autonuma *mm_autonuma;
> +#endif
> };
Great. What is it used for, and how?
Why is that not documented?
> @@ -1514,6 +1514,9 @@ struct task_struct {
> struct mempolicy *mempolicy; /* Protected by alloc_lock */
> short il_next;
> short pref_node_fork;
> +#ifdef CONFIG_AUTONUMA
> + struct task_autonuma *task_autonuma;
> +#endif
This could use a comment,too. I know task_struct has historically
been documented rather poorly, but it may be time to break that
tradition and add documentation.
> +/*
> + * autonuma_balance_cpu_stop() is a callback to be invoked by
> + * stop_one_cpu_nowait(). It is used by sched_autonuma_balance() to
> + * migrate the tasks to the selected_cpu, from softirq context.
> + */
> +static int autonuma_balance_cpu_stop(void *data)
> +{
Uhhh what? It does not look like anything ends up stopped
as a result of this function running.
It looks like the function migrates a task to another NUMA
node, and always returns 0. Maybe void should be the return
type, and not the argument type?
It would be nice if the function name described what the
function does.
> + struct rq *src_rq = data;
A void* as the function parameter, when you know what the
data pointer actually is?
Why are you doing this?
> + int src_cpu = cpu_of(src_rq);
> + int dst_cpu = src_rq->autonuma_balance_dst_cpu;
> + struct task_struct *p = src_rq->autonuma_balance_task;
Why is the task to be migrated an item in the runqueue struct,
and not a function argument?
This seems backwards from the way things are usually done.
Not saying it is wrong, but doing things this way needs a good
explanation.
> +out_unlock:
> + src_rq->autonuma_balance = false;
> + raw_spin_unlock(&src_rq->lock);
> + /* spinlocks acts as barrier() so p is stored local on the stack */
What race are you trying to protect against?
Surely the reason p continues to be valid is that you are
holding a refcount to the task?
> + raw_spin_unlock_irq(&p->pi_lock);
> + put_task_struct(p);
> + return 0;
> +}
> +enum {
> + W_TYPE_THREAD,
> + W_TYPE_PROCESS,
> +};
What is W? What is the difference between thread type
and process type Ws?
You wrote a lot of text describing sched_autonuma_balance(),
but none of it helps me understand what you are trying to do :(
> + * We run the above math on every CPU not part of the current NUMA
> + * node, and we compare the current process against the other
> + * processes running in the other CPUs in the remote NUMA nodes. The
> + * objective is to select the cpu (in selected_cpu) with a bigger
> + * "weight". The bigger the "weight" the biggest gain we'll get by
> + * moving the current process to the selected_cpu (not only the
> + * biggest immediate CPU gain but also the fewer async memory
> + * migrations that will be required to reach full convergence
> + * later). If we select a cpu we migrate the current process to it.
The one thing you have not described at all is what factors
go into the weight calculation, and why you are using those.
We can all read C and figure out what the code does, but
we need to know why.
What factors does the code use to weigh the NUMA nodes and processes?
Why are statistics kept both on a per process and a per thread basis?
What is the difference between those two?
What makes a particular NUMA node a good node for a thread to run on?
When is it worthwhile moving stuff around?
When is it not worthwhile?
> + * One non trivial bit of this logic that deserves an explanation is
> + * how the three crucial variables of the core math
> + * (w_nid/w_other/wcpu_nid) are going to change depending on whether
> + * the other CPU is running a thread of the current process, or a
> + * thread of a different process.
It would be nice to know what w_nid/w_other/w_cpu_nid mean.
You have a one-line description of them higher up in the comment,
but there is still no description at all of what factors go into
calculating the weights, or why...
> + * A simple example is required. Given the following:
> + * - 2 processes
> + * - 4 threads per process
> + * - 2 NUMA nodes
> + * - 4 CPUS per NUMA node
> + *
> + * Because the 8 threads belong to 2 different processes, by using the
> + * process statistics when comparing threads of different processes,
> + * we will converge reliably and quickly to a configuration where the
> + * 1st process is entirely contained in one node and the 2nd process
> + * in the other node.
> + *
> + * If all threads only use thread local memory (no sharing of memory
> + * between the threads), it wouldn't matter if we use per-thread or
> + * per-mm statistics for w_nid/w_other/w_cpu_nid. We could then use
> + * per-thread statistics all the time.
> + *
> + * But clearly with threads it's expected to get some sharing of
> + * memory. To avoid false sharing it's better to keep all threads of
> + * the same process in the same node (or if they don't fit in a single
> + * node, in as fewer nodes as possible). This is why we have to use
> + * processes statistics in w_nid/w_other/wcpu_nid when comparing
> + * threads of different processes. Why instead do we have to use
> + * thread statistics when comparing threads of the same process? This
> + * should be obvious if you're still reading
Nothing at all here is obvious, because you have not explained
what factors go into determining each weight.
You describe a lot of specific details, but are missing the
general overview that helps us make sense of things.
> +void sched_autonuma_balance(void)
> +{
> + int cpu, nid, selected_cpu, selected_nid, selected_nid_mm;
> + int cpu_nid = numa_node_id();
> + int this_cpu = smp_processor_id();
> + /*
> + * w_t: node thread weight
> + * w_t_t: total sum of all node thread weights
> + * w_m: node mm/process weight
> + * w_m_t: total sum of all node mm/process weights
> + */
> + unsigned long w_t, w_t_t, w_m, w_m_t;
> + unsigned long w_t_max, w_m_max;
> + unsigned long weight_max, weight;
> + long s_w_nid = -1, s_w_cpu_nid = -1, s_w_other = -1;
> + int s_w_type = -1;
> + struct cpumask *allowed;
> + struct task_struct *p = current;
> + struct task_autonuma *task_autonuma = p->task_autonuma;
Considering that p is always current, it may be better to just use
current throughout the function, that way people can see at a glance
that "p" cannot go away while the code is running, because current
is running the code on itself.
> + /*
> + * The below two arrays holds the NUMA affinity information of
> + * the current process if scheduled in the "nid". This is task
> + * local and mm local information. We compute this information
> + * for all nodes.
> + *
> + * task/mm_numa_weight[nid] will become w_nid.
> + * task/mm_numa_weight[cpu_nid] will become w_cpu_nid.
> + */
> + rq = cpu_rq(this_cpu);
> + task_numa_weight = rq->task_numa_weight;
> + mm_numa_weight = rq->mm_numa_weight;
It is a mystery to me why these items are allocated in the
runqueue structure. We have per-cpu allocations for things
like this, why are you adding them to the runqueue?
If there is a reason, you need to document it.
> + w_t_max = w_m_max = 0;
> + selected_nid = selected_nid_mm = -1;
> + for_each_online_node(nid) {
> + w_m = ACCESS_ONCE(p->mm->mm_autonuma->mm_numa_fault[nid]);
> + w_t = task_autonuma->task_numa_fault[nid];
> + if (w_m> w_m_t)
> + w_m_t = w_m;
> + mm_numa_weight[nid] = w_m*AUTONUMA_BALANCE_SCALE/w_m_t;
> + if (w_t> w_t_t)
> + w_t_t = w_t;
> + task_numa_weight[nid] = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
> + if (mm_numa_weight[nid]> w_m_max) {
> + w_m_max = mm_numa_weight[nid];
> + selected_nid_mm = nid;
> + }
> + if (task_numa_weight[nid]> w_t_max) {
> + w_t_max = task_numa_weight[nid];
> + selected_nid = nid;
> + }
> + }
What do the task and mm numa weights mean?
What factors go into calculating them?
Is it better to have a higher or a lower number? :)
We could use less documentation of what the code
does, and more explaining what the code is trying
to do, and why.
Under what circumstances do we continue into this loop?
What is it trying to do?
> + for_each_online_node(nid) {
> + /*
> + * Calculate the "weight" for all CPUs that the
> + * current process is allowed to be migrated to,
> + * except the CPUs of the current nid (it would be
> + * worthless from a NUMA affinity standpoint to
> + * migrate the task to another CPU of the current
> + * node).
> + */
> + if (nid == cpu_nid)
> + continue;
> + for_each_cpu_and(cpu, cpumask_of_node(nid), allowed) {
> + long w_nid, w_cpu_nid, w_other;
> + int w_type;
> + struct mm_struct *mm;
> + rq = cpu_rq(cpu);
> + if (!cpu_online(cpu))
> + continue;
> +
> + if (idle_cpu(cpu))
> + /*
> + * Offload the while IDLE balancing
> + * and physical / logical imbalances
> + * to CFS.
> + */
/* CFS idle balancing takes care of this */
> + continue;
> +
> + mm = rq->curr->mm;
> + if (!mm)
> + continue;
> + /*
> + * Grab the w_m/w_t/w_m_t/w_t_t of the
> + * processes running in the other CPUs to
> + * compute w_other.
> + */
> + raw_spin_lock_irq(&rq->lock);
> + /* recheck after implicit barrier() */
> + mm = rq->curr->mm;
> + if (!mm) {
> + raw_spin_unlock_irq(&rq->lock);
> + continue;
> + }
> + w_m_t = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault_tot);
> + w_t_t = rq->curr->task_autonuma->task_numa_fault_tot;
> + if (!w_m_t || !w_t_t) {
> + raw_spin_unlock_irq(&rq->lock);
> + continue;
> + }
> + w_m = ACCESS_ONCE(mm->mm_autonuma->mm_numa_fault[nid]);
> + w_t = rq->curr->task_autonuma->task_numa_fault[nid];
> + raw_spin_unlock_irq(&rq->lock);
Is this why the info is stored in the runqueue struct?
How do we know the other runqueue's data is consistent?
We seem to be doing our own updates outside of the lock...
How do we know the other runqueue's data is up to date?
How often is this function run?
> + /*
> + * Generate the w_nid/w_cpu_nid from the
> + * pre-computed mm/task_numa_weight[] and
> + * compute w_other using the w_m/w_t info
> + * collected from the other process.
> + */
> + if (mm == p->mm) {
if (mm == current->mm) {
> + if (w_t> w_t_t)
> + w_t_t = w_t;
> + w_other = w_t*AUTONUMA_BALANCE_SCALE/w_t_t;
> + w_nid = task_numa_weight[nid];
> + w_cpu_nid = task_numa_weight[cpu_nid];
> + w_type = W_TYPE_THREAD;
> + } else {
> + if (w_m> w_m_t)
> + w_m_t = w_m;
> + w_other = w_m*AUTONUMA_BALANCE_SCALE/w_m_t;
> + w_nid = mm_numa_weight[nid];
> + w_cpu_nid = mm_numa_weight[cpu_nid];
> + w_type = W_TYPE_PROCESS;
> + }
Wait, what?
Why is w_t used in one case, and w_m in the other?
Explaining the meaning of the two, and how each is used,
would help people understand this code.
> + /*
> + * Finally check if there's a combined gain in
> + * NUMA affinity. If there is and it's the
> + * biggest weight seen so far, record its
> + * weight and select this NUMA remote "cpu" as
> + * candidate migration destination.
> + */
> + if (w_nid> w_other&& w_nid> w_cpu_nid) {
> + weight = w_nid - w_other + w_nid - w_cpu_nid;
I read this as "check if moving the current task to the other CPU,
and moving its task away, would increase overall NUMA affinity".
Is that correct?
> + stop_one_cpu_nowait(this_cpu,
> + autonuma_balance_cpu_stop, rq,
> + &rq->autonuma_balance_work);
> +#ifdef __ia64__
> +#error "NOTE: tlb_migrate_finish won't run here"
> +#endif
> +}
So that is why the function is called autonuma_balance_cpu_stop?
Even though its function is to migrate a task?
What will happen in the IA64 case?
Memory corruption?
What would the IA64 maintainers have to do to make things work?
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6d52cea..e5b7ae9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -463,6 +463,24 @@ struct rq {
> #ifdef CONFIG_SMP
> struct llist_head wake_list;
> #endif
> +#ifdef CONFIG_AUTONUMA
> + /*
> + * Per-cpu arrays to compute the per-thread and per-process
> + * statistics. Allocated statically to avoid overflowing the
> + * stack with large MAX_NUMNODES values.
> + *
> + * FIXME: allocate dynamically and with num_possible_nodes()
> + * array sizes only if autonuma is not impossible, to save
> + * some dozen KB of RAM when booting on not NUMA (or small
> + * NUMA) systems.
> + */
I have a second FIXME: document what these fields actually mean,
what they are used for, and why they are allocated as part of the
runqueue structure.
> + long task_numa_weight[MAX_NUMNODES];
> + long mm_numa_weight[MAX_NUMNODES];
> + bool autonuma_balance;
> + int autonuma_balance_dst_cpu;
> + struct task_struct *autonuma_balance_task;
> + struct cpu_stop_work autonuma_balance_work;
> +#endif
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> On 64bit archs, 20 bytes are used for async memory migration (specific
> to the knuma_migrated per-node threads), and 4 bytes are used for the
> thread NUMA false sharing detection logic.
>
> This is a bad implementation due lack of time to do a proper one.
It is not ideal, no.
If you document what everything does, maybe somebody else
will understand the code well enough to help fix it.
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -136,6 +136,32 @@ struct page {
> struct page *first_page; /* Compound tail pages */
> };
>
> +#ifdef CONFIG_AUTONUMA
> + /*
> + * FIXME: move to pgdat section along with the memcg and allocate
> + * at runtime only in presence of a numa system.
> + */
Once you fix it, could you fold the fix into this patch?
> + /*
> + * To modify autonuma_last_nid lockless the architecture,
> + * needs SMP atomic granularity< sizeof(long), not all archs
> + * have that, notably some ancient alpha (but none of those
> + * should run in NUMA systems). Archs without that requires
> + * autonuma_last_nid to be a long.
> + */
> +#if BITS_PER_LONG> 32
> + int autonuma_migrate_nid;
> + int autonuma_last_nid;
> +#else
> +#if MAX_NUMNODES>= 32768
> +#error "too many nodes"
> +#endif
> + /* FIXME: remember to check the updates are atomic */
> + short autonuma_migrate_nid;
> + short autonuma_last_nid;
> +#endif
> + struct list_head autonuma_migrate_node;
> +#endif
Please document what these fields mean.
--
All rights reversed
On 2012年06月30日 00:30, Andrea Arcangeli wrote:
> Hi Nai,
>
> On Fri, Jun 29, 2012 at 10:11:35PM +0800, Nai Xia wrote:
>> If one process do very intensive visit of a small set of pages in this
>> node, but occasional visit of a large set of pages in another node.
>> Will this algorithm do a very bad judgment? I guess the answer would
>> be: it's possible and this judgment depends on the racing pattern
>> between the process and your knuma_scand.
>
> Depending if the knuma_scand/scan_pass_sleep_millisecs is more or less
> occasional than the visit of a large set of pages it may behave
> differently correct.
I bet this racing is more subtle than this, but since you admit
this judgment is a racing problem. Then it doesn't matter how subtle
it would be.
>
> Note that every algorithm will have a limit on how smart it can be.
>
> Just to make a random example: if you lookup some pagecache a million
> times and some other pagecache a dozen times, their "aging"
> information in the pagecache will end up identical. Yet we know one
> set of pages is clearly higher priority than the other. We've only so
> many levels of lrus and so many referenced/active bitflags per
> page. Once you get at the top, then all is equal.
>
> Does this mean the "active" list working set detection is useless just
> because we can't differentiate a million of lookups on a few pages, vs
> a dozen of lookups on lots of pages?
I knew you will give us an example of LRU. ;D
But unfortunately the approximation of LRU can not justify your case:
There are cases when LRU approximation behaves very badly,
but enough research in history have told us that 90% of the workloads
conforms to this kind of approximation, and even every programmer has
been taught to write LRU conforming programs.
But we have no idea how well real world workloads will conforms to your
algo especially the racing pattern.
>
> Last but not the least, in the very example you mention it's not even
> clear that the process should be scheduled in the CPU where there is
> the small set of pages accessed frequently, or the CPU where there's
> the large set of pages accessed occasionally. If the small sets of
> pages fits in the 8MBytes of the L2 cache, then it's better to put the
> process in the other CPU where the large set of pages can't fit in the
> L2 cache. Lots of hardware details should be evaluated, to really know
> what's the right thing in such case even if it was you having to
> decide.
That's just why I think it more subtle and why I am feeling not confident
about your algo -- if the effectiveness of your algorithm depends on so
many uncertain things.
>
> But the real reason why the above isn't an issue and why we don't need
> to solve that problem perfectly: there's not just a CPU follow memory
> algorithm in AutoNUMA. There's also the memory follow CPU
> algorithm. AutoNUMA will do its best to change the layout of your
> example to one that has only one clear solution: the occasional lookup
> of the large set of pages, will make those eventually go in the node
> together with the small set of pages (or the other way around), and
> this is how it's solved.
Not sure to follow, if you fall back on this, then why all its complexity?
This fall back equals to "just group all the pages to the running" policy.
>
> In any case, whatever wrong decision it will take, it will at least be
> a better decision than the numa/sched where there's absolutely zero
> information about what pages the process is accessing. And best of all
> with AutoNUMA you also know which pages the _thread_ is accessing so
> it will also be able to take optimal decisions if there are more
> threads than CPUs in a node (as long as not all thread accesses are
> shared).
Yeah, we need the information. But how to make best of the information
is a big problem.
I feel you may not address my question only by word reasoning,
if you currently have in your hand no survey of the common page access
patterns of real world workloads.
Maybe the assumption of your algorithm is right, maybe not...
>
> Hope this explains things better.
> Andrea
Thanks,
Nai
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2427706..d53b26a 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -697,6 +697,12 @@ typedef struct pglist_data {
> struct task_struct *kswapd;
> int kswapd_max_order;
> enum zone_type classzone_idx;
> +#ifdef CONFIG_AUTONUMA
> + spinlock_t autonuma_lock;
> + struct list_head autonuma_migrate_head[MAX_NUMNODES];
> + unsigned long autonuma_nr_migrate_pages;
> + wait_queue_head_t autonuma_knuma_migrated_wait;
> +#endif
> } pg_data_t;
>
> #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
Once again, the data structure could use documentation.
What are these things for?
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> Initialize the knuma_migrated queues at boot time.
>
> Signed-off-by: Andrea Arcangeli<[email protected]>
> ---
> mm/page_alloc.c | 11 +++++++++++
> 1 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a9710a4..48eabe9 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -59,6 +59,7 @@
> #include<linux/prefetch.h>
> #include<linux/migrate.h>
> #include<linux/page-debug-flags.h>
> +#include<linux/autonuma.h>
>
> #include<asm/tlbflush.h>
> #include<asm/div64.h>
> @@ -4348,8 +4349,18 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
> int nid = pgdat->node_id;
> unsigned long zone_start_pfn = pgdat->node_start_pfn;
> int ret;
> +#ifdef CONFIG_AUTONUMA
> + int node_iter;
> +#endif
>
> pgdat_resize_init(pgdat);
> +#ifdef CONFIG_AUTONUMA
> + spin_lock_init(&pgdat->autonuma_lock);
> + init_waitqueue_head(&pgdat->autonuma_knuma_migrated_wait);
> + pgdat->autonuma_nr_migrate_pages = 0;
> + for_each_node(node_iter)
> + INIT_LIST_HEAD(&pgdat->autonuma_migrate_head[node_iter]);
> +#endif
Should this be a __paginginit function inside one of the
autonuma files, so we can avoid the ifdefs here?
--
All rights reversed
Hi Nai,
On Fri, Jun 29, 2012 at 10:11:35PM +0800, Nai Xia wrote:
> If one process do very intensive visit of a small set of pages in this
> node, but occasional visit of a large set of pages in another node.
> Will this algorithm do a very bad judgment? I guess the answer would
> be: it's possible and this judgment depends on the racing pattern
> between the process and your knuma_scand.
Depending if the knuma_scand/scan_pass_sleep_millisecs is more or less
occasional than the visit of a large set of pages it may behave
differently correct.
Note that every algorithm will have a limit on how smart it can be.
Just to make a random example: if you lookup some pagecache a million
times and some other pagecache a dozen times, their "aging"
information in the pagecache will end up identical. Yet we know one
set of pages is clearly higher priority than the other. We've only so
many levels of lrus and so many referenced/active bitflags per
page. Once you get at the top, then all is equal.
Does this mean the "active" list working set detection is useless just
because we can't differentiate a million of lookups on a few pages, vs
a dozen of lookups on lots of pages?
Last but not the least, in the very example you mention it's not even
clear that the process should be scheduled in the CPU where there is
the small set of pages accessed frequently, or the CPU where there's
the large set of pages accessed occasionally. If the small sets of
pages fits in the 8MBytes of the L2 cache, then it's better to put the
process in the other CPU where the large set of pages can't fit in the
L2 cache. Lots of hardware details should be evaluated, to really know
what's the right thing in such case even if it was you having to
decide.
But the real reason why the above isn't an issue and why we don't need
to solve that problem perfectly: there's not just a CPU follow memory
algorithm in AutoNUMA. There's also the memory follow CPU
algorithm. AutoNUMA will do its best to change the layout of your
example to one that has only one clear solution: the occasional lookup
of the large set of pages, will make those eventually go in the node
together with the small set of pages (or the other way around), and
this is how it's solved.
In any case, whatever wrong decision it will take, it will at least be
a better decision than the numa/sched where there's absolutely zero
information about what pages the process is accessing. And best of all
with AutoNUMA you also know which pages the _thread_ is accessing so
it will also be able to take optimal decisions if there are more
threads than CPUs in a node (as long as not all thread accesses are
shared).
Hope this explains things better.
Andrea
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> This resets all per-thread and per-process statistics across exec
> syscalls or after kernel threads detached from the mm. The past
> statistical NUMA information is unlikely to be relevant for the future
> in these cases.
>
> Signed-off-by: Andrea Arcangeli<[email protected]>
Acked-by: Rik van Riel <[email protected]>
--
All rights reversed
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> The first gear in the whole AutoNUMA algorithm is knuma_scand. If
> knuma_scand doesn't run AutoNUMA is a full bypass. If knuma_scand is
> stopped, soon all other AutoNUMA gears will settle down too.
>
> knuma_scand is the daemon that sets the pmd_numa and pte_numa and
> allows the NUMA hinting page faults to start and then all other
> actions follows as a reaction to that.
>
> knuma_scand scans a list of "mm" and this is where we register and
> unregister the "mm" into AutoNUMA for knuma_scand to scan them.
>
> Signed-off-by: Andrea Arcangeli<[email protected]>
Acked-by: Rik van Riel <[email protected]>
--
All rights reversed
On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
> t's hard to say whether Peter's like to add ia64 support or
> just like to get rid of the forceful migration as a whole.
I've stated several times that all archs that have CONFIG_NUMA must be
supported before we can consider any of this. I've no intention of doing
so myself. Andrea wants this, Andrea gets to do it.
I've also stated several times that forceful migration in the context of
numa balancing must go.
On 06/29/2012 02:41 PM, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>> t's hard to say whether Peter's like to add ia64 support or
>> just like to get rid of the forceful migration as a whole.
>
> I've stated several times that all archs that have CONFIG_NUMA must be
> supported before we can consider any of this. I've no intention of doing
> so myself. Andrea wants this, Andrea gets to do it.
I am not convinced all architectures that have CONFIG_NUMA
need to be a requirement, since some of them (eg. Alpha)
seem to be lacking a maintainer nowadays.
It would be good if Andrea could touch base with the maintainers
of the actively maintained architectures with NUMA, and get them
to sign off on the way autonuma does things, and work with them
to get autonuma ported to those architectures.
> I've also stated several times that forceful migration in the context of
> numa balancing must go.
I am not convinced about this part either way.
I do not see how a migration numa thread (which could potentially
use idle cpu time) will be any worse than migrate on fault, which
will always take away time from the userspace process.
--
All rights reversed
On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
> AFAIK, Andrea answered many of Peter's request by reducing the memory
> overhead, adding documentation and changing the scheduler integration.
>
He ignored even more. Getting him to change anything is like pulling
teeth. I'm well tired of it. Take for instance this kthread_bind_node()
nonsense, I've said from the very beginning that wasn't good. Nor is it
required, yet he persists in including it.
The thing is, I would very much like to talk about the design of this
thing, but there's just nothing coming. Yes Andrea wrote a lot of words,
but they didn't explain anything much at all.
And while I have a fair idea of what and how its working, I still miss a
number of critical fundamentals of the whole approach.
And yes I'm tired and I'm cranky.. but wouldn't you be if you'd spend
days poring over dense and ill documented code, giving comments only to
have your feedback dismissed and ignored.
On Fri, 2012-06-29 at 14:46 -0400, Rik van Riel wrote:
> > I've also stated several times that forceful migration in the context of
> > numa balancing must go.
>
> I am not convinced about this part either way.
>
> I do not see how a migration numa thread (which could potentially
> use idle cpu time) will be any worse than migrate on fault, which
> will always take away time from the userspace process.
Any NUMA stuff is long term, it really shouldn't matter on the timescale
of a few jiffies.
NUMA placement should also not over-ride fairness, esp. not by default.
On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
> The previous comments were not shouts but the mother of all NAKs.
I never said any such thing. I just said why should I bother reading
your stuff if you're ignoring most my feedback anyway.
If you want to read that as a NAK, not my problem.
On 06/28/2012 08:55 AM, Andrea Arcangeli wrote:
> This is where the dynamically allocated sched_autonuma structure is
> being handled.
>
> The reason for keeping this outside of the task_struct besides not
> using too much kernel stack, is to only allocate it on NUMA
> hardware. So the not NUMA hardware only pays the memory of a pointer
> in the kernel stack (which remains NULL at all times in that case).
What is not documented is the reason for keeping it at all.
What is in the data structure?
What is the data structure used for?
How do we use it?
> + if (unlikely(alloc_task_autonuma(tsk, orig, node)))
> + /* free_thread_info() undoes arch_dup_task_struct() too */
> + goto out_thread_info;
Oh, you mean task_autonuma, and not sched_autonuma?
Please fix the commit message and the subject.
--
All rights reversed
On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> This is where the mm_autonuma structure is being handled. Just like
> sched_autonuma, this is only allocated at runtime if the hardware the
> kernel is running on has been detected as NUMA. On not NUMA hardware
> the memory cost is reduced to one pointer per mm.
>
> To get rid of the pointer in the each mm, the kernel can be compiled
> with CONFIG_AUTONUMA=n.
>
> Signed-off-by: Andrea Arcangeli<[email protected]>
Same comments as before. A description of what the data
structure is used for and how would be good.
--
All rights reversed
On Fri, 2012-06-29 at 14:46 -0400, Rik van Riel wrote:
>
> I am not convinced all architectures that have CONFIG_NUMA
> need to be a requirement, since some of them (eg. Alpha)
> seem to be lacking a maintainer nowadays.
Still, this NUMA balancing stuff is not a small tweak to load-balancing.
Its a very significant change is how you schedule. Having such great
differences over architectures isn't something I look forward to.
On 06/28/2012 08:56 AM, Andrea Arcangeli wrote:
> Fix to avoid -1 retval.
>
> Includes fixes from Hillf Danton<[email protected]>.
>
> Signed-off-by: Andrea Arcangeli<[email protected]>
> ---
> kernel/sched/fair.c | 4 ++++
> 1 files changed, 4 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c099cc6..fa96810 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2789,6 +2789,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> if (new_cpu == -1 || new_cpu == cpu) {
> /* Now try balancing at a lower domain level of cpu */
> sd = sd->child;
> + if (new_cpu< 0)
> + /* Return prev_cpu is find_idlest_cpu failed */
> + new_cpu = prev_cpu;
> continue;
> }
>
> @@ -2807,6 +2810,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
> unlock:
> rcu_read_unlock();
>
> + BUG_ON(new_cpu< 0);
> return new_cpu;
> }
> #endif /* CONFIG_SMP */
Wait, what?
Either this is a scheduler bugfix, in which case you
are better off submitting it separately and reducing
the size of your autonuma patch queue, or this is a
behaviour change in the scheduler that needs better
arguments than a 1-line changelog.
--
All rights reversed
On Fri, 2012-06-29 at 20:57 +0200, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 14:46 -0400, Rik van Riel wrote:
> >
> > I am not convinced all architectures that have CONFIG_NUMA
> > need to be a requirement, since some of them (eg. Alpha)
> > seem to be lacking a maintainer nowadays.
>
> Still, this NUMA balancing stuff is not a small tweak to load-balancing.
> Its a very significant change is how you schedule. Having such great
> differences over architectures isn't something I look forward to.
Also, Andrea keeps insisting arch support is trivial, so I don't see the
problem.
On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
> Some developers have a thick skin and nothing gets in, others are human
> and have feelings. Using a tiny difference in behavior we can do much
> much better. What's works in a f2f loud discussion doesn't play well in
> email.
We're all humans, we all have feelings, and I'm frigging upset.
As a maintainer I try and do my best to support and maintain the
subsystems I'm responsible for. I take this very serious.
I don't agree with the approach Andrea takes, we all know that, yet I do
want to talk about it. The problem is, many of the crucial details are
non-obvious and no sane explanation seems forthcoming.
I really feel I'm talking to deaf ears.
On Fri, 2012-06-29 at 14:57 -0400, Rik van Riel wrote:
> Either this is a scheduler bugfix, in which case you
> are better off submitting it separately and reducing
> the size of your autonuma patch queue, or this is a
> behaviour change in the scheduler that needs better
> arguments than a 1-line changelog.
I've only said this like 2 or 3 times.. :/
On 06/29/2012 03:05 PM, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 14:57 -0400, Rik van Riel wrote:
>> Either this is a scheduler bugfix, in which case you
>> are better off submitting it separately and reducing
>> the size of your autonuma patch queue, or this is a
>> behaviour change in the scheduler that needs better
>> arguments than a 1-line changelog.
>
> I've only said this like 2 or 3 times.. :/
I'll keep saying it until Andrea has fixed it :)
--
All rights reversed
On 06/29/2012 03:03 PM, Peter Zijlstra wrote:
> On Fri, 2012-06-29 at 20:57 +0200, Peter Zijlstra wrote:
>> On Fri, 2012-06-29 at 14:46 -0400, Rik van Riel wrote:
>>>
>>> I am not convinced all architectures that have CONFIG_NUMA
>>> need to be a requirement, since some of them (eg. Alpha)
>>> seem to be lacking a maintainer nowadays.
>>
>> Still, this NUMA balancing stuff is not a small tweak to load-balancing.
>> Its a very significant change is how you schedule. Having such great
>> differences over architectures isn't something I look forward to.
I am not too worried about the performance of architectures
that are essentially orphaned :)
> Also, Andrea keeps insisting arch support is trivial, so I don't see the
> problem.
Getting it implemented in one or two additional architectures
would be good, to get a template out there that can be used by
other architecture maintainers.
--
All rights reversed
* Rik van Riel <[email protected]> wrote:
> On 06/29/2012 03:05 PM, Peter Zijlstra wrote:
> >On Fri, 2012-06-29 at 14:57 -0400, Rik van Riel wrote:
> >>Either this is a scheduler bugfix, in which case you
> >>are better off submitting it separately and reducing
> >>the size of your autonuma patch queue, or this is a
> >>behaviour change in the scheduler that needs better
> >>arguments than a 1-line changelog.
> >
> >I've only said this like 2 or 3 times.. :/
>
> I'll keep saying it until Andrea has fixed it :)
But that's just wrong - patch submitters *MUST* be responsive
and forthcoming. Mistakes are OK, but this goes well beyond
that. A patch-queue must generally not be resubmitted for yet
another review round, as long as there are yet unaddressed
review feedback items.
The thing is, core kernel code maintainers like PeterZ don't
scale and the number of patches to review is huge - yet Andrea
keeps wasting Peter's time with the same things again and
again... How much is too much?
Thanks,
Ingo
On Sat, Jun 30, 2012 at 12:30 AM, Andrea Arcangeli <[email protected]> wrote:
> Hi Nai,
>
> On Fri, Jun 29, 2012 at 10:11:35PM +0800, Nai Xia wrote:
>> If one process do very intensive visit of a small set of pages in this
>> node, but occasional visit of a large set of pages in another node.
>> Will this algorithm do a very bad judgment? I guess the answer would
>> be: it's possible and this judgment depends on the racing pattern
>> between the process and your knuma_scand.
>
> Depending if the knuma_scand/scan_pass_sleep_millisecs is more or less
> occasional than the visit of a large set of pages it may behave
> differently correct.
>
> Note that every algorithm will have a limit on how smart it can be.
>
> Just to make a random example: if you lookup some pagecache a million
> times and some other pagecache a dozen times, their "aging"
> information in the pagecache will end up identical. Yet we know one
> set of pages is clearly higher priority than the other. We've only so
> many levels of lrus and so many referenced/active bitflags per
> page. Once you get at the top, then all is equal.
>
> Does this mean the "active" list working set detection is useless just
> because we can't differentiate a million of lookups on a few pages, vs
> a dozen of lookups on lots of pages?
>
> Last but not the least, in the very example you mention it's not even
> clear that the process should be scheduled in the CPU where there is
> the small set of pages accessed frequently, or the CPU where there's
> the large set of pages accessed occasionally. If the small sets of
> pages fits in the 8MBytes of the L2 cache, then it's better to put the
> process in the other CPU where the large set of pages can't fit in the
> L2 cache. Lots of hardware details should be evaluated, to really know
> what's the right thing in such case even if it was you having to
> decide.
>
> But the real reason why the above isn't an issue and why we don't need
> to solve that problem perfectly: there's not just a CPU follow memory
> algorithm in AutoNUMA. There's also the memory follow CPU
> algorithm. AutoNUMA will do its best to change the layout of your
> example to one that has only one clear solution: the occasional lookup
> of the large set of pages, will make those eventually go in the node
> together with the small set of pages (or the other way around), and
> this is how it's solved.
>
> In any case, whatever wrong decision it will take, it will at least be
> a better decision than the numa/sched where there's absolutely zero
> information about what pages the process is accessing. And best of all
> with AutoNUMA you also know which pages the _thread_ is accessing so
> it will also be able to take optimal decisions if there are more
> threads than CPUs in a node (as long as not all thread accesses are
> shared).
>
> Hope this explains things better.
> Andrea
Hi Andrea,
Sorry for being so negative, but this problem seems so clear to me.
I might have pointed all these out, if you CC me since the first version,
I am not always on the list watching posts....
Sincerely,
Nai
On Thu, Jun 28, 2012 at 02:55:44PM +0200, Andrea Arcangeli wrote:
> Xen has taken over the last reserved bit available for the pagetables
Some time ago when I saw this patch I asked about it (if there is way
to actually stop using this bit) and you mentioned it is not the last
bit available for pagemaps. Perhaps you should alter the comment
in this description?
> which is set through ioremap, this documents it and makes the code
It actually is through ioremap, gntdev (to map another guest memory),
and on pfns which fall in E820 on the non-RAM and gap sections.
> more readable.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> arch/x86/include/asm/pgtable_types.h | 11 +++++++++--
> 1 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 013286a..b74cac9 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -17,7 +17,7 @@
> #define _PAGE_BIT_PAT 7 /* on 4KB pages */
> #define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */
> #define _PAGE_BIT_UNUSED1 9 /* available for programmer */
> -#define _PAGE_BIT_IOMAP 10 /* flag used to indicate IO mapping */
> +#define _PAGE_BIT_UNUSED2 10
> #define _PAGE_BIT_HIDDEN 11 /* hidden by kmemcheck */
> #define _PAGE_BIT_PAT_LARGE 12 /* On 2MB or 1GB pages */
> #define _PAGE_BIT_SPECIAL _PAGE_BIT_UNUSED1
> @@ -41,7 +41,7 @@
> #define _PAGE_PSE (_AT(pteval_t, 1) << _PAGE_BIT_PSE)
> #define _PAGE_GLOBAL (_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
> #define _PAGE_UNUSED1 (_AT(pteval_t, 1) << _PAGE_BIT_UNUSED1)
> -#define _PAGE_IOMAP (_AT(pteval_t, 1) << _PAGE_BIT_IOMAP)
> +#define _PAGE_UNUSED2 (_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
> #define _PAGE_PAT (_AT(pteval_t, 1) << _PAGE_BIT_PAT)
> #define _PAGE_PAT_LARGE (_AT(pteval_t, 1) << _PAGE_BIT_PAT_LARGE)
> #define _PAGE_SPECIAL (_AT(pteval_t, 1) << _PAGE_BIT_SPECIAL)
> @@ -49,6 +49,13 @@
> #define _PAGE_SPLITTING (_AT(pteval_t, 1) << _PAGE_BIT_SPLITTING)
> #define __HAVE_ARCH_PTE_SPECIAL
>
> +/* flag used to indicate IO mapping */
> +#ifdef CONFIG_XEN
> +#define _PAGE_IOMAP (_AT(pteval_t, 1) << _PAGE_BIT_UNUSED2)
> +#else
> +#define _PAGE_IOMAP (_AT(pteval_t, 0))
> +#endif
> +
> #ifdef CONFIG_KMEMCHECK
> #define _PAGE_HIDDEN (_AT(pteval_t, 1) << _PAGE_BIT_HIDDEN)
> #else
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Thu, Jun 28, 2012 at 02:55:49PM +0200, Andrea Arcangeli wrote:
> This function makes it easy to bind the per-node knuma_migrated
> threads to their respective NUMA nodes. Those threads take memory from
> the other nodes (in round robin with a incoming queue for each remote
> node) and they move that memory to their local node.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> include/linux/kthread.h | 1 +
> include/linux/sched.h | 2 +-
> kernel/kthread.c | 23 +++++++++++++++++++++++
> 3 files changed, 25 insertions(+), 1 deletions(-)
>
> diff --git a/include/linux/kthread.h b/include/linux/kthread.h
> index 0714b24..e733f97 100644
> --- a/include/linux/kthread.h
> +++ b/include/linux/kthread.h
> @@ -33,6 +33,7 @@ struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
> })
>
> void kthread_bind(struct task_struct *k, unsigned int cpu);
> +void kthread_bind_node(struct task_struct *p, int nid);
> int kthread_stop(struct task_struct *k);
> int kthread_should_stop(void);
> bool kthread_freezable_should_stop(bool *was_frozen);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 4059c0f..699324c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1792,7 +1792,7 @@ extern void thread_group_times(struct task_struct *p, cputime_t *ut, cputime_t *
> #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
> #define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
> #define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
> -#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
> +#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpus */
> #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
> #define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
> #define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
> diff --git a/kernel/kthread.c b/kernel/kthread.c
> index 3d3de63..48b36f9 100644
> --- a/kernel/kthread.c
> +++ b/kernel/kthread.c
> @@ -234,6 +234,29 @@ void kthread_bind(struct task_struct *p, unsigned int cpu)
> EXPORT_SYMBOL(kthread_bind);
>
> /**
> + * kthread_bind_node - bind a just-created kthread to the CPUs of a node.
> + * @p: thread created by kthread_create().
> + * @nid: node (might not be online, must be possible) for @k to run on.
> + *
> + * Description: This function is equivalent to set_cpus_allowed(),
> + * except that @nid doesn't need to be online, and the thread must be
> + * stopped (i.e., just returned from kthread_create()).
> + */
> +void kthread_bind_node(struct task_struct *p, int nid)
> +{
> + /* Must have done schedule() in kthread() before we set_task_cpu */
> + if (!wait_task_inactive(p, TASK_UNINTERRUPTIBLE)) {
> + WARN_ON(1);
> + return;
> + }
> +
> + /* It's safe because the task is inactive. */
> + do_set_cpus_allowed(p, cpumask_of_node(nid));
> + p->flags |= PF_THREAD_BOUND;
> +}
> +EXPORT_SYMBOL(kthread_bind_node);
_GPL ?
> +
> +/**
> * kthread_stop - stop a thread created by kthread_create().
> * @k: thread created by kthread_create().
> *
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Thu, Jun 28, 2012 at 02:55:51PM +0200, Andrea Arcangeli wrote:
> These flags are the ones tweaked through sysfs, they control the
> behavior of autonuma, from enabling disabling it, to selecting various
> runtime options.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> include/linux/autonuma_flags.h | 62 ++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 62 insertions(+), 0 deletions(-)
> create mode 100644 include/linux/autonuma_flags.h
>
> diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
> new file mode 100644
> index 0000000..5e29a75
> --- /dev/null
> +++ b/include/linux/autonuma_flags.h
> @@ -0,0 +1,62 @@
> +#ifndef _LINUX_AUTONUMA_FLAGS_H
> +#define _LINUX_AUTONUMA_FLAGS_H
> +
> +enum autonuma_flag {
These aren't really flags. They are bit-fields.
A
> + AUTONUMA_FLAG,
Looking at the code, this is to turn it on. Perhaps a better name such
as: AUTONUMA_ACTIVE_FLAG ?
> + AUTONUMA_IMPOSSIBLE_FLAG,
> + AUTONUMA_DEBUG_FLAG,
> + AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
I might have gotten my math wrong, but if you have
AUTONUMA_SCHED_LOAD_BALACE.. set (so 3), that also means
that bit 0 and 1 are on. In other words AUTONUMA_FLAG
and AUTONUMA_IMPOSSIBLE_FLAG are turned on.
> + AUTONUMA_SCHED_CLONE_RESET_FLAG,
> + AUTONUMA_SCHED_FORK_RESET_FLAG,
> + AUTONUMA_SCAN_PMD_FLAG,
So this is 6, which means 110 bits. So AUTONUMA_FLAG
gets turned off.
You definitly want to convert these to #defines or
at least define the proper numbers.
> + AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> + AUTONUMA_MIGRATE_DEFER_FLAG,
> +};
> +
> +extern unsigned long autonuma_flags;
> +
> +static inline bool autonuma_enabled(void)
> +{
> + return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_debug(void)
> +{
> + return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_load_balance_strict(void)
> +{
> + return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> + &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_clone_reset(void)
> +{
> + return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
> + &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_fork_reset(void)
> +{
> + return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
> + &autonuma_flags);
> +}
> +
> +static inline bool autonuma_scan_pmd(void)
> +{
> + return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_scan_use_working_set(void)
> +{
> + return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> + &autonuma_flags);
> +}
> +
> +static inline bool autonuma_migrate_defer(void)
> +{
> + return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
> +}
> +
> +#endif /* _LINUX_AUTONUMA_FLAGS_H */
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Thu, Jun 28, 2012 at 02:55:58PM +0200, Andrea Arcangeli wrote:
> This resets all per-thread and per-process statistics across exec
> syscalls or after kernel threads detached from the mm. The past
> statistical NUMA information is unlikely to be relevant for the future
> in these cases.
The previous patch mentioned that it can run in bypass mode. Is
this also able to do so? Meaning that these calls end up doing nops?
Thanks!
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> fs/exec.c | 3 +++
> mm/mmu_context.c | 2 ++
> 2 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index da27b91..146ced2 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -55,6 +55,7 @@
> #include <linux/pipe_fs_i.h>
> #include <linux/oom.h>
> #include <linux/compat.h>
> +#include <linux/autonuma.h>
>
> #include <asm/uaccess.h>
> #include <asm/mmu_context.h>
> @@ -1172,6 +1173,8 @@ void setup_new_exec(struct linux_binprm * bprm)
>
> flush_signal_handlers(current, 0);
> flush_old_files(current->files);
> +
> + autonuma_setup_new_exec(current);
> }
> EXPORT_SYMBOL(setup_new_exec);
>
> diff --git a/mm/mmu_context.c b/mm/mmu_context.c
> index 3dcfaf4..40f0f13 100644
> --- a/mm/mmu_context.c
> +++ b/mm/mmu_context.c
> @@ -7,6 +7,7 @@
> #include <linux/mmu_context.h>
> #include <linux/export.h>
> #include <linux/sched.h>
> +#include <linux/autonuma.h>
>
> #include <asm/mmu_context.h>
>
> @@ -58,5 +59,6 @@ void unuse_mm(struct mm_struct *mm)
> /* active_mm is still 'mm' */
> enter_lazy_tlb(mm, tsk);
> task_unlock(tsk);
> + autonuma_setup_new_exec(tsk);
> }
> EXPORT_SYMBOL_GPL(unuse_mm);
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Thu, Jun 28, 2012 at 02:55:51PM +0200, Andrea Arcangeli wrote:
> These flags are the ones tweaked through sysfs, they control the
> behavior of autonuma, from enabling disabling it, to selecting various
> runtime options.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> include/linux/autonuma_flags.h | 62 ++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 62 insertions(+), 0 deletions(-)
> create mode 100644 include/linux/autonuma_flags.h
>
> diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
> new file mode 100644
> index 0000000..5e29a75
> --- /dev/null
> +++ b/include/linux/autonuma_flags.h
> @@ -0,0 +1,62 @@
> +#ifndef _LINUX_AUTONUMA_FLAGS_H
> +#define _LINUX_AUTONUMA_FLAGS_H
> +
> +enum autonuma_flag {
> + AUTONUMA_FLAG,
> + AUTONUMA_IMPOSSIBLE_FLAG,
> + AUTONUMA_DEBUG_FLAG,
> + AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> + AUTONUMA_SCHED_CLONE_RESET_FLAG,
> + AUTONUMA_SCHED_FORK_RESET_FLAG,
> + AUTONUMA_SCAN_PMD_FLAG,
> + AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> + AUTONUMA_MIGRATE_DEFER_FLAG,
> +};
> +
> +extern unsigned long autonuma_flags;
I could not find the this variable in the preceding patches?
Which patch actually uses it?
Also, is there a way to force the AutoNUMA framework
from not initializing at all? Hold that thought, it probably
is in some of the other patches.
> +
> +static inline bool autonuma_enabled(void)
> +{
> + return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_debug(void)
> +{
> + return !!test_bit(AUTONUMA_DEBUG_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_load_balance_strict(void)
> +{
> + return !!test_bit(AUTONUMA_SCHED_LOAD_BALANCE_STRICT_FLAG,
> + &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_clone_reset(void)
> +{
> + return !!test_bit(AUTONUMA_SCHED_CLONE_RESET_FLAG,
> + &autonuma_flags);
> +}
> +
> +static inline bool autonuma_sched_fork_reset(void)
> +{
> + return !!test_bit(AUTONUMA_SCHED_FORK_RESET_FLAG,
> + &autonuma_flags);
> +}
> +
> +static inline bool autonuma_scan_pmd(void)
> +{
> + return !!test_bit(AUTONUMA_SCAN_PMD_FLAG, &autonuma_flags);
> +}
> +
> +static inline bool autonuma_scan_use_working_set(void)
> +{
> + return !!test_bit(AUTONUMA_SCAN_USE_WORKING_SET_FLAG,
> + &autonuma_flags);
> +}
> +
> +static inline bool autonuma_migrate_defer(void)
> +{
> + return !!test_bit(AUTONUMA_MIGRATE_DEFER_FLAG, &autonuma_flags);
> +}
> +
> +#endif /* _LINUX_AUTONUMA_FLAGS_H */
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
On Thu, Jun 28, 2012 at 02:55:59PM +0200, Andrea Arcangeli wrote:
> This is where the dynamically allocated sched_autonuma structure is
> being handled.
>
> The reason for keeping this outside of the task_struct besides not
> using too much kernel stack, is to only allocate it on NUMA
> hardware. So the not NUMA hardware only pays the memory of a pointer
> in the kernel stack (which remains NULL at all times in that case).
.. snip..
> + if (unlikely(alloc_task_autonuma(tsk, orig, node)))
> + /* free_thread_info() undoes arch_dup_task_struct() too */
> + goto out_thread_info;
>
That looks (without seeing the implementation) and from reading the git
commit, like that on non-NUMA machines it would fail - and end up
stop the creation of a task.
Perhaps a better name for the function: alloc_always_task_autonuma
since the function (at least from the description of this patch) will
always succeed. Perhaps even remove the:
"if unlikely(..)" bit?
On Thu, Jun 28, 2012 at 02:56:00PM +0200, Andrea Arcangeli wrote:
> This is where the mm_autonuma structure is being handled. Just like
> sched_autonuma, this is only allocated at runtime if the hardware the
> kernel is running on has been detected as NUMA. On not NUMA hardware
I think the correct wording is "non-NUMA", not "not NUMA".
> the memory cost is reduced to one pointer per mm.
>
> To get rid of the pointer in the each mm, the kernel can be compiled
> with CONFIG_AUTONUMA=n.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> kernel/fork.c | 7 +++++++
> 1 files changed, 7 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0adbe09..3e5a0d9 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -527,6 +527,8 @@ static void mm_init_aio(struct mm_struct *mm)
>
> static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> {
> + if (unlikely(alloc_mm_autonuma(mm)))
> + goto out_free_mm;
So reading that I would think that on non-NUMA machines this would fail
(since there is nothing to allocate). But that is not the case
(I hope!?) Perhaps just make the function not return any values?
On Thu, Jun 28, 2012 at 02:56:16PM +0200, Andrea Arcangeli wrote:
> Move the AutoNUMA per page information from the "struct page" to a
> separate page_autonuma data structure allocated in the memsection
> (with sparsemem) or in the pgdat (with flatmem).
>
> This is done to avoid growing the size of the "struct page" and the
> page_autonuma data is only allocated if the kernel has been booted on
> real NUMA hardware (or if noautonuma is passed as parameter to the
> kernel).
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> include/linux/autonuma.h | 18 +++-
> include/linux/autonuma_flags.h | 6 +
> include/linux/autonuma_types.h | 55 ++++++++++
> include/linux/mm_types.h | 26 -----
> include/linux/mmzone.h | 14 +++-
> include/linux/page_autonuma.h | 53 +++++++++
> init/main.c | 2 +
> mm/Makefile | 2 +-
> mm/autonuma.c | 98 ++++++++++-------
> mm/huge_memory.c | 26 +++--
> mm/page_alloc.c | 21 +---
> mm/page_autonuma.c | 234 ++++++++++++++++++++++++++++++++++++++++
> mm/sparse.c | 126 ++++++++++++++++++++-
> 13 files changed, 577 insertions(+), 104 deletions(-)
> create mode 100644 include/linux/page_autonuma.h
> create mode 100644 mm/page_autonuma.c
>
> diff --git a/include/linux/autonuma.h b/include/linux/autonuma.h
> index 85ca5eb..67af86a 100644
> --- a/include/linux/autonuma.h
> +++ b/include/linux/autonuma.h
> @@ -7,15 +7,26 @@
>
> extern void autonuma_enter(struct mm_struct *mm);
> extern void autonuma_exit(struct mm_struct *mm);
> -extern void __autonuma_migrate_page_remove(struct page *page);
> +extern void __autonuma_migrate_page_remove(struct page *,
> + struct page_autonuma *);
> extern void autonuma_migrate_split_huge_page(struct page *page,
> struct page *page_tail);
> extern void autonuma_setup_new_exec(struct task_struct *p);
> +extern struct page_autonuma *lookup_page_autonuma(struct page *page);
>
> static inline void autonuma_migrate_page_remove(struct page *page)
> {
> - if (ACCESS_ONCE(page->autonuma_migrate_nid) >= 0)
> - __autonuma_migrate_page_remove(page);
> + struct page_autonuma *page_autonuma = lookup_page_autonuma(page);
> + if (ACCESS_ONCE(page_autonuma->autonuma_migrate_nid) >= 0)
> + __autonuma_migrate_page_remove(page, page_autonuma);
> +}
> +
> +static inline void autonuma_free_page(struct page *page)
> +{
> + if (!autonuma_impossible()) {
I think you are better using a different name.
Perhaps 'if (autonuma_on())'
> + autonuma_migrate_page_remove(page);
> + lookup_page_autonuma(page)->autonuma_last_nid = -1;
> + }
> }
>
> #define autonuma_printk(format, args...) \
> @@ -29,6 +40,7 @@ static inline void autonuma_migrate_page_remove(struct page *page) {}
> static inline void autonuma_migrate_split_huge_page(struct page *page,
> struct page *page_tail) {}
> static inline void autonuma_setup_new_exec(struct task_struct *p) {}
> +static inline void autonuma_free_page(struct page *page) {}
>
> #endif /* CONFIG_AUTONUMA */
>
> diff --git a/include/linux/autonuma_flags.h b/include/linux/autonuma_flags.h
> index 5e29a75..035d993 100644
> --- a/include/linux/autonuma_flags.h
> +++ b/include/linux/autonuma_flags.h
> @@ -15,6 +15,12 @@ enum autonuma_flag {
>
> extern unsigned long autonuma_flags;
>
> +static inline bool autonuma_impossible(void)
> +{
> + return num_possible_nodes() <= 1 ||
> + test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
> +}
> +
> static inline bool autonuma_enabled(void)
> {
> return !!test_bit(AUTONUMA_FLAG, &autonuma_flags);
> diff --git a/include/linux/autonuma_types.h b/include/linux/autonuma_types.h
> index 9e697e3..1e860f6 100644
> --- a/include/linux/autonuma_types.h
> +++ b/include/linux/autonuma_types.h
> @@ -39,6 +39,61 @@ struct task_autonuma {
> unsigned long task_numa_fault[0];
> };
>
> +/*
> + * Per page (or per-pageblock) structure dynamically allocated only if
> + * autonuma is not impossible.
not impossible? So possible?
> + */
> +struct page_autonuma {
> + /*
> + * To modify autonuma_last_nid lockless the architecture,
> + * needs SMP atomic granularity < sizeof(long), not all archs
> + * have that, notably some ancient alpha (but none of those
> + * should run in NUMA systems). Archs without that requires
> + * autonuma_last_nid to be a long.
> + */
> +#if BITS_PER_LONG > 32
> + /*
> + * autonuma_migrate_nid is -1 if the page_autonuma structure
> + * is not linked into any
> + * pgdat->autonuma_migrate_head. Otherwise it means the
> + * page_autonuma structure is linked into the
> + * &NODE_DATA(autonuma_migrate_nid)->autonuma_migrate_head[page_nid].
> + * page_nid is the nid that the page (referenced by the
> + * page_autonuma structure) belongs to.
> + */
> + int autonuma_migrate_nid;
> + /*
> + * autonuma_last_nid records which is the NUMA nid that tried
> + * to access this page at the last NUMA hinting page fault.
> + * If it changed, AutoNUMA will not try to migrate the page to
> + * the nid where the thread is running on and to the contrary,
> + * it will make different threads trashing on the same pages,
> + * converge on the same NUMA node (if possible).
> + */
> + int autonuma_last_nid;
> +#else
> +#if MAX_NUMNODES >= 32768
> +#error "too many nodes"
> +#endif
> + short autonuma_migrate_nid;
> + short autonuma_last_nid;
> +#endif
> + /*
> + * This is the list node that links the page (referenced by
> + * the page_autonuma structure) in the
> + * &NODE_DATA(dst_nid)->autonuma_migrate_head[page_nid] lru.
> + */
> + struct list_head autonuma_migrate_node;
> +
> + /*
> + * To find the page starting from the autonuma_migrate_node we
> + * need a backlink.
> + *
> + * FIXME: drop it;
> + */
> + struct page *page;
> +};
> +
> extern int alloc_task_autonuma(struct task_struct *tsk,
> struct task_struct *orig,
> int node);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d1248cf..f0c6379 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -136,32 +136,6 @@ struct page {
> struct page *first_page; /* Compound tail pages */
> };
>
> -#ifdef CONFIG_AUTONUMA
> - /*
> - * FIXME: move to pgdat section along with the memcg and allocate
> - * at runtime only in presence of a numa system.
> - */
> - /*
> - * To modify autonuma_last_nid lockless the architecture,
> - * needs SMP atomic granularity < sizeof(long), not all archs
> - * have that, notably some ancient alpha (but none of those
> - * should run in NUMA systems). Archs without that requires
> - * autonuma_last_nid to be a long.
> - */
> -#if BITS_PER_LONG > 32
> - int autonuma_migrate_nid;
> - int autonuma_last_nid;
> -#else
> -#if MAX_NUMNODES >= 32768
> -#error "too many nodes"
> -#endif
> - /* FIXME: remember to check the updates are atomic */
> - short autonuma_migrate_nid;
> - short autonuma_last_nid;
> -#endif
> - struct list_head autonuma_migrate_node;
> -#endif
> -
> /*
> * On machines where all RAM is mapped into kernel address space,
> * we can simply calculate the virtual address. On machines with
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d53b26a..e66da74 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -698,10 +698,13 @@ typedef struct pglist_data {
> int kswapd_max_order;
> enum zone_type classzone_idx;
> #ifdef CONFIG_AUTONUMA
> - spinlock_t autonuma_lock;
> +#if !defined(CONFIG_SPARSEMEM)
> + struct page_autonuma *node_page_autonuma;
> +#endif
> struct list_head autonuma_migrate_head[MAX_NUMNODES];
> unsigned long autonuma_nr_migrate_pages;
> wait_queue_head_t autonuma_knuma_migrated_wait;
> + spinlock_t autonuma_lock;
> #endif
> } pg_data_t;
>
> @@ -1064,6 +1067,15 @@ struct mem_section {
> * section. (see memcontrol.h/page_cgroup.h about this.)
> */
> struct page_cgroup *page_cgroup;
> +#endif
> +#ifdef CONFIG_AUTONUMA
> + /*
> + * If !SPARSEMEM, pgdat doesn't have page_autonuma pointer. We use
> + * section.
> + */
> + struct page_autonuma *section_page_autonuma;
> +#endif
> +#if defined(CONFIG_CGROUP_MEM_RES_CTLR) ^ defined(CONFIG_AUTONUMA)
> unsigned long pad;
> #endif
> };
> diff --git a/include/linux/page_autonuma.h b/include/linux/page_autonuma.h
> new file mode 100644
> index 0000000..d748aa2
> --- /dev/null
> +++ b/include/linux/page_autonuma.h
> @@ -0,0 +1,53 @@
> +#ifndef _LINUX_PAGE_AUTONUMA_H
> +#define _LINUX_PAGE_AUTONUMA_H
> +
> +#if defined(CONFIG_AUTONUMA) && !defined(CONFIG_SPARSEMEM)
> +extern void __init page_autonuma_init_flatmem(void);
> +#else
> +static inline void __init page_autonuma_init_flatmem(void) {}
> +#endif
> +
> +#ifdef CONFIG_AUTONUMA
> +
> +#include <linux/autonuma_flags.h>
> +
> +extern void __meminit page_autonuma_map_init(struct page *page,
> + struct page_autonuma *page_autonuma,
> + int nr_pages);
> +
> +#ifdef CONFIG_SPARSEMEM
> +#define PAGE_AUTONUMA_SIZE (sizeof(struct page_autonuma))
> +#define SECTION_PAGE_AUTONUMA_SIZE (PAGE_AUTONUMA_SIZE * \
> + PAGES_PER_SECTION)
> +#endif
> +
> +extern void __meminit pgdat_autonuma_init(struct pglist_data *);
> +
> +#else /* CONFIG_AUTONUMA */
> +
> +#ifdef CONFIG_SPARSEMEM
> +struct page_autonuma;
> +#define PAGE_AUTONUMA_SIZE 0
> +#define SECTION_PAGE_AUTONUMA_SIZE 0
> +
> +#define autonuma_impossible() true
> +
> +#endif
> +
> +static inline void pgdat_autonuma_init(struct pglist_data *pgdat) {}
> +
> +#endif /* CONFIG_AUTONUMA */
> +
> +#ifdef CONFIG_SPARSEMEM
> +extern struct page_autonuma * __meminit __kmalloc_section_page_autonuma(int nid,
> + unsigned long nr_pages);
> +extern void __kfree_section_page_autonuma(struct page_autonuma *page_autonuma,
> + unsigned long nr_pages);
> +extern void __init sparse_early_page_autonuma_alloc_node(struct page_autonuma **page_autonuma_map,
> + unsigned long pnum_begin,
> + unsigned long pnum_end,
> + unsigned long map_count,
> + int nodeid);
> +#endif
> +
> +#endif /* _LINUX_PAGE_AUTONUMA_H */
> diff --git a/init/main.c b/init/main.c
> index b5cc0a7..070a377 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -68,6 +68,7 @@
> #include <linux/shmem_fs.h>
> #include <linux/slab.h>
> #include <linux/perf_event.h>
> +#include <linux/page_autonuma.h>
>
> #include <asm/io.h>
> #include <asm/bugs.h>
> @@ -455,6 +456,7 @@ static void __init mm_init(void)
> * bigger than MAX_ORDER unless SPARSEMEM.
> */
> page_cgroup_init_flatmem();
> + page_autonuma_init_flatmem();
> mem_init();
> kmem_cache_init();
> percpu_init_late();
> diff --git a/mm/Makefile b/mm/Makefile
> index 15900fd..a4d8354 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -33,7 +33,7 @@ obj-$(CONFIG_FRONTSWAP) += frontswap.o
> obj-$(CONFIG_HAS_DMA) += dmapool.o
> obj-$(CONFIG_HUGETLBFS) += hugetlb.o
> obj-$(CONFIG_NUMA) += mempolicy.o
> -obj-$(CONFIG_AUTONUMA) += autonuma.o
> +obj-$(CONFIG_AUTONUMA) += autonuma.o page_autonuma.o
> obj-$(CONFIG_SPARSEMEM) += sparse.o
> obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o
> obj-$(CONFIG_SLOB) += slob.o
> diff --git a/mm/autonuma.c b/mm/autonuma.c
> index f44272b..ec4d492 100644
> --- a/mm/autonuma.c
> +++ b/mm/autonuma.c
> @@ -51,12 +51,6 @@ static struct knumad_scan {
> .mm_head = LIST_HEAD_INIT(knumad_scan.mm_head),
> };
>
> -static inline bool autonuma_impossible(void)
> -{
> - return num_possible_nodes() <= 1 ||
> - test_bit(AUTONUMA_IMPOSSIBLE_FLAG, &autonuma_flags);
> -}
> -
> static inline void autonuma_migrate_lock(int nid)
> {
> spin_lock(&NODE_DATA(nid)->autonuma_lock);
> @@ -82,54 +76,63 @@ void autonuma_migrate_split_huge_page(struct page *page,
> struct page *page_tail)
> {
> int nid, last_nid;
> + struct page_autonuma *page_autonuma, *page_tail_autonuma;
>
> - nid = page->autonuma_migrate_nid;
> + if (autonuma_impossible())
Is it just better to call it 'autonuma_off()' ?
On Sat, 2012-06-30 at 14:58 +0800, Nai Xia wrote:
> If you insist on ignoring any constructive suggestions from others,
But there is nothing constructive about your criticism.
You are basically saying that the whole thing cannot work unless it's
based on 20 years of research. Duh !
Ben.
On 2012年07月01日 07:55, Benjamin Herrenschmidt wrote:
> On Sat, 2012-06-30 at 14:58 +0800, Nai Xia wrote:
>> If you insist on ignoring any constructive suggestions from others,
>
> But there is nothing constructive about your criticism.
>
> You are basically saying that the whole thing cannot work unless it's
> based on 20 years of research. Duh !
1. You quote me wrong: I said "group all pages to one node" is correct,
and highly possible to play the major role in your benchmarks.
Sampling is completely broken from my point of view. PZ's patch also
has similar idea of "group all pages to one node" which I think
is also correct.
2. My suggestion to Andrea: Do some more comparative benchmarks to
see what's really happening inside, instead of only macro benchmarks.
You need to have 20 hours of carefully designed survey research
for a new algorithm, instead of reading my mail and spending 20min
to give a conclusion.
If you cannot see the constructiveness of my suggestion. That's
your problem, not mine.
I understand the hard feelings of seeing the possible brokenness of a
thing you've already spend a lot of time. But that's the way people
seeking for truth.
You see, you guys has spent quite sometime to defend your points,
if this time were used to follow my advise doing some further
analysis maybe you've already got some valuable information.
Dor was right, we all made our points. And we are all busy.
Let's stop it. Thanks.
>
> Ben.
>
>