Hi,
This is the latest iteration of our numa/core tree, which
implements adaptive NUMA affinity balancing.
Changes in this version:
https://lkml.org/lkml/2012/11/12/315
Performance figures:
https://lkml.org/lkml/2012/11/12/330
Any review feedback, comments and test results are welcome!
For testing purposes I'd suggest using the latest tip:master
integration tree, which has the latest numa/core tree merged:
git pull git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
(But you can also directly use the tip:numa/core tree as well.)
Thanks,
Ingo
----------------------->
Andrea Arcangeli (1):
numa, mm: Support NUMA hinting page faults from gup/gup_fast
Gerald Schaefer (1):
sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390
Ingo Molnar (3):
mm/pgprot: Move the pgprot_modify() fallback definition to mm.h
sched, mm, x86: Add the ARCH_SUPPORTS_NUMA_BALANCING flag
mm: Allow the migration of shared pages
Lee Schermerhorn (3):
mm/mpol: Add MPOL_MF_NOOP
mm/mpol: Check for misplaced page
mm/mpol: Add MPOL_MF_LAZY
Peter Zijlstra (16):
sched, numa, mm: Make find_busiest_queue() a method
sched, numa, mm: Describe the NUMA scheduling problem formally
mm/thp: Preserve pgprot across huge page split
mm/mpol: Make MPOL_LOCAL a real policy
mm/mpol: Create special PROT_NONE infrastructure
mm/migrate: Introduce migrate_misplaced_page()
mm/mpol: Use special PROT_NONE to migrate pages
sched, numa, mm: Introduce sched_feat_numa()
sched, numa, mm: Implement THP migration
sched, numa, mm: Add last_cpu to page flags
sched, numa, mm, arch: Add variable locality exception
sched, numa, mm: Add the scanning page fault machinery
sched, numa, mm: Add adaptive NUMA affinity support
sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate
sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges
sched, numa, mm: Implement slow start for working set sampling
Ralf Baechle (1):
sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation
Rik van Riel (6):
mm/generic: Only flush the local TLB in ptep_set_access_flags()
x86/mm: Only do a local tlb flush in ptep_set_access_flags()
x86/mm: Introduce pte_accessible()
mm: Only flush the TLB when clearing an accessible pte
x86/mm: Completely drop the TLB flush from ptep_set_access_flags()
sched, numa, mm: Add credits for NUMA placement
---
CREDITS | 1 +
Documentation/scheduler/numa-problem.txt | 236 +++++++++++
arch/mips/include/asm/pgtable.h | 2 +
arch/s390/include/asm/pgtable.h | 13 +
arch/sh/mm/Kconfig | 1 +
arch/x86/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 7 +
arch/x86/mm/pgtable.c | 8 +-
include/asm-generic/pgtable.h | 4 +
include/linux/huge_mm.h | 19 +
include/linux/hugetlb.h | 8 +-
include/linux/init_task.h | 8 +
include/linux/mempolicy.h | 8 +
include/linux/migrate.h | 7 +
include/linux/migrate_mode.h | 3 +
include/linux/mm.h | 122 ++++--
include/linux/mm_types.h | 10 +
include/linux/mmzone.h | 14 +-
include/linux/page-flags-layout.h | 83 ++++
include/linux/sched.h | 46 ++-
include/uapi/linux/mempolicy.h | 16 +-
init/Kconfig | 23 ++
kernel/bounds.c | 2 +
kernel/sched/core.c | 68 +++-
kernel/sched/fair.c | 1032 ++++++++++++++++++++++++++++++++++++++++---------
kernel/sched/features.h | 8 +
kernel/sched/sched.h | 38 +-
kernel/sysctl.c | 45 ++-
mm/huge_memory.c | 253 +++++++++---
mm/hugetlb.c | 10 +-
mm/memory.c | 129 ++++++-
mm/mempolicy.c | 206 ++++++++--
mm/migrate.c | 81 +++-
mm/mprotect.c | 64 ++-
mm/pgtable-generic.c | 9 +-
35 files changed, 2200 insertions(+), 385 deletions(-)
create mode 100644 Documentation/scheduler/numa-problem.txt
create mode 100644 include/linux/page-flags-layout.h
From: Rik van Riel <[email protected]>
Because we only ever upgrade a PTE when calling ptep_set_access_flags(),
it is safe to skip flushing entries on remote TLBs.
The worst that can happen is a spurious page fault on other CPUs, which
would flush that TLB entry.
Lazily letting another CPU incur a spurious page fault occasionally
is (much!) cheaper than aggressively flushing everybody else's TLB.
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..be3bb46 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
free_page((unsigned long)pgd);
}
+/*
+ * Used to set accessed or dirty bits in the page table entries
+ * on other architectures. On x86, the accessed and dirty bits
+ * are tracked by hardware. However, do_wp_page calls this function
+ * to also make the pte writeable at the same time the dirty bit is
+ * set. In that case we do actually need to write the PTE.
+ */
int ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty)
@@ -310,7 +317,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
if (changed && dirty) {
*ptep = entry;
pte_update_defer(vma->vm_mm, address, ptep);
- flush_tlb_page(vma, address);
+ __flush_tlb_one(address);
}
return changed;
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
This is probably a first: formal description of a complex high-level
computing problem, within the kernel source.
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Mike Galbraith <[email protected]>
Rik van Riel <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ]
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++
1 file changed, 230 insertions(+)
create mode 100644 Documentation/scheduler/numa-problem.txt
diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt
new file mode 100644
index 0000000..a5d2fee
--- /dev/null
+++ b/Documentation/scheduler/numa-problem.txt
@@ -0,0 +1,230 @@
+
+
+Effective NUMA scheduling problem statement, described formally:
+
+ * minimize interconnect traffic
+
+For each task 't_i' we have memory, this memory can be spread over multiple
+physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on
+node 'k' in [pages].
+
+If a task shares memory with another task let us denote this as:
+'s_i,k', the memory shared between tasks including 't_i' residing on node
+'k'.
+
+Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement.
+
+Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage
+frequency over those memory regions [1/s] such that the product gives an
+(average) bandwidth 'bp' and 'bs' in [pages/s].
+
+(note: multiple tasks sharing memory naturally avoid duplicat accounting
+ because each task will have its own access frequency 'fs')
+
+(pjt: I think this frequency is more numerically consistent if you explicitly
+ restrict p/s above to be the working-set. (It also makes explicit the
+ requirement for <C0,M0> to change about a change in the working set.)
+
+ Doing this does have the nice property that it lets you use your frequency
+ measurement as a weak-ordering for the benefit a task would receive when
+ we can't fit everything.
+
+ e.g. task1 has working set 10mb, f=90%
+ task2 has working set 90mb, f=10%
+
+ Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit
+ from task1 being on the right node than task2. )
+
+Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i':
+
+ C: t_i -> {c_i, n_i}
+
+This gives us the total interconnect traffic between nodes 'k' and 'l',
+'T_k,l', as:
+
+ T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l
+
+And our goal is to obtain C0 and M0 such that:
+
+ T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l
+
+(note: we could introduce 'nc(k,l)' as the cost function of accessing memory
+ on node 'l' from node 'k', this would be useful for bigger NUMA systems
+
+ pjt: I agree nice to have, but intuition suggests diminishing returns on more
+ usual systems given factors like things like Haswell's enormous 35mb l3
+ cache and QPI being able to do a direct fetch.)
+
+(note: do we need a limit on the total memory per node?)
+
+
+ * fairness
+
+For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu
+'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a
+load 'L_n':
+
+ L_n = 1/P_n * \Sum_i w_i for all c_i = n
+
+using that we can formulate a load difference between CPUs
+
+ L_n,m = | L_n - L_m |
+
+Which allows us to state the fairness goal like:
+
+ L_n,m(C0) =< L_n,m(C) for all C, n != m
+
+(pjt: It can also be usefully stated that, having converged at C0:
+
+ | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) |
+
+ Where G_n,m is the greedy partition of tasks between L_n and L_m. This is
+ the "worst" partition we should accept; but having it gives us a useful
+ bound on how much we can reasonably adjust L_n/L_m at a Pareto point to
+ favor T_n,m. )
+
+Together they give us the complete multi-objective optimization problem:
+
+ min_C,M [ L_n,m(C), T_k,l(C,M) ]
+
+
+
+Notes:
+
+ - the memory bandwidth problem is very much an inter-process problem, in
+ particular there is no such concept as a process in the above problem.
+
+ - the naive solution would completely prefer fairness over interconnect
+ traffic, the more complicated solution could pick another Pareto point using
+ an aggregate objective function such that we balance the loss of work
+ efficiency against the gain of running, we'd want to more or less suggest
+ there to be a fixed bound on the error from the Pareto line for any
+ such solution.
+
+References:
+
+ http://en.wikipedia.org/wiki/Mathematical_optimization
+ http://en.wikipedia.org/wiki/Multi-objective_optimization
+
+
+* warning, significant hand-waving ahead, improvements welcome *
+
+
+Partial solutions / approximations:
+
+ 1) have task node placement be a pure preference from the 'fairness' pov.
+
+This means we always prefer fairness over interconnect bandwidth. This reduces
+the problem to:
+
+ min_C,M [ T_k,l(C,M) ]
+
+ 2a) migrate memory towards 'n_i' (the task's node).
+
+This creates memory movement such that 'p_i,k for k != n_i' becomes 0 --
+provided 'n_i' stays stable enough and there's sufficient memory (looks like
+we might need memory limits for this).
+
+This does however not provide us with any 's_i' (shared) information. It does
+however remove 'M' since it defines memory placement in terms of task
+placement.
+
+XXX properties of this M vs a potential optimal
+
+ 2b) migrate memory towards 'n_i' using 2 samples.
+
+This separates pages into those that will migrate and those that will not due
+to the two samples not matching. We could consider the first to be of 'p_i'
+(private) and the second to be of 's_i' (shared).
+
+This interpretation can be motivated by the previously observed property that
+'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only
+'s_i' (shared). (here we loose the need for memory limits again, since it
+becomes indistinguishable from shared).
+
+XXX include the statistical babble on double sampling somewhere near
+
+This reduces the problem further; we loose 'M' as per 2a, it further reduces
+the 'T_k,l' (interconnect traffic) term to only include shared (since per the
+above all private will be local):
+
+ T_k,l = \Sum_i bs_i,l for every n_i = k, l != k
+
+[ more or less matches the state of sched/numa and describes its remaining
+ problems and assumptions. It should work well for tasks without significant
+ shared memory usage between tasks. ]
+
+Possible future directions:
+
+Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
+can evaluate it;
+
+ 3a) add per-task per node counters
+
+At fault time, count the number of pages the task faults on for each node.
+This should give an approximation of 'p_i' for the local node and 's_i,k' for
+all remote nodes.
+
+While these numbers provide pages per scan, and so have the unit [pages/s] they
+don't count repeat access and thus aren't actually representable for our
+bandwidth numberes.
+
+ 3b) additional frequency term
+
+Additionally (or instead if it turns out we don't need the raw 'p' and 's'
+numbers) we can approximate the repeat accesses by using the time since marking
+the pages as indication of the access frequency.
+
+Let 'I' be the interval of marking pages and 'e' the elapsed time since the
+last marking, then we could estimate the number of accesses 'a' as 'a = I / e'.
+If we then increment the node counters using 'a' instead of 1 we might get
+a better estimate of bandwidth terms.
+
+ 3c) additional averaging; can be applied on top of either a/b.
+
+[ Rik argues that decaying averages on 3a might be sufficient for bandwidth since
+ the decaying avg includes the old accesses and therefore has a measure of repeat
+ accesses.
+
+ Rik also argued that the sample frequency is too low to get accurate access
+ frequency measurements, I'm not entirely convinced, event at low sample
+ frequencies the avg elapsed time 'e' over multiple samples should still
+ give us a fair approximation of the avg access frequency 'a'.
+
+ So doing both b&c has a fair chance of working and allowing us to distinguish
+ between important and less important memory accesses.
+
+ Experimentation has shown no benefit from the added frequency term so far. ]
+
+This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute
+'T_k,l' Our optimization problem now reads:
+
+ min_C [ \Sum_i bs_i,l for every n_i = k, l != k ]
+
+And includes only shared terms, this makes sense since all task private memory
+will become local as per 2.
+
+This suggests that if there is significant shared memory, we should try and
+move towards it.
+
+ 4) move towards where 'most' memory is
+
+The simplest significance test is comparing the biggest shared 's_i,k' against
+the private 'p_i'. If we have more shared than private, move towards it.
+
+This effectively makes us move towards where most our memory is and forms a
+feed-back loop with 2. We migrate memory towards us and we migrate towards
+where 'most' memory is.
+
+(Note: even if there were two tasks fully trashing the same shared memory, it
+ is very rare for there to be an 50/50 split in memory, lacking a perfect
+ split, the small will move towards the larger. In case of the perfect
+ split, we'll tie-break towards the lower node number.)
+
+ 5) 'throttle' 4's node placement
+
+Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize'
+and show representative numbers, we should limit node-migration to not be
+faster than this.
+
+ n) poke holes in previous that require more stuff and describe it.
--
1.7.11.7
From: Rik van Riel <[email protected]>
If ptep_clear_flush() is called to clear a page table entry that is
accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry,
there is no need to flush the TLB on remote CPUs.
Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/pgtable-generic.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d8397da..0c8323f 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -88,7 +88,8 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
{
pte_t pte;
pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
- flush_tlb_page(vma, address);
+ if (pte_accessible(pte))
+ flush_tlb_page(vma, address);
return pte;
}
#endif
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Make MPOL_LOCAL a real and exposed policy such that applications that
relied on the previous default behaviour can explicitly request it.
Requested-by: Christoph Lameter <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 9 ++++++---
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 23e62e0..3e835c9 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -20,6 +20,7 @@ enum {
MPOL_PREFERRED,
MPOL_BIND,
MPOL_INTERLEAVE,
+ MPOL_LOCAL,
MPOL_MAX, /* always last member of enum */
};
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..72f50ba 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -269,6 +269,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
(flags & MPOL_F_RELATIVE_NODES)))
return ERR_PTR(-EINVAL);
}
+ } else if (mode == MPOL_LOCAL) {
+ if (!nodes_empty(*nodes))
+ return ERR_PTR(-EINVAL);
+ mode = MPOL_PREFERRED;
} else if (nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -2397,7 +2401,6 @@ void numa_default_policy(void)
* "local" is pseudo-policy: MPOL_PREFERRED with MPOL_F_LOCAL flag
* Used only for mpol_parse_str() and mpol_to_str()
*/
-#define MPOL_LOCAL MPOL_MAX
static const char * const policy_modes[] =
{
[MPOL_DEFAULT] = "default",
@@ -2450,12 +2453,12 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
if (flags)
*flags++ = '\0'; /* terminate mode string */
- for (mode = 0; mode <= MPOL_LOCAL; mode++) {
+ for (mode = 0; mode < MPOL_MAX; mode++) {
if (!strcmp(str, policy_modes[mode])) {
break;
}
}
- if (mode > MPOL_LOCAL)
+ if (mode >= MPOL_MAX)
goto out;
switch (mode) {
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
In order to facilitate a lazy -- fault driven -- migration of pages,
create a special transient PROT_NONE variant, we can then use the
'spurious' protection faults to drive our migrations from.
Pages that already had an effective PROT_NONE mapping will not
be detected to generate these 'spuriuos' faults for the simple reason
that we cannot distinguish them on their protection bits, see
pte_numa().
This isn't a problem since PROT_NONE (and possible PROT_WRITE with
dirty tracking) aren't used or are rare enough for us to not care
about their placement.
Suggested-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ fixed various cross-arch and THP/!THP details ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/huge_mm.h | 19 +++++++++++++
include/linux/mm.h | 18 ++++++++++++
mm/huge_memory.c | 32 +++++++++++++++++++++
mm/memory.c | 75 ++++++++++++++++++++++++++++++++++++++++++++-----
mm/mprotect.c | 24 +++++++++++-----
5 files changed, 154 insertions(+), 14 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b31cb7d..4f0f948 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -159,6 +159,13 @@ static inline struct page *compound_trans_head(struct page *page)
}
return page;
}
+
+extern bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);
+
+extern void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t orig_pmd);
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -195,6 +202,18 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
{
return 0;
}
+
+static inline bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
+{
+ return false;
+}
+
+static inline void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t orig_pmd)
+{
+}
+
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2a32cf8..0025bf9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1091,6 +1091,9 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
extern unsigned long do_mremap(unsigned long addr,
unsigned long old_len, unsigned long new_len,
unsigned long flags, unsigned long new_addr);
+extern void change_protection(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, pgprot_t newprot,
+ int dirty_accountable);
extern int mprotect_fixup(struct vm_area_struct *vma,
struct vm_area_struct **pprev, unsigned long start,
unsigned long end, unsigned long newflags);
@@ -1561,6 +1564,21 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
}
#endif
+static inline pgprot_t vma_prot_none(struct vm_area_struct *vma)
+{
+ /*
+ * obtain PROT_NONE by removing READ|WRITE|EXEC privs
+ */
+ vm_flags_t vmflags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
+ return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags));
+}
+
+static inline void
+change_prot_none(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+ change_protection(vma, start, end, vma_prot_none(vma), 0);
+}
+
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 176fe3d..6924edf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -725,6 +725,38 @@ out:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}
+bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
+{
+ /*
+ * See pte_numa().
+ */
+ if (pmd_same(pmd, pmd_modify(pmd, vma->vm_page_prot)))
+ return false;
+
+ return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma)));
+}
+
+void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t entry)
+{
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry)))
+ goto out_unlock;
+
+ /* do fancy stuff */
+
+ /* change back to regular protection */
+ entry = pmd_modify(entry, vma->vm_page_prot);
+ if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
+ update_mmu_cache_pmd(vma, address, entry);
+
+out_unlock:
+ spin_unlock(&mm->page_table_lock);
+}
+
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma)
diff --git a/mm/memory.c b/mm/memory.c
index fb135ba..e3e8ab2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1464,6 +1464,25 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
}
EXPORT_SYMBOL_GPL(zap_vma_ptes);
+static bool pte_numa(struct vm_area_struct *vma, pte_t pte)
+{
+ /*
+ * If we have the normal vma->vm_page_prot protections we're not a
+ * 'special' PROT_NONE page.
+ *
+ * This means we cannot get 'special' PROT_NONE faults from genuine
+ * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
+ * tracking.
+ *
+ * Neither case is really interesting for our current use though so we
+ * don't care.
+ */
+ if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
+ return false;
+
+ return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
+}
+
/**
* follow_page - look up a page descriptor from a user-virtual address
* @vma: vm_area_struct mapping @address
@@ -3433,6 +3452,41 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}
+static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep, pmd_t *pmd,
+ unsigned int flags, pte_t entry)
+{
+ spinlock_t *ptl;
+ int ret = 0;
+
+ if (!pte_unmap_same(mm, pmd, ptep, entry))
+ goto out;
+
+ /*
+ * Do fancy stuff...
+ */
+
+ /*
+ * OK, nothing to do,.. change the protection back to what it
+ * ought to be.
+ */
+ ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (unlikely(!pte_same(*ptep, entry)))
+ goto unlock;
+
+ flush_cache_page(vma, address, pte_pfn(entry));
+
+ ptep_modify_prot_start(mm, address, ptep);
+ entry = pte_modify(entry, vma->vm_page_prot);
+ ptep_modify_prot_commit(mm, address, ptep, entry);
+
+ update_mmu_cache(vma, address, ptep);
+unlock:
+ pte_unmap_unlock(ptep, ptl);
+out:
+ return ret;
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
@@ -3471,6 +3525,9 @@ int handle_pte_fault(struct mm_struct *mm,
pte, pmd, flags, entry);
}
+ if (pte_numa(vma, entry))
+ return do_numa_page(mm, vma, address, pte, pmd, flags, entry);
+
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (unlikely(!pte_same(*pte, entry)))
@@ -3535,13 +3592,16 @@ retry:
pmd, flags);
} else {
pmd_t orig_pmd = *pmd;
- int ret;
+ int ret = 0;
barrier();
- if (pmd_trans_huge(orig_pmd)) {
- if (flags & FAULT_FLAG_WRITE &&
- !pmd_write(orig_pmd) &&
- !pmd_trans_splitting(orig_pmd)) {
+ if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) {
+ if (pmd_numa(vma, orig_pmd)) {
+ do_huge_pmd_numa_page(mm, vma, address, pmd,
+ flags, orig_pmd);
+ }
+
+ if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
orig_pmd);
/*
@@ -3551,12 +3611,13 @@ retry:
*/
if (unlikely(ret & VM_FAULT_OOM))
goto retry;
- return ret;
}
- return 0;
+
+ return ret;
}
}
+
/*
* Use __pte_alloc instead of pte_alloc_map, because we can't
* run pte_offset_map on the pmd, if an huge pmd could
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e97b0d6..392b124 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -112,7 +112,7 @@ static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
} while (pud++, addr = next, addr != end);
}
-static void change_protection(struct vm_area_struct *vma,
+static void change_protection_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
@@ -134,6 +134,20 @@ static void change_protection(struct vm_area_struct *vma,
flush_tlb_range(vma, start, end);
}
+void change_protection(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, pgprot_t newprot,
+ int dirty_accountable)
+{
+ struct mm_struct *mm = vma->vm_mm;
+
+ mmu_notifier_invalidate_range_start(mm, start, end);
+ if (is_vm_hugetlb_page(vma))
+ hugetlb_change_protection(vma, start, end, newprot);
+ else
+ change_protection_range(vma, start, end, newprot, dirty_accountable);
+ mmu_notifier_invalidate_range_end(mm, start, end);
+}
+
int
mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
unsigned long start, unsigned long end, unsigned long newflags)
@@ -206,12 +220,8 @@ success:
dirty_accountable = 1;
}
- mmu_notifier_invalidate_range_start(mm, start, end);
- if (is_vm_hugetlb_page(vma))
- hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
- else
- change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+
vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
vm_stat_account(mm, newflags, vma->vm_file, nrpages);
perf_event_mmap(vma);
--
1.7.11.7
From: Rik van Riel <[email protected]>
Intel has an architectural guarantee that the TLB entry causing
a page fault gets invalidated automatically. This means
we should be able to drop the local TLB invalidation.
Because of the way other areas of the page fault code work,
chances are good that all x86 CPUs do this. However, if
someone somewhere has an x86 CPU that does not invalidate
the TLB entry causing a page fault, this one-liner should
be easy to revert - or a CPU model specific quirk could
be added to retain this optimization on most CPUs.
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michel Lespinasse <[email protected]>
[ Applied changelog massage and moved this last in the series,
to create bisection distance. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index be3bb46..7353de3 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -317,7 +317,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
if (changed && dirty) {
*ptep = entry;
pte_update_defer(vma->vm_mm, address, ptep);
- __flush_tlb_one(address);
}
return changed;
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Add migrate_misplaced_page() which deals with migrating pages from
faults.
This includes adding a new MIGRATE_FAULT migration mode to
deal with the extra page reference required due to having to look up
the page.
Based-on-work-by: Lee Schermerhorn <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/migrate.h | 7 ++++
include/linux/migrate_mode.h | 3 ++
mm/migrate.c | 85 +++++++++++++++++++++++++++++++++++++++-----
3 files changed, 87 insertions(+), 8 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ce7e667..9a5afea 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -30,6 +30,7 @@ extern int migrate_vmas(struct mm_struct *mm,
extern void migrate_page_copy(struct page *newpage, struct page *page);
extern int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page);
+extern int migrate_misplaced_page(struct page *page, int node);
#else
static inline void putback_lru_pages(struct list_head *l) {}
@@ -63,5 +64,11 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
#define migrate_page NULL
#define fail_migrate_page NULL
+static inline
+int migrate_misplaced_page(struct page *page, int node)
+{
+ return -EAGAIN; /* can't migrate now */
+}
#endif /* CONFIG_MIGRATION */
+
#endif /* _LINUX_MIGRATE_H */
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index ebf3d89..40b37dc 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,14 @@
* on most operations but not ->writepage as the potential stall time
* is too significant
* MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
+ * this path has an extra reference count
*/
enum migrate_mode {
MIGRATE_ASYNC,
MIGRATE_SYNC_LIGHT,
MIGRATE_SYNC,
+ MIGRATE_FAULT,
};
#endif /* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..3299949 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -225,7 +225,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
struct buffer_head *bh = head;
/* Simple case, sync compaction */
- if (mode != MIGRATE_ASYNC) {
+ if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) {
do {
get_bh(bh);
lock_buffer(bh);
@@ -279,12 +279,22 @@ static int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page,
struct buffer_head *head, enum migrate_mode mode)
{
- int expected_count;
+ int expected_count = 0;
void **pslot;
+ if (mode == MIGRATE_FAULT) {
+ /*
+ * MIGRATE_FAULT has an extra reference on the page and
+ * otherwise acts like ASYNC, no point in delaying the
+ * fault, we'll try again next time.
+ */
+ expected_count++;
+ }
+
if (!mapping) {
/* Anonymous page without mapping */
- if (page_count(page) != 1)
+ expected_count += 1;
+ if (page_count(page) != expected_count)
return -EAGAIN;
return 0;
}
@@ -294,7 +304,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));
- expected_count = 2 + page_has_private(page);
+ expected_count += 2 + page_has_private(page);
if (page_count(page) != expected_count ||
radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
spin_unlock_irq(&mapping->tree_lock);
@@ -313,7 +323,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
* the mapping back due to an elevated page count, we would have to
* block waiting on other references to be dropped.
*/
- if (mode == MIGRATE_ASYNC && head &&
+ if ((mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) && head &&
!buffer_migrate_lock_buffers(head, mode)) {
page_unfreeze_refs(page, expected_count);
spin_unlock_irq(&mapping->tree_lock);
@@ -521,7 +531,7 @@ int buffer_migrate_page(struct address_space *mapping,
* with an IRQ-safe spinlock held. In the sync case, the buffers
* need to be locked now
*/
- if (mode != MIGRATE_ASYNC)
+ if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT)
BUG_ON(!buffer_migrate_lock_buffers(head, mode));
ClearPagePrivate(page);
@@ -687,7 +697,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
struct anon_vma *anon_vma = NULL;
if (!trylock_page(page)) {
- if (!force || mode == MIGRATE_ASYNC)
+ if (!force || mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT)
goto out;
/*
@@ -1403,4 +1413,63 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
}
return err;
}
-#endif
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+ struct address_space *mapping = page_mapping(page);
+ int page_lru = page_is_file_cache(page);
+ struct page *newpage;
+ int ret = -EAGAIN;
+ gfp_t gfp = GFP_HIGHUSER_MOVABLE;
+
+ /*
+ * Don't migrate pages that are mapped in multiple processes.
+ */
+ if (page_mapcount(page) != 1)
+ goto out;
+
+ /*
+ * Never wait for allocations just to migrate on fault, but don't dip
+ * into reserves. And, only accept pages from the specified node. No
+ * sense migrating to a different "misplaced" page!
+ */
+ if (mapping)
+ gfp = mapping_gfp_mask(mapping);
+ gfp &= ~__GFP_WAIT;
+ gfp |= __GFP_NOMEMALLOC | GFP_THISNODE;
+
+ newpage = alloc_pages_node(node, gfp, 0);
+ if (!newpage) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ if (isolate_lru_page(page)) {
+ ret = -EBUSY;
+ goto put_new;
+ }
+
+ inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+ ret = __unmap_and_move(page, newpage, 0, 0, MIGRATE_FAULT);
+ /*
+ * A page that has been migrated has all references removed and will be
+ * freed. A page that has not been migrated will have kepts its
+ * references and be restored.
+ */
+ dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+ putback_lru_page(page);
+put_new:
+ /*
+ * Move the new page to the LRU. If migration was not successful
+ * then this will free the page.
+ */
+ putback_lru_page(newpage);
+out:
+ return ret;
+}
+
+#endif /* CONFIG_NUMA */
--
1.7.11.7
Allow architectures to opt-in to the adaptive affinity NUMA balancing code.
Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
init/Kconfig | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/init/Kconfig b/init/Kconfig
index ae412fd..78807b3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -703,6 +703,13 @@ config HAVE_UNSTABLE_SCHED_CLOCK
config ARCH_WANT_NUMA_VARIABLE_LOCALITY
bool
+#
+# For architectures that want to enable the PROT_NONE driven,
+# NUMA-affine scheduler balancing logic:
+#
+config ARCH_SUPPORTS_NUMA_BALANCING
+ bool
+
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
The principal ideas behind this patch are the fundamental
difference between shared and privately used memory and the very
strong desire to only rely on per-task behavioral state for
scheduling decisions.
We define 'shared memory' as all user memory that is frequently
accessed by multiple tasks and conversely 'private memory' is
the user memory used predominantly by a single task.
To approximate the above strict definition we recognise that
task placement is dominantly per cpu and thus using cpu granular
page access state is a natural fit. Thus we introduce
page::last_cpu as the cpu that last accessed a page.
Using this, we can construct two per-task node-vectors, 'S_i'
and 'P_i' reflecting the amount of shared and privately used
pages of this task respectively. Pages for which two consecutive
'hits' are of the same cpu are assumed private and the others
are shared.
[ This means that we will start evaluating this state when the
task has not migrated for at least 2 scans, see NUMA_SETTLE ]
Using these vectors we can compute the total number of
shared/private pages of this task and determine which dominates.
[ Note that for shared tasks we only see '1/n' the total number
of shared pages for the other tasks will take the other
faults; where 'n' is the number of tasks sharing the memory.
So for an equal comparison we should divide total private by
'n' as well, but we don't have 'n' so we pick 2. ]
We can also compute which node holds most of our memory, running
on this node will be called 'ideal placement' (As per previous
patches we will prefer to pull memory towards wherever we run.)
We change the load-balancer to prefer moving tasks in order of:
1) !numa tasks and numa tasks in the direction of more faults
2) allow !ideal tasks getting worse in the direction of faults
3) allow private tasks to get worse
4) allow shared tasks to get worse
This order ensures we prefer increasing memory locality but when
we do have to make hard decisions we prefer spreading private
over shared, because spreading shared tasks significantly
increases the interconnect bandwidth since not all memory can
follow.
We also add an extra 'lateral' force to the load balancer that
perturbs the state when otherwise 'fairly' balanced. This
ensures we don't get 'stuck' in a state which is fair but
undesired from a memory location POV (see can_do_numa_run()).
Lastly, we allow shared tasks to defeat the default spreading of
tasks such that, when possible, they can aggregate on a single
node.
Shared tasks aggregate for the very simple reason that there has
to be a single node that holds most of their memory and a second
most, etc.. and tasks want to move up the faults ladder.
Enable it on x86. A number of other architectures are
most likely fine too - but they should enable and test this
feature explicitly.
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/scheduler/numa-problem.txt | 20 +-
arch/x86/Kconfig | 1 +
include/linux/sched.h | 1 +
kernel/sched/core.c | 53 +-
kernel/sched/fair.c | 983 +++++++++++++++++++++++++------
kernel/sched/features.h | 8 +
kernel/sched/sched.h | 32 +-
7 files changed, 901 insertions(+), 197 deletions(-)
diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt
index a5d2fee..7f133e3 100644
--- a/Documentation/scheduler/numa-problem.txt
+++ b/Documentation/scheduler/numa-problem.txt
@@ -133,6 +133,8 @@ XXX properties of this M vs a potential optimal
2b) migrate memory towards 'n_i' using 2 samples.
+XXX include the statistical babble on double sampling somewhere near
+
This separates pages into those that will migrate and those that will not due
to the two samples not matching. We could consider the first to be of 'p_i'
(private) and the second to be of 's_i' (shared).
@@ -142,7 +144,17 @@ This interpretation can be motivated by the previously observed property that
's_i' (shared). (here we loose the need for memory limits again, since it
becomes indistinguishable from shared).
-XXX include the statistical babble on double sampling somewhere near
+ 2c) use cpu samples instead of node samples
+
+The problem with sampling on node granularity is that one looses 's_i' for
+the local node, since one cannot distinguish between two accesses from the
+same node.
+
+By increasing the granularity to per-cpu we gain the ability to have both an
+'s_i' and 'p_i' per node. Since we do all task placement per-cpu as well this
+seems like a natural match. The line where we overcommit cpus is where we loose
+granularity again, but when we loose overcommit we naturally spread tasks.
+Therefore it should work out nicely.
This reduces the problem further; we loose 'M' as per 2a, it further reduces
the 'T_k,l' (interconnect traffic) term to only include shared (since per the
@@ -150,12 +162,6 @@ above all private will be local):
T_k,l = \Sum_i bs_i,l for every n_i = k, l != k
-[ more or less matches the state of sched/numa and describes its remaining
- problems and assumptions. It should work well for tasks without significant
- shared memory usage between tasks. ]
-
-Possible future directions:
-
Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
can evaluate it;
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 46c3bff..02d0f2a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -22,6 +22,7 @@ config X86
def_bool y
select HAVE_AOUT if X86_32
select HAVE_UNSTABLE_SCHED_CLOCK
+ select ARCH_SUPPORTS_NUMA_BALANCING
select HAVE_IDE
select HAVE_OPROFILE
select HAVE_PCSPKR_PLATFORM
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 418d405..bb12cc3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -823,6 +823,7 @@ enum cpu_idle_type {
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
+#define SD_NUMA 0x4000 /* cross-node balancing */
extern int __weak arch_sd_sibiling_asym_packing(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3611f5f..0cd9896 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1800,6 +1800,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
+ task_numa_free(prev);
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
@@ -5510,7 +5511,9 @@ static void destroy_sched_domains(struct sched_domain *sd, int cpu)
DEFINE_PER_CPU(struct sched_domain *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_id);
-static void update_top_cache_domain(int cpu)
+DEFINE_PER_CPU(struct sched_domain *, sd_node);
+
+static void update_domain_cache(int cpu)
{
struct sched_domain *sd;
int id = cpu;
@@ -5521,6 +5524,15 @@ static void update_top_cache_domain(int cpu)
rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
per_cpu(sd_llc_id, cpu) = id;
+
+ for_each_domain(cpu, sd) {
+ if (cpumask_equal(sched_domain_span(sd),
+ cpumask_of_node(cpu_to_node(cpu))))
+ goto got_node;
+ }
+ sd = NULL;
+got_node:
+ rcu_assign_pointer(per_cpu(sd_node, cpu), sd);
}
/*
@@ -5563,7 +5575,7 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
rcu_assign_pointer(rq->sd, sd);
destroy_sched_domains(tmp, cpu);
- update_top_cache_domain(cpu);
+ update_domain_cache(cpu);
}
/* cpus with isolated domains */
@@ -5985,6 +5997,37 @@ static struct sched_domain_topology_level default_topology[] = {
static struct sched_domain_topology_level *sched_domain_topology = default_topology;
+#ifdef CONFIG_NUMA_BALANCING
+
+/*
+ */
+void sched_setnuma(struct task_struct *p, int node, int shared)
+{
+ unsigned long flags;
+ int on_rq, running;
+ struct rq *rq;
+
+ rq = task_rq_lock(p, &flags);
+ on_rq = p->on_rq;
+ running = task_current(rq, p);
+
+ if (on_rq)
+ dequeue_task(rq, p, 0);
+ if (running)
+ p->sched_class->put_prev_task(rq, p);
+
+ p->numa_shared = shared;
+ p->numa_max_node = node;
+
+ if (running)
+ p->sched_class->set_curr_task(rq);
+ if (on_rq)
+ enqueue_task(rq, p, 0);
+ task_rq_unlock(rq, p, &flags);
+}
+
+#endif /* CONFIG_NUMA_BALANCING */
+
#ifdef CONFIG_NUMA
static int sched_domains_numa_levels;
@@ -6030,6 +6073,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
| 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
+ | 1*SD_NUMA
| sd_local_flags(level)
,
.last_balance = jiffies,
@@ -6884,7 +6928,6 @@ void __init sched_init(void)
rq->post_schedule = 0;
rq->active_balance = 0;
rq->next_balance = jiffies;
- rq->push_cpu = 0;
rq->cpu = i;
rq->online = 0;
rq->idle_stamp = 0;
@@ -6892,6 +6935,10 @@ void __init sched_init(void)
INIT_LIST_HEAD(&rq->cfs_tasks);
+#ifdef CONFIG_NUMA_BALANCING
+ rq->nr_shared_running = 0;
+#endif
+
rq_attach_root(rq, &def_root_domain);
#ifdef CONFIG_NO_HZ
rq->nohz_flags = 0;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309a254..b1d8fea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -29,6 +29,9 @@
#include <linux/slab.h>
#include <linux/profile.h>
#include <linux/interrupt.h>
+#include <linux/random.h>
+#include <linux/mempolicy.h>
+#include <linux/task_work.h>
#include <trace/events/sched.h>
@@ -774,6 +777,235 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
}
/**************************************************
+ * Scheduling class numa methods.
+ *
+ * The purpose of the NUMA bits are to maintain compute (task) and data
+ * (memory) locality.
+ *
+ * We keep a faults vector per task and use periodic fault scans to try and
+ * estalish a task<->page relation. This assumes the task<->page relation is a
+ * compute<->data relation, this is false for things like virt. and n:m
+ * threading solutions but its the best we can do given the information we
+ * have.
+ *
+ * We try and migrate such that we increase along the order provided by this
+ * vector while maintaining fairness.
+ *
+ * Tasks start out with their numa status unset (-1) this effectively means
+ * they act !NUMA until we've established the task is busy enough to bother
+ * with placement.
+ */
+
+#ifdef CONFIG_SMP
+static unsigned long task_h_load(struct task_struct *p);
+#endif
+
+#ifdef CONFIG_NUMA_BALANCING
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+ if (task_numa_shared(p) != -1) {
+ p->numa_weight = task_h_load(p);
+ rq->nr_numa_running++;
+ rq->nr_shared_running += task_numa_shared(p);
+ rq->nr_ideal_running += (cpu_to_node(task_cpu(p)) == p->numa_max_node);
+ rq->numa_weight += p->numa_weight;
+ }
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+ if (task_numa_shared(p) != -1) {
+ rq->nr_numa_running--;
+ rq->nr_shared_running -= task_numa_shared(p);
+ rq->nr_ideal_running -= (cpu_to_node(task_cpu(p)) == p->numa_max_node);
+ rq->numa_weight -= p->numa_weight;
+ }
+}
+
+/*
+ * numa task sample period in ms: 5s
+ */
+unsigned int sysctl_sched_numa_scan_period_min = 5000;
+unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
+
+/*
+ * Wait for the 2-sample stuff to settle before migrating again
+ */
+unsigned int sysctl_sched_numa_settle_count = 2;
+
+static void task_numa_migrate(struct task_struct *p, int next_cpu)
+{
+ p->numa_migrate_seq = 0;
+}
+
+static void task_numa_placement(struct task_struct *p)
+{
+ int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
+ unsigned long total[2] = { 0, 0 };
+ unsigned long faults, max_faults = 0;
+ int node, priv, shared, max_node = -1;
+
+ if (p->numa_scan_seq == seq)
+ return;
+
+ p->numa_scan_seq = seq;
+
+ for (node = 0; node < nr_node_ids; node++) {
+ faults = 0;
+ for (priv = 0; priv < 2; priv++) {
+ faults += p->numa_faults[2*node + priv];
+ total[priv] += p->numa_faults[2*node + priv];
+ p->numa_faults[2*node + priv] /= 2;
+ }
+ if (faults > max_faults) {
+ max_faults = faults;
+ max_node = node;
+ }
+ }
+
+ if (max_node != p->numa_max_node)
+ sched_setnuma(p, max_node, task_numa_shared(p));
+
+ p->numa_migrate_seq++;
+ if (sched_feat(NUMA_SETTLE) &&
+ p->numa_migrate_seq < sysctl_sched_numa_settle_count)
+ return;
+
+ /*
+ * Note: shared is spread across multiple tasks and in the future
+ * we might want to consider a different equation below to reduce
+ * the impact of a little private memory accesses.
+ */
+ shared = (total[0] >= total[1] / 4);
+ if (shared != task_numa_shared(p)) {
+ sched_setnuma(p, p->numa_max_node, shared);
+ p->numa_migrate_seq = 0;
+ }
+}
+
+/*
+ * Got a PROT_NONE fault for a page on @node.
+ */
+void task_numa_fault(int node, int last_cpu, int pages)
+{
+ struct task_struct *p = current;
+ int priv = (task_cpu(p) == last_cpu);
+
+ if (unlikely(!p->numa_faults)) {
+ int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
+
+ p->numa_faults = kzalloc(size, GFP_KERNEL);
+ if (!p->numa_faults)
+ return;
+ }
+
+ task_numa_placement(p);
+ p->numa_faults[2*node + priv] += pages;
+}
+
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+void task_numa_work(struct callback_head *work)
+{
+ unsigned long migrate, next_scan, now = jiffies;
+ struct task_struct *p = current;
+ struct mm_struct *mm = p->mm;
+
+ WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
+
+ work->next = work; /* protect against double add */
+ /*
+ * Who cares about NUMA placement when they're dying.
+ *
+ * NOTE: make sure not to dereference p->mm before this check,
+ * exit_task_work() happens _after_ exit_mm() so we could be called
+ * without p->mm even though we still had it when we enqueued this
+ * work.
+ */
+ if (p->flags & PF_EXITING)
+ return;
+
+ /*
+ * Enforce maximal scan/migration frequency..
+ */
+ migrate = mm->numa_next_scan;
+ if (time_before(now, migrate))
+ return;
+
+ next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
+ if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
+ return;
+
+ ACCESS_ONCE(mm->numa_scan_seq)++;
+ {
+ struct vm_area_struct *vma;
+
+ down_write(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (!vma_migratable(vma))
+ continue;
+ change_protection(vma, vma->vm_start, vma->vm_end, vma_prot_none(vma), 0);
+ }
+ up_write(&mm->mmap_sem);
+ }
+}
+
+/*
+ * Drive the periodic memory faults..
+ */
+void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+ struct callback_head *work = &curr->numa_work;
+ u64 period, now;
+
+ /*
+ * We don't care about NUMA placement if we don't have memory.
+ */
+ if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+ return;
+
+ /*
+ * Using runtime rather than walltime has the dual advantage that
+ * we (mostly) drive the selection from busy threads and that the
+ * task needs to have done some actual work before we bother with
+ * NUMA placement.
+ */
+ now = curr->se.sum_exec_runtime;
+ period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
+
+ if (now - curr->node_stamp > period) {
+ curr->node_stamp = now;
+
+ /*
+ * We are comparing runtime to wall clock time here, which
+ * puts a maximum scan frequency limit on the task work.
+ *
+ * This, together with the limits in task_numa_work() filters
+ * us from over-sampling if there are many threads: if all
+ * threads happen to come in at the same time we don't create a
+ * spike in overhead.
+ *
+ * We also avoid multiple threads scanning at once in parallel to
+ * each other.
+ */
+ if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+ init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
+ task_work_add(curr, work, true);
+ }
+ }
+}
+#else /* !CONFIG_NUMA_BALANCING: */
+#ifdef CONFIG_SMP
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p) { }
+#endif
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p) { }
+static inline void task_tick_numa(struct rq *rq, struct task_struct *curr) { }
+static inline void task_numa_migrate(struct task_struct *p, int next_cpu) { }
+#endif /* CONFIG_NUMA_BALANCING */
+
+/**************************************************
* Scheduling class queueing methods:
*/
@@ -784,9 +1016,13 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (!parent_entity(se))
update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
#ifdef CONFIG_SMP
- if (entity_is_task(se))
- list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
-#endif
+ if (entity_is_task(se)) {
+ struct rq *rq = rq_of(cfs_rq);
+
+ account_numa_enqueue(rq, task_of(se));
+ list_add(&se->group_node, &rq->cfs_tasks);
+ }
+#endif /* CONFIG_SMP */
cfs_rq->nr_running++;
}
@@ -796,8 +1032,10 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
update_load_sub(&cfs_rq->load, se->load.weight);
if (!parent_entity(se))
update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
- if (entity_is_task(se))
+ if (entity_is_task(se)) {
list_del_init(&se->group_node);
+ account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+ }
cfs_rq->nr_running--;
}
@@ -3177,20 +3415,8 @@ unlock:
return new_cpu;
}
-/*
- * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
- * removed when useful for applications beyond shares distribution (e.g.
- * load-balance).
- */
#ifdef CONFIG_FAIR_GROUP_SCHED
-/*
- * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
- * cfs_rq_of(p) references at time of call are still valid and identify the
- * previous cpu. However, the caller only guarantees p->pi_lock is held; no
- * other assumptions, including the state of rq->lock, should be made.
- */
-static void
-migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+static void migrate_task_rq_entity(struct task_struct *p, int next_cpu)
{
struct sched_entity *se = &p->se;
struct cfs_rq *cfs_rq = cfs_rq_of(se);
@@ -3206,7 +3432,27 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
atomic64_add(se->avg.load_avg_contrib, &cfs_rq->removed_load);
}
}
+#else
+static void migrate_task_rq_entity(struct task_struct *p, int next_cpu) { }
#endif
+
+/*
+ * Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
+ * removed when useful for applications beyond shares distribution (e.g.
+ * load-balance).
+ */
+/*
+ * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
+ * cfs_rq_of(p) references at time of call are still valid and identify the
+ * previous cpu. However, the caller only guarantees p->pi_lock is held; no
+ * other assumptions, including the state of rq->lock, should be made.
+ */
+static void
+migrate_task_rq_fair(struct task_struct *p, int next_cpu)
+{
+ migrate_task_rq_entity(p, next_cpu);
+ task_numa_migrate(p, next_cpu);
+}
#endif /* CONFIG_SMP */
static unsigned long
@@ -3580,7 +3826,10 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
-#define LBF_SOME_PINNED 0x04
+#define LBF_SOME_PINNED 0x04
+#define LBF_NUMA_RUN 0x08
+#define LBF_NUMA_SHARED 0x10
+#define LBF_KEEP_SHARED 0x20
struct lb_env {
struct sched_domain *sd;
@@ -3599,6 +3848,8 @@ struct lb_env {
struct cpumask *cpus;
unsigned int flags;
+ unsigned int failed;
+ unsigned int iteration;
unsigned int loop;
unsigned int loop_break;
@@ -3620,11 +3871,87 @@ static void move_task(struct task_struct *p, struct lb_env *env)
check_preempt_curr(env->dst_rq, p, 0);
}
+#ifdef CONFIG_NUMA_BALANCING
+
+static inline unsigned long task_node_faults(struct task_struct *p, int node)
+{
+ return p->numa_faults[2*node] + p->numa_faults[2*node + 1];
+}
+
+static int task_faults_down(struct task_struct *p, struct lb_env *env)
+{
+ int src_node, dst_node, node, down_node = -1;
+ unsigned long faults, src_faults, max_faults = 0;
+
+ if (!sched_feat_numa(NUMA_FAULTS_DOWN) || !p->numa_faults)
+ return 1;
+
+ src_node = cpu_to_node(env->src_cpu);
+ dst_node = cpu_to_node(env->dst_cpu);
+
+ if (src_node == dst_node)
+ return 1;
+
+ src_faults = task_node_faults(p, src_node);
+
+ for (node = 0; node < nr_node_ids; node++) {
+ if (node == src_node)
+ continue;
+
+ faults = task_node_faults(p, node);
+
+ if (faults > max_faults && faults <= src_faults) {
+ max_faults = faults;
+ down_node = node;
+ }
+ }
+
+ if (down_node == dst_node)
+ return 1; /* move towards the next node down */
+
+ return 0; /* stay here */
+}
+
+static int task_faults_up(struct task_struct *p, struct lb_env *env)
+{
+ unsigned long src_faults, dst_faults;
+ int src_node, dst_node;
+
+ if (!sched_feat_numa(NUMA_FAULTS_UP) || !p->numa_faults)
+ return 0; /* can't say it improved */
+
+ src_node = cpu_to_node(env->src_cpu);
+ dst_node = cpu_to_node(env->dst_cpu);
+
+ if (src_node == dst_node)
+ return 0; /* pointless, don't do that */
+
+ src_faults = task_node_faults(p, src_node);
+ dst_faults = task_node_faults(p, dst_node);
+
+ if (dst_faults > src_faults)
+ return 1; /* move to dst */
+
+ return 0; /* stay where we are */
+}
+
+#else /* !CONFIG_NUMA_BALANCING: */
+static inline int task_faults_up(struct task_struct *p, struct lb_env *env)
+{
+ return 0;
+}
+
+static inline int task_faults_down(struct task_struct *p, struct lb_env *env)
+{
+ return 0;
+}
+#endif
+
/*
* Is this task likely cache-hot:
*/
static int
-task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
+task_hot(struct task_struct *p, struct lb_env *env)
{
s64 delta;
@@ -3647,80 +3974,153 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
if (sysctl_sched_migration_cost == 0)
return 0;
- delta = now - p->se.exec_start;
+ delta = env->src_rq->clock_task - p->se.exec_start;
return delta < (s64)sysctl_sched_migration_cost;
}
/*
- * can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
+ * We do not migrate tasks that cannot be migrated to this CPU
+ * due to cpus_allowed.
+ *
+ * NOTE: this function has env-> side effects, to help the balancing
+ * of pinned tasks.
*/
-static
-int can_migrate_task(struct task_struct *p, struct lb_env *env)
+static bool can_migrate_pinned_task(struct task_struct *p, struct lb_env *env)
{
- int tsk_cache_hot = 0;
+ int new_dst_cpu;
+
+ if (cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p)))
+ return true;
+
+ schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+
/*
- * We do not migrate tasks that are:
- * 1) running (obviously), or
- * 2) cannot be migrated to this CPU due to cpus_allowed, or
- * 3) are cache-hot on their current CPU.
+ * Remember if this task can be migrated to any other cpu in
+ * our sched_group. We may want to revisit it if we couldn't
+ * meet load balance goals by pulling other tasks on src_cpu.
+ *
+ * Also avoid computing new_dst_cpu if we have already computed
+ * one in current iteration.
*/
- if (!cpumask_test_cpu(env->dst_cpu, tsk_cpus_allowed(p))) {
- int new_dst_cpu;
-
- schedstat_inc(p, se.statistics.nr_failed_migrations_affine);
+ if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+ return false;
- /*
- * Remember if this task can be migrated to any other cpu in
- * our sched_group. We may want to revisit it if we couldn't
- * meet load balance goals by pulling other tasks on src_cpu.
- *
- * Also avoid computing new_dst_cpu if we have already computed
- * one in current iteration.
- */
- if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
- return 0;
-
- new_dst_cpu = cpumask_first_and(env->dst_grpmask,
- tsk_cpus_allowed(p));
- if (new_dst_cpu < nr_cpu_ids) {
- env->flags |= LBF_SOME_PINNED;
- env->new_dst_cpu = new_dst_cpu;
- }
- return 0;
+ new_dst_cpu = cpumask_first_and(env->dst_grpmask, tsk_cpus_allowed(p));
+ if (new_dst_cpu < nr_cpu_ids) {
+ env->flags |= LBF_SOME_PINNED;
+ env->new_dst_cpu = new_dst_cpu;
}
+ return false;
+}
- /* Record that we found atleast one task that could run on dst_cpu */
- env->flags &= ~LBF_ALL_PINNED;
+/*
+ * We cannot (easily) migrate tasks that are currently running:
+ */
+static bool can_migrate_running_task(struct task_struct *p, struct lb_env *env)
+{
+ if (!task_running(env->src_rq, p))
+ return true;
- if (task_running(env->src_rq, p)) {
- schedstat_inc(p, se.statistics.nr_failed_migrations_running);
- return 0;
- }
+ schedstat_inc(p, se.statistics.nr_failed_migrations_running);
+ return false;
+}
+/*
+ * Can we migrate a NUMA task? The rules are rather involved:
+ */
+static bool can_migrate_numa_task(struct task_struct *p, struct lb_env *env)
+{
/*
- * Aggressive migration if:
- * 1) task is cache cold, or
- * 2) too many balance attempts have failed.
+ * iteration:
+ * 0 -- only allow improvement, or !numa
+ * 1 -- + worsen !ideal
+ * 2 priv
+ * 3 shared (everything)
+ *
+ * NUMA_HOT_DOWN:
+ * 1 .. nodes -- allow getting worse by step
+ * nodes+1 -- punt, everything goes!
+ *
+ * LBF_NUMA_RUN -- numa only, only allow improvement
+ * LBF_NUMA_SHARED -- shared only
+ *
+ * LBF_KEEP_SHARED -- do not touch shared tasks
*/
- tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
- if (!tsk_cache_hot ||
- env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
-#ifdef CONFIG_SCHEDSTATS
- if (tsk_cache_hot) {
- schedstat_inc(env->sd, lb_hot_gained[env->idle]);
- schedstat_inc(p, se.statistics.nr_forced_migrations);
- }
-#endif
- return 1;
+ /* a numa run can only move numa tasks about to improve things */
+ if (env->flags & LBF_NUMA_RUN) {
+ if (task_numa_shared(p) < 0)
+ return false;
+ /* can only pull shared tasks */
+ if ((env->flags & LBF_NUMA_SHARED) && !task_numa_shared(p))
+ return false;
+ } else {
+ if (task_numa_shared(p) < 0)
+ goto try_migrate;
}
- if (tsk_cache_hot) {
- schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
- return 0;
- }
- return 1;
+ /* can not move shared tasks */
+ if ((env->flags & LBF_KEEP_SHARED) && task_numa_shared(p) == 1)
+ return false;
+
+ if (task_faults_up(p, env))
+ return true; /* memory locality beats cache hotness */
+
+ if (env->iteration < 1)
+ return false;
+
+#ifdef CONFIG_NUMA_BALANCING
+ if (p->numa_max_node != cpu_to_node(task_cpu(p))) /* !ideal */
+ goto demote;
+#endif
+
+ if (env->iteration < 2)
+ return false;
+
+ if (task_numa_shared(p) == 0) /* private */
+ goto demote;
+
+ if (env->iteration < 3)
+ return false;
+
+demote:
+ if (env->iteration < 5)
+ return task_faults_down(p, env);
+
+try_migrate:
+ if (env->failed > env->sd->cache_nice_tries)
+ return true;
+
+ return !task_hot(p, env);
+}
+
+/*
+ * can_migrate_task() - may task p from runqueue rq be migrated to this_cpu?
+ */
+static int can_migrate_task(struct task_struct *p, struct lb_env *env)
+{
+ if (!can_migrate_pinned_task(p, env))
+ return false;
+
+ /* Record that we found atleast one task that could run on dst_cpu */
+ env->flags &= ~LBF_ALL_PINNED;
+
+ if (!can_migrate_running_task(p, env))
+ return false;
+
+ if (env->sd->flags & SD_NUMA)
+ return can_migrate_numa_task(p, env);
+
+ if (env->failed > env->sd->cache_nice_tries)
+ return true;
+
+ if (!task_hot(p, env))
+ return true;
+
+ schedstat_inc(p, se.statistics.nr_failed_migrations_hot);
+
+ return false;
}
/*
@@ -3735,6 +4135,7 @@ static int move_one_task(struct lb_env *env)
struct task_struct *p, *n;
list_for_each_entry_safe(p, n, &env->src_rq->cfs_tasks, se.group_node) {
+
if (throttled_lb_pair(task_group(p), env->src_rq->cpu, env->dst_cpu))
continue;
@@ -3742,6 +4143,7 @@ static int move_one_task(struct lb_env *env)
continue;
move_task(p, env);
+
/*
* Right now, this is only the second place move_task()
* is called, so we can safely collect move_task()
@@ -3753,8 +4155,6 @@ static int move_one_task(struct lb_env *env)
return 0;
}
-static unsigned long task_h_load(struct task_struct *p);
-
static const unsigned int sched_nr_migrate_break = 32;
/*
@@ -3766,7 +4166,6 @@ static const unsigned int sched_nr_migrate_break = 32;
*/
static int move_tasks(struct lb_env *env)
{
- struct list_head *tasks = &env->src_rq->cfs_tasks;
struct task_struct *p;
unsigned long load;
int pulled = 0;
@@ -3774,8 +4173,8 @@ static int move_tasks(struct lb_env *env)
if (env->imbalance <= 0)
return 0;
- while (!list_empty(tasks)) {
- p = list_first_entry(tasks, struct task_struct, se.group_node);
+ while (!list_empty(&env->src_rq->cfs_tasks)) {
+ p = list_first_entry(&env->src_rq->cfs_tasks, struct task_struct, se.group_node);
env->loop++;
/* We've more or less seen every task there is, call it quits */
@@ -3786,7 +4185,7 @@ static int move_tasks(struct lb_env *env)
if (env->loop > env->loop_break) {
env->loop_break += sched_nr_migrate_break;
env->flags |= LBF_NEED_BREAK;
- break;
+ goto out;
}
if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
@@ -3794,7 +4193,7 @@ static int move_tasks(struct lb_env *env)
load = task_h_load(p);
- if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
+ if (sched_feat(LB_MIN) && load < 16 && !env->failed)
goto next;
if ((load / 2) > env->imbalance)
@@ -3814,7 +4213,7 @@ static int move_tasks(struct lb_env *env)
* the critical section.
*/
if (env->idle == CPU_NEWLY_IDLE)
- break;
+ goto out;
#endif
/*
@@ -3822,13 +4221,13 @@ static int move_tasks(struct lb_env *env)
* weighted load.
*/
if (env->imbalance <= 0)
- break;
+ goto out;
continue;
next:
- list_move_tail(&p->se.group_node, tasks);
+ list_move_tail(&p->se.group_node, &env->src_rq->cfs_tasks);
}
-
+out:
/*
* Right now, this is one of only two places move_task() is called,
* so we can safely collect move_task() stats here rather than
@@ -3953,17 +4352,18 @@ static inline void update_blocked_averages(int cpu)
static inline void update_h_load(long cpu)
{
}
-
+#ifdef CONFIG_SMP
static unsigned long task_h_load(struct task_struct *p)
{
return p->se.load.weight;
}
#endif
+#endif
/********** Helpers for find_busiest_group ************************/
/*
* sd_lb_stats - Structure to store the statistics of a sched_domain
- * during load balancing.
+ * during load balancing.
*/
struct sd_lb_stats {
struct sched_group *busiest; /* Busiest group in this sd */
@@ -3976,7 +4376,7 @@ struct sd_lb_stats {
unsigned long this_load;
unsigned long this_load_per_task;
unsigned long this_nr_running;
- unsigned long this_has_capacity;
+ unsigned int this_has_capacity;
unsigned int this_idle_cpus;
/* Statistics of the busiest group */
@@ -3985,10 +4385,28 @@ struct sd_lb_stats {
unsigned long busiest_load_per_task;
unsigned long busiest_nr_running;
unsigned long busiest_group_capacity;
- unsigned long busiest_has_capacity;
+ unsigned int busiest_has_capacity;
unsigned int busiest_group_weight;
int group_imb; /* Is there imbalance in this sd */
+
+#ifdef CONFIG_NUMA_BALANCING
+ unsigned long this_numa_running;
+ unsigned long this_numa_weight;
+ unsigned long this_shared_running;
+ unsigned long this_ideal_running;
+ unsigned long this_group_capacity;
+
+ struct sched_group *numa;
+ unsigned long numa_load;
+ unsigned long numa_nr_running;
+ unsigned long numa_numa_running;
+ unsigned long numa_shared_running;
+ unsigned long numa_ideal_running;
+ unsigned long numa_numa_weight;
+ unsigned long numa_group_capacity;
+ unsigned int numa_has_capacity;
+#endif
};
/*
@@ -4004,6 +4422,13 @@ struct sg_lb_stats {
unsigned long group_weight;
int group_imb; /* Is there an imbalance in the group ? */
int group_has_capacity; /* Is there extra capacity in the group? */
+
+#ifdef CONFIG_NUMA_BALANCING
+ unsigned long sum_ideal_running;
+ unsigned long sum_numa_running;
+ unsigned long sum_numa_weight;
+#endif
+ unsigned long sum_shared_running; /* 0 on non-NUMA */
};
/**
@@ -4032,6 +4457,160 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
return load_idx;
}
+#ifdef CONFIG_NUMA_BALANCING
+
+static inline bool pick_numa_rand(int n)
+{
+ return !(get_random_int() % n);
+}
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+ sgs->sum_ideal_running += rq->nr_ideal_running;
+ sgs->sum_shared_running += rq->nr_shared_running;
+ sgs->sum_numa_running += rq->nr_numa_running;
+ sgs->sum_numa_weight += rq->numa_weight;
+}
+
+static inline
+void update_sd_numa_stats(struct sched_domain *sd, struct sched_group *sg,
+ struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
+ int local_group)
+{
+ if (!(sd->flags & SD_NUMA))
+ return;
+
+ if (local_group) {
+ sds->this_numa_running = sgs->sum_numa_running;
+ sds->this_numa_weight = sgs->sum_numa_weight;
+ sds->this_shared_running = sgs->sum_shared_running;
+ sds->this_ideal_running = sgs->sum_ideal_running;
+ sds->this_group_capacity = sgs->group_capacity;
+
+ } else if (sgs->sum_numa_running - sgs->sum_ideal_running) {
+ if (!sds->numa || pick_numa_rand(sd->span_weight / sg->group_weight)) {
+ sds->numa = sg;
+ sds->numa_load = sgs->avg_load;
+ sds->numa_nr_running = sgs->sum_nr_running;
+ sds->numa_numa_running = sgs->sum_numa_running;
+ sds->numa_shared_running = sgs->sum_shared_running;
+ sds->numa_ideal_running = sgs->sum_ideal_running;
+ sds->numa_numa_weight = sgs->sum_numa_weight;
+ sds->numa_has_capacity = sgs->group_has_capacity;
+ sds->numa_group_capacity = sgs->group_capacity;
+ }
+ }
+}
+
+static struct rq *
+find_busiest_numa_queue(struct lb_env *env, struct sched_group *sg)
+{
+ struct rq *rq, *busiest = NULL;
+ int cpu;
+
+ for_each_cpu_and(cpu, sched_group_cpus(sg), env->cpus) {
+ rq = cpu_rq(cpu);
+
+ if (!rq->nr_numa_running)
+ continue;
+
+ if (!(rq->nr_numa_running - rq->nr_ideal_running))
+ continue;
+
+ if ((env->flags & LBF_KEEP_SHARED) && !(rq->nr_running - rq->nr_shared_running))
+ continue;
+
+ if (!busiest || pick_numa_rand(sg->group_weight))
+ busiest = rq;
+ }
+
+ return busiest;
+}
+
+#define TP_SG(_sg) \
+ (_sg) ? cpumask_first(sched_group_cpus(_sg)) : -1, \
+ (_sg) ? (_sg)->group_weight : -1
+
+static bool can_do_numa_run(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ /*
+ * if we're overloaded; don't pull when:
+ * - the other guy isn't
+ * - imbalance would become too great
+ */
+ if (!sds->this_has_capacity) {
+ if (sds->numa_has_capacity)
+ return false;
+
+#if 0
+ if (sds->this_load * env->sd->imbalance_pct > sds->numa_load * 100)
+ return false;
+#endif
+ }
+
+ /*
+ * pull if we got easy trade
+ */
+ if (sds->this_nr_running - sds->this_numa_running)
+ return true;
+
+ /*
+ * If we got capacity allow stacking up on shared tasks.
+ */
+ if ((sds->this_shared_running < sds->this_group_capacity) && sds->numa_shared_running) {
+ env->flags |= LBF_NUMA_SHARED;
+ return true;
+ }
+
+ /*
+ * pull if we could possibly trade
+ */
+ if (sds->this_numa_running - sds->this_ideal_running)
+ return true;
+
+ return false;
+}
+
+/*
+ * introduce some controlled imbalance to perturb the state so we allow the
+ * state to improve should be tightly controlled/co-ordinated with
+ * can_migrate_task()
+ */
+static int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ if (!sds->numa || !sds->numa_numa_running)
+ return 0;
+
+ if (!can_do_numa_run(env, sds))
+ return 0;
+
+ env->flags |= LBF_NUMA_RUN;
+ env->flags &= ~LBF_KEEP_SHARED;
+ env->imbalance = sds->numa_numa_weight / sds->numa_numa_running;
+ sds->busiest = sds->numa;
+ env->find_busiest_queue = find_busiest_numa_queue;
+
+ return 1;
+}
+
+#else /* !CONFIG_NUMA_BALANCING: */
+static inline
+void update_sd_numa_stats(struct sched_domain *sd, struct sched_group *sg,
+ struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
+ int local_group)
+{
+}
+
+static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
+{
+}
+
+static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
+{
+ return 0;
+}
+#endif
+
unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
{
return SCHED_POWER_SCALE;
@@ -4245,6 +4824,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->group_load += load;
sgs->sum_nr_running += nr_running;
sgs->sum_weighted_load += weighted_cpuload(i);
+
+ update_sg_numa_stats(sgs, rq);
+
if (idle_cpu(i))
sgs->idle_cpus++;
}
@@ -4336,6 +4918,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
return false;
}
+static void update_src_keep_shared(struct lb_env *env, bool keep_shared)
+{
+ env->flags &= ~LBF_KEEP_SHARED;
+ if (keep_shared)
+ env->flags |= LBF_KEEP_SHARED;
+}
+
/**
* update_sd_lb_stats - Update sched_domain's statistics for load balancing.
* @env: The load balancing environment.
@@ -4368,6 +4957,7 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sds->total_load += sgs.group_load;
sds->total_pwr += sg->sgp->power;
+#ifdef CONFIG_NUMA_BALANCING
/*
* In case the child domain prefers tasks go to siblings
* first, lower the sg capacity to one so that we'll try
@@ -4378,8 +4968,11 @@ static inline void update_sd_lb_stats(struct lb_env *env,
* heaviest group when it is already under-utilized (possible
* with a large weight task outweighs the tasks on the system).
*/
- if (prefer_sibling && !local_group && sds->this_has_capacity)
- sgs.group_capacity = min(sgs.group_capacity, 1UL);
+ if (prefer_sibling && !local_group && sds->this_has_capacity) {
+ sgs.group_capacity = clamp_val(sgs.sum_shared_running,
+ 1UL, sgs.group_capacity);
+ }
+#endif
if (local_group) {
sds->this_load = sgs.avg_load;
@@ -4398,8 +4991,13 @@ static inline void update_sd_lb_stats(struct lb_env *env,
sds->busiest_has_capacity = sgs.group_has_capacity;
sds->busiest_group_weight = sgs.group_weight;
sds->group_imb = sgs.group_imb;
+
+ update_src_keep_shared(env,
+ sgs.sum_shared_running <= sgs.group_capacity);
}
+ update_sd_numa_stats(env->sd, sg, sds, &sgs, local_group);
+
sg = sg->next;
} while (sg != env->sd->groups);
}
@@ -4652,14 +5250,14 @@ find_busiest_group(struct lb_env *env, int *balance)
* don't try and pull any tasks.
*/
if (sds.this_load >= sds.max_load)
- goto out_balanced;
+ goto out_imbalanced;
/*
* Don't pull any tasks if this group is already above the domain
* average load.
*/
if (sds.this_load >= sds.avg_load)
- goto out_balanced;
+ goto out_imbalanced;
if (env->idle == CPU_IDLE) {
/*
@@ -4685,7 +5283,15 @@ force_balance:
calculate_imbalance(env, &sds);
return sds.busiest;
+out_imbalanced:
+ /* if we've got capacity allow for secondary placement preference */
+ if (!sds.this_has_capacity)
+ goto ret;
+
out_balanced:
+ if (check_numa_busiest_group(env, &sds))
+ return sds.busiest;
+
ret:
env->imbalance = 0;
return NULL;
@@ -4723,6 +5329,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
if (capacity && rq->nr_running == 1 && wl > env->imbalance)
continue;
+ if ((env->flags & LBF_KEEP_SHARED) && !(rq->nr_running - rq->nr_shared_running))
+ continue;
+
/*
* For the load comparisons with the other cpu's, consider
* the weighted_cpuload() scaled with the cpu power, so that
@@ -4749,25 +5358,40 @@ static struct rq *find_busiest_queue(struct lb_env *env,
/* Working cpumask for load_balance and load_balance_newidle. */
DEFINE_PER_CPU(cpumask_var_t, load_balance_tmpmask);
-static int need_active_balance(struct lb_env *env)
-{
- struct sched_domain *sd = env->sd;
-
- if (env->idle == CPU_NEWLY_IDLE) {
+static int active_load_balance_cpu_stop(void *data);
+static void update_sd_failed(struct lb_env *env, int ld_moved)
+{
+ if (!ld_moved) {
+ schedstat_inc(env->sd, lb_failed[env->idle]);
/*
- * ASYM_PACKING needs to force migrate tasks from busy but
- * higher numbered CPUs in order to pack all tasks in the
- * lowest numbered CPUs.
+ * Increment the failure counter only on periodic balance.
+ * We do not want newidle balance, which can be very
+ * frequent, pollute the failure counter causing
+ * excessive cache_hot migrations and active balances.
*/
- if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
- return 1;
- }
-
- return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
+ if (env->idle != CPU_NEWLY_IDLE && !(env->flags & LBF_NUMA_RUN))
+ env->sd->nr_balance_failed++;
+ } else
+ env->sd->nr_balance_failed = 0;
}
-static int active_load_balance_cpu_stop(void *data);
+/*
+ * See can_migrate_numa_task()
+ */
+static int lb_max_iteration(struct lb_env *env)
+{
+ if (!(env->sd->flags & SD_NUMA))
+ return 0;
+
+ if (env->flags & LBF_NUMA_RUN)
+ return 0; /* NUMA_RUN may only improve */
+
+ if (sched_feat_numa(NUMA_FAULTS_DOWN))
+ return 5; /* nodes^2 would suck */
+
+ return 3;
+}
/*
* Check this_cpu to ensure it is balanced within domain. Attempt to move
@@ -4793,6 +5417,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
.find_busiest_queue = find_busiest_queue,
+ .failed = sd->nr_balance_failed,
+ .iteration = 0,
};
cpumask_copy(cpus, cpu_active_mask);
@@ -4816,6 +5442,8 @@ redo:
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;
}
+ env.src_rq = busiest;
+ env.src_cpu = busiest->cpu;
BUG_ON(busiest == env.dst_rq);
@@ -4895,92 +5523,72 @@ more_balance:
}
/* All tasks on this runqueue were pinned by CPU affinity */
- if (unlikely(env.flags & LBF_ALL_PINNED)) {
- cpumask_clear_cpu(cpu_of(busiest), cpus);
- if (!cpumask_empty(cpus)) {
- env.loop = 0;
- env.loop_break = sched_nr_migrate_break;
- goto redo;
- }
- goto out_balanced;
+ if (unlikely(env.flags & LBF_ALL_PINNED))
+ goto out_pinned;
+
+ if (!ld_moved && env.iteration < lb_max_iteration(&env)) {
+ env.iteration++;
+ env.loop = 0;
+ goto more_balance;
}
}
- if (!ld_moved) {
- schedstat_inc(sd, lb_failed[idle]);
+ if (!ld_moved && idle != CPU_NEWLY_IDLE) {
+ raw_spin_lock_irqsave(&busiest->lock, flags);
+
/*
- * Increment the failure counter only on periodic balance.
- * We do not want newidle balance, which can be very
- * frequent, pollute the failure counter causing
- * excessive cache_hot migrations and active balances.
+ * Don't kick the active_load_balance_cpu_stop,
+ * if the curr task on busiest cpu can't be
+ * moved to this_cpu
*/
- if (idle != CPU_NEWLY_IDLE)
- sd->nr_balance_failed++;
-
- if (need_active_balance(&env)) {
- raw_spin_lock_irqsave(&busiest->lock, flags);
-
- /* don't kick the active_load_balance_cpu_stop,
- * if the curr task on busiest cpu can't be
- * moved to this_cpu
- */
- if (!cpumask_test_cpu(this_cpu,
- tsk_cpus_allowed(busiest->curr))) {
- raw_spin_unlock_irqrestore(&busiest->lock,
- flags);
- env.flags |= LBF_ALL_PINNED;
- goto out_one_pinned;
- }
-
- /*
- * ->active_balance synchronizes accesses to
- * ->active_balance_work. Once set, it's cleared
- * only after active load balance is finished.
- */
- if (!busiest->active_balance) {
- busiest->active_balance = 1;
- busiest->push_cpu = this_cpu;
- active_balance = 1;
- }
+ if (!cpumask_test_cpu(this_cpu, tsk_cpus_allowed(busiest->curr))) {
raw_spin_unlock_irqrestore(&busiest->lock, flags);
-
- if (active_balance) {
- stop_one_cpu_nowait(cpu_of(busiest),
- active_load_balance_cpu_stop, busiest,
- &busiest->active_balance_work);
- }
-
- /*
- * We've kicked active balancing, reset the failure
- * counter.
- */
- sd->nr_balance_failed = sd->cache_nice_tries+1;
+ env.flags |= LBF_ALL_PINNED;
+ goto out_pinned;
}
- } else
- sd->nr_balance_failed = 0;
- if (likely(!active_balance)) {
- /* We were unbalanced, so reset the balancing interval */
- sd->balance_interval = sd->min_interval;
- } else {
/*
- * If we've begun active balancing, start to back off. This
- * case may not be covered by the all_pinned logic if there
- * is only 1 task on the busy runqueue (because we don't call
- * move_tasks).
+ * ->active_balance synchronizes accesses to
+ * ->active_balance_work. Once set, it's cleared
+ * only after active load balance is finished.
*/
- if (sd->balance_interval < sd->max_interval)
- sd->balance_interval *= 2;
+ if (!busiest->active_balance) {
+ busiest->active_balance = 1;
+ busiest->ab_dst_cpu = this_cpu;
+ busiest->ab_flags = env.flags;
+ busiest->ab_failed = env.failed;
+ busiest->ab_idle = env.idle;
+ active_balance = 1;
+ }
+ raw_spin_unlock_irqrestore(&busiest->lock, flags);
+
+ if (active_balance) {
+ stop_one_cpu_nowait(cpu_of(busiest),
+ active_load_balance_cpu_stop, busiest,
+ &busiest->ab_work);
+ }
}
- goto out;
+ if (!active_balance)
+ update_sd_failed(&env, ld_moved);
+
+ sd->balance_interval = sd->min_interval;
+out:
+ return ld_moved;
+
+out_pinned:
+ cpumask_clear_cpu(cpu_of(busiest), cpus);
+ if (!cpumask_empty(cpus)) {
+ env.loop = 0;
+ env.loop_break = sched_nr_migrate_break;
+ goto redo;
+ }
out_balanced:
schedstat_inc(sd, lb_balanced[idle]);
sd->nr_balance_failed = 0;
-out_one_pinned:
/* tune up the balancing interval */
if (((env.flags & LBF_ALL_PINNED) &&
sd->balance_interval < MAX_PINNED_INTERVAL) ||
@@ -4988,8 +5596,8 @@ out_one_pinned:
sd->balance_interval *= 2;
ld_moved = 0;
-out:
- return ld_moved;
+
+ goto out;
}
/*
@@ -5060,7 +5668,7 @@ static int active_load_balance_cpu_stop(void *data)
{
struct rq *busiest_rq = data;
int busiest_cpu = cpu_of(busiest_rq);
- int target_cpu = busiest_rq->push_cpu;
+ int target_cpu = busiest_rq->ab_dst_cpu;
struct rq *target_rq = cpu_rq(target_cpu);
struct sched_domain *sd;
@@ -5098,17 +5706,23 @@ static int active_load_balance_cpu_stop(void *data)
.sd = sd,
.dst_cpu = target_cpu,
.dst_rq = target_rq,
- .src_cpu = busiest_rq->cpu,
+ .src_cpu = busiest_cpu,
.src_rq = busiest_rq,
- .idle = CPU_IDLE,
+ .flags = busiest_rq->ab_flags,
+ .failed = busiest_rq->ab_failed,
+ .idle = busiest_rq->ab_idle,
};
+ env.iteration = lb_max_iteration(&env);
schedstat_inc(sd, alb_count);
- if (move_one_task(&env))
+ if (move_one_task(&env)) {
schedstat_inc(sd, alb_pushed);
- else
+ update_sd_failed(&env, 1);
+ } else {
schedstat_inc(sd, alb_failed);
+ update_sd_failed(&env, 0);
+ }
}
rcu_read_unlock();
double_unlock_balance(busiest_rq, target_rq);
@@ -5508,6 +6122,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
}
update_rq_runnable_avg(rq, 1);
+
+ if (sched_feat_numa(NUMA) && nr_node_ids > 1)
+ task_tick_numa(rq, curr);
}
/*
@@ -5902,9 +6519,7 @@ const struct sched_class fair_sched_class = {
#ifdef CONFIG_SMP
.select_task_rq = select_task_rq_fair,
-#ifdef CONFIG_FAIR_GROUP_SCHED
.migrate_task_rq = migrate_task_rq_fair,
-#endif
.rq_online = rq_online_fair,
.rq_offline = rq_offline_fair,
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index eebefca..33cd556 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -61,3 +61,11 @@ SCHED_FEAT(TTWU_QUEUE, true)
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
+
+#ifdef CONFIG_NUMA_BALANCING
+/* Do the working set probing faults: */
+SCHED_FEAT(NUMA, true)
+SCHED_FEAT(NUMA_FAULTS_UP, true)
+SCHED_FEAT(NUMA_FAULTS_DOWN, false)
+SCHED_FEAT(NUMA_SETTLE, true)
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6ac4056..bb9475c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -3,6 +3,7 @@
#include <linux/mutex.h>
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
+#include <linux/slab.h>
#include "cpupri.h"
@@ -420,17 +421,29 @@ struct rq {
unsigned long cpu_power;
unsigned char idle_balance;
- /* For active balancing */
int post_schedule;
+
+ /* For active balancing */
int active_balance;
- int push_cpu;
- struct cpu_stop_work active_balance_work;
+ int ab_dst_cpu;
+ int ab_flags;
+ int ab_failed;
+ int ab_idle;
+ struct cpu_stop_work ab_work;
+
/* cpu of this runqueue: */
int cpu;
int online;
struct list_head cfs_tasks;
+#ifdef CONFIG_NUMA_BALANCING
+ unsigned long numa_weight;
+ unsigned long nr_numa_running;
+ unsigned long nr_ideal_running;
+#endif
+ unsigned long nr_shared_running; /* 0 on non-NUMA */
+
u64 rt_avg;
u64 age_stamp;
u64 idle_stamp;
@@ -501,6 +514,18 @@ DECLARE_PER_CPU(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() (&__raw_get_cpu_var(runqueues))
+#ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int node, int shared);
+static inline void task_numa_free(struct task_struct *p)
+{
+ kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
#ifdef CONFIG_SMP
#define rcu_dereference_check_sched_domain(p) \
@@ -544,6 +569,7 @@ static inline struct sched_domain *highest_flag_domain(int cpu, int flag)
DECLARE_PER_CPU(struct sched_domain *, sd_llc);
DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(struct sched_domain *, sd_node);
extern int group_balance_cpu(struct sched_group *sg);
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.
That method has various (obvious) disadvantages:
- it samples the working set at dissimilar rates,
giving some tasks a sampling quality advantage
over others.
- creates performance problems for tasks with very
large working sets
- over-samples processes with large address spaces but
which only very rarely execute
Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.
The per-task nature of the working set sampling functionality
in this tree allows such constant rate, per task,
execution-weight proportional sampling of the working set,
with an adaptive sampling interval/frequency that goes from
once per 100 msecs up to just once per 1.6 seconds.
The current sampling volume is 256 MB per interval.
As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 1.6
seconds of CPU time executed.
This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.
[ In AutoNUMA speak, this patch deals with the effective sampling
rate of the 'hinting page fault'. AutoNUMA's scanning is
currently rate-limited, but it is also fundamentally
single-threaded, executing in the knuma_scand kernel thread,
so the limit in AutoNUMA is global and does not scale up with
the number of CPUs, nor does it scan tasks in an execution
proportional manner.
So the idea of rate-limiting the scanning was first implemented
in the AutoNUMA tree via a global rate limit. This patch goes
beyond that by implementing an execution rate proportional
working set sampling rate that is not implemented via a single
global scanning daemon. ]
[ Dan Carpenter pointed out a possible NULL pointer dereference in the
first version of this patch. ]
Based-on-idea-by: Andrea Arcangeli <[email protected]>
Bug-Found-By: Dan Carpenter <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mm_types.h | 1 +
include/linux/sched.h | 1 +
kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++------------
kernel/sysctl.c | 7 +++++++
4 files changed, 38 insertions(+), 12 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 48760e9..5995652 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -405,6 +405,7 @@ struct mm_struct {
#endif
#ifdef CONFIG_NUMA_BALANCING
unsigned long numa_next_scan;
+ unsigned long numa_scan_offset;
int numa_scan_seq;
#endif
struct uprobes_state uprobes_state;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index bb12cc3..3372aac 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2047,6 +2047,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
extern unsigned int sysctl_sched_numa_scan_period_min;
extern unsigned int sysctl_sched_numa_scan_period_max;
+extern unsigned int sysctl_sched_numa_scan_size;
extern unsigned int sysctl_sched_numa_settle_count;
#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b1d8fea..2a843e7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -825,8 +825,9 @@ static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
/*
* numa task sample period in ms: 5s
*/
-unsigned int sysctl_sched_numa_scan_period_min = 5000;
-unsigned int sysctl_sched_numa_scan_period_max = 5000*16;
+unsigned int sysctl_sched_numa_scan_period_min = 100;
+unsigned int sysctl_sched_numa_scan_period_max = 100*16;
+unsigned int sysctl_sched_numa_scan_size = 256; /* MB */
/*
* Wait for the 2-sample stuff to settle before migrating again
@@ -912,6 +913,9 @@ void task_numa_work(struct callback_head *work)
unsigned long migrate, next_scan, now = jiffies;
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
+ struct vm_area_struct *vma;
+ unsigned long offset, end;
+ long length;
WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -938,18 +942,31 @@ void task_numa_work(struct callback_head *work)
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;
- ACCESS_ONCE(mm->numa_scan_seq)++;
- {
- struct vm_area_struct *vma;
+ offset = mm->numa_scan_offset;
+ length = sysctl_sched_numa_scan_size;
+ length <<= 20;
- down_write(&mm->mmap_sem);
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
- if (!vma_migratable(vma))
- continue;
- change_protection(vma, vma->vm_start, vma->vm_end, vma_prot_none(vma), 0);
- }
- up_write(&mm->mmap_sem);
+ down_write(&mm->mmap_sem);
+ vma = find_vma(mm, offset);
+ if (!vma) {
+ ACCESS_ONCE(mm->numa_scan_seq)++;
+ offset = 0;
+ vma = mm->mmap;
+ }
+ for (; vma && length > 0; vma = vma->vm_next) {
+ if (!vma_migratable(vma))
+ continue;
+
+ offset = max(offset, vma->vm_start);
+ end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
+ length -= end - offset;
+
+ change_prot_none(vma, offset, end);
+
+ offset = end;
}
+ mm->numa_scan_offset = offset;
+ up_write(&mm->mmap_sem);
}
/*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f6cd550..f1f6d8c 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -367,6 +367,13 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
{
+ .procname = "sched_numa_scan_size_mb",
+ .data = &sysctl_sched_numa_scan_size,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_numa_settle_count",
.data = &sysctl_sched_numa_settle_count,
.maxlen = sizeof(unsigned int),
--
1.7.11.7
There's no good reason to disallow the migration of pages shared
by multiple processes - the migration code itself is already
properly walking the rmap chain.
So allow it. We've tested this with various workloads and
no ill effect appears to have come from this.
Cc: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/migrate.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 72d1056..b89062d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1427,12 +1427,6 @@ int migrate_misplaced_page(struct page *page, int node)
gfp_t gfp = GFP_HIGHUSER_MOVABLE;
/*
- * Don't migrate pages that are mapped in multiple processes.
- */
- if (page_mapcount(page) != 1)
- goto out;
-
- /*
* Never wait for allocations just to migrate on fault, but don't dip
* into reserves. And, only accept pages from the specified node. No
* sense migrating to a different "misplaced" page!
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Add a 1 second delay before starting to scan the working set of
a task and starting to balance it amongst nodes.
[ note that before the constant per task WSS sampling rate patch
the initial scan would happen much later still, in effect that
patch caused this regression. ]
The theory is that short-run tasks benefit very little from NUMA
placement: they come and go, and they better stick to the node
they were started on. As tasks mature and rebalance to other CPUs
and nodes, so does their NUMA placement have to change and so
does it start to matter more and more.
In practice this change fixes an observable kbuild regression:
# [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]
!NUMA:
45.291088843 seconds time elapsed ( +- 0.40% )
45.154231752 seconds time elapsed ( +- 0.36% )
+NUMA, no slow start:
46.172308123 seconds time elapsed ( +- 0.30% )
46.343168745 seconds time elapsed ( +- 0.25% )
+NUMA, 1 sec slow start:
45.224189155 seconds time elapsed ( +- 0.25% )
45.160866532 seconds time elapsed ( +- 0.17% )
and it also fixes an observable perf bench (hackbench) regression:
# perf stat --null --repeat 10 perf bench sched messaging
-NUMA:
-NUMA: 0.246225691 seconds time elapsed ( +- 1.31% )
+NUMA no slow start: 0.252620063 seconds time elapsed ( +- 1.13% )
+NUMA 1sec delay: 0.248076230 seconds time elapsed ( +- 1.35% )
The implementation is simple and straightforward, most of the patch
deals with adding the /proc/sys/kernel/sched_numa_scan_delay_ms tunable
knob.
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Wrote the changelog, ran measurements, tuned the default. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 16 ++++++++++------
kernel/sysctl.c | 7 +++++++
4 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3372aac..8f65323 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2045,6 +2045,7 @@ enum sched_tunable_scaling {
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
+extern unsigned int sysctl_sched_numa_scan_delay;
extern unsigned int sysctl_sched_numa_scan_period_min;
extern unsigned int sysctl_sched_numa_scan_period_max;
extern unsigned int sysctl_sched_numa_scan_size;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0cd9896..9dbbe45 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1556,7 +1556,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
p->numa_migrate_seq = 2;
p->numa_faults = NULL;
- p->numa_scan_period = sysctl_sched_numa_scan_period_min;
+ p->numa_scan_period = sysctl_sched_numa_scan_delay;
p->numa_work.next = &p->numa_work;
#endif /* CONFIG_NUMA_BALANCING */
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index adcad19..d4d708e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -823,11 +823,12 @@ static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
}
/*
- * numa task sample period in ms: 5s
+ * Scan @scan_size MB every @scan_period after an initial @scan_delay.
*/
-unsigned int sysctl_sched_numa_scan_period_min = 100;
-unsigned int sysctl_sched_numa_scan_period_max = 100*16;
-unsigned int sysctl_sched_numa_scan_size = 256; /* MB */
+unsigned int sysctl_sched_numa_scan_delay = 1000; /* ms */
+unsigned int sysctl_sched_numa_scan_period_min = 100; /* ms */
+unsigned int sysctl_sched_numa_scan_period_max = 100*16;/* ms */
+unsigned int sysctl_sched_numa_scan_size = 256; /* MB */
/*
* Wait for the 2-sample stuff to settle before migrating again
@@ -938,10 +939,12 @@ void task_numa_work(struct callback_head *work)
if (time_before(now, migrate))
return;
- next_scan = now + 2*msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
+ next_scan = now + msecs_to_jiffies(sysctl_sched_numa_scan_period_min);
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;
+ current->numa_scan_period += jiffies_to_msecs(2);
+
start = mm->numa_scan_offset;
pages = sysctl_sched_numa_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
@@ -998,7 +1001,8 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
if (now - curr->node_stamp > period) {
- curr->node_stamp = now;
+ curr->node_stamp += period;
+ curr->numa_scan_period = sysctl_sched_numa_scan_period_min;
/*
* We are comparing runtime to wall clock time here, which
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f1f6d8c..5b005d8 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -353,6 +353,13 @@ static struct ctl_table kern_table[] = {
#endif /* CONFIG_SMP */
#ifdef CONFIG_NUMA_BALANCING
{
+ .procname = "sched_numa_scan_delay_ms",
+ .data = &sysctl_sched_numa_scan_delay,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "sched_numa_scan_period_min_ms",
.data = &sysctl_sched_numa_scan_period_min,
.maxlen = sizeof(unsigned int),
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Add the NUMA working set scanning/hinting page fault machinery,
with no policy yet.
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
[ split it out of the main policy patch - as suggested by Mel Gorman ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/init_task.h | 8 ++++
include/linux/mm_types.h | 4 ++
include/linux/sched.h | 43 +++++++++++++++++++--
init/Kconfig | 9 +++++
kernel/sched/core.c | 15 ++++++++
kernel/sysctl.c | 31 +++++++++++++++-
mm/huge_memory.c | 7 +++-
mm/memory.c | 6 ++-
mm/mempolicy.c | 95 +++++++++++++++++++++++++++++++++++++++--------
9 files changed, 193 insertions(+), 25 deletions(-)
diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index 6d087c5..ed98982 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -143,6 +143,13 @@ extern struct task_group root_task_group;
#define INIT_TASK_COMM "swapper"
+#ifdef CONFIG_NUMA_BALANCING
+# define INIT_TASK_NUMA(tsk) \
+ .numa_shared = -1,
+#else
+# define INIT_TASK_NUMA(tsk)
+#endif
+
/*
* INIT_TASK is used to set up the first task table, touch at
* your own risk!. Base=0, limit=0x1fffff (=2MB)
@@ -210,6 +217,7 @@ extern struct task_group root_task_group;
INIT_TRACE_RECURSION \
INIT_TASK_RCU_PREEMPT(tsk) \
INIT_CPUSET_SEQ \
+ INIT_TASK_NUMA(tsk) \
}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7e9f758..48760e9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -403,6 +403,10 @@ struct mm_struct {
#ifdef CONFIG_CPUMASK_OFFSTACK
struct cpumask cpumask_allocation;
#endif
+#ifdef CONFIG_NUMA_BALANCING
+ unsigned long numa_next_scan;
+ int numa_scan_seq;
+#endif
struct uprobes_state uprobes_state;
};
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e1581a0..418d405 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1501,6 +1501,18 @@ struct task_struct {
short il_next;
short pref_node_fork;
#endif
+#ifdef CONFIG_NUMA_BALANCING
+ int numa_shared;
+ int numa_max_node;
+ int numa_scan_seq;
+ int numa_migrate_seq;
+ unsigned int numa_scan_period;
+ u64 node_stamp; /* migration stamp */
+ unsigned long numa_weight;
+ unsigned long *numa_faults;
+ struct callback_head numa_work;
+#endif /* CONFIG_NUMA_BALANCING */
+
struct rcu_head rcu;
/*
@@ -1575,6 +1587,26 @@ struct task_struct {
/* Future-safe accessor for struct task_struct's cpus_allowed. */
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
+#ifdef CONFIG_NUMA_BALANCING
+extern void task_numa_fault(int node, int cpu, int pages);
+#else
+static inline void task_numa_fault(int node, int cpu, int pages) { }
+#endif /* CONFIG_NUMA_BALANCING */
+
+/*
+ * -1: non-NUMA task
+ * 0: NUMA task with a dominantly 'private' working set
+ * 1: NUMA task with a dominantly 'shared' working set
+ */
+static inline int task_numa_shared(struct task_struct *p)
+{
+#ifdef CONFIG_NUMA_BALANCING
+ return p->numa_shared;
+#else
+ return -1;
+#endif
+}
+
/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
@@ -2012,6 +2044,10 @@ enum sched_tunable_scaling {
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
+extern unsigned int sysctl_sched_numa_scan_period_min;
+extern unsigned int sysctl_sched_numa_scan_period_max;
+extern unsigned int sysctl_sched_numa_settle_count;
+
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
@@ -2022,18 +2058,17 @@ extern unsigned int sysctl_sched_shares_window;
int sched_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
loff_t *ppos);
-#endif
-#ifdef CONFIG_SCHED_DEBUG
+
static inline unsigned int get_sysctl_timer_migration(void)
{
return sysctl_timer_migration;
}
-#else
+#else /* CONFIG_SCHED_DEBUG */
static inline unsigned int get_sysctl_timer_migration(void)
{
return 1;
}
-#endif
+#endif /* CONFIG_SCHED_DEBUG */
extern unsigned int sysctl_sched_rt_period;
extern int sysctl_sched_rt_runtime;
diff --git a/init/Kconfig b/init/Kconfig
index 78807b3..4367c62 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -710,6 +710,15 @@ config ARCH_WANT_NUMA_VARIABLE_LOCALITY
config ARCH_SUPPORTS_NUMA_BALANCING
bool
+config NUMA_BALANCING
+ bool "Memory placement aware NUMA scheduler"
+ default n
+ depends on ARCH_SUPPORTS_NUMA_BALANCING
+ depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
+ depends on SMP && NUMA && MIGRATION
+ help
+ This option adds support for automatic NUMA aware memory/task placement.
+
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5dae0d2..3611f5f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1544,6 +1544,21 @@ static void __sched_fork(struct task_struct *p)
#ifdef CONFIG_PREEMPT_NOTIFIERS
INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif
+
+#ifdef CONFIG_NUMA_BALANCING
+ if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+ p->mm->numa_next_scan = jiffies;
+ p->mm->numa_scan_seq = 0;
+ }
+
+ p->numa_shared = -1;
+ p->node_stamp = 0ULL;
+ p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+ p->numa_migrate_seq = 2;
+ p->numa_faults = NULL;
+ p->numa_scan_period = sysctl_sched_numa_scan_period_min;
+ p->numa_work.next = &p->numa_work;
+#endif /* CONFIG_NUMA_BALANCING */
}
/*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 26f65ea..f6cd550 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -256,9 +256,11 @@ static int min_sched_granularity_ns = 100000; /* 100 usecs */
static int max_sched_granularity_ns = NSEC_PER_SEC; /* 1 second */
static int min_wakeup_granularity_ns; /* 0 usecs */
static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
+#ifdef CONFIG_SMP
static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
-#endif
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_SCHED_DEBUG */
#ifdef CONFIG_COMPACTION
static int min_extfrag_threshold;
@@ -301,6 +303,7 @@ static struct ctl_table kern_table[] = {
.extra1 = &min_wakeup_granularity_ns,
.extra2 = &max_wakeup_granularity_ns,
},
+#ifdef CONFIG_SMP
{
.procname = "sched_tunable_scaling",
.data = &sysctl_sched_tunable_scaling,
@@ -347,7 +350,31 @@ static struct ctl_table kern_table[] = {
.extra1 = &zero,
.extra2 = &one,
},
-#endif
+#endif /* CONFIG_SMP */
+#ifdef CONFIG_NUMA_BALANCING
+ {
+ .procname = "sched_numa_scan_period_min_ms",
+ .data = &sysctl_sched_numa_scan_period_min,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "sched_numa_scan_period_max_ms",
+ .data = &sysctl_sched_numa_scan_period_max,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "sched_numa_settle_count",
+ .data = &sysctl_sched_numa_settle_count,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif /* CONFIG_NUMA_BALANCING */
+#endif /* CONFIG_SCHED_DEBUG */
{
.procname = "sched_rt_period_us",
.data = &sysctl_sched_rt_period,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fbff718..088f23b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -776,9 +776,10 @@ fixup:
unlock:
spin_unlock(&mm->page_table_lock);
- if (page)
+ if (page) {
+ task_numa_fault(page_to_nid(page), last_cpu, HPAGE_PMD_NR);
put_page(page);
-
+ }
return;
migrate:
@@ -847,6 +848,8 @@ migrate:
put_page(page); /* Drop the rmap reference */
+ task_numa_fault(node, last_cpu, HPAGE_PMD_NR);
+
if (lru)
put_page(page); /* drop the LRU isolation reference */
diff --git a/mm/memory.c b/mm/memory.c
index ebd18fd..a13da1e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3484,6 +3484,7 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *page = NULL;
int node, page_nid = -1;
+ int last_cpu = -1;
spinlock_t *ptl;
ptl = pte_lockptr(mm, pmd);
@@ -3495,6 +3496,7 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (page) {
get_page(page);
page_nid = page_to_nid(page);
+ last_cpu = page_last_cpu(page);
node = mpol_misplaced(page, vma, address);
if (node != -1)
goto migrate;
@@ -3514,8 +3516,10 @@ out_pte_upgrade_unlock:
out_unlock:
pte_unmap_unlock(ptep, ptl);
out:
- if (page)
+ if (page) {
+ task_numa_fault(page_nid, last_cpu, 1);
put_page(page);
+ }
return 0;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5ee326c..e31571c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2194,12 +2194,70 @@ static void sp_free(struct sp_node *n)
kmem_cache_free(sn_cache, n);
}
+/*
+ * Multi-stage node selection is used in conjunction with a periodic
+ * migration fault to build a temporal task<->page relation. By
+ * using a two-stage filter we remove short/unlikely relations.
+ *
+ * Using P(p) ~ n_p / n_t as per frequentist probability, we can
+ * equate a task's usage of a particular page (n_p) per total usage
+ * of this page (n_t) (in a given time-span) to a probability.
+ *
+ * Our periodic faults will then sample this probability and getting
+ * the same result twice in a row, given these samples are fully
+ * independent, is then given by P(n)^2, provided our sample period
+ * is sufficiently short compared to the usage pattern.
+ *
+ * This quadric squishes small probabilities, making it less likely
+ * we act on an unlikely task<->page relation.
+ *
+ * Return the best node ID this page should be on, or -1 if it should
+ * stay where it is.
+ */
+static int
+numa_migration_target(struct page *page, int page_nid,
+ struct task_struct *p, int this_cpu,
+ int cpu_last_access)
+{
+ int nid_last_access;
+ int this_nid;
+
+ if (task_numa_shared(p) < 0)
+ return -1;
+
+ /*
+ * Possibly migrate towards the current node, depends on
+ * task_numa_placement() and access details.
+ */
+ nid_last_access = cpu_to_node(cpu_last_access);
+ this_nid = cpu_to_node(this_cpu);
+
+ if (nid_last_access != this_nid) {
+ /*
+ * 'Access miss': the page got last accessed from a remote node.
+ */
+ return -1;
+ }
+ /*
+ * 'Access hit': the page got last accessed from our node.
+ *
+ * Migrate the page if needed.
+ */
+
+ /* The page is already on this node: */
+ if (page_nid == this_nid)
+ return -1;
+
+ return this_nid;
+}
+
/**
* mpol_misplaced - check whether current page node is valid in policy
*
* @page - page to be checked
* @vma - vm area where page mapped
* @addr - virtual address where page mapped
+ * @multi - use multi-stage node binding
*
* Lookup current policy node id for vma,addr and "compare to" page's
* node id.
@@ -2213,18 +2271,22 @@ static void sp_free(struct sp_node *n)
*/
int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
{
+ int best_nid = -1, page_nid;
+ int cpu_last_access, this_cpu;
struct mempolicy *pol;
- struct zone *zone;
- int curnid = page_to_nid(page);
unsigned long pgoff;
- int polnid = -1;
- int ret = -1;
+ struct zone *zone;
BUG_ON(!vma);
+ this_cpu = raw_smp_processor_id();
+ page_nid = page_to_nid(page);
+
+ cpu_last_access = page_xchg_last_cpu(page, this_cpu);
+
pol = get_vma_policy(current, vma, addr);
- if (!(pol->flags & MPOL_F_MOF))
- goto out;
+ if (!(pol->flags & MPOL_F_MOF) && !(task_numa_shared(current) >= 0))
+ goto out_keep_page;
switch (pol->mode) {
case MPOL_INTERLEAVE:
@@ -2233,14 +2295,14 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
pgoff = vma->vm_pgoff;
pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
- polnid = offset_il_node(pol, vma, pgoff);
+ best_nid = offset_il_node(pol, vma, pgoff);
break;
case MPOL_PREFERRED:
if (pol->flags & MPOL_F_LOCAL)
- polnid = numa_node_id();
+ best_nid = numa_node_id();
else
- polnid = pol->v.preferred_node;
+ best_nid = pol->v.preferred_node;
break;
case MPOL_BIND:
@@ -2250,24 +2312,25 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
* else select nearest allowed node, if any.
* If no allowed nodes, use current [!misplaced].
*/
- if (node_isset(curnid, pol->v.nodes))
- goto out;
+ if (node_isset(page_nid, pol->v.nodes))
+ goto out_keep_page;
(void)first_zones_zonelist(
node_zonelist(numa_node_id(), GFP_HIGHUSER),
gfp_zone(GFP_HIGHUSER),
&pol->v.nodes, &zone);
- polnid = zone->node;
+ best_nid = zone->node;
break;
default:
BUG();
}
- if (curnid != polnid)
- ret = polnid;
-out:
+
+ best_nid = numa_migration_target(page, page_nid, current, this_cpu, cpu_last_access);
+
+out_keep_page:
mpol_cond_put(pol);
- return ret;
+ return best_nid;
}
static void sp_delete(struct shared_policy *sp, struct sp_node *n)
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
By accounting against the present PTEs, scanning speed reflects the
actual present (mapped) memory.
Suggested-by: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/hugetlb.h | 8 ++++++--
include/linux/mm.h | 6 +++---
kernel/sched/fair.c | 37 +++++++++++++++++++++----------------
mm/hugetlb.c | 10 ++++++++--
mm/mprotect.c | 41 ++++++++++++++++++++++++++++++-----------
5 files changed, 68 insertions(+), 34 deletions(-)
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2251648..06e691b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -87,7 +87,7 @@ struct page *follow_huge_pud(struct mm_struct *mm, unsigned long address,
pud_t *pud, int write);
int pmd_huge(pmd_t pmd);
int pud_huge(pud_t pmd);
-void hugetlb_change_protection(struct vm_area_struct *vma,
+unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot);
#else /* !CONFIG_HUGETLB_PAGE */
@@ -132,7 +132,11 @@ static inline void copy_huge_page(struct page *dst, struct page *src)
{
}
-#define hugetlb_change_protection(vma, address, end, newprot)
+static inline unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
+ unsigned long address, unsigned long end, pgprot_t newprot)
+{
+ return 0;
+}
static inline void __unmap_hugepage_range_final(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 141a28f..e6df281 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1099,7 +1099,7 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
extern unsigned long do_mremap(unsigned long addr,
unsigned long old_len, unsigned long new_len,
unsigned long flags, unsigned long new_addr);
-extern void change_protection(struct vm_area_struct *vma, unsigned long start,
+extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgprot_t newprot,
int dirty_accountable);
extern int mprotect_fixup(struct vm_area_struct *vma,
@@ -1581,10 +1581,10 @@ static inline pgprot_t vma_prot_none(struct vm_area_struct *vma)
return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags));
}
-static inline void
+static inline unsigned long
change_prot_none(struct vm_area_struct *vma, unsigned long start, unsigned long end)
{
- change_protection(vma, start, end, vma_prot_none(vma), 0);
+ return change_protection(vma, start, end, vma_prot_none(vma), 0);
}
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2a843e7..adcad19 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -914,8 +914,8 @@ void task_numa_work(struct callback_head *work)
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
struct vm_area_struct *vma;
- unsigned long offset, end;
- long length;
+ unsigned long start, end;
+ long pages;
WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -942,30 +942,35 @@ void task_numa_work(struct callback_head *work)
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;
- offset = mm->numa_scan_offset;
- length = sysctl_sched_numa_scan_size;
- length <<= 20;
+ start = mm->numa_scan_offset;
+ pages = sysctl_sched_numa_scan_size;
+ pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+ if (!pages)
+ return;
down_write(&mm->mmap_sem);
- vma = find_vma(mm, offset);
+ vma = find_vma(mm, start);
if (!vma) {
ACCESS_ONCE(mm->numa_scan_seq)++;
- offset = 0;
+ start = 0;
vma = mm->mmap;
}
- for (; vma && length > 0; vma = vma->vm_next) {
+ for (; vma; vma = vma->vm_next) {
if (!vma_migratable(vma))
continue;
- offset = max(offset, vma->vm_start);
- end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
- length -= end - offset;
-
- change_prot_none(vma, offset, end);
-
- offset = end;
+ do {
+ start = max(start, vma->vm_start);
+ end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
+ end = min(end, vma->vm_end);
+ pages -= change_prot_none(vma, start, end);
+ start = end;
+ if (pages <= 0)
+ goto out;
+ } while (end != vma->vm_end);
}
- mm->numa_scan_offset = offset;
+out:
+ mm->numa_scan_offset = start;
up_write(&mm->mmap_sem);
}
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 59a0059..712895e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3014,7 +3014,7 @@ same_page:
return i ? i : -EFAULT;
}
-void hugetlb_change_protection(struct vm_area_struct *vma,
+unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot)
{
struct mm_struct *mm = vma->vm_mm;
@@ -3022,6 +3022,7 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
pte_t *ptep;
pte_t pte;
struct hstate *h = hstate_vma(vma);
+ unsigned long pages = 0;
BUG_ON(address >= end);
flush_cache_range(vma, address, end);
@@ -3032,12 +3033,15 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
ptep = huge_pte_offset(mm, address);
if (!ptep)
continue;
- if (huge_pmd_unshare(mm, &address, ptep))
+ if (huge_pmd_unshare(mm, &address, ptep)) {
+ pages++;
continue;
+ }
if (!huge_pte_none(huge_ptep_get(ptep))) {
pte = huge_ptep_get_and_clear(mm, address, ptep);
pte = pte_mkhuge(pte_modify(pte, newprot));
set_huge_pte_at(mm, address, ptep, pte);
+ pages++;
}
}
spin_unlock(&mm->page_table_lock);
@@ -3049,6 +3053,8 @@ void hugetlb_change_protection(struct vm_area_struct *vma,
*/
flush_tlb_range(vma, start, end);
mutex_unlock(&vma->vm_file->f_mapping->i_mmap_mutex);
+
+ return pages << h->order;
}
int hugetlb_reserve_pages(struct inode *inode,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 392b124..ce0377b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -28,12 +28,13 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
-static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
+static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
pte_t *pte, oldpte;
spinlock_t *ptl;
+ unsigned long pages = 0;
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -53,6 +54,7 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
ptent = pte_mkwrite(ptent);
ptep_modify_prot_commit(mm, addr, pte, ptent);
+ pages++;
} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
@@ -65,18 +67,22 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
set_pte_at(mm, addr, pte,
swp_entry_to_pte(entry));
}
+ pages++;
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
+
+ return pages;
}
-static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
+static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
pmd_t *pmd;
unsigned long next;
+ unsigned long pages = 0;
pmd = pmd_offset(pud, addr);
do {
@@ -84,35 +90,42 @@ static inline void change_pmd_range(struct vm_area_struct *vma, pud_t *pud,
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
split_huge_page_pmd(vma->vm_mm, pmd);
- else if (change_huge_pmd(vma, pmd, addr, newprot))
+ else if (change_huge_pmd(vma, pmd, addr, newprot)) {
+ pages += HPAGE_PMD_NR;
continue;
+ }
/* fall through */
}
if (pmd_none_or_clear_bad(pmd))
continue;
- change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
+ pages += change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
dirty_accountable);
} while (pmd++, addr = next, addr != end);
+
+ return pages;
}
-static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+static inline unsigned long change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
pud_t *pud;
unsigned long next;
+ unsigned long pages = 0;
pud = pud_offset(pgd, addr);
do {
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
continue;
- change_pmd_range(vma, pud, addr, next, newprot,
+ pages += change_pmd_range(vma, pud, addr, next, newprot,
dirty_accountable);
} while (pud++, addr = next, addr != end);
+
+ return pages;
}
-static void change_protection_range(struct vm_area_struct *vma,
+static unsigned long change_protection_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
@@ -120,6 +133,7 @@ static void change_protection_range(struct vm_area_struct *vma,
pgd_t *pgd;
unsigned long next;
unsigned long start = addr;
+ unsigned long pages = 0;
BUG_ON(addr >= end);
pgd = pgd_offset(mm, addr);
@@ -128,24 +142,29 @@ static void change_protection_range(struct vm_area_struct *vma,
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- change_pud_range(vma, pgd, addr, next, newprot,
+ pages += change_pud_range(vma, pgd, addr, next, newprot,
dirty_accountable);
} while (pgd++, addr = next, addr != end);
flush_tlb_range(vma, start, end);
+
+ return pages;
}
-void change_protection(struct vm_area_struct *vma, unsigned long start,
+unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
struct mm_struct *mm = vma->vm_mm;
+ unsigned long pages;
mmu_notifier_invalidate_range_start(mm, start, end);
if (is_vm_hugetlb_page(vma))
- hugetlb_change_protection(vma, start, end, newprot);
+ pages = hugetlb_change_protection(vma, start, end, newprot);
else
- change_protection_range(vma, start, end, newprot, dirty_accountable);
+ pages = change_protection_range(vma, start, end, newprot, dirty_accountable);
mmu_notifier_invalidate_range_end(mm, start, end);
+
+ return pages;
}
int
--
1.7.11.7
From: Rik van Riel <[email protected]>
The NUMA placement code has been rewritten several times, but
the basic ideas took a lot of work to develop. The people who
put in the work deserve credit for it. Thanks Andrea & Peter :)
[ The Documentation/scheduler/numa-problem.txt file should
probably be rewritten once we figure out the final details of
what the NUMA code needs to do, and why. ]
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
----
This is against tip.git numa/core
---
CREDITS | 1 +
kernel/sched/fair.c | 3 +++
mm/memory.c | 2 ++
3 files changed, 6 insertions(+)
diff --git a/CREDITS b/CREDITS
index d8fe12a..b4cdc8f 100644
--- a/CREDITS
+++ b/CREDITS
@@ -125,6 +125,7 @@ D: Author of pscan that helps to fix lp/parport bugs
D: Author of lil (Linux Interrupt Latency benchmark)
D: Fixed the shm swap deallocation at swapoff time (try_to_unuse message)
D: VM hacker
+D: NUMA task placement
D: Various other kernel hacks
S: Imola 40026
S: Italy
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 93f4de4..309a254 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -18,6 +18,9 @@
*
* Adaptive scheduling granularity, math enhancements by Peter Zijlstra
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra <[email protected]>
+ *
+ * NUMA placement, statistics and algorithm by Andrea Arcangeli,
+ * CFS balancing changes by Peter Zijlstra. Copyright (C) 2012 Red Hat, Inc.
*/
#include <linux/latencytop.h>
diff --git a/mm/memory.c b/mm/memory.c
index 1b9108c..ebd18fd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -36,6 +36,8 @@
* ([email protected])
*
* Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
+ *
+ * 2012 - NUMA placement page faults (Andrea Arcangeli, Peter Zijlstra)
*/
#include <linux/kernel_stat.h>
--
1.7.11.7
From: Andrea Arcangeli <[email protected]>
Introduce FOLL_NUMA to tell follow_page to check
pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
so because it always invokes handle_mm_fault and retries the
follow_page later.
KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault.
Other follow_page callers like KSM should not use FOLL_NUMA, or they
would fail to get the pages if they use follow_page instead of
get_user_pages.
[ This patch was picked up from the AutoNUMA tree. ]
Originally-by: Andrea Arcangeli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
[ ported to this tree. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mm.h | 1 +
mm/memory.c | 17 +++++++++++++++++
2 files changed, 18 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0025bf9..1821629 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1600,6 +1600,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
#define FOLL_MLOCK 0x40 /* mark page as mlocked */
#define FOLL_SPLIT 0x80 /* don't return transhuge pages, split them */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
+#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/memory.c b/mm/memory.c
index e3e8ab2..a660fd0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1536,6 +1536,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
goto out;
}
+ if ((flags & FOLL_NUMA) && pmd_numa(vma, *pmd))
+ goto no_page_table;
if (pmd_trans_huge(*pmd)) {
if (flags & FOLL_SPLIT) {
split_huge_page_pmd(mm, pmd);
@@ -1565,6 +1567,8 @@ split_fallthrough:
pte = *ptep;
if (!pte_present(pte))
goto no_page;
+ if ((flags & FOLL_NUMA) && pte_numa(vma, pte))
+ goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;
@@ -1716,6 +1720,19 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
(VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
vm_flags &= (gup_flags & FOLL_FORCE) ?
(VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
+
+ /*
+ * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
+ * would be called on PROT_NONE ranges. We must never invoke
+ * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
+ * page faults would unprotect the PROT_NONE ranges if
+ * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
+ * bitflag. So to avoid that, don't set FOLL_NUMA if
+ * FOLL_FORCE is set.
+ */
+ if (!(gup_flags & FOLL_FORCE))
+ gup_flags |= FOLL_NUMA;
+
i = 0;
do {
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Some architectures (ab)use NUMA to represent different memory
regions all cpu-local but of different latencies, such as SuperH.
The naming comes from Mel Gorman.
Named-by: Mel Gorman <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/sh/mm/Kconfig | 1 +
init/Kconfig | 7 +++++++
2 files changed, 8 insertions(+)
diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig
index cb8f992..0f7c852 100644
--- a/arch/sh/mm/Kconfig
+++ b/arch/sh/mm/Kconfig
@@ -111,6 +111,7 @@ config VSYSCALL
config NUMA
bool "Non Uniform Memory Access (NUMA) Support"
depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
+ select ARCH_WANT_NUMA_VARIABLE_LOCALITY
default n
help
Some SH systems have many various memories scattered around
diff --git a/init/Kconfig b/init/Kconfig
index 6fdd6e3..ae412fd 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -696,6 +696,13 @@ config LOG_BUF_SHIFT
config HAVE_UNSTABLE_SCHED_CLOCK
bool
+#
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config ARCH_WANT_NUMA_VARIABLE_LOCALITY
+ bool
+
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Introduce a per-page last_cpu field, fold this into the struct
page::flags field whenever possible.
The unlikely/rare 32bit NUMA configs will likely grow the page-frame.
[ Completely dropping 32bit support for CONFIG_NUMA_BALANCING would simplify
things, but it would also remove the warning if we grow enough 64bit
only page-flags to push the last-cpu out. ]
Suggested-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mm.h | 90 +++++++++++++++++++++------------------
include/linux/mm_types.h | 5 +++
include/linux/mmzone.h | 14 +-----
include/linux/page-flags-layout.h | 83 ++++++++++++++++++++++++++++++++++++
kernel/bounds.c | 4 ++
mm/huge_memory.c | 3 ++
mm/memory.c | 4 ++
7 files changed, 149 insertions(+), 54 deletions(-)
create mode 100644 include/linux/page-flags-layout.h
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1821629..141a28f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -594,50 +594,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
* sets it, so none of the operations on it need to be atomic.
*/
-
-/*
- * page->flags layout:
- *
- * There are three possibilities for how page->flags get
- * laid out. The first is for the normal case, without
- * sparsemem. The second is for sparsemem when there is
- * plenty of space for node and section. The last is when
- * we have run out of space and have to fall back to an
- * alternate (slower) way of determining the node.
- *
- * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
- * classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
- */
-#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
-#define SECTIONS_WIDTH SECTIONS_SHIFT
-#else
-#define SECTIONS_WIDTH 0
-#endif
-
-#define ZONES_WIDTH ZONES_SHIFT
-
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define NODES_WIDTH NODES_SHIFT
-#else
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-#error "Vmemmap: No space for nodes field in page flags"
-#endif
-#define NODES_WIDTH 0
-#endif
-
-/* Page flags: | [SECTION] | [NODE] | ZONE | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPU] | ... | FLAGS | */
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
-
-/*
- * We are going to use the flags for the page to node mapping if its in
- * there. This includes the case where there is no node, so it is implicit.
- */
-#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
-#define NODE_NOT_IN_PAGE_FLAGS
-#endif
+#define LAST_CPU_PGOFF (ZONES_PGOFF - LAST_CPU_WIDTH)
/*
* Define the bit shifts to access each section. For non-existent
@@ -647,6 +608,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
+#define LAST_CPU_PGSHIFT (LAST_CPU_PGOFF * (LAST_CPU_WIDTH != 0))
/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
#ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -668,6 +630,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK ((1UL << NODES_WIDTH) - 1)
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
+#define LAST_CPU_MASK ((1UL << LAST_CPU_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)
static inline enum zone_type page_zonenum(const struct page *page)
@@ -706,6 +669,51 @@ static inline int page_to_nid(const struct page *page)
}
#endif
+#ifdef CONFIG_NUMA_BALANCING
+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ return xchg(&page->_last_cpu, cpu);
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+ return page->_last_cpu;
+}
+#else
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ unsigned long old_flags, flags;
+ int last_cpu;
+
+ do {
+ old_flags = flags = page->flags;
+ last_cpu = (flags >> LAST_CPU_PGSHIFT) & LAST_CPU_MASK;
+
+ flags &= ~(LAST_CPU_MASK << LAST_CPU_PGSHIFT);
+ flags |= (cpu & LAST_CPU_MASK) << LAST_CPU_PGSHIFT;
+ } while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
+
+ return last_cpu;
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+ return (page->flags >> LAST_CPU_PGSHIFT) & LAST_CPU_MASK;
+}
+#endif /* LAST_CPU_NOT_IN_PAGE_FLAGS */
+#else /* CONFIG_NUMA_BALANCING */
+static inline int page_xchg_last_cpu(struct page *page, int cpu)
+{
+ return page_to_nid(page);
+}
+
+static inline int page_last_cpu(struct page *page)
+{
+ return page_to_nid(page);
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
static inline struct zone *page_zone(const struct page *page)
{
return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)];
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 31f8a3a..7e9f758 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,7 @@
#include <linux/cpumask.h>
#include <linux/page-debug-flags.h>
#include <linux/uprobes.h>
+#include <linux/page-flags-layout.h>
#include <asm/page.h>
#include <asm/mmu.h>
@@ -175,6 +176,10 @@ struct page {
*/
void *shadow;
#endif
+
+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+ int _last_cpu;
+#endif
}
/*
* The struct page can be forced to be double word aligned so that atomic ops
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 50aaca8..7e116ed 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -15,7 +15,7 @@
#include <linux/seqlock.h>
#include <linux/nodemask.h>
#include <linux/pageblock-flags.h>
-#include <generated/bounds.h>
+#include <linux/page-flags-layout.h>
#include <linux/atomic.h>
#include <asm/page.h>
@@ -318,16 +318,6 @@ enum zone_type {
* match the requested limits. See gfp_zone() in include/linux/gfp.h
*/
-#if MAX_NR_ZONES < 2
-#define ZONES_SHIFT 0
-#elif MAX_NR_ZONES <= 2
-#define ZONES_SHIFT 1
-#elif MAX_NR_ZONES <= 4
-#define ZONES_SHIFT 2
-#else
-#error ZONES_SHIFT -- too many zones configured adjust calculation
-#endif
-
struct zone {
/* Fields commonly accessed by the page allocator */
@@ -1030,8 +1020,6 @@ static inline unsigned long early_pfn_to_nid(unsigned long pfn)
* PA_SECTION_SHIFT physical address to/from section number
* PFN_SECTION_SHIFT pfn to/from section number
*/
-#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
-
#define PA_SECTION_SHIFT (SECTION_SIZE_BITS)
#define PFN_SECTION_SHIFT (SECTION_SIZE_BITS - PAGE_SHIFT)
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
new file mode 100644
index 0000000..b258132
--- /dev/null
+++ b/include/linux/page-flags-layout.h
@@ -0,0 +1,83 @@
+#ifndef _LINUX_PAGE_FLAGS_LAYOUT
+#define _LINUX_PAGE_FLAGS_LAYOUT
+
+#include <linux/numa.h>
+#include <generated/bounds.h>
+
+#if MAX_NR_ZONES < 2
+#define ZONES_SHIFT 0
+#elif MAX_NR_ZONES <= 2
+#define ZONES_SHIFT 1
+#elif MAX_NR_ZONES <= 4
+#define ZONES_SHIFT 2
+#else
+#error ZONES_SHIFT -- too many zones configured adjust calculation
+#endif
+
+#ifdef CONFIG_SPARSEMEM
+#include <asm/sparsemem.h>
+
+/*
+ * SECTION_SHIFT #bits space required to store a section #
+ */
+#define SECTIONS_SHIFT (MAX_PHYSMEM_BITS - SECTION_SIZE_BITS)
+#endif
+
+/*
+ * page->flags layout:
+ *
+ * There are five possibilities for how page->flags get laid out. The first
+ * (and second) is for the normal case, without sparsemem. The third is for
+ * sparsemem when there is plenty of space for node and section. The last is
+ * when we have run out of space and have to fall back to an alternate (slower)
+ * way of determining the node.
+ *
+ * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_cpu:| NODE | ZONE | LAST_CPU | ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_cpu:| SECTION | NODE | ZONE | LAST_CPU | ... | FLAGS |
+ * classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
+ */
+#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
+
+#define SECTIONS_WIDTH SECTIONS_SHIFT
+#else
+#define SECTIONS_WIDTH 0
+#endif
+
+#define ZONES_WIDTH ZONES_SHIFT
+
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define NODES_WIDTH NODES_SHIFT
+#else
+#ifdef CONFIG_SPARSEMEM_VMEMMAP
+#error "Vmemmap: No space for nodes field in page flags"
+#endif
+#define NODES_WIDTH 0
+#endif
+
+#ifdef CONFIG_NUMA_BALANCING
+#define LAST_CPU_SHIFT NR_CPUS_BITS
+#else
+#define LAST_CPU_SHIFT 0
+#endif
+
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPU_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_CPU_WIDTH LAST_CPU_SHIFT
+#else
+#define LAST_CPU_WIDTH 0
+#endif
+
+/*
+ * We are going to use the flags for the page to node mapping if its in
+ * there. This includes the case where there is no node, so it is implicit.
+ */
+#if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
+#define NODE_NOT_IN_PAGE_FLAGS
+#endif
+
+#if defined(CONFIG_NUMA_BALANCING) && LAST_CPU_WIDTH == 0
+#define LAST_CPU_NOT_IN_PAGE_FLAGS
+#endif
+
+#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 0c9b862..e8ca97b 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -10,6 +10,7 @@
#include <linux/mmzone.h>
#include <linux/kbuild.h>
#include <linux/page_cgroup.h>
+#include <linux/log2.h>
void foo(void)
{
@@ -17,5 +18,8 @@ void foo(void)
DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
DEFINE(MAX_NR_ZONES, __MAX_NR_ZONES);
DEFINE(NR_PCG_FLAGS, __NR_PCG_FLAGS);
+#ifdef CONFIG_SMP
+ DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
+#endif
/* End of constants */
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 931caf4..fbff718 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -745,6 +745,7 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *new_page = NULL;
struct page *page = NULL;
int node, lru;
+ int last_cpu;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, entry)))
@@ -759,6 +760,7 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = pmd_page(entry);
if (page) {
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+ last_cpu = page_last_cpu(page);
get_page(page);
node = mpol_misplaced(page, vma, haddr);
@@ -1440,6 +1442,7 @@ static void __split_huge_page_refcount(struct page *page)
page_tail->mapping = page->mapping;
page_tail->index = page->index + i;
+ page_xchg_last_cpu(page, page_last_cpu(page_tail));
BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index 0d26a28..1b9108c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -68,6 +68,10 @@
#include "internal.h"
+#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA config, growing page-frame for last_cpu.
+#endif
+
#ifndef CONFIG_NEED_MULTIPLE_NODES
/* use the per-pgdat data instead for discontigmem - mbligh */
unsigned long max_mapnr;
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Combine our previous PROT_NONE, mpol_misplaced and
migrate_misplaced_page() pieces into an effective migrate on fault
scheme.
Note that (on x86) we rely on PROT_NONE pages being !present and avoid
the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves
the page-migration performance.
Suggested-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/huge_memory.c | 41 +++++++++++++++++++++++++++++++++++-
mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++----------------
2 files changed, 85 insertions(+), 19 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6924edf..c4c0a57 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
#include <linux/freezer.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
+#include <linux/migrate.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -741,12 +742,48 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned int flags, pmd_t entry)
{
unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct page *page = NULL;
+ int node;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, entry)))
goto out_unlock;
- /* do fancy stuff */
+ if (unlikely(pmd_trans_splitting(entry))) {
+ spin_unlock(&mm->page_table_lock);
+ wait_split_huge_page(vma->anon_vma, pmd);
+ return;
+ }
+
+#ifdef CONFIG_NUMA
+ page = pmd_page(entry);
+ VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+
+ get_page(page);
+ spin_unlock(&mm->page_table_lock);
+
+ /*
+ * XXX should we serialize against split_huge_page ?
+ */
+
+ node = mpol_misplaced(page, vma, haddr);
+ if (node == -1)
+ goto do_fixup;
+
+ /*
+ * Due to lacking code to migrate thp pages, we'll split
+ * (which preserves the special PROT_NONE) and re-take the
+ * fault on the normal pages.
+ */
+ split_huge_page(page);
+ put_page(page);
+ return;
+
+do_fixup:
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry)))
+ goto out_unlock;
+#endif
/* change back to regular protection */
entry = pmd_modify(entry, vma->vm_page_prot);
@@ -755,6 +792,8 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
out_unlock:
spin_unlock(&mm->page_table_lock);
+ if (page)
+ put_page(page);
}
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/memory.c b/mm/memory.c
index a660fd0..0d26a28 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
#include <linux/gfp.h>
+#include <linux/migrate.h>
#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -1467,8 +1468,10 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes);
static bool pte_numa(struct vm_area_struct *vma, pte_t pte)
{
/*
- * If we have the normal vma->vm_page_prot protections we're not a
- * 'special' PROT_NONE page.
+ * For NUMA page faults, we use PROT_NONE ptes in VMAs with
+ * "normal" vma->vm_page_prot protections. Genuine PROT_NONE
+ * VMAs should never get here, because the fault handling code
+ * will notice that the VMA has no read or write permissions.
*
* This means we cannot get 'special' PROT_NONE faults from genuine
* PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
@@ -3473,35 +3476,59 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pmd_t *pmd,
unsigned int flags, pte_t entry)
{
+ struct page *page = NULL;
+ int node, page_nid = -1;
spinlock_t *ptl;
- int ret = 0;
-
- if (!pte_unmap_same(mm, pmd, ptep, entry))
- goto out;
- /*
- * Do fancy stuff...
- */
-
- /*
- * OK, nothing to do,.. change the protection back to what it
- * ought to be.
- */
- ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
if (unlikely(!pte_same(*ptep, entry)))
- goto unlock;
+ goto out_unlock;
+ page = vm_normal_page(vma, address, entry);
+ if (page) {
+ get_page(page);
+ page_nid = page_to_nid(page);
+ node = mpol_misplaced(page, vma, address);
+ if (node != -1)
+ goto migrate;
+ }
+
+out_pte_upgrade_unlock:
flush_cache_page(vma, address, pte_pfn(entry));
ptep_modify_prot_start(mm, address, ptep);
entry = pte_modify(entry, vma->vm_page_prot);
ptep_modify_prot_commit(mm, address, ptep, entry);
+ /* No TLB flush needed because we upgraded the PTE */
+
update_mmu_cache(vma, address, ptep);
-unlock:
+
+out_unlock:
pte_unmap_unlock(ptep, ptl);
out:
- return ret;
+ if (page)
+ put_page(page);
+
+ return 0;
+
+migrate:
+ pte_unmap_unlock(ptep, ptl);
+
+ if (!migrate_misplaced_page(page, node)) {
+ page_nid = node;
+ goto out;
+ }
+
+ ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_same(*ptep, entry)) {
+ put_page(page);
+ page = NULL;
+ goto out_unlock;
+ }
+
+ goto out_pte_upgrade_unlock;
}
/*
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Add THP migration for the NUMA working set scanning fault case.
It uses the page lock to serialize. No migration pte dance is
necessary because the pte is already unmapped when we decide
to migrate.
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Significant fixes and changelog. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/huge_memory.c | 131 +++++++++++++++++++++++++++++++++++++++++++------------
mm/migrate.c | 2 +-
2 files changed, 103 insertions(+), 30 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c4c0a57..931caf4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -742,12 +742,13 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned int flags, pmd_t entry)
{
unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct page *new_page = NULL;
struct page *page = NULL;
- int node;
+ int node, lru;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, entry)))
- goto out_unlock;
+ goto unlock;
if (unlikely(pmd_trans_splitting(entry))) {
spin_unlock(&mm->page_table_lock);
@@ -755,45 +756,117 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
return;
}
-#ifdef CONFIG_NUMA
page = pmd_page(entry);
- VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+ if (page) {
+ VM_BUG_ON(!PageCompound(page) || !PageHead(page));
- get_page(page);
- spin_unlock(&mm->page_table_lock);
+ get_page(page);
+ node = mpol_misplaced(page, vma, haddr);
+ if (node != -1)
+ goto migrate;
+ }
- /*
- * XXX should we serialize against split_huge_page ?
- */
+fixup:
+ /* change back to regular protection */
+ entry = pmd_modify(entry, vma->vm_page_prot);
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, entry);
- node = mpol_misplaced(page, vma, haddr);
- if (node == -1)
- goto do_fixup;
+unlock:
+ spin_unlock(&mm->page_table_lock);
+ if (page)
+ put_page(page);
- /*
- * Due to lacking code to migrate thp pages, we'll split
- * (which preserves the special PROT_NONE) and re-take the
- * fault on the normal pages.
- */
- split_huge_page(page);
- put_page(page);
return;
-do_fixup:
+migrate:
+ spin_unlock(&mm->page_table_lock);
+
+ lock_page(page);
spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(*pmd, entry)))
- goto out_unlock;
-#endif
+ if (unlikely(!pmd_same(*pmd, entry))) {
+ spin_unlock(&mm->page_table_lock);
+ unlock_page(page);
+ put_page(page);
+ return;
+ }
+ spin_unlock(&mm->page_table_lock);
- /* change back to regular protection */
- entry = pmd_modify(entry, vma->vm_page_prot);
- if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
- update_mmu_cache_pmd(vma, address, entry);
+ new_page = alloc_pages_node(node,
+ (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT,
+ HPAGE_PMD_ORDER);
-out_unlock:
+ if (!new_page)
+ goto alloc_fail;
+
+ lru = PageLRU(page);
+
+ if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
+ goto alloc_fail;
+
+ if (!trylock_page(new_page))
+ BUG();
+
+ /* anon mapping, we can simply copy page->mapping to the new page: */
+ new_page->mapping = page->mapping;
+ new_page->index = page->index;
+
+ migrate_page_copy(new_page, page);
+
+ WARN_ON(PageLRU(new_page));
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry))) {
+ spin_unlock(&mm->page_table_lock);
+ if (lru)
+ putback_lru_page(page);
+
+ unlock_page(new_page);
+ ClearPageActive(new_page); /* Set by migrate_page_copy() */
+ new_page->mapping = NULL;
+ put_page(new_page); /* Free it */
+
+ unlock_page(page);
+ put_page(page); /* Drop the local reference */
+
+ return;
+ }
+
+ entry = mk_pmd(new_page, vma->vm_page_prot);
+ entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+ entry = pmd_mkhuge(entry);
+
+ page_add_new_anon_rmap(new_page, vma, haddr);
+
+ set_pmd_at(mm, haddr, pmd, entry);
+ update_mmu_cache_pmd(vma, address, entry);
+ page_remove_rmap(page);
spin_unlock(&mm->page_table_lock);
- if (page)
+
+ put_page(page); /* Drop the rmap reference */
+
+ if (lru)
+ put_page(page); /* drop the LRU isolation reference */
+
+ unlock_page(new_page);
+ unlock_page(page);
+ put_page(page); /* Drop the local reference */
+
+ return;
+
+alloc_fail:
+ if (new_page)
+ put_page(new_page);
+
+ unlock_page(page);
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry))) {
put_page(page);
+ page = NULL;
+ goto unlock;
+ }
+ goto fixup;
}
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/migrate.c b/mm/migrate.c
index 3299949..72d1056 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -417,7 +417,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
*/
void migrate_page_copy(struct page *newpage, struct page *page)
{
- if (PageHuge(page))
+ if (PageHuge(page) || PageTransHuge(page))
copy_huge_page(newpage, page);
else
copy_highpage(newpage, page);
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Avoid a few #ifdef's later on.
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/sched.h | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5eca173..6ac4056 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -663,6 +663,12 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
#endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
+#ifdef CONFIG_NUMA_BALANCING
+#define sched_feat_numa(x) sched_feat(x)
+#else
+#define sched_feat_numa(x) (0)
+#endif
+
static inline u64 global_rt_period(void)
{
return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
--
1.7.11.7
From: Lee Schermerhorn <[email protected]>
This patch adds another mbind() flag to request "lazy migration". The
flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are marked PROT_NONE. The pages will be migrated in the fault
path on "first touch", if the policy dictates at that time.
"Lazy Migration" will allow testing of migrate-on-fault via mbind().
Also allows applications to specify that only subsequently touched
pages be migrated to obey new policy, instead of all pages in range.
This can be useful for multi-threaded applications working on a
large shared data area that is initialized by an initial thread
resulting in all pages on one [or a few, if overflowed] nodes.
After PROT_NONE, the pages in regions assigned to the worker threads
will be automatically migrated local to the threads on 1st touch.
Signed-off-by: Lee Schermerhorn <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
[ nearly complete rewrite.. ]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/uapi/linux/mempolicy.h | 13 ++++++++---
mm/mempolicy.c | 49 +++++++++++++++++++++++++++---------------
2 files changed, 42 insertions(+), 20 deletions(-)
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 472de8a..6a1baae 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -49,9 +49,16 @@ enum mpol_rebind_step {
/* Flags for mbind */
#define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */
-#define MPOL_MF_INTERNAL (1<<3) /* Internal flags start here */
+#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform
+ to policy */
+#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to policy */
+#define MPOL_MF_LAZY (1<<3) /* Modifies '_MOVE: lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4) /* Internal flags start here */
+
+#define MPOL_MF_VALID (MPOL_MF_STRICT | \
+ MPOL_MF_MOVE | \
+ MPOL_MF_MOVE_ALL | \
+ MPOL_MF_LAZY)
/*
* Internal flags that share the struct mempolicy flags word with
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1b2890c..5ee326c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -583,22 +583,32 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
return ERR_PTR(-EFAULT);
prev = NULL;
for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
+ unsigned long endvma = vma->vm_end;
+
+ if (endvma > end)
+ endvma = end;
+ if (vma->vm_start > start)
+ start = vma->vm_start;
+
if (!(flags & MPOL_MF_DISCONTIG_OK)) {
if (!vma->vm_next && vma->vm_end < end)
return ERR_PTR(-EFAULT);
if (prev && prev->vm_end < vma->vm_start)
return ERR_PTR(-EFAULT);
}
- if (!is_vm_hugetlb_page(vma) &&
- ((flags & MPOL_MF_STRICT) ||
+
+ if (is_vm_hugetlb_page(vma))
+ goto next;
+
+ if (flags & MPOL_MF_LAZY) {
+ change_prot_none(vma, start, endvma);
+ goto next;
+ }
+
+ if ((flags & MPOL_MF_STRICT) ||
((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
- vma_migratable(vma)))) {
- unsigned long endvma = vma->vm_end;
+ vma_migratable(vma))) {
- if (endvma > end)
- endvma = end;
- if (vma->vm_start > start)
- start = vma->vm_start;
err = check_pgd_range(vma, start, endvma, nodes,
flags, private);
if (err) {
@@ -606,6 +616,7 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
break;
}
}
+next:
prev = vma;
}
return first;
@@ -1137,8 +1148,7 @@ static long do_mbind(unsigned long start, unsigned long len,
int err;
LIST_HEAD(pagelist);
- if (flags & ~(unsigned long)(MPOL_MF_STRICT |
- MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ if (flags & ~(unsigned long)MPOL_MF_VALID)
return -EINVAL;
if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
return -EPERM;
@@ -1161,6 +1171,9 @@ static long do_mbind(unsigned long start, unsigned long len,
if (IS_ERR(new))
return PTR_ERR(new);
+ if (flags & MPOL_MF_LAZY)
+ new->flags |= MPOL_F_MOF;
+
/*
* If we are using the default policy then operation
* on discontinuous address spaces is okay after all
@@ -1197,21 +1210,23 @@ static long do_mbind(unsigned long start, unsigned long len,
vma = check_range(mm, start, end, nmask,
flags | MPOL_MF_INVERT, &pagelist);
- err = PTR_ERR(vma);
- if (!IS_ERR(vma)) {
- int nr_failed = 0;
-
+ err = PTR_ERR(vma); /* maybe ... */
+ if (!IS_ERR(vma) && mode != MPOL_NOOP)
err = mbind_range(mm, start, end, new);
+ if (!err) {
+ int nr_failed = 0;
+
if (!list_empty(&pagelist)) {
+ WARN_ON_ONCE(flags & MPOL_MF_LAZY);
nr_failed = migrate_pages(&pagelist, new_vma_page,
- (unsigned long)vma,
- false, MIGRATE_SYNC);
+ (unsigned long)vma,
+ false, MIGRATE_SYNC);
if (nr_failed)
putback_lru_pages(&pagelist);
}
- if (!err && nr_failed && (flags & MPOL_MF_STRICT))
+ if (nr_failed && (flags & MPOL_MF_STRICT))
err = -EIO;
} else
putback_lru_pages(&pagelist);
--
1.7.11.7
From: Lee Schermerhorn <[email protected]>
This patch provides a new function to test whether a page resides
on a node that is appropriate for the mempolicy for the vma and
address where the page is supposed to be mapped. This involves
looking up the node where the page belongs. So, the function
returns that node so that it may be used to allocated the page
without consulting the policy again.
A subsequent patch will call this function from the fault path.
Because of this, I don't want to go ahead and allocate the page, e.g.,
via alloc_page_vma() only to have to free it if it has the correct
policy. So, I just mimic the alloc_page_vma() node computation
logic--sort of.
Note: we could use this function to implement a MPOL_MF_STRICT
behavior when migrating pages to match mbind() mempolicy--e.g.,
to ensure that pages in an interleaved range are reinterleaved
rather than left where they are when they reside on any page in
the interleave nodemask.
Signed-off-by: Lee Schermerhorn <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
[ Added MPOL_F_LAZY to trigger migrate-on-fault;
simplified code now that we don't have to bother
with special crap for interleaved ]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mempolicy.h | 8 +++++
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 76 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 85 insertions(+)
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e5ccb9d..c511e25 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -198,6 +198,8 @@ static inline int vma_migratable(struct vm_area_struct *vma)
return 1;
}
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+
#else
struct mempolicy {};
@@ -323,5 +325,11 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
return 0;
}
+static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+ unsigned long address)
+{
+ return -1; /* no node preference */
+}
+
#endif /* CONFIG_NUMA */
#endif
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index d23dca8..472de8a 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -61,6 +61,7 @@ enum mpol_rebind_step {
#define MPOL_F_SHARED (1 << 0) /* identify shared policies */
#define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */
#define MPOL_F_REBINDING (1 << 2) /* identify policies in rebinding */
+#define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */
#endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c7c7c86..1b2890c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2179,6 +2179,82 @@ static void sp_free(struct sp_node *n)
kmem_cache_free(sn_cache, n);
}
+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page - page to be checked
+ * @vma - vm area where page mapped
+ * @addr - virtual address where page mapped
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ * -1 - not misplaced, page is in the right node
+ * node - node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+{
+ struct mempolicy *pol;
+ struct zone *zone;
+ int curnid = page_to_nid(page);
+ unsigned long pgoff;
+ int polnid = -1;
+ int ret = -1;
+
+ BUG_ON(!vma);
+
+ pol = get_vma_policy(current, vma, addr);
+ if (!(pol->flags & MPOL_F_MOF))
+ goto out;
+
+ switch (pol->mode) {
+ case MPOL_INTERLEAVE:
+ BUG_ON(addr >= vma->vm_end);
+ BUG_ON(addr < vma->vm_start);
+
+ pgoff = vma->vm_pgoff;
+ pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+ polnid = offset_il_node(pol, vma, pgoff);
+ break;
+
+ case MPOL_PREFERRED:
+ if (pol->flags & MPOL_F_LOCAL)
+ polnid = numa_node_id();
+ else
+ polnid = pol->v.preferred_node;
+ break;
+
+ case MPOL_BIND:
+ /*
+ * allows binding to multiple nodes.
+ * use current page if in policy nodemask,
+ * else select nearest allowed node, if any.
+ * If no allowed nodes, use current [!misplaced].
+ */
+ if (node_isset(curnid, pol->v.nodes))
+ goto out;
+ (void)first_zones_zonelist(
+ node_zonelist(numa_node_id(), GFP_HIGHUSER),
+ gfp_zone(GFP_HIGHUSER),
+ &pol->v.nodes, &zone);
+ polnid = zone->node;
+ break;
+
+ default:
+ BUG();
+ }
+ if (curnid != polnid)
+ ret = polnid;
+out:
+ mpol_cond_put(pol);
+
+ return ret;
+}
+
static void sp_delete(struct shared_policy *sp, struct sp_node *n)
{
pr_debug("deleting %lx-l%lx\n", n->start, n->end);
--
1.7.11.7
From: Lee Schermerhorn <[email protected]>
This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy
to mbind(). When the NOOP policy is used with the 'MOVE and 'LAZY
flags, mbind() will map the pages PROT_NONE so that they will be
migrated on the next touch.
This allows an application to prepare for a new phase of operation
where different regions of shared storage will be assigned to
worker threads, w/o changing policy. Note that we could just use
"default" policy in this case. However, this also allows an
application to request that pages be migrated, only if necessary,
to follow any arbitrary policy that might currently apply to a
range of pages, without knowing the policy, or without specifying
multiple mbind()s for ranges with different policies.
[ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]
Bug-Reported-by: Reported-by: Fengguang Wu <[email protected]>
Signed-off-by: Lee Schermerhorn <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 11 ++++++-----
2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 3e835c9..d23dca8 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -21,6 +21,7 @@ enum {
MPOL_BIND,
MPOL_INTERLEAVE,
MPOL_LOCAL,
+ MPOL_NOOP, /* retain existing policy for range */
MPOL_MAX, /* always last member of enum */
};
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 72f50ba..c7c7c86 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -251,10 +251,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
pr_debug("setting mode %d flags %d nodes[0] %lx\n",
mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
- if (mode == MPOL_DEFAULT) {
+ if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
if (nodes && !nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
- return NULL; /* simply delete any existing policy */
+ return NULL;
}
VM_BUG_ON(!nodes);
@@ -1146,7 +1146,7 @@ static long do_mbind(unsigned long start, unsigned long len,
if (start & ~PAGE_MASK)
return -EINVAL;
- if (mode == MPOL_DEFAULT)
+ if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
flags &= ~MPOL_MF_STRICT;
len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -2407,7 +2407,8 @@ static const char * const policy_modes[] =
[MPOL_PREFERRED] = "prefer",
[MPOL_BIND] = "bind",
[MPOL_INTERLEAVE] = "interleave",
- [MPOL_LOCAL] = "local"
+ [MPOL_LOCAL] = "local",
+ [MPOL_NOOP] = "noop", /* should not actually be used */
};
@@ -2458,7 +2459,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
break;
}
}
- if (mode >= MPOL_MAX)
+ if (mode >= MPOL_MAX || mode == MPOL_NOOP)
goto out;
switch (mode) {
--
1.7.11.7
pgprot_modify() is available on x86, but on other architectures it only
gets defined in mm/mprotect.c - breaking the build if anything outside
of mprotect.c tries to make use of this function.
Move it to the generic pgprot area in mm.h, so that an upcoming patch
can make use of it.
Acked-by: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mm.h | 13 +++++++++++++
mm/mprotect.c | 7 -------
2 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..2a32cf8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -164,6 +164,19 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_TRIED 0x40 /* second try */
/*
+ * Some architectures (such as x86) may need to preserve certain pgprot
+ * bits, without complicating generic pgprot code.
+ *
+ * Most architectures don't care:
+ */
+#ifndef pgprot_modify
+static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
+{
+ return newprot;
+}
+#endif
+
+/*
* vm_fault is filled by the the pagefault handler and passed to the vma's
* ->fault function. The vma's ->fault is responsible for returning a bitmask
* of VM_FAULT_xxx flags that give details about how the fault was handled.
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..e97b0d6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -28,13 +28,6 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
-#ifndef pgprot_modify
-static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
-{
- return newprot;
-}
-#endif
-
static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
We're going to play games with page-protections, ensure we don't lose
them over a THP split.
Collapse seems to always allocate a new (huge) page which should
already end up on the new target node so loosing protections there
isn't a problem.
Signed-off-by: Peter Zijlstra <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/pgtable.h | 1 +
mm/huge_memory.c | 103 ++++++++++++++++++++---------------------
2 files changed, 50 insertions(+), 54 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a1f780d..f85dccd 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -349,6 +349,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
}
#define pte_pgprot(x) __pgprot(pte_flags(x) & PTE_FLAGS_MASK)
+#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_HPAGE_CHG_MASK)
#define canon_pgprot(p) __pgprot(massage_pgprot(p))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..176fe3d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1343,63 +1343,60 @@ static int __split_huge_page_map(struct page *page,
int ret = 0, i;
pgtable_t pgtable;
unsigned long haddr;
+ pgprot_t prot;
spin_lock(&mm->page_table_lock);
pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
- if (pmd) {
- pgtable = pgtable_trans_huge_withdraw(mm);
- pmd_populate(mm, &_pmd, pgtable);
-
- haddr = address;
- for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- pte_t *pte, entry;
- BUG_ON(PageCompound(page+i));
- entry = mk_pte(page + i, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (!pmd_write(*pmd))
- entry = pte_wrprotect(entry);
- else
- BUG_ON(page_mapcount(page) != 1);
- if (!pmd_young(*pmd))
- entry = pte_mkold(entry);
- pte = pte_offset_map(&_pmd, haddr);
- BUG_ON(!pte_none(*pte));
- set_pte_at(mm, haddr, pte, entry);
- pte_unmap(pte);
- }
+ if (!pmd)
+ goto unlock;
- smp_wmb(); /* make pte visible before pmd */
- /*
- * Up to this point the pmd is present and huge and
- * userland has the whole access to the hugepage
- * during the split (which happens in place). If we
- * overwrite the pmd with the not-huge version
- * pointing to the pte here (which of course we could
- * if all CPUs were bug free), userland could trigger
- * a small page size TLB miss on the small sized TLB
- * while the hugepage TLB entry is still established
- * in the huge TLB. Some CPU doesn't like that. See
- * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
- * Erratum 383 on page 93. Intel should be safe but is
- * also warns that it's only safe if the permission
- * and cache attributes of the two entries loaded in
- * the two TLB is identical (which should be the case
- * here). But it is generally safer to never allow
- * small and huge TLB entries for the same virtual
- * address to be loaded simultaneously. So instead of
- * doing "pmd_populate(); flush_tlb_range();" we first
- * mark the current pmd notpresent (atomically because
- * here the pmd_trans_huge and pmd_trans_splitting
- * must remain set at all times on the pmd until the
- * split is complete for this pmd), then we flush the
- * SMP TLB and finally we write the non-huge version
- * of the pmd entry with pmd_populate.
- */
- pmdp_invalidate(vma, address, pmd);
- pmd_populate(mm, pmd, pgtable);
- ret = 1;
+ prot = pmd_pgprot(*pmd);
+ pgtable = pgtable_trans_huge_withdraw(mm);
+ pmd_populate(mm, &_pmd, pgtable);
+
+ for (i = 0, haddr = address; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+
+ BUG_ON(PageCompound(page+i));
+ entry = mk_pte(page + i, prot);
+ entry = pte_mkdirty(entry);
+ if (!pmd_young(*pmd))
+ entry = pte_mkold(entry);
+ pte = pte_offset_map(&_pmd, haddr);
+ BUG_ON(!pte_none(*pte));
+ set_pte_at(mm, haddr, pte, entry);
+ pte_unmap(pte);
}
+
+ smp_wmb(); /* make ptes visible before pmd, see __pte_alloc */
+ /*
+ * Up to this point the pmd is present and huge.
+ *
+ * If we overwrite the pmd with the not-huge version, we could trigger
+ * a small page size TLB miss on the small sized TLB while the hugepage
+ * TLB entry is still established in the huge TLB.
+ *
+ * Some CPUs don't like that. See
+ * http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum 383
+ * on page 93.
+ *
+ * Thus it is generally safer to never allow small and huge TLB entries
+ * for overlapping virtual addresses to be loaded. So we first mark the
+ * current pmd not present, then we flush the TLB and finally we write
+ * the non-huge version of the pmd entry with pmd_populate.
+ *
+ * The above needs to be done under the ptl because pmd_trans_huge and
+ * pmd_trans_splitting must remain set on the pmd until the split is
+ * complete. The ptl also protects against concurrent faults due to
+ * making the pmd not-present.
+ */
+ set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
+ flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ pmd_populate(mm, pmd, pgtable);
+ ret = 1;
+
+unlock:
spin_unlock(&mm->page_table_lock);
return ret;
@@ -2287,10 +2284,8 @@ static void khugepaged_do_scan(void)
{
struct page *hpage = NULL;
unsigned int progress = 0, pass_through_head = 0;
- unsigned int pages = khugepaged_pages_to_scan;
bool wait = true;
-
- barrier(); /* write khugepaged_pages_to_scan to local stack */
+ unsigned int pages = ACCESS_ONCE(khugepaged_pages_to_scan);
while (progress < pages) {
if (!khugepaged_prealloc_page(&hpage, &wait))
--
1.7.11.7
From: Ralf Baechle <[email protected]>
Add the pmd_pgprot() method that will be needed
by the new NUMA code.
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Ralf Baechle <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/mips/include/asm/pgtable.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index c02158b..bbe4cda 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -89,6 +89,8 @@ static inline int is_zero_pfn(unsigned long pfn)
extern void paging_init(void);
+#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_PAGE_CHG_MASK)
+
/*
* Conversion functions: convert a page and protection to a page entry,
* and a page entry and page directory to the page they refer to.
--
1.7.11.7
From: Rik van Riel <[email protected]>
We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that
the pte is associated with a page.
However, for TLB flushing purposes, we would like to know whether the pte
points to an actually accessible page. This allows us to skip remote TLB
flushes for pages that are not actually accessible.
Fill in this method for x86 and provide a safe (but slower) method
on other architectures.
Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Fixed-by: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Added Linus's review fixes. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/pgtable.h | 6 ++++++
include/asm-generic/pgtable.h | 4 ++++
2 files changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f85dccd..a984cf9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -408,6 +408,12 @@ static inline int pte_present(pte_t a)
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
}
+#define pte_accessible pte_accessible
+static inline int pte_accessible(pte_t a)
+{
+ return pte_flags(a) & _PAGE_PRESENT;
+}
+
static inline int pte_hidden(pte_t pte)
{
return pte_flags(pte) & _PAGE_HIDDEN;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b36ce40..48fc1dc 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
#define move_pte(pte, prot, old_addr, new_addr) (pte)
#endif
+#ifndef pte_accessible
+# define pte_accessible(pte) ((void)(pte),1)
+#endif
+
#ifndef flush_tlb_fix_spurious_fault
#define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
#endif
--
1.7.11.7
From: Gerald Schaefer <[email protected]>
This patch adds an implementation of pmd_pgprot() for s390,
in preparation to future THP changes.
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Gerald Schaefer <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ralf Baechle <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/s390/include/asm/pgtable.h | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index dd647c9..098fc5a 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1240,6 +1240,19 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
*pmdp = entry;
}
+static inline pgprot_t pmd_pgprot(pmd_t pmd)
+{
+ pgprot_t prot = PAGE_RW;
+
+ if (pmd_val(pmd) & _SEGMENT_ENTRY_RO) {
+ if (pmd_val(pmd) & _SEGMENT_ENTRY_INV)
+ prot = PAGE_NONE;
+ else
+ prot = PAGE_RO;
+ }
+ return prot;
+}
+
static inline unsigned long massage_pgprot_pmd(pgprot_t pgprot)
{
unsigned long pgprot_pmd = 0;
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Its a bit awkward but it was the least painful means of modifying the
queue selection. Used in a later patch to conditionally use a random
queue.
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a1..6ab627e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3063,6 +3063,9 @@ struct lb_env {
unsigned int loop;
unsigned int loop_break;
unsigned int loop_max;
+
+ struct rq * (*find_busiest_queue)(struct lb_env *,
+ struct sched_group *);
};
/*
@@ -4236,13 +4239,14 @@ static int load_balance(int this_cpu, struct rq *this_rq,
struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
struct lb_env env = {
- .sd = sd,
- .dst_cpu = this_cpu,
- .dst_rq = this_rq,
- .dst_grpmask = sched_group_cpus(sd->groups),
- .idle = idle,
- .loop_break = sched_nr_migrate_break,
- .cpus = cpus,
+ .sd = sd,
+ .dst_cpu = this_cpu,
+ .dst_rq = this_rq,
+ .dst_grpmask = sched_group_cpus(sd->groups),
+ .idle = idle,
+ .loop_break = sched_nr_migrate_break,
+ .cpus = cpus,
+ .find_busiest_queue = find_busiest_queue,
};
cpumask_copy(cpus, cpu_active_mask);
@@ -4261,7 +4265,7 @@ redo:
goto out_balanced;
}
- busiest = find_busiest_queue(&env, group);
+ busiest = env.find_busiest_queue(&env, group);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;
--
1.7.11.7
From: Rik van Riel <[email protected]>
The function ptep_set_access_flags() is only ever used to upgrade
access permissions to a page - i.e. they make it less restrictive.
That means the only negative side effect of not flushing remote
TLBs in this function is that other CPUs may incur spurious page
faults, if they happen to access the same address, and still have
a PTE with the old permissions cached in their TLB caches.
Having another CPU maybe incur a spurious page fault is faster
than always incurring the cost of a remote TLB flush, so replace
the remote TLB flush with a purely local one.
This should be safe on every architecture that correctly
implements flush_tlb_fix_spurious_fault() to actually invalidate
the local TLB entry that caused a page fault, as well as on
architectures where the hardware invalidates TLB entries that
cause page faults.
In the unlikely event that you are hitting what appears to be
an infinite loop of page faults, and 'git bisect' took you to
this changeset, your architecture needs to implement
flush_tlb_fix_spurious_fault() to actually flush the TLB entry.
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michel Lespinasse <[email protected]>
[ Changelog massage. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/pgtable-generic.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e642627..d8397da 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -12,8 +12,8 @@
#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
- * Only sets the access flags (dirty, accessed, and
- * writable). Furthermore, we know it always gets set to a "more
+ * Only sets the access flags (dirty, accessed), as well as write
+ * permission. Furthermore, we know it always gets set to a "more
* permissive" setting, which allows most architectures to optimize
* this. We return whether the PTE actually changed, which in turn
* instructs the caller to do things like update__mmu_cache. This
@@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
int changed = !pte_same(*ptep, entry);
if (changed) {
set_pte_at(vma->vm_mm, address, ptep, entry);
- flush_tlb_page(vma, address);
+ flush_tlb_fix_spurious_fault(vma, address);
}
return changed;
}
--
1.7.11.7
On Tue, Nov 13, 2012 at 06:13:23PM +0100, Ingo Molnar wrote:
> Hi,
>
> This is the latest iteration of our numa/core tree, which
> implements adaptive NUMA affinity balancing.
>
> Changes in this version:
>
> https://lkml.org/lkml/2012/11/12/315
>
> Performance figures:
>
> https://lkml.org/lkml/2012/11/12/330
>
> Any review feedback, comments and test results are welcome!
>
For the purposes of review and testing, this is going to be hard to pick
apart and compare. It doesn't apply against 3.7-rc5 and when trying to
resolve the conflicts it quickly becomes obvious that the series depends
on other scheduler patches such as
sched: Add an rq migration call-back to sched_class
sched: Introduce temporary FAIR_GROUP_SCHED dependency for load-tracking
This is not a full list, it was just the first I hit. What are the other
scheduler patches you are depend on? Knowing that will probably help pick
apart some of the massive patches like "sched, numa, mm: Add adaptive
NUMA affinity support" which is a massive monolithic patch I have not even
attempted to read yet but the diffstat for it alone says a lot.
7 files changed, 901 insertions(+), 197 deletions(-)
--
Mel Gorman
SUSE Labs
[Put Hugh back on CC]
What happened to Hugh's fixes to the LRU handling? I believe it was
racy beyond affecting memcg, it's just that memcg code had a BUG_ON in
the right place to point it out.
On Tue, Nov 13, 2012 at 06:13:44PM +0100, Ingo Molnar wrote:
> From: Peter Zijlstra <[email protected]>
>
> Add THP migration for the NUMA working set scanning fault case.
>
> It uses the page lock to serialize. No migration pte dance is
> necessary because the pte is already unmapped when we decide
> to migrate.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Link: http://lkml.kernel.org/n/[email protected]
> [ Significant fixes and changelog. ]
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> mm/huge_memory.c | 131 +++++++++++++++++++++++++++++++++++++++++++------------
> mm/migrate.c | 2 +-
> 2 files changed, 103 insertions(+), 30 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index c4c0a57..931caf4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -742,12 +742,13 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> unsigned int flags, pmd_t entry)
> {
> unsigned long haddr = address & HPAGE_PMD_MASK;
> + struct page *new_page = NULL;
> struct page *page = NULL;
> - int node;
> + int node, lru;
>
> spin_lock(&mm->page_table_lock);
> if (unlikely(!pmd_same(*pmd, entry)))
> - goto out_unlock;
> + goto unlock;
>
> if (unlikely(pmd_trans_splitting(entry))) {
> spin_unlock(&mm->page_table_lock);
> @@ -755,45 +756,117 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> return;
> }
>
> -#ifdef CONFIG_NUMA
> page = pmd_page(entry);
> - VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> + if (page) {
> + VM_BUG_ON(!PageCompound(page) || !PageHead(page));
>
> - get_page(page);
> - spin_unlock(&mm->page_table_lock);
> + get_page(page);
> + node = mpol_misplaced(page, vma, haddr);
> + if (node != -1)
> + goto migrate;
> + }
>
> - /*
> - * XXX should we serialize against split_huge_page ?
> - */
> +fixup:
> + /* change back to regular protection */
> + entry = pmd_modify(entry, vma->vm_page_prot);
> + set_pmd_at(mm, haddr, pmd, entry);
> + update_mmu_cache_pmd(vma, address, entry);
>
> - node = mpol_misplaced(page, vma, haddr);
> - if (node == -1)
> - goto do_fixup;
> +unlock:
> + spin_unlock(&mm->page_table_lock);
> + if (page)
> + put_page(page);
>
> - /*
> - * Due to lacking code to migrate thp pages, we'll split
> - * (which preserves the special PROT_NONE) and re-take the
> - * fault on the normal pages.
> - */
> - split_huge_page(page);
> - put_page(page);
> return;
>
> -do_fixup:
> +migrate:
> + spin_unlock(&mm->page_table_lock);
> +
> + lock_page(page);
> spin_lock(&mm->page_table_lock);
> - if (unlikely(!pmd_same(*pmd, entry)))
> - goto out_unlock;
> -#endif
> + if (unlikely(!pmd_same(*pmd, entry))) {
> + spin_unlock(&mm->page_table_lock);
> + unlock_page(page);
> + put_page(page);
> + return;
> + }
> + spin_unlock(&mm->page_table_lock);
>
> - /* change back to regular protection */
> - entry = pmd_modify(entry, vma->vm_page_prot);
> - if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
> - update_mmu_cache_pmd(vma, address, entry);
> + new_page = alloc_pages_node(node,
> + (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT,
> + HPAGE_PMD_ORDER);
>
> -out_unlock:
> + if (!new_page)
> + goto alloc_fail;
> +
> + lru = PageLRU(page);
> +
> + if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
> + goto alloc_fail;
> +
> + if (!trylock_page(new_page))
> + BUG();
> +
> + /* anon mapping, we can simply copy page->mapping to the new page: */
> + new_page->mapping = page->mapping;
> + new_page->index = page->index;
> +
> + migrate_page_copy(new_page, page);
> +
> + WARN_ON(PageLRU(new_page));
> +
> + spin_lock(&mm->page_table_lock);
> + if (unlikely(!pmd_same(*pmd, entry))) {
> + spin_unlock(&mm->page_table_lock);
> + if (lru)
> + putback_lru_page(page);
> +
> + unlock_page(new_page);
> + ClearPageActive(new_page); /* Set by migrate_page_copy() */
> + new_page->mapping = NULL;
> + put_page(new_page); /* Free it */
> +
> + unlock_page(page);
> + put_page(page); /* Drop the local reference */
> +
> + return;
> + }
> +
> + entry = mk_pmd(new_page, vma->vm_page_prot);
> + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> + entry = pmd_mkhuge(entry);
> +
> + page_add_new_anon_rmap(new_page, vma, haddr);
> +
> + set_pmd_at(mm, haddr, pmd, entry);
> + update_mmu_cache_pmd(vma, address, entry);
> + page_remove_rmap(page);
> spin_unlock(&mm->page_table_lock);
> - if (page)
> +
> + put_page(page); /* Drop the rmap reference */
> +
> + if (lru)
> + put_page(page); /* drop the LRU isolation reference */
> +
> + unlock_page(new_page);
> + unlock_page(page);
> + put_page(page); /* Drop the local reference */
> +
> + return;
> +
> +alloc_fail:
> + if (new_page)
> + put_page(new_page);
> +
> + unlock_page(page);
> +
> + spin_lock(&mm->page_table_lock);
> + if (unlikely(!pmd_same(*pmd, entry))) {
> put_page(page);
> + page = NULL;
> + goto unlock;
> + }
> + goto fixup;
> }
>
> int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 3299949..72d1056 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -417,7 +417,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
> */
> void migrate_page_copy(struct page *newpage, struct page *page)
> {
> - if (PageHuge(page))
> + if (PageHuge(page) || PageTransHuge(page))
> copy_huge_page(newpage, page);
> else
> copy_highpage(newpage, page);
> --
> 1.7.11.7
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, 13 Nov 2012, Johannes Weiner wrote:
> [Put Hugh back on CC]
>
> What happened to Hugh's fixes to the LRU handling? I believe it was
> racy beyond affecting memcg, it's just that memcg code had a BUG_ON in
> the right place to point it out.
Right, yes, I made several fixes there, but never sent out the final
version. And I can't see your fix to the memcg end of THP migration
either - though I've not done much more than scan the Subjects so far,
I've not yet got to grips with yesterday/today's latest approach.
I'll send out two patches, yours and mine, to what's in the current
linux-next and the current mmotm, which I have been testing with in
several contexts; they probably both apply easily to Ingo's new tree.
But I vehemently hope that this all very soon vanishes from linux-next,
and the new stuff is worked on properly for a while, in a separate
development branch of tip, hopefully converging with Mel's.
Hugh
>
> On Tue, Nov 13, 2012 at 06:13:44PM +0100, Ingo Molnar wrote:
> > From: Peter Zijlstra <[email protected]>
> >
> > Add THP migration for the NUMA working set scanning fault case.
> >
> > It uses the page lock to serialize. No migration pte dance is
> > necessary because the pte is already unmapped when we decide
> > to migrate.
> >
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > Cc: Johannes Weiner <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Andrea Arcangeli <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Linus Torvalds <[email protected]>
> > Link: http://lkml.kernel.org/n/[email protected]
> > [ Significant fixes and changelog. ]
> > Signed-off-by: Ingo Molnar <[email protected]>
> > ---
> > mm/huge_memory.c | 131 +++++++++++++++++++++++++++++++++++++++++++------------
> > mm/migrate.c | 2 +-
> > 2 files changed, 103 insertions(+), 30 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index c4c0a57..931caf4 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -742,12 +742,13 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > unsigned int flags, pmd_t entry)
> > {
> > unsigned long haddr = address & HPAGE_PMD_MASK;
> > + struct page *new_page = NULL;
> > struct page *page = NULL;
> > - int node;
> > + int node, lru;
> >
> > spin_lock(&mm->page_table_lock);
> > if (unlikely(!pmd_same(*pmd, entry)))
> > - goto out_unlock;
> > + goto unlock;
> >
> > if (unlikely(pmd_trans_splitting(entry))) {
> > spin_unlock(&mm->page_table_lock);
> > @@ -755,45 +756,117 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> > return;
> > }
> >
> > -#ifdef CONFIG_NUMA
> > page = pmd_page(entry);
> > - VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> > + if (page) {
> > + VM_BUG_ON(!PageCompound(page) || !PageHead(page));
> >
> > - get_page(page);
> > - spin_unlock(&mm->page_table_lock);
> > + get_page(page);
> > + node = mpol_misplaced(page, vma, haddr);
> > + if (node != -1)
> > + goto migrate;
> > + }
> >
> > - /*
> > - * XXX should we serialize against split_huge_page ?
> > - */
> > +fixup:
> > + /* change back to regular protection */
> > + entry = pmd_modify(entry, vma->vm_page_prot);
> > + set_pmd_at(mm, haddr, pmd, entry);
> > + update_mmu_cache_pmd(vma, address, entry);
> >
> > - node = mpol_misplaced(page, vma, haddr);
> > - if (node == -1)
> > - goto do_fixup;
> > +unlock:
> > + spin_unlock(&mm->page_table_lock);
> > + if (page)
> > + put_page(page);
> >
> > - /*
> > - * Due to lacking code to migrate thp pages, we'll split
> > - * (which preserves the special PROT_NONE) and re-take the
> > - * fault on the normal pages.
> > - */
> > - split_huge_page(page);
> > - put_page(page);
> > return;
> >
> > -do_fixup:
> > +migrate:
> > + spin_unlock(&mm->page_table_lock);
> > +
> > + lock_page(page);
> > spin_lock(&mm->page_table_lock);
> > - if (unlikely(!pmd_same(*pmd, entry)))
> > - goto out_unlock;
> > -#endif
> > + if (unlikely(!pmd_same(*pmd, entry))) {
> > + spin_unlock(&mm->page_table_lock);
> > + unlock_page(page);
> > + put_page(page);
> > + return;
> > + }
> > + spin_unlock(&mm->page_table_lock);
> >
> > - /* change back to regular protection */
> > - entry = pmd_modify(entry, vma->vm_page_prot);
> > - if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
> > - update_mmu_cache_pmd(vma, address, entry);
> > + new_page = alloc_pages_node(node,
> > + (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT,
> > + HPAGE_PMD_ORDER);
> >
> > -out_unlock:
> > + if (!new_page)
> > + goto alloc_fail;
> > +
> > + lru = PageLRU(page);
> > +
> > + if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
> > + goto alloc_fail;
> > +
> > + if (!trylock_page(new_page))
> > + BUG();
> > +
> > + /* anon mapping, we can simply copy page->mapping to the new page: */
> > + new_page->mapping = page->mapping;
> > + new_page->index = page->index;
> > +
> > + migrate_page_copy(new_page, page);
> > +
> > + WARN_ON(PageLRU(new_page));
> > +
> > + spin_lock(&mm->page_table_lock);
> > + if (unlikely(!pmd_same(*pmd, entry))) {
> > + spin_unlock(&mm->page_table_lock);
> > + if (lru)
> > + putback_lru_page(page);
> > +
> > + unlock_page(new_page);
> > + ClearPageActive(new_page); /* Set by migrate_page_copy() */
> > + new_page->mapping = NULL;
> > + put_page(new_page); /* Free it */
> > +
> > + unlock_page(page);
> > + put_page(page); /* Drop the local reference */
> > +
> > + return;
> > + }
> > +
> > + entry = mk_pmd(new_page, vma->vm_page_prot);
> > + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> > + entry = pmd_mkhuge(entry);
> > +
> > + page_add_new_anon_rmap(new_page, vma, haddr);
> > +
> > + set_pmd_at(mm, haddr, pmd, entry);
> > + update_mmu_cache_pmd(vma, address, entry);
> > + page_remove_rmap(page);
> > spin_unlock(&mm->page_table_lock);
> > - if (page)
> > +
> > + put_page(page); /* Drop the rmap reference */
> > +
> > + if (lru)
> > + put_page(page); /* drop the LRU isolation reference */
> > +
> > + unlock_page(new_page);
> > + unlock_page(page);
> > + put_page(page); /* Drop the local reference */
> > +
> > + return;
> > +
> > +alloc_fail:
> > + if (new_page)
> > + put_page(new_page);
> > +
> > + unlock_page(page);
> > +
> > + spin_lock(&mm->page_table_lock);
> > + if (unlikely(!pmd_same(*pmd, entry))) {
> > put_page(page);
> > + page = NULL;
> > + goto unlock;
> > + }
> > + goto fixup;
> > }
> >
> > int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > diff --git a/mm/migrate.c b/mm/migrate.c
> > index 3299949..72d1056 100644
> > --- a/mm/migrate.c
> > +++ b/mm/migrate.c
> > @@ -417,7 +417,7 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
> > */
> > void migrate_page_copy(struct page *newpage, struct page *page)
> > {
> > - if (PageHuge(page))
> > + if (PageHuge(page) || PageTransHuge(page))
> > copy_huge_page(newpage, page);
> > else
> > copy_highpage(newpage, page);
> > --
> > 1.7.11.7
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>
Refuse to migrate a THPage if its page count is raised: that covers both
the case when it is mapped into another address space, and the case
when it has been pinned by get_user_pages(), fast or slow.
Repeat this check each time we recheck pmd_same() under page_table_lock:
it is unclear how necessary this is, perhaps once after lock_page() or
once after isolate_lru_page() would be enough; but normal page migration
certainly checks page count, and we often think "ah, this is safe against
page migration because its page count is raised" - though sadly without
thinking through what serialization supports that.
Do not proceed with migration when PageLRU is unset: such a page may
well be in a private list or on a pagevec, about to be added to LRU at
any instant: checking PageLRU under zone lock, as isolate_lru_page() does,
is essential before proceeding safely.
Replace trylock_page and BUG by __set_page_locked: here the page has
been allocated a few lines earlier. And SetPageSwapBacked: it is set
later, but there may be an error path which needs it set earlier.
On error path reverse the Active, Unevictable, Mlocked changes made
by migrate_page_copy(). Update mlock_migrate_page() to account for
THPages correctly now that it can get called on them.
Cleanup: rearrange unwinding slightly, removing a few blank lines.
Previous-Version-Tested-by: Zhouping Liu <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
I did not understand at all how pmd_page(entry) might be NULL, but
assumed that check is there for good reason and did not remove it.
mm/huge_memory.c | 60 +++++++++++++++++++++++----------------------
mm/internal.h | 5 ++-
2 files changed, 34 insertions(+), 31 deletions(-)
--- mmotm/mm/huge_memory.c 2012-11-13 14:51:04.000321370 -0800
+++ linux/mm/huge_memory.c 2012-11-13 15:01:01.892335579 -0800
@@ -751,9 +751,9 @@ void do_huge_pmd_numa_page(struct mm_str
{
unsigned long haddr = address & HPAGE_PMD_MASK;
struct mem_cgroup *memcg = NULL;
- struct page *new_page = NULL;
+ struct page *new_page;
struct page *page = NULL;
- int node, lru;
+ int node = -1;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, entry)))
@@ -770,7 +770,17 @@ void do_huge_pmd_numa_page(struct mm_str
VM_BUG_ON(!PageCompound(page) || !PageHead(page));
get_page(page);
- node = mpol_misplaced(page, vma, haddr);
+ /*
+ * Do not migrate this page if it is mapped anywhere else.
+ * Do not migrate this page if its count has been raised.
+ * Our caller's down_read of mmap_sem excludes fork raising
+ * mapcount; but recheck page count below whenever we take
+ * page_table_lock - although it's unclear what pin we are
+ * protecting against, since get_user_pages() or GUP fast
+ * would have to fault it present before they could proceed.
+ */
+ if (page_count(page) == 2)
+ node = mpol_misplaced(page, vma, haddr);
if (node != -1)
goto migrate;
}
@@ -794,7 +804,7 @@ migrate:
lock_page(page);
spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(*pmd, entry))) {
+ if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 2)) {
spin_unlock(&mm->page_table_lock);
unlock_page(page);
put_page(page);
@@ -803,19 +813,17 @@ migrate:
spin_unlock(&mm->page_table_lock);
new_page = alloc_pages_node(node,
- (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT,
- HPAGE_PMD_ORDER);
-
+ (GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);
if (!new_page)
goto alloc_fail;
- lru = PageLRU(page);
-
- if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
+ if (isolate_lru_page(page)) { /* Does an implicit get_page() */
+ put_page(new_page);
goto alloc_fail;
+ }
- if (!trylock_page(new_page))
- BUG();
+ __set_page_locked(new_page);
+ SetPageSwapBacked(new_page);
/* anon mapping, we can simply copy page->mapping to the new page: */
new_page->mapping = page->mapping;
@@ -826,19 +834,22 @@ migrate:
WARN_ON(PageLRU(new_page));
spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(*pmd, entry))) {
+ if (unlikely(!pmd_same(*pmd, entry) || page_count(page) != 3)) {
spin_unlock(&mm->page_table_lock);
- if (lru)
- putback_lru_page(page);
+
+ /* Reverse changes made by migrate_page_copy() */
+ if (TestClearPageActive(new_page))
+ SetPageActive(page);
+ if (TestClearPageUnevictable(new_page))
+ SetPageUnevictable(page);
+ mlock_migrate_page(page, new_page);
unlock_page(new_page);
- ClearPageActive(new_page); /* Set by migrate_page_copy() */
- new_page->mapping = NULL;
put_page(new_page); /* Free it */
unlock_page(page);
+ putback_lru_page(page);
put_page(page); /* Drop the local reference */
-
return;
}
/*
@@ -867,26 +878,17 @@ migrate:
mem_cgroup_end_migration(memcg, page, new_page, true);
spin_unlock(&mm->page_table_lock);
- put_page(page); /* Drop the rmap reference */
-
task_numa_fault(node, HPAGE_PMD_NR);
- if (lru)
- put_page(page); /* drop the LRU isolation reference */
-
unlock_page(new_page);
-
unlock_page(page);
+ put_page(page); /* Drop the rmap reference */
+ put_page(page); /* Drop the LRU isolation reference */
put_page(page); /* Drop the local reference */
-
return;
alloc_fail:
- if (new_page)
- put_page(new_page);
-
unlock_page(page);
-
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, entry))) {
put_page(page);
--- mmotm/mm/internal.h 2012-11-09 09:43:46.896046342 -0800
+++ linux/mm/internal.h 2012-11-13 15:01:01.892335579 -0800
@@ -218,11 +218,12 @@ static inline void mlock_migrate_page(st
{
if (TestClearPageMlocked(page)) {
unsigned long flags;
+ int nr_pages = hpage_nr_pages(page);
local_irq_save(flags);
- __dec_zone_page_state(page, NR_MLOCK);
+ __mod_zone_page_state(page_zone(page), NR_MLOCK, -nr_pages);
SetPageMlocked(newpage);
- __inc_zone_page_state(newpage, NR_MLOCK);
+ __mod_zone_page_state(page_zone(newpage), NR_MLOCK, nr_pages);
local_irq_restore(flags);
}
}
From: Johannes Weiner <[email protected]>
Add mem_cgroup_prepare_migration() and mem_cgroup_end_migration() calls
into do_huge_pmd_numa_page(), and fix mem_cgroup_prepare_migration() to
account for a Transparent Huge Page correctly without bugging.
Tested-by: Zhouping Liu <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/huge_memory.c | 16 ++++++++++++++++
mm/memcontrol.c | 7 +++++--
2 files changed, 21 insertions(+), 2 deletions(-)
--- mmotm/mm/huge_memory.c 2012-11-09 09:43:46.892046342 -0800
+++ linux/mm/huge_memory.c 2012-11-13 14:51:04.000321370 -0800
@@ -750,6 +750,7 @@ void do_huge_pmd_numa_page(struct mm_str
unsigned int flags, pmd_t entry)
{
unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct mem_cgroup *memcg = NULL;
struct page *new_page = NULL;
struct page *page = NULL;
int node, lru;
@@ -840,6 +841,14 @@ migrate:
return;
}
+ /*
+ * Traditional migration needs to prepare the memcg charge
+ * transaction early to prevent the old page from being
+ * uncharged when installing migration entries. Here we can
+ * save the potential rollback and start the charge transfer
+ * only when migration is already known to end successfully.
+ */
+ mem_cgroup_prepare_migration(page, new_page, &memcg);
entry = mk_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -850,6 +859,12 @@ migrate:
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache_pmd(vma, address, entry);
page_remove_rmap(page);
+ /*
+ * Finish the charge transaction under the page table lock to
+ * prevent split_huge_page() from dividing up the charge
+ * before it's fully transferred to the new page.
+ */
+ mem_cgroup_end_migration(memcg, page, new_page, true);
spin_unlock(&mm->page_table_lock);
put_page(page); /* Drop the rmap reference */
@@ -860,6 +875,7 @@ migrate:
put_page(page); /* drop the LRU isolation reference */
unlock_page(new_page);
+
unlock_page(page);
put_page(page); /* Drop the local reference */
--- mmotm/mm/memcontrol.c 2012-11-09 09:43:46.896046342 -0800
+++ linux/mm/memcontrol.c 2012-11-13 14:51:04.004321370 -0800
@@ -4186,15 +4186,18 @@ void mem_cgroup_prepare_migration(struct
struct mem_cgroup **memcgp)
{
struct mem_cgroup *memcg = NULL;
+ unsigned int nr_pages = 1;
struct page_cgroup *pc;
enum charge_type ctype;
*memcgp = NULL;
- VM_BUG_ON(PageTransHuge(page));
if (mem_cgroup_disabled())
return;
+ if (PageTransHuge(page))
+ nr_pages <<= compound_order(page);
+
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
@@ -4256,7 +4259,7 @@ void mem_cgroup_prepare_migration(struct
* charged to the res_counter since we plan on replacing the
* old one and only one page is going to be left afterwards.
*/
- __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
+ __mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
}
/* remove redundant charge if migration failed*/
On Tue, 13 Nov 2012 18:23:13 -0800 (PST) Hugh Dickins <[email protected]> wrote:
> But I vehemently hope that this all very soon vanishes from linux-next,
> and the new stuff is worked on properly for a while, in a separate
> development branch of tip, hopefully converging with Mel's.
Yes please.
The old code in -next has been causing MM integration problems for
months, and -next shuts down from Nov 15 to Nov 26, reopening around
3.7-rc7. rc7 is too late for this material - let's shoot for
integration in -next at 3.8-rc1.
* Mel Gorman <[email protected]> wrote:
> On Tue, Nov 13, 2012 at 06:13:23PM +0100, Ingo Molnar wrote:
> > Hi,
> >
> > This is the latest iteration of our numa/core tree, which
> > implements adaptive NUMA affinity balancing.
> >
> > Changes in this version:
> >
> > https://lkml.org/lkml/2012/11/12/315
> >
> > Performance figures:
> >
> > https://lkml.org/lkml/2012/11/12/330
> >
> > Any review feedback, comments and test results are welcome!
> >
>
> For the purposes of review and testing, this is going to be
> hard to pick apart and compare. It doesn't apply against
> 3.7-rc5 [...]
Because the scheduler changes are highly non-trivial it's on top
of the scheduler tree:
git pull git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
I just tested the patches, they all apply cleanly, with zero
fuzz and offsets.
Thanks,
Ingo
On Wed, Nov 14, 2012 at 08:52:22AM +0100, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
> > On Tue, Nov 13, 2012 at 06:13:23PM +0100, Ingo Molnar wrote:
> > > Hi,
> > >
> > > This is the latest iteration of our numa/core tree, which
> > > implements adaptive NUMA affinity balancing.
> > >
> > > Changes in this version:
> > >
> > > https://lkml.org/lkml/2012/11/12/315
> > >
> > > Performance figures:
> > >
> > > https://lkml.org/lkml/2012/11/12/330
> > >
> > > Any review feedback, comments and test results are welcome!
> > >
> >
> > For the purposes of review and testing, this is going to be
> > hard to pick apart and compare. It doesn't apply against
> > 3.7-rc5 [...]
>
> Because the scheduler changes are highly non-trivial it's on top
> of the scheduler tree:
>
> git pull git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
>
> I just tested the patches, they all apply cleanly, with zero
> fuzz and offsets.
>
The actual numa patches don't apply on top of that but at least the
conflicts are obvious to resolve. I'll queue up a test to run overnight
but in the meantime, does the current implementation of the NUMA patches
*depend* on any of those scheduler patches? Normally I would say it'd be
obvious from the series except in this case it just isn't because of the
monolithic nature of some of the patches.
--
Mel Gorman
SUSE Labs
On Wed, Nov 14, 2012 at 08:52:22AM +0100, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
> > On Tue, Nov 13, 2012 at 06:13:23PM +0100, Ingo Molnar wrote:
> > > Hi,
> > >
> > > This is the latest iteration of our numa/core tree, which
> > > implements adaptive NUMA affinity balancing.
> > >
> > > Changes in this version:
> > >
> > > https://lkml.org/lkml/2012/11/12/315
> > >
> > > Performance figures:
> > >
> > > https://lkml.org/lkml/2012/11/12/330
> > >
> > > Any review feedback, comments and test results are welcome!
> > >
> >
> > For the purposes of review and testing, this is going to be
> > hard to pick apart and compare. It doesn't apply against
> > 3.7-rc5 [...]
>
> Because the scheduler changes are highly non-trivial it's on top
> of the scheduler tree:
>
> git pull git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
>
> I just tested the patches, they all apply cleanly, with zero
> fuzz and offsets.
>
My apologies about the merge complaint. I used the wrong baseline and
the problem was on my side. The series does indeed apply cleanly once
the scheduler patches are pulled in too.
--
Mel Gorman
SUSE Labs
had caught a ops on my 2 sockets SNB EP server. but can not reproduce it.
send out as a reminder:
on tip/master, head : a7b7a8ad4476bb641c8455a4e0d7d0fd3eb86f90
Oops: 0000 [#1] SMP
[ 21.967103] Modules linked in: iTCO_wdt iTCO_vendor_support
i2c_i801 igb microcode lpc_ich ioatdma i2c_core joydev mfd_core hed
dca ipv6 isci libsas scsi_transport_sas
[ 21.967109] CPU 7
[ 21.967109] Pid: 754, comm: systemd-readahe Not tainted
3.7.0-rc5-tip+ #20 Intel Corporation S2600CP/S2600CP
[ 21.967115] RIP: 0010:[<ffffffff8114987f>] [<ffffffff8114987f>]
__fd_install+0x2d/0x4f
[ 21.967117] RSP: 0018:ffff8808187f7de8 EFLAGS: 00010246
[ 21.967118] RAX: ffff881018bfb700 RBX: ffff88081c2f5d80 RCX: ffff880818dfc620
[ 21.967120] RDX: ffff881019b10000 RSI: 00000000ffffffff RDI: ffff88081c2f5e00
[ 21.967122] RBP: ffff8808187f7e08 R08: ffff88101b37e008 R09: ffffffff811644a6
[ 21.967123] R10: ffff880818005e00 R11: ffff880818005e00 R12: 00000000ffffffff
[ 21.967125] R13: 0000000000000000 R14: 00000000fffffff2 R15: 0000000000000000
[ 21.967128] FS: 00007ffa79ead7e0(0000) GS:ffff88081fce0000(0000)
knlGS:0000000000000000
[ 21.967130] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 21.967131] CR2: ffff881819b0fff8 CR3: 000000081be54000 CR4: 00000000000407e0
[ 21.967133] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 21.967135] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 21.967137] Process systemd-readahe (pid: 754, threadinfo
ffff8808187f6000, task ffff880818dfc620)
[ 21.967138] Stack:
[ 21.967145] ffff880818005e00 ffff88101b37e000 ffff880818005e00
00007fff57d29378
[ 21.967150] ffff8808187f7e18 ffffffff811498c6 ffff8808187f7ed8
ffffffff81167a7c
[ 21.967155] ffff880818dfc620 ffff880818dfc620 ffff880818004d00
ffff880818005e40
[ 21.967156] Call Trace:
[ 21.967162] [<ffffffff811498c6>] fd_install+0x25/0x27
[ 21.967168] [<ffffffff81167a7c>] fanotify_read+0x38d/0x475
[ 21.967176] [<ffffffff8106716e>] ? remove_wait_queue+0x3a/0x3a
[ 21.967181] [<ffffffff81133e21>] vfs_read+0xa9/0xf0
[ 21.967186] [<ffffffff811422cb>] ? poll_select_set_timeout+0x63/0x81
[ 21.967189] [<ffffffff81133ec1>] sys_read+0x59/0x7e
[ 21.967195] [<ffffffff814bd699>] system_call_fastpath+0x16/0x1b
[ 21.967222] Code: 66 66 90 55 48 89 e5 41 55 49 89 d5 41 54 41 89
f4 53 48 89 fb 48 8d bf 80 00 00 00 41 53 e8 69 ce 36 00 48 8b 43 08
48 8b 50 08 <4a> 83 3c e2 00 74 02 0f 0b 48 8b 40 08 4e 89 2c e0 66 83
83 80
[ 21.967226] RIP [<ffffffff8114987f>] __fd_install+0x2d/0x4f
[ 21.967227] RSP <ffff8808187f7de8>
[ 21.967228] CR2: ffff881819b0fff8
On Fri, Nov 16, 2012 at 10:45 PM, Alex Shi <[email protected]> wrote:
> had caught a ops on my 2 sockets SNB EP server. but can not reproduce it.
> send out as a reminder:
> on tip/master, head : a7b7a8ad4476bb641c8455a4e0d7d0fd3eb86f90
This is an independent bug, nothing to do with the NUMA stuff. Fixed
in my tree now (commit 3587b1b097d70).
Of course, it's entirely possible that the NUMA patches are subtly
buggy and helped trigger the fanotify OVERFLOW event that had this
particular bug. But the oops itself is due to a real bug.
Linus