This is the split-out series of mm/ patches that got no objections
from the latest (v15) posting of numa/core. If everyone is still
fine with these then these will be merge candidates for v3.8.
I left out the more contentious policy bits that people are still
arguing about.
The numa/base tree can also be found here:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/base
Thanks,
Ingo
------------------->
Andrea Arcangeli (1):
numa, mm: Support NUMA hinting page faults from gup/gup_fast
Gerald Schaefer (1):
sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390
Ingo Molnar (1):
mm/pgprot: Move the pgprot_modify() fallback definition to mm.h
Lee Schermerhorn (3):
mm/mpol: Add MPOL_MF_NOOP
mm/mpol: Check for misplaced page
mm/mpol: Add MPOL_MF_LAZY
Peter Zijlstra (7):
sched, numa, mm: Make find_busiest_queue() a method
sched, numa, mm: Describe the NUMA scheduling problem formally
mm/thp: Preserve pgprot across huge page split
mm/mpol: Make MPOL_LOCAL a real policy
mm/mpol: Create special PROT_NONE infrastructure
mm/migrate: Introduce migrate_misplaced_page()
mm/mpol: Use special PROT_NONE to migrate pages
Ralf Baechle (1):
sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation
Rik van Riel (5):
mm/generic: Only flush the local TLB in ptep_set_access_flags()
x86/mm: Only do a local tlb flush in ptep_set_access_flags()
x86/mm: Introduce pte_accessible()
mm: Only flush the TLB when clearing an accessible pte
x86/mm: Completely drop the TLB flush from ptep_set_access_flags()
Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++
arch/mips/include/asm/pgtable.h | 2 +
arch/s390/include/asm/pgtable.h | 13 ++
arch/x86/include/asm/pgtable.h | 7 +
arch/x86/mm/pgtable.c | 8 +-
include/asm-generic/pgtable.h | 4 +
include/linux/huge_mm.h | 19 +++
include/linux/mempolicy.h | 8 ++
include/linux/migrate.h | 7 +
include/linux/migrate_mode.h | 3 +
include/linux/mm.h | 32 +++++
include/uapi/linux/mempolicy.h | 16 ++-
kernel/sched/fair.c | 20 +--
mm/huge_memory.c | 174 +++++++++++++++--------
mm/memory.c | 119 +++++++++++++++-
mm/mempolicy.c | 143 +++++++++++++++----
mm/migrate.c | 85 ++++++++++--
mm/mprotect.c | 31 +++--
mm/pgtable-generic.c | 9 +-
19 files changed, 807 insertions(+), 123 deletions(-)
create mode 100644 Documentation/scheduler/numa-problem.txt
--
1.7.11.7
From: Rik van Riel <[email protected]>
The function ptep_set_access_flags() is only ever used to upgrade
access permissions to a page - i.e. they make it less restrictive.
That means the only negative side effect of not flushing remote
TLBs in this function is that other CPUs may incur spurious page
faults, if they happen to access the same address, and still have
a PTE with the old permissions cached in their TLB caches.
Having another CPU maybe incur a spurious page fault is faster
than always incurring the cost of a remote TLB flush, so replace
the remote TLB flush with a purely local one.
This should be safe on every architecture that correctly
implements flush_tlb_fix_spurious_fault() to actually invalidate
the local TLB entry that caused a page fault, as well as on
architectures where the hardware invalidates TLB entries that
cause page faults.
In the unlikely event that you are hitting what appears to be
an infinite loop of page faults, and 'git bisect' took you to
this changeset, your architecture needs to implement
flush_tlb_fix_spurious_fault() to actually flush the TLB entry.
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michel Lespinasse <[email protected]>
[ Changelog massage. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/pgtable-generic.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index e642627..d8397da 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -12,8 +12,8 @@
#ifndef __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
/*
- * Only sets the access flags (dirty, accessed, and
- * writable). Furthermore, we know it always gets set to a "more
+ * Only sets the access flags (dirty, accessed), as well as write
+ * permission. Furthermore, we know it always gets set to a "more
* permissive" setting, which allows most architectures to optimize
* this. We return whether the PTE actually changed, which in turn
* instructs the caller to do things like update__mmu_cache. This
@@ -27,7 +27,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
int changed = !pte_same(*ptep, entry);
if (changed) {
set_pte_at(vma->vm_mm, address, ptep, entry);
- flush_tlb_page(vma, address);
+ flush_tlb_fix_spurious_fault(vma, address);
}
return changed;
}
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
This is probably a first: formal description of a complex high-level
computing problem, within the kernel source.
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: Mike Galbraith <[email protected]>
Rik van Riel <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ]
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++
1 file changed, 230 insertions(+)
create mode 100644 Documentation/scheduler/numa-problem.txt
diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt
new file mode 100644
index 0000000..a5d2fee
--- /dev/null
+++ b/Documentation/scheduler/numa-problem.txt
@@ -0,0 +1,230 @@
+
+
+Effective NUMA scheduling problem statement, described formally:
+
+ * minimize interconnect traffic
+
+For each task 't_i' we have memory, this memory can be spread over multiple
+physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on
+node 'k' in [pages].
+
+If a task shares memory with another task let us denote this as:
+'s_i,k', the memory shared between tasks including 't_i' residing on node
+'k'.
+
+Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement.
+
+Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage
+frequency over those memory regions [1/s] such that the product gives an
+(average) bandwidth 'bp' and 'bs' in [pages/s].
+
+(note: multiple tasks sharing memory naturally avoid duplicat accounting
+ because each task will have its own access frequency 'fs')
+
+(pjt: I think this frequency is more numerically consistent if you explicitly
+ restrict p/s above to be the working-set. (It also makes explicit the
+ requirement for <C0,M0> to change about a change in the working set.)
+
+ Doing this does have the nice property that it lets you use your frequency
+ measurement as a weak-ordering for the benefit a task would receive when
+ we can't fit everything.
+
+ e.g. task1 has working set 10mb, f=90%
+ task2 has working set 90mb, f=10%
+
+ Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit
+ from task1 being on the right node than task2. )
+
+Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i':
+
+ C: t_i -> {c_i, n_i}
+
+This gives us the total interconnect traffic between nodes 'k' and 'l',
+'T_k,l', as:
+
+ T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l
+
+And our goal is to obtain C0 and M0 such that:
+
+ T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l
+
+(note: we could introduce 'nc(k,l)' as the cost function of accessing memory
+ on node 'l' from node 'k', this would be useful for bigger NUMA systems
+
+ pjt: I agree nice to have, but intuition suggests diminishing returns on more
+ usual systems given factors like things like Haswell's enormous 35mb l3
+ cache and QPI being able to do a direct fetch.)
+
+(note: do we need a limit on the total memory per node?)
+
+
+ * fairness
+
+For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu
+'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a
+load 'L_n':
+
+ L_n = 1/P_n * \Sum_i w_i for all c_i = n
+
+using that we can formulate a load difference between CPUs
+
+ L_n,m = | L_n - L_m |
+
+Which allows us to state the fairness goal like:
+
+ L_n,m(C0) =< L_n,m(C) for all C, n != m
+
+(pjt: It can also be usefully stated that, having converged at C0:
+
+ | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) |
+
+ Where G_n,m is the greedy partition of tasks between L_n and L_m. This is
+ the "worst" partition we should accept; but having it gives us a useful
+ bound on how much we can reasonably adjust L_n/L_m at a Pareto point to
+ favor T_n,m. )
+
+Together they give us the complete multi-objective optimization problem:
+
+ min_C,M [ L_n,m(C), T_k,l(C,M) ]
+
+
+
+Notes:
+
+ - the memory bandwidth problem is very much an inter-process problem, in
+ particular there is no such concept as a process in the above problem.
+
+ - the naive solution would completely prefer fairness over interconnect
+ traffic, the more complicated solution could pick another Pareto point using
+ an aggregate objective function such that we balance the loss of work
+ efficiency against the gain of running, we'd want to more or less suggest
+ there to be a fixed bound on the error from the Pareto line for any
+ such solution.
+
+References:
+
+ http://en.wikipedia.org/wiki/Mathematical_optimization
+ http://en.wikipedia.org/wiki/Multi-objective_optimization
+
+
+* warning, significant hand-waving ahead, improvements welcome *
+
+
+Partial solutions / approximations:
+
+ 1) have task node placement be a pure preference from the 'fairness' pov.
+
+This means we always prefer fairness over interconnect bandwidth. This reduces
+the problem to:
+
+ min_C,M [ T_k,l(C,M) ]
+
+ 2a) migrate memory towards 'n_i' (the task's node).
+
+This creates memory movement such that 'p_i,k for k != n_i' becomes 0 --
+provided 'n_i' stays stable enough and there's sufficient memory (looks like
+we might need memory limits for this).
+
+This does however not provide us with any 's_i' (shared) information. It does
+however remove 'M' since it defines memory placement in terms of task
+placement.
+
+XXX properties of this M vs a potential optimal
+
+ 2b) migrate memory towards 'n_i' using 2 samples.
+
+This separates pages into those that will migrate and those that will not due
+to the two samples not matching. We could consider the first to be of 'p_i'
+(private) and the second to be of 's_i' (shared).
+
+This interpretation can be motivated by the previously observed property that
+'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only
+'s_i' (shared). (here we loose the need for memory limits again, since it
+becomes indistinguishable from shared).
+
+XXX include the statistical babble on double sampling somewhere near
+
+This reduces the problem further; we loose 'M' as per 2a, it further reduces
+the 'T_k,l' (interconnect traffic) term to only include shared (since per the
+above all private will be local):
+
+ T_k,l = \Sum_i bs_i,l for every n_i = k, l != k
+
+[ more or less matches the state of sched/numa and describes its remaining
+ problems and assumptions. It should work well for tasks without significant
+ shared memory usage between tasks. ]
+
+Possible future directions:
+
+Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
+can evaluate it;
+
+ 3a) add per-task per node counters
+
+At fault time, count the number of pages the task faults on for each node.
+This should give an approximation of 'p_i' for the local node and 's_i,k' for
+all remote nodes.
+
+While these numbers provide pages per scan, and so have the unit [pages/s] they
+don't count repeat access and thus aren't actually representable for our
+bandwidth numberes.
+
+ 3b) additional frequency term
+
+Additionally (or instead if it turns out we don't need the raw 'p' and 's'
+numbers) we can approximate the repeat accesses by using the time since marking
+the pages as indication of the access frequency.
+
+Let 'I' be the interval of marking pages and 'e' the elapsed time since the
+last marking, then we could estimate the number of accesses 'a' as 'a = I / e'.
+If we then increment the node counters using 'a' instead of 1 we might get
+a better estimate of bandwidth terms.
+
+ 3c) additional averaging; can be applied on top of either a/b.
+
+[ Rik argues that decaying averages on 3a might be sufficient for bandwidth since
+ the decaying avg includes the old accesses and therefore has a measure of repeat
+ accesses.
+
+ Rik also argued that the sample frequency is too low to get accurate access
+ frequency measurements, I'm not entirely convinced, event at low sample
+ frequencies the avg elapsed time 'e' over multiple samples should still
+ give us a fair approximation of the avg access frequency 'a'.
+
+ So doing both b&c has a fair chance of working and allowing us to distinguish
+ between important and less important memory accesses.
+
+ Experimentation has shown no benefit from the added frequency term so far. ]
+
+This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute
+'T_k,l' Our optimization problem now reads:
+
+ min_C [ \Sum_i bs_i,l for every n_i = k, l != k ]
+
+And includes only shared terms, this makes sense since all task private memory
+will become local as per 2.
+
+This suggests that if there is significant shared memory, we should try and
+move towards it.
+
+ 4) move towards where 'most' memory is
+
+The simplest significance test is comparing the biggest shared 's_i,k' against
+the private 'p_i'. If we have more shared than private, move towards it.
+
+This effectively makes us move towards where most our memory is and forms a
+feed-back loop with 2. We migrate memory towards us and we migrate towards
+where 'most' memory is.
+
+(Note: even if there were two tasks fully trashing the same shared memory, it
+ is very rare for there to be an 50/50 split in memory, lacking a perfect
+ split, the small will move towards the larger. In case of the perfect
+ split, we'll tie-break towards the lower node number.)
+
+ 5) 'throttle' 4's node placement
+
+Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize'
+and show representative numbers, we should limit node-migration to not be
+faster than this.
+
+ n) poke holes in previous that require more stuff and describe it.
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Its a bit awkward but it was the least painful means of modifying the
queue selection. Used in a later patch to conditionally use a random
queue.
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/fair.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a1..6ab627e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3063,6 +3063,9 @@ struct lb_env {
unsigned int loop;
unsigned int loop_break;
unsigned int loop_max;
+
+ struct rq * (*find_busiest_queue)(struct lb_env *,
+ struct sched_group *);
};
/*
@@ -4236,13 +4239,14 @@ static int load_balance(int this_cpu, struct rq *this_rq,
struct cpumask *cpus = __get_cpu_var(load_balance_tmpmask);
struct lb_env env = {
- .sd = sd,
- .dst_cpu = this_cpu,
- .dst_rq = this_rq,
- .dst_grpmask = sched_group_cpus(sd->groups),
- .idle = idle,
- .loop_break = sched_nr_migrate_break,
- .cpus = cpus,
+ .sd = sd,
+ .dst_cpu = this_cpu,
+ .dst_rq = this_rq,
+ .dst_grpmask = sched_group_cpus(sd->groups),
+ .idle = idle,
+ .loop_break = sched_nr_migrate_break,
+ .cpus = cpus,
+ .find_busiest_queue = find_busiest_queue,
};
cpumask_copy(cpus, cpu_active_mask);
@@ -4261,7 +4265,7 @@ redo:
goto out_balanced;
}
- busiest = find_busiest_queue(&env, group);
+ busiest = env.find_busiest_queue(&env, group);
if (!busiest) {
schedstat_inc(sd, lb_nobusyq[idle]);
goto out_balanced;
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Make MPOL_LOCAL a real and exposed policy such that applications that
relied on the previous default behaviour can explicitly request it.
Requested-by: Christoph Lameter <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 9 ++++++---
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 23e62e0..3e835c9 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -20,6 +20,7 @@ enum {
MPOL_PREFERRED,
MPOL_BIND,
MPOL_INTERLEAVE,
+ MPOL_LOCAL,
MPOL_MAX, /* always last member of enum */
};
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..72f50ba 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -269,6 +269,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
(flags & MPOL_F_RELATIVE_NODES)))
return ERR_PTR(-EINVAL);
}
+ } else if (mode == MPOL_LOCAL) {
+ if (!nodes_empty(*nodes))
+ return ERR_PTR(-EINVAL);
+ mode = MPOL_PREFERRED;
} else if (nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -2397,7 +2401,6 @@ void numa_default_policy(void)
* "local" is pseudo-policy: MPOL_PREFERRED with MPOL_F_LOCAL flag
* Used only for mpol_parse_str() and mpol_to_str()
*/
-#define MPOL_LOCAL MPOL_MAX
static const char * const policy_modes[] =
{
[MPOL_DEFAULT] = "default",
@@ -2450,12 +2453,12 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
if (flags)
*flags++ = '\0'; /* terminate mode string */
- for (mode = 0; mode <= MPOL_LOCAL; mode++) {
+ for (mode = 0; mode < MPOL_MAX; mode++) {
if (!strcmp(str, policy_modes[mode])) {
break;
}
}
- if (mode > MPOL_LOCAL)
+ if (mode >= MPOL_MAX)
goto out;
switch (mode) {
--
1.7.11.7
From: Lee Schermerhorn <[email protected]>
This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy
to mbind(). When the NOOP policy is used with the 'MOVE and 'LAZY
flags, mbind() will map the pages PROT_NONE so that they will be
migrated on the next touch.
This allows an application to prepare for a new phase of operation
where different regions of shared storage will be assigned to
worker threads, w/o changing policy. Note that we could just use
"default" policy in this case. However, this also allows an
application to request that pages be migrated, only if necessary,
to follow any arbitrary policy that might currently apply to a
range of pages, without knowing the policy, or without specifying
multiple mbind()s for ranges with different policies.
[ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]
Bug-Reported-by: Reported-by: Fengguang Wu <[email protected]>
Signed-off-by: Lee Schermerhorn <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 11 ++++++-----
2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 3e835c9..d23dca8 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -21,6 +21,7 @@ enum {
MPOL_BIND,
MPOL_INTERLEAVE,
MPOL_LOCAL,
+ MPOL_NOOP, /* retain existing policy for range */
MPOL_MAX, /* always last member of enum */
};
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 72f50ba..c7c7c86 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -251,10 +251,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
pr_debug("setting mode %d flags %d nodes[0] %lx\n",
mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);
- if (mode == MPOL_DEFAULT) {
+ if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
if (nodes && !nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
- return NULL; /* simply delete any existing policy */
+ return NULL;
}
VM_BUG_ON(!nodes);
@@ -1146,7 +1146,7 @@ static long do_mbind(unsigned long start, unsigned long len,
if (start & ~PAGE_MASK)
return -EINVAL;
- if (mode == MPOL_DEFAULT)
+ if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
flags &= ~MPOL_MF_STRICT;
len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -2407,7 +2407,8 @@ static const char * const policy_modes[] =
[MPOL_PREFERRED] = "prefer",
[MPOL_BIND] = "bind",
[MPOL_INTERLEAVE] = "interleave",
- [MPOL_LOCAL] = "local"
+ [MPOL_LOCAL] = "local",
+ [MPOL_NOOP] = "noop", /* should not actually be used */
};
@@ -2458,7 +2459,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
break;
}
}
- if (mode >= MPOL_MAX)
+ if (mode >= MPOL_MAX || mode == MPOL_NOOP)
goto out;
switch (mode) {
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
In order to facilitate a lazy -- fault driven -- migration of pages,
create a special transient PROT_NONE variant, we can then use the
'spurious' protection faults to drive our migrations from.
Pages that already had an effective PROT_NONE mapping will not
be detected to generate these 'spuriuos' faults for the simple reason
that we cannot distinguish them on their protection bits, see
pte_numa().
This isn't a problem since PROT_NONE (and possible PROT_WRITE with
dirty tracking) aren't used or are rare enough for us to not care
about their placement.
Suggested-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ fixed various cross-arch and THP/!THP details ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/huge_mm.h | 19 +++++++++++++
include/linux/mm.h | 18 ++++++++++++
mm/huge_memory.c | 32 +++++++++++++++++++++
mm/memory.c | 75 ++++++++++++++++++++++++++++++++++++++++++++-----
mm/mprotect.c | 24 +++++++++++-----
5 files changed, 154 insertions(+), 14 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b31cb7d..4f0f948 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -159,6 +159,13 @@ static inline struct page *compound_trans_head(struct page *page)
}
return page;
}
+
+extern bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd);
+
+extern void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t orig_pmd);
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -195,6 +202,18 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
{
return 0;
}
+
+static inline bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
+{
+ return false;
+}
+
+static inline void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t orig_pmd)
+{
+}
+
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2a32cf8..0025bf9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1091,6 +1091,9 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
extern unsigned long do_mremap(unsigned long addr,
unsigned long old_len, unsigned long new_len,
unsigned long flags, unsigned long new_addr);
+extern void change_protection(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, pgprot_t newprot,
+ int dirty_accountable);
extern int mprotect_fixup(struct vm_area_struct *vma,
struct vm_area_struct **pprev, unsigned long start,
unsigned long end, unsigned long newflags);
@@ -1561,6 +1564,21 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
}
#endif
+static inline pgprot_t vma_prot_none(struct vm_area_struct *vma)
+{
+ /*
+ * obtain PROT_NONE by removing READ|WRITE|EXEC privs
+ */
+ vm_flags_t vmflags = vma->vm_flags & ~(VM_READ|VM_WRITE|VM_EXEC);
+ return pgprot_modify(vma->vm_page_prot, vm_get_page_prot(vmflags));
+}
+
+static inline void
+change_prot_none(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+{
+ change_protection(vma, start, end, vma_prot_none(vma), 0);
+}
+
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 176fe3d..6924edf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -725,6 +725,38 @@ out:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}
+bool pmd_numa(struct vm_area_struct *vma, pmd_t pmd)
+{
+ /*
+ * See pte_numa().
+ */
+ if (pmd_same(pmd, pmd_modify(pmd, vma->vm_page_prot)))
+ return false;
+
+ return pmd_same(pmd, pmd_modify(pmd, vma_prot_none(vma)));
+}
+
+void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd,
+ unsigned int flags, pmd_t entry)
+{
+ unsigned long haddr = address & HPAGE_PMD_MASK;
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry)))
+ goto out_unlock;
+
+ /* do fancy stuff */
+
+ /* change back to regular protection */
+ entry = pmd_modify(entry, vma->vm_page_prot);
+ if (pmdp_set_access_flags(vma, haddr, pmd, entry, 1))
+ update_mmu_cache_pmd(vma, address, entry);
+
+out_unlock:
+ spin_unlock(&mm->page_table_lock);
+}
+
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *vma)
diff --git a/mm/memory.c b/mm/memory.c
index fb135ba..e3e8ab2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1464,6 +1464,25 @@ int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
}
EXPORT_SYMBOL_GPL(zap_vma_ptes);
+static bool pte_numa(struct vm_area_struct *vma, pte_t pte)
+{
+ /*
+ * If we have the normal vma->vm_page_prot protections we're not a
+ * 'special' PROT_NONE page.
+ *
+ * This means we cannot get 'special' PROT_NONE faults from genuine
+ * PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
+ * tracking.
+ *
+ * Neither case is really interesting for our current use though so we
+ * don't care.
+ */
+ if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
+ return false;
+
+ return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
+}
+
/**
* follow_page - look up a page descriptor from a user-virtual address
* @vma: vm_area_struct mapping @address
@@ -3433,6 +3452,41 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}
+static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *ptep, pmd_t *pmd,
+ unsigned int flags, pte_t entry)
+{
+ spinlock_t *ptl;
+ int ret = 0;
+
+ if (!pte_unmap_same(mm, pmd, ptep, entry))
+ goto out;
+
+ /*
+ * Do fancy stuff...
+ */
+
+ /*
+ * OK, nothing to do,.. change the protection back to what it
+ * ought to be.
+ */
+ ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (unlikely(!pte_same(*ptep, entry)))
+ goto unlock;
+
+ flush_cache_page(vma, address, pte_pfn(entry));
+
+ ptep_modify_prot_start(mm, address, ptep);
+ entry = pte_modify(entry, vma->vm_page_prot);
+ ptep_modify_prot_commit(mm, address, ptep, entry);
+
+ update_mmu_cache(vma, address, ptep);
+unlock:
+ pte_unmap_unlock(ptep, ptl);
+out:
+ return ret;
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
@@ -3471,6 +3525,9 @@ int handle_pte_fault(struct mm_struct *mm,
pte, pmd, flags, entry);
}
+ if (pte_numa(vma, entry))
+ return do_numa_page(mm, vma, address, pte, pmd, flags, entry);
+
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (unlikely(!pte_same(*pte, entry)))
@@ -3535,13 +3592,16 @@ retry:
pmd, flags);
} else {
pmd_t orig_pmd = *pmd;
- int ret;
+ int ret = 0;
barrier();
- if (pmd_trans_huge(orig_pmd)) {
- if (flags & FAULT_FLAG_WRITE &&
- !pmd_write(orig_pmd) &&
- !pmd_trans_splitting(orig_pmd)) {
+ if (pmd_trans_huge(orig_pmd) && !pmd_trans_splitting(orig_pmd)) {
+ if (pmd_numa(vma, orig_pmd)) {
+ do_huge_pmd_numa_page(mm, vma, address, pmd,
+ flags, orig_pmd);
+ }
+
+ if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
orig_pmd);
/*
@@ -3551,12 +3611,13 @@ retry:
*/
if (unlikely(ret & VM_FAULT_OOM))
goto retry;
- return ret;
}
- return 0;
+
+ return ret;
}
}
+
/*
* Use __pte_alloc instead of pte_alloc_map, because we can't
* run pte_offset_map on the pmd, if an huge pmd could
diff --git a/mm/mprotect.c b/mm/mprotect.c
index e97b0d6..392b124 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -112,7 +112,7 @@ static inline void change_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
} while (pud++, addr = next, addr != end);
}
-static void change_protection(struct vm_area_struct *vma,
+static void change_protection_range(struct vm_area_struct *vma,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
{
@@ -134,6 +134,20 @@ static void change_protection(struct vm_area_struct *vma,
flush_tlb_range(vma, start, end);
}
+void change_protection(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, pgprot_t newprot,
+ int dirty_accountable)
+{
+ struct mm_struct *mm = vma->vm_mm;
+
+ mmu_notifier_invalidate_range_start(mm, start, end);
+ if (is_vm_hugetlb_page(vma))
+ hugetlb_change_protection(vma, start, end, newprot);
+ else
+ change_protection_range(vma, start, end, newprot, dirty_accountable);
+ mmu_notifier_invalidate_range_end(mm, start, end);
+}
+
int
mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
unsigned long start, unsigned long end, unsigned long newflags)
@@ -206,12 +220,8 @@ success:
dirty_accountable = 1;
}
- mmu_notifier_invalidate_range_start(mm, start, end);
- if (is_vm_hugetlb_page(vma))
- hugetlb_change_protection(vma, start, end, vma->vm_page_prot);
- else
- change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
- mmu_notifier_invalidate_range_end(mm, start, end);
+ change_protection(vma, start, end, vma->vm_page_prot, dirty_accountable);
+
vm_stat_account(mm, oldflags, vma->vm_file, -nrpages);
vm_stat_account(mm, newflags, vma->vm_file, nrpages);
perf_event_mmap(vma);
--
1.7.11.7
From: Rik van Riel <[email protected]>
Intel has an architectural guarantee that the TLB entry causing
a page fault gets invalidated automatically. This means
we should be able to drop the local TLB invalidation.
Because of the way other areas of the page fault code work,
chances are good that all x86 CPUs do this. However, if
someone somewhere has an x86 CPU that does not invalidate
the TLB entry causing a page fault, this one-liner should
be easy to revert - or a CPU model specific quirk could
be added to retain this optimization on most CPUs.
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michel Lespinasse <[email protected]>
[ Applied changelog massage and moved this last in the series,
to create bisection distance. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index be3bb46..7353de3 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -317,7 +317,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
if (changed && dirty) {
*ptep = entry;
pte_update_defer(vma->vm_mm, address, ptep);
- __flush_tlb_one(address);
}
return changed;
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Combine our previous PROT_NONE, mpol_misplaced and
migrate_misplaced_page() pieces into an effective migrate on fault
scheme.
Note that (on x86) we rely on PROT_NONE pages being !present and avoid
the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves
the page-migration performance.
Suggested-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/huge_memory.c | 41 +++++++++++++++++++++++++++++++++++-
mm/memory.c | 63 ++++++++++++++++++++++++++++++++++++++++----------------
2 files changed, 85 insertions(+), 19 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6924edf..c4c0a57 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
#include <linux/freezer.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
+#include <linux/migrate.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -741,12 +742,48 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned int flags, pmd_t entry)
{
unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct page *page = NULL;
+ int node;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, entry)))
goto out_unlock;
- /* do fancy stuff */
+ if (unlikely(pmd_trans_splitting(entry))) {
+ spin_unlock(&mm->page_table_lock);
+ wait_split_huge_page(vma->anon_vma, pmd);
+ return;
+ }
+
+#ifdef CONFIG_NUMA
+ page = pmd_page(entry);
+ VM_BUG_ON(!PageCompound(page) || !PageHead(page));
+
+ get_page(page);
+ spin_unlock(&mm->page_table_lock);
+
+ /*
+ * XXX should we serialize against split_huge_page ?
+ */
+
+ node = mpol_misplaced(page, vma, haddr);
+ if (node == -1)
+ goto do_fixup;
+
+ /*
+ * Due to lacking code to migrate thp pages, we'll split
+ * (which preserves the special PROT_NONE) and re-take the
+ * fault on the normal pages.
+ */
+ split_huge_page(page);
+ put_page(page);
+ return;
+
+do_fixup:
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(*pmd, entry)))
+ goto out_unlock;
+#endif
/* change back to regular protection */
entry = pmd_modify(entry, vma->vm_page_prot);
@@ -755,6 +792,8 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
out_unlock:
spin_unlock(&mm->page_table_lock);
+ if (page)
+ put_page(page);
}
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
diff --git a/mm/memory.c b/mm/memory.c
index a660fd0..0d26a28 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
#include <linux/gfp.h>
+#include <linux/migrate.h>
#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -1467,8 +1468,10 @@ EXPORT_SYMBOL_GPL(zap_vma_ptes);
static bool pte_numa(struct vm_area_struct *vma, pte_t pte)
{
/*
- * If we have the normal vma->vm_page_prot protections we're not a
- * 'special' PROT_NONE page.
+ * For NUMA page faults, we use PROT_NONE ptes in VMAs with
+ * "normal" vma->vm_page_prot protections. Genuine PROT_NONE
+ * VMAs should never get here, because the fault handling code
+ * will notice that the VMA has no read or write permissions.
*
* This means we cannot get 'special' PROT_NONE faults from genuine
* PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
@@ -3473,35 +3476,59 @@ static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pte_t *ptep, pmd_t *pmd,
unsigned int flags, pte_t entry)
{
+ struct page *page = NULL;
+ int node, page_nid = -1;
spinlock_t *ptl;
- int ret = 0;
-
- if (!pte_unmap_same(mm, pmd, ptep, entry))
- goto out;
- /*
- * Do fancy stuff...
- */
-
- /*
- * OK, nothing to do,.. change the protection back to what it
- * ought to be.
- */
- ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
if (unlikely(!pte_same(*ptep, entry)))
- goto unlock;
+ goto out_unlock;
+ page = vm_normal_page(vma, address, entry);
+ if (page) {
+ get_page(page);
+ page_nid = page_to_nid(page);
+ node = mpol_misplaced(page, vma, address);
+ if (node != -1)
+ goto migrate;
+ }
+
+out_pte_upgrade_unlock:
flush_cache_page(vma, address, pte_pfn(entry));
ptep_modify_prot_start(mm, address, ptep);
entry = pte_modify(entry, vma->vm_page_prot);
ptep_modify_prot_commit(mm, address, ptep, entry);
+ /* No TLB flush needed because we upgraded the PTE */
+
update_mmu_cache(vma, address, ptep);
-unlock:
+
+out_unlock:
pte_unmap_unlock(ptep, ptl);
out:
- return ret;
+ if (page)
+ put_page(page);
+
+ return 0;
+
+migrate:
+ pte_unmap_unlock(ptep, ptl);
+
+ if (!migrate_misplaced_page(page, node)) {
+ page_nid = node;
+ goto out;
+ }
+
+ ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
+ if (!pte_same(*ptep, entry)) {
+ put_page(page);
+ page = NULL;
+ goto out_unlock;
+ }
+
+ goto out_pte_upgrade_unlock;
}
/*
--
1.7.11.7
From: Lee Schermerhorn <[email protected]>
This patch adds another mbind() flag to request "lazy migration". The
flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are marked PROT_NONE. The pages will be migrated in the fault
path on "first touch", if the policy dictates at that time.
"Lazy Migration" will allow testing of migrate-on-fault via mbind().
Also allows applications to specify that only subsequently touched
pages be migrated to obey new policy, instead of all pages in range.
This can be useful for multi-threaded applications working on a
large shared data area that is initialized by an initial thread
resulting in all pages on one [or a few, if overflowed] nodes.
After PROT_NONE, the pages in regions assigned to the worker threads
will be automatically migrated local to the threads on 1st touch.
Signed-off-by: Lee Schermerhorn <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
[ nearly complete rewrite.. ]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/uapi/linux/mempolicy.h | 13 ++++++++---
mm/mempolicy.c | 49 +++++++++++++++++++++++++++---------------
2 files changed, 42 insertions(+), 20 deletions(-)
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 472de8a..6a1baae 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -49,9 +49,16 @@ enum mpol_rebind_step {
/* Flags for mbind */
#define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */
-#define MPOL_MF_INTERNAL (1<<3) /* Internal flags start here */
+#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform
+ to policy */
+#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to policy */
+#define MPOL_MF_LAZY (1<<3) /* Modifies '_MOVE: lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4) /* Internal flags start here */
+
+#define MPOL_MF_VALID (MPOL_MF_STRICT | \
+ MPOL_MF_MOVE | \
+ MPOL_MF_MOVE_ALL | \
+ MPOL_MF_LAZY)
/*
* Internal flags that share the struct mempolicy flags word with
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 1b2890c..5ee326c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -583,22 +583,32 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
return ERR_PTR(-EFAULT);
prev = NULL;
for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
+ unsigned long endvma = vma->vm_end;
+
+ if (endvma > end)
+ endvma = end;
+ if (vma->vm_start > start)
+ start = vma->vm_start;
+
if (!(flags & MPOL_MF_DISCONTIG_OK)) {
if (!vma->vm_next && vma->vm_end < end)
return ERR_PTR(-EFAULT);
if (prev && prev->vm_end < vma->vm_start)
return ERR_PTR(-EFAULT);
}
- if (!is_vm_hugetlb_page(vma) &&
- ((flags & MPOL_MF_STRICT) ||
+
+ if (is_vm_hugetlb_page(vma))
+ goto next;
+
+ if (flags & MPOL_MF_LAZY) {
+ change_prot_none(vma, start, endvma);
+ goto next;
+ }
+
+ if ((flags & MPOL_MF_STRICT) ||
((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
- vma_migratable(vma)))) {
- unsigned long endvma = vma->vm_end;
+ vma_migratable(vma))) {
- if (endvma > end)
- endvma = end;
- if (vma->vm_start > start)
- start = vma->vm_start;
err = check_pgd_range(vma, start, endvma, nodes,
flags, private);
if (err) {
@@ -606,6 +616,7 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
break;
}
}
+next:
prev = vma;
}
return first;
@@ -1137,8 +1148,7 @@ static long do_mbind(unsigned long start, unsigned long len,
int err;
LIST_HEAD(pagelist);
- if (flags & ~(unsigned long)(MPOL_MF_STRICT |
- MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ if (flags & ~(unsigned long)MPOL_MF_VALID)
return -EINVAL;
if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
return -EPERM;
@@ -1161,6 +1171,9 @@ static long do_mbind(unsigned long start, unsigned long len,
if (IS_ERR(new))
return PTR_ERR(new);
+ if (flags & MPOL_MF_LAZY)
+ new->flags |= MPOL_F_MOF;
+
/*
* If we are using the default policy then operation
* on discontinuous address spaces is okay after all
@@ -1197,21 +1210,23 @@ static long do_mbind(unsigned long start, unsigned long len,
vma = check_range(mm, start, end, nmask,
flags | MPOL_MF_INVERT, &pagelist);
- err = PTR_ERR(vma);
- if (!IS_ERR(vma)) {
- int nr_failed = 0;
-
+ err = PTR_ERR(vma); /* maybe ... */
+ if (!IS_ERR(vma) && mode != MPOL_NOOP)
err = mbind_range(mm, start, end, new);
+ if (!err) {
+ int nr_failed = 0;
+
if (!list_empty(&pagelist)) {
+ WARN_ON_ONCE(flags & MPOL_MF_LAZY);
nr_failed = migrate_pages(&pagelist, new_vma_page,
- (unsigned long)vma,
- false, MIGRATE_SYNC);
+ (unsigned long)vma,
+ false, MIGRATE_SYNC);
if (nr_failed)
putback_lru_pages(&pagelist);
}
- if (!err && nr_failed && (flags & MPOL_MF_STRICT))
+ if (nr_failed && (flags & MPOL_MF_STRICT))
err = -EIO;
} else
putback_lru_pages(&pagelist);
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
Add migrate_misplaced_page() which deals with migrating pages from
faults.
This includes adding a new MIGRATE_FAULT migration mode to
deal with the extra page reference required due to having to look up
the page.
Based-on-work-by: Lee Schermerhorn <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/migrate.h | 7 ++++
include/linux/migrate_mode.h | 3 ++
mm/migrate.c | 85 +++++++++++++++++++++++++++++++++++++++-----
3 files changed, 87 insertions(+), 8 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ce7e667..9a5afea 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -30,6 +30,7 @@ extern int migrate_vmas(struct mm_struct *mm,
extern void migrate_page_copy(struct page *newpage, struct page *page);
extern int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page);
+extern int migrate_misplaced_page(struct page *page, int node);
#else
static inline void putback_lru_pages(struct list_head *l) {}
@@ -63,5 +64,11 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
#define migrate_page NULL
#define fail_migrate_page NULL
+static inline
+int migrate_misplaced_page(struct page *page, int node)
+{
+ return -EAGAIN; /* can't migrate now */
+}
#endif /* CONFIG_MIGRATION */
+
#endif /* _LINUX_MIGRATE_H */
diff --git a/include/linux/migrate_mode.h b/include/linux/migrate_mode.h
index ebf3d89..40b37dc 100644
--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,14 @@
* on most operations but not ->writepage as the potential stall time
* is too significant
* MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
+ * this path has an extra reference count
*/
enum migrate_mode {
MIGRATE_ASYNC,
MIGRATE_SYNC_LIGHT,
MIGRATE_SYNC,
+ MIGRATE_FAULT,
};
#endif /* MIGRATE_MODE_H_INCLUDED */
diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..3299949 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -225,7 +225,7 @@ static bool buffer_migrate_lock_buffers(struct buffer_head *head,
struct buffer_head *bh = head;
/* Simple case, sync compaction */
- if (mode != MIGRATE_ASYNC) {
+ if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT) {
do {
get_bh(bh);
lock_buffer(bh);
@@ -279,12 +279,22 @@ static int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page,
struct buffer_head *head, enum migrate_mode mode)
{
- int expected_count;
+ int expected_count = 0;
void **pslot;
+ if (mode == MIGRATE_FAULT) {
+ /*
+ * MIGRATE_FAULT has an extra reference on the page and
+ * otherwise acts like ASYNC, no point in delaying the
+ * fault, we'll try again next time.
+ */
+ expected_count++;
+ }
+
if (!mapping) {
/* Anonymous page without mapping */
- if (page_count(page) != 1)
+ expected_count += 1;
+ if (page_count(page) != expected_count)
return -EAGAIN;
return 0;
}
@@ -294,7 +304,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
pslot = radix_tree_lookup_slot(&mapping->page_tree,
page_index(page));
- expected_count = 2 + page_has_private(page);
+ expected_count += 2 + page_has_private(page);
if (page_count(page) != expected_count ||
radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
spin_unlock_irq(&mapping->tree_lock);
@@ -313,7 +323,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
* the mapping back due to an elevated page count, we would have to
* block waiting on other references to be dropped.
*/
- if (mode == MIGRATE_ASYNC && head &&
+ if ((mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT) && head &&
!buffer_migrate_lock_buffers(head, mode)) {
page_unfreeze_refs(page, expected_count);
spin_unlock_irq(&mapping->tree_lock);
@@ -521,7 +531,7 @@ int buffer_migrate_page(struct address_space *mapping,
* with an IRQ-safe spinlock held. In the sync case, the buffers
* need to be locked now
*/
- if (mode != MIGRATE_ASYNC)
+ if (mode != MIGRATE_ASYNC && mode != MIGRATE_FAULT)
BUG_ON(!buffer_migrate_lock_buffers(head, mode));
ClearPagePrivate(page);
@@ -687,7 +697,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
struct anon_vma *anon_vma = NULL;
if (!trylock_page(page)) {
- if (!force || mode == MIGRATE_ASYNC)
+ if (!force || mode == MIGRATE_ASYNC || mode == MIGRATE_FAULT)
goto out;
/*
@@ -1403,4 +1413,63 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
}
return err;
}
-#endif
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+ struct address_space *mapping = page_mapping(page);
+ int page_lru = page_is_file_cache(page);
+ struct page *newpage;
+ int ret = -EAGAIN;
+ gfp_t gfp = GFP_HIGHUSER_MOVABLE;
+
+ /*
+ * Don't migrate pages that are mapped in multiple processes.
+ */
+ if (page_mapcount(page) != 1)
+ goto out;
+
+ /*
+ * Never wait for allocations just to migrate on fault, but don't dip
+ * into reserves. And, only accept pages from the specified node. No
+ * sense migrating to a different "misplaced" page!
+ */
+ if (mapping)
+ gfp = mapping_gfp_mask(mapping);
+ gfp &= ~__GFP_WAIT;
+ gfp |= __GFP_NOMEMALLOC | GFP_THISNODE;
+
+ newpage = alloc_pages_node(node, gfp, 0);
+ if (!newpage) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ if (isolate_lru_page(page)) {
+ ret = -EBUSY;
+ goto put_new;
+ }
+
+ inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+ ret = __unmap_and_move(page, newpage, 0, 0, MIGRATE_FAULT);
+ /*
+ * A page that has been migrated has all references removed and will be
+ * freed. A page that has not been migrated will have kepts its
+ * references and be restored.
+ */
+ dec_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+ putback_lru_page(page);
+put_new:
+ /*
+ * Move the new page to the LRU. If migration was not successful
+ * then this will free the page.
+ */
+ putback_lru_page(newpage);
+out:
+ return ret;
+}
+
+#endif /* CONFIG_NUMA */
--
1.7.11.7
From: Andrea Arcangeli <[email protected]>
Introduce FOLL_NUMA to tell follow_page to check
pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
so because it always invokes handle_mm_fault and retries the
follow_page later.
KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault.
Other follow_page callers like KSM should not use FOLL_NUMA, or they
would fail to get the pages if they use follow_page instead of
get_user_pages.
[ This patch was picked up from the AutoNUMA tree. ]
Originally-by: Andrea Arcangeli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
[ ported to this tree. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mm.h | 1 +
mm/memory.c | 17 +++++++++++++++++
2 files changed, 18 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0025bf9..1821629 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1600,6 +1600,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
#define FOLL_MLOCK 0x40 /* mark page as mlocked */
#define FOLL_SPLIT 0x80 /* don't return transhuge pages, split them */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
+#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */
typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/memory.c b/mm/memory.c
index e3e8ab2..a660fd0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1536,6 +1536,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
goto out;
}
+ if ((flags & FOLL_NUMA) && pmd_numa(vma, *pmd))
+ goto no_page_table;
if (pmd_trans_huge(*pmd)) {
if (flags & FOLL_SPLIT) {
split_huge_page_pmd(mm, pmd);
@@ -1565,6 +1567,8 @@ split_fallthrough:
pte = *ptep;
if (!pte_present(pte))
goto no_page;
+ if ((flags & FOLL_NUMA) && pte_numa(vma, pte))
+ goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;
@@ -1716,6 +1720,19 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
(VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
vm_flags &= (gup_flags & FOLL_FORCE) ?
(VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
+
+ /*
+ * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
+ * would be called on PROT_NONE ranges. We must never invoke
+ * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
+ * page faults would unprotect the PROT_NONE ranges if
+ * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
+ * bitflag. So to avoid that, don't set FOLL_NUMA if
+ * FOLL_FORCE is set.
+ */
+ if (!(gup_flags & FOLL_FORCE))
+ gup_flags |= FOLL_NUMA;
+
i = 0;
do {
--
1.7.11.7
From: Lee Schermerhorn <[email protected]>
This patch provides a new function to test whether a page resides
on a node that is appropriate for the mempolicy for the vma and
address where the page is supposed to be mapped. This involves
looking up the node where the page belongs. So, the function
returns that node so that it may be used to allocated the page
without consulting the policy again.
A subsequent patch will call this function from the fault path.
Because of this, I don't want to go ahead and allocate the page, e.g.,
via alloc_page_vma() only to have to free it if it has the correct
policy. So, I just mimic the alloc_page_vma() node computation
logic--sort of.
Note: we could use this function to implement a MPOL_MF_STRICT
behavior when migrating pages to match mbind() mempolicy--e.g.,
to ensure that pages in an interleaved range are reinterleaved
rather than left where they are when they reside on any page in
the interleave nodemask.
Signed-off-by: Lee Schermerhorn <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
[ Added MPOL_F_LAZY to trigger migrate-on-fault;
simplified code now that we don't have to bother
with special crap for interleaved ]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mempolicy.h | 8 +++++
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 76 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 85 insertions(+)
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e5ccb9d..c511e25 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -198,6 +198,8 @@ static inline int vma_migratable(struct vm_area_struct *vma)
return 1;
}
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+
#else
struct mempolicy {};
@@ -323,5 +325,11 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
return 0;
}
+static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+ unsigned long address)
+{
+ return -1; /* no node preference */
+}
+
#endif /* CONFIG_NUMA */
#endif
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index d23dca8..472de8a 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -61,6 +61,7 @@ enum mpol_rebind_step {
#define MPOL_F_SHARED (1 << 0) /* identify shared policies */
#define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */
#define MPOL_F_REBINDING (1 << 2) /* identify policies in rebinding */
+#define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */
#endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c7c7c86..1b2890c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2179,6 +2179,82 @@ static void sp_free(struct sp_node *n)
kmem_cache_free(sn_cache, n);
}
+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page - page to be checked
+ * @vma - vm area where page mapped
+ * @addr - virtual address where page mapped
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ * -1 - not misplaced, page is in the right node
+ * node - node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+{
+ struct mempolicy *pol;
+ struct zone *zone;
+ int curnid = page_to_nid(page);
+ unsigned long pgoff;
+ int polnid = -1;
+ int ret = -1;
+
+ BUG_ON(!vma);
+
+ pol = get_vma_policy(current, vma, addr);
+ if (!(pol->flags & MPOL_F_MOF))
+ goto out;
+
+ switch (pol->mode) {
+ case MPOL_INTERLEAVE:
+ BUG_ON(addr >= vma->vm_end);
+ BUG_ON(addr < vma->vm_start);
+
+ pgoff = vma->vm_pgoff;
+ pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+ polnid = offset_il_node(pol, vma, pgoff);
+ break;
+
+ case MPOL_PREFERRED:
+ if (pol->flags & MPOL_F_LOCAL)
+ polnid = numa_node_id();
+ else
+ polnid = pol->v.preferred_node;
+ break;
+
+ case MPOL_BIND:
+ /*
+ * allows binding to multiple nodes.
+ * use current page if in policy nodemask,
+ * else select nearest allowed node, if any.
+ * If no allowed nodes, use current [!misplaced].
+ */
+ if (node_isset(curnid, pol->v.nodes))
+ goto out;
+ (void)first_zones_zonelist(
+ node_zonelist(numa_node_id(), GFP_HIGHUSER),
+ gfp_zone(GFP_HIGHUSER),
+ &pol->v.nodes, &zone);
+ polnid = zone->node;
+ break;
+
+ default:
+ BUG();
+ }
+ if (curnid != polnid)
+ ret = polnid;
+out:
+ mpol_cond_put(pol);
+
+ return ret;
+}
+
static void sp_delete(struct shared_policy *sp, struct sp_node *n)
{
pr_debug("deleting %lx-l%lx\n", n->start, n->end);
--
1.7.11.7
From: Ralf Baechle <[email protected]>
Add the pmd_pgprot() method that will be needed
by the new NUMA code.
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Ralf Baechle <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/mips/include/asm/pgtable.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index c02158b..bbe4cda 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -89,6 +89,8 @@ static inline int is_zero_pfn(unsigned long pfn)
extern void paging_init(void);
+#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_PAGE_CHG_MASK)
+
/*
* Conversion functions: convert a page and protection to a page entry,
* and a page entry and page directory to the page they refer to.
--
1.7.11.7
From: Peter Zijlstra <[email protected]>
We're going to play games with page-protections, ensure we don't lose
them over a THP split.
Collapse seems to always allocate a new (huge) page which should
already end up on the new target node so loosing protections there
isn't a problem.
Signed-off-by: Peter Zijlstra <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/pgtable.h | 1 +
mm/huge_memory.c | 103 ++++++++++++++++++++---------------------
2 files changed, 50 insertions(+), 54 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a1f780d..f85dccd 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -349,6 +349,7 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
}
#define pte_pgprot(x) __pgprot(pte_flags(x) & PTE_FLAGS_MASK)
+#define pmd_pgprot(x) __pgprot(pmd_val(x) & ~_HPAGE_CHG_MASK)
#define canon_pgprot(p) __pgprot(massage_pgprot(p))
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..176fe3d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1343,63 +1343,60 @@ static int __split_huge_page_map(struct page *page,
int ret = 0, i;
pgtable_t pgtable;
unsigned long haddr;
+ pgprot_t prot;
spin_lock(&mm->page_table_lock);
pmd = page_check_address_pmd(page, mm, address,
PAGE_CHECK_ADDRESS_PMD_SPLITTING_FLAG);
- if (pmd) {
- pgtable = pgtable_trans_huge_withdraw(mm);
- pmd_populate(mm, &_pmd, pgtable);
-
- haddr = address;
- for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
- pte_t *pte, entry;
- BUG_ON(PageCompound(page+i));
- entry = mk_pte(page + i, vma->vm_page_prot);
- entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- if (!pmd_write(*pmd))
- entry = pte_wrprotect(entry);
- else
- BUG_ON(page_mapcount(page) != 1);
- if (!pmd_young(*pmd))
- entry = pte_mkold(entry);
- pte = pte_offset_map(&_pmd, haddr);
- BUG_ON(!pte_none(*pte));
- set_pte_at(mm, haddr, pte, entry);
- pte_unmap(pte);
- }
+ if (!pmd)
+ goto unlock;
- smp_wmb(); /* make pte visible before pmd */
- /*
- * Up to this point the pmd is present and huge and
- * userland has the whole access to the hugepage
- * during the split (which happens in place). If we
- * overwrite the pmd with the not-huge version
- * pointing to the pte here (which of course we could
- * if all CPUs were bug free), userland could trigger
- * a small page size TLB miss on the small sized TLB
- * while the hugepage TLB entry is still established
- * in the huge TLB. Some CPU doesn't like that. See
- * http://support.amd.com/us/Processor_TechDocs/41322.pdf,
- * Erratum 383 on page 93. Intel should be safe but is
- * also warns that it's only safe if the permission
- * and cache attributes of the two entries loaded in
- * the two TLB is identical (which should be the case
- * here). But it is generally safer to never allow
- * small and huge TLB entries for the same virtual
- * address to be loaded simultaneously. So instead of
- * doing "pmd_populate(); flush_tlb_range();" we first
- * mark the current pmd notpresent (atomically because
- * here the pmd_trans_huge and pmd_trans_splitting
- * must remain set at all times on the pmd until the
- * split is complete for this pmd), then we flush the
- * SMP TLB and finally we write the non-huge version
- * of the pmd entry with pmd_populate.
- */
- pmdp_invalidate(vma, address, pmd);
- pmd_populate(mm, pmd, pgtable);
- ret = 1;
+ prot = pmd_pgprot(*pmd);
+ pgtable = pgtable_trans_huge_withdraw(mm);
+ pmd_populate(mm, &_pmd, pgtable);
+
+ for (i = 0, haddr = address; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+ pte_t *pte, entry;
+
+ BUG_ON(PageCompound(page+i));
+ entry = mk_pte(page + i, prot);
+ entry = pte_mkdirty(entry);
+ if (!pmd_young(*pmd))
+ entry = pte_mkold(entry);
+ pte = pte_offset_map(&_pmd, haddr);
+ BUG_ON(!pte_none(*pte));
+ set_pte_at(mm, haddr, pte, entry);
+ pte_unmap(pte);
}
+
+ smp_wmb(); /* make ptes visible before pmd, see __pte_alloc */
+ /*
+ * Up to this point the pmd is present and huge.
+ *
+ * If we overwrite the pmd with the not-huge version, we could trigger
+ * a small page size TLB miss on the small sized TLB while the hugepage
+ * TLB entry is still established in the huge TLB.
+ *
+ * Some CPUs don't like that. See
+ * http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum 383
+ * on page 93.
+ *
+ * Thus it is generally safer to never allow small and huge TLB entries
+ * for overlapping virtual addresses to be loaded. So we first mark the
+ * current pmd not present, then we flush the TLB and finally we write
+ * the non-huge version of the pmd entry with pmd_populate.
+ *
+ * The above needs to be done under the ptl because pmd_trans_huge and
+ * pmd_trans_splitting must remain set on the pmd until the split is
+ * complete. The ptl also protects against concurrent faults due to
+ * making the pmd not-present.
+ */
+ set_pmd_at(mm, address, pmd, pmd_mknotpresent(*pmd));
+ flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ pmd_populate(mm, pmd, pgtable);
+ ret = 1;
+
+unlock:
spin_unlock(&mm->page_table_lock);
return ret;
@@ -2287,10 +2284,8 @@ static void khugepaged_do_scan(void)
{
struct page *hpage = NULL;
unsigned int progress = 0, pass_through_head = 0;
- unsigned int pages = khugepaged_pages_to_scan;
bool wait = true;
-
- barrier(); /* write khugepaged_pages_to_scan to local stack */
+ unsigned int pages = ACCESS_ONCE(khugepaged_pages_to_scan);
while (progress < pages) {
if (!khugepaged_prealloc_page(&hpage, &wait))
--
1.7.11.7
pgprot_modify() is available on x86, but on other architectures it only
gets defined in mm/mprotect.c - breaking the build if anything outside
of mprotect.c tries to make use of this function.
Move it to the generic pgprot area in mm.h, so that an upcoming patch
can make use of it.
Acked-by: Peter Zijlstra <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mm.h | 13 +++++++++++++
mm/mprotect.c | 7 -------
2 files changed, 13 insertions(+), 7 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..2a32cf8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -164,6 +164,19 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_TRIED 0x40 /* second try */
/*
+ * Some architectures (such as x86) may need to preserve certain pgprot
+ * bits, without complicating generic pgprot code.
+ *
+ * Most architectures don't care:
+ */
+#ifndef pgprot_modify
+static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
+{
+ return newprot;
+}
+#endif
+
+/*
* vm_fault is filled by the the pagefault handler and passed to the vma's
* ->fault function. The vma's ->fault is responsible for returning a bitmask
* of VM_FAULT_xxx flags that give details about how the fault was handled.
diff --git a/mm/mprotect.c b/mm/mprotect.c
index a409926..e97b0d6 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -28,13 +28,6 @@
#include <asm/cacheflush.h>
#include <asm/tlbflush.h>
-#ifndef pgprot_modify
-static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
-{
- return newprot;
-}
-#endif
-
static void change_pte_range(struct mm_struct *mm, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
int dirty_accountable)
--
1.7.11.7
From: Rik van Riel <[email protected]>
We need pte_present to return true for _PAGE_PROTNONE pages, to indicate that
the pte is associated with a page.
However, for TLB flushing purposes, we would like to know whether the pte
points to an actually accessible page. This allows us to skip remote TLB
flushes for pages that are not actually accessible.
Fill in this method for x86 and provide a safe (but slower) method
on other architectures.
Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Fixed-by: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
[ Added Linus's review fixes. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/pgtable.h | 6 ++++++
include/asm-generic/pgtable.h | 4 ++++
2 files changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f85dccd..a984cf9 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -408,6 +408,12 @@ static inline int pte_present(pte_t a)
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
}
+#define pte_accessible pte_accessible
+static inline int pte_accessible(pte_t a)
+{
+ return pte_flags(a) & _PAGE_PRESENT;
+}
+
static inline int pte_hidden(pte_t pte)
{
return pte_flags(pte) & _PAGE_HIDDEN;
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b36ce40..48fc1dc 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
#define move_pte(pte, prot, old_addr, new_addr) (pte)
#endif
+#ifndef pte_accessible
+# define pte_accessible(pte) ((void)(pte),1)
+#endif
+
#ifndef flush_tlb_fix_spurious_fault
#define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
#endif
--
1.7.11.7
From: Rik van Riel <[email protected]>
If ptep_clear_flush() is called to clear a page table entry that is
accessible anyway by the CPU, eg. a _PAGE_PROTNONE page table entry,
there is no need to flush the TLB on remote CPUs.
Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/pgtable-generic.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d8397da..0c8323f 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -88,7 +88,8 @@ pte_t ptep_clear_flush(struct vm_area_struct *vma, unsigned long address,
{
pte_t pte;
pte = ptep_get_and_clear((vma)->vm_mm, address, ptep);
- flush_tlb_page(vma, address);
+ if (pte_accessible(pte))
+ flush_tlb_page(vma, address);
return pte;
}
#endif
--
1.7.11.7
From: Gerald Schaefer <[email protected]>
This patch adds an implementation of pmd_pgprot() for s390,
in preparation to future THP changes.
Reported-by: Stephen Rothwell <[email protected]>
Signed-off-by: Gerald Schaefer <[email protected]>
Cc: Martin Schwidefsky <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ralf Baechle <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/s390/include/asm/pgtable.h | 13 +++++++++++++
1 file changed, 13 insertions(+)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index dd647c9..098fc5a 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1240,6 +1240,19 @@ static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
*pmdp = entry;
}
+static inline pgprot_t pmd_pgprot(pmd_t pmd)
+{
+ pgprot_t prot = PAGE_RW;
+
+ if (pmd_val(pmd) & _SEGMENT_ENTRY_RO) {
+ if (pmd_val(pmd) & _SEGMENT_ENTRY_INV)
+ prot = PAGE_NONE;
+ else
+ prot = PAGE_RO;
+ }
+ return prot;
+}
+
static inline unsigned long massage_pgprot_pmd(pgprot_t pgprot)
{
unsigned long pgprot_pmd = 0;
--
1.7.11.7
From: Rik van Riel <[email protected]>
Because we only ever upgrade a PTE when calling ptep_set_access_flags(),
it is safe to skip flushing entries on remote TLBs.
The worst that can happen is a spurious page fault on other CPUs, which
would flush that TLB entry.
Lazily letting another CPU incur a spurious page fault occasionally
is (much!) cheaper than aggressively flushing everybody else's TLB.
Signed-off-by: Rik van Riel <[email protected]>
Acked-by: Linus Torvalds <[email protected]>
Acked-by: Peter Zijlstra <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Michel Lespinasse <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/mm/pgtable.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 8573b83..be3bb46 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
free_page((unsigned long)pgd);
}
+/*
+ * Used to set accessed or dirty bits in the page table entries
+ * on other architectures. On x86, the accessed and dirty bits
+ * are tracked by hardware. However, do_wp_page calls this function
+ * to also make the pte writeable at the same time the dirty bit is
+ * set. In that case we do actually need to write the PTE.
+ */
int ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
pte_t entry, int dirty)
@@ -310,7 +317,7 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
if (changed && dirty) {
*ptep = entry;
pte_update_defer(vma->vm_mm, address, ptep);
- flush_tlb_page(vma, address);
+ __flush_tlb_one(address);
}
return changed;
--
1.7.11.7
Just find imbalance issue on the patchset.
I write a one line program:
int main ()
{
int i;
for (i=0; i< 1; )
__asm__ __volatile__ ("nop");
}
it was compiled with name pl and start it on my 2 socket * 4 cores *
HT NUMA machine:
the cpu domain top like this:
domain 0: span 4,12 level SIBLING
groups: 4 (cpu_power = 589) 12 (cpu_power = 589)
domain 1: span 0,2,4,6,8,10,12,14 level MC
groups: 4,12 (cpu_power = 1178) 6,14 (cpu_power = 1178) 0,8
(cpu_power = 1178) 2,10 (cpu_power = 1178)
domain 2: span 0,2,4,6,8,10,12,14 level CPU
groups: 0,2,4,6,8,10,12,14 (cpu_power = 4712)
domain 3: span 0-15 level NUMA
groups: 0,2,4,6,8,10,12,14 (cpu_power = 4712) 1,3,5,7,9,11,13,15
(cpu_power = 4712)
$for ((i=0; i< I; i++)); do ./pl & done
when I = 2, they are running on cpu 0,12
I = 4, they are running on cpu 0,9,12,14
I = 8, they are running on cpu 0,4,9,10,11,12,13,14
Regards!
Alex
On Sat, Nov 17, 2012 at 12:25 AM, Ingo Molnar <[email protected]> wrote:
> This is the split-out series of mm/ patches that got no objections
> from the latest (v15) posting of numa/core. If everyone is still
> fine with these then these will be merge candidates for v3.8.
>
> I left out the more contentious policy bits that people are still
> arguing about.
>
> The numa/base tree can also be found here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/base
>
> Thanks,
>
> Ingo
>
> ------------------->
>
> Andrea Arcangeli (1):
> numa, mm: Support NUMA hinting page faults from gup/gup_fast
>
> Gerald Schaefer (1):
> sched, numa, mm, s390/thp: Implement pmd_pgprot() for s390
>
> Ingo Molnar (1):
> mm/pgprot: Move the pgprot_modify() fallback definition to mm.h
>
> Lee Schermerhorn (3):
> mm/mpol: Add MPOL_MF_NOOP
> mm/mpol: Check for misplaced page
> mm/mpol: Add MPOL_MF_LAZY
>
> Peter Zijlstra (7):
> sched, numa, mm: Make find_busiest_queue() a method
> sched, numa, mm: Describe the NUMA scheduling problem formally
> mm/thp: Preserve pgprot across huge page split
> mm/mpol: Make MPOL_LOCAL a real policy
> mm/mpol: Create special PROT_NONE infrastructure
> mm/migrate: Introduce migrate_misplaced_page()
> mm/mpol: Use special PROT_NONE to migrate pages
>
> Ralf Baechle (1):
> sched, numa, mm, MIPS/thp: Add pmd_pgprot() implementation
>
> Rik van Riel (5):
> mm/generic: Only flush the local TLB in ptep_set_access_flags()
> x86/mm: Only do a local tlb flush in ptep_set_access_flags()
> x86/mm: Introduce pte_accessible()
> mm: Only flush the TLB when clearing an accessible pte
> x86/mm: Completely drop the TLB flush from ptep_set_access_flags()
>
> Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++
> arch/mips/include/asm/pgtable.h | 2 +
> arch/s390/include/asm/pgtable.h | 13 ++
> arch/x86/include/asm/pgtable.h | 7 +
> arch/x86/mm/pgtable.c | 8 +-
> include/asm-generic/pgtable.h | 4 +
> include/linux/huge_mm.h | 19 +++
> include/linux/mempolicy.h | 8 ++
> include/linux/migrate.h | 7 +
> include/linux/migrate_mode.h | 3 +
> include/linux/mm.h | 32 +++++
> include/uapi/linux/mempolicy.h | 16 ++-
> kernel/sched/fair.c | 20 +--
> mm/huge_memory.c | 174 +++++++++++++++--------
> mm/memory.c | 119 +++++++++++++++-
> mm/mempolicy.c | 143 +++++++++++++++----
> mm/migrate.c | 85 ++++++++++--
> mm/mprotect.c | 31 +++--
> mm/pgtable-generic.c | 9 +-
> 19 files changed, 807 insertions(+), 123 deletions(-)
> create mode 100644 Documentation/scheduler/numa-problem.txt
>
> --
> 1.7.11.7
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Thanks
Alex
On Sat, Nov 17, 2012 at 4:35 PM, Alex Shi <[email protected]> wrote:
> Just find imbalance issue on the patchset.
>
> I write a one line program:
> int main ()
> {
> int i;
> for (i=0; i< 1; )
> __asm__ __volatile__ ("nop");
> }
> it was compiled with name pl and start it on my 2 socket * 4 cores *
> HT NUMA machine:
> the cpu domain top like this:
> domain 0: span 4,12 level SIBLING
> groups: 4 (cpu_power = 589) 12 (cpu_power = 589)
> domain 1: span 0,2,4,6,8,10,12,14 level MC
> groups: 4,12 (cpu_power = 1178) 6,14 (cpu_power = 1178) 0,8
> (cpu_power = 1178) 2,10 (cpu_power = 1178)
> domain 2: span 0,2,4,6,8,10,12,14 level CPU
> groups: 0,2,4,6,8,10,12,14 (cpu_power = 4712)
> domain 3: span 0-15 level NUMA
> groups: 0,2,4,6,8,10,12,14 (cpu_power = 4712) 1,3,5,7,9,11,13,15
> (cpu_power = 4712)
>
> $for ((i=0; i< I; i++)); do ./pl & done
> when I = 2, they are running on cpu 0,12
> I = 4, they are running on cpu 0,9,12,14
> I = 8, they are running on cpu 0,4,9,10,11,12,13,14
>
Ops, it was tested on latest V15 tip/master tree, head is
a7b7a8ad4476bb641c8455a4e0d7d0fd3eb86f90
not on this series.
Sorry.
* Ingo Molnar <[email protected]> wrote:
> From: Peter Zijlstra <[email protected]>
>
> Add migrate_misplaced_page() which deals with migrating pages from
> faults.
>
> This includes adding a new MIGRATE_FAULT migration mode to
> deal with the extra page reference required due to having to look up
> the page.
[...]
> --- a/include/linux/migrate_mode.h
> +++ b/include/linux/migrate_mode.h
> @@ -6,11 +6,14 @@
> * on most operations but not ->writepage as the potential stall time
> * is too significant
> * MIGRATE_SYNC will block when migrating pages
> + * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
> + * this path has an extra reference count
> */
Note, this is still the older, open-coded version.
The newer replacement version created from Mel's patch which
reuses migrate_pages() and is nicer on out-of-node-memory
conditions and is cleaner all around can be found below.
I tested it today and it appears to work fine. I noticed no
performance improvement or performance drop from it - if it
holds up in testing it will be part of the -v17 release of
numa/core.
Thanks,
Ingo
-------------------------->
Subject: mm/migration: Introduce migrate_misplaced_page()
From: Mel Gorman <[email protected]>
Date: Fri, 16 Nov 2012 11:22:23 +0000
Note: This was originally based on Peter's patch "mm/migrate: Introduce
migrate_misplaced_page()" but borrows extremely heavily from Andrea's
"autonuma: memory follows CPU algorithm and task/mm_autonuma stats
collection". The end result is barely recognisable so signed-offs
had to be dropped. If original authors are ok with it, I'll
re-add the signed-off-bys.
Add migrate_misplaced_page() which deals with migrating pages
from faults.
Based-on-work-by: Lee Schermerhorn <[email protected]>
Based-on-work-by: Peter Zijlstra <[email protected]>
Based-on-work-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Linux-MM <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
[ Adapted to the numa/core tree. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/memory.c | 13 ++-----
mm/migrate.c | 103 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 106 insertions(+), 10 deletions(-)
Index: linux/mm/memory.c
===================================================================
--- linux.orig/mm/memory.c
+++ linux/mm/memory.c
@@ -3494,28 +3494,25 @@ out_pte_upgrade_unlock:
out_unlock:
pte_unmap_unlock(ptep, ptl);
-out:
+
if (page) {
task_numa_fault(page_nid, last_cpu, 1);
put_page(page);
}
-
+out:
return 0;
migrate:
pte_unmap_unlock(ptep, ptl);
- if (!migrate_misplaced_page(page, node)) {
- page_nid = node;
+ if (migrate_misplaced_page(page, node)) {
goto out;
}
+ page = NULL;
ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (!pte_same(*ptep, entry)) {
- put_page(page);
- page = NULL;
+ if (!pte_same(*ptep, entry))
goto out_unlock;
- }
goto out_pte_upgrade_unlock;
}
Index: linux/mm/migrate.c
===================================================================
--- linux.orig/mm/migrate.c
+++ linux/mm/migrate.c
@@ -279,7 +279,7 @@ static int migrate_page_move_mapping(str
struct page *newpage, struct page *page,
struct buffer_head *head, enum migrate_mode mode)
{
- int expected_count;
+ int expected_count = 0;
void **pslot;
if (!mapping) {
@@ -1403,4 +1403,103 @@ int migrate_vmas(struct mm_struct *mm, c
}
return err;
}
-#endif
+
+/*
+ * Returns true if this is a safe migration target node for misplaced NUMA
+ * pages. Currently it only checks the watermarks which crude
+ */
+static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+ int nr_migrate_pages)
+{
+ int z;
+ for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+ struct zone *zone = pgdat->node_zones + z;
+
+ if (!populated_zone(zone))
+ continue;
+
+ if (zone->all_unreclaimable)
+ continue;
+
+ /* Avoid waking kswapd by allocating pages_to_migrate pages. */
+ if (!zone_watermark_ok(zone, 0,
+ high_wmark_pages(zone) +
+ nr_migrate_pages,
+ 0, 0))
+ continue;
+ return true;
+ }
+ return false;
+}
+
+static struct page *alloc_misplaced_dst_page(struct page *page,
+ unsigned long data,
+ int **result)
+{
+ int nid = (int) data;
+ struct page *newpage;
+
+ newpage = alloc_pages_exact_node(nid,
+ (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
+ __GFP_NOMEMALLOC | __GFP_NORETRY |
+ __GFP_NOWARN) &
+ ~GFP_IOFS, 0);
+ return newpage;
+}
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node. Caller is expected to have an elevated reference count on
+ * the page that will be dropped by this function before returning.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+ int isolated = 0;
+ LIST_HEAD(migratepages);
+
+ /*
+ * Don't migrate pages that are mapped in multiple processes.
+ * TODO: Handle false sharing detection instead of this hammer
+ */
+ if (page_mapcount(page) != 1)
+ goto out;
+
+ /* Avoid migrating to a node that is nearly full */
+ if (migrate_balanced_pgdat(NODE_DATA(node), 1)) {
+ int page_lru;
+
+ if (isolate_lru_page(page)) {
+ put_page(page);
+ goto out;
+ }
+ isolated = 1;
+
+ /*
+ * Page is isolated which takes a reference count so now the
+ * callers reference can be safely dropped without the page
+ * disappearing underneath us during migration
+ */
+ put_page(page);
+
+ page_lru = page_is_file_cache(page);
+ inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+ list_add(&page->lru, &migratepages);
+ }
+
+ if (isolated) {
+ int nr_remaining;
+
+ nr_remaining = migrate_pages(&migratepages,
+ alloc_misplaced_dst_page,
+ node, false, MIGRATE_ASYNC);
+ if (nr_remaining) {
+ putback_lru_pages(&migratepages);
+ isolated = 0;
+ }
+ }
+ BUG_ON(!list_empty(&migratepages));
+out:
+ return isolated;
+}
+
+#endif /* CONFIG_NUMA */
On 11/18/2012 09:25 PM, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>> From: Peter Zijlstra <[email protected]>
>>
>> Add migrate_misplaced_page() which deals with migrating pages from
>> faults.
>>
>> This includes adding a new MIGRATE_FAULT migration mode to
>> deal with the extra page reference required due to having to look up
>> the page.
> [...]
>
>> --- a/include/linux/migrate_mode.h
>> +++ b/include/linux/migrate_mode.h
>> @@ -6,11 +6,14 @@
>> * on most operations but not ->writepage as the potential stall time
>> * is too significant
>> * MIGRATE_SYNC will block when migrating pages
>> + * MIGRATE_FAULT called from the fault path to migrate-on-fault for mempolicy
>> + * this path has an extra reference count
>> */
>
> Note, this is still the older, open-coded version.
>
> The newer replacement version created from Mel's patch which
> reuses migrate_pages() and is nicer on out-of-node-memory
> conditions and is cleaner all around can be found below.
>
> I tested it today and it appears to work fine. I noticed no
> performance improvement or performance drop from it - if it
> holds up in testing it will be part of the -v17 release of
> numa/core.
Excellent. That gets rid of the last issue with numa/base :)
--
All rights reversed
as per 4) move towards where "most" memory. If we have a large shared
memory than private memnory. Why not we just move the process towrds
the memory.. instead of the memory moving towards the node. This will
i guess be less cumbersome, then moving all the shared memory
On Fri, Nov 16, 2012 at 9:55 PM, Ingo Molnar <[email protected]> wrote:
> From: Peter Zijlstra <[email protected]>
>
> This is probably a first: formal description of a complex high-level
> computing problem, within the kernel source.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: Mike Galbraith <[email protected]>
> Rik van Riel <[email protected]>
> Link: http://lkml.kernel.org/n/[email protected]
> [ Next step: generate the kernel source from such formal descriptions and retire to a tropical island! ]
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> Documentation/scheduler/numa-problem.txt | 230 +++++++++++++++++++++++++++++++
> 1 file changed, 230 insertions(+)
> create mode 100644 Documentation/scheduler/numa-problem.txt
>
> diff --git a/Documentation/scheduler/numa-problem.txt b/Documentation/scheduler/numa-problem.txt
> new file mode 100644
> index 0000000..a5d2fee
> --- /dev/null
> +++ b/Documentation/scheduler/numa-problem.txt
> @@ -0,0 +1,230 @@
> +
> +
> +Effective NUMA scheduling problem statement, described formally:
> +
> + * minimize interconnect traffic
> +
> +For each task 't_i' we have memory, this memory can be spread over multiple
> +physical nodes, let us denote this as: 'p_i,k', the memory task 't_i' has on
> +node 'k' in [pages].
> +
> +If a task shares memory with another task let us denote this as:
> +'s_i,k', the memory shared between tasks including 't_i' residing on node
> +'k'.
> +
> +Let 'M' be the distribution that governs all 'p' and 's', ie. the page placement.
> +
> +Similarly, lets define 'fp_i,k' and 'fs_i,k' resp. as the (average) usage
> +frequency over those memory regions [1/s] such that the product gives an
> +(average) bandwidth 'bp' and 'bs' in [pages/s].
> +
> +(note: multiple tasks sharing memory naturally avoid duplicat accounting
> + because each task will have its own access frequency 'fs')
> +
> +(pjt: I think this frequency is more numerically consistent if you explicitly
> + restrict p/s above to be the working-set. (It also makes explicit the
> + requirement for <C0,M0> to change about a change in the working set.)
> +
> + Doing this does have the nice property that it lets you use your frequency
> + measurement as a weak-ordering for the benefit a task would receive when
> + we can't fit everything.
> +
> + e.g. task1 has working set 10mb, f=90%
> + task2 has working set 90mb, f=10%
> +
> + Both are using 9mb/s of bandwidth, but we'd expect a much larger benefit
> + from task1 being on the right node than task2. )
> +
> +Let 'C' map every task 't_i' to a cpu 'c_i' and its corresponding node 'n_i':
> +
> + C: t_i -> {c_i, n_i}
> +
> +This gives us the total interconnect traffic between nodes 'k' and 'l',
> +'T_k,l', as:
> +
> + T_k,l = \Sum_i bp_i,l + bs_i,l + \Sum bp_j,k + bs_j,k where n_i == k, n_j == l
> +
> +And our goal is to obtain C0 and M0 such that:
> +
> + T_k,l(C0, M0) =< T_k,l(C, M) for all C, M where k != l
> +
> +(note: we could introduce 'nc(k,l)' as the cost function of accessing memory
> + on node 'l' from node 'k', this would be useful for bigger NUMA systems
> +
> + pjt: I agree nice to have, but intuition suggests diminishing returns on more
> + usual systems given factors like things like Haswell's enormous 35mb l3
> + cache and QPI being able to do a direct fetch.)
> +
> +(note: do we need a limit on the total memory per node?)
> +
> +
> + * fairness
> +
> +For each task 't_i' we have a weight 'w_i' (related to nice), and each cpu
> +'c_n' has a compute capacity 'P_n', again, using our map 'C' we can formulate a
> +load 'L_n':
> +
> + L_n = 1/P_n * \Sum_i w_i for all c_i = n
> +
> +using that we can formulate a load difference between CPUs
> +
> + L_n,m = | L_n - L_m |
> +
> +Which allows us to state the fairness goal like:
> +
> + L_n,m(C0) =< L_n,m(C) for all C, n != m
> +
> +(pjt: It can also be usefully stated that, having converged at C0:
> +
> + | L_n(C0) - L_m(C0) | <= 4/3 * | G_n( U(t_i, t_j) ) - G_m( U(t_i, t_j) ) |
> +
> + Where G_n,m is the greedy partition of tasks between L_n and L_m. This is
> + the "worst" partition we should accept; but having it gives us a useful
> + bound on how much we can reasonably adjust L_n/L_m at a Pareto point to
> + favor T_n,m. )
> +
> +Together they give us the complete multi-objective optimization problem:
> +
> + min_C,M [ L_n,m(C), T_k,l(C,M) ]
> +
> +
> +
> +Notes:
> +
> + - the memory bandwidth problem is very much an inter-process problem, in
> + particular there is no such concept as a process in the above problem.
> +
> + - the naive solution would completely prefer fairness over interconnect
> + traffic, the more complicated solution could pick another Pareto point using
> + an aggregate objective function such that we balance the loss of work
> + efficiency against the gain of running, we'd want to more or less suggest
> + there to be a fixed bound on the error from the Pareto line for any
> + such solution.
> +
> +References:
> +
> + http://en.wikipedia.org/wiki/Mathematical_optimization
> + http://en.wikipedia.org/wiki/Multi-objective_optimization
> +
> +
> +* warning, significant hand-waving ahead, improvements welcome *
> +
> +
> +Partial solutions / approximations:
> +
> + 1) have task node placement be a pure preference from the 'fairness' pov.
> +
> +This means we always prefer fairness over interconnect bandwidth. This reduces
> +the problem to:
> +
> + min_C,M [ T_k,l(C,M) ]
> +
> + 2a) migrate memory towards 'n_i' (the task's node).
> +
> +This creates memory movement such that 'p_i,k for k != n_i' becomes 0 --
> +provided 'n_i' stays stable enough and there's sufficient memory (looks like
> +we might need memory limits for this).
> +
> +This does however not provide us with any 's_i' (shared) information. It does
> +however remove 'M' since it defines memory placement in terms of task
> +placement.
> +
> +XXX properties of this M vs a potential optimal
> +
> + 2b) migrate memory towards 'n_i' using 2 samples.
> +
> +This separates pages into those that will migrate and those that will not due
> +to the two samples not matching. We could consider the first to be of 'p_i'
> +(private) and the second to be of 's_i' (shared).
> +
> +This interpretation can be motivated by the previously observed property that
> +'p_i,k for k != n_i' should become 0 under sufficient memory, leaving only
> +'s_i' (shared). (here we loose the need for memory limits again, since it
> +becomes indistinguishable from shared).
> +
> +XXX include the statistical babble on double sampling somewhere near
> +
> +This reduces the problem further; we loose 'M' as per 2a, it further reduces
> +the 'T_k,l' (interconnect traffic) term to only include shared (since per the
> +above all private will be local):
> +
> + T_k,l = \Sum_i bs_i,l for every n_i = k, l != k
> +
> +[ more or less matches the state of sched/numa and describes its remaining
> + problems and assumptions. It should work well for tasks without significant
> + shared memory usage between tasks. ]
> +
> +Possible future directions:
> +
> +Motivated by the form of 'T_k,l', try and obtain each term of the sum, so we
> +can evaluate it;
> +
> + 3a) add per-task per node counters
> +
> +At fault time, count the number of pages the task faults on for each node.
> +This should give an approximation of 'p_i' for the local node and 's_i,k' for
> +all remote nodes.
> +
> +While these numbers provide pages per scan, and so have the unit [pages/s] they
> +don't count repeat access and thus aren't actually representable for our
> +bandwidth numberes.
> +
> + 3b) additional frequency term
> +
> +Additionally (or instead if it turns out we don't need the raw 'p' and 's'
> +numbers) we can approximate the repeat accesses by using the time since marking
> +the pages as indication of the access frequency.
> +
> +Let 'I' be the interval of marking pages and 'e' the elapsed time since the
> +last marking, then we could estimate the number of accesses 'a' as 'a = I / e'.
> +If we then increment the node counters using 'a' instead of 1 we might get
> +a better estimate of bandwidth terms.
> +
> + 3c) additional averaging; can be applied on top of either a/b.
> +
> +[ Rik argues that decaying averages on 3a might be sufficient for bandwidth since
> + the decaying avg includes the old accesses and therefore has a measure of repeat
> + accesses.
> +
> + Rik also argued that the sample frequency is too low to get accurate access
> + frequency measurements, I'm not entirely convinced, event at low sample
> + frequencies the avg elapsed time 'e' over multiple samples should still
> + give us a fair approximation of the avg access frequency 'a'.
> +
> + So doing both b&c has a fair chance of working and allowing us to distinguish
> + between important and less important memory accesses.
> +
> + Experimentation has shown no benefit from the added frequency term so far. ]
> +
> +This will give us 'bp_i' and 'bs_i,k' so that we can approximately compute
> +'T_k,l' Our optimization problem now reads:
> +
> + min_C [ \Sum_i bs_i,l for every n_i = k, l != k ]
> +
> +And includes only shared terms, this makes sense since all task private memory
> +will become local as per 2.
> +
> +This suggests that if there is significant shared memory, we should try and
> +move towards it.
> +
> + 4) move towards where 'most' memory is
> +
> +The simplest significance test is comparing the biggest shared 's_i,k' against
> +the private 'p_i'. If we have more shared than private, move towards it.
> +
> +This effectively makes us move towards where most our memory is and forms a
> +feed-back loop with 2. We migrate memory towards us and we migrate towards
> +where 'most' memory is.
> +
> +(Note: even if there were two tasks fully trashing the same shared memory, it
> + is very rare for there to be an 50/50 split in memory, lacking a perfect
> + split, the small will move towards the larger. In case of the perfect
> + split, we'll tie-break towards the lower node number.)
> +
> + 5) 'throttle' 4's node placement
> +
> +Since per 2b our 's_i,k' and 'p_i' require at least two scans to 'stabilize'
> +and show representative numbers, we should limit node-migration to not be
> +faster than this.
> +
> + n) poke holes in previous that require more stuff and describe it.
> --
> 1.7.11.7
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/