Add another layer of fallback policy to make the home node concept
useful from a memory allocation PoV.
This changes the mpol order to:
- vma->vm_ops->get_policy [if applicable]
- vma->vm_policy [if applicable]
- task->mempolicy
- tsk_home_node() preferred [NEW]
- default_policy
Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
facilitate efficient on-demand memory migration.
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Paul Turner <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/mempolicy.c | 29 +++++++++++++++++++++++++++--
1 file changed, 27 insertions(+), 2 deletions(-)
Index: tip/mm/mempolicy.c
===================================================================
--- tip.orig/mm/mempolicy.c
+++ tip/mm/mempolicy.c
@@ -117,6 +117,22 @@ static struct mempolicy default_policy =
.flags = MPOL_F_LOCAL,
};
+static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+
+static struct mempolicy *get_task_policy(struct task_struct *p)
+{
+ struct mempolicy *pol = p->mempolicy;
+ int node;
+
+ if (!pol) {
+ node = tsk_home_node(p);
+ if (node != -1)
+ pol = &preferred_node_policy[node];
+ }
+
+ return pol;
+}
+
static const struct mempolicy_operations {
int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
/*
@@ -1565,7 +1581,7 @@ asmlinkage long compat_sys_mbind(compat_
struct mempolicy *get_vma_policy(struct task_struct *task,
struct vm_area_struct *vma, unsigned long addr)
{
- struct mempolicy *pol = task->mempolicy;
+ struct mempolicy *pol = get_task_policy(task);
if (vma) {
if (vma->vm_ops && vma->vm_ops->get_policy) {
@@ -1965,7 +1981,7 @@ retry_cpuset:
*/
struct page *alloc_pages_current(gfp_t gfp, unsigned order)
{
- struct mempolicy *pol = current->mempolicy;
+ struct mempolicy *pol = get_task_policy(current);
struct page *page;
unsigned int cpuset_mems_cookie;
@@ -2424,6 +2440,15 @@ void __init numa_policy_init(void)
sizeof(struct sp_node),
0, SLAB_PANIC, NULL);
+ for_each_node(nid) {
+ preferred_node_policy[nid] = (struct mempolicy) {
+ .refcnt = ATOMIC_INIT(1),
+ .mode = MPOL_PREFERRED,
+ .flags = MPOL_F_MOF,
+ .v = { .preferred_node = nid, },
+ };
+ }
+
/*
* Set interleaving policy for system init. Interleaving is only
* enabled across suitably sized nodes (default is >= 16MB), or
On Thu, Oct 25, 2012 at 02:16:37PM +0200, Peter Zijlstra wrote:
> Add another layer of fallback policy to make the home node concept
> useful from a memory allocation PoV.
>
> This changes the mpol order to:
>
> - vma->vm_ops->get_policy [if applicable]
> - vma->vm_policy [if applicable]
> - task->mempolicy
> - tsk_home_node() preferred [NEW]
> - default_policy
>
> Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
> facilitate efficient on-demand memory migration.
>
Makes sense and it looks like a VMA policy, if set, will still override
the home_node policy as you'd expect. At some point this may need to cope
with node hot-remove. Also, at some point this must be dealing with the
case where mbind() is called but the home_node is not in the nodemask.
Does that happen somewhere else in the series? (maybe I'll see it later)
--
Mel Gorman
SUSE Labs
On 11/01/2012 06:58 AM, Mel Gorman wrote:
> On Thu, Oct 25, 2012 at 02:16:37PM +0200, Peter Zijlstra wrote:
>> Add another layer of fallback policy to make the home node concept
>> useful from a memory allocation PoV.
>>
>> This changes the mpol order to:
>>
>> - vma->vm_ops->get_policy [if applicable]
>> - vma->vm_policy [if applicable]
>> - task->mempolicy
>> - tsk_home_node() preferred [NEW]
>> - default_policy
>>
>> Note that the tsk_home_node() policy has Migrate-on-Fault enabled to
>> facilitate efficient on-demand memory migration.
>>
>
> Makes sense and it looks like a VMA policy, if set, will still override
> the home_node policy as you'd expect. At some point this may need to cope
> with node hot-remove. Also, at some point this must be dealing with the
> case where mbind() is called but the home_node is not in the nodemask.
> Does that happen somewhere else in the series? (maybe I'll see it later)
>
I'd expect one of the first things to be done in the sequence of
hot-removing a node would be to take the cpus offline (at least
out of being schedulable). Hence the tasks would be migrated
to other nodes/processors, which should result in a home node
update the same as if the scheduler had simply chosen a better
home for them anyway. The memory would then migrate either
via the home node change by the tasks themselves or via
migration to evacuate the outgoing node (with the preferred
migration target using the new home node).
As long as no one wants to do something crazy like offline
a node before taking the resources away from the scheduler
and memory management, it should all work out.
Don Morris