2009-11-04 08:12:17

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask

If clarification is not enough, let me know.
Thank you for all your review in RFC version.
Thanks,
-Kame
==
From: KAMEZAWA Hiroyuki <[email protected]>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as enhancement.

In these days, things are changed as
- alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
- mempolicy don't maintain its own private zonelists.
(And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
- check nodemask, if nodemask && nodemask != node_states[N_HIGH_MEMORY]
this is never be CONSTRAINT_NONE. We assume this from mempolicy.
- Scan all zonelist under nodemask, if it hits cpuset's wall
this faiulre is from cpuset.
And
- modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
This doesn't change "current" behavior.
If callers use __GFP_THISNODE, it should handle "page allocation failure"
by itself.

Changelog: 2009/11/04
- cut out as bug-fix for passing nodemask.
- added GFP_THISNODE sanity check.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
drivers/char/sysrq.c | 2 +-
mm/oom_kill.c | 38 ++++++++++++++++++++++++++------------
mm/page_alloc.c | 10 ++++++++--
3 files changed, 35 insertions(+), 15 deletions(-)

Index: mmotm-2.6.32-Nov2/drivers/char/sysrq.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/drivers/char/sysrq.c
+++ mmotm-2.6.32-Nov2/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op

static void moom_callback(struct work_struct *ignored)
{
- out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+ out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
}

static DECLARE_WORK(moom_work, moom_callback);
Index: mmotm-2.6.32-Nov2/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Nov2/mm/oom_kill.c
@@ -196,27 +196,40 @@ unsigned long badness(struct task_struct
/*
* Determine the type of allocation constraint.
*/
+#ifdef CONFIG_NUMA
static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask)
+ gfp_t gfp_mask, nodemask_t *nodemask)
{
-#ifdef CONFIG_NUMA
struct zone *zone;
struct zoneref *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
- nodemask_t nodes = node_states[N_HIGH_MEMORY];
+ int ret = CONSTRAINT_NONE;

- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
- if (cpuset_zone_allowed_softwall(zone, gfp_mask))
- node_clear(zone_to_nid(zone), nodes);
- else
+ /*
+ * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ * feature. Then, only mempolicy use this nodemask.
+ */
+ if (nodemask && nodes_equal(*nodemask, node_states[N_HIGH_MEMORY]))
+ ret = CONSTRAINT_MEMORY_POLICY;
+
+ /* Check this allocation failure is caused by cpuset's wall function */
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ high_zoneidx, nodemask)
+ if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
return CONSTRAINT_CPUSET;

- if (!nodes_empty(nodes))
- return CONSTRAINT_MEMORY_POLICY;
-#endif
+ /* __GFP_THISNODE never calls OOM.*/

+ return ret;
+}
+#else
+static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+ gfp_t gfp_mask, nodemask_t *nodemask)
+{
return CONSTRAINT_NONE;
}
+#endif

/*
* Simple selection loop. We chose the process with the highest
@@ -603,7 +616,8 @@ rest_and_return:
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
*/
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+ int order, nodemask_t *nodemask)
{
unsigned long freed = 0;
enum oom_constraint constraint;
@@ -622,7 +636,7 @@ void out_of_memory(struct zonelist *zone
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
- constraint = constrained_alloc(zonelist, gfp_mask);
+ constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
read_lock(&tasklist_lock);

switch (constraint) {
Index: mmotm-2.6.32-Nov2/mm/page_alloc.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/page_alloc.c
+++ mmotm-2.6.32-Nov2/mm/page_alloc.c
@@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
/* The OOM killer will not help higher order allocs */
if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
goto out;
-
+ /*
+ * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+ * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+ * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+ */
+ if (gfp_mask & __GFP_THISNODE)
+ goto out;
/* Exhausted what can be done so it's blamo time */
- out_of_memory(zonelist, gfp_mask, order);
+ out_of_memory(zonelist, gfp_mask, order, nodemask);

out:
clear_zonelist_oom(zonelist, gfp_mask);


2009-11-06 00:04:36

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2

On Wed, 4 Nov 2009 17:09:44 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:
> drivers/char/sysrq.c | 2 +-
> mm/oom_kill.c | 38 ++++++++++++++++++++++++++------------
> mm/page_alloc.c | 10 ++++++++--

Very sorry, this patch doesn't includes changes to oom.h...

From: KAMEZAWA Hiroyuki <[email protected]>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
- alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
- mempolicy don't maintain its own private zonelists.
(And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
- check nodemask, if nodemask && nodemask != node_states[N_HIGH_MEMORY]
this is never be CONSTRAINT_NONE. We assume this from mempolicy.
- Scan all zonelist under nodemask, if it hits cpuset's wall
this faiulre is from cpuset.
And
- modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
This doesn't change "current" behavior.
If callers use __GFP_THISNODE, it should handle "page allocation failure"
by itself.

Changelog: 2009/11/06
- fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
drivers/char/sysrq.c | 2 +-
include/linux/oom.h | 3 ++-
mm/oom_kill.c | 38 ++++++++++++++++++++++++++------------
mm/page_alloc.c | 10 ++++++++--
4 files changed, 37 insertions(+), 16 deletions(-)

Index: mmotm-2.6.32-Nov2/drivers/char/sysrq.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/drivers/char/sysrq.c
+++ mmotm-2.6.32-Nov2/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op

static void moom_callback(struct work_struct *ignored)
{
- out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+ out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
}

static DECLARE_WORK(moom_work, moom_callback);
Index: mmotm-2.6.32-Nov2/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Nov2/mm/oom_kill.c
@@ -196,27 +196,40 @@ unsigned long badness(struct task_struct
/*
* Determine the type of allocation constraint.
*/
+#ifdef CONFIG_NUMA
static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask)
+ gfp_t gfp_mask, nodemask_t *nodemask)
{
-#ifdef CONFIG_NUMA
struct zone *zone;
struct zoneref *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
- nodemask_t nodes = node_states[N_HIGH_MEMORY];
+ int ret = CONSTRAINT_NONE;

- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
- if (cpuset_zone_allowed_softwall(zone, gfp_mask))
- node_clear(zone_to_nid(zone), nodes);
- else
+ /*
+ * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ * feature. Then, only mempolicy use this nodemask.
+ */
+ if (nodemask && nodes_equal(*nodemask, node_states[N_HIGH_MEMORY]))
+ ret = CONSTRAINT_MEMORY_POLICY;
+
+ /* Check this allocation failure is caused by cpuset's wall function */
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ high_zoneidx, nodemask)
+ if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
return CONSTRAINT_CPUSET;

- if (!nodes_empty(nodes))
- return CONSTRAINT_MEMORY_POLICY;
-#endif
+ /* __GFP_THISNODE never calls OOM.*/

+ return ret;
+}
+#else
+static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+ gfp_t gfp_mask, nodemask_t *nodemask)
+{
return CONSTRAINT_NONE;
}
+#endif

/*
* Simple selection loop. We chose the process with the highest
@@ -603,7 +616,8 @@ rest_and_return:
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
*/
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+ int order, nodemask_t *nodemask)
{
unsigned long freed = 0;
enum oom_constraint constraint;
@@ -622,7 +636,7 @@ void out_of_memory(struct zonelist *zone
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
- constraint = constrained_alloc(zonelist, gfp_mask);
+ constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
read_lock(&tasklist_lock);

switch (constraint) {
Index: mmotm-2.6.32-Nov2/mm/page_alloc.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/page_alloc.c
+++ mmotm-2.6.32-Nov2/mm/page_alloc.c
@@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
/* The OOM killer will not help higher order allocs */
if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
goto out;
-
+ /*
+ * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+ * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+ * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+ */
+ if (gfp_mask & __GFP_THISNODE)
+ goto out;
/* Exhausted what can be done so it's blamo time */
- out_of_memory(zonelist, gfp_mask, order);
+ out_of_memory(zonelist, gfp_mask, order, nodemask);

out:
clear_zonelist_oom(zonelist, gfp_mask);
Index: mmotm-2.6.32-Nov2/include/linux/oom.h
===================================================================
--- mmotm-2.6.32-Nov2.orig/include/linux/oom.h
+++ mmotm-2.6.32-Nov2/include/linux/oom.h
@@ -27,7 +27,8 @@ enum oom_constraint {
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+ int order, nodemask_t *mask);
extern int register_oom_notifier(struct notifier_block *nb);
extern int unregister_oom_notifier(struct notifier_block *nb);

2009-11-10 07:24:20

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2

Hi

> ===================================================================
> --- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
> +++ mmotm-2.6.32-Nov2/mm/oom_kill.c
> @@ -196,27 +196,40 @@ unsigned long badness(struct task_struct
> /*
> * Determine the type of allocation constraint.
> */
> +#ifdef CONFIG_NUMA
> static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> - gfp_t gfp_mask)
> + gfp_t gfp_mask, nodemask_t *nodemask)
> {
> -#ifdef CONFIG_NUMA
> struct zone *zone;
> struct zoneref *z;
> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> - nodemask_t nodes = node_states[N_HIGH_MEMORY];
> + int ret = CONSTRAINT_NONE;
>
> - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> - if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> - node_clear(zone_to_nid(zone), nodes);
> - else
> + /*
> + * The nodemask here is a nodemask passed to alloc_pages(). Now,
> + * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> + * feature. Then, only mempolicy use this nodemask.
> + */
> + if (nodemask && nodes_equal(*nodemask, node_states[N_HIGH_MEMORY]))
> + ret = CONSTRAINT_MEMORY_POLICY;

!nodes_equal() ?


> +
> + /* Check this allocation failure is caused by cpuset's wall function */
> + for_each_zone_zonelist_nodemask(zone, z, zonelist,
> + high_zoneidx, nodemask)
> + if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> return CONSTRAINT_CPUSET;

If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
better choice.



>
> - if (!nodes_empty(nodes))
> - return CONSTRAINT_MEMORY_POLICY;
> -#endif
> + /* __GFP_THISNODE never calls OOM.*/
>
> + return ret;
> +}
> +#else
> +static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> + gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> return CONSTRAINT_NONE;
> }
> +#endif

2009-11-10 07:27:27

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2

On Tue, 10 Nov 2009 16:24:22 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> Hi
>
> > ===================================================================
> > --- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
> > +++ mmotm-2.6.32-Nov2/mm/oom_kill.c
> > @@ -196,27 +196,40 @@ unsigned long badness(struct task_struct
> > /*
> > * Determine the type of allocation constraint.
> > */
> > +#ifdef CONFIG_NUMA
> > static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > - gfp_t gfp_mask)
> > + gfp_t gfp_mask, nodemask_t *nodemask)
> > {
> > -#ifdef CONFIG_NUMA
> > struct zone *zone;
> > struct zoneref *z;
> > enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > - nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > + int ret = CONSTRAINT_NONE;
> >
> > - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > - if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > - node_clear(zone_to_nid(zone), nodes);
> > - else
> > + /*
> > + * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > + * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > + * feature. Then, only mempolicy use this nodemask.
> > + */
> > + if (nodemask && nodes_equal(*nodemask, node_states[N_HIGH_MEMORY]))
> > + ret = CONSTRAINT_MEMORY_POLICY;
>
> !nodes_equal() ?
>
yes. will fix.

>
> > +
> > + /* Check this allocation failure is caused by cpuset's wall function */
> > + for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > + high_zoneidx, nodemask)
> > + if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > return CONSTRAINT_CPUSET;
>
> If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> better choice.
>

No. this memory allocation is failed by limitation of cpuset's alloc mask.
Not from mempolicy.


Thanks,
-Kame

2009-11-10 07:39:01

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2

> > > +
> > > + /* Check this allocation failure is caused by cpuset's wall function */
> > > + for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > + high_zoneidx, nodemask)
> > > + if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > return CONSTRAINT_CPUSET;
> >
> > If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> > better choice.
>
> No. this memory allocation is failed by limitation of cpuset's alloc mask.
> Not from mempolicy.

But CONSTRAINT_CPUSET doesn't help to free necessary node memory. It isn't
your fault. original code is wrong too. but I hope we should fix it.


2009-11-10 07:43:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2

On Tue, 10 Nov 2009 16:39:02 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> > > > +
> > > > + /* Check this allocation failure is caused by cpuset's wall function */
> > > > + for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > > + high_zoneidx, nodemask)
> > > > + if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > > return CONSTRAINT_CPUSET;
> > >
> > > If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> > > better choice.
> >
> > No. this memory allocation is failed by limitation of cpuset's alloc mask.
> > Not from mempolicy.
>
> But CONSTRAINT_CPUSET doesn't help to free necessary node memory. It isn't
> your fault. original code is wrong too. but I hope we should fix it.
>
Hmm, maybe fair enough.

My 3rd version will use "kill always current(CONSTRAINT_MEMPOLICY does this)
if it uses mempolicy" logic.

Objections ?

Thanks,
-Kame

2009-11-10 08:05:01

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2

On Tue, 10 Nov 2009 16:40:55 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Tue, 10 Nov 2009 16:39:02 +0900 (JST)
> KOSAKI Motohiro <[email protected]> wrote:
>
> > > > > +
> > > > > + /* Check this allocation failure is caused by cpuset's wall function */
> > > > > + for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > > > + high_zoneidx, nodemask)
> > > > > + if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > > > return CONSTRAINT_CPUSET;
> > > >
> > > > If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> > > > better choice.
> > >
> > > No. this memory allocation is failed by limitation of cpuset's alloc mask.
> > > Not from mempolicy.
> >
> > But CONSTRAINT_CPUSET doesn't help to free necessary node memory. It isn't
> > your fault. original code is wrong too. but I hope we should fix it.
> >
I think so too.

> Hmm, maybe fair enough.
>
> My 3rd version will use "kill always current(CONSTRAINT_MEMPOLICY does this)
> if it uses mempolicy" logic.
>
"if it uses mempoicy" ?
You mean "kill allways current if memory allocation has failed by limitation of
cpuset's mask"(i.e. CONSTRAINT_CPUSET case) ?


Thanks,
Daisuke Nishimura.

2009-11-10 08:19:43

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v2

On Tue, 10 Nov 2009 17:03:38 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Tue, 10 Nov 2009 16:40:55 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > On Tue, 10 Nov 2009 16:39:02 +0900 (JST)
> > KOSAKI Motohiro <[email protected]> wrote:
> >
> > > > > > +
> > > > > > + /* Check this allocation failure is caused by cpuset's wall function */
> > > > > > + for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > > > > + high_zoneidx, nodemask)
> > > > > > + if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > > > > return CONSTRAINT_CPUSET;
> > > > >
> > > > > If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> > > > > better choice.
> > > >
> > > > No. this memory allocation is failed by limitation of cpuset's alloc mask.
> > > > Not from mempolicy.
> > >
> > > But CONSTRAINT_CPUSET doesn't help to free necessary node memory. It isn't
> > > your fault. original code is wrong too. but I hope we should fix it.
> > >
> I think so too.
>
> > Hmm, maybe fair enough.
> >
> > My 3rd version will use "kill always current(CONSTRAINT_MEMPOLICY does this)
> > if it uses mempolicy" logic.
> >
> "if it uses mempoicy" ?
> You mean "kill allways current if memory allocation has failed by limitation of
> cpuset's mask"(i.e. CONSTRAINT_CPUSET case) ?
>

No. "kill always current process if memory allocation uses mempolicy"
regardless of cpuset. If the task doesn't use mempolicy allocation,
usual CONSTRAINT_CPUSET/CONSTRAINT_NONE oom handler will be invoked.

Now, without patch, CONSTRAINT_MEMPOLICY is not returned at all. I'd
like to limit the scope of this patch to return it. If it's returned,
current will be killed.

Finally, we'll have to consinder "how to manage oom under cpuset"
problem, again. It's not handled in good way, now.

The main problems are...
- Cpuset allows intersection of nodes among groups.
- Task can be migrated to other cpuset withoug moving memory.
- We don't have per-node-rss information per task.

Then,
- We have to scan all tasks.
- We have to invoke Totally-Random-Innocent-Task-Killer and pray that
someone bad will be killed.

IMHO, "find correct one" is too heavy to the kernel (under cpuset).
If we can have notifier to userland, some daemon can check numa_maps of all
tasks and will do something reasonbale.


Thanks,
-Kame

2009-11-11 02:26:44

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3

From: KAMEZAWA Hiroyuki <[email protected]>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
- alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
- mempolicy don't maintain its own private zonelists.
(And cpuset doesn't use nodemask for __alloc_pages_nodemask())

But, current oom-killer's doesn't handle nodemask and it's wrong.

This patch does
- check nodemask, if nodemask && nodemask doesn't cover all
node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
- Scan all zonelist under nodemask, if it hits cpuset's wall
this faiulre is from cpuset.
And
- modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
This doesn't change "current" behavior.
If callers use __GFP_THISNODE, it should handle "page allocation failure"
by itself.

Changelog: 2009/11/11
- fixed nodes_equal() calculation.
- return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.

Changelog: 2009/11/06
- fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
drivers/char/sysrq.c | 2 +-
include/linux/oom.h | 4 +++-
mm/oom_kill.c | 44 ++++++++++++++++++++++++++++++++------------
mm/page_alloc.c | 10 ++++++++--
4 files changed, 44 insertions(+), 16 deletions(-)

Index: mm-test-kernel/drivers/char/sysrq.c
===================================================================
--- mm-test-kernel.orig/drivers/char/sysrq.c
+++ mm-test-kernel/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op

static void moom_callback(struct work_struct *ignored)
{
- out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+ out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
}

static DECLARE_WORK(moom_work, moom_callback);
Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -196,27 +196,45 @@ unsigned long badness(struct task_struct
/*
* Determine the type of allocation constraint.
*/
+#ifdef CONFIG_NUMA
static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask)
+ gfp_t gfp_mask, nodemask_t *nodemask)
{
-#ifdef CONFIG_NUMA
struct zone *zone;
struct zoneref *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
- nodemask_t nodes = node_states[N_HIGH_MEMORY];
+ int ret = CONSTRAINT_NONE;

- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
- if (cpuset_zone_allowed_softwall(zone, gfp_mask))
- node_clear(zone_to_nid(zone), nodes);
- else
+ /*
+ * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ * feature. mempolicy is an only user of nodemask here.
+ */
+ if (nodemask) {
+ nodemask_t mask;
+ /* check mempolicy's nodemask contains all N_HIGH_MEMORY */
+ nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
+ if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
+ return CONSTRAINT_MEMORY_POLICY;
+ }
+
+ /* Check this allocation failure is caused by cpuset's wall function */
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ high_zoneidx, nodemask)
+ if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
return CONSTRAINT_CPUSET;

- if (!nodes_empty(nodes))
- return CONSTRAINT_MEMORY_POLICY;
-#endif
+ /* __GFP_THISNODE never calls OOM.*/

return CONSTRAINT_NONE;
}
+#else
+static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+ gfp_t gfp_mask, nodemask_t *nodemask)
+{
+ return CONSTRAINT_NONE;
+}
+#endif

/*
* Simple selection loop. We chose the process with the highest
@@ -603,7 +621,8 @@ rest_and_return:
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
*/
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+ int order, nodemask_t *nodemask)
{
unsigned long freed = 0;
enum oom_constraint constraint;
@@ -622,11 +641,12 @@ void out_of_memory(struct zonelist *zone
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
- constraint = constrained_alloc(zonelist, gfp_mask);
+ constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
read_lock(&tasklist_lock);

switch (constraint) {
case CONSTRAINT_MEMORY_POLICY:
+ /* kill by process's its own memory alloc limitation */
oom_kill_process(current, gfp_mask, order, 0, NULL,
"No available memory (MPOL_BIND)");
break;
Index: mm-test-kernel/mm/page_alloc.c
===================================================================
--- mm-test-kernel.orig/mm/page_alloc.c
+++ mm-test-kernel/mm/page_alloc.c
@@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
/* The OOM killer will not help higher order allocs */
if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
goto out;
-
+ /*
+ * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+ * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+ * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+ */
+ if (gfp_mask & __GFP_THISNODE)
+ goto out;
/* Exhausted what can be done so it's blamo time */
- out_of_memory(zonelist, gfp_mask, order);
+ out_of_memory(zonelist, gfp_mask, order, nodemask);

out:
clear_zonelist_oom(zonelist, gfp_mask);
Index: mm-test-kernel/include/linux/oom.h
===================================================================
--- mm-test-kernel.orig/include/linux/oom.h
+++ mm-test-kernel/include/linux/oom.h
@@ -10,6 +10,7 @@
#ifdef __KERNEL__

#include <linux/types.h>
+#include <linux/nodemask.h>

struct zonelist;
struct notifier_block;
@@ -26,7 +27,8 @@ enum oom_constraint {
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+ int order, nodemask_t *mask);
extern int register_oom_notifier(struct notifier_block *nb);
extern int unregister_oom_notifier(struct notifier_block *nb);

2009-11-11 02:37:05

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Fixing node-oriented allocation handling in oom-kill.c
> I myself think this as bugfix not as ehnancement.
>
> In these days, things are changed as
> - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
> - mempolicy don't maintain its own private zonelists.
> (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
>
> But, current oom-killer's doesn't handle nodemask and it's wrong.
>
> This patch does
> - check nodemask, if nodemask && nodemask doesn't cover all
> node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
> - Scan all zonelist under nodemask, if it hits cpuset's wall
> this faiulre is from cpuset.
> And
> - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
> This doesn't change "current" behavior.
> If callers use __GFP_THISNODE, it should handle "page allocation failure"
> by itself.
>
> Changelog: 2009/11/11
> - fixed nodes_equal() calculation.
> - return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.
>
> Changelog: 2009/11/06
> - fixed lack of oom.h
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> drivers/char/sysrq.c | 2 +-
> include/linux/oom.h | 4 +++-
> mm/oom_kill.c | 44 ++++++++++++++++++++++++++++++++------------
> mm/page_alloc.c | 10 ++++++++--
> 4 files changed, 44 insertions(+), 16 deletions(-)
>

Looks good to me.
Reviewed-by: KOSAKI Motohiro <[email protected]>


2009-11-11 02:49:55

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> Index: mm-test-kernel/drivers/char/sysrq.c
> ===================================================================
> --- mm-test-kernel.orig/drivers/char/sysrq.c
> +++ mm-test-kernel/drivers/char/sysrq.c
> @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
>
> static void moom_callback(struct work_struct *ignored)
> {
> - out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> + out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
> }
>
> static DECLARE_WORK(moom_work, moom_callback);
> Index: mm-test-kernel/mm/oom_kill.c
> ===================================================================
> --- mm-test-kernel.orig/mm/oom_kill.c
> +++ mm-test-kernel/mm/oom_kill.c
> @@ -196,27 +196,45 @@ unsigned long badness(struct task_struct
> /*
> * Determine the type of allocation constraint.
> */
> +#ifdef CONFIG_NUMA
> static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> - gfp_t gfp_mask)
> + gfp_t gfp_mask, nodemask_t *nodemask)

We should probably remove the inline specifier, there's only one caller
currently and if additional ones were added in the future this function
should probably not be replicated.

> {
> -#ifdef CONFIG_NUMA
> struct zone *zone;
> struct zoneref *z;
> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> - nodemask_t nodes = node_states[N_HIGH_MEMORY];
> + int ret = CONSTRAINT_NONE;
>
> - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> - if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> - node_clear(zone_to_nid(zone), nodes);
> - else
> + /*
> + * The nodemask here is a nodemask passed to alloc_pages(). Now,
> + * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> + * feature. mempolicy is an only user of nodemask here.
> + */
> + if (nodemask) {
> + nodemask_t mask;
> + /* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> + nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> + if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> + return CONSTRAINT_MEMORY_POLICY;
> + }

Although a nodemask_t was previously allocated on the stack, we should
probably change this to use NODEMASK_ALLOC() for kernels with higher
CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.

There should be a way around that, however. Shouldn't

if (nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
return CONSTRAINT_MEMORY_POLICY;

be sufficient?

> +
> + /* Check this allocation failure is caused by cpuset's wall function */
> + for_each_zone_zonelist_nodemask(zone, z, zonelist,
> + high_zoneidx, nodemask)
> + if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> return CONSTRAINT_CPUSET;
>
> - if (!nodes_empty(nodes))
> - return CONSTRAINT_MEMORY_POLICY;
> -#endif
> + /* __GFP_THISNODE never calls OOM.*/
>
> return CONSTRAINT_NONE;
> }
> +#else
> +static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> + gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> + return CONSTRAINT_NONE;
> +}
> +#endif
>
> /*
> * Simple selection loop. We chose the process with the highest
> @@ -603,7 +621,8 @@ rest_and_return:
> * OR try to be smart about which process to kill. Note that we
> * don't have to be perfect here, we just have to be good.
> */
> -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
> +void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> + int order, nodemask_t *nodemask)
> {
> unsigned long freed = 0;
> enum oom_constraint constraint;
> @@ -622,11 +641,12 @@ void out_of_memory(struct zonelist *zone
> * Check if there were limitations on the allocation (only relevant for
> * NUMA) that may require different handling.
> */
> - constraint = constrained_alloc(zonelist, gfp_mask);
> + constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> read_lock(&tasklist_lock);
>
> switch (constraint) {
> case CONSTRAINT_MEMORY_POLICY:
> + /* kill by process's its own memory alloc limitation */

I don't understand this comment.

> oom_kill_process(current, gfp_mask, order, 0, NULL,
> "No available memory (MPOL_BIND)");
> break;
> Index: mm-test-kernel/mm/page_alloc.c
> ===================================================================
> --- mm-test-kernel.orig/mm/page_alloc.c
> +++ mm-test-kernel/mm/page_alloc.c
> @@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
> /* The OOM killer will not help higher order allocs */
> if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
> goto out;
> -
> + /*
> + * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> + * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> + * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
> + */
> + if (gfp_mask & __GFP_THISNODE)
> + goto out;
> /* Exhausted what can be done so it's blamo time */
> - out_of_memory(zonelist, gfp_mask, order);
> + out_of_memory(zonelist, gfp_mask, order, nodemask);
>
> out:
> clear_zonelist_oom(zonelist, gfp_mask);

This doesn't seem like the right place for this check; should we even try
direct reclaim for bare users of __GFP_THISNODE? If we're adding it for
sanity even though no callers would currently hit it, it also is a
potential escape route for __GFP_NOFAIL.

2009-11-11 03:02:10

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3

Hi

> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > Index: mm-test-kernel/drivers/char/sysrq.c
> > ===================================================================
> > --- mm-test-kernel.orig/drivers/char/sysrq.c
> > +++ mm-test-kernel/drivers/char/sysrq.c
> > @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
> >
> > static void moom_callback(struct work_struct *ignored)
> > {
> > - out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> > + out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
> > }
> >
> > static DECLARE_WORK(moom_work, moom_callback);
> > Index: mm-test-kernel/mm/oom_kill.c
> > ===================================================================
> > --- mm-test-kernel.orig/mm/oom_kill.c
> > +++ mm-test-kernel/mm/oom_kill.c
> > @@ -196,27 +196,45 @@ unsigned long badness(struct task_struct
> > /*
> > * Determine the type of allocation constraint.
> > */
> > +#ifdef CONFIG_NUMA
> > static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > - gfp_t gfp_mask)
> > + gfp_t gfp_mask, nodemask_t *nodemask)
>
> We should probably remove the inline specifier, there's only one caller
> currently and if additional ones were added in the future this function
> should probably not be replicated.

Good catch.


> > {
> > -#ifdef CONFIG_NUMA
> > struct zone *zone;
> > struct zoneref *z;
> > enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > - nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > + int ret = CONSTRAINT_NONE;
> >
> > - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > - if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > - node_clear(zone_to_nid(zone), nodes);
> > - else
> > + /*
> > + * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > + * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > + * feature. mempolicy is an only user of nodemask here.
> > + */
> > + if (nodemask) {
> > + nodemask_t mask;
> > + /* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> > + nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> > + if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> > + return CONSTRAINT_MEMORY_POLICY;
> > + }
>
> Although a nodemask_t was previously allocated on the stack, we should
> probably change this to use NODEMASK_ALLOC() for kernels with higher
> CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.

No. NODEMASK_ALLOC() is crap. we should remove it.
btw, CPUMASK_ALLOC was already removed.


> There should be a way around that, however. Shouldn't
>
> if (nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
> return CONSTRAINT_MEMORY_POLICY;
>
> be sufficient?

Is this safe on memory hotplug case?



2009-11-11 03:06:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3

On Tue, 10 Nov 2009 18:49:51 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > Index: mm-test-kernel/drivers/char/sysrq.c
> > ===================================================================
> > --- mm-test-kernel.orig/drivers/char/sysrq.c
> > +++ mm-test-kernel/drivers/char/sysrq.c
> > @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
> >
> > static void moom_callback(struct work_struct *ignored)
> > {
> > - out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> > + out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
> > }
> >
> > static DECLARE_WORK(moom_work, moom_callback);
> > Index: mm-test-kernel/mm/oom_kill.c
> > ===================================================================
> > --- mm-test-kernel.orig/mm/oom_kill.c
> > +++ mm-test-kernel/mm/oom_kill.c
> > @@ -196,27 +196,45 @@ unsigned long badness(struct task_struct
> > /*
> > * Determine the type of allocation constraint.
> > */
> > +#ifdef CONFIG_NUMA
> > static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > - gfp_t gfp_mask)
> > + gfp_t gfp_mask, nodemask_t *nodemask)
>
> We should probably remove the inline specifier, there's only one caller
> currently and if additional ones were added in the future this function
> should probably not be replicated.
>
Hmm, ok, remove.


> > {
> > -#ifdef CONFIG_NUMA
> > struct zone *zone;
> > struct zoneref *z;
> > enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > - nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > + int ret = CONSTRAINT_NONE;
> >
> > - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > - if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > - node_clear(zone_to_nid(zone), nodes);
> > - else
> > + /*
> > + * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > + * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > + * feature. mempolicy is an only user of nodemask here.
> > + */
> > + if (nodemask) {
> > + nodemask_t mask;
> > + /* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> > + nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> > + if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> > + return CONSTRAINT_MEMORY_POLICY;
> > + }
>
> Although a nodemask_t was previously allocated on the stack, we should
> probably change this to use NODEMASK_ALLOC() for kernels with higher
> CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.
>
> There should be a way around that, however. Shouldn't
>
> if (nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
> return CONSTRAINT_MEMORY_POLICY;
>
> be sufficient?
>

Ah, I didn't notice nodes_subset(). Thank you, I'll use it.


> > +
> > + /* Check this allocation failure is caused by cpuset's wall function */
> > + for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > + high_zoneidx, nodemask)
> > + if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > return CONSTRAINT_CPUSET;
> >
> > - if (!nodes_empty(nodes))
> > - return CONSTRAINT_MEMORY_POLICY;
> > -#endif
> > + /* __GFP_THISNODE never calls OOM.*/
> >
> > return CONSTRAINT_NONE;
> > }
> > +#else
> > +static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > + gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > + return CONSTRAINT_NONE;
> > +}
> > +#endif
> >
> > /*
> > * Simple selection loop. We chose the process with the highest
> > @@ -603,7 +621,8 @@ rest_and_return:
> > * OR try to be smart about which process to kill. Note that we
> > * don't have to be perfect here, we just have to be good.
> > */
> > -void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
> > +void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > + int order, nodemask_t *nodemask)
> > {
> > unsigned long freed = 0;
> > enum oom_constraint constraint;
> > @@ -622,11 +641,12 @@ void out_of_memory(struct zonelist *zone
> > * Check if there were limitations on the allocation (only relevant for
> > * NUMA) that may require different handling.
> > */
> > - constraint = constrained_alloc(zonelist, gfp_mask);
> > + constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
> > read_lock(&tasklist_lock);
> >
> > switch (constraint) {
> > case CONSTRAINT_MEMORY_POLICY:
> > + /* kill by process's its own memory alloc limitation */
>
> I don't understand this comment.
>
remove this. But it seems not to be well known that current is always killed if
CONSTRAINT_MEMPOLICY.

> > oom_kill_process(current, gfp_mask, order, 0, NULL,
> > "No available memory (MPOL_BIND)");
> > break;
> > Index: mm-test-kernel/mm/page_alloc.c
> > ===================================================================
> > --- mm-test-kernel.orig/mm/page_alloc.c
> > +++ mm-test-kernel/mm/page_alloc.c
> > @@ -1667,9 +1667,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
> > /* The OOM killer will not help higher order allocs */
> > if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
> > goto out;
> > -
> > + /*
> > + * In usual, GFP_THISNODE contains __GFP_NORETRY and we never hit this.
> > + * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> > + * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
> > + */
> > + if (gfp_mask & __GFP_THISNODE)
> > + goto out;
> > /* Exhausted what can be done so it's blamo time */
> > - out_of_memory(zonelist, gfp_mask, order);
> > + out_of_memory(zonelist, gfp_mask, order, nodemask);
> >
> > out:
> > clear_zonelist_oom(zonelist, gfp_mask);
>
> This doesn't seem like the right place for this check; should we even try
> direct reclaim for bare users of __GFP_THISNODE?
No, hugepage has to do reclaim.

> If we're adding it for sanity even though no callers would currently hit it,
> it also is a potential escape route for __GFP_NOFAIL.

will add __GFP_NOFAIL check.

Thanks,
-Kame




2009-11-11 03:13:04

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3

On Wed, 11 Nov 2009 12:02:06 +0900 (JST)
KOSAKI Motohiro <[email protected]> wrote:

> > There should be a way around that, however. Shouldn't
> >
> > if (nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
> > return CONSTRAINT_MEMORY_POLICY;
> >
> > be sufficient?
>
> Is this safe on memory hotplug case?
>
N_HIGH_MEMORY is updated at memory hotplug.

Thanks,
-Kame

2009-11-11 03:14:30

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3

On Wed, 11 Nov 2009, KOSAKI Motohiro wrote:

> > > {
> > > -#ifdef CONFIG_NUMA
> > > struct zone *zone;
> > > struct zoneref *z;
> > > enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > > - nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > > + int ret = CONSTRAINT_NONE;
> > >
> > > - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > > - if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > - node_clear(zone_to_nid(zone), nodes);
> > > - else
> > > + /*
> > > + * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > > + * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > > + * feature. mempolicy is an only user of nodemask here.
> > > + */
> > > + if (nodemask) {
> > > + nodemask_t mask;
> > > + /* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> > > + nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> > > + if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> > > + return CONSTRAINT_MEMORY_POLICY;
> > > + }
> >
> > Although a nodemask_t was previously allocated on the stack, we should
> > probably change this to use NODEMASK_ALLOC() for kernels with higher
> > CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.
>
> No. NODEMASK_ALLOC() is crap. we should remove it.

I've booted 1K node systems and have found it to be helpful to ensure that
the stack will not overflow especially in areas where we normally are deep
already, such as in the page allocator.

> btw, CPUMASK_ALLOC was already removed.

I don't remember CPUMASK_ALLOC() actually being merged. I know the
comment exists in nodemask.h, but I don't recall any CPUMASK_ALLOC() users
in the tree.

2009-11-11 03:23:41

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3

> On Wed, 11 Nov 2009, KOSAKI Motohiro wrote:
>
> > > > {
> > > > -#ifdef CONFIG_NUMA
> > > > struct zone *zone;
> > > > struct zoneref *z;
> > > > enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > > > - nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > > > + int ret = CONSTRAINT_NONE;
> > > >
> > > > - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > > > - if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > > - node_clear(zone_to_nid(zone), nodes);
> > > > - else
> > > > + /*
> > > > + * The nodemask here is a nodemask passed to alloc_pages(). Now,
> > > > + * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> > > > + * feature. mempolicy is an only user of nodemask here.
> > > > + */
> > > > + if (nodemask) {
> > > > + nodemask_t mask;
> > > > + /* check mempolicy's nodemask contains all N_HIGH_MEMORY */
> > > > + nodes_and(mask, *nodemask, node_states[N_HIGH_MEMORY]);
> > > > + if (!nodes_equal(mask, node_states[N_HIGH_MEMORY]))
> > > > + return CONSTRAINT_MEMORY_POLICY;
> > > > + }
> > >
> > > Although a nodemask_t was previously allocated on the stack, we should
> > > probably change this to use NODEMASK_ALLOC() for kernels with higher
> > > CONFIG_NODES_SHIFT since allocations can happen very deep into the stack.
> >
> > No. NODEMASK_ALLOC() is crap. we should remove it.
>
> I've booted 1K node systems and have found it to be helpful to ensure that
> the stack will not overflow especially in areas where we normally are deep
> already, such as in the page allocator.

Linux doesn't support 1K nodes. (and only SGI huge machine use 512 nodes)

At least, NODEMASK_ALLOC should make more cleaner interface. current one
and struct nodemask_scratch are pretty ugly.


> > btw, CPUMASK_ALLOC was already removed.
>
> I don't remember CPUMASK_ALLOC() actually being merged. I know the
> comment exists in nodemask.h, but I don't recall any CPUMASK_ALLOC() users
> in the tree.


2009-11-11 03:27:46

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v3

On Wed, 11 Nov 2009, KOSAKI Motohiro wrote:

> Linux doesn't support 1K nodes. (and only SGI huge machine use 512 nodes)
>

I know for a fact that it does on x86 if you adjust CONFIG_NODES_SHIFT,
I've booted kernels all the way back to 2.6.18 with 1K nodes.

> At least, NODEMASK_ALLOC should make more cleaner interface. current one
> and struct nodemask_scratch are pretty ugly.
>

I agree, I haven't been a fan of nodemask_scratch because I think its use
case is pretty limited, but I do advocate using NODEMASK_ALLOC() when deep
in the stack. We've made sure that most of the mempolicy code does that
where manipulating nodemasks is common in -mm.

2009-11-11 04:47:57

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4

From: KAMEZAWA Hiroyuki <[email protected]>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
- alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
- mempolicy don't maintain its own private zonelists.
(And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
- check nodemask, if nodemask && nodemask doesn't cover all
node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
- Scan all zonelist under nodemask, if it hits cpuset's wall
this faiulre is from cpuset.
And
- modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
This doesn't change "current" behavior. If callers use __GFP_THISNODE
it should handle "page allocation failure" by itself.

- handle __GFP_NOFAIL+__GFP_THISNODE path.
This is something like a FIXME but this gfpmask is not used now.

Changelog: 2009/11/11(2)
- uses nodes_subset().
- clean up.
- added __GFP_NOFAIL case. And added waring.

Changelog: 2009/11/11
- fixed nodes_equal() calculation.
- return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.

Changelog: 2009/11/06
- fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
drivers/char/sysrq.c | 2 +-
include/linux/oom.h | 4 +++-
mm/oom_kill.c | 45 +++++++++++++++++++++++++++++++++------------
mm/page_alloc.c | 20 +++++++++++++++-----
4 files changed, 52 insertions(+), 19 deletions(-)

Index: mm-test-kernel/drivers/char/sysrq.c
===================================================================
--- mm-test-kernel.orig/drivers/char/sysrq.c
+++ mm-test-kernel/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op

static void moom_callback(struct work_struct *ignored)
{
- out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+ out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
}

static DECLARE_WORK(moom_work, moom_callback);
Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
/*
* Determine the type of allocation constraint.
*/
+#ifdef CONFIG_NUMA
static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask)
+ gfp_t gfp_mask, nodemask_t *nodemask)
{
-#ifdef CONFIG_NUMA
struct zone *zone;
struct zoneref *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
- nodemask_t nodes = node_states[N_HIGH_MEMORY];
+ int ret = CONSTRAINT_NONE;

- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
- if (cpuset_zone_allowed_softwall(zone, gfp_mask))
- node_clear(zone_to_nid(zone), nodes);
- else
- return CONSTRAINT_CPUSET;
+ /*
+ * Reach here only when __GFP_NOFAIL is used. So, we should avoid
+ * to kill current.We have to random task kill in this case.
+ * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
+ */
+ if (gfp_mask & __GPF_THISNODE)
+ return ret;

- if (!nodes_empty(nodes))
+ /*
+ * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ * feature. mempolicy is an only user of nodemask here.
+ * check mempolicy's nodemask contains all N_HIGH_MEMORY
+ */
+ if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
return CONSTRAINT_MEMORY_POLICY;
-#endif

+ /* Check this allocation failure is caused by cpuset's wall function */
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ high_zoneidx, nodemask)
+ if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
+ return CONSTRAINT_CPUSET;
+
+ return CONSTRAINT_NONE;
+}
+#else
+static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+ gfp_t gfp_mask, nodemask_t *nodemask)
+{
return CONSTRAINT_NONE;
}
+#endif

/*
* Simple selection loop. We chose the process with the highest
@@ -603,7 +623,8 @@ rest_and_return:
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
*/
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+ int order, nodemask_t *nodemask)
{
unsigned long freed = 0;
enum oom_constraint constraint;
@@ -622,7 +643,7 @@ void out_of_memory(struct zonelist *zone
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
- constraint = constrained_alloc(zonelist, gfp_mask);
+ constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
read_lock(&tasklist_lock);

switch (constraint) {
Index: mm-test-kernel/mm/page_alloc.c
===================================================================
--- mm-test-kernel.orig/mm/page_alloc.c
+++ mm-test-kernel/mm/page_alloc.c
@@ -1664,12 +1664,22 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
if (page)
goto out;

- /* The OOM killer will not help higher order allocs */
- if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
- goto out;
-
+ if (!(gfp_mask & __GFP_NOFAIL)) {
+ /* The OOM killer will not help higher order allocs */
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ goto out;
+ /*
+ * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+ * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+ * The caller should handle page allocation failure by itself if
+ * it specifies __GFP_THISNODE.
+ * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+ */
+ if (gfp_mask & __GFP_THISNODE)
+ goto out;
+ }
/* Exhausted what can be done so it's blamo time */
- out_of_memory(zonelist, gfp_mask, order);
+ out_of_memory(zonelist, gfp_mask, order, nodemask);

out:
clear_zonelist_oom(zonelist, gfp_mask);
Index: mm-test-kernel/include/linux/oom.h
===================================================================
--- mm-test-kernel.orig/include/linux/oom.h
+++ mm-test-kernel/include/linux/oom.h
@@ -10,6 +10,7 @@
#ifdef __KERNEL__

#include <linux/types.h>
+#include <linux/nodemask.h>

struct zonelist;
struct notifier_block;
@@ -26,7 +27,8 @@ enum oom_constraint {
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+ int order, nodemask_t *mask);
extern int register_oom_notifier(struct notifier_block *nb);
extern int unregister_oom_notifier(struct notifier_block *nb);

2009-11-11 05:30:50

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.1

Sorry, missed to remove 'inline'...
==
From: KAMEZAWA Hiroyuki <[email protected]>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
- alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
- mempolicy don't maintain its own private zonelists.
(And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
- check nodemask, if nodemask && nodemask doesn't cover all
node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
- Scan all zonelist under nodemask, if it hits cpuset's wall
this faiulre is from cpuset.
And
- modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
This doesn't change "current" behavior. If callers use __GFP_THISNODE
it should handle "page allocation failure" by itself.

- handle __GFP_NOFAIL+__GFP_THISNODE path.
This is something like a FIXME but this gfpmask is not used now.

Changelog: 2009/11/11(2)
- uses nodes_subset().
- clean up.
- added __GFP_NOFAIL case. And added waring.
- removed inline

Changelog: 2009/11/11
- fixed nodes_equal() calculation.
- return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.

Changelog: 2009/11/06
- fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
drivers/char/sysrq.c | 2 +-
include/linux/oom.h | 4 +++-
mm/oom_kill.c | 47 ++++++++++++++++++++++++++++++++++-------------
mm/page_alloc.c | 20 +++++++++++++++-----
4 files changed, 53 insertions(+), 20 deletions(-)

Index: mm-test-kernel/drivers/char/sysrq.c
===================================================================
--- mm-test-kernel.orig/drivers/char/sysrq.c
+++ mm-test-kernel/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op

static void moom_callback(struct work_struct *ignored)
{
- out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+ out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
}

static DECLARE_WORK(moom_work, moom_callback);
Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
/*
* Determine the type of allocation constraint.
*/
-static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask)
-{
#ifdef CONFIG_NUMA
+static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+ gfp_t gfp_mask, nodemask_t *nodemask)
+{
struct zone *zone;
struct zoneref *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
- nodemask_t nodes = node_states[N_HIGH_MEMORY];
+ int ret = CONSTRAINT_NONE;

- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
- if (cpuset_zone_allowed_softwall(zone, gfp_mask))
- node_clear(zone_to_nid(zone), nodes);
- else
- return CONSTRAINT_CPUSET;
+ /*
+ * Reach here only when __GFP_NOFAIL is used. So, we should avoid
+ * to kill current.We have to random task kill in this case.
+ * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
+ */
+ if (gfp_mask & __GPF_THISNODE)
+ return ret;

- if (!nodes_empty(nodes))
+ /*
+ * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ * feature. mempolicy is an only user of nodemask here.
+ * check mempolicy's nodemask contains all N_HIGH_MEMORY
+ */
+ if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
return CONSTRAINT_MEMORY_POLICY;
-#endif

+ /* Check this allocation failure is caused by cpuset's wall function */
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ high_zoneidx, nodemask)
+ if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
+ return CONSTRAINT_CPUSET;
+
+ return CONSTRAINT_NONE;
+}
+#else
+static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+ gfp_t gfp_mask, nodemask_t *nodemask)
+{
return CONSTRAINT_NONE;
}
+#endif

/*
* Simple selection loop. We chose the process with the highest
@@ -603,7 +623,8 @@ rest_and_return:
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
*/
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+ int order, nodemask_t *nodemask)
{
unsigned long freed = 0;
enum oom_constraint constraint;
@@ -622,7 +643,7 @@ void out_of_memory(struct zonelist *zone
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
- constraint = constrained_alloc(zonelist, gfp_mask);
+ constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
read_lock(&tasklist_lock);

switch (constraint) {
Index: mm-test-kernel/mm/page_alloc.c
===================================================================
--- mm-test-kernel.orig/mm/page_alloc.c
+++ mm-test-kernel/mm/page_alloc.c
@@ -1664,12 +1664,22 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
if (page)
goto out;

- /* The OOM killer will not help higher order allocs */
- if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
- goto out;
-
+ if (!(gfp_mask & __GFP_NOFAIL)) {
+ /* The OOM killer will not help higher order allocs */
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ goto out;
+ /*
+ * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+ * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+ * The caller should handle page allocation failure by itself if
+ * it specifies __GFP_THISNODE.
+ * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+ */
+ if (gfp_mask & __GFP_THISNODE)
+ goto out;
+ }
/* Exhausted what can be done so it's blamo time */
- out_of_memory(zonelist, gfp_mask, order);
+ out_of_memory(zonelist, gfp_mask, order, nodemask);

out:
clear_zonelist_oom(zonelist, gfp_mask);
Index: mm-test-kernel/include/linux/oom.h
===================================================================
--- mm-test-kernel.orig/include/linux/oom.h
+++ mm-test-kernel/include/linux/oom.h
@@ -10,6 +10,7 @@
#ifdef __KERNEL__

#include <linux/types.h>
+#include <linux/nodemask.h>

struct zonelist;
struct notifier_block;
@@ -26,7 +27,8 @@ enum oom_constraint {
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+ int order, nodemask_t *mask);
extern int register_oom_notifier(struct notifier_block *nb);
extern int unregister_oom_notifier(struct notifier_block *nb);

2009-11-11 05:58:38

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.1

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> Index: mm-test-kernel/drivers/char/sysrq.c
> ===================================================================
> --- mm-test-kernel.orig/drivers/char/sysrq.c
> +++ mm-test-kernel/drivers/char/sysrq.c
> @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
>
> static void moom_callback(struct work_struct *ignored)
> {
> - out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> + out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
> }
>
> static DECLARE_WORK(moom_work, moom_callback);
> Index: mm-test-kernel/mm/oom_kill.c
> ===================================================================
> --- mm-test-kernel.orig/mm/oom_kill.c
> +++ mm-test-kernel/mm/oom_kill.c
> @@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
> /*
> * Determine the type of allocation constraint.
> */
> -static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> - gfp_t gfp_mask)
> -{
> #ifdef CONFIG_NUMA
> +static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> + gfp_t gfp_mask, nodemask_t *nodemask)
> +{
> struct zone *zone;
> struct zoneref *z;
> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> - nodemask_t nodes = node_states[N_HIGH_MEMORY];
> + int ret = CONSTRAINT_NONE;
>
> - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> - if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> - node_clear(zone_to_nid(zone), nodes);
> - else
> - return CONSTRAINT_CPUSET;
> + /*
> + * Reach here only when __GFP_NOFAIL is used. So, we should avoid
> + * to kill current.We have to random task kill in this case.
> + * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
> + */
> + if (gfp_mask & __GPF_THISNODE)
> + return ret;
>

That shouldn't compile.

> - if (!nodes_empty(nodes))
> + /*
> + * The nodemask here is a nodemask passed to alloc_pages(). Now,
> + * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
> + * feature. mempolicy is an only user of nodemask here.
> + * check mempolicy's nodemask contains all N_HIGH_MEMORY
> + */
> + if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
> return CONSTRAINT_MEMORY_POLICY;
> -#endif
>
> + /* Check this allocation failure is caused by cpuset's wall function */
> + for_each_zone_zonelist_nodemask(zone, z, zonelist,
> + high_zoneidx, nodemask)
> + if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> + return CONSTRAINT_CPUSET;
> +
> + return CONSTRAINT_NONE;
> +}
> +#else
> +static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> + gfp_t gfp_mask, nodemask_t *nodemask)

inline seems appropriate in this case, gcc will optimize it anyway.

2009-11-11 06:22:40

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.1

On Tue, 10 Nov 2009 21:58:31 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > Index: mm-test-kernel/drivers/char/sysrq.c
> > ===================================================================
> > --- mm-test-kernel.orig/drivers/char/sysrq.c
> > +++ mm-test-kernel/drivers/char/sysrq.c
> > @@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
> >
> > static void moom_callback(struct work_struct *ignored)
> > {
> > - out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
> > + out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
> > }
> >
> > static DECLARE_WORK(moom_work, moom_callback);
> > Index: mm-test-kernel/mm/oom_kill.c
> > ===================================================================
> > --- mm-test-kernel.orig/mm/oom_kill.c
> > +++ mm-test-kernel/mm/oom_kill.c
> > @@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
> > /*
> > * Determine the type of allocation constraint.
> > */
> > -static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > - gfp_t gfp_mask)
> > -{
> > #ifdef CONFIG_NUMA
> > +static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > + gfp_t gfp_mask, nodemask_t *nodemask)
> > +{
> > struct zone *zone;
> > struct zoneref *z;
> > enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > - nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > + int ret = CONSTRAINT_NONE;
> >
> > - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > - if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > - node_clear(zone_to_nid(zone), nodes);
> > - else
> > - return CONSTRAINT_CPUSET;
> > + /*
> > + * Reach here only when __GFP_NOFAIL is used. So, we should avoid
> > + * to kill current.We have to random task kill in this case.
> > + * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
> > + */
> > + if (gfp_mask & __GPF_THISNODE)
> > + return ret;
> >
>
> That shouldn't compile.
>
Why ?

Thanks,
-Kame

2009-11-11 06:26:24

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.1

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> > > Index: mm-test-kernel/mm/oom_kill.c
> > > ===================================================================
> > > --- mm-test-kernel.orig/mm/oom_kill.c
> > > +++ mm-test-kernel/mm/oom_kill.c
> > > @@ -196,27 +196,47 @@ unsigned long badness(struct task_struct
> > > /*
> > > * Determine the type of allocation constraint.
> > > */
> > > -static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > > - gfp_t gfp_mask)
> > > -{
> > > #ifdef CONFIG_NUMA
> > > +static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > > + gfp_t gfp_mask, nodemask_t *nodemask)
> > > +{
> > > struct zone *zone;
> > > struct zoneref *z;
> > > enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > > - nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > > + int ret = CONSTRAINT_NONE;
> > >
> > > - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > > - if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > - node_clear(zone_to_nid(zone), nodes);
> > > - else
> > > - return CONSTRAINT_CPUSET;
> > > + /*
> > > + * Reach here only when __GFP_NOFAIL is used. So, we should avoid
> > > + * to kill current.We have to random task kill in this case.
> > > + * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
> > > + */
> > > + if (gfp_mask & __GPF_THISNODE)
> > > + return ret;
> > >
> >
> > That shouldn't compile.
> >
> Why ?
>

Even when I pointed it out, you still didn't bother to try compiling it?
You need s/GPF/GFP, it stands for "get free pages."

You can also eliminate the ret variable, it's not doing anything.

2009-11-11 06:36:52

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Tue, 10 Nov 2009 22:26:19 -0800 (PST)
David Rientjes <[email protected]> wrote:
*/
> > > > + if (gfp_mask & __GPF_THISNODE)
> > > > + return ret;
> > > >
> > >
> > > That shouldn't compile.
> > >
> > Why ?
> >
>
> Even when I pointed it out, you still didn't bother to try compiling it?
> You need s/GPF/GFP, it stands for "get free pages."
>
You're right.
Ah. I noticed CONFIG_NUMA=n
....(of course, I usually set it =y)

> You can also eliminate the ret variable, it's not doing anything.
>
ok. fixed one here. Thank you.
==
From: KAMEZAWA Hiroyuki <[email protected]>

Fixing node-oriented allocation handling in oom-kill.c
I myself think this as bugfix not as ehnancement.

In these days, things are changed as
- alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
- mempolicy don't maintain its own private zonelists.
(And cpuset doesn't use nodemask for __alloc_pages_nodemask())

So, current oom-killer's check function is wrong.

This patch does
- check nodemask, if nodemask && nodemask doesn't cover all
node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
- Scan all zonelist under nodemask, if it hits cpuset's wall
this faiulre is from cpuset.
And
- modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
This doesn't change "current" behavior. If callers use __GFP_THISNODE
it should handle "page allocation failure" by itself.

- handle __GFP_NOFAIL+__GFP_THISNODE path.
This is something like a FIXME but this gfpmask is not used now.

Changelog: 2009/11/11(2)
- uses nodes_subset().
- clean up.
- added __GFP_NOFAIL case. And added waring.
- removed inline
- removed 'ret'

Changelog: 2009/11/11
- fixed nodes_equal() calculation.
- return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.

Changelog: 2009/11/06
- fixed lack of oom.h

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
drivers/char/sysrq.c | 2 +-
include/linux/oom.h | 4 +++-
mm/oom_kill.c | 46 +++++++++++++++++++++++++++++++++-------------
mm/page_alloc.c | 20 +++++++++++++++-----
4 files changed, 52 insertions(+), 20 deletions(-)

Index: mm-test-kernel/drivers/char/sysrq.c
===================================================================
--- mm-test-kernel.orig/drivers/char/sysrq.c
+++ mm-test-kernel/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op

static void moom_callback(struct work_struct *ignored)
{
- out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+ out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
}

static DECLARE_WORK(moom_work, moom_callback);
Index: mm-test-kernel/mm/oom_kill.c
===================================================================
--- mm-test-kernel.orig/mm/oom_kill.c
+++ mm-test-kernel/mm/oom_kill.c
@@ -196,27 +196,46 @@ unsigned long badness(struct task_struct
/*
* Determine the type of allocation constraint.
*/
-static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask)
-{
#ifdef CONFIG_NUMA
+static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+ gfp_t gfp_mask, nodemask_t *nodemask)
+{
struct zone *zone;
struct zoneref *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
- nodemask_t nodes = node_states[N_HIGH_MEMORY];

- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
- if (cpuset_zone_allowed_softwall(zone, gfp_mask))
- node_clear(zone_to_nid(zone), nodes);
- else
- return CONSTRAINT_CPUSET;
+ /*
+ * Reach here only when __GFP_NOFAIL is used. So, we should avoid
+ * to kill current.We have to random task kill in this case.
+ * Hopefully, CONSTRAINT_THISNODE...but no way to handle it, now.
+ */
+ if (gfp_mask & __GFP_THISNODE)
+ return CONSTRAINT_NONE;

- if (!nodes_empty(nodes))
+ /*
+ * The nodemask here is a nodemask passed to alloc_pages(). Now,
+ * cpuset doesn't use this nodemask for its hardwall/softwall/hierarchy
+ * feature. mempolicy is an only user of nodemask here.
+ * check mempolicy's nodemask contains all N_HIGH_MEMORY
+ */
+ if (nodemask && !nodes_subset(node_states[N_HIGH_MEMORY], *nodemask))
return CONSTRAINT_MEMORY_POLICY;
-#endif

+ /* Check this allocation failure is caused by cpuset's wall function */
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ high_zoneidx, nodemask)
+ if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
+ return CONSTRAINT_CPUSET;
+
+ return CONSTRAINT_NONE;
+}
+#else
+static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
+ gfp_t gfp_mask, nodemask_t *nodemask)
+{
return CONSTRAINT_NONE;
}
+#endif

/*
* Simple selection loop. We chose the process with the highest
@@ -603,7 +622,8 @@ rest_and_return:
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
*/
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+ int order, nodemask_t *nodemask)
{
unsigned long freed = 0;
enum oom_constraint constraint;
@@ -622,7 +642,7 @@ void out_of_memory(struct zonelist *zone
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
- constraint = constrained_alloc(zonelist, gfp_mask);
+ constraint = constrained_alloc(zonelist, gfp_mask, nodemask);
read_lock(&tasklist_lock);

switch (constraint) {
Index: mm-test-kernel/mm/page_alloc.c
===================================================================
--- mm-test-kernel.orig/mm/page_alloc.c
+++ mm-test-kernel/mm/page_alloc.c
@@ -1664,12 +1664,22 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
if (page)
goto out;

- /* The OOM killer will not help higher order allocs */
- if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_NOFAIL))
- goto out;
-
+ if (!(gfp_mask & __GFP_NOFAIL)) {
+ /* The OOM killer will not help higher order allocs */
+ if (order > PAGE_ALLOC_COSTLY_ORDER)
+ goto out;
+ /*
+ * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
+ * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
+ * The caller should handle page allocation failure by itself if
+ * it specifies __GFP_THISNODE.
+ * Note: Hugepage uses it but will hit PAGE_ALLOC_COSTLY_ORDER.
+ */
+ if (gfp_mask & __GFP_THISNODE)
+ goto out;
+ }
/* Exhausted what can be done so it's blamo time */
- out_of_memory(zonelist, gfp_mask, order);
+ out_of_memory(zonelist, gfp_mask, order, nodemask);

out:
clear_zonelist_oom(zonelist, gfp_mask);
Index: mm-test-kernel/include/linux/oom.h
===================================================================
--- mm-test-kernel.orig/include/linux/oom.h
+++ mm-test-kernel/include/linux/oom.h
@@ -10,6 +10,7 @@
#ifdef __KERNEL__

#include <linux/types.h>
+#include <linux/nodemask.h>

struct zonelist;
struct notifier_block;
@@ -26,7 +27,8 @@ enum oom_constraint {
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);

-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+ int order, nodemask_t *mask);
extern int register_oom_notifier(struct notifier_block *nb);
extern int unregister_oom_notifier(struct notifier_block *nb);

2009-11-11 07:32:22

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Fixing node-oriented allocation handling in oom-kill.c
> I myself think this as bugfix not as ehnancement.
>
> In these days, things are changed as
> - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
> - mempolicy don't maintain its own private zonelists.
> (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
>
> So, current oom-killer's check function is wrong.
>
> This patch does
> - check nodemask, if nodemask && nodemask doesn't cover all
> node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
> - Scan all zonelist under nodemask, if it hits cpuset's wall
> this faiulre is from cpuset.
> And
> - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
> This doesn't change "current" behavior. If callers use __GFP_THISNODE
> it should handle "page allocation failure" by itself.
>
> - handle __GFP_NOFAIL+__GFP_THISNODE path.
> This is something like a FIXME but this gfpmask is not used now.
>
> Changelog: 2009/11/11(2)
> - uses nodes_subset().
> - clean up.
> - added __GFP_NOFAIL case. And added waring.
> - removed inline
> - removed 'ret'
>
> Changelog: 2009/11/11
> - fixed nodes_equal() calculation.
> - return CONSTRAINT_MEMPOLICY always if given nodemask is not enough big.
>
> Changelog: 2009/11/06
> - fixed lack of oom.h
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

Acked-by: David Rientjes <[email protected]>

2009-11-18 00:12:08

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:

> Fixing node-oriented allocation handling in oom-kill.c
> I myself think this as bugfix not as ehnancement.
>
> In these days, things are changed as
> - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
> - mempolicy don't maintain its own private zonelists.
> (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
>
> So, current oom-killer's check function is wrong.
>
> This patch does
> - check nodemask, if nodemask && nodemask doesn't cover all
> node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
> - Scan all zonelist under nodemask, if it hits cpuset's wall
> this faiulre is from cpuset.
> And
> - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
> This doesn't change "current" behavior. If callers use __GFP_THISNODE
> it should handle "page allocation failure" by itself.
>
> - handle __GFP_NOFAIL+__GFP_THISNODE path.
> This is something like a FIXME but this gfpmask is not used now.
>

Now that we're passing the nodemask into the oom killer, we should be able
to do more intelligent CONSTRAINT_MEMORY_POLICY selection. current is not
always the ideal task to kill, so it's better to scan the tasklist and
determine the best task depending on our heuristics, similiar to how we
penalize candidates if they do not share the same cpuset.

Something like the following (untested) patch. Comments?
---
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -201,7 +201,9 @@ extern void mpol_fix_fork_child_flag(struct task_struct *p);
extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
unsigned long addr, gfp_t gfp_flags,
struct mempolicy **mpol, nodemask_t **nodemask);
-extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
+extern bool init_nodemask_of_task_mempolicy(struct task_struct *tsk,
+ nodemask_t *mask);
+extern bool init_nodemask_of_current_mempolicy(nodemask_t *mask);
extern unsigned slab_node(struct mempolicy *policy);

extern enum zone_type policy_zone;
@@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
return node_zonelist(0, gfp_flags);
}

-static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
+static inline bool init_nodemask_of_task_mempolicy(struct task_struct *tsk,
+ nodemask_t *mask)
+{
+ return false;
+}
+
+static inline bool init_nodemask_of_current_mempolicy(nodemask_t *mask)
+{
+ return false;
+}

static inline int do_migrate_pages(struct mm_struct *mm,
const nodemask_t *from_nodes,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed3f392..3ab3021 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1373,7 +1373,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
* global hstate attribute
*/
if (!(obey_mempolicy &&
- init_nodemask_of_mempolicy(nodes_allowed))) {
+ init_nodemask_of_current_mempolicy(nodes_allowed))) {
NODEMASK_FREE(nodes_allowed);
nodes_allowed = &node_states[N_HIGH_MEMORY];
}
@@ -1860,7 +1860,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
NODEMASK_ALLOC(nodemask_t, nodes_allowed,
GFP_KERNEL | __GFP_NORETRY);
if (!(obey_mempolicy &&
- init_nodemask_of_mempolicy(nodes_allowed))) {
+ init_nodemask_of_current_mempolicy(nodes_allowed))) {
NODEMASK_FREE(nodes_allowed);
nodes_allowed = &node_states[N_HIGH_MEMORY];
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f11fdad..23c84bb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1568,24 +1568,18 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
}
return zl;
}
+#endif

/*
- * init_nodemask_of_mempolicy
- *
- * If the current task's mempolicy is "default" [NULL], return 'false'
- * to indicate default policy. Otherwise, extract the policy nodemask
- * for 'bind' or 'interleave' policy into the argument nodemask, or
- * initialize the argument nodemask to contain the single node for
- * 'preferred' or 'local' policy and return 'true' to indicate presence
- * of non-default mempolicy.
- *
- * We don't bother with reference counting the mempolicy [mpol_get/put]
- * because the current task is examining it's own mempolicy and a task's
- * mempolicy is only ever changed by the task itself.
+ * If tsk's mempolicy is "default" [NULL], return 'false' to indicate default
+ * policy. Otherwise, extract the policy nodemask for 'bind' or 'interleave'
+ * policy into the argument nodemask, or initialize the argument nodemask to
+ * contain the single node for 'preferred' or 'local' policy and return 'true'
+ * to indicate presence of non-default mempolicy.
*
* N.B., it is the caller's responsibility to free a returned nodemask.
*/
-bool init_nodemask_of_mempolicy(nodemask_t *mask)
+bool init_nodemask_of_task_mempolicy(struct task_struct *tsk, nodemask_t *mask)
{
struct mempolicy *mempolicy;
int nid;
@@ -1615,7 +1609,16 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)

return true;
}
-#endif
+
+/*
+ * We don't bother with reference counting the mempolicy [mpol_get/put]
+ * because the current task is examining it's own mempolicy and a task's
+ * mempolicy is only ever changed by the task itself.
+ */
+bool init_nodemask_of_current_mempolicy(nodemask_t *mask)
+{
+ return init_nodemask_of_task_mempolicy(current, mask);
+}

/* Allocate a page in interleaved policy.
Own path because it needs to do special accounting. */
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index ab04537..4c5c58b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -27,6 +27,7 @@
#include <linux/notifier.h>
#include <linux/memcontrol.h>
#include <linux/security.h>
+#include <linux/mempolicy.h>

int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
@@ -35,18 +36,30 @@ static DEFINE_SPINLOCK(zone_scan_lock);
/* #define DEBUG */

/*
- * Is all threads of the target process nodes overlap ours?
+ * Do the nodes allowed by any of tsk's threads overlap ours?
*/
-static int has_intersects_mems_allowed(struct task_struct *tsk)
+static int has_intersects_mems_allowed(struct task_struct *tsk,
+ nodemask_t *nodemask)
{
- struct task_struct *t;
+ struct task_struct *start = tsk;
+ NODEMASK_ALLOC(nodemask_t, mpol_nodemask, GFP_KERNEL);

- t = tsk;
+ if (!nodemask)
+ mpol_nodemask = NULL;
do {
- if (cpuset_mems_allowed_intersects(current, t))
+ if (mpol_nodemask) {
+ mpol_get(tsk->mempolicy);
+ if (init_nodemask_of_task_mempolicy(tsk, mpol_nodemask) &&
+ nodes_intersects(*nodemask, *mpol_nodemask)) {
+ mpol_put(tsk->mempolicy);
+ return 1;
+ }
+ mpol_put(tsk->mempolicy);
+ }
+ if (cpuset_mems_allowed_intersects(current, tsk))
return 1;
- t = next_thread(t);
- } while (t != tsk);
+ tsk = next_thread(tsk);
+ } while (tsk != start);

return 0;
}
@@ -55,6 +68,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
* badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
* @uptime: current uptime in seconds
+ * @constraint: type of oom constraint
+ * @nodemask: nodemask passed to page allocator
*
* The formula used is relatively simple and documented inline in the
* function. The main rationale is that we want to select a good task
@@ -70,7 +85,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
* of least surprise ... (be careful when you change it)
*/

-unsigned long badness(struct task_struct *p, unsigned long uptime)
+unsigned long badness(struct task_struct *p, unsigned long uptime,
+ enum oom_constraint constraint, nodemask_t *nodemask)
{
unsigned long points, cpu_time, run_time;
struct mm_struct *mm;
@@ -171,7 +187,9 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
* because p may have allocated or otherwise mapped memory on
* this node before. However it will be less likely.
*/
- if (!has_intersects_mems_allowed(p))
+ if (!has_intersects_mems_allowed(p,
+ constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
+ NULL))
points /= 8;

/*
@@ -244,7 +262,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
* (not docbooked, we don't want this one cluttering up the manual)
*/
static struct task_struct *select_bad_process(unsigned long *ppoints,
- struct mem_cgroup *mem)
+ struct mem_cgroup *mem, enum oom_constraint constraint,
+ nodemask_t *nodemask)
{
struct task_struct *p;
struct task_struct *chosen = NULL;
@@ -300,7 +319,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
if (p->signal->oom_adj == OOM_DISABLE)
continue;

- points = badness(p, uptime.tv_sec);
+ points = badness(p, uptime.tv_sec, constraint, nodemask);
if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
@@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)

read_lock(&tasklist_lock);
retry:
- p = select_bad_process(&points, mem);
+ p = select_bad_process(&points, mem, NULL);
if (PTR_ERR(p) == -1UL)
goto out;

@@ -554,7 +573,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
/*
* Must be called with tasklist_lock held for read.
*/
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, int order,
+ enum oom_constraint constraint, nodemask_t *nodemask)
{
struct task_struct *p;
unsigned long points;
@@ -568,7 +588,7 @@ retry:
* Rambo mode: Shoot down a process and hope it solves whatever
* issues we may have.
*/
- p = select_bad_process(&points, NULL);
+ p = select_bad_process(&points, NULL, constraint, nodemask);

if (PTR_ERR(p) == -1UL)
return;
@@ -609,7 +629,8 @@ void pagefault_out_of_memory(void)
panic("out of memory from page fault. panic_on_oom is selected.\n");

read_lock(&tasklist_lock);
- __out_of_memory(0, 0); /* unknown gfp_mask and order */
+ /* unknown gfp_mask and order */
+ __out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
read_unlock(&tasklist_lock);

/*
@@ -656,11 +677,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
read_lock(&tasklist_lock);

switch (constraint) {
- case CONSTRAINT_MEMORY_POLICY:
- oom_kill_process(current, gfp_mask, order, 0, NULL,
- "No available memory (MPOL_BIND)");
- break;
-
case CONSTRAINT_NONE:
if (sysctl_panic_on_oom) {
dump_header(gfp_mask, order, NULL);
@@ -668,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
}
/* Fall-through */
case CONSTRAINT_CPUSET:
- __out_of_memory(gfp_mask, order);
+ case CONSTRAINT_MEMORY_POLICY:
+ __out_of_memory(gfp_mask, order, constraint, nodemask);
break;
}

2009-11-18 01:01:25

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

Hi,

On Tue, 17 Nov 2009 16:11:58 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > Fixing node-oriented allocation handling in oom-kill.c
> > I myself think this as bugfix not as ehnancement.
> >
> > In these days, things are changed as
> > - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
> > - mempolicy don't maintain its own private zonelists.
> > (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
> >
> > So, current oom-killer's check function is wrong.
> >
> > This patch does
> > - check nodemask, if nodemask && nodemask doesn't cover all
> > node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
> > - Scan all zonelist under nodemask, if it hits cpuset's wall
> > this faiulre is from cpuset.
> > And
> > - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
> > This doesn't change "current" behavior. If callers use __GFP_THISNODE
> > it should handle "page allocation failure" by itself.
> >
> > - handle __GFP_NOFAIL+__GFP_THISNODE path.
> > This is something like a FIXME but this gfpmask is not used now.
> >
>
> Now that we're passing the nodemask into the oom killer, we should be able
> to do more intelligent CONSTRAINT_MEMORY_POLICY selection. current is not
> always the ideal task to kill, so it's better to scan the tasklist and
> determine the best task depending on our heuristics, similiar to how we
> penalize candidates if they do not share the same cpuset.
>
> Something like the following (untested) patch. Comments?

Hm, yes. I think this direction is good.
I have my own but your version looks nicer.
(I'm busy with troubles in these days, sorry.)


> ---
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -201,7 +201,9 @@ extern void mpol_fix_fork_child_flag(struct task_struct *p);
> extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> unsigned long addr, gfp_t gfp_flags,
> struct mempolicy **mpol, nodemask_t **nodemask);
> -extern bool init_nodemask_of_mempolicy(nodemask_t *mask);
> +extern bool init_nodemask_of_task_mempolicy(struct task_struct *tsk,
> + nodemask_t *mask);
> +extern bool init_nodemask_of_current_mempolicy(nodemask_t *mask);
> extern unsigned slab_node(struct mempolicy *policy);
>
> extern enum zone_type policy_zone;
> @@ -329,7 +331,16 @@ static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
> return node_zonelist(0, gfp_flags);
> }
>
> -static inline bool init_nodemask_of_mempolicy(nodemask_t *m) { return false; }
> +static inline bool init_nodemask_of_task_mempolicy(struct task_struct *tsk,
> + nodemask_t *mask)
> +{
> + return false;
> +}
> +
> +static inline bool init_nodemask_of_current_mempolicy(nodemask_t *mask)
> +{
> + return false;
> +}
>
> static inline int do_migrate_pages(struct mm_struct *mm,
> const nodemask_t *from_nodes,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ed3f392..3ab3021 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1373,7 +1373,7 @@ static ssize_t nr_hugepages_store_common(bool obey_mempolicy,
> * global hstate attribute
> */
> if (!(obey_mempolicy &&
> - init_nodemask_of_mempolicy(nodes_allowed))) {
> + init_nodemask_of_current_mempolicy(nodes_allowed))) {
> NODEMASK_FREE(nodes_allowed);
> nodes_allowed = &node_states[N_HIGH_MEMORY];
> }
> @@ -1860,7 +1860,7 @@ static int hugetlb_sysctl_handler_common(bool obey_mempolicy,
> NODEMASK_ALLOC(nodemask_t, nodes_allowed,
> GFP_KERNEL | __GFP_NORETRY);
> if (!(obey_mempolicy &&
> - init_nodemask_of_mempolicy(nodes_allowed))) {
> + init_nodemask_of_current_mempolicy(nodes_allowed))) {
> NODEMASK_FREE(nodes_allowed);
> nodes_allowed = &node_states[N_HIGH_MEMORY];
> }
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index f11fdad..23c84bb 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1568,24 +1568,18 @@ struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr,
> }
> return zl;
> }
> +#endif
>
> /*
> - * init_nodemask_of_mempolicy
> - *
> - * If the current task's mempolicy is "default" [NULL], return 'false'
> - * to indicate default policy. Otherwise, extract the policy nodemask
> - * for 'bind' or 'interleave' policy into the argument nodemask, or
> - * initialize the argument nodemask to contain the single node for
> - * 'preferred' or 'local' policy and return 'true' to indicate presence
> - * of non-default mempolicy.
> - *
> - * We don't bother with reference counting the mempolicy [mpol_get/put]
> - * because the current task is examining it's own mempolicy and a task's
> - * mempolicy is only ever changed by the task itself.
> + * If tsk's mempolicy is "default" [NULL], return 'false' to indicate default
> + * policy. Otherwise, extract the policy nodemask for 'bind' or 'interleave'
> + * policy into the argument nodemask, or initialize the argument nodemask to
> + * contain the single node for 'preferred' or 'local' policy and return 'true'
> + * to indicate presence of non-default mempolicy.
> *
> * N.B., it is the caller's responsibility to free a returned nodemask.
> */
> -bool init_nodemask_of_mempolicy(nodemask_t *mask)
> +bool init_nodemask_of_task_mempolicy(struct task_struct *tsk, nodemask_t *mask)
> {
> struct mempolicy *mempolicy;
> int nid;
> @@ -1615,7 +1609,16 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
>
> return true;
> }
> -#endif
> +
> +/*
> + * We don't bother with reference counting the mempolicy [mpol_get/put]
> + * because the current task is examining it's own mempolicy and a task's
> + * mempolicy is only ever changed by the task itself.
> + */
> +bool init_nodemask_of_current_mempolicy(nodemask_t *mask)
> +{
> + return init_nodemask_of_task_mempolicy(current, mask);
> +}
>
> /* Allocate a page in interleaved policy.
> Own path because it needs to do special accounting. */
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ab04537..4c5c58b 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -27,6 +27,7 @@
> #include <linux/notifier.h>
> #include <linux/memcontrol.h>
> #include <linux/security.h>
> +#include <linux/mempolicy.h>
>
> int sysctl_panic_on_oom;
> int sysctl_oom_kill_allocating_task;
> @@ -35,18 +36,30 @@ static DEFINE_SPINLOCK(zone_scan_lock);
> /* #define DEBUG */
>
> /*
> - * Is all threads of the target process nodes overlap ours?
> + * Do the nodes allowed by any of tsk's threads overlap ours?
> */
> -static int has_intersects_mems_allowed(struct task_struct *tsk)
> +static int has_intersects_mems_allowed(struct task_struct *tsk,
> + nodemask_t *nodemask)
> {
> - struct task_struct *t;
> + struct task_struct *start = tsk;
> + NODEMASK_ALLOC(nodemask_t, mpol_nodemask, GFP_KERNEL);
>
> - t = tsk;
> + if (!nodemask)
> + mpol_nodemask = NULL;
> do {
> - if (cpuset_mems_allowed_intersects(current, t))
> + if (mpol_nodemask) {
> + mpol_get(tsk->mempolicy);
> + if (init_nodemask_of_task_mempolicy(tsk, mpol_nodemask) &&
> + nodes_intersects(*nodemask, *mpol_nodemask)) {
> + mpol_put(tsk->mempolicy);
> + return 1;
> + }
> + mpol_put(tsk->mempolicy);
> + }

Hmm this mpol_get()/mpol_put() are necessary under tasklist_lock held ?
And...I wonder

if (!init_nodemask_of_task_mempolicy(tsk, mpol_nodemask))
return 1; /* this task uses default policy */


> + if (cpuset_mems_allowed_intersects(current, tsk))
> return 1;
> - t = next_thread(t);
> - } while (t != tsk);
> + tsk = next_thread(tsk);
> + } while (tsk != start);
>

Sigh...we has to scan all threads, again.
Could you have an idea to improve this ?

For example,
mm->mask_of_nodes_which_a_page_was_allocated_on
or
mm->mask_of_nodes_made_by_some_magical_technique
some ?
(maybe per-node rss is over kill.)


> return 0;
> }
> @@ -55,6 +68,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
> * badness - calculate a numeric value for how bad this task has been
> * @p: task struct of which task we should calculate
> * @uptime: current uptime in seconds
> + * @constraint: type of oom constraint
> + * @nodemask: nodemask passed to page allocator
> *
> * The formula used is relatively simple and documented inline in the
> * function. The main rationale is that we want to select a good task
> @@ -70,7 +85,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
> * of least surprise ... (be careful when you change it)
> */
>
> -unsigned long badness(struct task_struct *p, unsigned long uptime)
> +unsigned long badness(struct task_struct *p, unsigned long uptime,
> + enum oom_constraint constraint, nodemask_t *nodemask)
> {
> unsigned long points, cpu_time, run_time;
> struct mm_struct *mm;
> @@ -171,7 +187,9 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> * because p may have allocated or otherwise mapped memory on
> * this node before. However it will be less likely.
> */
> - if (!has_intersects_mems_allowed(p))
> + if (!has_intersects_mems_allowed(p,
> + constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
> + NULL))
> points /= 8;
>
> /*
> @@ -244,7 +262,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> * (not docbooked, we don't want this one cluttering up the manual)
> */
> static struct task_struct *select_bad_process(unsigned long *ppoints,
> - struct mem_cgroup *mem)
> + struct mem_cgroup *mem, enum oom_constraint constraint,
> + nodemask_t *nodemask)
> {
> struct task_struct *p;
> struct task_struct *chosen = NULL;
> @@ -300,7 +319,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> if (p->signal->oom_adj == OOM_DISABLE)
> continue;
>
> - points = badness(p, uptime.tv_sec);
> + points = badness(p, uptime.tv_sec, constraint, nodemask);
> if (points > *ppoints || !chosen) {
> chosen = p;
> *ppoints = points;
> @@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>
> read_lock(&tasklist_lock);
> retry:
> - p = select_bad_process(&points, mem);
> + p = select_bad_process(&points, mem, NULL);
> if (PTR_ERR(p) == -1UL)
> goto out;
>
> @@ -554,7 +573,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
> /*
> * Must be called with tasklist_lock held for read.
> */
> -static void __out_of_memory(gfp_t gfp_mask, int order)
> +static void __out_of_memory(gfp_t gfp_mask, int order,
> + enum oom_constraint constraint, nodemask_t *nodemask)
> {
> struct task_struct *p;
> unsigned long points;
> @@ -568,7 +588,7 @@ retry:
> * Rambo mode: Shoot down a process and hope it solves whatever
> * issues we may have.
> */
> - p = select_bad_process(&points, NULL);
> + p = select_bad_process(&points, NULL, constraint, nodemask);
>
> if (PTR_ERR(p) == -1UL)
> return;
> @@ -609,7 +629,8 @@ void pagefault_out_of_memory(void)
> panic("out of memory from page fault. panic_on_oom is selected.\n");
>
> read_lock(&tasklist_lock);
> - __out_of_memory(0, 0); /* unknown gfp_mask and order */
> + /* unknown gfp_mask and order */
> + __out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
> read_unlock(&tasklist_lock);
>
> /*
> @@ -656,11 +677,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> read_lock(&tasklist_lock);
>
> switch (constraint) {
> - case CONSTRAINT_MEMORY_POLICY:
> - oom_kill_process(current, gfp_mask, order, 0, NULL,
> - "No available memory (MPOL_BIND)");
> - break;
> -
> case CONSTRAINT_NONE:
> if (sysctl_panic_on_oom) {
> dump_header(gfp_mask, order, NULL);
> @@ -668,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> }
> /* Fall-through */
> case CONSTRAINT_CPUSET:
> - __out_of_memory(gfp_mask, order);
> + case CONSTRAINT_MEMORY_POLICY:
> + __out_of_memory(gfp_mask, order, constraint, nodemask);
> break;
> }
maybe good. But hmm...does this work well with per-vma mempolicy ?

I wonder
mm->mask_of_nodes_made_by_some_magical_technique
will be necessary for completeness.

Thanks,
-Kame

2009-11-18 01:48:55

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

Hi.

On Tue, 17 Nov 2009 16:11:58 -0800 (PST), David Rientjes <[email protected]> wrote:
> On Wed, 11 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > Fixing node-oriented allocation handling in oom-kill.c
> > I myself think this as bugfix not as ehnancement.
> >
> > In these days, things are changed as
> > - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
> > - mempolicy don't maintain its own private zonelists.
> > (And cpuset doesn't use nodemask for __alloc_pages_nodemask())
> >
> > So, current oom-killer's check function is wrong.
> >
> > This patch does
> > - check nodemask, if nodemask && nodemask doesn't cover all
> > node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
> > - Scan all zonelist under nodemask, if it hits cpuset's wall
> > this faiulre is from cpuset.
> > And
> > - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
> > This doesn't change "current" behavior. If callers use __GFP_THISNODE
> > it should handle "page allocation failure" by itself.
> >
> > - handle __GFP_NOFAIL+__GFP_THISNODE path.
> > This is something like a FIXME but this gfpmask is not used now.
> >
>
> Now that we're passing the nodemask into the oom killer, we should be able
> to do more intelligent CONSTRAINT_MEMORY_POLICY selection. current is not
> always the ideal task to kill, so it's better to scan the tasklist and
> determine the best task depending on our heuristics, similiar to how we
> penalize candidates if they do not share the same cpuset.
>
> Something like the following (untested) patch. Comments?
I agree to this direction.

Taking into account the usage per node which is included in nodemask might be useful,
but we don't have per node rss counter per task now and it would add some overhead,
so I think this would be enough(at leaset for now).

Just a minor nitpick:

> @@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
>
> read_lock(&tasklist_lock);
> retry:
> - p = select_bad_process(&points, mem);
> + p = select_bad_process(&points, mem, NULL);
> if (PTR_ERR(p) == -1UL)
> goto out;
>
need to pass "CONSTRAINT_NONE" too.


Thanks,
Daisuke Nishimura.

2009-11-18 02:13:50

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Wed, 18 Nov 2009, KAMEZAWA Hiroyuki wrote:

> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index ab04537..4c5c58b 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -27,6 +27,7 @@
> > #include <linux/notifier.h>
> > #include <linux/memcontrol.h>
> > #include <linux/security.h>
> > +#include <linux/mempolicy.h>
> >
> > int sysctl_panic_on_oom;
> > int sysctl_oom_kill_allocating_task;
> > @@ -35,18 +36,30 @@ static DEFINE_SPINLOCK(zone_scan_lock);
> > /* #define DEBUG */
> >
> > /*
> > - * Is all threads of the target process nodes overlap ours?
> > + * Do the nodes allowed by any of tsk's threads overlap ours?
> > */
> > -static int has_intersects_mems_allowed(struct task_struct *tsk)
> > +static int has_intersects_mems_allowed(struct task_struct *tsk,
> > + nodemask_t *nodemask)
> > {
> > - struct task_struct *t;
> > + struct task_struct *start = tsk;
> > + NODEMASK_ALLOC(nodemask_t, mpol_nodemask, GFP_KERNEL);
> >
> > - t = tsk;
> > + if (!nodemask)
> > + mpol_nodemask = NULL;
> > do {
> > - if (cpuset_mems_allowed_intersects(current, t))
> > + if (mpol_nodemask) {
> > + mpol_get(tsk->mempolicy);
> > + if (init_nodemask_of_task_mempolicy(tsk, mpol_nodemask) &&
> > + nodes_intersects(*nodemask, *mpol_nodemask)) {
> > + mpol_put(tsk->mempolicy);
> > + return 1;
> > + }
> > + mpol_put(tsk->mempolicy);
> > + }
>
> Hmm this mpol_get()/mpol_put() are necessary under tasklist_lock held ?

They are, we don't hold tasklist_lock while dropping the reference count
in do_exit().

> And...I wonder
>
> if (!init_nodemask_of_task_mempolicy(tsk, mpol_nodemask))
> return 1; /* this task uses default policy */
>
>
> > + if (cpuset_mems_allowed_intersects(current, tsk))
> > return 1;
> > - t = next_thread(t);
> > - } while (t != tsk);
> > + tsk = next_thread(tsk);
> > + } while (tsk != start);
> >
>
> Sigh...we has to scan all threads, again.
> Could you have an idea to improve this ?
>
> For example,
> mm->mask_of_nodes_which_a_page_was_allocated_on
> or
> mm->mask_of_nodes_made_by_some_magical_technique
> some ?
> (maybe per-node rss is over kill.)
>

The same criticism could be said for the CONSTRAINT_CPUSET. We don't
actually know in either case, mempolicy or cpusets, if memory was ever
allocated on a particular node in tsk->mempolicy->v.nodes or
tsk->mems_allowed, respectively. We assume, however, that if a node is
included in a mempolicy nodemask or cpuset mems that it is an allowed
node to allocate from for all attached tasks (and those tasks aren't
solely allocating on a subset) so that killing tasks based on their
potential for allocating on oom nodes is actually helpful.

>
> > return 0;
> > }
> > @@ -55,6 +68,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
> > * badness - calculate a numeric value for how bad this task has been
> > * @p: task struct of which task we should calculate
> > * @uptime: current uptime in seconds
> > + * @constraint: type of oom constraint
> > + * @nodemask: nodemask passed to page allocator
> > *
> > * The formula used is relatively simple and documented inline in the
> > * function. The main rationale is that we want to select a good task
> > @@ -70,7 +85,8 @@ static int has_intersects_mems_allowed(struct task_struct *tsk)
> > * of least surprise ... (be careful when you change it)
> > */
> >
> > -unsigned long badness(struct task_struct *p, unsigned long uptime)
> > +unsigned long badness(struct task_struct *p, unsigned long uptime,
> > + enum oom_constraint constraint, nodemask_t *nodemask)
> > {
> > unsigned long points, cpu_time, run_time;
> > struct mm_struct *mm;
> > @@ -171,7 +187,9 @@ unsigned long badness(struct task_struct *p, unsigned long uptime)
> > * because p may have allocated or otherwise mapped memory on
> > * this node before. However it will be less likely.
> > */
> > - if (!has_intersects_mems_allowed(p))
> > + if (!has_intersects_mems_allowed(p,
> > + constraint == CONSTRAINT_MEMORY_POLICY ? nodemask :
> > + NULL))
> > points /= 8;
> >
> > /*
> > @@ -244,7 +262,8 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > * (not docbooked, we don't want this one cluttering up the manual)
> > */
> > static struct task_struct *select_bad_process(unsigned long *ppoints,
> > - struct mem_cgroup *mem)
> > + struct mem_cgroup *mem, enum oom_constraint constraint,
> > + nodemask_t *nodemask)
> > {
> > struct task_struct *p;
> > struct task_struct *chosen = NULL;
> > @@ -300,7 +319,7 @@ static struct task_struct *select_bad_process(unsigned long *ppoints,
> > if (p->signal->oom_adj == OOM_DISABLE)
> > continue;
> >
> > - points = badness(p, uptime.tv_sec);
> > + points = badness(p, uptime.tv_sec, constraint, nodemask);
> > if (points > *ppoints || !chosen) {
> > chosen = p;
> > *ppoints = points;
> > @@ -472,7 +491,7 @@ void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
> >
> > read_lock(&tasklist_lock);
> > retry:
> > - p = select_bad_process(&points, mem);
> > + p = select_bad_process(&points, mem, NULL);
> > if (PTR_ERR(p) == -1UL)
> > goto out;
> >
> > @@ -554,7 +573,8 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask)
> > /*
> > * Must be called with tasklist_lock held for read.
> > */
> > -static void __out_of_memory(gfp_t gfp_mask, int order)
> > +static void __out_of_memory(gfp_t gfp_mask, int order,
> > + enum oom_constraint constraint, nodemask_t *nodemask)
> > {
> > struct task_struct *p;
> > unsigned long points;
> > @@ -568,7 +588,7 @@ retry:
> > * Rambo mode: Shoot down a process and hope it solves whatever
> > * issues we may have.
> > */
> > - p = select_bad_process(&points, NULL);
> > + p = select_bad_process(&points, NULL, constraint, nodemask);
> >
> > if (PTR_ERR(p) == -1UL)
> > return;
> > @@ -609,7 +629,8 @@ void pagefault_out_of_memory(void)
> > panic("out of memory from page fault. panic_on_oom is selected.\n");
> >
> > read_lock(&tasklist_lock);
> > - __out_of_memory(0, 0); /* unknown gfp_mask and order */
> > + /* unknown gfp_mask and order */
> > + __out_of_memory(0, 0, CONSTRAINT_NONE, NULL);
> > read_unlock(&tasklist_lock);
> >
> > /*
> > @@ -656,11 +677,6 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > read_lock(&tasklist_lock);
> >
> > switch (constraint) {
> > - case CONSTRAINT_MEMORY_POLICY:
> > - oom_kill_process(current, gfp_mask, order, 0, NULL,
> > - "No available memory (MPOL_BIND)");
> > - break;
> > -
> > case CONSTRAINT_NONE:
> > if (sysctl_panic_on_oom) {
> > dump_header(gfp_mask, order, NULL);
> > @@ -668,7 +684,8 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > }
> > /* Fall-through */
> > case CONSTRAINT_CPUSET:
> > - __out_of_memory(gfp_mask, order);
> > + case CONSTRAINT_MEMORY_POLICY:
> > + __out_of_memory(gfp_mask, order, constraint, nodemask);
> > break;
> > }
> maybe good. But hmm...does this work well with per-vma mempolicy ?
>
> I wonder
> mm->mask_of_nodes_made_by_some_magical_technique
> will be necessary for completeness.
>

I think that would probably be rejected because of its implications on the
allocation fastpath. The change here isn't causing the oom killing to be
any less ideal; current may never have allocated any memory on its
mempolicy nodes prior to the oom and so killing it may be entirely
useless. It's better to use our heuristics for determining the ideal task
to kill in that case and restricting our subset of eligible tasks by the
same criteria that we use for cpusets.

2009-12-15 01:18:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2


So I have a note-to-self here that these patches:

oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch

are tentative and it was unclear whether I should merge them.

What do we think?

2009-12-15 01:35:23

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Mon, 14 Dec 2009 17:16:32 -0800
Andrew Morton <[email protected]> wrote:

>
> So I have a note-to-self here that these patches:
>
> oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
> oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
> oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch
>
> are tentative and it was unclear whether I should merge them.
>
> What do we think?
>

In my view,
oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
- should be merged. Because we tend to get several OOM reports in a month,
More precise information is always welcomed.

oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch
- should be merged. This is a bug fix.

oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
- should not be merged.
I'm now preparing more counters for mm's statistics. It's better to
wait and to see what we can do more. And other patches for total
oom-killer improvement is under development.

And, there is a compatibility problem.
As David says, this may break some crazy software which uses
fake_numa+cpuset+oom_killer+oom_adj for resource controlling.
(even if I recommend them to use memcg rather than crazy tricks...)

2 ideas which I can think of now are..
1) add sysctl_oom_calc_on_committed_memory
If this is set, use vm-size instead of rss.

2) add /proc/<pid>/oom_guard_size
This allows users to specify "valid/expected size" of a task.
When
#echo 10M > /proc/<pid>/oom_guard_size
At OOM calculation, 10Mbytes is subtracted from rss size.
(The best way is to estimate this automatically from vm_size..but...)



Thanks,
-Kame

2009-12-15 01:38:23

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

> On Mon, 14 Dec 2009 17:16:32 -0800
> Andrew Morton <[email protected]> wrote:
>
> >
> > So I have a note-to-self here that these patches:
> >
> > oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
> > oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
> > oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch
> >
> > are tentative and it was unclear whether I should merge them.
> >
> > What do we think?
>
> In my view,
> oom-kill-show-virtual-size-and-rss-information-of-the-killed-process.patch
> - should be merged.
>
> oom-kill-fix-numa-consraint-check-with-nodemask-v42.patch
> - should be merged.
>
> oom_kill-use-rss-value-instead-of-vm-size-for-badness.patch
> - should not be merged.

all agree.


2009-12-15 04:31:01

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:

> I'm now preparing more counters for mm's statistics. It's better to
> wait and to see what we can do more. And other patches for total
> oom-killer improvement is under development.
>
> And, there is a compatibility problem.
> As David says, this may break some crazy software which uses
> fake_numa+cpuset+oom_killer+oom_adj for resource controlling.
> (even if I recommend them to use memcg rather than crazy tricks...)
>

That's not at all what I said. I said using total_vm as a baseline allows
users to define when a process is to be considered "rogue," that is, using
more memory than expected. Using rss would be inappropriate since it is
highly dynamic and depends on the state of the VM at the time of oom,
which userspace cannot possibly keep updated.

You consistently ignore that point: the power of /proc/pid/oom_adj to
influence when a process, such as a memory leaker, is to be considered as
a high priority for an oom kill. It has absolutely nothing to do with
fake NUMA, cpusets, or memcg.

> 2 ideas which I can think of now are..
> 1) add sysctl_oom_calc_on_committed_memory
> If this is set, use vm-size instead of rss.
>

I would agree only if the oom killer used total_vm as a the default, it is
long-standing and allows for the aforementioned capability that you lose
with rss. I have no problem with the added sysctl to use rss as the
baseline when enabled.

> 2) add /proc/<pid>/oom_guard_size
> This allows users to specify "valid/expected size" of a task.
> When
> #echo 10M > /proc/<pid>/oom_guard_size
> At OOM calculation, 10Mbytes is subtracted from rss size.
> (The best way is to estimate this automatically from vm_size..but...)

Expected rss is almost impossible to tune for cpusets that have a highly
dynamic set of mems, let alone without containment.

2009-12-15 04:39:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Mon, 14 Dec 2009 20:30:37 -0800 (PST)
David Rientjes <[email protected]> wrote:
> > 2 ideas which I can think of now are..
> > 1) add sysctl_oom_calc_on_committed_memory
> > If this is set, use vm-size instead of rss.
> >
>
> I would agree only if the oom killer used total_vm as a the default, it is
> long-standing and allows for the aforementioned capability that you lose
> with rss. I have no problem with the added sysctl to use rss as the
> baseline when enabled.
>
I'll prepare a patch for adds

sysctl_oom_kill_based_on_rss (default=0)

ok ?

Thanks,
-Kame

2009-12-15 04:46:32

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Mon, 14 Dec 2009 20:30:37 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
>
> > I'm now preparing more counters for mm's statistics. It's better to
> > wait and to see what we can do more. And other patches for total
> > oom-killer improvement is under development.
> >
> > And, there is a compatibility problem.
> > As David says, this may break some crazy software which uses
> > fake_numa+cpuset+oom_killer+oom_adj for resource controlling.
> > (even if I recommend them to use memcg rather than crazy tricks...)
> >
>
> That's not at all what I said. I said using total_vm as a baseline allows
> users to define when a process is to be considered "rogue," that is, using
> more memory than expected. Using rss would be inappropriate since it is
> highly dynamic and depends on the state of the VM at the time of oom,
> which userspace cannot possibly keep updated.
>
> You consistently ignore that point: the power of /proc/pid/oom_adj to
> influence when a process, such as a memory leaker, is to be considered as
> a high priority for an oom kill. It has absolutely nothing to do with
> fake NUMA, cpusets, or memcg.
>
You also ignore that it's not sane to use oom kill for resource control ;)

Anyway, rss patch was dropped as you want.
I'll prepare other ones.

Thanks,
-Kame

2009-12-15 04:48:08

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
>
> > I'm now preparing more counters for mm's statistics. It's better to
> > wait and to see what we can do more. And other patches for total
> > oom-killer improvement is under development.
> >
> > And, there is a compatibility problem.
> > As David says, this may break some crazy software which uses
> > fake_numa+cpuset+oom_killer+oom_adj for resource controlling.
> > (even if I recommend them to use memcg rather than crazy tricks...)
> >
>
> That's not at all what I said. I said using total_vm as a baseline allows
> users to define when a process is to be considered "rogue," that is, using
> more memory than expected. Using rss would be inappropriate since it is
> highly dynamic and depends on the state of the VM at the time of oom,
> which userspace cannot possibly keep updated.
>
> You consistently ignore that point: the power of /proc/pid/oom_adj to
> influence when a process, such as a memory leaker, is to be considered as
> a high priority for an oom kill. It has absolutely nothing to do with
> fake NUMA, cpusets, or memcg.

To compare vsz is only meaningful when the same program are compared.
But oom killer don't. To compare vsz between another program DONT detect
any memory leak.


>
> > 2 ideas which I can think of now are..
> > 1) add sysctl_oom_calc_on_committed_memory
> > If this is set, use vm-size instead of rss.
> >
>
> I would agree only if the oom killer used total_vm as a the default, it is
> long-standing and allows for the aforementioned capability that you lose
> with rss. I have no problem with the added sysctl to use rss as the
> baseline when enabled.

Probably, nobody agree you. your opinion don't solve original issue.
kernel developer can't ignore bug report.


>
> > 2) add /proc/<pid>/oom_guard_size
> > This allows users to specify "valid/expected size" of a task.
> > When
> > #echo 10M > /proc/<pid>/oom_guard_size
> > At OOM calculation, 10Mbytes is subtracted from rss size.
> > (The best way is to estimate this automatically from vm_size..but...)
>
> Expected rss is almost impossible to tune for cpusets that have a highly
> dynamic set of mems, let alone without containment.


2009-12-15 04:54:47

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:

> > I would agree only if the oom killer used total_vm as a the default, it is
> > long-standing and allows for the aforementioned capability that you lose
> > with rss. I have no problem with the added sysctl to use rss as the
> > baseline when enabled.
> >
> I'll prepare a patch for adds
>
> sysctl_oom_kill_based_on_rss (default=0)
>
> ok ?
>

I have no strong feelings either for or against that, I guess users who
want to always kill the biggest memory hogger even when single page
__GFP_WAIT allocations fail could use it. I'm not sure it would get much
use, though.

I think we should methodically work out an oom killer badness rewrite that
won't compound the problem by adding more and more userspace knobs. In
other words, we should slow down, construct a list of goals that we want
to achieve, and then see what type of solution we can create.

A few requirements that I have:

- we must be able to define when a task is a memory hogger; this is
currently done by /proc/pid/oom_adj relying on the overall total_vm
size of the task as a baseline. Most users should have a good sense
of when their task is using more memory than expected and killing a
memory leaker should always be the optimal oom killer result. A better
set of units other than a shift on total_vm would be helpful, though.

- we must prefer tasks that run on a cpuset or mempolicy's nodes if the
oom condition is constrained by that cpuset or mempolicy and its not a
system-wide issue.

- we must be able to polarize the badness heuristic to always select a
particular task is if its very low priority or disable oom killing for
a task if its must-run.

The proposal may be to remove /proc/pid/oom_adj completely since I know
both you and KOSAKI-san dislike it, but we'd need an alternative which
keeps the above functionality intact.

2009-12-15 05:00:46

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Tue, 15 Dec 2009 13:35:46 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Mon, 14 Dec 2009 20:30:37 -0800 (PST)
> David Rientjes <[email protected]> wrote:
> > > 2 ideas which I can think of now are..
> > > 1) add sysctl_oom_calc_on_committed_memory
> > > If this is set, use vm-size instead of rss.
> > >
> >
> > I would agree only if the oom killer used total_vm as a the default, it is
> > long-standing and allows for the aforementioned capability that you lose
> > with rss. I have no problem with the added sysctl to use rss as the
> > baseline when enabled.
> >
> I'll prepare a patch for adds
>
> sysctl_oom_kill_based_on_rss (default=0)
>
Hmm..

But for usual desktop users, using rss as default,as memory-eater-should-die.
sounds reasoable to me.
Hmm...maybe some automatic detection logic is required.

As my 1st version shows,

CONSTRAINT_CPUSET -> use vm_size
CONSTRAINT_LOWMEM -> use lowmem_rss
CONSTRAINT_NONE -> use rss
seems like a landing point for all stake holders. No ?

Thanks,
-Kame

2009-12-15 04:58:00

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:

> > That's not at all what I said. I said using total_vm as a baseline allows
> > users to define when a process is to be considered "rogue," that is, using
> > more memory than expected. Using rss would be inappropriate since it is
> > highly dynamic and depends on the state of the VM at the time of oom,
> > which userspace cannot possibly keep updated.
> >
> > You consistently ignore that point: the power of /proc/pid/oom_adj to
> > influence when a process, such as a memory leaker, is to be considered as
> > a high priority for an oom kill. It has absolutely nothing to do with
> > fake NUMA, cpusets, or memcg.
> >
> You also ignore that it's not sane to use oom kill for resource control ;)
>

Please read my email. Did I say anything about resource control AT ALL?
I said /proc/pid/oom_adj currently allows userspace to define when a task
is "rogue," meaning its consuming much more memory than expected. Those
memory leakers should always be the optimal result for the oom killer to
kill. Using rss as the baseline would not allow userspace to effectively
do the same thing since it's dynamic and depends on the state of the VM at
the time of oom which is probably not reflected in the /proc/pid/oom_adj
values for all tasks. It has absolutely nothing to do with resource
control, so please address this very trivial issue without going off on
tangents. Thanks.

2009-12-15 05:03:16

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Tue, 15 Dec 2009, KOSAKI Motohiro wrote:

> To compare vsz is only meaningful when the same program are compared.
> But oom killer don't. To compare vsz between another program DONT detect
> any memory leak.
>

You're losing the ability to detect that memory leak because you'd be
using a baseline that userspace cannot possibly know at the time of oom.
You cannot possibly insist that users understand the amount of resident
memory for all applications when tuning the heuristic adjuster from
userspace.

In other words, how do you plan on userspace being able to identify tasks
that are memory leakers if you change the baseline to rss? Unless you
have an answer to this question, you're not admitting the problem that
the oom killer is primarily designed to address.

2009-12-15 05:12:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Mon, 14 Dec 2009 20:57:53 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
>
> > > That's not at all what I said. I said using total_vm as a baseline allows
> > > users to define when a process is to be considered "rogue," that is, using
> > > more memory than expected. Using rss would be inappropriate since it is
> > > highly dynamic and depends on the state of the VM at the time of oom,
> > > which userspace cannot possibly keep updated.
> > >
> > > You consistently ignore that point: the power of /proc/pid/oom_adj to
> > > influence when a process, such as a memory leaker, is to be considered as
> > > a high priority for an oom kill. It has absolutely nothing to do with
> > > fake NUMA, cpusets, or memcg.
> > >
> > You also ignore that it's not sane to use oom kill for resource control ;)
> >
>
> Please read my email. Did I say anything about resource control AT ALL?
> I said /proc/pid/oom_adj currently allows userspace to define when a task
> is "rogue," meaning its consuming much more memory than expected. Those
> memory leakers should always be the optimal result for the oom killer to
> kill. Using rss as the baseline would not allow userspace to effectively
> do the same thing since it's dynamic and depends on the state of the VM at
> the time of oom which is probably not reflected in the /proc/pid/oom_adj
> values for all tasks. It has absolutely nothing to do with resource
> control, so please address this very trivial issue without going off on
> tangents. Thanks.

What I can't undestand is the technique to know whether a (unknown) process is
leaking memory or not by checking vm_size.
And, why don't you use overcommit_memory when you can depends on vm_size ?

Thanks,
-Kame

2009-12-15 05:19:54

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
>
> > > I would agree only if the oom killer used total_vm as a the default, it is
> > > long-standing and allows for the aforementioned capability that you lose
> > > with rss. I have no problem with the added sysctl to use rss as the
> > > baseline when enabled.
> > >
> > I'll prepare a patch for adds
> >
> > sysctl_oom_kill_based_on_rss (default=0)
> >
> > ok ?
> >
>
> I have no strong feelings either for or against that, I guess users who
> want to always kill the biggest memory hogger even when single page
> __GFP_WAIT allocations fail could use it. I'm not sure it would get much
> use, though.
>
> I think we should methodically work out an oom killer badness rewrite that
> won't compound the problem by adding more and more userspace knobs. In
> other words, we should slow down, construct a list of goals that we want
> to achieve, and then see what type of solution we can create.
>
> A few requirements that I have:

Um, good analysis! really.

>
> - we must be able to define when a task is a memory hogger; this is
> currently done by /proc/pid/oom_adj relying on the overall total_vm
> size of the task as a baseline. Most users should have a good sense
> of when their task is using more memory than expected and killing a
> memory leaker should always be the optimal oom killer result. A better
> set of units other than a shift on total_vm would be helpful, though.

nit: What's mean "Most users"? desktop user(one of most majority users)
don't have any expection of memory usage.

but, if admin have memory expection, they should be able to tune
optimal oom result.

I think you pointed right thing.


> - we must prefer tasks that run on a cpuset or mempolicy's nodes if the
> oom condition is constrained by that cpuset or mempolicy and its not a
> system-wide issue.

agreed. (who disagree it?)


> - we must be able to polarize the badness heuristic to always select a
> particular task is if its very low priority or disable oom killing for
> a task if its must-run.

Probably I haven't catch your point. What's mean "polarize"? Can you
please describe more?


> The proposal may be to remove /proc/pid/oom_adj completely since I know
> both you and KOSAKI-san dislike it, but we'd need an alternative which
> keeps the above functionality intact.

Yes, To provide alternative way is must.


2009-12-17 22:22:07

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Tue, 15 Dec 2009, KOSAKI Motohiro wrote:

> > A few requirements that I have:
>
> Um, good analysis! really.
>
> >
> > - we must be able to define when a task is a memory hogger; this is
> > currently done by /proc/pid/oom_adj relying on the overall total_vm
> > size of the task as a baseline. Most users should have a good sense
> > of when their task is using more memory than expected and killing a
> > memory leaker should always be the optimal oom killer result. A better
> > set of units other than a shift on total_vm would be helpful, though.
>
> nit: What's mean "Most users"? desktop user(one of most majority users)
> don't have any expection of memory usage.
>
> but, if admin have memory expection, they should be able to tune
> optimal oom result.
>
> I think you pointed right thing.
>

This is mostly referring to production server users where memory
consumption by particular applications can be estimated, which allows the
kernel to determine when a task is using a wildly unexpected amount that
happens to become egregious enough to force the oom killer into killing a
task.

That is contrast to using rss as a baseline where we prefer on killing the
application with the most resident RAM. It is not always ideal to kill a
task with 8GB of rss when we fail to allocate a single page for a low
priority task.

> > - we must prefer tasks that run on a cpuset or mempolicy's nodes if the
> > oom condition is constrained by that cpuset or mempolicy and its not a
> > system-wide issue.
>
> agreed. (who disagree it?)
>

It's possible to nullify the current penalization in the badness heuristic
(order 3 reduction) if a candidate task does not share nodes with
current's allowed set either by way of cpusets or mempolicies. For
example, an oom caused by an application with an MPOL_BIND on a single
node can easily kill a task that has no memory resident on that node if
its usage (or rss) is 3 orders higher than any candidate that is allowed
on my bound node.

> > - we must be able to polarize the badness heuristic to always select a
> > particular task is if its very low priority or disable oom killing for
> > a task if its must-run.
>
> Probably I haven't catch your point. What's mean "polarize"? Can you
> please describe more?
>

We need to be able to polarize tasks so they are always killed regardless
of any kernel heuristic (/proc/pid/oom_adj of +15, currently) or always
chosen last (-16, currently). We also need a way of completely disabling
oom killing for certain tasks such as with OOM_DISABLE.

2009-12-17 22:23:47

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:

> What I can't undestand is the technique to know whether a (unknown) process is
> leaking memory or not by checking vm_size.

Memory leaks are better identified via total_vm since leaked memory has a
lower probability of staying resident in physical memory.

2009-12-17 23:37:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Thu, 17 Dec 2009 14:23:39 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
>
> > What I can't undestand is the technique to know whether a (unknown) process is
> > leaking memory or not by checking vm_size.
>
> Memory leaks are better identified via total_vm since leaked memory has a
> lower probability of staying resident in physical memory.
>
Because malloc() writes header on newly allcoated memory, (vm_size - rss) cannot
be far from a some important program which wakes up once in a
day or sleep in the day works in the night.

I hope user knows expected memory size of applications, but I know it can't.
Sigh...

Thanks,
-Kame

2009-12-18 04:30:46

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

> On Tue, 15 Dec 2009, KOSAKI Motohiro wrote:
>
> > > A few requirements that I have:
> >
> > Um, good analysis! really.
> >
> > >
> > > - we must be able to define when a task is a memory hogger; this is
> > > currently done by /proc/pid/oom_adj relying on the overall total_vm
> > > size of the task as a baseline. Most users should have a good sense
> > > of when their task is using more memory than expected and killing a
> > > memory leaker should always be the optimal oom killer result. A better
> > > set of units other than a shift on total_vm would be helpful, though.
> >
> > nit: What's mean "Most users"? desktop user(one of most majority users)
> > don't have any expection of memory usage.
> >
> > but, if admin have memory expection, they should be able to tune
> > optimal oom result.
> >
> > I think you pointed right thing.
> >
>
> This is mostly referring to production server users where memory
> consumption by particular applications can be estimated, which allows the
> kernel to determine when a task is using a wildly unexpected amount that
> happens to become egregious enough to force the oom killer into killing a
> task.
>
> That is contrast to using rss as a baseline where we prefer on killing the
> application with the most resident RAM. It is not always ideal to kill a
> task with 8GB of rss when we fail to allocate a single page for a low
> priority task.

VSZ has the same problem if low priority task allocate last single page.


> > > - we must prefer tasks that run on a cpuset or mempolicy's nodes if the
> > > oom condition is constrained by that cpuset or mempolicy and its not a
> > > system-wide issue.
> >
> > agreed. (who disagree it?)
> >
>
> It's possible to nullify the current penalization in the badness heuristic
> (order 3 reduction) if a candidate task does not share nodes with
> current's allowed set either by way of cpusets or mempolicies. For
> example, an oom caused by an application with an MPOL_BIND on a single
> node can easily kill a task that has no memory resident on that node if
> its usage (or rss) is 3 orders higher than any candidate that is allowed
> on my bound node.

yes, possible. however its heuristic is intensional. the code comment says:

/*
* If p's nodes don't overlap ours, it may still help to kill p
* because p may have allocated or otherwise mapped memory on
* this node before. However it will be less likely.
*/

do you have alternative plan? How do we know the task don't have any
page in memory busted node? we can't add any statistics for oom because
almost systems never ever use oom. thus, many developer oppose such slowdown.


> > > - we must be able to polarize the badness heuristic to always select a
> > > particular task is if its very low priority or disable oom killing for
> > > a task if its must-run.
> >
> > Probably I haven't catch your point. What's mean "polarize"? Can you
> > please describe more?
>
> We need to be able to polarize tasks so they are always killed regardless
> of any kernel heuristic (/proc/pid/oom_adj of +15, currently) or always
> chosen last (-16, currently). We also need a way of completely disabling
> oom killing for certain tasks such as with OOM_DISABLE.

afaik, when admin use +15 or -16 adjustment, usually they hope to don't use
kernel heuristic. This is the reason that I proposed /proc/pid/oom_priority
new tunable knob.


2009-12-18 10:05:10

by David Rientjes

[permalink] [raw]
Subject: Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check with nodemask v4.2

On Fri, 18 Dec 2009, KOSAKI Motohiro wrote:

> > That is contrast to using rss as a baseline where we prefer on killing the
> > application with the most resident RAM. It is not always ideal to kill a
> > task with 8GB of rss when we fail to allocate a single page for a low
> > priority task.
>
> VSZ has the same problem if low priority task allocate last single page.
>

I don't understand what you're trying to say, sorry. Why, in your mind,
do we always want to prefer to kill the application with the largest
amount of memory present in physical RAM for a single, failed order-0
allocation attempt from a lower priority task?

Additionally, when would it be sufficient to simply fail a ~__GFP_NOFAIL
allocation instead of killing anything?

> yes, possible. however its heuristic is intensional. the code comment says:
>
> /*
> * If p's nodes don't overlap ours, it may still help to kill p
> * because p may have allocated or otherwise mapped memory on
> * this node before. However it will be less likely.
> */
>
> do you have alternative plan? How do we know the task don't have any
> page in memory busted node? we can't add any statistics for oom because
> almost systems never ever use oom. thus, many developer oppose such slowdown.
>

There's nothing wrong with that currently (except it doesn't work for
mempolicies), I'm stating that it is a requirement that we keep such a
penalization in our heuristic if we plan on rewriting it. I was
attempting to get a list of requirements for oom killing decisions so that
we can write a sane heuristic and you're simply defending the status quo
which you insist we should change.

> > We need to be able to polarize tasks so they are always killed regardless
> > of any kernel heuristic (/proc/pid/oom_adj of +15, currently) or always
> > chosen last (-16, currently). We also need a way of completely disabling
> > oom killing for certain tasks such as with OOM_DISABLE.
>
> afaik, when admin use +15 or -16 adjustment, usually they hope to don't use
> kernel heuristic.

That's exactly what I said above.

> This is the reason that I proposed /proc/pid/oom_priority
> new tunable knob.
>

In addition to /proc/pid/oom_adj?? oom_priority on it's own does not
allow us to define when a task is a memory leaker based on the expected
memory consumption of a single application. That should be the single
biggest consideration in the new badness heuristic: to define when a task
should be killed because it is rogue.