2006-02-08 18:05:58

by Christoph Lameter

[permalink] [raw]
Subject: Terminate process that fails on a constrained allocation

Some allocations are restricted to a limited set of nodes (due to memory
policies or cpuset constraints). If the page allocator is not able to find
enough memory then that does not mean that overall system memory is low.

In particular going postal and more or less randomly shooting at processes
is not likely going to help the situation but may just lead to suicide (the
whole system coming down).

It is better to signal to the process that no memory exists given the
constraints that the process (or the configuration of the process) has
placed on the allocation behavior. The process may be killed but then the
sysadmin or developer can investigate the situation. The solution is similar
to what we do when we run out of hugepages.

This patch adds a check before the out of memory killer is invoked. At that
point performance considerations do not matter much so we just scan the zonelist
and reconstruct a list of nodes. If the list of nodes does not contain all
online nodes then this is a constrained allocation and we should not call
the OOM killer.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.16-rc2/mm/page_alloc.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-08 09:55:21.000000000 -0800
@@ -1011,6 +1011,28 @@ rebalance:
if (page)
goto got_pg;

+#ifdef CONFIG_NUMA
+ /*
+ * In the NUMA case we may have gotten here because the
+ * memory policies or cpusets have restricted the allocation.
+ */
+ {
+ nodemask_t nodes;
+
+ nodes_empty(nodes);
+ for (z = zonelist->zones; *z; z++)
+ if (cpuset_zone_allowed(*z, gfp_mask))
+ node_set((*z)->zone_pgdat->node_id,
+ nodes);
+ /*
+ * If we were only allowed to allocate from
+ * a subset of nodes then terminate the process.
+ */
+ if (!nodes_subset(node_online_map, nodes))
+ return NULL;
+ }
+#endif
+
out_of_memory(gfp_mask, order);
goto restart;
}


2006-02-08 18:14:34

by Andi Kleen

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wednesday 08 February 2006 19:05, Christoph Lameter wrote:

> This patch adds a check before the out of memory killer is invoked. At that
> point performance considerations do not matter much so we just scan the zonelist
> and reconstruct a list of nodes. If the list of nodes does not contain all
> online nodes then this is a constrained allocation and we should not call
> the OOM killer.

Looks good.

Ok I would have used an noinline function instead of putting the code inline
to prevent register pressure etc. and make it more readable.

-Andi

2006-02-08 18:33:32

by Paul Jackson

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

A couple of comments ...

It took me an extra couple of passes to understand this code.

I wonder if the following, essentially equivalent (if I didn't
break something - never tested this) is easier to understand:

#ifdef CONFIG_NUMA
/*
* In the NUMA case we may have gotten here because the
* memory policies or cpusets have restricted the allocation.
*/
{
nodemask_t nodes; /* compute nodes not allowd */

nodes = node_online_map;
for (z = zonelist->zones; *z; z++)
if (cpuset_zone_allowed(*z, gfp_mask))
node_clear((*z)->zone_pgdat->node_id,
nodes);
/*
* If there are any nodes left set in 'nodes', these
* are nodes the cpuset or mempolicy settings aren't
* letting us use. In that case, return NULL to the
* current task, rather than invoking out_of_memory()
* on the system.
*/
if (!nodes_empty(nodes))
return NULL;
}
#endif

Second point - I thought I had already throttled the oom_killer
to some degree, with the lines, in mm/oom_kill.c select_bad_process():

/* If p's nodes don't overlap ours, it won't help to kill p. */
if (!cpuset_excl_nodes_overlap(p))
continue;

What your patch is doing affectively disables the oom_killer for
big numa systems, rather than having it operate within the set
of tasks using overlapping resources.

Do we need this more radical constraint on the oom_killer?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-02-08 18:34:57

by Paul Jackson

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

Andi wrote:
> Ok I would have used an noinline function instead of putting the code inline
> to prevent register pressure etc. and make it more readable.

good idea.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-02-08 18:42:32

by Christoph Lameter

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wed, 8 Feb 2006, Paul Jackson wrote:

> What your patch is doing affectively disables the oom_killer for
> big numa systems, rather than having it operate within the set
> of tasks using overlapping resources.

No it only disables the oom killer for constrained allocations. If the big
numa system uses a cpuset that limits the number of nodes then those
allocations occurring within these cpusets are constrained allocations
which will then lead to the killing of the application that allocated too
much memory. I think this is much more consistent than trying to tame the
OOM killer enough to stay in a certain cpuset.

F.e. a sysadmin may mistakenly start a process allocating too much memory
in a cpuset. The OOM killer will then start randomly shooting other
processes one of which may be a critical process.. Ouch.

> Do we need this more radical constraint on the oom_killer?

Yes. One can make the OOM killer go postal if one specifies a memory
policy that contains only one node and then allocates enough memory.

Actually a good Denial of service attack than can be used with cpusets
and memory policies.

2006-02-08 18:54:41

by Christoph Lameter

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wed, 8 Feb 2006, Paul Jackson wrote:

> good idea.

Ok. Here is V2:

Terminate process that fails on a constrained allocation

Some allocations are restricted to a limited set of nodes (due to memory
policies or cpuset constraints). If the page allocator is not able to find
enough memory then that does not mean that overall system memory is low.

In particular going postal and more or less randomly shooting at processes
is not likely going to help the situation but may just lead to suicide (the
whole system coming down).

It is better to signal to the process that no memory exists given the
constraints that the process (or the configuration of the process) has
placed on the allocation behavior. The process may be killed but then the
sysadmin or developer can investigate the situation. The solution is similar
to what we do when running out of hugepages.

This patch adds a check before the out of memory killer is invoked. At that
point performance considerations do not matter much so we just scan the zonelist
and reconstruct a list of nodes. If the list of nodes does not contain all
online nodes then this is a constrained allocation and we should not calle
the OOM killer.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.16-rc2/mm/page_alloc.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-08 10:53:08.000000000 -0800
@@ -817,6 +818,27 @@ failed:
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */

/*
+ * check if a given zonelist allows allocation over all the nodes
+ * in the system.
+ */
+int zonelist_incomplete(struct zonelist *zonelist, gfp_t gfp_mask)
+{
+#ifdef CONFIG_NUMA
+ struct zone **z;
+ nodemask_t nodes;
+
+ nodes = node_online_map;
+ for (z = zonelist->zones; *z; z++)
+ if (cpuset_zone_allowed(*z, gfp_mask))
+ node_clear((*z)->zone_pgdat->node_id,
+ nodes);
+ return !nodes_empty(nodes);
+#else
+ return 0;
+#endif
+}
+
+/*
* Return 1 if free pages are above 'mark'. This takes into account the order
* of the allocation.
*/
@@ -1011,6 +1033,9 @@ rebalance:
if (page)
goto got_pg;

+ if (zonelist_incomplete(zonelist, gfp_mask))
+ return NULL;
+
out_of_memory(gfp_mask, order);
goto restart;
}

2006-02-08 18:57:22

by Paul Jackson

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

> No it only disables the oom killer for constrained allocations.

But on many big numa systems, the way they are administered,
that affectively disables the oom killer.

> F.e. a sysadmin may mistakenly start a process allocating too much memory
> in a cpuset. The OOM killer will then start randomly shooting other
> processes one of which may be a critical process.. Ouch.

That same exact claim could be made to justify a patch that
removed mm/oom_kill.c entirely. And the same exact claim
could be made, dropping the "in a cpuset" phrase, on an UMA
desktop PC.

The basic premise of the oom killer is that, instead of blowing
off the caller, who might innocently have asked for one page
too many after some other hog used up all the available memory,
we try to pick the worst offender.

Get the worst offender, not just who ever finally hit the limit.

> Actually a good Denial of service attack than can be used with cpusets
> and memory policies.

Nothing special about cpusets there. The same DOS opportunity
exists on simple one cpu, one node systems.

...

Granted, I am objecting to this patch with mixed feelings.

I've yet to be convinced that the oom killer is our friend,
and half of me (not seriously) is almost wishing it were
gone.

Would another option be to continue to fine tune the heuristics
that the oom killer uses to pick its next victim?

What situation did you hit that motivated this change?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-02-08 19:03:00

by Christoph Lameter

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wed, 8 Feb 2006, Paul Jackson wrote:

> > F.e. a sysadmin may mistakenly start a process allocating too much memory
> > in a cpuset. The OOM killer will then start randomly shooting other
> > processes one of which may be a critical process.. Ouch.
>
> That same exact claim could be made to justify a patch that
> removed mm/oom_kill.c entirely. And the same exact claim
> could be made, dropping the "in a cpuset" phrase, on an UMA
> desktop PC.

Well the justification for the OOM Killer was that it keeps the system
as a whole alive ..... cpusets are not vital to the system. Usually the
folks using cpusets are very aware of the memory usage of their processes
and I think they would hate that a random process be killed.

The obvious case is that someone starts a process that allocates
large amount of memory while other processes are already running.

> The basic premise of the oom killer is that, instead of blowing
> off the caller, who might innocently have asked for one page
> too many after some other hog used up all the available memory,
> we try to pick the worst offender.
>
> Get the worst offender, not just who ever finally hit the limit.

Typically the worst offender is who just asked for the page.

> I've yet to be convinced that the oom killer is our friend,
> and half of me (not seriously) is almost wishing it were
> gone.

I definitely agree with you. Lets at least keep it from doing harm as much
as possible.

> Would another option be to continue to fine tune the heuristics
> that the oom killer uses to pick its next victim?
>
> What situation did you hit that motivated this change?

mbind to one node. Allocate on that node until you trigger the OOM killer.

2006-02-08 19:07:42

by Andi Kleen

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wednesday 08 February 2006 19:54, Christoph Lameter wrote:

> Index: linux-2.6.16-rc2/mm/page_alloc.c
> ===================================================================
> --- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-02 22:03:08.000000000 -0800
> +++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-08 10:53:08.000000000 -0800
> @@ -817,6 +818,27 @@ failed:
> #define ALLOC_CPUSET 0x40 /* check for correct cpuset */
>
> /*
> + * check if a given zonelist allows allocation over all the nodes
> + * in the system.
> + */
> +int zonelist_incomplete(struct zonelist *zonelist, gfp_t gfp_mask)

static noinline

-Andi

2006-02-08 19:07:42

by Andi Kleen

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wednesday 08 February 2006 19:57, Paul Jackson wrote:
> > No it only disables the oom killer for constrained allocations.
>
> But on many big numa systems, the way they are administered,
> that affectively disables the oom killer.

I guess it won't matter because they are administrated with appropiate ulimits
I guess. And to be honest the OOM killer never really worked all that
well, so it's not a big loss.

[still often wish we had that "global virtual memory ulimit for uid"]


> I've yet to be convinced that the oom killer is our friend,
> and half of me (not seriously) is almost wishing it were
> gone.

It's more than half of me near seriously agreeing with you.

> Would another option be to continue to fine tune the heuristics
> that the oom killer uses to pick its next victim?
>
> What situation did you hit that motivated this change?

It's a long known design bug of the NUMA policy, but it recently
hit with some test program again.


-Andi

2006-02-08 19:15:59

by Christoph Lameter

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wed, 8 Feb 2006, Andi Kleen wrote:

> static noinline

Next rev:

Terminate process that fails on a constrained allocation

Some allocations are restricted to a limited set of nodes (due to memory
policies or cpuset constraints). If the page allocator is not able to find
enough memory then that does not mean that overall system memory is low.

In particular going postal and more or less randomly shooting at processes
is not likely going to help the situation but may just lead to suicide (the
whole system coming down).

It is better to signal to the process that no memory exists given the
constraints that the process (or the configuration of the process) has
placed on the allocation behavior. The process may be killed but then the
sysadmin or developer can investigate the situation. The solution is similar
to what we do when running out of hugepages.

This patch adds a check before the out of memory killer is invoked. At that
point performance considerations do not matter much so we just scan the zonelist
and reconstruct a list of nodes. If the list of nodes does not contain all
online nodes then this is a constrained allocation and we should not calle
the OOM killer.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.16-rc2/mm/page_alloc.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-08 11:14:32.000000000 -0800
@@ -612,6 +612,7 @@ void drain_remote_pages(void)
}
local_irq_restore(flags);
}
+
#endif

#if defined(CONFIG_PM) || defined(CONFIG_HOTPLUG_CPU)
@@ -817,6 +818,27 @@ failed:
#define ALLOC_CPUSET 0x40 /* check for correct cpuset */

/*
+ * Check if a given zonelist is complete and allows allocation over all
+ * the nodes in the system.
+ */
+static noinline int zonelist_incomplete(struct zonelist *zonelist, gfp_t gfp_mask)
+{
+#ifdef CONFIG_NUMA
+ struct zone **z;
+ nodemask_t nodes;
+
+ nodes = node_online_map;
+ for (z = zonelist->zones; *z; z++)
+ if (cpuset_zone_allowed(*z, gfp_mask))
+ node_clear((*z)->zone_pgdat->node_id,
+ nodes);
+ return !nodes_empty(nodes);
+#else
+ return 0;
+#endif
+}
+
+/*
* Return 1 if free pages are above 'mark'. This takes into account the order
* of the allocation.
*/
@@ -1011,6 +1033,15 @@ rebalance:
if (page)
goto got_pg;

+ /*
+ * If we allocated using a zonelist that does not include
+ * all nodes in the system then the OOM situation may be
+ * due to configured limitations. Just abort the application
+ * instead of calling the OOM killer.
+ */
+ if (zonelist_incomplete(zonelist, gfp_mask))
+ return NULL;
+
out_of_memory(gfp_mask, order);
goto restart;
}

2006-02-08 20:14:44

by George Spelvin

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

Would perhaps a less drastic solution, that at least supports the common
partitioned-system configuration, be to limit the oom killer to processes
whose nodes are a *subset* of ours?

That way, a limited process won't kill any unlimited processes, but it
can fight with other processes with the same limits. However, an
unlimited process discovering oom can kill anything on the system if
necessary.

(This requires a modified version of cpuset_excl_nodes_overlap that
calls nodes_subset() instead of nodes_intersects(), but it's pretty
straightforward.)

2006-02-08 20:22:37

by Paul Jackson

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

If the oom killer is of any use on a numa system using set_mempolicy,
mbind and/or cpusets, then it would be of use in the bootcpuset -- that
cpuset which is often setup to hold the classic Unix daemons. I'm not
sure we want to do that.

If we changed the mm/oom_kill.c select_bad_process() constraint from

/* If p's nodes don't overlap ours, it won't help to kill p. */
if (!cpuset_excl_nodes_overlap(p))
continue;

which kills any task that has any overlap with nodes with the current task:

1) to the much tighter limit of only killing tasks that are on the same or
a subset of the same nodes, and

2) changed it to look at -both- the cpuset and MPOL_BIND constraints,

then oom killer would work in a bootcpuset rather as it does now on
simple UMA systems.

The new logic would be -- only kill tasks that are constrained to
the same or a subset of the same nodes as the current task.

With this, the problem you describe with the oom killer and a task that
has used mbind or cpuset to just allow a single node would be fixed --
only tasks constrained to just that node, no more, would be considered
for killing.

At the same time, a typical bootcpuset would have oom killer behaviour
resembling what people have become accustomed to.

If we are going to neuter or eliminate the oom killer, it seems like
we should do it across the board, not just on numa boxes using
some form of memory constraint (mempolicy or cpuset).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-02-08 20:36:59

by Christoph Lameter

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wed, 8 Feb 2006, Paul Jackson wrote:

> The new logic would be -- only kill tasks that are constrained to
> the same or a subset of the same nodes as the current task.

If a task has restricted its memory allocation to one node and does
excessive allocations then that process needs to die not other processes
that are harmlessly running on the node and that may not be allocating
memory at the time.

> At the same time, a typical bootcpuset would have oom killer behaviour
> resembling what people have become accustomed to.

People are accustomed of having random processes killed? <shudder>

> If we are going to neuter or eliminate the oom killer, it seems like
> we should do it across the board, not just on numa boxes using
> some form of memory constraint (mempolicy or cpuset).

We are terminating processes perform restricted allocations. Restricted
allocations are only possible on NUMA boxes so the phrase "numa boxes
using some form of memory constraint" is a tautology. OOM killing makes
sense for global allocations if the system is really tight on memory and
survival is the main goal even if he have to cannibalize processes.
However, cannibalism is still a taboo.

2006-02-08 20:55:28

by Paul Jackson

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

Can anyone give us a good reason why we shouldn't just remove the oom
killer, entirely?

Christoph wrote:
> If a task has restricted its memory allocation to one node and does
> excessive allocations then that process needs to die not other processes
> that are harmlessly running on the node and that may not be allocating
> memory at the time.

That _exact_ same argument applies to a system that only has one node.

If we want to remove the oom killer, lets just remove the oom killer.


> People are accustomed of having random processes killed? <shudder>

That's what the oom killer does ... well, it makes an honest effort
not to be random.

So, yes, since it has been there a long time, people are used to
it. Maybe they don't like it, maybe with good reason. But it
is there.


> OOM killing makes
> sense for global allocations if the system is really tight on memory and
> survival is the main goal

If that argument justifies OOM killing on a simple UMA system, then
surely, for -some- critical tasks, it justifies it on a big NUMA system.

Either OOM is useful in some cases or it is not.


--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-02-08 21:01:33

by Andi Kleen

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wednesday 08 February 2006 21:55, Paul Jackson wrote:

> If that argument justifies OOM killing on a simple UMA system, then
> surely, for -some- critical tasks, it justifies it on a big NUMA system.
>
> Either OOM is useful in some cases or it is not.


I don't think you really want to open a full scale "is the oom killer needed"
thread. Check the archives - there have been some going on for months.

But I think we can agree that together with mbind the oom killer is pretty
useless, can't we?


-Andi (who is definitely in favour of Christoph's latest patch)

2006-02-08 21:03:56

by Paul Jackson

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

> I don't think you really want to open a full scale "is the oom killer needed"
> thread. Check the archives - there have been some going on for months.
>
> But I think we can agree that together with mbind the oom killer is pretty
> useless, can't we?

Excellent points.

I approve this patch.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-02-08 21:21:58

by Andi Kleen

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wednesday 08 February 2006 22:03, Paul Jackson wrote:
> > I don't think you really want to open a full scale "is the oom killer needed"
> > thread. Check the archives - there have been some going on for months.
> >
> > But I think we can agree that together with mbind the oom killer is pretty
> > useless, can't we?
>
> Excellent points.
>
> I approve this patch.

I think it should be put into 2.6.16. Andrew?

I had the small objection about adding static noinline, but it's really not important
and the patch can be used as it.

-Andi


2006-02-08 21:39:40

by Andrew Morton

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

Andi Kleen <[email protected]> wrote:
>
> On Wednesday 08 February 2006 22:03, Paul Jackson wrote:
> > > I don't think you really want to open a full scale "is the oom killer needed"
> > > thread. Check the archives - there have been some going on for months.
> > >
> > > But I think we can agree that together with mbind the oom killer is pretty
> > > useless, can't we?
> >
> > Excellent points.
> >
> > I approve this patch.
>
> I think it should be put into 2.6.16. Andrew?

Does every single caller of __alloc_pages(__GFP_FS) correctly handle a NULL
return? I doubt it, in which case this patch will cause oopses and hangs.

2006-02-08 22:12:01

by Christoph Lameter

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wed, 8 Feb 2006, Andrew Morton wrote:

> > I think it should be put into 2.6.16. Andrew?
>
> Does every single caller of __alloc_pages(__GFP_FS) correctly handle a NULL
> return? I doubt it, in which case this patch will cause oopses and hangs.

I sent you a patch with static inline..... But I am having second thoughts
about this patch. Paul is partially right. Maybe we can move the logic
into the out_of_memory handler for now? That would allow us to implement
more sophisticated things later (for example page migration would allow us
to move memory of processes that can also allocate on other nodes from the
nodes where we lack memory) and Paul may put something in there to
address his concerns.

---

Terminate process that fails on a constrained allocation

Some allocations are restricted to a limited set of nodes (due to memory
policies or cpuset constraints). If the page allocator is not able to find
enough memory then that does not mean that overall system memory is low.

In particular going postal and more or less randomly shooting at processes
is not likely going to help the situation but may just lead to suicide (the
whole system coming down).

It is better to signal to the process that no memory exists given the
constraints that the process (or the configuration of the process) has
placed on the allocation behavior. The process may be killed but then the
sysadmin or developer can investigate the situation. The solution is similar
to what we do when running out of hugepages.

This patch adds a check before the out of memory killer is invoked. At that
point performance considerations do not matter much so we just scan the zonelist
and reconstruct a list of nodes. If the list of nodes does not contain all
online nodes then this is a constrained allocation and we should not calle
the OOM killer.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.16-rc2/mm/page_alloc.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-08 14:06:12.000000000 -0800
@@ -1011,8 +1012,10 @@ rebalance:
if (page)
goto got_pg;

- out_of_memory(gfp_mask, order);
- goto restart;
+ if (out_of_memory(zonelist, gfp_mask, order))
+ goto restart;
+
+ return NULL;
}

/*
Index: linux-2.6.16-rc2/include/linux/swap.h
===================================================================
--- linux-2.6.16-rc2.orig/include/linux/swap.h 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/include/linux/swap.h 2006-02-08 14:10:02.000000000 -0800
@@ -147,7 +147,7 @@ struct swap_list_t {
#define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)

/* linux/mm/oom_kill.c */
-extern void out_of_memory(gfp_t gfp_mask, int order);
+extern int out_of_memory(struct zonelist *zl, gfp_t gfp_mask, int order);

/* linux/mm/memory.c */
extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *);
Index: linux-2.6.16-rc2/mm/oom_kill.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/oom_kill.c 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/mm/oom_kill.c 2006-02-08 14:09:43.000000000 -0800
@@ -131,6 +131,27 @@ unsigned long badness(struct task_struct
}

/*
+ * check if a given zonelist allows allocation over all the nodes
+ * in the system.
+ */
+static noinline int zonelist_incomplete(struct zonelist *zonelist, gfp_t gfp_mask)
+{
+#ifdef CONFIG_NUMA
+ struct zone **z;
+ nodemask_t nodes;
+
+ nodes = node_online_map;
+ for (z = zonelist->zones; *z; z++)
+ if (cpuset_zone_allowed(*z, gfp_mask))
+ node_clear((*z)->zone_pgdat->node_id,
+ nodes);
+ return !nodes_empty(nodes);
+#else
+ return 0;
+#endif
+}
+
+/*
* Simple selection loop. We chose the process with the highest
* number of 'points'. We expect the caller will lock the tasklist.
*
@@ -262,12 +283,23 @@ static struct mm_struct *oom_kill_proces
* killing a random task (bad), letting the system crash (worse)
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
+ *
+ * Function returns 1 if the allocation should be retried.
+ * zero if the allocation should fail.
*/
-void out_of_memory(gfp_t gfp_mask, int order)
+int out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
{
struct mm_struct *mm = NULL;
task_t * p;

+ /*
+ * Simply fail an allocation that does not allow
+ * allocation on all nodes. We may want to have
+ * more sophisticated logic here in the future.
+ */
+ if (zonelist_incomplete(zonelist, gfp_mask))
+ return 0;
+
if (printk_ratelimit()) {
printk("oom-killer: gfp_mask=0x%x, order=%d\n",
gfp_mask, order);
@@ -306,4 +338,5 @@ retry:
*/
if (!test_thread_flag(TIF_MEMDIE))
schedule_timeout_interruptible(1);
+ return 1;
}

2006-02-08 22:48:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

Cleaned up and tested version. Better comments and I forgot to modify
drivers/char/sysrq.c:


Terminate process that fails on a constrained allocation

Some allocations are restricted to a limited set of nodes (due to memory
policies or cpuset constraints). If the page allocator is not able to find
enough memory then that does not mean that overall system memory is low.

In particular going postal and more or less randomly shooting at processes
is not likely going to help the situation but may just lead to suicide (the
whole system coming down).

It is better to signal to the process that no memory exists given the
constraints that the process (or the configuration of the process) has
placed on the allocation behavior. The process may be killed but then the
sysadmin or developer can investigate the situation. The solution is similar
to what we do when running out of hugepages.

This patch adds a check before we kill processes. At that
point performance considerations do not matter much so we just scan the zonelist
and reconstruct a list of nodes. If the list of nodes does not contain all
online nodes then this is a constrained allocation and we should not kill
processes but fail the allocation.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.16-rc2/mm/page_alloc.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-08 14:28:50.000000000 -0800
@@ -1011,8 +1011,10 @@ rebalance:
if (page)
goto got_pg;

- out_of_memory(gfp_mask, order);
- goto restart;
+ if (out_of_memory(zonelist, gfp_mask, order))
+ goto restart;
+
+ return NULL;
}

/*
Index: linux-2.6.16-rc2/include/linux/swap.h
===================================================================
--- linux-2.6.16-rc2.orig/include/linux/swap.h 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/include/linux/swap.h 2006-02-08 14:28:50.000000000 -0800
@@ -147,7 +147,7 @@ struct swap_list_t {
#define vm_swap_full() (nr_swap_pages*2 < total_swap_pages)

/* linux/mm/oom_kill.c */
-extern void out_of_memory(gfp_t gfp_mask, int order);
+extern int out_of_memory(struct zonelist *zl, gfp_t gfp_mask, int order);

/* linux/mm/memory.c */
extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *);
Index: linux-2.6.16-rc2/mm/oom_kill.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/oom_kill.c 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/mm/oom_kill.c 2006-02-08 14:36:27.000000000 -0800
@@ -131,6 +131,27 @@ unsigned long badness(struct task_struct
}

/*
+ * check if a given zonelist allows allocation over all the nodes
+ * in the system.
+ */
+static noinline int zonelist_incomplete(struct zonelist *zonelist, gfp_t gfp_mask)
+{
+#ifdef CONFIG_NUMA
+ struct zone **z;
+ nodemask_t nodes;
+
+ nodes = node_online_map;
+ for (z = zonelist->zones; *z; z++)
+ if (cpuset_zone_allowed(*z, gfp_mask))
+ node_clear((*z)->zone_pgdat->node_id,
+ nodes);
+ return !nodes_empty(nodes);
+#else
+ return 0;
+#endif
+}
+
+/*
* Simple selection loop. We chose the process with the highest
* number of 'points'. We expect the caller will lock the tasklist.
*
@@ -256,18 +277,33 @@ static struct mm_struct *oom_kill_proces
}

/**
- * oom_kill - kill the "best" process when we run out of memory
+ * out_of_memory - deal with out of memory situations
+ *
+ * If we are out of memory then check if this was due to the allocation
+ * being restricted to only a part of system memory. If so then
+ * fail the allocation.
*
* If we run out of memory, we have the choice between either
* killing a random task (bad), letting the system crash (worse)
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
+ *
+ * Returns 1 if the allocation should be retried.
+ * 0 if the allocation should fail.
*/
-void out_of_memory(gfp_t gfp_mask, int order)
+int out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
{
struct mm_struct *mm = NULL;
task_t * p;

+ /*
+ * Simply fail an allocation that does not allow
+ * allocation on all nodes. We may want to have
+ * more sophisticated logic here in the future.
+ */
+ if (zonelist_incomplete(zonelist, gfp_mask))
+ return 0;
+
if (printk_ratelimit()) {
printk("oom-killer: gfp_mask=0x%x, order=%d\n",
gfp_mask, order);
@@ -306,4 +342,5 @@ retry:
*/
if (!test_thread_flag(TIF_MEMDIE))
schedule_timeout_interruptible(1);
+ return 1;
}
Index: linux-2.6.16-rc2/drivers/char/sysrq.c
===================================================================
--- linux-2.6.16-rc2.orig/drivers/char/sysrq.c 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/drivers/char/sysrq.c 2006-02-08 14:29:17.000000000 -0800
@@ -243,7 +243,7 @@ static struct sysrq_key_op sysrq_term_op

static void moom_callback(void *ignored)
{
- out_of_memory(GFP_KERNEL, 0);
+ out_of_memory(&NODE_DATA(0)->node_zonelists[ZONE_NORMAL], GFP_KERNEL, 0);
}

static DECLARE_WORK(moom_work, moom_callback, NULL);

2006-02-08 23:23:52

by Andi Kleen

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wednesday 08 February 2006 23:11, Christoph Lameter wrote:
> On Wed, 8 Feb 2006, Andrew Morton wrote:
>
> > > I think it should be put into 2.6.16. Andrew?
> >
> > Does every single caller of __alloc_pages(__GFP_FS) correctly handle a NULL
> > return? I doubt it, in which case this patch will cause oopses and hangs.
>
> I sent you a patch with static inline.....

noinline

> But I am having second thoughts
> about this patch. Paul is partially right. Maybe we can move the logic
> into the out_of_memory handler for now? That would allow us to implement
> more sophisticated things later

I have my doubts that's really worth it, but ok.

> (for example page migration would allow us
> to move memory of processes that can also allocate on other nodes from the
> nodes where we lack memory) and Paul may put something in there to
> address his concerns.
>
> ---
>
> Terminate process that fails on a constrained allocation

Patch looks good for me too. Thanks.

Unfortunately Andrew's point with the GFP_NOFS still applies :/
But I would consider any caller of this not handling NULL be broken.
Andrew do you have any stronger evidence it's a real problem?

Another way would be to force a default non strict policy with GFP_NOFS, but
that would be somewhat ugly again and impact the fast paths.

-Andi

2006-02-08 23:28:21

by Christoph Lameter

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wed, 8 Feb 2006, Andrew Morton wrote:

> > I think it should be put into 2.6.16. Andrew?
>
> Does every single caller of __alloc_pages(__GFP_FS) correctly handle a NULL
> return? I doubt it, in which case this patch will cause oopses and hangs.

If a caller cannot handle NULL then __GFP_NOFAIL has to be set, right?

So we will never get to the piece of code under discussion.

2006-02-08 23:30:21

by Christoph Lameter

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wed, 8 Feb 2006, Andi Kleen wrote:

> Unfortunately Andrew's point with the GFP_NOFS still applies :/
> But I would consider any caller of this not handling NULL be broken.
> Andrew do you have any stronger evidence it's a real problem?

I would think that a caller will have to set __GFP_NOFAIL if a NULL is
unacceptable.

2006-02-08 23:36:33

by Andrew Morton

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

Andi Kleen <[email protected]> wrote:
>
>
> Unfortunately Andrew's point with the GFP_NOFS still applies :/
> But I would consider any caller of this not handling NULL be broken.

Sure.

> Andrew do you have any stronger evidence it's a real problem?

No. About a million years ago I had a make-alloc_pages-fail-at-1%-rate
debug patch. Doing that again would be an interesting exercise.

Another option is to only fail userspace allocations (__GFP_HIGHMEM is set,
or a new flag in GFP_HIGHUSER). For the workloads which the NUMA guys care
about I expect this is a 99% solution and the amount of code which it puts
at risk is vastly less.

2006-02-08 23:44:11

by Andrew Morton

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

Christoph Lameter <[email protected]> wrote:
>
> On Wed, 8 Feb 2006, Andrew Morton wrote:
>
> > > I think it should be put into 2.6.16. Andrew?
> >
> > Does every single caller of __alloc_pages(__GFP_FS) correctly handle a NULL
> > return? I doubt it, in which case this patch will cause oopses and hangs.
>
> If a caller cannot handle NULL then __GFP_NOFAIL has to be set, right?

That would assume non-buggy code. I'm talking about the exercising of
hitherto-unused codepaths. We've fixed many, many pieces of code which
simply assumed that kmalloc(GFP_KERNEL) succeeds. I doubt if many such
simple bugs still exist, but there will be more subtle ones in there.

2006-02-08 23:54:28

by Christoph Lameter

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Wed, 8 Feb 2006, Andrew Morton wrote:

> > If a caller cannot handle NULL then __GFP_NOFAIL has to be set, right?
>
> That would assume non-buggy code. I'm talking about the exercising of
> hitherto-unused codepaths. We've fixed many, many pieces of code which
> simply assumed that kmalloc(GFP_KERNEL) succeeds. I doubt if many such
> simple bugs still exist, but there will be more subtle ones in there.

We could add __GFP_NOFAIL to kmem_getpages in slab.c to insure that
kmalloc waits rather than return NULL. Also a too drastic measure right?

2006-02-08 23:57:23

by Andi Kleen

[permalink] [raw]
Subject: Re: Terminate process that fails on a constrained allocation

On Thursday 09 February 2006 00:54, Christoph Lameter wrote:
> On Wed, 8 Feb 2006, Andrew Morton wrote:
>
> > > If a caller cannot handle NULL then __GFP_NOFAIL has to be set, right?
> >
> > That would assume non-buggy code. I'm talking about the exercising of
> > hitherto-unused codepaths. We've fixed many, many pieces of code which
> > simply assumed that kmalloc(GFP_KERNEL) succeeds. I doubt if many such
> > simple bugs still exist, but there will be more subtle ones in there.
>
> We could add __GFP_NOFAIL to kmem_getpages in slab.c to insure that
> kmalloc waits rather than return NULL. Also a too drastic measure right?

Definitely too drastic. I bet that would cause deadlocks in quite some loads.

-Andi