2006-02-06 20:59:09

by Christoph Lameter

[permalink] [raw]
Subject: OOM behavior in constrained memory situations

There are situations in which memory allocations are restricted by policy,
by a cpuset or by type of allocation.

I propose that we need different OOM behavior for the cases in which the
user has imposed a limit on what type of memory to be allocated. In that
case the application should be terminate with OOM. The OOM killer should
not run.

The huge page allocator has already been modified to return a Bus Error
because it would otherwise trigger the OOM killer. Its a bit strange
if an app returns a Bus Error because it its out of memory.

Could we modify the system so that the application requesting
memory is terminated with an out of memory condition if

1. No huge pages are available anymore.

2. The application has set a policy that restricts allocation to
certain nodes.

3. An application is restricted by a cpuset to certain nodes.

4. An application has requested large amounts of memory and the
allocation fails.

That should avoid the OOM killer in most situations.


2006-02-06 21:10:55

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

Christoph Lameter <[email protected]> wrote:
>
> There are situations in which memory allocations are restricted by policy,
> by a cpuset or by type of allocation.
>
> I propose that we need different OOM behavior for the cases in which the
> user has imposed a limit on what type of memory to be allocated. In that
> case the application should be terminate with OOM. The OOM killer should
> not run.
>
> The huge page allocator has already been modified to return a Bus Error
> because it would otherwise trigger the OOM killer. Its a bit strange
> if an app returns a Bus Error because it its out of memory.
>
> Could we modify the system so that the application requesting
> memory is terminated with an out of memory condition if
>
> 1. No huge pages are available anymore.
>
> 2. The application has set a policy that restricts allocation to
> certain nodes.
>
> 3. An application is restricted by a cpuset to certain nodes.
>
> 4. An application has requested large amounts of memory and the
> allocation fails.
>
> That should avoid the OOM killer in most situations.

Do we really want to kill the application? A more convetional response
would be to return NULL from the page allocator and let that trickle back.

The hugepage thing is special, because it's a pagefault, not a syscall.

2006-02-06 21:22:56

by Andi Kleen

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Monday 06 February 2006 22:10, Andrew Morton wrote:

> Do we really want to kill the application? A more convetional response
> would be to return NULL from the page allocator and let that trickle back.

Yes that is what it's supposed to be doing.

> The hugepage thing is special, because it's a pagefault, not a syscall.

At least remnants from my old 80% hack to avoid this (huge_page_needed)
seem to be still there in mainline:

fs/hugetlbfs/inode.c:hugetlbfs_file_mmap

bytes = huge_pages_needed(mapping, vma);
if (!is_hugepage_mem_enough(bytes))
return -ENOMEM;


So something must be broken if this doesn't work. Or did you allocate
the pages in some other way?

>From taking a quick look at ipc/shm.c it might be missing an equivalent
check when allocating a huge page segment.

-Andi

2006-02-06 22:11:45

by Christoph Lameter

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Mon, 6 Feb 2006, Andrew Morton wrote:

> Do we really want to kill the application? A more convetional response
> would be to return NULL from the page allocator and let that trickle back.

Ok. But ultimately that will lead to a application fault or the
termination of the application .

> The hugepage thing is special, because it's a pagefault, not a syscall.

The same can happen if a pagefault occurs in the application but the page
allocator cannot satisfy the allocation. At that point we need to
determine if the allocation was restricted. If so then we are not really
in an OOM situation and the app could be terminated.

2006-02-06 22:17:09

by Christoph Lameter

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Mon, 6 Feb 2006, Andi Kleen wrote:

> At least remnants from my old 80% hack to avoid this (huge_page_needed)
> seem to be still there in mainline:
>
> fs/hugetlbfs/inode.c:hugetlbfs_file_mmap
>
> bytes = huge_pages_needed(mapping, vma);
> if (!is_hugepage_mem_enough(bytes))
> return -ENOMEM;
>
>
> So something must be broken if this doesn't work. Or did you allocate
> the pages in some other way?

huge pages are now allocated in the huge fault handler. If it would be
returning an OOM then the OOM killer may be activated.

2006-02-06 22:24:46

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

Christoph Lameter <[email protected]> wrote:
>
> On Mon, 6 Feb 2006, Andrew Morton wrote:
>
> > Do we really want to kill the application? A more convetional response
> > would be to return NULL from the page allocator and let that trickle back.
>
> Ok. But ultimately that will lead to a application fault or the
> termination of the application .

Yup. The application gets to decide what to do if its stat() or read() or
whatever failed.

> > The hugepage thing is special, because it's a pagefault, not a syscall.
>
> The same can happen if a pagefault occurs in the application but the page
> allocator cannot satisfy the allocation. At that point we need to
> determine if the allocation was restricted. If so then we are not really
> in an OOM situation and the app could be terminated.
>

Not too sure what you mean here.

The current behaviour of a oom-in-pagefault is to kill the caller via
do_exit(SIGKILL). (Perhaps hugetlbpages should be doing that too).

If the page allocator decides "hey, this was a restricted allocation
attempt and we cannot satisfy it" then it should return NULL and if it's a
pagefault the app will do the do_exit(SIGKILL). If it's a syscall, that
syscall will return an error indication (and there's a decent chance that
the application will then misbehave, but that's life).

2006-02-06 22:26:28

by Andi Kleen

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Monday 06 February 2006 23:16, Christoph Lameter wrote:
> On Mon, 6 Feb 2006, Andi Kleen wrote:
>
> > At least remnants from my old 80% hack to avoid this (huge_page_needed)
> > seem to be still there in mainline:
> >
> > fs/hugetlbfs/inode.c:hugetlbfs_file_mmap
> >
> > bytes = huge_pages_needed(mapping, vma);
> > if (!is_hugepage_mem_enough(bytes))
> > return -ENOMEM;
> >
> >
> > So something must be broken if this doesn't work. Or did you allocate
> > the pages in some other way?
>
> huge pages are now allocated in the huge fault handler. If it would be
> returning an OOM then the OOM killer may be activated.

Sorry Christoph - somehow I have the feeling we're miscommunicating already
all day.

Of course they are allocated in the huge fault handler. But the point
of that check is to check first if there are enough pages free and
fail the allocation early, just to catch the easy mistakes (has
races, that is why I called it a 80% solution) Just like
Linux mmap traditionally worked and still does if you don't enable
the strict overcommit checking.

-Andi

2006-02-06 22:28:24

by Andrew Morton

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

Christoph Lameter <[email protected]> wrote:
>
> On Mon, 6 Feb 2006, Andi Kleen wrote:
>
> > At least remnants from my old 80% hack to avoid this (huge_page_needed)
> > seem to be still there in mainline:
> >
> > fs/hugetlbfs/inode.c:hugetlbfs_file_mmap
> >
> > bytes = huge_pages_needed(mapping, vma);
> > if (!is_hugepage_mem_enough(bytes))
> > return -ENOMEM;
> >
> >
> > So something must be broken if this doesn't work. Or did you allocate
> > the pages in some other way?
>
> huge pages are now allocated in the huge fault handler. If it would be
> returning an OOM then the OOM killer may be activated.

?

The oom-killer is invoked from the page allocator. A hugetlb pagefault
won't use the page allocator. So there shouldn't be an oom-killing on
hugepage exhaustion.

I think this comment is just wrong:

/* Logically this is OOM, not a SIGBUS, but an OOM
* could cause the kernel to go killing other
* processes which won't help the hugepage situation
* at all (?) */

A VM_FAULT_OOM from there won't cause the oom-killer to do anything. We
should return VM_FAULT_OOM and let do_page_fault() commit suicide with
SIGKILL.

2006-02-06 22:59:30

by Paul Jackson

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

Christoph wrote:
> There are situations in which memory allocations are restricted by policy,
> by a cpuset or by type of allocation.
>
> I propose that we need different OOM behavior for the cases in which the
> user has imposed a limit on what type of memory to be allocated. In that
> case the application should be terminate with OOM. The OOM killer should
> not run.

I'll duck the discussion that followed your post as to whether some
sort of error or null return would be better than killing something.

If it is the case that some code path leads to the OOM killer, then
I don't agree that memory restrictions such as cpuset constraints
should mean we avoid the OOM killer.

I've already changed the OOM killer to only go after tasks in or
overlapping with the same cpuset.

static struct task_struct *select_bad_process(unsigned long *ppoints)
{
...
do_each_thread(g, p) {
...
/* If p's nodes don't overlap ours, it won't help to kill p. */
if (!cpuset_excl_nodes_overlap(p))
continue;

I am guessing (you don't say) that your concern is that it seems
unfair for some app in some small cpuset to be able to trigger the
system-wide OOM killer. The basic problem that this caused, that
of killing unrelated processes in entirely non-overlapping cpusets,
which was of no use in reducing the memory stress in the faulting
cpuset, is no longer a problem.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.925.600.0401

2006-02-07 00:04:05

by Christoph Lameter

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Mon, 6 Feb 2006, Andrew Morton wrote:

> The oom-killer is invoked from the page allocator. A hugetlb pagefault
> won't use the page allocator. So there shouldn't be an oom-killing on
> hugepage exhaustion.

Right..... and the arch specific fault code (at least ia64) does not call
the OOM killer.

> I think this comment is just wrong:
>
> /* Logically this is OOM, not a SIGBUS, but an OOM
> * could cause the kernel to go killing other
> * processes which won't help the hugepage situation
> * at all (?) */
>
> A VM_FAULT_OOM from there won't cause the oom-killer to do anything. We
> should return VM_FAULT_OOM and let do_page_fault() commit suicide with
> SIGKILL.

Drop my patch that adds the comments explaining the bus error and add this
fix instead. This will terminate an application with out of memory instead
of bus error and remove the comment that you mentioned.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.16-rc2/mm/hugetlb.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/hugetlb.c 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/mm/hugetlb.c 2006-02-06 16:02:53.000000000 -0800
@@ -391,12 +391,7 @@ static int hugetlb_cow(struct mm_struct

if (!new_page) {
page_cache_release(old_page);
-
- /* Logically this is OOM, not a SIGBUS, but an OOM
- * could cause the kernel to go killing other
- * processes which won't help the hugepage situation
- * at all (?) */
- return VM_FAULT_SIGBUS;
+ return VM_FAULT_OOM;
}

spin_unlock(&mm->page_table_lock);
@@ -444,6 +439,7 @@ retry:
page = alloc_huge_page(vma, address);
if (!page) {
hugetlb_put_quota(mapping);
+ ret = VM_FAULT_OOM;
goto out;
}

2006-02-07 00:39:14

by Christoph Lameter

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Mon, 6 Feb 2006, Paul Jackson wrote:

> If it is the case that some code path leads to the OOM killer, then
> I don't agree that memory restrictions such as cpuset constraints
> should mean we avoid the OOM killer.
>
> I've already changed the OOM killer to only go after tasks in or
> overlapping with the same cpuset.

That is not enough. Memory access may also be restricted by policies
and then the OOM killer will still go postal. If a process has restricted
its memory allocation by either setting up a policy or a cpuset constrain
then the OOM killer should not be called. Instead either the application
needs to be terminated (if it set the stupid policy) or the cpuset can
perform something OOM-killer-like on the processes it controls.

In its simplest form this could look like the following patch (missing
support for vma policies, and its not certain that a process has just
attempted an allocation constrained by policy if mempolicy is set).

Maybe we need to set a gfp flag for constrained allocations that will
terminate the app?

Index: linux-2.6.16-rc2/mm/page_alloc.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-06 16:36:20.000000000 -0800
@@ -1011,6 +1011,15 @@ rebalance:
if (page)
goto got_pg;

+ if (current->mempolicy && current->mempolicy->policy == MPOL_BIND)
+ /*
+ * Process has set up a memory policy which may
+ * constrain allocations. This is not a case
+ * for the OOM killer. Terminate the application
+ * instead.
+ */
+ return NULL;
+
out_of_memory(gfp_mask, order);
goto restart;
}

2006-02-07 01:55:33

by Christoph Lameter

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

I just tried to oom a process that has restricted its mem allocation to
node 0 using a memory policy. Instead of an OOM the system began to swap
on node zero. The swapping is restricted to the zones passed to
__alloc_pages. It was thus swapping node zero alone.

If a cpuset restriction is in place then we will still swap to all the
zones passed to alloc_pages(). This includes zones not in the current
cpusets. Thats somewhat strange behavior.

Only if swap is off then we will reach the OOM killer if node zero or the
cpuset is exhausted.

So swap memory fulfills all requirements of memory policies and cpusets?
Weird.


Here is a first of all a proposal of a patch that kills the process that
constrains the allocations instead of starting the OOM killer (in the not
swapped case).

Do not invoke OOM killer for constrained allocations

Add some code to detect policies that do not include all nodes in the system.
For those policies invoke the page allocator with the newly added
__GFP_NO_OOM_KILLER flags. Allocations will then not invoke the OOM killer
(finding no page just means that the system has no available page for
that particular process) but terminate the application instead.

This may interfere with the cpusets mechanism for killing processes
through the OOM killer. This could be fixed by setting __GFP_NO_OOM_KILLER
if the __alloc_pages detects that the active cpuset is constraining allocations.

Signed-off-by: Christoph Lameter <[email protected]>

Index: linux-2.6.16-rc2/mm/page_alloc.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/page_alloc.c 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/mm/page_alloc.c 2006-02-06 17:07:41.000000000 -0800
@@ -1011,6 +1011,13 @@ rebalance:
if (page)
goto got_pg;

+ if (gfp_mask & __GFP_NO_OOM_KILLER)
+ /*
+ * Process uses constrained allocations.
+ * Terminate the application instead.
+ */
+ return NULL;
+
out_of_memory(gfp_mask, order);
goto restart;
}
Index: linux-2.6.16-rc2/mm/mempolicy.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/mempolicy.c 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/mm/mempolicy.c 2006-02-06 17:07:41.000000000 -0800
@@ -161,6 +161,11 @@ static struct mempolicy *mpol_new(int mo
if (!policy)
return ERR_PTR(-ENOMEM);
atomic_set(&policy->refcnt, 1);
+
+ policy->gfp_flags = 0;
+ if (!nodes_equal(*nodes, node_online_map))
+ policy->gfp_flags |= __GFP_NO_OOM_KILLER;
+
switch (mode) {
case MPOL_INTERLEAVE:
policy->v.nodes = *nodes;
@@ -1219,6 +1224,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area

cpuset_update_task_memory_state();

+ gfp |= pol->gfp_flags;
if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
unsigned nid;

@@ -1255,6 +1261,7 @@ struct page *alloc_pages_current(gfp_t g
cpuset_update_task_memory_state();
if (!pol || in_interrupt())
pol = &default_policy;
+ gfp |= pol->gfp_flags;
if (pol->policy == MPOL_INTERLEAVE)
return alloc_page_interleave(gfp, order, interleave_nodes(pol));
return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
@@ -1590,6 +1597,9 @@ void mpol_rebind_policy(struct mempolicy
if (nodes_equal(*mpolmask, *newmask))
return;

+ if (!nodes_equal(*newmask, node_online_map))
+ pol->gfp_flags |= __GFP_NO_OOM_KILLER;
+
switch (pol->policy) {
case MPOL_DEFAULT:
break;
Index: linux-2.6.16-rc2/include/linux/mempolicy.h
===================================================================
--- linux-2.6.16-rc2.orig/include/linux/mempolicy.h 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/include/linux/mempolicy.h 2006-02-06 17:07:41.000000000 -0800
@@ -62,6 +62,7 @@ struct vm_area_struct;
struct mempolicy {
atomic_t refcnt;
short policy; /* See MPOL_* above */
+ gfp_t gfp_flags; /* flags ORed into gfp_flags for each allocation */
union {
struct zonelist *zonelist; /* bind */
short preferred_node; /* preferred */
Index: linux-2.6.16-rc2/include/linux/gfp.h
===================================================================
--- linux-2.6.16-rc2.orig/include/linux/gfp.h 2006-02-02 22:03:08.000000000 -0800
+++ linux-2.6.16-rc2/include/linux/gfp.h 2006-02-06 17:07:41.000000000 -0800
@@ -47,6 +47,7 @@ struct vm_area_struct;
#define __GFP_ZERO ((__force gfp_t)0x8000u)/* Return zeroed page on success */
#define __GFP_NOMEMALLOC ((__force gfp_t)0x10000u) /* Don't use emergency reserves */
#define __GFP_HARDWALL ((__force gfp_t)0x20000u) /* Enforce hardwall cpuset memory allocs */
+#define __GFP_NO_OOM_KILLER ((__force gfp_t)0x40000u) /* Terminate process do not call OOM killer */

#define __GFP_BITS_SHIFT 20 /* Room for 20 __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

2006-02-07 09:40:30

by Andi Kleen

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Tuesday 07 February 2006 02:55, Christoph Lameter wrote:
> I just tried to oom a process that has restricted its mem allocation to
> node 0 using a memory policy. Instead of an OOM the system began to swap
> on node zero. The swapping is restricted to the zones passed to
> __alloc_pages. It was thus swapping node zero alone.

Thanks for doing that work. It's needed imho and was on my todo list.


> switch (pol->policy) {
> case MPOL_DEFAULT:
> break;
> Index: linux-2.6.16-rc2/include/linux/mempolicy.h
> ===================================================================
> --- linux-2.6.16-rc2.orig/include/linux/mempolicy.h 2006-02-02 22:03:08.000000000 -0800
> +++ linux-2.6.16-rc2/include/linux/mempolicy.h 2006-02-06 17:07:41.000000000 -0800
> @@ -62,6 +62,7 @@ struct vm_area_struct;
> struct mempolicy {
> atomic_t refcnt;
> short policy; /* See MPOL_* above */
> + gfp_t gfp_flags; /* flags ORed into gfp_flags for each allocation */

I don't think it's a good idea to add it to the struct mempolicy. I've tried to
make it as memory efficient as possibile and it would be a waste to add such
a mostly unused field. Better to pass that information around in some other way.

(in the worst case it could be a upper bit in policy, but I would prefer
function arguments I think)

The rest looks good.

-Andi

2006-02-07 17:29:08

by Christoph Lameter

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Tue, 7 Feb 2006, Andi Kleen wrote:

> On Tuesday 07 February 2006 02:55, Christoph Lameter wrote:
> > I just tried to oom a process that has restricted its mem allocation to
> > node 0 using a memory policy. Instead of an OOM the system began to swap
> > on node zero. The swapping is restricted to the zones passed to
> > __alloc_pages. It was thus swapping node zero alone.
>
> Thanks for doing that work. It's needed imho and was on my todo list.

This is talking not about the text above but about what comes later right?
The OOM behavior for a constrained allocation with no swap?

> > + gfp_t gfp_flags; /* flags ORed into gfp_flags for each allocation */
>
> I don't think it's a good idea to add it to the struct mempolicy. I've tried to
> make it as memory efficient as possibile and it would be a waste to add such
> a mostly unused field. Better to pass that information around in some other way.

Memory policies are rare and this would be insignificant on any NUMA
system.

> (in the worst case it could be a upper bit in policy, but I would prefer
> function arguments I think)

An upper bit in policy would require special processing in hot code paths.
The current implementation can simply OR in a value that is in a cacheline
already in the data cache.

I'd rather keep it separate.

Function arguments? Add function pointer to mempolicy for allocation?

Then there is the other issue:

Should the system swap if an MPOL_BIND request does not find enough
memory? Maybe it would be good to not swap, rely on zone_reclaim only and
fail if there is no local memory?

We could change __GFP_NO_OOM_KILLER to __GFP_CONSTRAINED_ALLOC and then
not invoke kswapd and neither the OOM killer on a constrained allocation.

2006-02-07 17:45:43

by Andi Kleen

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Tuesday 07 February 2006 18:29, Christoph Lameter wrote:
> On Tue, 7 Feb 2006, Andi Kleen wrote:
>
> > On Tuesday 07 February 2006 02:55, Christoph Lameter wrote:
> > > I just tried to oom a process that has restricted its mem allocation to
> > > node 0 using a memory policy. Instead of an OOM the system began to swap
> > > on node zero. The swapping is restricted to the zones passed to
> > > __alloc_pages. It was thus swapping node zero alone.
> >
> > Thanks for doing that work. It's needed imho and was on my todo list.
>
> This is talking not about the text above but about what comes later right?
> The OOM behavior for a constrained allocation with no swap?
>
> > > + gfp_t gfp_flags; /* flags ORed into gfp_flags for each allocation */
> >
> > I don't think it's a good idea to add it to the struct mempolicy. I've tried to
> > make it as memory efficient as possibile and it would be a waste to add such
> > a mostly unused field. Better to pass that information around in some other way.
>
> Memory policies are rare and this would be insignificant on any NUMA
> system

It could be a problem on those 32bit NUMA systems with only 1GB of lowmem.
There are some workloads with lots of VMAs and it's in theory possible
some application wants to set a lot of policy for them.

I back then spent some time to make the data structure as small as possible
and I would hate to destroy it with such thoughtless changes.


>
> > (in the worst case it could be a upper bit in policy, but I would prefer
> > function arguments I think)
>
> An upper bit in policy would require special processing in hot code paths.
> The current implementation can simply OR in a value that is in a cacheline
> already in the data cache.
>
> I'd rather keep it separate.
>
> Function arguments? Add function pointer to mempolicy for allocation?

I was more thinking:

when MPOL_BIND == node_online_map automatically revert to MPOL_PREFERED with empty mask.
Then on the allocation only set the gfp flag for MPOL_BIND

Ok there might be small trouble with node hotplug, but that could be probably
ignored for now.

> Then there is the other issue:
>
> Should the system swap if an MPOL_BIND request does not find enough
> memory? Maybe it would be good to not swap, rely on zone_reclaim only and
> fail if there is no local memory?

Not sure. I guess it depends. Maybe it needs a nodeswappiness sysctl.

>
> We could change __GFP_NO_OOM_KILLER to __GFP_CONSTRAINED_ALLOC and then
> not invoke kswapd and neither the OOM killer on a constrained allocation.

That could be a problem if one node is filled with dirty file cache pages,
no? There needs to be some action to free it. I had a few reports of this case.
It needs to make at least some effort to wait for these pages and push them out.

On the other hand I would like to have less swapping for MPOL_BIND by
default than the global VM does. I remember
driving the system in quite severe swap storms when doing early mempolicy
testing.

-Andi

2006-02-07 17:51:10

by Christoph Lameter

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Tue, 7 Feb 2006, Andi Kleen wrote:

> > Memory policies are rare and this would be insignificant on any NUMA
> > system
>
> It could be a problem on those 32bit NUMA systems with only 1GB of lowmem.
> There are some workloads with lots of VMAs and it's in theory possible
> some application wants to set a lot of policy for them.

But this is still only 4 additional bytes per policy on 32bit NUMA.

> I back then spent some time to make the data structure as small as possible
> and I would hate to destroy it with such thoughtless changes.

Well the ability to set bits for policy controlled allocations may come in
handy if we want to support memory policy controlling some other aspect
for page allocation.

> when MPOL_BIND == node_online_map automatically revert to MPOL_PREFERED with empty mask.
> Then on the allocation only set the gfp flag for MPOL_BIND

That would work for MPOL_BIND. How about MPOL_INTERLEAVE with a restricted
set of nodes? cpusets may cause any policy to have a restricted nodeset.

> > Should the system swap if an MPOL_BIND request does not find enough
> > memory? Maybe it would be good to not swap, rely on zone_reclaim only and
> > fail if there is no local memory?
>
> Not sure. I guess it depends. Maybe it needs a nodeswappiness sysctl.

Hmm.... Maybe make this a separate post?

> > We could change __GFP_NO_OOM_KILLER to __GFP_CONSTRAINED_ALLOC and then
> > not invoke kswapd and neither the OOM killer on a constrained allocation.
>
> That could be a problem if one node is filled with dirty file cache pages,
> no? There needs to be some action to free it. I had a few reports of this case.
> It needs to make at least some effort to wait for these pages and push them out.

zone_reclaim can be configured to write out dirty file cache pages during
reclaim. So this could be addressed.

> On the other hand I would like to have less swapping for MPOL_BIND by
> default than the global VM does. I remember
> driving the system in quite severe swap storms when doing early mempolicy
> testing.

Hmmm.. We could now add some other control bits to the mempolicy gfp
mask to affect the behavior of the page allocator.

2006-02-07 18:02:17

by Andi Kleen

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Tuesday 07 February 2006 18:51, Christoph Lameter wrote:

> > I back then spent some time to make the data structure as small as possible
> > and I would hate to destroy it with such thoughtless changes.
>
> Well the ability to set bits for policy controlled allocations may come in
> handy if we want to support memory policy controlling some other aspect
> for page allocation.

Actually looking at it again "v" should be aligned to 4 bytes anyways, so
there is a unused 2 byte hole that you can use for this.

Still think the way of converting the policy is more elegant though.

> > when MPOL_BIND == node_online_map automatically revert to MPOL_PREFERED with empty mask.
> > Then on the allocation only set the gfp flag for MPOL_BIND
>
> That would work for MPOL_BIND. How about MPOL_INTERLEAVE with a restricted
> set of nodes? cpusets may cause any policy to have a restricted nodeset.

MPOL_INTERLEAVE isn't strict, so it can never cause OOMs unless the complete
system is out of memory. It just changes which node is tried first.

>
> > > Should the system swap if an MPOL_BIND request does not find enough
> > > memory? Maybe it would be good to not swap, rely on zone_reclaim only and
> > > fail if there is no local memory?
> >
> > Not sure. I guess it depends. Maybe it needs a nodeswappiness sysctl.
>
> Hmm.... Maybe make this a separate post?

Do you have enough new thoughts on it for one?

> > > We could change __GFP_NO_OOM_KILLER to __GFP_CONSTRAINED_ALLOC and then
> > > not invoke kswapd and neither the OOM killer on a constrained allocation.
> >
> > That could be a problem if one node is filled with dirty file cache pages,
> > no? There needs to be some action to free it. I had a few reports of this case.
> > It needs to make at least some effort to wait for these pages and push them out.
>
> zone_reclaim can be configured to write out dirty file cache pages during
> reclaim. So this could be addressed.

Would be good.

-Andi

2006-02-07 18:11:10

by Christoph Lameter

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Tue, 7 Feb 2006, Andi Kleen wrote:

> On Tuesday 07 February 2006 18:51, Christoph Lameter wrote:
>
> > > I back then spent some time to make the data structure as small as possible
> > > and I would hate to destroy it with such thoughtless changes.
> >
> > Well the ability to set bits for policy controlled allocations may come in
> > handy if we want to support memory policy controlling some other aspect
> > for page allocation.
>
> Actually looking at it again "v" should be aligned to 4 bytes anyways, so
> there is a unused 2 byte hole that you can use for this.
>
> Still think the way of converting the policy is more elegant though.

True I could use these two spare bytes to store the 2 highest bytes of the
GFP flags. But then I'd still have some more complicated bit arithmetic in the
hot code paths of the page allocator to wedge these bits into the gfp
flags.

> > > > Should the system swap if an MPOL_BIND request does not find enough
> > > > memory? Maybe it would be good to not swap, rely on zone_reclaim only and
> > > > fail if there is no local memory?
> > >
> > > Not sure. I guess it depends. Maybe it needs a nodeswappiness sysctl.
> >
> > Hmm.... Maybe make this a separate post?
>
> Do you have enough new thoughts on it for one?

Right now I think it could just work by simply not swapping for
constrained allocations. Do an agressive zone_reclaim that writes out
potentially dirty pages if needed. But that is a significant change of
memory allocation behavior. Maybe that would need some kind of option
during bind that would then set something in the gfp_flags that would then
trigger the agressive zone_reclaim?

2006-02-07 18:19:14

by Christoph Lameter

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Tue, 7 Feb 2006, Andi Kleen wrote:

> Actually looking at it again "v" should be aligned to 4 bytes anyways, so
> there is a unused 2 byte hole that you can use for this.

Ok. Here is the hack you wanted:

Index: linux-2.6.16-rc2/mm/mempolicy.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/mempolicy.c 2006-02-07 10:14:43.000000000 -0800
+++ linux-2.6.16-rc2/mm/mempolicy.c 2006-02-07 10:17:42.000000000 -0800
@@ -162,9 +162,9 @@ static struct mempolicy *mpol_new(int mo
return ERR_PTR(-ENOMEM);
atomic_set(&policy->refcnt, 1);

- policy->gfp_flags = 0;
+ policy->gfp_flags_high = 0;
if (!nodes_equal(*nodes, node_online_map))
- policy->gfp_flags |= __GFP_NO_OOM_KILLER;
+ policy->gfp_flags_high |= (__GFP_NO_OOM_KILLER >> 16);

switch (mode) {
case MPOL_INTERLEAVE:
@@ -1224,7 +1224,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area

cpuset_update_task_memory_state();

- gfp |= pol->gfp_flags;
+ gfp |= (pol->gfp_flags_high << 16);
if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
unsigned nid;

@@ -1261,7 +1261,7 @@ struct page *alloc_pages_current(gfp_t g
cpuset_update_task_memory_state();
if (!pol || in_interrupt())
pol = &default_policy;
- gfp |= pol->gfp_flags;
+ gfp |= (pol->gfp_flags_high << 16);
if (pol->policy == MPOL_INTERLEAVE)
return alloc_page_interleave(gfp, order, interleave_nodes(pol));
return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
@@ -1598,7 +1598,7 @@ void mpol_rebind_policy(struct mempolicy
return;

if (!nodes_equal(*newmask, node_online_map))
- pol->gfp_flags |= __GFP_NO_OOM_KILLER;
+ pol->gfp_flags_high |= (__GFP_NO_OOM_KILLER >> 16);

switch (pol->policy) {
case MPOL_DEFAULT:
Index: linux-2.6.16-rc2/include/linux/mempolicy.h
===================================================================
--- linux-2.6.16-rc2.orig/include/linux/mempolicy.h 2006-02-07 10:14:43.000000000 -0800
+++ linux-2.6.16-rc2/include/linux/mempolicy.h 2006-02-07 10:15:39.000000000 -0800
@@ -62,7 +62,7 @@ struct vm_area_struct;
struct mempolicy {
atomic_t refcnt;
short policy; /* See MPOL_* above */
- gfp_t gfp_flags; /* flags ORed into gfp_flags for each allocation */
+ short gfp_flags_high; /* bits 16-31 ORed into gfp_flags for each allocation */
union {
struct zonelist *zonelist; /* bind */
short preferred_node; /* preferred */

2006-02-07 18:31:24

by Andi Kleen

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Tuesday 07 February 2006 19:19, Christoph Lameter wrote:

> @@ -1261,7 +1261,7 @@ struct page *alloc_pages_current(gfp_t g
> cpuset_update_task_memory_state();
> if (!pol || in_interrupt())
> pol = &default_policy;
> - gfp |= pol->gfp_flags;
> + gfp |= (pol->gfp_flags_high << 16);

Why don't you put it into zonelist_policy and pass a pointer to gfp
and then only do it for MPOL_BIND? The code should be similar because
it should be inline anyways (perhaps add a force_inline to be sure)

-Andi

2006-02-07 19:01:04

by Christoph Lameter

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Tue, 7 Feb 2006, Andi Kleen wrote:

> On Tuesday 07 February 2006 19:19, Christoph Lameter wrote:
>
> > @@ -1261,7 +1261,7 @@ struct page *alloc_pages_current(gfp_t g
> > cpuset_update_task_memory_state();
> > if (!pol || in_interrupt())
> > pol = &default_policy;
> > - gfp |= pol->gfp_flags;
> > + gfp |= (pol->gfp_flags_high << 16);
>
> Why don't you put it into zonelist_policy and pass a pointer to gfp
> and then only do it for MPOL_BIND? The code should be similar because
> it should be inline anyways (perhaps add a force_inline to be sure)

Like this?

Index: linux-2.6.16-rc2/mm/mempolicy.c
===================================================================
--- linux-2.6.16-rc2.orig/mm/mempolicy.c 2006-02-07 10:14:43.000000000 -0800
+++ linux-2.6.16-rc2/mm/mempolicy.c 2006-02-07 10:58:55.000000000 -0800
@@ -162,9 +162,9 @@ static struct mempolicy *mpol_new(int mo
return ERR_PTR(-ENOMEM);
atomic_set(&policy->refcnt, 1);

- policy->gfp_flags = 0;
+ policy->flags = 0;
if (!nodes_equal(*nodes, node_online_map))
- policy->gfp_flags |= __GFP_NO_OOM_KILLER;
+ policy->flags = 1;

switch (mode) {
case MPOL_INTERLEAVE:
@@ -1064,7 +1064,7 @@ static struct mempolicy * get_vma_policy
}

/* Return a zonelist representing a mempolicy */
-static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
+static struct zonelist *zonelist_policy(gfp_t *gfp, struct mempolicy *policy)
{
int nd;

@@ -1077,9 +1077,12 @@ static struct zonelist *zonelist_policy(
case MPOL_BIND:
/* Lower zones don't get a policy applied */
/* Careful: current->mems_allowed might have moved */
- if (gfp_zone(gfp) >= policy_zone)
- if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist))
+ if (gfp_zone(*gfp) >= policy_zone)
+ if (cpuset_zonelist_valid_mems_allowed(policy->v.zonelist)) {
+ if (policy->flags)
+ *gfp |= __GFP_NO_OOM_KILLER;
return policy->v.zonelist;
+ }
/*FALL THROUGH*/
case MPOL_INTERLEAVE: /* should not happen */
case MPOL_DEFAULT:
@@ -1089,7 +1092,7 @@ static struct zonelist *zonelist_policy(
nd = 0;
BUG();
}
- return NODE_DATA(nd)->node_zonelists + gfp_zone(gfp);
+ return NODE_DATA(nd)->node_zonelists + gfp_zone(*gfp);
}

/* Do dynamic interleaving for a process */
@@ -1168,14 +1171,15 @@ static inline unsigned interleave_nid(st
struct zonelist *huge_zonelist(struct vm_area_struct *vma, unsigned long addr)
{
struct mempolicy *pol = get_vma_policy(current, vma, addr);
-
+ gfp_t gfp = GFP_HIGHUSER;
+
if (pol->policy == MPOL_INTERLEAVE) {
unsigned nid;

nid = interleave_nid(pol, vma, addr, HPAGE_SHIFT);
- return NODE_DATA(nid)->node_zonelists + gfp_zone(GFP_HIGHUSER);
+ return NODE_DATA(nid)->node_zonelists + gfp_zone(gfp);
}
- return zonelist_policy(GFP_HIGHUSER, pol);
+ return zonelist_policy(&gfp, pol);
}

/* Allocate a page in interleaved policy.
@@ -1224,14 +1228,13 @@ alloc_page_vma(gfp_t gfp, struct vm_area

cpuset_update_task_memory_state();

- gfp |= pol->gfp_flags;
if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
unsigned nid;

nid = interleave_nid(pol, vma, addr, PAGE_SHIFT);
return alloc_page_interleave(gfp, 0, nid);
}
- return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+ return __alloc_pages(gfp, 0, zonelist_policy(&gfp, pol));
}

/**
@@ -1261,10 +1264,9 @@ struct page *alloc_pages_current(gfp_t g
cpuset_update_task_memory_state();
if (!pol || in_interrupt())
pol = &default_policy;
- gfp |= pol->gfp_flags;
if (pol->policy == MPOL_INTERLEAVE)
return alloc_page_interleave(gfp, order, interleave_nodes(pol));
- return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
+ return __alloc_pages(gfp, order, zonelist_policy(&gfp, pol));
}
EXPORT_SYMBOL(alloc_pages_current);

@@ -1598,7 +1600,7 @@ void mpol_rebind_policy(struct mempolicy
return;

if (!nodes_equal(*newmask, node_online_map))
- pol->gfp_flags |= __GFP_NO_OOM_KILLER;
+ pol->flags = 1;

switch (pol->policy) {
case MPOL_DEFAULT:
Index: linux-2.6.16-rc2/include/linux/mempolicy.h
===================================================================
--- linux-2.6.16-rc2.orig/include/linux/mempolicy.h 2006-02-07 10:14:43.000000000 -0800
+++ linux-2.6.16-rc2/include/linux/mempolicy.h 2006-02-07 10:57:30.000000000 -0800
@@ -62,7 +62,7 @@ struct vm_area_struct;
struct mempolicy {
atomic_t refcnt;
short policy; /* See MPOL_* above */
- gfp_t gfp_flags; /* flags ORed into gfp_flags for each allocation */
+ short flags; /* 1 = Set __GFP_NO_OOM_KILLER on allocation for BIND */
union {
struct zonelist *zonelist; /* bind */
short preferred_node; /* preferred */

2006-02-09 23:08:34

by David Gibson

[permalink] [raw]
Subject: Re: OOM behavior in constrained memory situations

On Mon, Feb 06, 2006 at 04:03:53PM -0800, Christoph Lameter wrote:
> On Mon, 6 Feb 2006, Andrew Morton wrote:
>
> > The oom-killer is invoked from the page allocator. A hugetlb pagefault
> > won't use the page allocator. So there shouldn't be an oom-killing on
> > hugepage exhaustion.
>
> Right..... and the arch specific fault code (at least ia64) does not call
> the OOM killer.
>
> > I think this comment is just wrong:
> >
> > /* Logically this is OOM, not a SIGBUS, but an OOM
> > * could cause the kernel to go killing other
> > * processes which won't help the hugepage situation
> > * at all (?) */
> >
> > A VM_FAULT_OOM from there won't cause the oom-killer to do anything. We
> > should return VM_FAULT_OOM and let do_page_fault() commit suicide with
> > SIGKILL.
>
> Drop my patch that adds the comments explaining the bus error and add this
> fix instead. This will terminate an application with out of memory instead
> of bus error and remove the comment that you mentioned.

Looks good, except I think a comment should go in there so the next
person to look at it doesn't make the same wrong assumption I did,
that returning VM_FAULT_OOM will trigger the OOM killer.

> Signed-off-by: Christoph Lameter <[email protected]>
>
> Index: linux-2.6.16-rc2/mm/hugetlb.c
> ===================================================================
> --- linux-2.6.16-rc2.orig/mm/hugetlb.c 2006-02-02 22:03:08.000000000 -0800
> +++ linux-2.6.16-rc2/mm/hugetlb.c 2006-02-06 16:02:53.000000000 -0800
> @@ -391,12 +391,7 @@ static int hugetlb_cow(struct mm_struct
>
> if (!new_page) {
> page_cache_release(old_page);
> -
> - /* Logically this is OOM, not a SIGBUS, but an OOM
> - * could cause the kernel to go killing other
> - * processes which won't help the hugepage situation
> - * at all (?) */
> - return VM_FAULT_SIGBUS;
> + return VM_FAULT_OOM;
> }
>
> spin_unlock(&mm->page_table_lock);
> @@ -444,6 +439,7 @@ retry:
> page = alloc_huge_page(vma, address);
> if (!page) {
> hugetlb_put_quota(mapping);
> + ret = VM_FAULT_OOM;
> goto out;
> }
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson