2004-11-05 23:16:15

by Marcelo Tosatti

[permalink] [raw]
Subject: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

Hi,

As you know the OOM is very problematic in 2.6 right now - so I went
to investigate it.

Currently the oom killer is invoked from the task reclaim
code (try_to_free_pages), which IMO is fundamentally broken,
because its non deterministic - the chance the OOM killer
will be triggered increases as the number of tasks inside
reclaiming increases. And kswapd is freeing pages in parallel,
which is completly ignored by this approach.

In my opinion the correct approach is to trigger the OOM killer
when kswapd is unable to free pages. Once that is done, the number
of tasks inside page reclaim is irrelevant.

So the following patch moves the out_of_memory() call to
balance_pgdat(), and makes it dependant on success reclaiming pages
when work has actually been done, and on priority reaching zero.

It also removes the "If it's been a long time since last failure dont
OOM kill" logic which for me just tries to paper over a bigger issue.

Relying on this information (kswapd failure after DEF_PRIORITY passes)
to trigger the OOM killer seems to be very reliable - it needs some
more testing though.

With this in place I can't see spurious OOM kills - just need to guarantee
that it reliably OOM kills when we are really out of memory.

While doing this, I noticed that kswapd will happily go to sleep
if all zones have all_unreclaimable set. I bet this is the reason
for the page allocation failures we are seeing. So the patch
also makes balance_pgdat() NOT return and go to "loop_again"
instead in case of page shortage - even if all_unreclaimable is set.

Basically the "loop_again" logic IS NOT WORKING!

Comments?

My wife is almost killing me, its Friday night and I've been telling her
"just another minute" for hours. Have to run.

diff -Nur --show-c-function --exclude='*.orig' linux-2.6.10-rc1-mm2.orig/mm/oom_kill.c linux-2.6.10-rc1-mm2/mm/oom_kill.c
--- linux-2.6.10-rc1-mm2.orig/mm/oom_kill.c 2004-11-04 22:50:50.000000000 -0200
+++ linux-2.6.10-rc1-mm2/mm/oom_kill.c 2004-11-05 18:33:29.918459072 -0200
@@ -240,23 +240,23 @@ void out_of_memory(int gfp_mask)
* If it's been a long time since last failure,
* we're not oom.
*/
- if (since > 5*HZ)
- goto reset;
+ //if (since > 5*HZ)
+ // goto reset;

/*
* If we haven't tried for at least one second,
* we're not really oom.
*/
- since = now - first;
- if (since < HZ)
- goto out_unlock;
+ //since = now - first;
+ //if (since < HZ)
+ // goto out_unlock;

/*
* If we have gotten only a few failures,
* we're not really oom.
*/
- if (++count < 10)
- goto out_unlock;
+// if (++count < 10)
+// goto out_unlock;

/*
* If we just killed a process, wait a while
diff -Nur --show-c-function --exclude='*.orig' linux-2.6.10-rc1-mm2.orig/mm/vmscan.c linux-2.6.10-rc1-mm2/mm/vmscan.c
--- linux-2.6.10-rc1-mm2.orig/mm/vmscan.c 2004-11-04 22:50:50.000000000 -0200
+++ linux-2.6.10-rc1-mm2/mm/vmscan.c 2004-11-05 19:53:35.915836416 -0200
@@ -952,8 +952,6 @@ int try_to_free_pages(struct zone **zone
if (sc.nr_scanned && priority < DEF_PRIORITY - 2)
blk_congestion_wait(WRITE, HZ/10);
}
- if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY))
- out_of_memory(gfp_mask);
out:
for (i = 0; zones[i] != 0; i++) {
struct zone *zone = zones[i];
@@ -997,13 +995,15 @@ static int balance_pgdat(pg_data_t *pgda
int all_zones_ok;
int priority;
int i;
- int total_scanned, total_reclaimed;
+ int total_scanned, total_reclaimed, worked;
struct reclaim_state *reclaim_state = current->reclaim_state;
struct scan_control sc;

+
loop_again:
total_scanned = 0;
total_reclaimed = 0;
+ worked = 0;
sc.gfp_mask = GFP_KERNEL;
sc.may_writepage = 0;
sc.nr_mapped = read_page_state(nr_mapped);
@@ -1033,6 +1033,10 @@ loop_again:
if (zone->present_pages == 0)
continue;

+ if (!zone_watermark_ok(zone, order,
+ zone->pages_high, 0, 0, 0))
+ all_zones_ok = 0;
+
if (zone->all_unreclaimable &&
priority != DEF_PRIORITY)
continue;
@@ -1072,6 +1076,9 @@ scan:
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;

+ if (priority == 0)
+ worked = 1;
+
if (nr_pages == 0) { /* Not software suspend */
if (!zone_watermark_ok(zone, order,
zone->pages_high, end_zone, 0, 0))
@@ -1129,6 +1136,9 @@ out:
zone->prev_priority = zone->temp_priority;
}
if (!all_zones_ok) {
+ if (priority == 0 && !total_reclaimed && worked)
+ out_of_memory(GFP_KERNEL);
+
cond_resched();
goto loop_again;
}


2004-11-05 23:36:44

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Friday, November 05, 2004 12:01 pm, Marcelo Tosatti wrote:
> Hi,
>
> As you know the OOM is very problematic in 2.6 right now - so I went
> to investigate it.
>
> Currently the oom killer is invoked from the task reclaim
> code (try_to_free_pages), which IMO is fundamentally broken,
> because its non deterministic - the chance the OOM killer
> will be triggered increases as the number of tasks inside
> reclaiming increases. And kswapd is freeing pages in parallel,
> which is completly ignored by this approach.
>
> In my opinion the correct approach is to trigger the OOM killer
> when kswapd is unable to free pages. Once that is done, the number
> of tasks inside page reclaim is irrelevant.

That makes sense.

> With this in place I can't see spurious OOM kills - just need to guarantee
> that it reliably OOM kills when we are really out of memory.

That's good. I can test it on a large machine (hopefully next week).

> Comments?

Sounds good, though we may want to do a couple of more things, we shouldn't
kill root tasks quite as easily and we should avoid zombies since they may be
large apps in the process of exiting, and killing them would be bad (iirc
it'll cause a panic).

Thanks,
Jesse


Attachments:
(No filename) (1.19 kB)
oom-fixes.patch (623.00 B)
Download all attachments

2004-11-05 23:56:01

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Fri, 2004-11-05 at 15:32 -0800, Jesse Barnes wrote:
> On Friday, November 05, 2004 12:01 pm, Marcelo Tosatti wrote:
> > Comments?
>
> Sounds good, though we may want to do a couple of more things, we shouldn't
> kill root tasks quite as easily and we should avoid zombies since they may be
> large apps in the process of exiting, and killing them would be bad (iirc
> it'll cause a panic).
>

Yep, it makes sense, but it still does not fix the selection problem,
where e.g. sshd is killed while a out of control forking server floods
the machine with child processes.

Patch to address this:
http://marc.theaimsgroup.com/?l=linux-kernel&m=109922680000746&w=2

tglx






2004-11-06 01:21:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Fri, Nov 05, 2004 at 03:32:50PM -0800, Jesse Barnes wrote:
> On Friday, November 05, 2004 12:01 pm, Marcelo Tosatti wrote:
> > In my opinion the correct approach is to trigger the OOM killer
> > when kswapd is unable to free pages. Once that is done, the number
> > of tasks inside page reclaim is irrelevant.
>
> That makes sense.

I don't like it, kswapd may fail balancing because there's a GFP_DMA
allocation that eat the last dma page, but we should not kill tasks if
we fail to balance in kswapd, we should kill tasks only when no fail
path exists (i.e. only during page faults, everything else in the kernel
has a fail path and it should never trigger oom).

If you move it in kswapd there's no way to prevent oom-killing from a
syscall allocation (I guess even right now it would go wrong in this
sense, but at least right now it's more fixable). I want to move the oom
kill outside the alloc_page paths. The oom killing is all about the page
faults not having a fail path, and in turn the oom killing should be
moved in the page fault code, not in the allocator. Everything else
should keep returning -ENOMEM to the caller.

So to me moving the oom killer into kswapd looks a regression.

2004-11-06 01:27:06

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage



Andrea Arcangeli wrote:

>On Fri, Nov 05, 2004 at 03:32:50PM -0800, Jesse Barnes wrote:
>
>>On Friday, November 05, 2004 12:01 pm, Marcelo Tosatti wrote:
>>
>>>In my opinion the correct approach is to trigger the OOM killer
>>>when kswapd is unable to free pages. Once that is done, the number
>>>of tasks inside page reclaim is irrelevant.
>>>
>>That makes sense.
>>
>
>I don't like it, kswapd may fail balancing because there's a GFP_DMA
>allocation that eat the last dma page, but we should not kill tasks if
>we fail to balance in kswapd, we should kill tasks only when no fail
>path exists (i.e. only during page faults, everything else in the kernel
>has a fail path and it should never trigger oom).
>
>If you move it in kswapd there's no way to prevent oom-killing from a
>syscall allocation (I guess even right now it would go wrong in this
>sense, but at least right now it's more fixable). I want to move the oom
>kill outside the alloc_page paths. The oom killing is all about the page
>faults not having a fail path, and in turn the oom killing should be
>moved in the page fault code, not in the allocator. Everything else
>should keep returning -ENOMEM to the caller.
>
>

Probably a good idea. OTOH, some kernel allocations might really
need to be performed and have no failure path. For example __GFP_REPEAT.

I think maybe __GFP_REPEAT allocations at least should be able to
cause an OOM. Not sure though.

>So to me moving the oom killer into kswapd looks a regression.
>
>
>

Also, I think it would do the wrong thing on NUMA machines because
that has a per-node kswapd.

2004-11-06 01:37:48

by Jesse Barnes

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Friday, November 05, 2004 5:26 pm, Nick Piggin wrote:
> >If you move it in kswapd there's no way to prevent oom-killing from a
> >syscall allocation (I guess even right now it would go wrong in this
> >sense, but at least right now it's more fixable). I want to move the oom
> >kill outside the alloc_page paths. The oom killing is all about the page
> >faults not having a fail path, and in turn the oom killing should be
> >moved in the page fault code, not in the allocator. Everything else
> >should keep returning -ENOMEM to the caller.
>
> Probably a good idea. OTOH, some kernel allocations might really
> need to be performed and have no failure path. For example __GFP_REPEAT.

Ah, I see what you're saying, yes, that makes even more sense :)

> I think maybe __GFP_REPEAT allocations at least should be able to
> cause an OOM. Not sure though.
>
> >So to me moving the oom killer into kswapd looks a regression.
>
> Also, I think it would do the wrong thing on NUMA machines because
> that has a per-node kswapd.

Yep, Andrea's explaination is clear, I just had to read it a few times.
Anyway, the fixes I posted are still necessary I think.

Thanks,
Jesse

2004-11-06 01:51:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Sat, Nov 06, 2004 at 12:26:57PM +1100, Nick Piggin wrote:
> need to be performed and have no failure path. For example __GFP_REPEAT.

all allocations should have a failure path to avoid deadlocks. But in
the meantime __GFP_REPEAT is at least localizing the problematic places ;)

> I think maybe __GFP_REPEAT allocations at least should be able to
> cause an OOM. Not sure though.

probably it should because this is also a case where no fail path exists.

My point was only that when a fail path exists, it's more reliable not
to invoke the oom killer and let userspace handle the failure.

> Also, I think it would do the wrong thing on NUMA machines because
> that has a per-node kswapd.

yep.

2004-11-06 02:04:16

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Sat, 2004-11-06 at 02:20 +0100, Andrea Arcangeli wrote:
> On Fri, Nov 05, 2004 at 03:32:50PM -0800, Jesse Barnes wrote:
> > On Friday, November 05, 2004 12:01 pm, Marcelo Tosatti wrote:
> > > In my opinion the correct approach is to trigger the OOM killer
> > > when kswapd is unable to free pages. Once that is done, the number
> > > of tasks inside page reclaim is irrelevant.
> >
> > That makes sense.
>
> I don't like it, kswapd may fail balancing because there's a GFP_DMA
> allocation that eat the last dma page, but we should not kill tasks if
> we fail to balance in kswapd, we should kill tasks only when no fail
> path exists (i.e. only during page faults, everything else in the kernel
> has a fail path and it should never trigger oom).
>
> If you move it in kswapd there's no way to prevent oom-killing from a
> syscall allocation (I guess even right now it would go wrong in this
> sense, but at least right now it's more fixable). I want to move the oom
> kill outside the alloc_page paths. The oom killing is all about the page
> faults not having a fail path, and in turn the oom killing should be
> moved in the page fault code, not in the allocator. Everything else
> should keep returning -ENOMEM to the caller.
>
> So to me moving the oom killer into kswapd looks a regression.

My point is not where oom-killer is triggered. My point is the decision
criteria of oom-killer, when it is finally invoked, which process to
kill. That's kind of independend of your patch. Your patch corrects the
context in which oom-killer is called. My concern is that the decision
critrion which process should be killed is not sufficient. In my case it
kills sshd instead of a process which forks a bunch of child processes.
Thats just wrong, because it takes away the chance to log into the
machine remotely and fix the problem.

tglx


2004-11-06 09:48:20

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Sat, 6 Nov 2004, Andrea Arcangeli wrote:
>
> all allocations should have a failure path to avoid deadlocks. But in
> the meantime __GFP_REPEAT is at least localizing the problematic places ;)

Problematic, yes: don't overlook that GFP_REPEAT and GFP_NOFAIL _can_
fail, returning NULL: when the process is being OOM-killed (PF_MEMDIE).

Hugh

2004-11-06 10:53:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage



Hugh Dickins wrote:

>On Sat, 6 Nov 2004, Andrea Arcangeli wrote:
>
>>all allocations should have a failure path to avoid deadlocks. But in
>>the meantime __GFP_REPEAT is at least localizing the problematic places ;)
>>
>
>Problematic, yes: don't overlook that GFP_REPEAT and GFP_NOFAIL _can_
>fail, returning NULL: when the process is being OOM-killed (PF_MEMDIE).
>
>

Yeah right you are. I think NOFAIL is a bug and should really not fail.
It looks like it is only used in fs/jbd/*, and things will crash if it
fails. Maybe they're only called from the kjournald threads and can't
be OOM killed, but that is still a pretty subtle dependancy.

2004-11-06 11:37:23

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

Andrea Arcangeli writes:
> On Sat, Nov 06, 2004 at 12:26:57PM +1100, Nick Piggin wrote:
> > need to be performed and have no failure path. For example __GFP_REPEAT.
>
> all allocations should have a failure path to avoid deadlocks. But in

This is not currently possible for a complex operation that allocates
multiple pages and has always complete as a whole.

We need page-reservation API of some sort. There were several attempts
to introduce this, but none get into mainline.

Nikita.

2004-11-06 12:53:38

by Andries Brouwer

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer ...

On Fri, Nov 05, 2004 at 06:01:18PM -0200, Marcelo Tosatti wrote:

> My wife is almost killing me, its Friday night and I've been telling her
> "just another minute" for hours. Have to run.

:-)

> As you know the OOM is very problematic in 2.6 right now - so I went
> to investigate it.

I have always been surprised that so few people investigated
doing things right, that is, entirely without OOM killer.
Apparently developers do not think about using Linux for serious work
where it can be a disaster, possibly even a life-threatening disaster,
when any process can be killed at any time.

Ten years ago it was a bad waste of resources to have swapspace
lying around that would be used essentially 0% of the time.
But with todays disk sizes it is entirely feasible to have
a few hundred MB of "unused" swap space. A small price to
pay for the guarantee that no process will be OOM killed.

A month ago I showed a patch that made overcommit mode 2
work for me. Google finds it in http://lwn.net/Articles/104959/

So far, nobody commented.

This is not in a state such that I would like to submit it,
but I think it would be good to focus some energy into
offering a Linux that is guaranteed free of OOM surprises.

So, let me repeat the RFC.
Apply the above patch, and do "echo 2 > /proc/sys/vm/overcommit_memory".
Now test. In case you have no, or only a small amount of swap space,
also do "echo 80 > /proc/sys/vm/overcommit_ratio" or so.

Andries

2004-11-06 13:18:05

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Sat, Nov 06, 2004 at 02:20:18AM +0100, Andrea Arcangeli wrote:
> On Fri, Nov 05, 2004 at 03:32:50PM -0800, Jesse Barnes wrote:
> > On Friday, November 05, 2004 12:01 pm, Marcelo Tosatti wrote:
> > > In my opinion the correct approach is to trigger the OOM killer
> > > when kswapd is unable to free pages. Once that is done, the number
> > > of tasks inside page reclaim is irrelevant.
> >
> > That makes sense.

Hi Andrea,

> I don't like it, kswapd may fail balancing because there's a GFP_DMA
> allocation that eat the last dma page, but we should not kill tasks if
> we fail to balance in kswapd, we should kill tasks only when no fail
> path exists (i.e. only during page faults, everything else in the kernel
> has a fail path and it should never trigger oom).

The OOM killer is only going to get triggered if kswapd is not able
to make _any_ progress in all zones. So it wont "fail balancing because there's
a GFP_DMA allocation that eat the last dma page".

As long as frees _one_ page during all passes from DEF_PRIORITY till priority=0,
it wont kill any task. See?

I dont get your point.

> If you move it in kswapd there's no way to prevent oom-killing from a
> syscall allocation (I guess even right now it would go wrong in this
> sense, but at least right now it's more fixable).

I dont understand what you mean. "prevent oom-killing from a syscall allocation" ?

> I want to move the oom
> kill outside the alloc_page paths. The oom killing is all about the page
> faults not having a fail path, and in turn the oom killing should be
> moved in the page fault code, not in the allocator. Everything else
> should keep returning -ENOMEM to the caller.

Isnt OOM killing all about the reclaiming efforts not being able to make progress?

> So to me moving the oom killer into kswapd looks a regression.

To me having tasks trigger the OOM kill is fundamentally broken
because it doesnt take into account kswapd page freeing
efforts which are in-progress at the very moment.

That makes senses a lot of sense to me - would love to be proved
wrong.

See, its completly screwed right now. The code inside out_of_memory()
which only triggers OOM if it has happened several times during the
past few seconds is horrible and shows how bad it is.

2004-11-06 13:23:53

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Sat, Nov 06, 2004 at 12:26:57PM +1100, Nick Piggin wrote:
>
>
> Andrea Arcangeli wrote:
>
> >On Fri, Nov 05, 2004 at 03:32:50PM -0800, Jesse Barnes wrote:
> >
> >>On Friday, November 05, 2004 12:01 pm, Marcelo Tosatti wrote:
> >>
> >>>In my opinion the correct approach is to trigger the OOM killer
> >>>when kswapd is unable to free pages. Once that is done, the number
> >>>of tasks inside page reclaim is irrelevant.
> >>>
> >>That makes sense.
> >>
> >
> >I don't like it, kswapd may fail balancing because there's a GFP_DMA
> >allocation that eat the last dma page, but we should not kill tasks if
> >we fail to balance in kswapd, we should kill tasks only when no fail
> >path exists (i.e. only during page faults, everything else in the kernel
> >has a fail path and it should never trigger oom).
> >
> >If you move it in kswapd there's no way to prevent oom-killing from a
> >syscall allocation (I guess even right now it would go wrong in this
> >sense, but at least right now it's more fixable). I want to move the oom
> >kill outside the alloc_page paths. The oom killing is all about the page
> >faults not having a fail path, and in turn the oom killing should be
> >moved in the page fault code, not in the allocator. Everything else
> >should keep returning -ENOMEM to the caller.
> >
> >
>
> Probably a good idea. OTOH, some kernel allocations might really
> need to be performed and have no failure path. For example __GFP_REPEAT.
>
> I think maybe __GFP_REPEAT allocations at least should be able to
> cause an OOM. Not sure though.

As I said in my answer to Andrea, OOM killing is about allocations not being
able to succeed (aka VM not being able to free pages).

kswapd is freeing pages, he is the one who knows about progress.

> >So to me moving the oom killer into kswapd looks a regression.
>
> Also, I think it would do the wrong thing on NUMA machines because
> that has a per-node kswapd.

Right, we need to handle the NUMA case correctly (we need to which does "dont kill if other nodes
have available memory").

But still, it looks the right thing to do to me.

2004-11-06 13:41:43

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Sat, Nov 06, 2004 at 02:55:50AM +0100, Thomas Gleixner wrote:
> On Sat, 2004-11-06 at 02:20 +0100, Andrea Arcangeli wrote:
> > On Fri, Nov 05, 2004 at 03:32:50PM -0800, Jesse Barnes wrote:
> > > On Friday, November 05, 2004 12:01 pm, Marcelo Tosatti wrote:
> > > > In my opinion the correct approach is to trigger the OOM killer
> > > > when kswapd is unable to free pages. Once that is done, the number
> > > > of tasks inside page reclaim is irrelevant.
> > >
> > > That makes sense.
> >
> > I don't like it, kswapd may fail balancing because there's a GFP_DMA
> > allocation that eat the last dma page, but we should not kill tasks if
> > we fail to balance in kswapd, we should kill tasks only when no fail
> > path exists (i.e. only during page faults, everything else in the kernel
> > has a fail path and it should never trigger oom).
> >
> > If you move it in kswapd there's no way to prevent oom-killing from a
> > syscall allocation (I guess even right now it would go wrong in this
> > sense, but at least right now it's more fixable). I want to move the oom
> > kill outside the alloc_page paths. The oom killing is all about the page
> > faults not having a fail path, and in turn the oom killing should be
> > moved in the page fault code, not in the allocator. Everything else
> > should keep returning -ENOMEM to the caller.
> >
> > So to me moving the oom killer into kswapd looks a regression.
>
> My point is not where oom-killer is triggered. My point is the decision
> criteria of oom-killer, when it is finally invoked, which process to
> kill. That's kind of independend of your patch. Your patch corrects the
> context in which oom-killer is called. My concern is that the decision
> critrion which process should be killed is not sufficient. In my case it
> kills sshd instead of a process which forks a bunch of child processes.
> Thats just wrong, because it takes away the chance to log into the
> machine remotely and fix the problem.

Hi Thomas,

Yes your patches are correct and needed independantly of where OOM killer
is triggered from.

2004-11-06 13:54:37

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer ...


Hi Andries,

On Sat, Nov 06, 2004 at 01:53:17PM +0100, Andries Brouwer wrote:
> On Fri, Nov 05, 2004 at 06:01:18PM -0200, Marcelo Tosatti wrote:
>
> > My wife is almost killing me, its Friday night and I've been telling her
> > "just another minute" for hours. Have to run.
>
> :-)
>
> > As you know the OOM is very problematic in 2.6 right now - so I went
> > to investigate it.
>
> I have always been surprised that so few people investigated
> doing things right, that is, entirely without OOM killer.
> Apparently developers do not think about using Linux for serious work
> where it can be a disaster, possibly even a life-threatening disaster,
> when any process can be killed at any time.

Its just that the majority of users use total overcommit (the default),
but you have a point.

> Ten years ago it was a bad waste of resources to have swapspace
> lying around that would be used essentially 0% of the time.
> But with todays disk sizes it is entirely feasible to have
> a few hundred MB of "unused" swap space. A small price to
> pay for the guarantee that no process will be OOM killed.
>
> A month ago I showed a patch that made overcommit mode 2
> work for me. Google finds it in http://lwn.net/Articles/104959/
>
> So far, nobody commented.
>
> This is not in a state such that I would like to submit it,
> but I think it would be good to focus some energy into
> offering a Linux that is guaranteed free of OOM surprises.

I dont have any useful comments on patch on a quick look at it -
but yes non-overcommit should be working correctly.

> So, let me repeat the RFC.
> Apply the above patch, and do "echo 2 > /proc/sys/vm/overcommit_memory".
> Now test. In case you have no, or only a small amount of swap space,
> also do "echo 80 > /proc/sys/vm/overcommit_ratio" or so.

Will test your patch later on the weekend and take a slower look
at it, hopefully with useful comments.

2004-11-06 15:30:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Sat, Nov 06, 2004 at 09:47:56AM +0000, Hugh Dickins wrote:
> Problematic, yes: don't overlook that GFP_REPEAT and GFP_NOFAIL _can_
> fail, returning NULL: when the process is being OOM-killed (PF_MEMDIE).

that looks weird, why that? The oom killer must be robust against a task
not going anyway regardless of this (task can be stuck in nfs or
similar). If a fail path ever existed, __GFP_NOFAIL should not have been
used in the first place. I don't see many valid excuses to use
__GFP_NOFAIL if we can return NULL without the caller running into an
infinite loop.

btw, PF_MEMDIE has always been racy in the way it's being set, so it can
corrupt the p->flags, but the race window is very small to trigger it
(and even if it triggers, it probably wouldn't be fatal). That's why I
don't use PF_MEMDIE in 2.4-aa.

2004-11-06 15:33:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Sat, Nov 06, 2004 at 02:37:05PM +0300, Nikita Danilov wrote:
> We need page-reservation API of some sort. There were several attempts
> to introduce this, but none get into mainline.

they're already in under the name of mempools

I'm perfectly aware the fs tends to be the less correct places in terms
of allocations, and luckily it's not an heavy memory user, so I still
have to see a deadlock in getblk or create_buffers or similar. It's
mostly a correctness issue (math proof it can't deadlock, right now it
can if more tasks all get stuck in getblk at the same time during a hard
oom condition etc..).

2004-11-06 15:33:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Sat, Nov 06, 2004 at 09:53:00PM +1100, Nick Piggin wrote:
> Yeah right you are. I think NOFAIL is a bug and should really not fail.

agreed.

2004-11-06 15:45:16

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

Hi Marcelo,

On Sat, Nov 06, 2004 at 08:05:16AM -0200, Marcelo Tosatti wrote:
> The OOM killer is only going to get triggered if kswapd is not able
> to make _any_ progress in all zones. So it wont "fail balancing because there's
> a GFP_DMA allocation that eat the last dma page".
>
> As long as frees _one_ page during all passes from DEF_PRIORITY till priority=0,
> it wont kill any task. See?

It's still wrong on numa. the machine isn't oom despite kswapd couldn't
free any page (the local node will fallback in the other nodes instead)

> > If you move it in kswapd there's no way to prevent oom-killing from a
> > syscall allocation (I guess even right now it would go wrong in this
> > sense, but at least right now it's more fixable).
>
> I dont understand what you mean. "prevent oom-killing from a syscall allocation" ?

yes. oom killing should be avoided as far as we can avoid it. Ideally we
should never invoke the oom killer and we should always return -ENOMEM
to applications. If a syscall runs oom then we can return -ENOMEM and
handle the failure gracefully instead of getting a sigkill.

With 2.4 -ENOMEM is returned and the machine doesn't deadlock when the
zone normal is full and that works fine.

> Isnt OOM killing all about the reclaiming efforts not being able to make progress?

it's invoked when we're not able to make progress and no fail path
exists.

> To me having tasks trigger the OOM kill is fundamentally broken
> because it doesnt take into account kswapd page freeing
> efforts which are in-progress at the very moment.

kswapd page freeing efforts are not very useful. kswapd is an helper,
it's not the thing that can or should guarantee allocations to succeed.

The rule is that if you want to allocate 1 page, you've to free the page
yourself. Then if kswapd frees a page too, that's welcome. But keep also
in mind kswapd may be running in another cpu, and it will put the pages
back into the per-cpu queue of the other cpu. So you should really free
a page yourself to be guaranteed to find that page later on.

kswapd is more for keeping the balance between low and high so that we
never block freeing memory, and to keep the disk running.

> See, its completly screwed right now. The code inside out_of_memory()
> which only triggers OOM if it has happened several times during the
> past few seconds is horrible and shows how bad it is.

that's very bad indeed. But anything happening inside out_of_memory has
nothing to do with what we discussed above like Thomas Gleixner pointed
out yesterday.

these are two different things:

1) choose when we need to invoke out_of_memory
2) choose what to do inside out_of_memory

I definitely agree about the 5 sec waiting being an hack, to me even
blk_congest_wait looks an hack.

2004-11-06 15:53:29

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage


> yes. oom killing should be avoided as far as we can avoid it. Ideally we
> should never invoke the oom killer and we should always return -ENOMEM
> to applications. If a syscall runs oom then we can return -ENOMEM and
> handle the failure gracefully instead of getting a sigkill.
>
> With 2.4 -ENOMEM is returned and the machine doesn't deadlock when the
> zone normal is full and that works fine.

the harder case is where you do an mmap and then in the fault path find out that there's no memory to allocate the PMD ...
killing the task that has that failing isn't per se the right answer.

2004-11-06 16:21:58

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Sat, 6 Nov 2004, Andrea Arcangeli wrote:
> On Sat, Nov 06, 2004 at 09:47:56AM +0000, Hugh Dickins wrote:
> > Problematic, yes: don't overlook that GFP_REPEAT and GFP_NOFAIL _can_
> > fail, returning NULL: when the process is being OOM-killed (PF_MEMDIE).
>
> that looks weird, why that? The oom killer must be robust against a task
> not going anyway regardless of this (task can be stuck in nfs or
> similar).

Oh, sure, it is, that's not the problem.

> If a fail path ever existed, __GFP_NOFAIL should not have been
> used in the first place. I don't see many valid excuses to use
> __GFP_NOFAIL if we can return NULL without the caller running into an
> infinite loop.

I took exception to the misleadingness of the name GFP_NOFAIL, and did
send Andrew a patch to remove it once upon a time, but he didn't bite.

Your view, that it's better to hang repeating indefinitely than ever
return a NULL when caller said not to, is probably the better view.

> btw, PF_MEMDIE has always been racy in the way it's being set, so it can
> corrupt the p->flags, but the race window is very small to trigger it
> (and even if it triggers, it probably wouldn't be fatal). That's why I
> don't use PF_MEMDIE in 2.4-aa.

I expect so, yes, the PF_ flags don't have proper locking. Those
places which set or clear PF_MEMALLOC are more likely to hit races,
but last time I went there I don't think there was a real serious problem.

Hugh

2004-11-06 16:54:45

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

Andrea Arcangeli writes:
> On Sat, Nov 06, 2004 at 02:37:05PM +0300, Nikita Danilov wrote:
> > We need page-reservation API of some sort. There were several attempts
> > to introduce this, but none get into mainline.
>
> they're already in under the name of mempools

I am talking about slightly different thing. Think of some operation
that calls find_or_create_page(). find_or_create_page() doesn't know
about memory reserved in mempools, it uses alloc_page() directly. If one
wants to guarantee that compound operation has enough memory to
complete, memory should be reserved at the lowest level---in the page
allocator.

>
> I'm perfectly aware the fs tends to be the less correct places in terms
> of allocations, and luckily it's not an heavy memory user, so I still

Either you are kidding, or we are facing very different workloads. In
the world of file-system development, file-system is (not surprisingly)
single largest memory consumer.

> have to see a deadlock in getblk or create_buffers or similar. It's
> mostly a correctness issue (math proof it can't deadlock, right now it
> can if more tasks all get stuck in getblk at the same time during a hard
> oom condition etc..).

Add here mmap that can dirty all physical memory behind your back, and
delayed disk block allocation that forces ->writepage() to allocate
potentially huge extent when memory is already tight and hope of having
a proof becomes quite remote.

Nikita.

2004-11-06 17:45:46

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Sat, Nov 06, 2004 at 07:54:12PM +0300, Nikita Danilov wrote:
> Andrea Arcangeli writes:
> > On Sat, Nov 06, 2004 at 02:37:05PM +0300, Nikita Danilov wrote:
> > > We need page-reservation API of some sort. There were several attempts
> > > to introduce this, but none get into mainline.
> >
> > they're already in under the name of mempools
>
> I am talking about slightly different thing. Think of some operation
> that calls find_or_create_page(). find_or_create_page() doesn't know
> about memory reserved in mempools, it uses alloc_page() directly. If one
> wants to guarantee that compound operation has enough memory to
> complete, memory should be reserved at the lowest level---in the page
> allocator.

the page allocator reserve only memory in order to swapout, that's
PF_MEMALLOC.

For other purposes not related to swapping (which is not a deterministic
thing, given there can be multiple layers of I/O and fs operations to
do), you should use mempool and change find_or_create_page to get your
reserved page as parameter.

> > I'm perfectly aware the fs tends to be the less correct places in terms
> > of allocations, and luckily it's not an heavy memory user, so I still
>
> Either you are kidding, or we are facing very different workloads. In
> the world of file-system development, file-system is (not surprisingly)
> single largest memory consumer.

when the machine runs oom the fs allocations means nothing. when the
machine runs oom is because somebody entered the malloc loop or
something like that. all allocations come from page faults.

Try yourself to run your box oom with getblk allocations. Only then
you'll run into the deadlock.

You've to keep in mind an oom condition happens once in a while, and
when it happens the userspace memory allocation load is huge compared to
any fs operation.

> > have to see a deadlock in getblk or create_buffers or similar. It's
> > mostly a correctness issue (math proof it can't deadlock, right now it
> > can if more tasks all get stuck in getblk at the same time during a hard
> > oom condition etc..).
>
> Add here mmap that can dirty all physical memory behind your back, and
> delayed disk block allocation that forces ->writepage() to allocate
> potentially huge extent when memory is already tight and hope of having
> a proof becomes quite remote.

that's the PF_MEMALLOC path. A reservation already exists, or it would
never work since 2.2. PF_MEMALLOC and the min/2 watermark are meant to
allow writepage to allocate ram. however the amount reserved is limited,
so it's not perfect. The only way to make it perfect I believe is to
reserve the stuff inside the fs with mempools as described above.

2004-11-06 19:25:42

by Nikita Danilov

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

Andrea Arcangeli writes:
> On Sat, Nov 06, 2004 at 07:54:12PM +0300, Nikita Danilov wrote:
> > Andrea Arcangeli writes:
> > > On Sat, Nov 06, 2004 at 02:37:05PM +0300, Nikita Danilov wrote:
> > > > We need page-reservation API of some sort. There were several attempts
> > > > to introduce this, but none get into mainline.
> > >
> > > they're already in under the name of mempools
> >
> > I am talking about slightly different thing. Think of some operation
> > that calls find_or_create_page(). find_or_create_page() doesn't know
> > about memory reserved in mempools, it uses alloc_page() directly. If one
> > wants to guarantee that compound operation has enough memory to
> > complete, memory should be reserved at the lowest level---in the page
> > allocator.
>
> the page allocator reserve only memory in order to swapout, that's
> PF_MEMALLOC.
>
> For other purposes not related to swapping (which is not a deterministic
> thing, given there can be multiple layers of I/O and fs operations to
> do), you should use mempool and change find_or_create_page to get your
> reserved page as parameter.

This means breaking all layering and passing mempool pointer all the way
down to the lowest layer allocators (like bio and drivers). The only
practical way to do this, is to put mempool pointer into current
task_struct. At which point it's no different from having per-thread
list of pages that __alloc_pages() looks into before falling back to
per-cpu page-sets and buddy. _Except_ in the latter case, reservation is
handled transparently in __alloc_pages() and code shouldn't be adjusted
to check for mempool in zillion of places.

>
> > > I'm perfectly aware the fs tends to be the less correct places in terms
> > > of allocations, and luckily it's not an heavy memory user, so I still
> >
> > Either you are kidding, or we are facing very different workloads. In
> > the world of file-system development, file-system is (not surprisingly)
> > single largest memory consumer.
>
> when the machine runs oom the fs allocations means nothing. when the
> machine runs oom is because somebody entered the malloc loop or
> something like that. all allocations come from page faults.
>
> Try yourself to run your box oom with getblk allocations. Only then
> you'll run into the deadlock.
>
> You've to keep in mind an oom condition happens once in a while, and
> when it happens the userspace memory allocation load is huge compared to
> any fs operation.

I think you are confusing "file system" and "ext2". I definitely know
from experience that with some file system types, system can be oommed
without any significant user-level allocation activity. Now, one can say
that either such file-systems are broken, or Linux MM lacks support for
features (like reservation) they need.

>
> > > have to see a deadlock in getblk or create_buffers or similar. It's
> > > mostly a correctness issue (math proof it can't deadlock, right now it
> > > can if more tasks all get stuck in getblk at the same time during a hard
> > > oom condition etc..).
> >
> > Add here mmap that can dirty all physical memory behind your back, and
> > delayed disk block allocation that forces ->writepage() to allocate
> > potentially huge extent when memory is already tight and hope of having
> > a proof becomes quite remote.
>
> that's the PF_MEMALLOC path. A reservation already exists, or it would
> never work since 2.2. PF_MEMALLOC and the min/2 watermark are meant to
> allow writepage to allocate ram. however the amount reserved is limited,

low-mem watermark is mostly useless in the face of direct reclaim, when
unbounded number of threads enter try_to_free_pages() and call
->writepage() simultaneously.

> so it's not perfect. The only way to make it perfect I believe is to
> reserve the stuff inside the fs with mempools as described above.

I don't see what advantages mempools have over page reservation handled
directly by page allocator, like in

ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc4/2.6.9-rc4-mm1/broken-out/reiser4-perthread-pages.patch

Nikita.

2004-11-06 20:23:22

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Sat, Nov 06, 2004 at 04:44:15PM +0100, Andrea Arcangeli wrote:
> Hi Marcelo,

Hi again!

> On Sat, Nov 06, 2004 at 08:05:16AM -0200, Marcelo Tosatti wrote:
> > The OOM killer is only going to get triggered if kswapd is not able
> > to make _any_ progress in all zones. So it wont "fail balancing because there's
> > a GFP_DMA allocation that eat the last dma page".
> >
> > As long as frees _one_ page during all passes from DEF_PRIORITY till priority=0,
> > it wont kill any task. See?
>
> It's still wrong on numa. the machine isn't oom despite kswapd couldn't
> free any page (the local node will fallback in the other nodes instead)

Sure NUMA can and has to be special cased, as I answered Nick.

"dont kill if we can allocate from other nodes", should be pretty simple.

> > > If you move it in kswapd there's no way to prevent oom-killing from a
> > > syscall allocation (I guess even right now it would go wrong in this
> > > sense, but at least right now it's more fixable).
> >
> > I dont understand what you mean. "prevent oom-killing from a syscall allocation" ?
>
> yes. oom killing should be avoided as far as we can avoid it. Ideally we
> should never invoke the oom killer and we should always return -ENOMEM
> to applications. If a syscall runs oom then we can return -ENOMEM and
> handle the failure gracefully instead of getting a sigkill.
>
> With 2.4 -ENOMEM is returned and the machine doesn't deadlock when the
> zone normal is full and that works fine.

I agree with you here. But then there are the cases which you can't return ENOMEM -
the fault paths you have mentioned, and the PMD/PTE allocation mentioned by Arjan
on mmap, and probably others.

We should be returning -ENOMEM to syscalls right now in v2.6, but thats
not the problem here. The problem are the page faults.

If v2.6 is failing to return -ENOMEM to syscalls then its indeed screwed,
but its not the same problem.

Have you done any tests on this respect?

> > Isnt OOM killing all about the reclaiming efforts not being able to make progress?
>
> it's invoked when we're not able to make progress and no fail path
> exists.

The system will, in the vast majority of cases, be OOM due to page faults
which can't be handled (anonymous memory mappings created by brk/sbrk) anyway.

So "not being able to make progress freeing pages" seems to be reliable
information on whether to trigger OOM. Note that reaching priority
0 means we've tried VERY VERY hard already.

> > To me having tasks trigger the OOM kill is fundamentally broken
> > because it doesnt take into account kswapd page freeing
> > efforts which are in-progress at the very moment.
>
> kswapd page freeing efforts are not very useful. kswapd is an helper,
> it's not the thing that can or should guarantee allocations to succeed.

Oh wait, kswapd job is to guarantee that allocations succeed. We used to
wait on kswapd before on v2.3 VM development - then we switched to
task-goes-to-memory-reclaim for _performance_ reasons (parallelism).

My point here is, kswapd is the entity responsible for freeing pages.

The action of triggering OOM killer from inside a task context (whether
its from the alloc_pages path or the fault path is irrelevant here)
is WRONG because at the same time, kswapd, who is the main entity freeing
pages, is also running the memory reclaim code - it might just have freed
a bunch of pages but we have no way of knowing that from normal task context.

> The rule is that if you want to allocate 1 page, you've to free the page
> yourself. Then if kswapd frees a page too, that's welcome. But keep also
> in mind kswapd may be running in another cpu, and it will put the pages
> back into the per-cpu queue of the other cpu.

Exactly another reason for _NOT_ triggering the OOM killer from task context
- pages which have been freed might be in the per-CPU queue (but a task
running on another CPU can't see them).

We should be flushing the per-cpu queues quite often under these circumstances.

> So you should really free
> a page yourself to be guaranteed to find that page later on.
>
> kswapd is more for keeping the balance between low and high so that we
> never block freeing memory, and to keep the disk running.
>
> > See, its completly screwed right now. The code inside out_of_memory()
> > which only triggers OOM if it has happened several times during the
> > past few seconds is horrible and shows how bad it is.
>
> that's very bad indeed. But anything happening inside out_of_memory has
> nothing to do with what we discussed above like Thomas Gleixner pointed
> out yesterday.
>
> these are two different things:
>
> 1) choose when we need to invoke out_of_memory
> 2) choose what to do inside out_of_memory

OK - agreed.

> I definitely agree about the 5 sec waiting being an hack, to me even
> blk_congest_wait looks an hack.

Why's that? blk_congestion_wait looks clean and correct to me - if the queue
is full dont bother queueing more pages at this device.

OK - so you seem to be agreeing with me that triggering OOM killer
from kswapd context is the correct thing to do now?

2004-11-07 03:01:52

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Sat, Nov 06, 2004 at 03:09:25PM -0200, Marcelo Tosatti wrote:
> Sure NUMA can and has to be special cased, as I answered Nick.
>
> "dont kill if we can allocate from other nodes", should be pretty simple.

then how do you call the oom killer if some highmem page is still free
but there's a GFP_KERNEL|__GFP_NOFAIL allocation failing and looping
forever?

> If v2.6 is failing to return -ENOMEM to syscalls then its indeed screwed,
> but its not the same problem.
>
> Have you done any tests on this respect?

the only test I have are based on 2.6.5 and the box simply deadlock
there, the oom killer is forbidden there if nr_swap_pages > 0. Not
anymore in 2.6.9rc, however with 2.6.9 and more recent the oom killer is
invoked too early.

> So "not being able to make progress freeing pages" seems to be reliable
> information on whether to trigger OOM. Note that reaching priority
> 0 means we've tried VERY VERY hard already.

yes, "not being able to make progress freeing pages" has always been the
only reliable information in linux. The early 2.6 and some older 2.4
deadlocked in corner cases (like mlock for example) at trying to guess
the oom time by looking at a few stat numbers (nr_swap_pages for
example).

Though when we reach prio 0 clearly we didn't try hard enough if I'm
getting spurious oom. It's also possible that the pages are being freed
in other per-cpu queues and we lose visibility on them, so we do hard
work and still we cannot allocate anything. Unfortunately I've never
been able to reproduce the early oom kill here, so I could never figure
out what's going on. and yes, I know there are patches floating around
claiming they fixed it, but I'd like to hear positive feedback on those.

> > kswapd page freeing efforts are not very useful. kswapd is an helper,
> > it's not the thing that can or should guarantee allocations to succeed.
>
> Oh wait, kswapd job is to guarantee that allocations succeed. We used to

kswapd is all but a guarantee. kswapd is a pure helper.

> wait on kswapd before on v2.3 VM development - then we switched to
> task-goes-to-memory-reclaim for _performance_ reasons (parallelism).

it wasn't parallelism. The only way you could make it safe is that you
create a message passing mechanism where you post a request to kswapd
and kswapd wakes you up back. But that'd inefficient compared to current
model where kswapd is an helper.

Infact kswapd right now only hurts during heavy paging since it will
prevent the freed pages to go into the right per-cpu queue. kswapd only
hurts during paging, we should stop it as far as somebody is inside the
critical section for a certain numa node.

> My point here is, kswapd is the entity responsible for freeing pages.

it can't even know which is the per-cpu queue where it has to put the
pages back.

> The action of triggering OOM killer from inside a task context (whether
> its from the alloc_pages path or the fault path is irrelevant here)
> is WRONG because at the same time, kswapd, who is the main entity freeing
> pages, is also running the memory reclaim code - it might just have freed
> a bunch of pages but we have no way of knowing that from normal task context.

we definitely have a way of knowing, the fact the current code is buggy
doesn't mean we don't have a way of knowing, 2.4 VM perfectly knows when
kswapd did the right thing and helped. Though I agree kswapd generally
hurts during paging and we'd better stop it to reduce the synchronous
amount of work.

the allocator must check if the levels are above pages.low before
killing, if it doesn't do that it's broken, moving the oom killer in
kswapd cannot fix this problem, because obviously then it'll be the task
context that will have freed the additional pages instead of kswapd.

The rule is to do:

paging
check the levels and kill

If you just do paging and oom kill if paging has failed there's no way
it can work. 2.6 is broken here, and that could be the reason of the oom
kills too.

There will be always a race condition even with the above, since the
check for the levels and oom kill isn't an atomic operation and we don't
block all other cpus in that path, but it's an insignificant window we
don't have to worry about (only theoretical).

But if you check the levels; paging; kill, like current 2.6, there is an
huge window while we wait for I/O. After we finished the I/O the whole
VM status may have changed and we may be full of free pages in the
per-cpu queue and in the buddy as well. so we've to recheck the levels
before killing anything. This is again why doing oom_kill inside the
try_to_free_pages (or in kswapd anyways) is flawed.

> > The rule is that if you want to allocate 1 page, you've to free the page
> > yourself. Then if kswapd frees a page too, that's welcome. But keep also
> > in mind kswapd may be running in another cpu, and it will put the pages
> > back into the per-cpu queue of the other cpu.
>
> Exactly another reason for _NOT_ triggering the OOM killer from task context
> - pages which have been freed might be in the per-CPU queue (but a task
> running on another CPU can't see them).
>
> We should be flushing the per-cpu queues quite often under these circumstances.

we should never flush per-cpu pages, that'd hurt performance, per-cpu
pages are lost memory. this is also why we must give up freeing memory
only if everything else is not available, in 2.4 I had to stop after 50
tries or so. We must keep going until all per-cpu queues are full,
because if kswapd is in our way every other cpu may get the ram before
us. This is why stopping kswapd would be beneficial while we're working
on it, it'd probabilistically reduce the amount of synchronous work.

> Why's that? blk_congestion_wait looks clean and correct to me - if the queue
> is full dont bother queueing more pages at this device.

blk_contestion_wait is waiting on random I/O, it doesn't mean it's
waiting on any substantial VM related paging (an O_DIRECT I/O would fool
blk_congestion_wait), and if there's no I/O it just wakeup after a fixed
random number of seconds.

the VM should only throttle on locked pages or locked bh it can see,
never on random I/O happening at the blkdev layer just because somebody
is rolling some directio. Throttling on random I/O will lead to oom
failures too and that's another bug in the 2.6 VM (and if it's not a
bug, and we throttle elsewhere too, then it's simply useless and it
should be replaced with a yield, if there's no I/O waiting there is a
nosense, especially given that even if the oom killer triggers there
won't be any additional ram to free, since the oom killer will generate
free memory, not memory to free).

> OK - so you seem to be agreeing with me that triggering OOM killer
> from kswapd context is the correct thing to do now?

I disagree about that sorry. not even try_to_free_pages should ever call
the oom killer (unless you want to move the watermark checks from
page_alloc.c to vmscan.c that would not be clean at all). Taking the
decision on when to oom kill inside vmscan.c (like current 2.6 does)
looks wrong to me.

2004-11-07 03:19:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Sat, Nov 06, 2004 at 10:24:58PM +0300, Nikita Danilov wrote:
> This means breaking all layering and passing mempool pointer all the way
> down to the lowest layer allocators (like bio and drivers). The only

bio and drivers already have their own mempools. the blkdev layer is
guaranteed to succeed and it can only try GFP_NOIO allocations (if those
fails it'll fallback in the reserved mempool).

> practical way to do this, is to put mempool pointer into current
> task_struct. At which point it's no different from having per-thread
> list of pages that __alloc_pages() looks into before falling back to
> per-cpu page-sets and buddy. _Except_ in the latter case, reservation is
> handled transparently in __alloc_pages() and code shouldn't be adjusted
> to check for mempool in zillion of places.

that's sure reasonable to avoid changing lots of code.

> I think you are confusing "file system" and "ext2". I definitely know
> from experience that with some file system types, system can be oommed
> without any significant user-level allocation activity. Now, one can say
> that either such file-systems are broken, or Linux MM lacks support for
> features (like reservation) they need.

the latter is true, I agree.

> > that's the PF_MEMALLOC path. A reservation already exists, or it would
> > never work since 2.2. PF_MEMALLOC and the min/2 watermark are meant to
> > allow writepage to allocate ram. however the amount reserved is limited,
>
> low-mem watermark is mostly useless in the face of direct reclaim, when
> unbounded number of threads enter try_to_free_pages() and call
> ->writepage() simultaneously.

agreed.

> > so it's not perfect. The only way to make it perfect I believe is to
> > reserve the stuff inside the fs with mempools as described above.
>
> I don't see what advantages mempools have over page reservation handled
> directly by page allocator, like in
>
> ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc4/2.6.9-rc4-mm1/broken-out/reiser4-perthread-pages.patch

guess what, that patch is running in my kernel right now. However I
believe this approch is very wasteful. I agree it'll work right, but
you're wasting loads of ram and you're as well less efficient.

The efficient fix for your problem, is to have a global pool, protected
by a global semaphore (definitely not per-thread), so that when you hit
oom (and when you hit true oom the last thing you can care about is
paralleism or the scalability on such a global semaphore), the VM will
trasparently take the semaphore and start using the pool. This will
still require you to mark the start and end of your critical section
like this:

reiserf4_writepage()
{
enable_reserved_pages_pool();

find_or_create_page()
journal something
getblk
biowhatever

disable_reserved_pages_pool();
}

disable_reserved_pages_pool has to check a per-thread flag that the VM
will set if it has used the reserved pool and taken the semaphore, but
by that time the I/O can be guaranteed to complete and the memory will
be guaranteed to be unlocked eventually when the bio I/O completes. So
you can freely alloc_pages to refill the pool inside
disable_reserved_pages_pool and then drop the semaphore.
enable_reserved_pages_pool is only needed to set a per-thread flag to
tell the VM it's allowed to fallback in the global pool by blocking in
the global semaphore if the box is oom (instead of returning NULL).

in disable_reserved_pages_pool you'll also have to clear the pre-thread
flag before calling alloc_pages again to avoid deadlock on the semaphore
if another oom condition happens of course.

then you need an create_reserved_pages_pool(nr_pages) while you mount the
fs, and destroy_unreserve_pages_pool(nr_pages) when you unmont it. where
many different users (i.e. different fs) will be allowed to reserve a
different size for the global pool. They all will share the same pool,
you've only need to track each user nr_pages to know which is the max
reservation you need.

That's still enterely transparent, it'll work in the thread context
thanks to the global semaphore, but it'll avoid the waste of ram where
every different task has to pin the ram into the task before starting
the writepage I/O.

I mean, I understand the only point of the perthread-pages patch is
deadlock avoidance during OOM. So you definitely don't need a per-thread
reservation, the global pool methods I described above should be more
than enough and they'll save ram and make your system faster as well.

I agree PF_MEMALLOC has nothing to do with this.

2004-11-07 09:26:37

by Marko Macek

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer ...

Andries Brouwer wrote:

> I have always been surprised that so few people investigated
> doing things right, that is, entirely without OOM killer.

Agreed.

> This is not in a state such that I would like to submit it,
> but I think it would be good to focus some energy into
> offering a Linux that is guaranteed free of OOM surprises.

A good thing would be to make the OOM killer only kill
processes that actually overcommit (independant of overcommit mode).

The first step would be adding a value in /proc/$pid/...
somewhere that shows how much a process is overcommitted when
overcommit is enabled. This would allow important processes to be
fixed for all overcommit modes.


MArk

2004-11-07 12:03:33

by Anton Ertl

[permalink] [raw]
Subject: memory overcommit (was: [PATCH] Remove OOM killer ...)

Marko Macek <[email protected]> writes:
>Andries Brouwer wrote:
>
>> I have always been surprised that so few people investigated
>> doing things right, that is, entirely without OOM killer.

I.e., without overcommitment. That's not necessarily the right thing
for all processes, because many programs are not written in a way to
do useful things when a memory allocation or other system call fails.
For these programs it's better to let the allocation succeed, let it
use all the unused (but possibly commited) memory and swap space in
the system, and kill the process if the system runs out of memory
later.

In a recent posting <[email protected]> in
c.o.l.d.s, I proposed separating the processes in two classes:

- a no-overcommit class for which memory commitment is accounted. It
may get ENOMEM on allocation, when the system runs out of commitable
memory, and processes in this class are never OOM killed.

- an overcommiting class for which memory commitment is not accounted.
It normally does not get ENOMEM on allocation, but if the system runs
out of memory (virtual memory, not commitable memory), processes from
this class are OOM-killed. Note that these processes can use memory
that has been commited to, but has not been used by no-overcommit
class processes.

Ideally, all the important applications would be able to handle failed
allocations gracefully, and would be marked as no-overcommit, and thus
would be safe from the OOM killer.

And all the other applications would often continue running long after
they would have crashed or become otherwise useless from ENOMEM on a
pure no-overcommitment system.

>> This is not in a state such that I would like to submit it,
>> but I think it would be good to focus some energy into
>> offering a Linux that is guaranteed free of OOM surprises.
>
>A good thing would be to make the OOM killer only kill
>processes that actually overcommit (independant of overcommit mode).

What does that mean? Overcommitment is normally a thing that all
processes do together (each one usually asks for less than the total
virtual memory). In my proposal it means that the process would be
marked as overcommiting or not through something like "nice", maybe
with a default coming from a flag in the executable.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
[email protected] Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

2004-11-07 14:36:09

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage


On Sun, Nov 07, 2004 at 01:48:09AM +0100, Andrea Arcangeli wrote:
> On Sat, Nov 06, 2004 at 03:09:25PM -0200, Marcelo Tosatti wrote:
> > Sure NUMA can and has to be special cased, as I answered Nick.
> >
> > "dont kill if we can allocate from other nodes", should be pretty simple.
>
> then how do you call the oom killer if some highmem page is still free
> but there's a GFP_KERNEL|__GFP_NOFAIL allocation failing and looping
> forever?

We should probably kill a task if a GFP_KERNEL allocation
is failing and looping forever. We need some mechanism
to indicate to kswapd that there is a allocator
failing repeatedly (an indication that we are really getting into
an OOM condition).

With that information kswapd can decide whether to kill or not.

> > but its not the same problem.
> >
> > Have you done any tests on this respect?
>
> the only test I have are based on 2.6.5 and the box simply deadlock
> there, the oom killer is forbidden there if nr_swap_pages > 0. Not
> anymore in 2.6.9rc, however with 2.6.9 and more recent the oom killer is
> invoked too early.

Right.

> > So "not being able to make progress freeing pages" seems to be reliable
> > information on whether to trigger OOM. Note that reaching priority
> > 0 means we've tried VERY VERY hard already.
>
> yes, "not being able to make progress freeing pages" has always been the
> only reliable information in linux. The early 2.6 and some older 2.4
> deadlocked in corner cases (like mlock for example) at trying to guess
> the oom time by looking at a few stat numbers (nr_swap_pages for
> example).

Right, and William removed the "nr_swap_pages" check, which fixed
those - but introduced the spurious OOM kills.

> Though when we reach prio 0 clearly we didn't try hard enough if I'm
> getting spurious oom. It's also possible that the pages are being freed
> in other per-cpu queues and we lose visibility on them, so we do hard
> work and still we cannot allocate anything. Unfortunately I've never
> been able to reproduce the early oom kill here, so I could never figure
> out what's going on. and yes, I know there are patches floating around
> claiming they fixed it, but I'd like to hear positive feedback on those.

Just run a program which allocates a lot of anonymous memory on a 100M setup
and you will see it.

The OOM killer will be triggered even if there is anonymous memory to be
swapped out and swap space available. Which is plain wrong.

> > > kswapd page freeing efforts are not very useful. kswapd is an helper,
> > > it's not the thing that can or should guarantee allocations to succeed.
> >
> > Oh wait, kswapd job is to guarantee that allocations succeed. We used to
>
> kswapd is all but a guarantee. kswapd is a pure helper.
>
> > wait on kswapd before on v2.3 VM development - then we switched to
> > task-goes-to-memory-reclaim for _performance_ reasons (parallelism).
>
> it wasn't parallelism. The only way you could make it safe is that you
> create a message passing mechanism where you post a request to kswapd
> and kswapd wakes you up back. But that'd inefficient compared to current
> model where kswapd is an helper.

Yes, it was inefficient, but it used to work like that. There was a
kswapd waitqueue in which tasks used to sleep. By parallelism I meant
parallelism freeing pages -> efficiency.

OK - its not like that anymore - but kswapd is not simply a "helper" -
its more than that, its responsible for keeping enough pages free.

> Infact kswapd right now only hurts during heavy paging since it will
> prevent the freed pages to go into the right per-cpu queue. kswapd only
> hurts during paging, we should stop it as far as somebody is inside the
> critical section for a certain numa node.

Indeed! But then if you stop kswapd under heavy load, you're pretty much
guaranteeing tasks will have to free pages themselves - synchronously
, which is not a good thing.

Its not trivial to know when to stop/when to start kswapd.

Another related problem is that an unlimited number of tasks can go
into reclaim - it should be limited to a sane number.

> > My point here is, kswapd is the entity responsible for freeing pages.
>
> it can't even know which is the per-cpu queue where it has to put the
> pages back.

No need - just put send'em back to the buddy allocator - which happens
anyway when the per-cpu queue reaches its high limit.

> > The action of triggering OOM killer from inside a task context (whether
> > its from the alloc_pages path or the fault path is irrelevant here)
> > is WRONG because at the same time, kswapd, who is the main entity freeing
> > pages, is also running the memory reclaim code - it might just have freed
> > a bunch of pages but we have no way of knowing that from normal task context.
>
> we definitely have a way of knowing, the fact the current code is buggy
> doesn't mean we don't have a way of knowing, 2.4 VM perfectly knows when
> kswapd did the right thing and helped. Though I agree kswapd generally
> hurts during paging and we'd better stop it to reduce the synchronous
> amount of work.
>
> the allocator must check if the levels are above pages.low before
> killing, if it doesn't do that it's broken, moving the oom killer in
> kswapd cannot fix this problem, because obviously then it'll be the task
> context that will have freed the additional pages instead of kswapd.

OK - the pages.low check before killing makes total sense.

We dont do that right now - and we should - whatever callee
invokes the OOM killer.

> The rule is to do:
>
> paging
> check the levels and kill
>
> If you just do paging and oom kill if paging has failed there's no way
> it can work. 2.6 is broken here, and that could be the reason of the oom
> kills too.
>
> There will be always a race condition even with the above, since the
> check for the levels and oom kill isn't an atomic operation and we don't
> block all other cpus in that path, but it's an insignificant window we
> don't have to worry about (only theoretical).
>
> But if you check the levels; paging; kill, like current 2.6, there is an
> huge window while we wait for I/O. After we finished the I/O the whole
> VM status may have changed and we may be full of free pages in the
> per-cpu queue and in the buddy as well. so we've to recheck the levels
> before killing anything. This is again why doing oom_kill inside the
> try_to_free_pages (or in kswapd anyways) is flawed.

OK, it could be a task who now freed pages instead of kswapd - but I still
prefer having the OOM killer be called from a centralized place, which is
also the main responsible task for freeing pages, than to call it from each
task try_to_free_pages() path.

Anyway, where do you suggest oom_kill to be called from if you think
both try_to_free_pages/kswapd are not the right callees?

Failure of handle_mm_fault ?

> > > The rule is that if you want to allocate 1 page, you've to free the page
> > > yourself. Then if kswapd frees a page too, that's welcome. But keep also
> > > in mind kswapd may be running in another cpu, and it will put the pages
> > > back into the per-cpu queue of the other cpu.
> >
> > Exactly another reason for _NOT_ triggering the OOM killer from task context
> > - pages which have been freed might be in the per-CPU queue (but a task
> > running on another CPU can't see them).
> >
> > We should be flushing the per-cpu queues quite often under these circumstances.
>
> we should never flush per-cpu pages, that'd hurt performance, per-cpu
> pages are lost memory. this is also why we must give up freeing memory
> only if everything else is not available, in 2.4 I had to stop after 50
> tries or so. We must keep going until all per-cpu queues are full,
> because if kswapd is in our way every other cpu may get the ram before
> us. This is why stopping kswapd would be beneficial while we're working
> on it, it'd probabilistically reduce the amount of synchronous work.

OK - flushing the per-cpu queues would hurt performance.

I meant, if we're kswapd, send'em back to the buddy allocator
so that tasks on other CPUs can see the just-freed-memory.

But thats indeed bad from a performance point of view.

> > Why's that? blk_congestion_wait looks clean and correct to me - if the queue
> > is full dont bother queueing more pages at this device.
>
> blk_contestion_wait is waiting on random I/O, it doesn't mean it's
> waiting on any substantial VM related paging (an O_DIRECT I/O would fool
> blk_congestion_wait), and if there's no I/O it just wakeup after a fixed
> random number of seconds.
>
> the VM should only throttle on locked pages or locked bh it can see,
> never on random I/O happening at the blkdev layer just because somebody
> is rolling some directio. Throttling on random I/O will lead to oom
> failures too and that's another bug in the 2.6 VM (and if it's not a
> bug, and we throttle elsewhere too, then it's simply useless and it
> should be replaced with a yield, if there's no I/O waiting there is a
> nosense, especially given that even if the oom killer triggers there
> won't be any additional ram to free, since the oom killer will generate
> free memory, not memory to free).
>
> > OK - so you seem to be agreeing with me that triggering OOM killer
> > from kswapd context is the correct thing to do now?
>
> I disagree about that sorry. not even try_to_free_pages should ever call
> the oom killer (unless you want to move the watermark checks from
> page_alloc.c to vmscan.c that would not be clean at all). Taking the
> decision on when to oom kill inside vmscan.c (like current 2.6 does)
> looks wrong to me.

OK, so, please answer my question above on "where do you think is
the correct callee".

IMHO OOM killing from kswapd, is way better than the current approach
- and I will try to enhance my patch handling the NUMA and GFP_KERNEL
allocation cases (with a notification mechanism used by allocators
to inform successive failure to kswapd).

Either way - this discussion is being productive.

2004-11-08 19:47:24

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Fri, Nov 05, 2004 at 06:01:18PM -0200, Marcelo Tosatti wrote:

> While doing this, I noticed that kswapd will happily go to sleep
> if all zones have all_unreclaimable set. I bet this is the reason
> for the page allocation failures we are seeing. So the patch
> also makes balance_pgdat() NOT return and go to "loop_again"
> instead in case of page shortage - even if all_unreclaimable is set.
>
> Basically the "loop_again" logic IS NOT WORKING!

Wrong, the loop_again logic is working, all_zones_ok will be
set when DEF_PRIORITY = 0.

So the page allocation failures are happening for some other
reason(s).

2004-11-08 22:16:31

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Mon, Nov 08, 2004 at 02:27:31PM -0200, Marcelo Tosatti wrote:
> On Fri, Nov 05, 2004 at 06:01:18PM -0200, Marcelo Tosatti wrote:
>
> > While doing this, I noticed that kswapd will happily go to sleep
> > if all zones have all_unreclaimable set. I bet this is the reason
> > for the page allocation failures we are seeing. So the patch
> > also makes balance_pgdat() NOT return and go to "loop_again"
> > instead in case of page shortage - even if all_unreclaimable is set.
> >
> > Basically the "loop_again" logic IS NOT WORKING!
>
> Wrong, the loop_again logic is working, all_zones_ok will be
> set when DEF_PRIORITY = 0.

I meant priority=DEF_PRIORITY.

> So the page allocation failures are happening for some other
> reason(s).

2004-11-09 02:23:48

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage



Marcelo Tosatti wrote:

>On Mon, Nov 08, 2004 at 02:27:31PM -0200, Marcelo Tosatti wrote:
>
>>On Fri, Nov 05, 2004 at 06:01:18PM -0200, Marcelo Tosatti wrote:
>>
>>
>>>While doing this, I noticed that kswapd will happily go to sleep
>>>if all zones have all_unreclaimable set. I bet this is the reason
>>>for the page allocation failures we are seeing. So the patch
>>>also makes balance_pgdat() NOT return and go to "loop_again"
>>>instead in case of page shortage - even if all_unreclaimable is set.
>>>
>>>Basically the "loop_again" logic IS NOT WORKING!
>>>
>>Wrong, the loop_again logic is working, all_zones_ok will be
>>set when DEF_PRIORITY = 0.
>>
>
>I meant priority=DEF_PRIORITY.
>
>

Yep

>>So the page allocation failures are happening for some other
>>reason(s).
>>

Pre alloc_pages / kswapd shakeup, the watermark stuff had been pretty
broken. For example, allocations would wakeup kswapd at the *same*
watermark as they would start synchronous reclaim (or fail in the case
of !wait allocations).

Why there have been apparently more reports of allocation failures
since those patches is a mystery to me. I've looked but can't find
anything to explain it. Perhaps the initial watermark calculation had
been changed slightly? I'm not sure... it could also be just be a fluke
due to chaotic effects in the mm, I suppose :|

2004-11-09 02:36:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

Nick Piggin <[email protected]> wrote:
>
> I'm not sure... it could also be just be a fluke
> due to chaotic effects in the mm, I suppose :|

2.6 scans less than 2.4 before declaring oom. I looked at the 2.4
implementation and thought "whoa, that's crazy - let's reduce it and see
who complains". My three-year-old memory tells me it was reduced by 2x to
3x.

We need to find testcases (dammit) and do the analysis. It could be that
we're simply not scanning far enough.

2004-11-09 02:46:35

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage



Andrew Morton wrote:

>Nick Piggin <[email protected]> wrote:
>
>>I'm not sure... it could also be just be a fluke
>> due to chaotic effects in the mm, I suppose :|
>>
>
>2.6 scans less than 2.4 before declaring oom. I looked at the 2.4
>implementation and thought "whoa, that's crazy - let's reduce it and see
>who complains". My three-year-old memory tells me it was reduced by 2x to
>3x.
>
>We need to find testcases (dammit) and do the analysis. It could be that
>we're simply not scanning far enough.
>
>
>

Oh yeah, there definitely seems to be OOM problems as well (although
luckily not _too_ many people seem to be complaining).

I thought Marcelo was talking about increased incidents of people
reporting eg. order-0 atomic allocation failures though, after the
recentish code from you and I to fix up alloc_pages.

2004-11-09 10:37:27

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Mon, Nov 08, 2004 at 06:35:52PM -0800, Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
> >
> > I'm not sure... it could also be just be a fluke
> > due to chaotic effects in the mm, I suppose :|
>
> 2.6 scans less than 2.4 before declaring oom. I looked at the 2.4
> implementation and thought "whoa, that's crazy - let's reduce it and see
> who complains". My three-year-old memory tells me it was reduced by 2x to
> 3x.
>
> We need to find testcases (dammit) and do the analysis. It could be that
> we're simply not scanning far enough.

Andrew,

When reading the code I was really suspicious of the all_unreclaimable code.
It basically stops scanning when reaching OOM conditions - that might be it.

I tried to disable it (ignore it if priority==0) - result: very slow progress
on extreme load.

2004-11-09 10:54:43

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Tue, Nov 09, 2004 at 01:46:27PM +1100, Nick Piggin wrote:
>
>
> Andrew Morton wrote:
>
> >Nick Piggin <[email protected]> wrote:
> >
> >>I'm not sure... it could also be just be a fluke
> >>due to chaotic effects in the mm, I suppose :|
> >>
> >
> >2.6 scans less than 2.4 before declaring oom. I looked at the 2.4
> >implementation and thought "whoa, that's crazy - let's reduce it and see
> >who complains". My three-year-old memory tells me it was reduced by 2x to
> >3x.
> >
> >We need to find testcases (dammit) and do the analysis. It could be that
> >we're simply not scanning far enough.
> >
> >
> >
>
> Oh yeah, there definitely seems to be OOM problems as well (although
> luckily not _too_ many people seem to be complaining).
>
> I thought Marcelo was talking about increased incidents of people
> reporting eg. order-0 atomic allocation failures though, after the
> recentish code from you and I to fix up alloc_pages.

Yes that is what I'm talking about - it should be happening.

The amount of reports is _too high_. I can at least one report
of 0-order page allocation failure a day.

2004-11-10 01:11:27

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage



Marcelo Tosatti wrote:

>On Mon, Nov 08, 2004 at 06:35:52PM -0800, Andrew Morton wrote:
>
>>Nick Piggin <[email protected]> wrote:
>>
>>>I'm not sure... it could also be just be a fluke
>>> due to chaotic effects in the mm, I suppose :|
>>>
>>2.6 scans less than 2.4 before declaring oom. I looked at the 2.4
>>implementation and thought "whoa, that's crazy - let's reduce it and see
>>who complains". My three-year-old memory tells me it was reduced by 2x to
>>3x.
>>
>>We need to find testcases (dammit) and do the analysis. It could be that
>>we're simply not scanning far enough.
>>
>
>Andrew,
>
>When reading the code I was really suspicious of the all_unreclaimable code.
>It basically stops scanning when reaching OOM conditions - that might be it.
>
>

Yeah, I saw a pretty good correlation between OOM killing and
all_unreclaimable.

We've got some code to spit that out during an OOM kill now, so that
might be
helpful.

>I tried to disable it (ignore it if priority==0) - result: very slow progress
>on extreme load.
>
>

I had a patch that caused try_to_free_pages to ignore all_unreclaimable and
go 'round the loop again if we reached oom-kill conditions. Basically that
guarantees you'll scan ~ pages_present*2 before going OOM. I think it may
be a good thing to do, but I wasn't really able to reproduce these early
OOM killings.

2004-11-17 22:59:42

by Werner Almesberger

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

Entering an old discussion ...

Thomas Gleixner wrote:
> context in which oom-killer is called. My concern is that the decision
> critrion which process should be killed is not sufficient. In my case it
> kills sshd instead of a process which forks a bunch of child processes.

It recently occurred to me that we could have relatively light-weight
voluntary victimization for known trouble-makers. E.g. in a desktop
environment, the cause for trouble seems to be almost always the Web
browser, or something closely related to it.

A process could declare itself as usual suspect. This would then be
recorded as a per-task flag, to be inherited by children. Now, one
could write a launcher like this:

int main(int argc,char **argv)
{
if (argc < 2) {
fprintf(stderr,"usage: %s command [arguments...]\n",*argv);
return 1;
}
sys_suspect_me();
execvp(argv[1],argv+1);
perror(argv[1]);
return 1;
}

And then something like

# mv /usr/bin/browser /usr/bin/browser.bin
# echo '#!/bin/sh' >/usr/bin/browser
# echo 'suspect_me /usr/bin/browser.bin "$@"' >>/usr/bin/browser
# chmod 555 /usr/bin/browser

or use an alias if you like your packet manager.

Not sure if this would actually be useful in real life, but it looks
at least like a relatively simple and flexible solution to a part of
the selection problem.

One could even consider getting rid of the suspects a while before
hitting OOM, so that the system doesn't have to slow down before the
inevitable killing.

Not that'm getting many OOMs these days - my VNC setup is quite good
at dying well before anything serious turns up :-(

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2004-11-17 23:30:05

by Chris Ross

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage



Werner Almesberger escreveu:
> A process could declare itself as usual suspect. This would then be
> recorded as a per-task flag, to be inherited by children.

I don't think this "I know I'm buggy, please kill me" flag is the right
approach even if it can be made to work. The operating system has an
overview of all the memory and can see when a particular process is
basically making the machine unusable. It's quite likely that the
process causing the trouble doesn't know (or hasn't admitted) that it's
buggy and hasn't volunteered for early termination. As this means the
kernel must be able to deal with a problematic process completely
irrespective of whether it has set "kill me" flag or not the flag
doesn't really buy you anything.

It is also specific to runaway processes that are clearly at fault.
There is the related case where no particular process is faulty as such
but the system as a whole can't cope with the demands being made.

On a related note, I would prefer to see victim processes who are not
determined to be the cause of the trouble swapped out (i.e. *all* their
pages pushed out to swap) and suspended (not allowed to run) as a first
resort. The example I have in mind is on my machine when the daily cron
run over commits causing standard daemons such as ntpd to be killed to
make room. It would be preferable if the daemon was swapped out and just
didn't run for minutes, or even hours if need be, but was allowed to run
again once the system had settled down.

Of course, from recent discussion the system should not actually be
killing off these daemons at all but that does seem to be resolved now.
There are circumstances when there simply isn't enough RAM and swapping
something out is preferable to killing it off. Of course, if there isn't
sufficient swap space killing it should be the second resort. The last
resort being panic.

So, the problem breaks down into three parts:

i) When should the oom killer be invoked.
ii) How do we pick a victim process
iii) How can we deal with the process in the most useful manner

Regards,
Chris R.

2004-11-18 00:07:34

by Werner Almesberger

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

Chris Ross wrote:
> The operating system has an
> overview of all the memory and can see when a particular process is
> basically making the machine unusable.

The underlying hypothesis for suggesting explicitly flagging
candidates for killing is of course that it doesn't see who
exactly is misbehaving :-) Since this issue has been around for
a nummber of years, I think it's fair to assume that the OOM
killers indeed have a problem in that area.

> It's quite likely that the
> process causing the trouble doesn't know (or hasn't admitted) that it's
> buggy and hasn't volunteered for early termination.

I guess that depends a lot on your scenario. If your system is
the typical undergrad mainframe where an army of students is
hard at work trying to fork-bomb it out of existence, you're
absolutely right.

However, on a system where new programs are rarely added to the
mix, the distinction should be easier. You can still get
unexpected problems, e.g. vi trying to load a huge file, but
you should be in a much better position to profile your system
behaviour.

It could of course be that this scenario is overly specific.

> As this means the
> kernel must be able to deal with a problematic process completely
> irrespective of whether it has set "kill me" flag or not the flag
> doesn't really buy you anything.

I'd view it as an additional hint that killing that process is
likely to help, a) because it may be the culprit, or b) because
it is likely to hold lots of memory, and its death will not be
mourned.

I'm not suggesting to use this as a replacement for an adaptive
OOM killer. The OOM killer could first make its pick among the
suspects, and only if it runs out of them (or maybe if it finds
overwhelming evidence that it's something else), then it would
go after non-suspects.

> There is the related case where no particular process is faulty as such
> but the system as a whole can't cope with the demands being made.

Yes, that's yet another scenario. Even then, having a list of
things we can kill to give us some room would be useful.

> The example I have in mind is on my machine when the daily cron
> run over commits causing standard daemons such as ntpd to be killed to
> make room. It would be preferable if the daemon was swapped out and just
> didn't run for minutes, or even hours if need be, but was allowed to run
> again once the system had settled down.

Ah, now I understand why you'd want to swap. Interesting. Now,
depending on the time if day, you have typically "interactive"
processes, like your idle desktop, turn into "non-interactive"
ones, which can then be subjected to swapping. Nice example
against static classification :-)

> So, the problem breaks down into three parts:
>
> i) When should the oom killer be invoked.
> ii) How do we pick a victim process
> iii) How can we deal with the process in the most useful manner

iii) may also affect i). If you're going to swap, you don't want
to wait until you're fighting for the last available page in the
system.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2004-11-18 00:40:38

by Chris Ross

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage


Werner Almesberger escreveu:
> Chris Ross wrote:
> The underlying hypothesis for suggesting explicitly flagging
> candidates for killing is of course that it doesn't see who
> exactly is misbehaving :-) Since this issue has been around for
> a nummber of years, I think it's fair to assume that the OOM
> killers indeed have a problem in that area.

That's my point ii) below, which is what Thomas's patch is trying to
address. I doubt you'd find much disagreement that this area still needs
work :)

>>The example I have in mind is on my machine when the daily cron
>>run over commits causing standard daemons such as ntpd to be killed to
>>make room. It would be preferable if the daemon was swapped out and just
>>didn't run for minutes, or even hours if need be, but was allowed to run
>>again once the system had settled down.
>
> Ah, now I understand why you'd want to swap. Interesting. Now,
> depending on the time if day, you have typically "interactive"
> processes, like your idle desktop, turn into "non-interactive"
> ones, which can then be subjected to swapping. Nice example
> against static classification :-)

A better example than the ntpd daemon (which mightn't take kindly to
finding minutes just passed in a blink of its eye) is Thomas's example
with the sshd. If the daemon was swapped out you wouldn't be able to log
into the box while it was thrashing, but in practice you can't really
anyway. At least once the system had recovered sufficiently you could
get back in, under the present system you can never log in again.

>>So, the problem breaks down into three parts:
>>
>> i) When should the oom killer be invoked.
>> ii) How do we pick a victim process
>> iii) How can we deal with the process in the most useful manner
>
> iii) may also affect i). If you're going to swap, you don't want
> to wait until you're fighting for the last available page in the
> system.

Well yes, in typical fashion everything depends on everything else. That
in a nutshell is also my argument against the kill-me flag.

Regards,
Chris R.

2004-11-18 01:17:59

by Werner Almesberger

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

Chris Ross wrote:
> with the sshd. If the daemon was swapped out you wouldn't be able to log
> into the box while it was thrashing, but in practice you can't really
> anyway.

Nor would you want to, in the scenario you're describing, because
the system is doing housekeeping while you're away/asleep. I
agree that this makes sense.

The tricky bit is now to identify such part-time interactive tasks,
i.e. the ones who won't receive a trigger for a while. To make
things worse, there are those who may be happily doing something,
like spinning some animated GIF, which would be perfectly fine
being put to a long sleep. That in turn may make the X server idle,
etc.

Again, if you have such a clearly defined scenario, perhaps the
cron jobs should just loudly announce that housekeeping is now
starting and that this changes some of the rules. Or perhaps,
there could be a SIGSWAP to swap out a process (maybe SIGSUSP it
first so that it doesn't come back on its own).

> Well yes, in typical fashion everything depends on everything else. That
> in a nutshell is also my argument against the kill-me flag.

I think it may be more subtle: everybody seems to have a set of
scenarios where the best solution is quite obvious and could
be easily implemented. Also, every once in a while, you find
that system loads which clearly demand a specific action in
scenario A need something very different in scenario B.

E.g. if you go by load spike, you'll be able to contain some
of the less inspired experiments on that undergrad mainframe,
but you may end up killing the cron jobs in your housekeeping
scenario. (And in this case, swapping wouldn't even help.) Or,
if you never kill anything big with a long run time, you'll
protect that simulation of an universe that's just on the
verge of developing intelligent life, but you may completely
miss the Web browser that's been rotating banner ads for weeks.
(Here, swapping might help.)

So I think that you also need to know what the processes are,
and not only what they're doing. This should greatly improve
predictions of what they will do in the future, and why
they're doing it in the first place.

It's ultimately policy decisions, and that's where I see a place
for light-weight markup mechanisms like a "kill me first" bit.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2004-11-18 08:21:55

by Chris Ross

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage



Werner Almesberger escreveu:
> The tricky bit is now to identify such part-time interactive tasks,
> i.e. the ones who won't receive a trigger for a while. To make
> things worse, there are those who may be happily doing something,
> like spinning some animated GIF, which would be perfectly fine
> being put to a long sleep. That in turn may make the X server idle,
> etc.

I don't think you need to be that subtle about it, though I agree
perfection would be nice :) The present behaviour is just to kill
something. All I'm advocating is just swapping something out if possible
instead. Yes by definition we probably have picked something you would
have preferred to leave running, but the machine simply cope with
everything being asked of it at the moment and that something got the
short straw. At least swapped out we will get round to running it when
we can.

> Again, if you have such a clearly defined scenario, perhaps the
> cron jobs should just loudly announce that housekeeping is now
> starting and that this changes some of the rules. Or perhaps,
> there could be a SIGSWAP to swap out a process (maybe SIGSUSP it
> first so that it doesn't come back on its own).

Sounds like a job for priorities and sensible use of batch scheduling.

I still feel that special casing things is basically wrong. We could
work around the specific example that the cron.daily on my test machines
tends to cause things to be oom_killed, but it's better to fix the
problem. What about when I try to build umlsim again -- my standard test
case for triggering the oom killer ;)

Let's not forget that oom killing (when it works) is a last resort,
something we do only if we have to to avoid a panic. Too often at
present the machine just doesn't know what to do, runs around confused
and makes things worse by shooting its own leg off. Which is pretty much
a real-world definition of panicking*. Lets at least try to avoid
causing permanent damage, such as killing off sshd.

[ * I just looked it up: "of, relating to, or resembling the mental or
emotional state believed induced by the god Pan". Cool ]

Regards,
Chris R.

2004-11-18 10:02:23

by Werner Almesberger

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

Chris Ross wrote:
> All I'm advocating is just swapping something out if possible
> instead.

Yes, but this only works if a) your system can make progress
towards lowering its memory needs without the process(es) you've
picked for swapping, and b) these processes don't happen to be
something that cannot tolerate long suspension, and c) the total
memory needs are such that they can be better satisfied after
these processes have been swapped out.

Examples where this isn't the case: a) if you swap out your
hoursekeeping cron job, the system will just sit idle, then you
swap it in again after a few minutes, and the agony repeats.
b) if you swap out my X server while I'm sitting at the machine,
all you've done was to force me to press the big red switch
manually. c) if there's a process with excessive memory demands
that can't be met anyway, it's better to end its misery quickly,
instead of spending a day thrashing.

So again, your automatic OOM kill^H^H^H^Hcounsellor doesn't only
have to follow a fixed policy, but it also has to sense what kind
of situation we're in.

A SIGSWAP would help with a) and b). In case a), the cron jobs
would signal anything that's not them. In case b), by definition,
I'd not be working when this happens. This can be assisted by
user detection heuristics as used in some batch distribution
systems. (Now we have a fairly complex user space already, with
lots of policy.) The usual "runaway process" heuristics can
probably take care of c).

> Too often at present the machine just doesn't know what to do,

See, that's exactly what I mean :-) So, why not just tell it ?
"Hey, things are going to get a little rough for a while. Why
don't you take a nap on that comfty swap disk while I clean up
the house ?"

> [ * I just looked it up: "of, relating to, or resembling the mental or
> emotional state believed induced by the god Pan". Cool ]

Hmm, you're suggesting we follow Morpheus instead of Pan then ?
And I always thought the OOM killer was more like Eris' work :-)

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2004-11-18 14:58:17

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Thu, 2004-11-18 at 07:01 -0300, Werner Almesberger wrote:
> > Too often at present the machine just doesn't know what to do,
>
> See, that's exactly what I mean :-) So, why not just tell it ?
> "Hey, things are going to get a little rough for a while. Why
> don't you take a nap on that comfty swap disk while I clean up
> the house ?"
>
> > [ * I just looked it up: "of, relating to, or resembling the mental or
> > emotional state believed induced by the god Pan". Cool ]
>
> Hmm, you're suggesting we follow Morpheus instead of Pan then ?
> And I always thought the OOM killer was more like Eris' work :-)

Hmm, what about embedded boxes without swap ? There you have only one
choice. Kill anything appropriate.

I tested Marcellos, Andreas and Andrews changes and they all change the
trigger of the oom-killer to prevent the spurious unneccecary kills. But
in the case where an oom-kill is neccecary I still need my modifications
to the whom to kill decision and to the oom-killer itself.

1. Not to kill the innocent and maybe important processes like sshd

This one is solved by taking the childs into account.

2. To prevent overkill.

This still happens with all the modifications I have tested.
The szenario is that two processes request memory in an OOM situation.
The first caller to oom-kill is making room and then the second one is
also going to kill something.

The oom-killer must be protected against reentrancy and it must check
for the free memory level before finally doing the kill.
See my previous patch in this thread.

BTW, I found out that hackbench is a quite good replacement for my real
application in triggering these corner cases. Just increase the number
of processes until you reach the limit of the machine.

tglx


2004-11-18 15:14:23

by Chris Friesen

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

Thomas Gleixner wrote:

> Hmm, what about embedded boxes without swap ? There you have only one
> choice. Kill anything appropriate.

I worked on a project that took the opposite approach from the "I'm a suspect"
flag mentioned earlier. Processes could request immunity from the OOM killer as
long as they were under a specified memory usage. Critical apps were thus
protected as long as they were sane, while noncritical stuff could be killed at
will.

Chris

2004-11-18 21:18:49

by Werner Almesberger

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

Bodo Eggert wrote:
> You'll have some precompiled binaries causing trouble, while other
> precompiled binaries will be killed while you want them to stay alife.

That's why you could use a wrapper.

> Sometimes you'll have the same binary (e.g. perl or java) running a
> "notme"-task like watching the log for intrusion while at the same time
> processing a very large image.

The wrapper could also not be part of the regular execution, and
you'd only use it if you really need it, much like nice, chroot,
etc.

> The best solution I can think of is attaching a kill priority (similar to
> the nice value). Before killing, this value would be added to lg_2(memsize),
> and the least desirable process would "win", even if it's sshd running wild.

I'm extremely sceptical about solutions that require the user to
quantify things. In the world of QoS, if you give users a knob
to play with, the'll stare at in confusion, and ask for the
"faster" button. I don't think the OOM case is much different.

A "victim" (or a "precious") flag has the advantage that the user
doesn't need to estimate peak demands, but still doesn't depend
solely on the verdict of some arcane algorithm working behind
the scenes.

> For the trashing problem: I like the idea of sending a signal to stop the
> process, but it should rather be a request to stop that can be caught by
> the process.

Good idea. That would also help with the problem of browsers
immediately asking to be brought back to life, so that they can
spin the banner ads some more.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2004-11-18 23:41:33

by Bodo Eggert

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

Werner Almesberger wrote:

> A process could declare itself as usual suspect. This would then be
> recorded as a per-task flag, to be inherited by children. Now, one
> could write a launcher like this:

You'll have some precompiled binaries causing trouble, while other
precompiled binaries will be killed while you want them to stay alife.
Sometimes you'll have the same binary (e.g. perl or java) running a
"notme"-task like watching the log for intrusion while at the same time
processing a very large image.

The best solution I can think of is attaching a kill priority (similar to
the nice value). Before killing, this value would be added to lg_2(memsize),
and the least desirable process would "win", even if it's sshd running wild.



For the trashing problem: I like the idea of sending a signal to stop the
process, but it should rather be a request to stop that can be caught by
the process. A SETI-like task could save its workset and free the memory
instead, a browser would discard it's memory cache and pause loading
Images for the sites etc.
--
The newest and least experienced soldier will usually win the Congressional
Medal Of Honor.

Fri?, Spammer: [email protected] [email protected]

2004-11-19 00:20:24

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Nov 18, 2004 21:48 +0100, Bodo Eggert wrote:
> You'll have some precompiled binaries causing trouble, while other
> precompiled binaries will be killed while you want them to stay alife.
> Sometimes you'll have the same binary (e.g. perl or java) running a
> "notme"-task like watching the log for intrusion while at the same time
> processing a very large image.
>
> The best solution I can think of is attaching a kill priority (similar to
> the nice value). Before killing, this value would be added to lg_2(memsize),
> and the least desirable process would "win", even if it's sshd running wild.
>
> For the trashing problem: I like the idea of sending a signal to stop the
> process, but it should rather be a request to stop that can be caught by
> the process. A SETI-like task could save its workset and free the memory
> instead, a browser would discard it's memory cache and pause loading
> Images for the sites etc.

Sounds familiar. AIX has had this for years. "SIGDANGER" can be
caught by applications which care to register a handler, but is
otherwise fatal. Usage scenarios are exactly as proposed above.

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/


Attachments:
(No filename) (1.25 kB)
(No filename) (189.00 B)
Download all attachments

2004-11-19 01:10:24

by Bodo Eggert

[permalink] [raw]
Subject: Re: [PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

On Thu, 18 Nov 2004, Werner Almesberger wrote:
> Bodo Eggert wrote:

> > You'll have some precompiled binaries causing trouble, while other
> > precompiled binaries will be killed while you want them to stay alife.
>
> That's why you could use a wrapper.

That's why I thought about a nice-like value.

> > The best solution I can think of is attaching a kill priority (similar to
> > the nice value). Before killing, this value would be added to lg_2(memsize),
> > and the least desirable process would "win", even if it's sshd running wild.
>
> I'm extremely sceptical about solutions that require the user to
> quantify things. In the world of QoS, if you give users a knob
> to play with, the'll stare at in confusion, and ask for the
> "faster" button. I don't think the OOM case is much different.

> A "victim" (or a "precious") flag has the advantage that the user
> doesn't need to estimate peak demands, but still doesn't depend
> solely on the verdict of some arcane algorithm working behind
> the scenes.

There would usurally be no need to tune. The OOB would select the biggest
process by default, and it would allmost certainly be the the best
decision.

If it isn't you'll einther need to flag all suspicious tasks before they
run and prevent your users from avoiding your wrappers, or you'll need to
flag the few important "notme"s. But what happens if your "notme" is dhcpd
running wild (as it just happened to me because it didn't handle
"permission denied" correctly)? Bye bye userland?

With the "notme"-level, your default system will kill the biggest process.
This will usurally be the best process, but on some systems, it would
kill the DB engine with the login table instead. You'll need to tune here,
sicne no OS will know when your DB is just big and when it's eating
memory. In this case, you would adjust only the database process, and
you'd choose a small value. For a 2-GB-Process to "lose" against a 500 MB
mozilla, you'll need a "notme"-value ("OOM-Adjustment"?) of -2; -4 would
be allmost overkill to the system and -sizeof(void*)*8 will create a
privileged class of processes that will only be killed if all other
processes are killed (I repeat: you don't want that).

OTOH, there is nothing hindering you from creating goat processes by
running your web-browser on +127, if you like it being killed as soon as
your vi starts eating memory. (Even here I'd limit the value to +2.)


Summary:

The "notme"-value will autotune, while the "victim"-flag needs to be
adjusted on every system. In rare cases, "notme" will need to be adjusted
for large daemons, and even in those cases, it won't need much adjustment.
The "notme"-value _is_ more complicated, but you only need to count to 5.

The "victim"-flag can be circumvented by users, while the "notme"-value
will be as safe as "nice".

The "precious"-flag does not protect against mad processes, while the
"notme"-value can be adjusted to match your specific need.


Further considerations:

Both systems will need an additional per-process-value, but they can share
their space with existing flags.
(Even the 8 bit per value I asumed above may be overkill, and they aren't
supposed to be touched often (set, fork and OOM only).)

You'll usurally want to kill forked processes before the parent (e.g.
inetd), and legacy applications won't adjust the setting for you.
Therefore you'll need a second value for the childs.

The adjustment to OOM-Killer for "victim" is a flag telling you the flag
status of the last found candidate, and it could skip calculating the
memory size if there is a flagged-to-be-better victim.

"notme" will require a level (lg_2()+"notme") instead of the flag, and it
would usurally have to calculate the memory size, especially if the
adjustment is limited to <=5 bit.
-or-
You'd use a larger data type for the calculated memory size and bit-shift
the result by (adjustment+minval). I guess it's cheaper, but it only
works for small adjustments (5 or 6 bit, depending on arch).

Both should be easy to implement if somebody else is doing the work.-)



[0] This can be difficult: Shall we kill process A even if most memory is
shared and we don't gain much? What if it's a large forkbomb?
:(){:|:};:
--
'.... now touch these wires to your tongue!'