2001-11-06 13:00:36

by Marcelo Tosatti

[permalink] [raw]
Subject: out_of_memory() heuristic broken for different mem configurations



Hi people,

While testing 2.4.14 VM on the 16GB machine testbox I've been able to
make the OOM killer not trigger correctly. The same usual workload: lots
of fillmem processes.

Looking at out_of_memory() I've found out that we will only kill a task if
we happen to call out_of_memory() ten times in one second.

That is completly variable on the system load: Taking into account that we
have we lots of processes inside try_to_free_pages() and the LRU list is
insanely HUGE (almost all pages were on the inactive list), I think this
"ten times in one second" just does not work.

Well, yes, its seems to be just a wrong magic number for this
setup/workload.

Linus, any suggestion to "fix" that ?

/proc tunable (eeek) ?


vmstat output:

13 7 2 3812816 2884 228 7648 42 28454 58 28430 295 1070 0 70 30
10 12 2 3880664 2640 204 7644 24 27754 28 27766 277 14650 0 92 8
17 5 1 3943404 2964 228 7656 22 31258 38 31276 303 10067 0 80 20
18 3 1 3998948 2708 220 7652 32 19434 46 19422 224 857 0 83 17

(end of swap)

23 1 1 4032960 2892 232 7648 16 24246 22 24252 235 997 0 72 28
22 1 1 4032960 2872 232 7648 0 0 0 0 106 13 0 99 0
23 0 2 4032960 2764 232 7648 0 0 0 0 116 11 0 100 0
21 0 1 4032960 2856 232 7648 0 0 0 0 122 45 0 100 0
21 0 1 4032960 2848 232 7648 0 0 0 0 117 21 0 100 0
21 0 1 4032960 2588 232 7648 38 0 38 0 118 41 0 100 0
21 0 1 4032960 2584 232 7648 0 0 0 0 123 10 0 100 0



2001-11-06 13:32:04

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations

On Tue, 6 Nov 2001 09:40:51 -0200 (BRST) Marcelo Tosatti
<[email protected]> wrote:

> Well, yes, its seems to be just a wrong magic number for this
> setup/workload.

Well, first time I read the code I thought that this will happen. Simply think
of a _slow_ system with _lots_ of mem. Chances are high you cannot match the
seconds-rule.

> Linus, any suggestion to "fix" that ?

How about this really stupid idea: oom means allocs fail, so why not simply
count failed 0-order allocs, if one succeeds, reset counter. If a page is freed
reset counter. If counter reaches <new magic number> then you're oom. No timing
involved, which means you can have as much mem or as slow host as you like. It
isn't even really interesting, if you have swap or not, because a failed
0-order alloc tells you whatever mem you have, there is surely not much left.
I'd try about 100 as magic number.

> /proc tunable (eeek) ?

NoNoNo, please don't do that!

Regards,
Stephan

2001-11-06 13:41:34

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations



On Tue, 6 Nov 2001, Stephan von Krawczynski wrote:

> On Tue, 6 Nov 2001 09:40:51 -0200 (BRST) Marcelo Tosatti
> <[email protected]> wrote:
>
> > Well, yes, its seems to be just a wrong magic number for this
> > setup/workload.
>
> Well, first time I read the code I thought that this will happen. Simply think
> of a _slow_ system with _lots_ of mem. Chances are high you cannot match the
> seconds-rule.
>
> > Linus, any suggestion to "fix" that ?
>
> How about this really stupid idea: oom means allocs fail, so why not simply
> count failed 0-order allocs, if one succeeds, reset counter. If a page is freed
> reset counter. If counter reaches <new magic number> then you're oom. No timing
> involved, which means you can have as much mem or as slow host as you like.

> It isn't even really interesting, if you have swap or not, because a
> failed 0-order alloc tells you whatever mem you have, there is surely
> not much left.

Wrong. If we have swap available, we are able to swapout anonymous data,
so we are _not_ OOM. This is an important point on this whole OOM killer
nightmare.

Keep in mind that we don't want to destroy anonymous data from userspace
(OOM kill).

> I'd try about 100 as magic number.

I think your suggestion will work well in practice (except that we have to
check the swap).

I'll try that later.

> > /proc tunable (eeek) ?
>
> NoNoNo, please don't do that!

Note that even if your suggestion works, we may want to make the magic
value /proc tunable.

The thing is that the point where tasks should be killed is also an admin
decision, not a complete kernel decision.

2001-11-06 14:22:14

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations

On Tue, 6 Nov 2001 10:22:02 -0200 (BRST) Marcelo Tosatti
<[email protected]> wrote:

> > How about this really stupid idea: oom means allocs fail, so why not simply
> > count failed 0-order allocs, if one succeeds, reset counter. If a page is
freed
> > reset counter. If counter reaches <new magic number> then you're oom. No
timing
> > involved, which means you can have as much mem or as slow host as you like.
>
> > It isn't even really interesting, if you have swap or not, because a
> > failed 0-order alloc tells you whatever mem you have, there is surely
> > not much left.
>
> Wrong. If we have swap available, we are able to swapout anonymous data,
> so we are _not_ OOM. This is an important point on this whole OOM killer
> nightmare.

I guess this is not the complete picture, either. There may as well be a
situation, where there is nothing to swap out left, but still swap-space
available. Anyway you would be deadlocked in this situation. The only thing you
can see is the failing allocs (and of course no frees). You will never enter
oom-state, if you make "available swap" a negative-trigger. It _sounds_ good,
but _is_ wrong.

> Keep in mind that we don't want to destroy anonymous data from userspace
> (OOM kill).
>
> > I'd try about 100 as magic number.
>
> I think your suggestion will work well in practice (except that we have to
> check the swap).
>
> I'll try that later.
>
> > > /proc tunable (eeek) ?
> >
> > NoNoNo, please don't do that!
>
> Note that even if your suggestion works, we may want to make the magic
> value /proc tunable.

Well, in fact I really think my suggestion may be better than the current
implementation, but do believe that it is not quite like "42". Whenever you
hear someone talk about magic numbers/limits, keep in mind its only because he
doesn't have the _complete_ answer to the question. I'm in no way different. I
don't like my magic number, only I have no better answer.
>
> The thing is that the point where tasks should be killed is also an admin
> decision, not a complete kernel decision.

I completely disagree. There can only be two completely independant ways for
this oom stuff:
1) the kernel knows
2) the admin knows

You suggest 2), but then you have to make a totally different approach to the
problem. Because if admin knows, then it's very likely, that he even knows
_which_ application should be killed, or even better, which should _not_ be
killed.
He (the admin) would like to have an option to choose this for sure. You cannot
really solve this idea _inside_ the kernel, I guess. I think this would better
be solved as an oom-daemon with a config-file in /etc, where you tell him,
"whatever is bad, don't kill google". This would be Bens' config file :-)

Regards,
Stephan


2001-11-06 15:22:19

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations



On Tue, 6 Nov 2001, Stephan von Krawczynski wrote:

> On Tue, 6 Nov 2001 10:22:02 -0200 (BRST) Marcelo Tosatti
> <[email protected]> wrote:
>
> > > How about this really stupid idea: oom means allocs fail, so why not simply
> > > count failed 0-order allocs, if one succeeds, reset counter. If a page is
> freed
> > > reset counter. If counter reaches <new magic number> then you're oom. No
> timing
> > > involved, which means you can have as much mem or as slow host as you like.
> >
> > > It isn't even really interesting, if you have swap or not, because a
> > > failed 0-order alloc tells you whatever mem you have, there is surely
> > > not much left.
> >
> > Wrong. If we have swap available, we are able to swapout anonymous data,
> > so we are _not_ OOM. This is an important point on this whole OOM killer
> > nightmare.
>
> I guess this is not the complete picture, either. There may as well be a
> situation, where there is nothing to swap out left, but still swap-space
> available. Anyway you would be deadlocked in this situation.

Memory used by the userspace tasks is either cache or anonymous memory.

If there is no anonymous memory to swap out (the case you just described),
and we are in a OOM condition, there _has_ to be cache available --- clean
or dirty.

We can easily drop clean cache from memory. Now dirty cache has to be
written out (cleaned), then can be easily freed.

So having no anonymous memory available to swapout _and_ an OOM condition
means that we should either drop clean cache or clean dirty cache to drop
it: The OOM killer has nothing to do with that.

It all depends on which kind of pressure you have on the system.


> can see is the failing allocs (and of course no frees). You will never enter
> oom-state, if you make "available swap" a negative-trigger. It _sounds_ good,
> but _is_ wrong.

Read the above.

>
> > Keep in mind that we don't want to destroy anonymous data from userspace
> > (OOM kill).
> >
> > > I'd try about 100 as magic number.
> >
> > I think your suggestion will work well in practice (except that we have to
> > check the swap).
> >
> > I'll try that later.
> >
> > > > /proc tunable (eeek) ?
> > >
> > > NoNoNo, please don't do that!
> >
> > Note that even if your suggestion works, we may want to make the magic
> > value /proc tunable.
>
> Well, in fact I really think my suggestion may be better than the current
> implementation, but do believe that it is not quite like "42". Whenever you
> hear someone talk about magic numbers/limits, keep in mind its only because he
> doesn't have the _complete_ answer to the question. I'm in no way different. I
> don't like my magic number, only I have no better answer.
>
> > The thing is that the point where tasks should be killed is also an admin
> > decision, not a complete kernel decision.
>
> I completely disagree. There can only be two completely independant ways for
> this oom stuff:
> 1) the kernel knows
> 2) the admin knows
>
> You suggest 2), but then you have to make a totally different approach to the
> problem.

Please read what I said again:

"The thing is that the point where tasks should be killed is also an admin
decision, not a complete kernel decision."

The kernel knows when the system runs out of memory. Period.

If we are completly out of memory, the kernel is going to choose a
userspace task to kill and free its memory otherwise the system can't do
any progress anymore.

> Because if admin knows, then it's very likely, that he even knows
> _which_ application should be killed, or even better, which should
> _not_ be killed.

Exactly.

> He (the admin) would like to have an option to choose this for sure.
> You cannot really solve this idea _inside_ the kernel, I guess. I
> think this would better be solved as an oom-daemon with a config-file
> in /etc, where you tell him, "whatever is bad, don't kill google".
> This would be Bens' config file :-)

Or something similar, yes.



2001-11-06 15:29:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations


On Tue, 6 Nov 2001, Marcelo Tosatti wrote:
>
> Looking at out_of_memory() I've found out that we will only kill a task if
> we happen to call out_of_memory() ten times in one second.

It should be "10 times in at _least_ one second, and at most 5 seconds
apart", but yes, I can imagine that it doesn't work very well if you have
tons of memory.

The problem became more pronounced when we started freeing the swap cache:
it seems that allows the machine to shuffle memory around so efficiently
that I've seen it go for half a minute with vmstat claiming zero IO, and
yet very few out-of-memory messages - I don't know why shrink_cache()
_claims_ success under those circumstances, but I've seen it myself.
Shrink_cache _should_ only return success when it has actually dropped a
page that needs re-loading, but..

> (end of swap)
>
> 23 1 1 4032960 2892 232 7648 16 24246 22 24252 235 997 0 72 28
> 22 1 1 4032960 2872 232 7648 0 0 0 0 106 13 0 99 0
> 23 0 2 4032960 2764 232 7648 0 0 0 0 116 11 0 100 0
> 21 0 1 4032960 2856 232 7648 0 0 0 0 122 45 0 100 0
> 21 0 1 4032960 2848 232 7648 0 0 0 0 117 21 0 100 0
> 21 0 1 4032960 2588 232 7648 38 0 38 0 118 41 0 100 0
> 21 0 1 4032960 2584 232 7648 0 0 0 0 123 10 0 100 0

Note how you also go for seconds with no IO and no shrinking of the
caches, while shrink_cache() is apparently happy (and no, it does not take
several seconds to traverse even a 16GB inactive queue, there's something
else going on)

With the more aggressive max_mapped, the oom failure count could be
dropped to something smaller, as a false positive from shrink_caches
should be fairly rare. I don't think it needs to be tunable on memory
size, I just didn't even try any other values on my machines (I noticed
that the old values were too high once max_mapped was upped and the swap
cache reclaiming was re-done, but I didn't try if five seconds and ten
failures was any better than 10 seconds and five failures, for example)

Linus

2001-11-06 15:39:39

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations



On Tue, 6 Nov 2001, Linus Torvalds wrote:

>
> On Tue, 6 Nov 2001, Marcelo Tosatti wrote:
> >
> > Looking at out_of_memory() I've found out that we will only kill a task if
> > we happen to call out_of_memory() ten times in one second.
>
> It should be "10 times in at _least_ one second, and at most 5 seconds
> apart", but yes, I can imagine that it doesn't work very well if you have
> tons of memory.
>
> The problem became more pronounced when we started freeing the swap cache:
> it seems that allows the machine to shuffle memory around so efficiently
> that I've seen it go for half a minute with vmstat claiming zero IO, and
> yet very few out-of-memory messages - I don't know why shrink_cache()
> _claims_ success under those circumstances, but I've seen it myself.
> Shrink_cache _should_ only return success when it has actually dropped a
> page that needs re-loading, but..
>
> > (end of swap)
> >
> > 23 1 1 4032960 2892 232 7648 16 24246 22 24252 235 997 0 72 28
> > 22 1 1 4032960 2872 232 7648 0 0 0 0 106 13 0 99 0
> > 23 0 2 4032960 2764 232 7648 0 0 0 0 116 11 0 100 0
> > 21 0 1 4032960 2856 232 7648 0 0 0 0 122 45 0 100 0
> > 21 0 1 4032960 2848 232 7648 0 0 0 0 117 21 0 100 0
> > 21 0 1 4032960 2588 232 7648 38 0 38 0 118 41 0 100 0
> > 21 0 1 4032960 2584 232 7648 0 0 0 0 123 10 0 100 0
>
> Note how you also go for seconds with no IO and no shrinking of the
> caches, while shrink_cache() is apparently happy (and no, it does not take
> several seconds to traverse even a 16GB inactive queue, there's something
> else going on)
>
> With the more aggressive max_mapped, the oom failure count could be
> dropped to something smaller, as a false positive from shrink_caches
> should be fairly rare. I don't think it needs to be tunable on memory
> size, I just didn't even try any other values on my machines (I noticed
> that the old values were too high once max_mapped was upped and the swap
> cache reclaiming was re-done, but I didn't try if five seconds and ten
> failures was any better than 10 seconds and five failures, for example)

Ok, I'll take a careful look at shrink_cache()/try_to_free_pages() path
later and find out "saner" magic numbers for big/small memory workloads.

2001-11-06 16:08:34

by Stephan von Krawczynski

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations

On Tue, 6 Nov 2001 07:25:40 -0800 (PST) Linus Torvalds <[email protected]>
wrote:

> Note how you also go for seconds with no IO and no shrinking of the
> caches, while shrink_cache() is apparently happy (and no, it does not take
> several seconds to traverse even a 16GB inactive queue, there's something
> else going on)

Did you time it? There is a lot of things going on in the shrink_cache loop
including swap_out, wait_on_page, locks, ...
It's not really simple traversing of a queue.

2001-11-06 18:43:10

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: out_of_memory() heuristic broken for different mem configurations



On Tue, 6 Nov 2001, Marcelo Tosatti wrote:

>
> > With the more aggressive max_mapped, the oom failure count could be
> > dropped to something smaller, as a false positive from shrink_caches
> > should be fairly rare. I don't think it needs to be tunable on memory
> > size, I just didn't even try any other values on my machines (I noticed
> > that the old values were too high once max_mapped was upped and the swap
> > cache reclaiming was re-done, but I didn't try if five seconds and ten
> > failures was any better than 10 seconds and five failures, for example)
>
> Ok, I'll take a careful look at shrink_cache()/try_to_free_pages() path
> later and find out "saner" magic numbers for big/small memory workloads.

Ok, I found it. The problem is that swap_out() tries to scan the _whole_
address space looking for pte's to deactivate, and try_to_swap_out() does
not return a value indicating the lack of swap space, so at each
swap_out() call we simply loop around the whole VM when there is no swap
space available.

Here goes the tested fix.


--- linux.orig/mm/vmscan.c Sun Nov 4 22:54:44 2001
+++ linux/mm/vmscan.c Tue Nov 6 16:06:08 2001
@@ -36,7 +36,8 @@
/*
* The swap-out function returns 1 if it successfully
* scanned all the pages it was asked to (`count').
- * It returns zero if it couldn't do anything,
+ * It returns zero if it couldn't free the given pte or -1
+ * if there was no swap space left.
*
* rss may decrease because pages are shared, but this
* doesn't count as having freed a page.
@@ -142,7 +143,7 @@
/* No swap space left */
set_pte(page_table, pte);
UnlockPage(page);
- return 0;
+ return -1;
}

/* mm->page_table_lock is held. mmap_sem is not held */
@@ -170,7 +171,12 @@
struct page *page = pte_page(*pte);

if (VALID_PAGE(page) && !PageReserved(page)) {
- count -= try_to_swap_out(mm, vma, address, pte, page, classzone);
+ int ret = try_to_swap_out(mm, vma, address, pte, page, classzone);
+ if (ret < 0)
+ return ret;
+
+ count -= ret;
+
if (!count) {
address += PAGE_SIZE;
break;
@@ -205,7 +211,11 @@
end = pgd_end;

do {
- count = swap_out_pmd(mm, vma, pmd, address, end, count, classzone);
+ int ret = swap_out_pmd(mm, vma, pmd, address, end, count, classzone);
+
+ if (ret < 0)
+ return ret;
+ count = ret;
if (!count)
break;
address = (address + PMD_SIZE) & PMD_MASK;
@@ -230,7 +240,10 @@
if (address >= end)
BUG();
do {
- count = swap_out_pgd(mm, vma, pgdir, address, end, count, classzone);
+ int ret = swap_out_pgd(mm, vma, pgdir, address, end, count, classzone);
+ if (ret < 0)
+ return ret;
+ count = ret;
if (!count)
break;
address = (address + PGDIR_SIZE) & PGDIR_MASK;
@@ -287,7 +300,7 @@
static int FASTCALL(swap_out(unsigned int priority, unsigned int gfp_mask, zone_t * classzone));
static int swap_out(unsigned int priority, unsigned int gfp_mask, zone_t * classzone)
{
- int counter, nr_pages = SWAP_CLUSTER_MAX;
+ int counter, nr_pages = SWAP_CLUSTER_MAX, ret;
struct mm_struct *mm;

counter = mmlist_nr;
@@ -311,9 +324,15 @@
atomic_inc(&mm->mm_users);
spin_unlock(&mmlist_lock);

- nr_pages = swap_out_mm(mm, nr_pages, &counter, classzone);
+ ret = swap_out_mm(mm, nr_pages, &counter, classzone);

mmput(mm);
+
+ /* No more swap space ? */
+ if (ret < 0)
+ return nr_pages;
+
+ nr_pages = ret;

if (!nr_pages)
return 1;