2003-07-10 11:00:26

by Miquel van Smoorenburg

[permalink] [raw]
Subject: 2.5.74-mm3 OOM killer fubared ?

I was running 2.5.74 on our newsfeeder box for a day without
problems (and 2.5.72-mm2 for weeks before that).

Now with 2.5.74-mm3 (booted 11 hours ago) it keeps killing processes
for no apparent reason:

Jul 10 11:59:01 quantum kernel: Out of Memory: Killed process 9952 (innfeed).
Jul 10 12:25:48 quantum kernel: Out of Memory: Killed process 10498 (innfeed).
Jul 10 12:25:48 quantum kernel: Fixed up OOM kill of mm-less task
Jul 10 12:45:41 quantum kernel: Out of Memory: Killed process 11894 (innfeed).
Jul 10 12:47:14 quantum kernel: Out of Memory: Killed process 13128 (innfeed).
Jul 10 12:53:09 quantum kernel: Out of Memory: Killed process 13221 (innfeed).
Jul 10 12:55:12 quantum kernel: Out of Memory: Killed process 13649 (innfeed).

I check seconds before the last kill:

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
news 13649 0.2 0.1 60368 1108 ? SN 12:53 0:00 innfeed -y

# free
total used free shared buffers cached
Mem: 1035212 1030104 5108 0 548208 307148
-/+ buffers/cache: 174748 860464
Swap: 996020 41708 954312

Enough memory free, no problems at all .. yet every few minutes
the OOM killer kills one of my innfeed processes.

I notice that in -mm3 this was deleted relative to -vanilla:

-
- /*
- * Enough swap space left? Not OOM.
- */
- if (nr_swap_pages > 0)
- return;

.. is that what causes this ? In any case, that should't vene matter -
there's plenty of memory in this box, all buffers and cached, but that
should be easily freed ..

Related mm question - this box is a news server, which does a lot
of streaming I/O, and also keeps a history database open. I have the
idea that the streaming I/O evicts the history database hash and
index file caches from memory, which I do not want. Any chance of
a control on a filedescriptor that tells it how persistant to be
in caching file data ? E.g. a sort of "nice" for the cache, so that
I could say that streaming data may be flushed from buffers/cache
earlier than other data (where the other data would be the
database files) ?

Mike.


2003-07-10 11:12:32

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.74-mm3 OOM killer fubared ?

On Thu, Jul 10, 2003 at 11:14:59AM +0000, Miquel van Smoorenburg wrote:
> Enough memory free, no problems at all .. yet every few minutes
> the OOM killer kills one of my innfeed processes.
> I notice that in -mm3 this was deleted relative to -vanilla:
>
> -
> - /*
> - * Enough swap space left? Not OOM.
> - */
> - if (nr_swap_pages > 0)
> - return;
> .. is that what causes this ? In any case, that should't vene matter -
> there's plenty of memory in this box, all buffers and cached, but that
> should be easily freed ..

This means we're calling into it more often than we should be.
Basically, we hit __alloc_pages() with __GFP_WAIT set, find nothing
we're allowed to touch, dive into try_to_free_pages(), fall through
scanning there, sleep in blk_congestion_wait(), wake up again, try
to shrink_slab(), find nothing there either, repeat that 11 more times,
and then fall through to out_of_memory()... and this happens at at
least 10Hz.

since = now - lastkill;
if (since < HZ*5)
goto out_unlock;

try s/goto out_unlock/goto reset/ and let me know how it goes.


-- wli

2003-07-10 12:39:23

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: 2.5.74-mm3 OOM killer fubared ?

In article <[email protected]>,
William Lee Irwin III <[email protected]> wrote:
>On Thu, Jul 10, 2003 at 11:14:59AM +0000, Miquel van Smoorenburg wrote:
>> Enough memory free, no problems at all .. yet every few minutes
>> the OOM killer kills one of my innfeed processes.
>> I notice that in -mm3 this was deleted relative to -vanilla:
>>
>> -
>> - /*
>> - * Enough swap space left? Not OOM.
>> - */
>> - if (nr_swap_pages > 0)
>> - return;
>> .. is that what causes this ? In any case, that should't vene matter -
>> there's plenty of memory in this box, all buffers and cached, but that
>> should be easily freed ..
>
>This means we're calling into it more often than we should be.
>Basically, we hit __alloc_pages() with __GFP_WAIT set, find nothing
>we're allowed to touch, dive into try_to_free_pages(), fall through
>scanning there, sleep in blk_congestion_wait(), wake up again, try
>to shrink_slab(), find nothing there either, repeat that 11 more times,
>and then fall through to out_of_memory()... and this happens at at
>least 10Hz.
>
> since = now - lastkill;
> if (since < HZ*5)
> goto out_unlock;
>
>try s/goto out_unlock/goto reset/ and let me know how it goes.

But that will only change the rate at which processes are killed,
not the fact that they are killed in the first place, right ?

As I said I've got plenty memory free ... perhaps I need to tune
/proc/sys/vm because I've got so much streaming I/O ? Possibly,
there are too many dirty pages so cleaning them out faster might
help (and let pflushd do it instead of my single-threaded app)

# cd /proc/sys/vm
# echo 200 > dirty_writeback_centisecs
# echo 60 > dirty_ratio
# echo 500 > dirty_expire_centisecs

I'll let it run like this for a while first.

If this helps, perhaps it means that the VM doesn't write out
dirty pages fast enough when low on memory ?

Mike.

2003-07-10 13:19:58

by Richard B. Johnson

[permalink] [raw]
Subject: Re: 2.5.74-mm3 OOM killer fubared ?

On Thu, 10 Jul 2003, Miquel van Smoorenburg wrote:

> In article <[email protected]>,
> William Lee Irwin III <[email protected]> wrote:
> >On Thu, Jul 10, 2003 at 11:14:59AM +0000, Miquel van Smoorenburg wrote:
> >> Enough memory free, no problems at all .. yet every few minutes
> >> the OOM killer kills one of my innfeed processes.
> >> I notice that in -mm3 this was deleted relative to -vanilla:
> >>
> >> -
> >> - /*
> >> - * Enough swap space left? Not OOM.
> >> - */
> >> - if (nr_swap_pages > 0)
> >> - return;
> >> .. is that what causes this ? In any case, that should't vene matter -
> >> there's plenty of memory in this box, all buffers and cached, but that
> >> should be easily freed ..
> >
> >This means we're calling into it more often than we should be.
> >Basically, we hit __alloc_pages() with __GFP_WAIT set, find nothing
> >we're allowed to touch, dive into try_to_free_pages(), fall through
> >scanning there, sleep in blk_congestion_wait(), wake up again, try
> >to shrink_slab(), find nothing there either, repeat that 11 more times,
> >and then fall through to out_of_memory()... and this happens at at
> >least 10Hz.
> >
> > since = now - lastkill;
> > if (since < HZ*5)
> > goto out_unlock;
> >
> >try s/goto out_unlock/goto reset/ and let me know how it goes.
>
> But that will only change the rate at which processes are killed,
> not the fact that they are killed in the first place, right ?
>
> As I said I've got plenty memory free ... perhaps I need to tune
> /proc/sys/vm because I've got so much streaming I/O ? Possibly,
> there are too many dirty pages so cleaning them out faster might
> help (and let pflushd do it instead of my single-threaded app)
>

The problem, as I see it, is that you can dirty pages 10-15 times
faster than they can be written to disk. So, you will always
have the possibility of an OOM situation as long as you are I/O
bound. FYI, you can read/write RAM at 1,000+ megabytes/second, but
you can only write to disk at 80 megabytes/second with the fastest
SCSI around, 40 megabytes/second with ATA, 20 megabytes/second with
IDE/DMA, 10 megabytes/second with PIOW, etc. There just aren't
any disks around that will run at RAM speeds so buffered I/O will
always result in full buffers if the I/O is sustained. To completely
solve the OOM situation requires throttling the generation of data.

It is only when the data generation rate is less than or equal to
the data storage rate that you can generate data forever.

A possibility may be to not return control to the writing process
(including swap), until the write completes if RAM gets low. In
other words, stop buffering data in RAM in tight memory situations.
This forces all the tasks to wait and, therefore slows down the
dirty-page and data generation rate to match the RAM available.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.20 on an i686 machine (797.90 BogoMips).
Why is the government concerned about the lunatic fringe? Think about it.

2003-07-10 13:30:51

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: 2.5.74-mm3 OOM killer fubared ?

In article <Pine.LNX.4.53.0307100918410.203@chaos>,
Richard B. Johnson <[email protected]> wrote:
>On Thu, 10 Jul 2003, Miquel van Smoorenburg wrote:
>
>> As I said I've got plenty memory free ... perhaps I need to tune
>> /proc/sys/vm because I've got so much streaming I/O ? Possibly,
>> there are too many dirty pages so cleaning them out faster might
>> help (and let pflushd do it instead of my single-threaded app)
>>

I did the tuning now, but it did not help much. Alas.

>The problem, as I see it, is that you can dirty pages 10-15 times
>faster than they can be written to disk. So, you will always
>have the possibility of an OOM situation as long as you are I/O
>bound. FYI, you can read/write RAM at 1,000+ megabytes/second, but
>you can only write to disk at 80 megabytes/second with the fastest
>SCSI around, 40 megabytes/second with ATA, 20 megabytes/second with
>IDE/DMA, 10 megabytes/second with PIOW, etc. There just aren't
>any disks around that will run at RAM speeds so buffered I/O will
>always result in full buffers if the I/O is sustained. To completely
>solve the OOM situation requires throttling the generation of data.

My disks are fast enough - under 2.5.74-vanilla, no problem.

>It is only when the data generation rate is less than or equal to
>the data storage rate that you can generate data forever.
>
>A possibility may be to not return control to the writing process
>(including swap), until the write completes if RAM gets low.

That's what can be tuned with /proc/sys/vm/dirty_ratio , right ?
If I understand Documentation/filesystems/proc.txt correctly.

Mike.

2003-07-10 14:18:14

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.5.74-mm3 OOM killer fubared ?

On Thu, 10 Jul 2003, Richard B. Johnson wrote:

> The problem, as I see it, is that you can dirty pages 10-15 times
> faster than they can be written to disk. So, you will always
> have the possibility of an OOM situation as long as you are I/O
> bound.

That's not a problem at all. I think the VM never starts
pageout IO on a page that doesn't look like it'll become
freeable after the IO is done, so we simply shouldn't go
into the OOM killer as long as there are pages waiting on
pageout IO to finish.

Once we really are OOM we shouldn't have pages in pageout
IO.

This is what I am doing in current 2.4-rmap and it seems
to do the right thing in both the "heavy IO" and the "out
of memory" corner cases.

--
Great minds drink alike.

2003-07-10 15:41:05

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.74-mm3 OOM killer fubared ?

In article <[email protected]>, William Lee Irwin III <[email protected]> wrote:
>> since = now - lastkill;
>> if (since < HZ*5)
>> goto out_unlock;
>> try s/goto out_unlock/goto reset/ and let me know how it goes.

On Thu, Jul 10, 2003 at 12:54:01PM +0000, Miquel van Smoorenburg wrote:
> But that will only change the rate at which processes are killed,
> not the fact that they are killed in the first place, right ?
> As I said I've got plenty memory free ... perhaps I need to tune
> /proc/sys/vm because I've got so much streaming I/O ? Possibly,
> there are too many dirty pages so cleaning them out faster might
> help (and let pflushd do it instead of my single-threaded app)

That's not what it's supposed to do. The thought behind it is that since
out_of_memory()'s count is not reset unless it's been 5s since the last
time this was ever invoked, it will happen on a regular basis after the
first kill if it is invoked regularly. It's actually a bit too late,
since something's already been killed, but it should make a larger
difference than merely altering the rate.

-- wli

2003-07-10 17:28:52

by Mikulas Patocka

[permalink] [raw]
Subject: Re: 2.5.74-mm3 OOM killer fubared ?

> > The problem, as I see it, is that you can dirty pages 10-15 times
> > faster than they can be written to disk. So, you will always
> > have the possibility of an OOM situation as long as you are I/O
> > bound.
>
> That's not a problem at all. I think the VM never starts
> pageout IO on a page that doesn't look like it'll become
> freeable after the IO is done, so we simply shouldn't go
> into the OOM killer as long as there are pages waiting on
> pageout IO to finish.
>
> Once we really are OOM we shouldn't have pages in pageout
> IO.

What piece of code does prevent that?
As I see, OOM is triggered if no pages were freed in few loops. It
doesn't care about pages that are already being written or pages for which
write operation was started.

The only thing that prevents total oom when writing a lot of dirty pages
is blk_congestion_wait, but it's pretty unreliable because
1) it uses timeout
2) not all page writes go through block devices (NFS, NBD etc.)
blk_congestion_wait may be used for improving performance, but not for
ensuring stability.

> This is what I am doing in current 2.4-rmap and it seems
> to do the right thing in both the "heavy IO" and the "out
> of memory" corner cases.

I remember there was (and probably still is, only with smaller
probability) a similar bug in 2.2.

Mikulas

2003-07-10 17:30:12

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.74-mm3 OOM killer fubared ?

"Miquel van Smoorenburg" <[email protected]> wrote:
>
> I was running 2.5.74 on our newsfeeder box for a day without
> problems (and 2.5.72-mm2 for weeks before that).

And how was 2.5.72-mm2 performing, generally?

> Now with 2.5.74-mm3 (booted 11 hours ago) it keeps killing processes
> for no apparent reason:
> ...
> I notice that in -mm3 this was deleted relative to -vanilla:
>
> -
> - /*
> - * Enough swap space left? Not OOM.
> - */
> - if (nr_swap_pages > 0)
> - return;
>
> .. is that what causes this?

Yes. That was a "hmm, I wonder what happens if I do this" patch. It's
interesting that we're going down that code path.

Is your INND configured to use the strange mmap(MAP_SHARED) of a blockdev
thing? That could explain the scanning hysteria.

> Related mm question - this box is a news server, which does a lot
> of streaming I/O, and also keeps a history database open. I have the
> idea that the streaming I/O evicts the history database hash and
> index file caches from memory, which I do not want. Any chance of
> a control on a filedescriptor that tells it how persistant to be
> in caching file data ? E.g. a sort of "nice" for the cache, so that
> I could say that streaming data may be flushed from buffers/cache
> earlier than other data (where the other data would be the
> database files) ?

Makes sense. We can use posix_fadvise(POSIX_FADV_NOREUSE) as a hint to
tell the VM/VFS to throw away old pages. I'll take a look at that.

2003-07-10 17:32:33

by Mikulas Patocka

[permalink] [raw]
Subject: Re: 2.5.74-mm3 OOM killer fubared ?

> Enough memory free, no problems at all .. yet every few minutes
> the OOM killer kills one of my innfeed processes.
>
> I notice that in -mm3 this was deleted relative to -vanilla:
>
> -
> - /*
> - * Enough swap space left? Not OOM.
> - */
> - if (nr_swap_pages > 0)
> - return;
>
> .. is that what causes this ? In any case, that should't vene matter -
> there's plenty of memory in this box, all buffers and cached, but that
> should be easily freed ..

This is not cause, it only makes the bug more obvious.

With that code it is wrong too, it would trigger exactly same unnecesary
oom kills if you didn't have swap or if you had swap full and lot of free
memory as dirty filesystem cache.

Mikulas

2003-07-10 22:50:59

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: 2.5.74-mm3 OOM killer fubared ?

In article <[email protected]>,
William Lee Irwin III <[email protected]> wrote:
>In article <[email protected]>, William Lee Irwin
>III <[email protected]> wrote:
>>> since = now - lastkill;
>>> if (since < HZ*5)
>>> goto out_unlock;
>>> try s/goto out_unlock/goto reset/ and let me know how it goes.
>
>On Thu, Jul 10, 2003 at 12:54:01PM +0000, Miquel van Smoorenburg wrote:
>> But that will only change the rate at which processes are killed,
>> not the fact that they are killed in the first place, right ?
>> As I said I've got plenty memory free ... perhaps I need to tune
>> /proc/sys/vm because I've got so much streaming I/O ? Possibly,
>> there are too many dirty pages so cleaning them out faster might
>> help (and let pflushd do it instead of my single-threaded app)
>
>That's not what it's supposed to do. The thought behind it is that since
>out_of_memory()'s count is not reset unless it's been 5s since the last
>time this was ever invoked, it will happen on a regular basis after the
>first kill if it is invoked regularly. It's actually a bit too late,
>since something's already been killed, but it should make a larger
>difference than merely altering the rate.

Well, that won't help in my case, as my problem is not that many
processes are killed - it's just that every few minutes (sometimes
3 minutes, sometimes 30, sometimes an hour) an innocent process
gets killed (just one) with 2.5.74-mm3. And that did not happen
with 2.5.74 or 2.5.72-mm2

Mike.

2003-07-10 23:17:22

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: 2.5.74-mm3 OOM killer fubared ?

In article <[email protected]>,
Andrew Morton <[email protected]> wrote:
>"Miquel van Smoorenburg" <[email protected]> wrote:
>>
>> I was running 2.5.74 on our newsfeeder box for a day without
>> problems (and 2.5.72-mm2 for weeks before that).
>
>And how was 2.5.72-mm2 performing, generally?

Quite allright, but it had (known) ext3 problems so I went to
2.5.74. Linus fixed the issue with siimage yesterday, so that
was stable. No fun in that so I went to -mm3 ;)

The VM appears to run better with 2.5 than with 2.4, but I do not
experience a real difference between 2.5.74-vanilla and -mm3.

Difference between 2.4.21 and 2.5.7x is that interactive response
under load is much better, and that the VM is better. With 2.4.21,
the nightly expire process (rebuilds the history database) was
running for 3 days without making progress if the news server (INN)
was running at the same time, because its working set kept getting
evicted to swap. 2.5 behaves better - expire finished in a few hours.
Ofcourse now I stop the INN process before expire and restart it
afterwards and the expire process runs in 5 minutes ... which simply
means I need more memory.

>> Now with 2.5.74-mm3 (booted 11 hours ago) it keeps killing processes
>> for no apparent reason:
>> ...
>> I notice that in -mm3 this was deleted relative to -vanilla:
>>
>> -
>> - /*
>> - * Enough swap space left? Not OOM.
>> - */
>> - if (nr_swap_pages > 0)
>> - return;
>>
>> .. is that what causes this?
>
>Yes. That was a "hmm, I wonder what happens if I do this" patch. It's
>interesting that we're going down that code path.

Yes, and it's really not happening all that often. For example:

Jul 10 12:25:48 quantum kernel: Fixed up OOM kill of mm-less task
Jul 10 12:45:41 quantum kernel: Out of Memory: Killed process 11894 (innfeed).
Jul 10 12:47:14 quantum kernel: Out of Memory: Killed process 13128 (innfeed).
Jul 10 12:53:09 quantum kernel: Out of Memory: Killed process 13221 (innfeed).
Jul 10 12:55:12 quantum kernel: Out of Memory: Killed process 13649 (innfeed).
Jul 10 15:18:24 quantum kernel: Out of Memory: Killed process 13754 (innfeed).
Jul 10 15:35:54 quantum kernel: Out of Memory: Killed process 22667 (innfeed).
Jul 10 16:38:05 quantum kernel: Out of Memory: Killed process 23781 (innfeed).
Jul 10 17:04:32 quantum kernel: Out of Memory: Killed process 27640 (innfeed).
Jul 10 20:26:12 quantum kernel: Out of Memory: Killed process 29298 (innfeed).
Jul 10 22:05:43 quantum kernel: Out of Memory: Killed process 9333 (innfeed).
Jul 10 22:24:00 quantum kernel: Out of Memory: Killed process 15476 (innfeed).
Jul 10 22:24:00 quantum kernel: Fixed up OOM kill of mm-less task
Jul 10 22:44:47 quantum kernel: Out of Memory: Killed process 16616 (innfeed).
Jul 11 01:23:32 quantum kernel: Out of Memory: Killed process 17896 (innfeed).
Jul 11 01:24:50 quantum kernel: Out of Memory: Killed process 28108 (innfeed).

Sometimes it happens every 5 minutes though.

>Is your INND configured to use the strange mmap(MAP_SHARED) of a blockdev
>thing? That could explain the scanning hysteria.

Yes. In fact, INND itself just mmaps small parts of the blockdev
(just a few 100K of bitmap at the beginning), but the outgoing processes
called "innfeed" actually mmaps whole articles from the blockdevs,
which could be several megs. Not hundeds of megs, though. Maybe 20.

>> Related mm question - this box is a news server, which does a lot
>> of streaming I/O, and also keeps a history database open. I have the
>> idea that the streaming I/O evicts the history database hash and
>> index file caches from memory, which I do not want. Any chance of
>> a control on a filedescriptor that tells it how persistant to be
>> in caching file data ? E.g. a sort of "nice" for the cache, so that
>> I could say that streaming data may be flushed from buffers/cache
>> earlier than other data (where the other data would be the
>> database files) ?
>
>Makes sense. We can use posix_fadvise(POSIX_FADV_NOREUSE) as a hint to
>tell the VM/VFS to throw away old pages. I'll take a look at that.

Well, throw away earlier than other pages. If nothing else needs
those pages, I'd like to keep them around anyway. What comes in
streaming, goes out streaming - but usually within seconds.

Mike.

2003-07-11 00:45:32

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.74-mm3 OOM killer fubared ?

In article <[email protected]>, William Lee Irwin III <[email protected]> wrote:
>> That's not what it's supposed to do. The thought behind it is that since
>> out_of_memory()'s count is not reset unless it's been 5s since the last
>> time this was ever invoked, it will happen on a regular basis after the
>> first kill if it is invoked regularly. It's actually a bit too late,
>> since something's already been killed, but it should make a larger
>> difference than merely altering the rate.

On Thu, Jul 10, 2003 at 11:05:37PM +0000, Miquel van Smoorenburg wrote:
> Well, that won't help in my case, as my problem is not that many
> processes are killed - it's just that every few minutes (sometimes
> 3 minutes, sometimes 30, sometimes an hour) an innocent process
> gets killed (just one) with 2.5.74-mm3. And that did not happen
> with 2.5.74 or 2.5.72-mm2

Okay, it won't help your case, then. I've had this improvement to its
heuristics on the back burner for a while but haven't gone so far as
to dig up a case it directly benefits. It's small enough I'll think
about it later.


-- wli