2005-01-02 17:29:27

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH][2/2] do not OOM kill if we skip writing many pages

On Mon, Dec 20, 2004 at 10:17:28AM -0500, Rik van Riel wrote:
> Simply running "dd if=/dev/zero of=/dev/hd<one you can miss>" will
> result in OOM kills, with the dirty pagecache completely filling up
> lowmem. This patch is part 2 to fixing that problem.
>
> Note that this test case demonstrates that the false OOM kills can
> also be reproduced with pages that are not "pinned" by the swap token
> at all, so there are some serious VM problems left still...
>
> If we cannot write out a number of pages because of congestion on
> the filesystem or block device, do not cause an OOM kill. These
> pages will become freeable later, when the congestion clears.

I don't like this one, it's much less obvious than 1/2. After your
obviously right 1/2 we're already guaranteed at least a percentage of
the ram will not be dirty. Is the below really needed even after 1/2 +
Andrew's fix? Are you sure this isn't a workaround for the lack of
Andrew's fix.

This 2/2 is absolutely generic, not related to highmem, and I'm at least
not having problem with Andrew's patch applied.

The conditional to out_of_memory especially looks not good, and I'm
scared it could generate livelocks.

I'm going to apply both your 1/2 and I already applied Andrew's
total_scanned, but from my part I'm not applying this 2/2. I believe to
be already safe with total_scanned + 1/2.


2005-01-03 04:20:49

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH][2/2] do not OOM kill if we skip writing many pages

On Sun, 2 Jan 2005, Andrea Arcangeli wrote:

> I don't like this one, it's much less obvious than 1/2. After your
> obviously right 1/2 we're already guaranteed at least a percentage of
> the ram will not be dirty. Is the below really needed even after 1/2 +
> Andrew's fix? Are you sure this isn't a workaround for the lack of
> Andrew's fix.

Agreed, Andrew's fix should in theory be enough and only my
1/2 should be needed.

However, in practice people are still generating OOM kills
even with both Andrew's fix and my own patch applied, so I
suspect there's another hole left open somewhere...

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2005-01-03 15:26:37

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH][2/2] do not OOM kill if we skip writing many pages

On Sun, Jan 02, 2005 at 11:20:28PM -0500, Rik van Riel wrote:
> On Sun, 2 Jan 2005, Andrea Arcangeli wrote:
>
> >I don't like this one, it's much less obvious than 1/2. After your
> >obviously right 1/2 we're already guaranteed at least a percentage of
> >the ram will not be dirty. Is the below really needed even after 1/2 +
> >Andrew's fix? Are you sure this isn't a workaround for the lack of
> >Andrew's fix.
>
> Agreed, Andrew's fix should in theory be enough and only my
> 1/2 should be needed.
>
> However, in practice people are still generating OOM kills
> even with both Andrew's fix and my own patch applied, so I
> suspect there's another hole left open somewhere...

Hi Rik,

What are the details of the OOM kills (output, workload, configuration, etc)?

Are these running 2.6.10-mm?


2005-01-03 16:24:53

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH][2/2] do not OOM kill if we skip writing many pages

On Mon, Jan 03, 2005 at 10:22:41AM -0200, Marcelo Tosatti wrote:
> What are the details of the OOM kills (output, workload, configuration, etc)?
>
> Are these running 2.6.10-mm?

And did they apply Con's patch? (i.e. my 3/4 I posted few days ago)

2005-01-03 16:41:12

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH][2/2] do not OOM kill if we skip writing many pages

On Mon, 3 Jan 2005, Andrea Arcangeli wrote:
> On Mon, Jan 03, 2005 at 10:22:41AM -0200, Marcelo Tosatti wrote:
>> What are the details of the OOM kills (output, workload, configuration, etc)?

The workload is a simple dd to a block device, on a system
with highmem. The mapping for the block device can only be
cached in lowmem.

kernel: oom-killer: gfp_mask=0xd0
...
kernel: Free pages: 968016kB (966400kB HighMem)
kernel: Active:31932 inactive:185316 dirty:8 writeback:165518 unstable:0 free:242004 slab:55830 mapped:33266 pagetables:1135
kernel: DMA free:16kB min:16kB low:32kB high:48kB active:0kB inactive:9656kB present:16384kB
kernel: protections[]: 0 0 0
kernel: Normal free:1600kB min:936kB low:1872kB high:2808kB active:208kB inactive:653148kB present:901120kB
kernel: protections[]: 0 0 0
kernel: HighMem free:966400kB min:512kB low:1024kB high:1536kB active:127520kB inactive:78464kB present:1179584kB
kernel: protections[]: 0 0 0
...

If you run on a system with more highmem, you'll simply get
an OOM kill with more free highmem pages. The only thing
that lives in highmem is the process code, which the VM is
not scanning for obvious reasons.

>> Are these running 2.6.10-mm?

The latest rawhide kernel, with a few VM fixes, including all
the important ones that I could see from -mm.

Reading balance_dirty_pages, I do not understand how we could
end up having so many pages in writeback state, but still
continue writing out more - surely we should have run out of
dirty pages long ago and stalled in blk_congestion_wait()
until lots of IO had finished completing ?

Why can we build up 660MB of pages in the writeback stage,
for a mapping that can only live in the low 900MB of memory?
Yes, it has my patch 1/2 applied (lowering the dirty limit
for lowmem only mappings)...

> And did they apply Con's patch? (i.e. my 3/4 I posted few days ago)

Con's patch is not relevant for this bug, since there are so few
mapped pages (and those almost certainly live in highmem, which
the VM is not scanning).

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2005-01-03 17:11:25

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH][2/2] do not OOM kill if we skip writing many pages

On Mon, 3 Jan 2005, Rik van Riel wrote:

>> And did they apply Con's patch? (i.e. my 3/4 I posted few days ago)
>
> Con's patch is not relevant for this bug, since there are so few
> mapped pages (and those almost certainly live in highmem, which
> the VM is not scanning).

To quantify this, literally 99.8% of the inactive lowmem
pages are in writeback stage. Yet the VM is OOM killing.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2005-01-03 22:00:41

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH][2/2] do not OOM kill if we skip writing many pages


Rik!

On Mon, Jan 03, 2005 at 11:40:41AM -0500, Rik van Riel wrote:
> On Mon, 3 Jan 2005, Andrea Arcangeli wrote:
> >On Mon, Jan 03, 2005 at 10:22:41AM -0200, Marcelo Tosatti wrote:
> >>What are the details of the OOM kills (output, workload, configuration,
> >>etc)?
>
> The workload is a simple dd to a block device, on a system
> with highmem. The mapping for the block device can only be
> cached in lowmem.
>
> kernel: oom-killer: gfp_mask=0xd0
> ...
> kernel: Free pages: 968016kB (966400kB HighMem)
> kernel: Active:31932 inactive:185316 dirty:8 writeback:165518 unstable:0
> free:242004 slab:55830 mapped:33266 pagetables:1135
> kernel: DMA free:16kB min:16kB low:32kB high:48kB active:0kB
> inactive:9656kB present:16384kB
> kernel: protections[]: 0 0 0
> kernel: Normal free:1600kB min:936kB low:1872kB high:2808kB active:208kB
> inactive:653148kB present:901120kB
> kernel: protections[]: 0 0 0
> kernel: HighMem free:966400kB min:512kB low:1024kB high:1536kB
> active:127520kB inactive:78464kB present:1179584kB
> kernel: protections[]: 0 0 0
> ...
>
> If you run on a system with more highmem, you'll simply get
> an OOM kill with more free highmem pages. The only thing
> that lives in highmem is the process code, which the VM is
> not scanning for obvious reasons.
>
> >>Are these running 2.6.10-mm?
>
> The latest rawhide kernel, with a few VM fixes, including all
> the important ones that I could see from -mm.
>
> Reading balance_dirty_pages, I do not understand how we could
> end up having so many pages in writeback state, but still
> continue writing out more - surely we should have run out of
> dirty pages long ago and stalled in blk_congestion_wait()
> until lots of IO had finished completing ?

Yes - Andrew's throttle_vm_writeout() should be handling that.

/*
* Boost the allowable dirty threshold a bit for page
* allocators so they don't get DoS'ed by heavy writers
*/
dirty_thresh += dirty_thresh / 10; /* wheeee... */

if (wbs.nr_unstable + wbs.nr_writeback <= dirty_thresh)
break;
blk_congestion_wait(WRITE, HZ/10);


You sure the above logic is working on RH kernels?

I can't see how it could fail with this in place.

>
> Why can we build up 660MB of pages in the writeback stage,
> for a mapping that can only live in the low 900MB of memory?
> Yes, it has my patch 1/2 applied (lowering the dirty limit
> for lowmem only mappings)...
>
> >And did they apply Con's patch? (i.e. my 3/4 I posted few days ago)
>
> Con's patch is not relevant for this bug, since there are so few
> mapped pages (and those almost certainly live in highmem, which
> the VM is not scanning).

2005-01-03 22:08:22

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH][2/2] do not OOM kill if we skip writing many pages

On Mon, 3 Jan 2005, Marcelo Tosatti wrote:

> Yes - Andrew's throttle_vm_writeout() should be handling that.

> You sure the above logic is working on RH kernels?

Exactly the same code.

> I can't see how it could fail with this in place.

Neither can I, except perhaps the IO subsystem is sized
to handle more IO than all the lowmem pages simultaneously ?

The patch I just posted to lkml ([5/?]) should fix another
issue related to this problem, and might just fix the problem.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan