2001-12-10 20:25:36

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)


Andrea,

Could you please start looking at any 2.4 VM issues which show up ?

Just please make sure that when sending a fix for something, send me _one_
problem and a patch which fixes _that_ problem.

I'm tempted to look at VM, but I think I'll spend my limited time in a
better way if I review's others people work instead.

---------- Forwarded message ----------
Date: Mon, 10 Dec 2001 16:46:02 -0200 (BRST)
From: Marcelo Tosatti <[email protected]>
To: Abraham vd Merwe <[email protected]>
Cc: Linux Kernel Development <[email protected]>
Subject: Re: 2.4.16 & OOM killer screw up



On Mon, 10 Dec 2001, Abraham vd Merwe wrote:

> Hi!
>
> If I leave my machine on for a day or two without doing anything on it (e.g.
> my machine at work over a weekend) and I come back then 1) all my memory is
> used for buffers/caches and when I try running application, the OOM killer
> kicks in, tries to allocate swap space (which I don't have) and kills
> whatever I try start (that's with 300M+ memory in buffers/caches).

Abraham,

I'll take a look at this issue as soon as pre8 is released.


2001-12-10 20:48:50

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

Marcelo Tosatti wrote:
>
> Andrea,
>
> Could you please start looking at any 2.4 VM issues which show up ?
>

Just fwiw, I did some testing on this yesterday.

Buffers and cache data are sitting on the active list, and shrink_caches()
is *not* getting them off the active list, and onto the inactive list
where they can be freed.

So we end up with enormous amounts of anon memory on the inactive
list, so this code:

/* try to keep the active list 2/3 of the size of the cache */
ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
refill_inactive(ratio);

just calls refill_inactive(0) all the time. Nothing gets moved
onto the inactive list - it remains full of unfreeable anon
allocations. And with no swap, there's nowhere to go.

I think a little fix is to add

if (ratio < nr_pages)
ratio = nr_pages;

so we at least move *something* onto the inactive list.

Also refill_inactive needs to be changed so that it counts
the number of pages which it actually moved, rather than
the number of pages which it inspected.

In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
So we're madly trying to swap pages out and finding that there's no swap
space. I beleive that when we find there's no swap left we should move
the page onto the active list so we don't keep rescanning it pointlessly.

A fix may be to just remove the use-once stuff. It is one of the
sources of this problem, because it's overpopulating the inactive list.

In my testing last night, I tried to allocate 650 megs on a 768 meg
swapless box. Got oom-killed when there was almost 100 megs of freeable
memory: half buffercache, half filecache. Presumably, all of it was
stuck on the active list with no way to get off.

We also need to do something about shrink_[di]cache_memory(),
which seem to be called in the wrong place.

There's also the report concerning modify_ldt() failure in a
similar situation. I'm not sure why this one occurred. It
vmallocs 64k of memory and that seems to fail.

I did some similar testing a week or so ago, also tested
the -aa patches. They seemed to maybe help a tiny bit,
but not significantly.

-

2001-12-10 20:59:32

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)



On Mon, 10 Dec 2001, Andrew Morton wrote:

> Marcelo Tosatti wrote:
> >
> > Andrea,
> >
> > Could you please start looking at any 2.4 VM issues which show up ?
> >
>
> Just fwiw, I did some testing on this yesterday.
>
> Buffers and cache data are sitting on the active list, and shrink_caches()
> is *not* getting them off the active list, and onto the inactive list
> where they can be freed.
>
> So we end up with enormous amounts of anon memory on the inactive
> list, so this code:
>
> /* try to keep the active list 2/3 of the size of the cache */
> ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
> refill_inactive(ratio);
>
> just calls refill_inactive(0) all the time. Nothing gets moved
> onto the inactive list - it remains full of unfreeable anon
> allocations. And with no swap, there's nowhere to go.
>
> I think a little fix is to add
>
> if (ratio < nr_pages)
> ratio = nr_pages;
>
> so we at least move *something* onto the inactive list.
>
> Also refill_inactive needs to be changed so that it counts
> the number of pages which it actually moved, rather than
> the number of pages which it inspected.
>
> In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
> So we're madly trying to swap pages out and finding that there's no swap
> space. I beleive that when we find there's no swap left we should move
> the page onto the active list so we don't keep rescanning it pointlessly.
>
> A fix may be to just remove the use-once stuff. It is one of the
> sources of this problem, because it's overpopulating the inactive list.
>
> In my testing last night, I tried to allocate 650 megs on a 768 meg
> swapless box. Got oom-killed when there was almost 100 megs of freeable
> memory: half buffercache, half filecache. Presumably, all of it was
> stuck on the active list with no way to get off.
>
> We also need to do something about shrink_[di]cache_memory(),
> which seem to be called in the wrong place.
>
> There's also the report concerning modify_ldt() failure in a
> similar situation. I'm not sure why this one occurred. It
> vmallocs 64k of memory and that seems to fail.

I haven't applied the modify_ldt() patch because I want to make sure its
needed: It may just be a bad effect of this one bug.

> I did some similar testing a week or so ago, also tested
> the -aa patches. They seemed to maybe help a tiny bit,
> but not significantly.


2001-12-11 00:11:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Mon, Dec 10, 2001 at 12:47:55PM -0800, Andrew Morton wrote:
> Marcelo Tosatti wrote:
> >
> > Andrea,
> >
> > Could you please start looking at any 2.4 VM issues which show up ?
> >
>
> Just fwiw, I did some testing on this yesterday.
>
> Buffers and cache data are sitting on the active list, and shrink_caches()
> is *not* getting them off the active list, and onto the inactive list
> where they can be freed.

please check 2.4.17pre4aa1, see the per-classzone info, they will
prevent all the problems with the refill inactive with highmem.

>
> So we end up with enormous amounts of anon memory on the inactive
> list, so this code:
>
> /* try to keep the active list 2/3 of the size of the cache */
> ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
> refill_inactive(ratio);
>
> just calls refill_inactive(0) all the time. Nothing gets moved
> onto the inactive list - it remains full of unfreeable anon
> allocations. And with no swap, there's nowhere to go.
>
> I think a little fix is to add
>
> if (ratio < nr_pages)
> ratio = nr_pages;
>
> so we at least move *something* onto the inactive list.
>
> Also refill_inactive needs to be changed so that it counts
> the number of pages which it actually moved, rather than
> the number of pages which it inspected.

done ages ago here.

>
> In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
> So we're madly trying to swap pages out and finding that there's no swap
> space. I beleive that when we find there's no swap left we should move
> the page onto the active list so we don't keep rescanning it pointlessly.

yes, however I think the swap-flood with no swap isn't a very
interesting case to optimize.

>
> A fix may be to just remove the use-once stuff. It is one of the
> sources of this problem, because it's overpopulating the inactive list.
>
> In my testing last night, I tried to allocate 650 megs on a 768 meg
> swapless box. Got oom-killed when there was almost 100 megs of freeable
> memory: half buffercache, half filecache. Presumably, all of it was
> stuck on the active list with no way to get off.
>
> We also need to do something about shrink_[di]cache_memory(),
> which seem to be called in the wrong place.
>
> There's also the report concerning modify_ldt() failure in a
> similar situation. I'm not sure why this one occurred. It
> vmallocs 64k of memory and that seems to fail.

dunno about this modify_ldt failure.

>
> I did some similar testing a week or so ago, also tested
> the -aa patches. They seemed to maybe help a tiny bit,
> but not significantly.

I don't have any pending bug report. AFIK those bugs are only in
mainline. If you can reproduce with -aa please send me a bug report.
thanks,

Andrea

2001-12-11 00:43:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Mon, Dec 10, 2001 at 05:08:44PM -0200, Marcelo Tosatti wrote:
>
> Andrea,
>
> Could you please start looking at any 2.4 VM issues which show up ?

well, as far I can tell no VM bug should be present in my latest -aa, so
I think I'm finished. At the very least I know people is using 2.4.15aa1
and 2.4.17pre1aa1 in production on multigigabyte boxes under heavy VM
load and I didn't got any bugreport back yet.

>
> Just please make sure that when sending a fix for something, send me _one_
> problem and a patch which fixes _that_ problem.

I will split something for you soon, at the moment I was doing some
further benchmark.

>
> I'm tempted to look at VM, but I think I'll spend my limited time in a
> better way if I review's others people work instead.

until I split something out, you can see all the vm related changes in
the 10_vm-* patches in my ftp area.

Andrea

2001-12-11 07:08:26

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

Andrea Arcangeli wrote:
>
> On Mon, Dec 10, 2001 at 12:47:55PM -0800, Andrew Morton wrote:
> > Marcelo Tosatti wrote:
> > >
> > > Andrea,
> > >
> > > Could you please start looking at any 2.4 VM issues which show up ?
> > >
> >
> > Just fwiw, I did some testing on this yesterday.
> >
> > Buffers and cache data are sitting on the active list, and shrink_caches()
> > is *not* getting them off the active list, and onto the inactive list
> > where they can be freed.
>
> please check 2.4.17pre4aa1, see the per-classzone info, they will
> prevent all the problems with the refill inactive with highmem.

This is not highmem-related. But the latest -aa patch does
appear to have fixed this bug. Stale memory is no longer being
left on the active list, and all buffercache memory is being reclaimed
before the oom-killer kicks in (swapless case).

Also, (and this is in fact the same problem), the patched kernel
has less tendency to push in-use memory out to swap while leaving
tens of megs of old memory on the active list. This is all good.

Which of your changes has caused this?

Could you please separate this out into one or more specific patches for
the 2.4.17 series?





Why does this code exist at the end of refill_inactive()?

if (entry != &active_list) {
list_del(&active_list);
list_add(&active_list, entry);
}





This test on a 64 megabyte machine, on ext2:

time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync)

On 2.4.17-pre7 it takes 21 seconds. On -aa it is much slower: 36 seconds.

This is probably due to the write scheduling changes in fs/buffer.c.
This chunk especially will, under some conditions, cause bdflush
to madly spin in a loop unplugging all the disk queues:

@@ -2787,7 +2795,7 @@

spin_lock(&lru_list_lock);
if (!write_some_buffers(NODEV) || balance_dirty_state() < 0) {
- wait_for_some_buffers(NODEV);
+ run_task_queue(&tq_disk);
interruptible_sleep_on(&bdflush_wait);
}
}

Why did you make this change?





Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap
dual x86:

-aa: 4 minutes 20 seconds
2.4.7-pre8 4 minutes 8 seconds
2.4.7-pre8 plus the below patch: 3 minutes 55 seconds

Now it could be that this performance regression is due to the
write merging mistake in fs/buffer.c. But with so much unrelated
material in the same patch it's hard to pinpoint the source.



--- linux-2.4.17-pre8/mm/vmscan.c Thu Nov 22 23:02:59 2001
+++ linux-akpm/mm/vmscan.c Mon Dec 10 22:34:18 2001
@@ -537,7 +537,7 @@ static void refill_inactive(int nr_pages

spin_lock(&pagemap_lru_lock);
entry = active_list.prev;
- while (nr_pages-- && entry != &active_list) {
+ while (nr_pages && entry != &active_list) {
struct page * page;

page = list_entry(entry, struct page, lru);
@@ -551,6 +551,7 @@ static void refill_inactive(int nr_pages
del_page_from_active_list(page);
add_page_to_inactive_list(page);
SetPageReferenced(page);
+ nr_pages--;
}
spin_unlock(&pagemap_lru_lock);
}
@@ -561,6 +562,12 @@ static int shrink_caches(zone_t * classz
int chunk_size = nr_pages;
unsigned long ratio;

+ shrink_dcache_memory(priority, gfp_mask);
+ shrink_icache_memory(priority, gfp_mask);
+#ifdef CONFIG_QUOTA
+ shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
+#endif
+
nr_pages -= kmem_cache_reap(gfp_mask);
if (nr_pages <= 0)
return 0;
@@ -568,17 +575,13 @@ static int shrink_caches(zone_t * classz
nr_pages = chunk_size;
/* try to keep the active list 2/3 of the size of the cache */
ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
+ if (ratio == 0)
+ ratio = nr_pages;
refill_inactive(ratio);

nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
if (nr_pages <= 0)
return 0;
-
- shrink_dcache_memory(priority, gfp_mask);
- shrink_icache_memory(priority, gfp_mask);
-#ifdef CONFIG_QUOTA
- shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
-#endif

return nr_pages;
}

> ...
>
> >
> > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
> > So we're madly trying to swap pages out and finding that there's no swap
> > space. I beleive that when we find there's no swap left we should move
> > the page onto the active list so we don't keep rescanning it pointlessly.
>
> yes, however I think the swap-flood with no swap isn't a very
> interesting case to optimize.

Running swapless is a valid configuration, and the kernel is doing
great amounts of pointless work. I would expect a diskless workstation
to suffer from this. The problem remains in latest -aa. It would be
useful to find a fix.

>
> I don't have any pending bug report. AFIK those bugs are only in
> mainline. If you can reproduce with -aa please send me a bug report.
> thanks,

Bugs which are only fixed in -aa aren't much use to anyone.

The VM code lacks comments, and nobody except yourself understands
what it is supposed to be doing. That's a bug, don't you think?

-

2001-12-11 13:33:12

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Mon, 10 Dec 2001, Andrew Morton wrote:

> This test on a 64 megabyte machine, on ext2:
>
> time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync)
>
> On 2.4.17-pre7 it takes 21 seconds. On -aa it is much slower: 36 seconds.

> Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap
> dual x86:
>
> -aa: 4 minutes 20 seconds
> 2.4.7-pre8 4 minutes 8 seconds
> 2.4.7-pre8 plus the below patch: 3 minutes 55 seconds


Andrea, it seems -aa is not the holy grail VM-wise. If you want
to merge your good stuff with marcelo, please do it in the
"one patch with explanation per problem" style marcelo asked.

If nothing happens I'll take my chainsaw and remove the whole
use-once stuff just so 2.4 will avoid the worst cases, even if
it happens to remove some of the nice stuff you've been working
on.

regards,

Rik
--
Shortwave goes a long way: irc.starchat.net #swl

http://www.surriel.com/ http://distro.conectiva.com/

2001-12-11 13:42:44

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Mon, Dec 10, 2001 at 11:07:31PM -0800, Andrew Morton wrote:
> Why does this code exist at the end of refill_inactive()?
>
> if (entry != &active_list) {
> list_del(&active_list);
> list_add(&active_list, entry);
> }

so that we restart next time at the point where we stopped browsing the
active list.

> This test on a 64 megabyte machine, on ext2:
>
> time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync)
>
> On 2.4.17-pre7 it takes 21 seconds. On -aa it is much slower: 36 seconds.
>
> This is probably due to the write scheduling changes in fs/buffer.c.

yes, I also lowered the percentage of dirty memory in the system by
default, so that a write flood should less probably stall the system.

Plus I made the elevator more latency oriented, rather than throughput
oriented. Did you also tested how much the system was responsive during
the test?

Do you remeber the thread about a 'tar xzf' hanging the machine? It
doesn't hang with -aa, but of course you'll run slower if it has to do
seeks.

> This chunk especially will, under some conditions, cause bdflush
> to madly spin in a loop unplugging all the disk queues:
>
> @@ -2787,7 +2795,7 @@
>
> spin_lock(&lru_list_lock);
> if (!write_some_buffers(NODEV) || balance_dirty_state() < 0) {
> - wait_for_some_buffers(NODEV);
> + run_task_queue(&tq_disk);
> interruptible_sleep_on(&bdflush_wait);
> }
> }
>
> Why did you make this change?

to make bdflush to less badly spin in a loop unplugging all the disk
queues.

We need to unplug only once, to submit the I/O, but we don't need to
wait on every single buffer that we previously wrote. Note that
run_task_queue() has nothing to do with wait_on_buffer, the above should
be much better in terms of "spinning in a loop unplugging all the disk
queues". It will do it only once at least.

Infact all the wait_for_some_buffers are broken (particularly the one in
balance_dirty()), they're not necessary, they can only slowdown the
machine.

The only reason would be to refile the buffers into the clean list, but
nothing else. That's a total waste of I/O pipelining. And yes, that's
something to fix too.

> Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap
> dual x86:
>
> -aa: 4 minutes 20 seconds
> 2.4.7-pre8 4 minutes 8 seconds
> 2.4.7-pre8 plus the below patch: 3 minutes 55 seconds
>
> Now it could be that this performance regression is due to the
> write merging mistake in fs/buffer.c. But with so much unrelated
> material in the same patch it's hard to pinpoint the source.
>
>
>
> --- linux-2.4.17-pre8/mm/vmscan.c Thu Nov 22 23:02:59 2001
> +++ linux-akpm/mm/vmscan.c Mon Dec 10 22:34:18 2001
> @@ -537,7 +537,7 @@ static void refill_inactive(int nr_pages
>
> spin_lock(&pagemap_lru_lock);
> entry = active_list.prev;
> - while (nr_pages-- && entry != &active_list) {
> + while (nr_pages && entry != &active_list) {
> struct page * page;
>
> page = list_entry(entry, struct page, lru);
> @@ -551,6 +551,7 @@ static void refill_inactive(int nr_pages
> del_page_from_active_list(page);
> add_page_to_inactive_list(page);
> SetPageReferenced(page);
> + nr_pages--;
> }
> spin_unlock(&pagemap_lru_lock);
> }
> @@ -561,6 +562,12 @@ static int shrink_caches(zone_t * classz
> int chunk_size = nr_pages;
> unsigned long ratio;
>
> + shrink_dcache_memory(priority, gfp_mask);
> + shrink_icache_memory(priority, gfp_mask);
> +#ifdef CONFIG_QUOTA
> + shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
> +#endif
> +
> nr_pages -= kmem_cache_reap(gfp_mask);
> if (nr_pages <= 0)
> return 0;
> @@ -568,17 +575,13 @@ static int shrink_caches(zone_t * classz
> nr_pages = chunk_size;
> /* try to keep the active list 2/3 of the size of the cache */
> ratio = (unsigned long) nr_pages * nr_active_pages / ((nr_inactive_pages + 1) * 2);
> + if (ratio == 0)
> + ratio = nr_pages;
> refill_inactive(ratio);
>
> nr_pages = shrink_cache(nr_pages, classzone, gfp_mask, priority);
> if (nr_pages <= 0)
> return 0;
> -
> - shrink_dcache_memory(priority, gfp_mask);
> - shrink_icache_memory(priority, gfp_mask);
> -#ifdef CONFIG_QUOTA
> - shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
> -#endif
>
> return nr_pages;
> }

it should be simple, mainline swapouts more, so it's less likely to
trash away some useful cache.

just try -aa after a:

echo 10 >/proc/sys/vm/vm_mapped_ratio

it should swapout more and better preserve the cache.

> > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
> > > So we're madly trying to swap pages out and finding that there's no swap
> > > space. I beleive that when we find there's no swap left we should move
> > > the page onto the active list so we don't keep rescanning it pointlessly.
> >
> > yes, however I think the swap-flood with no swap isn't a very
> > interesting case to optimize.
>
> Running swapless is a valid configuration, and the kernel is doing

I'm not saying it's not valid or non interesting.

It's the mix "I'm running out of memory and I'm swapless" that is the
case not interesting to optimize.

If you're swapless it means you've enough memory and that you're not
running out of swap. Otherwise _you_ (not the kernel) are wrong not
having swap.

> great amounts of pointless work. I would expect a diskless workstation
> to suffer from this. The problem remains in latest -aa. It would be
> useful to find a fix.

It can be optimized by making the other cases slower. I believe if
swap_out is recalled heavily in a swapless configuration either some
other part of the kernel or the user are wrong, not swap_out. So it's at
least not obvious to me that it would be useful to fix it inside
swap_out.

> > I don't have any pending bug report. AFIK those bugs are only in
> > mainline. If you can reproduce with -aa please send me a bug report.
> > thanks,
>
> Bugs which are only fixed in -aa aren't much use to anyone.

Then there are no other bugs, that's fine, this is why I said I'm
finished (except for the minor performance work, like the buffer
flushing in buffer.c that certainly cannot affect stability, or the
swap-triggering etc.. all minor things that doesn't affect stability and
where there's not the perfect solution anyways).

> The VM code lacks comments, and nobody except yourself understands
> what it is supposed to be doing. That's a bug, don't you think?

Lack of documentation is not a bug, period. Also it's not true that I'm
the only one who understands it. For istance Linus understand it
completly, I am 100% sure.

Anyways I wrote a dozen of slides on the VM with some graph showing the
design of the VM if anybody can better learn from a slide than from the
code.

I believe the slides are useful to understand the design, but if you
want to change one line of code slides or not you've to read the code.
Everybody is complaining about documentation. This is a red-herring.
There's no documentation that allows you to hack the previous VM code.
I'd ask how many of the people happy with the previous documentation
were effectively VM developers. Except for some possible misleading
comment in the current code that we may have not updated yet, I don't
think there's been a regression in documentation.

Andrea

2001-12-11 13:46:34

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Tue, Dec 11, 2001 at 11:32:25AM -0200, Rik van Riel wrote:
> On Mon, 10 Dec 2001, Andrew Morton wrote:
>
> > This test on a 64 megabyte machine, on ext2:
> >
> > time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync)
> >
> > On 2.4.17-pre7 it takes 21 seconds. On -aa it is much slower: 36 seconds.
>
> > Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap
> > dual x86:
> >
> > -aa: 4 minutes 20 seconds
> > 2.4.7-pre8 4 minutes 8 seconds
> > 2.4.7-pre8 plus the below patch: 3 minutes 55 seconds
>
>
> Andrea, it seems -aa is not the holy grail VM-wise. If you want

it may be not a holy grail in swap benchmarks and flood of writes to
disk, those are minor performance regressions, but I have no one single
bug report related to "stability".

The only thing I got back from Andrew is been "it runs a little slower"
in those two tests.

and of course he didn't even attempted to benchmark the interactive
feeling that was the _whole_ point of my buffer.c and elevator changes.

So as far as I'm concerned 2.4.15aa1 and 2.4.17pre?aa? are just rock
solid and usable in production.

We'll keep doing background benchmarking and changes that cannot
affect stability, but the core design is finished as far I can tell.

Andrea

2001-12-11 13:56:44

by Abraham vd Merwe

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

Hi Andrea!

> > > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
> > > > So we're madly trying to swap pages out and finding that there's no swap
> > > > space. I beleive that when we find there's no swap left we should move
> > > > the page onto the active list so we don't keep rescanning it pointlessly.
> > >
> > > yes, however I think the swap-flood with no swap isn't a very
> > > interesting case to optimize.
> >
> > Running swapless is a valid configuration, and the kernel is doing
>
> I'm not saying it's not valid or non interesting.
>
> It's the mix "I'm running out of memory and I'm swapless" that is the
> case not interesting to optimize.
>
> If you're swapless it means you've enough memory and that you're not
> running out of swap. Otherwise _you_ (not the kernel) are wrong not
> having swap.

The problem is that your VM is unnecesarily eating up memory and then wants
swap. That is unacceptable. Having 90% of your memory in buffers/cache and
then the OOM killer kicks in because nothing is free is what we're moaning
about.

--

Regards
Abraham

Did you hear about the model who sat on a broken bottle and cut a nice figure?

__________________________________________________________
Abraham vd Merwe - 2d3D, Inc.

Device Driver Development, Outsourcing, Embedded Systems

Cell: +27 82 565 4451 Snailmail:
Tel: +27 21 761 7549 Block C, Antree Park
Fax: +27 21 761 7648 Doncaster Road
Email: [email protected] Kenilworth, 7700
Http: http://www.2d3d.com South Africa


Attachments:
(No filename) (1.55 kB)
(No filename) (232.00 B)
Download all attachments

2001-12-11 14:01:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Tue, Dec 11, 2001 at 03:59:22PM +0200, Abraham vd Merwe wrote:
> Hi Andrea!
>
> > > > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
> > > > > So we're madly trying to swap pages out and finding that there's no swap
> > > > > space. I beleive that when we find there's no swap left we should move
> > > > > the page onto the active list so we don't keep rescanning it pointlessly.
> > > >
> > > > yes, however I think the swap-flood with no swap isn't a very
> > > > interesting case to optimize.
> > >
> > > Running swapless is a valid configuration, and the kernel is doing
> >
> > I'm not saying it's not valid or non interesting.
> >
> > It's the mix "I'm running out of memory and I'm swapless" that is the
> > case not interesting to optimize.
> >
> > If you're swapless it means you've enough memory and that you're not
> > running out of swap. Otherwise _you_ (not the kernel) are wrong not
> > having swap.
>
> The problem is that your VM is unnecesarily eating up memory and then wants
> swap. That is unacceptable. Having 90% of your memory in buffers/cache and
> then the OOM killer kicks in because nothing is free is what we're moaning
> about.

Dear, Abraham please apply this patch:

ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.17pre4aa1.bz2

on top of a 2.4.17pre4 and then recompile, try again and send me a
bugreport if you can reproduce. thanks,

Andrea

2001-12-11 13:59:54

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Tue, 11 Dec 2001, Andrea Arcangeli wrote:

> > The VM code lacks comments, and nobody except yourself understands
> > what it is supposed to be doing. That's a bug, don't you think?
>
> Lack of documentation is not a bug, period. Also it's not true that
> I'm the only one who understands it.

Without documentation, you can only know what the code
does, never what it is supposed to do or why it does it.

This makes fixing problems a lot harder, especially since
people will never agree on what a piece of code is supposed
to do.

regards,

Rik
--
Shortwave goes a long way: irc.starchat.net #swl

http://www.surriel.com/ http://distro.conectiva.com/

2001-12-11 14:23:54

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Tue, Dec 11, 2001 at 11:59:06AM -0200, Rik van Riel wrote:
> On Tue, 11 Dec 2001, Andrea Arcangeli wrote:
>
> > > The VM code lacks comments, and nobody except yourself understands
> > > what it is supposed to be doing. That's a bug, don't you think?
> >
> > Lack of documentation is not a bug, period. Also it's not true that
> > I'm the only one who understands it.
>
> Without documentation, you can only know what the code
> does, never what it is supposed to do or why it does it.

I only care about "what the code does" and "what are the results and the
bugreports". Anything else is vaopurware and I don't care about that.

As said I wrote some documentation on the VM for my last speech at the
one of the most important italian linux events, it explains the basic
design. It should be published on their webside as soon as I find the
time to send them the slides. I can post a link once it will be online.
It shoud allow non VM-developers to understand the logic behind the VM
algorithm, but understanding those slides it's far from allowing anyone
to hack the VM.

I _totally_ agree with Linus when he said "real world is totally
dominated by the implementation details". I was thinking this way before
reading his recent email to l-k (however I totally disagree about
evolution being random and the other kernel-offtopic part of such thread :).

For developers the real freedom is the code, not the documentation and
the code is there. And I think it's much easier to understand the
current code (ok I'm biased, but still I believe for outsiders it's
simpler).

Andrea

2001-12-11 15:25:59

by Daniel Phillips

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On December 11, 2001 03:23 pm, Andrea Arcangeli wrote:
> As said I wrote some documentation on the VM for my last speech at the
> one of the most important italian linux events, it explains the basic
> design. It should be published on their webside as soon as I find the
> time to send them the slides. I can post a link once it will be online.

Why not also post the whole thing as an email, right here?

> It shoud allow non VM-developers to understand the logic behind the VM
> algorithm, but understanding those slides it's far from allowing anyone
> to hack the VM.

It's a start.

> I _totally_ agree with Linus when he said "real world is totally
> dominated by the implementation details".

Linus didn't say anything about not documenting the implementation details,
nor did he say anything about not documenting in general.

> For developers the real freedom is the code, not the documentation and
> the code is there. And I think it's much easier to understand the
> current code (ok I'm biased, but still I believe for outsiders it's
> simpler).

Judging by the number of complaints, it's not easy enough. I know that,
personally, decoding your vm is something that's always on my 'things I could
do if I didn't have a lot of other things to do' list. So far, only Linus,
Marcelo, Andrew and maybe Rik seem to have made the investment. You'd have a
lot more helpers by now if you gave just a little higher priority to
documentation

--
Daniel

2001-12-11 15:47:24

by Luigi Genoni

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)



On Tue, 11 Dec 2001, Andrea Arcangeli wrote:

> On Mon, Dec 10, 2001 at 05:08:44PM -0200, Marcelo Tosatti wrote:
> >
> > Andrea,
> >
> > Could you please start looking at any 2.4 VM issues which show up ?
>
> well, as far I can tell no VM bug should be present in my latest -aa, so
> I think I'm finished. At the very least I know people is using 2.4.15aa1
> and 2.4.17pre1aa1 in production on multigigabyte boxes under heavy VM
> load and I didn't got any bugreport back yet.
2.4.17pre1aa1 is quire rock solid on all my 2 and 4 GB machines
But I have to admitt that actually I did not really stressed the VM on my
servers, since, guys, we are going to christmass :)



Subject: Re: 2.4.16 & OOM killer screw up (fwd)

Andrea Arcangeli <[email protected]> writes:

>Lack of documentation is not a bug, period. Also it's not true that I'm

I scare myself shitless that you as the one responsible for something
as crucial as MM in the Linux kernel, has such an attitude towards
software development especially when people like RvR as for docs.

Sorry, but to me this sounds like something from M$ (MAPI? You don't
need MAPI documentation. We know what we're doing. You don't need to
know how Windows XX works. It's enough that we know).

Actually, you _do_ get documentation from M$. Something, one can't say
about the Linux MM-sprikled-with holy-penguin-pee subsystem.

I'm not happy about your usage of magic numbers, either. So it is
still running on solid 2.2.19 until further notice (or until Rik loses
his patience. ;-) )

Regards
Henning

--
Dipl.-Inf. (Univ.) Henning P. Schmiedehausen -- Geschaeftsfuehrer
INTERMETA - Gesellschaft fuer Mehrwertdienste mbH [email protected]

Am Schwabachgrund 22 Fon.: 09131 / 50654-0 [email protected]
D-91054 Buckenhof Fax.: 09131 / 50654-20

2001-12-11 15:52:45

by Alan

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

> I'm not happy about your usage of magic numbers, either. So it is
> still running on solid 2.2.19 until further notice (or until Rik loses
> his patience. ;-) )

Andrea did the 2.2.19 VM as well, but that one is somewhat better
documented, and doesn't have the use-once funnies.

Alan

2001-12-11 16:37:58

by Hubert Mantel

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

Hi,

On Tue, Dec 11, Henning P. Schmiedehausen wrote:

> Andrea Arcangeli <[email protected]> writes:
>
> >Lack of documentation is not a bug, period. Also it's not true that I'm
>
> I scare myself shitless that you as the one responsible for something
> as crucial as MM in the Linux kernel, has such an attitude towards
> software development especially when people like RvR as for docs.
>
> Sorry, but to me this sounds like something from M$ (MAPI? You don't
> need MAPI documentation. We know what we're doing. You don't need to
> know how Windows XX works. It's enough that we know).
>
> Actually, you _do_ get documentation from M$. Something, one can't say

How do you know the documentation matches the actual code?

> about the Linux MM-sprikled-with holy-penguin-pee subsystem.

In Linux, you get even more: You can look at the code itself.

> I'm not happy about your usage of magic numbers, either. So it is
> still running on solid 2.2.19 until further notice (or until Rik loses
> his patience. ;-) )

Oh, the 2.2.19 VM is from Andrea ;)

> Regards
> Henning
-o)
Hubert Mantel Goodbye, dots... /\\
_\_v

2001-12-11 17:10:19

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Tue, 11 Dec 2001, Henning P. Schmiedehausen wrote:

> I'm not happy about your usage of magic numbers, either. So it is
> still running on solid 2.2.19 until further notice (or until Rik loses
> his patience. ;-) )

I've lost patience and have decided to move development away
from the main tree. http://linuxvm.bkbits.net/ ;)

cheers,

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/

http://www.surriel.com/ http://distro.conectiva.com/

2001-12-11 17:19:02

by Alan

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

> > I'm not happy about your usage of magic numbers, either. So it is
> > still running on solid 2.2.19 until further notice (or until Rik loses
> > his patience. ;-) )
>
> I've lost patience and have decided to move development away
> from the main tree. http://linuxvm.bkbits.net/ ;)

Are your patches available in a format that is accessible using free
software ?

(Now where did I put the troll sign 8))

2001-12-11 17:22:52

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Tue, 11 Dec 2001, Alan Cox wrote:

> > > I'm not happy about your usage of magic numbers, either. So it is
> > > still running on solid 2.2.19 until further notice (or until Rik loses
> > > his patience. ;-) )
> >
> > I've lost patience and have decided to move development away
> > from the main tree. http://linuxvm.bkbits.net/ ;)
>
> Are your patches available in a format that is accessible using free
> software ?

Yes, I'm making patches available on my home page:

http://surriel.com/patches/

Note that development isn't too fast due to the fact
that I try to clean up all code I touch instead of
just making the changes needed for the functionality.

kind regards,

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/

http://www.surriel.com/ http://distro.conectiva.com/

2001-12-11 17:24:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

In article <[email protected]> you wrote:
>> > I'm not happy about your usage of magic numbers, either. So it is
>> > still running on solid 2.2.19 until further notice (or until Rik loses
>> > his patience. ;-) )
>>
>> I've lost patience and have decided to move development away
>> from the main tree. http://linuxvm.bkbits.net/ ;)
>
> Are your patches available in a format that is accessible using free
> software ?

As bitkeeper-ignorant I've found nice snapshots on
http://www.surriel.com/patches/.

For BSD advocates it might be a problem that these are unified diffs
that are only applyable with GPL-licensed patch(1) version..

Christoph

--
Of course it doesn't work. We've performed a software upgrade.

2001-12-11 17:31:02

by Leigh Orf

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)



Andrea Arcangeli wrote:

| > The problem is that your VM is unnecesarily eating up
| > memory and then wants swap. That is unacceptable. Having
| > 90% of your memory in buffers/cache and then the OOM killer
| > kicks in because nothing is free is what we're moaning
| > about.
|

| Dear, Abraham please apply this patch:
|
| ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.17pre4aa1.bz2
|
| on top of a 2.4.17pre4 and then recompile, try again and send me a
| bugreport if you can reproduce. thanks,

Andrea,

I applied your patch and it didn't fix the problem.
I reported this earlier to the kernel list but I'm not sure if you got
it. See http://groups.google.com/groups?hl=en&rnum=1&selm=linux.kernel.200112081539.fB8FdFj03048%40orp.orf.cx
or see the recent thread "2.4.16 memory badness (reproducible)". The
behavior I cite with 2.4.16 is identical to what happens with
2.4.17pre4aa1, but here it is again. It is reproducible.
Machine is 1.4GHZ Athlon with 1 GB memory, 2 GB swap, RH 7.2 with
updates.

home[1001]:/home/orf% uname -a
Linux orp.orf.cx 2.4.17-pre4 #1 Mon Dec 10 22:09:16 EST 2001 i686 unknown
(it's been patched with 2.4.17pre4aa1.bz2)
(updatedb updates RedHat's file database, does lots of file I/O)

home[1005]:/home/orf% free
total used free shared buffers cached
Mem: 1029780 207976 821804 0 49468 71856
-/+ buffers/cache: 86652 943128
Swap: 2064344 6324 2058020

home[1006]:/home/orf% sudo updatedb
Password:

home[1007]:/home/orf% free
total used free shared buffers cached
Mem: 1029780 1017576 12204 0 471548 70924
-/+ buffers/cache: 475104 554676
Swap: 2064344 6312 2058032

home[1008]:/home/orf% xmms
Memory fault

home[1009]:/home/orf% strace xmms 2>&1 | tail
old_mmap(NULL, 1291080, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x40316000
mprotect(0x40448000, 37704, PROT_NONE) = 0
old_mmap(0x40448000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x131000) = 0x40448000
old_mmap(0x4044e000, 13128, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4044e000
close(3) = 0
old_mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x40452000
munmap(0x40018000, 72492) = 0
modify_ldt(0x1, 0xbffff33c, 0x10) = -1 ENOMEM (Cannot allocate memory)
--- SIGSEGV (Segmentation fault) ---
+++ killed by SIGSEGV +++

Note that some applications don't mem fault this way, but all the ones
that do die at modify_ldt (see my previous post).

home[1010]:/home/orf% cat /proc/meminfo
total: used: free: shared: buffers: cached:
Mem: 1054494720 1041756160 12738560 0 481837056 77209600
Swap: 2113888256 6463488 2107424768
MemTotal: 1029780 kB
MemFree: 12440 kB
MemShared: 0 kB
Buffers: 470544 kB
Cached: 71388 kB
SwapCached: 4012 kB
Active: 367796 kB
Inactive: 232088 kB
HighTotal: 130992 kB
HighFree: 2044 kB
LowTotal: 898788 kB
LowFree: 10396 kB
SwapTotal: 2064344 kB
SwapFree: 2058032 kB


home[1011]:/home/orf% cat /proc/slabinfo
slabinfo - version: 1.1
kmem_cache 65 68 112 2 2 1
ip_conntrack 22 50 384 5 5 1
nfs_write_data 0 0 384 0 0 1
nfs_read_data 0 0 384 0 0 1
nfs_page 0 0 128 0 0 1
ip_fib_hash 10 112 32 1 1 1
urb_priv 0 0 64 0 0 1
clip_arp_cache 0 0 128 0 0 1
ip_mrt_cache 0 0 128 0 0 1
tcp_tw_bucket 0 0 128 0 0 1
tcp_bind_bucket 17 112 32 1 1 1
tcp_open_request 0 0 128 0 0 1
inet_peer_cache 2 59 64 1 1 1
ip_dst_cache 56 80 192 4 4 1
arp_cache 3 30 128 1 1 1
blkdev_requests 640 660 128 22 22 1
journal_head 0 0 48 0 0 1
revoke_table 0 0 12 0 0 1
revoke_record 0 0 32 0 0 1
dnotify cache 0 0 20 0 0 1
file lock cache 2 42 92 1 1 1
fasync cache 2 202 16 1 1 1
uid_cache 7 112 32 1 1 1
skbuff_head_cache 293 320 192 16 16 1
sock 131 132 1280 44 44 1
sigqueue 4 29 132 1 1 1
cdev_cache 2313 2360 64 40 40 1
bdev_cache 8 59 64 1 1 1
mnt_cache 19 59 64 1 1 1
inode_cache 452259 452263 512 64609 64609 1
dentry_cache 469963 469980 128 15666 15666 1
dquot 0 0 128 0 0 1
filp 1633 1650 128 55 55 1
names_cache 0 2 4096 0 2 1
buffer_head 136268 164880 128 5496 5496 1
mm_struct 54 60 192 3 3 1
vm_area_struct 2186 2250 128 73 75 1
fs_cache 53 59 64 1 1 1
files_cache 53 63 448 6 7 1
signal_act 61 63 1344 21 21 1
size-131072(DMA) 0 0 131072 0 0 32
size-131072 0 0 131072 0 0 32
size-65536(DMA) 0 0 65536 0 0 16
size-65536 1 1 65536 1 1 16
size-32768(DMA) 0 0 32768 0 0 8
size-32768 1 1 32768 1 1 8
size-16384(DMA) 0 0 16384 0 0 4
size-16384 1 3 16384 1 3 4
size-8192(DMA) 0 0 8192 0 0 2
size-8192 5 7 8192 5 7 2
size-4096(DMA) 0 0 4096 0 0 1
size-4096 70 73 4096 70 73 1
size-2048(DMA) 0 0 2048 0 0 1
size-2048 64 68 2048 34 34 1
size-1024(DMA) 0 0 1024 0 0 1
size-1024 11028 11032 1024 2757 2758 1
size-512(DMA) 0 0 512 0 0 1
size-512 12029 12032 512 1504 1504 1
size-256(DMA) 0 0 256 0 0 1
size-256 1609 1635 256 109 109 1
size-128(DMA) 2 30 128 1 1 1
size-128 29383 29430 128 980 981 1
size-64(DMA) 0 0 64 0 0 1
size-64 9105 9145 64 155 155 1
size-32(DMA) 34 59 64 1 1 1
size-32 70942 70977 64 1203 1203 1

Leigh Orf

2001-12-12 08:40:47

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

Andrea Arcangeli wrote:
>
>
> [ big snip. Addressed in other email ]
>
> it should be simple, mainline swapouts more, so it's less likely to
> trash away some useful cache.
>
> just try -aa after a:
>
> echo 10 >/proc/sys/vm/vm_mapped_ratio
>
> it should swapout more and better preserve the cache.

-aa swapout balancing seems very good indeed to me.

> > > > In my swapless testing, I burnt HUGE amounts of CPU in flush_tlb_others().
> > > > So we're madly trying to swap pages out and finding that there's no swap
> > > > space. I beleive that when we find there's no swap left we should move
> > > > the page onto the active list so we don't keep rescanning it pointlessly.
> > >
> > > yes, however I think the swap-flood with no swap isn't a very
> > > interesting case to optimize.
> >
> > Running swapless is a valid configuration, and the kernel is doing
>
> I'm not saying it's not valid or non interesting.
>
> It's the mix "I'm running out of memory and I'm swapless" that is the
> case not interesting to optimize.
>
> If you're swapless it means you've enough memory and that you're not
> running out of swap. Otherwise _you_ (not the kernel) are wrong not
> having swap.

um. Spose so.

> ...
>
> > The VM code lacks comments, and nobody except yourself understands
> > what it is supposed to be doing. That's a bug, don't you think?
>
> Lack of documentation is not a bug, period. Also it's not true that I'm
> the only one who understands it. For istance Linus understand it
> completly, I am 100% sure.
>
> Anyways I wrote a dozen of slides on the VM with some graph showing the
> design of the VM if anybody can better learn from a slide than from the
> code.

That's good. Your elevator design slides were very helpful. However
offline documentation tends to go stale. A nice big block comment
maintained by a programmer who cares goes a loooog way.

> I believe the slides are useful to understand the design, but if you
> want to change one line of code slides or not you've to read the code.
> Everybody is complaining about documentation. This is a red-herring.
> There's no documentation that allows you to hack the previous VM code.
> I'd ask how many of the people happy with the previous documentation
> were effectively VM developers. Except for some possible misleading
> comment in the current code that we may have not updated yet, I don't
> think there's been a regression in documentation.
>

Sigh. Just because the current core kernel looks like it was
scrawled in crayon by an infant doesn't mean that everyone has
to eschew literate, mature, competent and maintainable programming
practices.

-

2001-12-12 08:45:27

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

Andrea Arcangeli wrote:
>
> On Tue, Dec 11, 2001 at 11:32:25AM -0200, Rik van Riel wrote:
> > On Mon, 10 Dec 2001, Andrew Morton wrote:
> >
> > > This test on a 64 megabyte machine, on ext2:
> > >
> > > time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync)
> > >
> > > On 2.4.17-pre7 it takes 21 seconds. On -aa it is much slower: 36 seconds.
> >
> > > Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap
> > > dual x86:
> > >
> > > -aa: 4 minutes 20 seconds
> > > 2.4.7-pre8 4 minutes 8 seconds
> > > 2.4.7-pre8 plus the below patch: 3 minutes 55 seconds
> >
> >
> > Andrea, it seems -aa is not the holy grail VM-wise. If you want
>
> it may be not a holy grail in swap benchmarks and flood of writes to
> disk, those are minor performance regressions, but I have no one single
> bug report related to "stability".

Your patch increases the time to untar a kernel tree by seventy five
percent. That's a fairly major minor regression.

> The only thing I got back from Andrew is been "it runs a little slower"
> in those two tests.

The swapstorm I agree is uninteresting. The slowdown with a heavy write
load impacts a very common usage, and I've told you how to mostly fix
it. You need to back out the change to bdflush.

> and of course he didn't even attempted to benchmark the interactive
> feeling that was the _whole_ point of my buffer.c and elevator changes.

As far as I know, at no point in time have you told anyone that
this was an objective of your latest patch. So of course I
didn't test for it.

Interactivity is indeed improved. It has gone from catastrophic to
horrid.

There are four basic tests I use to quantify this, all with 64 megs of
memory:

1: Start a continuous write, and on a different partition, time how
long it takes to read a 16 megabyte file.

Here, -aa takes 40 seconds. Stock 2.4.17-pre8 takes 71 seconds.
2.4.17-pre8 with the same elevator settings as in -aa takes
40 seconds.

Large writes are slowing reads by a factor of 100.

2: Start a continuous write and, from another machine, run

time ssh -X otherhost xterm -e true

On -aa this takes 68 seconds. On 2.4.17-pre8 it takes over
three minutes. I got bored and killed it. The problem can't
be fixed on 2.4.17-pre8 with tuning - it's probably due to the
poor page replacement - stuff is getting swapped out. This is
a significant problem in 2.4.17-pre and we need a fix for it.

3: Run `cp -a linux/ junk'. Time how long it takes to read a 16 meg file.

There's no appreciable difference between any of the kernels here.
It varies from 2 seconds to 10, and is generally OK.

4: Run `cp -a linux/ junk'. time ssh -X otherhost xterm -e true

Varies between three and five seconds, depending on elvtune settings.
No noticeable difference between any kernels.

It's tests 1 and 2 which are interesting, because we perform so
very badly. And no amount of fiddling buffer.c or elvtune settings
is going to fix it, because they don't address the core problem.

Which is: when the elevator can't merge a read it sticks it at the
end of the request queue, behind all the writes.

I'll be submitting a little patch for 2.4.18-pre which allows the user
to tunably promote reads ahead of most of the writes. It improves
tests 1 and 2 by a factor of eight to twelve.

> So as far as I'm concerned 2.4.15aa1 and 2.4.17pre?aa? are just rock
> solid and usable in production.

I haven't done much stability testing - without a description of what the
changes are trying to do, I can't test them - all I could do is blindly
run stress tests and I'm sure your QA team can do that as well as I,
on bigger boxes.

But I don't doubt that it's stable. However Red Hat's QA guys are
pretty good at knocking kernels over...

gargh. Ninety seconds of bash-shared-mapping and I get "end-request:
buffer-list destroyed" against the swap device. Borked IDE driver.
Seems stable on SCSI.

The -aa VM is still a little prone to tossing out "0-order allocation
failures" when there's tons of swap available and when much memory
is freeable by dropping or writing back to shared mappings. But
this doesn't seem to cause any problems, as long as there's some
memory available for atomic allocations, and I never saw free
memory go below 800 kbytes...

> We'll keep doing background benchmarking and changes that cannot
> affect stability, but the core design is finished as far I can tell.

We'll know when it gets wider testing in the runup to 2.4.18. The
fact that I found a major (although easily fixed) performance problem
in the first ten minutes indicates that caution is needed, yes?

What's the thinking with the changes to dcache/icache flushing?
A single d/icache entry can save three seeks, which is _enormous_ value for
just a few hundred bytes of memory. You appear to be shrinking the i/dcache
by 12% each time you try to swap out or evict 32 pages. What this means
is that as soon we start to get a bit short on memory, the i/dcache vanishes.
And it takes ages to read that stuff back in. How did you test this? Without
having done (or even devised) any quantitative testing myself, I have a gut
feel that we need to preserve the i/dcache (versus file data) much more than
this.



Oh. Maybe the core design (whatever it is :)) is not finished,
because it retains the bone-headed, dumb-to-the-point-of-astonishing
misfeature which Linux VM has always had:

If someone is linearly writing (or reading) a gigabyte file on a 64
megabyte box they *don't* want the VM to evict every last little scrap
of cache on behalf of data which they *obviously* do not want
cached.

It's good that -aa VM doesn't summarily dump the i/dcache and plonk
everything you want into swap when this happens. Progress.


So. To summarise.

- Your attempt to address read latencies didn't work out, and should
be dropped (hopefully Marcelo and Jens are OK with an elevator hack :))

- We urgently need a fix for 2.4.17's page replacement problems.

- aa is good. Believe it or not, I like it. The mm/* portions fix
significant performance problems in our current VM. I guess we
should bite the bullet and merge it all in 2.4.18-pre

-

2001-12-12 09:21:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Wed, Dec 12, 2001 at 12:44:17AM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > On Tue, Dec 11, 2001 at 11:32:25AM -0200, Rik van Riel wrote:
> > > On Mon, 10 Dec 2001, Andrew Morton wrote:
> > >
> > > > This test on a 64 megabyte machine, on ext2:
> > > >
> > > > time (tar xfz /nfsserver/linux-2.4.16.tar.gz ; sync)
> > > >
> > > > On 2.4.17-pre7 it takes 21 seconds. On -aa it is much slower: 36 seconds.
> > >
> > > > Execution time for `make -j12 bzImage' on a 64meg RAM/512 meg swap
> > > > dual x86:
> > > >
> > > > -aa: 4 minutes 20 seconds
> > > > 2.4.7-pre8 4 minutes 8 seconds
> > > > 2.4.7-pre8 plus the below patch: 3 minutes 55 seconds
> > >
> > >
> > > Andrea, it seems -aa is not the holy grail VM-wise. If you want
> >
> > it may be not a holy grail in swap benchmarks and flood of writes to
> > disk, those are minor performance regressions, but I have no one single
> > bug report related to "stability".
>
> Your patch increases the time to untar a kernel tree by seventy five
> percent. That's a fairly major minor regression.
>
> > The only thing I got back from Andrew is been "it runs a little slower"
> > in those two tests.
>
> The swapstorm I agree is uninteresting. The slowdown with a heavy write
> load impacts a very common usage, and I've told you how to mostly fix
> it. You need to back out the change to bdflush.

I guess i should drop the run_task_queue(&tq_disk) instead of replacing
it back with a wait_for_some_buffers().

> > and of course he didn't even attempted to benchmark the interactive
> > feeling that was the _whole_ point of my buffer.c and elevator changes.
>
> As far as I know, at no point in time have you told anyone that
> this was an objective of your latest patch. So of course I
> didn't test for it.
>
> Interactivity is indeed improved. It has gone from catastrophic to
> horrid.

:)

>
> There are four basic tests I use to quantify this, all with 64 megs of
> memory:
>
> 1: Start a continuous write, and on a different partition, time how
> long it takes to read a 16 megabyte file.
>
> Here, -aa takes 40 seconds. Stock 2.4.17-pre8 takes 71 seconds.
> 2.4.17-pre8 with the same elevator settings as in -aa takes
> 40 seconds.
>
> Large writes are slowing reads by a factor of 100.
>
> 2: Start a continuous write and, from another machine, run
>
> time ssh -X otherhost xterm -e true
>
> On -aa this takes 68 seconds. On 2.4.17-pre8 it takes over
> three minutes. I got bored and killed it. The problem can't
> be fixed on 2.4.17-pre8 with tuning - it's probably due to the
> poor page replacement - stuff is getting swapped out. This is
> a significant problem in 2.4.17-pre and we need a fix for it.
>
> 3: Run `cp -a linux/ junk'. Time how long it takes to read a 16 meg file.
>
> There's no appreciable difference between any of the kernels here.
> It varies from 2 seconds to 10, and is generally OK.
>
> 4: Run `cp -a linux/ junk'. time ssh -X otherhost xterm -e true
>
> Varies between three and five seconds, depending on elvtune settings.
> No noticeable difference between any kernels.
>
> It's tests 1 and 2 which are interesting, because we perform so
> very badly. And no amount of fiddling buffer.c or elvtune settings
> is going to fix it, because they don't address the core problem.
>
> Which is: when the elevator can't merge a read it sticks it at the
> end of the request queue, behind all the writes.
>
> I'll be submitting a little patch for 2.4.18-pre which allows the user
> to tunably promote reads ahead of most of the writes. It improves
> tests 1 and 2 by a factor of eight to twelve.

Note that the first elevator (not elevator_linus) could handle this
case, however it was too complicated and I'm been told it was hurting
too much the performance of things like dbench etc.. But it was allowing
you to take a few seconds for your test number 2 for example. Quite
frankly all my benchmark were latency oriented, but I couldn't notice
an huge drop of performance, but OTOH at that time my test box had a
10mbyte/sec HD, and I know for experience that on such HD numbers tends
to be very different than on fast SCSI and my current test hd IDE
33mbyte/sec so I think they were right.

> > So as far as I'm concerned 2.4.15aa1 and 2.4.17pre?aa? are just rock
> > solid and usable in production.
>
> I haven't done much stability testing - without a description of what the
> changes are trying to do, I can't test them - all I could do is blindly
> run stress tests and I'm sure your QA team can do that as well as I,
> on bigger boxes.
>
> But I don't doubt that it's stable. However Red Hat's QA guys are
> pretty good at knocking kernels over...
>
> gargh. Ninety seconds of bash-shared-mapping and I get "end-request:
> buffer-list destroyed" against the swap device. Borked IDE driver.
> Seems stable on SCSI.
>
> The -aa VM is still a little prone to tossing out "0-order allocation
> failures" when there's tons of swap available and when much memory
> is freeable by dropping or writing back to shared mappings. But
> this doesn't seem to cause any problems, as long as there's some
> memory available for atomic allocations, and I never saw free
> memory go below 800 kbytes...

It mostly tends to fail on the GFP_NOIO and friends, where it cannot
block and I believe that's correct, looping forever inside the allocator
can only lead to deadlocks. Those GFP_NOIO users have loops outside the
allocator if required.

A failure means that unless somebody else does something for us, we
couldn't allocate anything. Thus SCHED_YIELD and try again.

> > We'll keep doing background benchmarking and changes that cannot
> > affect stability, but the core design is finished as far I can tell.
>
> We'll know when it gets wider testing in the runup to 2.4.18. The
> fact that I found a major (although easily fixed) performance problem
> in the first ten minutes indicates that caution is needed, yes?

I consider that minor tuning (as you said removing the run_task_queue()
in bdflush may be enough to cure the tar xzf, I will make some test).

> What's the thinking with the changes to dcache/icache flushing?
> A single d/icache entry can save three seeks, which is _enormous_ value for
> just a few hundred bytes of memory. You appear to be shrinking the i/dcache
> by 12% each time you try to swap out or evict 32 pages. What this means

yes.

> is that as soon we start to get a bit short on memory, the i/dcache vanishes.
> And it takes ages to read that stuff back in. How did you test this? Without
> having done (or even devised) any quantitative testing myself, I have a gut
> feel that we need to preserve the i/dcache (versus file data) much more than
> this.

The problem is the zone-normal, if we fail to shrink the cache we _must_
shrink the dcache/icache as well to be correct (at the very least if the
classzone is < ZONE_HIGHMEM). Otherwise zone normal/dma allocations can
fail forever and you won't be able to fork a new task any longer. I
tested this with a ZONE_NORMAL of 1/2 mbytes with highmem emulation. Of
course this makes the problem reproducible trivially but it could happen
on larger boxes as well at least in theory, and I want to cover all the
cases as best as I can.

> Oh. Maybe the core design (whatever it is :)) is not finished,
> because it retains the bone-headed, dumb-to-the-point-of-astonishing
> misfeature which Linux VM has always had:
>
> If someone is linearly writing (or reading) a gigabyte file on a 64
> megabyte box they *don't* want the VM to evict every last little scrap
> of cache on behalf of data which they *obviously* do not want
> cached.

The current design tries to detect this, at least much much better than
2.2. This is why I disagree with Rik's patch of yesterday. detecting
cache pollution is good also on the lowmem boxes (not only for DB).

> It's good that -aa VM doesn't summarily dump the i/dcache and plonk
> everything you want into swap when this happens. Progress.
>
>
> So. To summarise.
>
> - Your attempt to address read latencies didn't work out, and should
> be dropped (hopefully Marcelo and Jens are OK with an elevator hack :))

It should not be dropped. And it's not an hack, I only enabled the code
that was basically disabled due the huge numbers. It will work as 2.2.20.

Now what you want to add is an hack to move the read at the top of the
request_queue and if you go back to 2.3.5x you'll see I was doing this,
that's the first thing I did while playing with the elevator. And
latency-wise it was working great. I'm sure somebody remebers the kind
of latency you could get with such an elevator.

Then I got flames from Linus and Ingo claiming that I screwedup the
elevator and that I was the source of the 2.3.x bad I/O performance and
so they required to nearly rewrite the elevator in a way that was
obvious that couldn't hurt the benchmarks and so Jens dropped part of my
latency-capable elevator and he did the elevator_linus that of course
cannot hurt performance of benchmarks, but that has the usual problem
you need to wait 1 minute for xterm to be stared under a write flood.

However my object was to avoid nearly infinite starvation and the
elevator_linus avoids it (you can start the xterm it in 1 minute,
previously in early 2.3 and 2.2 you'd need to wait for the disk to be
full, and that could take some day with some terabyte of data). So I was
pretty much fine with elevator_linus too but we very well known reads
would be starved again significantly (even if not indefinitely).

Many thanks for the help!!

Andrea

2001-12-12 09:46:32

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Wed, 12 Dec 2001, Andrea Arcangeli wrote:
> On Wed, Dec 12, 2001 at 12:44:17AM -0800, Andrew Morton wrote:
> > Oh. Maybe the core design (whatever it is :)) is not finished,
> > because it retains the bone-headed, dumb-to-the-point-of-astonishing
> > misfeature which Linux VM has always had:
> >
> > If someone is linearly writing (or reading) a gigabyte file on a 64
> > megabyte box they *don't* want the VM to evict every last little scrap
> > of cache on behalf of data which they *obviously* do not want
> > cached.
>
> The current design tries to detect this, at least much much better than
> 2.2. This is why I disagree with Rik's patch of yesterday. detecting
> cache pollution is good also on the lowmem boxes (not only for DB).

Oh, absolutely. The problem just is that the current design
has even worse problems where it doesn't put any pressure on
pages which were touched twice an hour ago.

This leads to the situation that applications get OOM-killed
to preserve buffer cache memory which hasn't been touched
since bootup time.

There are ways to both have good behaviour on bulk IO and
flush out old data which was in active use but no longer is.
I believe these are called page aging and drop-behind.
I've been thinking about achieving the wanted behaviour
without these two, but haven't been able to come up with
any algorithm which doesn't have some very bad side effects.

If you know a way of doing bulk IO properly and flushing out
an old working set correctly, please let us know.

regards,

Rik
--
Shortwave goes a long way: irc.starchat.net #swl

http://www.surriel.com/ http://distro.conectiva.com/

2001-12-12 10:09:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Wed, Dec 12, 2001 at 07:45:45AM -0200, Rik van Riel wrote:
> On Wed, 12 Dec 2001, Andrea Arcangeli wrote:
> > On Wed, Dec 12, 2001 at 12:44:17AM -0800, Andrew Morton wrote:
> > > Oh. Maybe the core design (whatever it is :)) is not finished,
> > > because it retains the bone-headed, dumb-to-the-point-of-astonishing
> > > misfeature which Linux VM has always had:
> > >
> > > If someone is linearly writing (or reading) a gigabyte file on a 64
> > > megabyte box they *don't* want the VM to evict every last little scrap
> > > of cache on behalf of data which they *obviously* do not want
> > > cached.
> >
> > The current design tries to detect this, at least much much better than
> > 2.2. This is why I disagree with Rik's patch of yesterday. detecting
> > cache pollution is good also on the lowmem boxes (not only for DB).
>
> Oh, absolutely. The problem just is that the current design
> has even worse problems where it doesn't put any pressure on
> pages which were touched twice an hour ago.

it does. See the refill_inactive pass.

> This leads to the situation that applications get OOM-killed
> to preserve buffer cache memory which hasn't been touched
> since bootup time.

It doesn't happen here.

At the very least the fix is the two liner from Andrew that forces a
nr_pages refile from active list, that will guarantee that whatever
happens we always roll the active list too, but the oom killing you are
experiencing is a problem of mainline, it definitely doesn't happen here
and the refill_inactive(0) cannot be the culprit because the active list
grows always to a relevant size and if during oom a few pages stays
untouched into the active list that's fine, those two pages couldn't
save us anyways so they'd better stay there so we don't trash.

>
> There are ways to both have good behaviour on bulk IO and
> flush out old data which was in active use but no longer is.
> I believe these are called page aging and drop-behind.
> I've been thinking about achieving the wanted behaviour
> without these two, but haven't been able to come up with
> any algorithm which doesn't have some very bad side effects.
>
> If you know a way of doing bulk IO properly and flushing out
> an old working set correctly, please let us know.
>
> regards,
>
> Rik
> --
> Shortwave goes a long way: irc.starchat.net #swl
>
> http://www.surriel.com/ http://distro.conectiva.com/


Andrea

2001-12-12 10:01:24

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

Andrea Arcangeli wrote:
>
> ...
> > The swapstorm I agree is uninteresting. The slowdown with a heavy write
> > load impacts a very common usage, and I've told you how to mostly fix
> > it. You need to back out the change to bdflush.
>
> I guess i should drop the run_task_queue(&tq_disk) instead of replacing
> it back with a wait_for_some_buffers().

hum. Nope, it definitely wants the wait_for_locked_buffers() in there.
36 seconds versus 25. (21 on stock kernel)

My theory is that balance_dirty() is directing heaps of wakeups
to bdflush, so bdflush just keeps on running. I'll take a look
tomorrow.

(If we're sending that many wakeups, we should do a waitqueue_active
test in wakeup_bdflush...)

> ...
>
> Note that the first elevator (not elevator_linus) could handle this
> case, however it was too complicated and I'm been told it was hurting
> too much the performance of things like dbench etc.. But it was allowing
> you to take a few seconds for your test number 2 for example. Quite
> frankly all my benchmark were latency oriented, but I couldn't notice
> an huge drop of performance, but OTOH at that time my test box had a
> 10mbyte/sec HD, and I know for experience that on such HD numbers tends
> to be very different than on fast SCSI and my current test hd IDE
> 33mbyte/sec so I think they were right.

OK, well I think I'll make it so the feature defaults to "off" - no
change in behaviour. People need to run `elvtune -b non-zero-value'
to turn it on.

So what is then needed is testing to determine the latency-versus-throughput
tradeoff. Andries takes manpage patches :)

> ...
> > - Your attempt to address read latencies didn't work out, and should
> > be dropped (hopefully Marcelo and Jens are OK with an elevator hack :))
>
> It should not be dropped. And it's not an hack, I only enabled the code
> that was basically disabled due the huge numbers. It will work as 2.2.20.

Sorry, I was referring to the elevator-bypass patch. Jens called
it a hack ;)

> Now what you want to add is an hack to move the read at the top of the
> request_queue and if you go back to 2.3.5x you'll see I was doing this,
> that's the first thing I did while playing with the elevator. And
> latency-wise it was working great. I'm sure somebody remebers the kind
> of latency you could get with such an elevator.
>
> Then I got flames from Linus and Ingo claiming that I screwedup the
> elevator and that I was the source of the 2.3.x bad I/O performance and
> so they required to nearly rewrite the elevator in a way that was
> obvious that couldn't hurt the benchmarks and so Jens dropped part of my
> latency-capable elevator and he did the elevator_linus that of course
> cannot hurt performance of benchmarks, but that has the usual problem
> you need to wait 1 minute for xterm to be stared under a write flood.
>
> However my object was to avoid nearly infinite starvation and the
> elevator_linus avoids it (you can start the xterm it in 1 minute,
> previously in early 2.3 and 2.2 you'd need to wait for the disk to be
> full, and that could take some day with some terabyte of data). So I was
> pretty much fine with elevator_linus too but we very well known reads
> would be starved again significantly (even if not indefinitely).
>

OK, thanks.

As long as the elevator-bypass tunable gives a good range of
latency-versus-throughput tuning then I'll be happy. It's a
bit sad that in even the best case, reads are penalised by a
factor of ten when there are writes happening.

But fixing that would require major readahead surgery, and perhaps
implementation of anticipatory scheduling, as described in
http://www.cse.ucsc.edu/~sbrandt/290S/anticipatoryscheduling.pdf
which is out of scope.

-

2001-12-12 10:15:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Wed, Dec 12, 2001 at 01:59:38AM -0800, Andrew Morton wrote:
> Andrea Arcangeli wrote:
> >
> > ...
> > > The swapstorm I agree is uninteresting. The slowdown with a heavy write
> > > load impacts a very common usage, and I've told you how to mostly fix
> > > it. You need to back out the change to bdflush.
> >
> > I guess i should drop the run_task_queue(&tq_disk) instead of replacing
> > it back with a wait_for_some_buffers().
>
> hum. Nope, it definitely wants the wait_for_locked_buffers() in there.
> 36 seconds versus 25. (21 on stock kernel)

please try without the wait_for_locked_buffers and without the
run_task_queue, just delete that line.

>
> My theory is that balance_dirty() is directing heaps of wakeups
> to bdflush, so bdflush just keeps on running. I'll take a look
> tomorrow.

Please delete the wait_on_buffers from balance_dirty() too, it's totally
broken there as well.

wait_on_something _does_ wakeup the queue just like a run_task_queue()
otherwise it's a noop.

However I need to check better the refile of clean buffers from locked to
clean lists, we should make sure not to spend too much time there, the
first time a wait_on_buffers is recalled...

> (If we're sending that many wakeups, we should do a waitqueue_active
> test in wakeup_bdflush...)
>
> > ...
> >
> > Note that the first elevator (not elevator_linus) could handle this
> > case, however it was too complicated and I'm been told it was hurting
> > too much the performance of things like dbench etc.. But it was allowing
> > you to take a few seconds for your test number 2 for example. Quite
> > frankly all my benchmark were latency oriented, but I couldn't notice
> > an huge drop of performance, but OTOH at that time my test box had a
> > 10mbyte/sec HD, and I know for experience that on such HD numbers tends
> > to be very different than on fast SCSI and my current test hd IDE
> > 33mbyte/sec so I think they were right.
>
> OK, well I think I'll make it so the feature defaults to "off" - no
> change in behaviour. People need to run `elvtune -b non-zero-value'
> to turn it on.

Ok. BTW, I guess on this side it worth to work only on 2.5. We know
latency isn't very good in 2.4 and in 2.2, we're more throughput oriented.

Ah and of course to make the latency better we could as well reduce the
size of the I/O queue, I bet the queues are way oversized for a normal
desktop.

>
> So what is then needed is testing to determine the latency-versus-throughput
> tradeoff. Andries takes manpage patches :)
>
> > ...
> > > - Your attempt to address read latencies didn't work out, and should
> > > be dropped (hopefully Marcelo and Jens are OK with an elevator hack :))
> >
> > It should not be dropped. And it's not an hack, I only enabled the code
> > that was basically disabled due the huge numbers. It will work as 2.2.20.
>
> Sorry, I was referring to the elevator-bypass patch. Jens called
> it a hack ;)

Oh yes, that's an "hack" :), and it definitely works well for the latency.

>
> > Now what you want to add is an hack to move the read at the top of the
> > request_queue and if you go back to 2.3.5x you'll see I was doing this,
> > that's the first thing I did while playing with the elevator. And
> > latency-wise it was working great. I'm sure somebody remebers the kind
> > of latency you could get with such an elevator.
> >
> > Then I got flames from Linus and Ingo claiming that I screwedup the
> > elevator and that I was the source of the 2.3.x bad I/O performance and
> > so they required to nearly rewrite the elevator in a way that was
> > obvious that couldn't hurt the benchmarks and so Jens dropped part of my
> > latency-capable elevator and he did the elevator_linus that of course
> > cannot hurt performance of benchmarks, but that has the usual problem
> > you need to wait 1 minute for xterm to be stared under a write flood.
> >
> > However my object was to avoid nearly infinite starvation and the
> > elevator_linus avoids it (you can start the xterm it in 1 minute,
> > previously in early 2.3 and 2.2 you'd need to wait for the disk to be
> > full, and that could take some day with some terabyte of data). So I was
> > pretty much fine with elevator_linus too but we very well known reads
> > would be starved again significantly (even if not indefinitely).
> >
>
> OK, thanks.
>
> As long as the elevator-bypass tunable gives a good range of
> latency-versus-throughput tuning then I'll be happy. It's a
> bit sad that in even the best case, reads are penalised by a
> factor of ten when there are writes happening.
>
> But fixing that would require major readahead surgery, and perhaps
> implementation of anticipatory scheduling, as described in
> http://www.cse.ucsc.edu/~sbrandt/290S/anticipatoryscheduling.pdf
> which is out of scope.
>
> -


Andrea

2001-12-12 11:16:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Tue, Dec 11, 2001 at 04:27:23PM +0100, Daniel Phillips wrote:
> On December 11, 2001 03:23 pm, Andrea Arcangeli wrote:
> > As said I wrote some documentation on the VM for my last speech at the
> > one of the most important italian linux events, it explains the basic
> > design. It should be published on their webside as soon as I find the
> > time to send them the slides. I can post a link once it will be online.
>
> Why not also post the whole thing as an email, right here?

I uploaded it here:

ftp://ftp.suse.com//pub/people/andrea/talks/english/2001/pluto-dec-pub-0.tar.gz

Hopefully it's understandable standalone.

> > It shoud allow non VM-developers to understand the logic behind the VM
> > algorithm, but understanding those slides it's far from allowing anyone
> > to hack the VM.
>
> It's a start.
>
> > I _totally_ agree with Linus when he said "real world is totally
> > dominated by the implementation details".
>
> Linus didn't say anything about not documenting the implementation details,
> nor did he say anything about not documenting in general.

yes, my only point was that "documentation" isn't nearly enough, and
that it's not mandatory (given all the changes don't affect any user
API), but I certainly agree documentation helps.

Andrea

2001-12-12 20:01:41

by Daniel Phillips

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On December 12, 2001 12:16 pm, Andrea Arcangeli wrote:
> On Tue, Dec 11, 2001 at 04:27:23PM +0100, Daniel Phillips wrote:
> > On December 11, 2001 03:23 pm, Andrea Arcangeli wrote:
> > > As said I wrote some documentation on the VM for my last speech at the
> > > one of the most important italian linux events, it explains the basic
> > > design. It should be published on their webside as soon as I find the
> > > time to send them the slides. I can post a link once it will be online.
> >
> > Why not also post the whole thing as an email, right here?
>
> I uploaded it here:

ftp://ftp.suse.com//pub/people/andrea/talks/english/2001/pluto-dec-pub-0.tar.gz

This is really, really useful.

Helpful hint: to run the slideshow, get magicpoint (debian users: apt-get
install mgp) and do:

mv pluto.mpg pluto.mgp # ;)
mgp pluto.mgp -x vflib

Helpful hint #2: Actually, just gv pluto.ps is gets all the content.

Helpful hint #3: Actually, less pluto.mgp will do the trick (after the
rename) and lets you cut and paste the text, as I'm about to do...

Nit: "vm shrinking is not serialized with any other subsystem, it is also
only---^^^^
threaded against itself."

The big thing I see missing from this presentation is a discussion of how
icache, dcache etc fit into the picture, i.e., shrink_caches.

--
Daniel

2001-12-12 21:26:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Wed, Dec 12, 2001 at 09:03:20PM +0100, Daniel Phillips wrote:
> On December 12, 2001 12:16 pm, Andrea Arcangeli wrote:
> > On Tue, Dec 11, 2001 at 04:27:23PM +0100, Daniel Phillips wrote:
> > > On December 11, 2001 03:23 pm, Andrea Arcangeli wrote:
> > > > As said I wrote some documentation on the VM for my last speech at the
> > > > one of the most important italian linux events, it explains the basic
> > > > design. It should be published on their webside as soon as I find the
> > > > time to send them the slides. I can post a link once it will be online.
> > >
> > > Why not also post the whole thing as an email, right here?
> >
> > I uploaded it here:
>
> ftp://ftp.suse.com//pub/people/andrea/talks/english/2001/pluto-dec-pub-0.tar.gz
>
> This is really, really useful.
>
> Helpful hint: to run the slideshow, get magicpoint (debian users: apt-get
> install mgp) and do:
>
> mv pluto.mpg pluto.mgp # ;)

8)

> mgp pluto.mgp -x vflib
>
> Helpful hint #2: Actually, just gv pluto.ps is gets all the content.
>
> Helpful hint #3: Actually, less pluto.mgp will do the trick (after the
> rename) and lets you cut and paste the text, as I'm about to do...
>
> Nit: "vm shrinking is not serialized with any other subsystem, it is also
> only---^^^^
> threaded against itself."

correct.

> The big thing I see missing from this presentation is a discussion of how
> icache, dcache etc fit into the picture, i.e., shrink_caches.

Going into the differences between icache/dcache and pagecache would
been too low level (and I should have spent some time explaining what
icache and dcache are first ;), so as you noticed I intentionally
ignored those highlevel vfs caches in the slides. The concept of "pages
of cache" is usually well known by most people instead, so I only
considered the pagecache, that incidentally is also the most interesting
case for the VM. For seasoned kernel developers it would been
interesting to integrate more stuff, of course, but as you said this is
a start at least :).

About the icache/dcache shrinking, that's probably the most rough thing
we have in the vm at the moment. It just works.

Andrea

2001-12-12 22:06:25

by Ken Brownfield

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Tue, Dec 11, 2001 at 01:43:46AM +0100, Andrea Arcangeli wrote:
| On Mon, Dec 10, 2001 at 05:08:44PM -0200, Marcelo Tosatti wrote:
| > Andrea,
| > Could you please start looking at any 2.4 VM issues which show up ?
|
| well, as far I can tell no VM bug should be present in my latest -aa, so
| I think I'm finished. At the very least I know people is using 2.4.15aa1
| and 2.4.17pre1aa1 in production on multigigabyte boxes under heavy VM
| load and I didn't got any bugreport back yet.
[...]

I look forward to this stuff. 2.4 mainline falls down reliably and
completely when running updatedb on systems with a large number of used
inodes. Linus' VM/mmap patch helped a ton, but between general VM
issues and the i/dcache bloat I'm hoping that I won't have to redirect
my irritated users' ire into a karma pool to get these changes merged
into mainline where all of the knowledgeable folks here can beat out the
details.

I do think that the vast majority of users don't see this issue on
small-ish UP desktops. But I'm about to buy >100 SMP systems for
production expansion which will most likely be effected by this issue.
For me that emphasizes that these so-called corner cases really are
show-stoppers for Linux-as-more-than-toy.

Gimme the /proc interface (bdflush?) and lets bang on this stuff in
mainline. I need to stick with the latest -pre so I can track progress,
so 2.4.17pre4aa1 (or 10_vm-19) hasn't been a possibility for me... :-(

Cheers, just venting,
--
Ken.
[email protected]

PS: Nice catch on the NTFS vmalloc() issue.

| > Just please make sure that when sending a fix for something, send me _one_
| > problem and a patch which fixes _that_ problem.
|
| I will split something for you soon, at the moment I was doing some
| further benchmark.
|
| >
| > I'm tempted to look at VM, but I think I'll spend my limited time in a
| > better way if I review's others people work instead.
|
| until I split something out, you can see all the vm related changes in
| the 10_vm-* patches in my ftp area.
|
| Andrea
| -
| To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
| the body of a message to [email protected]
| More majordomo info at http://vger.kernel.org/majordomo-info.html
| Please read the FAQ at http://www.tux.org/lkml/

2001-12-12 22:31:26

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Wed, Dec 12, 2001 at 04:05:51PM -0600, Ken Brownfield wrote:
> On Tue, Dec 11, 2001 at 01:43:46AM +0100, Andrea Arcangeli wrote:
> | On Mon, Dec 10, 2001 at 05:08:44PM -0200, Marcelo Tosatti wrote:
> | > Andrea,
> | > Could you please start looking at any 2.4 VM issues which show up ?
> |
> | well, as far I can tell no VM bug should be present in my latest -aa, so
> | I think I'm finished. At the very least I know people is using 2.4.15aa1
> | and 2.4.17pre1aa1 in production on multigigabyte boxes under heavy VM
> | load and I didn't got any bugreport back yet.
> [...]
>
> I look forward to this stuff. 2.4 mainline falls down reliably and
> completely when running updatedb on systems with a large number of used
> inodes. Linus' VM/mmap patch helped a ton, but between general VM
> issues and the i/dcache bloat I'm hoping that I won't have to redirect
> my irritated users' ire into a karma pool to get these changes merged
> into mainline where all of the knowledgeable folks here can beat out the
> details.
>
> I do think that the vast majority of users don't see this issue on
> small-ish UP desktops. But I'm about to buy >100 SMP systems for
> production expansion which will most likely be effected by this issue.
> For me that emphasizes that these so-called corner cases really are
> show-stoppers for Linux-as-more-than-toy.
>
> Gimme the /proc interface (bdflush?) and lets bang on this stuff in
> mainline. I need to stick with the latest -pre so I can track progress,
> so 2.4.17pre4aa1 (or 10_vm-19) hasn't been a possibility for me... :-(

I finished fixing the bdflush stuff that Andrew kindly pointed out.
async writes are as fast as possible again now and I also introduced
some histeresis for bdflush to reduce the wakeup rate, plus I'm forcing
bdflush to do some significant work rather than just NRSYNC buffers. But
I'm doing some other swapout benchmarking before releasing a new -aa, I
hope to finish tomorrow. Once I'll feel to be finished I'll split out
something.

anyways here it is a preview of the bdflush fixes for Andrew. it
definitely cures the performance for me. previously there were too many
reschedule. I also wonder that the balance_dirty() should also write
nfract of buffers, instead of only NRSYNC (or maybe something less than
ndirty but more than NRSYNC). comments?

(then BUF_LOCKED will contain all the clean buffers too, and so it
cannot be accounted into balance_dirty anymore, the VM will throttle on
those locked buffers and so it's not a problem)

--- 2.4.17pre7aa1/fs/buffer.c.~1~ Mon Dec 10 16:10:40 2001
+++ 2.4.17pre7aa1/fs/buffer.c Wed Dec 12 19:16:23 2001
@@ -105,22 +105,23 @@
struct {
int nfract; /* Percentage of buffer cache dirty to
activate bdflush */
- int dummy1; /* old "ndirty" */
+ int ndirty; /* Maximum number of dirty blocks to write out per
+ wake-cycle */
int dummy2; /* old "nrefill" */
int dummy3; /* unused */
int interval; /* jiffies delay between kupdate flushes */
int age_buffer; /* Time for normal buffer to age before we flush it */
int nfract_sync;/* Percentage of buffer cache dirty to
activate bdflush synchronously */
- int dummy4; /* unused */
+ int nfract_stop_bdflush; /* Percetange of buffer cache dirty to stop bdflush */
int dummy5; /* unused */
} b_un;
unsigned int data[N_PARAM];
-} bdf_prm = {{20, 0, 0, 0, 5*HZ, 30*HZ, 40, 0, 0}};
+} bdf_prm = {{30, 500, 0, 0, 5*HZ, 30*HZ, 60, 20, 0}};

/* These are the min and max parameter values that we will allow to be assigned */
-int bdflush_min[N_PARAM] = { 0, 0, 0, 0, 0, 1*HZ, 0, 0, 0};
-int bdflush_max[N_PARAM] = {100,50000, 20000, 20000,10000*HZ, 10000*HZ, 100, 0, 0};
+int bdflush_min[N_PARAM] = { 0, 1, 0, 0, 0, 1*HZ, 0, 0, 0};
+int bdflush_max[N_PARAM] = {100,50000, 20000, 20000,10000*HZ, 10000*HZ, 100, 100, 0};

void unlock_buffer(struct buffer_head *bh)
{
@@ -181,7 +182,6 @@
bh->b_end_io = end_buffer_io_sync;
clear_bit(BH_Pending_IO, &bh->b_state);
submit_bh(WRITE, bh);
- conditional_schedule();
} while (--count);
}

@@ -217,11 +217,10 @@
array[count++] = bh;
if (count < NRSYNC)
continue;
-
spin_unlock(&lru_list_lock);
- conditional_schedule();

write_locked_buffers(array, count);
+ conditional_schedule();
return -EAGAIN;
}
unlock_buffer(bh);
@@ -282,12 +281,6 @@
return 0;
}

-static inline void wait_for_some_buffers(kdev_t dev)
-{
- spin_lock(&lru_list_lock);
- wait_for_buffers(dev, BUF_LOCKED, 1);
-}
-
static int wait_for_locked_buffers(kdev_t dev, int index, int refile)
{
do
@@ -1043,7 +1036,6 @@
unsigned long dirty, tot, hard_dirty_limit, soft_dirty_limit;

dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;
- dirty += size_buffers_type[BUF_LOCKED] >> PAGE_SHIFT;
tot = nr_free_buffer_pages();

dirty *= 100;
@@ -1060,6 +1052,21 @@
return -1;
}

+static int bdflush_stop(void)
+{
+ unsigned long dirty, tot, dirty_limit;
+
+ dirty = size_buffers_type[BUF_DIRTY] >> PAGE_SHIFT;
+ tot = nr_free_buffer_pages();
+
+ dirty *= 100;
+ dirty_limit = tot * bdf_prm.b_un.nfract_stop_bdflush;
+
+ if (dirty > dirty_limit)
+ return 0;
+ return 1;
+}
+
/*
* if a new dirty buffer is created we need to balance bdflush.
*
@@ -1084,7 +1091,6 @@
if (state > 0) {
spin_lock(&lru_list_lock);
write_some_buffers(NODEV);
- wait_for_some_buffers(NODEV);
}
}

@@ -2789,13 +2795,18 @@
complete((struct completion *)startup);

for (;;) {
+ int ndirty = bdf_prm.b_un.ndirty;
+
CHECK_EMERGENCY_SYNC

- spin_lock(&lru_list_lock);
- if (!write_some_buffers(NODEV) || balance_dirty_state() < 0) {
- run_task_queue(&tq_disk);
- interruptible_sleep_on(&bdflush_wait);
+ while (ndirty > 0) {
+ spin_lock(&lru_list_lock);
+ if (!write_some_buffers(NODEV))
+ break;
+ ndirty -= NRSYNC;
}
+ if (ndirty > 0 || bdflush_stop())
+ interruptible_sleep_on(&bdflush_wait);
}
}




Andrea

2001-12-12 23:24:24

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Wed, 12 Dec 2001, Ken Brownfield wrote:

> I'm hoping that I won't have to redirect my irritated users' ire into
> a karma pool to get these changes merged into mainline

Actually, Marcelo has already indicated that he's willing to
take VM code from Andrea, as long as the parts are merged one
by one and come with proper argumentation.

This means you'll either have to split out Andrea's patch
yourself or you'll have to convince Andrea to play by the
rules ;))

regards,

Rik
--
Shortwave goes a long way: irc.starchat.net #swl

http://www.surriel.com/ http://distro.conectiva.com/

2001-12-13 06:36:45

by Rob Landley

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Tuesday 11 December 2001 12:23 pm, Christoph Hellwig wrote:

> For BSD advocates it might be a problem that these are unified diffs
> that are only applyable with GPL-licensed patch(1) version..

Why would BSD advocates be applying patches to the linux kernel? (You don't
need the tool to read a patch for ideas, do you?) Why would BSD advocates
apply a GPL-licensed patch to the GPL-licensed Linux kernel, and then
complain that the tool they're using to do so is GPL-licensed?

I'm confused. (Not SUPRISED, mind you. Just easily confused.)

Rob

2001-12-13 08:40:09

by Alan

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

> On Tuesday 11 December 2001 12:23 pm, Christoph Hellwig wrote:
>
> > For BSD advocates it might be a problem that these are unified diffs
> > that are only applyable with GPL-licensed patch(1) version..
>
> Why would BSD advocates be applying patches to the linux kernel? (You don't
> need the tool to read a patch for ideas, do you?) Why would BSD advocates
> apply a GPL-licensed patch to the GPL-licensed Linux kernel, and then
> complain that the tool they're using to do so is GPL-licensed?
>
> I'm confused. (Not SUPRISED, mind you. Just easily confused.)

Christoph, please remember that irony is not available between the Canadian
and Mexican border.... you are confusing them again 8)

Alan

2001-12-13 08:48:40

by David Miller

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)


> > For BSD advocates it might be a problem that these are unified diffs
> > that are only applyable with GPL-licensed patch(1) version..

I'm back quoting twice, sorry I've lost the original attribution.

But anyways didn't the original Larry Wall patch do unified diffs?
I thought it did, and I recall that wasn't GPL licensed.

2001-12-13 18:24:40

by Rob Landley

[permalink] [raw]
Subject: [OT] Re: 2.4.16 & OOM killer screw up (fwd)

On Thursday 13 December 2001 03:48 am, Alan Cox wrote:
> > On Tuesday 11 December 2001 12:23 pm, Christoph Hellwig wrote:
> > > For BSD advocates it might be a problem that these are unified diffs
> > > that are only applyable with GPL-licensed patch(1) version..
> >
> > Why would BSD advocates be applying patches to the linux kernel? (You
> > don't need the tool to read a patch for ideas, do you?) Why would BSD
> > advocates apply a GPL-licensed patch to the GPL-licensed Linux kernel,
> > and then complain that the tool they're using to do so is GPL-licensed?
> >
> > I'm confused. (Not SUPRISED, mind you. Just easily confused.)
>
> Christoph, please remember that irony is not available between the Canadian
> and Mexican border.... you are confusing them again 8)

We'll get it back when the whole "everything has changed" fad dies down.
Average together how long the OJ simpson trial lasted, the monica lewinsky
thing, elian gonzalez down in miami, the press coverage of hurricane andrew,
the original gulf war, nancy kerrigan, john wayne bobbit, joey buttafuoco,
the military interventions in somalia and bosnia, the outcry over alar and
malathion in california back in the 80's, dan quayle attacking murphy brown,
the anti-nuke sentiment following chernobyl and three mile island...

That's our national attention span. A year, maybe a year and change.
Anybody who thinks some nut with a beard can keep this country permanently
nervous obviously doesn't remember the cuban missile crisis. (And of course
there are a lot of people who don't, again because of our short attention
span...) Our military may be rather impressive, but our sarcastic
self-centered indifference is legendary. We're STILL bombing Iraq, and most
of the US has forgotten that country even exists...

</off topic thread>

> Alan

Rob

2001-12-13 22:43:45

by Matthias Andree

[permalink] [raw]
Subject: Re: 2.4.16 & OOM killer screw up (fwd)

On Thu, 13 Dec 2001, David S. Miller wrote:

> But anyways didn't the original Larry Wall patch do unified diffs?
> I thought it did, and I recall that wasn't GPL licensed.

Nope, it did context diffs however.

--
Matthias Andree

"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." Benjamin Franklin