LinuxLists.cc - VM: qsbench

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

On Wed, 31 Oct 2001, Rik van Riel wrote:
>
> Linus, it seems Lorenzo's test program gets killed due
> to the new out_of_memory() heuristic ...

Hmm.. The oom killer really only gets invoced if we're really down to zero
swapspace (that's the _only_ non-rate-based heuristic in the whole thing).

Lorenzo, can you do a "vmstat 1" and show the output of it during the
interesting part of the test (ie around the kill).

I could probably argue that the machine really _is_ out of memory at this
point: no swap, and it obviously has to work very hard to free any pages.
Read the "out_of_memory()" code (which is _really_ simple), with the
realization that it only gets called when "try_to_free_pages()" fails and
I think you'll agree.

That said, it may be "try_to_free_pages()" itself that just gives up way
too easily - it simply didn't matter before, because all callers just
looped around and asked for more memory if it failed. So the code could
still trigger too easily not because the oom() logic itself is all that
bad, but simply because it makes the assumption that try_to_free_pages()
only fails in bad situations.

Linus

2001-10-31 16:04:51

by Rik van Riel

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

On Wed, 31 Oct 2001, Linus Torvalds wrote:

> I could probably argue that the machine really _is_ out of memory at this
> point: no swap, and it obviously has to work very hard to free any pages.
> Read the "out_of_memory()" code (which is _really_ simple), with the
> realization that it only gets called when "try_to_free_pages()" fails and
> I think you'll agree.

Absolutely agreed, an earlier out_of_memory() is probably a good
thing for most systems. The only "but" is that Lorenzo's test
program runs fine with other kernels, but you could argue that
it's a corner case anyway...

Rik
--
DMCA, SSSCA, W3C? Who cares? http://thefreeworld.net/

http://www.surriel.com/ http://distro.conectiva.com/

2001-10-31 17:43:32

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

On Wed, 31 Oct 2001 14:04:45 -0200 (BRST) Rik van Riel <[email protected]>
wrote:

> On Wed, 31 Oct 2001, Linus Torvalds wrote:
>
> > I could probably argue that the machine really _is_ out of memory at this
> > point: no swap, and it obviously has to work very hard to free any pages.
> > Read the "out_of_memory()" code (which is _really_ simple), with the
> > realization that it only gets called when "try_to_free_pages()" fails and
> > I think you'll agree.
>
> Absolutely agreed, an earlier out_of_memory() is probably a good
> thing for most systems. The only "but" is that Lorenzo's test
> program runs fine with other kernels, but you could argue that
> it's a corner case anyway...

I took a deep look into this code and wonder how this benchmark manages to get
killed. If I read that right this would imply that shrink_cache has run a
hundred times through the _complete_ inactive_list finding no free-able pages,
with one exception that I read across:

int max_mapped = nr_pages*10;
...
page_mapped:
if (--max_mapped >= 0)
continue;

/*
* Alert! We've found too many mapped pages on the
* inactive list, so we start swapping out now!
*/
spin_unlock(&pagemap_lru_lock);
swap_out(priority, gfp_mask, classzone);
return nr_pages;

Is it possible, that this does a too early exit from shrink_cache?
I don't know how much mem Lorenzo has, but running only once through several
hundred MB of inactive list is a notable time in my system, running a hundred
times through could be far more than 70 s. But if there's no complete run, you
cannot state to really be oom.
Does it make sense to stop shrink_cache when having detected 4k * 32 * 10 =
1280 k of mapped mem on the inactive list of possibly several hundred MB in
size?

Regards,
Stephan

2001-10-31 17:52:32

[permalink] [raw]

Subject: Re: VM: qsbench

At 07.23 31/10/01 -0500, Jeff Garzik wrote:
>Lorenzo Allegrucci wrote:
>> Linux-2.4.14-pre6:
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
>> Out of Memory: Killed process 224 (qsbench).
>> 69.890u 3.430s 2:12.48 55.3% 0+0k 0+0io 16374pf+0w
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
>> Out of Memory: Killed process 226 (qsbench).
>> 69.550u 2.990s 2:11.31 55.2% 0+0k 0+0io 15374pf+0w
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
>> Out of Memory: Killed process 228 (qsbench).
>> 69.480u 3.100s 2:13.33 54.4% 0+0k 0+0io 15950pf+0w
>> 0:01 kswapd
>>
>> This is interesting, -pre6 killed qsbench _just_ before qsbench exited.
>> Unreliable results.
>
>Can you give us some idea of the memory usage of this application? Your
>amount of RAM and swap?

256M of RAM + 200M of swap, qsbench allocates about 343M.

--
Lorenzo

2001-10-31 17:54:24

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

At 07.52 31/10/01 >
>On Wed, 31 >>
>> Linus, >> to the >
>Hmm.. The >swapspace >
>Lorenzo, >interesting
procs cpu
r b w swpd id
2 0 0 139908 3588 0
1 0 0 139908 3588 0
1 0 0 139908 3588 0
0 1 0 139336 1568 2 41
1 0 0 140296 2996 3 94
1 0 0 139968 2708 4
1 0 0 139968 2708 0
1 0 0 139968 2708 0
1 0 0 139968 2708 0
1 0 0 139968 2708 0
1 0 0 139968 2708 0
1 0 0 139968 2708 0
0 1 0 144064 1620 5 67
1 0 0 146168 2952 8 92
0 1 0 151672 3580 7 90
0 1 0 165496 1620 6 83
1 0 0 177912 1592 2 93
0 1 0 182392 1548 11 78
0 1 1 195320 2692 8 89
1 0 0 195512 3516 8 90
1 0 1 195512 1664 8 74
1 0 0 195512 1544 7 80
procs cpu
r b w swpd id
2 0 0 195512 1660 10 80
0 0 0 5384 250420 7 92
0 0 0 5384 250420 100
0 0 0 5384 250416 100
0 0 0 5384 250400 100
0 0 0 5384 250400 100
0 0 0 5384 250400 100

Until swpd is free swap (I >From that because qsbench I guess Linux whole array to I wonder why..

Linux-2.4.13:
1 0 0 109864 3820 0
1 0 0 109864 3816 0
1 0 0 109864 3816 0
1 0 0 109864 3816 0
1 0 0 109864 3816 0
1 0 0 109864 3816 0
0 1 0 112156 3224 1 31
procs cpu
r b w swpd id
1 1 0 121372 3416 5 91
1 0 0 130460 3340 2 93
0 1 0 139932 3168 4 91
0 1 1 149532 3488 5 93
1 0 1 158308 4008 15 80
0 1 0 166244 3724 13 83
0 1 0 175716 4092 4 88
0 1 0 185188 4076 8 85
0 1 1 192100 3624 13 85
1 0 0 195512 3972 23 69
1 0 0 195512 4184 25 61
0 1 0 195512 4164 21 70
1 0 0 195512 3880 16 74
1 0 0 195512 3752 21 66
1 0 0 195512 3096 23 66
1 0 0 195512 3344 5 44
0 0 0 5948 250640 67
0 0 0 5948 250640 100
0 0 0 5948 250640 100
0 0 0 5948 250640 100
0 0 0 5948 250640 100
0 0 0 5948 250640 100

Same behaviour.

Linux-2.4.14-pre5:
1 0 0 142648 3268 0
1 0 0 142648 3268 0
1 0 0 142648 3268 0
1 0 0 142648 3268 0
1 0 0 142648 3268 0
1 0 0 142648 3268 0
0 1 0 143404 3624 2 37
0 1 0 148324 1632 7 89
1 0 0 153572 3412 4 90
1 0 0 165604 1584 10 80
1 1 0 175076 1624 6 90
0 1 0 181604 1540 4 84
1 0 0 194276 2812 6 86
1 0 0 195512 1640 5 91
1 0 0 195512 1572 9 77
1 1 0 195512 1580 10 76
procs cpu
r b w swpd id
1 1 0 195512 1608 8 77
1 0 0 195512 1728 6 47
1 0 0 195512 1612 11 70
0 1 0 5144 250652 9 61
0 0 0 4984 250236 100
0 0 0 4984 250228 100
0 0 0 4984 250228 100
0 0 0 4984 250228 100
0 0 0 4984 250228 100
0 0 0 4984 250228 100
0 0 0 4984 250228 100
0 1 0 4984 250196 100

Same behaviour, Both kernels qsbench.

>I could probably >point: no >Read the >realization >I think you'll agree.
>
>That said, >too easily >looped around >still trigger >bad, but >only fails in bad situations.
>
> Linus

--
Lorenzo
-0800, Linus Torvalds wrote:
Oct 2001, Rik van Riel wrote:
it seems Lorenzo's test program gets killed due
new out_of_memory() heuristic ...
oom killer really only gets invoced if we're really down to zero
(that's the _only_ non-rate-based heuristic in the whole thing).
can you do a "vmstat 1" and show the output of it during the
part of the test (ie around the kill).
memory swap io system
free buff cache si so bi bo in cs us sy
68 3588 0 0 0 0 101 3 100 0
68 3588 0 0 0 0 101 7 100 0
68 3588 0 0 0 0 101 5 100 0
68 3588 2524 696 2524 696 192 180 57
68 3588 3776 4208 3776 4208 287 304 3
68 3588 288 0 288 0 110 21 96 0
68 3588 0 0 0 0 101 5 100 0
68 3588 0 0 0 0 101 5 100 0
68 3588 0 0 0 0 101 5 99 1
68 3588 0 0 0 0 101 3 100 0
68 3588 0 0 0 0 101 3 100 0
68 3588 0 0 0 12 104 9 100 0
64 3588 7256 6880 7256 6880 395 517 28
60 3584 5780 6720 5780 6720 396 401 0
64 3584 12744 10076 12748 10076 579 870 3
64 3388 14684 4108 14684 4108 629 1131 11
64 1624 4544 14196 4544 14200 377 355 5
60 1624 14648 8064 14648 8064 633 935 11
64 1624 14156 9600 14160 9600 605 943 3
64 400 5312 8376 5312 8376 378 374 2
64 400 22256 0 22256 0 797 1419 18
60 400 23520 0 23520 4 837 1540 13
memory swap io system
free buff cache si so bi bo in cs us sy
60 400 23292 0 23292 0 832 1546 10
76 784 2212 24 2672 24 201 208 1
76 784 0 0 0 0 101 3 0 0
76 788 0 0 0 0 101 3 0 0
92 788 0 0 16 0 105 15 0 0
92 788 0 0 0 0 101 3 0 0
92 788 0 0 0 0 101 7 0 0
"139968" everything is fine and I have about 60M of
have 256M RAM + 200M of swap and qsbench uses about 343M).
point Linux starts swapping without any apparent reason (?)
allocates its memory just once at the beginning.
starts swapping when qsbench scans sequentially the
check for errors after sorting, in the final stage.
64 396 0 0 0 0 101 3 100 0
68 396 0 0 4 0 107 23 100 0
68 396 0 0 0 0 101 5 98 2
68 396 0 0 0 0 101 3 100 0
68 396 0 0 0 0 101 3 100 0
68 396 0 0 0 0 102 5 100 0
64 508 2676 2048 2888 2052 235 239 68
memory swap io system
free buff cache si so bi bo in cs us sy
64 508 8896 9216 8896 9216 519 686 4
64 508 9420 9216 9420 9216 559 737 5
64 508 9644 9472 9644 9472 547 717 5
64 508 9356 9572 9356 9576 550 725 2
64 500 8484 8736 8492 8744 502 655 5
64 500 8204 8004 8204 8004 452 601 4
64 500 9104 9344 9104 9344 525 681 8
64 500 9356 9344 9356 9344 545 690 7
64 500 7544 7040 7544 7040 444 548 2
64 348 11260 3924 11264 3928 521 767 8
64 348 16812 0 16812 0 632 1074 14
64 364 19828 0 19856 0 722 1251 9
64 364 19740 0 19740 0 721 1240 10
64 396 20676 0 20736 0 752 1307 13
64 372 16260 4 16264 8 617 1040 11
68 372 7548 0 7560 0 346 493 51
80 800 328 0 768 0 132 64 29 4
80 800 0 0 0 0 101 3 0 0
80 800 0 0 0 0 104 10 0 0
80 800 0 0 0 0 119 44 0 0
80 800 0 0 0 0 104 11 0 0
80 800 0 0 0 0 130 61 0 0
80 3784 0 0 0 0 101 5 99 1
80 3784 0 0 0 0 101 9 100 0
80 3784 0 0 0 0 101 3 100 0
80 3784 0 0 0 0 101 3 100 0
80 3784 0 0 0 0 101 3 100 0
80 3784 0 0 0 0 101 5 99 1
80 3784 5380 2108 5380 2116 298 346 61
76 3780 9452 7808 9452 7808 480 601 4
72 3780 11492 6044 11492 6044 560 737 6
72 2860 13952 7972 13952 7972 615 889 10
72 1624 5232 13536 5232 13536 390 339 4
76 1624 13360 7924 13364 7924 593 852 12
76 1624 12696 7704 12696 7704 575 804 8
76 556 7624 11412 7624 11412 449 488 4
72 496 21768 52 21768 56 784 1367 14
72 496 23196 0 23196 0 827 1460 14
memory swap io system
free buff cache si so bi bo in cs us sy
76 496 19208 0 19212 0 704 1220 15
76 496 15040 0 15040 4 572 946 48
72 496 21664 0 21664 0 782 1363 19
84 564 12120 0 12196 0 495 790 30
84 748 368 0 552 0 122 44 0 0
92 748 0 0 8 0 106 20 0 0
92 748 0 0 0 0 102 5 0 0
92 748 0 0 0 0 105 12 0 0
92 748 0 0 0 0 102 5 0 0
92 748 0 0 0 0 101 3 0 0
92 748 0 0 0 0 102 11 0 0
92 748 32 0 32 0 112 26 0 0
but 2.4.13 uses less swap space.
above seem to fall in OOM conditions, but they don't kill
argue that the machine really _is_ out of memory at this
swap, and it obviously has to work very hard to free any pages.
"out_of_memory()" code (which is _really_ simple), with the
that it only gets called when "try_to_free_pages()" fails and
it may be "try_to_free_pages()" itself that just gives up way
- it simply didn't matter before, because all callers just
and asked for more memory if it failed. So the code could
too easily not because the oom() logic itself is all that
simply because it makes the assumption that try_to_free_pages()

2001-10-31 18:09:22

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

On Wed, 31 Oct 2001, Lorenzo Allegrucci wrote:
>
> Until swpd is "139968" everything is fine and I have about 60M of
> free swap (I have 256M RAM + 200M of swap and qsbench uses about 343M).

Ok, that's the problem. The swap free on swap-in logic got removed, try
this simple patch, and I bet it ends up working ok for you

You should see better performance with a bigger swapspace, though. Linux
would prefer to keep the swap cache allocated as long as possible, and not
drop the pages just because swap is smaller than the working set.

(Ie the best setup is not when "RAM + SWAP > working set", but when you
have "SWAP > working set").

Can you re-do the numbers with this one on top of pre6?

Thanks,

Linus

-----
diff -u --recursive pre6/linux/mm/memory.c linux/mm/memory.c
--- pre6/linux/mm/memory.c Wed Oct 31 10:04:11 2001
+++ linux/mm/memory.c Wed Oct 31 10:02:33 2001
@@ -1158,6 +1158,8 @@
pte = mk_pte(page, vma->vm_page_prot);

swap_free(entry);
+ if (vm_swap_full())
+ remove_exclusive_swap_page(page);

flush_page_to_ram(page);
flush_icache_page(vma, page);

2001-10-31 18:24:42

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

In article <[email protected]>,
Stephan von Krawczynski <[email protected]> wrote:
>
>I took a deep look into this code and wonder how this benchmark manages to get
>killed. If I read that right this would imply that shrink_cache has run a
>hundred times through the _complete_ inactive_list finding no free-able pages,
>with one exception that I read across:

That's a red herring. The real reason it is killed is that the machine
really _is_ out of memory, but that, in turn, is because the swap space
is totally filled up - with pages we have in memory in the swap cache.

The swap cache is wonderful for many thing, but Linux has historically
had swap as "additional" memory, and the swap cache really really wants
to have backing store for the _whole_ working set, not just for the
pages we have to get rid of.

Thus the two-line patch elsewhere in this thread, which says "ok, if
we're low on swap space, let's start decimating the swap cache entries
for stuff we have in memory".

Linus

2001-10-31 19:49:04

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

[ Cc'd to linux-kernel just in case other people are wondering ]

On Wed, 31 Oct 2001, Bernt Hansen wrote:
>
> Do I need to rebuild my systems with my swap partitions >= my physical
> memory size for the 2.4.x kernels? All of my systems have total swap
> space less than their physical memory size and are running 2.4.13 kernels.

No. With the two-liner patch on linux-kernel, your old setup should work
as-is.

And performance will be fine, _except_ if you regularly actually have your
swap usage up in the 75%+ range. But if you do work that typically puts a
lot of pressure on swap, and you find that you almost always end up using
clearly more than half your swapspace, that implies that you should
consider perhaps reconfiguring so that you have a bigger swap partition.

When I pointed out the performance problems to Lorenzo, I specifically
meant only that one load that he is testing - the fact that the load fills
up the swap device implies that for _that_ load, performance could be
improved by making sure he has enough swap to cover it.

I bet Lorenzo doesn't even come _close_ to 80% full swap under normal
usage, so he probably wouldn't see any performance impact normally. It's
just that when you report VM benchmarks, maybe you want to try to improve
the numbers..

[ It's equally valid to say that Lorenzo's numbers are _especially_
interesting exactly because they also test the behaviour when we need to
start pruning the swap cache, though. So I'm in no way trying to
criticise his benchmark - I think the qsort benchmark is actually one of
the more valid VM patterns we have ever had as a benchmark, and I
really like how it mixes random accesses with non-random ones ]

So don't worry.

Linus

2001-10-31 21:33:43

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

At 10.06 31/10/01 -0800, Linus Torvalds wrote:
>
>On Wed, 31 Oct 2001, Lorenzo Allegrucci wrote:
>>
>> Until swpd is "139968" everything is fine and I have about 60M of
>> free swap (I have 256M RAM + 200M of swap and qsbench uses about 343M).
>
>Ok, that's the problem. The swap free on swap-in logic got removed, try
>this simple patch, and I bet it ends up working ok for you
>
>You should see better performance with a bigger swapspace, though. Linux
>would prefer to keep the swap cache allocated as long as possible, and not
>drop the pages just because swap is smaller than the working set.
>
>(Ie the best setup is not when "RAM + SWAP > working set", but when you
>have "SWAP > working set").
>
>Can you re-do the numbers with this one on top of pre6?
>
>Thanks,
>
> Linus
>
>-----
>diff -u --recursive pre6/linux/mm/memory.c linux/mm/memory.c
>--- pre6/linux/mm/memory.c Wed Oct 31 10:04:11 2001
>+++ linux/mm/memory.c Wed Oct 31 10:02:33 2001
>@@ -1158,6 +1158,8 @@
> pte = mk_pte(page, vma->vm_page_prot);
>
> swap_free(entry);
>+ if (vm_swap_full())
>+ remove_exclusive_swap_page(page);
>
> flush_page_to_ram(page);
> flush_icache_page(vma, page);

Linus,

your patch seems to help one case out of three.
(even though I have not any meaningful statistical data)

lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
Out of Memory: Killed process 225 (qsbench).
69.500u 3.200s 2:11.23 55.3% 0+0k 0+0io 15297pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
Out of Memory: Killed process 228 (qsbench).
69.720u 3.190s 2:12.23 55.1% 0+0k 0+0io 15561pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.250u 3.470s 2:15.88 54.2% 0+0k 0+0io 17170pf+0w

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
2 0 0 136320 3644 72 284 0 0 0 0 101 5 100 0 0
1 0 0 136320 3644 72 284 0 0 0 0 101 3 100 0 0
1 0 0 136320 3644 72 284 0 0 0 0 101 3 100 0 0
1 0 0 136320 3644 72 284 0 0 0 0 101 9 100 0 0
0 1 0 133140 2608 72 284 3344 768 3344 768 215 215 47 2 51
0 1 0 132276 1608 72 284 3552 6376 3552 6376 280 227 2 3 95
1 0 0 128648 3240 72 284 768 3656 768 3660 162 54 57 1 42
1 0 0 128648 3240 72 284 0 0 0 0 101 3 100 0 0
1 0 0 128648 3240 72 284 0 0 0 4 102 9 100 0 0
1 0 0 128648 3240 72 284 0 0 0 0 101 3 100 0 0
1 0 0 128648 3240 72 284 0 0 0 0 101 3 100 0 0
1 0 0 128648 3240 72 284 0 0 0 0 101 3 100 0 0
1 0 0 128648 3240 72 284 0 0 0 0 101 3 100 0 0
1 0 0 129672 3316 68 284 4328 2860 4328 2860 265 282 62 2 36
1 0 0 137992 1644 68 280 19216 3172 19216 3172 743 1227 7 5 88
0 1 0 153096 3648 68 280 3072 17788 3072 17788 353 218 2 6 92
0 1 1 160136 1660 68 280 15240 4740 15240 4740 647 963 16 10 74
1 0 0 177288 1588 68 280 5868 14220 5868 14220 422 393 0 7 93
0 1 0 188680 1620 68 280 8144 11904 8144 11904 473 544 4 5 91
0 1 0 192136 1552 68 280 17136 5860 17136 5860 689 1081 8 9 83
1 0 0 195512 2948 68 280 7672 9008 7672 9008 476 512 2 8 90
1 0 0 195512 1556 68 280 21688 356 21688 356 786 1375 11 8 81
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
2 0 0 195512 1608 68 276 22352 0 22352 0 801 1422 10 17 73
1 0 0 195512 1588 68 276 22748 0 22748 0 812 1431 14 12 74
1 0 0 195512 1560 68 276 12768 0 12768 0 502 809 55 4 41
1 0 0 195512 1552 68 280 23012 0 23012 0 823 1446 11 6 83
0 1 0 4696 250440 80 632 9048 0 9412 4 409 609 27 7 66
0 0 0 4564 250284 84 752 32 0 156 0 108 17 0 0 100
0 0 0 4564 250280 88 752 0 0 4 0 106 18 0 0 100
0 0 0 4564 250280 88 752 0 0 0 0 101 3 0 0 100
0 0 0 4564 250280 88 752 0 0 0 0 109 21 0 0 100
0 0 0 4564 250280 88 752 0 0 0 0 121 44 0 0 100
0 0 0 4564 250280 88 752 0 0 0 0 101 3 0 0 100

Then, I repeated the test with a bigger swap partition (400M):
qsbench working set is about 343M, so now SWAP > working set.

lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.770u 3.630s 2:14.21 55.4% 0+0k 0+0io 16545pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.720u 3.370s 2:16.66 54.2% 0+0k 0+0io 17444pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
70.050u 3.380s 2:15.05 54.3% 0+0k 0+0io 17045pf+0w

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
2 0 0 124040 3652 68 3428 0 0 0 0 101 7 100 0 0
1 0 0 122096 1640 68 3428 2880 1656 2880 1656 208 184 51 1 48
1 0 0 129972 1896 68 3428 2328 1836 2328 1836 195 292 45 3 52
1 0 0 130740 2000 68 3428 0 168 0 168 106 6 100 0 0
1 0 0 131508 2104 68 3428 0 252 0 252 101 8 100 0 0
1 0 0 132660 2340 68 3428 0 336 0 336 107 10 100 0 0
1 0 0 133428 2460 68 3428 0 208 0 208 101 6 100 0 0
1 0 0 134196 2560 68 3428 0 212 0 212 105 6 100 0 0
1 0 0 134196 2560 68 3428 0 0 0 0 101 5 100 0 0
0 1 1 138932 1664 68 3428 1856 2052 1856 2052 178 156 83 0 17
0 1 0 145076 1612 68 3428 6900 9956 6900 9964 451 532 3 5 92
1 0 0 149044 3648 68 3424 3232 9556 3232 9556 333 259 2 5 93
1 0 0 154036 1580 64 3424 13816 4736 13816 4736 635 951 6 4 90
0 1 0 171444 1648 64 2404 14328 6544 14328 6544 620 1155 5 13 82
0 1 0 182580 1648 64 1584 6180 21916 6180 21912 438 422 1 7 92
0 1 0 184500 1628 64 1584 13800 3980 13800 3984 602 878 11 5 84
1 0 0 196532 1624 64 1584 10876 7576 10876 7576 522 707 6 5 89
0 1 0 210612 1540 64 1584 8992 13760 8992 13760 492 592 5 9 86
0 1 0 214452 2412 64 1584 12928 10176 12928 10176 593 817 11 4 85
1 0 0 225460 1632 64 1584 11704 8380 11704 8380 564 766 5 8 87
1 0 0 230976 1592 64 1224 8012 10008 8012 10008 465 525 2 6 92
1 0 0 233340 1556 80 288 17748 888 17764 888 674 1136 7 12 81
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
2 0 0 233340 3524 64 284 20276 2392 20276 2392 771 1315 10 6 84
1 0 0 233340 3632 64 284 14948 0 14948 0 569 957 44 5 51
1 0 0 233340 1556 64 284 24448 0 24448 0 865 1575 12 8 80
0 1 0 240920 1580 68 288 18208 4656 18212 4656 717 1186 7 5 88
1 0 0 240920 2704 68 288 16672 2924 16672 2928 656 1069 25 11 64
0 0 0 4536 250948 84 760 4384 0 4872 0 270 340 3 8 89
0 0 0 4536 250948 84 760 0 0 0 0 101 3 0 0 100
0 0 0 4536 250948 84 760 0 0 0 0 101 3 0 0 100
0 0 0 4536 250948 84 760 0 0 0 0 101 7 0 0 100

--
Lorenzo

2001-11-01 21:57:29

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

At 22.08 01/11/01 +0100, you wrote:
>> At 15.44 01/11/01 +0100, Stephan von Krawczynski wrote:
>> >On Wed, 31 Oct 2001 22:31:40 +0100 Lorenzo Allegrucci
><[email protected]>
>> >wrote:
>> >
>> >> Linus,
>> >>
>> >> your patch seems to help one case out of three.
>> >> (even though I have not any meaningful statistical data)
>> >
>> >Hm, I will not say that I expected that :-), he knows by far more
>than me.
>> >But can you try my patch below in addition or comparison to linus'
>?
>> >Give me a hint what happens.
>>
>> Well, your patch works but it hurts performance :(
>>
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
>> 71.500u 1.790s 2:29.18 49.1% 0+0k 0+0io 18498pf+0w
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
>> 71.460u 1.990s 2:26.87 50.0% 0+0k 0+0io 18257pf+0w
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
>> 71.220u 2.200s 2:26.82 50.0% 0+0k 0+0io 18326pf+0w
>> 0:55 kswapd
>>
>> Linux-2.4.14-pre5:
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
>> 70.340u 3.450s 2:13.62 55.2% 0+0k 0+0io 16829pf+0w
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
>> 70.590u 2.940s 2:15.48 54.2% 0+0k 0+0io 17182pf+0w
>> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
>> 70.140u 3.480s 2:14.66 54.6% 0+0k 0+0io 17122pf+0w
>> 0:01 kswapd
>
>Hello Lorenzo,
>
>to be honest: I expected that. The patch according to my knowledge
>fixes a "definition hole" in the shrink_cache algorithm. I tend to say
>it is the right thing to do it this way, but I am sure it is not as
>fast as immediate exit to swap. It would be interesting to know if it
>does hurt performance in not-near-oom environment. I'd say Andrea or
>Linus might know that, or you can try, of course :-)

400M of swap now (from 200M), Linux-2.4.14-pre6 + your vmscan-patch:

lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.320u 2.260s 2:28.92 49.4% 0+0k 0+0io 18755pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.330u 2.120s 2:28.40 49.4% 0+0k 0+0io 18838pf+0w
lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100
71.880u 2.100s 2:28.31 49.8% 0+0k 0+0io 18646pf+0w
0:56 kswapd

qsbench vsize is just 343M, definitely not-near-oom environment :)

>Anyway may I beg you to post my patch and your answer to the list,
>because I currently cannot do it (I am not in office right now, but on
>a web-terminal somewhere in the outbacks ;-). I have neither patch at
>hand nor am I able to attach it with this mailer...
>
>Thanks,
>Stephan

your vmscan-patch:

--- linux-orig/mm/vmscan.c Wed Oct 31 12:32:11 2001
+++ linux/mm/vmscan.c Thu Nov 1 15:38:13 2001
@@ -469,16 +469,10 @@
spin_unlock(&pagecache_lock);
UnlockPage(page);
page_mapped:
- if (--max_mapped >= 0)
- continue;
+ if (max_mapped > 0)
+ max_mapped--;
+ continue;

- /*
- * Alert! We've found too many mapped pages on the
- * inactive list, so we start swapping out now!
- */
- spin_unlock(&pagemap_lru_lock);
- swap_out(priority, gfp_mask, classzone);
- return nr_pages;
}

/*
@@ -514,6 +508,14 @@
break;
}
spin_unlock(&pagemap_lru_lock);
+
+ /*
+ * Alert! We've found too many mapped pages on the
+ * inactive list, so we start swapping out - delayed!
+ * -skraw
+ */
+ if (max_mapped==0)
+ swap_out(priority, gfp_mask, classzone);

return nr_pages;
}

--
Lorenzo

2001-11-01 23:35:42

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

> At 22.08 01/11/01 +0100, you wrote:
> >> Well, your patch works but it hurts performance :(

> >>
> >> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100

> >> 71.500u 1.790s 2:29.18 49.1% 0+0k 0+0io 18498pf+0w

> >> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100

> >> 71.460u 1.990s 2:26.87 50.0% 0+0k 0+0io 18257pf+0w

> >> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100

> >> 71.220u 2.200s 2:26.82 50.0% 0+0k 0+0io 18326pf+0w

> >> 0:55 kswapd

> >>

> >> Linux-2.4.14-pre5:

> >> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100

> >> 70.340u 3.450s 2:13.62 55.2% 0+0k 0+0io 16829pf+0w

> >> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100

> >> 70.590u 2.940s 2:15.48 54.2% 0+0k 0+0io 17182pf+0w

> >> lenstra:~/src/qsort> time ./qsbench -n 90000000 -p 1 -s 140175100

> >> 70.140u 3.480s 2:14.66 54.6% 0+0k 0+0io 17122pf+0w

> >> 0:01 kswapd

> >

> >Hello Lorenzo,

> >

> >to be honest: I expected that. The patch according to my knowledge

> >fixes a "definition hole" in the shrink_cache algorithm. I tend to
say
> >it is the right thing to do it this way, but I am sure it is not as

> >fast as immediate exit to swap. It would be interesting to know if
it
> >does hurt performance in not-near-oom environment. I'd say Andrea
or
> >Linus might know that, or you can try, of course :-)

To clarify this one a bit:
shrink_cache is thought to do what it says, it is given a number of
pages it should somehow manage to free by shrinking the cache. What my
patch does is go after the _whole_ list to fulfill that. One cannot
really say that this is the wrong thing to do, I guess. If it takes
time to _find_ free pages with shrink_cache, then probably the idea to
use it was wrong in the first place (which is not the fault of the
function itself). Or the number of free-pages to find is to high, or
(as a last but guess unrealistic approach) the swap_out eats the time
and shouldn't be called when nr_pages (return value) is equal to zero.
This last one could be checked (hint hint Lorenzo ;-) by simply
modifiying

if (max_swapped==0)

to

if (max_swapped==0 && nr_pages>0)

at the end of shrink_cache.
Thinking again about this it really sounds like the right choice,
because there is no need to swap when we fulfilled the requested
number of free-pages.

You should try.

Thank you for your patience Lorenzo

Regards,
Stephan

PS: just fishing for lobster, Linus ;-)

2001-11-02 00:40:39

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

On Fri, 2 Nov 2001, Stephan von Krawczynski wrote:
>
> To clarify this one a bit:
> shrink_cache is thought to do what it says, it is given a number of
> pages it should somehow manage to free by shrinking the cache. What my
> patch does is go after the _whole_ list to fulfill that.

I would suggest a slight modification: make "max_mapped" grow as the
priority goes up.

Right now max_mapped is fixed at "nr_pages*10".

You could have something like

max_mapped = nr_pages * 60 / priority;

instead, which might also alleviate the problem with not even bothering to
scan much of the inactive list simply because 99% of all pages are mapped.

That way you don't waste time on looking at the rest of the inactive list
until you _need_ to.

Linus

2001-11-02 02:17:50

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

>
> On Fri, 2 Nov 2001, Stephan von Krawczynski wrote:
> >
> > To clarify this one a bit:
> > shrink_cache is thought to do what it says, it is given a number
of
> > pages it should somehow manage to free by shrinking the cache.
What my
> > patch does is go after the _whole_ list to fulfill that.
>
> I would suggest a slight modification: make "max_mapped" grow as the
> priority goes up.
>
> Right now max_mapped is fixed at "nr_pages*10".
>
> You could have something like
>
> max_mapped = nr_pages * 60 / priority;
>
> instead, which might also alleviate the problem with not even
bothering to
> scan much of the inactive list simply because 99% of all pages are
mapped.
>
> That way you don't waste time on looking at the rest of the inactive
list
> until you _need_ to.

Wait a minute: there is something illogical in this approach:
Basically you say by making max_mapped bigger that the "early exit"
from shrink_cache shouldn't be that early. But if you _know_ that
merely all pages are mapped, then why don't you just go to swap_out
right away without even walking through the list, because in the end,
you will go to swap_out anyway (simply because of the high percentage
of mapped pages). That makes scanning somehow superfluous. Making it
priority-dependant sounds like you want to swap_out earlier the
_lower_ memory pressure is. In the end it sounds just like a hack to
hold up the early exit against every logic (but not against some
benchmark of course).
It doesn't sound like the right thing.
Is the inactive list somehow sorted currently? If not, could it be
implicitly sorted to match this criteria (not mapped versa mapped), so
that shrink_cache finds the not-mapped first (with a chance to fulfill
nr_pages-request). If it isn't fulfilled and hits the first mapped
page, it can go to swap_out right away, because more scanning doesn't
make sense and can only end in swap_out anyways.

I am no fan of complete list scanning, but if you are looking for
something you have to scan until you find it.

Regards,
Stephan

PS: I am still no pro in this area, so I try to go after the global
picture and find the right direction...

2001-11-02 02:24:50

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

On Fri, 2 Nov 2001, Stephan von Krawczynski wrote:
>
> Wait a minute: there is something illogical in this approach:
> Basically you say by making max_mapped bigger that the "early exit"
> from shrink_cache shouldn't be that early. But if you _know_ that
> merely all pages are mapped, then why don't you just go to swap_out
> right away without even walking through the list, because in the end,
> you will go to swap_out anyway (simply because of the high percentage
> of mapped pages). That makes scanning somehow superfluous.

Well, no.

There's two things: sure, we know we have tons of mapped pages, and we
obviously will have done the "swap_out()" for th efirst iteration (and
probably the second and third ones too).

But at some point you have to say "Ok, _this_ process has done its due
work to clean up the VM pressure, and now this process needs to get on
with its life and stop caring about other peoples bad memory usage".

Remember: everybody who calls "swap_out()" will free several pages from
the pag tables. And everybody starts off with a low priority (ie 6). So if
we're truly 99% mapped, then every single allocator will start off doing
swap_out(), but at some point they obviously need to do other things too
(ie they need to get to the point int he inactive queue where those
swapped out pages are now, and try to write them out to disk).

Imagine a inactive queue that is a million entries. That's 4GB worth of
RAM, sure, but there are lots of machines like that. If we only allow
shrink_cache() to look at 320 pages at a time, we'll never get a life of
our own.

(Yeah, sure, if you have all that 4GB on the inactive list, and it's all
mapped, you're going to spend some time cleaning it up _regardless_ of
what you do. That's life.)

Linus

2001-11-02 02:31:10

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

>
> On Fri, 2 Nov 2001, Stephan von Krawczynski wrote:
> >
> > To clarify this one a bit:
> > shrink_cache is thought to do what it says, it is given a number
of
> > pages it should somehow manage to free by shrinking the cache.
What my
> > patch does is go after the _whole_ list to fulfill that.
>
> I would suggest a slight modification: make "max_mapped" grow as the
> priority goes up.
>
> Right now max_mapped is fixed at "nr_pages*10".
>
> You could have something like
>
> max_mapped = nr_pages * 60 / priority;
>
> instead, which might also alleviate the problem with not even
bothering to
> scan much of the inactive list simply because 99% of all pages are
mapped.
>
> That way you don't waste time on looking at the rest of the inactive
list
> until you _need_ to.

Ok. I re-checked the code and found out this approach cannot stand.
the list scan _is_ already exited early when priority is low:

int max_scan = nr_inactive_pages / priority;

while (--max_scan >= 0 && (entry = inactive_list.prev) !=
&inactive_list) {

It will not make big sense to do it again in max_mapped.

On the other hand I am also very sure, that refining:

if (max_mapped==0)
swap_out(priority, gfp_mask, classzone);

return nr_pages;

in the end to:

if (max_mapped==0 && nr_pages>0)
swap_out(priority, gfp_mask, classzone);

return nr_pages;

is a good thing. We don't need swap_out if we gained all the pages
requested, no matter if we _could_ do it or not.

Is there some performance difference in this approach, Lorenzo? I
guess it should.

Regards,
Stephan

2001-11-02 02:42:14

by Ed Tomlinson

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

Hi,

shrink_caches can end up lying. shrink_dcache_memory and friends do not tell
shrink_caches how many pages they free so nr_pages can be bogus... Is it worth
fixing? The simpliest, harmlessly racey and not too pretty, code follows. It
would also not be hard to change the shrink_ calls to return the number of pages
shrunk, but this would hit more code...

Comments?

Ed Tomlinson

--- linux/mm/vmscan.c.orig Wed Oct 31 14:11:33 2001
+++ linux/mm/vmscan.c Wed Oct 31 14:51:58 2001
@@ -552,6 +552,7 @@
static int shrink_caches(zone_t * classzone, int priority, unsigned int gfp_mask, int nr_pages)
{
int chunk_size = nr_pages;
+ int nr_shrunk;
unsigned long ratio;

nr_pages -= kmem_cache_reap(gfp_mask);
@@ -567,11 +568,21 @@
if (nr_pages <= 0)
return 0;

+ nr_shrunk = nr_free_pages;
+
shrink_dcache_memory(priority, gfp_mask);
shrink_icache_memory(priority, gfp_mask);
#ifdef CONFIG_QUOTA
shrink_dqcache_memory(DEF_PRIORITY, gfp_mask);
#endif
+
+ /* racey - calculate how many pages we got from shrinks */
+ nr_shrunk = nr_free_pages - nr_shrunk;
+ if (nr_shrunk > 0) {
+ nr_pages -= nr_shrunk;
+ if (nr_pages <= 0)
+ return 0;
+ }

return nr_pages;
}

2001-11-02 02:55:36

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

> Ok. I re-checked the code and found out this approach cannot stand.

> the list scan _is_ already exited early when priority is low:

Sorry for followup on my own mail, but there is another thing that
comes to my mind:

swap_out is currently in no way priority-dependant. But it could be
(the parameter is there). How about swapping more pages in tighter
memory situation? The basic idea is that if there is a rising need for
mem it cannot be wrong to do a bit more than under normal
circumstances. One could achieve this simply by:

int counter, nr_pages = SWAP_CLUSTER_MAX;

to

int counter, nr_pages = SWAP_CLUSTER_MAX * DEF_PRIORITY /
priority;

in swap_out.
The idea behind is to reduce the overhead in finding out if swapping
is needed by simply swapping more everytime we already gone "the long
way to knowing".

Regards,
Stephan

2001-11-02 03:02:05

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

> Hi,
>
> shrink_caches can end up lying. shrink_dcache_memory and friends do
not tell
> shrink_caches how many pages they free so nr_pages can be bogus...
Is it worth
> fixing? The simpliest, harmlessly racey and not too pretty, code
follows. It
> would also not be hard to change the shrink_ calls to return the
number of pages
> shrunk, but this would hit more code...
>
> Comments?

I believe the idea of having a more precise nr_pages value can make a
difference. We are trying to estimate if swapping is needed, which is
pretty expensive. If we can avoid it by more accurately knowing what
is really going on (without _too_ much costs) we can only win.

Regards,
Stephan

2001-11-02 13:00:41

[permalink] [raw]

Subject: Re: new OOM heuristic failure (was: Re: VM: qsbench)

Hello Lorenzo,

please find attached next vmscan.c patch which sums up the delayed swap_out
(first patch), the fix for not swapping when nr_pages is reached, and (new) the
idea to swap more pages in one call to swap_out if priority gets higher.

I have not the slightest idea what all this does to the performance. Especially
the "more" swap_out code is a pure try-and-error type of thing. Can you do some
testing please?

Thanks,
Stephan

Attachments:

vmscan-patch2 (1.48 kB)

2001-11-02 17:34:58