LinuxLists.cc - VM test on 2.4.13-pre3aa1 (compared to 2.4.12-aa1 and 2.4.13-pre2aa1)

2001-10-16 12:14:40

Subject: VM test on 2.4.13-pre3aa1 (compared to 2.4.12-aa1 and 2.4.13-pre2aa1)

Summary:

Wall clock time for this test has dropped dramatically (which
is good) over the last 3 Andrea Arcangeli patched kernels.
mp3blaster sounds less pleasant though.

Test:

Run loop of 10 iterations of Linux Test Project's "mtest01 -p80 -w"
This test attempts to allocate 80% of virtual memory and write to
each page. Simultaneously listen to mp3blaster.

Reboot before each test.

Hardware:
Athlon 1333
512 Mb RAM
1024 Mb swap

I've shown the last 2 results in a previous message. But the
side by side comparison is pretty exciting.

2.4.13pre3aa1

Averages for 10 mtest01 runs
bytes allocated: 1240045977
User time (seconds): 2.106
System time (seconds): 2.738
Elapsed (wall clock) time: 39.408
Percent of CPU this job got: 11.70
Major (requiring I/O) page faults: 110.0
Minor (reclaiming a frame) faults: 303527.4

2.4.13pre2aa1

Averages for 10 mtest01 runs
bytes allocated: 1245184000
User time (seconds): 2.050
System time (seconds): 2.874
Elapsed (wall clock) time: 49.513
Percent of CPU this job got: 9.70
Major (requiring I/O) page faults: 115.6
Minor (reclaiming a frame) faults: 304781.9

2.4.12aa1

Averages for 10 mtest01 runs
bytes allocated: 1253362892
User time (seconds): 2.099
System time (seconds): 2.823
Elapsed (wall clock) time: 64.109
Percent of CPU this job got: 7.50
Major (requiring I/O) page faults: 135.2
Minor (reclaiming a frame) faults: 306779.8

The rest of the results below are just from 2.4.13pre3aa1.

mtest01 passes each time with the expect 1.2 gigabytes of
memory allocated:

PASS ... 1215299584 bytes allocated.
PASS ... 1242562560 bytes allocated.
PASS ... 1240465408 bytes allocated.
PASS ... 1241513984 bytes allocated.
PASS ... 1244659712 bytes allocated.
PASS ... 1241513984 bytes allocated.
PASS ... 1245708288 bytes allocated.
PASS ... 1242562560 bytes allocated.
PASS ... 1243611136 bytes allocated.
PASS ... 1242562560 bytes allocated.

mp3blaster is much less pleasant as the wall clock time for VM improves.
With 2.4.13pre3aa1, mp3blaster stutters through almost the entire run.
The last 3-4 seconds of each iteration sound good though. (highest vmstat
swpd value and the next 2 low values). This "sounds good" may actually be
the first 3-4 seconds of the test.

vmstat 1 output for 1 iteration:

vmstat output starts towards the end of one iteration, goes through a complete cycle,
then the beginning of another.

procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 6 1 685252 1548 1188 1136 72 23192 380 23192 419 310 3 8 89
0 6 1 707740 1648 1196 1140 44 20532 412 20552 368 335 5 9 86
0 4 0 725628 3624 1176 1152 32 20264 512 20264 343 300 2 10 88

mp3blaster sounds good

1 4 0 738192 3312 1216 1908 516 11276 1528 11276 467 435 3 4 93
2 0 0 15928 387480 1264 3148 352 0 1632 0 477 686 19 24 56
2 0 0 15756 122780 1280 3172 0 0 24 24 285 563 35 65 0

mp3blaster stutters until the end of test iteration.

3 3 0 47424 3788 1172 1412 860 40228 892 40236 789 819 12 23 66
0 5 1 90244 1656 1184 1416 1032 39568 1076 39572 653 425 6 5 89
1 3 0 129592 3744 1176 1416 236 40960 276 40988 588 432 5 8 87
0 2 1 159260 3584 1172 1540 132 27676 300 27680 396 270 7 9 84
0 5 1 187764 2572 1184 1416 312 29632 368 29636 534 448 5 7 88
0 5 1 218844 1648 1176 1416 220 31268 256 31272 560 486 5 7 89
0 2 1 242820 2548 1172 1416 124 24576 168 24600 419 376 3 8 89
1 1 1 280660 3052 1176 1416 60 36352 116 36356 554 439 3 10 87
0 3 1 325164 2036 1176 1416 40 44832 76 44836 586 467 4 10 86
0 3 1 350204 1660 1172 1420 44 25824 88 25852 432 319 3 12 85
1 2 1 396728 3564 1184 1416 72 45780 120 45784 637 528 3 12 86
0 3 1 423816 3572 1180 1416 48 27020 80 27024 420 361 2 14 84
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 3 2 467284 1644 1200 1420 52 42816 100 42832 627 482 5 10 85
0 3 1 490292 2040 1180 1420 32 23648 40 23656 344 242 6 12 82
0 3 1 512764 1660 1172 956 292 23604 340 23604 426 273 5 14 81
0 3 1 539844 2108 1184 968 56 26728 316 26728 463 338 3 11 86
0 3 1 563852 2036 1184 976 56 23500 512 23500 440 357 3 11 86
1 2 1 579656 1908 1172 1004 332 17356 1324 17380 411 352 3 8 89
1 3 1 605720 1652 1184 1024 48 24656 516 24656 456 375 0 11 89
0 6 1 627676 3176 1200 1028 64 21432 316 21436 386 283 3 9 88
1 3 0 642980 3804 1180 1048 56 16376 888 16376 356 280 4 7 89
2 3 1 661348 1776 1180 1064 312 18816 724 18860 390 340 2 9 89
0 4 1 686888 2148 1184 1256 68 23992 848 23992 443 359 3 9 88
1 4 1 705276 1896 1188 1116 44 19880 836 19880 431 331 2 6 92

mp3blaster sounds good

0 5 1 724676 1652 1192 1124 240 18388 1084 18388 371 336 2 13 85
1 4 1 16348 491352 1220 1796 512 12108 1332 12132 393 403 2 11 87
2 2 0 15872 489360 1264 3004 732 0 1984 0 472 691 2 3 95
3 0 0 15692 266700 1284 3168 116 0 296 0 344 639 41 46 14

mp3blaster begins to stutter again

2 0 0 14604 4480 1284 3196 316 0 344 0 300 587 46 54 0
0 4 0 32952 3572 1176 1464 372 21932 392 21944 393 313 9 12 79

vmstat 1 output from 2.4.12aa1 and 2.4.13pre2aa1 is in previous messages. Subject
is something like VM Test on {kernel versions}. Two separate tiny email threads.

--
Randy Hron

2001-10-17 00:17:47

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: VM test on 2.4.13-pre3aa1 (compared to 2.4.12-aa1 and 2.4.13-pre2aa1)

On Tue, Oct 16, 2001 at 08:16:39AM -0400, [email protected] wrote:
>
> Summary:
>
> Wall clock time for this test has dropped dramatically (which
> is good) over the last 3 Andrea Arcangeli patched kernels.

:) I worked the last two days to make it faster under swap, it's nice to
see that your tests also confirm that. I'm only scared it swaps too
much when swap is not used but if this sorts out to be the case it will
be very easy to fix. And a very minor bit of very seldom background
pagetable scanning shouldn't hurt anyways. So far on my desktop it seems
not to swap too much.

> mp3blaster sounds less pleasant though.

A (very) optimistic theory could be that the increase of the swap
throughput is decreasing the bandiwth available to read the mp3 8). Do
you swap on the same physical disk where you keep the mp3? But it maybe
that I'm blocking too easily waiting for I/O completion instead, or that
the mp3blast routines needed for the playback are been swapped out,
dunno with only this info. You can rule out the "mp3blast is been
swapped out" by running mp3blast after an mlockall. And you can avoid
the disk bandwith problems by putting the mp3 in a separate disk.

So far I received very good feedback about 2.4.13pre3aa1 [also Luigi's
and Mario's problems gone away completly] (I'm also happy myself on my
machine). It may need further tuning but I'd hope it's only a matter of
changing some line of code.

> 3 3 0 47424 3788 1172 1412 860 40228 892 40236 789 819 12 23 66
> 0 5 1 90244 1656 1184 1416 1032 39568 1076 39572 653 425 6 5 89

those swapins could be due mp3blast that is getting swapped out
continously while it sleeps. Not easy for the vm to understand it has
to stay in cache and it makes sense it gets swapped out faster, the
faster the swap rate is. Could you also make sure to run mp3blast with
-20 priority and the swap-hog at +19 priority just in case?

thanks for feedback!

Andrea

2001-10-17 01:32:26

by Beau Kuiper

[permalink] [raw]

Subject: Re: VM test on 2.4.13-pre3aa1 (compared to 2.4.12-aa1 and 2.4.13-pre2aa1)

On Wed, 17 Oct 2001, Andrea Arcangeli wrote:

> On Tue, Oct 16, 2001 at 08:16:39AM -0400, [email protected] wrote:
> >
> > Summary:
> >
> > Wall clock time for this test has dropped dramatically (which
> > is good) over the last 3 Andrea Arcangeli patched kernels.
>
> :) I worked the last two days to make it faster under swap, it's nice to
> see that your tests also confirm that. I'm only scared it swaps too
> much when swap is not used but if this sorts out to be the case it will
> be very easy to fix. And a very minor bit of very seldom background
> pagetable scanning shouldn't hurt anyways. So far on my desktop it seems
> not to swap too much.

Swapping too much probably has a lot to do with a particular hard drive
and its performace. Is there any way of adding a configurable option (via
sysctl) to allow the adminstrators to tune how aggressively the kernel
swaps out data/vs throwing out the disk cache (so if it is set to
agressive, the kernel will try hard to make sure to use swap to free up
memory, or if it is set to conservative it will try to free disk cache (to
a limit) instead of swapping stuff out to free memory)

Beau Kuiper
[email protected]

2001-10-17 02:14:44

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: VM test on 2.4.13-pre3aa1 (compared to 2.4.12-aa1 and 2.4.13-pre2aa1)

On Wed, Oct 17, 2001 at 09:32:12AM +0800, Beau Kuiper wrote:
> On Wed, 17 Oct 2001, Andrea Arcangeli wrote:
>
> > On Tue, Oct 16, 2001 at 08:16:39AM -0400, [email protected] wrote:
> > >
> > > Summary:
> > >
> > > Wall clock time for this test has dropped dramatically (which
> > > is good) over the last 3 Andrea Arcangeli patched kernels.
> >
> > :) I worked the last two days to make it faster under swap, it's nice to
> > see that your tests also confirm that. I'm only scared it swaps too
> > much when swap is not used but if this sorts out to be the case it will
> > be very easy to fix. And a very minor bit of very seldom background
> > pagetable scanning shouldn't hurt anyways. So far on my desktop it seems
> > not to swap too much.
>
> Swapping too much probably has a lot to do with a particular hard drive
> and its performace. Is there any way of adding a configurable option (via
> sysctl) to allow the adminstrators to tune how aggressively the kernel
> swaps out data/vs throwing out the disk cache (so if it is set to
> agressive, the kernel will try hard to make sure to use swap to free up
> memory, or if it is set to conservative it will try to free disk cache (to
> a limit) instead of swapping stuff out to free memory)

I could add a sysctl to control that. In short the change consists of
making the DEF_PRIORITY in mm/vmscan.c a variable rather than a
preprocessor #define. That's the "ratio" number I was talking about in
the last email to Rik, and if you read ac/mm/vmscan.c you'll find it
there too indeed.

That's basically the only number that I left in the code, everything
else should be completly dynamic behaviour. Anyways also this number
isn't critical, as said it shouldn't make an huge difference anyways,
but yes it could be tunable.

However one of the reasons I didn't do that I still believe the vm
should be autotuning and provide behaviour with concepts, not with
random tweaking. But I cannot imagine at the moment how to make even
such fixed number to go away :), so at the moment it could make some
sense to make it a sysctl.

The probe of the cache that allows me to swapouts before we really
failed shrinking the cache doesn't sounds like random tweaking either to
me (maybe I'm biased 8), it instead allows to free memory and swapout at
the very same time, and this seems beneficial.

Andrea

2001-10-17 02:36:27

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: VM test on 2.4.13-pre3aa1 (compared to 2.4.12-aa1 and 2.4.13-pre2aa1)

On Wed, Oct 17, 2001 at 02:12:42AM +0200, Andrea Arcangeli wrote:
> > 3 3 0 47424 3788 1172 1412 860 40228 892 40236 789 819 12 23 66
> > 0 5 1 90244 1656 1184 1416 1032 39568 1076 39572 653 425 6 5 89
>
> those swapins could be due mp3blast that is getting swapped out
> continously while it sleeps. Not easy for the vm to understand it has

I noticed that anotehr thing that changed between vanilla 2.4.13pre2 and
2.4.13pre3 is the setting of page_cluster on machine with lots of ram.

You'll now find the page_cluster set to 6, that means "1 << 6 << 12"
bytes will be paged in at each major fault, while previously only "1 <<
4 << 12" bytes were paged in.

So I'd suggest to try again after "echo 4 > /proc/sys/vm/page-cluster"
to see if it makes any difference.

Andrea

2001-10-17 03:57:35

by Randy Hron

[permalink] [raw]

Subject: Re: VM test on 2.4.13-pre3aa1 (compared to 2.4.12-aa1 and 2.4.13-pre2aa1)

On Wed, Oct 17, 2001 at 02:12:42AM +0200, Andrea Arcangeli wrote:
> On Tue, Oct 16, 2001 at 08:16:39AM -0400, [email protected] wrote:
> >
> > Wall clock time for this test has dropped dramatically (which
> > is good) over the last 3 Andrea Arcangeli patched kernels.
>
> > mp3blaster sounds less pleasant though.
>
> A (very) optimistic theory could be that the increase of the swap
> throughput is decreasing the bandiwth available to read the mp3 8). Do
> you swap on the same physical disk where you keep the mp3? But it maybe
> that I'm blocking too easily waiting for I/O completion instead, or that
> the mp3blast routines needed for the playback are been swapped out,

That theory makes sense. 2.4.13-pre3aa1 seems more aggressive at
making memory (swap) available to memory (swap) hogs. 2.4.12aa1
would be agressive from swpd (small) to about 130000 on this machine.
2.4.13-pre2aa1 was aggressive until swpd around 280000 on this machine,
and 2.4.13-pre3aa1 is aggressive as long as swap is needed.

I say "aggressive" based on when mp3blaster starts to sputter.

The mp3 is on the same disk as swap and everything else.

> dunno with only this info. You can rule out the "mp3blast is been
> swapped out" by running mp3blast after an mlockall. And you can avoid
> the disk bandwith problems by putting the mp3 in a separate disk.

I didn't find a user mlockall program on freshmeat or icewalkers.

> > 3 3 0 47424 3788 1172 1412 860 40228 892 40236 789 819 12 23 66
> > 0 5 1 90244 1656 1184 1416 1032 39568 1076 39572 653 425 6 5 89
>
> those swapins could be due mp3blast that is getting swapped out
> continously while it sleeps. Not easy for the vm to understand it has
> to stay in cache and it makes sense it gets swapped out faster, the
> faster the swap rate is. Could you also make sure to run mp3blast with
> -20 priority and the swap-hog at +19 priority just in case?

I did 3 tests using "nice".

1) nothing niced
2) mp3blaster not nice
3) mtest01 very nice, and mp3blaster not nice

mp3blaster uses about 11 seconds of CPU time to play a 3 minute mp3 on this machine.

Here is a bit of ps with mtest01 very nice, and mp3blaster un-nice

F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD
000 S 18008 15643 93 0 59 -20 - 7455 nanosl tty3 00:00:00 mp3blaster
002 S 18008 15644 15643 0 59 -20 - 7455 do_pol tty3 00:00:00 mp3blaster
002 S 18008 15645 15644 1 59 -20 - 7455 nanosl tty3 00:00:15 mp3blaster
002 S 18008 15710 15644 0 59 -20 - 7455 end tty3 00:00:00 mp3blaster
004 S 0 15711 91 0 79 19 - 530 wait4 tty1 00:00:00 mmtest
004 S 0 15714 15711 0 79 19 - 331 nanosl tty1 00:00:00 vmstat
000 R 18008 15717 98 5 75 0 - 727 - tty8 00:00:00 ps
000 S 0 15718 15711 0 79 19 - 318 wait4 tty1 00:00:00 time
000 R 0 15719 15718 0 79 19 - 4686 - tty1 00:00:00 mtest01

Changing nice values didn't really have any affect on mp3blaster's sound quality.

mp3blaster not nice, mtest01 very nice

Averages for 10 mtest01 runs
bytes allocated: 1238577971
User time (seconds): 2.062
System time (seconds): 2.715
Elapsed (wall clock) time: 40.606
Percent of CPU this job got: 11.50
Major (requiring I/O) page faults: 108.3
Minor (reclaiming a frame) faults: 303169.0

mp3blaster not nice

Averages for 10 mtest01 runs
bytes allocated: 1221800755
User time (seconds): 2.059
System time (seconds): 2.697
Elapsed (wall clock) time: 37.597
Percent of CPU this job got: 12.10
Major (requiring I/O) page faults: 115.2
Minor (reclaiming a frame) faults: 299073.0

no nice processes

Averages for 10 mtest01 runs
bytes allocated: 1240045977
User time (seconds): 2.106
System time (seconds): 2.738
Elapsed (wall clock) time: 39.408
Percent of CPU this job got: 11.70
Major (requiring I/O) page faults: 110.0
Minor (reclaiming a frame) faults: 303527.4

Note the total test time is around 400 seconds (wall clock * 10).
The mp3 would play just over 120 seconds by the time mtest01 completed
10 iterations.

I did a fourth run with strace -p 15645 (mp3blaster PID using most cpu time).

read(6, "\20Ks\303\303\222\236o\272\231\177\32\316\360\341\314z"..., 4096) = 4096
nanosleep({0, 200000}, NULL) = 0 (5 calls to nanosleep)
time([1003288001]) = 1003288001
nanosleep({0, 200000}, NULL) = 0 (21 calls to nanosleep)
read(6, "\356$\365\274)\332\336\277c\375\356>+\234\307q\213\6\4"..., 4096) = 4096

When not running mtest01, strace is like this:

read(6, "\317W\234\311i\230\273\221\276J5\245\310A\251\226C?\202"..., 4096) = 4096
nanosleep({0, 200000}, NULL) = 0 (4 calls to nanosleep)
time([1003287905]) = 1003287905
nanosleep({0, 200000}, NULL) = 0 (3 calls to nanosleep)
read(6, "$Q\17\357aL\264\301e\357S\370h{4\322L\246\344\273y\232"..., 4096) = 4096

Oddly, it appears there are more calls to nanosleep when mp3blaster is sputtering
(and fighting for i/o or memory?)

> thanks for feedback!
>
> Andrea

My pleasure!

--
Randy Hron

2001-10-17 04:46:23

by Randy Hron

[permalink] [raw]

Subject: Re: VM test on 2.4.13-pre3aa1 (compared to 2.4.12-aa1 and 2.4.13-pre2aa1)

On Wed, Oct 17, 2001 at 04:31:03AM +0200, Andrea Arcangeli wrote:
> I noticed that anotehr thing that changed between vanilla 2.4.13pre2 and
> 2.4.13pre3 is the setting of page_cluster on machine with lots of ram.
>
> You'll now find the page_cluster set to 6, that means "1 << 6 << 12"
> bytes will be paged in at each major fault, while previously only "1 <<
> 4 << 12" bytes were paged in.
>
> So I'd suggest to try again after "echo 4 > /proc/sys/vm/page-cluster"
> to see if it makes any difference.
>
> Andrea

You Rule!

The tweak to page-cluster is basically magic for this test.

With page-cluster=4, the mp3blaster sputtered like 2.4.13pre2aa1.
Better, but not beautiful.

Real beauty happens with page-cluster=2. There is virtually no sputter.
And the wall clock time is a little better than 2.4.13pre2aa1!

I don't know what page-cluster size is best for everything, but
2.4.12aa1 (which was very good IMHO), sputtered about 10 seconds per
iteration, and each iteration took 64 seconds.

2.4.13pre3aa1 with no sputters: 48 seconds.

Amazing!

Also, interactive "feel" is much better too. This test would
really brutalize keyboard response. With 2.4.13-pre3aa1 and
page-cluster=2, the box is still usable. (for more than listening
to mp3's :))

page-cluster = 6

Averages for 10 mtest01 runs
bytes allocated: 1236166246
User time (seconds): 2.299
System time (seconds): 2.951
Elapsed (wall clock) time: 41.969
Percent of CPU this job got: 12.00
Major (requiring I/O) page faults: 113.5
Minor (reclaiming a frame) faults: 302580.3

page-cluster = 4

Averages for 10 mtest01 runs
bytes allocated: 1237529395
User time (seconds): 2.097
System time (seconds): 2.788
Elapsed (wall clock) time: 49.394
Percent of CPU this job got: 9.50
Major (requiring I/O) page faults: 120.3
Minor (reclaiming a frame) faults: 302914.1

page-cluster = 2

Averages for 10 mtest01 runs
bytes allocated: 1239521689
User time (seconds): 2.051
System time (seconds): 2.785
Elapsed (wall clock) time: 47.878
Percent of CPU this job got: 9.80
Major (requiring I/O) page faults: 114.0
Minor (reclaiming a frame) faults: 303399.7

The wall clock time went up somewhat from page-cluster=6.
Here is where we were before:

2.4.13-pre2aa1

Averages for 10 mtest01 runs
bytes allocated: 1245184000
User time (seconds): 2.050
System time (seconds): 2.874
Elapsed (wall clock) time: 49.513
Percent of CPU this job got: 9.70
Major (requiring I/O) page faults: 115.6
Minor (reclaiming a frame) faults: 304781.9

2.4.12aa1

Averages for 10 mtest01 runs
bytes allocated: 1253362892
User time (seconds): 2.099
System time (seconds): 2.823
Elapsed (wall clock) time: 64.109
Percent of CPU this job got: 7.50
Major (requiring I/O) page faults: 135.2
Minor (reclaiming a frame) faults: 306779.8

--
Randy Hron

2001-10-17 16:28:19

by Linus Torvalds

[permalink] [raw]

Subject: Re: VM test on 2.4.13-pre3aa1 (compared to 2.4.12-aa1 and 2.4.13-pre2aa1)

In article <[email protected]>, <[email protected]> wrote:
>>
>> So I'd suggest to try again after "echo 4 > /proc/sys/vm/page-cluster"
>> to see if it makes any difference.
>>
>> Andrea
>
>You Rule!
>
>The tweak to page-cluster is basically magic for this test.
>
>With page-cluster=4, the mp3blaster sputtered like 2.4.13pre2aa1.
>Better, but not beautiful.
>
>Real beauty happens with page-cluster=2. There is virtually no sputter.
>And the wall clock time is a little better than 2.4.13pre2aa1!

This is good information.

The problem is that "page-cluster" is actually used for two different
things: it's used for mmap page-in clustering, and it's used for swap
page-in clustering, and they probably have rather different behaviours.

Setting page-cluster to 2 means that both mmap and page-in will cluster
only four pages, which might slow down mmap throughput when not swapping
(and make program loading in particular slow down under disk load). At
the same time it's probably perfectly fine for swapping - I think
Marcelo eventually wants to re-do the swapin read-clustering anyway.

And wall-clock time apparently did decrease with page-clustering
lowered, although personally I like latency more than throughput so that
doesn't really bother me.

However, I'd really like to know whether it is mmap or swap clustering
that matters more, so it would be interesting to hear what happens if
you remove the "swapin_readahead(entry)" line in mm/memory.c (in
do_swap_page()). Does a large page-cluster value still make matters
worse when it's disabled for swapping? (In other words: does
page-cluster actually hurt for mmap too, or is the problem strictly
related to swapping?)

Willing to test your load?

Thanks,
Linus

2001-10-17 16:40:29

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: VM test on 2.4.13-pre3aa1 (compared to 2.4.12-aa1 and 2.4.13-pre2aa1)

On Wed, 17 Oct 2001, Linus Torvalds wrote:

> In article <[email protected]>, <[email protected]> wrote:
> >>
> >> So I'd suggest to try again after "echo 4 > /proc/sys/vm/page-cluster"
> >> to see if it makes any difference.
> >>
> >> Andrea
> >
> >You Rule!
> >
> >The tweak to page-cluster is basically magic for this test.
> >
> >With page-cluster=4, the mp3blaster sputtered like 2.4.13pre2aa1.
> >Better, but not beautiful.
> >
> >Real beauty happens with page-cluster=2. There is virtually no sputter.
> >And the wall clock time is a little better than 2.4.13pre2aa1!
>
> This is good information.
>
> The problem is that "page-cluster" is actually used for two different
> things: it's used for mmap page-in clustering, and it's used for swap
> page-in clustering, and they probably have rather different behaviours.

Its also used to limit the number of on flight swapouts. That different
meaning thingie sucks: I would say we need to separate that :)

2001-10-18 19:36:11

by Bill Davidsen

[permalink] [raw]

Subject: Re: VM test on 2.4.13-pre3aa1 (compared to 2.4.12-aa1 and 2.4.13-pre2aa1)

In article <[email protected]> [email protected] wrote:
>On Wed, Oct 17, 2001 at 09:32:12AM +0800, Beau Kuiper wrote:

>> Swapping too much probably has a lot to do with a particular hard drive
>> and its performace. Is there any way of adding a configurable option (via
>> sysctl) to allow the adminstrators to tune how aggressively the kernel
>> swaps out data/vs throwing out the disk cache (so if it is set to
>> agressive, the kernel will try hard to make sure to use swap to free up
>> memory, or if it is set to conservative it will try to free disk cache (to
>> a limit) instead of swapping stuff out to free memory)
>
>I could add a sysctl to control that. In short the change consists of
>making the DEF_PRIORITY in mm/vmscan.c a variable rather than a
>preprocessor #define. That's the "ratio" number I was talking about in
>the last email to Rik, and if you read ac/mm/vmscan.c you'll find it
>there too indeed.

I think that would give people a sense of control.

>That's basically the only number that I left in the code, everything
>else should be completly dynamic behaviour. Anyways also this number
>isn't critical, as said it shouldn't make an huge difference anyways,
>but yes it could be tunable.
>
>However one of the reasons I didn't do that I still believe the vm
>should be autotuning and provide behaviour with concepts, not with
>random tweaking. But I cannot imagine at the moment how to make even
>such fixed number to go away :), so at the moment it could make some
>sense to make it a sysctl.

I thing it's desirable for VM to run well on autopilot for as many
cases as possible, because Linux is going to be used by some very
non-technical users. However, it is also strong in small machines, old
PCs, embedded uses, etc. These would benefit from tuning the ratio of
swap and buffer use, and also from being able to specify having a large
available page pool, for applications which suddenly need memory which
is a large percentage of physical memory.

| The probe of the cache that allows me to swapouts before we really
| failed shrinking the cache doesn't sounds like random tweaking either to
| me (maybe I'm biased 8), it instead allows to free memory and swapout at
| the very same time, and this seems beneficial.

If it can work well in common cases for average users, and still allow
tuning by people who have special needs, I'm all for it. I'm sure you
understand the problems with self-tuning VM as well as anyone, I just
want to suggest that for uncommon situations you provide a way for
knowledgable users to handle special situations which need info not
available to the VM otherwise.

--
bill davidsen <[email protected]>
His first management concern is not solving the problem, but covering
his ass. If he lived in the middle ages he'd wear his codpiece backward.

2001-10-18 19:45:31

by Bill Davidsen

[permalink] [raw]

Subject: Re: VM test on 2.4.13-pre3aa1 (compared to 2.4.12-aa1 and 2.4.13-pre2aa1)

In article <[email protected]> [email protected] wrote:
>On Wed, Oct 17, 2001 at 04:31:03AM +0200, Andrea Arcangeli wrote:
>> I noticed that anotehr thing that changed between vanilla 2.4.13pre2 and
>> 2.4.13pre3 is the setting of page_cluster on machine with lots of ram.
>>
>> You'll now find the page_cluster set to 6, that means "1 << 6 << 12"
>> bytes will be paged in at each major fault, while previously only "1 <<
>> 4 << 12" bytes were paged in.
>>
>> So I'd suggest to try again after "echo 4 > /proc/sys/vm/page-cluster"
>> to see if it makes any difference.
>>
>> Andrea
>
>You Rule!
>
>The tweak to page-cluster is basically magic for this test.

Out of curiousity, did you play with the 'preempt' patch at all?

--
bill davidsen <[email protected]>
His first management concern is not solving the problem, but covering
his ass. If he lived in the middle ages he'd wear his codpiece backward.

2001-10-18 23:29:15

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: VM test on 2.4.13-pre3aa1 (compared to 2.4.12-aa1 and 2.4.13-pre2aa1)

On Thu, Oct 18, 2001 at 03:36:21PM -0400, bill davidsen wrote:
> In article <[email protected]> [email protected] wrote:
> >On Wed, Oct 17, 2001 at 09:32:12AM +0800, Beau Kuiper wrote:
>
> >> Swapping too much probably has a lot to do with a particular hard drive
> >> and its performace. Is there any way of adding a configurable option (via
> >> sysctl) to allow the adminstrators to tune how aggressively the kernel
> >> swaps out data/vs throwing out the disk cache (so if it is set to
> >> agressive, the kernel will try hard to make sure to use swap to free up
> >> memory, or if it is set to conservative it will try to free disk cache (to
> >> a limit) instead of swapping stuff out to free memory)
> >
> >I could add a sysctl to control that. In short the change consists of
> >making the DEF_PRIORITY in mm/vmscan.c a variable rather than a
> >preprocessor #define. That's the "ratio" number I was talking about in
> >the last email to Rik, and if you read ac/mm/vmscan.c you'll find it
> >there too indeed.
>
> I think that would give people a sense of control.

ok, I added three sysctl:

andrea@laser:/misc/andrea-athlon > ls /proc/sys/vm/vm_*
/proc/sys/vm/vm_balance_ratio /proc/sys/vm/vm_mapped_ratio /proc/sys/vm/vm_scan_ratio
andrea@laser:/misc/andrea-athlon >

with some commentary in the sourcecode:

/*
* The "vm_scan_ratio" is how much of the queues we will scan
* in one go. A value of 6 for vm_scan_ratio implies that we'll
* scan 1/6 of the inactive list during a normal aging round.
*/
int vm_scan_ratio = 8;

/*
* The "vm_mapped_ratio" controls when to start early-paging, we probe
* the inactive list during shrink_cache() and if there are too many
* mapped unfreeable pages we have an indication that we'd better
* start paging. The bigger vm_mapped_ratio is, the eaerlier the
* machine will run into swapping activities.
*/
int vm_mapped_ratio = 32;

/*
* The "vm_balance_ratio" controls the balance between active and
* inactive cache. The bigger vm_balance_ratio is, the easier the
* active cache will grow, because we'll rotate the active list
* slowly. A value of 4 means we'll go towards a balance of
* 1/5 of the cache being inactive.
*/
int vm_balance_ratio = 16;

I'm still testing though, so it's not guaranteed that the above will
remain the same :).

> If it can work well in common cases for average users, and still allow
> tuning by people who have special needs, I'm all for it. I'm sure you
> understand the problems with self-tuning VM as well as anyone, I just
> want to suggest that for uncommon situations you provide a way for
> knowledgable users to handle special situations which need info not
> available to the VM otherwise.

ok. Another argument is that by making those sysctl tunable people can
test and report the best numbers for them in their workloads.
Those would be fixed numbers anyways, they're magics, they're not
perfect, they tend to do the right thing, and changing them slightly
isn't going to make a big difference if the machine has enough ram for
doing its work.

Andrea