2001-10-01 18:32:23

by Lorenzo Allegrucci

[permalink] [raw]
Subject: VM: 2.4.10 vs. 2.4.10-ac2 and qsort()


Disclaimer:
I don't know if this "benchmark" is meaningful or not, but anyhow..

As workload I used qsort() on an array of 90000000 32 bit ints.
(trivial code to the end of my email).

The VSIZE of the resulting process is about 343Mb.
Tested machine has 256Mb of RAM + 200Mb of swap.
Same srand() in all tests.

Below are linux-2.4.10 results

run 1:
real 4m54.728s
user 2m47.910s
sys 0m2.520s

run 2:
real 4m55.109s
user 2m46.050s
sys 0m2.530s

kswapd CPU time: 3 seconds
qs RSS always on 238-240M, very stable never below 235M.

.. and 2.4.10-ac2 results

run 1:
real 6m2.139s
user 2m44.390s
sys 0m3.210s

run 2:
real 6m57.140s
user 2m47.050s
sys 0m3.560s

kswapd CPU time: 20 seconds
qs RSS never above 204M, average value 150M.


Comments?

------------- qs.c ---------------
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <malloc.h>
#include <time.h>

int cmp(const void * x, const void * y)
{
int *a, *b;

a = (int *)x;
b = (int *)y;

if (*a == *b)
return 0;
else
if (*a > *b)
return 1;
else
return -1;
}

int main(int argc, char *argv[])
{
int *a, n, i, errors = 0;

n = atoi(argv[1]);

if ((a = malloc(sizeof(int) * n)) == NULL) {
perror("malloc");
exit(1);
}
srand(1);
for (i = 0; i < n; i++)
a[i] = rand();

qsort(a, n, sizeof(int), cmp);

for (i = 0; i < n - 1; i++)
if (a[i] > a[i + 1])
errors++;
printf("%d errors.\n", errors);
free(a);
return 0;
}
----------------- qs.c -----------------



--
Lorenzo


2001-10-01 19:24:12

by Rik van Riel

[permalink] [raw]
Subject: Re: VM: 2.4.10 vs. 2.4.10-ac2 and qsort()

On Mon, 1 Oct 2001, Lorenzo Allegrucci wrote:

> Disclaimer:
> I don't know if this "benchmark" is meaningful or not, but anyhow..

I'm not sure either, since qsort doesn't really have much
locality of reference but just walks all over the place.

This is direct contrast with the basic assumption on which
VM and CPU caches are built ;)

I wonder how eg. merge sort would perform ...

> Below are linux-2.4.10 results
> real 4m54.728s
>
> kswapd CPU time: 3 seconds
> qs RSS always on 238-240M, very stable never below 235M.

> .. and 2.4.10-ac2 results
> real 6m2.139s
>
> kswapd CPU time: 20 seconds
> qs RSS never above 204M, average value 150M.

The RSS thing is just a side effect of how swap is allocated
and should have little or no influence on which pages are
kept in memory.

One thing which could make 2.4.10 faster for this single case
is the fact that it doesn't keep any page aging info, so IO
clustering won't be confused by the process accessing its
pages ;)

cheers,

Rik
--
IA64: a worthy successor to i860.

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to [email protected] (spam digging piggy)

2001-10-01 19:37:52

by Alan

[permalink] [raw]
Subject: Re: VM: 2.4.10 vs. 2.4.10-ac2 and qsort()

> I'm not sure either, since qsort doesn't really have much
> locality of reference but just walks all over the place.

qsort can be made to perform reasonably well providing you try to cache
colour the objects you sort and try to use prefetches a bit.

> I wonder how eg. merge sort would perform ...

Generally better but thats seperate to the VM issues.

> One thing which could make 2.4.10 faster for this single case
> is the fact that it doesn't keep any page aging info, so IO
> clustering won't be confused by the process accessing its
> pages ;)

I don't think that is too unusual a case. If the smarter vm is making poorer
I/O clustering decisions it wants investigating

2001-10-01 19:45:12

by Rik van Riel

[permalink] [raw]
Subject: Re: VM: 2.4.10 vs. 2.4.10-ac2 and qsort()

On Mon, 1 Oct 2001, Alan Cox wrote:

> > I'm not sure either, since qsort doesn't really have much
> > locality of reference but just walks all over the place.
>
> qsort can be made to perform reasonably well providing you try to cache
> colour the objects you sort and try to use prefetches a bit.

That won't quite work when the qsort in question is 150%
the size of your RAM ;)

> > One thing which could make 2.4.10 faster for this single case
> > is the fact that it doesn't keep any page aging info, so IO
> > clustering won't be confused by the process accessing its
> > pages ;)
>
> I don't think that is too unusual a case. If the smarter vm is making
> poorer I/O clustering decisions it wants investigating

Absolutely, this is something we really want to know ...

I guess I'll play with Lorenzo's program a bit to see how
the system behaves and how it can be improved.

regards,

Rik
--
IA64: a worthy successor to i860.

http://www.surriel.com/ http://distro.conectiva.com/

Send all your spam to [email protected] (spam digging piggy)

2001-10-01 21:31:29

by Daniel Phillips

[permalink] [raw]
Subject: Re: VM: 2.4.10 vs. 2.4.10-ac2 and qsort()

On October 1, 2001 08:33 pm, Lorenzo Allegrucci wrote:
> Disclaimer:
> I don't know if this "benchmark" is meaningful or not, but anyhow..
>
> As workload I used qsort() on an array of 90000000 32 bit ints.
> (trivial code to the end of my email).
>
> The VSIZE of the resulting process is about 343Mb.
> Tested machine has 256Mb of RAM + 200Mb of swap.
> Same srand() in all tests.
>
> Below are linux-2.4.10 results
>
> run 1:
> real 4m54.728s
> user 2m47.910s
> sys 0m2.520s
>
> run 2:
> real 4m55.109s
> user 2m46.050s
> sys 0m2.530s
>
> kswapd CPU time: 3 seconds
> qs RSS always on 238-240M, very stable never below 235M.
>
> .. and 2.4.10-ac2 results
>
> run 1:
> real 6m2.139s
> user 2m44.390s
> sys 0m3.210s
>
> run 2:
> real 6m57.140s
> user 2m47.050s
> sys 0m3.560s
>
> kswapd CPU time: 20 seconds
> qs RSS never above 204M, average value 150M.
>
>
> Comments?

Nice test, it's simpl and fairly easy to analyze. It provides a nice mix of
nonlocalized and localized memory activity, in that order.

The first pass of the quicksort will make one pass over all of memory with a
working set within the array at any given time of just two pages,
corresponding to the pattern of exchanges. Somewhat more than the available
physical memory will be dirtied, causing some swapouts. A prescient mm would
page back in each of those swapped out pages exactly once later in the sort,
by choosing all the eviction victims from the set of pages that will not be
used in the next recursive semi-sort. I can't imagine any general-purpose
page replacement policy that would be this smart, so we should assume that a
good page replacement algorithm will evict about 50% more dirty pages than
the prescient algorithm.

There is some danger that any given page replacement algorithm might
accidently behave the same as the prescient algorithm. I can't think of a
good way to test for this other than by knowing how much physical memory is
actually available for the array and becoming suspicious if the mm does
better than expected according to the above analysis. Odds are, the mm won't
get lucky, but it could.

Aside from obvious bugs, one mm stategy can beat another by making more
memory available for caching the array data and by closely approaching 150%
of the number of swapouts that the prescient algorithm would make.

Andrea's mm apparently did better than Rik's, but you were too polite to
point that out ;-) Some additional statistics would help tell us why:
swapin/swapout counts, and run time with no swapping, which you would have to
obtain using a smaller data size and an artificial memory limit. It would
also be nice to know the maximum array size that can be sorted with no or
minimal swapping, which tells us how much cache the mm is able to make
available for program data. Ideally, that would amount to nearly all of
physical memory since very little else is going on.

--
Daniel

2001-10-01 21:50:04

by Lorenzo Allegrucci

[permalink] [raw]
Subject: Re: VM: 2.4.10 vs. 2.4.10-ac2 and qsort()

At 16.23 01/10/01 -0300, you wrote:
>On Mon, 1 Oct 2001, Lorenzo Allegrucci wrote:
>
>> Disclaimer:
>> I don't know if this "benchmark" is meaningful or not, but anyhow..
>
>I'm not sure either, since qsort doesn't really have much
>locality of reference but just walks all over the place.

Yes, it was exactly my goal :)

>This is direct contrast with the basic assumption on which
>VM and CPU caches are built ;)

Indeed, it put strain the VM by a pseudo random-sequential access pattern.

>I wonder how eg. merge sort would perform ...

It would perform better, but merge sort doesn't trash the system :)
I wanted to test the system in trashing conditions.
Just curious.


--
Lorenzo

2001-10-02 01:07:57

by Matthias Andree

[permalink] [raw]
Subject: Re: VM: 2.4.10 vs. 2.4.10-ac2 and qsort()

On Mon, 01 Oct 2001, Rik van Riel wrote:

> I'm not sure either, since qsort doesn't really have much
> locality of reference but just walks all over the place.
>
> This is direct contrast with the basic assumption on which
> VM and CPU caches are built ;)
>
> I wonder how eg. merge sort would perform ...

Just rip it off NetBSD and there you go. (FreeBSD's breaks on machines
like SPARC, NetBSD's does not.)

http://www.de.freebsd.org/cgi/cvsweb.cgi/basesrc/lib/libc/stdlib/merge.c?rev=1.10&cvsroot=netbsd

2001-10-05 00:02:42

by Alexei Podtelezhnikov

[permalink] [raw]
Subject: Re: VM: 2.4.10 vs. 2.4.10-ac2 and qsort()

Hi guys,

I've already expressed my concern about using srand(1) in private e-mails.
I think it's unscientific to use one particular random sequence. Since
no one checked if that matters, I changed srand(1) to srand(time(NULL))
and I'm posting my results. I don't do testing of Alan or Linus's kernels,
but use recent Red Hat kernel. I think I've shown that it does matter.

Six quick consecutive runs of modified qs on a small set of 8 million
integers (obviously no swap activity):

> time ./a.out 8000000
0 errors.
24.250u 0.310s 0:24.55 100.0% 0+0k 0+0io 116pf+0w
0 errors.
24.290u 0.260s 0:24.55 100.0% 0+0k 0+0io 116pf+0w
0 errors.
24.300u 0.260s 0:24.55 100.0% 0+0k 0+0io 116pf+0w
0 errors.
24.270u 0.300s 0:24.57 100.0% 0+0k 0+0io 116pf+0w
0 errors.
24.290u 0.270s 0:24.56 100.0% 0+0k 0+0io 116pf+0w
0 errors.
24.280u 0.280s 0:24.55 100.0% 0+0k 0+0io 116pf+0w

Apparently, no significant deviations in computing times.

Six runs of modified qs on a large set of 80 million integers (a lot of
swapping!)

> time ./a.out 80000000
0 errors.
261.580u 4.250s 11:09.21 39.7% 0+0k 0+0io 17379pf+0w
0 errors.
260.460u 3.660s 9:09.72 48.0% 0+0k 0+0io 13194pf+0w
0 errors.
260.620u 4.510s 10:39.80 41.4% 0+0k 0+0io 16714pf+0w
0 errors.
261.790u 4.150s 10:09.58 43.6% 0+0k 0+0io 16331pf+0w
0 errors.
260.400u 4.140s 9:23.46 46.9% 0+0k 0+0io 13722pf+0w
0 errors.
259.980u 3.940s 9:10.22 47.9% 0+0k 0+0io 14240pf+0w

mean = 9m57s; standard deviation = 50s.

Apparently, the random sequence does matter (to the Rik's algorithm at
least since it's in RH kernel).

I wonder how big the deviation is for official and AC trees.
Now Lorenzo's results seem inconclusive.

Regards,
Alexei

2001-10-05 02:02:55

by Rob Landley

[permalink] [raw]
Subject: Re: VM: 2.4.10 vs. 2.4.10-ac2 and qsort()

On Thursday 04 October 2001 20:03, Alexei Podtelezhnikov wrote:
> Hi guys,
>
> I've already expressed my concern about using srand(1) in private e-mails.
> I think it's unscientific to use one particular random sequence. Since
> no one checked if that matters, I changed srand(1) to srand(time(NULL))
> and I'm posting my results. I don't do testing of Alan or Linus's kernels,
> but use recent Red Hat kernel. I think I've shown that it does matter.

One advantage of srand(1) is that it has a chance of being repeatable. You
don't WANT a truly random sequence, you want a sequence that simulates a
reproducable work load. That's why it makes sense to initialize the random
number generator with a known seed, so we can do it again after applying a
patch and see what kind of difference it made.

If you vary the initialization of the random number generator, your work load
isn't reproducible. It's the same TYPE of load, but not the same actual load
from one run to the next. If we're talking a 2% improvement expected out of
a particular tweak, you want to reduce random variation in the test case as
much as possible that might swamp your change. (We get enough variation
already from scheduling and random interrupts flushing cache state and
such...)

There's nothing wrong with using different seed values for srand, but testing
different VMs with different seed values has the possibility of being apples
to oranges.

If you want to run your test in a loop for a while to invoke statistics to
counter the randomness of your tests, you could do that too. But that's a
different kind of test. And one that's not nearly as convenient to run from
tweak to tweak...

Rob

2001-10-05 03:54:51

by Alexei Podtelezhnikov

[permalink] [raw]
Subject: Re: VM: 2.4.10 vs. 2.4.10-ac2 and qsort()

On Thu, 4 Oct 2001, Rob Landley wrote:

> On Thursday 04 October 2001 20:03, Alexei Podtelezhnikov wrote:
> > Hi guys,
> >
> > I've already expressed my concern about using srand(1) in private e-mails.
> > I think it's unscientific to use one particular random sequence. Since
> > no one checked if that matters, I changed srand(1) to srand(time(NULL))
> > and I'm posting my results. I don't do testing of Alan or Linus's kernels,
> > but use recent Red Hat kernel. I think I've shown that it does matter.
>
> One advantage of srand(1) is that it has a chance of being repeatable. You
> don't WANT a truly random sequence, you want a sequence that simulates a
> reproducable work load. That's why it makes sense to initialize the random
> number generator with a known seed, so we can do it again after applying a
> patch and see what kind of difference it made.

I agree with this. All I was trying to show is that srand(1) makes it ONE
VERY PARTICULAR LOAD. Optimizing for this load doesn't make sense at all
because it is so particular. Optimizing for lower average make more sense.
That is what you would call TYPE of load. And the average is reproducible
statistically.

> There's nothing wrong with using different seed values for srand, but testing
> different VMs with different seed values has the possibility of being apples
> to oranges.
>

Again, the averages and deviations are reproducible for each particular VM
algorithm, and provide a good comparison on this type of load. They are
the fruits.

Speaking of deviations, I was very surprised to see such a significant
deviation. I think we want it to be smaller. That is what we would call
predictability of a VM scheme. We need this quality, e.g. for predicting
the imminent OOM when possible. It would be interesting to compare these
too for Rik's and Andrea's schemes.

Lastly, I think this sort of simple tests can be adjusted to mimic some
certain interesting loads by cooking non-random sequences. This is
different from srand(1), because those sequences can make sense. That's
were the test becomes deterministic rather than statistical, and qsort()
becomes irrelevant too.

Alexei

2001-10-05 15:34:17

by Lorenzo Allegrucci

[permalink] [raw]
Subject: Re: VM: 2.4.10 vs. 2.4.10-ac2 and qsort()

At 17.03 04/10/01 -0700, Alexei Podtelezhnikov wrote:
>Hi guys,
>
>I've already expressed my concern about using srand(1) in private e-mails.
>I think it's unscientific to use one particular random sequence. Since
>no one checked if that matters, I changed srand(1) to srand(time(NULL))
>and I'm posting my results. I don't do testing of Alan or Linus's kernels,
>but use recent Red Hat kernel. I think I've shown that it does matter.
>
>Six quick consecutive runs of modified qs on a small set of 8 million
>integers (obviously no swap activity):
>
>> time ./a.out 8000000
>0 errors.
>24.250u 0.310s 0:24.55 100.0% 0+0k 0+0io 116pf+0w
>0 errors.
>24.290u 0.260s 0:24.55 100.0% 0+0k 0+0io 116pf+0w
>0 errors.
>24.300u 0.260s 0:24.55 100.0% 0+0k 0+0io 116pf+0w
>0 errors.
>24.270u 0.300s 0:24.57 100.0% 0+0k 0+0io 116pf+0w
>0 errors.
>24.290u 0.270s 0:24.56 100.0% 0+0k 0+0io 116pf+0w
>0 errors.
>24.280u 0.280s 0:24.55 100.0% 0+0k 0+0io 116pf+0w
>
>Apparently, no significant deviations in computing times.
>
>Six runs of modified qs on a large set of 80 million integers (a lot of
>swapping!)
>
>> time ./a.out 80000000
>0 errors.
>261.580u 4.250s 11:09.21 39.7% 0+0k 0+0io 17379pf+0w
>0 errors.
>260.460u 3.660s 9:09.72 48.0% 0+0k 0+0io 13194pf+0w
>0 errors.
>260.620u 4.510s 10:39.80 41.4% 0+0k 0+0io 16714pf+0w
>0 errors.
>261.790u 4.150s 10:09.58 43.6% 0+0k 0+0io 16331pf+0w
>0 errors.
>260.400u 4.140s 9:23.46 46.9% 0+0k 0+0io 13722pf+0w
>0 errors.
>259.980u 3.940s 9:10.22 47.9% 0+0k 0+0io 14240pf+0w
>
>mean = 9m57s; standard deviation = 50s.
>
>Apparently, the random sequence does matter (to the Rik's algorithm at
>least since it's in RH kernel).
>
>I wonder how big the deviation is for official and AC trees.
>Now Lorenzo's results seem inconclusive.

Yours too, as you have not compared two or more kernels yet.
You have just proved that the random sequence does matter on that
particular kernel.



--
Lorenzo