http://www.ozlabs.org/people/dgibson/maptest.tar.gz
has a small set of test programs which perform a naive matrix multiply
in memory obtained by several different methods: one is
MAP_PRIVATE|MAP_ANONYMOUS, one is MAP_SHARED|MAP_ANONYMOUS and the
third attempts to map from hugetlbfs.
On a number of machines I've tested - both ppc64 and x86 - the SHARED
version is consistently and significantly (50-100%) slower than the
PRIVATE version. Increasing the matrix size does not appear to make
the situation significantly better. The routine that does the actual
multiply is identical (same .o) in each case, only a wrapper which
allocates memory is different.
I am at a complete loss to explain this behaviour, and I'm sure it
didn't use to happen (unfortunately I can't remember what kernel
version we were on at the time). oprofile appears to show essentially
all the time is taken in userspace in both cases. Can anyone explain
what's going on?
I've also seen anomolies with the hugepage version, but it seems to be
less consistent.
--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson
On Wed, Oct 27, 2004 at 12:23:00AM -0700, James Cloos wrote:
> >>>>> "David" == David Gibson <[email protected]> writes:
>
> David> http://www.ozlabs.org/people/dgibson/maptest.tar.gz
>
> David> On a number of machines I've tested - both ppc64 and x86 - the
> David> SHARED version is consistently and significantly (50-100%)
> David> slower than the PRIVATE version.
>
> Just gave it a test on my laptop and server. Both are p3. The
> laptop is under heavier mem pressure; the server has just under
> a gig with most free/cache/buff. Laptop is still running 2.6.7
> whereas the server is bk as of 2004-10-24.
>
> Buth took about 11 seconds for the private and around 30 seconds
> for the shared tests.
>
> So if this is a regression, it predates v2.6.7.
Actually, I think I've figured this one out, now. And I think it may
have been a very subtle change in my test case.
The difference between MAP_SHARED and MAP_PRIVATE is that when a page
is touched for any reason on MAP_SHARED, a new page will be allocated,
whereas if a MAP_PRIVATE page is touched for read only it will get a
copy of the zero page. My test wasn't initializing the matrices, just
multiplying whatever was in memory, so it was never write-touching the
input matrices.
With the entire input matrices all copies of the zero page, cache
performance, oddly enough, would have been rather better...
<sticks head in bucket>
--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson
>>>>> "Andrew" == Andrew Morton <[email protected]> writes:
JimC> Both took about 11 seconds for the private and around 30 seconds
JimC> for the shared tests.
Andrew> I get the exact opposite, on a P4:
Interesting. I gave it a try on a couple of my UMLs. One is on a P4
(possibly xeon; not sure) and the other is on an athlon. The p4 did
shared about twice as fast as private and the athlon was 50% faster.
(p4 uses uml kernel 2.4.27; athlon 2.6.6; no idea what the hosts run.)
-JimC
James Cloos <[email protected]> wrote:
>
> >>>>> "David" == David Gibson <[email protected]> writes:
>
> David> http://www.ozlabs.org/people/dgibson/maptest.tar.gz
>
> David> On a number of machines I've tested - both ppc64 and x86 - the
> David> SHARED version is consistently and significantly (50-100%)
> David> slower than the PRIVATE version.
>
> Just gave it a test on my laptop and server. Both are p3. The
> laptop is under heavier mem pressure; the server has just under
> a gig with most free/cache/buff. Laptop is still running 2.6.7
> whereas the server is bk as of 2004-10-24.
>
> Buth took about 11 seconds for the private and around 30 seconds
> for the shared tests.
>
I get the exact opposite, on a P4:
vmm:/home/akpm/maptest> time ./mm-sharemmap
./mm-sharemmap 10.81s user 0.05s system 100% cpu 10.855 total
vmm:/home/akpm/maptest> time ./mm-sharemmap
./mm-sharemmap 11.04s user 0.05s system 100% cpu 11.086 total
vmm:/home/akpm/maptest> time ./mm-privmmap
./mm-privmmap 26.91s user 0.02s system 100% cpu 26.903 total
vmm:/home/akpm/maptest> time ./mm-privmmap
./mm-privmmap 26.89s user 0.02s system 100% cpu 26.894 total
vmm:/home/akpm/maptest> uname -a
Linux vmm 2.6.10-rc1-mm1 #14 SMP Tue Oct 26 23:23:23 PDT 2004 i686 i686 i386 GNU/Linux
It's all user time so I can think of no reason apart from physical page
allocation order causing additional TLB reloads in one case. One is using
anonymous pages and the other is using shmem-backed pages, although I can't
think why that would make a difference.
Let's back out the no-buddy-bitmap patches:
vmm:/home/akpm/maptest> time ./mm-sharemmap
./mm-sharemmap 12.01s user 0.06s system 99% cpu 12.087 total
vmm:/home/akpm/maptest> time ./mm-sharemmap
./mm-sharemmap 12.56s user 0.05s system 100% cpu 12.607 total
vmm:/home/akpm/maptest> time ./mm-privmmap
./mm-privmmap 26.74s user 0.03s system 99% cpu 26.776 total
vmm:/home/akpm/maptest> time ./mm-privmmap
./mm-privmmap 26.66s user 0.02s system 100% cpu 26.674 total
much the same.
Backing out "[PATCH] tweak the buddy allocator for better I/O merging" from
June 24 makes no difference.
>>>>> "David" == David Gibson <[email protected]> writes:
David> http://www.ozlabs.org/people/dgibson/maptest.tar.gz
David> On a number of machines I've tested - both ppc64 and x86 - the
David> SHARED version is consistently and significantly (50-100%)
David> slower than the PRIVATE version.
Just gave it a test on my laptop and server. Both are p3. The
laptop is under heavier mem pressure; the server has just under
a gig with most free/cache/buff. Laptop is still running 2.6.7
whereas the server is bk as of 2004-10-24.
Buth took about 11 seconds for the private and around 30 seconds
for the shared tests.
So if this is a regression, it predates v2.6.7.
-JimC
--
James H. Cloos, Jr. <[email protected]> <http://jhcloos.com>
On Wed, Oct 27, 2004 at 01:06:59AM -0700, Andrew Morton wrote:
> James Cloos <[email protected]> wrote:
> >
> > >>>>> "David" == David Gibson <[email protected]> writes:
> >
> > David> http://www.ozlabs.org/people/dgibson/maptest.tar.gz
> >
> > David> On a number of machines I've tested - both ppc64 and x86 - the
> > David> SHARED version is consistently and significantly (50-100%)
> > David> slower than the PRIVATE version.
> >
> > Just gave it a test on my laptop and server. Both are p3. The
> > laptop is under heavier mem pressure; the server has just under
> > a gig with most free/cache/buff. Laptop is still running 2.6.7
> > whereas the server is bk as of 2004-10-24.
> >
> > Buth took about 11 seconds for the private and around 30 seconds
> > for the shared tests.
> >
>
> I get the exact opposite, on a P4:
>
> vmm:/home/akpm/maptest> time ./mm-sharemmap
> ./mm-sharemmap 10.81s user 0.05s system 100% cpu 10.855 total
> vmm:/home/akpm/maptest> time ./mm-sharemmap
> ./mm-sharemmap 11.04s user 0.05s system 100% cpu 11.086 total
> vmm:/home/akpm/maptest> time ./mm-privmmap
> ./mm-privmmap 26.91s user 0.02s system 100% cpu 26.903 total
> vmm:/home/akpm/maptest> time ./mm-privmmap
> ./mm-privmmap 26.89s user 0.02s system 100% cpu 26.894 total
> vmm:/home/akpm/maptest> uname -a
> Linux vmm 2.6.10-rc1-mm1 #14 SMP Tue Oct 26 23:23:23 PDT 2004 i686 i686 i386 GNU/Linux
How very odd. I've now understood what was happening (see other
post), but I'm not sure what could reverse the situation. Can you
download the test tarball again - I've put up an updated version which
pretouches the pages and gives some extra info. Running it both with
and without pretouch would be interesting (#if 0/1 in matmul.h to
change).
> It's all user time so I can think of no reason apart from physical page
> allocation order causing additional TLB reloads in one case. One is using
> anonymous pages and the other is using shmem-backed pages, although I can't
> think why that would make a difference.
>
>
> Let's back out the no-buddy-bitmap patches:
>
> vmm:/home/akpm/maptest> time ./mm-sharemmap
> ./mm-sharemmap 12.01s user 0.06s system 99% cpu 12.087 total
> vmm:/home/akpm/maptest> time ./mm-sharemmap
> ./mm-sharemmap 12.56s user 0.05s system 100% cpu 12.607 total
> vmm:/home/akpm/maptest> time ./mm-privmmap
> ./mm-privmmap 26.74s user 0.03s system 99% cpu 26.776 total
> vmm:/home/akpm/maptest> time ./mm-privmmap
> ./mm-privmmap 26.66s user 0.02s system 100% cpu 26.674 total
>
> much the same.
>
> Backing out "[PATCH] tweak the buddy allocator for better I/O merging" from
> June 24 makes no difference.
>
--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson
Andrew Morton wrote:
> James Cloos <[email protected]> wrote:
>
>>>>>>>"David" == David Gibson <[email protected]> writes:
>>
>>David> http://www.ozlabs.org/people/dgibson/maptest.tar.gz
>>
>>David> On a number of machines I've tested - both ppc64 and x86 - the
>>David> SHARED version is consistently and significantly (50-100%)
>>David> slower than the PRIVATE version.
>>
>>Just gave it a test on my laptop and server. Both are p3. The
>>laptop is under heavier mem pressure; the server has just under
>>a gig with most free/cache/buff. Laptop is still running 2.6.7
>>whereas the server is bk as of 2004-10-24.
>>
>>Buth took about 11 seconds for the private and around 30 seconds
>>for the shared tests.
>>
>
>
> I get the exact opposite, on a P4:
>
> vmm:/home/akpm/maptest> time ./mm-sharemmap
> ./mm-sharemmap 10.81s user 0.05s system 100% cpu 10.855 total
> vmm:/home/akpm/maptest> time ./mm-sharemmap
> ./mm-sharemmap 11.04s user 0.05s system 100% cpu 11.086 total
> vmm:/home/akpm/maptest> time ./mm-privmmap
> ./mm-privmmap 26.91s user 0.02s system 100% cpu 26.903 total
> vmm:/home/akpm/maptest> time ./mm-privmmap
> ./mm-privmmap 26.89s user 0.02s system 100% cpu 26.894 total
> vmm:/home/akpm/maptest> uname -a
> Linux vmm 2.6.10-rc1-mm1 #14 SMP Tue Oct 26 23:23:23 PDT 2004 i686 i686 i386 GNU/Linux
>
> It's all user time so I can think of no reason apart from physical page
> allocation order causing additional TLB reloads in one case. One is using
> anonymous pages and the other is using shmem-backed pages, although I can't
> think why that would make a difference.
I think the cause was covered in another post, I'm surprised that the
page overhead is reported as user time. It would have been a good hint
if the big jump were in system time.
Yes, I know some kernel time is charged to the user, I'm just not sure
diddling the page tables should be, since it might mask the effect of vm
changes, etc.
That's comment not a suggestion.
--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me
On Wed, Oct 27, 2004 at 04:54:42PM -0400, Bill Davidsen wrote:
> Andrew Morton wrote:
> >James Cloos <[email protected]> wrote:
> >
> >>>>>>>"David" == David Gibson <[email protected]> writes:
> >>
> >>David> http://www.ozlabs.org/people/dgibson/maptest.tar.gz
> >>
> >>David> On a number of machines I've tested - both ppc64 and x86 - the
> >>David> SHARED version is consistently and significantly (50-100%)
> >>David> slower than the PRIVATE version.
> >>
> >>Just gave it a test on my laptop and server. Both are p3. The
> >>laptop is under heavier mem pressure; the server has just under
> >>a gig with most free/cache/buff. Laptop is still running 2.6.7
> >>whereas the server is bk as of 2004-10-24.
> >>
> >>Buth took about 11 seconds for the private and around 30 seconds
> >>for the shared tests.
> >>
> >
> >
> >I get the exact opposite, on a P4:
> >
> >vmm:/home/akpm/maptest> time ./mm-sharemmap
> >./mm-sharemmap 10.81s user 0.05s system 100% cpu 10.855 total
> >vmm:/home/akpm/maptest> time ./mm-sharemmap
> >./mm-sharemmap 11.04s user 0.05s system 100% cpu 11.086 total
> >vmm:/home/akpm/maptest> time ./mm-privmmap
> >./mm-privmmap 26.91s user 0.02s system 100% cpu 26.903 total
> >vmm:/home/akpm/maptest> time ./mm-privmmap
> >./mm-privmmap 26.89s user 0.02s system 100% cpu 26.894 total
> >vmm:/home/akpm/maptest> uname -a
> >Linux vmm 2.6.10-rc1-mm1 #14 SMP Tue Oct 26 23:23:23 PDT 2004 i686 i686
> >i386 GNU/Linux
> >
> >It's all user time so I can think of no reason apart from physical page
> >allocation order causing additional TLB reloads in one case. One is using
> >anonymous pages and the other is using shmem-backed pages, although I can't
> >think why that would make a difference.
>
> I think the cause was covered in another post, I'm surprised that the
> page overhead is reported as user time. It would have been a good hint
> if the big jump were in system time.
The cause isn't page overhead. The problem is that the SHARED version
actually uses a whole lot more real memory, so cache performance is
much worse. So the time really is in userland.
--
David Gibson | For every complex problem there is a
david AT gibson.dropbear.id.au | solution which is simple, neat and
| wrong.
http://www.ozlabs.org/people/dgibson
On Wed, Oct 27, 2004 at 05:59:44PM +1000, David Gibson wrote:
> With the entire input matrices all copies of the zero page, cache
> performance, oddly enough, would have been rather better...
I agree with your analysis. Just for my own fun playing with
profiling, I ran the two tests on a 1.5Ghz Itanium 2 (I compiled with
the Intel compiler lest you all laugh at how slow the "Itanic" is at
multiplying matrices).
ianw@baci:/usr/src/tmp/maptest$ /usr/bin/time ./mm-sharemmap
9.58user 0.02system 0:09.61elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1540minor)pagefaults 0swaps
ianw@baci:/usr/src/tmp/maptest$ /usr/bin/time ./mm-sharemmap
9.47user 0.02system 0:09.50elapsed 100%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1540minor)pagefaults 0swaps
ianw@baci:/usr/src/tmp/maptest$ /usr/bin/time ./mm-privmmap
8.63user 0.00system 0:08.63elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1541minor)pagefaults 0swaps
ianw@baci:/usr/src/tmp/maptest$ /usr/bin/time ./mm-privmmap
8.63user 0.00system 0:08.63elapsed 99%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+1541minor)pagefaults 0swaps
Both close, certainly nothing like some other reports. But as with
all benchmarking the devil is in the details; watching the cache
misses:
ianw@baci:/usr/src/tmp/maptest$ pfmon --events=L3_MISSES ./mm-privmmap
112678 L3_MISSES
ianw@baci:/usr/src/tmp/maptest$ pfmon --events=L3_MISSES ./mm-sharemmap
68600586 L3_MISSES
So it's no wonder shared mmap takes a little longer. And indeed,
modifying your program to touch the memory in the privmmap call brings
the two into line.
Also looking at the kernel profiling via q-syscollect, the only
significant difference is private mapping spends about 19% of it's
time in clear_page, whilst shared spends around 29% of it's time in
clear_page. All things being equal, you can thus expect to run ~10%
slower with shared, which is pretty close to what you actually see.
-i
[email protected]
http://www.gelato.unsw.edu.au
On Wed, 27 Oct 2004, Andrew Morton wrote:
> I get the exact opposite, on a P4:
>
> vmm:/home/akpm/maptest> time ./mm-sharemmap
> ./mm-sharemmap 10.81s user 0.05s system 100% cpu 10.855 total
> vmm:/home/akpm/maptest> time ./mm-sharemmap
> ./mm-sharemmap 11.04s user 0.05s system 100% cpu 11.086 total
> vmm:/home/akpm/maptest> time ./mm-privmmap
> ./mm-privmmap 26.91s user 0.02s system 100% cpu 26.903 total
> vmm:/home/akpm/maptest> time ./mm-privmmap
> ./mm-privmmap 26.89s user 0.02s system 100% cpu 26.894 total
> vmm:/home/akpm/maptest> uname -a
> Linux vmm 2.6.10-rc1-mm1 #14 SMP Tue Oct 26 23:23:23 PDT 2004 i686 i686 i386 GNU/Linux
>
> It's all user time so I can think of no reason apart from physical page
> allocation order causing additional TLB reloads in one case. One is using
> anonymous pages and the other is using shmem-backed pages, although I can't
> think why that would make a difference.
you're experiencing the wonder of the L1 data cache on the P4 ... based on
its behaviour i'm pretty sure that early in the pipeline they use the
virtual address to match a virtual tag and procede with that data as if
it's correct. not until the TLB lookup and physical tag check many cycles
later does it realise that it's done something wrong and pull kill / flush
pipelines.
when you set up a virtual alias, like you have with the shared zero page,
it becomes very confused.
in fact if you do something as simple as a 4 element pointer-chase where
the cache lines for elt 0 and 2 alias, and cache lines for 1 and 3 alias
then you can watch some p4 take up to 3000 cycles per reference.
-dean