2002-09-07 12:14:20

by Paolo Ciarrocchi

[permalink] [raw]
Subject: LMbench2.0 results

Hi all,
I've just ran lmbench2.0 on my laptop.
Here the results (again, 2.5.33 seems to be "slow", I don't know why...)

cd results && make summary percent 2>/dev/null | more
make[1]: Entering directory `/usr/src/LMbench/results'

L M B E N C H 2 . 0 S U M M A R Y
------------------------------------


Basic system parameters
----------------------------------------------------
Host OS Description Mhz

--------- ------------- ----------------------- ----
frodo Linux 2.4.18 i686-pc-linux-gnu 797
frodo Linux 2.4.19 i686-pc-linux-gnu 797
frodo Linux 2.5.33 i686-pc-linux-gnu 797

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host OS Mhz null null open selct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
frodo Linux 2.4.18 797 0.40 0.56 3.18 3.97 1.00 3.18 115. 1231 13.K
frodo Linux 2.4.19 797 0.40 0.56 3.07 3.88 1.00 3.19 129. 1113 13.K
frodo Linux 2.5.33 797 0.40 0.61 3.78 4.76 1.02 3.37 201. 1458 13.K

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
frodo Linux 2.4.18 0.990 4.4200 13.8 6.2700 309.8 58.6 310.5
frodo Linux 2.4.19 0.900 4.2900 15.3 5.9100 309.6 57.7 309.9
frodo Linux 2.5.33 1.620 5.2800 15.3 9.3500 312.7 54.9 312.7

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
frodo Linux 2.4.18 0.990 4.437 8.66
frodo Linux 2.4.19 0.900 4.561 7.76
frodo Linux 2.5.33 1.620 6.497 9.11

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page
Create Delete Create Delete Latency Fault Fault
--------- ------------- ------ ------ ------ ------ ------- ----- -----
frodo Linux 2.4.18 68.9 16.0 185.8 31.6 425.0 0.789 2.00000
frodo Linux 2.4.19 68.9 14.9 186.5 29.8 416.0 0.798 2.00000
frodo Linux 2.5.33 77.8 19.1 211.6 38.3 774.0 0.832 3.00000

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
frodo Linux 2.4.18 810. 650. 181.7 203.7 101.5 101.4 203. 195.3
frodo Linux 2.4.19 808. 680. 187.2 203.8 101.5 101.4 203. 190.1
frodo Linux 2.5.33 571. 636. 185.6 202.5 100.5 100.4 202. 190.3

Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
---------------------------------------------------
Host OS Mhz L1 $ L2 $ Main mem Guesses
--------- ------------- ---- ----- ------ -------- -------
frodo Linux 2.4.18 797 3.767 8.7890 158.9
frodo Linux 2.4.19 797 3.767 8.7980 158.9
frodo Linux 2.5.33 797 3.798 8.8660 160.1
make[1]: Leaving directory `/usr/src/LMbench/results'

Comments?

Let me know if you need further information (.config, info about my hardware) or if you want I run other tests.

Ciao,
Paolo
--
Get your free email from http://www.linuxmail.org


Powered by Outblaze


2002-09-07 12:23:48

by Jeff Garzik

[permalink] [raw]
Subject: Re: LMbench2.0 results

Paolo Ciarrocchi wrote:
> Comments?

Yeah: "ouch" because I don't see a single category that's faster.

Oh well, it still needs to be tuned....


2002-09-07 12:38:32

by Paolo Ciarrocchi

[permalink] [raw]
Subject: Re: LMbench2.0 results

From: Jeff Garzik <[email protected]>

> Paolo Ciarrocchi wrote:
> > Comments?
>
> Yeah: "ouch" because I don't see a single category that's faster.
Indeed!!

> Oh well, it still needs to be tuned....
Yes, but it seems to me really strange...

Ciao,
Paolo
--
Get your free email from http://www.linuxmail.org


Powered by Outblaze

2002-09-07 14:05:02

by Shane Shrybman

[permalink] [raw]
Subject: Re: LMbench2.0 results

Hi,

Is it possible that there is still some debugging stuff turned on in
2.5.33?

Shane




2002-09-07 14:29:18

by James Morris

[permalink] [raw]
Subject: Re: LMbench2.0 results

On Sat, 7 Sep 2002, Paolo Ciarrocchi wrote:

> Let me know if you need further information (.config, info about my
> hardware) or if you want I run other tests.

Would you be able to run the tests for 2.5.31? I'm looking into a
slowdown in 2.5.32/33 which may be related. Some hardware info might be
useful too.


- James
--
James Morris
<[email protected]>


2002-09-07 16:02:17

by Andrew Morton

[permalink] [raw]
Subject: Re: LMbench2.0 results

Paolo Ciarrocchi wrote:
>
> Hi all,
> I've just ran lmbench2.0 on my laptop.
> Here the results (again, 2.5.33 seems to be "slow", I don't know why...)
>

The fork/exec/mmap slowdown is the rmap overhead. I have some stuff
which partialy improves it.

The many-small-file-create slowdown is known but its cause is not.
I need to get oprofile onto it.

2002-09-07 18:02:44

by Paolo Ciarrocchi

[permalink] [raw]
Subject: Re: LMbench2.0 results

From: James Morris <[email protected]>

> On Sat, 7 Sep 2002, Paolo Ciarrocchi wrote:
>
> > Let me know if you need further information (.config, info about my
> > hardware) or if you want I run other tests.
>
> Would you be able to run the tests for 2.5.31? I'm looking into a
> slowdown in 2.5.32/33 which may be related. Some hardware info might be
> useful too.
I don't have the 2.5.31, and now I've only a slow
internet connection... I'll try to download it on Monday.

The hw is a Laptop, a standard HP Omnibook 6000, 256 MiB of RAM, PIII@800.
Do you need more information?

Ciao,
Paolo
--
Get your free email from http://www.linuxmail.org


Powered by Outblaze

2002-09-07 18:05:03

by Paolo Ciarrocchi

[permalink] [raw]
Subject: Re: LMbench2.0 results

From: Andrew Morton <[email protected]>

> Paolo Ciarrocchi wrote:
> >
> > Hi all,
> > I've just ran lmbench2.0 on my laptop.
> > Here the results (again, 2.5.33 seems to be "slow", I don't know why...)
> >
>
> The fork/exec/mmap slowdown is the rmap overhead. I have some stuff
> which partialy improves it.
>
> The many-small-file-create slowdown is known but its cause is not.
> I need to get oprofile onto it.
Let me know if do something usefull for you.
Now I compiled the 2.5.33 with _NO_ preemption (the x tagged kernel).
Performance are better, but it still "slow".

cd results && make summary percent 2>/dev/null | more
make[1]: Entering directory `/usr/src/LMbench/results'

L M B E N C H 2 . 0 S U M M A R Y
------------------------------------


Basic system parameters
----------------------------------------------------
Host OS Description Mhz

--------- ------------- ----------------------- ----
frodo Linux 2.4.18 i686-pc-linux-gnu 797
frodo Linux 2.4.19 i686-pc-linux-gnu 797
frodo Linux 2.5.33 i686-pc-linux-gnu 797
frodo Linux 2.5.33x i686-pc-linux-gnu 797

Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host OS Mhz null null open selct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
frodo Linux 2.4.18 797 0.40 0.56 3.18 3.97 1.00 3.18 115. 1231 13.K
frodo Linux 2.4.19 797 0.40 0.56 3.07 3.88 1.00 3.19 129. 1113 13.K
frodo Linux 2.5.33 797 0.40 0.61 3.78 4.76 1.02 3.37 201. 1458 13.K
frodo Linux 2.5.33x 797 0.40 0.60 3.51 4.38 1.02 3.27 159. 1430 13.K

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
frodo Linux 2.4.18 0.990 4.4200 13.8 6.2700 309.8 58.6 310.5
frodo Linux 2.4.19 0.900 4.2900 15.3 5.9100 309.6 57.7 309.9
frodo Linux 2.5.33 1.620 5.2800 15.3 9.3500 312.7 54.9 312.7
frodo Linux 2.5.33x 1.040 4.3200 17.8 7.6200 312.5 49.9 312.5

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
frodo Linux 2.4.18 0.990 4.437 8.66
frodo Linux 2.4.19 0.900 4.561 7.76
frodo Linux 2.5.33 1.620 6.497 9.11
frodo Linux 2.5.33x 1.040 4.888 8.70

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page
Create Delete Create Delete Latency Fault Fault
--------- ------------- ------ ------ ------ ------ ------- ----- -----
frodo Linux 2.4.18 68.9 16.0 185.8 31.6 425.0 0.789 2.00000
frodo Linux 2.4.19 68.9 14.9 186.5 29.8 416.0 0.798 2.00000
frodo Linux 2.5.33 77.8 19.1 211.6 38.3 774.0 0.832 3.00000
frodo Linux 2.5.33x 77.2 18.8 206.7 37.0 769.0 0.823 3.00000

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
frodo Linux 2.4.18 810. 650. 181.7 203.7 101.5 101.4 203. 195.3
frodo Linux 2.4.19 808. 680. 187.2 203.8 101.5 101.4 203. 190.1
frodo Linux 2.5.33 571. 636. 185.6 202.5 100.5 100.4 202. 190.3
frodo Linux 2.5.33x 768. 710. 185.4 202.5 100.5 100.4 202. 189.5

Memory latencies in nanoseconds - smaller is better
(WARNING - may not be correct, check graphs)
---------------------------------------------------
Host OS Mhz L1 $ L2 $ Main mem Guesses
--------- ------------- ---- ----- ------ -------- -------
frodo Linux 2.4.18 797 3.767 8.7890 158.9
frodo Linux 2.4.19 797 3.767 8.7980 158.9
frodo Linux 2.5.33 797 3.798 8.8660 160.1
frodo Linux 2.5.33x 797 3.796 45.5 160.2
make[1]: Leaving directory `/usr/src/LMbench/results'

Hope it helps.

Ciao,
Paolo


--
Get your free email from http://www.linuxmail.org


Powered by Outblaze

2002-09-07 18:49:08

by Rik van Riel

[permalink] [raw]
Subject: Re: LMbench2.0 results

On Sat, 7 Sep 2002, Jeff Garzik wrote:
> Paolo Ciarrocchi wrote:
> > Comments?
>
> Yeah: "ouch" because I don't see a single category that's faster.

HZ went to 1000, which should help multimedia latencies a lot.

> Oh well, it still needs to be tuned....

For throughput or for latency ? ;)

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

Spamtraps of the month: [email protected] [email protected]

2002-09-07 20:01:06

by William Lee Irwin III

[permalink] [raw]
Subject: Re: LMbench2.0 results

Paolo Ciarrocchi wrote:
>> Hi all,
>> I've just ran lmbench2.0 on my laptop.
>> Here the results (again, 2.5.33 seems to be "slow", I don't know why...)

On Sat, Sep 07, 2002 at 09:20:56AM -0700, Andrew Morton wrote:
> The fork/exec/mmap slowdown is the rmap overhead. I have some stuff
> which partialy improves it.

Hmm, Where does it enter the mmap() path? PTE instantiation is only done
for the VM_LOCKED case IIRC. Otherwise it should be invisible.

Perhaps testing with overcommit on would be useful.


Cheers,
Bill

2002-09-07 21:38:12

by Alan

[permalink] [raw]
Subject: Re: LMbench2.0 results

On Sat, 2002-09-07 at 19:53, Rik van Riel wrote:
> On Sat, 7 Sep 2002, Jeff Garzik wrote:
> > Paolo Ciarrocchi wrote:
> > > Comments?
> >
> > Yeah: "ouch" because I don't see a single category that's faster.
>
> HZ went to 1000, which should help multimedia latencies a lot.

It shouldn't materially damage performance unless we have other things
extremely wrong. Its easy enough to verify by putting HZ back to 100 and
rebenching


2002-09-07 22:54:03

by Andrew Morton

[permalink] [raw]
Subject: Re: LMbench2.0 results

William Lee Irwin III wrote:
>
> Paolo Ciarrocchi wrote:
> >> Hi all,
> >> I've just ran lmbench2.0 on my laptop.
> >> Here the results (again, 2.5.33 seems to be "slow", I don't know why...)
>
> On Sat, Sep 07, 2002 at 09:20:56AM -0700, Andrew Morton wrote:
> > The fork/exec/mmap slowdown is the rmap overhead. I have some stuff
> > which partialy improves it.
>
> Hmm, Where does it enter the mmap() path? PTE instantiation is only done
> for the VM_LOCKED case IIRC. Otherwise it should be invisible.

Oh, is that just the mmap() call itself?

> Perhaps testing with overcommit on would be useful.

Well yes - the new overcommit code was a significant hit on the 16ways
was it not? You have some numbers on that?

2002-09-07 22:58:34

by William Lee Irwin III

[permalink] [raw]
Subject: Re: LMbench2.0 results

On Sat, Sep 07, 2002 at 09:20:56AM -0700, Andrew Morton wrote:
>>> The fork/exec/mmap slowdown is the rmap overhead. I have some stuff
>>> which partialy improves it.

William Lee Irwin III wrote:
>> Hmm, Where does it enter the mmap() path? PTE instantiation is only done
>> for the VM_LOCKED case IIRC. Otherwise it should be invisible.

On Sat, Sep 07, 2002 at 04:12:49PM -0700, Andrew Morton wrote:
> Oh, is that just the mmap() call itself?

I'm not actually sure what lmbench is doing.


William Lee Irwin III wrote:
>> Perhaps testing with overcommit on would be useful.

On Sat, Sep 07, 2002 at 04:12:49PM -0700, Andrew Morton wrote:
> Well yes - the new overcommit code was a significant hit on the 16ways
> was it not? You have some numbers on that?

I don't remember the before/after numbers, but I can collect some.


Cheers,
Bill

2002-09-07 23:43:11

by Martin J. Bligh

[permalink] [raw]
Subject: Re: LMbench2.0 results

>> Perhaps testing with overcommit on would be useful.
>
> Well yes - the new overcommit code was a significant hit on the 16ways
> was it not? You have some numbers on that?

About 20% hit on system time for kernel compiles.

M.

2002-09-08 07:32:28

by Andrew Morton

[permalink] [raw]
Subject: Re: LMbench2.0 results

Paolo Ciarrocchi wrote:
>
> ...
> File & VM system latencies in microseconds - smaller is better
> --------------------------------------------------------------
> Host OS 0K File 10K File Mmap Prot Page
> Create Delete Create Delete Latency Fault Fault
> --------- ------------- ------ ------ ------ ------ ------- ----- -----
> frodo Linux 2.4.18 68.9 16.0 185.8 31.6 425.0 0.789 2.00000
> frodo Linux 2.4.19 68.9 14.9 186.5 29.8 416.0 0.798 2.00000
> frodo Linux 2.5.33 77.8 19.1 211.6 38.3 774.0 0.832 3.00000
> frodo Linux 2.5.33x 77.2 18.8 206.7 37.0 769.0 0.823 3.00000
>

The create/delete performance is filesystem-specific.

profiling lat_fs on ext3:

c0170b70 236 0.372293 ext3_get_inode_loc
c014354c 278 0.438548 __find_get_block
c017cbf0 284 0.448013 journal_cancel_revoke
c017ee24 291 0.459056 journal_add_journal_head
c0171030 307 0.484296 ext3_do_update_inode
c017856c 353 0.556861 journal_get_write_access
c0178088 487 0.768248 do_get_write_access
c0114744 530 0.836081 smp_apic_timer_interrupt
c0178a84 559 0.881829 journal_dirty_metadata
c0130644 832 1.31249 generic_file_write_nolock
c0172654 2903 4.57951 ext3_add_entry
c016ca10 3636 5.73583 ext3_check_dir_entry
c0107048 47078 74.2661 poll_idle

ext3_check_dir_entry is just sanity checking. hmm.

on ext2:

c017f3ec 138 0.239971 ext2_free_blocks
c012f560 147 0.255621 unlock_page
c017f954 148 0.25736 ext2_new_block
c017f2f0 154 0.267793 ext2_get_group_desc
c0181958 162 0.281705 ext2_new_inode
c014354c 182 0.316483 __find_get_block
c0154f64 184 0.319961 __d_lookup
c0109bc0 232 0.403429 apic_timer_interrupt
c0143cc4 455 0.791208 __block_prepare_write
c0114744 459 0.798164 smp_apic_timer_interrupt
c0130644 1634 2.84139 generic_file_write_nolock
c0180c64 6084 10.5796 ext2_add_link
c0107048 42472 73.8554 poll_idle

This is mostly in ext2_match() - comparing strings while
searching the directory. memcmp().

ext3 with hashed index directories:

c01803dc 292 0.495251 journal_unlock_journal_head
c0170b70 313 0.530868 ext3_get_inode_loc
c01801a4 412 0.698779 journal_add_journal_head
c014354c 455 0.77171 __find_get_block
c0171030 489 0.829376 ext3_do_update_inode
c017df70 515 0.873474 journal_cancel_revoke
c01798ec 555 0.941316 journal_get_write_access
c0173208 568 0.963365 ext3_add_entry
c0179408 804 1.36364 do_get_write_access
c0179e04 838 1.4213 journal_dirty_metadata
c0130644 1127 1.91147 generic_file_write_nolock
c0107048 44117 74.8253 poll_idle

And yet the test (which tries to run for a fixed walltime)
seems to do the same amount of work. No idea what's up
with that.

Lessons: use an indexed-directory filesystem, and consistency
checking costs.

2002-09-08 07:32:26

by Andrew Morton

[permalink] [raw]
Subject: Re: LMbench2.0 results

William Lee Irwin III wrote:
>
> Paolo Ciarrocchi wrote:
> >> Hi all,
> >> I've just ran lmbench2.0 on my laptop.
> >> Here the results (again, 2.5.33 seems to be "slow", I don't know why...)
>
> On Sat, Sep 07, 2002 at 09:20:56AM -0700, Andrew Morton wrote:
> > The fork/exec/mmap slowdown is the rmap overhead. I have some stuff
> > which partialy improves it.
>
> Hmm, Where does it enter the mmap() path? PTE instantiation is only done
> for the VM_LOCKED case IIRC. Otherwise it should be invisible.
>

lat_mmap seems to do a mmap, faults in ten pages and then
a munmap(). Most of the CPU cost is in cache misses against
the pagetables in munmap().

c012d54c 153 0.569493 do_mmap_pgoff
c012db5c 158 0.588104 find_vma
c01301ec 172 0.640214 filemap_nopage
c0134e84 172 0.640214 release_pages
c0114744 184 0.684881 smp_apic_timer_interrupt
c012ce3c 248 0.9231 handle_mm_fault
c012f738 282 1.04965 find_get_page
c013e2b0 356 1.32509 __set_page_dirty_buffers
c0116294 377 1.40326 do_page_fault
c013e72c 383 1.42559 page_add_rmap
c013e8bc 398 1.48143 page_remove_rmap
c012cb10 425 1.58193 do_no_page
c0109d70 629 2.34125 page_fault
c012b2f4 1036 3.85618 zap_pte_range
c0107048 20205 75.2066 poll_idle

(Multiply everything by four - it's a quad)

Instruction-level profile for -mm5:

c012b2f4 1036 3.85618 0 0 zap_pte_range /usr/src/25/mm/memory.c:325
c012b2f5 2 0.19305 0 0 /usr/src/25/mm/memory.c:325
c012b2fd 1 0.0965251 0 0 /usr/src/25/mm/memory.c:325
c012b300 2 0.19305 0 0 /usr/src/25/mm/memory.c:325
c012b306 1 0.0965251 0 0 /usr/src/25/mm/memory.c:329
c012b309 1 0.0965251 0 0 /usr/src/25/mm/memory.c:329
c012b30f 1 0.0965251 0 0 /usr/src/25/mm/memory.c:331
c012b319 1 0.0965251 0 0 /usr/src/25/mm/memory.c:331
c012b340 1 0.0965251 0 0 /usr/src/25/mm/memory.c:336
c012b348 1 0.0965251 0 0 /usr/src/25/include/asm/highmem.h:80
c012b350 1 0.0965251 0 0 /usr/src/25/include/asm/thread_info.h:75
c012b35a 2 0.19305 0 0 /usr/src/25/include/asm/highmem.h:85
c012b365 2 0.19305 0 0 /usr/src/25/include/asm/highmem.h:86
c012b3c3 2 0.19305 0 0 /usr/src/25/mm/memory.c:337
c012b3d6 1 0.0965251 0 0 /usr/src/25/mm/memory.c:338
c012b3e9 3 0.289575 0 0 /usr/src/25/mm/memory.c:341
c012b3f5 106 10.2317 0 0 /usr/src/25/mm/memory.c:342
c012b3f8 2 0.19305 0 0 /usr/src/25/mm/memory.c:342
c012b3fa 26 2.50965 0 0 /usr/src/25/mm/memory.c:343
c012b3fc 124 11.9691 0 0 /usr/src/25/mm/memory.c:343
c012b405 13 1.25483 0 0 /usr/src/25/mm/memory.c:345
c012b40b 1 0.0965251 0 0 /usr/src/25/mm/memory.c:346
c012b410 2 0.19305 0 0 /usr/src/25/mm/memory.c:348
c012b412 1 0.0965251 0 0 /usr/src/25/mm/memory.c:348
c012b414 62 5.98456 0 0 /usr/src/25/mm/memory.c:349
c012b41b 1 0.0965251 0 0 /usr/src/25/mm/memory.c:350
c012b421 21 2.02703 0 0 /usr/src/25/mm/memory.c:350
c012b427 2 0.19305 0 0 /usr/src/25/mm/memory.c:351
c012b432 2 0.19305 0 0 /usr/src/25/include/asm/bitops.h:244
c012b434 10 0.965251 0 0 /usr/src/25/mm/memory.c:352
c012b437 1 0.0965251 0 0 /usr/src/25/mm/memory.c:352
c012b43d 5 0.482625 0 0 /usr/src/25/mm/memory.c:353
c012b446 7 0.675676 0 0 /usr/src/25/include/linux/mm.h:389
c012b44b 1 0.0965251 0 0 /usr/src/25/include/linux/mm.h:392
c012b44e 1 0.0965251 0 0 /usr/src/25/include/linux/mm.h:392
c012b451 7 0.675676 0 0 /usr/src/25/include/linux/mm.h:393
c012b453 2 0.19305 0 0 /usr/src/25/include/linux/mm.h:393
c012b461 6 0.579151 0 0 /usr/src/25/include/linux/mm.h:396
c012b466 8 0.772201 0 0 /usr/src/25/include/linux/mm.h:396
c012b46f 6 0.579151 0 0 /usr/src/25/mm/memory.c:356
c012b476 15 1.44788 0 0 /usr/src/25/include/asm-generic/tlb.h:105
c012b481 3 0.289575 0 0 /usr/src/25/include/asm-generic/tlb.h:106
c012b490 5 0.482625 0 0 /usr/src/25/include/asm-generic/tlb.h:110
c012b493 7 0.675676 0 0 /usr/src/25/include/asm-generic/tlb.h:110
c012b49a 1 0.0965251 0 0 /usr/src/25/include/asm-generic/tlb.h:110
c012b49d 3 0.289575 0 0 /usr/src/25/include/asm-generic/tlb.h:110
c012b4a0 1 0.0965251 0 0 /usr/src/25/include/asm-generic/tlb.h:110
c012b4a3 8 0.772201 0 0 /usr/src/25/include/asm-generic/tlb.h:111
c012b4aa 13 1.25483 0 0 /usr/src/25/include/asm-generic/tlb.h:111
c012b500 128 12.3552 0 0 /usr/src/25/mm/memory.c:341
c012b504 108 10.4247 0 0 /usr/src/25/mm/memory.c:341
c012b50b 111 10.7143 0 0 /usr/src/25/mm/memory.c:341
c012b50e 99 9.55598 0 0 /usr/src/25/mm/memory.c:341
c012b511 86 8.30116 0 0 /usr/src/25/mm/memory.c:341
c012b51c 4 0.3861 0 0 /usr/src/25/include/asm/thread_info.h:75
c012b521 3 0.289575 0 0 /usr/src/25/mm/memory.c:366
c012b525 1 0.0965251 0 0 /usr/src/25/mm/memory.c:366
c012b526 1 0.0965251 0 0 /usr/src/25/mm/memory.c:366

So it's a bit of rmap in there. I'd have to compare with a 2.4
profile and fiddle a few kernel parameters. But I'm not sure
that munmap of extremely sparsely populated pagtetables is very
interesting?

2002-09-08 07:40:00

by David Miller

[permalink] [raw]
Subject: Re: LMbench2.0 results

From: Andrew Morton <[email protected]>
Date: Sun, 08 Sep 2002 00:51:19 -0700

So it's a bit of rmap in there. I'd have to compare with a 2.4
profile and fiddle a few kernel parameters. But I'm not sure
that munmap of extremely sparsely populated pagtetables is very
interesting?

Another issue is that x86 doesn't use a pagetable cache. I think it
got killed from x86 when the pagetables in highmem went in.

This is all from memory.

2002-09-08 08:28:15

by David Miller

[permalink] [raw]
Subject: Re: LMbench2.0 results

From: William Lee Irwin III <[email protected]>
Date: Sun, 8 Sep 2002 01:28:21 -0700

But if this were truly the issue, the allocation and deallocation
overhead for pagetables should show up as additional pressure
against zone->lock.

The big gain is not only that allocation/free is cheap, also
page table entries tend to hit in cpu cache for even freshly
allocated page tables.

I think that is the bit that would show up in the mmap lmbench
test.

2002-09-08 08:25:55

by William Lee Irwin III

[permalink] [raw]
Subject: Re: LMbench2.0 results

From: Andrew Morton <[email protected]>
> So it's a bit of rmap in there. I'd have to compare with a 2.4
> profile and fiddle a few kernel parameters. But I'm not sure
> that munmap of extremely sparsely populated pagtetables is very
> interesting?

On Sun, Sep 08, 2002 at 12:37:00AM -0700, David S. Miller wrote:
> Another issue is that x86 doesn't use a pagetable cache. I think it
> got killed from x86 when the pagetables in highmem went in.
> This is all from memory.

They seemed to have some other issues related to extreme memory
pressure (routine for me). But if this were truly the issue, the
allocation and deallocation overhead for pagetables should show up as
additional pressure against zone->lock. I can't tell at the moment
because zone->lock is hammered quite hard to begin with and no one's
gone out and done a pagetable cacheing patch for the stuff since. It
should be simple to chain with links in struct page instead of links
embedded in the pagetables & smp_call_function() to reclaim. But this
raises questions of generality.

Cheers,
Bill

2002-09-08 09:09:40

by William Lee Irwin III

[permalink] [raw]
Subject: Re: LMbench2.0 results

From: William Lee Irwin III <[email protected]>
Date: Sun, 8 Sep 2002 01:28:21 -0700
> But if this were truly the issue, the allocation and deallocation
> overhead for pagetables should show up as additional pressure
> against zone->lock.

On Sun, Sep 08, 2002 at 01:25:26AM -0700, David S. Miller wrote:
> The big gain is not only that allocation/free is cheap, also
> page table entries tend to hit in cpu cache for even freshly
> allocated page tables.
> I think that is the bit that would show up in the mmap lmbench
> test.

I'd have to doublecheck to see how parallelized lat_mmap is. My
machines are considerably more sensitive to locking uglies than cache
warmth. (They're taking my machines out, not just slowing them down.)
Cache warmth goodies are certainly nice optimizations, though.


Cheers,
Bill

2002-09-08 17:01:23

by Alan

[permalink] [raw]
Subject: Re: LMbench2.0 results

On Sun, 2002-09-08 at 00:44, Martin J. Bligh wrote:
> >> Perhaps testing with overcommit on would be useful.
> >
> > Well yes - the new overcommit code was a significant hit on the 16ways
> > was it not? You have some numbers on that?
>
> About 20% hit on system time for kernel compiles.

That suprises me a lot. On a 2 way and 4 way the 2.4 memory overcommit
check code didnt show up. That may be down to the 2 way being on a CPU
that has no measurable cost for locked operations and the 4 way being an
ancient ppro a friend has.

If it is the memory overcommit handling then there are plenty of ways to
deal with it efficiently in the non-preempt case at least. I had
wondered originally about booking chunks of pages off per CPU (take the
remaining overcommit divide by four and only when a CPU finds its
private block is empty take a lock and redistribute the remaining
allocation). Since boxes almost never get that close to overcommit
kicking in then it should mean we close to never touch a locked count.

2002-09-08 18:08:35

by Martin J. Bligh

[permalink] [raw]
Subject: Re: LMbench2.0 results

>> >> Perhaps testing with overcommit on would be useful.
>> >
>> > Well yes - the new overcommit code was a significant hit on the 16ways
>> > was it not? You have some numbers on that?
>>
>> About 20% hit on system time for kernel compiles.
>
> That suprises me a lot. On a 2 way and 4 way the 2.4 memory overcommit
> check code didnt show up. That may be down to the 2 way being on a CPU
> that has no measurable cost for locked operations and the 4 way being an
> ancient ppro a friend has.

Remember this is a NUMA machine - gathering global information
is extremely expensive. On an SMP system, I wouldn't expect it
to show up so much, though it still doesn't seem terribly efficient.
The code is admits it's broken anyway, for the overcommit = 2 case
(which was NOT what I was running - the 20% is for 1). Below is a
simple patch that I've never got around to testing, that I think
will improve that case (not that I'm that interested in setting
overcommit to 2 ;-)).

> If it is the memory overcommit handling then there are plenty of ways to
> deal with it efficiently in the non-preempt case at least. I had
> wondered originally about booking chunks of pages off per CPU (take the
> remaining overcommit divide by four and only when a CPU finds its
> private block is empty take a lock and redistribute the remaining
> allocation). Since boxes almost never get that close to overcommit
> kicking in then it should mean we close to never touch a locked count.

Can you use per-zone stats rather than global ones? That tends to
fix things pretty efficently on these type of machines - per zone
LRUs made a huge impact.

Here's a little patch (untested!). I'll go look at the other case
and see if there's something easy to do, but I think it needs some
significant rework to do anything.

--- virgin-2.5.30.full/mm/mmap.c Thu Aug 1 14:16:05 2002
+++ linux-2.5.30-vm_enough_memory/mm/mmap.c Wed Aug 7 13:26:46 2002
@@ -74,7 +74,6 @@
int vm_enough_memory(long pages)
{
unsigned long free, allowed;
- struct sysinfo i;

atomic_add(pages, &vm_committed_space);

@@ -115,12 +114,7 @@
return 0;
}

- /*
- * FIXME: need to add arch hooks to get the bits we need
- * without this higher overhead crap
- */
- si_meminfo(&i);
- allowed = i.totalram * sysctl_overcommit_ratio / 100;
+ allowed = totalram_pages * sysctl_overcommit_ratio / 100;
allowed += total_swap_pages;

if (atomic_read(&vm_committed_space) < allowed)


2002-09-08 18:21:02

by Andrew Morton

[permalink] [raw]
Subject: Re: LMbench2.0 results

Alan Cox wrote:
>
> On Sun, 2002-09-08 at 00:44, Martin J. Bligh wrote:
> > >> Perhaps testing with overcommit on would be useful.
> > >
> > > Well yes - the new overcommit code was a significant hit on the 16ways
> > > was it not? You have some numbers on that?
> >
> > About 20% hit on system time for kernel compiles.
>
> That suprises me a lot. On a 2 way and 4 way the 2.4 memory overcommit
> check code didnt show up. That may be down to the 2 way being on a CPU
> that has no measurable cost for locked operations and the 4 way being an
> ancient ppro a friend has.
>
> If it is the memory overcommit handling then there are plenty of ways to
> deal with it efficiently in the non-preempt case at least. I had
> wondered originally about booking chunks of pages off per CPU (take the
> remaining overcommit divide by four and only when a CPU finds its
> private block is empty take a lock and redistribute the remaining
> allocation). Since boxes almost never get that close to overcommit
> kicking in then it should mean we close to never touch a locked count.

Martin had this profile for a kernel build on 2.5.31-mm1:



c01299d0 6761 1.28814 vm_enough_memory
c0114584 8085 1.5404 load_balance
c01334c0 8292 1.57984 __free_pages_ok
c011193c 11559 2.20228 smp_apic_timer_interrupt
c0113040 12075 2.3006 do_page_fault
c012bf08 12075 2.3006 find_get_page
c0114954 12912 2.46007 scheduler_tick
c012c430 13199 2.51475 file_read_actor
c01727e8 20440 3.89434 __generic_copy_from_user
c0133fb8 25792 4.91403 nr_free_pages
c01337c0 27318 5.20478 rmqueue
c0129588 36955 7.04087 handle_mm_fault
c013a65c 38391 7.31447 page_remove_rmap
c0134094 43755 8.33645 get_page_state
c0105300 57699 10.9931 default_idle
c0128e64 58735 11.1905 do_anonymous_page

We can make nr_free_pages go away by adding global free page
accounting to struct page_states. So we're accounting it in
two places, but it'll be simple.

The global page accounting is very much optimised for the fast path at
the expense of get_page_state(). (And that kernel didn't have the
rmap speedups).

We need to find some way of making vm_enough_memory not call get_page_state
so often. One way of doing that might be to make get_page_state dump
its latest result into a global copy, and make vm_enough_memory()
only get_page_state once per N invokations. A speed/accuracy tradeoff there.

2002-09-08 20:43:37

by Hugh Dickins

[permalink] [raw]
Subject: Re: LMbench2.0 results

On Sun, 8 Sep 2002, Andrew Morton wrote:
>
> We need to find some way of making vm_enough_memory not call get_page_state
> so often. One way of doing that might be to make get_page_state dump
> its latest result into a global copy, and make vm_enough_memory()
> only get_page_state once per N invokations. A speed/accuracy tradeoff there.

Accuracy is not very important in that sysctl_overcommit_memory 0 case
e.g. the swapper_space.nr_pages addition was brought in at a time when
it was very necessary, but usually overestimates now (or last time I
thought about it). The main thing to look out for is running the same
memory grabber twice in quick succession: not nice if it succeeds the
first time, but not the second, just because of some transient effect
that its old pages are temporarily uncounted.

Hugh

2002-09-08 21:31:55

by Andrew Morton

[permalink] [raw]
Subject: Re: LMbench2.0 results

Hugh Dickins wrote:
>
> On Sun, 8 Sep 2002, Andrew Morton wrote:
> >
> > We need to find some way of making vm_enough_memory not call get_page_state
> > so often. One way of doing that might be to make get_page_state dump
> > its latest result into a global copy, and make vm_enough_memory()
> > only get_page_state once per N invokations. A speed/accuracy tradeoff there.
>
> Accuracy is not very important in that sysctl_overcommit_memory 0 case
> e.g. the swapper_space.nr_pages addition was brought in at a time when
> it was very necessary, but usually overestimates now (or last time I
> thought about it). The main thing to look out for is running the same
> memory grabber twice in quick succession: not nice if it succeeds the
> first time, but not the second, just because of some transient effect
> that its old pages are temporarily uncounted.
>

That's right - there can be sudden and huge changes in pages used/free.

So any rate limiting tweak in there would have to be in terms of
number-of-pages rather than number-of-seconds.

2002-09-09 04:24:14

by Daniel Phillips

[permalink] [raw]
Subject: Re: LMbench2.0 results

On Saturday 07 September 2002 18:20, Andrew Morton wrote:
> Paolo Ciarrocchi wrote:
> >
> > Hi all,
> > I've just ran lmbench2.0 on my laptop.
> > Here the results (again, 2.5.33 seems to be "slow", I don't know why...)
> >
>
> The fork/exec/mmap slowdown is the rmap overhead. I have some stuff
> which partialy improves it.

It only seems like a big deal if you get out your microscope and focus on
the fork times. On the other hand, look at the sh times: the rmap setup
time gets lost in the noise. The latter looks more like reality to me.

I suspect the overall performance loss on the laptop has more to do with
several months of focussing exclusively on the needs of 4-way and higher
smp machines.

--
Daniel

2002-09-09 13:32:54

by Rik van Riel

[permalink] [raw]
Subject: Re: LMbench2.0 results

On Sun, 8 Sep 2002, Daniel Phillips wrote:

> I suspect the overall performance loss on the laptop has more to do with
> several months of focussing exclusively on the needs of 4-way and higher
> smp machines.

Probably true, we're pulling off an indecent number of tricks
for 4-way and 8-way SMP performance. This overhead shouldn't
be too bad on UP and 2-way machines, but might easily be a
percent or so.

Rik
--
Bravely reimplemented by the knights who say "NIH".

http://www.surriel.com/ http://distro.conectiva.com/

Spamtraps of the month: [email protected] [email protected]

2002-09-09 16:08:56

by Daniel Phillips

[permalink] [raw]
Subject: Re: LMbench2.0 results

On Monday 09 September 2002 15:37, Rik van Riel wrote:
> On Sun, 8 Sep 2002, Daniel Phillips wrote:
>
> > I suspect the overall performance loss on the laptop has more to do with
> > several months of focussing exclusively on the needs of 4-way and higher
> > smp machines.
>
> Probably true, we're pulling off an indecent number of tricks
> for 4-way and 8-way SMP performance. This overhead shouldn't
> be too bad on UP and 2-way machines, but might easily be a
> percent or so.

Though to be fair, it's smart to concentrate on the high end with a
view to achieving world domination sooner. And it's a stretch to call
the low end performance 'slow'.

An idea that's looking more and more attractive as time goes by is to
have a global config option that specifies that we want to choose the
simple way of doing things wherever possible, over the enterprise way.
We want this especially for embedded. On low end processors, it's even
possible that the small way will be faster in some cases than the
enterprise way, due to cache effects.

--
Daniel

2002-09-09 16:23:42

by Martin J. Bligh

[permalink] [raw]
Subject: Re: LMbench2.0 results

>> Probably true, we're pulling off an indecent number of tricks
>> for 4-way and 8-way SMP performance. This overhead shouldn't
>> be too bad on UP and 2-way machines, but might easily be a
>> percent or so.
>
> Though to be fair, it's smart to concentrate on the high end with a
> view to achieving world domination sooner. And it's a stretch to call
> the low end performance 'slow'.

I don't think there's that much overhead, it's just not where people
have been focusing tuning efforts recently. If you run the numbers,
and point out specific problems, I'm sure people will fix them ;-)
In other words, I don't think the recent focus has caused a problem
for low end machines, it just hasn't really looked at solving one.

> An idea that's looking more and more attractive as time goes by is to
> have a global config option that specifies that we want to choose the
> simple way of doing things wherever possible, over the enterprise way.
> We want this especially for embedded. On low end processors, it's even
> possible that the small way will be faster in some cases than the
> enterprise way, due to cache effects.

Can't we just use the existing config options instead? CONFIG_SMP is
a good start ;-) How many embedded systems with SMP do you have?

M.

2002-09-09 16:32:49

by Andrew Morton

[permalink] [raw]
Subject: Re: LMbench2.0 results

Daniel Phillips wrote:
>
> On Monday 09 September 2002 15:37, Rik van Riel wrote:
> > On Sun, 8 Sep 2002, Daniel Phillips wrote:
> >
> > > I suspect the overall performance loss on the laptop has more to do with
> > > several months of focussing exclusively on the needs of 4-way and higher
> > > smp machines.
> >
> > Probably true, we're pulling off an indecent number of tricks
> > for 4-way and 8-way SMP performance. This overhead shouldn't
> > be too bad on UP and 2-way machines, but might easily be a
> > percent or so.
>
> Though to be fair, it's smart to concentrate on the high end with a
> view to achieving world domination sooner. And it's a stretch to call
> the low end performance 'slow'.

It's on the larger machines where 2.4 has problems. Fixing them up
makes the kernel broader, more general purpose. We're seeing 50-100%
gains in some areas there. Giving away a few percent on smaller machines
at this stage is OK. But yup, we need to go and get that back later.

> An idea that's looking more and more attractive as time goes by is to
> have a global config option that specifies that we want to choose the
> simple way of doing things wherever possible, over the enterprise way.

Prefer not to. We've been able to cover all bases moderately well
thus far without adding a big boolean switch.

> We want this especially for embedded. On low end processors, it's even
> possible that the small way will be faster in some cases than the
> enterprise way, due to cache effects.

The main thing we can do for smaller systems is to not allocate as much
memory at boot time. Some more careful scaling is needed there. I'll
generate a list soon.

2002-09-09 16:52:46

by Daniel Phillips

[permalink] [raw]
Subject: Re: LMbench2.0 results

On Monday 09 September 2002 18:26, Martin J. Bligh wrote:
> > An idea that's looking more and more attractive as time goes by is to
> > have a global config option that specifies that we want to choose the
> > simple way of doing things wherever possible, over the enterprise way.
> > We want this especially for embedded. On low end processors, it's even
> > possible that the small way will be faster in some cases than the
> > enterprise way, due to cache effects.
>
> Can't we just use the existing config options instead? CONFIG_SMP is
> a good start ;-) How many embedded systems with SMP do you have?

You need to look at it from the other direction: how do the needs of a
uniprocessor Clawhammer box differ from a Linksys adsl router?

--
Daniel

2002-09-09 17:21:55

by Martin J. Bligh

[permalink] [raw]
Subject: Re: LMbench2.0 results

>> > An idea that's looking more and more attractive as time goes by is to
>> > have a global config option that specifies that we want to choose the
>> > simple way of doing things wherever possible, over the enterprise way.
>> > We want this especially for embedded. On low end processors, it's even
>> > possible that the small way will be faster in some cases than the
>> > enterprise way, due to cache effects.
>>
>> Can't we just use the existing config options instead? CONFIG_SMP is
>> a good start ;-) How many embedded systems with SMP do you have?
>
> You need to look at it from the other direction: how do the needs of a
> uniprocessor Clawhammer box differ from a Linksys adsl router?

I wouldn't call uniprocessor Clawhammer the "enterprise way" type
machine. But other than that, I see your point. You're in a far
better position to answer your own question than I am, so I'll leave
that as rhetorical ;-)

M.

2002-09-09 21:04:53

by Alan

[permalink] [raw]
Subject: Re: LMbench2.0 results

On Mon, 2002-09-09 at 17:55, Daniel Phillips wrote:
> You need to look at it from the other direction: how do the needs of a
> uniprocessor Clawhammer box differ from a Linksys adsl router?

I've advocated several times having a single config option for "fine
tuning" that sane people say "N" to and which if set lets you force
small hash sizes, disable block layer support and kill various other
'always needed' PC crap. Tell me - on a 4Mb embedded 386 running your
toaster do you really care if the TCP hash lookup is a little slower
than perfect scaling, and do you need a 64Kbyte mount hash ?

2002-09-09 21:07:06

by Alan

[permalink] [raw]
Subject: Re: LMbench2.0 results

On Sun, 2002-09-08 at 19:40, Andrew Morton wrote:
> We need to find some way of making vm_enough_memory not call get_page_state
> so often. One way of doing that might be to make get_page_state dump
> its latest result into a global copy, and make vm_enough_memory()
> only get_page_state once per N invokations. A speed/accuracy tradeoff there.

Unless the error always falls on the same side the accuracy tradeoff is
fatal to the entire scheme of things. Sorting out the use of
get_page_state is worth doing if that is the bottleneck, and
snapshooting such that we only look at it if we might be close to the
limit would work, but we'd need to know when the limit had shifted too
much

2002-09-09 21:40:16

by Andrew Morton

[permalink] [raw]
Subject: Re: LMbench2.0 results

Alan Cox wrote:
>
> On Sun, 2002-09-08 at 19:40, Andrew Morton wrote:
> > We need to find some way of making vm_enough_memory not call get_page_state
> > so often. One way of doing that might be to make get_page_state dump
> > its latest result into a global copy, and make vm_enough_memory()
> > only get_page_state once per N invokations. A speed/accuracy tradeoff there.
>
> Unless the error always falls on the same side the accuracy tradeoff is
> fatal to the entire scheme of things. Sorting out the use of
> get_page_state is worth doing if that is the bottleneck, and
> snapshooting such that we only look at it if we might be close to the
> limit would work, but we'd need to know when the limit had shifted too
> much

It could be that the cost is only present on the IBM whackomatics,
so they can twiddle the /proc setting and we can all be happy.
Certainly I did not see any problems on the quad.

Does "heuristic" overcommit handling need so much accuracy?
Perhaps we can push some of the cost over into mode 2 somehow.

Or we could turn it the other way up and, in __add_to_page_cache(),
do:

if (overcommit_mode == anal)
atomic_inc(&nr_pagecache_pages);

2002-09-09 22:02:50

by Alan

[permalink] [raw]
Subject: Re: LMbench2.0 results

On Mon, 2002-09-09 at 22:44, Andrew Morton wrote:
> Does "heuristic" overcommit handling need so much accuracy?
> Perhaps we can push some of the cost over into mode 2 somehow.

Its only needed in mode 2, but its also only computed for mode 2,3 in
2,4 8)

2002-09-09 22:17:49

by Cliff White

[permalink] [raw]
Subject: Re: LMbench2.0 results

> On Sat, 7 Sep 2002, Paolo Ciarrocchi wrote:
>
> > Let me know if you need further information (.config, info about my
> > hardware) or if you want I run other tests.
>
> Would you be able to run the tests for 2.5.31? I'm looking into a
> slowdown in 2.5.32/33 which may be related. Some hardware info might be
> useful too.
>
>
Certainly, we have those in the STP data base, and here's a quick summary:
(Of course you can search these yourself ) The full reports have the hardware
summary also. see web links at the end. Full reports have each test run 5x.


Processor, Processes - times in microseconds - smaller is better
----------------------------------------------------------------
Host OS Mhz null null open selct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ----- ---- ---- ---- ---- ----
stp1-000. Linux 2.5.33 1000 0.33 0.49 2.84 3.52 0.79 2.62 168. 1279 4475
stp1-002. Linux 2.5.32 1000 0.32 0.47 2.94 4.41 15.7 0.80 2.63 202. 1292 4603
stp1-003. Linux 2.5.31 1000 0.32 0.46 2.85 6.92 14.4 0.80 2.60 856. 2596 8122

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ----- ------ ------ ------ ------ ------- -------
stp1-000. Linux 2.5.33 1.530 4.1100 12.2 6.4700 136.4 32.7 136.2
stp1-002. Linux 2.5.32 1.590 4.2200 12.4 5.4000 139.1 26.6 136.7
stp1-003. Linux 2.5.31 1.830 46.4 142.6 47.5 141.7 47.6 141.2

*Local* Communication latencies in microseconds - smaller is better
-------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
stp1-000. Linux 2.5.33 1.530 5.320 10.7 13.8 30.5 19.3 42.1 65.3
stp1-002. Linux 2.5.32 1.570 5.456 11.3 14.2 31.3 21.1 42.6 67.4
stp1-003. Linux 2.5.31 1.810 7.377 14.9 50.5 173.7 117.1 263.8 414.

File & VM system latencies in microseconds - smaller is better
--------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page
Create Delete Create Delete Latency Fault Fault
--------- ------------- ------ ------ ------ ------ ------- ----- -----
stp1-000. Linux 2.5.33 32.9 5.4600 117.0 13.3 1261.0 0.575 3.00000
stp1-002. Linux 2.5.32 34.0 5.9460 118.6 14.0 1265.0 0.619 3.00000
stp1-003. Linux 2.5.31 72.5 15.3 225.5 38.2 2062.0 0.657 4.00000

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
stp1-000. Linux 2.5.33 699. 855. 68.0 407.9 460.1 168.5 157.8 460. 233.8
stp1-002. Linux 2.5.32 690. 297. 93.7 397.5 459.2 162.1 150.0 458. 233.1
stp1-003. Linux 2.5.31 145. 74.8 58.5 118.6 456.9 169.7 156.8 456. 269.0

Full list:
2.5.33 http://khack.osdl.org/stp/4925 -1cpu
http://khack.osdl.org/stp/4915 -1cpu
http://khack.osdl.org/stp/4932 -2cpu
http://khack.osdl.org/stp/4926 -2cpu
2.5.32
http://khack.osdl.org/stp/4758 -2cpu
http://khack.osdl.org/stp/4752 -2cpu
http://khack.osdl.org/stp/4751 -1cpu
http://khack.osdl.org/stp/4741 -1cpu
2.5.31
http://khack.osdl.org/stp/4302 -1cpu
http://khack.osdl.org/stp/4312 -1cpu
http://khack.osdl.org/stp/4313 -2cpu
http://khack.osdl.org/stp/4319 -2cpu
cliffw
OSDL

> - James
> --
> James Morris
> <[email protected]>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>


2002-09-14 14:51:15

by Pavel Machek

[permalink] [raw]
Subject: Re: LMbench2.0 results

Hi!

> > > Let me know if you need further information (.config, info about my
> > > hardware) or if you want I run other tests.
> >
> > Would you be able to run the tests for 2.5.31? I'm looking into a
> > slowdown in 2.5.32/33 which may be related. Some hardware info might be
> > useful too.
> I don't have the 2.5.31, and now I've only a slow
> internet connection... I'll try to download it on Monday.
>
> The hw is a Laptop, a standard HP Omnibook 6000, 256 MiB of RAM, PIII@800.
> Do you need more information?

I hope powermanagment is completely disabled this time.
Pavel
--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2002-09-14 14:51:14

by Pavel Machek

[permalink] [raw]
Subject: Re: LMbench2.0 results

Hi!

> > > > Comments?
> > >
> > > Yeah: "ouch" because I don't see a single category that's faster.
> >
> > HZ went to 1000, which should help multimedia latencies a lot.
>
> It shouldn't materially damage performance unless we have other things
> extremely wrong. Its easy enough to verify by putting HZ back to 100 and
> rebenching

1000 times per second, enter timer interrupt, acknowledge it, exit
interrupt. Few i/o accessess, few tlb entries kicked out, some L1
cache consumed?

Is 10usec per timer interrupt reasonable on modern system? That's 10
msec per second spend in timer with HZ=1000, thats 1% overall. So it
seems to me it is possible for HZ=1000 to have performance impact...

Pavel

--
Worst form of spam? Adding advertisment signatures ala sourceforge.net.
What goes next? Inserting advertisment *into* email?

2002-09-14 18:24:01

by Paolo Ciarrocchi

[permalink] [raw]
Subject: Re: LMbench2.0 results

From: Pavel Machek <[email protected]>
[...]
> I hope powermanagment is completely disabled this time.
> Pavel
Yes.
Pavel, is there a way to disable apm at boot time with a lilo parameter?

Paolo
--
Get your free email from http://www.linuxmail.org


Powered by Outblaze

2002-09-15 18:03:30

by Pavel Machek

[permalink] [raw]
Subject: Re: LMbench2.0 results

Hi!

> [...]
> > I hope powermanagment is completely disabled this time.
> > Pavel
> Yes.
> Pavel, is there a way to disable apm at boot time with a lilo
parameter?

apm=off
Pavel

--
Casualities in World Trade Center: ~3k dead inside the building,
cryptography in U.S.A. and free speech in Czech Republic.