I've been looking at the CPU cost of the write() system call. Time how long
it takes to write a million bytes to an ext2 file, via a million
one-byte-writes:
time dd if=/dev/zero of=foo bs=1 count=1M
This only uses one CPU. It takes twice as long on SMP.
On a 2.7GHz P4-HT:
2.5.65-mm4, UP:
0.34s user 1.00s system 99% cpu 1.348 total
2.5.65-mm4, SMP:
0.41s user 2.04s system 100% cpu 2.445 total
2.4.21-pre5, UP:
0.34s user 0.96s system 106% cpu 1.224 total
2.4.21-pre5, SMP:
0.42s user 1.95s system 99% cpu 2.372 total
(The small additional overhead in 2.5 is expected - there are more function
calls due to the addition of AIO and there is more setup due to the (large)
writev speedups).
On a 500MHz PIII Xeon:
500MHz PIII, UP:
1.08s user 2.90s system 100% cpu 3.971 total
500MHz PIII, SMP:
1.13s user 4.86s system 99% cpu 5.999 total
This pretty gross. About six months back I worked out that across the
lifecycle of a pagecache page (creation via write() through to reclaim via
the page LRU) we take 27 spinlocks and rwlocks. And this does not even
include semaphores and atomic bitops (I used lockmeter). I'm not sure it is
this high any more - quite a few things were fixed up, but it is still high.
Profiles for 2.5 on the P4 show that it's all in fget(), fput() and
find_get_page(). Those locked operations are really hurting.
One thing is noteworthy: ia32's read_unlock() is buslocked, whereas
spin_unlock() is not. So let's see what happens if we convert file_lock
from an rwlock to a spinlock:
2.5.65-mm4, SMP:
0.34s user 2.00s system 100% cpu 2.329 total
That's a 5% speedup.
And if we were to convert file->f_count to be a nonatomic "int", protected by
files_lock it would probably speed things up further.
I've always been a bit skeptical about rwlocks - if you're holding the lock
for long enough for a significant amount of reader concurrency, you're
holding it for too long. eg: tasklist_lock.
SMP:
c0254a14 read_zero 189 0.3841
c0250f5c clear_user 217 3.3906
c0148a64 sys_write 251 3.9219
c0148a24 sys_read 268 4.1875
c014b00c __block_prepare_write 485 0.5226
c0130dd4 unlock_page 493 7.7031
c0120220 current_kernel_time 560 8.7500
c0163190 __mark_inode_dirty 775 3.5227
c01488ec vfs_write 983 3.1506
c0132d9c generic_file_write 996 10.3750
c01486fc vfs_read 1010 3.2372
c014b3ac __block_commit_write 1100 7.6389
c0149500 fput 1110 39.6429
c01322e4 generic_file_aio_write_nolock 1558 0.6292
c0130fc0 find_lock_page 2301 13.0739
c01495f0 fget 2780 33.0952
c0108ddc system_call 4047 91.9773
c0106f64 default_idle 24624 473.5385
00000000 total 45321 0.0200
UP:
c023c9ac radix_tree_lookup 13 0.1711
c015362c inode_times_differ 14 0.1944
c015372c inode_update_time 14 0.1000
c012cb9c generic_file_write 15 0.1630
c012ca90 generic_file_write_nolock 17 0.1214
c0140e1c fget 17 0.3542
c023deb8 __copy_from_user_ll 17 0.1545
c0241514 read_zero 17 0.0346
c014001c vfs_read 18 0.0682
c01401dc vfs_write 20 0.0758
c01430d8 generic_commit_write 20 0.1786
c01402e4 sys_read 21 0.3281
c0142968 __block_commit_write 29 0.2014
c023dc4c clear_user 34 0.5312
c0140324 sys_write 38 0.5938
c01425cc __block_prepare_write 39 0.0422
c012c0f4 generic_file_aio_write_nolock 89 0.0362
c0108b54 system_call 406 9.2273
00000000 total 944 0.0004
On Sat, 2003-03-22 at 17:58, Andrew Morton wrote:
> I've always been a bit skeptical about rwlocks - if you're holding the lock
> for long enough for a significant amount of reader concurrency, you're
> holding it for too long. eg: tasklist_lock.
I totally agree with you Andrew, on modern SMP systems rwlocks
are basically worthless. I think we should kill them off and
convert all instances to spinlocks or some better primitive (perhaps
a more generalized big reader lock, Roman Zippel had something...)
--
David S. Miller <[email protected]>
> One thing is noteworthy: ia32's read_unlock() is buslocked, whereas
> spin_unlock() is not.
Dont forget bitops/atomics used as spinlocks like the rmap pte chain lock.
Anton
Anton Blanchard <[email protected]> wrote:
>
>
> > One thing is noteworthy: ia32's read_unlock() is buslocked, whereas
> > spin_unlock() is not.
>
> Dont forget bitops/atomics used as spinlocks like the rmap pte chain lock.
>
Did you end up deciding it was worthwhile putting a spinlock in the ppc64
pageframe for that? Or hashing for it.
> Did you end up deciding it was worthwhile putting a spinlock in the ppc64
> pageframe for that? Or hashing for it.
Heres an old email I dug up. It was running an sdet like thing.
Anton
>From [email protected] Fri Jul 26 06:32:52 2002
Date: Fri, 26 Jul 2002 06:32:52 +1000
From: Anton Blanchard <[email protected]>
To: Andrew Morton <[email protected]>
Cc: Rik van Riel <[email protected]>,
William Lee Irwin III <[email protected]>,
"Martin J. Bligh" <[email protected]>
Subject: Re: [[email protected]: raw data] (fwd)
> > 26274 .page_remove_rmap
> > 19059 .page_add_rmap
> > 13378 .save_remaining_regs
> > 11126 .page_cache_release
> > 10031 .clear_user_page
> > 9616 .copy_page
> > 9539 .do_page_fault
> > 8876 .pSeries_insert_hpte
> > 8488 .pSeries_flush_hash_range
> > 8106 .find_get_page
> > 7496 .lru_cache_add
> > 6789 .copy_page_range
> > 6177 .zap_pte_range
>
> Holy cow.
Here are the results when I embedded a spinlock in the page struct.
19624 .page_remove_rmap
15682 .save_remaining_regs
13835 .page_cache_release
12827 .zap_pte_range
12624 .do_page_fault
12390 .copy_page_range
10908 .pSeries_insert_hpte
10492 .pSeries_flush_hash_range
9219 .find_get_page
8879 .page_add_rmap
7482 .copy_page
6672 .kmem_cache_free
5672 .kmem_cache_alloc
page_add_rmap has dropped an awful lot. I need to check that its
actually the change from bitop spinlock to regular spinlock and not the
increase in page struct size (which might be reduce cacheline sharing
between adjacent mem_map entries).
As expected the difference between page_add_rmap and page_remove_rmap
looks to be the linked list walk. From the profile below, lock
aquisition is 15% and list walking is about 40% of the function.
The two things that seem to help us here are:
1. Half as many atomic operations. The lock drop is a simple store
instead of an atomic sequence needed to avoid changing the other bits in
the flags word.
2. Not completing the atomic sequence when we know we arent going to get
the lock. The spinlock backs off to loads when someone else has the
lock, avoiding the costly store with reservation. Perhaps I should make
our bitops do a similar check so we do a load with reservation and
jump straight out if the bit is set. Im guessing you cant do the same on
intel.
Anton
00094864 .page_remove_rmap 19784
0.3 c000000000094864 std r30,-16(r1)
0.2 c000000000094868 mflr r0
0.0 c00000000009486c ld r30,-18960(r2)
0.0 c000000000094870 std r0,16(r1)
0.0 c000000000094874 subfic r0,r3,0
0.3 c000000000094878 adde r11,r0,r3
0.0 c00000000009487c subfic r10,r4,0
0.0 c000000000094880 adde r0,r10,r4
0.0 c000000000094884 std r27,-40(r1)
0.3 c000000000094888 or r0,r0,r11
0.0 c00000000009488c li r27,0
0.0 c000000000094890 std r28,-32(r1)
0.0 c000000000094894 cmpwi r0,0
0.3 c000000000094898 mr r28,r4
0.0 c00000000009489c std r29,-24(r1)
0.0 c0000000000948a0 mr r29,r3
0.0 c0000000000948a4 std r31,-8(r1)
0.3 c0000000000948a8 std r26,-48(r1)
0.0 c0000000000948ac stdu r1,-160(r1)
0.0 c0000000000948b0 lbz r9,40(r3)
0.5 c0000000000948b4 ld r8,-32752(r30)
0.0 c0000000000948b8 rldicr r9,r9,3,60
0.0 c0000000000948bc ld r10,-32744(r30)
0.0 c0000000000948c0 ldx r11,r9,r8
0.9 c0000000000948c4 ld r0,4616(r11)
0.0 c0000000000948c8 ld r9,4624(r11)
0.0 c0000000000948cc subf r0,r0,r3
0.0 c0000000000948d0 sradi r0,r0,3
1.1 c0000000000948d4 rldicl r9,r9,52,12
0.0 c0000000000948d8 mulld r0,r0,r10
0.0 c0000000000948dc add r31,r0,r9
0.0 c0000000000948e0 bne- c000000000094a34
1.6 c0000000000948e4 ld r9,-32768(r30)
0.0 c0000000000948e8 ld r0,0(r9)
0.0 c0000000000948ec cmpld r31,r0
0.0 c0000000000948f0 bge- c00000000009497c
1.1 c0000000000948f4 ld r0,40(r29)
0.0 c0000000000948f8 addi r31,r29,40
0.0 c0000000000948fc rldicl r0,r0,53,63
0.1 c000000000094900 cmpwi r0,0
0.0 c000000000094904 bne- c00000000009497c
0.7 c000000000094908 addi r11,r29,80
0.0 c00000000009490c li r9,1
0.0 c000000000094910 b c000000000094920
3.4 c000000000094914 lwzx r10,r0,r11
0.0 c000000000094918 cmpwi r10,0
0.0 c00000000009491c bne- c000000000094914
0.9 c000000000094920 lwarx r10,r0,r11
3.8 c000000000094924 cmpwi r10,0
0.0 c000000000094928 bne+ c000000000094914
0.5 c00000000009492c stwcx. r9,r0,r11
14.9 c000000000094930 bne+ c000000000094920
0.8 c000000000094934 isync
4.7 c000000000094938 ld r0,40(r29)
6.8 c00000000009493c rldicl r0,r0,48,63
0.2 c000000000094940 cmpwi r0,0
0.0 c000000000094944 beq- c0000000000949c0
0.1 c000000000094948 ld r0,64(r29)
0.0 c00000000009494c cmpd r0,r28
0.0 c000000000094950 beq- c0000000000949a4
2.4 c000000000094954 lhz r10,24(r13)
0.0 c000000000094958 ld r0,-32760(r30)
0.0 c00000000009495c rldicr r10,r10,7,56
0.1 c000000000094960 add r10,r10,r0
0.2 c000000000094964 ld r11,48(r10)
0.0 c000000000094968 addi r11,r11,-1
0.0 c00000000009496c std r11,48(r10)
0.3 c000000000094970 eieio
0.3 c000000000094974 li r0,0
0.0 c000000000094978 stw r0,80(r29)
0.0 c00000000009497c addi r1,r1,160
0.0 c000000000094980 ld r0,16(r1)
0.4 c000000000094984 ld r26,-48(r1)
0.2 c000000000094988 mtlr r0
0.0 c00000000009498c ld r27,-40(r1)
0.0 c000000000094990 ld r28,-32(r1)
0.0 c000000000094994 ld r29,-24(r1)
0.2 c000000000094998 ld r30,-16(r1)
0.0 c00000000009499c ld r31,-8(r1)
0.0 c0000000000949a0 blr
0.1 c0000000000949a4 std r27,64(r29)
0.0 c0000000000949a8 lis r0,1
0.0 c0000000000949ac ldarx r9,r0,r31
1.2 c0000000000949b0 andc r9,r9,r0
0.1 c0000000000949b4 stdcx. r9,r0,r31
1.4 c0000000000949b8 bne+ c0000000000949ac
0.1 c0000000000949bc b c000000000094954
0.8 c0000000000949c0 ld r3,64(r29)
0.0 c0000000000949c4 cmpdi r3,0
0.0 c0000000000949c8 beq+ c000000000094954
0.7 c0000000000949cc lis r26,1
1.4 c0000000000949d0 ld r0,8(r3)
0.0 c0000000000949d4 cmpd r0,r28
0.0 c0000000000949d8 beq- c0000000000949f0
38.4 c0000000000949dc mr r27,r3
0.1 c0000000000949e0 ld r3,0(r3)
0.0 c0000000000949e4 cmpdi r3,0
0.0 c0000000000949e8 bne+ c0000000000949d0
0.0 c0000000000949ec b c000000000094954
5.4 c0000000000949f0 mr r4,r27
0.0 c0000000000949f4 mr r5,r29
0.0 c0000000000949f8 bl c000000000094f14 <.pte_chain_free>
0.1 c0000000000949fc ld r3,64(r29)
0.1 c000000000094a00 ld r0,0(r3)
0.0 c000000000094a04 cmpdi r0,0
0.0 c000000000094a08 bne+ c000000000094954
0.5 c000000000094a0c ld r0,8(r3)
0.0 c000000000094a10 std r0,64(r29)
0.0 c000000000094a14 ldarx r9,r0,r31
0.2 c000000000094a18 or r9,r9,r26
0.0 c000000000094a1c stdcx. r9,r0,r31
1.3 c000000000094a20 bne+ c000000000094a14
0.1 c000000000094a24 li r4,0
0.0 c000000000094a28 li r5,0
0.0 c000000000094a2c bl c000000000094f14 <.pte_chain_free>
0.0 c000000000094a30 b c000000000094954
0.0 c000000000094a34 ld r3,-32736(r30)
0.0 c000000000094a38 li r5,172
0.0 c000000000094a3c ld r4,-32728(r30)
0.0 c000000000094a40 bl c000000000059620 <.printk>
0.0 c000000000094a44 nop
0.0 c000000000094a48 li r3,0
0.0 c000000000094a4c bl c0000000001ce80c <.xmon>
0.0 c000000000094a50 nop
0.0 c000000000094a54 b c0000000000948e4
000946f8 .page_add_rmap 9001
0.7 c0000000000946f8 std r30,-16(r1)
0.4 c0000000000946fc mflr r0
0.0 c000000000094700 ld r30,-18960(r2)
0.0 c000000000094704 std r28,-32(r1)
0.0 c000000000094708 mr r28,r4
0.5 c00000000009470c ld r11,-32768(r30)
0.1 c000000000094710 std r29,-24(r1)
0.0 c000000000094714 addi r29,r3,40
0.0 c000000000094718 std r31,-8(r1)
0.6 c00000000009471c mr r31,r3
0.0 c000000000094720 std r0,16(r1)
0.0 c000000000094724 stdu r1,-144(r1)
0.4 c000000000094728 ld r0,0(r4)
0.1 c00000000009472c ld r9,0(r11)
0.0 c000000000094730 rldicl r0,r0,48,16
0.0 c000000000094734 cmpld r0,r9
0.0 c000000000094738 bge- c0000000000947d8
10.3 c00000000009473c ld r0,40(r3)
0.0 c000000000094740 addi r9,r3,80
0.0 c000000000094744 li r11,1
0.0 c000000000094748 rldicl r0,r0,53,63
1.2 c00000000009474c cmpwi r0,0
0.0 c000000000094750 bne- c0000000000947d8
0.7 c000000000094754 b c000000000094764
4.1 c000000000094758 lwzx r0,r0,r9
0.0 c00000000009475c cmpwi r0,0
2.8 c000000000094760 bne- c000000000094758
0.6 c000000000094764 lwarx r0,r0,r9
8.9 c000000000094768 cmpwi r0,0
0.0 c00000000009476c bne+ c000000000094758
0.9 c000000000094770 stwcx. r11,r0,r9
24.6 c000000000094774 bne+ c000000000094764
1.3 c000000000094778 isync
9.7 c00000000009477c ld r9,40(r3)
10.0 c000000000094780 rldicl r9,r9,48,63
0.0 c000000000094784 cmpwi r9,0
0.0 c000000000094788 bne- c000000000094810
2.3 c00000000009478c ld r0,64(r31)
0.0 c000000000094790 lis r9,1
0.0 c000000000094794 cmpdi r0,0
0.0 c000000000094798 bne- c0000000000947f8
0.3 c00000000009479c std r28,64(r31)
0.1 c0000000000947a0 ldarx r0,r0,r29
2.8 c0000000000947a4 or r0,r0,r9
0.1 c0000000000947a8 stdcx. r0,r0,r29
2.5 c0000000000947ac bne+ c0000000000947a0
1.1 c0000000000947b0 eieio
0.9 c0000000000947b4 li r0,0
0.0 c0000000000947b8 stw r0,80(r31)
0.0 c0000000000947bc lhz r9,24(r13)
0.2 c0000000000947c0 ld r0,-32760(r30)
0.6 c0000000000947c4 rldicr r9,r9,7,56
0.0 c0000000000947c8 add r9,r9,r0
0.0 c0000000000947cc ld r10,48(r9)
0.4 c0000000000947d0 addi r10,r10,1
2.3 c0000000000947d4 std r10,48(r9)
0.0 c0000000000947d8 addi r1,r1,144
0.0 c0000000000947dc ld r0,16(r1)
0.3 c0000000000947e0 ld r28,-32(r1)
0.7 c0000000000947e4 mtlr r0
0.0 c0000000000947e8 ld r29,-24(r1)
0.0 c0000000000947ec ld r30,-16(r1)
0.0 c0000000000947f0 ld r31,-8(r1)
0.0 c0000000000947f4 blr
1.2 c0000000000947f8 bl c000000000094f98 <.pte_chain_alloc>
0.4 c0000000000947fc std r28,8(r3)
1.0 c000000000094800 ld r0,64(r31)
0.0 c000000000094804 std r0,0(r3)
0.0 c000000000094808 std r3,64(r31)
0.0 c00000000009480c b c0000000000947b0
0.4 c000000000094810 bl c000000000094f98 <.pte_chain_alloc>
0.1 c000000000094814 li r0,0
0.0 c000000000094818 ld r11,64(r31)
0.0 c00000000009481c lis r9,1
0.2 c000000000094820 std r0,0(r3)
0.1 c000000000094824 std r11,8(r3)
0.0 c000000000094828 std r3,64(r31)
0.2 c00000000009482c ldarx r0,r0,r29
0.4 c000000000094830 andc r0,r0,r9
0.0 c000000000094834 stdcx. r0,r0,r29
3.3 c000000000094838 bne+ c00000000009482c
0.3 c00000000009483c b c00000000009478c
On 2003-03-23, Andrew Morton wrote:
>I've been looking at the CPU cost of the write() system call. Time how long
>it takes to write a million bytes to an ext2 file, via a million
>one-byte-writes:
[...]
>One thing is noteworthy: ia32's read_unlock() is buslocked, whereas
>spin_unlock() is not. So let's see what happens if we convert file_lock
>from an rwlock to a spinlock:
>
>
>2.5.65-mm4, SMP:
> 0.34s user 2.00s system 100% cpu 2.329 total
>
>That's a 5% speedup.
>
>
>And if we were to convert file->f_count to be a nonatomic "int", protected by
>files_lock it would probably speed things up further.
>
>I've always been a bit skeptical about rwlocks - if you're holding the lock
>for long enough for a significant amount of reader concurrency, you're
>holding it for too long. eg: tasklist_lock.
A million one byte writes is probably not the use case for
which we want to trade off, against against other cases that are more
common or more performance critical.
I'm not saying that you're necessarily wrong. I haven't
looked into typical rwlock usage enough to have a well founded opinion
about it, and it's obvious that you have done some detailed research
on at least this specific case. I am saying that I think I'd have to
see essentially no negative effects on workloads that we care more
about before being convinced that converting rwlocks to spinlocks is a
good trade-off either globally or in specific cases.
Also, what is optimal on a one CPU system running an SMP
kernel may be different from what is optimal on a bigger machine. So,
if it turns out that you're right in the 1-2 CPU case and wrong in the
64 CPU NUMA case, then perhaps we want some maybe_rwlock primitive
that is a spinlock if one CONFIG_RWLOCK_EXPENSIVE flag is set and an
rwlock if it is not set for code that does not *rely* on the ability
to have more than one owner of a read lock. I want to emphasize that
I'm not suggesting that anyone implement this complexity unless this
prediction turns out to be true. I'm just speculating on one
potential scenario.
Speaking of scaling up, I think it might be useful to have a
version of rw_semaphore and perhaps of rwlock that used per_cpu memory
(probably throgh dcounter) so that calls to read_{up,down} without any
intervening calls to write_{up,down} would cause no inter-cpu cache
consistency traffic. (This idea was inspired by Roman Zippel's posting
of an implementation of module usage counters along these lines.)
Adam J. Richter __ ______________ 575 Oroville Road
[email protected] \ / Milpitas, California 95035
+1 408 309-6081 | g g d r a s i l United States of America
"Free Software For The Rest Of Us."
Hi Andrew,
I would like to noticed to you that the SMP capacity can't be used on one process under Linux.
when you run 'time dd if=/dev/zero of=foo bs=1 count=1M', the capacity of 1 processor will use since your command sets is executed in ONE process.
In different with FreeBSD, there is a processor STATE _per_ process, so the kernel can 'load balance' the current syscall execution.
To 'SMPized' your programme, you have to create X process (X corresponding to the machine processors numbers.). And exec your syscall since these process.
Is there a small of my test :
[DEBUG] at 3e7da90d - Scheduler.c: Initialized.
[DEBUG] at 3e7da90d - Interrupt.c: Entrering in Linux process management.
[INFOS] at 3e7da90d - LinuxProcessor.c: There is 4 proccessor(s) on this machine.
[INFOS] at 3e7da90d - LinuxProcessor.c: Initialize processor #0 (Intel(R) Xeon(TM) CPU 2.40GHz)
[INFOS] at 3e7da90d - LinuxProcessor.c: Initialize processor #1 (Intel(R) Xeon(TM) CPU 2.40GHz)
[INFOS] at 3e7da90d - LinuxProcessor.c: Initialize processor #2 (Intel(R) Xeon(TM) CPU 2.40GHz)
[INFOS] at 3e7da90d - LinuxProcessor.c: Initialize processor #3 (Intel(R) Xeon(TM) CPU 2.40GHz)
[INFOS] at 3e7da90d - LinuxProcessor.c: Waiting processor(s) initialization...
[INFOS] at 3e7da97b - LinuxProcessor.c: Processor #2 initialized. (pid=9771)
[INFOS] at 3e7da97c - LinuxProcessor.c: Processor #3 initialized. (pid=9772)
[INFOS] at 3e7da97d - LinuxProcessor.c: Processor #0 initialized. (pid=9769)
[INFOS] at 3e7da97e - LinuxProcessor.c: Processor #1 initialized. (pid=9770)
[INFOS] at 3e7da915 - LinuxProcessor.c: All processor(s) is ready.
[DEBUG] at 3e7da915 - LinuxInterrupt.c: Interrupt execution trip average : 24.220 ms
[DEBUG] at 3e7da916 - LinuxInterrupt.c: Interrupt execution trip average : 1.200 ms
[DEBUG] at 3e7da916 - LinuxInterrupt.c: Interrupt execution trip average : 1.240 ms
[DEBUG] at 3e7da917 - LinuxInterrupt.c: Interrupt execution trip average : 1.200 ms
[DEBUG] at 3e7da917 - LinuxInterrupt.c: Interrupt execution trip average : 1.240 ms
[DEBUG] at 3e7da918 - LinuxInterrupt.c: Interrupt execution trip average : 1.360 ms
[DEBUG] at 3e7da918 - LinuxInterrupt.c: Interrupt execution trip average : 1.300 ms
[DEBUG] at 3e7da919 - LinuxInterrupt.c: Interrupt execution trip average : 1.440 ms
[DEBUG] at 3e7da919 - LinuxInterrupt.c: Interrupt execution trip average : 1.360 ms
[DEBUG] at 3e7da91a - LinuxInterrupt.c: Interrupt execution trip average : 1.260 ms
[DEBUG] at 3e7da91a - LinuxInterrupt.c: Interrupt execution trip average : 1.500 ms
[DEBUG] at 3e7da91b - LinuxInterrupt.c: Interrupt execution trip average : 1.340 ms
[DEBUG] at 3e7da91b - LinuxInterrupt.c: Interrupt execution trip average : 1.460 ms
[DEBUG] at 3e7da91c - LinuxInterrupt.c: Interrupt execution trip average : 1.460 ms
[DEBUG] at 3e7da91c - LinuxInterrupt.c: Interrupt execution trip average : 1.340 ms
[DEBUG] at 3e7da91d - LinuxInterrupt.c: Interrupt execution trip average : 1.300 ms
[DEBUG] at 3e7da91e - LinuxInterrupt.c: Interrupt execution trip average : 1.420 ms
[DEBUG] at 3e7da91e - LinuxInterrupt.c: Interrupt execution trip average : 1.560 ms
[DEBUG] at 3e7da91f - LinuxInterrupt.c: Interrupt execution trip average : 1.360 ms
[DEBUG] at 3e7da91f - LinuxInterrupt.c: Interrupt execution trip average : 1.360 ms
Sorry for my bad english :/
Best regards,
Michael
On Sat, 22 Mar 2003 17:58:16 -0800
Andrew Morton <[email protected]> wrote:
>
> I've been looking at the CPU cost of the write() system call. Time how long
> it takes to write a million bytes to an ext2 file, via a million
> one-byte-writes:
>
> time dd if=/dev/zero of=foo bs=1 count=1M
>
> This only uses one CPU. It takes twice as long on SMP.
>
> On a 2.7GHz P4-HT:
>
> 2.5.65-mm4, UP:
> 0.34s user 1.00s system 99% cpu 1.348 total
> 2.5.65-mm4, SMP:
> 0.41s user 2.04s system 100% cpu 2.445 total
>
> 2.4.21-pre5, UP:
> 0.34s user 0.96s system 106% cpu 1.224 total
> 2.4.21-pre5, SMP:
> 0.42s user 1.95s system 99% cpu 2.372 total
>
> (The small additional overhead in 2.5 is expected - there are more function
> calls due to the addition of AIO and there is more setup due to the (large)
> writev speedups).
>
>
> On a 500MHz PIII Xeon:
>
> 500MHz PIII, UP:
> 1.08s user 2.90s system 100% cpu 3.971 total
> 500MHz PIII, SMP:
> 1.13s user 4.86s system 99% cpu 5.999 total
>
>
> This pretty gross. About six months back I worked out that across the
> lifecycle of a pagecache page (creation via write() through to reclaim via
> the page LRU) we take 27 spinlocks and rwlocks. And this does not even
> include semaphores and atomic bitops (I used lockmeter). I'm not sure it is
> this high any more - quite a few things were fixed up, but it is still high.
>
>
> Profiles for 2.5 on the P4 show that it's all in fget(), fput() and
> find_get_page(). Those locked operations are really hurting.
>
> One thing is noteworthy: ia32's read_unlock() is buslocked, whereas
> spin_unlock() is not. So let's see what happens if we convert file_lock
> from an rwlock to a spinlock:
>
>
> 2.5.65-mm4, SMP:
> 0.34s user 2.00s system 100% cpu 2.329 total
>
> That's a 5% speedup.
>
>
> And if we were to convert file->f_count to be a nonatomic "int", protected by
> files_lock it would probably speed things up further.
>
> I've always been a bit skeptical about rwlocks - if you're holding the lock
> for long enough for a significant amount of reader concurrency, you're
> holding it for too long. eg: tasklist_lock.
>
>
>
> SMP:
>
> c0254a14 read_zero 189 0.3841
> c0250f5c clear_user 217 3.3906
> c0148a64 sys_write 251 3.9219
> c0148a24 sys_read 268 4.1875
> c014b00c __block_prepare_write 485 0.5226
> c0130dd4 unlock_page 493 7.7031
> c0120220 current_kernel_time 560 8.7500
> c0163190 __mark_inode_dirty 775 3.5227
> c01488ec vfs_write 983 3.1506
> c0132d9c generic_file_write 996 10.3750
> c01486fc vfs_read 1010 3.2372
> c014b3ac __block_commit_write 1100 7.6389
> c0149500 fput 1110 39.6429
> c01322e4 generic_file_aio_write_nolock 1558 0.6292
> c0130fc0 find_lock_page 2301 13.0739
> c01495f0 fget 2780 33.0952
> c0108ddc system_call 4047 91.9773
> c0106f64 default_idle 24624 473.5385
> 00000000 total 45321 0.0200
>
> UP:
>
> c023c9ac radix_tree_lookup 13 0.1711
> c015362c inode_times_differ 14 0.1944
> c015372c inode_update_time 14 0.1000
> c012cb9c generic_file_write 15 0.1630
> c012ca90 generic_file_write_nolock 17 0.1214
> c0140e1c fget 17 0.3542
> c023deb8 __copy_from_user_ll 17 0.1545
> c0241514 read_zero 17 0.0346
> c014001c vfs_read 18 0.0682
> c01401dc vfs_write 20 0.0758
> c01430d8 generic_commit_write 20 0.1786
> c01402e4 sys_read 21 0.3281
> c0142968 __block_commit_write 29 0.2014
> c023dc4c clear_user 34 0.5312
> c0140324 sys_write 38 0.5938
> c01425cc __block_prepare_write 39 0.0422
> c012c0f4 generic_file_aio_write_nolock 89 0.0362
> c0108b54 system_call 406 9.2273
> 00000000 total 944 0.0004
>
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Sun, 2003-03-23 at 12:33, Michael Vergoz wrote:
> Hi Andrew,
>
> I would like to noticed to you that the SMP capacity can't be used on one process under Linux.
>
> when you run 'time dd if=/dev/zero of=foo bs=1 count=1M', the capacity of 1 processor will
> use since your command sets is executed in ONE process.
Your dd is benchmarking the lock operations in the C library I suspect.
The kernel will happily use both processors and a given syscall can
evne start on one cpu and complete on another, or have the IRQ tasks
executed on its behalf on another CPU.
There are *good* reasons btw for avoiding splitting stuff too far, the
cost of copying data between processor caches is very high.
On Sat, Mar 22, 2003 at 05:58:16PM -0800, Andrew Morton wrote:
>
> I've been looking at the CPU cost of the write() system call. Time how long
> it takes to write a million bytes to an ext2 file, via a million
> one-byte-writes:
Are you using a sysenter-capable C library?
Aaron Lehmann <[email protected]> wrote:
>
> On Sat, Mar 22, 2003 at 05:58:16PM -0800, Andrew Morton wrote:
> >
> > I've been looking at the CPU cost of the write() system call. Time how long
> > it takes to write a million bytes to an ext2 file, via a million
> > one-byte-writes:
>
> Are you using a sysenter-capable C library?
No. That would certainly help the numbers.
But it is unrelated to the lock overhead problem.