2004-09-16 11:33:20

by Denis Vlasenko

[permalink] [raw]
Subject: top hogs CPU in 2.6: kallsyms_lookup is very slow

Hi,

I see 2.6.9-rc2 being slower than 2.4.27-rc3
on a Pentium 66, 80 MB RAM.

Specifically:
top on idle machine sucks ~40% CPU while in 2.4
it takes only ~6%

I recompiled 2.6 with HZ=100 and with slab debugging
off. This helped a bit (wget was slow too),
but top still hogs CPU.

I did 'strace -T -tt top b n 1' under both kernels,
postprocessed it a bit:

cat 24topbn1.strace | sed 's/(.* </ </' | sed 's/^[^ ]* //' >24
cat 26topbn1.strace | sed 's/(.* </ </' | sed 's/^[^ ]* //' >26
exec 24<24
exec 26<26
while true; do
read -r l24 <&24 || exit
read -r l26 <&26 || exit
printf "%-30s %-30s\n" "$l24" "$l26"
done

and got the following result. Note that many syscalls
are taking 50-100% more time to execute under 2.6.
Why?

uname <0.000142> uname <0.000217>
brk <0.000176> brk <0.000174>
open <0.000218> open <0.000335>
open <0.000248> open <0.000335>
fstat64 <0.000104> fstat64 <0.000191>
old_mmap <0.000129> old_mmap <0.000233>
close <0.000096> close <0.000177>
open <0.000148> open <0.000234>
stat64 <0.000139> stat64 <0.000221>
open <0.000237> open <0.000308>
read <0.000181> read <0.000257>
fstat64 <0.000106> fstat64 <0.000260>
old_mmap <0.000138> old_mmap <0.000210>
mprotect <0.000140> mprotect <0.000226>
old_mmap <0.000169> old_mmap <0.000281>
old_mmap <0.000162> old_mmap <0.000242>
close <0.000097> close <0.000157>
open <0.000214> open <0.000288>
read <0.000174> read <0.000260>
fstat64 <0.000106> fstat64 <0.000166>
old_mmap <0.000122> old_mmap <0.000209>
old_mmap <0.000133> old_mmap <0.000221>
mprotect <0.000137> mprotect <0.000224>
old_mmap <0.000177> old_mmap <0.000280>
old_mmap <0.000166> old_mmap <0.000237>
close <0.000097> close <0.000158>
open <0.000252> open <0.000323>
read <0.000172> read <0.000253>
fstat64 <0.000107> fstat64 <0.000166>
old_mmap <0.000140> old_mmap <0.000222>
mprotect <0.000134> mprotect <0.000221>
old_mmap <0.000171> old_mmap <0.000282>
old_mmap <0.000164> old_mmap <0.000242>
close <0.000099> close <0.000156>
open <0.000310> open <0.000384>
read <0.000175> read <0.000252>
fstat64 <0.000106> fstat64 <0.000168>
old_mmap <0.000141> old_mmap <0.000223>
mprotect <0.000136> mprotect <0.000316>
old_mmap <0.000174> old_mmap <0.000257>
close <0.000098> close <0.000157>
munmap <0.000237> munmap <0.000332>
uname <0.000335> uname <0.000414>
brk <0.000097> brk <0.000155>
brk <0.000124> brk <0.000275>
brk <0.000111> open <0.000345>
...
(here straces start to diverge a bit and go out of sync.
unabridged straces and .configs are at
http://195.66.192.168/linux/slowtop/)

This slowdown worries me a bit, but this alone cannot explain
6% -> 40% CPU usage regression.

I ran kerneltop while running 'top b n 1' on a 2.6
in an endless loop:

...
Sampling_step: 4 | Address range: 0xc0100240 - 0xc045780e
address function ..... 2004-09-16/13:41:30 ...... ticks
c0117b30 __might_sleep 1
c012fd00 kallsyms_lookup 201
c0134c40 buffered_rmqueue 1
c013d560 zap_pte_range 1
c013e2e0 do_wp_page 1
c013ea50 do_anonymous_page 14
c014ab80 dentry_open 1
c014c260 get_empty_filp 1
c0160230 dput 1
c0245a60 __copy_to_user_ll 22
00000000 total 244
Sampling_step: 4 | Address range: 0xc0100240 - 0xc045780e
address function ..... 2004-09-16/13:41:32 ...... ticks
c0104070 default_idle 1
c0111e50 do_page_fault 1
c012fd00 kallsyms_lookup 98
c013d560 zap_pte_range 1
c013e2e0 do_wp_page 1
c013ea50 do_anonymous_page 12
c0244860 vsnprintf 1
c0245a60 __copy_to_user_ll 14
00000000 total 129
Sampling_step: 4 | Address range: 0xc0100240 - 0xc045780e
address function ..... 2004-09-16/13:41:36 ...... ticks
c0104af0 get_wchan 1
c0111e50 do_page_fault 1
c0117b30 __might_sleep 1
c011a0b0 acquire_console_sem 1
c012fd00 kallsyms_lookup 192
c0134b20 free_hot_cold_page 1
c0134c40 buffered_rmqueue 2
c013a620 __pagevec_lru_add_active 1
c013e2e0 do_wp_page 2
c013ea50 do_anonymous_page 18
c013ee50 handle_mm_fault 1
c014c570 file_kill 1
c02445f0 number 2
c0245a60 __copy_to_user_ll 18
c027a860 opost_block 1
c02816d0 conv_uni_to_pc 1
c0287100 do_con_write 1
00000000 total 245
...

kallsyms_lookup visibly stands out. What's that?

2.4 does not even have kallsym_lookup,
kerneltop run looks like this in 2.4:

Sampling_step: 4 | Address range: 0xc0105000 - 0xc032f491
address function ..... 2004-09-16/14:15:50 ...... ticks
c0106e04 default_idle 54
c0122744 do_anonymous_page 10
c03257c0 __generic_copy_to_user 10
c014dc3c statm_pgd_range 4
c013a7f4 link_path_walk 2
c0142938 d_alloc 2
c014c914 proc_pid_lookup 2
c0325ea4 number 2
c01221bc do_wp_page 1
c012286c do_no_page 1
c012bcac rmqueue 1
c013264c get_empty_filp 1
c01327bc fput 1
c013a3a8 getname 1
c013a450 vfs_permission 1
c013a550 permission 1
c01433bc alloc_inode 1
00000000 total 102
--
vda


2004-09-16 11:51:39

by William Lee Irwin III

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

On Thu, Sep 16, 2004 at 02:28:26PM +0300, Denis Vlasenko wrote:
> I see 2.6.9-rc2 being slower than 2.4.27-rc3
> on a Pentium 66, 80 MB RAM.
> Specifically:
> top on idle machine sucks ~40% CPU while in 2.4
> it takes only ~6%
> I recompiled 2.6 with HZ=100 and with slab debugging
> off. This helped a bit (wget was slow too),
> but top still hogs CPU.
> I did 'strace -T -tt top b n 1' under both kernels,
> postprocessed it a bit:

The following patches in -mm are likely to help top(1).

kallsyms-data-size-reduction--lookup-speedup.patch
kallsyms data size reduction / lookup speedup

inconsistent-kallsyms-fix.patch
Inconsistent kallsyms fix

kallsyms-correct-type-char-in-proc-kallsyms.patch
kallsyms: correct type char in /proc/kallsyms

kallsyms-fix-sparc-gibberish.patch
kallsyms: fix sparc gibberish

As for all syscalls/etc. being slower by 50%-100%, I suggest toning
down HZ (we desperately need to go tickless) and seeing if it persists.
Also please check that time isn't twice as fast as it should be in 2.6.


-- wli

2004-09-16 12:00:06

by Denis Vlasenko

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

> > Specifically:
> > top on idle machine sucks ~40% CPU while in 2.4
> > it takes only ~6%
> > I recompiled 2.6 with HZ=100 and with slab debugging
> > off. This helped a bit (wget was slow too),
> > but top still hogs CPU.
> > I did 'strace -T -tt top b n 1' under both kernels,
> > postprocessed it a bit:
>
> The following patches in -mm are likely to help top(1).
>
> kallsyms-data-size-reduction--lookup-speedup.patch
> kallsyms data size reduction / lookup speedup
>
> inconsistent-kallsyms-fix.patch
> Inconsistent kallsyms fix
>
> kallsyms-correct-type-char-in-proc-kallsyms.patch
> kallsyms: correct type char in /proc/kallsyms
>
> kallsyms-fix-sparc-gibberish.patch
> kallsyms: fix sparc gibberish

Thanks.

> As for all syscalls/etc. being slower by 50%-100%, I suggest toning

s/all/many/:

uname <0.000142> ? ? ? ? ? ? ? uname <0.000217> 25% slower
brk <0.000176> ? ? ? ? ? ? ? ? brk <0.000174> no change
open <0.000218> ? ? ? ? ? ? ? ?open <0.000335> 33% slower
fstat64 <0.000104> ? ? ? ? ? ? fstat64 <0.000191> 90% slower

or maybe strace simply isn't very accurate and adds signinficant
noise to the measured delta?

> down HZ (we desperately need to go tickless) and seeing if it persists.
> Also please check that time isn't twice as fast as it should be in 2.6.

I recompiled 2.6 with HZ=100. It's not it.
Time is running normally too.
--
vda

2004-09-16 12:19:38

by William Lee Irwin III

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

At some point in the past, I wrote:
>> As for all syscalls/etc. being slower by 50%-100%, I suggest toning

On Thu, Sep 16, 2004 at 02:57:08PM +0300, Denis Vlasenko wrote:
> s/all/many/:
> uname <0.000142> ? ? ? ? ? ? ? uname <0.000217> 25% slower
> brk <0.000176> ? ? ? ? ? ? ? ? brk <0.000174> no change
> open <0.000218> ? ? ? ? ? ? ? ?open <0.000335> 33% slower
> fstat64 <0.000104> ? ? ? ? ? ? fstat64 <0.000191> 90% slower
> or maybe strace simply isn't very accurate and adds signinficant
> noise to the measured delta?

Could you try to estimate the resolution of whatever timer strace uses?

At some point in the past, I wrote:
>> down HZ (we desperately need to go tickless) and seeing if it persists.
>> Also please check that time isn't twice as fast as it should be in 2.6.

On Thu, Sep 16, 2004 at 02:57:08PM +0300, Denis Vlasenko wrote:
> I recompiled 2.6 with HZ=100. It's not it.
> Time is running normally too.

Did the kallsyms patches reduce the cpu overhead from get_wchan()? I take
this to mean reducing HZ to 100 did not alleviate the syscall problems?
How do microbenchmarks fare, e.g. lmbench?


-- wli

2004-09-16 12:35:09

by Denis Vlasenko

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

On Thursday 16 September 2004 15:17, William Lee Irwin III wrote:
> At some point in the past, I wrote:
> >> As for all syscalls/etc. being slower by 50%-100%, I suggest toning
>
> On Thu, Sep 16, 2004 at 02:57:08PM +0300, Denis Vlasenko wrote:
> > s/all/many/:
> > uname <0.000142> ? ? ? ? ? ? ? uname <0.000217> 25% slower
> > brk <0.000176> ? ? ? ? ? ? ? ? brk <0.000174> no change
> > open <0.000218> ? ? ? ? ? ? ? ?open <0.000335> 33% slower
> > fstat64 <0.000104> ? ? ? ? ? ? fstat64 <0.000191> 90% slower
> > or maybe strace simply isn't very accurate and adds signinficant
> > noise to the measured delta?
>
> Could you try to estimate the resolution of whatever timer strace uses?
>
> At some point in the past, I wrote:
> >> down HZ (we desperately need to go tickless) and seeing if it persists.
> >> Also please check that time isn't twice as fast as it should be in 2.6.
>
> On Thu, Sep 16, 2004 at 02:57:08PM +0300, Denis Vlasenko wrote:
> > I recompiled 2.6 with HZ=100. It's not it.
> > Time is running normally too.
>
> Did the kallsyms patches reduce the cpu overhead from get_wchan()? I take

I am applying them right now...

> this to mean reducing HZ to 100 did not alleviate the syscall problems?

Yes, it didn't help.

> How do microbenchmarks fare, e.g. lmbench?

I did not run them (yet).
--
vda

2004-09-17 10:48:05

by Denis Vlasenko

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

On Thursday 16 September 2004 15:17, William Lee Irwin III wrote:
> At some point in the past, I wrote:
> >> As for all syscalls/etc. being slower by 50%-100%, I suggest toning
>
> On Thu, Sep 16, 2004 at 02:57:08PM +0300, Denis Vlasenko wrote:
> > s/all/many/:
> > uname <0.000142> ? ? ? ? ? ? ? uname <0.000217> 25% slower
> > brk <0.000176> ? ? ? ? ? ? ? ? brk <0.000174> no change
> > open <0.000218> ? ? ? ? ? ? ? ?open <0.000335> 33% slower
> > fstat64 <0.000104> ? ? ? ? ? ? fstat64 <0.000191> 90% slower
> > or maybe strace simply isn't very accurate and adds signinficant
> > noise to the measured delta?
>
> Could you try to estimate the resolution of whatever timer strace uses?
>
> At some point in the past, I wrote:
> >> down HZ (we desperately need to go tickless) and seeing if it persists.
> >> Also please check that time isn't twice as fast as it should be in 2.6.
>
> On Thu, Sep 16, 2004 at 02:57:08PM +0300, Denis Vlasenko wrote:
> > I recompiled 2.6 with HZ=100. It's not it.
> > Time is running normally too.
>
> Did the kallsyms patches reduce the cpu overhead from get_wchan()? I take
> this to mean reducing HZ to 100 did not alleviate the syscall problems?
> How do microbenchmarks fare, e.g. lmbench?

Not a lmbench, but:

#include <fcntl.h>
int main() {
int fd;
int i=100000;
while(i--) {
fd = open("/etc/ld.so.cache", O_RDONLY);
close(fd);
}
return 0;
}

2.4:
# time ./openclose
real 0m7.455s
user 0m0.300s
sys 0m7.150s

2.6:
# time ./openclose
real 0m8.170s
user 0m0.370s
sys 0m7.800s

2.6 is at HZ=100 here. /etc is on ramfs.
configs are in attached tarball.
--
vda


Attachments:
(No filename) (1.62 kB)
c.tar.bz2 (9.88 kB)
Download all attachments

2004-09-17 10:52:03

by Denis Vlasenko

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

On Thursday 16 September 2004 15:17, William Lee Irwin III wrote:
> At some point in the past, I wrote:
> >> As for all syscalls/etc. being slower by 50%-100%, I suggest toning
>
> On Thu, Sep 16, 2004 at 02:57:08PM +0300, Denis Vlasenko wrote:
> > s/all/many/:
> > uname <0.000142> ? ? ? ? ? ? ? uname <0.000217> 25% slower
> > brk <0.000176> ? ? ? ? ? ? ? ? brk <0.000174> no change
> > open <0.000218> ? ? ? ? ? ? ? ?open <0.000335> 33% slower
> > fstat64 <0.000104> ? ? ? ? ? ? fstat64 <0.000191> 90% slower
> > or maybe strace simply isn't very accurate and adds signinficant
> > noise to the measured delta?
>
> Could you try to estimate the resolution of whatever timer strace uses?
>
> At some point in the past, I wrote:
> >> down HZ (we desperately need to go tickless) and seeing if it persists.
> >> Also please check that time isn't twice as fast as it should be in 2.6.
>
> On Thu, Sep 16, 2004 at 02:57:08PM +0300, Denis Vlasenko wrote:
> > I recompiled 2.6 with HZ=100. It's not it.
> > Time is running normally too.
>
> Did the kallsyms patches reduce the cpu overhead from get_wchan()?

Yes. top does not hog CPU anymore. It takes even a liitle bit *less*
CPU than in 2.4 now.
--
vda

2004-09-17 10:56:42

by William Lee Irwin III

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

On Thursday 16 September 2004 15:17, William Lee Irwin III wrote:
>> Did the kallsyms patches reduce the cpu overhead from get_wchan()?

On Fri, Sep 17, 2004 at 12:01:15PM +0300, Denis Vlasenko wrote:
> Yes. top does not hog CPU anymore. It takes even a liitle bit *less*
> CPU than in 2.4 now.

Brilliant! Profile comparisons (merely booting the same kernels with
profile=2 or whatever shift you deem acceptable given that little RAM),
if feasible, would also be appreciated.

Thanks.


-- wli

2004-09-17 11:05:04

by William Lee Irwin III

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

On Thursday 16 September 2004 15:17, William Lee Irwin III wrote:
>> How do microbenchmarks fare, e.g. lmbench?

On Fri, Sep 17, 2004 at 11:57:24AM +0300, Denis Vlasenko wrote:
> Not a lmbench, but:
[...]
> 2.4:
> # time ./openclose
> real 0m7.455s
> user 0m0.300s
> sys 0m7.150s
> 2.6:
> # time ./openclose
> real 0m8.170s
> user 0m0.370s
> sys 0m7.800s
> 2.6 is at HZ=100 here. /etc is on ramfs.
> configs are in attached tarball.

To address this in a meaningful way, we're going to have to get some
profiling data. The built-in kernel profiler should suffice, though you
may want to run the test for a longer, fixed period of time (I
recommend making the test run as long as 60s and recording the number
of operations completed). Also, please snapshot the profile state with
readprofile(1) immediately before and after the microbenchmark runs on
both kernels. This should only require booting into the kernels you've
already built with an additional commandline parameter, e.g. profile=2.

The number after the = sign is the shift used for the granularity of
accounting. With a shift of 0, the profile buffer will contain one
atomic_t (4B on ia32, e.g. your box) for every byte of kernel text.
Adding 1 to the shift halves the space needed for the profile buffer,
albeit with a concomitant decrease in the accuracy of its accounting.
profile=2 should be equivalent to the space required for a second copy
of the kernel text.

Thanks.


-- wli

2004-09-17 11:22:44

by Denis Vlasenko

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

> On Thu, Sep 16, 2004 at 02:57:08PM +0300, Denis Vlasenko wrote:
> > I recompiled 2.6 with HZ=100. It's not it.
> > Time is running normally too.
>
> Did the kallsyms patches reduce the cpu overhead from get_wchan()? I take
> this to mean reducing HZ to 100 did not alleviate the syscall problems?
> How do microbenchmarks fare, e.g. lmbench?

2x3 lmbench runs. HZ=100, configs attached.

i586-pc-linux-gnu: no data for Socket bandwidth using localhost
i586-pc-linux-gnu: no data for Socket bandwidth using localhost
i586-pc-linux-gnu: no data for Socket bandwidth using localhost
i586-pc-linux-gnu: no data for Socket bandwidth using localhost
i586-pc-linux-gnu: no data for Socket bandwidth using localhost
i586-pc-linux-gnu: no data for Socket bandwidth using localhost

L M B E N C H 3 . 0 S U M M A R Y
------------------------------------
(Alpha software, do not distribute)

Basic system parameters
------------------------------------------------------------------------------
Host OS Description Mhz tlb cache mem scal
pages line par load
bytes
--------- ------------- ----------------------- ---- ----- ----- ------ ----
hunter Linux 2.6.9-r i586-pc-linux-gnu 67 32 1
hunter Linux 2.6.9-r i586-pc-linux-gnu 67 32 1
hunter Linux 2.6.9-r i586-pc-linux-gnu 67 32 1
hunter Linux 2.4.27- i586-pc-linux-gnu 67 32 1
hunter Linux 2.4.27- i586-pc-linux-gnu 67 32 1
hunter Linux 2.4.27- i586-pc-linux-gnu 67 32 1

Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host OS Mhz null null open slct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
hunter Linux 2.6.9-r 67 2.27 5.25 106. 129. 12.7 64.3 4312 31.K 86.K
hunter Linux 2.6.9-r 67 2.27 4.77 100. 122. 13.3 62.2 4338 31.K 85.K
hunter Linux 2.6.9-r 67 2.27 5.48 107. 129. 14.3 62.5 4420 31.K 86.K
hunter Linux 2.4.27- 67 2.26 4.98 3073 3110 9.98 23.0 4200 77.K 240K
hunter Linux 2.4.27- 67 2.26 4.88 3034 3028 12.3 23.0 3852 78.K 237K
hunter Linux 2.4.27- 67 2.32 5.15 3009 3192 10.0 23.7 4047 78.K 238K

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
hunter Linux 2.6.9-r 31.9 64.0 120.8 115.7 322.2 136.0 356.4
hunter Linux 2.6.9-r 29.1 47.2 76.7 102.3 321.5 136.0 354.2
hunter Linux 2.6.9-r 29.3 56.0 144.5 101.9 305.5 145.3 351.0
hunter Linux 2.4.27- 19.8 39.4 190.3 77.8 368.0 104.1 401.5
hunter Linux 2.4.27- 19.7 30.9 105.0 87.2 316.9 107.2 359.9
hunter Linux 2.4.27- 19.5 34.6 95.5 74.3 279.1 109.5 325.0

*Local* Communication latencies in microseconds - smaller is better
---------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
hunter Linux 2.6.9-r 31.9 129.8 230.
hunter Linux 2.6.9-r 29.1 130.1 244.
hunter Linux 2.6.9-r 29.3 136.9 233.
hunter Linux 2.4.27- 19.8 74.0 146.
hunter Linux 2.4.27- 19.7 74.4 134.
hunter Linux 2.4.27- 19.5 77.8 137.

File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page 100fd
Create Delete Create Delete Latency Fault Fault selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
hunter Linux 2.6.9-r 837.5 1272.3 2421.3 1751.3 1289.0 1.145 18.9 129.6
hunter Linux 2.6.9-r 862.8 1234.6 2463.1 1683.5 1270.0 3.732 18.4 128.1
hunter Linux 2.6.9-r 805.2 1122.3 2283.1 1550.4 1157.0 3.415 18.3 128.0
hunter Linux 2.4.27- 4219.4 4807.7 5586.6 4444.4 910.0 10.3 14.6 164.1
hunter Linux 2.4.27- 3759.4 4065.0 5586.6 4444.4 870.0 10.3 15.8 163.6
hunter Linux 2.4.27- 4048.6 4830.9 5714.3 4761.9 985.0 9.553 16.7 163.1

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
hunter Linux 2.6.9-r 13.1 11.1 16.8 54.3 18.4 18.5 54.2 26.1
hunter Linux 2.6.9-r 12.7 12.1 16.4 54.3 18.5 18.5 54.3 26.2
hunter Linux 2.6.9-r 13.0 11.2 18.3 54.3 18.5 18.5 54.3 26.1
hunter Linux 2.4.27- 15.7 11.9 17.6 54.2 18.6 18.6 54.4 26.0
hunter Linux 2.4.27- 15.7 10.7 18.3 54.2 18.4 18.5 54.4 26.1
hunter Linux 2.4.27- 15.5 10.9 17.8 54.3 18.6 18.5 54.4 26.1
--
vda


Attachments:
(No filename) (5.79 kB)
c.tar.bz2 (9.88 kB)
Download all attachments

2004-09-17 11:31:40

by Paulo Marques

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

Denis Vlasenko wrote:
> On Thursday 16 September 2004 15:17, William Lee Irwin III wrote:
>>Did the kallsyms patches reduce the cpu overhead from get_wchan()?
>
>
> Yes. top does not hog CPU anymore. It takes even a liitle bit *less*
> CPU than in 2.4 now.

Great!

I was the one who wrote those patches, so I'm glad to know that they
actually made a difference in real world workloads (like using "top").

Reading /proc/kallsyms should be a lot faster too, although there is no
comparison with 2.4 kernel, because there where no kallsyms there. It
can be compared with previous 2.6 kernels, though.

--
Paulo Marques - http://www.grupopie.com

To err is human, but to really foul things up requires a computer.
Farmers' Almanac, 1978

2004-09-17 12:12:28

by William Lee Irwin III

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

On Thu, Sep 16, 2004 at 02:57:08PM +0300, Denis Vlasenko wrote:
>> Did the kallsyms patches reduce the cpu overhead from get_wchan()? I take
>> this to mean reducing HZ to 100 did not alleviate the syscall problems?
>> How do microbenchmarks fare, e.g. lmbench?

On Fri, Sep 17, 2004 at 02:21:15PM +0300, Denis Vlasenko wrote:
> 2x3 lmbench runs. HZ=100, configs attached.
> Context switching - times in microseconds - smaller is better
> -------------------------------------------------------------------------
> Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
> ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
> --------- ------------- ------ ------ ------ ------ ------ ------- -------
> hunter Linux 2.6.9-r 31.9 64.0 120.8 115.7 322.2 136.0 356.4
> hunter Linux 2.6.9-r 29.1 47.2 76.7 102.3 321.5 136.0 354.2
> hunter Linux 2.6.9-r 29.3 56.0 144.5 101.9 305.5 145.3 351.0
> hunter Linux 2.4.27- 19.8 39.4 190.3 77.8 368.0 104.1 401.5
> hunter Linux 2.4.27- 19.7 30.9 105.0 87.2 316.9 107.2 359.9
> hunter Linux 2.4.27- 19.5 34.6 95.5 74.3 279.1 109.5 325.0

Bizarre; context switching latency is actually one of the 2.6
scheduler's strong points AFAIK. This generally needs NMI's to
instrument as the critical sections here have interrupts disabled.
Someone more knowledgable regarding the i8259A PIC may have ideas about
how to go about arranging for NMI-based profiling on a system such as
yours. I presume this is a Pentium "Classic", not Pentium Pro.

Alan, any ideas?


On Fri, Sep 17, 2004 at 02:21:15PM +0300, Denis Vlasenko wrote:
> *Local* Communication latencies in microseconds - smaller is better
> ---------------------------------------------------------------------
> Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
> ctxsw UNIX UDP TCP conn
> --------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
> hunter Linux 2.6.9-r 31.9 129.8 230.
> hunter Linux 2.6.9-r 29.1 130.1 244.
> hunter Linux 2.6.9-r 29.3 136.9 233.
> hunter Linux 2.4.27- 19.8 74.0 146.
> hunter Linux 2.4.27- 19.7 74.4 134.
> hunter Linux 2.4.27- 19.5 77.8 137.

Degradation #2: pipe and AF_UNIX latencies. This can likely be profiled
without NMI's.


On Fri, Sep 17, 2004 at 02:21:15PM +0300, Denis Vlasenko wrote:
> File & VM system latencies in microseconds - smaller is better
[...]

2.6 looks clean here, which directly contradicts your prior results.
Hmm.


On Fri, Sep 17, 2004 at 02:21:15PM +0300, Denis Vlasenko wrote:
> *Local* Communication bandwidths in MB/s - bigger is better
> -----------------------------------------------------------------------------
> Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
> UNIX reread reread (libc) (hand) read write
> --------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
> hunter Linux 2.6.9-r 13.1 11.1 16.8 54.3 18.4 18.5 54.2 26.1
> hunter Linux 2.6.9-r 12.7 12.1 16.4 54.3 18.5 18.5 54.3 26.2
> hunter Linux 2.6.9-r 13.0 11.2 18.3 54.3 18.5 18.5 54.3 26.1
> hunter Linux 2.4.27- 15.7 11.9 17.6 54.2 18.6 18.6 54.4 26.0
> hunter Linux 2.4.27- 15.7 10.7 18.3 54.2 18.4 18.5 54.4 26.1
> hunter Linux 2.4.27- 15.5 10.9 17.8 54.3 18.6 18.5 54.4 26.1

So pipes also degraded slightly wrt. bandwidth, and AF_UNIX bandwidth is
slightly superior. This can likely also be profiled without NMI's.

One thing we're going to have to do to instrument your specific
regressions is to run the tests for whatever you see degradations for
in isolation. None of this should require building new kernels, except
for whatever we (later) devise to instrument your context switching
latencies.

There is also a question of how the context switching benchmark was
implemented. If it was via pipes, then degradations in pipe performance
will affect it. So we should likely ask Larry to comment on how that
part of his benchmark was implemented; if it is indeed via pipes, then
we should ignore the scheduler until fs/pipe.c has been addressed.
There are likely other ways to measure context switching latency (well,
sched_yield() probably won't DTRT on 2.6.x unless they're RT tasks with
a policy of SCHED_RR, but that's one alternative).


-- wli

2004-09-17 12:36:31

by Denis Vlasenko

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

On Friday 17 September 2004 14:03, William Lee Irwin III wrote:
> On Thursday 16 September 2004 15:17, William Lee Irwin III wrote:
> >> How do microbenchmarks fare, e.g. lmbench?
>
> On Fri, Sep 17, 2004 at 11:57:24AM +0300, Denis Vlasenko wrote:
> > Not a lmbench, but:
>
> [...]
>
> > 2.4:
> > # time ./openclose
> > real 0m7.455s
> > user 0m0.300s
> > sys 0m7.150s
> > 2.6:
> > # time ./openclose
> > real 0m8.170s
> > user 0m0.370s
> > sys 0m7.800s
> > 2.6 is at HZ=100 here. /etc is on ramfs.
> > configs are in attached tarball.
>
> To address this in a meaningful way, we're going to have to get some
> profiling data. The built-in kernel profiler should suffice, though you
> may want to run the test for a longer, fixed period of time (I
> recommend making the test run as long as 60s and recording the number
> of operations completed). Also, please snapshot the profile state with
> readprofile(1) immediately before and after the microbenchmark runs on
> both kernels. This should only require booting into the kernels you've
> already built with an additional commandline parameter, e.g. profile=2.

I have profile=2. Profiles were collected with this script:

./openclose100000
readprofile -m System.map -r
./openclose100000
./openclose100000
./openclose100000
./openclose100000
readprofile -m System.map | sort -r

2.4:

2994 total 0.0013
620 link_path_walk 0.2892
285 d_lookup 1.2076
156 dput 0.4815
118 kmem_cache_free 0.7564
109 system_call 1.9464
106 kmem_cache_alloc 0.5096
103 strncpy_from_user 1.2875
97 open_namei 0.0767
92 fput 0.3710
88 get_unused_fd 0.2529
85 vfs_permission 0.3320
79 page_follow_link 0.2264
79 __constant_c_and_count_memset 0.6583
78 sys_close 0.9286
72 permission 1.5000
72 path_init 0.3000
67 lookup_mnt 0.8816
64 cached_lookup 0.8000
54 get_empty_filp 0.2045
54 fd_install 1.1250
45 page_getlink 0.2679
44 sys_open 0.3438
35 getname 0.2083
31 read_cache_page 0.1020
30 path_release 0.5769
30 dnotify_flush 0.2778
30 dentry_open 0.0708
27 filp_open 0.3553
26 __free_pages 0.7222
25 __find_get_page 0.5682
24 update_atime 0.2727
23 mark_page_accessed 0.4423
22 file_move 0.4583
19 path_lookup 0.4318
19 filp_close 0.2065
12 ret_from_sys_call 0.7059
12 locks_remove_posix 0.0390
12 locks_remove_flock 0.1364
11 __generic_copy_to_user 0.1146
8 write_profile 0.1538
6 path_walk 0.2143
6 find_trylock_page 0.0600
4 do_truncate 0.0312
3 fs_may_remount_ro 0.0417
3 do_wp_page 0.0044
3 clear_user 0.0312
3 check_mnt 0.0577
1 inet_rtm_newrule 0.0017
1 fib_lookup 0.0045
1 copy_page_range 0.0023

2.6:

3200 total 0.0013
790 link_path_walk 0.2759
287 __d_lookup 1.2756
277 get_empty_filp 1.4503
146 strncpy_from_user 1.8250
110 dput 0.3254
109 system_call 2.4773
98 find_trylock_page 2.0000
92 do_lookup 0.8070
69 path_lookup 0.2594
68 get_unused_fd 0.4172
61 dentry_open 0.1709
60 may_open 0.1339
55 follow_mount 0.3873
54 sys_close 0.6506
54 __fput 0.2784
53 kmem_cache_free 1.0000
53 kmem_cache_alloc 1.0192
50 read_cache_page 0.0978
45 open_namei 0.0323
42 locks_remove_flock 0.2642
41 find_next_zero_bit 0.2847
41 filp_open 0.5775
40 vfs_permission 0.1465
35 sys_open 0.3333
34 getname 0.2125
34 filp_close 0.3333
29 dnotify_flush 0.2762
28 page_getlink 0.1931
28 file_ra_state_init 1.1667
28 fd_install 0.6364
26 page_put_link 0.1733
25 file_move 0.6098
23 update_atime 0.1361
23 syscall_exit 2.0909
23 lookup_mnt 0.2706
18 locks_remove_posix 0.0732
17 fs_may_remount_ro 0.1910
15 permission 0.2273
14 path_release 0.2500
12 page_follow_link_light 0.3333
10 file_kill 0.3704
9 zap_pte_range 0.0171
9 write_profile 0.1765
9 eventpoll_init_file 0.3462
8 fput 0.3200
7 mark_page_accessed 0.1346
6 eventpoll_release_file 0.0444
6 clear_user 0.1071
4 syscall_call 0.3636
4 __copy_to_user_ll 0.0101
3 do_wp_page 0.0043
3 detach_mnt 0.0400
2 read_cache_pages 0.0066
2 page_remove_rmap 0.0317
1 zap_pmd_range 0.0097
1 unmap_vmas 0.0024
1 pte_alloc_one 0.0154
1 process_backlog 0.0039
1 free_page_and_swap_cache 0.0109
1 free_hot_cold_page 0.0045
1 find_get_page 0.0323
1 dst_alloc 0.0074
1 do_no_page 0.0016
1 do_anonymous_page 0.0033
1 __up_write 0.0044

Re-sorted by 3rd column:

2.4:

109 system_call 1.9464
72 permission 1.5000
103 strncpy_from_user 1.2875
285 d_lookup 1.2076
54 fd_install 1.1250
78 sys_close 0.9286
67 lookup_mnt 0.8816
64 cached_lookup 0.8000
118 kmem_cache_free 0.7564
26 __free_pages 0.7222
12 ret_from_sys_call 0.7059
79 __constant_c_and_count_memset 0.6583
30 path_release 0.5769
25 __find_get_page 0.5682
106 kmem_cache_alloc 0.5096
156 dput 0.4815
22 file_move 0.4583
23 mark_page_accessed 0.4423
19 path_lookup 0.4318
92 fput 0.3710
27 filp_open 0.3553
44 sys_open 0.3438
85 vfs_permission 0.3320
72 path_init 0.3000
620 link_path_walk 0.2892
30 dnotify_flush 0.2778
24 update_atime 0.2727
45 page_getlink 0.2679
88 get_unused_fd 0.2529
79 page_follow_link 0.2264
6 path_walk 0.2143
35 getname 0.2083
19 filp_close 0.2065
54 get_empty_filp 0.2045
8 write_profile 0.1538
12 locks_remove_flock 0.1364
11 __generic_copy_to_user 0.1146
31 read_cache_page 0.1020
97 open_namei 0.0767
30 dentry_open 0.0708
6 find_trylock_page 0.0600
3 check_mnt 0.0577
3 fs_may_remount_ro 0.0417
12 locks_remove_posix 0.0390
4 do_truncate 0.0312
3 clear_user 0.0312
1 fib_lookup 0.0045
3 do_wp_page 0.0044
1 copy_page_range 0.0023
1 inet_rtm_newrule 0.0017
2994 total 0.0013

2.6:

109 system_call 2.4773
23 syscall_exit 2.0909
98 find_trylock_page 2.0000
146 strncpy_from_user 1.8250
277 get_empty_filp 1.4503
287 __d_lookup 1.2756
28 file_ra_state_init 1.1667
53 kmem_cache_alloc 1.0192
53 kmem_cache_free 1.0000
92 do_lookup 0.8070
54 sys_close 0.6506
28 fd_install 0.6364
25 file_move 0.6098
41 filp_open 0.5775
68 get_unused_fd 0.4172
55 follow_mount 0.3873
10 file_kill 0.3704
4 syscall_call 0.3636
9 eventpoll_init_file 0.3462
35 sys_open 0.3333
34 filp_close 0.3333
12 page_follow_link_light 0.3333
110 dput 0.3254
8 fput 0.3200
41 find_next_zero_bit 0.2847
54 __fput 0.2784
29 dnotify_flush 0.2762
790 link_path_walk 0.2759
23 lookup_mnt 0.2706
42 locks_remove_flock 0.2642
69 path_lookup 0.2594
14 path_release 0.2500
15 permission 0.2273
34 getname 0.2125
28 page_getlink 0.1931
17 fs_may_remount_ro 0.1910
9 write_profile 0.1765
26 page_put_link 0.1733
61 dentry_open 0.1709
40 vfs_permission 0.1465
23 update_atime 0.1361
7 mark_page_accessed 0.1346
60 may_open 0.1339
6 clear_user 0.1071
50 read_cache_page 0.0978
18 locks_remove_posix 0.0732
6 eventpoll_release_file 0.0444
3 detach_mnt 0.0400
45 open_namei 0.0323
1 find_get_page 0.0323
2 page_remove_rmap 0.0317
9 zap_pte_range 0.0171
1 pte_alloc_one 0.0154
1 free_page_and_swap_cache 0.0109
4 __copy_to_user_ll 0.0101
1 zap_pmd_range 0.0097
1 dst_alloc 0.0074
2 read_cache_pages 0.0066
1 free_hot_cold_page 0.0045
1 __up_write 0.0044
3 do_wp_page 0.0043
1 process_backlog 0.0039
1 do_anonymous_page 0.0033
1 unmap_vmas 0.0024
1 do_no_page 0.0016
3200 total 0.0013

--
vda

2004-09-17 13:04:52

by Denis Vlasenko

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

On Friday 17 September 2004 15:10, William Lee Irwin III wrote:
> On Thu, Sep 16, 2004 at 02:57:08PM +0300, Denis Vlasenko wrote:
> >> Did the kallsyms patches reduce the cpu overhead from get_wchan()? I take
> >> this to mean reducing HZ to 100 did not alleviate the syscall problems?
> >> How do microbenchmarks fare, e.g. lmbench?
>
> On Fri, Sep 17, 2004 at 02:21:15PM +0300, Denis Vlasenko wrote:
> > 2x3 lmbench runs. HZ=100, configs attached.
> > Context switching - times in microseconds - smaller is better
> > -------------------------------------------------------------------------
> > Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
> > ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
> > --------- ------------- ------ ------ ------ ------ ------ ------- -------
> > hunter Linux 2.6.9-r 31.9 64.0 120.8 115.7 322.2 136.0 356.4
> > hunter Linux 2.6.9-r 29.1 47.2 76.7 102.3 321.5 136.0 354.2
> > hunter Linux 2.6.9-r 29.3 56.0 144.5 101.9 305.5 145.3 351.0
> > hunter Linux 2.4.27- 19.8 39.4 190.3 77.8 368.0 104.1 401.5
> > hunter Linux 2.4.27- 19.7 30.9 105.0 87.2 316.9 107.2 359.9
> > hunter Linux 2.4.27- 19.5 34.6 95.5 74.3 279.1 109.5 325.0
>
> Bizarre; context switching latency is actually one of the 2.6
> scheduler's strong points AFAIK. This generally needs NMI's to
> instrument as the critical sections here have interrupts disabled.
> Someone more knowledgable regarding the i8259A PIC may have ideas about
> how to go about arranging for NMI-based profiling on a system such as
> yours. I presume this is a Pentium "Classic", not Pentium Pro.

Yes. Ther is no Pentium 66 MMX on this planet.

> Alan, any ideas?
>
>
> On Fri, Sep 17, 2004 at 02:21:15PM +0300, Denis Vlasenko wrote:
> > *Local* Communication latencies in microseconds - smaller is better
> > ---------------------------------------------------------------------
> > Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
> > ctxsw UNIX UDP TCP conn
> > --------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
> > hunter Linux 2.6.9-r 31.9 129.8 230.
> > hunter Linux 2.6.9-r 29.1 130.1 244.
> > hunter Linux 2.6.9-r 29.3 136.9 233.
> > hunter Linux 2.4.27- 19.8 74.0 146.
> > hunter Linux 2.4.27- 19.7 74.4 134.
> > hunter Linux 2.4.27- 19.5 77.8 137.
>
> Degradation #2: pipe and AF_UNIX latencies. This can likely be profiled
> without NMI's.

Is there an easy way to run only this one? I can profile that.
Larry?

--
vda

2004-09-17 13:33:56

by William Lee Irwin III

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

On Friday 17 September 2004 14:03, William Lee Irwin III wrote:
>> To address this in a meaningful way, we're going to have to get some
>> profiling data. The built-in kernel profiler should suffice, though you
>> may want to run the test for a longer, fixed period of time (I
>> recommend making the test run as long as 60s and recording the number
>> of operations completed). Also, please snapshot the profile state with
>> readprofile(1) immediately before and after the microbenchmark runs on
>> both kernels. This should only require booting into the kernels you've
>> already built with an additional commandline parameter, e.g. profile=2.

On Fri, Sep 17, 2004 at 03:34:59PM +0300, Denis Vlasenko wrote:
> I have profile=2. Profiles were collected with this script:
> ./openclose100000
> readprofile -m System.map -r
> ./openclose100000
> ./openclose100000
> ./openclose100000
> ./openclose100000
> readprofile -m System.map | sort -r

Excellent!


On Fri, Sep 17, 2004 at 03:34:59PM +0300, Denis Vlasenko wrote:
> 2.4:
> 2994 total 0.0013
> 620 link_path_walk 0.2892
> 285 d_lookup 1.2076
> 156 dput 0.4815
> 118 kmem_cache_free 0.7564
> 109 system_call 1.9464
> 106 kmem_cache_alloc 0.5096
> 103 strncpy_from_user 1.2875
[...]
> 2.6:
> 3200 total 0.0013
> 790 link_path_walk 0.2759
> 287 __d_lookup 1.2756
> 277 get_empty_filp 1.4503
> 146 strncpy_from_user 1.8250
> 110 dput 0.3254
> 109 system_call 2.4773

Very odd that get_empty_filp() and strncpy_from_user() should be so
high on your profiles. They're not tremendously different between 2.4
and 2.6. It may be the case that 2.6 makes some inappropriate choice of
algorithms for usercopying on systems of your vintage. get_empty_filp()
is more mysterious still.

What was the relative performance of 2.4 vs. 2.6? e.g. 2.6 completed
some percentage of the number of operations as 2.4 in the given time?

Additive differential profile follows. I wonder if certain symbols'
hits should be folded together for a proper comparison.

8.9688% __d_lookup
6.8526% get_empty_filp
3.9794% link_path_walk
2.8750% do_lookup
2.8621% find_trylock_page
1.8750% may_open
1.7188% follow_mount
1.6875% __fput
1.5216% path_lookup
1.2812% find_next_zero_bit
1.1223% strncpy_from_user
0.9117% locks_remove_flock
0.9042% dentry_open
0.8750% file_ra_state_init
0.8125% page_put_link
0.7188% syscall_exit
0.5271% read_cache_page
0.4310% fs_may_remount_ro
0.4279% filp_close
0.3794% filp_open
0.3750% page_follow_link_light
0.3125% file_kill
0.2812% zap_pte_range
0.2812% eventpoll_init_file
0.1875% eventpoll_release_file
0.1617% locks_remove_posix
0.1250% syscall_call
0.1250% __copy_to_user_ll
0.0938% detach_mnt
0.0873% clear_user
0.0625% read_cache_pages
0.0625% page_remove_rmap
0.0464% file_move
0.0312% zap_pmd_range
0.0312% unmap_vmas
0.0312% pte_alloc_one
0.0312% process_backlog
0.0312% free_page_and_swap_cache
0.0312% free_hot_cold_page
0.0312% find_get_page
0.0312% dst_alloc
0.0312% do_no_page
0.0312% do_anonymous_page
0.0312% __up_write
0.0140% write_profile
0.0000% ret_from_sys_call
0.0000% path_walk
0.0000% path_init
0.0000% page_follow_link
0.0000% inet_rtm_newrule
0.0000% fib_lookup
0.0000% do_truncate
0.0000% d_lookup
0.0000% copy_page_range
0.0000% check_mnt
0.0000% cached_lookup
0.0000% __generic_copy_to_user
0.0000% __free_pages
0.0000% __find_get_page
0.0000% __constant_c_and_count_memset
-0.0065% do_wp_page
-0.0334% inet_rtm_newrule
-0.0334% fib_lookup
-0.0334% copy_page_range
-0.0829% update_atime
-0.0958% dnotify_flush
-0.1002% check_mnt
-0.1065% getname
-0.1336% do_truncate
-0.2004% path_walk
-0.2344% system_call
-0.3674% __generic_copy_to_user
-0.3759% sys_open
-0.4008% ret_from_sys_call
-0.5495% mark_page_accessed
-0.5645% path_release
-0.6280% page_getlink
-0.8142% get_unused_fd
-0.8350% __find_get_page
-0.8684% __free_pages
-0.9177% sys_close
-0.9286% fd_install
-1.5191% lookup_mnt
-1.5890% vfs_permission
-1.7729% dput
-1.8336% open_namei
-1.8842% kmem_cache_alloc
-1.9361% permission
-2.1376% cached_lookup
-2.2850% kmem_cache_free
-2.4048% path_init
-2.6386% page_follow_link
-2.6386% __constant_c_and_count_memset
-2.8228% fput
-9.5190% d_lookup


-- wli

2004-09-17 13:57:41

by Larry McVoy

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

> Is there an easy way to run only this one? I can profile that.
> Larry?

Absolutely, when you are in the source directory you can see how all this
is run, it's just a shell script. Look in ../scripts/lmbench for lat_pipe
and run that.

In general, the code in LMbench is very very small and it's easy to tinker
with the system and add more tests. I've been very surprised over the
years that people haven't done so.

For what it is worth, we're looking at porting LMbench to Windows using
the Unix API layer we did in BitKeeper.
--
---
Larry McVoy lm at bitmover.com http://www.bitkeeper.com

2004-09-17 20:39:31

by Denis Vlasenko

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

> On Fri, Sep 17, 2004 at 03:34:59PM +0300, Denis Vlasenko wrote:
> > I have profile=2. Profiles were collected with this script:
> > ./openclose100000
> > readprofile -m System.map -r
> > ./openclose100000
> > ./openclose100000
> > ./openclose100000
> > ./openclose100000
> > readprofile -m System.map | sort -r
>
> Excellent!
>
> On Fri, Sep 17, 2004 at 03:34:59PM +0300, Denis Vlasenko wrote:
> > 2.4:
> > 2994 total 0.0013
> > 620 link_path_walk 0.2892
> > 285 d_lookup 1.2076
> > 156 dput 0.4815
> > 118 kmem_cache_free 0.7564
> > 109 system_call 1.9464
> > 106 kmem_cache_alloc 0.5096
> > 103 strncpy_from_user 1.2875
>
> [...]
>
> > 2.6:
> > 3200 total 0.0013
> > 790 link_path_walk 0.2759
> > 287 __d_lookup 1.2756
> > 277 get_empty_filp 1.4503
> > 146 strncpy_from_user 1.8250
> > 110 dput 0.3254
> > 109 system_call 2.4773
>
> Very odd that get_empty_filp() and strncpy_from_user() should be so
> high on your profiles. They're not tremendously different between 2.4
> and 2.6. It may be the case that 2.6 makes some inappropriate choice of
> algorithms for usercopying on systems of your vintage. get_empty_filp()
> is more mysterious still.
>
> What was the relative performance of 2.4 vs. 2.6? e.g. 2.6 completed
> some percentage of the number of operations as 2.4 in the given time?

Test merely did 4x100000 open()/close() syscalls.

2.4: 2994 total -> took ~29.9 secs
2.6: 3200 total -> 32.0 secs

There is no reason to believe that link_path_walk was called more
frequently in 2.6 than in 2.4. So, it must be taking more time to execute.

These are the diffs (2.4.27-pre3, not -rc3, because I dont have rc3
handy at home):

strncpy_from_user:

--- linux-2.4.27-pre3/arch/i386/lib/usercopy.c Fri Jun 13 17:51:29 2003
+++ linux-2.6.9-rc2.src/arch/i386/lib/usercopy.c Mon Sep 13 22:33:12 2004
@@ -70,6 +31,7 @@ __generic_copy_from_user(void *to, const
#define __do_strncpy_from_user(dst,src,count,res) \
do { \
int __d0, __d1, __d2; \
+ might_sleep(); \
__asm__ __volatile__( \
" testl %1,%1\n" \
" jz 2f\n" \
@@ -116,7 +78,7 @@ do { \
* and returns @count.
*/
long
-__strncpy_from_user(char *dst, const char *src, long count)
+__strncpy_from_user(char *dst, const char __user *src, long count)
{
long res;
__do_strncpy_from_user(dst, src, count, res);
@@ -142,7 +104,7 @@ __strncpy_from_user(char *dst, const cha
* and returns @count.
*/
long
-strncpy_from_user(char *dst, const char *src, long count)
+strncpy_from_user(char *dst, const char __user *src, long count)
{
long res = -EFAULT;
if (access_ok(VERIFY_READ, src, 1))
@@ -158,6 +120,7 @@ strncpy_from_user(char *dst, const char
#define __do_clear_user(addr,size) \
do { \
int __d0; \
+ might_sleep(); \
__asm__ __volatile__( \
"0: rep; stosl\n" \
" movl %2,%0\n" \

Aha! Maybe it's just a might_sleep() overhead?


link_path_walk:

--- linux-2.4.27-pre3/fs/namei.c Mon Aug 25 14:44:43 2003
+++ linux-2.6.9-rc2.src/fs/namei.c Mon Sep 13 22:33:30 2004
@@ -447,20 +654,20 @@ static inline void follow_dotdot(struct
*
* We expect 'base' to be positive and a directory.
*/
-int link_path_walk(const char * name, struct nameidata *nd)
+int fastcall link_path_walk(const char * name, struct nameidata *nd)
{
- struct dentry *dentry;
+ struct path next;
struct inode *inode;
int err;
unsigned int lookup_flags = nd->flags;
-
+
while (*name=='/')
name++;
if (!*name)
goto return_reval;

inode = nd->dentry->d_inode;
- if (current->link_count)
+ if (nd->depth)
lookup_flags = LOOKUP_FOLLOW;

/* At this point we know we have a real path component. */
@@ -469,8 +676,10 @@ int link_path_walk(const char * name, st
struct qstr this;
unsigned int c;

- err = permission(inode, MAY_EXEC);
- dentry = ERR_PTR(err);
+ err = exec_permission_lite(inode, nd);
+ if (err == -EAGAIN) {
+ err = permission(inode, MAY_EXEC, nd);
+ }
if (err)
break;

@@ -504,7 +713,7 @@ int link_path_walk(const char * name, st
case 2:
if (this.name[1] != '.')
break;
- follow_dotdot(nd);
+ follow_dotdot(&nd->mnt, &nd->dentry);
inode = nd->dentry->d_inode;
/* fallthrough */
case 1:
@@ -519,20 +728,16 @@ int link_path_walk(const char * name, st
if (err < 0)
break;
}
+ nd->flags |= LOOKUP_CONTINUE;
/* This does the actual lookups.. */
- dentry = cached_lookup(nd->dentry, &this, LOOKUP_CONTINUE);
- if (!dentry) {
- dentry = real_lookup(nd->dentry, &this, LOOKUP_CONTINUE);
- err = PTR_ERR(dentry);
- if (IS_ERR(dentry))
- break;
- }
+ err = do_lookup(nd, &this, &next);
+ if (err)
+ break;
/* Check mountpoints.. */
- while (d_mountpoint(dentry) && __follow_down(&nd->mnt, &dentry))
- ;
+ follow_mount(&next.mnt, &next.dentry);

err = -ENOENT;
- inode = dentry->d_inode;
+ inode = next.dentry->d_inode;
if (!inode)
goto out_dput;
err = -ENOTDIR;
@@ -540,8 +745,10 @@ int link_path_walk(const char * name, st
goto out_dput;

if (inode->i_op->follow_link) {
- err = do_follow_link(dentry, nd);
- dput(dentry);
+ mntget(next.mnt);
+ err = do_follow_link(next.dentry, nd);
+ dput(next.dentry);
+ mntput(next.mnt);
if (err)
goto return_err;
err = -ENOENT;
@@ -553,7 +760,8 @@ int link_path_walk(const char * name, st
break;
} else {
dput(nd->dentry);
- nd->dentry = dentry;
+ nd->mnt = next.mnt;
+ nd->dentry = next.dentry;
}
err = -ENOTDIR;
if (!inode->i_op->lookup)
@@ -564,6 +772,7 @@ int link_path_walk(const char * name, st
last_with_slashes:
lookup_flags |= LOOKUP_FOLLOW | LOOKUP_DIRECTORY;
last_component:
+ nd->flags &= ~LOOKUP_CONTINUE;
if (lookup_flags & LOOKUP_PARENT)
goto lookup_parent;
if (this.name[0] == '.') switch (this.len) {
@@ -572,7 +781,7 @@ last_component:
case 2:
if (this.name[1] != '.')
break;
- follow_dotdot(nd);
+ follow_dotdot(&nd->mnt, &nd->dentry);
inode = nd->dentry->d_inode;
/* fallthrough */
case 1:
@@ -583,41 +792,34 @@ last_component:
if (err < 0)
break;
}
- dentry = cached_lookup(nd->dentry, &this, 0);
- if (!dentry) {
- dentry = real_lookup(nd->dentry, &this, 0);
- err = PTR_ERR(dentry);
- if (IS_ERR(dentry))
- break;
- }
- while (d_mountpoint(dentry) && __follow_down(&nd->mnt, &dentry))
- ;
- inode = dentry->d_inode;
+ err = do_lookup(nd, &this, &next);
+ if (err)
+ break;
+ follow_mount(&next.mnt, &next.dentry);
+ inode = next.dentry->d_inode;
if ((lookup_flags & LOOKUP_FOLLOW)
&& inode && inode->i_op && inode->i_op->follow_link) {
- err = do_follow_link(dentry, nd);
- dput(dentry);
+ mntget(next.mnt);
+ err = do_follow_link(next.dentry, nd);
+ dput(next.dentry);
+ mntput(next.mnt);
if (err)
goto return_err;
inode = nd->dentry->d_inode;
} else {
dput(nd->dentry);
- nd->dentry = dentry;
+ nd->mnt = next.mnt;
+ nd->dentry = next.dentry;
}
err = -ENOENT;
if (!inode)
- goto no_inode;
+ break;
if (lookup_flags & LOOKUP_DIRECTORY) {
err = -ENOTDIR;
if (!inode->i_op || !inode->i_op->lookup)
break;
}
goto return_base;
-no_inode:
- err = -ENOENT;
- if (lookup_flags & (LOOKUP_POSITIVE|LOOKUP_DIRECTORY))
- break;
- goto return_base;
lookup_parent:
nd->last = this;
nd->last_type = LAST_NORM;
@@ -632,20 +834,19 @@ lookup_parent:
return_reval:
/*
* We bypassed the ordinary revalidation routines.
- * Check the cached dentry for staleness.
+ * We may need to check the cached dentry for staleness.
*/
- dentry = nd->dentry;
- if (dentry && dentry->d_op && dentry->d_op->d_revalidate) {
+ if (nd->dentry && nd->dentry->d_sb &&
+ (nd->dentry->d_sb->s_type->fs_flags & FS_REVAL_DOT)) {
err = -ESTALE;
- if (!dentry->d_op->d_revalidate(dentry, 0)) {
- d_invalidate(dentry);
+ /* Note: we do not d_invalidate() */
+ if (!nd->dentry->d_op->d_revalidate(nd->dentry, nd))
break;
- }
}
return_base:
return 0;
out_dput:
- dput(dentry);
+ dput(next.dentry);
break;
}
path_release(nd);
@@ -653,7 +854,7 @@ return_err:
return err;
}


Nothing obvious stands out...
CC'ing Al, maybe he can spot something.
--
vda

2004-09-17 21:33:16

by William Lee Irwin III

[permalink] [raw]
Subject: Re: top hogs CPU in 2.6: kallsyms_lookup is very slow

On Fri, Sep 17, 2004 at 11:36:25PM +0300, Denis Vlasenko wrote:
> Aha! Maybe it's just a might_sleep() overhead?

How does 2.6 do with those might_sleep() calls removed?


-- wli