2002-09-26 07:52:54

by Andrew Morton

[permalink] [raw]
Subject: 2.5.38-mm3


url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.38/2.5.38-mm3/

Includes a SARD update from Rick. The SARD disk accounting is
pretty much final now.

I moved the remaining disk accounting numbers (pgpgin, pgpgout) out of
/proc/stat and this will confuse vmstat. Again. Updated versions
are at http://surriel.com/procps, but they're not uptodate enough.

To get a current procps, grab the cygnus CVS (instructions are at
Rik's site) and then apply
http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.38/2.5.38-mm3/vmstat.patch


Since 2.5.38-mm2:

-ide-block-fix-1.patch

Merged (Jens)

-ext3-htree.patch
+ext3-dxdir.patch

Switch to Ted's ext3-htree patch.

-might_sleep.patch
-unbreak-writeback-mode.patch
-queue-congestion.patch
-nonblocking-ext2-preread.patch
-nonblocking-pdflush.patch
-nonblocking-vm.patch
-set_page_dirty-locking-fix.patch
-prepare_to_wait.patch
-vm-wakeups.patch
-sync-helper.patch
-slabasap.patch

Merged

+misc.patch

A comment fix.

+topology_fixes.patch

Some topology API fixlets from Matthew

+dio-bio-add-page.patch

Convert direct-io.c to use bio_add_page(). (Badari)

It will now build BIOs as large as the device supports.

+dio-bio-fixes.patch

Some alterations to the above.

-read-latency.patch

"I have to say, that elevator thing is the ugliest code I've seen
in a long while." -- Linus

+deadline-update.patch

Latest deadline scheduler fixes from Jens.

+akpm-deadline.patch

Expose the deadline scheduler tunables into /proc/sys/vm, and set
the default fifo_batch to 16.



linus.patch
cset-1.579.3.4-to-1.605.1.31.txt.gz

ide-high-1.patch

scsi_hack.patch
Fix block-highmem for scsi

ext3-dxdir.patch

spin-lock-check.patch
spinlock/rwlock checking infrastructure

rd-cleanup.patch
Cleanup and fix the ramdisk driver (doesn't work right yet)

misc.patch
misc fixes

write-deadlock.patch
Fix the generic_file_write-from-same-mmapped-page deadlock

buddyinfo.patch
Add /proc/buddyinfo - stats on the free pages pool

free_area.patch
Remove struct free_area_struct and free_area_t, use `struct free_area'

per-node-kswapd.patch
Per-node kswapd instance

topology-api.patch
Simple topology API

topology_fixes.patch
topology-api cleanups

radix_tree_gang_lookup.patch
radix tree gang lookup

truncate_inode_pages.patch
truncate/invalidate_inode_pages rewrite

proc_vmstat.patch
Move the vm accounting out of /proc/stat

kswapd-reclaim-stats.patch
Add kswapd_steal to /proc/vmstat

iowait.patch
I/O wait statistics

sard.patch
SARD disk accounting

dio-bio-add-page.patch
Use bio_add_page() in direct-io.c

dio-bio-fixes.patch
dio-bio-add-page fixes

remove-gfp_nfs.patch
remove GFP_NFS

tcp-wakeups.patch
Use fast wakeups in TCP/IPV4

swapoff-deadlock.patch
Fix a tmpfs swapoff deadlock

dirty-and-uptodate.patch
page state cleanup

shmem_rename.patch
shmem_rename() directory link count fix

dirent-size.patch
tmpfs: show a non-zero size for directories

tmpfs-trivia.patch
tmpfs: small fixlets

per-zone-vm.patch
separate the kswapd and direct reclaim code paths

swsusp-feature.patch
add shrink_all_memory() for swsusp

adaptec-fix.patch
partial fix for aic7xxx error recovery

remove-page-virtual.patch
remove page->virtual for !WANT_PAGE_VIRTUAL

dirty-memory-clamp.patch
sterner dirty-memory clamping

mempool-wakeup-fix.patch
Fix for stuck tasks in mempool_alloc()

remove-write_mapping_buffers.patch
Remove write_mapping_buffers

buffer_boundary-scheduling.patch
IO schduling for indirect blocks

ll_rw_block-cleanup.patch
cleanup ll_rw_block()

lseek-ext2_readdir.patch
remove lock_kernel() from ext2_readdir()

discontig-no-contig_page_data.patch
undefine contif_page_data for discontigmem

per-node-zone_normal.patch
ia32 NUMA: per-node ZONE_NORMAL

alloc_pages_node-cleanup.patch
alloc_pages_node cleanup

read_barrier_depends.patch
extended barrier primitives

rcu_ltimer.patch
RCU core

dcache_rcu.patch
Use RCU for dcache

deadline-update.patch
deadline scheduler updates

akpm-deadline.patch


2002-09-26 12:14:04

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Thu, Sep 26, 2002 at 07:59:21AM +0000, Andrew Morton wrote:
> url: http://www.zip.com.au/~akpm/linux/patches/2.5/2.5.38/2.5.38-mm3/
>
> Includes a SARD update from Rick. The SARD disk accounting is
> pretty much final now.
>
> read_barrier_depends.patch
> extended barrier primitives
>
> rcu_ltimer.patch
> RCU core
>
> dcache_rcu.patch
> Use RCU for dcache
>

Hi Andrew,

Updated 2.5.38 RCU core and dcache_rcu patches are now available
at http://sourceforge.net/project/showfiles.php?group_id=8875&release_id=112473

The differences since earlier versions are -

rcu_ltimer - call_rcu() fixed for preemption and the earlier fix I had sent
to you.
read_barrier_depends - fixes list_for_each_rcu macro compilation error.
dcache_rcu - uses list_add_rcu in d_rehash and list_for_each_rcu in d_lookup
making the read_barrier_depends() fix I had sent to you
earlier unnecessary.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-26 12:23:57

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Thu, Sep 26, 2002 at 05:54:45PM +0530, Dipankar Sarma wrote:
> Updated 2.5.38 RCU core and dcache_rcu patches are now available
> at http://sourceforge.net/project/showfiles.php?group_id=8875&release_id=112473
> The differences since earlier versions are -
> rcu_ltimer - call_rcu() fixed for preemption and the earlier fix I had sent
> to you.
> read_barrier_depends - fixes list_for_each_rcu macro compilation error.
> dcache_rcu - uses list_add_rcu in d_rehash and list_for_each_rcu in d_lookup
> making the read_barrier_depends() fix I had sent to you
> earlier unnecessary.

Is there an update to the files_struct stuff too? I'm seeing large
overheads there also.


Thanks,
Bill

2002-09-26 12:30:43

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Thu, Sep 26, 2002 at 05:29:09AM -0700, William Lee Irwin III wrote:
> On Thu, Sep 26, 2002 at 05:54:45PM +0530, Dipankar Sarma wrote:
> > Updated 2.5.38 RCU core and dcache_rcu patches are now available
> > at http://sourceforge.net/project/showfiles.php?group_id=8875&release_id=112473
> > The differences since earlier versions are -
> > rcu_ltimer - call_rcu() fixed for preemption and the earlier fix I had sent
> > to you.
> > read_barrier_depends - fixes list_for_each_rcu macro compilation error.
> > dcache_rcu - uses list_add_rcu in d_rehash and list_for_each_rcu in d_lookup
> > making the read_barrier_depends() fix I had sent to you
> > earlier unnecessary.
>
> Is there an update to the files_struct stuff too? I'm seeing large
> overheads there also.

files_struct_rcu is not in mm kernels, but I will upload the most
recent version to the same download directory in LSE.

I would be interested in fget() profile count change with that patch.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-26 12:37:32

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Thu, Sep 26, 2002 at 05:29:09AM -0700, William Lee Irwin III wrote:
>> Is there an update to the files_struct stuff too? I'm seeing large
>> overheads there also.

On Thu, Sep 26, 2002 at 06:10:52PM +0530, Dipankar Sarma wrote:
> files_struct_rcu is not in mm kernels, but I will upload the most
> recent version to the same download directory in LSE.
> I would be interested in fget() profile count change with that patch.

In my experience fget() is large even on UP kernels. For instance, a UP
profile from a long-running interactive load UP box (my home machine):

228542527 total 169.5902
216163353 default_idle 4503403.1875
850707 number 781.8998
829885 handle_IRQ_event 8644.6354
687351 proc_getdata 1227.4125
454401 system_call 8114.3036
446452 csum_partial_copy_generic 1800.2097
330157 tcp_sendmsg 76.4252
300022 vsnprintf 284.1117
271134 __generic_copy_to_user 3389.1750
237151 fget 3705.4844
222390 proc_pid_stat 308.8750
210759 fput 878.1625
186408 tcp_ioctl 314.8784
179146 sys_ioctl 238.2261
177419 do_softirq 1232.0764
167881 kmem_cache_free 1165.8403
154854 skb_clone 387.1350
149377 d_lookup 444.5744
139131 kmem_cache_alloc 668.8990
138638 kfree 866.4875
132555 sys_write 637.2837

This is only aggravated by cacheline bouncing on SMP. The reductions
of system cpu time will doubtless be beneficial for all.


Thanks,
Bill

2002-09-26 12:55:48

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Thu, Sep 26, 2002 at 05:42:44AM -0700, William Lee Irwin III wrote:
> This is only aggravated by cacheline bouncing on SMP. The reductions
> of system cpu time will doubtless be beneficial for all.

On SMP, I would have thought that only sharing the fd table
while cloning tasks (CLONE_FILES) affects performance by bouncing the rwlock
cache line. Are there a lot of common workloads where this happens ?

Anyway the files_struct_rcu patch for 2.5.38 is up at
http://sourceforge.net/project/showfiles.php?group_id=8875&release_id=112473

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-26 13:12:28

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Thu, Sep 26, 2002 at 05:42:44AM -0700, William Lee Irwin III wrote:
>> This is only aggravated by cacheline bouncing on SMP. The reductions
>> of system cpu time will doubtless be beneficial for all.

On Thu, Sep 26, 2002 at 06:35:58PM +0530, Dipankar Sarma wrote:
> On SMP, I would have thought that only sharing the fd table
> while cloning tasks (CLONE_FILES) affects performance by bouncing the rwlock
> cache line. Are there a lot of common workloads where this happens ?
> Anyway the files_struct_rcu patch for 2.5.38 is up at
> http://sourceforge.net/project/showfiles.php?group_id=8875&release_id=112473

It looks very unusual, but it is very real. Some of my prior profile
results show this. I'll run a before/after profile with this either
tonight or tomorrow night (it's 6:06AM PST here -- tonight is unlikely).


Cheers,
Bill

2002-09-26 13:25:07

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Thu, 26 Sep 2002, William Lee Irwin III wrote:

> In my experience fget() is large even on UP kernels. For instance, a UP
> profile from a long-running interactive load UP box (my home machine):

I can affirmative that;

6124639 total 4.1414
4883005 default_idle 101729.2708
380218 ata_input_data 1697.4018
242647 ata_output_data 1083.2455
35989 do_select 60.7922
34931 unix_poll 218.3187
33561 schedule 52.4391
29823 do_softirq 155.3281
27021 fget 422.2031
25270 sock_poll 526.4583
18224 preempt_schedule 379.6667
17895 sys_select 15.5339
17741 __generic_copy_from_user 184.8021
15397 __generic_copy_to_user 240.5781
13214 fput 55.0583
13088 add_wait_queue 163.6000
12637 system_call 225.6607

--
function.linuxpower.ca

2002-09-26 13:34:11

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Thu, Sep 26, 2002 at 09:29:36AM -0400, Zwane Mwaikambo wrote:
> I can affirmative that;
> 6124639 total 4.1414
> 4883005 default_idle 101729.2708
> 380218 ata_input_data 1697.4018
> 242647 ata_output_data 1083.2455
> 35989 do_select 60.7922
> 34931 unix_poll 218.3187
> 33561 schedule 52.4391
> 29823 do_softirq 155.3281
> 27021 fget 422.2031
> 25270 sock_poll 526.4583

Interesting, can you narrow down the poll overheads any? No immediate
needs (read as: leave your box up, but watch for it when you can),
but I'd be interested in knowing if it's fd chunk or poll table setup
overhead.


Thanks,
Bill

2002-09-26 13:41:56

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Thu, 26 Sep 2002, William Lee Irwin III wrote:

> On Thu, Sep 26, 2002 at 09:29:36AM -0400, Zwane Mwaikambo wrote:
> > I can affirmative that;
> > 6124639 total 4.1414
> > 4883005 default_idle 101729.2708
> > 380218 ata_input_data 1697.4018
> > 242647 ata_output_data 1083.2455
> > 35989 do_select 60.7922
> > 34931 unix_poll 218.3187
> > 33561 schedule 52.4391
> > 29823 do_softirq 155.3281
> > 27021 fget 422.2031
> > 25270 sock_poll 526.4583
>
> Interesting, can you narrow down the poll overheads any? No immediate
> needs (read as: leave your box up, but watch for it when you can),
> but I'd be interested in knowing if it's fd chunk or poll table setup
> overhead.

Sure, i'm pretty sure i know which application is doing that so i can
reproduce easily enough.

Zwane
--
function.linuxpower.ca

2002-09-27 08:17:07

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Thu, Sep 26, 2002 at 06:39:19AM -0700, William Lee Irwin III wrote:
> Interesting, can you narrow down the poll overheads any? No immediate
> needs (read as: leave your box up, but watch for it when you can),
> but I'd be interested in knowing if it's fd chunk or poll table setup
> overhead.

Hmm.. I don't see this by just leaving the box up (and a fiew interactive
commands) (4CPU P3 2.5.38-vanilla) -

8744695 default_idle 136635.8594
4371 __rdtsc_delay 136.5938
22793 do_softirq 118.7135
1734 serial_in 21.6750
261 .text.lock.serio 13.7368
8777715 total 6.2461
422 tasklet_hi_action 2.0288
106 bh_action 1.3250
46 system_call 1.0455
56 __generic_copy_to_user 0.8750
575 timer_bh 0.8168
70 __cpu_up 0.7292
57 cpu_idle 0.5089
24 __const_udelay 0.3750
35 mdio_read 0.3646
120 probe_irq_on 0.3571
134 page_remove_rmap 0.3102
108 page_add_rmap 0.3068
18 find_get_page 0.2812
189 do_wp_page 0.2513
7 fput 0.2188
27 pte_alloc_one 0.1875
135 __free_pages_ok 0.1834
2 syscall_call 0.1818
11 pgd_alloc 0.1719
11 __free_pages 0.1719
65 i8042_interrupt 0.1693
8 __wake_up 0.1667
16 find_vma 0.1667
15 serial_out 0.1562
15 radix_tree_lookup 0.1339
17 kmem_cache_free 0.1328
17 get_page_state 0.1328
62 zap_pte_range 0.1292
6 mdio_sync 0.1250
3 ret_from_intr 0.1250
2 cap_inode_permission_lite 0.1250
2 cap_file_permission 0.1250
49 do_anonymous_page 0.1178
9 lru_cache_add 0.1125
9 fget 0.1125

What application were you all running ?

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-27 09:15:33

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Fri, Sep 27, 2002 at 01:57:43PM +0530, Dipankar Sarma wrote:
> What application were you all running ?
> Thanks

Basically, the workload on my "desktop" system consists of numerous ssh
sessions in and out of the machine, half a dozen IRC clients, xmms,
Mozilla, and X overhead.



Cheers,
Bill

2002-09-27 09:48:27

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Fri, Sep 27, 2002 at 02:20:20AM -0700, William Lee Irwin III wrote:
> On Fri, Sep 27, 2002 at 01:57:43PM +0530, Dipankar Sarma wrote:
> > What application were you all running ?
> > Thanks
>
> Basically, the workload on my "desktop" system consists of numerous ssh
> sessions in and out of the machine, half a dozen IRC clients, xmms,
> Mozilla, and X overhead.

Ok, from a relatively idle system (4CPU) running SMP kernel -

18 fget 0.2250
0 0.00 c013d460: push %ebx
0 0.00 c013d461: mov $0xffffe000,%edx
0 0.00 c013d466: mov %eax,%ecx
0 0.00 c013d468: and %esp,%edx
0 0.00 c013d46a: mov (%edx),%eax
1 5.56 c013d46c: mov 0x674(%eax),%ebx
1 5.56 c013d472: lea 0x4(%ebx),%eax
0 0.00 c013d475: lock subl $0x1,(%eax)
3 16.67 c013d479: js c013d61b <.text.lock.file_table+0x30>
0 0.00 c013d47f: mov (%edx),%eax
1 5.56 c013d481: mov 0x674(%eax),%edx
0 0.00 c013d487: xor %eax,%eax
0 0.00 c013d489: cmp 0x8(%edx),%ecx
0 0.00 c013d48c: jae c013d494 <fget+0x34>
0 0.00 c013d48e: mov 0x14(%edx),%eax
0 0.00 c013d491: mov (%eax,%ecx,4),%eax
0 0.00 c013d494: test %eax,%eax
0 0.00 c013d496: je c013d49c <fget+0x3c>
0 0.00 c013d498: lock incl 0x14(%eax)
0 0.00 c013d49c: lock incl 0x4(%ebx)
5 27.78 c013d4a0: pop %ebx
0 0.00 c013d4a1: ret
7 38.89 c013d4a2: lea 0x0(%esi,1),%esi

I tried an SMP kernel on 1 CPU -

15 fget 0.1875
0 0.00 c013d460: push %ebx
2 13.33 c013d461: mov $0xffffe000,%edx
0 0.00 c013d466: mov %eax,%ecx
0 0.00 c013d468: and %esp,%edx
0 0.00 c013d46a: mov (%edx),%eax
0 0.00 c013d46c: mov 0x674(%eax),%ebx
0 0.00 c013d472: lea 0x4(%ebx),%eax
0 0.00 c013d475: lock subl $0x1,(%eax)
3 20.00 c013d479: js c013d61b <.text.lock.file_table+0x30>
0 0.00 c013d47f: mov (%edx),%eax
0 0.00 c013d481: mov 0x674(%eax),%edx
0 0.00 c013d487: xor %eax,%eax
0 0.00 c013d489: cmp 0x8(%edx),%ecx
0 0.00 c013d48c: jae c013d494 <fget+0x34>
0 0.00 c013d48e: mov 0x14(%edx),%eax
0 0.00 c013d491: mov (%eax,%ecx,4),%eax
0 0.00 c013d494: test %eax,%eax
0 0.00 c013d496: je c013d49c <fget+0x3c>
0 0.00 c013d498: lock incl 0x14(%eax)
0 0.00 c013d49c: lock incl 0x4(%ebx)
4 26.67 c013d4a0: pop %ebx
0 0.00 c013d4a1: ret
6 40.00 c013d4a2: lea 0x0(%esi,1),%esi

The counts are off by one.

With a UP kernel, I see that fget() cost is negligible.
So it is most likely the atomic operations for rwlock acquisition/release
in fget() that is adding to its cost. Unless of course my sampling
is too less.

Please try running the files_struct_rcu patch where fget() is lockfree
and let me know what you see.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-27 15:14:08

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 2.5.38-mm3

>> > What application were you all running ?

Kernel compile on NUMA-Q looks like this:

125673 total
82183 default_idle
6134 do_anonymous_page
4431 page_remove_rmap
2345 page_add_rmap
2288 d_lookup
1921 vm_enough_memory
1883 __generic_copy_from_user
1566 file_read_actor
1381 .text.lock.file_table <-------------
1168 find_get_page
1116 get_empty_filp

Presumably that's the same thing? Interestingly, if I look back at
previous results, I see it's about twice the cost in -mm as it is
in mainline, not sure why ... at least against 2.5.37 virgin it was.

> Please try running the files_struct_rcu patch where fget() is lockfree
> and let me know what you see.

Will do ... if you tell me where it is ;-)

M.

2002-09-27 17:04:46

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Fri, Sep 27, 2002 at 08:04:31AM -0700, Martin J. Bligh wrote:
> >> > What application were you all running ?
>
> Kernel compile on NUMA-Q looks like this:
>
> 125673 total
> 82183 default_idle
> 2288 d_lookup
> 1921 vm_enough_memory
> 1883 __generic_copy_from_user
> 1566 file_read_actor
> 1381 .text.lock.file_table <-------------

More likely, this is contention for the files_lock. Do you have any
lockmeter data ? That should give us more information. If so,
the files_struct_rcu isn't likely to help.

> 1168 find_get_page
> 1116 get_empty_filp
>
> Presumably that's the same thing? Interestingly, if I look back at
> previous results, I see it's about twice the cost in -mm as it is
> in mainline, not sure why ... at least against 2.5.37 virgin it was.

Not sure why it shows up more in -mm, but likely because -mm has
lot less contention on other locks like dcache_lock.

>
> > Please try running the files_struct_rcu patch where fget() is lockfree
> > and let me know what you see.
>
> Will do ... if you tell me where it is ;-)

Oh, the usual place -
http://sourceforge.net/project/showfiles.php?group_id=8875&release_id=112473
I wish sourceforge FRS continued to allow direct links to patches.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-27 22:50:49

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Fri, Sep 27, 2002 at 10:44:24PM +0530, Dipankar Sarma wrote:
> Not sure why it shows up more in -mm, but likely because -mm has
> lot less contention on other locks like dcache_lock.

Well, the profile I posted was an interactive UP workload, and it's
fairly high there. Trimming cycles off this is good for everyone.

Small SMP boxen (dual?) used similarly will probably see additional
gains as the number of locked operations in fget() will be reduced.
There's clearly no contention or cacheline bouncing in my workloads as
none of them have tasks sharing file tables, nor is anything else
messing with the cachelines.


Cheers,
Bill

2002-09-28 04:26:28

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Fri, 27 Sep 2002, William Lee Irwin III wrote:

> On Fri, Sep 27, 2002 at 01:57:43PM +0530, Dipankar Sarma wrote:
> > What application were you all running ?
> > Thanks
>
> Basically, the workload on my "desktop" system consists of numerous ssh
> sessions in and out of the machine, half a dozen IRC clients, xmms,
> Mozilla, and X overhead.

That box is my development/main box, i run a lot of xterms, xmms, network
applications (ssh, browsers, irc etc...). Heavy simulator usage (i believe
thats where the poll stuff comes from, due to its virtual ethernet
interface) all done in X and the box is also local NFS server for the
various testboxes i have (heavy I/O, disk load) as well as kernel
compiles.

Zwane

--
function.linuxpower.ca

2002-09-28 04:33:35

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.38-mm3

> On Fri, 27 Sep 2002, Dipankar Sarma wrote:
>> The counts are off by one.
>> With a UP kernel, I see that fget() cost is negligible.
>> So it is most likely the atomic operations for rwlock acquisition/release
>> in fget() that is adding to its cost. Unless of course my sampling
>> is too less.

On Sat, Sep 28, 2002 at 12:35:30AM -0400, Zwane Mwaikambo wrote:
> Mine is a UP box not an SMP kernel, although preempt is enabled;
> 0xc013d370 <fget>: push %ebx
> 0xc013d371 <fget+1>: mov %eax,%ecx
> 0xc013d373 <fget+3>: mov $0xffffe000,%edx
> 0xc013d378 <fget+8>: and %esp,%edx
> 0xc013d37a <fget+10>: incl 0x4(%edx)

Do you have instruction-level profiles to show where the cost is on UP?


Thanks,
Bill

2002-09-28 04:42:15

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Fri, 27 Sep 2002, Dipankar Sarma wrote:

> The counts are off by one.
>
> With a UP kernel, I see that fget() cost is negligible.
> So it is most likely the atomic operations for rwlock acquisition/release
> in fget() that is adding to its cost. Unless of course my sampling
> is too less.

Mine is a UP box not an SMP kernel, although preempt is enabled;

0xc013d370 <fget>: push %ebx
0xc013d371 <fget+1>: mov %eax,%ecx
0xc013d373 <fget+3>: mov $0xffffe000,%edx
0xc013d378 <fget+8>: and %esp,%edx
0xc013d37a <fget+10>: incl 0x4(%edx)
0xc013d37d <fget+13>: xor %ebx,%ebx
0xc013d37f <fget+15>: mov 0x554(%edx),%eax
0xc013d385 <fget+21>: cmp 0x8(%eax),%ecx
0xc013d388 <fget+24>: jae 0xc013d390 <fget+32>
0xc013d38a <fget+26>: mov 0x14(%eax),%eax
0xc013d38d <fget+29>: mov (%eax,%ecx,4),%ebx
0xc013d390 <fget+32>: test %ebx,%ebx
0xc013d392 <fget+34>: je 0xc013d397 <fget+39>
0xc013d394 <fget+36>: incl 0x14(%ebx)
0xc013d397 <fget+39>: decl 0x4(%edx)
0xc013d39a <fget+42>: mov 0x14(%edx),%eax
0xc013d39d <fget+45>: cmp %eax,0x4(%edx)
0xc013d3a0 <fget+48>: jge 0xc013d3a7 <fget+55>
0xc013d3a2 <fget+50>: call 0xc01179b0 <preempt_schedule>
0xc013d3a7 <fget+55>: mov %ebx,%eax
0xc013d3a9 <fget+57>: pop %ebx
0xc013d3aa <fget+58>: ret
0xc013d3ab <fget+59>: nop
0xc013d3ac <fget+60>: lea 0x0(%esi,1),%esi

> Please try running the files_struct_rcu patch where fget() is lockfree
> and let me know what you see.

Lock acquisition/release should be painless on this system no?

Zwane
--
function.linuxpower.ca

2002-09-28 04:50:25

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Fri, 27 Sep 2002, William Lee Irwin III wrote:

> On Sat, Sep 28, 2002 at 12:35:30AM -0400, Zwane Mwaikambo wrote:
> > Mine is a UP box not an SMP kernel, although preempt is enabled;
> > 0xc013d370 <fget>: push %ebx
> > 0xc013d371 <fget+1>: mov %eax,%ecx
> > 0xc013d373 <fget+3>: mov $0xffffe000,%edx
> > 0xc013d378 <fget+8>: and %esp,%edx
> > 0xc013d37a <fget+10>: incl 0x4(%edx)
>
> Do you have instruction-level profiles to show where the cost is on UP?

Unfortunately no, i was lucky to remember to even be running profile=n on
this box.

--
function.linuxpower.ca

2002-09-28 05:18:48

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Sat, Sep 28, 2002 at 12:54:39AM -0400, Zwane Mwaikambo wrote:
> On Fri, 27 Sep 2002, William Lee Irwin III wrote:
>
> > On Sat, Sep 28, 2002 at 12:35:30AM -0400, Zwane Mwaikambo wrote:
> > > Mine is a UP box not an SMP kernel, although preempt is enabled;
> > > 0xc013d370 <fget>: push %ebx
> > > 0xc013d371 <fget+1>: mov %eax,%ecx
> > > 0xc013d373 <fget+3>: mov $0xffffe000,%edx
> > > 0xc013d378 <fget+8>: and %esp,%edx
> > > 0xc013d37a <fget+10>: incl 0x4(%edx)
> >
> > Do you have instruction-level profiles to show where the cost is on UP?
>
> Unfortunately no, i was lucky to remember to even be running profile=n on
> this box.

That is sufficient to get instruction level profile. Just use
the hacked readprofile by tridge (it's available somewhere in his
samba.org webpage).

I suspect that inlining fget() will help, not sure whether that is
clean code-wise.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-28 05:30:10

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.38-mm3

On Fri, Sep 27, 2002 at 03:54:55PM -0700, William Lee Irwin III wrote:
> On Fri, Sep 27, 2002 at 10:44:24PM +0530, Dipankar Sarma wrote:
> > Not sure why it shows up more in -mm, but likely because -mm has
> > lot less contention on other locks like dcache_lock.
>
> Well, the profile I posted was an interactive UP workload, and it's
> fairly high there. Trimming cycles off this is good for everyone.

Oh, I was commenting on possible files_lock contention on mbligh's
NUMA-Q.

>
> Small SMP boxen (dual?) used similarly will probably see additional
> gains as the number of locked operations in fget() will be reduced.
> There's clearly no contention or cacheline bouncing in my workloads as
> none of them have tasks sharing file tables, nor is anything else
> messing with the cachelines.

I remember seeing fget() high up in specweb profiles. I suspect that
fget profile count is high because it just happens to get called very
often for most workloads (all file syscalls) and the atomic
operations (SMP) and the function call overhead just adds to the cost.
If possible, we should try inlining it too.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.