2002-09-19 22:32:08

by William Lee Irwin III

[permalink] [raw]
Subject: 2.5.36-mm1 dbench 512 profiles

Well, from some private responses, it appeared this is of more general
interest than linux-mm, so I'm reposting here.

I'll follow up with some suggested patches for addressing some of the
performance issues that may have been encountered.

>From dbench 512 on a 32x NUMA-Q with 32GB of RAM running 2.5.36-mm1:

c01053a4 14040139 35.542 default_idle
c0114ab8 4436882 11.2318 load_balance
c015c5c6 4243413 10.742 .text.lock.dcache
c01317f4 2229431 5.64371 generic_file_write_nolock
c0130d10 2182906 5.52593 file_read_actor
c0114f30 2126191 5.38236 scheduler_tick
c0154b83 1905648 4.82407 .text.lock.namei
c011749d 1344623 3.40386 .text.lock.sched
c019f8ab 1102566 2.7911 .text.lock.dec_and_lock
c01066a8 612167 1.54968 .text.lock.semaphore
c015ba5c 440889 1.11609 d_lookup
c013f81c 314222 0.79544 blk_queue_bounce
c0111798 310317 0.785554 smp_apic_timer_interrupt
c013fac4 228103 0.577433 .text.lock.highmem
c01523b8 206811 0.523533 path_lookup
c0115274 164177 0.415607 do_schedule
c019f830 143365 0.362922 atomic_dec_and_lock
c0114628 136075 0.344468 try_to_wake_up
c01062dc 125245 0.317052 __down
c010d9d8 121864 0.308494 timer_interrupt
c015ae30 114653 0.290239 prune_dcache
c0144e00 102093 0.258444 generic_file_llseek
c015b714 83273 0.210802 d_instantiate

with akpm's removal of lock section directives:

c01053a4 31781009 38.3441 default_idle
c0114968 13184373 15.9071 load_balance
c0114de0 6545861 7.89765 scheduler_tick
c0151718 4514372 5.44664 path_lookup
c015ac4c 3314721 3.99924 d_lookup
c0130560 3153290 3.80448 file_read_actor
c0131044 2816477 3.39811 generic_file_write_nolock
c015a8e4 1980809 2.38987 d_instantiate
c019e1b0 1959187 2.36378 atomic_dec_and_lock
c0111668 1447604 1.74655 smp_apic_timer_interrupt
c0159fc0 1291884 1.55867 prune_dcache
c015a714 1089696 1.31473 d_alloc
c01062cc 1030194 1.24294 __down
c015b0dc 625279 0.754405 d_rehash
c013edac 554017 0.668427 blk_queue_bounce
c0115128 508229 0.613183 do_schedule
c01144c8 441818 0.533058 try_to_wake_up
c010d8f8 403607 0.486956 timer_interrupt
c01229a4 333023 0.401796 update_one_process
c015af70 322781 0.389439 d_delete
c01508a0 248442 0.299748 do_lookup
c01155f4 213738 0.257877 __wake_up
c013e63c 185472 0.223774 kmap_high


2002-09-19 23:10:23

by Hanna Linder

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

--On Thursday, September 19, 2002 15:30:07 -0700 William Lee Irwin III <[email protected]> wrote:

> From dbench 512 on a 32x NUMA-Q with 32GB of RAM running 2.5.36-mm1:
>
> c015c5c6 4243413 10.742 .text.lock.dcache
> c01317f4 2229431 5.64371 generic_file_write_nolock
> c0130d10 2182906 5.52593 file_read_actor
> c0114f30 2126191 5.38236 scheduler_tick
> c0154b83 1905648 4.82407 .text.lock.namei
> c011749d 1344623 3.40386 .text.lock.sched
> c019f8ab 1102566 2.7911 .text.lock.dec_and_lock
> c01066a8 612167 1.54968 .text.lock.semaphore
> c015ba5c 440889 1.11609 d_lookup

>
> with akpm's removal of lock section directives:
>
> c0114de0 6545861 7.89765 scheduler_tick
> c0151718 4514372 5.44664 path_lookup
> c015ac4c 3314721 3.99924 d_lookup
> c0130560 3153290 3.80448 file_read_actor
> c0131044 2816477 3.39811 generic_file_write_nolock
> c015a8e4 1980809 2.38987 d_instantiate
> c019e1b0 1959187 2.36378 atomic_dec_and_lock
> c0111668 1447604 1.74655 smp_apic_timer_interrupt
> c0159fc0 1291884 1.55867 prune_dcache
> c015a714 1089696 1.31473 d_alloc
> c01062cc 1030194 1.24294 __down

So akpm's removal of lock section directives breaks down the
functions holding locks that previously were reported under the
.text.lock.filename? Looks like fastwalk might not behave so well
on this 32 cpu numa system...

Hanna

2002-09-19 23:33:18

by Andrew Morton

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

Hanna Linder wrote:
>
> ...
> So akpm's removal of lock section directives breaks down the
> functions holding locks that previously were reported under the
> .text.lock.filename?

Yup. It makes the profiler report the spinlock cost at the
actual callsite. Patch below.

> Looks like fastwalk might not behave so well
> on this 32 cpu numa system...

I've rather lost the plot. Have any of the dcache speedup
patches been merged into 2.5?

It would be interesting to know the context switch rate
during this test, and to see what things look like with HZ=100.



--- 2.5.24/include/asm-i386/spinlock.h~spinlock-inline Fri Jun 21 13:12:01 2002
+++ 2.5.24-akpm/include/asm-i386/spinlock.h Fri Jun 21 13:18:12 2002
@@ -46,13 +46,13 @@ typedef struct {
"\n1:\t" \
"lock ; decb %0\n\t" \
"js 2f\n" \
- LOCK_SECTION_START("") \
+ "jmp 3f\n" \
"2:\t" \
"cmpb $0,%0\n\t" \
"rep;nop\n\t" \
"jle 2b\n\t" \
"jmp 1b\n" \
- LOCK_SECTION_END
+ "3:\t" \

/*
* This works. Despite all the confusion.
--- 2.5.24/include/asm-i386/rwlock.h~spinlock-inline Fri Jun 21 13:18:33 2002
+++ 2.5.24-akpm/include/asm-i386/rwlock.h Fri Jun 21 13:22:09 2002
@@ -22,25 +22,19 @@

#define __build_read_lock_ptr(rw, helper) \
asm volatile(LOCK "subl $1,(%0)\n\t" \
- "js 2f\n" \
- "1:\n" \
- LOCK_SECTION_START("") \
- "2:\tcall " helper "\n\t" \
- "jmp 1b\n" \
- LOCK_SECTION_END \
+ "jns 1f\n\t" \
+ "call " helper "\n\t" \
+ "1:\t" \
::"a" (rw) : "memory")

#define __build_read_lock_const(rw, helper) \
asm volatile(LOCK "subl $1,%0\n\t" \
- "js 2f\n" \
- "1:\n" \
- LOCK_SECTION_START("") \
- "2:\tpushl %%eax\n\t" \
+ "jns 1f\n\t" \
+ "pushl %%eax\n\t" \
"leal %0,%%eax\n\t" \
"call " helper "\n\t" \
"popl %%eax\n\t" \
- "jmp 1b\n" \
- LOCK_SECTION_END \
+ "1:\t" \
:"=m" (*(volatile int *)rw) : : "memory")

#define __build_read_lock(rw, helper) do { \
@@ -52,25 +46,19 @@

#define __build_write_lock_ptr(rw, helper) \
asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",(%0)\n\t" \
- "jnz 2f\n" \
+ "jz 1f\n\t" \
+ "call " helper "\n\t" \
"1:\n" \
- LOCK_SECTION_START("") \
- "2:\tcall " helper "\n\t" \
- "jmp 1b\n" \
- LOCK_SECTION_END \
::"a" (rw) : "memory")

#define __build_write_lock_const(rw, helper) \
asm volatile(LOCK "subl $" RW_LOCK_BIAS_STR ",%0\n\t" \
- "jnz 2f\n" \
- "1:\n" \
- LOCK_SECTION_START("") \
- "2:\tpushl %%eax\n\t" \
+ "jz 1f\n\t" \
+ "pushl %%eax\n\t" \
"leal %0,%%eax\n\t" \
"call " helper "\n\t" \
"popl %%eax\n\t" \
- "jmp 1b\n" \
- LOCK_SECTION_END \
+ "1:\n" \
:"=m" (*(volatile int *)rw) : : "memory")

#define __build_write_lock(rw, helper) do { \

-

2002-09-19 23:37:23

by Hanna Linder

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles


--On Thursday, September 19, 2002 16:38:14 -0700 Andrew Morton <[email protected]> wrote:

> Hanna Linder wrote:
>>
>> ...
>> So akpm's removal of lock section directives breaks down the
>> functions holding locks that previously were reported under the
>> .text.lock.filename?
>
> Yup. It makes the profiler report the spinlock cost at the
> actual callsite. Patch below.

Thanks. We've needed that for quite some time.
>
>> Looks like fastwalk might not behave so well
>> on this 32 cpu numa system...
>
> I've rather lost the plot. Have any of the dcache speedup
> patches been merged into 2.5?

Yes, starting with 2.5.11. Al Viro made some changes to
it and it went in. Havent heard anything about it since...

Hanna

2002-09-20 00:09:17

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

Hanna Linder wrote:
>> Looks like fastwalk might not behave so well
>> on this 32 cpu numa system...

On Thu, Sep 19, 2002 at 04:38:14PM -0700, Andrew Morton wrote:
> I've rather lost the plot. Have any of the dcache speedup
> patches been merged into 2.5?

As far as the dcache goes, I'll stick to observing and reporting.
I'll rerun with dcache patches applied, though.


On Thu, Sep 19, 2002 at 04:38:14PM -0700, Andrew Morton wrote:
> It would be interesting to know the context switch rate
> during this test, and to see what things look like with HZ=100.

The context switch rate was 60 or 70 cs/sec. during the steady
state of the test, and around 10K cs/sec for ramp-up and ramp-down.

I've already prepared a kernel with a lowered HZ, but stopped briefly to
debug a calibrate_delay() oops and chat with folks around the workplace.


Thanks,
Bill

2002-09-20 04:03:30

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

On Thu, Sep 19, 2002 at 04:38:14PM -0700, Andrew Morton wrote:
>> It would be interesting to know the context switch rate
>> during this test, and to see what things look like with HZ=100.

On Thu, Sep 19, 2002 at 05:08:15PM -0700, William Lee Irwin III wrote:
> The context switch rate was 60 or 70 cs/sec. during the steady
> state of the test, and around 10K cs/sec for ramp-up and ramp-down.
> I've already prepared a kernel with a lowered HZ, but stopped briefly to
> debug a calibrate_delay() oops and chat with folks around the workplace.

Okay, figured that one out (c.f. x86_udelay_tsc thread). I'll grind out
another one in about 90-120 minutes or thereabouts with HZ == 100. I'm
going to take a wild guess param.h should have an #ifdef there for
NR_CPUS >= WLI_SAW_EXCESS_TIMER_INTS_HERE or something. It's probably
possible to figure out what the service time vs. arrival rate stuff says
but it's too easy to fix to be worth analyzing, and we don't exist to
process timer ticks anyway. Hrm, yet another cry for i386 subarches?

Hanna, did you have a particular dcache patch in mind? ISTR there were
several flavors. (I can of course sift through them myself as well.)

Cheers,
Bill

2002-09-20 05:15:57

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

On Thu, Sep 19, 2002 at 04:38:14PM -0700, Andrew Morton wrote:
> It would be interesting to know the context switch rate
> during this test, and to see what things look like with HZ=100.

Ow! Throughput went down by something like 25% by just bumping HZ down
to 100. (or so it appears)

One of the odd things here is that this is absolutely not being allowed
to sample times when the test is not running. So default_idle is some
kind of actual scheduling artifact, though I'm not entirely sure when
it's sitting idle. It may just be the most predominant sampled thing
within the kernel despite 98% non-idle system time and HZ == 1000 is not
the issue. Not sure.


Cheers,
Bill

Out-of-line lock version:

c01053a4 20974286 42.2901 default_idle
c015c586 5759482 11.6127 .text.lock.dcache
c0154b43 5747223 11.588 .text.lock.namei
c01317e4 4653534 9.38284 generic_file_write_nolock
c0130d00 1861383 3.75308 file_read_actor
c0114a98 1230049 2.48013 load_balance
c019f6bb 866076 1.74625 .text.lock.dec_and_lock
c01066a8 796042 1.60505 .text.lock.semaphore
c013f7fc 749976 1.51216 blk_queue_bounce
c019f640 497122 1.00234 atomic_dec_and_lock
c0114f10 321897 0.649036 scheduler_tick
c0152378 262290 0.528851 path_lookup
c0144dc0 223189 0.450012 generic_file_llseek
c015adf0 207285 0.417945 prune_dcache
c011748d 185193 0.373402 .text.lock.sched
c0115258 184852 0.372714 do_schedule
c0114628 171719 0.346234 try_to_wake_up
c013676c 170725 0.34423 .text.lock.slab
c013faa4 143586 0.28951 .text.lock.highmem
c01062dc 142571 0.287464 __down
c015ba1c 140737 0.283766 d_lookup
c014675c 130760 0.263649 __fput
c0152a30 126688 0.255439 open_namei

2002-09-20 07:00:56

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

On Thu, Sep 19, 2002 at 04:38:14PM -0700, Andrew Morton wrote:
> It would be interesting to know the context switch rate
> during this test, and to see what things look like with HZ=100.

There is no obvious time when the machine appears idle, but regardless:
HZ == 100 with idle=poll. The numbers are still down, and I'm not 100%
sure why, so I'm backing out HZ = 100 and trying something else.

c01053dc 296114302 79.5237 poll_idle
c01053a4 20974286 5.6328 default_idle
c015c586 11098020 2.98046 .text.lock.dcache
c0154b43 10046044 2.69794 .text.lock.namei
c01317e4 9712304 2.60831 generic_file_write_nolock
c0130d00 3133118 0.841423 file_read_actor
c0114a98 2300611 0.617847 load_balance
c013f7fc 1615733 0.433917 blk_queue_bounce
c019f6bb 1591350 0.427369 .text.lock.dec_and_lock
c01066a8 1554624 0.417506 .text.lock.semaphore
c019f640 1000989 0.268823 atomic_dec_and_lock
c0114f10 851084 0.228565 scheduler_tick
c0152378 639577 0.171763 path_lookup
c0144dc0 456594 0.122622 generic_file_llseek
c0114628 407284 0.109379 try_to_wake_up
c0115258 399893 0.107394 do_schedule
c015adf0 370906 0.0996096 prune_dcache
c011748d 366470 0.0984183 .text.lock.sched
c015b6d4 306140 0.0822162 d_instantiate
c013faa4 292987 0.0786839 .text.lock.highmem
c013676c 291664 0.0783286 .text.lock.slab
c01062dc 282983 0.0759972 __down
c014675c 281106 0.0754932 __fput

2002-09-20 07:43:45

by Maneesh Soni

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

On Fri, 20 Sep 2002 05:48:38 +0530, William Lee Irwin III wrote:

> Hanna Linder wrote:
>>> Looks like fastwalk might not behave so well on this 32 cpu numa
>>> system...
>
> On Thu, Sep 19, 2002 at 04:38:14PM -0700, Andrew Morton wrote:
>> I've rather lost the plot. Have any of the dcache speedup patches been
>> merged into 2.5?
>
> As far as the dcache goes, I'll stick to observing and reporting. I'll
> rerun with dcache patches applied, though.
>
..
> Thanks,
> Bill
> -

For a 32-way system fastwalk will perform badly from dcache_lock point of
view, basically due to increased lock hold time. dcache_rcu-12 should reduce
dcache_lock contention and hold time. The patch uses RCU infrastructer patch and
read_barrier_depends patch. The patches are available in Read-Copy-Update
section on lse site at

http://sourceforge.net/projects/lse

Regards
Maneesh

--
Maneesh Soni
IBM Linux Technology Center,
IBM India Software Lab, Bangalore.
Phone: +91-80-5044999 email: [email protected]
http://lse.sourceforge.net/

2002-09-20 08:07:31

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

On Fri, 20 Sep 2002 05:48:38 +0530, William Lee Irwin III wrote:
>> As far as the dcache goes, I'll stick to observing and reporting. I'll
>> rerun with dcache patches applied, though.

On Fri, Sep 20, 2002 at 01:29:28PM +0530, Maneesh Soni wrote:
> For a 32-way system fastwalk will perform badly from dcache_lock
> point of view, basically due to increased lock hold time.
> dcache_rcu-12 should reduce dcache_lock contention and hold time. The
> patch uses RCU infrastructer patch and read_barrier_depends patch.
> The patches are available in Read-Copy-Update section on lse site at
> http://sourceforge.net/projects/lse

ISTR Hubertus mentioning this at OLS, and it sounded like a problem to
me. I'm doing some runs with this to see if it fixes the problem.


Cheers,
Bill

2002-09-20 12:05:03

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

On Fri, Sep 20, 2002 at 01:29:28PM +0530, Maneesh Soni wrote:
>> For a 32-way system fastwalk will perform badly from dcache_lock
>> point of view, basically due to increased lock hold time.
>> dcache_rcu-12 should reduce dcache_lock contention and hold time. The
>> patch uses RCU infrastructer patch and read_barrier_depends patch.
>> The patches are available in Read-Copy-Update section on lse site at
>> http://sourceforge.net/projects/lse

On Fri, Sep 20, 2002 at 01:06:28AM -0700, William Lee Irwin III wrote:
> ISTR Hubertus mentioning this at OLS, and it sounded like a problem to
> me. I'm doing some runs with this to see if it fixes the problem.

AFAICT, with one bottleneck out of the way, a new one merely arises to
take its place. Ugly. OTOH the qualitative difference is striking. The
interactive responsiveness of the machine, even when entirely unloaded,
is drastically improved, along with such nice things as init scripts
and kernel compiles also markedly faster. I suspect this is just the
wrong benchmark to show throughput benefits with.

Also notable is that the system time was significantly reduced though
I didn't log it. Essentially a long period of 100% system time is
entered after a certain point in the benchmark, during which there are
few (around 60 or 70) context switches in a second, and the duration
of this period was shortened.

The results here contradict my prior conclusions wrt. HZ 100 vs. 1000.

IMHO this worked and the stuff aroung generic_file_write_nolock(),
file_read_actor(), whatever is hammering semaphore.c, and reducing
blk_queue_bounce() traffic are the next issues to address. Any ideas?


dcache_rcu, HZ == 1000:
Throughput 36.5059 MB/sec (NB=45.6324 MB/sec 365.059 MBit/sec) 512 procs
---------------------------------------------------------------------------
c01053dc 320521015 90.6236 poll_idle
c0114ab8 13559139 3.83369 load_balance
c0114f30 3146028 0.889502 scheduler_tick
c011751d 2702819 0.76419 .text.lock.sched
c0131110 2534516 0.716605 file_read_actor
c0131bf4 1307874 0.369786 generic_file_write_nolock
c0111798 1243507 0.351587 smp_apic_timer_interrupt
c01066a8 1108969 0.313548 .text.lock.semaphore
c013fc0c 772807 0.218502 blk_queue_bounce
c01152e4 559869 0.158296 do_schedule
c0114628 323975 0.0916001 try_to_wake_up
c010d9d8 304144 0.0859931 timer_interrupt
c01062dc 271440 0.0767465 __down
c013feb4 240824 0.0680902 .text.lock.highmem
c01450b0 224874 0.0635805 generic_file_llseek
c019f55b 214729 0.0607121 .text.lock.dec_and_lock
c0136b7c 208790 0.0590329 .text.lock.slab
c0122ef4 185013 0.0523103 update_one_process
c0146dee 135391 0.0382802 .text.lock.file_table
c01472dc 127782 0.0361289 __find_get_block_slow
c015ba4c 122577 0.0346572 d_lookup
c0173cd0 115446 0.032641 ext2_new_block
c0132958 114472 0.0323656 generic_file_write


dcache_rcu, HZ == 100:
Throughput 39.1471 MB/sec (NB=48.9339 MB/sec 391.471 MBit/sec) 512 procs
--------------------------------------------------------------------------
c01053dc 331775731 95.9799 poll_idle
c0131be4 3310063 0.957573 generic_file_write_nolock
c0131100 1552058 0.448997 file_read_actor
c0114a98 1491802 0.431565 load_balance
c01066a8 1048138 0.303217 .text.lock.semaphore
c013fbec 570986 0.165181 blk_queue_bounce
c0114f10 532451 0.154033 scheduler_tick
c01152c8 311667 0.0901626 do_schedule
c013fe94 239497 0.0692844 .text.lock.highmem
c0114628 222569 0.0643873 try_to_wake_up
c019f36b 220632 0.0638269 .text.lock.dec_and_lock
c01062dc 191477 0.0553926 __down
c0136b6c 164682 0.0476411 .text.lock.slab
c011750d 160221 0.0463506 .text.lock.sched
c014729c 123385 0.0356942 __find_get_block_slow
c0173b00 120967 0.0349947 ext2_new_block
c01387f0 111699 0.0323136 __free_pages_ok
c0146dae 104794 0.030316 .text.lock.file_table
c019f2f0 102715 0.0297146 atomic_dec_and_lock
c0145070 96505 0.0279181 generic_file_llseek
c01367c4 95436 0.0276088 s_show
c0138b24 91321 0.0264184 rmqueue
c01523a8 87421 0.0252901 path_lookup

mm1, HZ == 1000:
Throughput 36.3452 MB/sec (NB=45.4315 MB/sec 363.452 MBit/sec) 512 procs
--------------------------------------------------------------------------
c01053dc 291824934 78.5936 poll_idle
c0114ab8 15361229 4.13705 load_balance
c01053a4 14040139 3.78126 default_idle
c015c5c6 7489522 2.01706 .text.lock.dcache
c01317f4 5707336 1.53709 generic_file_write_nolock
c0114f30 5425740 1.46125 scheduler_tick
c0130d10 5397721 1.4537 file_read_actor
c0154b83 3917278 1.05499 .text.lock.namei
c011749d 3508427 0.944882 .text.lock.sched
c019f8ab 2415903 0.650646 .text.lock.dec_and_lock
c01066a8 1615952 0.435205 .text.lock.semaphore
c0111798 1461670 0.393654 smp_apic_timer_interrupt
c013f81c 1330609 0.358357 blk_queue_bounce
c015ba5c 780847 0.210296 d_lookup
c013fac4 578235 0.155729 .text.lock.highmem
c0115274 542453 0.146092 do_schedule
c0114628 441528 0.118911 try_to_wake_up
c010d9d8 437417 0.117804 timer_interrupt
c01523b8 399484 0.107588 path_lookup
c01062dc 362925 0.0977422 __down
c019f830 275515 0.0742011 atomic_dec_and_lock
c0122e94 271817 0.0732051 update_one_process
c0144e00 260097 0.0700487 generic_file_llseek

mm1, HZ == 100:
Throughput 39.0368 MB/sec (NB=48.796 MB/sec 390.368 MBit/sec) 512 procs
-------------------------------------------------------------------------
c01053dc 572091962 84.309 poll_idle
c01053a4 20974286 3.09097 default_idle
c015c586 17014849 2.50747 .text.lock.dcache
c0154b43 16074116 2.36884 .text.lock.namei
c01317e4 14653053 2.15942 generic_file_write_nolock
c0130d00 5295158 0.780346 file_read_actor
c0114a98 3437483 0.506581 load_balance
c019f6bb 2455126 0.361811 .text.lock.dec_and_lock
c013f7fc 2428344 0.357864 blk_queue_bounce
c01066a8 2379650 0.350688 .text.lock.semaphore
c019f640 1525996 0.224886 atomic_dec_and_lock
c0114f10 1328712 0.195812 scheduler_tick
c0152378 923439 0.136087 path_lookup
c0144dc0 692727 0.102087 generic_file_llseek
c0115258 599269 0.0883141 do_schedule
c0114628 593380 0.0874462 try_to_wake_up
c011748d 574637 0.0846841 .text.lock.sched
c015adf0 516917 0.0761779 prune_dcache
c013676c 496571 0.0731795 .text.lock.slab
c013faa4 471971 0.0695542 .text.lock.highmem
c015b6d4 444406 0.065492 d_instantiate
c01062dc 436983 0.064398 __down
c014675c 420142 0.0619162 __fput

2002-09-20 14:29:56

by Dave Hansen

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

Maneesh Soni wrote:
> On Fri, 20 Sep 2002 05:48:38 +0530, William Lee Irwin III wrote:
>
>>Hanna Linder wrote:
>>
>>>>Looks like fastwalk might not behave so well on this 32 cpu numa
>>>>system...
>>
>>On Thu, Sep 19, 2002 at 04:38:14PM -0700, Andrew Morton wrote:
>>
>>>I've rather lost the plot. Have any of the dcache speedup patches been
>>>merged into 2.5?
>>
>>As far as the dcache goes, I'll stick to observing and reporting. I'll
>>rerun with dcache patches applied, though.
>>
> For a 32-way system fastwalk will perform badly from dcache_lock point of
> view, basically due to increased lock hold time. dcache_rcu-12 should reduce
> dcache_lock contention and hold time.

Isn't increased hold time _good_ on NUMA-Q? I thought that the really
costy operation was bouncing the lock around the interconnect, not
holding it. Has fastwalk ever been tested on NUMA-Q?

Remember when John Stultz tried MCS (fair) locks on NUMA-Q? They
sucked because low hold times, which result from fairness, aren't
efficient. It is actually faster to somewhat starve remote CPUs.

In any case, we all know often acquired global locks are a bad idea on
a 32-way, and should be avoided like the plague. I just wish we had a
dcache solution that didn't even need locks as much... :)

--
Dave Hansen
[email protected]

2002-09-20 16:04:21

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

>> For a 32-way system fastwalk will perform badly from dcache_lock point of
>> view, basically due to increased lock hold time. dcache_rcu-12 should reduce
>> dcache_lock contention and hold time.
>
> Isn't increased hold time _good_ on NUMA-Q? I thought that the
> really costy operation was bouncing the lock around the interconnect,
> not holding it.

Depends what you get it return. The object of fastwalk was to stop the
cacheline bouncing on all the individual dentry counters, at the cost
of increased dcache_lock hold times. It's a tradeoff ... and in this
instance it wins. In general, long lock hold times are bad.

> Has fastwalk ever been tested on NUMA-Q?

Yes, in 2.4. Gave good results, I forget exactly what ... something
like 5-10% off kernel compile times.

> Remember when John Stultz tried MCS (fair) locks on NUMA-Q? They
> sucked because low hold times, which result from fairness, aren't
> efficient. It is actually faster to somewhat starve remote CPUs.

Nothing to do with low hold times - it's to do with bouncing the
lock between nodes.

> In any case, we all know often acquired global locks are a bad idea
> on a 32-way, and should be avoided like the plague. I just wish we
> had a dcache solution that didn't even need locks as much... :)

Well, avoiding data corruption is a preferable goal too. The point of
RCU is not to have to take a lock for the common read case. I'd expect
good results from it on the NUMA machines - never been benchmarked, as
far as I recall.

M.

2002-09-20 17:31:16

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

On Fri, Sep 20, 2002 at 02:37:41PM +0000, Dave Hansen wrote:
> Isn't increased hold time _good_ on NUMA-Q? I thought that the really
> costy operation was bouncing the lock around the interconnect, not

Increased hold time isn't necessarily good. If you acquire the lock
often, your lock wait time will increase correspondingly. The ultimate
goal should be to decrease the total number of acquisitions.

> holding it. Has fastwalk ever been tested on NUMA-Q?

Fastwalk is in 2.5. You can see wli's profile numbers for dbench 512
earlier in this thread.

>
> Remember when John Stultz tried MCS (fair) locks on NUMA-Q? They
> sucked because low hold times, which result from fairness, aren't
> efficient. It is actually faster to somewhat starve remote CPUs.

One workaround is to keep scheduling the lock within the CPUs of
a node as much as possible and release it to a different node
only if there isn't any CPU available in the current node. Anyway
these are not real solutions, just band-aids.

>
> In any case, we all know often acquired global locks are a bad idea on
> a 32-way, and should be avoided like the plague. I just wish we had a
> dcache solution that didn't even need locks as much... :)

You have one - dcache_rcu. It reduces the dcache_lock acquisition
by about 65% over fastwalk.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-20 17:38:47

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

On Fri, Sep 20, 2002 at 04:17:10PM +0000, Martin J. Bligh wrote:
> > Isn't increased hold time _good_ on NUMA-Q? I thought that the
> > really costy operation was bouncing the lock around the interconnect,
> > not holding it.
>
> Depends what you get it return. The object of fastwalk was to stop the
> cacheline bouncing on all the individual dentry counters, at the cost
> of increased dcache_lock hold times. It's a tradeoff ... and in this
> instance it wins. In general, long lock hold times are bad.

I don't think individual dentry counters are as much a problem as
acquisition of dcache_lock for every path component lookup as done
by the earlier path walking algorithm. The big deal with fastwalk
is that it decreases the number of acquisitions of dcache_lock
for a webserver workload by 70% on an 8-CPU machine. That is avoiding
a lot of possible cacheline bouncing of dcache_lock.


> > In any case, we all know often acquired global locks are a bad idea
> > on a 32-way, and should be avoided like the plague. I just wish we
> > had a dcache solution that didn't even need locks as much... :)
>
> Well, avoiding data corruption is a preferable goal too. The point of
> RCU is not to have to take a lock for the common read case. I'd expect
> good results from it on the NUMA machines - never been benchmarked, as
> far as I recall.

You can see that in wli's dbench 512 results on his NUMA box.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-20 18:41:49

by Hanna Linder

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

--On Friday, September 20, 2002 05:03:58 -0700 William Lee Irwin III <[email protected]> wrote:

> On Fri, Sep 20, 2002 at 01:29:28PM +0530, Maneesh Soni wrote:
>>> For a 32-way system fastwalk will perform badly from dcache_lock
>>> point of view, basically due to increased lock hold time.
>>> dcache_rcu-12 should reduce dcache_lock contention and hold time. The
>>> patch uses RCU infrastructer patch and read_barrier_depends patch.
>>> The patches are available in Read-Copy-Update section on lse site at
>>> http://sourceforge.net/projects/lse
>
> On Fri, Sep 20, 2002 at 01:06:28AM -0700, William Lee Irwin III wrote:
>> ISTR Hubertus mentioning this at OLS, and it sounded like a problem to
>> me. I'm doing some runs with this to see if it fixes the problem.

I mentioned it at OLS too. It was the point of my talk. Next
time I will request a non 10am time slot!

> take its place. Ugly. OTOH the qualitative difference is striking. The
> interactive responsiveness of the machine, even when entirely unloaded,
> is drastically improved, along with such nice things as init scripts
> and kernel compiles also markedly faster. I suspect this is just the
> wrong benchmark to show throughput benefits with.
>
> Also notable is that the system time was significantly reduced though
> I didn't log it. Essentially a long period of 100% system time is
> entered after a certain point in the benchmark, during which there are
> few (around 60 or 70) context switches in a second, and the duration
> of this period was shortened.

Bill, you are saying that replacing dcache_rcu significantly
improved system response time among other things?

Perhaps it is time to reconsider replacing fastwalk with dcache_rcu.

Viro? What are your objections?

Thanks.

Hanna

2002-09-20 20:19:30

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

On Fri, Sep 20, 2002 at 11:10:20PM +0530, Dipankar Sarma wrote:
> >
> > In any case, we all know often acquired global locks are a bad idea on
> > a 32-way, and should be avoided like the plague. I just wish we had a
> > dcache solution that didn't even need locks as much... :)
>
> You have one - dcache_rcu. It reduces the dcache_lock acquisition
> by about 65% over fastwalk.

I should clarify, this was with a webserver benchmark.

For those who want to use them, Maneesh's dcache_rcu-12 patch and my
RCU "performance" infrastructure patches are in -

http://sourceforge.net/project/showfiles.php?group_id=8875&release_id=111743

The latest release is 2.5.36-mm1.
rcu_ltimer and read_barrier_depends are pre-requisites for dcache_rcu.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-20 20:24:37

by Hanna Linder

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

--On Friday, September 20, 2002 11:51:13 -0700 Hanna Linder <[email protected]> wrote:

>
> Perhaps it is time to reconsider replacing fastwalk with dcache_rcu.

These patches were written by Maneesh Soni. Since the Read-Copy Update
infrastructure has not been accepted into the mainline kernel yet (although
there were murmurings of it being acceptable) you will need to apply
those first. Here they are, apply in this order. Too big to post
inline text though. These are provided against 2.5.36-mm1.


http://prdownloads.sourceforge.net/lse/rcu_ltimer-2.5.36-mm1

http://prdownloads.sourceforge.net/lse/read_barrier_depends-2.5.36-mm1

http://prdownloads.sourceforge.net/lse/dcache_rcu-12-2.5.36-mm1

There has been quite a bit of testing done on this and it has proven
quite stable. If anyone wants to do any additional testing that would
be great.

Thanks.

Hanna

2002-09-20 20:45:04

by Dipankar Sarma

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

On Fri, Sep 20, 2002 at 08:32:48PM +0000, Hanna Linder wrote:
> --On Friday, September 20, 2002 11:51:13 -0700 Hanna Linder <[email protected]> wrote:
>
> >
> > Perhaps it is time to reconsider replacing fastwalk with dcache_rcu.
>
> These patches were written by Maneesh Soni. Since the Read-Copy Update
> infrastructure has not been accepted into the mainline kernel yet (although
> there were murmurings of it being acceptable) you will need to apply
> those first. Here they are, apply in this order. Too big to post
> inline text though. These are provided against 2.5.36-mm1.
>
>
> http://prdownloads.sourceforge.net/lse/rcu_ltimer-2.5.36-mm1
>
> http://prdownloads.sourceforge.net/lse/read_barrier_depends-2.5.36-mm1
>
> http://prdownloads.sourceforge.net/lse/dcache_rcu-12-2.5.36-mm1
>
> There has been quite a bit of testing done on this and it has proven
> quite stable. If anyone wants to do any additional testing that would
> be great.

Thanks for the vote of confidence :)

Now for some results (out of date, but also has results with backported code
from 2.5) see http://lse.sf.net/locking/dcache/dcache.html.

Preliminary profiling of webserver benchmarks in 2.5.3X show similar potential
for dcache_rcu. I will have actual results published when we can
get formal runs done.

Thanks
--
Dipankar Sarma <[email protected]> http://lse.sourceforge.net
Linux Technology Center, IBM Software Lab, Bangalore, India.

2002-09-20 20:40:47

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

On Fri, Sep 20, 2002 at 11:51:13AM -0700, Hanna Linder wrote:
> I mentioned it at OLS too. It was the point of my talk. Next
> time I will request a non 10am time slot!

10AM is relatively early in the morning for me. =)


On Friday, September 20, 2002 05:03:58 -0700 William Lee Irwin III <[email protected]> wrote:
>> take its place. Ugly. OTOH the qualitative difference is striking. The
>> interactive responsiveness of the machine, even when entirely unloaded,
>> is drastically improved, along with such nice things as init scripts
>> and kernel compiles also markedly faster. I suspect this is just the
>> wrong benchmark to show throughput benefits with.
>> Also notable is that the system time was significantly reduced though
>> I didn't log it. Essentially a long period of 100% system time is
>> entered after a certain point in the benchmark, during which there are
>> few (around 60 or 70) context switches in a second, and the duration
>> of this period was shortened.

On Fri, Sep 20, 2002 at 11:51:13AM -0700, Hanna Linder wrote:
> Bill, you are saying that replacing dcache_rcu significantly
> improved system response time among other things?
> Perhaps it is time to reconsider replacing fastwalk with dcache_rcu.
> Viro? What are your objections?

Basically, the big ones get laggy, and laggier with more cpus. This fixed
a decent amount of that.

Another thing to note is that the max bandwidth of these disks is 40MB/s,
so we're running pretty close to peak anyway. I need to either get an FC
cable or something to see larger bandwidth gains.


Cheers,
Bill

2002-09-20 21:30:12

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

> AFAICT, with one bottleneck out of the way, a new one merely arises to
> take its place. Ugly. OTOH the qualitative difference is striking. The
> interactive responsiveness of the machine, even when entirely unloaded,
> is drastically improved, along with such nice things as init scripts
> and kernel compiles also markedly faster. I suspect this is just the
> wrong benchmark to show throughput benefits with.
>
> Also notable is that the system time was significantly reduced though
> I didn't log it. Essentially a long period of 100% system time is
> entered after a certain point in the benchmark, during which there are
> few (around 60 or 70) context switches in a second, and the duration
> of this period was shortened.
>
> The results here contradict my prior conclusions wrt. HZ 100 vs. 1000.

Hmmm ... I think you need the NUMA aware scheduler ;-)
On the plus side, that does look like RCU pretty much obliterated the dcache
problems ....

M.


2002-09-20 23:12:33

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

At some point in the past, I wrote:
>> AFAICT, with one bottleneck out of the way, a new one merely arises to
>> take its place. Ugly. OTOH the qualitative difference is striking. The
>> interactive responsiveness of the machine, even when entirely unloaded,
>> is drastically improved, along with such nice things as init scripts
>> and kernel compiles also markedly faster. I suspect this is just the
>> wrong benchmark to show throughput benefits with.

On Fri, Sep 20, 2002 at 02:30:23PM -0700, Martin J. Bligh wrote:
> Hmmm ... I think you need the NUMA aware scheduler ;-)
> On the plus side, that does look like RCU pretty much obliterated the dcache
> problems ....

This sounds like a likely solution to the expense of load_balance().
Do you have a patch for it floating around?


Thanks,
Bill

2002-09-20 23:20:44

by Martin J. Bligh

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

>>> AFAICT, with one bottleneck out of the way, a new one merely arises to
>>> take its place. Ugly. OTOH the qualitative difference is striking. The
>>> interactive responsiveness of the machine, even when entirely unloaded,
>>> is drastically improved, along with such nice things as init scripts
>>> and kernel compiles also markedly faster. I suspect this is just the
>>> wrong benchmark to show throughput benefits with.
>
> On Fri, Sep 20, 2002 at 02:30:23PM -0700, Martin J. Bligh wrote:
>> Hmmm ... I think you need the NUMA aware scheduler ;-)
>> On the plus side, that does look like RCU pretty much obliterated the dcache
>> problems ....
>
> This sounds like a likely solution to the expense of load_balance().
> Do you have a patch for it floating around?

I have a really old hacky one from Mike Kravetz, or Michael Hohnbaum
is working on something new, but I don't think it's ready yet ....
I think Mike's will need some rework. Will send it to you ...

M.

2002-09-21 07:53:28

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.36-mm1 dbench 512 profiles

On Fri, Sep 20, 2002 at 05:03:58AM -0700, William Lee Irwin III wrote:
> Also notable is that the system time was significantly reduced though
> I didn't log it. Essentially a long period of 100% system time is
> entered after a certain point in the benchmark, during which there are
> few (around 60 or 70) context switches in a second, and the duration
> of this period was shortened.

A radical difference is present in 2.5.37: the long period of 100%
system time is instead a long period of idle time.

I don't have an oprofile vs. 2.5.37 but I'll report back when I do.


Cheers,
Bill