2002-12-08 13:02:17

by Anton Blanchard

[permalink] [raw]
Subject: 2.5.50-BK + 24 CPUs


Hi,

I found time to run a few benchmarks over a largish machine (24 way
ppc64) running 2.5.50-BK from a few days ago.

1. kernel compile benchmark (ie build an x86 2.4.18 kernel)

I hijacked /proc/profile to log functions where we call schedule from.
It shows:

schedules:
56283 total
41984 pipe_wait
9746 do_work
1949 do_exit
1834 sys_wait4

ie during the compile we scheduled 56283 times, and 41984 of them were
caused by pipes. Simple fix, remove -pipe from the Makefile of the
kernel I was building:

schedules:
8497 total
3665 do_work
1878 do_exit
1824 sys_wait4
306 cpu_idle
260 open_namei
256 pipe_wait

Much nicer. Does it make sense to use -pipe in our kernel Makefile these
days? Note "do_work" is a ppc64 assembly function which checks
need_resched and calls schedule if the timeslice has been exceeded. So
its nice to see almost all of the schedules are due to timeslice
expiration, processes exiting or processes doing a wait().

Now we can look at the profile:

profile:
66260 total
54227 cpu_idle
1000 page_remove_rmap
909 __get_page_state
830 page_add_rmap
753 save_remaining_regs
646 do_anonymous_page
529 do_page_fault
475 release_pages
468 pSeries_flush_hash_range
462 pSeries_hpte_insert
266 __copy_tofrom_user
215 zap_pte_range
214 sys_brk
210 __pagevec_lru_add_active
209 buffered_rmqueue
201 find_get_page
185 vm_enough_memory
183 nr_free_pages

Mostly idle time, theres a limit to how much we can parallelize here.
Note: save_remaining_regs is the syscall/interrupt entry path for ppc64.

2. dbench 24

Lets not pay too much attention here but there are a few things to keep
in mind:

schedules:
1635314 total
753694 cpu_idle
357910 ext2_new_block
289189 ext2_free_blocks
123788 ext2_new_inode
95025 ext2_free_inode

Whee, look at all the schedules we took inside the ext2 code. Of course
its due to the superblock lock semaphore.

profile:
370142 total
302615 cpu_idle
8600 __copy_tofrom_user
3119 schedule
2760 current_kernel_time

Lots of idle time in part due to the superblock lock (oh yeah and my
slow to react finger stopping profiling after the benchmark finished).
current_kernel_time makes a recent appearance in the profile, we are
working on a number of things to address this.

3. "University workload"

A benchmark that does lots of shell scripts, cc, troff, etc.

schedules:
470212 total
126262 do_work
86986 ext2_free_blocks
58039 ext2_new_block
53627 cpu_idle
43140 ext2_new_inode
30934 ext2_free_inode
19849 do_exit
18526 sys_wait4

The superblock lock semaphore makes an appearance in the schedule
summary again (ext2_*). Now for the profile:

profile:
136296 total
41592 cpu_idle
16319 page_remove_rmap
7338 page_add_rmap
3583 save_remaining_regs
3072 pSeries_flush_hash_range
2832 release_pages
2584 do_page_fault
2281 find_get_page
2238 pSeries_hpte_insert
2117 copy_page_range
2085 current_kernel_time
2028 zap_pte_range
1886 __get_page_state
1689 atomic_dec_and_lock

No big suprises in the profile. This benchmark tends to be a worst case
scenario for rmap, think of 100s of shells all mapping the same text
pages.

Anton


2002-12-08 14:41:36

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.5.50-BK + 24 CPUs

On Mon, 9 Dec 2002, Anton Blanchard wrote:

> profile:
> 66260 total
> 54227 cpu_idle
> 1000 page_remove_rmap
> 909 __get_page_state
> 830 page_add_rmap

Looks like the bitflag locking in rmap is hurting you.
How does it work with a real spinlock in the struct page
instead of using a bit in page->flags ?

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>

2002-12-08 16:38:35

by Rik van Riel

[permalink] [raw]
Subject: Re: 2.5.50-BK + 24 CPUs

On Sun, 8 Dec 2002, Rik van Riel wrote:
> On Mon, 9 Dec 2002, Anton Blanchard wrote:
>
> > profile:
> > 66260 total
> > 54227 cpu_idle
> > 1000 page_remove_rmap
> > 909 __get_page_state
> > 830 page_add_rmap
>
> Looks like the bitflag locking in rmap is hurting you.
> How does it work with a real spinlock in the struct page
> instead of using a bit in page->flags ?

In particular, something like the (completely untested) patch
below. Yes, this patch is on the wrong side of the space/time
tradeoff for machines with highmem, but it might be worth it
for 64 bit machines, especially those with slow bitops.

regards,

Rik
--
Bravely reimplemented by the knights who say "NIH".
http://www.surriel.com/ http://guru.conectiva.com/
Current spamtrap: <a href=mailto:"[email protected]">[email protected]</a>


===== include/linux/mm.h 1.97 vs edited =====
--- 1.97/include/linux/mm.h Thu Nov 7 08:48:53 2002
+++ edited/include/linux/mm.h Sun Dec 8 14:36:44 2002
@@ -169,6 +169,7 @@
* protected by PG_chainlock */
pte_addr_t direct;
} pte;
+ spinlock_t ptechain_lock; /* Lock for pte.chain and pte.direct */
unsigned long private; /* mapping-private opaque data */

/*
===== include/linux/rmap-locking.h 1.1 vs edited =====
--- 1.1/include/linux/rmap-locking.h Sun Sep 1 17:56:32 2002
+++ edited/include/linux/rmap-locking.h Sun Dec 8 14:37:49 2002
@@ -14,20 +14,10 @@
* busywait with less bus contention for a good time to
* attempt to acquire the lock bit.
*/
- preempt_disable();
-#ifdef CONFIG_SMP
- while (test_and_set_bit(PG_chainlock, &page->flags)) {
- while (test_bit(PG_chainlock, &page->flags))
- cpu_relax();
- }
-#endif
+ spin_lock(&page->ptechain_lock);
}

static inline void pte_chain_unlock(struct page *page)
{
-#ifdef CONFIG_SMP
- smp_mb__before_clear_bit();
- clear_bit(PG_chainlock, &page->flags);
-#endif
- preempt_enable();
+ spin_unlock(&page->ptechain_lock);
}
===== mm/page_alloc.c 1.135 vs edited =====
--- 1.135/mm/page_alloc.c Mon Dec 2 18:31:01 2002
+++ edited/mm/page_alloc.c Sun Dec 8 14:39:06 2002
@@ -1129,6 +1129,7 @@
struct page *page = lmem_map + local_offset + i;
set_page_zone(page, nid * MAX_NR_ZONES + j);
set_page_count(page, 0);
+ page->ptechain_lock = SPIN_LOCK_UNLOCKED;
SetPageReserved(page);
INIT_LIST_HEAD(&page->list);
#ifdef WANT_PAGE_VIRTUAL

2002-12-08 21:14:28

by Manfred Spraul

[permalink] [raw]
Subject: Re: 2.5.50-BK + 24 CPUs

Anton wrote:

>schedules:
> 56283 total
> 41984 pipe_wait
> 9746 do_work
> 1949 do_exit
> 1834 sys_wait4
>
>ie during the compile we scheduled 56283 times, and 41984 of them were
>caused by pipes.
>
The linux pipe implementation has only a page sized buffer - with 4 kB
pages, transfering 1 MB through a pipe means at 512 context switches.

--
Manfred

2002-12-08 21:20:55

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.50-BK + 24 CPUs

Anton wrote:
>> ie during the compile we scheduled 56283 times, and 41984 of them were
>> caused by pipes.

On Sun, Dec 08, 2002 at 10:22:03PM +0100, Manfred Spraul wrote:
> The linux pipe implementation has only a page sized buffer - with 4 kB
> pages, transfering 1 MB through a pipe means at 512 context switches.

Hmm. What happened to that pipe buffer size increase patch? That sounds
like it might help here, but only if those things are trying to shove
more than 4KB through the pipe at a time.


Bill

2002-12-08 22:50:11

by David Miller

[permalink] [raw]
Subject: Re: 2.5.50-BK + 24 CPUs

On Sun, 2002-12-08 at 13:28, William Lee Irwin III wrote:
> Hmm. What happened to that pipe buffer size increase patch? That sounds
> like it might help here, but only if those things are trying to shove
> more than 4KB through the pipe at a time.

You probably mean the zero-copy pipe patches, which I think really
should go in. The most recent version of the diffs I saw didn't
use the zero copy bits unless the trasnfers were quite large so it
should be ok and not pessimize small transfers.

That patch has been gathering cobwebs for more than a year now when I
first did it, let's push this in already :-)

2002-12-08 22:53:55

by William Lee Irwin III

[permalink] [raw]
Subject: Re: 2.5.50-BK + 24 CPUs

On Sun, 2002-12-08 at 13:28, William Lee Irwin III wrote:
>> Hmm. What happened to that pipe buffer size increase patch? That sounds
>> like it might help here, but only if those things are trying to shove
>> more than 4KB through the pipe at a time.

On Sun, Dec 08, 2002 at 03:22:58PM -0800, David S. Miller wrote:
> You probably mean the zero-copy pipe patches, which I think really
> should go in. The most recent version of the diffs I saw didn't
> use the zero copy bits unless the trasnfers were quite large so it
> should be ok and not pessimize small transfers.
> That patch has been gathering cobwebs for more than a year now when I
> first did it, let's push this in already :-)

I was actually referring to one that explicitly used larger pipe
buffers, but this sounds useful too.


Bill

2002-12-09 16:55:53

by Manfred Spraul

[permalink] [raw]
Subject: Re: 2.5.50-BK + 24 CPUs

David S. Miller wrote:

>On Sun, 2002-12-08 at 13:28, William Lee Irwin III wrote:
>
>
>>Hmm. What happened to that pipe buffer size increase patch? That sounds
>>like it might help here, but only if those things are trying to shove
>>more than 4KB through the pipe at a time.
>>
>>
>
>You probably mean the zero-copy pipe patches, which I think really
>should go in. The most recent version of the diffs I saw didn't
>use the zero copy bits unless the trasnfers were quite large so it
>should be ok and not pessimize small transfers.
>
>That patch has been gathering cobwebs for more than a year now when I
>first did it, let's push this in already :-)
>
>
Unfortunately zero-copy doesn't help to avoid the schedules:
Zero copy just avoid the copy to kernel - you still need one schedule
for each page to be transfered.

writer calls
for(;;){
prepare_data(buf);
write(fd,buf,PAGE_SIZE);
}
reader calls
for(;;) {
read(fd,buf,PAGE_SIZE);
use_data(buf);
}

What's needed is a large kernel buffer - I've seen buffers between 64
and 256 kB in other unices.
zero copy only helps lmbench and other apps where the whole working set
fits into the cpu cache.

The difference between
main-mem->cache;cache->main_mem [non-zerocopy]
and
main-mem->main-mem [zerocopy, the copy to kernel is skipped]
is small.

--
Manfred

2002-12-09 14:04:05

by Anton Blanchard

[permalink] [raw]
Subject: Re: 2.5.50-BK + 24 CPUs


Thanks Rik,

> In particular, something like the (completely untested) patch
> below. Yes, this patch is on the wrong side of the space/time
> tradeoff for machines with highmem, but it might be worth it
> for 64 bit machines, especially those with slow bitops.

I'll give it a spin when I next get some time on the machine.

Anton

2002-12-09 19:41:42

by Daniel Drown

[permalink] [raw]
Subject: page_remove_rmap

On Sun, 8 Dec 2002, Rik van Riel wrote:
> Looks like the bitflag locking in rmap is hurting you.
> How does it work with a real spinlock in the struct page
> instead of using a bit in page->flags ?

On Sun, 08 Dec 2002, Rik van Riel wrote:
> In particular, something like the (completely untested) patch
> below. Yes, this patch is on the wrong side of the space/time
> tradeoff for machines with highmem, but it might be worth it
> for 64 bit machines, especially those with slow bitops.
[removed patch]

I also have a workload that page_remove_rmap appears at the top of the profile
output. So I took your suggestion but I made the changes to 2.4.20-ac1 (which
looks like rmap-14b).

Machine is 2x 2.2ghz Xeon P4 (HT enabled), 2gb ram.

Nagios is configured with 8,000 services on 4,000 hosts with each service
being checked every 2 minutes. The service checks just execute /bin/true. No
"real" status checks are used for the purposes of this benchmark.

When nagios wants to execute a service check, it forks off a process to
monitor the service check and report its status back to the master process.
That process forks of another process to actually call exec. (I have changed
the default nagios behaviour from calling fork three times to execute a
service check)

The default behavour of nagios is to call free() on most of the allocated
memory in the monitoring child, but this just lead to higher mem usage (glibc
not able to free memory it allocated with brk because it's fragmented, copy
from a freelist update?). I changed this to just leave the allocated memory
alone, so this leads to high amounts of shared memory (~20mb shared between
the processes).

cpu time - cpu time spent in nagios and child processes
# checks - number of checks executed by nagios (rounded to the nearest 100)
nagios latency - difference between when nagios wanted to schedule a check and
when it ran
check exec time - how long it took for the check to finish

kernel cpu time # checks nagios latency check exec time
(user/sys/real) (rounded) (max/avg) (max/avg)
2.4.19-O1 36s/61s/141s 8500 1s/0.130s 1s/0.031s
2.4.20 40s/73s/141s 8500 1s/0.151s 1s/0.009s
2.4.20-ac1 21s/1029s/308s 4000 79s/3.456s 1s/0.011s
2.4.20-ac1+spin 21s/899s/276s 4000 72s/5.428s 1s/0.013s
2.4.18-18.8.0 23s/600s/187s 3800 84s/9.494s 2s/0.019s

The run queue on the -rmap based kernels (2.4.20-ac1*, rh 8.0's 2.4.18-18.8.0)
gets up to 1500 at points. Machine is still pretty responsive at this
point, but when I kill all the nagios processes, the box stops cold for a
minute or two. (kernel still responsive ala ICMP, userland not.)

2.4.19-O1 (sched-2.4.19-rc2-A4 + 2G/2G split) and 2.4.20 (vanilla) run queues
are mostly 0 with some jumps to 200.

These boxes arn't in production yet, so I can do testing on them.

Full kernel profiling output for everything but the rh kernel is available at:
http://dan.drown.org/nagios/profiling.html

--
It's looking like if MySQL AB doesn't make a movie based on the manual,
nobody's ever gonna learn how to use a database.
- r.

2002-12-09 20:11:55

by David Miller

[permalink] [raw]
Subject: Re: 2.5.50-BK + 24 CPUs

From: Manfred Spraul <[email protected]>
Date: Mon, 09 Dec 2002 18:03:10 +0100

Unfortunately zero-copy doesn't help to avoid the schedules:
Zero copy just avoid the copy to kernel - you still need one schedule
for each page to be transfered.

The zerocopy patches copied up to 64k (or rather, 16 pages, something
like that) at once, that's going to lead to 16 times less schedules.

The 64k number was decided arbitrarily (it's what freebsd's pipe code
uses) and it can be experimented with.

2002-12-09 21:05:26

by Manfred Spraul

[permalink] [raw]
Subject: Re: 2.5.50-BK + 24 CPUs

David S. Miller wrote:

> From: Manfred Spraul <[email protected]>
> Date: Mon, 09 Dec 2002 18:03:10 +0100
>
> Unfortunately zero-copy doesn't help to avoid the schedules:
> Zero copy just avoid the copy to kernel - you still need one schedule
> for each page to be transfered.
>
>The zerocopy patches copied up to 64k (or rather, 16 pages, something
>like that) at once, that's going to lead to 16 times less schedules.
>
>The 64k number was decided arbitrarily (it's what freebsd's pipe code
>uses) and it can be experimented with.
>
>
Only if user space writes in 64 kB chunks - if user space writes 4 kB
chunks, then zerocopy doesn't help much against schedule [depending on
the implementation, it halves the number of schedules].
And page table tricks (COW) tricks are not acceptable.

--
Manfred