2005-03-17 12:19:58

by Ian Pratt

[permalink] [raw]
Subject: 2.6.11 vs 2.6.10 slowdown on i686


Folks,

When we upgraded arch xen/x86 to kernel 2.6.11, we noticed a slowdown
on a number of micro-benchmarks. In order to investigate, I built
native (non Xen) i686 uniprocessor kernels for 2.6.10 and 2.6.11 with
the same configuration and ran lmbench-3.0-a3 on them. The test
machine was a 2.4GHz Xeon box, gcc 3.3.3 (FC3 default) was used to
compile the kernels, NOHIGHMEM=y (2-level only).

On the i686 fork and exec benchmarks I found that there's been a
significant slowdown between 2.6.10 and 2.6.11. Some of the other
numbers a bit ugly too (see attached).

fork: 166 -> 235 (40% slowdown)
exec: 857 -> 1003 (17% slowdown)

I'm guessing this is down to the 4 level pagetables. This is rather a
surprise as I thought the compiler would optimise most of these
changes away. Apparently not.

Anyhow, this explains the arch Xen results we were seeing.

Results appended, median of 6 runs.

Best,
Ian


Processor, Processes - times in microseconds - smaller is better
------------------------------------------------------------------------------
Host OS Mhz null null open slct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
commando- Linux 2.6.10 2400 0.49 0.57 2.06 3.06 19.6 0.89 2.70 166. 857. 2972
commando- Linux 2.6.11 2400 0.49 0.60 2.12 3.35 20.8 0.92 2.73 235. 1003 3168

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
commando- Linux 2.6.10 7.5800 4.3300 8.1900 5.1100 33.1 8.37000 41.9
commando- Linux 2.6.11 7.9200 8.3200 8.3200 5.8300 26.6 9.46000 40.4

*Local* Communication latencies in microseconds - smaller is better
---------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
commando- Linux 2.6.10 7.750 19.4 21.3 37.2 45.5 42.5 53.2 76.
commando- Linux 2.6.11 7.920 20.3 23.6 40.2 50.1 46.5 57.6 87.

File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page 100fd
Create Delete Create Delete Latency Fault Fault selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
commando- Linux 2.6.10 39.3 16.2 92.7 35.2 122.0 1.200 2.14310 18.3
commando- Linux 2.6.11 40.8 16.8 99.5 36.7 163.0 1.075 2.27760 18.8

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
commando- Linux 2.6.10 313. 440. 222. 1551.7 1528.5 549.1 566.8 1550 784.8
commando- Linux 2.6.11 554. 450. 224. 1564.8 1548.3 549.9 574.6 1528 760.5


2005-03-17 12:37:45

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.11 vs 2.6.10 slowdown on i686

Ian Pratt wrote:
> Folks,
>
> When we upgraded arch xen/x86 to kernel 2.6.11, we noticed a slowdown
> on a number of micro-benchmarks. In order to investigate, I built
> native (non Xen) i686 uniprocessor kernels for 2.6.10 and 2.6.11 with
> the same configuration and ran lmbench-3.0-a3 on them. The test
> machine was a 2.4GHz Xeon box, gcc 3.3.3 (FC3 default) was used to
> compile the kernels, NOHIGHMEM=y (2-level only).
>
> On the i686 fork and exec benchmarks I found that there's been a
> significant slowdown between 2.6.10 and 2.6.11. Some of the other
> numbers a bit ugly too (see attached).
>
> fork: 166 -> 235 (40% slowdown)
> exec: 857 -> 1003 (17% slowdown)
>
> I'm guessing this is down to the 4 level pagetables. This is rather a
> surprise as I thought the compiler would optimise most of these
> changes away. Apparently not.
>

There are some changes in the current -bk tree (which are a
bit in-flux at the moment) which introduce some optimisations.

They should bring 2-level performance close to par with 2.6.10.
If not, complain again :)

Thanks,
Nick

2005-03-17 18:39:27

by Andi Kleen

[permalink] [raw]
Subject: Re: 2.6.11 vs 2.6.10 slowdown on i686

On Thu, Mar 17, 2005 at 12:16:40PM +0000, Ian Pratt wrote:
>
> Folks,
>
> When we upgraded arch xen/x86 to kernel 2.6.11, we noticed a slowdown
> on a number of micro-benchmarks. In order to investigate, I built
> native (non Xen) i686 uniprocessor kernels for 2.6.10 and 2.6.11 with
> the same configuration and ran lmbench-3.0-a3 on them. The test
> machine was a 2.4GHz Xeon box, gcc 3.3.3 (FC3 default) was used to
> compile the kernels, NOHIGHMEM=y (2-level only).

Hmm, it is known that x86-64 performance is down because it touches
a lot more memory now on fork/exit. I have some optimizations planned to fix
that, in fact it should be faster in the end.

i386 slowdowns are unexpected though.

I remember I tested i386 briefly with lmbench with my original 4level
patch, and there werent any significant slowdowns. However the patch
that eventually went into mainline was very different and in particular
clear_page_range() which is very critical looks completely different
now and does more work than before. Perhaps the slowdown happens in this
area.

diffprofile of before and after would be interesting.

-Andi

2005-03-17 20:23:33

by Ian Pratt

[permalink] [raw]
Subject: Re: 2.6.11 vs 2.6.10 slowdown on i686


> There are some changes in the current -bk tree (which are a
> bit in-flux at the moment) which introduce some optimisations.
>
> They should bring 2-level performance close to par with 2.6.10.
> If not, complain again :)

The good news is that with a BK snapshot from today
[md5key=4238cb8e36_Z5Cgys8rTovspboIJpw] performance is rather
improved relative to 2.6.11 :

fork: 166 -> 187 -13%
exec: 857 -> 909 -6%

Rather better than -40%, but still not brilliant.

Any more improvements in the pipeline?

Ian



------------------------------------------------------------------------------
Host OS Mhz null null open slct sig sig fork exec sh
call I/O stat clos TCP inst hndl proc proc proc
--------- ------------- ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ----
commando- Linux 2.6.10 2400 0.49 0.57 2.06 3.06 19.6 0.89 2.70 166. 857. 2972
commando- Linux 2.6.12 2400 0.49 0.60 2.37 3.43 20.9 0.91 2.64 187. 909. 3076

Context switching - times in microseconds - smaller is better
-------------------------------------------------------------------------
Host OS 2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
commando- Linux 2.6.10 7.5800 4.3300 8.1900 5.1100 33.1 8.37000 41.9
commando- Linux 2.6.12 7.7400 7.9200 8.3700 5.1600 27.0 9.32000 36.5

*Local* Communication latencies in microseconds - smaller is better
---------------------------------------------------------------------
Host OS 2p/0K Pipe AF UDP RPC/ TCP RPC/ TCP
ctxsw UNIX UDP TCP conn
--------- ------------- ----- ----- ---- ----- ----- ----- ----- ----
commando- Linux 2.6.10 7.750 19.4 21.3 37.2 45.5 42.5 53.2 76.
commando- Linux 2.6.12 7.740 18.2 23.1 37.4 45.6 42.6 54.9 80.

File & VM system latencies in microseconds - smaller is better
-------------------------------------------------------------------------------
Host OS 0K File 10K File Mmap Prot Page 100fd
Create Delete Create Delete Latency Fault Fault selct
--------- ------------- ------ ------ ------ ------ ------- ----- ------- -----
commando- Linux 2.6.10 39.3 16.2 92.7 35.2 122.0 1.200 2.14310 18.3
commando- Linux 2.6.12 38.7 16.4 94.1 35.1 148.0 1.029 2.25100 18.0

*Local* Communication bandwidths in MB/s - bigger is better
-----------------------------------------------------------------------------
Host OS Pipe AF TCP File Mmap Bcopy Bcopy Mem Mem
UNIX reread reread (libc) (hand) read write
--------- ------------- ---- ---- ---- ------ ------ ------ ------ ---- -----
commando- Linux 2.6.10 313. 440. 222. 1551.7 1528.5 549.1 566.8 1550 784.8
commando- Linux 2.6.12 556. 477. 224. 1540.3 1551.4 566.5 566.6 1551 786.2


2005-03-18 08:27:00

by Kurt Garloff

[permalink] [raw]
Subject: Re: 2.6.11 vs 2.6.10 slowdown on i686

Hi Nick,

On Thu, Mar 17, 2005 at 11:37:24PM +1100, Nick Piggin wrote:
> Ian Pratt wrote:
> >fork: 166 -> 235 (40% slowdown)
> >exec: 857 -> 1003 (17% slowdown)
> >
> >I'm guessing this is down to the 4 level pagetables. This is rather a
> >surprise as I thought the compiler would optimise most of these
> >changes away. Apparently not.
>
> There are some changes in the current -bk tree (which are a
> bit in-flux at the moment) which introduce some optimisations.
>
> They should bring 2-level performance close to par with 2.6.10.
> If not, complain again :)

Is there a clean patchset that we should look at to test?

Regards,
--
Kurt Garloff, Director SUSE Labs, Novell Inc.


Attachments:
(No filename) (686.00 B)
(No filename) (189.00 B)
Download all attachments

2005-03-18 08:46:56

by Nick Piggin

[permalink] [raw]
Subject: Re: 2.6.11 vs 2.6.10 slowdown on i686

Kurt Garloff wrote:
> Hi Nick,
>

Hi Kurt!

> On Thu, Mar 17, 2005 at 11:37:24PM +1100, Nick Piggin wrote:
>
>>Ian Pratt wrote:
>>
>>>fork: 166 -> 235 (40% slowdown)
>>>exec: 857 -> 1003 (17% slowdown)
>>>
>>>I'm guessing this is down to the 4 level pagetables. This is rather a
>>>surprise as I thought the compiler would optimise most of these
>>>changes away. Apparently not.
>>
>>There are some changes in the current -bk tree (which are a
>>bit in-flux at the moment) which introduce some optimisations.
>>
>>They should bring 2-level performance close to par with 2.6.10.
>>If not, complain again :)
>
>
> Is there a clean patchset that we should look at to test?
>

Probably the best thing would be to wait and see what happens
with the ptwalk patches. There is a fix in there for ia64 now,
but I think that may be a temporary one.

Andi is probably keeping an eye on that, but if not then I
could put a patchset together when things finalise in 2.6.

From the profiles I have seen, the ptwalk patches bring page
table walking performance pretty well back to 2.6.10 levels,
however the "aggressive page table freeing" (clear_page_range)
changes that went in at the same time as the 4level stuff
seem to be what is slowing down exit() and unmapping performance.

Not by a huge amount, mind you, and it is not completely wasted
performance, because it provides better page table freeing.
But it is enough to be annoying! I haven't had much time to look
at it lately, but I hope to get onto it soon.

Nick