LinuxLists.cc - Mainline kernel OLTP performance update

2009-04-23 16:50:40

Subject: Mainline kernel OLTP performance update

Summary: Measured the mainline kernel from kernel.org (2.6.30-rc2).

The regression for 2.6.30-rc2 against the baseline, 2.6.24.2 is 1.95%. Note the dip in cycles for database compared to us% in summary.

Linux OLTP Performance summary
Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait%
2.6.24.2 1.000 22106 43709 75 24 0 0
2.6.30-rc2 0.981 30755 43072 75 25 0 0

Server configurations:
Intel Xeon Quad-core 2.0GHz 2 cpus/8 cores/8 threads
64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units)

======oprofile 0.9.3 CPU_CLK_UNHALTED for top 30 functions
Cycles% 2.6.24.2 Cycles% 2.6.30-rc2
74.8578 <database> 67.6966 <database>
1.0500 qla24xx_start_scsi 1.1724 qla24xx_start_scsi
0.8089 schedule 1.0578 qla24xx_intr_handler
0.5864 kmem_cache_alloc 0.8259 __schedule
0.4989 __blockdev_direct_IO 0.7451 kmem_cache_alloc
0.4357 __sigsetjmp 0.4872 __blockdev_direct_IO
0.4152 copy_user_generic_string 0.4390 task_rq_lock
0.3953 qla24xx_intr_handler 0.4338 __sigsetjmp
0.3850 memcpy 0.4195 __switch_to
0.3596 scsi_request_fn 0.3713 copy_user_generic_string
0.3188 __switch_to 0.3608 __list_add
0.2889 lock_timer_base 0.3595 rb_get_reader_page
0.2750 memmove 0.3309 ring_buffer_consume
0.2519 task_rq_lock 0.3152 scsi_request_fn
0.2474 aio_complete 0.3048 try_to_wake_up
0.2460 scsi_alloc_sgtable 0.2983 tcp_sendmsg
0.2445 generic_make_request 0.2931 lock_timer_base
0.2263 qla2x00_process_completed_re0.2840 aio_complete
0.2118 blk_queue_end_tag 0.2697 memset_c
0.2085 dio_bio_complete 0.2527 mod_timer
0.2021 e1000_xmit_frame 0.2462 qla2x00_process_completed_re
0.2006 __end_that_request_first 0.2449 memmove
0.1954 generic_file_aio_read 0.2358 blk_queue_end_tag
0.1949 kfree 0.2241 generic_make_request
0.1915 tcp_sendmsg 0.2215 scsi_device_unbusy
0.1901 try_to_wake_up 0.2162 mempool_free
0.1895 kref_get 0.2097 e1000_xmit_frame
0.1864 __mod_timer 0.2097 kmem_cache_free
0.1863 thread_return 0.2058 kfree
0.1854 math_state_restore 0.1993 sched_clock_cpu

Thanks,
Doug

2009-04-27 07:03:08

by Andi Kleen

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

"Styner, Douglas W" <[email protected]> writes:

>
> ======oprofile 0.9.3 CPU_CLK_UNHALTED for top 30 functions
> Cycles% 2.6.24.2 Cycles% 2.6.30-rc2
> 74.8578 <database> 67.6966 <database>

The dip in database cycles is indeed worrying.

> 1.0500 qla24xx_start_scsi 1.1724 qla24xx_start_scsi
> 0.8089 schedule 1.0578 qla24xx_intr_handler
> 0.5864 kmem_cache_alloc 0.8259 __schedule
> 0.4989 __blockdev_direct_IO 0.7451 kmem_cache_alloc
> 0.4357 __sigsetjmp 0.4872 __blockdev_direct_IO
> 0.4152 copy_user_generic_string 0.4390 task_rq_lock
> 0.3953 qla24xx_intr_handler 0.4338 __sigsetjmp

And also why the qla24xx_intr_handler became ~2.5x as expensive.
Cc linux-scsi and qla24xx maintainers.

-Andi

--
[email protected] -- Speaking for myself only.

2009-04-27 18:38:23

by Styner, Douglas W

[permalink] [raw]

Subject: Discrepancies between Oprofile and vmstat

We are using an OLTP-type workload on a fully saturated system and have begun seeing a discrepancy between the output of oprofile and vmstat on upstream kernels that was first seen with the 2.6.27 kernel.

Workload:
OLTP database workload using QLogic FC cards with MSI-X enabled doing ~50,000 IOPS

Oprofile (v0.9.3) collection methodology:
opcontrol --init
opcontrol --vmlinux=/usr/src/linux-`uname -r`/vmlinux
opcontrol -e=CPU_CLK_UNHALTED:80000 -e=LLC_MISSES:6000
opcontrol --start-daemon
sleep 5
opcontrol --start
sleep 60
opcontrol --stop
opcontrol --save=$1
opcontrol --deinit

Function classification:
Modules vmlinux and *.ko = kernel, all others = user.

Results (2.6.30-rc3 kernel):
vmstat reports ~75% user, 25% kernel
oprofile reports 70.3808% user, 29.6142% kernel

Using another profiling tool to confirm, we see 74.784% user, 25.174% kernel.

Doug Styner
[email protected]

2009-04-27 18:46:46

by Andi Kleen

[permalink] [raw]

Subject: Re: Discrepancies between Oprofile and vmstat

On Mon, Apr 27, 2009 at 11:38:08AM -0700, Styner, Douglas W wrote:
> We are using an OLTP-type workload on a fully saturated system and have begun seeing a discrepancy between the output of oprofile and vmstat on upstream kernels that was first seen with the 2.6.27 kernel.

Rafael,

something to track for the regression list.

-Andi

2009-04-27 18:49:20

by Andi Kleen

[permalink] [raw]

Subject: Re: Discrepancies between Oprofile and vmstat II

> opcontrol -e=CPU_CLK_UNHALTED:80000 -e=LLC_MISSES:6000
>
> Using another profiling tool to confirm, we see 74.784% user, 25.174% kernel.

Just verifying -- you also see it when you use a shorter period than 80000 right?

-Andi

--
[email protected] -- Speaking for myself only.

2009-04-27 18:51:55

by Styner, Douglas W

[permalink] [raw]

Subject: RE: Discrepancies between Oprofile and vmstat II

I believe so, but will confirm.

> opcontrol -e=CPU_CLK_UNHALTED:80000 -e=LLC_MISSES:6000
>
> Using another profiling tool to confirm, we see 74.784% user, 25.174% kernel.

Just verifying -- you also see it when you use a shorter period than 80000 right?

2009-04-28 16:57:28

by Chuck Ebbert

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Mon, 27 Apr 2009 09:02:40 +0200
Andi Kleen <[email protected]> wrote:

> "Styner, Douglas W" <[email protected]> writes:
>
> >
> > ======oprofile 0.9.3 CPU_CLK_UNHALTED for top 30 functions
> > Cycles% 2.6.24.2 Cycles% 2.6.30-rc2
> > 74.8578 <database> 67.6966 <database>
>
> The dip in database cycles is indeed worrying.
>
> > 1.0500 qla24xx_start_scsi 1.1724 qla24xx_start_scsi
> > 0.8089 schedule 1.0578 qla24xx_intr_handler
> > 0.5864 kmem_cache_alloc 0.8259 __schedule
> > 0.4989 __blockdev_direct_IO 0.7451 kmem_cache_alloc
> > 0.4357 __sigsetjmp 0.4872 __blockdev_direct_IO
> > 0.4152 copy_user_generic_string 0.4390 task_rq_lock
> > 0.3953 qla24xx_intr_handler 0.4338 __sigsetjmp
>
> And also why the qla24xx_intr_handler became ~2.5x as expensive.
> Cc linux-scsi and qla24xx maintainers.
>

They are getting 31000 interrupts/sec vs. 22000/sec on older kernels.

2009-04-28 17:16:15

by James Bottomley

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Tue, 2009-04-28 at 12:57 -0400, Chuck Ebbert wrote:
> On Mon, 27 Apr 2009 09:02:40 +0200
> Andi Kleen <[email protected]> wrote:
>
> > "Styner, Douglas W" <[email protected]> writes:
> >
> > >
> > > ======oprofile 0.9.3 CPU_CLK_UNHALTED for top 30 functions
> > > Cycles% 2.6.24.2 Cycles% 2.6.30-rc2
> > > 74.8578 <database> 67.6966 <database>
> >
> > The dip in database cycles is indeed worrying.
> >
> > > 1.0500 qla24xx_start_scsi 1.1724 qla24xx_start_scsi
> > > 0.8089 schedule 1.0578 qla24xx_intr_handler
> > > 0.5864 kmem_cache_alloc 0.8259 __schedule
> > > 0.4989 __blockdev_direct_IO 0.7451 kmem_cache_alloc
> > > 0.4357 __sigsetjmp 0.4872 __blockdev_direct_IO
> > > 0.4152 copy_user_generic_string 0.4390 task_rq_lock
> > > 0.3953 qla24xx_intr_handler 0.4338 __sigsetjmp
> >
> > And also why the qla24xx_intr_handler became ~2.5x as expensive.
> > Cc linux-scsi and qla24xx maintainers.
> >
>
> They are getting 31000 interrupts/sec vs. 22000/sec on older kernels.

Should be fixed by:

http://marc.info/?l=linux-scsi&m=124093712114937

If someone could verify, I'd be grateful.

Thanks,

James

2009-04-28 17:17:48

by Styner, Douglas W

[permalink] [raw]

Subject: RE: Mainline kernel OLTP performance update

Working on it as we speak...

>-----Original Message-----
>From: James Bottomley [mailto:[email protected]]
>Sent: Tuesday, April 28, 2009 10:16 AM
>To: Chuck Ebbert
>Cc: Andi Kleen; Styner, Douglas W; [email protected]; linux-
>[email protected]; [email protected]
>Subject: Re: Mainline kernel OLTP performance update
>
>On Tue, 2009-04-28 at 12:57 -0400, Chuck Ebbert wrote:
>> On Mon, 27 Apr 2009 09:02:40 +0200
>> Andi Kleen <[email protected]> wrote:
>>
>> > "Styner, Douglas W" <[email protected]> writes:
>> >
>> > >
>> > > ======oprofile 0.9.3 CPU_CLK_UNHALTED for top 30 functions
>> > > Cycles% 2.6.24.2 Cycles% 2.6.30-rc2
>> > > 74.8578 <database> 67.6966 <database>
>> >
>> > The dip in database cycles is indeed worrying.
>> >
>> > > 1.0500 qla24xx_start_scsi 1.1724 qla24xx_start_scsi
>> > > 0.8089 schedule 1.0578 qla24xx_intr_handler
>> > > 0.5864 kmem_cache_alloc 0.8259 __schedule
>> > > 0.4989 __blockdev_direct_IO 0.7451 kmem_cache_alloc
>> > > 0.4357 __sigsetjmp 0.4872 __blockdev_direct_IO
>> > > 0.4152 copy_user_generic_string 0.4390 task_rq_lock
>> > > 0.3953 qla24xx_intr_handler 0.4338 __sigsetjmp
>> >
>> > And also why the qla24xx_intr_handler became ~2.5x as expensive.
>> > Cc linux-scsi and qla24xx maintainers.
>> >
>>
>> They are getting 31000 interrupts/sec vs. 22000/sec on older kernels.
>
>Should be fixed by:
>
>http://marc.info/?l=linux-scsi&m=124093712114937
>
>If someone could verify, I'd be grateful.
>
>Thanks,
>
>James
>