LinuxLists.cc - Mainline kernel OLTP performance update

[permalink] [raw]

Subject: RE: Mainline kernel OLTP performance update

One encouraging thing is that we don't see a significant drop-off between 2.6.28 and 2.6.29-rc1, which I think is the first time we've not seen a big problem with -rc1.

To compare the top 30 functions between 2.6.28 and 2.6.29-rc1:

1.4257 qla24xx_start_scsi 1.0691 qla24xx_intr_handler
0.8784 kmem_cache_alloc 0.7701 copy_user_generic_string
0.6876 qla24xx_intr_handler 0.7339 qla24xx_wrt_req_reg
0.5834 copy_user_generic_string 0.6458 kmem_cache_alloc
0.4945 scsi_request_fn 0.5794 qla24xx_start_scsi
0.4846 __blockdev_direct_IO 0.5505 unmap_vmas
0.4187 try_to_wake_up 0.4869 __blockdev_direct_IO
0.3518 aio_complete 0.4493 try_to_wake_up
0.3513 __end_that_request_first 0.4291 scsi_request_fn
0.3483 __switch_to 0.4118 clear_page_c
0.3271 memset_c 0.4002 __switch_to
0.2976 qla2x00_process_completed_re 0.3381 ring_buffer_consume
0.2905 __list_add 0.3366 rb_get_reader_page
0.2901 generic_make_request 0.3222 aio_complete
0.2755 lock_timer_base 0.3135 memset_c
0.2741 blk_queue_end_tag 0.2875 __list_add
0.2593 kmem_cache_free 0.2673 task_rq_lock
0.2445 disk_map_sector_rcu 0.2658 __end_that_request_first
0.2370 pick_next_highest_task_rt 0.2615 qla2x00_process_completed_re
0.2323 scsi_device_unbusy 0.2615 lock_timer_base
0.2321 task_rq_lock 0.2456 disk_map_sector_rcu
0.2316 scsi_dispatch_cmd 0.2427 tcp_sendmsg
0.2239 kref_get 0.2413 e1000_xmit_frame
0.2237 dio_bio_complete 0.2398 kmem_cache_free
0.2194 push_rt_task 0.2384 pick_next_highest_task_rt
0.2145 __aio_get_req 0.2225 blk_queue_end_tag
0.2143 kfree 0.2211 sd_prep_fn
0.2138 __mod_timer 0.2167 qla24xx_queuecommand
0.2131 e1000_irq_enable 0.2109 scsi_device_unbusy
0.2091 scsi_softirq_done 0.2095 kref_get

It looks like a number of functions in the qla2x00 driver were split up, so it's probably best to ignore all the changes in qla* functions.

unmap_vmas is a new hot function. It's been around since before git history started, and hasn't changed substantially between 2.6.28 and 2.6.29-rc1, so I suspect we're calling it more often. I don't know why we'd be doing that.

clear_page_c is also new to the hot list. I haven't tried to understand why this might be so.

The ring_buffer_consume() and rb_get_reader_page() functions are part of the oprofile code. This seems to indicate a bug -- they should not be the #12 and #13 hottest functions in the kernel when monitoring a database run!

That seems to be about it for regressions.

> -----Original Message-----
> From: Ma, Chinang
> Sent: Tuesday, January 13, 2009 1:11 PM
> To: [email protected]
> Cc: Tripathi, Sharad C; [email protected]; Wilcox, Matthew R; Kleen,
> Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter
> Xihong; Nueckel, Hubert; Chris Mason
> Subject: Mainline kernel OLTP performance update
>
> This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to
> 2.6.24.2 the regression is around 3.5%.
>
> Linux OLTP Performance summary
> Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait%
> 2.6.24.2 1.000 21969 43425 76 24 0 0
> 2.6.27.2 0.973 30402 43523 74 25 0 1
> 2.6.29-rc1 0.965 30331 41970 74 26 0 0
>
> Server configurations:
> Intel Xeon Quad-core 2.0GHz 2 cpus/8 cores/8 threads
> 64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units)
>
> ======oprofile CPU_CLK_UNHALTED for top 30 functions
> Cycles% 2.6.24.2 Cycles% 2.6.27.2
> 1.0500 qla24xx_start_scsi 1.2125 qla24xx_start_scsi
> 0.8089 schedule 0.6962 kmem_cache_alloc
> 0.5864 kmem_cache_alloc 0.6209 qla24xx_intr_handler
> 0.4989 __blockdev_direct_IO 0.4895 copy_user_generic_string
> 0.4152 copy_user_generic_string 0.4591 __blockdev_direct_IO
> 0.3953 qla24xx_intr_handler 0.4409 __end_that_request_first
> 0.3596 scsi_request_fn 0.3729 __switch_to
> 0.3188 __switch_to 0.3716 try_to_wake_up
> 0.2889 lock_timer_base 0.3531 lock_timer_base
> 0.2519 task_rq_lock 0.3393 scsi_request_fn
> 0.2474 aio_complete 0.3038 aio_complete
> 0.2460 scsi_alloc_sgtable 0.2989 memset_c
> 0.2445 generic_make_request 0.2633 qla2x00_process_completed_re
> 0.2263 qla2x00_process_completed_re0.2583 pick_next_highest_task_rt
> 0.2118 blk_queue_end_tag 0.2578 generic_make_request
> 0.2085 dio_bio_complete 0.2510 __list_add
> 0.2021 e1000_xmit_frame 0.2459 task_rq_lock
> 0.2006 __end_that_request_first 0.2322 kmem_cache_free
> 0.1954 generic_file_aio_read 0.2206 blk_queue_end_tag
> 0.1949 kfree 0.2205 __mod_timer
> 0.1915 tcp_sendmsg 0.2179 update_curr_rt
> 0.1901 try_to_wake_up 0.2164 sd_prep_fn
> 0.1895 kref_get 0.2130 kref_get
> 0.1864 __mod_timer 0.2075 dio_bio_complete
> 0.1863 thread_return 0.2066 push_rt_task
> 0.1854 math_state_restore 0.1974 qla24xx_msix_default
> 0.1775 __list_add 0.1935 generic_file_aio_read
> 0.1721 memset_c 0.1870 scsi_device_unbusy
> 0.1706 find_vma 0.1861 tcp_sendmsg
> 0.1688 read_tsc 0.1843 e1000_xmit_frame
>
> ======oprofile CPU_CLK_UNHALTED for top 30 functions
> Cycles% 2.6.24.2 Cycles% 2.6.29-rc1
> 1.0500 qla24xx_start_scsi 1.0691 qla24xx_intr_handler
> 0.8089 schedule 0.7701 copy_user_generic_string
> 0.5864 kmem_cache_alloc 0.7339 qla24xx_wrt_req_reg
> 0.4989 __blockdev_direct_IO 0.6458 kmem_cache_alloc
> 0.4152 copy_user_generic_string 0.5794 qla24xx_start_scsi
> 0.3953 qla24xx_intr_handler 0.5505 unmap_vmas
> 0.3596 scsi_request_fn 0.4869 __blockdev_direct_IO
> 0.3188 __switch_to 0.4493 try_to_wake_up
> 0.2889 lock_timer_base 0.4291 scsi_request_fn
> 0.2519 task_rq_lock 0.4118 clear_page_c
> 0.2474 aio_complete 0.4002 __switch_to
> 0.2460 scsi_alloc_sgtable 0.3381 ring_buffer_consume
> 0.2445 generic_make_request 0.3366 rb_get_reader_page
> 0.2263 qla2x00_process_completed_re0.3222 aio_complete
> 0.2118 blk_queue_end_tag 0.3135 memset_c
> 0.2085 dio_bio_complete 0.2875 __list_add
> 0.2021 e1000_xmit_frame 0.2673 task_rq_lock
> 0.2006 __end_that_request_first 0.2658 __end_that_request_first
> 0.1954 generic_file_aio_read 0.2615 qla2x00_process_completed_re
> 0.1949 kfree 0.2615 lock_timer_base
> 0.1915 tcp_sendmsg 0.2456 disk_map_sector_rcu
> 0.1901 try_to_wake_up 0.2427 tcp_sendmsg
> 0.1895 kref_get 0.2413 e1000_xmit_frame
> 0.1864 __mod_timer 0.2398 kmem_cache_free
> 0.1863 thread_return 0.2384 pick_next_highest_task_rt
> 0.1854 math_state_restore 0.2225 blk_queue_end_tag
> 0.1775 __list_add 0.2211 sd_prep_fn
> 0.1721 memset_c 0.2167 qla24xx_queuecommand
> 0.1706 find_vma 0.2109 scsi_device_unbusy
> 0.1688 read_tsc 0.2095 kref_get

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2009-01-15 00:36:30

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Tue, 13 Jan 2009 15:44:17 -0700
"Wilcox, Matthew R" <[email protected]> wrote:
>

(top-posting repaired. That @intel.com address is a bad influence ;))

(cc linux-scsi)

> > -----Original Message-----
> > From: Ma, Chinang
> > Sent: Tuesday, January 13, 2009 1:11 PM
> > To: [email protected]
> > Cc: Tripathi, Sharad C; [email protected]; Wilcox, Matthew R; Kleen,
> > Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter
> > Xihong; Nueckel, Hubert; Chris Mason
> > Subject: Mainline kernel OLTP performance update
> >
> > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to
> > 2.6.24.2 the regression is around 3.5%.
> >
> > Linux OLTP Performance summary
> > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait%
> > 2.6.24.2 1.000 21969 43425 76 24 0 0
> > 2.6.27.2 0.973 30402 43523 74 25 0 1
> > 2.6.29-rc1 0.965 30331 41970 74 26 0 0
> >
> > Server configurations:
> > Intel Xeon Quad-core 2.0GHz 2 cpus/8 cores/8 threads
> > 64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units)
>
>
> One encouraging thing is that we don't see a significant drop-off between 2.6.28 and 2.6.29-rc1, which I think is the first time we've not seen a big problem with -rc1.
>
> To compare the top 30 functions between 2.6.28 and 2.6.29-rc1:
>
> 1.4257 qla24xx_start_scsi 1.0691 qla24xx_intr_handler
> 0.8784 kmem_cache_alloc 0.7701 copy_user_generic_string
> 0.6876 qla24xx_intr_handler 0.7339 qla24xx_wrt_req_reg
> 0.5834 copy_user_generic_string 0.6458 kmem_cache_alloc
> 0.4945 scsi_request_fn 0.5794 qla24xx_start_scsi
> 0.4846 __blockdev_direct_IO 0.5505 unmap_vmas
> 0.4187 try_to_wake_up 0.4869 __blockdev_direct_IO
> 0.3518 aio_complete 0.4493 try_to_wake_up
> 0.3513 __end_that_request_first 0.4291 scsi_request_fn
> 0.3483 __switch_to 0.4118 clear_page_c
> 0.3271 memset_c 0.4002 __switch_to
> 0.2976 qla2x00_process_completed_re 0.3381 ring_buffer_consume
> 0.2905 __list_add 0.3366 rb_get_reader_page
> 0.2901 generic_make_request 0.3222 aio_complete
> 0.2755 lock_timer_base 0.3135 memset_c
> 0.2741 blk_queue_end_tag 0.2875 __list_add
> 0.2593 kmem_cache_free 0.2673 task_rq_lock
> 0.2445 disk_map_sector_rcu 0.2658 __end_that_request_first
> 0.2370 pick_next_highest_task_rt 0.2615 qla2x00_process_completed_re
> 0.2323 scsi_device_unbusy 0.2615 lock_timer_base
> 0.2321 task_rq_lock 0.2456 disk_map_sector_rcu
> 0.2316 scsi_dispatch_cmd 0.2427 tcp_sendmsg
> 0.2239 kref_get 0.2413 e1000_xmit_frame
> 0.2237 dio_bio_complete 0.2398 kmem_cache_free
> 0.2194 push_rt_task 0.2384 pick_next_highest_task_rt
> 0.2145 __aio_get_req 0.2225 blk_queue_end_tag
> 0.2143 kfree 0.2211 sd_prep_fn
> 0.2138 __mod_timer 0.2167 qla24xx_queuecommand
> 0.2131 e1000_irq_enable 0.2109 scsi_device_unbusy
> 0.2091 scsi_softirq_done 0.2095 kref_get
>
> It looks like a number of functions in the qla2x00 driver were split up, so it's probably best to ignore all the changes in qla* functions.
>
> unmap_vmas is a new hot function. It's been around since before git history started, and hasn't changed substantially between 2.6.28 and 2.6.29-rc1, so I suspect we're calling it more often. I don't know why we'd be doing that.
>
> clear_page_c is also new to the hot list. I haven't tried to understand why this might be so.
>
> The ring_buffer_consume() and rb_get_reader_page() functions are part of the oprofile code. This seems to indicate a bug -- they should not be the #12 and #13 hottest functions in the kernel when monitoring a database run!
>
> That seems to be about it for regressions.
>

But the interrupt rate went through the roof.

A 3.5% slowdown in this workload is considered pretty serious, isn't it?

2009-01-15 01:22:20

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
> On Tue, 13 Jan 2009 15:44:17 -0700
> "Wilcox, Matthew R" <[email protected]> wrote:
> >
>
> (top-posting repaired. That @intel.com address is a bad influence ;))

Alas, that email address goes to an Outlook client. Not much to be done
about that.

> (cc linux-scsi)
>
> > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to
> > > 2.6.24.2 the regression is around 3.5%.
> > >
> > > Linux OLTP Performance summary
> > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait%
> > > 2.6.24.2 1.000 21969 43425 76 24 0 0
> > > 2.6.27.2 0.973 30402 43523 74 25 0 1
> > > 2.6.29-rc1 0.965 30331 41970 74 26 0 0

> But the interrupt rate went through the roof.

Yes. I forget why that was; I'll have to dig through my archives for
that.

> A 3.5% slowdown in this workload is considered pretty serious, isn't it?

Yes. Anything above 0.3% is statistically significant. 1% is a big
deal. The fact that we've lost 3.5% in the last year doesn't make
people happy. There's a few things we've identified that have a big
effect:

- Per-partition statistics. Putting in a sysctl to stop doing them gets
some of that back, but not as much as taking them out (even when
the sysctl'd variable is in a __read_mostly section). We tried a
patch from Jens to speed up the search for a new partition, but it
had no effect.

- The RT scheduler changes. They're better for some RT tasks, but not
the database benchmark workload. Chinang has posted about
this before, but the thread didn't really go anywhere.
http://marc.info/?t=122903815000001&r=1&w=2

SLUB would have had a huge negative effect if we were using it -- on the
order of 7% iirc. SLQB is at least performance-neutral with SLAB.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-01-15 02:05:47

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <[email protected]> wrote:

> On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
> > On Tue, 13 Jan 2009 15:44:17 -0700
> > "Wilcox, Matthew R" <[email protected]> wrote:
> > >
> >
> > (top-posting repaired. That @intel.com address is a bad influence ;))
>
> Alas, that email address goes to an Outlook client. Not much to be done
> about that.

aspirin?

> > (cc linux-scsi)
> >
> > > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to
> > > > 2.6.24.2 the regression is around 3.5%.
> > > >
> > > > Linux OLTP Performance summary
> > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait%
> > > > 2.6.24.2 1.000 21969 43425 76 24 0 0
> > > > 2.6.27.2 0.973 30402 43523 74 25 0 1
> > > > 2.6.29-rc1 0.965 30331 41970 74 26 0 0
>
> > But the interrupt rate went through the roof.
>
> Yes. I forget why that was; I'll have to dig through my archives for
> that.

Oh. I'd have thought that this alone could account for 3.5%.

> > A 3.5% slowdown in this workload is considered pretty serious, isn't it?
>
> Yes. Anything above 0.3% is statistically significant. 1% is a big
> deal. The fact that we've lost 3.5% in the last year doesn't make
> people happy. There's a few things we've identified that have a big
> effect:
>
> - Per-partition statistics. Putting in a sysctl to stop doing them gets
> some of that back, but not as much as taking them out (even when
> the sysctl'd variable is in a __read_mostly section). We tried a
> patch from Jens to speed up the search for a new partition, but it
> had no effect.

I find this surprising.

> - The RT scheduler changes. They're better for some RT tasks, but not
> the database benchmark workload. Chinang has posted about
> this before, but the thread didn't really go anywhere.
> http://marc.info/?t=122903815000001&r=1&w=2

Well. It's more a case that it wasn't taken anywhere. I appear to
have recently been informed that there have never been any
CPU-scheduler-caused regressions. Please persist!

> SLUB would have had a huge negative effect if we were using it -- on the
> order of 7% iirc. SLQB is at least performance-neutral with SLAB.

We really need to unblock that problem somehow. I assume that
enterprise distros are shipping slab?

2009-01-15 02:28:18

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

(added Ingo, Thomas, Peter and Gregory)

On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote:
> On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <[email protected]> wrote:
>
> > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
> > > On Tue, 13 Jan 2009 15:44:17 -0700
> > > "Wilcox, Matthew R" <[email protected]> wrote:
> > > >
> > >
> > > (top-posting repaired. That @intel.com address is a bad influence ;))
> >
> > Alas, that email address goes to an Outlook client. Not much to be done
> > about that.
>
> aspirin?
>
> > > (cc linux-scsi)
> > >
> > > > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to
> > > > > 2.6.24.2 the regression is around 3.5%.
> > > > >
> > > > > Linux OLTP Performance summary
> > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait%
> > > > > 2.6.24.2 1.000 21969 43425 76 24 0 0
> > > > > 2.6.27.2 0.973 30402 43523 74 25 0 1
> > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0 0
> >
> > > But the interrupt rate went through the roof.
> >
> > Yes. I forget why that was; I'll have to dig through my archives for
> > that.
>
> Oh. I'd have thought that this alone could account for 3.5%.
>
> > > A 3.5% slowdown in this workload is considered pretty serious, isn't it?
> >
> > Yes. Anything above 0.3% is statistically significant. 1% is a big
> > deal. The fact that we've lost 3.5% in the last year doesn't make
> > people happy. There's a few things we've identified that have a big
> > effect:
> >
> > - Per-partition statistics. Putting in a sysctl to stop doing them gets
> > some of that back, but not as much as taking them out (even when
> > the sysctl'd variable is in a __read_mostly section). We tried a
> > patch from Jens to speed up the search for a new partition, but it
> > had no effect.
>
> I find this surprising.
>
> > - The RT scheduler changes. They're better for some RT tasks, but not
> > the database benchmark workload. Chinang has posted about
> > this before, but the thread didn't really go anywhere.
> > http://marc.info/?t=122903815000001&r=1&w=2

I read the whole thread before I found what you were talking about here:

http://marc.info/?l=linux-kernel&m=122937424114658&w=2

With this comment:

"When setting foreground and log writer to rt-prio, the log latency reduced to 4.8ms. \
Performance is about 1.5% higher than the CFS result.
On a side note, we had been using rt-prio on all DBMS processes and log writer ( in \
higher priority) for the best OLTP performance. That has worked pretty well until \
2.6.25 when the new rt scheduler introduced the pull/push task for lower scheduling \
latency for rt-task. That has negative impact on this workload, probably due to the \
more elaborated load calculation/balancing for hundred of foreground rt-prio \
processes. Also, there is that question of no production environment would run DBMS \
with rt-prio. That is why I am going back to explore CFS and see whether I can drop \
rt-prio for good."

A couple of questions:

1) how does the latest rt scheduler compare? There has been a lot of improvements.
2) how many rt tasks?
3) what were the prios, producer compared to consumers, not actual numbers
4) have you tried pinning tasks?

RT is more about determinism than performance. The old scheduler
migrated rt tasks the same as other tasks. This helps with performance
because it will keep several rt tasks on the same CPU and cache hot even
when a rt task can migrate. This helps performance, but kills
determinism (I was seeing 10 ms wake up times from the next-highest-prio
task on a cpu, even when another CPU was available).

If you pin a task to a cpu, then it skips over the push and pull logic
and will help with performance too.

-- Steve

>
> Well. It's more a case that it wasn't taken anywhere. I appear to
> have recently been informed that there have never been any
> CPU-scheduler-caused regressions. Please persist!
>
> > SLUB would have had a huge negative effect if we were using it -- on the
> > order of 7% iirc. SLQB is at least performance-neutral with SLAB.
>
> We really need to unblock that problem somehow. I assume that
> enterprise distros are shipping slab?
>

2009-01-15 02:39:55

by Andi Kleen

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

Andrew Morton <[email protected]> writes:

>> some of that back, but not as much as taking them out (even when
>> the sysctl'd variable is in a __read_mostly section). We tried a
>> patch from Jens to speed up the search for a new partition, but it
>> had no effect.
>
> I find this surprising.

The test system has thousands of disks/LUNs which it writes to
all the time, in addition to a workload which is a real cache pig.
So any increase in the per LUN overhead directly leads to a lot
more cache misses in the kernel because it increases the working set
there sigificantly.

>
>> - The RT scheduler changes. They're better for some RT tasks, but not
>> the database benchmark workload. Chinang has posted about
>> this before, but the thread didn't really go anywhere.
>> http://marc.info/?t=122903815000001&r=1&w=2
>
> Well. It's more a case that it wasn't taken anywhere. I appear to
> have recently been informed that there have never been any
> CPU-scheduler-caused regressions. Please persist!

Just to clarify: the non RT scheduler has never performed well on this
workload (although it seems to get slightly worse too), mostly
because of log writer starvation.

RT at some point performed significantly better, but then as
the RT behaviour was improved to be more fair on MP there were signficant
regressions when running under RT.
I wouldn't really advocate to make RT less fair again, it would
be better to just fix the non RT scheduler to perform reasonably.
Unfortunately the thread above which was supposed to do that
didn't go anywhere.

>> SLUB would have had a huge negative effect if we were using it -- on the
>> order of 7% iirc. SLQB is at least performance-neutral with SLAB.
>
> We really need to unblock that problem somehow. I assume that
> enterprise distros are shipping slab?

The released ones all do.

-Andi
--
[email protected]

2009-01-15 02:47:56

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Thu, Jan 15, 2009 at 03:39:05AM +0100, Andi Kleen wrote:
> Andrew Morton <[email protected]> writes:
> >> some of that back, but not as much as taking them out (even when
> >> the sysctl'd variable is in a __read_mostly section). We tried a
> >> patch from Jens to speed up the search for a new partition, but it
> >> had no effect.
> >
> > I find this surprising.
>
> The test system has thousands of disks/LUNs which it writes to
> all the time, in addition to a workload which is a real cache pig.
> So any increase in the per LUN overhead directly leads to a lot
> more cache misses in the kernel because it increases the working set
> there sigificantly.

This particular system has 450 spindles, but they're amalgamated into
30 logical volumes by the hardware or firmware. Linux sees 30 LUNs.
Each one, though, has fifteen partitions on it, so that brings us back
up to 450 partitions.

This system, btw, is a scale model of the full system that would be used
to get published results. If I remember correctly, a 1% performance
regression on this system is likely to translate to a 2% regression on
the full-scale system.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-01-15 03:21:51

by Andi Kleen

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

> This particular system has 450 spindles, but they're amalgamated into
> 30 logical volumes by the hardware or firmware. Linux sees 30 LUNs.
> Each one, though, has fifteen partitions on it, so that brings us back
> up to 450 partitions.

Thanks for the correction.

-Andi
--
[email protected] -- Speaking for myself only.

2009-01-15 07:12:20

[permalink] [raw]

Subject: RE: Mainline kernel OLTP performance update

Trying to answer to some of the question below:

-Chinang

>-----Original Message-----
>From: Steven Rostedt [mailto:[email protected]]
>Sent: Wednesday, January 14, 2009 6:27 PM
>To: Andrew Morton
>Cc: Matthew Wilcox; Wilcox, Matthew R; Ma, Chinang; linux-
>[email protected]; Tripathi, Sharad C; [email protected]; Kleen,
>Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter
>Xihong; Nueckel, Hubert; [email protected]; [email protected];
>Andrew Vasquez; Anirban Chakraborty; Ingo Molnar; Thomas Gleixner; Peter
>Zijlstra; Gregory Haskins
>Subject: Re: Mainline kernel OLTP performance update
>
>(added Ingo, Thomas, Peter and Gregory)
>
>On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote:
>> On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <[email protected]> wrote:
>>
>> > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
>> > > On Tue, 13 Jan 2009 15:44:17 -0700
>> > > "Wilcox, Matthew R" <[email protected]> wrote:
>> > > >
>> > >
>> > > (top-posting repaired. That @intel.com address is a bad influence ;))
>> >
>> > Alas, that email address goes to an Outlook client. Not much to be
>done
>> > about that.
>>
>> aspirin?
>>
>> > > (cc linux-scsi)
>> > >
>> > > > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare
>to
>> > > > > 2.6.24.2 the regression is around 3.5%.
>> > > > >
>> > > > > Linux OLTP Performance summary
>> > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle%
>iowait%
>> > > > > 2.6.24.2 1.000 21969 43425 76 24 0
>0
>> > > > > 2.6.27.2 0.973 30402 43523 74 25 0
>1
>> > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0
>0
>> >
>> > > But the interrupt rate went through the roof.
>> >
>> > Yes. I forget why that was; I'll have to dig through my archives for
>> > that.
>>
>> Oh. I'd have thought that this alone could account for 3.5%.
>>
>> > > A 3.5% slowdown in this workload is considered pretty serious, isn't
>it?
>> >
>> > Yes. Anything above 0.3% is statistically significant. 1% is a big
>> > deal. The fact that we've lost 3.5% in the last year doesn't make
>> > people happy. There's a few things we've identified that have a big
>> > effect:
>> >
>> > - Per-partition statistics. Putting in a sysctl to stop doing them
>gets
>> > some of that back, but not as much as taking them out (even when
>> > the sysctl'd variable is in a __read_mostly section). We tried a
>> > patch from Jens to speed up the search for a new partition, but it
>> > had no effect.
>>
>> I find this surprising.
>>
>> > - The RT scheduler changes. They're better for some RT tasks, but not
>> > the database benchmark workload. Chinang has posted about
>> > this before, but the thread didn't really go anywhere.
>> > http://marc.info/?t=122903815000001&r=1&w=2
>
>I read the whole thread before I found what you were talking about here:
>
>http://marc.info/?l=linux-kernel&m=122937424114658&w=2
>
>With this comment:
>
>"When setting foreground and log writer to rt-prio, the log latency reduced
>to 4.8ms. \
>Performance is about 1.5% higher than the CFS result.
>On a side note, we had been using rt-prio on all DBMS processes and log
>writer ( in \
>higher priority) for the best OLTP performance. That has worked pretty well
>until \
>2.6.25 when the new rt scheduler introduced the pull/push task for lower
>scheduling \
>latency for rt-task. That has negative impact on this workload, probably
>due to the \
>more elaborated load calculation/balancing for hundred of foreground rt-
>prio \
>processes. Also, there is that question of no production environment would
>run DBMS \
>with rt-prio. That is why I am going back to explore CFS and see whether I
>can drop \
>rt-prio for good."
>

>A couple of questions:
>
>1) how does the latest rt scheduler compare? There has been a lot of
>improvements.

It is difficult for me to isolate the recent rt scheduler improvement as so many other changes were introduced to the kernel at the same time. A more accurate comparison should just revert the rt-scheduler back to the previous version and test the delta. I am not sure how to get that done.

>2) how many rt tasks?
Around 250 rt tasks.

>3) what were the prios, producer compared to consumers, not actual numbers
I suppose the single log writer is the main producer (rt-prio 49, higheset rt-prio in this workload) which wake up all foreground process when the log write is done. The 240 foreground processes are the consumer (rt-prio 48). At any given time some number of the 240 foreground was waiting for log writer to finish flushing out the log data.

>4) have you tried pinning tasks?
>
We did try pin foreground rt-process to cpu. That recovered about 1% performance but introduce idle time in some cpu. Without load balancing, my solution is to pin more processes to the idle cpu. I don't think this is a practical solution for the idle time problem as the process distribution need to be adjusted again when upgrade to a different server.

>RT is more about determinism than performance. The old scheduler
>migrated rt tasks the same as other tasks. This helps with performance
>because it will keep several rt tasks on the same CPU and cache hot even
>when a rt task can migrate. This helps performance, but kills
>determinism (I was seeing 10 ms wake up times from the next-highest-prio
>task on a cpu, even when another CPU was available).
>
>If you pin a task to a cpu, then it skips over the push and pull logic
>and will help with performance too.
>
>-- Steve
>
>
>
>>
>> Well. It's more a case that it wasn't taken anywhere. I appear to
>> have recently been informed that there have never been any
>> CPU-scheduler-caused regressions. Please persist!
>>
>> > SLUB would have had a huge negative effect if we were using it -- on
>the
>> > order of 7% iirc. SLQB is at least performance-neutral with SLAB.
>>
>> We really need to unblock that problem somehow. I assume that
>> enterprise distros are shipping slab?
>>

2009-01-15 07:25:41

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Thursday 15 January 2009 13:04:31 Andrew Morton wrote:
> On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <[email protected]> wrote:

> > SLUB would have had a huge negative effect if we were using it -- on the
> > order of 7% iirc. SLQB is at least performance-neutral with SLAB.
>
> We really need to unblock that problem somehow. I assume that
> enterprise distros are shipping slab?

SLES11 will ship with SLAB, FWIW. As I said in the SLQB thread, this was
not due to my input. But I think it was probably the right choice to make
in that situation.

The biggest problem with SLAB for SGI I think is alien caches bloating the
kmem cache footprint to many GB each on their huge systems, but SLAB has a
parameter to turn off alien caches anyway so I think that is a reasonable
workaround.

Given the OLTP regression, and also I'd hate to have to deal with even
more reports of people's order-N allocations failing... basically with the
regression potential there, I don't think there was a compelling case
found to use SLUB (ie. where does it actually help?).

I'm going to propose to try to unblock the problem by asking to merge SLQB
with a plan to end up picking just one general allocator (and SLOB).

Given that SLAB and SLUB are fairly mature, I wonder what you'd think of
taking SLQB into -mm and making it the default there for a while, to see
if anybody reports a problem?

2009-01-15 09:46:53

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Thu, Jan 15, 2009 at 9:24 AM, Nick Piggin <[email protected]> wrote:
> SLES11 will ship with SLAB, FWIW. As I said in the SLQB thread, this was
> not due to my input. But I think it was probably the right choice to make
> in that situation.
>
> The biggest problem with SLAB for SGI I think is alien caches bloating the
> kmem cache footprint to many GB each on their huge systems, but SLAB has a
> parameter to turn off alien caches anyway so I think that is a reasonable
> workaround.
>
> Given the OLTP regression, and also I'd hate to have to deal with even
> more reports of people's order-N allocations failing... basically with the
> regression potential there, I don't think there was a compelling case
> found to use SLUB (ie. where does it actually help?).
>
> I'm going to propose to try to unblock the problem by asking to merge SLQB
> with a plan to end up picking just one general allocator (and SLOB).

It would also be nice if someone could do the performance analysis on
the SLUB bug. I ran sysbench in oltp mode here and the results look
like this:

[ number of transactions per second from 10 runs. ]

min max avg sd
2.6.29-rc1-slab 833.77 852.32 845.10 4.72
2.6.29-rc1-slub 823.61 851.94 836.74 8.57

I used the following sysbench parameters:

sysbench --test=oltp \
--oltp-table-size=1000000 \
--mysql-socket=/var/run/mysqld/mysqld.sock \
prepare

sysbench --num-threads=16 \
--max-requests=100000 \
--test=oltp --oltp-table-size=1000000 \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--oltp-read-only run

And no, the numbers are not flipped, SLUB beats SLAB here. :(

Pekka

$ mysql --version
mysql Ver 14.12 Distrib 5.0.51a, for debian-linux-gnu (x86_64) using
readline 5.2

$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz
stepping : 6
cpu MHz : 1000.000
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor
ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips : 3989.99
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 15
model name : Intel(R) Core(TM)2 CPU T7200 @ 2.00GHz
stepping : 6
cpu MHz : 1000.000
cache size : 4096 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor
ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow
bogomips : 3990.04
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

$ lspci
00:00.0 Host bridge: Intel Corporation Mobile 945GM/PM/GMS, 943/940GML
and 945GT Express Memory Controller Hub (rev 03)
00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS,
943/940GML Express Integrated Graphics Controller (rev 03)
00:02.1 Display controller: Intel Corporation Mobile 945GM/GMS/GME,
943/940GML Express Integrated Graphics Controller (rev 03)
00:07.0 Performance counters: Intel Corporation Unknown device 27a3 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801G (ICH7 Family) High
Definition Audio Controller (rev 02)
00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express
Port 1 (rev 02)
00:1c.1 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express
Port 2 (rev 02)
00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB
UHCI Controller #1 (rev 02)
00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB
UHCI Controller #2 (rev 02)
00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB
UHCI Controller #3 (rev 02)
00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB
UHCI Controller #4 (rev 02)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2
EHCI Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev e2)
00:1f.0 ISA bridge: Intel Corporation 82801GBM (ICH7-M) LPC Interface
Bridge (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE
Controller (rev 02)
00:1f.2 IDE interface: Intel Corporation 82801GBM/GHM (ICH7 Family)
SATA IDE Controller (rev 02)
00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller (rev 02)
01:00.0 Ethernet controller: Marvell Technology Group Ltd. 88E8053
PCI-E Gigabit Ethernet Controller (rev 22)
02:00.0 Network controller: Atheros Communications Inc. AR5418
802.11abgn Wireless PCI Express Adapter (rev 01)
03:03.0 FireWire (IEEE 1394): Agere Systems FW323 (rev 61)

2009-01-15 13:52:52

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Thu, Jan 15, 2009 at 11:46:09AM +0200, Pekka Enberg wrote:
> It would also be nice if someone could do the performance analysis on
> the SLUB bug. I ran sysbench in oltp mode here and the results look
> like this:
>
> [ number of transactions per second from 10 runs. ]
>
> min max avg sd
> 2.6.29-rc1-slab 833.77 852.32 845.10 4.72
> 2.6.29-rc1-slub 823.61 851.94 836.74 8.57
>
> And no, the numbers are not flipped, SLUB beats SLAB here. :(

Um. More transactions per second is good. Your numbers show SLAB
beating SLUB (even on your dual-CPU system). And SLAB shows a lower
standard deviation, which is also good.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-01-15 14:13:16

by James Bottomley

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote:
> On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <[email protected]> wrote:
> > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
> > > > > Linux OLTP Performance summary
> > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait%
> > > > > 2.6.24.2 1.000 21969 43425 76 24 0 0
> > > > > 2.6.27.2 0.973 30402 43523 74 25 0 1
> > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0 0
> >
> > > But the interrupt rate went through the roof.
> >
> > Yes. I forget why that was; I'll have to dig through my archives for
> > that.
>
> Oh. I'd have thought that this alone could account for 3.5%.

Me too. Anecdotally, I haven't noticed this in my lab machines, but
what I have noticed is on someone else's laptop (a hyperthreaded atom)
that I was trying to demo powertop on was that IPI reschedule interrupts
seem to be out of control ... they were ticking over at a really high
rate and preventing the CPU from spending much time in the low C and P
states. To me this implicates some scheduler problem since that's the
primary producer of IPI reschedules ... I think it wouldn't be a
significant extrapolation to predict that the scheduler might be the
cause of the above problem as well.

James

2009-01-15 14:47:39

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

Matthew Wilcox wrote:
> On Thu, Jan 15, 2009 at 11:46:09AM +0200, Pekka Enberg wrote:
>> It would also be nice if someone could do the performance analysis on
>> the SLUB bug. I ran sysbench in oltp mode here and the results look
>> like this:
>>
>> [ number of transactions per second from 10 runs. ]
>>
>> min max avg sd
>> 2.6.29-rc1-slab 833.77 852.32 845.10 4.72
>> 2.6.29-rc1-slub 823.61 851.94 836.74 8.57
>>
>> And no, the numbers are not flipped, SLUB beats SLAB here. :(
>
> Um. More transactions per second is good. Your numbers show SLAB
> beating SLUB (even on your dual-CPU system). And SLAB shows a lower
> standard deviation, which is also good.

*blush*

Will do oprofile tomorrow. Thanks Matthew.

2009-01-15 16:48:43

[permalink] [raw]

Subject: RE: Mainline kernel OLTP performance update

>-----Original Message-----
>From: Matthew Wilcox [mailto:[email protected]]
>Sent: Wednesday, January 14, 2009 5:22 PM
>To: Andrew Morton
>Cc: Wilcox, Matthew R; Ma, Chinang; [email protected]; Tripathi,
>Sharad C; [email protected]; Kleen, Andi; Siddha, Suresh B; Chilukuri,
>Harita; Styner, Douglas W; Wang, Peter Xihong; Nueckel, Hubert;
>[email protected]; [email protected]; [email protected];
>Andrew Vasquez; Anirban Chakraborty
>Subject: Re: Mainline kernel OLTP performance update
>
>On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
>> On Tue, 13 Jan 2009 15:44:17 -0700
>> "Wilcox, Matthew R" <[email protected]> wrote:
>> >
>>
>> (top-posting repaired. That @intel.com address is a bad influence ;))
>
>Alas, that email address goes to an Outlook client. Not much to be done
>about that.
>
>> (cc linux-scsi)
>>
>> > > This is latest 2.6.29-rc1 kernel OLTP performance result. Compare to
>> > > 2.6.24.2 the regression is around 3.5%.
>> > >
>> > > Linux OLTP Performance summary
>> > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle%
>iowait%
>> > > 2.6.24.2 1.000 21969 43425 76 24 0 0
>> > > 2.6.27.2 0.973 30402 43523 74 25 0 1
>> > > 2.6.29-rc1 0.965 30331 41970 74 26 0 0
>
>> But the interrupt rate went through the roof.
>
>Yes. I forget why that was; I'll have to dig through my archives for
>that.

I took a quick look at the interrupts figure between 2.6.24 and 2.6.27. i/o interuputs is slightly down in 2.6.27 (due to reduce throughput). But both NMI and reschedule interrupt increased. Reschedule interrupts is 2x of 2.6.24.

>
>> A 3.5% slowdown in this workload is considered pretty serious, isn't it?
>
>Yes. Anything above 0.3% is statistically significant. 1% is a big
>deal. The fact that we've lost 3.5% in the last year doesn't make
>people happy. There's a few things we've identified that have a big
>effect:
>
> - Per-partition statistics. Putting in a sysctl to stop doing them gets
> some of that back, but not as much as taking them out (even when
> the sysctl'd variable is in a __read_mostly section). We tried a
> patch from Jens to speed up the search for a new partition, but it
> had no effect.
>
> - The RT scheduler changes. They're better for some RT tasks, but not
> the database benchmark workload. Chinang has posted about
> this before, but the thread didn't really go anywhere.
> http://marc.info/?t=122903815000001&r=1&w=2
>
>SLUB would have had a huge negative effect if we were using it -- on the
>order of 7% iirc. SLQB is at least performance-neutral with SLAB.
>
>--
>Matthew Wilcox Intel Open Source Technology Centre
>"Bill, look, we understand that you're interested in selling us this
>operating system, but compare it to ours. We can't possibly take such
>a retrograde step."

-Chinang

2009-01-15 17:46:01

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Thu, 15 Jan 2009 09:12:46 -0500 James Bottomley <[email protected]> wrote:

> On Wed, 2009-01-14 at 18:04 -0800, Andrew Morton wrote:
> > On Wed, 14 Jan 2009 18:21:47 -0700 Matthew Wilcox <[email protected]> wrote:
> > > On Wed, Jan 14, 2009 at 04:35:57PM -0800, Andrew Morton wrote:
> > > > > > Linux OLTP Performance summary
> > > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait%
> > > > > > 2.6.24.2 1.000 21969 43425 76 24 0 0
> > > > > > 2.6.27.2 0.973 30402 43523 74 25 0 1
> > > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0 0
> > >
> > > > But the interrupt rate went through the roof.
> > >
> > > Yes. I forget why that was; I'll have to dig through my archives for
> > > that.
> >
> > Oh. I'd have thought that this alone could account for 3.5%.
>
> Me too. Anecdotally, I haven't noticed this in my lab machines, but
> what I have noticed is on someone else's laptop (a hyperthreaded atom)
> that I was trying to demo powertop on was that IPI reschedule interrupts
> seem to be out of control ... they were ticking over at a really high
> rate and preventing the CPU from spending much time in the low C and P
> states. To me this implicates some scheduler problem since that's the
> primary producer of IPI reschedules ... I think it wouldn't be a
> significant extrapolation to predict that the scheduler might be the
> cause of the above problem as well.
>

Good point.

The context switch rate actually went down a bit.

I wonder if the Intel test people have records of /proc/interrupts for
the various kernel versions.

2009-01-15 18:01:16

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Thu, Jan 15, 2009 at 09:44:42AM -0800, Andrew Morton wrote:
> > Me too. Anecdotally, I haven't noticed this in my lab machines, but
> > what I have noticed is on someone else's laptop (a hyperthreaded atom)
> > that I was trying to demo powertop on was that IPI reschedule interrupts
> > seem to be out of control ... they were ticking over at a really high
> > rate and preventing the CPU from spending much time in the low C and P
> > states. To me this implicates some scheduler problem since that's the
> > primary producer of IPI reschedules ... I think it wouldn't be a
> > significant extrapolation to predict that the scheduler might be the
> > cause of the above problem as well.
> >
>
> Good point.
>
> The context switch rate actually went down a bit.
>
> I wonder if the Intel test people have records of /proc/interrupts for
> the various kernel versions.

I think Chinang does, but he's out of office today. He did say in an
earlier reply:

> I took a quick look at the interrupts figure between 2.6.24 and 2.6.27.
> i/o interuputs is slightly down in 2.6.27 (due to reduce throughput).
> But both NMI and reschedule interrupt increased. Reschedule interrupts
> is 2x of 2.6.24.

So if the reschedule interrupt is happening twice as often, and the
context switch rate is basically unchanged, I guess that means the
scheduler is doing a lot more work to get approximately the same
results. And that seems like a bad thing.

Again, it's worth bearing in mind that these are all RT tasks, so the
underlying problem may be very different from the one that both James and
I have observed with an Atom laptop running predominantly non-RT tasks.

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-01-15 18:15:35

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote:
> On Thu, Jan 15, 2009 at 09:44:42AM -0800, Andrew Morton wrote:
> > > Me too. Anecdotally, I haven't noticed this in my lab machines, but
> > > what I have noticed is on someone else's laptop (a hyperthreaded atom)
> > > that I was trying to demo powertop on was that IPI reschedule interrupts
> > > seem to be out of control ... they were ticking over at a really high
> > > rate and preventing the CPU from spending much time in the low C and P
> > > states. To me this implicates some scheduler problem since that's the
> > > primary producer of IPI reschedules ... I think it wouldn't be a
> > > significant extrapolation to predict that the scheduler might be the
> > > cause of the above problem as well.
> > >
> >
> > Good point.
> >
> > The context switch rate actually went down a bit.
> >
> > I wonder if the Intel test people have records of /proc/interrupts for
> > the various kernel versions.
>
> I think Chinang does, but he's out of office today. He did say in an
> earlier reply:
>
> > I took a quick look at the interrupts figure between 2.6.24 and 2.6.27.
> > i/o interuputs is slightly down in 2.6.27 (due to reduce throughput).
> > But both NMI and reschedule interrupt increased. Reschedule interrupts
> > is 2x of 2.6.24.
>
> So if the reschedule interrupt is happening twice as often, and the
> context switch rate is basically unchanged, I guess that means the
> scheduler is doing a lot more work to get approximately the same
> results. And that seems like a bad thing.
>
> Again, it's worth bearing in mind that these are all RT tasks, so the
> underlying problem may be very different from the one that both James and
> I have observed with an Atom laptop running predominantly non-RT tasks.
>

The RT scheduler is a bit more aggressive than it use to be. It use to
just migrate RT tasks when the migration thread woke up, and did that in
"bulk". Now, when an individual RT task wakes up and it can not run on
the current CPU but can on another CPU, it is scheduled immediately, and
an IPI is sent out.

As for context switching, it would be the same amount as before, but the
difference is that the RT task will try to wake up as soon as possible.
This also causes RT tasks to bounce around CPUs more often.

If there are many threads, they should not be RT, unless there is some
design behind it.

Forgive me if you already did this and said so, but what is the result
of just making the writer an RT task and keeping all the readers as
SCHED_OTHER?

-- Steve

2009-01-15 18:42:20

by Gregory Haskins

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

Steven Rostedt wrote:
> On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote:
>
>> On Thu, Jan 15, 2009 at 09:44:42AM -0800, Andrew Morton wrote:
>>
>>>> Me too. Anecdotally, I haven't noticed this in my lab machines, but
>>>> what I have noticed is on someone else's laptop (a hyperthreaded atom)
>>>> that I was trying to demo powertop on was that IPI reschedule interrupts
>>>> seem to be out of control ... they were ticking over at a really high
>>>> rate and preventing the CPU from spending much time in the low C and P
>>>> states. To me this implicates some scheduler problem since that's the
>>>> primary producer of IPI reschedules ... I think it wouldn't be a
>>>> significant extrapolation to predict that the scheduler might be the
>>>> cause of the above problem as well.
>>>>
>>>>
>>> Good point.
>>>
>>> The context switch rate actually went down a bit.
>>>
>>> I wonder if the Intel test people have records of /proc/interrupts for
>>> the various kernel versions.
>>>
>> I think Chinang does, but he's out of office today. He did say in an
>> earlier reply:
>>
>>
>>> I took a quick look at the interrupts figure between 2.6.24 and 2.6.27.
>>> i/o interuputs is slightly down in 2.6.27 (due to reduce throughput).
>>> But both NMI and reschedule interrupt increased. Reschedule interrupts
>>> is 2x of 2.6.24.
>>>
>> So if the reschedule interrupt is happening twice as often, and the
>> context switch rate is basically unchanged, I guess that means the
>> scheduler is doing a lot more work to get approximately the same
>> results. And that seems like a bad thing.
>>

I would be very interested in gathering some data in this area. One
thing that pops to mind is to instrument the resched-ipi with
ftrace_printk() and gather a trace of this system in action. I assume
that I wouldn't have access to this OLTP suite, so I may need a
volunteer to try this for me. I could put together an instrumentation
patch for the testers convenience if they prefer.

Another data-point I wouldn't mind seeing is looking at the scheduler
statistics, particularly with my sched-top utility, which you can find here:

http://rt.wiki.kernel.org/index.php/Schedtop_utility

(Note you may want to exclude the sched_info stats, as they are
inherently noisy and make it hard to see the real trends. To do this
run it with: 'schedtop -x "sched_info"'

In the meantime, I will try similar approaches here on other non-OLTP
based workloads to see if I spy anything that looks amiss.

-Greg

Attachments:

signature.asc (257.00 B)
OpenPGP digital signature

2009-01-15 18:47:59

[permalink] [raw]

Subject: RE: Mainline kernel OLTP performance update

Gregory Haskins [mailto:[email protected]] wrote:
> > On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote:
> >> So if the reschedule interrupt is happening twice as often, and the
> >> context switch rate is basically unchanged, I guess that means the
> >> scheduler is doing a lot more work to get approximately the same
> >> results. And that seems like a bad thing.
>
> I would be very interested in gathering some data in this area. One
> thing that pops to mind is to instrument the resched-ipi with
> ftrace_printk() and gather a trace of this system in action. I assume
> that I wouldn't have access to this OLTP suite, so I may need a
> volunteer to try this for me. I could put together an instrumentation
> patch for the testers convenience if they prefer.

I don't know whether Novell have an arrangement with the Well-Known Commercial Database and the Well-Known OLTP Benchmark to do runs like this. Chinang is normally only too happy to build his own kernels with patches from people who are interested in helping, so that's probably the best way to do it.

I'm leaving for LCA in an hour or so, so further responses from me to this thread are unlikely ;-)
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m????????????I?

2009-01-15 19:29:14

[permalink] [raw]

Subject: RE: Mainline kernel OLTP performance update

>-----Original Message-----
>From: Steven Rostedt [mailto:[email protected]]
>Sent: Thursday, January 15, 2009 10:15 AM
>To: Matthew Wilcox
>Cc: Andrew Morton; James Bottomley; Wilcox, Matthew R; Ma, Chinang; linux-
>[email protected]; Tripathi, Sharad C; [email protected]; Kleen,
>Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter
>Xihong; Nueckel, Hubert; [email protected]; [email protected];
>Andrew Vasquez; Anirban Chakraborty; Gregory Haskins
>Subject: Re: Mainline kernel OLTP performance update
>
>
>On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote:
>> On Thu, Jan 15, 2009 at 09:44:42AM -0800, Andrew Morton wrote:
>> > > Me too. Anecdotally, I haven't noticed this in my lab machines, but
>> > > what I have noticed is on someone else's laptop (a hyperthreaded atom)
>> > > that I was trying to demo powertop on was that IPI reschedule
>interrupts
>> > > seem to be out of control ... they were ticking over at a really high
>> > > rate and preventing the CPU from spending much time in the low C and
>P
>> > > states. To me this implicates some scheduler problem since that's
>the
>> > > primary producer of IPI reschedules ... I think it wouldn't be a
>> > > significant extrapolation to predict that the scheduler might be the
>> > > cause of the above problem as well.
>> > >
>> >
>> > Good point.
>> >
>> > The context switch rate actually went down a bit.
>> >
>> > I wonder if the Intel test people have records of /proc/interrupts for
>> > the various kernel versions.
>>
>> I think Chinang does, but he's out of office today. He did say in an
>> earlier reply:
>>
>> > I took a quick look at the interrupts figure between 2.6.24 and 2.6.27.
>> > i/o interuputs is slightly down in 2.6.27 (due to reduce throughput).
>> > But both NMI and reschedule interrupt increased. Reschedule interrupts
>> > is 2x of 2.6.24.
>>
>> So if the reschedule interrupt is happening twice as often, and the
>> context switch rate is basically unchanged, I guess that means the
>> scheduler is doing a lot more work to get approximately the same
>> results. And that seems like a bad thing.
>>
>> Again, it's worth bearing in mind that these are all RT tasks, so the
>> underlying problem may be very different from the one that both James and
>> I have observed with an Atom laptop running predominantly non-RT tasks.
>>
>
>The RT scheduler is a bit more aggressive than it use to be. It use to
>just migrate RT tasks when the migration thread woke up, and did that in
>"bulk". Now, when an individual RT task wakes up and it can not run on
>the current CPU but can on another CPU, it is scheduled immediately, and
>an IPI is sent out.
>
>As for context switching, it would be the same amount as before, but the
>difference is that the RT task will try to wake up as soon as possible.
>This also causes RT tasks to bounce around CPUs more often.
>
>If there are many threads, they should not be RT, unless there is some
>design behind it.
>
>Forgive me if you already did this and said so, but what is the result
>of just making the writer an RT task and keeping all the readers as
>SCHED_OTHER?
>
>-- Steve
>

I think the high OLTP throughtput with rt-prio is due to the fixed time-slice. It is better to give DBMS process a bigger timeslice for getting a data buffer lock, process data, release the lock and switch out due to waiting on i/o instead of being force to switch out while still holding a data lock.

I suppose SCHED_OTHER is the default policy for user processes. We tried setting only the log writer to RT and left all other DBMS orocess in default sched policy and the performance is ~1.5% lower than the all rt-prio process result.

2009-01-15 19:44:40

[permalink] [raw]

Subject: RE: Mainline kernel OLTP performance update

Gregory.
I will test the resched-ipi instrumentation patch with our OLTP if you can post the patch and some instructions.
Thanks,
-Chinang

>-----Original Message-----
>From: Wilcox, Matthew R
>Sent: Thursday, January 15, 2009 10:47 AM
>To: Gregory Haskins; Steven Rostedt
>Cc: Matthew Wilcox; Andrew Morton; James Bottomley; Ma, Chinang; linux-
>[email protected]; Tripathi, Sharad C; [email protected]; Kleen,
>Andi; Siddha, Suresh B; Chilukuri, Harita; Styner, Douglas W; Wang, Peter
>Xihong; Nueckel, Hubert; [email protected]; [email protected];
>Andrew Vasquez; Anirban Chakraborty
>Subject: RE: Mainline kernel OLTP performance update
>
>Gregory Haskins [mailto:[email protected]] wrote:
>> > On Thu, 2009-01-15 at 11:00 -0700, Matthew Wilcox wrote:
>> >> So if the reschedule interrupt is happening twice as often, and the
>> >> context switch rate is basically unchanged, I guess that means the
>> >> scheduler is doing a lot more work to get approximately the same
>> >> results. And that seems like a bad thing.
>>
>> I would be very interested in gathering some data in this area. One
>> thing that pops to mind is to instrument the resched-ipi with
>> ftrace_printk() and gather a trace of this system in action. I assume
>> that I wouldn't have access to this OLTP suite, so I may need a
>> volunteer to try this for me. I could put together an instrumentation
>> patch for the testers convenience if they prefer.
>
>I don't know whether Novell have an arrangement with the Well-Known
>Commercial Database and the Well-Known OLTP Benchmark to do runs like this.
>Chinang is normally only too happy to build his own kernels with patches
>from people who are interested in helping, so that's probably the best way
>to do it.
>
>I'm leaving for LCA in an hour or so, so further responses from me to this
>thread are unlikely ;-)

2009-01-16 00:30:11

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Thu, 15 Jan 2009 18:24:36 +1100
Nick Piggin <[email protected]> wrote:

> Given that SLAB and SLUB are fairly mature, I wonder what you'd think of
> taking SLQB into -mm and making it the default there for a while, to see
> if anybody reports a problem?

Nobody would test it in interesting ways.

We'd get more testing in linux-next, but still not enough, and not of
the right type.

It would be better to just make the desision, merge it and forge ahead.

Me, I'd be 100% behind the idea if it had a credible prospect of a net
reduction in the number of slab allocator implementations.

I guess the naming convention will limit us to 26 of them. Fortunate
indeed that the kernel isn't written in cyrillic!

2009-01-16 04:03:54

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Friday 16 January 2009 11:27:35 Andrew Morton wrote:
> On Thu, 15 Jan 2009 18:24:36 +1100
>
> Nick Piggin <[email protected]> wrote:
> > Given that SLAB and SLUB are fairly mature, I wonder what you'd think of
> > taking SLQB into -mm and making it the default there for a while, to see
> > if anybody reports a problem?
>
> Nobody would test it in interesting ways.
>
> We'd get more testing in linux-next, but still not enough, and not of
> the right type.

It would be better than nothing, for SLQB, I guess.

> It would be better to just make the desision, merge it and forge ahead.
>
> Me, I'd be 100% behind the idea if it had a credible prospect of a net
> reduction in the number of slab allocator implementations.

>From the data we have so far, I think SLQB is a "credible prospect" to
replace SLUB and SLAB. But then again, apparently SLUB was a credible
prospect to replace SLAB when it was merged.

Unfortunately I can't honestly say that some serious regression will not
be discovered in SLQB that cannot be fixed. I guess that's never stopped
us merging other rewrites before, though.

I would like to see SLQB merged in mainline, made default, and wait for
some number releases. Then we take what we know, and try to make an
informed decision about the best one to take. I guess that is problematic
in that the rest of the kernel is moving underneath us. Do you have
another idea?

> I guess the naming convention will limit us to 26 of them. Fortunate
> indeed that the kernel isn't written in cyrillic!

I could have called it SL4B. 4 would be somehow fitting...

2009-01-16 04:13:22

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin <[email protected]> wrote:

> I would like to see SLQB merged in mainline, made default, and wait for
> some number releases. Then we take what we know, and try to make an
> informed decision about the best one to take. I guess that is problematic
> in that the rest of the kernel is moving underneath us. Do you have
> another idea?

Nope. If it doesn't work out, we can remove it again I guess.

2009-01-16 06:47:07

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Friday 16 January 2009 15:12:10 Andrew Morton wrote:
> On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin <[email protected]>
wrote:
> > I would like to see SLQB merged in mainline, made default, and wait for
> > some number releases. Then we take what we know, and try to make an
> > informed decision about the best one to take. I guess that is problematic
> > in that the rest of the kernel is moving underneath us. Do you have
> > another idea?
>
> Nope. If it doesn't work out, we can remove it again I guess.

OK, I have these numbers to show I'm not completely off my rocker to suggest
we merge SLQB :) Given these results, how about I ask to merge SLQB as default
in linux-next, then if nothing catastrophic happens, merge it upstream in the
next merge window, then a couple of releases after that, given some time to
test and tweak SLQB, then we plan to bite the bullet and emerge with just one
main slab allocator (plus SLOB).

System is a 2socket, 4 core AMD. All debug and stats options turned off for
all the allocators; default parameters (ie. SLUB using higher order pages,
and the others tend to be using order-0). SLQB is the version I recently
posted, with some of the prefetching removed according to Pekka's review
(probably a good idea to only add things like that in if/when they prove to
be an improvement).

time fio examples/netio (10 runs, lower better):
SLAB AVG=13.19 STD=0.40
SLQB AVG=13.78 STD=0.24
SLUB AVG=14.47 STD=0.23

SLAB makes a good showing here. The allocation/freeing pattern seems to be
very regular and easy (fast allocs and frees). So it could be some "lucky"
caching behaviour, I'm not exactly sure. I'll have to run more tests and
profiles here.

hackbench (10 runs, lower better):
1 GROUP
SLAB AVG=1.34 STD=0.05
SLQB AVG=1.31 STD=0.06
SLUB AVG=1.46 STD=0.07

2 GROUPS
SLAB AVG=1.20 STD=0.09
SLQB AVG=1.22 STD=0.12
SLUB AVG=1.21 STD=0.06

4 GROUPS
SLAB AVG=0.84 STD=0.05
SLQB AVG=0.81 STD=0.10
SLUB AVG=0.98 STD=0.07

8 GROUPS
SLAB AVG=0.79 STD=0.10
SLQB AVG=0.76 STD=0.15
SLUB AVG=0.89 STD=0.08

16 GROUPS
SLAB AVG=0.78 STD=0.08
SLQB AVG=0.79 STD=0.10
SLUB AVG=0.86 STD=0.05

32 GROUPS
SLAB AVG=0.86 STD=0.05
SLQB AVG=0.78 STD=0.06
SLUB AVG=0.88 STD=0.06

64 GROUPS
SLAB AVG=1.03 STD=0.05
SLQB AVG=0.90 STD=0.04
SLUB AVG=1.05 STD=0.06

128 GROUPS
SLAB AVG=1.31 STD=0.19
SLQB AVG=1.16 STD=0.36
SLUB AVG=1.29 STD=0.11

SLQB tends to be the winner here. SLAB is close at lower numbers of
groups, but drops behind a bit more as they increase.

tbench (10 runs, higher better):
1 THREAD
SLAB AVG=239.25 STD=31.74
SLQB AVG=257.75 STD=33.89
SLUB AVG=223.02 STD=14.73

2 THREADS
SLAB AVG=649.56 STD=9.77
SLQB AVG=647.77 STD=7.48
SLUB AVG=634.50 STD=7.66

4 THREADS
SLAB AVG=1294.52 STD=13.19
SLQB AVG=1266.58 STD=35.71
SLUB AVG=1228.31 STD=48.08

8 THREADS
SLAB AVG=2750.78 STD=26.67
SLQB AVG=2758.90 STD=18.86
SLUB AVG=2685.59 STD=22.41

16 THREADS
SLAB AVG=2669.11 STD=58.34
SLQB AVG=2671.69 STD=31.84
SLUB AVG=2571.05 STD=45.39

SLAB and SLQB seem to be pretty close, winning some and losing some.
They're always within a standard deviation of one another, so we can't
make conclusions between them. SLUB seems to be a bit slower.

Netperf UDP unidirectional send test (10 runs, higher better):

Server and client bound to same CPU
SLAB AVG=60.111 STD=1.59382
SLQB AVG=60.167 STD=0.685347
SLUB AVG=58.277 STD=0.788328

Server and client bound to same socket, different CPUs
SLAB AVG=85.938 STD=0.875794
SLQB AVG=93.662 STD=2.07434
SLUB AVG=81.983 STD=0.864362

Server and client bound to different sockets
SLAB AVG=78.801 STD=1.44118
SLQB AVG=78.269 STD=1.10457
SLUB AVG=71.334 STD=1.16809

SLQB is up with SLAB for the first and last cases, and faster in
the second case. SLUB trails in each case. (Any ideas for better types
of netperf tests?)

Kbuild numbers don't seem to be significantly different. SLAB and SLQB
actually got exactly the same average over 10 runs. The user+sys times
tend to be almost identical between allocators, with elapsed time mainly
depending on how much time the CPU was not idle.

Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within
their measurement confidence interval. If it comes down to it, I think we
could get them to do more runs to narrow that down, but we're talking a
couple of tenths of a percent already.

I haven't done any non-local network tests. Networking is the one of the
subsystems most heavily dependent on slab performance, so if anybody
cares to run their favourite tests, that would be really helpful.

Disclaimer
----------
Now remember this is just one specific HW configuration, and some
allocators for some reason give significantly (and sometimes perplexingly)
different results between different CPU and system architectures.

The other frustrating thing is that sometimes you happen to get a lucky
or unlucky cache or NUMA layout depending on the compile, the boot, etc.
So sometimes results get a little "skewed" in a way that isn't reflected
in the STDDEV. But I've tried to minimise that. Dropping caches and
restarting services etc. between individual runs.

2009-01-16 06:56:11

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote:
> Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within
> their measurement confidence interval. If it comes down to it, I think we
> could get them to do more runs to narrow that down, but we're talking a
> couple of tenths of a percent already.

I think I can speak with some measure of confidence for at least the
OLTP-testing part of my company when I say that I have no objection to
Nick's planned merge scheme.

I believe the kernel benchmark group have also done some testing with
SLQB and have generally positive things to say about it (Yanmin added to
the gargantuan cc).

Did slabtop get fixed to work with SLQB?

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-01-16 07:02:01

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <[email protected]> wrote:

> On Friday 16 January 2009 15:12:10 Andrew Morton wrote:
> > On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin <[email protected]>
> wrote:
> > > I would like to see SLQB merged in mainline, made default, and wait for
> > > some number releases. Then we take what we know, and try to make an
> > > informed decision about the best one to take. I guess that is problematic
> > > in that the rest of the kernel is moving underneath us. Do you have
> > > another idea?
> >
> > Nope. If it doesn't work out, we can remove it again I guess.
>
> OK, I have these numbers to show I'm not completely off my rocker to suggest
> we merge SLQB :) Given these results, how about I ask to merge SLQB as default
> in linux-next, then if nothing catastrophic happens, merge it upstream in the
> next merge window, then a couple of releases after that, given some time to
> test and tweak SLQB, then we plan to bite the bullet and emerge with just one
> main slab allocator (plus SLOB).

That's a plan.

> SLQB tends to be the winner here.

Can you think of anything with which it will be the loser?

2009-01-16 07:07:40

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Friday 16 January 2009 17:55:47 Matthew Wilcox wrote:
> On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote:
> > Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within
> > their measurement confidence interval. If it comes down to it, I think we
> > could get them to do more runs to narrow that down, but we're talking a
> > couple of tenths of a percent already.
>
> I think I can speak with some measure of confidence for at least the
> OLTP-testing part of my company when I say that I have no objection to
> Nick's planned merge scheme.
>
> I believe the kernel benchmark group have also done some testing with
> SLQB and have generally positive things to say about it (Yanmin added to
> the gargantuan cc).
>
> Did slabtop get fixed to work with SLQB?

Yes the old slabtop that works on /proc/slabinfo works with SLQB (ie. SLQB
implements /proc/slabinfo).

Lin Ming recently also ported the SLUB /sys/kernel/slab/ specific slabinfo
tool to SLQB. Basically it reports in-depth internal event counts etc. and
can operate on individual caches, making it very useful for performance
"observability" and tuning.

It is hard to come up with a single set of statistics that apply usefully
to all the allocators. FWIW, it would be a useful tool to port over to
SLAB too, if we end up deciding to go with SLAB.

2009-01-16 07:25:50

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Friday 16 January 2009 18:00:43 Andrew Morton wrote:
> On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <[email protected]>
wrote:
> > On Friday 16 January 2009 15:12:10 Andrew Morton wrote:
> > > On Fri, 16 Jan 2009 15:03:12 +1100 Nick Piggin
> > > <[email protected]>
> >
> > wrote:
> > > > I would like to see SLQB merged in mainline, made default, and wait
> > > > for some number releases. Then we take what we know, and try to make
> > > > an informed decision about the best one to take. I guess that is
> > > > problematic in that the rest of the kernel is moving underneath us.
> > > > Do you have another idea?
> > >
> > > Nope. If it doesn't work out, we can remove it again I guess.
> >
> > OK, I have these numbers to show I'm not completely off my rocker to
> > suggest we merge SLQB :) Given these results, how about I ask to merge
> > SLQB as default in linux-next, then if nothing catastrophic happens,
> > merge it upstream in the next merge window, then a couple of releases
> > after that, given some time to test and tweak SLQB, then we plan to bite
> > the bullet and emerge with just one main slab allocator (plus SLOB).
>
> That's a plan.
>
> > SLQB tends to be the winner here.
>
> Can you think of anything with which it will be the loser?

Well, that fio test showed it was behind SLAB. I just discovered that
yesterday during running these tests, so I'll take a look at that. The
Intel performance guys I think have one or two cases where it is slower.
They don't seem to be too serious, and tend to be specific to some
machines (eg. the same test with a different CPU architecture turns out
to be faster). So I'll be looking into these things, but I haven't seen
anything too serious yet. I'm mostly interested in macro benchmarks and
more real world workloads.

At a higher level, SLAB has some interesting features. It basically has
"crossbars" of queues, that basically provide queues for allocating and
freeing to and from different CPUs and nodes. This is what bloats up
the kmem_cache data structures to tens or hundreds of gigabytes each
on SGI size systems. But it is also has good properties. On smaller
multiprocessor and NUMA systems, it might be the case that SLAB does
better in workloads that involve objects being allocated on one CPU and
freed on another. I haven't actually observed problems here, but I don't
have a lot of good tests.

SLAB is also fundamentally different from SLUB and SLQB in that it uses
arrays to store pointers to objects in its queues, rather than having
a linked list using pointers embedded in the objects. This might in some
cases make it easier to prefetch objects in parallel with finding the
object itself. I haven't actually been able to attribute a particular
regression to this interesting difference, but it might turn up as an
issue.

These are two big differences between SLAB and SLQB.

The linked lists of objects were used in favour of arrays again because of
the memory overhead, and to have a better ability to tune the size of the
queues, and reduced overhead in copying around arrays of pointers (SLQB can
just copy the head of one the list to the tail of another in order to move
objects around), and eliminated the need to have additional metadata beyond
the struct page for each slab.

The crossbars of queues were removed because of the bloating and memory
overhead issues. The fact that we now have linked lists helps a little bit
with this, because moving lists of objects around gets a bit easier.

2009-01-16 07:54:26

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Thu, 2009-01-15 at 23:55 -0700, Matthew Wilcox wrote:
> On Fri, Jan 16, 2009 at 05:46:23PM +1100, Nick Piggin wrote:
> > Intel's OLTP shows SLQB is "neutral" to SLAB. That is, literally within
> > their measurement confidence interval. If it comes down to it, I think we
> > could get them to do more runs to narrow that down, but we're talking a
> > couple of tenths of a percent already.
>
> I think I can speak with some measure of confidence for at least the
> OLTP-testing part of my company when I say that I have no objection to
> Nick's planned merge scheme.
>
> I believe the kernel benchmark group have also done some testing with
> SLQB and have generally positive things to say about it (Yanmin added to
> the gargantuan cc).
We did run lots of benchmarks with SLQB. Comparing with SLUB, one highlighting of
SLQB is with netperf UDP-U-4k. On my x86-64 machines, if I start 1 client and 1 server
process and bind them to different physical cpus, the result of SLQB is about 20% better
than SLUB's. If I start CPU_NUM clients and the same number of servers without binding,
the results of SLQB is about 100% better than SLUB's. I think that's because SLQB
doesn't pass through big object allocation to page allocator.
netperf UDP-U-1k has less improvement with SLQB.

The results of other benchmarks have variations. They are good on some machines,
but bad on other machines. However, the variation is small. For example, hackbench's result
with SLQB is about 1 second than with SLUB on 8-core stoakley. After we worked with
Nick to do small code changing, SLQB's result is a little better than SLUB's
with hackbench on stoakley.

We consider other variations as fluctuation.

All the testing use default SLUB and SLQB configuration.

>
> Did slabtop get fixed to work with SLQB?
>

2009-01-16 09:00:47

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Friday 16 January 2009 18:00:43 Andrew Morton wrote:
> On Fri, 16 Jan 2009 17:46:23 +1100 Nick Piggin <[email protected]>
> > SLQB tends to be the winner here.
>
> Can you think of anything with which it will be the loser?

Here are some more performance numbers with "slub_test" kernel module.
It's basically a really tiny microbenchmark, so I don't really consider
it gives too useful results, except it does show up some problems in
SLAB's scalability that may start to bite as we continue to get more
threads per socket.

(I ran a few of these tests on one of Dave's 2 socket, 128 thread
systems, and slab gets really painful... these kinds of thread counts
may only be a couple of years away from x86).

All numbers are in CPU cycles.

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate 10000 objs then free them
obj size SLAB SLQB SLUB
8 77+ 128 69+ 47 61+ 77
16 69+ 104 116+ 70 77+ 80
32 66+ 101 82+ 81 71+ 89
64 82+ 116 95+ 81 94+105
128 100+ 148 106+ 94 114+163
256 153+ 136 134+ 98 124+186
512 209+ 161 170+186 134+276
1024 331+ 249 236+245 134+283
2048 608+ 443 380+386 172+312
4096 1109+ 624 678+661 239+372
8192 1166+1077 767+683 535+433
16384 1213+1160 914+731 577+682

We can see SLAB has a fair bit more overhead in this case. SLUB starts
doing higher order allocations I think around size 256, which reduces
costs there. Don't know what the SLQB artifact at 16 is caused by...

2. Kmalloc: alloc/free test (repeatedly allocate and free)
SLAB SLQB SLUB
8 98 90 94
16 98 90 93
32 98 90 93
64 99 90 94
128 100 92 93
256 104 93 95
512 105 94 97
1024 106 93 97
2048 107 95 95
4096 111 92 97
8192 111 94 631
16384 114 92 741

Here we see SLUB's allocator passthrough (or is the the lack of queueing?).
Straight line speed at small sizes is probably due to instructions in the
fastpaths. It's pretty meaningless though because it probably changes if
there is any actual load on the CPU, or another CPU architecture. Doesn't
look bad for SLQB though :)

Concurrent allocs
=================
1. Like the first single thread test, lots of allocs, then lots of frees.
But running on all CPUs. Average over all CPUs.
SLAB SLQB SLUB
8 251+ 322 73+ 47 65+ 76
16 240+ 331 84+ 53 67+ 82
32 235+ 316 94+ 57 77+ 92
64 338+ 303 120+ 66 105+ 136
128 549+ 355 139+ 166 127+ 344
256 1129+ 456 189+ 178 236+ 404
512 2085+ 872 240+ 217 244+ 419
1024 3895+1373 347+ 333 251+ 440
2048 7725+2579 616+ 695 373+ 588
4096 15320+4534 1245+1442 689+1002

A problem with SLAB scalability starts showing up on this system with only
4 threads per socket. Again, SLUB sees a benefit from higher order
allocations.

2. Same as 2nd single threaded test, alloc then free, on all CPUs.
SLAB SLQB SLUB
8 99 90 93
16 99 90 93
32 99 90 93
64 100 91 94
128 102 90 93
256 105 94 97
512 106 93 97
1024 108 93 97
2048 109 93 96
4096 110 93 96

No surprises. Objects always fit in queues (or unqueues, in the case of
SLUB), so there is no cross cache traffic.

Remote free test
================
1. Allocate N objects on CPUs 1-7, then free them all from CPU 0. Average cost
of all kmalloc+kfree
SLAB SLQB SLUB
8 191+ 142 53+ 64 56+99
16 180+ 141 82+ 69 60+117
32 173+ 142 100+ 71 78+151
64 240+ 147 131+ 73 117+216
128 441+ 162 158+114 114+251
256 833+ 181 179+119 185+263
512 1546+ 243 220+132 194+292
1024 2886+ 341 299+135 201+312
2048 5737+ 577 517+139 291+370
4096 11288+1201 976+153 528+482

2. All CPUs allocate on objects on CPU N, then freed by CPU N+1 % NR_CPUS
(ie. CPU1 frees objects allocated by CPU0).
SLAB SLQB SLUB
8 236+ 331 72+123 64+ 114
16 232+ 345 80+125 71+ 139
32 227+ 342 85+134 82+ 183
64 324+ 336 140+138 111+ 219
128 569+ 384 245+201 145+ 337
256 1111+ 448 243+222 238+ 447
512 2091+ 871 249+244 247+ 470
1024 3923+1593 254+256 254+ 503
2048 7700+2968 273+277 369+ 699
4096 15154+5061 310+323 693+1220

SLAB's concurrent allocation bottlnecks show up again in these tests.

Unfortunately these are not too realistic tests of remote freeing pattern,
because normally you would expect remote freeing and allocation happening
concurrently, rather than all allocations up front, then all frees. If
the test behaved like that, then object could probably fit in SLAB's
queues and it might see some good numbers.

2009-01-16 10:16:49

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Thu, Jan 15, 2009 at 11:46:09AM +0200, Pekka Enberg wrote:
>> It would also be nice if someone could do the performance analysis on
>> the SLUB bug. I ran sysbench in oltp mode here and the results look
>> like this:
>>
>> [ number of transactions per second from 10 runs. ]
>>
>> min max avg sd
>> 2.6.29-rc1-slab 833.77 852.32 845.10 4.72
>> 2.6.29-rc1-slub 823.61 851.94 836.74 8.57
>>
>> And no, the numbers are not flipped, SLUB beats SLAB here. :(

On Thu, Jan 15, 2009 at 3:52 PM, Matthew Wilcox <[email protected]> wrote:
> Um. More transactions per second is good. Your numbers show SLAB
> beating SLUB (even on your dual-CPU system). And SLAB shows a lower
> standard deviation, which is also good.

I had lockdep enabled in my config so I ran the tests again with
x86-64 defconfig and I'm back to square one:

[ number of transactions per second from 10 runs, bigger is better ]

min max avg sd
2.6.29-rc1-slab 802.02 805.37 803.93 0.97
2.6.29-rc1-slub 807.78 811.20 809.86 1.05

Pekka

2009-01-16 10:21:38

by Andi Kleen

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

"Zhang, Yanmin" <[email protected]> writes:

> I think that's because SLQB
> doesn't pass through big object allocation to page allocator.
> netperf UDP-U-1k has less improvement with SLQB.

That sounds like just the page allocator needs to be improved.
That would help everyone. We talked a bit about this earlier,
some of the heuristics for hot/cold pages are quite outdated
and have been tuned for obsolete machines and also its fast path
is quite long. Unfortunately no code currently.

-Andi

--
[email protected] -- Speaking for myself only.

2009-01-16 10:22:05

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Friday 16 January 2009 21:16:31 Pekka Enberg wrote:
> On Thu, Jan 15, 2009 at 11:46:09AM +0200, Pekka Enberg wrote:
> >> It would also be nice if someone could do the performance analysis on
> >> the SLUB bug. I ran sysbench in oltp mode here and the results look
> >> like this:
> >>
> >> [ number of transactions per second from 10 runs. ]
> >>
> >> min max avg sd
> >> 2.6.29-rc1-slab 833.77 852.32 845.10 4.72
> >> 2.6.29-rc1-slub 823.61 851.94 836.74 8.57

> I had lockdep enabled in my config so I ran the tests again with
> x86-64 defconfig and I'm back to square one:
>
> [ number of transactions per second from 10 runs, bigger is better ]
>
> min max avg sd
> 2.6.29-rc1-slab 802.02 805.37 803.93 0.97
> 2.6.29-rc1-slub 807.78 811.20 809.86 1.05

Hm, I wonder why it is going slower with lockdep disabled?
Did something else change?

2009-01-16 10:31:21

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Friday 16 January 2009 21:16:31 Pekka Enberg wrote:
>> I had lockdep enabled in my config so I ran the tests again with
>> x86-64 defconfig and I'm back to square one:
>>
>> [ number of transactions per second from 10 runs, bigger is better ]
>>
>> min max avg sd
>> 2.6.29-rc1-slab 802.02 805.37 803.93 0.97
>> 2.6.29-rc1-slub 807.78 811.20 809.86 1.05

On Fri, Jan 16, 2009 at 12:21 PM, Nick Piggin <[email protected]> wrote:
> Hm, I wonder why it is going slower with lockdep disabled?
> Did something else change?

I don't have the exact config for the previous tests but it's was just
my laptop regular config whereas the new tests are x86-64 defconfig.
So I think I'm just hitting some of the other OLTP regressions here,
aren't I? There's some scheduler related options such as
CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig
that I didn't have in the original tests. I can try without them if
you want but I'm not sure it's relevant for SLAB vs SLUB tests.

Pekka

2009-01-16 10:43:50

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Friday 16 January 2009 21:31:03 Pekka Enberg wrote:
> On Friday 16 January 2009 21:16:31 Pekka Enberg wrote:
> >> I had lockdep enabled in my config so I ran the tests again with
> >> x86-64 defconfig and I'm back to square one:
> >>
> >> [ number of transactions per second from 10 runs, bigger is better ]
> >>
> >> min max avg sd
> >> 2.6.29-rc1-slab 802.02 805.37 803.93 0.97
> >> 2.6.29-rc1-slub 807.78 811.20 809.86 1.05
>
> On Fri, Jan 16, 2009 at 12:21 PM, Nick Piggin <[email protected]>
wrote:
> > Hm, I wonder why it is going slower with lockdep disabled?
> > Did something else change?
>
> I don't have the exact config for the previous tests but it's was just
> my laptop regular config whereas the new tests are x86-64 defconfig.
> So I think I'm just hitting some of the other OLTP regressions here,
> aren't I? There's some scheduler related options such as
> CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig
> that I didn't have in the original tests. I can try without them if
> you want but I'm not sure it's relevant for SLAB vs SLUB tests.

Oh no that's fine. It just looked like you repeated the test but
with lockdep disabled (and no other changes).

2009-01-16 10:55:50

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

Hi Nick,

On Fri, Jan 16, 2009 at 12:42 PM, Nick Piggin <[email protected]> wrote:
>> I don't have the exact config for the previous tests but it's was just
>> my laptop regular config whereas the new tests are x86-64 defconfig.
>> So I think I'm just hitting some of the other OLTP regressions here,
>> aren't I? There's some scheduler related options such as
>> CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig
>> that I didn't have in the original tests. I can try without them if
>> you want but I'm not sure it's relevant for SLAB vs SLUB tests.
>
> Oh no that's fine. It just looked like you repeated the test but
> with lockdep disabled (and no other changes).

Right. In any case, I am still unable to reproduce the OLTP issue and
I've seen SLUB beat SLAB on my machine in most of the benchmarks
you've posted. So I have very mixed feelings about SLQB. It's very
nice that it works for OLTP but we still don't have much insight (i.e.
numbers) on why it's better. I'm also bit worried if SLQB has gotten
enough attention from the NUMA and HPC folks that brought us SLUB.

The good news is that SLQB can replace SLAB so either way, we're not
going to end up with four allocators. Whether it can replace SLUB
remains to be seen.

Pekka

2009-01-16 18:16:32

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

Nick Piggin wrote:
> OK, I have these numbers to show I'm not completely off my rocker to suggest
> we merge SLQB :) Given these results, how about I ask to merge SLQB as default
> in linux-next, then if nothing catastrophic happens, merge it upstream in the
> next merge window, then a couple of releases after that, given some time to
> test and tweak SLQB, then we plan to bite the bullet and emerge with just one
> main slab allocator (plus SLOB).
>
>
> System is a 2socket, 4 core AMD.

Not exactly a large system :) Barely NUMA even with just two sockets.

> All debug and stats options turned off for
> all the allocators; default parameters (ie. SLUB using higher order pages,
> and the others tend to be using order-0). SLQB is the version I recently
> posted, with some of the prefetching removed according to Pekka's review
> (probably a good idea to only add things like that in if/when they prove to
> be an improvement).
>
> ...
>
> Netperf UDP unidirectional send test (10 runs, higher better):
>
> Server and client bound to same CPU
> SLAB AVG=60.111 STD=1.59382
> SLQB AVG=60.167 STD=0.685347
> SLUB AVG=58.277 STD=0.788328
>
> Server and client bound to same socket, different CPUs
> SLAB AVG=85.938 STD=0.875794
> SLQB AVG=93.662 STD=2.07434
> SLUB AVG=81.983 STD=0.864362
>
> Server and client bound to different sockets
> SLAB AVG=78.801 STD=1.44118
> SLQB AVG=78.269 STD=1.10457
> SLUB AVG=71.334 STD=1.16809
> ...
> I haven't done any non-local network tests. Networking is the one of the
> subsystems most heavily dependent on slab performance, so if anybody
> cares to run their favourite tests, that would be really helpful.

I'm guessing, but then are these Mbit/s figures? Would that be the sending
throughput or the receiving throughput?

I love to see netperf used, but why UDP and loopback? Also, how about the
service demands?

rick jones

2009-01-16 18:16:58

by Gregory Haskins

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

Ma, Chinang wrote:
> Gregory.
> I will test the resched-ipi instrumentation patch with our OLTP if you can post the patch and some instructions.
> Thanks,
> -Chinang
>

Hi Chinang,
Please find a patch attached which applies to linus.git as of today.
You will also want to enable CONFIG_FUNCTION_TRACER as well as the trace
components. Here is my system:

ghaskins@dev:~/sandbox/git/linux-2.6-rt> grep TRACE .config
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_TRACEPOINTS=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_BLK_DEV_IO_TRACE=y
# CONFIG_TREE_RCU_TRACE is not set
# CONFIG_PREEMPT_RCU_TRACE is not set
CONFIG_X86_PTRACE_BTS=y
# CONFIG_ACPI_DEBUG_FUNC_TRACE is not set
CONFIG_NETFILTER_XT_TARGET_TRACE=m
CONFIG_SOUND_TRACEINIT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_TRACE_IRQFLAGS=y
CONFIG_STACKTRACE=y
# CONFIG_BACKTRACE_SELF_TEST is not set
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_HW_BRANCH_TRACER=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
CONFIG_IRQSOFF_TRACER=y
CONFIG_SYSPROF_TRACER=y
CONFIG_SCHED_TRACER=y
CONFIG_CONTEXT_SWITCH_TRACER=y
# CONFIG_BOOT_TRACER is not set
# CONFIG_TRACE_BRANCH_PROFILING is not set
CONFIG_POWER_TRACER=y
CONFIG_STACK_TRACER=y
CONFIG_HW_BRANCH_TRACER=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_KVM_TRACE is not set

Then on your booted system, do:

echo sched_switch > /sys/kernel/debug/tracing/current_tracer
echo 1 > /sys/kernel/debug/tracing/tracing_enabled
$run_oltp && echo 0 > /sys/kernel/debug/tracing/tracing_enabled

(where $run_oltp is your suite)

Then, email the contents of /sys/kernel/debug/tracing/trace to me

-Greg

Attachments:

instrumentation.patch (3.08 kB)
signature.asc (257.00 B)
OpenPGP digital signature Download all attachments

2009-01-16 19:10:52

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 2009-01-16 at 13:14 -0500, Gregory Haskins wrote:
> Ma, Chinang wrote:
> > Gregory.
> > I will test the resched-ipi instrumentation patch with our OLTP if you can post the patch and some instructions.
> > Thanks,
> > -Chinang
> >
>
> Hi Chinang,
> Please find a patch attached which applies to linus.git as of today.
> You will also want to enable CONFIG_FUNCTION_TRACER as well as the trace
> components. Here is my system:
>

I don't see why CONFIG_FUNCTION_TRACER is needed.

> ghaskins@dev:~/sandbox/git/linux-2.6-rt> grep TRACE .config
> CONFIG_STACKTRACE_SUPPORT=y
> CONFIG_TRACEPOINTS=y
> CONFIG_HAVE_ARCH_TRACEHOOK=y
> CONFIG_BLK_DEV_IO_TRACE=y
> # CONFIG_TREE_RCU_TRACE is not set
> # CONFIG_PREEMPT_RCU_TRACE is not set
> CONFIG_X86_PTRACE_BTS=y
> # CONFIG_ACPI_DEBUG_FUNC_TRACE is not set
> CONFIG_NETFILTER_XT_TARGET_TRACE=m
> CONFIG_SOUND_TRACEINIT=y
> CONFIG_TRACE_IRQFLAGS_SUPPORT=y
> CONFIG_TRACE_IRQFLAGS=y
> CONFIG_STACKTRACE=y
> # CONFIG_BACKTRACE_SELF_TEST is not set
> CONFIG_USER_STACKTRACE_SUPPORT=y
> CONFIG_NOP_TRACER=y
> CONFIG_HAVE_FUNCTION_TRACER=y
> CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
> CONFIG_HAVE_FUNCTION_TRACE_MCOUNT_TEST=y
> CONFIG_HAVE_DYNAMIC_FTRACE=y
> CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
> CONFIG_HAVE_HW_BRANCH_TRACER=y
> CONFIG_TRACER_MAX_TRACE=y
> CONFIG_FUNCTION_TRACER=y
> CONFIG_FUNCTION_GRAPH_TRACER=y
> CONFIG_IRQSOFF_TRACER=y
> CONFIG_SYSPROF_TRACER=y
> CONFIG_SCHED_TRACER=y

This CONFIG_SCHED_TRACER should be enough.

-- Steve

> CONFIG_CONTEXT_SWITCH_TRACER=y
> # CONFIG_BOOT_TRACER is not set
> # CONFIG_TRACE_BRANCH_PROFILING is not set
> CONFIG_POWER_TRACER=y
> CONFIG_STACK_TRACER=y
> CONFIG_HW_BRANCH_TRACER=y
> CONFIG_DYNAMIC_FTRACE=y
> CONFIG_FTRACE_MCOUNT_RECORD=y
> # CONFIG_FTRACE_STARTUP_TEST is not set
> # CONFIG_MMIOTRACE is not set
> # CONFIG_KVM_TRACE is not set
>
>
> Then on your booted system, do:
>
> echo sched_switch > /sys/kernel/debug/tracing/current_tracer
> echo 1 > /sys/kernel/debug/tracing/tracing_enabled
> $run_oltp && echo 0 > /sys/kernel/debug/tracing/tracing_enabled
>
> (where $run_oltp is your suite)
>
> Then, email the contents of /sys/kernel/debug/tracing/trace to me
>
> -Greg
>

2009-01-16 21:02:00

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 16 Jan 2009, Pekka Enberg wrote:

> aren't I? There's some scheduler related options such as
> CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig
> that I didn't have in the original tests. I can try without them if
> you want but I'm not sure it's relevant for SLAB vs SLUB tests.

I have seen CONFIG_GROUP_SCHED to affect latency tests in significant
ways.

2009-01-19 07:13:43

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Friday 16 January 2009 21:55:30 Pekka Enberg wrote:
> Hi Nick,
>
> On Fri, Jan 16, 2009 at 12:42 PM, Nick Piggin <[email protected]>
wrote:
> >> I don't have the exact config for the previous tests but it's was just
> >> my laptop regular config whereas the new tests are x86-64 defconfig.
> >> So I think I'm just hitting some of the other OLTP regressions here,
> >> aren't I? There's some scheduler related options such as
> >> CONFIG_GROUP_SCHED and CONFIG_FAIR_GROUP_SCHED enabled in defconfig
> >> that I didn't have in the original tests. I can try without them if
> >> you want but I'm not sure it's relevant for SLAB vs SLUB tests.
> >
> > Oh no that's fine. It just looked like you repeated the test but
> > with lockdep disabled (and no other changes).
>
> Right. In any case, I am still unable to reproduce the OLTP issue and
> I've seen SLUB beat SLAB on my machine in most of the benchmarks
> you've posted.

SLUB was distinctly slower on the tbench, netperf, and hackbench
tests that I ran. These were faster with SLUB on your machine?
What kind of system is it?

> So I have very mixed feelings about SLQB. It's very
> nice that it works for OLTP but we still don't have much insight (i.e.
> numbers) on why it's better.

According to estimates in this thread, I think Matthew said SLUB would
be around 6% slower? SLQB is within measurement error of SLAB.

Fair point about personally reproducing the OLTP problem yourself. But
the fact is that we will get problem reports that cannot be reproduced.
That does not make them less relevant. I can't reproduce the OLTP
benchmark myself. And I'm fully expecting to get problem reports for
SLQB against insanely sized SGI systems, which I will take very seriously
and try to fix them.

> I'm also bit worried if SLQB has gotten
> enough attention from the NUMA and HPC folks that brought us SLUB.

It hasn't, but that's the problem we're hoping to solve by getting it
merged. People can give it more attention, and we can try to fix any
problems. SLUB has been default for quite a while now and not able to
solve all problems it has had reported against it. So I hope SLQB will
be able to unblock this situation.

> The good news is that SLQB can replace SLAB so either way, we're not
> going to end up with four allocators. Whether it can replace SLUB
> remains to be seen.

Well I think being able to simply replace SLAB is not ideal. The plan
I'm hoping is to have four allocators for a few releases, and then
go back to having two. That is going to mean some groups might not
have their ideal allocator merged... but I think it is crazy to settle
with more than one main compile-time allocator for the long term.

I don't know what the next redhat enterprise release is going to do,
but if they go with SLAB, then I think that means no SGI systems would
run in production with SLUB anyway, so what would be the purpose of
having a special "HPC/huge system" allocator? Or... what other reasons
should users select SLUB vs SLAB? (in terms of core allocator behaviour,
versus extras that can be ported from one to the other) If we can't even
make up our own minds, then will others be able to?

2009-01-19 07:44:23

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Saturday 17 January 2009 05:11:02 Rick Jones wrote:
> Nick Piggin wrote:
> > OK, I have these numbers to show I'm not completely off my rocker to
> > suggest we merge SLQB :) Given these results, how about I ask to merge
> > SLQB as default in linux-next, then if nothing catastrophic happens,
> > merge it upstream in the next merge window, then a couple of releases
> > after that, given some time to test and tweak SLQB, then we plan to bite
> > the bullet and emerge with just one main slab allocator (plus SLOB).
> >
> >
> > System is a 2socket, 4 core AMD.
>
> Not exactly a large system :) Barely NUMA even with just two sockets.

You're right ;)

But at least it is exercising the NUMA paths in the allocator, and
represents a pretty common size of system...

I can run some tests on bigger systems at SUSE, but it is not always
easy to set up "real" meaningful workloads on them or configure
significant IO for them.

> > Netperf UDP unidirectional send test (10 runs, higher better):
> >
> > Server and client bound to same CPU
> > SLAB AVG=60.111 STD=1.59382
> > SLQB AVG=60.167 STD=0.685347
> > SLUB AVG=58.277 STD=0.788328
> >
> > Server and client bound to same socket, different CPUs
> > SLAB AVG=85.938 STD=0.875794
> > SLQB AVG=93.662 STD=2.07434
> > SLUB AVG=81.983 STD=0.864362
> >
> > Server and client bound to different sockets
> > SLAB AVG=78.801 STD=1.44118
> > SLQB AVG=78.269 STD=1.10457
> > SLUB AVG=71.334 STD=1.16809
> >
> > ...
> >
> > I haven't done any non-local network tests. Networking is the one of the
> > subsystems most heavily dependent on slab performance, so if anybody
> > cares to run their favourite tests, that would be really helpful.
>
> I'm guessing, but then are these Mbit/s figures? Would that be the sending
> throughput or the receiving throughput?

Yes, Mbit/s. They were... hmm, sending throughput I think, but each pair
of numbers seemed to be identical IIRC?

> I love to see netperf used, but why UDP and loopback?

No really good reason. I guess I was hoping to keep other variables as
small as possible. But I guess a real remote test would be a lot more
realistic as a networking test. Hmm, but I could probably set up a test
over a simple GbE link here. I'll try that.

> Also, how about the
> service demands?

Well, over loopback and using CPU binding, I was hoping it wouldn't
change much... but I see netperf does some measurements for you. I
will consider those in future too.

BTW. is it possible to do parallel netperf tests?

2009-01-19 08:05:25

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

Hi Nick,

On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> SLUB was distinctly slower on the tbench, netperf, and hackbench
> tests that I ran. These were faster with SLUB on your machine?

I was trying to bisect a somewhat recent SLAB vs. SLUB regression in
tbench that seems to be triggered by CONFIG_SLUB as suggested by Evgeniy
Polyakov performance tests. Unfortunately I bisected it down to a bogus
commit so while I saw SLUB beating SLAB, I also saw the reverse in
nearby commits which didn't touch anything interesting. So for tbench,
SLUB _used to_ dominate SLAB on my machine but the current situation is
not as clear with all the tbench regressions in other subsystems.

SLUB has been a consistent winner for hackbench after Christoph fixed
the regression reported by Ingo Molnar two years (?) ago. I don't think
I've ran netperf, but for the fio test you mentioned, SLUB is beating
SLAB here.

On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> What kind of system is it?

2-way Core2. I posted my /proc/cpuinfo in this thread if you're
interested.

On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > So I have very mixed feelings about SLQB. It's very
> > nice that it works for OLTP but we still don't have much insight (i.e.
> > numbers) on why it's better.

On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> According to estimates in this thread, I think Matthew said SLUB would
> be around 6% slower? SLQB is within measurement error of SLAB.

Yeah but I say that we don't know _why_ it's better. There's the
kmalloc()/kfree() CPU ping-pong hypothesis but it could also be due to
page allocator interaction or just a plain bug in SLUB. And lets not
forget bad interaction with some random subsystem (SCSI, for example).

On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> Fair point about personally reproducing the OLTP problem yourself. But
> the fact is that we will get problem reports that cannot be reproduced.
> That does not make them less relevant. I can't reproduce the OLTP
> benchmark myself. And I'm fully expecting to get problem reports for
> SLQB against insanely sized SGI systems, which I will take very seriously
> and try to fix them.

Again, it's not that I don't take the OLTP regression seriously (I do)
but as a "part-time maintainer" I simply don't have the time and
resources to attempt to fix it without either (a) being able to
reproduce the problem or (b) have someone who can reproduce it who is
willing to do oprofile and so on.

So as much as I would have preferred that you had at least attempted to
fix SLUB, I'm more than happy that we have a very active developer
working on the problem now. I mean, I don't really care which allocator
we decide to go forward with, if all the relevant regressions are dealt
with.

All I am saying is that I don't like how we're fixing a performance bug
with a shiny new allocator without a credible explanation why the
current approach is not fixable.

On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > The good news is that SLQB can replace SLAB so either way, we're not
> > going to end up with four allocators. Whether it can replace SLUB
> > remains to be seen.
>
> Well I think being able to simply replace SLAB is not ideal. The plan
> I'm hoping is to have four allocators for a few releases, and then
> go back to having two. That is going to mean some groups might not
> have their ideal allocator merged... but I think it is crazy to settle
> with more than one main compile-time allocator for the long term.

So now the HPC folk will be screwed over by the OLTP folk? I guess
that's okay as the latter have been treated rather badly for the past
two years.... ;-)

Pekka

2009-01-19 08:34:09

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Monday 19 January 2009 19:05:03 Pekka Enberg wrote:
> Hi Nick,
>
> On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > SLUB was distinctly slower on the tbench, netperf, and hackbench
> > tests that I ran. These were faster with SLUB on your machine?
>
> I was trying to bisect a somewhat recent SLAB vs. SLUB regression in
> tbench that seems to be triggered by CONFIG_SLUB as suggested by Evgeniy
> Polyakov performance tests. Unfortunately I bisected it down to a bogus
> commit so while I saw SLUB beating SLAB, I also saw the reverse in
> nearby commits which didn't touch anything interesting. So for tbench,
> SLUB _used to_ dominate SLAB on my machine but the current situation is
> not as clear with all the tbench regressions in other subsystems.

OK.

> SLUB has been a consistent winner for hackbench after Christoph fixed
> the regression reported by Ingo Molnar two years (?) ago. I don't think
> I've ran netperf, but for the fio test you mentioned, SLUB is beating
> SLAB here.

Hmm, netperf, hackbench, and fio are all faster with SLAB than SLUB.

> On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > What kind of system is it?
>
> 2-way Core2. I posted my /proc/cpuinfo in this thread if you're
> interested.

Thanks. I guess one of three obvious differences, mine is a K10, is
NUMA, and has significantly more cores. I can try setting it to
interleave cachelines over nodes or use fewer cores to see if the
picture changes...

> On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > > So I have very mixed feelings about SLQB. It's very
> > > nice that it works for OLTP but we still don't have much insight (i.e.
> > > numbers) on why it's better.
>
> On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > According to estimates in this thread, I think Matthew said SLUB would
> > be around 6% slower? SLQB is within measurement error of SLAB.
>
> Yeah but I say that we don't know _why_ it's better. There's the
> kmalloc()/kfree() CPU ping-pong hypothesis but it could also be due to
> page allocator interaction or just a plain bug in SLUB. And lets not
> forget bad interaction with some random subsystem (SCSI, for example).
>
> On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > Fair point about personally reproducing the OLTP problem yourself. But
> > the fact is that we will get problem reports that cannot be reproduced.
> > That does not make them less relevant. I can't reproduce the OLTP
> > benchmark myself. And I'm fully expecting to get problem reports for
> > SLQB against insanely sized SGI systems, which I will take very seriously
> > and try to fix them.
>
> Again, it's not that I don't take the OLTP regression seriously (I do)
> but as a "part-time maintainer" I simply don't have the time and
> resources to attempt to fix it without either (a) being able to
> reproduce the problem or (b) have someone who can reproduce it who is
> willing to do oprofile and so on.
>
> So as much as I would have preferred that you had at least attempted to
> fix SLUB, I'm more than happy that we have a very active developer
> working on the problem now. I mean, I don't really care which allocator
> we decide to go forward with, if all the relevant regressions are dealt
> with.

OK, good to know.

> All I am saying is that I don't like how we're fixing a performance bug
> with a shiny new allocator without a credible explanation why the
> current approach is not fixable.

To be honest, my biggest concern with SLUB is the higher order pages
thing. But Christoph always poo poos me when I raise that concern, and
it's hard to get concrete numbers showing real fragmentation problems
when it can take days or months to start biting.

It really stems from queueing versus not queueing I guess. And I think
SLUB is flawed due to its avoidance of queueing.

> On Mon, 2009-01-19 at 18:13 +1100, Nick Piggin wrote:
> > > The good news is that SLQB can replace SLAB so either way, we're not
> > > going to end up with four allocators. Whether it can replace SLUB
> > > remains to be seen.
> >
> > Well I think being able to simply replace SLAB is not ideal. The plan
> > I'm hoping is to have four allocators for a few releases, and then
> > go back to having two. That is going to mean some groups might not
> > have their ideal allocator merged... but I think it is crazy to settle
> > with more than one main compile-time allocator for the long term.
>
> So now the HPC folk will be screwed over by the OLTP folk?

No. I'm imagining there will be a discussion of the 3, and at some
point an executive decision will be made if an agreement can't be
reached. At this point, I think that is a better and fairer option
than just asserting one allocator is better than another and making
it the default.

And... we have no indication that SLQB will be worse for HPC than
SLUB ;)

> I guess
> that's okay as the latter have been treated rather badly for the past
> two years.... ;-)

I don't know if that is meant to be sarcastic, but the OLTP performance
numbers almost never get better from one kernel to the next. Actually
the trend is downward. Mainly due to bloat or new features being added.

I think that at some level, controlled addition of features that may
add some cycles to these paths is not a bad idea (what good is Moore's
Law if we can't have shiny new features? :) But on the other hand, this
OLTP test is incredibly valuable to monitor the general performance-
health of this area of the kernel.

2009-01-19 08:43:23

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Monday 19 January 2009 19:33:27 Nick Piggin wrote:
> On Monday 19 January 2009 19:05:03 Pekka Enberg wrote:

> > All I am saying is that I don't like how we're fixing a performance bug
> > with a shiny new allocator without a credible explanation why the
> > current approach is not fixable.
>
> To be honest, my biggest concern with SLUB is the higher order pages
> thing. But Christoph always poo poos me when I raise that concern, and
> it's hard to get concrete numbers showing real fragmentation problems
> when it can take days or months to start biting.
>
> It really stems from queueing versus not queueing I guess. And I think
> SLUB is flawed due to its avoidance of queueing.

And FWIW, Christoph was also not able to fix the OLTP problem although
I think it has been known for nearly two years ago now (I remember we
talked about it at 2007 KS, although I wasn't following slab development
very keenly back then).

At this point I feel spending time working on SLUB isn't a good idea if
a) Christoph himself hadn't fixed this problem; and b) we disagree about
fundamental design choices (see the "SLQB slab allocator" thread).

Anyway, nobody has disagreed with my proposal to merge SLQB, so in the
worst case I don't think it will cause too much harm, and in the best
case it might turn out to make the best tradeoffs and who knows, it
might actually not be catastrophic for HPC ;)

2009-01-19 08:47:40

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Mon, Jan 19, 2009 at 10:42 AM, Nick Piggin <[email protected]> wrote:
> Anyway, nobody has disagreed with my proposal to merge SLQB, so in the
> worst case I don't think it will cause too much harm, and in the best
> case it might turn out to make the best tradeoffs and who knows, it
> might actually not be catastrophic for HPC ;)

Yeah. If Andrew/Linus doesn't want to merge SLQB to 2.6.29, we can
stick it in linux-next through slab.git if you want.

Pekka

2009-01-19 08:58:40

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Monday 19 January 2009 19:47:24 Pekka Enberg wrote:
> On Mon, Jan 19, 2009 at 10:42 AM, Nick Piggin <[email protected]>
wrote:
> > Anyway, nobody has disagreed with my proposal to merge SLQB, so in the
> > worst case I don't think it will cause too much harm, and in the best
> > case it might turn out to make the best tradeoffs and who knows, it
> > might actually not be catastrophic for HPC ;)
>
> Yeah. If Andrew/Linus doesn't want to merge SLQB to 2.6.29, we can

I would prefer not. Apart from not practicing what I preach about
merging, if it has stupid bugs on some systems or obvious performance
problems, it will not be a good start ;)

> stick it in linux-next through slab.git if you want.

That would be appreciated. It's not quite ready yet...

Thanks.
Nick

2009-01-19 09:49:14

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

Hi Nick,

On Mon, Jan 19, 2009 at 10:33 AM, Nick Piggin <[email protected]> wrote:
>> All I am saying is that I don't like how we're fixing a performance bug
>> with a shiny new allocator without a credible explanation why the
>> current approach is not fixable.
>
> To be honest, my biggest concern with SLUB is the higher order pages
> thing. But Christoph always poo poos me when I raise that concern, and
> it's hard to get concrete numbers showing real fragmentation problems
> when it can take days or months to start biting.

To be fair to SLUB, we do have the pending slab defragmentation
patches in my tree. Not that we have any numbers on if defragmentation
helps and how much. IIRC, Christoph said one of the reasons for
avoiding queues in SLUB is to be able to do defragmentation. But I
suppose with SLQB we can do the same thing as long as we flush the
queues before attempting to defrag.

Pekka

2009-01-19 10:04:45

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Monday 19 January 2009 20:48:52 Pekka Enberg wrote:
> Hi Nick,
>
> On Mon, Jan 19, 2009 at 10:33 AM, Nick Piggin <[email protected]>
wrote:
> >> All I am saying is that I don't like how we're fixing a performance bug
> >> with a shiny new allocator without a credible explanation why the
> >> current approach is not fixable.
> >
> > To be honest, my biggest concern with SLUB is the higher order pages
> > thing. But Christoph always poo poos me when I raise that concern, and
> > it's hard to get concrete numbers showing real fragmentation problems
> > when it can take days or months to start biting.
>
> To be fair to SLUB, we do have the pending slab defragmentation
> patches in my tree. Not that we have any numbers on if defragmentation
> helps and how much. IIRC, Christoph said one of the reasons for
> avoiding queues in SLUB is to be able to do defragmentation. But I
> suppose with SLQB we can do the same thing as long as we flush the
> queues before attempting to defrag.

I have had a look at them, (and I raised some concerns about races with
the bufferhead "defragmentation" patch which I didn't get a reply to,
but now's not the time to get into that).

Christoph's design AFAIKS is not impossible with queued slab allocators,
but they would just need to do either some kind of per-cpu processing,
at least a way to flush queues of objects. This should not be impossible.

But in my reply, I also outlined an idea for a possibly better design for
targetted slab reclaim that could have fewer of the locking complexitiesin
other subsystems like the slub defrag patches do. I plan to look at this
at some point, but I think we need to sort out the basics first.

2009-01-19 18:06:27

by Chris Mason

[permalink] [raw]

Subject: RE: Mainline kernel OLTP performance update

On Thu, 2009-01-15 at 00:11 -0700, Ma, Chinang wrote:
> >> > > > >
> >> > > > > Linux OLTP Performance summary
> >> > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle%
> >iowait%
> >> > > > > 2.6.24.2 1.000 21969 43425 76 24 0
> >0
> >> > > > > 2.6.27.2 0.973 30402 43523 74 25 0
> >1
> >> > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0
> >0
> >> >
> >> > > But the interrupt rate went through the roof.
> >> >
> >> > Yes. I forget why that was; I'll have to dig through my archives for
> >> > that.
> >>
> >> Oh. I'd have thought that this alone could account for 3.5%.

A later email indicated the reschedule interrupt count doubled since
2.6.24, and so I poked around a bit at the causes of resched_task.

I think the -rt version of check_preempt_equal_prio has gotten much more
expensive since 2.6.24.

I'm sure these changes were made for good reasons, and this workload may
not be a good reason to change it back. But, what does the patch below
do to performance on 2.6.29-rcX?

-chris

diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 954e1a8..bbe3492 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq,
struct task_struct *p, int sync
resched_task(rq->curr);
return;
}
+ return;

#ifdef CONFIG_SMP
/*

2009-01-19 18:40:48

[permalink] [raw]

Subject: RE: Mainline kernel OLTP performance update

(added Rusty)

On Mon, 2009-01-19 at 13:04 -0500, Chris Mason wrote:
> On Thu, 2009-01-15 at 00:11 -0700, Ma, Chinang wrote:
> > >> > > > >
> > >> > > > > Linux OLTP Performance summary
> > >> > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle%
> > >iowait%
> > >> > > > > 2.6.24.2 1.000 21969 43425 76 24 0
> > >0
> > >> > > > > 2.6.27.2 0.973 30402 43523 74 25 0
> > >1
> > >> > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0
> > >0
> > >> >
> > >> > > But the interrupt rate went through the roof.
> > >> >
> > >> > Yes. I forget why that was; I'll have to dig through my archives for
> > >> > that.
> > >>
> > >> Oh. I'd have thought that this alone could account for 3.5%.
>
> A later email indicated the reschedule interrupt count doubled since
> 2.6.24, and so I poked around a bit at the causes of resched_task.
>
> I think the -rt version of check_preempt_equal_prio has gotten much more
> expensive since 2.6.24.
>
> I'm sure these changes were made for good reasons, and this workload may
> not be a good reason to change it back. But, what does the patch below
> do to performance on 2.6.29-rcX?
>
> -chris
>
> diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> index 954e1a8..bbe3492 100644
> --- a/kernel/sched_rt.c
> +++ b/kernel/sched_rt.c
> @@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq,
> struct task_struct *p, int sync
> resched_task(rq->curr);
> return;
> }
> + return;
>
> #ifdef CONFIG_SMP
> /*

That should not cause much of a problem if the scheduling task is not
pinned to an CPU. But!!!!!

A recent change makes it expensive:

commit 24600ce89a819a8f2fb4fd69fd777218a82ade20
Author: Rusty Russell <[email protected]>
Date: Tue Nov 25 02:35:13 2008 +1030

sched: convert check_preempt_equal_prio to cpumask_var_t.

Impact: stack reduction for large NR_CPUS

which has:

static void check_preempt_equal_prio(struct rq *rq, struct task_struct
*p)
{
- cpumask_t mask;
+ cpumask_var_t mask;

if (rq->curr->rt.nr_cpus_allowed == 1)
return;

- if (p->rt.nr_cpus_allowed != 1
- && cpupri_find(&rq->rd->cpupri, p, &mask))
+ if (!alloc_cpumask_var(&mask, GFP_ATOMIC))
return;

check_preempt_equal_prio is in a scheduling hot path!!!!!

WTF are we allocating there for?

-- Steve

2009-01-19 18:56:36

by Chris Mason

[permalink] [raw]

Subject: RE: Mainline kernel OLTP performance update

On Mon, 2009-01-19 at 13:37 -0500, Steven Rostedt wrote:
> (added Rusty)
>
> On Mon, 2009-01-19 at 13:04 -0500, Chris Mason wrote:
> >
> > I think the -rt version of check_preempt_equal_prio has gotten much more
> > expensive since 2.6.24.
> >
> > I'm sure these changes were made for good reasons, and this workload may
> > not be a good reason to change it back. But, what does the patch below
> > do to performance on 2.6.29-rcX?
> >
> > -chris
> >
> > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> > index 954e1a8..bbe3492 100644
> > --- a/kernel/sched_rt.c
> > +++ b/kernel/sched_rt.c
> > @@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq,
> > struct task_struct *p, int sync
> > resched_task(rq->curr);
> > return;
> > }
> > + return;
> >
> > #ifdef CONFIG_SMP
> > /*
>
> That should not cause much of a problem if the scheduling task is not
> pinned to an CPU. But!!!!!
>
> A recent change makes it expensive:

> + if (!alloc_cpumask_var(&mask, GFP_ATOMIC))
> return;

> check_preempt_equal_prio is in a scheduling hot path!!!!!
>
> WTF are we allocating there for?

I wasn't actually looking at the cost of the checks, even though they do
look higher (if they are using CONFIG_CPUMASK_OFFSTACK anyway).

The 2.6.24 code would trigger a rescheduling interrupt only when the
prio of the inbound task was higher than the running task.

This workload has a large number of equal priority rt tasks that are not
bound to a single CPU, and so I think it should trigger more
preempts/reschedules with the today's check_preempt_equal_prio().

-chris

2009-01-19 19:08:49

[permalink] [raw]

Subject: RE: Mainline kernel OLTP performance update

On Mon, 2009-01-19 at 13:55 -0500, Chris Mason wrote:

> I wasn't actually looking at the cost of the checks, even though they do
> look higher (if they are using CONFIG_CPUMASK_OFFSTACK anyway).
>
> The 2.6.24 code would trigger a rescheduling interrupt only when the
> prio of the inbound task was higher than the running task.
>
> This workload has a large number of equal priority rt tasks that are not
> bound to a single CPU, and so I think it should trigger more
> preempts/reschedules with the today's check_preempt_equal_prio().

Ah yeah. This is one of the things that shows RT being more "responsive"
but less on performance. An RT task wants to run ASAP even if that means
there's a chance of more interrupts and higher cache misses.

The old way would be much faster in general through put, but I measured
RT tasks taking up to tens of milliseconds to get scheduled. This is
unacceptable for an RT task.

-- Steve

2009-01-19 22:20:03

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

>>>System is a 2socket, 4 core AMD.
>>
>>Not exactly a large system :) Barely NUMA even with just two sockets.
>
>
> You're right ;)
>
> But at least it is exercising the NUMA paths in the allocator, and
> represents a pretty common size of system...
>
> I can run some tests on bigger systems at SUSE, but it is not always
> easy to set up "real" meaningful workloads on them or configure
> significant IO for them.

Not sure if I know enough git to pull your trees, or if this cobbler's child will
have much in the way of bigger systems, but there is a chance I might - contact
me offline with some pointers on how to pull and build the bits and such.

>>>Netperf UDP unidirectional send test (10 runs, higher better):
>>>
>>>Server and client bound to same CPU
>>>SLAB AVG=60.111 STD=1.59382
>>>SLQB AVG=60.167 STD=0.685347
>>>SLUB AVG=58.277 STD=0.788328
>>>
>>>Server and client bound to same socket, different CPUs
>>>SLAB AVG=85.938 STD=0.875794
>>>SLQB AVG=93.662 STD=2.07434
>>>SLUB AVG=81.983 STD=0.864362
>>>
>>>Server and client bound to different sockets
>>>SLAB AVG=78.801 STD=1.44118
>>>SLQB AVG=78.269 STD=1.10457
>>>SLUB AVG=71.334 STD=1.16809
>>>
>>
>> > ...
>>
>>>I haven't done any non-local network tests. Networking is the one of the
>>>subsystems most heavily dependent on slab performance, so if anybody
>>>cares to run their favourite tests, that would be really helpful.
>>
>>I'm guessing, but then are these Mbit/s figures? Would that be the sending
>>throughput or the receiving throughput?
>
>
> Yes, Mbit/s. They were... hmm, sending throughput I think, but each pair
> of numbers seemed to be identical IIRC?

Mega *bits* per second? And those were 4K sends right? That seems rather low
for loopback - I would have expected nearly two orders of magnitude more. I
wonder if the intra-stack flow control kicked-in? You might try adding test
specific -S and -s options to set much larger socket buffers to try to avoid
that. Or simply use TCP.

netperf -H <foo> ... -- -s 1M -S 1M -m 4K

>>I love to see netperf used, but why UDP and loopback?
>
>
> No really good reason. I guess I was hoping to keep other variables as
> small as possible. But I guess a real remote test would be a lot more
> realistic as a networking test. Hmm, but I could probably set up a test
> over a simple GbE link here. I'll try that.

If bandwidth is an issue, that is to say one saturates the link before much of
anything "interesting" happens in the host you can use something like aggregate
TCP_RR - ./configure with --enable_burst and then something like

netperf -H <remote> -t TCP_RR -- -D -b 32

and it will have as many as 33 discrete transactions in flight at one time on the
one connection. The -D is there to set TCP_NODELAY to preclude TCP chunking the
single-byte (default, take your pick of a more reasonable size) transactions into
one segment.

>>Also, how about the service demands?
>
>
> Well, over loopback and using CPU binding, I was hoping it wouldn't
> change much...

Hope... but verify :)

> but I see netperf does some measurements for you. I
> will consider those in future too.
>
> BTW. is it possible to do parallel netperf tests?

Yes, by (ab)using the confidence intervals code. Poke around in
http://www.netperf.org/svn/netperf2/doc/netperf.html in the "Aggregates" section,
and I can go into further details offline (or here if folks want to see the
discussion).

rick jones

2009-01-19 23:42:17

by Ingo Molnar

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

* Steven Rostedt <[email protected]> wrote:

> (added Rusty)
>
> On Mon, 2009-01-19 at 13:04 -0500, Chris Mason wrote:
> > On Thu, 2009-01-15 at 00:11 -0700, Ma, Chinang wrote:
> > > >> > > > >
> > > >> > > > > Linux OLTP Performance summary
> > > >> > > > > Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle%
> > > >iowait%
> > > >> > > > > 2.6.24.2 1.000 21969 43425 76 24 0
> > > >0
> > > >> > > > > 2.6.27.2 0.973 30402 43523 74 25 0
> > > >1
> > > >> > > > > 2.6.29-rc1 0.965 30331 41970 74 26 0
> > > >0
> > > >> >
> > > >> > > But the interrupt rate went through the roof.
> > > >> >
> > > >> > Yes. I forget why that was; I'll have to dig through my archives for
> > > >> > that.
> > > >>
> > > >> Oh. I'd have thought that this alone could account for 3.5%.
> >
> > A later email indicated the reschedule interrupt count doubled since
> > 2.6.24, and so I poked around a bit at the causes of resched_task.
> >
> > I think the -rt version of check_preempt_equal_prio has gotten much more
> > expensive since 2.6.24.
> >
> > I'm sure these changes were made for good reasons, and this workload may
> > not be a good reason to change it back. But, what does the patch below
> > do to performance on 2.6.29-rcX?
> >
> > -chris
> >
> > diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
> > index 954e1a8..bbe3492 100644
> > --- a/kernel/sched_rt.c
> > +++ b/kernel/sched_rt.c
> > @@ -842,6 +842,7 @@ static void check_preempt_curr_rt(struct rq *rq,
> > struct task_struct *p, int sync
> > resched_task(rq->curr);
> > return;
> > }
> > + return;
> >
> > #ifdef CONFIG_SMP
> > /*
>
> That should not cause much of a problem if the scheduling task is not
> pinned to an CPU. But!!!!!
>
> A recent change makes it expensive:
>
> commit 24600ce89a819a8f2fb4fd69fd777218a82ade20
> Author: Rusty Russell <[email protected]>
> Date: Tue Nov 25 02:35:13 2008 +1030
>
> sched: convert check_preempt_equal_prio to cpumask_var_t.
>
> Impact: stack reduction for large NR_CPUS
>
>
>
> which has:
>
> static void check_preempt_equal_prio(struct rq *rq, struct task_struct
> *p)
> {
> - cpumask_t mask;
> + cpumask_var_t mask;
>
> if (rq->curr->rt.nr_cpus_allowed == 1)
> return;
>
> - if (p->rt.nr_cpus_allowed != 1
> - && cpupri_find(&rq->rd->cpupri, p, &mask))
> + if (!alloc_cpumask_var(&mask, GFP_ATOMIC))
> return;
>
>
>
>
> check_preempt_equal_prio is in a scheduling hot path!!!!!
>
> WTF are we allocating there for?

Agreed - this needs to be fixed. Since this runs under the runqueue lock
we can have a temporary cpumask in the runqueue itself, not on the stack.

Ingo

2009-01-20 05:16:48

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 2009-01-16 at 11:20 +0100, Andi Kleen wrote:
> "Zhang, Yanmin" <[email protected]> writes:
>
>
> > I think that's because SLQB
> > doesn't pass through big object allocation to page allocator.
> > netperf UDP-U-1k has less improvement with SLQB.
>
> That sounds like just the page allocator needs to be improved.
> That would help everyone. We talked a bit about this earlier,
> some of the heuristics for hot/cold pages are quite outdated
> and have been tuned for obsolete machines and also its fast path
> is quite long. Unfortunately no code currently.
Andi,

Thanks for your kind information. I did more investigation with SLUB
on netperf UDP-U-4k issue.

oprofile shows:
328058 30.1342 linux-2.6.29-rc2 copy_user_generic_string
134666 12.3699 linux-2.6.29-rc2 __free_pages_ok
125447 11.5231 linux-2.6.29-rc2 get_page_from_freelist
22611 2.0770 linux-2.6.29-rc2 __sk_mem_reclaim
21442 1.9696 linux-2.6.29-rc2 list_del
21187 1.9462 linux-2.6.29-rc2 __ip_route_output_key

So __free_pages_ok and get_page_from_freelist consume too much cpu time.
With SLQB, these 2 functions almost don't consume time.

Command 'slabinfo -AD' shows:
Name Objects Alloc Free %Fast
:0000256 1685 29611065 29609548 99 99
:0000168 2987 164689 161859 94 39
:0004096 1471 114918 113490 99 97

So kmem_cache :0000256 is very active.

Kernel stack dump in __free_pages_ok shows
[<ffffffff8027010f>] __free_pages_ok+0x109/0x2e0
[<ffffffff8024bb34>] autoremove_wake_function+0x0/0x2e
[<ffffffff8060f387>] __kfree_skb+0x9/0x6f
[<ffffffff8061204b>] skb_free_datagram+0xc/0x31
[<ffffffff8064b528>] udp_recvmsg+0x1e7/0x26f
[<ffffffff8060b509>] sock_common_recvmsg+0x30/0x45
[<ffffffff80609acd>] sock_recvmsg+0xd5/0xed

The callchain is:
__kfree_skb =>
kfree_skbmem =>
kmem_cache_free(skbuff_head_cache, skb);

kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
with :0000256. Their order is 1 which means every slab consists of 2 physical pages.

netperf UDP-U-4k is a UDP stream testing. client process keeps sending 4k-size packets
to server process and server process just receives the packets one by one.

If we start CPU_NUM clients and the same number of servers, every client will send lots
of packets within one sched slice, then process scheduler schedules the server to receive
many packets within one sched slice; then client resends again. So there are many packets
in the queue. When server receive the packets, it frees skbuff_head_cache. When the slab's
objects are all free, the slab will be released by calling __free_pages. Such batch
sending/receiving creates lots of slab free activity.

Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0.
But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer.

SLQB has no such issue, because:
1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up
later on quickly without lock. A batch parameter to control the free object recollection is mostly
1024.
2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can
benefit from zone_pcp(zone, cpu)->pcp page buffer.

So SLUB need resolve such issues that one process allocates a batch of objects and another process
frees them batchly.

yanmin

2009-01-20 12:42:52

by Gregory Haskins

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

Gregory Haskins wrote:
>
> Then, email the contents of /sys/kernel/debug/tracing/trace to me
>
>
>

[ Chinang has performed the trace as requested, but replied with a
reduced CC to avoid spamming people with a large file. This is
restoring the original list]

Ma, Chinang wrote:
> Hi Gregory,
> Trace in attachment. I trim down the distribution list. As the attachment is quite big.
>
> Thanks,
> -Chinang
>
Hi Chinang,

Thank you very much for taking the time to do this. I have analyzed
the trace: I do not see any smoking gun w.r.t. the theory that we are
over IPI'ing the system. There were holes in the data due to trace
limitations that rendered some of the data inconclusive. However, the
places where we did not run into trace limitations looked like
everything was functioning as designed.

That being said, I do see that you have a ton of prio 48(ish) threads
that are over-straining the RT push logic. The interesting thing here
is I recently pushed some patches to tip that have potential to help you
here. Could you try your test using the sched/rt branch from -tip?
Here is a clone link, for your convenience:

git://git.kernel.org/pub/scm/linux/kernel/git/x86/linux-2.6-tip.git sched/rt

For this run, do _not_ use the trace patch/config. I just want to see
if you observe performance improvements with OLTP configured for RT prio
when compared to historic rt-push/pull based kernels (including HEAD on
linus.git, as tested in the last run)

Thanks!
-Greg

Attachments:

signature.asc (257.00 B)
OpenPGP digital signature

2009-01-20 13:29:11

by Jens Axboe

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Wed, Jan 14 2009, Matthew Wilcox wrote:
> On Thu, Jan 15, 2009 at 03:39:05AM +0100, Andi Kleen wrote:
> > Andrew Morton <[email protected]> writes:
> > >> some of that back, but not as much as taking them out (even when
> > >> the sysctl'd variable is in a __read_mostly section). We tried a
> > >> patch from Jens to speed up the search for a new partition, but it
> > >> had no effect.
> > >
> > > I find this surprising.
> >
> > The test system has thousands of disks/LUNs which it writes to
> > all the time, in addition to a workload which is a real cache pig.
> > So any increase in the per LUN overhead directly leads to a lot
> > more cache misses in the kernel because it increases the working set
> > there sigificantly.
>
> This particular system has 450 spindles, but they're amalgamated into
> 30 logical volumes by the hardware or firmware. Linux sees 30 LUNs.
> Each one, though, has fifteen partitions on it, so that brings us back
> up to 450 partitions.
>
> This system, btw, is a scale model of the full system that would be used
> to get published results. If I remember correctly, a 1% performance
> regression on this system is likely to translate to a 2% regression on
> the full-scale system.

Matthew, lets see if we can get this a little closer to disappearing. I
don't see lookup problems in the current kernel with the one-hit cache,
but perhaps it's either not getting enough hits in this bigger test case
or perhaps it's simply the rcu locking and preempt disables that build
up enough to cause a slowdown.

First things first, can you get a run of 2.6.29-rc2 with this patch?
It'll enable you to turn off per-partition stats in sysfs. I'd suggest
doing a run with a 2.6.29-rc2 booted with this patch, and then another
run with part_stats set to 0 for every exposed spindle. Then post those
profiles!

diff --git a/block/blk-core.c b/block/blk-core.c
index a824e49..6f693ae 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -600,7 +600,8 @@ blk_init_queue_node(request_fn_proc *rfn, spinlock_t *lock, int node_id)
q->prep_rq_fn = NULL;
q->unplug_fn = generic_unplug_device;
q->queue_flags = (1 << QUEUE_FLAG_CLUSTER |
- 1 << QUEUE_FLAG_STACKABLE);
+ 1 << QUEUE_FLAG_STACKABLE |
+ 1 << QUEUE_FLAG_PART_STAT);
q->queue_lock = lock;

blk_queue_segment_boundary(q, BLK_SEG_BOUNDARY_MASK);
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index a29cb78..a6ec2e3 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -158,6 +158,29 @@ static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page)
return queue_var_show(set != 0, page);
}

+static ssize_t queue_part_stat_store(struct request_queue *q, const char *page,
+ size_t count)
+{
+ unsigned long nm;
+ ssize_t ret = queue_var_store(&nm, page, count);
+
+ spin_lock_irq(q->queue_lock);
+ if (nm)
+ queue_flag_set(QUEUE_FLAG_PART_STAT, q);
+ else
+ queue_flag_clear(QUEUE_FLAG_PART_STAT, q);
+
+ spin_unlock_irq(q->queue_lock);
+ return ret;
+}
+
+static ssize_t queue_part_stat_show(struct request_queue *q, char *page)
+{
+ unsigned int set = test_bit(QUEUE_FLAG_PART_STAT, &q->queue_flags);
+
+ return queue_var_show(set != 0, page);
+}
+
static ssize_t
queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count)
{
@@ -222,6 +245,12 @@ static struct queue_sysfs_entry queue_rq_affinity_entry = {
.store = queue_rq_affinity_store,
};

+static struct queue_sysfs_entry queue_part_stat_entry = {
+ .attr = {.name = "part_stats", .mode = S_IRUGO | S_IWUSR },
+ .show = queue_part_stat_show,
+ .store = queue_part_stat_store,
+};
+
static struct attribute *default_attrs[] = {
&queue_requests_entry.attr,
&queue_ra_entry.attr,
@@ -231,6 +260,7 @@ static struct attribute *default_attrs[] = {
&queue_hw_sector_size_entry.attr,
&queue_nomerges_entry.attr,
&queue_rq_affinity_entry.attr,
+ &queue_part_stat_entry.attr,
NULL,
};

diff --git a/block/genhd.c b/block/genhd.c
index 397960c..09cbac2 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -208,6 +208,9 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
struct hd_struct *part;
int i;

+ if (!blk_queue_part_stat(disk->queue))
+ goto part0;
+
ptbl = rcu_dereference(disk->part_tbl);

part = rcu_dereference(ptbl->last_lookup);
@@ -222,6 +225,7 @@ struct hd_struct *disk_map_sector_rcu(struct gendisk *disk, sector_t sector)
return part;
}
}
+part0:
return &disk->part0;
}
EXPORT_SYMBOL_GPL(disk_map_sector_rcu);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 044467e..4d45842 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -449,6 +449,7 @@ struct request_queue
#define QUEUE_FLAG_STACKABLE 13 /* supports request stacking */
#define QUEUE_FLAG_NONROT 14 /* non-rotational device (SSD) */
#define QUEUE_FLAG_VIRT QUEUE_FLAG_NONROT /* paravirt device */
+#define QUEUE_FLAG_PART_STAT 15 /* per-partition stats enabled */

static inline int queue_is_locked(struct request_queue *q)
{
@@ -568,6 +569,8 @@ enum {
#define blk_queue_flushing(q) ((q)->ordseq)
#define blk_queue_stackable(q) \
test_bit(QUEUE_FLAG_STACKABLE, &(q)->queue_flags)
+#define blk_queue_part_stat(q) \
+ test_bit(QUEUE_FLAG_PART_STAT, &(q)->queue_flags)

#define blk_fs_request(rq) ((rq)->cmd_type == REQ_TYPE_FS)
#define blk_pc_request(rq) ((rq)->cmd_type == REQ_TYPE_BLOCK_PC)

--
Jens Axboe

2009-01-22 00:31:50

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Tue, 20 Jan 2009, Zhang, Yanmin wrote:

> kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> with :0000256. Their order is 1 which means every slab consists of 2 physical pages.

That order can be changed. Try specifying slub_max_order=0 on the kernel
command line to force an order 0 alloc.

The queues of the page allocator are of limited use due to their overhead.
Order-1 allocations can actually be 5% faster than order-0. order-0 makes
sense if pages are pushed rapidly to the page allocator and are then
reissues elsewhere. If there is a linear consumption then the page
allocator queues are just overhead.

> Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0.
> But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer.

That usually does not matter because of partial list avoiding page
allocator actions.

> SLQB has no such issue, because:
> 1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up
> later on quickly without lock. A batch parameter to control the free object recollection is mostly
> 1024.
> 2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can
> benefit from zone_pcp(zone, cpu)->pcp page buffer.
>
> So SLUB need resolve such issues that one process allocates a batch of objects and another process
> frees them batchly.

SLUB has a percpu freelist but its bounded by the basic allocation unit.
You can increase that by modifying the allocation order. Writing a 3 or 5
into the order value in /sys/kernel/slab/xxx/order would do the trick.

2009-01-22 08:37:01

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
>
> > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
>
> That order can be changed. Try specifying slub_max_order=0 on the kernel
> command line to force an order 0 alloc.
I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.

I checked my instrumentation in kernel and found it's caused by large object allocation/free
whose size is more than PAGE_SIZE. Here its order is 1.

The right free callchain is __kfree_skb => skb_release_all => skb_release_data.

So this case isn't the issue that batch of allocation/free might erase partial page
functionality.

'#slaninfo -AD' couldn't show statistics of large object allocation/free. Can we add
such info? That will be more helpful.

In addition, I didn't find such issue wih TCP stream testing.

>
> The queues of the page allocator are of limited use due to their overhead.
> Order-1 allocations can actually be 5% faster than order-0. order-0 makes
> sense if pages are pushed rapidly to the page allocator and are then
> reissues elsewhere. If there is a linear consumption then the page
> allocator queues are just overhead.
>
> > Page allocator has an array at zone_pcp(zone, cpu)->pcp to keep a page buffer for page order 0.
> > But here skbuff_head_cache's order is 1, so UDP-U-4k couldn't benefit from the page buffer.
>
> That usually does not matter because of partial list avoiding page
> allocator actions.

>
> > SLQB has no such issue, because:
> > 1) SLQB has a percpu freelist. Free objects are put to the list firstly and can be picked up
> > later on quickly without lock. A batch parameter to control the free object recollection is mostly
> > 1024.
> > 2) SLQB slab order mostly is 0, so although sometimes it calls alloc_pages/free_pages, it can
> > benefit from zone_pcp(zone, cpu)->pcp page buffer.
> >
> > So SLUB need resolve such issues that one process allocates a batch of objects and another process
> > frees them batchly.
>
> SLUB has a percpu freelist but its bounded by the basic allocation unit.
> You can increase that by modifying the allocation order. Writing a 3 or 5
> into the order value in /sys/kernel/slab/xxx/order would do the trick.

2009-01-22 09:16:13

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> >
> > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> >
> > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > command line to force an order 0 alloc.
> I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
>
> I checked my instrumentation in kernel and found it's caused by large object allocation/free
> whose size is more than PAGE_SIZE. Here its order is 1.
>
> The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
>
> So this case isn't the issue that batch of allocation/free might erase partial page
> functionality.

So is this the kfree(skb->head) in skb_release_data() or the put_page()
calls in the same function in a loop?

If it's the former, with big enough size passed to __alloc_skb(), the
networking code might be taking a hit from the SLUB page allocator
pass-through.

Pekka

2009-01-22 09:29:07

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote:
> On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > >
> > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > >
> > > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > > command line to force an order 0 alloc.
> > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> >
> > I checked my instrumentation in kernel and found it's caused by large object allocation/free
> > whose size is more than PAGE_SIZE. Here its order is 1.
> >
> > The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> >
> > So this case isn't the issue that batch of allocation/free might erase partial page
> > functionality.
>
> So is this the kfree(skb->head) in skb_release_data() or the put_page()
> calls in the same function in a loop?
It's kfree(skb->head).

>
> If it's the former, with big enough size passed to __alloc_skb(), the
> networking code might be taking a hit from the SLUB page allocator
> pass-through.

2009-01-22 09:48:17

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Thu, 2009-01-22 at 17:28 +0800, Zhang, Yanmin wrote:
> On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote:
> > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > > >
> > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > > >
> > > > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > > > command line to force an order 0 alloc.
> > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> > >
> > > I checked my instrumentation in kernel and found it's caused by large object allocation/free
> > > whose size is more than PAGE_SIZE. Here its order is 1.
> > >
> > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> > >
> > > So this case isn't the issue that batch of allocation/free might erase partial page
> > > functionality.
> >
> > So is this the kfree(skb->head) in skb_release_data() or the put_page()
> > calls in the same function in a loop?
> It's kfree(skb->head).
>
> >
> > If it's the former, with big enough size passed to __alloc_skb(), the
> > networking code might be taking a hit from the SLUB page allocator
> > pass-through.

Do we know what kind of size is being passed to __alloc_skb() in this
case? Maybe we want to do something like this.

Pekka

SLUB: revert page allocator pass-through

This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB:
direct pass through of page size or higher kmalloc requests").
---

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 2f5c16b..3bd3662 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -124,7 +124,7 @@ struct kmem_cache {
* We keep the general caches in an array of slab caches that are used for
* 2^x bytes of allocations.
*/
-extern struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1];
+extern struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1];

/*
* Sorry that the following has to be that ugly but some versions of GCC
@@ -135,6 +135,9 @@ static __always_inline int kmalloc_index(size_t size)
if (!size)
return 0;

+ if (size > KMALLOC_MAX_SIZE)
+ return -1;
+
if (size <= KMALLOC_MIN_SIZE)
return KMALLOC_SHIFT_LOW;

@@ -154,10 +157,6 @@ static __always_inline int kmalloc_index(size_t size)
if (size <= 1024) return 10;
if (size <= 2 * 1024) return 11;
if (size <= 4 * 1024) return 12;
-/*
- * The following is only needed to support architectures with a larger page
- * size than 4k.
- */
if (size <= 8 * 1024) return 13;
if (size <= 16 * 1024) return 14;
if (size <= 32 * 1024) return 15;
@@ -167,6 +166,10 @@ static __always_inline int kmalloc_index(size_t size)
if (size <= 512 * 1024) return 19;
if (size <= 1024 * 1024) return 20;
if (size <= 2 * 1024 * 1024) return 21;
+ if (size <= 4 * 1024 * 1024) return 22;
+ if (size <= 8 * 1024 * 1024) return 23;
+ if (size <= 16 * 1024 * 1024) return 24;
+ if (size <= 32 * 1024 * 1024) return 25;
return -1;

/*
@@ -191,6 +194,19 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
if (index == 0)
return NULL;

+ /*
+ * This function only gets expanded if __builtin_constant_p(size), so
+ * testing it here shouldn't be needed. But some versions of gcc need
+ * help.
+ */
+ if (__builtin_constant_p(size) && index < 0) {
+ /*
+ * Generate a link failure. Would be great if we could
+ * do something to stop the compile here.
+ */
+ extern void __kmalloc_size_too_large(void);
+ __kmalloc_size_too_large();
+ }
return &kmalloc_caches[index];
}

@@ -204,17 +220,9 @@ static __always_inline struct kmem_cache *kmalloc_slab(size_t size)
void *kmem_cache_alloc(struct kmem_cache *, gfp_t);
void *__kmalloc(size_t size, gfp_t flags);

-static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
-{
- return (void *)__get_free_pages(flags | __GFP_COMP, get_order(size));
-}
-
static __always_inline void *kmalloc(size_t size, gfp_t flags)
{
if (__builtin_constant_p(size)) {
- if (size > PAGE_SIZE)
- return kmalloc_large(size, flags);
-
if (!(flags & SLUB_DMA)) {
struct kmem_cache *s = kmalloc_slab(size);

diff --git a/mm/slub.c b/mm/slub.c
index 6392ae5..8fad23f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2475,7 +2475,7 @@ EXPORT_SYMBOL(kmem_cache_destroy);
* Kmalloc subsystem
*******************************************************************/

-struct kmem_cache kmalloc_caches[PAGE_SHIFT + 1] __cacheline_aligned;
+struct kmem_cache kmalloc_caches[KMALLOC_SHIFT_HIGH + 1] __cacheline_aligned;
EXPORT_SYMBOL(kmalloc_caches);

static int __init setup_slub_min_order(char *str)
@@ -2537,7 +2537,7 @@ panic:
}

#ifdef CONFIG_ZONE_DMA
-static struct kmem_cache *kmalloc_caches_dma[PAGE_SHIFT + 1];
+static struct kmem_cache *kmalloc_caches_dma[KMALLOC_SHIFT_HIGH + 1];

static void sysfs_add_func(struct work_struct *w)
{
@@ -2643,8 +2643,12 @@ static struct kmem_cache *get_slab(size_t size, gfp_t flags)
return ZERO_SIZE_PTR;

index = size_index[(size - 1) / 8];
- } else
+ } else {
+ if (size > KMALLOC_MAX_SIZE)
+ return NULL;
+
index = fls(size - 1);
+ }

#ifdef CONFIG_ZONE_DMA
if (unlikely((flags & SLUB_DMA)))
@@ -2658,9 +2662,6 @@ void *__kmalloc(size_t size, gfp_t flags)
{
struct kmem_cache *s;

- if (unlikely(size > PAGE_SIZE))
- return kmalloc_large(size, flags);
-
s = get_slab(size, flags);

if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -2670,25 +2671,11 @@ void *__kmalloc(size_t size, gfp_t flags)
}
EXPORT_SYMBOL(__kmalloc);

-static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
-{
- struct page *page = alloc_pages_node(node, flags | __GFP_COMP,
- get_order(size));
-
- if (page)
- return page_address(page);
- else
- return NULL;
-}
-
#ifdef CONFIG_NUMA
void *__kmalloc_node(size_t size, gfp_t flags, int node)
{
struct kmem_cache *s;

- if (unlikely(size > PAGE_SIZE))
- return kmalloc_large_node(size, flags, node);
-
s = get_slab(size, flags);

if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -2746,11 +2733,8 @@ void kfree(const void *x)
return;

page = virt_to_head_page(x);
- if (unlikely(!PageSlab(page))) {
- BUG_ON(!PageCompound(page));
- put_page(page);
+ if (unlikely(WARN_ON(!PageSlab(page)))) /* XXX */
return;
- }
slab_free(page->slab, page, object, _RET_IP_);
}
EXPORT_SYMBOL(kfree);
@@ -2985,7 +2969,7 @@ void __init kmem_cache_init(void)
caches++;
}

- for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++) {
+ for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++) {
create_kmalloc_cache(&kmalloc_caches[i],
"kmalloc", 1 << i, GFP_KERNEL);
caches++;
@@ -3022,7 +3006,7 @@ void __init kmem_cache_init(void)
slab_state = UP;

/* Provide the correct kmalloc names now that the caches are up */
- for (i = KMALLOC_SHIFT_LOW; i <= PAGE_SHIFT; i++)
+ for (i = KMALLOC_SHIFT_LOW; i <= KMALLOC_SHIFT_HIGH; i++)
kmalloc_caches[i]. name =
kasprintf(GFP_KERNEL, "kmalloc-%d", 1 << i);

@@ -3222,9 +3206,6 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
{
struct kmem_cache *s;

- if (unlikely(size > PAGE_SIZE))
- return kmalloc_large(size, gfpflags);
-
s = get_slab(size, gfpflags);

if (unlikely(ZERO_OR_NULL_PTR(s)))
@@ -3238,9 +3219,6 @@ void *__kmalloc_node_track_caller(size_t size, gfp_t gfpflags,
{
struct kmem_cache *s;

- if (unlikely(size > PAGE_SIZE))
- return kmalloc_large_node(size, gfpflags, node);
-
s = get_slab(size, gfpflags);

if (unlikely(ZERO_OR_NULL_PTR(s)))

2009-01-22 11:31:43

by Jens Axboe

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Wed, Jan 21 2009, Chilukuri, Harita wrote:
> Jen, we work with Matthew on the OLTP workload and have tested the part_stats patch on 2.6.29-rc2. Below are the details:
>
> Disabling the part_stats has positive impact on the OLTP workload.
>
> Linux OLTP Performance summary
> Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait%
> 2.6.29-rc2-part_stats 1.000 30329 41716 74 26 0 0
> 2.6.29-rc2-disable-part_stats 1.006 30413 42582 74 25 0 0
>
> Server configurations:
> Intel Xeon Quad-core 2.0GHz 2 cpus/8 cores/8 threads
> 64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units)
>
>
> ======oprofile CPU_CLK_UNHALTED for top 30 functions
> Cycles% 2.6.29-rc2-part_stats Cycles% 2.6.29-rc2-disable-part_stats
> 0.9634 qla24xx_intr_handler 1.0372 qla24xx_intr_handler
> 0.9057 copy_user_generic_string 0.7461 qla24xx_wrt_req_reg
> 0.7583 unmap_vmas 0.7130 kmem_cache_alloc
> 0.6280 qla24xx_wrt_req_reg 0.6876 copy_user_generic_string
> 0.6088 kmem_cache_alloc 0.5656 qla24xx_start_scsi
> 0.5468 clear_page_c 0.4881 __blockdev_direct_IO
> 0.5191 qla24xx_start_scsi 0.4728 try_to_wake_up
> 0.4892 try_to_wake_up 0.4588 unmap_vmas
> 0.4870 __blockdev_direct_IO 0.4360 scsi_request_fn
> 0.4187 scsi_request_fn 0.3711 __switch_to
> 0.3717 __switch_to 0.3699 aio_complete
> 0.3567 rb_get_reader_page 0.3648 rb_get_reader_page
> 0.3396 aio_complete 0.3597 ring_buffer_consume
> 0.3012 __end_that_request_first 0.3292 memset_c
> 0.2926 memset_c 0.3076 __list_add
> 0.2926 ring_buffer_consume 0.2771 clear_page_c
> 0.2884 page_remove_rmap 0.2745 task_rq_lock
> 0.2691 disk_map_sector_rcu 0.2733 generic_make_request
> 0.2670 copy_page_c 0.2555 tcp_sendmsg
> 0.2670 lock_timer_base 0.2529 qla2x00_process_completed_re
> 0.2606 qla2x00_process_completed_re0.2440 e1000_xmit_frame
> 0.2521 task_rq_lock 0.2390 lock_timer_base
> 0.2328 __list_add 0.2364 qla24xx_queuecommand
> 0.2286 generic_make_request 0.2301 kmem_cache_free
> 0.2286 pick_next_highest_task_rt 0.2262 blk_queue_end_tag
> 0.2136 push_rt_task 0.2262 kref_get
> 0.2115 blk_queue_end_tag 0.2250 push_rt_task
> 0.2115 kmem_cache_free 0.2135 scsi_dispatch_cmd
> 0.2051 e1000_xmit_frame 0.2084 sd_prep_fn
> 0.2051 scsi_device_unbusy 0.2059 kfree

Alright, so that 0.6%. IIRC, 0.1% (or there abouts) is significant with
this benchmark, correct? To get a feel for the rest of the accounting
overhead, could you try with this patch that just disables the whole
thing?

diff --git a/block/blk-core.c b/block/blk-core.c
index a824e49..eec9126 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -64,6 +64,7 @@ static struct workqueue_struct *kblockd_workqueue;

static void drive_stat_acct(struct request *rq, int new_io)
{
+#if 0
struct hd_struct *part;
int rw = rq_data_dir(rq);
int cpu;
@@ -82,6 +83,7 @@ static void drive_stat_acct(struct request *rq, int new_io)
}

part_stat_unlock();
+#endif
}

void blk_queue_congestion_threshold(struct request_queue *q)
@@ -1014,6 +1017,7 @@ static inline void add_request(struct request_queue *q, struct request *req)
__elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0);
}

+#if 0
static void part_round_stats_single(int cpu, struct hd_struct *part,
unsigned long now)
{
@@ -1027,6 +1031,7 @@ static void part_round_stats_single(int cpu, struct hd_struct *part,
}
part->stamp = now;
}
+#endif

/**
* part_round_stats() - Round off the performance stats on a struct disk_stats.
@@ -1046,11 +1051,13 @@ static void part_round_stats_single(int cpu, struct hd_struct *part,
*/
void part_round_stats(int cpu, struct hd_struct *part)
{
+#if 0
unsigned long now = jiffies;

if (part->partno)
part_round_stats_single(cpu, &part_to_disk(part)->part0, now);
part_round_stats_single(cpu, part, now);
+#endif
}
EXPORT_SYMBOL_GPL(part_round_stats);

@@ -1690,6 +1697,7 @@ static int __end_that_request_first(struct request *req, int error,
(unsigned long long)req->sector);
}

+#if 0
if (blk_fs_request(req) && req->rq_disk) {
const int rw = rq_data_dir(req);
struct hd_struct *part;
@@ -1700,6 +1708,7 @@ static int __end_that_request_first(struct request *req, int error,
part_stat_add(cpu, part, sectors[rw], nr_bytes >> 9);
part_stat_unlock();
}
+#endif

total_bytes = bio_nbytes = 0;
while ((bio = req->bio) != NULL) {
@@ -1779,7 +1788,9 @@ static int __end_that_request_first(struct request *req, int error,
*/
static void end_that_request_last(struct request *req, int error)
{
+#if 0
struct gendisk *disk = req->rq_disk;
+#endif

if (blk_rq_tagged(req))
blk_queue_end_tag(req->q, req);
@@ -1797,6 +1808,7 @@ static void end_that_request_last(struct request *req, int error)
* IO on queueing nor completion. Accounting the containing
* request is enough.
*/
+#if 0
if (disk && blk_fs_request(req) && req != &req->q->bar_rq) {
unsigned long duration = jiffies - req->start_time;
const int rw = rq_data_dir(req);
@@ -1813,6 +1825,7 @@ static void end_that_request_last(struct request *req, int error)

part_stat_unlock();
}
+#endif

if (req->end_io)
req->end_io(req, error);

--
Jens Axboe

2009-01-23 03:03:23

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Thu, 2009-01-22 at 11:47 +0200, Pekka Enberg wrote:
> On Thu, 2009-01-22 at 17:28 +0800, Zhang, Yanmin wrote:
> > On Thu, 2009-01-22 at 11:15 +0200, Pekka Enberg wrote:
> > > On Thu, 2009-01-22 at 16:36 +0800, Zhang, Yanmin wrote:
> > > > On Wed, 2009-01-21 at 18:58 -0500, Christoph Lameter wrote:
> > > > > On Tue, 20 Jan 2009, Zhang, Yanmin wrote:
> > > > >
> > > > > > kmem_cache skbuff_head_cache's object size is just 256, so it shares the kmem_cache
> > > > > > with :0000256. Their order is 1 which means every slab consists of 2 physical pages.
> > > > >
> > > > > That order can be changed. Try specifying slub_max_order=0 on the kernel
> > > > > command line to force an order 0 alloc.
> > > > I tried slub_max_order=0 and there is no improvement on this UDP-U-4k issue.
> > > > Both get_page_from_freelist and __free_pages_ok's cpu time are still very high.
> > > >
> > > > I checked my instrumentation in kernel and found it's caused by large object allocation/free
> > > > whose size is more than PAGE_SIZE. Here its order is 1.
> > > >
> > > > The right free callchain is __kfree_skb => skb_release_all => skb_release_data.
> > > >
> > > > So this case isn't the issue that batch of allocation/free might erase partial page
> > > > functionality.
> > >
> > > So is this the kfree(skb->head) in skb_release_data() or the put_page()
> > > calls in the same function in a loop?
> > It's kfree(skb->head).
> >
> > >
> > > If it's the former, with big enough size passed to __alloc_skb(), the
> > > networking code might be taking a hit from the SLUB page allocator
> > > pass-through.
>
> Do we know what kind of size is being passed to __alloc_skb() in this
> case?
In function __alloc_skb, original parameter size=4155,
SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so
__kmalloc_track_caller's parameter size=4696.

> Maybe we want to do something like this.
>
> Pekka
>
> SLUB: revert page allocator pass-through
This patch amost fixes the netperf UDP-U-4k issue.

#slabinfo -AD
Name Objects Alloc Free %Fast
:0000256 1658 70350463 70348946 99 99
kmalloc-8192 31 70322309 70322293 99 99
:0000168 2592 143154 140684 93 28
:0004096 1456 91072 89644 99 96
:0000192 3402 63838 60491 89 11
:0000064 6177 49635 43743 98 77

So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides.
kmalloc-8192's default order on my 8-core stoakley is 2.

1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
is about 10% better than SLUB's.

I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?

>
> This is a revert of commit aadb4bc4a1f9108c1d0fbd121827c936c2ed4217 ("SLUB:
> direct pass through of page size or higher kmalloc requests").
> ---
>
> diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
> index 2f5c16b..3bd3662 100644
> --- a/include/linux/slub_def.h
> +++ b/include/linux/slub_def.h

2009-01-23 06:56:57

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

Zhang, Yanmin wrote:
>>>> If it's the former, with big enough size passed to __alloc_skb(), the
>>>> networking code might be taking a hit from the SLUB page allocator
>>>> pass-through.
>> Do we know what kind of size is being passed to __alloc_skb() in this
>> case?
> In function __alloc_skb, original parameter size=4155,
> SKB_DATA_ALIGN(size)=4224, sizeof(struct skb_shared_info)=472, so
> __kmalloc_track_caller's parameter size=4696.

OK, so all allocations go straight to the page allocator.

>
>> Maybe we want to do something like this.
>>
>> SLUB: revert page allocator pass-through
> This patch amost fixes the netperf UDP-U-4k issue.
>
> #slabinfo -AD
> Name Objects Alloc Free %Fast
> :0000256 1658 70350463 70348946 99 99
> kmalloc-8192 31 70322309 70322293 99 99
> :0000168 2592 143154 140684 93 28
> :0004096 1456 91072 89644 99 96
> :0000192 3402 63838 60491 89 11
> :0000064 6177 49635 43743 98 77
>
> So kmalloc-8192 appears. Without the patch, kmalloc-8192 hides.
> kmalloc-8192's default order on my 8-core stoakley is 2.

Christoph, should we merge my patch as-is or do you have an alternative
fix in mind? We could, of course, increase kmalloc() caches one level up
to 8192 or higher.

>
> 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> is about 10% better than SLUB's.
>
> I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?

Maybe we can use the perfstat and/or kerneltop utilities of the new perf
counters patch to diagnose this:

http://lkml.org/lkml/2009/1/21/273

And do oprofile, of course. Thanks!

Pekka

2009-01-23 08:06:56

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote:
> > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> > is about 10% better than SLUB's.
> >
> > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?
>
> Maybe we can use the perfstat and/or kerneltop utilities of the new perf
> counters patch to diagnose this:
>
> http://lkml.org/lkml/2009/1/21/273
>
> And do oprofile, of course. Thanks!

I assume binding the client and the server to different physical CPUs
also means that the SKB is always allocated on CPU 1 and freed on CPU
2? If so, we will be taking the __slab_free() slow path all the time on
kfree() which will cause cache effects, no doubt.

But there's another potential performance hit we're taking because the
object size of the cache is so big. As allocations from CPU 1 keep
coming in, we need to allocate new pages and unfreeze the per-cpu page.
That in turn causes __slab_free() to be more eager to discard the slab
(see the PageSlubFrozen check there).

So before going for cache profiling, I'd really like to see an oprofile
report. I suspect we're still going to see much more page allocator
activity there than with SLAB or SLQB which is why we're still behaving
so badly here.

Pekka

2009-01-23 08:30:34

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 2009-01-23 at 10:06 +0200, Pekka Enberg wrote:
> On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote:
> > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> > > is about 10% better than SLUB's.
> > >
> > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?
> >
> > Maybe we can use the perfstat and/or kerneltop utilities of the new perf
> > counters patch to diagnose this:
> >
> > http://lkml.org/lkml/2009/1/21/273
> >
> > And do oprofile, of course. Thanks!
>
> I assume binding the client and the server to different physical CPUs
> also means that the SKB is always allocated on CPU 1 and freed on CPU
> 2? If so, we will be taking the __slab_free() slow path all the time on
> kfree() which will cause cache effects, no doubt.
>
> But there's another potential performance hit we're taking because the
> object size of the cache is so big. As allocations from CPU 1 keep
> coming in, we need to allocate new pages and unfreeze the per-cpu page.
> That in turn causes __slab_free() to be more eager to discard the slab
> (see the PageSlubFrozen check there).
>
> So before going for cache profiling, I'd really like to see an oprofile
> report. I suspect we're still going to see much more page allocator
> activity
Theoretically, it should, but oprofile doesn't show that.

> there than with SLAB or SLQB which is why we're still behaving
> so badly here.

oprofile output with 2.6.29-rc2-slubrevertlarge:
CPU: Core 2, speed 2666.71 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % app name symbol name
132779 32.9951 vmlinux copy_user_generic_string
25334 6.2954 vmlinux schedule
21032 5.2264 vmlinux tg_shares_up
17175 4.2679 vmlinux __skb_recv_datagram
9091 2.2591 vmlinux sock_def_readable
8934 2.2201 vmlinux mwait_idle
8796 2.1858 vmlinux try_to_wake_up
6940 1.7246 vmlinux __slab_free

#slaninfo -AD
Name Objects Alloc Free %Fast
:0000256 1643 5215544 5214027 94 0
kmalloc-8192 28 5189576 5189560 0 0
:0000168 2631 141466 138976 92 28
:0004096 1452 88697 87269 99 96
:0000192 3402 63050 59732 89 11
:0000064 6265 46611 40721 98 82
:0000128 1895 30429 28654 93 32

oprofile output with kernel 2.6.29-rc2-slqb0121:
CPU: Core 2, speed 2666.76 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples % image name app name symbol name
114793 28.7163 vmlinux vmlinux copy_user_generic_string
27880 6.9744 vmlinux vmlinux tg_shares_up
22218 5.5580 vmlinux vmlinux schedule
12238 3.0614 vmlinux vmlinux mwait_idle
7395 1.8499 vmlinux vmlinux task_rq_lock
7348 1.8382 vmlinux vmlinux sock_def_readable
7202 1.8016 vmlinux vmlinux sched_clock_cpu
6981 1.7464 vmlinux vmlinux __skb_recv_datagram
6566 1.6425 vmlinux vmlinux udp_queue_rcv_skb

2009-01-23 08:33:48

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Friday 23 January 2009 14:02:53 Zhang, Yanmin wrote:

> 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better
> than SLQB's;

I'll have to look into this too. Could be evidence of the possible
TLB improvement from using bigger pages and/or page-specific freelist,
I suppose.

Do you have a scripted used to start netperf in that configuration?

2009-01-23 08:40:31

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 2009-01-23 at 16:30 +0800, Zhang, Yanmin wrote:
> > I assume binding the client and the server to different physical CPUs
> > also means that the SKB is always allocated on CPU 1 and freed on CPU
> > 2? If so, we will be taking the __slab_free() slow path all the time on
> > kfree() which will cause cache effects, no doubt.
> >
> > But there's another potential performance hit we're taking because the
> > object size of the cache is so big. As allocations from CPU 1 keep
> > coming in, we need to allocate new pages and unfreeze the per-cpu page.
> > That in turn causes __slab_free() to be more eager to discard the slab
> > (see the PageSlubFrozen check there).
> >
> > So before going for cache profiling, I'd really like to see an oprofile
> > report. I suspect we're still going to see much more page allocator
> > activity
> Theoretically, it should, but oprofile doesn't show that.
>
> > there than with SLAB or SLQB which is why we're still behaving
> > so badly here.
>
> oprofile output with 2.6.29-rc2-slubrevertlarge:
> CPU: Core 2, speed 2666.71 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples % app name symbol name
> 132779 32.9951 vmlinux copy_user_generic_string
> 25334 6.2954 vmlinux schedule
> 21032 5.2264 vmlinux tg_shares_up
> 17175 4.2679 vmlinux __skb_recv_datagram
> 9091 2.2591 vmlinux sock_def_readable
> 8934 2.2201 vmlinux mwait_idle
> 8796 2.1858 vmlinux try_to_wake_up
> 6940 1.7246 vmlinux __slab_free
>
> #slaninfo -AD
> Name Objects Alloc Free %Fast
> :0000256 1643 5215544 5214027 94 0
> kmalloc-8192 28 5189576 5189560 0 0
^^^^^^

This looks bit funny. Hmm.

> :0000168 2631 141466 138976 92 28
> :0004096 1452 88697 87269 99 96
> :0000192 3402 63050 59732 89 11
> :0000064 6265 46611 40721 98 82
> :0000128 1895 30429 28654 93 32

2009-01-23 09:03:03

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 2009-01-23 at 19:33 +1100, Nick Piggin wrote:
> On Friday 23 January 2009 14:02:53 Zhang, Yanmin wrote:
>
> > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better
> > than SLQB's;
>
> I'll have to look into this too. Could be evidence of the possible
> TLB improvement from using bigger pages and/or page-specific freelist,
> I suppose.
>
> Do you have a scripted used to start netperf in that configuration?
See the attachment.

Steps to run testing:
1) compile netperf;
2) Change PROG_DIR to path/to/netperf/src;
3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.

Attachments:

start_netperf_udp_v4.sh (1.33 kB)

2009-01-23 09:46:50

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 2009-01-23 at 16:30 +0800, Zhang, Yanmin wrote:
> On Fri, 2009-01-23 at 10:06 +0200, Pekka Enberg wrote:
> > On Fri, 2009-01-23 at 08:52 +0200, Pekka Enberg wrote:
> > > > 1) If I start CPU_NUM clients and servers, SLUB's result is about 2% better than SLQB's;
> > > > 2) If I start 1 clinet and 1 server, and bind them to different physical cpu, SLQB's result
> > > > is about 10% better than SLUB's.
> > > >
> > > > I don't know why there is still 10% difference with item 2). Maybe cachemiss causes it?
> > >
> > > Maybe we can use the perfstat and/or kerneltop utilities of the new perf
> > > counters patch to diagnose this:
> > >
> > > http://lkml.org/lkml/2009/1/21/273
> > >
> > > And do oprofile, of course. Thanks!
> >
> > I assume binding the client and the server to different physical CPUs
> > also means that the SKB is always allocated on CPU 1 and freed on CPU
> > 2? If so, we will be taking the __slab_free() slow path all the time on
> > kfree() which will cause cache effects, no doubt.
> >
> > But there's another potential performance hit we're taking because the
> > object size of the cache is so big. As allocations from CPU 1 keep
> > coming in, we need to allocate new pages and unfreeze the per-cpu page.
> > That in turn causes __slab_free() to be more eager to discard the slab
> > (see the PageSlubFrozen check there).
> >
> > So before going for cache profiling, I'd really like to see an oprofile
> > report. I suspect we're still going to see much more page allocator
> > activity
> Theoretically, it should, but oprofile doesn't show that.

That's bit surprising, actually. FWIW, I've included a patch for empty
slab lists. But it's probably not going to help here.

> > there than with SLAB or SLQB which is why we're still behaving
> > so badly here.
>
> oprofile output with 2.6.29-rc2-slubrevertlarge:
> CPU: Core 2, speed 2666.71 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> samples % app name symbol name
> 132779 32.9951 vmlinux copy_user_generic_string
> 25334 6.2954 vmlinux schedule
> 21032 5.2264 vmlinux tg_shares_up
> 17175 4.2679 vmlinux __skb_recv_datagram
> 9091 2.2591 vmlinux sock_def_readable
> 8934 2.2201 vmlinux mwait_idle
> 8796 2.1858 vmlinux try_to_wake_up
> 6940 1.7246 vmlinux __slab_free
>
> #slaninfo -AD
> Name Objects Alloc Free %Fast
> :0000256 1643 5215544 5214027 94 0
> kmalloc-8192 28 5189576 5189560 0 0
> :0000168 2631 141466 138976 92 28
> :0004096 1452 88697 87269 99 96
> :0000192 3402 63050 59732 89 11
> :0000064 6265 46611 40721 98 82
> :0000128 1895 30429 28654 93 32

Looking at __slab_free(), unless page->inuse is constantly zero and we
discard the slab, it really is just cache effects (10% sounds like a
lot, though!). AFAICT, the only way to optimize that is with Christoph's
unfinished pointer freelists patches or with a remote free list like in
SLQB.

Pekka

diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 3bd3662..41a4c1a 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -48,6 +48,9 @@ struct kmem_cache_node {
unsigned long nr_partial;
unsigned long min_partial;
struct list_head partial;
+ unsigned long nr_empty;
+ unsigned long max_empty;
+ struct list_head empty;
#ifdef CONFIG_SLUB_DEBUG
atomic_long_t nr_slabs;
atomic_long_t total_objects;
diff --git a/mm/slub.c b/mm/slub.c
index 8fad23f..5a12597 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -134,6 +134,11 @@
*/
#define MAX_PARTIAL 10

+/*
+ * Maximum number of empty slabs.
+ */
+#define MAX_EMPTY 1
+
#define DEBUG_DEFAULT_FLAGS (SLAB_DEBUG_FREE | SLAB_RED_ZONE | \
SLAB_POISON | SLAB_STORE_USER)

@@ -1205,6 +1210,24 @@ static void discard_slab(struct kmem_cache *s, struct page *page)
free_slab(s, page);
}

+static void discard_or_cache_slab(struct kmem_cache *s, struct page *page)
+{
+ struct kmem_cache_node *n;
+ int node;
+
+ node = page_to_nid(page);
+ n = get_node(s, node);
+
+ dec_slabs_node(s, node, page->objects);
+
+ if (likely(n->nr_empty >= n->max_empty)) {
+ free_slab(s, page);
+ } else {
+ n->nr_empty++;
+ list_add(&page->lru, &n->partial);
+ }
+}
+
/*
* Per slab locking using the pagelock
*/
@@ -1252,7 +1275,7 @@ static void remove_partial(struct kmem_cache *s, struct page *page)
}

/*
- * Lock slab and remove from the partial list.
+ * Lock slab and remove from the partial or empty list.
*
* Must hold list_lock.
*/
@@ -1261,7 +1284,6 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
{
if (slab_trylock(page)) {
list_del(&page->lru);
- n->nr_partial--;
__SetPageSlubFrozen(page);
return 1;
}
@@ -1271,7 +1293,7 @@ static inline int lock_and_freeze_slab(struct kmem_cache_node *n,
/*
* Try to allocate a partial slab from a specific node.
*/
-static struct page *get_partial_node(struct kmem_cache_node *n)
+static struct page *get_partial_or_empty_node(struct kmem_cache_node *n)
{
struct page *page;

@@ -1281,13 +1303,22 @@ static struct page *get_partial_node(struct kmem_cache_node *n)
* partial slab and there is none available then get_partials()
* will return NULL.
*/
- if (!n || !n->nr_partial)
+ if (!n || (!n->nr_partial && !n->nr_empty))
return NULL;

spin_lock(&n->list_lock);
+
list_for_each_entry(page, &n->partial, lru)
- if (lock_and_freeze_slab(n, page))
+ if (lock_and_freeze_slab(n, page)) {
+ n->nr_partial--;
+ goto out;
+ }
+
+ list_for_each_entry(page, &n->empty, lru)
+ if (lock_and_freeze_slab(n, page)) {
+ n->nr_empty--;
goto out;
+ }
page = NULL;
out:
spin_unlock(&n->list_lock);
@@ -1297,7 +1328,7 @@ out:
/*
* Get a page from somewhere. Search in increasing NUMA distances.
*/
-static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
+static struct page *get_any_partial_or_empty(struct kmem_cache *s, gfp_t flags)
{
#ifdef CONFIG_NUMA
struct zonelist *zonelist;
@@ -1336,7 +1367,7 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)

if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
n->nr_partial > n->min_partial) {
- page = get_partial_node(n);
+ page = get_partial_or_empty_node(n);
if (page)
return page;
}
@@ -1346,18 +1377,19 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
}

/*
- * Get a partial page, lock it and return it.
+ * Get a partial or empty page, lock it and return it.
*/
-static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
+static struct page *
+get_partial_or_empty(struct kmem_cache *s, gfp_t flags, int node)
{
struct page *page;
int searchnode = (node == -1) ? numa_node_id() : node;

- page = get_partial_node(get_node(s, searchnode));
+ page = get_partial_or_empty_node(get_node(s, searchnode));
if (page || (flags & __GFP_THISNODE))
return page;

- return get_any_partial(s, flags);
+ return get_any_partial_or_empty(s, flags);
}

/*
@@ -1403,7 +1435,7 @@ static void unfreeze_slab(struct kmem_cache *s, struct page *page, int tail)
} else {
slab_unlock(page);
stat(get_cpu_slab(s, raw_smp_processor_id()), FREE_SLAB);
- discard_slab(s, page);
+ discard_or_cache_slab(s, page);
}
}
}
@@ -1542,7 +1574,7 @@ another_slab:
deactivate_slab(s, c);

new_slab:
- new = get_partial(s, gfpflags, node);
+ new = get_partial_or_empty(s, gfpflags, node);
if (new) {
c->page = new;
stat(c, ALLOC_FROM_PARTIAL);
@@ -1693,7 +1725,7 @@ slab_empty:
}
slab_unlock(page);
stat(c, FREE_SLAB);
- discard_slab(s, page);
+ discard_or_cache_slab(s, page);
return;

debug:
@@ -1927,6 +1959,8 @@ static void init_kmem_cache_cpu(struct kmem_cache *s,
static void
init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
{
+ spin_lock_init(&n->list_lock);
+
n->nr_partial = 0;

/*
@@ -1939,8 +1973,18 @@ init_kmem_cache_node(struct kmem_cache_node *n, struct kmem_cache *s)
else if (n->min_partial > MAX_PARTIAL)
n->min_partial = MAX_PARTIAL;

- spin_lock_init(&n->list_lock);
INIT_LIST_HEAD(&n->partial);
+
+ n->nr_empty = 0;
+ /*
+ * XXX: This needs to take object size into account. We don't need
+ * empty slabs for caches which will have plenty of partial slabs
+ * available. Only caches that have either full or empty slabs need
+ * this kind of optimization.
+ */
+ n->max_empty = MAX_EMPTY;
+ INIT_LIST_HEAD(&n->empty);
+
#ifdef CONFIG_SLUB_DEBUG
atomic_long_set(&n->nr_slabs, 0);
atomic_long_set(&n->total_objects, 0);
@@ -2427,6 +2471,32 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n)
spin_unlock_irqrestore(&n->list_lock, flags);
}

+static void free_empty_slabs(struct kmem_cache *s)
+{
+ int node;
+
+ for_each_node_state(node, N_NORMAL_MEMORY) {
+ struct kmem_cache_node *n;
+ struct page *page, *t;
+ unsigned long flags;
+
+ n = get_node(s, node);
+
+ if (!n->nr_empty)
+ continue;
+
+ spin_lock_irqsave(&n->list_lock, flags);
+
+ list_for_each_entry_safe(page, t, &n->empty, lru) {
+ list_del(&page->lru);
+ n->nr_empty--;
+
+ free_slab(s, page);
+ }
+ spin_unlock_irqrestore(&n->list_lock, flags);
+ }
+}
+
/*
* Release all resources used by a slab cache.
*/
@@ -2436,6 +2506,8 @@ static inline int kmem_cache_close(struct kmem_cache *s)

flush_all(s);

+ free_empty_slabs(s);
+
/* Attempt to free all objects */
free_kmem_cache_cpus(s);
for_each_node_state(node, N_NORMAL_MEMORY) {
@@ -2765,6 +2837,7 @@ int kmem_cache_shrink(struct kmem_cache *s)
return -ENOMEM;

flush_all(s);
+ free_empty_slabs(s);
for_each_node_state(node, N_NORMAL_MEMORY) {
n = get_node(s, node);

2009-01-23 15:31:49

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

2009-01-23 15:58:00

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 23 Jan 2009, Pekka Enberg wrote:

> Looking at __slab_free(), unless page->inuse is constantly zero and we
> discard the slab, it really is just cache effects (10% sounds like a
> lot, though!). AFAICT, the only way to optimize that is with Christoph's
> unfinished pointer freelists patches or with a remote free list like in
> SLQB.

No there is another way. Increase the allocator order to 3 for the
kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
larger chunks of data gotten from the page allocator. That will allow slub
to do fast allocs.

2009-01-23 16:01:56

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 23 Jan 2009, Pekka Enberg wrote:
> > I wonder why that doesn't happen already, actually. The slub_max_order
> > know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously
> > order 3 should be as good fit as order 2 so 'fraction' can't be too high
> > either. Hmm.

On Fri, 2009-01-23 at 10:55 -0500, Christoph Lameter wrote:
> The kmalloc-8192 is new. Look at slabinfo output to see what allocation
> orders are chosen.

Yes, yes, I know the new cache a result of my patch. I'm just saying
that AFAICT, the existing logic should set the order to 3 but IIRC
Yanmin said it's 2.

Pekka

2009-01-23 16:57:37

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 23 Jan 2009, Pekka Enberg wrote:

> I wonder why that doesn't happen already, actually. The slub_max_order
> know is capped to PAGE_ALLOC_COSTLY_ORDER ("3") by default and obviously
> order 3 should be as good fit as order 2 so 'fraction' can't be too high
> either. Hmm.

The kmalloc-8192 is new. Look at slabinfo output to see what allocation
orders are chosen.

2009-01-23 18:41:10

[permalink] [raw]

Subject: care and feeding of netperf (Re: Mainline kernel OLTP performance update)

> 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.

Some comments on the script:

> #!/bin/sh
>
> PROG_DIR=/home/ymzhang/test/netperf/src
> date=`date +%H%M%N`
> #PROG_DIR=/root/netperf/netperf/src
> client_num=$1
> pin_cpu=$2
>
> start_port_server=12384
> start_port_client=15888
>
> killall netserver
> ${PROG_DIR}/netserver
> sleep 2

Any particular reason for killing-off the netserver daemon?

> if [ ! -d result ]; then
> mkdir result
> fi
>
> all_result_files=""
> for i in `seq 1 ${client_num}`; do
> if [ "${pin_cpu}" == "pin" ]; then
> pin_param="-T ${i} ${i}"

The -T option takes arguments of the form:

N - bind both netperf and netserver to core N
N, - bind only netperf to core N, float netserver
,M - float netperf, bind only netserver to core M
N,M - bind netperf to core N and netserver to core M

Without a comma between N and M knuth only knows what the command line parser
will do :)

> fi
> result_file=result/netperf_${start_port_client}.${date}
> #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096
> #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384 12888 -s 32768 -S 32768 -m 4096
> #${PROG_DIR}/netperf -p ${port_num} -t TCP_RR -l 60 -H 127.0.0.1 ${pin_param} -- -r 1,1 >${result_file} &
> ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 ${pin_param} -- -P ${start_port_client} ${start_port_server} -s 32768 -S 32768 -m 4096 >${result_file} &

Same thing here for the -P option - there needs to be a comma between the two
port numbers otherwise, the best case is that the second port number is ignored.
Worst case is that netperf starts doing knuth only knows what.

To get quick profiles, that form of aggregate netperf is OK - just the one
iteration with background processes using a moderatly long run time. However,
for result reporting, it is best to (ab)use the confidence intervals
functionality to try to avoid skew errors. I tend to add-in a global -i 30
option to get each netperf to repeat its measurments 30 times. That way one is
reasonably confident that skew issues are minimized.

http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance

And I would probably add the -c and -C options to have netperf report service
demands.

> sub_pid="${sub_pid} `echo $!`"
> port_num=$((${port_num}+1))
> all_result_files="${all_result_files} ${result_file}"
> start_port_server=$((${start_port_server}+1))
> start_port_client=$((${start_port_client}+1))
> done;
>
> wait ${sub_pid}
> killall netserver
>
> result="0"
> for i in `echo ${all_result_files}`; do
> sub_result=`awk '/Throughput/ {getline; getline; getline; print " "$6}' ${i}`
> result=`echo "${result}+${sub_result}"|bc`
> done;

The documented-only-in-source :( "omni" tests in top-of-trunk netperf:

http://www.netperf.org/svn/netperf2/trunk

./configure --enable-omni

allow one to specify which result values one wants, in which order, either as
more or less traditional netperf output (test-specific -O), CSV (test-specific
-o) or keyval (test-specific -k). All three take an optional filename as an
argument with the file containing a list of desired output values. You can give
a "filename" of '?' to get the list of output values known to that version of
netperf.

Might help simplify parsing and whatnot.

happy benchmarking,

rick jones

>
> echo $result

>

2009-01-23 18:51:41

by Grant Grundler

[permalink] [raw]

Subject: Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update)

On Fri, Jan 23, 2009 at 10:40 AM, Rick Jones <[email protected]> wrote:
...
> And I would probably add the -c and -C options to have netperf report
> service demands.

For performance analysis, the service demand is often more interesting
than the absolute performance (which typically only varies a few Mb/s
for gigE NICs). I strongly encourage adding -c and -C.

grant

2009-01-24 02:56:01

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
> On Fri, 23 Jan 2009, Pekka Enberg wrote:
>
> > Looking at __slab_free(), unless page->inuse is constantly zero and we
> > discard the slab, it really is just cache effects (10% sounds like a
> > lot, though!). AFAICT, the only way to optimize that is with Christoph's
> > unfinished pointer freelists patches or with a remote free list like in
> > SLQB.
>
> No there is another way. Increase the allocator order to 3 for the
> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
> larger chunks of data gotten from the page allocator. That will allow slub
> to do fast allocs.
After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k)
difference between SLUB and SLQB becomes 1% which can be considered as fluctuation.

But when trying to increased it to 4, I got:
[root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order
[root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order
-bash: echo: write error: Invalid argument

Comparing with SLQB, it seems SLUB needs too many investigation/manual finer-tuning
against specific benchmarks. One hard is to tune page order number. Although SLQB also
has many tuning options, I almost doesn't tune it manually, just run benchmark and
collect results to compare. Does that mean the scalability of SLQB is better?

2009-01-24 03:03:33

[permalink] [raw]

Subject: Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update)

On Fri, 2009-01-23 at 10:40 -0800, Rick Jones wrote:
> > 3) ./start_netperf_udp_v4.sh 8 #Assume your machine has 8 logical cpus.
>
> Some comments on the script:
Thanks. I wanted to run the testing to get result quickly as long as
the result has no big fluctuation.

>
> > #!/bin/sh
> >
> > PROG_DIR=/home/ymzhang/test/netperf/src
> > date=`date +%H%M%N`
> > #PROG_DIR=/root/netperf/netperf/src
> > client_num=$1
> > pin_cpu=$2
> >
> > start_port_server=12384
> > start_port_client=15888
> >
> > killall netserver
> > ${PROG_DIR}/netserver
> > sleep 2
>
> Any particular reason for killing-off the netserver daemon?
I'm not sure if prior running might leave any impact on later running, so
just kill netserver.

>
> > if [ ! -d result ]; then
> > mkdir result
> > fi
> >
> > all_result_files=""
> > for i in `seq 1 ${client_num}`; do
> > if [ "${pin_cpu}" == "pin" ]; then
> > pin_param="-T ${i} ${i}"
>
> The -T option takes arguments of the form:
>
> N - bind both netperf and netserver to core N
> N, - bind only netperf to core N, float netserver
> ,M - float netperf, bind only netserver to core M
> N,M - bind netperf to core N and netserver to core M
>
> Without a comma between N and M knuth only knows what the command line parser
> will do :)
>
> > fi
> > result_file=result/netperf_${start_port_client}.${date}
> > #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -- -P 15895 12391 -s 32768 -S 32768 -m 4096
> > #./netperf -t UDP_STREAM -l 60 -H 127.0.0.1 -i 50 3 -I 99 5 -- -P 12384 12888 -s 32768 -S 32768 -m 4096
> > #${PROG_DIR}/netperf -p ${port_num} -t TCP_RR -l 60 -H 127.0.0.1 ${pin_param} -- -r 1,1 >${result_file} &
> > ${PROG_DIR}/netperf -t UDP_STREAM -l 60 -H 127.0.0.1 ${pin_param} -- -P ${start_port_client} ${start_port_server} -s 32768 -S 32768 -m 4096 >${result_file} &
>
> Same thing here for the -P option - there needs to be a comma between the two
> port numbers otherwise, the best case is that the second port number is ignored.
> Worst case is that netperf starts doing knuth only knows what.
Thanks.

>
>
> To get quick profiles, that form of aggregate netperf is OK - just the one
> iteration with background processes using a moderatly long run time. However,
> for result reporting, it is best to (ab)use the confidence intervals
> functionality to try to avoid skew errors.
Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need
finer-tuning or investigation, I would turn on more options.

> I tend to add-in a global -i 30
> option to get each netperf to repeat its measurments 30 times. That way one is
> reasonably confident that skew issues are minimized.
>
> http://www.netperf.org/svn/netperf2/trunk/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance
>
> And I would probably add the -c and -C options to have netperf report service
> demands.
Yes. That's good. I'm used to start vmstat or mpstat to monitor cpu utilization
in real time.

>
>
> > sub_pid="${sub_pid} `echo $!`"
> > port_num=$((${port_num}+1))
> > all_result_files="${all_result_files} ${result_file}"
> > start_port_server=$((${start_port_server}+1))
> > start_port_client=$((${start_port_client}+1))
> > done;
> >
> > wait ${sub_pid}
> > killall netserver
> >
> > result="0"
> > for i in `echo ${all_result_files}`; do
> > sub_result=`awk '/Throughput/ {getline; getline; getline; print " "$6}' ${i}`
> > result=`echo "${result}+${sub_result}"|bc`
> > done;
>
> The documented-only-in-source :( "omni" tests in top-of-trunk netperf:
>
> http://www.netperf.org/svn/netperf2/trunk
>
> ./configure --enable-omni
>
> allow one to specify which result values one wants, in which order, either as
> more or less traditional netperf output (test-specific -O), CSV (test-specific
> -o) or keyval (test-specific -k). All three take an optional filename as an
> argument with the file containing a list of desired output values. You can give
> a "filename" of '?' to get the list of output values known to that version of
> netperf.
>
> Might help simplify parsing and whatnot.
Yes, it does.

>
> happy benchmarking,
>
> rick jones
Thanks again. I learned a lot.

>
> >
> > echo $result
>
> >

2009-01-24 07:37:46

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Fri, 2009-01-23 at 10:22 -0500, Christoph Lameter wrote:
>> No there is another way. Increase the allocator order to 3 for the
>> kmalloc-8192 slab then multiple 8k blocks can be allocated from one of the
>> larger chunks of data gotten from the page allocator. That will allow slub
>> to do fast allocs.

On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
<[email protected]> wrote:
> After I change kmalloc-8192/order to 3, the result(pinned netperf UDP-U-4k)
> difference between SLUB and SLQB becomes 1% which can be considered as fluctuation.

Great. We should fix calculate_order() to be order 3 for kmalloc-8192.
Are you interested in doing that?

On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
<[email protected]> wrote:
> But when trying to increased it to 4, I got:
> [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order
> [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order
> -bash: echo: write error: Invalid argument

That's probably because max order is capped to 3. You can change that
by passing slub_max_order=<n> as kernel parameter.

On Sat, Jan 24, 2009 at 4:55 AM, Zhang, Yanmin
<[email protected]> wrote:
> Comparing with SLQB, it seems SLUB needs too many investigation/manual finer-tuning
> against specific benchmarks. One hard is to tune page order number. Although SLQB also
> has many tuning options, I almost doesn't tune it manually, just run benchmark and
> collect results to compare. Does that mean the scalability of SLQB is better?

One thing is sure, SLUB seems to be hard to tune. Probably because
it's dependent on the page order so much.

2009-01-26 17:57:58

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Sat, 24 Jan 2009, Zhang, Yanmin wrote:

> But when trying to increased it to 4, I got:
> [root@lkp-st02-x8664 slab]# echo "3">kmalloc-8192/order
> [root@lkp-st02-x8664 slab]# echo "4">kmalloc-8192/order
> -bash: echo: write error: Invalid argument

This is because 4 is more than the maximum allowed order. You can
reconfigure that by setting

slub_max_order=5

or so on boot.

2009-01-26 18:26:39

[permalink] [raw]

Subject: Re: care and feeding of netperf (Re: Mainline kernel OLTP performance update)

>>To get quick profiles, that form of aggregate netperf is OK - just the one
>>iteration with background processes using a moderatly long run time. However,
>>for result reporting, it is best to (ab)use the confidence intervals
>>functionality to try to avoid skew errors.
>
> Yes. My formal testing uses -i 50. I just wanted a quick testing. If I need
> finer-tuning or investigation, I would turn on more options.

Netperf will silently clip that to 30 as that is all the built-in tables know.

> Thanks again. I learned a lot.

Feel free to wander over to netperf-talk over at netperf.org if you want to talk
some more about the care and feeding of netperf.

happy benchmarking,

rick jones

2009-01-27 08:30:16

by Jens Axboe

[permalink] [raw]

Subject: Re: Mainline kernel OLTP performance update

On Mon, Jan 26 2009, Chilukuri, Harita wrote:
> Jens, we did test the patch that disables the whole stats. We get 0.5% gain with this patch on 2.6.29-rc2 comparing to 2.6.29-rc2-disbale_part_stats
>
> Below is the description of the result:
>
> Linux OLTP Performance summary
> Kernel# Speedup(x) Intr/s CtxSw/s us% sys% idle% iowait%
> 2.6.29-rc2-disbale_partition_stats 1.000 30413 42582 74 25 0 0
> 2.6.29-rc2-disable_all 1.005 30401 42656 74 25 0 0
>
> Server configurations:
> Intel Xeon Quad-core 2.0GHz 2 cpus/8 cores/8 threads
> 64GB memory, 3 qle2462 FC HBA, 450 spindles (30 logical units)

OK, so about the same, which means the lookup is likely the expensive
bit. I have merged this patch:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=e5b74b703da41fab060adc335a0b98fa5a5ea61d

which exposes an 'iostats' toggle that allows users to disable disk
statistics completely.

--
Jens Axboe

2009-02-01 02:53:24