2002-09-19 17:10:11

by Bond, Andrew

[permalink] [raw]
Subject: TPC-C benchmark used standard RH kernel

I believe I need to clarify my earlier posting about kernel features that gave the benchmark a boost. The kernel that we used in the benchmark was an unmodified Red Hat Advanced Server 2.1 kernel. We did tune the kernel via standard user space tuning, but the kernel was not patched. HP, Red Hat, and Oracle have worked closely together to make sure that the features I mentioned were in the Advanced Server kernel "out of the box."

Could we have gotten better performance by patching the kernel? Sure. There are many new features in 2.5 that would enhance database performance. However, the fairly strict support requirements of TPC benchmarking mean that we need to benchmark a kernel that a Linux distributor ships and can support.
Modifications could also be taken to the extreme, and we could have built a screamer kernel that runs Oracle TPC-C's and nothing else. However, that doesn't really tell us anything useful and doesn't help those customers thinking about running Linux. The question also becomes "Who would provide customer support for that kernel?"

Regards,
Andy


2002-09-19 18:10:29

by Dave Hansen

[permalink] [raw]
Subject: Re: TPC-C benchmark used standard RH kernel

Bond, Andrew wrote:
> I believe I need to clarify my earlier posting about kernel features that
> gave the benchmark a boost. The kernel that we used in the benchmark was an
> unmodified Red Hat Advanced Server 2.1 kernel. We did tune the kernel via
> standard user space tuning, but the kernel was not patched. HP, Red Hat, and
> Oracle have worked closely together to make sure that the features I
> mentioned were in the Advanced Server kernel "out of the box."

Have you done much profiling of that kernel? I'm sure a lot of people would be
very interested to see even readprofile results from a piece of the cluster
during a TPC run.

--
Dave Hansen
[email protected]

2002-09-19 19:02:31

by Martin J. Bligh

[permalink] [raw]
Subject: Re: TPC-C benchmark used standard RH kernel

> Could we have gotten better performance by patching the kernel? Sure. There are many new features in 2.5 that would enhance database performance. However, the fairly strict support requirements of TPC benchmarking mean that we need to benchmark a kernel that a Linux distributor ships and can support.
> Modifications could also be taken to the extreme, and we could have built a screamer kernel that runs Oracle TPC-C's and nothing else. However, that doesn't really tell us anything useful and doesn't help those customers thinking about running Linux. The question also becomes "Who would provide customer support for that kernel?"

Unofficial results for 2.5 vs 2.4 (or 2.4-redhatAS) would be most
interesting if you're able to gather them, and still have the
machine. Most times you can avoid their draconian rules by saying
"on a large benchmark test that I can't name but you all know what
it is ..." instead of naming it ... ;-)

M.

2002-09-19 19:10:53

by Bond, Andrew

[permalink] [raw]
Subject: RE: TPC-C benchmark used standard RH kernel

I am actually moving in that direction. I don't know if I will be able to use the same setup or not, but I will post once I get some data. I can post that I saw X% delta going from 2.4 to 2.5. I can't help it if anyone extrapolates data from there ;-)

Andy

> -----Original Message-----
> From: Martin J. Bligh [mailto:[email protected]]
> Sent: Thursday, September 19, 2002 3:06 PM
> To: Bond, Andrew; [email protected]
> Subject: Re: TPC-C benchmark used standard RH kernel
>
>
> > Could we have gotten better performance by patching the
> kernel? Sure. There are many new features in 2.5 that would
> enhance database performance. However, the fairly strict
> support requirements of TPC benchmarking mean that we need to
> benchmark a kernel that a Linux distributor ships and can support.
> > Modifications could also be taken to the extreme, and we
> could have built a screamer kernel that runs Oracle TPC-C's
> and nothing else. However, that doesn't really tell us
> anything useful and doesn't help those customers thinking
> about running Linux. The question also becomes "Who would
> provide customer support for that kernel?"
>
> Unofficial results for 2.5 vs 2.4 (or 2.4-redhatAS) would be most
> interesting if you're able to gather them, and still have the
> machine. Most times you can avoid their draconian rules by saying
> "on a large benchmark test that I can't name but you all know what
> it is ..." instead of naming it ... ;-)
>
> M.
>
>

2002-09-19 19:22:29

by Bond, Andrew

[permalink] [raw]
Subject: RE: TPC-C benchmark used standard RH kernel

This isn't as recent as I would like, but it will give you an idea. Top 75 from readprofile. This run was not using bigpages though.

Andy

00000000 total 7872 0.0066
c0105400 default_idle 1367 21.3594
c012ea20 find_vma_prev 462 2.2212
c0142840 create_bounce 378 1.1250
c0142540 bounce_end_io_read 332 0.9881
c0197740 __make_request 256 0.1290
c012af20 zap_page_range 231 0.1739
c012e9a0 find_vma 214 1.6719
c012e780 avl_rebalance 160 0.4762
c0118d80 schedule 157 0.1609
c010ba50 do_gettimeofday 145 1.0069
c0130c30 __find_lock_page 144 0.4500
c0119150 __wake_up 142 0.9861
c01497c0 end_buffer_io_kiobuf_async 140 0.6250
c0113020 flush_tlb_mm 128 1.0000
c0168000 proc_pid_stat 125 0.2003
c012d010 do_no_page 125 0.2056
c0107128 system_call 107 1.9107
c01488d0 end_buffer_io_kiobuf 91 0.9479
c0142760 alloc_bounce_bh 91 0.4062
c01489a0 brw_kiovec 90 0.1250
c011fe90 sys_gettimeofday 89 0.5563
c012edd0 do_munmap 88 0.1310
c0142690 alloc_bounce_page 87 0.4183
c01402b0 shmem_getpage 84 0.3281
c01498a0 brw_kvec_async 77 0.0859
c0142490 bounce_end_io_write 72 0.4091
c012f3d0 __insert_vm_struct 70 0.1823
c012e1d0 do_mmap_pgoff 66 0.0581
c0137cb0 kmem_cache_alloc 59 0.2169
c012e090 lock_vma_mappings 57 1.1875
c0137ba0 free_block 52 0.1912
c012d450 handle_mm_fault 51 0.1226
c013cb50 __free_pages 50 1.5625
c0198020 submit_bh 47 0.3672
c0147260 set_bh_page 46 0.7188
c0145730 fget 45 0.7031
c012b450 mm_follow_page 45 0.1339
c01182d0 try_to_wake_up 45 0.1278
c0107380 page_fault 45 3.7500
c0199730 elevator_linus_merge 44 0.1375
c0137f20 kmem_cache_free 44 0.3438
c0178b50 try_atomic_semop 39 0.1283
c0183000 batch_entropy_store 37 0.1927
c0197440 account_io_start 36 0.4500
c0181f30 rw_raw_dev 32 0.0444
c0137aa0 kmem_cache_alloc_batch 32 0.1250
c012e8d0 avl_remove 32 0.1538
c0162490 sys_io_submit 30 0.0193
c0161d00 aio_complete 29 0.1133
c0117870 do_page_fault 29 0.0224
c0196fd0 generic_plug_device 28 0.2917
c012d800 mm_map_user_kvec 28 0.0312
c0107171 restore_all 28 1.2174
c011a4a0 remove_wait_queue 26 0.4062
c0197410 disk_round_stats 25 0.5208
c0148930 wait_kio 25 0.2232
c012a980 pte_alloc_map 25 0.0473
c0227668 csum_partial_copy_generic 24 0.0968
c01283f0 getrusage 24 0.0326
c0144ce0 sys_pread 22 0.0688
c013c3d0 rmqueue 22 0.0255
c0118690 load_balance 22 0.0241
c01973a0 locate_hd_struct 21 0.1875
c010c0e0 sys_mmap2 21 0.1313
c015bfd0 end_kio_request 20 0.2083
c012a800 __free_pte 20 0.1562
c0123ea0 add_timer 20 0.0735
c01ef250 ip_queue_xmit 19 0.0154
c0197490 account_io_end 19 0.2375
c015c300 kiobuf_wait_for_io 19 0.1080
c012b5c0 map_user_kiobuf 19 0.0247
c0161e00 aio_read_evt 18 0.1875
c012db80 unmap_kvec 18 0.1607
c010c8b0 restore_fpu 18 0.5625
c01f4130 tcp_sendmsg 17 0.0036

> -----Original Message-----
> From: Dave Hansen [mailto:[email protected]]
> Sent: Thursday, September 19, 2002 2:15 PM
> To: Bond, Andrew
> Cc: [email protected]
> Subject: Re: TPC-C benchmark used standard RH kernel
>
>
> Bond, Andrew wrote:
> > I believe I need to clarify my earlier posting about
> kernel features that
> > gave the benchmark a boost. The kernel that we used in
> the benchmark was an
> > unmodified Red Hat Advanced Server 2.1 kernel. We did
> tune the kernel via
> > standard user space tuning, but the kernel was not
> patched. HP, Red Hat, and
> > Oracle have worked closely together to make sure that the
> features I
> > mentioned were in the Advanced Server kernel "out of the box."
>
> Have you done much profiling of that kernel? I'm sure a lot
> of people would be
> very interested to see even readprofile results from a piece
> of the cluster
> during a TPC run.
>
> --
> Dave Hansen
> [email protected]
>
>

2002-09-19 20:36:23

by Dave Hansen

[permalink] [raw]
Subject: Re: TPC-C benchmark used standard RH kernel

Bond, Andrew wrote:
> This isn't as recent as I would like, but it will give you an idea.
> Top 75 from readprofile. This run was not using bigpages though.
>
> 00000000 total 7872 0.0066
> c0105400 default_idle 1367 21.3594
> c012ea20 find_vma_prev 462 2.2212
> c0142840 create_bounce 378 1.1250
> c0142540 bounce_end_io_read 332 0.9881
> c0197740 __make_request 256 0.1290
> c012af20 zap_page_range 231 0.1739
> c012e9a0 find_vma 214 1.6719
> c012e780 avl_rebalance 160 0.4762
> c0118d80 schedule 157 0.1609
> c010ba50 do_gettimeofday 145 1.0069
> c0130c30 __find_lock_page 144 0.4500
> c0119150 __wake_up 142 0.9861
> c01497c0 end_buffer_io_kiobuf_async 140 0.6250
> c0113020 flush_tlb_mm 128 1.0000
> c0168000 proc_pid_stat 125 0.2003

Forgive my complete ignorane about TPC-C... Why do you have so much
idle time? Are you I/O bound? (with that many disks, I sure hope not
:) ) Or is it as simple as leaving profiling running for a bit before
or after the benchmark was run?

Earlier, I got a little over-excited because I thinking that the
machines under test were 8-ways, but it looks like the DL580 is a
4xPIII-Xeon, and you have 8 of them. I know you haven't published it,
but do you do any testing on 8-ways?

For most of our work (Specweb, dbench, plain kernel compiles), the
kernel tends to blow up a lot worse at 8 CPUs than 4. It really dies
on the 32-way NUMA-Qs, but that's a whole other story...

--
Dave Hansen
[email protected]

2002-09-19 21:13:27

by Bond, Andrew

[permalink] [raw]
Subject: RE: TPC-C benchmark used standard RH kernel

> -----Original Message-----
> From: Dave Hansen [mailto:[email protected]]
> Sent: Thursday, September 19, 2002 4:41 PM
> To: Bond, Andrew
> Cc: [email protected]
> Subject: Re: TPC-C benchmark used standard RH kernel
>
>
> Bond, Andrew wrote:
> > This isn't as recent as I would like, but it will give you an idea.
> > Top 75 from readprofile. This run was not using bigpages though.
> >
> > 00000000 total 7872 0.0066
> > c0105400 default_idle 1367 21.3594
> > c012ea20 find_vma_prev 462 2.2212
> > c0142840 create_bounce 378 1.1250
> > c0142540 bounce_end_io_read 332 0.9881
> > c0197740 __make_request 256 0.1290
> > c012af20 zap_page_range 231 0.1739
> > c012e9a0 find_vma 214 1.6719
> > c012e780 avl_rebalance 160 0.4762
> > c0118d80 schedule 157 0.1609
> > c010ba50 do_gettimeofday 145 1.0069
> > c0130c30 __find_lock_page 144 0.4500
> > c0119150 __wake_up 142 0.9861
> > c01497c0 end_buffer_io_kiobuf_async 140 0.6250
> > c0113020 flush_tlb_mm 128 1.0000
> > c0168000 proc_pid_stat 125 0.2003
>
> Forgive my complete ignorane about TPC-C... Why do you have so much
> idle time? Are you I/O bound? (with that many disks, I sure hope not
> :) ) Or is it as simple as leaving profiling running for a
> bit before
> or after the benchmark was run?
>

We were never able to run the system at 100%. This run looks like it may have had more than normal. We always had around 5% idle time that we were not able to get ride of by adding more user load, so we were definitely hitting a bottleneck somewhere. Initial attempts at identifying that bottleneck yielded no results. So we ended up living with it for the benchmark, intending to post-mortem a root cause.

> Earlier, I got a little over-excited because I thinking that the
> machines under test were 8-ways, but it looks like the DL580 is a
> 4xPIII-Xeon, and you have 8 of them. I know you haven't
> published it,
> but do you do any testing on 8-ways?
>
> For most of our work (Specweb, dbench, plain kernel compiles), the
> kernel tends to blow up a lot worse at 8 CPUs than 4. It really dies
> on the 32-way NUMA-Qs, but that's a whole other story...
>
> --
> Dave Hansen
> [email protected]
>
>

Don't have any data yet on 8-ways. Our focus for the cluster was 4-ways because those are what HP uses for most Oracle RAC configurations. We had done some testing last year that showed very bad scaling from 4 to 8 cpus (only around 10% gain), but that was in the days of 2.4.5. The kernel has come a long way from then, but like you said there is more work to do in the 8-way arena.

Are the 8-way's you are talking about 8 full processors, or 4 with Hyperthreading?

Regards,
Andy

2002-09-19 21:44:29

by Dave Hansen

[permalink] [raw]
Subject: Re: TPC-C benchmark used standard RH kernel

Bond, Andrew wrote:
> Don't have any data yet on 8-ways. Our focus for the cluster was
> 4-ways because those are what HP uses for most Oracle RAC
> configurations. We had done some testing last year that showed
> very bad scaling from 4 to 8 cpus (only around 10% gain), but that
> was in the days of 2.4.5. The kernel has come a long way from
> then, but like you said there is more work to do in the 8-way
> arena.
>
> Are the 8-way's you are talking about 8 full processors, or 4 with
> Hyperthreading?

The machines that I was talking about are normal 8 full processors.
They're only PIII's, so we don't even have the option.

--
Dave Hansen
[email protected]

2002-09-20 17:14:43

by Mike Anderson

[permalink] [raw]
Subject: Re: TPC-C benchmark used standard RH kernel


Dave Hansen [[email protected]] wrote:
> Bond, Andrew wrote:
> > This isn't as recent as I would like, but it will give you an idea.
> > Top 75 from readprofile. This run was not using bigpages though.
> >
> > 00000000 total 7872 0.0066
> > c0105400 default_idle 1367 21.3594
> > c012ea20 find_vma_prev 462 2.2212

> > c0142840 create_bounce 378 1.1250
> > c0142540 bounce_end_io_read 332 0.9881

.. snip..
>
> Forgive my complete ignorane about TPC-C... Why do you have so much
> idle time? Are you I/O bound? (with that many disks, I sure hope not
> :) ) Or is it as simple as leaving profiling running for a bit before
> or after the benchmark was run?

The calls to create_bounce and bounce_end_io_read are indications that
some of your IO is being bounced and will not be running a peak
performance.

This is avoided by using the highmem IO changes which I believe are not
in the standard RH kernel. Unknown if that would address your idle time
question.

-andmike

--
Michael Anderson
[email protected]

2002-09-20 17:26:21

by Jens Axboe

[permalink] [raw]
Subject: Re: TPC-C benchmark used standard RH kernel

On Fri, Sep 20 2002, Mike Anderson wrote:
>
> Dave Hansen [[email protected]] wrote:
> > Bond, Andrew wrote:
> > > This isn't as recent as I would like, but it will give you an idea.
> > > Top 75 from readprofile. This run was not using bigpages though.
> > >
> > > 00000000 total 7872 0.0066
> > > c0105400 default_idle 1367 21.3594
> > > c012ea20 find_vma_prev 462 2.2212
>
> > > c0142840 create_bounce 378 1.1250
> > > c0142540 bounce_end_io_read 332 0.9881
>
> .. snip..
> >
> > Forgive my complete ignorane about TPC-C... Why do you have so much
> > idle time? Are you I/O bound? (with that many disks, I sure hope not
> > :) ) Or is it as simple as leaving profiling running for a bit before
> > or after the benchmark was run?
>
> The calls to create_bounce and bounce_end_io_read are indications that
> some of your IO is being bounced and will not be running a peak
> performance.
>
> This is avoided by using the highmem IO changes which I believe are not
> in the standard RH kernel. Unknown if that would address your idle time
> question.

They benched RHAS iirc, and that has the block-highmem patch. They also
had more than 4GB of memory, alas, there is bouncing. That doesn't work
on all hardware, and all drivers.

--
Jens Axboe

2002-09-20 17:38:04

by Bond, Andrew

[permalink] [raw]
Subject: RE: TPC-C benchmark used standard RH kernel


> -----Original Message-----
> From: Mike Anderson [mailto:[email protected]]
> Sent: Friday, September 20, 2002 1:21 PM
> To: Dave Hansen
> Cc: Bond, Andrew; [email protected]
> Subject: Re: TPC-C benchmark used standard RH kernel
>
>
>
> Dave Hansen [[email protected]] wrote:
> > Bond, Andrew wrote:
> > > This isn't as recent as I would like, but it will give
> you an idea.
> > > Top 75 from readprofile. This run was not using bigpages though.
> > >
> > > 00000000 total 7872 0.0066
> > > c0105400 default_idle 1367 21.3594
> > > c012ea20 find_vma_prev 462 2.2212
>
> > > c0142840 create_bounce 378 1.1250
> > > c0142540 bounce_end_io_read 332 0.9881
>
> .. snip..
> >
> > Forgive my complete ignorane about TPC-C... Why do you
> have so much
> > idle time? Are you I/O bound? (with that many disks, I
> sure hope not
> > :) ) Or is it as simple as leaving profiling running for a
> bit before
> > or after the benchmark was run?
>
> The calls to create_bounce and bounce_end_io_read are indications that
> some of your IO is being bounced and will not be running a peak
> performance.
>
> This is avoided by using the highmem IO changes which I
> believe are not
> in the standard RH kernel. Unknown if that would address your
> idle time
> question.
>
> -andmike
>
> --
> Michael Anderson
> [email protected]
>
>

Yes, bounce buffers were definitely a problem and could be contributing to our idle time issues. Highmem IO is in the RH Advanced Server kernel. Our problem was that 64-bit DMA for SCSI devices wasn't working, so all our IO to memory >4GB still required bounce buffers since the Qlogic controllers use the SCSI layer to present their drives. So we weren't paying the full penalty as we would without any highmem support, but since 3/4 of our memory was above 4GB it was still a heavy penalty.

If we had used block device IO with our HP cciss driver, we could have done 64-bit DMA in the RHAS 2.1 environment. However, we needed shared storage capability for the cluster. Hence, a fibre channel HBA.

The problem was more of a "what can we support in this benchmark timeframe" issue rather than a technical one. The technical problems are already solved.

Regards,
Andy

2002-09-20 17:59:29

by Mike Anderson

[permalink] [raw]
Subject: Re: TPC-C benchmark used standard RH kernel

Jens Axboe [[email protected]] wrote:
> On Fri, Sep 20 2002, Mike Anderson wrote:
>
> They benched RHAS iirc, and that has the block-highmem patch. They also
> had more than 4GB of memory, alas, there is bouncing. That doesn't work
> on all hardware, and all drivers.

Yes I have seen that. Normally a lot of these greater that 4GB
interfaces are activated on BITS_PER_LONG. We have passed a few changes
on to adapter driver maintainers to activate these interfaces also on
the CONFIG_HIGHMEM64G. This has helped on these 32 bit greater than 4GB
systems.

What driver does the FCA 2214 use?

-andmike
--
Michael Anderson
[email protected]

2002-09-20 20:59:36

by William Lee Irwin III

[permalink] [raw]
Subject: Re: TPC-C benchmark used standard RH kernel

On Thu, Sep 19, 2002 at 02:27:22PM -0500, Bond, Andrew wrote:
> This isn't as recent as I would like, but it will give you an idea.
> Top 75 from readprofile. This run was not using bigpages though.
> Andy

> 00000000 total 7872 0.0066
> c0105400 default_idle 1367 21.3594
> c012ea20 find_vma_prev 462 2.2212
> c0142840 create_bounce 378 1.1250
> c0142540 bounce_end_io_read 332 0.9881
> c0197740 __make_request 256 0.1290
> c012af20 zap_page_range 231 0.1739
> c012e9a0 find_vma 214 1.6719
> c012e780 avl_rebalance 160 0.4762

Looks like you're doing a lot of mmapping or faulting requiring VMA
lookups, or the number of VMA's associated with a task makes the
various VMA manipulations extremely expensive.

Can you dump /proc/pid/maps on some of these processes?

Thanks,
Bill

2002-09-20 22:06:53

by Martin J. Bligh

[permalink] [raw]
Subject: Re: TPC-C benchmark used standard RH kernel

>> This isn't as recent as I would like, but it will give you an idea.
>> Top 75 from readprofile. This run was not using bigpages though.
>> Andy
>
>> 00000000 total 7872 0.0066
>> c0105400 default_idle 1367 21.3594
>> c012ea20 find_vma_prev 462 2.2212
>> c0142840 create_bounce 378 1.1250
>> c0142540 bounce_end_io_read 332 0.9881
>> c0197740 __make_request 256 0.1290
>> c012af20 zap_page_range 231 0.1739
>> c012e9a0 find_vma 214 1.6719
>> c012e780 avl_rebalance 160 0.4762
>
> Looks like you're doing a lot of mmapping or faulting requiring VMA
> lookups, or the number of VMA's associated with a task makes the
> various VMA manipulations extremely expensive.
>
> Can you dump /proc/pid/maps on some of these processes?

Isn't that the magic Oracle 32Kb mmap hack at work here, in order
to get a >2Gb SGA?

M.