I am trying to get very fast disk drive performance and I am seeing
some interesting bottlenecks. We are trying to get 800 MB/sec or more
(yes, that is megabytes per second). We are currently using
PCI-Express with a 16 drive raid card (SATA drives). We have achieved
that speed, but only through the SG (SCSI generic) driver. This is
running the stock 2.6.10 kernel. And the device is not mounted as a
file system. I also set the read ahead size on the device to 16KB
(which speeds things up a lot):
blockdev --setra 16834 /dev/sdb
So here are the results:
$ time dd if=/dev/sdb of=/dev/null bs=64k count=1000000
1000000+0 records in
1000000+0 records out
0.27user 86.19system 2:40.68elapsed 53%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+177minor)pagefaults 0swaps
64k * 1000000 / 160.68 = 398.3 MB/sec
Using sg_dd just to make sure it works the same:
$ time sg_dd if=/dev/sdb of=/dev/null bs=64k count=1000000
1000000+0 records in
1000000+0 records out
0.05user 144.27system 2:41.55elapsed 89%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (17major+5375minor)pagefaults 0swaps
Pretty much the same speed. Now using the SG device (sg1 is tied to
sdb):
$ time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=1000000
Reducing read to 16 blocks per loop
1000000+0 records in
1000000+0 records out
0.22user 66.21system 1:10.23elapsed 94%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+2327minor)pagefaults 0swaps
64k * 1000000 / 70.23 = 911.3 MB/sec
Now that's more like the speeds we expected. I understand that the
SG device uses direct I/O and/or mmap memory from the kernel. What I
cannot believe is that there is that much overhead in going through the
page buffer/cache system in Linux.
We also tried going through a file system (various ones, JFS, XFS,
Reiser, Ext3). They all seem to bottleneck at around 400MB/sec, much
like /dev/sdb does. We also have a "real" SCSI raid system which also
bottlenecks right at 400 MB/sec. Under Windows (XP) both of these
systems run at 650 (SCSI) or 800 (SATA) MB/sec.
Other variations I've tried: setting the read ahead to larger or
smaller number (1, 2, 4, 8, 16, 32, 64 KB)... 8 or 16 seems to be
optimal. Using different block sizes in the dd command (again 1, 2, 4,
8, 16, 32, 64). 16, 32, 64 are pretty much identical and fastest.
Below is an oprofile (truncated) of (the same) dd running on /dev/sdb.
So is the overhead really that high? Hopefully there's a bottleneck
in there that no one has come across yet, and it can be optimized.
Anyone else trying to pull close to 1GB/sec from disk? :) The kernel
has changed a lot since the last time I really worked with it (2.2), so
any suggestions are appreciated.
Ian Godin
Senior Software Developer
DTS/Lowry Digital Images
-----------
CPU: P4 / Xeon, speed 3402.13 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples % symbol name
848185 8.3510 __copy_to_user_ll
772172 7.6026 do_anonymous_page
701579 6.9076 _spin_lock_irq
579024 5.7009 __copy_user_intel
361634 3.5606 _spin_lock
343018 3.3773 _spin_lock_irqsave
307462 3.0272 kmap_atomic
193327 1.9035 page_fault
181040 1.7825 schedule
174502 1.7181 radix_tree_delete
158967 1.5652 end_buffer_async_read
124124 1.2221 free_hot_cold_page
119057 1.1722 sysenter_past_esp
117384 1.1557 shrink_list
112762 1.1102 buffered_rmqueue
105490 1.0386 smp_call_function
101568 1.0000 kmem_cache_alloc
97404 0.9590 kmem_cache_free
95826 0.9435 __rmqueue
95443 0.9397 __copy_from_user_ll
93181 0.9174 free_pages_bulk
92732 0.9130 release_pages
86912 0.8557 shrink_cache
85896 0.8457 block_read_full_page
79629 0.7840 free_block
78304 0.7710 mempool_free
72264 0.7115 create_empty_buffers
71303 0.7020 do_syslog
70769 0.6968 emit_log_char
66413 0.6539 mark_offset_tsc
64333 0.6334 vprintk
63468 0.6249 file_read_actor
63292 0.6232 add_to_page_cache
62281 0.6132 unlock_page
61655 0.6070 _spin_unlock_irqrestore
59486 0.5857 find_get_page
58901 0.5799 drop_buffers
58775 0.5787 do_generic_mapping_read
55070 0.5422 __wake_up_bit
48681 0.4793 __end_that_request_first
47121 0.4639 bad_range
47102 0.4638 submit_bh
45009 0.4431 journal_add_journal_head
41270 0.4063 __alloc_pages
41247 0.4061 page_waitqueue
39520 0.3891 generic_file_buffered_write
38520 0.3793 __pagevec_lru_add
38142 0.3755 do_select
38105 0.3752 do_mpage_readpage
37020 0.3645 vsnprintf
36541 0.3598 __clear_page_buffers
35932 0.3538 journal_put_journal_head
35769 0.3522 radix_tree_lookup
35636 0.3509 bio_put
34904 0.3437 jfs_get_blocks
34865 0.3433 mark_page_accessed
33686 0.3317 bio_alloc
33273 0.3276 __switch_to
32928 0.3242 __generic_file_aio_write_nolock
32781 0.3228 mempool_alloc
31457 0.3097 number
31160 0.3068 do_get_write_access
30922 0.3045 sys_select
30201 0.2974 blk_rq_map_sg
29488 0.2903 smp_call_function_interrupt
28815 0.2837 do_page_fault
27654 0.2723 radix_tree_insert
27650 0.2722 __block_prepare_write
27185 0.2677 cache_alloc_refill
26910 0.2649 page_remove_rmap
26378 0.2597 handle_mm_fault
26192 0.2579 invalidate_complete_page
25762 0.2536 page_add_anon_rmap
25580 0.2519 max_block
25497 0.2510 do_gettimeofday
In article <[email protected]> you wrote:
> Below is an oprofile (truncated) of (the same) dd running on /dev/sdb.
do you also have the oprofile of the sg_dd handy?
Greetings
Bernd
On Wed, Feb 02 2005, Ian Godin wrote:
>
> I am trying to get very fast disk drive performance and I am seeing
> some interesting bottlenecks. We are trying to get 800 MB/sec or more
> (yes, that is megabytes per second). We are currently using
> PCI-Express with a 16 drive raid card (SATA drives). We have achieved
> that speed, but only through the SG (SCSI generic) driver. This is
> running the stock 2.6.10 kernel. And the device is not mounted as a
> file system. I also set the read ahead size on the device to 16KB
> (which speeds things up a lot):
>
> blockdev --setra 16834 /dev/sdb
>
> So here are the results:
>
> $ time dd if=/dev/sdb of=/dev/null bs=64k count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 0.27user 86.19system 2:40.68elapsed 53%CPU (0avgtext+0avgdata
> 0maxresident)k
> 0inputs+0outputs (0major+177minor)pagefaults 0swaps
>
> 64k * 1000000 / 160.68 = 398.3 MB/sec
>
> Using sg_dd just to make sure it works the same:
>
> $ time sg_dd if=/dev/sdb of=/dev/null bs=64k count=1000000
> 1000000+0 records in
> 1000000+0 records out
> 0.05user 144.27system 2:41.55elapsed 89%CPU (0avgtext+0avgdata
> 0maxresident)k
> 0inputs+0outputs (17major+5375minor)pagefaults 0swaps
>
> Pretty much the same speed. Now using the SG device (sg1 is tied to
> sdb):
>
> $ time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=1000000
> Reducing read to 16 blocks per loop
> 1000000+0 records in
> 1000000+0 records out
> 0.22user 66.21system 1:10.23elapsed 94%CPU (0avgtext+0avgdata
> 0maxresident)k
> 0inputs+0outputs (0major+2327minor)pagefaults 0swaps
>
> 64k * 1000000 / 70.23 = 911.3 MB/sec
>
> Now that's more like the speeds we expected. I understand that the
> SG device uses direct I/O and/or mmap memory from the kernel. What I
> cannot believe is that there is that much overhead in going through the
> page buffer/cache system in Linux.
It's not going through the page cache that is the problem, it's the
copying to user space. Have you tried using O_DIRECT? What kind of
speeds are you getting with that?
--
Jens Axboe
On Feb 2, 2005, at 7:56 PM, Bernd Eckenfels wrote:
> In article <[email protected]> you
> wrote:
>> Below is an oprofile (truncated) of (the same) dd running on
>> /dev/sdb.
>
> do you also have the oprofile of the sg_dd handy?
>
> Greetings
> Bernd
>
Just ran it on the sg_dd (using /dev/sg1):
CPU: P4 / Xeon, speed 3402.13 MHz (estimated)
Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 10
0000
samples % symbol name
2145136 89.7474 __copy_to_user_ll
64720 2.7077 lock_kernel
17883 0.7482 mark_offset_tsc
16212 0.6783 page_address
8091 0.3385 schedule
5380 0.2251 timer_interrupt
5314 0.2223 _spin_lock
5311 0.2222 mwait_idle
4034 0.1688 sysenter_past_esp
3569 0.1493 do_anonymous_page
3530 0.1477 apic_timer_interrupt
3368 0.1409 _spin_lock_irqsave
3150 0.1318 sg_read_xfer
3043 0.1273 kmem_cache_alloc
2590 0.1084 find_busiest_group
2553 0.1068 scheduler_tick
2319 0.0970 __copy_from_user_ll
2299 0.0962 sg_ioctl
2131 0.0892 irq_entries_start
1905 0.0797 sys_ioctl
1773 0.0742 copy_page_range
1678 0.0702 fget
1648 0.0689 __switch_to
1648 0.0689 scsi_block_when_processing_errors
1632 0.0683 sg_start_req
1569 0.0656 increment_tail
1511 0.0632 try_to_wake_up
1506 0.0630 update_one_process
1454 0.0608 fput
1397 0.0584 do_gettimeofday
1396 0.0584 zap_pte_range
1371 0.0574 recalc_task_prio
1357 0.0568 sched_clock
1352 0.0566 sg_link_reserve
1282 0.0536 __elv_add_request
1229 0.0514 kmap_atomic
1195 0.0500 page_fault
Oddly enough it appears to also be copying to user space...
Ian.
On Feb 3, 2005, at 9:40 AM, Nuno Silva wrote:
> Ian Godin wrote:
>> I am trying to get very fast disk drive performance and I am seeing
>> some interesting bottlenecks. We are trying to get 800 MB/sec or
>> more (yes, that is megabytes per second). We are currently using
>> PCI-Express with a 16 drive raid card (SATA drives). We have
>> achieved that speed, but only through the SG (SCSI generic) driver.
>> This is running the stock 2.6.10 kernel. And the device is not
>> mounted as a file system. I also set the read ahead size on the
>> device to 16KB (which speeds things up a lot):
>
> I was trying to reproduce but got distracted by this:
> (use page down, if you just want to see the odd result)
>
> puma:/tmp/dd# sg_map
> /dev/sg0 /dev/sda
> /dev/sg1 /dev/sdb
> /dev/sg2 /dev/scd0
> /dev/sg3 /dev/sdc
> puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/tmp/dd/sg1 bs=64k count=1000
> Reducing read to 64 blocks per loop
> 1000+0 records in
> 1000+0 records out
>
> real 0m0.187s
> user 0m0.001s
> sys 0m0.141s
> puma:/tmp/dd# time dd if=/dev/sdb of=/tmp/dd/sdb bs=64k count=1000
> 1000+0 records in
> 1000+0 records out
> 65536000 bytes transferred in 1.203468 seconds (54455956 bytes/sec)
>
> real 0m1.219s
> user 0m0.001s
> sys 0m0.138s
> puma:/tmp/dd# ls -l
> total 128000
> -rw-r--r-- 1 root root 65536000 Feb 3 17:16 sdb
> -rw-r--r-- 1 root root 65536000 Feb 3 17:16 sg1
> puma:/tmp/dd# md5sum *
> ec31224970ddd3fb74501c8e68327e7b sdb
> 60d4689227d60e6122f1ffe0ec1b2ad7 sg1
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> See? dd from sdb is not the same as sg1! Is this supposed to happen?
>
> About the 900MB/sec:
> This same sg1 (= sdb, which is a single hitachi sata hdd) performes
> like this:
>
> puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=1000000
> time=1
> Reducing read to 64 blocks per loop
> time to transfer data was 69.784784 secs, 939.12 MB/sec
> 1000000+0 records in
> 1000000+0 records out
>
> real 1m9.787s
> user 0m0.063s
> sys 0m58.115s
>
> I can assure you that this drive can't do more than 60MB/sec sustained.
>
> My only conclusion is that sg (or sg_dd) is broken? ;)
>
> Peace,
> Nuno Silva
>
Definitely have been able to repeat that here, so the SG driver
definitely appears to be broken. At least I'm glad I am not going
insane, I was starting to wonder :)
I'll run some more tests with O_DIRECT and such things, see if I can
figure out what the REAL max speed is.
Thanks for the help everyone,
Ian.
Ian Godin wrote:
>
> I am trying to get very fast disk drive performance and I am seeing
> some interesting bottlenecks. We are trying to get 800 MB/sec or more
> (yes, that is megabytes per second). We are currently using PCI-Express
> with a 16 drive raid card (SATA drives). We have achieved that speed,
> but only through the SG (SCSI generic) driver. This is running the
> stock 2.6.10 kernel. And the device is not mounted as a file system. I
> also set the read ahead size on the device to 16KB (which speeds things
> up a lot):
I was trying to reproduce but got distracted by this:
(use page down, if you just want to see the odd result)
puma:/tmp/dd# sg_map
/dev/sg0 /dev/sda
/dev/sg1 /dev/sdb
/dev/sg2 /dev/scd0
/dev/sg3 /dev/sdc
puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/tmp/dd/sg1 bs=64k count=1000
Reducing read to 64 blocks per loop
1000+0 records in
1000+0 records out
real 0m0.187s
user 0m0.001s
sys 0m0.141s
puma:/tmp/dd# time dd if=/dev/sdb of=/tmp/dd/sdb bs=64k count=1000
1000+0 records in
1000+0 records out
65536000 bytes transferred in 1.203468 seconds (54455956 bytes/sec)
real 0m1.219s
user 0m0.001s
sys 0m0.138s
puma:/tmp/dd# ls -l
total 128000
-rw-r--r-- 1 root root 65536000 Feb 3 17:16 sdb
-rw-r--r-- 1 root root 65536000 Feb 3 17:16 sg1
puma:/tmp/dd# md5sum *
ec31224970ddd3fb74501c8e68327e7b sdb
60d4689227d60e6122f1ffe0ec1b2ad7 sg1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
See? dd from sdb is not the same as sg1! Is this supposed to happen?
About the 900MB/sec:
This same sg1 (= sdb, which is a single hitachi sata hdd) performes like
this:
puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=1000000
time=1
Reducing read to 64 blocks per loop
time to transfer data was 69.784784 secs, 939.12 MB/sec
1000000+0 records in
1000000+0 records out
real 1m9.787s
user 0m0.063s
sys 0m58.115s
I can assure you that this drive can't do more than 60MB/sec sustained.
My only conclusion is that sg (or sg_dd) is broken? ;)
Peace,
Nuno Silva
Ian Godin wrote:
> [...]
> Definitely have been able to repeat that here, so the SG driver
> definitely appears to be broken. At least I'm glad I am not going
> insane, I was starting to wonder :)
>
> I'll run some more tests with O_DIRECT and such things, see if I can
> figure out what the REAL max speed is.
FYI there was a patch running around last April that made a new option
for "dd" to make it use O_DIRECT. You can get it here:
http://marc.theaimsgroup.com/?l=linux-kernel&m=108135935629589&w=2
Unfortunately this hasn't made it into coreutils. IIRC there were issues
about dd being multi-platform and the way O_DIRECT was done in other
systems.
Anyway, you can patch dd yourself and have a tool for debugging with
O_DIRECT. I hope this helps,
--
Paulo Marques - http://www.grupopie.com
All that is necessary for the triumph of evil is that good men do nothing.
Edmund Burke (1729 - 1797)
Ian Godin <[email protected]> wrote:
>
>
> I am trying to get very fast disk drive performance and I am seeing
> some interesting bottlenecks. We are trying to get 800 MB/sec or more
> (yes, that is megabytes per second). We are currently using
> PCI-Express with a 16 drive raid card (SATA drives). We have achieved
> that speed, but only through the SG (SCSI generic) driver. This is
> running the stock 2.6.10 kernel. And the device is not mounted as a
> file system. I also set the read ahead size on the device to 16KB
> (which speeds things up a lot):
> ...
> samples % symbol name
> 848185 8.3510 __copy_to_user_ll
> 772172 7.6026 do_anonymous_page
> 701579 6.9076 _spin_lock_irq
> 579024 5.7009 __copy_user_intel
> 361634 3.5606 _spin_lock
> 343018 3.3773 _spin_lock_irqsave
> 307462 3.0272 kmap_atomic
> 193327 1.9035 page_fault
Something funny is happening here - it looks like there's plenty of CPU
capacity left over.
It's odd that you're getting a lot of pagefaults in this test but not with
the sg_dd test, too. I wonder why dd is getting so many pagefaults? (I
recall that sg_dd did something cheaty, but I forget what it was).
Could you monitor the CPU load during the various tests? If the `dd'
workload isn't pegging the CPU then it could be that there's something
wrong with the I/O submission patterns.
On Thu, Feb 03, 2005 at 07:03:48PM +0000, Paulo Marques wrote:
> FYI there was a patch running around last April that made a new option
> for "dd" to make it use O_DIRECT. You can get it here:
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=108135935629589&w=2
>
> Unfortunately this hasn't made it into coreutils.
Follow down the thread and you'll see that it was merged to coreutils
CVS (message-id <[email protected]>), but there apparently
hasn't been a coreutils release since then. Nor has the patch been
added to the debian package.
-andy
At 08:32 PM 4/02/2005, Andrew Morton wrote:
>Something funny is happening here - it looks like there's plenty of CPU
>capacity left over.
[..]
>Could you monitor the CPU load during the various tests? If the `dd'
>workload isn't pegging the CPU then it could be that there's something
>wrong with the I/O submission patterns.
as an educated guess, i'd say that the workload is running out of memory
bandwidth ..
lets say the RAM is single-channel DDR400. peak bandwidth = 3.2Gb/s (400 x
10^6 x 64 bits / 1000000000). its fair to say that peak bandwidth is
pretty rare thing to achieve with SDRAM given real-world access patterns --
lets take a conservative "it'll be 50% efficient" -- so DDR400 realistic
peak = 1.6Gbps.
as far as memory-accesses go, a standard user-space read() from disk
results in 4 memory-accesses (1. DMA from HBA to RAM, 2. read in
copy_to_user(), 3. write in copy_to_user(), 4. userspace accessing that data).
1.6Gbps / 4 = 400MB/s -- or roughly what Ian was seeing.
sg_dd uses a window into a kernel DMA window. as such, two of the four
memory acccesses are cut out (1. DMA from HBA to RAM, 2. userspace
accessing data).
1.6Gbps / 2 = 800MB/s -- or roughly what Ian was seeing with sg_dd.
DIRECT_IO should achieve similar numbers to sg_dd, but perhaps not quite as
efficient.
cheers,
lincoln.
Lincoln Dale <[email protected]> wrote:
>
> sg_dd uses a window into a kernel DMA window. as such, two of the four
> memory acccesses are cut out (1. DMA from HBA to RAM, 2. userspace
> accessing data).
> 1.6Gbps / 2 = 800MB/s -- or roughly what Ian was seeing with sg_dd.
Right. That's a fancy way of saying "cheating" ;)
But from the oprofile output it appears to me that there is plenty of CPU
capacity left over. Maybe I'm misreading it due to oprofile adding in the
SMP factor (25% CPU on a 4-way means we've exhausted CPU capacity).
> DIRECT_IO should achieve similar numbers to sg_dd, but perhaps not quite as
> efficient.
Probably so.
There are various tools in
http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz which are
more useful than dd, btw. `odread' and `odwrite' are usful for this sort
of thing.
Andrew Morton wrote:
> Lincoln Dale <[email protected]> wrote:
>
>>sg_dd uses a window into a kernel DMA window. as such, two of the four
>>memory acccesses are cut out (1. DMA from HBA to RAM, 2. userspace
>>accessing data).
>>1.6Gbps / 2 = 800MB/s -- or roughly what Ian was seeing with sg_dd.
>
>
> Right. That's a fancy way of saying "cheating" ;)
>
> But from the oprofile output it appears to me that there is plenty of CPU
> capacity left over. Maybe I'm misreading it due to oprofile adding in the
> SMP factor (25% CPU on a 4-way means we've exhausted CPU capacity).
sg_dd is lying or /dev/sg* is broken. Try to do that sg_dd test in any
single drive and you'll get 20 times the performance you're supposed to
achieve:
puma:/tmp/dd# time sg_dd if=/dev/sg1 of=/dev/null bs=64k count=1000000
time=1
Reducing read to 64 blocks per loop
time to transfer data was 69.784784 secs, 939.12 MB/sec
1000000+0 records in
1000000+0 records out
This is a single sata drive. I'm lucky, am I not? ;-)
Regards,
Nuno Silva