2007-02-12 15:22:36

by Martin A. Fink

[permalink] [raw]
Subject: SATA-performance: Linux vs. FreeBSD

Dear all,

I did some performance tests that made me really wonder:

My Hardware:
Asus P5LD2 board with Intel i945P chipset, ICH7R southbridge
CPU Intel Core 2 Duo E6300 at 1.86 GHz, 2 MB Cache
1 GB RAM
My Software:
OpenSuSE 10.2 with Linux kernel 2.6.18, x86-64 architecture
FreeBSD 6.2

Testdrives:
1. HDD: Seagate ST3250820AS RPM 7200.9, 8 MB Cache, 250 GB, SATA-II
(Harddisk Drive)
2. SSD: Adtron AF25FB, 27GB, SATA Revision 1.0a (Solid State Disk)

What I did:
I wrote blocks of 1 MB size to file. Each 1 GB I made a fsync and took the
time. For those tests with filesystems I wrote files of 1 GB size, otherwise
I just wrote to the raw device.

Results: -1-

Test OpenSuSE(AHCI) FreeBSD(AHCI)
---------------------------------------------------------------------------------------------------------------------------------------
SSD(vfat 25GB) 41+/-2 MB/s at 4-10% 15+/-0 MB/s at 2% CPU
SSD(raw ?25GB)? 26+/-1 MB/s at 4-10% CPU 48+/-0 MB/s at 1% CPU
SSD(ext3 25GB) 39+/-5 MB/s at 10-15% CPU 34+/-0 MB/s at 14% CPU
SSD(ext2 25GB) 42+/-1 MB/s at 10-15% CPU 32+/-0 MB/s at 10% CPU
---------------------------------------------------------------------------------------------------------------------------------------

Test OpenSuSE (AHCI off) FreeBSD (AHCI off)
---------------------------------------------------------------------------------------------------------------------------------------
SSD(vfat 25GB) 22+/-4 MB/s at 6-19% CPU --
SSD(raw ?25GB) 33+/-4 MB/s at 7-14% CPU 41+/-0 MB/s at 1% CPU
SSD(ext2 25GB) 27+/-6 MB/s at 6-14% CPU --
---------------------------------------------------------------------------------------------------------------------------------------

Question 1:
Can anybody explain to me, why writing to a SATA-I device with AHCI consumes
so much CPU time using Linux, while it takes almost no CPU time on FreeBSD
6.2 ? Especially comparing values of writing to the raw device?

Question 2:
Can anybody explain to me, why writing to a solid state disk (a kind of memory
that always has the same constant bandwidth) has such big standard errors in
writing rate using Linux (between 1 to 6 MB/s error) while FreeBSD gives an
almost constant writing rate (as one would expect it for a SSD) ?

Question 3:
Why is writing to a raw device in Linux slower than using e.g. ext2 ? And why
is Linux writing rate much lower (-12.5 % for the best case) compared to
writing rate of FreeBSD?

Question 4:
When writing to the SATA-II HDD Linux is around 10% slower than FreeBSD when
using ext3, but around as fast as FreeBSD when writing raw. Why?


How can I improve the speed of Linux,
Thanks for advices

Martin

PS: part of my testcode:

? int fd=open(fileName, O_WRONLY | O_CREAT | O_TRUNC, 0666);
? (void)gettimeofday(&start, 0);
? for (long bl=0; bl < blocksPerGigaByte; ++bl)
? ? write(fd, block, blockSize);
? fsync(fd);
? (void)gettimeofday(&ende, 0);


2007-02-12 16:04:47

by Andi Kleen

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

"Martin A. Fink" <[email protected]> writes:
>
> What I did:
> I wrote blocks of 1 MB size to file. Each 1 GB I made a fsync and took the
> time. For those tests with filesystems I wrote files of 1 GB size, otherwise
> I just wrote to the raw device.

Newer Linux versions depending on the disk and the file system will tell
the disk to flush the buffers to disk on fsync. FreeBSD might or might not
do that, but if it doesn't it would explain the difference.

>
> Results: -1-
>
> Test OpenSuSE(AHCI) FreeBSD(AHCI)
> ---------------------------------------------------------------------------------------------------------------------------------------
> SSD(vfat 25GB) 41+/-2 MB/s at 4-10% 15+/-0 MB/s at 2% CPU

vfat is certainly not a performance optimized file system.

> SSD(raw ?25GB)? 26+/-1 MB/s at 4-10% CPU 48+/-0 MB/s at 1% CPU
> SSD(ext3 25GB) 39+/-5 MB/s at 10-15% CPU 34+/-0 MB/s at 14% CPU
> SSD(ext2 25GB) 42+/-1 MB/s at 10-15% CPU 32+/-0 MB/s at 10% CPU


You could use oprofile (http://oprofile.sourceforge.net) to find out
where the CPU is being used.


> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Test OpenSuSE (AHCI off) FreeBSD (AHCI off)
> ---------------------------------------------------------------------------------------------------------------------------------------
> SSD(vfat 25GB) 22+/-4 MB/s at 6-19% CPU --
> SSD(raw ?25GB) 33+/-4 MB/s at 7-14% CPU 41+/-0 MB/s at 1% CPU

I remember vaguely (but I might be wrong here) the standard block
character devices on FreeBSD are buffered, while raw is truly
unbuffered on Linux. Naive programs (no optimized IO threads or aio)
on truly unbuffered devices tend to perform poorly because they
don't do any write behind.

It might also useful if you post the libata related parts of your
boot log.
>
> Question 2:
> Can anybody explain to me, why writing to a solid state disk (a kind of memory
> that always has the same constant bandwidth) has such big standard errors in
> writing rate using Linux (between 1 to 6 MB/s error) while FreeBSD gives an
> almost constant writing rate (as one would expect it for a SSD) ?

Could be buffered vs unbuffered. Unbuffered single threaded writes
tend to be quite variable.

> Question 3:
> Why is writing to a raw device in Linux slower than using e.g. ext2 ? And why
> is Linux writing rate much lower (-12.5 % for the best case) compared to
> writing rate of FreeBSD?

It's really hard to make raw io perform well without complicated
efforts because nobody will hide the IO latencies. That is why
buffered IO is normally recommend

-Andi

2007-02-12 16:27:19

by Martin A. Fink

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Am Montag, 12. Februar 2007 18:04 schrieb Andi Kleen:
> "Martin A. Fink" <[email protected]> writes:
> >
> > What I did:
> > I wrote blocks of 1 MB size to file. Each 1 GB I made a fsync and took the
> > time. For those tests with filesystems I wrote files of 1 GB size,
otherwise
> > I just wrote to the raw device.
>
> Newer Linux versions depending on the disk and the file system will tell
> the disk to flush the buffers to disk on fsync. FreeBSD might or might not
> do that, but if it doesn't it would explain the difference.

If you call fsync in BSD then you get what you expect. anything that is still
not on disk will be written. Afterwards fsync returns... So this should be
the same like with linux?!
>
> >
> > Results: -1-
> >
> > Test OpenSuSE(AHCI) FreeBSD(AHCI)
> >
---------------------------------------------------------------------------------------------------------------------------------------
> > SSD(vfat 25GB) 41+/-2 MB/s at 4-10% 15+/-0 MB/s at 2% CPU
>
> vfat is certainly not a performance optimized file system.
That is just a minor test.
>
> > SSD(raw ?25GB)? 26+/-1 MB/s at 4-10% CPU 48+/-0 MB/s at 1% CPU

The above line is what makes me wondering !!!

> > SSD(ext3 25GB) 39+/-5 MB/s at 10-15% CPU 34+/-0 MB/s at 14% CPU
> > SSD(ext2 25GB) 42+/-1 MB/s at 10-15% CPU 32+/-0 MB/s at 10% CPU
>
>
> You could use oprofile (http://oprofile.sourceforge.net) to find out
> where the CPU is being used.
>
>
> >
---------------------------------------------------------------------------------------------------------------------------------------
> >
> > Test OpenSuSE (AHCI off) FreeBSD (AHCI off)
> >
---------------------------------------------------------------------------------------------------------------------------------------
> > SSD(vfat 25GB) 22+/-4 MB/s at 6-19% CPU --
> > SSD(raw ?25GB) 33+/-4 MB/s at 7-14% CPU 41+/-0 MB/s at 1% CPU
>
> I remember vaguely (but I might be wrong here) the standard block
> character devices on FreeBSD are buffered, while raw is truly
> unbuffered on Linux. Naive programs (no optimized IO threads or aio)
> on truly unbuffered devices tend to perform poorly because they
> don't do any write behind.

But the big question still is -- buffered or not -- where do the big
variations within linux come frome? I am not writing small blocks. I write
huge amounts of data. So the buffer will always be full. And: Linux is even
slower then BSD if it can use a buffer. The maximum performance of Linux is
42 MB/s (buffered) while the maximum performance of BSD is 48 MB/s (buffered
or not -- i don't know).
If I use a normal SATA-II disk, there are no differences between BSD and Linux
when writing to the raw device... So it cant be a buffer-problem alone.
>
> It might also useful if you post the libata related parts of your
> boot log.

> >
> > Question 2:
> > Can anybody explain to me, why writing to a solid state disk (a kind of
> > memory
> > that always has the same constant bandwidth) has such big standard errors
> > in
> > writing rate using Linux (between 1 to 6 MB/s error) while FreeBSD gives
> > an
> > almost constant writing rate (as one would expect it for a SSD) ?
>
> Could be buffered vs unbuffered. Unbuffered single threaded writes
> tend to be quite variable.
This does not answer the big variation when writing with ext3 of +/- 5 MB/s.

I still don't understand the buffer argument. If one writes 25 GB in blocks of
1 MB your buffer should be always full...
>
> > Question 3:
> > Why is writing to a raw device in Linux slower than using e.g. ext2 ? And
why
> > is Linux writing rate much lower (-12.5 % for the best case) compared to
> > writing rate of FreeBSD?
>
> It's really hard to make raw io perform well without complicated
> efforts because nobody will hide the IO latencies. That is why
> buffered IO is normally recommend

Is there a buffered io device that I can use, but that does not use a
filesystem?

>
> -Andi
>

--
Dipl. Physiker
Martin Anton Fink
Max Planck Institute for extraterrestrial Physics
Giessenbachstrasse
85741 Garching
Germany
Tel. +49-(0)89-30000-3645
Fax. +49-(0)89-30000-3569

2007-02-12 16:37:22

by Martin A. Fink

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Some more info:

:~> strace -c -T -o trace.out dd if=/dev/zero of=test.txt bs=10MB count=200

200+0 Datens?tze ein
200+0 Datens?tze aus
2000000000 bytes (2,0 GB) copied, 52,8632 seconds, 37,8 MB/s

test.txt:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
93.26 6.845265 33555 204 write
6.41 0.470283 11757 40 18 open
0.32 0.023687 116 205 read
0.00 0.000149 9 16 mmap2
0.00 0.000119 40 3 munmap
0.00 0.000081 3 24 close
0.00 0.000068 6 11 old_mmap
0.00 0.000064 3 20 fstat64
0.00 0.000040 4 10 rt_sigaction
0.00 0.000036 12 3 madvise
0.00 0.000014 7 2 clock_gettime
0.00 0.000010 3 3 brk
0.00 0.000008 8 1 _sysctl
0.00 0.000007 7 1 1 access
0.00 0.000006 6 1 mprotect
0.00 0.000005 5 1 futex
0.00 0.000004 4 1 uname
0.00 0.000004 4 1 _llseek
0.00 0.000003 3 1 rt_sigprocmask
0.00 0.000003 3 1 getrlimit
0.00 0.000003 3 1 set_thread_area
0.00 0.000003 3 1 set_tid_address
------ ----------- ----------- --------- --------- ----------------
100.00 7.339862 551 19 total

This means, that the CPU is only 7.3 of 52.8 seconds working. This is what one
can hear: If I run programs where the time they need is the same time as
strace says, then I have 100% CPU load and the cpu fan starts to blow
heavily. In the case here, the heat fan does not do anything. It looks like
the SATA driver simply blocks the CPU while doing whatever...

2007-02-12 17:41:14

by Andi Kleen

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

"Martin A. Fink" <[email protected]> writes:

Your mailer seems to be broken. It drops cc.
>
> If you call fsync in BSD then you get what you expect. anything that is still
> not on disk will be written. Afterwards fsync returns... So this should be
> the same like with linux?!

Not necessarily. The disk may buffer additionally. Handling that
differs widely, but modern Linux forces flushes to platter if the hardware support
it.

> But the big question still is -- buffered or not -- where do the big
> variations within linux come frome? I am not writing small blocks. I write
> huge amounts of data.

1MB is nowhere near huge by modern standards. Many IO subsystems are
only happy with multi MB requests.

> So the buffer will always be full.

Hardly. Especially not if you do synchronous fsync inbetween.

> If I use a normal SATA-II disk, there are no differences between BSD and Linux
> when writing to the raw device... So it cant be a buffer-problem alone.

Yes that is something that needs to be investigated. That is why I suggested
oprofile if your assertation of a more CPU overhead on Linux is true.

> I still don't understand the buffer argument. If one writes 25 GB in blocks of
> 1 MB your buffer should be always full...

Your mental model of a IO subsystem seems to be quite off.
Think what happens when you fsync and submit synchronously.

It's like sending something down a long pipe and waiting until it arrives
at the bottom and you hear the echo of the impact. Then only then you send again.
There will be always long periods when the pipe will be empty.

If you use large enough blocks these gaps will be quite small and
might effectively become unimportant, but 1MB is nowhere near big enough
for that.

> Is there a buffered io device that I can use, but that does not use a
> filesystem?

/dev/sdX*. However it has some other issues that also don't make
it ideal. File systems are usually best.

-Andi

2007-02-12 17:42:45

by Martin A. Fink

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

System Details:

dmesg: (parts)

Bootdata ok (command line is root=/dev/sda7 vga=0x31a resume=/dev/sda5
splash=silent)
Linux version 2.6.18.2-34-default (geeko@buildhost) (gcc version 4.1.2
20061115 (prerelease) (SUSE Linux)) #1 SMP Mon Nov 27 11:46:27 UTC 2006
...
Using ACPI (MADT) for SMP configuration information
...
Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz stepping 06
Brought up 2 CPUs
...
ACPI: Processor [CPU1] (supports 8 throttling states)
ACPI: Processor [CPU2] (supports 8 throttling states)
...
ICH7: IDE controller at PCI slot 0000:00:1f.1
GSI 18 sharing vector 0xD9 and IRQ 18
ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 22 (level, low) -> IRQ 217
ICH7: chipset revision 1
ICH7: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0xffa0-0xffa7, BIOS settings: hda:DMA, hdb:pio
ide1: BM-DMA at 0xffa8-0xffaf, BIOS settings: hdc:pio, hdd:pio
Probing IDE interface ide0...
hda: HL-DT-STDVD-RAM GSA-H22N, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Probing IDE interface ide1...
libata version 2.00 loaded.
ahci 0000:00:1f.2: version 2.0
GSI 19 sharing vector 0xE1 and IRQ 19
ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 23 (level, low) -> IRQ 225
PCI: Setting latency timer of device 0000:00:1f.2 to 64
ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode
ahci 0000:00:1f.2: flags: 64bit ncq led clo pio slum part
ata1: SATA max UDMA/133 cmd 0xFFFFC20000026D00 ctl 0x0 bmdma 0x0 irq 233
ata2: SATA max UDMA/133 cmd 0xFFFFC20000026D80 ctl 0x0 bmdma 0x0 irq 233
ata3: SATA max UDMA/133 cmd 0xFFFFC20000026E00 ctl 0x0 bmdma 0x0 irq 233
ata4: SATA max UDMA/133 cmd 0xFFFFC20000026E80 ctl 0x0 bmdma 0x0 irq 233
scsi0 : ahci
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata1.00: ATA-7, max UDMA/133, 156301488 sectors: LBA48 NCQ (depth 31/32)
ata1.00: ata1: dev 0 multi count 16
ata1.00: configured for UDMA/133
scsi1 : ahci
ata2: SATA link down (SStatus 0 SControl 300)
scsi2 : ahci
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: ATA-6, max UDMA/100, 57337056 sectors: LBA
ata3.00: ata3: dev 0 multi count 1
ata3.00: applying bridge limits
ata3.00: configured for UDMA/100
scsi3 : ahci
ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata4.00: ATA-7, max UDMA/133, 488397168 sectors: LBA48 NCQ (depth 31/32)
ata4.00: ata4: dev 0 multi count 16
ata4.00: configured for UDMA/133
Vendor: ATA Model: ST380811AS Rev: 3.AA
Losing some ticks... checking if CPU frequency changed.
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 sda8 >
sda2: <bsd: sda9 sda10 sda11 sda12 sda13 >
sd 0:0:0:0: Attached scsi disk sda
Vendor: ATA Model: Adtron A25FB-28G Rev: BF22
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sdb: 57337056 512-byte hdwr sectors (29357 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write through
SCSI device sdb: 57337056 512-byte hdwr sectors (29357 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write through
sdb: sdb1
sd 2:0:0:0: Attached scsi disk sdb
Vendor: ATA Model: ST3250820AS Rev: 3.AA
Type: Direct-Access ANSI SCSI revision: 05
SCSI device sdc: 488397168 512-byte hdwr sectors (250059 MB)
sdc: Write Protect is off
sdc: Mode Sense: 00 3a 00 00
SCSI device sdc: drive cache: write back
sd 0:0:0:0: Attached scsi generic sg0 type 0
sd 2:0:0:0: Attached scsi generic sg1 type 0
SCSI device sdc: 488397168 512-byte hdwr sectors (250059 MB)
sdc: Write Protect is off
sdc: Mode Sense: 00 3a 00 00
SCSI device sdc: drive cache: write back
sdc:
sd 3:0:0:0: Attached scsi disk sdc
sd 3:0:0:0: Attached scsi generic sg2 type 0
...


strace output:

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
73.73 49.904049 1947 25627 write
25.66 17.365062 694602 25 fsync
0.62 0.416500 59500 7 close
0.00 0.000000 0 4 read
0.00 0.000000 0 7 open
0.00 0.000000 0 5 fstat
0.00 0.000000 0 16 mmap
0.00 0.000000 0 7 mprotect
0.00 0.000000 0 1 munmap
0.00 0.000000 0 3 brk
0.00 0.000000 0 1 1 access
0.00 0.000000 0 1 execve
0.00 0.000000 0 1 uname
0.00 0.000000 0 1 arch_prctl
0.00 0.000000 0 4 fadvise64
------ ----------- ----------- --------- --------- ----------------
100.00 67.685611 25710 1 total

as result of $> strace -c -T -o trace.out PTestY

PTestY run for 12.5 min = 752 seconds
Thus it stuck somewhere in the system for around 690 seconds !!

2007-02-12 17:56:36

by Martin A. Fink

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Am Montag, 12. Februar 2007 19:41 schrieben Sie:
> "Martin A. Fink" <[email protected]> writes:
>
> Your mailer seems to be broken. It drops cc.
> >
> > If you call fsync in BSD then you get what you expect. anything that is
still
> > not on disk will be written. Afterwards fsync returns... So this should be
> > the same like with linux?!
>
> Not necessarily. The disk may buffer additionally. Handling that
> differs widely, but modern Linux forces flushes to platter if the hardware
support
> it.
>
> > But the big question still is -- buffered or not -- where do the big
> > variations within linux come frome? I am not writing small blocks. I write
> > huge amounts of data.
>
> 1MB is nowhere near huge by modern standards. Many IO subsystems are
> only happy with multi MB requests.
>
> > So the buffer will always be full.
>
> Hardly. Especially not if you do synchronous fsync inbetween.

Well no. I write 1 GB in blocks of 1 MB. After that I call fsync. Then I
process the next Gigabyte...
>
> > If I use a normal SATA-II disk, there are no differences between BSD and
Linux
> > when writing to the raw device... So it cant be a buffer-problem alone.
>
> Yes that is something that needs to be investigated. That is why I suggested
> oprofile if your assertation of a more CPU overhead on Linux is true.
>
> > I still don't understand the buffer argument. If one writes 25 GB in
blocks of
> > 1 MB your buffer should be always full...
>
> Your mental model of a IO subsystem seems to be quite off.
> Think what happens when you fsync and submit synchronously.

See above, how I do writing.
>
> It's like sending something down a long pipe and waiting until it arrives
> at the bottom and you hear the echo of the impact. Then only then you send
again.
> There will be always long periods when the pipe will be empty.
>
> If you use large enough blocks these gaps will be quite small and
> might effectively become unimportant, but 1MB is nowhere near big enough
> for that.

I tested this: When I write in blocks of 8kB or less the effect you describe
happens. But above 100kB blocksize there is no more increase of speed.

>
> > Is there a buffered io device that I can use, but that does not use a
> > filesystem?
>
> /dev/sdX*. However it has some other issues that also don't make
> it ideal. File systems are usually best.

My experience with filesystems is: I write some data and the write-function
returns nearly immediatelly. So I write again. Sometimes it returns only
after some 100-300ms. I think this happens always then when the buffer is
full and thus linux starts to write to disk. After this happend, it returns
again nearly immediatelly and after another while the same trouble happen
again. But not in a regular order...

I have to store big amounts of data coming from 2 digital cameras to disk.
Thus I have to write blocks of around 1 MB at 30 to 50 frames per second for
a long period of time. So it is important for me that the harddisk drive is
reliable in the sense of "if it is capable of 50 MB/s then it should operate
at this speed. Constantly."

>
> -Andi
>

2007-02-12 18:17:08

by Ray Lee

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

On 2/12/07, Martin A. Fink <[email protected]> wrote:
> Am Montag, 12. Februar 2007 19:41 schrieben Sie:
> I have to store big amounts of data coming from 2 digital cameras to disk.
> Thus I have to write blocks of around 1 MB at 30 to 50 frames per second for
> a long period of time. So it is important for me that the harddisk drive is
> reliable in the sense of "if it is capable of 50 MB/s then it should operate
> at this speed. Constantly."

Ah, here is a misunderstanding, I think. By default, Linux won't start
writing out dirty buffers until something like 40% of memory is used.
This is to help common workloads where many temporary files are
created and destroyed, or even data that gets written then overwritten
shortly after.

If the kernel were to immediately write out that dirty data, it would
be slower than leaving it in memory for those workloads. But since
that isn't best for everyone, there's a parameter that controls that
dirty threshold. Setting that to a lower value will help even out the
writeout, and start it early, just as you seem to be requesting.

Hmm, it may be one of:

/proc/sys/vm/dirty_ratio
/proc/sys/vm/dirty_background_ratio

Try tweaking those to much lower values and see if that helps.

Ray

2007-02-12 18:19:10

by Stefan Richter

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Martin A. Fink wrote:
> This means, that the CPU is only 7.3 of 52.8 seconds working.
...
> It looks like
> the SATA driver simply blocks the CPU while doing whatever...

The system sleeps while waiting for the disk (actually, for the SATA
host port) to be done with its work.

As Andi explained, if the system gives the disk a small task, waits for
the task to be completed, then gives it a next task and so on, latencies
add up and eat into effective bandwidth. Give the disk a whole set of
tasks so that
- it has immediately something new to do when it finished one task,
- deep pipes are not mostly empty due to "bubbles" in the pipe,
- tasks can be reordered to be executed in optimized manner for good
bandwidth utilization (if software/ firmware/ hardware is present
which supports this; e.g. the Linux kernel itself),
etc.
Also make each task large so that the ratio of protocol overhead to net
data payload stays minimal.
--
Stefan Richter
-=====-=-=== --=- -==--
http://arcgraph.de/sr/

2007-02-12 18:56:00

by Alan

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

On Mon, 12 Feb 2007 18:56:29 +0100
"Martin A. Fink" <[email protected]> wrote:

> I have to store big amounts of data coming from 2 digital cameras to disk.
> Thus I have to write blocks of around 1 MB at 30 to 50 frames per second for
> a long period of time. So it is important for me that the harddisk drive is
> reliable in the sense of "if it is capable of 50 MB/s then it should operate
> at this speed. Constantly."

Hard disks don't do this. They support operations/second based upon
physical and rotational latency constraints, vibration levels, mechanism,
internal layout policy and the need to do housekeeping.

If you have an ATA7 drive with suitable firmware sets you can talk to it
directly via the SG_IO interface and use the streaming feature set which
is quite different to filesystem type operations and lets you ask the
drive to do this sort of stuff - if you can find any general PC firmware
ones that support it anyway.

I'm not sure you'll get 50MB/sec sustained to work although you might
with a good current drive used for nothing else, a linear stream of data
(no seeking and file system overhead), and a non PCI controller (PCI
Express, host chipset bus etc).

If you are using a file system then the more you fsync the more I'd
expect you to see stalling as you keep draining whats effectively an 8MB
plus pipeline on a modern drive precisely because fsync does "hitting
disk" guarantees. You also want to be sure you are not journalling data.

Alan


2007-02-12 20:35:13

by Nigel Cunningham

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Hi Alan et al.

On Mon, 2007-02-12 at 19:08 +0000, Alan wrote:
> I'm not sure you'll get 50MB/sec sustained to work although you might
> with a good current drive used for nothing else, a linear stream of data
> (no seeking and file system overhead), and a non PCI controller (PCI
> Express, host chipset bus etc).

That's Suspend2's usage pattern when given a whole partition, so I can
state without reservation you can get maximum throughput under those
circumstances, even with a PCI controller. Swsusp should do about the
same too.

Nigel

2007-02-12 23:32:12

by Matthias Schniedermeyer

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Martin A. Fink wrote:
> I have to store big amounts of data coming from 2 digital cameras to disk.
> Thus I have to write blocks of around 1 MB at 30 to 50 frames per second for
> a long period of time. So it is important for me that the harddisk drive is
> reliable in the sense of "if it is capable of 50 MB/s then it should operate
> at this speed. Constantly."

The good old handful of suggestions:

- Use a dedicated disc for the task.
- Use an empty disc so there is no fragmentation.
- Buy a bigger disk, they have high bandwidths.
- Buy a more "specialized" disc.
for e.x.: Western Digital Raptor X(*) a 150GB, 10-KRPM S-ATA disc.
- Buy several discs and use RAID 0
or alternate between discs when writing.
- use XFS. AFAIK XFS has about the best "large file" and "high
bandwidth" characteristics.
- that with XFS you can preallocate the files doesn't seem relevant in
this case. It's more for the case that you write several files
simultaneously over a longer period of time.
- Write to one large file and separate the individual files later.

if you are sure that you don't get a power-failure:
- Disable Write-Barriers, especially on a logging-filesystem.
- Enable write-caching.
(hdparm doesn't appear to be able to do that with a SATA-disc, but
blktool appears to be able to)
The later has a good chance of corrupting your filesystem when you do
get a power-failure!!!



*:
I don't think you want something from the server-line,
SCSI/FibreChannel/...?
IIRC i read a something about the first 100MB/s disc with in the 15-KRPM
league.

Bis denn

--
Real Programmers consider "what you see is what you get" to be just as
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated,
cryptic, powerful, unforgiving, dangerous.

2007-02-13 09:25:20

by Martin A. Fink

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Am Dienstag, 13. Februar 2007 00:31 schrieben Sie:
> Martin A. Fink wrote:
> > I have to store big amounts of data coming from 2 digital cameras to disk.
> > Thus I have to write blocks of around 1 MB at 30 to 50 frames per second
for
> > a long period of time. So it is important for me that the harddisk drive
is
> > reliable in the sense of "if it is capable of 50 MB/s then it should
operate
> > at this speed. Constantly."
>
> The good old handful of suggestions:
>
> - Use a dedicated disc for the task.

I used a dedicated disk for this task. No one else besides the task is writing
to it!

> - Use an empty disc so there is no fragmentation.

All tests were performed on empty disk!

> - Buy a bigger disk, they have high bandwidths.

I have a flash disk from a manufacturer who grants me 48 MB/s. And FreeBSD as
well as Windows reach this value. Only Linux 2.6.18 is far away from it (42
MB/s)

> - Buy a more "specialized" disc.

see above

> for e.x.: Western Digital Raptor X(*) a 150GB, 10-KRPM S-ATA disc.
> - Buy several discs and use RAID 0
> or alternate between discs when writing.

What I have to build is an application for the International Space Station
ISS. I am limited with power and space. So If the disk is able to write
constantly 48 MB/s then the Operating System should do this!

> - use XFS. AFAIK XFS has about the best "large file" and "high
> bandwidth" characteristics.
> - that with XFS you can preallocate the files doesn't seem relevant in
> this case. It's more for the case that you write several files
> simultaneously over a longer period of time.
> - Write to one large file and separate the individual files later.
>
> if you are sure that you don't get a power-failure:
> - Disable Write-Barriers, especially on a logging-filesystem.
> - Enable write-caching.
> (hdparm doesn't appear to be able to do that with a SATA-disc, but
> blktool appears to be able to)
> The later has a good chance of corrupting your filesystem when you do
> get a power-failure!!!
>
>
>
> *:
> I don't think you want something from the server-line,
> SCSI/FibreChannel/...?
> IIRC i read a something about the first 100MB/s disc with in the 15-KRPM
> league.

Power consumption! See above.
>
> Bis denn
>
The problem is: FreeBSD is fast, but lacks of some special drivers. Linux has
all drivers but access to harddisk is unpredictable and thus unreliable!
What can I do??
> --
> Real Programmers consider "what you see is what you get" to be just as
> bad a concept in Text Editors as it is in women. No, the Real Programmer
> wants a "you asked for it, you got it" text editor -- complicated,
> cryptic, powerful, unforgiving, dangerous.
>
>

--
Dipl. Physiker
Martin Anton Fink
Max Planck Institute for extraterrestrial Physics
Giessenbachstrasse
85741 Garching
Germany
Tel. +49-(0)89-30000-3645
Fax. +49-(0)89-30000-3569

2007-02-13 09:34:30

by Martin A. Fink

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Am Montag, 12. Februar 2007 20:08 schrieben Sie:
> On Mon, 12 Feb 2007 18:56:29 +0100
> "Martin A. Fink" <[email protected]> wrote:
>
> > I have to store big amounts of data coming from 2 digital cameras to disk.
> > Thus I have to write blocks of around 1 MB at 30 to 50 frames per second
for
> > a long period of time. So it is important for me that the harddisk drive
is
> > reliable in the sense of "if it is capable of 50 MB/s then it should
operate
> > at this speed. Constantly."
>
> Hard disks don't do this. They support operations/second based upon
> physical and rotational latency constraints, vibration levels, mechanism,
> internal layout policy and the need to do housekeeping.

Well they do. The Flash disk I have (SATA-I) is capable of 48 MB/s and this
value is reached over the whole disk size by windows as well as by FreeBSD.
See my test results in the first thread.
My Seagate Barracuda Harddisk drive (SATA-II) starts with 76 MB/s and
decreases linearly to 35 MB/s due to the fact that it has to write to a
rotating disk. But on a flash disk there is nothing rotating...

So where is the difference between SATA-I and SATA-II ?
And why is FreeBSD able to write with constant rates (the complete 25 GB, all
with 48+/-0.1 MB/s) but Linux 2.6.18 not ?

>
> If you have an ATA7 drive with suitable firmware sets you can talk to it
> directly via the SG_IO interface and use the streaming feature set which
> is quite different to filesystem type operations and lets you ask the
> drive to do this sort of stuff - if you can find any general PC firmware
> ones that support it anyway.
>
> I'm not sure you'll get 50MB/sec sustained to work although you might
> with a good current drive used for nothing else, a linear stream of data
> (no seeking and file system overhead), and a non PCI controller (PCI
> Express, host chipset bus etc).

With a dedicated (rotating) SATA II device, using the first 70% of disk space
no problem -- tested ! With a SATA-I device only a problem with Linux 2.6.18
>
> If you are using a file system then the more you fsync the more I'd
> expect you to see stalling as you keep draining whats effectively an 8MB
> plus pipeline on a modern drive precisely because fsync does "hitting
> disk" guarantees. You also want to be sure you are not journalling data.

That is true. Thus i do the sync only after every 1GB of written data. That is
not to often in my eyes...
Journaling of data: you are right, ext2 performs better than ext3.


Martin
>
> Alan
>
>
>

--
Dipl. Physiker
Martin Anton Fink
Max Planck Institute for extraterrestrial Physics
Giessenbachstrasse
85741 Garching
Germany
Tel. +49-(0)89-30000-3645
Fax. +49-(0)89-30000-3569

2007-02-13 10:08:28

by Arjan van de Ven

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

> >
> The problem is: FreeBSD is fast, but lacks of some special drivers. Linux has
> all drivers but access to harddisk is unpredictable and thus unreliable!
> What can I do??


there's several tunables you can do;
1) increase /sys/block/<device>/queue/nr_requests
the linux default is on the low side
2) investigate other elevators; cfq is great for interactive use but not
so great for max throughput. you can do this by echo'ing "deadline"
into /sys/block/<device>/scheduler
3) make sure ext3 is set to "data=writeback"; the default journalling
mode is very strict, fine for smallish files but for multi-gigabyte
it'll start to hurt
4) try to use iostat -x /dev/<foo> 1 to see what values avg-rq and
avg-qu are.. avg-rq should be at least several hundred if not more.
5) echo a larger value into /sys/block/<device>/queue/max_sectors_kb
the default seems to be 512 which is... really low. The hw max is in
another file in that directory; if you want max throughput set the
max_sectors_kb value to the hw max. (you pay in terms of fairness for
this; it's the eternal fairness/latency versus throughput tradeoff)



2007-02-13 10:17:06

by Matthias Schniedermeyer

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Martin A. Fink wrote:
> Am Dienstag, 13. Februar 2007 00:31 schrieben Sie:
>> Martin A. Fink wrote:
>>> I have to store big amounts of data coming from 2 digital cameras to disk.
>>> Thus I have to write blocks of around 1 MB at 30 to 50 frames per second
>>> for
>>> a long period of time. So it is important for me that the harddisk drive
>>> is
>>> reliable in the sense of "if it is capable of 50 MB/s then it should
>>> operate
>>> at this speed. Constantly."
>> The good old handful of suggestions:
>>
>> - Use a dedicated disc for the task.
>
> I used a dedicated disk for this task. No one else besides the task is writing
> to it!

OK.

>> - Use an empty disc so there is no fragmentation.
>
> All tests were performed on empty disk!

OK.

>> - Buy a bigger disk, they have high bandwidths.
>
> I have a flash disk from a manufacturer who grants me 48 MB/s. And FreeBSD as
> well as Windows reach this value. Only Linux 2.6.18 is far away from it (42
> MB/s)

Even 48MB/s is quite low.
I've reached up to 70MB/s with a single 500GB Seagate model and even my older HDDs all reach 60MB/s (at least on the outer cylinders)
But i haven't tested any "sync/fsync" in between, only after.

>> - Buy a more "specialized" disc.
>
> see above
>
>> for e.x.: Western Digital Raptor X(*) a 150GB, 10-KRPM S-ATA disc.
>> - Buy several discs and use RAID 0
>> or alternate between discs when writing.
>
> What I have to build is an application for the International Space Station
> ISS. I am limited with power and space. So If the disk is able to write
> constantly 48 MB/s then the Operating System should do this!

OK. That appears to be a serious constraint.
Do HDDs cope well with zero gravity?
At least the SSD won't have a problem with that. ;-)

> The problem is: FreeBSD is fast, but lacks of some special drivers. Linux has
> all drivers but access to harddisk is unpredictable and thus unreliable!
> What can I do??

Personally i haven't had such bad write speeds in years. Taking USB connected and/or encrypted partitions aside.
But on the other hand: I don't sync(fsync) until i have to.
And personally i have good (and constant bandwidth) experience using XFS as a filesystem.
(I have 41 HDDs with a total capacity of 10.5 TB, performance is quite important for me.)

Also you have skipped the information how the images "arrive" on the system (PCI(e) card?), that may be important for an "end to end" view of the problem.

And what's also missing. What is "a long period of time".
Calculating best-case with the SSD:
27GB divided by 30MB/s only gives a bit more than 15 Minutes.
And worst case with 50MB/s is less than 10 Minutes.





--
Real Programmers consider "what you see is what you get" to be just as
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated,
cryptic, powerful, unforgiving, dangerous.

2007-02-13 10:18:24

by Andi Kleen

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Arjan van de Ven <[email protected]> writes:

> > >
> > The problem is: FreeBSD is fast, but lacks of some special drivers. Linux has
> > all drivers but access to harddisk is unpredictable and thus unreliable!
> > What can I do??
>
>
> there's several tunables you can do;

[...] Well Linux certainly should perform better out of the box
on such a simple configuration.

Something is wrong especially when the CPU usage is so high.

That is why I suggested oprofile. Perhaps contact [email protected]
(if the results show driver problems) and [email protected] (otherwise)
with the results.

-Andi

2007-02-13 10:25:11

by Arjan van de Ven

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

On Tue, 2007-02-13 at 12:18 +0100, Andi Kleen wrote:
> Arjan van de Ven <[email protected]> writes:
>
> > > >
> > > The problem is: FreeBSD is fast, but lacks of some special drivers. Linux has
> > > all drivers but access to harddisk is unpredictable and thus unreliable!
> > > What can I do??
> >
> >
> > there's several tunables you can do;
>
> [...] Well Linux certainly should perform better out of the box
> on such a simple configuration.

no argument from me there; first need to find out which piece is wrong
>
> Something is wrong especially when the CPU usage is so high.

I'll buy that, yet there's plenty of cpu time available so that
shouldn't be all that much of a limit on the throughput... there's still
headroom

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2007-02-13 10:29:22

by Martin A. Fink

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Am Dienstag, 13. Februar 2007 11:16 schrieben Sie:
> Martin A. Fink wrote:
> > Am Dienstag, 13. Februar 2007 00:31 schrieben Sie:
> >> Martin A. Fink wrote:
> >>> I have to store big amounts of data coming from 2 digital cameras to
disk.
> >>> Thus I have to write blocks of around 1 MB at 30 to 50 frames per second
> >>> for
> >>> a long period of time. So it is important for me that the harddisk drive
> >>> is
> >>> reliable in the sense of "if it is capable of 50 MB/s then it should
> >>> operate
> >>> at this speed. Constantly."
> >> The good old handful of suggestions:
> >>
> >> - Use a dedicated disc for the task.
> >
> > I used a dedicated disk for this task. No one else besides the task is
writing
> > to it!
>
> OK.
>
> >> - Use an empty disc so there is no fragmentation.
> >
> > All tests were performed on empty disk!
>
> OK.
>
> >> - Buy a bigger disk, they have high bandwidths.
> >
> > I have a flash disk from a manufacturer who grants me 48 MB/s. And FreeBSD
as
> > well as Windows reach this value. Only Linux 2.6.18 is far away from it
(42
> > MB/s)
>
> Even 48MB/s is quite low.
> I've reached up to 70MB/s with a single 500GB Seagate model and even my
older HDDs all reach 60MB/s (at least on the outer cylinders)
> But i haven't tested any "sync/fsync" in between, only after.

Please Read Carefully! I talk about flash disk, not normal harddisks. There
are no mechanical parts in flash disks, only flash memory. And therefore
48MB/s is excellent (compared to all other available disks)

>
> >> - Buy a more "specialized" disc.
> >
> > see above
> >
> >> for e.x.: Western Digital Raptor X(*) a 150GB, 10-KRPM S-ATA disc.
> >> - Buy several discs and use RAID 0
> >> or alternate between discs when writing.
> >
> > What I have to build is an application for the International Space Station
> > ISS. I am limited with power and space. So If the disk is able to write
> > constantly 48 MB/s then the Operating System should do this!
>
> OK. That appears to be a serious constraint.
> Do HDDs cope well with zero gravity?

Yes and no. Yes: standard desktop HDDs are unproblematic. Laptop HDDs have
g-force shock hardware that works on zero-g detection and thus Laptop HDDs
can't be used in space. At least modern ones can't...

> At least the SSD won't have a problem with that. ;-)
>
> > The problem is: FreeBSD is fast, but lacks of some special drivers. Linux
has
> > all drivers but access to harddisk is unpredictable and thus unreliable!
> > What can I do??
>
> Personally i haven't had such bad write speeds in years. Taking USB
connected and/or encrypted partitions aside.
> But on the other hand: I don't sync(fsync) until i have to.

If you don't have to - no problem. But if you use filesystem you do a fsync
every time you close the file (and filesize is less then 1-2 GB)
> And personally i have good (and constant bandwidth) experience using XFS as
a filesystem.
> (I have 41 HDDs with a total capacity of 10.5 TB, performance is quite
important for me.)
>
> Also you have skipped the information how the images "arrive" on the system
(PCI(e) card?), that may be important for an "end to end" view of the
problem.

Images arrive via Gigabit Ethernet. GigE Vision standard. (PCIe x4)
>
> And what's also missing. What is "a long period of time".
> Calculating best-case with the SSD:
> 27GB divided by 30MB/s only gives a bit more than 15 Minutes.
> And worst case with 50MB/s is less than 10 Minutes.

Well. The testdrive has 27GB. The final drive will have 225 GB. And there will
be 3 cameras and thus 3 disks. This means we talk about 140 MB/s for around
90 minutes.
For space applications with low power but high performance this is a long
time... ;-)
>
>
>
>
>
> --
> Real Programmers consider "what you see is what you get" to be just as
> bad a concept in Text Editors as it is in women. No, the Real Programmer
> wants a "you asked for it, you got it" text editor -- complicated,
> cryptic, powerful, unforgiving, dangerous.
>
>

--
Dipl. Physiker
Martin Anton Fink
Max Planck Institute for extraterrestrial Physics
Giessenbachstrasse
85741 Garching
Germany
Tel. +49-(0)89-30000-3645
Fax. +49-(0)89-30000-3569

2007-02-13 11:12:49

by Alan

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

> Well they do. The Flash disk I have (SATA-I) is capable of 48 MB/s and this
> value is reached over the whole disk size by windows as well as by FreeBSD.
> See my test results in the first thread.

Ok a flash disk should be more stable

> My Seagate Barracuda Harddisk drive (SATA-II) starts with 76 MB/s and
> decreases linearly to 35 MB/s due to the fact that it has to write to a
> rotating disk. But on a flash disk there is nothing rotating...

The hard disk one isn't guaranteed or stable but the flash especially if
it is aimed at it ought to behave.

> So where is the difference between SATA-I and SATA-II ?

All physical side if they are on the same controller when you do the
tests. Mostly latency,

> And why is FreeBSD able to write with constant rates (the complete 25 GB, all
> with 48+/-0.1 MB/s) but Linux 2.6.18 not ?

Does the FreeBSD fsync sync to media ? Also what controller is being used
here, and do you have EHCI USB support running ?

> With a dedicated (rotating) SATA II device, using the first 70% of disk space
> no problem -- tested ! With a SATA-I device only a problem with Linux 2.6.18

I suspect the SATA-1 itself may not be the decider but something else -
eg the hard disk using NCQ, which would cover up any latency related
problems.

> Journaling of data: you are right, ext2 performs better than ext3.

And ext3 in writeback mode ought in theory (but practice is always
harder ;)) be faster than ext2.

2007-02-13 11:15:54

by Alan

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

> there's several tunables you can do;
> 1) increase /sys/block/<device>/queue/nr_requests
> the linux default is on the low side
> 5) echo a larger value into /sys/block/<device>/queue/max_sectors_kb
> the default seems to be 512 which is... really low. The hw max is in
> another file in that directory; if you want max throughput set the
> max_sectors_kb value to the hw max. (you pay in terms of fairness for


There are two more factors that play into #1 and #5. Firstly there is a
per command completion overhead in ATA without NCQ being active and that
isn't yet a heavily optimised libata path. Secondly erase block size
matters with flash drives so the bigger each I/O the better erase block
behaviour we should get.

2007-02-13 12:04:35

by Jörn Engel

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

On Tue, 13 February 2007 11:27:58 +0000, Alan wrote:
>
> isn't yet a heavily optimised libata path. Secondly erase block size
> matters with flash drives so the bigger each I/O the better erase block
> behaviour we should get.

Although that should max out somewhere between 16KiB and 128KiB,
depending on the chips being used.

Jörn

--
If you're willing to restrict the flexibility of your approach,
you can almost always do something better.
-- John Carmack

2007-02-13 12:08:33

by Jörn Engel

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

On Tue, 13 February 2007 11:29:18 +0100, Martin A. Fink wrote:
>
> Please Read Carefully! I talk about flash disk, not normal harddisks. There
> are no mechanical parts in flash disks, only flash memory. And therefore
> 48MB/s is excellent (compared to all other available disks)
>
> [...]
>
> Well. The testdrive has 27GB. The final drive will have 225 GB. And there will
> be 3 cameras and thus 3 disks. This means we talk about 140 MB/s for around
> 90 minutes.

Do you have any numbers on the performance for the final drive? Single
flash chips are relatively slow, the high bandwidth is usually achieved
by writing in parallel to several of them. With the bigger drive you
get more chips and the manufacturer could run more of them in parallel.

Jörn

--
With a PC, I always felt limited by the software available. On Unix,
I am limited only by my knowledge.
-- Peter J. Schoenster

2007-02-13 12:25:11

by Matthias Schniedermeyer

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Martin A. Fink wrote:

>> Also you have skipped the information how the images "arrive" on the system
> (PCI(e) card?), that may be important for an "end to end" view of the
> problem.
>
> Images arrive via Gigabit Ethernet. GigE Vision standard. (PCIe x4)

The the next question is: ChipSet/Used Protocol/JumboFrames/(NAPI)/... .

Have you already determined the load caused by this part?
Depending on the GigE-Chipset, and Protocol/JumboFrames/(NAPI)/..., the involved overhead can be quite serious.

>> And what's also missing. What is "a long period of time".
>> Calculating best-case with the SSD:
>> 27GB divided by 30MB/s only gives a bit more than 15 Minutes.
>> And worst case with 50MB/s is less than 10 Minutes.
>
> Well. The testdrive has 27GB. The final drive will have 225 GB. And there will
> be 3 cameras and thus 3 disks. This means we talk about 140 MB/s for around
> 90 minutes.
> For space applications with low power but high performance this is a long
> time... ;-)

The MB/CPU/RAM will be the one specified in the first mail?
My gut feeling says: Forget it.

The needed total bandwidth may be to high and at least the incoming part via GigE may have serious overhead.
150MB/s in via (at least 2) GigE, without Zero-Copy there is another 150MB/s memory to memory.
Then there is the next 150MB/s memory to the discs, without Zero-Copy there also another 150MB/s memory to memory.
In total that's 300MB/s to 600MB/s without any processing.

But on the other hand, hdparm -T says my system (Core2Duo E6700, FSB1066, 2GB DDR2-800 RAM, 32Bit) has a buffer-cache bandwidth around 4000MB/s.
As you don't said which FSB and Memory-Type you have i would guess that your system should reach between 2000MB/s and 3500MB/s of LINEAR(!) memory bandwidth.
(Total usable Memory-Bandwidth is unfortunately also dependent on usage pattern. Large & linear is not as important as with a rotating HDD, but it factors in)



Btw. On the topic of filesystem and Linux performance:
SGI did a "really big" test some time ago width a big iron having 24 Itanium2-CPUs in 12 nodes, and 12*2 GB of ram and having 256 discs using XFS(Which is from SGI!).
The pdf-file is here:
http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

According the the paper the system had a theoretical peak IO-performance of 11.5 GB/s and practically peaked at 10.7GB/s reading and 8.9GB/s writing.
IOW Linux and XFS CAN perform quite well, but the system has to have enough muscle for the job.
And since the paper (and Kernel 2.6.5) the development of Linux hasn't stopped.



--
Real Programmers consider "what you see is what you get" to be just as
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated,
cryptic, powerful, unforgiving, dangerous.

2007-02-13 12:32:39

by Martin A. Fink

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Am Dienstag, 13. Februar 2007 12:25 schrieben Sie:
> > Well they do. The Flash disk I have (SATA-I) is capable of 48 MB/s and
this
> > value is reached over the whole disk size by windows as well as by
FreeBSD.
> > See my test results in the first thread.
>
> Ok a flash disk should be more stable
>
> > My Seagate Barracuda Harddisk drive (SATA-II) starts with 76 MB/s and
> > decreases linearly to 35 MB/s due to the fact that it has to write to a
> > rotating disk. But on a flash disk there is nothing rotating...
>
> The hard disk one isn't guaranteed or stable but the flash especially if
> it is aimed at it ought to behave.
>
> > So where is the difference between SATA-I and SATA-II ?
>
> All physical side if they are on the same controller when you do the
> tests. Mostly latency,
>
> > And why is FreeBSD able to write with constant rates (the complete 25 GB,
all
> > with 48+/-0.1 MB/s) but Linux 2.6.18 not ?
>
> Does the FreeBSD fsync sync to media ? Also what controller is being used
> here, and do you have EHCI USB support running ?
Manual of FreeBSD fsync says it syncs to media.

I used the same controller: Same computer, same harddisk. two partitions on
the system disk, one for linux, one for freebsd.

EHCI:

ehci_hcd 0000:00:1d.7: EHCI Host Controller
ehci_hcd 0000:00:1d.7: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
usb usb1: Product: EHCI Host Controller

AHCI

ahci 0000:00:1f.2: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode

>
> > With a dedicated (rotating) SATA II device, using the first 70% of disk
space
> > no problem -- tested ! With a SATA-I device only a problem with Linux
2.6.18
>
> I suspect the SATA-1 itself may not be the decider but something else -
> eg the hard disk using NCQ, which would cover up any latency related
> problems.
>
> > Journaling of data: you are right, ext2 performs better than ext3.
>
> And ext3 in writeback mode ought in theory (but practice is always
> harder ;)) be faster than ext2.
>
>

--
Dipl. Physiker
Martin Anton Fink
Max Planck Institute for extraterrestrial Physics
Giessenbachstrasse
85741 Garching
Germany
Tel. +49-(0)89-30000-3645
Fax. +49-(0)89-30000-3569

2007-02-13 12:49:10

by Martin A. Fink

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Am Dienstag, 13. Februar 2007 13:24 schrieben Sie:
> Martin A. Fink wrote:
>
> >> Also you have skipped the information how the images "arrive" on the
system
> > (PCI(e) card?), that may be important for an "end to end" view of the
> > problem.
> >
> > Images arrive via Gigabit Ethernet. GigE Vision standard. (PCIe x4)
>
> The the next question is: ChipSet/Used Protocol/JumboFrames/(NAPI)/... .
>
> Have you already determined the load caused by this part?
> Depending on the GigE-Chipset, and Protocol/JumboFrames/(NAPI)/..., the
involved overhead can be quite serious.
>
> >> And what's also missing. What is "a long period of time".
> >> Calculating best-case with the SSD:
> >> 27GB divided by 30MB/s only gives a bit more than 15 Minutes.
> >> And worst case with 50MB/s is less than 10 Minutes.
> >
> > Well. The testdrive has 27GB. The final drive will have 225 GB. And there
will
> > be 3 cameras and thus 3 disks. This means we talk about 140 MB/s for
around
> > 90 minutes.
> > For space applications with low power but high performance this is a long
> > time... ;-)
>
> The MB/CPU/RAM will be the one specified in the first mail?
> My gut feeling says: Forget it.
>
> The needed total bandwidth may be to high and at least the incoming part via
GigE may have serious overhead.
> 150MB/s in via (at least 2) GigE, without Zero-Copy there is another 150MB/s
memory to memory.
> Then there is the next 150MB/s memory to the discs, without Zero-Copy there
also another 150MB/s memory to memory.
> In total that's 300MB/s to 600MB/s without any processing.

I dont understand your calculation: from 3 GE ports come around 50 MB/each.
These altogether 150MB/s have to be copied to memory. From there they will be
copied to disk. So we talk about 2x150 MB/s running through my system. That
is less than 2 PCIe lanes can handle... And there are more than 2 lanes
between north and south bridge....
>
> But on the other hand, hdparm -T says my system (Core2Duo E6700, FSB1066,
2GB DDR2-800 RAM, 32Bit) has a buffer-cache bandwidth around 4000MB/s.
> As you don't said which FSB and Memory-Type you have i would guess that your
system should reach between 2000MB/s and 3500MB/s of LINEAR(!) memory
bandwidth.
> (Total usable Memory-Bandwidth is unfortunately also dependent on usage
pattern. Large & linear is not as important as with a rotating HDD, but it
factors in)
>
>
>
> Btw. On the topic of filesystem and Linux performance:
> SGI did a "really big" test some time ago width a big iron having 24
Itanium2-CPUs in 12 nodes, and 12*2 GB of ram and having 256 discs using
XFS(Which is from SGI!).
> The pdf-file is here:
> http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf
>
> According the the paper the system had a theoretical peak IO-performance of
11.5 GB/s and practically peaked at 10.7GB/s reading and 8.9GB/s writing.
> IOW Linux and XFS CAN perform quite well, but the system has to have enough
muscle for the job.
> And since the paper (and Kernel 2.6.5) the development of Linux hasn't
stopped.
>
>
>
> --
> Real Programmers consider "what you see is what you get" to be just as
> bad a concept in Text Editors as it is in women. No, the Real Programmer
> wants a "you asked for it, you got it" text editor -- complicated,
> cryptic, powerful, unforgiving, dangerous.
>
>

--
Dipl. Physiker
Martin Anton Fink
Max Planck Institute for extraterrestrial Physics
Giessenbachstrasse
85741 Garching
Germany
Tel. +49-(0)89-30000-3645
Fax. +49-(0)89-30000-3569

2007-02-13 13:53:58

by Matthias Schniedermeyer

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Martin A. Fink wrote:
>> The needed total bandwidth may be to high and at least the incoming part via
> GigE may have serious overhead.
>> 150MB/s in via (at least 2) GigE, without Zero-Copy there is another 150MB/s
> memory to memory.
>> Then there is the next 150MB/s memory to the discs, without Zero-Copy there
> also another 150MB/s memory to memory.
>> In total that's 300MB/s to 600MB/s without any processing.
>
> I dont understand your calculation: from 3 GE ports come around 50 MB/each.
> These altogether 150MB/s have to be copied to memory. From there they will be
> copied to disk. So we talk about 2x150 MB/s running through my system. That
> is less than 2 PCIe lanes can handle... And there are more than 2 lanes
> between north and south bridge....

It may be that the TCP/IP-Stack has to copy the data around. But someone that knows the inner workings would have to answer this.
That may also depend on the used NIC.

Also the data doesn't appear 'en bloc', but arrives over a period of time, so you have more or less big "gaps" in the processing.

Especially the "gaps" can considerably lower total achievable bandwidth.

A little naive fallacy (According to dict.leo.org a translation for: Milchm?dchenrechnung):
You get a package of work every (say) 1ms and you (say) need .2ms for processing, shoveling and writing to disc.
Then there is no way you can saturate more than 1/5 of total theoretical bandwidth, because 80% of the time you are waiting for more work to come.



--
Real Programmers consider "what you see is what you get" to be just as
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated,
cryptic, powerful, unforgiving, dangerous.

2007-02-13 14:47:12

by Theodore Ts'o

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

On Tue, Feb 13, 2007 at 01:32:34PM +0100, Martin A. Fink wrote:
> > Does the FreeBSD fsync sync to media ? Also what controller is being used
> > here, and do you have EHCI USB support running ?
>
> Manual of FreeBSD fsync says it syncs to media.

That didn't answer the question. With SATA in particular, just
because you flush it to the *disk*, doesn't mean that you've flushed
it to the *media*, unless the OS is explicitly giving an command to
the disk to do so. If you haven't done any tests where you sync a
huge amount of data on FreeBSD, and then immediate manually kick the
power plug out of the wall, and then checking to make sure all of the
data actually did make it to the media, I wouldn't necessary assume
that it has. Given that it sounds like you really care about this,
I'd suggest that you explicitly testing this before making
assumptions.

- Ted

2007-02-13 14:51:33

by Alan

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

> data actually did make it to the media, I wouldn't necessary assume
> that it has. Given that it sounds like you really care about this,
> I'd suggest that you explicitly testing this before making
> assumptions.

FreeBSD 6.1 appears to get it right for some subsets of devices so it
seems a reasonable assumption at first glance - I did actually look the
BSD bits up to check.

2007-02-13 17:12:36

by Jeff Garzik

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

On Tue, Feb 13, 2007 at 11:25:27AM +0000, Alan wrote:
> > So where is the difference between SATA-I and SATA-II ?
>
> All physical side if they are on the same controller when you do the
> tests. Mostly latency,

SATA-II is a highly confusing marketing term. It is /not/ a technical
term.

In some cases there are NO differences between SATA-I and SATA-II. You
can find 1.5Gbps non-NCQ-supporting devices claiming SATA-II.

Similarly, there is no "SATA version" word in the IDENTIFY DEVICE page,
like there are "ATA version" words.

Jeff



2007-02-13 19:01:07

by Jeff Carr

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

On 02/12/07 08:37, Martin A. Fink wrote:

> :~> strace -c -T -o trace.out dd if=/dev/zero of=test.txt bs=10MB count=200
>
> 200+0 Datens?tze ein
> 200+0 Datens?tze aus
> 2000000000 bytes (2,0 GB) copied, 52,8632 seconds, 37,8 MB/s

You might want to check the raw write & read speed to the device
without a filesystem. Also, your previous email didn't include xfs.
xfs has very good sustained write performance.

dd if=/dev/zero of=/dev/sdX bs=10MB count=200
dd of=/dev/null if=/dev/sdX bs=10MB count=200

2007-02-13 20:23:28

by Jeffrey Hundstad

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Arjan van de Ven wrote:
>> The problem is: FreeBSD is fast, but lacks of some special drivers. Linux has
>> all drivers but access to harddisk is unpredictable and thus unreliable!
>> What can I do??
>>
>
>
> there's several tunables you can do;
> 1) increase /sys/block/<device>/queue/nr_requests
> the linux default is on the low side
> 2) investigate other elevators; cfq is great for interactive use but not
> so great for max throughput. you can do this by echo'ing "deadline"
> into /sys/block/<device>/scheduler
>

I'd suggest trying the noop scheduler with your ram based devices. I
don't see why these devices would need clever scheduling. ...but prove
me wrong if you will. I haven't tested this.

echo noop > /sys/block/<device>/queue/scheduler

If you don't need journaling EXT2 might be a good choice. But, I'd also
like to re-iterate the XFS filesystem recommendation given several times
now as well. There are many tunables that /may/ help during filesystem
creation. Block size (-b) set to it's maximum would prob. help.

If you're sure you can not encounter power issues:
mount -t xfs -o nobarrier /dev/<device> /mount-point

Here's some more general reading for ya:
Troubleshooting Linux Performance Issues:
http://www.phptr.com/articles/article.asp?p=481867&seqNum=2&rl=1

--
Jeffrey Hundstad

2007-02-15 16:03:24

by Tejun Heo

[permalink] [raw]
Subject: Re: SATA-performance: Linux vs. FreeBSD

Hello, Martin.

Martin A. Fink wrote:
> Test OpenSuSE(AHCI) FreeBSD(AHCI)
> ---------------------------------------------------------------------------------------------------------------------------------------
> SSD(vfat 25GB) 41+/-2 MB/s at 4-10% 15+/-0 MB/s at 2% CPU
> SSD(raw 25GB) 26+/-1 MB/s at 4-10% CPU 48+/-0 MB/s at 1% CPU
> SSD(ext3 25GB) 39+/-5 MB/s at 10-15% CPU 34+/-0 MB/s at 14% CPU
> SSD(ext2 25GB) 42+/-1 MB/s at 10-15% CPU 32+/-0 MB/s at 10% CPU
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Test OpenSuSE (AHCI off) FreeBSD (AHCI off)
> ---------------------------------------------------------------------------------------------------------------------------------------
> SSD(vfat 25GB) 22+/-4 MB/s at 6-19% CPU --
> SSD(raw 25GB) 33+/-4 MB/s at 7-14% CPU 41+/-0 MB/s at 1% CPU
> SSD(ext2 25GB) 27+/-6 MB/s at 6-14% CPU --
> ---------------------------------------------------------------------------------------------------------------------------------------
>
> Question 1:
> Can anybody explain to me, why writing to a SATA-I device with AHCI consumes
> so much CPU time using Linux, while it takes almost no CPU time on FreeBSD
> 6.2 ? Especially comparing values of writing to the raw device?

Can't tell. AHCI needs very few MMIOs to perform each request. As Andi
suggested, please do oprofile. It's easy.

> Question 2:
> Can anybody explain to me, why writing to a solid state disk (a kind of memory
> that always has the same constant bandwidth) has such big standard errors in
> writing rate using Linux (between 1 to 6 MB/s error) while FreeBSD gives an
> almost constant writing rate (as one would expect it for a SSD) ?

The default iosched is heavily optimized for regular disks with moving
head and for more usual workload. Requests are sometimes paused to wait
for requests in adjacent area. Use deadline or noop for ssd.

Also, try turn off NCQ. Some of early drives from major disk vendors
had all kinds of issues with NCQ implementation. SSD firmwares don't
tend to be of high quality.

> Question 3:
> Why is writing to a raw device in Linux slower than using e.g. ext2 ? And why
> is Linux writing rate much lower (-12.5 % for the best case) compared to
> writing rate of FreeBSD?

As written above, the first thing I can think of is interaction with
iosched. SSD and your workload are pretty unusual.

> Question 4:
> When writing to the SATA-II HDD Linux is around 10% slower than FreeBSD when
> using ext3, but around as fast as FreeBSD when writing raw. Why?

Dunno much about that. Where's the test result?

> How can I improve the speed of Linux,

Other ppl have pointed out but use /dev/sdX not the raw devices. If you
use raw, you end up writing each chunk synchronously.

--
tejun