We're doing some mysql benchmarking. For some reason it seems that ide
drives are currently beating a scsi raid array and it seems to be related
to fsync's. Bonnie stats show the scsi array to blow away ide as
expected, but mysql tests still have the idea beating on plain insert
speeds. Can anyone explain how this is possible, or perhaps explain how
our testing may be flawed?
Here's the bonnie stats:
IDE Drive:
Version 1.00g ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
jeremy 300M 9026 94 17524 12 8173 9 7269 83 23678 7 102.9 0
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 469 98 1476 98 16855 89 459 98 7132 99 688 25
SCSI Array:
Version 1.00g ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
orville 300M 8433 100 134143 99 127982 99 8016 100 374457 99 1583.4 6
------Sequential Create------ --------Random Create--------
-Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
16 503 13 +++++ +++ 538 13 490 13 +++++ +++ 428 11
So...obviously from bonnie stats, the scsi array blows away the ide...but
using the attached c program, here's what we get for fsync stats using the
little c program I've attached:
IDE Drive:
jeremy:~# time ./xlog file.out fsync
real 0m1.850s
user 0m0.000s
sys 0m0.220s
SCSI Array:
[root@orville mysql_data]# time /root/xlog file.out fsync
real 0m23.586s
user 0m0.010s
sys 0m0.110s
I would appreciate any help understand what I'm seeing here and any
suggestions on how to improve the performance.
The SCSI adapter on the raid array is an Adaptec 39160, the raid
controller is a CMD-7040. Kernel 2.4.0 using XFS for the filesystem on
the raid array, kernel 2.2.18 on ext2 on the IDE drive. The filesystem is
not the problem, as I get almost the exact same results running this on
ext2 on the raid array.
Thanks
-jeremy
--
this is my sig.
>
>
> We're doing some mysql benchmarking. For some reason it seems that ide
> drives are currently beating a scsi raid array and it seems to be related
> to fsync's. Bonnie stats show the scsi array to blow away ide as
> expected, but mysql tests still have the idea beating on plain insert
> speeds. Can anyone explain how this is possible, or perhaps explain how
> our testing may be flawed?
>
The fsync system call does this:
down(&inode->i_sem);
filemap_fdatasync(inode->i_mapping);
err = file->f_op->fsync(file, dentry, 0);
filemap_fdatawait(inode->i_mapping);
up(&inode->i_sem);
the f_op->fsync part of this calls file_fsync() which does:
sync_buffers(dev, 1);
So it looks like fsync is going to cost more for bigger devices. Given the
O_SYNC changes Stephen Tweedie did, couldnt fsync look more like this:
down(&inode->i_sem);
filemap_fdatasync(ip->i_mapping);
fsync_inode_buffers(ip);
filemap_fdatawait(ip->i_mapping);
up(&inode->i_sem);
Steve Lord
On Friday, March 02, 2001 12:39:01 PM -0600 Steve Lord <[email protected]> wrote:
[ file_fsync syncs all dirty buffers on the FS ]
>
> So it looks like fsync is going to cost more for bigger devices. Given the
> O_SYNC changes Stephen Tweedie did, couldnt fsync look more like this:
>
> down(&inode->i_sem);
> filemap_fdatasync(ip->i_mapping);
> fsync_inode_buffers(ip);
> filemap_fdatawait(ip->i_mapping);
> up(&inode->i_sem);
>
reiserfs might need to trigger a commit on fsync, so the fs specific fsync
op needs to be called. But, you should not need to call file_fsync in the
XFS fsync call (check out ext2's)
For why ide is beating scsi in this benchmark...make sure tagged queueing
is on (or increase the queue length?). For the xlog.c test posted, I would
expect scsi to get faster than ide as the size of the write increases.
-chris
On Fri, 2 Mar 2001, Steve Lord wrote:
> >
> >
> > On Friday, March 02, 2001 12:39:01 PM -0600 Steve Lord <[email protected]> wrote:
> >
> > [ file_fsync syncs all dirty buffers on the FS ]
> > >
> > > So it looks like fsync is going to cost more for bigger devices. Given the
> > > O_SYNC changes Stephen Tweedie did, couldnt fsync look more like this:
> > >
> > > down(&inode->i_sem);
> > > filemap_fdatasync(ip->i_mapping);
> > > fsync_inode_buffers(ip);
> > > filemap_fdatawait(ip->i_mapping);
> > > up(&inode->i_sem);
> > >
> >
> > reiserfs might need to trigger a commit on fsync, so the fs specific fsync
> > op needs to be called. But, you should not need to call file_fsync in the
> > XFS fsync call (check out ext2's)
>
>
> Right, this was just a generic example, the fsync_inode_buffers would be in
> the filesystem specific fsync callout - this was more of a logical
> example of what ext2 could do. XFS does completely different stuff in there
> anyway.
>
> >
> > For why ide is beating scsi in this benchmark...make sure tagged queueing
> > is on (or increase the queue length?). For the xlog.c test posted, I would
> > expect scsi to get faster than ide as the size of the write increases.
>
> I think the issue is the call being used now is going to get slower the
> larger the device is, just from the point of view of how many buffers it
> has to scan.
Well, I tried making the device smaller, creating just a 9gig partition on
the raid array and this made no different in the xlog results.
-jeremy
> >
> > -chris
>
> Steve
>
>
>
--
this is my sig.
>
>
> On Friday, March 02, 2001 12:39:01 PM -0600 Steve Lord <[email protected]> wrote:
>
> [ file_fsync syncs all dirty buffers on the FS ]
> >
> > So it looks like fsync is going to cost more for bigger devices. Given the
> > O_SYNC changes Stephen Tweedie did, couldnt fsync look more like this:
> >
> > down(&inode->i_sem);
> > filemap_fdatasync(ip->i_mapping);
> > fsync_inode_buffers(ip);
> > filemap_fdatawait(ip->i_mapping);
> > up(&inode->i_sem);
> >
>
> reiserfs might need to trigger a commit on fsync, so the fs specific fsync
> op needs to be called. But, you should not need to call file_fsync in the
> XFS fsync call (check out ext2's)
Right, this was just a generic example, the fsync_inode_buffers would be in
the filesystem specific fsync callout - this was more of a logical
example of what ext2 could do. XFS does completely different stuff in there
anyway.
>
> For why ide is beating scsi in this benchmark...make sure tagged queueing
> is on (or increase the queue length?). For the xlog.c test posted, I would
> expect scsi to get faster than ide as the size of the write increases.
I think the issue is the call being used now is going to get slower the
larger the device is, just from the point of view of how many buffers it
has to scan.
>
> -chris
Steve
Okay I now have to create TCQ for ATA becasue I am not going to lose again
now that I am winning ;-)
On Fri, 2 Mar 2001, Chris Mason wrote:
>
>
> On Friday, March 02, 2001 12:39:01 PM -0600 Steve Lord <[email protected]> wrote:
>
> [ file_fsync syncs all dirty buffers on the FS ]
> >
> > So it looks like fsync is going to cost more for bigger devices. Given the
> > O_SYNC changes Stephen Tweedie did, couldnt fsync look more like this:
> >
> > down(&inode->i_sem);
> > filemap_fdatasync(ip->i_mapping);
> > fsync_inode_buffers(ip);
> > filemap_fdatawait(ip->i_mapping);
> > up(&inode->i_sem);
> >
>
> reiserfs might need to trigger a commit on fsync, so the fs specific fsync
> op needs to be called. But, you should not need to call file_fsync in the
> XFS fsync call (check out ext2's)
>
> For why ide is beating scsi in this benchmark...make sure tagged queueing
> is on (or increase the queue length?). For the xlog.c test posted, I would
> expect scsi to get faster than ide as the size of the write increases.
>
> -chris
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Andre Hedrick
Linux ATA Development
ASL Kernel Development
-----------------------------------------------------------------------------
ASL, Inc. Toll free: 1-877-ASL-3535
1757 Houret Court Fax: 1-408-941-2071
Milpitas, CA 95035 Web: http://www.aslab.com
On Friday, March 02, 2001 01:25:25 PM -0600 Steve Lord <[email protected]> wrote:
>> For why ide is beating scsi in this benchmark...make sure tagged queueing
>> is on (or increase the queue length?). For the xlog.c test posted, I
>> would expect scsi to get faster than ide as the size of the write
>> increases.
>
> I think the issue is the call being used now is going to get slower the
> larger the device is, just from the point of view of how many buffers it
> has to scan.
filemap_fdatawait, filemap_fdatasync, and fsync_inode_buffers all restrict
their scans to a list of dirty buffers for that specific file. Only
file_fsync goes through all the dirty buffers on the device, and the ext2
fsync path never calls file_fsync.
Or am I missing something?
-chris
>
>
> On Friday, March 02, 2001 01:25:25 PM -0600 Steve Lord <[email protected]> wrote:
>
> >> For why ide is beating scsi in this benchmark...make sure tagged queueing
> >> is on (or increase the queue length?). For the xlog.c test posted, I
> >> would expect scsi to get faster than ide as the size of the write
> >> increases.
> >
> > I think the issue is the call being used now is going to get slower the
> > larger the device is, just from the point of view of how many buffers it
> > has to scan.
>
> filemap_fdatawait, filemap_fdatasync, and fsync_inode_buffers all restrict
> their scans to a list of dirty buffers for that specific file. Only
> file_fsync goes through all the dirty buffers on the device, and the ext2
> fsync path never calls file_fsync.
>
> Or am I missing something?
>
> -chris
>
>
No you are not, I will now go put on the brown paper bag.....
The scsi thing is wierd though, we have seen it here too.
Steve
In article <[email protected]>,
Jeremy Hansen <[email protected]> wrote:
>
>The SCSI adapter on the raid array is an Adaptec 39160, the raid
>controller is a CMD-7040. Kernel 2.4.0 using XFS for the filesystem on
>the raid array, kernel 2.2.18 on ext2 on the IDE drive. The filesystem is
>not the problem, as I get almost the exact same results running this on
>ext2 on the raid array.
Did you try a 2.4.x kernel on both?
2.4.0 has a bad elevator, which may show problems, so please check 2.4.2
if the numbers change. Also, "fsync()" is very different indeed on 2.2.x
and 2.4.x, and I would not be 100% surprised if your IDE drive does
asynchronous write caching and your RAID does not... That would not show
up in bonnie.
Also note how your bonnie file remove numbers for IDE seem to be much
better than for your RAID array, so it is not impossible that your RAID
unit just has a _huge_ setup overhead but good throughput, and that the
IDE numbers are better simply because your IDE setup is much lower
latency. Never mistake throughput for _speed_.
Linus
On Fri, 2 Mar 2001, Chris Mason wrote:
> For why ide is beating scsi in this benchmark...make sure tagged queueing
> is on (or increase the queue length?). For the xlog.c test posted, I would
> expect scsi to get faster than ide as the size of the write increases.
I have seen that many drives either have a pathetically small queue or
have completely broken tagged queueing. I guess thats what happens when
most vendors target their hardware for micro$oft.
-Dan
There is definitely something strange going on here.
As the bonnie test below shows, the SCSI disk used
for my tests should vastly outperform the old IDE one:
-------Sequential Output-------- ---Sequential Input-- --Random--
Seagate -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
ST318451LW MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
SCSI 200 21544 96.8 51367 51.4 11141 16.3 17729 58.2 40968 40.4 602.9 5.4
Quantum -------Sequential Output-------- ---Sequential Input-- --Random--
Fireball -Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
ST3.2A MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
IDE 200 3884 72.8 4513 86.0 1781 36.4 3144 89.9 4052 95.3 131.5 0.9
I used a program based on Mike Black's "Blah Blah" test
(shown below) in which 200 write()+fdatasync()s are
performed. Each write() outputs either 20 or 4096 bytes.
On my Celeron 533 Mhz 128 MB ram hardware with an ext2 fs,
the "block" size that is seen by the sd driver for each
fdatasync() is 4096 bytes. lk 2.4.2 is being used. The
fs/buffer.c __wait_on_buffer() routine waits for IO
completion in response to fdatasync(). Timings have been
done with Andrew Morton's timepegs (units are microseconds).
Here are the IDE results:
IDE 20*200 Destination Count Min Max Average Total
enter __wait_on_buffer:0 ->
leave __wait_on_buffer:0 200 1,037.23 6,487.72 1,252.19 250,439.80
leave __wait_on_buffer:0 ->
enter __wait_on_buffer:0 199 7.32 21.05 7.82 1,557.05
IDE 4096*200 Destination Count Min Max Average Total
enter __wait_on_buffer:0 ->
leave __wait_on_buffer:0 200 1,037.06 7,354.21 1,243.78 248,756.64
leave __wait_on_buffer:0 ->
enter __wait_on_buffer:0 199 23.01 67.32 37.03 7,370.51
So the size of each transfer doesn't matter to this IDE
disk. Now the same test for the SCSI disk:
SCSI(20*200) Destination Count Min Max Average Total
enter __wait_on_buffer:0 ->
enter sd_init_command:0 200 1.86 13.27 2.05 411.48
enter sd_init_command:0 ->
enter rw_intr:0 200 320.87 5,398.56 3,417.30 683,461.25
enter rw_intr:0 ->
leave __wait_on_buffer:0 200 4.04 15.81 4.42 885.73
leave __wait_on_buffer:0 ->
enter __wait_on_buffer:0 199 8.78 14.39 9.26 1,844.23
SCSI(4096*200) Destination Count Min Max Average Total
enter __wait_on_buffer:0 ->
enter sd_init_command:0 200 1.97 13.20 2.21 443.52
enter sd_init_command:0 ->
enter rw_intr:0 200 109.53 13,997.50 1,327.47 265,495.87
enter rw_intr:0 ->
leave __wait_on_buffer:0 200 4.37 22.50 4.75 951.44
leave __wait_on_buffer:0 ->
enter __wait_on_buffer:0 199 22.40 42.20 24.27 4,831.34
The extra timepegs inside the SCSI subsystem show that
the IO transaction to that disk really did take that
long. [Initially I suspected a "plugging" type
elevator bug, but that isn't supported by the above
and various other timepegs not shown.]
Since there is a wait on completion for every write,
tagged queuing should not be involved.
So writing more data to the SCSI disk speeds it up!
I suspect the critical point in the "20*200" test is
that the same sequence of 8 512 byte sectors are being
written to disk 200 times. BTW That disk spins at
15K rpm so one rotation takes 4 ms and it has a
4 MB cache.
Even though the SCSI disk's "cache" mode page indicates
that the write cache is on, it would seem that writing
the same sectors continually causes flushes to the medium
(and hence the associated delay). Here is scu's output
of the "cache" mode page:
$ scu -f /dev/sda show page cache
Cache Control Parameters (Page 0x8 - Current Values):
Mode Parameter Header:
Mode Data Length: 31
Medium Type: 0 (Default Medium Type)
Device Specific Parameter: 0x10 (Supports DPO & FUA bits)
Block Descriptor Length: 8
Mode Parameter Block Descriptor:
Density Code: 0x2
Number of Logical Blocks: 2289239 (1117.792 megabytes)
Logical Block Length: 512
Page Header / Data:
Page Code: 0x8
Parameters Savable: Yes
Page Length: 18
Read Cache Disable (RCD): No
Multiplication Factor (MF): Off
Write Cache Enable (WCE): Yes
Cache Segment Size Enable (SIZE): Off
Discontinuity (DISC): On
Caching Analysis Permitted (CAP): Disabled
Abort Pre-Fetch (ABPF): Off
Initiator Control Enable (IC): Off
Write Retention Priority: 0 (Not distiguished)
Demand Read Retention Priority: 0 (Not distiguished)
Disable Prefetch Transfer Length: 65535 blocks
Minimum Prefetch: 0 blocks
Maximum Prefetch: 65535 blocks
Maximum Prefetch Ceiling: 65535 blocks
Vendor Specific Cache Bits (VS): 0
Disable Read-Ahead (DRA): No
Logical Block Cache Segment Size: Off (Cache size interpreted as bytes)
Force Sequential Write (FSW): Yes (Blocks written in sequential order)
Number of Cache Segments: 20
Cache Segment Size: 0 bytes
Non-Cache Segment Size: 0 bytes
Perhaps someone has an idea of which of the above settings
can be tweaked to make the write caching work better in
this case.
Doug Gilbert
----------------------------------------------------------
Test program:
#include <stdio.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#define NUM_BLKS 200
#define BLK_SIZE 4096 /* use either 20 or 4096 */
char buff[BLK_SIZE * NUM_BLKS];
int main(int argc, char * argv[])
{
int fd, k;
fd = open("tst.txt", O_WRONLY | O_CREAT, 0644);
for (k = 0; k < NUM_BLKS; ++k) {
write(fd, buff + (k * BLK_SIZE), BLK_SIZE);
fdatasync(fd);
}
close(fd);
return 0;
}
Mike Black wrote:
>
> Here's a strace -r on IDE:
> 0.001488 write(3, "\214\1\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000516 fdatasync(0x3) = 0
> 0.001530 write(3, "\215\1\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000513 fdatasync(0x3) = 0
> 0.001555 write(3, "\216\1\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000517 fdatasync(0x3) = 0
> 0.001494 write(3, "\217\1\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000515 fdatasync(0x3) = 0
> 0.001495 write(3, "\220\1\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000522 fdatasync(0x3) = 0
>
> Here it is on SCSI:
> 0.049285 write(3, "\3\0\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000689 fdatasync(0x3) = 0
> 0.049148 write(3, "\4\0\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000516 fdatasync(0x3) = 0
> 0.049318 write(3, "\5\0\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
> 0.000516 fdatasync(0x3) = 0
> 0.049343 write(3, "\6\0\0\0Blah Blah Blah Blah Blah Bla"..., 56) = 56
>
> Looks like a constant 50ms delay on each fdatasync() on SCSI vs .5ms for
> IDE. Maybe IDE isn't really doing a sync?? I find .5ms to be a little too
> good.
>
> I did this on 4 different machines with different SCSI cards (include RAID5
> and non-RAID), disks, and IDE drives with the same behavior.
>
> ________________________________________
> Michael D. Black Principal Engineer
> [email protected] 321-676-2923,x203
> http://www.csihq.com Computer Science Innovations
> http://www.csihq.com/~mike My home page
> FAX 321-676-2355
> ----- Original Message -----
> From: "Jeremy Hansen" <[email protected]>
> To: <[email protected]>
> Cc: <[email protected]>; <[email protected]>
> Sent: Friday, March 02, 2001 11:27 AM
> Subject: scsi vs ide performance on fsync's
>
> We're doing some mysql benchmarking. For some reason it seems that ide
> drives are currently beating a scsi raid array and it seems to be related
> to fsync's. Bonnie stats show the scsi array to blow away ide as
> expected, but mysql tests still have the idea beating on plain insert
> speeds. Can anyone explain how this is possible, or perhaps explain how
> our testing may be flawed?
>
> Here's the bonnie stats:
>
> IDE Drive:
>
> Version 1.00g ------Sequential Output------ --Sequential
> Input- --Random-
> -Per Chr- --Block-- -Rewrite- -Per
> Chr- --Block-- --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec
> %CP
> jeremy 300M 9026 94 17524 12 8173 9 7269 83 23678 7 102.9
> 0
> ------Sequential Create------ --------Random
> Create--------
> -Create-- --Read--- -Delete-- -Create-- --Read--- -Delet
> e--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec
> %CP
> 16 469 98 1476 98 16855 89 459 98 7132 99 688
> 25
>
> SCSI Array:
>
> Version 1.00g ------Sequential Output------ --Sequential
> Input- --Random-
> -Per Chr- --Block-- -Rewrite- -Per
> Chr- --Block-- --Seeks--
> Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec
> %CP
> orville 300M 8433 100 134143 99 127982 99 8016 100 374457 99
> 1583.4 6
> ------Sequential Create------ --------Random
> Create--------
> -Create-- --Read--- -Delete-- -Create-- --Read--- -Delet
> e--
> files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec
> %CP
> 16 503 13 +++++ +++ 538 13 490 13 +++++ +++ 428
> 11
>
> So...obviously from bonnie stats, the scsi array blows away the ide...but
> using the attached c program, here's what we get for fsync stats using the
> little c program I've attached:
>
> IDE Drive:
>
> jeremy:~# time ./xlog file.out fsync
>
> real 0m1.850s
> user 0m0.000s
> sys 0m0.220s
>
> SCSI Array:
>
> [root@orville mysql_data]# time /root/xlog file.out fsync
>
> real 0m23.586s
> user 0m0.010s
> sys 0m0.110s
>
> I would appreciate any help understand what I'm seeing here and any
> suggestions on how to improve the performance.
>
> The SCSI adapter on the raid array is an Adaptec 39160, the raid
> controller is a CMD-7040. Kernel 2.4.0 using XFS for the filesystem on
> the raid array, kernel 2.2.18 on ext2 on the IDE drive. The filesystem is
> not the problem, as I get almost the exact same results running this on
> ext2 on the raid array.
>
> Thanks
> -jeremy
>
> --
> this is my sig.
Douglas Gilbert wrote:
> There is definitely something strange going on here.
> As the bonnie test below shows, the SCSI disk used
> for my tests should vastly outperform the old IDE one:
First thank you and others with my clueless investigation about
the module loading under Debian GNU/Linux. (I should have known
that Debian uses a very special module setup.)
Anyway, I used to think SCSI is better than IDE in general, and
the post was quite surprising.
So I ran the test on my PC.
On my systems too, the IDE beats SCSI hand down with the test case.
BTW, has anyone noticed that
the elapsed time of SCSI case is TWICE as long if
we let the previous output of the test program stay before
running the second test? (I suspect fdatasync
takes time proportional to the (then current) file size, but
still why SCSI case is so long is beyond me.)
Eg.
ishikawa@duron$ ls -l /tmp/t.out
ls: /tmp/t.out: No such file or directory
ishikawa@duron$ time ./xlog /tmp/t.out fsync
real 0m38.673s <=== my scsi disk is slow one to begin with...
user 0m0.050s
sys 0m0.140s
ishikawa@duron$ ls -l /tmp/t.out
-rw-r--r-- 1 ishikawa users 112000 Mar 5 06:19 /tmp/t.out
ishikawa@duron$ time ./xlog /tmp/t.out fsync
real 1m16.928s <=== See TWICE as long!
user 0m0.060s
sys 0m0.160s
ishikawa@duron$ ls -l /tmp/t.out
-rw-r--r-- 1 ishikawa users 112000 Mar 5 06:20 /tmp/t.out
ishikawa@duron$ rm /tmp/t.out <==== REMOVE the file and try again.
ishikawa@duron$ time ./xlog /tmp/t.out fsync
real 0m40.667s <==== Half as long and back to original.
user 0m0.040s
sys 0m0.120s
iishikawa@duron$ time ./xlog /tmp/t.out xxx
real 0m0.012s <=== very fast without fdatasync as it should be.
user 0m0.010s
sys 0m0.010s
ishikawa@duron$
Chris Mason <[email protected]> writes:
> filemap_fdatawait, filemap_fdatasync, and fsync_inode_buffers all restrict
> their scans to a list of dirty buffers for that specific file. Only
> file_fsync goes through all the dirty buffers on the device, and the ext2
> fsync path never calls file_fsync.
>
> Or am I missing something?
If the filesystems tested had blocksize < PAGE_SIZE the fsync would try
to sync everything, not walk the dirty buffers directly.
So e.g. if one of the file systems tested was generated with old ext2 utils
that do not use 4K block size then some performance difference could be
explained.
-Andi
Since the intention of fsync and fdatasync seems to be
to write dirty fs buffers to persistent storage (i.e.
the "oxide") then the best time is not necessarily
the objective. Given the IDE times that people have
been reporting, it is very unlikely that any of those
IDE disks were really doing 2000 discrete IO operations
involving waiting for the those buffers to be written
to the "oxide". [Reason: it should take at least 2000
revolutions of the disk to do it, since most of the
4KB writes are going to the same disk address as the
prior write.]
As it stands, the Linux SCSI subsystem has no mechanism
to force a disk cache write through. The SCSI WRITE(10)
command has a Force Unit Access bit (FUA) to do exactly
that, but we don't use it. Do the fs/block layers flag
they wish buffers written to the oxide??
The measurements that showed SCSI disks were taking a lot
longer with the "xlog" test were more luck than good
management.
Here are some tests that show an IDE versus SCSI "xlog"
comparison are very similar between FreeBSD 4.2 and
lk 2.4.2 on the same hardware:
# IBM DCHS04U SCSI disk 7200 rpm <<FreeBSD 4.2>>
[root@free /var]# time /root/xlog tst.txt
real 0m0.043s
[root@free /var]# time /root/xlog tst.txt fsync
real 0m33.131s
# Quantum Fireball ST3.2A IDE disk 3600 rpm <<FreeBSD 4.2>>
[root@free dos]# time /root/xlog tst.txt
real 0m0.034s
[root@free dos]# time /root/xlog tst.txt fsync
real 0m5.737s
# IBM DCHS04U SCSI disk 7200 rpm <<lk 2.4.2>>
[root@tvilling extra]# time /root/xlog tst.txt
0:00.00elapsed 125%CPU
[root@tvilling spare]# time /root/xlog tst.txt fsync
0:33.15elapsed 0%CPU
# Quantum Fireball ST3.2A IDE disk 3600 rpm <<lk 2.4.2>>
[root@tvilling /root]# time /root/xlog tst.txt
0:00.02elapsed 43%CPU
[root@tvilling /root]# time /root/xlog tst.txt fsync
0:05.99elapsed 69%CPU
Notes: FreeBSD doesn't have fdatasync() so I changed xlog
to use fsync(). Linux timings were the same with fsync()
and fdatasync(). The xlog program crashed immediately in
FreeBSD; it needed some sanity checks on its arguments.
One further note: I wrote:
> [snip]
> So writing more data to the SCSI disk speeds it up!
> I suspect the critical point in the "20*200" test is
> that the same sequence of 8 512 byte sectors are being
> written to disk 200 times. BTW That disk spins at
> 15K rpm so one rotation takes 4 ms and it has a
> 4 MB cache.
A clarification: by "same sequence" I meant written
to the same disk address. If the 4 KB lies on the same
track, then a delay of one disk revolution would be
expected before you could write the next 4 KB to the
"oxide" at the same address.
Doug Gilbert
On 2 Mar 2001, Linus Torvalds wrote:
> In article <[email protected]>,
> Jeremy Hansen <[email protected]> wrote:
> >
> >The SCSI adapter on the raid array is an Adaptec 39160, the raid
> >controller is a CMD-7040. Kernel 2.4.0 using XFS for the filesystem on
> >the raid array, kernel 2.2.18 on ext2 on the IDE drive. The filesystem is
> >not the problem, as I get almost the exact same results running this on
> >ext2 on the raid array.
>
> Did you try a 2.4.x kernel on both?
Finally got around to working on this.
Right now I'm running 2.4.2-ac11 on both machines and getting the same
results:
SCSI:
[root@orville /root]# time /root/xlog file.out fsync
real 0m21.266s
user 0m0.000s
sys 0m0.310s
IDE:
[root@kahlbi /root]# time /root/xlog file.out fsync
real 0m8.928s
user 0m0.000s
sys 0m6.700s
This behavior has been noticed by others, so I'm hoping I'm not just crazy
or that my test is somehow flawed.
We're using MySQL with Berkeley DB for transaction log support. It was
really confusing when a simple ide workstation was out performing our
Ultra160 raid array.
Thanks
-jeremy
> 2.4.0 has a bad elevator, which may show problems, so please check 2.4.2
> if the numbers change. Also, "fsync()" is very different indeed on 2.2.x
> and 2.4.x, and I would not be 100% surprised if your IDE drive does
> asynchronous write caching and your RAID does not... That would not show
> up in bonnie.
>
> Also note how your bonnie file remove numbers for IDE seem to be much
> better than for your RAID array, so it is not impossible that your RAID
> unit just has a _huge_ setup overhead but good throughput, and that the
> IDE numbers are better simply because your IDE setup is much lower
> latency. Never mistake throughput for _speed_.
>
> Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
this is my sig.
On Mon, 5 Mar 2001, Jeremy Hansen wrote:
>
> Right now I'm running 2.4.2-ac11 on both machines and getting the same
> results:
>
> SCSI:
>
> [root@orville /root]# time /root/xlog file.out fsync
>
> real 0m21.266s
> user 0m0.000s
> sys 0m0.310s
>
> IDE:
>
> [root@kahlbi /root]# time /root/xlog file.out fsync
>
> real 0m8.928s
> user 0m0.000s
> sys 0m6.700s
>
> This behavior has been noticed by others, so I'm hoping I'm not just crazy
> or that my test is somehow flawed.
>
> We're using MySQL with Berkeley DB for transaction log support. It was
> really confusing when a simple ide workstation was out performing our
> Ultra160 raid array.
Well, it's entirely possible that the mid-level SCSI layer is doing
something horribly stupid.
On the other hand, it's also entirely possible that IDE is just a lot
better than what the SCSI-bigots tend to claim. It's not all that
surprising, considering that the PC industry has pushed untold billions of
dollars into improving IDE, with SCSI as nary a consideration. The above
may just simply be the Truth, with a capital T.
(And "bonnie" is not a very good benchmark. It's not exactly mirroring any
real life access patterns. I would not be surprised if the SCSI driver
performance has been tuned by bonnie alone, and maybe it just sucks at
everything else)
Maybe we should ask whether somebody like lnz is interested in seeing what
SCSI does wrong here?
Linus
I've run the test on my own system and noted something interesting about
the results:
When the write() call extended the file (rather than just overwriting a
section of a file already long enough), the performance drop was seen, and
it was slower on SCSI than IDE - this is independent of whether IDE had
hardware write-caching on or off. Where the file already existed, from an
immediately-prior run of the same benchmark, both SCSI and IDE sped up to
the same, relatively fast speed.
These runs are for the following code, writing 2000 blocks of 4096 bytes each:
fd = open("tst.txt", O_WRONLY | O_CREAT, 0644);
for (k = 0; k < NUM_BLKS; ++k) {
write(fd, buff + (k * BLK_SIZE), BLK_SIZE);
fdatasync(fd);
}
close(fd);
IDE: Seagate Barracuda 7200rpm UDMA/66
first run: 1.98 elapsed
second and further runs: 0.50 elapsed
SCSI: IBM UltraStar 10000 rpm Ultra/160
first run: 23.57 elapsed
second and further runs: 0.55 elapsed
If the test file is removed between runs, all show the longer timings.
HOWEVER if I modify the benchmark to use 2000 blocks of *20* bytes each,
the timings change.
IDE: Seagate Barracuda 7200rpm UDMA/66
first run: 1.46 elapsed
second and further runs: 1.45 elapsed
SCSI: IBM UltraStar 10000 rpm Ultra/160
first run: 18.30 elapsed
second and further runs: 11.88 elapsed
Notice that the time for the second run of the SCSI drive is almost exactly
one-fifth of a minute, and remember that 2000 rotations / 10000 rpm = 1/5
minute. IOW, the SCSI drive is performing *correctly* on the second run of
the benchmark. The poorer performance on the first run *could* be
attributed to writing metadata interleaved with the data writes. The
better performance on the second run of the first benchmark can easily be
attributed to the fact that the drive does not need to wait an entire
revolution before writing the next block of a file, if that block arrives
quickly enough (this is a Duron, so it darn well arrives quickly).
It's pretty clear that the IDE drive(r) is *not* waiting for the physical
write to take place before returning control to the user program, whereas
the SCSI drive(r) is. Both devices appear to be performing the write
immediately, however (judging from the device activity lights). Whether
this is the correct behaviour or not, I leave up to you kernel hackers...
IMHO, if an application needs performance, it shouldn't be syncing disks
after every write. Syncing means, in my book, "wait for the data to be
committed to physical media" - note the *wait* involved there - so syncing
should only be used where data integrity in the event of a system failure
has a much higher importance than performance.
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]
The key to knowledge is not to rely on people to teach you it.
Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/
-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----
On Tue, 6 Mar 2001, Jonathan Morton wrote:
>
> It's pretty clear that the IDE drive(r) is *not* waiting for the physical
> write to take place before returning control to the user program, whereas
> the SCSI drive(r) is.
This would not be unexpected.
IDE drives generally always do write buffering. I don't even know if you
_can_ turn it off. So the drive claims to have written the data as soon as
it has made the write buffer.
It's definitely not the driver, but the actual drive.
Linus
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Linus Torvalds wrote:
> Well, it's entirely possible that the mid-level SCSI layer is doing
> something horribly stupid.
Well it's in good company as FreeBSD 4.2 on the same hardware
returns the same result (including IDE timings that were too
fast). My timepeg analysis showed that the SCSI disk was consuming
the time, not any of the SCSI layers.
> On the other hand, it's also entirely possible that IDE is just a lot
> better than what the SCSI-bigots tend to claim. It's not all that
> surprising, considering that the PC industry has pushed untold billions of
> dollars into improving IDE, with SCSI as nary a consideration. The above
> may just simply be the Truth, with a capital T.
What exactly do you think fsync() and fdatasync() should
do? If they need to wait for dirty buffers to get flushed
to the disk oxide then multiple reported IDE results to
this thread are defying physics.
Doug Gilbert
On Tue, 6 Mar 2001, Douglas Gilbert wrote:
>
> > On the other hand, it's also entirely possible that IDE is just a lot
> > better than what the SCSI-bigots tend to claim. It's not all that
> > surprising, considering that the PC industry has pushed untold billions of
> > dollars into improving IDE, with SCSI as nary a consideration. The above
> > may just simply be the Truth, with a capital T.
>
> What exactly do you think fsync() and fdatasync() should
> do? If they need to wait for dirty buffers to get flushed
> to the disk oxide then multiple reported IDE results to
> this thread are defying physics.
Well, it's fairly hard for the kernel to do much about that - it's almost
certainly just IDE doing write buffering on the disk itself. No OS
involved.
The kernel VFS and controller layers certainly wait for the disk to tell
us that the data has been written, there's no question about that. But
it's also not at all unlikely that the disk itself just lies.
I don't know if there is any way to turn of a write buffer on an IDE disk.
I do remember that there were some reports of filesystem corruption with
some version of Windows that turned off the machine at shutdown (using
software power-off as supported by most modern motherboards), and shut
down so fast that the drives had not actually written out all data.
Whether the reports were true or not I do not know, but I think we can
take for granted that write buffers exist.
Now, if you really care about your data integrity with a write-buffering
disk, I suspect that you'd better have an UPS. At which point write
buffering is a valid optimization, as long as you trust the harddisk
itself not to crash even if the OS were to crash.
Of course, whether you should even trust the harddisk is another question.
Linus
>I don't know if there is any way to turn of a write buffer on an IDE disk.
hdparm has an option of this nature, but it makes no difference (as I
reported). It's worth noting that even turning off UDMA to the disk on my
machine doesn't help the situation - although it does slow things down a
little, it's not "slow enough" to indicate that the drive is behaving
properly. Might be worth running the test on some of my other machines,
with their diverse collection of IDE controllers (mostly non-UDMA) and
disks.
>Of course, whether you should even trust the harddisk is another question.
I think this result in itself would lead me *not* to trust the hard disk,
especially an IDE one. Has anybody tried running this test with a recent
IBM DeskStar - one of the ones that is the same mech as the equivalent
UltraStar but with IDE controller? I only have SCSI and laptop IBMs here -
all my desktop IDE drives are Seagate. However I do have one SCSI Seagate,
which might be worth firing up for the occasion...
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]
The key to knowledge is not to rely on people to teach you it.
Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/
-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----
On Tue, 6 Mar 2001, Jonathan Morton wrote:
> It's pretty clear that the IDE drive(r) is *not* waiting for the physical
> write to take place before returning control to the user program, whereas
> the SCSI drive(r) is. Both devices appear to be performing the write
Wrong, IDE does not unplug thus the request is almost, I hate to admit it
SYNC and not ASYNC :-( Thus if the drive acks that it has the data then
the driver lets go.
> immediately, however (judging from the device activity lights). Whether
> this is the correct behaviour or not, I leave up to you kernel hackers...
Seagate has a better seek profile than ibm.
The second access is correct because the first one pushed the heads to the
pre-seek. Thus the question is were is the drive leaving the heads when
not active? It does not appear to be in the zone 1 region.
> IMHO, if an application needs performance, it shouldn't be syncing disks
> after every write. Syncing means, in my book, "wait for the data to be
> committed to physical media" - note the *wait* involved there - so syncing
> should only be used where data integrity in the event of a system failure
> has a much higher importance than performance.
I have only gotten the drive makers in the past 6 months to committee to
actively updating the contents of the identify page to reflect reality.
Thus if your drive is one of those that does a stress test check that goes:
"this bozo did not really mean to turn off write caching, renabling <smurk>"
Cheers,
Andre Hedrick
Linux ATA Development
ASL Kernel Development
-----------------------------------------------------------------------------
ASL, Inc. Toll free: 1-877-ASL-3535
1757 Houret Court Fax: 1-408-941-2071
Milpitas, CA 95035 Web: http://www.aslab.com
On Mon, 5 Mar 2001, Linus Torvalds wrote:
> Well, it's fairly hard for the kernel to do much about that - it's almost
> certainly just IDE doing write buffering on the disk itself. No OS
> involved.
I am pushing for WC to be defaulted in the off state, but as you know I
have a bigger fight than caching on my hands...
> I don't know if there is any way to turn of a write buffer on an IDE disk.
You want a forced set of commands to kill caching at init?
Andre Hedrick
Linux ATA Development
ASL Kernel Development
-----------------------------------------------------------------------------
ASL, Inc. Toll free: 1-877-ASL-3535
1757 Houret Court Fax: 1-408-941-2071
Milpitas, CA 95035 Web: http://www.aslab.com
>> It's pretty clear that the IDE drive(r) is *not* waiting for the physical
>> write to take place before returning control to the user program, whereas
>> the SCSI drive(r) is. Both devices appear to be performing the write
>
>Wrong, IDE does not unplug thus the request is almost, I hate to admit it
>SYNC and not ASYNC :-( Thus if the drive acks that it has the data then
>the driver lets go.
Uh, run that past me again? You are saying that because the IDE drive hogs
the bus until the write is complete or the driver forcibly disconnects, you
make the driver disconnect to save time? Or (more likely) have I totally
misread you...
>> immediately, however (judging from the device activity lights). Whether
>> this is the correct behaviour or not, I leave up to you kernel hackers...
>
>Seagate has a better seek profile than ibm.
>The second access is correct because the first one pushed the heads to the
>pre-seek. Thus the question is were is the drive leaving the heads when
>not active? It does not appear to be in the zone 1 region.
Duh... I don't quite see what you're saying here, either. The test is a
continuous rewrite of the same sector of the disk, so the head shouldn't be
moving *at all* until it's all over. In addition, the drive can't start
writing the sector when it's just finished writing it, so it has to wait
for the rotation to breing it back round again. Under those circumstances,
I would expect my 7200rpm Seagate to perform slower than my 10000rpm IBM
*regardless* of seeking performance. Seeking doesn't come into it!
>> IMHO, if an application needs performance, it shouldn't be syncing disks
>> after every write. Syncing means, in my book, "wait for the data to be
>> committed to physical media" - note the *wait* involved there - so syncing
>> should only be used where data integrity in the event of a system failure
>> has a much higher importance than performance.
>
>I have only gotten the drive makers in the past 6 months to committee to
>actively updating the contents of the identify page to reflect reality.
>Thus if your drive is one of those that does a stress test check that goes:
>"this bozo did not really mean to turn off write caching, renabling <smurk>"
Why does this sound familiar?
Personally, I feel the bottom line is rapidly turning into "if you have
critical data, don't put it on an IDE disk". There are too many corners
cut when compared to ostensibly similar SCSI devices. Call me a SCSI bigot
if you like - I realise SCSI is more expensive, but you get what you pay
for.
Of course, under normal circumstances, you leave write-caching and UDMA on,
and you don't use a pathological stress-test like we've been doing. That
gives the best performance. But sometimes it's necessary to use these
"pathological" access patterns to achieve certain system functions.
Suppose, harking back to the Windows data-corruption scenario mentioned
earlier, that just before powering off you stuffed several MB of data,
scattered across the disk, into said disk and waited for said disk to say
"yup, i've got that", then powered down. Recent drives have very large
(2MB?) on-board caches, so how long does it take for a pathological pattern
of these to be committed to physical media? Can the drive sustain it's own
power long enough to do this (highly unlikely)? So the drive *must* be
able to tell the OS when it's actually committed the data to media, or risk
*serious* data corruption.
Pathological shutdown pattern: assuming scatter-gather is not allowed (for
IDE), and a 20ms full-stroke seek, write sectors at alternately opposite
ends of the disk, working inwards until the buffer is full. 512-byte
sectors, 2MB of them, is 4000 writes * 20ms = around 80 seconds (not
including rotational delay, either). Last time I checked, you'd need a
capacitor array the size of the entire computer case to store enough power
to allow the drive to do this after system shutdown, and I don't remember
seeing LiIon batteries strapped to the bottom of my HDs. Admittedly, any
sane OS doesn't actually use that kind of write pattern on shutdown, but
the drive can't assume that.
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]
The key to knowledge is not to rely on people to teach you it.
Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/
-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----
> > I don't know if there is any way to turn of a write buffer on an IDE disk.
> You want a forced set of commands to kill caching at init?
Wrong model
You want a write barrier. Write buffering (at least for short intervals) in
the drive is very sensible. The kernel needs to able to send drivers a write
barrier which will not be completed with outstanding commands before the
barrier.
On Tue, 6 Mar 2001, Jonathan Morton wrote:
> Pathological shutdown pattern: assuming scatter-gather is not allowed
> (for IDE), and a 20ms full-stroke seek, write sectors at alternately
> opposite ends of the disk, working inwards until the buffer is full.
> 512-byte sectors, 2MB of them, is 4000 writes * 20ms = around 80
> seconds
i don't understand why the disk couldn't elevator in this case and be done
in 20ms + rotational.
> >Of course, whether you should even trust the harddisk is another question.
>
> I think this result in itself would lead me *not* to trust the hard disk,
> especially an IDE one. Has anybody tried running this test with a recent
> IBM DeskStar - one of the ones that is the same mech as the equivalent
> UltraStar but with IDE controller?
i assume you meant to time the xlog.c program? (or did i miss another
program on the thread?)
i've an IBM-DJSA-210 (travelstar 10GB, 5411rpm) which appears to do
*something* with the write cache flag -- it gets 0.10s elapsed real time
in default config; and gets 2.91s if i do "hdparm -W 0".
ditto for an IBM-DTLA-307015 (deskstar 15GB 7200rpm) -- varies from .15s
with write-cache to 1.8s without.
and an IBM-DTLA-307075 (deskstar 75GB 7200rpm) varies from .03s to 1.67s.
of course 1.8s is nowhere near enough time for 200 writes to complete.
so who knows what that flag is doing.
-dean
On Tue, 6 Mar 2001, dean gaudet wrote:
> i assume you meant to time the xlog.c program? (or did i miss another
> program on the thread?)
>
> i've an IBM-DJSA-210 (travelstar 10GB, 5411rpm) which appears to do
> *something* with the write cache flag -- it gets 0.10s elapsed real time
> in default config; and gets 2.91s if i do "hdparm -W 0".
>
> ditto for an IBM-DTLA-307015 (deskstar 15GB 7200rpm) -- varies from .15s
> with write-cache to 1.8s without.
>
> and an IBM-DTLA-307075 (deskstar 75GB 7200rpm) varies from .03s to 1.67s.
>
> of course 1.8s is nowhere near enough time for 200 writes to complete.
hi, not enough sleep, can't do math. 1.67s is exactly the ballpark you'd
expect for 200 writes to a correctly functioning 7200rpm disk. and the
travelstar appears to be doing the right thing as well.
-dean
On Tue, 6 Mar 2001, Jonathan Morton wrote:
> Pathological shutdown pattern: assuming scatter-gather is not allowed (for
> IDE), and a 20ms full-stroke seek, write sectors at alternately opposite
> ends of the disk, working inwards until the buffer is full. 512-byte
> sectors, 2MB of them, is 4000 writes * 20ms = around 80 seconds (not
> including rotational delay, either). Last time I checked, you'd need a
> capacitor array the size of the entire computer case to store enough power
> to allow the drive to do this after system shutdown, and I don't remember
> seeing LiIon batteries strapped to the bottom of my HDs. Admittedly, any
> sane OS doesn't actually use that kind of write pattern on shutdown, but
> the drive can't assume that.
But since the drive has everything in cache, it can just write
out both bunches of sectors in an order which minimises disk
seek time ...
(yes, the drives don't guarantee write ordering either, but that
shouldn't come as a big surprise when they don't guarantee that
data makes it to disk ;))
regards,
Rik
--
Virtual memory is like a game you can't win;
However, without VM there's truly nothing to lose...
http://www.surriel.com/
http://www.conectiva.com/ http://distro.conectiva.com.br/
Write caching is the culprit for the performance diff:
On IDE:
time xlog /blah.dat fsync
0.000u 0.190s 0:01.72 11.0% 0+0k 0+0io 91pf+0w
# hdparm -W 0 /dev/hda
/dev/hda:
setting drive write-caching to 0 (off)
# time xlog /blah.dat fsync
0.000u 0.220s 0:50.60 0.4% 0+0k 0+0io 91pf+0w
# hdparm -W 1 /dev/hda
/dev/hda:
setting drive write-caching to 1 (on)
# time xlog /blah.dat fsync
0.010u 0.230s 0:01.88 12.7% 0+0k 0+0io 91pf+0w
On my SCSI setup:
# time xlog /usr5/blah.dat fsync
0.020u 0.230s 0:30.48 0.8% 0+0k 0+0io 91pf+0w
________________________________________
Michael D. Black Principal Engineer
[email protected] 321-676-2923,x203
http://www.csihq.com Computer Science Innovations
http://www.csihq.com/~mike My home page
FAX 321-676-2355
----- Original Message -----
From: "Andre Hedrick" <[email protected]>
To: "Linus Torvalds" <[email protected]>
Cc: "Douglas Gilbert" <[email protected]>; <[email protected]>
Sent: Tuesday, March 06, 2001 2:12 AM
Subject: Re: scsi vs ide performance on fsync's
On Mon, 5 Mar 2001, Linus Torvalds wrote:
> Well, it's fairly hard for the kernel to do much about that - it's almost
> certainly just IDE doing write buffering on the disk itself. No OS
> involved.
I am pushing for WC to be defaulted in the off state, but as you know I
have a bigger fight than caching on my hands...
> I don't know if there is any way to turn of a write buffer on an IDE disk.
You want a forced set of commands to kill caching at init?
Andre Hedrick
Linux ATA Development
ASL Kernel Development
----------------------------------------------------------------------------
-
ASL, Inc. Toll free: 1-877-ASL-3535
1757 Houret Court Fax: 1-408-941-2071
Milpitas, CA 95035 Web: http://www.aslab.com
>> i assume you meant to time the xlog.c program? (or did i miss another
>> program on the thread?)
Yes.
>> i've an IBM-DJSA-210 (travelstar 10GB, 5411rpm) which appears to do
>> *something* with the write cache flag -- it gets 0.10s elapsed real time
>> in default config; and gets 2.91s if i do "hdparm -W 0".
>>
>> ditto for an IBM-DTLA-307015 (deskstar 15GB 7200rpm) -- varies from .15s
>> with write-cache to 1.8s without.
>>
>> and an IBM-DTLA-307075 (deskstar 75GB 7200rpm) varies from .03s to 1.67s.
>>
>> of course 1.8s is nowhere near enough time for 200 writes to complete.
>
>hi, not enough sleep, can't do math. 1.67s is exactly the ballpark you'd
>expect for 200 writes to a correctly functioning 7200rpm disk. and the
>travelstar appears to be doing the right thing as well.
I was just about to point that out. :) I ran the program with 2000
packets in order to magnify the difference.
So, it appears that the IBM IDE drives are doing the "right thing" when
write-caching is switched off, but the Seagate drive (at least the one I'm
using) appears not to turn the write-caching off at all. I want to try
this out with some other drives, including a Seagate SCSI drive and a
different Seagate IDE drive (attached to a non-UDMA controller), and
perhaps a couple of older drives which I just happen to have lying around
(particularly a Maxtor and an old TravelStar with very little cache).
That'll have to wait until later, though - university work beckons. :(
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]
The key to knowledge is not to rely on people to teach you it.
Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/
-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----
>> Pathological shutdown pattern: assuming scatter-gather is not allowed (for
>> IDE), and a 20ms full-stroke seek, write sectors at alternately opposite
>> ends of the disk, working inwards until the buffer is full. 512-byte
>> sectors, 2MB of them, is 4000 writes * 20ms = around 80 seconds (not
>> including rotational delay, either). Last time I checked, you'd need a
>> capacitor array the size of the entire computer case to store enough power
>> to allow the drive to do this after system shutdown, and I don't remember
>> seeing LiIon batteries strapped to the bottom of my HDs. Admittedly, any
>> sane OS doesn't actually use that kind of write pattern on shutdown, but
>> the drive can't assume that.
>
>But since the drive has everything in cache, it can just write
>out both bunches of sectors in an order which minimises disk
>seek time ...
>
>(yes, the drives don't guarantee write ordering either, but that
>shouldn't come as a big surprise when they don't guarantee that
>data makes it to disk ;))
That would be true for SCSI devices - I understand the controllers and/or
drives support "scatter-gather" which allows a drive to optimise it's seek
pattern in the manner you describe. However, I'm not sure whether an IDE
drive is allowed to do this. I'm reasonably sure that I heard somewhere
that IDE drives have to complete transactions in the specified order as far
as the host is concerned - what I'm unsure of is whether this also applies
to mechanical head movement.
If not, then the drive could by all means optimise the access pattern
provided it acked the data or provided the results in the same order as the
instructions were given. This would probably shorten the time for a new
pathological set (distributed evenly across the disk surface, but all on
the worst-possible angular offset compared to the previous) to (8ms seek
time + 5ms rotational delay) * 4000 writes ~= 52 seconds (compared with
around 120 seconds for the previous set with rotational delay factored in).
Great, so you only need half as big a power store to guarantee writing that
much data, but it's still too much. Even with a 15000rpm drive and 5ms
seek times, it would still be too much.
The OS needs to know the physical act of writing data has finished before
it tells the m/board to cut the power - period. Pathological data sets
included - they are the worst case which every engineer must take into
account. Out of interest, does Linux guarantee this, in the light of what
we've uncovered? If so, perhaps it could use the same technique to fix
fdatasync() and family...
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]
The key to knowledge is not to rely on people to teach you it.
Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/
-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----
Ahh, now we're getting somewhere.
IDE:
jeremy:~# time ./xlog file.out fsync
real 0m33.739s
user 0m0.010s
sys 0m0.120s
so now this corresponds to the performance we're seeing on SCSI.
So I guess what I'm wondering now is can or should anything be done about
this on the SCSI side?
Thanks
-jeremy
On Tue, 6 Mar 2001, Mike Black wrote:
> Write caching is the culprit for the performance diff:
>
> On IDE:
> time xlog /blah.dat fsync
> 0.000u 0.190s 0:01.72 11.0% 0+0k 0+0io 91pf+0w
> # hdparm -W 0 /dev/hda
>
> /dev/hda:
> setting drive write-caching to 0 (off)
> # time xlog /blah.dat fsync
> 0.000u 0.220s 0:50.60 0.4% 0+0k 0+0io 91pf+0w
> # hdparm -W 1 /dev/hda
>
> /dev/hda:
> setting drive write-caching to 1 (on)
> # time xlog /blah.dat fsync
> 0.010u 0.230s 0:01.88 12.7% 0+0k 0+0io 91pf+0w
>
> On my SCSI setup:
> # time xlog /usr5/blah.dat fsync
> 0.020u 0.230s 0:30.48 0.8% 0+0k 0+0io 91pf+0w
>
>
> ________________________________________
> Michael D. Black Principal Engineer
> [email protected] 321-676-2923,x203
> http://www.csihq.com Computer Science Innovations
> http://www.csihq.com/~mike My home page
> FAX 321-676-2355
> ----- Original Message -----
> From: "Andre Hedrick" <[email protected]>
> To: "Linus Torvalds" <[email protected]>
> Cc: "Douglas Gilbert" <[email protected]>; <[email protected]>
> Sent: Tuesday, March 06, 2001 2:12 AM
> Subject: Re: scsi vs ide performance on fsync's
>
>
> On Mon, 5 Mar 2001, Linus Torvalds wrote:
>
> > Well, it's fairly hard for the kernel to do much about that - it's almost
> > certainly just IDE doing write buffering on the disk itself. No OS
> > involved.
>
> I am pushing for WC to be defaulted in the off state, but as you know I
> have a bigger fight than caching on my hands...
>
> > I don't know if there is any way to turn of a write buffer on an IDE disk.
>
> You want a forced set of commands to kill caching at init?
>
> Andre Hedrick
> Linux ATA Development
> ASL Kernel Development
> ----------------------------------------------------------------------------
> -
> ASL, Inc. Toll free: 1-877-ASL-3535
> 1757 Houret Court Fax: 1-408-941-2071
> Milpitas, CA 95035 Web: http://www.aslab.com
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
this is my sig.
>On Tue, 6 Mar 2001, Mike Black wrote:
>
>> Write caching is the culprit for the performance diff:
Indeed, and my during-the-boring-lecture benchmark on my 18Gb IBM
TravelStar bears this out. I was confused earlier by the fact that one of
my Seagate drives blatently ignores the no-write-caching request I sent it.
:P
At 4:02 pm +0000 6/3/2001, Jeremy Hansen wrote:
>Ahh, now we're getting somewhere.
>so now this corresponds to the performance we're seeing on SCSI.
>
>So I guess what I'm wondering now is can or should anything be done about
>this on the SCSI side?
Maybe, it depends on your perspective. In my personal opinion, the IDE
behaviour is incorrect and some way of dealing with it (while still
retaining the benefits of write-caching for normal applications) would be
highly desirable. However, some applications may like or partially rely on
that behaviour, to gain better on-disk data consistency while not suffering
too much in performance (eg. the transaction database mentioned by at least
one poster).
The way to make all parties happy is to fix the IDE driver (or drives!) and
make sure an *alternative* syscall is available which flushes the buffers
asynchronously, as per the current IDE behaviour. It shouldn't be too hard
to make the SCSI driver use that behaviour in the alternative syscall
(which may already exist, I don't know Linux well enough to say).
May this be a warning to all hardware manufacturers who "tweak" their
hardware to gain better benchmark results without actually increasing
performance - you *will* be found out!
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]
The key to knowledge is not to rely on people to teach you it.
Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/
-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----
(( please CC me , not subscribed , [email protected] )
Jonathan Morton ([email protected]) wrote :
> The OS needs to know the physical act of writing data has finished before
> it tells the m/board to cut the power - period. Pathological data sets
> included - they are the worst case which every engineer must take into
> account. Out of interest, does Linux guarantee this, in the light of what
> we've uncovered? If so, perhaps it could use the same technique to fix
> fdatasync() and family...
Linux currently ignores write-cache, AFAICT.
Recently I asked a similar question , about flushing drive caches at shutdown :
Subject : "Flusing caches on shutdown"
message archived at :
http://boudicca.tux.org/hypermail/linux-kernel/2001week08/0157.html
Body attached at end of this message.
The answer ( and only reply ) was :
[ archived at : http://boudicca.tux.org/hypermail/linux-kernel/2001week08/0211.html ]
--- begin quote ---
From: Ingo Oeser ([email protected])
On Mon, Feb 19, 2001 at 01:45:57PM +0100, David Balazic wrote:
> It is a good idea IMO to flush the write cache of storage devices
> at shutdown and other critical moments.
Not needed. All device drivers should disable write caches of
their devices, that need another signal than switching it off by
the power button to flush themselves.
> Loosing data at powerdown due to write caches have been reported,
> so this is no a theoretical problems. Also the journaled filesystems
> are safe only in theory if the journal is not stored on non-volatile
> memory, which is not guarantied in the current kernel.
Fine. If users/admins have write caching enabled, they either
know what they do, or should disable it (which is the default for
all mass storage drivers AFAIK).
Hardware Level caching is only good for OSes which have broken
drivers and broken caching (like plain old DOS).
Linux does a good job in caching and cache control at software
level.
Regards
Ingo Oeser
--- end quote ---
My original mail :
--- begin quote ---
(( CC me the replies, as I'm not subscribed to LKML ))
Hi!
It is a good idea IMO to flush the write cache of storage devices
at shutdown and other critical moments.
I browsed through linux-2.4.1 and see no use of the SYNCHRONIZE CACHE
SCSI command ( curiously it is defined in several other files
besides include/scsi/scsi.h , grep returns :
drivers/scsi/pci2000.h:#define SCSIOP_SYNCHRONIZE_CACHE 0x35
drivers/scsi/psi_dale.h:#define SCSIOP_SYNCHRONIZE_CACHE 0x35
drivers/scsi/psi240i.h:#define SCSIOP_SYNCHRONIZE_CACHE 0x35
)
I couldn't find evidence to the use of the equivalent ATA command either
( FLUSH CACHE , command code E7h ).
Also add ATAPI to the list. ( and all other interfaces. I checked just SCSI
and ATA )
Loosing data at powerdown due to write caches have been reported,
so this is no a theoretical problems. Also the journaled filesystems
are safe only in theory if the journal is not stored on non-volatile
memory, which is not guarantied in the current kernel.
What is the official word on this issue ?
I think this is important to the "enterprise" guys, at the least.
Sincerely,
david
PS: CC me , as I'm not subscribed to LKML
--- end quote ---
--
David Balazic
--------------
"Be excellent to each other." - Bill & Ted
- - - - - - - - - - - - - - - - - - - - - -
--
David Balazic
--------------
"Be excellent to each other." - Bill & Ted
- - - - - - - - - - - - - - - - - - - - - -
On Tue, Mar 06, 2001 at 06:14:15PM +0100, David Balazic wrote:
[snip]
> Hardware Level caching is only good for OSes which have broken
> drivers and broken caching (like plain old DOS).
>
> Linux does a good job in caching and cache control at software
> level.
Read caching, yes. But for writes, the drive can often do a lot more
optimization because of it's synchronous operation with the platter and
greater knowledge of internal disk geometry.
What would be useful, as Alan said, is a barrier operation.
>Jonathan Morton ([email protected]) wrote :
>
>> The OS needs to know the physical act of writing data has finished
>>before
>> it tells the m/board to cut the power - period. Pathological data sets
>> included - they are the worst case which every engineer must take into
>> account. Out of interest, does Linux guarantee this, in the light of what
>> we've uncovered? If so, perhaps it could use the same technique to fix
>> fdatasync() and family...
>
>Linux currently ignores write-cache, AFAICT.
>Recently I asked a similar question , about flushing drive caches at
>shutdown :
>On Mon, Feb 19, 2001 at 01:45:57PM +0100, David Balazic wrote:
>> It is a good idea IMO to flush the write cache of storage devices
>> at shutdown and other critical moments.
>
>Not needed. All device drivers should disable write caches of
>their devices, that need another signal than switching it off by
>the power button to flush themselves.
Sounds like a sensible place to implement it - in the device driver. I
also note the existence of an ATA flush-buffer command, which should
probably be used in sync() and family. The call(s) to the sync() family on
shutdown should probably be performed by the filesystem itself on unmount
(or remount read-only), and if journalled filesystems need synchronisation
they should use sync() (or a more fine-grained version) themselves as
necessary.
Doesn't sound like too much of a headache to implement, to me - unless some
drives ignore the ATA FLUSH command, in which case said drives can be
considered seriously broken. :P I don't agree that write-caching in
itself is a bad thing, particularly given the amount of CPU overhead that
IDE drives demand while attached to the controller (orders of magnitude
higher than a good SCSI controller) - the more overhead we can hand off to
dedicated hardware, the better. What does matter is that drives
implementing write-caching are handled in a safe and efficient manner,
especially in cases where they refuse to turn such caching off (eg. my
Seagate Barracuda *glares at drive*).
Recalling my recent comments on worst-case drive-shutdown timings, I also
remember seeing drives with 18ms *average* seek times quite recently - this
was a Quantum Bigfoot (yes, a 5.25" HD), found in a low-end Compaq desktop
- if anyone still believes Compaq makes high-quality machines for their
low-end market, they're totally mistaken. The machine sped up quite a lot
when a new 3.5" IBM DeskStar was installed, with an 8.5ms average seek and
an almost doubling in rotational speed. :)
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]
The key to knowledge is not to rely on people to teach you it.
Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/
-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----
On Tue, 6 Mar 2001, Alan Cox wrote:
>
> > > I don't know if there is any way to turn of a write buffer on an IDE disk.
> > You want a forced set of commands to kill caching at init?
>
> Wrong model
>
> You want a write barrier. Write buffering (at least for short intervals) in
> the drive is very sensible. The kernel needs to able to send drivers a write
> barrier which will not be completed with outstanding commands before the
> barrier.
Agreed.
Write buffering is incredibly useful on a disk - for all the same reasons
that an OS wants to do it. The disk can use write buffering to speed up
writes a lot - not just lower the _perceived_ latency by the OS, but to
actually improve performance too.
But Alan is right - we needs a "sync" command or something. I don't know
if IDE has one (it already might, for all I know).
Linus
Jonathan,
I am not going to bite on your flame bate, and are free to waste you money.
On Tue, 6 Mar 2001, Jonathan Morton wrote:
> >> It's pretty clear that the IDE drive(r) is *not* waiting for the physical
> >> write to take place before returning control to the user program, whereas
> >> the SCSI drive(r) is. Both devices appear to be performing the write
> >
> >Wrong, IDE does not unplug thus the request is almost, I hate to admit it
> >SYNC and not ASYNC :-( Thus if the drive acks that it has the data then
> >the driver lets go.
>
> Uh, run that past me again? You are saying that because the IDE drive hogs
> the bus until the write is complete or the driver forcibly disconnects, you
> make the driver disconnect to save time? Or (more likely) have I totally
> misread you...
No, SCSI does with queuing.
I am saying that the ata/ide driver rips the heart out of the
io_request_lock what to darn long. This means that upon execution a
request virtually all interrupts are wacked and the drivers in dominating
the system. Given that IO's are limited to 128 sectors or one DMA PRD,
this is vastly smaller than the SCSI trasfer limit.
Since you are not using the test "Write Verify Read" all drives are going
to lie. Only this command will force the stuff to hit the platters and
return a read out of the dirty-cache.
> >pre-seek. Thus the question is were is the drive leaving the heads when
> >not active? It does not appear to be in the zone 1 region.
>
> Duh... I don't quite see what you're saying here, either. The test is a
Okay real short....limit to two zones that are equal in size.
The inner and outer, and the latter will cover more physical media than
the former. Simple Two zone model.
> continuous rewrite of the same sector of the disk, so the head shouldn't be
> moving *at all* until it's all over. In addition, the drive can't start
True and you slip a rev. everytime.
> writing the sector when it's just finished writing it, so it has to wait
> for the rotation to breing it back round again. Under those circumstances,
> I would expect my 7200rpm Seagate to perform slower than my 10000rpm IBM
> *regardless* of seeking performance. Seeking doesn't come into it!
It does, because more RPM means more air-flow and more work to keep the
position stable.
> >Thus if your drive is one of those that does a stress test check that goes:
> >"this bozo did not really mean to turn off write caching, renabling <smurk>"
>
> Why does this sound familiar?
Because of WinBench!
All the prefetch/caching are modeled to be optimized to that bench-mark.
> Personally, I feel the bottom line is rapidly turning into "if you have
> critical data, don't put it on an IDE disk". There are too many corners
> cut when compared to ostensibly similar SCSI devices. Call me a SCSI bigot
> if you like - I realise SCSI is more expensive, but you get what you pay
> for.
Let me slap you in the face with a salomi stick!
ATA 7200 RPM Drives are using SCSI 7200 RPM Drive HDA's
So you say ATA is Lame? Then so was your SCSI 7200's.
> Of course, under normal circumstances, you leave write-caching and UDMA on,
> and you don't use a pathological stress-test like we've been doing. That
> gives the best performance. But sometimes it's necessary to use these
> "pathological" access patterns to achieve certain system functions.
> Suppose, harking back to the Windows data-corruption scenario mentioned
> earlier, that just before powering off you stuffed several MB of data,
> scattered across the disk, into said disk and waited for said disk to say
> "yup, i've got that", then powered down. Recent drives have very large
> (2MB?) on-board caches, so how long does it take for a pathological pattern
> of these to be committed to physical media? Can the drive sustain it's own
> power long enough to do this (highly unlikely)? So the drive *must* be
> able to tell the OS when it's actually committed the data to media, or risk
> *serious* data corruption.
OH...you are talking about the one IBM drive that is goat-screwed...
The one that is to stupid to use the energy of the platters to drop the
data in the vender power down strip...yet it dumps the buffer in a panic..
ERM, that is a bad drive, regardless if they publish an errata that states
only good HOSTS that issue a flush-cache prior to power are to be
certified...we maybe if they did not default the WC on then it would be a
NOP of the design error. Since all OSes that enable WC at init will flush
it at shutdown and do a periodic purge with in-activity.
> Pathological shutdown pattern: assuming scatter-gather is not allowed (for
> IDE), and a 20ms full-stroke seek, write sectors at alternately opposite
> ends of the disk, working inwards until the buffer is full. 512-byte
> sectors, 2MB of them, is 4000 writes * 20ms = around 80 seconds (not
> including rotational delay, either). Last time I checked, you'd need a
> capacitor array the size of the entire computer case to store enough power
> to allow the drive to do this after system shutdown, and I don't remember
> seeing LiIon batteries strapped to the bottom of my HDs. Admittedly, any
> sane OS doesn't actually use that kind of write pattern on shutdown, but
> the drive can't assume that.
Err, last time I check all good devices flush their write caching on their
own to take advantage of having a maximum cache for prefetching.
Andre Hedrick
Linux ATA Development
ASL Kernel Development
-----------------------------------------------------------------------------
ASL, Inc. Toll free: 1-877-ASL-3535
1757 Houret Court Fax: 1-408-941-2071
Milpitas, CA 95035 Web: http://www.aslab.com
Linus Torvalds himself wrote :
> On Tue, 6 Mar 2001, Alan Cox wrote:
> >
> > > > I don't know if there is any way to turn of a write buffer on an IDE disk.
> > > You want a forced set of commands to kill caching at init?
> >
> > Wrong model
> >
> > You want a write barrier. Write buffering (at least for short intervals) in
> > the drive is very sensible. The kernel needs to able to send drivers a write
> > barrier which will not be completed with outstanding commands before the
> > barrier.
>
> Agreed.
>
> Write buffering is incredibly useful on a disk - for all the same reasons
> that an OS wants to do it. The disk can use write buffering to speed up
> writes a lot - not just lower the _perceived_ latency by the OS, but to
> actually improve performance too.
>
> But Alan is right - we needs a "sync" command or something. I don't know
> if IDE has one (it already might, for all I know).
ATA , SCSI and ATAPI all have a FLUSH_CACHE command. (*)
Whether the drives implement it is another question ...
(*) references :
ATA-6 draft standard from http://www.t13.org
MtFuji document from ????????
--
David Balazic
--------------
"Be excellent to each other." - Bill & Ted
- - - - - - - - - - - - - - - - - - - - - -
On Tue, Mar 06 2001, David Balazic wrote:
> > > Wrong model
> > >
> > > You want a write barrier. Write buffering (at least for short intervals)
> > > in the drive is very sensible. The kernel needs to able to send
> > > drivers a write barrier which will not be completed with outstanding
> > > commands before the
> > > barrier.
> >
> > Agreed.
> >
> > Write buffering is incredibly useful on a disk - for all the same reasons
> > that an OS wants to do it. The disk can use write buffering to speed up
> > writes a lot - not just lower the _perceived_ latency by the OS, but to
> > actually improve performance too.
> >
> > But Alan is right - we needs a "sync" command or something. I don't know
> > if IDE has one (it already might, for all I know).
>
> ATA , SCSI and ATAPI all have a FLUSH_CACHE command. (*)
> Whether the drives implement it is another question ...
(Usually called SYNCHRONIZE_CACHE btw)
SCSI has ordered tag, which fit the model Alan described quite nicely.
I've been meaning to implement this for some time, it would be handy
for journalled fs to use such a barrier. Since ATA doesn't do queueing
(at least not in current Linux), a synchronize cache is probably the
only way to go there.
> (*) references :
> ATA-6 draft standard from http://www.t13.org
> MtFuji document from ????????
ftp.avc-pioneer.com
--
Jens Axboe
> itself is a bad thing, particularly given the amount of CPU overhead that
> IDE drives demand while attached to the controller (orders of magnitude
> higher than a good SCSI controller) - the more overhead we can hand off to
I know this is just a troll by a scsi-believer, but I'm biting anyway.
on current machines and disks, ide costs a few % CPU, depending on
which CPU, disk, kernel, the sustained bandwidth, etc. I've measured
this using the now-trendy method of noticing how much the IO costs
a separate, CPU-bound benchmark: load = 1 - (unloadedPerf / loadedPerf).
my cheesy duron/600 desktop typically shows ~2% actual cost when running
bonnie's block IO tests.
>I am not going to bite on your flame bate, and are free to waste you money.
I don't flamebait. I was trying to clear up some confusion...
>No, SCSI does with queuing.
>I am saying that the ata/ide driver rips the heart out of the
>io_request_lock what to darn long. This means that upon execution a
>request virtually all interrupts are wacked and the drivers in dominating
>the system. Given that IO's are limited to 128 sectors or one DMA PRD,
>this is vastly smaller than the SCSI trasfer limit.
Ah, so the ATA driver hogs interrupts. Nice. Kinda explains why I can't
use the mouse on some systems when I use cdparanoia.
>Okay real short....limit to two zones that are equal in size.
>The inner and outer, and the latter will cover more physical media than
>the former. Simple Two zone model.
Still doesn't make a difference - there is one revolution between writes,
no matter where on disk it is.
>> Under those circumstances,
>> I would expect my 7200rpm Seagate to perform slower than my 10000rpm IBM
>> *regardless* of seeking performance. Seeking doesn't come into it!
>
>It does, because more RPM means more air-flow and more work to keep the
>position stable.
That's the engineers' problem, not ours. In fact, it's not really a
problem because my IBM drive gave almost exactly the correct performance
result, even at 10000rpm, therefore it's managing to keep the position
stable regardless of airflow.
>> Why does this sound familiar?
>
>Because of WinBench!
>All the prefetch/caching are modeled to be optimized to that bench-mark.
Lies, damn lies, statistics, benchmarks, delivery dates. Especially a
consumer-oriented benchmark like WinBench. It's perfectly natural to
optimise for particular access patterns, but IMHO that doesn't excuse
breaking the drive just to get a better benchmark score.
>> Personally, I feel the bottom line is rapidly turning into "if you have
>> critical data, don't put it on an IDE disk". There are too many corners
>> cut when compared to ostensibly similar SCSI devices. Call me a SCSI bigot
>> if you like - I realise SCSI is more expensive, but you get what you pay
>> for.
>
>Let me slap you in the face with a salomi stick!
>ATA 7200 RPM Drives are using SCSI 7200 RPM Drive HDA's
>So you say ATA is Lame? Then so was your SCSI 7200's.
That isn't the point! I'm not talking about the physical mechanism, which
indeed is often the same between one generation of SCSI and the next
generation of IDE devices. I'm talking about the IDE controller which is
slapped on the bottom of said mechanism. The mech can be of world-class
quality, but if the controller is shot it doesn't cut the grain.
>Since all OSes that enable WC at init will flush
>it at shutdown and do a periodic purge with in-activity.
But Linux doesn't, as has been pointed out earlier. We need to fix Linux.
Also, as I and someone else have also pointed out, there are drives in
circulation which refuse to turn off write caching, including one sitting
in my main workstation - the one which is rebooted the most often, simply
because I need to use Windoze 95 for a few onerous tasks. I haven't
suffered disk corruption yet, because Linux unmounts the filesystems and
flushes it's own buffers several seconds before powering down, and uses a
non-pathological access pattern, but I sure don't want to see the first
time this doesn't work properly.
>Err, last time I check all good devices flush their write caching on their
>own to take advantage of having a maximum cache for prefetching.
Which doesn't work if the buffer is filled up by the OS 0.5 seconds before
the power goes.
I'm sorry if this looks like another troll, but I really do like to clear
up confusion. I do accept that IDE now has good enough real performance
for many purposes, but in terms of enforced quality it clearly lags behind
the entire SCSI field.
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]
The key to knowledge is not to rely on people to teach you it.
Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/
-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----
On Wed, 7 Mar 2001, Jonathan Morton wrote:
> Still doesn't make a difference - there is one revolution between writes,
> no matter where on disk it is.
Oh it does, because you are hitting the same sector with the same data.
Rotate your buffer and then you will see the difference.
> >Because of WinBench!
> >All the prefetch/caching are modeled to be optimized to that bench-mark.
>
> Lies, damn lies, statistics, benchmarks, delivery dates. Especially a
> consumer-oriented benchmark like WinBench. It's perfectly natural to
> optimise for particular access patterns, but IMHO that doesn't excuse
> breaking the drive just to get a better benchmark score.
Obviously you have never been in the bowls of drive industry hell.
Why do you think there was a change the ATA-6 to require the
Write-Verify-Read to always return stuff from the platter?
Because the SOB's in storage LIE! A real wake-up call for you is that
everything about the world of storage is a big-fat-whopper of a LIE.
Storage devices are BLACK-BOXES with the standards/rules to communicate
being dictated by the device not the host. Storage devices are no beter
then a Coke(tm) vending machine. You push "Coke" it gives you "Coke".
You have not a clue to how it arrives or where it came from.
Same thing about reading from a drive.
> That isn't the point! I'm not talking about the physical mechanism, which
> indeed is often the same between one generation of SCSI and the next
> generation of IDE devices. I'm talking about the IDE controller which is
> slapped on the bottom of said mechanism. The mech can be of world-class
> quality, but if the controller is shot it doesn't cut the grain.
So there is a $5 differnce in the cell-gates and the line drivers are more
powerful, 80GB ATA + $5 != 80GB SCSI.
> >Since all OSes that enable WC at init will flush
> >it at shutdown and do a periodic purge with in-activity.
>
> But Linux doesn't, as has been pointed out earlier. We need to fix Linux.
Friend I have fixed this some time ago but it is bundled with TASKFILE
that is not going to arrive until 2.5. Because I need a way to execute
this and hold the driver until it is complete, regardless of the shutdown
method.
> >Err, last time I check all good devices flush their write caching on their
> >own to take advantage of having a maximum cache for prefetching.
>
> Which doesn't work if the buffer is filled up by the OS 0.5 seconds before
> the power goes.
Maybe that is why there is a vender disk-cache dump zone on the edge of
the platters...just maybe you need to buy your drives from somebody that
does this and has a predictive sector stretcher as the energy from the
inertia by the DC three-phase motor executes the dump.
Ever wondered why modern drives have open collectors on the databuss?
Maybe to disconnect the power draw so that the motor now generator
provides the needed power to complete the data dump...
> I'm sorry if this looks like another troll, but I really do like to clear
> up confusion. I do accept that IDE now has good enough real performance
> for many purposes, but in terms of enforced quality it clearly lags behind
> the entire SCSI field.
I have no desire to debate the merits, but when your onboard host for ATA
starts shipping with GigaBit-Copper speeds then we can have a pissing
contest.
Cheers,
Andre Hedrick
Linux ATA Development
ASL Kernel Development
-----------------------------------------------------------------------------
ASL, Inc. Toll free: 1-877-ASL-3535
1757 Houret Court Fax: 1-408-941-2071
Milpitas, CA 95035 Web: http://www.aslab.com
Andre Hedrick ([email protected]) wrote on Wed Mar 07 2001 - 01:58:44 EST :
> On Wed, 7 Mar 2001, Jonathan Morton wrote:
[ snip ]
> > >Since all OSes that enable WC at init will flush
> > >it at shutdown and do a periodic purge with in-activity.
> >
> > But Linux doesn't, as has been pointed out earlier. We need to fix Linux.
>
> Friend I have fixed this some time ago but it is bundled with TASKFILE
> that is not going to arrive until 2.5. Because I need a way to execute
> this and hold the driver until it is complete, regardless of the shutdown
> method.
I don't understand 100%.
Is TASKFILE required to do proper write cache flushing ?
> > >Err, last time I check all good devices flush their write caching on their
> > >own to take advantage of having a maximum cache for prefetching.
> >
> > Which doesn't work if the buffer is filled up by the OS 0.5 seconds before
> > the power goes.
>
> Maybe that is why there is a vender disk-cache dump zone on the edge of
> the platters...just maybe you need to buy your drives from somebody that
> does this and has a predictive sector stretcher as the energy from the
> inertia by the DC three-phase motor executes the dump.
So where is a list of drives that do this ?
http://www.list-of-hardware-that-doesnt-suck.com is not responding ...
> Ever wondered why modern drives have open collectors on the databuss?
no :-)
--
David Balazic
--------------
"Be excellent to each other." - Bill & Ted
- - - - - - - - - - - - - - - - - - - - - -
Hi,
On Tue, Mar 06, 2001 at 09:37:20PM +0100, Jens Axboe wrote:
>
> SCSI has ordered tag, which fit the model Alan described quite nicely.
> I've been meaning to implement this for some time, it would be handy
> for journalled fs to use such a barrier. Since ATA doesn't do queueing
> (at least not in current Linux), a synchronize cache is probably the
> only way to go there.
Note that you also have to preserve the position of the barrier in the
elevator queue, and you need to prevent LVM and soft raid from
violating the barrier if different commands end up being sent to
different disks.
--Stephen
Hi,
On Tue, Mar 06, 2001 at 10:44:34AM -0800, Linus Torvalds wrote:
> On Tue, 6 Mar 2001, Alan Cox wrote:
> > You want a write barrier. Write buffering (at least for short intervals) in
> > the drive is very sensible. The kernel needs to able to send drivers a write
> > barrier which will not be completed with outstanding commands before the
> > barrier.
>
> But Alan is right - we needs a "sync" command or something. I don't know
> if IDE has one (it already might, for all I know).
Sync and barrier are very different models. With barriers we can
enforce some elemnt of write ordering without actually waiting for the
IOs to complete; with sync, we're explicitly asking to be told when
the data has become persistant. We can make use of both of these.
SCSI certainly lets us do both of these operations independently. IDE
has the sync/flush command afaik, but I'm not sure whether the IDE
tagged command stuff has the equivalent of SCSI's ordered tag bits.
Andre?
--Stephen
On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
> > SCSI has ordered tag, which fit the model Alan described quite nicely.
> > I've been meaning to implement this for some time, it would be handy
> > for journalled fs to use such a barrier. Since ATA doesn't do queueing
> > (at least not in current Linux), a synchronize cache is probably the
> > only way to go there.
>
> Note that you also have to preserve the position of the barrier in the
> elevator queue, and you need to prevent LVM and soft raid from
> violating the barrier if different commands end up being sent to
> different disks.
Yep, it's much harder than it seems. Especially because for the barrier
to be really useful, having inter-request dependencies becomes a
requirement. So you can say something like 'flush X and Y, but don't
flush Y before X is done'.
--
Jens Axboe
On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
> SCSI certainly lets us do both of these operations independently. IDE
> has the sync/flush command afaik, but I'm not sure whether the IDE
> tagged command stuff has the equivalent of SCSI's ordered tag bits.
> Andre?
IDE has no concept of ordered tags...
--
Jens Axboe
Hi,
On Wed, Mar 07, 2001 at 03:12:41PM +0100, Jens Axboe wrote:
>
> Yep, it's much harder than it seems. Especially because for the barrier
> to be really useful, having inter-request dependencies becomes a
> requirement. So you can say something like 'flush X and Y, but don't
> flush Y before X is done'.
Yes. Fortunately, the simplest possible barrier is just a matter of
marking a request as non-reorderable, and then making sure that you
both flush the elevator queue before servicing that request, and defer
any subsequent requests until the barrier request has been satisfied.
One it has gone through, you can let through the deferred requests (in
order, up to the point at which you encounter another barrier).
Only if the queue is empty can you give a barrier request directly to
the driver. The special optimisation you can do in this case with
SCSI is to continue to allow new requests through even before the
barrier has completed if the disk supports ordered queue tags.
--Stephen
Hi!
> If not, then the drive could by all means optimise the access pattern
> provided it acked the data or provided the results in the same order as the
> instructions were given. This would probably shorten the time for a new
> pathological set (distributed evenly across the disk surface, but all on
> the worst-possible angular offset compared to the previous) to (8ms seek
> time + 5ms rotational delay) * 4000 writes ~= 52 seconds (compared with
> around 120 seconds for the previous set with rotational delay factored in).
> Great, so you only need half as big a power store to guarantee writing that
> much data, but it's still too much. Even with a 15000rpm drive and 5ms
> seek times, it would still be too much.
Drive can trivially seek to reserved track, and flush data on it. All within
25msec. Then, move data to proper location on next powerup. Pavel
--
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.
So in the meantime as this gets worked out on a lower level, we've decided
to take the fsync() out of berkeley db for mysql transaction logs and
mount the filesystem -o sync.
Can anyone perhaps tell me why this may be a bad idea?
Thanks
-jeremy
On Tue, 6 Mar 2001, Jeremy Hansen wrote:
>
> Ahh, now we're getting somewhere.
>
> IDE:
>
> jeremy:~# time ./xlog file.out fsync
>
> real 0m33.739s
> user 0m0.010s
> sys 0m0.120s
>
>
> so now this corresponds to the performance we're seeing on SCSI.
>
> So I guess what I'm wondering now is can or should anything be done about
> this on the SCSI side?
>
> Thanks
> -jeremy
>
> On Tue, 6 Mar 2001, Mike Black wrote:
>
> > Write caching is the culprit for the performance diff:
> >
> > On IDE:
> > time xlog /blah.dat fsync
> > 0.000u 0.190s 0:01.72 11.0% 0+0k 0+0io 91pf+0w
> > # hdparm -W 0 /dev/hda
> >
> > /dev/hda:
> > setting drive write-caching to 0 (off)
> > # time xlog /blah.dat fsync
> > 0.000u 0.220s 0:50.60 0.4% 0+0k 0+0io 91pf+0w
> > # hdparm -W 1 /dev/hda
> >
> > /dev/hda:
> > setting drive write-caching to 1 (on)
> > # time xlog /blah.dat fsync
> > 0.010u 0.230s 0:01.88 12.7% 0+0k 0+0io 91pf+0w
> >
> > On my SCSI setup:
> > # time xlog /usr5/blah.dat fsync
> > 0.020u 0.230s 0:30.48 0.8% 0+0k 0+0io 91pf+0w
> >
> >
> > ________________________________________
> > Michael D. Black Principal Engineer
> > [email protected] 321-676-2923,x203
> > http://www.csihq.com Computer Science Innovations
> > http://www.csihq.com/~mike My home page
> > FAX 321-676-2355
> > ----- Original Message -----
> > From: "Andre Hedrick" <[email protected]>
> > To: "Linus Torvalds" <[email protected]>
> > Cc: "Douglas Gilbert" <[email protected]>; <[email protected]>
> > Sent: Tuesday, March 06, 2001 2:12 AM
> > Subject: Re: scsi vs ide performance on fsync's
> >
> >
> > On Mon, 5 Mar 2001, Linus Torvalds wrote:
> >
> > > Well, it's fairly hard for the kernel to do much about that - it's almost
> > > certainly just IDE doing write buffering on the disk itself. No OS
> > > involved.
> >
> > I am pushing for WC to be defaulted in the off state, but as you know I
> > have a bigger fight than caching on my hands...
> >
> > > I don't know if there is any way to turn of a write buffer on an IDE disk.
> >
> > You want a forced set of commands to kill caching at init?
> >
> > Andre Hedrick
> > Linux ATA Development
> > ASL Kernel Development
> > ----------------------------------------------------------------------------
> > -
> > ASL, Inc. Toll free: 1-877-ASL-3535
> > 1757 Houret Court Fax: 1-408-941-2071
> > Milpitas, CA 95035 Web: http://www.aslab.com
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
>
>
--
this is my sig.
On Wed, 7 Mar 2001, Jeremy Hansen wrote:
>
> So in the meantime as this gets worked out on a lower level, we've decided
> to take the fsync() out of berkeley db for mysql transaction logs and
> mount the filesystem -o sync.
>
> Can anyone perhaps tell me why this may be a bad idea?
Two reasons:
- it doesn't help. The disk will _still_ do write buffering. It's the
DISK, not the OS. It doesn't matter what you do.
- your performance will suck.
Use fsync(). That's what it's there for.
Tell people who don't have an UPS to disable write caching. If they have
one (of the many, apparently) IDE disks that refuse to disable it, tell
them to either get an UPS, or to switch to another disk.
Linus
On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
> > Yep, it's much harder than it seems. Especially because for the barrier
> > to be really useful, having inter-request dependencies becomes a
> > requirement. So you can say something like 'flush X and Y, but don't
> > flush Y before X is done'.
>
> Yes. Fortunately, the simplest possible barrier is just a matter of
> marking a request as non-reorderable, and then making sure that you
> both flush the elevator queue before servicing that request, and defer
> any subsequent requests until the barrier request has been satisfied.
> One it has gone through, you can let through the deferred requests (in
> order, up to the point at which you encounter another barrier).
The above should have been inter-queue dependencies. For one queue
it's not a big issue, you basically described the whole sequence
above. Either sequence it as zero for a non-empty queue and make
sure the low level driver orders or flushes, or just hand it directly
to the device.
My bigger concern is when the journalled fs has a log on a different
queue.
> Only if the queue is empty can you give a barrier request directly to
> the driver. The special optimisation you can do in this case with
> SCSI is to continue to allow new requests through even before the
> barrier has completed if the disk supports ordered queue tags.
Yep, IDE will have to pay the price of a flush.
--
Jens Axboe
Hi,
On Wed, Mar 07, 2001 at 07:51:52PM +0100, Jens Axboe wrote:
> On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
>
> My bigger concern is when the journalled fs has a log on a different
> queue.
For most fs'es, that's not an issue. The fs won't start writeback on
the primary disk at all until the journal commit has been acknowledged
as firm on disk.
Certainly for ext3, synchronisation between the log and the primary
disk is no big thing. What really hurts is writing to the log, where
we have to wait for the log writes to complete before submitting the
commit write (which is sequentially allocated just after the rest of
the log blocks). Specifying a barrier on the commit block would allow
us to keep the log device streaming, and the fs can deal with
synchronising the primary disk quite happily by itself.
Cheers,
Stephen
On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
> On Wed, Mar 07, 2001 at 07:51:52PM +0100, Jens Axboe wrote:
> > On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
> >
> > My bigger concern is when the journalled fs has a log on a different
> > queue.
>
> For most fs'es, that's not an issue. The fs won't start writeback on
> the primary disk at all until the journal commit has been acknowledged
> as firm on disk.
But do you then force wait on that journal commit?
> Certainly for ext3, synchronisation between the log and the primary
> disk is no big thing. What really hurts is writing to the log, where
> we have to wait for the log writes to complete before submitting the
> commit write (which is sequentially allocated just after the rest of
> the log blocks). Specifying a barrier on the commit block would allow
> us to keep the log device streaming, and the fs can deal with
> synchronising the primary disk quite happily by itself.
A barrier operation is sufficient then. So you're saying don't
over design, a simple barrier is all you need?
--
Jens Axboe
On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
> On Wed, Mar 07, 2001 at 09:15:36PM +0100, Jens Axboe wrote:
> > On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
> > >
> > > For most fs'es, that's not an issue. The fs won't start writeback on
> > > the primary disk at all until the journal commit has been acknowledged
> > > as firm on disk.
> >
> > But do you then force wait on that journal commit?
>
> It doesn't matter too much --- it's only the writeback which is doing
> this (ext3 uses a separate journal thread for it), so any sleep is
> only there to wait for the moment when writeback can safely begin:
> users of the filesystem won't see any stalls.
Ok, but even if this is true for ext3 it may not be true for other
journalled fs. AFAIR, reiser is doing an explicit wait_on_buffer
which would then amount to quite a performance hit (speculation,
haven't measured).
> > A barrier operation is sufficient then. So you're saying don't
> > over design, a simple barrier is all you need?
>
> Pretty much so. The simple barrier is the only thing which can be
> effectively optimised at the hardware level with SCSI anyway.
True
--
Jens Axboe
Hi,
On Wed, Mar 07, 2001 at 09:15:36PM +0100, Jens Axboe wrote:
> On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
> >
> > For most fs'es, that's not an issue. The fs won't start writeback on
> > the primary disk at all until the journal commit has been acknowledged
> > as firm on disk.
>
> But do you then force wait on that journal commit?
It doesn't matter too much --- it's only the writeback which is doing
this (ext3 uses a separate journal thread for it), so any sleep is
only there to wait for the moment when writeback can safely begin:
users of the filesystem won't see any stalls.
> A barrier operation is sufficient then. So you're saying don't
> over design, a simple barrier is all you need?
Pretty much so. The simple barrier is the only thing which can be
effectively optimised at the hardware level with SCSI anyway.
Cheers,
Stephen
Hi,
On Wed, Mar 07, 2001 at 10:36:38AM -0800, Linus Torvalds wrote:
> On Wed, 7 Mar 2001, Jeremy Hansen wrote:
> >
> > So in the meantime as this gets worked out on a lower level, we've decided
> > to take the fsync() out of berkeley db for mysql transaction logs and
> > mount the filesystem -o sync.
> >
> > Can anyone perhaps tell me why this may be a bad idea?
>
> - it doesn't help. The disk will _still_ do write buffering. It's the
> DISK, not the OS. It doesn't matter what you do.
> - your performance will suck.
Added to which, "-o sync" only enables sync metadata updates. It
still doesn't force an fsync on data writes.
--Stephen
On Wednesday, March 07, 2001 08:56:59 PM +0000 "Stephen C. Tweedie"
<[email protected]> wrote:
> Hi,
>
> On Wed, Mar 07, 2001 at 09:15:36PM +0100, Jens Axboe wrote:
>> On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
>> >
>> > For most fs'es, that's not an issue. The fs won't start writeback on
>> > the primary disk at all until the journal commit has been acknowledged
>> > as firm on disk.
>>
>> But do you then force wait on that journal commit?
>
> It doesn't matter too much --- it's only the writeback which is doing
> this (ext3 uses a separate journal thread for it), so any sleep is
> only there to wait for the moment when writeback can safely begin:
> users of the filesystem won't see any stalls.
It is similar under reiserfs unless the log is full and new transactions
have to wait for flushes to free up the log space. It is probably valid to
assume the dedicated log device will be large enough that this won't happen
very often, or fast enough (nvram) that it won't matter when it does happen.
>
>> A barrier operation is sufficient then. So you're saying don't
>> over design, a simple barrier is all you need?
>
> Pretty much so. The simple barrier is the only thing which can be
> effectively optimised at the hardware level with SCSI anyway.
>
The simple barrier is a good starting point regardless. If we can find
hardware where it makes sense to do cross queue barriers (big raid
controllers?), it might be worth trying.
-chris
Hi,
Matthias Urlichs:
> On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
> > SCSI certainly lets us do both of these operations independently. IDE
> > has the sync/flush command afaik, but I'm not sure whether the IDE
> > tagged command stuff has the equivalent of SCSI's ordered tag bits.
> > Andre?
>
> IDE has no concept of ordered tags...
>
But most disks these days support IDE-SCSI, and SCSI does have ordered
tags, so...
Has anybody done speed comparisons between "native" IDE and IDE-SCSI?
--
Matthias Urlichs | noris network AG | http://smurf.noris.de/
--
Success is something I will dress for when I get there, and not until.
>> It's pretty clear that the IDE drive(r) is *not* waiting for the physical
>> write to take place before returning control to the user program, whereas
>> the SCSI drive(r) is.
>
>This would not be unexpected.
>
>IDE drives generally always do write buffering. I don't even know if you
>_can_ turn it off. So the drive claims to have written the data as soon as
>it has made the write buffer.
>
>It's definitely not the driver, but the actual drive.
As I suspected. However, testing shows that many drives, including most
IBMs, do respond to hdparm -W0 which turns write-caching off (some drives
don't, including some Seagates). There are also drives in existence that
have no cache at all (mostly old sub-1G drives) and some with too little
for this to make a significant difference (the old 1.2G TravelStar in one
of my PowerBooks is an example).
So, is there a way to force (the majority of, rather than all) IDE drives
to wait until it's been truly committed to media? If so, will this be
integrated into the appropriate parts of the kernel, particularly for
certain members of the sync() family and FS unmounting?
--------------------------------------------------------------
from: Jonathan "Chromatix" Morton
mail: [email protected] (not for attachments)
big-mail: [email protected]
uni-mail: [email protected]
The key to knowledge is not to rely on people to teach you it.
Get VNC Server for Macintosh from http://www.chromatix.uklinux.net/vnc/
-----BEGIN GEEK CODE BLOCK-----
Version 3.12
GCS$/E/S dpu(!) s:- a20 C+++ UL++ P L+++ E W+ N- o? K? w--- O-- M++$ V? PS
PE- Y+ PGP++ t- 5- X- R !tv b++ DI+++ D G e+ h+ r- y+
-----END GEEK CODE BLOCK-----
On Fri, Mar 09 2001, Matthias Urlichs wrote:
> Matthias Urlichs:
> > On Wed, Mar 07 2001, Stephen C. Tweedie wrote:
> > > SCSI certainly lets us do both of these operations independently. IDE
> > > has the sync/flush command afaik, but I'm not sure whether the IDE
> > > tagged command stuff has the equivalent of SCSI's ordered tag bits.
> > > Andre?
> >
> > IDE has no concept of ordered tags...
> >
> But most disks these days support IDE-SCSI, and SCSI does have ordered
> tags, so...
Any proof to back this up? To my knowledge, only some WDC ATA disks
can be ATAPI driven.
--
Jens Axboe
Hi,
Jens Axboe:
> > But most disks these days support IDE-SCSI, and SCSI does have ordered
> > tags, so...
>
> Any proof to back this up? To my knowledge, only some WDC ATA disks
> can be ATAPI driven.
>
Ummm, no, but that was my impression. If that's wrong, I apologize and
will state the opposite, next time.
--
Matthias Urlichs | noris network AG | http://smurf.noris.de/
--
You see things; and you say 'Why?'
But I dream things that never were; and I say 'Why not?'
--George Bernard Shaw [Back to Methuselah]
On Wed, 7 Mar 2001, Stephen C. Tweedie wrote:
> Hi,
>
> On Tue, Mar 06, 2001 at 10:44:34AM -0800, Linus Torvalds wrote:
>
> > On Tue, 6 Mar 2001, Alan Cox wrote:
> > > You want a write barrier. Write buffering (at least for short intervals) in
> > > the drive is very sensible. The kernel needs to able to send drivers a write
> > > barrier which will not be completed with outstanding commands before the
> > > barrier.
> >
> > But Alan is right - we needs a "sync" command or something. I don't know
> > if IDE has one (it already might, for all I know).
>
> Sync and barrier are very different models. With barriers we can
> enforce some elemnt of write ordering without actually waiting for the
> IOs to complete; with sync, we're explicitly asking to be told when
> the data has become persistant. We can make use of both of these.
>
> SCSI certainly lets us do both of these operations independently. IDE
> has the sync/flush command afaik, but I'm not sure whether the IDE
> tagged command stuff has the equivalent of SCSI's ordered tag bits.
> Andre?
ATA-TCQ suxs to put is plain and simple. It really requires a special
host and only the HPT366 series works. It is similar but not clear as to
the nature. We are debating the usage of it now in T13.
Cheers,
Andre Hedrick
Linux ATA Development