2005-05-02 13:24:05

by Grzegorz Kulewski

[permalink] [raw]
Subject: How to flush data to disk reliably?

Hi,

I am writing an app that has log files. These log files must be reliably
and permanently flushed to disk on some points. These log files can be:
a. normal files on some (local) filesytem,
b. some block device (disk/partition).

I am asking how to flush the data from these logs to disk. I know of
several methods:
1. open with O_SYNC,
2. sync(2),
3. fsync,
4. fdatasync,
5. msync (if they are mmaped).

Which of these are best and most reliable for (a/b) and for (IDE/SCSI)?
What are differences between them? Maybe some other method? Are there any
other precautions that I should be aware of? What about write caches? Are
write barriers implemented (on IDE and SATA/SCSI) or should I turn caches
off?

I am using Linux 2.6. (But I will be glad to hear about differences
between 2.4 and 2.6 in this aspect if there are any.)


Thanks in advance,

Grzegorz Kulewski


2005-05-02 17:57:44

by Alan

[permalink] [raw]
Subject: Re: How to flush data to disk reliably?

> I am asking how to flush the data from these logs to disk. I know of
> several methods:
> 1. open with O_SYNC,
> 2. sync(2),
> 3. fsync,
> 4. fdatasync,
> 5. msync (if they are mmaped).
>
> Which of these are best and most reliable for (a/b) and for (IDE/SCSI)?

For scsi the combination of O_SYNC and ext3 or fsync and ext3 should be
reliable. fdatasync doesn't write back all the metadata so depending on
the use may not be sufficient. FAT based fs's I believe you need a
current kernel for full O_SYNC behaviour.

> What are differences between them?

See the manual pages and/or standard.

> other precautions that I should be aware of? What about write caches? Are
> write barriers implemented (on IDE and SATA/SCSI) or should I turn caches
> off?

SCSI shoule be fine, IDE turn off the write cache.


2005-05-02 19:18:43

by Grzegorz Kulewski

[permalink] [raw]
Subject: Re: How to flush data to disk reliably?

Thanks for your fast response!

On Mon, 2 May 2005, Alan Cox wrote:
>> I am asking how to flush the data from these logs to disk. I know of
>> several methods:
>> 1. open with O_SYNC,
>> 2. sync(2),
>> 3. fsync,
>> 4. fdatasync,
>> 5. msync (if they are mmaped).
>>
>> Which of these are best and most reliable for (a/b) and for (IDE/SCSI)?
>
> For scsi the combination of O_SYNC and ext3 or fsync and ext3 should be
> reliable. fdatasync doesn't write back all the metadata so depending on
> the use may not be sufficient. FAT based fs's I believe you need a
> current kernel for full O_SYNC behaviour.

What about other filesystems? Does anybody know anwser for Reiserfs3,
Reiser4, JFS, XFS and any other popular server filesystems? I assume that
if log file is some block device (like partition) both O_SYNC and fsync
will work? What about ext2? What about some strange RAID/DM/NBD
configurations? (I do not know in advance what our customers will use so I
need portable method.)


>> What are differences between them?
>
> See the manual pages and/or standard.

I already saw them. But I am asking about current implementation status on
Linux. For example does fsync differ from fdatasync if file is block
device? Does O_SYNC always equal "write; fsync" for all not read only
operations? Is it faster because only one syscall is executed?

Also flushing should be slow (for example 50 flushes/s) because disk seeks
are slow. So if I need say 200 reliable writes to log per second may I put
4 independent disks into the server and use them as 4 independent log
files? But fsync operation blocks. Is there any "submit flush request and
get some info when done" command or should I use 4 threads / processes?


Thanks in advance,

Grzegorz Kulewski

2005-05-02 21:42:37

by Alan

[permalink] [raw]
Subject: Re: How to flush data to disk reliably?

On Llu, 2005-05-02 at 20:18, Grzegorz Kulewski wrote:
> What about other filesystems? Does anybody know anwser for Reiserfs3,
> Reiser4, JFS, XFS and any other popular server filesystems? I assume that
> if log file is some block device (like partition) both O_SYNC and fsync
> will work? What about ext2? What about some strange RAID/DM/NBD
> configurations? (I do not know in advance what our customers will use so I
> need portable method.)

RAID does stripe sized rewrites so you get into the same situation as
with actual disks - a physical media failure might lose you old data
(but then if the disk goes bang so does the data...)

> I already saw them. But I am asking about current implementation status on
> Linux. For example does fsync differ from fdatasync if file is block
> device? Does O_SYNC always equal "write; fsync" for all not read only
> operations? Is it faster because only one syscall is executed?

Benchmark it - it depends on your workload I imagine. If you are able to
write a log entry well before it is needed then the fsync can be a big
win because you can hope the data is already on or heading for disk when
you fsync

> Also flushing should be slow (for example 50 flushes/s) because disk seeks
> are slow. So if I need say 200 reliable writes to log per second may I put
> 4 independent disks into the server and use them as 4 independent log
> files? But fsync operation blocks. Is there any "submit flush request and
> get some info when done" command or should I use 4 threads / processes?

If you are trying to write logs on that kind of simplistic "reliable"
approach then you probably want a raid card with battery backed ram, at
that point fsync becomes very fast. Most transaction services tend to
use different algorithms to provide reliable service due to exactly
these kind of reasons.



2005-05-02 22:43:26

by Bill Davidsen

[permalink] [raw]
Subject: Re: How to flush data to disk reliably?

On Mon, 2 May 2005, Alan Cox wrote:

> On Llu, 2005-05-02 at 20:18, Grzegorz Kulewski wrote:
> > What about other filesystems? Does anybody know anwser for Reiserfs3,
> > Reiser4, JFS, XFS and any other popular server filesystems? I assume that
> > if log file is some block device (like partition) both O_SYNC and fsync
> > will work? What about ext2? What about some strange RAID/DM/NBD
> > configurations? (I do not know in advance what our customers will use so I
> > need portable method.)
>
> RAID does stripe sized rewrites so you get into the same situation as
> with actual disks - a physical media failure might lose you old data
> (but then if the disk goes bang so does the data...)

I hope I'm reading that wrong, and that rewriting a single sector of a
file doesn't result in r-a-w of the entire stripe. That would be a large
memory hit for filesystems with large stripes for mostly sequential i/o.

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2005-05-02 23:18:30

by Arjan van de Ven

[permalink] [raw]
Subject: Re: How to flush data to disk reliably?

On Mon, 2005-05-02 at 18:30 -0400, Bill Davidsen wrote:
> On Mon, 2 May 2005, Alan Cox wrote:
>
> > On Llu, 2005-05-02 at 20:18, Grzegorz Kulewski wrote:
> > > What about other filesystems? Does anybody know anwser for Reiserfs3,
> > > Reiser4, JFS, XFS and any other popular server filesystems? I assume that
> > > if log file is some block device (like partition) both O_SYNC and fsync
> > > will work? What about ext2? What about some strange RAID/DM/NBD
> > > configurations? (I do not know in advance what our customers will use so I
> > > need portable method.)
> >
> > RAID does stripe sized rewrites so you get into the same situation as
> > with actual disks - a physical media failure might lose you old data
> > (but then if the disk goes bang so does the data...)
>
> I hope I'm reading that wrong, and that rewriting a single sector of a
> file doesn't result in r-a-w of the entire stripe. That would be a large
> memory hit for filesystems with large stripes for mostly sequential i/o.

it results in a read of the entire stripe and at least two writes (the
actual data and the new parity)

the alternative (and I don't think linux does that) is to read the old
data sector, and do an differential xor.