2006-02-22 21:57:04

by Xin Zhao

[permalink] [raw]
Subject: question about possibility of data loss in Ext2/3 file system

As far as I know, in Ext2/3 file system, data blocks to be flushed to
disk are usually marked as dirty and wait for kernel thread to flush
them lazily. So data blocks of a file could be flushed even after this
file is closed.

Now consider this scenario: suppose data block 2,3 and 4 of file A are
marked to be flushed out. At time T1, block 2 and 3 are flushed, and
file A is closed. However, at time T2, system experiences power outage
and failed to flushed block 4. Does that mean we will end up with
getting a partially flushed file? Is there any way to provide better
guarantee on file integrity?


2006-02-22 22:01:00

by Arjan van de Ven

[permalink] [raw]
Subject: Re: question about possibility of data loss in Ext2/3 file system

On Wed, 2006-02-22 at 16:56 -0500, Xin Zhao wrote:
> As far as I know, in Ext2/3 file system, data blocks to be flushed to
> disk are usually marked as dirty and wait for kernel thread to flush
> them lazily. So data blocks of a file could be flushed even after this
> file is closed.
>
> Now consider this scenario: suppose data block 2,3 and 4 of file A are
> marked to be flushed out. At time T1, block 2 and 3 are flushed, and
> file A is closed. However, at time T2, system experiences power outage
> and failed to flushed block 4. Does that mean we will end up with
> getting a partially flushed file? Is there any way to provide better
> guarantee on file integrity?

on ext3 in default mode it works a bit different

if you write a NEW file that is

then first the data gets written (within like 5 seconds, and not waiting
for the lazy flush daemon). Only when that is done is the metadata (eg
filesize on disk) updated. So after the power comes back you don't see a
mixed thing; you see a file of a certain size, and all the data upto
that size is there.

If you need more guarantees you need to use fsync/fdata_sync from the
application.


2006-02-22 22:34:35

by Xin Zhao

[permalink] [raw]
Subject: Re: question about possibility of data loss in Ext2/3 file system

Apparently the scheme you described helps improve the file integrity.
But still not good enough. For example, if all data blocks are
flushed, then you will update the metadata. But right after you update
the block bitmap and before you update the inode, you lose power. You
will get some dead blocks. Right? Do you know how ext2/3 deal with
this situation?

Also, the scheme you mentioned is just for new file creation. What
will happen if I want to update an existing file? Say, I open file A,
seek to offset 5000, write 4096 bytes, and then close. Do you know how
ext2/3 handle this situation?

Many thanks for your kind help!

Xin

On 2/22/06, Arjan van de Ven <[email protected]> wrote:
> On Wed, 2006-02-22 at 16:56 -0500, Xin Zhao wrote:
> > As far as I know, in Ext2/3 file system, data blocks to be flushed to
> > disk are usually marked as dirty and wait for kernel thread to flush
> > them lazily. So data blocks of a file could be flushed even after this
> > file is closed.
> >
> > Now consider this scenario: suppose data block 2,3 and 4 of file A are
> > marked to be flushed out. At time T1, block 2 and 3 are flushed, and
> > file A is closed. However, at time T2, system experiences power outage
> > and failed to flushed block 4. Does that mean we will end up with
> > getting a partially flushed file? Is there any way to provide better
> > guarantee on file integrity?
>
> on ext3 in default mode it works a bit different
>
> if you write a NEW file that is
>
> then first the data gets written (within like 5 seconds, and not waiting
> for the lazy flush daemon). Only when that is done is the metadata (eg
> filesize on disk) updated. So after the power comes back you don't see a
> mixed thing; you see a file of a certain size, and all the data upto
> that size is there.
>
> If you need more guarantees you need to use fsync/fdata_sync from the
> application.
>
>
>

2006-02-22 23:07:13

by Andreas Dilger

[permalink] [raw]
Subject: Re: question about possibility of data loss in Ext2/3 file system

On Feb 22, 2006 17:34 -0500, Xin Zhao wrote:
> Apparently the scheme you described helps improve the file integrity.
> But still not good enough. For example, if all data blocks are
> flushed, then you will update the metadata. But right after you update
> the block bitmap and before you update the inode, you lose power. You
> will get some dead blocks. Right? Do you know how ext2/3 deal with
> this situation?

ext3 journals changes to the filesystem metadata, and if the journal
update is not fully written to disk (committed) then the change to
the filesystem _metadata_ is NOT actually performed. Only if the
metadata change is committed to the journal does it actually continue
and update the filesystem metatada. If that is interrupted then journal
replay will re-do the operation.

> Also, the scheme you mentioned is just for new file creation. What
> will happen if I want to update an existing file? Say, I open file A,
> seek to offset 5000, write 4096 bytes, and then close. Do you know how
> ext2/3 handle this situation?

The above is not relevant to any _data_ changes, just metadata, unless
the file(system) is running in data-journal mode. In that case the write
is written to the journal before being written into the filesystem. There
is a limitation on how large such a write can be before it is split into
smaller (non-atomic) transactions in the journal. The data-journal mode
also writes twice as much data to disk so can impact performance if you
are already using more than 1/2 of your disk bandwidth.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

2006-02-23 04:58:58

by Theodore Ts'o

[permalink] [raw]
Subject: Re: question about possibility of data loss in Ext2/3 file system

On Wed, Feb 22, 2006 at 05:34:33PM -0500, Xin Zhao wrote:
> Apparently the scheme you described helps improve the file integrity.
> But still not good enough. For example, if all data blocks are
> flushed, then you will update the metadata. But right after you update
> the block bitmap and before you update the inode, you lose power. You
> will get some dead blocks. Right? Do you know how ext2/3 deal with
> this situation?

Ext3 uses the journal to guarantee that the bitmap blocks are
consistent with the inode. Ext2 will require that e2fsck be run to
fix the consistency problem.

> Also, the scheme you mentioned is just for new file creation. What
> will happen if I want to update an existing file? Say, I open file A,
> seek to offset 5000, write 4096 bytes, and then close. Do you know how
> ext2/3 handle this situation?

If you have a power failure right after the close, the data could be
lost. This is true for pretty much all Unix filesystems, for
performance reasons. If you care about the data hitting disk, the
application must use fsync().

Regards,

- Ted

2006-02-23 12:53:06

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: question about possibility of data loss in Ext2/3 file system


On Wed, 22 Feb 2006, Xin Zhao wrote:

> Apparently the scheme you described helps improve the file integrity.
> But still not good enough. For example, if all data blocks are
> flushed, then you will update the metadata. But right after you update
> the block bitmap and before you update the inode, you lose power. You
> will get some dead blocks. Right? Do you know how ext2/3 deal with
> this situation?
>
> Also, the scheme you mentioned is just for new file creation. What
> will happen if I want to update an existing file? Say, I open file A,
> seek to offset 5000, write 4096 bytes, and then close. Do you know how
> ext2/3 handle this situation?
>
> Many thanks for your kind help!
>
> Xin

Don't "top-post" please.

File-systems are not reliable. None of them are. Some people
make their livings designing databases and database software
so that data are secure using unreliable file-systems. You
may need to study how that is done. It's a multi-step process
so that if at any instant the system should crash or power
should fail, transactions can be restarted from the last
completed one.

The ext3 file-system is a journaling file-system in which
some of the data-base methods are embedded within the file-
system itself. This makes it more reliable, but not really
reliable from the absolute meaning of the word.

>
> On 2/22/06, Arjan van de Ven <[email protected]> wrote:
>> On Wed, 2006-02-22 at 16:56 -0500, Xin Zhao wrote:
>>> As far as I know, in Ext2/3 file system, data blocks to be flushed to
>>> disk are usually marked as dirty and wait for kernel thread to flush
>>> them lazily. So data blocks of a file could be flushed even after this
>>> file is closed.
>>>
>>> Now consider this scenario: suppose data block 2,3 and 4 of file A are
>>> marked to be flushed out. At time T1, block 2 and 3 are flushed, and
>>> file A is closed. However, at time T2, system experiences power outage
>>> and failed to flushed block 4. Does that mean we will end up with
>>> getting a partially flushed file? Is there any way to provide better
>>> guarantee on file integrity?
>>
>> on ext3 in default mode it works a bit different
>>
>> if you write a NEW file that is
>>
>> then first the data gets written (within like 5 seconds, and not waiting
>> for the lazy flush daemon). Only when that is done is the metadata (eg
>> filesize on disk) updated. So after the power comes back you don't see a
>> mixed thing; you see a file of a certain size, and all the data upto
>> that size is there.
>>
>> If you need more guarantees you need to use fsync/fdata_sync from the
>> application.
>>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.15.4 on an i686 machine (5589.54 BogoMips).
Warning : 98.36% of all statistics are fiction.
_


****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

2006-02-23 19:46:46

by Sam Vilain

[permalink] [raw]
Subject: Re: question about possibility of data loss in Ext2/3 file system

Theodore Ts'o wrote:
>>Also, the scheme you mentioned is just for new file creation. What
>>will happen if I want to update an existing file? Say, I open file A,
>>seek to offset 5000, write 4096 bytes, and then close. Do you know how
>>ext2/3 handle this situation?
> If you have a power failure right after the close, the data could be
> lost. This is true for pretty much all Unix filesystems, for
> performance reasons. If you care about the data hitting disk, the
> application must use fsync().

I always liked Sun's approach to this in Online Disk Suite - journal at
the block device level rather than the FS / application level.
Something I haven't seen from the Linux md-utils or DM.

Sam.

2006-02-24 16:30:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: question about possibility of data loss in Ext2/3 file system

On Fri, Feb 24, 2006 at 08:46:24AM +1300, Sam Vilain wrote:
> Theodore Ts'o wrote:
> >>Also, the scheme you mentioned is just for new file creation. What
> >>will happen if I want to update an existing file? Say, I open file A,
> >>seek to offset 5000, write 4096 bytes, and then close. Do you know how
> >>ext2/3 handle this situation?
> >If you have a power failure right after the close, the data could be
> >lost. This is true for pretty much all Unix filesystems, for
> >performance reasons. If you care about the data hitting disk, the
> >application must use fsync().
>
> I always liked Sun's approach to this in Online Disk Suite - journal at
> the block device level rather than the FS / application level.
> Something I haven't seen from the Linux md-utils or DM.

You can do data block journalling in ext3. But the performance impact
can be significant for some work loads. TNSFAAFL.

- Ted

2006-02-26 21:27:20

by Sam Vilain

[permalink] [raw]
Subject: Re: question about possibility of data loss in Ext2/3 file system

Theodore Ts'o wrote:
>>I always liked Sun's approach to this in Online Disk Suite - journal at
>>the block device level rather than the FS / application level.
>>Something I haven't seen from the Linux md-utils or DM.
> You can do data block journalling in ext3. But the performance impact
> can be significant for some work loads. TNSFAAFL.

Sure, but on a large system with a big array, you just move the journal
to a seperate diskset. That can make a big speed improvement for those
types of update patterns where you care about always applying updates
sequentially, such as a filesystem or a database.

Sam.

2006-02-27 07:38:44

by Xin Zhao

[permalink] [raw]
Subject: Re: question about possibility of data loss in Ext2/3 file system

Many thanks for above responses.

Sounds like Ext3 uses journal to protect the data integrity. In data
journal mode, ext 3 writes data to journal first, marks start to
commit, then marks "done with commit" after data is flushed to disk.
If power failure happen during data flush, the system will redo the
data writes next time system is back.

However, how to guarantee the integrity of journal? This solution
works based on an assumption that the journal data has been flushed to
disk before file data is flushed. Otherwise, consider this scenario:
process A wrote a data block to File F. Ext3 first writes this data
block into journal, put a "start to commit" notice, flags that journal
page as dirty. (note that the journal data is not flushed into disk
yet). Then ext3 starts to flag data page as dirty and wait for flush
daemon to write it to disk. Say just when the disk controller writes
2048 out of 4096 bytes into disk, power outage happens. At this time,
journal data has not been flushed into disk, so no enough information
to support redo. The file A will end up with some junk data. So
flushing journal data to disk before starting to write file data to
disk seems to be necessary. If so, how ext3 guarantees that? Is it
because the dirty pages are flushed in a first come first serve
fashion?

Another concern is that the journal data mode requires twice as much
as data to write, this could impact performance if disk bandwidth
usage is over 50%. For small files, it could be rare to use 50%. But
how about large files? In a real world system, what's the probablity
of using over 50% disk bandwidth?

I am sorry for ask for too high integrity on data. But I think ext3 is
a famous stable file system, it should have some good design to
protect data integrity.

Last question, does anyone know whether it is possible that ext3
creates some junk data or makes bitmap and inode inconsistent (under
any extreme condition) ?

Again, thanks for your help!

Xin

otherwise, the journal may not contain sufficient data to redo file
writes. Am I missing some points?

On 2/26/06, Sam Vilain <[email protected]> wrote:
> Theodore Ts'o wrote:
> >>I always liked Sun's approach to this in Online Disk Suite - journal at
> >>the block device level rather than the FS / application level.
> >>Something I haven't seen from the Linux md-utils or DM.
> > You can do data block journalling in ext3. But the performance impact
> > can be significant for some work loads. TNSFAAFL.
>
> Sure, but on a large system with a big array, you just move the journal
> to a seperate diskset. That can make a big speed improvement for those
> types of update patterns where you care about always applying updates
> sequentially, such as a filesystem or a database.
>
> Sam.
>

2006-02-28 16:58:06

by Phillip Susi

[permalink] [raw]
Subject: Re: question about possibility of data loss in Ext2/3 file system

Xin Zhao wrote:
> Many thanks for above responses.
>
> Sounds like Ext3 uses journal to protect the data integrity. In data
> journal mode, ext 3 writes data to journal first, marks start to
> commit, then marks "done with commit" after data is flushed to disk.
> If power failure happen during data flush, the system will redo the
> data writes next time system is back.
>
> However, how to guarantee the integrity of journal? This solution
> works based on an assumption that the journal data has been flushed to
> disk before file data is flushed. Otherwise, consider this scenario:

The kernel flushes the writes to the journal before it starts writing to
the main area of the disk, then marks the transaction as complete only
after the actual updates have been flushed.

> process A wrote a data block to File F. Ext3 first writes this data
> block into journal, put a "start to commit" notice, flags that journal
> page as dirty. (note that the journal data is not flushed into disk
> yet). Then ext3 starts to flag data page as dirty and wait for flush
> daemon to write it to disk. Say just when the disk controller writes
> 2048 out of 4096 bytes into disk, power outage happens. At this time,
> journal data has not been flushed into disk, so no enough information
> to support redo. The file A will end up with some junk data. So
> flushing journal data to disk before starting to write file data to
> disk seems to be necessary. If so, how ext3 guarantees that? Is it
> because the dirty pages are flushed in a first come first serve
> fashion?
>
> Another concern is that the journal data mode requires twice as much
> as data to write, this could impact performance if disk bandwidth
> usage is over 50%. For small files, it could be rare to use 50%. But
> how about large files? In a real world system, what's the probablity
> of using over 50% disk bandwidth?
>

Depending on exactly what you are doing it will be anywhere from 0 to
100%. It entirely depends on the hardware you have and the tasks you
are asking it to perform. For most people though, the performance hit
is too high, and it really doesn't 100% prevent data loss because
programs writing to a file do not inform the kernel about when a
transaction should start or stop, so it has to guess.

Programs that are mission critical will use their own mechanism to
prevent corruption of their data files rather than rely on data
journaling. Usually this results in better safety and efficiency.

> I am sorry for ask for too high integrity on data. But I think ext3 is
> a famous stable file system, it should have some good design to
> protect data integrity.
>

Which is why it does.

> Last question, does anyone know whether it is possible that ext3
> creates some junk data or makes bitmap and inode inconsistent (under
> any extreme condition) ?
>

No.

> Again, thanks for your help!
>
> Xin
>