2004-06-16 07:34:59

by Oleg Drokin

[permalink] [raw]
Subject: Re: mode data=journal in ext3. Is it safe to use?

Hello!

Petter Larsen <[email protected]> wrote:

PL> Can anybody of you acknowledge or not if mode data=journal in ext3 is
PL> safe to use in Linux kernel 2.6.x?
PL> Wee need to have a very consistent and integrity for our filesystem, and
PL> it would then be desired to journal both data and metadata.

Actually data=journal mode would gain you mostly zero extra consistency compared
to data=ordered mode. (the only more consistency bit that you get is
correct mtime on files that have their pages overwritten, I think).
You have zero control over transaction boundaries in ext3, so you still need
to design your applications in such a way that they have their own
sort of transactions (if this is needed).

PL> Data integrity is much more important for us than speed.

It is not clear what sort of extra data integrity do you expect from data
journaling mode and why do you think it is there.

Garbage in files should not happen in data ordered mode as data pages are
written first before metadata updates are committed.

Bye,
Oleg


2004-06-17 08:27:26

by Petter Larsen

[permalink] [raw]
Subject: Re: mode data=journal in ext3. Is it safe to use?

Hello

I comment inline..

> PL> Can anybody of you acknowledge or not if mode data=journal in ext3 is
> PL> safe to use in Linux kernel 2.6.x?
> PL> Wee need to have a very consistent and integrity for our filesystem, and
> PL> it would then be desired to journal both data and metadata.
>
> OLEG> Actually data=journal mode would gain you mostly zero extra consistency compared
> to data=ordered mode. (the only more consistency bit that you get is
> correct mtime on files that have their pages overwritten, I think).
> You have zero control over transaction boundaries in ext3, so you still need
> to design your applications in such a way that they have their own
> sort of transactions (if this is needed).

So your conclusion is that data=journal mode is useless if you do not
want a correct mtime?

It would be a littles sense in developing the data=journal mode if this
is the only benefit, don't you think?

>From the Linux/Documentation/filesystems/ext3.txt

data=journal All data are committed into the journal prior
to being written into the main file system.

data=ordered (*) All data are forced directly out to the main
file system prior to its metadata being committed to
the journal.

My problem is that ext3 in the latest kernel, 2.6.x and the latest
2.4.x, are not well documented around the web. Whitepapers and so are
pretty old. Much have changed I belive in ext3 since it was first
introduced by Dr. Tweedie. The first release was journaling both data
and metadata, se also the transcript from Dr. Tweedie from the Ottawa
Linux Symposium 20th July 2000.
http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html

There he says that they are journaling both metadata and data, but that
the design goal is not to do that. So can this be interpreted that mode
data=journal is only there for historic reasons?


> PL> Data integrity is much more important for us than speed.
>
> OLEG> It is not clear what sort of extra data integrity do you expect from data
> journaling mode and why do you think it is there.

I would belive that the goal for such a mode data=journal would gain
extra data integrity because it also journals data. Why should it not? I
would belive that it makes sense to have these different modes so people
can choose the best mode for there applications.

> OLEG> Garbage in files should not happen in data ordered mode as data pages are
> written first before metadata updates are committed.

Are you sure?


Petter


2004-06-17 17:10:23

by Oleg Drokin

[permalink] [raw]
Subject: Re: mode data=journal in ext3. Is it safe to use?

Hello!

On Thu, Jun 17, 2004 at 10:27:17AM +0200, Petter Larsen wrote:
> > PL> Can anybody of you acknowledge or not if mode data=journal in ext3 is
> > PL> safe to use in Linux kernel 2.6.x?
> > PL> Wee need to have a very consistent and integrity for our filesystem, and
> > PL> it would then be desired to journal both data and metadata.
> > OLEG> Actually data=journal mode would gain you mostly zero extra consistency compared
> > to data=ordered mode. (the only more consistency bit that you get is
> > correct mtime on files that have their pages overwritten, I think).
> > You have zero control over transaction boundaries in ext3, so you still need
> > to design your applications in such a way that they have their own
> > sort of transactions (if this is needed).
> So your conclusion is that data=journal mode is useless if you do not
> want a correct mtime?

Well, yes.

> It would be a littles sense in developing the data=journal mode if this
> is the only benefit, don't you think?
> >From the Linux/Documentation/filesystems/ext3.txt
> data=journal All data are committed into the journal prior
> to being written into the main file system.
> data=ordered (*) All data are forced directly out to the main
> file system prior to its metadata being committed to
> the journal.
> My problem is that ext3 in the latest kernel, 2.6.x and the latest
> 2.4.x, are not well documented around the web. Whitepapers and so are
> pretty old. Much have changed I belive in ext3 since it was first
> introduced by Dr. Tweedie. The first release was journaling both data
> and metadata, se also the transcript from Dr. Tweedie from the Ottawa
> Linux Symposium 20th July 2000.
> http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html
> There he says that they are journaling both metadata and data, but that
> the design goal is not to do that. So can this be interpreted that mode
> data=journal is only there for historic reasons?

May be so. Also fsync heavy loads on real disk devices with large journals
tend to benefit from journaled data mode as well.

> > PL> Data integrity is much more important for us than speed.
> >
> > OLEG> It is not clear what sort of extra data integrity do you expect from data
> > journaling mode and why do you think it is there.
> I would belive that the goal for such a mode data=journal would gain
> extra data integrity because it also journals data. Why should it not? I

Well, actually I bet you do not care if the data goes through journal or not
as long as it is not lost.
In case of ordered journaling mode, data is written first before metadata
updates, mostly the same happens with data journal mode, only with the latter
case date is written into journal and if transaction was not committed, after
a reboot it won't be copied to where it should be, same scenario in ordered
journal mode will result in data getting where it should be, but due to
lack of metadata updates, you won't see it. (this is in case of append,
for overwrite it will be a little bit different, but still you have no
control over how much of stuff will be overwritten).

> would belive that it makes sense to have these different modes so people
> can choose the best mode for there applications.

True.

> > OLEG> Garbage in files should not happen in data ordered mode as data pages are
> > written first before metadata updates are committed.
> Are you sure?

If you can reproduce a garbage in files in ordered journal mode, that would be a
bug that should be fixed then.

Bye,
Oleg

2004-06-18 09:38:33

by Helge Hafting

[permalink] [raw]
Subject: Re: mode data=journal in ext3. Is it safe to use?

Oleg Drokin wrote:

>Hello!
>
>On Thu, Jun 17, 2004 at 10:27:17AM +0200, Petter Larsen wrote:
>
>
>>>PL> Can anybody of you acknowledge or not if mode data=journal in ext3 is
>>>PL> safe to use in Linux kernel 2.6.x?
>>>PL> Wee need to have a very consistent and integrity for our filesystem, and
>>>PL> it would then be desired to journal both data and metadata.
>>>OLEG> Actually data=journal mode would gain you mostly zero extra consistency compared
>>>to data=ordered mode. (the only more consistency bit that you get is
>>>correct mtime on files that have their pages overwritten, I think).
>>>You have zero control over transaction boundaries in ext3, so you still need
>>>to design your applications in such a way that they have their own
>>>sort of transactions (if this is needed).
>>>
>>>
>>So your conclusion is that data=journal mode is useless if you do not
>>want a correct mtime?
>>
>>
>
>Well, yes.
>
>
>
>>It would be a littles sense in developing the data=journal mode if this
>>is the only benefit, don't you think?
>>>From the Linux/Documentation/filesystems/ext3.txt
>>data=journal All data are committed into the journal prior
>> to being written into the main file system.
>>data=ordered (*) All data are forced directly out to the main
>>file system prior to its metadata being committed to
>> the journal.
>>My problem is that ext3 in the latest kernel, 2.6.x and the latest
>>2.4.x, are not well documented around the web. Whitepapers and so are
>>pretty old. Much have changed I belive in ext3 since it was first
>>introduced by Dr. Tweedie. The first release was journaling both data
>>and metadata, se also the transcript from Dr. Tweedie from the Ottawa
>>Linux Symposium 20th July 2000.
>>http://olstrans.sourceforge.net/release/OLS2000-ext3/OLS2000-ext3.html
>>There he says that they are journaling both metadata and data, but that
>>the design goal is not to do that. So can this be interpreted that mode
>>data=journal is only there for historic reasons?
>>
>>
>
>May be so. Also fsync heavy loads on real disk devices with large journals
>tend to benefit from journaled data mode as well.
>
>
>
>>>PL> Data integrity is much more important for us than speed.
>>>
>>>OLEG> It is not clear what sort of extra data integrity do you expect from data
>>>journaling mode and why do you think it is there.
>>>
>>>
>>I would belive that the goal for such a mode data=journal would gain
>>extra data integrity because it also journals data. Why should it not? I
>>
>>
>
>Well, actually I bet you do not care if the data goes through journal or not
>as long as it is not lost.
>In case of ordered journaling mode, data is written first before metadata
>updates, mostly the same happens with data journal mode, only with the latter
>case date is written into journal and if transaction was not committed, after
>a reboot it won't be copied to where it should be, same scenario in ordered
>journal mode will result in data getting where it should be, but due to
>lack of metadata updates, you won't see it. (this is in case of append,
>for overwrite it will be a little bit different, but still you have no
>control over how much of stuff will be overwritten).
>
>
>
>>would belive that it makes sense to have these different modes so people
>>can choose the best mode for there applications.
>>
>>
>
>True.
>
>
>
>>>OLEG> Garbage in files should not happen in data ordered mode as data pages are
>>>written first before metadata updates are committed.
>>>
>>>
>>Are you sure?
>>
>>
>
>If you can reproduce a garbage in files in ordered journal mode, that would be a
>bug that should be fixed then.
>
>
Hard to _produce_, but consider:
1. Write data to an existing file
2. Sync metadata
3. data is forced out because of ordered mode, a powerout crash happens
in the middle of this. The file now has a block with a mix of new
and old,
it may even be unreadable due to a bad sector checksum.

With data journalling you either get the old data (because the crash
happened
during a write to the journal) or new data (crash happened during data
write,
the data is restored from the good copy in the journal.)

Helge Hafting

2004-06-18 10:26:01

by Oleg Drokin

[permalink] [raw]
Subject: Re: mode data=journal in ext3. Is it safe to use?

Hello!

On Fri, Jun 18, 2004 at 11:41:23AM +0200, Helge Hafting wrote:

> >If you can reproduce a garbage in files in ordered journal mode, that
> >would be a
> >bug that should be fixed then.
> Hard to _produce_, but consider:
> 1. Write data to an existing file
> 2. Sync metadata
> 3. data is forced out because of ordered mode, a powerout crash happens
> in the middle of this. The file now has a block with a mix of new
> and old,

Well, this is not much worse than having two blocks, one from old file
and one from new after a crash.

> it may even be unreadable due to a bad sector checksum.

Well, in data journaled mode you may get unreadable journal, is this much
better? (Also original question was about CF flash media, so no bad sector
problems I presume).

> With data journalling you either get the old data (because the crash
> happened
> during a write to the journal) or new data (crash happened during data
> write,

Well, while with data journaling mode your granularity is one block,
with data ordered it is one sector.

> the data is restored from the good copy in the journal.)

Bye,
Oleg

2004-06-18 11:31:05

by Paulo Marques

[permalink] [raw]
Subject: Re: mode data=journal in ext3. Is it safe to use?

On Fri, 2004-06-18 at 11:15, Oleg Drokin wrote:
> Hello!
>
> On Fri, Jun 18, 2004 at 11:41:23AM +0200, Helge Hafting wrote:
>
> > >If you can reproduce a garbage in files in ordered journal mode, that
> > >would be a
> > >bug that should be fixed then.
> > Hard to _produce_, but consider:
> > 1. Write data to an existing file
> > 2. Sync metadata
> > 3. data is forced out because of ordered mode, a powerout crash happens
> > in the middle of this. The file now has a block with a mix of new
> > and old,
>
> Well, this is not much worse than having two blocks, one from old file
> and one from new after a crash.

Agree. If the application needs consistency it must do some journaling
itself. At least, until the time when an application can say "start
transaction" "commit transaction" to the file system itself.

> > it may even be unreadable due to a bad sector checksum.
>
> Well, in data journaled mode you may get unreadable journal, is this much
> better? (Also original question was about CF flash media, so no bad sector
> problems I presume).

You got it wrong here. The sentence was "bad sector checksum", not "bad
sector". If the sector was "half written", then the checksum would not
match.

If the journal is "half written" then it is just discarded (or at least
it should be).

> > With data journalling you either get the old data (because the crash
> > happened
> > during a write to the journal) or new data (crash happened during data
> > write,
>
> Well, while with data journaling mode your granularity is one block,
> with data ordered it is one sector.

Imagine that you request a 2Mb write to an ext3 filesystem with an 1Mb
journal. There is *no way* the filesystem can do the write in an atomic
operation. (there would be if the filesystem wrote the data to free
blocks and updated the metadata through the journal)

The point is, there is no concept of "atomic operation" at the file
system level, so the application must do journaling itself if it wants
to have some concept of "transactions".

>From my experience with CF cards, there are some brands that do
wear-leveling (I know that at least the TwinMOS ones do, and probably
SanDisk too) and others that don't (Kingmax).

With a bad CF card and an ext3 filesystem you can get bad sectors in a
couple of hours doing some intensive writing.

A good CF card will sustain "normal use" (2 writes per minute average)
and an ext3 filesystem for months (maybe years, I still didn't went that
far in time :)

Just my two cents,

--
Paulo Marques - http://www.grupopie.com
"In a world without walls and fences who needs windows and gates?"

2004-06-18 12:05:56

by Oleg Drokin

[permalink] [raw]
Subject: Re: mode data=journal in ext3. Is it safe to use?

Hello!

On Fri, Jun 18, 2004 at 12:30:55PM +0100, Paulo Marques wrote:
> > > Hard to _produce_, but consider:
> > > 1. Write data to an existing file
> > > 2. Sync metadata
> > > 3. data is forced out because of ordered mode, a powerout crash happens
> > > in the middle of this. The file now has a block with a mix of new
> > > and old,
> > Well, this is not much worse than having two blocks, one from old file
> > and one from new after a crash.
> Agree. If the application needs consistency it must do some journaling
> itself. At least, until the time when an application can say "start
> transaction" "commit transaction" to the file system itself.

Right, this is my point.

> > > it may even be unreadable due to a bad sector checksum.
> > Well, in data journaled mode you may get unreadable journal, is this much
> > better? (Also original question was about CF flash media, so no bad sector
> > problems I presume).
> You got it wrong here. The sentence was "bad sector checksum", not "bad
> sector". If the sector was "half written", then the checksum would not
> match.

In any case bad sector checksum is hardware bug. Sector write is supposed to be
atomic, it either happens or not.

> If the journal is "half written" then it is just discarded (or at least
> it should be).

Well, if there is bad sector checksum inside journal block, ext3 won't be
all that happy about this for sure (and most of other journaling filesystems as
well, I am sure).

> > > With data journalling you either get the old data (because the crash
> > > happened
> > > during a write to the journal) or new data (crash happened during data
> > > write,
> > Well, while with data journaling mode your granularity is one block,
> > with data ordered it is one sector.
> Imagine that you request a 2Mb write to an ext3 filesystem with an 1Mb
> journal. There is *no way* the filesystem can do the write in an atomic
> operation. (there would be if the filesystem wrote the data to free
> blocks and updated the metadata through the journal)

True.
Even if you write 512K of data and have 1Mb journal, still there is no atomicity
guarantee.

> The point is, there is no concept of "atomic operation" at the file
> system level, so the application must do journaling itself if it wants
> to have some concept of "transactions".

Well, if you go with less than 1 block size updates (that do not cross block
boundaries), this can be done atomically. (with help of fsync and stuff).

> >From my experience with CF cards, there are some brands that do
> wear-leveling (I know that at least the TwinMOS ones do, and probably
> SanDisk too) and others that don't (Kingmax).
> With a bad CF card and an ext3 filesystem you can get bad sectors in a
> couple of hours doing some intensive writing.

Well, for flash memory there is jffs2, it does (data) journalling and supports
compression. And it can even work over conventional block devices via mtd block
emulation, I think. Basically jffs2 is one large fs-sized journal as I
understand it.

Bye,
Oleg

2004-06-19 19:16:57

by Bernd Eckenfels

[permalink] [raw]
Subject: Re: mode data=journal in ext3. Is it safe to use?

In article <1087558255.25904.14.camel@pmarqueslinux> you wrote:
> The point is, there is no concept of "atomic operation" at the file
> system level, so the application must do journaling itself if it wants
> to have some concept of "transactions".

Well, there can be rules like "writes after flush with size less than x are
atomic". With X beeing something between sector size, blocksize or data
journal size.

However most unix programs which do not do yournalling and rely on some
stable atomic behaviour work with generating new files and renaming that.
And for this the meta data journalling in ordered mode is fine.

So only the append only logfiles may need some special treatment, this looks
like a common source for null-bytes in a file. And only in case it is not a
temp file, its a problem (syslog)

Greetings
Bernd
--
eckes privat - http://www.eckes.org/
Project Freefire - http://www.freefire.org/

2004-06-21 17:42:36

by Petter Larsen

[permalink] [raw]
Subject: Re: mode data=journal in ext3. Is it safe to use? Conclusion


I will summarise this thread and try to set the picture of what has been
discussed and concluded.

1. ext3 with mode data=journal in kernel 2.6.x is probably working as
intended. One has responded with using this mode heavily on 2.6.6
without corruption related to the fs code. Since nobody has said that
they have seen faults, we should belive that it is safe. It is in an
stable kernel...

2. Mode data=journal will not gain much more than correct mtime compared
to mode data=ordered.

3. Applications that need a very consistent filesystem, e.g. consistent
writes, they need to do this by implementing there own
transaction/journaling system. Alberto Bertogli has written a library
that can assist with this. See URL,
http://users.auriga.wearlab.de/~alb/libjio/. I have not used it so I can
not say for sure how good it is, but it seems like a nice start and
worth to take a look at.

4. Because mode data=journal does not gain much, it would be better to
use mode data=ordered and use any form of transaction/journaling itself.
Mode data=ordered is the default in ext3 and probably most used, and
therefor also best tested.

5. If, and only if, you have less than 1 block size updates (that do not
cross block boundaries), these operations (write) can be done
atomically. (with help of fsync and stuff,(from Oleg and others)).

6. Wear leveling on a Compact Flash card:
Wear leveling is an important task. SanDisk has Industrial Grade support
for some of there CF-cards, see these links.
http://www.sandisk.com/pressrelease/020522_toughness.htm
http://www.sandisk.com/pressrelease/021112_igapps.htm
http://www.sandisk.com/pdf/oem/WPaperWearLevelv1.0.pdf
We are in the telecommunications and networking business and need this
kind of Compact Flash cards. From there site:
* Enhanced error correction and sophisticated wear leveling technology
* Card level MTBF >3 million hours
* 2 million program/erase cycle endurance per block

We are not bound to SanDisk. We could use any suplier that meet these
criteria.

I do not know the wear leveling algorithm in detail so how they shuffle
read-only data (or if they do) around the disk, and even how it does it
if we create partitions on this CF disk (partition are probably
transparent for the wear leveling algorithm), is an issue we need to
find out of.

Thanks for all your replies ( there are 32 threads:-) spread along the
ext3 ML and the LKML and a couple private ). It has helped me a lot.

Best regards
--
Petter Larsen
cand. scient.
moreCom as
913 17 222