2009-03-29 10:43:59

by Graham Murray

[permalink] [raw]
Subject: Zero length files - an alternative approach?

Just a thought on the ongoing discussion of dataloss with ext4 vs ext3.

Taking the common scenario:
Read oldfile
create newfile file
write newfile data
close newfile
rename newfile to oldfile

When using this scenario, the application writer wants to ensure that
either the old or new content are present. With delayed allocation, this
can lead to zero length files. Most of the suggestions on how to address
this have involved syncing the data either before the rename or making
the rename sync the data.

What about, instead of 'bringing forward' the allocation and flushing of
the data, would it be possible to instead delay the rename until after
the blocks for newfile have been allocated and the data buffers flushed?
This would keep the performance benefits of delayed allocation etc and
also satisfy the applications developers' apparent dislike of using
fsync(). It would give better performance that syncing the data at
rename time (either using fsync() or automatically) and satisfy the
requirements that either the old or new content is present.

I am not a filesystem developer, so do not know how feasible this would
be.


2009-03-29 11:22:29

by Måns Rullgård

[permalink] [raw]
Subject: Re: Zero length files - an alternative approach?

Graham Murray <[email protected]> writes:

> Just a thought on the ongoing discussion of dataloss with ext4 vs ext3.
>
> Taking the common scenario:
> Read oldfile
> create newfile file
> write newfile data
> close newfile
> rename newfile to oldfile
>
> When using this scenario, the application writer wants to ensure that
> either the old or new content are present. With delayed allocation, this
> can lead to zero length files. Most of the suggestions on how to address
> this have involved syncing the data either before the rename or making
> the rename sync the data.
>
> What about, instead of 'bringing forward' the allocation and flushing of
> the data, would it be possible to instead delay the rename until after
> the blocks for newfile have been allocated and the data buffers flushed?
> This would keep the performance benefits of delayed allocation etc and
> also satisfy the applications developers' apparent dislike of using
> fsync(). It would give better performance that syncing the data at
> rename time (either using fsync() or automatically) and satisfy the
> requirements that either the old or new content is present.

Consider this scenario:

1. Create/write/close newfile
2. Rename newfile to oldfile
3. Open/read oldfile. This must return the new contents.
4. System crash and reboot before delayed allocation/flush complete
5. Open/read oldfile. Old contents now returned.

This rollback isn't obviously, to me at least, without problems of its
own.

--
M?ns Rullg?rd
[email protected]

2009-03-29 12:02:50

by Andreas T.Auer

[permalink] [raw]
Subject: Re: Zero length files - an alternative approach?



On 29.03.2009 13:22 M?ns Rullg?rd wrote:
> Consider this scenario:
>
> 1. Create/write/close newfile
> 2. Rename newfile to oldfile
> 3. Open/read oldfile. This must return the new contents.
> 4. System crash and reboot before delayed allocation/flush complete
> 5. Open/read oldfile. Old contents now returned.
>
> This rollback isn't obviously, to me at least, without problems of its
> own.
>
>
Having the old data in 5) is far better than having no data in 5).

2009-03-29 12:18:32

by Måns Rullgård

[permalink] [raw]
Subject: Re: Zero length files - an alternative approach?

"Andreas T.Auer" <[email protected]> writes:

> On 29.03.2009 13:22 M?ns Rullg?rd wrote:
>> Consider this scenario:
>>
>> 1. Create/write/close newfile
>> 2. Rename newfile to oldfile
>> 3. Open/read oldfile. This must return the new contents.
>> 4. System crash and reboot before delayed allocation/flush complete
>> 5. Open/read oldfile. Old contents now returned.
>>
>> This rollback isn't obviously, to me at least, without problems of its
>> own.
>>
> Having the old data in 5) is far better than having no data in 5).

Of course having old data is better than no data. However, fsync()
and similar approaches make a rollback to old data after new data has
been visible impossible or far less likely than the suggested one.
I'm not saying it's necessarily a problem, only that it is a
difference that should be taken into account.

--
M?ns Rullg?rd
[email protected]

2009-03-29 13:49:09

by Pavel Machek

[permalink] [raw]
Subject: Re: Zero length files - an alternative approach?

On Sun 2009-03-29 13:10:23, M?ns Rullg?rd wrote:
> "Andreas T.Auer" <[email protected]> writes:
>
> > On 29.03.2009 13:22 M?ns Rullg?rd wrote:
> >> Consider this scenario:
> >>
> >> 1. Create/write/close newfile
> >> 2. Rename newfile to oldfile
> >> 3. Open/read oldfile. This must return the new contents.
> >> 4. System crash and reboot before delayed allocation/flush complete
> >> 5. Open/read oldfile. Old contents now returned.
> >>
> >> This rollback isn't obviously, to me at least, without problems of its
> >> own.
> >>
> > Having the old data in 5) is far better than having no data in 5).
>
> Of course having old data is better than no data. However, fsync()
> and similar approaches make a rollback to old data after new data has
> been visible impossible or far less likely than the suggested one.

Untrue. Unless you fsync after rename, you can get olddata.

fsync() is easy. But some people _want_ to have either newdata _or_
olddata, but don't care which one, and would prefer to avoid
fsync. That's where replace() should help...
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-03-29 16:49:18

by Avi Kivity

[permalink] [raw]
Subject: Re: Zero length files - an alternative approach?

Graham Murray wrote:
> Just a thought on the ongoing discussion of dataloss with ext4 vs ext3.
>
> Taking the common scenario:
> Read oldfile
> create newfile file
> write newfile data
> close newfile
> rename newfile to oldfile
>
> When using this scenario, the application writer wants to ensure that
> either the old or new content are present. With delayed allocation, this
> can lead to zero length files. Most of the suggestions on how to address
> this have involved syncing the data either before the rename or making
> the rename sync the data.
>
> What about, instead of 'bringing forward' the allocation and flushing of
> the data, would it be possible to instead delay the rename until after
> the blocks for newfile have been allocated and the data buffers flushed?
> This would keep the performance benefits of delayed allocation etc and
> also satisfy the applications developers' apparent dislike of using
> fsync(). It would give better performance that syncing the data at
> rename time (either using fsync() or automatically) and satisfy the
> requirements that either the old or new content is present.
>
> I am not a filesystem developer, so do not know how feasible this would
> be.
>

This has been suggested, I believe. In filesystem terms, it means
inserting a barrier before the rename operation, meaning that any write
operations needed to carry out the rename must not take place until all
write operations from the previous calls have completed.


--
error compiling committee.c: too many arguments to function


2009-03-29 20:16:42

by David Newall

[permalink] [raw]
Subject: Re: Zero length files - an alternative approach?

Pavel Machek wrote:
> fsync() is easy. But some people _want_ to have either newdata _or_
> olddata, but don't care which one, and would prefer to avoid
> fsync. That's where replace() should help...

Most people, I wager, care more about their code being portable than
they do about leaping through a Linux-specific hoop. They're not going
to use replace; not ever; that's what link/unlink is for.

If you think it's reasonable to modify every instance in applications
where a sudden crash would cause data loss, why not make a mount-time
flag that does all of that in FS; and for the other 99% of users, it
doesn't, but runs faster?

2009-03-30 12:41:45

by Chris Mason

[permalink] [raw]
Subject: Re: Zero length files - an alternative approach?

On Sun, 2009-03-29 at 12:22 +0100, Måns Rullgård wrote:
> Graham Murray <[email protected]> writes:
>
> > Just a thought on the ongoing discussion of dataloss with ext4 vs ext3.
> >
> > Taking the common scenario:
> > Read oldfile
> > create newfile file
> > write newfile data
> > close newfile
> > rename newfile to oldfile
> >
> > When using this scenario, the application writer wants to ensure that
> > either the old or new content are present. With delayed allocation, this
> > can lead to zero length files. Most of the suggestions on how to address
> > this have involved syncing the data either before the rename or making
> > the rename sync the data.
> >
> > What about, instead of 'bringing forward' the allocation and flushing of
> > the data, would it be possible to instead delay the rename until after
> > the blocks for newfile have been allocated and the data buffers flushed?
> > This would keep the performance benefits of delayed allocation etc and
> > also satisfy the applications developers' apparent dislike of using
> > fsync(). It would give better performance that syncing the data at
> > rename time (either using fsync() or automatically) and satisfy the
> > requirements that either the old or new content is present.
>
> Consider this scenario:
>
> 1. Create/write/close newfile
> 2. Rename newfile to oldfile

2a. create oldfile again
2b. fsync oldfile

> 3. Open/read oldfile. This must return the new contents.
> 4. System crash and reboot before delayed allocation/flush complete
> 5. Open/read oldfile. Old contents now returned.
>

What happens to the new generation of oldfile? We could insert
dependency tracking so that we know the fsync of oldfile is supposed to
also fsync the rename'd new file. But then picture a loop of operations
doing renames and creating files in the place of the old one...that
dependency tracking gets ugly in a hurry.

Databases know how to do all of this, but filesystems don't implement
most of the database transactional features.

-chris

2009-03-30 14:07:07

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Zero length files - an alternative approach?

On Mon, Mar 30, 2009 at 08:41:26AM -0400, Chris Mason wrote:
> >
> > Consider this scenario:
> >
> > 1. Create/write/close newfile
> > 2. Rename newfile to oldfile
>
> 2a. create oldfile again
> 2b. fsync oldfile
>
> > 3. Open/read oldfile. This must return the new contents.
> > 4. System crash and reboot before delayed allocation/flush complete
> > 5. Open/read oldfile. Old contents now returned.
> >
>
> What happens to the new generation of oldfile? We could insert
> dependency tracking so that we know the fsync of oldfile is supposed to
> also fsync the rename'd new file. But then picture a loop of operations
> doing renames and creating files in the place of the old one...that
> dependency tracking gets ugly in a hurry.

If there are any calls to link(2) to create hard links to oldfile or
newfile intermingled in this sequence, life also gets very
entertaining.

> Databases know how to do all of this, but filesystems don't implement
> most of the database transactional features.

Yep, we'd have to implement a rollback log to get this right, which
would also impact performance. My guess is that just aggressively
forcing out the data write before the rename() is going to cost less
in performance, and is certainly much easier to implement.

- Ted

2009-03-30 21:10:20

by Bodo Eggert

[permalink] [raw]
Subject: Re: Zero length files - an alternative approach?

Chris Mason <[email protected]> wrote:
> On Sun, 2009-03-29 at 12:22 +0100, M?ns Rullg?rd wrote:

>> Consider this scenario:
>>
>> 1. Create/write/close newfile
>> 2. Rename newfile to oldfile
>
> 2a. create oldfile again
> 2b. fsync oldfile
>
>> 3. Open/read oldfile. This must return the new contents.
>> 4. System crash and reboot before delayed allocation/flush complete
>> 5. Open/read oldfile. Old contents now returned.
>>
>
> What happens to the new generation of oldfile? We could insert
> dependency tracking so that we know the fsync of oldfile is supposed to
> also fsync the rename'd new file.

If rename() is BEFORE create(oldfile) and if create(oldfile) is committed,
oldfile will be the newly created file. If the sync() is interrupted by the
crash, any intermediate state may appear. If the system crashes before
create(), either the old oldfile or newfile should be visible.