2010-12-09 12:03:42

by Olaf van der Spek

[permalink] [raw]
Subject: Atomic non-durable file write API

Hi,

Since the introduction of ext4, some apps/users have had issues with
file corruption after a system crash. It's not a bug in the FS AFAIK
and it's not exclusive to ext4.
Writing a temp file, fsync, rename is often proposed. However, the
durable aspect of fsync isn't always required and this way has other
issues.
What is the recommended way for atomic non-durable (complete) file writes?

I'm also wondering why FSs commit after open/truncate but before
write/close. AFAIK this isn't necessary and thus suboptimal.

Greetings,

Olaf


2010-12-16 12:22:20

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Thu, Dec 9, 2010 at 1:03 PM, Olaf van der Spek <[email protected]> wrote:
> Hi,
>
> Since the introduction of ext4, some apps/users have had issues with
> file corruption after a system crash. It's not a bug in the FS AFAIK
> and it's not exclusive to ext4.
> Writing a temp file, fsync, rename is often proposed. However, the
> durable aspect of fsync isn't always required and this way has other
> issues.
> What is the recommended way for atomic non-durable (complete) file writes?
>
> I'm also wondering why FSs commit after open/truncate but before
> write/close. AFAIK this isn't necessary and thus suboptimal.

Somebody?

Olaf

2010-12-16 20:11:40

by Ric Wheeler

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On 12/16/2010 07:22 AM, Olaf van der Spek wrote:
> On Thu, Dec 9, 2010 at 1:03 PM, Olaf van der Spek<[email protected]> wrote:
>> Hi,
>>
>> Since the introduction of ext4, some apps/users have had issues with
>> file corruption after a system crash. It's not a bug in the FS AFAIK
>> and it's not exclusive to ext4.
>> Writing a temp file, fsync, rename is often proposed. However, the
>> durable aspect of fsync isn't always required and this way has other
>> issues.
>> What is the recommended way for atomic non-durable (complete) file writes?
>>
>> I'm also wondering why FSs commit after open/truncate but before
>> write/close. AFAIK this isn't necessary and thus suboptimal.
> Somebody?
>
> Olaf

Getting an atomic IO from user space down to storage is not really trivial.

What I think you would have to do is:

(1) understand the alignment and minimum IO size of your target storage device
which you can get from /sys/block (or libblkid)

(2) pre-allocate the file so that you do not need to update meta-data for your write

(3) use O_DIRECT write calls that are minimum IO sized requests

Note that there are still things that could break your atomic write - failures
in the storage device firmware, fragmentation in another layer (breaking up an
atomic write into transport sized chunks, etc).

In practice, most applications that need to do atomic transactions use logging
(and fsync()) calls I suspect....

Was this the kind of answer that you were looking for?

Ric


2010-12-18 22:15:47

by Calvin Walton

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Thu, 2010-12-16 at 15:11 -0500, Ric Wheeler wrote:
> On 12/16/2010 07:22 AM, Olaf van der Spek wrote:
> > On Thu, Dec 9, 2010 at 1:03 PM, Olaf van der Spek<[email protected]> wrote:
> >> Hi,
> >>
> >> Since the introduction of ext4, some apps/users have had issues with
> >> file corruption after a system crash. It's not a bug in the FS AFAIK
> >> and it's not exclusive to ext4.
> >> Writing a temp file, fsync, rename is often proposed. However, the
> >> durable aspect of fsync isn't always required and this way has other
> >> issues.
> >> What is the recommended way for atomic non-durable (complete) file writes?
> >>
> >> I'm also wondering why FSs commit after open/truncate but before
> >> write/close. AFAIK this isn't necessary and thus suboptimal.
> > Somebody?
> >
> > Olaf
>
> Getting an atomic IO from user space down to storage is not really trivial.
>
> What I think you would have to do is:
>
> (1) understand the alignment and minimum IO size of your target storage device
> which you can get from /sys/block (or libblkid)

Hmm. I’m doing a little interpretation of what Olaf said here; but I
think you may have misunderstood the question?

He doesn’t care about whether or not the file is securely written to
disk (durable); however he doesn’t want to see any partially written
files. In other words, something like

1. Write to temp file
2. Rename temp file over original file

Where the rename is only committed to disk once the entire contents of
the file have been written securely – whenever that may eventually
happen.

He doesn’t want to synchronously wait for the file to be written,
because the new data isn’t particularly important. The only important
thing is that the file either contains the old or new data after a
filesystem crash; not incomplete data. So, it’s more of an ordering
problem, I think? (Analogous to putting some sort of barrier between the
file write/close and the file rename to maintain ordering.)

Hopefully I’ve interpreted the original question correctly, because this
is something I would find interesting as well.

--
Calvin Walton <[email protected]>

2010-12-19 16:39:03

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Sat, Dec 18, 2010 at 11:15 PM, Calvin Walton <[email protected]> wrote:
> Hmm. I’m doing a little interpretation of what Olaf said here; but I
> think you may have misunderstood the question?
>
> He doesn’t care about whether or not the file is securely written to
> disk (durable); however he doesn’t want to see any partially written
> files. In other words, something like
>
>     1. Write to temp file
>     2. Rename temp file over original file

Meta data, including file owner, should be preserved.
Ideally, no temp files should be visible either.

> Where the rename is only committed to disk once the entire contents of
> the file have been written securely – whenever that may eventually
> happen.
>
> He doesn’t want to synchronously wait for the file to be written,
> because the new data isn’t particularly important. The only important
> thing is that the file either contains the old or new data after a
> filesystem crash; not incomplete data. So, it’s more of an ordering
> problem, I think? (Analogous to putting some sort of barrier between the
> file write/close and the file rename to maintain ordering.)
>
> Hopefully I’ve interpreted the original question correctly, because this
> is something I would find interesting as well.

Yes, you did.

Olaf

2010-12-23 15:49:54

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Sun, Dec 19, 2010 at 5:39 PM, Olaf van der Spek <[email protected]> wrote:
> On Sat, Dec 18, 2010 at 11:15 PM, Calvin Walton <[email protected]> wrote:
>> Hmm. I’m doing a little interpretation of what Olaf said here; but I
>> think you may have misunderstood the question?
>>
>> He doesn’t care about whether or not the file is securely written to
>> disk (durable); however he doesn’t want to see any partially written
>> files. In other words, something like
>>
>>     1. Write to temp file
>>     2. Rename temp file over original file
>
> Meta data, including file owner, should be preserved.
> Ideally, no temp files should be visible either.
>
>> Where the rename is only committed to disk once the entire contents of
>> the file have been written securely – whenever that may eventually
>> happen.
>>
>> He doesn’t want to synchronously wait for the file to be written,
>> because the new data isn’t particularly important. The only important
>> thing is that the file either contains the old or new data after a
>> filesystem crash; not incomplete data. So, it’s more of an ordering
>> problem, I think? (Analogous to putting some sort of barrier between the
>> file write/close and the file rename to maintain ordering.)
>>
>> Hopefully I’ve interpreted the original question correctly, because this
>> is something I would find interesting as well.
>
> Yes, you did.

Somebody?

Olaf

2010-12-23 21:51:39

by NeilBrown

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Thu, 23 Dec 2010 16:49:53 +0100 Olaf van der Spek <[email protected]>
wrote:

> On Sun, Dec 19, 2010 at 5:39 PM, Olaf van der Spek <[email protected]> wrote:
> > On Sat, Dec 18, 2010 at 11:15 PM, Calvin Walton <[email protected]> wrote:
> >> Hmm. I’m doing a little interpretation of what Olaf said here; but I
> >> think you may have misunderstood the question?
> >>
> >> He doesn’t care about whether or not the file is securely written to
> >> disk (durable); however he doesn’t want to see any partially written
> >> files. In other words, something like
> >>
> >>     1. Write to temp file
> >>     2. Rename temp file over original file
> >
> > Meta data, including file owner, should be preserved.
> > Ideally, no temp files should be visible either.
> >
> >> Where the rename is only committed to disk once the entire contents of
> >> the file have been written securely – whenever that may eventually
> >> happen.
> >>
> >> He doesn’t want to synchronously wait for the file to be written,
> >> because the new data isn’t particularly important. The only important
> >> thing is that the file either contains the old or new data after a
> >> filesystem crash; not incomplete data. So, it’s more of an ordering
> >> problem, I think? (Analogous to putting some sort of barrier between the
> >> file write/close and the file rename to maintain ordering.)
> >>
> >> Hopefully I’ve interpreted the original question correctly, because this
> >> is something I would find interesting as well.
> >
> > Yes, you did.
>
> Somebody?
>

You are asking for something that doesn't exist, which is why no-one can tell
you want the answer is.

The only mechanism for synchronising different filesystem operations is
fsync. You should use that.

If it is too slow, use data journalling, and place your journal on a
small low-latency device (NVRAM??)

NeilBrown

2010-12-23 22:22:11

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Fri, Dec 24, 2010 at 08:51:26AM +1100, Neil Brown wrote:
> You are asking for something that doesn't exist, which is why no-one can tell
> you want the answer is.

Basically, file systems are not databases, and databases are not file
systems. There seems to be an unfortunate tendency for application
programmers to want to use file systems as databases, and they suffer
as a result.

Among other things, file systems have to be fast a very wide variety
of operations, including compiles, and we don't have ways for people
to explicitly delineate transaction boundaries. And of course,
everyone else has different ideas of what kind of consistency
guarantees they want.

You may *say* that you don't care which version of the file you get
after a rename, but only that one or the other is valid, but what if
some other program reads from the file, gets the second file, and
sends out a network message saying the rename was successful, but then
a crash happens and the rename is undone? There's a reason why
databases block reads of a modified row until the transaction is
completed or rolled back.

> The only mechanism for synchronising different filesystem operations is
> fsync. You should use that.
>
> If it is too slow, use data journalling, and place your journal on a
> small low-latency device (NVRAM??)

Or use a real database, and don't try to assume you will get database
semantics when you try to write to multiple small files.

Or you can use various compromise solutions which provide lesser or
greater guarantees: for example:

1. sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);

2. sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE |
SYNC_FILE_RANGE_WAIT_AFTER);

3. fdatasync(fd);

4. fsync(fd);

are four different things you can do, listed in order of increasing
cost, and also increasing guarantees that you will survive a system
crash, or a power cut (only the last two will guarantee data survival
after a power cut).

If you don't care about the mod-time, fdatasync() could be less costly
than fsync().

If you only care about a 3D game crashing the system when it exits
(which some Ubuntu users using Nvidia drivers think is normal;
sigh...), but not what happens on a power cut, then maybe option #2 is
enough.

The implementors of a number of mainstream file systems (i.e., ext4,
btrfs, XFS) have agreed to do the equivalent of #1 (i.e., initiating
writeback, but not necessarily waiting for the writeback to complete)
in the case of a rename that replaces an existing file. Some file
systems may do chose to do more (i.e., either waiting for the
writeback to complete: #2) or actually issuing a barrier operation
(#3, which is way more expensive), but some of these will slow down
source tree builds, where in truth people *really* don't care if a
file is trashed on a crash or power failure, since you can always
regenerate a file by rerunning "make".

But for the crazy kids who want to write several hundred small files
when an GNOME or KDE application exits (one file for the X coordinate
for the window, another file for the Y coordinate, another file for
the height of the window, another file for the width of the window,
etc....) --- cut it out. That way just lies insanity; use something
like sqlite instead and batch all of your updates into a single atomic
update. Or don't use crappy proprietary drivers that will crash your
system at arbitrary (and commonplace) times.

- Ted


2010-12-23 22:43:09

by Dave Chinner

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Thu, Dec 23, 2010 at 04:49:53PM +0100, Olaf van der Spek wrote:
> On Sun, Dec 19, 2010 at 5:39 PM, Olaf van der Spek <[email protected]> wrote:
> > On Sat, Dec 18, 2010 at 11:15 PM, Calvin Walton <[email protected]> wrote:
> >> Hmm. I’m doing a little interpretation of what Olaf said here; but I
> >> think you may have misunderstood the question?
> >>
> >> He doesn’t care about whether or not the file is securely written to
> >> disk (durable); however he doesn’t want to see any partially written
> >> files. In other words, something like
> >>
> >>     1. Write to temp file
> >>     2. Rename temp file over original file
> >
> > Meta data, including file owner, should be preserved.
> > Ideally, no temp files should be visible either.
> >
> >> Where the rename is only committed to disk once the entire contents of
> >> the file have been written securely – whenever that may eventually
> >> happen.
> >>
> >> He doesn’t want to synchronously wait for the file to be written,
> >> because the new data isn’t particularly important. The only important
> >> thing is that the file either contains the old or new data after a
> >> filesystem crash; not incomplete data. So, it’s more of an ordering
> >> problem, I think? (Analogous to putting some sort of barrier between the
> >> file write/close and the file rename to maintain ordering.)
> >>
> >> Hopefully I’ve interpreted the original question correctly, because this
> >> is something I would find interesting as well.
> >
> > Yes, you did.
>
> Somebody?

So you are looking for something like:

http://developer.apple.com/library/mac/#DOCUMENTATION/Darwin/Reference/ManPages/man2/exchangedata.2.html

?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-12-23 22:47:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Fri, Dec 24, 2010 at 09:43:09AM +1100, Dave Chinner wrote:
>
> So you are looking for something like:
>
> http://developer.apple.com/library/mac/#DOCUMENTATION/Darwin/Reference/ManPages/man2/exchangedata.2.html
>

It doesn't look like the man page for exchangedata() states what
happens if the system crashes. It says "atomic" the same way the
rename() system call says it is "atomic".... i.e., from the
perspective of processes running on the system see either the
pre-exchange or post-exchange state.

- Ted

2010-12-24 11:14:22

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Fri, Dec 24, 2010 at 10:51 AM, Ted Ts'o <[email protected]> wrote:
> On Fri, Dec 24, 2010 at 02:00:13AM +0100, Christian Stroetmann wrote:
>> I really do know what you want to say, despite that this example is
>> based on a bug in another system than the FS. But there will be
>> other examples, for sure.
>
> Sure, but this thread started because someone wanted an "atomic
> non-durable file write API", apparently because it was too slow to use
> fsync().  If people use databases, it's not a problem; databases use
> fsync(), but they use it properly and they provide the proper
> transactional interfaces that people want.
>
> The problem comes when people try to implement their own databases
> using small files for each row and column of the database, or for each
> registry variable.  Then they complain when fsync() is to expensive,
> because they need to use fsync() for every single 3 bytes of data they
> store in their badly implemented database.
>
> The bottom line is that if you want atomic updates of state
> information, you need to use fsync() or fdatasync().  If this is a
> performance bottleneck, then you're doing something wrong.  Maybe you
> shouldn't be writing a third of a megabyte on every URL click, on the
> main GUI thread; maybe the user doesn't need to remember every single
> URL that was visited even if the power suddenly fails (maybe it's
> enough if you write that information to disk every 3-5 minutes, and
> less if you're running on battery).  Or maybe you shouldn't be using
> hundreds of small state files, and screw up the dirty flag handling.
> But regardless, you're doing something wrong/stupid.

Hi Ted,

Thanks for taking the time to answer. The thread was started due to
the dpkg issue.
The questions were:
> What is the recommended way for atomic non-durable (complete) file writes?

It seems you're saying fsync is required, but why can't atomic be
provided without durable? Is it just an API issue?

If rename is recommended, how does one preserve meta-data including file owner?

> I'm also wondering why FSs commit after open/truncate but before
write/close. AFAIK this isn't necessary and thus suboptimal.

Olaf

2010-12-24 11:17:48

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Thu, Dec 23, 2010 at 10:51 PM, Neil Brown <[email protected]> wrote:
> You are asking for something that doesn't exist, which is why no-one can tell
> you want the answer is.

It seems like a very common and basic operation. If it doesn't exist
IMO it should be created.

> The only mechanism for synchronising different filesystem operations is
> fsync.  You should use that.
>
> If it is too slow, use data journalling, and place your journal on a
> small low-latency device (NVRAM??)

This isn't about some DB-like app, it's about normal file writes, like
archive extractions, compiling, editors, etc.

Olaf

2010-12-25 03:15:37

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Fri, Dec 24, 2010 at 12:14:21PM +0100, Olaf van der Spek wrote:
>
> Thanks for taking the time to answer. The thread was started due to
> the dpkg issue.

I've talked to the dpkg folks and I believe they are squared away; for
their use case sync_file_range() combined with fsync() should solve
their reliability and performance problem.

- Ted

2010-12-25 10:41:48

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Sat, Dec 25, 2010 at 4:15 AM, Ted Ts'o <[email protected]> wrote:
> On Fri, Dec 24, 2010 at 12:14:21PM +0100, Olaf van der Spek wrote:
>>
>> Thanks for taking the time to answer. The thread was started due to
>> the dpkg issue.
>
> I've talked to the dpkg folks and I believe they are squared away; for
> their use case sync_file_range() combined with fsync() should solve
> their reliability and performance problem.

It's not just about dpkg, I'm still very interested in answers to my
original questions.

Olaf

2010-12-25 11:34:01

by Nicholas Piggin

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Sat, Dec 25, 2010 at 9:41 PM, Olaf van der Spek <[email protected]> wrote:
> On Sat, Dec 25, 2010 at 4:15 AM, Ted Ts'o <[email protected]> wrote:
>> On Fri, Dec 24, 2010 at 12:14:21PM +0100, Olaf van der Spek wrote:
>>>
>>> Thanks for taking the time to answer. The thread was started due to
>>> the dpkg issue.
>>
>> I've talked to the dpkg folks and I believe they are squared away; for
>> their use case sync_file_range() combined with fsync() should solve
>> their reliability and performance problem.
>
> It's not just about dpkg, I'm still very interested in answers to my
> original questions.

Arbitrary atomic but non-durable file write operation? That's significantly
different to how any part of the pagecache or filesystem or syscall API
is set up. Writes are not atomic, and syncs are only for durability (not
atomicity), atomicity is typically built on top of these durable points.

That is quite fundamental functionality and suits simple
implementations of filesystems and writeback caches.

If you start building complex atomicity semantics, then you get APIs
which can't be supported by all filesystems, Linux specific, adds
complexity from the API through to the pagecache and to the
filesystems, and is Linux specific.

Compare that to using cross platform, mature and well tested sqlite
or bdb, how much reason do we have for implementing such APIs?

It's not that it isn't possible, it's that there is no way we're adding
such a thing unless it really helps and is going to be widely used.

What exact use case do you have in mind, and what exact API
semantics do you want, anyway?

2010-12-25 15:24:22

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Sat, Dec 25, 2010 at 12:33 PM, Nick Piggin <[email protected]> wrote:
>> It's not just about dpkg, I'm still very interested in answers to my
>> original questions.
>
> Arbitrary atomic but non-durable file write operation?

No, not arbitrary writes. It's about complete file writes.
Also, don't forget my question about how to preserve meta-data
including file owner.

> That's significantly
> different to how any part of the pagecache or filesystem or syscall API
> is set up. Writes are not atomic, and syncs are only for durability (not
> atomicity), atomicity is typically built on top of these durable points.
>
> That is quite fundamental functionality and suits simple
> implementations of filesystems and writeback caches.
>
> If you start building complex atomicity semantics, then you get APIs

Atomic semantics are not (that) complex.

> which can't be supported by all filesystems, Linux specific, adds
> complexity from the API through to the pagecache and to the
> filesystems, and is Linux specific.

> Compare that to using cross platform, mature and well tested sqlite
> or bdb, how much reason do we have for implementing such APIs?

Like I said before, it's not about DB-like functionality but about
complete file writes/updates. For example, I've got a file in an
editor and I want to save it.

> It's not that it isn't possible, it's that there is no way we're adding
> such a thing unless it really helps and is going to be widely used.
>
> What exact use case do you have in mind, and what exact API
> semantics do you want, anyway?

Let me copy the original post:
Writing a temp file, fsync, rename is often proposed. However, the
durable aspect of fsync isn't always required and this way has other
issues, like losing file meta-data.
What is the recommended way for atomic non-durable (complete) file writes?

I'm also wondering why FSs commit after open/truncate but before
write/close. AFAIK this isn't necessary and thus suboptimal.

Olaf

2010-12-25 17:25:28

by Nicholas Piggin

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Sun, Dec 26, 2010 at 2:24 AM, Olaf van der Spek <[email protected]> wrote:
> On Sat, Dec 25, 2010 at 12:33 PM, Nick Piggin <[email protected]> wrote:
>>> It's not just about dpkg, I'm still very interested in answers to my
>>> original questions.
>>
>> Arbitrary atomic but non-durable file write operation?
>
> No, not arbitrary writes. It's about complete file writes.

You still haven't defined exactly what you want.


> Also, don't forget my question about how to preserve meta-data
> including file owner.
>
>> That's significantly
>> different to how any part of the pagecache or filesystem or syscall API
>> is set up. Writes are not atomic, and syncs are only for durability (not
>> atomicity), atomicity is typically built on top of these durable points.
>>
>> That is quite fundamental functionality and suits simple
>> implementations of filesystems and writeback caches.
>>
>> If you start building complex atomicity semantics, then you get APIs
>
> Atomic semantics are not (that) complex.

That is something to be argued over patches. What is not in question
is that an atomic API is more complex than none :)


>> which can't be supported by all filesystems, Linux specific, adds
>> complexity from the API through to the pagecache and to the
>> filesystems, and is Linux specific.
>
>> Compare that to using cross platform, mature and well tested sqlite
>> or bdb, how much reason do we have for implementing such APIs?
>
> Like I said before, it's not about DB-like functionality but about
> complete file writes/updates. For example, I've got a file in an
> editor and I want to save it.

I don't understand your example, because in that case you surely
want durability.


>> It's not that it isn't possible, it's that there is no way we're adding
>> such a thing unless it really helps and is going to be widely used.
>>
>> What exact use case do you have in mind, and what exact API
>> semantics do you want, anyway?
>
> Let me copy the original post:
> Writing a temp file, fsync, rename is often proposed. However, the
> durable aspect of fsync isn't always required

So you want a way to atomically replace the contents of a file with
new contents, in a way which completes asynchronously and lazily,
and your new contents will eventually just appear sometime after
they are guaranteed to be on disk?

You would need to create an unlinked inode with dirty data, and then
have callbacks from pagecache writeback checking when the inode
is cleaned, and then call appropriate filesystem routines to sync and
issue barriers etc, and rename the old name to the new inode.

You will also need to have a chain of inodes representing ordering of
the updates so the renames can be performed in the right order. And
add some hooks to solve the metadata issue.

Then what happens when you fsync the original file? What if the
original file is renamed or unlinked? How do you sync the outstanding
queue of updates?

Once you solve all those problems, then people will ask you to now
solve them for multiple files at once because they also have some
great use-case that is surely nothing like databases.

Please tell us what for. If you have immediate need to replace the
name, then you need the durability of fsync. If you don't have
immediate need, then you can use another name, surely (until it
comes time you want to switch names, at that point you want
durability so you fsync then rename).


> and this way has other
> issues, like losing file meta-data.

Yes that's true, if you're not owner you may not be able to recreate
most of it. Did you need to?


> What is the recommended way for atomic non-durable (complete) file writes?

There really isn't one. Like I said, there is not much atomicity
semantics in the API, which works really well because it is simple
to implement and to use (although apparently still far too complex
for some programmers to get right).

If we start adding atomicity beyond fundamental requirement of
namespace operations, then where does it end? Why would it make
sense to add atomicity for writes to one file, but not writes to 2 files?
What if you require atomic multiple modifications to directory
structure as well as file updates? And why only writes? What about
atomic reads of several things? What isolation level should all of that
have, and how to solve deadlocks?


> I'm also wondering why FSs commit after open/truncate but before
> write/close. AFAIK this isn't necessary and thus suboptimal.

I don't know, can you expand on this? What fses are you talking
about, and what behaviour.

Thanks,
Nick

2010-12-25 21:40:07

by NeilBrown

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Fri, 24 Dec 2010 12:17:46 +0100 Olaf van der Spek <[email protected]>
wrote:

> On Thu, Dec 23, 2010 at 10:51 PM, Neil Brown <[email protected]> wrote:
> > You are asking for something that doesn't exist, which is why no-one can tell
> > you want the answer is.
>
> It seems like a very common and basic operation. If it doesn't exist
> IMO it should be created.
>
> > The only mechanism for synchronising different filesystem operations is
> > fsync. ?You should use that.
> >
> > If it is too slow, use data journalling, and place your journal on a
> > small low-latency device (NVRAM??)
>
> This isn't about some DB-like app, it's about normal file writes, like
> archive extractions, compiling, editors, etc.
>

Yes, it might be nice to have a very low cost way to make those safer against
corruption during a crash.
It would have to be *very* low cost as in most cases the cost of cleaning up
after the crash instead (e.g. 'make clean') is quite low. But people do
sometime edit /etc/init.d files with an ordinary editor and it would be
rather embarrassing if a crash just at the wrong time left some critical file
incomplete, and maybe it would be easier to teach editors to fsync before
rename for files in /etc .....

So what would this mechanism really look like? I think the proposal is to
delay committing the rename until the writeout of the file is complete,
without accelerating the writeout.
That would probably require delaying all updates to the directory until the
writeout was complete, as trying to reason about which changes were dependent
and which were independent is unlikely to be easy.

So as soon as you rename a file, you create a dependency between the file and
the directory such that no update for the directory may be written while any
page in the file is dirty. Conversely, any fsync of the directory would
fsync the file as well.

Any write to the file should probably break the dependency as you can no
longer be sure what exactly the rename was supposed to protect.

I suspect that much of the infrastructure for this could be implemented in
the VFS/VM. Certainly the dependency linkage between inodes, created on
rename, destroyed on write or fsync or when writeout on the inode completes,
and the fsync dependency could be common code. Preventing writeout of
directories with dependent files would need some fs interaction. You could
probably prototype in ext2 quite easily to do some testing and collection
some numbers on overhead.

I think this would be an interesting project for someone to do and I would be
happy to review any patches. Whether it ever got further than an interesting
project would depend very much on how intrusive it was to other filesystems,
how much over head it caused, and what actual benefits resulted.
If anyone wanted to pursue this idea, they would certainly need to address
each of those in their final proposal.

I think there could be room for improved transactional semantics in Linux
filesystems. This might be what they should look like ... don't know yet.

NeilBrown

2010-12-26 09:59:44

by Amir Goldstein

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Fri, Dec 24, 2010 at 12:47 AM, Ted Ts'o <[email protected]> wrote:
> On Fri, Dec 24, 2010 at 09:43:09AM +1100, Dave Chinner wrote:
>>
>> So you are looking for something like:
>>
>> http://developer.apple.com/library/mac/#DOCUMENTATION/Darwin/Reference/ManPages/man2/exchangedata.2.html
>>
>
> It doesn't look like the man page for exchangedata() states what
> happens if the system crashes. ?It says "atomic" the same way the
> rename() system call says it is "atomic".... i.e., from the
> perspective of processes running on the system see either the
> pre-exchange or post-exchange state.
>

Since we already have EXT4_IOC_MOVE_EXT ioctl, it might as well be
used for the purpose of 'safe save'
in the same manner that exachangedata() works.

Whether or not the new data is durable is entirely up to the
programmer to decide.
Perhaps all that is missing is an fdatawait(fd) API or fdatawait_async(fd) API,
which should be very simple to implement (right?).

So an editor that doesn't want to be too pushy, will just save a temp file,
wait for it to sync on the system's free time and then swap the data with
the original file in an atomic manner, which preserves metadata.

If the system takes too long to sync, the editor can always issue
fdatasync(fd) when it is tired of waiting.

Will that make you happy, Olaf?
If you are not happy from the new copy not being available to all
system users until
fdatasync is done, then we will call it "atomic, isolated non-durable
file write API", OK?

Amir.

2010-12-26 15:08:12

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Sat, Dec 25, 2010 at 6:25 PM, Nick Piggin <[email protected]> wrote:
>> No, not arbitrary writes. It's about complete file writes.
>
> You still haven't defined exactly what you want.

Do you not understand what is meant by a complete file write?

>> Atomic semantics are not (that) complex.
>
> That is something to be argued over patches. What is not in question
> is that an atomic API is more complex than none :)

That's implementation complexity, not concept/semantics complexity.

>> Like I said before, it's not about DB-like functionality but about
>> complete file writes/updates. For example, I've got a file in an
>> editor and I want to save it.
>
> I don't understand your example, because in that case you surely
> want durability.

Hmm, true, bad example, although it depends on editor/user.
Let's take archive extraction instead.

>> Let me copy the original post:
>> Writing a temp file, fsync, rename is often proposed. However, the
>> durable aspect of fsync isn't always required
>
> So you want a way to atomically replace the contents of a file with
> new contents, in a way which completes asynchronously and lazily,
> and your new contents will eventually just appear sometime after
> they are guaranteed to be on disk?

Almost. Visibility to other process should be normal (I don't know the
exact rules), but commit to disk may be deferred.

> You would need to create an unlinked inode with dirty data, and then
> have callbacks from pagecache writeback checking when the inode
> is cleaned, and then call appropriate filesystem routines to sync and
> issue barriers etc, and rename the old name to the new inode.

That's an implementation detail, but yes, something like that.

> You will also need to have a chain of inodes representing ordering of
> the updates so the renames can be performed in the right order. And
> add some hooks to solve the metadata issue.
>
> Then what happens when you fsync the original file? What if the
> original file is renamed or unlinked? How do you sync the outstanding
> queue of updates?

Logically those actions would happen after the atomic data update.
The fsync would be done on a now unlinked file (if done via fd). The
rename would be done on the new file. Same for unlink.

> Once you solve all those problems, then people will ask you to now
> solve them for multiple files at once because they also have some
> great use-case that is surely nothing like databases.

I don't want to play the what if game.

> Please tell us what for. If you have immediate need to replace the
> name, then you need the durability of fsync. If you don't have
> immediate need, then you can use another name, surely (until it
> comes time you want to switch names, at that point you want
> durability so you fsync then rename).

Temp file, rename has issues with losing meta-data.

>
>> and this way has other
>> issues, like losing file meta-data.
>
> Yes that's true, if you're not owner you may not be able to recreate
> most of it. Did you need to?

Yes

>
>> What is the recommended way for atomic non-durable (complete) file writes?
>
> There really isn't one. Like I said, there is not much atomicity
> semantics in the API, which works really well because it is simple
> to implement and to use (although apparently still far too complex
> for some programmers to get right).

It's simple to implement but it's not simple to use right.

> If we start adding atomicity beyond fundamental requirement of
> namespace operations, then where does it end? Why would it make
> sense to add atomicity for writes to one file, but not writes to 2 files?
> What if you require atomic multiple modifications to directory
> structure as well as file updates? And why only writes? What about
> atomic reads of several things? What isolation level should all of that
> have, and how to solve deadlocks?
>
>
>> I'm also wondering why FSs commit after open/truncate but before
>> write/close. AFAIK this isn't necessary and thus suboptimal.
>
> I don't know, can you expand on this? What fses are you talking
> about, and what behaviour.

The zero size issues of ext4 (before some patch). Presumably because
some apps do open, truncate, write, close on a file. I'm wondering why
an FS commits between truncate and write.

Olaf

2010-12-26 15:23:08

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Sun, Dec 26, 2010 at 10:59 AM, Amir Goldstein <[email protected]> wrote:
> Whether or not the new data is durable is entirely up to the
> programmer to decide.

Right

> Perhaps all that is missing is an fdatawait(fd) API or fdatawait_async(fd) API,
> which should be very simple to implement (right?).
>
> So an editor that doesn't want to be too pushy, will just save a temp file,
> wait for it to sync on the system's free time and then swap the data with
> the original file in an atomic manner, which preserves metadata.
>
> If the system takes too long to sync, the editor can always issue
> fdatasync(fd) when it is tired of waiting.
>
> Will that make you happy, Olaf?
> If you are not happy from the new copy not being available to all
> system users until
> fdatasync is done, then we will call it "atomic, isolated non-durable
> file write API", OK?

No. Take the compiler case. Ideally you'd like file data updates to be
atomic, but waiting until an update hits disk before it's visible to
other processes is unacceptable.

IMO the use case for atomic non-durable file writes is very broad, so
you don't want to have these kinds of exceptions.

Olaf

2010-12-26 15:55:26

by Boaz Harrosh

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On 12/26/2010 05:08 PM, Olaf van der Spek wrote:
>> Please tell us what for. If you have immediate need to replace the
>> name, then you need the durability of fsync. If you don't have
>> immediate need, then you can use another name, surely (until it
>> comes time you want to switch names, at that point you want
>> durability so you fsync then rename).
>
> Temp file, rename has issues with losing meta-data.
>

What if you use a soft link? wouldn't that solve all of your problems?

- do your fsync/fdatasync of choice in a *backend thread* then at the return
- point set to the new link, fsync the link it's very small, therefore fast.
- Then delete the old source file.

You need a simple "name-version" schema and the "name" is kept soft linked.
(You might even skip the last step above and implement an undo stack, some
background management caps on history size)

>>
>>> and this way has other
>>> issues, like losing file meta-data.
>>

With soft links this is persevered?

Same system can be used with lots of files. where the final switch is
the set of a single soft-link say to a folder of related files.

Just me $0.017
Boaz

2010-12-26 16:02:34

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Sun, Dec 26, 2010 at 4:55 PM, Boaz Harrosh <[email protected]> wrote:
> What if you use a soft link? wouldn't that solve all of your problems?
>
> - do your fsync/fdatasync of choice in a *backend thread* then at the return
> - point set to the new link, fsync the link it's very small, therefore fast.
> - Then delete the old source file.
>
> You need a simple "name-version" schema and the "name" is kept soft linked.
> (You might even skip the last step above and implement an undo stack, some
>  background management caps on history size)
>
>>>
>>>> and this way has other
>>>> issues, like losing file meta-data.
>>>
>
> With soft links this is persevered?
>
> Same system can be used with lots of files. where the final switch is
> the set of a single soft-link say to a folder of related files.

Are you proposing to turn every single file into a symlink?
How would that solve the meta-data issue?

Olaf

2010-12-26 16:27:37

by Boaz Harrosh

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On 12/26/2010 06:02 PM, Olaf van der Spek wrote:
> On Sun, Dec 26, 2010 at 4:55 PM, Boaz Harrosh <[email protected]> wrote:
>> What if you use a soft link? wouldn't that solve all of your problems?
>>
>> - do your fsync/fdatasync of choice in a *backend thread* then at the return
>> - point set to the new link, fsync the link it's very small, therefore fast.
>> - Then delete the old source file.
>>
>> You need a simple "name-version" schema and the "name" is kept soft linked.
>> (You might even skip the last step above and implement an undo stack, some
>> background management caps on history size)
>>
>>>>
>>>>> and this way has other
>>>>> issues, like losing file meta-data.
>>>>
>>
>> With soft links this is persevered?
>>
>> Same system can be used with lots of files. where the final switch is
>> the set of a single soft-link say to a folder of related files.
>
> Are you proposing to turn every single file into a symlink?

Sure, a symlink and a "versioned" file for every object. Something similar
to the silly rename of nfs.

Even if you have 1000 files that need the same atomicity treatment
that's not that bad. You should be able to devise a namespace policy
that makes all this nit and tidy.

> How would that solve the meta-data issue?
>

That's what I asked. Do you want to preserve the original's file
metat-data, or the meta-data of the owner of the new content?
In the first case you'll need a metat-data copy like tar is
using.

> Olaf

The point is to fsync/fdatasync on a background thread and continue
from there where the application is free to go on to the next step.
As if you had a notification when the commit was done (in the background).
So you make it an async pipeline model. The version-naming schem is so the
pipeline can get arbitrary big.

Boaz

2010-12-26 16:43:00

by Nicholas Piggin

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Mon, Dec 27, 2010 at 2:08 AM, Olaf van der Spek <[email protected]> wrote:
> On Sat, Dec 25, 2010 at 6:25 PM, Nick Piggin <[email protected]> wrote:
>>> No, not arbitrary writes. It's about complete file writes.
>>
>> You still haven't defined exactly what you want.
>
> Do you not understand what is meant by a complete file write?

It is not a rigourous definition. What I understand it to mean may be
different than what you understand it to mean. Particularly when you
consider what the actual API should look like and interact with the rest
of the apis.


>>> Atomic semantics are not (that) complex.
>>
>> That is something to be argued over patches. What is not in question
>> is that an atomic API is more complex than none :)
>
> That's implementation complexity, not concept/semantics complexity.

It is both. "atomic complete file write" is not sufficient at all.


>>> Like I said before, it's not about DB-like functionality but about
>>> complete file writes/updates. For example, I've got a file in an
>>> editor and I want to save it.
>>
>> I don't understand your example, because in that case you surely
>> want durability.
>
> Hmm, true, bad example, although it depends on editor/user.
> Let's take archive extraction instead.

OK, so please show how it helps.

If it is a multi-file/dir archive, then you could equally well come back in
an inconsistent state after crashing with some files extracted and
some not, without atomic-write-multiple-files-and-directories API.


>>> Let me copy the original post:
>>> Writing a temp file, fsync, rename is often proposed. However, the
>>> durable aspect of fsync isn't always required
>>
>> So you want a way to atomically replace the contents of a file with
>> new contents, in a way which completes asynchronously and lazily,
>> and your new contents will eventually just appear sometime after
>> they are guaranteed to be on disk?
>
> Almost. Visibility to other process should be normal (I don't know the
> exact rules), but commit to disk may be deferred.

That's pretty important detail. What is "normal"? Will a process
see old or new data from the atomic write before atomic write has
committed to disk?

Is the atomic write guaranteed to take an atomic snapshot of file
and only specified updates?

What happens to subsequent atomic and non atomic writes to the
file?


>> You would need to create an unlinked inode with dirty data, and then
>> have callbacks from pagecache writeback checking when the inode
>> is cleaned, and then call appropriate filesystem routines to sync and
>> issue barriers etc, and rename the old name to the new inode.
>
> That's an implementation detail, but yes, something like that.
>
>> You will also need to have a chain of inodes representing ordering of
>> the updates so the renames can be performed in the right order. And
>> add some hooks to solve the metadata issue.
>>
>> Then what happens when you fsync the original file? What if the
>> original file is renamed or unlinked? How do you sync the outstanding
>> queue of updates?
>
> Logically those actions would happen after the atomic data update.
> The fsync would be done on a now unlinked file (if done via fd). The
> rename would be done on the new file. Same for unlink.
>
>> Once you solve all those problems, then people will ask you to now
>> solve them for multiple files at once because they also have some
>> great use-case that is surely nothing like databases.
>
> I don't want to play the what if game.

You must if you want to design a sane API.


>> Please tell us what for. If you have immediate need to replace the
>> name, then you need the durability of fsync. If you don't have
>> immediate need, then you can use another name, surely (until it
>> comes time you want to switch names, at that point you want
>> durability so you fsync then rename).
>
> Temp file, rename has issues with losing meta-data.

How about solving that easier issue?


>>> and this way has other
>>> issues, like losing file meta-data.
>>
>> Yes that's true, if you're not owner you may not be able to recreate
>> most of it. Did you need to?
>
> Yes
>
>>
>>> What is the recommended way for atomic non-durable (complete) file writes?
>>
>> There really isn't one. Like I said, there is not much atomicity
>> semantics in the API, which works really well because it is simple
>> to implement and to use (although apparently still far too complex
>> for some programmers to get right).
>
> It's simple to implement but it's not simple to use right.

You do not have the ability to have arbitrary atomic transactions to the
filesystem. If you show a problem of a half completed write after crash,
then I can show you a problem of any half completed multi-syscall
operation after crash.

The simple thing is to properly clean up such things after a crash, and
just use an atomic commit somewhere to say whether the file operations
that just completed are now in a durable state. Either that or use an
existing code that does it right.


>> If we start adding atomicity beyond fundamental requirement of
>> namespace operations, then where does it end? Why would it make
>> sense to add atomicity for writes to one file, but not writes to 2 files?
>> What if you require atomic multiple modifications to directory
>> structure as well as file updates? And why only writes? What about
>> atomic reads of several things? What isolation level should all of that
>> have, and how to solve deadlocks?
>>
>>
>>> I'm also wondering why FSs commit after open/truncate but before
>>> write/close. AFAIK this isn't necessary and thus suboptimal.
>>
>> I don't know, can you expand on this? What fses are you talking
>> about, and what behaviour.
>
> The zero size issues of ext4 (before some patch). Presumably because
> some apps do open, truncate, write, close on a file. I'm wondering why
> an FS commits between truncate and write.

I'm still not clear what you mean. Filesystem state may get updated
between any 2 syscalls because the kernel has no oracle or infinite
storage.

2010-12-26 16:52:05

by Nicholas Piggin

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Mon, Dec 27, 2010 at 2:23 AM, Olaf van der Spek <[email protected]> wrote:
> On Sun, Dec 26, 2010 at 10:59 AM, Amir Goldstein <[email protected]> wrote:
>> Whether or not the new data is durable is entirely up to the
>> programmer to decide.
>
> Right
>
>> Perhaps all that is missing is an fdatawait(fd) API or fdatawait_async(fd) API,
>> which should be very simple to implement (right?).
>>
>> So an editor that doesn't want to be too pushy, will just save a temp file,
>> wait for it to sync on the system's free time and then swap the data with
>> the original file in an atomic manner, which preserves metadata.
>>
>> If the system takes too long to sync, the editor can always issue
>> fdatasync(fd) when it is tired of waiting.
>>
>> Will that make you happy, Olaf?
>> If you are not happy from the new copy not being available to all
>> system users until
>> fdatasync is done, then we will call it "atomic, isolated non-durable
>> file write API", OK?
>
> No. Take the compiler case. Ideally you'd like file data updates to be
> atomic, but waiting until an update hits disk before it's visible to
> other processes is unacceptable.

How about sending output to an intermediate name, then before it is
used in the next stage of the processor output, fsync+rename to the
expected name? make clean would delete any intermediate names.

2010-12-26 18:26:25

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Sun, Dec 26, 2010 at 5:27 PM, Boaz Harrosh <[email protected]> wrote:
>> Are you proposing to turn every single file into a symlink?
>
> Sure, a symlink and a "versioned" file for every object. Something similar
> to the silly rename of nfs.
>
> Even if you have 1000 files that need the same atomicity treatment
> that's not that bad. You should be able to devise a namespace policy
> that makes all this nit and tidy.

Nearly all files on the system need the atomicity treatment.

>> How would that solve the meta-data issue?
>>
>
> That's what I asked. Do you want to preserve the original's file
> metat-data, or the meta-data of the owner of the new content?
> In the first case you'll need a metat-data copy like tar is
> using.

The original meta-data, of course. Including file owner. AFAIK not
doable without root access.

>> Olaf
>
> The point is to fsync/fdatasync on a background thread and continue
> from there where the application is free to go on to the next step.

Not if it's the last thing a process does and another processes is
waiting on it.

> As if you had a notification when the commit was done (in the background).
> So you make it an async pipeline model. The version-naming schem is so the
> pipeline can get arbitrary big.
>
> Boaz
>

2010-12-26 18:51:25

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Sun, Dec 26, 2010 at 5:43 PM, Nick Piggin <[email protected]> wrote:
>> Do you not understand what is meant by a complete file write?
>
> It is not a rigourous definition. What I understand it to mean may be
> different than what you understand it to mean. Particularly when you
> consider what the actual API should look like and interact with the rest
> of the apis.

f = open(..., O_ATOMIC, O_CREAT, O_TRUNC);
write(...); // 0+ times
abort/rollback(...); // optional
close(f);

> OK, so please show how it helps.
>
> If it is a multi-file/dir archive, then you could equally well come back in
> an inconsistent state after crashing with some files extracted and
> some not, without atomic-write-multiple-files-and-directories API.

True, but at least each file will be valid by itself. So no broken
executables, images or scripts.
Transactions involving multiple files are outside the scope of this discussion.

>> Almost. Visibility to other process should be normal (I don't know the
>> exact rules), but commit to disk may be deferred.
>
> That's pretty important detail. What is "normal"? Will a process
> see old or new data from the atomic write before atomic write has
> committed to disk?

New data. Isn't that the current rule?

> Is the atomic write guaranteed to take an atomic snapshot of file
> and only specified updates?
>
> What happens to subsequent atomic and non atomic writes to the
> file?

It's about an atomic replace of the entire file data. So it's not like
expecting a single write to be atomic.

>>> Once you solve all those problems, then people will ask you to now
>>> solve them for multiple files at once because they also have some
>>> great use-case that is surely nothing like databases.
>>
>> I don't want to play the what if game.
>
> You must if you want to design a sane API.

Providing transaction semantics for multiple files is a far broader
proposal and not necessary for implement this proposal.

>> Temp file, rename has issues with losing meta-data.
>
> How about solving that easier issue?

That would be nice, but it's not the only issue.
I'm not sure, but Ted appears to be saying temp file + rename (but no
fsync) isn't guaranteed to work either.
There's also the issue of not having permission to create the temp
file, having to ensure the temp file is on the same volume (so the
rename can work).

>> It's simple to implement but it's not simple to use right.
>
> You do not have the ability to have arbitrary atomic transactions to the
> filesystem. If you show a problem of a half completed write after crash,
> then I can show you a problem of any half completed multi-syscall
> operation after crash.

It's not about arbitrary transactions.

> The simple thing is to properly clean up such things after a crash, and
> just use an atomic commit somewhere to say whether the file operations
> that just completed are now in a durable state. Either that or use an
> existing code that does it right.

That's not simple if you're talking about arbitrary processes and files.
It's not even that simple if you're talking about DBs. They do
implement it, but obviously that's not usable for arbitrary files.

>>> If we start adding atomicity beyond fundamental requirement of
>> The zero size issues of ext4 (before some patch). Presumably because
>> some apps do open, truncate, write, close on a file. I'm wondering why
>> an FS commits between truncate and write.
>
> I'm still not clear what you mean. Filesystem state may get updated
> between any 2 syscalls because the kernel has no oracle or infinite
> storage.

It just seems quite suboptimal. There's no need for infinite storage
(or an oracle) to avoid this.

Olaf

2010-12-26 22:10:28

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Sun, Dec 26, 2010 at 07:51:23PM +0100, Olaf van der Spek wrote:
> f = open(..., O_ATOMIC, O_CREAT, O_TRUNC);

Great, let's rename O_ATOMIC to O_PONIES. :-)

> abort/rollback(...); // optional

As I said earlier, "file systems are not databases", and "databases
are not file systems". Oracle tried to foist their database as a file
system during the dot.com boom, and everyone laughed at them; the
performance was a nightmare. If Oracle wasn't able to make a
transaction engine that supports transactions and rollbacks
performant, you really expect that you'll be able to do it?

> > If it is a multi-file/dir archive, then you could equally well come back in
> > an inconsistent state after crashing with some files extracted and
> > some not, without atomic-write-multiple-files-and-directories API.
>
> True, but at least each file will be valid by itself. So no broken
> executables, images or scripts.
> Transactions involving multiple files are outside the scope of this
> discussion.

But what's the use case where this is useful and/or interesting? It
certainly doesn't help in the case of dpkg, because you still have to
deal with shell scripts that depend on certain executables being
present, or executables depending on the new version of the shared
library being present. If we're going to give up huge amounts of file
system performance for some use case, it's nice to know what the
real-world use case would actually be. (And again, I believe the dpkg
folks are squared away at this point.)

If the use case is really one of replacing the data while maintaining
the metadata (i.e., ACL's, extended attributes, etc.), we've already
pointed out that in the case of a file editor, you had better have
durability. Keep in mind that if you don't eventually call fsync(),
you'll never know if the file system is full or the user has hit their
quota, and the data can't be lazily written out later. Or in the case
of a networked file system, what if the network connection disappears
before you have a chance to lazily update the data and do the rename?
So before the editor exits, and the last remaining copy of the new
data (in memory) disappears, you had better call fsync() and check to
make sure the write can and has succeeded.

So in the case of replacing the data, what's the use case if it's not
for a file editor? And note that you've said that you want atomicity
because you want to make sure that after a crash you don't lose data.
What about the case where the system doesn't crash, but the wireless
connection goes away, or the user has exceeded his/her quota and they
were trying to replace 4k worth of data fork with 12k worth of data?
I can certainly think of scenarios where wireless connection drops and
quota overruns are far more likely than system crashes. (i.e., when
you're not using proprietary video drivers. :-P)

> Providing transaction semantics for multiple files is a far broader
> proposal and not necessary for implement this proposal.

But providing magic transaction semantics for a single file in the
rename is not at all clearly useful. You need to justify all of this
hard effort, and performance loss. (Well, or if you're so smart you
can implement your own file system that does all of this work, and we
can benchmark it against a file system that doesn't do all of this
work....)

> I'm not sure, but Ted appears to be saying temp file + rename (but no
> fsync) isn't guaranteed to work either.

It won't work if you get really unlucky and your system takes a power
cut right at the wrong moment during or after the rename(). It could
be made to work, but at a performance cost. And the question is
whether the performance cost is worth it. At the end of the day it's
all between the tradeoff between performance cost, implementation
cost, and value to the user and the application programmer. Which is
why you need to articular the use case where this makes sense.

It's not dpkg, and it's not file editors. What is it, specifically?
And why can it tolerate data loss in the case of quota overruns and
wireless connection hits, but not in the case of system crashes?

> It just seems quite suboptimal. There's no need for infinite storage
> (or an oracle) to avoid this.

If you're so smart, why don't you try implementing it? Itt's going to
be hard for us to convince you why it's going to be non-trivial and
have huge implementation *and* performance costs, so why don't you
produce the patches that makes this all work?

- Ted

2010-12-27 00:29:48

by Christian Stroetmann

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On the 26.12.2010 23:10, Ted Ts'o wrote:
> On Sun, Dec 26, 2010 at 07:51:23PM +0100, Olaf van der Spek wrote:
>
<snip>
> As I said earlier, "file systems are not databases", and "databases
> are not file systems". Oracle tried to foist their database as a file
> system during the dot.com boom, and everyone laughed at them; the
> performance was a nightmare. If Oracle wasn't able to make a
> transaction engine that supports transactions and rollbacks
> performant, you really expect that you'll be able to do it?

An FS could easily have the rest of the functions of a database
management system (DBMS) as an FSDB, a hybrid if you wish. An example
for such a hybrid is the ext2/3-sqlite FS and there are two little
architectural problems only: One is related with the structure and
naming scheme of the api and the other is related with the handling of
the FS caching by the programmer and the user due to the many different
options available.

Furthermore, the performance of Oracle's solutions was and still is so
low, because they have a file system as a database that is managed by a
DBMS as a file that again is stored in an FS. Can you see now what does
the loss of performance?
And Oracle fears FSs like R4 that have database(-like) functionalities,
so it took those technical features of R4 for the BTRFS, which they
thought could stop its show.
And also, Oracle has started some months ago again to promote its FS in
a DB in an FS concept.

So, there must be something that is highly interesting with the idea to
use an FS as DBMS, not only for Oracle, but at least for the four
largest software companies.

<snip>
>
>> Providing transaction semantics for multiple files is a far broader
>> proposal and not necessary for implement this proposal.
> But providing magic transaction semantics for a single file in the
> rename is not at all clearly useful. You need to justify all of this
> hard effort, and performance loss. (Well, or if you're so smart you
> can implement your own file system that does all of this work, and we
> can benchmark it against a file system that doesn't do all of this
> work....)

But then the benchmark must be done correctly, which means that the FS
without transaction must be used with a transaction mechanism by an
additional software component. Otherwise the benchmarking would be worth
nothing.

>> I'm not sure, but Ted appears to be saying temp file + rename (but no
>> fsync) isn't guaranteed to work either.
> It won't work if you get really unlucky and your system takes a power
> cut right at the wrong moment during or after the rename(). It could
> be made to work, but at a performance cost. And the question is
> whether the performance cost is worth it. At the end of the day it's
> all between the tradeoff between performance cost, implementation
> cost, and value to the user and the application programmer. Which is
> why you need to articular the use case where this makes sense.

see above

> It's not dpkg, and it's not file editors. What is it, specifically?
> And why can it tolerate data loss in the case of quota overruns and
> wireless connection hits, but not in the case of system crashes?
>
>> It just seems quite suboptimal. There's no need for infinite storage
>> (or an oracle) to avoid this.
> If you're so smart, why don't you try implementing it? Itt's going to
> be hard for us to convince you why it's going to be non-trivial and
> have huge implementation *and* performance costs,

see above

> so why don't you
> produce the patches that makes this all work?
>
> - Ted
>

Christian Stroetmann


2010-12-27 01:04:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Mon, Dec 27, 2010 at 01:30:05AM +0100, Christian Stroetmann wrote:
> An FS could easily have the rest of the functions of a database
> management system (DBMS) as an FSDB, a hybrid if you wish. An
> example for such a hybrid is the ext2/3-sqlite FS...

What are you talking about? If you mean creating a sqlite database on
top of an existing file system, sure that works fine. That's the
right solution. But if you mean trying to access sqllite via a
file-system interface (i.e., via FUSE), I suspect the result will be a
disaster, precisely because the file system API isn't expressive
enough to handle database functionality, and so the result ends up
being a performance disaster. So the answer is "use a database, using
a database API, if you have database requirements".

> Furthermore, the performance of Oracle's solutions was and still is
> so low, because they have a file system as a database that is
> managed by a DBMS as a file that again is stored in an FS. Can you
> see now what does the loss of performance?

It was a disaster from a performance perspective even if the database
was run on top of a raw block device....

> And Oracle fears FSs like R4 that have database(-like)
> functionalities, so it took those technical features of R4 for the
> BTRFS, which they thought could stop its show.
> And also, Oracle has started some months ago again to promote its FS
> in a DB in an FS concept.

I've never heard of the R4 file system, and apparently Google hasn't
either. But if you think BTRFS is a database, you're fooling
yourself. There's a lot more to a database than just using a b-tree.

> So, there must be something that is highly interesting with the idea
> to use an FS as DBMS, not only for Oracle, but at least for the four
> largest software companies.

No, I think you're just utterly confused from a technical perspective.

- Ted

2010-12-27 01:29:55

by Christian Stroetmann

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On 27.12.2010 02:04, Ted Ts'o wrote:
> On Mon, Dec 27, 2010 at 01:30:05AM +0100, Christian Stroetmann wrote:
>> An FS could easily have the rest of the functions of a database
>> management system (DBMS) as an FSDB, a hybrid if you wish. An
>> example for such a hybrid is the ext2/3-sqlite FS...
> What are you talking about? If you mean creating a sqlite database on
> top of an existing file system, sure that works fine.

No, I don't mean this.

> That's the
> right solution. But if you mean trying to access sqllite via a
> file-system interface (i.e., via FUSE),

No, I don't mean this. I mean taking out the FUSE and do it directly.

> I suspect the result will be a
> disaster, precisely because the file system API isn't expressive
> enough to handle database functionality, and so the result ends up
> being a performance disaster.

Three times wrong:
Firstly, the result won't be a disaster.
Secondly, I already said in the e-mail before that file system API will
be extended by this additional database functionality, which is just
only a little architectural problem.
Thirdly, it won't end up in a performance disaster.

> So the answer is "use a database, using
> a database API, if you have database requirements".

No, I won't.

>> Furthermore, the performance of Oracle's solutions was and still is
>> so low, because they have a file system as a database that is
>> managed by a DBMS as a file that again is stored in an FS. Can you
>> see now what does the loss of performance?
> It was a disaster from a performance perspective even if the database
> was run on top of a raw block device....

Yes, for sure. So what?

>> And Oracle fears FSs like R4 that have database(-like)
>> functionalities, so it took those technical features of R4 for the
>> BTRFS, which they thought could stop its show.
>> And also, Oracle has started some months ago again to promote its FS
>> in a DB in an FS concept.
> I've never heard of the R4 file system, and apparently Google hasn't
> either. But if you think BTRFS is a database, you're fooling
> yourself. There's a lot more to a database than just using a b-tree.

I'm sorry, because I was really thinking that you do know that R4 is
used as the short term for the file system Reiser4.
And no, I'm not fooling, because I don't think that BTRFS is a
database. I only said that Oracle took technical parts of Reiser4 like a
b-tree datastructure and some other parts as a show stopper.

>> So, there must be something that is highly interesting with the idea
>> to use an FS as DBMS, not only for Oracle, but at least for the four
>> largest software companies.
> No, I think you're just utterly confused from a technical perspective.

No, I'm not utterly confused from a technical perspective. You really
have a wrong impression.
And if you read above again, then you will see that I already said that
Oracle has started once again the promotion of its concept with an FS in
a DB in an FS (this thing that you described as a performance disaster
even running on a raw block device). Do you claim that Oracle doesn't do
this?
I'm sorry, but I do believe Oracle, Microsoft and Apple more than you.

> - Ted
>

Christian Stroetmann

2010-12-27 02:53:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Mon, Dec 27, 2010 at 02:30:15AM +0100, Christian Stroetmann wrote:
> I'm sorry, because I was really thinking that you do know that R4 is
> used as the short term for the file system Reiser4.
> And no, I'm not fooling, because I don't think that BTRFS is a
> database. I only said that Oracle took technical parts of Reiser4
> like a b-tree datastructure and some other parts as a show stopper.

The fact that Reiser4 and BTRFS use a B-tree doesn't mean that they
have the double intent/rollback logs that a traditional database uses.
So mentioning them is irrelevant to the argument.

> And if you read above again, then you will see that I already said
> that Oracle has started once again the promotion of its concept with
> an FS in a DB in an FS (this thing that you described as a
> performance disaster even running on a raw block device). Do you
> claim that Oracle doesn't do this?

I haven't personally seen evidence of Oracle trying to make the claim
that it's sane to implement a file system, a web server and/or an IMAP
server using a Oracle DB as a backend since their last attempt at the
end of the dot COM error was greeted with near-universal ridicule and
amusement.

Even if they did are trying to convince people to do this, I'm pretty
sure the response (and resulting performance) would be the same. It
would be like sending an Armored Hummer H1 Hummvee to try to do the
job of a Audi Convertible. Sure, the Hummer may be more durable, and
maybe it can go everywhere an Audi can go --- but it's going to have
awful gas mileage compared to the convertible. Can I imagine a
Hummmer dealership saying, "yes, you should use an H1 for your daily
15 minute commute from your suburb to the city?" Sure, but I don't
think many sane people will believe them.

> I'm sorry, but I do believe Oracle, Microsoft and Apple more than you.

You mean how Microsoft attempted to create a hybrid file system and
database solution called WinFS, which helped delay MS Vista by seven
years, and ultimately was abandoned by Microsoft?

And I'm not aware of any attempt by Apple to try to go down this
insane architectural direction.

But sure, if you're so smart, maybe you're smarter than me. Go ahead
and implement it, and send us the patches. I'll be happy to look them
over and benchmark them on common Linux workloads when you're done.

- Ted

2010-12-27 04:12:52

by Nicholas Piggin

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Mon, Dec 27, 2010 at 5:51 AM, Olaf van der Spek <[email protected]> wrote:
> On Sun, Dec 26, 2010 at 5:43 PM, Nick Piggin <[email protected]> wrote:
>>> Do you not understand what is meant by a complete file write?
>>
>> It is not a rigourous definition. What I understand it to mean may be
>> different than what you understand it to mean. Particularly when you
>> consider what the actual API should look like and interact with the rest
>> of the apis.
>
> f = open(..., O_ATOMIC, O_CREAT, O_TRUNC);
> write(...); // 0+ times
> abort/rollback(...); // optional
> close(f);

Sorry, it's still not a rigourous definition, and what you have
defined indicates it is
not atomic. You have not done *anything* to specify how the API interacts with
the rest of the system calls and other calls.

You have a circular definition -- "complete file write means you open the file
with O_ATOMIC, and O_ATOMIC means you want a complete file write". I'm
afraid you'll have to put in a bit more effort than that.


>> OK, so please show how it helps.
>>
>> If it is a multi-file/dir archive, then you could equally well come back in
>> an inconsistent state after crashing with some files extracted and
>> some not, without atomic-write-multiple-files-and-directories API.
>
> True, but at least each file will be valid by itself. So no broken
> executables, images or scripts.

So if a script depends on an executable or an executable depends on a
data file or library that do not exist, they're effectively broken. So you
need to be able to clean up properly anyway.


> Transactions involving multiple files are outside the scope of this discussion.

No they are not, because as I understand you want atomicity of some
file operations so that partially visible error cases do not have to be dealt
with by userspace. The problem is exactly the same when dealing with
multiple files and directories.


>>> Almost. Visibility to other process should be normal (I don't know the
>>> exact rules), but commit to disk may be deferred.
>>
>> That's pretty important detail. What is "normal"? Will a process
>> see old or new data from the atomic write before atomic write has
>> committed to disk?
>
> New data.

What if the writer subsequently "aborts" or makes more writes to the file?


> Isn't that the current rule?

There are no atomic writes, so you can't just say "it's easy, just do writes
atomically and use 'current' rules for everything else"


>> Is the atomic write guaranteed to take an atomic snapshot of file
>> and only specified updates?
>>
>> What happens to subsequent atomic and non atomic writes to the
>> file?
>
> It's about an atomic replace of the entire file data. So it's not like
> expecting a single write to be atomic.

You didn't answer what happens. It's pretty important, because if those
writes from other processes join the new data from your atomic write,
and then you subsequently abort it, what happens? If writes are in progress
to the file when it is to be atomically written to, does the atomic write
"transaction" see parts of these writes? What sort of isolation level are
we talking about here? read uncommitted?

It's pretty important details when you're talking about transactions and
atomicity, you can't just say it isn't relevant, out of scope, or just use
"existing" semantics.


>>>> Once you solve all those problems, then people will ask you to now
>>>> solve them for multiple files at once because they also have some
>>>> great use-case that is surely nothing like databases.
>>>
>>> I don't want to play the what if game.
>>
>> You must if you want to design a sane API.
>
> Providing transaction semantics for multiple files is a far broader
> proposal and not necessary for implement this proposal.

The question is, if it makes sense to do it for 1, why does it not make sense
to do it for multiple? If you want to radically change the file
syscall APIs, you
need to explore all avenues and come up with something consistent that
makes sense.


>>> Temp file, rename has issues with losing meta-data.
>>
>> How about solving that easier issue?
>
> That would be nice, but it's not the only issue.
> I'm not sure, but Ted appears to be saying temp file + rename (but no
> fsync) isn't guaranteed to work either.

The rename obviously happens only *after* you fsync. Like I said,
at the point when you actually overwrite the old file with new, you do
really want durability.


> There's also the issue of not having permission to create the temp
> file, having to ensure the temp file is on the same volume (so the
> rename can work).

I don't see how those are problems. You can't do an atomic write to
a file if you don't have permissions to do it, either.


>>> It's simple to implement but it's not simple to use right.
>>
>> You do not have the ability to have arbitrary atomic transactions to the
>> filesystem. If you show a problem of a half completed write after crash,
>> then I can show you a problem of any half completed multi-syscall
>> operation after crash.
>
> It's not about arbitrary transactions.

That is my point. This "atomic write complete file" thing solves about 1% of
the problem that already has to be solved within the existing posix API
anyway.


>> The simple thing is to properly clean up such things after a crash, and
>> just use an atomic commit somewhere to say whether the file operations
>> that just completed are now in a durable state. Either that or use an
>> existing code that does it right.
>
> That's not simple if you're talking about arbitrary processes and files.
> It's not even that simple if you're talking about DBs. They do
> implement it, but obviously that's not usable for arbitrary files.

I don't see how you can just handwave that something is simple when
it suits your argument, and something else is not simple when that suits
your argument.

It seems pretty simple to me, when you have several ways to perform
a visible and durable atomic operation (such as a write+fdatasync on
file data), then you can use that to checkpoint state of your operations
at any point.


>>>> If we start adding atomicity beyond fundamental requirement of
>>> The zero size issues of ext4 (before some patch). Presumably because
>>> some apps do open, truncate, write, close on a file. I'm wondering why
>>> an FS commits between truncate and write.
>>
>> I'm still not clear what you mean. Filesystem state may get updated
>> between any 2 syscalls because the kernel has no oracle or infinite
>> storage.
>
> It just seems quite suboptimal. There's no need for infinite storage
> (or an oracle) to avoid this.

You do, because you can't guarantee to keep arbitrary amount of dirty
data in memory or another location on disk for an indeterminate period
of time. What if you have a 1GB filesystem, 128MB memory, you open
an 800MB file on it, and write 800MB of data to that file before closing it?

If you have "atomic write of complete file", how would you save your
"abort/rollback" data on arbitrarily large file and for multiple concurrent
atomic transactions of indeterminate duration? For that matter, how
would you even handle the above situation which has no concurrency?

Anyway, it seems you'll just keep arguing about this, so I'm with Ted
now. It's pointless to keep going back and forth. You're certainly
welcome to post patches (or even prototypes, modifications to user
programs, numbers, etc.). Some of us are skeptics, but we'd all
welcome any work that improves the user API so significantly and
with such simplicity as you think it's possible.

2010-12-27 10:21:47

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Sun, Dec 26, 2010 at 11:10 PM, Ted Ts'o <[email protected]> wrote:
> On Sun, Dec 26, 2010 at 07:51:23PM +0100, Olaf van der Spek wrote:
>> f = open(..., O_ATOMIC, O_CREAT, O_TRUNC);
>
> Great, let's rename O_ATOMIC to O_PONIES.  :-)

If that makes you happy.

>> abort/rollback(...); // optional
>
> As I said earlier, "file systems are not databases", and "databases
> are not file systems".  Oracle tried to foist their database as a file
> system during the dot.com boom, and everyone laughed at them; the
> performance was a nightmare.  If Oracle wasn't able to make a
> transaction engine that supports transactions and rollbacks
> performant, you really expect that you'll be able to do it?

Like I've said dozens of times, this is not about full DB functionality.
Why do you keep making false analogies?

>> > If it is a multi-file/dir archive, then you could equally well come back in
>> > an inconsistent state after crashing with some files extracted and
>> > some not, without atomic-write-multiple-files-and-directories API.
>>
>> True, but at least each file will be valid by itself. So no broken
>> executables, images or scripts.
>> Transactions involving multiple files are outside the scope of this
>> discussion.
>
> But what's the use case where this is useful and/or interesting?  It
> certainly doesn't help in the case of dpkg, because you still have to
> deal with shell scripts that depend on certain executables being
> present, or executables depending on the new version of the shared
> library being present.  If we're going to give up huge amounts of file
> system performance for some use case, it's nice to know what the
> real-world use case would actually be.  (And again, I believe the dpkg
> folks are squared away at this point.)

Why would this require a huge performance hit? It's comparable to temp
file + rename which doesn't have this performance hit either AFAIK.

> If the use case is really one of replacing the data while maintaining
> the metadata (i.e., ACL's, extended attributes, etc.), we've already
> pointed out that in the case of a file editor, you had better have
> durability.  Keep in mind that if you don't eventually call fsync(),
> you'll never know if the file system is full or the user has hit their
> quota, and the data can't be lazily written out later.  Or in the case
> of a networked file system, what if the network connection disappears
> before you have a chance to lazily update the data and do the rename?
> So before the editor exits, and the last remaining copy of the new
> data (in memory) disappears, you had better call fsync() and check to
> make sure the write can and has succeeded.

Good point. So fsync is still needed in that case. What about the
meta-data though?

> So in the case of replacing the data, what's the use case if it's not
> for a file editor?  And note that you've said that you want atomicity
> because you want to make sure that after a crash you don't lose data.
> What about the case where the system doesn't crash, but the wireless
> connection goes away, or the user has exceeded his/her quota and they
> were trying to replace 4k worth of data fork with 12k worth of data?
> I can certainly think of scenarios where wireless connection drops and
> quota overruns are far more likely than system crashes.  (i.e., when
> you're not using proprietary video drivers.  :-P)
>
>> Providing transaction semantics for multiple files is a far broader
>> proposal and not necessary for implement this proposal.
>
> But providing magic transaction semantics for a single file in the
> rename is not at all clearly useful.  You need to justify all of this
> hard effort, and performance loss.  (Well, or if you're so smart you
> can implement your own file system that does all of this work, and we
> can benchmark it against a file system that doesn't do all of this
> work....)

Still waiting on any hint for why that performance loss would happen.

>> I'm not sure, but Ted appears to be saying temp file + rename (but no
>> fsync) isn't guaranteed to work either.
>
> It won't work if you get really unlucky and your system takes a power
> cut right at the wrong moment during or after the rename().  It could
> be made to work, but at a performance cost.  And the question is
> whether the performance cost is worth it.  At the end of the day it's
> all between the tradeoff between performance cost, implementation
> cost, and value to the user and the application programmer.  Which is
> why you need to articular the use case where this makes sense.
>
> It's not dpkg, and it's not file editors.  What is it, specifically?
> And why can it tolerate data loss in the case of quota overruns and
> wireless connection hits, but not in the case of system crashes?

There are two different kinds of losses here. One is losing the entire
file, the other is losing the update but still having the old file.

>> It just seems quite suboptimal. There's no need for infinite storage
>> (or an oracle) to avoid this.
>
> If you're so smart, why don't you try implementing it?  Itt's going to
> be hard for us to convince you why it's going to be non-trivial and
> have huge implementation *and* performance costs, so why don't you
> produce the patches that makes this all work?

Why is that so hard?
Should be a lot easier then me implementing a FS from scratch.

Olaf

2010-12-27 11:07:35

by Marco Stornelli

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

Il 27/12/2010 11:21, Olaf van der Spek ha scritto:
>
> Why is that so hard?
> Should be a lot easier then me implementing a FS from scratch.
>
> Olaf

We are impatiences for reading your patches, maybe in this way your
concept/API will be clearer.

Marco

2010-12-27 11:48:14

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Mon, Dec 27, 2010 at 5:12 AM, Nick Piggin <[email protected]> wrote:
> On Mon, Dec 27, 2010 at 5:51 AM, Olaf van der Spek <[email protected]> wrote:
>> On Sun, Dec 26, 2010 at 5:43 PM, Nick Piggin <[email protected]> wrote:
>>>> Do you not understand what is meant by a complete file write?
>>>
>>> It is not a rigourous definition. What I understand it to mean may be
>>> different than what you understand it to mean. Particularly when you
>>> consider what the actual API should look like and interact with the rest
>>> of the apis.
>>
>> f = open(..., O_ATOMIC, O_CREAT, O_TRUNC);
>> write(...); // 0+ times
>> abort/rollback(...); // optional
>> close(f);
>
> Sorry, it's still not a rigourous definition, and what you have
> defined indicates it is
> not atomic. You have not done *anything* to specify how the API interacts with
> the rest of the system calls and other calls.
>
> You have a circular definition -- "complete file write means you open the file
> with O_ATOMIC, and O_ATOMIC means you want a complete file write". I'm
> afraid you'll have to put in a bit more effort than that.

Semantics:
Old state: data before open
New state: data after open
Others see either the old or the new state.
After close but before a crash, others see the new state.

>> True, but at least each file will be valid by itself. So no broken
>> executables, images or scripts.
>
> So if a script depends on an executable or an executable depends on a
> data file or library that do not exist, they're effectively broken. So you
> need to be able to clean up properly anyway.

If those ifs are true, yes. Otherwise, no.

>
>> Transactions involving multiple files are outside the scope of this discussion.
>
> No they are not, because as I understand you want atomicity of some
> file operations so that partially visible error cases do not have to be dealt
> with by userspace. The problem is exactly the same when dealing with
> multiple files and directories.

Solving it for a single file does not require solving it for multiple files.

>>>> Almost. Visibility to other process should be normal (I don't know the
>>>> exact rules), but commit to disk may be deferred.
>>>
>>> That's pretty important detail. What is "normal"? Will a process
>>> see old or new data from the atomic write before atomic write has
>>> committed to disk?
>>
>> New data.
>
> What if the writer subsequently "aborts" or makes more writes to the file?

That's all part of the atomic transaction. New data is the state after close.

>
>> Isn't that the current rule?
>
> There are no atomic writes, so you can't just say "it's easy, just do writes
> atomically and use 'current' rules for everything else"

I mean the rules that exist to current (non-atomic) stuff.

>> It's about an atomic replace of the entire file data. So it's not like
>> expecting a single write to be atomic.
>
> You didn't answer what happens. It's pretty important, because if those
> writes from other processes join the new data from your atomic write,
> and then you subsequently abort it, what happens? If writes are in progress
> to the file when it is to be atomically written to, does the atomic write
> "transaction" see parts of these writes? What sort of isolation level are
> we talking about here? read uncommitted?
>
> It's pretty important details when you're talking about transactions and
> atomicity, you can't just say it isn't relevant, out of scope, or just use
> "existing" semantics.

Ah, yes, that's important. The transaction is defined as beginning
with open and ending with close. Others won't see inconsistent state.
If other (atomic or non-atomic) updates happen they happen either
before or after the transaction. Since this is about replacing the
entire file data, you don't depend on the previous data.

>> Providing transaction semantics for multiple files is a far broader
>> proposal and not necessary for implement this proposal.
>
> The question is, if it makes sense to do it for 1, why does it not make sense
> to do it for multiple? If you want to radically change the file
> syscall APIs, you
> need to explore all avenues and come up with something consistent that
> makes sense.

IMO the single-file case is does not require radical changes.

>
>>>> Temp file, rename has issues with losing meta-data.
>>>
>>> How about solving that easier issue?
>>
>> That would be nice, but it's not the only issue.
>> I'm not sure, but Ted appears to be saying temp file + rename (but no
>> fsync) isn't guaranteed to work either.
>
> The rename obviously happens only *after* you fsync. Like I said,
> at the point when you actually overwrite the old file with new, you do
> really want durability.

There's still the meta-data issue.

>
>> There's also the issue of not having permission to create the temp
>> file, having to ensure the temp file is on the same volume (so the
>> rename can work).
>
> I don't see how those are problems. You can't do an atomic write to
> a file if you don't have permissions to do it, either.

Doh. This is about having permission to write to the file you want to
update but not to write to another file.

>
>
>>>> It's simple to implement but it's not simple to use right.
>>>
>>> You do not have the ability to have arbitrary atomic transactions to the
>>> filesystem. If you show a problem of a half completed write after crash,
>>> then I can show you a problem of any half completed multi-syscall
>>> operation after crash.
>>
>> It's not about arbitrary transactions.
>
> That is my point. This "atomic write complete file" thing solves about 1% of
> the problem that already has to be solved within the existing posix API
> anyway.
>
>
>>> The simple thing is to properly clean up such things after a crash, and
>>> just use an atomic commit somewhere to say whether the file operations
>>> that just completed are now in a durable state. Either that or use an
>>> existing code that does it right.
>>
>> That's not simple if you're talking about arbitrary processes and files.
>> It's not even that simple if you're talking about DBs. They do
>> implement it, but obviously that's not usable for arbitrary files.
>
> I don't see how you can just handwave that something is simple when
> it suits your argument, and something else is not simple when that suits
> your argument.

True

> It seems pretty simple to me, when you have several ways to perform
> a visible and durable atomic operation (such as a write+fdatasync on
> file data), then you can use that to checkpoint state of your operations
> at any point.

True

>>> I'm still not clear what you mean. Filesystem state may get updated
>>> between any 2 syscalls because the kernel has no oracle or infinite
>>> storage.
>>
>> It just seems quite suboptimal. There's no need for infinite storage
>> (or an oracle) to avoid this.
>
> You do, because you can't guarantee to keep arbitrary amount of  dirty
> data in memory or another location on disk for an indeterminate period
> of time. What if you have a 1GB filesystem, 128MB memory, you open
> an 800MB file on it, and write 800MB of data to that file before closing it?

This referred to commiting between truncate and the first write.
You're right about not being able to delay writes in other cases.

> If you have "atomic write of complete file",  how would you save your
> "abort/rollback" data on arbitrarily large file and for multiple concurrent
> atomic transactions of indeterminate duration? For that matter, how
> would you even handle the above situation which has no concurrency?

Atomic writes, just like temp file + rename, would require more space.
If you don't have that space, your writes will fail.

> Anyway, it seems you'll just keep arguing about this, so I'm with Ted
> now. It's pointless to keep going back and forth. You're certainly
> welcome to post patches (or even prototypes, modifications to user
> programs, numbers, etc.). Some of us are skeptics, but we'd all
> welcome any work that improves the user API so significantly and
> with such simplicity as you think it's possible.

Let's drop the non-durable aspect and refocus then. I'll create a new thread.

Olaf

2010-12-27 12:43:05

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Mon, Dec 27, 2010 at 12:48 PM, Olaf van der Spek
<[email protected]> wrote:
> Semantics:
> Old state: data before open
> New state: data after open

Argh, this should read data after close.

> Others see either the old or the new state.
> After close but before a crash, others see the new state.

2010-12-27 15:29:46

by Christian Stroetmann

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On the 27.12.2010 11:21, Olaf van der Spek wrote:
> On Sun, Dec 26, 2010 at 11:10 PM, Ted Ts'o<[email protected]> wrote:
>> On Sun, Dec 26, 2010 at 07:51:23PM +0100, Olaf van der Spek wrote:
>>> f = open(..., O_ATOMIC, O_CREAT, O_TRUNC);
>> Great, let's rename O_ATOMIC to O_PONIES. :-)
> If that makes you happy.
>
>>> abort/rollback(...); // optional
>> As I said earlier, "file systems are not databases", and "databases
>> are not file systems". Oracle tried to foist their database as a file
>> system during the dot.com boom, and everyone laughed at them; the
>> performance was a nightmare. If Oracle wasn't able to make a
>> transaction engine that supports transactions and rollbacks
>> performant, you really expect that you'll be able to do it?
> Like I've said dozens of times, this is not about full DB functionality.
> Why do you keep making false analogies?

The analogy is not so wrong. The concepts atomicity and abort/rollback
your are talking about are also concepts of the field of database
management systems (DBMSs). And once you have established the Atomicity,
which is the A of the principle ACID of DBMSs, you have the basis for
establishing the rest, the CID.

And you even went further into this DBMS direction by letting down the
requirement of non-durability.

<snip>

>>> Providing transaction semantics for multiple files is a far broader
>>> proposal and not necessary for implement this proposal.
>> But providing magic transaction semantics for a single file in the
>> rename is not at all clearly useful. You need to justify all of this
>> hard effort, and performance loss. (Well, or if you're so smart you
>> can implement your own file system that does all of this work, and we
>> can benchmark it against a file system that doesn't do all of this
>> work....)
> Still waiting on any hint for why that performance loss would happen.

From my point of view, the loss of performance depends on what is
benchmarked in which way.

<snip>

> Olaf
> --

Christian Stroetmann

2010-12-27 19:07:01

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Mon, Dec 27, 2010 at 4:30 PM, Christian Stroetmann
<[email protected]> wrote:
>> Like I've said dozens of times, this is not about full DB functionality.
>> Why do you keep making false analogies?
>
> The analogy is not so wrong. The concepts atomicity and abort/rollback your
> are talking about are also concepts of the field of database management
> systems (DBMSs). And once you have established the Atomicity, which is the A
> of the principle ACID of DBMSs, you have the basis for establishing the
> rest, the CID.
>
> And you even went further into this DBMS direction by letting down the
> requirement of non-durability.

Of course the concepts are the same. That doesn't mean the analogy is valid.

>> Still waiting on any hint for why that performance loss would happen.
>
> From my point of view, the loss of performance depends on what is
> benchmarked in which way.

Maybe, but still no indication of why.

Olaf

2010-12-27 19:30:20

by Christian Stroetmann

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On the 27.12.2010 20:07, Olaf van der Spek wrote:
> On Mon, Dec 27, 2010 at 4:30 PM, Christian Stroetmann
> <[email protected]> wrote:
>>> Like I've said dozens of times, this is not about full DB functionality.
>>> Why do you keep making false analogies?
>> The analogy is not so wrong. The concepts atomicity and abort/rollback your
>> are talking about are also concepts of the field of database management
>> systems (DBMSs). And once you have established the Atomicity, which is the A
>> of the principle ACID of DBMSs, you have the basis for establishing the
>> rest, the CID.
>>
>> And you even went further into this DBMS direction by letting down the
>> requirement of non-durability.
> Of course the concepts are the same. That doesn't mean the analogy is valid.

Btw.: There is even no analogy: "The concepts are the same".

>>> Still waiting on any hint for why that performance loss would happen.
>> > From my point of view, the loss of performance depends on what is
>> benchmarked in which way.
> Maybe, but still no indication of why.

If you have a solution, then you really should show other persons the
working source code.
For me speaking: I like such technologies and I'm also interested in
your attempts.

> Olaf

Christian Stroetmann

2010-12-28 00:45:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Mon, Dec 27, 2010 at 12:48:12PM +0100, Olaf van der Spek wrote:
>
> Let's drop the non-durable aspect and refocus then. I'll create a new thread.
>

Why don't you send patches? We can revisit and refocus once you've
implemented it.

- Ted

2010-12-28 17:22:42

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Mon, Dec 27, 2010 at 8:30 PM, Christian Stroetmann
<[email protected]> wrote:
> Btw.: There is even no analogy: "The concepts are the same".

So? Doesn't mean you implement full ACID DB-like transactions.

>>>> Still waiting on any hint for why that performance loss would happen.
>>>
>>> > From my point of view, the loss of performance depends on what is
>>> benchmarked in which way.
>>
>> Maybe, but still no indication of why.
>
> If you have a solution, then you really should show other persons the
> working source code.

I don't have source code.
Are you not capable of reasoning about something without having
concrete source code?

> For me speaking: I like such technologies and I'm also interested in your
> attempts.

Writing code is a lot of work and one should have the design clear
before writing code, IMO.

Olaf

2010-12-28 20:59:28

by NeilBrown

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Tue, 28 Dec 2010 18:22:42 +0100 Olaf van der Spek <[email protected]>
wrote:

> On Mon, Dec 27, 2010 at 8:30 PM, Christian Stroetmann
> <[email protected]> wrote:
> > Btw.: There is even no analogy: "The concepts are the same".
>
> So? Doesn't mean you implement full ACID DB-like transactions.
>
> >>>> Still waiting on any hint for why that performance loss would happen.
> >>>
> >>> > From my point of view, the loss of performance depends on what is
> >>> benchmarked in which way.
> >>
> >> Maybe, but still no indication of why.
> >
> > If you have a solution, then you really should show other persons the
> > working source code.
>
> I don't have source code.
> Are you not capable of reasoning about something without having
> concrete source code?
>
> > For me speaking: I like such technologies and I'm also interested in your
> > attempts.
>
> Writing code is a lot of work and one should have the design clear
> before writing code, IMO.

Yes and no.

Having some design is obviously important before starting to code.
However it is a common experience that once you start writing code, you start
to see all the holes in your design - all the corner cases that you didn't
think about. So sometimes writing some proof-of-concept code is a very
valuable step in the design process.
Then of course you need to do some testing to see if in the code actually
performs as hoped or expected. That testing may cause the design to be
revised.
So asking for code early is not necessarily a bad thing.

I think the real disconnect here is that you haven't really established or
justified a need.

You seem to be asking for the ability to atomically change the data in a file
without changing the metadata. I cannot see why you would want this. Maybe
you could give an explicit use-case??

Another significant issue here is "how much atomicity can we justify".
One possibility is for the file system not to provide any atomicity, and so
require lots of repair after a crash: fsck for the filesystem, "make clean"
for your compile tree, removal of stray temp files etc for other subsystems.

On the other extreme we could allow full transactions encompassing
multiple changes to multiple files which a guarantee to be either committed
completely or not at all after a crash.

We gave up on the first extreme about a decade ago when journalling
filesystems became available for Linux. There seems to be little desire to
pay the cost of ever implementing the other extreme in general purpose
filesystems.

So the important question is "Where on that spectrum of options should we be?"
The answer has to be based on cost/benefit. The cost of adding journalling
was quite high, but the benefit of not having to fsck an enormous filesystem
after a crash is enormous, so it is a cost we have chosen to pay.

If you want some extra level of atomicity, you need to demonstrate either a
high benefit or a low cost. There seems to be some scepticism as to whether
you can. A convincing use-case might demonstrate the high benefit. Working
code might demonstrate low cost. But you really need to provide at least one
(ideally both) or people are unlikely to take you seriously.

NeilBrown


2010-12-28 22:01:16

by Greg Freemyer

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Tue, Dec 28, 2010 at 3:59 PM, Neil Brown <[email protected]> wrote:
> On Tue, 28 Dec 2010 18:22:42 +0100 Olaf van der Spek <[email protected]>
> wrote:
>
>> On Mon, Dec 27, 2010 at 8:30 PM, Christian Stroetmann
>> <[email protected]> wrote:
>> > Btw.: There is even no analogy: "The concepts are the same".
>>
>> So? Doesn't mean you implement full ACID DB-like transactions.
>>
>> >>>> Still waiting on any hint for why that performance loss would happen.
>> >>>
>> >>> > From my point of view, the loss of performance depends on what is
>> >>> benchmarked in which way.
>> >>
>> >> Maybe, but still no indication of why.
>> >
>> > If you have a solution, then you really should show other persons the
>> > working source code.
>>
>> I don't have source code.
>> Are you not capable of reasoning about something without having
>> concrete source code?
>>
>> > For me speaking: I like such technologies and I'm also interested in your
>> > attempts.
>>
>> Writing code is a lot of work and one should have the design clear
>> before writing code, IMO.
>
> Yes and no.
>
> Having some design is obviously important before starting to code.
> However it is a common experience that once you start writing code, you start
> to see all the holes in your design - all the corner cases that you didn't
> think about. ?So sometimes writing some proof-of-concept code is a very
> valuable step in the design process.
> Then of course you need to do some testing to see if in the code actually
> performs as hoped or expected. ?That testing may cause the design to be
> revised.
> So asking for code early is not necessarily a bad thing.
>
> I think the real disconnect here is that you haven't really established or
> justified a need.
>
> You seem to be asking for the ability to atomically change the data in a file
> without changing the metadata. ?I cannot see why you would want this. ?Maybe
> you could give an explicit use-case??

==> joining the thread

I assumed the use case was one of simple document editor.

If I am working on a office doc with oowriter as an example, I don't
want a system crash or out of diskspace to kill my original doc. 7 or
8 years ago XFS used to zero out the file in situations like that.
Hopefully that's fixed.

What I don't understand is what the security impacts are of the file
owner changing. ie. Assume user 1 creates a word doc, then user 2
makes an edit. If the owner is changed to the second user, is that a
problem?

Same for the group?

Personally I'd rather see it stay with the original owner / group.

I assume the solution today that oowriter etc. use is:

===
create temp file
write out new data
delete old file
rename temp file to primary name
===

If so there is still a little window of vulnerability where the whole
file can be lost. (Or at least only the temp file is present).

Text editors, xml editors, etc. seem like they all have the same vulnerability.

If my assumptions are right, then a way to guarantee either the old or
new file contents are available after a crash would be useful.

== EXT4_IOC_MOVE_EXT

There was mention of this ioctl earlier in the thread.

I don't think it guarantees all or nothing data replacement as
requested here. It is designed with defragment as the primary use
case.

As such it steps through data blocks and replaces them, but if only
half the blocks get replaced with new ones due to a crash, it is not a
big deal.

Greg
Greg

2010-12-28 22:06:16

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Tue, Dec 28, 2010 at 11:00 PM, Greg Freemyer <[email protected]> wrote:
> create temp file
> write out new data
> delete old file
> rename temp file to primary name
> ===
>
> If so there is still a little window of vulnerability where the whole
> file can be lost.  (Or at least only the temp file is present).

Delete isn't used, rename will overwrite the old file. So it's safe.
Meta-data is probably lost, file owner is certainly lost.

Olaf

2010-12-28 22:10:53

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Tue, Dec 28, 2010 at 9:59 PM, Neil Brown <[email protected]> wrote:
>> Writing code is a lot of work and one should have the design clear
>> before writing code, IMO.
>
> Yes and no.
>
> Having some design is obviously important before starting to code.
> However it is a common experience that once you start writing code, you start
> to see all the holes in your design - all the corner cases that you didn't
> think about.  So sometimes writing some proof-of-concept code is a very
> valuable step in the design process.

Sometimes, yes.

> I think the real disconnect here is that you haven't really established or
> justified a need.

True, but all those exceptions (IMO) should be (proven to be) no problem.
I'd prefer designs that don't have such exceptions. I may not be able
to think of a concrete problem right now, but that doesn't mean such
problems don't exist.

I also don't understand why providing this feature is such a
(performance) problem.
Surely the people that claim this should be able to explain why.

> You seem to be asking for the ability to atomically change the data in a file
> without changing the metadata.  I cannot see why you would want this.  Maybe
> you could give an explicit use-case??

Where losing meta-data is bad? That should be obvious.
Or where losing file owner is bad? Still thinking about that one.

> Another significant issue here is "how much atomicity can we justify".
> One possibility is for the file system not to provide any atomicity, and so
> require lots of repair after a crash:  fsck for the filesystem, "make clean"
> for your compile tree, removal of stray temp files etc for other subsystems.
>
> On the other extreme we could allow full transactions encompassing
> multiple changes to multiple files which a guarantee to be either committed
> completely or not at all after a crash.
>
> We gave up on the first extreme about a decade ago when journalling
> filesystems became available for Linux.  There seems to be little desire to
> pay the cost of ever implementing the other extreme in general purpose
> filesystems.

Note that I'm not asking for this other extreme.

> So the important question is "Where on that spectrum of options should we be?"
> The answer has to be based on cost/benefit.  The cost of adding journalling
> was quite high, but the benefit of not having to fsck an enormous filesystem
> after a crash is enormous, so it is a cost we have chosen to pay.
>
> If you want some extra level of atomicity, you need to demonstrate either a
> high benefit or a low cost.  There seems to be some scepticism as to whether
> you can.  A convincing use-case might demonstrate the high benefit.  Working
> code might demonstrate low cost.  But you really need to provide at least one
> (ideally both) or people are unlikely to take you seriously.

I understand.

Olaf

2010-12-28 22:16:18

by Greg Freemyer

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Tue, Dec 28, 2010 at 5:06 PM, Olaf van der Spek <[email protected]> wrote:
> On Tue, Dec 28, 2010 at 11:00 PM, Greg Freemyer <[email protected]> wrote:
>> create temp file
>> write out new data
>> delete old file
>> rename temp file to primary name
>> ===
>>
>> If so there is still a little window of vulnerability where the whole
>> file can be lost. ?(Or at least only the temp file is present).
>
> Delete isn't used, rename will overwrite the old file. So it's safe.
> Meta-data is probably lost, file owner is certainly lost.
>
> Olaf

So ACLs are lost?

That seems like a potentially bigger issue than loosing the owner/group info.

And I assume if the owner changes, then the new owner has privileges
to modify ACLs he didn't have previously.

So if I want to instigate a simple denial of service in a multi-user
environment, I edit a few key docs that I have privileges to edit. By
doing so I take ownership. As owner I change the permissions and
ACLs so that no one but me can access them.

Seems like a security hole to me.

Greg

2010-12-28 22:28:33

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Tue, Dec 28, 2010 at 11:15 PM, Greg Freemyer <[email protected]> wrote:
> So ACLs are lost?

I'm not sure. Since preserving them might not be easy I think it's
likely they're lost in some cases.

> That seems like a potentially bigger issue than loosing the owner/group info.
>
> And I assume if the owner changes, then the new owner has privileges
> to modify ACLs he didn't have previously.
>
> So if I want to instigate a simple denial of service in a multi-user
> environment, I edit a few key docs that I have privileges to edit.  By
> doing so I take ownership.  As owner I  change the permissions and
> ACLs so that no one but me can access them.
>
> Seems like a security hole to me.

If you have write access you can clear the data as well, so
effectively the difference is small.

Olaf

2010-12-28 22:31:58

by NeilBrown

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Tue, 28 Dec 2010 23:10:51 +0100 Olaf van der Spek <[email protected]>
wrote:

> On Tue, Dec 28, 2010 at 9:59 PM, Neil Brown <[email protected]> wrote:
> >> Writing code is a lot of work and one should have the design clear
> >> before writing code, IMO.
> >
> > Yes and no.
> >
> > Having some design is obviously important before starting to code.
> > However it is a common experience that once you start writing code, you start
> > to see all the holes in your design - all the corner cases that you didn't
> > think about. ?So sometimes writing some proof-of-concept code is a very
> > valuable step in the design process.
>
> Sometimes, yes.
>
> > I think the real disconnect here is that you haven't really established or
> > justified a need.
>
> True, but all those exceptions (IMO) should be (proven to be) no problem.
> I'd prefer designs that don't have such exceptions. I may not be able
> to think of a concrete problem right now, but that doesn't mean such
> problems don't exist.

Very true. But until such problems are described an understood, there is not
a lot of point trying to implement a solution. Premature implementation,
like premature optimisation, is unlikely to be fruitful. I know this from
experience.

>
> I also don't understand why providing this feature is such a
> (performance) problem.
> Surely the people that claim this should be able to explain why.

Without a concrete design, it is hard to assess the performance impact. I
would guess that those who anticipate a significant performance impact are
assuming a more feature-full implementation than you are, and they are
probably doing that because they feel that you need the extra features to
meet the actual needs (and so suggest those needs a best met by a DBMS rather
than a file-system).
Of course this is just guess work. With concreted reference points it is
hard to be sure.

>
> > You seem to be asking for the ability to atomically change the data in a file
> > without changing the metadata. ?I cannot see why you would want this. ?Maybe
> > you could give an explicit use-case??
>
> Where losing meta-data is bad? That should be obvious.
> Or where losing file owner is bad? Still thinking about that one.

This is a bit left-field, but I think that losing metadata is always a good
thing. A file should contain data - nothing else. At all. Owner and access
permissions should be based on location as dictated by external policy....
but yeah - off topic.

Clearly maintaining metadata by creating a new file and renaming in-place is
easy for root (chown/chmod/etc). So you are presumably envisaging situations
where a non-root user has write access to a file that they don't own, and
they want to make an atomic data-update to that file.
Sorry, but I think that allowing non-owners to write to a file is a really
really bad idea and providing extra support for that use-case is completely
unjustifiable.

If you want multiple people to be able to update some data you should have
some way to ask the owner to make an update. That could be:
- setuid program
- daemon which authenticates requests
- distributed workflow tool like 'git' where you speak to the owner
and ask them to pull updates.

and there are probably other options. But un-mediated writes to a file you
don't own? Just say NO!

NeilBrown


>
> > Another significant issue here is "how much atomicity can we justify".
> > One possibility is for the file system not to provide any atomicity, and so
> > require lots of repair after a crash: ?fsck for the filesystem, "make clean"
> > for your compile tree, removal of stray temp files etc for other subsystems.
> >
> > On the other extreme we could allow full transactions encompassing
> > multiple changes to multiple files which a guarantee to be either committed
> > completely or not at all after a crash.
> >
> > We gave up on the first extreme about a decade ago when journalling
> > filesystems became available for Linux. ?There seems to be little desire to
> > pay the cost of ever implementing the other extreme in general purpose
> > filesystems.
>
> Note that I'm not asking for this other extreme.
>
> > So the important question is "Where on that spectrum of options should we be?"
> > The answer has to be based on cost/benefit. ?The cost of adding journalling
> > was quite high, but the benefit of not having to fsck an enormous filesystem
> > after a crash is enormous, so it is a cost we have chosen to pay.
> >
> > If you want some extra level of atomicity, you need to demonstrate either a
> > high benefit or a low cost. ?There seems to be some scepticism as to whether
> > you can. ?A convincing use-case might demonstrate the high benefit. ?Working
> > code might demonstrate low cost. ?But you really need to provide at least one
> > (ideally both) or people are unlikely to take you seriously.
>
> I understand.
>
> Olaf

2010-12-28 22:35:49

by NeilBrown

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Tue, 28 Dec 2010 17:15:57 -0500 Greg Freemyer <[email protected]>
wrote:

> On Tue, Dec 28, 2010 at 5:06 PM, Olaf van der Spek <[email protected]> wrote:
> > On Tue, Dec 28, 2010 at 11:00 PM, Greg Freemyer <[email protected]> wrote:
> >> create temp file
> >> write out new data
> >> delete old file
> >> rename temp file to primary name
> >> ===
> >>
> >> If so there is still a little window of vulnerability where the whole
> >> file can be lost. ?(Or at least only the temp file is present).
> >
> > Delete isn't used, rename will overwrite the old file. So it's safe.
> > Meta-data is probably lost, file owner is certainly lost.
> >
> > Olaf
>
> So ACLs are lost?
>
> That seems like a potentially bigger issue than loosing the owner/group info.
>
> And I assume if the owner changes, then the new owner has privileges
> to modify ACLs he didn't have previously.
>
> So if I want to instigate a simple denial of service in a multi-user
> environment, I edit a few key docs that I have privileges to edit. By
> doing so I take ownership. As owner I change the permissions and
> ACLs so that no one but me can access them.
>
> Seems like a security hole to me.

Giving someone you don't trust uncontrolled write access to something you
value has always been a security issue - long before ACLs or editors or
computers.

NeilBrown

2010-12-28 22:54:33

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Tue, Dec 28, 2010 at 11:31 PM, Neil Brown <[email protected]> wrote:
>> True, but all those exceptions (IMO) should be (proven to be) no problem.
>> I'd prefer designs that don't have such exceptions. I may not be able
>> to think of a concrete problem right now, but that doesn't mean such
>> problems don't exist.
>
> Very true.  But until such problems are described an understood, there is not
> a lot of point trying to implement a solution.  Premature implementation,
> like premature optimisation, is unlikely to be fruitful.  I know this from
> experience.

The problems seem clear. The implications not yet.

>>
>> I also don't understand why providing this feature is such a
>> (performance) problem.
>> Surely the people that claim this should be able to explain why.
>
> Without a concrete design, it is hard to assess the performance impact.  I
> would guess that those who anticipate a significant performance impact are
> assuming a more feature-full implementation than you are, and they are
> probably doing that because they feel that you need the extra features to
> meet the actual needs (and so suggest those needs a best met by a DBMS rather
> than a file-system).
> Of course this is just guess work.  With concreted reference points it is
> hard to be sure.

True, I don't understand why people say it will cause a performance
hit but then don't want to tell why.

>>
>> > You seem to be asking for the ability to atomically change the data in a file
>> > without changing the metadata.  I cannot see why you would want this.  Maybe
>> > you could give an explicit use-case??
>>
>> Where losing meta-data is bad? That should be obvious.
>> Or where losing file owner is bad? Still thinking about that one.
>
> This is a bit left-field, but I think that losing metadata is always a good
> thing.  A file should contain data - nothing else.  At all.  Owner and access
> permissions should be based on location as dictated by external policy....
> but yeah - off topic.

In that case meta-data shouldn't be supported in the first place.

> Clearly maintaining metadata by creating a new file and renaming in-place is
> easy for root (chown/chmod/etc).  So you are presumably envisaging situations
> where a non-root user has write access to a file that they don't own, and
> they want to make an atomic data-update to that file.
> Sorry, but I think that allowing non-owners to write to a file is a really
> really bad idea and providing extra support for that use-case is completely
> unjustifiable.

Isn't it quite common?
Is preserving other meta-data really easy enough to be sure most apps do it?

> If you want multiple people to be able to update some data you should have
> some way to ask the owner to make an update.  That could be:
>  - setuid program
>  - daemon which authenticates requests
>  - distributed workflow tool like 'git' where you speak to the owner
>    and ask them to pull updates.
>
> and there are probably other options.  But un-mediated writes to a file you
> don't own?  Just say NO!

Wouldn't that make Linux user groups quite useless?

Olaf

2010-12-28 23:42:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Tue, Dec 28, 2010 at 11:54:33PM +0100, Olaf van der Spek wrote:

> > Very true. But until such problems are described an understood,
> > there is not a lot of point trying to implement a
> > solution. Premature implementation, like premature optimisation,
> > is unlikely to be fruitful. I know this from experience.
>
> The problems seem clear. The implications not yet.

I don't think there's even agreement that it is a problem. A problem
implies a use case where where such a need is critical, and I haven't
seen it yet. I'd rather characeterize it as a demand for a "solution"
for a problem that hasn't been proven to exist yet.

> >> I also don't understand why providing this feature is such a
> >> (performance) problem.
> >> Surely the people that claim this should be able to explain why.
> >
> > Without a concrete design, it is hard to assess the performance
> > impact. I would guess that those who anticipate a significant
> > performance impact are assuming a more feature-full implementation
> > than you are, and they are probably doing that because they feel
> > that you need the extra features to meet the actual needs (and so
> > suggest those needs a best met by a DBMS rather than a
> > file-system). Of course this is just guess work. ?With concreted
> > reference points it is hard to be sure.
>
> True, I don't understand why people say it will cause a performance
> hit but then don't want to tell why.

Because I don't want waste time doing a hypothetical design when (a)
the specification space hasn't even been fully spec'ed out, and (b) no
compelling use case has been demonstrated, and (c) no one is paying
me.

The last point is a critical one; who's going to do the work? If you
are going to do the work, then implement it and send us the patches.
If you expect a technology expert to do the work, it's dirty pool to
try force him or her do a design to "prove" that it's not trivial.

If you're going to pay me $50,000 or $100,000, then it's on the golden
rule principle (the customer with the gold, makes the rules), and I'll
happily work on a design even if in my best judgment it's ill-advised,
and probably will be a waste of money, because, hey, it's the
customer's money. But if you're going to ask me to spend my time
working on something which in my professional opinion is a waste of
time, and do it pro bono, you must be smoking something really good,
and probably really illegal.

Here are some of the hints though about trouble spots.

1) What happens in disk full cases? Remember, we can't free the old
inode until writeback has happened. And if we haven't allocated space
yet for the file, and space is needed for the new file, what happens?
What if some other disk write needs the space?

2) How big are the files that you imagine should be supported with
such a scheme? If the file system is 1 GB, and the file is 600MG, and
you want to replace it with new contents which is 750MB long, what
happens? How does the system degrade gracefully in the case of larger
files? Does the user get any notification that maybe the magic
O_PONIES semantics might be changing?

3) What if the rename is still pending, but in the mean time, some
other process modifies the file? Do those writes also have to be
atomic vis-a-vis the rename?

4) What if the rename is still pending, but in the meantime, some
other process does another create a new file, and rename over the same
file name?

etc.

> >> Where losing meta-data is bad? That should be obvious.
>
> In that case meta-data shouldn't be supported in the first place.

Well, hold on a minute. It depends on what the meta-data means. If
the meta-data is supposed to be a secure indication of who created the
file, or more importantly if quotes are enforced, to whom the disk
usage quota should be charged, then it might not be allowable to
"preserve the metadata in some cases".

In general, you can always save the meta data, and restore the meta
data to the new file --- except when there are security reasons why
this isn't allowed. For example, file ownership is special, because
of (a) setuid bit considerations, and (b) file quota considerations.
If you don't have those issues, then allowing a non-privileged user to
use chown() is perfectly acceptable. But it's because of these issues
that chown() is special.

And if quota is enabled, replacing a 10MB file with a 6TB file, while
preserving the same file "owner", and therefore charging the 6TB to
the old owner, would be a total evasion of the quota system.

In any case, have fun trying to design this system for which you have
no use cases....

- Ted

2010-12-29 09:09:50

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Wed, Dec 29, 2010 at 12:42 AM, Ted Ts'o <[email protected]> wrote:
> On Tue, Dec 28, 2010 at 11:54:33PM +0100, Olaf van der Spek wrote:
>
>> > Very true.  But until such problems are described an understood,
>> > there is not a lot of point trying to implement a
>> > solution.  Premature implementation, like premature optimisation,
>> > is unlikely to be fruitful.  I know this from experience.
>>
>> The problems seem clear. The implications not yet.
>
> I don't think there's even agreement that it is a problem.  A problem

Maybe problem isn't the right word, but it does seem a cornercase / exception.

> implies a use case where where such a need is critical, and I haven't
> seen it yet.  I'd rather characeterize it as a demand for a "solution"
> for a problem that hasn't been proven to exist yet.
>
>> True, I don't understand why people say it will cause a performance
>> hit but then don't want to tell why.
>
> Because I don't want waste time doing a hypothetical design when (a)
> the specification space hasn't even been fully spec'ed out, and (b) no
> compelling use case has been demonstrated, and (c) no one is paying
> me.

> The last point is a critical one; who's going to do the work?  If you
> are going to do the work, then implement it and send us the patches.
> If you expect a technology expert to do the work, it's dirty pool to
> try force him or her do a design to "prove" that it's not trivial.
>
> If you're going to pay me $50,000 or $100,000, then it's on the golden
> rule principle (the customer with the gold, makes the rules), and I'll
> happily work on a design even if in my best judgment it's ill-advised,
> and probably will be a waste of money, because, hey, it's the
> customer's money.  But if you're going to ask me to spend my time
> working on something which in my professional opinion is a waste of
> time, and do it pro bono, you must be smoking something really good,
> and probably really illegal.

I don't want you to work on something you do not support.
I want to understand why you think it's a bad idea.

> Here are some of the hints though about trouble spots.
>
> 1) What happens in disk full cases?  Remember, we can't free the old
> inode until writeback has happened.  And if we haven't allocated space
> yet for the file, and space is needed for the new file, what happens?
> What if some other disk write needs the space?

I would expect a no space error.

> 2) How big are the files that you imagine should be supported with
> such a scheme?  If the file system is 1 GB, and the file is 600MG, and
> you want to replace it with new contents which is 750MB long, what
> happens?  How does the system degrade gracefully in the case of larger
> files?  Does the user get any notification that maybe the magic
> O_PONIES semantics might be changing?

No sementics will change, you'll get a no space error.
Just like you would if you use the temp file approach.

> 3) What if the rename is still pending, but in the mean time, some
> other process modifies the file?  Do those writes also have to be
> atomic vis-a-vis the rename?

So the rename has been executed already (but has not yet been comitted
to disk) and then the file is modified? They would apply to the new
file.

> 4) What if the rename is still pending, but in the meantime, some
> other process does another create a new file, and rename over the same
> file name?

The last update would win, if by pending you mean the rename has been
executed already but hasn't been written to disk yet.

> etc.
>
>> >> Where losing meta-data is bad? That should be obvious.
>>
>> In that case meta-data shouldn't be supported in the first place.
>
> Well, hold on a minute.  It depends on what the meta-data means.  If
> the meta-data is supposed to be a secure indication of who created the
> file, or more importantly if quotes are enforced, to whom the disk
> usage quota should be charged, then it might not be allowable to
> "preserve the metadata in some cases".

I understand you can't just allow chown, but ...

> In general, you can always save the meta data, and restore the meta
> data to the new file --- except when there are security reasons why
> this isn't allowed.  For example, file ownership is special, because
> of (a) setuid bit considerations, and (b) file quota considerations.
> If you don't have those issues, then allowing a non-privileged user to
> use chown() is perfectly acceptable.  But it's because of these issues
> that chown() is special.
>
> And if quota is enabled, replacing a 10MB file with a 6TB file, while
> preserving the same file "owner", and therefore charging the 6TB to
> the old owner, would be a total evasion of the quota system.

Isn't that already a problem if you have write access to a file you don't own?

Still waiting on an answer to:
> What is the recommended way for atomic (complete) file writes?

Given that (you say) so many get it wrong, it would be nice to know
the right way.

Olaf

2010-12-29 11:05:36

by Dave Chinner

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Tue, Dec 28, 2010 at 05:00:55PM -0500, Greg Freemyer wrote:
> If I am working on a office doc with oowriter as an example, I don't
> want a system crash or out of diskspace to kill my original doc. 7 or
> 8 years ago XFS used to zero out the file in situations like that.

FUD. XFS has _never_ zeroed files during recovery. This gets repeated
often enough that we've even got a FAQ entry for it:

http://xfs.org/index.php/XFS_FAQ#Q:_Why_do_I_see_binary_NULLS_in_some_files_after_recovery_when_I_unplugged_the_power.3F

> Hopefully that's fixed.

4 years ago...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-12-29 15:29:57

by Christian Stroetmann

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On the 28.12.2010 23:54, Olaf van der Spek wrote:
> On Tue, Dec 28, 2010 at 11:31 PM, Neil Brown<[email protected]> wrote:
>>> True, but all those exceptions (IMO) should be (proven to be) no problem.
>>> I'd prefer designs that don't have such exceptions. I may not be able
>>> to think of a concrete problem right now, but that doesn't mean such
>>> problems don't exist.
>> Very true. But until such problems are described an understood, there is not
>> a lot of point trying to implement a solution. Premature implementation,
>> like premature optimisation, is unlikely to be fruitful. I know this from
>> experience.
> The problems seem clear. The implications not yet.
>
>>> I also don't understand why providing this feature is such a
>>> (performance) problem.
>>> Surely the people that claim this should be able to explain why.
>> Without a concrete design, it is hard to assess the performance impact. I
>> would guess that those who anticipate a significant performance impact are
>> assuming a more feature-full implementation than you are, and they are
>> probably doing that because they feel that you need the extra features to
>> meet the actual needs (and so suggest those needs a best met by a DBMS rather
>> than a file-system).
>> Of course this is just guess work. With concreted reference points it is
>> hard to be sure.
> True, I don't understand why people say it will cause a performance
> hit but then don't want to tell why.

We are talking about atomicity. And it is a simple fact in the field of
information processing/informatics/computer science that if someone
wants to give/have the guarantee of atomicity, then she/he has to do
several additional steps often by using an additional data structure. In
the end this all costs more time and/or space than doing it without
atomicity. At this point there is no discussion anymore, because this is
fully discussed to the maximum in subjects like Efficient Algorithms,
Special Problem Fields of Operating System Design and Fundamentals of
DBMS Design (eg. AtomicityCID principle).
And such fundamental points are not (needed to be) discussed here.

Furthermore, due to the competence it is possible for FS gurus like Ted
to estimate that the additional steps have to be done by several
functions of an FS, which implies performance loss. And because
elementary FS functions are involved the performance loss could be and
in the past have been significant, though in nearly all cases I have
seen the reason was a very bad implementation. The only exception so far
is the Reiser4 FS: All of its file operations are atomic, but still to a
little cost of performance in the most cases and the need of a repacker
in some few cases which show a significant loss of performance.

And the advice to use a well-known DBMS is simply based on the knowledge
that it has all the needed functionality already implemented in a highly
performant way, and on the knowledge that such a solution is used
oftenly for comparable use cases due to the cost vs. benefit ratio.
To take a look at the Reiser4 FS could also help.

<snip>

> Olaf

The bar is opened
Christian Stroetmann


2010-12-29 15:41:45

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Wed, Dec 29, 2010 at 4:30 PM, Christian Stroetmann
<[email protected]> wrote:
>> True, I don't understand why people say it will cause a performance
>> hit but then don't want to tell why.
>
> We are talking about atomicity. And it is a simple fact in the field of
> information processing/informatics/computer science that if someone wants to
> give/have the guarantee of atomicity, then she/he has to do several
> additional steps often by using an additional data structure. In the end

Additional steps compared to what? The temp file, fsync, rename case?

> this all costs more time and/or space than doing it without atomicity. At

Of course. But this should not affect the non-atomic usage.

> this point there is no discussion anymore, because this is fully discussed
> to the maximum in subjects like Efficient Algorithms, Special Problem Fields
> of Operating System Design and Fundamentals of DBMS Design (eg. AtomicityCID
> principle).
> And such fundamental points are not (needed to be) discussed here.
>
> Furthermore, due to the competence it is possible for FS gurus like Ted to
> estimate that the additional steps have to be done by several functions of
> an FS, which implies performance loss. And because elementary FS functions
> are involved the performance loss could be and in the past have been
> significant, though in nearly all cases I have seen the reason was a very
> bad implementation. The only exception so far is the Reiser4 FS: All of its
> file operations are atomic, but still to a little cost of performance in the
> most cases and the need of a repacker in some few cases which show a
> significant loss of performance.

So making all ops atomic can be done at a little performance hit, but
implementing one specific op costs a huge performance hit? That
doesn't make sense and seems to indicate those that say otherwise
aren't right.

> And the advice to use a well-known DBMS is simply based on the knowledge
> that it has all the needed functionality already implemented in a highly
> performant way, and on the knowledge that such a solution is used oftenly
> for comparable use cases due to the cost vs. benefit ratio.
> To take a look at the Reiser4 FS could also help.

I don't think storing all my conf files, executables, libraries etc in
a DBMS is a good idea...

Olaf

2010-12-29 16:30:19

by Christian Stroetmann

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On the 29.12.2010 16:41, Olaf van der Spek wrote:
> On Wed, Dec 29, 2010 at 4:30 PM, Christian Stroetmann
> <[email protected]> wrote:
>>> True, I don't understand why people say it will cause a performance
>>> hit but then don't want to tell why.
>> We are talking about atomicity. And it is a simple fact in the field of
>> information processing/informatics/computer science that if someone wants to
>> give/have the guarantee of atomicity, then she/he has to do several
>> additional steps often by using an additional data structure. In the end
> Additional steps compared to what? The temp file, fsync, rename case?

read the paragraphs as a whole

>> this all costs more time and/or space than doing it without atomicity. At
> Of course. But this should not affect the non-atomic usage.

read the whole paragraphs again

>> this point there is no discussion anymore, because this is fully discussed
>> to the maximum in subjects like Efficient Algorithms, Special Problem Fields
>> of Operating System Design and Fundamentals of DBMS Design (eg. AtomicityCID
>> principle).
>> And such fundamental points are not (needed to be) discussed here.
>>
>> Furthermore, due to the competence it is possible for FS gurus like Ted to
>> estimate that the additional steps have to be done by several functions of
>> an FS, which implies performance loss. And because elementary FS functions
>> are involved the performance loss could be and in the past have been
>> significant, though in nearly all cases I have seen the reason was a very
>> bad implementation. The only exception so far is the Reiser4 FS: All of its
>> file operations are atomic, but still to a little cost of performance in the
>> most cases and the need of a repacker in some few cases which show a
>> significant loss of performance.
> So making all ops atomic can be done at a little performance hit, but
> implementing one specific op costs a huge performance hit? That
> doesn't make sense and seems to indicate those that say otherwise
> aren't right.

No, not in all cases, as it was explained (read the seocnd paragraph again).
And also, Reiser4 FS does no standard journaling to achieve this, and in
this way had to change everything of the FS (read about the design
concepts of the different FSs).

>> And the advice to use a well-known DBMS is simply based on the knowledge
>> that it has all the needed functionality already implemented in a highly
>> performant way, and on the knowledge that such a solution is used oftenly
>> for comparable use cases due to the cost vs. benefit ratio.
>> To take a look at the Reiser4 FS could also help.
> I don't think storing all my conf files, executables, libraries etc in
> a DBMS is a good idea...

read the whole both threads started by you again

> Olaf

Christian Stroetmann

2010-12-29 17:14:05

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Wed, Dec 29, 2010 at 5:30 PM, Christian Stroetmann
<[email protected]> wrote:
>> Additional steps compared to what? The temp file, fsync, rename case?
>
> read the paragraphs as a whole

Reading stuff again isn't going to change my question.

>>> And the advice to use a well-known DBMS is simply based on the knowledge
>>> that it has all the needed functionality already implemented in a highly
>>> performant way, and on the knowledge that such a solution is used oftenly
>>> for comparable use cases due to the cost vs. benefit ratio.
>>> To take a look at the Reiser4 FS could also help.
>>
>> I don't think storing all my conf files, executables, libraries etc in
>> a DBMS is a good idea...
>
> read the whole both threads started by you again

And then what?

Olaf

2010-12-30 00:50:12

by NeilBrown

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Wed, 29 Dec 2010 18:14:04 +0100 Olaf van der Spek <[email protected]>
wrote:

> On Wed, Dec 29, 2010 at 5:30 PM, Christian Stroetmann
> <[email protected]> wrote:
> >> Additional steps compared to what? The temp file, fsync, rename case?
> >
> > read the paragraphs as a whole
>
> Reading stuff again isn't going to change my question.
>

OK, the fun I over. I guess it is time to actually answer your question,
rather than just teasing you with partial answers and hints about performance
impact ....

Your question, as I understand it is:

You see a hypothetical problem for which you cannot see a solution in
Linux, but for which you also cannot present a concrete use-case where
this problem needs to be addresses.
You want to know what the recommended solution is.


The reality is that the solution was devises and implemented many years ago
and is deeply embedded in the core design principles of Unix and Linux.
The reason that you cannot present a use-case is that there isn't one.
Unix was design so that this hypothetical need will never arise.

There is a strong parallel with computer viruses. You could say "viruses
could be a problem, and while I cannot actually present one that is a
problem, I want to know what the recommended solution to viruses is"
The answer is, of course, that Unix/Linux is largely immune to viruses,
not because of any specific anti-virus feature that was designed and
implemented, but because the over-all design approach of Unix makes
viruses hard to spread and rather ineffectual if one ever did take hold.

At least, I think that is the correct answer. However if you actually have
a concrete use-case, then maybe there is a better answer. I wouldn't know
without seeing the use-case.


(And I was joking about the teasing and the hints - it just seemed to make a
better story if I told it that way :-)

NeilBrown

2011-01-07 14:23:40

by Olaf van der Spek

[permalink] [raw]
Subject: Re: Atomic non-durable file write API

On Thu, Dec 30, 2010 at 1:50 AM, Neil Brown <[email protected]> wrote:
> On Wed, 29 Dec 2010 18:14:04 +0100 Olaf van der Spek <[email protected]>
> wrote:
>
>> On Wed, Dec 29, 2010 at 5:30 PM, Christian Stroetmann
>> <[email protected]> wrote:
>> >> Additional steps compared to what? The temp file, fsync, rename case?
>> >
>> > read the paragraphs as a whole
>>
>> Reading stuff again isn't going to change my question.
>>
>
> OK, the fun I over.  I guess it is time to actually answer your question,
> rather than just teasing you with partial answers and hints about performance
> impact ....
>
> Your question, as I understand it is:
>
>   You see a hypothetical problem for which you cannot see a solution in
>   Linux, but for which you also cannot present a concrete use-case where
>   this problem needs to be addresses.
>   You want to know what the recommended solution is.
>
>
>   The reality is that the solution was devises and implemented many years ago
>   and is deeply embedded in the core design principles of Unix and Linux.
>   The reason that you cannot present a use-case is that there isn't one.
>   Unix was design so that this hypothetical need will never arise.

It's so hypothetical that a number of other comments on Ted's blog
about this ask the same question:
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-1979
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-1981
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-1990
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-1992
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2095
http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/#comment-2099

And many more.

http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/

Olaf