2013-06-20 21:42:46

by Ryan Lortie

[permalink] [raw]
Subject: ext4 file replace guarantees

hi,

I recently read the kernel documentation on the topic of guarantees
provided by ext4 when renaming-over-existing. I found this:

(*) == default

auto_da_alloc(*) Many broken applications don't use fsync() when
noauto_da_alloc replacing existing files via patterns such
as
fd =
open("foo.new")/write(fd,..)/close(fd)/
rename("foo.new", "foo"), or
worse yet,
fd = open("foo",
O_TRUNC)/write(fd,..)/close(fd).
If auto_da_alloc is enabled,
ext4 will detect
the replace-via-rename and
replace-via-truncate
patterns and force that any
delayed allocation
blocks are allocated such that
at the next
journal commit, in the default
data=ordered
mode, the data blocks of the new
file are forced
to disk before the rename()
operation is
committed. This provides
roughly the same level
of guarantees as ext3, and
avoids the
"zero-length" problem that can
happen when a
system crashes before the
delayed allocation
blocks are forced to disk.


in https://www.kernel.org/doc/Documentation/filesystems/ext4.txt

which says to me "replace by rename is guaranteed safe in modern ext4,
under default mount options".

I understand that this was added after the "ext4 is eating my data"
panic in 2009.

Knowing that ext4 provides this guarantee caused me to modify GLib to
remove the fsync() that we used to do from g_file_set_contents(), if we
detect that we are on ext2/3/4:

https://git.gnome.org/browse/glib/commit/?id=9d0c17b50102267a5029b58b1f44efbad82d8f03

(we already skipped the fsync() on btrfs since this filesystem
guarantees that replace-by-rename is safe):

"""
What are the crash guarantees of overwrite-by-rename?

Overwriting an existing file using a rename is atomic. That means that
either the old content of the file is there or the new content. A
sequence like this:
"""

in
https://btrfs.wiki.kernel.org/index.php/FAQ#What_are_the_crash_guarantees_of_overwrite-by-rename.3F

We don't really care too much about ext2 (although it would be great if
there was a convenient API to detect the difference between
ext2/ext3/ext4 filesystems since they all share one magic number).

Anyway... by mistake, this patch (removing fsync on ext4) got backported
into one of our stable releases and landed in Debian and the Fedora 19
beta, where many users started reporting data loss.

So what's the story here? Is this safe or not?


The _only_ thing that I can think of is that GLib also does an
fallocate() before writing the data. Does doing fallocate() before
write() void the rename-is-safe guarantees or is this just a filesystem
bug?

In any case, we have reverted the patch for now to work around the
issue.

It would be great if I could find out some official word on what the
guaranteed behaviour of the filesystem is with respect to
replace-by-rename. Trying to dance around these issues is starting to
get a bit annoying...

Thanks in advance.


2013-06-21 00:59:41

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

On Thu, Jun 20, 2013 at 05:34:18PM -0400, Ryan Lortie wrote:
>
> in https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
>
> which says to me "replace by rename is guaranteed safe in modern ext4,
> under default mount options".

It's not _guaranteed_ safe. It significantly reduces the chances of
data loss in case of a crash, but it's possible for the transaction
containing the rename to close before the blocks are written back. So
if the transaction is almost full, or there is a fsync() racing with
the rename(), such that the file system operation to allocate the
delayed allocation blocks ends up in a different transaction than the
transaction where the rename took place (race #1), and then you crash
before the second transaction completes (race #2), you could lose
data.

You'll have to make your own decision about how likely this
combination is to happen. The failure scenario would probably be
something like the user who plays tux racer all the time, and uses
crappy proprietary drivers that crash the system every single time an
OpenGL application exits. If they think that's normal, and are
willing to live with the crap proprietary drivers, and they are also
the sort of people who carefully position all of their windows to be
precisely just so, and if the [email protected]#!?! desktop libraries are still
bogusly rewriting the entire contents of every single registry file,
regardless of whether the application changed anything --- then
eventually, said user will whine about how the hours she spent
obsessively setting up their window layout got lost after Tux Racer
creashed their system *again*.

(Unfortunately, this example is not entirely hypothetical....)

Regards,

- Ted

2013-06-21 12:43:17

by Ryan Lortie

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

hi Ted,

Thanks for the quick reply.

On Thu, Jun 20, 2013, at 20:59, Theodore Ts'o wrote:
> It's not _guaranteed_ safe. It significantly reduces the chances of
> data loss in case of a crash, but it's possible for the transaction
> containing the rename to close before the blocks are written back.

I think you need to update the documentation. Specifically, this:

"avoids the "zero-length" problem that can happen
when a system crashes before the delayed allocation
blocks are forced to disk"

makes it sound like the replace-by-rename-without-fsync problem has been
solved. This is the statement that caused me to remove the extra
fsync() we had in GLib.

Probably "avoids" should be replaced with "significantly reduces the
chances of", as you write in your mail here.

Note that "significantly reduces the chances of" is not good enough to
prevent about a dozen reports of lost data that I alone have heard about
in the past few days...

https://bugzilla.gnome.org/show_bug.cgi?id=701560#c30
https://plus.google.com/113426423446646532879/posts/Mqepf8STJEK?cfem=1
https://bugzilla.redhat.com/show_bug.cgi?id=975521
https://bugzilla.gnome.org/show_bug.cgi?id=702574
https://bugzilla.gnome.org/show_bug.cgi?id=702394

and many "me too" cases from IRC, including reports from two Debian
users.

So I think that the "significantly reduces the chances of" statement
itself is even suspect...

It would be great if the docs would just said "If you want safety with
ext4, it's going to be slow. Please always call fsync()." instead of
making it sound like I'll probably be mostly OKish if I don't.

> You'll have to make your own decision about how likely this
> combination is to happen. The failure scenario would probably be
> something like the user who plays tux racer all the time, and uses
> crappy proprietary drivers that crash the system every single time an
> OpenGL application exits. If they think that's normal, and are
> willing to live with the crap proprietary drivers, and they are also
> the sort of people who carefully position all of their windows to be
> precisely just so, and if the [email protected]#!?! desktop libraries are still
> bogusly rewriting the entire contents of every single registry file,
> regardless of whether the application changed anything --- then
> eventually, said user will whine about how the hours she spent
> obsessively setting up their window layout got lost after Tux Racer
> creashed their system *again*.

Dude....



Thanks again.

2013-06-21 13:15:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

On Fri, Jun 21, 2013 at 08:43:16AM -0400, Ryan Lortie wrote:
>
> On Thu, Jun 20, 2013, at 20:59, Theodore Ts'o wrote:
> > It's not _guaranteed_ safe. It significantly reduces the chances of
> > data loss in case of a crash, but it's possible for the transaction
> > containing the rename to close before the blocks are written back.
>
> I think you need to update the documentation. Specifically, this:
>
> "avoids the "zero-length" problem that can happen
> when a system crashes before the delayed allocation
> blocks are forced to disk"
>
> makes it sound like the replace-by-rename-without-fsync problem has been
> solved. This is the statement that caused me to remove the extra
> fsync() we had in GLib.

I agree it can be read that way, although we were very careful to
avoid the word "guarantee".

> Note that "significantly reduces the chances of" is not good enough to
> prevent about a dozen reports of lost data that I alone have heard about
> in the past few days...
>
> https://bugzilla.gnome.org/show_bug.cgi?id=701560#c30

So in at least a few of these bugs, people are failing to fsync()
after creatinig a new file. The implied flush *only* happens in two
cases:

#1) If an existing file was opened using O_TRUNC, an implied flush is
sent after the file is closed.

#2) If an existing file is removed via a rename, if there are any
delayed allocation blocks in the new file, they will be flushed
out.

One of the test cases created lots of new files, and that's not one of
the two cases shown above.

> It would be great if the docs would just said "If you want safety with
> ext4, it's going to be slow. Please always call fsync()." instead of
> making it sound like I'll probably be mostly OKish if I don't.

This is going to be true for all file systems. If application writers
are trying to surf the boundaries of what is safe or not, inevitably
they will eventually run into problems.

Also, although fsync() is not free, it's not the performance disaster
it was in ext3. Precisely because of delayed allocation, we don't
have to flush out pending writes of unrelated files when you do a
fsync().

Finally, I strongly encourage you to think very carefully about your
strategy for storing these sorts of registry data. Even if it is
"safe" for btrfs, if the desktop applications are constantly writing
back files for no good reason, it's going to burn battery, SSD write
cycles, and etc. And if said registry files are large XML files that
have to be complete rewritten every single time an applicatoin
modifies an element or two (and said application is doing this several
times a seocnd), the user is going to have a bad time --- in shortened
battery and SSD life if nothing else. And this is not a problem a
file system can protect you against, since while we can try to be more
clever about how we manage the metadata, the data blocks still have to
be written to disk.

The fact that you are trying to optimize out the fsync() makes me
wonder if there is something fundamentally flawed in the design of
either the application or its underlying libraries....

- Ted

2013-06-21 13:51:48

by Ryan Lortie

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

hi,

On Fri, Jun 21, 2013, at 9:15, Theodore Ts'o wrote:
> I agree it can be read that way, although we were very careful to
> avoid the word "guarantee".

It would still be great if you could update the docs to say very clearly
that there is no guarantee. Using words other than 'guarantee' can
(clearly) still mislead people to believing that there is.

> #2) If an existing file is removed via a rename, if there are any
> delayed allocation blocks in the new file, they will be flushed
> out.

Can you describe to me exactly what this means in terms of the syscall
API? Do I void these "delayed allocation blocks" guarantees by using
fallocate() before-hand?

> > It would be great if the docs would just said "If you want safety with
> > ext4, it's going to be slow. Please always call fsync()." instead of
> > making it sound like I'll probably be mostly OKish if I don't.
>
> This is going to be true for all file systems. If application writers
> are trying to surf the boundaries of what is safe or not, inevitably
> they will eventually run into problems.

btrfs explicitly states in its documentation that replace-by-rename is
safe without fsync(). Although, I start to wonder about that guarantee
considering my reading of the ext4 documentation led me to believe the
same was true for ext4...

As an application developer, what I want is extremely simple: I want to
replace a file in such a way that if, at ANY time, the system crashes or
loses power, then when it comes back on (after replaying the journal), I
will have, at the filename of the file, either the old contents or the
new contents.

This is my primary concern. I also have a secondary concern:
performance.

I have a legitimate desire to optimise for my second concern without
sacrificing my first concern and I don't think that it's unreasonable
for me to ask where the line is. I don't really accept this "surf the
boundaries" business about fuzzy APIs that "should do the right thing
usually, but no guarantees if you get too close to the line". I just
want to know what I have to do in order to be safe.

If there are no guarantees then please say that, explicitly, in your
documentation. I'm fine calling fsync() constantly if that's what your
filesystem requires me to do.

> Finally, I strongly encourage you to think very carefully about your
> strategy for storing these sorts of registry data. Even if it is
> "safe" for btrfs, if the desktop applications are constantly writing
> back files for no good reason, it's going to burn battery, SSD write
> cycles, and etc.

I agree. A stated design goal of dconf has always been that reads are
insanely fast at the cost of extremely expensive writes. This is why it
is only for storing settings (which are rarely modified, but are read by
the thousands on login). I consider fsync() to be an acceptable level
of "expensive", even on spinning disks, and the system is designed to
hide the latency from applications.

I've gone on several campaigns of bug-filing against other people's
applications in order to find cases of writes-with-no-good-reason and
I've created tools for tracking these cases down and figuring out who is
at fault. The result is that on a properly-functioning system, the
dconf database is never written to except in response to explicit user
actions. This means no writes on login, no writes on merely starting
apps, etc.

Meanwhile, the file size is pretty small. 100k is really on the high
side, for someone who has aggressively customised a large number of
applications.

> modifies an element or two (and said application is doing this several
> times a seocnd), the user is going to have a bad time --- in shortened
> battery and SSD life if nothing else.

I agree. I would consider this an application bug.

But it brings me to another interesting point: I don't want to sync the
filesystem. I really don't care when the data makes it to disk. I
don't want to waste battery and SSD life. But I am forced to. I only
want one thing, which is what I said before: I want to know that either
the old data will be there, or the new data. The fact that I have to
force the hard drive to wake up in order to get this is pretty
unreasonable.

Why can't we get what we really need?

Honestly, what would be absolutely awesome is g_file_set_contents() in
the kernel, with the "old or new" guarantee. I have all of the data in
a single buffer and I know the size of it in advance. I just want you
to get it onto the disk, at some point that is convenient to you, in
such a way that I don't end up destroying the old data while still not
having the new.

If not that, then I'd accept some sort of barrier operation where I
could tell the kernel "do not commit this next operation to disk until
you commit the previous one". It's a bit sad that the only way that I
can get this to happen is to have the kernel do the first operation
*right now* and wait until it's done before telling the kernel to do the
second one.

> The fact that you are trying to optimize out the fsync() makes me
> wonder if there is something fundamentally flawed in the design of
> either the application or its underlying libraries....

I took out the fsync() because I thought it was no longer needed and
therefore pointless. As I mentioned above, with dconf, fsync() is not a
substantial problem (or a visible problem at all, really).

Thanks

2013-06-21 14:33:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

On Fri, Jun 21, 2013 at 09:51:47AM -0400, Ryan Lortie wrote:
>
> > #2) If an existing file is removed via a rename, if there are any
> > delayed allocation blocks in the new file, they will be flushed
> > out.
>
> Can you describe to me exactly what this means in terms of the syscall
> API? Do I void these "delayed allocation blocks" guarantees by using
> fallocate() before-hand?

Based on how the implementation is currently implemented, any modified
blocks belonging to the inode will be staged out to disk --- Although
with out an explicit CACHE FLUSH command, which is ***extremely***
expensive.

Why are you using fallocate, by the way? For small files, fallocate
is largely pointless. All of the modern file systems which use
delayed allocation can do the right thing without fallocate(2). It
won't hurt, but it won't help, either.

Fallocate is primarily useful for very large files, where the file
system has to start writing the data out to disk due to memory
pressure before the entire file has been written to the page cache.
And it sounds like multi-megabyte dconf files are a bit out of your
design scope. :-)

> I have a legitimate desire to optimise for my second concern without
> sacrificing my first concern and I don't think that it's unreasonable
> for me to ask where the line is. I don't really accept this "surf the
> boundaries" business about fuzzy APIs that "should do the right thing
> usually, but no guarantees if you get too close to the line". I just
> want to know what I have to do in order to be safe.

The POSIX API is pretty clear: if you care about data being on disk,
you have to use fsync().

As file system designers, we're generally rather hesitant to make
guarantees beyond that, since most users are hypersensitive about
performance, and every single time we make additional guarantees
beyond what is specified by POSIX, it further constrains us, and it
may be that one of these guarantees is one you don't care about, but
will impact your performance. Just as the cost of some guarantee you
*do* care about may impact the performance of some other application
which happens to be renaming files but who doesn't necessarily care
about making sure things are forced to disk atomically in that
particular instance.

> But it brings me to another interesting point: I don't want to sync the
> filesystem. I really don't care when the data makes it to disk. I
> don't want to waste battery and SSD life. But I am forced to. I only
> want one thing, which is what I said before: I want to know that either
> the old data will be there, or the new data. The fact that I have to
> force the hard drive to wake up in order to get this is pretty
> unreasonable.
>
> Why can't we get what we really need?
>
> Honestly, what would be absolutely awesome is g_file_set_contents() in
> the kernel, with the "old or new" guarantee. I have all of the data in
> a single buffer and I know the size of it in advance. I just want you
> to get it onto the disk, at some point that is convenient to you, in
> such a way that I don't end up destroying the old data while still not
> having the new.

There are all sorts of rather tricky impliciations with this. For
example, consider what happens if some disk editor does this with a
small text file. OK, fine. Future reads of this text file will get
the new contents, but if the system crashes, when the file read, they
will get the old value. Now suppose other files are created based on
that text file. For example, suppose the text file is a C source
file, and the compiler writes out an object file based on the source
file --- and then the system crashes. What guarantees do we have to
give about the state of the object file after the crash? What if the
object file contains the compiled version of the "new" source file,
but that source file hsa reverted to its original value. Can you
imagine how badly make would get confused with such a thing?

Beyond the semantic difficulties of such an interface, while I can
technically think of ways that this might be workable for small files,
the problem with the file system API is that it's highly generalized,
and while you might promise than you'd only use it for files less than
64k, say, inevitably someone would try to use the exact same interface
with a multi-megabyte file. And then complain when it didn't work,
didn't fulfill the guarantees, or OOM-killed their process, or trashed
their performance of their entire system....

By the way, if the old and new contents of the data you are replacing
is less than 4k, just open the file using O_DIRECT, and do a 4k
replacement write. Keep the unused portion of the file padded out
with zeros, so you don't have to worry about the inode's i_size
update, and just do a 4k write. *Boom*. This is the fast, guaranteed
to work on all file systems, and will result in the least amount of
SSD wear.


By the way, I think I know why people have been seeing more instnaces
of data loss after a power failure. There's been a bug that's been
introduced since the 3.0 kernel which disabled the writeback timer
under some circumstances. This is the timer which forces inodes (for
all file systems; this was generic writeback bug) to be written back
after /sys/proc/vm/dirty_writeback_centisecs. As a result, for files
that were written that did _not_ meet magic overwrite-via-rename
conditions and did not use fsync, they were more succeptible to data
loss after a crash. The patch to fix this was discovered a three
weeks ago, and is attached below.

- Ted

From: Jan Kara <[email protected]>
To: Jens Axboe <[email protected]>
Cc: Wu Fengguang <[email protected]>,
[email protected],
Bert De Jonghe <[email protected]>,
Jan Kara <"[email protected]>, stable"@vger.kernel.org#>
Subject: [PATCH] writeback: Fix periodic writeback after fs mount
Date: Thu, 30 May 2013 10:44:19 +0200
Message-Id: <[email protected]>
X-Mailer: git-send-email 1.8.1.4

Code in blkdev.c moves a device inode to default_backing_dev_info when
the last reference to the device is put and moves the device inode back
to its bdi when the first reference is acquired. This includes moving to
wb.b_dirty list if the device inode is dirty. The code however doesn't
setup timer to wake corresponding flusher thread and while wb.b_dirty
list is non-empty __mark_inode_dirty() will not set it up either. Thus
periodic writeback is effectively disabled until a sync(2) call which can
lead to unexpected data loss in case of crash or power failure.

Fix the problem by setting up a timer for periodic writeback in case we
add the first dirty inode to wb.b_dirty list in bdev_inode_switch_bdi().

Reported-by: Bert De Jonghe <[email protected]>
CC: [email protected] # >= 3.0
Signed-off-by: Jan Kara <[email protected]>
---
fs/block_dev.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 2091db8..85f5c85 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -58,17 +58,24 @@ static void bdev_inode_switch_bdi(struct inode *inode,
struct backing_dev_info *dst)
{
struct backing_dev_info *old = inode->i_data.backing_dev_info;
+ bool wakeup_bdi = false;

if (unlikely(dst == old)) /* deadlock avoidance */
return;
bdi_lock_two(&old->wb, &dst->wb);
spin_lock(&inode->i_lock);
inode->i_data.backing_dev_info = dst;
- if (inode->i_state & I_DIRTY)
+ if (inode->i_state & I_DIRTY) {
+ if (bdi_cap_writeback_dirty(dst) && !wb_has_dirty_io(&dst->wb))
+ wakeup_bdi = true;
list_move(&inode->i_wb_list, &dst->wb.b_dirty);
+ }
spin_unlock(&inode->i_lock);
spin_unlock(&old->wb.list_lock);
spin_unlock(&dst->wb.list_lock);
+
+ if (wakeup_bdi)
+ bdi_wakeup_thread_delayed(dst);
}

/* Kill _all_ buffers and pagecache , dirty or not.. */
--
1.8.1.4


2013-06-21 15:24:46

by Ryan Lortie

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

hi,

On Fri, Jun 21, 2013, at 10:33, Theodore Ts'o wrote:
> Based on how the implementation is currently implemented, any modified
> blocks belonging to the inode will be staged out to disk --- Although
> with out an explicit CACHE FLUSH command, which is ***extremely***
> expensive.

Okay -- so any modified blocks, not just unallocated ones, therefore
fallocate() doesn't affect us here.... Good.

So why are we seeing the problem happen so often? Do you really think
this is related to a bug that was introduced in the block layer in 3.0
and that once that bug is fixed replace-by-rename without fsync() will
become "relatively" safe again?

> Why are you using fallocate, by the way? For small files, fallocate
> is largely pointless. All of the modern file systems which use
> delayed allocation can do the right thing without fallocate(2). It
> won't hurt, but it won't help, either.

g_file_set_contents() is a very general purpose API used by dconf but
also many other things. It is being used to write all kinds of files,
large and small. I understand how delayed allocation on ext4 is
essentially giving me the same thing automatically for small files that
manage to be written out before the kernel decides to do the allocation
but doing this explicitly will mean that I'm always giving the kernel
the information it needs, up front, to avoid fragmentation to the
greatest extent possible. I see it as "won't hurt and may help" and
therefore I do it.

I'm happy to remove it on your (justified) advice, but keep in mind that
people are using this API for larger files as well...

> The POSIX API is pretty clear: if you care about data being on disk,
> you have to use fsync().

Well, in fairness, it's not even clear on this point. POSIX doesn't
really talk about any sort of guarantees across system crashes at all...
and I can easily imagine that fsync() still doesn't get me what I want
in some really bizarre cases (like an ecryptfs over NFS from a virtual
server using an lvm setup running inside of kvm on a machine with hard
dives that have buggy firmware).

I guess I'm trying to solve for the case of "normal ext4 on a normal
partition on real metal with properly working hardware". Subject to
those constraints, I'm happy to call fsync().

> As file system designers, we're generally rather hesitant to make
> guarantees beyond that, since most users are hypersensitive about
> performance, and every single time we make additional guarantees
> beyond what is specified by POSIX, it further constrains us, and it
> may be that one of these guarantees is one you don't care about, but
> will impact your performance. Just as the cost of some guarantee you
> *do* care about may impact the performance of some other application
> which happens to be renaming files but who doesn't necessarily care
> about making sure things are forced to disk atomically in that
> particular instance.

That's fair... and I do realise the pain in the ass that it is if I call
fsync() on a filesystem that has an ordered metadata guarantee. I'm
asking you to immediately write out my metadata changes that came after
other metadata, so you have to write all of it out first.

This is part of why I'd rather avoid the fsync entirely...

aside: what's your opinion on fdatasync()? Seems like it wouldn't be
good enough for my usecase because I'm changing the size of the file....

another aside: why do you make global guarantees about metadata changes
being well-ordered? It seems quite likely that what's going on on one
part of the disk by one user is totally unrelated to what's going on on
another other part by a different user... ((and I do appreciate the
irony that I am committing by complaining about "other guarantees that I
don't care about")).

> There are all sorts of rather tricky impliciations with this. For
> example, consider what happens if some disk editor does this with a
> small text file. OK, fine. Future reads of this text file will get
> the new contents, but if the system crashes, when the file read, they
> will get the old value. Now suppose other files are created based on
> that text file. For example, suppose the text file is a C source
> file, and the compiler writes out an object file based on the source
> file --- and then the system crashes. What guarantees do we have to
> give about the state of the object file after the crash? What if the
> object file contains the compiled version of the "new" source file,
> but that source file hsa reverted to its original value. Can you
> imagine how badly make would get confused with such a thing?

Ya... I can see this. I don't think it's important for normal users,
but this is an argument that goes to the heart of "what is a normal
user" and is honestly not a useful discussion to have here...

I guess in fact this answers my previous question about "why do you care
about metadata changes being well ordered?" The answer is "make".

In any case, I don't expect that you'd change your existing guarantees
about the filesystem. I'm suggesting that using this new 'replace file
with contents' API, however, would indicate that I am only interested in
this one thing happening, and I don't care how it relates to anything
else.

If we wanted a way to express that one file's contents should only be
replaced after another file's contents (which might be useful, but
doesn't concern me) then the API could be made more advanced...

> Beyond the semantic difficulties of such an interface, while I can
> technically think of ways that this might be workable for small files,
> the problem with the file system API is that it's highly generalized,
> and while you might promise than you'd only use it for files less than
> 64k, say, inevitably someone would try to use the exact same interface
> with a multi-megabyte file. And then complain when it didn't work,
> didn't fulfill the guarantees, or OOM-killed their process, or trashed
> their performance of their entire system....

I know essentially nothing about the block layer or filesystems, but I
don't understand why it shouldn't work for files larger than 64k. I
would expect this to be reasonably implementable for files up to a
non-trivial fraction of the available memory in the system (say 20%).
The user has successfully allocated a buffer of this size already, after
all... I would certainly not expect anything in the range of
"multi-megabyte" (or even up to 10s or 100s of megabytes) to be a
problem at all on a system with 4 to 8GB of RAM.

If memory pressure really became a big problem you would have two easy
outs before reaching for the OOM stick: force the cached data out to
disk in real time, or return ENOMEM from this new API (instructing
userspace to go about more traditional means of getting their data on
disk).


There are a few different threads in this discussion and I think we've
gotten away from the original point of my email (which wasn't addressed
in your most recent reply): I think you need to update the ext4
documentation to more clearly state that if you care about your data,
you really must call fsync().

Thanks

2013-06-21 16:24:14

by Joseph D. Wagner

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

> I want to know that either the old data will be there, or the new
> data.

Ted,

Please forgive me for the stupid question. Would Ryan's concerns be
addressed by configuring ext4 to run in data=journal mode?

I realize that as a programmer he cannot guarantee that his application
will run on a file system in a particular configuration. I'm just
curious as to how this all fits together.

Joseph D. Wagner

2013-06-21 20:35:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

So I've been taking a closer look at the the rename code, and there's
something I can do which will improve the chances of avoiding data
loss on a crash after an application tries to replace file contents
via:

1) write foo.new
2) <omit fsync of foo.new>
3) rename foo.new to foo

Those are the kernel patches that I cc'ed you on.

The reason why it's still not a guarantee is because we are not doing
a file integrity writeback; this is not as important for small files,
but if foo.new is several megabytes, not all of the data blocks will
be flushed out before the rename, and this will kill performance, and
in somoe cases it might not be necessary.

Still, for small files ("most config files are smaller than 100k"),
this should serve you just fine. Of course, it's not going to be in
currently deployed kernels, so I don't know how much these proposed
patches will help you,. I'm doing mainly because it helps protects
users against (in my mind) unwise application programmers, and it
doesn't cost us any extra performance from what we are currently
doing, so why not improve things a little?


If you want better guarantees than that, this is the best you can do:

1) write foo.new using file descriptor fd
2) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
3) rename foo.new to foo

This will work on today's kernels, and it should be safe to do for all
file systems. What sync_file_range() will do is to force out all of
the data blocks (and if foo.new is a some gaurgantuan DVD ISO image,
you may stall for seconds or minutes while the data blocks are written
back). It does not force out any of the metadata blocks, or issue a
CACHE_FLUSH command. But that's OK, because after the rename()
operation, at the next journal commit the metdata blocks will be
flushed out, and the journal commit will issue a CACHE FLUSH command
to the disk.

So this is just as safe as using fsync(), and it will be more
performant. However, sync_file_range(2) is a Linux-specific system
call, so if you care about portability to other operating systems,
you'll have to use fsync() instead on legacy Unix systems. :-)


On Fri, Jun 21, 2013 at 11:24:45AM -0400, Ryan Lortie wrote:
>
> So why are we seeing the problem happen so often? Do you really think
> this is related to a bug that was introduced in the block layer in 3.0
> and that once that bug is fixed replace-by-rename without fsync() will
> become "relatively" safe again?

So there are a couple of explanations I can think of. As I said, at
least one of the test programs was not actually doing a rename()
operation to overwrite an existing file. So in that case, seeing a
zero-length file after a crash really isn't unexpected. The btrfs
wiki also makes it clear that if you aren't doing a rename which
deletes an old file, there are no guarantees.

For situations where the application really was doing rename() to
overwrite an existing file, we were indeed initiating a
non-data-integrity writeback after the rename. So most of the time,
users should have been OK. Now, if the application is still issuing
lots and lots of updates, say multiple times a second while a window
is being moved around, or even once for every single window
resize/move, it could just have been a case of bad luck.

Another example of bad luck might be the case of Tux Racer writing its
high score file, and then shutting down its Open GL context, which
promptly caused the Nvidia driver to crash the entire system. In that
case, the two events would be highly correlated together, so the
chances that the user would get screwed would thus be much, much
higher.

Yet another possible cause is crappy flash devices. Not all flash
devices are forcing out their internal flash translation layer (FTL)
metadata to stable store on a CACHE FLUSH command --- precisely
because if they did, it would trash their performance, and getting
good scores on AnandTech rankings might be more important to them than
the safety of their user's data.

As a result, even for an application which is properly calling
fsync(), you could see data loss or even file system corruption after
a power failure. I recently helped out an embedded systems engineer
who was trying to use ext4 in an appliance, and he was complaining
that with this Intel SSD things worked fine, but with his Brand X SSD
(name ommited to protect the guilty) he was seeing file system
corruption after a power plug pull test. I had to tell them that
there was nothing I could do. If the storage device isn't flushing
everything to stable store upon receipt of a CACHE FLUSH command, that
would be like a file system which didn't properly implement fsync().
If the application doesn't call fsync(), then it's on the application.
But if the application calls fsync(), and data loss still occurs, then
it's on either the file system or the storage device.

Similarly, if the file system doesn't send a CACHE FLUSH command, it's
on the file system (or the administrator, if he or she uses the
nobarrier mount option, which disables the CACHE FLUSH command). But
if the file system does send the CACHE FLUSH command, and the device
isn't guaranteeing that all data sent to the storage device can be
read after a power pull, then it's on the storage device, and it's the
storage device which is buggy.

> g_file_set_contents() is a very general purpose API used by dconf but
> also many other things. It is being used to write all kinds of files,
> large and small. I understand how delayed allocation on ext4 is
> essentially giving me the same thing automatically for small files that
> manage to be written out before the kernel decides to do the allocation
> but doing this explicitly will mean that I'm always giving the kernel
> the information it needs, up front, to avoid fragmentation to the
> greatest extent possible. I see it as "won't hurt and may help" and
> therefore I do it.

So that would be the problem if we defined some new interface which
implemented a replace data contents functionality. Inevitably it
would get used by some crazy application which tried to write a
multi-gigabyte file....

If I were to define such a new syscall, what I'd probably do is export
it a set_contents() type interface. So you would *not* use read or
write, but you would send down a the new contents in a single data
buffer, and if it is too big (where too big is completely at the
discretion of the kernel) you would get back an error, and it would be
up to the application to fall back to the traditional methods of
"write to foo.new, rename foo.new to foo" scheme.

I don't know if I could get the rest of the file system developers to
agree to such an interface, but if we were to do such a thing, that's
the proposal I would make. On the other hand, you keep telling me
that dconf() is only intended to be used for small config files, and
applications should only be calling it but rarely.

In that case, does it really matter if g_file_set_contents() takes a
tenth of a second or so? I can see needing to optimize things if
g_file_set_contents() is getting called several times a second as the
window is getting moved or resized, but I thought we've agreed that's
an abusive use of the interface, and hence not one we should be trying
to spend huge amounts of programming effort trying to optimize.

> > The POSIX API is pretty clear: if you care about data being on disk,
> > you have to use fsync().
>
> Well, in fairness, it's not even clear on this point. POSIX doesn't
> really talk about any sort of guarantees across system crashes at all...

Actually, POSIX does have clear words about this:

"If _POSIX_SYNCHRONIZED_IO is defined, the fsync() function shall
force all currently queued I/O operations associated with the file
indicated by file descriptor fildes to the synchronized I/O
completion state. All I/O operations shall be completed as defined
for synchronized I/O file integrity completion"

And yes, Linux is intended to be implemented with the
_POSIX_SYNCHRONIZED_IO defined. There are operating systems where
this might not be true, though. For example, Mac OS X.

Fun Trivia fact; on Mac OS, fsync doesn't result in a CACHE FLUSH
command, so the contents of your data writes are not guaranteed after
a power failure. If you really want fsync(2) to perform as specified
by POSIX, you must use fcntl and F_FULLFSYNC on Mac OS. The reason?
Surprise, surprise --- performance. And probably because there are
too many applications which were calling fsync() several times a
second during a window resize on the display thread, or some such, and
Steve Jobs wanted things to be silky smooth. :-)

Another fun fact: firefox used to call fsync() on its UI thread.
Guess who got blamed for the resulting UI Jank and got all of the hate
mail from the users? Hint: not the Firefox developers.... and this
sort of thing probably explains Mac OS's design decision vis-a-vis
fsync(). There are a lot of application programmers out there, and
they outnumber the file system developers --- and not all of them are
competent.

> and I can easily imagine that fsync() still doesn't get me what I want
> in some really bizarre cases (like an ecryptfs over NFS from a virtual
> server using an lvm setup running inside of kvm on a machine with hard
> dives that have buggy firmware).

Yes, there are file systems such as NFS which are not POSIX compliant.
And yes, you can always have buggy hardware, such as the crap SSD's
that are out there. But there's nothing anyone can do if the hardware
is crap.

(There's a reason why I tend to stick to Intel SSD's on my personal
laptops. More recently I've experimented with using a Samsung SSD on
one of my machines, but in general, I don't use SSD's from random
manufacturers precisely because I don't trust them.....)

>
> This is part of why I'd rather avoid the fsync entirely...

Well, for your use case, sync_file_range() should actually be
sufficient, and it's what I would recommend instead of fsync(), at
least for Linux which has this syscall.

> aside: what's your opinion on fdatasync()? Seems like it wouldn't be
> good enough for my usecase because I'm changing the size of the file....

fdatasync() is basically sync_file_range() plus a CACHE FLUSH command.
Like sync_file_range, it doesn't sync the metadata (and by the way,
this includes things like indirect blocks for ext2/3 or extent tree
blocks for ext4).

However, for the "write foo.new ; rename foo.new to foo" use case,
either sync_file_range() or fdatasync() is fine, since at the point
where the rename() is committed via the file system transaction, all
of the metadata will be forced out to disk, and there will also be a
CACHE FLUSH sent to the device. So if the desired guarantee is "file
foo contains either the old or new data", fsync(), fdatasync() or
sync_file_range() will all do, but the sync_file_range() will be the
best from a performance POV, since it eliminates a duplicate and
expensive CACHE FLUSH command from being sent to the disk.

> another aside: why do you make global guarantees about metadata changes
> being well-ordered? It seems quite likely that what's going on on one
> part of the disk by one user is totally unrelated to what's going on on
> another other part by a different user... ((and I do appreciate the
> irony that I am committing by complaining about "other guarantees that I
> don't care about")).

There are two answers. One is that very often there are dependencies
between files --- and at the file system level, we don't necessarily
know what these dependencies are (for example, between a .y and .c
file, and a .c and a .o files with respect to make). There may be
(many) files for which it doesn't matter, but how do you tell the file
system which file dependencies matters and which doesn't?

There's another reason why, which is that trying to do this would be
horrifically complicated and/or much, much slower. The problem is
entangled updates. If you modify an inode, you have to write back the
entire inode table block. There may be other inodes that are also in
the procss of being modified. Similarly, you might have a rename
operation, an unlink operation, and a file write operation that all
result in changes to a block allocation bitmap. If you want to keep
these operations separate, then you need to have a very complicated
transaction machinery, with intent logs, and rollback logs, and all of
the rest of the massive complexity that comes with a full-fledged
transactional database. There have been attempts to use a general
database engine to implement a generic file system, but they have
generally been a performance disaster. One such example happened in
the early 90's, when Oracle tried push this concept in order to sell
more OracleDB licenses, but that went over like a lead balloon, and
not just because of the pricing issue.

- Ted

2013-06-21 21:06:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

On Fri, Jun 21, 2013 at 09:25:13AM -0700, Joseph D. Wagner wrote:
>
> Please forgive me for the stupid question. Would Ryan's concerns be
> addressed by configuring ext4 to run in data=journal mode?

It would, but for most workloads, data=journal is much, much slower.

Another option is to use the nodelalloc mount option. The nodelalloc
option plus the default data=ordered mode will cause an implicit data
integrity writeback of all files that were written at every commit.
This is basically what ext3 did, but it makes fsync()'s painfully slow
since *all* files are getting flushed out, which is what trained so
many application programmers to think fsync() calls needed to be
avoided at all costs. In addition, it decreases performance both for
writes, and since without delalloc, block allocation tends not to be
as good, for future read operations as well.

We are keeping nodelalloc mode for now, basically so that when the
ext4 code is used to mount filesystems explicitly marked as ext3 in
/etc/fstab (via CONFIG_EXT4_USE_AS_EXT23) we are bug-for-bug
compatible with ext3 at least with respect to delayed allocation and
implied writeback on file system commits.

> I realize that as a programmer he cannot guarantee that his application
> will run on a file system in a particular configuration. I'm just
> curious as to how this all fits together.

Agreed, and in addition, this is not something we would want to
recommend to users, since they come with significant downsides.

- Ted

2013-06-21 23:18:33

by Sidorov, Andrei

[permalink] [raw]
Subject: RE: ext4 file replace guarantees

But there is no need to mount entire fs with data journalling mode.
In fact I find per-file data journalling extremely useful. It would be even more useful if it allowed regular users to set journalling mode on specific file and there was some way to designate rewrite transaction boundaries (even 128k would cover a lot of small-but-important-file use cases).

As for now it is a best choice for app running with root privileges for rewriting files <= page size.

Regards,
Andrey.

________________________________________
From: [email protected] [[email protected]] on behalf of Theodore Ts'o [[email protected]]
Sent: Saturday, June 22, 2013 01:05
To: Joseph D. Wagner
Cc: [email protected]; Ryan Lortie
Subject: Re: ext4 file replace guarantees

On Fri, Jun 21, 2013 at 09:25:13AM -0700, Joseph D. Wagner wrote:
>
> Please forgive me for the stupid question. Would Ryan's concerns be
> addressed by configuring ext4 to run in data=journal mode?

It would, but for most workloads, data=journal is much, much slower.

- Ted

2013-06-22 03:29:51

by Dave Chinner

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

On Fri, Jun 21, 2013 at 04:35:47PM -0400, Theodore Ts'o wrote:
> So I've been taking a closer look at the the rename code, and there's
> something I can do which will improve the chances of avoiding data
> loss on a crash after an application tries to replace file contents
> via:
>
> 1) write foo.new
> 2) <omit fsync of foo.new>
> 3) rename foo.new to foo
>
> Those are the kernel patches that I cc'ed you on.
>
> The reason why it's still not a guarantee is because we are not doing
> a file integrity writeback; this is not as important for small files,
> but if foo.new is several megabytes, not all of the data blocks will
> be flushed out before the rename, and this will kill performance, and
> in somoe cases it might not be necessary.
>
> Still, for small files ("most config files are smaller than 100k"),
> this should serve you just fine. Of course, it's not going to be in
> currently deployed kernels, so I don't know how much these proposed
> patches will help you,. I'm doing mainly because it helps protects
> users against (in my mind) unwise application programmers, and it
> doesn't cost us any extra performance from what we are currently
> doing, so why not improve things a little?
>
>
> If you want better guarantees than that, this is the best you can do:
>
> 1) write foo.new using file descriptor fd
> 2) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
> 3) rename foo.new to foo
>
> This will work on today's kernels, and it should be safe to do for all
> file systems.

No, it's not. SYNC_FILE_RANGE_WRITE does not wait for IO completion,
and not all filesystems sychronise journal flushes with data write
IO completion.

Indeed, we have a current "data corruption after power failure"
problem found on Ceph storage clusters using XFS for the OSD storage
that is specifically triggered by the use of SYNC_FILE_RANGE_WRITE
rather than using fsync() to get data to disk.

http://oss.sgi.com/pipermail/xfs/2013-June/027390.html

The question was raised as to whether sync_file_range() was safe on
ext4 was asked and my response was:

http://oss.sgi.com/pipermail/xfs/2013-June/027452.html

"> Is sync_file_range(2) similarly problematic with ext4?

In data=writeback mode, most definitely. For data=ordered, I have no
idea - the writeack paths in ext4 are ... convoluted, and I hurt my
brain every time I look at them. I wouldn't be surprised if there
are problems, but they'll be different problems because ext4 doesn't
do speculative prealloc..."

.....

> > aside: what's your opinion on fdatasync()? Seems like it wouldn't be
> > good enough for my usecase because I'm changing the size of the file....
>
> fdatasync() is basically sync_file_range() plus a CACHE FLUSH command.
> Like sync_file_range, it doesn't sync the metadata (and by the way,
> this includes things like indirect blocks for ext2/3 or extent tree
> blocks for ext4).

If fdatasync() on ext4 doesn't sync metadata blocks required to
access the data that was just written by the fdatasync() call, then
it is broken.

fdatasync() is supposed to guarantee all the data in the file and
all the metadata *needed to access that data* is on stable storage
by the time the fdatasync() completes. i.e. fdatasync() might just
be a data write and cache flush, but in the case where allocation,
file size changes, etc have occurred, it is effectively the
equivalent of a full fsync().

So, fdatasync() will do what you want, but the performance overhead
will be no different to fsync() in the rename case because all the
metadata pointing to the tmp file needs to comitted as well...

----

But, let me make a very important point here. Nobody should be
trying to optimise a general purpose application for a specific
filesystem's data integrity behaviour. fsync() and fdatasync() are
the gold standards as it is consistently implemented across all
Linux filesystems.

The reason I say this is that we've been down this road before and
we shoul dhave learnt better from it. Ted, you should recognise this
because you were front and centre in the fallout of it:

http://tytso.livejournal.com/61989.html

".... Application writers had gotten lazy, because ext3 by default
has a commit interval of 5 seconds, and and uses a journalling mode
called data=ordered. What does this mean? ....

... Since ext3 became the dominant filesystem for Linux, application
writers and users have started depending on this, and so they become
shocked and angry when their system locks up and they lose data -
even though POSIX never really made any such guarantee. ..."

This discussion of "how can we abuse ext4 data=ordered sematics to
avoid using fsync()" is heading right going down this path again.
It is starting from "fsync on ext4 is too slow", and solutions are
being proposed that assume that either everyone is use ext4
(patently untrue) and that all filesystems behave like ext4 (also
patently untrue).

To all the application developers reading this: just use
fsync()/fdatasync() for operations that require data integrity. Your
responisbility is to your users: using methods that don't guarantee
data integrity and therefore will result in data loss is indicating
you don't place any value on your user's data what-so-ever. There is
no place for being fancy when it comes to data integrity - it needs
to be reliable and rock-solid.

If that means your application is slow, then you need to explain why
it is slow to your users and how they can change a knob to make it
fast by trading off data integrity. The user can make the choice at
that point, and they have no grounds to complain if they lose data
at that point because they made a conscious choice to configure
their system that way.

IOWs, the choice of whether data can be lost on a crash is one that
only the user can make. As such, applications need be
safe-by-default when it comes to data integrity.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-06-22 04:47:23

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

On Sat, Jun 22, 2013 at 01:29:44PM +1000, Dave Chinner wrote:
> > This will work on today's kernels, and it should be safe to do for all
> > file systems.
>
> No, it's not. SYNC_FILE_RANGE_WRITE does not wait for IO completion,
> and not all filesystems sychronise journal flushes with data write
> IO completion.

Sorry, what I should have said is:

sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);

This *does* wait for I/O completion; the result is the equivalent
filemap_fdatawrite() followed by filemap_fdatawait().

> Indeed, we have a current "data corruption after power failure"
> problem found on Ceph storage clusters using XFS for the OSD storage
> that is specifically triggered by the use of SYNC_FILE_RANGE_WRITE
> rather than using fsync() to get data to disk.
>
> http://oss.sgi.com/pipermail/xfs/2013-June/027390.html

This woudn't be a problem in the sequence I suggested.

1) write foo.new
2) sync_file_range(...)
3) rename foo.new to foo

If the system crashes right after foo.new, yes, there's no guarantee
since the metadata blocks are written. (Although if XFS is exposing
stale data as a result of sync_file_range, that's arguably a security
bug, since sync_file_range doesn't require root to execute, and it
_has_ been part of the Linux interface for a long time.)

Unlike in the Ceph case, in this sequence as are calling
sync_file_range on the new file (foo.new). And if we haven't done the
rename yet, we don't care about the contents of foo.new. (Although if
the file system is causing stale data to be revealed, I'd claim that's
a fs bug.) If the rename has completed, the metadata blocks will
definitely be journalled and committed for both ext3 and ext4.

So for ext3 and ext4, this sequence will guarantee that the file named
"foo" will have either the old data or the new data --- and this is
true for either data=ordered, or data=writeback. You're the expert
for xfs, but I didn't think this sequence was depending on anything
file system specific, since filemap_fdatawrite() and
filemap_fdatawait() are part of the core Linux FS/MM layer.

> But, let me make a very important point here. Nobody should be
> trying to optimise a general purpose application for a specific
> filesystem's data integrity behaviour. fsync() and fdatasync() are
> the gold standards as it is consistently implemented across all
> Linux filesystems.

>From a philosophical point of view, I agree with you. As I wrote in
my earlier messages, assuming the applications aren't abusively
calling g_file_set_contents() several times per second, I don't
understand why Ryan is trying so hard to optimize it. The fact that
he's trying to optimize it at least to me seems to indicate a simple
admission that there *are* broken applications out there, some of
which may be calling it with high frequency, perhaps out of the UI
thread.

And having general applications or generic desktop libraries trying to
depend on specific implementation details of file systems is really
ugly. So it's not something I'm all that excited about.

However, regardless of whether wish it or not, abusive applications
written by incompetent application authors *will* exist, and whether
we like it or not, desktop library authors are going to try to coddle
said abusive applications by do these filesystem implemenatation
dependent optimization. GNOME is *already* detecting btrfs and has
made optimization decisions based on it in its libraries. Trying to
prevent this is (in my opinion) the equivalent of claiming that the
command of King Canute could hold back the tide.

Given that, my thinking was to try to suggest the least harmful way of
doing so, and so my eye fell on sync_file_range(2) as a generic system
call that is not file system specific, with some relatively well
defined semantics. And given that we don't care about foo.new until
after the rename operation has completed, it seemed to me that this
should be safe for more than just ext3 and ext4 (where I am quite sure
it is safe).

But if XFS is doing something sufficiently clever that
sync_file_range() isn't going to do the right thing, and if we presume
that abusive applications will always exist, then maybe it's time to
consider implementing a new system call which has very well defined
semantics, for those applications that insist on updating a file
hundreds of times an hour, demand good performance, and aren't picky
about consistency semantics, so long that some version of the file
contents is available after a crash.

This system call would of course have to be optional, and so for file
systems that don't support it, applications will have to fall back
more traditional approachses, whether that involves fsync() or perhaps
sync_file_range(), if that can be made safe and generic for all/most
file systems.

Personally, I think application programmers *shouldn't* need such a
facility, if their applications are competently designed and
implemented. But unfortunately, they outnumber us file system
developers, and apparently many of them seem to want to do things
their way, whether we like it or not.

Regards,

- Ted

2013-06-22 12:56:10

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

On Fri, Jun 21, 2013 at 09:49:26PM +0000, Sidorov, Andrei wrote:
> But there is no need to mount entire fs with data journalling mode.
> In fact I find per-file data journalling extremely useful. It would
> be even more useful if it allowed regular users to set journalling
> mode on specific file and there was some way to designate rewrite
> transaction boundaries (even 128k would cover a lot of
> small-but-important-file use cases).

Note that at the moment, the +j flag is only honored in nodelalloc
mode. Since delayed allocation is enabled by defalut the per-file
data journal flag is ignored. This is something that we could fix, in
theory. It would be possible to teach ext4_writepages how to allocate
the block(s) and write the data block(s) in the same journal
transaction --- but that functionality does not exist today.

So if you want to use the +j flag, you have to mount the file system
with the non-standard nodelalloc mount option. And that's actually
sufficient to be bug-for-bug compatible with ext3 in terms of the
commit of the transaction which contains the rename operation first
forcing the file out to disk first.

Although as both I and Dave Chinner have pointed out, it's a bad idea
for generic application to depend on file system implementation,
because we do reserve the right to change those implementation details
if it would help improve the file system's performance or reliability.

> As for now it is a best choice for app running with root privileges
> for rewriting files <= page size.

The best choice for an application rewriting files <= a single 4k
block is to use O_DIRECT to rewrite the contents of the file, using a
4k buffer which is zero padded. This is the most performant, uses the
fewest write cycles for a SSD, etc.

- Ted

2013-06-22 14:02:07

by Sidorov, Andrei

[permalink] [raw]
Subject: RE: ext4 file replace guarantees

> So if you want to use the +j flag, you have to mount the file system
> with the non-standard nodelalloc mount option. And that's actually
> sufficient to be bug-for-bug compatible with ext3 in terms of the
> commit of the transaction which contains the rename operation first
> forcing the file out to disk first.

Thanks for pointing this, I didn't know that. In my appliance I don't benefit
from delalloc, so that's not a problem.

> The best choice for an application rewriting files <= a single 4k
> block is to use O_DIRECT to rewrite the contents of the file, using a
> 4k buffer which is zero padded. This is the most performant, uses the
> fewest write cycles for a SSD, etc.

This doesn't work in power loss scenario.
First of all majority of hdd's still have 512b sectors, so it is possible that
hdd won't have a chance to write all 8 sectors.
This doesn't work even with 4k drives because they are susceptible to spliced
sector writes. Well, 512b are susceptible too, but 4k drives have wider
window.
That's why for rewrite to be completely safe data has to be written twice. And
that's where inode's data journalling is a win.
I guess this isn't an issue for SSD's provided they properly order remapping
(which is a bug otherwise).

Regards,
Andrey.

2013-06-22 14:06:19

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

On Sat, Jun 22, 2013 at 01:40:26PM +0000, Sidorov, Andrei wrote:
> Me too, but people have to do that because fs api is too generic and
> at the same time one has to account fs specifics in order to make
> their app take most advantage or at least to avoid inefficiencies.
> For example I have an app that constantly does appending writes to
> about 15 files and I must ensure that no more than 5 seconds will be
> lost in an event of system crash or power loss.

Stop right there. If you are doing lots of appending writes, and you
dont want to lose no more than 5 seconds, why do you have 15 files?
Why send your appending writes into a single file and then later on,
disaggregate out the logs when you need to read them? With 15 files,
no matter what you are doing, you will be forcing the heads to seek
all over the place, and since the 4k blocks won't be full when you
sync them, you're wasting write cycles and disk bandwidth.

See what I mean? There's a reason why I said the fault was
incompetent application design, not just incomplement implementation.

- Ted

2013-06-22 14:30:58

by Theodore Ts'o

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

On Sat, Jun 22, 2013 at 02:01:39PM +0000, Sidorov, Andrei wrote:
>
> This doesn't work in power loss scenario.
> First of all majority of hdd's still have 512b sectors, so it is possible that
> hdd won't have a chance to write all 8 sectors.
> This doesn't work even with 4k drives because they are susceptible to spliced
> sector writes. Well, 512b are susceptible too, but 4k drives have wider
> window.

Torn writes can happen, yes, but they are relatively rare. Most file
systems don't protect against them, so if you're worried about that
sort of thing, you need to go beyond using fsync(). Even if you are
using a file system with metadata journalling, in the case of a torn
write, we'll detect the corrupt metadata, but at that point guarantees
about what files will be accessible are out the window. Fortunately,
this is not a common event.

There are techniques for protecting against torn writes, but they have
engineering tradeoffs, which you may or may not be willing to live
with. After all, if you're worried about these sorts of things,
hopefully you will have engineered your system to deal with other
events which are at the a similar or higher levels of probability ---
such as the hard drive developing bad sectors (which is generally how
most HDD's treat sectors that are incompletely written due to spliced
sector writes) or even dying catastrophically.

For many of the use cases that Ryan and GNOME have been dealing with,
which are desktop apps where the precious data at question are things
like the high score board for games, or the window position of desktop
applications, this is probably beyond what they need to be concerned
with.

(And at the industrial data center scale, you may use very different
techniques --- such as computer-level or rack-level battery backups,
diesel generators, cloud file systems which send the data to multiple
different servers on multiple different racks, etc. And at that
scale, you might not even use a file system journal or send CACHE
FLUSH commands, because you've engineered the entire system against
failure, and you accept the fact that having multiple levels of power
backup fails, or multiple HDD's all dying at the same time before the
cloud file system has a chance to rereplicate the data, is good
enough. Nothing is ever going to be 100% perfect; there's only a level
of data integrity which you are willing to pay for.)

- Ted

2013-06-22 14:41:30

by Sidorov, Andrei

[permalink] [raw]
Subject: RE: ext4 file replace guarantees

> Stop right there. If you are doing lots of appending writes, and you
> dont want to lose no more than 5 seconds, why do you have 15 files?
> Why send your appending writes into a single file and then later on,
> disaggregate out the logs when you need to read them?

There are 15 files because those are different entities. There might be less of
them, some of them can be removed. There is no reason to place those files
close to each other since those are just metadata related to mpeg stream
and in fact it is best to put stream and metadata close to each other.
If I wanted to localize those small files, I would just put them on a different
partition.

Anyway, if I have to implement a file system inside a file, why do I need
top layer file system? : )

> With 15 files,
> no matter what you are doing, you will be forcing the heads to seek
> all over the place, and since the 4k blocks won't be full when you
> sync them, you're wasting write cycles and disk bandwidth.

Yes, seeks are unavoidable, but that's something I have to account in worst
case scenario anyway.
Huh, yes. Another advantage of ext4 is that I can control writeback on a
specific file range. I sync only complete pages and ext4 is happy to commit
whatever was synced. Generic solution would be to cache pages in app and
submit only complete ones. Disadvantage of generic solution is that data
cached by app is invisible to others.

Regards,
Andrey.

2013-06-22 15:04:07

by Sidorov, Andrei

[permalink] [raw]
Subject: RE: ext4 file replace guarantees

> From a philosophical point of view, I agree with you. As I wrote in
> my earlier messages, assuming the applications aren't abusively
> calling g_file_set_contents() several times per second, I don't
> understand why Ryan is trying so hard to optimize it. The fact that
> he's trying to optimize it at least to me seems to indicate a simple
> admission that there *are* broken applications out there, some of
> which may be calling it with high frequency, perhaps out of the UI
> thread.

Well, one application calling fsync is almost nothing to care about.
On the other hand tens and hundreds of apps doing fsync's is a disaster.

> And having general applications or generic desktop libraries trying to
> depend on specific implementation details of file systems is really
> ugly. So it's not something I'm all that excited about.

Me too, but people have to do that because fs api is too generic and at the
same time one has to account fs specifics in order to make their app take most
advantage or at least to avoid inefficiencies.
For example I have an app that constantly does appending writes to about 15
files and I must ensure that no more than 5 seconds will be lost in an event
of system crash or power loss. How would you do that in generic way?
Generic and portable way to do it is to start 15 threads and call fsyncs on
those fds at the same time. That works fine with JFS since it doesn't do
flushes and it works fine with ext4 because all those fsync's are likely to
complete within single transaction.
However that doesn't scale well and it forces app to do bursts. Scalable, but
still bursty solution could be io_submit, but afaik no fs currently supports
async fsync.
What if you want to distribute the load? Single dedicated thread calling
fsync's works fine with JFS, but sucks with ext4. Ok, there is a
sync_file_range, let's try it out. Luckily I have control over commit=N option
to underlying ext4 fs which I leave at default 5s. Otherwise I would like to
have an ioctl to ext4 to force commit (I'm not sure if fsync on a single fd
will commit currently running transaction). Sync thread calls sync_file_range
evenly over 5s interval, ext4 does commits every 5s. Nice! But it doesn't work
with JFS. Therefore I have two implementations for different file systems.

> Personally, I think application programmers *shouldn't* need such a
> facility, if their applications are competently designed and
> implemented. But unfortunately, they outnumber us file system
> developers, and apparently many of them seem to want to do things
> their way, whether we like it or not.

I would argue : )
fsync is not the one to rule them all. It's semantics is clear: write all
those bytes NOW.
The fact fsync can be used as a barrier doesn't mean it's the best way to do
it. There are quite few cases where write-right-now semantics is
absolutely required. More often apps just want atomic file updates and
sort of writeback control which is available only as system-wide knob.

As for atomic updates, I'm thinking of something like io_exec() or
io_submit_atomic() or whatever name is best for it. Probably it shouldn't be
tied to kaio.
This syscall would accept an array of iocb's and guarantee atomicity of the
update. This shouldn't be a big deal for ext4 to support it because it already
supports data journalling, which is however only block/page-wise atomic.
Such a syscall wouldn't be undervalued if majority of file systems support it.

Regards,
Andrey.


2013-06-23 01:58:12

by Dave Chinner

[permalink] [raw]
Subject: Re: ext4 file replace guarantees

On Sat, Jun 22, 2013 at 12:47:18AM -0400, Theodore Ts'o wrote:
> On Sat, Jun 22, 2013 at 01:29:44PM +1000, Dave Chinner wrote:
> > > This will work on today's kernels, and it should be safe to do for all
> > > file systems.
> >
> > No, it's not. SYNC_FILE_RANGE_WRITE does not wait for IO completion,
> > and not all filesystems sychronise journal flushes with data write
> > IO completion.
>
> Sorry, what I should have said is:
>
> sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER);
>
> This *does* wait for I/O completion; the result is the equivalent
> filemap_fdatawrite() followed by filemap_fdatawait().

Right, and now it is effectively identical to fsync() in terms of
speed. And when performance is the reason for avoiding fsync()....

> > Indeed, we have a current "data corruption after power failure"
> > problem found on Ceph storage clusters using XFS for the OSD storage
> > that is specifically triggered by the use of SYNC_FILE_RANGE_WRITE
> > rather than using fsync() to get data to disk.
> >
> > http://oss.sgi.com/pipermail/xfs/2013-June/027390.html
>
> This woudn't be a problem in the sequence I suggested.
>
> 1) write foo.new
> 2) sync_file_range(...)
> 3) rename foo.new to foo

You miss the point - it's not a data integrity operation, and to
present it as a way of obtaining data integrity across *all
filesystems* because it would work on ex4 is extremely dangerous.
You're relying on side effects of a specific filesystem implementation
to give you the desired result.

> If the system crashes right after foo.new, yes, there's no
> guarantee since the metadata blocks are written.

Sure, but focussing on ext4 behaviour misses the larger issue -
sync_file_range() has no well defined integrity rules and hence
exposes applications to unexpected behaviours on different
filesystems.

> (Although if XFS
> is exposing stale data as a result of sync_file_range, that's
> arguably a security bug, since sync_file_range doesn't require
> root to execute, and it _has_ been part of the Linux interface for
> a long time.)

Yes, it is.

But doesn't this point out one of the problems with
moving from a well known interface to something that almost nobody
uses and (obviously) has never tested in a data integrity test bench
previously? it's taken how many years for this problem to be
exposed? We test fsync crash integrity all the time, but do we test
sync_file_range()? No, we don't, because it doesn't have any data
integrity guarantees that we can validate.

So while the interface may have been around for years, the first
serious storage product I've heard of that has tried to use it to
optimise away fsync() calls to use it has found that:

a) it exposes strange behaviours on different filesystems;
b) doesn't provide data integrity guarantees worth a damn;
and
c) isn't any faster than using fsync/fdatasync() in the
first place.

And so has reverted back to using fsync()....

> So for ext3 and ext4, this sequence will guarantee that the file

Exactly my point. You're designing for ext3/4 data=ordered behaviour
and so it's not a replacement for fsync/fdatasync...

> > But, let me make a very important point here. Nobody should be
> > trying to optimise a general purpose application for a specific
> > filesystem's data integrity behaviour. fsync() and fdatasync()
> > are the gold standards as it is consistently implemented across
> > all Linux filesystems.
>
> From a philosophical point of view, I agree with you. As I wrote
> in my earlier messages, assuming the applications aren't abusively
> calling g_file_set_contents() several times per second, I don't
> understand why Ryan is trying so hard to optimize it. The fact
> that he's trying to optimize it at least to me seems to indicate a
> simple admission that there *are* broken applications out there,
> some of which may be calling it with high frequency, perhaps out
> of the UI thread.

That's not a problem that should be solved by trading off data
integrity and losing user's configurations. Have a backbone, Ted,
and tell people they are doing something stupid when they are doing
something stupid.

> And having general applications or generic desktop libraries
> trying to depend on specific implementation details of file
> systems is really ugly. So it's not something I'm all that
> excited about.

Not excited? That's an understatement - it's the sort of thing that
gives me nightmares.

We've *been here before*. We've been telling application developer's
to use fsync/fdatasync() since the O_PONIES saga so we don't end up
back in that world of hurt. Yet here you are advocating that
application developers go back down the rat hole of relying on
implicit side effects of the design of a specific filesystem
configuration to provide them with data integrity guarantees....

> But if XFS is doing something sufficiently clever that
> sync_file_range() isn't going to do the right thing, and if we
> presume that abusive applications will always exist, then maybe
> it's time to consider implementing a new system call which has
> very well defined semantics,

That's fine - go and define the requirements for an
rename_datasync() syscall, write a bunch of xfstests to ensure the
API is usable and provides the required behaviour. Then do a slow
generic implementation that you wire all the filesystems up to so
it's available to everyone (i.e. no fs probing needed).
Applications can start using it immediately, knowing it will work on
every filesystem. i.e:

generic_rename_datasync()
{
vfs_fsync(src);
->rename()
}

At that point you can implement a fast, optimal implementation in
ext3/4 that nobody outside ext3/4 needs to care about how it works.
The BTRFS guys can optimise it appropriately in their own time, as
can us XFS people. But the most important point is that we start
with on implementation that *works everywhere*...

> Personally, I think application programmers *shouldn't* need such
> a facility, if their applications are competently designed and
> implemented. But unfortunately, they outnumber us file system
> developers, and apparently many of them seem to want to do things
> their way, whether we like it or not.

And you're too afraid to call out someone when they are doing something
stupid. If you don't call them out, they'll keep doing stupid things
and the problems will never get solved.

Cheers,

Dave.
--
Dave Chinner
[email protected]