From: Ryan Lortie <desrt@desrt.ca>
Subject: Re: ext4 file replace guarantees
Date: Fri, 21 Jun 2013 09:51:47 -0400
Message-ID: <1371822707.3188.140661246795017.2D10645B@webmail.messagingengine.com>
References: <1371764058.18527.140661246414097.671B4999@webmail.messagingengine.com>
 <20130621005937.GB10730@thunk.org>
 <1371818596.20553.140661246775057.0F7160F3@webmail.messagingengine.com>
 <20130621131521.GE10730@thunk.org>
Mime-Version: 1.0
Content-Type: text/plain
Content-Transfer-Encoding: 7bit
Cc: linux-ext4@vger.kernel.org
To: "Theodore Ts'o" <tytso@mit.edu>
In-Reply-To: <20130621131521.GE10730@thunk.org>
Sender: linux-ext4-owner@vger.kernel.org

hi,

On Fri, Jun 21, 2013, at 9:15, Theodore Ts'o wrote:
> I agree it can be read that way, although we were very careful to
> avoid the word "guarantee".

It would still be great if you could update the docs to say very clearly
that there is no guarantee.  Using words other than 'guarantee' can
(clearly) still mislead people to believing that there is.

> #2) If an existing file is removed via a rename, if there are any
>     delayed allocation blocks in the new file, they will be flushed
>     out.

Can you describe to me exactly what this means in terms of the syscall
API?  Do I void these "delayed allocation blocks" guarantees by using
fallocate() before-hand?

> > It would be great if the docs would just said "If you want safety with
> > ext4, it's going to be slow.  Please always call fsync()." instead of
> > making it sound like I'll probably be mostly OKish if I don't.
> 
> This is going to be true for all file systems.  If application writers
> are trying to surf the boundaries of what is safe or not, inevitably
> they will eventually run into problems.

btrfs explicitly states in its documentation that replace-by-rename is
safe without fsync().  Although, I start to wonder about that guarantee
considering my reading of the ext4 documentation led me to believe the
same was true for ext4...

As an application developer, what I want is extremely simple: I want to
replace a file in such a way that if, at ANY time, the system crashes or
loses power, then when it comes back on (after replaying the journal), I
will have, at the filename of the file, either the old contents or the
new contents.

This is my primary concern.  I also have a secondary concern:
performance.

I have a legitimate desire to optimise for my second concern without
sacrificing my first concern and I don't think that it's unreasonable
for me to ask where the line is.  I don't really accept this "surf the
boundaries" business about fuzzy APIs that "should do the right thing
usually, but no guarantees if you get too close to the line".  I just
want to know what I have to do in order to be safe.

If there are no guarantees then please say that, explicitly, in your
documentation.  I'm fine calling fsync() constantly if that's what your
filesystem requires me to do.

> Finally, I strongly encourage you to think very carefully about your
> strategy for storing these sorts of registry data.  Even if it is
> "safe" for btrfs, if the desktop applications are constantly writing
> back files for no good reason, it's going to burn battery, SSD write
> cycles, and etc.

I agree.  A stated design goal of dconf has always been that reads are
insanely fast at the cost of extremely expensive writes.  This is why it
is only for storing settings (which are rarely modified, but are read by
the thousands on login).  I consider fsync() to be an acceptable level
of "expensive", even on spinning disks, and the system is designed to
hide the latency from applications.

I've gone on several campaigns of bug-filing against other people's
applications in order to find cases of writes-with-no-good-reason and
I've created tools for tracking these cases down and figuring out who is
at fault.  The result is that on a properly-functioning system, the
dconf database is never written to except in response to explicit user
actions.  This means no writes on login, no writes on merely starting
apps, etc.

Meanwhile, the file size is pretty small.  100k is really on the high
side, for someone who has aggressively customised a large number of
applications.

> modifies an element or two (and said application is doing this several
> times a seocnd), the user is going to have a bad time --- in shortened
> battery and SSD life if nothing else.

I agree.  I would consider this an application bug.

But it brings me to another interesting point: I don't want to sync the
filesystem.  I really don't care when the data makes it to disk.  I
don't want to waste battery and SSD life.  But I am forced to.  I only
want one thing, which is what I said before: I want to know that either
the old data will be there, or the new data.  The fact that I have to
force the hard drive to wake up in order to get this is pretty
unreasonable.

Why can't we get what we really need?

Honestly, what would be absolutely awesome is g_file_set_contents() in
the kernel, with the "old or new" guarantee.  I have all of the data in
a single buffer and I know the size of it in advance.  I just want you
to get it onto the disk, at some point that is convenient to you, in
such a way that I don't end up destroying the old data while still not
having the new.

If not that, then I'd accept some sort of barrier operation where I
could tell the kernel "do not commit this next operation to disk until
you commit the previous one".  It's a bit sad that the only way that I
can get this to happen is to have the kernel do the first operation
*right now* and wait until it's done before telling the kernel to do the
second one.

> The fact that you are trying to optimize out the fsync() makes me
> wonder if there is something fundamentally flawed in the design of
> either the application or its underlying libraries....

I took out the fsync() because I thought it was no longer needed and
therefore pointless.  As I mentioned above, with dconf, fsync() is not a
substantial problem (or a visible problem at all, really).

Thanks