From: Olaf van der Spek Subject: Re: Atomic non-durable file write API Date: Sun, 26 Dec 2010 16:08:12 +0100 Message-ID: References: <1292710543.17128.14.camel@nayuki> <20101224085126.2a7ff187@notabene.brown> <20101223222206.GD12763@thunk.org> <4D13E98D.8070105@ontolinux.com> <20101224004825.GF12763@thunk.org> <4D13F09D.4010703@ontolinux.com> <20101224095105.GG12763@thunk.org> <20101225031529.GA2595@thunk.org> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Cc: "Ted Ts'o" , linux-fsdevel , linux-ext4@vger.kernel.org To: Nick Piggin Return-path: In-Reply-To: Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Sat, Dec 25, 2010 at 6:25 PM, Nick Piggin wrote: >> No, not arbitrary writes. It's about complete file writes. > > You still haven't defined exactly what you want. Do you not understand what is meant by a complete file write? >> Atomic semantics are not (that) complex. > > That is something to be argued over patches. What is not in question > is that an atomic API is more complex than none :) That's implementation complexity, not concept/semantics complexity. >> Like I said before, it's not about DB-like functionality but about >> complete file writes/updates. For example, I've got a file in an >> editor and I want to save it. > > I don't understand your example, because in that case you surely > want durability. Hmm, true, bad example, although it depends on editor/user. Let's take archive extraction instead. >> Let me copy the original post: >> Writing a temp file, fsync, rename is often proposed. However, the >> durable aspect of fsync isn't always required > > So you want a way to atomically replace the contents of a file with > new contents, in a way which completes asynchronously and lazily, > and your new contents will eventually just appear sometime after > they are guaranteed to be on disk? Almost. Visibility to other process should be normal (I don't know the exact rules), but commit to disk may be deferred. > You would need to create an unlinked inode with dirty data, and then > have callbacks from pagecache writeback checking when the inode > is cleaned, and then call appropriate filesystem routines to sync and > issue barriers etc, and rename the old name to the new inode. That's an implementation detail, but yes, something like that. > You will also need to have a chain of inodes representing ordering of > the updates so the renames can be performed in the right order. And > add some hooks to solve the metadata issue. > > Then what happens when you fsync the original file? What if the > original file is renamed or unlinked? How do you sync the outstanding > queue of updates? Logically those actions would happen after the atomic data update. The fsync would be done on a now unlinked file (if done via fd). The rename would be done on the new file. Same for unlink. > Once you solve all those problems, then people will ask you to now > solve them for multiple files at once because they also have some > great use-case that is surely nothing like databases. I don't want to play the what if game. > Please tell us what for. If you have immediate need to replace the > name, then you need the durability of fsync. If you don't have > immediate need, then you can use another name, surely (until it > comes time you want to switch names, at that point you want > durability so you fsync then rename). Temp file, rename has issues with losing meta-data. > >> and this way has other >> issues, like losing file meta-data. > > Yes that's true, if you're not owner you may not be able to recreate > most of it. Did you need to? Yes > >> What is the recommended way for atomic non-durable (complete) file writes? > > There really isn't one. Like I said, there is not much atomicity > semantics in the API, which works really well because it is simple > to implement and to use (although apparently still far too complex > for some programmers to get right). It's simple to implement but it's not simple to use right. > If we start adding atomicity beyond fundamental requirement of > namespace operations, then where does it end? Why would it make > sense to add atomicity for writes to one file, but not writes to 2 files? > What if you require atomic multiple modifications to directory > structure as well as file updates? And why only writes? What about > atomic reads of several things? What isolation level should all of that > have, and how to solve deadlocks? > > >> I'm also wondering why FSs commit after open/truncate but before >> write/close. AFAIK this isn't necessary and thus suboptimal. > > I don't know, can you expand on this? What fses are you talking > about, and what behaviour. The zero size issues of ext4 (before some patch). Presumably because some apps do open, truncate, write, close on a file. I'm wondering why an FS commits between truncate and write. Olaf