2002-07-09 13:46:48

by Trond Myklebust

[permalink] [raw]
Subject: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

Hi,

There was a bug reported on the 'exim' user list a couple of months ago:
the Linux NFS client reports -EINVAL if you try to fsync() a directory.

The correct response would be to return a dummy '0' for success, since all
NFS operations that change the directory are supposed to be performed
synchronously on the server anyway...

Cheers,
Trond


Attachments:
linux-2.4.19-fsync_dir.dif (1.05 kB)

2002-07-09 14:02:27

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Tue, 9 Jul 2002, Trond Myklebust wrote:

> Hi,
>
> There was a bug reported on the 'exim' user list a couple of months ago:
> the Linux NFS client reports -EINVAL if you try to fsync() a directory.
>
> The correct response would be to return a dummy '0' for success, since all
> NFS operations that change the directory are supposed to be performed
> synchronously on the server anyway...
>
> Cheers,
> Trond
>
>

Isn't it supposed to return EINVAL if "fd is bound to a file which
doesn't support synchronization..." That's what POSIX 4 says.

Errors:
EBADF fildes is not a valid file descriptor.
EINVAL The file descriptor is valid, but the system doesn't support
fsync on this particular file.

I think code that opens a directory as a file is broken. We have
opendir() for that and it returns a DIR pointer, not a file descriptor.
If the directory was properly opened, one would never attempt to
fsync() it.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

Windows-2000/Professional isn't.

2002-07-09 14:06:21

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

>>>>> " " == Richard B Johnson <[email protected]> writes:

> I think code that opens a directory as a file is broken. We
> have opendir() for that and it returns a DIR pointer, not a
> file descriptor. If the directory was properly opened, one
> would never attempt to fsync() it.

fsync() is supported on directories on local filesystems as a way of
ensuring that changes (due to file creation etc) are committed to
disk. Where is the POSIX violation in that?

There is no reason why NFS, which ensures this anyway, should
not adhere to this convention.

Cheers,
Trond

2002-07-09 15:01:40

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Tue, 9 Jul 2002, Trond Myklebust wrote:

> >>>>> " " == Richard B Johnson <[email protected]> writes:
>
> > I think code that opens a directory as a file is broken. We
> > have opendir() for that and it returns a DIR pointer, not a
> > file descriptor. If the directory was properly opened, one
> > would never attempt to fsync() it.
>
> fsync() is supported on directories on local filesystems as a way of
> ensuring that changes (due to file creation etc) are committed to
> disk. Where is the POSIX violation in that?
>
> There is no reason why NFS, which ensures this anyway, should
> not adhere to this convention.
>
> Cheers,
> Trond
> -

Well, no. It's not supported. You can't get a valid file-descriptor...

#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>

int main()
{
int fd;
fd = open("/", O_RDWR, 0);
fsync(fd);
}


execve("./xxx", ["xxx"], [/* 32 vars */]) = 0
brk(0) = 0x804966c
open("/etc/ld.so.preload", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/lib/libc.so.6", O_RDONLY) = 3
old_mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, 3, 0) = 0x4000c000
munmap(0x4000c000, 4096) = 0
old_mmap(NULL, 644232, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0x4000c000
mprotect(0x40097000, 74888, PROT_NONE) = 0
old_mmap(0x40097000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED, 3, 0x8b000) = 0x40097000
old_mmap(0x4009d000, 50312, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x4009d000
close(3) = 0
mprotect(0x4000c000, 569344, PROT_READ|PROT_WRITE) = 0
mprotect(0x4000c000, 569344, PROT_READ|PROT_EXEC) = 0
personality(PER_LINUX) = 0
getpid() = 27544
open("/", O_RDWR) = -1 EISDIR (Is a directory)


There are ways to 'cheat' and obtain a file-descriptor that references
a directory, but cheating is against POSIX rules, also.

You can open it read-only. But, Read-Only means that you can't
update it, so fsync means nothing, will return 0 because it is
already "whatever it was" since you can't modify it...

getpid() = 27568
open("/", O_RDONLY) = 3
fsync(3) = 0
_exit(0) = ?


My reading is that you need to fsync() every file within a directory
to fsync() a directory. Playing tricks with a directory inode doesn't
do it.

Regardless, POSIX.4 declines to state exactly what "successfully
transferred" means when it states that fsync() doesn't return until
all data has been successfully transferred to the disk or underlying
hardware. This is a real problem for a network file-system where
data that will eventually get to a file-server in the Congo may be
en-route for several minutes.

If an application insists, it is up to the application to determine,
probably once upon startup, just what kind of file synchronization
is supported.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

Windows-2000/Professional isn't.

2002-07-09 16:30:48

by Alan

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

> > not adhere to this convention.
>
> Well, no. It's not supported. You can't get a valid file-descriptor...

Wrong (as usual)

> If an application insists, it is up to the application to determine,
> probably once upon startup, just what kind of file synchronization
> is supported.

Linux defines fsync for directories

2002-07-09 17:17:27

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Tue, 9 Jul 2002, Alan Cox wrote:

> > > not adhere to this convention.
> >
> > Well, no. It's not supported. You can't get a valid file-descriptor...
>
> Wrong (as usual)

Really? Then what is the meaning of fsync() on a read-only file-
descriptor? You can't update the information you can't change.

This is (as usual) just an example of your helpful responses.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

Windows-2000/Professional isn't.

2002-07-09 18:46:11

by Alan

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

> Really? Then what is the meaning of fsync() on a read-only file-
> descriptor? You can't update the information you can't change.

fsync ensures the data for that inode/file content is on stable storage - note
_the_ _data_ not only random things written by this specific file handle.

2002-07-09 18:56:21

by Bill Rugolsky Jr.

[permalink] [raw]
Subject: Re: [NFS] Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Tue, Jul 09, 2002 at 01:22:29PM -0400, Richard B. Johnson wrote:
> Really? Then what is the meaning of fsync() on a read-only file-
> descriptor? You can't update the information you can't change.

Eh? I do an fchmod() on a readonly descriptor, then I call fsync()
on that descriptor. The inode gets sync'd to disk (with updated mode
and c_time). So no, I don't need a writable descriptor to call fsync().
The only question is *what* gets sync'd when I call fsync() on an O_RDONLY
file-descriptor.

SUSv3 (http://www.opengroup.org/onlinepubs/007908799/xsh/fsync.html)
says "The fsync() function forces all currently queued I/O
operations associated with the file indicated by file descriptor fildes
to the synchronised I/O completion state."

It appears from this wording that the file-descriptor is *merely* a
handle referring to the inode, and that *all* outstanding I/O on the
inode [within the "system"] is performed. In other words, if I had
several different file handles referring to the same inode (but
different kernel "struct file" objects), all inode data and meta-data
updates prior to the fsync() call would be synchronized. It doesn't
say that explicitly, but given the usual visibility rules regarding
writes, etc., that is the "natural" interpretation. [Caveat: mmap()]
To state it succinctly: if other (data or meta-data) writes are visible to
the process doing the fsync(), they need to be sync'd too.

In the case of directories, there is no file handle "doing the writing" --
the kernel does that, so absent the ability to call fsync() on a readonly
handle to a directory, i.e. fsync(dirfd(dir)), there is no convenient way to sync
the directory contents. Calling fsync() on every file in a directory
does not necessitate syncing the directory!

Regards,

Bill Rugolsky

2002-07-09 19:10:43

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Tue, 9 Jul 2002, Alan Cox wrote:

> > Really? Then what is the meaning of fsync() on a read-only file-
> > descriptor? You can't update the information you can't change.
>
> fsync ensures the data for that inode/file content is on stable storage - note
> _the_ _data_ not only random things written by this specific file handle.
>

That is what it's supposed to do with files. The attached code clearly
shows that it doesn't work with directories. The fsync() instantly
returns, even though there is buffered data still to be written.


#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#define NR_WRITES 0x1000
int main()
{
char foo[0x10000];

int dirfd, outfd;
int flags, i;
outfd = open("/foo", O_WRONLY|O_TRUNC|O_CREAT, 0644);


dirfd = open("/", O_RDONLY, 0);
flags = fcntl(dirfd, F_GETFL);
flags &= ~O_RDONLY;
flags |= O_RDWR;
fcntl(dirfd, F_SETFL, flags);
fprintf(stderr, "Write %d bytes\n", sizeof(foo) * NR_WRITES);
for(i=0; i< NR_WRITES; i++)
write(outfd, foo, sizeof(foo));
fprintf(stderr, "Write complete\n");
fprintf(stderr, "Sync the directory\n");
fsync(dirfd);
fprintf(stderr, "Done, returns immediately!\n");
close(outfd);
fprintf(stderr, "Now execute sync and see if your disk is active!\n");
// unlink("/foo");
}


Again, to assure that file-data is written to storage, one must
execute fsync on files, not directories. The dummy return of 0,
that Linux provides is a database bug waiting to happen.

Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

Windows-2000/Professional isn't.

2002-07-09 19:34:22

by Alan

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

> That is what it's supposed to do with files. The attached code clearly
> shows that it doesn't work with directories. The fsync() instantly
> returns, even though there is buffered data still to be written.

Your understanding or code is wrong. Its hard to tell which.

fsync on the directory syncs the directory metadata not the file metadata

2002-07-09 19:48:19

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Tue, 9 Jul 2002, Alan Cox wrote:

> > That is what it's supposed to do with files. The attached code clearly
> > shows that it doesn't work with directories. The fsync() instantly
> > returns, even though there is buffered data still to be written.
>
> Your understanding or code is wrong. Its hard to tell which.
>
> fsync on the directory syncs the directory metadata not the file metadata
>

Well the original complaint was that Linux NFS didn't allow a directory to
be fsync()ed. I showed that POSIX.4 doesn't provide for fsync()ing
directories, only files, that you have to fsync() individual files, not
the directories that contain them. Others said that fsync()ing individual
files was not necessary, that you only have to fsync() the directory. I
explained that you have to cheat to even get a fd that can be used
to fsync() a directory. Then I showed that fsync()ing a directory in this
manner doesn't work so, we are actually in violent agreement.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

Windows-2000/Professional isn't.

2002-07-10 06:31:18

by Alex Riesen

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Tue, Jul 09, 2002 at 10:06:45AM -0400, Richard B. Johnson wrote:
> I think code that opens a directory as a file is broken. We have
> opendir() for that and it returns a DIR pointer, not a file descriptor.
> If the directory was properly opened, one would never attempt to
> fsync() it.

It's the libc which defines it. Theere no syscall "opendir". How you think
you can return what sus defines as "DIR*" from the kernel?

offtopic: on aix you can do this: "cat ."

2002-07-10 11:16:21

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Wed, 10 Jul 2002, Alex Riesen wrote:

> On Tue, Jul 09, 2002 at 10:06:45AM -0400, Richard B. Johnson wrote:
> > I think code that opens a directory as a file is broken. We have
> > opendir() for that and it returns a DIR pointer, not a file descriptor.
> > If the directory was properly opened, one would never attempt to
> > fsync() it.
>
> It's the libc which defines it. Theere no syscall "opendir". How you think
> you can return what sus defines as "DIR*" from the kernel?
>
> offtopic: on aix you can do this: "cat ."
>

Any attempt to open a directory as a file and read it on Linux up to
version 2.4.18 (at least), or on Sun (up to) SunOS 5.5.1, returns
-1 with errno set to ISDIR (21). As mentioned several times, there
are ways to 'cheat', but I was (and have been) talking about POSIX
conformance.

Script started on Wed Jul 10 07:15:46 2002
# od .
od: .: Is a directory
0000000
# cat .
cat: .: Is a directory
# exit
exit
Script done on Wed Jul 10 07:15:58 2002


Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

Windows-2000/Professional isn't.

2002-07-15 12:40:54

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Mon, 15 Jul 2002, Sean Hunter wrote:

> On Tue, Jul 09, 2002 at 03:50:17PM -0400, Richard B. Johnson wrote:
> > On Tue, 9 Jul 2002, Alan Cox wrote:
> >
> > > > That is what it's supposed to do with files. The attached code clearly
> > > > shows that it doesn't work with directories. The fsync() instantly
> > > > returns, even though there is buffered data still to be written.
> > >
> > > Your understanding or code is wrong. Its hard to tell which.
> > >
> > > fsync on the directory syncs the directory metadata not the file metadata
> > >
> >
> > Well the original complaint was that Linux NFS didn't allow a directory to
> > be fsync()ed. I showed that POSIX.4 doesn't provide for fsync()ing
> > directories, only files, that you have to fsync() individual files, not
> > the directories that contain them. Others said that fsync()ing individual
> > files was not necessary, that you only have to fsync() the directory. I
> > explained that you have to cheat to even get a fd that can be used
> > to fsync() a directory. Then I showed that fsync()ing a directory in this
> > manner doesn't work so, we are actually in violent agreement.
>
> I'm not sure whether or not you've got the gist with all the flamage and
> shrapnel flying about, however as I understand it, fsync on a directory fd
> ensures that all directory ops such as rename()s unlinks(), links() etc are
> committed, not that all data pending to all files in that dir are flushed.
>
> To get all changes you need to fsync the dirfd and all the fds of the files as
> well.
>
> Because directory changes (such as renames, unlinks etc) are synchronous on NFS
> any way, fsync() on a dir fd on an NFS mount can simply return. There will
> never be any outstanding dir ops to flush. ergo: no bug.
>
> Hope that's clear.
>
> Sean
>


NFS has characteristics that seem to make it 'special'.
For instance, you have a server that performs local actions
on behalf of a remote client. As long as the local server
doesn't crash, everything it did for the remote client is
safe even if the remote client crashes and burns. From
the perspective of the remote client, it really doesn't make
much difference if it ever calls fsync() on anything as long
as the server doesn't crash. Therefore, for discussion I
will ignore NFS and other Client Server file access systems.
But just because they are special, it doesn't mean that they
should be treated specially.

Given the following:

/1/2/3/4/5/6/7/8/9/file

... I suggest that it MUST be sufficient to fsync() 'file' to
assure that file data can be recovered. That's what POSIX.4 states.
If the implementation doesn't allow this, i.e., 'file' will end up
in 'lost+found', then there is a problem that should be addressed.
This is because a local file user's program may not know the entire
directory tree. For example, in a chrooted environment. Also,
the task has no way of knowing what, if any, of these directory
entries have already been flushed to disk. A directory tree could,
in principle, be up to _POSIX_PATH_MAX entries in length.

In the beginning, when God created Unix, files and directories
were all the same. I could fix a bad directory entry with an
editor. Over the years, certain rules were established to prevent
users from accessing directories as files. They still are files,
but the Operating System(s) try their best to make sure you don't
muck with directories as files.

So now you have to read a directory with getdents(), actually that's
not even POSIX, you need to use readdir(). Also, the directory will
fail to be opened in other than read-only. These are all artificial
constraints, imposed to make sure you follow the rules.

So, you get a read-only file-descriptor and fsync() it! What does
that mean? Obviously, the file must have existed previously to open
it read-only. Since I can't change its contents, because I opened
it read-only, fsync() can't do anything because I could not have
altered its contents.

So, lets say two tasks open the same file. One opens it read-only
and the other read-write. The read-write task is happily writing
to the file. The read-only task executes fsync(). Does this cause
the writer to wait until the file has been flushed to disk? I don't
know, but if it does, we have a very broken system where an
unprivileged reader can severely affect the performance of a
file-server with a denial-of-service attack. So, I suggest that
a read-only file-descriptor CANNOT cause the contents of a file
to be written. If it does, it's broken. Given this, fsync() on
a directory entry, accessed by a read-only file-descriptor, can't
do anything.

These are things that should be addressed rather than flamed-
away. I think that the intent of fsync() on a file is to make
certain that it is on the physical media in a state from which
it can be accessed after a crash. If this is the intent, then
playing games with individual directories is not useful and
fsync() on the read/write file-descriptor actually updating the
file should be sufficient.


Cheers,
Dick Johnson

Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).

Windows-2000/Professional isn't.

2002-07-15 13:32:24

by Matthias Andree

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Mon, 15 Jul 2002, Richard B. Johnson wrote:

> These are things that should be addressed rather than flamed-
> away. I think that the intent of fsync() on a file is to make
> certain that it is on the physical media in a state from which
> it can be accessed after a crash. If this is the intent, then
> playing games with individual directories is not useful and
> fsync() on the read/write file-descriptor actually updating the
> file should be sufficient.

We had a similar discussion along the lines of an MTA roughly a year
ago, but without your (unquoted) objection that fsync() on a fiel
without write permit should be impossible.

The essence was that Linux 2.4 ext3fs and reiserfs guarantee that on
fsync(), the file is recoverable from the place it was created, 2.2 was
halfway there; but beware: only data=ordered or data=journal (in ext3fs,
as beta patch for reiserfs from
ftp.suse.com:/pub/people/mason/patches/data-logging/ <- from memory))
will guarantee that your file contents are recoverable.

This does not constitute any statement on JFS or XFS. I'm unaware of
their characteristics in fsync and directory update issues.

That aside, it would really useful to get this "hog a writer" issue
ironed out either way, and that the illogical "fsync() a O_RDONLY" file
be resolved somehow.

For the data of users not acquainted with kernel intrinsics, the way
things are now are most dangerous, and I'd really ask that Andrew
Morton's dirsync() patches (where still necessary) and tool patches
(chattr, mount) be deployed NOW and that -o dirsync (call it noasync for
compatibility) be the default. A safety-speed tradeoff should only
sacrifice safety at the explicit request and mke2fs should be told to
generate ext3fs by default NOW.

The argumentation that Linux leaves the choice of when to sync directory
data to the application is nice, but not more, and having this as tuning
option is fine, but to quote Wietse Venema "it's interesting to see that
out of the box, Linux handles logging more securely (sync writes) than
email (async directory updates)". And right he is.

Is fsync()ing directories any portable?

-- archived at: http://groups.google.com/groups?selm=89uj5c%242h2s%241%40FreeBSD.csie.NCTU.edu.tw&oe=utf-8&output=gplain

--
Matthias Andree

2002-07-15 14:46:54

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

Matthias Andree <[email protected]> writes:

> We had a similar discussion along the lines of an MTA roughly a year
> ago, but without your (unquoted) objection that fsync() on a fiel
> without write permit should be impossible.

It was a long thread:

http://groups.google.com/groups?threadm=linux.kernel.3B5FC7FB.D5AF0932%40zip.com.au

http://lists.insecure.org/linux-kernel/2001/Aug/index.html#39

> The essence was that Linux 2.4 ext3fs and reiserfs guarantee that on
> fsync(), the file is recoverable from the place it was created, 2.2 was
> halfway there; but beware: only data=ordered or data=journal (in ext3fs,
> as beta patch for reiserfs from
> ftp.suse.com:/pub/people/mason/patches/data-logging/ <- from memory))
> will guarantee that your file contents are recoverable.

I do not recall anything about data=ordered or data=journal mode being
required. I thought someone authoritative (Stephen Tweedie?) said
that ext3 happens to commit the journal on fsync(), independent of the
journaling mode, but that this behavior was an implementation
coincidence and not guaranteed. (Unfortunately, I am having trouble
finding that message... Can someone familiar with the source confirm
or deny this?)

I would love to know what IS guaranteed. This fsync() question keeps
cropping up, and as far as I know there is no authoritative statement
anywhere about what Linux promises. "Read the source code" is the
wrong answer; implementations can change at any time. This is a
question about the interface, not the implementation. "See post XXX
on linux-kernel" is almost as bad.

> That aside, it would really useful to get this "hog a writer" issue
> ironed out either way, and that the illogical "fsync() a O_RDONLY"
> file be resolved somehow.

It is a non-issue; no resolution is necessary. If I can even read or
write a single file on the same DISK (or bus) that some server process
uses, I can "hog its resources" and slow it down. Horrors! Is there
any solution??? Oh yeah, don't let me do that.

The only interesting question here is what the relevant standards say.
And if they allow fsync() at all on a read-only descriptor, then there
is pretty clearly only one thing that can mean. If you have a problem
with this behavior, then configure your precious servers to keep their
data unreadable by untrusted parties.

> Is fsync()ing directories any portable?

No, but apparently it is what Linux supports. If this were documented
clearly somewhere, maybe application authors could be convinced to
support it.

- Pat

2002-07-15 15:04:03

by Alan

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Mon, 2002-07-15 at 15:49, Patrick J. LoPresti wrote:
> I would love to know what IS guaranteed. This fsync() question keeps
> cropping up, and as far as I know there is no authoritative statement

Linus has explicitly stated what fsync on a directory does, during
several of the thousands of cycling repeated flamewars generated by MTA
authors

If that isnt definitive I don't know what is

2002-07-15 15:15:51

by Matthias Andree

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Mon, 15 Jul 2002, Patrick J. LoPresti wrote:

> I do not recall anything about data=ordered or data=journal mode being
> required. I thought someone authoritative (Stephen Tweedie?) said
> that ext3 happens to commit the journal on fsync(), independent of the
> journaling mode, but that this behavior was an implementation
> coincidence and not guaranteed. (Unfortunately, I am having trouble
> finding that message... Can someone familiar with the source confirm
> or deny this?)

I know about the "happens to...", but I think after that discussion,
they'd keep it that way.

The data= mode was not part of the past discussion, that's why I brought
this up now. However, reiserfs or ext3fs with data=writeback only
journal the fsync() metadata involved, not the order of data (file
contents) versus directory contents, so you can end up with a "crash -
journal replay - file with bogus contents" scenario. I've seen this
happen on ReiserFS and I was not too fond of it, particularly not as I
don't have "fast-access" backups, I need to read a full file from SLR
tape up to the point where the desired file is stored.

> I would love to know what IS guaranteed. This fsync() question keeps
> cropping up, and as far as I know there is no authoritative statement
> anywhere about what Linux promises. "Read the source code" is the

Indeed not, and a "file system codex" to document these guarantees in
respect to path names, with link, rename, directory updates should be
documented authoritatively and should be valid for one kernel revision
until the next version (i. e. if documented 2.4.18+, it must not change
before 2.5.x).

> > That aside, it would really useful to get this "hog a writer" issue
> > ironed out either way, and that the illogical "fsync() a O_RDONLY"
> > file be resolved somehow.
>
> It is a non-issue; no resolution is necessary. If I can even read or
> write a single file on the same DISK (or bus) that some server process
> uses, I can "hog its resources" and slow it down. Horrors! Is there
> any solution??? Oh yeah, don't let me do that.

[IRONY DETECTED]

Seriously: imagine another process that opens the file your process
is writing into, but it itself has no write permission -- and busy loops
on fsync(). Should this fsync process really trigger flushing your
blocks although it has no write permissions, this _is_ a problem unless
you have some decent tagged queueing in place.

fsync() as per open group base specs issue 6 is allowed to return EBADF,
EINTR, EINVAL, EIO. Returning EINVAL for fsync(fd) after fd =
open("blah", O_RDONLY) does not sound unreasonable. You have nothing to
write in O_RDONLY, use O_RDWR or O_WRONLY instead.

> The only interesting question here is what the relevant standards say.
> And if they allow fsync() at all on a read-only descriptor, then there
> is pretty clearly only one thing that can mean. If you have a problem
> with this behavior, then configure your precious servers to keep their
> data unreadable by untrusted parties.

Or moke fsync() a no-op, meaning "your process (group) has no data to
write", or return error... EINVAL.

> > Is fsync()ing directories any portable?
>
> No, but apparently it is what Linux supports. If this were documented
> clearly somewhere, maybe application authors could be convinced to
> support it.

I don't think so. They'd rather declare ReiserFS unsupported and go with
chattr +S. Seen that.

New implementations (Courier's maildrop) still rely on BSD FFS
"synchronous directory" semantics.

--
Matthias Andree

2002-07-15 15:16:38

by Matthias Andree

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Mon, 15 Jul 2002, Alan Cox wrote:

> On Mon, 2002-07-15 at 15:49, Patrick J. LoPresti wrote:
> > I would love to know what IS guaranteed. This fsync() question keeps
> > cropping up, and as far as I know there is no authoritative statement
>
> Linus has explicitly stated what fsync on a directory does, during
> several of the thousands of cycling repeated flamewars generated by MTA
> authors

That requires explicitly porting applications to Linux and is
unreasonable to expect from a usability point of view.

--
Matthias Andree

2002-07-15 15:18:32

by Bill Rugolsky Jr.

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Mon, Jul 15, 2002 at 03:35:07PM +0200, Matthias Andree wrote:
> For the data of users not acquainted with kernel intrinsics, the way
> things are now are most dangerous, and I'd really ask that Andrew
> Morton's dirsync() patches (where still necessary) and tool patches
> (chattr, mount) be deployed NOW and that -o dirsync (call it noasync for
> compatibility) be the default. A safety-speed tradeoff should only
> sacrifice safety at the explicit request and mke2fs should be told to
> generate ext3fs by default NOW.

Put dirsync in 2.4? Sure, good idea. Dangerous without it? To whom?

Explain how it is dangerous? The journalling filesystems perform
directory updates as transactions. It's dangerous to your MTA
perhaps. Andrew Morton has bent over backwards to find and fix bugs in
the synchronous write logic and to provide what you wanted, i.e.,
dirsync. He and Chris Mason fixed performance problems in ext3 and
Reiserfs. Reread the thread -- you insisted repeatedly that you just
wanted dirsync. Or was that just the opening gambit?

> The argumentation that Linux leaves the choice of when to sync directory
> data to the application is nice, but not more, and having this as tuning
> option is fine, but to quote Wietse Venema "it's interesting to see that
> out of the box, Linux handles logging more securely (sync writes) than
> email (async directory updates)". And right he is.

With all due respect to Wieste, that's nonsense: synchronous write
in syslog or other logging facilities is a *userspace* policy issue.
Default synchronous directory updates is a *kernel* policy issue.

I don't have dirsync handy at the moment, so I can't test, but
I have to ask: have you tried the simple (and IMHO devastating) benchmark
that I posted back on 2001-08-02 comparing Linux to Solaris file creation,

http://marc.theaimsgroup.com/?l=linux-kernel&m=99678208121947&w=2

i.e., copy a file tree (XFree86-4.1, 33027 files) with hard links.

Recall:

Solaris: 363.46s real 0.84s user 10.13s system
Ext2: real 0m3.823s user 0m0.240s sys 0m3.570s
Ext3: real 0m5.106s user 0m0.200s sys 0m3.700s

"dirsync" gives you what you want; please mount /var (or wherever)
-o dirsync and leave the kernel defaults as they are.

Regards,

Bill Rugolsky

2002-07-15 15:32:51

by Alan

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Mon, 2002-07-15 at 16:19, Matthias Andree wrote:
> > Linus has explicitly stated what fsync on a directory does, during
> > several of the thousands of cycling repeated flamewars generated by MTA
> > authors
>
> That requires explicitly porting applications to Linux and is
> unreasonable to expect from a usability point of view.

Well bad luck then. POSIX and SuS forgot to specify a standard on this.
I've pointed other people at the standards committee to go fix it but
heard a deafening silence. Until then I'll run nice fast Linux ported
mail apps

2002-07-15 15:35:11

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

Alan Cox <[email protected]> writes:

> Linus has explicitly stated what fsync on a directory does, during
> several of the thousands of cycling repeated flamewars generated by MTA
> authors
>
> If that isnt definitive I don't know what is

Documentation/fsync.txt would be better.

I mean, suppose I write to some MTA's authors to inform them that
their product is "broken on Linux" and telling them how to fix it.
They might think I am nuts, or that this behavior is an implementation
coincidence. (Some of them even seem to think Linux is not complying
with the relevant standards. That there is even an argument here
means that the standards themselves are broken; standards are supposed
to be very clear.)

To where should I refer these authors to convince them that this
really is how Linux behaves, by definition, now and forever? Should I
point them at the flamewars in various mailing list archives? Should
I suggest they write to Linus personally?

I would rather refer them to Documentation/fsync.txt. Do you agree?
Would you accept a patch to add it?

- Pat

2002-07-15 15:33:08

by Matthias Andree

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Mon, 15 Jul 2002, Bill Rugolsky Jr. wrote:

> Put dirsync in 2.4? Sure, good idea. Dangerous without it? To whom?
>
> Explain how it is dangerous? The journalling filesystems perform
> directory updates as transactions. It's dangerous to your MTA
> perhaps. Andrew Morton has bent over backwards to find and fix bugs in
> the synchronous write logic and to provide what you wanted, i.e.,
> dirsync. He and Chris Mason fixed performance problems in ext3 and
> Reiserfs. Reread the thread -- you insisted repeatedly that you just
> wanted dirsync. Or was that just the opening gambit?

The code is there, for ext3, but not for reiserfs. A year has passed,
but still, dirsync is not the default. This is directed towards the
maintainers of the kernel, not towards Andrew Morton.

> With all due respect to Wieste, that's nonsense: synchronous write
> in syslog or other logging facilities is a *userspace* policy issue.
> Default synchronous directory updates is a *kernel* policy issue.

I'm well aware of this, and that _by_default_ user-space is more
cautious than kernel-space is beyond my horizon, I'm afraid. Of course,
these things are not really related, as syslog and Linux kernel are
separate projects, but still, it looks strange from the outside.

> I don't have dirsync handy at the moment, so I can't test, but
> I have to ask: have you tried the simple (and IMHO devastating) benchmark
> that I posted back on 2001-08-02 comparing Linux to Solaris file creation,
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=99678208121947&w=2
>
> i.e., copy a file tree (XFree86-4.1, 33027 files) with hard links.

Nope, I prefer not to play disk hogging games on my Solaris boxen, both
of which are in production :-)

> Recall:
>
> Solaris: 363.46s real 0.84s user 10.13s system
> Ext2: real 0m3.823s user 0m0.240s sys 0m3.570s
> Ext3: real 0m5.106s user 0m0.200s sys 0m3.700s
>
> "dirsync" gives you what you want; please mount /var (or wherever)
> -o dirsync and leave the kernel defaults as they are.

/var and /home, indeed.

So you prefer speed over safety. That's fine. But that's not sane for a
kernel to do. Cheating benchmarks is what others may call it. I just
call it sad.

--
Matthias Andree

2002-07-15 15:42:58

by Alan

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Mon, 2002-07-15 at 16:38, Patrick J. LoPresti wrote:
> Alan Cox <[email protected]> writes:
>
> > Linus has explicitly stated what fsync on a directory does, during
> > several of the thousands of cycling repeated flamewars generated by MTA
> > authors
> >
> > If that isnt definitive I don't know what is
>
> Documentation/fsync.txt would be better.

Documentation/fs/fsync.txt or similar sounds a good idea

2002-07-15 16:07:46

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

Matthias Andree <[email protected]> writes:

> The data= mode was not part of the past discussion, that's why I
> brought this up now. However, reiserfs or ext3fs with data=writeback
> only journal the fsync() metadata involved, not the order of data
> (file contents) versus directory contents, so you can end up with a
> "crash - journal replay - file with bogus contents" scenario.

This should not happen with a properly written application. fsync()
flushes a bunch of stuff to disk, but it normally makes no promise
about the ORDER in which that stuff goes out. fsync() itself is how
application authors can enforce an ordering on disk operations.

For example, a typical MTA might follow this paradigm:

write temp file
fsync()
rename temp file to destination
fsync()
report success

(Yes, I know, "link/unlink" is more common in practice than rename().
But the principle is the same.)

Or, in the case of Postfix:

write message file
fsync()
chmod +x message file
fsync()
report success

The first paradigm uses the presence of a directory entry to represent
"committed" data. The second uses a mode bit on the file.

Both of these paradigms work fine with data=writeback. Yes, they
require calling fsync() twice, but that is exactly what you need to
enforce the ordering constraints!

An MTA has two ordering constraints:

1) Data must be flushed to disk before it is marked on disk as
"committed". This is to ensure that, after a crash, the MTA does
not read a corrupted mail file.

2) Data must be marked on disk as "committed" before a success code
is reported to the remote MTA. This is to ensure that no mail is
lost.

The ext3 data=ordered mode enforces the first constraint for mailers
using the "rename" paradigm, eliminating the need for the first
fsync() call. But any MTA which relies on data=ordered semantics is
not only Linux-specific, but ext3/reiserfs specific!

Synchronous directory updates, a la FFS, enforce the second constraint
(again for the "rename" paradigm), eliminating the need for the second
fsync().

But to be robust across platforms and file systems, a mailer needs
both fsync() calls. (On Linux, you actually need to fsync() the
*directory*, not the file, for the "rename" paradigm. It would be
nice if we could convince MTA authors to do this.)

> I don't think so. They'd rather declare ReiserFS unsupported and go with
> chattr +S. Seen that.
>
> New implementations (Courier's maildrop) still rely on BSD FFS
> "synchronous directory" semantics.

Are you sure? Because that is ridiculous... Modern BSDs like to use
"soft updates", which need that second fsync() to commit the metadata.
So as long as fsync() commits the journal, either paradigm above
should work fine under any journaling mode.

Summary: *All* MTAs should call fsync() twice. The only issue is what
descriptors they should call it on, exactly :-).

- Pat

2002-07-15 16:11:44

by Bill Rugolsky Jr.

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Mon, Jul 15, 2002 at 05:35:53PM +0200, Matthias Andree wrote:
> The code is there, for ext3, but not for reiserfs. A year has passed,
> but still, dirsync is not the default. This is directed towards the
> maintainers of the kernel, not towards Andrew Morton.

I'm in violent agreement that it should go into 2.4 *now that it is merged
in 2.5*. You may have noticed that Marcelo has been occupied with a few
other issues (VM, IDE).

> > I don't have dirsync handy at the moment, so I can't test, but
> > I have to ask: have you tried the simple (and IMHO devastating) benchmark
> > that I posted back on 2001-08-02 comparing Linux to Solaris file creation,
> >
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=99678208121947&w=2
> >
> > i.e., copy a file tree (XFree86-4.1, 33027 files) with hard links.
>
> Nope, I prefer not to play disk hogging games on my Solaris boxen, both
> of which are in production :-)

I'm not asking you to do it on your Solaris boxen -- I couldn't give a
damn about slow, buggy Solaris I'm asking whether you have tested this
on ext2/ext3 with/without dirsync. The gentlemanly thing to do when asking
for a change to the kernel is to (honestly) assess its impact.

> So you prefer speed over safety. That's fine. But that's not sane for a
> kernel to do. Cheating benchmarks is what others may call it. I just
> call it sad.

Cheating benchmarks -- bah! Safety for *one* (naive) application class!

dirsync buys me no useful safety on my build host, all it will do is
slow down things like rpmbuild --rebuild.

This is all rather silly. An MTA requires configuration, so what is
the difficulty in using -o dirsync, or alternatively, and quite a bit
more simply, executing chattr +D on the spool directory. It's quite
simple: put dirsync in the kernel and tools, then add chattr +D to the
post-install scripts for your favorite package manager.

- Bill Rugolsky

2002-07-15 18:14:02

by Matthias Andree

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Mon, 15 Jul 2002, Patrick J. LoPresti wrote:

> Matthias Andree <[email protected]> writes:
>
> > The data= mode was not part of the past discussion, that's why I
> > brought this up now. However, reiserfs or ext3fs with data=writeback
> > only journal the fsync() metadata involved, not the order of data
> > (file contents) versus directory contents, so you can end up with a
> > "crash - journal replay - file with bogus contents" scenario.
>
> This should not happen with a properly written application. fsync()
> flushes a bunch of stuff to disk, but it normally makes no promise
> about the ORDER in which that stuff goes out. fsync() itself is how
> application authors can enforce an ordering on disk operations.

Well, to some extent.

> For example, a typical MTA might follow this paradigm:
>
> write temp file
> fsync()
> rename temp file to destination
> fsync()

So does fsync() guarantee rename() persistence across crash on all file
systems and kernel versions? IIRC, no.

We might want to fill in a table, on the rows kernel release and file
system, on the columns whether 1. fsync() syncs all directory updates up
to the root, 2. fsync() syncs rename properly, 3. fsync() syncs link, 4.
fsync() syncs unlink (not too important, at least not for an MTA, if you
ask me), 5. offers dirsync, 6. has dirsync on by default.

Very raw draft:

Linux 2.0
ext2
ufs

Linux 2.2
ufs
ext2
ext3 0.0.7<mumble>
reiserfs 3.5
jfs
xfs? don't think so.

Linux 2.4
ufs
ext2
ext3 0.9.x 1. yes 2. yes 3. yes 4. ? 5. use patch, use sync, use chattr +S 6. no
reiserfs 3.5
reiserfs 3.6 1. yes 2. yes 3. yes 4. ? 5. no, use sync 6. no
jfs 1.0
xfs 1.0
xfs 1.1
you name it

And for completeness:
Free/Net/OpenBSD
ffs 1. yes 2. yes 3. yes 4. yes 5. yes 6. yes
ffs softupdates 1. yes 2. yes 3. yes 4. ? 5. no 6. no
ext2
ufs
lfs

^ editor vacancy...


> report success
>
> (Yes, I know, "link/unlink" is more common in practice than rename().
> But the principle is the same.)

doesn't matter except that unlink over a crash is usually unsafe, the
file may reappear.

> Or, in the case of Postfix:
>
> write message file
> fsync()
> chmod +x message file
> fsync()
> report success

That'd be inefficient for the double fsync(). Postfix is ahead of that:
it omits the first fsync() you suggest, because the +x flag, while
necessary, is not sufficient to mark the mail as "complete, further
processing allowed". The "message file" is a structured file format that
has an "end" record at the end of the file. The +x flag must be set AND
this end marker must be present for Postfix to treat the message file.
So the +x flag is just an accelerator for the concurrent reader that
won't even bother to look into the file that lacks the +x flag.

write - fchmod - fsync - close -> 250 Ok is therefore sound in Postfix.

(but beware of chmod, in publicly accessible places like /tmp, this can
be prone to races, use fchmod if you have an open file descriptor at
hand).

> An MTA has two ordering constraints:
>
> 1) Data must be flushed to disk before it is marked on disk as
> "committed". This is to ensure that, after a crash, the MTA does
> not read a corrupted mail file.
>
> 2) Data must be marked on disk as "committed" before a success code
> is reported to the remote MTA. This is to ensure that no mail is
> lost.
>
> The ext3 data=ordered mode enforces the first constraint for mailers
> using the "rename" paradigm, eliminating the need for the first
> fsync() call. But any MTA which relies on data=ordered semantics is
> not only Linux-specific, but ext3/reiserfs specific!

You're right for the MTA AFAICT.

But let's keep this unspecific to the MTA. Unless fsync() is used to
enforce ordering, without data=ordered, journalled file systems can
"recreate" files that are not there. Undead you may call them if you so
like...

Let me claim that fsync() is beyond the common hobbyist hacker. Yes, I
have just put Asbestos underwear on :-)

> Synchronous directory updates, a la FFS, enforce the second constraint
> (again for the "rename" paradigm), eliminating the need for the second
> fsync().

...or for systems that don't sync the "new" path name created with
rename(2) from an open file descriptor...

> But to be robust across platforms and file systems, a mailer needs
> both fsync() calls. (On Linux, you actually need to fsync() the
> *directory*, not the file, for the "rename" paradigm. It would be
> nice if we could convince MTA authors to do this.)

...and this will not likely happen with Postfix. Wietse uses chattr +S,
and the Postfix queue only works reliably on systems that either (any
one alone is sufficient):

1. mount the file system containing /var/spool/postfix with -o sync
2. support chattr +S /var/spool/postfix
3. behave the way BSD softdeps do, where fsync() also syncs all
directory changes involved in a rename(2), all the way up to the mount
point.

Postfix' local(8) daemon additionally relies on rename(2) being
synchronous (in Maildir delivery), it does not fsync() after rename.
OTOH, the file is completely in Maildir/tmp/somename, so it's not really
lost, just invisible.

It'd be interesting if chattr +S Maildir/tmp/ would be sufficient to
make the rename ("tmp/somefile", "cur/somefile") persistent.

> > New implementations (Courier's maildrop) still rely on BSD FFS
> > "synchronous directory" semantics.
>
> Are you sure? Because that is ridiculous... Modern BSDs like to use
> "soft updates", which need that second fsync() to commit the metadata.

Unless I misread maildrop, yes. Anyone is free to show otherwise, and I
will apologize for this false claim.

> Summary: *All* MTAs should call fsync() twice. The only issue is what
> descriptors they should call it on, exactly :-).

See above. Before that, we must know that fsync() syncs all directory
and file data and metadata (that makes four) all the way up to the mount
point. For Linux 2.0, 2.2, 2.4. For any file system and any mount
option. See the table project above ;-)

--
Matthias Andree

2002-07-15 18:53:46

by Patrick J. LoPresti

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

Matthias Andree <[email protected]> writes:

> > For example, a typical MTA might follow this paradigm:
> >
> > write temp file
> > fsync()
> > rename temp file to destination
> > fsync()
>
> So does fsync() guarantee rename() persistence across crash on all file
> systems and kernel versions? IIRC, no.

It depends on what you fsync() :-).

On BSD, fsync() of a file's descriptor will commit the rename of that
file to disk.

On Linux, fsync() of the *directory's* descriptor is required. And
yes, this will work across file systems and Linux versions, according
to Linus/Alan/etc.

> That'd be inefficient for the double fsync().

But it is necessary. See below.

> Postfix is ahead of that: it omits the first fsync() you suggest,
> because the +x flag, while necessary, is not sufficient to mark the
> mail as "complete, further processing allowed". The "message file"
> is a structured file format that has an "end" record at the end of
> the file.

This is not sufficient! Data writes are NOT guaranteed to be ordered.
It is permissible for the file system to flush the first and last
block of a file to disk BEFORE flushing the middle. You either need
the double fsync() or you need a checksum in the file; simple markers
are not enough to make a real guarantee. And MTAs should be making
real guarantees!

> But let's keep this unspecific to the MTA. Unless fsync() is used to
> enforce ordering, without data=ordered, journalled file systems can
> "recreate" files that are not there. Undead you may call them if you
> so like...

No, data=ordered has nothing to do with recreating dead files. What
data=ordered does is make sure bogus blocks do not appear in new files
(or in new extents of old files).

Failing to call fsync() at all (i.e., failing to commit metadata
updates) is what can recreate dead files.

> Postfix' local(8) daemon additionally relies on rename(2) being
> synchronous (in Maildir delivery), it does not fsync() after rename.
> OTOH, the file is completely in Maildir/tmp/somename, so it's not
> really lost, just invisible.

No, it is lost, because the file's creation is not guaranteed to have
happened at all! (Well, depending on the file system and the
semantics. I think I need to write this up more clearly.)

> > Summary: *All* MTAs should call fsync() twice. The only issue is what
> > descriptors they should call it on, exactly :-).
>
> See above. Before that, we must know that fsync() syncs all directory
> and file data and metadata (that makes four) all the way up to the mount
> point. For Linux 2.0, 2.2, 2.4. For any file system and any mount
> option. See the table project above ;-)

As I said, the issue is what descriptors they should call fsync() on.
On Linux, fsync() on a file's descriptor will commit the file's
contents; a second fsync() on the containing directory's descriptor
will commit the rename()/link().

- Pat

2002-07-16 01:37:43

by jw schultz

[permalink] [raw]
Subject: Re: [PATCH] 2.4.19-rc1/2.5.25 provide dummy fsync() routine for directories on NFS mounts

On Mon, Jul 15, 2002 at 04:17:01PM -0400, Patrick J. LoPresti wrote:
> Alan Cox <[email protected]> writes:
>
> > Documentation/fs/fsync.txt or similar sounds a good idea
>
> OK, attached is my first attempt at such a document.
>
> What do you think?
>
> - Pat
>
>

Nice and clear. I expect it also applies to unlink(2) and
rename(2).

A simplified version of this with a list of popular "broken"
MTAs and other spooling utilities might also go into the faq
with a strong emphasis on the chattr and mount options.



Content-Description: fsync.txt
> Linux fsync() semantics
> (or, "How to create a file reliably")
>
>
> Introduction
> ============
>
> Consider the following C program:
>
> #include <unistd.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <string.h>
>
> int
> main (int argc, char *argv[]) {
> int fd;
> char *s = "Hello, world!\n";
>
> fd = open ("/tmp/foo", O_WRONLY|O_CREAT|O_EXCL);
> if (fd < 0) return 1;
>
> if (write (fd, s, strlen(s)) < 0) return 3;
> if (fsync (fd) < 0) return 4;
> if (close (fd) < 0) return 5;
>
> return 0;
> }
>
> Question: If you compile and run this program, and it exits zero
> (success), and your machine then crashes, is it guaranteed that the
> file /tmp/foo will exist upon reboot?
>
> Answer: On many Unices, including *BSD, yes.
> On Linux, NO.
>
> How could this be? And what can you do about it?
>
>
> History
> =======
>
> In the beginning was BSD with its Fast File System (FFS). Under FFS,
> changes to directories were "synchronous", meaning they were committed
> to disk before the system call (open/link/rename/etc.) returned.
> Changes to files (write()) were asynchronous. The fsync() system call
> allowed an application to force a file's pending writes to be
> committed to persistent media.
>
> In general, disks have reasonble throughput but horrible latency, so
> it is much faster to write many things all at once rather than one at
> a time. In other words, synchronous operations are slow.
>
> Enter Linux. By default, Linux makes all operations, including
> directory updates, asynchronous. Early file system benchmarks showed
> Linux beating the pants off of BSD, especially when lots of directory
> operations were involved. This annoyed the BSD folks, who claimed
> that synchronous directory updates are required for reliable
> operation. (As with most points of contention between Linux and BSD,
> this is both true and false... See below.)
>
> The problem with making directory operations asynchronous is that you
> then need to provide a way for the application to commit those changes
> to disk. Otherwise, it is impossible to write reliable applications.
>
>
> BSD softupdates
> ===============
>
> Sometime during the 90s, the BSD developers introduced "soft updates"
> to improve performance. These do two things. First, they make all
> file system operations asynchronous (like Linux). Second, they extend
> the fsync() system call so that it commits to disk BOTH the file's
> data AND any directories via which the file might be accessed.
>
> In other words, BSD with soft updates requires that you call fsync()
> on a file to commit any changes to its containing directory. This is
> why the program above "works" on BSD.
>
> Many programs are written these days to expect soft update semantics,
> because such algorithms will also work correctly under traditional
> FFS.
>
> The problem with the softupdates approach is that finding all paths to
> a file is complex, and the Linux developers hate complexity. Linux
> does NOT support this behavior for fsync() and probably never will.
>
>
> Standards
> =========
>
> Quick aside: What do the relevant standards (POSIX, SuS) say? Is
> Linux violating some standard here?
>
> Well, different people, having read the standards, disagree on this
> point. This itself means the standards are not clear (which is a bad
> thing for a standard). This is probably because the standards were
> written when synchronous directory updates were the norm, and the
> authors did not even consider asynchronous directory updates.
>
>
> The Linux Solution
> ==================
>
> The Linux answer is simple: If you want to flush a modified directory
> to disk, call fsync() on the directory.
>
> In other words, to reliably create a file on Linux, you need to do
> something like this:
>
> #include <unistd.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <string.h>
>
> int
> main (int argc, char *argv[]) {
> int fd, dirfd;
> char *s = "Hello, world!\n";
>
> fd = open ("/tmp/foo", O_WRONLY|O_CREAT|O_EXCL);
> if (fd < 0) return 1;
>
> dirfd = open ("/tmp", O_RDONLY);
> if (dirfd < 0) return 2;
>
> if (write (fd, s, strlen(s)) < 0) return 3;
> if (fsync (fd) < 0) return 4;
> if (close (fd) < 0) return 5;
> if (fsync (dirfd) < 0) return 6;
> if (close (dirfd) < 0) return 7;
>
> return 0;
> }
>
> If this program exits zero, the file /tmp/foo is guaranteed to be on
> disk and to have the correct contents. This is true for ALL versions
> of the Linux kernel and ALL file systems.
>
>
> Other choices
> =============
>
> So you have written to the authors of your favorite MTA asking them to
> support Linux properly by using fsync() on directories. They have
> responded saying that "Linux is broken". (Be sure to ask them to
> justify this claim with chapter and verse from a standard. It is sure
> to be interesting.) What can you do?
>
> If the application does all its work in one directory, or a few
> directories, you can do "chattr +S" on the directory. This will cause
> all operations on that directory to be synchronous.
>
> You can use the "-o sync" mount option. This will cause ALL
> operations on that partition to be synchronous. This solves the
> problem, but is likely to be slow.
>
> In the current version of Linux, you can use the ext3 or ReiserFS file
> systems. These happen to commit their journals to disk whenever
> fsync() is called, which has the side-effect of providing semantics
> like BSD's soft updates. But note: This behavior is not guaranteed,
> and may change in future releases!
>
> But really, the best idea is to convince application authors to
> support the "Linux way" for committing directory updates. The
> semantics are simple, clear, and extremely efficient. So go bug those
> MTA authors until they listen :-).
>
>
> - Patrick LoPresti <[email protected]>
> July 2002


--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt