2009-04-23 00:13:25

by Valerie Aurora

[permalink] [raw]
Subject: [RFC PATCH] fpathconf() for fsync() behavior

In the default mode for ext3 and btrfs, fsync() is both slow and
unnecessary for some important application use cases - at the same
time that it is absolutely required for correctness for other modes of
ext3, ext4, XFS, etc. If applications could easilyl distinguish
between the two cases, they would be more likely to be correct and
fast.

How about an fpathconf() variable, something like _PC_ORDERED? E.g.:

/* Unoptimized example optional fsync() demo */
write(fd);
/* Only fsync() if we need it */
if (fpath_conf(fd, _PC_ORDERED) != 1)
fsync(fd);
rename(tmp_path, new_path);

I know of two specific real-world cases in which this would
significantly improve performance: (a) fsync() before rename(), (b)
fsync() of the parent directory of a newly created file. Case (b) is
particularly nasty when you have multiple threads creating files in
the same directory because the dir's i_mutex is held across fsync() -
file creates become limited to the speed of sequential fsync()s.

Conceptual libc patch below.

-VAL

diff --git a/sysdeps/unix/sysv/linux/pathconf.c b/sysdeps/unix/sysv/linux/pathconf.c
index db03529..5b64939 100644
--- a/sysdeps/unix/sysv/linux/pathconf.c
+++ b/sysdeps/unix/sysv/linux/pathconf.c
@@ -51,6 +51,9 @@ __pathconf (const char *file, int name)
case _PC_CHOWN_RESTRICTED:
return __statfs_chown_restricted (__statfs (file, &fsbuf), &fsbuf);

+ case _PC_ORDERED:
+ return __statfs_ordered (__statfs (file, &fsbuf), &fsbuf);
+
default:
return posix_pathconf (file, name);
}
@@ -225,3 +228,44 @@ __statfs_chown_restricted (int result, const struct statfs *fsbuf)

return retval;
}
+
+
+/* Tells us if write operations are ordered with respect to each
+ * other. Useful for skipping fsync in some cases. Default is 0 -
+ * not ordered. */
+
+/* Used like: return statfs_ordered (__statfs (name, &buf), &buf); */
+long int
+__statfs_ordered (int result, const struct statfs *fsbuf)
+{
+ if (result < 0)
+ {
+ if (errno == ENOSYS)
+ /* Not possible, return the default value. */
+ return 0;
+
+ /* Some error occured. */
+ return -1;
+ }
+
+#define BTRFS_SUPER_MAGIC 0x9123683E
+ switch (fsbuf->f_type)
+ {
+ case BTRFS_SUPER_MAGIC:
+ case EXT2_SUPER_MAGIC:
+ /* XXX Must distinguish between 2, 3, and 4 */
+ case REISERFS_SUPER_MAGIC:
+ /* XXX Nasty hacking needed here to determine exact
+ * journaling mode. Options include parsing /proc/mounts,
+ * defining an ioctl(), creating a generic VFS interface.
+ * For demonstration purposes, assume the default mode,
+ * which is ordered for each of these file systems.
+ */
+ return 1;
+ case XFS_SUPER_MAGIC:
+ /* XXX XFS has a trillion options, is there one to do ordered mode? */
+ return 0;
+ default:
+ return 0;
+ }
+}
diff --git a/bits/confname.h b/bits/confname.h
index 80b51ac..3d19902 100644
--- a/bits/confname.h
+++ b/bits/confname.h
@@ -39,6 +39,8 @@ enum
#define _PC_PIPE_BUF _PC_PIPE_BUF
_PC_CHOWN_RESTRICTED,
#define _PC_CHOWN_RESTRICTED _PC_CHOWN_RESTRICTED
+ _PC_ORDERED,
+#define _PC_ORDERED _PC_ORDERED
_PC_NO_TRUNC,
#define _PC_NO_TRUNC _PC_NO_TRUNC
_PC_VDISABLE,
diff --git a/conform/data/unistd.h-data b/conform/data/unistd.h-data
index b6effa0..7325ff5 100644
--- a/conform/data/unistd.h-data
+++ b/conform/data/unistd.h-data
@@ -248,6 +248,7 @@ constant _PC_MAX_CANON
constant _PC_MAX_INPUT
constant _PC_NAME_MAX
constant _PC_NO_TRUNC
+constant _PC_ORDERED
constant _PC_PATH_MAX
constant _PC_PIPE_BUF
constant _PC_PRIO_IO
diff --git a/posix/annexc.c b/posix/annexc.c
index df5913a..658bdc1 100644
--- a/posix/annexc.c
+++ b/posix/annexc.c
@@ -501,7 +501,7 @@ static const char *const unistd_syms[] =
"F_OK", "NULL", "R_OK", "SEEK_CUR", "SEEK_END", "SEEK_SET", "STDERR_FILENO",
"STDIN_FILENO", "STDOUT_FILENO", "W_OK", "X_OK",
"_PC_ASYNC_IO", "_PC_CHOWN_RESTRICTED", "_PC_LINK_MAX", "_PC_MAX_CANON",
- "_PC_MAX_INPUT", "_PC_NAME_MAX", "_PC_NO_TRUNC", "_PC_PATH_MAX",
+ "_PC_MAX_INPUT", "_PC_NAME_MAX", "_PC_NO_TRUNC", "_PC_PATH_MAX", "_PC_ORDERED",
"_PC_PIPE_BUF", "_PC_PRIO_IO", "_PC_SYNC_IO", "_PC_VDISABLE",
"_SC_AIO_LISTIO_MAX", "_SC_AIO_MAX", "_SC_AIO_PRIO_DELTA_MAX",
"_SC_ARG_MAX", "_SC_ASYNCHRONOUS_IO", "_SC_CHILD_MAX", "_SC_CLK_TCK",
diff --git a/posix/fpathconf.c b/posix/fpathconf.c
index 840460b..d7f9a89 100644
--- a/posix/fpathconf.c
+++ b/posix/fpathconf.c
@@ -47,6 +47,7 @@ __fpathconf (fd, name)
case _PC_PIPE_BUF:
case _PC_SOCK_MAXBUF:
case _PC_CHOWN_RESTRICTED:
+ case _PC_ORDERED:
case _PC_NO_TRUNC:
case _PC_VDISABLE:
break;
diff --git a/posix/getconf.c b/posix/getconf.c
index 6184292..5995d60 100644
--- a/posix/getconf.c
+++ b/posix/getconf.c
@@ -81,6 +81,9 @@ static const struct conf vars[] =
#ifdef _PC_CHOWN_RESTRICTED
{ "_POSIX_CHOWN_RESTRICTED", _PC_CHOWN_RESTRICTED, PATHCONF },
#endif
+#ifdef _PC_ORDERED
+ { "_POSIX_ORDERED", _PC_ORDERED, PATHCONF },
+#endif
#ifdef _PC_NO_TRUNC
{ "_POSIX_NO_TRUNC", _PC_NO_TRUNC, PATHCONF },
#endif
diff --git a/sysdeps/posix/fpathconf.c b/sysdeps/posix/fpathconf.c
index 605cd17..c29fa6f 100644
--- a/sysdeps/posix/fpathconf.c
+++ b/sysdeps/posix/fpathconf.c
@@ -121,6 +121,13 @@ __fpathconf (fd, name)
return -1;
#endif

+ case _PC_ORDERED:
+#ifdef _POSIX_ORDERED
+ return _POSIX_ORDERED;
+#else
+ return -1;
+#endif
+
case _PC_NO_TRUNC:
#ifdef _POSIX_NO_TRUNC
return _POSIX_NO_TRUNC;
diff --git a/sysdeps/posix/pathconf.c b/sysdeps/posix/pathconf.c
index 75c99ee..f9d84ab 100644
--- a/sysdeps/posix/pathconf.c
+++ b/sysdeps/posix/pathconf.c
@@ -117,6 +117,13 @@ __pathconf (const char *path, int name)
return -1;
#endif

+ case _PC_ORDERED:
+#ifdef _POSIX_ORDERED
+ return _POSIX_ORDERED;
+#else
+ return -1;
+#endif
+
case _PC_NO_TRUNC:
#ifdef _POSIX_NO_TRUNC
return _POSIX_NO_TRUNC;
diff --git a/sysdeps/unix/sysv/linux/fpathconf.c b/sysdeps/unix/sysv/linux/fpathconf.c
index 2701c9e..51c43c4 100644
--- a/sysdeps/unix/sysv/linux/fpathconf.c
+++ b/sysdeps/unix/sysv/linux/fpathconf.c
@@ -48,6 +48,9 @@ __fpathconf (fd, name)
case _PC_CHOWN_RESTRICTED:
return __statfs_chown_restricted (__fstatfs (fd, &fsbuf), &fsbuf);

+ case _PC_ORDERED:
+ return __statfs_ordered (__fstatfs (fd, &fsbuf), &fsbuf);
+
default:
return posix_fpathconf (fd, name);
}
diff --git a/sysdeps/unix/sysv/linux/pathconf.h b/sysdeps/unix/sysv/linux/pathconf.h
index 806adcc..1c0b513 100644
--- a/sysdeps/unix/sysv/linux/pathconf.h
+++ b/sysdeps/unix/sysv/linux/pathconf.h
@@ -37,3 +37,6 @@ extern long int __statfs_symlinks (int result, const struct statfs *fsbuf);
/* Used like: return __statfs_chown_restricted (__statfs (name, &buf), &buf);*/
extern long int __statfs_chown_restricted (int result,
const struct statfs *fsbuf);
+
+/* Used like: return statfs_ordered (__statfs (name, &buf), &buf); */
+extern long int __statfs_ordered (int result, const struct statfs *fsbuf);


2009-04-23 05:23:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC PATCH] fpathconf() for fsync() behavior

On Wed, 22 Apr 2009 20:12:57 -0400 Valerie Aurora Henson <[email protected]> wrote:

> In the default mode for ext3 and btrfs, fsync() is both slow and
> unnecessary for some important application use cases - at the same
> time that it is absolutely required for correctness for other modes of
> ext3, ext4, XFS, etc. If applications could easilyl distinguish
> between the two cases, they would be more likely to be correct and
> fast.
>
> How about an fpathconf() variable, something like _PC_ORDERED? E.g.:
>
> /* Unoptimized example optional fsync() demo */
> write(fd);
> /* Only fsync() if we need it */
> if (fpath_conf(fd, _PC_ORDERED) != 1)
> fsync(fd);
> rename(tmp_path, new_path);
>
> I know of two specific real-world cases in which this would
> significantly improve performance: (a) fsync() before rename(), (b)
> fsync() of the parent directory of a newly created file. Case (b) is
> particularly nasty when you have multiple threads creating files in
> the same directory because the dir's i_mutex is held across fsync() -
> file creates become limited to the speed of sequential fsync()s.
>
> Conceptual libc patch below.

Would it be better to implement new syscall(s) with finer-grained control
and better semantics? Then userspace would just need to to:

fsync_on_steroids(fd, FSYNC_BEFORE_RENAME);

and that all gets down into the filesystem which can then work out what
it needs to do to implement the command.

2009-04-23 11:11:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH] fpathconf() for fsync() behavior

On Wed, Apr 22, 2009 at 08:12:57PM -0400, Valerie Aurora Henson wrote:
> In the default mode for ext3 and btrfs, fsync() is both slow and
> unnecessary for some important application use cases - at the same
> time that it is absolutely required for correctness for other modes of
> ext3, ext4, XFS, etc. If applications could easilyl distinguish
> between the two cases, they would be more likely to be correct and
> fast.
>
> How about an fpathconf() variable, something like _PC_ORDERED? E.g.:

Before we add any new fpathconf varibale we need a reall (f)pathconf(at)
syscall so that the fs driver can exposed it's characteristics, having
to replicate that information to glibc especially for something required
for data integrity is a receipe for a desaster.

2009-04-23 11:21:23

by Jamie Lokier

[permalink] [raw]
Subject: Re: [RFC PATCH] fpathconf() for fsync() behavior

Andrew Morton wrote:
> Would it be better to implement new syscall(s) with finer-grained control
> and better semantics? Then userspace would just need to to:
>
> fsync_on_steroids(fd, FSYNC_BEFORE_RENAME);
>
> and that all gets down into the filesystem which can then work out what
> it needs to do to implement the command.

+1 from me. Several flags come to mind for discussion.
FSYNC_HARDWARE. FSYNC_ORDER_ONLY, FSYNC_FLUSH.
FSYNC_DATA_BEFORE_SIZE. FSYNC_BEFORE_NEW_FILE.

Nick Piggin was working on fsync_range().

Maybe it's time to do fsync properly?

-- Jamie

2009-04-23 12:43:30

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH] fpathconf() for fsync() behavior

On Thu, Apr 23, 2009 at 12:21:05PM +0100, Jamie Lokier wrote:
> Andrew Morton wrote:
> > Would it be better to implement new syscall(s) with finer-grained control
> > and better semantics? Then userspace would just need to to:
> >
> > fsync_on_steroids(fd, FSYNC_BEFORE_RENAME);
> >
> > and that all gets down into the filesystem which can then work out what
> > it needs to do to implement the command.
>
> +1 from me. Several flags come to mind for discussion.
> FSYNC_HARDWARE. FSYNC_ORDER_ONLY, FSYNC_FLUSH.
> FSYNC_DATA_BEFORE_SIZE. FSYNC_BEFORE_NEW_FILE.
>
> Nick Piggin was working on fsync_range().
>
> Maybe it's time to do fsync properly?

We could create such a thing, but how many application programmers
will actually *use* them? People need to check out my blog, where my
competence, my judgement, even my paternity was questioned about this
issue.

Application writers don't care about OS portability (it only has to
work on Linux), or working on multiple filesystems (it only has work
on ext3, and any filesystems which doesn't do automagic fsync's at the
right magic times automagically is broken by design). This includes
many GNOME and KDE developers. So as we concluded at the filesystem
and storage workshop, we probably will have to keep automagic
hueristics out there, for all of the broken applications. Heck, Linus
even refused to call those applications "broken".

So we can create a more finer-grained controlled system call ---
although I would suggest that we just add some extra flags to
sync_file_range() --- but it's doubtful that many application
programmers will use it.

- Ted

2009-04-23 12:48:54

by Jeff Garzik

[permalink] [raw]
Subject: Re: [RFC PATCH] fpathconf() for fsync() behavior

Theodore Tso wrote:
> So we can create a more finer-grained controlled system call ---
> although I would suggest that we just add some extra flags to
> sync_file_range() --- but it's doubtful that many application
> programmers will use it.

sync_file_range() seems the obvious avenue for new fsync flags.

I even explored what it would take to add a "flush storage dev writeback
cache, for this file" flag to sync_file_range(), rather unfortunately
non-trivial given the current implementation's close ties to MM.

But yeah... how many people will use these fancy new flags and features?

Jeff


2009-04-23 14:11:25

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH] fpathconf() for fsync() behavior

On Thu, Apr 23, 2009 at 08:48:01AM -0400, Jeff Garzik wrote:
> Theodore Tso wrote:
>> So we can create a more finer-grained controlled system call ---
>> although I would suggest that we just add some extra flags to
>> sync_file_range() --- but it's doubtful that many application
>> programmers will use it.
>
> sync_file_range() seems the obvious avenue for new fsync flags.
>
> I even explored what it would take to add a "flush storage dev writeback
> cache, for this file" flag to sync_file_range(), rather unfortunately
> non-trivial given the current implementation's close ties to MM.

What I had roughly in mind was some (optional) calls to the filesystem
before and after the current implementations MM magic, but I haven't
thought very deeply on the subject yet, mainly because...

> But yeah... how many people will use these fancy new flags and features?
>

Yeah. That issue.

It would be nice to have some additional semantics, but in terms of
priorities, it's not the highest thing on my list in terms of itches
to scratch.

- Ted

2009-04-23 16:13:05

by Valerie Aurora

[permalink] [raw]
Subject: Re: [RFC PATCH] fpathconf() for fsync() behavior

On Wed, Apr 22, 2009 at 10:17:48PM -0700, Andrew Morton wrote:
> On Wed, 22 Apr 2009 20:12:57 -0400 Valerie Aurora Henson <[email protected]> wrote:
>
> > In the default mode for ext3 and btrfs, fsync() is both slow and
> > unnecessary for some important application use cases - at the same
> > time that it is absolutely required for correctness for other modes of
> > ext3, ext4, XFS, etc. If applications could easilyl distinguish
> > between the two cases, they would be more likely to be correct and
> > fast.
> >
> > How about an fpathconf() variable, something like _PC_ORDERED? E.g.:
> >
> > /* Unoptimized example optional fsync() demo */
> > write(fd);
> > /* Only fsync() if we need it */
> > if (fpath_conf(fd, _PC_ORDERED) != 1)
> > fsync(fd);
> > rename(tmp_path, new_path);
> >
> > I know of two specific real-world cases in which this would
> > significantly improve performance: (a) fsync() before rename(), (b)
> > fsync() of the parent directory of a newly created file. Case (b) is
> > particularly nasty when you have multiple threads creating files in
> > the same directory because the dir's i_mutex is held across fsync() -
> > file creates become limited to the speed of sequential fsync()s.
> >
> > Conceptual libc patch below.
>
> Would it be better to implement new syscall(s) with finer-grained control
> and better semantics? Then userspace would just need to to:
>
> fsync_on_steroids(fd, FSYNC_BEFORE_RENAME);
>
> and that all gets down into the filesystem which can then work out what
> it needs to do to implement the command.

You and Jamie have a good point: fsync() is a very big hammer used for
many different purposes, and it would be nice to have finer-grained
tools. There are distinct limits to what you can do to optimize a
full fsync(); we should be thrilled to get fewer of them from userspace.

Like others, I am concerned about the complexity for the programmer.
Perhaps in addition to the various fine-grained options, there is a:

fsync_on_steroids(fd, FSYNC_DO_WHAT_ORDERED_WOULD_DO);

The idea is that we've currently got a lot of code that assumes ext3
data=ordered semantics (btrfs will fulfill these assumptions too). It
would be nice if we had one simple drop-in test to distinguish between
ext3-ordered/btrfs/reiserfs and all other fs's; I think we'd get a lot
more adoption that way.

All that being said, I'd be thrilled to have fine-grained fsync().

-VAL

2009-04-23 16:16:20

by Ric Wheeler

[permalink] [raw]
Subject: Re: [RFC PATCH] fpathconf() for fsync() behavior

Valerie Aurora Henson wrote:
> On Wed, Apr 22, 2009 at 10:17:48PM -0700, Andrew Morton wrote:
>> On Wed, 22 Apr 2009 20:12:57 -0400 Valerie Aurora Henson <[email protected]> wrote:
>>
>>> In the default mode for ext3 and btrfs, fsync() is both slow and
>>> unnecessary for some important application use cases - at the same
>>> time that it is absolutely required for correctness for other modes of
>>> ext3, ext4, XFS, etc. If applications could easilyl distinguish
>>> between the two cases, they would be more likely to be correct and
>>> fast.
>>>
>>> How about an fpathconf() variable, something like _PC_ORDERED? E.g.:
>>>
>>> /* Unoptimized example optional fsync() demo */
>>> write(fd);
>>> /* Only fsync() if we need it */
>>> if (fpath_conf(fd, _PC_ORDERED) != 1)
>>> fsync(fd);
>>> rename(tmp_path, new_path);
>>>
>>> I know of two specific real-world cases in which this would
>>> significantly improve performance: (a) fsync() before rename(), (b)
>>> fsync() of the parent directory of a newly created file. Case (b) is
>>> particularly nasty when you have multiple threads creating files in
>>> the same directory because the dir's i_mutex is held across fsync() -
>>> file creates become limited to the speed of sequential fsync()s.
>>>
>>> Conceptual libc patch below.
>> Would it be better to implement new syscall(s) with finer-grained control
>> and better semantics? Then userspace would just need to to:
>>
>> fsync_on_steroids(fd, FSYNC_BEFORE_RENAME);
>>
>> and that all gets down into the filesystem which can then work out what
>> it needs to do to implement the command.
>
> You and Jamie have a good point: fsync() is a very big hammer used for
> many different purposes, and it would be nice to have finer-grained
> tools. There are distinct limits to what you can do to optimize a
> full fsync(); we should be thrilled to get fewer of them from userspace.
>
> Like others, I am concerned about the complexity for the programmer.
> Perhaps in addition to the various fine-grained options, there is a:
>
> fsync_on_steroids(fd, FSYNC_DO_WHAT_ORDERED_WOULD_DO);
>
> The idea is that we've currently got a lot of code that assumes ext3
> data=ordered semantics (btrfs will fulfill these assumptions too). It
> would be nice if we had one simple drop-in test to distinguish between
> ext3-ordered/btrfs/reiserfs and all other fs's; I think we'd get a lot
> more adoption that way.
>
> All that being said, I'd be thrilled to have fine-grained fsync().
>
> -VAL

I like the fine grained fsync variation as well. We could reimplement the
standard fsync to be safe, boring and relatively slow while allowing the few
really sophisticated users the extra options.

It would also make it easier to insure that the traditional fsync() semantics
are not weakened in unexpected ways for apps that care.

ric

2009-04-23 16:24:09

by Valerie Aurora

[permalink] [raw]
Subject: Re: [RFC PATCH] fpathconf() for fsync() behavior

On Thu, Apr 23, 2009 at 08:42:30AM -0400, Theodore Tso wrote:
> On Thu, Apr 23, 2009 at 12:21:05PM +0100, Jamie Lokier wrote:
> > Andrew Morton wrote:
> > > Would it be better to implement new syscall(s) with finer-grained control
> > > and better semantics? Then userspace would just need to to:
> > >
> > > fsync_on_steroids(fd, FSYNC_BEFORE_RENAME);
> > >
> > > and that all gets down into the filesystem which can then work out what
> > > it needs to do to implement the command.
> >
> > +1 from me. Several flags come to mind for discussion.
> > FSYNC_HARDWARE. FSYNC_ORDER_ONLY, FSYNC_FLUSH.
> > FSYNC_DATA_BEFORE_SIZE. FSYNC_BEFORE_NEW_FILE.
> >
> > Nick Piggin was working on fsync_range().
> >
> > Maybe it's time to do fsync properly?
>
> We could create such a thing, but how many application programmers
> will actually *use* them? People need to check out my blog, where my
> competence, my judgement, even my paternity was questioned about this
> issue.

I am sorry you were personally attacked over this issue - that was
uncalled for and unproductive. I wish personal attacks were less
common in open source.

> Application writers don't care about OS portability (it only has to
> work on Linux), or working on multiple filesystems (it only has work
> on ext3, and any filesystems which doesn't do automagic fsync's at the
> right magic times automagically is broken by design). This includes
> many GNOME and KDE developers. So as we concluded at the filesystem
> and storage workshop, we probably will have to keep automagic
> hueristics out there, for all of the broken applications. Heck, Linus
> even refused to call those applications "broken".
>
> So we can create a more finer-grained controlled system call ---
> although I would suggest that we just add some extra flags to
> sync_file_range() --- but it's doubtful that many application
> programmers will use it.

I remain hopeful. :) Application developers *want* to do the right
thing in general; they are just facing a hopeless catch-22 right now.
The POSIX-ly correct use of fsync() exposes them to potential
multi-second delays on 95% of file systems currently in existence -
and the fsync() isn't even needed in many cases!

For example, Red Hat is beginning to support XFS officially, and I
would be happy to fix any bugs we receive about applications failing
to do fsync() before rename() - if I was sure I wasn't introducing a
performance regression.

-VAL

2009-04-23 16:28:10

by Valerie Aurora

[permalink] [raw]
Subject: Re: [RFC PATCH] fpathconf() for fsync() behavior

On Thu, Apr 23, 2009 at 07:11:47AM -0400, Christoph Hellwig wrote:
> On Wed, Apr 22, 2009 at 08:12:57PM -0400, Valerie Aurora Henson wrote:
> > In the default mode for ext3 and btrfs, fsync() is both slow and
> > unnecessary for some important application use cases - at the same
> > time that it is absolutely required for correctness for other modes of
> > ext3, ext4, XFS, etc. If applications could easilyl distinguish
> > between the two cases, they would be more likely to be correct and
> > fast.
> >
> > How about an fpathconf() variable, something like _PC_ORDERED? E.g.:
>
> Before we add any new fpathconf varibale we need a reall (f)pathconf(at)
> syscall so that the fs driver can exposed it's characteristics, having
> to replicate that information to glibc especially for something required
> for data integrity is a receipe for a desaster.

I think a real pathconf() is a great idea, regardless of the solution
for this exact problem. Anyone want to beat me to the patch? I
really need to be working on union mounts right now.

-VAL

2009-04-23 16:43:53

by Jamie Lokier

[permalink] [raw]
Subject: Re: [RFC PATCH] fpathconf() for fsync() behavior

Theodore Tso wrote:
> On Thu, Apr 23, 2009 at 12:21:05PM +0100, Jamie Lokier wrote:
> > Maybe it's time to do fsync properly?
>
> Application writers don't care about OS portability (it only has to
> work on Linux), or working on multiple filesystems (it only has work
> on ext3, and any filesystems which doesn't do automagic fsync's at the
> right magic times automagically is broken by design). This includes
> many GNOME and KDE developers. So as we concluded at the filesystem
> and storage workshop, we probably will have to keep automagic
> hueristics out there, for all of the broken applications. Heck, Linus
> even refused to call those applications "broken".

Sure, most apps are low quality in all respects. Many don't care about
a bit of corruption when the battery runs out. There's no pressure to
get that right, and it's quite hard to get right without good practice
to follow, and good APIs which encourage good practice naturally.

Imho, the rename-automagic-safety rule now in ext3/4 is _better_ than
requiring apps to call fsync, because it doesn't require an immediate,
synchronous disk flush and hardware cache flush. Fsync requires those
things, to be useful for databases and mail servers. If you're
renaming a lot of files, 1000s of explicit fsyncs serialises badly on
rotating media.

> So we can create a more finer-grained controlled system call ---
> although I would suggest that we just add some extra flags to
> sync_file_range() --- but it's doubtful that many application
> programmers will use it.

I proposed some flags to sync_file_range() last year, and got very
little response. Mind you there's been a lot of fsync issues coming
up since then, so maybe it stirred something :-)

sync_file_range() itself is just too weird to use. Reading the man
page many times, I still couldn't be sure what it does or is meant to
do until asking on l-k a few years ago. My guess, from reading the
man page, turned out to be wrong. The recommended way to use it for a
database-like application was quite convoluted and required the app to
apply its own set of mm-style heuristics. I never did find out if it
commits data-locating metadata and file size after extending a file or
filling a hole. It never seemed to emit I/O barriers.

Does anything at all use it? Maybe sync_file_range() can be improved
though.

I hold more hope for Nick Piggins work on fsync_range() - which at
least is comprehensible :-)

It says something that instead of writing a small wrapper around
sync_file_range() which is _supposed_ to be usable as range fsync, and
fixing sync_file_range() to behave properly, Nick found it easier to
start a separate implementation :-)

-- Jamie

2009-04-23 17:24:15

by Jamie Lokier

[permalink] [raw]
Subject: Re: [RFC PATCH] fpathconf() for fsync() behavior

Valerie Aurora Henson wrote:
> All that being said, I'd be thrilled to have fine-grained fsync().

Me too.

Ted raises a very good point that fine-grained fsync will not be used
by most applications, and they expect good behaviour automatically on
crashes without it (which is imho reasonable to ask for).

A lot of apps and scripts go wrong if the disk is full too. I've seen
more truncated files from that than from system crashes.

I think both events are so rare that most of the time nobody cares.
They're corner cases. Let's face it, nearly every shell script which
edits files in a specific order (see also "make") will see
inconsistencies following a system crashes.

But the thing is: certain core packages where reliability is a
requirement will use whatever mechanisms are available. Every mail
transport and database engine seems to get this right - or try their
best given limitations of the OS - it's their job to care. Those are
widely used by other apps.

Let's face it, like most other authors, if powerfail-robustness were
that important to us on linux-kernel, barriers would never have been
off by default on ext3, and fsync would always have emitted barriers.

The thing is: you _can_ expect certain core packages, used by a large
number of apps, to make use of whatever features work well. Make
those reliable and you've solved a big chunk of the problem. Make the
core packages able to perform well at the same time, and a few more
apps will use them instead of their own implementations.

-- Jamie

2009-04-23 17:30:29

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [RFC PATCH] fpathconf() for fsync() behavior

On Thu, Apr 23, 2009 at 05:43:30PM +0100, Jamie Lokier wrote:
> Sure, most apps are low quality in all respects. Many don't care about
> a bit of corruption when the battery runs out. There's no pressure to
> get that right, and it's quite hard to get right without good practice
> to follow, and good APIs which encourage good practice naturally.
>
> Imho, the rename-automagic-safety rule now in ext3/4 is _better_ than
> requiring apps to call fsync, because it doesn't require an immediate,
> synchronous disk flush and hardware cache flush. Fsync requires those
> things, to be useful for databases and mail servers. If you're
> renaming a lot of files, 1000s of explicit fsyncs serialises badly on
> rotating media.

Well, some KDE deskops rewrites *hundreds* of files on on startup, and
users get very cranky when their window layouts, for which they have
spent *hours* optimizing just so, get lost on application crash. (It
doesn't help that KDE was rewriting files even though nothing had
changed.... I can't remember if it was via rename or truncate, but I
have a bad feeling it was via truncate.)

> sync_file_range() itself is just too weird to use. Reading the man
> page many times, I still couldn't be sure what it does or is meant to
> do until asking on l-k a few years ago. My guess, from reading the
> man page, turned out to be wrong. The recommended way to use it for a
> database-like application was quite convoluted and required the app to
> apply its own set of mm-style heuristics. I never did find out if it
> commits data-locating metadata and file size after extending a file or
> filling a hole. It never seemed to emit I/O barriers.

Have you looked at the man page for sync_file_range()? It's gotten a
lot better. My version says it was last updated 2008-05-27, and it
now answers your question about whether it commits data-locating
metadata (it doesn't). It now has a bunch of examples how how to use
the flags in combination.

In terms of making it easier to use, some predefined bitfield
combinations is all that's necessary.

As far as extending the implementation so it calls into filesystem to
commit data-locating metadata, and other semantics such as "flush on
next commit, or "flush-when-upcoming-metadata-changes-such-as-a-rename",
we might need to change the implementation somewhat (or a lot).

But the interface does make a lot of sense. (But maybe that's because
I've spent too much time staring at all of the page writeback call
paths, and compared to that even string theory is pretty simple. :-)

- Ted

2009-04-23 20:44:31

by Jamie Lokier

[permalink] [raw]
Subject: fsync_range_with_flags() - improving sync_file_range()

Theodore Tso wrote:
> > sync_file_range() itself is just too weird to use. Reading the man
> > page many times, I still couldn't be sure what it does or is meant to
> > do until asking on l-k a few years ago. My guess, from reading the
> > man page, turned out to be wrong. The recommended way to use it for a
> > database-like application was quite convoluted and required the app to
> > apply its own set of mm-style heuristics. I never did find out if it
> > commits data-locating metadata and file size after extending a file or
> > filling a hole. It never seemed to emit I/O barriers.
>
> Have you looked at the man page for sync_file_range()? It's gotten a
> lot better. My version says it was last updated 2008-05-27, and it
> now answers your question about whether it commits data-locating
> metadata (it doesn't). It now has a bunch of examples how how to use
> the flags in combination.

Yes that's the page I've read and didn't find useful :-)
The data-locating metadata is explained thus:

None of these operations write out the file’s metadata. Therefore,
unless the application is strictly performing overwrites of already-
instantiated disk blocks, there are no guarantees that the data will be
available after a crash.

First, "the file's metadata". On many OSes, fdatasync() and O_DSYNC
are documented to not write out "the file's metadata" _but_ that often
means inode attributes such as time and mode, not data-locating
metadata which is written, including file size if it increases. Some
are explicit about that (e.g. VxFS), some don't make it clear.
Clearly that is a useful behaviour for fdatasync(), and not writing
data-locating metadata is a lot less useful.

So given what I know of fdatasync() in other OSes documentation, does
"metadata" mean fdatasync() and/or sync_file_range() exclude all
metadata, or just non-data-locating metadata, and what about size changes?

But it's not that bad. sync_file_range() says a bit more, about
overwrites to instantiated data blocks.

Or does it? What about filesystems which don't overwrite all
instantiated data in place? There are a few of those. ext3 with
data=journalling. All flash filesystems. nilfs. Btrfs I'm not sure
about, seems likely. They all involve _some_ kind of metadata just to
update instantiated data.

Does the text mean sync_file_range() might be unreliable for
crash-resistant commits on those filesystems, or do _they_ have
another kind of metadata that is not excluded "the file's metadata"?

I can't tell from the man page what happens on those filesystems.

But a kernel thread from Feb 2008 revealed the truth:
sync_file_range() _doesn't_ commit data on such filesystems.

So sync_file_range() is basically useless as a data integrity
operation. It's not a substitute for fdatasync(). Therefore why
would you ever use it?

> In terms of making it easier to use, some predefined bitfield
> combinations is all that's necessary.

See above. It doesn't work on some filesystems for integrity. It may
still be useful for application-directed writeout scheduling, but I
don't imagine many apps using it for that when we have
fadvise/madvise.

The flag examples aren't as helpful as they first look. Unless you
already know about Linux dirty page writeouts, it's far from clear why
SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER is not a data
integrity sync.

(After all, an aio_fsync_range + aio_suspend _would_ be a data
integrity sync.)

It's obvious when you know about the kernel's queued writeout concept
and pages dirtied after beginning writeout, but that's missing from
the man page.

Also it does too much writing, if a page has been queued for writeout,
then dirtied, _but_ that queued writeout had not reached the I/O
driver when the page was re-dirtied. In that case, it will write the
page twice but once is enough. With long writeout queues in the
kernel and a small file being regularly updated, or simply with normal
background writeout occurring as well, those unnecessary double writes
are a realistic scenario.

So in a nutshell:

The man page could do with an explanation of the writeout queue,
and why SYNC_FILE_RANGE_WAIT_BEFORE is needed for
dirtied-after-queued-for-writeout pages.

The man page could do with defining "metadata", and the
implemention should follow.

Preferably change it to behave like useful fdatasync()
implementations claim to: Exclude inode metadata such as times and
permissions, and include all relevant data-locating metadata
including file size (if in the range), indirection blocks in simple
filesystems, and tree/journal updates in modern ones.

Until the implementation follows, the man page should note that on
COW filesystems it currently guarantees nothing.

The implemention blocks too much when sYNC_FILE_RANGE_WAIT_BEFORE
has to wait for a single page, while it could be queuing others.

With a large range and SYNC_FILE_RANGE_WRITE only - which looks
like it could be asynchronous - does it block because the writeout
queue is full? Or is there no limit on the writeout queue size?
If it can block arbitrarily - how is that any better than
fsync_range() with obvious semantics?

It might be less efficient than fdatasync(), if setting all three
flags means it writes a page twice which was dirtied since the last
writeout was queued but has not hit the I/O driver. The
implementation _might_ be smarter than that (or fdatasync() have
the same inefficiency), but the optimisation is not allowed if you
treat the man page as a specification.

I'll be more than happy to submit man page improvements if we can
agree what it should really do :-)

Btw, I've Cc'd Nick Piggin who introduced a good thread proposing
fsync_range a few months ago. Imho, fsync_range with a flags
argument, and its AIO equivalent, would be great.

> As far as extending the implementation so it calls into filesystem to
> commit data-locating metadata, and other semantics such as "flush on
> next commit, or "flush-when-upcoming-metadata-changes-such-as-a-rename",
> we might need to change the implementation somewhat (or a lot).

Yes I agree. It must be allowed to return ENOTSUP for unsupportable
or unimplemented combinations - after doing all that it can anyway.

For example, SYNC_HARD (disk cache barrier) won't be supportable if
the disk doesn't do barriers, or if it's some device-mapper devices,
or if it's NFS, for example. Unless you configured the filesystem to
lie, or told it the NFS server does hardware-level commit, say.

> But the interface does make a lot of sense. (But maybe that's because
> I've spent too much time staring at all of the page writeback call
> paths, and compared to that even string theory is pretty simple. :-)

Yeah, sounds like you have studied both and gained the proper perspective :-)

I suspect all the fsync-related uncertainty about whether it really
works, including interactions with filesystem quirks, reliable and
potential bugs in filesystems, would be much easier to get right if we
only had a way to repeatably test it.

Just like other filesystem stress/regression tests.

I'm thinking running a kernel inside a VM invoked and
stopped/killed/branched is the only realistic way to test that all
data is committed properly, with/without necessary I/O barriers, and
recovers properly after a crash and resume. Fortunately we have good
VMs now, such a test seems very doable. It would help with testing
journalling & recovery behaviour too.

Is there such a test or related tool already?

-- Jamie

2009-04-23 21:13:42

by Theodore Ts'o

[permalink] [raw]
Subject: Re: fsync_range_with_flags() - improving sync_file_range()

On Thu, Apr 23, 2009 at 09:44:11PM +0100, Jamie Lokier wrote:
> Yes that's the page I've read and didn't find useful :-)
> The data-locating metadata is explained thus:
>
> None of these operations write out the file’s metadata. Therefore,
> unless the application is strictly performing overwrites of already-
> instantiated disk blocks, there are no guarantees that the data will be
> available after a crash.

Well, I thought that was clear. Today, sync_file_range(2) only works
if the data-localting metadata is already on the disk. This is useful
for databases where the tablespace is allocated ahead of time, but not
much else.

> But a kernel thread from Feb 2008 revealed the truth:
> sync_file_range() _doesn't_ commit data on such filesystems.

Because we could very easily add a flag which would cause it to commit
the data-locating metadata blocks --- or maybe we change it so that it
does commit the data-locating metadata, on the assumption that if the
data-locating metadata is already committed, which would be true for
all of its existing users, it's a no-op, and if it isn't, we should
just comit the data-locating metadata and add a call from the existing
implementation to a filesystem-provided method function.

> So sync_file_range() is basically useless as a data integrity
> operation. It's not a substitute for fdatasync(). Therefore why
> would you ever use it?

It's not useful *today*. But we could make it useful. The power of
the existing bit flags is useful, although granted it can be confusing
for the users who aren't haven't meditated deeply upon the writeback
code paths. I thought it was clear, but if it isn't we can improve
the documentation.

More to the point, given that we already have sync_file_range(2), I
would argue that it would be unfortunate to create a new system call
that has overlapping functionality but which is not a superset of
sync_file_range(2). Maybe Nick has a good reason for starting with an
entirely new system call, but if so, it would be nice if it at least
have the power of sync_file_range(2), in addition to having new
functionality.

> > But the interface does make a lot of sense. (But maybe that's because
> > I've spent too much time staring at all of the page writeback call
> > paths, and compared to that even string theory is pretty simple. :-)
>
> Yeah, sounds like you have studied both and gained the proper perspective :-)
>
> I suspect all the fsync-related uncertainty about whether it really
> works, including interactions with filesystem quirks, reliable and
> potential bugs in filesystems, would be much easier to get right if we
> only had a way to repeatably test it.

The answer today is sync_file_range(2) is purely a creature of the MM
subsystem, and doesn't do anything with respect to filesystem metadata
or barriers. Once you understand that, the rest of the man page is
pretty simple, I think. :-)

Whether or not it should *continue* to be that way in the future is a
different discussion, of course.

> I'm thinking running a kernel inside a VM invoked and
> stopped/killed/branched is the only realistic way to test that all
> data is committed properly, with/without necessary I/O barriers, and
> recovers properly after a crash and resume. Fortunately we have good
> VMs now, such a test seems very doable. It would help with testing
> journalling & recovery behaviour too.
>
> Is there such a test or related tool already?

I don't know of one. I agree it would be a useful thing to have. It
won't test barriers at the driver level, but it would be good for
testing the everything above that.

- Ted

2009-04-23 22:03:36

by Jamie Lokier

[permalink] [raw]
Subject: Re: fsync_range_with_flags() - improving sync_file_range()

Theodore Tso wrote:
> On Thu, Apr 23, 2009 at 09:44:11PM +0100, Jamie Lokier wrote:
> > Yes that's the page I've read and didn't find useful :-)
> > The data-locating metadata is explained thus:
> >
> > None of these operations write out the file’s metadata.
> > Therefore, unless the application is strictly performing
> > overwrites of already- instantiated disk blocks, there are no
> > guarantees that the data will be available after a crash.
>
> Well, I thought that was clear. Today, sync_file_range(2) only works
> if the data-localting metadata is already on the disk. This is useful
> for databases where the tablespace is allocated ahead of time, but not
> much else.

The text is clear to me, except for the ambiguity about whether
"strictly performing overwrites of already-instantiated disk blocks"
means in-place file overwrites on all filesystems, or strictly depends
on the filesystem's disk format.

It turns out "already-instantiated disk blocks" really does mean disk
blocks and not file blocks, and that low-level dependency looks so out
of place as generic file system call that I wondered if it was just a
poorly worded reference to file blocks.

> > But a kernel thread from Feb 2008 revealed the truth:
> > sync_file_range() _doesn't_ commit data on such filesystems.
>
> Because we could very easily add a flag which would cause it to commit
> the data-locating metadata blocks --- or maybe we change it so that it
> does commit the data-locating metadata, on the assumption that if the
> data-locating metadata is already committed, which would be true for
> all of its existing users, it's a no-op, and if it isn't, we should
> just comit the data-locating metadata and add a call from the existing
> implementation to a filesystem-provided method function.

I can't think of any reason why you'd not want to commit the
data-locating metadata, except for performance tweaking. And that
lives better in the kernel than the app, because it's filesystem
dependent anyway. And there are better ways to tweak performance - a
good implementation of aio_fdatasync springs to mind :-)

> > So sync_file_range() is basically useless as a data integrity
> > operation. It's not a substitute for fdatasync(). Therefore why
> > would you ever use it?
>
> It's not useful *today*. But we could make it useful. The power of
> the existing bit flags is useful, although granted it can be confusing
> for the users who aren't haven't meditated deeply upon the writeback
> code paths. I thought it was clear, but if it isn't we can improve
> the documentation.
>
> More to the point, given that we already have sync_file_range(2), I
> would argue that it would be unfortunate to create a new system call
> that has overlapping functionality but which is not a superset of
> sync_file_range(2). Maybe Nick has a good reason for starting with an
> entirely new system call, but if so, it would be nice if it at least
> have the power of sync_file_range(2), in addition to having new
> functionality.

I agree on all these points. There might be slight improvements
possible to the API, such as multiple ranges, but most things should
be doable with new flags.

I'm not sure what "power" sync_file_range has over fsync_range. I
used to think it must be better because you can separate out the
phases and let the application be clever. But since it can block
arbitrarily with none of the WAIT flags set, and since you can request
range writeout with fadvise() anyway, I'm not seeing any extra
expressive power in its current form.

> > I suspect all the fsync-related uncertainty about whether it really
> > works, including interactions with filesystem quirks, reliable and
> > potential bugs in filesystems, would be much easier to get right if we
> > only had a way to repeatably test it.
>
> The answer today is sync_file_range(2) is purely a creature of the MM
> subsystem, and doesn't do anything with respect to filesystem metadata
> or barriers. Once you understand that, the rest of the man page is
> pretty simple, I think. :-)

Heh. If only I knew the MM subsystem well enough to understand it
then :-) Right now, with all those writeback hooks, I'm not sure if
sync_file_range results in data-locating metadata being sync'd on, say,
Btrfs or not :-)

> > I'm thinking running a kernel inside a VM invoked and
> > stopped/killed/branched is the only realistic way to test that all
> > data is committed properly, with/without necessary I/O barriers, and
> > recovers properly after a crash and resume. Fortunately we have good
> > VMs now, such a test seems very doable. It would help with testing
> > journalling & recovery behaviour too.
> >
> > Is there such a test or related tool already?
>
> I don't know of one. I agree it would be a useful thing to have. It
> won't test barriers at the driver level, but it would be good for
> testing the everything above that.

With a VM like QEMU/KVM it would be quite easy to test barriers at the
driver level too, including SCSI and ATA. To thoroughly test, you
could simulate an infinitely large drive cache, flushed to the backing
file only on barrier requests. To avoid lots of reboots, just take
lots of snapshots without stopping the guest and check the snapshots
while the test continues. Nowadays you can make snapshots appear as
block devices in the host kernel for fscking and looking at the
contents.

Hmm. Looks like that harness could also test your rename-over logic
in ext3, test that journalling works in general, and to test userspace
such as databases always call fsync (or sync_file_range) at the right
times. Hmm.

-- Jamie

2009-04-28 20:28:29

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC PATCH] fpathconf() for fsync() behavior

Hi!

> > Application writers don't care about OS portability (it only has to
> > work on Linux), or working on multiple filesystems (it only has work
> > on ext3, and any filesystems which doesn't do automagic fsync's at the
> > right magic times automagically is broken by design). This includes
> > many GNOME and KDE developers. So as we concluded at the filesystem
> > and storage workshop, we probably will have to keep automagic
> > hueristics out there, for all of the broken applications. Heck, Linus
> > even refused to call those applications "broken".
> >
> > So we can create a more finer-grained controlled system call ---
> > although I would suggest that we just add some extra flags to
> > sync_file_range() --- but it's doubtful that many application
> > programmers will use it.
>
> I remain hopeful. :) Application developers *want* to do the right
> thing in general; they are just facing a hopeless catch-22 right now.
> The POSIX-ly correct use of fsync() exposes them to potential
> multi-second delays on 95% of file systems currently in existence -
> and the fsync() isn't even needed in many cases!

Perhaps we should create documentation file explaining what kernel
guarantees and what works mostly by mistake?

For example (and I tried to watch the discussion...):

On ext3 data=ordered, is "echo new > bar.new; mv -f bar.new bar "
actually safe?

Is it still safe on ext4 with workarounds, or is it that you just
shrank the window?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html