Linux is not syncing write() data for files on synchronously mounted
filesystems, and it isn't syncing write() data for ext2/3 files which
are operating under `chattr +S'.
Is this deliberate, or a bug?
This patch makes write()s data-synchronous. There's a fix-for-the-fix
here for ext3. Other filesystems which are second-guessing the generic
layer's behaviour may also end up doing a double sync with this patch.
Really, generic_osync_inode() needs to be front-ended by
an inode_operations method.
--- linux-2.4.15-pre4/mm/filemap.c Mon Nov 12 11:16:12 2001
+++ linux-akpm/mm/filemap.c Wed Nov 14 22:50:25 2001
@@ -2916,8 +2916,10 @@ unlock:
/* For now, when the user asks for O_SYNC, we'll actually
* provide O_DSYNC. */
- if ((status >= 0) && (file->f_flags & O_SYNC))
- status = generic_osync_inode(inode, OSYNC_METADATA|OSYNC_DATA);
+ if (status >= 0) {
+ if ((file->f_flags & O_SYNC) || IS_SYNC(inode))
+ status = generic_osync_inode(inode, OSYNC_METADATA|OSYNC_DATA);
+ }
err = written ? written : status;
out:
--- linux-2.4.15-pre4/fs/ext3/file.c Mon Nov 12 11:16:12 2001
+++ linux-akpm/fs/ext3/file.c Wed Nov 14 23:52:08 2001
@@ -61,22 +61,19 @@ static int ext3_open_file (struct inode
static ssize_t
ext3_file_write(struct file *file, const char *buf, size_t count, loff_t *ppos)
{
- int ret;
struct inode *inode = file->f_dentry->d_inode;
- ret = generic_file_write(file, buf, count, ppos);
- if ((ret >= 0) && IS_SYNC(inode)) {
- if (file->f_flags & O_SYNC) {
- /*
- * generic_osync_inode() has already done the sync
- */
- } else {
- int ret2 = ext3_force_commit(inode->i_sb);
- if (ret2)
- ret = ret2;
- }
- }
- return ret;
+ /*
+ * Nasty: if the file is subject to synchronous writes then we need
+ * to force generic_osync_inode() to call ext3_write_inode().
+ * We do that by marking the inode dirty. This adds much more
+ * computational expense than we need, but we're going to sync
+ * anyway.
+ */
+ if (IS_SYNC(inode) || (file->f_flags & O_SYNC))
+ mark_inode_dirty(inode);
+
+ return generic_file_write(file, buf, count, ppos);
}
struct file_operations ext3_file_operations = {
Hi,
On Thu, Nov 15, 2001 at 12:03:57AM -0800, Andrew Morton wrote:
> Linux is not syncing write() data for files on synchronously mounted
> filesystems, and it isn't syncing write() data for ext2/3 files which
> are operating under `chattr +S'.
In the past, chattr +S and mount -o sync always resulted in sync
metadata with no guarantees about data.
I'm not sure this makes much sense, but it's what has always happened.
For directories, the behaviour is fine, in particular as it gives us
the same directory sync consistency semantics as synchronous BSD UFS.
It's not clear to me that chattr/mount sync options make _any_ sense
for regular file metadata. Rather than tightening up the semantics,
I'd actually prefer to restrict them so that they only apply to
directories. Users who set the sync bits are usually doing so for
applications like MTAs where it's directory syncing which is
what matters: the apps typically fsync the files themselves, anyway.
> Really, generic_osync_inode() needs to be front-ended by
> an inode_operations method.
Yes.
Cheers,
Stephen
"Stephen C. Tweedie" wrote:
>
> Hi,
>
> On Thu, Nov 15, 2001 at 12:03:57AM -0800, Andrew Morton wrote:
> > Linux is not syncing write() data for files on synchronously mounted
> > filesystems, and it isn't syncing write() data for ext2/3 files which
> > are operating under `chattr +S'.
>
> In the past, chattr +S and mount -o sync always resulted in sync
> metadata with no guarantees about data.
>
> I'm not sure this makes much sense, but it's what has always happened.
> For directories, the behaviour is fine, in particular as it gives us
> the same directory sync consistency semantics as synchronous BSD UFS.
>
> It's not clear to me that chattr/mount sync options make _any_ sense
> for regular file metadata. Rather than tightening up the semantics,
> I'd actually prefer to restrict them so that they only apply to
> directories. Users who set the sync bits are usually doing so for
> applications like MTAs where it's directory syncing which is
> what matters: the apps typically fsync the files themselves, anyway.
>
OK, that makes sense. Thanks. The `mount' and `chattr' manpages
need updating...
So shall we try to nail this down? Synchronous mounts and chattr +S
provide synchronous semantics for directory contents, diretory metadata
and directory inodes only. And fsync() will write out a file's data,
metadata and inode?
If this is correct then there are a few places where ext2 is
syncing stuff unnecessarily - file indirect blocks, etc. Not
very important at this stage I guess.
-
Hi,
On Thu, Nov 15, 2001 at 02:02:54PM -0800, Andrew Morton wrote:
>
> So shall we try to nail this down? Synchronous mounts and chattr +S
> provide synchronous semantics for directory contents, diretory metadata
> and directory inodes only. And fsync() will write out a file's data,
> metadata and inode?
Sounds sane. UFS sync mounts also update the inode bitmaps
synchronously, and in order, so that they can tell exactly when an
inode is valid. I guess that we don't need that, at least on ext2,
since the i_dtime gives us the necessary information automatically.
> If this is correct then there are a few places where ext2 is
> syncing stuff unnecessarily - file indirect blocks, etc. Not
> very important at this stage I guess.
It's important for people running MTAs which expect dir sync
behaviour: admins have to pay the performance cost of all file updates
being sync as well as the directory updates if they enable chattr +S
on the dir.
--Stephen
"Stephen C. Tweedie" wrote:
> It's not clear to me that chattr/mount sync options make _any_ sense
> for regular file metadata. Rather than tightening up the semantics,
> I'd actually prefer to restrict them so that they only apply to
> directories. Users who set the sync bits are usually doing so for
> applications like MTAs where it's directory syncing which is
> what matters: the apps typically fsync the files themselves, anyway.
I don't think it is correct to make that assumption.
When working on something likely to crash, I always remount my
filesystems 'sync' with the intention to have the kernel immediately
sync to disk anything and everything it is coded to do. Since the
kernel is responsible to flushing data to disk, it makes perfect sense
to have an option to sync not only metadata but data to disk
immediately, if the user desires such.
Further, expecting all apps to fsync(2) files under the right
circumstances is not reasonable. There are "normal" circumstances where
someone expects non-syncing behavior of "cat foo bar > foobar", and then
there are extentuating circumstances where another expects the shell to
sync that command immediately. Should we rewrite cat/bash/apps to all
fsync, depending on an option? Should we expect people to modify all
their shell scripts to include "/bin/sync" for those times when they
want data-sync? Such is not scalable at all.
It would make more sense to me to implement 'sync' as sync-everything,
and add 'dirsync' or 'metasync' for syncing only directories or only
metadata.
As it stands, it seems like redefining 'sync' to sync less data than is
currently done is not only changing current behavior, but providing less
to users overall.
Jeff
--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno
On Thursday November 15, [email protected] wrote:
> Hi,
>
> On Thu, Nov 15, 2001 at 12:03:57AM -0800, Andrew Morton wrote:
> > Linux is not syncing write() data for files on synchronously mounted
> > filesystems, and it isn't syncing write() data for ext2/3 files which
> > are operating under `chattr +S'.
>
> In the past, chattr +S and mount -o sync always resulted in sync
> metadata with no guarantees about data.
>
> I'm not sure this makes much sense, but it's what has always happened.
> For directories, the behaviour is fine, in particular as it gives us
> the same directory sync consistency semantics as synchronous BSD UFS.
>
> It's not clear to me that chattr/mount sync options make _any_ sense
> for regular file metadata. Rather than tightening up the semantics,
> I'd actually prefer to restrict them so that they only apply to
> directories. Users who set the sync bits are usually doing so for
> applications like MTAs where it's directory syncing which is
> what matters: the apps typically fsync the files themselves, anyway.
Ok, but what about openning with O_SYNC? Surely that should sync the data.
This issue came up because I was looking at how much disc activity the
various sync options causes on an ext3 filesystem, and on the journal
(which I had on a separate drive so I could differentiate journal IO
and filesystem IO).
You've probably already seen them on ext3-users, but I will repeat the
numbers below for linux-kernel readers.
If we forget about the "sync" mounts, as there is some question over
their intended semantics, the issue becomes that ext3 does not appear
to properly honour "fsync" in data=writeback mode (unless the data is
going to the journal on fsync) and does not properly honour O_SYNC
*except* in data=writeback mode.
Andrew's patch fixes O_SYNC in data=journal and data=ordered modes,
but does not fix fsync in data=writeback mode.
NeilBrown
-------------------------------------
Results:
For each test I wrote the first 40000 bytes of a pre-existing file 100
times and counted the nubmer of scsi operations (which may be
multi-block operations) on the filesytem device (first column) and the
journal device (second column).
For "async" I open, write, close 100 times.
For "fsync" I open, write, fsync, close 100 times.
For "osync" I open(,O_SYNC), write, close 100 times.
ext2 sync
async 0 0
fsync 200 0
osync 101 0
ext2 async
async 0 0
fsync 200 0
osync 101 0
ext3 sync,data=journal
async 0 200
fsync 0 300
osync 0 2
ext3 async,data=journal
async 0 0
fsync 0 200
osync 0 2
ext3 sync,data=ordered
async 100 200
fsync 100 300
osync 0 0
ext3 async,data=ordered
async 0 0
fsync 100 200
osync 1 2
ext3 sync,data=writeback
async 0 200
fsync 0 300
osync 100 2
ext3 async,data=writeback
async 0 0
fsync 0 200
osync 100 2
Jeff Garzik wrote:
>
> As it stands, it seems like redefining 'sync' to sync less data than is
> currently done is not only changing current behavior, but providing less
> to users overall.
>
Persuasively argued. You appear to have your wish, as this
patch was merged in -pre5.
A `dirsync' option does make sense though, for the reasons which
Stephen outlined.
The whole handling of synchronous operations needs a rip-up-and-rewrite
anyway. We're currently holding onto a stack of locks while waiting
for the disk to spin round and round. It's a great scalability bottleneck
for multiple threads doing things in the same directory. This is
something I shall look at when the kernel versions turn odd.
-
On Fri, 16 Nov 2001, Andrew Morton wrote:
> A `dirsync' option does make sense though, for the reasons which
> Stephen outlined.
So we could then have:
- (no option) == async, which only syncs file + data on fsync() or
O_SYNC (BSD calls this async, it may corrupt file systems because
writes are out-of-order)
- dirsync, which syncs directories and metadata and causes ordered
writes thereof (BSD calls this noasync), no chance of corrupting
on-disk structure unrecoverably.
- sync, which syncs all filesystem operations (BSD calls this sync
also), will have at most 1 dirty block at a time on non-journaled file
systems(?)
I expect sync to be faster on journalled file systems in that case,
because "in-order execution" to journal will probably cause linear
writes, while on ext2, it will involve seeking.
--
Matthias Andree
"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." Benjamin Franklin
Hi,
On Thu, Nov 15, 2001 at 07:19:43PM -0500, Jeff Garzik wrote:
> When working on something likely to crash, I always remount my
> filesystems 'sync' with the intention to have the kernel immediately
> sync to disk anything and everything it is coded to do.
The kernel has, in my memory, never behaved like that on sync mounts.
mount -o sync was always intended just to give people the BSD-style
sync metadata updates that some users expected.
The "mount" man page is wrong on this one.
> Since the
> kernel is responsible to flushing data to disk, it makes perfect sense
> to have an option to sync not only metadata but data to disk
> immediately, if the user desires such.
If you want to sync _everything_, it's at least 5 seeks per write
syscall when you're writing a new file: superblock, group descriptor,
block bitmap, inode, data, and potentially inode indirect.
There's no point doing all that, especially since some of that data is
redundant and will be rebuilt by e2fsck anyway after a crash.
Is it really such an important feature that we're willing to suffer a
factor-of-100 or more slowdown for it?
> Further, expecting all apps to fsync(2) files under the right
> circumstances is not reasonable. There are "normal" circumstances where
> someone expects non-syncing behavior of "cat foo bar > foobar", and then
> there are extentuating circumstances where another expects the shell to
> sync that command immediately. Should we rewrite cat/bash/apps to all
> fsync, depending on an option? Should we expect people to modify all
> their shell scripts to include "/bin/sync" for those times when they
> want data-sync? Such is not scalable at all.
Not-scalable is doing 5000 seeks to write a 4MB file.
The behaviour you are talking about now, "cat foo bar > foobar" and
expecting it to be intact on return, is *not the same thing*. The
sync mount option is there to order metadata writes for predictable
recovery of the directory structure. In the "cat" case, nobody cares
what the inode is like during the write. All that is desired in that
example is fsync-on-close, and it is insane to implement
fsync-on-close by writing every single block of the file
synchronously.
At ALS, an ext3 user asked why ext3 performance was entirely unusable
under mount -o sync (he had a broken config which accidentally set an
ext3 mount synchronous), whereas ext2 was OK. I only realised
afterwards that this was because of ext3's ordered data writes:
whereas ext2 was just syncing the inodes and indirect blocks on write,
ext3 was syncing the data too as part of the ordered data guarantees,
and performance was totally destroyed by the extra seeks.
"sync to keep the fs structures intact" and "sync to keep this file
intact" are two totally different things. In the latter case, we only
care about the file contents as a whole, so fsync-on-close is far more
appropriate. If we want that, lets add it as a new option, but I
don't see the benefit in making o- sync do all file data writes 100%
synchronously.
Cheers,
Stephen
"Stephen C. Tweedie" wrote:
> On Thu, Nov 15, 2001 at 07:19:43PM -0500, Jeff Garzik wrote:
> > When working on something likely to crash, I always remount my
> > filesystems 'sync' with the intention to have the kernel immediately
> > sync to disk anything and everything it is coded to do.
>
> The kernel has, in my memory, never behaved like that on sync mounts.
> mount -o sync was always intended just to give people the BSD-style
> sync metadata updates that some users expected.
>
> The "mount" man page is wrong on this one.
No, that's always been a bug. Occasionally it gets brought up on lkml
and people have talked about fixing it.
> > Since the
> > kernel is responsible to flushing data to disk, it makes perfect sense
> > to have an option to sync not only metadata but data to disk
> > immediately, if the user desires such.
>
> If you want to sync _everything_, it's at least 5 seeks per write
> syscall when you're writing a new file: superblock, group descriptor,
> block bitmap, inode, data, and potentially inode indirect.
>
> There's no point doing all that, especially since some of that data is
> redundant and will be rebuilt by e2fsck anyway after a crash.
>
> Is it really such an important feature that we're willing to suffer a
> factor-of-100 or more slowdown for it?
mount -o dirsync, if you don't want the slowdown.
I can write a fast, incorrect implementation of 'sync' too. Let's try
for a correct implementation.
> > Further, expecting all apps to fsync(2) files under the right
> > circumstances is not reasonable. There are "normal" circumstances where
> > someone expects non-syncing behavior of "cat foo bar > foobar", and then
> > there are extentuating circumstances where another expects the shell to
> > sync that command immediately. Should we rewrite cat/bash/apps to all
> > fsync, depending on an option? Should we expect people to modify all
> > their shell scripts to include "/bin/sync" for those times when they
> > want data-sync? Such is not scalable at all.
>
> Not-scalable is doing 5000 seeks to write a 4MB file.
>
> The behaviour you are talking about now, "cat foo bar > foobar" and
> expecting it to be intact on return, is *not the same thing*. The
> sync mount option is there to order metadata writes for predictable
> recovery of the directory structure. In the "cat" case, nobody cares
> what the inode is like during the write. All that is desired in that
> example is fsync-on-close, and it is insane to implement
> fsync-on-close by writing every single block of the file
> synchronously.
The "sync" mount option's purpose is and should be this simple:
"All I/O to the file system should be done synchronously."
If you want different behavior, use a different option.
I still do not understand why you seem to think modifying tons of
programs and shell scripts is reasonable, in order to get "true" sync
behavior.
> At ALS, an ext3 user asked why ext3 performance was entirely unusable
> under mount -o sync (he had a broken config which accidentally set an
> ext3 mount synchronous), whereas ext2 was OK. I only realised
> afterwards that this was because of ext3's ordered data writes:
> whereas ext2 was just syncing the inodes and indirect blocks on write,
> ext3 was syncing the data too as part of the ordered data guarantees,
> and performance was totally destroyed by the extra seeks.
Sure. Seems perfectly normal and expected for an fs mounted 'sync'.
> "sync to keep the fs structures intact" and "sync to keep this file
> intact" are two totally different things. In the latter case, we only
> care about the file contents as a whole, so fsync-on-close is far more
> appropriate. If we want that, lets add it as a new option, but I
> don't see the benefit in making o- sync do all file data writes 100%
> synchronously.
The benefit is a correct implementation..
Jeff
--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno
May I suggest another way of thinking about this issue. ext3 allows one
to journal everything, or just metadata. Why -avoid- this distrinction
when it comes to the 'sync' mount option (and 'dirsync' or 'metasync')?
--
Jeff Garzik | Only so many songs can be sung
Building 1024 | with two lips, two lungs, and one tongue.
MandrakeSoft | - nomeansno
On Fri, 16 Nov 2001, Stephen Tweedie wrote:
> If you want to sync _everything_, it's at least 5 seeks per write
> syscall when you're writing a new file: superblock, group descriptor,
> block bitmap, inode, data, and potentially inode indirect.
>
> There's no point doing all that, especially since some of that data is
> redundant and will be rebuilt by e2fsck anyway after a crash.
The file system cannot judge on what's needed, neither can the file
system developer.
The developer however can offer the administrator
and/or application developer the corresponding interfaces. In the sense
of portability, however, it may be rather useful to correspond to BSD
semantics, so I vote to change sync and offer dirsync -- it might be
useful to use BSD nomenclature though and offer a synonymous noasync
option (which also has dirsync semantics as laid out in this thread
before).
It does not matter whether you look at BSD or Linux man pages, either
one documents "sync - All I/O to the file system should be done
synchronously." It should behave like documented. Let's not introduce
compatibility problems by fixing the documentation.
> Is it really such an important feature that we're willing to suffer a
> factor-of-100 or more slowdown for it?
It depends on the task. If the machine regularly reads data from an
external interface and records it to disk, like in a log, synchronous
data writes may be important.
BTW, Wietse Venema was dazzled when he got to know that Linux values its
syslog (defaults to all-synchronous writes) more than its mail (defaults
to all-asynchronous writes that not even a file fsync() rectifies).
> Not-scalable is doing 5000 seeks to write a 4MB file.
If that's what the admin wants, let him do so.
> what the inode is like during the write. All that is desired in that
> example is fsync-on-close, and it is insane to implement
> fsync-on-close by writing every single block of the file
> synchronously.
Yes, but you cannot generalize this example.
> appropriate. If we want that, lets add it as a new option, but I
> don't see the benefit in making o- sync do all file data writes 100%
> synchronously.
Compatibility and reliability are good points.
--
Matthias Andree
"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." Benjamin Franklin
Neil Brown wrote:
>
> Andrew's patch fixes O_SYNC in data=journal and data=ordered modes,
> but does not fix fsync in data=writeback mode.
>
Neil, thanks for spotting this.
patch-2.4.10-pre11 introduced a new per-inode dirty buffer queue,
inode.i_dirty_data_buffers.
generic_commit_write() was changed so that it placed buffers on that
queue, rather than i_dirty_buffers. A new function fsync_inode_data_buffers()
was added for writing the new buffer list out.
That patch changed ext2 and reiserfs to use the new function. The patch was
not copied to any mailing list. There was no announcement and no discussion.
There are no comments in the code indicating the distinction between these
lists and why we have two such.
This change broke fsync() for three in-kernel filesystems as well
as the then-out-of-kernel ext3 and presumably any other buffer-based
Linux filesystems.
Ah. google finds this, from April:
http://www.uwsg.iu.edu/hypermail/linux/kernel/0104.1/0810.html
which attempts to explain why the separate list was added. It
appears to be bogus. There is no reason why, if correctly done,
syncing a list of ext2 data buffers and indirects should be
significantly slower than syncing just the data.
Here's a fix. A better fix would be to back out the unnecesary
list and its support functions.
--- linux-2.4.15-pre5/fs/minix/file.c Mon Sep 10 07:31:30 2001
+++ linux-akpm/fs/minix/file.c Fri Nov 16 22:07:53 2001
@@ -30,8 +30,10 @@ struct inode_operations minix_file_inode
int minix_sync_file(struct file * file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
- int err = fsync_inode_buffers(inode);
+ int err;
+ err = fsync_inode_buffers(inode);
+ err |= fsync_inode_data_buffers(inode);
if (!(inode->i_state & I_DIRTY))
return err;
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
--- linux-2.4.15-pre5/fs/sysv/file.c Thu Nov 15 23:44:29 2001
+++ linux-akpm/fs/sysv/file.c Fri Nov 16 22:09:20 2001
@@ -35,8 +35,10 @@ struct inode_operations sysv_file_inode_
int sysv_sync_file(struct file * file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
- int err = fsync_inode_buffers(inode);
+ int err;
+ err = fsync_inode_buffers(inode);
+ err |= fsync_inode_data_buffers(inode);
if (!(inode->i_state & I_DIRTY))
return err;
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
--- linux-2.4.15-pre5/fs/udf/fsync.c Mon Jun 11 19:15:27 2001
+++ linux-akpm/fs/udf/fsync.c Fri Nov 16 22:11:03 2001
@@ -45,6 +45,7 @@ int udf_fsync_inode(struct inode *inode,
int err;
err = fsync_inode_buffers(inode);
+ err |= fsync_inode_data_buffers(inode);
if (!(inode->i_state & I_DIRTY))
return err;
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
--- linux-2.4.15-pre5/fs/ext3/fsync.c Thu Nov 15 23:44:29 2001
+++ linux-akpm/fs/ext3/fsync.c Fri Nov 16 22:09:34 2001
@@ -62,6 +62,7 @@ int ext3_sync_file(struct file * file, s
* we'll end up waiting on them in commit.
*/
ret = fsync_inode_buffers(inode);
+ ret |= fsync_inode_data_buffers(inode);
ext3_force_commit(inode->i_sb);